718 lines
50 KiB
Plaintext
718 lines
50 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## **Лабораторная работа №2**\n",
|
|||
|
"\n",
|
|||
|
"Загрузка и анализ трёх датасетов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## **Описание датасетов**\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"## Первый датасет: Diamonds Prices\n",
|
|||
|
"\n",
|
|||
|
"**Описание:** Данный набор данных включает информацию о 53 940 бриллиантах круглой огранки и содержит 10 уникальных характеристик, которые описывают бриллиант с разных сторон. Большинство переменных являются числовыми, однако характеристики огранка, цвет и чистота представлены в виде категорий. Цена указана в долларах США.\n",
|
|||
|
"\n",
|
|||
|
"**Объект исследования:** Объектом исследования данного датасета являются круглые бриллианты с разными характеристиками, которые влияют на их стоимость.\n",
|
|||
|
"\n",
|
|||
|
"**Атрибуты объекта:** \n",
|
|||
|
"1. **id** - Уникальный идентификатор каждого бриллианта\n",
|
|||
|
"\n",
|
|||
|
"2. **carat** - Вес бриллианта в каратах. Карат — это мера массы, где один карат равен 0,2 грамма.\n",
|
|||
|
"\n",
|
|||
|
"3. **cut** - Оценка огранки бриллианта, которая влияет на его способность отражать свет.\n",
|
|||
|
"\n",
|
|||
|
"4. **color** - Цвет бриллианта, который оценивается по шкале, где более высокие уровни означают меньший оттенок желтого и более высокую ценность. \n",
|
|||
|
"\n",
|
|||
|
"5. **clarity** - Чистота бриллианта, измеряемая по количеству и размеру внутренних дефектов или внешних недостатков.\n",
|
|||
|
"\n",
|
|||
|
"6. **depth** - Общая глубина бриллианта, выраженная как процент от его среднего диаметра.\n",
|
|||
|
"\n",
|
|||
|
"7. **table** - Ширина верхней плоской грани бриллианта (\"стола\"), выраженная как процент от его среднего диаметра.\n",
|
|||
|
"\n",
|
|||
|
"8. **price** - Цена бриллианта в долларах США.\n",
|
|||
|
"\n",
|
|||
|
"9. **X** - Длина бриллианта в миллиметрах.\n",
|
|||
|
"\n",
|
|||
|
"10. **Y** - Ширина бриллианта в миллиметрах.\n",
|
|||
|
"\n",
|
|||
|
"11. **Z** - Глубина бриллианта в миллиметрах.\n",
|
|||
|
"\n",
|
|||
|
"**Цель исследования:** Анализ взаимосвязей между различными характеристиками бриллиантов (такими как карат, чистота и огранка) и их ценой. Этот анализ может помочь определить, какие атрибуты оказывают наибольшее влияние на стоимость бриллианта и предоставить информацию для прогнозирования цен на основе параметров.\n",
|
|||
|
"\n",
|
|||
|
"**Ссылка на датасет:** https://www.kaggle.com/datasets/nancyalaswad90/diamonds-prices\n",
|
|||
|
"\n",
|
|||
|
"## Второй датасет: Forbes 2022 Billionaires data\n",
|
|||
|
"\n",
|
|||
|
"**Описание:** Этот набор данных содержит ежегодный рейтинг миллиардеров с самыми высокими состояниями, составленный журналом Forbes. Информация включает чистую стоимость активов каждого человека, оцененную в долларах США на основе подтвержденных активов за вычетом долгов. Исключены лица, чье богатство не может быть задокументировано, а также представители монархий и диктатуры, чье богатство зависит от их положения. Методология сбора данных включает интервьюирование миллиардеров, анализ публичных данных и оценку активов по рыночным ценам.\n",
|
|||
|
"\n",
|
|||
|
"**Объект исследования:** Объектом исследования данного датасета являются документально подтверждённые состояния миллиардеров по всему миру на 2022 год.\n",
|
|||
|
"\n",
|
|||
|
"**Атрибуты объекта:** \n",
|
|||
|
"1. **Rank** - Ранг в списке миллиардеров Forbes, который показывает позицию человека по величине состояния среди всех миллиардеров, начиная с самого богатого.\n",
|
|||
|
"\n",
|
|||
|
"2. **Name** - Имя и фамилия миллиардера.\n",
|
|||
|
"\n",
|
|||
|
"3. **Networth** - Чистая стоимость активов миллиардера, выраженная в миллиардах долларов США.\n",
|
|||
|
"\n",
|
|||
|
"4. **Age** - Возраст миллиардера на момент составления рейтинга.\n",
|
|||
|
"\n",
|
|||
|
"5. **Country** - Страна проживания миллиардера, которая показывает национальную принадлежность или основное место жительства.\n",
|
|||
|
"\n",
|
|||
|
"6. **Source** - Источник состояния, указывающий на основные компании, отрасли или типы бизнеса, благодаря которым было накоплено богатство.\n",
|
|||
|
"\n",
|
|||
|
"7. **Industry** - Отрасль, к которой относится основной источник дохода миллиардера. \n",
|
|||
|
"\n",
|
|||
|
"**Цель исследования:** Определить распределение и влияние различных факторов (например, происхождения состояния, географического региона) на величину состояния. Это поможет выявить тренды в распределении богатства и дать более глубокое понимание ключевых факторов, влияющих на богатство миллиардеров.\n",
|
|||
|
"\n",
|
|||
|
"**Ссылка на датасет:** https://www.kaggle.com/datasets/surajjha101/forbes-billionaires-data-preprocessed\n",
|
|||
|
"\n",
|
|||
|
"## Третий датасет: Tesla Insider Trading\n",
|
|||
|
"\n",
|
|||
|
"**Описание:** Этот датасет представляет собой небольшой фрагмент данных о торговле акциями компании Tesla с участием инсайдеров и содержит записи крупных сделок с ноября 2021 года по июль 2022 года. Включает в себя информацию о личности, совершившей сделку, её должности, типе транзакции (покупка, продажа или опцион), стоимости и количестве акций, дате и общей стоимости сделки. Дополнительно указана дата подачи отчета в SEC (Форма 4).\n",
|
|||
|
"\n",
|
|||
|
"**Объект исследования:** Объектом исследования данного датасета являются транзакции с акциями Tesla, совершенные инсайдерами компании в период с ноября 2021 года по июль 2022 года.\n",
|
|||
|
"\n",
|
|||
|
"**Атрибуты объекта:** \n",
|
|||
|
"1. **Insider Trading** - Лицо, совершившее транзакцию.\n",
|
|||
|
"\n",
|
|||
|
"2. **Relationship** - Статус этого лица в компании.\n",
|
|||
|
"\n",
|
|||
|
"3. **Date** - Дата, когда транзакция была завершена.\n",
|
|||
|
"\n",
|
|||
|
"4. **Transaction** - Тип транзакции.\n",
|
|||
|
"\n",
|
|||
|
"5. **Cost** - Стоимость акций в этой транзакции.\n",
|
|||
|
"\n",
|
|||
|
"6. **Shares** - Сколько акций участвует в транзакции.\n",
|
|||
|
"\n",
|
|||
|
"7. **Value ($)** - Общая стоимость транзакции.\n",
|
|||
|
"\n",
|
|||
|
"8. **Shares Total** - Общее количество акций лица на данный момент.\n",
|
|||
|
"\n",
|
|||
|
"9. **SEC Form 4** - Дата, когда транзакция была зарегистрирована.\n",
|
|||
|
"\n",
|
|||
|
"**Цель исследования:** Анализ инсайдерских сделок, выявление трендов в покупке или продаже акций инсайдерами, а также определение потенциального влияния этих действий на стоимость акций Tesla. Информация может быть полезной для прогнозирования динамики акций компании и принятия инвестиционных решений на основе анализа действий крупных держателей акций.\n",
|
|||
|
"\n",
|
|||
|
"**Ссылка на датасет:** https://www.kaggle.com/datasets/ilyaryabov/tesla-insider-trading"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## **Работа с наборами данных**\n",
|
|||
|
"\n",
|
|||
|
"Загрузим три датасета и оценим их структуру"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 53943 entries, 0 to 53942\n",
|
|||
|
"Data columns (total 11 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 Unnamed: 0 53943 non-null int64 \n",
|
|||
|
" 1 carat 53943 non-null float64\n",
|
|||
|
" 2 cut 53943 non-null object \n",
|
|||
|
" 3 color 53943 non-null object \n",
|
|||
|
" 4 clarity 53943 non-null object \n",
|
|||
|
" 5 depth 53943 non-null float64\n",
|
|||
|
" 6 table 53943 non-null float64\n",
|
|||
|
" 7 price 53943 non-null int64 \n",
|
|||
|
" 8 x 53943 non-null float64\n",
|
|||
|
" 9 y 53943 non-null float64\n",
|
|||
|
" 10 z 53943 non-null float64\n",
|
|||
|
"dtypes: float64(6), int64(2), object(3)\n",
|
|||
|
"memory usage: 4.5+ MB\n",
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 2600 entries, 0 to 2599\n",
|
|||
|
"Data columns (total 7 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 Rank 2600 non-null int64 \n",
|
|||
|
" 1 Name 2600 non-null object \n",
|
|||
|
" 2 Networth 2600 non-null float64\n",
|
|||
|
" 3 Age 2600 non-null int64 \n",
|
|||
|
" 4 Country 2600 non-null object \n",
|
|||
|
" 5 Source 2600 non-null object \n",
|
|||
|
" 6 Industry 2600 non-null object \n",
|
|||
|
"dtypes: float64(1), int64(2), object(4)\n",
|
|||
|
"memory usage: 142.3+ KB\n",
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 156 entries, 0 to 155\n",
|
|||
|
"Data columns (total 9 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 Insider Trading 156 non-null object \n",
|
|||
|
" 1 Relationship 156 non-null object \n",
|
|||
|
" 2 Date 156 non-null object \n",
|
|||
|
" 3 Transaction 156 non-null object \n",
|
|||
|
" 4 Cost 156 non-null float64\n",
|
|||
|
" 5 Shares 156 non-null object \n",
|
|||
|
" 6 Value ($) 156 non-null object \n",
|
|||
|
" 7 Shares Total 156 non-null object \n",
|
|||
|
" 8 SEC Form 4 156 non-null object \n",
|
|||
|
"dtypes: float64(1), object(8)\n",
|
|||
|
"memory usage: 11.1+ KB\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd \n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//DiamondsPrices2022.csv\")\n",
|
|||
|
"df2 = pd.read_csv(\"..//static//csv//ForbesBillionaires.csv\")\n",
|
|||
|
"df3 = pd.read_csv(\"..//static//csv//TSLA.csv\")\n",
|
|||
|
"\n",
|
|||
|
"df.info()\n",
|
|||
|
"df2.info()\n",
|
|||
|
"df3.info()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### **Получение сведений о пропущенных данных**\n",
|
|||
|
"\n",
|
|||
|
"**Типы пропущенных данных:**\n",
|
|||
|
"- **None** - представление пустых данных в Python\n",
|
|||
|
"- **NaN** - представление пустых данных в Pandas\n",
|
|||
|
"- **' '** - пустая строка"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Цены на бриллианты\n",
|
|||
|
"Unnamed: 0 0\n",
|
|||
|
"carat 0\n",
|
|||
|
"cut 0\n",
|
|||
|
"color 0\n",
|
|||
|
"clarity 0\n",
|
|||
|
"depth 0\n",
|
|||
|
"table 0\n",
|
|||
|
"price 0\n",
|
|||
|
"x 0\n",
|
|||
|
"y 0\n",
|
|||
|
"z 0\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Unnamed: 0 False\n",
|
|||
|
"carat False\n",
|
|||
|
"cut False\n",
|
|||
|
"color False\n",
|
|||
|
"clarity False\n",
|
|||
|
"depth False\n",
|
|||
|
"table False\n",
|
|||
|
"price False\n",
|
|||
|
"x False\n",
|
|||
|
"y False\n",
|
|||
|
"z False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n",
|
|||
|
"Рейтинг миллиардеров с Forbes\n",
|
|||
|
"Rank 0\n",
|
|||
|
"Name 0\n",
|
|||
|
"Networth 0\n",
|
|||
|
"Age 0\n",
|
|||
|
"Country 0\n",
|
|||
|
"Source 0\n",
|
|||
|
"Industry 0\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Rank False\n",
|
|||
|
"Name False\n",
|
|||
|
"Networth False\n",
|
|||
|
"Age False\n",
|
|||
|
"Country False\n",
|
|||
|
"Source False\n",
|
|||
|
"Industry False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n",
|
|||
|
"Торговля акциями компании Tesla\n",
|
|||
|
"Insider Trading 0\n",
|
|||
|
"Relationship 0\n",
|
|||
|
"Date 0\n",
|
|||
|
"Transaction 0\n",
|
|||
|
"Cost 0\n",
|
|||
|
"Shares 0\n",
|
|||
|
"Value ($) 0\n",
|
|||
|
"Shares Total 0\n",
|
|||
|
"SEC Form 4 0\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Insider Trading False\n",
|
|||
|
"Relationship False\n",
|
|||
|
"Date False\n",
|
|||
|
"Transaction False\n",
|
|||
|
"Cost False\n",
|
|||
|
"Shares False\n",
|
|||
|
"Value ($) False\n",
|
|||
|
"Shares Total False\n",
|
|||
|
"SEC Form 4 False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# цены на бриллианты\n",
|
|||
|
"print(\"Цены на бриллианты\")\n",
|
|||
|
"# Количество пустых значений признаков\n",
|
|||
|
"print(df.isnull().sum())\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Есть ли пустые значения признаков\n",
|
|||
|
"print(df.isnull().any())\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in df.columns:\n",
|
|||
|
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# рейтинг миллиардеров с Forbes \n",
|
|||
|
"print(\"Рейтинг миллиардеров с Forbes\")\n",
|
|||
|
"# Количество пустых значений признаков\n",
|
|||
|
"print(df2.isnull().sum())\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Есть ли пустые значения признаков\n",
|
|||
|
"print(df2.isnull().any())\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in df2.columns:\n",
|
|||
|
" null_rate = df2[i].isnull().sum() / len(df2) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\\n\")\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Инсайдерские акции компании Tesla\n",
|
|||
|
"print(\"Торговля акциями компании Tesla\")\n",
|
|||
|
"# Количество пустых значений признаков\n",
|
|||
|
"print(df3.isnull().sum())\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Есть ли пустые значения признаков\n",
|
|||
|
"print(df3.isnull().any())\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in df3.columns:\n",
|
|||
|
" null_rate = df3[i].isnull().sum() / len(df3) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\\n\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"В ходе проверки на "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проверим датасет по числовым данным, для выявления аномальных распределений"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
" Unnamed: 0 carat depth table price \\\n",
|
|||
|
"count 53943.000000 53943.000000 53943.000000 53943.000000 53943.000000 \n",
|
|||
|
"mean 26972.000000 0.797935 61.749322 57.457251 3932.734294 \n",
|
|||
|
"std 15572.147122 0.473999 1.432626 2.234549 3989.338447 \n",
|
|||
|
"min 1.000000 0.200000 43.000000 43.000000 326.000000 \n",
|
|||
|
"25% 13486.500000 0.400000 61.000000 56.000000 950.000000 \n",
|
|||
|
"50% 26972.000000 0.700000 61.800000 57.000000 2401.000000 \n",
|
|||
|
"75% 40457.500000 1.040000 62.500000 59.000000 5324.000000 \n",
|
|||
|
"max 53943.000000 5.010000 79.000000 95.000000 18823.000000 \n",
|
|||
|
"\n",
|
|||
|
" x y z \n",
|
|||
|
"count 53943.000000 53943.000000 53943.000000 \n",
|
|||
|
"mean 5.731158 5.734526 3.538730 \n",
|
|||
|
"std 1.121730 1.142103 0.705679 \n",
|
|||
|
"min 0.000000 0.000000 0.000000 \n",
|
|||
|
"25% 4.710000 4.720000 2.910000 \n",
|
|||
|
"50% 5.700000 5.710000 3.530000 \n",
|
|||
|
"75% 6.540000 6.540000 4.040000 \n",
|
|||
|
"max 10.740000 58.900000 31.800000 \n",
|
|||
|
" Rank Networth Age\n",
|
|||
|
"count 2600.000000 2600.000000 2600.000000\n",
|
|||
|
"mean 1269.570769 4.860750 64.271923\n",
|
|||
|
"std 728.146364 10.659671 13.220607\n",
|
|||
|
"min 1.000000 1.000000 19.000000\n",
|
|||
|
"25% 637.000000 1.500000 55.000000\n",
|
|||
|
"50% 1292.000000 2.400000 64.000000\n",
|
|||
|
"75% 1929.000000 4.500000 74.000000\n",
|
|||
|
"max 2578.000000 219.000000 100.000000\n",
|
|||
|
" Cost\n",
|
|||
|
"count 156.000000\n",
|
|||
|
"mean 478.785641\n",
|
|||
|
"std 448.922903\n",
|
|||
|
"min 0.000000\n",
|
|||
|
"25% 50.522500\n",
|
|||
|
"50% 240.225000\n",
|
|||
|
"75% 934.107500\n",
|
|||
|
"max 1171.040000\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(df.describe())\n",
|
|||
|
"print(df2.describe())\n",
|
|||
|
"print(df3.describe())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Аномальное рапределение будем искать по z-индексую. Z-индекс показывает, насколько далеко значение находится от среднего в стандартных отклонениях. Значения Z-индекса больше 3 или меньше -3 обычно считаются аномальными."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Аномалии в наборе данных Diamonds Prices:\n",
|
|||
|
"В атрибуте 'carat' обнаружены аномалии: [2.22, 2.27, 2.49, 3.0, 2.22, 2.25, 2.32, 2.72, 2.23, 2.25, 2.27, 2.3, 2.31, 2.5, 3.01, 3.0, 2.33, 2.68, 2.25, 2.5, 2.34, 2.5, 2.74, 2.28, 2.25, 2.68, 2.43, 3.11, 3.01, 2.52, 2.77, 2.63, 3.05, 2.46, 3.02, 2.63, 2.22, 2.38, 3.01, 3.0, 2.24, 2.32, 2.3, 2.25, 2.25, 2.34, 2.34, 2.26, 2.36, 2.29, 3.0, 2.72, 2.33, 3.65, 2.45, 2.45, 2.24, 2.4, 2.5, 2.54, 2.31, 2.36, 3.24, 2.3, 2.38, 2.31, 2.4, 2.3, 2.58, 2.22, 3.22, 3.5, 2.23, 2.54, 2.28, 2.23, 2.29, 2.26, 2.48, 2.28, 2.23, 2.27, 2.3, 2.34, 2.44, 2.23, 2.26, 2.75, 2.29, 3.0, 2.33, 2.5, 2.5, 2.33, 2.22, 2.3, 2.41, 2.22, 2.25, 2.32, 2.29, 2.22, 2.22, 2.29, 2.29, 2.22, 2.61, 2.26, 2.24, 2.31, 2.52, 2.26, 2.28, 2.28, 2.26, 2.23, 2.3, 2.3, 2.22, 2.29, 2.23, 2.36, 2.35, 2.5, 2.51, 2.3, 2.28, 3.01, 2.27, 2.51, 2.23, 2.22, 2.7, 2.25, 2.28, 2.55, 2.27, 2.35, 2.22, 2.54, 2.29, 2.5, 2.51, 2.51, 2.3, 2.22, 2.38, 2.5, 2.53, 2.22, 2.32, 2.32, 2.33, 2.54, 2.43, 2.23, 2.3, 2.58, 2.55, 2.25, 2.33, 2.31, 2.37, 2.23, 3.0, 2.27, 2.47, 2.43, 2.48, 2.8, 2.22, 2.33, 2.35, 2.53, 2.22, 4.01, 4.01, 2.56, 2.5, 2.41, 2.25, 2.24, 2.41, 2.31, 2.51, 3.04, 2.29, 2.24, 2.54, 2.25, 2.28, 2.75, 2.26, 2.47, 2.3, 2.24, 2.22, 2.26, 2.27, 2.24, 2.46, 2.48, 2.52, 2.27, 2.34, 2.4, 2.24, 2.39, 2.4, 2.5, 3.4, 4.0, 2.5, 2.3, 2.53, 2.25, 2.37, 2.23, 3.01, 2.28, 2.37, 2.3, 2.24, 2.24, 2.31, 2.38, 2.43, 2.28, 2.3, 2.28, 3.67, 2.58, 2.42, 2.42, 2.42, 2.32, 2.66, 2.66, 2.26, 2.26, 2.38, 2.4, 2.51, 2.24, 2.4, 2.65, 2.3, 2.35, 2.54, 2.54, 2.28, 2.22, 2.51, 2.28, 2.59, 2.46, 2.47, 2.44, 3.01, 2.22, 2.24, 2.28, 2.25, 2.45, 2.28, 2.38, 2.24, 2.39, 2.4, 2.22, 2.36, 2.24, 2.53, 2.48, 2.51, 2.37, 2.39, 2.4, 2.31, 2.23, 2.48, 2.42, 2.51, 2.26, 2.63, 2.49, 2.53, 2.37, 2.28, 2.5, 2.28, 3.0, 3.0, 2.41, 2.27, 2.31, 2.26, 2.6, 2.5, 2.4, 2.27, 2.32, 2.31, 2.53, 2.57, 2.22, 2.25, 2.71, 2.27, 2.22, 2.22, 2.51, 2.74, 2.42, 2.42, 2.74, 2.56, 2.3, 2.6, 2.61, 2.31, 2.52, 2.25, 2.42, 2.35, 2.35, 2.26, 4.13, 2.54, 2.39, 2.5, 2.64, 2.48, 2.31, 2.51, 2.44, 2.33, 2.32, 2.52, 2.36, 2.29, 2.36, 2.53, 2.48, 2.52, 2.39, 2.29, 2.28, 2.32, 2.52, 2.31, 2.56, 2.72, 2.36, 2.43, 2.32, 2.28, 2.48, 2.39, 2.41, 2.57, 2.4, 2.39, 2.24, 2.54, 2.35, 2.25, 5.01, 2.51, 2.32, 2.32, 2.26, 2.51, 2.25, 2.25, 2.51, 2.28, 2.29, 2.32, 2.51, 2.3, 2.45, 2.33, 2.23, 2.52, 2.3, 2.3, 3.01, 3.01, 3.01, 3.01, 3.01, 2.52, 2.53, 2.51, 2.24, 2.3, 2.51, 2.5, 2.49, 2.27, 2.22, 2.6, 2.32, 2.4, 2.26, 2.29, 2.44, 2.22, 2.5, 2.57, 2.66, 2.32, 2.37, 2.38, 2.22, 4.5, 2.32, 2.4, 2.4, 3.04, 2.38, 3.01, 2.29, 2.42, 2.29, 2.67, 2.43, 2.48, 3.51, 2.22, 3.01, 3.01, 2.36, 2.61, 2.55, 2.8, 2.29, 2.29]\n",
|
|||
|
"В атрибуте 'depth' обнаружены аномалии: [56.9, 55.1, 66.3, 67.9, 57.2, 55.1, 67.4, 67.3, 66.4, 68.1, 55.0, 56.0, 53.1, 57.2, 66.7, 53.3, 53.0, 67.8, 67.9, 66.1, 56.9, 55.8, 67.6, 68.2, 67.7, 66.3, 69.5, 56.9, 56.6, 56.3, 66.9, 67.0, 56.0, 67.1, 66.4, 69.3, 66.2, 55.4, 66.8, 66.8, 66.1, 66.1, 68.1, 67.0, 66.6, 55.9, 57.3, 57.3, 57.3, 68.2, 57.4, 66.6, 67.0, 67.0, 68.3, 57.2, 66.8, 57.3, 66.3, 66.1, 68.5, 69.3, 57.4, 56.2, 57.4, 66.1, 56.3, 56.5, 56.1, 56.2, 66.5, 67.6, 66.1, 68.4, 66.2, 69.7, 56.9, 66.6, 57.1, 66.4, 68.7, 66.5, 57.4, 56.7, 66.1, 56.6, 66.8, 66.4, 66.1, 68.6, 71.6, 68.7, 67.3, 66.7, 56.7, 67.7, 66.6, 43.0, 68.8, 66.7, 66.1, 67.1, 67.4, 66.5, 66.1, 57.1, 57.1, 67.8, 67.5, 57.2, 66.4, 67.7, 66.3, 69.0, 57.4, 55.2, 68.2, 68.9, 66.8, 57.4, 56.7, 69.6, 57.4, 56.9, 57.3, 57.4, 66.1, 57.0, 56.4, 56.0, 57.4, 66.1, 68.3, 66.5, 56.8, 66.8, 57.3, 44.0, 57.4, 67.0, 66.1, 56.8, 57.2, 56.4, 56.9, 56.8, 56.8, 57.1, 66.2, 56.9, 66.9, 56.5, 55.2, 66.3, 56.8, 66.3, 56.3, 57.3, 67.8, 56.7, 66.8, 57.1, 66.2, 66.3, 56.8, 66.4, 56.4, 67.2, 57.4, 55.9, 68.8, 66.2, 68.3, 67.1, 70.1, 57.0, 71.3, 66.2, 57.4, 70.6, 69.8, 66.9, 56.7, 66.5, 69.8, 56.5, 71.8, 66.6, 66.4, 66.8, 66.8, 57.0, 57.1, 66.6, 66.9, 67.5, 57.1, 57.4, 69.7, 68.4, 67.6, 56.5, 56.9, 55.9, 66.3, 66.1, 69.7, 66.7, 66.9, 57.2, 43.0, 53.8, 66.1, 66.4, 56.3, 66.1, 56.7, 57.4, 69.5, 66.8, 66.7, 55.9, 56.9, 56.9, 66.6, 56.9, 53.2, 67.0, 56.5, 56.8, 56.3, 70.0, 67.3, 66.2, 55.9, 56.8, 57.2, 56.9, 57.4, 68.5, 57.4, 68.9, 68.9, 57.2, 66.6, 69.4, 57.2, 55.8, 66.6, 56.6, 67.6, 57.3, 66.6, 56.7, 66.7, 67.8, 67.4, 55.9, 57.3, 56.3, 67.0, 67.6, 57.3, 68.0, 69.8, 66.9, 57.0, 67.3, 66.6, 69.0, 57.4, 57.0, 67.0, 66.8, 57.4, 68.1, 57.4, 57.4, 57.3, 66.3, 70.2, 57.1, 66.4, 68.0, 68.0, 66.3, 66.9, 57.1, 67.7, 66.1, 68.5, 70.1, 66.2, 67.4, 56.7, 67.7, 50.8, 66.5, 55.6, 70.5, 68.2, 68.0, 57.4, 56.5, 67.6, 56.9, 57.4, 55.9, 68.6, 71.0, 68.4, 67.4, 55.6, 56.1, 67.2, 68.8, 55.8, 56.8, 69.6, 57.1, 56.3, 66.1, 56.7, 66.3, 56.6, 66.5, 66.8, 67.1, 66.8, 67.3, 66.5, 68.3, 56.2, 66.1, 67.1, 69.1, 56.0, 56.3, 67.5, 56.9, 68.7, 57.2, 57.4, 57.2, 57.2, 57.2, 67.1, 68.3, 66.6, 66.9, 57.0, 66.3, 67.7, 55.2, 66.2, 66.5, 66.9, 56.7, 56.1, 66.7, 57.0, 67.9, 66.7, 66.4, 55.3, 56.9, 57.0, 56.2, 57.2, 67.1, 56.8, 56.2, 67.4, 57.1, 56.8, 66.3, 57.3, 66.3, 56.3, 57.0, 56.7, 57.3, 57.2, 56.5, 67.0, 66.5, 66.6, 66.9, 56.7, 67.2, 70.2, 56.2, 66.7, 66.3, 66.1, 66.8, 66.9, 56.9, 56.8, 67.7, 67.5, 67.6, 56.9, 70.6, 55.2, 67.6, 56.3, 66.2, 66.2, 68.6, 56.9, 56.5, 66.4, 57.0, 66.8, 57.1, 55.3, 56.7, 54.2, 57.0, 67.9, 56.5, 66.8, 68.6, 56.2, 57.1, 56.6, 66.3, 55.2, 57.1, 67.0, 51.0, 66.9, 66.8, 67.3, 67.3, 66.7, 57.4, 57.2, 56.0, 66.9, 56.6, 56.3, 57.4, 68.3, 66.9, 66.3, 70.8, 66.5, 56.4, 66.5, 54.2, 57.1, 66.9, 54.6, 53.2, 54.0, 54.4, 66.3, 56.9, 66.3, 66.5, 55.2, 70.8, 66.5, 56.7, 56.8, 57.3, 67.5, 66.8, 56.3, 56.6, 66.2, 68.6, 66.5, 67.1, 66.2, 56.5, 66.5, 57.1, 56.5, 67.1, 66.6, 66.3, 57.3, 57.1, 68.6, 56.3, 66.5, 56.5, 66.9, 52.3, 67.3, 66.1, 57.2, 56.8, 57.2, 66.4, 66.1, 55.5, 66.5, 57.2, 56.3, 56.8, 56.7, 66.9, 56.3, 57.4, 67.6, 78.2, 57.4, 66.1, 71.2, 55.8, 67.0, 56.2, 52.7, 66.5, 57.3, 57.2, 66.7, 57.4, 56.3, 55.8, 54.3, 57.4, 66.1, 66.8, 55.3, 56.4, 66.3, 57.0, 55.3, 69.3, 66.9, 57.2, 57.4, 56.1, 56.4, 57.2, 56.7, 66.4, 67.0, 57.1, 56.3, 56.9, 56.9, 55.1, 56.0, 66.1, 66.2, 67.6, 71.6, 66.9, 69.7, 69.2, 68.0, 56.9, 66.9, 57.3, 66.1, 67.8, 68.9, 55.8, 56.6, 57.2, 67.2, 68.5, 73.6, 56.9, 68.6, 56.2, 67.4, 67.2, 66.1, 57.0, 55.0, 70.6, 56.2, 57.1, 55.3, 52.2, 67.4, 57.4, 68.4, 67.3, 57.4, 56.1, 67.3, 67.6, 57.0, 57.1, 66.9, 57.2, 57.1, 56.6, 57.4, 67.3, 57.0, 57.2, 55.5, 57.1, 69.9, 56.6, 68.4, 57.4, 57.0, 55.8, 53.4, 56.9, 56.7, 66.3, 57.0, 66.7, 68.5, 57.3, 68.0, 66.1, 66.5, 68.8, 57.1, 70.2, 55.9, 68.2, 67.2, 66.4, 66.5, 56.1, 57.4, 67.8, 67.3, 55.6, 66.3, 66.7, 66.6, 67.8, 56.2, 66.3, 66.2, 66.4, 55.9, 67.1, 57.2, 66.4, 67.7, 57.3, 72.2, 57.4, 56.8, 57.2, 56.9, 57.0, 66.2, 56.9, 56.7, 66.4, 67.3, 55.5, 67.8, 69.0, 66.6, 66.8, 66.3, 57.4, 66.2, 57.1, 66.1, 66.5, 66.5, 79.0,
|
|||
|
"В атрибуте 'table' обнаружены аномалии: [65.0, 69.0, 67.0, 66.0, 70.0, 66.0, 68.0, 67.0, 67.0, 65.0, 70.0, 69.0, 65.0, 66.0, 67.0, 67.0, 66.0, 65.0, 66.0, 67.0, 66.0, 66.0, 65.0, 66.0, 65.0, 65.0, 67.0, 65.0, 66.0, 65.0, 68.0, 65.0, 66.0, 66.0, 66.0, 50.1, 65.0, 65.0, 66.0, 66.0, 65.0, 65.0, 67.0, 65.0, 65.0, 65.0, 65.0, 65.0, 66.0, 65.0, 65.0, 67.0, 66.0, 68.0, 65.0, 65.0, 65.0, 66.0, 66.0, 65.0, 49.0, 65.0, 66.0, 66.0, 67.0, 67.0, 65.0, 67.0, 66.0, 65.0, 66.0, 67.0, 65.0, 65.0, 65.0, 65.0, 50.0, 65.0, 66.0, 67.0, 65.0, 68.0, 66.0, 65.0, 65.0, 65.0, 65.0, 66.0, 65.0, 67.0, 66.0, 66.0, 66.0, 68.0, 65.0, 67.0, 65.0, 66.0, 66.0, 65.0, 68.0, 65.0, 43.0, 65.0, 65.0, 66.0, 67.0, 65.0, 65.0, 65.0, 65.0, 65.0, 65.0, 68.0, 67.0, 66.0, 65.0, 65.0, 67.0, 65.0, 66.0, 65.0, 65.0, 65.0, 65.0, 67.0, 65.0, 66.0, 67.0, 66.0, 69.0, 65.0, 65.0, 65.0, 66.0, 68.0, 66.0, 66.0, 65.0, 65.0, 69.0, 65.0, 66.0, 65.0, 65.0, 66.0, 67.0, 66.0, 49.0, 68.0, 65.0, 70.0, 66.0, 65.0, 67.0, 68.0, 65.0, 66.0, 65.0, 68.0, 68.0, 66.0, 65.0, 66.0, 69.0, 66.0, 65.0, 95.0, 66.0, 65.0, 66.0, 50.0, 65.0, 66.0, 66.0, 65.0, 65.0, 66.0, 65.0, 65.0, 65.0, 66.0, 69.0, 65.0, 65.0, 65.0, 65.0, 65.0, 66.0, 65.0, 66.0, 66.0, 65.0, 67.0, 66.0, 67.0, 65.0, 65.0, 66.0, 65.0, 44.0, 68.0, 65.0, 65.0, 67.0, 67.0, 66.0, 65.0, 65.0, 66.0, 65.0, 65.0, 66.0, 66.0, 65.0, 69.0, 65.0, 65.0, 66.0, 67.0, 68.0, 65.0, 68.0, 65.0, 70.0, 66.0, 65.0, 65.0, 66.0, 66.0, 65.0, 68.0, 65.0, 64.3, 65.0, 66.0, 69.0, 65.0, 66.0, 65.0, 70.0, 65.0, 65.0, 65.0, 71.0, 66.0, 67.0, 68.0, 67.0, 67.0, 67.0, 66.0, 66.0, 70.0, 67.0, 67.0, 65.0, 67.0, 65.0, 67.0, 65.0, 66.0, 66.0, 65.0, 66.0, 66.0, 66.0, 64.2, 68.0, 66.0, 66.0, 66.0, 66.0, 65.0, 66.0, 65.0, 65.0, 66.0, 65.0, 73.0, 66.0, 65.0, 65.0, 65.0, 67.0, 65.0, 65.0, 68.0, 65.0, 66.0, 65.4, 65.0, 65.0, 65.0, 66.0, 79.0, 65.0, 68.0, 70.0, 66.0, 65.0, 65.0, 67.0, 66.0, 65.0, 65.0, 76.0, 73.0, 65.0, 65.0, 66.0, 66.0, 65.0, 65.0, 65.0, 65.0, 70.0, 65.0, 66.0, 65.0, 70.0, 69.0, 67.0, 67.0, 73.0, 73.0, 66.0, 68.0, 66.0, 65.0, 65.0, 67.0, 67.0, 65.0, 65.0, 65.0]\n",
|
|||
|
"В атрибуте 'price' обнаружены аномалии: [15907, 15908, 15913, 15915, 15917, 15917, 15917, 15919, 15919, 15919, 15920, 15922, 15923, 15928, 15930, 15930, 15931, 15934, 15937, 15938, 15939, 15939, 15941, 15941, 15941, 15942, 15946, 15948, 15948, 15949, 15949, 15952, 15955, 15957, 15959, 15959, 15962, 15964, 15965, 15966, 15968, 15970, 15970, 15974, 15977, 15977, 15983, 15984, 15984, 15984, 15984, 15987, 15987, 15990, 15991, 15992, 15992, 15992, 15992, 15992, 15992, 15993, 15996, 15996, 16003, 16004, 16013, 16018, 16021, 16023, 16025, 16031, 16036, 16037, 16041, 16043, 16048, 16049, 16052, 16055, 16059, 16062, 16062, 16064, 16064, 16064, 16068, 16068, 16073, 16073, 16075, 16077, 16080, 16082, 16085, 16086, 16086, 16087, 16091, 16092, 16097, 16098, 16100, 16104, 16104, 16111, 16112, 16112, 16116, 16123, 16126, 16128, 16129, 16130, 16131, 16137, 16140, 16146, 16147, 16148, 16149, 16149, 16151, 16169, 16169, 16169, 16170, 16171, 16171, 16174, 16179, 16181, 16183, 16187, 16187, 16188, 16189, 16190, 16191, 16192, 16193, 16195, 16198, 16198, 16198, 16206, 16210, 16215, 16219, 16220, 16223, 16224, 16231, 16231, 16232, 16234, 16235, 16235, 16237, 16239, 16239, 16239, 16240, 16240, 16241, 16241, 16241, 16242, 16253, 16253, 16256, 16256, 16261, 16262, 16273, 16274, 16277, 16278, 16280, 16280, 16286, 16287, 16287, 16287, 16290, 16291, 16294, 16294, 16295, 16297, 16300, 16300, 16304, 16304, 16304, 16309, 16309, 16311, 16314, 16316, 16316, 16316, 16319, 16319, 16319, 16323, 16329, 16336, 16337, 16339, 16340, 16340, 16343, 16353, 16353, 16357, 16357, 16358, 16363, 16364, 16364, 16368, 16369, 16370, 16378, 16380, 16383, 16384, 16386, 16389, 16390, 16392, 16392, 16395, 16397, 16397, 16398, 16400, 16402, 16404, 16406, 16407, 16407, 16409, 16410, 16412, 16420, 16422, 16425, 16426, 16427, 16427, 16427, 16431, 16437, 16439, 16442, 16446, 16450, 16451, 16459, 16462, 16462, 16465, 16466, 16466, 16469, 16472, 16479, 16479, 16483, 16484, 16485, 16492, 16499, 16499, 16505, 16506, 16506, 16507, 16512, 16512, 16513, 16518, 16519, 16520, 16521, 16530, 16532, 16533, 16538, 16544, 16544, 16545, 16547, 16547, 16551, 16558, 16558, 16558, 16560, 16562, 16564, 16565, 16570, 16575, 16575, 16580, 16582, 16582, 16583, 16587, 16589, 16592, 16593, 16599, 16601, 16603, 16611, 16613, 16616, 16617, 16618, 16624, 16626, 16626, 16626, 16626, 16628, 16628, 16629, 16629, 16629, 16632, 16636, 16641, 16642, 16643, 16643, 16650, 16650, 16650, 16656, 16657, 16665, 16669, 16670, 16677, 16683, 16687, 16687, 16688, 16689, 16690, 16693, 16694, 16694, 16700, 16703, 16704, 16704, 16707, 16709, 16709, 16709, 16715, 16716, 16716, 16716, 16717, 16718, 16718, 16723, 16723, 16728, 16731, 16733, 16733, 16733, 16733, 16733, 16733, 16736, 16737, 16742, 16747, 16750, 16754, 16768, 16769, 16776, 16776, 16778, 16778, 16778, 16778, 16778, 16778, 16778, 16779, 16779, 16779, 16783, 16783, 16783, 16783, 16786, 16787, 16789, 16789, 16789, 16790, 16791, 16792, 16793, 16793, 16797, 16800, 16801, 16803, 16804, 16805, 16807, 16808, 16811, 16813, 16817, 16819, 16820, 16823, 16824, 16826, 16842, 16842, 16854, 16857, 16861, 16872, 16872, 16874, 16878, 16879, 16881, 16881, 16889, 16896, 16900, 16900, 16900, 16901, 16904, 16914, 16914, 16914, 16914, 16915, 16916, 16921, 16922, 16929, 16931, 16934, 16937, 16941, 16942, 16944, 16945, 16948, 16954, 16955, 16955, 16956, 16956, 16957, 16960, 16960, 16960, 16969, 16970, 16970, 16975, 16985, 16985, 16987, 16988, 16992, 16994, 16996, 17000, 17001, 17003, 17005, 17006, 17009, 17010, 17012, 17014, 17016, 17017, 17019, 17024, 17024, 17027, 17028, 17028, 17028, 17029, 17036, 17038, 17039, 17041, 17042, 17045, 17045, 17049, 17049, 17050, 17051, 17051, 17052, 17053, 17057, 17057, 17062, 17063, 17065, 17066, 17068, 17068, 17068, 17068, 17068, 17073, 17073, 17076, 17078, 17079, 17081, 17084, 17094, 17095, 17095, 17096, 17099, 17100, 17103, 17108, 17111, 17114, 17114, 17115, 17116, 17118, 17123, 17125, 17126, 17127, 17136, 17138, 17141, 17143, 17143, 17146, 17149, 17151, 17153, 17153, 17156, 17160, 17162, 17164, 17166, 17168, 17168, 17
|
|||
|
"В атрибуте 'x' обнаружены аномалии: [0.0, 0.0, 0.0, 9.23, 9.1, 9.11, 9.15, 9.24, 9.26, 9.11, 9.54, 9.38, 9.17, 9.53, 9.44, 9.49, 9.65, 0.0, 9.42, 9.44, 9.32, 10.14, 10.02, 9.14, 0.0, 9.42, 10.01, 9.25, 9.86, 9.3, 9.13, 10.0, 10.74, 0.0, 9.36, 10.23, 9.51, 9.44, 9.66, 9.35, 9.41, 0.0, 0.0]\n",
|
|||
|
"В атрибуте 'y' обнаружены аномалии: [0.0, 0.0, 9.25, 9.38, 9.31, 9.48, 58.9, 9.4, 9.42, 9.59, 0.0, 9.26, 9.37, 9.19, 10.1, 9.94, 0.0, 9.34, 9.94, 9.2, 9.81, 9.85, 10.54, 0.0, 9.31, 10.16, 9.46, 9.38, 9.63, 9.22, 9.32, 31.8, 0.0, 0.0]\n",
|
|||
|
"В атрибуте 'z' обнаружены аномалии: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.07, 0.0, 5.77, 5.76, 5.67, 5.97, 1.41, 5.98, 5.73, 5.66, 5.91, 5.79, 6.38, 8.06, 5.85, 5.92, 6.03, 0.0, 0.0, 6.17, 6.24, 5.75, 0.0, 6.16, 0.0, 6.27, 6.31, 5.69, 6.13, 5.86, 5.72, 0.0, 6.43, 6.98, 0.0, 0.0, 5.9, 5.9, 5.77, 5.77, 6.72, 6.03, 0.0, 31.8, 0.0, 0.0, 0.0]\n",
|
|||
|
"\n",
|
|||
|
"Аномалии в наборе данных Forbes Billionaires:\n",
|
|||
|
"В атрибуте 'Networth' обнаружены аномалии: [219.0, 171.0, 158.0, 129.0, 118.0, 111.0, 107.0, 106.0, 91.4, 90.7, 90.0, 82.0, 81.2, 74.8, 67.3, 66.2, 65.7, 65.3, 65.0, 65.0, 60.0, 60.0, 59.6, 55.1, 50.0, 49.2, 47.3, 47.1, 44.8, 43.6, 41.4, 40.4, 37.3, 37.2]\n",
|
|||
|
"В атрибуте 'Age' обнаружены аномалии: [19]\n",
|
|||
|
"\n",
|
|||
|
"Аномалии в наборе данных Tesla:\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from scipy import stats\n",
|
|||
|
"# Вычисляем Z-индексы только для числовых столбцов\n",
|
|||
|
"df_zscores = df.select_dtypes(include=['float64', 'int64']).apply(stats.zscore, nan_policy='omit')\n",
|
|||
|
"df2_zscores = df2.select_dtypes(include=['float64', 'int64']).apply(stats.zscore, nan_policy='omit')\n",
|
|||
|
"df3_zscores = df3.select_dtypes(include=['float64', 'int64']).apply(stats.zscore, nan_policy='omit')\n",
|
|||
|
"\n",
|
|||
|
"# Устанавливаем порог для поиска аномалий\n",
|
|||
|
"threshold = 3\n",
|
|||
|
"\n",
|
|||
|
"# Функция для нахождения аномалий и вывода сообщения\n",
|
|||
|
"def find_anomalies(zscores, data):\n",
|
|||
|
" for column in zscores.columns:\n",
|
|||
|
" # Проверяем, есть ли аномалии в Z-индексах\n",
|
|||
|
" anomalies = data[column][(zscores[column].abs() > threshold)]\n",
|
|||
|
" if not anomalies.empty:\n",
|
|||
|
" print(f\"В атрибуте '{column}' обнаружены аномалии: {anomalies.tolist()}\")\n",
|
|||
|
"\n",
|
|||
|
"# Находим аномалии\n",
|
|||
|
"try:\n",
|
|||
|
" print(\"Аномалии в наборе данных Diamonds Prices:\")\n",
|
|||
|
" find_anomalies(df_zscores, df)\n",
|
|||
|
"\n",
|
|||
|
" print(\"\\nАномалии в наборе данных Forbes Billionaires:\")\n",
|
|||
|
" find_anomalies(df2_zscores, df2)\n",
|
|||
|
"\n",
|
|||
|
" print(\"\\nАномалии в наборе данных Tesla:\")\n",
|
|||
|
" find_anomalies(df3_zscores, df3)\n",
|
|||
|
"\n",
|
|||
|
"except Exception as e:\n",
|
|||
|
" print(f\"Произошла ошибка: {e}\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Теперь выполним 10 пункт, разобьем данные на выборки"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 33,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"df2['networth_segment'] = pd.cut(df2['Networth'], bins=[0,10,80,250], labels=['Ultra High Networth','High Networth','Medium Networth'], include_lowest=True)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 34,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Набор данных Diamonds Prices:\n",
|
|||
|
"Обучающая выборка:\n",
|
|||
|
"cut\n",
|
|||
|
"Ideal 0.399523\n",
|
|||
|
"Premium 0.255689\n",
|
|||
|
"Very Good 0.223989\n",
|
|||
|
"Good 0.090953\n",
|
|||
|
"Fair 0.029847\n",
|
|||
|
"Name: proportion, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка:\n",
|
|||
|
"cut\n",
|
|||
|
"Ideal 0.399518\n",
|
|||
|
"Premium 0.255654\n",
|
|||
|
"Very Good 0.223953\n",
|
|||
|
"Good 0.091027\n",
|
|||
|
"Fair 0.029848\n",
|
|||
|
"Name: proportion, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка:\n",
|
|||
|
"cut\n",
|
|||
|
"Ideal 0.399444\n",
|
|||
|
"Premium 0.255792\n",
|
|||
|
"Very Good 0.224096\n",
|
|||
|
"Good 0.090825\n",
|
|||
|
"Fair 0.029842\n",
|
|||
|
"Name: proportion, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Набор данных Forbes Billionaires:\n",
|
|||
|
"Обучающая выборка:\n",
|
|||
|
"networth_segment\n",
|
|||
|
"Ultra High Networth 0.924519\n",
|
|||
|
"High Networth 0.070192\n",
|
|||
|
"Medium Networth 0.005288\n",
|
|||
|
"Name: proportion, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка:\n",
|
|||
|
"networth_segment\n",
|
|||
|
"Ultra High Networth 0.926923\n",
|
|||
|
"High Networth 0.069231\n",
|
|||
|
"Medium Networth 0.003846\n",
|
|||
|
"Name: proportion, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка:\n",
|
|||
|
"networth_segment\n",
|
|||
|
"Ultra High Networth 0.923077\n",
|
|||
|
"High Networth 0.073077\n",
|
|||
|
"Medium Networth 0.003846\n",
|
|||
|
"Name: proportion, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Набор данных Tesla:\n",
|
|||
|
"Обучающая выборка:\n",
|
|||
|
"Transaction\n",
|
|||
|
"Sale 0.637097\n",
|
|||
|
"Option Exercise 0.362903\n",
|
|||
|
"Name: proportion, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка:\n",
|
|||
|
"Transaction\n",
|
|||
|
"Sale 0.625\n",
|
|||
|
"Option Exercise 0.375\n",
|
|||
|
"Name: proportion, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка:\n",
|
|||
|
"Transaction\n",
|
|||
|
"Sale 0.625\n",
|
|||
|
"Option Exercise 0.375\n",
|
|||
|
"Name: proportion, dtype: float64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"def split_data(data, target_column, test_size=0.2, random_state=42):\n",
|
|||
|
" # Разделяем данные на обучающую и временную выборки\n",
|
|||
|
" X_train, X_temp, y_train, y_temp = train_test_split(data.drop(columns=[target_column]), \n",
|
|||
|
" data[target_column], \n",
|
|||
|
" test_size=test_size, \n",
|
|||
|
" random_state=random_state, \n",
|
|||
|
" stratify=data[target_column])\n",
|
|||
|
" # Делим временную выборку на контрольную и тестовую\n",
|
|||
|
" X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, \n",
|
|||
|
" test_size=0.5, \n",
|
|||
|
" random_state=random_state, \n",
|
|||
|
" stratify=y_temp)\n",
|
|||
|
" \n",
|
|||
|
" return X_train, X_val, X_test, y_train, y_val, y_test\n",
|
|||
|
"\n",
|
|||
|
"# Для набора данных neo\n",
|
|||
|
"df_train, df_val, df_test, df_train_labels, df_val_labels, df_test_labels = split_data(df, 'cut')\n",
|
|||
|
"\n",
|
|||
|
"# Для набора данных healthcare\n",
|
|||
|
"df2_train, df2_val, df2_test, df2_train_labels, df2_val_labels, df2_test_labels = split_data(df2, 'networth_segment')\n",
|
|||
|
"\n",
|
|||
|
"# Для набора данных diabetes\n",
|
|||
|
"df3_train, df3_val, df3_test, df3_train_labels, df3_val_labels, df3_test_labels = split_data(df3, 'Transaction')\n",
|
|||
|
"def check_balance(y_train, y_val, y_test):\n",
|
|||
|
" print(\"Обучающая выборка:\")\n",
|
|||
|
" print(y_train.value_counts(normalize=True))\n",
|
|||
|
" print(\"\\nКонтрольная выборка:\")\n",
|
|||
|
" print(y_val.value_counts(normalize=True))\n",
|
|||
|
" print(\"\\nТестовая выборка:\")\n",
|
|||
|
" print(y_test.value_counts(normalize=True))\n",
|
|||
|
"\n",
|
|||
|
"print(\"Набор данных Diamonds Prices:\")\n",
|
|||
|
"check_balance(df_train_labels, df_val_labels, df_test_labels)\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nНабор данных Forbes Billionaires:\")\n",
|
|||
|
"check_balance(df2_train_labels, df2_val_labels, df2_test_labels)\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nНабор данных Tesla:\")\n",
|
|||
|
"check_balance(df3_train_labels, df3_val_labels, df3_test_labels)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Приращение методами выборки с избытком (oversampling) и выборки с недостатком (undersampling)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 43,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Oversampling для Diamonds Prices:\n",
|
|||
|
"cut\n",
|
|||
|
"Ideal 21551\n",
|
|||
|
"Premium 21551\n",
|
|||
|
"Good 21551\n",
|
|||
|
"Very Good 21551\n",
|
|||
|
"Fair 21551\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Undersampling для Forbes Billionaires:\n",
|
|||
|
"networth_segment\n",
|
|||
|
"Ultra High Networth 13\n",
|
|||
|
"High Networth 13\n",
|
|||
|
"Medium Networth 13\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Oversampling для Tesla:\n",
|
|||
|
"Transaction\n",
|
|||
|
"Sale 99\n",
|
|||
|
"Option Exercise 99\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"# Пример Oversampling для Diamonds Prices\n",
|
|||
|
"X_df = df.drop('cut', axis=1) \n",
|
|||
|
"y_df = df['cut'] \n",
|
|||
|
"\n",
|
|||
|
"# Oversampling\n",
|
|||
|
"ros_df = RandomOverSampler(random_state=42)\n",
|
|||
|
"X_df_resampled, y_df_resampled = ros_df.fit_resample(X_df, y_df)\n",
|
|||
|
"df_resampled = pd.DataFrame(X_df_resampled, columns=X_df.columns)\n",
|
|||
|
"df_resampled['cut'] = y_df_resampled\n",
|
|||
|
"\n",
|
|||
|
"print(\"Oversampling для Diamonds Prices:\")\n",
|
|||
|
"print(df_resampled['cut'].value_counts())\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"X_df2 = df2.drop('networth_segment', axis=1)\n",
|
|||
|
"y_df2 = df2['networth_segment']\n",
|
|||
|
"\n",
|
|||
|
"# Пример Undersampling для Forbes Billionaires\n",
|
|||
|
"rus_df2 = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_df2_resampled_under, y_df2_resampled_under = rus_df2.fit_resample(X_df2, y_df2)\n",
|
|||
|
"df2_resampled_under = pd.DataFrame(X_df2_resampled_under, columns=X_df2.columns)\n",
|
|||
|
"df2_resampled_under['networth_segment'] = y_df2_resampled_under\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nUndersampling для Forbes Billionaires:\")\n",
|
|||
|
"print(df2_resampled_under['networth_segment'].value_counts())\n",
|
|||
|
"\n",
|
|||
|
"# Пример Oversampling для Tesla\n",
|
|||
|
"X_df3 = df3.drop('Transaction', axis=1)\n",
|
|||
|
"y_df3 = df3['Transaction']\n",
|
|||
|
"\n",
|
|||
|
"# Oversampling\n",
|
|||
|
"ros_df3 = RandomOverSampler(random_state=42)\n",
|
|||
|
"X_df3_resampled, y_df3_resampled = ros_df3.fit_resample(X_df3, y_df3)\n",
|
|||
|
"df3_resampled = pd.DataFrame(X_df3_resampled, columns=X_df3.columns)\n",
|
|||
|
"df3_resampled['Transaction'] = y_df3_resampled\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nOversampling для Tesla:\")\n",
|
|||
|
"print(df3_resampled['Transaction'].value_counts())"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.13.0"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|