AIM-PIbd-32-Shabunov-O-A/lab_2/lab2.ipynb

1950 lines
1.1 MiB
Plaintext
Raw Permalink Normal View History

Один мужик сделал операцию по омоложению лица. "Ну, надо бы походить, поспрашивать народ... Один мужик сделал операцию по омоложению лица. "Ну, надо бы походить, поспрашивать народ - помолодел ли я?" Вышел, идёт, довольный идёт. Подошёл к остановке. — Девушка, как вы думаете - сколько мне лет? — Ну, думаю 27 — Нет, мне 42! Пришёл в бар мужик. — Сколько мне лет? - спрашивает он у людей около данной стойки — 32? 30? 29? - начали гадать посетители бара — Нет-нет-нет! Мне 42. Вернулся мужик к остановке, довольный собой. Видит, бабушка стоит. — Здравствуйте! Как думаете, сколько мне лет? — У меня, сынок, зрение - ни к черту, давай я тебя "там" потрогаю - сразу определю. Ну, мужику интересно стало - как это так, по яйцам, она определит - и согласился. Мнёт старуха минуту, мнёт две. Наконец, говорит. — Тебе 42 года. — Как вы узнали!? - удивился мужик — В баре услышала.
2024-10-25 22:30:53 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Сбор и подготовка данных\n",
"\n",
"## Датасет №1. Продажи домов\n",
"\n",
"[**Ссылка**](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction)\n",
"\n",
"**Проблемная область**: Данный набор данных о продажах домов касается недвижимости.\n",
"\n",
"**Объекты наблюдения**: Объектом наблюдения в данном случае является каждый отдельный дом (единица недвижимости), который был продан. Каждая строка в таблице представляет собой один дом.\n",
"\n",
"**Атрибуты объектов:**\n",
"- `id` — уникальный идентификатор дома.\n",
"- `date` — дата продажи.\n",
"- `price` — цена продажи дома (целевая переменная).\n",
"- `bedrooms` — количество спален.\n",
"- `bathrooms` — количество ванных комнат.\n",
"- `sqft_living` — общая жилая площадь (в квадратных футах).\n",
"- `sqft_lot` — площадь участка.\n",
"- `floors` — количество этажей.\n",
"- `waterfront` — есть ли выход на воду.\n",
"- `view` — наличие вида из окон.\n",
"- `condition` — состояние дома (оценка).\n",
"- `grade` — оценка качества дома.\n",
"- `sqft_above` — площадь дома над землей.\n",
"- `sqft_basement` — площадь подвала.\n",
"- `yr_built` — год постройки.\n",
"- `yr_renovated` — год последнего ремонта.\n",
"- `zipcode` — почтовый индекс.\n",
"- `lat` и `long` — географические координаты дома.\n",
"- `sqft_living15` и `sqft_lot15` — средняя площадь жилой площади и участка для 15 ближайших домов.\n",
"\n",
"**Бизнес-цель**: Улучшенное прогнозирование цен поможет продавцам устанавливать конкурентные цены, а покупателям — принимать более взвешенные решения о покупке. Это также даст риелторам возможность лучше ориентироваться на рынке и оптимизировать стратегию продажи.\n",
"\n",
"**Техническая цель**: Прогнозирование цен на жилье\n",
"\n",
"**Входные данные**: Исторические данные о продажах домов, включая все признаки (количество комнат, площадь, состояние, местоположение и др.).\n",
"\n",
"**Целевая переменная**: Цена (`price`)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>date</th>\n",
" <th>price</th>\n",
" <th>bedrooms</th>\n",
" <th>bathrooms</th>\n",
" <th>sqft_living</th>\n",
" <th>sqft_lot</th>\n",
" <th>floors</th>\n",
" <th>waterfront</th>\n",
" <th>view</th>\n",
" <th>...</th>\n",
" <th>grade</th>\n",
" <th>sqft_above</th>\n",
" <th>sqft_basement</th>\n",
" <th>yr_built</th>\n",
" <th>yr_renovated</th>\n",
" <th>zipcode</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>sqft_living15</th>\n",
" <th>sqft_lot15</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7129300520</td>\n",
" <td>20141013T000000</td>\n",
" <td>221900.0</td>\n",
" <td>3</td>\n",
" <td>1.00</td>\n",
" <td>1180</td>\n",
" <td>5650</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>1180</td>\n",
" <td>0</td>\n",
" <td>1955</td>\n",
" <td>0</td>\n",
" <td>98178</td>\n",
" <td>47.5112</td>\n",
" <td>-122.257</td>\n",
" <td>1340</td>\n",
" <td>5650</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>6414100192</td>\n",
" <td>20141209T000000</td>\n",
" <td>538000.0</td>\n",
" <td>3</td>\n",
" <td>2.25</td>\n",
" <td>2570</td>\n",
" <td>7242</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>2170</td>\n",
" <td>400</td>\n",
" <td>1951</td>\n",
" <td>1991</td>\n",
" <td>98125</td>\n",
" <td>47.7210</td>\n",
" <td>-122.319</td>\n",
" <td>1690</td>\n",
" <td>7639</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5631500400</td>\n",
" <td>20150225T000000</td>\n",
" <td>180000.0</td>\n",
" <td>2</td>\n",
" <td>1.00</td>\n",
" <td>770</td>\n",
" <td>10000</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>6</td>\n",
" <td>770</td>\n",
" <td>0</td>\n",
" <td>1933</td>\n",
" <td>0</td>\n",
" <td>98028</td>\n",
" <td>47.7379</td>\n",
" <td>-122.233</td>\n",
" <td>2720</td>\n",
" <td>8062</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2487200875</td>\n",
" <td>20141209T000000</td>\n",
" <td>604000.0</td>\n",
" <td>4</td>\n",
" <td>3.00</td>\n",
" <td>1960</td>\n",
" <td>5000</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>1050</td>\n",
" <td>910</td>\n",
" <td>1965</td>\n",
" <td>0</td>\n",
" <td>98136</td>\n",
" <td>47.5208</td>\n",
" <td>-122.393</td>\n",
" <td>1360</td>\n",
" <td>5000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1954400510</td>\n",
" <td>20150218T000000</td>\n",
" <td>510000.0</td>\n",
" <td>3</td>\n",
" <td>2.00</td>\n",
" <td>1680</td>\n",
" <td>8080</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>8</td>\n",
" <td>1680</td>\n",
" <td>0</td>\n",
" <td>1987</td>\n",
" <td>0</td>\n",
" <td>98074</td>\n",
" <td>47.6168</td>\n",
" <td>-122.045</td>\n",
" <td>1800</td>\n",
" <td>7503</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" id date price bedrooms bathrooms sqft_living \\\n",
"0 7129300520 20141013T000000 221900.0 3 1.00 1180 \n",
"1 6414100192 20141209T000000 538000.0 3 2.25 2570 \n",
"2 5631500400 20150225T000000 180000.0 2 1.00 770 \n",
"3 2487200875 20141209T000000 604000.0 4 3.00 1960 \n",
"4 1954400510 20150218T000000 510000.0 3 2.00 1680 \n",
"\n",
" sqft_lot floors waterfront view ... grade sqft_above sqft_basement \\\n",
"0 5650 1.0 0 0 ... 7 1180 0 \n",
"1 7242 2.0 0 0 ... 7 2170 400 \n",
"2 10000 1.0 0 0 ... 6 770 0 \n",
"3 5000 1.0 0 0 ... 7 1050 910 \n",
"4 8080 1.0 0 0 ... 8 1680 0 \n",
"\n",
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
"0 1955 0 98178 47.5112 -122.257 1340 \n",
"1 1951 1991 98125 47.7210 -122.319 1690 \n",
"2 1933 0 98028 47.7379 -122.233 2720 \n",
"3 1965 0 98136 47.5208 -122.393 1360 \n",
"4 1987 0 98074 47.6168 -122.045 1800 \n",
"\n",
" sqft_lot15 \n",
"0 5650 \n",
"1 7639 \n",
"2 8062 \n",
"3 5000 \n",
"4 7503 \n",
"\n",
"[5 rows x 21 columns]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\".//static//csv//kc_house_data.csv\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Отбросим лишний признак `id`, который не будет принимать участие в анализе данного набора данных"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df.drop(columns=['id'], inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Получение сведений о пропущенных данных"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"date 0\n",
"price 0\n",
"bedrooms 0\n",
"bathrooms 0\n",
"sqft_living 0\n",
"sqft_lot 0\n",
"floors 0\n",
"waterfront 0\n",
"view 0\n",
"condition 0\n",
"grade 0\n",
"sqft_above 0\n",
"sqft_basement 0\n",
"yr_built 0\n",
"yr_renovated 0\n",
"zipcode 0\n",
"lat 0\n",
"long 0\n",
"sqft_living15 0\n",
"sqft_lot15 0\n",
"dtype: int64\n",
"\n",
"date False\n",
"price False\n",
"bedrooms False\n",
"bathrooms False\n",
"sqft_living False\n",
"sqft_lot False\n",
"floors False\n",
"waterfront False\n",
"view False\n",
"condition False\n",
"grade False\n",
"sqft_above False\n",
"sqft_basement False\n",
"yr_built False\n",
"yr_renovated False\n",
"zipcode False\n",
"lat False\n",
"long False\n",
"sqft_living15 False\n",
"sqft_lot15 False\n",
"dtype: bool\n",
"\n"
]
}
],
"source": [
"# Количество пустых значений признаков\n",
"print(df.isnull().sum())\n",
"\n",
"print()\n",
"\n",
"# Есть ли пустые значения признаков\n",
"print(df.isnull().any())\n",
"\n",
"print()\n",
"\n",
"# Процент пустых значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пропущенных данных в датасете **не обнаружено**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Проверка набора данных на выбросы"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов в столбце 'price': 1146\n",
"Количество выбросов в столбце 'sqft_living': 572\n",
"Количество выбросов в столбце 'bathrooms': 571\n",
"Количество выбросов в столбце 'yr_built': 0\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdIAAAPdCAYAAACOcJpIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXhU5fn/8c/MJDMJWQmShChg3EGxKFqJKyolRfQnlWq1VFFxLahgqy3fuiBaUSpKRRS3ArbytdqvtRUVQVCsZVFRXEBxQ9myCFmGbLNkzu+PyRwYCSHLzJyZzPt1XXOFnPPknPskts+ce+5zPzbDMAwBAAAAAAAAAIBW2a0OAAAAAAAAAACAeEYiHQAAAAAAAACANpBIBwAAAAAAAACgDSTSAQAAAAAAAABoA4l0AAAAAAAAAADaQCIdAAAAAAAAAIA2kEgHAAAAAAAAAKANJNIBAAAAAAAAAGgDiXQAAAAAAAAAANpAIh2IgoMPPliXX3651WF0e3/60590yCGHyOFwaPDgwVE911tvvSWbzaa33norqucBAHQOc29sxHLu3dN7772nk08+WRkZGbLZbFq3bl2XjvfD/166Ms8PGzZMw4YN61I8AJDsmMdjo73z+LBhw3TMMcfEJKbLL79cmZmZMTkX0FUk0oH9mD9/vmw2m95///1W90dqgnn11Vc1derULh8nWSxZskS33nqrTjnlFM2bN0/33nuv1SEBACKEuTc+WTX3+nw+XXjhhaqqqtJDDz2kv/71r+rfv78effRRzZ8/PyYxAADaj3k8Pll5D93Q0KCpU6dSmIaEl2J1AEB3tHHjRtntHfuc6tVXX9WcOXN4I9BOy5cvl91u19NPPy2n0xn1851++ulqbGyMybkAAB3H3Bt9sZ57Q77++mt99913evLJJ3XVVVeZ2x999FEdcMABEalg7Mo8v2TJki6fHwCSHfN49Fk1j0vBRPpdd90lSTzFhYRGRToQBS6XS6mpqVaH0SH19fVWh9AhlZWVSk9Pj/obgKamJgUCAdntdqWlpXX4zR0AIDaYe6MvVnNva+eVpNzc3KidoyvzvNPp5IN2AOgi5vHos2oejya/3y+v12t1GEgiZISAKPhhfzefz6e77rpLhx9+uNLS0tSrVy+deuqpWrp0qaRgT7A5c+ZIkmw2m/kKqa+v129+8xv17dtXLpdLRx55pB544AEZhhF23sbGRt1444064IADlJWVpf/3//6ftm3bJpvNFvYp/dSpU2Wz2bRhwwb98pe/VM+ePXXqqadKkj7++GNdfvnlOuSQQ5SWlqbCwkJdeeWV2rlzZ9i5Qsf44osv9Ktf/Uo5OTnq3bu3br/9dhmGoS1btuj8889Xdna2CgsLNXPmzHb97vx+v+6++24deuihcrlcOvjgg/U///M/8ng85hibzaZ58+apvr7e/F219Wh36NHBtWvX6uSTT1Z6erqKi4s1d+7csHGh/qjPPfecbrvtNh144IHq0aOH3G73PnunrlmzRuecc4569uypjIwMHXvssfrzn/8cNubzzz/Xz3/+c+Xl5SktLU0nnHCC/v3vf7fr9wEAaB/m3viae7/88kuNGTNGhYWFSktL00EHHaSLL75YtbW15hiPx6PJkyerd+/e5u9u69atYb+7yy+/XGeccYYk6cILL5TNZtOwYcN08MEHa/369VqxYoUZT1cq3H44z0+cOFGZmZlqaGjYa+wll1yiwsJCNTc3S9q7R3roWM8//7z++Mc/6qCDDlJaWprOPvtsffXVV3sdb86cOTrkkEOUnp6uH//4x/rPf/5D33UASYd5PL7m8ZD93UN7vV7dcccdGjJkiHJycpSRkaHTTjtNb775pjnm22+/Ve/evSVJd911l3n+Hz5JsG3bNo0ePVqZmZnq3bu3fvvb35pzbeg4NptNDzzwgGbNmmVe74YNGyQFK+5PO+00ZWRkKDc3V+eff74+++yzva7pww8/1MiRI5Wdna3MzEydffbZWr16ddiYUHuid955RzfeeKN69+6t3NxcXXvttfJ6vaqpqdFll12mnj17qmfPnrr11lv3+m/rueee05AhQ5SVlaXs7GwNGjRor1wBEg+tXYB2qq2t1Y4dO/ba7vP59vuzU6dO1fTp03XVVVfpxz/+sdxut95//3198MEH+slPfqJrr71W27dv19KlS/XXv/417GcNw9D/+3//T2+++abGjx+vwYMH6/XXX9ctt9yibdu26aGHHjLHXn755Xr++ed16aWXaujQoVqxYoVGjRq1z7guvPBCHX744br33nvN/9NfunSpvvnmG11xxRUqLCzU+vXr9cQTT2j9+vVavXp12JsTSfrFL36hAQMG6L777tMrr7yie+65R3l5eXr88cd11lln6f7779ezzz6r3/72tzrxxBN1+umnt/m7uuqqq7RgwQL9/Oc/129+8xutWbNG06dP12effaZ//vOfkqS//vWveuKJJ/Tuu+/qqaeekiSdfPLJbR63urpa55xzji666CJdcsklev7553X99dfL6XTqyiuvDBt79913y+l06re//a08Hs8+P7FfunSpzj33XPXp00c33XSTCgsL9dlnn2nRokW66aabJEnr16/XKaecogMPPFC///3vlZGRoeeff16jR4/W//3f/+lnP/tZm3EDQDJj7k3Mudfr9aq0tFQej0c33HCDCgsLtW3bNi1atEg1NTXKyckxz/u3v/1Nv/zlL3XyySdr+fLle/3urr32Wh144IG69957deONN+rEE09UQUGB6uvrdcMNNygzM1N/+MMfJEkFBQVtXmdH/OIXv9CcOXP0yiuv6MILLzS3NzQ06OWXX9bll18uh8PR5jHuu+8+2e12/fa3v1Vtba1mzJihsWPHas2aNeaYxx57TBMnTtRpp52myZMn69tvv9Xo0aPVs2dPHXTQQRG7HgCwAvN4Ys7jIe25h3a73Xrqqad0ySWX6Oqrr9auXbv09NNPq7S0VO+++64GDx6s3r1767HHHtP111+vn/3sZ7rgggskSccee6x5rubmZpWWluqkk07SAw88oDfeeEMzZ87UoYcequuvvz4srnnz5qmpqUnXXHONXC6X8vLy9MYbb2jkyJE65JBDNHXqVDU2Nmr27Nk65ZRT9MEHH+jggw+WFLw/P+2005Sdna1bb71VqampevzxxzVs2DCtWLFCJ510Uti5Qu9j7rrrLq1evVpPPPGEcnNztXLlSvXr10/33nuvXn31Vf3pT3/SMccco8suu8z8b+KSSy7R2Wefrfvvv1+S9Nlnn+m///2vmStAgjIAtGnevHmGpDZfRx99dNjP9O/f3xg3bpz5/Y9+9CNj1KhRbZ5nwoQJRmv/k3zppZcMScY999wTtv3nP/+5YbPZjK+++sowDMNYu3atIcmYNGlS2LjLL7/ckGTceeed5rY777zTkGRccskle52voaFhr23/+7//a0gy3n777b2Occ0115jb/H6/cdBBBxk2m8247777zO3V1dVGenp62O+kNevWrTMkGVdddVXY9t/+9reGJGP58uXmtnHjxhkZGRltHi/kjDPOMCQZM2fONLd5PB5j8ODBRn5+vuH1eg3DMIw333zTkGQccsghe/0eQvvefPNN81qLi4uN/v37G9XV1WFjA4GA+e+zzz7bGDRokNHU1BS2/+STTzYOP/zwdsUPAMmGuTex594PP/zQkGS88MIL+z3vr3/967Dtv/zlL/f63YXm4B8e7+ijjzbOOOOM/cbTmh/+9/LDeT4QCBgHHnigMWbMmLCfe/755/f6u5xxxhlhcYSONWDAAMPj8Zjb//znPxuSjE8++cQwjOB7kV69ehknnnii4fP5zHHz5883JHX62gDAaszjiT2PG0b776H9fn/YXBeKvaCgwLjyyivNbd9///1ev9M945JkTJs2LWz7cccdZwwZMsT8ftOmTYYkIzs726isrAwbG4pr586d5raPPvrIsNvtxmWXXWZuGz16tOF0Oo2vv/7a3LZ9+3YjKyvLOP30081tof+GS0tLw+7vS0pKDJvNZlx33XXmttDfcM95+6abbjKys7MNv9+
"text/plain": [
"<Figure size 1500x1000 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['price', 'sqft_living', 'bathrooms', 'yr_built']\n",
"\n",
"# Функция для подсчета выбросов\n",
"def count_outliers(df, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(columns_to_check, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Признак `yr_built` не имеет выбросов, а для признаков `price`, `sqft_living` и `bathrooms` необходимо использовать метод решения проблемы выбросов. Так как в рассматриваемом наборе данных количество наблюдений достаточно велико (более 21 тыс.), то для решения проблемы выбросов данных воспользуемся методом удаления наблюдений с такими выбросами:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество удаленных строк: 1520\n"
]
}
],
"source": [
"# Выбираем столбцы для очистки\n",
"columns_to_clean = ['price', 'sqft_living', 'bathrooms']\n",
"\n",
"# Функция для удаления выбросов\n",
"def remove_outliers(df, columns):\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Удаляем строки, содержащие выбросы\n",
" df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]\n",
" \n",
" return df\n",
"\n",
"# Удаляем выбросы\n",
"df_cleaned = remove_outliers(df, columns_to_clean)\n",
"\n",
"# Выводим количество удаленных строк\n",
"print(f\"Количество удаленных строк: {len(df) - len(df_cleaned)}\")\n",
"\n",
"df = df_cleaned"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Оценим выбросы в выборке после удаления некоторых наблюдений:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов в столбце 'price': 211\n",
"Количество выбросов в столбце 'sqft_living': 34\n",
"Количество выбросов в столбце 'bathrooms': 0\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAISCAYAAAAjjoaeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hUZfrG8XsmvU4KqRAg9C4ICEFUSiQqFpS1LSpYKAoouOrqbxUFu64rFhTbiu6Kru6urorSUVQ6SOi9E9LLpE7KnN8fIaORBCEkOSnfz3Xl2s05Z865z5Ddd+aZd57XYhiGIQAAAAAAAAAAcAqr2QEAAAAAAAAAAGioKKIDAAAAAAAAAFANiugAAAAAAAAAAFSDIjoAAAAAAAAAANWgiA4AAAAAAAAAQDUoogMAAAAAAAAAUA2K6AAAAAAAAAAAVIMiOgAAAAAAAAAA1aCIDgAAAAAAAABANSiiA/Wgbdu2GjdunNkxmrwXX3xR7dq1k5ubm3r37l2n1/ruu+9ksVj03Xff1el1AABnhrG2ftTnWPtr69ev16BBg+Tn5yeLxaLNmzef0/l++/dyLuP6kCFDNGTIkHPKAwDNDeN2/TjTcXvIkCHq0aNHvWQaN26c/P396+VaQG2iiA6cpXnz5slisWjDhg1V7q+tweebb77RE088cc7naS4WL16shx56SBdeeKHef/99PfPMM2ZHAgDUEGNtw2TWWFtSUqLrr79emZmZevnll/WPf/xDbdq00RtvvKF58+bVSwYAQPUYtxsmM98jFxQU6IknnmDSGZoUd7MDAM3B7t27ZbWe3WdW33zzjebMmcOLhDO0fPlyWa1Wvffee/L09Kzz61188cUqLCysl2sBAH4fY23dq++xtsL+/ft1+PBhvfPOO7rrrrtc29944w21aNGiVmYynsu4vnjx4nO+PgA0N4zbdc+scVsqL6LPnDlTkvi2FpoMZqID9cDLy0seHh5mxzgr+fn5Zkc4K6mpqfLx8anzFwdFRUVyOp2yWq3y9vY+6xd+AIC6wVhb9+prrK3qupIUFBRUZ9c4l3Hd09OTD9UB4Cwxbtc9s8btulRaWqri4mKzY6CZovoD1IPf9nsrKSnRzJkz1bFjR3l7eys0NFSDBw/WkiVLJJX3CJszZ44kyWKxuH4q5Ofn609/+pNiYmLk5eWlzp07669//asMw6h03cLCQt17771q0aKFAgICdPXVV+v48eOyWCyVPr1/4oknZLFYtGPHDv3xj39UcHCwBg8eLEnasmWLxo0bp3bt2snb21uRkZG64447lJGRUelaFefYs2ePbrnlFtlsNoWFhemxxx6TYRg6evSorrnmGgUGBioyMlIvvfTSGT13paWlevLJJ9W+fXt5eXmpbdu2+r//+z85HA7XMRaLRe+//77y8/Ndz9Xpvt5d8XXCjRs3atCgQfLx8VFsbKzmzp1b6biK/qiffPKJHn30UbVs2VK+vr6y2+3V9k5du3atrrjiCgUHB8vPz0+9evXSK6+8UumYXbt26Q9/+INCQkLk7e2tfv366csvvzyj5wMAUDXG2oY11u7du1ejR49WZGSkvL291apVK910003KyclxHeNwODR9+nSFhYW5nrtjx45Veu7GjRunSy65RJJ0/fXXy2KxaMiQIWrbtq22b9+u77//3pXnXGa6/XZcnzJlivz9/VVQUHDKsTfffLMiIyNVVlYm6dSe6BXn+vTTT/X000+rVatW8vb21vDhw7Vv375Tzjdnzhy1a9dOPj4+uuCCC/TDDz/QZx1Ak8e43bDG7Qq/9x65uLhYM2bMUN++fWWz2eTn56eLLrpIK1ascB1z6NAhhYWFSZJmzpzpuv5vv0Fw/PhxjRo1Sv7+/goLC9MDDzzgGlsrzmOxWPTXv/5Vs2fPdt3vjh07JJXPtL/ooovk5+enoKAgXXPNNdq5c+cp9/Tzzz/r8ssvV2BgoPz9/TV8+HCtWbOm0jEVLYl+/PFH3XvvvQoLC1NQUJAmTpyo4uJiZWdn67bbblNwcLCCg4P10EMPnfK39cknn6hv374KCAhQYGCgevbseUotAI0b7VyAGsrJyVF6evop20tKSn73sU888YSeffZZ3XXXXbrgggtkt9u1YcMGbdq0SZdeeqkmTpyopKQkLVmyRP/4xz8qPdYwDF199dVasWKF7rzzTvXu3VuLFi3Sgw8+qOPHj+vll192HTtu3Dh9+umnuvXWWzVw4EB9//33GjlyZLW5rr/+enXs2FHPPPOMa0BYsmSJDhw4oNtvv12RkZHavn273n77bW3fvl1r1qyp9MJFkm688UZ17dpVzz33nBYsWKCnnnpKISEheuuttzRs2DA9//zz+uijj/TAAw+of//+uvjii0/7XN1111364IMP9Ic//EF/+tOftHbtWj377LPauXOnPv/8c0nSP/7xD7399ttat26d3n33XUnSoEGDTnverKwsXXHFFbrhhht0880369NPP9Xdd98tT09P3XHHHZWOffLJJ+Xp6akHHnhADoej2k/ylyxZoiuvvFJRUVG67777FBkZqZ07d+rrr7/WfffdJ0navn27LrzwQrVs2VIPP/yw/Pz89Omnn2rUqFH6z3/+o2uvvfa0uQGgOWGsbZxjbXFxsRISEuRwODR16lRFRkbq+PHj+vrrr5WdnS2bzea67j//+U/98Y9/1KBBg7R8+fJTnruJEyeqZcuWeuaZZ3Tvvfeqf//+ioiIUH5+vqZOnSp/f3/95S9/kSRFRESc9j7Pxo033qg5c+ZowYIFuv76613bCwoK9NVXX2ncuHFyc3M77Tmee+45Wa1WPfDAA8rJydELL7ygMWPGaO3ata5j3nzzTU2ZMkUXXXSRpk+frkOHDmnUqFEKDg5Wq1atau1+AKA+MG43znG7wpm8R7bb7Xr33Xd18803a/z48crNzdV7772nhIQErVu3Tr1791ZYWJjefPNN3X333br22mt13XXXSZJ69erlulZZWZkSEhI0YMAA/fWvf9XSpUv10ksvqX379rr77rsr5Xr//fdVVFSkCRMmyMvLSyEhIVq6dKkuv/xytWvXTk888YQKCwv12muv6cILL9SmTZvUtm1bSeXvvy+66CIFBgbqoYcekoeHh9566y0NGTJE33//vQYMGFDpWhWvW2bOnKk1a9bo7bffVlBQkFatWqXWrVvrmWee0TfffKMXX3xRPXr00G233eb6m7j55ps1fPhwPf/885KknTt36qeffnLVAtAEGADOyvvvv29IOu1P9+7dKz2mTZs2xtixY12/n3feecbIkSNPe53JkycbVf1P9IsvvjAkGU899VSl7X/4wx8Mi8Vi7Nu3zzAMw9i4caMhyZg2bVql48aNG2dIMh5//HHXtscff9yQZNx8882nXK+goOCUbR9//LEhyVi5cuUp55gwYYJrW2lpqdGqVSvDYrEYzz33nGt7VlaW4ePjU+k5qcrmzZsNScZdd91VafsDDzxgSDKWL1/u2jZ27FjDz8/vtOercMkllxiSjJdeesm1zeFwGL179zbCw8ON4uJiwzAMY8WKFYYko127dqc8DxX7VqxY4brX2NhYo02bNkZWVlalY51Op+u/Dx8+3OjZs6dRVFRUaf+gQYOMjh07nlF+AGjqGGsb91j7888/G5KMzz777Heve88991Ta/sc//vGU565izP3t+bp3725ccsklv5unKr/9e/ntuO50Oo2WLVsao0ePrvS4Tz/99JR/l0suuaRSjopzde3a1XA4HK7tr7zyiiHJ2Lp1q2EY5a89QkNDjf79+xslJSWu4+bNm2dIqvG9AUB9Y9xu3OO2YZz5e+TS0tJKY1tF9oiICOOOO+5wbUtLSzvlOf11LknGrFmzKm3v06eP0bdvX9fvBw8eNCQZgYGBRmpqaqVjK3JlZGS4tiUmJhpWq9W47bbbXNtGjRpleHp6Gvv373dtS0pKMgICAoyLL77Yta3ibzghIaHS+/e4uDjDYrEYkyZNcm2r+Df89Th93333GYGBgUZpaekp94umg3YuQA3NmTNHS5YsOeXn15+uVicoKEjbt2/X3r17z/q
"text/plain": [
"<Figure size 1500x1000 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"columns_to_check = ['price', 'sqft_living', 'bathrooms']\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(columns_to_check, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Судя по данным на диаграмме выше, количество выбросов значительно сократилось и не превышает допустимые диапозоны."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Разбиение датасета на выборки"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Средняя цена в обучающей выборке: 470646.91546391754\n",
"Средняя цена в контрольной выборке: 468579.52654280025\n",
"Средняя цена в тестовой выборке: 471598.4402786994\n",
"\n",
"Стандартное отклонение цены в обучающей выборке: 202445.70321089853\n",
"Стандартное отклонение цены в контрольной выборке: 202183.1175619316\n",
"Стандартное отклонение цены в тестовой выборке: 206393.61053704965\n",
"\n",
"Распределение по квартилам (обучающая):\n",
"0.25 313500.0\n",
"0.50 435000.0\n",
"0.75 594000.0\n",
"Name: price, dtype: float64\n",
"\n",
"Распределение по квартилам (контрольная):\n",
"0.25 313612.5\n",
"0.50 429925.0\n",
"0.75 595000.0\n",
"Name: price, dtype: float64\n",
"\n",
"Распределение по квартилам (тестовая):\n",
"0.25 312500.0\n",
"0.50 428500.0\n",
"0.75 595375.0\n",
"Name: price, dtype: float64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+0AAAIjCAYAAAB20vpjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3iN9//H8efJXhIjW0mUELNGtUIVrT2LDi016lutokOntnRXpy6qGy2qVdWpFNVWNWbNilFFjCDHishO7t8f9y+njgRJJO6M1+O67ivn3ONzv+8z8z6fZTMMw0BERERERERESh0XqwMQERERERERkfwpaRcREREREREppZS0i4iIiIiIiJRSStpFRERERERESikl7SIiIiIiIiKllJJ2ERERERERkVJKSbuIiIiIiIhIKaWkXURERERERKSUUtIuIiLl3okTJ/jnn3/IysqyOhQRkXLDMAyOHTvGzp07rQ5FpFxT0i4iIuVOZmYmr7zyCldccQWenp5UqVKFqKgoli5danVoZcKWLVv45ptvHPc3bNjAjz/+aF1AUqL279/P9OnTHff37NnDrFmzrAtISvV78NSpUzz55JPUq1cPDw8PqlWrRt26ddm+fbvVoYmUW25WByBS0UyfPp1hw4Y57nt6elKzZk06d+7M+PHjCQkJsTA6kbIvPT2dzp07s3LlSu6++26ee+45fHx8cHV1pUWLFlaHVyacOnWKu+66i9DQUKpVq8Z9991Ht27d6NGjh9WhSQmw2WyMGjWKsLAw6tWrxyOPPELVqlUZOHCg1aFVWKX1PXj06FHatWtHfHw8Y8aMoU2bNnh4eODu7k5kZKSlsYmUZ0raRSzy7LPPUqtWLdLS0vjjjz+YOnUqCxYsYMuWLfj4+FgdnkiZ9fLLL7Nq1SoWLVpE+/btrQ6nTIqJiXEsAHXr1uXOO++0OCopKdWrV+fOO++ka9euAISFhfHrr79aG1QFV1rfgw8//DAJCQnExsbSsGFDq8MRqTBshmEYVgchUpHk1rSvWbOGK6+80rH+wQcfZNKkScyePZtbb73VwghFyq6srCyCg4MZOXIkL7zwgtXhlHlbt24lNTWVxo0b4+HhYXU4UsJ27dqF3W6nUaNG+Pr6Wh2OULreg0eOHCEsLIz33nuvVPyAIFKRqE+7SClx3XXXAbB7924Ajh07xkMPPUTjxo3x8/PD39+fbt26sXHjxjzHpqWl8fTTT1O3bl28vLwICwujX79+7Nq1CzD7J9pstnMuZ9ZG/vrrr9hsNr744gsef/xxQkND8fX1pXfv3uzbty/PuVetWkXXrl0JCAjAx8eHdu3asWLFinyvsX379vme/+mnn86z78yZM2nRogXe3t5UrVqVAQMG5Hv+813bmXJycnjzzTdp2LAhXl5ehISEcNddd3H8+HGn/SIjI+nZs2ee84wePTpPmfnF/uqrr+Z5TMFssv3UU09Rp04dPD09qVGjBo888gjp6en5PlZnat++PY0aNcqz/rXXXsNms7Fnzx6n9SdOnOD++++nRo0aeHp6UqdOHV5++WVycnIc++Q+bq+99lqechs1apTva+Krr746Z4xDhw4tUNPIyMhIx/Pj4uJCaGgot9xyC/Hx8Rc8FuDdd9+lYcOGeHp6Eh4ezqhRozhx4oRj+/bt2zl+/DiVKlWiXbt2+Pj4EBAQQM+ePdmyZYtjv2XLlmGz2Zg/f36ec8yePRubzUZsbKwj5qFDhzrtk/uYnFkbuXz5cm666SZq1qzpeI4feOABUlNTnY59+umn87yWZs2aRdOmTfHy8qJatWrceuuteR6ToUOH4ufn57Tuq6++yhMHgJ+fX56YoWDvq/bt2zue/wYNGtCiRQs2btyY7/uqoKZPn57ntfr3339TpUoVevbs6TRA4L///stNN91E1apV8fHxoVWrVnn68p7vNXnmteee93xLbl/u3Mf333//pUuXLvj6+hIeHs6zzz7L2fUbp0+f5sEHH3S8x+rVq8drr72WZ7/zxXDmeyx3n7Vr1573cczvNQDnfh3MnTvX8XwHBgYyaNAgDhw4kKfM3Pdu7dq1ufrqqzl27Bje3t75fr7kF9PZ7/19+/YV6PihQ4de8Pk58/iffvqJtm3b4uvrS6VKlejRowd///13nnK3bdvGzTffTFBQEN7e3tSrV48nnngC+O/9d77lzMexoI/hmcdXqVKF9u3bs3z58jyxXegzDC7+PXj2d21gYCA9evRw+gwE8zts9OjR5yzn7PftmjVryMnJISMjgyuvvPK8n1cAv/zyi+P5qly5Mn369CEuLs5pn9znI/c58/f3d3QHSEtLyxPvmd+5WVlZdO/enapVq7J161bH+mnTpnHdddcRHByMp6cnDRo0YOrUqXlic3FxYcKECU7rcz//z95fxGpqHi9SSuQm2NWqVQPMf1y/+eYbbrrpJmrVqsXhw4d5//33adeuHVu3biU8PByA7OxsevbsydKlSxkwYAD33Xcfp06dYvHixWzZsoXatWs7znHrrbfSvXt3p/OOGzcu33heeOEFbDYbjz76KEeOHOHNN9+kY8eObNiwAW9vb8D80uvWrRstWrTgqaeewsXFxfFluXz5cq666qo85V522WVMnDgRgOTkZEaOHJnvucePH8/NN9/M//73PxITE3nnnXe49tprWb9+PZUrV85zzIgRI2jbti0AX3/9dZ5k7K677nK0crj33nvZvXs3kydPZv369axYsQJ3d/d8H4fCOHHihOPazpSTk0Pv3r35448/GDFiBPXr12fz5s288cYb7Nixw2mwoYuVkpJCu3btOHDgAHfddRc1a9bkzz//ZNy4cSQkJPDmm28W27mKqm3btowYMYKcnBy2bNnCm2++ycGDB/P9B/dMTz/9NM888wwdO3Zk5MiRbN++nalTp7JmzRrHc3j06FHAfF1HRUXxzDPPkJaWxpQpU2jTpg1r1qyhbt26tG/fnho1ajBr1iz69u3rdJ5Zs2ZRu3ZtR7PUgpo7dy4pKSmMHDmSatWqsXr1at555x3279/P3Llzz3nc7NmzGTRoEFdccQUTJ07k6NGjvP322/zxxx+sX7+ewMDAQsVxLkV5X+V69NFHiyWGXPv27aNr165ER0fz5Zdf4uZm/jty+PBhWrduTUpKCvfeey/VqlVjxowZ9O7dm6+++irPc3Uh1157LZ999pnjfm7ri9wEDqB169aO29nZ2XTt2pVWrVrxyiuvsHDhQp566imysrJ49tlnAXO07N69e7Ns2TKGDx9O06ZNWbRoEQ8//DAHDhzgjTfeyDeWN954w/FcXopWILmfdy1btmTixIkcPnyYt956ixUrVlzw+Z4wYUKehKkwCnr8XXfdRceOHR33b7/9dvr27Uu/fv0c64KCggD47LPPGDJkCF26dOHll18mJSWFqVOncs0117B+/XrHDwebNm2ibdu2uLu7M2LECCIjI9m1axfff/89L7zwAv369aNOnTqO8h944AHq16/PiBEjHOvq168PFO4xDAwMdDz3+/fv56233qJ79+7s27fPsV9BPsPOpbDvwejoaJ544gkMw2DXrl1MmjSJ7t27F/gH0vzkfr6OHj2aFi1a8NJLL5GYmJjv59WSJUvo1q0bl19+OU8//TSpqam88847tGnThr/++ivPDz0333wzkZGRTJw4kZUrV/L2229z/PhxPv3003PG87///Y9ff/2VxYsX06BBA8f6qVOn0rBhQ3r37o2bmxvff/8999xzDzk5OYwaNQowK0ruueceJk6cyA033EDz5s1JSEhgzJgxdOzYkbvvvrvIj5NIiTBE5JKaNm2aARhLliwxEhMTjX379hlz5swxqlWrZnh7exv79+83DMMw0tLSjOzsbKdjd+/ebXh6ehrPPvusY90nn3xiAMakSZPynCsnJ8dxHGC8+uqrefZp2LCh0a5dO8f9ZcuWGYBRvXp1IykpybH+yy+/NADjrbfecpQdFRVldOnSxXEewzCMlJQUo1atWkanTp3ynKt169ZGo0aNHPcTExMNwHjqqacc6/bs2WO4uroaL7zwgtOxmzdvNtzc3PKs37l
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_data, temp_data = train_test_split(df, test_size=0.3, random_state=42)\n",
"val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)\n",
"\n",
"# Средние значения цены\n",
"print(\"Средняя цена в обучающей выборке:\", train_data['price'].mean())\n",
"print(\"Средняя цена в контрольной выборке:\", val_data['price'].mean())\n",
"print(\"Средняя цена в тестовой выборке:\", test_data['price'].mean())\n",
"print()\n",
"\n",
"# Стандартное отклонение цены\n",
"print(\"Стандартное отклонение цены в обучающей выборке:\", train_data['price'].std())\n",
"print(\"Стандартное отклонение цены в контрольной выборке:\", val_data['price'].std())\n",
"print(\"Стандартное отклонение цены в тестовой выборке:\", test_data['price'].std())\n",
"print()\n",
"\n",
"# Проверка распределений по количеству объектов в диапазонах\n",
"print(\"Распределение по квартилам (обучающая):\")\n",
"print(train_data['price'].quantile([0.25, 0.5, 0.75]))\n",
"print()\n",
"print(\"Распределение по квартилам (контрольная):\")\n",
"print(val_data['price'].quantile([0.25, 0.5, 0.75]))\n",
"print()\n",
"print(\"Распределение по квартилам (тестовая):\")\n",
"print(test_data['price'].quantile([0.25, 0.5, 0.75]))\n",
"\n",
"# Построение гистограмм для каждой выборки\n",
"plt.figure(figsize=(12, 6))\n",
"\n",
"sns.histplot(train_data['price'], color='blue', label='Train', kde=True)\n",
"sns.histplot(val_data['price'], color='green', label='Validation', kde=True)\n",
"sns.histplot(test_data['price'], color='red', label='Test', kde=True)\n",
"\n",
"plt.legend()\n",
"plt.xlabel('Price')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Распределение цены в обучающей, контрольной и тестовой выборках')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборки оказались **недостаточно сбалансированными**. Используем методы приращения данных *с избытком* и *с недостатком*:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+0AAAIjCAYAAAB20vpjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3yNd//H8dfJ3iEyzRCxqaJa1Ko9W10URavlbmmr+9ZF2/tXnbp03neLFtVqUR0oiqoapbbYIkZIIoTsdf3+uJpTR4IkElcS7+fjcR7nOte5xueM6zrnc32XzTAMAxEREREREREpc5ysDkBERERERERECqakXURERERERKSMUtIuIiIiIiIiUkYpaRcREREREREpo5S0i4iIiIiIiJRRStpFREREREREyigl7SIiIiIiIiJllJJ2ERERERERkTJKSbuISDly+vRp9u3bR3Z2ttWhiIhUGIZhkJiYyN69e60ORUQkHyXtIiJlWFZWFq+//jrXXHMN7u7uVK5cmcjISJYtW2Z1aOXC9u3bmT9/vv3x5s2b+emnn6wLSErVkSNHmDZtmv1xdHQ0M2fOtC4gKdPH4NmzZ3nuueeoX78+bm5uVKlShXr16rF7926rQ7vievfuzf333291GFKO7Ny5ExcXF7Zv3251KFcFF6sDkPJr2rRp3HPPPfbH7u7u1KxZk+7du/P8888TEhJiYXQi5V9GRgbdu3dn7dq1/Otf/+Lll1/Gy8sLZ2dnWrZsaXV45cLZs2cZPXo0oaGhVKlShUceeYRevXrRp08fq0OTUmCz2RgzZgxhYWHUr1+fp556ioCAAIYMGWJ1aFetsnoMnjx5ko4dOxITE8NDDz1Eu3btcHNzw9XVlfDwcEtju9JWr17NL7/8wq5du6wORcqRRo0a0adPH1544QXmzp1rdTgVnpJ2uWwvvfQStWvXJj09nd9//52PPvqIn3/+me3bt+Pl5WV1eCLl1muvvca6detYvHgxnTp1sjqccqlNmzb2G0C9evVUmlSBVatWjfvvv5+ePXsCEBYWxooVK6wN6ipXVo/BJ598ktjYWNasWUPjxo2tDsdSb7zxBl26dKFu3bpWhyLlzL/+9S969+7N/v37iYiIsDqcCs1mGIZhdRBSPuWVtP/555+0atXKPv/xxx9n8uTJzJo1i7vuusvCCEXKr+zsbIKDg3nggQf4v//7P6vDKfd27txJWloaTZs2xc3NzepwpJTt37+fhIQEmjRpgre3t9XhCGXrGIyLiyMsLIyPP/64TFxAsFJcXBzVqlXj448/ZuTIkVaHI+VMVlYWISEhjB07lpdeesnqcCo0tWmXEnfTTTcBcPDgQQASExN54oknaNq0KT4+Pvj5+dGrVy+2bNmSb9309HQmTpxIvXr18PDwICwsjFtvvZX9+/cDZvtEm812wdu5pZErVqzAZrPx9ddf88wzzxAaGoq3tzf9+/fn8OHD+fa9bt06evbsib+/P15eXnTs2JHVq1cX+Bo7depU4P4nTpyYb9kZM2bQsmVLPD09CQgIYNCgQQXu/2Kv7Vy5ubm88847NG7cGA8PD0JCQhg9ejSnTp1yWC48PJy+ffvm28/YsWPzbbOg2N9444187ymYVbYnTJhA3bp1cXd3p0aNGjz11FNkZGQU+F6dq1OnTjRp0iTf/DfffBObzUZ0dLTD/NOnTzNu3Dhq1KiBu7s7devW5bXXXiM3N9e+TN779uabb+bbbpMmTQr8Tnz77bcXjHHEiBGFqhoZHh5u/3ycnJwIDQ1l4MCBxMTEXHJdgA8//JDGjRvj7u5O1apVGTNmDKdPn7Y/v3v3bk6dOoWvry8dO3bEy8sLf39/+vbt69B+bPny5dhsNubNm5dvH7NmzcJms7FmzRp7zCNGjHBYJu89Obc0ctWqVdxxxx3UrFnT/hk/+uijpKWlOaw7ceLEfN+lmTNn0rx5czw8PKhSpQp33XVXvvdkxIgR+Pj4OMz79ttv88UB4OPjky9mKNxx1alTJ/vn36hRI1q2bMmWLVsKPK4Ka9q0afm+qzt27KBy5cr07dvXoYPAAwcOcMcddxAQEICXlxc33HBDvra8F/tOnvva8/Z7sVteW+689/fAgQP06NEDb29vqlatyksvvcT51+lTUlJ4/PHH7cdY/fr1efPNN/Mtd7EYzj3G8pbZsGHDRd/Hgr4DcOHvwZw5c+yfd2BgIEOHDuXo0aP5tpl37EZERHD99deTmJiIp6dngeeXgmI6/9g/fPhwodYfMWLEJT+fc9dfuHAh7du3x9vbG19fX/r06cOOHTvybXfXrl3ceeedBAUF4enpSf369Xn22WeBf46/i93OfR8L+x6eu37lypXp1KkTq1atyhfbpc5hcPnH4Pm/tYGBgfTp0ydfG1qbzcbYsWMvuJ3zj9s///yT3NxcMjMzadWq1UXPVwC//vqr/fOqVKkSN998M1FRUQ7L5H0eeZ+Zn5+fvTlAenp6vnjP/c3Nzs6md+/eBAQEsHPnTvv8qVOnctNNNxEcHIy7uzuNGjXio48+yhebk5MTL7zwgsP8vPP/+cuf76effiI7O5uuXbs6zC/sf7fCnsPOdaHjpaBlC3OsFOWcl5uby7vvvkvTpk3x8PAgKCiInj175jtnFfY3xmazccstt+SLe/To0dhsNof/PEX5D3up9+rc709Rzl1F/W9Y0O0///mPfRlXV1c6derE999/n2+bUrJUPV5KXF6CXaVKFcD84zp//nzuuOMOateuzYkTJ/jkk0/o2LEjO3fupGrVqgDk5OTQt29fli1bxqBBg3jkkUc4e/YsS5YsYfv27Q7Vbu666y569+7tsN/x48cXGM///d//YbPZePrpp4mLi+Odd96ha9eubN68GU9PT8D80evVqxctW7ZkwoQJODk52X8sV61aRevWrfNtt3r16kyaNAmA5ORkHnjggQL3/fzzz3PnnXdy3333ER8fz/vvv0+HDh3YtGkTlSpVyrfOqFGjaN++PQBz587Nl4yNHj3aXsvh4Ycf5uDBg0yZMoVNmzaxevVqXF1dC3wfiuL06dP213au3Nxc+vfvz++//86oUaNo2LAh27Zt4+2332bPnj0OnQ1drtTUVDp27MjRo0cZPXo0NWvW5I8//mD8+PHExsbyzjvvlNi+iqt9+/aMGjWK3Nxctm/fzjvvvMOxY8cK/IN7rokTJ/Liiy/StWtXHnjgAXbv3s1HH33En3/+af8MT548CZjf68jISF588UXS09P54IMPaNeuHX/++Sf16tWjU6dO1KhRg5kzZzJgwACH/cycOZOIiAh7tdTCmjNnDqmpqTzwwANUqVKF9evX8/7773PkyBHmzJlzwfVmzZrF0KFDueaaa5g0aRInT57kvffe4/fff2fTpk0EBgYWKY4LKc5xlefpp58ukRjyHD58mJ49e9KgQQO++eYbXFzMn9UTJ07Qtm1bUlNTefjhh6lSpQrTp0+nf//+fPvtt/k+q0vp0KEDX375pf1xXu2LvAQOoG3btvbpnJwcevbsyQ033MDrr7/OokWLmDBhAtnZ2fbSEMMw6N+/P8uXL2fkyJE0b96cxYsX8+STT3L06FHefvvtAmN5++237Z/llagFkne+u+6665g0aRInTpzg3XffZfXq1Zf8vF944YV8CVNRFHb90aNHOyQ9d999NwMGDODWW2+1zwsKCgLgyy+/ZPjw4fTo0YPXXnuN1NRUPvroI2688UY2bdpk//O9detW2rdvj6urK6NGjSI8PJz9+/fzww8/8H//93/ceuutDtWZH330URo2bMioUaPs8xo2bAgU7T0MDAy0f/ZHjhzh3XffpXfv3hw+fNi+XGHOYRdS1GOwQYMGPPvssxiGwf79+5k8eTK9e/cu9AXSguSdX8eOHUvLli159dVXiY+PL/B8tXTpUnr16kWdOnWYOHEiaWlpvP/++7Rr146//vorX7J05513Eh4ezqRJk1i7di3vvfcep06d4osvvrhgPPfddx8rVqxgyZIlNGrUyD7/o48+onHjxvTv3x8XFxd++OEHHnzwQXJzcxkzZgxgFpQ8+OCDTJo0iVt
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA/YAAAIjCAYAAACpnIB8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3QU5dvG8e+mNxJKKj30IoiA0ovSCaCAhaKAoFgABaWIiqAICCoIgmChqRRBhJ+IIB0FoxTpXQQDSAKRElogZd4/5s3KkkISkkwC1+ecOZmdeWbm3s3u7N4zT7EZhmEgIiIiIiIiInmSk9UBiIiIiIiIiEjmKbEXERERERERycOU2IuIiIiIiIjkYUrsRURERERERPIwJfYiIiIiIiIieZgSexEREREREZE8TIm9iIiIiIiISB6mxF5EREREREQkD1NiLyJylzh//jx//vkn8fHxVociInLHMAyDs2fPcvjwYatDEZG7mBJ7EZE7VFxcHOPGjePee+/F3d2dAgUKULZsWdasWWN1aHnCnj17WLJkif3xjh07WLZsmXUBSbY6ceIEs2bNsj8+duwYc+bMsS4gydWfwYsXL/Lmm29Svnx53NzcKFSoEOXKlePgwYNWh5bjWrduzbPPPmt1GGKh2rVrM3jwYKvDuOspsRdLzJo1C5vNZp88PDwoV64cffv2JSoqyurwRPK8a9eu0bRpU4YNG0bjxo1ZuHAhq1atYu3atdSpU8fq8PKEixcv8txzz/Hbb79x+PBhXn75ZXbv3m11WJJNbDYbffr04aeffuLYsWMMHjyYX375xeqw7mq59TP477//UqdOHSZNmsSjjz7K//73P1atWsX69espWbKk1eHlqE2bNrFy5UqGDBlidShioSFDhjBlyhQiIyOtDuWu5mJ1AHJ3e+eddwgNDSU2NpaNGzcydepUfvzxR/bs2YOXl5fV4YnkWWPHjuX333/np59+onHjxlaHkyfVqVPHPgGUK1dOd6XuYEWKFOHZZ5+lZcuWAISEhLB+/Xprg7rL5dbP4KBBgzh16hTh4eFUrlzZ6nAs9f7779OkSRPKlCljdShioYcffhhfX18++eQT3nnnHavDuWvZDMMwrA5C7j6zZs3i6aefZsuWLdSsWdO+/NVXX2X8+PHMnTuXzp07WxihSN4VHx9PYGAgL7zwAqNGjbI6nDxv3759XL16lSpVquDm5mZ1OJLNjhw5QnR0NPfccw/e3t5WhyPkrs/g6dOnCQkJYdq0abniIoOVTp8+TZEiRZg2bRq9evWyOhyxWL9+/Vi6dClHjx7FZrNZHc5dSVXxJVd56KGHADh69CgAZ8+eZeDAgVSpUgUfHx98fX1p1aoVO3fuTLZtbGwsI0aMoFy5cnh4eBASEkKHDh04cuQIYLaXvLH6/83TjXc1169fj81m45tvvuH1118nODgYb29v2rVrx/Hjx5Md+/fff6dly5b4+fnh5eVFo0aN2LRpU4rPsXHjxikef8SIEcnKfv3119SoUQNPT08KFixIp06dUjx+Ws/tRomJiXz00UdUrlwZDw8PgoKCeO655zh37pxDuZIlS9KmTZtkx+nbt2+yfaYU+/vvv5/sNQWzevjw4cMpU6YM7u7uFCtWjMGDB3Pt2rUUX6sbNW7cmHvuuSfZ8g8++ACbzcaxY8cclp8/f57+/ftTrFgx3N3dKVOmDGPHjiUxMdFeJul1++CDD5Lt95577knxPfHtt9+mGmOPHj3SVQ2zZMmS9v+Pk5MTwcHBPPHEE0RERNxyW4BPPvmEypUr4+7uTuHChenTpw/nz5+3rz948CDnzp0jX758NGrUCC8vL/z8/GjTpg179uyxl1u3bh02m43FixcnO8bcuXOx2WyEh4fbY+7Ro4dDmaTX5Ma7mr/88guPPfYYxYsXt/+PBwwYwNWrVx22HTFiRLL30pw5c6hWrRoeHh4UKlSIzp07J3tNevTogY+Pj8Oyb7/9NlkcAD4+PslihvR9rho3bmz//1eqVIkaNWqwc+fOFD9X6ZXUBOnG9+revXspUKAAbdq0cejU8K+//uKxxx6jYMGCeHl5Ubt27WRti9N6T9743G9u+pTSlNS2POn1/euvv2jRogXe3t4ULlyYd955h5vvA1y+fJlXX33V/hkrX748H3zwQbJyacVw42csqczWrVvTfB1Teg9A6u+DhQsX2v/f/v7+PPnkk5w8eTLZPpM+u6VLl6ZWrVqcPXsWT0/PFM8vKcV082f/+PHj6dq+R48et/z/3Lj98uXLadCgAd7e3uTLl4+wsDD27t2bbL8HDhzg8ccfJyAgAE9PT8qXL88bb7wB/Pf5S2u68XVM72t44/YFChSgcePGKTZnuNU5DG7/M3jzd62/vz9hYWEO50Awv8P69u2b6n5u/txu2bKFxMRErl+/Ts2aNdM8XwGsXbvW/v/Knz8/Dz/8MPv373cok/T/SPqf+fr6UqhQIV5++WViY2OTxXvjd258fDytW7emYMGC7Nu3z7585syZPPTQQwQGBuLu7k6lSpWYOnVqsticnJx46623HJYnnf9vLn+zZcuWER8fT9OmTZOty8g5LGlyd3enXLlyjBkzJtl55OTJk/Ts2ZOgoCDc3d2pXLkyM2bMSHF/Gfnt1rp1awoUKIC3tzdVq1Zl4sSJDmUOHDjAo48+SsGCBfHw8KBmzZp8//33DmWS3iNubm6cOXPGYV14eLj9+d14bkvtt2BKn/mUXquk6cbzTkrfyQBhYWHJ3jdJ77no6GiHslu3bnX4ToDUz1E319Jo1qwZf//9Nzt27Ej2WkvOUFV8yVWSkvBChQoB5hfDkiVLeOyxxwgNDSUqKopPP/2URo0asW/fPgoXLgxAQkICbdq0Yc2aNXTq1ImXX36ZixcvsmrVKvbs2UPp0qXtx+jcuTOtW7d2OO7QoUNTjGfUqFHYbDaGDBnC6dOn+eijj2jatCk7duzA09MTML8YW7VqRY0aNRg+fDhOTk72L9RffvmFBx54INl+ixYtypgxYwC4dOkSL7zwQorHHjZsGI8//jjPPPMMZ86c4eOPP6Zhw4Zs376d/PnzJ9umd+/eNGjQAIDvvvsuWcL23HPP2WtLvPTSSxw9epTJkyezfft2Nm3ahKura4qvQ0acP3/e/txulJiYSLt27di4cSO9e/emYsWK7N69mwkTJnDo0CGHDpJu15UrV2jUqBEnT57kueeeo3jx4vz6668MHTqUU6dO8dFHH2XZsTKrQYMG9O7dm8TERPbs2cNHH33EP//8c8s2vSNGjODtt9+madOmvPDCCxw8eJCpU6eyZcsW+//w33//Bcz3ddmyZXn77beJjY1lypQp1KtXjy1btlCuXDkaN25MsWLFmDNnDu3bt3c4zpw5cyhdunSG2+MvXLiQK1eu8MILL1CoUCE2b97Mxx9/zIkTJ1i4cGGq282dO5cnn3ySe++9lzFjxvDvv/8yadIkNm7cyPbt2/H3989QHKnJzOcqSVa3IT1+/DgtW7akQoUKLFiwABcX8ys5KiqKunXrcuXKFV566SUKFSrE7NmzadeuHd9++22y/9WtNGzYkK+++sr+OKkWR1KSB1C3bl37fEJCAi1btqR27dqMGzeOFStWMHz4cOLj4+1VLA3DoF27dqxbt45evXpRrVo1fvrpJwYNGsTJkyeZMGFCirFMmDDB/r/MidokSee7+++/nzFjxhAVFcXEiRPZtGnTLf/fb731VrKkKiPSu/1zzz3nkBg99dRTtG/fng4dOtiXBQQEAPDVV1/RvXt3WrRowdixY7ly5QpTp06lfv36bN++3f4jf9euXTRo0ABXV1d69+5NyZIlOXLkCEuXLmXUqFF06NDB4Uf5gAEDqFixIr1797Yvq1ixIpCx19Df39/+vz9x4gQTJ06kdevWHD9+3F4uPeew1GT0M1ihQgXeeOMNDMPgyJEjjB8/ntatW6f7ImpKks6vffv2pUaNGrz33nucOXMmxfPV6tWradWqFaVKlWLEiBFcvXqVjz/+mHr16vHHH38kuxj0+OOPU7JkScaMGcNvv/3GpEmTOHfuHF9++WWq8Tz
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"def oversample(df, target_column):\n",
" X = df.drop(target_column, axis=1)\n",
" y = df[target_column]\n",
" \n",
" oversampler = RandomOverSampler(random_state=42)\n",
" x_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
" \n",
" resampled_df = pd.concat([x_resampled, y_resampled], axis=1) \n",
" return resampled_df\n",
"\n",
"def undersample(df, target_column):\n",
" X = df.drop(target_column, axis=1)\n",
" y = df[target_column]\n",
" \n",
" undersampler = RandomUnderSampler(random_state=42)\n",
" x_resampled, y_resampled = undersampler.fit_resample(X, y)\n",
" \n",
" resampled_df = pd.concat([x_resampled, y_resampled], axis=1)\n",
" return resampled_df\n",
"\n",
"train_df_oversampled = oversample(train_data, 'price')\n",
"val_df_oversampled = oversample(val_data, 'price')\n",
"test_df_oversampled = oversample(test_data, 'price')\n",
"\n",
"train_df_undersampled = undersample(train_data, 'price')\n",
"val_df_undersampled = undersample(val_data, 'price')\n",
"test_df_undersampled = undersample(test_data, 'price')\n",
"\n",
"# Построение гистограмм для каждой выборки\n",
"plt.figure(figsize=(12, 6))\n",
"\n",
"sns.histplot(train_df_undersampled['price'], color='blue', label='Train', kde=True)\n",
"sns.histplot(val_df_undersampled['price'], color='green', label='Validation', kde=True)\n",
"sns.histplot(test_df_undersampled['price'], color='red', label='Test', kde=True)\n",
"\n",
"plt.legend()\n",
"plt.xlabel('Price')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Распределение цены в обучающей, контрольной и тестовой выборках (андерсемплинг)')\n",
"plt.show()\n",
"\n",
"# Построение гистограмм для каждой выборки\n",
"plt.figure(figsize=(12, 6))\n",
"\n",
"sns.histplot(train_df_oversampled['price'], color='blue', label='Train', kde=True)\n",
"sns.histplot(val_df_oversampled['price'], color='green', label='Validation', kde=True)\n",
"sns.histplot(test_df_oversampled['price'], color='red', label='Test', kde=True)\n",
"\n",
"plt.legend()\n",
"plt.xlabel('Price')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Распределение цены в обучающей, контрольной и тестовой выборках (оверсемплинг)')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Датасет №2. Ближайшие к Земле объекты\n",
"\n",
"[**Ссылка**](https://www.kaggle.com/datasets/sameepvani/nasa-nearest-earth-objects)\n",
"\n",
"**Проблемная область**: Данный набор данных связан с наблюдением за объектами, которые проходят рядом с Землей.\n",
"\n",
"**Объекты наблюдения**: Объектами наблюдения являются \"Ближайшие к Земле объекты\", которые представляют собой астероиды или кометы, проходящие относительно близко к орбите Земли.\n",
"\n",
"**Атрибуты объектов:**\n",
"- `id` — уникальный идентификатор объекта.\n",
"- `name` — название или идентификатор объекта (например, имя или дата открытия).\n",
"- `est_diameter_min` — минимальный оценочный диаметр объекта (в километрах или других единицах).\n",
"- `est_diameter_max` — максимальный оценочный диаметр объекта.\n",
"- `relative_velocity` — относительная скорость объекта (по отношению к Земле) в км/ч.\n",
"- `miss_distance` — расстояние между объектом и Землей в момент его ближайшего прохождения (в километрах).\n",
"- `orbiting_body` — небесное тело, вокруг которого объект совершает орбитальное движение (в данном случае это Земля).\n",
"- `sentry_object` — булевый показатель (True/False), указывающий, отслеживается ли объект системой Sentry для оценки возможных столкновений в будущем.\n",
"- `absolute_magnitude` — абсолютная звездная величина объекта, которая помогает определить его яркость и, соответственно, размер.\n",
"- `hazardous` — булевый показатель (True/False), который указывает, представляет ли объект потенциальную опасность для Земли (включает анализ его размера, скорости и расстояния).\n",
"\n",
"**Бизнес-цель**: Разработка стратегии защиты планеты, создание технологий защиты, что может привести к увеличению инвестиций в аэрокосмическую индустрию и соответствующие разработки.\n",
"\n",
"**Техническая цель**: Оптимизация стратегии отклонения или разрушения опасных объектов.\n",
"\n",
"**Входные данные**: Данные о космических объектах, включая все признаки (диаметр объекта, расстояние между объектом и Землей и др.).\n",
"\n",
"**Целевая переменная**: Опасность (`hazardous`)."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>name</th>\n",
" <th>est_diameter_min</th>\n",
" <th>est_diameter_max</th>\n",
" <th>relative_velocity</th>\n",
" <th>miss_distance</th>\n",
" <th>orbiting_body</th>\n",
" <th>sentry_object</th>\n",
" <th>absolute_magnitude</th>\n",
" <th>hazardous</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2162635</td>\n",
" <td>162635 (2000 SS164)</td>\n",
" <td>1.198271</td>\n",
" <td>2.679415</td>\n",
" <td>13569.249224</td>\n",
" <td>5.483974e+07</td>\n",
" <td>Earth</td>\n",
" <td>False</td>\n",
" <td>16.73</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2277475</td>\n",
" <td>277475 (2005 WK4)</td>\n",
" <td>0.265800</td>\n",
" <td>0.594347</td>\n",
" <td>73588.726663</td>\n",
" <td>6.143813e+07</td>\n",
" <td>Earth</td>\n",
" <td>False</td>\n",
" <td>20.00</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2512244</td>\n",
" <td>512244 (2015 YE18)</td>\n",
" <td>0.722030</td>\n",
" <td>1.614507</td>\n",
" <td>114258.692129</td>\n",
" <td>4.979872e+07</td>\n",
" <td>Earth</td>\n",
" <td>False</td>\n",
" <td>17.83</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3596030</td>\n",
" <td>(2012 BV13)</td>\n",
" <td>0.096506</td>\n",
" <td>0.215794</td>\n",
" <td>24764.303138</td>\n",
" <td>2.543497e+07</td>\n",
" <td>Earth</td>\n",
" <td>False</td>\n",
" <td>22.20</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3667127</td>\n",
" <td>(2014 GE35)</td>\n",
" <td>0.255009</td>\n",
" <td>0.570217</td>\n",
" <td>42737.733765</td>\n",
" <td>4.627557e+07</td>\n",
" <td>Earth</td>\n",
" <td>False</td>\n",
" <td>20.09</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id name est_diameter_min est_diameter_max \\\n",
"0 2162635 162635 (2000 SS164) 1.198271 2.679415 \n",
"1 2277475 277475 (2005 WK4) 0.265800 0.594347 \n",
"2 2512244 512244 (2015 YE18) 0.722030 1.614507 \n",
"3 3596030 (2012 BV13) 0.096506 0.215794 \n",
"4 3667127 (2014 GE35) 0.255009 0.570217 \n",
"\n",
" relative_velocity miss_distance orbiting_body sentry_object \\\n",
"0 13569.249224 5.483974e+07 Earth False \n",
"1 73588.726663 6.143813e+07 Earth False \n",
"2 114258.692129 4.979872e+07 Earth False \n",
"3 24764.303138 2.543497e+07 Earth False \n",
"4 42737.733765 4.627557e+07 Earth False \n",
"\n",
" absolute_magnitude hazardous \n",
"0 16.73 False \n",
"1 20.00 True \n",
"2 17.83 False \n",
"3 22.20 False \n",
"4 20.09 True "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\".//static//csv//neo_v2.csv\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Отбросим признаки `orbiting_body` и `sentry_object`, так как все они имеют одинаковое значение во всех записях. Также отбросим признак `name`, так как он не будет иметь пользу при прогнозировании опасности объекта."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"df = df.drop(columns=['name', 'orbiting_body', 'sentry_object'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Получение сведений о пропущенных данных"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"id 0\n",
"est_diameter_min 0\n",
"est_diameter_max 0\n",
"relative_velocity 0\n",
"miss_distance 0\n",
"absolute_magnitude 0\n",
"hazardous 0\n",
"dtype: int64\n",
"\n",
"id False\n",
"est_diameter_min False\n",
"est_diameter_max False\n",
"relative_velocity False\n",
"miss_distance False\n",
"absolute_magnitude False\n",
"hazardous False\n",
"dtype: bool\n",
"\n"
]
}
],
"source": [
"# Количество пустых значений признаков\n",
"print(df.isnull().sum())\n",
"\n",
"print()\n",
"\n",
"# Есть ли пустые значения признаков\n",
"print(df.isnull().any())\n",
"\n",
"print()\n",
"\n",
"# Процент пустых значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пропущенных данных в датасете **не обнаружено**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Проверка набора данных на выбросы"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов в столбце 'est_diameter_min': 8306\n",
"Количество выбросов в столбце 'est_diameter_max': 8306\n",
"Количество выбросов в столбце 'relative_velocity': 1574\n",
"Количество выбросов в столбце 'miss_distance': 0\n",
"Количество выбросов в столбце 'absolute_magnitude': 101\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAPeCAYAAADj01PlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXwU9f3H8fdujs2dkEASIhEDWg5FqOhP4oGgSMRgPVBLRUVFqTbYAhUtrSKHSkURL5RaLdAWqtB6ggLhFgmIUQRBEDUYICSB3Pe18/sj7MqSgyQkmWzyej4e+yCZ+e7MZyZhP5vPfuY7FsMwDAEAAAAAAAAAgBqsZgcAAAAAAAAAAEBbRREdAAAAAAAAAIA6UEQHAAAAAAAAAKAOFNEBAAAAAAAAAKgDRXQAAAAAAAAAAOpAER0AAAAAAAAAgDpQRAcAAAAAAAAAoA4U0QEAAAAAAAAAqANFdAAAAAAAAAAA6kARHW3OOeeco3vuucfsMNq95557Tj169JCHh4cGDBhgdjhNdurvy8aNG2WxWLRx40bTYuoIhgwZoiFDhpgdBgCTkbNbBzkbANCayO+tozXze0v8TO+55x6dc845zbpNsy1atEgWi0UHDx5ske1Pnz5dFoulRbaNlkcRHS3K8QL0xRdf1Lp+yJAhuuCCC854Px9//LGmT59+xtvpKNasWaNHH31Ul19+uRYuXKhnnnmmVfa7dOlSvfjii62yr7asuLhY06dPp2gAoE0hZ7dN5GxzkbMBuDvye9tkVn5vrLS0NE2fPl07d+40O5R265lnntH7779vdhhoAE+zAwBOtX//flmtjft85+OPP9b8+fNJ2g20fv16Wa1WvfXWW/L29m61/S5dulTffPONJk6c2GL7GDx4sEpKSlr1uBqruLhYM2bMkCS37eZes2aN2SEAaAPI2S2PnG2u9pCzAaCxyO8tz6z83lhpaWmaMWOGzjnnnBrd8n//+99lt9vNCcxNPf744/rTn/7ksuyZZ57RrbfeqptuusmcoNBgdKKjzbHZbPLy8jI7jEYpKioyO4RGyczMlK+vb5tO1k1ltVrl4+PT6Dd97UFr/h56e3u3y98fAI1Dzm555Oz2yd1+DwF0LOT3lnem+b24uLiZI2o8Ly8v2Ww2s8NwK56envLx8TE7DDRRx3vHijbv1Lm6KioqNGPGDJ133nny8fFRWFiYrrjiCiUmJkqqnodr/vz5kiSLxeJ8OBQVFemPf/yjoqOjZbPZ1KtXLz3//PMyDMNlvyUlJfr973+vzp07KzAwUL/61a905MgRWSwWl0/THXNY7d27V3fccYc6deqkK664QpK0a9cu3XPPPerRo4d8fHwUGRmp++67T1lZWS77cmzju+++05133qng4GB16dJFTzzxhAzD0KFDh3TjjTcqKChIkZGRmjt3boPOXWVlpWbNmqWePXvKZrPpnHPO0Z///GeVlZU5x1gsFi1cuFBFRUXOc7Vo0aJ6t7t9+3Zdd911Cg4Olp+fn6666ip99tlnLmMKCgo0ceJEnXPOObLZbAoPD9e1116rL7/8UlJ199bKlSv1008/OffbmPnTDMPQU089pW7dusnPz09Dhw7Vnj17aoyrbX7VTz/9VLfddpvOPvts2Ww2RUdHa9KkSSopKXF57j333KOAgAClpqZq5MiRCggI0FlnneX8/dq9e7euvvpq+fv7q3v37lq6dGmN/efm5mrixInO37dzzz1Xzz77rPMT+oMHD6pLly6SpBkzZjjPxcm/Y/v27dOtt96q0NBQ+fj46OKLL9aHH37osh/HZZmbNm3S7373O4WHh6tbt24NOpcHDx6UxWLR888/r/nz56tHjx7y8/PT8OHDdejQIRmGoVmzZqlbt27y9fXVjTfeqOzsbJdtnDonuuO8L1u2TE8//bS6desmHx8fXXPNNfr+++8bFBcA90POJmfXhpzdtnL2Bx98oPj4eEVFRclms6lnz56aNWuWqqqqnGO+/fZb+fr66u6773Z57pYtW+Th4aHHHnusQfECaB/I720rvzum3ElOTtbgwYPl5+enP//5z5KksrIyPfnkkzr33HOdefPRRx912V9tsrOz9cgjj6hfv34KCAhQUFCQRowYoa+//to5ZuPGjbrkkkskSffee2+NWE+eE72iokKhoaG69957a+wrPz9fPj4+euSRR5zLmhr3yZ5//nlZLBb99NNPNdZNnTpV3t7eysnJcS5ryHukurz22ms6//zzZbPZFBUVpYSEBOXm5tYYt337dl1//fXq1KmT/P39deGFF+qll15yrj91TnSLxaKioiItXrzYeX7vuecebdiwQRaLRe+9916NfSxdulQWi0VJSUkNih3Nh+lc0Cry8vJ0/PjxGssrKipO+9zp06dr9uzZuv/++/V///d/ys/P1xdffKEvv/xS1157rX77298qLS1NiYmJ+te//uXyXMMw9Ktf/UobNmzQuHHjNGDAAK1evVpTpkzRkSNHNG/ePOfYe+65R8uWLdNdd92lQYMGadOmTYqPj68zrttuu03nnXeennnmGWfyT0xM1I8//qh7771XkZGR2rNnj9544w3t2bNH27Ztq3EDiV//+tfq06eP/vrXv2rlypV66qmnFBoaqr/97W+6+uqr9eyzz2rJkiV65JFHdMkll2jw4MH1nqv7779fixcv1q233qo//vGP2r59u2bPnq1vv/3W+eL7r3/9S2+88YY+//xzvfnmm5Kkyy67rM5trl+/XiNGjNDAgQP15JNPymq1auHChbr66qv16aef6v/+7/8kSQ8++KD++9//asKECerbt6+ysrK0ZcsWffvtt7rooov0l7/8RXl5eTp8+LDzvAcEBNR7PCebNm2annrqKV1//fW6/vrr9eWXX2r48OEqLy8/7XOXL1+u4uJiPfTQQwoLC9Pnn3+uV155RYcPH9by5ctdxlZVVWnEiBEaPHiw5syZoyVLlmjChAny9/fXX/7yF40ZM0a33HKLFixYoLvvvluxsbGKiYmRVN0NcNVVV+nIkSP67W9/q7PPPltbt27V1KlTdfToUb344ovq0qWLXn/9dT300EO6+eabdcstt0iSLrzwQknSnj17dPnll+uss87Sn/70J/n7+2vZsmW66aab9L///U8333yzS7y/+93v1KVLF02bNq3R3RdLlixReXm5Hn74YWVnZ2vOnDm6/fbbdfXVV2vjxo167LHH9P333+uVV17RI488on/84x+n3eZf//pXWa1WPfLII8rLy9OcOXM0ZswYbd++vVGxATAPOZucTc5uXzl70aJFCggI0OTJkxUQEKD169dr2rRpys/P13PPPSdJ6tOnj2bNmqUpU6bo1ltv1a9+9SsVFRXpnnvuUe/evTVz5sxGxQug7SG/u29+l6SsrCyNGDFCo0eP1p133qmIiAjZ7Xb96le/0pYtWzR+/Hj16dNHu3fv1rx58/Tdd9/VO8/2jz/+qPfff1+33XabYmJilJGRob/97W+66qqrtHfvXkVFRalPnz6aOXOmpk2bpvHjx+vKK6+sM1YvLy/dfPPNevfdd/W3v/3NpcP+/fffV1lZmUaPHi1JZxT3yW6//XY9+uijWrZsmaZMmeKybtmyZRo+fLg6deokqeHvkWozffp0zZgxQ8OGDdNDDz2k/fv36/XXX9eOHTv02WefOa/aSExM1MiRI9W1a1f94Q9/UGRkpL799lutWLFCf/jDH2rd9r/+9S/n/6vx48dLknr27KlBgwYpOjpaS5YsqfF+YsmSJerZs6diY2MbdJ7QjAygBS1cuNCQVO/j/PPPd3lO9+7djbFjxzq/79+/vxEfH1/vfhISEozafp3ff/99Q5Lx1FNPuSy/9dZbDYvFYnz//feGYRhGcnKyIcmYOHGiy7h77rnHkGQ8+eSTzmVPPvmkIcn4zW9+U2N/xcXFNZb95z//MSQZmzdvrrGN8ePHO5dVVlYa3bp1MywWi/HXv/7VuTwnJ8fw9fV1OSe12blzpyHJuP/++12WP/LII4YkY/369c5lY8eONfz9/evdnmEYht1uN8477zwjLi7OsNvtLscZExN
"text/plain": [
"<Figure size 1500x1000 with 5 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['est_diameter_min', 'est_diameter_max', 'relative_velocity', 'miss_distance', 'absolute_magnitude']\n",
"\n",
"# Функция для подсчета выбросов\n",
"def count_outliers(df, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(columns_to_check, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Признак `miss_distance` не имеет выбросов, у признака `absolute_magnitude` количество выбросов находится в приемлемом диапазоне, а для признаков `est_diameter_min`, `est_diameter_max` и `relative_velocity` необходимо использовать метод решения проблемы выбросов. Воспользуемся методом усреднения значений:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"columns_to_fix = ['est_diameter_min', 'est_diameter_max', 'relative_velocity']\n",
"\n",
"for column in columns_to_fix:\n",
" q1 = df[column].quantile(0.25)\n",
" q3 = df[column].quantile(0.75)\n",
" iqr = q3 - q1\n",
"\n",
" # Определяем границы для выбросов\n",
" lower_bound = q1 - 1.5 * iqr\n",
" upper_bound = q3 + 1.5 * iqr\n",
"\n",
" # Устраняем выбросы: заменяем значения ниже нижней границы на саму нижнюю границу, а выше верхней — на верхнюю\n",
" df[column] = df[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Оценим выбросы в выборке после усреднения:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов в столбце 'est_diameter_min': 0\n",
"Количество выбросов в столбце 'est_diameter_max': 0\n",
"Количество выбросов в столбце 'relative_velocity': 0\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAISCAYAAAAjjoaeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXhU5dnH8e+s2XfIwhbCIouyKFLFBVERBKxat/pWBa2KWrRVFC3WBdBq3bUWpVYrtgUVW6vWld2NVZQIBJAl7CSE7PtkZs77xzADgQRISHJmkt/nunJpzjxzzj2TkHvOfZ5zPxbDMAxEREREREREREREROQIVrMDEBEREREREREREREJViqii4iIiIiIiIiIiIjUQ0V0EREREREREREREZF6qIguIiIiIiIiIiIiIlIPFdFFREREREREREREROqhIrqIiIiIiIiIiIiISD1URBcRERERERERERERqYeK6CIiIiIiIiIiIiIi9VARXURERERERERERESkHiqii+m6du3KjTfeaHYYrd4zzzxDt27dsNlsDBw40OxwGu3w35fFixdjsVhYvHixaTG1BcOGDWPYsGFmhyEizUj5uGUoH4uISFNR7m4ZLZm7m+NneuONN9K1a9cm3afZZs6cicViYdu2bc2y/ylTpmCxWJpl3xK6VESXJuX/Q/bdd9/V+fiwYcM45ZRTTvg4n376KVOmTDnh/bQVc+fO5f777+fss8/mzTff5IknnmiR486ePZsXX3yxRY4VzCoqKpgyZYoKCyLSYpSPg5PysbmUj0UkmCl3ByezcndD7dmzhylTprB69WqzQ2m1nnjiCT744AOzwxAT2c0OQGTjxo1YrQ27nvPpp58yffp0Jf/jtHDhQqxWK2+88QZOp7PFjjt79mzWrl3L3Xff3WzHGDp0KJWVlS36uhqqoqKCqVOnAoTsbO65c+eaHYKINDPl4+anfGyu1pCPRUQOpdzd/MzK3Q21Z88epk6dSteuXY+YLf+3v/0Nr9drTmAh6qGHHuL3v/99rW1PPPEEV111FZdffrk5QYnpNBNdTBcWFobD4TA7jAYpLy83O4QG2bdvHxEREUGd9BvLarUSHh7e4A+PrUFL/h46nc5W+fsjIgcpHzc/5ePWKdR+D0Wk9VDubn4nmrsrKiqaOKKGczgchIWFmR1GSLHb7YSHh5sdhgSZtvcpV4LO4T2/ampqmDp1Kj179iQ8PJykpCTOOecc5s2bB/j6eU2fPh0Ai8US+PIrLy/n3nvvpXPnzoSFhdGrVy+effZZDMOoddzKykp++9vf0q5dO2JiYrj00kvZvXs3Foul1lV5fy+srKwsfvWrX5GQkMA555wDwI8//siNN95It27dCA8PJzU1lV//+tfk5+fXOpZ/Hz/99BPXX389cXFxtG/fnocffhjDMNi5cyeXXXYZsbGxpKam8txzzx3Xe+d2u3nsscfo3r07YWFhdO3alQcffJDq6urAGIvFwptvvkl5eXngvZo5c+ZR97t8+XIuvvhi4uLiiIyM5LzzzuPbb7+tNaa0tJS7776brl27EhYWRnJyMhdddBHff/894Jvh9cknn7B9+/bAcRvSh80wDB5//HE6depEZGQk559/PuvWrTtiXF09WL/++muuvvpqunTpQlhYGJ07d+aee+6hsrKy1nNvvPFGoqOj2bFjB5dccgnR0dF07Ngx8Pu1Zs0aLrjgAqKiokhPT2f27NlHHL+oqIi777478PvWo0cPnnrqqcCV/m3bttG+fXsApk6dGngvDv0d27BhA1dddRWJiYmEh4dz+umn89FHH9U6jv/2zi+//JLf/OY3JCcn06lTp+N6L7dt24bFYuHZZ59l+vTpdOvWjcjISEaMGMHOnTsxDIPHHnuMTp06ERERwWWXXUZBQUGtfRzeE93/vs+ZM4c//vGPdOrUifDwcC688EI2b958XHGJSHBRPlY+rovycXDl4w8//JAxY8bQoUMHwsLC6N69O4899hgejycwZv369URERDB27Nhaz/3mm2+w2Ww88MADxxWviAQ/5e7gyt3+ljurVq1i6NChREZG8uCDDwJQXV3No48+So8ePQI58f777691vLoUFBRw33330a9fP6Kjo4mNjWXUqFFkZmYGxixevJjBgwcDcNNNNx0R66E90WtqakhMTOSmm2464lglJSWEh4dz3333BbY1Nu5DPfvss1gsFrZv337EY5MnT8bpdFJYWBjYdjyff+rzyiuvcPLJJxMWFkaHDh2YMGECRUVFR4xbvnw5o0ePJiEhgaioKPr3789LL70UePzwnugWi4Xy8nLeeuutwPt74403smjRIiwWC//973+POMbs2bOxWCwsXbr0uGKX4Kd2LtIsiouL2b9//xHba2pqjvncKVOm8OSTT3LLLbfws5/9jJKSEr777ju+//57LrroIm677Tb27NnDvHnz+Oc//1nruYZhcOmll7Jo0SJuvvlmBg4cyBdffMGkSZPYvXs3L7zwQmDsjTfeyJw5c7jhhhs488wz+fLLLxkzZky9cV199dX07NmTJ554IvAhYt68eWzdupWbbrqJ1NRU1q1bx2uvvca6detYtmzZEQtR/PKXv6RPnz786U9/4pNPPuHxxx8nMTGRv/71r1xwwQU89dRTzJo1i/vuu4/BgwczdOjQo75Xt9xyC2+99RZXXXUV9957L8uXL+fJJ59k/fr1gT/i//znP3nttddYsWIFr7/+OgBnnXVWvftcuHAho0aNYtCgQTz66KNYrVbefPNNLrjgAr7++mt+9rOfAXD77bfz73//mzvvvJO+ffuSn5/PN998w/r16znttNP4wx/+QHFxMbt27Qq879HR0Ud9PYd65JFHePzxxxk9ejSjR4/m+++/Z8SIEbhcrmM+97333qOiooI77riDpKQkVqxYwcsvv8yuXbt47733ao31eDyMGjWKoUOH8vTTTzNr1izuvPNOoqKi+MMf/sB1113HFVdcwYwZMxg7dixDhgwhIyMD8M0qOO+889i9eze33XYbXbp0YcmSJUyePJm9e/fy4osv0r59e1599VXuuOMOfvGLX3DFFVcA0L9/fwDWrVvH2WefTceOHfn9739PVFQUc+bM4fLLL+c///kPv/jFL2rF+5vf/Ib27dvzyCOPNHgWx6xZs3C5XNx1110UFBTw9NNPc80113DBBRewePFiHnjgATZv3szLL7/Mfffdx9///vdj7vNPf/oTVquV++67j+LiYp5++mmuu+46li9f3qDYRKR5KB8rHysft658PHPmTKKjo5k4cSLR0dEsXLiQRx55hJKSEp555hkA+vTpw2OPPcakSZO46qqruPTSSykvL+fGG2+kd+/eTJs2rUHxikjLUu4O3dwNkJ+fz6hRo7j22mu5/vrrSUlJwev1cumll/LNN98wfvx4+vTpw5o1a3jhhRf46aefjtpne+vWrXzwwQdcffXVZGRkkJuby1//+lfOO+88srKy6NChA3369GHatGk88sgjjB8/nnPPPbfeWB0OB7/4xS94//33+etf/1prhv0HH3xAdXU11157LcAJxX2oa665hvvvv585c+YwadKkWo/NmTOHESNGkJCQABz/55+6TJkyhalTpzJ8+HDuuOMONm7cyKuvvsrKlSv59ttvA3dtzJs3j0suuYS0tDR+97vfkZqayvr16/n444/53e9+V+e+//nPfwb+XY0fPx6A7t27c+aZZ9K5c2dmzZp1xGeFWbNm0b17d4YMGXJc75OEAEOkCb355psGcNSvk08+udZz0tPTjXHjxgW+HzBggDFmzJijHmfChAlGXb++H3zwgQEYjz/+eK3tV111lWGxWIzNmzcbhmEYq1atMgDj7rvvrjXuxhtvNADj0UcfDWx79NFHDcD4v//7vyOOV1FRccS2t99+2wCMr7766oh9jB8/PrDN7XYbnTp1MiwWi/GnP/0psL2wsNCIiIio9Z7UZfXq1QZg3HLLLbW233fffQZgLFy4MLBt3LhxRlRU1FH3ZxiG4fV6jZ49exojR440vF5vrdeZkZFhXHTRRYFtcXFxxoQJE466vzF
"text/plain": [
"<Figure size 1500x1000 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"columns_to_check = ['est_diameter_min', 'est_diameter_max', 'relative_velocity']\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(columns_to_check, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Судя по данным на диаграмме выше, нам удалось избавиться от выбросов в соответствующих признаках."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Разбиение датасета на выборки"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Обучающая выборка: (63585, 7)\n",
"hazardous\n",
"False 57399\n",
"True 6186\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAgkAAADECAYAAAAVi7K7AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA9RUlEQVR4nO3dd1gUV9sH4N/uArv03kU6WLAglsQo2LHF2GJiF4OaqPHVaEw0sWvQWIndWDDia4xiJBordoxRsWBXlGKnd1hgd8/3By/zsewuAoID8tzXtZfucObMM2dmZ5+dc2ZGwBhjIIQQQggpQ8h3AIQQQgipnShJIIQQQohalCQQQgghRC1KEgghhBCiFiUJhBBCCFGLkgRCCCGEqEVJAiGEEELUoiSBEEIIIWpp8R0AIYTUB4WFhUhLS4NCoYCdnR3f4ZBqJJVKkZaWBi0tLVhZWfEdTrWiMwmE1AJjxoyBgYEB32FUm/nz50MgEPAdBu+ioqIwbNgwWFhYQCwWw9bWFoMGDeI7rDpj3bp1yMjI4N6vWbMGubm5/AVUSkREBPr16wcTExPo6urC3t4e//nPf/gOq9pV6kxCSEgIAgICuPdisRgNGzZEjx49MGfOHFhbW1d7gIQQUheFh4fjs88+Q6NGjbBkyRK4uroCwHv3S7MmHTp0CI8fP8b06dNx/vx5zJkzB1OmTOE7LGzYsAFff/01OnTogODgYNjb2wMAHB0deY6s+lWpu2HhwoVwdnaGVCpFZGQkNm7ciCNHjuDOnTvQ09Or7hgJIaROSUtLQ2BgIPz9/bFv3z7o6OjwHVKdNHv2bPTr1w/BwcEQCoVYuXIlhEJ+T4DHxMTgm2++wfjx47Fhw4b3/oxZlZKEXr16oXXr1gCAwMBAmJubY9WqVQgPD8fQoUOrNUBCSO0jk8mgUCjoy0+DHTt2QCqVIiQkhNroLfj5+SEhIQH379+Hg4MDGjRowHdI+OWXX2BjY4NffvnlvU8QgGoak9ClSxcAQFxcHIDiLHrGjBlo1qwZDAwMYGRkhF69eiE6OlplXqlUivnz58PDwwMSiQS2trYYOHAgnjx5AgCIj4+HQCDQ+OrUqRNX19mzZyEQCLB3717Mnj0bNjY20NfXR79+/fDs2TOVZV++fBk9e/aEsbEx9PT04Ofnh4sXL6pdx06dOqld/vz581XKhoaGwsfHB7q6ujAzM8Pnn3+udvnlrVtpCoUCa9asQdOmTSGRSGBtbY0JEyYgPT1dqZyTkxP69u2rspzJkyer1Kku9uXLl6u0KQAUFBRg3rx5cHNzg1gshoODA2bOnImCggK1bVVap06dVOpbsmQJhEIh/vvf/1apPVasWIH27dvD3Nwcurq68PHxwf79+9UuPzQ0FG3btoWenh5MTU3h6+uLEydOKJU5evQo/Pz8YGhoCCMjI7Rp00Yltn379nHb1MLCAiNGjMCLFy+UyowZM0YpZlNTU3Tq1AkXLlx4YzuVePHiBfr37w8DAwNYWlpixowZkMvllV7/srGo22cLCwsxd+5c+Pj4wNjYGPr6+ujYsSPOnDmjVFfJdlmxYgXWrFkDV1dXiMVi3Lt3DwAQGRmJNm3aQCKRwNXVFZs3b1a7bjKZDIsWLeLmd3JywuzZs1X2I02fKycnJ4wZM4Z7X1RUhAULFsDd3R0SiQTm5ubo0KEDTp48WW4bh4SEKLWHnp4emjVrhq1bt5Y7X4nY2Fh8+umnMDMzg56eHj744AP8/fffSmX+/fdftGzZEj/99BMcHBwgFovh7u6OpUuXQqFQcOX8/PzQokULtcvx9PSEv7+/Uszx8fFKZcp+viq6TQHVdn79+jVGjRoFS0tLiMVieHl54ddff1Wap/S+UJqXl5fK53zFihVqY37x4gXGjh0La2triMViNG3aFNu3b1cqU3IsP3v2LExMTPDhhx+iQYMG6NOnj8b9Q938JS+xWAwPDw8EBQWh9IOPS8bOpKSkaKyr7H7377//wsfHBxMnTuTWQV1bAUBubi6mT5/O7QOenp5YsWIFyj58WSAQYPLkydi9ezc8PT0hkUjg4+OD8+fPK5VTN9bnzJkzEIvF+PLLL5WmV6SdK6Jarm4o+UI3NzcHUPwhOnjwID799FM4OzsjMTERmzdvhp+fH+7du8eN7JXL5ejbty9OnTqFzz//HP/5z3+QnZ2NkydP4s6dO1wfHgAMHToUvXv3VlrurFmz1MazZMkSCAQCfPfdd0hKSsKaNWvQrVs33Lx5E7q6ugCA06dPo1evXvDx8cG8efMgFAqxY8cOdOnSBRcuXEDbtm1V6m3QoAGCgoIAADk5Ofjqq6/ULnvOnDkYMmQIAgMDkZycjLVr18LX1xc3btyAiYmJyjzjx49Hx44dAQAHDhzAn3/+qfT3CRMmcONBpkyZgri4OKxbtw43btzAxYsXoa2trbYdKiMjI4Nbt9IUCgX69euHyMhIjB8/Ho0bN8bt27exevVqPHr0CAcPHqzUcnbs2IEff/wRK1euxLBhw9SWeVN7BAcHo1+/fhg+fDgKCwvx+++/49NPP8Xhw4fRp08frtyCBQswf/58tG/fHgsXLoSOjg4uX76M06dPo0ePHgCKD75jx45F06ZNMWvWLJiYmODGjRs4duwYF19J27dp0wZBQUFITExEcHAwLl68qLJNLSwssHr1agDA8+fPERwcjN69e+PZs2dqt31pcrkc/v7+aNeuHVasWIGIiAisXLkSrq6uSvtaRdZ/woQJ6Natm1L9x44dw+7du7k+8aysLGzduhVDhw7FuHHjkJ2djW3btsHf3x9XrlxBy5YtVbadVCrF+PHjIRaLYWZmhtu3b6NHjx6wtLTE/PnzIZPJMG/ePLXjkwIDA7Fz504MHjwY06dPx+XLlxEUFIT79++rbOOKmD9/PoKCghAYGIi2bdsiKysLUVFRuH79Orp37/7G+VevXg0LCwtkZWVh+/btGDduHJycnFTarbTExES0b98eeXl5mDJlCszNzbFz507069cP+/fvx4ABAwAAqampiIyMRGRkJMaOHQsfHx+cOnUKs2bNQnx8PDZt2gQAGDlyJMaNG4c7d+7Ay8uLW87Vq1fx6NEj/Pjjj5Vqk8pu0xKFhYXo1q0bHjx4gK+++gqenp44ePAgxo8fj9TUVHz//feVikOTxMREfPDBB9yXoqWlJY4ePYovvvgCWVlZmDp1qsZ5z58/jyNHjlRqebNnz0bjxo2Rn5/P/Xi0srLCF198UeV1SE1NRVRUFLS0tDBp0iS4urqqbSvGGPr164czZ87giy++QMuWLXH8+HF8++23ePHiBXecKHHu3Dns3bsXU6ZMgVgsxoYNG9CzZ09cuXJFad8oLTo6Gv3790fv3r2xfv16bvrbtLMKVgk7duxgAFhERARLTk5mz549Y7///jszNzdnurq67Pnz54wxxqRSKZPL5UrzxsXFMbFYzBYuXMhN2759OwPAVq1apbIshULBzQeALV++XKVM06ZNmZ+fH/f+zJkzDACzt7dnWVlZ3PQ//viDAWDBwcFc3e7u7szf359bDmOM5eXlMWdnZ9a9e3eVZbVv3555eXlx75OTkxkANm/ePG5afHw8E4lEbMmSJUrz3r59m2lpaalMj4mJYQDYzp07uWnz5s1jpTfLhQsXGAC2e/dupXmPHTumMt3R0ZH16dNHJfZJkyaxspu6bOwzZ85kVlZWzMfHR6lNd+3axYRCIbtw4YLS/Js2bWIA2MWLF1WWV5qfnx9X399//820tLTY9OnT1ZatSHswVrydSissLGReXl6sS5cuSnUJhUI2YMAAlX2xZJtnZGQwQ0ND1q5dO5afn6+2TGFhIbOysmJeXl5KZQ4fPswAsLlz53LTRo8ezRwdHZXq2bJlCwPArly5onadS88LQOnzwRhj3t7ezMfHp9LrX1ZMTAwzNjZm3bt3ZzKZjDHGmEwmYwUFBUrl0tPTmbW1NRs7diw3reQzaGRkxJKSkpTK9+/fn0kkEpaQkMBNu3fvHhOJRErb7ebNmwwACwwMVJp/xowZDAA7ffo0N63svlnC0dGRjR49mnvfokULtfv7m5Q
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Контрольная выборка: (13625, 7)\n",
"hazardous\n",
"False 12315\n",
"True 1310\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhUAAADECAYAAAAoGdPdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA8vElEQVR4nO3dd1hTZ/sH8G8SQgJh76EFZIi7iqNO3IhY6n7dii9qq2htHa32ddviqqPuVmup+toKttJqxb1Q66qgOJHlRkDZBEjy/P7gl7yEBCQQOKD357q4NCfnPOc+M3eeccJjjDEQQgghhFQTn+sACCGEEPJ2oKSCEEIIIXpBSQUhhBBC9IKSCkIIIYToBSUVhBBCCNELSioIIYQQoheUVBBCCCFELyipIIQQQoheUFJBCCGE1LLMzEw8fPgQMpmM61D0ipIKQuqACRMmwMTEhOsw9Gbx4sXg8Xhch0HeMU+ePMFPP/2kep2cnIy9e/dyF1ApxcXFWLVqFVq1agWRSARLS0t4enri5MmTXIemVzolFT/99BN4PJ7qTywWw8vLCyEhIUhNTa2pGAkhhJA34vF4mDZtGo4ePYrk5GTMnTsX58+f5zosFBYWonfv3liwYAG6d++O8PBwHD9+HKdOnULHjh25Dk+vDKqy0NKlS+Hm5gapVIro6Ghs3boVf/31F+Li4mBsbKzvGAkhhJA3cnZ2xqRJk9CvXz8AgKOjI86cOcNtUABWrlyJy5cv4+jRo+jevTvX4dSoKiUV/v7+aNu2LQAgODgY1tbWWLt2LSIjIzFy5Ei9BkgIqXtkMhkUCgUMDQ25DoUQNevXr8f06dORnp6O5s2bQyKRcBqPTCbD+vXrMWvWrLc+oQD01KeiZ8+eAICkpCQAwKtXrzB79my0aNECJiYmMDMzg7+/P2JjYzWWlUqlWLx4Mby8vCAWi+Ho6IjBgwcjISEBQEmbWOkml7J/pQ/SmTNnwOPx8Ouvv2L+/PlwcHCARCJBYGAgHj9+rLHuy5cvo1+/fjA3N4exsTF8fX1x4cIFrdvYvXt3retfvHixxrx79uyBj48PjIyMYGVlhREjRmhdf0XbVppCocD69evRrFkziMVi2NvbY8qUKXj9+rXafK6urhgwYIDGekJCQjTK1Bb76tWrNfYpUFJ1t2jRInh4eEAkEqFhw4aYO3cuCgsLte6r0rp3765R3tdffw0+n4///ve/Vdofa9asQadOnWBtbQ0jIyP4+PggIiJC6/r37NmD9u3bw9jYGJaWlujWrRuOHTumNs+RI0fg6+sLU1NTmJmZoV27dhqxhYeHq46pjY0NxowZg6dPn6rNM2HCBLWYLS0t0b17d52qX58+fYqBAwfCxMQEtra2mD17NuRyuc7bXzYWbedsUVERFi5cCB8fH5ibm0MikaBr1644ffq0WlnK47JmzRqsX78e7u7uEIlEuHPnDgAgOjoa7dq1g1gshru7O7Zv365122QyGZYtW6Za3tXVFfPnz9c4j8q7rlxdXTFhwgTV6+LiYixZsgSenp4Qi8WwtrZGly5dcPz48Qr3cdlmXGNjY7Ro0QI7duyocLnSyyYnJ6um3b59G5aWlhgwYIBap7vExEQMGzYMVlZWMDY2xgcffIDDhw+rlae8Z2k7f01MTFTbWzZmbX/KvgTK/jmJiYnw8/ODRCKBk5MTli5dirI/Sp2Xl4dZs2ahYcOGEIlEaNy4MdasWaMxX0UxlL6+lfNcu3atwv1YXh+iiIgI8Hg8jdqFyl5/rq6uAAB3d3d06NABr169gpGRkcYxKy+myly/5d1nlZTHVLkN9+/fx+vXr2FqagpfX18YGxvD3NwcAwYMQFxcnMbyN27cgL+/P8zMzGBiYoJevXrh77//VptHuZ/PnTuHKVOmwNraGmZmZhg3bpzWz4XS1w0ATJ48GWKxWGM/HzlyBF27doVEIoGpqSkCAgJw+/btCvdbWVWqqShLmQBYW1sDKLmYDh48iGHDhsHNzQ2pqanYvn07fH19cefOHTg5OQEA5HI5BgwYgJMnT2LEiBH49NNPkZOTg+PHjyMuLg7u7u6qdYwcORL9+/dXW++8efO0xvP111+Dx+Phiy++wMuXL7F+/Xr07t0bMTExMDIyAgCcOnUK/v7+8PHxwaJFi8Dn87Fr1y707NkT58+fR/v27TXKbdCgAUJDQwEAubm5+OSTT7Sue8GCBRg+fDiCg4ORlpaGjRs3olu3brhx4wYsLCw0lpk8eTK6du0KAPjtt9/w+++/q70/ZcoU/PTTTwgKCsKMGTOQlJSETZs24caNG7hw4QKEQqHW/aCLzMxM1baVplAoEBgYiOjoaEyePBlNmjTBrVu3sG7dOjx48AAHDx7UaT27du3Cf/7zH3z77bcYNWqU1nnetD82bNiAwMBAjB49GkVFRfjll18wbNgwHDp0CAEBAar5lixZgsWLF6NTp05YunQpDA0NcfnyZZw6dQp9+/YFUHJxTpw4Ec2aNcO8efNgYWGBGzduICoqShWfct+3a9cOoaGhSE1NxYYNG3DhwgWNY2pjY4N169YBKOk0tmHDBvTv3x+PHz/WeuxLk8vl8PPzQ4cOHbBmzRqcOHEC3377Ldzd3dXOtcps/5QpU9C7d2+18qOiorB3717Y2dkBALKzs7Fjxw6MHDkSkyZNQk5ODnbu3Ak/Pz9cuXIF77//vsaxk0qlmDx5MkQiEaysrHDr1i307dsXtra2WLx4MWQyGRYtWgR7e3uN7QsODkZYWBiGDh2KWbNm4fLlywgNDcXdu3c1jnFlLF68GKGhoQgODkb79u2RnZ2Na9eu4Z9//kGfPn3euPy6detgY2OD7Oxs/Pjjj5g0aRJcXV019ltFHj9+jH79+sHb2xv79++HgUHJLTU1NRWdOnVCfn4+ZsyYAWtra4SFhSEwMBAREREYNGiQTtvarVs37N69W/X666+/BgB89dVXqmmdOnVS/V8ul6Nfv3744IMPsGrVKkRFRWHRokWQyWRYunQpAIAxhsDAQJw+fRr//ve/8f777+Po0aOYM2cOnj59qjqPy1Lut9Jx1CRdrr+yFi5cCKlUWul1Vef6LU9GRgaAks8rT09PLFmyBFKpFJs3b0bnzp1x9epVeHl5AShJULt27QozMzPMnTsXQqEQ27dvR/fu3XH27Fl06NBBreyQkBBYWFhg8eLFuH//PrZu3YqUlBRVYqPNokWLsHPnTvz6669qCeHu3bsxfvx4+Pn5YeXKlcjPz8fWrVvRpUsX3LhxQ5WwvRHTwa5duxgAduLECZaWlsYeP37MfvnlF2Ztbc2MjIzYkydPGGOMSaVSJpfL1ZZNSkpiIpGILV26VDXtxx9/ZADY2rVrNdalUChUywFgq1ev1pinWbNmzNfXV/X69OnTDABzdnZm2dnZqun79+9nANiGDRtUZXt6ejI/Pz/VehhjLD8/n7m5ubE+ffporKtTp06sefPmqtdpaWkMAFu0aJFqWnJyMhMIBOzrr79WW/bWrVvMwMBAY3p8fDwDwMLCwlTTFi1axEoflvPnzzMAbO/evWrLRkVFaUx3cXFhAQEBGrFPmzaNlT3UZWOfO3cus7OzYz4+Pmr7dPfu3YzP57Pz58+rLb9t2zYGgF24cEFjfaX5+vqqyjt8+DAzMDBgs2bN0jpvZfYHYyXHqbSioiLWvHlz1rNnT7Wy+Hw+GzRokMa5qDzmmZmZzNTUlHXo0IEVFBRonaeoqIjZ2dmx5s2bq81z6NAhBoAtXLhQNW38+PHMxcVFrZzvv/+eAWBXrlzRus2llwWgdn0wxljr1q2Zj4+PzttfVnx8PDM3N2d9+vRhMpmMMcaYTCZjhYWFavO9fv2a2dvbs4kTJ6qmKa9BMzMz9vLlS7X5Bw4cyMRiMUtJSVFNu3PnDhMIBGrHLSYmhgFgwcHBasvPnj2bAWCnTp1STSt7biq5uLiw8ePHq163atVK6/n+Jsr7WFJSkmragwcPGAC2atWqSi/76tUr1rRpU9a4cWOWnp6uNt/MmTMZALXrJicnh7m5uTFXV1fVOam8Z4WHh2usSyKRqG1vaaWvq7KU59L06dN
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Тестовая выборка: (13626, 7)\n",
"hazardous\n",
"False 12282\n",
"True 1344\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAfQAAADECAYAAABp29OTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA5YUlEQVR4nO3dd1hTZ/sH8G8WCSvsJVpAEEUcVRTrQNS6UKu4aNU6i+On1tpqrdaquMrbumrdtlVRbK2CVWvromoVa92oKCqyXMiSDSGQPL8/eJOXkDAFgsn9ua5cmsNznnOfkec+6zmHwxhjIIQQQsgbjavtAAghhBDy+iihE0IIITqAEjohhBCiAyihE0IIITqAEjohhBCiAyihE0IIITqAEjohhBCiAyihE0IIITqAEjohhBC9kZubi8TEROTn52s7lDpHCZ2QRmDSpEkwMTHRdhh1JigoCBwOR9thkAaSl5eH7777Tvk9KysLW7Zs0V5AZTDGsHPnTrzzzjswMjKCWCyGi4sLQkNDtR1anatRQt+zZw84HI7yIxKJ4O7ujtmzZyMlJaW+YiSEENKIGRoa4quvvsL+/fvx9OlTBAUF4ffff9d2WACAsWPHYsaMGfDw8MC+fftw5swZREREYMSIEdoOrc7xazPSihUr4OLiAolEgsjISGzbtg1//vknoqOjYWRkVNcxEkIIacR4PB6WL1+OCRMmQC6XQywW448//tB2WNi7dy9+/fVXhIaGYuzYsdoOp97VKqH7+fmhU6dOAIDAwEBYWVlh/fr1OHr0KMaMGVOnARJCGp+SkhLI5XIYGBhoOxTSSMybNw/vv/8+nj59Cg8PD5ibm2s7JKxZswZjxozRi2QO1NE19D59+gAAEhISAACvXr3C/Pnz0bZtW5iYmEAsFsPPzw+3b99WG1cikSAoKAju7u4QiURwcHDAiBEjEBcXBwBITExUOc1f/tOrVy9lXefPnweHw8Gvv/6KL7/8Evb29jA2NsbQoUPx9OlTtWlfuXIFAwcOhJmZGYyMjODr64tLly5pnMdevXppnH5QUJBa2dDQUHh5ecHQ0BCWlpb44IMPNE6/snkrSy6X47vvvoOnpydEIhHs7Owwffp0ZGZmqpRzdnbGkCFD1KYze/ZstTo1xb5mzRq1ZQoARUVFWLZsGdzc3CAUCtGsWTMsWLAARUVFGpdVWb169VKrb/Xq1eByufj5559rtTzWrl2Lbt26wcrKCoaGhvDy8kJYWJjG6YeGhsLb2xtGRkawsLBAz549cfr0aZUyJ06cgK+vL0xNTSEWi9G5c2e12A4dOqRcp9bW1vjwww/x/PlzlTKTJk1SidnCwgK9evXCxYsXq1xOCs+fP4e/vz9MTExgY2OD+fPnQyaT1Xj+y8eiaZuVSqVYunQpvLy8YGZmBmNjY/j4+ODcuXMqdSnWy9q1a/Hdd9/B1dUVQqEQ9+/fBwBERkaic+fOEIlEcHV1xY4dOzTOW0lJCVauXKkc39nZGV9++aXadlTR78rZ2RmTJk1Sfi8uLsby5cvRokULiEQiWFlZoUePHjhz5kyly7j8pUMjIyO0bdsWP/74Y43G0/TZs2ePsvyDBw8watQoWFpaQiQSoVOnTjh27JhavVlZWfj000/h7OwMoVCIpk2bYsKECUhPT1e2aZV9yi6rW7duwc/PD2KxGCYmJnj33Xfx77//1nr+z549Cx8fHxgbG8Pc3BzDhg1DTEyMSpmy90s0bdoUXbt2BZ/Ph729PTgcDs6fP1/pclWMr/iYmprC29sbR44cUSnXq1cvtGnTpsJ6FNupYh3k5+cjOjoazZo1w+DBgyEWi2FsbFzhbzI+Ph6jR4+GpaUljIyM8M4776idZahJjqlJ21eTXFSZWh2hl6dIvlZWVgBKF8yRI0cwevRouLi4ICUlBTt27ICvry/u37+PJk2aAABkMhmGDBmCv/76Cx988AE++eQT5Obm4syZM4iOjoarq6tyGmPGjMGgQYNUprto0SKN8axevRocDgdffPEFUlNT8d1336Fv376IioqCoaEhgNIN1c/PD15eXli2bBm4XC52796NPn364OLFi/D29lart2nTpggODgZQehPI//3f/2mc9pIlSxAQEIDAwECkpaVh06ZN6NmzJ27duqVxr3XatGnw8fEBABw+fBi//fabyt+nT5+OPXv2YPLkyZgzZw4SEhKwefNm3Lp1C5cuXYJAINC4HGoiKytLOW9lyeVyDB06FJGRkZg2bRo8PDxw9+5dbNiwAY8ePVL70VVl9+7d+Oqrr7Bu3boK95qrWh4bN27E0KFDMW7cOEilUhw4cACjR4/G8ePHMXjwYGW55cuXIygoCN26dcOKFStgYGCAK1eu4OzZs+jfvz+A0sZtypQp8PT0xKJFi2Bubo5bt27h5MmTyvgUy75z584IDg5GSkoKNm7ciEuXLqmtU2tra2zYsAEA8OzZM2zcuBGDBg3C06dPqzxikclkGDBgALp06YK1a9ciIiIC69atg6urq8q2Vp35nz59Ovr27atS/8mTJ7F//37Y2toCAHJycvDjjz9izJgxmDp1KnJzc/HTTz9hwIABuHr1Kt5++221dSeRSDBt2jQIhUJYWlri7t276N+/P2xsbBAUFISSkhIsW7YMdnZ2avMXGBiIkJAQjBo1CvPmzcOVK1cQHByMmJgYtXVcHUFBQQgODkZgYCC8vb2Rk5OD69ev4+bNm+jXr1+V42/YsAHW1tbIycnBrl27MHXqVDg7O6stN4WePXti3759yu+rV68GACxevFg5rFu3bgCAe/fuoXv37nB0dMTChQthbGyMgwcPwt/fH+Hh4Rg+fDiA0nbEx8cHMTExmDJlCjp27Ij09HQcO3YMz549U173Vdi5cydiYmKU2xgAtGvXTjlNHx8fiMViLFiwAAKBADt27ECvXr3w999/o0uXLjWa/4iICPj5+aF58+YICgpCYWEhNm3ahO7du+PmzZtwdnaucNmuW7euxvdVKeYzPT0dW7duxejRoxEdHY2WLVvWqB6FjIwMAMA333wDe3t7fP755xCJRPjhhx/Qt29fnDlzBj179gQApKSkoFu3bigoKMCcOXNgZWWFkJAQDB06FGFhYcr1pVCdHFNeRW1fbXJRhVgN7N69mwFgERERLC0tjT19+pQdOHCAWVlZMUNDQ/bs2TPGGGMSiYTJZDKVcRMSEphQKGQrVqxQDtu1axcDwNavX682LblcrhwPAFuzZo1aGU9PT+br66v8fu7cOQaAOTo6spycHOXwgwcPMgBs48aNyrpbtGjBBgwYoJwOY4wVFBQwFxcX1q9fP7VpdevWjbVp00b5PS0tjQFgy5YtUw5LTExkPB6PrV69WmXcu3fvMj6frzY8NjaWAWAhISHKYcuWLWNlV8vFixcZALZ//36VcU+ePKk23MnJiQ0ePFgt9lmzZrHyq7p87AsWLGC2trbMy8tLZZnu27ePcblcdvHiRZXxt2/fzgCwS5cuqU2vLF9fX2V9f/zxB+Pz+WzevHkay1ZneTBWup7KkkqlrE2bNqxPnz4qdXG5XDZ8+HC1bVGxzrOyspipqSnr0qULKyws1FhGKpUyW1tb1qZNG5Uyx48fZwDY0qVLlcMmTpzInJycVOrZuXMnA8CuXr2qcZ7LjgtA5ffBGGMdOnRgXl5eNZ7/8mJjY5mZmRnr168fKykpYYwxVlJSwoqKilTKZWZmMjs7OzZlyhTlMMVvUCwWs9TUVJXy/v7+TCQSsaSkJOWw+/fvMx6Pp7LeoqKiGAAWGBioMv78+fMZAHb27FnlsPLbpoKTkxObOHGi8nv79u01bu9VUbRjCQkJymGPHj1iANi3335b7XrKbtvlvfvuu6xt27ZMIpEoh8nlctatWzfWokUL5bClS5cyAOzw4cNqdZRtmxQ0bWMK/v7+zMDAgMXFxSmHvXjxgpmamrKePXsqh1V3/t9++21ma2vLMjIylMNu377NuFwumzBhgnJY+d9oamoqMzU1ZX5+fgwAO3funMZ4KxqfMcZOnz7NALCDBw8qh/n6+jJPT88K61Fsp7t371b5bmBgwB49eqQ
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_data, temp_data = train_test_split(df, test_size=0.3, random_state=42)\n",
"val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)\n",
"\n",
"print(\"Обучающая выборка: \", train_data.shape)\n",
"print(train_data.hazardous.value_counts())\n",
"hazardous_counts = train_data['hazardous'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(hazardous_counts, labels=hazardous_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов hazardous в обучающей выборке')\n",
"plt.show()\n",
"\n",
"print(\"Контрольная выборка: \", val_data.shape)\n",
"print(val_data.hazardous.value_counts())\n",
"hazardous_counts = val_data['hazardous'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(hazardous_counts, labels=hazardous_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов hazardous в контрольной выборке')\n",
"plt.show()\n",
"\n",
"print(\"Тестовая выборка: \", test_data.shape)\n",
"print(test_data.hazardous.value_counts())\n",
"hazardous_counts = test_data['hazardous'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(hazardous_counts, labels=hazardous_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов hazardous в тестовой выборке')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Как видно из круговых диаграмм, распределение классов сильно смещено, что может привести к проблемам в обучении модели, так как модель будет обучаться в основном на одном классе. В таком случае имеет смысл рассмотреть методы аугментации данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Аугментация данных методом оверсемплинга\n",
"\n",
"Этот метод увеличивает количество примеров меньшинства."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Обучающая выборка после оверсемплинга: (115166, 7)\n",
"hazardous\n",
"True 57767\n",
"False 57399\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAs0AAADECAYAAABk3xxgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABDRUlEQVR4nO3dd3hT1f8H8HeStulu6S6rlELZswrKKrtliCyRPcsQEFEQBEWGQEFQpoAoAgI/REBBQFYFhKIyZO+yoUBpC90zyfn90W9i06RNWgq34/16njyQ2zs+99ybcz8599wTmRBCgIiIiIiIciWXOgAiIiIioqKOSTMRERERkQlMmomIiIiITGDSTERERERkApNmIiIiIiITmDQTEREREZnApJmIiIiIyAQmzUREREREJlhIHQARUWmQkZGBZ8+eQaPRoGzZslKHQ4UoLS0Nz549g4WFBTw8PKQOh4heErY0ExUBgwcPhr29vdRhFJoZM2ZAJpNJHYbkTp8+jb59+8LNzQ1KpRLe3t7o0aOH1GEVG8uXL0dcXJzu/eLFi5GcnCxdQNmEhYWhS5cucHZ2ho2NDcqVK4cPPvhA6rCI6CXKV0vzunXrMGTIEN17pVKJihUron379pg2bRo8PT0LPUAiouJo586dePfdd1G9enXMmTMHfn5+AMCWyHzYtWsXbt68iQkTJuDo0aOYNm0axo0bJ3VYWLFiBd5//300a9YMS5YsQbly5QAAPj4+EkdGRC9TgbpnzJo1C76+vkhLS0N4eDhWrlyJ33//HZcuXYKtrW1hx0hEVKw8e/YMISEhCAoKwtatW2FlZSV1SMXS1KlT0aVLFyxZsgRyuRxfffUV5HJpb5BGRETgo48+wogRI7BixQreUSEqRQqUNHfo0AGvvfYaACAkJASurq74+uuvsXPnTvTp06dQAySiokelUkGj0TAZzMXatWuRlpaGdevWsYxeQGBgIO7du4erV6+iQoUKKF++vNQhYenSpfDy8sLSpUuZMBOVMoXylb1169YAgDt37gDIamWZOHEi6tSpA3t7ezg6OqJDhw44f/68wbJpaWmYMWMG/P39YW1tDW9vb3Tv3h23bt0CANy9excymSzXV8uWLXXrOnLkCGQyGbZs2YKpU6fCy8sLdnZ26NKlCx48eGCw7RMnTiA4OBhOTk6wtbVFYGAgjh8/bnQfW7ZsaXT7M2bMMJh348aNCAgIgI2NDVxcXNC7d2+j289r37LTaDRYvHgxatWqBWtra3h6emLkyJF4/vy53nyVKlVC586dDbYzduxYg3Uai33BggUGZQoA6enpmD59OqpUqQKlUokKFSpg0qRJSE9PN1pW2bVs2dJgfXPmzIFcLsf//d//Fag8Fi5ciCZNmsDV1RU2NjYICAjAtm3bjG5/48aNaNSoEWxtbVGmTBm0aNECBw4c0Jtn7969CAwMhIODAxwdHfH6668bxLZ161bdMXVzc0P//v0RGRmpN8/gwYP1Yi5TpgxatmyJY8eOmSwnrcjISHTt2hX29vZwd3fHxIkToVar873/OWMxds5mZGTg888/R0BAAJycnGBnZ4fmzZvj8OHDeuvSHpeFCxdi8eLF8PPzg1KpxJUrVwAA4eHheP3112FtbQ0/Pz98++23RvdNpVLhiy++0C1fqVIlTJ061eA8yu1zValSJQwePFj3PjMzEzNnzkTVqlVhbW0NV1dXNGvWDAcPHsyzjNetW6dXHra2tqhTpw6+//77PJfTun37Nt555x24uLjA1tYWb7zxBvbs2aM3zz///IP69etj7ty5qFChApRKJapWrYp58+ZBo9Ho5gsMDES9evWMbqdatWoICgrSi/nu3bt68+T8fJl7TAHDcn7y5AkGDhwId3d3KJVK1K5dG999953eMtnPhexq165t8DlfuHCh0ZgjIyMxdOhQeHp6QqlUolatWvjhhx/05tHW5UeOHIGzszPefPNNlC9fHp06dcr1/DC2vPalVCrh7++P0NBQCCF082n73sfExOS6rpzn3T///IOAgACMHj1atw/GygoAkpOTMWHCBN05UK1aNSxcuFAvBiDrWIwdOxabNm1CtWrVYG1tjYCAABw9elRvPmPPChw+fBhKpRKjRo3Sm25OOecmr2tupUqVCrSPgHn1cc5jl9t2X+S6BBR+nb537140b94cdnZ2cHBwQKdOnXD58mWD9dnb2+P27dsICgqCnZ0dypYti1mzZhmUl0ajwZIlS1CnTh1YW1vD3d0dwcHBOH36tEGZmso3tPlL165dDeIeOXIkZDIZateurZuWn7wrt7IyliMNHjzY4Dg+ePAANjY2BnVFfvMZY6/Zs2cDyF+9aEqhjJ6hTXBdXV0BZF1UduzYgXfeeQe+vr6IiorCt99+i8DAQFy5ckX35LharUbnzp3xxx9/oHfv3vjggw+QmJiIgwcP4tKlS7o+gADQp08fdOzYUW+7U6ZMMRrPnDlzIJPJMHnyZDx9+hSLFy9G27Ztce7cOdjY2AAADh06hA4dOiAgIADTp0+HXC7H2rVr0bp1axw7dgyNGjUyWG/58uURGhoKAEhKSsJ7771ndNvTpk1Dr169EBISgujoaCxbtgwtWrTA2bNn4ezsbLDMiBEj0Lx5cwDAL7/8gl9//VXv7yNHjtT1Jx83bhzu3LmD5cuX4+zZszh+/DgsLS2NlkN+xMXF6fYtO41Ggy5duiA8PBwjRoxAjRo1cPHiRSxatAg3btzAjh078rWdtWvX4rPPPsNXX32Fvn37Gp3HVHksWbIEXbp0Qb9+/ZCRkYGffvoJ77zzDnbv3o1OnTrp5ps5cyZmzJiBJk2aYNasWbCyssKJEydw6NAhtG/fHkBWMjJ06FDUqlULU6ZMgbOzM86ePYt9+/bp4tOW/euvv47Q0FBERUVhyZIlOH78uMExdXNzw6JFiwAADx8+xJIlS9CxY0c8ePDA6LHPTq1WIygoCI0bN8bChQsRFhaGr776Cn5+fnrnmjn7P3LkSLRt21Zv/fv27cOmTZt0fWoTEhLw/fffo0+fPhg+fDgSExOxZs0aBAUF4eTJk6hfv77BsUtLS8OIESOgVCrh4uKCixcvon379nB3d8eMGTOgUqkwffp0o883hISEYP369ejZsycmTJiAEydOIDQ0FFevXjU4xuaYMWMGQkNDERISgkaNGiEhIQGnT5/GmTNn0K5dO5PLL1q0CG5ubkhISMAPP/yA4cOHo1KlSgblll1UVBSaNGmClJQUjBs3Dq6urli/fj26dOmCbdu2oVu3bgCA2NhYhIeHIzw8HEOHDkVAQAD++OMPTJkyBXfv3sWqVasAAAMGDMDw4cNx6dIlvYvWqVOncOPGDXz22Wf5KpP8HlOtjIwMtG3bFteuXcN7772HatWqYceOHRgxYgRiY2PxySef5CuO3ERFReGNN97QJYnu7u7Yu3cvhg0bhoSEBIwfPz7XZY8ePYrff/89X9ubOnUqatSogdTUVF1jioeHB4YNG1bgfYiNjcXp06dhYWGBMWPGwM/Pz2hZCSHQpUsXHD58GMOGDUP9+vWxf/9+fPzxx4iMjNTVE1p//vkntmzZgnHjxkGpVGLFihUIDg7GyZMn9c6N7M6fP4+uXbuiY8eO+Oabb3TTX6Sctdq1a4eBAwfqTfvqq6/0Gmvys4/m1MfZaY8dAKxevRr379/X/e1Fr0uFXadv2LABgwYNQlBQEObPn4+UlBSsXLkSzZo1w9mzZ/USRbVajeDgYLzxxhv48ssvsW/fPkyfPh0qlQqzZs3SzTds2DCsW7cOHTp0QEhICFQqFY4dO4Z//vlHd6c/P/mGtbU19uzZg6dPn+quAdrPhbW1tdFyyk/elbOsgKz6zZTPP/8caWlpJuczxdj5qq3vClovGiXyYe3atQKACAsLE9HR0eLBgwfip59+Eq6ursLGxkY8fPhQCCFEWlqaUKvVesveuXNHKJVKMWvWLN20H374QQAQX3/9tcG2NBqNbjkAYsGCBQbz1KpVSwQGBureHz58WAAQ5cqVEwkJCbrpP//8swAglixZolt31apVRVBQkG47QgiRkpIifH1
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from imblearn.over_sampling import ADASYN\n",
"\n",
"# Создание экземпляра ADASYN\n",
"ada = ADASYN()\n",
"\n",
"# Применение ADASYN\n",
"X_resampled, y_resampled = ada.fit_resample(train_data.drop(columns=['hazardous']), train_data['hazardous'])\n",
"\n",
"# Создание нового DataFrame\n",
"df_train_adasyn = pd.DataFrame(X_resampled)\n",
"df_train_adasyn['hazardous'] = y_resampled # Добавление целевой переменной\n",
"\n",
"# Вывод информации о новой выборке\n",
"print(\"Обучающая выборка после оверсемплинга: \", df_train_adasyn.shape)\n",
"print(df_train_adasyn['hazardous'].value_counts())\n",
"hazardous_counts = df_train_adasyn['hazardous'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(hazardous_counts, labels=hazardous_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов hazardous в обучающей выборке после оверсемплинга')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Аугментация данных методом андерсемплинга\n",
"\n",
"Проведём также приращение данных методом выборки с недостатком (андерсемплинг). Этот метод помогает сбалансировать выборку, уменьшая количество экземпляров класса большинства, чтобы привести его в соответствие с классом меньшинства."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Обучающая выборка после андерсемплинга: (12372, 7)\n",
"hazardous\n",
"False 6186\n",
"True 6186\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtoAAADECAYAAAChrYbxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABC5ElEQVR4nO3deXwM5x8H8M/uRjZ3InK6I0HqJkWpiNYRV1NHaZ11BC2qlKqjjrhCaVGK6kGKn2pQV+sK6j6r7itIHEEkInc2ye4+vz/S3Wazm2RzWYnP+/XaFzuZ4zvPzDzz3ZlnnpEIIQSIiIiIiKhYSU0dABERERFRWcREm4iIiIioBDDRJiIiIiIqAUy0iYiIiIhKABNtIiIiIqISwESbiIiIiKgEMNEmIiIiIioBTLSJiIiIiEqAmakDICJ6FWRkZCAuLg5qtRoVK1Y0dThUjBQKBeLi4mBmZgYXFxdTh0NELxFe0SZ6CQwaNAg2NjamDqPYzJw5ExKJxNRhmNy5c+fQt29fODk5QS6Xw93dHT179jR1WKXG8uXLER8fr/2+ZMkSpKSkmC6gbMLCwhAQEAAHBwdYWlqiUqVK+PTTT00dFhG9ZAp0RXvt2rUYPHiw9rtcLkfVqlXRoUMHTJs2Da6ursUeIBFRabR9+3a8//778Pb2xty5c+Hp6QkAvOJZADt37sTt27cxfvx4HDlyBNOmTcOYMWNMHRZWrFiBTz75BK1atcLSpUtRqVIlAEC1atVMHBkRvWwK1XRk1qxZ8PDwgEKhwLFjx7By5Ur8+eefuHLlCqysrIo7RiKiUiUuLg6BgYHw9/dHaGgozM3NTR1SqTRlyhQEBARg6dKlkEql+PrrryGVmvZGbHh4OD777DMMHz4cK1as4J0bIspToRLtTp064fXXXwcABAYGokKFCvjmm2+wfft29OnTp1gDJKKXj1KphFqtZgKZizVr1kChUGDt2rUsoyLw8/PDvXv3cP36dVSpUgWVK1c2dUj49ttv4ebmhm+//ZZJNhHlq1guDbz99tsAgIiICABZV3MmTJiA+vXrw8bGBnZ2dujUqRMuXryoN61CocDMmTNRq1YtWFhYwN3dHT169MCdO3cAAJGRkZBIJLl+2rRpo53XX3/9BYlEgk2bNmHKlClwc3ODtbU1AgIC8ODBA71lnz59Gh07doS9vT2srKzg5+eH48ePG1zHNm3aGFz+zJkz9cZdv349fHx8YGlpCUdHR3zwwQcGl5/XumWnVquxZMkS1K1bFxYWFnB1dcWIESPw/PlznfGqV6+Orl276i1n9OjRevM0FPvChQv1yhQA0tPTMWPGDHh5eUEul6NKlSqYOHEi0tPTDZZVdm3atNGb39y5cyGVSvG///2vUOWxaNEitGzZEhUqVIClpSV8fHywefNmg8tfv349mjVrBisrK5QvXx6tW7fGvn37dMbZvXs3/Pz8YGtrCzs7OzRt2lQvttDQUO02dXJyQv/+/REVFaUzzqBBg3RiLl++PNq0aYOjR4/mW04aUVFR6NatG2xsbODs7IwJEyZApVIVeP1zxmJon83IyMD06dPh4+MDe3t7WFtbw9fXF4cOHdKZl2a7LFq0CEuWLIGnpyfkcjmuXbsGADh27BiaNm0KCwsLeHp64vvvvze4bkqlErNnz9ZOX716dUyZMkVvP8rtuKpevToGDRqk/Z6ZmYmgoCDUrFkTFhYWqFChAlq1aoX9+/fnWcZr167VKQ8rKyvUr18fP/74Y57Tady9exe9evWCo6MjrKys8MYbb+CPP/7QGefUqVNo1KgR5s2bhypVqkAul6NmzZqYP38+1Gq1djw/Pz80bNjQ4HJq164Nf39/nZgjIyN1xsl5fBm7TQH9cn7y5AkGDhwIZ2dnyOVy1KtXDz/88IPONNn3hezq1aund5wvWrTIYMxRUVEYMmQIXF1dIZfLUbduXfz8888642jq8r/++gsODg5o0aIFKleujC5duuS6fxiaXvORy+WoVasWgoODIYTQjqd5liA2NjbXeeXc706dOgUfHx+MHDlSuw6GygoAUlJSMH78eO0+ULt2bSxatEgnBiBrW4wePRobNmxA7dq1YWFhAR8fHxw5ckRnPEPPPhw6dAhyuRwfffSRznBjyjk3eZ1zq1evXqh1BIyrj3Nuu9yWW5Tz0tGjR9GrVy9UrVpVO+24ceOQlpamM15uz85s3rxZu38aW3Y5xzU2fmP3DSBrmw8dOhQVK1aEXC6Hh4cHPv74Y2RkZGjHiY+Px9ixY7Xby8vLCwsWLNCpl7Kfi7dt26azDIVCgfLly+vVA5p9M7fP2rVrjS6r7HWGsbmKZr8xlAvY2NjoHMM5zwHZPw8fPgQAXLp0CYMGDUKNGjVgYWEBNzc3DBkyBM+ePdObf36KpdcRTVJcoUIFAFknom3btqFXr17w8PBAdHQ0vv/+e/j5+eHatWvaJ+5VKhW6du2KAwcO4IMPPsCnn36KpKQk7N+/H1euXNG2aQSAPn36oHPnzjrLnTx5ssF45s6dC4lEgi+++AJPnz7FkiVL0K5dO1y4cAGWlpYAgIMHD6JTp07w8fHBjBkzIJVKsWbNGrz99ts4evQomjVrpjffypUrIzg4GACQnJyMjz/+2OCyp02bht69eyMwMBAxMTFYtmwZWrdujX/++QcODg560wwfPhy+vr4AgK1bt+L333/X+fuIESO07ePHjBmDiIgILF++HP/88w+OHz+OcuXKGSyHgoiPj9euW3ZqtRoBAQE4duwYhg8fjtdeew2XL1/G4sWLcevWLb2DMD9r1qzBl19+ia+//hp9+/Y1OE5+5bF06VIEBASgX79+yMjIwK+//opevXph165d6NKli3a8oKAgzJw5Ey1btsSsWbNgbm6O06dP4+DBg+jQoQOArANuyJAhqFu3LiZPngwHBwf8888/2LNnjzY+Tdk3bdoUwcHBiI6OxtKlS3H8+HG9berk5ITFixcDAB4+fIilS5eic+fOePDggcFtn51KpYK/vz+aN2+ORYsWISwsDF9//TU8PT119jVj1n/EiBFo166dzvz37NmDDRs2aNsIJyYm4scff0SfPn0wbNgwJCUl4aeffoK/vz/OnDmDRo0a6W07hUKB4cOHQy6Xw9HREZcvX0aHDh3g7OyMmTNnQqlUYsaMGQaf1wgMDERISAjee+89jB8/HqdPn0ZwcDCuX7+ut42NMXPmTAQHByMwMBDNmjVDYmIizp07h/Pnz6N9+/b5Tr948WI4OTkhMTERP//8M4YNG4bq1avrlVt20dHRaNmyJVJTUzFmzBhUqFABISEhCAgIwObNm9G9e3cAwLNnz3Ds2DEcO3YMQ4YMgY+PDw4cOIDJkycjMjISq1atAgAMGDAAw4YNw5UrV1CvXj3tcs6ePYtbt27hyy+/LFCZFHSbamRkZKBdu3a4ceMGPv74Y9SuXRvbtm3D8OHD8ezZM0yaNKlAceQmOjoab7zxhjZ5cHZ2xu7duzF06FAkJiZi7NixuU575MgR/PnnnwVa3pQpU/Daa68hLS1NewHGxcUFQ4cOLfQ6PHv2DOfOnYOZmRlGjRoFT09Pg2UlhEBAQAAOHTqEoUOHolGjRti7dy8+//xzREVFaesJjcOHD2PTpk0YM2YM5HI5VqxYgY4dO+LMmTM6+0Z2Fy9eRLdu3dC5c2d899132uFFKWeN9u3bY+DAgTrDvv76a50LPAVZR2Pq4+w02w4AVq9ejfv372v/VtTzUmhoKFJTU/Hxxx+jQoUKOHPmDJYtW4aHDx8iNDQ037LJT/ayO3v2LL799ludvxc0fmP2jUePHqFZs2aIj4/H8OHD4e3tjaioKGzevBmpqakwNzdHamoq/Pz8EBUVhREjRqBq1ao4ceIEJk+ejMePH2PJkiU6y7WwsMCaNWvQrVs37bCtW7dCoVDkuu4rV67U+XESERGB6dOn5zp+9+7d0aNHDwBZP4BWr16d67hA7rlKYWiaQWfn6OgIANi/fz/u3r2LwYMHw83NDVevXsXq1atx9epVnDp1qmB3s0QBrFmzRgAQYWFhIiYmRjx48ED8+uuvokKFCsL
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"rus = RandomUnderSampler()\n",
"\n",
"# Применение RandomUnderSampler\n",
"X_resampled, y_resampled = rus.fit_resample(train_data.drop(columns=['hazardous']), train_data['hazardous'])\n",
"\n",
"# Создание нового DataFrame\n",
"df_train_undersampled = pd.DataFrame(X_resampled)\n",
"df_train_undersampled['hazardous'] = y_resampled # Добавление целевой переменной\n",
"\n",
"# Вывод информации о новой выборке\n",
"print(\"Обучающая выборка после андерсемплинга: \", df_train_undersampled.shape)\n",
"print(df_train_undersampled['hazardous'].value_counts())\n",
"\n",
"# Визуализация распределения классов\n",
"hazardous_counts = df_train_undersampled['hazardous'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(hazardous_counts, labels=hazardous_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов hazardous в обучающей выборке после андерсемплинга')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Датасет №3. Данные о диабете индейцев Пима\n",
"\n",
"[**Ссылка**](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)\n",
"\n",
"**Проблемная область**: Данный набор данных связан с наблюдением о наличии диабета у женщин из группы коренного народа Пима, основываясь на медицинских показателях. Диабет является хроническим заболеванием, которое требует длительного лечения и оказывает значительное влияние на качество жизни пациентов.\n",
"\n",
"**Объекты наблюдения**: Каждая строка (запись) в наборе данных соответствует одному пациенту из группы индейцев Пима.\n",
"\n",
"**Атрибуты объектов:**\n",
"- `Pregnancies` - количество беременностей у пациента.\n",
"- `Glucose` - уровень глюкозы в крови.\n",
"- `BloodPressure` - диастолическое артериальное давление (мм рт. ст.).\n",
"- `SkinThickness` - толщина кожной складки на трицепсе (мм).\n",
"- `Insulin` - уровень инсулина в сыворотке крови (мЕд/мл).\n",
"- `BMI` - индекс массы тела (вес в кг/кв. м роста).\n",
"- `DiabetesPedigreeFunction` - коэффициент наследственной предрасположенности к диабету.\n",
"- `Age` - возраст пациента.\n",
"- `Outcome` - целевой признак, показывающий наличие (1) или отсутствие (0) диабета.\n",
"\n",
"**Бизнес-цель**: Оптимизация страховых предложений. Страховые компании могут предложить индивидуализированные тарифы, исходя из вероятности возникновения у пациента диабета, что позволит снизить риски и сделать страхование доступнее.\n",
"\n",
"**Техническая цель**: Разработка предсказательной модели для классификации пациентов по риску. На основании этого риска можно сформировать динамические предложения для клиентов.\n",
"\n",
"**Входные данные**: Данные о пациентах.\n",
"\n",
"**Целевая переменная**: Диагноз диабета (`Outcome`)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pregnancies</th>\n",
" <th>Glucose</th>\n",
" <th>BloodPressure</th>\n",
" <th>SkinThickness</th>\n",
" <th>Insulin</th>\n",
" <th>BMI</th>\n",
" <th>DiabetesPedigreeFunction</th>\n",
" <th>Age</th>\n",
" <th>Outcome</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6</td>\n",
" <td>148</td>\n",
" <td>72</td>\n",
" <td>35</td>\n",
" <td>0</td>\n",
" <td>33.6</td>\n",
" <td>0.627</td>\n",
" <td>50</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>85</td>\n",
" <td>66</td>\n",
" <td>29</td>\n",
" <td>0</td>\n",
" <td>26.6</td>\n",
" <td>0.351</td>\n",
" <td>31</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>8</td>\n",
" <td>183</td>\n",
" <td>64</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>23.3</td>\n",
" <td>0.672</td>\n",
" <td>32</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>89</td>\n",
" <td>66</td>\n",
" <td>23</td>\n",
" <td>94</td>\n",
" <td>28.1</td>\n",
" <td>0.167</td>\n",
" <td>21</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>137</td>\n",
" <td>40</td>\n",
" <td>35</td>\n",
" <td>168</td>\n",
" <td>43.1</td>\n",
" <td>2.288</td>\n",
" <td>33</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
"0 6 148 72 35 0 33.6 \n",
"1 1 85 66 29 0 26.6 \n",
"2 8 183 64 0 0 23.3 \n",
"3 1 89 66 23 94 28.1 \n",
"4 0 137 40 35 168 43.1 \n",
"\n",
" DiabetesPedigreeFunction Age Outcome \n",
"0 0.627 50 1 \n",
"1 0.351 31 0 \n",
"2 0.672 32 1 \n",
"3 0.167 21 0 \n",
"4 2.288 33 1 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\".//static//csv//diabetes.csv\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Получение сведений о пропущенных данных"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pregnancies 0\n",
"Glucose 0\n",
"BloodPressure 0\n",
"SkinThickness 0\n",
"Insulin 0\n",
"BMI 0\n",
"DiabetesPedigreeFunction 0\n",
"Age 0\n",
"Outcome 0\n",
"dtype: int64\n",
"\n",
"Pregnancies False\n",
"Glucose False\n",
"BloodPressure False\n",
"SkinThickness False\n",
"Insulin False\n",
"BMI False\n",
"DiabetesPedigreeFunction False\n",
"Age False\n",
"Outcome False\n",
"dtype: bool\n",
"\n"
]
}
],
"source": [
"# Количество пустых значений признаков\n",
"print(df.isnull().sum())\n",
"\n",
"print()\n",
"\n",
"# Есть ли пустые значения признаков\n",
"print(df.isnull().any())\n",
"\n",
"print()\n",
"\n",
"# Процент пустых значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пропущенных данных в датасете **не обнаружено**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Проверка набора данных на выбросы"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов в столбце 'Pregnancies': 4\n",
"Количество выбросов в столбце 'Glucose': 5\n",
"Количество выбросов в столбце 'BloodPressure': 45\n",
"Количество выбросов в столбце 'SkinThickness': 1\n",
"Количество выбросов в столбце 'Insulin': 34\n",
"Количество выбросов в столбце 'BMI': 19\n",
"Количество выбросов в столбце 'DiabetesPedigreeFunction': 29\n",
"Количество выбросов в столбце 'Age': 9\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdAAAAPeCAYAAAAMETjbAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3wUdfoH8M9sTd1Nb5CEBJAqvQoCSgcVFE9RFPCH4inoIZ56eIKKnBxWTo6D4wpFRTwbp5zSEUQQAQWlE1oCIb1s2iZb5vfHZgcWNqTt7mz5vF+vfWlmZmee3YTv7D7zzPMVRFEUQUREREREREREREREDhRyB0BERERERERERERE5I2YQCciIiIiIiIiIiIicoIJdCIiIiIiIiIiIiIiJ5hAJyIiIiIiIiIiIiJyggl0IiIiIiIiIiIiIiInmEAnIiIiIiIiIiIiInKCCXQiIiIiIiIiIiIiIieYQCciIiIiIiIiIiIicoIJdCIiIiIiIiIiIiIiJ5hA93OtWrXC1KlT5Q7D77355ptIT0+HUqlEt27d5A7Hr3z77bcQBAHffvut3KEQ+TyeEzzDneeEIUOGYMiQIS7dJxG5DsdZz3DHOCvnZ85XXnkFgiB4/LhEdGMc0z3D1WO6fTz/9NNPb7jdqlWrIAgCzp8/3+xj1uX8+fMQBAFvvfVWvds29VzQqlUr3HHHHU0JjxqBCXQfYv/HfeDAAafrhwwZgs6dOzf7OF9//TVeeeWVZu8nUGzevBnPP/88BgwYgJUrV+L111+vc9upU6dCEATpodPp0LVrV7z99tuorq72YNRE5Ot4TvBOjTkn2H333Xe477770KJFC2g0Guj1evTt2xfz589Hbm6uB6ImImc4znqn5nz2VqlUSE5OxsSJE3Hs2DEPRt14rVq1cog9Li4Ot956K7744gu5QyPySRzTvVNjPzt/9dVXGDx4MOLi4hASEoL09HTcd9992Lhxo4civn58ruuxatUqj8VE7qeSOwByr5MnT0KhaNx1kq+//hpLly7loN9A27dvh0KhwL/+9S9oNJp6t9dqtfjnP/8JACgpKcFnn32G3//+99i/fz/WrVvn7nB9zqBBg1BVVdWg95aIboznBPdr7Dlh3rx5eO2115Ceno6pU6ciPT0dRqMRBw8exNtvv43Vq1fjzJkzHoiciFyB46z7Neezt9lsxpkzZ7B8+XJs3LgRx44dQ1JSkrtDbrJu3brh2WefBQBkZ2fj73//O+655x4sW7YMv/3tb2WOjsj/cUx3v8aM6W+99Raee+45DB48GHPmzEFISAgyMjKwdetWrFu3DqNGjWrUsR9++GFMnDgRWq22Uc9bvHgxysvLpZ+//vprfPTRR3j33XcRExMjLb/lllsatd+XXnoJf/jDHxr1HPIcJtD9XGMHAm9QUVGB0NBQucNosLy8PAQHBzc4watSqfDQQw9JPz/55JPo27cvPv74Y7zzzjtOP8SLogij0Yjg4GCXxe0rFAoFgoKC5A6DyC/wnOB+jTknfPzxx3jttddw33334f3337/uOe+++y7effddd4VKRG7Acdb9mvvZGwD69euHO+64A//73//w2GOPuSNMl2jRooVD7JMnT0abNm3w7rvv1plAN5vNsFqtPlV84mt/gxQ4OKa7X0PHdLPZjNdeew3Dhw/H5s2bne6nsZRKJZRKZaOfN378eIefc3Jy8NFHH2H8+PFo1aqVw7rGtIdRqVRQqZim9VZs4eLnru3ZZTKZ8Oqrr6Jt27YICgpCdHQ0Bg4ciC1btgCw3ea4dOlSAHC49cSuoqICzz77LJKTk6HVatGuXTu89dZbEEXR4bhVVVV4+umnERMTg/DwcNx11124dOkSBEFwuBJr7/F07NgxPPjgg4iMjMTAgQMBAL/88otUjRcUFISEhAT83//9HwoLCx2OZd/HqVOn8NBDD0Gv1yM2NhZz586FKIrIysrCuHHjoNPpkJCQgLfffrtB7519gG7dujW0Wi1atWqFF1980aHViiAIWLlyJSoqKpp8m45CoZD62doHV3sPq02bNqFXr14IDg7G3//+dwC2qvVZs2ZJv4M2bdpg0aJFsFqtDvstLCzEww8/DJ1Oh4iICEyZMgWHDx++LsapU6ciLCwMly5dwvjx4xEWFobY2Fj8/ve/h8VicdjnW2+9hVtuuQXR0dEIDg5Gz549nfYVEwQBM2fOxPr169G5c2dotVp06tTJ6W1Vly5dwrRp05CUlAStVou0tDQ88cQTqKmpAVB3P8p9+/Zh1KhR0Ov1CAkJweDBg/H99987bFNWVoZZs2ahVatW0Gq1iIuLw/Dhw/HTTz/V+3sh8kc8J3jXOWHevHmIiYmps+JGr9fXW71UV+/GG42dY8aMQWRkJEJDQ9GlSxf85S9/cdhm+/btuPXWWxEaGoqIiAiMGzcOx48fd9imoeNrQ8ZqIn/Ccda7xtm6JCQkAECDEhWffPIJevbsieDgYMTExOChhx7CpUuXrtuuIWMnAOzevRu9e/dGUFAQWrduLX3Gb2jcHTp0wLlz5wA49tZdvHix9N7Z29OcOHEC9957L6KiohAUFIRevXrhyy+/dNhnfX+jgC059Mgjj6Bly5bQarVITEzEuHHjHM491/6t2V37b8J+3tq5cyeefPJJxMXFoWXLltL6b775Rnofw8PDMXbsWBw9erTB7xGRK3FM954xvaCgAAaDAQMGDHC6Pi4u7obxVFdX44477oBer8eePXsAOP8cbc/F7N69G3369EFQUBDS09OxZs2aBr3uG1mxYoX0fvTu3Rv79+93WF9XD/QPPvgAffr0QUhICCIjIzFo0CCnFxGutnr1aqhUKjz33HMAHM8X9cUBePb84St4acMHlZaWoqCg4LrlJpOp3ue+8sorWLhwIR599FH06dMHBoMBBw4cwE8//YThw4fj8ccfR3Z2NrZs2YL333/f4bmiKOKuu+7Cjh07MG3aNHTr1g2bNm3Cc889h0uXLjlUyU2dOhX/+c9/8PDDD6Nfv37YuXMnxo4dW2dcv/nNb9C2bVu8/vrr0sljy5YtOHv2LB555BEkJCTg6NGjWLFiBY4ePYoffvjhuoHl/vvvR4cOHfDnP/8Z//vf/7BgwQJERUXh73//O26//XYsWrQIH374IX7/+9+jd+/eGDRo0A3fq0cffRSrV6/Gvffei2effRb79u3DwoULcfz4can34Pvvv48VK1bgxx9/lG4NbextOgCk2/Ojo6OlZSdPnsQDDzyAxx9/HI899hjatWuHyspKDB48GJcuXcLjjz+OlJQU7NmzB3PmzMHly5exePFiAIDVasWdd96JH3/8EU888QTat2+P//73v5gyZYrT41ssFowcORJ9+/bFW2+9ha1bt+Ltt99G69at8cQTT0jb/eUvf8Fdd92FSZMmoaamBuvWrcNvfvMbbNiw4brf7+7du/H555/jySefRHh4ON577z1MmDABmZmZ0uvMzs5Gnz59UFJSgunTp6N9+/a4dOkSPv30U1RWVtZ5FXr79u0YPXo0evbsiZdffhkKhQIrV67E7bffju+++w59+vQBAPz2t7/Fp59+ipkzZ6Jjx44oLCzE7t27cfz4cfTo0aPRvycib8Rzgm+eE06dOoVTp07h0UcfRVhY2A2P7SpbtmzBHXfcgcTERPzud79DQkICjh8/jg0bNuB3v/sdAGDr1q0YPXo00tPT8corr6CqqgpLlizBgAED8NNPP0lVNQ0ZXxs6VhN5O46zvjnOXs3++7NYLDh79ixeeOEFREdH1zvp2qpVq/DII4+gd+/eWLhwIXJzc/GXv/wF33//PX7++WdEREQAaPjY+euvv2LEiBGIjY3FK6+8ArPZjJdffhnx8fH1vgbA9jeXlZXl8J0BAFauXAmj0Yjp06dDq9UiKioKR48exYABA9CiRQv84Q9/QGhoKP7zn/9g/Pjx+Oyzz3D33XcDqP9vFAAmTJiAo0eP4qmnnkKrVq2Ql5eHLVu2IDMz87pqy4Z68sknERsbi3nz5qGiogKA7fc7ZcoUjBw5EosWLUJlZSWWLVuGgQMH4ueff27ysYi
"text/plain": [
"<Figure size 1500x1000 with 8 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']\n",
"\n",
"# Функция для подсчета выбросов\n",
"def count_outliers(df, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(columns_to_check, 1):\n",
" plt.subplot(2, 4, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объемы выбросов по различным признакам оказались в приемлемых границах. Усреднение выбросов не требуется."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Разбиение датасета на выборки"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Обучающая выборка: (537, 9)\n",
"Outcome\n",
"0 349\n",
"1 188\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAf8AAADECAYAAACROyhkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA3nUlEQVR4nO3dd1xT1/sH8E8SAmSwtwqCggu0Ko46ALci1tVK3Yq22trWWvutrd/Wbeu31bqtq9WiUvf6aR0odYtbUKooKCAiskEII5Cc3x80KSFhCl4gz/v1yku5OffcJ+fenOfOEx5jjIEQQggheoPPdQCEEEIIebMo+RNCCCF6hpI/IYQQomco+RNCCCF6hpI/IYQQomco+RNCCCF6hpI/IYQQomco+RNCCCF6hpI/IYS8BplMhvj4eGRkZHAdCqlh2dnZiI2NhUwm4zqUGkfJnxBCqmj//v3o27cvTExMIJVK4eTkhJ9++onrsOqFnJwcrF69Wv13ZmYmNmzYwF1AJTDGsGXLFrz99tsQi8UwNTWFi4sLdu3axXVoNa5Kyf/3338Hj8dTv4yNjdGiRQt8+umnSEpKqq0YCdEbV65cwYgRI2BnZwcjIyM4Oztj+vTpePbsWbXrzM3NxcKFC3H+/PmaC1SPffPNN/D394eJiQm2bt2KM2fO4OzZs5gxYwbXodULIpEI3333HYKCghAfH4+FCxfi2LFjXIcFABg7diw++ugjtG7dGjt37lSv25EjR3IdWs1jVbB9+3YGgC1evJjt3LmTbd26lU2aNInx+Xzm4uLCZDJZVaojhJSwdu1axuPxWPPmzdmSJUvYr7/+yr788ktmZmbGzMzM2JUrV6pVb0pKCgPAFixYULMB66Hz588zAGzZsmVch1KvrVixgvH5fAaAmZqaskuXLnEdEgsMDGQ8Ho8FBQVxHcobUa3kf/PmTY3ps2fPZgDYH3/8UaPBEaIvLl++zPh8PvPy8tLaiY6OjmZ2dnbMwcGBpaenV7luSv41Z8iQIax79+5ch9EgxMfHs6tXr7KMjAyuQ2GMMebh4cHGjh3LdRhvTI1c8+/Tpw8AICYmBgCQnp6O//znP2jbti2kUilMTU3h6+uL8PBwrXnz8/OxcOFCtGjRAsbGxnBwcMDIkSPx5MkTAEBsbKzGpYbSr169eqnrOn/+PHg8Hvbu3Yv//ve/sLe3h0QiwdChQxEfH6+17OvXr2PQoEEwMzODWCyGj48Prly5ovMz9urVS+fyFy5cqFV2165d8PT0hEgkgqWlJUaPHq1z+eV9tpKUSiVWr14Nd3d3GBsbw87ODtOnT9e6wcjZ2RlDhgzRWs6nn36qVaeu2JcvX67VpgBQUFCABQsWwNXVFUZGRnB0dMScOXNQUFCgs61K6tWrl1Z933//Pfh8Pv74449qtceKFSvQvXt3WFlZQSQSwdPTEwcOHNC5/F27dqFLly4Qi8WwsLCAt7c3goODNcqcPHkSPj4+MDExgampKTp37qwV2/79+9Xr1NraGuPHj0dCQoJGmcmTJ2vEbGFhgV69euHSpUsVttOSJUvA4/EQGBgIsVis8V7z5s3x008/ITExEZs3b1ZP19W2qjicnZ0BFLepjY0NAGDRokU6t9vIyEj4+/vDxsYGIpEILVu2xLfffqtR5927d+Hr6wtTU1NIpVL07dsX165d0yijuix4+fJlzJw5EzY2NjA3N8f06dMhl8uRmZmJiRMnwsLCAhYWFpgzZw5YqR8Vrey2rsvrtD8A/PXXX/Dy8oJEIoG5uTmGDRuGhw8fapS5du0aPDw8MHr0aFhaWkIkEqFz5844cuSIukxOTg4kEgk+//xzrWU8f/4cAoEAy5YtU8esWlcllV5HcXFxmDFjBlq2bAmRSAQrKyuMGjUKsbGxGvOp+sCSl3hu3ryJ/v37w8TEBBKJRGebqNbdrVu31NNSU1N19hNDhgzRGXNl+tOFCxeqv89NmjRBt27dYGBgAHt7e624dVHNr3qZmJigS5cuGu0PFH83PDw8yqxH1df8/vvvAIpv2oyIiICjoyP8/PxgampaZlsBwNOnTzFq1ChYWlpCLBbj7bffxp9//qlRpir5qCr9ZFXyVnkMqjyHDqpEbWVlBaC4YY4cOYJRo0bBxcUFSUlJ2Lx5M3x8fPDgwQM0atQIAKBQKDBkyBCEhIRg9OjR+Pzzz5GdnY0zZ84gIiICzZs3Vy9jzJgxGDx4sMZy586dqzOe77//HjweD19//TWSk5OxevVq9OvXD2FhYRCJRACKv+i+vr7w9PTEggULwOfzsX37dvTp0weXLl1Cly5dtOpt0qSJ+kubk5ODjz/+WOey582bB39/f3zwwQdISUnBunXr4O3tjbt378Lc3FxrnmnTpsHLywsAcOjQIRw+fFjj/enTp+P3339HQEAAZs6ciZiYGKxfvx53797FlStXIBQKdbZDVWRmZqo/W0lKpRJDhw7F5cuXMW3aNLRu3Rr379/HqlWr8PjxY60vXUW2b9+O7777Dj///DPGjh2rs0xF7bFmzRoMHToU48aNg1wux549ezBq1CgcP34cfn5+6nKLFi3CwoUL0b17dyxevBiGhoa4fv06/vrrLwwYMABAcac3ZcoUuLu7Y+7cuTA3N8fdu3dx6tQpdXyqtu/cuTOWLVuGpKQkrFmzBleuXNFap9bW1li1ahWA4o5+zZo1GDx4MOLj43Wue6D4mnxISAi8vLzg4uKis8z777+PadOm4fjx4/jmm28qbuh/2NjYYOPGjfj4448xYsQI9bXLdu3aAQDu3bsHLy8vCIVCTJs2Dc7Oznjy5AmOHTuG77//HgDw999/w8vLC6amppgzZw6EQiE2b96MXr164cKFC+jatavGMj/77DPY29tj0aJFuHbtGrZs2QJzc3NcvXoVTk5O+OGHH3DixAksX74cHh4emDhxonre193Wq9P+AHD27Fn4+vqiWbNmWLhwIfLy8rBu3Tr06NEDd+7cUSe7tLQ0bNmyBVKpVL2Ds2vXLowcORJBQUEYM2YMpFIpRowYgb1792LlypUQCATq5ezevRuMMYwbN65yK/AfN2/exNWrVzF69Gg0adIEsbGx2LhxI3r16oUHDx5o7TCqREdHo1evXhCLxfjqq68gFouxdetW9OvXD2fOnIG3t3eV4ihLdfpTlZ9//rnK94zt3LkTQPEOyi+//IJRo0YhIiICLVu2rFb8aWlpAIAff/wR9vb2+Oqrr2BsbKyzrZKSktC9e3fk5uZi5syZsLKyQmBgIIYOHYoDBw5gxIgRGnVXJh+VVlY/+TrtrKUqpwlUp/3Pnj3LUlJSWHx8PNuzZw+zsrJiIpGIPX/+nDHGWH5+PlMoFBrzxsTEMCMjI7Z48WL1tG3btjEAbOXKlVrLUiqV6vkAsOXLl2uVcXd3Zz4+Puq/z507xwCwxo0bs1evXqmn79u3jwFga9asUdft5ubGBg4cqF4OY4zl5uYyFxcX1r9/f61lde/enXl4eKj/1nUqNTY2lgkEAvb9999rzHv//n1mYGCgNT0qKooBYIGBgeppCxYsYCVXy6VLlxgAretQp06d0pretGlT5ufnpxX7J598wkqv6tKxz5kzh9na2jJPT0+NNt25cyfj8/la1+Q2bdrEAFR4HdrHx0dd359//skMDAzYl19+qbNsZdqDseL1VJJcLmceHh6sT58+GnXx+Xw2YsQIrW1Rtc4zMzOZiYkJ69q1K8vLy9NZRi6XM1tbW+bh4aFR5vjx4wwAmz9/vnrapEmTWNOmTTXq2bJlCwPAbty4ofMzM8ZYWFgYA8A+//zzMsswxli7du2YpaWl+u+SbVtS6TjKO+3v7e3NTExMWFxcnMb0kt+L4cOHM0NDQ/bkyRP1tBcvXjATExPm7e2tnqbqH0p/r7p168Z4PB776KOP1NOKiopYkyZNNOKvyrauS3XbnzHG2rdvz2xtbVlaWpp6Wnh4OOPz+WzixInqaQAYAHb+/Hn1tNzcXNa6dWtmb2/P5HI5Y4yx06dPMwDs5MmTGstp166dxmcOCAhgTk5OWvGUXl+lt3nGGAsNDWUA2I4dO9TTVH3
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Контрольная выборка: (115, 9)\n",
"Outcome\n",
"0 78\n",
"1 37\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAgsAAADECAYAAAARfmKGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA4d0lEQVR4nO3dZ3xTZd8H8F+StummLZ1AoaXssisgewqUKrJRkI0UEBBBeQTFMkRUvBEERETZKHvcIpaN7E0ZUqCUllXopKV0psn1vOid2JA0HbScjt/388mLnFznOv9zcs7JP9dIZEIIASIiIqJcyKUOgIiIiEo2JgtERERkEpMFIiIiMonJAhEREZnEZIGIiIhMYrJAREREJjFZICIiIpOYLBAREZFJTBaIiIiKgEajQVxcHO7evSt1KEWOyQIREZVoJ06cwNGjR3XPjx49ipMnT0oXUA5PnjzB5MmTUa1aNVhYWMDFxQX16tXDs2fPpA6tSBUoWVizZg1kMpnuYWlpiVq1amHChAmIjo4urhiJyo2TJ0+id+/ecHNzg1KphJeXFwIDA3H//v1C15mamopZs2bp3WyJSpMHDx5g/PjxuHbtGq5du4bx48fjwYMHUoeFO3fuoFmzZti0aRMCAwOxZ88eHDhwAIcOHYKNjY3U4RUps8KsNGfOHHh7eyM9PR0nTpzA8uXLsXfvXly/fh3W1tZFHSNRubBkyRJ8+OGHqF69OiZOnAgPDw+Ehobil19+webNm7F37160atWqwPWmpqZi9uzZAIAOHToUcdRExa9Pnz5YtGgRGjZsCABo2bIl+vTpI3FUQGBgICwsLHDmzBlUrlxZ6nCKVaGSBX9/f7z22msAgNGjR6NixYpYuHAhdu/ejXfffbdIAyQqD06ePInJkyejTZs2CA4O1ku6x40bh9atW6Nfv374559/4OjoKGGkRK+eUqnEqVOncP36dQBA/fr1oVAoJI3p4sWLOHz4MPbv31/mEwWgiMYsdOrUCQAQEREBAEhISMDHH3+MBg0awNbWFvb29vD398eVK1cM1k1PT8esWbNQq1YtWFpawsPDA3369EF4eDgAIDIyUq/r48VHzm9KR48ehUwmw+bNmzFjxgy4u7vDxsYGPXv2NNpkdfbsWXTv3h0VKlSAtbU12rdvn2s/WIcOHYxuf9asWQZlN2zYAD8/P1hZWcHJyQnvvPOO0e2b2recNBoNFi1aBF9fX1haWsLNzQ2BgYF4+vSpXjkvLy+8+eabBtuZMGGCQZ3GYl+wYIHBMQWAjIwMBAUFoUaNGlAqlfD09MS0adOQkZFh9Fjl1KFDB4P65s2bB7lcjt9++61Qx+O7775Dq1atULFiRVhZWcHPzw/btm0zuv0NGzagefPmsLa2hqOjI9q1a4f9+/frlfnrr7/Qvn172NnZwd7eHs2aNTOIbevWrbr31NnZGe+99x4ePXqkV2b48OF6MTs6OqJDhw44fvx4nsdp7ty5kMlkWLt2rUHrnI+PD7799ls8fvwYK1as0C03dmy1cXh5eQHIPqYuLi4AgNmzZxs9b2/evIkBAwbAxcUFVlZWqF27Nj777DO9Oi9fvgx/f3/Y29vD1tYWnTt3xpkzZ/TKaLspT5w4gUmTJsHFxQUODg4IDAxEZmYmEhMTMXToUDg6OsLR0RHTpk3Di396m99z3ZiXOf45j5nWhg0bIJfL8fXXX+stP3z4MNq2bQsbGxs4ODjg7bffRmhoqF6ZWbNmQSaTIS4uTm/5hQsXIJPJsGbNGqMxG3tERkYC+Pf63r9/Pxo3bgxLS0vUq1cPO3bsMNifu3fvon///nBycoK1tTVef/11/Pnnn/k6bsbOkeHDh8PW1jbP41iQe1BWVhbmzp0LHx8fXZfbjBkzDO4rXl5eGD58OBQKBRo1aoRGjRphx44dkMlkBu9ZbjFp90kul8Pd3R0DBw7U69rT3nu+++67XOvRvqdaZ86cgaWlJcLDw+Hr6wulUgl3d3cEBgYiISHBYP383kNsbW1x9+5ddOvWDTY2NqhUqRLmzJmjd61o49WeRwCQnJwMPz8/eHt74/Hjx7rlL3NN5VSoloUXaT/YK1asCCD7RN21axf69+8Pb29vREdHY8WKFWjfvj1u3LiBSpUqAQDUajXefPNNHDp0CO+88w4+/PBDJCcn48CBA7h+/Tp8fHx023j33XfRo0cPve1Onz7daDzz5s2DTCbD//3f/yEmJgaLFi1Cly5dEBISAisrKwDZF7y/vz/8/PwQFBQEuVyO1atXo1OnTjh+/DiaN29uUG+VKlUwf/58AMDz588xbtw4o9ueOXMmBgwYgNGjRyM2NhZLlixBu3btcPnyZTg4OBisM2bMGLRt2xYAsGPHDuzcuVPv9cDAQKxZswYjRozApEmTEBERgaVLl+Ly5cs4efIkzM3NjR6HgkhMTNTtW04ajQY9e/bEiRMnMGbMGNStWxfXrl3D999/j9u3b2PXrl0F2s7q1avx+eef4z//+Q8GDRpktExex2Px4sXo2bMnBg8ejMzMTGzatAn9+/fHnj17EBAQoCs3e/ZszJo1C61atcKcOXNgYWGBs2fP4vDhw+jatSuA7A+4kSNHwtfXF9OnT4eDgwMuX76M4OBgXXzaY9+sWTPMnz8f0dHRWLx4MU6ePGnwnjo7O+P7778HADx8+BCLFy9Gjx498ODBA6PvPZDdTXDo0CG0bdsW3t7eRssMHDgQY8aMwZ49e/Dpp5/mfaD/x8XFBcuXL8e4cePQu3dvXdOttjn36tWraNu2LczNzTFmzBh4eXkhPDwcf/zxB+bNmwcA+Oeff9C2bVvY29tj2rRpMDc3x4oVK9ChQwf8/fffaNGihd42J06cCHd3d8yePRtnzpzBzz//DAcHB5w6dQpVq1bFV199hb1792LBggWoX78+hg4dqlv3Zc/1whx/Y/bv34+RI0diwoQJesf74MGD8Pf3R/Xq1TFr1iykpaVhyZIlaN26NS5dupSvD6+cAgMD0aVLF93zIUOG6L1PAHTJHgCEhYVh4MCBGDt2LIYNG4bVq1ejf//+CA4OxhtvvAEAiI6ORqtWrZCamopJkyahYsWKWLt2LXr27Ilt27ahd+/eBnHkPG7aOIrb6NGjsXbtWvTr1w9Tp07F2bNnMX/+fISGhhpc8zllZWUZJLN5adu2LcaMGQONRoPr169j0aJFiIqKylcimZv4+Hikp6dj3Lhx6NSpE8aOHYvw8HAsW7YMZ8+exdmzZ6FUKgEU7B6iVqvRvXt3vP766/j2228RHByMoKAgZGVlYc6cOUZjUalU6Nu3L+7fv4+TJ0/Cw8ND91qRfX6IAli9erUAIA4ePChiY2PFgwcPxKZNm0TFihWFlZWVePjwoRBCiPT0dKFWq/XWjYiIEEqlUsyZM0e3bNWqVQKAWLhwocG2NBqNbj0AYsGCBQZlfH19Rfv27XXPjxw5IgCIypUri2fPnumWb9myRQAQixcv1tVds2ZN0a1bN912hBAiNTVVeHt7izfeeMNgW61atRL169fXPY+NjRUARFBQkG5ZZGSkUCgUYt68eXrrXrt2TZiZmRksDwsLEwDE2rVrdcuCgoJEzrfl+PHjAoDYuHGj3rrBwcEGy6tVqyYCAgIMYv/ggw/Ei2/1i7FPmzZNuLq6Cj8/P71jun79eiGXy8Xx48f11v/pp58EAHHy5EmD7eXUvn17XX1//vmnMDMzE1OnTjVaNj/HQ4js9ymnzMxMUb9+fdGpUye9uuRyuejdu7fBuah9zxMTE4WdnZ1o0aKFSEtLM1omMzNTuLq6ivr16+uV2bNnjwAgvvjiC92yYcOGiWrVqunV8/PPPwsA4ty5c0b3WQghQkJCBADx4Ycf5lpGCCEaNmwonJycdM9zHtucXozD2Lmq1a5dO2FnZyfu3buntzznddGrVy9hYWEhwsPDdcuioqKEnZ2daNeunW6Z9v7w4nXVsmVLIZPJxNixY3XLsrKyRJUqVfTiL8i5bkxhj/+L6164cEHY2tqK/v37G5w7jRs3Fq6uriI+Pl637MqVK0Iul4uhQ4fqlmnP29jYWL31z58/LwC
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Тестовая выборка: (116, 9)\n",
"Outcome\n",
"0 73\n",
"1 43\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAekAAADECAYAAAC7i9nLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA0/klEQVR4nO3dd1xT1/sH8E8SEvbeKgqCgIo4cPwUGVoVcVWtpe6taNXaar+0trXu2tZW69a2rmqHq2KtuEfdk+FCRAUHsgVkJyTn9wdNyiUBAYGE5Hm/Xr5abk7OeXJyc5+cc8+94THGGAghhBCicfjqDoAQQgghqlGSJoQQQjQUJWlCCCFEQ1GSJoQQQjQUJWlCCCFEQ1GSJoQQQjQUJWlCCCFEQ1GSJoQQQjQUJWlCCCENGmMML1++RHx8vLpDqXWUpAkhhKh0584dhIeHK/6Ojo7G4cOH1RdQGbm5ufjiiy/g4eEBkUgEa2truLu7Iy4uTt2h1apqJent27eDx+Mp/hkYGMDd3R0zZ85EampqXcVIiM64ePEihgwZAnt7e+jr68PZ2RmhoaF4+vRpjessKCjAwoULcfbs2doLlOiE3NxchIaG4sqVK4iPj8fs2bNx+/ZtdYeFzMxMdO3aFWvWrMGwYcNw8OBBnDhxAmfPnoWzs7O6w6tVejV50uLFi+Hi4oKioiJcuHABGzduREREBO7cuQMjI6PajpEQnbB27VrMnj0bzZs3x6xZs+Do6IjY2Fj8/PPP2L17NyIiItCtW7dq11tQUIBFixYBAAIDA2s5aqLNunbtqvgHAO7u7pgyZYqaowL+97//ITk5GZcvX0br1q3VHU6dqlGSDg4ORseOHQEAkydPhrW1NVauXImDBw9ixIgRtRogIbrg4sWL+PDDD9G9e3ccPXqU82V3+vTp8PX1xbBhw3D37l1YWlqqMVKia8LDw3Hv3j0UFhaiTZs2EIlEao0nLS0NO3bswKZNm7Q+QQO1dE66Z8+eAICEhAQAwMuXL/Hxxx+jTZs2MDExgZmZGYKDgxETE6P03KKiIixcuBDu7u4wMDCAo6Mjhg4dikePHgEAEhMTOVPs5f+VHRmcPXsWPB4Pu3fvxmeffQYHBwcYGxtj0KBBePbsmVLbV69eRd++fWFubg4jIyMEBATg4sWLKl9jYGCgyvYXLlyoVHbXrl3w8fGBoaEhrKysMHz4cJXtV/baypLJZPjhhx/QunVrGBgYwN7eHqGhocjKyuKUc3Z2xoABA5TamTlzplKdqmJfsWKFUp8CQHFxMRYsWAA3Nzfo6+vDyckJYWFhKC4uVtlXZQUGBirVt2zZMvD5fPz222816o/vvvsO3bp1g7W1NQwNDeHj44N9+/apbH/Xrl3o3LkzjIyMYGlpCX9/fxw/fpxT5siRIwgICICpqSnMzMzQqVMnpdj27t2reE9tbGwwevRoJCUlccqMHz+eE7OlpSUCAwNx/vz51/bTkiVLwOPxsGPHDqXZKFdXV3z77bdITk7G5s2bFdtV9a08DvmUX2JiImxtbQEAixYtUrnf3r9/HyEhIbC1tYWhoSE8PDzw+eefc+qMiopCcHAwzMzMYGJigrfeegtXrlzhlJGfDrtw4QI++OAD2NrawsLCAqGhoRCLxcjOzsbYsWNhaWkJS0tLhIWFofyP8FV1X1elpv1f/nmq/iUmJirKHzlyBH5+fjA2NoapqSn69++Pu3fvKtVbWb8uXLjwtW2WPT1R2/vfhg0b0Lp1a+jr66NRo0aYMWMGsrOzOWXK7l+tWrWCj48PYmJiVH4mVSl/zLSxsUH//v1x584dTjkej4eZM2dWWI98v5K/B9evX4dMJoNYLEbHjh1hYGAAa2trjBgxQuVpodOnTyveLwsLC7z99tuIjY3llJG/H/L3zMzMDNbW1pg9ezaKioqU4i37+SkpKUG/fv1gZWWFe/fuccpWNRdUpkYj6fLkCdXa2hoA8PjxY4SHh+Pdd9+Fi4sLUlNTsXnzZgQEBODevXto1KgRAEAqlWLAgAE4deoUhg8fjtmzZyM3NxcnTpzAnTt34OrqqmhjxIgR6NevH6fdefPmqYxn2bJl4PF4+OSTT5CWloYffvgBvXr1QnR0NAwNDQGUvnHBwcHw8fHBggULwOfzsW3bNvTs2RPnz59H586dlept0qQJli9fDgDIy8vD9OnTVbY9f/58hISEYPLkyUhPT8fatWvh7++PqKgoWFhYKD1n6tSp8PPzAwD8+eefOHDgAOfx0NBQbN++HRMmTMAHH3yAhIQErFu3DlFRUbh48SKEQqHKfqiO7OxsxWsrSyaTYdCgQbhw4QKmTp2Kli1b4vbt21i1ahUePHjAWVRSFdu2bcMXX3yB77//HiNHjlRZ5nX9sXr1agwaNAijRo2CWCzGH3/8gXfffRd///03+vfvryi3aNEiLFy4EN26dcPixYshEolw9epVnD59Gn369AFQegCYOHEiWrdujXnz5sHCwgJRUVE4evSoIj5533fq1AnLly9HamoqVq9ejYsXLyq9pzY2Nli1ahUA4Pnz51i9ejX69euHZ8+eqXzvgdLp6FOnTsHPzw8uLi4qy7z33nuYOnUq/v77b3z66aev7+h/2draYuPGjZg+fTqGDBmCoUOHAgC8vb0BALdu3YKfnx+EQiGmTp0KZ2dnPHr0CIcOHcKyZcsAAHfv3oWfnx/MzMwQFhYGoVCIzZs3IzAwEP/88w+6dOnCaXPWrFlwcHDAokWLcOXKFfz444+wsLDApUuX0LRpU3z11VeIiIjAihUr4OXlhbFjxyqe+6b7ek36PzQ0FL169VL8PWbMGE5fyfsRAHbu3Ilx48YhKCgI33zzDQoKCrBx40Z0794dUVFRii9Hr+vXoUOHws3NTVH/Rx99hJYtW2Lq1KmKbS1btgRQ+/vfwoULsWjRIvTq1QvTp09HXFwcNm7ciOvXr7+2jz/55JNK+788T09PfP7552CM4dGjR1i5ciX69ev3RmssMjMzAZQOPnx8fPD1118jPT0da9aswYULFxAVFQUbGxsAwMmTJxEcHIzmzZtj4cKFKCwsxNq1a+Hr64vIyEil89chISFwdnbG8uXLceXKFaxZswZZWVn45ZdfKoxn8uTJOHv2LE6cOIFWrVopttckF6jEqmHbtm0MADt58iRLT09nz549Y3/88QeztrZmhoaG7Pnz54wxxoqKiphUKuU8NyEhgenr67PFixcrtm3dupUBYCtXrlRqSyaTKZ4HgK1YsUKpTOvWrVlAQIDi7zNnzjAArHHjxuzVq1eK7Xv27GEA2OrVqxV1t2jRggUFBSnaYYyxgoIC5uLiwnr37q3UVrdu3ZiXl5fi7/T0dAaALViwQLEtMTGRCQQCtmzZMs5zb9++zfT09JS2x8fHMwBsx44dim0LFixgZd+W8+fPMwDs119/5Tz36NGjStubNWvG+vfvrxT7jBkzWPm3unzsYWFhzM7Ojvn4+HD6dOfOnYzP57Pz589znr9p0yYGgF28eFGpvbICAgIU9R0+fJjp6emxuXPnqixblf5grPR9KkssFjMvLy/Ws2dPTl18Pp8NGTJEaV+Uv+fZ2dnM1NSUdenShRUWFqosIxaLmZ2dHfPy8uKU+fvvvxkA9uWXXyq2jRs3jjVr1oxTz48//sgAsGvXrql8zYwxFh0dzQCw2bNnV1iGMca8vb2ZlZWV4u+yfVtW+ThU7aty/v7+zNTUlD158oSzveznYvDgwUwkErFHjx4ptr148YKZmpoyf39/xTb58aH856pr166Mx+OxadOmKbaVlJSwJk2acOKvzr6uSk37v7yK+io3N5dZWFiwKVOmcLanpKQwc3Nzzvaq9GtZzZo1Y+PGjVPaXtv7X1paGhOJRKxPnz6cz8W6desYALZ161bFtvL7V0REBAPA+vbtq/SZVEXV/vnZZ58xACwtLU2xDQCbMWNGhfXI96uEhATO361ateIcC+TH/7LHl3bt2jE7OzuWmZmp2BYTE8P4fD4bO3asYpv8ODNo0CBO2++//z4DwGJiYjjxyvePefPmMYFAwMLDwznPq24uqEyNprt79eo
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_data, temp_data = train_test_split(df, test_size=0.3, random_state=42)\n",
"val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)\n",
"\n",
"print(\"Обучающая выборка: \", train_data.shape)\n",
"print(train_data.Outcome.value_counts())\n",
"outcome_counts = train_data['Outcome'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(outcome_counts, labels=outcome_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов Outcome в обучающей выборке')\n",
"plt.show()\n",
"\n",
"print(\"Контрольная выборка: \", val_data.shape)\n",
"print(val_data.Outcome.value_counts())\n",
"outcome_counts = val_data['Outcome'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(outcome_counts, labels=outcome_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов Outcome в контрольной выборке')\n",
"plt.show()\n",
"\n",
"print(\"Тестовая выборка: \", test_data.shape)\n",
"print(test_data.Outcome.value_counts())\n",
"outcome_counts = test_data['Outcome'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(outcome_counts, labels=outcome_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов Outcome в тестовой выборке')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Как видно из круговых диаграмм, распределение классов достаточно смещено, что может привести к проблемам в обучении модели, так как модель будет обучаться в большей степени на одном классе. В таком случае имеет смысл рассмотреть методы аугментации данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Аугментация данных методом оверсемплинга\n",
"\n",
"Этот метод увеличивает количество примеров меньшинства."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Обучающая выборка после оверсемплинга: (677, 9)\n",
"Outcome\n",
"0 349\n",
"1 328\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsMAAADECAYAAAB6FizTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA+w0lEQVR4nO3dd3gUVdsG8Ht303sv1ITQCUUiICAJvUsHEYUQqgXRV5QPLFSVF0EEEWkKRHqJgALSEWnSW4QghARDCymkZ9P2fH/k3TWb3ZQNCbPJ3r/LXLizs2eeOTN75tmZM2dkQggBIiIiIiITJJc6ACIiIiIiqTAZJiIiIiKTxWSYiIiIiEwWk2EiIiIiMllMhomIiIjIZDEZJiIiIiKTxWSYiIiIiEwWk2EiIiIiMllMhomInkF6ejpiYmLw9OlTqUOhcpaamoro6Gikp6dLHQoRVSAmw0REBtq+fTu6dOkCe3t72NnZoVatWvjqq6+kDqtSSEtLw+LFizWvk5KSsGzZMukCKkAIgVWrVuGll16CjY0NHBwc4Ovriw0bNkgdGhFVIIOS4XXr1kEmk2n+rKysUL9+fUyaNAmxsbEVFSORyTh16hQGDhwIT09PWFpawsfHBxMnTsQ///xT5jIzMjIwa9Ys/P777+UXqAmbNm0ahg0bBnt7e6xevRqHDh3C4cOH8fbbb0sdWqVgbW2NTz/9FBs3bkRMTAxmzZqFX3/9VeqwAAAjRozAm2++iUaNGmH9+vWabTto0CCpQyOiCmRWlg/NmTMHvr6+UCqVOHnyJJYvX459+/YhPDwcNjY25R0jkUlYunQp3nvvPdSpUwfvvvsuvL29cfPmTfzwww/YunUr9u3bh3bt2hlcbkZGBmbPng0A6NixYzlHbVqOHz+O+fPnY968eZg2bZrU4VRKCoUCs2fPxqhRo6BSqeDg4IC9e/dKHRZ++uknbN26FRs2bMCIESOkDoeIniOZEEKUduZ169YhJCQE58+fx4svvqiZPmXKFCxatAibNm3Ca6+9ViGBElVlp06dQmBgINq3b4/9+/dr/aiMjIxE+/btIZfL8ddff8HZ2dmgsuPj4+Hu7o6ZM2di1qxZ5Ry5aXnllVeQmJiIU6dOSR1KpXf//n3ExMSgUaNGcHJykjocNG3aFM2aNcPGjRulDoWInrNy6TPcuXNnAEBUVBQAIDExER9++CGaNm0KOzs7ODg4oFevXrh69arOZ5VKJWbNmoX69evDysoK3t7eGDRoECIjIwEA0dHRWl0zCv8VPNP1+++/QyaTYevWrfj444/h5eUFW1tb9OvXDzExMTrLPnv2LHr27AlHR0fY2NggKCioyINcx44d9S5fX3KxYcMGBAQEwNraGi4uLhg+fLje5Re3bgWpVCosXrwYTZo0gZWVFTw9PTFx4kSdG3Z8fHzQt29fneVMmjRJp0x9sS9YsECnTgEgKysLM2fORN26dWFpaYmaNWti6tSpyMrK0ltXBXXs2FGnvC+++AJyuRybNm0qU30sXLgQ7dq1g6urK6ytrREQEIAdO3boXf6GDRvQunVr2NjYwNnZGYGBgTh48KDWPL/99huCgoJgb28PBwcHtGrVSie27du3a7apm5sb3njjDTx48EBrntGjR2vF7OzsjI4dO+LEiRMl1tPcuXMhk8kQGhqqc3XFz88PX331FR49eoSVK1dqpuurW3UcPj4+APLr1N3dHQAwe/ZsvfttREQEhg0bBnd3d1hbW6NBgwb45JNPtMq8fPkyevXqBQcHB9jZ2aFLly74888/teZRd6M6efIkJk+eDHd3dzg5OWHixInIzs5GUlISRo0aBWdnZzg7O2Pq1Kko/Fu8tPu6Ps9S/wBw9OhRdOjQAba2tnByckL//v1x8+ZNrXn+/PNP+Pv7Y/jw4XBxcYG1tTVatWqFXbt2aeZJS0uDra0t3nvvPZ1l3L9/HwqFAvPmzdPErN5WBRXeRvfu3cPbb7+NBg0awNraGq6urhg6dCiio6O1PqduAwt2iTl//jy6desGe3t72Nra6q0T9ba7cOGCZlp8fLzedqJv3756Yy5Nezpr1izN97lGjRpo27YtzMzM4OXlpRO3PurPq//s7e3RunVrrfoH8r8b/v7+RZajbmvWrVsHIP8myPDwcNSsWRN9+vSBg4NDkXUFAHfv3sXQoUPh4uICGxsbvPTSSzpntw05HhnSThpy3NIXT1F/o0ePNngdgZKP4WqFt11Ry33w4AHGjBmj6SrWpEkTrFmzpsT1A4Dc3FzMnTsXfn5+mm5mH3/8sc6xysfHR7N8uVwOLy8vvPrqqzrd0Qw99h48eBAtWrSAlZUVGjdujJ9//lknxqSkJPznP/+Bj48PLC0tUaNGDYwaNQrx8fGaeUp7zFWvQ8E++GoNGzaETCbDpEmTNNMKd3UtTT5TsK4K/hX8rvr4+Ohsx+3bt0Mmk2m1Ferv3cKFC3WW4+/vrzef0/d38uRJAKVvF0ujTN0kClPv9K6urgDyv0S7du3C0KFD4evri9jYWKxcuRJBQUG4ceMGqlWrBgDIy8tD3759ceTIEQwfPhzvvfceUlNTcejQIYSHh8PPz0+zjNdeew29e/fWWu706dP1xvPFF19AJpPh//7v//DkyRMsXrwYXbt2xZUrV2BtbQ0g/8DXq1cvBAQEYObMmZDL5Vi7di06d+6MEydOoHXr1jrl1qhRQ3MQS0tLw1tvvaV32Z999hmGDRuGcePGIS4uDkuXLkVgYCAuX76s9wzIhAkT0KFDBwDAzz//jJ07d2q9P3HiRM1Z+cmTJyMqKgrfffcdLl++jFOnTsHc3FxvPRgiKSlJs24FqVQq9OvXDydPnsSECRPQqFEjXL9+Hd988w3+/vtvnYNQSdauXYtPP/0UX3/9dZGXIkuqjyVLlqBfv354/fXXkZ2djS1btmDo0KHYs2cP+vTpo5lv9uzZmDVrFtq1a4c5c+bAwsICZ8+exdGjR9G9e3cA+Y3DmDFj0KRJE0yfPh1OTk64fPky9u/fr4lPXfetWrXCvHnzEBsbiyVLluDUqVM629TNzQ3ffPMNgPzEZ8mSJejduzdiYmKKPPuVkZGBI0eOoEOHDvD19dU7z6uvvooJEyZgz549Bl2ed3d3x/Lly/HWW29h4MCBmr6PzZo1AwBcu3YNHTp0gLm5OSZMmAAfHx9ERkbi119/xRdffAEA+Ouvv9ChQwc4ODhg6tSpMDc3x8qVK9GxY0ccP34cbdq00Vrmu+++Cy8vL8yePRt//vknVq1aBScnJ5w+fRq1atXCl19+iX379mHBggXw9/fHqFGjNJ991n29LPUPAIcPH0avXr1Qp04dzJo1C5mZmVi6dCnat2+PS5cuaRr0hIQErFq1CnZ2dpqEf8OGDRg0aBA2btyI1157DXZ2dhg4cCC2bt2KRYsWQaFQaJazefNmCCHw+uuvl24D/s/58+dx+vRpDB8+HDVq1EB0dDSWL1+Ojh074saNG0V2T7tz5w46duwIGxsbfPTRR7CxscHq1avRtWtXHDp0CIGBgQbFUZSytKdqX3/9tcH3nKxfvx5AfsL+/fffY+jQoQgPD0eDBg3KFH9CQgIAYP78+fDy8sJHH30EKysrvXUVGxuLdu3aISMjA5MnT4arqytCQ0PRr18/7NixAwMHDtQquzTHo8KKaiefpZ7VJk+ejFatWmlNGzdunNbr0q6jIcdwNfW2A4D//Oc/Ost96aWXNEmcu7s7fvvtN4wdOxYpKSl4//33i123cePGITQ0FEOGDMGUKVNw9uxZzJs3Dzdv3tQ5jnTo0AETJkyASqVCeHg4Fi9ejIcPH2r9+DGkPbp9+zZeffVVvPnmmwgODsbatWsxdOhQ7N+/H926dQOQnzd06NABN2/exJgxY9CyZUvEx8fjl19+wf379+Hm5mbwMdfKygpr167VqpvTp0/j3r17RdaTuqurWlH5TOG6AoCbN2/iyy+/LHojIP9HSeETKmWlb39Vf8/L2i7qJQywdu1aAUAcPnxYxMXFiZiYGLFlyxbh6uoqrK2txf3794UQQiiVSpGXl6f12aioKGFpaSnmzJm
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from imblearn.over_sampling import ADASYN\n",
"\n",
"# Создание экземпляра ADASYN\n",
"ada = ADASYN()\n",
"\n",
"# Применение ADASYN\n",
"X_resampled, y_resampled = ada.fit_resample(train_data.drop(columns=['Outcome']), train_data['Outcome'])\n",
"\n",
"# Создание нового DataFrame\n",
"df_train_adasyn = pd.DataFrame(X_resampled)\n",
"df_train_adasyn['Outcome'] = y_resampled # Добавление целевой переменной\n",
"\n",
"# Вывод информации о новой выборке\n",
"print(\"Обучающая выборка после оверсемплинга: \", df_train_adasyn.shape)\n",
"print(df_train_adasyn['Outcome'].value_counts())\n",
"counts = df_train_adasyn['Outcome'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(counts, labels=counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов Outcome в обучающей выборке после оверсемплинга')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Аугментация данных методом андерсемплинга\n",
"\n",
"Проведём также приращение данных методом выборки с недостатком (андерсемплинг). Этот метод помогает сбалансировать выборку, уменьшая количество экземпляров класса большинства, чтобы привести его в соответствие с классом меньшинства."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Обучающая выборка после андерсемплинга: (376, 9)\n",
"Outcome\n",
"0 188\n",
"1 188\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtoAAADECAYAAAChrYbxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA/5UlEQVR4nO3dd3wT9f8H8FeS0qR70gFCKS20MmSUIQgUZRQK1gKKgjIKBRQQEXCgMopgZSggQ34oS+GLyJChskEZMkX2KtAyCpSW0tJBR5LP74+a2DRpm45wtLyej0cekOuN933u7nPv3H3uczIhhAAREREREZUrudQBEBERERFVRky0iYiIiIgsgIk2EREREZEFMNEmIiIiIrIAJtpERERERBbARJuIiIiIyAKYaBMRERERWQATbSIiIiIiC7CSOgAioqdBTk4OkpOTodVqUa1aNanDoXKUlZWF5ORkWFlZwcPDQ+pwiOgJwivaRE+AgQMHwt7eXuowys3kyZMhk8mkDkNyx48fR9++feHu7g6lUglvb2/06tVL6rAqjPnz5yMlJUX/fc6cOcjIyJAuoHx27dqFsLAwODs7w8bGBtWrV8d7770ndVhE9IQp0RXt5cuXIyIiQv9dqVSiZs2a6Ny5MyZMmABPT89yD5CIqCLatGkTXn/9dQQGBmLatGnw8/MDAF7xLIEtW7bgypUrGDt2LPbt24cJEyZg1KhRUoeFhQsX4t1330WbNm0wd+5cVK9eHQDg4+MjcWRE9KQpVdORKVOmwNfXF1lZWThw4AC+/fZb/P777zh79ixsbW3LO0YiogolOTkZkZGRCAkJwdq1a2FtbS11SBXSJ598grCwMMydOxdyuRxfffUV5HJpb8TGxMRgzJgxGDp0KBYuXMg7N0RUpFIl2l27dkWzZs0AAJGRkXBzc8PXX3+NTZs2oU+fPuUaIBE9edRqNbRaLRPIQixbtgxZWVlYvnw5y6gMgoODcf36dVy4cAE1atTAM888I3VI+Oabb+Dl5YVvvvmGSTYRFatcLg289NJLAIDY2FgAeVdzxo0bh4YNG8Le3h6Ojo7o2rUrTp06ZTRtVlYWJk+ejLp160KlUsHb2xs9e/bE1atXAQBxcXGQyWSFftq3b6+f1x9//AGZTIY1a9bgk08+gZeXF+zs7BAWFoabN28aLfvIkSPo0qULnJycYGtri+DgYBw8eNDkOrZv397k8idPnmw07sqVKxEUFAQbGxu4urrijTfeMLn8otYtP61Wizlz5qB+/fpQqVTw9PTEsGHD8ODBA4PxatWqhe7duxstZ+TIkUbzNBX7zJkzjcoUALKzszFp0iT4+/tDqVSiRo0a+PDDD5GdnW2yrPJr37690fymTZsGuVyO//3vf6Uqj1mzZqF169Zwc3ODjY0NgoKCsG7dOpPLX7lyJVq0aAFbW1u4uLigXbt22LFjh8E4W7duRXBwMBwcHODo6IjmzZsbxbZ27Vr9NnV3d8dbb72F+Ph4g3EGDhxoELOLiwvat2+P/fv3F1tOOvHx8QgPD4e9vT2qVq2KcePGQaPRlHj9C8Ziap/NycnBxIkTERQUBCcnJ9jZ2aFt27bYu3evwbx022XWrFmYM2cO/Pz8oFQqcf78eQDAgQMH0Lx5c6hUKvj5+eH//u//TK6bWq3G559/rp++Vq1a+OSTT4z2o8KOq1q1amHgwIH677m5uYiKikKdOnWgUqng5uaGNm3aYOfOnUWW8fLlyw3Kw9bWFg0bNsT3339f5HQ6165dw2uvvQZXV1fY2tri+eefx2+//WYwzuHDh9G4cWN88cUXqFGjBpRKJerUqYMvv/wSWq1WP15wcDAaNWpkcjkBAQEICQkxiDkuLs5gnILHl7nbFDAu57t376J///6oWrUqlEolGjRogO+++85gmvz7Qn4NGjQwOs5nzZplMub4+HgMGjQInp6eUCqVqF+/PpYuXWowjq4u/+OPP+Ds7IxWrVrhmWeeQbdu3QrdP0xNr/solUrUrVsX0dHREELox9M9S5CUlFTovArud4cPH0ZQUBCGDx+uXwdTZQUAGRkZGDt2rH4fCAgIwKxZswxiAPK2xciRI7Fq1SoEBARApVIhKCgI+/btMxjP1LMPe/fuhVKpxNtvv20w3JxyLkxR59xatWqVah0B8+rjgtuusOWW5by0f/9+vPbaa6hZs6Z+2vfffx+PHj0yGK+wZ2fWrVun3z/NLbuC45obv7n7BpC3zQcPHoxq1apBqVTC19cX77zzDnJycvTjpKSkYPTo0frt5e/vj+nTpxvUS/nPxRs3bjRYRlZWFlxcXIzqAd2+Wdhn+fLlZpdV/jrD3FxFt9+YygXs7e0NjuGC54D8n1u3bgEATp8+jYEDB6J27dpQqVTw8vLCoEGDcP/+faP5F6dceh3RJcVubm4A8k5EGzduxGuvvQZfX18kJCTg//7v/xAcHIzz58/rn7jXaDTo3r07du/ejTfeeAPvvfce0tLSsHPnTpw9e1bfphEA+vTpg9DQUIPljh8/3mQ806ZNg0wmw0cffYR79+5hzpw56NixI06ePAkbGxsAwJ49e9C1a1cEBQVh0qRJkMvlWLZsGV566SXs378fLVq0MJrvM888g+joaABAeno63nnnHZPLnjBhAnr37o3IyEgkJiZi3rx5aNeuHf755x84OzsbTTN06FC0bdsWALBhwwb88ssvBn8fNmyYvn38qFGjEBsbi/nz5+Off/7BwYMHUaVKFZPlUBIpKSn6dctPq9UiLCwMBw4cwNChQ/Hss8/izJkzmD17Ni5fvmx0EBZn2bJl+Oyzz/DVV1+hb9++Jscprjzmzp2LsLAwvPnmm8jJycFPP/2E1157Db/++iu6deumHy8qKgqTJ09G69atMWXKFFhbW+PIkSPYs2cPOnfuDCDvgBs0aBDq16+P8ePHw9nZGf/88w+2bdumj09X9s2bN0d0dDQSEhIwd+5cHDx40Giburu7Y/bs2QCAW7duYe7cuQgNDcXNmzdNbvv8NBoNQkJC0LJlS8yaNQu7du3CV199BT8/P4N9zZz1HzZsGDp27Ggw/23btmHVqlX6NsIPHz7E999/jz59+mDIkCFIS0vDkiVLEBISgqNHj6Jx48ZG2y4rKwtDhw6FUqmEq6srzpw5g86dO6Nq1aqYPHky1Go1Jk2aZPJ5jcjISKxYsQKvvvoqxo4diyNHjiA6OhoXLlww2sbmmDx5MqKjoxEZGYkWLVrg4cOHOH78OE6cOIFOnToVO/3s2bPh7u6Ohw8fYunSpRgyZAhq1aplVG75JSQkoHXr1sjMzMSoUaPg5uaGFStWICwsDOvWrUOPHj0AAPfv38eBAwdw4MABDBo0CEFBQdi9ezfGjx+PuLg4LFq0CADQr18/DBkyBGfPnkWDBg30yzl27BguX76Mzz77rERlUtJtqpOTk4OOHTvi4sWLeOeddxAQEICNGzdi6NChuH//Pj7++OMSxVGYhIQEPP/88/rkoWrVqti6dSsGDx6Mhw8fYvTo0YVOu2/fPvz+++8lWt4nn3yCZ599Fo8ePdJfgPHw8MDgwYNLvQ7379/H8ePHYWVlhREjRsDPz89kWQkhEBYWhr1792Lw4MFo3Lgxtm/fjg8++ADx8fH6ekLnzz//xJo1azBq1CgolUosXLgQXbp0wdGjRw32jfxOnTqF8PBwhIaGYsGCBfrhZSlnnU6dOqF///4Gw7766iuDCzwlWUdz6uP8dNsOABYvXowbN27o/1bW89LatWuRmZmJd955B25ubjh69CjmzZuHW7duYe3atcWWTXHyl92xY8fwzTffGPy9pPGbs2/cvn0bLVq0QEpKCoYOHYrAwEDEx8dj3bp1yMzMhLW1NTIzMxEcHIz4+HgMGzYMNWvWxF9//YXx48fjzp07mDNnjsFyVSoVli1bhvDwcP2wDRs2ICsrq9B1//bbbw1+nMTGxmLixImFjt+jRw/07NkTQN4PoMWLFxc6LlB4rlIaumbQ+bm6ugIAdu7ciWvXriEiIgJeXl44d+4cFi9ejHPnzuHw4cMlu5slSmDZsmUCgNi1a5dITEwUN2/
"text/plain": [
"<Figure size 200x200 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"rus = RandomUnderSampler()\n",
"\n",
"# Применение RandomUnderSampler\n",
"X_resampled, y_resampled = rus.fit_resample(train_data.drop(columns=['Outcome']), train_data['Outcome'])\n",
"\n",
"# Создание нового DataFrame\n",
"df_train_undersampled = pd.DataFrame(X_resampled)\n",
"df_train_undersampled['Outcome'] = y_resampled # Добавление целевой переменной\n",
"\n",
"# Вывод информации о новой выборке\n",
"print(\"Обучающая выборка после андерсемплинга: \", df_train_undersampled.shape)\n",
"print(df_train_undersampled['Outcome'].value_counts())\n",
"\n",
"# Визуализация распределения классов\n",
"hazardous_counts = df_train_undersampled['Outcome'].value_counts()\n",
"plt.figure(figsize=(2, 2))\n",
"plt.pie(hazardous_counts, labels=hazardous_counts.index, autopct='%1.1f%%', startangle=90)\n",
"plt.title('Распределение классов hazardous в обучающей выборке после андерсемплинга')\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}