ISE_31.Aparyan.mai/lab2.ipynb

1931 lines
396 KiB
Plaintext
Raw Normal View History

2024-10-11 18:50:07 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Лабораторная работа №2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Анализ нескольких датасетов"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.Выбрать три набора данных, которые не соответствуют Вашему варианту задания\n",
"### 2. Провести анализ сведений о каждом наборе данных со страницы загрузки в Kaggle. Какова проблемная область?\n",
"\n",
"Магазины, Цены на автомобиль, Инсульты"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Инсульты "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Данный датасет используется для предсказания вероятности возникновения инсульта у пациента на основе различных параметров, таких как пол, возраст, наличие заболеваний и статус курения. Инсульт является второй по значимости причиной смерти в мире, по данным Всемирной организации здравоохранения (ВОЗ), и ответственен за около 11% всех случаев смерти.\n",
"\n",
"Информация о колонках\n",
"\n",
"- id: уникальный идентификатор пациента (int)\n",
"- gender: пол пациента, возможные значения — \"Male\" (мужчина), \"Female\" (женщина) или \"Other\" (другое) (object, строковый)\n",
"- age: возраст пациента (float)\n",
"- hypertension: наличие гипертензии; 0 — если гипертензии нет, 1 — если гипертензия есть (int)\n",
"- heart_disease: наличие сердечных заболеваний; 0 — если заболеваний нет, 1 — если есть (int)\n",
"- ever_married: статус брака; \"No\" (нет) или \"Yes\" (да) (object, строковый)\n",
"- work_type: тип работы; возможные значения — \"children\" (дети), \"Govt_job\" (государственная работа), \"Never_worked\" (никогда не работал), \"Private\" (частный сектор) или \"Self-employed\" (самозанятый) (object, строковый)\n",
"- Residence_type: тип проживания; \"Rural\" (сельская местность) или \"Urban\" (городская местность) (object, строковый)\n",
"- avg_glucose_level: средний уровень глюкозы в крови (float)\n",
"- bmi: индекс массы тела (ИМТ) (float)\n",
"- smoking_status: статус курения; возможные значения — \"formerly smoked\" (курил раньше), \"never smoked\" (никогда не курил), \"smokes\" (курит) или \"Unknown\" (неизвестно). Значение \"Unknown\" указывает на недоступность информации о статусе курения пациента (object, строковый) \n",
"- stroke: наличие инсульта; 1 — если инсульт был, 0 — если не был (int)\n",
"\n",
"Каждая строка в датасете содержит соответствующую информацию о пациенте, что позволяет проводить анализ и строить модели для предсказания риска инсульта."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>gender</th>\n",
" <th>age</th>\n",
" <th>hypertension</th>\n",
" <th>heart_disease</th>\n",
" <th>ever_married</th>\n",
" <th>work_type</th>\n",
" <th>Residence_type</th>\n",
" <th>avg_glucose_level</th>\n",
" <th>bmi</th>\n",
" <th>smoking_status</th>\n",
" <th>stroke</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>9046</td>\n",
" <td>Male</td>\n",
" <td>67.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Yes</td>\n",
" <td>Private</td>\n",
" <td>Urban</td>\n",
" <td>228.69</td>\n",
" <td>36.6</td>\n",
" <td>formerly smoked</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>51676</td>\n",
" <td>Female</td>\n",
" <td>61.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>Self-employed</td>\n",
" <td>Rural</td>\n",
" <td>202.21</td>\n",
" <td>NaN</td>\n",
" <td>never smoked</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>31112</td>\n",
" <td>Male</td>\n",
" <td>80.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Yes</td>\n",
" <td>Private</td>\n",
" <td>Rural</td>\n",
" <td>105.92</td>\n",
" <td>32.5</td>\n",
" <td>never smoked</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>60182</td>\n",
" <td>Female</td>\n",
" <td>49.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>Private</td>\n",
" <td>Urban</td>\n",
" <td>171.23</td>\n",
" <td>34.4</td>\n",
" <td>smokes</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1665</td>\n",
" <td>Female</td>\n",
" <td>79.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>Self-employed</td>\n",
" <td>Rural</td>\n",
" <td>174.12</td>\n",
" <td>24.0</td>\n",
" <td>never smoked</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5105</th>\n",
" <td>18234</td>\n",
" <td>Female</td>\n",
" <td>80.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>Private</td>\n",
" <td>Urban</td>\n",
" <td>83.75</td>\n",
" <td>NaN</td>\n",
" <td>never smoked</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5106</th>\n",
" <td>44873</td>\n",
" <td>Female</td>\n",
" <td>81.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>Self-employed</td>\n",
" <td>Urban</td>\n",
" <td>125.20</td>\n",
" <td>40.0</td>\n",
" <td>never smoked</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5107</th>\n",
" <td>19723</td>\n",
" <td>Female</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>Self-employed</td>\n",
" <td>Rural</td>\n",
" <td>82.99</td>\n",
" <td>30.6</td>\n",
" <td>never smoked</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5108</th>\n",
" <td>37544</td>\n",
" <td>Male</td>\n",
" <td>51.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>Private</td>\n",
" <td>Rural</td>\n",
" <td>166.29</td>\n",
" <td>25.6</td>\n",
" <td>formerly smoked</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5109</th>\n",
" <td>44679</td>\n",
" <td>Female</td>\n",
" <td>44.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>Govt_job</td>\n",
" <td>Urban</td>\n",
" <td>85.28</td>\n",
" <td>26.2</td>\n",
" <td>Unknown</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5110 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" id gender age hypertension heart_disease ever_married \\\n",
"0 9046 Male 67.0 0 1 Yes \n",
"1 51676 Female 61.0 0 0 Yes \n",
"2 31112 Male 80.0 0 1 Yes \n",
"3 60182 Female 49.0 0 0 Yes \n",
"4 1665 Female 79.0 1 0 Yes \n",
"... ... ... ... ... ... ... \n",
"5105 18234 Female 80.0 1 0 Yes \n",
"5106 44873 Female 81.0 0 0 Yes \n",
"5107 19723 Female 35.0 0 0 Yes \n",
"5108 37544 Male 51.0 0 0 Yes \n",
"5109 44679 Female 44.0 0 0 Yes \n",
"\n",
" work_type Residence_type avg_glucose_level bmi smoking_status \\\n",
"0 Private Urban 228.69 36.6 formerly smoked \n",
"1 Self-employed Rural 202.21 NaN never smoked \n",
"2 Private Rural 105.92 32.5 never smoked \n",
"3 Private Urban 171.23 34.4 smokes \n",
"4 Self-employed Rural 174.12 24.0 never smoked \n",
"... ... ... ... ... ... \n",
"5105 Private Urban 83.75 NaN never smoked \n",
"5106 Self-employed Urban 125.20 40.0 never smoked \n",
"5107 Self-employed Rural 82.99 30.6 never smoked \n",
"5108 Private Rural 166.29 25.6 formerly smoked \n",
"5109 Govt_job Urban 85.28 26.2 Unknown \n",
"\n",
" stroke \n",
"0 1 \n",
"1 1 \n",
"2 1 \n",
"3 1 \n",
"4 1 \n",
"... ... \n",
"5105 0 \n",
"5106 0 \n",
"5107 0 \n",
"5108 0 \n",
"5109 0 \n",
"\n",
"[5110 rows x 12 columns]"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd \n",
"\n",
"strokes = pd.read_csv(\"healthcare-dataset-stroke-data.csv\")\n",
"\n",
"strokes"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id int64\n",
"gender object\n",
"age float64\n",
"hypertension int64\n",
"heart_disease int64\n",
"ever_married object\n",
"work_type object\n",
"Residence_type object\n",
"avg_glucose_level float64\n",
"bmi float64\n",
"smoking_status object\n",
"stroke int64\n",
"dtype: object"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"strokes.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Автомобили "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Данный датасет используется для предсказания цены автомобиля на основе различных параметров, таких как производитель, модель, год выпуска и другие характеристики.\n",
"\n",
"Информация о колонках\n",
"- ID: уникальный идентификатор автомобиля (int)\n",
"- Price: цена автомобиля (целевой столбец) (int)\n",
"- Levy: налог или сбор, связанный с автомобилем (obect, строковый)\n",
"- Manufacturer: производитель автомобиля (obect, строковый)\n",
"- Model: модель автомобиля (obect, строковый)\n",
"- Prod. year: год производства (int)\n",
"- Category: категория автомобиля (obect, строковый)\n",
"- Leather interior: наличие кожаного салона (да/нет) (obect, строковый) \n",
"- Fuel type: тип топлива (бензин, дизель и т.д.) (obect, строковый)\n",
"- Engine volume: рабочий объем двигателя (obect, строковый)\n",
"- Mileage: пробег автомобиля (obect, строковый)\n",
"- Cylinders: количество цилиндров в двигателе (float)\n",
"- Gear box type: тип коробки передач (механическая, автоматическая и т.д.) (obect, строковый)\n",
"- Drive wheels: тип привода (передний, задний, полный) (obect, строковый)\n",
"- Doors: количество дверей (obect, строковый)\n",
"- Wheel: расположение руля (левосторонний, правосторонний) (obect, строковый)\n",
"- Color: цвет автомобиля (obect, строковый)\n",
"- Airbags: наличие подушек безопасности (int)\n",
"\n",
"\n",
"Каждая строка в датасете содержит соответствующую информацию о автомобиле, что позволяет проводить анализ и строить модели для предсказания его цены."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Price</th>\n",
" <th>Levy</th>\n",
" <th>Manufacturer</th>\n",
" <th>Model</th>\n",
" <th>Prod. year</th>\n",
" <th>Category</th>\n",
" <th>Leather interior</th>\n",
" <th>Fuel type</th>\n",
" <th>Engine volume</th>\n",
" <th>Mileage</th>\n",
" <th>Cylinders</th>\n",
" <th>Gear box type</th>\n",
" <th>Drive wheels</th>\n",
" <th>Doors</th>\n",
" <th>Wheel</th>\n",
" <th>Color</th>\n",
" <th>Airbags</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>45654403</td>\n",
" <td>13328</td>\n",
" <td>1399</td>\n",
" <td>LEXUS</td>\n",
" <td>RX 450</td>\n",
" <td>2010</td>\n",
" <td>Jeep</td>\n",
" <td>Yes</td>\n",
" <td>Hybrid</td>\n",
" <td>3.5</td>\n",
" <td>186005 km</td>\n",
" <td>6.0</td>\n",
" <td>Automatic</td>\n",
" <td>4x4</td>\n",
" <td>04-May</td>\n",
" <td>Left wheel</td>\n",
" <td>Silver</td>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>44731507</td>\n",
" <td>16621</td>\n",
" <td>1018</td>\n",
" <td>CHEVROLET</td>\n",
" <td>Equinox</td>\n",
" <td>2011</td>\n",
" <td>Jeep</td>\n",
" <td>No</td>\n",
" <td>Petrol</td>\n",
" <td>3</td>\n",
" <td>192000 km</td>\n",
" <td>6.0</td>\n",
" <td>Tiptronic</td>\n",
" <td>4x4</td>\n",
" <td>04-May</td>\n",
" <td>Left wheel</td>\n",
" <td>Black</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>45774419</td>\n",
" <td>8467</td>\n",
" <td>-</td>\n",
" <td>HONDA</td>\n",
" <td>FIT</td>\n",
" <td>2006</td>\n",
" <td>Hatchback</td>\n",
" <td>No</td>\n",
" <td>Petrol</td>\n",
" <td>1.3</td>\n",
" <td>200000 km</td>\n",
" <td>4.0</td>\n",
" <td>Variator</td>\n",
" <td>Front</td>\n",
" <td>04-May</td>\n",
" <td>Right-hand drive</td>\n",
" <td>Black</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>45769185</td>\n",
" <td>3607</td>\n",
" <td>862</td>\n",
" <td>FORD</td>\n",
" <td>Escape</td>\n",
" <td>2011</td>\n",
" <td>Jeep</td>\n",
" <td>Yes</td>\n",
" <td>Hybrid</td>\n",
" <td>2.5</td>\n",
" <td>168966 km</td>\n",
" <td>4.0</td>\n",
" <td>Automatic</td>\n",
" <td>4x4</td>\n",
" <td>04-May</td>\n",
" <td>Left wheel</td>\n",
" <td>White</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>45809263</td>\n",
" <td>11726</td>\n",
" <td>446</td>\n",
" <td>HONDA</td>\n",
" <td>FIT</td>\n",
" <td>2014</td>\n",
" <td>Hatchback</td>\n",
" <td>Yes</td>\n",
" <td>Petrol</td>\n",
" <td>1.3</td>\n",
" <td>91901 km</td>\n",
" <td>4.0</td>\n",
" <td>Automatic</td>\n",
" <td>Front</td>\n",
" <td>04-May</td>\n",
" <td>Left wheel</td>\n",
" <td>Silver</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19232</th>\n",
" <td>45798355</td>\n",
" <td>8467</td>\n",
" <td>-</td>\n",
" <td>MERCEDES-BENZ</td>\n",
" <td>CLK 200</td>\n",
" <td>1999</td>\n",
" <td>Coupe</td>\n",
" <td>Yes</td>\n",
" <td>CNG</td>\n",
" <td>2.0 Turbo</td>\n",
" <td>300000 km</td>\n",
" <td>4.0</td>\n",
" <td>Manual</td>\n",
" <td>Rear</td>\n",
" <td>02-Mar</td>\n",
" <td>Left wheel</td>\n",
" <td>Silver</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19233</th>\n",
" <td>45778856</td>\n",
" <td>15681</td>\n",
" <td>831</td>\n",
" <td>HYUNDAI</td>\n",
" <td>Sonata</td>\n",
" <td>2011</td>\n",
" <td>Sedan</td>\n",
" <td>Yes</td>\n",
" <td>Petrol</td>\n",
" <td>2.4</td>\n",
" <td>161600 km</td>\n",
" <td>4.0</td>\n",
" <td>Tiptronic</td>\n",
" <td>Front</td>\n",
" <td>04-May</td>\n",
" <td>Left wheel</td>\n",
" <td>Red</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19234</th>\n",
" <td>45804997</td>\n",
" <td>26108</td>\n",
" <td>836</td>\n",
" <td>HYUNDAI</td>\n",
" <td>Tucson</td>\n",
" <td>2010</td>\n",
" <td>Jeep</td>\n",
" <td>Yes</td>\n",
" <td>Diesel</td>\n",
" <td>2</td>\n",
" <td>116365 km</td>\n",
" <td>4.0</td>\n",
" <td>Automatic</td>\n",
" <td>Front</td>\n",
" <td>04-May</td>\n",
" <td>Left wheel</td>\n",
" <td>Grey</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19235</th>\n",
" <td>45793526</td>\n",
" <td>5331</td>\n",
" <td>1288</td>\n",
" <td>CHEVROLET</td>\n",
" <td>Captiva</td>\n",
" <td>2007</td>\n",
" <td>Jeep</td>\n",
" <td>Yes</td>\n",
" <td>Diesel</td>\n",
" <td>2</td>\n",
" <td>51258 km</td>\n",
" <td>4.0</td>\n",
" <td>Automatic</td>\n",
" <td>Front</td>\n",
" <td>04-May</td>\n",
" <td>Left wheel</td>\n",
" <td>Black</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19236</th>\n",
" <td>45813273</td>\n",
" <td>470</td>\n",
" <td>753</td>\n",
" <td>HYUNDAI</td>\n",
" <td>Sonata</td>\n",
" <td>2012</td>\n",
" <td>Sedan</td>\n",
" <td>Yes</td>\n",
" <td>Hybrid</td>\n",
" <td>2.4</td>\n",
" <td>186923 km</td>\n",
" <td>4.0</td>\n",
" <td>Automatic</td>\n",
" <td>Front</td>\n",
" <td>04-May</td>\n",
" <td>Left wheel</td>\n",
" <td>White</td>\n",
" <td>12</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>19237 rows × 18 columns</p>\n",
"</div>"
],
"text/plain": [
" ID Price Levy Manufacturer Model Prod. year Category \\\n",
"0 45654403 13328 1399 LEXUS RX 450 2010 Jeep \n",
"1 44731507 16621 1018 CHEVROLET Equinox 2011 Jeep \n",
"2 45774419 8467 - HONDA FIT 2006 Hatchback \n",
"3 45769185 3607 862 FORD Escape 2011 Jeep \n",
"4 45809263 11726 446 HONDA FIT 2014 Hatchback \n",
"... ... ... ... ... ... ... ... \n",
"19232 45798355 8467 - MERCEDES-BENZ CLK 200 1999 Coupe \n",
"19233 45778856 15681 831 HYUNDAI Sonata 2011 Sedan \n",
"19234 45804997 26108 836 HYUNDAI Tucson 2010 Jeep \n",
"19235 45793526 5331 1288 CHEVROLET Captiva 2007 Jeep \n",
"19236 45813273 470 753 HYUNDAI Sonata 2012 Sedan \n",
"\n",
" Leather interior Fuel type Engine volume Mileage Cylinders \\\n",
"0 Yes Hybrid 3.5 186005 km 6.0 \n",
"1 No Petrol 3 192000 km 6.0 \n",
"2 No Petrol 1.3 200000 km 4.0 \n",
"3 Yes Hybrid 2.5 168966 km 4.0 \n",
"4 Yes Petrol 1.3 91901 km 4.0 \n",
"... ... ... ... ... ... \n",
"19232 Yes CNG 2.0 Turbo 300000 km 4.0 \n",
"19233 Yes Petrol 2.4 161600 km 4.0 \n",
"19234 Yes Diesel 2 116365 km 4.0 \n",
"19235 Yes Diesel 2 51258 km 4.0 \n",
"19236 Yes Hybrid 2.4 186923 km 4.0 \n",
"\n",
" Gear box type Drive wheels Doors Wheel Color Airbags \n",
"0 Automatic 4x4 04-May Left wheel Silver 12 \n",
"1 Tiptronic 4x4 04-May Left wheel Black 8 \n",
"2 Variator Front 04-May Right-hand drive Black 2 \n",
"3 Automatic 4x4 04-May Left wheel White 0 \n",
"4 Automatic Front 04-May Left wheel Silver 4 \n",
"... ... ... ... ... ... ... \n",
"19232 Manual Rear 02-Mar Left wheel Silver 5 \n",
"19233 Tiptronic Front 04-May Left wheel Red 8 \n",
"19234 Automatic Front 04-May Left wheel Grey 4 \n",
"19235 Automatic Front 04-May Left wheel Black 4 \n",
"19236 Automatic Front 04-May Left wheel White 12 \n",
"\n",
"[19237 rows x 18 columns]"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto = pd.read_csv(\"car_price_prediction.csv\")\n",
"\n",
"auto"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ID int64\n",
"Price int64\n",
"Levy object\n",
"Manufacturer object\n",
"Model object\n",
"Prod. year int64\n",
"Category object\n",
"Leather interior object\n",
"Fuel type object\n",
"Engine volume object\n",
"Mileage object\n",
"Cylinders float64\n",
"Gear box type object\n",
"Drive wheels object\n",
"Doors object\n",
"Wheel object\n",
"Color object\n",
"Airbags int64\n",
"dtype: object"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Магазины "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Информация о колонках\n",
"- Store ID: уникальный идентификатор конкретного магазина (индекс) (int)\n",
"- Store_Area: физическая площадь магазина в квадратных ярдах (int)\n",
"- Items_Available: количество различных товаров, доступных в соответствующем магазине (int)\n",
"- Daily_Customer_Count: среднее количество клиентов, посещающих магазины за месяц (int)\n",
"- Store_Sales: объем продаж (в долларах США), полученный магазинами (int)\n",
"\n",
"Каждая строка в датасете содержит соответствующую информацию о магазине, что позволяет проводить анализ и строить модели для оценки его работы."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Store ID</th>\n",
" <th>Store_Area</th>\n",
" <th>Items_Available</th>\n",
" <th>Daily_Customer_Count</th>\n",
" <th>Store_Sales</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1659</td>\n",
" <td>1961</td>\n",
" <td>530</td>\n",
" <td>66490</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1461</td>\n",
" <td>1752</td>\n",
" <td>210</td>\n",
" <td>39820</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1340</td>\n",
" <td>1609</td>\n",
" <td>720</td>\n",
" <td>54010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1451</td>\n",
" <td>1748</td>\n",
" <td>620</td>\n",
" <td>53730</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1770</td>\n",
" <td>2111</td>\n",
" <td>450</td>\n",
" <td>46620</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>891</th>\n",
" <td>892</td>\n",
" <td>1582</td>\n",
" <td>1910</td>\n",
" <td>1080</td>\n",
" <td>66390</td>\n",
" </tr>\n",
" <tr>\n",
" <th>892</th>\n",
" <td>893</td>\n",
" <td>1387</td>\n",
" <td>1663</td>\n",
" <td>850</td>\n",
" <td>82080</td>\n",
" </tr>\n",
" <tr>\n",
" <th>893</th>\n",
" <td>894</td>\n",
" <td>1200</td>\n",
" <td>1436</td>\n",
" <td>1060</td>\n",
" <td>76440</td>\n",
" </tr>\n",
" <tr>\n",
" <th>894</th>\n",
" <td>895</td>\n",
" <td>1299</td>\n",
" <td>1560</td>\n",
" <td>770</td>\n",
" <td>96610</td>\n",
" </tr>\n",
" <tr>\n",
" <th>895</th>\n",
" <td>896</td>\n",
" <td>1174</td>\n",
" <td>1429</td>\n",
" <td>1110</td>\n",
" <td>54340</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>896 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" Store ID Store_Area Items_Available Daily_Customer_Count Store_Sales\n",
"0 1 1659 1961 530 66490\n",
"1 2 1461 1752 210 39820\n",
"2 3 1340 1609 720 54010\n",
"3 4 1451 1748 620 53730\n",
"4 5 1770 2111 450 46620\n",
".. ... ... ... ... ...\n",
"891 892 1582 1910 1080 66390\n",
"892 893 1387 1663 850 82080\n",
"893 894 1200 1436 1060 76440\n",
"894 895 1299 1560 770 96610\n",
"895 896 1174 1429 1110 54340\n",
"\n",
"[896 rows x 5 columns]"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"shop = pd.read_csv(\"Stores.csv\")\n",
"\n",
"shop"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Store ID int64\n",
"Store_Area int64\n",
"Items_Available int64\n",
"Daily_Customer_Count int64\n",
"Store_Sales int64\n",
"dtype: object"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"shop.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Провести анализ содержимого каждого набора данных. Что является объектом/объектами наблюдения? Каковы атрибуты объектов? Есть ли связи между объектами?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Датасет о риске инсульта\n",
"\n",
"Объект наблюдения: Пациенты. \n",
"\n",
"Атрибуты перечисленны выше. \n",
"\n",
"2. Датасет с ценами автомобилей\n",
"\n",
"Объект наблюдения: Автомобили. \n",
"\n",
"Атрибуты перечисленны выше. \n",
"\n",
"3. Датасет супермакета\n",
"\n",
"Объект наблюдения: Магазины супермаркета.\n",
"\n",
"Атрибуты перечисленны выше. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Привести примеры бизнес-целей, для достижения которых могут подойти выбранные наборы данных. Каков эффект для бизнеса?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Датасет о риске инсульта\n",
"\n",
"Бизнес-цель: Разработка системы раннего предупреждения инсульта на основе анализа данных пациентов.\n",
"\n",
"Эффект для бизнеса:\n",
"\n",
"Улучшение здоровья пациентов: Снижение числа инсультов за счет раннего выявления рисков.\n",
"\n",
"Снижение затрат: Уменьшение расходов на лечение инсульта и реабилитацию.\n",
"\n",
"2. Датасет для прогнозирования цен на автомобили\n",
"\n",
"Бизнес-цель: Оптимизация ценообразования и улучшение стратегии продаж автомобилей.\n",
"\n",
"Эффект для бизнеса:\n",
"\n",
"Увеличение прибыли: Установка конкурентоспособных цен на автомобили на основе анализа данных.\n",
"\n",
"Лучшее планирование запасов: Снижение излишков и оптимизация поставок.\n",
"\n",
"3. Датасет супермаркета\n",
"\n",
"Бизнес-цель: Оптимизация ассортимента и улучшение обслуживания клиентов на основе анализа посещаемости и продаж.\n",
"\n",
"Эффект для бизнеса:\n",
"\n",
"Увеличение объема продаж: Подбор товаров, наиболее популярных среди клиентов.\n",
"Снижение затрат: Оптимизация площади магазина и распределения товаров.\n",
"Повышение клиентской удовлетворенности: Улучшение опыта покупок за счет более эффективной организации товаров и обслуживания."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. Привести примеры целей технического проекта для каждой выделенной ранее бизнес-цели. Что поступает на вход, что является целевым признаком?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Датасет о риске инсульта\n",
"\n",
"Бизнес-цель: Разработка системы раннего предупреждения инсульта.\n",
"\n",
"Цель технического проекта: Создание модели машинного обучения для прогнозирования вероятности инсульта.\n",
"\n",
"Входные данные:\n",
"\n",
"Пол\n",
"Возраст\n",
"Наличие гипертензии\n",
"Наличие сердечных заболеваний\n",
"Статус брака\n",
"Тип работы\n",
"Тип проживания\n",
"Средний уровень глюкозы\n",
"Индекс массы тела\n",
"Статус курения\n",
"и так далее\n",
"\n",
"Целевой признак: Наличие инсульта (stroke).\n",
"\n",
"2. Датасет для прогнозирования цен на автомобили\n",
"\n",
"Бизнес-цель: Оптимизация ценообразования и улучшение стратегии продаж автомобилей.\n",
"\n",
"Цель технического проекта: Построение модели для предсказания цены автомобиля на основе характеристик.\n",
"\n",
"Входные данные:\n",
"\n",
"Производитель\n",
"Модель\n",
"Год производства\n",
"Категория\n",
"Налог\n",
"Наличие кожаного салона\n",
"Тип топлива\n",
"Рабочий объем двигателя\n",
"Пробег\n",
"Количество цилиндров\n",
"Тип коробки передач\n",
"и так далее\n",
"\n",
"Целевой признак: Цена автомобиля (Price).\n",
"\n",
"3. Датасет супермаркета\n",
"\n",
"Бизнес-цель: Оптимизация ассортимента и улучшение обслуживания клиентов.\n",
"\n",
"Цель технического проекта: Разработка аналитической платформы для анализа посещаемости и продаж.\n",
"\n",
"Входные данные:\n",
"\n",
"Физическая площадь магазина\n",
"Количество доступных товаров\n",
"Среднее количество клиентов\n",
"Объем продаж\n",
"и так далее\n",
"\n",
"Целевой признак: Объем продаж (Store_Sales)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Определить проблемы выбранных наборов данных: зашумленность, смещение, актуальность, выбросы, просачивание данных.\n",
"### 7. Привести примеры решения обнаруженных проблем для каждого набора данных"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# 1. Проверка на зашумленность ---- количество пропусков в процентах от общего кол-ва\n",
"def check_noise(dataframe):\n",
" total_values = dataframe.size\n",
" missing_values = dataframe.isnull().sum().sum()\n",
" noise_percentage = (missing_values / total_values) * 100\n",
" return f\"Зашумленность: {noise_percentage:.2f}%\"\n",
"\n",
"# 2. Проверка на смещение ----- объем уникальных значений внутри определнной колонки \n",
"def check_bias(dataframe, target_column):\n",
" if target_column in dataframe.columns:\n",
" unique_values = dataframe[target_column].nunique()\n",
" total_values = len(dataframe)\n",
" bias_percentage = (unique_values / total_values) * 100\n",
" return f\"Смещение по {target_column}: {bias_percentage:.2f}% уникальных значений\"\n",
" return \"Целевой признак не найден.\"\n",
"\n",
"# 3. Проверка на дубликаты\n",
"def check_duplicates(dataframe):\n",
" duplicate_percentage = dataframe.duplicated().mean() * 100\n",
" return f\"Количество дубликатов: {duplicate_percentage:.2f}%\"\n",
"\n",
"# 4. Проверка на выбросы\n",
"def check_outliers(dataframe, column):\n",
" if column in dataframe.columns:\n",
" Q1 = dataframe[column].quantile(0.25)\n",
" Q3 = dataframe[column].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" outlier_count = dataframe[(dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)].shape[0]\n",
" total_count = dataframe.shape[0]\n",
" outlier_percentage = (outlier_count / total_count) * 100\n",
" return f\"Выбросы по {column}: {outlier_percentage:.2f}%\"\n",
" return f\"Признак {column} не найден.\"\n",
"\n",
"# 5. Проверка на просачивание данных\n",
"def check_data_leakage(dataframe, target_column):\n",
" if target_column in dataframe.columns:\n",
" correlation_matrix = dataframe.select_dtypes(include=[np.number]).corr()\n",
" leakage_info = correlation_matrix[target_column].abs().nlargest(10)\n",
" leakage_report = \", \".join([f\"{feature}: {value:.2f}\" for feature, value in leakage_info.items() if feature != target_column])\n",
" return f\"Признаки просачивания данных: {leakage_report}\"\n",
" return \"Целевой признак не найден.\""
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Зашумленность: 0.33%\n",
"Смещение по avg_glucose_level: 77.87% уникальных значений\n",
"Количество дубликатов: 0.00%\n",
"Выбросы по avg_glucose_level: 12.27%\n",
"Признаки просачивания данных: age: 0.25, heart_disease: 0.13, avg_glucose_level: 0.13, hypertension: 0.13, bmi: 0.04, id: 0.01\n"
]
}
],
"source": [
"noise_columns = check_noise(strokes)\n",
"bias_info = check_bias(strokes, 'avg_glucose_level') \n",
"duplicate_count = check_duplicates(strokes)\n",
"outliers_data = check_outliers(strokes, 'avg_glucose_level') \n",
"leakage_info = check_data_leakage(strokes, 'stroke') \n",
"\n",
"print(noise_columns)\n",
"print(bias_info)\n",
"print(duplicate_count)\n",
"print(outliers_data)\n",
"print(leakage_info)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Зашумленность: 0.00%\n",
"Смещение по Price: 12.03% уникальных значений\n",
"Количество дубликатов: 1.63%\n",
"Выбросы по Prod. year: 5.10%\n",
"Признаки просачивания данных: Prod. year: 0.24, Cylinders: 0.18, ID: 0.02, Price: 0.01\n"
]
}
],
"source": [
"##Машины\n",
"noise_columns = check_noise(auto)\n",
"bias_info = check_bias(auto, 'Price') \n",
"duplicate_count = check_duplicates(auto)\n",
"outliers_data = check_outliers(auto, 'Prod. year') \n",
"leakage_info = check_data_leakage(auto, 'Airbags') \n",
"\n",
"print(noise_columns)\n",
"print(bias_info)\n",
"print(duplicate_count)\n",
"print(outliers_data)\n",
"print(leakage_info)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Зашумленность: 0.00%\n",
"Смещение по Items_Available: 68.75% уникальных значений\n",
"Количество дубликатов: 0.00%\n",
"Выбросы по Store_Sales: 0.11%\n",
"Признаки просачивания данных: Store_Area: 0.04, Items_Available: 0.04, Store ID : 0.01, Store_Sales: 0.01\n"
]
}
],
"source": [
"noise_columns = check_noise(shop)\n",
"bias_info = check_bias(shop, 'Items_Available') \n",
"duplicate_count = check_duplicates(shop)\n",
"outliers_data = check_outliers(shop, 'Store_Sales') \n",
"leakage_info = check_data_leakage(shop, 'Daily_Customer_Count') \n",
"\n",
"print(noise_columns)\n",
"print(bias_info)\n",
"print(duplicate_count)\n",
"print(outliers_data)\n",
"print(leakage_info)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9. Устранить проблему пропущенных данных. Для каждого набора данных использовать разные методы: удаление, подстановка константного значения (0 или подобное), подстановка среднего значения"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 0\n",
"gender 0\n",
"age 0\n",
"hypertension 0\n",
"heart_disease 0\n",
"ever_married 0\n",
"work_type 0\n",
"Residence_type 0\n",
"avg_glucose_level 0\n",
"bmi 201\n",
"smoking_status 0\n",
"stroke 0\n",
"dtype: int64"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Инсульт\n",
"\n",
"strokes.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"strokes['bmi'] = strokes['bmi'].fillna(strokes['bmi'].mean())"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 0\n",
"gender 0\n",
"age 0\n",
"hypertension 0\n",
"heart_disease 0\n",
"ever_married 0\n",
"work_type 0\n",
"Residence_type 0\n",
"avg_glucose_level 0\n",
"bmi 0\n",
"smoking_status 0\n",
"stroke 0\n",
"dtype: int64"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"strokes.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ID 0\n",
"Price 0\n",
"Levy 0\n",
"Manufacturer 0\n",
"Model 0\n",
"Prod. year 0\n",
"Category 0\n",
"Leather interior 0\n",
"Fuel type 0\n",
"Engine volume 0\n",
"Mileage 0\n",
"Cylinders 0\n",
"Gear box type 0\n",
"Drive wheels 0\n",
"Doors 0\n",
"Wheel 0\n",
"Color 0\n",
"Airbags 0\n",
"dtype: int64"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Store ID 0\n",
"Store_Area 0\n",
"Items_Available 0\n",
"Daily_Customer_Count 0\n",
"Store_Sales 0\n",
"dtype: int64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"shop.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"# удалить\n",
"shop = shop.dropna()\n",
"\n",
"# заполнить значением\n",
"shop['Items_Avialable'] = shop['Items_Avialable'].fillna(5000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 10. Выполнить разбиение каждого набора данных на обучающую, контрольную и тестовую выборки"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shop Dataset:\n",
"Train: 79.91%\n",
"Validation: 10.04%\n",
"Test: 10.04%\n",
"\n"
]
}
],
"source": [
"# Разбиение shop\n",
"original_shop_size = len(shop)\n",
"train_shop, temp_shop = train_test_split(shop, test_size=0.2, random_state=42)\n",
"val_shop, test_shop = train_test_split(temp_shop, test_size=0.5, random_state=42)\n",
"\n",
"print(\"Shop Dataset:\")\n",
"print(f\"Train: {len(train_shop)/original_shop_size*100:.2f}%\")\n",
"print(f\"Validation: {len(val_shop)/original_shop_size*100:.2f}%\")\n",
"print(f\"Test: {len(test_shop)/original_shop_size*100:.2f}%\\n\")\n"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Strokes Dataset:\n",
"Train: 80.00%\n",
"Validation: 10.00%\n",
"Test: 10.00%\n",
"\n"
]
}
],
"source": [
"# Разбиение strokes\n",
"original_strokes_size = len(strokes)\n",
"train_strokes, temp_strokes = train_test_split(strokes, test_size=0.2, random_state=42)\n",
"val_strokes, test_strokes = train_test_split(temp_strokes, test_size=0.5, random_state=42)\n",
"\n",
"print(\"Strokes Dataset:\")\n",
"print(f\"Train: {len(train_strokes)/original_strokes_size*100:.2f}%\")\n",
"print(f\"Validation: {len(val_strokes)/original_strokes_size*100:.2f}%\")\n",
"print(f\"Test: {len(test_strokes)/original_strokes_size*100:.2f}%\\n\")\n"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Auto Dataset:\n",
"Train: 80.00%\n",
"Validation: 10.00%\n",
"Test: 10.00%\n"
]
}
],
"source": [
"# Разбиение auto\n",
"original_auto_size = len(auto)\n",
"train_auto, temp_auto = train_test_split(auto, test_size=0.2, random_state=42)\n",
"val_auto, test_auto = train_test_split(temp_auto, test_size=0.5, random_state=42)\n",
"\n",
"print(\"Auto Dataset:\")\n",
"print(f\"Train: {len(train_auto)/original_auto_size*100:.2f}%\")\n",
"print(f\"Validation: {len(val_auto)/original_auto_size*100:.2f}%\")\n",
"print(f\"Test: {len(test_auto)/original_auto_size*100:.2f}%\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 11. Оценить сбалансированность выборок для каждого набора данных. Оценить необходимость использования методов приращения (аугментации) данных.\n",
"### 12. Выполнить приращение данных методами выборки с избытком (oversampling) и выборки с недостатком (undersampling). Должны быть представлены примеры реализации обоих методов для выборок каждого набора данных."
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAq4AAAHWCAYAAAC2Zgs3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB/eklEQVR4nO3dd3hT1f8H8HdGk850T7qgdEDZZZU9iiwRFEUQFARx4VcQRcWFuABFERGcLBEHIOAERUBkj0LLKlAKpaU7Ld07Ob8/avMjtEApKTdp36/nyfM0956c+05v0nx6c+65MiGEABERERGRmZNLHYCIiIiIqC5YuBIRERGRRWDhSkREREQWgYUrEREREVkEFq5EREREZBFYuBIRERGRRWDhSkREREQWgYUrEREREVkEFq5EREREZBFYuBIRNaD33nsPer0eAKDX6zFv3jyJE9GtOHnyJDZv3my4HxMTg99//126QBbgzTffhEwmk2Tb/fr1Q5s2bSTZNt0ZLFzplqxatQoymcxws7a2RkhICJ555hlkZGRIHY/I7KxevRoLFy7E5cuX8eGHH2L16tVSR6JbUFBQgCeeeAIHDhxAfHw8pk+fjhMnTkgdq14CAwON/n5f77Zq1Sqpo9aQlZWF6dOnIywsDDY2NvDw8EDXrl3x0ksvobCwUOp4dAfJhBBC6hBkOVatWoVHH30Ub731Fpo3b47S0lLs2bMHa9asQUBAAE6ePAlbW1upYxKZjR9//BGPPPIIysvLoVar8e233+L++++XOhbdglGjRuHnn38GAISEhGDfvn1wdXWVONWt27x5s1GR98cff+D777/HokWL4ObmZljeo0cPtGjRot7bqaysRGVlJaytrW8rb7WcnBx07NgR+fn5mDx5MsLCwpCdnY3jx4/jt99+w/HjxxEYGAig6oirVqvFyZMnTbJtMj9KqQOQZRo6dCg6d+4MAHjsscfg6uqKjz76CD///DPGjRsncToi8/Hggw+if//+OH/+PIKDg+Hu7i51JLpFmzdvxunTp1FSUoK2bdtCpVJJHaleRo0aZXQ/PT0d33//PUaNGmUo/GpTVFQEOzu7Om9HqVRCqTRdebF8+XIkJSVh79696NGjh9G6/Px8i90fVD8cKkAmMWDAAADAxYsXAVT9h/zCCy+gbdu2sLe3h0ajwdChQxEbG1vjsaWlpXjzzTcREhICa2treHt747777kNCQgIAIDEx8YZfa/Xr18/Q1z///AOZTIYff/wRr7zyCry8vGBnZ4d77rkHycnJNbZ98OBBDBkyBI6OjrC1tUXfvn2xd+/eWp9jv379at3+m2++WaPtt99+i4iICNjY2MDFxQVjx46tdfs3em5X0+v1+PjjjxEeHg5ra2t4enriiSeewJUrV4zaBQYG4u67766xnWeeeaZGn7Vl/+CDD2r8TgGgrKwMc+bMQcuWLaFWq+Hn54cXX3wRZWVltf6urna9MWcLFy6ETCZDYmKi0fLc3FzMmDEDfn5+UKvVaNmyJRYsWGAYJ3q16rF0194mTZpk1C4lJQWTJ0+Gp6cn1Go1wsPDsWLFCqM21a+d6ptarUZISAjmzZuHa7+YOnbsGIYOHQqNRgN7e3sMHDgQBw4cMGpTPawmMTERHh4e6NGjB1xdXdGuXbs6fR177bCcm73ubuU5mvL9Ub0PPDw8UFFRYbTu+++/N+TVarVG67Zs2YLevXvDzs4ODg4OGD58OE6dOmXUZtKkSbC3t6+Ra8OGDZDJZPjnn38My271dbZs2TKEh4dDrVbDx8cH06ZNQ25urlGbfv36Gd4LrVu3RkREBGJjY2t9j97I9fbh1fmvfs512d8bNmxA586d4eDgYNRu4cKFdc5Vm+rfeUJCAoYNGwYHBweMHz8eALB792488MAD8Pf3N/wdeO6551BSUmLUR21jXGUyGZ555hls3rwZbdq0MbxGt27detNMCQkJUCgU6N69e411Go2m1iO7p0+fRv/+/WFra4tmzZrh/fffr9EmMzMTU6ZMgaenJ6ytrdG+ffsaQ3mq/0YvXLgQixYtQkBAAGxsbNC3b18e1ZUIj7iSSVQXmdVfn124cAGbN2/GAw88gObNmyMjIwNffPEF+vbti9OnT8PHxwcAoNPpcPfdd2P79u0YO3Yspk+fjoKCAmzbtg0nT55EUFCQYRvjxo3DsGHDjLY7e/bsWvO8++67kMlkeOmll5CZmYmPP/4YUVFRiImJgY2NDQBgx44dGDp0KCIiIjBnzhzI5XKsXLkSAwYMwO7du9G1a9ca/fr6+hpOriksLMRTTz1V67Zff/11jBkzBo899hiysrKwZMkS9OnTB8eOHYOTk1ONxzz++OPo3bs3AGDjxo3YtGmT0fonnnjCMEzj2WefxcWLF/Hpp5/i2LFj2Lt3L6ysrGr9PdyK3NzcWk8c0uv1uOeee7Bnzx48/vjjaNWqFU6cOIFFixbh3LlzRieu3K7i4mL07dsXKSkpeOKJJ+Dv7499+/Zh9uzZSEtLw8cff1zr49asWWP4+bnnnjNal5GRge7duxs+ON3d3bFlyxZMmTIF+fn5mDFjhlH7V155Ba1atUJJSYmhwPPw8MCUKVMAAKdOnULv3r2h0Wjw4osvwsrKCl988QX69euHXbt2oVu3btd9fmvWrLnl8ZHVw3Kq1fa6u9Xn2BDvj4KCAvz222+49957DctWrlwJa2trlJaW1vg9TJw4EYMHD8aCBQtQXFyMzz77DL169cKxY8duePTPFN58803MnTsXUVFReOqpp3D27Fl89tlnOHz48E3fTy+99FK9tjlo0CA88sgjAIDDhw/jk08+uW5bNzc3LFq0yHD/4YcfNlq/f/9+jBkzBu3bt8f8+fPh6OgIrVZb47VfX5WVlRg8eDB69eqFhQsXGoZ/rV+/HsXFxXjqqafg6uqKQ4cOYcmSJbh8+TLWr19/03737NmDjRs34umnn4aDgwM++eQTjB49GklJSTccehEQEACdTmd43dzMlStXMGTIENx3330YM2YMNmzYgJdeeglt27bF0KFDAQAlJSXo168fzp8/j2eeeQbNmzfH+vXrMWnSJOTm5mL69OlGfX7zzTcoKCjAtGnTUFpaisWLF2PAgAE4ceIEPD09b5qJTEgQ3YKVK1cKAOLvv/8WWVlZIjk5Wfzwww/C1dVV2NjYiMuXLwshhCgtLRU6nc7osRcvXhRqtVq89dZbhmUrVqwQAMRHH31UY1t6vd7wOADigw8+qNEmPDxc9O3b13B/586dAoBo1qyZyM/PNyxft26dACAWL15s6Ds4OFgMHjzYsB0hhCguLhbNmzcXgwYNqrGtHj16iDZt2hjuZ2VlCQBizpw5hmWJiYlCoVCId9991+ixJ06cEEqlssby+Ph4AUCsXr3asGzOnDni6rfm7t27BQCxdu1ao8du3bq1xvKAgAAxfPjwGtmnTZsmrn27X5v9xRdfFB4eHiIiIsLod7pmzRohl8vF7t27jR7/+eefCwBi7969NbZ3tb59+4rw8PAayz/44AMBQFy8eNGw7O233xZ2dnbi3LlzRm1ffvlloVAoRFJSktHyV199VchkMqNlAQEBYuLEiYb7U6ZMEd7e3kKr1Rq1Gzt2rHB0dBTFxcVCiP9/7ezcudPQprS0VMjlcvH0008blo0aNUqoVCqRkJBgWJaamiocHBxEnz59DMuq3yvVz6+0tFT4+/uLoUOHCgBi5cqVNX9ZV6l+/OHDh42W1/a6u9XnaMr3R/Xrddy4ceLuu+82LL906ZKQy+Vi3LhxAoDIysoSQghRUFAgnJycxNSpU42ypqenC0dHR6PlEydOFHZ2djV+N+vXr6+xr+r6OsvMzBQqlUrcddddRn+jPv30UwFArFixwqjPq98Lf/zxhwAghgwZUuP9dD3l5eUCgHjmmWdumL/a+PHjRfPmzY2WXbu/Z8+eLQCItLQ0w7Ib/Z28ntregxMnThQAxMsvv1yjffXr6Grz5s0TMplMXLp0ybDs2r9h1c9BpVKJ8+fPG5bFxsYKAGLJkiU3zJmeni7c3d0FABEWFiaefPJJ8d1334nc3Nwabfv
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAq4AAAHWCAYAAAC2Zgs3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB3/klEQVR4nO3dd3hTZf8G8DtJm3TvPSmlUChllWHZCIIsQZBlUZaIgq/gAERExIWDV0FEwAWogDIEVGQLyC6zrAKlFLpHuvdInt8f/TUvoQMoaZO09+e6ckHOefKcb05Okrsn5zxHIoQQICIiIiIycFJ9F0BERERE9CAYXImIiIjIKDC4EhEREZFRYHAlIiIiIqPA4EpERERERoHBlYiIiIiMAoMrERERERkFBlciIiIiMgoMrkRERERkFBhciYj+38cffwy1Wg0AUKvVWLx4sZ4roodx+fJlbN++XXP/woUL2Llzp/4KMkASiQTvvfee5v7atWshkUhw+/bt+z62SZMmmDhxok7rmThxIpo0aaLTPnWtd+/eaN26tb7LoP/H4NqAVXwgVdzMzMzQvHlzvPLKK0hJSdF3eUQGZ926dViyZAni4+Px3//+F+vWrdN3SfQQcnNzMW3aNJw8eRJRUVGYOXMmLl26pO+yauXVV1+FRCLBzZs3q20zf/58SCQSXLx4sR4re3iJiYl47733cOHCBX2XoiUtLQ0zZ85EYGAgzM3N4eLigs6dO2Pu3LnIy8vTd3lUDRN9F0B17/3334efnx+Kiopw9OhRrFy5En///TcuX74MCwsLfZdHZDDef/99PP/885g7dy4UCgV++eUXfZdEDyE0NFRzA4DmzZtj6tSpeq6qdsLCwrB8+XJs2LAB7777bpVtNm7ciODgYLRp06bWy3nuuecwduxYKBSKWvdxP4mJiVi0aBGaNGmCdu3aac377rvvNL9y1KeMjAx07NgROTk5mDx5MgIDA5Geno6LFy9i5cqVePnll2FlZVXvddH9Mbg2AgMHDkTHjh0BAC+88AIcHR3xxRdfYMeOHRg3bpyeqyMyHGPGjEGfPn1w8+ZNBAQEwNnZWd8l0UPavn07rl69isLCQgQHB0Mul+u7pFrp0qULmjVrho0bN1YZXE+cOIGYmBh88sknj7QcmUwGmUz2SH08ClNTU70s94cffkBsbCyOHTuGrl27as3Lyckx2u2mMeChAo3Q448/DgCIiYkBUP6X55tvvong4GBYWVnBxsYGAwcORERERKXHFhUV4b333kPz5s1hZmYGd3d3jBgxAtHR0QCA27dvax2ecO+td+/emr4OHToEiUSC3377DW+//Tbc3NxgaWmJp556CnFxcZWWferUKTz55JOwtbWFhYUFevXqhWPHjlX5HHv37l3l8u8+tqvCL7/8gpCQEJibm8PBwQFjx46tcvk1Pbe7qdVqLF26FEFBQTAzM4OrqyumTZuGzMxMrXZNmjTBkCFDKi3nlVdeqdRnVbV//vnnldYpABQXF2PhwoVo1qwZFAoFvL29MWfOHBQXF1e5ru5W3bFcS5YsqfI4uKysLMyaNQve3t5QKBRo1qwZPv300yr3oLz33ntVrrt7j5lLSEjA5MmT4erqCoVCgaCgIPz4449abSq2nYqbQqFA8+bNsXjxYgghtNqeP38eAwcOhI2NDaysrNC3b1+cPHlSq83dx/m5uLiga9eucHR0RJs2bSCRSLB27doa19u9h+Xcb7t7mOeoy/dHxWvg4uKC0tJSrXkbN27U1KtUKrXm7dq1Cz169IClpSWsra0xePBgXLlyRavNxIkTq9xDtWXLFkgkEhw6dEgz7WG3s2+++QZBQUFQKBTw8PDAjBkzkJWVpdWmd+/emvdCq1atEBISgoiIiCrfozWp7jW8u/67n/ODvN5btmxBx44dYW1trdVuyZIlNdYSFhaGa9eu4dy5c5XmbdiwARKJBOPGjUNJSQneffddhISEwNbWFpaWlujRowcOHjx43+db1TGuQgh8+OGH8PLygoWFBfr06VPp9QYe7Lvj0KFD6NSpEwBg0qRJmude8Z6q6hjX/Px8vPHGG5rPlRYtWmDJkiWV3tsSiQSvvPIKtm/fjtatW2veS7t3777v846OjoZMJsNjjz1WaZ6NjQ3MzMwqTb969Sr69OkDCwsLeHp64rPPPqvUJjU1FVOmTIGrqyvMzMzQtm3bSoccVXyXLFmyBF9++SV8fX1hbm6OXr164fLly/etvbHjHtdGqCJkOjo6AgBu3bqF7du3Y9SoUfDz80NKSgpWr16NXr164erVq/Dw8AAAqFQqDBkyBAcOHMDYsWMxc+ZM5ObmYt++fbh8+TL8/f01yxg3bhwGDRqktdx58+ZVWc9HH30EiUSCuXPnIjU1FUuXLkW/fv1w4cIFmJubAwD++ecfDBw4ECEhIVi4cCGkUinWrFmDxx9/HEeOHEHnzp0r9evl5aU5uSYvLw8vv/xylctesGABRo8ejRdeeAFpaWlYvnw5evbsifPnz8POzq7SY1588UX06NEDAPD7779j27ZtWvOnTZuGtWvXYtKkSXj11VcRExODr7/+GufPn8exY8d0sochKyuryhOH1Go1nnrqKRw9ehQvvvgiWrZsiUuXLuHLL7/EjRs3tE5ceVQFBQXo1asXEhISMG3aNPj4+OD48eOYN28ekpKSsHTp0iof9/PPP2v+/9prr2nNS0lJwWOPPab5QnJ2dsauXbswZcoU5OTkYNasWVrt3377bbRs2RKFhYWagOfi4oIpU6YAAK5cuYIePXrAxsYGc+bMgampKVavXo3evXvj8OHD6NKlS7XP7+eff37o4yMrDsupUNV297DPsS7eH7m5ufjrr7/w9NNPa6atWbMGZmZmKCoqqrQeJkyYgAEDBuDTTz9FQUEBVq5cie7du+P8+fN1fmLNe++9h0WLFqFfv354+eWXcf36daxcuRKnT5++7/tp7ty5tVrmE088geeffx4AcPr0aXz11VfVtnVycsKXX36puf/cc89pzT9x4gRGjx6Ntm3b4pNPPoGtrS2USmWlbb8qYWFhWLRoETZs2IAOHTpopqtUKmzatAk9evSAj48PlEolvv/+e4wbNw5Tp05Fbm4ufvjhBwwYMADh4eGVfp6/n3fffRcffvghBg0ahEGDBuHcuXPo378/SkpKtNo9yHdHy5Yt8f777+Pdd9/V+uy8dy9nBSEEnnrqKRw8eBBTpkxBu3btsGfPHsyePRsJCQla6xoAjh49it9//x3Tp0+HtbU1vvrqK4wcORKxsbGa77iq+Pr6QqVSabbv+8nMzMSTTz6JESNGYPTo0diyZQvmzp2L4OBgDBw4EABQWFiI3r174+bNm3jllVfg5+eHzZs3Y+LEicjKysLMmTO1+vzpp5+Qm5uLGTNmoKioCMuWLcPjjz+OS5cuwdXV9b41NVqCGqw1a9YIAGL//v0iLS1NxMXFiV9//VU4OjoKc3NzER8fL4QQoqioSKhUKq3HxsTECIVCId5//33NtB9//FEAEF988UWlZanVas3jAIjPP/+8UpugoCDRq1cvzf2DBw8KAMLT01Pk5ORopm/atEkAEMuWLdP0HRAQIAYMGKBZjhBCFBQUCD8/P/HEE09UWlbXrl1F69atNffT0tIEALFw4ULNtNu3bwuZTCY++ugjrcdeunRJmJiYVJoeFRUlAIh169Zppi1cuFDc/TY6cuSIACDWr1+v9djdu3dXmu7r6ysGDx5cqfYZM2aIe9+a99Y+Z84c4eLiIkJCQrTW6c8//yykUqk4cuSI1uNXrVolAIhjx45VWt7devXqJYKCgipN//zzzwUAERMTo5n2wQcfCEtLS3Hjxg2ttm+99ZaQyWQiNjZWa/r8+fOFRCLRmubr6ysmTJiguT9lyhTh7u4ulEqlVruxY8cKW1tbUVBQIIT437Zz8OBBTZuioiIhlUrF9OnTNdOGDx8u5HK5iI6O1kxLTEwU1tbWomfPnpppFe+ViudXVFQkfHx8xMCBAwUAsWbNmsor6y4Vjz99+rTW9Kq2u4d9jrp8f1Rsr+PGjRNDhgzRTL9z546QSqV
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArgAAAHWCAYAAACc1vqYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB4zklEQVR4nO3dd3hTZf8G8PskaZPuvTeFUii7DAsyRQEBRVGUFxTELQ7EF3nRVxEHQ1FRUBw/BQeKDMUNspE9yyyllJbSvZvutMnz+6M2L6EFShs4SXp/risX5OTJkzs5J8m3J895jiSEECAiIiIishEKuQMQEREREZkTC1wiIiIisikscImIiIjIprDAJSIiIiKbwgKXiIiIiGwKC1wiIiIisikscImIiIjIprDAJSIiIiKbwgKXiIiIiGwKC1wiIjOZO3cuDAYDAMBgMGDevHkyJ6JrceLECaxbt854PT4+Hr///rt8gQiDBg1Cp06d5I5BVogFLl3W8uXLIUmS8aLRaBAVFYWnn34aOTk5cscjsjhfffUVFi5ciPT0dLz77rv46quv5I5E16C0tBSPP/449u7di6SkJDz33HM4fvy43LGaJTw83OTz+3KX5cuXm+Xx5s6da/LHwdXk5eXhueeeQ3R0NBwcHODr64vevXtj5syZKCsrM0smat1Ucgcgy/f6668jIiICVVVV2LlzJ5YuXYo//vgDJ06cgKOjo9zxiCzG66+/jgcffBAzZ86EWq3Gt99+K3ckugZxcXHGCwBERUXh0UcflTlV8yxatMikUPzjjz/w/fff4/3334e3t7dxed++fc3yeHPnzsU999yDMWPGXLVtYWEhevbsCa1WiylTpiA6OhoFBQU4duwYli5diieffBLOzs5myUWtFwtcuqoRI0agZ8+eAIBHHnkEXl5eeO+99/Dzzz9j/PjxMqcjshz33XcfBg8ejLNnz6Jdu3bw8fGROxJdo3Xr1uHUqVOorKxE586dYW9vL3ekZrm00MzOzsb333+PMWPGIDw8XJZM9b744gukpaVh165dDQpsrVZrta85WRYOUaBrNmTIEABASkoKgLq/xv/973+jc+fOcHZ2hqurK0aMGIGjR482uG9VVRVee+01REVFQaPRICAgAHfffTeSk5MBAKmpqVf8OW3QoEHGvrZt2wZJkvDDDz/gpZdegr+/P5ycnHDHHXfgwoULDR573759GD58ONzc3ODo6IiBAwdi165djT7HQYMGNfr4r732WoO23377LWJjY+Hg4ABPT0/cf//9jT7+lZ7bxQwGAxYtWoSYmBhoNBr4+fnh8ccfR1FRkUm78PBwjBo1qsHjPP300w36bCz7O++80+A1BYDq6mrMnj0bbdu2hVqtRkhICF588UVUV1c3+lpd7HLj5RYuXAhJkpCammqyvLi4GNOmTUNISAjUajXatm2LBQsWGMexXuy1115r9LWbPHmySbuMjAxMmTIFfn5+UKvViImJwZdffmnSpn7bqb+o1WpERUVh3rx5EEKYtD1y5AhGjBgBV1dXODs745ZbbsHevXtN2tQP50lNTYWvry/69u0LLy8vdOnSpUk/A186HOhq2921PEdzvj/q14Gvry9qampMbvv++++NefPz801u+/PPP9G/f384OTnBxcUFI0eOxMmTJ03aTJ48udG9dmvWrIEkSdi2bZtx2bVuZx9//DFiYmKgVqsRGBiIqVOnori42KTNoEGDjO+Fjh07IjY2FkePHm30PXoll1uHF+e/+Dk3ZX2vWbMGPXv2hIuLi0m7hQsXNjnX5TTl8yspKQljx46Fv78/NBoNgoODcf/996OkpMT4nMvLy/HVV19d9n15seTkZCiVStx0000NbnN1dYVGo2mw/NSpUxg8eDAcHR0RFBSEt99+u0Gb3NxcPPzww/Dz84NGo0HXrl0bDBOq/xxeuHAh3n//fYSFhcHBwQEDBw7EiRMnmvKSkZXgHly6ZvXFqJeXFwDg3LlzWLduHe69915EREQgJycHn376KQYOHIhTp04hMDAQAKDX6zFq1Chs3rwZ999/P5577jmUlpZi48aNOHHiBCIjI42PMX78eNx+++0mjztr1qxG87z11luQJAkzZ85Ebm4uFi1ahKFDhyI+Ph4ODg4AgC1btmDEiBGIjY3F7NmzoVAosGzZMgwZMgR///03evfu3aDf4OBg40FCZWVlePLJJxt97FdeeQXjxo3DI488gry8PCxevBgDBgzAkSNH4O7u3uA+jz32GPr37w8A+PHHH/HTTz+Z3P74449j+fLleOihh/Dss88iJSUFS5YswZEjR7Br1y7Y2dk1+jpci+Li4kYPgDIYDLjjjjuwc+dOPPbYY+jQoQOOHz+O999/H2fOnLmmMXZXU1FRgYEDByIjIwOPP/44QkNDsXv3bsyaNQtZWVlYtGhRo/f75ptvjP9//vnnTW7LycnBTTfdBEmS8PTTT8PHxwd//vknHn74YWi1WkybNs2k/UsvvYQOHTqgsrLSWAj6+vri4YcfBgCcPHkS/fv3h6urK1588UXY2dnh008/xaBBg7B9+3b06dPnss/vm2++uebxm/XDgeo1tt1d63O8Hu+P0tJS/Pbbb7jrrruMy5YtWwaNRoOqqqoGr8OkSZMwbNgwLFiwABUVFVi6dCluvvlmHDly5LrvTXzttdcwZ84cDB06FE8++SQSExOxdOlSHDhw4Krvp5kzZzbrMW+99VY8+OCDAIADBw7gww8/vGxbb29vvP/++8brDzzwgMnte/bswbhx49C1a1fMnz8fbm5uyM/Pb7DtN0dTPr90Oh2GDRuG6upqPPPMM/D390dGRgZ+++03FBcXw83NDd988w0eeeQR9O7dG4899hgAmHyeXyosLAx6vd64bVxNUVERhg8fjrvvvhvjxo3DmjVrMHPmTHTu3BkjRowAAFRWVmLQoEE4e/Ysnn76aURERGD16tWYPHkyiouL8dxzz5n0+fXXX6O0tBRTp05FVVUVPvjgAwwZMgTHjx+Hn59fC15VshiC6DKWLVsmAIhNmzaJvLw8ceHCBbFy5Urh5eUlHBwcRHp6uhBCiKqqKqHX603um5KSItRqtXj99deNy7788ksBQLz33nsNHstgMBjvB0C88847DdrExMSIgQMHGq9v3bpVABBBQUFCq9Ual69atUoAEB988IGx73bt2olhw4YZH0cIISoqKkRERIS49dZbGzxW3759RadOnYzX8/LyBAAxe/Zs47LU1FShVCrFW2+9ZXLf48ePC5VK1WB5UlKSACC++uor47LZs2eLi9+Gf//9twAgVqxYYXLf9evXN1geFhYmRo4c2SD71KlTxaVv7Uuzv/jii8LX11fExsaavKbffPONUCgU4u+//za5/yeffCIAiF27djV4vIsNHDhQxMTENFj+zjvvCAAiJSXFuOyNN94QTk5O4syZMyZt//Of/wilUinS0tJMlr/88stCkiSTZWFhYWLSpEnG6w8//LAICAgQ+fn5Ju3uv/9+4ebmJioqKoQQ/9t2tm7damxTVVUlFAqFeOqpp4zLxowZI+zt7UVycrJxWWZmpnBxcREDBgwwLqt/r9Q/v6qqKhEaGipGjBghAIhly5Y1fLEuUn//AwcOmCxvbLu71udozvdH/fY6fvx4MWrUKOPy8+fPC4VCIcaPHy8AiLy8PCGEEKWlpcLd3V08+uijJlmzs7OFm5ubyfJJkyYJJyenBq/N6tWrG6yrpm5nubm5wt7eXtx2220mn1FLliwRAMSXX35p0ufF74U//vhDABDDhw9v8H66HJ1OJwCIp59++or5602YMEFERESYLLt0fc+aNUsAEFlZWcZlV/qcvJxLX5umfn4dOXJEABCrV6++Yv9OTk4m78Uryc7OFj4+PgKAiI6OFk888YT47rvvRHFxcYO2AwcOFADE119/bVxWXV0t/P39xdixY43LFi1aJACIb7/91rhMp9OJuLg44ezsbHwP1L92F3+HCSHEvn37BADx/PPPN+k5kOXjEAW6qqFDh8LHxwchISG4//774ezsjJ9++glBQUEAALVaDYWiblPS6/UoKCi
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def plot_sample_balance(y, sample_name):\n",
" plt.figure(figsize=(8, 5))\n",
" sns.histplot(y, bins=30, kde=True)\n",
" plt.title(f'Распределение целевой переменной для {sample_name}')\n",
" plt.xlabel(sample_name)\n",
" plt.ylabel('Частота')\n",
" plt.show()\n",
"\n",
"# Оценка сбалансированности выборок\n",
"plot_sample_balance(train_shop['Store_Sales'], 'Train Shop')\n",
"plot_sample_balance(val_shop['Store_Sales'], 'Validation Shop')\n",
"plot_sample_balance(test_shop['Store_Sales'], 'Test Shop')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Распределения выборок у данного датасета выглядят схоже. Это говорит о сбалансированности выборок. "
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsAAAAHWCAYAAAB5SD/0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABj2ElEQVR4nO3dd1gU1/4G8Hcpu9QFpSOIiAVB1Iht7QVFRaPRxBiNJbaY4L1REzUkxhprTNRYk5sYvAnGFk2xI7aoWIKi2IgFxbYgGlh6Pb8/vMzPlSLgwqL7fp5nnrgzZ898Z2d38zJ7ZkYmhBAgIiIiIjIQRvougIiIiIioKjEAExEREZFBYQAmIiIiIoPCAExEREREBoUBmIiIiIgMCgMwERERERkUBmAiIiIiMigMwERERERkUBiAiYiIiMigMAATEenJ/PnzUVBQAAAoKCjAggUL9FwRlceFCxfw66+/So+jo6Oxc+dO/RX0Apg1axZkMpm+y9C5kSNHwsrKSt9lUDkwAJPOhIaGQiaTSZOZmRkaNGiACRMmICEhQd/lEVU769evx5IlS3Dnzh18+eWXWL9+vb5LonJITU3Fu+++ixMnTuDq1av44IMPEBMTo++yKqROnTpa398lTaGhofoutYgHDx7ggw8+gLe3N8zNzeHo6IhWrVph2rRpSEtLk9pt2LABy5Yt01+hVK3IhBBC30XQyyE0NBTvvPMO5syZA09PT2RlZeHo0aP48ccf4eHhgQsXLsDCwkLfZRJVG5s2bcLw4cORk5MDhUKBn376Ca+//rq+y6Jy6N+/P3777TcAQIMGDXD8+HHY2dnpuary+/XXX7XC4q5du/Dzzz9j6dKlsLe3l+a3bdsWdevWrfB68vLykJeXBzMzs+eqt9CjR4/wyiuvQKPRYNSoUfD29sbDhw9x/vx57NixA+fPn0edOnUAAH369MGFCxdw8+ZNnaz7SSNHjsTWrVu1XkOq3kz0XQC9fHr16oUWLVoAAMaMGQM7Ozt89dVX+O233/DWW2/puTqi6uPNN99Ely5dcO3aNdSvXx8ODg76LonK6ddff8WlS5eQmZkJPz8/yOVyfZdUIf3799d6rFar8fPPP6N///5SgCxOeno6LC0ty7weExMTmJjoLnp8//33iI+Px7Fjx9C2bVutZRqNpsL7IysrC3K5HEZG/KH8ZcU9S5Wua9euAIC4uDgAj/9i/+ijj+Dn5wcrKysolUr06tUL586dK/LcrKwszJo1Cw0aNICZmRlcXFwwYMAAXL9+HQBw8+bNUn+u69y5s9TXoUOHIJPJsGnTJnzyySdwdnaGpaUlXn31Vdy+fbvIuk+ePImePXvCxsYGFhYW6NSpE44dO1bsNnbu3LnY9c+aNatI259++gn+/v4wNzdHzZo1MXjw4GLXX9q2PamgoADLli2Dr68vzMzM4OTkhHfffRf//POPVrs6deqgT58+RdYzYcKEIn0WV/sXX3xR5DUFgOzsbMycORP16tWDQqGAu7s7pk6diuzs7GJfqyd17twZjRs3LjJ/yZIlkMlkRY7UJCcnY+LEiXB3d4dCoUC9evWwaNEiaRztkwrHGj49jRw5Uqvd3bt3MWrUKDg5OUGhUMDX1xfr1q3TalP43imcFAoFGjRogAULFuDpH9HOnj2LXr16QalUwsrKCt26dcOJEye02hQOF7p58yYcHR3Rtm1b2NnZoUmTJmX6mfnp4UbPet+VZxt1+fko3AeOjo7Izc3VWvbzzz9L9SYlJWkt2717Nzp06ABLS0tYW1sjKCgIFy9e1GpT0pjLrVu3QiaT4dChQ9K88r7PVq9eDV9fXygUCri6uiI4OBjJyclabTp37ix9Fnx8fODv749z584V+xktTUn78Mn6n9zmsuzvrVu3okWLFrC2ttZqt2TJkjLXVZzC1/z69evo3bs3rK2tMXToUADAn3/+iTfeeAO1a9eWvgcmTZqEzMxMrT6KGwMsk8kwYcIE/Prrr2jcuLH0Ht2zZ88za7p+/TqMjY3Rpk2bIsuUSqV0pLlz587YuXMnbt26Jb0ehcG+8L2/ceNGTJ8+HbVq1YKFhQU0Gg0AYMuWLdJ3tr29Pd5++23cvXv3mbVFR0fDwcEBnTt3lo4Ml+WzCAArVqyAr68vLCwsUKNGDbRo0QIbNmx45jqp7HgEmCpdYVgt/Fnwxo0b+PXXX/HGG2/A09MTCQkJ+Oabb9CpUydcunQJrq6uAID8/Hz06dMHERERGDx4MD744AOkpqYiPDwcFy5cgJeXl7SOt956C71799Zab0hISLH1zJs3DzKZDNOmTUNiYiKWLVuGgIAAREdHw9zcHABw4MAB9OrVC/7+/pg5cyaMjIzwww8/oGvXrvjzzz/RqlWrIv26ublJJzGlpaXhvffeK3bdn332GQYNGoQxY8bgwYMHWLFiBTp27IizZ8/C1ta2yHPGjRuHDh06AAC2bduG7du3ay1/9913peEn//73vxEXF4eVK1fi7NmzOHbsGExNTYt9HcojOTm52BO0CgoK8Oqrr+Lo0aMYN24cGjVqhJiYGCxduhR///231glCzysjIwOdOnXC3bt38e6776J27do4fvw4QkJCcP/+/RLH9v3444/SvydNmqS1LCEhAW3atJH+B+zg4IDdu3dj9OjR0Gg0mDhxolb7Tz75BI0aNUJmZqYUFB0dHTF69GgAwMWLF9GhQwcolUpMnToVpqam+Oabb9C5c2ccPnwYrVu3LnH7fvzxx3KPHy0cblSouPddebexMj4fqamp2LFjB1577TVp3g8//AAzMzNkZWUVeR1GjBiBwMBALFq0CBkZGVizZg3at2+Ps2fPlno0UhdmzZqF2bNnIyAgAO+99x5iY2OxZs0anD59+pmfp2nTplVond27d8fw4cMBAKdPn8bXX39dYlt7e3ssXbpUejxs2DCt5ZGRkRg0aBCaNm2KhQsXwsbGBklJSUXe+xWVl5eHwMBAtG/fHkuWLJGGtW3ZsgUZGRl47733YGdnh1OnTmHFihW4c+cOtmzZ8sx+jx49im3btuH999+HtbU1vv76awwcOBDx8fGlDinx8PBAfn6+9L4pyaeffoqUlBTcuXNHev2e/gNq7ty5kMvl+Oijj5CdnQ25XC59t7Zs2RILFixAQkICli9fjmPHjpX4nQ083o+BgYFo0aIFfvvtN5ibm5f5s/if//wH//73v/H666/jgw8+QFZWFs6fP4+TJ09iyJAhz3wtqYwEkY788MMPAoDYv3+/ePDggbh9+7bYuHGjsLOzE+bm5uLOnTtCCCGysrJEfn6+1nPj4uKEQqEQc+bMkeatW7dOABBfffVVkXUVFBRIzwMgvvjiiyJtfH19RadOnaTHBw8eFABErVq1hEajkeZv3rxZABDLly+X+q5fv74IDAyU1iOEEBkZGcLT01N07969yLratm0rGjduLD1+8OCBACBmzpwpzbt586YwNjYW8+bN03puTEyMMDExKTL/6tWrAoBYv369NG/mzJniyY/tn3/+KQCIsLAwrefu2bOnyHwPDw8RFBRUpPbg4GDx9FfB07VPnTpVODo6Cn9/f63X9McffxRGRkbizz//1Hr+2rVrBQBx7NixIut7UqdOnYSvr2+R+V988YUAIOLi4qR5c+fOFZaWluLvv//Wavvxxx8LY2NjER8frzX/008/FTKZTGueh4eHGDFihPR49OjRwsXFRSQlJWm1Gzx4sLCxsREZGRlCiP9/7xw8eFBqk5WVJYyMjMT7778vzevfv7+Qy+Xi+vXr0rx79+4Ja2tr0bFjR2le4WelcPuysrJE7dq1Ra9evQQA8cMPPxR9sZ5Q+PzTp09rzS/ufVfebdTl56Pw/frWW2+JPn36SPNv3boljIyMxFtvvSUAiAcPHgghhEhNTRW2trZi7NixWrWq1WphY2OjNX/EiBHC0tKyyGuzZcuWIvuqrO+zxMREIZfLRY8ePbS+o1auXCkAiHXr1mn1+eRnYdeuXQKA6NmzZ5HPU0lycnIEADFhwoRS6y80dOhQ4enpqTXv6f0dEhIiAIj79+9L80r7nixJcZ/BESNGCAD
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAHWCAYAAABt3aEVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABcrElEQVR4nO3deZyNdf/H8fc5s5vVNpst+5alxjZkSZiYlJsiSSqpRIWSZKci+SVZW1E3KUXuFEKobMlSsoVobDNMmsUy+/f3h3vO7ZjBzBgOl9fz8bgenO/1Pdf1uc51rnPec53vuY7NGGMEAAAAWIDd1QUAAAAAhYVwCwAAAMsg3AIAAMAyCLcAAACwDMItAAAALINwCwAAAMsg3AIAAMAyCLcAAACwDMItAAAALINwCwCSXn/9dWVlZUmSsrKyNHbsWBdXhPz4/fff9dVXXzlub9u2Td98843rCroO2Ww2jRw50nF71qxZstlsOnjw4GXve8stt+jRRx8t1HoeffRR3XLLLYW6zBvJo48+Kj8/P1eXYUmEW4vKftHKnry9vVWlShX17dtXcXFxri4PuO7Mnj1bEyZM0OHDh/V///d/mj17tqtLQj4kJyfrqaee0oYNG7R37149//zz2r59u6vLKpDnnntONptN+/btu2ifIUOGyGaz6bfffruGleXf0aNHNXLkSG3bts3VpTg5ceKEnn/+eVWrVk0+Pj4KDg5WgwYNNGjQIJ06dcrRb+7cuXr77bddVygKhHBrcaNHj9Ynn3yiKVOmqHHjxpo+fboiIyN15swZV5cGXFdGjx6tYcOGqUyZMho2bJheffVVV5eEfIiMjHRMVapUUWxsrHr16uXqsgqkW7duks4Fq4v59NNPVatWLdWuXbvA6+nevbvOnj2rcuXKFXgZl3P06FGNGjUq13D7/vvva8+ePVdt3Rdz8uRJ1atXTx9//LGio6P1zjvvaMCAAapUqZKmT5+u+Ph4R1/C7Y3J3dUF4Opq27at6tWrJ0l64oknVLx4cb311ltatGiRunbt6uLqgOtHly5ddOedd2rfvn2qXLmySpYs6eqSkE9fffWVdu7cqbNnz6pWrVry9PR0dUkF0rBhQ1WqVEmffvqphg8fnmP++vXrdeDAAY0bN+6K1uPm5iY3N7crWsaV8PDwcMl6P/zwQ8XExGjt2rVq3Lix07ykpKQCP29SUlLk6ekpu53zhq7GHrjJtGzZUpJ04MABSef+gn3xxRdVq1Yt+fn5KSAgQG3bttWvv/6a474pKSkaOXKkqlSpIm9vb4WFhaljx47av3+/JOngwYNOQyEunFq0aOFY1urVq2Wz2fTZZ5/plVdeUWhoqHx9fXXvvffq0KFDOda9ceNG3X333QoMDFSRIkXUvHlzrV27NtdtbNGiRa7rP3+sWbZ///vfioiIkI+Pj4oVK6YHH3ww1/VfatvOl5WVpbfffls1a9aUt7e3QkJC9NRTT+mff/5x6nfLLbfonnvuybGevn375lhmbrW/+eabOR5TSUpNTdWIESNUqVIleXl5qUyZMnrppZeUmpqa62N1vhYtWujWW2/N0T5hwoRcx+UlJCSoX79+KlOmjLy8vFSpUiW98cYbjnGr5xs5cmSuj92FY/iOHDmixx9/XCEhIfLy8lLNmjX10UcfOfXJfu5kT15eXqpSpYrGjh0rY4xT361bt6pt27YKCAiQn5+f7rrrLm3YsMGpz/njDoODg9W4cWMVL15ctWvXls1m06xZsy75uF04BOhyz7v8bGNhHh/Z+yA4OFjp6elO8z799FNHveeftZKkJUuWqGnTpvL19ZW/v7+io6O1Y8cOpz4XGzv4xRdfyGazafXq1Y62/D7Ppk2bppo1a8rLy0vh4eHq06ePEhISnPq0aNHCcSzUqFFDERER+vXXX3M9Ri/lYvvw/PrP3+a87O8vvvhC9erVk7+/v1O/CRMmXLKWbt26affu3dqyZUuOeXPnzpXNZlPXrl2Vlpam4cOHKyIiQoGBgfL19VXTpk21atWqy25vbmNujTF69dVXVbp0aRUpUkR33nlnjv0t5e29Y/Xq1apfv74k6bHHHnNse/YxlduY29OnT+uFF15wvK5UrVpVEyZMyHFs22w29e3bV1999ZVuvfVWx7G0dOnSy273/v375ebmpkaNGuWYFxAQIG9vb0nnnlfffPON/vrrL0ft2fVmH6Pz5s3T0KFDVapUKRUpUkRJSUmSpPnz5zveW0qUKKGHH35YR44cuWxt27ZtU8mSJdWiRQvH8Ii8vGZI0uTJk1WzZk0VKVJERYsWVb169S559t/KOHN7k8kOosWLF5ck/fnnn/rqq6/0wAMPqHz58oqLi9O7776r5s2ba+fOnQoPD5ckZWZm6p577tHKlSv14IMP6vnnn1dycrKWL1+u33//XRUrVnSso2vXrmrXrp3TegcPHpxrPa+99ppsNpsGDRqk48eP6+2331arVq20bds2+fj4SJK+//57tW3bVhERERoxYoTsdrtmzpypli1b6scff1SDBg1yLLd06dKOLwSdOnVKvXv3znXdw4YNU+fOnfXEE0/oxIkTmjx5spo1a6atW7cqKCgox32efPJJNW3aVJK0YMECLVy40Gn+U089pVmzZumxxx7Tc889pwMHDmjKlCnaunWr1q5dWyhnKhISEnL9slNWVpbuvfde/fTTT3ryySdVvXp1bd++XRMnTtQff/zh9GWbK3XmzBk1b95cR44c0VNPPaWyZctq3bp1Gjx4sI4dO3bRj/E++eQTx//79+/vNC8uLk6NGjVyvGmVLFlSS5YsUc+ePZWUlKR+/fo59X/llVdUvXp1nT171hECg4OD1bNnT0nSjh071LRpUwUEBOill16Sh4eH3n33XbVo0UJr1qxRw4YNL7p9n3zySb7Ha44ePVrly5d33M7teZffbbwax0dycrIWL16sf/3rX462mTNnytvbWykpKTkehx49eigqKkpvvPGGzpw5o+nTp+uOO+7Q1q1br/qXgUaOHKlRo0apVatW6t27t/bs2aPp06dr06ZNlz2eBg0aVKB1tm7dWo888ogkadOmTXrnnXcu2rdEiRKaOHGi43b37t2d5q9fv16dO3dWnTp1NG7cOAUGBio+Pj7Hcz833bp106hRozR37lzdfvvtjvbMzEx9/vnnatq0qcqWLav4+Hh98MEH6tq1q3r16qXk5GR9+OGHioqK0s8//6y6devma/uHDx+uV199Ve3atVO7du20ZcsWtWnTRmlpaU798vLeUb16dY0ePVrDhw93eu288GxpNmOM7r33Xq1atUo9e/ZU3bp1tWzZMg0cOFBHjhxxeqwl6aefftKCBQv0zDPPyN/fX++88446deqkmJgYx3tcbsqVK6fMzEzH8/tihgwZosTERB0+fNix7gv/iBszZow8PT314osvKjU1VZ6eno73gPr162vs2LGKi4vTpEmTtHbt2ou+t0jnnm9RUVGqV6+eFi1aJB8fnzy/Zrz//vt67rnndP/99+v5559XSkqKfvvtN23cuFEPPfTQRbfRsgwsaebMmUaSWbFihTlx4oQ5dOiQmTdvnilevLjx8fExhw8fNsYYk5KSYjIzM53ue+DAAePl5WVGjx7taPvoo4+MJPPWW2/lWFdWVpbjfpLMm2++maNPzZo1TfPmzR23V61aZSSZUqVKmaSkJEf7559/biSZSZMmOZZduXJlExUV5ViPMcacOXPGlC9f3rRu3TrHuho3bmxuvfVWx+0TJ04YSWbEiBGOtoMHDxo3Nzfz2muvOd13+/btxt3dPUf73r17jSQze/ZsR9uIESPM+YfQjz/+aCSZOXPmON136dKlOdrLlStnoqOjc9Tep08fc+FheWHtL730kgkODjYRERFOj+knn3xi7Ha7+fHHH53uP2PGDCPJrF27Nsf6zte8eXNTs2bNHO1vvvmmkWQOHDjgaBszZozx9fU1f/zxh1Pfl19+2bi5uZmYmBin9iFDhhibzebUVq5cOdOjRw/H7Z49e5qwsDATHx/v1O/BBx80gYGB5syZM8aY/z13Vq1a5eiTkpJi7Ha7eea
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAHWCAYAAABt3aEVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABW5UlEQVR4nO3deZyN5f/H8feZfTMzBrPZ9z1qhLEWMiQtlCWJSCUqlErZSlFSKmupkEjRolSyhMqWtWRLUmObYdKszH79/vCd83PMDDNjONxez8fjfnCu+zr3/bnPfe4z73Of69zHZowxAgAAACzAxdkFAAAAAMWFcAsAAADLINwCAADAMgi3AAAAsAzCLQAAACyDcAsAAADLINwCAADAMgi3AAAAsAzCLQAAACyDcAsAl8H48eOVnZ0tScrOztaECROcXBEK4/fff9eXX35pv71jxw598803zisIl1Tfvn3l5+fn7DJQRIRbFMmcOXNks9nsk5eXl2rUqKHBgwcrNjbW2eUBV5y5c+dq0qRJOnz4sF5//XXNnTvX2SWhEJKSkvTwww9r48aN2r9/v5544gnt3LnT2WUVSaVKlRxev/Ob5syZUyzrGz9+vMMbgws5ceKEnnjiCdWqVUve3t4KDg5W48aN9cwzzyg5Odneb8GCBXrzzTeLpUZYi80YY5xdBK4+c+bM0QMPPKAXX3xRlStXVmpqqn7++WfNmzdPFStW1O+//y4fHx9nlwlcMT755BPdf//9Sk9Pl6enpz766CPdfffdzi4LhXDnnXdqyZIlkqQaNWpo/fr1KlWqlJOrKrwvv/zSISR+++23+vjjjzV58mSVLl3a3t6sWTNVqVLlotfn5+enu+++u0Bh+eTJk7r++uuVmJiofv36qVatWvr333/122+/aenSpfrtt99UqVIlSdJtt92m33//XX///fdF13iuvn37avHixQ6PE64ebs4uAFe3jh07qlGjRpKkBx98UKVKldIbb7yhJUuWqGfPnk6uDrhydO/eXTfffLP+/PNPVa9eXWXKlHF2SSikL7/8Urt379bp06dVv359eXh4OLukIrnzzjsdbsfExOjjjz/WnXfeaQ+OzvL+++8rOjpa69atU7NmzRzmJSYmFvkxT01NlYeHh1xc+MD6WsBeRrFq06aNJOngwYOSzrwLf+qpp1S/fn35+fnJ399fHTt21K+//prrvqmpqRo7dqxq1KghLy8vhYWFqUuXLjpw4IAk6e+//z7vR2g33XSTfVlr1qyRzWbTJ598oueee06hoaHy9fXV7bffrkOHDuVa96ZNm9ShQwcFBATIx8dHrVu31rp16/LcxptuuinP9Y8dOzZX348++kgRERHy9vZWUFCQevTokef6z7dtZ8vOztabb76punXrysvLSyEhIXr44Yf133//OfSrVKmSbrvttlzrGTx4cK5l5lX7a6+9lusxlaS0tDSNGTNG1apVk6enp8qXL6+nn35aaWlpeT5WZ7vppptUr169XO2TJk2SzWbLdfYlPj5eQ4YMUfny5eXp6alq1arp1VdftY9bPdvYsWPzfOz69u3r0O/IkSPq16+fQkJC5Onpqbp16+qDDz5w6JPz3MmZPD09VaNGDU2YMEHnftC1fft2dezYUf7+/vLz81Pbtm21ceNGhz45Q3j+/vtvBQcHq1mzZipVqpSuu+66An30e+4QoAs97wqzjcV5fOTsg+DgYGVkZDjM+/jjj+31xsXFOcz77rvv1LJlS/n6+qpEiRLq1KmTdu3a5dAnv/GPixcvls1m05o1a+xthX2eTZ8+XXXr1pWnp6fCw8M1aNAgxcfHO/S56aab7MdCnTp1FBERoV9//TXPY/R88tuHZ9d/9jYXZH8vXrxYjRo1UokSJRz6TZo0qcB15acgr1/79+9X165dFRoaKi8vL5UrV049evRQQkKCfZtTUlI0d+7cfI/Lsx04cECurq5q2rRprnn+/v7y8vKSdGaffPPNN/rnn3/sy80J5jnP74ULF2rkyJEqW7asfHx8lJiYKElatGiRfbtKly6t++67T0eOHLng47Fjxw6VKVNGN910k/2MbkGON0maMmWK6tatKx8fH5UsWVKNGjXSggULLrhOFA1nblGscoJozkd1f/31l7788kvdc889qly5smJjY/XOO++odevW2r17t8LDwyVJWVlZuu2227Rq1Sr16NFDTzzxhJKSkrRixQr9/vvvqlq1qn0dPXv21K233uqw3hEjRuRZz8svvyybzaZnnnlGx48f15tvvql27dppx44d8vb2liT98MMP6tixoyIiIjRmzBi5uLho9uzZatOmjX766Sc1btw413LLlStn/0JQcnKyBg4cmOe6R40apW7duunBBx/UiRMnNGXKFLVq1Urbt29XYGBgrvs89NBDatmypSTp888/1xdffOEw/+GHH7YPCXn88cd18OBBTZ06Vdu3b9e6devk7u6e5+NQGPHx8Xl+2Sk7O1u33367fv75Zz300EOqXbu2du7cqcmTJ+uPP/4o1Ji6Czl16pRat26tI0eO6OGHH1aFChW0fv16jRgxQseOHct3nN28efPs/x86dKjDvNjYWDVt2lQ2m02DBw9WmTJl9N1336l///5KTEzUkCFDHPo/99xzql27tk6fPm0PgcHBwerfv78kadeuXWrZsqX8/f319NNPy93dXe+8845uuukmrV27Vk2aNMl3++bNm1fo8Zo5Q4By5PW8K+w2XorjIykpSUuXLtVdd91lb5s9e7a8vLyUmpqa63Ho06ePoqKi9Oqrr+rUqVOaMWOGWrRooe3bt1/ys4hjx47VCy+8oHbt2mngwIHat2+fZsyYoc2bN1/weHrmmWeKtM5bbrlF999/vyRp8+bNevvtt/PtW7p0aU2ePNl+u3fv3g7zN2zYoG7duqlBgwZ65ZVXFBAQoLi4uFzP/aIoyOtXenq6oqKilJaWpscee0yhoaE6cuSIli5dqvj4eAUEBGjevHl68MEH1bhxYz300EOS5PB6fq6KFSsqKyvL/tzIz/PPP6+EhAQdPnzY/hid+wZo3Lhx8vDw0FNPPaW0tDR5eHjYXz9vvPFGTZgwQbGxsXrrrbe0bt26fF+XpTP7KioqSo0aNdKSJUvk7e1d4ONt1qxZevzxx3X33XfriSeeUGpqqn777Tdt2rRJ9957byH2CgrMAEUwe/ZsI8msXLnSnDhxwhw6dMgsXLjQlCpVynh7e5vDhw8bY4xJTU01WVlZDvc9ePCg8fT0NC+++KK97YMPPjCSzBtvvJFrXdnZ2fb7STKvvfZarj5169Y1rVu3tt9evXq1kWTKli1rEhMT7e2ffvqpkWTeeust+7KrV69uoqKi7OsxxphTp06ZypUrm1tuuSXXupo1a2bq1atnv33ixAkjyYwZM8be9vfffxtXV1fz8ssvO9x3586dxs3NLVf7/v37jSQzd+5ce9uYMWPM2YfoTz/9ZCSZ+fPnO9x32bJludorVqxoOnXqlKv2QYMGmXMP+3Nrf/rpp01wcLCJiIhweEznzZtnXFxczE8//eRw/5kzZxpJZt26dbnWd7bWrVubunXr5mp/7bXXjCRz8OBBe9u4ceOMr6+v+eOPPxz6Pvvss8bV1dVER0c7tD///PPGZrM5tFWsWNH06dPHfrt///4mLCzMxMXFOfTr0aOHCQgIMKdOnTLG/P9zZ/Xq1fY+qampxsXFxTz66KP2tjvvvNN4eHiYAwcO2NuOHj1qSpQoYVq1amVvyzlWcrYvNTXVVKhQwXTs2NFIMrNnz879YJ0l5/6bN292aM/reVfYbSzO4yPn+dqzZ09z22232dv/+ecf4+LiYnr27GkkmRMnThhjjElKSjKBgYFmwIABDrXGxMSYgIAAh/Y+ffoYX1/fXI/NokWLcu2rgj7Pjh8/bjw8PEz79u0dXqOmTp1qJJkPPvjAYZlnHwvffvutkWQ6dOiQ63jKT3p6upFkBg8efN76c/Tq1ctUrlzZoe3c/T1ixAgjyRw7dszedr7Xyfyc+9gU9PVr+/btRpJZtGjReZfv6+vrcCyeT0xMjClTpoyRZGrVqmU
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot_sample_balance(train_strokes['stroke'], 'Train Strokes')\n",
"plot_sample_balance(val_strokes['stroke'], 'Validation Strokes')\n",
"plot_sample_balance(test_strokes['stroke'], 'Test Strokes')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборки выглядят схоже, но у всех трех имеется явный дисбаланс классов. Это проблема, т.к в дальнейшем не сможем обучить какую-либо модель."
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtEAAAHWCAYAAACxJNUiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABf8klEQVR4nO3dd3gU1f7H8c9uQgqEhJqEYCgiTURQ0BikKYGgyL25oghyFTWCBRRE6lWaokhRioDIzwJXsYAKKgjcCAgKkd7bRY2CYEJPIPTk/P7g7sCSACmTtnm/nmcf3ZmzM9/Z2U0+nJw54zDGGAEAAADIMmdBFwAAAAAUNYRoAAAAIJsI0QAAAEA2EaIBAACAbCJEAwAAANlEiAYAAACyiRANAAAAZBMhGgAAAMgmQjQAAACQTYRoAChmXn/9daWnp0uS0tPTNXLkyAKuCNmxdetWzZ0713q+ceNGzZ8/v+AKKgKGDRsmh8NR0GXAwxCiUeRNnz5dDofDevj5+alWrVrq2bOnkpKSCro8oNCZMWOGxo4dqz///FNvvvmmZsyYUdAlIRuOHz+up556Sj///LN2796tXr16acuWLQVdVo5Uq1bN7ef3lR7Tp08v6FKvaMeOHdbvnmPHjuVqW9u3b9ewYcP0+++/21Ib8pbDGGMKugggN6ZPn67HH39cr7zyiqpXr67Tp0/rp59+0kcffaSqVatq69atKlmyZEGXCRQan3/+uR599FGdPXtWvr6++vjjj/XAAw8UdFnIhpiYGH399deSpFq1amnlypUqX758AVeVfXPnztWJEyes5999950+/fRTjRs3ThUqVLCWN2nSRNdff32O93P+/HmdP39efn5+uao3My+99JI++OADHT16VJMmTdKTTz6Z42198cUXevDBB7V06VK1bNnSviKRJ7wLugDALvfcc48aN24sSXryySdVvnx5vfXWW/r666/VuXPnAq4OKDweeugh3XXXXfrll19Us2ZNVaxYsaBLQjbNnTtX27dv16lTp1S/fn35+PgUdEk5EhMT4/Y8MTFRn376qWJiYlStWrUrvi41NVWlSpXK8n68vb3l7W1/5DHG6JNPPtHDDz+shIQEzZw5M1chGkULwzngse6++25JUkJCgiTpyJEj6tu3r+rXr6+AgAAFBgbqnnvu0aZNmzK89vTp0xo2bJhq1aolPz8/VapUSffff79+/fVXSdLvv/9+1T89XtqD8MMPP8jhcOjzzz/Xv/71L4WGhqpUqVL629/+pr1792bY96pVq9S2bVsFBQWpZMmSatGihVasWJHpMbZs2TLT/Q8bNixD248//liNGjWSv7+/ypUrp06dOmW6/6sd26XS09M1fvx41atXT35+fgoJCdFTTz2lo0ePurWrVq2a7rvvvgz76dmzZ4ZtZlb7mDFjMrynknTmzBkNHTpUN9xwg3x9fRUeHq7+/fvrzJkzmb5Xl2rZsqVuuummDMvHjh0rh8OR4U+px44dU+/evRUeHi5fX1/dcMMNGjVqlDWu+FKusZeXPx577DG3dvv27dMTTzyhkJAQ+fr6ql69evrggw/c2rg+O66Hr6+vatWqpZEjR+ryPyJu2LBB99xzjwIDAxUQEKBWrVrp559/dmvjGvr0+++/Kzg4WE2aNFH58uV18803Z+lP5pcPnbrW5y47x2jn98N1DoKDg3Xu3Dm3dZ9++qlV76FDh9zWLViwQM2aNVOpUqVUunRptWvXTtu2bXNr89hjjykgICBDXV988YUcDod++OEHa1l2P2dTpkxRvXr15Ovrq7CwMPXo0SPD8ICWLVta34Ubb7xRjRo10qZNmzL9jl7Nlc7hpfVfesxZOd9ffPGFGjdurNKlS7u1Gzt2bJbryozrPf/111917733qnTp0urSpYsk6ccff9SDDz6oKlWqWD8HXnjhBZ06dcptG5mNiXY4HOrZs6fmzp2rm266yfqMLly4MMu1rVixQr///rs6deqkTp06afny5frzzz8ztLvSz+Vq1apZPxumT5+uBx98UJJ01113ZXpOsvIZQf6hJxoeyxV4XX/i/O233zR37lw9+OCDql69upKSkvTuu++qRYsW2r59u8LCwiRJaWlpuu+++7R48WJ16tRJvXr10vHjxxUXF6etW7eqRo0a1j46d+6se++9122/gwYNyrSe1157TQ6HQwMGDNCBAwc0fvx4RUVFaePGjfL395ckLVmyRPfcc48aNWqkoUOHyul06sMPP9Tdd9+tH3/8UbfffnuG7V533XXWhWEnTpzQM888k+m+Bw8erI4dO+rJJ5/UwYMH9fbbb6t58+basGGDypQpk+E13bt3V7NmzSRJX331lebMmeO2/qmnnrKG0jz//PNKSEjQpEmTtGHDBq1YsUIlSpTI9H3IjmPHjmV60Vt6err+9re/6aefflL37t1Vt25dbdmyRePGjdN///tft4uucuvkyZNq0aKF9u3bp6eeekpVqlTRypUrNWjQIP31118aP358pq/76KOPrP9/4YUX3NYlJSXpjjvusH6JV6xYUQsWLFBsbKxSUlLUu3dvt/b/+te/VLduXZ06dcoKm8HBwYqNjZUkbdu2Tc2aNVNgYKD69++vEiVK6N1331XLli21bNkyRUREXPH4Pvroo2yPp3UNnXLJ7HOX3WPMi+/H8ePHNW/ePP3jH/+wln344Yfy8/PT6dOnM7wPXbt2VXR0tEaNGqWTJ0/qnXfeUdOmTbVhw4ar9oraYdiwYRo+fLiioqL0zDPPaNeuXXrnnXe0Zs2aa36fBgwYkKN9tm7dWo8++qgkac2aNZo4ceIV21aoUEHjxo2znj/yyCNu6+Pj49WxY0c1aNBAb7zxhoKCgnTo0KEMn/2cOn/+vKKjo9W0aVONHTvWGqI3e/ZsnTx5Us8884zKly+v1atX6+2339aff/6p2bNnX3O7P/30k7766is9++yzKl26tCZOnKgOHTpoz549WRoeM3PmTNWoUUO33XabbrrpJpUsWVKffvqp+vXrl+1jbN68uZ5//nlNnDjR+s5Lsv6bm88I8ogBirgPP/zQSDLff/+9OXjwoNm7d6/57LPPTPny5Y2/v7/5888/jTHGnD592qSlpbm9NiEhwfj6+ppXXnnFWvbBBx8YSeatt97KsK/09HTrdZLMmDFjMrSpV6+eadGihfV86dKlRpKpXLmySUlJsZbPmjXLSDITJkywtl2zZk0THR1t7ccYY06ePGmqV69uWrdunWFfTZo0MTfddJP1/ODBg0aSGTp0qLXs999/N15eXua1115ze+2WLVuMt7d3huW7d+82ksyMGTOsZUOHDjWX/rj48ccfjSQzc+ZMt9cuXLgww/KqVauadu3aZai9R48e5vIfQZfX3r9/fxMcHGwaNWrk9p5+9NFHxul0mh9//NHt9VOnTjWSzIoVKzLs71ItWrQw9erVy7B8zJgxRpJJSEiwlr366qumVKlS5r///a9b24EDBxovLy+zZ88et+UvvfSScTgcbsuqVq1qunbtaj2PjY01lSpVMocOHXJr16lTJxMUFGROnjxpjLn42Vm6dKnV5vTp08bpdJpnn33WWhYTE2N8fHzMr7/+ai3bv3+/KV26tGnevLm1zPVdcR3f6dOnTZUqVcw999xjJJkPP/ww45t1Cdfr16xZ47Y8s89ddo/Rzu+H6/PauXNnc99991nL//jjD+N0Ok3nzp2NJHPw4EFjjDHHjx83ZcqUMd26dXOrNTEx0QQFBbkt79q1qylVqlSG92b27NkZzlVWP2cHDhwwPj4+pk2bNm4/oyZNmmQkmQ8++MBtm5d+F7777jsjybRt2zbD9+lKzp49aySZnj17XrV+ly5dupjq1au7Lbv8fA8aNMhIMn/99Ze17Go/J68ks+9g165djSQzcODADO1dn6NLjRw50jgcDvPHH39Yyy7/GeY6Bh8fH/PLL79YyzZt2mQkmbfffvuatZ49e9aUL1/evPTSS9ayhx9+2DRo0CBD28vfL5fLfzZc6Txk5zOC/MNwDniMqKgoVaxYUeHh4erUqZMCAgI0Z84cVa5cWZLk6+srp/PCRz4
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAswAAAHWCAYAAABjdN96AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABvd0lEQVR4nO3deVhUZf8G8PvMyr7vCuS+L4Wl5JqiqLSYlmlqmr5WhpVa6kvutphmmZXa8ppYav4yl8pSU1MrRVNTc0FSQ3EBFJF9GWbm+f2Bc2IERkDgINyf6+LSOeeZc75nzgzcPDznOZIQQoCIiIiIiEqkUroAIiIiIqKajIGZiIiIiMgGBmYiIiIiIhsYmImIiIiIbGBgJiIiIiKygYGZiIiIiMgGBmYiIiIiIhsYmImIiIiIbGBgJiIiIiKygYGZiKic3n77bZjNZgCA2WzGvHnzFK6IyuPEiRPYtGmT/Pjo0aP48ccflSuoBpIkCbNnz5YfR0dHQ5IknD9//rbPveeeezBq1KhKrWfUqFG45557KnWbROXBwEzyN0LLl52dHZo2bYrx48cjOTlZ6fKIapyVK1di4cKFuHTpEt577z2sXLlS6ZKoHDIzM/H8889j//79OHPmDF555RUcP35c6bIq5OWXX4YkSTh79mypbaZNmwZJkvDXX39VY2Xld+XKFcyePRtHjx5VupQSxcbGyj8j09LS7mhbp06dwuzZs8v0CwjVDAzMJJs7dy6++uorfPzxx3jwwQexbNkyhIaGIicnR+nSiGqUuXPnYsaMGQgMDMSMGTPw5ptvKl0SlUNoaKj81bRpUyQlJWHs2LFKl1Uhw4YNAwCsWbOm1DZff/012rRpg7Zt21Z4PyNGjEBubi6Cg4MrvI3buXLlCubMmVNiYP78888RFxdXZfsui1WrVsHPzw8A8O23397Rtk6dOoU5c+YwMN9FGJhJ1q9fPwwfPhz/+c9/EB0djQkTJiA+Ph7fffed0qUR1ShPPfUULl68iL179+LixYt44oknlC6JymnTpk04efIkDh06hOPHj8PT01PpkiqkY8eOaNy4Mb7++usS18fExCA+Pl4O1hWlVqthZ2cHSZLuaDsVpdVqodfrFdk3AAghsGbNGjz99NPo378/Vq9erVgtpAwGZipVz549AQDx8fEAgNTUVLz22mto06YNnJyc4OLign79+uHYsWPFnpuXl4fZs2ejadOmsLOzg7+/PwYOHIhz584BAM6fP281DOTWrx49esjb2r17NyRJwv/93//h9ddfh5+fHxwdHfHoo4/i4sWLxfZ94MAB9O3bF66urnBwcED37t2xd+/eEo+xR48eJe6/6Ng9i1WrViEkJAT29vbw8PDAkCFDSty/rWMrymw244MPPkCrVq1gZ2cHX19fPP/887hx44ZVu3vuuQcPP/xwsf2MHz++2DZLqv3dd98t9poCQH5+PmbNmoXGjRtDr9cjMDAQU6ZMQX5+fomvVVE9evRA69atiy1fuHBhieMc09LSMGHCBAQGBkKv16Nx48aYP3++PA64qNmzZ5f42t06JvLy5csYPXo0fH19odfr0apVK3zxxRdWbSzvHcuXXq9H06ZNMW/ePAghrNoeOXIE/fr1g4uLC5ycnNCrVy/s37/fqk3RcZw+Pj548MEH4enpibZt20KSJERHR9t83W4d/nS79115jrEyPx+Wc+Dj44OCggKrdV9//bVcb0pKitW6LVu2oGvXrnB0dISzszMiIiJw8uRJqzajRo2Ck5NTsbq+/fZbSJKE3bt3y8vK+z5bunQpWrVqBb1ej4CAAERGRhb703mPHj3kz0LLli0REhKCY8eOlfgZtaW0c1i0/qLHXJbz/e2336JDhw5wdna2ardw4UKbtQwbNgynT5/Gn3/+WWzdmjVrIEkShg4dCoPBgJkzZyIkJASurq5wdHRE165dsWvXrtseb0ljmIUQePPNN1G/fn04ODjgoYceKna+gbL97Ni9ezfuv/9+AMCzzz4rH7vlM1XSGObs7Gy8+uqr8veVZs2aYeHChcU+25IkYfz48di0aRNat24tf5a2bt162+O22Lt3L86fP48hQ4ZgyJAh+PXXX3Hp0qVi7Ur7+VF0XHd0dDSefPJJAMBDDz1U4nunLO9lql4apQugmssSbi09L//88w82bdqEJ598Eg0aNEBycjI+/fRTdO/eHadOnUJAQAAAwGQy4eGHH8bOnTsxZMgQvPLKK8jMzMT27dtx4sQJNGrUSN7H0KFD0b9/f6v9RkVFlVjPW2+9BUmSMHXqVFy9ehUffPABwsLCcPToUdjb2wMAfvnlF/Tr1w8hISGYNWsWVCoVVqxYgZ49e+K3337DAw88UGy79evXly/aysrKwrhx40rc94wZMzB48GD85z//wbVr1/DRRx+hW7duOHLkCNzc3Io957nnnkPXrl0BABs2bMDGjRut1j///POIjo7Gs88+i5dffhnx8fH4+OOPceTIEezduxdarbbE16E80tLSSrwgzWw249FHH8Xvv/+O5557Di1atMDx48exaNEi/P3331YXRN2pnJwcdO/eHZcvX8bzzz+PoKAg7Nu3D1FRUUhMTMQHH3xQ4vO++uor+f8TJ060WpecnIxOnTrJPwi9vb2xZcsWjBkzBhkZGZgwYYJV+9dffx0tWrRAbm6uHCx9fHwwZswYAMDJkyfRtWtXuLi4YMqUKdBqtfj000/Ro0cP7NmzBx07diz1+L766qtyj3+dO3cuGjRoID8u6X1X3mOsis9HZmYmNm/ejMcff1xetmLFCtjZ2SEvL6/Y6zBy5EiEh4dj/vz5yMnJwbJly9ClSxccOXKkyi/Ymj17NubMmYOwsDCMGzcOcXFxWLZsGQ4ePHjbz9PUqVMrtM/evXvjmWeeAQAcPHgQH374Yaltvby8sGjRIvnxiBEjrNbHxMRg8ODBaNeuHd555x24uroiJSWl2Hu/JMOGDcOcOXOwZs0a3HffffJyk8mEb775Bl27dkVQUBBSUlLwv//9D0OHDsXYsWORmZmJ5cuXIzw8HH/88Qfat29fruOfOXMm3nzzTfTv3x/9+/fHn3/+iT59+sBgMFi1K8vPjhYtWmDu3LmYOXOm1ffOBx98sMR9CyHw6KOPYteuXRgzZgzat2+Pbdu2YfLkybh8+bLVaw0Av//+OzZs2IAXX3wRzs7O+PDDDzFo0CAkJCSU6a8Lq1evRqNGjXD//fejdevWcHBwwNdff43JkyeX6zUDgG7duuHll1/Ghx9+KH9vAiD/eyfvZapCguq8FStWCABix44d4tq1a+LixYti7dq1wtPTU9jb24tLly4JIYTIy8sTJpPJ6rnx8fFCr9eLuXPnysu++OILAUC8//77xfZlNpvl5wEQ7777brE2rVq1Et27d5cf79q1SwAQ9erVExkZGfLyb775RgAQixcvlrfdpEkTER4eLu9HCCFycnJEgwYNRO/evYvt68EHHxStW7eWH1+7dk0AELNmzZKXnT9/XqjVavHWW29ZPff48eNCo9EUW37mzBkBQKxcuVJeNmvWLFH04/bbb78JAGL16tVWz926dWux5cHBwSIiIqJY7ZGRkeLWj/CttU+ZMkX4+PiIkJAQq9f0q6++EiqVSvz2229Wz//kk08EALF3795i+yuqe/fuolWrVsWWv/vuuwKAiI+Pl5e98cYbwtHRUfz9999Wbf/73/8KtVotEhISrJZPmzZNSJJktSw4OFiMHDlSfjxmzBjh7+8vUlJSrNoNGTJEuLq6ipycHCHEv++dXbt2yW3y8vKESqUSL774orxswIABQqfTiXPnzsnLrly5IpydnUW3bt3kZZbPiuX48vLyRFBQkOjXr58AIFasWFH8xSrC8vyDBw9aLS/pfVfeY6zMz4fl/Tp06FDx8MMPy8svXLggVCqVGDp0qAAgrl27JoQQIjMzU7i5uYmxY8da1ZqUlCRcXV2tlo8cOVI4OjoWe23WrVtX7FyV9X129epVodPpRJ8+fay+R3388ccCgPjiiy+stln0s/DTTz8JAKJv377FPk+lMRgMAoAYP368zfothg0bJho
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAHWCAYAAABt3aEVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABnQ0lEQVR4nO3deVxU5f4H8M+ZFQaYYWdAAXHftVAR9xRFs8W0TK+WlZmZVmqp15umWWlZaVlq2S2xsrxqi79MLXdL0czUXNBcUFxYRfZ1Zp7fHziTI6CAwBnGz/v1GmXOeeac7zBnhg8Pz3mOJIQQICIiIiJyAgq5CyAiIiIiqi4Mt0RERETkNBhuiYiIiMhpMNwSERERkdNguCUiIiIip8FwS0REREROg+GWiIiIiJwGwy0REREROQ2GWyIiIiJyGgy3REQOaO7cubBYLAAAi8WCefPmyVwRVcbRo0fxww8/2O4fOnQIP/30k3wFEd1BGG6pVsTExECSJNvNxcUFTZs2xYQJE5CcnCx3eUQOZ8WKFXj33Xdx8eJFvPfee1ixYoXcJVElZGdnY+zYsdi7dy9OnTqFF198EUeOHJG7rCpp0KCB3ed3ebeYmJhq2d/cuXPtfjGoqLi4ONvPl4yMjNuq4fjx45g9ezbOnTt3W9sheajkLoDuLHPmzEFYWBgKCgrw22+/YenSpdiwYQOOHj0KnU4nd3lEDmPOnDl4/PHHMW3aNGi1Wnz11Vdyl0SVEBkZabsBQNOmTTFmzBiZq6qa999/Hzk5Obb7GzZswDfffIOFCxfC19fXtrxLly7Vsr+5c+fi4YcfxqBBgyr1uK+++gpGoxFXr17F2rVr8fTTT1e5huPHj+O1115Dr1690KBBgypvh+TBcEu1asCAAejQoQMA4Omnn4aPjw8WLFiAdevWYfjw4TJXR+Q4Hn30Udxzzz04ffo0mjRpAj8/P7lLokr64YcfcPz4ceTn56NNmzbQaDRyl1QlN4bMpKQkfPPNNxg0aJDDBD8hBL7++mv861//Qnx8PFauXHlb4ZbqNg5LIFn17t0bABAfHw8ASE9Px8svv4w2bdrA3d0der0eAwYMwOHDh0s9tqCgALNnz0bTpk3h4uKCwMBADB48GGfOnAEAnDt37qZ/QuvVq5dtWzt27IAkSfjf//6H//znPzAajXBzc8MDDzyACxculNr3vn370L9/fxgMBuh0OvTs2RO7d+8u8zn26tWrzP3Pnj27VNuvvvoK4eHhcHV1hbe3N4YNG1bm/m/23K5nsVjw/vvvo1WrVnBxcUFAQADGjh2Lq1ev2rVr0KAB7rvvvlL7mTBhQqltllX7O++8U+p7CgCFhYWYNWsWGjduDK1Wi+DgYEydOhWFhYVlfq+u16tXL7Ru3brU8nfffReSJJX6c2FGRgYmTpyI4OBgaLVaNG7cGG+//bZt3Or1Zs+eXeb37oknnrBrd+nSJTz11FMICAiAVqtFq1at8Pnnn9u1sR471ptWq0XTpk0xb948CCHs2h48eBADBgyAXq+Hu7s7+vTpg71799q1sQ7hOXfuHPz9/dGlSxf4+Pigbdu2FfrT741DgG513FXmOVbn+8P6Gvj7+6O4uNhu3TfffGOrNy0tzW7dxo0b0b17d7i5ucHDwwMDBw7EsWPH7No88cQTcHd3L1XX2rVrIUkSduzYYVtW2eNsyZIlaNWqFbRaLYKCgjB+/PhSfwLv1auX7b3QsmVLhIeH4/Dhw2W+R2+mvNfw+vqvf84Veb3Xrl2LDh06wMPDw67du+++W+G6ylORz69Tp05hyJAhMBqNcHFxQf369TFs2DBkZmbannNubi5WrFhR7vuyLLt378a5c+cwbNgwDBs2DLt27cLFixdLtSvvs7dBgwa2/cTExOCRRx4BANxzzz1lft8rchyQfNhzS7KyBlEfHx8AwNmzZ/HDDz/gkUceQVhYGJKTk/HJJ5+gZ8+eOH78OIKCggAAZrMZ9913H7Zu3Yphw4bhxRdfRHZ2NjZv3oyjR4+iUaNGtn0MHz4c9957r91+p0+fXmY9b775JiRJwrRp05CSkoL3338fUVFROHToEFxdXQEA27Ztw4ABAxAeHo5Zs2ZBoVBg+fLl6N27N3799Vd06tSp1Hbr169vOyEoJycH48aNK3PfM2fOxNChQ/H0008jNTUVH374IXr06IGDBw/C09Oz1GOeeeYZdO/eHQDw3Xff4fvvv7dbP3bsWMTExODJJ5/ECy+8gPj4eHz00Uc4ePAgdu/eDbVaXeb3oTIyMjLKPNnJYrHggQcewG+//YZnnnkGLVq0wJEjR7Bw4UL8/fffVRpTV568vDz07NkTly5dwtixYxESEoI9e/Zg+vTpSExMxPvvv1/m47788kvb15MmTbJbl5ycjM6dO0OSJEyYMAF+fn7YuHEjRo8ejaysLEycONGu/X/+8x+0aNEC+fn5thDo7++P0aNHAwCOHTuG7t27Q6/XY+rUqVCr1fjkk0/Qq1cv7Ny5ExEREeU+vy+//LLS4zWtQ4CsyjruKvsca+L9kZ2djfXr1+Ohhx6yLVu+fDlcXFxQUFBQ6vswatQoREdH4+2330ZeXh6WLl2Kbt264eDBgzXeizh79my89tpriIqKwrhx43Dy5EksXboU+/fvv+X7adq0aVXaZ9++ffH4448DAPbv349FixaV29bX1xcLFy603X/sscfs1sfGxmLo0KFo164d3nrrLRgMBqSlpZU69quiIp9fRUVFiI6ORmFhIZ5//nkYjUZcunQJ69evR0ZGBgwGA7788ks8/fTT6NSpE5555hkAsPs8L8/KlSvRqFEjdOzYEa1bt4ZOp8M333yDKVOmVPq59OjRAy+88AIWLVpke18DsP1/O8cB1RJBVAuWL18uAIgtW7aI1NRUceHCBbFq1Srh4+MjXF1dxcWLF4UQQhQUFAiz2Wz32Pj4eKHVasWcOXNsyz7//HMBQCxYsKDUviwWi+1xAMQ777xTqk2rVq1Ez549bfe3b98uAIh69eqJrKws2/LVq1cLAOKDDz6wbbtJkyYiOjrath8hhMjLyxNhYWGib9++pfbVpUsX0bp1a9v91NRUAUDMmjXLtuzcuXNCqVSKN9980+6xR44cESqVqtTyU6dOCQBixYoVtmWzZs0S17+lf/31VwFArFy50u6xmzZtKrU8NDRUDBw4sFTt48ePFzd+TNxY+9SpU4W/v78IDw+3+55++eWXQqFQiF9//dXu8R9//LEAIHbv3l1qf9fr2bOnaNWqVanl77zzjgAg4uPjbctef/114ebmJv7++2+7tv/+97+FUqkUCQkJdstfeeUVIUmS3bLQ0FAxatQo2/3Ro0eLwMBAkZaWZtdu2LBhwmAwiLy8PCHEP8fO9u3bbW0KCgqEQqEQzz33nG3ZoEGDhEajEWfOnLEtu3z5svDw8BA9evSwLbO+V6zPr6CgQISEhIgBAwYIAGL58uWlv1nXsT5+//79dsvLOu4q+xyr8/1hPV6HDx8u7rvvPtvy8+fPC4VCIYYPHy4AiNTUVCGEENnZ2cLT01OMGTPGrtakpCRhMBjslo8aNUq4ubmV+t6sWbOm1GtV0eMsJSVFaDQa0a9fP7vPqI8++kgAEJ9//rndNq9/L2zYsEEAEP379y/1fipPUVGRACAmTJhw0/qtRowYIcLCwuyW3fh6T58+XQAQiYmJtmU3+5wsz43fm4p+fh08eFAAEGvWrLnp9t3c3Ozei7dSVFQkfHx8xCuvvGJb9q9//Uu0a9euVNsbvydWN77/y/teV+Y4IPlwWALVqqioKPj5+SE4OBjDhg2Du7s7vv/+e9SrVw8AoNVqoVCUHJZmsxlXrlyBu7s7mjVrhj///NO2nW+//Ra+vr54/vnnS+2jMn/2u9Hjjz8ODw8P2/2HH34YgYGB2LBhA4CS6XxOnTqFf/3rX7hy5QrS0tKQlpaG3Nxc9OnTB7t27Sr1Z/CCggK4uLjcdL/fffcdLBYLhg4dattmWloajEYjmjRpgu3bt9u1LyoqAlDy/SrPmjVrYDAY0LdvX7tthoeHw93dvdQ2i4uL7dqlpaWV6jm
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot_sample_balance(train_auto['Price'], 'Train Auto')\n",
"plot_sample_balance(val_auto['Price'], 'Validation Auto')\n",
"plot_sample_balance(test_auto['Price'], 'Test Auto')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Распределения выборок у данного датасета выглядят схоже. Это говорит о сбалансированности выборок. Однако в тренировочной выборке значительно больший размах значений "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 12. Выполнить приращение данных методами выборки с избытком (oversampling) и выборки с недостатком (undersampling). Должны быть представлены примеры реализации обоих методов для выборок каждого набора данных"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Инсультики"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После oversampling (strokes): stroke\n",
"1 4861\n",
"0 4861\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"from imblearn.over_sampling import SMOTE\n",
"\n",
"X_strokes = strokes.drop('stroke', axis=1)\n",
"y_strokes = strokes['stroke']\n",
"\n",
"# Кодирование категориальных признаков\n",
"for column in X_strokes.select_dtypes(include=['object']).columns:\n",
" X_strokes[column] = X_strokes[column].astype('category').cat.codes\n",
"\n",
"# Теперь применяем SMOTE\n",
"smote = SMOTE(random_state=42)\n",
"X_resampled_strokes, y_resampled_strokes = smote.fit_resample(X_strokes, y_strokes)\n",
"\n",
"# Получаем результаты\n",
"print(f'После oversampling (strokes): {pd.Series(y_resampled_strokes).value_counts()}')"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После undersampling (strokes): stroke\n",
"0 249\n",
"1 249\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Undersampling для strokes\n",
"undersample = RandomUnderSampler(random_state=42)\n",
"X_under_strokes, y_under_strokes = undersample.fit_resample(X_strokes, y_strokes)\n",
"\n",
"print(f'После undersampling (strokes): {pd.Series(y_under_strokes).value_counts()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Машины"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После oversampling (strokes): stroke\n",
"1 4861\n",
"0 4861\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"from imblearn.over_sampling import SMOTE\n",
"\n",
"X_strokes = strokes.drop('stroke', axis=1)\n",
"y_strokes = strokes['stroke']\n",
"\n",
"# Кодирование категориальных признаков\n",
"for column in X_strokes.select_dtypes(include=['object']).columns:\n",
" X_strokes[column] = X_strokes[column].astype('category').cat.codes\n",
"\n",
"# Теперь применяем SMOTE\n",
"smote = SMOTE(random_state=42)\n",
"X_resampled_strokes, y_resampled_strokes = smote.fit_resample(X_strokes, y_strokes)\n",
"\n",
"# Получаем результаты\n",
"print(f'После oversampling (strokes): {pd.Series(y_resampled_strokes).value_counts()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После undersampling (strokes): stroke\n",
"0 249\n",
"1 249\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Undersampling для strokes\n",
"undersample = RandomUnderSampler(random_state=42)\n",
"X_under_strokes, y_under_strokes = undersample.fit_resample(X_strokes, y_strokes)\n",
"\n",
"print(f'После undersampling (strokes): {pd.Series(y_under_strokes).value_counts()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Магазины"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 1, n_samples = 1",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32mc:\\Users\\Матевос\\Desktop\\ИИ.Дырночкин\\ai\\lab2\\lab2.ipynb Cell 55\u001b[0m line \u001b[0;36m1\n\u001b[0;32m <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=7'>8</a>\u001b[0m \u001b[39m# Теперь применяем SMOTE\u001b[39;00m\n\u001b[0;32m <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=8'>9</a>\u001b[0m smote \u001b[39m=\u001b[39m SMOTE(random_state\u001b[39m=\u001b[39m\u001b[39m42\u001b[39m)\n\u001b[1;32m---> <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=9'>10</a>\u001b[0m X_resampled_shop, y_resampled_shop \u001b[39m=\u001b[39m smote\u001b[39m.\u001b[39;49mfit_resample(X_shop, y_shop)\n\u001b[0;32m <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=11'>12</a>\u001b[0m \u001b[39m# Получаем результаты\u001b[39;00m\n\u001b[0;32m <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=12'>13</a>\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39mf\u001b[39m\u001b[39m'\u001b[39m\u001b[39mПосле oversampling (strokes): \u001b[39m\u001b[39m{\u001b[39;00mpd\u001b[39m.\u001b[39mSeries(y_resampled_shop)\u001b[39m.\u001b[39mvalue_counts()\u001b[39m}\u001b[39;00m\u001b[39m'\u001b[39m)\n",
"File \u001b[1;32mc:\\Users\\Матевос\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\base.py:208\u001b[0m, in \u001b[0;36mBaseSampler.fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 187\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"Resample the dataset.\u001b[39;00m\n\u001b[0;32m 188\u001b[0m \n\u001b[0;32m 189\u001b[0m \u001b[39mParameters\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 205\u001b[0m \u001b[39m The corresponding label of `X_resampled`.\u001b[39;00m\n\u001b[0;32m 206\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[0;32m 207\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_validate_params()\n\u001b[1;32m--> 208\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39msuper\u001b[39;49m()\u001b[39m.\u001b[39;49mfit_resample(X, y)\n",
"File \u001b[1;32mc:\\Users\\Матевос\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\base.py:112\u001b[0m, in \u001b[0;36mSamplerMixin.fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 106\u001b[0m X, y, binarize_y \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_check_X_y(X, y)\n\u001b[0;32m 108\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msampling_strategy_ \u001b[39m=\u001b[39m check_sampling_strategy(\n\u001b[0;32m 109\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msampling_strategy, y, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_sampling_type\n\u001b[0;32m 110\u001b[0m )\n\u001b[1;32m--> 112\u001b[0m output \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_fit_resample(X, y)\n\u001b[0;32m 114\u001b[0m y_ \u001b[39m=\u001b[39m (\n\u001b[0;32m 115\u001b[0m label_binarize(output[\u001b[39m1\u001b[39m], classes\u001b[39m=\u001b[39mnp\u001b[39m.\u001b[39munique(y)) \u001b[39mif\u001b[39;00m binarize_y \u001b[39melse\u001b[39;00m output[\u001b[39m1\u001b[39m]\n\u001b[0;32m 116\u001b[0m )\n\u001b[0;32m 118\u001b[0m X_, y_ \u001b[39m=\u001b[39m arrays_transformer\u001b[39m.\u001b[39mtransform(output[\u001b[39m0\u001b[39m], y_)\n",
"File \u001b[1;32mc:\\Users\\Матевос\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\over_sampling\\_smote\\base.py:389\u001b[0m, in \u001b[0;36mSMOTE._fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 386\u001b[0m X_class \u001b[39m=\u001b[39m _safe_indexing(X, target_class_indices)\n\u001b[0;32m 388\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mnn_k_\u001b[39m.\u001b[39mfit(X_class)\n\u001b[1;32m--> 389\u001b[0m nns \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mnn_k_\u001b[39m.\u001b[39;49mkneighbors(X_class, return_distance\u001b[39m=\u001b[39;49m\u001b[39mFalse\u001b[39;49;00m)[:, \u001b[39m1\u001b[39m:]\n\u001b[0;32m 390\u001b[0m X_new, y_new \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_make_samples(\n\u001b[0;32m 391\u001b[0m X_class, y\u001b[39m.\u001b[39mdtype, class_sample, X_class, nns, n_samples, \u001b[39m1.0\u001b[39m\n\u001b[0;32m 392\u001b[0m )\n\u001b[0;32m 393\u001b[0m X_resampled\u001b[39m.\u001b[39mappend(X_new)\n",
"File \u001b[1;32mc:\\Users\\Матевос\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\neighbors\\_base.py:834\u001b[0m, in \u001b[0;36mKNeighborsMixin.kneighbors\u001b[1;34m(self, X, n_neighbors, return_distance)\u001b[0m\n\u001b[0;32m 832\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m 833\u001b[0m inequality_str \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mn_neighbors <= n_samples_fit\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m--> 834\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 835\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mExpected \u001b[39m\u001b[39m{\u001b[39;00minequality_str\u001b[39m}\u001b[39;00m\u001b[39m, but \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 836\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mn_neighbors = \u001b[39m\u001b[39m{\u001b[39;00mn_neighbors\u001b[39m}\u001b[39;00m\u001b[39m, n_samples_fit = \u001b[39m\u001b[39m{\u001b[39;00mn_samples_fit\u001b[39m}\u001b[39;00m\u001b[39m, \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 837\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mn_samples = \u001b[39m\u001b[39m{\u001b[39;00mX\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m]\u001b[39m}\u001b[39;00m\u001b[39m\"\u001b[39m \u001b[39m# include n_samples for common tests\u001b[39;00m\n\u001b[0;32m 838\u001b[0m )\n\u001b[0;32m 840\u001b[0m n_jobs \u001b[39m=\u001b[39m effective_n_jobs(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mn_jobs)\n\u001b[0;32m 841\u001b[0m chunked_results \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m\n",
"\u001b[1;31mValueError\u001b[0m: Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 1, n_samples = 1"
]
}
],
"source": [
"X_shop = shop.drop('Store_Sales', axis=1)\n",
"y_shop = shop['Store_Sales']\n",
"\n",
"# Кодирование категориальных признаков\n",
"for column in X_shop.select_dtypes(include=['object']).columns:\n",
" X_shop[column] = X_shop[column].astype('category').cat.codes\n",
"\n",
"# Теперь применяем SMOTE\n",
"smote = SMOTE(random_state=42)\n",
"X_resampled_shop, y_resampled_shop = smote.fit_resample(X_shop, y_shop)\n",
"\n",
"# Получаем результаты\n",
"print(f'После oversampling (strokes): {pd.Series(y_resampled_shop).value_counts()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После undersampling (strokes): stroke\n",
"0 249\n",
"1 249\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Undersampling для strokes\n",
"undersample = RandomUnderSampler(random_state=42)\n",
"X_under_strokes, y_under_strokes = undersample.fit_resample(X_strokes, y_strokes)\n",
"\n",
"print(f'После undersampling (strokes): {pd.Series(y_under_strokes).value_counts()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В данном случае у нас есть только один датасет, предназначенный для решения задачи классификации (инсульт). Проблему дисбаланса в нем мы решили применив undersampling & oversampling.\n",
"\n",
"Два остальных датасета не содержат классов, т.к предназначены для решения задачи регрессии (предсказания цен на автомобили или на чек в супермаркете), поэтому выполнять приращение данных не требуется."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}