1931 lines
396 KiB
Plaintext
1931 lines
396 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Лабораторная работа №2"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Анализ нескольких датасетов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 1.Выбрать три набора данных, которые не соответствуют Вашему варианту задания\n",
|
|||
|
"### 2. Провести анализ сведений о каждом наборе данных со страницы загрузки в Kaggle. Какова проблемная область?\n",
|
|||
|
"\n",
|
|||
|
"Магазины, Цены на автомобиль, Инсульты"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Инсульты "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Данный датасет используется для предсказания вероятности возникновения инсульта у пациента на основе различных параметров, таких как пол, возраст, наличие заболеваний и статус курения. Инсульт является второй по значимости причиной смерти в мире, по данным Всемирной организации здравоохранения (ВОЗ), и ответственен за около 11% всех случаев смерти.\n",
|
|||
|
"\n",
|
|||
|
"Информация о колонках\n",
|
|||
|
"\n",
|
|||
|
"- id: уникальный идентификатор пациента (int)\n",
|
|||
|
"- gender: пол пациента, возможные значения — \"Male\" (мужчина), \"Female\" (женщина) или \"Other\" (другое) (object, строковый)\n",
|
|||
|
"- age: возраст пациента (float)\n",
|
|||
|
"- hypertension: наличие гипертензии; 0 — если гипертензии нет, 1 — если гипертензия есть (int)\n",
|
|||
|
"- heart_disease: наличие сердечных заболеваний; 0 — если заболеваний нет, 1 — если есть (int)\n",
|
|||
|
"- ever_married: статус брака; \"No\" (нет) или \"Yes\" (да) (object, строковый)\n",
|
|||
|
"- work_type: тип работы; возможные значения — \"children\" (дети), \"Govt_job\" (государственная работа), \"Never_worked\" (никогда не работал), \"Private\" (частный сектор) или \"Self-employed\" (самозанятый) (object, строковый)\n",
|
|||
|
"- Residence_type: тип проживания; \"Rural\" (сельская местность) или \"Urban\" (городская местность) (object, строковый)\n",
|
|||
|
"- avg_glucose_level: средний уровень глюкозы в крови (float)\n",
|
|||
|
"- bmi: индекс массы тела (ИМТ) (float)\n",
|
|||
|
"- smoking_status: статус курения; возможные значения — \"formerly smoked\" (курил раньше), \"never smoked\" (никогда не курил), \"smokes\" (курит) или \"Unknown\" (неизвестно). Значение \"Unknown\" указывает на недоступность информации о статусе курения пациента (object, строковый) \n",
|
|||
|
"- stroke: наличие инсульта; 1 — если инсульт был, 0 — если не был (int)\n",
|
|||
|
"\n",
|
|||
|
"Каждая строка в датасете содержит соответствующую информацию о пациенте, что позволяет проводить анализ и строить модели для предсказания риска инсульта."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 61,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>gender</th>\n",
|
|||
|
" <th>age</th>\n",
|
|||
|
" <th>hypertension</th>\n",
|
|||
|
" <th>heart_disease</th>\n",
|
|||
|
" <th>ever_married</th>\n",
|
|||
|
" <th>work_type</th>\n",
|
|||
|
" <th>Residence_type</th>\n",
|
|||
|
" <th>avg_glucose_level</th>\n",
|
|||
|
" <th>bmi</th>\n",
|
|||
|
" <th>smoking_status</th>\n",
|
|||
|
" <th>stroke</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>9046</td>\n",
|
|||
|
" <td>Male</td>\n",
|
|||
|
" <td>67.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Private</td>\n",
|
|||
|
" <td>Urban</td>\n",
|
|||
|
" <td>228.69</td>\n",
|
|||
|
" <td>36.6</td>\n",
|
|||
|
" <td>formerly smoked</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>51676</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>61.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Self-employed</td>\n",
|
|||
|
" <td>Rural</td>\n",
|
|||
|
" <td>202.21</td>\n",
|
|||
|
" <td>NaN</td>\n",
|
|||
|
" <td>never smoked</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>31112</td>\n",
|
|||
|
" <td>Male</td>\n",
|
|||
|
" <td>80.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Private</td>\n",
|
|||
|
" <td>Rural</td>\n",
|
|||
|
" <td>105.92</td>\n",
|
|||
|
" <td>32.5</td>\n",
|
|||
|
" <td>never smoked</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>60182</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>49.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Private</td>\n",
|
|||
|
" <td>Urban</td>\n",
|
|||
|
" <td>171.23</td>\n",
|
|||
|
" <td>34.4</td>\n",
|
|||
|
" <td>smokes</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>1665</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>79.0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Self-employed</td>\n",
|
|||
|
" <td>Rural</td>\n",
|
|||
|
" <td>174.12</td>\n",
|
|||
|
" <td>24.0</td>\n",
|
|||
|
" <td>never smoked</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5105</th>\n",
|
|||
|
" <td>18234</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>80.0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Private</td>\n",
|
|||
|
" <td>Urban</td>\n",
|
|||
|
" <td>83.75</td>\n",
|
|||
|
" <td>NaN</td>\n",
|
|||
|
" <td>never smoked</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5106</th>\n",
|
|||
|
" <td>44873</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>81.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Self-employed</td>\n",
|
|||
|
" <td>Urban</td>\n",
|
|||
|
" <td>125.20</td>\n",
|
|||
|
" <td>40.0</td>\n",
|
|||
|
" <td>never smoked</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5107</th>\n",
|
|||
|
" <td>19723</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>35.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Self-employed</td>\n",
|
|||
|
" <td>Rural</td>\n",
|
|||
|
" <td>82.99</td>\n",
|
|||
|
" <td>30.6</td>\n",
|
|||
|
" <td>never smoked</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5108</th>\n",
|
|||
|
" <td>37544</td>\n",
|
|||
|
" <td>Male</td>\n",
|
|||
|
" <td>51.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Private</td>\n",
|
|||
|
" <td>Rural</td>\n",
|
|||
|
" <td>166.29</td>\n",
|
|||
|
" <td>25.6</td>\n",
|
|||
|
" <td>formerly smoked</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5109</th>\n",
|
|||
|
" <td>44679</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>44.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Govt_job</td>\n",
|
|||
|
" <td>Urban</td>\n",
|
|||
|
" <td>85.28</td>\n",
|
|||
|
" <td>26.2</td>\n",
|
|||
|
" <td>Unknown</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>5110 rows × 12 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id gender age hypertension heart_disease ever_married \\\n",
|
|||
|
"0 9046 Male 67.0 0 1 Yes \n",
|
|||
|
"1 51676 Female 61.0 0 0 Yes \n",
|
|||
|
"2 31112 Male 80.0 0 1 Yes \n",
|
|||
|
"3 60182 Female 49.0 0 0 Yes \n",
|
|||
|
"4 1665 Female 79.0 1 0 Yes \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"5105 18234 Female 80.0 1 0 Yes \n",
|
|||
|
"5106 44873 Female 81.0 0 0 Yes \n",
|
|||
|
"5107 19723 Female 35.0 0 0 Yes \n",
|
|||
|
"5108 37544 Male 51.0 0 0 Yes \n",
|
|||
|
"5109 44679 Female 44.0 0 0 Yes \n",
|
|||
|
"\n",
|
|||
|
" work_type Residence_type avg_glucose_level bmi smoking_status \\\n",
|
|||
|
"0 Private Urban 228.69 36.6 formerly smoked \n",
|
|||
|
"1 Self-employed Rural 202.21 NaN never smoked \n",
|
|||
|
"2 Private Rural 105.92 32.5 never smoked \n",
|
|||
|
"3 Private Urban 171.23 34.4 smokes \n",
|
|||
|
"4 Self-employed Rural 174.12 24.0 never smoked \n",
|
|||
|
"... ... ... ... ... ... \n",
|
|||
|
"5105 Private Urban 83.75 NaN never smoked \n",
|
|||
|
"5106 Self-employed Urban 125.20 40.0 never smoked \n",
|
|||
|
"5107 Self-employed Rural 82.99 30.6 never smoked \n",
|
|||
|
"5108 Private Rural 166.29 25.6 formerly smoked \n",
|
|||
|
"5109 Govt_job Urban 85.28 26.2 Unknown \n",
|
|||
|
"\n",
|
|||
|
" stroke \n",
|
|||
|
"0 1 \n",
|
|||
|
"1 1 \n",
|
|||
|
"2 1 \n",
|
|||
|
"3 1 \n",
|
|||
|
"4 1 \n",
|
|||
|
"... ... \n",
|
|||
|
"5105 0 \n",
|
|||
|
"5106 0 \n",
|
|||
|
"5107 0 \n",
|
|||
|
"5108 0 \n",
|
|||
|
"5109 0 \n",
|
|||
|
"\n",
|
|||
|
"[5110 rows x 12 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 61,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd \n",
|
|||
|
"\n",
|
|||
|
"strokes = pd.read_csv(\"healthcare-dataset-stroke-data.csv\")\n",
|
|||
|
"\n",
|
|||
|
"strokes"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 62,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"id int64\n",
|
|||
|
"gender object\n",
|
|||
|
"age float64\n",
|
|||
|
"hypertension int64\n",
|
|||
|
"heart_disease int64\n",
|
|||
|
"ever_married object\n",
|
|||
|
"work_type object\n",
|
|||
|
"Residence_type object\n",
|
|||
|
"avg_glucose_level float64\n",
|
|||
|
"bmi float64\n",
|
|||
|
"smoking_status object\n",
|
|||
|
"stroke int64\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 62,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"strokes.dtypes"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": []
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Автомобили "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Данный датасет используется для предсказания цены автомобиля на основе различных параметров, таких как производитель, модель, год выпуска и другие характеристики.\n",
|
|||
|
"\n",
|
|||
|
"Информация о колонках\n",
|
|||
|
"- ID: уникальный идентификатор автомобиля (int)\n",
|
|||
|
"- Price: цена автомобиля (целевой столбец) (int)\n",
|
|||
|
"- Levy: налог или сбор, связанный с автомобилем (obect, строковый)\n",
|
|||
|
"- Manufacturer: производитель автомобиля (obect, строковый)\n",
|
|||
|
"- Model: модель автомобиля (obect, строковый)\n",
|
|||
|
"- Prod. year: год производства (int)\n",
|
|||
|
"- Category: категория автомобиля (obect, строковый)\n",
|
|||
|
"- Leather interior: наличие кожаного салона (да/нет) (obect, строковый) \n",
|
|||
|
"- Fuel type: тип топлива (бензин, дизель и т.д.) (obect, строковый)\n",
|
|||
|
"- Engine volume: рабочий объем двигателя (obect, строковый)\n",
|
|||
|
"- Mileage: пробег автомобиля (obect, строковый)\n",
|
|||
|
"- Cylinders: количество цилиндров в двигателе (float)\n",
|
|||
|
"- Gear box type: тип коробки передач (механическая, автоматическая и т.д.) (obect, строковый)\n",
|
|||
|
"- Drive wheels: тип привода (передний, задний, полный) (obect, строковый)\n",
|
|||
|
"- Doors: количество дверей (obect, строковый)\n",
|
|||
|
"- Wheel: расположение руля (левосторонний, правосторонний) (obect, строковый)\n",
|
|||
|
"- Color: цвет автомобиля (obect, строковый)\n",
|
|||
|
"- Airbags: наличие подушек безопасности (int)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"Каждая строка в датасете содержит соответствующую информацию о автомобиле, что позволяет проводить анализ и строить модели для предсказания его цены."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 63,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>ID</th>\n",
|
|||
|
" <th>Price</th>\n",
|
|||
|
" <th>Levy</th>\n",
|
|||
|
" <th>Manufacturer</th>\n",
|
|||
|
" <th>Model</th>\n",
|
|||
|
" <th>Prod. year</th>\n",
|
|||
|
" <th>Category</th>\n",
|
|||
|
" <th>Leather interior</th>\n",
|
|||
|
" <th>Fuel type</th>\n",
|
|||
|
" <th>Engine volume</th>\n",
|
|||
|
" <th>Mileage</th>\n",
|
|||
|
" <th>Cylinders</th>\n",
|
|||
|
" <th>Gear box type</th>\n",
|
|||
|
" <th>Drive wheels</th>\n",
|
|||
|
" <th>Doors</th>\n",
|
|||
|
" <th>Wheel</th>\n",
|
|||
|
" <th>Color</th>\n",
|
|||
|
" <th>Airbags</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>45654403</td>\n",
|
|||
|
" <td>13328</td>\n",
|
|||
|
" <td>1399</td>\n",
|
|||
|
" <td>LEXUS</td>\n",
|
|||
|
" <td>RX 450</td>\n",
|
|||
|
" <td>2010</td>\n",
|
|||
|
" <td>Jeep</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Hybrid</td>\n",
|
|||
|
" <td>3.5</td>\n",
|
|||
|
" <td>186005 km</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>Automatic</td>\n",
|
|||
|
" <td>4x4</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>Silver</td>\n",
|
|||
|
" <td>12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>44731507</td>\n",
|
|||
|
" <td>16621</td>\n",
|
|||
|
" <td>1018</td>\n",
|
|||
|
" <td>CHEVROLET</td>\n",
|
|||
|
" <td>Equinox</td>\n",
|
|||
|
" <td>2011</td>\n",
|
|||
|
" <td>Jeep</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Petrol</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>192000 km</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>Tiptronic</td>\n",
|
|||
|
" <td>4x4</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>Black</td>\n",
|
|||
|
" <td>8</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>45774419</td>\n",
|
|||
|
" <td>8467</td>\n",
|
|||
|
" <td>-</td>\n",
|
|||
|
" <td>HONDA</td>\n",
|
|||
|
" <td>FIT</td>\n",
|
|||
|
" <td>2006</td>\n",
|
|||
|
" <td>Hatchback</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Petrol</td>\n",
|
|||
|
" <td>1.3</td>\n",
|
|||
|
" <td>200000 km</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>Variator</td>\n",
|
|||
|
" <td>Front</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Right-hand drive</td>\n",
|
|||
|
" <td>Black</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>45769185</td>\n",
|
|||
|
" <td>3607</td>\n",
|
|||
|
" <td>862</td>\n",
|
|||
|
" <td>FORD</td>\n",
|
|||
|
" <td>Escape</td>\n",
|
|||
|
" <td>2011</td>\n",
|
|||
|
" <td>Jeep</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Hybrid</td>\n",
|
|||
|
" <td>2.5</td>\n",
|
|||
|
" <td>168966 km</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>Automatic</td>\n",
|
|||
|
" <td>4x4</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>45809263</td>\n",
|
|||
|
" <td>11726</td>\n",
|
|||
|
" <td>446</td>\n",
|
|||
|
" <td>HONDA</td>\n",
|
|||
|
" <td>FIT</td>\n",
|
|||
|
" <td>2014</td>\n",
|
|||
|
" <td>Hatchback</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Petrol</td>\n",
|
|||
|
" <td>1.3</td>\n",
|
|||
|
" <td>91901 km</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>Automatic</td>\n",
|
|||
|
" <td>Front</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>Silver</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19232</th>\n",
|
|||
|
" <td>45798355</td>\n",
|
|||
|
" <td>8467</td>\n",
|
|||
|
" <td>-</td>\n",
|
|||
|
" <td>MERCEDES-BENZ</td>\n",
|
|||
|
" <td>CLK 200</td>\n",
|
|||
|
" <td>1999</td>\n",
|
|||
|
" <td>Coupe</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>CNG</td>\n",
|
|||
|
" <td>2.0 Turbo</td>\n",
|
|||
|
" <td>300000 km</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>Manual</td>\n",
|
|||
|
" <td>Rear</td>\n",
|
|||
|
" <td>02-Mar</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>Silver</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19233</th>\n",
|
|||
|
" <td>45778856</td>\n",
|
|||
|
" <td>15681</td>\n",
|
|||
|
" <td>831</td>\n",
|
|||
|
" <td>HYUNDAI</td>\n",
|
|||
|
" <td>Sonata</td>\n",
|
|||
|
" <td>2011</td>\n",
|
|||
|
" <td>Sedan</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Petrol</td>\n",
|
|||
|
" <td>2.4</td>\n",
|
|||
|
" <td>161600 km</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>Tiptronic</td>\n",
|
|||
|
" <td>Front</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>Red</td>\n",
|
|||
|
" <td>8</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19234</th>\n",
|
|||
|
" <td>45804997</td>\n",
|
|||
|
" <td>26108</td>\n",
|
|||
|
" <td>836</td>\n",
|
|||
|
" <td>HYUNDAI</td>\n",
|
|||
|
" <td>Tucson</td>\n",
|
|||
|
" <td>2010</td>\n",
|
|||
|
" <td>Jeep</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Diesel</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>116365 km</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>Automatic</td>\n",
|
|||
|
" <td>Front</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>Grey</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19235</th>\n",
|
|||
|
" <td>45793526</td>\n",
|
|||
|
" <td>5331</td>\n",
|
|||
|
" <td>1288</td>\n",
|
|||
|
" <td>CHEVROLET</td>\n",
|
|||
|
" <td>Captiva</td>\n",
|
|||
|
" <td>2007</td>\n",
|
|||
|
" <td>Jeep</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Diesel</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>51258 km</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>Automatic</td>\n",
|
|||
|
" <td>Front</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>Black</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19236</th>\n",
|
|||
|
" <td>45813273</td>\n",
|
|||
|
" <td>470</td>\n",
|
|||
|
" <td>753</td>\n",
|
|||
|
" <td>HYUNDAI</td>\n",
|
|||
|
" <td>Sonata</td>\n",
|
|||
|
" <td>2012</td>\n",
|
|||
|
" <td>Sedan</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Hybrid</td>\n",
|
|||
|
" <td>2.4</td>\n",
|
|||
|
" <td>186923 km</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>Automatic</td>\n",
|
|||
|
" <td>Front</td>\n",
|
|||
|
" <td>04-May</td>\n",
|
|||
|
" <td>Left wheel</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>19237 rows × 18 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" ID Price Levy Manufacturer Model Prod. year Category \\\n",
|
|||
|
"0 45654403 13328 1399 LEXUS RX 450 2010 Jeep \n",
|
|||
|
"1 44731507 16621 1018 CHEVROLET Equinox 2011 Jeep \n",
|
|||
|
"2 45774419 8467 - HONDA FIT 2006 Hatchback \n",
|
|||
|
"3 45769185 3607 862 FORD Escape 2011 Jeep \n",
|
|||
|
"4 45809263 11726 446 HONDA FIT 2014 Hatchback \n",
|
|||
|
"... ... ... ... ... ... ... ... \n",
|
|||
|
"19232 45798355 8467 - MERCEDES-BENZ CLK 200 1999 Coupe \n",
|
|||
|
"19233 45778856 15681 831 HYUNDAI Sonata 2011 Sedan \n",
|
|||
|
"19234 45804997 26108 836 HYUNDAI Tucson 2010 Jeep \n",
|
|||
|
"19235 45793526 5331 1288 CHEVROLET Captiva 2007 Jeep \n",
|
|||
|
"19236 45813273 470 753 HYUNDAI Sonata 2012 Sedan \n",
|
|||
|
"\n",
|
|||
|
" Leather interior Fuel type Engine volume Mileage Cylinders \\\n",
|
|||
|
"0 Yes Hybrid 3.5 186005 km 6.0 \n",
|
|||
|
"1 No Petrol 3 192000 km 6.0 \n",
|
|||
|
"2 No Petrol 1.3 200000 km 4.0 \n",
|
|||
|
"3 Yes Hybrid 2.5 168966 km 4.0 \n",
|
|||
|
"4 Yes Petrol 1.3 91901 km 4.0 \n",
|
|||
|
"... ... ... ... ... ... \n",
|
|||
|
"19232 Yes CNG 2.0 Turbo 300000 km 4.0 \n",
|
|||
|
"19233 Yes Petrol 2.4 161600 km 4.0 \n",
|
|||
|
"19234 Yes Diesel 2 116365 km 4.0 \n",
|
|||
|
"19235 Yes Diesel 2 51258 km 4.0 \n",
|
|||
|
"19236 Yes Hybrid 2.4 186923 km 4.0 \n",
|
|||
|
"\n",
|
|||
|
" Gear box type Drive wheels Doors Wheel Color Airbags \n",
|
|||
|
"0 Automatic 4x4 04-May Left wheel Silver 12 \n",
|
|||
|
"1 Tiptronic 4x4 04-May Left wheel Black 8 \n",
|
|||
|
"2 Variator Front 04-May Right-hand drive Black 2 \n",
|
|||
|
"3 Automatic 4x4 04-May Left wheel White 0 \n",
|
|||
|
"4 Automatic Front 04-May Left wheel Silver 4 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"19232 Manual Rear 02-Mar Left wheel Silver 5 \n",
|
|||
|
"19233 Tiptronic Front 04-May Left wheel Red 8 \n",
|
|||
|
"19234 Automatic Front 04-May Left wheel Grey 4 \n",
|
|||
|
"19235 Automatic Front 04-May Left wheel Black 4 \n",
|
|||
|
"19236 Automatic Front 04-May Left wheel White 12 \n",
|
|||
|
"\n",
|
|||
|
"[19237 rows x 18 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 63,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"auto = pd.read_csv(\"car_price_prediction.csv\")\n",
|
|||
|
"\n",
|
|||
|
"auto"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 64,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ID int64\n",
|
|||
|
"Price int64\n",
|
|||
|
"Levy object\n",
|
|||
|
"Manufacturer object\n",
|
|||
|
"Model object\n",
|
|||
|
"Prod. year int64\n",
|
|||
|
"Category object\n",
|
|||
|
"Leather interior object\n",
|
|||
|
"Fuel type object\n",
|
|||
|
"Engine volume object\n",
|
|||
|
"Mileage object\n",
|
|||
|
"Cylinders float64\n",
|
|||
|
"Gear box type object\n",
|
|||
|
"Drive wheels object\n",
|
|||
|
"Doors object\n",
|
|||
|
"Wheel object\n",
|
|||
|
"Color object\n",
|
|||
|
"Airbags int64\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 64,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"auto.dtypes"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Магазины "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"\n",
|
|||
|
"Информация о колонках\n",
|
|||
|
"- Store ID: уникальный идентификатор конкретного магазина (индекс) (int)\n",
|
|||
|
"- Store_Area: физическая площадь магазина в квадратных ярдах (int)\n",
|
|||
|
"- Items_Available: количество различных товаров, доступных в соответствующем магазине (int)\n",
|
|||
|
"- Daily_Customer_Count: среднее количество клиентов, посещающих магазины за месяц (int)\n",
|
|||
|
"- Store_Sales: объем продаж (в долларах США), полученный магазинами (int)\n",
|
|||
|
"\n",
|
|||
|
"Каждая строка в датасете содержит соответствующую информацию о магазине, что позволяет проводить анализ и строить модели для оценки его работы."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 65,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Store ID</th>\n",
|
|||
|
" <th>Store_Area</th>\n",
|
|||
|
" <th>Items_Available</th>\n",
|
|||
|
" <th>Daily_Customer_Count</th>\n",
|
|||
|
" <th>Store_Sales</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1659</td>\n",
|
|||
|
" <td>1961</td>\n",
|
|||
|
" <td>530</td>\n",
|
|||
|
" <td>66490</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1461</td>\n",
|
|||
|
" <td>1752</td>\n",
|
|||
|
" <td>210</td>\n",
|
|||
|
" <td>39820</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1340</td>\n",
|
|||
|
" <td>1609</td>\n",
|
|||
|
" <td>720</td>\n",
|
|||
|
" <td>54010</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1451</td>\n",
|
|||
|
" <td>1748</td>\n",
|
|||
|
" <td>620</td>\n",
|
|||
|
" <td>53730</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>1770</td>\n",
|
|||
|
" <td>2111</td>\n",
|
|||
|
" <td>450</td>\n",
|
|||
|
" <td>46620</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>891</th>\n",
|
|||
|
" <td>892</td>\n",
|
|||
|
" <td>1582</td>\n",
|
|||
|
" <td>1910</td>\n",
|
|||
|
" <td>1080</td>\n",
|
|||
|
" <td>66390</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>892</th>\n",
|
|||
|
" <td>893</td>\n",
|
|||
|
" <td>1387</td>\n",
|
|||
|
" <td>1663</td>\n",
|
|||
|
" <td>850</td>\n",
|
|||
|
" <td>82080</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>893</th>\n",
|
|||
|
" <td>894</td>\n",
|
|||
|
" <td>1200</td>\n",
|
|||
|
" <td>1436</td>\n",
|
|||
|
" <td>1060</td>\n",
|
|||
|
" <td>76440</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>894</th>\n",
|
|||
|
" <td>895</td>\n",
|
|||
|
" <td>1299</td>\n",
|
|||
|
" <td>1560</td>\n",
|
|||
|
" <td>770</td>\n",
|
|||
|
" <td>96610</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>895</th>\n",
|
|||
|
" <td>896</td>\n",
|
|||
|
" <td>1174</td>\n",
|
|||
|
" <td>1429</td>\n",
|
|||
|
" <td>1110</td>\n",
|
|||
|
" <td>54340</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>896 rows × 5 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Store ID Store_Area Items_Available Daily_Customer_Count Store_Sales\n",
|
|||
|
"0 1 1659 1961 530 66490\n",
|
|||
|
"1 2 1461 1752 210 39820\n",
|
|||
|
"2 3 1340 1609 720 54010\n",
|
|||
|
"3 4 1451 1748 620 53730\n",
|
|||
|
"4 5 1770 2111 450 46620\n",
|
|||
|
".. ... ... ... ... ...\n",
|
|||
|
"891 892 1582 1910 1080 66390\n",
|
|||
|
"892 893 1387 1663 850 82080\n",
|
|||
|
"893 894 1200 1436 1060 76440\n",
|
|||
|
"894 895 1299 1560 770 96610\n",
|
|||
|
"895 896 1174 1429 1110 54340\n",
|
|||
|
"\n",
|
|||
|
"[896 rows x 5 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 65,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"shop = pd.read_csv(\"Stores.csv\")\n",
|
|||
|
"\n",
|
|||
|
"shop"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 66,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Store ID int64\n",
|
|||
|
"Store_Area int64\n",
|
|||
|
"Items_Available int64\n",
|
|||
|
"Daily_Customer_Count int64\n",
|
|||
|
"Store_Sales int64\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 66,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"shop.dtypes"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 3. Провести анализ содержимого каждого набора данных. Что является объектом/объектами наблюдения? Каковы атрибуты объектов? Есть ли связи между объектами?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. Датасет о риске инсульта\n",
|
|||
|
"\n",
|
|||
|
"Объект наблюдения: Пациенты. \n",
|
|||
|
"\n",
|
|||
|
"Атрибуты перечисленны выше. \n",
|
|||
|
"\n",
|
|||
|
"2. Датасет с ценами автомобилей\n",
|
|||
|
"\n",
|
|||
|
"Объект наблюдения: Автомобили. \n",
|
|||
|
"\n",
|
|||
|
"Атрибуты перечисленны выше. \n",
|
|||
|
"\n",
|
|||
|
"3. Датасет супермакета\n",
|
|||
|
"\n",
|
|||
|
"Объект наблюдения: Магазины супермаркета.\n",
|
|||
|
"\n",
|
|||
|
"Атрибуты перечисленны выше. "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 4. Привести примеры бизнес-целей, для достижения которых могут подойти выбранные наборы данных. Каков эффект для бизнеса?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. Датасет о риске инсульта\n",
|
|||
|
"\n",
|
|||
|
"Бизнес-цель: Разработка системы раннего предупреждения инсульта на основе анализа данных пациентов.\n",
|
|||
|
"\n",
|
|||
|
"Эффект для бизнеса:\n",
|
|||
|
"\n",
|
|||
|
"Улучшение здоровья пациентов: Снижение числа инсультов за счет раннего выявления рисков.\n",
|
|||
|
"\n",
|
|||
|
"Снижение затрат: Уменьшение расходов на лечение инсульта и реабилитацию.\n",
|
|||
|
"\n",
|
|||
|
"2. Датасет для прогнозирования цен на автомобили\n",
|
|||
|
"\n",
|
|||
|
"Бизнес-цель: Оптимизация ценообразования и улучшение стратегии продаж автомобилей.\n",
|
|||
|
"\n",
|
|||
|
"Эффект для бизнеса:\n",
|
|||
|
"\n",
|
|||
|
"Увеличение прибыли: Установка конкурентоспособных цен на автомобили на основе анализа данных.\n",
|
|||
|
"\n",
|
|||
|
"Лучшее планирование запасов: Снижение излишков и оптимизация поставок.\n",
|
|||
|
"\n",
|
|||
|
"3. Датасет супермаркета\n",
|
|||
|
"\n",
|
|||
|
"Бизнес-цель: Оптимизация ассортимента и улучшение обслуживания клиентов на основе анализа посещаемости и продаж.\n",
|
|||
|
"\n",
|
|||
|
"Эффект для бизнеса:\n",
|
|||
|
"\n",
|
|||
|
"Увеличение объема продаж: Подбор товаров, наиболее популярных среди клиентов.\n",
|
|||
|
"Снижение затрат: Оптимизация площади магазина и распределения товаров.\n",
|
|||
|
"Повышение клиентской удовлетворенности: Улучшение опыта покупок за счет более эффективной организации товаров и обслуживания."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 5. Привести примеры целей технического проекта для каждой выделенной ранее бизнес-цели. Что поступает на вход, что является целевым признаком?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. Датасет о риске инсульта\n",
|
|||
|
"\n",
|
|||
|
"Бизнес-цель: Разработка системы раннего предупреждения инсульта.\n",
|
|||
|
"\n",
|
|||
|
"Цель технического проекта: Создание модели машинного обучения для прогнозирования вероятности инсульта.\n",
|
|||
|
"\n",
|
|||
|
"Входные данные:\n",
|
|||
|
"\n",
|
|||
|
"Пол\n",
|
|||
|
"Возраст\n",
|
|||
|
"Наличие гипертензии\n",
|
|||
|
"Наличие сердечных заболеваний\n",
|
|||
|
"Статус брака\n",
|
|||
|
"Тип работы\n",
|
|||
|
"Тип проживания\n",
|
|||
|
"Средний уровень глюкозы\n",
|
|||
|
"Индекс массы тела\n",
|
|||
|
"Статус курения\n",
|
|||
|
"и так далее\n",
|
|||
|
"\n",
|
|||
|
"Целевой признак: Наличие инсульта (stroke).\n",
|
|||
|
"\n",
|
|||
|
"2. Датасет для прогнозирования цен на автомобили\n",
|
|||
|
"\n",
|
|||
|
"Бизнес-цель: Оптимизация ценообразования и улучшение стратегии продаж автомобилей.\n",
|
|||
|
"\n",
|
|||
|
"Цель технического проекта: Построение модели для предсказания цены автомобиля на основе характеристик.\n",
|
|||
|
"\n",
|
|||
|
"Входные данные:\n",
|
|||
|
"\n",
|
|||
|
"Производитель\n",
|
|||
|
"Модель\n",
|
|||
|
"Год производства\n",
|
|||
|
"Категория\n",
|
|||
|
"Налог\n",
|
|||
|
"Наличие кожаного салона\n",
|
|||
|
"Тип топлива\n",
|
|||
|
"Рабочий объем двигателя\n",
|
|||
|
"Пробег\n",
|
|||
|
"Количество цилиндров\n",
|
|||
|
"Тип коробки передач\n",
|
|||
|
"и так далее\n",
|
|||
|
"\n",
|
|||
|
"Целевой признак: Цена автомобиля (Price).\n",
|
|||
|
"\n",
|
|||
|
"3. Датасет супермаркета\n",
|
|||
|
"\n",
|
|||
|
"Бизнес-цель: Оптимизация ассортимента и улучшение обслуживания клиентов.\n",
|
|||
|
"\n",
|
|||
|
"Цель технического проекта: Разработка аналитической платформы для анализа посещаемости и продаж.\n",
|
|||
|
"\n",
|
|||
|
"Входные данные:\n",
|
|||
|
"\n",
|
|||
|
"Физическая площадь магазина\n",
|
|||
|
"Количество доступных товаров\n",
|
|||
|
"Среднее количество клиентов\n",
|
|||
|
"Объем продаж\n",
|
|||
|
"и так далее\n",
|
|||
|
"\n",
|
|||
|
"Целевой признак: Объем продаж (Store_Sales)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 6. Определить проблемы выбранных наборов данных: зашумленность, смещение, актуальность, выбросы, просачивание данных.\n",
|
|||
|
"### 7. Привести примеры решения обнаруженных проблем для каждого набора данных"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 67,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# 1. Проверка на зашумленность ---- количество пропусков в процентах от общего кол-ва\n",
|
|||
|
"def check_noise(dataframe):\n",
|
|||
|
" total_values = dataframe.size\n",
|
|||
|
" missing_values = dataframe.isnull().sum().sum()\n",
|
|||
|
" noise_percentage = (missing_values / total_values) * 100\n",
|
|||
|
" return f\"Зашумленность: {noise_percentage:.2f}%\"\n",
|
|||
|
"\n",
|
|||
|
"# 2. Проверка на смещение ----- объем уникальных значений внутри определнной колонки \n",
|
|||
|
"def check_bias(dataframe, target_column):\n",
|
|||
|
" if target_column in dataframe.columns:\n",
|
|||
|
" unique_values = dataframe[target_column].nunique()\n",
|
|||
|
" total_values = len(dataframe)\n",
|
|||
|
" bias_percentage = (unique_values / total_values) * 100\n",
|
|||
|
" return f\"Смещение по {target_column}: {bias_percentage:.2f}% уникальных значений\"\n",
|
|||
|
" return \"Целевой признак не найден.\"\n",
|
|||
|
"\n",
|
|||
|
"# 3. Проверка на дубликаты\n",
|
|||
|
"def check_duplicates(dataframe):\n",
|
|||
|
" duplicate_percentage = dataframe.duplicated().mean() * 100\n",
|
|||
|
" return f\"Количество дубликатов: {duplicate_percentage:.2f}%\"\n",
|
|||
|
"\n",
|
|||
|
"# 4. Проверка на выбросы\n",
|
|||
|
"def check_outliers(dataframe, column):\n",
|
|||
|
" if column in dataframe.columns:\n",
|
|||
|
" Q1 = dataframe[column].quantile(0.25)\n",
|
|||
|
" Q3 = dataframe[column].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" outlier_count = dataframe[(dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)].shape[0]\n",
|
|||
|
" total_count = dataframe.shape[0]\n",
|
|||
|
" outlier_percentage = (outlier_count / total_count) * 100\n",
|
|||
|
" return f\"Выбросы по {column}: {outlier_percentage:.2f}%\"\n",
|
|||
|
" return f\"Признак {column} не найден.\"\n",
|
|||
|
"\n",
|
|||
|
"# 5. Проверка на просачивание данных\n",
|
|||
|
"def check_data_leakage(dataframe, target_column):\n",
|
|||
|
" if target_column in dataframe.columns:\n",
|
|||
|
" correlation_matrix = dataframe.select_dtypes(include=[np.number]).corr()\n",
|
|||
|
" leakage_info = correlation_matrix[target_column].abs().nlargest(10)\n",
|
|||
|
" leakage_report = \", \".join([f\"{feature}: {value:.2f}\" for feature, value in leakage_info.items() if feature != target_column])\n",
|
|||
|
" return f\"Признаки просачивания данных: {leakage_report}\"\n",
|
|||
|
" return \"Целевой признак не найден.\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 68,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Зашумленность: 0.33%\n",
|
|||
|
"Смещение по avg_glucose_level: 77.87% уникальных значений\n",
|
|||
|
"Количество дубликатов: 0.00%\n",
|
|||
|
"Выбросы по avg_glucose_level: 12.27%\n",
|
|||
|
"Признаки просачивания данных: age: 0.25, heart_disease: 0.13, avg_glucose_level: 0.13, hypertension: 0.13, bmi: 0.04, id: 0.01\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"noise_columns = check_noise(strokes)\n",
|
|||
|
"bias_info = check_bias(strokes, 'avg_glucose_level') \n",
|
|||
|
"duplicate_count = check_duplicates(strokes)\n",
|
|||
|
"outliers_data = check_outliers(strokes, 'avg_glucose_level') \n",
|
|||
|
"leakage_info = check_data_leakage(strokes, 'stroke') \n",
|
|||
|
"\n",
|
|||
|
"print(noise_columns)\n",
|
|||
|
"print(bias_info)\n",
|
|||
|
"print(duplicate_count)\n",
|
|||
|
"print(outliers_data)\n",
|
|||
|
"print(leakage_info)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 69,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Зашумленность: 0.00%\n",
|
|||
|
"Смещение по Price: 12.03% уникальных значений\n",
|
|||
|
"Количество дубликатов: 1.63%\n",
|
|||
|
"Выбросы по Prod. year: 5.10%\n",
|
|||
|
"Признаки просачивания данных: Prod. year: 0.24, Cylinders: 0.18, ID: 0.02, Price: 0.01\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"##Машины\n",
|
|||
|
"noise_columns = check_noise(auto)\n",
|
|||
|
"bias_info = check_bias(auto, 'Price') \n",
|
|||
|
"duplicate_count = check_duplicates(auto)\n",
|
|||
|
"outliers_data = check_outliers(auto, 'Prod. year') \n",
|
|||
|
"leakage_info = check_data_leakage(auto, 'Airbags') \n",
|
|||
|
"\n",
|
|||
|
"print(noise_columns)\n",
|
|||
|
"print(bias_info)\n",
|
|||
|
"print(duplicate_count)\n",
|
|||
|
"print(outliers_data)\n",
|
|||
|
"print(leakage_info)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 70,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Зашумленность: 0.00%\n",
|
|||
|
"Смещение по Items_Available: 68.75% уникальных значений\n",
|
|||
|
"Количество дубликатов: 0.00%\n",
|
|||
|
"Выбросы по Store_Sales: 0.11%\n",
|
|||
|
"Признаки просачивания данных: Store_Area: 0.04, Items_Available: 0.04, Store ID : 0.01, Store_Sales: 0.01\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"noise_columns = check_noise(shop)\n",
|
|||
|
"bias_info = check_bias(shop, 'Items_Available') \n",
|
|||
|
"duplicate_count = check_duplicates(shop)\n",
|
|||
|
"outliers_data = check_outliers(shop, 'Store_Sales') \n",
|
|||
|
"leakage_info = check_data_leakage(shop, 'Daily_Customer_Count') \n",
|
|||
|
"\n",
|
|||
|
"print(noise_columns)\n",
|
|||
|
"print(bias_info)\n",
|
|||
|
"print(duplicate_count)\n",
|
|||
|
"print(outliers_data)\n",
|
|||
|
"print(leakage_info)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 9. Устранить проблему пропущенных данных. Для каждого набора данных использовать разные методы: удаление, подстановка константного значения (0 или подобное), подстановка среднего значения"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 71,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"id 0\n",
|
|||
|
"gender 0\n",
|
|||
|
"age 0\n",
|
|||
|
"hypertension 0\n",
|
|||
|
"heart_disease 0\n",
|
|||
|
"ever_married 0\n",
|
|||
|
"work_type 0\n",
|
|||
|
"Residence_type 0\n",
|
|||
|
"avg_glucose_level 0\n",
|
|||
|
"bmi 201\n",
|
|||
|
"smoking_status 0\n",
|
|||
|
"stroke 0\n",
|
|||
|
"dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 71,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Инсульт\n",
|
|||
|
"\n",
|
|||
|
"strokes.isnull().sum()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 72,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"strokes['bmi'] = strokes['bmi'].fillna(strokes['bmi'].mean())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 73,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"id 0\n",
|
|||
|
"gender 0\n",
|
|||
|
"age 0\n",
|
|||
|
"hypertension 0\n",
|
|||
|
"heart_disease 0\n",
|
|||
|
"ever_married 0\n",
|
|||
|
"work_type 0\n",
|
|||
|
"Residence_type 0\n",
|
|||
|
"avg_glucose_level 0\n",
|
|||
|
"bmi 0\n",
|
|||
|
"smoking_status 0\n",
|
|||
|
"stroke 0\n",
|
|||
|
"dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 73,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"strokes.isnull().sum()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 74,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ID 0\n",
|
|||
|
"Price 0\n",
|
|||
|
"Levy 0\n",
|
|||
|
"Manufacturer 0\n",
|
|||
|
"Model 0\n",
|
|||
|
"Prod. year 0\n",
|
|||
|
"Category 0\n",
|
|||
|
"Leather interior 0\n",
|
|||
|
"Fuel type 0\n",
|
|||
|
"Engine volume 0\n",
|
|||
|
"Mileage 0\n",
|
|||
|
"Cylinders 0\n",
|
|||
|
"Gear box type 0\n",
|
|||
|
"Drive wheels 0\n",
|
|||
|
"Doors 0\n",
|
|||
|
"Wheel 0\n",
|
|||
|
"Color 0\n",
|
|||
|
"Airbags 0\n",
|
|||
|
"dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 74,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"auto.isnull().sum()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 75,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Store ID 0\n",
|
|||
|
"Store_Area 0\n",
|
|||
|
"Items_Available 0\n",
|
|||
|
"Daily_Customer_Count 0\n",
|
|||
|
"Store_Sales 0\n",
|
|||
|
"dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 75,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"shop.isnull().sum()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 76,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# удалить\n",
|
|||
|
"shop = shop.dropna()\n",
|
|||
|
"\n",
|
|||
|
"# заполнить значением\n",
|
|||
|
"shop['Items_Avialable'] = shop['Items_Avialable'].fillna(5000)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 10. Выполнить разбиение каждого набора данных на обучающую, контрольную и тестовую выборки"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 79,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 80,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Shop Dataset:\n",
|
|||
|
"Train: 79.91%\n",
|
|||
|
"Validation: 10.04%\n",
|
|||
|
"Test: 10.04%\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Разбиение shop\n",
|
|||
|
"original_shop_size = len(shop)\n",
|
|||
|
"train_shop, temp_shop = train_test_split(shop, test_size=0.2, random_state=42)\n",
|
|||
|
"val_shop, test_shop = train_test_split(temp_shop, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Shop Dataset:\")\n",
|
|||
|
"print(f\"Train: {len(train_shop)/original_shop_size*100:.2f}%\")\n",
|
|||
|
"print(f\"Validation: {len(val_shop)/original_shop_size*100:.2f}%\")\n",
|
|||
|
"print(f\"Test: {len(test_shop)/original_shop_size*100:.2f}%\\n\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 82,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Strokes Dataset:\n",
|
|||
|
"Train: 80.00%\n",
|
|||
|
"Validation: 10.00%\n",
|
|||
|
"Test: 10.00%\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Разбиение strokes\n",
|
|||
|
"original_strokes_size = len(strokes)\n",
|
|||
|
"train_strokes, temp_strokes = train_test_split(strokes, test_size=0.2, random_state=42)\n",
|
|||
|
"val_strokes, test_strokes = train_test_split(temp_strokes, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Strokes Dataset:\")\n",
|
|||
|
"print(f\"Train: {len(train_strokes)/original_strokes_size*100:.2f}%\")\n",
|
|||
|
"print(f\"Validation: {len(val_strokes)/original_strokes_size*100:.2f}%\")\n",
|
|||
|
"print(f\"Test: {len(test_strokes)/original_strokes_size*100:.2f}%\\n\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 83,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Auto Dataset:\n",
|
|||
|
"Train: 80.00%\n",
|
|||
|
"Validation: 10.00%\n",
|
|||
|
"Test: 10.00%\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Разбиение auto\n",
|
|||
|
"original_auto_size = len(auto)\n",
|
|||
|
"train_auto, temp_auto = train_test_split(auto, test_size=0.2, random_state=42)\n",
|
|||
|
"val_auto, test_auto = train_test_split(temp_auto, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Auto Dataset:\")\n",
|
|||
|
"print(f\"Train: {len(train_auto)/original_auto_size*100:.2f}%\")\n",
|
|||
|
"print(f\"Validation: {len(val_auto)/original_auto_size*100:.2f}%\")\n",
|
|||
|
"print(f\"Test: {len(test_auto)/original_auto_size*100:.2f}%\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 11. Оценить сбалансированность выборок для каждого набора данных. Оценить необходимость использования методов приращения (аугментации) данных.\n",
|
|||
|
"### 12. Выполнить приращение данных методами выборки с избытком (oversampling) и выборки с недостатком (undersampling). Должны быть представлены примеры реализации обоих методов для выборок каждого набора данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 85,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAq4AAAHWCAYAAAC2Zgs3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB/eklEQVR4nO3dd3hT1f8H8HdGk850T7qgdEDZZZU9iiwRFEUQFARx4VcQRcWFuABFERGcLBEHIOAERUBkj0LLKlAKpaU7Ld07Ob8/avMjtEApKTdp36/nyfM0956c+05v0nx6c+65MiGEABERERGRmZNLHYCIiIiIqC5YuBIRERGRRWDhSkREREQWgYUrEREREVkEFq5EREREZBFYuBIRERGRRWDhSkREREQWgYUrEREREVkEFq5EREREZBFYuBIRNaD33nsPer0eAKDX6zFv3jyJE9GtOHnyJDZv3my4HxMTg99//126QBbgzTffhEwmk2Tb/fr1Q5s2bSTZNt0ZLFzplqxatQoymcxws7a2RkhICJ555hlkZGRIHY/I7KxevRoLFy7E5cuX8eGHH2L16tVSR6JbUFBQgCeeeAIHDhxAfHw8pk+fjhMnTkgdq14CAwON/n5f77Zq1Sqpo9aQlZWF6dOnIywsDDY2NvDw8EDXrl3x0ksvobCwUOp4dAfJhBBC6hBkOVatWoVHH30Ub731Fpo3b47S0lLs2bMHa9asQUBAAE6ePAlbW1upYxKZjR9//BGPPPIIysvLoVar8e233+L++++XOhbdglGjRuHnn38GAISEhGDfvn1wdXWVONWt27x5s1GR98cff+D777/HokWL4ObmZljeo0cPtGjRot7bqaysRGVlJaytrW8rb7WcnBx07NgR+fn5mDx5MsLCwpCdnY3jx4/jt99+w/HjxxEYGAig6oirVqvFyZMnTbJtMj9KqQOQZRo6dCg6d+4MAHjsscfg6uqKjz76CD///DPGjRsncToi8/Hggw+if//+OH/+PIKDg+Hu7i51JLpFmzdvxunTp1FSUoK2bdtCpVJJHaleRo0aZXQ/PT0d33//PUaNGmUo/GpTVFQEOzu7Om9HqVRCqTRdebF8+XIkJSVh79696NGjh9G6/Px8i90fVD8cKkAmMWDAAADAxYsXAVT9h/zCCy+gbdu2sLe3h0ajwdChQxEbG1vjsaWlpXjzzTcREhICa2treHt747777kNCQgIAIDEx8YZfa/Xr18/Q1z///AOZTIYff/wRr7zyCry8vGBnZ4d77rkHycnJNbZ98OBBDBkyBI6OjrC1tUXfvn2xd+/eWp9jv379at3+m2++WaPtt99+i4iICNjY2MDFxQVjx46tdfs3em5X0+v1+PjjjxEeHg5ra2t4enriiSeewJUrV4zaBQYG4u67766xnWeeeaZGn7Vl/+CDD2r8TgGgrKwMc+bMQcuWLaFWq+Hn54cXX3wRZWVltf6urna9MWcLFy6ETCZDYmKi0fLc3FzMmDEDfn5+UKvVaNmyJRYsWGAYJ3q16rF0194mTZpk1C4lJQWTJ0+Gp6cn1Go1wsPDsWLFCqM21a+d6ptarUZISAjmzZuHa7+YOnbsGIYOHQqNRgN7e3sMHDgQBw4cMGpTPawmMTERHh4e6NGjB1xdXdGuXbs6fR177bCcm73ubuU5mvL9Ub0PPDw8UFFRYbTu+++/N+TVarVG67Zs2YLevXvDzs4ODg4OGD58OE6dOmXUZtKkSbC3t6+Ra8OGDZDJZPjnn38My271dbZs2TKEh4dDrVbDx8cH06ZNQ25urlGbfv36Gd4LrVu3RkREBGJjY2t9j97I9fbh1fmvfs512d8bNmxA586d4eDgYNRu4cKFdc5Vm+rfeUJCAoYNGwYHBweMHz8eALB792488MAD8Pf3N/wdeO6551BSUmLUR21jXGUyGZ555hls3rwZbdq0MbxGt27detNMCQkJUCgU6N69e411Go2m1iO7p0+fRv/+/WFra4tmzZrh/fffr9EmMzMTU6ZMgaenJ6ytrdG+ffsaQ3mq/0YvXLgQixYtQkBAAGxsbNC3b18e1ZUIj7iSSVQXmdVfn124cAGbN2/GAw88gObNmyMjIwNffPEF+vbti9OnT8PHxwcAoNPpcPfdd2P79u0YO3Yspk+fjoKCAmzbtg0nT55EUFCQYRvjxo3DsGHDjLY7e/bsWvO8++67kMlkeOmll5CZmYmPP/4YUVFRiImJgY2NDQBgx44dGDp0KCIiIjBnzhzI5XKsXLkSAwYMwO7du9G1a9ca/fr6+hpOriksLMRTTz1V67Zff/11jBkzBo899hiysrKwZMkS9OnTB8eOHYOTk1ONxzz++OPo3bs3AGDjxo3YtGmT0fonnnjCMEzj2WefxcWLF/Hpp5/i2LFj2Lt3L6ysrGr9PdyK3NzcWk8c0uv1uOeee7Bnzx48/vjjaNWqFU6cOIFFixbh3LlzRieu3K7i4mL07dsXKSkpeOKJJ+Dv7499+/Zh9uzZSEtLw8cff1zr49asWWP4+bnnnjNal5GRge7duxs+ON3d3bFlyxZMmTIF+fn5mDFjhlH7V155Ba1atUJJSYmhwPPw8MCUKVMAAKdOnULv3r2h0Wjw4osvwsrKCl988QX69euHXbt2oVu3btd9fmvWrLnl8ZHVw3Kq1fa6u9Xn2BDvj4KCAvz222+49957DctWrlwJa2trlJaW1vg9TJw4EYMHD8aCBQtQXFyMzz77DL169cKxY8duePTPFN58803MnTsXUVFReOqpp3D27Fl89tlnOHz48E3fTy+99FK9tjlo0CA88sgjAIDDhw/jk08+uW5bNzc3LFq0yHD/4YcfNlq/f/9+jBkzBu3bt8f8+fPh6OgIrVZb47VfX5WVlRg8eDB69eqFhQsXGoZ/rV+/HsXFxXjqqafg6uqKQ4cOYcmSJbh8+TLWr19/03737NmDjRs34umnn4aDgwM++eQTjB49GklJSTccehEQEACdTmd43dzMlStXMGTIENx3330YM2YMNmzYgJdeeglt27bF0KFDAQAlJSXo168fzp8/j2eeeQbNmzfH+vXrMWnSJOTm5mL69OlGfX7zzTcoKCjAtGnTUFpaisWLF2PAgAE4ceIEPD09b5qJTEgQ3YKVK1cKAOLvv/8WWVlZIjk5Wfzwww/C1dVV2NjYiMuXLwshhCgtLRU6nc7osRcvXhRqtVq89dZbhmUrVqwQAMRHH31UY1t6vd7wOADigw8+qNEmPDxc9O3b13B/586dAoBo1qyZyM/PNyxft26dACAWL15s6Ds4OFgMHjzYsB0hhCguLhbNmzcXgwYNqrGtHj16iDZt2hjuZ2VlCQBizpw5hmWJiYlCoVCId9991+ixJ06cEEqlssby+Ph4AUCsXr3asGzOnDni6rfm7t27BQCxdu1ao8du3bq1xvKAgAAxfPjwGtmnTZsmrn27X5v9xRdfFB4eHiIiIsLod7pmzRohl8vF7t27jR7/+eefCwBi7969NbZ3tb59+4rw8PAayz/44AMBQFy8eNGw7O233xZ2dnbi3LlzRm1ffvlloVAoRFJSktHyV199VchkMqNlAQEBYuLEiYb7U6ZMEd7e3kKr1Rq1Gzt2rHB0dBTFxcVCiP9/7ezcudPQprS0VMjlcvH0008blo0aNUqoVCqRkJBgWJaamiocHBxEnz59DMuq3yvVz6+0tFT4+/uLoUOHCgBi5cqVNX9ZV6l+/OHDh42W1/a6u9XnaMr3R/Xrddy4ceLuu+82LL906ZKQy+Vi3LhxAoDIysoSQghRUFAgnJycxNSpU42ypqenC0dHR6PlEydOFHZ2djV+N+vXr6+xr+r6OsvMzBQqlUrcddddRn+jPv30UwFArFixwqjPq98Lf/zxhwAghgwZUuP9dD3l5eUCgHjmmWdumL/a+PHjRfPmzY2WXbu/Z8+eLQCItLQ0w7Ib/Z28ntregxMnThQAxMsvv1yjffXr6Grz5s0TMplMXLp0ybDs2r9h1c9BpVKJ8+fPG5bFxsYKAGLJkiU3zJmeni7c3d0FABEWFiaefPJJ8d1334nc3Nwabfv
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAq4AAAHWCAYAAAC2Zgs3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB3/klEQVR4nO3dd3hTZf8G8DtJm3TvPSmlUChllWHZCIIsQZBlUZaIgq/gAERExIWDV0FEwAWogDIEVGQLyC6zrAKlFLpHuvdInt8f/TUvoQMoaZO09+e6ckHOefKcb05Okrsn5zxHIoQQICIiIiIycFJ9F0BERERE9CAYXImIiIjIKDC4EhEREZFRYHAlIiIiIqPA4EpERERERoHBlYiIiIiMAoMrERERERkFBlciIiIiMgoMrkRERERkFBhciYj+38cffwy1Wg0AUKvVWLx4sZ4roodx+fJlbN++XXP/woUL2Llzp/4KMkASiQTvvfee5v7atWshkUhw+/bt+z62SZMmmDhxok7rmThxIpo0aaLTPnWtd+/eaN26tb7LoP/H4NqAVXwgVdzMzMzQvHlzvPLKK0hJSdF3eUQGZ926dViyZAni4+Px3//+F+vWrdN3SfQQcnNzMW3aNJw8eRJRUVGYOXMmLl26pO+yauXVV1+FRCLBzZs3q20zf/58SCQSXLx4sR4re3iJiYl47733cOHCBX2XoiUtLQ0zZ85EYGAgzM3N4eLigs6dO2Pu3LnIy8vTd3lUDRN9F0B17/3334efnx+Kiopw9OhRrFy5En///TcuX74MCwsLfZdHZDDef/99PP/885g7dy4UCgV++eUXfZdEDyE0NFRzA4DmzZtj6tSpeq6qdsLCwrB8+XJs2LAB7777bpVtNm7ciODgYLRp06bWy3nuuecwduxYKBSKWvdxP4mJiVi0aBGaNGmCdu3aac377rvvNL9y1KeMjAx07NgROTk5mDx5MgIDA5Geno6LFy9i5cqVePnll2FlZVXvddH9Mbg2AgMHDkTHjh0BAC+88AIcHR3xxRdfYMeOHRg3bpyeqyMyHGPGjEGfPn1w8+ZNBAQEwNnZWd8l0UPavn07rl69isLCQgQHB0Mul+u7pFrp0qULmjVrho0bN1YZXE+cOIGYmBh88sknj7QcmUwGmUz2SH08ClNTU70s94cffkBsbCyOHTuGrl27as3Lyckx2u2mMeChAo3Q448/DgCIiYkBUP6X55tvvong4GBYWVnBxsYGAwcORERERKXHFhUV4b333kPz5s1hZmYGd3d3jBgxAtHR0QCA27dvax2ecO+td+/emr4OHToEiUSC3377DW+//Tbc3NxgaWmJp556CnFxcZWWferUKTz55JOwtbWFhYUFevXqhWPHjlX5HHv37l3l8u8+tqvCL7/8gpCQEJibm8PBwQFjx46tcvk1Pbe7qdVqLF26FEFBQTAzM4OrqyumTZuGzMxMrXZNmjTBkCFDKi3nlVdeqdRnVbV//vnnldYpABQXF2PhwoVo1qwZFAoFvL29MWfOHBQXF1e5ru5W3bFcS5YsqfI4uKysLMyaNQve3t5QKBRo1qwZPv300yr3oLz33ntVrrt7j5lLSEjA5MmT4erqCoVCgaCgIPz4449abSq2nYqbQqFA8+bNsXjxYgghtNqeP38eAwcOhI2NDaysrNC3b1+cPHlSq83dx/m5uLiga9eucHR0RJs2bSCRSLB27doa19u9h+Xcb7t7mOeoy/dHxWvg4uKC0tJSrXkbN27U1KtUKrXm7dq1Cz169IClpSWsra0xePBgXLlyRavNxIkTq9xDtWXLFkgkEhw6dEgz7WG3s2+++QZBQUFQKBTw8PDAjBkzkJWVpdWmd+/emvdCq1atEBISgoiIiCrfozWp7jW8u/67n/ODvN5btmxBx44dYW1trdVuyZIlNdYSFhaGa9eu4dy5c5XmbdiwARKJBOPGjUNJSQneffddhISEwNbWFpaWlujRowcOHjx43+db1TGuQgh8+OGH8PLygoWFBfr06VPp9QYe7Lvj0KFD6NSpEwBg0qRJmude8Z6q6hjX/Px8vPHGG5rPlRYtWmDJkiWV3tsSiQSvvPIKtm/fjtatW2veS7t3777v846OjoZMJsNjjz1WaZ6NjQ3MzMwqTb969Sr69OkDCwsLeHp64rPPPqvUJjU1FVOmTIGrqyvMzMzQtm3bSoccVXyXLFmyBF9++SV8fX1hbm6OXr164fLly/etvbHjHtdGqCJkOjo6AgBu3bqF7du3Y9SoUfDz80NKSgpWr16NXr164erVq/Dw8AAAqFQqDBkyBAcOHMDYsWMxc+ZM5ObmYt++fbh8+TL8/f01yxg3bhwGDRqktdx58+ZVWc9HH30EiUSCuXPnIjU1FUuXLkW/fv1w4cIFmJubAwD++ecfDBw4ECEhIVi4cCGkUinWrFmDxx9/HEeOHEHnzp0r9evl5aU5uSYvLw8vv/xylctesGABRo8ejRdeeAFpaWlYvnw5evbsifPnz8POzq7SY1588UX06NEDAPD7779j27ZtWvOnTZuGtWvXYtKkSXj11VcRExODr7/+GufPn8exY8d0sochKyuryhOH1Go1nnrqKRw9ehQvvvgiWrZsiUuXLuHLL7/EjRs3tE5ceVQFBQXo1asXEhISMG3aNPj4+OD48eOYN28ekpKSsHTp0iof9/PPP2v+/9prr2nNS0lJwWOPPab5QnJ2dsauXbswZcoU5OTkYNasWVrt3377bbRs2RKFhYWagOfi4oIpU6YAAK5cuYIePXrAxsYGc+bMgampKVavXo3evXvj8OHD6NKlS7XP7+eff37o4yMrDsupUNV297DPsS7eH7m5ufjrr7/w9NNPa6atWbMGZmZmKCoqqrQeJkyYgAEDBuDTTz9FQUEBVq5cie7du+P8+fN1fmLNe++9h0WLFqFfv354+eWXcf36daxcuRKnT5++7/tp7ty5tVrmE088geeffx4AcPr0aXz11VfVtnVycsKXX36puf/cc89pzT9x4gRGjx6Ntm3b4pNPPoGtrS2USmWlbb8qYWFhWLRoETZs2IAOHTpopqtUKmzatAk9evSAj48PlEolvv/+e4wbNw5Tp05Fbm4ufvjhBwwYMADh4eGVfp6/n3fffRcffvghBg0ahEGDBuHcuXPo378/SkpKtNo9yHdHy5Yt8f777+Pdd9/V+uy8dy9nBSEEnnrqKRw8eBBTpkxBu3btsGfPHsyePRsJCQla6xoAjh49it9//x3Tp0+HtbU1vvrqK4wcORKxsbGa77iq+Pr6QqVSabbv+8nMzMSTTz6JESNGYPTo0diyZQvmzp2L4OBgDBw4EABQWFiI3r174+bNm3jllVfg5+eHzZs3Y+LEicjKysLMmTO1+vzpp5+Qm5uLGTNmoKioCMuWLcPjjz+OS5cuwdXV9b41NVqCGqw1a9YIAGL//v0iLS1NxMXFiV9//VU4OjoKc3NzER8fL4QQoqioSKhUKq3HxsTECIVCId5//33NtB9//FEAEF988UWlZanVas3jAIjPP/+8UpugoCDRq1cvzf2DBw8KAMLT01Pk5ORopm/atEkAEMuWLdP0HRAQIAYMGKBZjhBCFBQUCD8/P/HEE09UWlbXrl1F69atNffT0tIEALFw4ULNtNu3bwuZTCY++ugjrcdeunRJmJiYVJoeFRUlAIh169Zppi1cuFDc/TY6cuSIACDWr1+v9djdu3dXmu7r6ysGDx5cqfYZM2aIe9+a99Y+Z84c4eLiIkJCQrTW6c8//yykUqk4cuSI1uNXrVolAIhjx45VWt7devXqJYKCgipN//zzzwUAERMTo5n2wQcfCEtLS3Hjxg2ttm+99ZaQyWQiNjZWa/r8+fOFRCLRmubr6ysmTJiguT9lyhTh7u4ulEqlVruxY8cKW1tbUVBQIIT437Zz8OBBTZuioiIhlUrF9OnTNdOGDx8u5HK5iI6O1kxLTEwU1tbWomfPnpppFe+ViudXVFQkfHx8xMCBAwUAsWbNmsor6y4Vjz99+rTW9Kq2u4d9jrp8f1Rsr+PGjRNDhgzRTL9z546QSqV
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArgAAAHWCAYAAACc1vqYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB4zklEQVR4nO3dd3hTZf8G8PskaZPuvTeFUii7DAsyRQEBRVGUFxTELQ7EF3nRVxEHQ1FRUBw/BQeKDMUNspE9yyyllJbSvZvutMnz+6M2L6EFShs4SXp/risX5OTJkzs5J8m3J895jiSEECAiIiIishEKuQMQEREREZkTC1wiIiIisikscImIiIjIprDAJSIiIiKbwgKXiIiIiGwKC1wiIiIisikscImIiIjIprDAJSIiIiKbwgKXiIiIiGwKC1wiIjOZO3cuDAYDAMBgMGDevHkyJ6JrceLECaxbt854PT4+Hr///rt8gQiDBg1Cp06d5I5BVogFLl3W8uXLIUmS8aLRaBAVFYWnn34aOTk5cscjsjhfffUVFi5ciPT0dLz77rv46quv5I5E16C0tBSPP/449u7di6SkJDz33HM4fvy43LGaJTw83OTz+3KX5cuXm+Xx5s6da/LHwdXk5eXhueeeQ3R0NBwcHODr64vevXtj5syZKCsrM0smat1Ucgcgy/f6668jIiICVVVV2LlzJ5YuXYo//vgDJ06cgKOjo9zxiCzG66+/jgcffBAzZ86EWq3Gt99+K3ckugZxcXHGCwBERUXh0UcflTlV8yxatMikUPzjjz/w/fff4/3334e3t7dxed++fc3yeHPnzsU999yDMWPGXLVtYWEhevbsCa1WiylTpiA6OhoFBQU4duwYli5diieffBLOzs5myUWtFwtcuqoRI0agZ8+eAIBHHnkEXl5eeO+99/Dzzz9j/PjxMqcjshz33XcfBg8ejLNnz6Jdu3bw8fGROxJdo3Xr1uHUqVOorKxE586dYW9vL3ekZrm00MzOzsb333+PMWPGIDw8XJZM9b744gukpaVh165dDQpsrVZrta85WRYOUaBrNmTIEABASkoKgLq/xv/973+jc+fOcHZ2hqurK0aMGIGjR482uG9VVRVee+01REVFQaPRICAgAHfffTeSk5MBAKmpqVf8OW3QoEHGvrZt2wZJkvDDDz/gpZdegr+/P5ycnHDHHXfgwoULDR573759GD58ONzc3ODo6IiBAwdi165djT7HQYMGNfr4r732WoO23377LWJjY+Hg4ABPT0/cf//9jT7+lZ7bxQwGAxYtWoSYmBhoNBr4+fnh8ccfR1FRkUm78PBwjBo1qsHjPP300w36bCz7O++80+A1BYDq6mrMnj0bbdu2hVqtRkhICF588UVUV1c3+lpd7HLj5RYuXAhJkpCammqyvLi4GNOmTUNISAjUajXatm2LBQsWGMexXuy1115r9LWbPHmySbuMjAxMmTIFfn5+UKvViImJwZdffmnSpn7bqb+o1WpERUVh3rx5EEKYtD1y5AhGjBgBV1dXODs745ZbbsHevXtN2tQP50lNTYWvry/69u0LLy8vdOnSpUk/A186HOhq2921PEdzvj/q14Gvry9qampMbvv++++NefPz801u+/PPP9G/f384OTnBxcUFI0eOxMmTJ03aTJ48udG9dmvWrIEkSdi2bZtx2bVuZx9//DFiYmKgVqsRGBiIqVOnori42KTNoEGDjO+Fjh07IjY2FkePHm30PXoll1uHF+e/+Dk3ZX2vWbMGPXv2hIuLi0m7hQsXNjnX5TTl8yspKQljx46Fv78/NBoNgoODcf/996OkpMT4nMvLy/HVV19d9n15seTkZCiVStx0000NbnN1dYVGo2mw/NSpUxg8eDAcHR0RFBSEt99+u0Gb3NxcPPzww/Dz84NGo0HXrl0bDBOq/xxeuHAh3n//fYSFhcHBwQEDBw7EiRMnmvKSkZXgHly6ZvXFqJeXFwDg3LlzWLduHe69915EREQgJycHn376KQYOHIhTp04hMDAQAKDX6zFq1Chs3rwZ999/P5577jmUlpZi48aNOHHiBCIjI42PMX78eNx+++0mjztr1qxG87z11luQJAkzZ85Ebm4uFi1ahKFDhyI+Ph4ODg4AgC1btmDEiBGIjY3F7NmzoVAosGzZMgwZMgR///03evfu3aDf4OBg40FCZWVlePLJJxt97FdeeQXjxo3DI488gry8PCxevBgDBgzAkSNH4O7u3uA+jz32GPr37w8A+PHHH/HTTz+Z3P74449j+fLleOihh/Dss88iJSUFS5YswZEjR7Br1y7Y2dk1+jpci+Li4kYPgDIYDLjjjjuwc+dOPPbYY+jQoQOOHz+O999/H2fOnLmmMXZXU1FRgYEDByIjIwOPP/44QkNDsXv3bsyaNQtZWVlYtGhRo/f75ptvjP9//vnnTW7LycnBTTfdBEmS8PTTT8PHxwd//vknHn74YWi1WkybNs2k/UsvvYQOHTqgsrLSWAj6+vri4YcfBgCcPHkS/fv3h6urK1588UXY2dnh008/xaBBg7B9+3b06dPnss/vm2++uebxm/XDgeo1tt1d63O8Hu+P0tJS/Pbbb7jrrruMy5YtWwaNRoOqqqoGr8OkSZMwbNgwLFiwABUVFVi6dCluvvlmHDly5LrvTXzttdcwZ84cDB06FE8++SQSExOxdOlSHDhw4Krvp5kzZzbrMW+99VY8+OCDAIADBw7gww8/vGxbb29vvP/++8brDzzwgMnte/bswbhx49C1a1fMnz8fbm5uyM/Pb7DtN0dTPr90Oh2GDRuG6upqPPPMM/D390dGRgZ+++03FBcXw83NDd988w0eeeQR9O7dG4899hgAmHyeXyosLAx6vd64bVxNUVERhg8fjrvvvhvjxo3DmjVrMHPmTHTu3BkjRowAAFRWVmLQoEE4e/Ysnn76aURERGD16tWYPHkyiouL8dxzz5n0+fXXX6O0tBRTp05FVVUVPvjgAwwZMgTHjx+Hn59fC15VshiC6DKWLVsmAIhNmzaJvLw8ceHCBbFy5Urh5eUlHBwcRHp6uhBCiKqqKqHX603um5KSItRqtXj99deNy7788ksBQLz33nsNHstgMBjvB0C88847DdrExMSIgQMHGq9v3bpVABBBQUFCq9Ual69atUoAEB988IGx73bt2olhw4YZH0cIISoqKkRERIS49dZbGzxW3759RadOnYzX8/LyBAAxe/Zs47LU1FShVCrFW2+9ZXLf48ePC5VK1WB5UlKSACC++uor47LZs2eLi9+Gf//9twAgVqxYYXLf9evXN1geFhYmRo4c2SD71KlTxaVv7Uuzv/jii8LX11fExsaavKbffPONUCgU4u+//za5/yeffCIAiF27djV4vIsNHDhQxMTENFj+zjvvCAAiJSXFuOyNN94QTk5O4syZMyZt//Of/wilUinS0tJMlr/88stCkiSTZWFhYWLSpEnG6w8//LAICAgQ+fn5Ju3uv/9+4ebmJioqKoQQ/9t2tm7damxTVVUlFAqFeOqpp4zLxowZI+zt7UVycrJxWWZmpnBxcREDBgwwLqt/r9Q/v6qqKhEaGipGjBghAIhly5Y1fLEuUn//AwcOmCxvbLu71udozvdH/fY6fvx4MWrUKOPy8+fPC4VCIcaPHy8AiLy8PCGEEKWlpcLd3V08+uijJlmzs7OFm5ubyfJJkyYJJyenBq/N6tWrG6yrpm5nubm5wt7eXtx2220mn1FLliwRAMSXX35p0ufF74U//vhDABDDhw9v8H66HJ1OJwCIp59++or5602YMEFERESYLLt0fc+aNUsAEFlZWcZlV/qcvJxLX5umfn4dOXJEABCrV6++Yv9OTk4m78Uryc7OFj4+PgKAiI6OFk888YT47rvvRHFxcYO2AwcOFADE119/bVxWXV0t/P39xdixY43LFi1aJACIb7/91rhMp9OJuLg44ezsbHwP1L92F3+HCSHEvn37BADx/PPPN+k5kOXjEAW6qqFDh8LHxwchISG4//774ezsjJ9++glBQUEAALVaDYWiblPS6/UoKCi
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"def plot_sample_balance(y, sample_name):\n",
|
|||
|
" plt.figure(figsize=(8, 5))\n",
|
|||
|
" sns.histplot(y, bins=30, kde=True)\n",
|
|||
|
" plt.title(f'Распределение целевой переменной для {sample_name}')\n",
|
|||
|
" plt.xlabel(sample_name)\n",
|
|||
|
" plt.ylabel('Частота')\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Оценка сбалансированности выборок\n",
|
|||
|
"plot_sample_balance(train_shop['Store_Sales'], 'Train Shop')\n",
|
|||
|
"plot_sample_balance(val_shop['Store_Sales'], 'Validation Shop')\n",
|
|||
|
"plot_sample_balance(test_shop['Store_Sales'], 'Test Shop')\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Распределения выборок у данного датасета выглядят схоже. Это говорит о сбалансированности выборок. "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 86,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsAAAAHWCAYAAAB5SD/0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABj2ElEQVR4nO3dd1gU1/4G8Hcpu9QFpSOIiAVB1Iht7QVFRaPRxBiNJbaY4L1REzUkxhprTNRYk5sYvAnGFk2xI7aoWIKi2IgFxbYgGlh6Pb8/vMzPlSLgwqL7fp5nnrgzZ898Z2d38zJ7ZkYmhBAgIiIiIjIQRvougIiIiIioKjEAExEREZFBYQAmIiIiIoPCAExEREREBoUBmIiIiIgMCgMwERERERkUBmAiIiIiMigMwERERERkUBiAiYiIiMigMAATEenJ/PnzUVBQAAAoKCjAggUL9FwRlceFCxfw66+/So+jo6Oxc+dO/RX0Apg1axZkMpm+y9C5kSNHwsrKSt9lUDkwAJPOhIaGQiaTSZOZmRkaNGiACRMmICEhQd/lEVU769evx5IlS3Dnzh18+eWXWL9+vb5LonJITU3Fu+++ixMnTuDq1av44IMPEBMTo++yKqROnTpa398lTaGhofoutYgHDx7ggw8+gLe3N8zNzeHo6IhWrVph2rRpSEtLk9pt2LABy5Yt01+hVK3IhBBC30XQyyE0NBTvvPMO5syZA09PT2RlZeHo0aP48ccf4eHhgQsXLsDCwkLfZRJVG5s2bcLw4cORk5MDhUKBn376Ca+//rq+y6Jy6N+/P3777TcAQIMGDXD8+HHY2dnpuary+/XXX7XC4q5du/Dzzz9j6dKlsLe3l+a3bdsWdevWrfB68vLykJeXBzMzs+eqt9CjR4/wyiuvQKPRYNSoUfD29sbDhw9x/vx57NixA+fPn0edOnUAAH369MGFCxdw8+ZNnaz7SSNHjsTWrVu1XkOq3kz0XQC9fHr16oUWLVoAAMaMGQM7Ozt89dVX+O233/DWW2/puTqi6uPNN99Ely5dcO3aNdSvXx8ODg76LonK6ddff8WlS5eQmZkJPz8/yOVyfZdUIf3799d6rFar8fPPP6N///5SgCxOeno6LC0ty7weExMTmJjoLnp8//33iI+Px7Fjx9C2bVutZRqNpsL7IysrC3K5HEZG/KH8ZcU9S5Wua9euAIC4uDgAj/9i/+ijj+Dn5wcrKysolUr06tUL586dK/LcrKwszJo1Cw0aNICZmRlcXFwwYMAAXL9+HQBw8+bNUn+u69y5s9TXoUOHIJPJsGnTJnzyySdwdnaGpaUlXn31Vdy+fbvIuk+ePImePXvCxsYGFhYW6NSpE44dO1bsNnbu3LnY9c+aNatI259++gn+/v4wNzdHzZo1MXjw4GLXX9q2PamgoADLli2Dr68vzMzM4OTkhHfffRf//POPVrs6deqgT58+RdYzYcKEIn0WV/sXX3xR5DUFgOzsbMycORP16tWDQqGAu7s7pk6diuzs7GJfqyd17twZjRs3LjJ/yZIlkMlkRY7UJCcnY+LEiXB3d4dCoUC9evWwaNEiaRztkwrHGj49jRw5Uqvd3bt3MWrUKDg5OUGhUMDX1xfr1q3TalP43imcFAoFGjRogAULFuDpH9HOnj2LXr16QalUwsrKCt26dcOJEye02hQOF7p58yYcHR3Rtm1b2NnZoUmTJmX6mfnp4UbPet+VZxt1+fko3AeOjo7Izc3VWvbzzz9L9SYlJWkt2717Nzp06ABLS0tYW1sjKCgIFy9e1GpT0pjLrVu3QiaT4dChQ9K88r7PVq9eDV9fXygUCri6uiI4OBjJyclabTp37ix9Fnx8fODv749z584V+xktTUn78Mn6n9zmsuzvrVu3okWLFrC2ttZqt2TJkjLXVZzC1/z69evo3bs3rK2tMXToUADAn3/+iTfeeAO1a9eWvgcmTZqEzMxMrT6KGwMsk8kwYcIE/Prrr2jcuLH0Ht2zZ88za7p+/TqMjY3Rpk2bIsuUSqV0pLlz587YuXMnbt26Jb0ehcG+8L2/ceNGTJ8+HbVq1YKFhQU0Gg0AYMuWLdJ3tr29Pd5++23cvXv3mbVFR0fDwcEBnTt3lo4Ml+WzCAArVqyAr68vLCwsUKNGDbRo0QIbNmx45jqp7HgEmCpdYVgt/Fnwxo0b+PXXX/HGG2/A09MTCQkJ+Oabb9CpUydcunQJrq6uAID8/Hz06dMHERERGDx4MD744AOkpqYiPDwcFy5cgJeXl7SOt956C71799Zab0hISLH1zJs3DzKZDNOmTUNiYiKWLVuGgIAAREdHw9zcHABw4MAB9OrVC/7+/pg5cyaMjIzwww8/oGvXrvjzzz/RqlWrIv26ublJJzGlpaXhvffeK3bdn332GQYNGoQxY8bgwYMHWLFiBTp27IizZ8/C1ta2yHPGjRuHDh06AAC2bduG7du3ay1/9913peEn//73vxEXF4eVK1fi7NmzOHbsGExNTYt9HcojOTm52BO0CgoK8Oqrr+Lo0aMYN24cGjVqhJiYGCxduhR///231glCzysjIwOdOnXC3bt38e6776J27do4fvw4QkJCcP/+/RLH9v3444/SvydNmqS1LCEhAW3atJH+B+zg4IDdu3dj9OjR0Gg0mDhxolb7Tz75BI0aNUJmZqYUFB0dHTF69GgAwMWLF9GhQwcolUpMnToVpqam+Oabb9C5c2ccPnwYrVu3LnH7fvzxx3KPHy0cblSouPddebexMj4fqamp2LFjB1577TVp3g8//AAzMzNkZWUVeR1GjBiBwMBALFq0CBkZGVizZg3at2+Ps2fPlno0UhdmzZqF2bNnIyAgAO+99x5iY2OxZs0anD59+pmfp2nTplVond27d8fw4cMBAKdPn8bXX39dYlt7e3ssXbpUejxs2DCt5ZGRkRg0aBCaNm2KhQsXwsbGBklJSUXe+xWVl5eHwMBAtG/fHkuWLJGGtW3ZsgUZGRl47733YGdnh1OnTmHFihW4c+cOtmzZ8sx+jx49im3btuH999+HtbU1vv76awwcOBDx8fGlDinx8PBAfn6+9L4pyaeffoqUlBTcuXNHev2e/gNq7ty5kMvl+Oijj5CdnQ25XC59t7Zs2RILFixAQkICli9fjmPHjpX4nQ083o+BgYFo0aIFfvvtN5ibm5f5s/if//wH//73v/H666/jgw8+QFZWFs6fP4+TJ09iyJAhz3wtqYwEkY788MMPAoDYv3+/ePDggbh9+7bYuHGjsLOzE+bm5uLOnTtCCCGysrJEfn6+1nPj4uKEQqEQc+bMkeatW7dOABBfffVVkXUVFBRIzwMgvvjiiyJtfH19RadOnaTHBw8eFABErVq1hEajkeZv3rxZABDLly+X+q5fv74IDAyU1iOEEBkZGcLT01N07969yLratm0rGjduLD1+8OCBACBmzpwpzbt586YwNjYW8+bN03puTEyMMDExKTL/6tWrAoBYv369NG/mzJniyY/tn3/+KQCIsLAwrefu2bOnyHwPDw8RFBRUpPbg4GDx9FfB07VPnTpVODo6Cn9/f63X9McffxRGRkbizz//1Hr+2rVrBQBx7NixIut7UqdOnYSvr2+R+V988YUAIOLi4qR5c+fOFZaWluLvv//Wavvxxx8LY2NjER8frzX/008/FTKZTGueh4eHGDFihPR49OjRwsXFRSQlJWm1Gzx4sLCxsREZGRlCiP9/7xw8eFBqk5WVJYyMjMT7778vzevfv7+Qy+Xi+vXr0rx79+4Ja2tr0bFjR2le4WelcPuysrJE7dq1Ra9evQQA8cMPPxR9sZ5Q+PzTp09rzS/ufVfebdTl56Pw/frWW2+JPn36SPNv3boljIyMxFtvvSUAiAcPHgghhEhNTRW2trZi7NixWrWq1WphY2OjNX/EiBHC0tKyyGuzZcuWIvuqrO+zxMREIZfLRY8ePbS+o1auXCkAiHXr1mn1+eRnYdeuXQKA6NmzZ5HPU0lycnIEADFhwoRS6y80dOhQ4enpqTXv6f0dEhIiAIj79+9L80r7nixJcZ/BESNGCAD
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAHWCAYAAABt3aEVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABcrElEQVR4nO3deZyNdf/H8fc5s5vVNpst+5alxjZkSZiYlJsiSSqpRIWSZKci+SVZW1E3KUXuFEKobMlSsoVobDNMmsUy+/f3h3vO7ZjBzBgOl9fz8bgenO/1Pdf1uc51rnPec53vuY7NGGMEAAAAWIDd1QUAAAAAhYVwCwAAAMsg3AIAAMAyCLcAAACwDMItAAAALINwCwAAAMsg3AIAAMAyCLcAAACwDMItAAAALINwCwCSXn/9dWVlZUmSsrKyNHbsWBdXhPz4/fff9dVXXzlub9u2Td98843rCroO2Ww2jRw50nF71qxZstlsOnjw4GXve8stt+jRRx8t1HoeffRR3XLLLYW6zBvJo48+Kj8/P1eXYUmEW4vKftHKnry9vVWlShX17dtXcXFxri4PuO7Mnj1bEyZM0OHDh/V///d/mj17tqtLQj4kJyfrqaee0oYNG7R37149//zz2r59u6vLKpDnnntONptN+/btu2ifIUOGyGaz6bfffruGleXf0aNHNXLkSG3bts3VpTg5ceKEnn/+eVWrVk0+Pj4KDg5WgwYNNGjQIJ06dcrRb+7cuXr77bddVygKhHBrcaNHj9Ynn3yiKVOmqHHjxpo+fboiIyN15swZV5cGXFdGjx6tYcOGqUyZMho2bJheffVVV5eEfIiMjHRMVapUUWxsrHr16uXqsgqkW7duks4Fq4v59NNPVatWLdWuXbvA6+nevbvOnj2rcuXKFXgZl3P06FGNGjUq13D7/vvva8+ePVdt3Rdz8uRJ1atXTx9//LGio6P1zjvvaMCAAapUqZKmT5+u+Ph4R1/C7Y3J3dUF4Opq27at6tWrJ0l64oknVLx4cb311ltatGiRunbt6uLqgOtHly5ddOedd2rfvn2qXLmySpYs6eqSkE9fffWVdu7cqbNnz6pWrVry9PR0dUkF0rBhQ1WqVEmffvqphg8fnmP++vXrdeDAAY0bN+6K1uPm5iY3N7crWsaV8PDwcMl6P/zwQ8XExGjt2rVq3Lix07ykpKQCP29SUlLk6ekpu53zhq7GHrjJtGzZUpJ04MABSef+gn3xxRdVq1Yt+fn5KSAgQG3bttWvv/6a474pKSkaOXKkqlSpIm9vb4WFhaljx47av3+/JOngwYNOQyEunFq0aOFY1urVq2Wz2fTZZ5/plVdeUWhoqHx9fXXvvffq0KFDOda9ceNG3X333QoMDFSRIkXUvHlzrV27NtdtbNGiRa7rP3+sWbZ///vfioiIkI+Pj4oVK6YHH3ww1/VfatvOl5WVpbfffls1a9aUt7e3QkJC9NRTT+mff/5x6nfLLbfonnvuybGevn375lhmbrW/+eabOR5TSUpNTdWIESNUqVIleXl5qUyZMnrppZeUmpqa62N1vhYtWujWW2/N0T5hwoRcx+UlJCSoX79+KlOmjLy8vFSpUiW98cYbjnGr5xs5cmSuj92FY/iOHDmixx9/XCEhIfLy8lLNmjX10UcfOfXJfu5kT15eXqpSpYrGjh0rY4xT361bt6pt27YKCAiQn5+f7rrrLm3YsMGpz/njDoODg9W4cWMVL15ctWvXls1m06xZsy75uF04BOhyz7v8bGNhHh/Z+yA4OFjp6elO8z799FNHveeftZKkJUuWqGnTpvL19ZW/v7+io6O1Y8cOpz4XGzv4xRdfyGazafXq1Y62/D7Ppk2bppo1a8rLy0vh4eHq06ePEhISnPq0aNHCcSzUqFFDERER+vXXX3M9Ri/lYvvw/PrP3+a87O8vvvhC9erVk7+/v1O/CRMmXLKWbt26affu3dqyZUuOeXPnzpXNZlPXrl2Vlpam4cOHKyIiQoGBgfL19VXTpk21atWqy25vbmNujTF69dVXVbp0aRUpUkR33nlnjv0t5e29Y/Xq1apfv74k6bHHHnNse/YxlduY29OnT+uFF15wvK5UrVpVEyZMyHFs22w29e3bV1999ZVuvfVWx7G0dOnSy273/v375ebmpkaNGuWYFxAQIG9vb0nnnlfffPON/vrrL0ft2fVmH6Pz5s3T0KFDVapUKRUpUkRJSUmSpPnz5zveW0qUKKGHH35YR44cuWxt27ZtU8mSJdWiRQvH8Ii8vGZI0uTJk1WzZk0VKVJERYsWVb169S559t/KOHN7k8kOosWLF5ck/fnnn/rqq6/0wAMPqHz58oqLi9O7776r5s2ba+fOnQoPD5ckZWZm6p577tHKlSv14IMP6vnnn1dycrKWL1+u33//XRUrVnSso2vXrmrXrp3TegcPHpxrPa+99ppsNpsGDRqk48eP6+2331arVq20bds2+fj4SJK+//57tW3bVhERERoxYoTsdrtmzpypli1b6scff1SDBg1yLLd06dKOLwSdOnVKvXv3znXdw4YNU+fOnfXEE0/oxIkTmjx5spo1a6atW7cqKCgox32efPJJNW3aVJK0YMECLVy40Gn+U089pVmzZumxxx7Tc889pwMHDmjKlCnaunWr1q5dWyhnKhISEnL9slNWVpbuvfde/fTTT3ryySdVvXp1bd++XRMnTtQff/zh9GWbK3XmzBk1b95cR44c0VNPPaWyZctq3bp1Gjx4sI4dO3bRj/E++eQTx//79+/vNC8uLk6NGjVyvGmVLFlSS5YsUc+ePZWUlKR+/fo59X/llVdUvXp1nT171hECg4OD1bNnT0nSjh071LRpUwUEBOill16Sh4eH3n33XbVo0UJr1qxRw4YNL7p9n3zySb7Ha44ePVrly5d33M7teZffbbwax0dycrIWL16sf/3rX462mTNnytvbWykpKTkehx49eigqKkpvvPGGzpw5o+nTp+uOO+7Q1q1br/qXgUaOHKlRo0apVatW6t27t/bs2aPp06dr06ZNlz2eBg0aVKB1tm7dWo888ogkadOmTXrnnXcu2rdEiRKaOHGi43b37t2d5q9fv16dO3dWnTp1NG7cOAUGBio+Pj7Hcz833bp106hRozR37lzdfvvtjvbMzEx9/vnnatq0qcqWLav4+Hh98MEH6tq1q3r16qXk5GR9+OGHioqK0s8//6y6devma/uHDx+uV199Ve3atVO7du20ZcsWtWnTRmlpaU798vLeUb16dY0ePVrDhw93eu288GxpNmOM7r33Xq1atUo9e/ZU3bp1tWzZMg0cOFBHjhxxeqwl6aefftKCBQv0zDPPyN/fX++88446deqkmJgYx3tcbsqVK6fMzEzH8/tihgwZosTERB0+fNix7gv/iBszZow8PT314osvKjU1VZ6eno73gPr162vs2LGKi4vTpEmTtHbt2ou+t0jnnm9RUVGqV6+eFi1aJB8fnzy/Zrz//vt67rnndP/99+v5559XSkqKfvvtN23cuFEPPfTQRbfRsgwsaebMmUaSWbFihTlx4oQ5dOiQmTdvnilevLjx8fExhw8fNsYYk5KSYjIzM53ue+DAAePl5WVGjx7taPvoo4+MJPPWW2/lWFdWVpbjfpLMm2++maNPzZo1TfPmzR23V61aZSSZUqVKmaSkJEf7559/biSZSZMmOZZduXJlExUV5ViPMcacOXPGlC9f3rRu3TrHuho3bmxuvfVWx+0TJ04YSWbEiBGOtoMHDxo3Nzfz2muvOd13+/btxt3dPUf73r17jSQze/ZsR9uIESPM+YfQjz/+aCSZOXPmON136dKlOdrLlStnoqOjc9Tep08fc+FheWHtL730kgkODjYRERFOj+knn3xi7Ha7+fHHH53uP2PGDCPJrF27Nsf6zte8eXNTs2bNHO1vvvmmkWQOHDjgaBszZozx9fU1f/zxh1Pfl19+2bi5uZmYmBin9iFDhhibzebUVq5cOdOjRw/H7Z49e5qwsDATHx/v1O/BBx80gYGB5syZM8aY/z13Vq1a5eiTkpJi7Ha7eea
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAHWCAYAAABt3aEVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABW5UlEQVR4nO3deZyN5f/H8feZfTMzBrPZ9z1qhLEWMiQtlCWJSCUqlErZSlFSKmupkEjRolSyhMqWtWRLUmObYdKszH79/vCd83PMDDNjONxez8fjfnCu+zr3/bnPfe4z73Of69zHZowxAgAAACzAxdkFAAAAAMWFcAsAAADLINwCAADAMgi3AAAAsAzCLQAAACyDcAsAAADLINwCAADAMgi3AAAAsAzCLQAAACyDcAsAl8H48eOVnZ0tScrOztaECROcXBEK4/fff9eXX35pv71jxw598803zisIl1Tfvn3l5+fn7DJQRIRbFMmcOXNks9nsk5eXl2rUqKHBgwcrNjbW2eUBV5y5c+dq0qRJOnz4sF5//XXNnTvX2SWhEJKSkvTwww9r48aN2r9/v5544gnt3LnT2WUVSaVKlRxev/Ob5syZUyzrGz9+vMMbgws5ceKEnnjiCdWqVUve3t4KDg5W48aN9cwzzyg5Odneb8GCBXrzzTeLpUZYi80YY5xdBK4+c+bM0QMPPKAXX3xRlStXVmpqqn7++WfNmzdPFStW1O+//y4fHx9nlwlcMT755BPdf//9Sk9Pl6enpz766CPdfffdzi4LhXDnnXdqyZIlkqQaNWpo/fr1KlWqlJOrKrwvv/zSISR+++23+vjjjzV58mSVLl3a3t6sWTNVqVLlotfn5+enu+++u0Bh+eTJk7r++uuVmJiofv36qVatWvr333/122+/aenSpfrtt99UqVIlSdJtt92m33//XX///fdF13iuvn37avHixQ6PE64ebs4uAFe3jh07qlGjRpKkBx98UKVKldIbb7yhJUuWqGfPnk6uDrhydO/eXTfffLP+/PNPVa9eXWXKlHF2SSikL7/8Urt379bp06dVv359eXh4OLukIrnzzjsdbsfExOjjjz/WnXfeaQ+OzvL+++8rOjpa69atU7NmzRzmJSYmFvkxT01NlYeHh1xc+MD6WsBeRrFq06aNJOngwYOSzrwLf+qpp1S/fn35+fnJ399fHTt21K+//prrvqmpqRo7dqxq1KghLy8vhYWFqUuXLjpw4IAk6e+//z7vR2g33XSTfVlr1qyRzWbTJ598oueee06hoaHy9fXV7bffrkOHDuVa96ZNm9ShQwcFBATIx8dHrVu31rp16/LcxptuuinP9Y8dOzZX348++kgRERHy9vZWUFCQevTokef6z7dtZ8vOztabb76punXrysvLSyEhIXr44Yf133//OfSrVKmSbrvttlzrGTx4cK5l5lX7a6+9lusxlaS0tDSNGTNG1apVk6enp8qXL6+nn35aaWlpeT5WZ7vppptUr169XO2TJk2SzWbLdfYlPj5eQ4YMUfny5eXp6alq1arp1VdftY9bPdvYsWPzfOz69u3r0O/IkSPq16+fQkJC5Onpqbp16+qDDz5w6JPz3MmZPD09VaNGDU2YMEHnftC1fft2dezYUf7+/vLz81Pbtm21ceNGhz45Q3j+/vtvBQcHq1mzZipVqpSuu+66An30e+4QoAs97wqzjcV5fOTsg+DgYGVkZDjM+/jjj+31xsXFOcz77rvv1LJlS/n6+qpEiRLq1KmTdu3a5dAnv/GPixcvls1m05o1a+xthX2eTZ8+XXXr1pWnp6fCw8M1aNAgxcfHO/S56aab7MdCnTp1FBERoV9//TXPY/R88tuHZ9d/9jYXZH8vXrxYjRo1UokSJRz6TZo0qcB15acgr1/79+9X165dFRoaKi8vL5UrV049evRQQkKCfZtTUlI0d+7cfI/Lsx04cECurq5q2rRprnn+/v7y8vKSdGaffPPNN/rnn3/sy80J5jnP74ULF2rkyJEqW7asfHx8lJiYKElatGiRfbtKly6t++67T0eOHLng47Fjxw6VKVNGN910k/2MbkGON0maMmWK6tatKx8fH5UsWVKNGjXSggULLrhOFA1nblGscoJozkd1f/31l7788kvdc889qly5smJjY/XOO++odevW2r17t8LDwyVJWVlZuu2227Rq1Sr16NFDTzzxhJKSkrRixQr9/vvvqlq1qn0dPXv21K233uqw3hEjRuRZz8svvyybzaZnnnlGx48f15tvvql27dppx44d8vb2liT98MMP6tixoyIiIjRmzBi5uLho9uzZatOmjX766Sc1btw413LLlStn/0JQcnKyBg4cmOe6R40apW7duunBBx/UiRMnNGXKFLVq1Urbt29XYGBgrvs89NBDatmypSTp888/1xdffOEw/+GHH7YPCXn88cd18OBBTZ06Vdu3b9e6devk7u6e5+NQGPHx8Xl+2Sk7O1u33367fv75Zz300EOqXbu2du7cqcmTJ+uPP/4o1Ji6Czl16pRat26tI0eO6OGHH1aFChW0fv16jRgxQseOHct3nN28efPs/x86dKjDvNjYWDVt2lQ2m02DBw9WmTJl9N1336l///5KTEzUkCFDHPo/99xzql27tk6fPm0PgcHBwerfv78kadeuXWrZsqX8/f319NNPy93dXe+8845uuukmrV27Vk2aNMl3++bNm1fo8Zo5Q4By5PW8K+w2XorjIykpSUuXLtVdd91lb5s9e7a8vLyUmpqa63Ho06ePoqKi9Oqrr+rUqVOaMWOGWrRooe3bt1/ys4hjx47VCy+8oHbt2mngwIHat2+fZsyYoc2bN1/weHrmmWeKtM5bbrlF999/vyRp8+bNevvtt/PtW7p0aU2ePNl+u3fv3g7zN2zYoG7duqlBgwZ65ZVXFBAQoLi4uFzP/aIoyOtXenq6oqKilJaWpscee0yhoaE6cuSIli5dqvj4eAUEBGjevHl68MEH1bhxYz300EOS5PB6fq6KFSsqKyvL/tzIz/PPP6+EhAQdPnzY/hid+wZo3Lhx8vDw0FNPPaW0tDR5eHjYXz9vvPFGTZgwQbGxsXrrrbe0bt26fF+XpTP7KioqSo0aNdKSJUvk7e1d4ONt1qxZevzxx3X33XfriSeeUGpqqn777Tdt2rRJ9957byH2CgrMAEUwe/ZsI8msXLnSnDhxwhw6dMgsXLjQlCpVynh7e5vDhw8bY4xJTU01WVlZDvc9ePCg8fT0NC+++KK97YMPPjCSzBtvvJFrXdnZ2fb7STKvvfZarj5169Y1rVu3tt9evXq1kWTKli1rEhMT7e2ffvqpkWTeeust+7KrV69uoqKi7OsxxphTp06ZypUrm1tuuSXXupo1a2bq1atnv33ixAkjyYwZM8be9vfffxtXV1fz8ssvO9x3586dxs3NLVf7/v37jSQzd+5ce9uYMWPM2YfoTz/9ZCSZ+fPnO9x32bJludorVqxoOnXqlKv2QYMGmXMP+3Nrf/rpp01wcLCJiIhweEznzZtnXFxczE8//eRw/5kzZxpJZt26dbnWd7bWrVubunXr5mp/7bXXjCRz8OBBe9u4ceOMr6+v+eOPPxz6Pvvss8bV1dVER0c7tD///PPGZrM5tFWsWNH06dPHfrt///4mLCzMxMXFOfTr0aOHCQgIMKdOnTLG/P9zZ/Xq1fY+qampxsXFxTz66KP2tjvvvNN4eHiYAwcO2NuOHj1qSpQoYVq1amVvyzlWcrYvNTXVVKhQwXTs2NFIMrNnz879YJ0l5/6bN292aM/reVfYbSzO4yPn+dqzZ09z22232dv/+ecf4+LiYnr27GkkmRMnThhjjElKSjKBgYFmwIABDrXGxMSYgIAAh/Y+ffoYX1/fXI/NokWLcu2rgj7Pjh8/bjw8PEz79u0dXqOmTp1qJJkPPvjAYZlnHwvffvutkWQ6dOiQ63jKT3p6upFkBg8efN76c/Tq1ctUrlzZoe3c/T1ixAgjyRw7dszedr7Xyfyc+9gU9PVr+/btRpJZtGjReZfv6+vrcCyeT0xMjClTpoyRZGrVqmU
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plot_sample_balance(train_strokes['stroke'], 'Train Strokes')\n",
|
|||
|
"plot_sample_balance(val_strokes['stroke'], 'Validation Strokes')\n",
|
|||
|
"plot_sample_balance(test_strokes['stroke'], 'Test Strokes')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Выборки выглядят схоже, но у всех трех имеется явный дисбаланс классов. Это проблема, т.к в дальнейшем не сможем обучить какую-либо модель."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 87,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtEAAAHWCAYAAACxJNUiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABf8klEQVR4nO3dd3gU1f7H8c9uQgqEhJqEYCgiTURQ0BikKYGgyL25oghyFTWCBRRE6lWaokhRioDIzwJXsYAKKgjcCAgKkd7bRY2CYEJPIPTk/P7g7sCSACmTtnm/nmcf3ZmzM9/Z2U0+nJw54zDGGAEAAADIMmdBFwAAAAAUNYRoAAAAIJsI0QAAAEA2EaIBAACAbCJEAwAAANlEiAYAAACyiRANAAAAZBMhGgAAAMgmQjQAAACQTYRoAChmXn/9daWnp0uS0tPTNXLkyAKuCNmxdetWzZ0713q+ceNGzZ8/v+AKKgKGDRsmh8NR0GXAwxCiUeRNnz5dDofDevj5+alWrVrq2bOnkpKSCro8oNCZMWOGxo4dqz///FNvvvmmZsyYUdAlIRuOHz+up556Sj///LN2796tXr16acuWLQVdVo5Uq1bN7ef3lR7Tp08v6FKvaMeOHdbvnmPHjuVqW9u3b9ewYcP0+++/21Ib8pbDGGMKugggN6ZPn67HH39cr7zyiqpXr67Tp0/rp59+0kcffaSqVatq69atKlmyZEGXCRQan3/+uR599FGdPXtWvr6++vjjj/XAAw8UdFnIhpiYGH399deSpFq1amnlypUqX758AVeVfXPnztWJEyes5999950+/fRTjRs3ThUqVLCWN2nSRNdff32O93P+/HmdP39efn5+uao3My+99JI++OADHT16VJMmTdKTTz6Z42198cUXevDBB7V06VK1bNnSviKRJ7wLugDALvfcc48aN24sSXryySdVvnx5vfXWW/r666/VuXPnAq4OKDweeugh3XXXXfrll19Us2ZNVaxYsaBLQjbNnTtX27dv16lTp1S/fn35+PgUdEk5EhMT4/Y8MTFRn376qWJiYlStWrUrvi41NVWlSpXK8n68vb3l7W1/5DHG6JNPPtHDDz+shIQEzZw5M1chGkULwzngse6++25JUkJCgiTpyJEj6tu3r+rXr6+AgAAFBgbqnnvu0aZNmzK89vTp0xo2bJhq1aolPz8/VapUSffff79+/fVXSdLvv/9+1T89XtqD8MMPP8jhcOjzzz/Xv/71L4WGhqpUqVL629/+pr1792bY96pVq9S2bVsFBQWpZMmSatGihVasWJHpMbZs2TLT/Q8bNixD248//liNGjWSv7+/ypUrp06dOmW6/6sd26XS09M1fvx41atXT35+fgoJCdFTTz2lo0ePurWrVq2a7rvvvgz76dmzZ4ZtZlb7mDFjMrynknTmzBkNHTpUN9xwg3x9fRUeHq7+/fvrzJkzmb5Xl2rZsqVuuummDMvHjh0rh8OR4U+px44dU+/evRUeHi5fX1/dcMMNGjVqlDWu+FKusZeXPx577DG3dvv27dMTTzyhkJAQ+fr6ql69evrggw/c2rg+O66Hr6+vatWqpZEjR+ryPyJu2LBB99xzjwIDAxUQEKBWrVrp559/dmvjGvr0+++/Kzg4WE2aNFH58uV18803Z+lP5pcPnbrW5y47x2jn98N1DoKDg3Xu3Dm3dZ9++qlV76FDh9zWLViwQM2aNVOpUqVUunRptWvXTtu2bXNr89hjjykgICBDXV988YUcDod++OEHa1l2P2dTpkxRvXr15Ovrq7CwMPXo0SPD8ICWLVta34Ubb7xRjRo10qZNmzL9jl7Nlc7hpfVfesxZOd9ffPGFGjdurNKlS7u1Gzt2bJbryozrPf/111917733qnTp0urSpYsk6ccff9SDDz6oKlWqWD8HXnjhBZ06dcptG5mNiXY4HOrZs6fmzp2rm266yfqMLly4MMu1rVixQr///rs6deqkTp06afny5frzzz8ztLvSz+Vq1apZPxumT5+uBx98UJJ01113ZXpOsvIZQf6hJxoeyxV4XX/i/O233zR37lw9+OCDql69upKSkvTuu++qRYsW2r59u8LCwiRJaWlpuu+++7R48WJ16tRJvXr10vHjxxUXF6etW7eqRo0a1j46d+6se++9122/gwYNyrSe1157TQ6HQwMGDNCBAwc0fvx4RUVFaePGjfL395ckLVmyRPfcc48aNWqkoUOHyul06sMPP9Tdd9+tH3/8UbfffnuG7V533XXWhWEnTpzQM888k+m+Bw8erI4dO+rJJ5/UwYMH9fbbb6t58+basGGDypQpk+E13bt3V7NmzSRJX331lebMmeO2/qmnnrKG0jz//PNKSEjQpEmTtGHDBq1YsUIlSpTI9H3IjmPHjmV60Vt6err+9re/6aefflL37t1Vt25dbdmyRePGjdN///tft4uucuvkyZNq0aKF9u3bp6eeekpVqlTRypUrNWjQIP31118aP358pq/76KOPrP9/4YUX3NYlJSXpjjvusH6JV6xYUQsWLFBsbKxSUlLUu3dvt/b/+te/VLduXZ06dcoKm8HBwYqNjZUkbdu2Tc2aNVNgYKD69++vEiVK6N1331XLli21bNkyRUREXPH4Pvroo2yPp3UNnXLJ7HOX3WPMi+/H8ePHNW/ePP3jH/+wln344Yfy8/PT6dOnM7wPXbt2VXR0tEaNGqWTJ0/qnXfeUdOmTbVhw4ar9oraYdiwYRo+fLiioqL0zDPPaNeuXXrnnXe0Zs2aa36fBgwYkKN9tm7dWo8++qgkac2aNZo4ceIV21aoUEHjxo2znj/yyCNu6+Pj49WxY0c1aNBAb7zxhoKCgnTo0KEMn/2cOn/+vKKjo9W0aVONHTvWGqI3e/ZsnTx5Us8884zKly+v1atX6+2339aff/6p2bNnX3O7P/30k7766is9++yzKl26tCZOnKgOHTpoz549WRoeM3PmTNWoUUO33XabbrrpJpUsWVKffvqp+vXrl+1jbN68uZ5//nlNnDjR+s5Lsv6bm88I8ogBirgPP/zQSDLff/+9OXjwoNm7d6/57LPPTPny5Y2/v7/5888/jTHGnD592qSlpbm9NiEhwfj6+ppXXnnFWvbBBx8YSeatt97KsK/09HTrdZLMmDFjMrSpV6+eadGihfV86dKlRpKpXLmySUlJsZbPmjXLSDITJkywtl2zZk0THR1t7ccYY06ePGmqV69uWrdunWFfTZo0MTfddJP1/ODBg0aSGTp0qLXs999/N15eXua1115ze+2WLVuMt7d3huW7d+82ksyMGTOsZUOHDjWX/rj48ccfjSQzc+ZMt9cuXLgww/KqVauadu3aZai9R48e5vIfQZfX3r9/fxMcHGwaNWrk9p5+9NFHxul0mh9//NHt9VOnTjWSzIoVKzLs71ItWrQw9erVy7B8zJgxRpJJSEiwlr366qumVKlS5r///a9b24EDBxovLy+zZ88et+UvvfSScTgcbsuqVq1qunbtaj2PjY01lSpVMocOHXJr16lTJxMUFGROnjxpjLn42Vm6dKnV5vTp08bpdJpnn33WWhYTE2N8fHzMr7/+ai3bv3+/KV26tGnevLm1zPVdcR3f6dOnTZUqVcw999xjJJkPP/ww45t1Cdfr16xZ47Y8s89ddo/Rzu+H6/PauXNnc99991nL//jjD+N0Ok3nzp2NJHPw4EFjjDHHjx83ZcqUMd26dXOrNTEx0QQFBbkt79q1qylVqlSG92b27NkZzlVWP2cHDhwwPj4+pk2bNm4/oyZNmmQkmQ8++MBtm5d+F7777jsjybRt2zbD9+lKzp49aySZnj17XrV+ly5dupjq1au7Lbv8fA8aNMhIMn/99Ze17Go/J68ks+9g165djSQzcODADO1dn6NLjRw50jgcDvPHH39Yyy7/GeY6Bh8fH/PLL79YyzZt2mQkmbfffvuatZ49e9aUL1/evPTSS9ayhx9+2DRo0CBD28vfL5fLfzZc6Txk5zOC/MNwDniMqKgoVaxYUeHh4erUqZMCAgI0Z84cVa5cWZLk6+srp/PCRz4
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAswAAAHWCAYAAABjdN96AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABvd0lEQVR4nO3deVhUZf8G8PvMyr7vCuS+L4Wl5JqiqLSYlmlqmr5WhpVa6kvutphmmZXa8ppYav4yl8pSU1MrRVNTc0FSQ3EBFJF9GWbm+f2Bc2IERkDgINyf6+LSOeeZc75nzgzcPDznOZIQQoCIiIiIiEqkUroAIiIiIqKajIGZiIiIiMgGBmYiIiIiIhsYmImIiIiIbGBgJiIiIiKygYGZiIiIiMgGBmYiIiIiIhsYmImIiIiIbGBgJiIiIiKygYGZiKic3n77bZjNZgCA2WzGvHnzFK6IyuPEiRPYtGmT/Pjo0aP48ccflSuoBpIkCbNnz5YfR0dHQ5IknD9//rbPveeeezBq1KhKrWfUqFG45557KnWbROXBwEzyN0LLl52dHZo2bYrx48cjOTlZ6fKIapyVK1di4cKFuHTpEt577z2sXLlS6ZKoHDIzM/H8889j//79OHPmDF555RUcP35c6bIq5OWXX4YkSTh79mypbaZNmwZJkvDXX39VY2Xld+XKFcyePRtHjx5VupQSxcbGyj8j09LS7mhbp06dwuzZs8v0CwjVDAzMJJs7dy6++uorfPzxx3jwwQexbNkyhIaGIicnR+nSiGqUuXPnYsaMGQgMDMSMGTPw5ptvKl0SlUNoaKj81bRpUyQlJWHs2LFKl1Uhw4YNAwCsWbOm1DZff/012rRpg7Zt21Z4PyNGjEBubi6Cg4MrvI3buXLlCubMmVNiYP78888RFxdXZfsui1WrVsHPzw8A8O23397Rtk6dOoU5c+YwMN9FGJhJ1q9fPwwfPhz/+c9/EB0djQkTJiA+Ph7fffed0qUR1ShPPfUULl68iL179+LixYt44oknlC6JymnTpk04efIkDh06hOPHj8PT01PpkiqkY8eOaNy4Mb7++usS18fExCA+Pl4O1hWlVqthZ2cHSZLuaDsVpdVqodfrFdk3AAghsGbNGjz99NPo378/Vq9erVgtpAwGZipVz549AQDx8fEAgNTUVLz22mto06YNnJyc4OLign79+uHYsWPFnpuXl4fZs2ejadOmsLOzg7+/PwYOHIhz584BAM6fP281DOTWrx49esjb2r17NyRJwv/93//h9ddfh5+fHxwdHfHoo4/i4sWLxfZ94MAB9O3bF66urnBwcED37t2xd+/eEo+xR48eJe6/6Ng9i1WrViEkJAT29vbw8PDAkCFDSty/rWMrymw244MPPkCrVq1gZ2cHX19fPP/887hx44ZVu3vuuQcPP/xwsf2MHz++2DZLqv3dd98t9poCQH5+PmbNmoXGjRtDr9cjMDAQU6ZMQX5+fomvVVE9evRA69atiy1fuHBhieMc09LSMGHCBAQGBkKv16Nx48aYP3++PA64qNmzZ5f42t06JvLy5csYPXo0fH19odfr0apVK3zxxRdWbSzvHcuXXq9H06ZNMW/ePAghrNoeOXIE/fr1g4uLC5ycnNCrVy/s37/fqk3RcZw+Pj548MEH4enpibZt20KSJERHR9t83W4d/nS79115jrEyPx+Wc+Dj44OCggKrdV9//bVcb0pKitW6LVu2oGvXrnB0dISzszMiIiJw8uRJqzajRo2Ck5NTsbq+/fZbSJKE3bt3y8vK+z5bunQpWrVqBb1ej4CAAERGRhb703mPHj3kz0LLli0REhKCY8eOlfgZtaW0c1i0/qLHXJbz/e2336JDhw5wdna2ardw4UKbtQwbNgynT5/Gn3/+WWzdmjVrIEkShg4dCoPBgJkzZyIkJASurq5wdHRE165dsWvXrtseb0ljmIUQePPNN1G/fn04ODjgoYceKna+gbL97Ni9ezfuv/9+AMCzzz4rH7vlM1XSGObs7Gy8+uqr8veVZs2aYeHChcU+25IkYfz48di0aRNat24tf5a2bt162+O22Lt3L86fP48hQ4ZgyJAh+PXXX3Hp0qVi7Ur7+VF0XHd0dDSefPJJAMBDDz1U4nunLO9lql4apQugmssSbi09L//88w82bdqEJ598Eg0aNEBycjI+/fRTdO/eHadOnUJAQAAAwGQy4eGHH8bOnTsxZMgQvPLKK8jMzMT27dtx4sQJNGrUSN7H0KFD0b9/f6v9RkVFlVjPW2+9BUmSMHXqVFy9ehUffPABwsLCcPToUdjb2wMAfvnlF/Tr1w8hISGYNWsWVCoVVqxYgZ49e+K3337DAw88UGy79evXly/aysrKwrhx40rc94wZMzB48GD85z//wbVr1/DRRx+hW7duOHLkCNzc3Io957nnnkPXrl0BABs2bMDGjRut1j///POIjo7Gs88+i5dffhnx8fH4+OOPceTIEezduxdarbbE16E80tLSSrwgzWw249FHH8Xvv/+O5557Di1atMDx48exaNEi/P3331YXRN2pnJwcdO/eHZcvX8bzzz+PoKAg7Nu3D1FRUUhMTMQHH3xQ4vO++uor+f8TJ060WpecnIxOnTrJPwi9vb2xZcsWjBkzBhkZGZgwYYJV+9dffx0tWrRAbm6uHCx9fHwwZswYAMDJkyfRtWtXuLi4YMqUKdBqtfj000/Ro0cP7NmzBx07diz1+L766qtyj3+dO3cuGjRoID8u6X1X3mOsis9HZmYmNm/ejMcff1xetmLFCtjZ2SEvL6/Y6zBy5EiEh4dj/vz5yMnJwbJly9ClSxccOXKkyi/Ymj17NubMmYOwsDCMGzcOcXFxWLZsGQ4ePHjbz9PUqVMrtM/evXvjmWeeAQAcPHgQH374Yaltvby8sGjRIvnxiBEjrNbHxMRg8ODBaNeuHd555x24uroiJSWl2Hu/JMOGDcOcOXOwZs0a3HffffJyk8mEb775Bl27dkVQUBBSUlLwv//9D0OHDsXYsWORmZmJ5cuXIzw8HH/88Qfat29fruOfOXMm3nzzTfTv3x/9+/fHn3/+iT59+sBgMFi1K8vPjhYtWmDu3LmYOXOm1ffOBx98sMR9CyHw6KOPYteuXRgzZgzat2+Pbdu2YfLkybh8+bLVaw0Av//+OzZs2IAXX3wRzs7O+PDDDzFo0CAkJCSU6a8Lq1evRqNGjXD//fejdevWcHBwwNdff43JkyeX6zUDgG7duuHll1/Ghx9+KH9vAiD/eyfvZapCguq8FStWCABix44d4tq1a+LixYti7dq1wtPTU9jb24tLly4JIYTIy8sTJpPJ6rnx8fFCr9eLuXPnysu++OILAUC8//77xfZlNpvl5wEQ7777brE2rVq1Et27d5cf79q1SwAQ9erVExkZGfLyb775RgAQixcvlrfdpEkTER4eLu9HCCFycnJEgwYNRO/evYvt68EHHxStW7eWH1+7dk0AELNmzZKXnT9/XqjVavHWW29ZPff48eNCo9EUW37mzBkBQKxcuVJeNmvWLFH04/bbb78JAGL16tVWz926dWux5cHBwSIiIqJY7ZGRkeLWj/CttU+ZMkX4+PiIkJAQq9f0q6++EiqVSvz2229Wz//kk08EALF3795i+yuqe/fuolWrVsWWv/vuuwKAiI+Pl5e98cYbwtHRUfz9999Wbf/73/8KtVotEhISrJZPmzZNSJJktSw4OFiMHDlSfjxmzBjh7+8vUlJSrNoNGTJEuLq6ipycHCHEv++dXbt2yW3y8vKESqUSL774orxswIABQqfTiXPnzsnLrly5IpydnUW3bt3kZZbPiuX48vLyRFBQkOjXr58AIFasWFH8xSrC8vyDBw9aLS/pfVfeY6zMz4fl/Tp06FDx8MMPy8svXLggVCqVGDp0qAAgrl27JoQQIjMzU7i5uYmxY8da1ZqUlCRcXV2tlo8cOVI4OjoWe23WrVtX7FyV9X129epVodPpRJ8+fay+R3388ccCgPjiiy+stln0s/DTTz8JAKJv377FPk+lMRgMAoAYP368zfothg0bJho
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAHWCAYAAABt3aEVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABnQ0lEQVR4nO3deVxU5f4H8M+ZFQaYYWdAAXHftVAR9xRFs8W0TK+WlZmZVmqp15umWWlZaVlq2S2xsrxqi79MLXdL0czUXNBcUFxYRfZ1Zp7fHziTI6CAwBnGz/v1GmXOeeac7zBnhg8Pz3mOJIQQICIiIiJyAgq5CyAiIiIiqi4Mt0RERETkNBhuiYiIiMhpMNwSERERkdNguCUiIiIip8FwS0REREROg+GWiIiIiJwGwy0REREROQ2GWyIiIiJyGgy3REQOaO7cubBYLAAAi8WCefPmyVwRVcbRo0fxww8/2O4fOnQIP/30k3wFEd1BGG6pVsTExECSJNvNxcUFTZs2xYQJE5CcnCx3eUQOZ8WKFXj33Xdx8eJFvPfee1ixYoXcJVElZGdnY+zYsdi7dy9OnTqFF198EUeOHJG7rCpp0KCB3ed3ebeYmJhq2d/cuXPtfjGoqLi4ONvPl4yMjNuq4fjx45g9ezbOnTt3W9sheajkLoDuLHPmzEFYWBgKCgrw22+/YenSpdiwYQOOHj0KnU4nd3lEDmPOnDl4/PHHMW3aNGi1Wnz11Vdyl0SVEBkZabsBQNOmTTFmzBiZq6qa999/Hzk5Obb7GzZswDfffIOFCxfC19fXtrxLly7Vsr+5c+fi4YcfxqBBgyr1uK+++gpGoxFXr17F2rVr8fTTT1e5huPHj+O1115Dr1690KBBgypvh+TBcEu1asCAAejQoQMA4Omnn4aPjw8WLFiAdevWYfjw4TJXR+Q4Hn30Udxzzz04ffo0mjRpAj8/P7lLokr64YcfcPz4ceTn56NNmzbQaDRyl1QlN4bMpKQkfPPNNxg0aJDDBD8hBL7++mv861//Qnx8PFauXHlb4ZbqNg5LIFn17t0bABAfHw8ASE9Px8svv4w2bdrA3d0der0eAwYMwOHDh0s9tqCgALNnz0bTpk3h4uKCwMBADB48GGfOnAEAnDt37qZ/QuvVq5dtWzt27IAkSfjf//6H//znPzAajXBzc8MDDzyACxculNr3vn370L9/fxgMBuh0OvTs2RO7d+8u8zn26tWrzP3Pnj27VNuvvvoK4eHhcHV1hbe3N4YNG1bm/m/23K5nsVjw/vvvo1WrVnBxcUFAQADGjh2Lq1ev2rVr0KAB7rvvvlL7mTBhQqltllX7O++8U+p7CgCFhYWYNWsWGjduDK1Wi+DgYEydOhWFhYVlfq+u16tXL7Ru3brU8nfffReSJJX6c2FGRgYmTpyI4OBgaLVaNG7cGG+//bZt3Or1Zs+eXeb37oknnrBrd+nSJTz11FMICAiAVqtFq1at8Pnnn9u1sR471ptWq0XTpk0xb948CCHs2h48eBADBgyAXq+Hu7s7+vTpg71799q1sQ7hOXfuHPz9/dGlSxf4+Pigbdu2FfrT741DgG513FXmOVbn+8P6Gvj7+6O4uNhu3TfffGOrNy0tzW7dxo0b0b17d7i5ucHDwwMDBw7EsWPH7No88cQTcHd3L1XX2rVrIUkSduzYYVtW2eNsyZIlaNWqFbRaLYKCgjB+/PhSfwLv1auX7b3QsmVLhIeH4/Dhw2W+R2+mvNfw+vqvf84Veb3Xrl2LDh06wMPDw67du+++W+G6ylORz69Tp05hyJAhMBqNcHFxQf369TFs2DBkZmbannNubi5WrFhR7vuyLLt378a5c+cwbNgwDBs2DLt27cLFixdLtSvvs7dBgwa2/cTExOCRRx4BANxzzz1lft8rchyQfNhzS7KyBlEfHx8AwNmzZ/HDDz/gkUceQVhYGJKTk/HJJ5+gZ8+eOH78OIKCggAAZrMZ9913H7Zu3Yphw4bhxRdfRHZ2NjZv3oyjR4+iUaNGtn0MHz4c9957r91+p0+fXmY9b775JiRJwrRp05CSkoL3338fUVFROHToEFxdXQEA27Ztw4ABAxAeHo5Zs2ZBoVBg+fLl6N27N3799Vd06tSp1Hbr169vOyEoJycH48aNK3PfM2fOxNChQ/H0008jNTUVH374IXr06IGDBw/C09Oz1GOeeeYZdO/eHQDw3Xff4fvvv7dbP3bsWMTExODJJ5/ECy+8gPj4eHz00Uc4ePAgdu/eDbVaXeb3oTIyMjLKPNnJYrHggQcewG+//YZnnnkGLVq0wJEjR7Bw4UL8/fffVRpTV568vDz07NkTly5dwtixYxESEoI9e/Zg+vTpSExMxPvvv1/m47788kvb15MmTbJbl5ycjM6dO0OSJEyYMAF+fn7YuHEjRo8ejaysLEycONGu/X/+8x+0aNEC+fn5thDo7++P0aNHAwCOHTuG7t27Q6/XY+rUqVCr1fjkk0/Qq1cv7Ny5ExEREeU+vy+//LLS4zWtQ4CsyjruKvsca+L9kZ2djfXr1+Ohhx6yLVu+fDlcXFxQUFBQ6vswatQoREdH4+2330ZeXh6WLl2Kbt264eDBgzXeizh79my89tpriIqKwrhx43Dy5EksXboU+/fvv+X7adq0aVXaZ9++ffH4448DAPbv349FixaV29bX1xcLFy603X/sscfs1sfGxmLo0KFo164d3nrrLRgMBqSlpZU69quiIp9fRUVFiI6ORmFhIZ5//nkYjUZcunQJ69evR0ZGBgwGA7788ks8/fTT6NSpE5555hkAsPs8L8/KlSvRqFEjdOzYEa1bt4ZOp8M333yDKVOmVPq59OjRAy+88AIWLVpke18DsP1/O8cB1RJBVAuWL18uAIgtW7aI1NRUceHCBbFq1Srh4+MjXF1dxcWLF4UQQhQUFAiz2Wz32Pj4eKHVasWcOXNsyz7//HMBQCxYsKDUviwWi+1xAMQ777xTqk2rVq1Ez549bfe3b98uAIh69eqJrKws2/LVq1cLAOKDDz6wbbtJkyYiOjrath8hhMjLyxNhYWGib9++pfbVpUsX0bp1a9v91NRUAUDMmjXLtuzcuXNCqVSKN9980+6xR44cESqVqtTyU6dOCQBixYoVtmWzZs0S17+lf/31VwFArFy50u6xmzZtKrU8NDRUDBw4sFTt48ePFzd+TNxY+9SpU4W/v78IDw+3+55++eWXQqFQiF9//dXu8R9//LEAIHbv3l1qf9fr2bOnaNWqVanl77zzjgAg4uPjbctef/114ebmJv7++2+7tv/+97+FUqkUCQkJdstfeeUVIUmS3bLQ0FAxatQo2/3Ro0eLwMBAkZaWZtdu2LBhwmAwiLy8PCHEP8fO9u3bbW0KCgqEQqEQzz33nG3ZoEGDhEajEWfOnLEtu3z5svDw8BA9evSwLbO+V6zPr6CgQISEhIgBAwYIAGL58uWlv1nXsT5+//79dsvLOu4q+xyr8/1hPV6HDx8u7rvvPtvy8+fPC4VCIYYPHy4AiNTUVCGEENnZ2cLT01OMGTPGrtakpCRhMBjslo8aNUq4ubmV+t6sWbOm1GtV0eMsJSVFaDQa0a9fP7vPqI8++kgAEJ9//rndNq9/L2zYsEEAEP379y/1fipPUVGRACAmTJhw0/qtRowYIcLCwuyW3fh6T58+XQAQiYmJtmU3+5wsz43fm4p+fh08eFAAEGvWrLnp9t3c3Ozei7dSVFQkfHx8xCuvvGJb9q9//Uu0a9euVNsbvydWN77/y/teV+Y4IPlwWALVqqioKPj5+SE4OBjDhg2Du7s7vv/+e9SrVw8AoNVqoVCUHJZmsxlXrlyBu7s7mjVrhj///NO2nW+//Ra+vr54/vnnS+2jMn/2u9Hjjz8ODw8P2/2HH34YgYGB2LBhA4CS6XxOnTqFf/3rX7hy5QrS0tKQlpaG3Nxc9OnTB7t27Sr1Z/CCggK4uLjcdL/fffcdLBYLhg4dattmWloajEYjmjRpgu3bt9u1LyoqAlDy/SrPmjVrYDAY0LdvX7tthoeHw93dvdQ2i4uL7dqlpaWV6jm
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plot_sample_balance(train_auto['Price'], 'Train Auto')\n",
|
|||
|
"plot_sample_balance(val_auto['Price'], 'Validation Auto')\n",
|
|||
|
"plot_sample_balance(test_auto['Price'], 'Test Auto')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Распределения выборок у данного датасета выглядят схоже. Это говорит о сбалансированности выборок. Однако в тренировочной выборке значительно больший размах значений "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### 12. Выполнить приращение данных методами выборки с избытком (oversampling) и выборки с недостатком (undersampling). Должны быть представлены примеры реализации обоих методов для выборок каждого набора данных"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Инсультики"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 90,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"После oversampling (strokes): stroke\n",
|
|||
|
"1 4861\n",
|
|||
|
"0 4861\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import SMOTE\n",
|
|||
|
"\n",
|
|||
|
"X_strokes = strokes.drop('stroke', axis=1)\n",
|
|||
|
"y_strokes = strokes['stroke']\n",
|
|||
|
"\n",
|
|||
|
"# Кодирование категориальных признаков\n",
|
|||
|
"for column in X_strokes.select_dtypes(include=['object']).columns:\n",
|
|||
|
" X_strokes[column] = X_strokes[column].astype('category').cat.codes\n",
|
|||
|
"\n",
|
|||
|
"# Теперь применяем SMOTE\n",
|
|||
|
"smote = SMOTE(random_state=42)\n",
|
|||
|
"X_resampled_strokes, y_resampled_strokes = smote.fit_resample(X_strokes, y_strokes)\n",
|
|||
|
"\n",
|
|||
|
"# Получаем результаты\n",
|
|||
|
"print(f'После oversampling (strokes): {pd.Series(y_resampled_strokes).value_counts()}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 92,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"После undersampling (strokes): stroke\n",
|
|||
|
"0 249\n",
|
|||
|
"1 249\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"# Undersampling для strokes\n",
|
|||
|
"undersample = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_under_strokes, y_under_strokes = undersample.fit_resample(X_strokes, y_strokes)\n",
|
|||
|
"\n",
|
|||
|
"print(f'После undersampling (strokes): {pd.Series(y_under_strokes).value_counts()}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Машины"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"После oversampling (strokes): stroke\n",
|
|||
|
"1 4861\n",
|
|||
|
"0 4861\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import SMOTE\n",
|
|||
|
"\n",
|
|||
|
"X_strokes = strokes.drop('stroke', axis=1)\n",
|
|||
|
"y_strokes = strokes['stroke']\n",
|
|||
|
"\n",
|
|||
|
"# Кодирование категориальных признаков\n",
|
|||
|
"for column in X_strokes.select_dtypes(include=['object']).columns:\n",
|
|||
|
" X_strokes[column] = X_strokes[column].astype('category').cat.codes\n",
|
|||
|
"\n",
|
|||
|
"# Теперь применяем SMOTE\n",
|
|||
|
"smote = SMOTE(random_state=42)\n",
|
|||
|
"X_resampled_strokes, y_resampled_strokes = smote.fit_resample(X_strokes, y_strokes)\n",
|
|||
|
"\n",
|
|||
|
"# Получаем результаты\n",
|
|||
|
"print(f'После oversampling (strokes): {pd.Series(y_resampled_strokes).value_counts()}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"После undersampling (strokes): stroke\n",
|
|||
|
"0 249\n",
|
|||
|
"1 249\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"# Undersampling для strokes\n",
|
|||
|
"undersample = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_under_strokes, y_under_strokes = undersample.fit_resample(X_strokes, y_strokes)\n",
|
|||
|
"\n",
|
|||
|
"print(f'После undersampling (strokes): {pd.Series(y_under_strokes).value_counts()}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Магазины"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 93,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"ename": "ValueError",
|
|||
|
"evalue": "Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 1, n_samples = 1",
|
|||
|
"output_type": "error",
|
|||
|
"traceback": [
|
|||
|
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
|||
|
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
|
|||
|
"\u001b[1;32mc:\\Users\\Матевос\\Desktop\\ИИ.Дырночкин\\ai\\lab2\\lab2.ipynb Cell 55\u001b[0m line \u001b[0;36m1\n\u001b[0;32m <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=7'>8</a>\u001b[0m \u001b[39m# Теперь применяем SMOTE\u001b[39;00m\n\u001b[0;32m <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=8'>9</a>\u001b[0m smote \u001b[39m=\u001b[39m SMOTE(random_state\u001b[39m=\u001b[39m\u001b[39m42\u001b[39m)\n\u001b[1;32m---> <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=9'>10</a>\u001b[0m X_resampled_shop, y_resampled_shop \u001b[39m=\u001b[39m smote\u001b[39m.\u001b[39;49mfit_resample(X_shop, y_shop)\n\u001b[0;32m <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=11'>12</a>\u001b[0m \u001b[39m# Получаем результаты\u001b[39;00m\n\u001b[0;32m <a href='vscode-notebook-cell:/c%3A/Users/%D0%9C%D0%B0%D1%82%D0%B5%D0%B2%D0%BE%D1%81/Desktop/%D0%98%D0%98.%D0%94%D1%8B%D1%80%D0%BD%D0%BE%D1%87%D0%BA%D0%B8%D0%BD/ai/lab2/lab2.ipynb#Y151sZmlsZQ%3D%3D?line=12'>13</a>\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39mf\u001b[39m\u001b[39m'\u001b[39m\u001b[39mПосле oversampling (strokes): \u001b[39m\u001b[39m{\u001b[39;00mpd\u001b[39m.\u001b[39mSeries(y_resampled_shop)\u001b[39m.\u001b[39mvalue_counts()\u001b[39m}\u001b[39;00m\u001b[39m'\u001b[39m)\n",
|
|||
|
"File \u001b[1;32mc:\\Users\\Матевос\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\base.py:208\u001b[0m, in \u001b[0;36mBaseSampler.fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 187\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"Resample the dataset.\u001b[39;00m\n\u001b[0;32m 188\u001b[0m \n\u001b[0;32m 189\u001b[0m \u001b[39mParameters\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 205\u001b[0m \u001b[39m The corresponding label of `X_resampled`.\u001b[39;00m\n\u001b[0;32m 206\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[0;32m 207\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_validate_params()\n\u001b[1;32m--> 208\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39msuper\u001b[39;49m()\u001b[39m.\u001b[39;49mfit_resample(X, y)\n",
|
|||
|
"File \u001b[1;32mc:\\Users\\Матевос\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\base.py:112\u001b[0m, in \u001b[0;36mSamplerMixin.fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 106\u001b[0m X, y, binarize_y \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_check_X_y(X, y)\n\u001b[0;32m 108\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msampling_strategy_ \u001b[39m=\u001b[39m check_sampling_strategy(\n\u001b[0;32m 109\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msampling_strategy, y, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_sampling_type\n\u001b[0;32m 110\u001b[0m )\n\u001b[1;32m--> 112\u001b[0m output \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_fit_resample(X, y)\n\u001b[0;32m 114\u001b[0m y_ \u001b[39m=\u001b[39m (\n\u001b[0;32m 115\u001b[0m label_binarize(output[\u001b[39m1\u001b[39m], classes\u001b[39m=\u001b[39mnp\u001b[39m.\u001b[39munique(y)) \u001b[39mif\u001b[39;00m binarize_y \u001b[39melse\u001b[39;00m output[\u001b[39m1\u001b[39m]\n\u001b[0;32m 116\u001b[0m )\n\u001b[0;32m 118\u001b[0m X_, y_ \u001b[39m=\u001b[39m arrays_transformer\u001b[39m.\u001b[39mtransform(output[\u001b[39m0\u001b[39m], y_)\n",
|
|||
|
"File \u001b[1;32mc:\\Users\\Матевос\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\over_sampling\\_smote\\base.py:389\u001b[0m, in \u001b[0;36mSMOTE._fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 386\u001b[0m X_class \u001b[39m=\u001b[39m _safe_indexing(X, target_class_indices)\n\u001b[0;32m 388\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mnn_k_\u001b[39m.\u001b[39mfit(X_class)\n\u001b[1;32m--> 389\u001b[0m nns \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mnn_k_\u001b[39m.\u001b[39;49mkneighbors(X_class, return_distance\u001b[39m=\u001b[39;49m\u001b[39mFalse\u001b[39;49;00m)[:, \u001b[39m1\u001b[39m:]\n\u001b[0;32m 390\u001b[0m X_new, y_new \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_make_samples(\n\u001b[0;32m 391\u001b[0m X_class, y\u001b[39m.\u001b[39mdtype, class_sample, X_class, nns, n_samples, \u001b[39m1.0\u001b[39m\n\u001b[0;32m 392\u001b[0m )\n\u001b[0;32m 393\u001b[0m X_resampled\u001b[39m.\u001b[39mappend(X_new)\n",
|
|||
|
"File \u001b[1;32mc:\\Users\\Матевос\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\neighbors\\_base.py:834\u001b[0m, in \u001b[0;36mKNeighborsMixin.kneighbors\u001b[1;34m(self, X, n_neighbors, return_distance)\u001b[0m\n\u001b[0;32m 832\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m 833\u001b[0m inequality_str \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mn_neighbors <= n_samples_fit\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m--> 834\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 835\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mExpected \u001b[39m\u001b[39m{\u001b[39;00minequality_str\u001b[39m}\u001b[39;00m\u001b[39m, but \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 836\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mn_neighbors = \u001b[39m\u001b[39m{\u001b[39;00mn_neighbors\u001b[39m}\u001b[39;00m\u001b[39m, n_samples_fit = \u001b[39m\u001b[39m{\u001b[39;00mn_samples_fit\u001b[39m}\u001b[39;00m\u001b[39m, \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 837\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mn_samples = \u001b[39m\u001b[39m{\u001b[39;00mX\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m]\u001b[39m}\u001b[39;00m\u001b[39m\"\u001b[39m \u001b[39m# include n_samples for common tests\u001b[39;00m\n\u001b[0;32m 838\u001b[0m )\n\u001b[0;32m 840\u001b[0m n_jobs \u001b[39m=\u001b[39m effective_n_jobs(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mn_jobs)\n\u001b[0;32m 841\u001b[0m chunked_results \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m\n",
|
|||
|
"\u001b[1;31mValueError\u001b[0m: Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 1, n_samples = 1"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"X_shop = shop.drop('Store_Sales', axis=1)\n",
|
|||
|
"y_shop = shop['Store_Sales']\n",
|
|||
|
"\n",
|
|||
|
"# Кодирование категориальных признаков\n",
|
|||
|
"for column in X_shop.select_dtypes(include=['object']).columns:\n",
|
|||
|
" X_shop[column] = X_shop[column].astype('category').cat.codes\n",
|
|||
|
"\n",
|
|||
|
"# Теперь применяем SMOTE\n",
|
|||
|
"smote = SMOTE(random_state=42)\n",
|
|||
|
"X_resampled_shop, y_resampled_shop = smote.fit_resample(X_shop, y_shop)\n",
|
|||
|
"\n",
|
|||
|
"# Получаем результаты\n",
|
|||
|
"print(f'После oversampling (strokes): {pd.Series(y_resampled_shop).value_counts()}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"После undersampling (strokes): stroke\n",
|
|||
|
"0 249\n",
|
|||
|
"1 249\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"# Undersampling для strokes\n",
|
|||
|
"undersample = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_under_strokes, y_under_strokes = undersample.fit_resample(X_strokes, y_strokes)\n",
|
|||
|
"\n",
|
|||
|
"print(f'После undersampling (strokes): {pd.Series(y_under_strokes).value_counts()}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"В данном случае у нас есть только один датасет, предназначенный для решения задачи классификации (инсульт). Проблему дисбаланса в нем мы решили применив undersampling & oversampling.\n",
|
|||
|
"\n",
|
|||
|
"Два остальных датасета не содержат классов, т.к предназначены для решения задачи регрессии (предсказания цен на автомобили или на чек в супермаркете), поэтому выполнять приращение данных не требуется."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"language_info": {
|
|||
|
"name": "python"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|