ISEbd-31_Alimova_M.S._MAI/labs/lab2/lab2.ipynb

1174 lines
421 KiB
Plaintext
Raw Permalink Normal View History

2024-11-09 10:40:40 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Лабораторная работа №2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ход выполнения работы. \n",
"1. Выбрать три набора данных, которые не соответствуют Вашему варианту задания.\n",
"2. Провести анализ сведений о каждом наборе данных со страницы загрузки в Kaggle.\n",
"Какова проблемная область?\n",
"3. Провести анализ содержимого каждого набора данных. Что является\n",
"объектом/объектами наблюдения? Каковы атрибуты объектов? Есть ли связи между\n",
"объектами?\n",
"4. Привести примеры бизнес-целей, для достижения которых могут подойти\n",
"выбранные наборы данных. Каков эффект для бизнеса?\n",
"5. Привести примеры целей технического проекта для каждой выделенной ранее\n",
"бизнес-цели. Что поступает на вход, что является целевым признаком?\n",
"6. Определить проблемы выбранных наборов данных: зашумленность, смещение,\n",
"актуальность, выбросы, просачивание данных.\n",
"7. Привести примеры решения обнаруженных проблем для каждого набора данных.\n",
"8. Оценить качество каждого набора данных: информативность, степень покрытия,\n",
"соответствие реальным данным, согласованность меток.\n",
"9. Устранить проблему пропущенных данных. Для каждого набора данных\n",
"использовать разные методы: удаление, подстановка константного значения (0 или\n",
"подобное), подстановка среднего значения.\n",
"10. Выполнить разбиение каждого набора данных на обучающую, контрольную и\n",
"тестовую выборки.\n",
"11. Оценить сбалансированность выборок для каждого набора данных. Оценить\n",
"необходимость использования методов приращения (аугментации) данных.\n",
"12. Выполнить приращение данных методами выборки с избытком (oversampling) и\n",
"выборки с недостатком (undersampling). Должны быть представлены примеры\n",
"реализации обоих методов для выборок каждого набора данных.\n",
"13. Все выводы и программный код должны быть оформлены в виде ноутбука. Для\n",
"выполнения данной лабораторной работы следует создать новый файл-ноутбук."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пункты 1-5."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Были выбраны датасеты:\n",
"1. Объекты вокруг Земли (https://www.kaggle.com/datasets/sameepvani/nasa-nearest-earth-objects)\n",
"7. Экономика стран (https://www.kaggle.com/datasets/pratik453609/economic-data-9-countries-19802020)\n",
"18. Цены на мобильное устройство (https://www.kaggle.com/datasets/dewangmoghe/mobile-phone-price-prediction)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Объекты вокруг Земли**\n",
"\n",
"Проблемная область: космические объекты и их угроза для Земли\n",
"\n",
"Объект наблюдения: астероиды и другие малые тела Солнечной системы\n",
"\n",
"Атрибуты: имя объекта, минимальный и максимальный оценочные диаметры, относительная скорость, расстояние промаха, орбитальное тело, объекты программы \"Сентри\", абсолютная звездная величина, опасность\n",
"\n",
"Пример бизнес-цели:\n",
"\n",
"1. Разработка и продажа страховых продуктов для космических рисков. Цель технического проекта: разработка системы оценки рисков и ценообразования для страховых продуктов, защищающих от космических угроз.\n",
"\n",
"2. Разработка и продажа технологий для мониторинга и предотвращения космических угроз. Цель технического проекта: создание системы мониторинга и прогнозирования траекторий небесных тел для предотвращения космических угроз.\n",
"\n",
"3. Образовательные программы и сервисы. Цель технического проекта: разработка интерактивных образовательных материалов и сервисов, основанных на данных о небесных телах.\n",
"Актуальность: Исследования астероидов и разработка технологий для их отклонения не только помогают защитить Землю от потенциальных угроз, но и стимулируют научные открытия в различных областях, включая астрономию, физику, инженерию и образование. Эта тема имеет важное значение для будущего нашей планеты и человечества в целом."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Экономика стран**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проблемная область: экономические данные по странам\n",
"\n",
"Объект наблюдения: каждая страна (например, США, Франция) выступает отдельным объектом наблюдения.\n",
"\n",
"Атрибуты: фондовый индекс: среднегодовая цена индекса, инфляция (inflationrate): годовой уровень инфляции, цена на нефть (oil prices): среднегодовая стоимость нефти, обменный курс (exchange_rate): курс национальной валюты к доллару США, ВВП в процентах (gdppercent): прирост ВВП, доход на душу населения (percapitaincome): средний доход на человека в стране.\n",
"\n",
"Пример бизнес-цели:\n",
"1) Определение факторов, влияющих на рост фондового рынка в различных странах (эффект для бизнеса: позволяет финансовым компаниям и инвесторам создавать более эффективные стратегии вложений)\n",
"2) Прогнозирование инфляции и курсов валют (эффект для бизнеса: помогает компаниям адаптироваться к колебаниям на валютных рынках и снижать риски при операциях с иностранной валютой)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Цены на мобильные устройства**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проблемная область: цены, характеристики мобильных устройств разных компаний.\n",
"\n",
"Объект наблюдения: каждый мобильный телефон представляет собой отдельный объект.\n",
"\n",
"Атрибуты: фондовый индекс: рейтинг, спецификации, количество SIM-карт, поддержка сетей (3G, 4G, 5G), оперативная память (RAM), батарея: емкость батареи, экран (размер и разрешение), камера: характеристика камеры (основной и фронтальной), встроенная и внешняя память, версия Android, процессор, поддержка быстрой зарядки.\n",
"\n",
"Пример бизнес-цели:\n",
"1) Анализ характеристик, влияющих на рейтинг и популярность модели.\n",
"Эффект для бизнеса: помогает производителям оптимизировать параметры устройства для увеличения спроса.\n",
"2) Оптимизация ценовой стратегии в зависимости от характеристик.\n",
"Эффект для бизнеса: позволяет разработать ценовые сегменты, соответствующие характеристикам, и повысить конкурентоспособность."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**№6.** Сперва напишем функции для определения проблем выбранных наборов данных: зашумленности, смещения, актуальности, выбросов, просачивания данных.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"\n",
"#Проверка на зашумленность\n",
"def check_noise(dataframe):\n",
" total_values = dataframe.size\n",
" missing_values = dataframe.isnull().sum().sum()\n",
" noise_percentage = (missing_values / total_values) * 100\n",
" return f\"Зашумленность: {noise_percentage:.2f}%\"\n",
"\n",
"\n",
"#Проверка на смещение \n",
"def check_bias(dataframe, target_column):\n",
" if target_column in dataframe.columns:\n",
" unique_values = dataframe[target_column].nunique()\n",
" total_values = len(dataframe)\n",
" bias_percentage = (unique_values / total_values) * 100\n",
" return (\n",
" f\"Смещение по {target_column}: {bias_percentage:.2f}% уникальных значений\"\n",
" )\n",
" return \"Целевой признак не найден.\"\n",
"\n",
"\n",
"#Проверка на дубликаты\n",
"def check_duplicates(dataframe):\n",
" duplicate_percentage = dataframe.duplicated().mean() * 100\n",
" return f\"Количество дубликатов: {duplicate_percentage:.2f}%\"\n",
"\n",
"\n",
"#Проверка на выбросы\n",
"def check_outliers(dataframe, column):\n",
" if column in dataframe.columns:\n",
" Q1 = dataframe[column].quantile(0.25)\n",
" Q3 = dataframe[column].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" outlier_count = dataframe[\n",
" (dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)\n",
" ].shape[0]\n",
" total_count = dataframe.shape[0]\n",
" outlier_percentage = (outlier_count / total_count) * 100\n",
" return f\"Выбросы по {column}: {outlier_percentage:.2f}%\"\n",
" return f\"Признак {column} не найден.\"\n",
"\n",
"\n",
"#Проверка на просачивание данных\n",
"def check_data_leakage(dataframe, target_column):\n",
" if target_column in dataframe.columns:\n",
" correlation_matrix = dataframe.select_dtypes(include=[np.number]).corr()\n",
" leakage_info = correlation_matrix[target_column].abs().nlargest(10)\n",
" leakage_report = \", \".join(\n",
" [\n",
" f\"{feature}: {value:.2f}\"\n",
" for feature, value in leakage_info.items()\n",
" if feature != target_column\n",
" ]\n",
" )\n",
" return f\"Признаки просачивания данных: {leakage_report}\"\n",
" return \"Целевой признак не найден.\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 1. Объекты вокруг земли"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['id', 'name', 'est_diameter_min', 'est_diameter_max',\n",
" 'relative_velocity', 'miss_distance', 'orbiting_body', 'sentry_object',\n",
" 'absolute_magnitude', 'hazardous'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Вывод всех столбцов\n",
"\n",
"df_neo = pd.read_csv(\"neo.csv\")\n",
"print(df_neo.columns)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Зашумленность: 0.00%\n",
"Смещение по miss_distance: 99.67% уникальных значений\n",
"Количество дубликатов: 0.00%\n",
"Выбросы по relative_velocity: 1.73%\n",
"Признаки просачивания данных: est_diameter_min: 0.56, est_diameter_max: 0.56, relative_velocity: 0.35, id: 0.28, miss_distance: 0.26\n"
]
}
],
"source": [
"noise_columns = check_noise(df_neo)\n",
"bias_info = check_bias(df_neo, \"miss_distance\")\n",
"duplicate_count = check_duplicates(df_neo)\n",
"outliers_data = check_outliers(df_neo, \"relative_velocity\")\n",
"leakage_info = check_data_leakage(df_neo, \"absolute_magnitude\")\n",
"\n",
"print(noise_columns)\n",
"print(bias_info)\n",
"print(duplicate_count)\n",
"print(outliers_data)\n",
"print(leakage_info)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 7. Экономика стран"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['stock index', 'country', 'year', 'index price', 'log_indexprice',\n",
" 'inflationrate', 'oil prices', 'exchange_rate', 'gdppercent',\n",
" 'percapitaincome', 'unemploymentrate', 'manufacturingoutput',\n",
" 'tradebalance', 'USTreasury'],\n",
" dtype='object')\n"
]
}
],
"source": [
"df_ed = pd.read_csv(\"economic_data.csv\")\n",
"print(df_ed.columns)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Зашумленность: 4.51%\n",
"Смещение по index price: 85.64% уникальных значений\n",
"Количество дубликатов: 0.00%\n",
"Выбросы по unemploymentrate: 4.88%\n",
"Признаки просачивания данных: percapitaincome: 0.51, USTreasury: 0.49, year: 0.47, log_indexprice: 0.34, gdppercent: 0.26, oil prices: 0.22, unemploymentrate: 0.18, manufacturingoutput: 0.11, index price: 0.08\n"
]
}
],
"source": [
"noise_columns = check_noise(df_ed)\n",
"bias_info = check_bias(df_ed, \"index price\")\n",
"duplicate_count = check_duplicates(df_ed)\n",
"outliers_data = check_outliers(df_ed, \"unemploymentrate\")\n",
"leakage_info = check_data_leakage(df_ed, \"inflationrate\")\n",
"\n",
"print(noise_columns)\n",
"print(bias_info)\n",
"print(duplicate_count)\n",
"print(outliers_data)\n",
"print(leakage_info)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 18. Цена на мобильные устройства"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Unnamed: 0', 'Name', 'Rating', 'Spec_score', 'No_of_sim', 'Ram',\n",
" 'Battery', 'Display', 'Camera', 'External_Memory', 'Android_version',\n",
" 'Price', 'company', 'Inbuilt_memory', 'fast_charging',\n",
" 'Screen_resolution', 'Processor', 'Processor_name'],\n",
" dtype='object')\n"
]
}
],
"source": [
"df_mp = pd.read_csv(\"mobile_phone_price_prediction.csv\")\n",
"print(df_mp.columns)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Зашумленность: 2.36%\n",
"Смещение по Name: 97.37% уникальных значений\n",
"Количество дубликатов: 0.00%\n",
"Выбросы по Spec_score: 1.24%\n",
"Признаки просачивания данных: Spec_score: 0.06, Unnamed: 0: 0.03\n"
]
}
],
"source": [
"noise_columns = check_noise(df_mp)\n",
"bias_info = check_bias(df_mp, \"Name\")\n",
"duplicate_count = check_duplicates(df_mp)\n",
"outliers_data = check_outliers(df_mp, \"Spec_score\")\n",
"leakage_info = check_data_leakage(df_mp, \"Rating\")\n",
"\n",
"print(noise_columns)\n",
"print(bias_info)\n",
"print(duplicate_count)\n",
"print(outliers_data)\n",
"print(leakage_info)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 1. Объекты вокруг Земли"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 0\n",
"name 0\n",
"est_diameter_min 0\n",
"est_diameter_max 0\n",
"relative_velocity 0\n",
"miss_distance 0\n",
"orbiting_body 0\n",
"sentry_object 0\n",
"absolute_magnitude 0\n",
"hazardous 0\n",
"dtype: int64"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_neo.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пропущенных значений нет, поэтому пропускаем."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 7. Экономика стран"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"stock index 0\n",
"country 0\n",
"year 0\n",
"index price 52\n",
"log_indexprice 0\n",
"inflationrate 43\n",
"oil prices 0\n",
"exchange_rate 2\n",
"gdppercent 19\n",
"percapitaincome 1\n",
"unemploymentrate 21\n",
"manufacturingoutput 91\n",
"tradebalance 4\n",
"USTreasury 0\n",
"dtype: int64"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ed.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Имеются пустые значения. На их место поставим \"No value\""
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"df_ed[\"index price\"] = df_ed[\"index price\"].fillna(\"No value\")\n",
"df_ed[\"inflationrate\"] = df_ed[\"inflationrate\"].fillna(\"No value\")\n",
"df_ed[\"exchange_rate\"] = df_ed[\"exchange_rate\"].fillna(\"No value\")\n",
"df_ed[\"gdppercent\"] = df_ed[\"gdppercent\"].fillna(\"No value\")\n",
"df_ed[\"percapitaincome\"] = df_ed[\"percapitaincome\"].fillna(\"No value\")\n",
"df_ed[\"unemploymentrate\"] = df_ed[\"unemploymentrate\"].fillna(\"No value\")\n",
"df_ed[\"manufacturingoutput\"] = df_ed[\"manufacturingoutput\"].fillna(\"No value\")\n",
"df_ed[\"tradebalance\"] = df_ed[\"tradebalance\"].fillna(\"No value\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_ed[\"index price\"] = df_ed[\"index price\"].replace(\"No value\", 0)\n",
"df_ed[\"inflationrate\"] = df_ed[\"inflationrate\"].replace(\"No value\", 0)\n",
"df_ed[\"exchange_rate\"] = df_ed[\"exchange_rate\"].replace(\"No value\", 0)\n",
"df_ed[\"gdppercent\"] = df_ed[\"gdppercent\"].replace(\"No value\", 0)\n",
"df_ed[\"percapitaincome\"] = df_ed[\"percapitaincome\"].replace(\"No value\", 0)\n",
"df_ed[\"unemploymentrate\"] = df_ed[\"unemploymentrate\"].replace(\"No value\", 0)\n",
"df_ed[\"manufacturingoutput\"] = df_ed[\"manufacturingoutput\"].replace(\"No value\", 0)\n",
"df_ed[\"tradebalance\"] = df_ed[\"tradebalance\"].replace(\"No value\", 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Снова проверим датафрейм на наличие пустых значений:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"stock index 0\n",
"country 0\n",
"year 0\n",
"index price 0\n",
"log_indexprice 0\n",
"inflationrate 0\n",
"oil prices 0\n",
"exchange_rate 0\n",
"gdppercent 0\n",
"percapitaincome 0\n",
"unemploymentrate 0\n",
"manufacturingoutput 0\n",
"tradebalance 0\n",
"USTreasury 0\n",
"dtype: int64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ed.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 18. Цены на мобильные устройства"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Unnamed: 0 0\n",
"Name 0\n",
"Rating 0\n",
"Spec_score 0\n",
"No_of_sim 0\n",
"Ram 0\n",
"Battery 0\n",
"Display 0\n",
"Camera 0\n",
"External_Memory 0\n",
"Android_version 443\n",
"Price 0\n",
"company 0\n",
"Inbuilt_memory 19\n",
"fast_charging 89\n",
"Screen_resolution 2\n",
"Processor 28\n",
"Processor_name 0\n",
"dtype: int64"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_mp.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"df_mp[\"Android_version\"] = df_mp[\"Android_version\"].fillna(\"No value\")\n",
"df_mp[\"Inbuilt_memory\"] = df_mp[\"Inbuilt_memory\"].fillna(\"No value\")\n",
"df_mp[\"fast_charging\"] = df_mp[\"fast_charging\"].fillna(\"No value\")\n",
"df_mp[\"Screen_resolution\"] = df_mp[\"Screen_resolution\"].fillna(\"No value\")\n",
"df_mp[\"Processor\"] = df_mp[\"Processor\"].fillna(\"No value\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_mp[\"Android_version\"] = df_mp[\"Android_version\"].replace(\"No value\", 0)\n",
"df_mp[\"Inbuilt_memory\"] = df_mp[\"Inbuilt_memory\"].replace(\"No value\", 0)\n",
"df_mp[\"fast_charging\"] = df_mp[\"fast_charging\"].replace(\"No value\", 0)\n",
"df_mp[\"Screen_resolution\"] = df_mp[\"Screen_resolution\"].replace(\"No value\", 0)\n",
"df_mp[\"Processor\"] = df_mp[\"Processor\"].replace(\"No value\", 0)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Unnamed: 0 0\n",
"Name 0\n",
"Rating 0\n",
"Spec_score 0\n",
"No_of_sim 0\n",
"Ram 0\n",
"Battery 0\n",
"Display 0\n",
"Camera 0\n",
"External_Memory 0\n",
"Android_version 0\n",
"Price 0\n",
"company 0\n",
"Inbuilt_memory 0\n",
"fast_charging 0\n",
"Screen_resolution 0\n",
"Processor 0\n",
"Processor_name 0\n",
"dtype: int64"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_mp.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"10. Выполнить разбиение каждого набора данных на обучающую, контрольную и тестовую выборки."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df_neo Dataset:\n",
"Train: 80.00%\n",
"Validation: 10.00%\n",
"Test: 10.00%\n",
"\n"
]
}
],
"source": [
"# Разбиение df_neo\n",
"\n",
"original_df_neo_size = len(df_neo)\n",
"train_df_neo, temp_df_neo = train_test_split(df_neo, test_size=0.2, random_state=42)\n",
"val_df_neo, test_df_neo = train_test_split(temp_df_neo, test_size=0.5, random_state=42)\n",
"\n",
"print(\"df_neo Dataset:\")\n",
"print(f\"Train: {len(train_df_neo)/original_df_neo_size*100:.2f}%\")\n",
"print(f\"Validation: {len(val_df_neo)/original_df_neo_size*100:.2f}%\")\n",
"print(f\"Test: {len(test_df_neo)/original_df_neo_size*100:.2f}%\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df_ed Dataset:\n",
"Train: 79.95%\n",
"Validation: 10.03%\n",
"Test: 10.03%\n",
"\n"
]
}
],
"source": [
"# Разбиение df_ed\n",
"\n",
"original_df_ed_size = len(df_ed)\n",
"train_df_ed, temp_df_ed = train_test_split(df_ed, test_size=0.2, random_state=42)\n",
"val_df_ed, test_df_ed = train_test_split(temp_df_ed, test_size=0.5, random_state=42)\n",
"\n",
"print(\"df_ed Dataset:\")\n",
"print(f\"Train: {len(train_df_ed)/original_df_ed_size*100:.2f}%\")\n",
"print(f\"Validation: {len(val_df_ed)/original_df_ed_size*100:.2f}%\")\n",
"print(f\"Test: {len(test_df_ed)/original_df_ed_size*100:.2f}%\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df_mp Dataset:\n",
"Train: 80.00%\n",
"Validation: 10.00%\n",
"Test: 10.00%\n",
"\n"
]
}
],
"source": [
"# Разбиение df_mp\n",
"\n",
"original_df_mp_size = len(df_mp)\n",
"train_df_mp, temp_df_mp = train_test_split(df_mp, test_size=0.2, random_state=42)\n",
"val_df_mp, test_df_mp = train_test_split(temp_df_mp, test_size=0.5, random_state=42)\n",
"\n",
"print(\"df_mp Dataset:\")\n",
"print(f\"Train: {len(train_df_mp)/original_df_mp_size*100:.2f}%\")\n",
"print(f\"Validation: {len(val_df_mp)/original_df_mp_size*100:.2f}%\")\n",
"print(f\"Test: {len(test_df_mp)/original_df_mp_size*100:.2f}%\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"11. Оценить сбалансированность выборок для каждого набора данных. Оценить необходимость использования методов приращения (аугментации) данных."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAskAAAHWCAYAAACFXRQ+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB+k0lEQVR4nO3dd3xT5f4H8M/JbDqSLrqgLWUWypKiWLZSKYgDxIsoKCqCV0FF7wXlqoA4EEQBEdfPq6gXFyq4EEWmQGVvSlmFQksXHelOmzy/P0Ii6aItaU/aft6vV1+Qc56c802Tpp8+ec7zSEIIASIiIiIislPIXQARERERkathSCYiIiIiqoAhmYiIiIioAoZkIiIiIqIKGJKJiIiIiCpgSCYiIiIiqoAhmYiIiIioAoZkIiIiIqIKGJKJiIiIiCpgSCYiaiJee+01WCwWAIDFYsH8+fNlrojq4siRI1izZo399oEDB/DLL7/IV1ATMHfuXEiS1ODn2bx5MyRJwubNmx22f/7554iMjIRarYa3t3eD10GuhSGZZLNixQpIkmT/cnNzQ6dOnTBt2jSkp6fLXR6Ry/n000+xaNEiXLhwAW+++SY+/fRTuUuiOsjPz8ejjz6Kv/76CydPnsRTTz2Fw4cPy11WvbRt29bh/bu6rxUrVshdar0dP34cDz74INq3b4//+7//w4cffih3SdTIVHIXQDRv3jxERESgpKQE27Ztw3vvvYe1a9fiyJEjcHd3l7s8Ipcxb948PPDAA3j22Weh1Wrxv//9T+6SqA5iYmLsXwDQqVMnTJ48Weaq6mfJkiUoKCiw3167di2+/PJLLF68GP7+/vbt/fr1u6bzvPDCC3juueeu6Rj1tXnzZlgsFixduhQdOnSQpQaSF0MyyW7EiBHo06cPAOCRRx6Bn58f3nrrLfzwww+49957Za6OyHXcc889uOmmm3Dq1Cl07NgRrVq1krskqqM1a9bg2LFjKC4uRvfu3aHRaOQuqV5GjRrlcDstLQ1ffvklRo0ahbZt21Z7v8LCQnh4eNT6PCqVCiqVPFElIyMDADjMogXjcAtyOTfffDMAICkpCQCQnZ2Nf//73+jevTs8PT2h1+sxYsQIHDx4sNJ9S0pKMHfuXHTq1Alubm4IDg7GXXfdhdOnTwMAzp49W+NHg0OGDLEfyzZG7euvv8Z//vMfBAUFwcPDA3fccQfOnz9f6dw7d+7E8OHDYTAY4O7ujsGDB2P79u1VPsYhQ4ZUef65c+dWavu///0P0dHR0Ol08PX1xbhx46o8f02P7UoWiwVLlixBVFQU3NzcEBgYiEcffRQ5OTkO7dq2bYvbbrut0nmmTZtW6ZhV1f7GG29U+p4CQGlpKebMmYMOHTpAq9UiNDQUM2fORGlpaZXfqysNGTIE3bp1q7R90aJFkCQJZ8+eddiem5uL6dOnIzQ0FFqtFh06dMCCBQvs43qvZBv7WPHrwQcfdGiXkpKChx9+GIGBgdBqtYiKisLHH3/s0Mb22rF9abVadOrUCfPnz4cQwqHt/v37MWLECOj1enh6emLo0KH466+/HNrYhiadPXsWAQEB6NevH/z8/NCjR49afaRdcWjT1V53dXmMzvz5sD0HAQEBKCsrc9j35Zdf2uvNyspy2Pfrr79i4MCB8PDwgJeXF0aOHImjR486tHnwwQfh6elZqa5vv/220ljUur7O3n33XURFRUGr1SIkJARTp05Fbm6uQ5shQ4bYfxa6du2K6OhoHDx4sMqf0ZpU9xxWHEtre8y1eb6//fZb9OnTB15eXg7tFi1aVOu6qmL7np8+fRq33norvLy8MH78eADAn3/+iX/84x8ICwuzvw88/fTTKC4udjhGVWOSJUnCtGnTsGbNGnTr1s3+Gl23bl2t6rpw4QJGjRoFDw8PBAQE4Omnn670/tO2bVvMmTMHANCqVatq35+rYqv51KlTePDBB+Ht7Q2DwYCHHnoIRUVFldrX9j1+1apV9nb+/v6YMGECUlJSalUT1Q97ksnl2AKtn58fAODMmTNYs2YN/vGPfyAiIgLp6en44IMPMHjwYBw7dgwhISEAALPZjNtuuw0bNmzAuHHj8NRTTyE/Px/r16/HkSNH0L59e/s57r33Xtx6660O5501a1aV9bz66quQJAnPPvssMjIysGTJEsTGxuLAgQPQ6XQAgI0bN2LEiBGIjo7GnDlzoFAo8Mknn+Dmm2/Gn3/+iRtuuKHScdu0aWO/8KqgoACPPfZYled+8cUXMXbsWDzyyCPIzMzEsmXLMGjQIOzfv7/KHo4pU6Zg4MCBAIDvv/8eq1evdtj/6KOPYsWKFXjooYfw5JNPIikpCe+88w7279+P7du3Q61WV/l9qIvc3NwqLyqzWCy44447sG3bNkyZMgVdunTB4cOHsXjxYpw4ccLhoqZrVVRUhMGDByMlJQWPPvoowsLCsGPHDsyaNQsXL17EkiVLqrzf559/bv//008/7bAvPT0dN954o/2XdKtWrfDrr79i0qRJMBqNmD59ukP7//znP+jSpQuKi4vtYTIgIACTJk0CABw9ehQDBw6EXq/HzJkzoVar8cEHH2DIkCHYsmUL+vbtW+3j+/zzz+s8ntU2tMmmqtddXR9jQ/x85Ofn4+eff8bo0aPt2z755BO4ubmhpKSk0vdh4sSJiIuLw4IFC1BUVIT33nsPAwYMwP79+2vs1XSGuXPn4qWXXkJsbCwee+wxJCYm4r333sPu3buv+vP07LPP1uuct9xyCx544AEAwO7du/H2229X29bf3x+LFy+2377//vsd9sfHx2Ps2LHo2bMnXn/9dRgMBmRlZVV67ddXeXk54uLiMGDAACxatMg+hG7VqlUoKirCY489Bj8/P+zatQvLli3DhQsXsGrVqqsed9u2bfj+++/x+OOPw8vLC2+//TbGjBmD5ORk+++OqhQXF2Po0KFITk7Gk08+iZCQEHz++efYuHGjQ7slS5bgs88+w+rVq/Hee+/B09MTPXr0qNNjHzt2LCIiIjB//nzs27cPH330EQICArBgwQJ7m9q+x9ves6+//nrMnz8f6enpWLp0KbZv317t7wJyAkEkk08++UQAEH/88YfIzMwU58+fF1999ZXw8/MTOp1OXLhwQQghRElJiTCbzQ73TUpKElqtVsybN8++7eOPPxYAxFtvvVXpXBaLxX4/AOKNN96o1CYqKkoMHjzYfnvTpk0CgGjdurUwGo327d98840AIJYuXWo/dseOHUVcXJz9PEIIUVRUJCIiIsQtt9xS6Vz9+vUT3bp1s9/OzMwUAMScOXPs286ePSuUSqV49dVXHe57+PBhoVKpKm0/efKkACA+/fRT+7Y5c+aIK3/M//zzTwFArFy50uG+69atq7Q9PDxcjBw5slLtU6dOFRXfOirWPnPmTBEQECCio6Mdvqeff/65UCgU4s8//3S4//vvvy8AiO3bt1c635UGDx4soqKiKm1/4403BACRlJRk3/byyy8LDw8PceLECYe2zz33nFAqlSI5Odlh+/PPPy8kSXLYFh4eLiZOnGi/PWnSJBEcHCyysrIc2o0bN04YDAZRVFQkhPj7tbNp0yZ7m5KSEqFQKMTjjz9u3zZq1Cih0WjE6dOn7dtSU1OFl5eXGDRokH2b7WfF9vhKSkpEWFiYGDFihAAgPvnkk8rfrCvY7r97926H7VW97ur6GJ3582F7vd57773itttus28/d+6cUCgU4t577xUARGZmphBCiPz8fOHt7S0mT57sUGtaWpowGAwO2ydOnCg8PDwqfW9WrVpV6bmq7essIyNDaDQaMWzYMIf3qHfeeUcAEB9//LHDMa/8WVi7dq0AIIYPH17p56k6JpNJABDTpk2rsX6b8ePHi4iICIdtFZ/vWbNmCQDi4sWL9m01vU9Wp6qfwYkTJwoA4rnnnqvU3vY6utL8+fOFJEni3Llz9m0V38Nsj0Gj0YhTp07Ztx08eFAAEMuWLauxziVLlggA4ptvvrFvKywsFB06dKj0fbSd2/Z6qy3b/R5++GGH7aNHjxZ+fn7227V9jzeZTCIgIEB069ZNFBcX29v
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsAAAAHWCAYAAAB5SD/0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACKAElEQVR4nOzdd3xT5f4H8M9JmqRN23Q3bemglL2hQC1bqTJVhKvCBVHkB16Fq6AiFwfLgQhXUUQcV8UBKihDUVE2CmVvKBVKoYXO0L3b5Pn9URoJHbSl5STN5/165dXmnCfnfM/JSfLNk2dIQggBIiIiIiI7oZA7ACIiIiKi24kJMBERERHZFSbARERERGRXmAATERERkV1hAkxEREREdoUJMBERERHZFSbARERERGRXmAATERERkV1hAkxEREREdoUJMBFRHb3xxhswmUwAAJPJhIULF8ocEdXFqVOnsGHDBvP9Y8eO4eeff5YvICskSRLmzZtnvr9y5UpIkoSLFy/e9LHNmzfHY4891qDxPPbYY2jevHmDbvNW7Ny5E5IkYefOnRbLv/rqK7Rt2xYqlQru7u6yxEa1wwSYzG9sFTdHR0e0bt0a06ZNQ2pqqtzhEVmdL774AkuWLMHly5fx3//+F1988YXcIVEd5Obm4oknnsC+fftw7tw5PPPMMzh58qTcYdXL008/DUmScP78+WrLvPTSS5AkCSdOnLiNkdVdUlIS5s2bh2PHjskdSr2cPXsWjz32GMLCwvDJJ5/g448/ljskqoGD3AGQ9ViwYAFCQ0NRVFSEP//8EytWrMAvv/yCU6dOQavVyh0ekdVYsGABJkyYgFmzZkGj0eDrr7+WOySqg8jISPMNAFq3bo3JkyfLHFX9jBs3DsuWLcPq1asxZ86cKst888036NSpEzp37lzv/TzyyCMYM2YMNBpNvbdxM0lJSZg/fz6aN2+Orl27Wqz75JNPzL+6WKudO3fCZDLh3XffRcuWLeUOh26CCTCZDR06FD169AAA/N///R+8vLzw9ttvY+PGjRg7dqzM0RFZj4cffhh33nknzp8/j1atWsHHx0fukKiONmzYgDNnzqCwsBCdOnWCWq2WO6R6iYiIQMuWLfHNN99UmQBHR0cjPj4eb7755i3tR6lUQqlU3tI2boVKpZJt37WVlpYGAGz6YCPYBIKqdddddwEA4uPjAQAZGRl4/vnn0alTJ7i4uECn02Ho0KE4fvx4pccWFRVh3rx5aN26NRwdHeHv749Ro0YhLi4OAHDx4kWLZhc33gYOHGjeVkVbq++++w4vvvgi/Pz84OzsjPvuuw+JiYmV9r1//34MGTIEbm5u0Gq1GDBgAPbs2VPlMQ4cOLDK/V/f9q3C119/jfDwcDg5OcHT0xNjxoypcv81Hdv1TCYTli5dig4dOsDR0RF6vR5PPPEEMjMzLco1b94cI0aMqLSfadOmVdpmVbEvXry40jkFgOLiYsydOxctW7aERqNBUFAQXnjhBRQXF1d5rq43cOBAdOzYsdLyJUuWVNlOMCsrC9OnT0dQUBA0Gg1atmyJRYsWVVmjM2/evCrP3Y1tCq9cuYLHH38cer0eGo0GHTp0wGeffWZRpuLaqbhpNBq0bt0aCxcuhBDCouzRo0cxdOhQ6HQ6uLi4YNCgQdi3b59FmevbQfr6+qJ3797w8vJC586dIUkSVq5cWeN5u7G50c2uu7ocY0O+PiqeA19fX5SWllqs++abb8zxGgwGi3W//vor+vXrB2dnZ7i6umL48OE4ffq0RZnHHnsMLi4uleL6/vvvK7WprOt19sEHH6BDhw7QaDQICAjA1KlTkZWVZVFm4MCB5tdC+/btER4ejuPHj1f5Gq1Jdc/hjW1CK465Ns/3999/jx49esDV1dWi3JIlS2qMZdy4cTh79iyOHDlSad3q1ashSRLGjh2LkpISzJkzB+Hh4XBzc4OzszP69euHHTt23PR4q2oDLITAa6+9hsDAQGi1Wtx5552Vnm+gdp8dO3fuRM+ePQEAEydONB97xWuqqjbA+fn5eO6558zvK23atMGSJUsqvbYlScK0adOwYcMGdOzY0fxa2rx5802PGwAuX76MkSNHwtnZGb6+vpgxY0al98nmzZtj7ty5AAAfH59qP0eqUvF6O3/+PB577DG4u7vDzc0NEydOREFBQaXytf0sWrt2rbmct7c3xo8fjytXrtQqJnvAGmCqVkWy6uXlBQC4cOECNmzYgAcffBChoaFITU3FRx99hAEDBuDMmTMICAgAABiNRowYMQLbtm3DmDFj8MwzzyA3NxdbtmzBqVOnEBYWZt7H2LFjMWzYMIv9zp49u8p4Xn/9dUiShFmzZiEtLQ1Lly5FVFQUjh07BicnJwDA9u3bMXToUISHh2Pu3LlQKBT4/PPPcdddd+GPP/5Ar169Km03MDDQ3IkpLy8PTz75ZJX7fuWVV/DQQw/h//7v/5Ceno5ly5ahf//+OHr0aJXf+KdMmYJ+/foBANatW4f169dbrH/iiSewcuVKTJw4EU8//TTi4+Px/vvv4+jRo9izZ0+D1HhkZWVV2UHLZDLhvvvuw59//okpU6agXbt2OHnyJN555x389ddfFh2EblVBQQEGDBiAK1eu4IknnkBwcDD27t2L2bNnIzk5GUuXLq3ycV999ZX5/xkzZlisS01NxR133GH+YPPx8cGvv/6KSZMmIScnB9OnT7co/+KLL6Jdu3YoLCw0J4q+vr6YNGkSAOD06dPo168fdDodXnjhBahUKnz00UcYOHAgdu3ahYiIiGqP76uvvqpz+9GK5kYVqrru6nqMjfH6yM3NxaZNm/DAAw+Yl33++edwdHREUVFRpfPw6KOPYvDgwVi0aBEKCgqwYsUK9O3bF0ePHm30Dkzz5s3D/PnzERUVhSeffBKxsbFYsWIFDh48eNPX06xZs+q1z7vvvhsTJkwAABw8eBDvvfdetWW9vb3xzjvvmO8/8sgjFuujo6Px0EMPoUuXLnjzzTfh5uYGg8FQ6dqvyrhx4zB//nysXr0a3bt3Ny83Go1Ys2YN+vXrh+DgYBgMBvzvf//D2LFjMXnyZOTm5uLTTz/F4MGDceDAgUrNDm5mzpw5eO211zBs2DAMGzYMR44cwT333IOSkhKLcrX57GjXrh0WLFiAOXPmWLx39u7du8p9CyFw3333YceOHZg0aRK6du2K3377DTNnzsSVK1cszjUA/Pnnn1i3bh2eeuopuLq64r333sPo0aORkJBg/oyrSmFhIQYNGoSEhAQ8/fTTCAgIwFdffYXt27dblFu6dCm+/PJLrF+/HitWrICLi0udm5w89NBDCA0NxcKFC3HkyBH873//g6+vLxYtWmQuU9vPoorPlp49e2LhwoVITU3Fu+++iz179lT7mWV3BNm9zz//XAAQW7duFenp6SIxMVF8++23wsvLSzg5OYnLly8LIYQoKioSRqPR4rHx8fFCo9GIBQsWmJd99tlnAoB4++23K+3LZDKZHwdALF68uFKZDh06iAEDBpjv79ixQwAQzZo1Ezk5Oebla9asEQDEu+++a952q1atxODBg837EUKIgoICERoaKu6+++5K++rdu7fo2LGj+X56eroAIObOnWtedvHiRaFUKsXrr79u8diTJ08KBweHSsvPnTsnAIgvvvjCvGzu3Lni+pfbH3/8IQCIVatWWTx28+bNlZaHhISI4cOHV4p96tSp4saX8I2xv/DCC8LX11eEh4dbnNOvvvpKKBQK8ccff1g8/sMPPxQAxJ49eyrt73oDBgwQHTp0qLR88eLFAoCIj483L3v11VeFs7Oz+OuvvyzK/uc//xFKpVIkJCRYLH/ppZeEJEkWy0JCQsSjjz5qvj9p0iTh7+8vDAaDRbkxY8YINzc3UVBQIIT4+9rZsWOHuUxRUZFQKBTiqaeeMi8bOXKkUKvVIi4uzrwsKSlJuLq6iv79+5uXVbxWKo6vqKhIBAcHi6FDhwoA4vPPP698sq5T8fiDBw9aLK/quqvrMTbk66Pieh07dqwYMWKEefmlS5eEQqEQY8eOFQBEenq6EEKI3Nxc4e7uLiZ
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtgAAAHWCAYAAABNMf7oAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACEiElEQVR4nOzdd3gU1f4G8He2pm56JYVQA6GHYqgqkSKoKIogKAgXUEFFvOBFBQQVBLkKKqL+rgIqNmwoIkpXIfQOIVICoaSH9LKb3fP7I2RlSSVsMrvZ9/M8eWBnzs5+ZzK7eTM5c44khBAgIiIiIiKrUMhdABERERFRY8KATURERERkRQzYRERERERWxIBNRERERGRFDNhERERERFbEgE1EREREZEUM2EREREREVsSATURERERkRQzYRERERERWxIBNRGSDFixYAJPJBAAwmUxYuHChzBXRzTh+/Dh+/PFH8+PDhw/jl19+ka8gB7N9+3ZIkoTt27dbLP/ss88QGRkJtVoNT09PWWojx8CATQ1i1apVkCTJ/OXk5IRWrVph6tSpSE1Nlbs8IpuzevVqLFmyBJcuXcJ///tfrF69Wu6S6Cbk5eVh8uTJ2L17N06fPo1nn30Wx44dk7usOmnatKnF53dVX6tWrbLK6y1YsMDilxNrOXXqFMaNG4fmzZvj//7v//DRRx9Z/TWIyqnkLoAcy/z58xEREYHi4mL89ddfWLFiBTZs2IDjx4/DxcVF7vKIbMb8+fPx2GOP4YUXXoBWq8Xnn38ud0l0E2JiYsxfANCqVStMnDhR5qrqZunSpcjPzzc/3rBhA7788ku8/fbb8PX1NS/v2bOnVV5vwYIFePDBBzFs2DCrbK/c9u3bYTKZsGzZMrRo0cKq2ya6EQM2NajBgweja9euAIB//etf8PHxwVtvvYV169Zh1KhRMldHZDsefvhh3HHHHThz5gxatmwJPz8/uUuim/Tjjz/i5MmTKCoqQvv27aHRaOQuqU5uDLopKSn48ssvMWzYMDRt2lSWmuoiLS0NANg1hBoEu4iQrO68804AQGJiIgAgKysL//73v9G+fXu4ublBp9Nh8ODBOHLkSIXnFhcX45VXXkGrVq3g5OSEoKAgPPDAAzh79iwA4Pz589X+OfP22283b6u8v97XX3+NF198EYGBgXB1dcW9996LixcvVnjtPXv2YNCgQfDw8ICLiwv69euHnTt3VrqPt99+e6Wv/8orr1Ro+/nnnyM6OhrOzs7w9vbGyJEjK3396vbteiaTCUuXLkVUVBScnJwQEBCAyZMn4+rVqxbtmjZtiqFDh1Z4nalTp1bYZmW1v/nmmxWOKQCUlJRg7ty5aNGiBbRaLUJDQzFz5kyUlJRUeqyud/vtt6Ndu3YVli9ZsgSSJOH8+fMWy7OzszFt2jSEhoZCq9WiRYsWWLRokbkf8/VeeeWVSo/duHHjLNpdvnwZ48ePR0BAALRaLaKiovDJJ59YtCk/d8q/tFotWrVqhYULF0IIYdH20KFDGDx4MHQ6Hdzc3NC/f3/s3r3bok15d6rz58/D398fPXv2hI+PDzp06FCrP8Pf2B2rpvPuZvbRmu+P8u+Bv78/DAaDxbovv/zSXG9GRobFul9//RV9+vSBq6sr3N3dMWTIEJw4ccKizbhx4+Dm5lahrm+//bZCv9ybPc/ef/99REVFQavVIjg4GFOmTEF2drZFm9tvv938Xmjbti2io6Nx5MiRSt+j1anqe3hjv+Lyfa7N9/vbb79F165d4e7ubtFuyZIlta6rKrX5/Dp9+jSGDx+OwMBAODk5ISQkBCNHjkROTo55nwsKCrB69eoq35c3unTpEoYNGwZXV1f4+/vjueeeq/AZ07RpU8ydOxcA4OfnV+VncGXKz9UzZ85g3Lhx8PT0hIeHBx5//HEUFhbW6TgAwNq1a83tfH19MWbMGFy+fLlWNZHt4xVsklV5GPbx8QEAnDt3Dj/++CMeeughREREIDU1FR9++CH69euHkydPIjg4GABgNBoxdOhQbNmyBSNHjsSzzz6LvLw8bNq0CcePH0fz5s3NrzFq1CjcfffdFq87a9asSut5/fXXIUkSXnjhBaSlpWHp0qWIjY3F4cOH4ezsDADYunUrBg8ejOjoaMydOxcKhQIrV67EnXfeiT///BPdu3evsN2QkBDzTWr5+fl48sknK33t2bNnY8SIEfjXv/6F9PR0vPvuu+jbty8OHTpU6VWXSZMmoU+fPgCA77//Hj/88IPF+smTJ2PVqlV4/PHH8cwzzyAxMRHvvfceDh06hJ07d0KtVld6HG5GdnZ2pTfgmUwm3Hvvvfjrr78wadIktGnTBseOHcPbb7+Nv//+26p9LAsLC9GvXz9cvnwZkydPRlhYGHbt2oVZs2YhOTkZS5curfR5n332mfn/zz33nMW61NRU3HbbbZAkCVOnToWfnx9+/fVXTJgwAbm5uZg2bZpF+xdffBFt2rRBUVGROYj6+/tjwoQJAIATJ06gT58+0Ol0mDlzJtRqNT788EPcfvvt2LFjB3r06FHl/n322Wc33X+3vDtWucrOu5vdx/p4f+Tl5WH9+vW4//77zctWrlwJJycnFBcXVzgOY8eOxcCBA7Fo0SIUFhZixYoV6N27Nw4dOlTvV1NfeeUVzJs3D7GxsXjyySeRkJCAFStWYN++fTW+n1544YU6veZdd92Fxx57DACwb98+vPPOO1W29fX1xdtvv21+/Oijj1qsj4uLw4gRI9CxY0e88cYb8PDwQEZGRoVzvy5q8/ml1+sxcOBAlJSU4Omnn0ZgYCAuX76M9evXIzs7Gx4eHvjss8/wr3/9C927d8ekSZMAwOLz/EZFRUXo378/kpKS8MwzzyA4OBifffYZtm7datFu6dKl+PTTT/HDDz9gxYoVcHNzQ4cOHW5qH0eMGIGIiAgsXLgQBw8exP/+9z/4+/tj0aJFN3UcAJg/l7t164aFCxciNTUVy5Ytw86dO6v8vCc7I4gawMqVKwUAsXnzZpGeni4uXrwovvrqK+Hj4yOcnZ3FpUuXhBBCFBcXC6PRaPHcxMREodVqxfz5883LPvnkEwFAvPXWWxVey2QymZ8HQLz55psV2kRFRYl+/fqZH2/btk0AEE2aNBG5ubnm5d98840AIJYtW2bedsuWLcXAgQPNryOEEIWFhSIiIkLcddddFV6rZ8+eol27dubH6enpAoCYO3euedn58+eFUqkUr7/+usVzjx07JlQqVYXlp0+fFgDE6tWrzcvmzp0rrn9L//nnnwKAWLNmjcVzN27cWGF5eHi4GDJkSIXap0yZIm78mLix9pkzZwp/f38RHR1tcUw/++wzoVAoxJ9//mnx/A8++EAAEDt37qzwetfr16+fiIqKqrD8zTffFABEYmKiedmrr74qXF1dxd9//23R9j//+Y9QKpUiKSnJYvlLL70kJEmyWBYeHi7Gjh1rfjxhwgQRFBQkMjIyLNqNHDlSeHh4iMLCQiHEP+fOtm3bzG2Ki4uFQqEQTz31lHnZsGHDhEajEWfPnjUvu3LlinB3dxd9+/Y1Lyt/r5TvX3FxsQgLCxODBw8WAMTKlSsrHqzrlD9/3759FssrO+9udh+t+f4oP19HjRolhg4dal5+4cIFoVAoxKhRowQAkZ6eLoQQIi8vT3h6eoqJEyda1JqSkiI8PDwslo8dO1a4urpWODZr166t8L2q7XmWlpYmNBqNGDBggMVn1HvvvScAiE8++cRim9e/FzZs2CAAiEGDBlV4P1VFr9cLAGLq1KnV1l9u9OjRIiIiwmLZjd/vWbNmCQAiOTnZvKy6z8mq3Hhsavv5dejQIQFArF27ttrtu7q6WrwXq7N06VIBQHzzzTfmZQUFBaJFixYVjlX5OVd+TtVW+fPGjx9vsfz+++8XPj4+5se1PQ56vV74+/uLdu3aiaKiInO79evXCwBizpw5N1Uf2SZ2EaEGFRsbCz8/P4SGhmLkyJFwc3PDDz/8gCZNmgAAtFotFIqy09JoNCIzMxNubm5o3bo1Dh48aN7Od999B19fXzz99NMVXuNm/gR7o8ceewzu7u7
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def plot_sample_balance(y, sample_name):\n",
" plt.figure(figsize=(8, 5))\n",
" sns.histplot(y, bins=30, kde=True)\n",
" plt.title(f\"Распределение целевой переменной для {sample_name}\")\n",
" plt.xlabel(sample_name)\n",
" plt.ylabel(\"Частота\")\n",
" plt.show()\n",
"\n",
"\n",
"# Оценка сбалансированности выборок\n",
"plot_sample_balance(train_df_neo[\"relative_velocity\"], \"Train df_neo\")\n",
"plot_sample_balance(val_df_neo[\"relative_velocity\"], \"Validation df_neo\")\n",
"plot_sample_balance(test_df_neo[\"relative_velocity\"], \"Test df_neo\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Кажется, выборки сбалансированы."
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAroAAAHWCAYAAACYIyqlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABvfklEQVR4nO3dd3gU1foH8O9sTdnspjdIQqhBIJQoEIogBik2BEUQr6BcsGADK9cCclVE+QmKgOUqWFAEFMtVsCBgo4YOIbRAQkhCCunZfn5/hOxlSYCUTWaz+X6eZ55kz8yeeWdnJ/vm7JlzJCGEABERERGRh1HIHQARERERUWNgoktEREREHomJLhERERF5JCa6REREROSRmOgSERERkUdioktEREREHomJLhERERF5JCa6REREROSRmOgSERERkUdioktE1AheffVV2O12AIDdbsfcuXNljojq4sCBA/jmm28cj/fs2YMffvhBvoCagdmzZ0OSpEbfz6ZNmyBJEjZt2uRU/umnnyIuLg5qtRr+/v6NHsflYiH3wUSXamX58uWQJMmxeHl5oWPHjnj44YeRk5Mjd3hEbufjjz/G/Pnzcfr0afzf//0fPv74Y7lDojooKSnB/fffj61bt+Lo0aN47LHHsH//frnDqpc2bdo4/f2+1LJ8+XK5Q623w4cPY9KkSWjXrh0++OADvP/++3KHRG5CJXcA1LzMmTMHsbGxMBqN+PPPP7F06VL8+OOPOHDgAHx8fOQOj8htzJkzB/fccw+eeeYZaLVafPbZZ3KHRHWQmJjoWACgY8eOmDJlisxR1c/ChQtRWlrqePzjjz/iiy++wIIFCxAcHOwo79evX4P28/zzz+PZZ59tUB31tWnTJtjtdrz11lto3769LDGQe2KiS3UyYsQIXH311QCAf/7znwgKCsKbb76Jb7/9FuPHj5c5OiL3ceedd+K6667DsWPH0KFDB4SEhMgdEtXRN998g0OHDqGiogLdunWDRqORO6R6GTVqlNPj7OxsfPHFFxg1ahTatGlzyeeVlZXB19e31vtRqVRQqeRJK86ePQsATdZlgZoPdl2gBhkyZAgAIC0tDQBQUFCAJ598Et26dYNOp4Ner8eIESOwd+/eas81Go2YPXs2OnbsCC8vL0RERGD06NE4fvw4AODkyZOX/Zpt8ODBjrqq+kl9+eWX+Ne//oXw8HD4+vrilltuQUZGRrV9b9u2DcOHD4fBYICPjw8GDRqEv/76q8ZjHDx4cI37nz17drVtP/vsMyQkJMDb2xuBgYEYN25cjfu/3LFdyG63Y+HChejSpQu8vLwQFhaG+++/H+fOnXPark2bNrjpppuq7efhhx+uVmdNsb/xxhvVXlMAMJlMmDVrFtq3bw+tVouoqCg8/fTTMJlMNb5WFxo8eDC6du1arXz+/PmQJAknT550Ki8sLMTjjz+OqKgoaLVatG/fHvPmzXP0c71QVV/Ai5dJkyY5bZeZmYn77rsPYWFh0Gq16NKlCz766COnbareO1WLVqtFx44dMXfuXAghnLbdvXs3RowYAb1eD51Oh+uvvx5bt2512qaqm8/JkycRGhqKfv36ISgoCPHx8bX6evjibkJXet/V5RhdeX1UnYPQ0FBYLBandV988YUj3ry8PKd169atw8CBA+Hr6ws/Pz/ceOONOHjwoNM2kyZNgk6nqxbXmjVrqvWHrOv7bMmSJejSpQu0Wi0iIyMxbdo0FBYWOm0zePBgx7Vw1VVXISEhAXv37q3xGr2cS53DmvpzTpo0qVbne82aNbj66qvh5+fntN38+fNrHVdNql7z48ePY+TIkfDz88OECRMAAH/88QfuuOMOREdHO/4OTJ8+HRUVFU511NRHV5IkPPzww/jmm2/QtWtXx3t0/fr1tYrr9OnTGDVqFHx9fREaGorp06dX+/vTpk0bzJo1CwAQEhJyyb/Pl1Kba6i2sZD7YYsuNUhVUhoUFAQAOHHiBL755hvccccdiI2NRU5ODt577z0MGjQIhw4dQmRkJADAZrPhpptuwoYNGzBu3Dg89thjKCkpwS+//IIDBw6gXbt2jn2MHz8eI0eOdNrvzJkza4znlVdegSRJeOaZZ3D27FksXLgQSUlJ2LNnD7y9vQEAv/32G0aMGIGEhATMmjULCoUCy5Ytw5AhQ/DHH3+gd+/e1ept3bq142ai0tJSPPjggzXu+4UXXsDYsWPxz3/+E7m5uVi0aBGuvfZa7N69u8aWhqlTp2LgwIEAgK+//hpr1651Wn///fdj+fLluPfee/Hoo48iLS0N77zzDnbv3o2//voLarW6xtehLgoLC2u8Ucput+OWW27Bn3/+ialTp6Jz587Yv38/FixYgCNHjjjdqNNQ5eXlGDRoEDIzM3H//fcjOjoaf//9N2bOnImsrCwsXLiwxud9+umnjt+nT5/utC4nJwd9+/Z1fNCGhIRg3bp1mDx5MoqLi/H44487bf+vf/0LnTt3RkVFhSMhDA0NxeTJkwEABw8exMCBA6HX6/H0009DrVbjvffew+DBg7F582b06dPnksf36aef1rl/Z1U3oSo1ve/qeoyNcX2UlJTgv//9L2677TZH2bJly+Dl5QWj0VjtdZg4cSKGDRuGefPmoby8HEuXLsWAAQOwe/fuy7YuusLs2bPx0ksvISkpCQ8++CBSU1OxdOlS7Nix44rX0zPPPFOvfQ4dOhT33HMPAGDHjh14++23L7ltcHAwFixY4Hj8j3/8w2n9li1bMHbsWHTv3h2vvfYaDAYD8vLyqr3368tqtWLYsGEYMGAA5s+f7+iOtnr1apSXl+PBBx9EUFAQtm/fjkWLFuH06dNYvXr1Fev9888/8fXXX+Ohhx6Cn58f3n77bYwZMwbp6emOz46aVFRU4Prrr0d6ejoeffRRREZG4tNPP8Vvv/3mtN3ChQvxySefYO3atVi6dCl0Oh3i4+Nrdcy1vYZqGwu5IUFUC8uWLRMAxK+//ipyc3NFRkaGWLlypQgKChLe3t7i9OnTQgghjEajsNlsTs9NS0sTWq1WzJkzx1H20UcfCQDizTffrLYvu93ueB4A8cYbb1TbpkuXLmLQoEGOxxs3bhQARKtWrURxcbGjfNWqVQKAeOuttxx1d+jQQQwbNsyxHyGEKC8vF7GxsWLo0KHV9tWvXz/RtWtXx+Pc3FwBQMyaNctRdvLkSaFUKsUrr7zi9Nz9+/cLlUpVrfzo0aMCgPj4448dZbNmzRIXXpJ//PGHACBWrFjh9Nz169dXK4+JiRE33nhjtdinTZsmLr7ML4796aefFqGhoSIhIcHpNf3000+FQqEQf/zxh9Pz3333XQFA/PXXX9X2d6FBgwaJLl26VCt/4403BACRlpbmKPv3v/8tfH19xZEjR5y2ffbZZ4VSqRTp6elO5c8995yQJMmpLCYmRkycONHxePLkySIiIkLk5eU5bTdu3DhhMBhEeXm5EOJ/752NGzc6tjEajUKhUIiHHnrIUTZq1Cih0WjE8ePHHWVnzpwRfn5+4tprr3WUVV0rVcdnNBpFdHS0GDFihAAgli1bVv3FukDV83fs2OFUXtP7rq7H6Mrro+r9On78eHHTTTc5yk+dOiUUCoUYP368ACByc3OFEEKUlJQIf39/MWXKFKdYs7OzhcFgcCqfOHGi8PX1rfbarF69utq5qu377OzZs0Kj0YgbbrjB6W/UO++8IwCIjz76yKnOC6+FH3/8UQAQw4cPr3Y9XYrZbBYAxMMPP3zZ+KtMmDBBxMbGOpVdfL5nzpwpAIisrCxH2eX+Tl5KTdfgxIkTBQDx7LPPVtu+6n10oblz5wpJksSpU6ccZRf/Das6Bo1GI44dO+Yo27t3rwAgFi1adNk4Fy5cKACIVatWOcrKyspE+/btq72OVfuuer/VVm2vobrEQu6FXReoTpKSkhASEoKoqCiMGzcOOp0Oa9euRatWrQAAWq0WCkXl28pmsyE/Px86nQ6dOnXCrl27HPV89dVXCA4OxiOPPFJtHw0Znuaee+6Bn5+f4/Htt9+OiIgI/PjjjwAqhwg6evQo7rrrLuTn5yMvLw95eXk
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArgAAAHWCAYAAACc1vqYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABgEUlEQVR4nO3dd3gU5d7G8Xs3vVNSSOi9CaggvHQQFBFQbAgHFQQRFQuigBwPUiyAchQpYhdEsICCHQUFC6CgIIICAgZCCxBKet193j9i9rAkQBI22WTz/VzXXsnOzD7zm52d3TuTZ561GGOMAAAAAA9hdXcBAAAAgCsRcAEAAOBRCLgAAADwKARcAAAAeBQCLgAAADwKARcAAAAehYALAAAAj0LABQAAgEch4AIAAMCjEHABVHjPPPOM7Ha7JMlut2vatGlurghFsX37dq1YscJx/7ffftPnn3/uvoLKIIvFosmTJzvuL1iwQBaLRfv27bvgY+vUqaOhQ4e6tJ6hQ4eqTp06Lm3zYqxdu1YWi0Vr1651mr5o0SI1adJEPj4+qlSpkltrQdEQcD1Q3htX3s3f31+NGjXS/fffr6NHj7q7PKDMWbhwoWbOnKmDBw/qv//9rxYuXOjuklAEycnJGjlypH766Sft3r1bDz30kLZt2+busorlwQcflMVi0Z49e865zOOPPy6LxaLff/+9FCsrusOHD2vy5Mn67bff3F1KsezcuVNDhw5V/fr19dprr+nVV191d0koAm93F4CSM3XqVNWtW1cZGRn68ccfNX/+fH3xxRfavn27AgMD3V0eUGZMnTpVd9xxh8aPHy8/Pz+988477i4JRdC+fXvHTZIaNWqkESNGuLmq4hk8eLDmzJmjJUuW6IknnihwmXfffVctWrRQy5Yti72e22+/XQMHDpSfn1+x27iQw4cPa8qUKapTp44uvfRSp3mvvfaa478mZdXatWtlt9v14osvqkGDBu4uB0VEwPVgvXv3Vps2bSRJd911l6pWrarnn39eH3/8sQYNGuTm6oCy49Zbb1X37t21Z88eNWzYUBEREe4uCUW0YsUK/fnnn0pPT1eLFi3k6+vr7pKKpV27dmrQoIHefffdAgPuhg0bFBsbq+nTp1/Uery8vOTl5XVRbVwMHx8ft627sI4dOyZJpdY1Aa5FF4UK5Morr5QkxcbGSpJOnjypRx99VC1atFBwcLBCQ0PVu3dvbd26Nd9jMzIyNHnyZDVq1Ej+/v6Kjo7WjTfeqL1790qS9u3b59Qt4uxbt27dHG3l9S96//339e9//1vVqlVTUFCQrrvuOh04cCDfun/++Wddc801CgsLU2BgoLp27ap169YVuI3dunUrcP1n9j3L884776h169YKCAhQlSpVNHDgwALXf75tO5PdbtesWbPUvHlz+fv7KyoqSiNHjtSpU6eclqtTp4769u2bbz33339/vjYLqv25557L95xKUmZmpiZNmqQGDRrIz89PNWvW1Lhx45SZmVngc3Wmbt266ZJLLsk3febMmQX20zt9+rRGjx6tmjVrys/PTw0aNNCMGTMKPCMzefLkAp+7s/v0HTp0SMOGDVNUVJT8/PzUvHlzvfnmm07L5L128m5+fn5q1KiRpk2bJmOM07JbtmxR7969FRoaquDgYPXo0UM//fST0zJn9kOMjIxUhw4dVLVqVbVs2VIWi0ULFiw47/N2dnegC73uirKNrjw+8vZBZGSksrOznea9++67jnoTEhKc5n355Zfq3LmzgoKCFBISoj59+uiPP/5wWmbo0KEKDg7OV9eyZcvy9SMs6uvspZdeUvPmzeXn56eYmBiNGjVKp0+fdlqmW7dujmOhWbNmat26tbZu3VrgMXo+59qHBfWDHDp0aKH297Jly9SmTRuFhIQ4LTdz5szz1jJ48GDt3LlTmzdvzjdvyZIlslgsGjRokLKysvTEE0+odevWCgsLU1BQkDp37qw1a9ZccHsL6oNrjNFTTz2lGjVqKDAwUN27d8+3v6XCfXasXbtWV1xxhSTpzjvvdGx73jFVUB/c1NRUPfLII473lcaNG2vmzJn5jm2LxaL7779fK1as0CWXXOI4llauXHnB7ZakgwcPqn///goKClJkZKQefvjhfO+TderU0aRJkyRJERER5/wcOZfCHOuFrQXFwxncCiQvjFatWlWS9Pfff2vFihW65ZZbVLduXR09elSvvPKKunbtqj///FMxMTGSJJvNpr59++qbb77RwIED9dBDDyk5OVmrVq3S9u3bVb9+fcc6Bg0apGuvvdZpvRMmTCiwnqeffloWi0Xjx4/XsWPHNGvWLPXs2VO//fabAgICJEnffvutevfurdatW2vSpEmyWq166623dOWVV+qHH35Q27Zt87Vbo0YNx0VCKSkpuvfeewtc98SJEzVgwADdddddOn78uObMmaMuXbpoy5YtBf7Ffvfdd6tz586SpI8++kjLly93mj9y5EgtWLBAd955px588EHFxsZq7ty52rJli9atW+eSMxanT58u8AIou92u6667Tj/++KPuvvtuNW3aVNu2bdMLL7ygv/76y+kCnIuVlpamrl276tChQxo5cqRq1aql9evXa8KECTpy5IhmzZpV4OMWLVrk+P3hhx92mnf06FH93//9n+ODKyIiQl9++aWGDx+upKQkjR492mn5f//732ratKnS09MdQTAyMlLDhw+XJP3xxx/q3LmzQkNDNW7cOPn4+OiVV15Rt27d9N1336ldu3bn3L5FixYVuf9mXnegPAW97oq6jSVxfCQnJ+uzzz7TDTfc4Jj21ltvyd/fXxkZGfmehyFDhqhXr16aMWOG0tLSNH/+fHXq1Elbtmwp8QuEJk+erClTpqhnz5669957tWvXLs2fP1+bNm264PE0fvz4Yq3zqquu0h133CFJ2rRpk2bPnn3OZcPDw/XCCy847t9+++1O8zds2KABAwaoVatWmj59usLCwpSQkJDvtV+QwYMHa8qUKVqyZIkuv/xyx3SbzaYPPvhAnTt3Vq1atZSQkKDXX39dgwYN0ogRI5ScnKw33nhDvXr10saNG/N1C7iQJ554Qk899ZSuvfZaXXvttdq8ebOuvvpqZWVlOS1XmM+Opk2baurUqXriiSec3js7dOhQ4LqNMbruuuu0Zs0aDR8+XJdeeqm++uorjR07VocOHXJ6riXpxx9/1EcffaT77rtPISEhmj17tm666SbFxcU5PuMKkp6erh49eiguLk4PPvigYmJitGjRIn377bdOy82aNUtvv/22li9frvnz5ys4OLjQXUIKe6wXthYUk4HHeeutt4wks3r1anP8+HFz4MAB895775mqVauagIAAc/DgQWOMMRkZGcZmszk9NjY21vj5+ZmpU6c6pr355ptGknn++efzrctutzseJ8k899xz+ZZp3ry56dq1q+P+mjVrjCRTvXp1k5SU5Jj+wQcfGEnmxRdfdLTdsGFD06tXL8d6jDEmLS3N1K1b11x11VX51tWhQwdzySWXOO4fP37cSDKTJk1yTNu3b5/x8vIyTz/9tNNjt23bZry9vfNN3717t5FkFi5c6Jg2adIkc+bh88MPPxhJZvHixU6PXblyZb7ptWvXNn369MlX+6hRo8zZh+TZtY8bN85ERkaa1q1bOz2nixYtMlar1fzwww9Oj3/55ZeNJLNu3bp86ztT165dTfPmzfNNf+6554wkExsb65j25JNPmqCgIPPXX385LfvYY48ZLy8vExcX5zT98ccfNxaLxWla7dq1zZAhQxz3hw8fbqKjo01CQoLTcgMHDjRhYWEmLS3NGPO/186aNWscy2RkZBir1Wruu+8+x7T+/fsbX19fs3fvXse0w4cPm5CQENOlSxfHtLxjJW/7MjIyTK1atUzv3r2NJPPWW2/lf7LOkPf4TZs2OU0v6HVX1G105fGR93odNGiQ6du3r2P6/v37jdVqNYMGDTKSzPHjx40xxiQnJ5tKlSqZESNGONUaHx9vwsLCnKYPGTLEBAUF5Xtuli5dmm9fFfZ1duzYMePr62uuvvpqp/eouXPnGknmzTffdGrzzGP
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAqYAAAHWCAYAAAClsUvDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABhqElEQVR4nO3dd3gU5doG8Ht2N9lN3fRGEhJCD71KDQgISBEbyEEFQWxwBFH0wwZ4lIgioqJYjoIiioAUsYB0BAHpEDohEBLSlpCe7Ca77/dHyB6WJJCyyUzI/buuvWBnZ9955t2Z4WaqJIQQICIiIiKSmUruAoiIiIiIAAZTIiIiIlIIBlMiIiIiUgQGUyIiIiJSBAZTIiIiIlIEBlMiIiIiUgQGUyIiIiJSBAZTIiIiIlIEBlMiIiIiUgQGUyKiapozZw4sFgsAwGKxIDo6WuaKqDJiYmKwdu1a6/sjR47gt99+k6+gemb79u2QJAnbt2+3Gb506VI0b94cDg4O8PDwkLUWqj0MplTKkiVLIEmS9aXT6dC0aVNMnjwZKSkpcpdHpDjffvst5s2bh4SEBHzwwQf49ttv5S6JKiE7OxtPP/009u7di3PnzmHKlCk4fvy43GVVSVhYmM32u7zXkiVL7DK9OXPm2IR6ezl9+jTGjRuHiIgIfPXVV/jyyy/tPg1SJo3cBZByvfXWWwgPD0dBQQF27dqFRYsW4ffff0dMTAycnZ3lLo9IMd566y08/vjjeOWVV6DVavH999/LXRJVQrdu3awvAGjatCkmTpwoc1VVs2DBAuTk5Fjf//777/jxxx/x4YcfwsfHxzq8e/fudpnenDlz8NBDD2HEiBF2aa/E9u3bYbFY8NFHH6Fx48Z2bZuUjcGUyjV48GB06tQJAPDkk0/C29sb8+fPx7p16zB69GiZqyNSjlGjRqFv3744f/48mjRpAl9fX7lLokpau3YtTp48ifz8fLRu3RqOjo5yl1QlNwfE5ORk/PjjjxgxYgTCwsJkqakqUlNTAaDWDuGTcvBQPlXY3XffDQCIi4sDAKSnp+Oll15C69at4erqCnd3dwwePBhHjx4t9d2CggLMmjULTZs2hU6nQ2BgIB544AHExsYCAC5evHjLw059+vSxtlVyDtBPP/2EV199FQEBAXBxccHw4cNx+fLlUtPet28fBg0aBL1eD2dnZ0RFRWH37t1lzmOfPn3KnP6sWbNKjfv999+jY8eOcHJygpeXFx555JEyp3+rebuRxWLBggULEBkZCZ1OB39/fzz99NO4du2azXhhYWEYOnRoqelMnjy5VJtl1f7++++X6lMAMBqNmDlzJho3bgytVouQkBC8/PLLMBqNZfbVjfr06YNWrVqVGj5v3jxIkoSLFy/aDM/IyMDUqVMREhICrVaLxo0bY+7cudbzNG80a9asMvtu3LhxNuMlJiZi/Pjx8Pf3h1arRWRkJL755hubcUqWnZKXVqtF06ZNER0dDSGEzbiHDx/G4MGD4e7uDldXV/Tr1w979+61GafktJeLFy/Cz88P3bt3h7e3N9q0aVOhw6U3nzZzu+WuMvNoz/Wj5Dfw8/NDYWGhzWc//vijtV6DwWDz2R9//IFevXrBxcUFbm5uGDJkCE6cOGEzzrhx4+Dq6lqqrlWrVpU616+yy9lnn32GyMhIaLVaBAUFYdKkScjIyLAZp0+fPtZ1oWXLlujYsSOOHj1a5jp6K+X9hmWdqzhu3LgK/d6rVq1Cp06d4ObmZjPevHnzKlxXeSqy/Tp37hwefPBBBAQEQKfTITg4GI888ggyMzOt85ybm4tvv/223PXyZgkJCRgxYgRcXFzg5+eHF154odQ2JiwsDDNnzgQA+Pr6lrsNLk9F1pOK1kK1j3tMqcJKQqS3tzcA4MKFC1i7di0efvhhhIeHIyUlBV988QWioqJw8uRJBAUFAQDMZjOGDh2KLVu24JFHHsGUKVOQnZ2NTZs2ISYmBhEREdZpjB49Gvfee6/NdGfMmFFmPe+88w4kScIrr7yC1NRULFiwAP3798eRI0fg5OQEANi6dSsGDx6Mjh07YubMmVCpVFi8eDHuvvtu/PXXX+jSpUupdoODg60Xr+Tk5ODZZ58tc9pvvPEGRo4ciSeffBJpaWn45JNP0Lt3bxw+fLjM/+U/9dRT6NWrFwBg9erVWLNmjc3nTz/9NJYsWYInnngCzz//POLi4rBw4UIcPnwYu3fvhoODQ5n9UBkZGRllXphjsVgwfPhw7Nq1C0899RRatGiB48eP48MPP8TZs2fteg5ZXl4eoqKikJiYiKeffhqhoaH4+++/MWPGDCQlJWHBggVlfm/p0qXWv7/wwgs2n6WkpOCuu+6CJEmYPHkyfH198ccff2DChAnIysrC1KlTbcZ/9dVX0aJFC+Tn51sDnJ+fHyZMmAAAOHHiBHr16gV3d3e8/PLLcHBwwBdffIE+ffpgx44d6Nq1a7nzt3Tp0kqfn1hy2kyJspa7ys5jTawf2dnZ+PXXX3H//fdbhy1evBg6nQ4FBQWl+mHs2LEYOHAg5s6di7y8PCxatAg9e/bE4cOHa3zv3axZszB79mz0798fzz77LM6cOYNFixZh//79t12fXnnllSpNc8CAAXj88ccBAPv378fHH39c7rg+Pj748MMPre8fe+wxm8/37NmDkSNHom3btnj33Xeh1+thMBhKLftVUZHtl8lkwsCBA2E0GvHvf/8bAQEBSExMxK+//oqMjAzo9XosXboUTz75JLp06YKnnnoKAGy25zfLz89Hv379EB8fj+effx5BQUFYunQptm7dajPeggUL8N1332HNmjVYtGgRXF1d0aZNmwrNW0XXk4rWQjIQRDdZvHixACA2b94s0tLSxOXLl8Xy5cuFt7e3cHJyEgkJCUIIIQoKCoTZbLb5blxcnNBqteKtt96yDvvmm28EADF//vxS07JYLNbvARDvv/9+qXEiIyNFVFSU9f22bdsEANGgQQORlZVlHb5ixQoBQHz00UfWtps0aSIGDhxonY4QQuTl5Ynw8HAxYMCAUtPq3r27aNWqlfV9WlqaACBmzpxpHXbx4kWhVqvFO++8Y/Pd48ePC41GU2r4uXPnBADx7bffWofNnDlT3Lj6/fXXXwKAWLZsmc13N2zYUGp4w4YNxZAhQ0rVPmnSJHHzKn1z7S+//LLw8/MTHTt2tOnTpUuXCpVKJf766y+b73/++ecCgNi9e3ep6d0oKipKREZGlhr+/vvvCwAiLi7OOuw///mPcHFxEWfPnrUZ9//+7/+EWq0W8fHxNsNfe+01IUmSzbCGDRuKsWPHWt9PmDBBBAYGCoPBYDPeI488IvR6vcjLyxNC/G/Z2bZtm3WcgoICoVKpxHPPPWcdNmLECOHo6ChiY2Otw65cuSLc3NxE7969rcNK1pWS+SsoKBChoaFi8ODBAoBYvHhx6c66Qcn39+/fbzO8rOWusvNoz/WjZHkdPXq0GDp0qHX4pUuXhEqlEqNHjxYARFpamhBCiOzsbOHh4SEmTpxoU2tycrLQ6/U2w8eOHStcXFxK9c3KlStL/VYVXc5SU1OFo6OjuOeee2y2UQsXLhQAxDfffGPT5o3rwu+//y4AiEGDBpVan8pjMpkEADF58uRb1l9izJgxIjw83GbYzb/3jBkzBACRlJRkHXar7WR5bu6bim6/Dh8+LACIlStX3rJ9FxcXm3XxVhYsWCAAiBUrVliH5ebmisaNG5fqq5JlrmSZqqiKrieVqYVqFw/lU7n69+8PX19fhISE4JFHHoGrqyvWrFmDBg0aAAC0Wi1UquJFyGw24+rVq3B1dUWzZs1w6NAhazs///wzfHx88O9//7vUNCpzqOxmjz/+ONzc3KzvH3roIQQGBuL3338HUHzLl3PnzuFf//oXrl69CoPBAIPBgNzcXPTr1w87d+4sdei4oKAAOp3ultNdvXo1LBYLRo4caW3TYDAgICAATZo0wbZt22zGN5lMAIr7qzwrV66EXq/HgAEDbNrs2LEjXF1dS7VZWFhoM57BYCi1x+pmiYmJ+OSTT/DGG2+UOnS6cuVKtGjRAs2bN7d
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Оценка сбалансированности выборок\n",
"plot_sample_balance(train_df_ed[\"inflationrate\"], \"Train df_ed\")\n",
"plot_sample_balance(val_df_ed[\"inflationrate\"], \"Validation df_ed\")\n",
"plot_sample_balance(test_df_ed[\"inflationrate\"], \"Test df_ed\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборки выглядят схоже, но у всех трех разное количество значений. В дальнейшем мы не сможем обучить какую-либо модель. \n",
"Если модель обучается на несбалансированных данных, она будет предсказывать какой-то гораздо чаще, даже если в тестовой выборке классы распределены более равномерно."
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAHWCAYAAABt3aEVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACfPElEQVR4nOzdd3hTZfsH8O/JbtOme9LBKpQ9BQrIhjJUVJQhICo/VAQVeEXeKspQQZBXQGXo+yqgMgQUcaAIyBAZsvdeLdAWSvdKm+T5/RFyaGgLnaTj+7muXJAz75Ock9x9cp/nkYQQAkREREREVYDC0QEQEREREZUVJrdEREREVGUwuSUiIiKiKoPJLRERERFVGUxuiYiIiKjKYHJLRERERFUGk1siIiIiqjKY3BIRERFRlcHkloiIiIiqDCa3REQP0IwZM2CxWAAAFosFM2fOdHBEVBzHjx/Hjz/+KD8/fPgwfv31V8cFVAlMnToVkiSV+362bdsGSZKwbds2u+nffPMNwsPDoVar4e7uXu5xkOMxuaVSWbp0KSRJkh86nQ716tXD2LFjER8f7+jwiCqcZcuWYc6cObh69Sr+85//YNmyZY4OiYohLS0NL730Evbs2YNz587h9ddfx7FjxxwdVonUrFnT7vO7sMfSpUsdHWqJnT59Gs899xzq1KmD//73v/jiiy8cHRI9ACpHB0BVw/Tp01GrVi1kZ2dj586dWLRoETZs2IDjx4/D2dnZ0eERVRjTp0/Hs88+i0mTJkGr1eLbb791dEhUDBEREfIDAOrVq4dRo0Y5OKqSmTdvHtLT0+XnGzZswMqVKzF37lx4e3vL09u3b1+q/UyePBn//ve/S7WNktq2bRssFgvmz5+PunXrOiQGevCY3FKZ6NOnD1q3bg0A+L//+z94eXnh448/xvr16zFkyBAHR0dUcQwaNAhdu3bF+fPnERYWBh8fH0eHRMX0448/4uTJk8jKykKTJk2g0WgcHVKJPP7443bP4+LisHLlSjz++OOoWbNmoetlZGRAr9cXeT8qlQoqlWPSjRs3bgAAyxGqGZYlULno1q0bAODSpUsAgMTERLzxxhto0qQJXFxcYDAY0KdPHxw5ciTfutnZ2Zg6dSrq1asHnU6HgIAAPPnkk7hw4QIA4PLly/f8Ca1Lly7ytmw1WN999x3eeust+Pv7Q6/X47HHHkNMTEy+fe/duxe9e/eGm5sbnJ2d0blzZ/z9998FHmOXLl0K3P/UqVPzLfvtt9+iVatWcHJygqenJwYPHlzg/u91bHlZLBbMmzcPjRo1gk6ng5+fH1566SUkJSXZLVezZk088sgj+fYzduzYfNssKPaPPvoo32sKAEajEVOmTEHdunWh1WoRHByMN998E0ajscDXKq8uXbqgcePG+abPmTMHkiTh8uXLdtOTk5Mxbtw4BAcHQ6vVom7dupg1a5Zct5qXrbbv7sdzzz1nt9y1a9fwwgsvwM/PD1qtFo0aNcJXX31lt4zt3LE9tFot6tWrh5kzZ0IIYbfsoUOH0KdPHxgMBri4uKB79+7Ys2eP3TK2Ep7Lly/D19cX7du3h5eXF5o2bVqkn37vLgG633lXnGMsy+vD9h74+voiNzfXbt7KlSvleBMSEuzm/fbbb3j44Yeh1+vh6uqKfv364cSJE3bLPPfcc3BxcckX19q1a/PVWhb3PFu4cCEaNWoErVaLwMBAjBkzBsnJyXbLdOnSRb4WGjZsiFatWuHIkSMFXqP3Uth7eHetqO2Yi/J+r127Fq1bt4arq6vdcnPmzClyXAWxveYXLlxA37594erqiqFDhwIA/vrrLzz99NMICQmRPwfGjx+PrKwsu20UVHMrSRLGjh2LH3/8EY0bN5bP0d9//71IcV29ehWPP/449Ho9fH19MX78+HyfPzVr1sSUKVMAAD4+PoV+PhfEFvPZs2cxbNgwuLm5wcfHB++88w6EEIiJiUH//v1hMBjg7++P//znP3brF/faorLFllsqF7ZE1MvLCwBw8eJF/Pjjj3j66adRq1YtxMfH4/PPP0fnzp1x8uRJBAYGAgDMZjMeeeQRbNmyBYMHD8brr7+OtLQ0bNq0CcePH0edOnXkfQwZMgR9+/a1229UVFSB8XzwwQeQJAmTJk3CjRs3MG/ePPTo0QOHDx+Gk5MTAODPP/9Enz590KpVK0yZMgUKhQJLlixBt27d8Ndff6FNmzb5thsUFCTfEJSeno7Ro0cXuO933nkHAwcOxP/93//h5s2b+PTTT9GpUyccOnSowBaFF198EQ8//DAA4IcffsC6devs5r/00ktYunQpnn/+ebz22mu4dOkSPvvsMxw6dAh///031Gp1ga9DcSQnJxd4s5PFYsFjjz2GnTt34sUXX0SDBg1w7NgxzJ07F2fPnrW72aa0MjMz0blzZ1y7dg0vvfQSQkJCsGvXLkRFRSE2Nhbz5s0rcL1vvvlG/v/48ePt5sXHx6Ndu3byl6uPjw9+++03jBw5EqmpqRg3bpzd8m+99RYaNGiArKws+YvK19cXI0eOBACcOHECDz/8MAwGA958802o1Wp8/vnn6NKlC7Zv3462bdsWenzffPNNses1bSVANgWdd8U9xvK4PtLS0vDLL7/giSeekKctWbIEOp0O2dnZ+V6HESNGIDIyErNmzUJmZiYWLVqEjh074tChQ/dsRSwLU6dOxbRp09CjRw+MHj0aZ86cwaJFi7Bv3777Xk+TJk0q0T579uyJZ599FgCwb98+fPLJJ4Uu6+3tjblz58rPhw8fbjd/9+7dGDhwIJo1a4YPP/wQbm5uSEhIyHful5TJZEJkZCQ6duyIOXPmyKVma9asQWZmJkaPHg0vLy/8888/+PTTT3H16lWsWbPmvtvduXMnfvjhB7zyyitwdXXFJ598ggEDBiA6Olr+7ihIVlYWunfvjujoaLz22msIDAzEN998gz///NNuuXnz5uHrr7/GunXrsGjRIri4uKBp06bFOvZBgwahQYMG+PDDD/Hrr7/i/fffh6enJz7//HN069YNs2bNwvLly/HGG2/goYceQqdOnezWL8q1ReVAEJXCkiVLBACxefNmcfPmTRETEyNWrVolvLy8hJOTk7h69aoQQojs7GxhNpvt1r106ZLQarVi+vTp8rSvvvpKABAff/xxvn1ZLBZ5PQDio48+yrdMo0aNROfOneXnW7duFQBEjRo1RGpqqjx99erVAoCYP3++vO2wsDARGRkp70cIITIzM0WtWrVEz5498+2rffv2onHjxvLzmzdvCgBiypQp8rTLly8LpVIpPvjgA7t1jx07JlQqVb7p586dEwDEsmXL5GlTpkwReS/Vv/76SwAQy5cvt1v3999/zzc9NDRU9OvXL1/sY8aMEXdf/nfH/uabbwpfX1/RqlUru9f0m2++EQqFQvz111926y9evFgAEH///Xe+/eXVuXNn0ahRo3zTP/roIwFAXLp0SZ723nvvCb1eL86ePWu37L///W+hVCpFdHS03fS3335bSJJkNy00NFSMGDFCfj5y5EgREBAgEhIS7JYbPHiwcHNzE5mZmUKIO+fO1q1b5WWys7OFQqEQr7zyijzt8ccfFxqNRly4cEGedv36deHq6io6deokT7NdK7bjy87OFiEhIaJPnz4CgFiyZEn+FysP2/r79u2zm17QeVfcYyzL68N2vg4ZMkQ88sgj8vQrV64IhUIhhgwZIgCImzdvCiGESEtLE+7u7mLUqFF2scbFxQk3Nze76SNGjBB6vT7fa7NmzZp871VRz7MbN24IjUYjevXqZfcZ9dlnnwkA4quvvrLbZt5rYcOGDQKA6N27d77rqTA5OTkCgBg7duw947cZOnSoqFWrlt20u9/vqKgoAUDExsbK0+71OVmYgq7BESNGCADi3//+d77lbedRXjNnzhSSJIkrV67I0+7+DLMdg0ajEefPn5enHTlyRAAQn3766T3jnDdvngAgVq9eLU/LyMgQdevWzfc62vZtO9+Kyrbeiy++KE8zmUwiKChISJIkPvzwQ3l6UlKScHJysvucKeq1ReWDZQlUJnr06AEfHx8EBwd
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAq4AAAHWCAYAAAC2Zgs3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB8j0lEQVR4nO3dd3hT5d8G8PskadKd7j0YbdkUKKtsGbIFQUEEBUFEBWUoKD9EBJGhKKAMUREUGQIyRBRlL9m0zLILLd2le7fJ8/5Rm5fQXQpp2vtzXbkgZ37PaU5y5+Q5z5GEEAJERERERFWczNAFEBERERGVBYMrERERERkFBlciIiIiMgoMrkRERERkFBhciYiIiMgoMLgSERERkVFgcCUiIiIio8DgSkRERERGgcGViIiIiIwCgysR1Ujz5s2DVqsFAGi1WsyfP9/AFVF5XL58GTt27NA9Dw4Oxu7duw1XUBUkSRI++eQT3fO1a9dCkiTcvXu31Hlr1aqFUaNGVWo9o0aNQq1atSp1mY/j0KFDkCQJhw4d0hu+bt061K9fHyYmJrCxsTFIbVQ8BtdqouANqeBhamoKPz8/TJgwATExMYYuj6jK+emnn7Bo0SLcv38fX375JX766SdDl0TlkJqainHjxuHkyZO4efMmJk6ciEuXLhm6rAp59913IUkSbt26Vew0M2bMgCRJuHjx4lOsrPwiIyPxySefIDg42NClVMi1a9cwatQo1K1bF99//z2+++47Q5dEj1AYugCqXHPmzEHt2rWRlZWFY8eOYeXKlfjzzz9x+fJlmJubG7o8oipjzpw5ePXVV/HBBx9ApVLhl19+MXRJVA6BgYG6BwD4+flh7NixBq6qYoYPH45vvvkGGzZswMcff1zkNBs3bkSTJk3QtGnTCq/nlVdewUsvvQSVSlXhZZQmMjISs2fPRq1atdCsWTO9cd9//73uV46q6tChQ9BqtVi6dCl8fHwMXQ4VgcG1munduzdatmwJAHj99ddhb2+Pr776Cjt37sSwYcMMXB1R1TF06FA888wzuHXrFnx9feHo6GjokqicduzYgatXryIzMxNNmjSBUqk0dEkV0qZNG/j4+GDjxo1FBtcTJ04gNDQUCxYseKz1yOVyyOXyx1rG4zAxMTHYussqNjYWANhEoApjU4FqrmvXrgCA0NBQAEBCQgLef/99NGnSBJaWlrC2tkbv3r1x4cKFQvNmZWXhk08+gZ+fH0xNTeHq6opBgwbh9u3bAIC7d+/qNU949NGlSxfdsgraEv3666/43//+BxcXF1hYWOC5555DeHh4oXWfOnUKvXr1glqthrm5OTp37ozjx48XuY1dunQpcv0Pt+0q8MsvvyAgIABmZmaws7PDSy+9VOT6S9q2h2m1WixZsgSNGjWCqakpnJ2dMW7cOCQmJupNV6tWLfTr16/QeiZMmFBomUXV/sUXXxTapwCQnZ2NWbNmwcfHByqVCp6enpg2bRqys7OL3FcP69KlCxo3blxo+KJFi4psB5eUlIRJkybB09MTKpUKPj4+WLhwYZFnUD755JMi992jbeYiIiIwevRoODs7Q6VSoVGjRvjxxx/1pil47RQ8VCoV/Pz8MH/+fAgh9KYNCgpC7969YW1tDUtLS3Tr1g0nT57Um+bhdn5OTk5o164d7O3t0bRpU0iShLVr15a43x5tllPa664821iZx0fB38DJyQm5ubl64zZu3KirNz4+Xm/cX3/9hY4dO8LCwgJWVlbo27cvrly5ojfNqFGjYGlpWaiurVu3FmozWN7X2YoVK9CoUSOoVCq4ublh/PjxSEpK0pumS5cuumOhYcOGCAgIwIULF4o8RktS3N/w0TaPBdtclr/31q1b0bJlS1hZWelNt2jRohJrGT58OK5du4bz588XGrdhwwZIkoRhw4YhJycHH3/8MQICAqBWq2FhYYGOHTvi4MGDpW5vUW1chRCYO3cuPDw8YG5ujmeeeabQ3xso22fHoUOH0KpVKwDAa6+9ptv2gmOqqDau6enpeO+993TvK/Xq1cOiRYsKHduSJGHChAnYsWMHGjdurDuW9uzZU+p2A8D9+/cxcOBAWFhYwMnJCZMnTy70PlmrVi3MmjULAODo6Fjs50hRCo63GzduYMSIEVCr1XB0dMTMmTMhhEB4eDgGDBgAa2truLi44Msvv9Sbv7zvATUZz7hWcwUh097eHgBw584d7NixAy+++CJq166NmJgYrFq1Cp07d8bVq1fh5uYGANBoNOjXrx/279+Pl156CRMnTkRqair27t2Ly5cvo27durp1DBs2DH369NFb7/Tp04us57PPPoMkSfjggw8QGxuLJUuWoHv37ggODoaZmRkA4MCBA+jduzcCAgIwa9YsyGQyrFmzBl27dsXRo0fRunXrQsv18PDQXVyTlpaGt956q8h1z5w5E0OGDMHrr7+OuLg4fPPNN+jUqROCgoKK/Ib9xhtvoGPHjgCAbdu2Yfv27Xrjx40bh7Vr1+K1117Du+++i9DQUCxbtgxBQUE4fvx4pZxhSEpKKvLCIa1Wi+eeew7Hjh3DG2+8gQYNGuDSpUtYvHgxbty4oXfhyuPKyMhA586dERERgXHjxsHLywv//vsvpk+fjqioKCxZsqTI+datW6f7/+TJk/XGxcTEoG3btroPJEdHR/z1118YM2YMUlJSMGnSJL3p//e//6FBgwbIzMzUvbk7OTlhzJgxAIArV66gY8eOsLa2xrRp02BiYoJVq1ahS5cuOHz4MNq0aVPs9q1bt67c7SMLmuUUKOp1V95tfBLHR2pqKv744w88//zzumFr1qyBqakpsrKyCu2HkSNHomfPnli4cCEyMjKwcuVKdOjQAUFBQU/8wppPPvkEs2fPRvfu3fHWW2/h+vXrWLlyJc6cOVPq8fTBBx9UaJ09evTAq6++CgA4c+YMvv7662KndXBwwOLFi3XPX3nlFb3xJ06cwJAhQ+Dv748FCxZArVYjPj6+0Gu/KMOHD8fs2bOxYcMGtGjRQjdco9Fg8+bN6NixI7y8vBAfH48ffvgBw4YNw9ixY5GamorVq1ejZ8+eOH36dKGf50vz8ccfY+7cuejTpw/69OmD8+fP49lnn0VOTo7edGX57GjQoAHmzJmDjz/+WO+9s127dkWuWwiB5557DgcPHsSYMWPQrFkz/P3335g6dSoiIiL09jUAHDt2DNu2bcPbb78NKysrfP311xg8eDDCwsJ0n3FFyczMRLdu3RAWFoZ3330Xbm5uWLduHQ4cOKA33ZIlS/Dzzz9j+/btWLlyJSwtLcvdNGPo0KFo0KABFixYgN27d2Pu3Lmws7PDqlWr0LVrVyxcuBDr16/H+++/j1atWqFTp05685flPaDGE1QtrFmzRgAQ+/btE3FxcSI8PFxs2rRJ2NvbCzMzM3H//n0hhBBZWVlCo9HozRsaGipUKpWYM2eObtiPP/4oAIivvvqq0Lq0Wq1uPgDiiy++KDRNo0aNROfOnXXPDx48KAAId3d3kZKSohu+efNmAUAsXbpUt2xfX1/Rs2dP3XqEECIjI0PUrl1b9OjRo9C62rVrJxo3bqx7HhcXJwCIWbNm6YbdvXtXyOVy8dlnn+nNe+nSJaFQKAoNv3nzpgAgfvrpJ92wWbNmiYcPmaNHjwoAYv369Xrz7tmzp9Bwb29v0bdv30K1jx8/Xjx6GD5a+7Rp04STk5MICAjQ26fr1q0TMplMHD16VG/+b7/9VgAQx48fL7S+h3Xu3Fk0atSo0PAvvvhCABChoaG6YZ9++qmwsLAQN27c0Jv2ww8/FHK5XISFhekNnzFjhpAkSW+Yt7e3GDlypO75mDFjhKurq4iPj9eb7qWXXhJqtVpkZGQIIf7/tXPw4EHdNFlZWUImk4m3335bN2zgwIFCqVSK27dv64ZFRkYKKysr0alTJ92wgmOlYPuysrKEl5eX6N27twAg1qxZU3hnPaRg/jNnzugNL+p1V95trMzjo+D1OmzYMNGvXz/d8Hv37gmZTCaGDRsmAIi4uDghhBCpqanCxsZGjB07Vq/W6OhooVar9YaPHDlSWFhYFNo3W7Z
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAq4AAAHWCAYAAAC2Zgs3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB+1ElEQVR4nO3dd3gU1f4G8He2JJu66b1QAoROCC10EEQEqQoiXFEQGyjlKspVRLwoYgOVIiqCIqCglAsqSsdCJ6F3AukhvSeb7J7fHyH7Y8kmJCHJZJP38zz7wE59d7Kz+ebsmTOSEEKAiIiIiKiOU8gdgIiIiIioIli4EhEREZFFYOFKRERERBaBhSsRERERWQQWrkRERERkEVi4EhEREZFFYOFKRERERBaBhSsRERERWQQWrkRERERkEVi4EhHVkPfeew8GgwEAYDAYsHDhQpkTUWWcPXsWW7duNT6PiIjAL7/8Il+gBmb//v2QJAn79+83mb527VoEBwdDrVbDyclJlmwkHxauVGFr1qyBJEnGh0ajQfPmzTFt2jQkJibKHY+ozvn222/x0UcfISYmBh9//DG+/fZbuSNRJWRlZeG5557D4cOHceXKFUyfPh1nzpyRO1aVNGrUyOTzu6zHmjVrqmV/7733nknRX10uXryIp556Ck2bNsVXX32FL7/8str3QXWbSu4AZHneeecdNG7cGPn5+fjrr7+wYsUK/Prrrzh79ixsbW3ljkdUZ7zzzjt48skn8dprr8Ha2hrff/+93JGoEsLCwowPAGjevDmmTJkic6qqWbJkCbKzs43Pf/31V2zYsAGLFy+Gm5ubcXr37t2rZX/vvfceHn30UYwYMaJatldi//79MBgM+PTTTxEUFFSt2ybLwMKVKm3w4MHo1KkTAOCZZ56Bq6srPvnkE2zbtg3jxo2TOR1R3TF27Fj069cPV69eRbNmzeDu7i53JKqkrVu34vz588jLy0Pbtm1hZWUld6QqubuATEhIwIYNGzBixAg0atRIlkxVcevWLQBgF4EGjF0F6L71798fABAZGQkASE1NxSuvvIK2bdvC3t4ejo6OGDx4ME6dOlVq3fz8fLz99tto3rw5NBoNvL29MWrUKFy7dg0AcOPGjXK/1urbt69xWyX9oX788Uf85z//gZeXF+zs7DBs2DBER0eX2veRI0fw0EMPQavVwtbWFn369MHff/9t9jX27dvX7P7ffvvtUst+//33CA0NhY2NDVxcXPD444+b3X95r+1OBoMBS5YsQevWraHRaODp6YnnnnsOaWlpJss1atQIQ4cOLbWfadOmldqmuewffvhhqWMKAAUFBZg3bx6CgoJgbW0Nf39/zJ49GwUFBWaP1Z369u2LNm3alJr+0UcfQZIk3Lhxw2R6eno6ZsyYAX9/f1hbWyMoKAiLFi0y9hO909tvv2322D311FMmy8XGxmLSpEnw9PSEtbU1WrdujW+++cZkmZL3TsnD2toazZs3x8KFCyGEMFk2PDwcgwcPhqOjI+zt7fHAAw/g8OHDJsuUdKu5ceMGPDw80L17d7i6uqJdu3YV+jr27m4593rfVeY1Vuf5UfIz8PDwQGFhocm8DRs2GPMmJyebzPvtt9/Qq1cv2NnZwcHBAUOGDMG5c+dMlnnqqadgb29fKtdPP/1Uqt9jZd9ny5cvR+vWrWFtbQ0fHx9MnToV6enpJsv07dvXeC60atUKoaGhOHXqlNlztDxl/Qzv7rdZ8por8vP+6aef0KlTJzg4OJgs99FHH1U4V1kq8vl15coVjB49Gl5eXtBoNPDz88Pjjz+OjIwM42vOycnBt99+W+Z5ebeYmBiMGDECdnZ28PDwwMyZM0t9xjRq1Ajz5s0DALi7u5f5GWxOyXv18uXLmDBhArRaLdzd3TF37lwIIRAdHY3hw4fD0dERXl5e+Pjjj03Wr+z5QzWHLa5030qKTFdXVwDA9evXsXXrVjz22GNo3LgxEhMTsXLlSvTp0wfnz5+Hj48PAECv12Po0KHYs2cPHn/8cUyfPh1ZWVnYtWsXzp49i6ZNmxr3MW7cODz88MMm+50zZ47ZPO+++y4kScJrr72GW7duYcmSJRgwYAAiIiJgY2MDANi7dy8GDx6M0NBQzJs3DwqFAqtXr0b//v3x559/okuXLqW26+fnZ7y4Jjs7Gy+88ILZfc+dOxdjxozBM888g6SkJHz++efo3bs3wsPDzbYSPPvss+jVqxcAYPPmzdiyZYvJ/Oeeew5r1qzB008/jZdffhmRkZFYunQpwsPD8ffff0OtVps9DpWRnp5u9sIhg8GAYcOG4a+//sKzzz6Lli1b4syZM1i8eDEuX75crX3YcnNz0adPH8TGxuK5555DQEAA/vnnH8yZMwfx8fFYsmSJ2fXWrl1r/P/MmTNN5iUmJqJbt26QJAnTpk2Du7s7fvvtN0yePBmZmZmYMWOGyfL/+c9/0LJlS+Tl5Rl/QXl4eGDy5MkAgHPnzqFXr15wdHTE7NmzoVarsXLlSvTt2xcHDhxA165dy3x9a9eurXT/yJJuOSXMve8q+xpr4vzIysrCjh07MHLkSOO01atXQ6PRID8/v9RxmDhxIgYNGoRFixYhNzcXK1asQM+ePREeHl7jrX9vv/025s+fjwEDBuCFF17ApUuXsGLFChw7duye59Nrr71WpX0OHDgQTz75JADg2LFj+Oyzz8pc1s3NDYsXLzY+/9e//mUy/9ChQxgzZgzat2+P999/H1qtFsnJyaXe+1VRkc8vnU6HQYMGoaCgAC+99BK8vLwQGxuLHTt2ID09HVqtFmvXrsUzzzyDLl264NlnnwUAk8/zu+Xl5eGBBx5AVFQUXn75Zfj4+GDt2rXYu3evyXJLlizBd999hy1btmDFihWwt7dHu3btKvUax44di5YtW+L999/HL7/8ggULFsDFxQUrV65E//79sWjRIqxbtw6vvPIKOnfujN69e5c6Rvc6f6iGCaIKWr16tQAgdu/eLZKSkkR0dLT44YcfhKurq7CxsRExMTFCCCHy8/OFXq83WTcyMlJYW1uLd955xzjtm2++EQDEJ598UmpfBoPBuB4A8eGHH5ZapnXr1qJPnz7G5/v27RMAhK+vr8jMzDRO37hxowAgPv30U+O2mzVrJgYNGmTcjxBC5ObmisaNG4uBAweW2lf37t1FmzZtjM+TkpIEADFv3jzjtBs3bgilUineffddk3XPnDkjVCpVqelXrlwRAMS3335rnDZv3jxx52n5559/CgBi3bp1Juvu3Lmz1PTAwEAxZMiQUtmnTp0q7j7V784+e/Zs4eHhIUJDQ02O6dq1a4VCoRB//vmnyfpffPGFACD+/vvvUvu7U58+fUTr1q1LTf/www8FABEZGWmc9t///lfY2dmJy5cvmyz7+uuvC6VSKaKiokymv/HGG0KSJJNpgYGBYuLEicbnkydPFt7e3iI5Odlkuccff1xotVqRm5srhPj/986+ffuMy+Tn5wuFQiFefPFF47QRI0YIKysrce3aNeO0uLg44eDgIHr37m2cVnKulLy+/Px8ERAQIAYPHiwAiNWrV5c+WHcoWf/YsWMm08297yr7Gqvz/Ch5v44bN04MHTrUOP3mzZtCoVCIcePGCQAiKSlJCCFEVlaWcHJyElOmTDHJmpCQILRarcn0iRMnCjs7u1LHZtOmTaV+VhV9n926dUtYWVmJBx980OQzaunSpQKA+Oabb0y2eee58OuvvwoA4qGHHip1PpVFp9MJAGLatGnl5i8xfvx40bhxY5Npd/+858yZIwCI+Ph447TyPifLcvexqejnV3h4uAAgNm3aVO727ezsTM7F8ixZskQAEBs3bjROy8nJEUFBQaWOVcl7ruQ9VVEl6z377LPGaUVFRcLPz09IkiTef/994/S0tDRhY2Njkr+i5w/VPHYVoEobMGAA3N3d4e/vj8cffxz29vbYsmULfH19AQDW1tZQKIrfWnq9HikpKbC3t0eLFi1w8uRJ43Z+/vlnuLm54aWXXiq1j8p8FXe3J598Eg4ODsbnjz76KLy9vfHrr78CKB7S5sqVK3jiiSeQkpK
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Оценка сбалансированности выборок\n",
"plot_sample_balance(train_df_mp[\"Ram\"], \"Train df_mp\")\n",
"plot_sample_balance(val_df_mp[\"Ram\"], \"Validation df_mp\")\n",
"plot_sample_balance(test_df_mp[\"Ram\"], \"Test df_mp\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборки явно не сбалансированы, пока что не можем обучить модель."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"12. Выполнить приращение данных методами выборки с избытком (oversampling) и выборки с недостатком (undersampling). Должны быть представлены примеры реализации обоих методов для выборок каждого набора данных."
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"from imblearn.over_sampling import SMOTE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 1."
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После oversampling (df_neo): hazardous\n",
"False 81996\n",
"True 81996\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"X_df_neo = df_neo.drop(\"hazardous\", axis=1)\n",
"y_df_neo = df_neo[\"hazardous\"]\n",
"\n",
"# Кодирование категориальных признаков\n",
"for column in X_df_neo.select_dtypes(include=[\"object\"]).columns:\n",
" X_df_neo[column] = X_df_neo[column].astype(\"category\").cat.codes\n",
"\n",
"# Теперь применяем SMOTE\n",
"smote = SMOTE(random_state=42)\n",
"X_resampled_df_neo, y_resampled_df_neo = smote.fit_resample(X_df_neo, y_df_neo)\n",
"\n",
"# Получаем результаты\n",
"print(f\"После oversampling (df_neo): {pd.Series(y_resampled_df_neo).value_counts()}\")"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После undersampling (df_neo): hazardous\n",
"False 8840\n",
"True 8840\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Undersampling df_neo\n",
"undersample = RandomUnderSampler(random_state=42)\n",
"X_under_df_neo, y_under_df_neo = undersample.fit_resample(X_df_neo, y_df_neo)\n",
"\n",
"print(f\"После undersampling (df_neo): {pd.Series(y_under_df_neo).value_counts()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 7."
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После oversampling (df_ed): country\n",
"United States of America 41\n",
"United Kingdom 41\n",
"India 41\n",
"Japan 41\n",
"Hong Kong 41\n",
"China 41\n",
"Germany 41\n",
"France 41\n",
"Spain 41\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"X_df_ed = df_ed.drop(\"country\", axis=1)\n",
"y_df_ed = df_ed[\"country\"]\n",
"\n",
"# Кодирование категориальных признаков\n",
"for column in X_df_ed.select_dtypes(include=[\"object\"]).columns:\n",
" X_df_ed[column] = X_df_ed[column].astype(\"category\").cat.codes\n",
"\n",
"# Теперь применяем SMOTE\n",
"smote = SMOTE(random_state=42)\n",
"X_resampled_df_ed, y_resampled_df_ed = smote.fit_resample(X_df_ed, y_df_ed)\n",
"\n",
"# Получаем результаты\n",
"print(f\"После oversampling (df_ed): {pd.Series(y_resampled_df_ed).value_counts()}\")"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"После undersampling (df_ed): country\n",
"China 41\n",
"France 41\n",
"Germany 41\n",
"Hong Kong 41\n",
"India 41\n",
"Japan 41\n",
"Spain 41\n",
"United Kingdom 41\n",
"United States of America 41\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Undersampling df_ed\n",
"undersample = RandomUnderSampler(random_state=42)\n",
"X_under_df_ed, y_under_df_ed = undersample.fit_resample(X_df_ed, y_df_ed)\n",
"\n",
"print(f\"После undersampling (df_ed): {pd.Series(y_under_df_ed).value_counts()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет 18."
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 1, n_samples = 1",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[107], line 10\u001b[0m\n\u001b[0;32m 8\u001b[0m \u001b[38;5;66;03m# Теперь применяем SMOTE\u001b[39;00m\n\u001b[0;32m 9\u001b[0m smote \u001b[38;5;241m=\u001b[39m SMOTE(random_state\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m42\u001b[39m)\n\u001b[1;32m---> 10\u001b[0m X_resampled_df_mp, y_resampled_df_mp \u001b[38;5;241m=\u001b[39m \u001b[43msmote\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit_resample\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX_df_mp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_df_mp\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 12\u001b[0m \u001b[38;5;66;03m# Получаем результаты\u001b[39;00m\n\u001b[0;32m 13\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mПосле oversampling (df_mp): \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mpd\u001b[38;5;241m.\u001b[39mSeries(y_resampled_df_mp)\u001b[38;5;241m.\u001b[39mvalue_counts()\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n",
"File \u001b[1;32mc:\\Users\\nemar\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\base.py:208\u001b[0m, in \u001b[0;36mBaseSampler.fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 187\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"Resample the dataset.\u001b[39;00m\n\u001b[0;32m 188\u001b[0m \n\u001b[0;32m 189\u001b[0m \u001b[38;5;124;03mParameters\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 205\u001b[0m \u001b[38;5;124;03m The corresponding label of `X_resampled`.\u001b[39;00m\n\u001b[0;32m 206\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 207\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_params()\n\u001b[1;32m--> 208\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit_resample\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\nemar\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\base.py:112\u001b[0m, in \u001b[0;36mSamplerMixin.fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 106\u001b[0m X, y, binarize_y \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_X_y(X, y)\n\u001b[0;32m 108\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39msampling_strategy_ \u001b[38;5;241m=\u001b[39m check_sampling_strategy(\n\u001b[0;32m 109\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39msampling_strategy, y, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_sampling_type\n\u001b[0;32m 110\u001b[0m )\n\u001b[1;32m--> 112\u001b[0m output \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_fit_resample\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 114\u001b[0m y_ \u001b[38;5;241m=\u001b[39m (\n\u001b[0;32m 115\u001b[0m label_binarize(output[\u001b[38;5;241m1\u001b[39m], classes\u001b[38;5;241m=\u001b[39mnp\u001b[38;5;241m.\u001b[39munique(y)) \u001b[38;5;28;01mif\u001b[39;00m binarize_y \u001b[38;5;28;01melse\u001b[39;00m output[\u001b[38;5;241m1\u001b[39m]\n\u001b[0;32m 116\u001b[0m )\n\u001b[0;32m 118\u001b[0m X_, y_ \u001b[38;5;241m=\u001b[39m arrays_transformer\u001b[38;5;241m.\u001b[39mtransform(output[\u001b[38;5;241m0\u001b[39m], y_)\n",
"File \u001b[1;32mc:\\Users\\nemar\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\imblearn\\over_sampling\\_smote\\base.py:389\u001b[0m, in \u001b[0;36mSMOTE._fit_resample\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 386\u001b[0m X_class \u001b[38;5;241m=\u001b[39m _safe_indexing(X, target_class_indices)\n\u001b[0;32m 388\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnn_k_\u001b[38;5;241m.\u001b[39mfit(X_class)\n\u001b[1;32m--> 389\u001b[0m nns \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mnn_k_\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mkneighbors\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX_class\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mreturn_distance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m[:, \u001b[38;5;241m1\u001b[39m:]\n\u001b[0;32m 390\u001b[0m X_new, y_new \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_make_samples(\n\u001b[0;32m 391\u001b[0m X_class, y\u001b[38;5;241m.\u001b[39mdtype, class_sample, X_class, nns, n_samples, \u001b[38;5;241m1.0\u001b[39m\n\u001b[0;32m 392\u001b[0m )\n\u001b[0;32m 393\u001b[0m X_resampled\u001b[38;5;241m.\u001b[39mappend(X_new)\n",
"File \u001b[1;32mc:\\Users\\nemar\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\neighbors\\_base.py:834\u001b[0m, in \u001b[0;36mKNeighborsMixin.kneighbors\u001b[1;34m(self, X, n_neighbors, return_distance)\u001b[0m\n\u001b[0;32m 832\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 833\u001b[0m inequality_str \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mn_neighbors <= n_samples_fit\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m--> 834\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[0;32m 835\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected \u001b[39m\u001b[38;5;132;01m{\u001b[39;00minequality_str\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m, but \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 836\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mn_neighbors = \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mn_neighbors\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m, n_samples_fit = \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mn_samples_fit\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m, \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 837\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mn_samples = \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mX\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m]\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;66;03m# include n_samples for common tests\u001b[39;00m\n\u001b[0;32m 838\u001b[0m )\n\u001b[0;32m 840\u001b[0m n_jobs \u001b[38;5;241m=\u001b[39m effective_n_jobs(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mn_jobs)\n\u001b[0;32m 841\u001b[0m chunked_results \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n",
"\u001b[1;31mValueError\u001b[0m: Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 1, n_samples = 1"
]
}
],
"source": [
"X_df_mp = df_mp.drop(\"Battery\", axis=1)\n",
"y_df_mp = df_mp[\"Battery\"]\n",
"\n",
"# Кодирование категориальных признаков\n",
"for column in X_df_mp.select_dtypes(include=[\"object\"]).columns:\n",
" X_df_mp[column] = X_df_mp[column].astype(\"category\").cat.codes\n",
"\n",
"# Теперь применяем SMOTE\n",
"smote = SMOTE(random_state=42)\n",
"X_resampled_df_mp, y_resampled_df_mp = smote.fit_resample(X_df_mp, y_df_mp)\n",
"\n",
"# Получаем результаты\n",
"print(f\"После oversampling (df_mp): {pd.Series(y_resampled_df_mp).value_counts()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В данном случае у нас есть два датасета, предназначенных для решения задачи классификации (df_neo, df_ed). Проблему дисбаланса в них мы решили применив undersampling и oversampling.\n",
"\n",
"Последний датасет не подходит для обучения, т.к предназначен для решения задачи регрессии (цены мобильного устройства), поэтому выполнять приращение данных не требуется."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}