AIM-PIbd-32-Shabunov-O-A/lab_3/lab3.ipynb

2895 lines
548 KiB
Plaintext
Raw Permalink Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Лабораторная работа №3. Конструирование признаков.\n",
"\n",
"## Датасет \"Набор данных для анализа и прогнозирования сердечного приступа\".\n",
"\n",
"[**Ссылка**](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)\n",
"\n",
"### Описание датасета\n",
"\n",
"**Проблемная область**: Датасет связан с медицинской статистикой и направлен на анализ факторов, связанных с риском сердечного приступа. Это важно для прогнозирования и разработки стратегий профилактики сердечно-сосудистых заболеваний.\n",
"\n",
"**Актуальность**: Сердечно-сосудистые заболевания являются одной из ведущих причин смертности во всем мире. Анализ данных об образе жизни, состоянии здоровья и наследственных факторах позволяет выделить ключевые предикторы, влияющие на развитие сердечно-сосудистых заболеваний. Этот датасет предоставляет инструменты для анализа таких факторов и может быть полезен в создании прогнозных моделей, направленных на снижение рисков и своевременную диагностику.\n",
"\n",
"**Объекты наблюдения**: Каждая запись представляет собой данные о человеке, включая информацию об их состоянии здоровья, образе жизни, демографических характеристиках и наличию определенных заболеваний. Объекты наблюдений — это индивидуальные пациенты.\n",
"\n",
"**Атрибуты объектов:**\n",
"- `HeartDisease` — наличие сердечного приступа (Yes/No) (целевая переменная).\n",
"- `BMI` — индекс массы тела (Body Mass Index), числовой показатель.\n",
"- `Smoking` — курение (Yes/No).\n",
"- `AlcoholDrinking` — употребление алкоголя (Yes/No).\n",
"- `Stroke` — наличие инсульта (Yes/No).\n",
"- `PhysicalHealth` — количество дней в месяц, когда физическое здоровье было неудовлетворительным.\n",
"- `MentalHealth` — количество дней в месяц, когда психическое здоровье было неудовлетворительным.\n",
"- `DiffWalking` — трудности при ходьбе (Yes/No).\n",
"- `Sex` — пол (Male/Female).\n",
"- `AgeCategory` — возрастная категория (например, 55-59, 80 or older).\n",
"- `Race` — расовая принадлежность (например, White, Black).\n",
"- `Diabetic` — наличие диабета (Yes/No/No, borderline diabetes).\n",
"- `PhysicalActivity` — физическая активность (Yes/No).\n",
"- `GenHealth` — общее состояние здоровья (от Excellent до Poor).\n",
"- `SleepTime` — среднее количество часов сна за сутки.\n",
"- `Asthma` — наличие астмы (Yes/No).\n",
"- `KidneyDisease` — наличие заболеваний почек (Yes/No).\n",
"- `SkinCancer` — наличие кожного рака (Yes/No).\n",
"\n",
"### Бизнес-цели и соответствующие цели технического проекта\n",
"\n",
"**Бизнес-цель 1: Разработка персонализированных программ профилактики сердечно-сосудистых заболеваний**\n",
"\n",
"Снижение числа сердечно-сосудистых заболеваний в группе риска благодаря внедрению программ профилактики уменьшает затраты на медицинское обслуживание (страховые выплаты, лечение). Компании, предоставляющие страховые или медицинские услуги, могут минимизировать убытки и увеличить доходы за счет раннего выявления риска у клиентов.\n",
"\n",
"*Цели технического проекта*:\n",
"1. Построить предиктивную модель машинного обучения для прогнозирования риска сердечного приступа на основе предоставленных данных.\n",
"2. Разработать алгоритм классификации пациентов по группам риска с учетом их образа жизни, состояния здоровья и наследственных факторов.\n",
"3. Выявить наиболее значимые факторы риска для рекомендации адресных изменений в образе жизни.\n",
"\n",
"**Бизнес-цель 2: Создание коммерческого продукта для оценки здоровья сотрудников компаний**\n",
"\n",
"Продукт может быть предложен корпоративным клиентам для оценки состояния здоровья их сотрудников и снижения риска долгосрочных больничных листов, что положительно скажется на производительности и снизит страховые выплаты работодателей. Компании смогут предлагать услуги в формате подписки или единовременной оценки.\n",
"\n",
"*Цели технического проекта*:\n",
"1. Разработать инструмент визуализации здоровья сотрудников с использованием анализа ключевых факторов из датасета (например, курение, индекс массы тела, физическая активность).\n",
"2. Обучить и оптимизировать модель прогнозирования вероятности сердечного приступа в зависимости от корпоративного контекста (возрастные группы сотрудников, стрессовые факторы).\n",
"3. Интегрировать предиктивную аналитику в продукт, предоставляющий персонализированные отчеты и рекомендации по здоровью.\n",
"\n",
"**Бизнес-цель**: Улучшенное прогнозирование цен поможет продавцам устанавливать конкурентные цены, а покупателям — принимать более взвешенные решения о покупке. Это также даст риелторам возможность лучше ориентироваться на рынке и оптимизировать стратегию продажи.\n",
"\n",
"**Техническая цель**: Прогнозирование цен на жилье\n",
"\n",
"**Входные данные**: Исторические данные о продажах домов, включая все признаки (количество комнат, площадь, состояние, местоположение и др.).\n",
"\n",
"**Целевая переменная**: Столбец `HeartDisease`, который указывает на наличие сердечного приступа у пациента (`Yes` или `No`)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>HeartDisease</th>\n",
" <th>BMI</th>\n",
" <th>Smoking</th>\n",
" <th>AlcoholDrinking</th>\n",
" <th>Stroke</th>\n",
" <th>PhysicalHealth</th>\n",
" <th>MentalHealth</th>\n",
" <th>DiffWalking</th>\n",
" <th>Sex</th>\n",
" <th>AgeCategory</th>\n",
" <th>Race</th>\n",
" <th>Diabetic</th>\n",
" <th>PhysicalActivity</th>\n",
" <th>GenHealth</th>\n",
" <th>SleepTime</th>\n",
" <th>Asthma</th>\n",
" <th>KidneyDisease</th>\n",
" <th>SkinCancer</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>No</td>\n",
" <td>16.60</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>3.0</td>\n",
" <td>30.0</td>\n",
" <td>No</td>\n",
" <td>Female</td>\n",
" <td>55-59</td>\n",
" <td>White</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Very good</td>\n",
" <td>5.0</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>No</td>\n",
" <td>20.34</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" <td>Female</td>\n",
" <td>80 or older</td>\n",
" <td>White</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Very good</td>\n",
" <td>7.0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>No</td>\n",
" <td>26.58</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>20.0</td>\n",
" <td>30.0</td>\n",
" <td>No</td>\n",
" <td>Male</td>\n",
" <td>65-69</td>\n",
" <td>White</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Fair</td>\n",
" <td>8.0</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>No</td>\n",
" <td>24.21</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" <td>Female</td>\n",
" <td>75-79</td>\n",
" <td>White</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Good</td>\n",
" <td>6.0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>No</td>\n",
" <td>23.71</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>28.0</td>\n",
" <td>0.0</td>\n",
" <td>Yes</td>\n",
" <td>Female</td>\n",
" <td>40-44</td>\n",
" <td>White</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Very good</td>\n",
" <td>8.0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth \\\n",
"0 No 16.60 Yes No No 3.0 \n",
"1 No 20.34 No No Yes 0.0 \n",
"2 No 26.58 Yes No No 20.0 \n",
"3 No 24.21 No No No 0.0 \n",
"4 No 23.71 No No No 28.0 \n",
"\n",
" MentalHealth DiffWalking Sex AgeCategory Race Diabetic \\\n",
"0 30.0 No Female 55-59 White Yes \n",
"1 0.0 No Female 80 or older White No \n",
"2 30.0 No Male 65-69 White Yes \n",
"3 0.0 No Female 75-79 White No \n",
"4 0.0 Yes Female 40-44 White No \n",
"\n",
" PhysicalActivity GenHealth SleepTime Asthma KidneyDisease SkinCancer \n",
"0 Yes Very good 5.0 Yes No Yes \n",
"1 Yes Very good 7.0 No No No \n",
"2 Yes Fair 8.0 Yes No No \n",
"3 No Good 6.0 No No Yes \n",
"4 Yes Very good 8.0 No No No "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\".//static//csv//heart_2020_cleaned.csv\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Устранение проблемы пропущенных данных\n",
"\n",
"Для начала определим, присутствуют ли в датасете пропущенные значения признаков:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"HeartDisease 0\n",
"BMI 0\n",
"Smoking 0\n",
"AlcoholDrinking 0\n",
"Stroke 0\n",
"PhysicalHealth 0\n",
"MentalHealth 0\n",
"DiffWalking 0\n",
"Sex 0\n",
"AgeCategory 0\n",
"Race 0\n",
"Diabetic 0\n",
"PhysicalActivity 0\n",
"GenHealth 0\n",
"SleepTime 0\n",
"Asthma 0\n",
"KidneyDisease 0\n",
"SkinCancer 0\n",
"dtype: int64\n",
"\n",
"HeartDisease False\n",
"BMI False\n",
"Smoking False\n",
"AlcoholDrinking False\n",
"Stroke False\n",
"PhysicalHealth False\n",
"MentalHealth False\n",
"DiffWalking False\n",
"Sex False\n",
"AgeCategory False\n",
"Race False\n",
"Diabetic False\n",
"PhysicalActivity False\n",
"GenHealth False\n",
"SleepTime False\n",
"Asthma False\n",
"KidneyDisease False\n",
"SkinCancer False\n",
"dtype: bool\n",
"\n"
]
}
],
"source": [
"# Количество пустых значений признаков\n",
"print(df.isnull().sum())\n",
"\n",
"print()\n",
"\n",
"# Есть ли пустые значения признаков\n",
"print(df.isnull().any())\n",
"\n",
"print()\n",
"\n",
"# Процент пустых значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пропущенных данных в датасете **не обнаружено**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Устранение проблемы зашумленности данных\n",
"\n",
"**Зашумленность** это наличие случайных ошибок или вариаций в данных, которые могут затруднить выявление истинных закономерностей. Шум может возникать из-за ошибок измерений, неправильных записей или других факторов.\n",
"\n",
"**Выбросы** это значения, которые значительно отличаются от остальных наблюдений в наборе данных. Выбросы могут указывать на ошибки в данных или на редкие, но важные события. Их наличие может повлиять на статистические методы анализа.\n",
"\n",
"Представленный ниже код помогает определить наличие выбросов в наборе данных и устранить их (при наличии), заменив значения ниже нижней границы (рассматриваемого минимума) на значения нижней границы, а значения выше верхней границы (рассматриваемого максимума) на значения верхней границы:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка наличия выбросов в колонках:\n",
"Колонка BMI:\n",
"\tЕсть выбросы: Да\n",
"\tКоличество выбросов: 10396\n",
"\tМинимальное значение: 12.02\n",
"\tМаксимальное значение: 94.85\n",
"\t1-й квартиль (Q1): 24.03\n",
"\t3-й квартиль (Q3): 31.42\n",
"\n",
"Колонка PhysicalHealth:\n",
"\tЕсть выбросы: Да\n",
"\tКоличество выбросов: 47146\n",
"\tМинимальное значение: 0.0\n",
"\tМаксимальное значение: 30.0\n",
"\t1-й квартиль (Q1): 0.0\n",
"\t3-й квартиль (Q3): 2.0\n",
"\n",
"Колонка MentalHealth:\n",
"\tЕсть выбросы: Да\n",
"\tКоличество выбросов: 51576\n",
"\tМинимальное значение: 0.0\n",
"\tМаксимальное значение: 30.0\n",
"\t1-й квартиль (Q1): 0.0\n",
"\t3-й квартиль (Q3): 3.0\n",
"\n",
"Колонка SleepTime:\n",
"\tЕсть выбросы: Да\n",
"\tКоличество выбросов: 4543\n",
"\tМинимальное значение: 1.0\n",
"\tМаксимальное значение: 24.0\n",
"\t1-й квартиль (Q1): 6.0\n",
"\t3-й квартиль (Q3): 8.0\n",
"\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAPeCAYAAADj01PlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC+k0lEQVR4nOzdd3gU5fr/8c8mIZ0EEhJCJBQBBSkWVIogoSggIFWOAkpTUEFFrHhEAZEIHhULRVGpggoiR1FABYKKwRIPggooSBNIJJSElgDJ/P7gl/nuJrspkGR2wvt1XXOxM/Ps7L1hk3ufe555xmEYhiEAAAAAAAAAAJCPj9UBAAAAAAAAAADgrSiiAwAAAAAAAADgAUV0AAAAAAAAAAA8oIgOAAAAAAAAAIAHFNEBAAAAAAAAAPCAIjoAAAAAAAAAAB5QRAcAAAAAAAAAwAOK6AAAAAAAAAAAeEARHQAAAAAAAAAADyiiAwAAALCd+Ph4NWrUqExfc9y4cXI4HKV2/EGDBqlWrVqldvzzsWvXLjkcDs2ZM6fIbf/zn/+UfmAAAJQxh8OhcePGWR2Gi+J8N8ltm5aWVspRlU8U0eHWkiVL5HA43C5l3VkByqNx48aZneQ5c+bkS3rx8fEuv3f+/v6qXbu2hg0bpr1797q0zX2+w+HQt99+m++1DMNQXFycHA6Hunbt6rLP4XBo5MiR5npu5zcxMdHcX5ROM1CekAOB0lXcHBgREaHrrrtO7777rnJyciyI2Fp5c7Wz3J/fTz/9VKYxff755xdcRCjscwCUBHI6ULqKmtPr1avn9vlffvml+Tu5ZMmSUo114cKFmjp1aokes7A8bMUJf0maNGmSli1bdkHHqFWrlpnr4+PjNWjQoAuOy+78rA4A3u2pp55SgwYNzPXnn3/ewmiAi0v16tWVkJAgSTp9+rR+//13zZw5U6tWrdKWLVsUHBzs0j4wMFALFy5Uq1atXLavW7dOf//9twICAsosdqA8IAcC1nHOgQcPHtS8efM0dOhQ/fHHH3rhhRcsi+vpp5/Wk08+adnre4vPP/9c06ZN87rReIAn5HTAOoGBgdq+fbt++OEHXX/99S773nvvPQUGBiozM7PU41i4cKF+/fVXjRo1qtRfy2qTJk1Snz591KNHD6tDKVcooqNAN910k+Lj4831t99+m8s+gDISHh6uAQMGuGyrXbu2Ro4cqfXr1+umm25y2XfLLbdo8eLFeu211+Tn939/3hcuXKimTZvyuwsUEzkQsE7eHDh8+HBdfvnleuONN/Tcc8+pQoUKlsTl5+fnkmMB2AM5HbBOnTp1dPbsWS1atMiliJ6ZmamPP/5YXbp00UcffWRhhEDRMJ0L3Dp9+rQkycen8I9I7uUru3btMrfl5OSoSZMm+aaC2LRpkwYNGqRLL71UgYGBiomJ0ZAhQ3To0CGXY+bO05R3ce605F4Wk5ycrJYtWyooKEi1a9fWzJkz872XZ555Rk2bNlV4eLhCQkLUunVrrV271qVd7jQWDocj32UvmZmZqly5cr45HnPjjI6O1pkzZ1yes2jRIvN4zl/Q/vvf/6pLly6KjY1VQECA6tSpo+eee07Z2dmF/qxzX2/r1q3q27evwsLCFBkZqYceeijfmdvZs2erXbt2io6OVkBAgK644grNmDEj3zG7d++uWrVqKTAwUNHR0br11lu1efNmlza578PdpU/169fPd5nx4cOH9eijj6px48YKDQ1VWFiYOnfurF9++cXluQMHDlRgYKC2bNnisr1jx46qXLmy9u/fX6zjFSQxMdHjpZzu1KpVy23b3GlOJOns2bOaOHGiLrvsMgUEBLi0K63LqmNiYiTJbQf+jjvu0KFDh/Tll1+a206fPq0lS5aoX79+pRIPUB6RA5e57CMHkgO9IQcGBwerefPmOnHihA4ePOiy7/fff1fbtm0VHBysSy65RFOmTDH3HT9+XCEhIXrooYfyHfPvv/+Wr6+vOeL9zJkzGj9+vOrVq6fAwEBFRkaqVatWLnnV07yjCxYs0PXXX6/g4GBVrlxZN954o7744gtz/4V89s/H1q1b1adPH0VERCgwMFDXXnutPvnkE5c25/vZGjRokKZNmyZJBX6W3nrrLdWpU0cBAQG67rrr9OOPP5bcGwSKiJy+zGUfOZ2cblVOv+OOO/TBBx+4TMv26aef6uTJk+rbt6/b5+zbt09DhgxR1apVFRAQoIYNG+rdd991aZP78/jwww/1/PPPq3r16goMDFT79u21fft2s118fLw+++wz7d6923xvudPQFPV3qyQtWLBATZs2VVBQkCIiInT77bfnm7b1m2++0W233aYaNWooICBAcXFxevjhh3Xq1KkCj+1wOHTixAnNnTvXfK95p2I5evSoBg0apEqVKik8PFyDBw/WyZMnS/ptljsMo4BbuV82znf6h/nz5+dLWNK5+a7++usvDR48WDExMfrtt9/01ltv6bffftOGDRvy/eGfMWOGQkNDzfW8X36OHDmiW265RX379tUdd9yhDz/8UPfdd5/8/f01ZMgQSVJGRobefvtt3XHHHbrnnnt07NgxvfPOO+rYsaN++OEHXXXVVS7HDAwM1OzZs10ue1m6dGmBlxcdO3ZMy5cvV8+ePc1ts2fPdntZ0pw5cxQaGqrRo0crNDRUa9as0TPPPKOMjAy9+OKLHl/DWd++fVWrVi0lJCRow4YNeu2113TkyBHNmzfP5WfXsGFD3XrrrfLz89Onn36q+++/Xzk5ORoxYoTL8YYNG6aYmBjt379fb7zxhjp06KCdO3e6TBeS+3NxvvTpu+++0+7du/PF99dff2nZsmW67bbbVLt2baWmpurNN99UmzZt9Pvvvys2NlaS9Oqrr2rNmjUaOHCgkpKS5OvrqzfffFNffPGF5s+fb7Yr6vGK4sEHH9R1110nSZo3b55Lxziv1q1ba9iwYZKkLVu2aNKkSS77X3rpJY0dO1Y9e/bUE088oYCAAH3zzTd66623ihxPQbKzs80vqmfOnNGWLVv07LPPqm7durrhhhvyta9Vq5ZatGihRYsWqXPnzpKkFStWKD09Xbfffrtee+21EokLKO/IgeRAcqD1OdCdv/76S76+vqpUqZK57ciRI+rUqZN69eqlvn37asmSJXriiSfUuHFjde7cWaGhoerZs6c++OADvfzyy/L19TWfu2jRIhmGof79+0s6V9RJSEjQ3Xffreuvv14ZGRn66aef9PPPP+e7+svZ+PHjNW7cOLVs2VITJkyQv7+/vv/+e61Zs0Y333yzpAv/7GdmZrodNXv8+PF823777TfdcMMNuuSSS/Tkk08qJCREH374oXr06KGPPvrI/F0938/W8OHDtX//fn355ZeaP3++2zYLFy7UsWPHNHz4cDkcDk2ZMkW9evXSX3/9ZdlVBLg4kdPJ6eR078jp/fr107hx45SYmKh27dpJOpcr2rdvr+jo6HztU1NT1bx5c/PERlRUlFasWKGhQ4cqIyMj35QsL7zwgnx8fPToo48qPT1dU6ZMUf/+/fX9999Lkv79738rPT1df//9t1555RVJMn8ni/u75U56errbPJ33pJR0bjqpsWPHqm/fvrr77rt18OBBvf7667rxxhv1v//9z/yes3jxYp08eVL33XefIiMj9cMPP+j111/X33//rcWLF3uMZf78+eZ3mdz/9zp16ri06du3r2rXrq2EhAT9/PPPevvttxUdHa3JkycX+l4vagbgxtSpUw1Jxi+//OKyvU2bNkbDhg1dts2ePduQZOzcudMwDMPIzMw0atSoYXTu3NmQZMyePdtse/LkyXyvtWjRIkOS8fXXX5vbnn32WUOScfDgQY8xtmnTxpBkvPTSS+a2rKws46qrrjKio6ON06dPG4ZhGGfPnjWysrJcnnvkyBGjatWqxpAhQ8xtO3fuNCQZd9xxh+Hn52ekpKSY+9q3b2/069fPkGS8+OKL+eK84447jK5du5rbd+/ebfj4+Bh33HFHvvfh7mcwfPhwIzg42MjMzPT4fp1f79Zbb3XZfv/99+f7/3L3Oh07djQuvfTSAl/jww8/NCQ
"text/plain": [
"<Figure size 1500x1000 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from math import ceil\n",
"\n",
"# Проверка выбросов в DataFrame\n",
"def check_outliers(dataframe, columns):\n",
" for column in columns:\n",
" if not pd.api.types.is_numeric_dtype(dataframe[column]): # Проверяем, является ли колонка числовой\n",
" continue\n",
" \n",
" Q1 = dataframe[column].quantile(0.25) # 1-й квартиль (25%)\n",
" Q3 = dataframe[column].quantile(0.75) # 3-й квартиль (75%)\n",
" IQR = Q3 - Q1 # Вычисляем межквартильный размах\n",
"\n",
" # Определяем границы для выбросов\n",
" lower_bound = Q1 - 1.5 * IQR # Нижняя граница\n",
" upper_bound = Q3 + 1.5 * IQR # Верхняя граница\n",
"\n",
" # Подсчитываем количество выбросов\n",
" outliers = dataframe[(dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)]\n",
" outlier_count = outliers.shape[0]\n",
"\n",
" print(f\"Колонка {column}:\")\n",
" print(f\"\\tЕсть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n",
" print(f\"\\tКоличество выбросов: {outlier_count}\")\n",
" print(f\"\\tМинимальное значение: {dataframe[column].min()}\")\n",
" print(f\"\\tМаксимальное значение: {dataframe[column].max()}\")\n",
" print(f\"\\t1-й квартиль (Q1): {Q1}\")\n",
" print(f\"\\t3-й квартиль (Q3): {Q3}\\n\")\n",
"\n",
"# Визуализация выбросов\n",
"def visualize_outliers(dataframe, columns):\n",
" # Диаграммы размахов\n",
" plt.figure(figsize=(15, 10))\n",
" rows = ceil(len(columns) / 3)\n",
" for index, column in enumerate(columns, 1):\n",
" plt.subplot(rows, 3, index)\n",
" plt.boxplot(dataframe[column], vert=True, patch_artist=True)\n",
" plt.title(f\"Диаграмма размаха для \\\"{column}\\\"\")\n",
" plt.xlabel(column)\n",
" \n",
" # Отображение графиков\n",
" plt.tight_layout()\n",
" plt.show()\n",
"\n",
"\n",
"# Числовые столбцы DataFrame\n",
"numeric_columns = [\n",
" 'BMI',\n",
" 'PhysicalHealth',\n",
" 'MentalHealth',\n",
" 'SleepTime'\n",
"]\n",
"\n",
"# Проверка наличия выбросов в колонках\n",
"print('Проверка наличия выбросов в колонках:')\n",
"check_outliers(df, numeric_columns)\n",
"visualize_outliers(df, numeric_columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Признаки `BMI` и `SleepTime` имеют достаточное количество выбросов, которое стоит **устранить**. Также числовые признаки `PhysicalHealth` и `MentalHealth` имеют большое количество выбросов, но так как количество таких наблюдений по сравнению с общим количеством объектов велико, а диапазон значений, которые эти признаки принимают, сравнительно небольшой, то удаление такого объема важной информации, как состояние здоровья, может **негативно сказаться на способности прогнозировать сердечный приступ**.\n",
"\n",
"Для решения проблемы выбросов у признаков `BMI` и `SleepTime` воспользуемся методом отсечения слишком отклоняющихся значений путем **замены на экстремальное значение соответствующей границы**:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка наличия выбросов в колонках после их устранения:\n",
"Колонка BMI:\n",
"\tЕсть выбросы: Нет\n",
"\tКоличество выбросов: 0\n",
"\tМинимальное значение: 12.945\n",
"\tМаксимальное значение: 42.505\n",
"\t1-й квартиль (Q1): 24.03\n",
"\t3-й квартиль (Q3): 31.42\n",
"\n",
"Колонка SleepTime:\n",
"\tЕсть выбросы: Нет\n",
"\tКоличество выбросов: 0\n",
"\tМинимальное значение: 3.0\n",
"\tМаксимальное значение: 11.0\n",
"\t1-й квартиль (Q1): 6.0\n",
"\t3-й квартиль (Q3): 8.0\n",
"\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+IAAAPdCAYAAAAONtIzAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWRElEQVR4nO3deZiVdd348c8AMoMMjILAgAyLkIIgqVQ2aoqCAimakj5ujyymVJgLT2r0yz0dl0o0EdfADS1NzSw1NcBS6BGSNFMSRUFZFNQZQBmUOb8/ujiPxxmUYfkeGF6v6zqXnPvc5z6fGQe+8z5rQSaTyQQAAACQRKN8DwAAAADbEiEOAAAACQlxAAAASEiIAwAAQEJCHAAAABIS4gAAAJCQEAcAAICEhDgAAAAkJMQBAAAgISEOAABss6ZOnRoFBQUxderUfI9SL126dInhw4fneww2kBCnTvfff38UFBTUeerdu3e+x4Ot3kUXXRRdunSJiIhJkyZFQUFBzuX9+vXL+XvXtGnT6Nq1a5x22mmxYMGCnH3XXr+goCD++te/1rqtTCYTZWVlUVBQEIcffnjOZQUFBXH66adnz7/xxhs5v4wUFBTEpEmTNv4Lhm2QtRQ2ry9aS2tqauKOO+6IffbZJ1q1ahUtWrSIXXfdNU4++eSYMWNGHib+fGvvEFifU77169cveyfA8OHDo1+/fnmdZ2vUJN8DsGX78Y9/HD179syev+yyy/I4DWxbOnbsGBUVFRERsXr16vjXv/4VN954Yzz++OPx8ssvx/bbb5+zf1FRUUyePDn233//nO3Tpk2Lt956KwoLC5PNDvwfaynkxxlnnBHjx4+PI488Mk488cRo0qRJzJkzJx599NHYZZdd4utf/3q+R8zRs2fPuPPOO3O2jR07NoqLi+P//b//V2v/OXPmRKNGHlfdWglxPtchhxyScw/XrbfeGkuXLs3fQLANKSkpiZNOOilnW9euXeP000+PZ555Jg455JCcy775zW/GfffdF9ddd100afJ//7xPnjw5+vbt6+8u5Im1FNJbsmRJ3HDDDXHqqafGzTffnHPZuHHj4t13383TZOvWrl27Wuv+FVdcETvttFOt7RHhDvatnLtQqNPq1asjItbrXra1TwV64403sttqamqiT58+tZ7W+sILL8Tw4cNjl112iaKioigtLY2RI0fGsmXLco550UUX1fk0nE/HRb9+/aJ3794xa9as2HfffaNZs2bRtWvXuPHGG2t9LRdccEH07ds3SkpKonnz5vGNb3wjpkyZkrPf2qfkFhQUxEMPPZRz2apVq2LHHXeMgoKC+NnPflZrzrZt28bHH3+cc5177rkne7xP/8L1u9/9Lg477LDo0KFDFBYWRrdu3eLSSy+NNWvWfOH3eu3tvfLKK3HsscdGy5Yto3Xr1nHmmWfGqlWrcvadOHFiHHzwwdG2bdsoLCyM3XffPSZMmFDrmEceeWR06dIlioqKom3btnHEEUfEiy++mLPP2q9j3Lhxta7fo0ePWk9vfu+99+KHP/xh7LHHHlFcXBwtW7aMwYMHxz/+8Y+c6w4bNiyKiori5Zdfztk+cODA2HHHHWPhwoX1Ot7n+byne9WlS5cude776dePffLJJ/HTn/40dt111ygsLMzZb+bMmes9W32UlpZGROT8XVjr+OOPj2XLlsUTTzyR3bZ69eq4//7744QTTtgs8wDrZi19KOcya6m1NOVaOm/evMhkMrHffvvVumztz9sX+dvf/haDBg2KkpKS2H777ePAAw+MZ555ptZ+b7/9dowcOTLatWsXhYWF0atXr/jVr36Vs8/a792vf/3r+PGPfxylpaXRvHnzOOKII2q95Gx9ffY14mv/HfnrX/8aZ5xxRrRp0yZ22GGHGDVqVKxevTo++OCDOPnkk2PHHXeMHXfcMc4999zIZDI5x6ypqYlx48ZFr169oqioKNq1axejRo2K999/f4NmZN08Ik6d1v7ysKH3tN155521FqCIiCeeeCJef/31GDFiRJSWlsZLL70UN998c7z00ksxY8aMWv+QT5gwIYqLi7PnP/vLzPvvvx/f/OY349hjj43jjz8+fvOb38T3vve9aNq0aYwcOTIiIqqqquLWW2+N448/Pk499dRYvnx53HbbbTFw4MD43//939hzzz1zjllUVBQTJ06Mb33rW9ltDzzwQK3F+dOWL18ejzzySBx11FHZbRMnToyioqJa15s0aVIUFxfHmDFjori4OP785z/HBRdcEFVVVXH11Vev8zY+7dhjj40uXbpERUVFzJgxI6677rp4//3344477sj53vXq1SuOOOKIaNKkSfz+97+P73//+1FTUxOjR4/OOd5pp50WpaWlsXDhwrj++utjwIABMW/evJynPq/9vpx11lnZbc8++2y8+eabteZ7/fXX46GHHopjjjkmunbtGkuWLImbbropDjzwwPjXv/4VHTp0iIiIa6+9Nv785z/HsGHDYvr06dG4ceO46aab4k9/+lPceeed2f3W93jr44wzzoivfvWrERFxxx135ETrZ33jG9+I0047LSIiXn755bj88stzLv/5z38e559/fhx11FFx3nnnRWFhYfzlL3+pdc/7hlqzZk32F8+PP/44Xn755bjwwguje/fudf5i0aVLlygvL4977rknBg8eHBERjz76aFRWVsZxxx0X11133SaZC1g/1lJrqbU0f2tp586dIyLivvvui2OOOabWy7m+yJ///OcYPHhw9O3bNy688MJo1KhR9o6Zv/zlL/G1r30tIv7zyPvXv/717B0pbdq0iUcffTROOeWUqKqqyvl/HfGfl6YUFBTEeeedF++8806MGzcuBgwYELNnz45mzZpt1Ne81g9+8IMoLS2Niy++OGbMmBE333xz7LDDDvHss89Gp06d4vLLL48//vGPcfXVV0fv3r3j5JNPzl531KhRMWnSpBgxYkScccYZMW/evLj++uvj+eefj2eeeSa22267TTIjEZGBOowbNy4TEZl//OMfOdsPPPDATK9evXK2TZw4MRMRmXnz5mUymUxm1apVmU6dOmUGDx6ciYjMxIkTs/t++OGHtW7rnnvuyURE5umnn85uu/DCCzMRkXn33XfXOeOBBx6YiYjMz3/+8+y26urqzJ577plp27ZtZvXq1ZlMJpP55JNPMtXV1TnXff/99zPt2rXLjBw5Mrtt3rx5mYjIHH/88ZkmTZpkFi9enL2sf//+mRNOOCETEZmrr7661pzHH3985vDDD89uf/PNNzONGjXKHH/88bW+jrq+B6NGjcpsv/32mVWrVq3z6/307R1xxBE527///e/X+v9V1+0MHDgws8suu3zubfzmN7/JRERm5syZ2W0Rkfn2t7+dadKkSc72U045Jft9GT16dHb7qlWrMmvWrMk57rx58zKFhYWZSy65JGf7448/nomIzE9/+tPM66+/nikuLs5861vfytmnPsdblz/96U+ZiMjcf//92W2jR4/OrOufwZ133jkzYsSI7PkpU6ZkIiIzZcqU7Lby8vJMz549MzU1Ndlta/8+PPfcc+s117qs/fn+7Klnz56Z119/PWffT9/m9ddfn2nRokX2//8xxxyTOeiggzKZTCbTuXPnzGGHHZZz3c/+vwM2HWuptdRamt+19OSTT85ERGbHHXfMHHXUUZmf/exnmZdffrnWfp+dq6amJvOlL30pM3DgwJy5Pvzww0zXrl0zhxxySHbbKaeckmnfvn1m6dKlOcc87rjjMiUlJdmfobW3sfPOO2eqqqqy+639Wbn22mvr/Bp69eqVOfDAA+u8rHPnzplhw4Zlz6/9vn127vLy8kxBQUHmu9/9bnbbJ598kunYsWPOsf/yl79kIiJz991359zOY489Vud2No6nplOntU9va9OmTb2vO378+Fi2bFlceOGFtS779D19q1atiqVLl2bfKOPvf/97vW+rSZMmMWrUqOz5pk2bxqhRo+Kdd96JWbNmRURE48aNo2nTphHxn6fbvPfee/HJJ5/EV77ylTpvc++9945evXpl3yzjzTffjClTpnzux0OMHDkyHnv
"text/plain": [
"<Figure size 1500x1000 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Устранить выборсы в DataFrame\n",
"def remove_outliers(dataframe, columns):\n",
" for column in columns:\n",
" if not pd.api.types.is_numeric_dtype(dataframe[column]): # Проверяем, является ли колонка числовой\n",
" continue\n",
" \n",
" Q1 = dataframe[column].quantile(0.25) # 1-й квартиль (25%)\n",
" Q3 = dataframe[column].quantile(0.75) # 3-й квартиль (75%)\n",
" IQR = Q3 - Q1 # Вычисляем межквартильный размах\n",
"\n",
" # Определяем границы для выбросов\n",
" lower_bound = Q1 - 1.5 * IQR # Нижняя граница\n",
" upper_bound = Q3 + 1.5 * IQR # Верхняя граница\n",
"\n",
" # Устраняем выбросы:\n",
" # Заменяем значения ниже нижней границы на нижнюю границу\n",
" # А значения выше верхней границы на верхнюю\n",
" dataframe[column] = dataframe[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n",
" \n",
" return dataframe\n",
"\n",
"# Cтолбцы, которые нужно исправить\n",
"columns_to_fix = [\n",
" 'BMI',\n",
" 'SleepTime'\n",
"]\n",
"\n",
"# Устраняем выборсы\n",
"df = remove_outliers(df, columns_to_fix)\n",
"\n",
"# Проверка наличия выбросов в колонках\n",
"print('Проверка наличия выбросов в колонках после их устранения:')\n",
"check_outliers(df, columns_to_fix)\n",
"visualize_outliers(df, columns_to_fix)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Разбиение датасета на выборки\n",
"\n",
"Разделим выборку данных на 3 группы:\n",
"1. *Обучающая* выборка (70%).\n",
"2. *Контрольная* выборка (15%).\n",
"3. *Тестовая* выборка (15%)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка сбалансированности выборок:\n",
"Обучающая выборка: (191877, 18)\n",
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
" HeartDisease\n",
"No 175453\n",
"Yes 16424\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"No\": 91.44%\n",
"Процент объектов класса \"Yes\": 8.56%\n",
"\n",
"Контрольная выборка: (63959, 18)\n",
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
" HeartDisease\n",
"No 58484\n",
"Yes 5475\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"No\": 91.44%\n",
"Процент объектов класса \"Yes\": 8.56%\n",
"\n",
"Тестовая выборка: (63959, 18)\n",
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
" HeartDisease\n",
"No 58485\n",
"Yes 5474\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"No\": 91.44%\n",
"Процент объектов класса \"Yes\": 8.56%\n",
"\n",
"Проверка необходимости аугментации выборок:\n",
"Для обучающей выборки аугментация данных требуется\n",
"Для контрольной выборки аугментация данных требуется\n",
"Для тестовой выборки аугментация данных требуется\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlMAAAMWCAYAAADVowODAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADGtklEQVR4nOzdd3wT5R8H8E+StkknXXTR0iKljLIrILMIyAYBARkyRVDAiaKgyBIRQYYMt4CCIkNBUUYRECjI3hQopWUU6ILunTy/PzD5NSQtbZP20vbzfr36glwud9+73F0+uefuiUwIIUBEREREpSKXugAiIiKiioxhioiIiMgEDFNEREREJmCYIiIiIjIBwxQRERGRCRimiIiIiEzAMEVERERkAoYpIiIiIhMwTBERkdkIIXD//n1ERkZKXQqZmUajQWJiIq5fvy51KRaHYYqIKp3Ro0cjICBA6jKqjLS0NHzwwQeoW7cubGxs4ObmhqCgIFy5ckXq0iqEQ4cOYf/+/brH+/fvR3h4uHQFFXDv3j288cYb8Pf3h42NDapXr44GDRogNTVV6tIsSrmHqTVr1kAmk+n+VCoVgoKCMHnyZMTFxZV3OUSVxqxZs3QBQrufFdSxY0c0bNjQ6GtjYmIgk8mwaNGisi7TqMzMTMyaNUvvA0Vr1qxZescMOzs71KxZE3369MHq1auRk5NT/gVLICAgALNmzQLw8L0cPXq0pPVoJSUloXXr1vj8888xcOBAbNu2DWFhYdi/fz8DbTHdunULEydOxPnz53H+/HlMnDgRt27dkrosXLt2DS1atMCGDRswYcIEbN++HWFhYfj7779hb28vdXkWxUqqGc+ZMwe1atVCdnY2Dh06hC+++AJ//fUXLly4ADs7O6nKIiIJZGZmYvbs2QAeBgVjvvjiCzg4OCAnJwexsbHYtWsXxo4di6VLl2L79u3w8/PTjfvNN99Ao9GUR+lV3jvvvIO7d+/iyJEjCA4OlrqcCmnAgAFYunQpGjduDABo3bo1BgwYIHFVwIQJE2BjY4N///0XNWrUkLociyZZmOrRoweefPJJAMC4cePg5uaGxYsXY9u2bRg6dKhUZRFROdJoNMjNzS3WuAMHDoS7u7vu8Ycffoj169dj5MiRGDRoEP7991/dc9bW1mavlQzFx8dj7dq1+PLLLxmkTKBUKnH48GFcuHABANCwYUMoFApJazp58iT27t2L3bt3M0gVg8VcM9WpUycAQHR0NADg/v37ePvtt9GoUSM4ODjAyckJPXr0wNmzZw1em52djVmzZiEoKAgqlQre3t4YMGAAoqKiAPy/CaOwv4LfhPfv3w+ZTIZffvkF06dPh5eXF+zt7dG3b1+jp12PHj2K7t27o1q1arCzs0NoaGihbd0dO3Y0On/tqfuC1q1bh5CQENja2sLV1RVDhgwxOv+ilq0gjUaDpUuXIjg4GCqVCp6enpgwYQIePHigN15AQAB69+5tMJ/JkycbTNNY7QsXLjRYpwCQk5ODmTNnIjAwEEqlEn5+fpg6dWqxmmg6duxoML158+ZBLpfjp59+KtX6WLRoEdq0aQM3NzfY2toiJCQEmzdvNjr/devWoWXLlrCzs4OLiws6dOiA3bt3642zY8cOhIaGwtHREU5OTmjRooVBbZs2bdK9p+7u7njhhRcQGxurN87o0aP1anZxcUHHjh1x8ODBx66nspCcnIw33ngDfn5+UCqVCAwMxIIFCwzO+hR3fcpkMkyePBnr169HcHAwlEolvvzyS1SvXh0AMHv27CL3i0cNHz4c48aNw9GjRxEWFqYbbuyaqQ0bNiAkJET3HjVq1AjLli0r0+UNCwtDu3bt4OzsDAcHB9StWxfTp0/XG8eUfeNxCm5LCoUCNWrUwPjx45GcnPzY1+bn52Pu3LmoXbs2lEolAgICMH36dL26jh8/rgvETz75JFQqFdzc3DB06FDcvHlTN97q1ashk8lw+vRpg/l8/PHHUCgUun3B2HuvbbaOiYnRDdu2bRt69eoFHx8fKJVK1K5dG3PnzoVardZ7rbFtYenSpahXrx6USiW8vLwwYcIE3L9/X28cY83iixYtMqgjMTHRaM0lOeaOHj0aCoUCTZo0QZMmTfDrr79CJpMVq5k0ICBA9x7L5XJ4eXnh+eef11v/xWnG1zana/37779QqVSIiorS7auFrSug+Mc3BwcHXL9+Hd26dYO9vT18fHwwZ84cCCEM6l2zZo1uWFpaGkJCQlCrVi3cvXu3xOu5rEl2ZupR2uDj5uYGALh+/Tq2bt2KQYMGoVatWoiLi8NXX32F0NBQXLp0CT4+PgAAtVqN3r174++//8aQIUPw+uuvIy0tDWFhYbhw4QJq166tm8fQoUPRs2dPvflOmzbNaD3z5s2DTCbDu+++i/j4eCxduhRdunTBmTNnYGtrCwDYu3cvevTogZCQEMycORNyuRyrV69Gp06dcPDgQbRs2dJgur6+vpg/fz4AID09Ha+88orRec+YMQODBw/GuHHjkJCQgOXLl6NDhw44ffo0nJ2dDV4zfvx4tG/fHgDw66+/4rffftN7fsKECVizZg3GjBmD1157DdHR0VixYgVOnz6N8PBws3yTT05O1i1bQRqNBn379sWhQ4cwfvx41K9fH+fPn8eSJUtw9epVbN26tUTzWb16NT744AN89tlnGDZsmNFxHrc+li1bhr59+2L48OHIzc3Fhg0bMGjQIGzfvh29evXSjTd79mzMmjULbdq0wZw5c2BjY4OjR49i79696Nq1K4CHB/qxY8ciODgY06ZNg7OzM06fPo2dO3fq6tOu+xYtWmD+/PmIi4vDsmXLEB4ebvCeuru7Y8mSJQCA27dvY9myZejZsydu3bpl9L0vCbVajcTERIPhxg48mZmZCA0NRWxsLCZMmICaNWvi8OHDmDZtGu7evYulS5eWeH0CD/ebjRs3YvLkyXB3d0eTJk3wxRdf4JVXXkH//v11zRvaJo/HGTFiBL7++mvs3r0bzzzzjNFxwsLCMHToUHTu3BkLFiwAAERERCA8PByvv/56mSzvxYsX0bt3bzRu3Bhz5syBUqnEtWvX9L5smXvfMEa7TvPz83HkyBF8/fXXyMrKwo8//ljk68aNG4e1a9di4MCBmDJlCo4ePYr58+cjIiJCtz8lJSUBePhlKyQkBJ988gkSEhLw+eef49ChQzh9+jTc3d0xcOBATJo0CevXr0ezZs305rN+/Xp07NixxGc/1qxZAwcHB7z11ltwcHDA3r178eGHHyI1NRULFy4s9HUff/wx3n//fXTo0AGTJk3SHQuPHj2Ko0ePQqlUlqiOwpT2mJufn4/333+/RPNq3749xo8fD41GgwsXLmDp0qW4c+eOSV/CkpKSkJ2djVdeeQWdOnXCyy+/jKioKKxcudJgXZXk+KZWq9G9e3c89dRT+PTTT7Fz507MnDkT+fn5mDNnjtFa8vLy8Nxzz+HmzZsIDw+Ht7e37rny+GwrFlHOVq9eLQCIPXv2iISEBHHr1i2xYcMG4ebmJmxtbcXt27eFEEJkZ2cLtVqt99ro6GihVCrFnDlzdMO+//57AUAsXrzYYF4ajUb3OgBi4cKFBuMEBweL0NBQ3eN9+/YJAKJGjRoiNTVVN3zjxo0CgFi2bJlu2nXq1BHdunXTzUcIITIzM0WtWrXEM888YzCvNm3aiIYNG+oeJyQkCABi5syZumExMTFCoVCIefPm6b32/PnzwsrKymB4ZGSkACDWrl2rGzZz5kxR8K09ePCgACDWr1+v99qdO3caDPf39xe9evUyqH3SpEni0c3l0dqnTp0qPDw8REhIiN46/fHHH4VcLhcHDx7Ue/2XX34pAIjw8HCD+RUUGhqqm96ff/4prKysxJQpU4yOW5z1IcTD96mg3Nxc0bBhQ9GpUye9acnlctG/f3+DbVH7nicnJwtHR0fRqlUrkZWVZXSc3Nxc4eHhIRo2bKg3zvbt2wUA8eGHH+qGjRo1Svj7++tN5+uvvxYAxLFjx4wuc3GFhoYKAEX+FdxH5s6dK+zt7cXVq1f1pvPee+8JhUIhbt68qRtWnPU
"text/plain": [
"<Figure size 600x800 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"def split_stratified_into_train_val_test(\n",
" df_input,\n",
" stratify_colname,\n",
" frac_train,\n",
" frac_val,\n",
" frac_test,\n",
" random_state=None,\n",
"):\n",
" \"\"\"\n",
" Splits a Pandas dataframe into three subsets (train, val, and test)\n",
" following fractional ratios provided by the user, where each subset is\n",
" stratified by the values in a specific column (that is, each subset has\n",
" the same relative frequency of the values in the column). It performs this\n",
" splitting by running train_test_split() twice.\n",
"\n",
" Parameters\n",
" ----------\n",
" df_input : Pandas dataframe\n",
" Input dataframe to be split.\n",
" stratify_colname : str\n",
" The name of the column that will be used for stratification. Usually\n",
" this column would be for the label.\n",
" frac_train : float\n",
" frac_val : float\n",
" frac_test : float\n",
" The ratios with which the dataframe will be split into train, val, and\n",
" test data. The values should be expressed as float fractions and should\n",
" sum to 1.0.\n",
" random_state : int, None, or RandomStateInstance\n",
" Value to be passed to train_test_split().\n",
"\n",
" Returns\n",
" -------\n",
" df_train, df_val, df_test :\n",
" Dataframes containing the three splits.\n",
" \"\"\"\n",
"\n",
" if frac_train + frac_val + frac_test != 1.0:\n",
" raise ValueError(\n",
" \"fractions %f, %f, %f do not add up to 1.0\"\n",
" % (frac_train, frac_val, frac_test)\n",
" )\n",
"\n",
" if stratify_colname not in df_input.columns:\n",
" raise ValueError(\"%s is not a column in the dataframe\" % (stratify_colname))\n",
"\n",
" X = df_input # Contains all columns.\n",
" y = df_input[\n",
" [stratify_colname]\n",
" ] # Dataframe of just the column on which to stratify.\n",
"\n",
" # Split original dataframe into train and temp dataframes.\n",
" df_train, df_temp, y_train, y_temp = train_test_split(\n",
" X, y, stratify=y, test_size=(1.0 - frac_train), random_state=random_state\n",
" )\n",
"\n",
" # Split the temp dataframe into val and test dataframes.\n",
" relative_frac_test = frac_test / (frac_val + frac_test)\n",
" df_val, df_test, y_val, y_test = train_test_split(\n",
" df_temp,\n",
" y_temp,\n",
" stratify=y_temp,\n",
" test_size=relative_frac_test,\n",
" random_state=random_state,\n",
" )\n",
"\n",
" assert len(df_input) == len(df_train) + len(df_val) + len(df_test)\n",
"\n",
" return df_train, df_val, df_test\n",
"\n",
"# Оценка сбалансированности\n",
"def check_balance(dataframe, dataframe_name, column):\n",
" counts = dataframe[column].value_counts()\n",
" print(dataframe_name + \": \", dataframe.shape)\n",
" print(f\"Распределение выборки данных по классам в колонке \\\"{column}\\\":\\n\", counts)\n",
" total_count = len(dataframe)\n",
" for value in counts.index:\n",
" percentage: float = counts[value] / total_count * 100\n",
" print(f\"Процент объектов класса \\\"{value}\\\": {percentage:.2f}%\")\n",
" print()\n",
" \n",
"# Определение необходимости аугментации данных\n",
"def need_augmentation(dataframe,\n",
" column, \n",
" first_value, second_value):\n",
" counts = dataframe[column].value_counts()\n",
" ratio: float = counts[first_value] / counts[second_value]\n",
" return ratio > 1.5 or ratio < 0.67\n",
" \n",
" # Визуализация сбалансированности классов\n",
"def visualize_balance(dataframe_train,\n",
" dataframe_val,\n",
" dataframe_test, \n",
" column: str):\n",
" fig, axes = plt.subplots(3, 1, figsize=(6, 8))\n",
"\n",
" # Обучающая выборка\n",
" counts_train = dataframe_train[column].value_counts()\n",
" axes[0].pie(counts_train, labels=counts_train.index, autopct='%1.1f%%', startangle=90)\n",
" axes[0].set_title(f\"Распределение классов \\\"{column}\\\" в обучающей выборке\")\n",
"\n",
" # Контрольная выборка\n",
" counts_val = dataframe_val[column].value_counts()\n",
" axes[1].pie(counts_val, labels=counts_val.index, autopct='%1.1f%%', startangle=90)\n",
" axes[1].set_title(f\"Распределение классов \\\"{column}\\\" в контрольной выборке\")\n",
"\n",
" # Тестовая выборка\n",
" counts_test = dataframe_test[column].value_counts()\n",
" axes[2].pie(counts_test, labels=counts_test.index, autopct='%1.1f%%', startangle=90)\n",
" axes[2].set_title(f\"Распределение классов \\\"{column}\\\" в тренировочной выборке\")\n",
"\n",
" # Отображение графиков\n",
" plt.tight_layout()\n",
" plt.show()\n",
"\n",
"\n",
"df_train, df_val, df_test = split_stratified_into_train_val_test(\n",
" df, \n",
" stratify_colname=\"HeartDisease\", \n",
" frac_train=0.60, \n",
" frac_val=0.20, \n",
" frac_test=0.20\n",
")\n",
"\n",
"# Проверка сбалансированности выборок\n",
"print('Проверка сбалансированности выборок:')\n",
"check_balance(df_train, 'Обучающая выборка', 'HeartDisease')\n",
"check_balance(df_val, 'Контрольная выборка', 'HeartDisease')\n",
"check_balance(df_test, 'Тестовая выборка', 'HeartDisease')\n",
"\n",
"# Проверка необходимости аугментации выборок\n",
"print('Проверка необходимости аугментации выборок:')\n",
"print(f\"Для обучающей выборки аугментация данных {'не ' if not need_augmentation(df_train, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
"print(f\"Для контрольной выборки аугментация данных {'не ' if not need_augmentation(df_val, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
"print(f\"Для тестовой выборки аугментация данных {'не ' if not need_augmentation(df_test, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
" \n",
"# Визуализация сбалансированности классов\n",
"visualize_balance(df_train, df_val, df_test, 'HeartDisease')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборки оказались **недостаточно сбалансированными**. Используем методы приращения данных *с избытком* (**oversampling**) копирование наблюдений или генерация новых наблюдений на основе существующих с помощью алгоритмов SMOTE и ADASYN (нахождение k-ближайших соседей):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка сбалансированности выборок:\n",
"Обучающая выборка: (350906, 51)\n",
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
" HeartDisease\n",
"No 175453\n",
"Yes 175453\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"No\": 50.00%\n",
"Процент объектов класса \"Yes\": 50.00%\n",
"\n",
"Контрольная выборка: (116968, 51)\n",
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
" HeartDisease\n",
"No 58484\n",
"Yes 58484\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"No\": 50.00%\n",
"Процент объектов класса \"Yes\": 50.00%\n",
"\n",
"Тестовая выборка: (116970, 51)\n",
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
" HeartDisease\n",
"No 58485\n",
"Yes 58485\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"No\": 50.00%\n",
"Процент объектов класса \"Yes\": 50.00%\n",
"\n",
"Проверка необходимости аугментации выборок:\n",
"Для обучающей выборки аугментация данных не требуется\n",
"Для контрольной выборки аугментация данных не требуется\n",
"Для тестовой выборки аугментация данных не требуется\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlMAAAMWCAYAAADVowODAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC8AUlEQVR4nOzdd1wT5x8H8E8SIGwVkOVAKuLATdU6seLCrVXrqLNWW/XXZWurrYpYa63WUUerbR1VW/eotg5cVamzdYsbHKgMFWSFkTy/P2hSYgICCRzj83698lIuN753yV0+uee5i0wIIUBEREREBSKXugAiIiKikoxhioiIiMgEDFNEREREJmCYIiIiIjIBwxQRERGRCRimiIiIiEzAMEVERERkAoYpIiIiIhMwTBERkdkIIfDkyRPcuHFD6lLIzDQaDeLi4nD79m2pSyl2GKaIqNQZPnw4qlWrJnUZZUZiYiI+//xz1KxZE1ZWVnB2doavry+uXbsmdWklwrFjx3D48GHd34cPH0ZYWJh0BWXz6NEjvP/++/Dy8oKVlRUqVqyIOnXq4NmzZ1KXVqwUeZhatWoVZDKZ7mFtbQ1fX1+MHz8e0dHRRV0OUakRHBysCxDa/Sy7tm3bom7dukanjYyMhEwmw9y5cwu7TKNSUlIQHBys94GiFRwcrHfMsLW1RdWqVdG9e3esXLkSaWlpRV+wBKpVq4bg4GAAWa/l8OHDJa1H6/Hjx2jevDm+/fZb9O3bFzt27EBoaCgOHz7MQJtH9+7dw9ixY3Hx4kVcvHgRY8eOxb1796QuCzdv3kSTJk2wfv16jBkzBrt27UJoaCgOHDgAOzs7qcsrViykWnBISAi8vb2hUqlw7NgxfPfdd/jjjz9w6dIl2NraSlUWEUkgJSUF06dPB5AVFIz57rvvYG9vj7S0NERFRWHv3r0YOXIkFixYgF27dqFKlSq6cX/44QdoNJqiKL3M+/jjj/Hw4UMcP34cfn5+UpdTIvXp0wcLFixA/fr1AQDNmzdHnz59JK4KGDNmDKysrHDixAlUqlRJ6nKKNcnCVFBQEF5++WUAwKhRo+Ds7Ix58+Zhx44dGDhwoFRlEVER0mg0SE9Pz9O4ffv2hYuLi+7vqVOnYt26dRg6dCj69euHEydO6J6ztLQ0e61kKCYmBqtXr8b333/PIGUCpVKJv/76C5cuXQIA1K1bFwqFQtKa/v77bxw8eBD79u1jkMqDYtNnql27dgCAiIgIAMCTJ0/w0UcfoV69erC3t4ejoyOCgoJw/vx5g2lVKhWCg4Ph6+sLa2treHh4oE+fPrh16xaA/5owcnpk/yZ8+PBhyGQybNiwAZMnT4a7uzvs7OzQo0cPo6ddT548ic6dO6NcuXKwtbVFQEBAjm3dbdu2Nbp87an77NauXQt/f3/Y2NjAyckJAwYMMLr83NYtO41GgwULFsDPzw/W1tZwc3PDmDFj8PTpU73xqlWrhm7duhksZ/z48QbzNFb7nDlzDLYpAKSlpWHatGnw8fGBUqlElSpVMHHixDw10bRt29ZgfjNnzoRcLscvv/xSoO0xd+5ctGjRAs7OzrCxsYG/vz82b95sdPlr165F06ZNYWtriwoVKqBNmzbYt2+f3ji7d+9GQEAAHBwc4OjoiCZNmhjUtmnTJt1r6uLigjfeeANRUVF64wwfPlyv5goVKqBt27Y4evToC7dTYYiPj8f777+PKlWqQKlUwsfHB7NnzzY465PX7SmTyTB+/HisW7cOfn5+UCqV+P7771GxYkUAwPTp03PdL543ePBgjBo1CidPnkRoaKhuuLE+U+vXr4e/v7/uNapXrx4WLlxYqOsbGhqKVq1aoXz58rC3t0fNmjUxefJkvXFM2TdeJPt7SaFQoFKlShg9ejTi4+NfOG1mZiZmzJiB6tWrQ6lUolq1apg8ebJeXadPn9YF4pdffhnW1tZwdnbGwIEDcffuXd14K1euhEwmw9mzZw2W8+WXX0KhUOj2BWOvvbbZOjIyUjdsx44d6Nq1Kzw9PaFUKlG9enXMmDEDarVab1pj74UFCxagVq1aUCqVcHd3x5gxY/DkyRO9cYw1i8+dO9egjri4OKM15+eYO3z4cCgUCjRo0AANGjTA1q1bIZPJ8tRMWq1aNd1rLJfL4e7ujtdff11v++elGV/bnK514sQJWFtb49atW7p9NadtBeT9+GZvb4/bt2+jU6dOsLOzg6enJ0JCQiCEMKh31apVumGJiYnw9/eHt7c3Hj58mO/tXNgkOzP1PG3wcXZ2BgDcvn0b27dvR79+/eDt7Y3o6GgsW7YMAQEBuHLlCjw9PQEAarUa3bp1w4EDBzBgwAC89957SExMRGhoKC5duoTq1avrljFw4EB06dJFb7mTJk0yWs/MmTMhk8nwySefICYmBgsWLED79u1x7tw52NjYAAAOHjyIoKAg+Pv7Y9q0aZDL5Vi5ciXatWuHo0ePomnTpgbzrVy5MmbNmgUASEpKwjvvvGN02VOmTEH//v0xatQoxMbGYtGiRWjTpg3Onj2L8uXLG0wzevRotG7dGgCwdetWbNu2Te/5MWPGYNWqVRgxYgTeffddREREYPHixTh79izCwsLM8k0+Pj5et27ZaTQa9OjRA8eOHcPo0aNRu3ZtXLx4EfPnz8f169exffv2fC1n5cqV+Pzzz/HNN99g0KBBRsd50fZYuHAhevTogcGDByM9PR3r169Hv379sGvXLnTt2lU33vTp0xEcHIwWLVogJCQEVlZWOHnyJA4ePIiOHTsCyDrQjxw5En5+fpg0aRLKly+Ps2fPYs+ePbr6tNu+SZMmmDVrFqKjo7Fw4UKEhYUZvKYuLi6YP38+AOD+/ftYuHAhunTpgnv37hl97fNDrVYjLi7OYLixA09KSgoCAgIQFRWFMWPGoGrVqvjrr78wadIkPHz4EAsWLMj39gSy9puNGzdi/PjxcHFxQYMGDfDdd9/hnXfeQe/evXXNG9omjxcZMmQIli9fjn379qFDhw5GxwkNDcXAgQMRGBiI2bNnAwDCw8MRFhaG9957r1DW9/Lly+jWrRvq16+PkJAQKJVK3Lx5U+/Llrn3DWO02zQzMxPHjx/H8uXLkZqaijVr1uQ63ahRo7B69Wr07dsXEyZMwMmTJzFr1iyEh4fr9qfHjx8DyPqy5e/vj6+++gqxsbH49ttvcezYMZw9exYuLi7o27cvxo0bh3Xr1qFRo0Z6y1m3bh3atm2b77Mfq1atgr29PT788EPY29vj4MGDmDp1Kp49e4Y5c+bkON2XX36Jzz77DG3atMG4ceN0x8KTJ0/i5MmTUCqV+aojJwU95mZmZuKzzz7L17Jat26N0aNHQ6PR4NKlS1iwYAEePHhg0pewx48fQ6VS4Z133kG7du3w9ttv49atW1iyZInBtsrP8U2tVqNz58545ZVX8PXXX2PPnj2YNm0aMjMzERISYrSWjIwMvPbaa7h79y7CwsLg4eGhe64oPtvyRBSxlStXCgBi//79IjY2Vty7d0+sX79eODs7CxsbG3H//n0hhBAqlUqo1Wq9aSMiIoRSqRQhISG6YStWrBAAxLx58wyWpdFodNMBEHPmzDEYx8/PTwQEBOj+PnTokAAgKlWqJJ49e6YbvnHjRgFALFy4UDfvGjVqiE6dOumWI4QQKSkpwtvbW3To0MFgWS1atBB169bV/R0bGysAiGnTpumGRUZGCoVCIWbOnKk37cWLF4WFhYXB8Bs3bggAYvXq1bph06ZNE9lf2qNHjwoAYt26dXrT7tmzx2C4l5eX6Nq1q0Ht48aNE8+/XZ6vfeLEicLV1VX4+/vrbdM1a9YIuVwujh49qjf9999/LwCIsLAwg+VlFxAQoJvf77//LiwsLMSECROMjpuX7SFE1uuUXXp6uqhbt65o166d3rzkcrno3bu3wXtR+5rHx8cLBwcH0axZM5Gammp0nPT0dOHq6irq1q2rN86uXbsEADF16lTdsGHDhgkvLy+9+SxfvlwAEKdOnTK6znkVEBAgAOT6yL6PzJgxQ9jZ2Ynr16/rzefTTz8VCoVC3L17VzcsL9tTiKz3jFw
"text/plain": [
"<Figure size 600x800 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from imblearn.over_sampling import SMOTE\n",
"\n",
"# Метод приращения с избытком (oversampling)\n",
"def oversample(df, column):\n",
" X = pd.get_dummies(df.drop(column, axis=1))\n",
" y = df[column]\n",
" \n",
" smote = SMOTE()\n",
" X_resampled, y_resampled = smote.fit_resample(X, y)\n",
" \n",
" df_resampled = pd.concat([X_resampled, y_resampled], axis=1)\n",
" return df_resampled\n",
"\n",
"\n",
"# Приращение данных (oversampling)\n",
"df_train_oversampled = oversample(df_train, 'HeartDisease')\n",
"df_val_oversampled = oversample(df_val, 'HeartDisease')\n",
"df_test_oversampled = oversample(df_test, 'HeartDisease')\n",
"\n",
"# Проверка сбалансированности выборок\n",
"print('Проверка сбалансированности выборок:')\n",
"check_balance(df_train_oversampled, 'Обучающая выборка', 'HeartDisease')\n",
"check_balance(df_val_oversampled, 'Контрольная выборка', 'HeartDisease')\n",
"check_balance(df_test_oversampled, 'Тестовая выборка', 'HeartDisease')\n",
"\n",
"# Проверка необходимости аугментации выборок\n",
"print('Проверка необходимости аугментации выборок:')\n",
"print(f\"Для обучающей выборки аугментация данных {'не ' if not need_augmentation(df_train_oversampled, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
"print(f\"Для контрольной выборки аугментация данных {'не ' if not need_augmentation(df_val_oversampled, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
"print(f\"Для тестовой выборки аугментация данных {'не ' if not need_augmentation(df_test_oversampled, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
" \n",
"# Визуализация сбалансированности классов\n",
"visualize_balance(df_train_oversampled, df_val_oversampled, df_test_oversampled, 'HeartDisease')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Конструирование признаков\n",
"\n",
"**Конструирование признаков** (*feature engineering*) процесс использования знаний об особенностях решаемой задачи и предметной области для определения признаков, которые будут использованы для обучения статистической модели.\n",
"\n",
"Методы конструирования признаков:\n",
"1. Для категориальных данных:\n",
" - **Унитарное кодирование категориальных признаков** (one-hot encoding) метод, который применяется для преобразования категориальных переменных в числовой формат. Каждая характеристика представляется в виде бинарного вектора, где для каждой категории выделяется отдельный признак (столбец) со значением 1 (True), если объект принадлежит этой категории, и 0 (False) в противном случае.\n",
"2. Для числовых данных:\n",
" - **Дискретизация** процесс преобразования непрерывных числовых значений в категориальные группы или интервалы (дискретные значения).\n",
" - **Ручной синтез** процесс создания новых признаков на основе существующих данных. Это может включать в себя комбинирование нескольких признаков, использование математических операций (например, сложение, вычитание), а также создание полиномиальных или логарифмических признаков.\n",
" - **Масштабирование признаков на основе нормировки и стандартизации** метод, который позволяет привести все числовые признаки к одинаковым или очень похожим диапазонам значений либо распределениям.\n",
" - **С применением фреймворка FeatureTools** библиотека для автоматизированного создания признаков (features) из структурированных данных. Подходит для задач машинного обучения, когда нужно быстро извлекать полезные признаки из больших объемов данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Унитарное кодирование\n",
"\n",
"Преобразование уже было выполнено на этапе приращения с избытком (метод `pd.get_dummies(...)`), так как метод `fit_resample` требовал для работы признаки типа число с плавающей точкой. Были преобразованы категориальные признаки `Smoking`, `AlcoholDrinking`, `Stroke`, `DiffWalking` и т.д. в бинарные признаки:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>BMI</th>\n",
" <th>PhysicalHealth</th>\n",
" <th>MentalHealth</th>\n",
" <th>SleepTime</th>\n",
" <th>Smoking_No</th>\n",
" <th>Smoking_Yes</th>\n",
" <th>AlcoholDrinking_No</th>\n",
" <th>AlcoholDrinking_Yes</th>\n",
" <th>Stroke_No</th>\n",
" <th>Stroke_Yes</th>\n",
" <th>...</th>\n",
" <th>GenHealth_Good</th>\n",
" <th>GenHealth_Poor</th>\n",
" <th>GenHealth_Very good</th>\n",
" <th>Asthma_No</th>\n",
" <th>Asthma_Yes</th>\n",
" <th>KidneyDisease_No</th>\n",
" <th>KidneyDisease_Yes</th>\n",
" <th>SkinCancer_No</th>\n",
" <th>SkinCancer_Yes</th>\n",
" <th>HeartDisease</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>24.28</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>8.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>34.44</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>25.86</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>8.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>19.47</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>8.0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>34.70</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>29.05</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>32.45</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>7.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>26.25</td>\n",
" <td>0.0</td>\n",
" <td>30.0</td>\n",
" <td>6.0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>30.67</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>7.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>34.96</td>\n",
" <td>14.0</td>\n",
" <td>0.0</td>\n",
" <td>6.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 51 columns</p>\n",
"</div>"
],
"text/plain": [
" BMI PhysicalHealth MentalHealth SleepTime Smoking_No Smoking_Yes \\\n",
"0 24.28 2.0 3.0 8.0 True False \n",
"1 34.44 0.0 0.0 8.0 False True \n",
"2 25.86 0.0 5.0 8.0 True False \n",
"3 19.47 0.0 2.0 8.0 False True \n",
"4 34.70 0.0 0.0 8.0 False True \n",
"5 29.05 0.0 0.0 6.0 True False \n",
"6 32.45 0.0 5.0 7.0 True False \n",
"7 26.25 0.0 30.0 6.0 False True \n",
"8 30.67 2.0 3.0 7.0 True False \n",
"9 34.96 14.0 0.0 6.0 True False \n",
"\n",
" AlcoholDrinking_No AlcoholDrinking_Yes Stroke_No Stroke_Yes ... \\\n",
"0 True False True False ... \n",
"1 True False True False ... \n",
"2 True False True False ... \n",
"3 True False True False ... \n",
"4 True False True False ... \n",
"5 False True True False ... \n",
"6 True False True False ... \n",
"7 True False True False ... \n",
"8 True False True False ... \n",
"9 True False True False ... \n",
"\n",
" GenHealth_Good GenHealth_Poor GenHealth_Very good Asthma_No Asthma_Yes \\\n",
"0 False False True False True \n",
"1 True False False True False \n",
"2 False False True True False \n",
"3 False False True True False \n",
"4 False False True True False \n",
"5 True False False True False \n",
"6 True False False True False \n",
"7 False False True True False \n",
"8 True False False True False \n",
"9 True False False True False \n",
"\n",
" KidneyDisease_No KidneyDisease_Yes SkinCancer_No SkinCancer_Yes \\\n",
"0 True False True False \n",
"1 True False True False \n",
"2 True False True False \n",
"3 True False True False \n",
"4 True False True False \n",
"5 True False True False \n",
"6 True False True False \n",
"7 True False True False \n",
"8 True False True False \n",
"9 True False True False \n",
"\n",
" HeartDisease \n",
"0 No \n",
"1 No \n",
"2 No \n",
"3 No \n",
"4 No \n",
"5 No \n",
"6 No \n",
"7 No \n",
"8 No \n",
"9 No \n",
"\n",
"[10 rows x 51 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"categorical_features = [\n",
" 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Race',\n",
" 'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer'\n",
"]\n",
"\n",
"df_train_oversampled.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Дискретизация числовых признаков\n",
"\n",
"Распределим значения признака `BMI` по интервалам, преобразуя его из числового представления в категориальное. Будем использовать метод **Равномерная группировка**:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>BMI</th>\n",
" <th>BMI_Category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>24.280</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>34.440</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>25.860</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>19.470</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>34.700</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>29.050</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>32.450</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>26.250</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>30.670</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>34.960</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>27.810</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>20.360</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>27.400</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>42.505</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>21.520</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>36.260</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>23.490</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>28.190</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>28.290</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>20.800</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" BMI BMI_Category\n",
"0 24.280 1\n",
"1 34.440 3\n",
"2 25.860 2\n",
"3 19.470 1\n",
"4 34.700 3\n",
"5 29.050 2\n",
"6 32.450 3\n",
"7 26.250 2\n",
"8 30.670 2\n",
"9 34.960 3\n",
"10 27.810 2\n",
"11 20.360 1\n",
"12 27.400 2\n",
"13 42.505 4\n",
"14 21.520 1\n",
"15 36.260 3\n",
"16 23.490 1\n",
"17 28.190 2\n",
"18 28.290 2\n",
"19 20.800 1"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Функция для дискретизации числовых признаков\n",
"def discretize_features(df, features, bins=5, labels=False):\n",
" for feature in features:\n",
" df[f'{feature}_Category'] = pd.cut(df[feature], bins=bins, labels=labels)\n",
" return df\n",
"\n",
"# Определение числовых признаков для дискретизации\n",
"numerical_features = ['BMI']\n",
"\n",
"# Применение дискретизации к обучающей, контрольной и тестовой выборкам\n",
"df_train_oversampled = discretize_features(df_train_oversampled, numerical_features)\n",
"df_val_oversampled = discretize_features(df_val_oversampled, numerical_features)\n",
"df_test_oversampled = discretize_features(df_test_oversampled, numerical_features)\n",
"\n",
"df_train_oversampled[['BMI', 'BMI_Category']].head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ручной синтез признаков\n",
"\n",
"Будем синтезировать новый признак `HealthScore`, являющийся числовым показателем здоровья на основе таких признаков, как `PhysicalHealth`, `MentalHealth`, `SleepTime`:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>BMI</th>\n",
" <th>PhysicalHealth</th>\n",
" <th>MentalHealth</th>\n",
" <th>SleepTime</th>\n",
" <th>Smoking_No</th>\n",
" <th>Smoking_Yes</th>\n",
" <th>AlcoholDrinking_No</th>\n",
" <th>AlcoholDrinking_Yes</th>\n",
" <th>Stroke_No</th>\n",
" <th>Stroke_Yes</th>\n",
" <th>...</th>\n",
" <th>GenHealth_Very good</th>\n",
" <th>Asthma_No</th>\n",
" <th>Asthma_Yes</th>\n",
" <th>KidneyDisease_No</th>\n",
" <th>KidneyDisease_Yes</th>\n",
" <th>SkinCancer_No</th>\n",
" <th>SkinCancer_Yes</th>\n",
" <th>HeartDisease</th>\n",
" <th>BMI_Category</th>\n",
" <th>HealthScore</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>24.28</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>8.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>1</td>\n",
" <td>21.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>34.44</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>3</td>\n",
" <td>23.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>25.86</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>8.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>2</td>\n",
" <td>21.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>19.47</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>8.0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>1</td>\n",
" <td>22.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>34.70</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>3</td>\n",
" <td>23.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>29.05</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>2</td>\n",
" <td>22.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>32.45</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>7.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>3</td>\n",
" <td>21.6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>26.25</td>\n",
" <td>0.0</td>\n",
" <td>30.0</td>\n",
" <td>6.0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>2</td>\n",
" <td>13.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>30.67</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>7.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>2</td>\n",
" <td>21.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>34.96</td>\n",
" <td>14.0</td>\n",
" <td>0.0</td>\n",
" <td>6.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>No</td>\n",
" <td>3</td>\n",
" <td>17.2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 53 columns</p>\n",
"</div>"
],
"text/plain": [
" BMI PhysicalHealth MentalHealth SleepTime Smoking_No Smoking_Yes \\\n",
"0 24.28 2.0 3.0 8.0 True False \n",
"1 34.44 0.0 0.0 8.0 False True \n",
"2 25.86 0.0 5.0 8.0 True False \n",
"3 19.47 0.0 2.0 8.0 False True \n",
"4 34.70 0.0 0.0 8.0 False True \n",
"5 29.05 0.0 0.0 6.0 True False \n",
"6 32.45 0.0 5.0 7.0 True False \n",
"7 26.25 0.0 30.0 6.0 False True \n",
"8 30.67 2.0 3.0 7.0 True False \n",
"9 34.96 14.0 0.0 6.0 True False \n",
"\n",
" AlcoholDrinking_No AlcoholDrinking_Yes Stroke_No Stroke_Yes ... \\\n",
"0 True False True False ... \n",
"1 True False True False ... \n",
"2 True False True False ... \n",
"3 True False True False ... \n",
"4 True False True False ... \n",
"5 False True True False ... \n",
"6 True False True False ... \n",
"7 True False True False ... \n",
"8 True False True False ... \n",
"9 True False True False ... \n",
"\n",
" GenHealth_Very good Asthma_No Asthma_Yes KidneyDisease_No \\\n",
"0 True False True True \n",
"1 False True False True \n",
"2 True True False True \n",
"3 True True False True \n",
"4 True True False True \n",
"5 False True False True \n",
"6 False True False True \n",
"7 True True False True \n",
"8 False True False True \n",
"9 False True False True \n",
"\n",
" KidneyDisease_Yes SkinCancer_No SkinCancer_Yes HeartDisease \\\n",
"0 False True False No \n",
"1 False True False No \n",
"2 False True False No \n",
"3 False True False No \n",
"4 False True False No \n",
"5 False True False No \n",
"6 False True False No \n",
"7 False True False No \n",
"8 False True False No \n",
"9 False True False No \n",
"\n",
" BMI_Category HealthScore \n",
"0 1 21.7 \n",
"1 3 23.4 \n",
"2 2 21.9 \n",
"3 1 22.8 \n",
"4 3 23.4 \n",
"5 2 22.8 \n",
"6 3 21.6 \n",
"7 2 13.8 \n",
"8 2 21.4 \n",
"9 3 17.2 \n",
"\n",
"[10 rows x 53 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Рассчитаем новый признак HealthScore\n",
"# Используем взвешенную сумму физического, ментального здоровья и количества сна\n",
"df_train_oversampled[\"HealthScore\"] = (\n",
" (30.0 - df_train_oversampled[\"PhysicalHealth\"]) * 0.4 + # Чем меньше проблем с физическим здоровьем, тем лучше\n",
" (30.0 - df_train_oversampled[\"MentalHealth\"]) * 0.3 + # Чем меньше проблем с ментальным здоровьем, тем лучше\n",
" df_train_oversampled[\"SleepTime\"] * 0.3 # Оптимальное время сна\n",
")\n",
"\n",
"df_train_oversampled.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Масштабирование признаков на основе нормировки и стандартизации\n",
"\n",
"Методы масштабирования признаков:\n",
"- *Нормировка* обычно применяется для равномерного распределения;\n",
"- *Стандартизация* обычно применяется для нормального распределения.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>BMI</th>\n",
" <th>PhysicalHealth</th>\n",
" <th>MentalHealth</th>\n",
" <th>SleepTime</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.383457</td>\n",
" <td>0.066667</td>\n",
" <td>0.100000</td>\n",
" <td>0.625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.727165</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.436908</td>\n",
" <td>0.000000</td>\n",
" <td>0.166667</td>\n",
" <td>0.625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.220737</td>\n",
" <td>0.000000</td>\n",
" <td>0.066667</td>\n",
" <td>0.625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.735961</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.544824</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.375</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.659844</td>\n",
" <td>0.000000</td>\n",
" <td>0.166667</td>\n",
" <td>0.500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0.450101</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.375</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0.599628</td>\n",
" <td>0.066667</td>\n",
" <td>0.100000</td>\n",
" <td>0.500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0.744756</td>\n",
" <td>0.466667</td>\n",
" <td>0.000000</td>\n",
" <td>0.375</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" BMI PhysicalHealth MentalHealth SleepTime\n",
"0 0.383457 0.066667 0.100000 0.625\n",
"1 0.727165 0.000000 0.000000 0.625\n",
"2 0.436908 0.000000 0.166667 0.625\n",
"3 0.220737 0.000000 0.066667 0.625\n",
"4 0.735961 0.000000 0.000000 0.625\n",
"5 0.544824 0.000000 0.000000 0.375\n",
"6 0.659844 0.000000 0.166667 0.500\n",
"7 0.450101 0.000000 1.000000 0.375\n",
"8 0.599628 0.066667 0.100000 0.500\n",
"9 0.744756 0.466667 0.000000 0.375"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import MinMaxScaler\n",
"\n",
"scaler = MinMaxScaler()\n",
"\n",
"# Применяем масштабирование к выбранным признакам\n",
"df_train_oversampled_normalized = df_train_oversampled\n",
"df_train_oversampled_normalized[numeric_columns] = scaler.fit_transform(df_train_oversampled_normalized[numeric_columns])\n",
"\n",
"df_train_oversampled_normalized[numeric_columns].head(10)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime', 'Smoking_No',\n",
" 'Smoking_Yes', 'AlcoholDrinking_No', 'AlcoholDrinking_Yes', 'Stroke_No',\n",
" 'Stroke_Yes', 'DiffWalking_No', 'DiffWalking_Yes', 'Sex_Female',\n",
" 'Sex_Male', 'AgeCategory_18-24', 'AgeCategory_25-29',\n",
" 'AgeCategory_30-34', 'AgeCategory_35-39', 'AgeCategory_40-44',\n",
" 'AgeCategory_45-49', 'AgeCategory_50-54', 'AgeCategory_55-59',\n",
" 'AgeCategory_60-64', 'AgeCategory_65-69', 'AgeCategory_70-74',\n",
" 'AgeCategory_75-79', 'AgeCategory_80 or older',\n",
" 'Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black',\n",
" 'Race_Hispanic', 'Race_Other', 'Race_White', 'Diabetic_No',\n",
" 'Diabetic_No, borderline diabetes', 'Diabetic_Yes',\n",
" 'Diabetic_Yes (during pregnancy)', 'PhysicalActivity_No',\n",
" 'PhysicalActivity_Yes', 'GenHealth_Excellent', 'GenHealth_Fair',\n",
" 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good', 'Asthma_No',\n",
" 'Asthma_Yes', 'KidneyDisease_No', 'KidneyDisease_Yes', 'SkinCancer_No',\n",
" 'SkinCancer_Yes', 'HeartDisease', 'BMI_Category', 'HealthScore'],\n",
" dtype='object')"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train_oversampled_normalized.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Конструирование с применением FeatureTools"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>BMI</th>\n",
" <th>PhysicalHealth</th>\n",
" <th>MentalHealth</th>\n",
" <th>SleepTime</th>\n",
" <th>Smoking_No</th>\n",
" <th>Smoking_Yes</th>\n",
" <th>AlcoholDrinking_No</th>\n",
" <th>AlcoholDrinking_Yes</th>\n",
" <th>Stroke_No</th>\n",
" <th>Stroke_Yes</th>\n",
" <th>...</th>\n",
" <th>BMI_Category * HealthScore</th>\n",
" <th>BMI_Category * MentalHealth</th>\n",
" <th>BMI_Category * PhysicalHealth</th>\n",
" <th>BMI_Category * SleepTime</th>\n",
" <th>HealthScore * MentalHealth</th>\n",
" <th>HealthScore * PhysicalHealth</th>\n",
" <th>HealthScore * SleepTime</th>\n",
" <th>MentalHealth * PhysicalHealth</th>\n",
" <th>MentalHealth * SleepTime</th>\n",
" <th>PhysicalHealth * SleepTime</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.383457</td>\n",
" <td>0.066667</td>\n",
" <td>0.100000</td>\n",
" <td>0.625</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>21.7</td>\n",
" <td>0.100000</td>\n",
" <td>0.066667</td>\n",
" <td>0.625</td>\n",
" <td>2.17</td>\n",
" <td>1.446667</td>\n",
" <td>13.5625</td>\n",
" <td>0.006667</td>\n",
" <td>0.062500</td>\n",
" <td>0.041667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.727165</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.625</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>70.2</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.875</td>\n",
" <td>0.00</td>\n",
" <td>0.000000</td>\n",
" <td>14.6250</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.436908</td>\n",
" <td>0.000000</td>\n",
" <td>0.166667</td>\n",
" <td>0.625</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>43.8</td>\n",
" <td>0.333333</td>\n",
" <td>0.000000</td>\n",
" <td>1.250</td>\n",
" <td>3.65</td>\n",
" <td>0.000000</td>\n",
" <td>13.6875</td>\n",
" <td>0.000000</td>\n",
" <td>0.104167</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.220737</td>\n",
" <td>0.000000</td>\n",
" <td>0.066667</td>\n",
" <td>0.625</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>22.8</td>\n",
" <td>0.066667</td>\n",
" <td>0.000000</td>\n",
" <td>0.625</td>\n",
" <td>1.52</td>\n",
" <td>0.000000</td>\n",
" <td>14.2500</td>\n",
" <td>0.000000</td>\n",
" <td>0.041667</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.735961</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.625</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>70.2</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.875</td>\n",
" <td>0.00</td>\n",
" <td>0.000000</td>\n",
" <td>14.6250</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.544824</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.375</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>45.6</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.750</td>\n",
" <td>0.00</td>\n",
" <td>0.000000</td>\n",
" <td>8.5500</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0.659844</td>\n",
" <td>0.000000</td>\n",
" <td>0.166667</td>\n",
" <td>0.500</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>64.8</td>\n",
" <td>0.500000</td>\n",
" <td>0.000000</td>\n",
" <td>1.500</td>\n",
" <td>3.60</td>\n",
" <td>0.000000</td>\n",
" <td>10.8000</td>\n",
" <td>0.000000</td>\n",
" <td>0.083333</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0.450101</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.375</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>27.6</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.750</td>\n",
" <td>13.80</td>\n",
" <td>0.000000</td>\n",
" <td>5.1750</td>\n",
" <td>0.000000</td>\n",
" <td>0.375000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0.599628</td>\n",
" <td>0.066667</td>\n",
" <td>0.100000</td>\n",
" <td>0.500</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>42.8</td>\n",
" <td>0.200000</td>\n",
" <td>0.133333</td>\n",
" <td>1.000</td>\n",
" <td>2.14</td>\n",
" <td>1.426667</td>\n",
" <td>10.7000</td>\n",
" <td>0.006667</td>\n",
" <td>0.050000</td>\n",
" <td>0.033333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>0.744756</td>\n",
" <td>0.466667</td>\n",
" <td>0.000000</td>\n",
" <td>0.375</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>51.6</td>\n",
" <td>0.000000</td>\n",
" <td>1.400000</td>\n",
" <td>1.125</td>\n",
" <td>0.00</td>\n",
" <td>8.026667</td>\n",
" <td>6.4500</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.175000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 68 columns</p>\n",
"</div>"
],
"text/plain": [
" BMI PhysicalHealth MentalHealth SleepTime Smoking_No \\\n",
"Id \n",
"1 0.383457 0.066667 0.100000 0.625 True \n",
"2 0.727165 0.000000 0.000000 0.625 False \n",
"3 0.436908 0.000000 0.166667 0.625 True \n",
"4 0.220737 0.000000 0.066667 0.625 False \n",
"5 0.735961 0.000000 0.000000 0.625 False \n",
"6 0.544824 0.000000 0.000000 0.375 True \n",
"7 0.659844 0.000000 0.166667 0.500 True \n",
"8 0.450101 0.000000 1.000000 0.375 False \n",
"9 0.599628 0.066667 0.100000 0.500 True \n",
"10 0.744756 0.466667 0.000000 0.375 True \n",
"\n",
" Smoking_Yes AlcoholDrinking_No AlcoholDrinking_Yes Stroke_No \\\n",
"Id \n",
"1 False True False True \n",
"2 True True False True \n",
"3 False True False True \n",
"4 True True False True \n",
"5 True True False True \n",
"6 False False True True \n",
"7 False True False True \n",
"8 True True False True \n",
"9 False True False True \n",
"10 False True False True \n",
"\n",
" Stroke_Yes ... BMI_Category * HealthScore BMI_Category * MentalHealth \\\n",
"Id ... \n",
"1 False ... 21.7 0.100000 \n",
"2 False ... 70.2 0.000000 \n",
"3 False ... 43.8 0.333333 \n",
"4 False ... 22.8 0.066667 \n",
"5 False ... 70.2 0.000000 \n",
"6 False ... 45.6 0.000000 \n",
"7 False ... 64.8 0.500000 \n",
"8 False ... 27.6 2.000000 \n",
"9 False ... 42.8 0.200000 \n",
"10 False ... 51.6 0.000000 \n",
"\n",
" BMI_Category * PhysicalHealth BMI_Category * SleepTime \\\n",
"Id \n",
"1 0.066667 0.625 \n",
"2 0.000000 1.875 \n",
"3 0.000000 1.250 \n",
"4 0.000000 0.625 \n",
"5 0.000000 1.875 \n",
"6 0.000000 0.750 \n",
"7 0.000000 1.500 \n",
"8 0.000000 0.750 \n",
"9 0.133333 1.000 \n",
"10 1.400000 1.125 \n",
"\n",
" HealthScore * MentalHealth HealthScore * PhysicalHealth \\\n",
"Id \n",
"1 2.17 1.446667 \n",
"2 0.00 0.000000 \n",
"3 3.65 0.000000 \n",
"4 1.52 0.000000 \n",
"5 0.00 0.000000 \n",
"6 0.00 0.000000 \n",
"7 3.60 0.000000 \n",
"8 13.80 0.000000 \n",
"9 2.14 1.426667 \n",
"10 0.00 8.026667 \n",
"\n",
" HealthScore * SleepTime MentalHealth * PhysicalHealth \\\n",
"Id \n",
"1 13.5625 0.006667 \n",
"2 14.6250 0.000000 \n",
"3 13.6875 0.000000 \n",
"4 14.2500 0.000000 \n",
"5 14.6250 0.000000 \n",
"6 8.5500 0.000000 \n",
"7 10.8000 0.000000 \n",
"8 5.1750 0.000000 \n",
"9 10.7000 0.006667 \n",
"10 6.4500 0.000000 \n",
"\n",
" MentalHealth * SleepTime PhysicalHealth * SleepTime \n",
"Id \n",
"1 0.062500 0.041667 \n",
"2 0.000000 0.000000 \n",
"3 0.104167 0.000000 \n",
"4 0.041667 0.000000 \n",
"5 0.000000 0.000000 \n",
"6 0.000000 0.000000 \n",
"7 0.083333 0.000000 \n",
"8 0.375000 0.000000 \n",
"9 0.050000 0.033333 \n",
"10 0.000000 0.175000 \n",
"\n",
"[10 rows x 68 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import featuretools as ft\n",
"\n",
"# Создание EntitySet\n",
"\n",
"df_testing = df_train_oversampled_normalized\n",
"# Создание уникального идентификатора для каждой строки\n",
"df_testing['Id'] = range(1, len(df_testing) + 1)\n",
"\n",
"es = ft.EntitySet(id='my-test-data')\n",
"es = es.add_dataframe(dataframe=df_testing, dataframe_name='my-name', index='Id')\n",
"\n",
"# Указываем, какие трансформации нужно применить\n",
"trans_primitives = ['multiply_numeric']\n",
"\n",
"# Генерация признаков с помощью глубокого синтеза признаков\n",
"feature_matrix, feature_defs = ft.dfs(\n",
" entityset=es, \n",
" target_dataframe_name='my-name', \n",
" max_depth=1,\n",
" trans_primitives=trans_primitives\n",
")\n",
"\n",
"# Выводим первые 10 строк сгенерированного набора признаков\n",
"feature_matrix.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Оценка качества каждого набора признаков\n",
"\n",
"**Предсказательная способность**: Способность набора признаков успешно прогнозировать целевую переменную. Это определяется через метрики, такие как RMSE, MAE, R², которые показывают, насколько хорошо модель использует признаки для достижения точных результатов. Для определения качества необходимо провести обучение модели на обучающей выборке и сравнить с оценкой прогнозирования на контрольной и тестовой выборках.\n",
"\n",
"**Скорость вычисления**: Время, необходимое для обработки данных и выполнения алгоритмов машинного обучения. Признаки должны быть вычисляемыми за разумный срок, чтобы обеспечить эффективность модели, особенно при работе с большими наборами данных. Для оценки качества необходимо провести измерение времени выполнения генерации признаков и обучения модели.\n",
"\n",
"**Надежность**: Устойчивость и воспроизводимость результатов при изменении входных данных. Надежные признаки должны давать схожие результаты независимо от случайных факторов или незначительных изменений в данных. Методы оценки: Кросс-валидация, анализ чувствительности модели к изменениям в данных.\n",
"\n",
"**Корреляция**: Степень взаимосвязи между признаками и целевой переменной, а также между самими признаками. Высокая корреляция с целевой переменной указывает на потенциальную предсказательную силу, тогда как высокая взаимосвязь между самими признаками может приводить к многоколлинеарности и снижению эффективности модели. Методы оценки: Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков.\n",
"\n",
"**Цельность**: Не является производным от других признаков. Методы оценки: Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 156699\n",
"Размер контрольной выборки: 67157\n",
"Размер тестовой выборки: 95939\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/entityset/entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
" warnings.warn(\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/synthesis/deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Feature Importance:\n",
" feature importance\n",
"0 HeartDisease 0.851120\n",
"1 BMI 0.014203\n",
"9 Stroke_No 0.012600\n",
"2 PhysicalHealth 0.008628\n",
"12 DiffWalking_Yes 0.008111\n",
"11 DiffWalking_No 0.007721\n",
"36 Diabetic_Yes 0.007583\n",
"10 Stroke_Yes 0.007551\n",
"4 SleepTime 0.007525\n",
"43 GenHealth_Poor 0.006605\n",
"27 AgeCategory_80 or older 0.006269\n",
"34 Diabetic_No 0.005300\n",
"3 MentalHealth 0.005102\n",
"41 GenHealth_Fair 0.004277\n",
"48 KidneyDisease_Yes 0.003435\n",
"47 KidneyDisease_No 0.003086\n",
"13 Sex_Female 0.002607\n",
"26 AgeCategory_75-79 0.002567\n",
"25 AgeCategory_70-74 0.002462\n",
"14 Sex_Male 0.002457\n",
"6 Smoking_Yes 0.002127\n",
"5 Smoking_No 0.001934\n",
"42 GenHealth_Good 0.001787\n",
"44 GenHealth_Very good 0.001734\n",
"33 Race_White 0.001731\n",
"50 SkinCancer_Yes 0.001687\n",
"38 PhysicalActivity_No 0.001658\n",
"39 PhysicalActivity_Yes 0.001585\n",
"49 SkinCancer_No 0.001513\n",
"40 GenHealth_Excellent 0.001451\n",
"24 AgeCategory_65-69 0.001318\n",
"46 Asthma_Yes 0.001315\n",
"45 Asthma_No 0.001256\n",
"23 AgeCategory_60-64 0.001091\n",
"30 Race_Black 0.000885\n",
"22 AgeCategory_55-59 0.000853\n",
"31 Race_Hispanic 0.000825\n",
"21 AgeCategory_50-54 0.000715\n",
"32 Race_Other 0.000699\n",
"7 AlcoholDrinking_No 0.000560\n",
"8 AlcoholDrinking_Yes 0.000550\n",
"20 AgeCategory_45-49 0.000520\n",
"28 Race_American Indian/Alaskan Native 0.000503\n",
"35 Diabetic_No, borderline diabetes 0.000479\n",
"19 AgeCategory_40-44 0.000444\n",
"18 AgeCategory_35-39 0.000412\n",
"17 AgeCategory_30-34 0.000267\n",
"29 Race_Asian 0.000260\n",
"15 AgeCategory_18-24 0.000231\n",
"16 AgeCategory_25-29 0.000217\n",
"37 Diabetic_Yes (during pregnancy) 0.000184\n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from imblearn.over_sampling import RandomOverSampler\n",
"import featuretools as ft\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\".//static//csv//heart_2020_cleaned.csv\")\n",
"\n",
"# Разделение на обучающую и тестовую выборки (например, 70% обучающая, 30% тестовая)\n",
"train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)\n",
"\n",
"# Разделение обучающей выборки на обучающую и контрольную (например, 70% обучающая, 30% контрольная)\n",
"train_df, val_df = train_test_split(train_df, test_size=0.3, random_state=42)\n",
"\n",
"# Вывод размеров выборок\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))\n",
"\n",
"# Определение категориальных признаков\n",
"categorical_features = [\n",
" 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Race',\n",
" 'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer'\n",
"]\n",
"\n",
"# Применение one-hot encoding к обучающей выборке\n",
"train_df_encoded = pd.get_dummies(train_df, columns=categorical_features)\n",
"\n",
"# Применение one-hot encoding к контрольной выборке\n",
"val_df_encoded = pd.get_dummies(val_df, columns=categorical_features)\n",
"\n",
"# Применение one-hot encoding к тестовой выборке\n",
"test_df_encoded = pd.get_dummies(test_df, columns=categorical_features)\n",
"\n",
"# Определение сущностей\n",
"es = ft.EntitySet(id='heart_data')\n",
"es = es.add_dataframe(dataframe_name='heart', dataframe=train_df_encoded, index='id')\n",
"\n",
"# Генерация признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='heart', max_depth=2)\n",
"\n",
"# Преобразование признаков для контрольной и тестовой выборок\n",
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_df_encoded.index)\n",
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_df_encoded.index)\n",
"\n",
"# Оценка важности признаков\n",
"X = feature_matrix\n",
"y = train_df_encoded['HeartDisease']\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Обучение модели\n",
"model = RandomForestClassifier(n_estimators=100, random_state=42)\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Получение важности признаков\n",
"importances = model.feature_importances_\n",
"feature_names = feature_matrix.columns\n",
"\n",
"# Сортировка признаков по важности\n",
"feature_importance = pd.DataFrame({'feature': feature_names, 'importance': importances})\n",
"feature_importance = feature_importance.sort_values(by='importance', ascending=False)\n",
"\n",
"print(\"Feature Importance:\")\n",
"print(feature_importance)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 15670\n",
"Размер контрольной выборки: 6716\n",
"Размер тестовой выборки: 9594\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/entityset/entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
" warnings.warn(\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 1.0\n",
"Precision: 1.0\n",
"Recall: 1.0\n",
"F1 Score: 1.0\n",
"ROC AUC: 1.0\n",
"Cross-validated Accuracy: 0.906126356094448\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABD8AAAIjCAYAAAAEDbCUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1gV1/bw8e+hl0NRREFEEMEuKkETRRFbwFijEbuiYm8YjYodG0Yllhg1NrDFEqPEiJ2IBU0sEayxoIheC0YFxAIq8/7hy/w8oQjGxJT1eZ557pmZPXuvOeTmuWfdvdfWKIqiIIQQQgghhBBCCPEvpfeuAxBCCCGEEEIIIYT4M0nyQwghhBBCCCGEEP9qkvwQQgghhBBCCCHEv5okP4QQQgghhBBCCPGvJskPIYQQQgghhBBC/KtJ8kMIIYQQQgghhBD/apL8EEIIIYQQQgghxL+aJD+EEEIIIYQQQgjxrybJDyGEEEIIIYQQQvyrSfJDCCGEEEIIIYQQ/2qS/BBCCCGEeAciIiLQaDS5HqNHj/5Txjx8+DCTJk0iJSXlT+n/j8j+Po4fP/6uQ3ljCxcuJCIi4l2HIYQQIhcG7zoAIYQQQoj/ssmTJ1OmTBmda1WqVPlTxjp8+DAhISEEBARgbW39p4zxX7Zw4UKKFStGQEDAuw5FCCHE70jyQwghhBDiHWratCmenp7vOow/5NGjR5ibm7/rMN6Zx48fY2Zm9q7DEEIIkQ9Z9iKEEEII8Te2Y8cO6tWrh7m5ORYWFjRr1oyzZ8/qtDl16hQBAQG4uLhgYmKCnZ0dPXv25N69e2qbSZMm8dlnnwFQpkwZdYlNYmIiiYmJaDSaXJdsaDQaJk2apNOPRqPh3LlzdOrUiSJFilC3bl31/po1a3jvvfcwNTWlaNGidOjQgevXr7/RuwcEBKDVaklKSqJ58+ZotVocHBz46quvADh9+jQNGzbE3NwcJycnvvnmG53ns5fSHDhwgL59+2JjY4OlpSXdunXjwYMHOcZbuHAhlStXxtjYmJIlSzJw4MAcS4R8fHyoUqUKJ06cwNvbGzMzM8aMGYOzszNnz55l//796nfr4+MDwP379xkxYgRVq1ZFq9ViaWlJ06ZNiY+P1+k7JiYGjUbDxo0bmTZtGqVKlcLExIRGjRpx+fLlHPH+/PPPfPTRRxQpUgRzc3Pc3d2ZN2+eTptff/2VTz75hKJFi2JiYoKnpydbt24t7J9CCCH+8WTmhxBCCCHEO5Samspvv/2mc61YsWIArF69mu7du+Pr68vnn3/O48ePWbRoEXXr1uXkyZM4OzsDsGfPHq5cuUKPHj2ws7Pj7NmzLFmyhLNnz/LTTz+h0Who06YNFy9eZN26dcyZM0cdw9bWlrt37xY67nbt2uHm5sb06dNRFAWAadOmMX78ePz9/QkMDOTu3bt8+eWXeHt7c/LkyTdaavPixQuaNm2Kt7c3M2fOZO3atQwaNAhzc3PGjh1L586dadOmDYsXL6Zbt27Url07xzKiQYMGYW1tzaRJk7hw4QKLFi3i2rVrarIBXiZ1QkJCaNy4Mf3791fbHTt2jNjYWAwNDdX+7t27R9OmTenQoQNdunShRIkS+Pj4MHjwYLRaLWPHjgWgRIkSAFy5coXIyEjatWtHmTJluHPnDl9//TX169fn3LlzlCxZUifeGTNmoKenx4gRI0hNTWXmzJl07tyZn3/+WW2zZ88emjdvjr29PUOHDsXOzo7z58+zbds2hg4dCsDZs2fx8vLCwcGB0aNHY25uzsaNG2ndujXfffcdH3/8caH/HkII8Y+lCCGEEEKIv1x4eLgC5HooiqI8fPhQsba2Vnr37q3z3O3btxUrKyud648fP87R/7p16xRAOXDggHpt1qxZCqBcvXpVp+3Vq1cVQAkPD8/RD6BMnDhRPZ84caICKB07dtRpl5iYqOjr6yvTpk3TuX769GnFwMAgx/W8vo9jx46p17p3764AyvTp09VrDx48UExNTRWNRqOsX79evf7rr7/miDW7z/fee0/JzMxUr8+cOVMBlO+//15RFEVJTk5WjIyMlA8//FB58eKF2m7BggUKoKxYsUK9Vr9+fQVQFi9enOMdKleurNSvXz/H9adPn+r0qygvv3NjY2Nl8uTJ6rV9+/YpgFKxYkUlIyNDvT5v3jwFUE6fPq0oiqI8f/5cKVOmjOLk5KQ8ePBAp9+srCz1c6NGjZSqVasqT58+1blfp04dxc3NLUecQgjxbybLXoQQQggh3qGvvvqKPXv26Bzw8v/ZT0lJoWPHjvz222/qoa+vz/vvv8++ffvUPkxNTdXPT58+5bfffuODDz4A4JdffvlT4u7Xr5/O+ebNm8nKysLf318nXjs7O9zc3HTiLazAwED1s7W1NeXLl8fc3Bx/f3/1evny5bG2tubKlSs5nu/Tp4/OzI3+/ftjYGDA9u3bAdi7dy+ZmZkEBQWhp/d///O4d+/eWFpaEhUVpdOfsbExPXr0KHD8xsbGar8vXrzg3r17aLVaypcvn+vfp0ePHhgZGann9erVA1Df7eTJk1y9epWgoKAcs2myZ7Lcv3+fH3/8EX9/fx4+fKj+Pe7du4evry+XLl3if//7X4HfQQgh/ulk2YsQQgghxDtUq1atXAueXrp0CYCGDRvm+pylpaX6+f79+4SEhLB+/XqSk5N12qWmpr7FaP/P75eWXLp0CUVRcHNzy7X9q8mHwjAxMcHW1lbnmpWVFaVKlVJ/6L96PbdaHr+PSavVYm9vT2JiIgDXrl0DXiZQXmVkZISLi4t6P5uDg4NOcuJ1srKymDdvHgsXLuTq1au8ePFCvWdjY5OjfenSpXXOixQpAqC+W0JCApD/rkCXL19GURTGjx/P+PHjc22TnJyMg4NDgd9DCCH+yST5IYQQQgjxN5SVlQW8rPthZ2eX476Bwf/9zzh/f38OHz7MZ599RvXq1dFqtWRlZeHn56f2k5/fJxGyvfoj/fdenW2SHa9Go2HHjh3o6+vnaK/Val8bR25y6yu/68r/rz/yZ/r9u7/O9OnTGT9+PD179mTKlCkULVoUPT09goKCcv37vI13y+53xIgR+Pr65trG1dW1wP0JIcQ/nSQ/hBBCCCH+hsqWLQtA8eLFady4cZ7tHjx4QHR0NCEhIUyYMEG9nj1z5FV5JTmyZxb8fmeT3894eF28iqJQpkwZypUrV+Dn/gqXLl2iQYMG6nl6ejq3bt3io48+AsDJyQmACxcu4OLiorbLzMzk6tWr+X7/r8rr+920aRMNGjRg+fLlOtdTUlLUwrOFkf3PxpkzZ/KMLfs9DA0NCxy/EEL8m0nNDyGEEEKIvyFfX18sLS2ZPn06z549y3E/e4eW7FkCv58VMHfu3BzPmJubAzmTHJaWlhQrVowDBw7oXF+4cGGB423Tpg36+vqEhITkiEVRFJ1td/9qS5Ys0fkOFy1axPPnz2natCkAjRs3xsjIiPnz5+vEvnz5clJTU2nWrFmBxjE3N8/x3cLLv9Hvv5Nvv/32jWtueHh4UKZMGebOnZtjvOxxihcvjo+PD19//TW3bt3K0ceb7PAjhBD/ZDLzQwghhBDib8jS0pJFixbRtWtXPDw86NChA7a2tiQlJREVFYWXlxcLFizA0tJS3Qb22bNnODg4sHv3bq5evZqjz/feew+AsWPH0qFDBwwNDWnRogXm5uYEBgYyY8YMAgMD8fT05MCBA1y8eLHA8ZYtW5apU6cSHBxMYmIirVu3xsLCgqtXr7Jlyxb69OnDiBEj3tr3UxiZmZk0atQIf39/Lly4wMKFC6lbty4tW7YEXm73GxwcTEhICH5+frRs2VJtV7NmTbp06VKgcd577z0WLVrE1KlTcXV1pXjx4jRs2JDmzZszefJkevToQZ06dTh9+jRr167VmWVSGHp6eixatIgWLVpQvXp1evTogb29Pb/++itnz55l165dwMtiunXr1qVq1ar07t0bFxcX7ty5w5EjR7hx4wbx8fFvNL4QQvwTSfJDCCGEEOJvqlOnTpQsWZIZM2Ywa9YsMjIycHBwoF69ejq7jXzzzTcMHjyYr77
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train Accuracy: 0.9994894703254626\n",
"Train Precision: 0.9992816091954023\n",
"Train Recall: 0.9949928469241774\n",
"Train F1 Score: 0.9971326164874552\n",
"Train ROC AUC: 0.9974613898298016\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAIjCAYAAAA0vUuxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB7UklEQVR4nO3dd3xO9///8eeVRBIjMWoTYu+dysdWFKVaSmvULFUl9qbEqL0J9bGqWlWKtmqVKlVFjaAUib23SiRGJDm/P/pzvp+rMXKR5GQ87rfbdbu5Xudc53pecUReeb/P+9gMwzAEAAAAAHgqJ6sDAAAAAEBiR+MEAAAAAM9B4wQAAAAAz0HjBAAAAADPQeMEAAAAAM9B4wQAAAAAz0HjBAAAAADPQeMEAAAAAM9B4wQAAAAAz0HjBACJlM1m04gRI6yOYbmaNWuqZs2a5vOzZ8/KZrNp8eLFlmX6t39nBF8TAMkPjROAFGHOnDmy2Wzy9fV94WNcvnxZI0aM0MGDB+MuWCK3bds22Ww285EqVSrlz59fbdu21enTp62O55CdO3dqxIgRunPnjmUZvL299eabbz5x2+Ov9cqVKxM41T+edX63b9/e7jxIly6d8ufPr2bNmmnVqlWKjo5O+MAAkMBcrA4AAAlh6dKl8vb21p49e3Ty5EkVLFjQ4WNcvnxZI0eOlLe3t8qWLRv3IROxHj166NVXX9WjR48UGBioefPmad26dTp8+LBy5syZoFny5s2r+/fvK1WqVA69bufOnRo5cqTat2+vDBkyxE+4JOx557ebm5sWLFggSbp//77OnTunH3/8Uc2aNVPNmjX1ww8/yNPT09x/06ZNCRUdABIEI04Akr0zZ85o586dmjp1qrJkyaKlS5daHSnJqVatmlq3bq0OHTpo1qxZmjx5sm7fvq0vvvjiqa8JDw+Plyw2m03u7u5ydnaOl+OnNJGRkYqIiHjufi4uLmrdurVat26tDz/8UJ9++qkOHTqkcePGadu2bfrwww/t9nd1dZWrq2t8xQaABEfjBCDZW7p0qTJmzKiGDRuqWbNmT22c7ty5o969e8vb21tubm7KnTu32rZtq5s3b2rbtm169dVXJUkdOnQwpyw9vs7G29tb7du3j3HMf1/nERERoeHDh6tChQpKnz690qZNq2rVqmnr1q0Of65r167JxcVFI0eOjLEtKChINptNAQEBkqRHjx5p5MiRKlSokNzd3fXKK6+oatWq2rx5s8PvK0m1atWS9E9TKkkjRoyQzWbT0aNH1apVK2XMmFFVq1Y19//qq69UoUIFpU6dWpkyZVKLFi104cKFGMedN2+eChQooNSpU6tixYr67bffYuzztGucjh8/rvfee09ZsmRR6tSpVaRIEQ0dOtTM179/f0lSvnz5zL+/s2fPxkvGuHTp0iV98MEHypYtm9zc3FSiRAktWrTIbp/YnlePv3aTJ0/W9OnTVaBAAbm5uWnOnDnPPL+fZdCgQapbt66+/fZbBQcHm/UnXeM0a9YslShRQmnSpFHGjBnl4+Ojr7/+Ot4+ryR98803qlChgjw8POTp6alSpUppxowZdvvcuXNHvXr1kpeXl9zc3FSwYEFNmDCBKYgA7DBVD0Cyt3TpUr3zzjtydXVVy5Yt9dlnn2nv3r3mD4qSFBYWpmrVqunYsWP64IMPVL58ed28eVNr1qzRxYsXVaxYMY0aNUrDhw9X586dVa1aNUlS5cqVHcoSGhqqBQsWqGXLlvrwww919+5dLVy4UPXq1dOePXscmgKYLVs21ahRQytWrJC/v7/dtuXLl8vZ2VnvvvuupH8ah3HjxqlTp06qWLGiQkNDtW/fPgUGBur111936DNI0qlTpyRJr7zyil393XffVaFChTR27FgZhiFJGjNmjIYNG6b33ntPnTp10o0bNzRr1ixVr15dBw4cMKfNLVy4UB999JEqV66sXr166fTp03rrrbeUKVMmeXl5PTPPn3/+qWrVqilVqlTq3LmzvL29derUKf34448aM2aM3nnnHQUHB2vZsmWaNm2aMmfOLEnKkiVLgmV87NGjR7p582aMekhISIzatWvX9J///Ec2m01+fn7KkiWLNmzYoI4dOyo0NFS9evWS5Ph59fnnn+vBgwfq3Lmz3Nzc1KRJE929e/eFz+82bdpo06ZN2rx5swoXLvzEfebPn68ePXqoWbNm6tmzpx48eKA///xTf/zxh1q1ahUvn3fz5s1q2bKlateurQkTJkiSjh07pt9//109e/aUJN27d081atTQpUuX9NFHHylPnjzauXOnBg8erCtXrmj69Omx+hoASAEMAEjG9u3bZ0gyNm/ebBiGYURHRxu5c+c2evbsabff8OHDDUnG6tWrYxwjOjraMAzD2Lt3ryHJ+Pzzz2PskzdvXqNdu3Yx6jVq1DBq1KhhPo+MjDQePnxot8/ff/9tZMuWzfjggw/s6pIMf3//Z36+//73v4Yk4/Dhw3b14sWLG7Vq1TKflylTxmjYsOEzj/UkW7duNSQZixYtMm7cuGFcvnzZWLduneHt7W3YbDZj7969hmEYhr+/vyHJaNmypd3rz549azg7Oxtjxoyxqx8+fNhwcXEx6xEREUbWrFmNsmXL2n195s2bZ0iy+xqeOXMmxt9D9erVDQ8PD+PcuXN27/P4784wDGPSpEmGJOPMmTPxnvFp8ubNa0h65uPbb7819+/YsaORI0cO4+bNm3bHadGihZE+fXrj3r17hmHE/rx6/LXz9PQ0rl+/brf/s87vdu3aGWnTpn3q5zpw4IAhyejdu7dZ+/e5//bbbxslSpR4+hcnHj5vz549DU9PTyMyMvKp7zl69Ggjbdq0RnBwsF190KBBhrOzs3H+/PlnZgaQcjBVD0CytnTpUmXLlk2vvfaapH+uj2nevLm++eYbRUVFmfutWrVKZcqUUZMmTWIcw2azxVkeZ2dn87qP6Oho3b59W5GRkfLx8VFgYKDDx3vnnXfk4uKi5cuXm7UjR47o6NGjat68uVnLkCGD/vrrL504ceKFcn/wwQfKkiWLcubMqYYNGyo8PFxffPGFfHx87Pbr0qWL3fPVq1crOjpa7733nm7evGk+smfPrkKFCplTq/bt26fr16+rS5cudtfFtG/fXunTp39mths3bmj79u364IMPlCdPHrttsfm7S4iM/8vX11ebN2+O8Zg8ebLdfoZhaNWqVWrUqJEMw7DLVq9ePYWEhJjnjKPnVdOmTc3RtriQLl06SdLdu3efuk+GDBl08eJF7d2794nb4+PzZsiQQeHh4c+ckvrtt9+qWrVqypgxo9171qlTR1FRUdq+fbvDXw8AyRNT9QAkW1FRUfrmm2/02muvmdfiSP/84DplyhRt2bJFdevWlfTP1LOmTZsmSK4vvvhCU6ZM0fHjx/Xo0SOzni9fPoePlTlzZtWuXVsrVqzQ6NGjJf0zTc/FxUXvvPOOud+oUaP09ttvq3DhwipZsqTq16+vNm3aqHTp0rF6n+HDh6tatWpydnZW5syZVaxYMbm4xPwv5N+f4cSJEzIMQ4UKFXricR+vjHfu3DlJirHf4+XPn+XxsuglS5aM1Wf5t4TI+L8yZ86sOnXqxKj/++t548YN3blzR/PmzdO8efOeeKzr16+bf3bkvHqRc+1ZwsLCJEkeHh5P3WfgwIH6+eefVbFiRRUsWFB169ZVq1atVKVKFUnx83m7du2qFStW6I033lCuXLlUt25dvffee6pfv765z4kTJ/Tnn38+tZH83/cEkLLROAFItn755RdduXJF33zzjb755psY25cuXWo2Ti/raSMbUVFRdqu/ffXVV2rfvr0aN26s/v37K2vWrHJ2dta4cePM64Yc1aJFC3Xo0EEHDx5U2bJltWLFCtWuXdu8jkeSqlevrlOnTumHH37Qpk2btGDBAk2bNk1z585Vp06dnvsepUqVeuIP+/+WOnVqu+fR0dGy2WzasGHDE1fBezxSYaXEmvHxwgStW7dWu3btnrjP48bX0fPq339PL+vIkSOS9Mxl/osVK6agoCCtXbtWGzdu1KpVqzRnzhwNHz5cI0eOjJf
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score\n",
"from sklearn.model_selection import cross_val_score\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import featuretools as ft\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\".//static//csv//heart_2020_cleaned.csv\")\n",
"\n",
"# Уменьшение размера выборки для ускорения работы (опционально)\n",
"df = df.sample(frac=0.1, random_state=42)\n",
"\n",
"# Разделение на обучающую и тестовую выборки (например, 70% обучающая, 30% тестовая)\n",
"train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)\n",
"\n",
"# Разделение обучающей выборки на обучающую и контрольную (например, 70% обучающая, 30% контрольная)\n",
"train_df, val_df = train_test_split(train_df, test_size=0.3, random_state=42)\n",
"\n",
"# Вывод размеров выборок\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))\n",
"\n",
"# Определение категориальных признаков\n",
"categorical_features = [\n",
" 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Race',\n",
" 'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer'\n",
"]\n",
"\n",
"# Применение one-hot encoding к обучающей выборке\n",
"train_df_encoded = pd.get_dummies(train_df, columns=categorical_features)\n",
"\n",
"# Применение one-hot encoding к контрольной выборке\n",
"val_df_encoded = pd.get_dummies(val_df, columns=categorical_features)\n",
"\n",
"# Применение one-hot encoding к тестовой выборке\n",
"test_df_encoded = pd.get_dummies(test_df, columns=categorical_features)\n",
"\n",
"# Определение сущностей\n",
"es = ft.EntitySet(id='heart_data')\n",
"es = es.add_dataframe(dataframe_name='heart', dataframe=train_df_encoded, index='id')\n",
"\n",
"# Генерация признаков с уменьшенной глубиной\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='heart', max_depth=1)\n",
"\n",
"# Преобразование признаков для контрольной и тестовой выборок\n",
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_df_encoded.index)\n",
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_df_encoded.index)\n",
"\n",
"# Удаление строк с NaN\n",
"feature_matrix = feature_matrix.dropna()\n",
"val_feature_matrix = val_feature_matrix.dropna()\n",
"test_feature_matrix = test_feature_matrix.dropna()\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки\n",
"X_train = feature_matrix.drop('HeartDisease', axis=1)\n",
"y_train = feature_matrix['HeartDisease']\n",
"X_val = val_feature_matrix.drop('HeartDisease', axis=1)\n",
"y_val = val_feature_matrix['HeartDisease']\n",
"X_test = test_feature_matrix.drop('HeartDisease', axis=1)\n",
"y_test = test_feature_matrix['HeartDisease']\n",
"\n",
"# Выбор модели\n",
"model = RandomForestClassifier(random_state=42)\n",
"\n",
"# Обучение модели\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Предсказание и оценка\n",
"y_pred = model.predict(X_test)\n",
"\n",
"accuracy = accuracy_score(y_test, y_pred)\n",
"precision = precision_score(y_test, y_pred)\n",
"recall = recall_score(y_test, y_pred)\n",
"f1 = f1_score(y_test, y_pred)\n",
"roc_auc = roc_auc_score(y_test, y_pred)\n",
"\n",
"print(f\"Accuracy: {accuracy}\")\n",
"print(f\"Precision: {precision}\")\n",
"print(f\"Recall: {recall}\")\n",
"print(f\"F1 Score: {f1}\")\n",
"print(f\"ROC AUC: {roc_auc}\")\n",
"\n",
"# Кросс-валидация\n",
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')\n",
"accuracy_cv = scores.mean()\n",
"print(f\"Cross-validated Accuracy: {accuracy_cv}\")\n",
"\n",
"# Анализ важности признаков\n",
"feature_importances = model.feature_importances_\n",
"feature_names = X_train.columns\n",
"\n",
"importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})\n",
"importance_df = importance_df.sort_values(by='Importance', ascending=False)\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"sns.barplot(x='Importance', y='Feature', data=importance_df)\n",
"plt.title('Feature Importance')\n",
"plt.show()\n",
"\n",
"# Проверка на переобучение\n",
"y_train_pred = model.predict(X_train)\n",
"\n",
"accuracy_train = accuracy_score(y_train, y_train_pred)\n",
"precision_train = precision_score(y_train, y_train_pred)\n",
"recall_train = recall_score(y_train, y_train_pred)\n",
"f1_train = f1_score(y_train, y_train_pred)\n",
"roc_auc_train = roc_auc_score(y_train, y_train_pred)\n",
"\n",
"print(f\"Train Accuracy: {accuracy_train}\")\n",
"print(f\"Train Precision: {precision_train}\")\n",
"print(f\"Train Recall: {recall_train}\")\n",
"print(f\"Train F1 Score: {f1_train}\")\n",
"print(f\"Train ROC AUC: {roc_auc_train}\")\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
"plt.xlabel('Actual HeartDisease')\n",
"plt.ylabel('Predicted HeartDisease')\n",
"plt.title('Actual vs Predicted HeartDisease')\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}