2895 lines
548 KiB
Plaintext
2895 lines
548 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Лабораторная работа №3. Конструирование признаков.\n",
|
|||
|
"\n",
|
|||
|
"## Датасет \"Набор данных для анализа и прогнозирования сердечного приступа\".\n",
|
|||
|
"\n",
|
|||
|
"[**Ссылка**](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)\n",
|
|||
|
"\n",
|
|||
|
"### Описание датасета\n",
|
|||
|
"\n",
|
|||
|
"**Проблемная область**: Датасет связан с медицинской статистикой и направлен на анализ факторов, связанных с риском сердечного приступа. Это важно для прогнозирования и разработки стратегий профилактики сердечно-сосудистых заболеваний.\n",
|
|||
|
"\n",
|
|||
|
"**Актуальность**: Сердечно-сосудистые заболевания являются одной из ведущих причин смертности во всем мире. Анализ данных об образе жизни, состоянии здоровья и наследственных факторах позволяет выделить ключевые предикторы, влияющие на развитие сердечно-сосудистых заболеваний. Этот датасет предоставляет инструменты для анализа таких факторов и может быть полезен в создании прогнозных моделей, направленных на снижение рисков и своевременную диагностику.\n",
|
|||
|
"\n",
|
|||
|
"**Объекты наблюдения**: Каждая запись представляет собой данные о человеке, включая информацию об их состоянии здоровья, образе жизни, демографических характеристиках и наличию определенных заболеваний. Объекты наблюдений — это индивидуальные пациенты.\n",
|
|||
|
"\n",
|
|||
|
"**Атрибуты объектов:**\n",
|
|||
|
"- `HeartDisease` — наличие сердечного приступа (Yes/No) (целевая переменная).\n",
|
|||
|
"- `BMI` — индекс массы тела (Body Mass Index), числовой показатель.\n",
|
|||
|
"- `Smoking` — курение (Yes/No).\n",
|
|||
|
"- `AlcoholDrinking` — употребление алкоголя (Yes/No).\n",
|
|||
|
"- `Stroke` — наличие инсульта (Yes/No).\n",
|
|||
|
"- `PhysicalHealth` — количество дней в месяц, когда физическое здоровье было неудовлетворительным.\n",
|
|||
|
"- `MentalHealth` — количество дней в месяц, когда психическое здоровье было неудовлетворительным.\n",
|
|||
|
"- `DiffWalking` — трудности при ходьбе (Yes/No).\n",
|
|||
|
"- `Sex` — пол (Male/Female).\n",
|
|||
|
"- `AgeCategory` — возрастная категория (например, 55-59, 80 or older).\n",
|
|||
|
"- `Race` — расовая принадлежность (например, White, Black).\n",
|
|||
|
"- `Diabetic` — наличие диабета (Yes/No/No, borderline diabetes).\n",
|
|||
|
"- `PhysicalActivity` — физическая активность (Yes/No).\n",
|
|||
|
"- `GenHealth` — общее состояние здоровья (от Excellent до Poor).\n",
|
|||
|
"- `SleepTime` — среднее количество часов сна за сутки.\n",
|
|||
|
"- `Asthma` — наличие астмы (Yes/No).\n",
|
|||
|
"- `KidneyDisease` — наличие заболеваний почек (Yes/No).\n",
|
|||
|
"- `SkinCancer` — наличие кожного рака (Yes/No).\n",
|
|||
|
"\n",
|
|||
|
"### Бизнес-цели и соответствующие цели технического проекта\n",
|
|||
|
"\n",
|
|||
|
"**Бизнес-цель 1: Разработка персонализированных программ профилактики сердечно-сосудистых заболеваний**\n",
|
|||
|
"\n",
|
|||
|
"Снижение числа сердечно-сосудистых заболеваний в группе риска благодаря внедрению программ профилактики уменьшает затраты на медицинское обслуживание (страховые выплаты, лечение). Компании, предоставляющие страховые или медицинские услуги, могут минимизировать убытки и увеличить доходы за счет раннего выявления риска у клиентов.\n",
|
|||
|
"\n",
|
|||
|
"*Цели технического проекта*:\n",
|
|||
|
"1. Построить предиктивную модель машинного обучения для прогнозирования риска сердечного приступа на основе предоставленных данных.\n",
|
|||
|
"2. Разработать алгоритм классификации пациентов по группам риска с учетом их образа жизни, состояния здоровья и наследственных факторов.\n",
|
|||
|
"3. Выявить наиболее значимые факторы риска для рекомендации адресных изменений в образе жизни.\n",
|
|||
|
"\n",
|
|||
|
"**Бизнес-цель 2: Создание коммерческого продукта для оценки здоровья сотрудников компаний**\n",
|
|||
|
"\n",
|
|||
|
"Продукт может быть предложен корпоративным клиентам для оценки состояния здоровья их сотрудников и снижения риска долгосрочных больничных листов, что положительно скажется на производительности и снизит страховые выплаты работодателей. Компании смогут предлагать услуги в формате подписки или единовременной оценки.\n",
|
|||
|
"\n",
|
|||
|
"*Цели технического проекта*:\n",
|
|||
|
"1. Разработать инструмент визуализации здоровья сотрудников с использованием анализа ключевых факторов из датасета (например, курение, индекс массы тела, физическая активность).\n",
|
|||
|
"2. Обучить и оптимизировать модель прогнозирования вероятности сердечного приступа в зависимости от корпоративного контекста (возрастные группы сотрудников, стрессовые факторы).\n",
|
|||
|
"3. Интегрировать предиктивную аналитику в продукт, предоставляющий персонализированные отчеты и рекомендации по здоровью.\n",
|
|||
|
"\n",
|
|||
|
"**Бизнес-цель**: Улучшенное прогнозирование цен поможет продавцам устанавливать конкурентные цены, а покупателям — принимать более взвешенные решения о покупке. Это также даст риелторам возможность лучше ориентироваться на рынке и оптимизировать стратегию продажи.\n",
|
|||
|
"\n",
|
|||
|
"**Техническая цель**: Прогнозирование цен на жилье\n",
|
|||
|
"\n",
|
|||
|
"**Входные данные**: Исторические данные о продажах домов, включая все признаки (количество комнат, площадь, состояние, местоположение и др.).\n",
|
|||
|
"\n",
|
|||
|
"**Целевая переменная**: Столбец `HeartDisease`, который указывает на наличие сердечного приступа у пациента (`Yes` или `No`)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>HeartDisease</th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>Smoking</th>\n",
|
|||
|
" <th>AlcoholDrinking</th>\n",
|
|||
|
" <th>Stroke</th>\n",
|
|||
|
" <th>PhysicalHealth</th>\n",
|
|||
|
" <th>MentalHealth</th>\n",
|
|||
|
" <th>DiffWalking</th>\n",
|
|||
|
" <th>Sex</th>\n",
|
|||
|
" <th>AgeCategory</th>\n",
|
|||
|
" <th>Race</th>\n",
|
|||
|
" <th>Diabetic</th>\n",
|
|||
|
" <th>PhysicalActivity</th>\n",
|
|||
|
" <th>GenHealth</th>\n",
|
|||
|
" <th>SleepTime</th>\n",
|
|||
|
" <th>Asthma</th>\n",
|
|||
|
" <th>KidneyDisease</th>\n",
|
|||
|
" <th>SkinCancer</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>16.60</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>30.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>55-59</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Very good</td>\n",
|
|||
|
" <td>5.0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>20.34</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>80 or older</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Very good</td>\n",
|
|||
|
" <td>7.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>26.58</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>20.0</td>\n",
|
|||
|
" <td>30.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Male</td>\n",
|
|||
|
" <td>65-69</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Fair</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>24.21</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>75-79</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Good</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>23.71</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>28.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>40-44</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Very good</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth \\\n",
|
|||
|
"0 No 16.60 Yes No No 3.0 \n",
|
|||
|
"1 No 20.34 No No Yes 0.0 \n",
|
|||
|
"2 No 26.58 Yes No No 20.0 \n",
|
|||
|
"3 No 24.21 No No No 0.0 \n",
|
|||
|
"4 No 23.71 No No No 28.0 \n",
|
|||
|
"\n",
|
|||
|
" MentalHealth DiffWalking Sex AgeCategory Race Diabetic \\\n",
|
|||
|
"0 30.0 No Female 55-59 White Yes \n",
|
|||
|
"1 0.0 No Female 80 or older White No \n",
|
|||
|
"2 30.0 No Male 65-69 White Yes \n",
|
|||
|
"3 0.0 No Female 75-79 White No \n",
|
|||
|
"4 0.0 Yes Female 40-44 White No \n",
|
|||
|
"\n",
|
|||
|
" PhysicalActivity GenHealth SleepTime Asthma KidneyDisease SkinCancer \n",
|
|||
|
"0 Yes Very good 5.0 Yes No Yes \n",
|
|||
|
"1 Yes Very good 7.0 No No No \n",
|
|||
|
"2 Yes Fair 8.0 Yes No No \n",
|
|||
|
"3 No Good 6.0 No No Yes \n",
|
|||
|
"4 Yes Very good 8.0 No No No "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//heart_2020_cleaned.csv\")\n",
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Устранение проблемы пропущенных данных\n",
|
|||
|
"\n",
|
|||
|
"Для начала определим, присутствуют ли в датасете пропущенные значения признаков:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"HeartDisease 0\n",
|
|||
|
"BMI 0\n",
|
|||
|
"Smoking 0\n",
|
|||
|
"AlcoholDrinking 0\n",
|
|||
|
"Stroke 0\n",
|
|||
|
"PhysicalHealth 0\n",
|
|||
|
"MentalHealth 0\n",
|
|||
|
"DiffWalking 0\n",
|
|||
|
"Sex 0\n",
|
|||
|
"AgeCategory 0\n",
|
|||
|
"Race 0\n",
|
|||
|
"Diabetic 0\n",
|
|||
|
"PhysicalActivity 0\n",
|
|||
|
"GenHealth 0\n",
|
|||
|
"SleepTime 0\n",
|
|||
|
"Asthma 0\n",
|
|||
|
"KidneyDisease 0\n",
|
|||
|
"SkinCancer 0\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"HeartDisease False\n",
|
|||
|
"BMI False\n",
|
|||
|
"Smoking False\n",
|
|||
|
"AlcoholDrinking False\n",
|
|||
|
"Stroke False\n",
|
|||
|
"PhysicalHealth False\n",
|
|||
|
"MentalHealth False\n",
|
|||
|
"DiffWalking False\n",
|
|||
|
"Sex False\n",
|
|||
|
"AgeCategory False\n",
|
|||
|
"Race False\n",
|
|||
|
"Diabetic False\n",
|
|||
|
"PhysicalActivity False\n",
|
|||
|
"GenHealth False\n",
|
|||
|
"SleepTime False\n",
|
|||
|
"Asthma False\n",
|
|||
|
"KidneyDisease False\n",
|
|||
|
"SkinCancer False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Количество пустых значений признаков\n",
|
|||
|
"print(df.isnull().sum())\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Есть ли пустые значения признаков\n",
|
|||
|
"print(df.isnull().any())\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in df.columns:\n",
|
|||
|
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пропущенных данных в датасете **не обнаружено**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Устранение проблемы зашумленности данных\n",
|
|||
|
"\n",
|
|||
|
"**Зашумленность** – это наличие случайных ошибок или вариаций в данных, которые могут затруднить выявление истинных закономерностей. Шум может возникать из-за ошибок измерений, неправильных записей или других факторов.\n",
|
|||
|
"\n",
|
|||
|
"**Выбросы** – это значения, которые значительно отличаются от остальных наблюдений в наборе данных. Выбросы могут указывать на ошибки в данных или на редкие, но важные события. Их наличие может повлиять на статистические методы анализа.\n",
|
|||
|
"\n",
|
|||
|
"Представленный ниже код помогает определить наличие выбросов в наборе данных и устранить их (при наличии), заменив значения ниже нижней границы (рассматриваемого минимума) на значения нижней границы, а значения выше верхней границы (рассматриваемого максимума) – на значения верхней границы:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Проверка наличия выбросов в колонках:\n",
|
|||
|
"Колонка BMI:\n",
|
|||
|
"\tЕсть выбросы: Да\n",
|
|||
|
"\tКоличество выбросов: 10396\n",
|
|||
|
"\tМинимальное значение: 12.02\n",
|
|||
|
"\tМаксимальное значение: 94.85\n",
|
|||
|
"\t1-й квартиль (Q1): 24.03\n",
|
|||
|
"\t3-й квартиль (Q3): 31.42\n",
|
|||
|
"\n",
|
|||
|
"Колонка PhysicalHealth:\n",
|
|||
|
"\tЕсть выбросы: Да\n",
|
|||
|
"\tКоличество выбросов: 47146\n",
|
|||
|
"\tМинимальное значение: 0.0\n",
|
|||
|
"\tМаксимальное значение: 30.0\n",
|
|||
|
"\t1-й квартиль (Q1): 0.0\n",
|
|||
|
"\t3-й квартиль (Q3): 2.0\n",
|
|||
|
"\n",
|
|||
|
"Колонка MentalHealth:\n",
|
|||
|
"\tЕсть выбросы: Да\n",
|
|||
|
"\tКоличество выбросов: 51576\n",
|
|||
|
"\tМинимальное значение: 0.0\n",
|
|||
|
"\tМаксимальное значение: 30.0\n",
|
|||
|
"\t1-й квартиль (Q1): 0.0\n",
|
|||
|
"\t3-й квартиль (Q3): 3.0\n",
|
|||
|
"\n",
|
|||
|
"Колонка SleepTime:\n",
|
|||
|
"\tЕсть выбросы: Да\n",
|
|||
|
"\tКоличество выбросов: 4543\n",
|
|||
|
"\tМинимальное значение: 1.0\n",
|
|||
|
"\tМаксимальное значение: 24.0\n",
|
|||
|
"\t1-й квартиль (Q1): 6.0\n",
|
|||
|
"\t3-й квартиль (Q3): 8.0\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAPeCAYAAADj01PlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC+k0lEQVR4nOzdd3gU5fr/8c8mIZ0EEhJCJBQBBSkWVIogoSggIFWOAkpTUEFFrHhEAZEIHhULRVGpggoiR1FABYKKwRIPggooSBNIJJSElgDJ/P7gl/nuJrspkGR2wvt1XXOxM/Ps7L1hk3ufe555xmEYhiEAAAAAAAAAAJCPj9UBAAAAAAAAAADgrSiiAwAAAAAAAADgAUV0AAAAAAAAAAA8oIgOAAAAAAAAAIAHFNEBAAAAAAAAAPCAIjoAAAAAAAAAAB5QRAcAAAAAAAAAwAOK6AAAAAAAAAAAeEARHQAAAAAAAAAADyiiAwAAALCd+Ph4NWrUqExfc9y4cXI4HKV2/EGDBqlWrVqldvzzsWvXLjkcDs2ZM6fIbf/zn/+UfmAAAJQxh8OhcePGWR2Gi+J8N8ltm5aWVspRlU8U0eHWkiVL5HA43C5l3VkByqNx48aZneQ5c+bkS3rx8fEuv3f+/v6qXbu2hg0bpr1797q0zX2+w+HQt99+m++1DMNQXFycHA6Hunbt6rLP4XBo5MiR5npu5zcxMdHcX5ROM1CekAOB0lXcHBgREaHrrrtO7777rnJyciyI2Fp5c7Wz3J/fTz/9VKYxff755xdcRCjscwCUBHI6ULqKmtPr1avn9vlffvml+Tu5ZMmSUo114cKFmjp1aokes7A8bMUJf0maNGmSli1bdkHHqFWrlpnr4+PjNWjQoAuOy+78rA4A3u2pp55SgwYNzPXnn3/ewmiAi0v16tWVkJAgSTp9+rR+//13zZw5U6tWrdKWLVsUHBzs0j4wMFALFy5Uq1atXLavW7dOf//9twICAsosdqA8IAcC1nHOgQcPHtS8efM0dOhQ/fHHH3rhhRcsi+vpp5/Wk08+adnre4vPP/9c06ZN87rReIAn5HTAOoGBgdq+fbt++OEHXX/99S773nvvPQUGBiozM7PU41i4cKF+/fVXjRo1qtRfy2qTJk1Snz591KNHD6tDKVcooqNAN910k+Lj4831t99+m8s+gDISHh6uAQMGuGyrXbu2Ro4cqfXr1+umm25y2XfLLbdo8eLFeu211+Tn939/3hcuXKimTZvyuwsUEzkQsE7eHDh8+HBdfvnleuONN/Tcc8+pQoUKlsTl5+fnkmMB2AM5HbBOnTp1dPbsWS1atMiliJ6ZmamPP/5YXbp00UcffWRhhEDRMJ0L3Dp9+rQkycen8I9I7uUru3btMrfl5OSoSZMm+aaC2LRpkwYNGqRLL71UgYGBiomJ0ZAhQ3To0CGXY+bO05R3ce605F4Wk5ycrJYtWyooKEi1a9fWzJkz872XZ555Rk2bNlV4eLhCQkLUunVrrV271qVd7jQWDocj32UvmZmZqly5cr45HnPjjI6O1pkzZ1yes2jRIvN4zl/Q/vvf/6pLly6KjY1VQECA6tSpo+eee07Z2dmF/qxzX2/r1q3q27evwsLCFBkZqYceeijfmdvZs2erXbt2io6OVkBAgK644grNmDEj3zG7d++uWrVqKTAwUNHR0br11lu1efNmlza578PdpU/169fPd5nx4cOH9eijj6px48YKDQ1VWFiYOnfurF9++cXluQMHDlRgYKC2bNnisr1jx46qXLmy9u/fX6zjFSQxMdHjpZzu1KpVy23b3GlOJOns2bOaOHGiLrvsMgUEBLi0K63LqmNiYiTJbQf+jjvu0KFDh/Tll1+a206fPq0lS5aoX79+pRIPUB6RA5e57CMHkgO9IQcGBwerefPmOnHihA4ePOiy7/fff1fbtm0VHBysSy65RFOmTDH3HT9+XCEhIXrooYfyHfPvv/+Wr6+vOeL9zJkzGj9+vOrVq6fAwEBFRkaqVatWLnnV07yjCxYs0PXXX6/g4GBVrlxZN954o7744gtz/4V89s/H1q1b1adPH0VERCgwMFDXXnutPvnkE5c25/vZGjRokKZNmyZJBX6W3nrrLdWpU0cBAQG67rrr9OOPP5bcGwSKiJy+zGUfOZ2cblVOv+OOO/TBBx+4TMv26aef6uTJk+rbt6/b5+zbt09DhgxR1apVFRAQoIYNG+rdd991aZP78/jwww/1/PPPq3r16goMDFT79u21fft2s118fLw+++wz7d6923xvudPQFPV3qyQtWLBATZs2VVBQkCIiInT77bfnm7b1m2++0W233aYaNWooICBAcXFxevjhh3Xq1KkCj+1wOHTixAnNnTvXfK95p2I5evSoBg0apEqVKik8PFyDBw/WyZMnS/ptljsMo4BbuV82znf6h/nz5+dLWNK5+a7++usvDR48WDExMfrtt9/01ltv6bffftOGDRvy/eGfMWOGQkNDzfW8X36OHDmiW265RX379tUdd9yhDz/8UPfdd5/8/f01ZMgQSVJGRobefvtt3XHHHbrnnnt07NgxvfPOO+rYsaN++OEHXXXVVS7HDAwM1OzZs10ue1m6dGmBlxcdO3ZMy5cvV8+ePc1ts2fPdntZ0pw5cxQaGqrRo0crNDRUa9as0TPPPKOMjAy9+OKLHl/DWd++fVWrVi0lJCRow4YNeu2113TkyBHNmzfP5WfXsGFD3XrrrfLz89Onn36q+++/Xzk5ORoxYoTL8YYNG6aYmBjt379fb7zxhjp06KCdO3e6TBeS+3NxvvTpu+++0+7du/PF99dff2nZsmW67bbbVLt2baWmpurNN99UmzZt9Pvvvys2NlaS9Oqrr2rNmjUaOHCgkpKS5OvrqzfffFNffPGF5s+fb7Yr6vGK4sEHH9R1110nSZo3b55Lxziv1q1ba9iwYZKkLVu2aNKkSS77X3rpJY0dO1Y9e/bUE088oYCAAH3zzTd66623ihxPQbKzs80vqmfOnNGWLVv07LPPqm7durrhhhvyta9Vq5ZatGihRYsWqXPnzpKkFStWKD09Xbfffrtee+21EokLKO/IgeRAcqD1OdCdv/76S76+vqpUqZK57ciRI+rUqZN69eqlvn37asmSJXriiSfUuHFjde7cWaGhoerZs6c++OADvfzyy/L19TWfu2jRIhmGof79+0s6V9RJSEjQ3Xffreuvv14ZGRn66aef9PPPP+e7+svZ+PHjNW7cOLVs2VITJkyQv7+/vv/+e61Zs0Y333yzpAv/7GdmZrodNXv8+PF823777TfdcMMNuuSSS/Tkk08qJCREH374oXr06KGPPvrI/F0938/W8OHDtX//fn355ZeaP3++2zYLFy7UsWPHNHz4cDkcDk2ZMkW9evXSX3/9ZdlVBLg4kdPJ6eR078jp/fr107hx45SYmKh27dpJOpcr2rdvr+jo6HztU1NT1bx5c/PERlRUlFasWKGhQ4cqIyMj35QsL7zwgnx8fPToo48qPT1dU6ZMUf/+/fX9999Lkv79738rPT1df//9t1555RVJMn8ni/u75U56errbPJ33pJR0bjqpsWPHqm/fvrr77rt18OBBvf7667rxxhv1v//9z/yes3jxYp08eVL33XefIiMj9cMPP+j111/X33//rcWLF3uMZf78+eZ3mdz/9zp16ri06du3r2rXrq2EhAT9/PPPevvttxUdHa3JkycX+l4vagbgxtSpUw1Jxi+//OKyvU2bNkbDhg1dts2ePduQZOzcudMwDMPIzMw0atSoYXTu3NmQZMyePdtse/LkyXyvtWjRIkOS8fXXX5vbnn32WUOScfDgQY8xtmnTxpBkvPTSS+a2rKws46qrrjKio6ON06dPG4ZhGGfPnjWysrJcnnvkyBGjatWqxpAhQ8xtO3fuNCQZd9xxh+Hn52ekpKSY+9q3b2/069fPkGS8+OKL+eK84447jK5du5rbd+/ebfj4+Bh33HFHvvfh7mcwfPhwIzg42MjMzPT4fp1f79Zbb3XZfv/99+f7/3L3Oh07djQuvfTSAl/jww8/NCQ
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x1000 with 4 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"from math import ceil\n",
|
|||
|
"\n",
|
|||
|
"# Проверка выбросов в DataFrame\n",
|
|||
|
"def check_outliers(dataframe, columns):\n",
|
|||
|
" for column in columns:\n",
|
|||
|
" if not pd.api.types.is_numeric_dtype(dataframe[column]): # Проверяем, является ли колонка числовой\n",
|
|||
|
" continue\n",
|
|||
|
" \n",
|
|||
|
" Q1 = dataframe[column].quantile(0.25) # 1-й квартиль (25%)\n",
|
|||
|
" Q3 = dataframe[column].quantile(0.75) # 3-й квартиль (75%)\n",
|
|||
|
" IQR = Q3 - Q1 # Вычисляем межквартильный размах\n",
|
|||
|
"\n",
|
|||
|
" # Определяем границы для выбросов\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR # Нижняя граница\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR # Верхняя граница\n",
|
|||
|
"\n",
|
|||
|
" # Подсчитываем количество выбросов\n",
|
|||
|
" outliers = dataframe[(dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)]\n",
|
|||
|
" outlier_count = outliers.shape[0]\n",
|
|||
|
"\n",
|
|||
|
" print(f\"Колонка {column}:\")\n",
|
|||
|
" print(f\"\\tЕсть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n",
|
|||
|
" print(f\"\\tКоличество выбросов: {outlier_count}\")\n",
|
|||
|
" print(f\"\\tМинимальное значение: {dataframe[column].min()}\")\n",
|
|||
|
" print(f\"\\tМаксимальное значение: {dataframe[column].max()}\")\n",
|
|||
|
" print(f\"\\t1-й квартиль (Q1): {Q1}\")\n",
|
|||
|
" print(f\"\\t3-й квартиль (Q3): {Q3}\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация выбросов\n",
|
|||
|
"def visualize_outliers(dataframe, columns):\n",
|
|||
|
" # Диаграммы размахов\n",
|
|||
|
" plt.figure(figsize=(15, 10))\n",
|
|||
|
" rows = ceil(len(columns) / 3)\n",
|
|||
|
" for index, column in enumerate(columns, 1):\n",
|
|||
|
" plt.subplot(rows, 3, index)\n",
|
|||
|
" plt.boxplot(dataframe[column], vert=True, patch_artist=True)\n",
|
|||
|
" plt.title(f\"Диаграмма размаха для \\\"{column}\\\"\")\n",
|
|||
|
" plt.xlabel(column)\n",
|
|||
|
" \n",
|
|||
|
" # Отображение графиков\n",
|
|||
|
" plt.tight_layout()\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Числовые столбцы DataFrame\n",
|
|||
|
"numeric_columns = [\n",
|
|||
|
" 'BMI',\n",
|
|||
|
" 'PhysicalHealth',\n",
|
|||
|
" 'MentalHealth',\n",
|
|||
|
" 'SleepTime'\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Проверка наличия выбросов в колонках\n",
|
|||
|
"print('Проверка наличия выбросов в колонках:')\n",
|
|||
|
"check_outliers(df, numeric_columns)\n",
|
|||
|
"visualize_outliers(df, numeric_columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Признаки `BMI` и `SleepTime` имеют достаточное количество выбросов, которое стоит **устранить**. Также числовые признаки `PhysicalHealth` и `MentalHealth` имеют большое количество выбросов, но так как количество таких наблюдений по сравнению с общим количеством объектов велико, а диапазон значений, которые эти признаки принимают, сравнительно небольшой, то удаление такого объема важной информации, как состояние здоровья, может **негативно сказаться на способности прогнозировать сердечный приступ**.\n",
|
|||
|
"\n",
|
|||
|
"Для решения проблемы выбросов у признаков `BMI` и `SleepTime` воспользуемся методом отсечения слишком отклоняющихся значений путем **замены на экстремальное значение соответствующей границы**:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Проверка наличия выбросов в колонках после их устранения:\n",
|
|||
|
"Колонка BMI:\n",
|
|||
|
"\tЕсть выбросы: Нет\n",
|
|||
|
"\tКоличество выбросов: 0\n",
|
|||
|
"\tМинимальное значение: 12.945\n",
|
|||
|
"\tМаксимальное значение: 42.505\n",
|
|||
|
"\t1-й квартиль (Q1): 24.03\n",
|
|||
|
"\t3-й квартиль (Q3): 31.42\n",
|
|||
|
"\n",
|
|||
|
"Колонка SleepTime:\n",
|
|||
|
"\tЕсть выбросы: Нет\n",
|
|||
|
"\tКоличество выбросов: 0\n",
|
|||
|
"\tМинимальное значение: 3.0\n",
|
|||
|
"\tМаксимальное значение: 11.0\n",
|
|||
|
"\t1-й квартиль (Q1): 6.0\n",
|
|||
|
"\t3-й квартиль (Q3): 8.0\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+IAAAPdCAYAAAAONtIzAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWRElEQVR4nO3deZiVdd348c8AMoMMjILAgAyLkIIgqVQ2aoqCAimakj5ujyymVJgLT2r0yz0dl0o0EdfADS1NzSw1NcBS6BGSNFMSRUFZFNQZQBmUOb8/ujiPxxmUYfkeGF6v6zqXnPvc5z6fGQe+8z5rQSaTyQQAAACQRKN8DwAAAADbEiEOAAAACQlxAAAASEiIAwAAQEJCHAAAABIS4gAAAJCQEAcAAICEhDgAAAAkJMQBAAAgISEOAABss6ZOnRoFBQUxderUfI9SL126dInhw4fneww2kBCnTvfff38UFBTUeerdu3e+x4Ot3kUXXRRdunSJiIhJkyZFQUFBzuX9+vXL+XvXtGnT6Nq1a5x22mmxYMGCnH3XXr+goCD++te/1rqtTCYTZWVlUVBQEIcffnjOZQUFBXH66adnz7/xxhs5v4wUFBTEpEmTNv4Lhm2QtRQ2ry9aS2tqauKOO+6IffbZJ1q1ahUtWrSIXXfdNU4++eSYMWNGHib+fGvvEFifU77169cveyfA8OHDo1+/fnmdZ2vUJN8DsGX78Y9/HD179syev+yyy/I4DWxbOnbsGBUVFRERsXr16vjXv/4VN954Yzz++OPx8ssvx/bbb5+zf1FRUUyePDn233//nO3Tpk2Lt956KwoLC5PNDvwfaynkxxlnnBHjx4+PI488Mk488cRo0qRJzJkzJx599NHYZZdd4utf/3q+R8zRs2fPuPPOO3O2jR07NoqLi+P//b//V2v/OXPmRKNGHlfdWglxPtchhxyScw/XrbfeGkuXLs3fQLANKSkpiZNOOilnW9euXeP000+PZ555Jg455JCcy775zW/GfffdF9ddd100afJ//7xPnjw5+vbt6+8u5Im1FNJbsmRJ3HDDDXHqqafGzTffnHPZuHHj4t13383TZOvWrl27Wuv+FVdcETvttFOt7RHhDvatnLtQqNPq1asjItbrXra1TwV64403sttqamqiT58+tZ7W+sILL8Tw4cNjl112iaKioigtLY2RI0fGsmXLco550UUX1fk0nE/HRb9+/aJ3794xa9as2HfffaNZs2bRtWvXuPHGG2t9LRdccEH07ds3SkpKonnz5vGNb3wjpkyZkrPf2qfkFhQUxEMPPZRz2apVq2LHHXeMgoKC+NnPflZrzrZt28bHH3+cc5177rkne7xP/8L1u9/9Lg477LDo0KFDFBYWRrdu3eLSSy+NNWvWfOH3eu3tvfLKK3HsscdGy5Yto3Xr1nHmmWfGqlWrcvadOHFiHHzwwdG2bdsoLCyM3XffPSZMmFDrmEceeWR06dIlioqKom3btnHEEUfEiy++mLPP2q9j3Lhxta7fo0ePWk9vfu+99+KHP/xh7LHHHlFcXBwtW7aMwYMHxz/+8Y+c6w4bNiyKiori5Zdfztk+cODA2HHHHWPhwoX1Ot7n+byne9WlS5cude776dePffLJJ/HTn/40dt111ygsLMzZb+bMmes9W32UlpZGROT8XVjr+OOPj2XLlsUTTzyR3bZ69eq4//7744QTTtgs8wDrZi19KOcya6m1NOVaOm/evMhkMrHffvvVumztz9sX+dvf/haDBg2KkpKS2H777ePAAw+MZ555ptZ+b7/9dowcOTLatWsXhYWF0atXr/jVr36Vs8/a792vf/3r+PGPfxylpaXRvHnzOOKII2q95Gx9ffY14mv/HfnrX/8aZ5xxRrRp0yZ22GGHGDVqVKxevTo++OCDOPnkk2PHHXeMHXfcMc4999zIZDI5x6ypqYlx48ZFr169oqioKNq1axejRo2K999/f4NmZN08Ik6d1v7ysKH3tN155521FqCIiCeeeCJef/31GDFiRJSWlsZLL70UN998c7z00ksxY8aMWv+QT5gwIYqLi7PnP/vLzPvvvx/f/OY349hjj43jjz8+fvOb38T3vve9aNq0aYwcOTIiIqqqquLWW2+N448/Pk499dRYvnx53HbbbTFw4MD43//939hzzz1zjllUVBQTJ06Mb33rW9ltDzzwQK3F+dOWL18ejzzySBx11FHZbRMnToyioqJa15s0aVIUFxfHmDFjori4OP785z/HBRdcEFVVVXH11Vev8zY+7dhjj40uXbpERUVFzJgxI6677rp4//3344477sj53vXq1SuOOOKIaNKkSfz+97+P73//+1FTUxOjR4/OOd5pp50WpaWlsXDhwrj++utjwIABMW/evJynPq/9vpx11lnZbc8++2y8+eabteZ7/fXX46GHHopjjjkmunbtGkuWLImbbropDjzwwPjXv/4VHTp0iIiIa6+9Nv785z/HsGHDYvr06dG4ceO46aab4k9/+lPceeed2f3W93jr44wzzoivfvWrERFxxx135ETrZ33jG9+I0047LSIiXn755bj88stzLv/5z38e559/fhx11FFx3nnnRWFhYfzlL3+pdc/7hlqzZk32F8+PP/44Xn755bjwwguje/fudf5i0aVLlygvL4977rknBg8eHBERjz76aFRWVsZxxx0X11133SaZC1g/1lJrqbU0f2tp586dIyLivvvui2OOOabWy7m+yJ///OcYPHhw9O3bNy688MJo1KhR9o6Zv/zlL/G1r30tIv7zyPvXv/717B0pbdq0iUcffTROOeWUqKqqyvl/HfGfl6YUFBTEeeedF++8806MGzcuBgwYELNnz45mzZpt1Ne81g9+8IMoLS2Niy++OGbMmBE333xz7LDDDvHss89Gp06d4vLLL48//vGPcfXVV0fv3r3j5JNPzl531KhRMWnSpBgxYkScccYZMW/evLj++uvj+eefj2eeeSa22267TTIjEZGBOowbNy4TEZl//OMfOdsPPPDATK9evXK2TZw4MRMRmXnz5mUymUxm1apVmU6dOmUGDx6ciYjMxIkTs/t++OGHtW7rnnvuyURE5umnn85uu/DCCzMRkXn33XfXOeOBBx6YiYjMz3/+8+y26urqzJ577plp27ZtZvXq1ZlMJpP55JNPMtXV1TnXff/99zPt2rXLjBw5Mrtt3rx5mYjIHH/88ZkmTZpkFi9enL2sf//+mRNOOCETEZmrr7661pzHH3985vDDD89uf/PNNzONGjXKHH/88bW+jrq+B6NGjcpsv/32mVWrVq3z6/307R1xxBE527///e/X+v9V1+0MHDgws8suu3zubfzmN7/JRERm5syZ2W0Rkfn2t7+dadKkSc72U045Jft9GT16dHb7qlWrMmvWrMk57rx58zKFhYWZSy65JGf7448/nomIzE9/+tPM66+/nikuLs5861vfytmnPsdblz/96U+ZiMjcf//92W2jR4/OrOufwZ133jkzYsSI7PkpU6ZkIiIzZcqU7Lby8vJMz549MzU1Ndlta/8+PPfcc+s117qs/fn+7Klnz56Z119/PWffT9/m9ddfn2nRokX2//8xxxyTOeiggzKZTCbTuXPnzGGHHZZz3c/+vwM2HWuptdRamt+19OSTT85ERGbHHXfMHHXUUZmf/exnmZdffrnWfp+dq6amJvOlL30pM3DgwJy5Pvzww0zXrl0zhxxySHbbKaeckmnfvn1m6dKlOcc87rjjMiUlJdmfobW3sfPOO2eqqqqy+639Wbn22mvr/Bp69eqVOfDAA+u8rHPnzplhw4Zlz6/9vn127vLy8kxBQUHmu9/9bnbbJ598kunYsWPOsf/yl79kIiJz991359zOY489Vud2No6nplOntU9va9OmTb2vO378+Fi2bFlceOGFtS779D19q1atiqVLl2bfKOPvf/97vW+rSZMmMWrUqOz5pk2bxqhRo+Kdd96JWbNmRURE48aNo2nTphHxn6fbvPfee/HJJ5/EV77ylTpvc++9945evXpl3yzjzTffjClTpnzux0OMHDkyHnv
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x1000 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Устранить выборсы в DataFrame\n",
|
|||
|
"def remove_outliers(dataframe, columns):\n",
|
|||
|
" for column in columns:\n",
|
|||
|
" if not pd.api.types.is_numeric_dtype(dataframe[column]): # Проверяем, является ли колонка числовой\n",
|
|||
|
" continue\n",
|
|||
|
" \n",
|
|||
|
" Q1 = dataframe[column].quantile(0.25) # 1-й квартиль (25%)\n",
|
|||
|
" Q3 = dataframe[column].quantile(0.75) # 3-й квартиль (75%)\n",
|
|||
|
" IQR = Q3 - Q1 # Вычисляем межквартильный размах\n",
|
|||
|
"\n",
|
|||
|
" # Определяем границы для выбросов\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR # Нижняя граница\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR # Верхняя граница\n",
|
|||
|
"\n",
|
|||
|
" # Устраняем выбросы:\n",
|
|||
|
" # Заменяем значения ниже нижней границы на нижнюю границу\n",
|
|||
|
" # А значения выше верхней границы – на верхнюю\n",
|
|||
|
" dataframe[column] = dataframe[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n",
|
|||
|
" \n",
|
|||
|
" return dataframe\n",
|
|||
|
"\n",
|
|||
|
"# Cтолбцы, которые нужно исправить\n",
|
|||
|
"columns_to_fix = [\n",
|
|||
|
" 'BMI',\n",
|
|||
|
" 'SleepTime'\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Устраняем выборсы\n",
|
|||
|
"df = remove_outliers(df, columns_to_fix)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка наличия выбросов в колонках\n",
|
|||
|
"print('Проверка наличия выбросов в колонках после их устранения:')\n",
|
|||
|
"check_outliers(df, columns_to_fix)\n",
|
|||
|
"visualize_outliers(df, columns_to_fix)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Разбиение датасета на выборки\n",
|
|||
|
"\n",
|
|||
|
"Разделим выборку данных на 3 группы:\n",
|
|||
|
"1. *Обучающая* выборка (70%).\n",
|
|||
|
"2. *Контрольная* выборка (15%).\n",
|
|||
|
"3. *Тестовая* выборка (15%)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Проверка сбалансированности выборок:\n",
|
|||
|
"Обучающая выборка: (191877, 18)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
|
|||
|
" HeartDisease\n",
|
|||
|
"No 175453\n",
|
|||
|
"Yes 16424\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"No\": 91.44%\n",
|
|||
|
"Процент объектов класса \"Yes\": 8.56%\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка: (63959, 18)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
|
|||
|
" HeartDisease\n",
|
|||
|
"No 58484\n",
|
|||
|
"Yes 5475\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"No\": 91.44%\n",
|
|||
|
"Процент объектов класса \"Yes\": 8.56%\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка: (63959, 18)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
|
|||
|
" HeartDisease\n",
|
|||
|
"No 58485\n",
|
|||
|
"Yes 5474\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"No\": 91.44%\n",
|
|||
|
"Процент объектов класса \"Yes\": 8.56%\n",
|
|||
|
"\n",
|
|||
|
"Проверка необходимости аугментации выборок:\n",
|
|||
|
"Для обучающей выборки аугментация данных требуется\n",
|
|||
|
"Для контрольной выборки аугментация данных требуется\n",
|
|||
|
"Для тестовой выборки аугментация данных требуется\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlMAAAMWCAYAAADVowODAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADGtklEQVR4nOzdd3wT5R8H8E+StkknXXTR0iKljLIrILMIyAYBARkyRVDAiaKgyBIRQYYMt4CCIkNBUUYRECjI3hQopWUU6ILunTy/PzD5NSQtbZP20vbzfr36glwud9+73F0+uefuiUwIIUBEREREpSKXugAiIiKiioxhioiIiMgEDFNEREREJmCYIiIiIjIBwxQRERGRCRimiIiIiEzAMEVERERkAoYpIiIiIhMwTBERkdkIIXD//n1ERkZKXQqZmUajQWJiIq5fvy51KRaHYYqIKp3Ro0cjICBA6jKqjLS0NHzwwQeoW7cubGxs4ObmhqCgIFy5ckXq0iqEQ4cOYf/+/brH+/fvR3h4uHQFFXDv3j288cYb8Pf3h42NDapXr44GDRogNTVV6tIsSrmHqTVr1kAmk+n+VCoVgoKCMHnyZMTFxZV3OUSVxqxZs3QBQrufFdSxY0c0bNjQ6GtjYmIgk8mwaNGisi7TqMzMTMyaNUvvA0Vr1qxZescMOzs71KxZE3369MHq1auRk5NT/gVLICAgALNmzQLw8L0cPXq0pPVoJSUloXXr1vj8888xcOBAbNu2DWFhYdi/fz8DbTHdunULEydOxPnz53H+/HlMnDgRt27dkrosXLt2DS1atMCGDRswYcIEbN++HWFhYfj7779hb28vdXkWxUqqGc+ZMwe1atVCdnY2Dh06hC+++AJ//fUXLly4ADs7O6nKIiIJZGZmYvbs2QAeBgVjvvjiCzg4OCAnJwexsbHYtWsXxo4di6VLl2L79u3w8/PTjfvNN99Ao9GUR+lV3jvvvIO7d+/iyJEjCA4OlrqcCmnAgAFYunQpGjduDABo3bo1BgwYIHFVwIQJE2BjY4N///0XNWrUkLociyZZmOrRoweefPJJAMC4cePg5uaGxYsXY9u2bRg6dKhUZRFROdJoNMjNzS3WuAMHDoS7u7vu8Ycffoj169dj5MiRGDRoEP7991/dc9bW1mavlQzFx8dj7dq1+PLLLxmkTKBUKnH48GFcuHABANCwYUMoFApJazp58iT27t2L3bt3M0gVg8VcM9WpUycAQHR0NADg/v37ePvtt9GoUSM4ODjAyckJPXr0wNmzZw1em52djVmzZiEoKAgqlQre3t4YMGAAoqKiAPy/CaOwv4LfhPfv3w+ZTIZffvkF06dPh5eXF+zt7dG3b1+jp12PHj2K7t27o1q1arCzs0NoaGihbd0dO3Y0On/tqfuC1q1bh5CQENja2sLV1RVDhgwxOv+ilq0gjUaDpUuXIjg4GCqVCp6enpgwYQIePHigN15AQAB69+5tMJ/JkycbTNNY7QsXLjRYpwCQk5ODmTNnIjAwEEqlEn5+fpg6dWqxmmg6duxoML158+ZBLpfjp59+KtX6WLRoEdq0aQM3NzfY2toiJCQEmzdvNjr/devWoWXLlrCzs4OLiws6dOiA3bt3642zY8cOhIaGwtHREU5OTmjRooVBbZs2bdK9p+7u7njhhRcQGxurN87o0aP1anZxcUHHjh1x8ODBx66nspCcnIw33ngDfn5+UCqVCAwMxIIFCwzO+hR3fcpkMkyePBnr169HcHAwlEolvvzyS1SvXh0AMHv27CL3i0cNHz4c48aNw9GjRxEWFqYbbuyaqQ0bNiAkJET3HjVq1AjLli0r0+UNCwtDu3bt4OzsDAcHB9StWxfTp0/XG8eUfeNxCm5LCoUCNWrUwPjx45GcnPzY1+bn52Pu3LmoXbs2lEolAgICMH36dL26jh8/rgvETz75JFQqFdzc3DB06FDcvHlTN97q1ashk8lw+vRpg/l8/PHHUCgUun3B2HuvbbaOiYnRDdu2bRt69eoFHx8fKJVK1K5dG3PnzoVardZ7rbFtYenSpahXrx6USiW8vLwwYcIE3L9/X28cY83iixYtMqgjMTHRaM0lOeaOHj0aCoUCTZo0QZMmTfDrr79CJpMVq5k0ICBA9x7L5XJ4eXnh+eef11v/xWnG1zana/37779QqVSIiorS7auFrSug+Mc3BwcHXL9+Hd26dYO9vT18fHwwZ84cCCEM6l2zZo1uWFpaGkJCQlCrVi3cvXu3xOu5rEl2ZupR2uDj5uYGALh+/Tq2bt2KQYMGoVatWoiLi8NXX32F0NBQXLp0CT4+PgAAtVqN3r174++//8aQIUPw+uuvIy0tDWFhYbhw4QJq166tm8fQoUPRs2dPvflOmzbNaD3z5s2DTCbDu+++i/j4eCxduhRdunTBmTNnYGtrCwDYu3cvevTogZCQEMycORNyuRyrV69Gp06dcPDgQbRs2dJgur6+vpg/fz4AID09Ha+88orRec+YMQODBw/GuHHjkJCQgOXLl6NDhw44ffo0nJ2dDV4zfvx4tG/fHgDw66+/4rffftN7fsKECVizZg3GjBmD1157DdHR0VixYgVOnz6N8PBws3yTT05O1i1bQRqNBn379sWhQ4cwfvx41K9fH+fPn8eSJUtw9epVbN26tUTzWb16NT744AN89tlnGDZsmNFxHrc+li1bhr59+2L48OHIzc3Fhg0bMGjQIGzfvh29evXSjTd79mzMmjULbdq0wZw5c2BjY4OjR49i79696Nq1K4CHB/qxY8ciODgY06ZNg7OzM06fPo2dO3fq6tOu+xYtWmD+/PmIi4vDsmXLEB4ebvCeuru7Y8mSJQCA27dvY9myZejZsydu3bpl9L0vCbVajcTERIPhxg48mZmZCA0NRWxsLCZMmICaNWvi8OHDmDZtGu7evYulS5eWeH0CD/ebjRs3YvLkyXB3d0eTJk3wxRdf4JVXXkH//v11zRvaJo/HGTFiBL7++mvs3r0bzzzzjNFxwsLCMHToUHTu3BkLFiwAAERERCA8PByvv/56mSzvxYsX0bt3bzRu3Bhz5syBUqnEtWvX9L5smXvfMEa7TvPz83HkyBF8/fXXyMrKwo8//ljk68aNG4e1a9di4MCBmDJlCo4ePYr58+cjIiJCtz8lJSUBePhlKyQkBJ988gkSEhLw+eef49ChQzh9+jTc3d0xcOBATJo0CevXr0ezZs305rN+/Xp07NixxGc/1qxZAwcHB7z11ltwcHDA3r178eGHHyI1NRULFy4s9HUff/wx3n//fXTo0AGTJk3SHQuPHj2Ko0ePQqlUlqiOwpT2mJufn4/333+/RPNq3749xo8fD41GgwsXLmDp0qW4c+eOSV/CkpKSkJ2djVdeeQWdOnXCyy+/jKioKKxcudJgXZXk+KZWq9G9e3c89dRT+PTTT7Fz507MnDkT+fn5mDNnjtFa8vLy8Nxzz+HmzZsIDw+Ht7e37rny+GwrFlHOVq9eLQCIPXv2iISEBHHr1i2xYcMG4ebmJmxtbcXt27eFEEJkZ2cLtVqt99ro6GihVCrFnDlzdMO+//57AUAsXrzYYF4ajUb3OgBi4cKFBuMEBweL0NBQ3eN9+/YJAKJGjRoiNTVVN3zjxo0CgFi2bJlu2nXq1BHdunXTzUcIITIzM0WtWrXEM888YzCvNm3aiIYNG+oeJyQkCABi5syZumExMTFCoVCIefPm6b32/PnzwsrKymB4ZGSkACDWrl2rGzZz5kxR8K09ePCgACDWr1+v99qdO3caDPf39xe9evUyqH3SpEni0c3l0dqnTp0qPDw8REhIiN46/fHHH4VcLhcHDx7Ue/2XX34pAIjw8HCD+RUUGhqqm96ff/4prKysxJQpU4yOW5z1IcTD96mg3Nxc0bBhQ9GpUye9acnlctG/f3+DbVH7nicnJwtHR0fRqlUrkZWVZXSc3Nxc4eHhIRo2bKg3zvbt2wUA8eGHH+qGjRo1Svj7++tN5+uvvxYAxLFjx4wuc3GFhoYKAEX+FdxH5s6dK+zt7cXVq1f1pvPee+8JhUIhbt68qRtWnPU
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 600x800 with 3 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"def split_stratified_into_train_val_test(\n",
|
|||
|
" df_input,\n",
|
|||
|
" stratify_colname,\n",
|
|||
|
" frac_train,\n",
|
|||
|
" frac_val,\n",
|
|||
|
" frac_test,\n",
|
|||
|
" random_state=None,\n",
|
|||
|
"):\n",
|
|||
|
" \"\"\"\n",
|
|||
|
" Splits a Pandas dataframe into three subsets (train, val, and test)\n",
|
|||
|
" following fractional ratios provided by the user, where each subset is\n",
|
|||
|
" stratified by the values in a specific column (that is, each subset has\n",
|
|||
|
" the same relative frequency of the values in the column). It performs this\n",
|
|||
|
" splitting by running train_test_split() twice.\n",
|
|||
|
"\n",
|
|||
|
" Parameters\n",
|
|||
|
" ----------\n",
|
|||
|
" df_input : Pandas dataframe\n",
|
|||
|
" Input dataframe to be split.\n",
|
|||
|
" stratify_colname : str\n",
|
|||
|
" The name of the column that will be used for stratification. Usually\n",
|
|||
|
" this column would be for the label.\n",
|
|||
|
" frac_train : float\n",
|
|||
|
" frac_val : float\n",
|
|||
|
" frac_test : float\n",
|
|||
|
" The ratios with which the dataframe will be split into train, val, and\n",
|
|||
|
" test data. The values should be expressed as float fractions and should\n",
|
|||
|
" sum to 1.0.\n",
|
|||
|
" random_state : int, None, or RandomStateInstance\n",
|
|||
|
" Value to be passed to train_test_split().\n",
|
|||
|
"\n",
|
|||
|
" Returns\n",
|
|||
|
" -------\n",
|
|||
|
" df_train, df_val, df_test :\n",
|
|||
|
" Dataframes containing the three splits.\n",
|
|||
|
" \"\"\"\n",
|
|||
|
"\n",
|
|||
|
" if frac_train + frac_val + frac_test != 1.0:\n",
|
|||
|
" raise ValueError(\n",
|
|||
|
" \"fractions %f, %f, %f do not add up to 1.0\"\n",
|
|||
|
" % (frac_train, frac_val, frac_test)\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" if stratify_colname not in df_input.columns:\n",
|
|||
|
" raise ValueError(\"%s is not a column in the dataframe\" % (stratify_colname))\n",
|
|||
|
"\n",
|
|||
|
" X = df_input # Contains all columns.\n",
|
|||
|
" y = df_input[\n",
|
|||
|
" [stratify_colname]\n",
|
|||
|
" ] # Dataframe of just the column on which to stratify.\n",
|
|||
|
"\n",
|
|||
|
" # Split original dataframe into train and temp dataframes.\n",
|
|||
|
" df_train, df_temp, y_train, y_temp = train_test_split(\n",
|
|||
|
" X, y, stratify=y, test_size=(1.0 - frac_train), random_state=random_state\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" # Split the temp dataframe into val and test dataframes.\n",
|
|||
|
" relative_frac_test = frac_test / (frac_val + frac_test)\n",
|
|||
|
" df_val, df_test, y_val, y_test = train_test_split(\n",
|
|||
|
" df_temp,\n",
|
|||
|
" y_temp,\n",
|
|||
|
" stratify=y_temp,\n",
|
|||
|
" test_size=relative_frac_test,\n",
|
|||
|
" random_state=random_state,\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" assert len(df_input) == len(df_train) + len(df_val) + len(df_test)\n",
|
|||
|
"\n",
|
|||
|
" return df_train, df_val, df_test\n",
|
|||
|
"\n",
|
|||
|
"# Оценка сбалансированности\n",
|
|||
|
"def check_balance(dataframe, dataframe_name, column):\n",
|
|||
|
" counts = dataframe[column].value_counts()\n",
|
|||
|
" print(dataframe_name + \": \", dataframe.shape)\n",
|
|||
|
" print(f\"Распределение выборки данных по классам в колонке \\\"{column}\\\":\\n\", counts)\n",
|
|||
|
" total_count = len(dataframe)\n",
|
|||
|
" for value in counts.index:\n",
|
|||
|
" percentage: float = counts[value] / total_count * 100\n",
|
|||
|
" print(f\"Процент объектов класса \\\"{value}\\\": {percentage:.2f}%\")\n",
|
|||
|
" print()\n",
|
|||
|
" \n",
|
|||
|
"# Определение необходимости аугментации данных\n",
|
|||
|
"def need_augmentation(dataframe,\n",
|
|||
|
" column, \n",
|
|||
|
" first_value, second_value):\n",
|
|||
|
" counts = dataframe[column].value_counts()\n",
|
|||
|
" ratio: float = counts[first_value] / counts[second_value]\n",
|
|||
|
" return ratio > 1.5 or ratio < 0.67\n",
|
|||
|
" \n",
|
|||
|
" # Визуализация сбалансированности классов\n",
|
|||
|
"def visualize_balance(dataframe_train,\n",
|
|||
|
" dataframe_val,\n",
|
|||
|
" dataframe_test, \n",
|
|||
|
" column: str):\n",
|
|||
|
" fig, axes = plt.subplots(3, 1, figsize=(6, 8))\n",
|
|||
|
"\n",
|
|||
|
" # Обучающая выборка\n",
|
|||
|
" counts_train = dataframe_train[column].value_counts()\n",
|
|||
|
" axes[0].pie(counts_train, labels=counts_train.index, autopct='%1.1f%%', startangle=90)\n",
|
|||
|
" axes[0].set_title(f\"Распределение классов \\\"{column}\\\" в обучающей выборке\")\n",
|
|||
|
"\n",
|
|||
|
" # Контрольная выборка\n",
|
|||
|
" counts_val = dataframe_val[column].value_counts()\n",
|
|||
|
" axes[1].pie(counts_val, labels=counts_val.index, autopct='%1.1f%%', startangle=90)\n",
|
|||
|
" axes[1].set_title(f\"Распределение классов \\\"{column}\\\" в контрольной выборке\")\n",
|
|||
|
"\n",
|
|||
|
" # Тестовая выборка\n",
|
|||
|
" counts_test = dataframe_test[column].value_counts()\n",
|
|||
|
" axes[2].pie(counts_test, labels=counts_test.index, autopct='%1.1f%%', startangle=90)\n",
|
|||
|
" axes[2].set_title(f\"Распределение классов \\\"{column}\\\" в тренировочной выборке\")\n",
|
|||
|
"\n",
|
|||
|
" # Отображение графиков\n",
|
|||
|
" plt.tight_layout()\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"df_train, df_val, df_test = split_stratified_into_train_val_test(\n",
|
|||
|
" df, \n",
|
|||
|
" stratify_colname=\"HeartDisease\", \n",
|
|||
|
" frac_train=0.60, \n",
|
|||
|
" frac_val=0.20, \n",
|
|||
|
" frac_test=0.20\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Проверка сбалансированности выборок\n",
|
|||
|
"print('Проверка сбалансированности выборок:')\n",
|
|||
|
"check_balance(df_train, 'Обучающая выборка', 'HeartDisease')\n",
|
|||
|
"check_balance(df_val, 'Контрольная выборка', 'HeartDisease')\n",
|
|||
|
"check_balance(df_test, 'Тестовая выборка', 'HeartDisease')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка необходимости аугментации выборок\n",
|
|||
|
"print('Проверка необходимости аугментации выборок:')\n",
|
|||
|
"print(f\"Для обучающей выборки аугментация данных {'не ' if not need_augmentation(df_train, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
|
|||
|
"print(f\"Для контрольной выборки аугментация данных {'не ' if not need_augmentation(df_val, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
|
|||
|
"print(f\"Для тестовой выборки аугментация данных {'не ' if not need_augmentation(df_test, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
|
|||
|
" \n",
|
|||
|
"# Визуализация сбалансированности классов\n",
|
|||
|
"visualize_balance(df_train, df_val, df_test, 'HeartDisease')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Выборки оказались **недостаточно сбалансированными**. Используем методы приращения данных *с избытком* (**oversampling**) – копирование наблюдений или генерация новых наблюдений на основе существующих с помощью алгоритмов SMOTE и ADASYN (нахождение k-ближайших соседей):"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Проверка сбалансированности выборок:\n",
|
|||
|
"Обучающая выборка: (350906, 51)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
|
|||
|
" HeartDisease\n",
|
|||
|
"No 175453\n",
|
|||
|
"Yes 175453\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"No\": 50.00%\n",
|
|||
|
"Процент объектов класса \"Yes\": 50.00%\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка: (116968, 51)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
|
|||
|
" HeartDisease\n",
|
|||
|
"No 58484\n",
|
|||
|
"Yes 58484\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"No\": 50.00%\n",
|
|||
|
"Процент объектов класса \"Yes\": 50.00%\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка: (116970, 51)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"HeartDisease\":\n",
|
|||
|
" HeartDisease\n",
|
|||
|
"No 58485\n",
|
|||
|
"Yes 58485\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"No\": 50.00%\n",
|
|||
|
"Процент объектов класса \"Yes\": 50.00%\n",
|
|||
|
"\n",
|
|||
|
"Проверка необходимости аугментации выборок:\n",
|
|||
|
"Для обучающей выборки аугментация данных не требуется\n",
|
|||
|
"Для контрольной выборки аугментация данных не требуется\n",
|
|||
|
"Для тестовой выборки аугментация данных не требуется\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlMAAAMWCAYAAADVowODAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC8AUlEQVR4nOzdd1wT5x8H8E8SIGwVkOVAKuLATdU6seLCrVXrqLNWW/XXZWurrYpYa63WUUerbR1VW/eotg5cVamzdYsbHKgMFWSFkTy/P2hSYgICCRzj83698lIuN753yV0+uee5i0wIIUBEREREBSKXugAiIiKikoxhioiIiMgEDFNEREREJmCYIiIiIjIBwxQRERGRCRimiIiIiEzAMEVERERkAoYpIiIiIhMwTBERkdkIIfDkyRPcuHFD6lLIzDQaDeLi4nD79m2pSyl2GKaIqNQZPnw4qlWrJnUZZUZiYiI+//xz1KxZE1ZWVnB2doavry+uXbsmdWklwrFjx3D48GHd34cPH0ZYWJh0BWXz6NEjvP/++/Dy8oKVlRUqVqyIOnXq4NmzZ1KXVqwUeZhatWoVZDKZ7mFtbQ1fX1+MHz8e0dHRRV0OUakRHBysCxDa/Sy7tm3bom7dukanjYyMhEwmw9y5cwu7TKNSUlIQHBys94GiFRwcrHfMsLW1RdWqVdG9e3esXLkSaWlpRV+wBKpVq4bg4GAAWa/l8OHDJa1H6/Hjx2jevDm+/fZb9O3bFzt27EBoaCgOHz7MQJtH9+7dw9ixY3Hx4kVcvHgRY8eOxb1796QuCzdv3kSTJk2wfv16jBkzBrt27UJoaCgOHDgAOzs7qcsrViykWnBISAi8vb2hUqlw7NgxfPfdd/jjjz9w6dIl2NraSlUWEUkgJSUF06dPB5AVFIz57rvvYG9vj7S0NERFRWHv3r0YOXIkFixYgF27dqFKlSq6cX/44QdoNJqiKL3M+/jjj/Hw4UMcP34cfn5+UpdTIvXp0wcLFixA/fr1AQDNmzdHnz59JK4KGDNmDKysrHDixAlUqlRJ6nKKNcnCVFBQEF5++WUAwKhRo+Ds7Ix58+Zhx44dGDhwoFRlEVER0mg0SE9Pz9O4ffv2hYuLi+7vqVOnYt26dRg6dCj69euHEydO6J6ztLQ0e61kKCYmBqtXr8b333/PIGUCpVKJv/76C5cuXQIA1K1bFwqFQtKa/v77bxw8eBD79u1jkMqDYtNnql27dgCAiIgIAMCTJ0/w0UcfoV69erC3t4ejoyOCgoJw/vx5g2lVKhWCg4Ph6+sLa2treHh4oE+fPrh16xaA/5owcnpk/yZ8+PBhyGQybNiwAZMnT4a7uzvs7OzQo0cPo6ddT548ic6dO6NcuXKwtbVFQEBAjm3dbdu2Nbp87an77NauXQt/f3/Y2NjAyckJAwYMMLr83NYtO41GgwULFsDPzw/W1tZwc3PDmDFj8PTpU73xqlWrhm7duhksZ/z48QbzNFb7nDlzDLYpAKSlpWHatGnw8fGBUqlElSpVMHHixDw10bRt29ZgfjNnzoRcLscvv/xSoO0xd+5ctGjRAs7OzrCxsYG/vz82b95sdPlr165F06ZNYWtriwoVKqBNmzbYt2+f3ji7d+9GQEAAHBwc4OjoiCZNmhjUtmnTJt1r6uLigjfeeANRUVF64wwfPlyv5goVKqBt27Y4evToC7dTYYiPj8f777+PKlWqQKlUwsfHB7NnzzY465PX7SmTyTB+/HisW7cOfn5+UCqV+P7771GxYkUAwPTp03PdL543ePBgjBo1CidPnkRoaKhuuLE+U+vXr4e/v7/uNapXrx4WLlxYqOsbGhqKVq1aoXz58rC3t0fNmjUxefJkvXFM2TdeJPt7SaFQoFKlShg9ejTi4+NfOG1mZiZmzJiB6tWrQ6lUolq1apg8ebJeXadPn9YF4pdffhnW1tZwdnbGwIEDcffuXd14K1euhEwmw9mzZw2W8+WXX0KhUOj2BWOvvbbZOjIyUjdsx44d6Nq1Kzw9PaFUKlG9enXMmDEDarVab1pj74UFCxagVq1aUCqVcHd3x5gxY/DkyRO9cYw1i8+dO9egjri4OKM15+eYO3z4cCgUCjRo0AANGjTA1q1bIZPJ8tRMWq1aNd1rLJfL4e7ujtdff11v++elGV/bnK514sQJWFtb49atW7p9NadtBeT9+GZvb4/bt2+jU6dOsLOzg6enJ0JCQiCEMKh31apVumGJiYnw9/eHt7c3Hj58mO/tXNgkOzP1PG3wcXZ2BgDcvn0b27dvR79+/eDt7Y3o6GgsW7YMAQEBuHLlCjw9PQEAarUa3bp1w4EDBzBgwAC89957SExMRGhoKC5duoTq1avrljFw4EB06dJFb7mTJk0yWs/MmTMhk8nwySefICYmBgsWLED79u1x7tw52NjYAAAOHjyIoKAg+Pv7Y9q0aZDL5Vi5ciXatWuHo0ePomnTpgbzrVy5MmbNmgUASEpKwjvvvGN02VOmTEH//v0xatQoxMbGYtGiRWjTpg3Onj2L8uXLG0wzevRotG7dGgCwdetWbNu2Te/5MWPGYNWqVRgxYgTeffddREREYPHixTh79izCwsLM8k0+Pj5et27ZaTQa9OjRA8eOHcPo0aNRu3ZtXLx4EfPnz8f169exffv2fC1n5cqV+Pzzz/HNN99g0KBBRsd50fZYuHAhevTogcGDByM9PR3r169Hv379sGvXLnTt2lU33vTp0xEcHIwWLVogJCQEVlZWOHnyJA4ePIiOHTsCyDrQjxw5En5+fpg0aRLKly+Ps2fPYs+ePbr6tNu+SZMmmDVrFqKjo7Fw4UKEhYUZvKYuLi6YP38+AOD+/ftYuHAhunTpgnv37hl97fNDrVYjLi7OYLixA09KSgoCAgIQFRWFMWPGoGrVqvjrr78wadIkPHz4EAsWLMj39gSy9puNGzdi/PjxcHFxQYMGDfDdd9/hnXfeQe/evXXNG9omjxcZMmQIli9fjn379qFDhw5GxwkNDcXAgQMRGBiI2bNnAwDCw8MRFhaG9957r1DW9/Lly+jWrRvq16+PkJAQKJVK3Lx5U+/Llrn3DWO02zQzMxPHjx/H8uXLkZqaijVr1uQ63ahRo7B69Wr07dsXEyZMwMmTJzFr1iyEh4fr9qfHjx8DyPqy5e/vj6+++gqxsbH49ttvcezYMZw9exYuLi7o27cvxo0bh3Xr1qFRo0Z6y1m3bh3atm2b77Mfq1atgr29PT788EPY29vj4MGDmDp1Kp49e4Y5c+bkON2XX36Jzz77DG3atMG4ceN0x8KTJ0/i5MmTUCqV+aojJwU95mZmZuKzzz7L17Jat26N0aNHQ6PR4NKlS1iwYAEePHhg0pewx48fQ6VS4Z133kG7du3w9ttv49atW1iyZInBtsrP8U2tVqNz58545ZVX8PXXX2PPnj2YNm0aMjMzERISYrSWjIwMvPbaa7h79y7CwsLg4eGhe64oPtvyRBSxlStXCgBi//79IjY2Vty7d0+sX79eODs7CxsbG3H//n0hhBAqlUqo1Wq9aSMiIoRSqRQhISG6YStWrBAAxLx58wyWpdFodNMBEHPmzDEYx8/PTwQEBOj+PnTokAAgKlWqJJ49e6YbvnHjRgFALFy4UDfvGjVqiE6dOumWI4QQKSkpwtvbW3To0MFgWS1atBB169bV/R0bGysAiGnTpumGRUZGCoVCIWbOnKk37cWLF4WFhYXB8Bs3bggAYvXq1bph06ZNE9lf2qNHjwoAYt26dXrT7tmzx2C4l5eX6Nq1q0Ht48aNE8+/XZ6vfeLEicLV1VX4+/vrbdM1a9YIuVwujh49qjf9999/LwCIsLAwg+VlFxAQoJvf77//LiwsLMSECROMjpuX7SFE1uuUXXp6uqhbt65o166d3rzkcrno3bu3wXtR+5rHx8cLBwcH0axZM5Gammp0nPT0dOHq6irq1q2rN86uXbsEADF16lTdsGHDhgkvLy+9+SxfvlwAEKdOnTK6znkVEBAgAOT6yL6PzJgxQ9jZ2Ynr16/rzefTTz8VCoVC3L17VzcsL9tTiKz3jFw
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 600x800 with 3 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import SMOTE\n",
|
|||
|
"\n",
|
|||
|
"# Метод приращения с избытком (oversampling)\n",
|
|||
|
"def oversample(df, column):\n",
|
|||
|
" X = pd.get_dummies(df.drop(column, axis=1))\n",
|
|||
|
" y = df[column]\n",
|
|||
|
" \n",
|
|||
|
" smote = SMOTE()\n",
|
|||
|
" X_resampled, y_resampled = smote.fit_resample(X, y)\n",
|
|||
|
" \n",
|
|||
|
" df_resampled = pd.concat([X_resampled, y_resampled], axis=1)\n",
|
|||
|
" return df_resampled\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Приращение данных (oversampling)\n",
|
|||
|
"df_train_oversampled = oversample(df_train, 'HeartDisease')\n",
|
|||
|
"df_val_oversampled = oversample(df_val, 'HeartDisease')\n",
|
|||
|
"df_test_oversampled = oversample(df_test, 'HeartDisease')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка сбалансированности выборок\n",
|
|||
|
"print('Проверка сбалансированности выборок:')\n",
|
|||
|
"check_balance(df_train_oversampled, 'Обучающая выборка', 'HeartDisease')\n",
|
|||
|
"check_balance(df_val_oversampled, 'Контрольная выборка', 'HeartDisease')\n",
|
|||
|
"check_balance(df_test_oversampled, 'Тестовая выборка', 'HeartDisease')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка необходимости аугментации выборок\n",
|
|||
|
"print('Проверка необходимости аугментации выборок:')\n",
|
|||
|
"print(f\"Для обучающей выборки аугментация данных {'не ' if not need_augmentation(df_train_oversampled, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
|
|||
|
"print(f\"Для контрольной выборки аугментация данных {'не ' if not need_augmentation(df_val_oversampled, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
|
|||
|
"print(f\"Для тестовой выборки аугментация данных {'не ' if not need_augmentation(df_test_oversampled, 'HeartDisease', 'No', 'Yes') else ''}требуется\")\n",
|
|||
|
" \n",
|
|||
|
"# Визуализация сбалансированности классов\n",
|
|||
|
"visualize_balance(df_train_oversampled, df_val_oversampled, df_test_oversampled, 'HeartDisease')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Конструирование признаков\n",
|
|||
|
"\n",
|
|||
|
"**Конструирование признаков** (*feature engineering*) – процесс использования знаний об особенностях решаемой задачи и предметной области для определения признаков, которые будут использованы для обучения статистической модели.\n",
|
|||
|
"\n",
|
|||
|
"Методы конструирования признаков:\n",
|
|||
|
"1. Для категориальных данных:\n",
|
|||
|
" - **Унитарное кодирование категориальных признаков** (one-hot encoding) – метод, который применяется для преобразования категориальных переменных в числовой формат. Каждая характеристика представляется в виде бинарного вектора, где для каждой категории выделяется отдельный признак (столбец) со значением 1 (True), если объект принадлежит этой категории, и 0 (False) в противном случае.\n",
|
|||
|
"2. Для числовых данных:\n",
|
|||
|
" - **Дискретизация** – процесс преобразования непрерывных числовых значений в категориальные группы или интервалы (дискретные значения).\n",
|
|||
|
" - **Ручной синтез** – процесс создания новых признаков на основе существующих данных. Это может включать в себя комбинирование нескольких признаков, использование математических операций (например, сложение, вычитание), а также создание полиномиальных или логарифмических признаков.\n",
|
|||
|
" - **Масштабирование признаков на основе нормировки и стандартизации** – метод, который позволяет привести все числовые признаки к одинаковым или очень похожим диапазонам значений либо распределениям.\n",
|
|||
|
" - **С применением фреймворка FeatureTools** – библиотека для автоматизированного создания признаков (features) из структурированных данных. Подходит для задач машинного обучения, когда нужно быстро извлекать полезные признаки из больших объемов данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Унитарное кодирование\n",
|
|||
|
"\n",
|
|||
|
"Преобразование уже было выполнено на этапе приращения с избытком (метод `pd.get_dummies(...)`), так как метод `fit_resample` требовал для работы признаки типа число с плавающей точкой. Были преобразованы категориальные признаки `Smoking`, `AlcoholDrinking`, `Stroke`, `DiffWalking` и т.д. в бинарные признаки:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>PhysicalHealth</th>\n",
|
|||
|
" <th>MentalHealth</th>\n",
|
|||
|
" <th>SleepTime</th>\n",
|
|||
|
" <th>Smoking_No</th>\n",
|
|||
|
" <th>Smoking_Yes</th>\n",
|
|||
|
" <th>AlcoholDrinking_No</th>\n",
|
|||
|
" <th>AlcoholDrinking_Yes</th>\n",
|
|||
|
" <th>Stroke_No</th>\n",
|
|||
|
" <th>Stroke_Yes</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>GenHealth_Good</th>\n",
|
|||
|
" <th>GenHealth_Poor</th>\n",
|
|||
|
" <th>GenHealth_Very good</th>\n",
|
|||
|
" <th>Asthma_No</th>\n",
|
|||
|
" <th>Asthma_Yes</th>\n",
|
|||
|
" <th>KidneyDisease_No</th>\n",
|
|||
|
" <th>KidneyDisease_Yes</th>\n",
|
|||
|
" <th>SkinCancer_No</th>\n",
|
|||
|
" <th>SkinCancer_Yes</th>\n",
|
|||
|
" <th>HeartDisease</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>24.28</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>34.44</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>25.86</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>5.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>19.47</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>34.70</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>29.05</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>32.45</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>5.0</td>\n",
|
|||
|
" <td>7.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>26.25</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>30.0</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>30.67</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>7.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>34.96</td>\n",
|
|||
|
" <td>14.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>10 rows × 51 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" BMI PhysicalHealth MentalHealth SleepTime Smoking_No Smoking_Yes \\\n",
|
|||
|
"0 24.28 2.0 3.0 8.0 True False \n",
|
|||
|
"1 34.44 0.0 0.0 8.0 False True \n",
|
|||
|
"2 25.86 0.0 5.0 8.0 True False \n",
|
|||
|
"3 19.47 0.0 2.0 8.0 False True \n",
|
|||
|
"4 34.70 0.0 0.0 8.0 False True \n",
|
|||
|
"5 29.05 0.0 0.0 6.0 True False \n",
|
|||
|
"6 32.45 0.0 5.0 7.0 True False \n",
|
|||
|
"7 26.25 0.0 30.0 6.0 False True \n",
|
|||
|
"8 30.67 2.0 3.0 7.0 True False \n",
|
|||
|
"9 34.96 14.0 0.0 6.0 True False \n",
|
|||
|
"\n",
|
|||
|
" AlcoholDrinking_No AlcoholDrinking_Yes Stroke_No Stroke_Yes ... \\\n",
|
|||
|
"0 True False True False ... \n",
|
|||
|
"1 True False True False ... \n",
|
|||
|
"2 True False True False ... \n",
|
|||
|
"3 True False True False ... \n",
|
|||
|
"4 True False True False ... \n",
|
|||
|
"5 False True True False ... \n",
|
|||
|
"6 True False True False ... \n",
|
|||
|
"7 True False True False ... \n",
|
|||
|
"8 True False True False ... \n",
|
|||
|
"9 True False True False ... \n",
|
|||
|
"\n",
|
|||
|
" GenHealth_Good GenHealth_Poor GenHealth_Very good Asthma_No Asthma_Yes \\\n",
|
|||
|
"0 False False True False True \n",
|
|||
|
"1 True False False True False \n",
|
|||
|
"2 False False True True False \n",
|
|||
|
"3 False False True True False \n",
|
|||
|
"4 False False True True False \n",
|
|||
|
"5 True False False True False \n",
|
|||
|
"6 True False False True False \n",
|
|||
|
"7 False False True True False \n",
|
|||
|
"8 True False False True False \n",
|
|||
|
"9 True False False True False \n",
|
|||
|
"\n",
|
|||
|
" KidneyDisease_No KidneyDisease_Yes SkinCancer_No SkinCancer_Yes \\\n",
|
|||
|
"0 True False True False \n",
|
|||
|
"1 True False True False \n",
|
|||
|
"2 True False True False \n",
|
|||
|
"3 True False True False \n",
|
|||
|
"4 True False True False \n",
|
|||
|
"5 True False True False \n",
|
|||
|
"6 True False True False \n",
|
|||
|
"7 True False True False \n",
|
|||
|
"8 True False True False \n",
|
|||
|
"9 True False True False \n",
|
|||
|
"\n",
|
|||
|
" HeartDisease \n",
|
|||
|
"0 No \n",
|
|||
|
"1 No \n",
|
|||
|
"2 No \n",
|
|||
|
"3 No \n",
|
|||
|
"4 No \n",
|
|||
|
"5 No \n",
|
|||
|
"6 No \n",
|
|||
|
"7 No \n",
|
|||
|
"8 No \n",
|
|||
|
"9 No \n",
|
|||
|
"\n",
|
|||
|
"[10 rows x 51 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"categorical_features = [\n",
|
|||
|
" 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Race',\n",
|
|||
|
" 'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer'\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"df_train_oversampled.head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Дискретизация числовых признаков\n",
|
|||
|
"\n",
|
|||
|
"Распределим значения признака `BMI` по интервалам, преобразуя его из числового представления в категориальное. Будем использовать метод **Равномерная группировка**:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>BMI_Category</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>24.280</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>34.440</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>25.860</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>19.470</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>34.700</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>29.050</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>32.450</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>26.250</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>30.670</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>34.960</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>10</th>\n",
|
|||
|
" <td>27.810</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>11</th>\n",
|
|||
|
" <td>20.360</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>12</th>\n",
|
|||
|
" <td>27.400</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>13</th>\n",
|
|||
|
" <td>42.505</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>14</th>\n",
|
|||
|
" <td>21.520</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>15</th>\n",
|
|||
|
" <td>36.260</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16</th>\n",
|
|||
|
" <td>23.490</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>17</th>\n",
|
|||
|
" <td>28.190</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>18</th>\n",
|
|||
|
" <td>28.290</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19</th>\n",
|
|||
|
" <td>20.800</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" BMI BMI_Category\n",
|
|||
|
"0 24.280 1\n",
|
|||
|
"1 34.440 3\n",
|
|||
|
"2 25.860 2\n",
|
|||
|
"3 19.470 1\n",
|
|||
|
"4 34.700 3\n",
|
|||
|
"5 29.050 2\n",
|
|||
|
"6 32.450 3\n",
|
|||
|
"7 26.250 2\n",
|
|||
|
"8 30.670 2\n",
|
|||
|
"9 34.960 3\n",
|
|||
|
"10 27.810 2\n",
|
|||
|
"11 20.360 1\n",
|
|||
|
"12 27.400 2\n",
|
|||
|
"13 42.505 4\n",
|
|||
|
"14 21.520 1\n",
|
|||
|
"15 36.260 3\n",
|
|||
|
"16 23.490 1\n",
|
|||
|
"17 28.190 2\n",
|
|||
|
"18 28.290 2\n",
|
|||
|
"19 20.800 1"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Функция для дискретизации числовых признаков\n",
|
|||
|
"def discretize_features(df, features, bins=5, labels=False):\n",
|
|||
|
" for feature in features:\n",
|
|||
|
" df[f'{feature}_Category'] = pd.cut(df[feature], bins=bins, labels=labels)\n",
|
|||
|
" return df\n",
|
|||
|
"\n",
|
|||
|
"# Определение числовых признаков для дискретизации\n",
|
|||
|
"numerical_features = ['BMI']\n",
|
|||
|
"\n",
|
|||
|
"# Применение дискретизации к обучающей, контрольной и тестовой выборкам\n",
|
|||
|
"df_train_oversampled = discretize_features(df_train_oversampled, numerical_features)\n",
|
|||
|
"df_val_oversampled = discretize_features(df_val_oversampled, numerical_features)\n",
|
|||
|
"df_test_oversampled = discretize_features(df_test_oversampled, numerical_features)\n",
|
|||
|
"\n",
|
|||
|
"df_train_oversampled[['BMI', 'BMI_Category']].head(20)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Ручной синтез признаков\n",
|
|||
|
"\n",
|
|||
|
"Будем синтезировать новый признак `HealthScore`, являющийся числовым показателем здоровья на основе таких признаков, как `PhysicalHealth`, `MentalHealth`, `SleepTime`:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>PhysicalHealth</th>\n",
|
|||
|
" <th>MentalHealth</th>\n",
|
|||
|
" <th>SleepTime</th>\n",
|
|||
|
" <th>Smoking_No</th>\n",
|
|||
|
" <th>Smoking_Yes</th>\n",
|
|||
|
" <th>AlcoholDrinking_No</th>\n",
|
|||
|
" <th>AlcoholDrinking_Yes</th>\n",
|
|||
|
" <th>Stroke_No</th>\n",
|
|||
|
" <th>Stroke_Yes</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>GenHealth_Very good</th>\n",
|
|||
|
" <th>Asthma_No</th>\n",
|
|||
|
" <th>Asthma_Yes</th>\n",
|
|||
|
" <th>KidneyDisease_No</th>\n",
|
|||
|
" <th>KidneyDisease_Yes</th>\n",
|
|||
|
" <th>SkinCancer_No</th>\n",
|
|||
|
" <th>SkinCancer_Yes</th>\n",
|
|||
|
" <th>HeartDisease</th>\n",
|
|||
|
" <th>BMI_Category</th>\n",
|
|||
|
" <th>HealthScore</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>24.28</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>21.7</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>34.44</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>23.4</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>25.86</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>5.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>21.9</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>19.47</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>22.8</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>34.70</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>23.4</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>29.05</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>22.8</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>32.45</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>5.0</td>\n",
|
|||
|
" <td>7.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>21.6</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>26.25</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>30.0</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>13.8</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>30.67</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>7.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>21.4</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>34.96</td>\n",
|
|||
|
" <td>14.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>17.2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>10 rows × 53 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" BMI PhysicalHealth MentalHealth SleepTime Smoking_No Smoking_Yes \\\n",
|
|||
|
"0 24.28 2.0 3.0 8.0 True False \n",
|
|||
|
"1 34.44 0.0 0.0 8.0 False True \n",
|
|||
|
"2 25.86 0.0 5.0 8.0 True False \n",
|
|||
|
"3 19.47 0.0 2.0 8.0 False True \n",
|
|||
|
"4 34.70 0.0 0.0 8.0 False True \n",
|
|||
|
"5 29.05 0.0 0.0 6.0 True False \n",
|
|||
|
"6 32.45 0.0 5.0 7.0 True False \n",
|
|||
|
"7 26.25 0.0 30.0 6.0 False True \n",
|
|||
|
"8 30.67 2.0 3.0 7.0 True False \n",
|
|||
|
"9 34.96 14.0 0.0 6.0 True False \n",
|
|||
|
"\n",
|
|||
|
" AlcoholDrinking_No AlcoholDrinking_Yes Stroke_No Stroke_Yes ... \\\n",
|
|||
|
"0 True False True False ... \n",
|
|||
|
"1 True False True False ... \n",
|
|||
|
"2 True False True False ... \n",
|
|||
|
"3 True False True False ... \n",
|
|||
|
"4 True False True False ... \n",
|
|||
|
"5 False True True False ... \n",
|
|||
|
"6 True False True False ... \n",
|
|||
|
"7 True False True False ... \n",
|
|||
|
"8 True False True False ... \n",
|
|||
|
"9 True False True False ... \n",
|
|||
|
"\n",
|
|||
|
" GenHealth_Very good Asthma_No Asthma_Yes KidneyDisease_No \\\n",
|
|||
|
"0 True False True True \n",
|
|||
|
"1 False True False True \n",
|
|||
|
"2 True True False True \n",
|
|||
|
"3 True True False True \n",
|
|||
|
"4 True True False True \n",
|
|||
|
"5 False True False True \n",
|
|||
|
"6 False True False True \n",
|
|||
|
"7 True True False True \n",
|
|||
|
"8 False True False True \n",
|
|||
|
"9 False True False True \n",
|
|||
|
"\n",
|
|||
|
" KidneyDisease_Yes SkinCancer_No SkinCancer_Yes HeartDisease \\\n",
|
|||
|
"0 False True False No \n",
|
|||
|
"1 False True False No \n",
|
|||
|
"2 False True False No \n",
|
|||
|
"3 False True False No \n",
|
|||
|
"4 False True False No \n",
|
|||
|
"5 False True False No \n",
|
|||
|
"6 False True False No \n",
|
|||
|
"7 False True False No \n",
|
|||
|
"8 False True False No \n",
|
|||
|
"9 False True False No \n",
|
|||
|
"\n",
|
|||
|
" BMI_Category HealthScore \n",
|
|||
|
"0 1 21.7 \n",
|
|||
|
"1 3 23.4 \n",
|
|||
|
"2 2 21.9 \n",
|
|||
|
"3 1 22.8 \n",
|
|||
|
"4 3 23.4 \n",
|
|||
|
"5 2 22.8 \n",
|
|||
|
"6 3 21.6 \n",
|
|||
|
"7 2 13.8 \n",
|
|||
|
"8 2 21.4 \n",
|
|||
|
"9 3 17.2 \n",
|
|||
|
"\n",
|
|||
|
"[10 rows x 53 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Рассчитаем новый признак HealthScore\n",
|
|||
|
"# Используем взвешенную сумму физического, ментального здоровья и количества сна\n",
|
|||
|
"df_train_oversampled[\"HealthScore\"] = (\n",
|
|||
|
" (30.0 - df_train_oversampled[\"PhysicalHealth\"]) * 0.4 + # Чем меньше проблем с физическим здоровьем, тем лучше\n",
|
|||
|
" (30.0 - df_train_oversampled[\"MentalHealth\"]) * 0.3 + # Чем меньше проблем с ментальным здоровьем, тем лучше\n",
|
|||
|
" df_train_oversampled[\"SleepTime\"] * 0.3 # Оптимальное время сна\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"df_train_oversampled.head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Масштабирование признаков на основе нормировки и стандартизации\n",
|
|||
|
"\n",
|
|||
|
"Методы масштабирования признаков:\n",
|
|||
|
"- *Нормировка* – обычно применяется для равномерного распределения;\n",
|
|||
|
"- *Стандартизация* – обычно применяется для нормального распределения.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>PhysicalHealth</th>\n",
|
|||
|
" <th>MentalHealth</th>\n",
|
|||
|
" <th>SleepTime</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>0.383457</td>\n",
|
|||
|
" <td>0.066667</td>\n",
|
|||
|
" <td>0.100000</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>0.727165</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>0.436908</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.166667</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>0.220737</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.066667</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>0.735961</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>0.544824</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.375</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>0.659844</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.166667</td>\n",
|
|||
|
" <td>0.500</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>0.450101</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1.000000</td>\n",
|
|||
|
" <td>0.375</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>0.599628</td>\n",
|
|||
|
" <td>0.066667</td>\n",
|
|||
|
" <td>0.100000</td>\n",
|
|||
|
" <td>0.500</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>0.744756</td>\n",
|
|||
|
" <td>0.466667</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.375</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" BMI PhysicalHealth MentalHealth SleepTime\n",
|
|||
|
"0 0.383457 0.066667 0.100000 0.625\n",
|
|||
|
"1 0.727165 0.000000 0.000000 0.625\n",
|
|||
|
"2 0.436908 0.000000 0.166667 0.625\n",
|
|||
|
"3 0.220737 0.000000 0.066667 0.625\n",
|
|||
|
"4 0.735961 0.000000 0.000000 0.625\n",
|
|||
|
"5 0.544824 0.000000 0.000000 0.375\n",
|
|||
|
"6 0.659844 0.000000 0.166667 0.500\n",
|
|||
|
"7 0.450101 0.000000 1.000000 0.375\n",
|
|||
|
"8 0.599628 0.066667 0.100000 0.500\n",
|
|||
|
"9 0.744756 0.466667 0.000000 0.375"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.preprocessing import MinMaxScaler\n",
|
|||
|
"\n",
|
|||
|
"scaler = MinMaxScaler()\n",
|
|||
|
"\n",
|
|||
|
"# Применяем масштабирование к выбранным признакам\n",
|
|||
|
"df_train_oversampled_normalized = df_train_oversampled\n",
|
|||
|
"df_train_oversampled_normalized[numeric_columns] = scaler.fit_transform(df_train_oversampled_normalized[numeric_columns])\n",
|
|||
|
"\n",
|
|||
|
"df_train_oversampled_normalized[numeric_columns].head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Index(['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime', 'Smoking_No',\n",
|
|||
|
" 'Smoking_Yes', 'AlcoholDrinking_No', 'AlcoholDrinking_Yes', 'Stroke_No',\n",
|
|||
|
" 'Stroke_Yes', 'DiffWalking_No', 'DiffWalking_Yes', 'Sex_Female',\n",
|
|||
|
" 'Sex_Male', 'AgeCategory_18-24', 'AgeCategory_25-29',\n",
|
|||
|
" 'AgeCategory_30-34', 'AgeCategory_35-39', 'AgeCategory_40-44',\n",
|
|||
|
" 'AgeCategory_45-49', 'AgeCategory_50-54', 'AgeCategory_55-59',\n",
|
|||
|
" 'AgeCategory_60-64', 'AgeCategory_65-69', 'AgeCategory_70-74',\n",
|
|||
|
" 'AgeCategory_75-79', 'AgeCategory_80 or older',\n",
|
|||
|
" 'Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black',\n",
|
|||
|
" 'Race_Hispanic', 'Race_Other', 'Race_White', 'Diabetic_No',\n",
|
|||
|
" 'Diabetic_No, borderline diabetes', 'Diabetic_Yes',\n",
|
|||
|
" 'Diabetic_Yes (during pregnancy)', 'PhysicalActivity_No',\n",
|
|||
|
" 'PhysicalActivity_Yes', 'GenHealth_Excellent', 'GenHealth_Fair',\n",
|
|||
|
" 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good', 'Asthma_No',\n",
|
|||
|
" 'Asthma_Yes', 'KidneyDisease_No', 'KidneyDisease_Yes', 'SkinCancer_No',\n",
|
|||
|
" 'SkinCancer_Yes', 'HeartDisease', 'BMI_Category', 'HealthScore'],\n",
|
|||
|
" dtype='object')"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df_train_oversampled_normalized.columns"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Конструирование с применением FeatureTools"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>PhysicalHealth</th>\n",
|
|||
|
" <th>MentalHealth</th>\n",
|
|||
|
" <th>SleepTime</th>\n",
|
|||
|
" <th>Smoking_No</th>\n",
|
|||
|
" <th>Smoking_Yes</th>\n",
|
|||
|
" <th>AlcoholDrinking_No</th>\n",
|
|||
|
" <th>AlcoholDrinking_Yes</th>\n",
|
|||
|
" <th>Stroke_No</th>\n",
|
|||
|
" <th>Stroke_Yes</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>BMI_Category * HealthScore</th>\n",
|
|||
|
" <th>BMI_Category * MentalHealth</th>\n",
|
|||
|
" <th>BMI_Category * PhysicalHealth</th>\n",
|
|||
|
" <th>BMI_Category * SleepTime</th>\n",
|
|||
|
" <th>HealthScore * MentalHealth</th>\n",
|
|||
|
" <th>HealthScore * PhysicalHealth</th>\n",
|
|||
|
" <th>HealthScore * SleepTime</th>\n",
|
|||
|
" <th>MentalHealth * PhysicalHealth</th>\n",
|
|||
|
" <th>MentalHealth * SleepTime</th>\n",
|
|||
|
" <th>PhysicalHealth * SleepTime</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>Id</th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>0.383457</td>\n",
|
|||
|
" <td>0.066667</td>\n",
|
|||
|
" <td>0.100000</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>21.7</td>\n",
|
|||
|
" <td>0.100000</td>\n",
|
|||
|
" <td>0.066667</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" <td>2.17</td>\n",
|
|||
|
" <td>1.446667</td>\n",
|
|||
|
" <td>13.5625</td>\n",
|
|||
|
" <td>0.006667</td>\n",
|
|||
|
" <td>0.062500</td>\n",
|
|||
|
" <td>0.041667</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>0.727165</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>70.2</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1.875</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>14.6250</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>0.436908</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.166667</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>43.8</td>\n",
|
|||
|
" <td>0.333333</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1.250</td>\n",
|
|||
|
" <td>3.65</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>13.6875</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.104167</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>0.220737</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.066667</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>22.8</td>\n",
|
|||
|
" <td>0.066667</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" <td>1.52</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>14.2500</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.041667</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>0.735961</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.625</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>70.2</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1.875</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>14.6250</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>0.544824</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.375</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>45.6</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.750</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>8.5500</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>0.659844</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.166667</td>\n",
|
|||
|
" <td>0.500</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>64.8</td>\n",
|
|||
|
" <td>0.500000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1.500</td>\n",
|
|||
|
" <td>3.60</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>10.8000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.083333</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>0.450101</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1.000000</td>\n",
|
|||
|
" <td>0.375</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>27.6</td>\n",
|
|||
|
" <td>2.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.750</td>\n",
|
|||
|
" <td>13.80</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>5.1750</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.375000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>0.599628</td>\n",
|
|||
|
" <td>0.066667</td>\n",
|
|||
|
" <td>0.100000</td>\n",
|
|||
|
" <td>0.500</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>42.8</td>\n",
|
|||
|
" <td>0.200000</td>\n",
|
|||
|
" <td>0.133333</td>\n",
|
|||
|
" <td>1.000</td>\n",
|
|||
|
" <td>2.14</td>\n",
|
|||
|
" <td>1.426667</td>\n",
|
|||
|
" <td>10.7000</td>\n",
|
|||
|
" <td>0.006667</td>\n",
|
|||
|
" <td>0.050000</td>\n",
|
|||
|
" <td>0.033333</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>10</th>\n",
|
|||
|
" <td>0.744756</td>\n",
|
|||
|
" <td>0.466667</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.375</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>51.6</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1.400000</td>\n",
|
|||
|
" <td>1.125</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>8.026667</td>\n",
|
|||
|
" <td>6.4500</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.175000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>10 rows × 68 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" BMI PhysicalHealth MentalHealth SleepTime Smoking_No \\\n",
|
|||
|
"Id \n",
|
|||
|
"1 0.383457 0.066667 0.100000 0.625 True \n",
|
|||
|
"2 0.727165 0.000000 0.000000 0.625 False \n",
|
|||
|
"3 0.436908 0.000000 0.166667 0.625 True \n",
|
|||
|
"4 0.220737 0.000000 0.066667 0.625 False \n",
|
|||
|
"5 0.735961 0.000000 0.000000 0.625 False \n",
|
|||
|
"6 0.544824 0.000000 0.000000 0.375 True \n",
|
|||
|
"7 0.659844 0.000000 0.166667 0.500 True \n",
|
|||
|
"8 0.450101 0.000000 1.000000 0.375 False \n",
|
|||
|
"9 0.599628 0.066667 0.100000 0.500 True \n",
|
|||
|
"10 0.744756 0.466667 0.000000 0.375 True \n",
|
|||
|
"\n",
|
|||
|
" Smoking_Yes AlcoholDrinking_No AlcoholDrinking_Yes Stroke_No \\\n",
|
|||
|
"Id \n",
|
|||
|
"1 False True False True \n",
|
|||
|
"2 True True False True \n",
|
|||
|
"3 False True False True \n",
|
|||
|
"4 True True False True \n",
|
|||
|
"5 True True False True \n",
|
|||
|
"6 False False True True \n",
|
|||
|
"7 False True False True \n",
|
|||
|
"8 True True False True \n",
|
|||
|
"9 False True False True \n",
|
|||
|
"10 False True False True \n",
|
|||
|
"\n",
|
|||
|
" Stroke_Yes ... BMI_Category * HealthScore BMI_Category * MentalHealth \\\n",
|
|||
|
"Id ... \n",
|
|||
|
"1 False ... 21.7 0.100000 \n",
|
|||
|
"2 False ... 70.2 0.000000 \n",
|
|||
|
"3 False ... 43.8 0.333333 \n",
|
|||
|
"4 False ... 22.8 0.066667 \n",
|
|||
|
"5 False ... 70.2 0.000000 \n",
|
|||
|
"6 False ... 45.6 0.000000 \n",
|
|||
|
"7 False ... 64.8 0.500000 \n",
|
|||
|
"8 False ... 27.6 2.000000 \n",
|
|||
|
"9 False ... 42.8 0.200000 \n",
|
|||
|
"10 False ... 51.6 0.000000 \n",
|
|||
|
"\n",
|
|||
|
" BMI_Category * PhysicalHealth BMI_Category * SleepTime \\\n",
|
|||
|
"Id \n",
|
|||
|
"1 0.066667 0.625 \n",
|
|||
|
"2 0.000000 1.875 \n",
|
|||
|
"3 0.000000 1.250 \n",
|
|||
|
"4 0.000000 0.625 \n",
|
|||
|
"5 0.000000 1.875 \n",
|
|||
|
"6 0.000000 0.750 \n",
|
|||
|
"7 0.000000 1.500 \n",
|
|||
|
"8 0.000000 0.750 \n",
|
|||
|
"9 0.133333 1.000 \n",
|
|||
|
"10 1.400000 1.125 \n",
|
|||
|
"\n",
|
|||
|
" HealthScore * MentalHealth HealthScore * PhysicalHealth \\\n",
|
|||
|
"Id \n",
|
|||
|
"1 2.17 1.446667 \n",
|
|||
|
"2 0.00 0.000000 \n",
|
|||
|
"3 3.65 0.000000 \n",
|
|||
|
"4 1.52 0.000000 \n",
|
|||
|
"5 0.00 0.000000 \n",
|
|||
|
"6 0.00 0.000000 \n",
|
|||
|
"7 3.60 0.000000 \n",
|
|||
|
"8 13.80 0.000000 \n",
|
|||
|
"9 2.14 1.426667 \n",
|
|||
|
"10 0.00 8.026667 \n",
|
|||
|
"\n",
|
|||
|
" HealthScore * SleepTime MentalHealth * PhysicalHealth \\\n",
|
|||
|
"Id \n",
|
|||
|
"1 13.5625 0.006667 \n",
|
|||
|
"2 14.6250 0.000000 \n",
|
|||
|
"3 13.6875 0.000000 \n",
|
|||
|
"4 14.2500 0.000000 \n",
|
|||
|
"5 14.6250 0.000000 \n",
|
|||
|
"6 8.5500 0.000000 \n",
|
|||
|
"7 10.8000 0.000000 \n",
|
|||
|
"8 5.1750 0.000000 \n",
|
|||
|
"9 10.7000 0.006667 \n",
|
|||
|
"10 6.4500 0.000000 \n",
|
|||
|
"\n",
|
|||
|
" MentalHealth * SleepTime PhysicalHealth * SleepTime \n",
|
|||
|
"Id \n",
|
|||
|
"1 0.062500 0.041667 \n",
|
|||
|
"2 0.000000 0.000000 \n",
|
|||
|
"3 0.104167 0.000000 \n",
|
|||
|
"4 0.041667 0.000000 \n",
|
|||
|
"5 0.000000 0.000000 \n",
|
|||
|
"6 0.000000 0.000000 \n",
|
|||
|
"7 0.083333 0.000000 \n",
|
|||
|
"8 0.375000 0.000000 \n",
|
|||
|
"9 0.050000 0.033333 \n",
|
|||
|
"10 0.000000 0.175000 \n",
|
|||
|
"\n",
|
|||
|
"[10 rows x 68 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import featuretools as ft\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"\n",
|
|||
|
"df_testing = df_train_oversampled_normalized\n",
|
|||
|
"# Создание уникального идентификатора для каждой строки\n",
|
|||
|
"df_testing['Id'] = range(1, len(df_testing) + 1)\n",
|
|||
|
"\n",
|
|||
|
"es = ft.EntitySet(id='my-test-data')\n",
|
|||
|
"es = es.add_dataframe(dataframe=df_testing, dataframe_name='my-name', index='Id')\n",
|
|||
|
"\n",
|
|||
|
"# Указываем, какие трансформации нужно применить\n",
|
|||
|
"trans_primitives = ['multiply_numeric']\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков с помощью глубокого синтеза признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(\n",
|
|||
|
" entityset=es, \n",
|
|||
|
" target_dataframe_name='my-name', \n",
|
|||
|
" max_depth=1,\n",
|
|||
|
" trans_primitives=trans_primitives\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Выводим первые 10 строк сгенерированного набора признаков\n",
|
|||
|
"feature_matrix.head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Оценка качества каждого набора признаков\n",
|
|||
|
"\n",
|
|||
|
"**Предсказательная способность**: Способность набора признаков успешно прогнозировать целевую переменную. Это определяется через метрики, такие как RMSE, MAE, R², которые показывают, насколько хорошо модель использует признаки для достижения точных результатов. Для определения качества необходимо провести обучение модели на обучающей выборке и сравнить с оценкой прогнозирования на контрольной и тестовой выборках.\n",
|
|||
|
"\n",
|
|||
|
"**Скорость вычисления**: Время, необходимое для обработки данных и выполнения алгоритмов машинного обучения. Признаки должны быть вычисляемыми за разумный срок, чтобы обеспечить эффективность модели, особенно при работе с большими наборами данных. Для оценки качества необходимо провести измерение времени выполнения генерации признаков и обучения модели.\n",
|
|||
|
"\n",
|
|||
|
"**Надежность**: Устойчивость и воспроизводимость результатов при изменении входных данных. Надежные признаки должны давать схожие результаты независимо от случайных факторов или незначительных изменений в данных. Методы оценки: Кросс-валидация, анализ чувствительности модели к изменениям в данных.\n",
|
|||
|
"\n",
|
|||
|
"**Корреляция**: Степень взаимосвязи между признаками и целевой переменной, а также между самими признаками. Высокая корреляция с целевой переменной указывает на потенциальную предсказательную силу, тогда как высокая взаимосвязь между самими признаками может приводить к многоколлинеарности и снижению эффективности модели. Методы оценки: Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков.\n",
|
|||
|
"\n",
|
|||
|
"**Цельность**: Не является производным от других признаков. Методы оценки: Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 156699\n",
|
|||
|
"Размер контрольной выборки: 67157\n",
|
|||
|
"Размер тестовой выборки: 95939\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/entityset/entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/synthesis/deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Feature Importance:\n",
|
|||
|
" feature importance\n",
|
|||
|
"0 HeartDisease 0.851120\n",
|
|||
|
"1 BMI 0.014203\n",
|
|||
|
"9 Stroke_No 0.012600\n",
|
|||
|
"2 PhysicalHealth 0.008628\n",
|
|||
|
"12 DiffWalking_Yes 0.008111\n",
|
|||
|
"11 DiffWalking_No 0.007721\n",
|
|||
|
"36 Diabetic_Yes 0.007583\n",
|
|||
|
"10 Stroke_Yes 0.007551\n",
|
|||
|
"4 SleepTime 0.007525\n",
|
|||
|
"43 GenHealth_Poor 0.006605\n",
|
|||
|
"27 AgeCategory_80 or older 0.006269\n",
|
|||
|
"34 Diabetic_No 0.005300\n",
|
|||
|
"3 MentalHealth 0.005102\n",
|
|||
|
"41 GenHealth_Fair 0.004277\n",
|
|||
|
"48 KidneyDisease_Yes 0.003435\n",
|
|||
|
"47 KidneyDisease_No 0.003086\n",
|
|||
|
"13 Sex_Female 0.002607\n",
|
|||
|
"26 AgeCategory_75-79 0.002567\n",
|
|||
|
"25 AgeCategory_70-74 0.002462\n",
|
|||
|
"14 Sex_Male 0.002457\n",
|
|||
|
"6 Smoking_Yes 0.002127\n",
|
|||
|
"5 Smoking_No 0.001934\n",
|
|||
|
"42 GenHealth_Good 0.001787\n",
|
|||
|
"44 GenHealth_Very good 0.001734\n",
|
|||
|
"33 Race_White 0.001731\n",
|
|||
|
"50 SkinCancer_Yes 0.001687\n",
|
|||
|
"38 PhysicalActivity_No 0.001658\n",
|
|||
|
"39 PhysicalActivity_Yes 0.001585\n",
|
|||
|
"49 SkinCancer_No 0.001513\n",
|
|||
|
"40 GenHealth_Excellent 0.001451\n",
|
|||
|
"24 AgeCategory_65-69 0.001318\n",
|
|||
|
"46 Asthma_Yes 0.001315\n",
|
|||
|
"45 Asthma_No 0.001256\n",
|
|||
|
"23 AgeCategory_60-64 0.001091\n",
|
|||
|
"30 Race_Black 0.000885\n",
|
|||
|
"22 AgeCategory_55-59 0.000853\n",
|
|||
|
"31 Race_Hispanic 0.000825\n",
|
|||
|
"21 AgeCategory_50-54 0.000715\n",
|
|||
|
"32 Race_Other 0.000699\n",
|
|||
|
"7 AlcoholDrinking_No 0.000560\n",
|
|||
|
"8 AlcoholDrinking_Yes 0.000550\n",
|
|||
|
"20 AgeCategory_45-49 0.000520\n",
|
|||
|
"28 Race_American Indian/Alaskan Native 0.000503\n",
|
|||
|
"35 Diabetic_No, borderline diabetes 0.000479\n",
|
|||
|
"19 AgeCategory_40-44 0.000444\n",
|
|||
|
"18 AgeCategory_35-39 0.000412\n",
|
|||
|
"17 AgeCategory_30-34 0.000267\n",
|
|||
|
"29 Race_Asian 0.000260\n",
|
|||
|
"15 AgeCategory_18-24 0.000231\n",
|
|||
|
"16 AgeCategory_25-29 0.000217\n",
|
|||
|
"37 Diabetic_Yes (during pregnancy) 0.000184\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"import featuretools as ft\n",
|
|||
|
"from sklearn.ensemble import RandomForestClassifier\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//heart_2020_cleaned.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Разделение на обучающую и тестовую выборки (например, 70% обучающая, 30% тестовая)\n",
|
|||
|
"train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение обучающей выборки на обучающую и контрольную (например, 70% обучающая, 30% контрольная)\n",
|
|||
|
"train_df, val_df = train_test_split(train_df, test_size=0.3, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод размеров выборок\n",
|
|||
|
"print(\"Размер обучающей выборки:\", len(train_df))\n",
|
|||
|
"print(\"Размер контрольной выборки:\", len(val_df))\n",
|
|||
|
"print(\"Размер тестовой выборки:\", len(test_df))\n",
|
|||
|
"\n",
|
|||
|
"# Определение категориальных признаков\n",
|
|||
|
"categorical_features = [\n",
|
|||
|
" 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Race',\n",
|
|||
|
" 'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer'\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding к обучающей выборке\n",
|
|||
|
"train_df_encoded = pd.get_dummies(train_df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding к контрольной выборке\n",
|
|||
|
"val_df_encoded = pd.get_dummies(val_df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding к тестовой выборке\n",
|
|||
|
"test_df_encoded = pd.get_dummies(test_df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"# Определение сущностей\n",
|
|||
|
"es = ft.EntitySet(id='heart_data')\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='heart', dataframe=train_df_encoded, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='heart', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование признаков для контрольной и тестовой выборок\n",
|
|||
|
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_df_encoded.index)\n",
|
|||
|
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_df_encoded.index)\n",
|
|||
|
"\n",
|
|||
|
"# Оценка важности признаков\n",
|
|||
|
"X = feature_matrix\n",
|
|||
|
"y = train_df_encoded['HeartDisease']\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model = RandomForestClassifier(n_estimators=100, random_state=42)\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Получение важности признаков\n",
|
|||
|
"importances = model.feature_importances_\n",
|
|||
|
"feature_names = feature_matrix.columns\n",
|
|||
|
"\n",
|
|||
|
"# Сортировка признаков по важности\n",
|
|||
|
"feature_importance = pd.DataFrame({'feature': feature_names, 'importance': importances})\n",
|
|||
|
"feature_importance = feature_importance.sort_values(by='importance', ascending=False)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Feature Importance:\")\n",
|
|||
|
"print(feature_importance)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 15670\n",
|
|||
|
"Размер контрольной выборки: 6716\n",
|
|||
|
"Размер тестовой выборки: 9594\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/entityset/entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n",
|
|||
|
"/home/oleg/aim_labs/lab_3/aimenv/lib/python3.12/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Accuracy: 1.0\n",
|
|||
|
"Precision: 1.0\n",
|
|||
|
"Recall: 1.0\n",
|
|||
|
"F1 Score: 1.0\n",
|
|||
|
"ROC AUC: 1.0\n",
|
|||
|
"Cross-validated Accuracy: 0.906126356094448\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABD8AAAIjCAYAAAAEDbCUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1gV1/bw8e+hl0NRREFEEMEuKkETRRFbwFijEbuiYm8YjYodG0Yllhg1NrDFEqPEiJ2IBU0sEayxoIheC0YFxAIq8/7hy/w8oQjGxJT1eZ557pmZPXuvOeTmuWfdvdfWKIqiIIQQQgghhBBCCPEvpfeuAxBCCCGEEEIIIYT4M0nyQwghhBBCCCGEEP9qkvwQQgghhBBCCCHEv5okP4QQQgghhBBCCPGvJskPIYQQQgghhBBC/KtJ8kMIIYQQQgghhBD/apL8EEIIIYQQQgghxL+aJD+EEEIIIYQQQgjxrybJDyGEEEIIIYQQQvyrSfJDCCGEEEIIIYQQ/2qS/BBCCCGEeAciIiLQaDS5HqNHj/5Txjx8+DCTJk0iJSXlT+n/j8j+Po4fP/6uQ3ljCxcuJCIi4l2HIYQQIhcG7zoAIYQQQoj/ssmTJ1OmTBmda1WqVPlTxjp8+DAhISEEBARgbW39p4zxX7Zw4UKKFStGQEDAuw5FCCHE70jyQwghhBDiHWratCmenp7vOow/5NGjR5ibm7/rMN6Zx48fY2Zm9q7DEEIIkQ9Z9iKEEEII8Te2Y8cO6tWrh7m5ORYWFjRr1oyzZ8/qtDl16hQBAQG4uLhgYmKCnZ0dPXv25N69e2qbSZMm8dlnnwFQpkwZdYlNYmIiiYmJaDSaXJdsaDQaJk2apNOPRqPh3LlzdOrUiSJFilC3bl31/po1a3jvvfcwNTWlaNGidOjQgevXr7/RuwcEBKDVaklKSqJ58+ZotVocHBz46quvADh9+jQNGzbE3NwcJycnvvnmG53ns5fSHDhwgL59+2JjY4OlpSXdunXjwYMHOcZbuHAhlStXxtjYmJIlSzJw4MAcS4R8fHyoUqUKJ06cwNvbGzMzM8aMGYOzszNnz55l//796nfr4+MDwP379xkxYgRVq1ZFq9ViaWlJ06ZNiY+P1+k7JiYGjUbDxo0bmTZtGqVKlcLExIRGjRpx+fLlHPH+/PPPfPTRRxQpUgRzc3Pc3d2ZN2+eTptff/2VTz75hKJFi2JiYoKnpydbt24t7J9CCCH+8WTmhxBCCCHEO5Samspvv/2mc61YsWIArF69mu7du+Pr68vnn3/O48ePWbRoEXXr1uXkyZM4OzsDsGfPHq5cuUKPHj2ws7Pj7NmzLFmyhLNnz/LTTz+h0Who06YNFy9eZN26dcyZM0cdw9bWlrt37xY67nbt2uHm5sb06dNRFAWAadOmMX78ePz9/QkMDOTu3bt8+eWXeHt7c/LkyTdaavPixQuaNm2Kt7c3M2fOZO3atQwaNAhzc3PGjh1L586dadOmDYsXL6Zbt27Url07xzKiQYMGYW1tzaRJk7hw4QKLFi3i2rVrarIBXiZ1QkJCaNy4Mf3791fbHTt2jNjYWAwNDdX+7t27R9OmTenQoQNdunShRIkS+Pj4MHjwYLRaLWPHjgWgRIkSAFy5coXIyEjatWtHmTJluHPnDl9//TX169fn3LlzlCxZUifeGTNmoKenx4gRI0hNTWXmzJl07tyZn3/+WW2zZ88emjdvjr29PUOHDsXOzo7z58+zbds2hg4dCsDZs2fx8vLCwcGB0aNHY25uzsaNG2ndujXfffcdH3/8caH/HkII8Y+lCCGEEEKIv1x4eLgC5HooiqI8fPhQsba2Vnr37q3z3O3btxUrKyud648fP87R/7p16xRAOXDggHpt1qxZCqBcvXpVp+3Vq1cVQAkPD8/RD6BMnDhRPZ84caICKB07dtRpl5iYqOjr6yvTpk3TuX769GnFwMAgx/W8vo9jx46p17p3764AyvTp09VrDx48UExNTRWNRqOsX79evf7rr7/miDW7z/fee0/JzMxUr8+cOVMBlO+//15RFEVJTk5WjIyMlA8//FB58eKF2m7BggUKoKxYsUK9Vr9+fQVQFi9enOMdKleurNSvXz/H9adPn+r0qygvv3NjY2Nl8uTJ6rV9+/YpgFKxYkUlIyNDvT5v3jwFUE6fPq0oiqI8f/5cKVOmjOLk5KQ8ePBAp9+srCz1c6NGjZSqVasqT58+1blfp04dxc3NLUecQgjxbybLXoQQQggh3qGvvvqKPXv26Bzw8v/ZT0lJoWPHjvz222/qoa+vz/vvv8++ffvUPkxNTdXPT58+5bfffuODDz4A4JdffvlT4u7Xr5/O+ebNm8nKysLf318nXjs7O9zc3HTiLazAwED1s7W1NeXLl8fc3Bx/f3/1evny5bG2tubKlSs5nu/Tp4/OzI3+/ftjYGDA9u3bAdi7dy+ZmZkEBQWhp/d///O4d+/eWFpaEhUVpdOfsbExPXr0KHD8xsbGar8vXrzg3r17aLVaypcvn+vfp0ePHhgZGann9erVA1Df7eTJk1y9epWgoKAcs2myZ7Lcv3+fH3/8EX9/fx4+fKj+Pe7du4evry+XLl3if//7X4HfQQgh/ulk2YsQQgghxDtUq1atXAueXrp0CYCGDRvm+pylpaX6+f79+4SEhLB+/XqSk5N12qWmpr7FaP/P75eWXLp0CUVRcHNzy7X9q8mHwjAxMcHW1lbnmpWVFaVKlVJ/6L96PbdaHr+PSavVYm9vT2JiIgDXrl0DXiZQXmVkZISLi4t6P5uDg4NOcuJ1srKymDdvHgsXLuTq1au8ePFCvWdjY5OjfenSpXXOixQpAqC+W0JCApD/rkCXL19GURTGjx/P+PHjc22TnJyMg4NDgd9DCCH+yST5IYQQQgjxN5SVlQW8rPthZ2eX476Bwf/9zzh/f38OHz7MZ599RvXq1dFqtWRlZeHn56f2k5/fJxGyvfoj/fdenW2SHa9Go2HHjh3o6+vnaK/Val8bR25y6yu/68r/rz/yZ/r9u7/O9OnTGT9+PD179mTKlCkULVoUPT09goKCcv37vI13y+53xIgR+Pr65trG1dW1wP0JIcQ/nSQ/hBBCCCH+hsqWLQtA8eLFady4cZ7tHjx4QHR0NCEhIUyYMEG9nj1z5FV5JTmyZxb8fmeT3894eF28iqJQpkwZypUrV+Dn/gqXLl2iQYMG6nl6ejq3bt3io48+AsDJyQmACxcu4OLiorbLzMzk6tWr+X7/r8rr+920aRMNGjRg+fLlOtdTUlLUwrOFkf3PxpkzZ/KMLfs9DA0NCxy/EEL8m0nNDyGEEEKIvyFfX18sLS2ZPn06z549y3E/e4eW7FkCv58VMHfu3BzPmJubAzmTHJaWlhQrVowDBw7oXF+4cGGB423Tpg36+vqEhITkiEVRFJ1td/9qS5Ys0fkOFy1axPPnz2natCkAjRs3xsjIiPnz5+vEvnz5clJTU2nWrFmBxjE3N8/x3cLLv9Hvv5Nvv/32jWtueHh4UKZMGebOnZtjvOxxihcvjo+PD19//TW3bt3K0ceb7PAjhBD/ZDLzQwghhBDib8jS0pJFixbRtWtXPDw86NChA7a2tiQlJREVFYWXlxcLFizA0tJS3Qb22bNnODg4sHv3bq5evZqjz/feew+AsWPH0qFDBwwNDWnRogXm5uYEBgYyY8YMAgMD8fT05MCBA1y8eLHA8ZYtW5apU6cSHBxMYmIirVu3xsLCgqtXr7Jlyxb69OnDiBEj3tr3UxiZmZk0atQIf39/Lly4wMKFC6lbty4tW7YEXm73GxwcTEhICH5+frRs2VJtV7NmTbp06VKgcd577z0WLVrE1KlTcXV1pXjx4jRs2JDmzZszefJkevToQZ06dTh9+jRr167VmWVSGHp6eixatIgWLVpQvXp1evTogb29Pb/++itnz55l165dwMtiunXr1qVq1ar07t0bFxcX7ty5w5EjR7hx4wbx8fFvNL4QQvwTSfJDCCGEEOJvqlOnTpQsWZIZM2Ywa9YsMjIycHBwoF69ejq7jXzzzTcMHjyYr77
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Train Accuracy: 0.9994894703254626\n",
|
|||
|
"Train Precision: 0.9992816091954023\n",
|
|||
|
"Train Recall: 0.9949928469241774\n",
|
|||
|
"Train F1 Score: 0.9971326164874552\n",
|
|||
|
"Train ROC AUC: 0.9974613898298016\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAIjCAYAAAA0vUuxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB7UklEQVR4nO3dd3xO9///8eeVRBIjMWoTYu+dysdWFKVaSmvULFUl9qbEqL0J9bGqWlWKtmqVKlVFjaAUib23SiRGJDm/P/pzvp+rMXKR5GQ87rfbdbu5Xudc53pecUReeb/P+9gMwzAEAAAAAHgqJ6sDAAAAAEBiR+MEAAAAAM9B4wQAAAAAz0HjBAAAAADPQeMEAAAAAM9B4wQAAAAAz0HjBAAAAADPQeMEAAAAAM9B4wQAAAAAz0HjBACJlM1m04gRI6yOYbmaNWuqZs2a5vOzZ8/KZrNp8eLFlmX6t39nBF8TAMkPjROAFGHOnDmy2Wzy9fV94WNcvnxZI0aM0MGDB+MuWCK3bds22Ww285EqVSrlz59fbdu21enTp62O55CdO3dqxIgRunPnjmUZvL299eabbz5x2+Ov9cqVKxM41T+edX63b9/e7jxIly6d8ufPr2bNmmnVqlWKjo5O+MAAkMBcrA4AAAlh6dKl8vb21p49e3Ty5EkVLFjQ4WNcvnxZI0eOlLe3t8qWLRv3IROxHj166NVXX9WjR48UGBioefPmad26dTp8+LBy5syZoFny5s2r+/fvK1WqVA69bufOnRo5cqTat2+vDBkyxE+4JOx557ebm5sWLFggSbp//77OnTunH3/8Uc2aNVPNmjX1ww8/yNPT09x/06ZNCRUdABIEI04Akr0zZ85o586dmjp1qrJkyaKlS5daHSnJqVatmlq3bq0OHTpo1qxZmjx5sm7fvq0vvvjiqa8JDw+Plyw2m03u7u5ydnaOl+OnNJGRkYqIiHjufi4uLmrdurVat26tDz/8UJ9++qkOHTqkcePGadu2bfrwww/t9nd1dZWrq2t8xQaABEfjBCDZW7p0qTJmzKiGDRuqWbNmT22c7ty5o969e8vb21tubm7KnTu32rZtq5s3b2rbtm169dVXJUkdOnQwpyw9vs7G29tb7du3j3HMf1/nERERoeHDh6tChQpKnz690qZNq2rVqmnr1q0Of65r167JxcVFI0eOjLEtKChINptNAQEBkqRHjx5p5MiRKlSokNzd3fXKK6+oatWq2rx5s8PvK0m1atWS9E9TKkkjRoyQzWbT0aNH1apVK2XMmFFVq1Y19//qq69UoUIFpU6dWpkyZVKLFi104cKFGMedN2+eChQooNSpU6tixYr67bffYuzztGucjh8/rvfee09ZsmRR6tSpVaRIEQ0dOtTM179/f0lSvnz5zL+/s2fPxkvGuHTp0iV98MEHypYtm9zc3FSiRAktWrTIbp/YnlePv3aTJ0/W9OnTVaBAAbm5uWnOnDnPPL+fZdCgQapbt66+/fZbBQcHm/UnXeM0a9YslShRQmnSpFHGjBnl4+Ojr7/+Ot4+ryR98803qlChgjw8POTp6alSpUppxowZdvvcuXNHvXr1kpeXl9zc3FSwYEFNmDCBKYgA7DBVD0Cyt3TpUr3zzjtydXVVy5Yt9dlnn2nv3r3mD4qSFBYWpmrVqunYsWP64IMPVL58ed28eVNr1qzRxYsXVaxYMY0aNUrDhw9X586dVa1aNUlS5cqVHcoSGhqqBQsWqGXLlvrwww919+5dLVy4UPXq1dOePXscmgKYLVs21ahRQytWrJC/v7/dtuXLl8vZ2VnvvvuupH8ah3HjxqlTp06qWLGiQkNDtW/fPgUGBur111936DNI0qlTpyRJr7zyil393XffVaFChTR27FgZhiFJGjNmjIYNG6b33ntPnTp10o0bNzRr1ixVr15dBw4cMKfNLVy4UB999JEqV66sXr166fTp03rrrbeUKVMmeXl5PTPPn3/+qWrVqilVqlTq3LmzvL29derUKf34448aM2aM3nnnHQUHB2vZsmWaNm2aMmfOLEnKkiVLgmV87NGjR7p582aMekhISIzatWvX9J///Ec2m01+fn7KkiWLNmzYoI4dOyo0NFS9evWS5Ph59fnnn+vBgwfq3Lmz3Nzc1KRJE929e/eFz+82bdpo06ZN2rx5swoXLvzEfebPn68ePXqoWbNm6tmzpx48eKA///xTf/zxh1q1ahUvn3fz5s1q2bKlateurQkTJkiSjh07pt9//109e/aUJN27d081atTQpUuX9NFHHylPnjzauXOnBg8erCtXrmj69Omx+hoASAEMAEjG9u3bZ0gyNm/ebBiGYURHRxu5c+c2evbsabff8OHDDUnG6tWrYxwjOjraMAzD2Lt3ryHJ+Pzzz2PskzdvXqNdu3Yx6jVq1DBq1KhhPo+MjDQePnxot8/ff/9tZMuWzfjggw/s6pIMf3//Z36+//73v4Yk4/Dhw3b14sWLG7Vq1TKflylTxmjYsOEzj/UkW7duNSQZixYtMm7cuGFcvnzZWLduneHt7W3YbDZj7969hmEYhr+/vyHJaNmypd3rz549azg7Oxtjxoyxqx8+fNhwcXEx6xEREUbWrFmNsmXL2n195s2bZ0iy+xqeOXMmxt9D9erVDQ8PD+PcuXN27/P4784wDGPSpEmGJOPMmTPxnvFp8ubNa0h65uPbb7819+/YsaORI0cO4+bNm3bHadGihZE+fXrj3r17hmHE/rx6/LXz9PQ0rl+/brf/s87vdu3aGWnTpn3q5zpw4IAhyejdu7dZ+/e5//bbbxslSpR4+hcnHj5vz549DU9PTyMyMvKp7zl69Ggjbdq0RnBwsF190KBBhrOzs3H+/PlnZgaQcjBVD0CytnTpUmXLlk2vvfaapH+uj2nevLm++eYbRUVFmfutWrVKZcqUUZMmTWIcw2azxVkeZ2dn87qP6Oho3b59W5GRkfLx8VFgYKDDx3vnnXfk4uKi5cuXm7UjR47o6NGjat68uVnLkCGD/vrrL504ceKFcn/wwQfKkiWLcubMqYYNGyo8PFxffPGFfHx87Pbr0qWL3fPVq1crOjpa7733nm7evGk+smfPrkKFCplTq/bt26fr16+rS5cudtfFtG/fXunTp39mths3bmj79u364IMPlCdPHrttsfm7S4iM/8vX11ebN2+O8Zg8ebLdfoZhaNWqVWrUqJEMw7DLVq9ePYWEhJjnjKPnVdOmTc3RtriQLl06SdLdu3efuk+GDBl08eJF7d2794nb4+PzZsiQQeHh4c+ckvrtt9+qWrVqypgxo9171qlTR1FRUdq+fbvDXw8AyRNT9QAkW1FRUfrmm2/02muvmdfiSP/84DplyhRt2bJFdevWlfTP1LOmTZsmSK4vvvhCU6ZM0fHjx/Xo0SOzni9fPoePlTlzZtWuXVsrVqzQ6NGjJf0zTc/FxUXvvPOOud+oUaP09ttvq3DhwipZsqTq16+vNm3aqHTp0rF6n+HDh6tatWpydnZW5syZVaxYMbm4xPwv5N+f4cSJEzIMQ4UKFXricR+vjHfu3DlJirHf4+XPn+XxsuglS5aM1Wf5t4TI+L8yZ86sOnXqxKj/++t548YN3blzR/PmzdO8efOeeKzr16+bf3bkvHqRc+1ZwsLCJEkeHh5P3WfgwIH6+eefVbFiRRUsWFB169ZVq1atVKVKFUnx83m7du2qFStW6I033lCuXLlUt25dvffee6pfv765z4kTJ/Tnn38+tZH83/cEkLLROAFItn755RdduXJF33zzjb755psY25cuXWo2Ti/raSMbUVFRdqu/ffXVV2rfvr0aN26s/v37K2vWrHJ2dta4cePM64Yc1aJFC3Xo0EEHDx5U2bJltWLFCtWuXdu8jkeSqlevrlOnTumHH37Qpk2btGDBAk2bNk1z585Vp06dnvsepUqVeuIP+/+WOnVqu+fR0dGy2WzasGHDE1fBezxSYaXEmvHxwgStW7dWu3btnrjP48bX0fPq339PL+vIkSOS9Mxl/osVK6agoCCtXbtWGzdu1KpVqzRnzhwNHz5cI0eOjJf
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.ensemble import RandomForestClassifier\n",
|
|||
|
"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score\n",
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import featuretools as ft\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//heart_2020_cleaned.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Уменьшение размера выборки для ускорения работы (опционально)\n",
|
|||
|
"df = df.sample(frac=0.1, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение на обучающую и тестовую выборки (например, 70% обучающая, 30% тестовая)\n",
|
|||
|
"train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение обучающей выборки на обучающую и контрольную (например, 70% обучающая, 30% контрольная)\n",
|
|||
|
"train_df, val_df = train_test_split(train_df, test_size=0.3, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод размеров выборок\n",
|
|||
|
"print(\"Размер обучающей выборки:\", len(train_df))\n",
|
|||
|
"print(\"Размер контрольной выборки:\", len(val_df))\n",
|
|||
|
"print(\"Размер тестовой выборки:\", len(test_df))\n",
|
|||
|
"\n",
|
|||
|
"# Определение категориальных признаков\n",
|
|||
|
"categorical_features = [\n",
|
|||
|
" 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Race',\n",
|
|||
|
" 'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer'\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding к обучающей выборке\n",
|
|||
|
"train_df_encoded = pd.get_dummies(train_df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding к контрольной выборке\n",
|
|||
|
"val_df_encoded = pd.get_dummies(val_df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding к тестовой выборке\n",
|
|||
|
"test_df_encoded = pd.get_dummies(test_df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"# Определение сущностей\n",
|
|||
|
"es = ft.EntitySet(id='heart_data')\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='heart', dataframe=train_df_encoded, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков с уменьшенной глубиной\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='heart', max_depth=1)\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование признаков для контрольной и тестовой выборок\n",
|
|||
|
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_df_encoded.index)\n",
|
|||
|
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_df_encoded.index)\n",
|
|||
|
"\n",
|
|||
|
"# Удаление строк с NaN\n",
|
|||
|
"feature_matrix = feature_matrix.dropna()\n",
|
|||
|
"val_feature_matrix = val_feature_matrix.dropna()\n",
|
|||
|
"test_feature_matrix = test_feature_matrix.dropna()\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"X_train = feature_matrix.drop('HeartDisease', axis=1)\n",
|
|||
|
"y_train = feature_matrix['HeartDisease']\n",
|
|||
|
"X_val = val_feature_matrix.drop('HeartDisease', axis=1)\n",
|
|||
|
"y_val = val_feature_matrix['HeartDisease']\n",
|
|||
|
"X_test = test_feature_matrix.drop('HeartDisease', axis=1)\n",
|
|||
|
"y_test = test_feature_matrix['HeartDisease']\n",
|
|||
|
"\n",
|
|||
|
"# Выбор модели\n",
|
|||
|
"model = RandomForestClassifier(random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Предсказание и оценка\n",
|
|||
|
"y_pred = model.predict(X_test)\n",
|
|||
|
"\n",
|
|||
|
"accuracy = accuracy_score(y_test, y_pred)\n",
|
|||
|
"precision = precision_score(y_test, y_pred)\n",
|
|||
|
"recall = recall_score(y_test, y_pred)\n",
|
|||
|
"f1 = f1_score(y_test, y_pred)\n",
|
|||
|
"roc_auc = roc_auc_score(y_test, y_pred)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"Accuracy: {accuracy}\")\n",
|
|||
|
"print(f\"Precision: {precision}\")\n",
|
|||
|
"print(f\"Recall: {recall}\")\n",
|
|||
|
"print(f\"F1 Score: {f1}\")\n",
|
|||
|
"print(f\"ROC AUC: {roc_auc}\")\n",
|
|||
|
"\n",
|
|||
|
"# Кросс-валидация\n",
|
|||
|
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')\n",
|
|||
|
"accuracy_cv = scores.mean()\n",
|
|||
|
"print(f\"Cross-validated Accuracy: {accuracy_cv}\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ важности признаков\n",
|
|||
|
"feature_importances = model.feature_importances_\n",
|
|||
|
"feature_names = X_train.columns\n",
|
|||
|
"\n",
|
|||
|
"importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})\n",
|
|||
|
"importance_df = importance_df.sort_values(by='Importance', ascending=False)\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"sns.barplot(x='Importance', y='Feature', data=importance_df)\n",
|
|||
|
"plt.title('Feature Importance')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на переобучение\n",
|
|||
|
"y_train_pred = model.predict(X_train)\n",
|
|||
|
"\n",
|
|||
|
"accuracy_train = accuracy_score(y_train, y_train_pred)\n",
|
|||
|
"precision_train = precision_score(y_train, y_train_pred)\n",
|
|||
|
"recall_train = recall_score(y_train, y_train_pred)\n",
|
|||
|
"f1_train = f1_score(y_train, y_train_pred)\n",
|
|||
|
"roc_auc_train = roc_auc_score(y_train, y_train_pred)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"Train Accuracy: {accuracy_train}\")\n",
|
|||
|
"print(f\"Train Precision: {precision_train}\")\n",
|
|||
|
"print(f\"Train Recall: {recall_train}\")\n",
|
|||
|
"print(f\"Train F1 Score: {f1_train}\")\n",
|
|||
|
"print(f\"Train ROC AUC: {roc_auc_train}\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация результатов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
|
|||
|
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
|
|||
|
"plt.xlabel('Actual HeartDisease')\n",
|
|||
|
"plt.ylabel('Predicted HeartDisease')\n",
|
|||
|
"plt.title('Actual vs Predicted HeartDisease')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.8"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|