{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

бизнес-цели и 2 задачи, которые нужно решить:
\n", "Снижение вероятности инсульта у пациентов с высоким риском путем раннего выявления предрасположенности.
\n", "Оптимизация медицинских услуг, предоставляемых пациентам, с учетом их риска инсульта.


\n", "Разработать модель, которая прогнозирует вероятность инсульта у пациента.
\n", "Определить значимые признаки для анализа риска инсульта, чтобы направить усилия медицинских работников на важные факторы.

" ] }, { "cell_type": "code", "execution_count": 330, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Количество колонок: 12\n", "Колонки: Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',\n", " 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',\n", " 'smoking_status', 'stroke'],\n", " dtype='object')\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Загрузка данных\n", "data = pd.read_csv('./csv/option4.csv')\n", "\n", "# Обзор данных\n", "print(\"Количество колонок:\", data.columns.size)\n", "print(\"Колонки:\", data.columns)" ] }, { "cell_type": "code", "execution_count": 331, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Наличие пропущенных значений:\n", "id 0\n", "gender 0\n", "age 0\n", "hypertension 0\n", "heart_disease 0\n", "ever_married 0\n", "work_type 0\n", "Residence_type 0\n", "avg_glucose_level 0\n", "bmi 201\n", "smoking_status 0\n", "stroke 0\n", "dtype: int64\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "print(\"\\nНаличие пропущенных значений:\")\n", "print(data.isnull().sum())\n", "\n", "print(\"\\n\\n\")\n", "\n", "print(data.describe)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Возьмем и заменим нулевые значения в столбце bmi на средние значения по столбцу

" ] }, { "cell_type": "code", "execution_count": 332, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Наличие пропущенных значений:\n", "id 0\n", "gender 0\n", "age 0\n", "hypertension 0\n", "heart_disease 0\n", "ever_married 0\n", "work_type 0\n", "Residence_type 0\n", "avg_glucose_level 0\n", "bmi 0\n", "smoking_status 0\n", "stroke 0\n", "dtype: int64\n" ] } ], "source": [ "data['bmi'] = data['bmi'].fillna(data['bmi'].median())\n", "print(\"\\nНаличие пропущенных значений:\")\n", "print(data.isnull().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Взглянем на выбросы:

" ] }, { "cell_type": "code", "execution_count": 333, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_numeric_boxplots(dataframe):\n", " # Фильтрация числовых столбцов\n", " numeric_columns = ['age', 'avg_glucose_level', 'bmi']\n", " \n", " # Построение графиков\n", " if numeric_columns:\n", " plt.figure(figsize=(15, 5))\n", " \n", " for i, col in enumerate(numeric_columns):\n", " if col != 'id':\n", " plt.subplot(1, len(numeric_columns), i + 1)\n", " sns.boxplot(y=dataframe[col])\n", " plt.title(f'{col}')\n", " plt.ylabel('')\n", " plt.xlabel(col)\n", " \n", " plt.tight_layout()\n", " plt.show()\n", " else:\n", " print(\"Нет подходящих числовых столбцов для построения графиков.\")\n", "\n", "plot_numeric_boxplots(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Видим выбросы в столбцах со средним уровнем глюкозы и в столбце bmi (индекс массы тела). устраним выбросы - поставим верхние и нижние границы

" ] }, { "cell_type": "code", "execution_count": 334, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def remove_outliers(df):\n", "\n", " numeric_columns = ['age', 'avg_glucose_level', 'bmi']\n", " for column in numeric_columns:\n", " Q1 = df[column].quantile(0.25)\n", " Q3 = df[column].quantile(0.75)\n", " IQR = Q3 - Q1\n", " lower_bound = Q1 - 1.5 * IQR\n", " upper_bound = Q3 + 1.5 * IQR\n", " df[column] = df[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n", " return df\n", " \n", "data = remove_outliers(data)\n", "plot_numeric_boxplots(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Теперь можно и к конструированию признаков приступить) данные ведь сбалансированы (в выборках)

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Унитарное кодирование категориальных признаков

Применяем к категориальным (НЕ числовым) признакам: 'gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'

" ] }, { "cell_type": "code", "execution_count": 335, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Данные после унитарного кодирования:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idagehypertensionheart_diseaseavg_glucose_levelbmistrokegender_Malegender_Otherever_married_Yeswork_type_Never_workedwork_type_Privatework_type_Self-employedwork_type_childrenResidence_type_Urbansmoking_status_formerly smokedsmoking_status_never smokedsmoking_status_smokes
0904667.001169.357536.61TrueFalseTrueFalseTrueFalseFalseTrueTrueFalseFalse
15167661.000169.357528.11FalseFalseTrueFalseFalseTrueFalseFalseFalseTrueFalse
23111280.001105.920032.51TrueFalseTrueFalseTrueFalseFalseFalseFalseTrueFalse
36018249.000169.357534.41FalseFalseTrueFalseTrueFalseFalseTrueFalseFalseTrue
4166579.010169.357524.01FalseFalseTrueFalseFalseTrueFalseFalseFalseTrueFalse
55666981.000169.357529.01TrueFalseTrueFalseTrueFalseFalseTrueTrueFalseFalse
65388274.01170.090027.41TrueFalseTrueFalseTrueFalseFalseFalseFalseTrueFalse
71043469.00094.390022.81FalseFalseFalseFalseTrueFalseFalseTrueFalseTrueFalse
82741959.00076.150028.11FalseFalseTrueFalseTrueFalseFalseFalseFalseFalseFalse
96049178.00058.570024.21FalseFalseTrueFalseTrueFalseFalseTrueFalseFalseFalse
\n", "
" ], "text/plain": [ " id age hypertension heart_disease avg_glucose_level bmi stroke \\\n", "0 9046 67.0 0 1 169.3575 36.6 1 \n", "1 51676 61.0 0 0 169.3575 28.1 1 \n", "2 31112 80.0 0 1 105.9200 32.5 1 \n", "3 60182 49.0 0 0 169.3575 34.4 1 \n", "4 1665 79.0 1 0 169.3575 24.0 1 \n", "5 56669 81.0 0 0 169.3575 29.0 1 \n", "6 53882 74.0 1 1 70.0900 27.4 1 \n", "7 10434 69.0 0 0 94.3900 22.8 1 \n", "8 27419 59.0 0 0 76.1500 28.1 1 \n", "9 60491 78.0 0 0 58.5700 24.2 1 \n", "\n", " gender_Male gender_Other ever_married_Yes work_type_Never_worked \\\n", "0 True False True False \n", "1 False False True False \n", "2 True False True False \n", "3 False False True False \n", "4 False False True False \n", "5 True False True False \n", "6 True False True False \n", "7 False False False False \n", "8 False False True False \n", "9 False False True False \n", "\n", " work_type_Private work_type_Self-employed work_type_children \\\n", "0 True False False \n", "1 False True False \n", "2 True False False \n", "3 True False False \n", "4 False True False \n", "5 True False False \n", "6 True False False \n", "7 True False False \n", "8 True False False \n", "9 True False False \n", "\n", " Residence_type_Urban smoking_status_formerly smoked \\\n", "0 True True \n", "1 False False \n", "2 False False \n", "3 True False \n", "4 False False \n", "5 True True \n", "6 False False \n", "7 True False \n", "8 False False \n", "9 True False \n", "\n", " smoking_status_never smoked smoking_status_smokes \n", "0 False False \n", "1 True False \n", "2 True False \n", "3 False True \n", "4 True False \n", "5 False False \n", "6 True False \n", "7 True False \n", "8 False False \n", "9 False False " ] }, "execution_count": 335, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# One-Hot Encoding\n", "categorical_columns = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']\n", "data_edit_categories = pd.get_dummies(data, columns=categorical_columns, drop_first=True)\n", "\n", "print(\"Данные после унитарного кодирования:\")\n", "data_edit_categories.head(10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Дискретизация числовых признаков

Числовые признаки, такие как 'age', 'avg_glucose_level', 'bmi', можно разделить на категории (биннинг).

\n" ] }, { "cell_type": "code", "execution_count": 336, "metadata": {}, "outputs": [], "source": [ "# data_edit_categories['age_bins'] = pd.cut(data_edit_categories['age'], bins=[0, 18, 30, 50, 100], labels=['ребенок', 'молодой', 'средний', 'пожилой'])\n", "# data_edit_categories['bmi_bins'] = pd.cut(data_edit_categories['bmi'], bins=[0, 18.5, 25, 30, 50], labels=['низкий', 'норма', 'избыток', 'ожирение'])\n", "\n", "# print(\"Данные после дискретизации:\")\n", "# data_edit_categories[['age_bins', 'bmi_bins']].head(10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Ручной синтез новых признаков

\n", "

  • Возрастной индекс глюкозы: age * avg_glucose_level\n", "
  • Индекс массы тела с поправкой на глюкозу: bmi / avg_glucose_level

    \n" ] }, { "cell_type": "code", "execution_count": 337, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Данные после синтеза новых признаков:\n" ] }, { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    age_glucose_indexbmi_glucose_ratio
    011346.95250.216111
    110330.80750.165921
    28473.60000.306835
    38298.51750.203121
    413379.24250.141712
    513717.95750.171235
    65186.66000.390926
    76512.91000.241551
    84492.85000.369009
    94568.46000.413181
    \n", "
    " ], "text/plain": [ " age_glucose_index bmi_glucose_ratio\n", "0 11346.9525 0.216111\n", "1 10330.8075 0.165921\n", "2 8473.6000 0.306835\n", "3 8298.5175 0.203121\n", "4 13379.2425 0.141712\n", "5 13717.9575 0.171235\n", "6 5186.6600 0.390926\n", "7 6512.9100 0.241551\n", "8 4492.8500 0.369009\n", "9 4568.4600 0.413181" ] }, "execution_count": 337, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_edit_categories['age_glucose_index'] = data_edit_categories['age'] * data_edit_categories['avg_glucose_level']\n", "data_edit_categories['bmi_glucose_ratio'] = data_edit_categories['bmi'] / data_edit_categories['avg_glucose_level']\n", "\n", "print(\"Данные после синтеза новых признаков:\")\n", "data_edit_categories[['age_glucose_index', 'bmi_glucose_ratio']].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

    Масштабирование признаков

    Применяем нормализацию (для сжатия в диапазон [0, 1]) и стандартизацию (для приведения к среднему 0 и стандартному отклонению 1)

    " ] }, { "cell_type": "code", "execution_count": 338, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Данные после нормализации:\n", "\n" ] }, { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    idagehypertensionheart_diseaseavg_glucose_levelbmistrokegender_Malegender_Otherever_married_Yeswork_type_Never_workedwork_type_Privatework_type_Self-employedwork_type_childrenResidence_type_Urbansmoking_status_formerly smokedsmoking_status_never smokedsmoking_status_smokesage_glucose_indexbmi_glucose_ratio
    090460.816895011.0000000.7305561TrueFalseTrueFalseTrueFalseFalseTrueTrueFalseFalse11346.95250.216111
    1516760.743652001.0000000.4944441FalseFalseTrueFalseFalseTrueFalseFalseFalseTrueFalse10330.80750.165921
    2311120.975586010.4446880.6166671TrueFalseTrueFalseTrueFalseFalseFalseFalseTrueFalse8473.60000.306835
    3601820.597168001.0000000.6694441FalseFalseTrueFalseTrueFalseFalseTrueFalseFalseTrue8298.51750.203121
    416650.963379101.0000000.3805561FalseFalseTrueFalseFalseTrueFalseFalseFalseTrueFalse13379.24250.141712
    ...............................................................
    5105182340.975586100.2506180.4944440FalseFalseTrueFalseTrueFalseFalseTrueFalseTrueFalse6700.00000.335522
    5106448730.987793000.6134590.8250000FalseFalseTrueFalseFalseTrueFalseTrueFalseTrueFalse10141.20000.319489
    5107197230.426270000.2439650.5638890FalseFalseTrueFalseFalseTrueFalseFalseFalseTrueFalse2904.65000.368719
    5108375440.621582000.9731480.4250000TrueFalseTrueFalseTrueFalseFalseFalseTrueFalseFalse8480.79000.153948
    5109446790.536133000.2640110.4416670FalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalse3752.32000.307223
    \n", "

    5110 rows × 20 columns

    \n", "
    " ], "text/plain": [ " id age hypertension heart_disease avg_glucose_level \\\n", "0 9046 0.816895 0 1 1.000000 \n", "1 51676 0.743652 0 0 1.000000 \n", "2 31112 0.975586 0 1 0.444688 \n", "3 60182 0.597168 0 0 1.000000 \n", "4 1665 0.963379 1 0 1.000000 \n", "... ... ... ... ... ... \n", "5105 18234 0.975586 1 0 0.250618 \n", "5106 44873 0.987793 0 0 0.613459 \n", "5107 19723 0.426270 0 0 0.243965 \n", "5108 37544 0.621582 0 0 0.973148 \n", "5109 44679 0.536133 0 0 0.264011 \n", "\n", " bmi stroke gender_Male gender_Other ever_married_Yes \\\n", "0 0.730556 1 True False True \n", "1 0.494444 1 False False True \n", "2 0.616667 1 True False True \n", "3 0.669444 1 False False True \n", "4 0.380556 1 False False True \n", "... ... ... ... ... ... \n", "5105 0.494444 0 False False True \n", "5106 0.825000 0 False False True \n", "5107 0.563889 0 False False True \n", "5108 0.425000 0 True False True \n", "5109 0.441667 0 False False True \n", "\n", " work_type_Never_worked work_type_Private work_type_Self-employed \\\n", "0 False True False \n", "1 False False True \n", "2 False True False \n", "3 False True False \n", "4 False False True \n", "... ... ... ... \n", "5105 False True False \n", "5106 False False True \n", "5107 False False True \n", "5108 False True False \n", "5109 False False False \n", "\n", " work_type_children Residence_type_Urban \\\n", "0 False True \n", "1 False False \n", "2 False False \n", "3 False True \n", "4 False False \n", "... ... ... \n", "5105 False True \n", "5106 False True \n", "5107 False False \n", "5108 False False \n", "5109 False True \n", "\n", " smoking_status_formerly smoked smoking_status_never smoked \\\n", "0 True False \n", "1 False True \n", "2 False True \n", "3 False False \n", "4 False True \n", "... ... ... \n", "5105 False True \n", "5106 False True \n", "5107 False True \n", "5108 True False \n", "5109 False False \n", "\n", " smoking_status_smokes age_glucose_index bmi_glucose_ratio \n", "0 False 11346.9525 0.216111 \n", "1 False 10330.8075 0.165921 \n", "2 False 8473.6000 0.306835 \n", "3 True 8298.5175 0.203121 \n", "4 False 13379.2425 0.141712 \n", "... ... ... ... \n", "5105 False 6700.0000 0.335522 \n", "5106 False 10141.2000 0.319489 \n", "5107 False 2904.6500 0.368719 \n", "5108 False 8480.7900 0.153948 \n", "5109 False 3752.3200 0.307223 \n", "\n", "[5110 rows x 20 columns]" ] }, "execution_count": 338, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from sklearn.preprocessing import MinMaxScaler, StandardScaler\n", "\n", "scaler = MinMaxScaler()\n", "standardizer = StandardScaler()\n", "\n", "# Нормализация\n", "data_edit_categories[['age', 'avg_glucose_level', 'bmi']] = scaler.fit_transform(data_edit_categories[['age', 'avg_glucose_level', 'bmi']])\n", "print(\"Данные после нормализации:\\n\")\n", "data_edit_categories\n", "\n", "\n", "# # Стандартизация\n", "# X_encoded[['age', 'avg_glucose_level', 'bmi']] = standardizer.fit_transform(X_encoded[['age', 'avg_glucose_level', 'bmi']])\n", "# print(\"Данные после стандартизации:\\n\", X_encoded.head(10))\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

    Конструирование признаков с применением фреймворка Featuretools

    " ] }, { "cell_type": "code", "execution_count": 339, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Столбцы в data: ['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke']\n", "id 0\n", "gender 0\n", "age 0\n", "hypertension 0\n", "heart_disease 0\n", "ever_married 0\n", "work_type 0\n", "Residence_type 0\n", "avg_glucose_level 0\n", "bmi 0\n", "smoking_status 0\n", "stroke 0\n", "dtype: int64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Сгенерированные признаки:\n", " gender age hypertension heart_disease ever_married work_type \\\n", "id \n", "9046 Male 67.0 0 1 True Private \n", "51676 Female 61.0 0 0 True Self-employed \n", "31112 Male 80.0 0 1 True Private \n", "60182 Female 49.0 0 0 True Private \n", "1665 Female 79.0 1 0 True Self-employed \n", "\n", " Residence_type avg_glucose_level bmi smoking_status stroke \n", "id \n", "9046 Urban 169.3575 36.6 formerly smoked 1 \n", "51676 Rural 169.3575 28.1 never smoked 1 \n", "31112 Rural 105.9200 32.5 never smoked 1 \n", "60182 Urban 169.3575 34.4 smokes 1 \n", "1665 Rural 169.3575 24.0 never smoked 1 \n" ] }, { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    genderagehypertensionheart_diseaseever_marriedwork_typeResidence_typeavg_glucose_levelbmismoking_statusstroke
    id
    9046Male67.001TruePrivateUrban169.357536.6formerly smoked1
    51676Female61.000TrueSelf-employedRural169.357528.1never smoked1
    31112Male80.001TruePrivateRural105.920032.5never smoked1
    60182Female49.000TruePrivateUrban169.357534.4smokes1
    1665Female79.010TrueSelf-employedRural169.357524.0never smoked1
    ....................................
    18234Female80.010TruePrivateUrban83.750028.1never smoked0
    44873Female81.000TrueSelf-employedUrban125.200040.0never smoked0
    19723Female35.000TrueSelf-employedRural82.990030.6never smoked0
    37544Male51.000TruePrivateRural166.290025.6formerly smoked0
    44679Female44.000TrueGovt_jobUrban85.280026.2Unknown0
    \n", "

    5110 rows × 11 columns

    \n", "
    " ], "text/plain": [ " gender age hypertension heart_disease ever_married work_type \\\n", "id \n", "9046 Male 67.0 0 1 True Private \n", "51676 Female 61.0 0 0 True Self-employed \n", "31112 Male 80.0 0 1 True Private \n", "60182 Female 49.0 0 0 True Private \n", "1665 Female 79.0 1 0 True Self-employed \n", "... ... ... ... ... ... ... \n", "18234 Female 80.0 1 0 True Private \n", "44873 Female 81.0 0 0 True Self-employed \n", "19723 Female 35.0 0 0 True Self-employed \n", "37544 Male 51.0 0 0 True Private \n", "44679 Female 44.0 0 0 True Govt_job \n", "\n", " Residence_type avg_glucose_level bmi smoking_status stroke \n", "id \n", "9046 Urban 169.3575 36.6 formerly smoked 1 \n", "51676 Rural 169.3575 28.1 never smoked 1 \n", "31112 Rural 105.9200 32.5 never smoked 1 \n", "60182 Urban 169.3575 34.4 smokes 1 \n", "1665 Rural 169.3575 24.0 never smoked 1 \n", "... ... ... ... ... ... \n", "18234 Urban 83.7500 28.1 never smoked 0 \n", "44873 Urban 125.2000 40.0 never smoked 0 \n", "19723 Rural 82.9900 30.6 never smoked 0 \n", "37544 Rural 166.2900 25.6 formerly smoked 0 \n", "44679 Urban 85.2800 26.2 Unknown 0 \n", "\n", "[5110 rows x 11 columns]" ] }, "execution_count": 339, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import featuretools as ft\n", "\n", "print(\"Столбцы в data:\", data.columns.tolist())\n", "print(data.isnull().sum())\n", "\n", "# Создание EntitySet (основная структура для Featuretools)\n", "entity = ft.EntitySet(id=\"stroke_prediction\")\n", "\n", "entity = entity.add_dataframe(\n", " dataframe_name=\"data\", \n", " dataframe=data, \n", " index=\"id\",\n", ")\n", "\n", "# Генерация новых признаков\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=entity,\n", " target_dataframe_name=\"data\", # Основная таблица\n", " max_depth=2 # Уровень вложенности\n", ")\n", "\n", "print(\"Сгенерированные признаки:\")\n", "print(feature_matrix.head())\n", "\n", "# Сохранение результатов\n", "feature_matrix.to_csv(\"./csv/generated_features_copy.csv\", index=False)\n", "feature_matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

    Так, теперь разобьем на выборки

    " ] }, { "cell_type": "code", "execution_count": 340, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Размеры выборок:\n", "Обучающая выборка: (4088, 18)\n", "Тестовая выборка: (511, 18)\n", "Контрольная выборка: (511, 18)\n" ] }, { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    idagehypertensionheart_diseaseavg_glucose_levelbmistrokegender_Malegender_Otherever_married_Yeswork_type_Never_workedwork_type_Privatework_type_Self-employedwork_type_childrenResidence_type_Urbansmoking_status_formerly smokedsmoking_status_never smokedsmoking_status_smokesage_glucose_indexbmi_glucose_ratio
    090460.816895011.0000000.7305561TrueFalseTrueFalseTrueFalseFalseTrueTrueFalseFalse11346.95250.216111
    1516760.743652001.0000000.4944441FalseFalseTrueFalseFalseTrueFalseFalseFalseTrueFalse10330.80750.165921
    2311120.975586010.4446880.6166671TrueFalseTrueFalseTrueFalseFalseFalseFalseTrueFalse8473.60000.306835
    3601820.597168001.0000000.6694441FalseFalseTrueFalseTrueFalseFalseTrueFalseFalseTrue8298.51750.203121
    416650.963379101.0000000.3805561FalseFalseTrueFalseFalseTrueFalseFalseFalseTrueFalse13379.24250.141712
    ...............................................................
    5105182340.975586100.2506180.4944440FalseFalseTrueFalseTrueFalseFalseTrueFalseTrueFalse6700.00000.335522
    5106448730.987793000.6134590.8250000FalseFalseTrueFalseFalseTrueFalseTrueFalseTrueFalse10141.20000.319489
    5107197230.426270000.2439650.5638890FalseFalseTrueFalseFalseTrueFalseFalseFalseTrueFalse2904.65000.368719
    5108375440.621582000.9731480.4250000TrueFalseTrueFalseTrueFalseFalseFalseTrueFalseFalse8480.79000.153948
    5109446790.536133000.2640110.4416670FalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalse3752.32000.307223
    \n", "

    5110 rows × 20 columns

    \n", "
    " ], "text/plain": [ " id age hypertension heart_disease avg_glucose_level \\\n", "0 9046 0.816895 0 1 1.000000 \n", "1 51676 0.743652 0 0 1.000000 \n", "2 31112 0.975586 0 1 0.444688 \n", "3 60182 0.597168 0 0 1.000000 \n", "4 1665 0.963379 1 0 1.000000 \n", "... ... ... ... ... ... \n", "5105 18234 0.975586 1 0 0.250618 \n", "5106 44873 0.987793 0 0 0.613459 \n", "5107 19723 0.426270 0 0 0.243965 \n", "5108 37544 0.621582 0 0 0.973148 \n", "5109 44679 0.536133 0 0 0.264011 \n", "\n", " bmi stroke gender_Male gender_Other ever_married_Yes \\\n", "0 0.730556 1 True False True \n", "1 0.494444 1 False False True \n", "2 0.616667 1 True False True \n", "3 0.669444 1 False False True \n", "4 0.380556 1 False False True \n", "... ... ... ... ... ... \n", "5105 0.494444 0 False False True \n", "5106 0.825000 0 False False True \n", "5107 0.563889 0 False False True \n", "5108 0.425000 0 True False True \n", "5109 0.441667 0 False False True \n", "\n", " work_type_Never_worked work_type_Private work_type_Self-employed \\\n", "0 False True False \n", "1 False False True \n", "2 False True False \n", "3 False True False \n", "4 False False True \n", "... ... ... ... \n", "5105 False True False \n", "5106 False False True \n", "5107 False False True \n", "5108 False True False \n", "5109 False False False \n", "\n", " work_type_children Residence_type_Urban \\\n", "0 False True \n", "1 False False \n", "2 False False \n", "3 False True \n", "4 False False \n", "... ... ... \n", "5105 False True \n", "5106 False True \n", "5107 False False \n", "5108 False False \n", "5109 False True \n", "\n", " smoking_status_formerly smoked smoking_status_never smoked \\\n", "0 True False \n", "1 False True \n", "2 False True \n", "3 False False \n", "4 False True \n", "... ... ... \n", "5105 False True \n", "5106 False True \n", "5107 False True \n", "5108 True False \n", "5109 False False \n", "\n", " smoking_status_smokes age_glucose_index bmi_glucose_ratio \n", "0 False 11346.9525 0.216111 \n", "1 False 10330.8075 0.165921 \n", "2 False 8473.6000 0.306835 \n", "3 True 8298.5175 0.203121 \n", "4 False 13379.2425 0.141712 \n", "... ... ... ... \n", "5105 False 6700.0000 0.335522 \n", "5106 False 10141.2000 0.319489 \n", "5107 False 2904.6500 0.368719 \n", "5108 False 8480.7900 0.153948 \n", "5109 False 3752.3200 0.307223 \n", "\n", "[5110 rows x 20 columns]" ] }, "execution_count": 340, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Определение признаков и целевой переменной\n", "\n", "# data_edit_categories = pd.read_csv('./csv/generated_features_copy.csv')\n", "\n", "\n", "X = data_edit_categories.drop(columns=['id', 'stroke']) \n", "y = data_edit_categories['stroke'] \n", "\n", "# Обучающая выборка\n", "X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=None, stratify=y)\n", "\n", "# Тестовая и контрольная выборки\n", "X_test, X_control, y_test, y_control = train_test_split(X_temp, y_temp, test_size=0.5, random_state=None, stratify=y_temp)\n", "\n", "print(\"\\nРазмеры выборок:\")\n", "print(f\"Обучающая выборка: {X_train.shape}\")\n", "print(f\"Тестовая выборка: {X_test.shape}\")\n", "print(f\"Контрольная выборка: {X_control.shape}\")\n", "\n", "data_edit_categories\n" ] }, { "cell_type": "code", "execution_count": 341, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "stroke\n", "0 4861\n", "1 249\n", "Name: count, dtype: int64\n" ] }, { "data": { "image/png": "", "text/plain": [ "
    " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "# Подсчет количества объектов каждого класса\n", "class_counts = y.value_counts()\n", "print(class_counts)\n", "\n", "# Визуализация\n", "sns.barplot(x=class_counts.index, y=class_counts.values)\n", "plt.title(\"Распределение классов (stroke)\")\n", "plt.xlabel(\"Класс\")\n", "plt.ylabel(\"Количество\")\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

    Напишем функцию и сделаем аугментацию данных

    " ] }, { "cell_type": "code", "execution_count": 342, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Данные ДО аугментации в ОБУЧАЮЩЕЙ ВЫБОРКЕ (60-80% данных)\n", "\n", "stroke\n", "0 3889\n", "1 199\n", "Name: count, dtype: int64\n", "\n", "После оверсемплинга\n", "\n", "stroke\n", "0 3889\n", "1 777\n", "Name: count, dtype: int64\n", "\n", "После балансировки данных (андерсемплинга)\n", "\n", "stroke\n", "0 777\n", "1 777\n", "Name: count, dtype: int64\n" ] }, { "data": { "image/png": "", "text/plain": [ "
    " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    agehypertensionheart_diseaseavg_glucose_levelbmigender_Malegender_Otherever_married_Yeswork_type_Never_workedwork_type_Privatework_type_Self-employedwork_type_childrenResidence_type_Urbansmoking_status_formerly smokedsmoking_status_never smokedsmoking_status_smokesage_glucose_indexbmi_glucose_ratio
    25080.316406000.1765620.341667FalseFalseTrueFalseTrueFalseFalseFalseFalseFalseTrue1957.5400.300173
    24350.768066000.3516360.591667TrueFalseTrueFalseTrueFalseFalseTrueFalseFalseTrue6003.2700.331619
    25470.060059000.2506180.216667TrueFalseFalseFalseFalseFalseTrueTrueFalseFalseFalse418.7500.216119
    38850.914551000.3428820.691667TrueFalseTrueFalseFalseFalseFalseTrueFalseFalseFalse7071.7500.373316
    3350.426270000.5009740.544444FalseFalseTrueFalseTrueFalseFalseTrueFalseFalseFalse3932.2500.266133
    .........................................................
    46610.853516101.0000000.977778TrueFalseTrueFalseTrueFalseFalseFalseTrueFalseFalse11855.0250.268662
    46620.926758000.0245100.494444FalseFalseTrueFalseTrueFalseFalseTrueTrueFalseFalse4401.9200.485152
    46630.682617001.0000000.836111FalseFalseTrueFalseTrueFalseFalseTrueTrueFalseFalse9484.0200.238549
    46640.768066000.3132070.494444FalseFalseTrueFalseTrueFalseFalseTrueTrueFalseFalse5726.7000.309131
    46650.902344000.1561660.583333TrueFalseTrueFalseTrueFalseFalseTrueFalseFalseTrue5399.0400.429002
    \n", "

    1554 rows × 18 columns

    \n", "
    " ], "text/plain": [ " age hypertension heart_disease avg_glucose_level bmi \\\n", "2508 0.316406 0 0 0.176562 0.341667 \n", "2435 0.768066 0 0 0.351636 0.591667 \n", "2547 0.060059 0 0 0.250618 0.216667 \n", "3885 0.914551 0 0 0.342882 0.691667 \n", "335 0.426270 0 0 0.500974 0.544444 \n", "... ... ... ... ... ... \n", "4661 0.853516 1 0 1.000000 0.977778 \n", "4662 0.926758 0 0 0.024510 0.494444 \n", "4663 0.682617 0 0 1.000000 0.836111 \n", "4664 0.768066 0 0 0.313207 0.494444 \n", "4665 0.902344 0 0 0.156166 0.583333 \n", "\n", " gender_Male gender_Other ever_married_Yes work_type_Never_worked \\\n", "2508 False False True False \n", "2435 True False True False \n", "2547 True False False False \n", "3885 True False True False \n", "335 False False True False \n", "... ... ... ... ... \n", "4661 True False True False \n", "4662 False False True False \n", "4663 False False True False \n", "4664 False False True False \n", "4665 True False True False \n", "\n", " work_type_Private work_type_Self-employed work_type_children \\\n", "2508 True False False \n", "2435 True False False \n", "2547 False False True \n", "3885 False False False \n", "335 True False False \n", "... ... ... ... \n", "4661 True False False \n", "4662 True False False \n", "4663 True False False \n", "4664 True False False \n", "4665 True False False \n", "\n", " Residence_type_Urban smoking_status_formerly smoked \\\n", "2508 False False \n", "2435 True False \n", "2547 True False \n", "3885 True False \n", "335 True False \n", "... ... ... \n", "4661 False True \n", "4662 True True \n", "4663 True True \n", "4664 True True \n", "4665 True False \n", "\n", " smoking_status_never smoked smoking_status_smokes age_glucose_index \\\n", "2508 False True 1957.540 \n", "2435 False True 6003.270 \n", "2547 False False 418.750 \n", "3885 False False 7071.750 \n", "335 False False 3932.250 \n", "... ... ... ... \n", "4661 False False 11855.025 \n", "4662 False False 4401.920 \n", "4663 False False 9484.020 \n", "4664 False False 5726.700 \n", "4665 False True 5399.040 \n", "\n", " bmi_glucose_ratio \n", "2508 0.300173 \n", "2435 0.331619 \n", "2547 0.216119 \n", "3885 0.373316 \n", "335 0.266133 \n", "... ... \n", "4661 0.268662 \n", "4662 0.485152 \n", "4663 0.238549 \n", "4664 0.309131 \n", "4665 0.429002 \n", "\n", "[1554 rows x 18 columns]" ] }, "execution_count": 342, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "from imblearn.over_sampling import RandomOverSampler\n", "from imblearn.under_sampling import RandomUnderSampler\n", "\n", "def over_under_sampling(x_selection, y_selection):\n", "\n", " # сначала увеличение меньшинства\n", "\n", " oversampler = RandomOverSampler(sampling_strategy=0.2, random_state=42) \n", " x_over, y_over = oversampler.fit_resample(x_selection, y_selection) \n", "\n", " print(\"\\nПосле оверсемплинга\\n\")\n", " print(y_over.value_counts())\n", "\n", " # потом уменьшение большинства\n", "\n", " undersampler = RandomUnderSampler(sampling_strategy=1.0, random_state=42)\n", " x_balanced, y_balanced = undersampler.fit_resample(x_over, y_over)\n", "\n", " print(\"\\nПосле балансировки данных (андерсемплинга)\\n\")\n", " print(y_balanced.value_counts())\n", "\n", " plt.pie(\n", " y_balanced.value_counts(), \n", " labels=class_counts.index, # Метки классов (0 и 1)\n", " autopct='%1.1f%%', # Отображение процентов\n", " colors=['lightgreen', 'lightcoral'], # Цвета для классов\n", " startangle=45, # Поворот диаграммы\n", " explode=(0, 0.05) # Небольшое смещение для класса 1\n", " )\n", " plt.title(\"Распределение классов (stroke)\")\n", " plt.show()\n", " return x_balanced, y_balanced \n", "\n", "print(\"Данные ДО аугментации в ОБУЧАЮЩЕЙ ВЫБОРКЕ (60-80% данных)\\n\")\n", "print(y_train.value_counts())\n", "X_train, y_train = over_under_sampling(X_train, y_train)\n", "\n", "X_train\n", "\n", "# print(\"Данные ДО аугментации в ТЕСТОВОЙ ВЫБОРКЕ (10-20% данных)\\n\")\n", "# print(y_test.value_counts())\n", "# over_under_sampling(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

    Самое время оценить качество работы модели

    " ] }, { "cell_type": "code", "execution_count": 343, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Время обучения модели: 0.25 секунд\n", "ROC-AUC: 0.84\n", "F1-Score: 0.29\n", "Матрица ошибок:\n", "[[434 52]\n", " [ 12 13]]\n", "Отчет по классификации:\n", " precision recall f1-score support\n", "\n", " 0 0.97 0.89 0.93 486\n", " 1 0.20 0.52 0.29 25\n", "\n", " accuracy 0.87 511\n", " macro avg 0.59 0.71 0.61 511\n", "weighted avg 0.94 0.87 0.90 511\n", "\n" ] }, { "data": { "image/png": "", "text/plain": [ "
    " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
    " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import time\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, classification_report\n", "\n", "# Разделение данных на обучающую и тестовую выборки\n", "\n", "# X = data.drop(columns=['id', 'stroke']) # Признаки\n", "# y = data['stroke'] # Целевая переменная\n", "\n", "# # Преобразование категориальных признаков с помощью One-Hot Encoding\n", "# categorical_columns = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']\n", "# X = pd.get_dummies(X, columns=categorical_columns, drop_first=True)\n", "\n", "# # Заполнение пропущенных значений (например, медианой для числовых данных)\n", "# X.fillna(X.median(), inplace=True)\n", "\n", "# # Разделение данных на обучающую и тестовую выборки\n", "# # Обучающая выборка\n", "# X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)\n", "\n", "# # Тестовая и контрольная выборки\n", "# X_test, X_control, y_test, y_control = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)\n", "\n", "\n", "# Обучение модели\n", "model = RandomForestClassifier(random_state=42)\n", "\n", "# Начинаем отсчет времени\n", "start_time = time.time()\n", "model.fit(X_train, y_train)\n", "\n", "# Время обучения модели\n", "train_time = time.time() - start_time\n", "\n", "# Предсказания и оценка модели\n", "y_pred = model.predict(X_test)\n", "y_pred_proba = model.predict_proba(X_test)[:, 1] # Вероятности для ROC-AUC\n", "\n", "# Метрики\n", "roc_auc = roc_auc_score(y_test, y_pred_proba)\n", "f1 = f1_score(y_test, y_pred)\n", "conf_matrix = confusion_matrix(y_test, y_pred)\n", "class_report = classification_report(y_test, y_pred)\n", "\n", "# Вывод результатов\n", "print(f'Время обучения модели: {train_time:.2f} секунд')\n", "print(f'ROC-AUC: {roc_auc:.2f}')\n", "print(f'F1-Score: {f1:.2f}')\n", "print('Матрица ошибок:')\n", "print(conf_matrix)\n", "print('Отчет по классификации:')\n", "print(class_report)\n", "\n", "# Визуализация матрицы ошибок\n", "plt.figure(figsize=(7, 7))\n", "sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Нет инсульта', 'Инсульт'], yticklabels=['Нет инсульта', 'Инсульт'])\n", "plt.title('Матрица ошибок')\n", "plt.xlabel('Предсказанный класс')\n", "plt.ylabel('Истинный класс')\n", "plt.show()\n", "\n", "\n", "plt.figure(figsize=(10, 6))\n", "plt.scatter(y_test, y_pred, alpha=0.5, color='blue', label='Прогнозы модели')\n", "plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Идеальное совпадение')\n", "plt.xlabel('Фактический статус инсульта')\n", "plt.ylabel('Прогнозируемый статус инсульта')\n", "plt.title('Фактический статус инсульта по сравнению с прогнозируемым')\n", "plt.legend()\n", "plt.show()\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

    А ВОТ ТЕПЕЕЕЕЕЕЕЕЕЕЕРЬ я поправила недоразумения и вроде как модель проперло на выявление инсульта. Но, так как в данных ЛЮТЫЙ дисбаланс, то модель слаба на выявление инсульта все еще.

    " ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 2 }