{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
Возьмем и заменим нулевые значения в столбце bmi на средние значения по столбцу
" ] }, { "cell_type": "code", "execution_count": 332, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Наличие пропущенных значений:\n", "id 0\n", "gender 0\n", "age 0\n", "hypertension 0\n", "heart_disease 0\n", "ever_married 0\n", "work_type 0\n", "Residence_type 0\n", "avg_glucose_level 0\n", "bmi 0\n", "smoking_status 0\n", "stroke 0\n", "dtype: int64\n" ] } ], "source": [ "data['bmi'] = data['bmi'].fillna(data['bmi'].median())\n", "print(\"\\nНаличие пропущенных значений:\")\n", "print(data.isnull().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Взглянем на выбросы:
" ] }, { "cell_type": "code", "execution_count": 333, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Видим выбросы в столбцах со средним уровнем глюкозы и в столбце bmi (индекс массы тела). устраним выбросы - поставим верхние и нижние границы
" ] }, { "cell_type": "code", "execution_count": 334, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Теперь можно и к конструированию признаков приступить) данные ведь сбалансированы (в выборках)
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Унитарное кодирование категориальных признаков
Применяем к категориальным (НЕ числовым) признакам: 'gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'
\n", " | id | \n", "age | \n", "hypertension | \n", "heart_disease | \n", "avg_glucose_level | \n", "bmi | \n", "stroke | \n", "gender_Male | \n", "gender_Other | \n", "ever_married_Yes | \n", "work_type_Never_worked | \n", "work_type_Private | \n", "work_type_Self-employed | \n", "work_type_children | \n", "Residence_type_Urban | \n", "smoking_status_formerly smoked | \n", "smoking_status_never smoked | \n", "smoking_status_smokes | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "9046 | \n", "67.0 | \n", "0 | \n", "1 | \n", "169.3575 | \n", "36.6 | \n", "1 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "
1 | \n", "51676 | \n", "61.0 | \n", "0 | \n", "0 | \n", "169.3575 | \n", "28.1 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
2 | \n", "31112 | \n", "80.0 | \n", "0 | \n", "1 | \n", "105.9200 | \n", "32.5 | \n", "1 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
3 | \n", "60182 | \n", "49.0 | \n", "0 | \n", "0 | \n", "169.3575 | \n", "34.4 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "
4 | \n", "1665 | \n", "79.0 | \n", "1 | \n", "0 | \n", "169.3575 | \n", "24.0 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
5 | \n", "56669 | \n", "81.0 | \n", "0 | \n", "0 | \n", "169.3575 | \n", "29.0 | \n", "1 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "
6 | \n", "53882 | \n", "74.0 | \n", "1 | \n", "1 | \n", "70.0900 | \n", "27.4 | \n", "1 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
7 | \n", "10434 | \n", "69.0 | \n", "0 | \n", "0 | \n", "94.3900 | \n", "22.8 | \n", "1 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "
8 | \n", "27419 | \n", "59.0 | \n", "0 | \n", "0 | \n", "76.1500 | \n", "28.1 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
9 | \n", "60491 | \n", "78.0 | \n", "0 | \n", "0 | \n", "58.5700 | \n", "24.2 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "
Дискретизация числовых признаков
Числовые признаки, такие как 'age', 'avg_glucose_level', 'bmi', можно разделить на категории (биннинг).
Ручной синтез новых признаков
\n",
"
\n", " | age_glucose_index | \n", "bmi_glucose_ratio | \n", "
---|---|---|
0 | \n", "11346.9525 | \n", "0.216111 | \n", "
1 | \n", "10330.8075 | \n", "0.165921 | \n", "
2 | \n", "8473.6000 | \n", "0.306835 | \n", "
3 | \n", "8298.5175 | \n", "0.203121 | \n", "
4 | \n", "13379.2425 | \n", "0.141712 | \n", "
5 | \n", "13717.9575 | \n", "0.171235 | \n", "
6 | \n", "5186.6600 | \n", "0.390926 | \n", "
7 | \n", "6512.9100 | \n", "0.241551 | \n", "
8 | \n", "4492.8500 | \n", "0.369009 | \n", "
9 | \n", "4568.4600 | \n", "0.413181 | \n", "
Масштабирование признаков
Применяем нормализацию (для сжатия в диапазон [0, 1]) и стандартизацию (для приведения к среднему 0 и стандартному отклонению 1)
\n", " | id | \n", "age | \n", "hypertension | \n", "heart_disease | \n", "avg_glucose_level | \n", "bmi | \n", "stroke | \n", "gender_Male | \n", "gender_Other | \n", "ever_married_Yes | \n", "work_type_Never_worked | \n", "work_type_Private | \n", "work_type_Self-employed | \n", "work_type_children | \n", "Residence_type_Urban | \n", "smoking_status_formerly smoked | \n", "smoking_status_never smoked | \n", "smoking_status_smokes | \n", "age_glucose_index | \n", "bmi_glucose_ratio | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "9046 | \n", "0.816895 | \n", "0 | \n", "1 | \n", "1.000000 | \n", "0.730556 | \n", "1 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "11346.9525 | \n", "0.216111 | \n", "
1 | \n", "51676 | \n", "0.743652 | \n", "0 | \n", "0 | \n", "1.000000 | \n", "0.494444 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "10330.8075 | \n", "0.165921 | \n", "
2 | \n", "31112 | \n", "0.975586 | \n", "0 | \n", "1 | \n", "0.444688 | \n", "0.616667 | \n", "1 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "8473.6000 | \n", "0.306835 | \n", "
3 | \n", "60182 | \n", "0.597168 | \n", "0 | \n", "0 | \n", "1.000000 | \n", "0.669444 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "8298.5175 | \n", "0.203121 | \n", "
4 | \n", "1665 | \n", "0.963379 | \n", "1 | \n", "0 | \n", "1.000000 | \n", "0.380556 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "13379.2425 | \n", "0.141712 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
5105 | \n", "18234 | \n", "0.975586 | \n", "1 | \n", "0 | \n", "0.250618 | \n", "0.494444 | \n", "0 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "6700.0000 | \n", "0.335522 | \n", "
5106 | \n", "44873 | \n", "0.987793 | \n", "0 | \n", "0 | \n", "0.613459 | \n", "0.825000 | \n", "0 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "10141.2000 | \n", "0.319489 | \n", "
5107 | \n", "19723 | \n", "0.426270 | \n", "0 | \n", "0 | \n", "0.243965 | \n", "0.563889 | \n", "0 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "2904.6500 | \n", "0.368719 | \n", "
5108 | \n", "37544 | \n", "0.621582 | \n", "0 | \n", "0 | \n", "0.973148 | \n", "0.425000 | \n", "0 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "8480.7900 | \n", "0.153948 | \n", "
5109 | \n", "44679 | \n", "0.536133 | \n", "0 | \n", "0 | \n", "0.264011 | \n", "0.441667 | \n", "0 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "3752.3200 | \n", "0.307223 | \n", "
5110 rows × 20 columns
\n", "Конструирование признаков с применением фреймворка Featuretools
" ] }, { "cell_type": "code", "execution_count": 339, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Столбцы в data: ['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke']\n", "id 0\n", "gender 0\n", "age 0\n", "hypertension 0\n", "heart_disease 0\n", "ever_married 0\n", "work_type 0\n", "Residence_type 0\n", "avg_glucose_level 0\n", "bmi 0\n", "smoking_status 0\n", "stroke 0\n", "dtype: int64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", " pd.to_datetime(\n", "d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Сгенерированные признаки:\n", " gender age hypertension heart_disease ever_married work_type \\\n", "id \n", "9046 Male 67.0 0 1 True Private \n", "51676 Female 61.0 0 0 True Self-employed \n", "31112 Male 80.0 0 1 True Private \n", "60182 Female 49.0 0 0 True Private \n", "1665 Female 79.0 1 0 True Self-employed \n", "\n", " Residence_type avg_glucose_level bmi smoking_status stroke \n", "id \n", "9046 Urban 169.3575 36.6 formerly smoked 1 \n", "51676 Rural 169.3575 28.1 never smoked 1 \n", "31112 Rural 105.9200 32.5 never smoked 1 \n", "60182 Urban 169.3575 34.4 smokes 1 \n", "1665 Rural 169.3575 24.0 never smoked 1 \n" ] }, { "data": { "text/html": [ "\n", " | gender | \n", "age | \n", "hypertension | \n", "heart_disease | \n", "ever_married | \n", "work_type | \n", "Residence_type | \n", "avg_glucose_level | \n", "bmi | \n", "smoking_status | \n", "stroke | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
9046 | \n", "Male | \n", "67.0 | \n", "0 | \n", "1 | \n", "True | \n", "Private | \n", "Urban | \n", "169.3575 | \n", "36.6 | \n", "formerly smoked | \n", "1 | \n", "
51676 | \n", "Female | \n", "61.0 | \n", "0 | \n", "0 | \n", "True | \n", "Self-employed | \n", "Rural | \n", "169.3575 | \n", "28.1 | \n", "never smoked | \n", "1 | \n", "
31112 | \n", "Male | \n", "80.0 | \n", "0 | \n", "1 | \n", "True | \n", "Private | \n", "Rural | \n", "105.9200 | \n", "32.5 | \n", "never smoked | \n", "1 | \n", "
60182 | \n", "Female | \n", "49.0 | \n", "0 | \n", "0 | \n", "True | \n", "Private | \n", "Urban | \n", "169.3575 | \n", "34.4 | \n", "smokes | \n", "1 | \n", "
1665 | \n", "Female | \n", "79.0 | \n", "1 | \n", "0 | \n", "True | \n", "Self-employed | \n", "Rural | \n", "169.3575 | \n", "24.0 | \n", "never smoked | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
18234 | \n", "Female | \n", "80.0 | \n", "1 | \n", "0 | \n", "True | \n", "Private | \n", "Urban | \n", "83.7500 | \n", "28.1 | \n", "never smoked | \n", "0 | \n", "
44873 | \n", "Female | \n", "81.0 | \n", "0 | \n", "0 | \n", "True | \n", "Self-employed | \n", "Urban | \n", "125.2000 | \n", "40.0 | \n", "never smoked | \n", "0 | \n", "
19723 | \n", "Female | \n", "35.0 | \n", "0 | \n", "0 | \n", "True | \n", "Self-employed | \n", "Rural | \n", "82.9900 | \n", "30.6 | \n", "never smoked | \n", "0 | \n", "
37544 | \n", "Male | \n", "51.0 | \n", "0 | \n", "0 | \n", "True | \n", "Private | \n", "Rural | \n", "166.2900 | \n", "25.6 | \n", "formerly smoked | \n", "0 | \n", "
44679 | \n", "Female | \n", "44.0 | \n", "0 | \n", "0 | \n", "True | \n", "Govt_job | \n", "Urban | \n", "85.2800 | \n", "26.2 | \n", "Unknown | \n", "0 | \n", "
5110 rows × 11 columns
\n", "Так, теперь разобьем на выборки
" ] }, { "cell_type": "code", "execution_count": 340, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Размеры выборок:\n", "Обучающая выборка: (4088, 18)\n", "Тестовая выборка: (511, 18)\n", "Контрольная выборка: (511, 18)\n" ] }, { "data": { "text/html": [ "\n", " | id | \n", "age | \n", "hypertension | \n", "heart_disease | \n", "avg_glucose_level | \n", "bmi | \n", "stroke | \n", "gender_Male | \n", "gender_Other | \n", "ever_married_Yes | \n", "work_type_Never_worked | \n", "work_type_Private | \n", "work_type_Self-employed | \n", "work_type_children | \n", "Residence_type_Urban | \n", "smoking_status_formerly smoked | \n", "smoking_status_never smoked | \n", "smoking_status_smokes | \n", "age_glucose_index | \n", "bmi_glucose_ratio | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "9046 | \n", "0.816895 | \n", "0 | \n", "1 | \n", "1.000000 | \n", "0.730556 | \n", "1 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "11346.9525 | \n", "0.216111 | \n", "
1 | \n", "51676 | \n", "0.743652 | \n", "0 | \n", "0 | \n", "1.000000 | \n", "0.494444 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "10330.8075 | \n", "0.165921 | \n", "
2 | \n", "31112 | \n", "0.975586 | \n", "0 | \n", "1 | \n", "0.444688 | \n", "0.616667 | \n", "1 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "8473.6000 | \n", "0.306835 | \n", "
3 | \n", "60182 | \n", "0.597168 | \n", "0 | \n", "0 | \n", "1.000000 | \n", "0.669444 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "8298.5175 | \n", "0.203121 | \n", "
4 | \n", "1665 | \n", "0.963379 | \n", "1 | \n", "0 | \n", "1.000000 | \n", "0.380556 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "13379.2425 | \n", "0.141712 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
5105 | \n", "18234 | \n", "0.975586 | \n", "1 | \n", "0 | \n", "0.250618 | \n", "0.494444 | \n", "0 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "6700.0000 | \n", "0.335522 | \n", "
5106 | \n", "44873 | \n", "0.987793 | \n", "0 | \n", "0 | \n", "0.613459 | \n", "0.825000 | \n", "0 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "10141.2000 | \n", "0.319489 | \n", "
5107 | \n", "19723 | \n", "0.426270 | \n", "0 | \n", "0 | \n", "0.243965 | \n", "0.563889 | \n", "0 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "2904.6500 | \n", "0.368719 | \n", "
5108 | \n", "37544 | \n", "0.621582 | \n", "0 | \n", "0 | \n", "0.973148 | \n", "0.425000 | \n", "0 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "8480.7900 | \n", "0.153948 | \n", "
5109 | \n", "44679 | \n", "0.536133 | \n", "0 | \n", "0 | \n", "0.264011 | \n", "0.441667 | \n", "0 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "3752.3200 | \n", "0.307223 | \n", "
5110 rows × 20 columns
\n", "Напишем функцию и сделаем аугментацию данных
" ] }, { "cell_type": "code", "execution_count": 342, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Данные ДО аугментации в ОБУЧАЮЩЕЙ ВЫБОРКЕ (60-80% данных)\n", "\n", "stroke\n", "0 3889\n", "1 199\n", "Name: count, dtype: int64\n", "\n", "После оверсемплинга\n", "\n", "stroke\n", "0 3889\n", "1 777\n", "Name: count, dtype: int64\n", "\n", "После балансировки данных (андерсемплинга)\n", "\n", "stroke\n", "0 777\n", "1 777\n", "Name: count, dtype: int64\n" ] }, { "data": { "image/png": "", "text/plain": [ "\n", " | age | \n", "hypertension | \n", "heart_disease | \n", "avg_glucose_level | \n", "bmi | \n", "gender_Male | \n", "gender_Other | \n", "ever_married_Yes | \n", "work_type_Never_worked | \n", "work_type_Private | \n", "work_type_Self-employed | \n", "work_type_children | \n", "Residence_type_Urban | \n", "smoking_status_formerly smoked | \n", "smoking_status_never smoked | \n", "smoking_status_smokes | \n", "age_glucose_index | \n", "bmi_glucose_ratio | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2508 | \n", "0.316406 | \n", "0 | \n", "0 | \n", "0.176562 | \n", "0.341667 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "1957.540 | \n", "0.300173 | \n", "
2435 | \n", "0.768066 | \n", "0 | \n", "0 | \n", "0.351636 | \n", "0.591667 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "6003.270 | \n", "0.331619 | \n", "
2547 | \n", "0.060059 | \n", "0 | \n", "0 | \n", "0.250618 | \n", "0.216667 | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "False | \n", "418.750 | \n", "0.216119 | \n", "
3885 | \n", "0.914551 | \n", "0 | \n", "0 | \n", "0.342882 | \n", "0.691667 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "7071.750 | \n", "0.373316 | \n", "
335 | \n", "0.426270 | \n", "0 | \n", "0 | \n", "0.500974 | \n", "0.544444 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "3932.250 | \n", "0.266133 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
4661 | \n", "0.853516 | \n", "1 | \n", "0 | \n", "1.000000 | \n", "0.977778 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "11855.025 | \n", "0.268662 | \n", "
4662 | \n", "0.926758 | \n", "0 | \n", "0 | \n", "0.024510 | \n", "0.494444 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "4401.920 | \n", "0.485152 | \n", "
4663 | \n", "0.682617 | \n", "0 | \n", "0 | \n", "1.000000 | \n", "0.836111 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "9484.020 | \n", "0.238549 | \n", "
4664 | \n", "0.768066 | \n", "0 | \n", "0 | \n", "0.313207 | \n", "0.494444 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "5726.700 | \n", "0.309131 | \n", "
4665 | \n", "0.902344 | \n", "0 | \n", "0 | \n", "0.156166 | \n", "0.583333 | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "5399.040 | \n", "0.429002 | \n", "
1554 rows × 18 columns
\n", "Самое время оценить качество работы модели
" ] }, { "cell_type": "code", "execution_count": 343, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Время обучения модели: 0.25 секунд\n", "ROC-AUC: 0.84\n", "F1-Score: 0.29\n", "Матрица ошибок:\n", "[[434 52]\n", " [ 12 13]]\n", "Отчет по классификации:\n", " precision recall f1-score support\n", "\n", " 0 0.97 0.89 0.93 486\n", " 1 0.20 0.52 0.29 25\n", "\n", " accuracy 0.87 511\n", " macro avg 0.59 0.71 0.61 511\n", "weighted avg 0.94 0.87 0.90 511\n", "\n" ] }, { "data": { "image/png": "", "text/plain": [ "А ВОТ ТЕПЕЕЕЕЕЕЕЕЕЕЕРЬ я поправила недоразумения и вроде как модель проперло на выявление инсульта. Но, так как в данных ЛЮТЫЙ дисбаланс, то модель слаба на выявление инсульта все еще.
" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 2 }