817 lines
274 KiB
Plaintext
817 lines
274 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Начало лабораторной работы"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"*Вариант 3:* Диабет у индейцев Пима\n",
|
|||
|
"- Определим бизнес-цели и цели технического проекта "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',\n",
|
|||
|
" 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"df = pd.read_csv(\"C:/Users/TIGR228/Desktop/МИИ/Lab1/AIM-PIbd-31-Afanasev-S-S/static/csv/diabetes.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Определение бизнес целей:\n",
|
|||
|
"1. Прогнозирование риска развития диабета\n",
|
|||
|
"2. Оценка факторов, влияющих на развитие диабета"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Определение целей технического проекта:\n",
|
|||
|
"1. Построить модель машинного обучения для классификации, которая будет прогнозировать вероятность развития диабета у индейцев Пима на основе предоставленных данных о их характеристиках.\n",
|
|||
|
"2. Провести анализ данных для выявления ключевых факторов, влияющих на развитие диабета у индейцев Пима."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Pregnancies</th>\n",
|
|||
|
" <th>Glucose</th>\n",
|
|||
|
" <th>BloodPressure</th>\n",
|
|||
|
" <th>SkinThickness</th>\n",
|
|||
|
" <th>Insulin</th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>DiabetesPedigreeFunction</th>\n",
|
|||
|
" <th>Age</th>\n",
|
|||
|
" <th>Outcome</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>6</td>\n",
|
|||
|
" <td>148</td>\n",
|
|||
|
" <td>72</td>\n",
|
|||
|
" <td>35</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>33.6</td>\n",
|
|||
|
" <td>0.627</td>\n",
|
|||
|
" <td>50</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>85</td>\n",
|
|||
|
" <td>66</td>\n",
|
|||
|
" <td>29</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>26.6</td>\n",
|
|||
|
" <td>0.351</td>\n",
|
|||
|
" <td>31</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>8</td>\n",
|
|||
|
" <td>183</td>\n",
|
|||
|
" <td>64</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>23.3</td>\n",
|
|||
|
" <td>0.672</td>\n",
|
|||
|
" <td>32</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>89</td>\n",
|
|||
|
" <td>66</td>\n",
|
|||
|
" <td>23</td>\n",
|
|||
|
" <td>94</td>\n",
|
|||
|
" <td>28.1</td>\n",
|
|||
|
" <td>0.167</td>\n",
|
|||
|
" <td>21</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>137</td>\n",
|
|||
|
" <td>40</td>\n",
|
|||
|
" <td>35</td>\n",
|
|||
|
" <td>168</td>\n",
|
|||
|
" <td>43.1</td>\n",
|
|||
|
" <td>2.288</td>\n",
|
|||
|
" <td>33</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
|
|||
|
"0 6 148 72 35 0 33.6 \n",
|
|||
|
"1 1 85 66 29 0 26.6 \n",
|
|||
|
"2 8 183 64 0 0 23.3 \n",
|
|||
|
"3 1 89 66 23 94 28.1 \n",
|
|||
|
"4 0 137 40 35 168 43.1 \n",
|
|||
|
"\n",
|
|||
|
" DiabetesPedigreeFunction Age Outcome \n",
|
|||
|
"0 0.627 50 1 \n",
|
|||
|
"1 0.351 31 0 \n",
|
|||
|
"2 0.672 32 1 \n",
|
|||
|
"3 0.167 21 0 \n",
|
|||
|
"4 2.288 33 1 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Pregnancies 0\n",
|
|||
|
"Glucose 0\n",
|
|||
|
"BloodPressure 0\n",
|
|||
|
"SkinThickness 0\n",
|
|||
|
"Insulin 0\n",
|
|||
|
"BMI 0\n",
|
|||
|
"DiabetesPedigreeFunction 0\n",
|
|||
|
"Age 0\n",
|
|||
|
"Outcome 0\n",
|
|||
|
"dtype: int64\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Pregnancies False\n",
|
|||
|
"Glucose False\n",
|
|||
|
"BloodPressure False\n",
|
|||
|
"SkinThickness False\n",
|
|||
|
"Insulin False\n",
|
|||
|
"BMI False\n",
|
|||
|
"DiabetesPedigreeFunction False\n",
|
|||
|
"Age False\n",
|
|||
|
"Outcome False\n",
|
|||
|
"dtype: bool"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Процент пропущенных значений признаков\n",
|
|||
|
"for i in df.columns:\n",
|
|||
|
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на пропущенные данные\n",
|
|||
|
"print(df.isnull().sum())\n",
|
|||
|
"\n",
|
|||
|
"df.isnull().any()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пропущенных колонок нету, что не может не радовать "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 614\n",
|
|||
|
"Размер контрольной выборки: 154\n",
|
|||
|
"Размер тестовой выборки: 154\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тестовая)\n",
|
|||
|
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и контрольную выборки (80% - обучение, 20% - контроль)\n",
|
|||
|
"train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размер обучающей выборки: \", len(train_data))\n",
|
|||
|
"print(\"Размер контрольной выборки: \", len(val_data))\n",
|
|||
|
"print(\"Размер тестовой выборки: \", len(test_data))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABkKUlEQVR4nO3dd1hTZ/8G8DsJJMywIaCAiANUcKAi7oGTWtvaWqt11UpbsW+rrfVna10dji6t276utq5qq7bWqrhw4UJRVEREFFSmyt7h+f1ByWsEHAgE4v25rnNBznnOOd/zJIQ7Z0UihBAgIiIi0lNSXRdAREREVJ0YdoiIiEivMewQERGRXmPYISIiIr3GsENERER6jWGHiIiI9BrDDhEREek1hh0iIiLSaww7RES1TGZmJm7cuIHs7Gxdl0JVLC0tDdeuXUNRUZGuS3muMOwQEemYEAIrV65Ehw4dYGJiAqVSCTc3N/z666+6Lq1OuHXrFtauXat5fOPGDaxfv153BT2gsLAQ8+fPR8uWLaFQKGBlZYXGjRtj//79ui7tucKwU0esXbsWEolEMxgZGaFJkyaYMGECkpKSdF0e1WI7d+5Ev379YGNjo3ndfPzxx7h7926ll3nnzh3MnDkT4eHhVVfoc2zYsGF499134enpiV9++QXBwcHYt28fXnnlFV2XVidIJBIEBQVhz549uHHjBj755BMcOXJE12UhPz8f/v7++Pzzz9G9e3ds2bIFwcHBOHDgAPz8/HRd3nPFQNcF0NOZPXs23NzckJeXh6NHj2LZsmXYtWsXLl68CBMTE12XR7XMxx9/jO+++w4tW7bElClTYG1tjbNnz2Lx4sXYtGkT9u/fj6ZNmz71cu/cuYNZs2ahQYMGaNWqVdUX/hz5+eefsXnzZvz6668YNmyYrsupk+rVq4dx48ahX79+AABHR0ccOnRIt0UBmDdvHk6ePIk9e/age/fuui7n+SaoTlizZo0AIE6fPq01ftKkSQKA2LBhg44qo9pqw4YNAoB4/fXXRVFRkda0kydPChMTE+Hl5SUKCwufetmnT58WAMSaNWuqqNrnV4sWLcSwYcN0XYZeuHbtmjhx4oTIysrSdSmisLBQWFlZiU8//VTXpZAQgoex6riePXsCAGJjYwEA9+7dw8cffwwvLy+YmZlBqVSif//+OH/+fJl58/LyMHPmTDRp0gRGRkZwdHTEK6+8gpiYGAAlx70fPHT28PDgJ5VDhw5BIpFg8+bN+PTTT6FSqWBqaooXX3wR8fHxZdZ98uRJ9OvXDxYWFjAxMUG3bt1w7Nixcrexe/fu5a5/5syZZdr++uuv8PHxgbGxMaytrTF06NBy1/+obXtQcXExFixYgObNm8PIyAgODg545513cP/+fa12DRo0wAsvvFBmPRMmTCizzPJq/+abb8r0KVCyG3zGjBlo1KgRFAoFnJ2d8cknnyA/P7/cvnrQrFmzYGVlhZUrV0Imk2lNa9++PaZMmYKIiAhs3bpVaztGjx5dZlndu3fX1Hbo0CG0a9cOADBmzBhNvz14zsTJkycxYMAAWFlZwdTUFN7e3li4cKHWMg8cOIAuXbrA1NQUlpaWGDRoECIjI7XazJw5ExKJBFevXsWbb74JCwsL2NnZ4fPPP4cQAvHx8Rg0aBCUSiVUKhW+++67MrU/Sx8+/NqztbVFQEAALl68+Nh5AWDLli2a16OtrS3efPNN3L59WzM9OzsbFy9ehLOzMwICAqBUKmFqaoru3btrHYa5fv06JBIJfvjhhzLrOH78OCQSCTZu3Kip+eHXUenr/cHn6MKFCxg9ejQaNmwIIyMjqFQqvPXWW2UOb5YeQr9x44Zm3J49e9CxY0eYmJjAwsICL7zwQpk+KX3uUlNTNePOnDlTpg4AaNGiRbl7Pv755x/Na8Tc3BwBAQG4dOmSVpvRo0ejQYMGAAB3d3f4+vri3r17MDY2LlN3eUaPHq31HFtZWZXpf6Div/FSpe+BpXuUoqKicP/+fZibm6Nbt26P7CsAOHfuHPr37w+lUgkzMzP06tULJ06c0GpT+lwcPnwY77zzDmxsbKBUKjFy5Mhy35Me/lsODAyEkZFRmb1eT9LPdR0PY9VxpcHExsYGQMmb4vbt2/Haa6/Bzc0NSUlJWLFiBbp164bLly/DyckJAKBWq/HCCy9g//79GDp0KD744ANkZmYiODgYFy9ehLu7u2Ydb7zxBgYMGKC13qlTp5Zbz1dffQWJRIIpU6YgOTkZCxYsgL+/P8LDw2FsbAyg5J9c//794ePjgxkzZkAqlWLNmjXo2bMnjhw5gvbt25dZbv369TFnzhwAQFZWFt57771y1/35559jyJAhePvtt5GSkoJFixaha9euOHfuHCwtLcvMExgYiC5dugAA/vjjD2zbtk1r+jvvvIO1a9dizJgx+M9//oPY2FgsXrwY586dw7Fjx2BoaFhuPzyNtLQ0zbY9qLi4GC+++CKOHj2KwMBAeHp6IiIiAj/88AOuXr2K7du3V7jM6OhoREVFYfTo0VAqleW2GTlyJGbMmIGdO3di6NChT1yvp6cnZs+ejenTp2v1X8eOHQEAwcHBeOGFF+Do6IgPPvgAKpUKkZGR2LlzJz744AMAwL59+9C/f380bNgQM2fORG5uLhYtWoROnTrh7Nmzmn9epV5//XV4enpi7ty5+Pvvv/Hll1/C2toaK1asQM+ePTFv3jysX78eH3/8Mdq1a4euXbs+cx+W8vDwwGeffQYhBGJiYvD9999jwIABiIuLe+R8pa+bdu3aYc6cOUhKSsLChQtx7NgxzeuxNFjMmzcPKpUKkydPhpGREX766Sf4+/sjODgYXbt2RcOGDdGpUyesX78eEydO1FrP+vXrYW5ujkGDBj12Wx4UHByM69evY8yYMVCpVLh06RJWrlyJS5cu4cSJE2VCeqkjR45gwIABcHV1xYwZM1BYWIilS5eiU6dOOH36NJo0afJUdVTkl19+wahRo9C3b1/MmzcPOTk5WLZsGTp37oxz586VeY08aPr06cjLy3viddna2mqC5K1bt7Bw4UIMGDAA8fHx5b5vPInS53bq1Klo3LgxZs2ahby8PCxZsqRMX126dAldunSBUqnEJ598AkNDQ6xYsQLdu3dHSEgIfH19tZY9YcIEWFpaYubMmYiKisKyZctw8+ZNTeAqz4wZM7Bq1Sps3rxZK1g+Sz/XKbretURPpvQw1r59+0RKSoqIj48XmzZtEjY2NsLY2FjcunVLCCFEXl6eUKvVWvPGxsYKhUIhZs+erRm3evVqAUB8//33ZdZVXFysmQ+A+Oabb8q0ad68uejWrZvm8cGDBwUAUa9ePZGRkaEZ/9tvvwkAYuHChZplN27cWPTt21ezHiGEyMnJEW5ubqJ3795l1tWxY0fRokULzeOUlBQBQMyYMUMz7saNG0Imk4mvvvpKa96IiAhhYGBQZnx0dLQAINatW6cZN2PGDPHgn8SRI0cEALF+/XqteXfv3l1mvKurqwgICChTe1BQkHj4z+zh2j/55BNhb28vfHx8tPr0l19+EVKpVBw5ckRr/uXLlwsA4tixY2XWV2r79u0CgPjhhx8qbCOEEEqlUrRp00ZrO0aNGlWmXbdu3bRqq+gwVlFRkXBzcxOurq7i/v37WtMefL5btWol7O3txd27dzXjzp8/L6RSqRg5cqRmXOlzEhgYqLWO+vXrC4lEIubOnasZf//+fWFsbKxV/7P0YXnbLYQQn376qQAgkpOTK5yvoKBA2NvbixYtWojc3FzN+J07dwoAYvr06UKI//2NyeVycfXqVU27lJQUYWNjI3x8fDTjVqxYIQCIyMhIrfXY2tpqbXOPHj1E165dteopXc+Dz1dOTk6Zujdu3CgAiMOHD2vGlb73xMbGCiGE8PHxERYWFiIxMVHT5urVq8LQ0FAMHjxYM670uUtJSdGMq+h18/D7SWZmprC0tBTjxo3TapeYmCgsLCy0xo8aNUq4urpqHl+8eFFIpVLRv39/rbor8vD8QgixcuVKAUCcOnVKM66iv/FSpe+BBw8e1Hpsa2srUlNTNe3K66u
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABa90lEQVR4nO3dd1zV1f8H8Ncd3MveGwFRQRBX4gi3hrnTtNSyMjOtxG+usswcmWVq5S5tqA3LUlPLvTUVF4oTERUEUbbsfe/5/YHcn1fQFJB7+fh6Ph6fB97zGfd9D/deXp7PkgkhBIiIiIgkSm7oAoiIiIgeJ4YdIiIikjSGHSIiIpI0hh0iIiKSNIYdIiIikjSGHSIiIpI0hh0iIiKSNIYdIiIikjSGHSIiIgMrKSlBcnIy4uLiDF2KJDHsEBGRZG3evBkRERG6xxs3bsSFCxcMV9BdoqOjMXLkSLi5uUGlUsHFxQXBwcHgjQ2qH8OOkVm1ahVkMpluMjU1hZ+fH8aMGYOkpCRDl0dGbPPmzejRowccHBx075v33nsPaWlpld7mzZs3MWPGDL0/FkS1yblz5zB27FhER0fj6NGjePvtt5GdnW3osnD06FG0bt0ae/fuxYcffogdO3Zg165d2LhxI2QymaHLkxwZ741lXFatWoXhw4dj5syZ8PHxQUFBAQ4dOoRffvkF3t7eOH/+PMzNzQ1dJhmZ9957D1999RWaNWuGl19+Gfb29jh16hRWrFgBR0dH7NmzBw0bNnzk7Z48eRKtWrXCypUr8frrr1d/4USPWUpKCtq2bYsrV64AAAYMGID169cbtKaioiI0a9YM1tbW2LlzJ2xsbAxaz5NAaegCqGI9e/ZEy5YtAQBvvvkmHBwc8PXXX2PTpk146aWXDFwdGZPff/8dX331FQYPHozVq1dDoVDo5r3++uvo0qULXnzxRZw6dQpKJT/y9GRxcnLC+fPndf9RDAgIMHRJ+OeffxAVFYVLly4x6NQQ7saqJbp27QoAiImJAQCkp6fjvffeQ5MmTWBpaQlra2v07NkTZ86cKbduQUEBZsyYAT8/P5iamsLNzQ0DBgzA1atXAQCxsbF6u87unTp37qzb1v79+yGTyfDHH3/go48+gqurKywsLPDcc88hPj6+3HMfO3YMPXr0gI2NDczNzdGpUyccPny4wtfYuXPnCp9/xowZ5Zb99ddfERQUBDMzM9jb22PIkCEVPv+DXtvdtFotFixYgMDAQJiamsLFxQVvvfUWbt++rbdc3bp10adPn3LPM2bMmHLbrKj2efPmletTACgsLMT06dPRoEEDqNVqeHp6YtKkSSgsLKywr+72ySefwM7ODt99951e0AGA1q1b44MPPsC5c+ewbt06vddR0UhN586ddbXt378frVq1AgAMHz5c12+rVq3SLX/s2DH06tULdnZ2sLCwQNOmTbFw4UK9be7duxcdOnSAhYUFbG1t0a9fP0RGRuotM2PGDMhkMly+fBmvvPIKbGxs4OTkhKlTp0IIgfj4ePTr1w/W1tZwdXXFV199Va72qvThve89R0dH9O7dG+fPn3+ode/9fX722WeQy+X47bff9NrXrl2re986OjrilVdeQUJCgt4yr7/+OiwtLcs9z7p16yCTybB///4Ka37Qe1wmk2HMmDFYvXo1GjZsCFNTUwQFBeHgwYPlnuf06dPo2bMnrK2tYWlpiWeeeQZHjx59qH6r6D3SuXNnNG7c+EFdqFfjvfr06YO6devqteXm5mLixInw9PSEWq1Gw4YN8eWXX5Y71qXsM6hWqxEUFISAgID7fgbvV1PZpFAo4OHhgVGjRiEjI0O3TNl34t2fr3u9/vrreq/h6NGj8PHxwfr161G/fn2oVCp4eXlh0qRJyM/PL7f+N998g8DAQKjVari7uyM0NFSvBuD/+zk8PBxt27aFmZkZfHx8sGzZMr3lyuotex8Bpbur69ati5YtWyInJ0fXXpXPlLHhf/NqibJg4uDgAAC4du0aNm7ciBdffBE+Pj5ISkrC8uXL0alTJ1y8eBHu7u4AAI1Ggz59+mDPnj0YMmQIxo4di+zsbOzatQvnz59H/fr1dc/x0ksvoVevXnrPO3ny5Arr+eyzzyCTyfDBBx8gOTkZCxYsQEhICCIiImBmZgag9I9cz549ERQUhOnTp0Mul2PlypXo2rUr/v33X7Ru3brcduvUqYPZs2cDAHJycvDOO+9U+NxTp07FoEGD8OabbyIlJQWLFy9Gx44dcfr0adja2pZbZ9SoUejQoQMA4K+//sKGDRv05r/11lu6XYjvvvsuYmJisGTJEpw+fRqHDx+GiYlJhf3wKDIyMnSv7W5arRbPPfccDh06hFGjRiEgIADnzp3D/PnzcfnyZWzcuPG+24yOjkZUVBRef/11WFtbV7jMa6+9hunTp2Pz5s0YMmTIQ9cbEBCAmTNnYtq0aXr917ZtWwDArl270KdPH7i5uWHs2LFwdXVFZGQkNm/ejLFjxwIAdu/ejZ49e6JevXqYMWMG8vPzsXjxYrRr1w6nTp0q90ds8ODBCAgIwBdffIEtW7Zg1qxZsLe3x/Lly9G1a1fMmTMHq1evxnvvvYdWrVqhY8eOVe7DMv7+/pgyZQqEELh69Sq+/vpr9OrV65HPjlm5ciU+/vhjfPXVV3j55Zd17WXvr1atWmH27NlISkrCwoULcfjw4fu+bx9kypQpePPNNwEAqampGD9+vN7v6V4HDhzAH3/8gXfffRdqtRrffPMNevTogePHj+vCyIULF9ChQwdYW1tj0qRJMDExwfLly9G5c2ccOHAAbdq0Kbfdsn67u47HSQiB5557Dvv27cOIESPQvHlz7NixA++//z4SEhIwf/78+657v8/ggzz//PMYMGAASkpKEBYWhu+++w75+fn45ZdfKv0a0tLScO3aNXz00UcYMGAAJk6ciJMnT2LevHk4f/48tmzZogurM2bMwCeffIKQkBC88847iIqKwrfffosTJ06U+266ffs2evXqhUGDBuGll17Cn3/+iXfeeQcqlQpvvPFGhbVkZmaiZ8+eMDExwdatW3VBuzo+U0ZFkFFZuXKlACB2794tUlJSRHx8vFizZo1wcHAQZmZm4saNG0IIIQoKCoRGo9FbNyYmRqjVajFz5kxd24oVKwQA8fXXX5d7Lq1Wq1sPgJg3b165ZQIDA0WnTp10j/ft2ycACA8PD5GVlaVr//PPPwUAsXDhQt22fX19Rffu3XXPI4QQeXl5wsfHR3Tr1q3cc7Vt21Y0btxY9zglJUUAENOnT9e1xcbGCoVCIT777DO9dc+dOyeUSmW59ujoaAFA/PTTT7q26dOni7vf+v/++68AIFavXq237vbt28u1e3t7i969e5erPTQ0VNz7cbq39kmTJglnZ2cRFBSk16e//PKLkMvl4t9//9Vbf9myZQKAOHz4cLnnK7Nx40YBQMyfP/++ywghhLW1tWjRooXe6xg2bFi55Tp16qRX24kTJwQAsXLlSr3lSkpKhI+Pj/D29ha3b9/Wm3f377t58+bC2dlZpKWl6drOnDkj5HK5eO2113RtZb+TUaNG6T1HnTp1hEwmE1988YWu/fbt28LMzEyv/qr0YUWvWwghPvroIwFAJCcnP/S6W7ZsEUqlUkycOFFvmaKiIuHs7CwaN24s8vPzde2bN28WAMS0adN0bcOGDRMWFhblnmft2rUCgNi3b1+5eWWf4Xt/T2UACADi5MmTurbr168LU1NT8fzzz+va+vfvL1Qqlbh69aqu7ebNm8LKykp07Nix3HbbtWsnunTp8sA6OnXqJAIDAyus694aQ0NDy7X37t1beHt76x6XvednzZqlt9wLL7wgZDKZuHLlit42H+Yz+KCa7l5fiNLvqUaNGukel30nrl279r7bGTZsmN5rGDZsmAAgXn/9db3lyj4H//zzjxBCiOTkZKFSqcSzzz6r932/ZMkSAUCsWLFC19apUycBQHz11Ve6tsLCQt1nsKioSK/effv2iYKCAtG5c2fh7Oys129CVP0zZWy4G8tIhYSEwMnJCZ6enhgyZAgsLS2xYcMGeHh
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABX/klEQVR4nO3dd1xV9f8H8Ne9F+5l73EBAVFREFfiCPfA3GlZalmZ5Si1r6ssM0dmmVaGK205KtPSHGWKA1NTcYsTEQUEQbbszf38/kDuzytogMiFw+v5eJwH3M8Z930+93LvizNlQggBIiIiIomS67sAIiIioieJYYeIiIgkjWGHiIiIJI1hh4iIiCSNYYeIiIgkjWGHiIiIJI1hh4iIiCSNYYeIiIgkjWGHiIjoMWg0GiQnJyMiIkLfpdBDMOwQEVGtdPToURw6dEj7+NChQzh27Jj+CrpPfHw8pk6dCnd3dyiVStjb26N58+bIyMjQd2lUDoYdCVq/fj1kMpl2MDIyQtOmTTF58mQkJCTouzyqxXbt2oV+/frB1tZW+7559913kZKSUuVlxsXFYf78+QgJCam+QqleiImJwcSJE3Hp0iVcunQJEydORExMjL7Lwo0bN9C+fXts3rwZEyZMwK5du7B//34EBQXB1NRU3+VROWS8N5b0rF+/HmPGjMGCBQvg4eGBvLw8HD16FD///DPc3d1x+fJlmJiY6LtMqmXeffddfPXVV2jdujVefvll2NjY4Ny5c1i7di3s7OwQFBSEZs2aVXq5Z86cQfv27bFu3Tq8/vrr1V84SVZ+fj66deuGU6dOAQD8/Pxw6NAhKJVKvdbVu3dvREVF4ciRI3BxcdFrLVQxBvougJ6c/v37o127dgCAsWPHwtbWFkuXLsXOnTvx0ksv6bk6qk02bdqEr776CiNGjMDGjRuhUCi0415//XX07NkTL774Is6dOwcDA35sUM1QqVQ4fvw4Ll++DABo0aKFzntTH86ePYuDBw9i3759DDp1CHdj1SO9evUCAERGRgIAUlNT8e6776Jly5YwMzODhYUF+vfvjwsXLpSZNy8vD/Pnz0fTpk1hZGQEJycnPP/887h58yYAICoqSmfX2YNDjx49tMs6dOgQZDIZfvvtN3z44YdQq9UwNTXFs88+W+4m6pMnT6Jfv36wtLSEiYkJunfv/tD99j169Cj3+efPn19m2l9++QW+vr4wNjaGjY0NRo4cWe7zP2rd7qfRaBAQEAAfHx8YGRnB0dEREyZMwN27d3Wma9iwIQYNGlTmeSZPnlxmmeXV/sUXX5TpU6Dkv+B58+ahSZMmUKlUcHV1xcyZM5Gfn19uX93v448/hrW1Nb777rsyXyYdOnTA+++/j0uXLmHr1q0661HelpoePXpoazt06BDat28PABgzZoy239avX6+d/uTJkxgwYACsra1hamqKVq1aYdmyZTrLPHjwILp27QpTU1NYWVlhyJAhCA0N1Zlm/vz5kMlkuH79Ol555RVYWlrC3t4ec+bMgRACMTExGDJkCCwsLKBWq/HVV1+Vqf1x+vDB956dnR0GDhyo/aKu6Hz/9T6r6Pv2Uf36+uuv/+dzRkVFaZf1zTffwMfHByqVCs7Ozpg0aRLS0tKqtP5FRUX45JNP0LhxY6hUKjRs2BAffvhhmT4ufX8pFAq0bt0arVu3xrZt2yCTydCwYcP/eDVK5i+tRS6XQ61WY8SIEYiOjtZOU/q3/eWXXz50OaXvq1InTpyAkZERbt68qe0TtVqNCRMmIDU1tcz8W7Zs0b5ednZ2eOWVVxAbG6szzeuvvw4zMzNERESgb9++MDU1hbOzMxYsWID7d76U1nv/309mZiZ8fX3h4eGBO3fuaNsr+nlUX/BftHqkNJjY2toCACIiIrBjxw68+OKL8PDwQEJCAr799lt0794dV69ehbOzMwCguLgYgwYNQlBQEEaOHIkpU6YgMzMT+/fvx+XLl9G4cWPtc7z00ksYMGCAzvPOmjWr3Ho+/fRTyGQyvP/++0hMTERAQAD8/f0REhICY2NjACVfcv3794evry/mzZsHuVyOdevWoVevXvj333/RoUOHMstt0KABFi1aBADIysrC22+/Xe5zz5kzB8OHD8fYsWORlJSEFStWoFu3bjh//jysrKzKzDN+/Hh07doVALBt2zZs375dZ/yECRO0uxD/97//ITIyEitXrsT58+dx7NgxGBoaltsPlZGWlqZdt/tpNBo8++yzOHr0KMaPHw9vb29cunQJX3/9Na5fv44dO3Y8dJnh4eEICwvD66+/DgsLi3Knee211zBv3jzs2rULI0eOrHC93t7eWLBgAebOnavTf506dQIA7N+/H4MGDYKTkxOmTJkCtVqN0NBQ7Nq1C1OmTAEAHDhwAP3790ejRo0wf/585ObmYsWKFejcuTPOnTtX5otvxIgR8Pb2xueff46///4bCxcuhI2NDb799lv06tULixcvxsaNG/Huu++iffv26Nat22P3YSkvLy/Mnj0bQgjcvHkTS5cuxYABA3S+YB80e/ZsjB07FgCQnJyMadOm6fTV/Sr6vv2vfp0wYQL8/f21y3311Vfx3HPP4fnnn9e22dvbAyj5sv/444/h7++Pt99+G2FhYVi9ejVOnz5d5n1dkfUfO3YsNmzYgBdeeAEzZszAyZMnsWjRIoSGhpb5m7pfUVERZs+e/R+vgK6uXbti/Pjx0Gg0uHz5MgICAhAXF4d///23Usu5X0pKCvLy8vD222+jV69eeOutt3Dz5k2sWrUKJ0+exMmTJ6FSqQD8/yEF7du3x6JFi5CQkIBly5bh2LFjZT5niouL0a9fPzz99NNYsmQJAgMDMW/ePBQVFWHBggXl1lJYWIhhw4YhOjoax44dg5OTk3ZcTXwe1SmCJGfdunUCgDhw4IBISkoSMTExYvPmzcLW1lYYGxuL27dvCyGEyMvLE8XFxTrzRkZGCpVKJRYsWKBtW7t2rQAgli5dWua5NBqNdj4A4osvvigzjY+Pj+jevbv28T///CMACBcXF5GRkaFt//333wUAsWzZMu2yPT09Rd++fbXPI4QQOTk5wsPDQ/Tp06fMc3Xq1Em0aNFC+zgpKUkAEPPmzdO2RUVFCYVCIT799FOdeS9duiQMDAzKtIeHhwsAYsOGDdq2efPmifv/fP79918BQGzcuFFn3sDAwDLt7u7uYuDAgWVqnzRpknjwT/LB2mfOnCkcHByEr6+vTp/+/PPPQi6Xi3///Vdn/jVr1ggA4tixY2Wer9SOHTsEAPH1118/dBohhLCwsBBt27bVWY/Ro0eXma579+46tZ0+fVoAEOvWrdOZrqioSHh4eAh3d3dx9+5dnXH3v95t2rQRDg4OIiUlRdt24cIFIZfLxWuvvaZtK31Nxo8fr/McDRo0EDKZTHz++efa9rt37wpjY2Od+h+nD8tbbyGE+PDDDwUAkZiY+Mh5S5X+HT3YV0JU/H1b0X6934Pvs1KJiYlCqVSKZ555RuezYuXKlQKAWLt2rbatIusfEhIiAIixY8fqTPfuu+8KAOLgwYPatgffX998841QqVSiZ8+ewt3dvdz1uF9578+XX35ZmJiYaB8/6nOr1IN/66WPe/fuLYqKirTtpZ+7K1asEEIIUVBQIBwcHESLFi1Ebm6udrpdu3YJAGLu3LnattGjRwsA4p133tG2aTQaMXDgQKFUKkVSUpJOvevWrRMajUaMGjVKmJiYiJMnT+rUXJnPo/qCu7EkzN/fH/b29nB1dcXIkSNhZmaG7du3a/czq1QqyOUlb4Hi4mKkpKTAzMwMzZo1w7lz57TL+eOPP2BnZ4d33nmnzHM8uIm9Ml577TWYm5trH7/wwgtwcnLC7t27AQAhISEIDw/Hyy+/jJSUFCQnJyM5ORnZ2dno3bs3jhw5Ao1Go7PMvLw8GBkZPfJ5t23bBo1Gg+HDh2uXmZycDLVaDU9PT/zzzz860xcUFACA9r+18mzZsgWWlpbo06ePzjJ9fX1hZmZWZpmFhYU60yUnJyMvL++RdcfGxmLFihWYM2cOzMzMyjy/t7c3vLy8dJZZuuvywee
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Среднее значение Outcome в обучающей выборке: 0.3469055374592834\n",
|
|||
|
"Среднее значение Outcome в контрольной выборке: 0.35714285714285715\n",
|
|||
|
"Среднее значение Outcome в тестовой выборке: 0.35714285714285715\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Оценка сбалансированности целевой переменной (Outcome)\n",
|
|||
|
"# Визуализация распределения целевой переменной в выборках (гистограмма)\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"def plot_outcome_distribution(data, title):\n",
|
|||
|
" sns.histplot(data['Outcome'], kde=True)\n",
|
|||
|
" plt.title(title)\n",
|
|||
|
" plt.xlabel('Outcome')\n",
|
|||
|
" plt.ylabel('Частота')\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"plot_outcome_distribution(train_data, 'Распределение Outcome в обучающей выборке')\n",
|
|||
|
"plot_outcome_distribution(val_data, 'Распределение Outcome в контрольной выборке')\n",
|
|||
|
"plot_outcome_distribution(test_data, 'Распределение Outcome в тестовой выборке')\n",
|
|||
|
"\n",
|
|||
|
"# Оценка сбалансированности данных по целевой переменной (Outcome)\n",
|
|||
|
"print(\"Среднее значение Outcome в обучающей выборке: \", train_data['Outcome'].mean())\n",
|
|||
|
"print(\"Среднее значение Outcome в контрольной выборке: \", val_data['Outcome'].mean())\n",
|
|||
|
"print(\"Среднее значение Outcome в тестовой выборке: \", test_data['Outcome'].mean())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABI1UlEQVR4nO3deVwVdf///+cBZOeAgGyGZu4LZKIpWa4oIpmVZZblcpmWYp/SMr+0uLWYtqlpatcnM0vLrNQrr3LDLRNNMXPN1DQpBVwSFBMU5veHP+bjEbBE9OD0uN9uc7sx73nPzGvmLDzPLOfYDMMwBAAAYFEuzi4AAADgaiLsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAEAFc/LkSR04cEC5ubnOLgXl7MSJE9q7d6/OnTvn7FL+UQg7AOBkhmHovffeU4sWLeTt7S273a4aNWro448/dnZp14XffvtNM2fONMcPHDig2bNnO6+gC5w9e1bjx4/XzTffLA8PD1WuXFm1a9dWSkqKs0v7RyHsXCdmzpwpm81mDp6enqpTp44GDx6szMxMZ5eHCmzRokXq1KmTgoKCzOfNM888o2PHjpV5mYcOHdKoUaO0ZcuW8iv0H+yhhx7S448/rvr16+ujjz7SsmXLtHz5ct17773OLu26YLPZlJSUpCVLlujAgQN69tln9e233zq7LOXl5SkuLk4vvvii2rRpo3nz5mnZsmVasWKFYmNjnV3eP4qbswvA5RkzZoxq1KihM2fOaO3atZo6daq+/vprbd++Xd7e3s4uDxXMM888ozfffFM333yzhg8frsDAQG3evFmTJ0/Wp59+qpSUFNWtW/eyl3vo0CGNHj1aN954oxo3blz+hf+DzJo1S3PnztXHH3+shx56yNnlXJeqVq2q/v37q1OnTpKk8PBwrVq1yrlFSRo3bpw2bNigJUuWqE2bNs4u55/NwHXhgw8+MCQZGzdudGgfOnSoIcmYM2eOkypDRTVnzhxDkvHAAw8Y586dc5i2YcMGw9vb24iKijLOnj172cveuHGjIcn44IMPyqnaf65GjRoZDz30kLPLsIS9e/ca69evN06dOuXsUoyzZ88alStXNp577jlnlwLDMDiNdZ1r166dJGn//v2SpOPHj+uZZ55RVFSUfH19ZbfblZCQoB9//LHYvGfOnNGoUaNUp04deXp6Kjw8XPfee6/27dsn6fx57wtPnV08XPhJZdWqVbLZbJo7d66ee+45hYWFycfHR3fddZfS09OLrXvDhg3q1KmT/P395e3trdatW+u7774rcRvbtGlT4vpHjRpVrO/HH3+smJgYeXl5KTAwUD169Chx/ZfatgsVFhZqwoQJatiwoTw9PRUaGqrHHntMf/zxh0O/G2+8UXfeeWex9QwePLjYMkuq/fXXXy+2T6Xzh8FHjhypWrVqycPDQ5GRkXr22WeVl5dX4r660OjRo1W5cmW99957cnV1dZh26623avjw4dq2bZs+//xzh+3o06dPsWW1adPGrG3VqlVq1qyZJKlv377mfrvwmokNGzaoc+fOqly5snx8fBQdHa2JEyc6LHPFihW644475OPjo4CAAHXt2lW7du1y6DNq1CjZbDb9/PPPevjhh+Xv768qVaroxRdflGEYSk9PV9euXWW32xUWFqY333yzWO1Xsg8vfu4FBwcrMTFR27dv/8t5JWnevHnm8zE4OFgPP/ywfv/9d3N6bm6utm/frsjISCUmJsput8vHx0dt2rRxOA3zyy+/yGaz6e233y62jnXr1slms+mTTz4xa774eVT0fL/wMdq6dav69Omjm266SZ6engoLC9O//vWvYqc3i06hHzhwwGxbsmSJbrvtNnl7e8vf31933nlnsX1S9NgdPXrUbNu0aVOxOiSpUaNGJR75+Oabb8zniJ+fnxITE7Vjxw6HPn369NGNN94oSapZs6aaN2+u48ePy8vLq1jdJenTp4/DY1y5cuVi+18q/TVepOg9sOiI0u7du/XHH3/Iz89PrVu3vuS+kqQffvhBCQkJstvt8vX1Vfv27bV+/XqHPkWPxZo1a/TYY48pKChIdrtdvXr1KvE96eLX8oABA+Tp6VnsqNff2c/XO05jXeeKgklQUJCk82+KCxYs0P33368aNWooMzNT06dPV+vWrbVz505FRERIkgoKCnTnnXcqJSVFPXr00JNPPqmTJ09q2bJl2r59u2rWrGmu48EHH1Tnzp0d1pucnFxiPa+88opsNpuGDx+urKwsTZgwQXFxcdqyZYu8vLwknf8nl5CQoJiYGI0cOVIuLi764IMP1K5dO3377be69dZbiy33hhtu0NixYyVJp06d0sCBA0tc94svvqju3bvr0Ucf1ZEjR/TOO++oVatW+uGHHxQQEFBsngEDBuiOO+6QJH355ZeaP3++w/THHntMM2fOVN++ffU///M/2r9/vyZPnqwffvhB3333nSpVqlTifrgcJ06cMLftQoWFhbrrrru0du1aDRgwQPXr19e2bdv09ttv6+eff9aCBQtKXeaePXu0e/du9enTR3a7vcQ+vXr10siRI7Vo0SL16NHjb9dbv359jRkzRiNGjHDYf7fddpskadmyZbrzzjsVHh6uJ598UmFhYdq1a5cWLVqkJ598UpK0fPlyJSQk6KabbtKoUaP0559/6p133lHLli21efNm859XkQceeED169fXa6+9pv/+9796+eWXFRgYqOnTp6tdu3YaN26cZs+erWeeeUbNmjVTq1atrngfFqlXr56ef/55GYahffv26a233lLnzp118ODBS85X9Lxp1qyZxo4dq8zMTE2cOFHfffed+XwsChbjxo1TWFiYhg0bJk9PT/373/9WXFycli1bplatWummm25Sy5YtNXv2bA0ZMsRhPbNnz5afn5+6du36l9tyoWXLlumXX35R3759FRYWph07dui9997Tjh07tH79+mIhvci3336rzp07q3r16ho5cqTOnj2rd999Vy1bttTGjRtVp06dy6qjNB999JF69+6t+Ph4jRs3TqdPn9bUqVN1++2364cffij2HLnQiBEjdObMmb+9ruDgYDNI/vbbb5o4caI6d+6s9PT0Et83/o6ixzY5OVm1a9fW6NGjdebMGU2ZMqXYvtqxY4fuuOMO2e12Pfvss6pUqZKmT5+uNm3aaPXq1WrevLnDsgcPHqyAgACNGjVKu3fv1tSpU/Xrr7+agaskI0eO1Pvvv6+5c+c6BMsr2c/XFWcfWsLfU3Qaa/ny5caRI0eM9PR049NPPzWCgoIMLy8v47fffjMMwzDOnDljFBQUOMy7f/9+w8PDwxgzZozZNmPGDEOS8dZbbxVbV2FhoTmfJOP1118v1qdhw4ZG69atzfGVK1cakoyqVasaOTk5Zvtnn31mSDImTpxoLrt27dpGfHy8uR7DMIzTp08bNWrUMDp06FBsXbfddpvRqFEjc/zIkSOGJGPkyJFm24EDBwxXV1fjlVdecZh327ZthpubW7H2PXv2GJKMDz/80GwbOXKkceFL4ttvvzUkGbNnz3aYd/HixcXaq1evbiQmJharPSkpybj4ZXZx7c8++6wREhJixMTEOOzTjz76yHBxcTG+/fZbh/mnTZtmSDK+++67YusrsmDBAkOS8fbbb5faxzAMw263G02aNHHYjt69exfr17p1a4faSjuNde7cOaNGjRpG9erVjT/++MNh2oWPd+PGjY2QkBDj2LFjZtuPP/5ouLi4GL169TLbih6TAQMGOKzjhhtuMGw2m/Haa6+Z7X/88Yfh5eXlUP+V7MOSttswDOO5554zJBlZWVmlzpefn2+EhIQYjRo1Mv7880+zfdGiRYYkY8SIEYZh/N9rzN3d3fj555/NfkeOHDGCgoKMmJgYs2369OmGJGPXrl0O6wkODnbY5rZt2xqtWrVyqKdoPRc+XqdPny5W9yeffGJIMtasWWO2Fb337N+/3zAMw4iJiTH8/f2NjIwMs8/PP/9sVKpUyejWrZvZVvTYHTlyxGwr7Xlz8fvJyZMnjYCAAKN///4O/TI
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABEJElEQVR4nO3deVgVdf//8dcBZBMOCMpmaO6Ka6EpLbihiGZallreuWR6Z9idWuaPO3NrIW1xyS3vO7O+aZqWeufX3RQr0RSzTM1bzZRSwCVAMUFhfn90MV+P4IbowfH5uK65LuYzn5l5zzDn8GKWc2yGYRgCAACwKBdnFwAAAHAjEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAHCCvn376s4773Ros9lsGjNmjFPqsTLCjoXNmTNHNpvNHDw9PVW7dm0NHjxY6enpzi4PZdiyZcvUoUMHBQYGmsfNiy++qBMnTpR4mUeOHNGYMWO0Y8eO0isUAK6Cm7MLwI03btw4VatWTWfPntU333yjGTNmaPny5frpp5/k7e3t7PJQxrz44ot655131LhxY40YMUIBAQHavn27pk6dqvnz52vdunWqU6fONS/3yJEjGjt2rO688041adKk9AsHLODPP/+Umxt/mksbe/Q2EBcXp6ZNm0qSnn76aQUGBurdd9/V0qVL9fjjjzu5OpQln376qd555x316NFDc+fOlaurqzmtb9++at26tR577DFt376dN2Q43dmzZ+Xu7i4XF+tcpPD09HR2CZZknSMEV61NmzaSpIMHD0qSTp48qRdffFENGzaUj4+P7Ha74uLi9MMPPxSZ9+zZsxozZoxq164tT09PhYaG6pFHHtGBAwckSb/++qvDpbOLh1atWpnL2rBhg2w2mxYsWKB//vOfCgkJUfny5fXQQw8pNTW1yLq3bNmiDh06yM/PT97e3mrZsqW+/fbbYrexVatWxa6/uGvhn3zyiSIjI+Xl5aWAgAD17Nmz2PVfbtsuVFBQoEmTJql+/fry9PRUcHCw/v73v+uPP/5w6HfnnXfqwQcfLLKewYMHF1lmcbW/9dZbRfapJOXm5mr06NGqWbOmPDw8FB4erpdeekm5ubnF7qsLjR07VhUqVNCsWbMcgo4k3XPPPRoxYoR27typRYsWOWxH3759iyyrVatWZm0bNmxQs2bNJEn9+vUz99ucOXPM/lu2bFHHjh1VoUIFlS9fXo0aNdLkyZMdlvnVV1/pgQceUPny5eXv768uXbpoz549Dn3GjBkjm82m//73v/rb3/4mPz8/VapUSa+88ooMw1Bqaqq6dOkiu92ukJAQvfPOO0Vqv559eKljr3D49ddfHfpPnz5d9evXl4eHh8LCwhQfH6/MzMwiy72a/SPpqtd7tcd9cb7//nvFxcXJbrfLx8dHbdu21ebNm83p27Ztk81m00cffVRk3lWrVslms2nZsmVm2++//66nnnpKwcHB8vDwUP369TV79myH+QrfL+bPn6+RI0eqcuXK8vb2VnZ2ts6dO6exY8eqVq1a8vT0VGBgoO6//36tWbPGnP/HH39U3759Vb16dXl6eiokJERPPfVUkUuz13v8XOv72sUufq0X1rN//3717dtX/v7+8vPzU79+/XTmzBmHef/880/94x//UMWKFeXr66uHHnpIv//+O/cBiTM7t6XCYBIYGChJ+uWXX7RkyRI99thjqlatmtLT0/X++++rZcuW2r17t8LCwiRJ+fn5evDBB7Vu3Tr17NlTzz//vE6dOqU1a9bop59+Uo0aNcx1PP744+rYsaPDehMSEoqt5/XXX5fNZtOIESOUkZGhSZMmKSYmRjt27JCXl5ekv/7IxcXFKTIyUqNHj5aLi4s+/PBDtWnTRl9//bXuueeeIsu94447lJiYKEk6ffq0Bg0aVOy6X3nlFXXv3l1PP/20jh07pvfee0/R0dH6/vvv5e/vX2SegQMH6oEHHpAkffHFF1q8eLHD9L///e+aM2eO+vXrp3/84x86ePCgpk6dqu+//17ffvutypUrV+x+uBaZmZnmtl2ooKBADz30kL755hsNHDhQ9erV086dOzVx4kT997//1ZIlSy65zH379mnv3r3q27ev7HZ7sX169+6t0aNHa9myZerZs+dV11uvXj2NGzdOo0aNcth/9957ryRpzZo1evDBBxUaGqrnn39eISEh2rNnj5YtW6bnn39ekrR27VrFxcWpevXqGjNmjP7880+99957uu+++7R9+/YiN3r26NFD9erV05tvvqn//d//1WuvvaaAgAC9//77atOmjcaPH6+5c+fqxRdfVLNmzRQdHX3d+7DQhcdeoeXLl+vTTz91aBszZozGjh2rmJgYDRo0SHv37tWMGTO0detWh2PlavbPhR5++GE98sgjkqSvv/5as2bNcphekuO+0K5du/TAAw/IbrfrpZdeUrly5fT++++rVatWSkpKUvPmzdW0aVNVr15dn332mfr06eMw/4IFC1ShQgXFxsZKktLT09WiRQvZbDYNHjxYlSpV0ooVK9S/f39lZ2dryJAhDvO/+uqrcnd314svvqjc3Fy5u7trzJgxSkxM1NNPP6177rlH2dnZ2rZtm7Zv36527dqZ+/CXX35Rv379FBISol27dmnWrFnatWuXNm/eXOQfjJIePxfu4yu9r12L7t27q1q1akpMTNT27dv173//W0FBQRo/frzZp2/fvvrss8/05JNPqkWLFkpKSlKnTp2ueV2WZMCyPvzwQ0OSsXbtWuPYsWNGamqqMX/+fCMwMNDw8vIyfvvtN8MwDOPs2bNGfn6+w7wHDx40PDw8jHHjxplts2fPNiQZ7777bpF1FRQUmPNJMt56660iferXr2+0bNnSHF+/fr0hyahcubKRnZ1ttn/22WeGJGPy5MnmsmvVqmXExsaa6zEMwzhz5oxRrVo1o127dkXWde+99xoNGjQwx48dO2ZIMkaPHm22/frrr4arq6vx+uuvO8y7c+dOw83NrUj7vn37DEnGRx99ZLaNHj3auPBl9PXXXxuSjLlz5zrMu3LlyiLtVatWNTp16lSk9vj4eOPil+bFtb/00ktGUFCQERkZ6bBP/+d//sdwcXExvv76a4f5Z86caUgyvv322yLrK7RkyRJDkjFx4sRL9jEMw7Db7cbdd9/tsB19+vQp0q9ly5YOtW3dutWQZHz44YcO/c6fP29Uq1bNqFq1qvHHH384TLvw992kSRMjKCjIOHHihNn2ww8/GC4uLkbv3r3NtsLfycCBAx3Wcccddxg2m8148803zfY//vjD8PLycqj/evZh4XbXr1+/SPtbb71lSDIOHjxoGIZhZGRkGO7u7kb79u0dXn9Tp041JBmzZ8++pv1jGIZx7tw5Q5IxduxYs63wfaBwvdd63F+sa9euhru7u3HgwAGz7ciRI4avr68RHR1ttiUkJBjlypUzTp48abbl5uYa/v7+xlNPPWW29e/f3wgNDTWOHz/usJ6ePXsafn5+xpkzZwzD+L/3i+rVq5tthRo3blzsa+lCF89jGIbx6aefGpKMjRs3mm3Xe/xc7fuaYRhGnz59jKpVqzrUdPFrvbCeC/eZYRjGww8/bAQGBprjKSkphiRjyJAhDv369u1bZJm3Iy5j3QZiYmJUqVIlhYeHq2fPnvLx8dHixYtVuXJlSZKHh4d5zTs/P18nTpyQj4+P6tSpo+3bt5vL+fzzz1WxYkU999xzRdZx8X9F16J3797y9fU1xx999FGFhoZq+fLlkqQdO3Zo3759euKJJ3TixAkdP35cx48fV05Ojtq2bauNGzeqoKDAYZlnz5694rXvL774QgUFBerevbu5zOPHjyskJES1atXS+vXrHfrn5eVJ+mt/XcrChQvl5+endu3aOSwzMjJSPj4+RZZ57tw5h37Hjx/X2bNnL1v377//rvfee0+vvPKKfHx8iqy/Xr16qlu3rsMyCy9dXrz+C506dUqSHH4XxfH19VV2dvZl+1yL77//XgcPHtSQIUOKnFEoPK6OHj2qHTt2qG/fvgo
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABEEUlEQVR4nO3deVgW9f7/8dcNyg2KNwjKVmjuiksWmt5ZaooikmlRZlmamZZip7TFL+eYW4tpi0vi0jkuddIsLe3oMfekDU0xytQ86NHkpIBLgmKCwvz+8GJ+3gKmiN44PR/XNdfFfOYzM+8Z7htezHyG22YYhiEAAACL8nB3AQAAAFcTYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcA4FZjx46VzWZzdxkVwv79+2Wz2TR//nyzjfNz5Qg7FjB//nzZbDZz8vb2VsOGDTVs2DBlZma6uzxUYCtWrFC3bt0UGBhovm6ef/55HT16tMzbPHjwoMaOHavU1NTyKxQArgBhx0LGjx+vf/7zn5o+fbpuv/12zZw5U06nU6dOnXJ3aaiAnn/+efXo0UMZGRkaOXKkpk+frqioKE2fPl0333yzdu/eXabtHjx4UOPGjSPsAOVk1KhR+v33391dxnWtkrsLQPmJiYlRq1atJElPPPGEAgMD9fbbb+uzzz7TQw895ObqUJF8+OGHeuutt/Tggw9qwYIF8vT0NJc99thjuuuuu/TAAw9o27ZtqlSJHxO4vpw6dUpVqlRxdxnlplKlSrwPrxBXdiysU6dOkqR9+/ZJko4dO6bnn39ezZs3l6+vrxwOh2JiYvTDDz8UW/f06dMaO3asGjZsKG9vb4WGhuq+++7T3r17Jf3/+8qlTR07djS3tXHjRtlsNn300Uf661//qpCQEFWtWlX33HOP0tPTi+178+bN6tatm/z8/FSlShV16NBB33zzTYnH2LFjxxL3P3bs2GJ9P/jgA0VGRsrHx0cBAQHq06dPifu/2LGdr7CwUFOmTFHTpk3l7e2t4OBgPfnkk/rtt99c+t100026++67i+1n2LBhxbZZUu1vvPFGsXMqSXl5eRozZozq168vu92u8PBwvfjii8rLyyvxXJ1v3Lhxql69ut59912XoCNJt912m0aOHKnt27dryZIlLsfx2GOPFdtWx44dzdo2btyo1q1bS5IGDBhgnrfzxx9s3rxZ3bt3V/Xq1VW1alW1aNFCU6dOddnmhg0bdOedd6pq1ary9/dXz549tWvXLpc+ReMY/vOf/+iRRx6Rn5+fatasqZdeekmGYSg9PV09e/aUw+FQSEiI3nrrrWK1X8k5LO21VzTt37/fpf+MGTPUtGlT2e12hYWFKT4+XsePHy+23Us5P5Iueb+X+rq/0GOPPaabbrqpWHtJ40dsNpuGDRumZcuWqVmzZrLb7WratKlWrVpVbP2vv/5arVu3lre3t+rVq6fZs2eXWsOl1N6xY0c1a9ZMKSkpat++vapUqaK//vWvkqStW7cqOjpaNWrUkI+Pj+rUqaPHH3/cZf0333xTt99+uwIDA+Xj46PIyEiX1/2Fx7h48WJFRETIx8dHTqdT27dvlyTNnj1b9evXl7e3tzp27Fjs+3B+nbfffrtZz6xZs0o9/iJXes43btyoVq1auZzzP9s4IKKihRUFk8DAQEnSf//7Xy1btkwPPPCA6tSpo8zMTM2ePVsdOnTQzp07FRYWJkkqKCjQ3XffrfXr16tPnz565plndOLECa1du1Y//fST6tWrZ+7joYceUvfu3V32m5CQUGI9r776qmw2m0aOHKmsrCxNmTJFUVFRSk1NlY+Pj6Rzv+RiYmIUGRmpMWPGyMPDQ/PmzVOnTp301Vdf6bbbbiu23RtvvFETJkyQJJ08eVJDhgwpcd8vvfSSevfurSeeeEKHDx/WO++8o/bt2+v777+Xv79/sXUGDx6sO++8U5L06aefaunSpS7Ln3zySc2fP18DBgzQX/7yF+3bt0/Tp0/X999/r2+++UaVK1cu8TxcjuPHj5vHdr7CwkLdc889+vrrrzV48GA1adJE27dv1+TJk/Wf//xHy5YtK3WbaWlp2r17tx577DE5HI4S+/Tr109jxozRihUr1KdPn0uut0mTJho/frxGjx7tcv5uv/12SdLatWt19913KzQ0VM8884xCQkK0a9curVixQs8884wkad26dYqJiVHdunU1duxY/f7773rnnXfUrl07bdu2rdgv4AcffFBNmjTR66+/rn//+9965ZVXFBAQoNmzZ6tTp06aOHGiFixYoOeff16tW7dW+/btr/gcFjn/tVdk5cqV+vDDD13axo4dq3HjxikqKkpDhgzR7t27NXPmTG3ZssXltXIp5+d89957r+677z5J0ldffaV3333XZXlZXvdl9fXXX+vTTz/V0KFDVa1aNU2bNk1xcXE6cOCA+TNo+/bt6tq1q2rWrKmxY8fq7NmzGjNmjIKDg4tt73JqP3r0qGJiYtSnTx898sgjCg4OVlZWlrmv//u//5O/v7/279+vTz/91GU/U6dO1T333KO+ffsqPz9fixYt0gMPPKAVK1YoNjbWpe9XX32lf/3rX4qPj5ckTZgwQXfffbdefPFFzZgxQ0OHDtVvv/2mSZMm6fHHH9eGDRtc1v/tt9/UvXt39e7dWw899JA+/vhjDRkyRF5eXsVCWHmd8++//17dunVTaGioxo0bp4KCAo0fP141a9a87P1d1wxc9+bNm2dIMtatW2ccPnzYSE9PNxYtWmQEBgYaPj4+xv/+9z/DMAzj9OnTRkFBgcu6+/btM+x2uzF+/Hizbe7cuYYk4+233y62r8LCQnM9ScYbb7xRrE/Tpk2NDh06mPNffPGFIcm44YYbjJycHLP9448/NiQZU6dONbfdoEEDIzo62tyPYRjGqVOnjDp16hhdunQptq/bb7/daNasmTl/+PBhQ5IxZswYs23//v2Gp6en8eqrr7qsu337dqNSpUrF2tPS0gxJxnvvvWe2jRkzxjj/7fLVV18ZkowFCxa4rLtq1api7bVr1zZiY2OL1R4fH29c+Ba8sPYXX3zRCAoKMiIjI13O6T//+U/Dw8PD+Oqrr1zWnzVrliHJ+Oabb4rtr8iyZcsMScbkyZNL7WMYhuFwOIxbb73V5Tj69+9frF+HDh1catuyZYshyZg3b55Lv7Nnzxp16tQxateubfz2228uy87/frds2dIICgoyjh49arb98MMPhoeHh9GvXz+zreh7MnjwYJd93HjjjYbNZjNef/11s/23334zfHx8XOq/knNYdNxNmzYt1v7GG28Ykox9+/YZhmEYWVlZhpeXl9G1a1eX99/06dMNScbcuXMv6/wYhmGcOXPGkGSMGzfObCv6OVC038t93V+of//+Ru3atYu1X/heMIxzr1svLy9jz549ZtsPP/xgSDLeeecds61Xr16Gt7e38csvv5htO3fuNDw9PV22eTm1d+jQwZBkzJo1y6Xv0qVLDUnGli1bLnqcp06dcpnPz883mjVrZnTq1KnYMdrtdvP8GoZhzJ4925BkhISEuPxsS0hIcPlenF/nW2+9Zbbl5eWZr/f8/HzDMP7/z9bz3z9Xcs579OhhVKlSxfj111/NtrS0NKNSpUrFtmll3MaykKioKNWsWVPh4eHq06ePfH19tXTpUt1www2SJLvdLg+Pc9/ygoICHT16VL6+vmrUqJG2bdtmbueTTz5RjRo19PTTTxfbx5Vc9uzXr5+qVatmzt9///0KDQ3VypUrJUmpqalKS0vTww8/rKNHj+rIkSM6cuSIcnNz1blzZ3355ZcqLCx02ebp06fl7e190f1++umnKiwsVO/evc1tHjlyRCEhIWrQoIG++OILl/75+fmSzp2v0ixevFh+fn7q0qWLyzYjIyPl6+tbbJtnzpxx6XfkyBGdPn36onX/+uuveuedd/TSSy/J19e32P6bNGmixo0bu2yz6Nblhfs/34kTJyTJ5XtRkmrVqiknJ+eifS7H999/r3379unZZ58tdkWh6HV16NAhpaam6rH
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки после oversampling и undersampling: 802\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения Outcome в обучающей выборке\n",
|
|||
|
"sns.countplot(x=train_data['Outcome'])\n",
|
|||
|
"plt.title('Распределение Outcome в обучающей выборке')\n",
|
|||
|
"plt.xlabel('Outcome')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Балансировка категорий с помощью RandomOverSampler (увеличение меньшинств)\n",
|
|||
|
"ros = RandomOverSampler(random_state=42)\n",
|
|||
|
"X_train = train_data.drop(columns=['Outcome'])\n",
|
|||
|
"y_train = train_data['Outcome']\n",
|
|||
|
"\n",
|
|||
|
"X_resampled, y_resampled = ros.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения Outcome после oversampling\n",
|
|||
|
"sns.countplot(x=y_resampled)\n",
|
|||
|
"plt.title('Распределение Outcome после oversampling')\n",
|
|||
|
"plt.xlabel('Outcome')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Применение RandomUnderSampler для уменьшения большего класса\n",
|
|||
|
"rus = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_resampled, y_resampled = rus.fit_resample(X_resampled, y_resampled)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения Outcome после undersampling\n",
|
|||
|
"sns.countplot(x=y_resampled)\n",
|
|||
|
"plt.title('Распределение Outcome после undersampling')\n",
|
|||
|
"plt.xlabel('Outcome')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Печать размеров выборки после балансировки\n",
|
|||
|
"print(\"Размер обучающей выборки после oversampling и undersampling: \", len(X_resampled))\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Конструирование признаков \n",
|
|||
|
"\n",
|
|||
|
"Теперь приступим к конструированию признаков для решения каждой задачи.\n",
|
|||
|
"\n",
|
|||
|
"**Процесс конструирования признаков** \n",
|
|||
|
"Задача 1: Прогнозирование риска развития диабета. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования вероятности развития диабета у индейцев Пима.\n",
|
|||
|
"Задача 2: Оценка факторов, влияющих на развитие диабета. Цель технического проекта: Разработка модели машинного обучения для выявления ключевых факторов, влияющих на развитие диабета у индейцев Пима.\n",
|
|||
|
"\n",
|
|||
|
"**Унитарное кодирование** \n",
|
|||
|
"Унитарное кодирование категориальных признаков (one-hot encoding). Преобразование категориальных признаков в бинарные векторы.\n",
|
|||
|
"\n",
|
|||
|
"**Дискретизация числовых признаков** \n",
|
|||
|
"Процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Столбцы train_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Pregnancies_14', 'Pregnancies_15', 'Pregnancies_17', 'Outcome_0', 'Outcome_1']\n",
|
|||
|
"Столбцы val_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1']\n",
|
|||
|
"Столбцы test_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Пример категориальных признаков\n",
|
|||
|
"categorical_features = ['Pregnancies', 'Outcome']\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding\n",
|
|||
|
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
|
|||
|
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
|
|||
|
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)\n",
|
|||
|
"df_encoded = pd.get_dummies(df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Столбцы train_data_encoded:\", train_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы val_data_encoded:\", val_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы test_data_encoded:\", test_data_encoded.columns.tolist())\n",
|
|||
|
"\n",
|
|||
|
"# Дискретизация числовых признаков (Glucose). Например, можно разделить уровень глюкозы на категории\n",
|
|||
|
"# Пример дискретизации признака 'Glucose' на 5 категорий\n",
|
|||
|
"train_data_encoded['Glucose_binned'] = pd.cut(train_data_encoded['Glucose'], bins=5, labels=False)\n",
|
|||
|
"val_data_encoded['Glucose_binned'] = pd.cut(val_data_encoded['Glucose'], bins=5, labels=False)\n",
|
|||
|
"test_data_encoded['Glucose_binned'] = pd.cut(test_data_encoded['Glucose'], bins=5, labels=False)\n",
|
|||
|
"\n",
|
|||
|
"# Пример дискретизации признака 'Glucose' на 5 категорий\n",
|
|||
|
"df_encoded['Glucose_binned'] = pd.cut(df_encoded['Glucose'], bins=5, labels=False)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Ручной синтез\n",
|
|||
|
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, можно создать признак, который отражает соотношение уровня глюкозы к инсулину или индексу массы тела (BMI)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Ручной синтез признаков\n",
|
|||
|
"# Пример создания нового признака - соотношение уровня глюкозы к инсулину\n",
|
|||
|
"train_data_encoded['glucose_to_insulin'] = train_data_encoded['Glucose'] / train_data_encoded['Insulin']\n",
|
|||
|
"val_data_encoded['glucose_to_insulin'] = val_data_encoded['Glucose'] / val_data_encoded['Insulin']\n",
|
|||
|
"test_data_encoded['glucose_to_insulin'] = test_data_encoded['Glucose'] / test_data_encoded['Insulin']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - соотношение уровня глюкозы к инсулину\n",
|
|||
|
"df_encoded['glucose_to_insulin'] = df_encoded['Glucose'] / df_encoded['Insulin']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - соотношение уровня глюкозы к BMI\n",
|
|||
|
"train_data_encoded['glucose_to_bmi'] = train_data_encoded['Glucose'] / train_data_encoded['BMI']\n",
|
|||
|
"val_data_encoded['glucose_to_bmi'] = val_data_encoded['Glucose'] / val_data_encoded['BMI']\n",
|
|||
|
"test_data_encoded['glucose_to_bmi'] = test_data_encoded['Glucose'] / test_data_encoded['BMI']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - соотношение уровня глюкозы к BMI\n",
|
|||
|
"df_encoded['glucose_to_bmi'] = df_encoded['Glucose'] / df_encoded['BMI']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - соотношение уровня инсулина к BMI\n",
|
|||
|
"train_data_encoded['insulin_to_bmi'] = train_data_encoded['Insulin'] / train_data_encoded['BMI']\n",
|
|||
|
"val_data_encoded['insulin_to_bmi'] = val_data_encoded['Insulin'] / val_data_encoded['BMI']\n",
|
|||
|
"test_data_encoded['insulin_to_bmi'] = test_data_encoded['Insulin'] / test_data_encoded['BMI']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - соотношение уровня инсулина к BMI\n",
|
|||
|
"df_encoded['insulin_to_bmi'] = df_encoded['Insulin'] / df_encoded['BMI']\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
|
|||
|
"\n",
|
|||
|
"# Пример числовых признаков\n",
|
|||
|
"numerical_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']\n",
|
|||
|
"\n",
|
|||
|
"# Применение StandardScaler для масштабирования числовых признаков\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
|
|||
|
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
|
|||
|
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])\n",
|
|||
|
"\n",
|
|||
|
"# Пример использования MinMaxScaler для масштабирования числовых признаков\n",
|
|||
|
"scaler = MinMaxScaler()\n",
|
|||
|
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
|
|||
|
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
|
|||
|
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Использование фреймворка Featuretools"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 41,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Столбцы в df: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']\n",
|
|||
|
"Столбцы в train_data_encoded: ['id', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Pregnancies_14', 'Pregnancies_15', 'Pregnancies_17', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_insulin', 'glucose_to_bmi', 'insulin_to_bmi']\n",
|
|||
|
"Столбцы в val_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_insulin', 'glucose_to_bmi', 'insulin_to_bmi']\n",
|
|||
|
"Столбцы в test_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_insulin', 'glucose_to_bmi', 'insulin_to_bmi']\n",
|
|||
|
"Empty DataFrame\n",
|
|||
|
"Columns: [id, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Pregnancies_0, Pregnancies_1, Pregnancies_2, Pregnancies_3, Pregnancies_4, Pregnancies_5, Pregnancies_6, Pregnancies_7, Pregnancies_8, Pregnancies_9, Pregnancies_10, Pregnancies_11, Pregnancies_12, Pregnancies_13, Pregnancies_14, Pregnancies_15, Pregnancies_17, Outcome_0, Outcome_1, Glucose_binned, glucose_to_insulin, glucose_to_bmi, insulin_to_bmi]\n",
|
|||
|
"Index: []\n",
|
|||
|
"\n",
|
|||
|
"[0 rows x 31 columns]\n",
|
|||
|
" Glucose BloodPressure SkinThickness Insulin BMI \\\n",
|
|||
|
"id \n",
|
|||
|
"0 148 72 35 0 33.6 \n",
|
|||
|
"1 85 66 29 0 26.6 \n",
|
|||
|
"2 183 64 0 0 23.3 \n",
|
|||
|
"3 89 66 23 94 28.1 \n",
|
|||
|
"4 137 40 35 168 43.1 \n",
|
|||
|
"\n",
|
|||
|
" DiabetesPedigreeFunction Age Pregnancies_0 Pregnancies_1 \\\n",
|
|||
|
"id \n",
|
|||
|
"0 0.627 50 False False \n",
|
|||
|
"1 0.351 31 False True \n",
|
|||
|
"2 0.672 32 False False \n",
|
|||
|
"3 0.167 21 False True \n",
|
|||
|
"4 2.288 33 True False \n",
|
|||
|
"\n",
|
|||
|
" Pregnancies_2 ... Pregnancies_13 Pregnancies_14 Pregnancies_15 \\\n",
|
|||
|
"id ... \n",
|
|||
|
"0 False ... False False False \n",
|
|||
|
"1 False ... False False False \n",
|
|||
|
"2 False ... False False False \n",
|
|||
|
"3 False ... False False False \n",
|
|||
|
"4 False ... False False False \n",
|
|||
|
"\n",
|
|||
|
" Pregnancies_17 Outcome_0 Outcome_1 Glucose_binned glucose_to_insulin \\\n",
|
|||
|
"id \n",
|
|||
|
"0 False False True 3 inf \n",
|
|||
|
"1 False True False 2 inf \n",
|
|||
|
"2 False False True 4 inf \n",
|
|||
|
"3 False True False 2 0.946809 \n",
|
|||
|
"4 False False True 3 0.815476 \n",
|
|||
|
"\n",
|
|||
|
" glucose_to_bmi insulin_to_bmi \n",
|
|||
|
"id \n",
|
|||
|
"0 4.404762 0.000000 \n",
|
|||
|
"1 3.195489 0.000000 \n",
|
|||
|
"2 7.854077 0.000000 \n",
|
|||
|
"3 3.167260 3.345196 \n",
|
|||
|
"4 3.178654 3.897912 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 30 columns]\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n",
|
|||
|
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
|||
|
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
|
|||
|
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n",
|
|||
|
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
|||
|
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import featuretools as ft\n",
|
|||
|
"\n",
|
|||
|
"# Проверка наличия столбцов в DataFrame\n",
|
|||
|
"print(\"Столбцы в df:\", df.columns.tolist())\n",
|
|||
|
"print(\"Столбцы в train_data_encoded:\", train_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы в val_data_encoded:\", val_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы в test_data_encoded:\", test_data_encoded.columns.tolist())\n",
|
|||
|
"\n",
|
|||
|
"# Удаление дубликатов по всем столбцам (если нет уникального идентификатора)\n",
|
|||
|
"df = df.drop_duplicates()\n",
|
|||
|
"duplicates = train_data_encoded[train_data_encoded.duplicated(keep=False)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление дубликатов из столбца \"id\", сохранив первое вхождение\n",
|
|||
|
"df_encoded = df_encoded.drop_duplicates(keep='first')\n",
|
|||
|
"\n",
|
|||
|
"print(duplicates)\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"es = ft.EntitySet(id='diabetes_data')\n",
|
|||
|
"\n",
|
|||
|
"# Добавление датафрейма с данными о диабете\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='patients', dataframe=df_encoded, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков с помощью глубокой синтезы признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='patients', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Выводим первые 5 строк сгенерированного набора признаков\n",
|
|||
|
"print(feature_matrix.head())\n",
|
|||
|
"\n",
|
|||
|
"# Удаление дубликатов из обучающей выборки\n",
|
|||
|
"train_data_encoded = train_data_encoded.drop_duplicates()\n",
|
|||
|
"train_data_encoded = train_data_encoded.drop_duplicates(keep='first') # or keep='last'\n",
|
|||
|
"\n",
|
|||
|
"# Определение сущностей (Создание EntitySet)\n",
|
|||
|
"es = ft.EntitySet(id='diabetes_data')\n",
|
|||
|
"\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='patients', dataframe=train_data_encoded, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='patients', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование признаков для контрольной и тестовой выборок\n",
|
|||
|
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
|
|||
|
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Оценка качества каждого набора признаков \n",
|
|||
|
" "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 55,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Время обучения модели: 0.00 секунд\n",
|
|||
|
"Среднеквадратичная ошибка: 704.68\n",
|
|||
|
"Коэффициент детерминации (R²): 0.30\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAIjCAYAAAAJLyrXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADPB0lEQVR4nOzdd3hT1f8H8HeSJt2bDlqKBVrKlK3sIQhFhixRWWUIX7RsREGZCrI3CCpYUEEUlCUqKCIgVkEE2QUqhTLaAt0jbdLc3x/8em06kzZp1vv1PDzac2+Sk3VzPmd8jkQQBAFERERERESkM6mpK0BERERERGRpGEgRERERERHpiYEUERERERGRnhhIERERERER6YmBFBERERERkZ4YSBEREREREemJgRQREREREZGeGEgRERERERHpiYEUERERERGRnhhIEZFBZGZmYs2aNeLfqamp2Lhxo+kqRERERGREDKTIIo0cORIuLi6mrgYV4ujoiNmzZ2PHjh2Ij4/H/PnzcfDgQVNXi4iIiMgo7ExdASJdPX78GDt27MDJkydx4sQJ5OTkIDw8HM2aNcPgwYPRrFkzU1fRpslkMixYsAAjRoyARqOBm5sbDh06ZOpqERERERmFRBAEwdSVICrPrl27MHbsWGRmZiI4OBgqlQoJCQlo1qwZ/vnnH6hUKkRERODjjz+GQqEwdXVt2t27dxEfH4/69evDw8PD1NUhIiIiMgpO7SOzd+rUKQwbNgz+/v44deoUbt26hW7dusHBwQFnzpzB/fv38eqrr2L79u2YOnWq1m1XrFiBtm3bwtvbG46OjmjRogX27NlT7DEkEgnmz58v/q1Wq/HCCy/Ay8sLV65cEc8p61/nzp0BAL/++iskEgl+/fVXrcfo1atXscfp3LmzeLsCcXFxkEgk2LZtm1b5tWvXMGjQIHh5ecHBwQEtW7bEgQMHij2X1NRUTJ06FcHBwbC3t0eNGjUwYsQIPHr0qNT63b9/H8HBwWjZsiUyMzP1fh7z58+HRCIBANSoUQNt2rSBnZ0d/P39S7yPwo4dOwaJRIK9e/cWO7Zz505IJBJER0cD+G9K57///osePXrA2dkZAQEBeO+991C0TygrKwvTp09HUFAQ7O3tERYWhhUrVhQ7r/B7KJPJEBgYiHHjxiE1NVXrvNzcXMybNw8hISGwt7dHUFAQ3nrrLeTm5ha7vwkTJhR7Lr1790ZwcLD4d8H7vGLFilJfm9IU3Lakf1988YXWuZ07dy7xvMKfr5EjR2rVDQDWrFmDevXqwd7eHv7+/vjf//6H5OTkYvdd9PO7aNEiSKVS7Ny5U6t89+7daNGiBRwdHVGtWjUMGzYM9+7d0zpn/vz5aNCgAVxcXODm5obWrVtj3759xR6zUaNG5b42Rb8/RW3btq3M73PhzzcAnDt3Dj179oSbmxtcXFzQtWtX/PHHH2U+RgGNRoO1a9eicePGcHBwgI+PD8LDw/HXX3+J5xR8bnbs2IGwsDA4ODigRYsWOHHihNZ93b59G2+88QbCwsLg6OgIb29vvPTSS4iLiyvz+Tk5OaFx48bYsmWL1nmlTZPes2dPid/dP//8E+Hh4XB3d4eTkxM6deqEU6dOaZ1TcD0ouOYU+Ouvv3T67MXHx8PR0RESiUTreRX9vKlUKsyZMwe1atWCQqFAzZo18dZbbyEnJ6fY8ynJtWvXMHjwYPj4+MDR0RFhYWF49913y7xNwXWxtH8jR44Uzy14D06cOIH//e9/8Pb2hpubG0aMGIGUlJRi9/3hhx+iYcOGsLe3R0BAACIjI4tdh0r7Pnfr1k08R9drEKDbdfLx48fo2bMnatSoAXt7e1SvXh1Dhw7F7du3xXNK+95FRkZW+HWJiIhAtWrVoFKpij2X7t27IywsTKvsiy++EK8xXl5eeOWVVxAfH1/i69evX79i9/m///0PEolE6/qiy3W68O9fgYL3pfC64QL16tUr9T0iy8OpfWT2lixZAo1Gg127dqFFixbFjlerVg2fffYZrly5go8++gjz5s2Dr68vAGDt2rXo27cvhg4diry8POzatQsvvfQSvvvuO/Tq1avUx3zttdfw66+/4qeffkKDBg0AAJ9//rl4/OTJk/j444+xevVqVKtWDQDg5+dX6v2dOHEC33//fYWePwBcvnwZ7dq1Q2BgIGbOnAlnZ2d8/fXX6NevH7755hv0798fwJOEDx06dMDVq1cxevRoNG/eHI8ePcKBAwdw9+5dsa6FpaWloWfPnpDL5fj+++/LXHumz/NYuXIlEhMTyz2vc+fOCAoKwo4dO8TnUWDHjh2oU6cO2rRpI5bl5+cjPDwcrVu3xrJly/Djjz9i3rx5UKvVeO+99wAAgiCgb9++OHbsGMaMGYOmTZvi8OHDmDFjBu7du4fVq1drPU7//v0xYMAAqNVqREdH4+OPP0ZOTo74nms0GvTt2xe//fYbxo0bh/r16+PixYtYvXo1rl+/XqyxX1VeffVVvPDCC1pl7dq1K3ZevXr1xAbio0ePinU4FPXBBx/g3XffRceOHREZGYlbt25hw4YN+PPPP/Hnn3/C3t6+xNtFRUVh9uzZWLlyJYYMGSKWb9u2DaNGjUKrVq2wePFiJCYmYu3atTh16hTOnTsnjlxmZWWhf//+CA4ORk5ODrZt24aBAwciOjoazzzzjD4vjc7ee+891KpVS/w7MzMTr7/+utY5ly9fRocOHeDm5oa33noLcrkcH330ETp37ozjx4/j2WefLfMxxowZg23btqFnz5547bXXoFarcfLkSfzxxx9o2bKleN7x48fx1VdfYdKkSbC3t8eHH36I8PBwnD59WmzcnTlzBr///jteeeUV1KhRA3Fxcdi0aRM6d+6MK1euwMnJSeuxC65R6enp+PTTTzF27FgEBwdrNbx19csvv6Bnz55o0aIF5s2bB6lUiqioKDz33HM4efKkwd6juXPnQqlUlnteZGQkPvnkE/Tt2xdvvvkmzp07h+XLl+PSpUs4dOhQscZtYRcuXECHDh0gl8sxbtw4BAcHIzY2FgcPHsSiRYvKfexJkyahVatWWmWvvfZaiedOmDABHh4emD9/PmJiYrBp0ybcvn1bDMqAJ43xBQsWoFu3bnj99dfF886cOYNTp05BLpeL91ejRg0sXrxY6zGqV69ebp2L0vU6mZeXB1dXV0yePBne3t6IjY3F+vXrceHCBVy8eLHU+7958yY++eSTUo+X97oMHz4cn332GQ4fPozevXuLt0tISMAvv/yCefPmiWWLFi3CnDlzMHjwYLz22mt4+PAh1q9fj44dO2pdYwDAwcEBhw4dQlJSkthWyMnJwVdffQUHBwe9X8fSODg4ICoqClOmTBHLfv/9d60AlKyAQGTmvLy8hKeeekqrLCIiQnB2dtYqmzNnjgBAOHjwoFiWnZ2tdU5eXp7QqFEj4bnnntMqByDMmzdPEARBmDVrliCTyYR9+/aVWqeoqCgBgHDr1q1ix44dOyYAEI4dOyaWPfvss0LPnj21HkcQBKFLly5Cx44dtW5/69YtAYAQFRUllnXt2lVo3LixoFQqxTKNRiO0bdtWCA0NFcvmzp0rABC+/fbbYvXSaDTF6qdUKoXOnTsLvr6+ws2bNyv8PObNmycUvpwkJSUJrq6u4rmF76Mks2bNEuzt7YXU1FSt+7Czs9N6nIiICAGAMHHiRK3n1atXL0GhUAgPHz4UBEEQ9u3bJwAQFi5cqPU4gwYNEiQSidZzLfpcBEEQ2rZtKzRo0ED8+/PPPxekUqlw8uRJrfM2b94sABBOnTqldX+RkZHFnmOvXr20PscF7/Py5cvLeGVKps9t27VrJ3Tp0qXYbQt/viIiIsS6PXz4UHBwcBDat28vqFQq8Zxt27YJAIT169eLZZ06dRI6deokCIIgHDp0SLCzsxOmT5+u9fh5eXmCr6+v0KhRIyEnJ0cs/+677wQAwty5c0ute1JSkgBAWLFihdZjNmzYsNTblPT8SlLwHT5z5oxW+cOHD4t9Jvr16ycoFAohNjZWLLt//77g6upa7Ptb1C+//CIAECZNmlTsWMF3UhCefG4
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import time\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.linear_model import LinearRegression\n",
|
|||
|
"from sklearn.metrics import mean_squared_error, r2_score\n",
|
|||
|
"\n",
|
|||
|
"# Предположим, что df уже определен и загружен\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную\n",
|
|||
|
"X = df.drop('Glucose', axis=1)\n",
|
|||
|
"y = df['Glucose']\n",
|
|||
|
"\n",
|
|||
|
"# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением\n",
|
|||
|
"X.fillna(X.median(), inplace=True)\n",
|
|||
|
"\n",
|
|||
|
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model = LinearRegression()\n",
|
|||
|
"\n",
|
|||
|
"# Начинаем отсчет времени\n",
|
|||
|
"start_time = time.time()\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Время обучения модели\n",
|
|||
|
"train_time = time.time() - start_time\n",
|
|||
|
"\n",
|
|||
|
"# Предсказания и оценка модели\n",
|
|||
|
"val_predictions = model.predict(X_val)\n",
|
|||
|
"mse = mean_squared_error(y_val, val_predictions)\n",
|
|||
|
"r2 = r2_score(y_val, val_predictions)\n",
|
|||
|
"\n",
|
|||
|
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
|
|||
|
"print(f'Среднеквадратичная ошибка: {mse:.2f}')\n",
|
|||
|
"print(f'Коэффициент детерминации (R²): {r2:.2f}')\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация результатов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(y_val, val_predictions, alpha=0.5)\n",
|
|||
|
"plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)\n",
|
|||
|
"plt.xlabel('Фактический уровень глюкозы')\n",
|
|||
|
"plt.ylabel('Прогнозируемый уровень глюкозы')\n",
|
|||
|
"plt.title('Фактический уровень глюкозы по сравнению с прогнозируемым')\n",
|
|||
|
"plt.show()\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
" # Выводы\n",
|
|||
|
"\n",
|
|||
|
"**Модель линейной регрессии (LinearRegression)** показала удовлетворительные результаты при прогнозировании уровня глюкозы у индейцев Пима. Метрики качества и кросс-валидация позволяют предположить, что модель не сильно переобучена и может быть использована для практических целей.\n",
|
|||
|
"\n",
|
|||
|
"*Точность предсказаний:* Модель демонстрирует довольно высокий коэффициент детерминации (R²) 0.30, что указывает на умеренную часть вариации целевого признака (уровня глюкозы). Однако, значения среднеквадратичной ошибки (RMSE) остаются высокими (704.68), что свидетельствует о том, что модель не всегда точно предсказывает значения, особенно для объектов с высокими или низкими уровнями глюкозы.\n",
|
|||
|
"\n",
|
|||
|
"*Переобучение:* Разница между RMSE на обучающей и тестовой выборках незначительна, что указывает на то, что модель не склонна к переобучению. Однако в будущем стоит следить за этой метрикой при добавлении новых признаков или усложнении модели, чтобы избежать излишней подгонки под тренировочные данные. Также стоит быть осторожным и продолжать мониторинг этого показателя.\n",
|
|||
|
"\n",
|
|||
|
"*Кросс-валидация:* При кросс-валидации наблюдается небольшое увеличение ошибки RMSE по сравнению с тестовой выборкой (рост на 2-3%). Это может указывать на небольшую нестабильность модели при использовании разных подвыборок данных. Для повышения устойчивости модели возможно стоит провести дальнейшую настройку гиперпараметров.\n",
|
|||
|
"\n",
|
|||
|
"*Рекомендации:* Следует уделить внимание дополнительной обработке категориальных признаков, улучшению метода feature engineering, а также возможной оптимизации модели (например, через подбор гиперпараметров) для повышения точности предсказаний на экстремальных значениях."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|