AIM-PIbd-31-Afanasev-S-S/lab_3/lab3.ipynb

817 lines
274 KiB
Plaintext
Raw Permalink Normal View History

2024-10-26 00:29:45 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Начало лабораторной работы"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Вариант 3:* Диабет у индейцев Пима\n",
"- Определим бизнес-цели и цели технического проекта "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',\n",
" 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"df = pd.read_csv(\"C:/Users/TIGR228/Desktop/МИИ/Lab1/AIM-PIbd-31-Afanasev-S-S/static/csv/diabetes.csv\")\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определение бизнес целей:\n",
"1. Прогнозирование риска развития диабета\n",
"2. Оценка факторов, влияющих на развитие диабета"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определение целей технического проекта:\n",
"1. Построить модель машинного обучения для классификации, которая будет прогнозировать вероятность развития диабета у индейцев Пима на основе предоставленных данных о их характеристиках.\n",
"2. Провести анализ данных для выявления ключевых факторов, влияющих на развитие диабета у индейцев Пима."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pregnancies</th>\n",
" <th>Glucose</th>\n",
" <th>BloodPressure</th>\n",
" <th>SkinThickness</th>\n",
" <th>Insulin</th>\n",
" <th>BMI</th>\n",
" <th>DiabetesPedigreeFunction</th>\n",
" <th>Age</th>\n",
" <th>Outcome</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6</td>\n",
" <td>148</td>\n",
" <td>72</td>\n",
" <td>35</td>\n",
" <td>0</td>\n",
" <td>33.6</td>\n",
" <td>0.627</td>\n",
" <td>50</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>85</td>\n",
" <td>66</td>\n",
" <td>29</td>\n",
" <td>0</td>\n",
" <td>26.6</td>\n",
" <td>0.351</td>\n",
" <td>31</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>8</td>\n",
" <td>183</td>\n",
" <td>64</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>23.3</td>\n",
" <td>0.672</td>\n",
" <td>32</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>89</td>\n",
" <td>66</td>\n",
" <td>23</td>\n",
" <td>94</td>\n",
" <td>28.1</td>\n",
" <td>0.167</td>\n",
" <td>21</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>137</td>\n",
" <td>40</td>\n",
" <td>35</td>\n",
" <td>168</td>\n",
" <td>43.1</td>\n",
" <td>2.288</td>\n",
" <td>33</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
"0 6 148 72 35 0 33.6 \n",
"1 1 85 66 29 0 26.6 \n",
"2 8 183 64 0 0 23.3 \n",
"3 1 89 66 23 94 28.1 \n",
"4 0 137 40 35 168 43.1 \n",
"\n",
" DiabetesPedigreeFunction Age Outcome \n",
"0 0.627 50 1 \n",
"1 0.351 31 0 \n",
"2 0.672 32 1 \n",
"3 0.167 21 0 \n",
"4 2.288 33 1 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pregnancies 0\n",
"Glucose 0\n",
"BloodPressure 0\n",
"SkinThickness 0\n",
"Insulin 0\n",
"BMI 0\n",
"DiabetesPedigreeFunction 0\n",
"Age 0\n",
"Outcome 0\n",
"dtype: int64\n"
]
},
{
"data": {
"text/plain": [
"Pregnancies False\n",
"Glucose False\n",
"BloodPressure False\n",
"SkinThickness False\n",
"Insulin False\n",
"BMI False\n",
"DiabetesPedigreeFunction False\n",
"Age False\n",
"Outcome False\n",
"dtype: bool"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Процент пропущенных значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
"\n",
"# Проверка на пропущенные данные\n",
"print(df.isnull().sum())\n",
"\n",
"df.isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пропущенных колонок нету, что не может не радовать "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 614\n",
"Размер контрольной выборки: 154\n",
"Размер тестовой выборки: 154\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тестовая)\n",
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"# Разделение данных на обучающую и контрольную выборки (80% - обучение, 20% - контроль)\n",
"train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки: \", len(train_data))\n",
"print(\"Размер контрольной выборки: \", len(val_data))\n",
"print(\"Размер тестовой выборки: \", len(test_data))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABkKUlEQVR4nO3dd1hTZ/8G8DsJJMywIaCAiANUcKAi7oGTWtvaWqt11UpbsW+rrfVna10dji6t276utq5qq7bWqrhw4UJRVEREFFSmyt7h+f1ByWsEHAgE4v25rnNBznnOOd/zJIQ7Z0UihBAgIiIi0lNSXRdAREREVJ0YdoiIiEivMewQERGRXmPYISIiIr3GsENERER6jWGHiIiI9BrDDhEREek1hh0iIiLSaww7RES1TGZmJm7cuIHs7Gxdl0JVLC0tDdeuXUNRUZGuS3muMOwQEemYEAIrV65Ehw4dYGJiAqVSCTc3N/z666+6Lq1OuHXrFtauXat5fOPGDaxfv153BT2gsLAQ8+fPR8uWLaFQKGBlZYXGjRtj//79ui7tucKwU0esXbsWEolEMxgZGaFJkyaYMGECkpKSdF0e1WI7d+5Ev379YGNjo3ndfPzxx7h7926ll3nnzh3MnDkT4eHhVVfoc2zYsGF499134enpiV9++QXBwcHYt28fXnnlFV2XVidIJBIEBQVhz549uHHjBj755BMcOXJE12UhPz8f/v7++Pzzz9G9e3ds2bIFwcHBOHDgAPz8/HRd3nPFQNcF0NOZPXs23NzckJeXh6NHj2LZsmXYtWsXLl68CBMTE12XR7XMxx9/jO+++w4tW7bElClTYG1tjbNnz2Lx4sXYtGkT9u/fj6ZNmz71cu/cuYNZs2ahQYMGaNWqVdUX/hz5+eefsXnzZvz6668YNmyYrsupk+rVq4dx48ahX79+AABHR0ccOnRIt0UBmDdvHk6ePIk9e/age/fuui7n+SaoTlizZo0AIE6fPq01ftKkSQKA2LBhg44qo9pqw4YNAoB4/fXXRVFRkda0kydPChMTE+Hl5SUKCwufetmnT58WAMSaNWuqqNrnV4sWLcSwYcN0XYZeuHbtmjhx4oTIysrSdSmisLBQWFlZiU8//VTXpZAQgoex6riePXsCAGJjYwEA9+7dw8cffwwvLy+YmZlBqVSif//+OH/+fJl58/LyMHPmTDRp0gRGRkZwdHTEK6+8gpiYGAAlx70fPHT28PDgJ5VDhw5BIpFg8+bN+PTTT6FSqWBqaooXX3wR8fHxZdZ98uRJ9OvXDxYWFjAxMUG3bt1w7Nixcrexe/fu5a5/5syZZdr++uuv8PHxgbGxMaytrTF06NBy1/+obXtQcXExFixYgObNm8PIyAgODg545513cP/+fa12DRo0wAsvvFBmPRMmTCizzPJq/+abb8r0KVCyG3zGjBlo1KgRFAoFnJ2d8cknnyA/P7/cvnrQrFmzYGVlhZUrV0Imk2lNa9++PaZMmYKIiAhs3bpVaztGjx5dZlndu3fX1Hbo0CG0a9cOADBmzBhNvz14zsTJkycxYMAAWFlZwdTUFN7e3li4cKHWMg8cOIAuXbrA1NQUlpaWGDRoECIjI7XazJw5ExKJBFevXsWbb74JCwsL2NnZ4fPPP4cQAvHx8Rg0aBCUSiVUKhW+++67MrU/Sx8+/NqztbVFQEAALl68+Nh5AWDLli2a16OtrS3efPNN3L59WzM9OzsbFy9ehLOzMwICAqBUKmFqaoru3btrHYa5fv06JBIJfvjhhzLrOH78OCQSCTZu3Kip+eHXUenr/cHn6MKFCxg9ejQaNmwIIyMjqFQqvPXWW2UOb5YeQr9x44Zm3J49e9CxY0eYmJjAwsICL7zwQpk+KX3uUlNTNePOnDlTpg4AaNGiRbl7Pv755x/Na8Tc3BwBAQG4dOmSVpvRo0ejQYMGAAB3d3f4+vri3r17MDY2LlN3eUaPHq31HFtZWZXpf6Div/FSpe+BpXuUoqKicP/+fZibm6Nbt26P7CsAOHfuHPr37w+lUgkzMzP06tULJ06c0GpT+lwcPnwY77zzDmxsbKBUKjFy5Mhy35Me/lsODAyEkZFRmb1eT9LPdR0PY9VxpcHExsYGQMmb4vbt2/Haa6/Bzc0NSUlJWLFiBbp164bLly/DyckJAKBWq/HCCy9g//79GDp0KD744ANkZmYiODgYFy9ehLu7u2Ydb7zxBgYMGKC13qlTp5Zbz1dffQWJRIIpU6YgOTkZCxYsgL+/P8LDw2FsbAyg5J9c//794ePjgxkzZkAqlWLNmjXo2bMnjhw5gvbt25dZbv369TFnzhwAQFZWFt57771y1/35559jyJAhePvtt5GSkoJFixaha9euOHfuHCwtLcvMExgYiC5dugAA/vjjD2zbtk1r+jvvvIO1a9dizJgx+M9//oPY2FgsXrwY586dw7Fjx2BoaFhuPzyNtLQ0zbY9qLi4GC+++CKOHj2KwMBAeHp6IiIiAj/88AOuXr2K7du3V7jM6OhoREVFYfTo0VAqleW2GTlyJGbMmIGdO3di6NChT1yvp6cnZs+ejenTp2v1X8eOHQEAwcHBeOGFF+Do6IgPPvgAKpUKkZGR2LlzJz744AMAwL59+9C/f380bNgQM2fORG5uLhYtWoROnTrh7Nmzmn9epV5//XV4enpi7ty5+Pvvv/Hll1/C2toaK1asQM+ePTFv3jysX78eH3/8Mdq1a4euXbs+cx+W8vDwwGeffQYhBGJiYvD9999jwIABiIuLe+R8pa+bdu3aYc6cOUhKSsLChQtx7NgxzeuxNFjMmzcPKpUKkydPhpGREX766Sf4+/sjODgYXbt2RcOGDdGpUyesX78eEydO1FrP+vXrYW5ujkGDBj12Wx4UHByM69evY8yYMVCpVLh06RJWrlyJS5cu4cSJE2VCeqkjR45gwIABcHV1xYwZM1BYWIilS5eiU6dOOH36NJo0afJUdVTkl19+wahRo9C3b1/MmzcPOTk5WLZsGTp37oxz586VeY08aPr06cjLy3viddna2mqC5K1bt7Bw4UIMGDAA8fHx5b5vPInS53bq1Klo3LgxZs2ahby8PCxZsqRMX126dAldunSBUqnEJ598AkNDQ6xYsQLdu3dHSEgIfH19tZY9YcIEWFpaYubMmYiKisKyZctw8+ZNTeAqz4wZM7Bq1Sps3rxZK1g+Sz/XKbretURPpvQw1r59+0RKSoqIj48XmzZtEjY2NsLY2FjcunVLCCFEXl6eUKvVWvPGxsYKhUIhZs+erRm3evVqAUB8//33ZdZVXFysmQ+A+Oabb8q0ad68uejWrZvm8cGDBwUAUa9ePZGRkaEZ/9tvvwkAYuHChZplN27cWPTt21ezHiGEyMnJEW5ubqJ3795l1tWxY0fRokULzeOUlBQBQMyYMUMz7saNG0Imk4mvvvpKa96IiAhhYGBQZnx0dLQAINatW6cZN2PGDPHgn8SRI0cEALF+/XqteXfv3l1mvKurqwgICChTe1BQkHj4z+zh2j/55BNhb28vfHx8tPr0l19+EVKpVBw5ckRr/uXLlwsA4tixY2XWV2r79u0CgPjhhx8qbCOEEEqlUrRp00ZrO0aNGlWmXbdu3bRqq+gwVlFRkXBzcxOurq7i/v37WtMefL5btWol7O3txd27dzXjzp8/L6RSqRg5cqRmXOlzEhgYqLWO+vXrC4lEIubOnasZf//+fWFsbKxV/7P0YXnbLYQQn376qQAgkpOTK5yvoKBA2NvbixYtWojc3FzN+J07dwoAYvr06UKI//2NyeVycfXqVU27lJQUYWNjI3x8fDTjVqxYIQCIyMhIrfXY2tpqbXOPHj1E165dteopXc+Dz1dOTk6Zujdu3CgAiMOHD2vGlb73xMbGCiGE8PHxERYWFiIxMVHT5urVq8LQ0FAMHjxYM670uUtJSdGMq+h18/D7SWZmprC0tBTjxo3TapeYmCgsLCy0xo8aNUq4urpqHl+8eFFIpVLRv39/rbor8vD8QgixcuVKAUCcOnVKM66iv/FSpe+BBw8e1Hpsa2srUlNTNe3K66u
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABa90lEQVR4nO3dd1zV1f8H8Ncd3MveGwFRQRBX4gi3hrnTtNSyMjOtxG+usswcmWVq5S5tqA3LUlPLvTUVF4oTERUEUbbsfe/5/YHcn1fQFJB7+fh6Ph6fB97zGfd9D/deXp7PkgkhBIiIiIgkSm7oAoiIiIgeJ4YdIiIikjSGHSIiIpI0hh0iIiKSNIYdIiIikjSGHSIiIpI0hh0iIiKSNIYdIiIikjSGHSIiIgMrKSlBcnIy4uLiDF2KJDHsEBGRZG3evBkRERG6xxs3bsSFCxcMV9BdoqOjMXLkSLi5uUGlUsHFxQXBwcHgjQ2qH8OOkVm1ahVkMpluMjU1hZ+fH8aMGYOkpCRDl0dGbPPmzejRowccHBx075v33nsPaWlpld7mzZs3MWPGDL0/FkS1yblz5zB27FhER0fj6NGjePvtt5GdnW3osnD06FG0bt0ae/fuxYcffogdO3Zg165d2LhxI2QymaHLkxwZ741lXFatWoXhw4dj5syZ8PHxQUFBAQ4dOoRffvkF3t7eOH/+PMzNzQ1dJhmZ9957D1999RWaNWuGl19+Gfb29jh16hRWrFgBR0dH7NmzBw0bNnzk7Z48eRKtWrXCypUr8frrr1d/4USPWUpKCtq2bYsrV64AAAYMGID169cbtKaioiI0a9YM1tbW2LlzJ2xsbAxaz5NAaegCqGI9e/ZEy5YtAQBvvvkmHBwc8PXXX2PTpk146aWXDFwdGZPff/8dX331FQYPHozVq1dDoVDo5r3++uvo0qULXnzxRZw6dQpKJT/y9GRxcnLC+fPndf9RDAgIMHRJ+OeffxAVFYVLly4x6NQQ7saqJbp27QoAiImJAQCkp6fjvffeQ5MmTWBpaQlra2v07NkTZ86cKbduQUEBZsyYAT8/P5iamsLNzQ0DBgzA1atXAQCxsbF6u87unTp37qzb1v79+yGTyfDHH3/go48+gqurKywsLPDcc88hPj6+3HMfO3YMPXr0gI2NDczNzdGpUyccPny4wtfYuXPnCp9/xowZ5Zb99ddfERQUBDMzM9jb22PIkCEVPv+DXtvdtFotFixYgMDAQJiamsLFxQVvvfUWbt++rbdc3bp10adPn3LPM2bMmHLbrKj2efPmletTACgsLMT06dPRoEEDqNVqeHp6YtKkSSgsLKywr+72ySefwM7ODt99951e0AGA1q1b44MPPsC5c+ewbt06vddR0UhN586ddbXt378frVq1AgAMHz5c12+rVq3SLX/s2DH06tULdnZ2sLCwQNOmTbFw4UK9be7duxcdOnSAhYUFbG1t0a9fP0RGRuotM2PGDMhkMly+fBmvvPIKbGxs4OTkhKlTp0IIgfj4ePTr1w/W1tZwdXXFV199Va72qvThve89R0dH9O7dG+fPn3+ode/9fX722WeQy+X47bff9NrXrl2re986OjrilVdeQUJCgt4yr7/+OiwtLcs9z7p16yCTybB///4Ka37Qe1wmk2HMmDFYvXo1GjZsCFNTUwQFBeHgwYPlnuf06dPo2bMnrK2tYWlpiWeeeQZHjx59qH6r6D3SuXNnNG7c+EFdqFfjvfr06YO6devqteXm5mLixInw9PSEWq1Gw4YN8eWXX5Y71qXsM6hWqxEUFISAgID7fgbvV1PZpFAo4OHhgVGjRiEjI0O3TNl34t2fr3u9/vrreq/h6NGj8PHxwfr161G/fn2oVCp4eXlh0qRJyM/PL7f+N998g8DAQKjVari7uyM0NFSvBuD/+zk8PBxt27aFmZkZfHx8sGzZMr3lyuotex8Bpbur69ati5YtWyInJ0fXXpXPlLHhf/NqibJg4uDgAAC4du0aNm7ciBdffBE+Pj5ISkrC8uXL0alTJ1y8eBHu7u4AAI1Ggz59+mDPnj0YMmQIxo4di+zsbOzatQvnz59H/fr1dc/x0ksvoVevXnrPO3ny5Arr+eyzzyCTyfDBBx8gOTkZCxYsQEhICCIiImBmZgag9I9cz549ERQUhOnTp0Mul2PlypXo2rUr/v33X7Ru3brcduvUqYPZs2cDAHJycvDOO+9U+NxTp07FoEGD8OabbyIlJQWLFy9Gx44dcfr0adja2pZbZ9SoUejQoQMA4K+//sKGDRv05r/11lu6XYjvvvsuYmJisGTJEpw+fRqHDx+GiYlJhf3wKDIyMnSv7W5arRbPPfccDh06hFGjRiEgIADnzp3D/PnzcfnyZWzcuPG+24yOjkZUVBRef/11WFtbV7jMa6+9hunTp2Pz5s0YMmTIQ9cbEBCAmTNnYtq0aXr917ZtWwDArl270KdPH7i5uWHs2LFwdXVFZGQkNm/ejLFjxwIAdu/ejZ49e6JevXqYMWMG8vPzsXjxYrRr1w6nTp0q90ds8ODBCAgIwBdffIEtW7Zg1qxZsLe3x/Lly9G1a1fMmTMHq1evxnvvvYdWrVqhY8eOVe7DMv7+/pgyZQqEELh69Sq+/vpr9OrV65HPjlm5ciU+/vhjfPXVV3j55Zd17WXvr1atWmH27NlISkrCwoULcfjw4fu+bx9kypQpePPNNwEAqampGD9+vN7v6V4HDhzAH3/8gXfffRdqtRrffPMNevTogePHj+vCyIULF9ChQwdYW1tj0qRJMDExwfLly9G5c2ccOHAAbdq0Kbfdsn67u47HSQiB5557Dvv27cOIESPQvHlz7NixA++//z4SEhIwf/78+657v8/ggzz//PMYMGAASkpKEBYWhu+++w75+fn45ZdfKv0a0tLScO3aNXz00UcYMGAAJk6ciJMnT2LevHk4f/48tmzZogurM2bMwCeffIKQkBC88847iIqKwrfffosTJ06U+266ffs2evXqhUGDBuGll17Cn3/+iXfeeQcqlQpvvPFGhbVkZmaiZ8+eMDExwdatW3VBuzo+U0ZFkFFZuXKlACB2794tUlJSRHx8vFizZo1wcHAQZmZm4saNG0IIIQoKCoRGo9FbNyYmRqjVajFz5kxd24oVKwQA8fXXX5d7Lq1Wq1sPgJg3b165ZQIDA0WnTp10j/ft2ycACA8PD5GVlaVr//PPPwUAsXDhQt22fX19Rffu3XXPI4QQeXl5wsfHR3Tr1q3cc7Vt21Y0btxY9zglJUUAENOnT9e1xcbGCoVCIT777DO9dc+dOyeUSmW59ujoaAFA/PTTT7q26dOni7vf+v/++68AIFavXq237vbt28u1e3t7i969e5erPTQ0VNz7cbq39kmTJglnZ2cRFBSk16e//PKLkMvl4t9//9Vbf9myZQKAOHz4cLnnK7Nx40YBQMyfP/++ywghhLW1tWjRooXe6xg2bFi55Tp16qRX24kTJwQAsXLlSr3lSkpKhI+Pj/D29ha3b9/Wm3f377t58+bC2dlZpKWl6drOnDkj5HK5eO2113RtZb+TUaNG6T1HnTp1hEwmE1988YWu/fbt28LMzEyv/qr0YUWvWwghPvroIwFAJCcnP/S6W7ZsEUqlUkycOFFvmaKiIuHs7CwaN24s8vPzde2bN28WAMS0adN0bcOGDRMWFhblnmft2rUCgNi3b1+5eWWf4Xt/T2UACADi5MmTurbr168LU1NT8fzzz+va+vfvL1Qqlbh69aqu7ebNm8LKykp07Nix3HbbtWsnunTp8sA6OnXqJAIDAyus694aQ0NDy7X37t1beHt76x6XvednzZqlt9wLL7wgZDKZuHLlit42H+Yz+KCa7l5fiNLvqUaNGukel30nrl279r7bGTZsmN5rGDZsmAAgXn/9db3lyj4H//zzjxBCiOTkZKFSqcSzzz6r932/ZMkSAUCsWLFC19apUycBQHz11Ve6tsLCQt1nsKioSK/effv2iYKCAtG5c2fh7Oys129CVP0zZWy4G8tIhYSEwMnJCZ6enhgyZAgsLS2xYcMGeHh
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABX/klEQVR4nO3dd1xV9f8H8Ne9F+5l73EBAVFREFfiCPfA3GlZalmZ5Si1r6ssM0dmmVaGK205KtPSHGWKA1NTcYsTEQUEQbbszf38/kDuzytogMiFw+v5eJwH3M8Z930+93LvizNlQggBIiIiIomS67sAIiIioieJYYeIiIgkjWGHiIiIJI1hh4iIiCSNYYeIiIgkjWGHiIiIJI1hh4iIiCSNYYeIiIgkjWGHiIjoMWg0GiQnJyMiIkLfpdBDMOwQEVGtdPToURw6dEj7+NChQzh27Jj+CrpPfHw8pk6dCnd3dyiVStjb26N58+bIyMjQd2lUDoYdCVq/fj1kMpl2MDIyQtOmTTF58mQkJCTouzyqxXbt2oV+/frB1tZW+7559913kZKSUuVlxsXFYf78+QgJCam+QqleiImJwcSJE3Hp0iVcunQJEydORExMjL7Lwo0bN9C+fXts3rwZEyZMwK5du7B//34EBQXB1NRU3+VROWS8N5b0rF+/HmPGjMGCBQvg4eGBvLw8HD16FD///DPc3d1x+fJlmJiY6LtMqmXeffddfPXVV2jdujVefvll2NjY4Ny5c1i7di3s7OwQFBSEZs2aVXq5Z86cQfv27bFu3Tq8/vrr1V84SVZ+fj66deuGU6dOAQD8/Pxw6NAhKJVKvdbVu3dvREVF4ciRI3BxcdFrLVQxBvougJ6c/v37o127dgCAsWPHwtbWFkuXLsXOnTvx0ksv6bk6qk02bdqEr776CiNGjMDGjRuhUCi0415//XX07NkTL774Is6dOwcDA35sUM1QqVQ4fvw4Ll++DABo0aKFzntTH86ePYuDBw9i3759DDp1CHdj1SO9evUCAERGRgIAUlNT8e6776Jly5YwMzODhYUF+vfvjwsXLpSZNy8vD/Pnz0fTpk1hZGQEJycnPP/887h58yYAICoqSmfX2YNDjx49tMs6dOgQZDIZfvvtN3z44YdQq9UwNTXFs88+W+4m6pMnT6Jfv36wtLSEiYkJunfv/tD99j169Cj3+efPn19m2l9++QW+vr4wNjaGjY0NRo4cWe7zP2rd7qfRaBAQEAAfHx8YGRnB0dEREyZMwN27d3Wma9iwIQYNGlTmeSZPnlxmmeXV/sUXX5TpU6Dkv+B58+ahSZMmUKlUcHV1xcyZM5Gfn19uX93v448/hrW1Nb777rsyXyYdOnTA+++/j0uXLmHr1q0661HelpoePXpoazt06BDat28PABgzZoy239avX6+d/uTJkxgwYACsra1hamqKVq1aYdmyZTrLPHjwILp27QpTU1NYWVlhyJAhCA0N1Zlm/vz5kMlkuH79Ol555RVYWlrC3t4ec+bMgRACMTExGDJkCCwsLKBWq/HVV1+Vqf1x+vDB956dnR0GDhyo/aKu6Hz/9T6r6Pv2Uf36+uuv/+dzRkVFaZf1zTffwMfHByqVCs7Ozpg0aRLS0tKqtP5FRUX45JNP0LhxY6hUKjRs2BAffvhhmT4ufX8pFAq0bt0arVu3xrZt2yCTydCwYcP/eDVK5i+tRS6XQ61WY8SIEYiOjtZOU/q3/eWXXz50OaXvq1InTpyAkZERbt68qe0TtVqNCRMmIDU1tcz8W7Zs0b5ednZ2eOWVVxAbG6szzeuvvw4zMzNERESgb9++MDU1hbOzMxYsWID7d76U1nv/309mZiZ8fX3h4eGBO3fuaNsr+nlUX/BftHqkNJjY2toCACIiIrBjxw68+OKL8PDwQEJCAr799lt0794dV69ehbOzMwCguLgYgwYNQlBQEEaOHIkpU6YgMzMT+/fvx+XLl9G4cWPtc7z00ksYMGCAzvPOmjWr3Ho+/fRTyGQyvP/++0hMTERAQAD8/f0REhICY2NjACVfcv3794evry/mzZsHuVyOdevWoVevXvj333/RoUOHMstt0KABFi1aBADIysrC22+/Xe5zz5kzB8OHD8fYsWORlJSEFStWoFu3bjh//jysrKzKzDN+/Hh07doVALBt2zZs375dZ/yECRO0uxD/97//ITIyEitXrsT58+dx7NgxGBoaltsPlZGWlqZdt/tpNBo8++yzOHr0KMaPHw9vb29cunQJX3/9Na5fv44dO3Y8dJnh4eEICwvD66+/DgsLi3Knee211zBv3jzs2rULI0eOrHC93t7eWLBgAebOnavTf506dQIA7N+/H4MGDYKTkxOmTJkCtVqN0NBQ7Nq1C1OmTAEAHDhwAP3790ejRo0wf/585ObmYsWKFejcuTPOnTtX5otvxIgR8Pb2xueff46///4bCxcuhI2NDb799lv06tULixcvxsaNG/Huu++iffv26Nat22P3YSkvLy/Mnj0bQgjcvHkTS5cuxYABA3S+YB80e/ZsjB07FgCQnJyMadOm6fTV/Sr6vv2vfp0wYQL8/f21y3311Vfx3HPP4fnnn9e22dvbAyj5sv/444/h7++Pt99+G2FhYVi9ejVOnz5d5n1dkfUfO3YsNmzYgBdeeAEzZszAyZMnsWjRIoSGhpb5m7pfUVERZs+e/R+vgK6uXbti/Pjx0Gg0uHz5MgICAhAXF4d///23Usu5X0pKCvLy8vD222+jV69eeOutt3Dz5k2sWrUKJ0+exMmTJ6FSqQD8/yEF7du3x6JFi5CQkIBly5bh2LFjZT5niouL0a9fPzz99NNYsmQJAgMDMW/ePBQVFWHBggXl1lJYWIhhw4YhOjoax44dg5OTk3ZcTXwe1SmCJGfdunUCgDhw4IBISkoSMTExYvPmzcLW1lYYGxuL27dvCyGEyMvLE8XFxTrzRkZGCpVKJRYsWKBtW7t2rQAgli5dWua5NBqNdj4A4osvvigzjY+Pj+jevbv28T///CMACBcXF5GRkaFt//333wUAsWzZMu2yPT09Rd++fbXPI4QQOTk5wsPDQ/Tp06fMc3Xq1Em0aNFC+zgpKUkAEPPmzdO2RUVFCYVCIT799FOdeS9duiQMDAzKtIeHhwsAYsOGDdq2efPmifv/fP79918BQGzcuFFn3sDAwDLt7u7uYuDAgWVqnzRpknjwT/LB2mfOnCkcHByEr6+vTp/+/PPPQi6Xi3///Vdn/jVr1ggA4tixY2Wer9SOHTsEAPH1118/dBohhLCwsBBt27bVWY/Ro0eXma579+46tZ0+fVoAEOvWrdOZrqioSHh4eAh3d3dx9+5dnXH3v95t2rQRDg4OIiUlRdt24cIFIZfLxWuvvaZtK31Nxo8fr/McDRo0EDKZTHz++efa9rt37wpjY2Od+h+nD8tbbyGE+PDDDwUAkZiY+Mh5S5X+HT3YV0JU/H1b0X6934Pvs1KJiYlCqVSKZ555RuezYuXKlQKAWLt2rbatIusfEhIiAIixY8fqTPfuu+8KAOLgwYPatgffX998841QqVSiZ8+ewt3dvdz1uF9578+XX35ZmJiYaB8/6nOr1IN/66WPe/fuLYqKirTtpZ+7K1asEEIIUVBQIBwcHESLFi1Ebm6udrpdu3YJAGLu3LnattGjRwsA4p133tG2aTQaMXDgQKFUKkVSUpJOvevWrRMajUaMGjVKmJiYiJMnT+rUXJnPo/qCu7EkzN/fH/b29nB1dcXIkSNhZmaG7du3a/czq1QqyOUlb4Hi4mKkpKTAzMwMzZo1w7lz57TL+eOPP2BnZ4d33nmnzHM8uIm9Ml577TWYm5trH7/wwgtwcnLC7t27AQAhISEIDw/Hyy+/jJSUFCQnJyM5ORnZ2dno3bs3jhw5Ao1Go7PMvLw8GBkZPfJ5t23bBo1Gg+HDh2uXmZycDLVaDU9PT/zzzz860xcUFACA9r+18mzZsgWWlpbo06ePzjJ9fX1hZmZWZpmFhYU60yUnJyMvL++RdcfGxmLFihWYM2cOzMzMyjy/t7c3vLy8dJZZuuvywee
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Среднее значение Outcome в обучающей выборке: 0.3469055374592834\n",
"Среднее значение Outcome в контрольной выборке: 0.35714285714285715\n",
"Среднее значение Outcome в тестовой выборке: 0.35714285714285715\n"
]
}
],
"source": [
"# Оценка сбалансированности целевой переменной (Outcome)\n",
"# Визуализация распределения целевой переменной в выборках (гистограмма)\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"def plot_outcome_distribution(data, title):\n",
" sns.histplot(data['Outcome'], kde=True)\n",
" plt.title(title)\n",
" plt.xlabel('Outcome')\n",
" plt.ylabel('Частота')\n",
" plt.show()\n",
"\n",
"plot_outcome_distribution(train_data, 'Распределение Outcome в обучающей выборке')\n",
"plot_outcome_distribution(val_data, 'Распределение Outcome в контрольной выборке')\n",
"plot_outcome_distribution(test_data, 'Распределение Outcome в тестовой выборке')\n",
"\n",
"# Оценка сбалансированности данных по целевой переменной (Outcome)\n",
"print(\"Среднее значение Outcome в обучающей выборке: \", train_data['Outcome'].mean())\n",
"print(\"Среднее значение Outcome в контрольной выборке: \", val_data['Outcome'].mean())\n",
"print(\"Среднее значение Outcome в тестовой выборке: \", test_data['Outcome'].mean())\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABI1UlEQVR4nO3deVwVdf///+cBZOeAgGyGZu4LZKIpWa4oIpmVZZblcpmWYp/SMr+0uLWYtqlpatcnM0vLrNQrr3LDLRNNMXPN1DQpBVwSFBMU5veHP+bjEbBE9OD0uN9uc7sx73nPzGvmLDzPLOfYDMMwBAAAYFEuzi4AAADgaiLsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAEAFc/LkSR04cEC5ubnOLgXl7MSJE9q7d6/OnTvn7FL+UQg7AOBkhmHovffeU4sWLeTt7S273a4aNWro448/dnZp14XffvtNM2fONMcPHDig2bNnO6+gC5w9e1bjx4/XzTffLA8PD1WuXFm1a9dWSkqKs0v7RyHsXCdmzpwpm81mDp6enqpTp44GDx6szMxMZ5eHCmzRokXq1KmTgoKCzOfNM888o2PHjpV5mYcOHdKoUaO0ZcuW8iv0H+yhhx7S448/rvr16+ujjz7SsmXLtHz5ct17773OLu26YLPZlJSUpCVLlujAgQN69tln9e233zq7LOXl5SkuLk4vvvii2rRpo3nz5mnZsmVasWKFYmNjnV3eP4qbswvA5RkzZoxq1KihM2fOaO3atZo6daq+/vprbd++Xd7e3s4uDxXMM888ozfffFM333yzhg8frsDAQG3evFmTJ0/Wp59+qpSUFNWtW/eyl3vo0CGNHj1aN954oxo3blz+hf+DzJo1S3PnztXHH3+shx56yNnlXJeqVq2q/v37q1OnTpKk8PBwrVq1yrlFSRo3bpw2bNigJUuWqE2bNs4u55/NwHXhgw8+MCQZGzdudGgfOnSoIcmYM2eOkypDRTVnzhxDkvHAAw8Y586dc5i2YcMGw9vb24iKijLOnj172cveuHGjIcn44IMPyqnaf65GjRoZDz30kLPLsIS9e/ca69evN06dOuXsUoyzZ88alStXNp577jlnlwLDMDiNdZ1r166dJGn//v2SpOPHj+uZZ55RVFSUfH19ZbfblZCQoB9//LHYvGfOnNGoUaNUp04deXp6Kjw8XPfee6/27dsn6fx57wtPnV08XPhJZdWqVbLZbJo7d66ee+45hYWFycfHR3fddZfS09OLrXvDhg3q1KmT/P395e3trdatW+u7774rcRvbtGlT4vpHjRpVrO/HH3+smJgYeXl5KTAwUD169Chx/ZfatgsVFhZqwoQJatiwoTw9PRUaGqrHHntMf/zxh0O/G2+8UXfeeWex9QwePLjYMkuq/fXXXy+2T6Xzh8FHjhypWrVqycPDQ5GRkXr22WeVl5dX4r660OjRo1W5cmW99957cnV1dZh26623avjw4dq2bZs+//xzh+3o06dPsWW1adPGrG3VqlVq1qyZJKlv377mfrvwmokNGzaoc+fOqly5snx8fBQdHa2JEyc6LHPFihW644475OPjo4CAAHXt2lW7du1y6DNq1CjZbDb9/PPPevjhh+Xv768qVaroxRdflGEYSk9PV9euXWW32xUWFqY333yzWO1Xsg8vfu4FBwcrMTFR27dv/8t5JWnevHnm8zE4OFgPP/ywfv/9d3N6bm6utm/frsjISCUmJsput8vHx0dt2rRxOA3zyy+/yGaz6e233y62jnXr1slms+mTTz4xa774eVT0fL/wMdq6dav69Omjm266SZ6engoLC9O//vWvYqc3i06hHzhwwGxbsmSJbrvtNnl7e8vf31933nlnsX1S9NgdPXrUbNu0aVOxOiSpUaNGJR75+Oabb8zniJ+fnxITE7Vjxw6HPn369NGNN94oSapZs6aaN2+u48ePy8vLq1jdJenTp4/DY1y5cuVi+18q/TVepOg9sOiI0u7du/XHH3/Iz89PrVu3vuS+kqQffvhBCQkJstvt8vX1Vfv27bV+/XqHPkWPxZo1a/TYY48pKChIdrtdvXr1KvE96eLX8oABA+Tp6VnsqNff2c/XO05jXeeKgklQUJCk82+KCxYs0P33368aNWooMzNT06dPV+vWrbVz505FRERIkgoKCnTnnXcqJSVFPXr00JNPPqmTJ09q2bJl2r59u2rWrGmu48EHH1Tnzp0d1pucnFxiPa+88opsNpuGDx+urKwsTZgwQXFxcdqyZYu8vLwknf8nl5CQoJiYGI0cOVIuLi764IMP1K5dO3377be69dZbiy33hhtu0NixYyVJp06d0sCBA0tc94svvqju3bvr0Ucf1ZEjR/TOO++oVatW+uGHHxQQEFBsngEDBuiOO+6QJH355ZeaP3++w/THHntMM2fOVN++ffU///M/2r9/vyZPnqwffvhB3333nSpVqlTifrgcJ06cMLftQoWFhbrrrru0du1aDRgwQPXr19e2bdv09ttv6+eff9aCBQtKXeaePXu0e/du9enTR3a7vcQ+vXr10siRI7Vo0SL16NHjb9dbv359jRkzRiNGjHDYf7fddpskadmyZbrzzjsVHh6uJ598UmFhYdq1a5cWLVqkJ598UpK0fPlyJSQk6KabbtKoUaP0559/6p133lHLli21efNm859XkQceeED169fXa6+9pv/+9796+eWXFRgYqOnTp6tdu3YaN26cZs+erWeeeUbNmjVTq1atrngfFqlXr56ef/55GYahffv26a233lLnzp118ODBS85X9Lxp1qyZxo4dq8zMTE2cOFHfffed+XwsChbjxo1TWFiYhg0bJk9PT/373/9WXFycli1bplatWummm25Sy5YtNXv2bA0ZMsRhPbNnz5afn5+6du36l9tyoWXLlumXX35R3759FRYWph07dui9997Tjh07tH79+mIhvci3336rzp07q3r16ho5cqTOnj2rd999Vy1bttTGjRtVp06dy6qjNB999JF69+6t+Ph4jRs3TqdPn9bUqVN1++2364cffij2HLnQiBEjdObMmb+9ruDgYDNI/vbbb5o4caI6d+6s9PT0Et83/o6ixzY5OVm1a9fW6NGjdebMGU2ZMqXYvtqxY4fuuOMO2e12Pfvss6pUqZKmT5+uNm3aaPXq1WrevLnDsgcPHqyAgACNGjVKu3fv1tSpU/Xrr7+agaskI0eO1Pvvv6+5c+c6BMsr2c/XFWcfWsLfU3Qaa/ny5caRI0eM9PR049NPPzWCgoIMLy8v47fffjMMwzDOnDljFBQUOMy7f/9+w8PDwxgzZozZNmPGDEOS8dZbbxVbV2FhoTmfJOP1118v1qdhw4ZG69atzfGVK1cakoyqVasaOTk5Zvtnn31mSDImTpxoLrt27dpGfHy8uR7DMIzTp08bNWrUMDp06FBsXbfddpvRqFEjc/zIkSOGJGPkyJFm24EDBwxXV1fjlVdecZh327ZthpubW7H2PXv2GJKMDz/80GwbOXKkceFL4ttvvzUkGbNnz3aYd/HixcXaq1evbiQmJharPSkpybj4ZXZx7c8++6wREhJixMTEOOzTjz76yHBxcTG+/fZbh/mnTZtmSDK+++67YusrsmDBAkOS8fbbb5faxzAMw263G02aNHHYjt69exfr17p1a4faSjuNde7cOaNGjRpG9erVjT/++MNh2oWPd+PGjY2QkBDj2LFjZtuPP/5ouLi4GL169TLbih6TAQMGOKzjhhtuMGw2m/Haa6+Z7X/88Yfh5eXlUP+V7MOSttswDOO5554zJBlZWVmlzpefn2+EhIQYjRo1Mv7880+zfdGiRYYkY8SIEYZh/N9rzN3d3fj555/NfkeOHDGCgoKMmJgYs2369OmGJGPXrl0O6wkODnbY5rZt2xqtWrVyqKdoPRc+XqdPny5W9yeffGJIMtasWWO2Fb337N+/3zAMw4iJiTH8/f2NjIwMs8/PP/9sVKpUyejWrZvZVvTYHTlyxGwr7Xlz8fvJyZMnjYCAAKN///4O/TI
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABEJElEQVR4nO3deVgVdf//8dcBZBMOCMpmaO6Ka6EpLbihiGZallreuWR6Z9idWuaPO3NrIW1xyS3vO7O+aZqWeufX3RQr0RSzTM1bzZRSwCVAMUFhfn90MV+P4IbowfH5uK65LuYzn5l5zzDn8GKWc2yGYRgCAACwKBdnFwAAAHAjEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAHCCvn376s4773Ros9lsGjNmjFPqsTLCjoXNmTNHNpvNHDw9PVW7dm0NHjxY6enpzi4PZdiyZcvUoUMHBQYGmsfNiy++qBMnTpR4mUeOHNGYMWO0Y8eO0isUAK6Cm7MLwI03btw4VatWTWfPntU333yjGTNmaPny5frpp5/k7e3t7PJQxrz44ot655131LhxY40YMUIBAQHavn27pk6dqvnz52vdunWqU6fONS/3yJEjGjt2rO688041adKk9AsHLODPP/+Umxt/mksbe/Q2EBcXp6ZNm0qSnn76aQUGBurdd9/V0qVL9fjjjzu5OpQln376qd555x316NFDc+fOlaurqzmtb9++at26tR577DFt376dN2Q43dmzZ+Xu7i4XF+tcpPD09HR2CZZknSMEV61NmzaSpIMHD0qSTp48qRdffFENGzaUj4+P7Ha74uLi9MMPPxSZ9+zZsxozZoxq164tT09PhYaG6pFHHtGBAwckSb/++qvDpbOLh1atWpnL2rBhg2w2mxYsWKB//vOfCgkJUfny5fXQQw8pNTW1yLq3bNmiDh06yM/PT97e3mrZsqW+/fbbYrexVatWxa6/uGvhn3zyiSIjI+Xl5aWAgAD17Nmz2PVfbtsuVFBQoEmTJql+/fry9PRUcHCw/v73v+uPP/5w6HfnnXfqwQcfLLKewYMHF1lmcbW/9dZbRfapJOXm5mr06NGqWbOmPDw8FB4erpdeekm5ubnF7qsLjR07VhUqVNCsWbMcgo4k3XPPPRoxYoR27typRYsWOWxH3759iyyrVatWZm0bNmxQs2bNJEn9+vUz99ucOXPM/lu2bFHHjh1VoUIFlS9fXo0aNdLkyZMdlvnVV1/pgQceUPny5eXv768uXbpoz549Dn3GjBkjm82m//73v/rb3/4mPz8/VapUSa+88ooMw1Bqaqq6dOkiu92ukJAQvfPOO0Vqv559eKljr3D49ddfHfpPnz5d9evXl4eHh8LCwhQfH6/MzMwiy72a/SPpqtd7tcd9cb7//nvFxcXJbrfLx8dHbdu21ebNm83p27Ztk81m00cffVRk3lWrVslms2nZsmVm2++//66nnnpKwcHB8vDwUP369TV79myH+QrfL+bPn6+RI0eqcuXK8vb2VnZ2ts6dO6exY8eqVq1a8vT0VGBgoO6//36tWbPGnP/HH39U3759Vb16dXl6eiokJERPPfVUkUuz13v8XOv72sUufq0X1rN//3717dtX/v7+8vPzU79+/XTmzBmHef/880/94x//UMWKFeXr66uHHnpIv//+O/cBiTM7t6XCYBIYGChJ+uWXX7RkyRI99thjqlatmtLT0/X++++rZcuW2r17t8LCwiRJ+fn5evDBB7Vu3Tr17NlTzz//vE6dOqU1a9bop59+Uo0aNcx1PP744+rYsaPDehMSEoqt5/XXX5fNZtOIESOUkZGhSZMmKSYmRjt27JCXl5ekv/7IxcXFKTIyUqNHj5aLi4s+/PBDtWnTRl9//bXuueeeIsu94447lJiYKEk6ffq0Bg0aVOy6X3nlFXXv3l1PP/20jh07pvfee0/R0dH6/vvv5e/vX2SegQMH6oEHHpAkffHFF1q8eLHD9L///e+aM2eO+vXrp3/84x86ePCgpk6dqu+//17ffvutypUrV+x+uBaZmZnmtl2ooKBADz30kL755hsNHDhQ9erV086dOzVx4kT997//1ZIlSy65zH379mnv3r3q27ev7HZ7sX169+6t0aNHa9myZerZs+dV11uvXj2NGzdOo0aNcth/9957ryRpzZo1evDBBxUaGqrnn39eISEh2rNnj5YtW6bnn39ekrR27VrFxcWpevXqGjNmjP7880+99957uu+++7R9+/YiN3r26NFD9erV05tvvqn//d//1WuvvaaAgAC9//77atOmjcaPH6+5c+fqxRdfVLNmzRQdHX3d+7DQhcdeoeXLl+vTTz91aBszZozGjh2rmJgYDRo0SHv37tWMGTO0detWh2PlavbPhR5++GE98sgjkqSvv/5as2bNcphekuO+0K5du/TAAw/IbrfrpZdeUrly5fT++++rVatWSkpKUvPmzdW0aVNVr15dn332mfr06eMw/4IFC1ShQgXFxsZKktLT09WiRQvZbDYNHjxYlSpV0ooVK9S/f39lZ2dryJAhDvO/+uqrcnd314svvqjc3Fy5u7trzJgxSkxM1NNPP6177rlH2dnZ2rZtm7Zv36527dqZ+/CXX35Rv379FBISol27dmnWrFnatWuXNm/eXOQfjJIePxfu4yu9r12L7t27q1q1akpMTNT27dv173//W0FBQRo/frzZp2/fvvrss8/05JNPqkWLFkpKSlKnTp2ueV2WZMCyPvzwQ0OSsXbtWuPYsWNGamqqMX/+fCMwMNDw8vIyfvvtN8MwDOPs2bNGfn6+w7wHDx40PDw8jHHjxplts2fPNiQZ7777bpF1FRQUmPNJMt56660iferXr2+0bNnSHF+/fr0hyahcubKRnZ1ttn/22WeGJGPy5MnmsmvVqmXExsaa6zEMwzhz5oxRrVo1o127dkXWde+99xoNGjQwx48dO2ZIMkaPHm22/frrr4arq6vx+uuvO8y7c+dOw83NrUj7vn37DEnGRx99ZLaNHj3auPBl9PXXXxuSjLlz5zrMu3LlyiLtVatWNTp16lSk9vj4eOPil+bFtb/00ktGUFCQERkZ6bBP/+d//sdwcXExvv76a4f5Z86caUgyvv322yLrK7RkyRJDkjFx4sRL9jEMw7Db7cbdd9/tsB19+vQp0q9ly5YOtW3dutWQZHz44YcO/c6fP29Uq1bNqFq1qvHHH384TLvw992kSRMjKCjIOHHihNn2ww8/GC4uLkbv3r3NtsLfycCBAx3Wcccddxg2m8148803zfY//vjD8PLycqj/evZh4XbXr1+/SPtbb71lSDIOHjxoGIZhZGRkGO7u7kb79u0dXn9Tp041JBmzZ8++pv1jGIZx7tw5Q5IxduxYs63wfaBwvdd63F+sa9euhru7u3HgwAGz7ciRI4avr68RHR1ttiUkJBjlypUzTp48abbl5uYa/v7+xlNPPWW29e/f3wgNDTWOHz/usJ6ePXsafn5+xpkzZwzD+L/3i+rVq5tthRo3blzsa+lCF89jGIbx6aefGpKMjRs3mm3Xe/xc7fuaYRhGnz59jKpVqzrUdPFrvbCeC/eZYRjGww8/bAQGBprjKSkphiRjyJAhDv369u1bZJm3Iy5j3QZiYmJUqVIlhYeHq2fPnvLx8dHixYtVuXJlSZKHh4d5zTs/P18nTpyQj4+P6tSpo+3bt5vL+fzzz1WxYkU999xzRdZx8X9F16J3797y9fU1xx999FGFhoZq+fLlkqQdO3Zo3759euKJJ3TixAkdP35cx48fV05Ojtq2bauNGzeqoKDAYZlnz5694rXvL774QgUFBerevbu5zOPHjyskJES1atXS+vXrHfrn5eVJ+mt/XcrChQvl5+endu3aOSwzMjJSPj4+RZZ57tw5h37Hjx/X2bNnL1v377//rvfee0+vvPKKfHx8iqy/Xr16qlu3rsMyCy9dXrz+C506dUqSHH4XxfH19VV2dvZl+1yL77//XgcPHtSQIUOKnFEoPK6OHj2qHTt2qG/fvgo
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABEEUlEQVR4nO3deVgW9f7/8dcNyg2KNwjKVmjuiksWmt5ZaooikmlRZlmamZZip7TFL+eYW4tpi0vi0jkuddIsLe3oMfekDU0xytQ86NHkpIBLgmKCwvz+8GJ+3gKmiN44PR/XNdfFfOYzM+8Z7htezHyG22YYhiEAAACL8nB3AQAAAFcTYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcA4FZjx46VzWZzdxkVwv79+2Wz2TR//nyzjfNz5Qg7FjB//nzZbDZz8vb2VsOGDTVs2DBlZma6uzxUYCtWrFC3bt0UGBhovm6ef/55HT16tMzbPHjwoMaOHavU1NTyKxQArgBhx0LGjx+vf/7zn5o+fbpuv/12zZw5U06nU6dOnXJ3aaiAnn/+efXo0UMZGRkaOXKkpk+frqioKE2fPl0333yzdu/eXabtHjx4UOPGjSPsAOVk1KhR+v33391dxnWtkrsLQPmJiYlRq1atJElPPPGEAgMD9fbbb+uzzz7TQw895ObqUJF8+OGHeuutt/Tggw9qwYIF8vT0NJc99thjuuuuu/TAAw9o27ZtqlSJHxO4vpw6dUpVqlRxdxnlplKlSrwPrxBXdiysU6dOkqR9+/ZJko4dO6bnn39ezZs3l6+vrxwOh2JiYvTDDz8UW/f06dMaO3asGjZsKG9vb4WGhuq+++7T3r17Jf3/+8qlTR07djS3tXHjRtlsNn300Uf661//qpCQEFWtWlX33HOP0tPTi+178+bN6tatm/z8/FSlShV16NBB33zzTYnH2LFjxxL3P3bs2GJ9P/jgA0VGRsrHx0cBAQHq06dPifu/2LGdr7CwUFOmTFHTpk3l7e2t4OBgPfnkk/rtt99c+t100026++67i+1n2LBhxbZZUu1vvPFGsXMqSXl5eRozZozq168vu92u8PBwvfjii8rLyyvxXJ1v3Lhxql69ut59912XoCNJt912m0aOHKnt27dryZIlLsfx2GOPFdtWx44dzdo2btyo1q1bS5IGDBhgnrfzxx9s3rxZ3bt3V/Xq1VW1alW1aNFCU6dOddnmhg0bdOedd6pq1ary9/dXz549tWvXLpc+ReMY/vOf/+iRRx6Rn5+fatasqZdeekmGYSg9PV09e/aUw+FQSEiI3nrrrWK1X8k5LO21VzTt37/fpf+MGTPUtGlT2e12hYWFKT4+XsePHy+23Us5P5Iueb+X+rq/0GOPPaabbrqpWHtJ40dsNpuGDRumZcuWqVmzZrLb7WratKlWrVpVbP2vv/5arVu3lre3t+rVq6fZs2eXWsOl1N6xY0c1a9ZMKSkpat++vapUqaK//vWvkqStW7cqOjpaNWrUkI+Pj+rUqaPHH3/cZf0333xTt99+uwIDA+Xj46PIyEiX1/2Fx7h48WJFRETIx8dHTqdT27dvlyTNnj1b9evXl7e3tzp27Fjs+3B+nbfffrtZz6xZs0o9/iJXes43btyoVq1auZzzP9s4IKKihRUFk8DAQEnSf//7Xy1btkwPPPCA6tSpo8zMTM2ePVsdOnTQzp07FRYWJkkqKCjQ3XffrfXr16tPnz565plndOLECa1du1Y//fST6tWrZ+7joYceUvfu3V32m5CQUGI9r776qmw2m0aOHKmsrCxNmTJFUVFRSk1NlY+Pj6Rzv+RiYmIUGRmpMWPGyMPDQ/PmzVOnTp301Vdf6bbbbiu23RtvvFETJkyQJJ08eVJDhgwpcd8vvfSSevfurSeeeEKHDx/WO++8o/bt2+v777+Xv79/sXUGDx6sO++8U5L06aefaunSpS7Ln3zySc2fP18DBgzQX/7yF+3bt0/Tp0/X999/r2+++UaVK1cu8TxcjuPHj5vHdr7CwkLdc889+vrrrzV48GA1adJE27dv1+TJk/Wf//xHy5YtK3WbaWlp2r17tx577DE5HI4S+/Tr109jxozRihUr1KdPn0uut0mTJho/frxGjx7tcv5uv/12SdLatWt19913KzQ0VM8884xCQkK0a9curVixQs8884wkad26dYqJiVHdunU1duxY/f7773rnnXfUrl07bdu2rdgv4AcffFBNmjTR66+/rn//+9965ZVXFBAQoNmzZ6tTp06aOHGiFixYoOeff16tW7dW+/btr/gcFjn/tVdk5cqV+vDDD13axo4dq3HjxikqKkpDhgzR7t27NXPmTG3ZssXltXIp5+d89957r+677z5J0ldffaV3333XZXlZXvdl9fXXX+vTTz/V0KFDVa1aNU2bNk1xcXE6cOCA+TNo+/bt6tq1q2rWrKmxY8fq7NmzGjNmjIKDg4tt73JqP3r0qGJiYtSnTx898sgjCg4OVlZWlrmv//u//5O/v7/279+vTz/91GU/U6dO1T333KO+ffsqPz9fixYt0gMPPKAVK1YoNjbWpe9XX32lf/3rX4qPj5ckTZgwQXfffbdefPFFzZgxQ0OHDtVvv/2mSZMm6fHHH9eGDRtc1v/tt9/UvXt39e7dWw899JA+/vhjDRkyRF5eXsVCWHmd8++//17dunVTaGioxo0bp4KCAo0fP141a9a87P1d1wxc9+bNm2dIMtatW2ccPnzYSE9PNxYtWmQEBgYaPj4+xv/+9z/DMAzj9OnTRkFBgcu6+/btM+x2uzF+/Hizbe7cuYYk4+233y62r8LCQnM9ScYbb7xRrE/Tpk2NDh06mPNffPGFIcm44YYbjJycHLP9448/NiQZU6dONbfdoEEDIzo62tyPYRjGqVOnjDp16hhdunQptq/bb7/daNasmTl/+PBhQ5IxZswYs23//v2Gp6en8eqrr7qsu337dqNSpUrF2tPS0gxJxnvvvWe2jRkzxjj/7fLVV18ZkowFCxa4rLtq1api7bVr1zZiY2OL1R4fH29c+Ba8sPYXX3zRCAoKMiIjI13O6T//+U/Dw8PD+Oqrr1zWnzVrliHJ+Oabb4rtr8iyZcsMScbkyZNL7WMYhuFwOIxbb73V5Tj69+9frF+HDh1catuyZYshyZg3b55Lv7Nnzxp16tQxateubfz2228uy87/frds2dIICgoyjh49arb98MMPhoeHh9GvXz+zreh7MnjwYJd93HjjjYbNZjNef/11s/23334zfHx8XOq/knNYdNxNmzYt1v7GG28Ykox9+/YZhmEYWVlZhpeXl9G1a1eX99/06dMNScbcuXMv6/wYhmGcOXPGkGSMGzfObCv6OVC038t93V+of//+Ru3atYu1X/heMIxzr1svLy9jz549ZtsPP/xgSDLeeecds61Xr16Gt7e38csvv5htO3fuNDw9PV22eTm1d+jQwZBkzJo1y6Xv0qVLDUnGli1bLnqcp06dcpnPz883mjVrZnTq1KnYMdrtdvP8GoZhzJ4925BkhISEuPxsS0hIcPlenF/nW2+9Zbbl5eWZr/f8/HzDMP7/z9bz3z9Xcs579OhhVKlSxfj111/NtrS0NKNSpUrFtmll3MaykKioKNWsWVPh4eHq06ePfH19tXTpUt1www2SJLvdLg+Pc9/ygoICHT16VL6+vmrUqJG2bdtmbueTTz5RjRo19PTTTxfbx5Vc9uzXr5+qVatmzt9///0KDQ3VypUrJUmpqalKS0vTww8/rKNHj+rIkSM6cuSIcnNz1blzZ3355ZcqLCx02ebp06fl7e190f1++umnKiwsVO/evc1tHjlyRCEhIWrQoIG++OILl/75+fmSzp2v0ixevFh+fn7q0qWLyzYjIyPl6+tbbJtnzpxx6XfkyBGdPn36onX/+uuveuedd/TSSy/J19e32P6bNGmixo0bu2yz6Nblhfs/34kTJyTJ5XtRkmrVqiknJ+eifS7H999/r3379unZZ58tdkWh6HV16NAhpaam6rH
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки после oversampling и undersampling: 802\n"
]
}
],
"source": [
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Визуализация распределения Outcome в обучающей выборке\n",
"sns.countplot(x=train_data['Outcome'])\n",
"plt.title('Распределение Outcome в обучающей выборке')\n",
"plt.xlabel('Outcome')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"# Балансировка категорий с помощью RandomOverSampler (увеличение меньшинств)\n",
"ros = RandomOverSampler(random_state=42)\n",
"X_train = train_data.drop(columns=['Outcome'])\n",
"y_train = train_data['Outcome']\n",
"\n",
"X_resampled, y_resampled = ros.fit_resample(X_train, y_train)\n",
"\n",
"# Визуализация распределения Outcome после oversampling\n",
"sns.countplot(x=y_resampled)\n",
"plt.title('Распределение Outcome после oversampling')\n",
"plt.xlabel('Outcome')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"# Применение RandomUnderSampler для уменьшения большего класса\n",
"rus = RandomUnderSampler(random_state=42)\n",
"X_resampled, y_resampled = rus.fit_resample(X_resampled, y_resampled)\n",
"\n",
"# Визуализация распределения Outcome после undersampling\n",
"sns.countplot(x=y_resampled)\n",
"plt.title('Распределение Outcome после undersampling')\n",
"plt.xlabel('Outcome')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"# Печать размеров выборки после балансировки\n",
"print(\"Размер обучающей выборки после oversampling и undersampling: \", len(X_resampled))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Конструирование признаков \n",
"\n",
"Теперь приступим к конструированию признаков для решения каждой задачи.\n",
"\n",
"**Процесс конструирования признаков** \n",
"Задача 1: Прогнозирование риска развития диабета. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования вероятности развития диабета у индейцев Пима.\n",
"Задача 2: Оценка факторов, влияющих на развитие диабета. Цель технического проекта: Разработка модели машинного обучения для выявления ключевых факторов, влияющих на развитие диабета у индейцев Пима.\n",
"\n",
"**Унитарное кодирование** \n",
"Унитарное кодирование категориальных признаков (one-hot encoding). Преобразование категориальных признаков в бинарные векторы.\n",
"\n",
"**Дискретизация числовых признаков** \n",
"Процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы train_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Pregnancies_14', 'Pregnancies_15', 'Pregnancies_17', 'Outcome_0', 'Outcome_1']\n",
"Столбцы val_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1']\n",
"Столбцы test_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1']\n"
]
}
],
"source": [
"# Пример категориальных признаков\n",
"categorical_features = ['Pregnancies', 'Outcome']\n",
"\n",
"# Применение one-hot encoding\n",
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)\n",
"df_encoded = pd.get_dummies(df, columns=categorical_features)\n",
"\n",
"print(\"Столбцы train_data_encoded:\", train_data_encoded.columns.tolist())\n",
"print(\"Столбцы val_data_encoded:\", val_data_encoded.columns.tolist())\n",
"print(\"Столбцы test_data_encoded:\", test_data_encoded.columns.tolist())\n",
"\n",
"# Дискретизация числовых признаков (Glucose). Например, можно разделить уровень глюкозы на категории\n",
"# Пример дискретизации признака 'Glucose' на 5 категорий\n",
"train_data_encoded['Glucose_binned'] = pd.cut(train_data_encoded['Glucose'], bins=5, labels=False)\n",
"val_data_encoded['Glucose_binned'] = pd.cut(val_data_encoded['Glucose'], bins=5, labels=False)\n",
"test_data_encoded['Glucose_binned'] = pd.cut(test_data_encoded['Glucose'], bins=5, labels=False)\n",
"\n",
"# Пример дискретизации признака 'Glucose' на 5 категорий\n",
"df_encoded['Glucose_binned'] = pd.cut(df_encoded['Glucose'], bins=5, labels=False)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ручной синтез\n",
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, можно создать признак, который отражает соотношение уровня глюкозы к инсулину или индексу массы тела (BMI)."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# Ручной синтез признаков\n",
"# Пример создания нового признака - соотношение уровня глюкозы к инсулину\n",
"train_data_encoded['glucose_to_insulin'] = train_data_encoded['Glucose'] / train_data_encoded['Insulin']\n",
"val_data_encoded['glucose_to_insulin'] = val_data_encoded['Glucose'] / val_data_encoded['Insulin']\n",
"test_data_encoded['glucose_to_insulin'] = test_data_encoded['Glucose'] / test_data_encoded['Insulin']\n",
"\n",
"# Пример создания нового признака - соотношение уровня глюкозы к инсулину\n",
"df_encoded['glucose_to_insulin'] = df_encoded['Glucose'] / df_encoded['Insulin']\n",
"\n",
"# Пример создания нового признака - соотношение уровня глюкозы к BMI\n",
"train_data_encoded['glucose_to_bmi'] = train_data_encoded['Glucose'] / train_data_encoded['BMI']\n",
"val_data_encoded['glucose_to_bmi'] = val_data_encoded['Glucose'] / val_data_encoded['BMI']\n",
"test_data_encoded['glucose_to_bmi'] = test_data_encoded['Glucose'] / test_data_encoded['BMI']\n",
"\n",
"# Пример создания нового признака - соотношение уровня глюкозы к BMI\n",
"df_encoded['glucose_to_bmi'] = df_encoded['Glucose'] / df_encoded['BMI']\n",
"\n",
"# Пример создания нового признака - соотношение уровня инсулина к BMI\n",
"train_data_encoded['insulin_to_bmi'] = train_data_encoded['Insulin'] / train_data_encoded['BMI']\n",
"val_data_encoded['insulin_to_bmi'] = val_data_encoded['Insulin'] / val_data_encoded['BMI']\n",
"test_data_encoded['insulin_to_bmi'] = test_data_encoded['Insulin'] / test_data_encoded['BMI']\n",
"\n",
"# Пример создания нового признака - соотношение уровня инсулина к BMI\n",
"df_encoded['insulin_to_bmi'] = df_encoded['Insulin'] / df_encoded['BMI']\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
"\n",
"# Пример числовых признаков\n",
"numerical_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']\n",
"\n",
"# Применение StandardScaler для масштабирования числовых признаков\n",
"scaler = StandardScaler()\n",
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])\n",
"\n",
"# Пример использования MinMaxScaler для масштабирования числовых признаков\n",
"scaler = MinMaxScaler()\n",
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Использование фреймворка Featuretools"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы в df: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']\n",
"Столбцы в train_data_encoded: ['id', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Pregnancies_14', 'Pregnancies_15', 'Pregnancies_17', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_insulin', 'glucose_to_bmi', 'insulin_to_bmi']\n",
"Столбцы в val_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_insulin', 'glucose_to_bmi', 'insulin_to_bmi']\n",
"Столбцы в test_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_insulin', 'glucose_to_bmi', 'insulin_to_bmi']\n",
"Empty DataFrame\n",
"Columns: [id, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Pregnancies_0, Pregnancies_1, Pregnancies_2, Pregnancies_3, Pregnancies_4, Pregnancies_5, Pregnancies_6, Pregnancies_7, Pregnancies_8, Pregnancies_9, Pregnancies_10, Pregnancies_11, Pregnancies_12, Pregnancies_13, Pregnancies_14, Pregnancies_15, Pregnancies_17, Outcome_0, Outcome_1, Glucose_binned, glucose_to_insulin, glucose_to_bmi, insulin_to_bmi]\n",
"Index: []\n",
"\n",
"[0 rows x 31 columns]\n",
" Glucose BloodPressure SkinThickness Insulin BMI \\\n",
"id \n",
"0 148 72 35 0 33.6 \n",
"1 85 66 29 0 26.6 \n",
"2 183 64 0 0 23.3 \n",
"3 89 66 23 94 28.1 \n",
"4 137 40 35 168 43.1 \n",
"\n",
" DiabetesPedigreeFunction Age Pregnancies_0 Pregnancies_1 \\\n",
"id \n",
"0 0.627 50 False False \n",
"1 0.351 31 False True \n",
"2 0.672 32 False False \n",
"3 0.167 21 False True \n",
"4 2.288 33 True False \n",
"\n",
" Pregnancies_2 ... Pregnancies_13 Pregnancies_14 Pregnancies_15 \\\n",
"id ... \n",
"0 False ... False False False \n",
"1 False ... False False False \n",
"2 False ... False False False \n",
"3 False ... False False False \n",
"4 False ... False False False \n",
"\n",
" Pregnancies_17 Outcome_0 Outcome_1 Glucose_binned glucose_to_insulin \\\n",
"id \n",
"0 False False True 3 inf \n",
"1 False True False 2 inf \n",
"2 False False True 4 inf \n",
"3 False True False 2 0.946809 \n",
"4 False False True 3 0.815476 \n",
"\n",
" glucose_to_bmi insulin_to_bmi \n",
"id \n",
"0 4.404762 0.000000 \n",
"1 3.195489 0.000000 \n",
"2 7.854077 0.000000 \n",
"3 3.167260 3.345196 \n",
"4 3.178654 3.897912 \n",
"\n",
"[5 rows x 30 columns]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n",
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n",
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"c:\\Users\\TIGR228\\Desktop\\МИИ\\Lab1\\AIM-PIbd-31-Afanasev-S-S\\aimenv\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
]
}
],
"source": [
"import featuretools as ft\n",
"\n",
"# Проверка наличия столбцов в DataFrame\n",
"print(\"Столбцы в df:\", df.columns.tolist())\n",
"print(\"Столбцы в train_data_encoded:\", train_data_encoded.columns.tolist())\n",
"print(\"Столбцы в val_data_encoded:\", val_data_encoded.columns.tolist())\n",
"print(\"Столбцы в test_data_encoded:\", test_data_encoded.columns.tolist())\n",
"\n",
"# Удаление дубликатов по всем столбцам (если нет уникального идентификатора)\n",
"df = df.drop_duplicates()\n",
"duplicates = train_data_encoded[train_data_encoded.duplicated(keep=False)]\n",
"\n",
"# Удаление дубликатов из столбца \"id\", сохранив первое вхождение\n",
"df_encoded = df_encoded.drop_duplicates(keep='first')\n",
"\n",
"print(duplicates)\n",
"\n",
"# Создание EntitySet\n",
"es = ft.EntitySet(id='diabetes_data')\n",
"\n",
"# Добавление датафрейма с данными о диабете\n",
"es = es.add_dataframe(dataframe_name='patients', dataframe=df_encoded, index='id')\n",
"\n",
"# Генерация признаков с помощью глубокой синтезы признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='patients', max_depth=2)\n",
"\n",
"# Выводим первые 5 строк сгенерированного набора признаков\n",
"print(feature_matrix.head())\n",
"\n",
"# Удаление дубликатов из обучающей выборки\n",
"train_data_encoded = train_data_encoded.drop_duplicates()\n",
"train_data_encoded = train_data_encoded.drop_duplicates(keep='first') # or keep='last'\n",
"\n",
"# Определение сущностей (Создание EntitySet)\n",
"es = ft.EntitySet(id='diabetes_data')\n",
"\n",
"es = es.add_dataframe(dataframe_name='patients', dataframe=train_data_encoded, index='id')\n",
"\n",
"# Генерация признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='patients', max_depth=2)\n",
"\n",
"# Преобразование признаков для контрольной и тестовой выборок\n",
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Оценка качества каждого набора признаков \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Время обучения модели: 0.00 секунд\n",
"Среднеквадратичная ошибка: 704.68\n",
"Коэффициент детерминации (R²): 0.30\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAIjCAYAAAAJLyrXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADPB0lEQVR4nOzdd3hT1f8H8HeSJt2bDlqKBVrKlK3sIQhFhixRWWUIX7RsREGZCrI3CCpYUEEUlCUqKCIgVkEE2QUqhTLaAt0jbdLc3x/8em06kzZp1vv1PDzac2+Sk3VzPmd8jkQQBAFERERERESkM6mpK0BERERERGRpGEgRERERERHpiYEUERERERGRnhhIERERERER6YmBFBERERERkZ4YSBEREREREemJgRQREREREZGeGEgRERERERHpiYEUERERERGRnhhIEZFBZGZmYs2aNeLfqamp2Lhxo+kqRERERGREDKTIIo0cORIuLi6mrgYV4ujoiNmzZ2PHjh2Ij4/H/PnzcfDgQVNXi4iIiMgo7ExdASJdPX78GDt27MDJkydx4sQJ5OTkIDw8HM2aNcPgwYPRrFkzU1fRpslkMixYsAAjRoyARqOBm5sbDh06ZOpqERERERmFRBAEwdSVICrPrl27MHbsWGRmZiI4OBgqlQoJCQlo1qwZ/vnnH6hUKkRERODjjz+GQqEwdXVt2t27dxEfH4/69evDw8PD1NUhIiIiMgpO7SOzd+rUKQwbNgz+/v44deoUbt26hW7dusHBwQFnzpzB/fv38eqrr2L79u2YOnWq1m1XrFiBtm3bwtvbG46OjmjRogX27NlT7DEkEgnmz58v/q1Wq/HCCy/Ay8sLV65cEc8p61/nzp0BAL/++iskEgl+/fVXrcfo1atXscfp3LmzeLsCcXFxkEgk2LZtm1b5tWvXMGjQIHh5ecHBwQEtW7bEgQMHij2X1NRUTJ06FcHBwbC3t0eNGjUwYsQIPHr0qNT63b9/H8HBwWjZsiUyMzP1fh7z58+HRCIBANSoUQNt2rSBnZ0d/P39S7yPwo4dOwaJRIK9e/cWO7Zz505IJBJER0cD+G9K57///osePXrA2dkZAQEBeO+991C0TygrKwvTp09HUFAQ7O3tERYWhhUrVhQ7r/B7KJPJEBgYiHHjxiE1NVXrvNzcXMybNw8hISGwt7dHUFAQ3nrrLeTm5ha7vwkTJhR7Lr1790ZwcLD4d8H7vGLFilJfm9IU3Lakf1988YXWuZ07dy7xvMKfr5EjR2rVDQDWrFmDevXqwd7eHv7+/vjf//6H5OTkYvdd9PO7aNEiSKVS7Ny5U6t89+7daNGiBRwdHVGtWjUMGzYM9+7d0zpn/vz5aNCgAVxcXODm5obWrVtj3759xR6zUaNG5b42Rb8/RW3btq3M73PhzzcAnDt3Dj179oSbmxtcXFzQtWtX/PHHH2U+RgGNRoO1a9eicePGcHBwgI+PD8LDw/HXX3+J5xR8bnbs2IGwsDA4ODigRYsWOHHihNZ93b59G2+88QbCwsLg6OgIb29vvPTSS4iLiyvz+Tk5OaFx48bYsmWL1nmlTZPes2dPid/dP//8E+Hh4XB3d4eTkxM6deqEU6dOaZ1TcD0ouOYU+Ouvv3T67MXHx8PR0RESiUTreRX9vKlUKsyZMwe1atWCQqFAzZo18dZbbyEnJ6fY8ynJtWvXMHjwYPj4+MDR0RFhYWF49913y7xNwXWxtH8jR44Uzy14D06cOIH//e9/8Pb2hpubG0aMGIGUlJRi9/3hhx+iYcOGsLe3R0BAACIjI4tdh0r7Pnfr1k08R9drEKDbdfLx48fo2bMnatSoAXt7e1SvXh1Dhw7F7du3xXNK+95FRkZW+HWJiIhAtWrVoFKpij2X7t27IywsTKvsiy++EK8xXl5eeOWVVxAfH1/i69evX79i9/m///0PEolE6/qiy3W68O9fgYL3pfC64QL16tUr9T0iy8OpfWT2lixZAo1Gg127dqFFixbFjlerVg2fffYZrly5go8++gjz5s2Dr68vAGDt2rXo27cvhg4diry8POzatQsvvfQSvvvuO/Tq1avUx3zttdfw66+/4qeffkKDBg0AAJ9//rl4/OTJk/j444+xevVqVKtWDQDg5+dX6v2dOHEC33//fYWePwBcvnwZ7dq1Q2BgIGbOnAlnZ2d8/fXX6NevH7755hv0798fwJOEDx06dMDVq1cxevRoNG/eHI8ePcKBAwdw9+5dsa6FpaWloWfPnpDL5fj+++/LXHumz/NYuXIlEhMTyz2vc+fOCAoKwo4dO8TnUWDHjh2oU6cO2rRpI5bl5+cjPDwcrVu3xrJly/Djjz9i3rx5UKvVeO+99wAAgiCgb9++OHbsGMaMGYOmTZvi8OHDmDFjBu7du4fVq1drPU7//v0xYMAAqNVqREdH4+OPP0ZOTo74nms0GvTt2xe//fYbxo0bh/r16+PixYtYvXo1rl+/XqyxX1VeffVVvPDCC1pl7dq1K3ZevXr1xAbio0ePinU4FPXBBx/g3XffRceOHREZGYlbt25hw4YN+PPPP/Hnn3/C3t6+xNtFRUVh9uzZWLlyJYYMGSKWb9u2DaNGjUKrVq2wePFiJCYmYu3atTh16hTOnTsnjlxmZWWhf//+CA4ORk5ODrZt24aBAwciOjoazzzzjD4vjc7ee+891KpVS/w7MzMTr7/+utY5ly9fRocOHeDm5oa33noLcrkcH330ETp37ozjx4/j2WefLfMxxowZg23btqFnz5547bXXoFarcfLkSfzxxx9o2bKleN7x48fx1VdfYdKkSbC3t8eHH36I8PBwnD59WmzcnTlzBr///jteeeUV1KhRA3Fxcdi0aRM6d+6MK1euwMnJSeuxC65R6enp+PTTTzF27FgEBwdrNbx19csvv6Bnz55o0aIF5s2bB6lUiqioKDz33HM4efKkwd6juXPnQqlUlnteZGQkPvnkE/Tt2xdvvvkmzp07h+XLl+PSpUs4dOhQscZtYRcuXECHDh0gl8sxbtw4BAcHIzY2FgcPHsSiRYvKfexJkyahVatWWmWvvfZaiedOmDABHh4emD9/PmJiYrBp0ybcvn1bDMqAJ43xBQsWoFu3bnj99dfF886cOYNTp05BLpeL91ejRg0sXrxY6zGqV69ebp2L0vU6mZeXB1dXV0yePBne3t6IjY3F+vXrceHCBVy8eLHU+7958yY++eSTUo+X97oMHz4cn332GQ4fPozevXuLt0tISMAvv/yCefPmiWWLFi3CnDlzMHjwYLz22mt4+PAh1q9fj44dO2pdYwDAwcEBhw4dQlJSkthWyMnJwVdffQUHBwe9X8fSODg4ICoqClOmTBHLfv/9d60AlKyAQGTmvLy8hKeeekqrLCIiQnB2dtYqmzNnjgBAOHjwoFiWnZ2tdU5eXp7QqFEj4bnnntMqByDMmzdPEARBmDVrliCTyYR9+/aVWqeoqCgBgHDr1q1ix44dOyYAEI4dOyaWPfvss0LPnj21HkcQBKFLly5Cx44dtW5/69YtAYAQFRUllnXt2lVo3LixoFQqxTKNRiO0bdtWCA0NFcvmzp0rABC+/fbbYvXSaDTF6qdUKoXOnTsLvr6+ws2bNyv8PObNmycUvpwkJSUJrq6u4rmF76Mks2bNEuzt7YXU1FSt+7Czs9N6nIiICAGAMHHiRK3n1atXL0GhUAgPHz4UBEEQ9u3bJwAQFi5cqPU4gwYNEiQSidZzLfpcBEEQ2rZtKzRo0ED8+/PPPxekUqlw8uRJrfM2b94sABBOnTqldX+RkZHFnmOvXr20PscF7/Py5cvLeGVKps9t27VrJ3Tp0qXYbQt/viIiIsS6PXz4UHBwcBDat28vqFQq8Zxt27YJAIT169eLZZ06dRI6deokCIIgHDp0SLCzsxOmT5+u9fh5eXmCr6+v0KhRIyEnJ0cs/+677wQAwty5c0ute1JSkgBAWLFihdZjNmzYsNTblPT8SlLwHT5z5oxW+cOHD4t9Jvr16ycoFAohNjZWLLt//77g6upa7Ptb1C+//CIAECZNmlTsWMF3UhCefG4
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import time\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"\n",
"# Предположим, что df уже определен и загружен\n",
"\n",
"# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную\n",
"X = df.drop('Glucose', axis=1)\n",
"y = df['Glucose']\n",
"\n",
"# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)\n",
"X = pd.get_dummies(X, drop_first=True)\n",
"\n",
"# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением\n",
"X.fillna(X.median(), inplace=True)\n",
"\n",
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Обучение модели\n",
"model = LinearRegression()\n",
"\n",
"# Начинаем отсчет времени\n",
"start_time = time.time()\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Время обучения модели\n",
"train_time = time.time() - start_time\n",
"\n",
"# Предсказания и оценка модели\n",
"val_predictions = model.predict(X_val)\n",
"mse = mean_squared_error(y_val, val_predictions)\n",
"r2 = r2_score(y_val, val_predictions)\n",
"\n",
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
"print(f'Среднеквадратичная ошибка: {mse:.2f}')\n",
"print(f'Коэффициент детерминации (R²): {r2:.2f}')\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_val, val_predictions, alpha=0.5)\n",
"plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)\n",
"plt.xlabel('Фактический уровень глюкозы')\n",
"plt.ylabel('Прогнозируемый уровень глюкозы')\n",
"plt.title('Фактический уровень глюкозы по сравнению с прогнозируемым')\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" # Выводы\n",
"\n",
"**Модель линейной регрессии (LinearRegression)** показала удовлетворительные результаты при прогнозировании уровня глюкозы у индейцев Пима. Метрики качества и кросс-валидация позволяют предположить, что модель не сильно переобучена и может быть использована для практических целей.\n",
"\n",
"*Точность предсказаний:* Модель демонстрирует довольно высокий коэффициент детерминации (R²) 0.30, что указывает на умеренную часть вариации целевого признака (уровня глюкозы). Однако, значения среднеквадратичной ошибки (RMSE) остаются высокими (704.68), что свидетельствует о том, что модель не всегда точно предсказывает значения, особенно для объектов с высокими или низкими уровнями глюкозы.\n",
"\n",
"*Переобучение:* Разница между RMSE на обучающей и тестовой выборках незначительна, что указывает на то, что модель не склонна к переобучению. Однако в будущем стоит следить за этой метрикой при добавлении новых признаков или усложнении модели, чтобы избежать излишней подгонки под тренировочные данные. Также стоит быть осторожным и продолжать мониторинг этого показателя.\n",
"\n",
"*Кросс-валидация:* При кросс-валидации наблюдается небольшое увеличение ошибки RMSE по сравнению с тестовой выборкой (рост на 2-3%). Это может указывать на небольшую нестабильность модели при использовании разных подвыборок данных. Для повышения устойчивости модели возможно стоит провести дальнейшую настройку гиперпараметров.\n",
"\n",
"*Рекомендации:* Следует уделить внимание дополнительной обработке категориальных признаков, улучшению метода feature engineering, а также возможной оптимизации модели (например, через подбор гиперпараметров) для повышения точности предсказаний на экстремальных значениях."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}