1341 lines
284 KiB
Plaintext
Raw Permalink Normal View History

2024-11-15 15:14:05 +04:00
{
"cells": [
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',\n",
" 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],\n",
" dtype='object')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd;\n",
"df = pd.read_csv(\"data/diabetes.csv\")\n",
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Определение бизнес целей:\n",
"1. Предсказание риска развития диабета. Данная цель поможет определить людей с высоким риском заболевания, что полезно для профилактических программ и эффективного медицинского вмешательства.\n",
"2. Анализ ключевых факторов, влияющих на диабет. Позволит выявить основные факторы риска и разработать рекомендации для улучшения здоровья населения."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Определение целей технического проекта:\n",
"1. Построить модель, которая сможет прогнозировать наличие диабета, основываясь на других параметрах.\n",
"2. Провести анализ данных для выявления факторов, которые больше всего влияют на риск развития диабета."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Проверим данные на пустые значения"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pregnancies 0\n",
"Glucose 0\n",
"BloodPressure 0\n",
"SkinThickness 0\n",
"Insulin 0\n",
"BMI 0\n",
"DiabetesPedigreeFunction 0\n",
"Age 0\n",
"Outcome 0\n",
"dtype: int64\n"
]
},
{
"data": {
"text/plain": [
"Pregnancies False\n",
"Glucose False\n",
"BloodPressure False\n",
"SkinThickness False\n",
"Insulin False\n",
"BMI False\n",
"DiabetesPedigreeFunction False\n",
"Age False\n",
"Outcome False\n",
"dtype: bool"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
"\n",
"print(df.isnull().sum())\n",
"\n",
"df.isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пустые и номинальные значения отсутствуют.\n",
"Разделение данных на обучающую, тестовую и контрольную выборки"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 614\n",
"Размер контрольной выборки: 154\n",
"Размер тестовой выборки: 154\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки: \", len(train_data))\n",
"print(\"Размер контрольной выборки: \", len(val_data))\n",
"print(\"Размер тестовой выборки: \", len(test_data))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Оценка сбалансированности целевой переменной (Outcome). Визуализация распределения целевой переменной в выборках (гистограмма)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABkKUlEQVR4nO3dd1hTZ/8G8DsJJMywIaCAiANUcKAi7oGTWtvaWqt11UpbsW+rrfVna10dji6t276utq5qq7bWqrhw4UJRVEREFFSmyt7h+f1ByWsEHAgE4v25rnNBznnOOd/zJIQ7Z0UihBAgIiIi0lNSXRdAREREVJ0YdoiIiEivMewQERGRXmPYISIiIr3GsENERER6jWGHiIiI9BrDDhEREek1hh0iIiLSaww7RES1TGZmJm7cuIHs7Gxdl0JVLC0tDdeuXUNRUZGuS3muMOwQEemYEAIrV65Ehw4dYGJiAqVSCTc3N/z666+6Lq1OuHXrFtauXat5fOPGDaxfv153BT2gsLAQ8+fPR8uWLaFQKGBlZYXGjRtj//79ui7tucKwU0esXbsWEolEMxgZGaFJkyaYMGECkpKSdF0e1WI7d+5Ev379YGNjo3ndfPzxx7h7926ll3nnzh3MnDkT4eHhVVfoc2zYsGF499134enpiV9++QXBwcHYt28fXnnlFV2XVidIJBIEBQVhz549uHHjBj755BMcOXJE12UhPz8f/v7++Pzzz9G9e3ds2bIFwcHBOHDgAPz8/HRd3nPFQNcF0NOZPXs23NzckJeXh6NHj2LZsmXYtWsXLl68CBMTE12XR7XMxx9/jO+++w4tW7bElClTYG1tjbNnz2Lx4sXYtGkT9u/fj6ZNmz71cu/cuYNZs2ahQYMGaNWqVdUX/hz5+eefsXnzZvz6668YNmyYrsupk+rVq4dx48ahX79+AABHR0ccOnRIt0UBmDdvHk6ePIk9e/age/fuui7n+SaoTlizZo0AIE6fPq01ftKkSQKA2LBhg44qo9pqw4YNAoB4/fXXRVFRkda0kydPChMTE+Hl5SUKCwufetmnT58WAMSaNWuqqNrnV4sWLcSwYcN0XYZeuHbtmjhx4oTIysrSdSmisLBQWFlZiU8//VTXpZAQgoex6riePXsCAGJjYwEA9+7dw8cffwwvLy+YmZlBqVSif//+OH/+fJl58/LyMHPmTDRp0gRGRkZwdHTEK6+8gpiYGAAlx70fPHT28PDgJ5VDhw5BIpFg8+bN+PTTT6FSqWBqaooXX3wR8fHxZdZ98uRJ9OvXDxYWFjAxMUG3bt1w7Nixcrexe/fu5a5/5syZZdr++uuv8PHxgbGxMaytrTF06NBy1/+obXtQcXExFixYgObNm8PIyAgODg545513cP/+fa12DRo0wAsvvFBmPRMmTCizzPJq/+abb8r0KVCyG3zGjBlo1KgRFAoFnJ2d8cknnyA/P7/cvnrQrFmzYGVlhZUrV0Imk2lNa9++PaZMmYKIiAhs3bpVaztGjx5dZlndu3fX1Hbo0CG0a9cOADBmzBhNvz14zsTJkycxYMAAWFlZwdTUFN7e3li4cKHWMg8cOIAuXbrA1NQUlpaWGDRoECIjI7XazJw5ExKJBFevXsWbb74JCwsL2NnZ4fPPP4cQAvHx8Rg0aBCUSiVUKhW+++67MrU/Sx8+/NqztbVFQEAALl68+Nh5AWDLli2a16OtrS3efPNN3L59WzM9OzsbFy9ehLOzMwICAqBUKmFqaoru3btrHYa5fv06JBIJfvjhhzLrOH78OCQSCTZu3Kip+eHXUenr/cHn6MKFCxg9ejQaNmwIIyMjqFQqvPXWW2UOb5YeQr9x44Zm3J49e9CxY0eYmJjAwsICL7zwQpk+KX3uUlNTNePOnDlTpg4AaNGiRbl7Pv755x/Na8Tc3BwBAQG4dOmSVpvRo0ejQYMGAAB3d3f4+vri3r17MDY2LlN3eUaPHq31HFtZWZXpf6Div/FSpe+BpXuUoqKicP/+fZibm6Nbt26P7CsAOHfuHPr37w+lUgkzMzP06tULJ06c0GpT+lwcPnwY77zzDmxsbKBUKjFy5Mhy35Me/lsODAyEkZFRmb1eT9LPdR0PY9VxpcHExsYGQMmb4vbt2/Haa6/Bzc0NSUlJWLFiBbp164bLly/DyckJAKBWq/HCCy9g//79GDp0KD744ANkZmYiODgYFy9ehLu7u2Ydb7zxBgYMGKC13qlTp5Zbz1dffQWJRIIpU6YgOTkZCxYsgL+/P8LDw2FsbAyg5J9c//794ePjgxkzZkAqlWLNmjXo2bMnjhw5gvbt25dZbv369TFnzhwAQFZWFt57771y1/35559jyJAhePvtt5GSkoJFixaha9euOHfuHCwtLcvMExgYiC5dugAA/vjjD2zbtk1r+jvvvIO1a9dizJgx+M9//oPY2FgsXrwY586dw7Fjx2BoaFhuPzyNtLQ0zbY9qLi4GC+++CKOHj2KwMBAeHp6IiIiAj/88AOuXr2K7du3V7jM6OhoREVFYfTo0VAqleW2GTlyJGbMmIGdO3di6NChT1yvp6cnZs+ejenTp2v1X8eOHQEAwcHBeOGFF+Do6IgPPvgAKpUKkZGR2LlzJz744AMAwL59+9C/f380bNgQM2fORG5uLhYtWoROnTrh7Nmzmn9epV5//XV4enpi7ty5+Pvvv/Hll1/C2toaK1asQM+ePTFv3jysX78eH3/8Mdq1a4euXbs+cx+W8vDwwGeffQYhBGJiYvD9999jwIABiIuLe+R8pa+bdu3aYc6cOUhKSsLChQtx7NgxzeuxNFjMmzcPKpUKkydPhpGREX766Sf4+/sjODgYXbt2RcOGDdGpUyesX78eEydO1FrP+vXrYW5ujkGDBj12Wx4UHByM69evY8yYMVCpVLh06RJWrlyJS5cu4cSJE2VCeqkjR45gwIABcHV1xYwZM1BYWIilS5eiU6dOOH36NJo0afJUdVTkl19+wahRo9C3b1/MmzcPOTk5WLZsGTp37oxz586VeY08aPr06cjLy3viddna2mqC5K1bt7Bw4UIMGDAA8fHx5b5vPInS53bq1Klo3LgxZs2ahby8PCxZsqRMX126dAldunSBUqnEJ598AkNDQ6xYsQLdu3dHSEgIfH19tZY9YcIEWFpaYubMmYiKisKyZctw8+ZNTeAqz4wZM7Bq1Sps3rxZK1g+Sz/XKbretURPpvQw1r59+0RKSoqIj48XmzZtEjY2NsLY2FjcunVLCCFEXl6eUKvVWvPGxsYKhUIhZs+erRm3evVqAUB8//33ZdZVXFysmQ+A+Oabb8q0ad68uejWrZvm8cGDBwUAUa9ePZGRkaEZ/9tvvwkAYuHChZplN27cWPTt21ezHiGEyMnJEW5ubqJ3795l1tWxY0fRokULzeOUlBQBQMyYMUMz7saNG0Imk4mvvvpKa96IiAhhYGBQZnx0dLQAINatW6cZN2PGDPHgn8SRI0cEALF+/XqteXfv3l1mvKurqwgICChTe1BQkHj4z+zh2j/55BNhb28vfHx8tPr0l19+EVKpVBw5ckRr/uXLlwsA4tixY2XWV2r79u0CgPjhhx8qbCOEEEqlUrRp00ZrO0aNGlWmXbdu3bRqq+gwVlFRkXBzcxOurq7i/v37WtMefL5btWol7O3txd27dzXjzp8/L6RSqRg5cqRmXOlzEhgYqLWO+vXrC4lEIubOnasZf//+fWFsbKxV/7P0YXnbLYQQn376qQAgkpOTK5yvoKBA2NvbixYtWojc3FzN+J07dwoAYvr06UKI//2NyeVycfXqVU27lJQUYWNjI3x8fDTjVqxYIQCIyMhIrfXY2tpqbXOPHj1E165dteopXc+Dz1dOTk6Zujdu3CgAiMOHD2vGlb73xMbGCiGE8PHxERYWFiIxMVHT5urVq8LQ0FAMHjxYM670uUtJSdGMq+h18/D7SWZmprC0tBTjxo3TapeYmCgsLCy0xo8aNUq4urpqHl+8eFFIpVLRv39/rbor8vD8QgixcuVKAUCcOnVKM66iv/FSpe+BBw8e1Hpsa2srUlNTNe3K66u
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABa90lEQVR4nO3dd1zV1f8H8Ncd3MveGwFRQRBX4gi3hrnTtNSyMjOtxG+usswcmWVq5S5tqA3LUlPLvTUVF4oTERUEUbbsfe/5/YHcn1fQFJB7+fh6Ph6fB97zGfd9D/deXp7PkgkhBIiIiIgkSm7oAoiIiIgeJ4YdIiIikjSGHSIiIpI0hh0iIiKSNIYdIiIikjSGHSIiIpI0hh0iIiKSNIYdIiIikjSGHSIiIgMrKSlBcnIy4uLiDF2KJDHsEBGRZG3evBkRERG6xxs3bsSFCxcMV9BdoqOjMXLkSLi5uUGlUsHFxQXBwcHgjQ2qH8OOkVm1ahVkMpluMjU1hZ+fH8aMGYOkpCRDl0dGbPPmzejRowccHBx075v33nsPaWlpld7mzZs3MWPGDL0/FkS1yblz5zB27FhER0fj6NGjePvtt5GdnW3osnD06FG0bt0ae/fuxYcffogdO3Zg165d2LhxI2QymaHLkxwZ741lXFatWoXhw4dj5syZ8PHxQUFBAQ4dOoRffvkF3t7eOH/+PMzNzQ1dJhmZ9957D1999RWaNWuGl19+Gfb29jh16hRWrFgBR0dH7NmzBw0bNnzk7Z48eRKtWrXCypUr8frrr1d/4USPWUpKCtq2bYsrV64AAAYMGID169cbtKaioiI0a9YM1tbW2LlzJ2xsbAxaz5NAaegCqGI9e/ZEy5YtAQBvvvkmHBwc8PXXX2PTpk146aWXDFwdGZPff/8dX331FQYPHozVq1dDoVDo5r3++uvo0qULXnzxRZw6dQpKJT/y9GRxcnLC+fPndf9RDAgIMHRJ+OeffxAVFYVLly4x6NQQ7saqJbp27QoAiImJAQCkp6fjvffeQ5MmTWBpaQlra2v07NkTZ86cKbduQUEBZsyYAT8/P5iamsLNzQ0DBgzA1atXAQCxsbF6u87unTp37qzb1v79+yGTyfDHH3/go48+gqurKywsLPDcc88hPj6+3HMfO3YMPXr0gI2NDczNzdGpUyccPny4wtfYuXPnCp9/xowZ5Zb99ddfERQUBDMzM9jb22PIkCEVPv+DXtvdtFotFixYgMDAQJiamsLFxQVvvfUWbt++rbdc3bp10adPn3LPM2bMmHLbrKj2efPmletTACgsLMT06dPRoEEDqNVqeHp6YtKkSSgsLKywr+72ySefwM7ODt99951e0AGA1q1b44MPPsC5c+ewbt06vddR0UhN586ddbXt378frVq1AgAMHz5c12+rVq3SLX/s2DH06tULdnZ2sLCwQNOmTbFw4UK9be7duxcdOnSAhYUFbG1t0a9fP0RGRuotM2PGDMhkMly+fBmvvPIKbGxs4OTkhKlTp0IIgfj4ePTr1w/W1tZwdXXFV199Va72qvThve89R0dH9O7dG+fPn3+ode/9fX722WeQy+X47bff9NrXrl2re986OjrilVdeQUJCgt4yr7/+OiwtLcs9z7p16yCTybB///4Ka37Qe1wmk2HMmDFYvXo1GjZsCFNTUwQFBeHgwYPlnuf06dPo2bMnrK2tYWlpiWeeeQZHjx59qH6r6D3SuXNnNG7c+EFdqFfjvfr06YO6devqteXm5mLixInw9PSEWq1Gw4YN8eWXX5Y71qXsM6hWqxEUFISAgID7fgbvV1PZpFAo4OHhgVGjRiEjI0O3TNl34t2fr3u9/vrreq/h6NGj8PHxwfr161G/fn2oVCp4eXlh0qRJyM/PL7f+N998g8DAQKjVari7uyM0NFSvBuD/+zk8PBxt27aFmZkZfHx8sGzZMr3lyuotex8Bpbur69ati5YtWyInJ0fXXpXPlLHhf/NqibJg4uDgAAC4du0aNm7ciBdffBE+Pj5ISkrC8uXL0alTJ1y8eBHu7u4AAI1Ggz59+mDPnj0YMmQIxo4di+zsbOzatQvnz59H/fr1dc/x0ksvoVevXnrPO3ny5Arr+eyzzyCTyfDBBx8gOTkZCxYsQEhICCIiImBmZgag9I9cz549ERQUhOnTp0Mul2PlypXo2rUr/v33X7Ru3brcduvUqYPZs2cDAHJycvDOO+9U+NxTp07FoEGD8OabbyIlJQWLFy9Gx44dcfr0adja2pZbZ9SoUejQoQMA4K+//sKGDRv05r/11lu6XYjvvvsuYmJisGTJEpw+fRqHDx+GiYlJhf3wKDIyMnSv7W5arRbPPfccDh06hFGjRiEgIADnzp3D/PnzcfnyZWzcuPG+24yOjkZUVBRef/11WFtbV7jMa6+9hunTp2Pz5s0YMmTIQ9cbEBCAmTNnYtq0aXr917ZtWwDArl270KdPH7i5uWHs2LFwdXVFZGQkNm/ejLFjxwIAdu/ejZ49e6JevXqYMWMG8vPzsXjxYrRr1w6nTp0q90ds8ODBCAgIwBdffIEtW7Zg1qxZsLe3x/Lly9G1a1fMmTMHq1evxnvvvYdWrVqhY8eOVe7DMv7+/pgyZQqEELh69Sq+/vpr9OrV65HPjlm5ciU+/vhjfPXVV3j55Zd17WXvr1atWmH27NlISkrCwoULcfjw4fu+bx9kypQpePPNNwEAqampGD9+vN7v6V4HDhzAH3/8gXfffRdqtRrffPMNevTogePHj+vCyIULF9ChQwdYW1tj0qRJMDExwfLly9G5c2ccOHAAbdq0Kbfdsn67u47HSQiB5557Dvv27cOIESPQvHlz7NixA++//z4SEhIwf/78+657v8/ggzz//PMYMGAASkpKEBYWhu+++w75+fn45ZdfKv0a0tLScO3aNXz00UcYMGAAJk6ciJMnT2LevHk4f/48tmzZogurM2bMwCeffIKQkBC88847iIqKwrfffosTJ06U+266ffs2evXqhUGDBuGll17Cn3/+iXfeeQcqlQpvvPFGhbVkZmaiZ8+eMDExwdatW3VBuzo+U0ZFkFFZuXKlACB2794tUlJSRHx8vFizZo1wcHAQZmZm4saNG0IIIQoKCoRGo9FbNyYmRqjVajFz5kxd24oVKwQA8fXXX5d7Lq1Wq1sPgJg3b165ZQIDA0WnTp10j/ft2ycACA8PD5GVlaVr//PPPwUAsXDhQt22fX19Rffu3XXPI4QQeXl5wsfHR3Tr1q3cc7Vt21Y0btxY9zglJUUAENOnT9e1xcbGCoVCIT777DO9dc+dOyeUSmW59ujoaAFA/PTTT7q26dOni7vf+v/++68AIFavXq237vbt28u1e3t7i969e5erPTQ0VNz7cbq39kmTJglnZ2cRFBSk16e//PKLkMvl4t9//9Vbf9myZQKAOHz4cLnnK7Nx40YBQMyfP/++ywghhLW1tWjRooXe6xg2bFi55Tp16qRX24kTJwQAsXLlSr3lSkpKhI+Pj/D29ha3b9/Wm3f377t58+bC2dlZpKWl6drOnDkj5HK5eO2113RtZb+TUaNG6T1HnTp1hEwmE1988YWu/fbt28LMzEyv/qr0YUWvWwghPvroIwFAJCcnP/S6W7ZsEUqlUkycOFFvmaKiIuHs7CwaN24s8vPzde2bN28WAMS0adN0bcOGDRMWFhblnmft2rUCgNi3b1+5eWWf4Xt/T2UACADi5MmTurbr168LU1NT8fzzz+va+vfvL1Qqlbh69aqu7ebNm8LKykp07Nix3HbbtWsnunTp8sA6OnXqJAIDAyus694aQ0NDy7X37t1beHt76x6XvednzZqlt9wLL7wgZDKZuHLlit42H+Yz+KCa7l5fiNLvqUaNGukel30nrl279r7bGTZsmN5rGDZsmAAgXn/9db3lyj4H//zzjxBCiOTkZKFSqcSzzz6r932/ZMkSAUCsWLFC19apUycBQHz11Ve6tsLCQt1nsKioSK/effv2iYKCAtG5c2fh7Oys129CVP0zZWy4G8tIhYSEwMnJCZ6enhgyZAgsLS2xYcMGeHh
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABX/klEQVR4nO3dd1xV9f8H8Ne9F+5l73EBAVFREFfiCPfA3GlZalmZ5Si1r6ssM0dmmVaGK205KtPSHGWKA1NTcYsTEQUEQbbszf38/kDuzytogMiFw+v5eJwH3M8Z930+93LvizNlQggBIiIiIomS67sAIiIioieJYYeIiIgkjWGHiIiIJI1hh4iIiCSNYYeIiIgkjWGHiIiIJI1hh4iIiCSNYYeIiIgkjWGHiIjoMWg0GiQnJyMiIkLfpdBDMOwQEVGtdPToURw6dEj7+NChQzh27Jj+CrpPfHw8pk6dCnd3dyiVStjb26N58+bIyMjQd2lUDoYdCVq/fj1kMpl2MDIyQtOmTTF58mQkJCTouzyqxXbt2oV+/frB1tZW+7559913kZKSUuVlxsXFYf78+QgJCam+QqleiImJwcSJE3Hp0iVcunQJEydORExMjL7Lwo0bN9C+fXts3rwZEyZMwK5du7B//34EBQXB1NRU3+VROWS8N5b0rF+/HmPGjMGCBQvg4eGBvLw8HD16FD///DPc3d1x+fJlmJiY6LtMqmXeffddfPXVV2jdujVefvll2NjY4Ny5c1i7di3s7OwQFBSEZs2aVXq5Z86cQfv27bFu3Tq8/vrr1V84SVZ+fj66deuGU6dOAQD8/Pxw6NAhKJVKvdbVu3dvREVF4ciRI3BxcdFrLVQxBvougJ6c/v37o127dgCAsWPHwtbWFkuXLsXOnTvx0ksv6bk6qk02bdqEr776CiNGjMDGjRuhUCi0415//XX07NkTL774Is6dOwcDA35sUM1QqVQ4fvw4Ll++DABo0aKFzntTH86ePYuDBw9i3759DDp1CHdj1SO9evUCAERGRgIAUlNT8e6776Jly5YwMzODhYUF+vfvjwsXLpSZNy8vD/Pnz0fTpk1hZGQEJycnPP/887h58yYAICoqSmfX2YNDjx49tMs6dOgQZDIZfvvtN3z44YdQq9UwNTXFs88+W+4m6pMnT6Jfv36wtLSEiYkJunfv/tD99j169Cj3+efPn19m2l9++QW+vr4wNjaGjY0NRo4cWe7zP2rd7qfRaBAQEAAfHx8YGRnB0dEREyZMwN27d3Wma9iwIQYNGlTmeSZPnlxmmeXV/sUXX5TpU6Dkv+B58+ahSZMmUKlUcHV1xcyZM5Gfn19uX93v448/hrW1Nb777rsyXyYdOnTA+++/j0uXLmHr1q0661HelpoePXpoazt06BDat28PABgzZoy239avX6+d/uTJkxgwYACsra1hamqKVq1aYdmyZTrLPHjwILp27QpTU1NYWVlhyJAhCA0N1Zlm/vz5kMlkuH79Ol555RVYWlrC3t4ec+bMgRACMTExGDJkCCwsLKBWq/HVV1+Vqf1x+vDB956dnR0GDhyo/aKu6Hz/9T6r6Pv2Uf36+uuv/+dzRkVFaZf1zTffwMfHByqVCs7Ozpg0aRLS0tKqtP5FRUX45JNP0LhxY6hUKjRs2BAffvhhmT4ufX8pFAq0bt0arVu3xrZt2yCTydCwYcP/eDVK5i+tRS6XQ61WY8SIEYiOjtZOU/q3/eWXXz50OaXvq1InTpyAkZERbt68qe0TtVqNCRMmIDU1tcz8W7Zs0b5ednZ2eOWVVxAbG6szzeuvvw4zMzNERESgb9++MDU1hbOzMxYsWID7d76U1nv/309mZiZ8fX3h4eGBO3fuaNsr+nlUX/BftHqkNJjY2toCACIiIrBjxw68+OKL8PDwQEJCAr799lt0794dV69ehbOzMwCguLgYgwYNQlBQEEaOHIkpU6YgMzMT+/fvx+XLl9G4cWPtc7z00ksYMGCAzvPOmjWr3Ho+/fRTyGQyvP/++0hMTERAQAD8/f0REhICY2NjACVfcv3794evry/mzZsHuVyOdevWoVevXvj333/RoUOHMstt0KABFi1aBADIysrC22+/Xe5zz5kzB8OHD8fYsWORlJSEFStWoFu3bjh//jysrKzKzDN+/Hh07doVALBt2zZs375dZ/yECRO0uxD/97//ITIyEitXrsT58+dx7NgxGBoaltsPlZGWlqZdt/tpNBo8++yzOHr0KMaPHw9vb29cunQJX3/9Na5fv44dO3Y8dJnh4eEICwvD66+/DgsLi3Knee211zBv3jzs2rULI0eOrHC93t7eWLBgAebOnavTf506dQIA7N+/H4MGDYKTkxOmTJkCtVqN0NBQ7Nq1C1OmTAEAHDhwAP3790ejRo0wf/585ObmYsWKFejcuTPOnTtX5otvxIgR8Pb2xueff46///4bCxcuhI2NDb799lv06tULixcvxsaNG/Huu++iffv26Nat22P3YSkvLy/Mnj0bQgjcvHkTS5cuxYABA3S+YB80e/ZsjB07FgCQnJyMadOm6fTV/Sr6vv2vfp0wYQL8/f21y3311Vfx3HPP4fnnn9e22dvbAyj5sv/444/h7++Pt99+G2FhYVi9ejVOnz5d5n1dkfUfO3YsNmzYgBdeeAEzZszAyZMnsWjRIoSGhpb5m7pfUVERZs+e/R+vgK6uXbti/Pjx0Gg0uHz5MgICAhAXF4d///23Usu5X0pKCvLy8vD222+jV69eeOutt3Dz5k2sWrUKJ0+exMmTJ6FSqQD8/yEF7du3x6JFi5CQkIBly5bh2LFjZT5niouL0a9fPzz99NNYsmQJAgMDMW/ePBQVFWHBggXl1lJYWIhhw4YhOjoax44dg5OTk3ZcTXwe1SmCJGfdunUCgDhw4IBISkoSMTExYvPmzcLW1lYYGxuL27dvCyGEyMvLE8XFxTrzRkZGCpVKJRYsWKBtW7t2rQAgli5dWua5NBqNdj4A4osvvigzjY+Pj+jevbv28T///CMACBcXF5GRkaFt//333wUAsWzZMu2yPT09Rd++fbXPI4QQOTk5wsPDQ/Tp06fMc3Xq1Em0aNFC+zgpKUkAEPPmzdO2RUVFCYVCIT799FOdeS9duiQMDAzKtIeHhwsAYsOGDdq2efPmifv/fP79918BQGzcuFFn3sDAwDLt7u7uYuDAgWVqnzRpknjwT/LB2mfOnCkcHByEr6+vTp/+/PPPQi6Xi3///Vdn/jVr1ggA4tixY2Wer9SOHTsEAPH1118/dBohhLCwsBBt27bVWY/Ro0eXma579+46tZ0+fVoAEOvWrdOZrqioSHh4eAh3d3dx9+5dnXH3v95t2rQRDg4OIiUlRdt24cIFIZfLxWuvvaZtK31Nxo8fr/McDRo0EDKZTHz++efa9rt37wpjY2Od+h+nD8tbbyGE+PDDDwUAkZiY+Mh5S5X+HT3YV0JU/H1b0X6934Pvs1KJiYlCqVSKZ555RuezYuXKlQKAWLt2rbatIusfEhIiAIixY8fqTPfuu+8KAOLgwYPatgffX998841QqVSiZ8+ewt3dvdz1uF9578+XX35ZmJiYaB8/6nOr1IN/66WPe/fuLYqKirTtpZ+7K1asEEIIUVBQIBwcHESLFi1Ebm6udrpdu3YJAGLu3LnattGjRwsA4p133tG2aTQaMXDgQKFUKkVSUpJOvevWrRMajUaMGjVKmJiYiJMnT+rUXJnPo/qCu7EkzN/fH/b29nB1dcXIkSNhZmaG7du3a/czq1QqyOUlb4Hi4mKkpKTAzMwMzZo1w7lz57TL+eOPP2BnZ4d33nmnzHM8uIm9Ml577TWYm5trH7/wwgtwcnLC7t27AQAhISEIDw/Hyy+/jJSUFCQnJyM5ORnZ2dno3bs3jhw5Ao1Go7PMvLw8GBkZPfJ5t23bBo1Gg+HDh2uXmZycDLVaDU9PT/zzzz860xcUFACA9r+18mzZsgWWlpbo06ePzjJ9fX1hZmZWZpmFhYU60yUnJyMvL++RdcfGxmLFihWYM2cOzMzMyjy/t7c3vLy8dJZZuuvywee
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Среднее значение Outcome в обучающей выборке: 0.3469055374592834\n",
"Среднее значение Outcome в контрольной выборке: 0.35714285714285715\n",
"Среднее значение Outcome в тестовой выборке: 0.35714285714285715\n"
]
}
],
"source": [
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"def plot_outcome_distribution(data, title):\n",
" sns.histplot(data['Outcome'], kde=True)\n",
" plt.title(title)\n",
" plt.xlabel('Outcome')\n",
" plt.ylabel('Частота')\n",
" plt.show()\n",
"\n",
"plot_outcome_distribution(train_data, 'Распределение Outcome в обучающей выборке')\n",
"plot_outcome_distribution(val_data, 'Распределение Outcome в контрольной выборке')\n",
"plot_outcome_distribution(test_data, 'Распределение Outcome в тестовой выборке')\n",
"\n",
"print(\"Среднее значение Outcome в обучающей выборке: \", train_data['Outcome'].mean())\n",
"print(\"Среднее значение Outcome в контрольной выборке: \", val_data['Outcome'].mean())\n",
"print(\"Среднее значение Outcome в тестовой выборке: \", test_data['Outcome'].mean())"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABI1UlEQVR4nO3deVwVdf///+cBZOeAgGyGZu4LZKIpWa4oIpmVZZblcpmWYp/SMr+0uLWYtqlpatcnM0vLrNQrr3LDLRNNMXPN1DQpBVwSFBMU5veHP+bjEbBE9OD0uN9uc7sx73nPzGvmLDzPLOfYDMMwBAAAYFEuzi4AAADgaiLsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAAAASyPsAEAFc/LkSR04cEC5ubnOLgXl7MSJE9q7d6/OnTvn7FL+UQg7AOBkhmHovffeU4sWLeTt7S273a4aNWro448/dnZp14XffvtNM2fONMcPHDig2bNnO6+gC5w9e1bjx4/XzTffLA8PD1WuXFm1a9dWSkqKs0v7RyHsXCdmzpwpm81mDp6enqpTp44GDx6szMxMZ5eHCmzRokXq1KmTgoKCzOfNM888o2PHjpV5mYcOHdKoUaO0ZcuW8iv0H+yhhx7S448/rvr16+ujjz7SsmXLtHz5ct17773OLu26YLPZlJSUpCVLlujAgQN69tln9e233zq7LOXl5SkuLk4vvvii2rRpo3nz5mnZsmVasWKFYmNjnV3eP4qbswvA5RkzZoxq1KihM2fOaO3atZo6daq+/vprbd++Xd7e3s4uDxXMM888ozfffFM333yzhg8frsDAQG3evFmTJ0/Wp59+qpSUFNWtW/eyl3vo0CGNHj1aN954oxo3blz+hf+DzJo1S3PnztXHH3+shx56yNnlXJeqVq2q/v37q1OnTpKk8PBwrVq1yrlFSRo3bpw2bNigJUuWqE2bNs4u55/NwHXhgw8+MCQZGzdudGgfOnSoIcmYM2eOkypDRTVnzhxDkvHAAw8Y586dc5i2YcMGw9vb24iKijLOnj172cveuHGjIcn44IMPyqnaf65GjRoZDz30kLPLsIS9e/ca69evN06dOuXsUoyzZ88alStXNp577jlnlwLDMDiNdZ1r166dJGn//v2SpOPHj+uZZ55RVFSUfH19ZbfblZCQoB9//LHYvGfOnNGoUaNUp04deXp6Kjw8XPfee6/27dsn6fx57wtPnV08XPhJZdWqVbLZbJo7d66ee+45hYWFycfHR3fddZfS09OLrXvDhg3q1KmT/P395e3trdatW+u7774rcRvbtGlT4vpHjRpVrO/HH3+smJgYeXl5KTAwUD169Chx/ZfatgsVFhZqwoQJatiwoTw9PRUaGqrHHntMf/zxh0O/G2+8UXfeeWex9QwePLjYMkuq/fXXXy+2T6Xzh8FHjhypWrVqycPDQ5GRkXr22WeVl5dX4r660OjRo1W5cmW99957cnV1dZh26623avjw4dq2bZs+//xzh+3o06dPsWW1adPGrG3VqlVq1qyZJKlv377mfrvwmokNGzaoc+fOqly5snx8fBQdHa2JEyc6LHPFihW644475OPjo4CAAHXt2lW7du1y6DNq1CjZbDb9/PPPevjhh+Xv768qVaroxRdflGEYSk9PV9euXWW32xUWFqY333yzWO1Xsg8vfu4FBwcrMTFR27dv/8t5JWnevHnm8zE4OFgPP/ywfv/9d3N6bm6utm/frsjISCUmJsput8vHx0dt2rRxOA3zyy+/yGaz6e233y62jnXr1slms+mTTz4xa774eVT0fL/wMdq6dav69Omjm266SZ6engoLC9O//vWvYqc3i06hHzhwwGxbsmSJbrvtNnl7e8vf31933nlnsX1S9NgdPXrUbNu0aVOxOiSpUaNGJR75+Oabb8zniJ+fnxITE7Vjxw6HPn369NGNN94oSapZs6aaN2+u48ePy8vLq1jdJenTp4/DY1y5cuVi+18q/TVepOg9sOiI0u7du/XHH3/Iz89PrVu3vuS+kqQffvhBCQkJstvt8vX1Vfv27bV+/XqHPkWPxZo1a/TYY48pKChIdrtdvXr1KvE96eLX8oABA+Tp6VnsqNff2c/XO05jXeeKgklQUJCk82+KCxYs0P33368aNWooMzNT06dPV+vWrbVz505FRERIkgoKCnTnnXcqJSVFPXr00JNPPqmTJ09q2bJl2r59u2rWrGmu48EHH1Tnzp0d1pucnFxiPa+88opsNpuGDx+urKwsTZgwQXFxcdqyZYu8vLwknf8nl5CQoJiYGI0cOVIuLi764IMP1K5dO3377be69dZbiy33hhtu0NixYyVJp06d0sCBA0tc94svvqju3bvr0Ucf1ZEjR/TOO++oVatW+uGHHxQQEFBsngEDBuiOO+6QJH355ZeaP3++w/THHntMM2fOVN++ffU///M/2r9/vyZPnqwffvhB3333nSpVqlTifrgcJ06cMLftQoWFhbrrrru0du1aDRgwQPXr19e2bdv09ttv6+eff9aCBQtKXeaePXu0e/du9enTR3a7vcQ+vXr10siRI7Vo0SL16NHjb9dbv359jRkzRiNGjHDYf7fddpskadmyZbrzzjsVHh6uJ598UmFhYdq1a5cWLVqkJ598UpK0fPlyJSQk6KabbtKoUaP0559/6p133lHLli21efNm859XkQceeED169fXa6+9pv/+9796+eWXFRgYqOnTp6tdu3YaN26cZs+erWeeeUbNmjVTq1atrngfFqlXr56ef/55GYahffv26a233lLnzp118ODBS85X9Lxp1qyZxo4dq8zMTE2cOFHfffed+XwsChbjxo1TWFiYhg0bJk9PT/373/9WXFycli1bplatWummm25Sy5YtNXv2bA0ZMsRhPbNnz5afn5+6du36l9tyoWXLlumXX35R3759FRYWph07dui9997Tjh07tH79+mIhvci3336rzp07q3r16ho5cqTOnj2rd999Vy1bttTGjRtVp06dy6qjNB999JF69+6t+Ph4jRs3TqdPn9bUqVN1++2364cffij2HLnQiBEjdObMmb+9ruDgYDNI/vbbb5o4caI6d+6s9PT0Et83/o6ixzY5OVm1a9fW6NGjdebMGU2ZMqXYvtqxY4fuuOMO2e12Pfvss6pUqZKmT5+uNm3aaPXq1WrevLnDsgcPHqyAgACNGjVKu3fv1tSpU/Xrr7+agaskI0eO1Pvvv6+5c+c6BMsr2c/XFWcfWsLfU3Qaa/ny5caRI0eM9PR049NPPzWCgoIMLy8v47fffjMMwzDOnDljFBQUOMy7f/9+w8PDwxgzZozZNmPGDEOS8dZbbxVbV2FhoTmfJOP1118v1qdhw4ZG69atzfGVK1cakoyqVasaOTk5Zvtnn31mSDImTpxoLrt27dpGfHy8uR7DMIzTp08bNWrUMDp06FBsXbfddpvRqFEjc/zIkSOGJGPkyJFm24EDBwxXV1fjlVdecZh327ZthpubW7H2PXv2GJKMDz/80GwbOXKkceFL4ttvvzUkGbNnz3aYd/HixcXaq1evbiQmJharPSkpybj4ZXZx7c8++6wREhJixMTEOOzTjz76yHBxcTG+/fZbh/mnTZtmSDK+++67YusrsmDBAkOS8fbbb5faxzAMw263G02aNHHYjt69exfr17p1a4faSjuNde7cOaNGjRpG9erVjT/++MNh2oWPd+PGjY2QkBDj2LFjZtuPP/5ouLi4GL169TLbih6TAQMGOKzjhhtuMGw2m/Haa6+Z7X/88Yfh5eXlUP+V7MOSttswDOO5554zJBlZWVmlzpefn2+EhIQYjRo1Mv7880+zfdGiRYYkY8SIEYZh/N9rzN3d3fj555/NfkeOHDGCgoKMmJgYs2369OmGJGPXrl0O6wkODnbY5rZt2xqtWrVyqKdoPRc+XqdPny5W9yeffGJIMtasWWO2Fb337N+/3zAMw4iJiTH8/f2NjIwMs8/PP/9sVKpUyejWrZvZVvTYHTlyxGwr7Xlz8fvJyZMnjYCAAKN///4O/TI
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABEJElEQVR4nO3deVgVdf//8dcBZBMOCMpmaO6Ka6EpLbihiGZallreuWR6Z9idWuaPO3NrIW1xyS3vO7O+aZqWeufX3RQr0RSzTM1bzZRSwCVAMUFhfn90MV+P4IbowfH5uK65LuYzn5l5zzDn8GKWc2yGYRgCAACwKBdnFwAAAHAjEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAIClEXYAAHCCvn376s4773Ros9lsGjNmjFPqsTLCjoXNmTNHNpvNHDw9PVW7dm0NHjxY6enpzi4PZdiyZcvUoUMHBQYGmsfNiy++qBMnTpR4mUeOHNGYMWO0Y8eO0isUAK6Cm7MLwI03btw4VatWTWfPntU333yjGTNmaPny5frpp5/k7e3t7PJQxrz44ot655131LhxY40YMUIBAQHavn27pk6dqvnz52vdunWqU6fONS/3yJEjGjt2rO688041adKk9AsHLODPP/+Umxt/mksbe/Q2EBcXp6ZNm0qSnn76aQUGBurdd9/V0qVL9fjjjzu5OpQln376qd555x316NFDc+fOlaurqzmtb9++at26tR577DFt376dN2Q43dmzZ+Xu7i4XF+tcpPD09HR2CZZknSMEV61NmzaSpIMHD0qSTp48qRdffFENGzaUj4+P7Ha74uLi9MMPPxSZ9+zZsxozZoxq164tT09PhYaG6pFHHtGBAwckSb/++qvDpbOLh1atWpnL2rBhg2w2mxYsWKB//vOfCgkJUfny5fXQQw8pNTW1yLq3bNmiDh06yM/PT97e3mrZsqW+/fbbYrexVatWxa6/uGvhn3zyiSIjI+Xl5aWAgAD17Nmz2PVfbtsuVFBQoEmTJql+/fry9PRUcHCw/v73v+uPP/5w6HfnnXfqwQcfLLKewYMHF1lmcbW/9dZbRfapJOXm5mr06NGqWbOmPDw8FB4erpdeekm5ubnF7qsLjR07VhUqVNCsWbMcgo4k3XPPPRoxYoR27typRYsWOWxH3759iyyrVatWZm0bNmxQs2bNJEn9+vUz99ucOXPM/lu2bFHHjh1VoUIFlS9fXo0aNdLkyZMdlvnVV1/pgQceUPny5eXv768uXbpoz549Dn3GjBkjm82m//73v/rb3/4mPz8/VapUSa+88ooMw1Bqaqq6dOkiu92ukJAQvfPOO0Vqv559eKljr3D49ddfHfpPnz5d9evXl4eHh8LCwhQfH6/MzMwiy72a/SPpqtd7tcd9cb7//nvFxcXJbrfLx8dHbdu21ebNm83p27Ztk81m00cffVRk3lWrVslms2nZsmVm2++//66nnnpKwcHB8vDwUP369TV79myH+QrfL+bPn6+RI0eqcuXK8vb2VnZ2ts6dO6exY8eqVq1a8vT0VGBgoO6//36tWbPGnP/HH39U3759Vb16dXl6eiokJERPPfVUkUuz13v8XOv72sUufq0X1rN//3717dtX/v7+8vPzU79+/XTmzBmHef/880/94x//UMWKFeXr66uHHnpIv//+O/cBiTM7t6XCYBIYGChJ+uWXX7RkyRI99thjqlatmtLT0/X++++rZcuW2r17t8LCwiRJ+fn5evDBB7Vu3Tr17NlTzz//vE6dOqU1a9bop59+Uo0aNcx1PP744+rYsaPDehMSEoqt5/XXX5fNZtOIESOUkZGhSZMmKSYmRjt27JCXl5ekv/7IxcXFKTIyUqNHj5aLi4s+/PBDtWnTRl9//bXuueeeIsu94447lJiYKEk6ffq0Bg0aVOy6X3nlFXXv3l1PP/20jh07pvfee0/R0dH6/vvv5e/vX2SegQMH6oEHHpAkffHFF1q8eLHD9L///e+aM2eO+vXrp3/84x86ePCgpk6dqu+//17ffvutypUrV+x+uBaZmZnmtl2ooKBADz30kL755hsNHDhQ9erV086dOzVx4kT997//1ZIlSy65zH379mnv3r3q27ev7HZ7sX169+6t0aNHa9myZerZs+dV11uvXj2NGzdOo0aNcth/9957ryRpzZo1evDBBxUaGqrnn39eISEh2rNnj5YtW6bnn39ekrR27VrFxcWpevXqGjNmjP7880+99957uu+++7R9+/YiN3r26NFD9erV05tvvqn//d//1WuvvaaAgAC9//77atOmjcaPH6+5c+fqxRdfVLNmzRQdHX3d+7DQhcdeoeXLl+vTTz91aBszZozGjh2rmJgYDRo0SHv37tWMGTO0detWh2PlavbPhR5++GE98sgjkqSvv/5as2bNcphekuO+0K5du/TAAw/IbrfrpZdeUrly5fT++++rVatWSkpKUvPmzdW0aVNVr15dn332mfr06eMw/4IFC1ShQgXFxsZKktLT09WiRQvZbDYNHjxYlSpV0ooVK9S/f39lZ2dryJAhDvO/+uqrcnd314svvqjc3Fy5u7trzJgxSkxM1NNPP6177rlH2dnZ2rZtm7Zv36527dqZ+/CXX35Rv379FBISol27dmnWrFnatWuXNm/eXOQfjJIePxfu4yu9r12L7t27q1q1akpMTNT27dv173//W0FBQRo/frzZp2/fvvrss8/05JNPqkWLFkpKSlKnTp2ueV2WZMCyPvzwQ0OSsXbtWuPYsWNGamqqMX/+fCMwMNDw8vIyfvvtN8MwDOPs2bNGfn6+w7wHDx40PDw8jHHjxplts2fPNiQZ7777bpF1FRQUmPNJMt56660iferXr2+0bNnSHF+/fr0hyahcubKRnZ1ttn/22WeGJGPy5MnmsmvVqmXExsaa6zEMwzhz5oxRrVo1o127dkXWde+99xoNGjQwx48dO2ZIMkaPHm22/frrr4arq6vx+uuvO8y7c+dOw83NrUj7vn37DEnGRx99ZLaNHj3auPBl9PXXXxuSjLlz5zrMu3LlyiLtVatWNTp16lSk9vj4eOPil+bFtb/00ktGUFCQERkZ6bBP/+d//sdwcXExvv76a4f5Z86caUgyvv322yLrK7RkyRJDkjFx4sRL9jEMw7Db7cbdd9/tsB19+vQp0q9ly5YOtW3dutWQZHz44YcO/c6fP29Uq1bNqFq1qvHHH384TLvw992kSRMjKCjIOHHihNn2ww8/GC4uLkbv3r3NtsLfycCBAx3Wcccddxg2m8148803zfY//vjD8PLycqj/evZh4XbXr1+/SPtbb71lSDIOHjxoGIZhZGRkGO7u7kb79u0dXn9Tp041JBmzZ8++pv1jGIZx7tw5Q5IxduxYs63wfaBwvdd63F+sa9euhru7u3HgwAGz7ciRI4avr68RHR1ttiUkJBjlypUzTp48abbl5uYa/v7+xlNPPWW29e/f3wgNDTWOHz/usJ6ePXsafn5+xpkzZwzD+L/3i+rVq5tthRo3blzsa+lCF89jGIbx6aefGpKMjRs3mm3Xe/xc7fuaYRhGnz59jKpVqzrUdPFrvbCeC/eZYRjGww8/bAQGBprjKSkphiRjyJAhDv369u1bZJm3Iy5j3QZiYmJUqVIlhYeHq2fPnvLx8dHixYtVuXJlSZKHh4d5zTs/P18nTpyQj4+P6tSpo+3bt5vL+fzzz1WxYkU999xzRdZx8X9F16J3797y9fU1xx999FGFhoZq+fLlkqQdO3Zo3759euKJJ3TixAkdP35cx48fV05Ojtq2bauNGzeqoKDAYZlnz5694rXvL774QgUFBerevbu5zOPHjyskJES1atXS+vXrHfrn5eVJ+mt/XcrChQvl5+endu3aOSwzMjJSPj4+RZZ57tw5h37Hjx/X2bNnL1v377//rvfee0+vvPKKfHx8iqy/Xr16qlu3rsMyCy9dXrz+C506dUqSHH4XxfH19VV2dvZl+1yL77//XgcPHtSQIUOKnFEoPK6OHj2qHTt2qG/fvgo
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABEEUlEQVR4nO3deVgW9f7/8dcNyg2KNwjKVmjuiksWmt5ZaooikmlRZlmamZZip7TFL+eYW4tpi0vi0jkuddIsLe3oMfekDU0xytQ86NHkpIBLgmKCwvz+8GJ+3gKmiN44PR/XNdfFfOYzM+8Z7htezHyG22YYhiEAAACL8nB3AQAAAFcTYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcA4FZjx46VzWZzdxkVwv79+2Wz2TR//nyzjfNz5Qg7FjB//nzZbDZz8vb2VsOGDTVs2DBlZma6uzxUYCtWrFC3bt0UGBhovm6ef/55HT16tMzbPHjwoMaOHavU1NTyKxQArgBhx0LGjx+vf/7zn5o+fbpuv/12zZw5U06nU6dOnXJ3aaiAnn/+efXo0UMZGRkaOXKkpk+frqioKE2fPl0333yzdu/eXabtHjx4UOPGjSPsAOVk1KhR+v33391dxnWtkrsLQPmJiYlRq1atJElPPPGEAgMD9fbbb+uzzz7TQw895ObqUJF8+OGHeuutt/Tggw9qwYIF8vT0NJc99thjuuuuu/TAAw9o27ZtqlSJHxO4vpw6dUpVqlRxdxnlplKlSrwPrxBXdiysU6dOkqR9+/ZJko4dO6bnn39ezZs3l6+vrxwOh2JiYvTDDz8UW/f06dMaO3asGjZsKG9vb4WGhuq+++7T3r17Jf3/+8qlTR07djS3tXHjRtlsNn300Uf661//qpCQEFWtWlX33HOP0tPTi+178+bN6tatm/z8/FSlShV16NBB33zzTYnH2LFjxxL3P3bs2GJ9P/jgA0VGRsrHx0cBAQHq06dPifu/2LGdr7CwUFOmTFHTpk3l7e2t4OBgPfnkk/rtt99c+t100026++67i+1n2LBhxbZZUu1vvPFGsXMqSXl5eRozZozq168vu92u8PBwvfjii8rLyyvxXJ1v3Lhxql69ut59912XoCNJt912m0aOHKnt27dryZIlLsfx2GOPFdtWx44dzdo2btyo1q1bS5IGDBhgnrfzxx9s3rxZ3bt3V/Xq1VW1alW1aNFCU6dOddnmhg0bdOedd6pq1ary9/dXz549tWvXLpc+ReMY/vOf/+iRRx6Rn5+fatasqZdeekmGYSg9PV09e/aUw+FQSEiI3nrrrWK1X8k5LO21VzTt37/fpf+MGTPUtGlT2e12hYWFKT4+XsePHy+23Us5P5Iueb+X+rq/0GOPPaabbrqpWHtJ40dsNpuGDRumZcuWqVmzZrLb7WratKlWrVpVbP2vv/5arVu3lre3t+rVq6fZs2eXWsOl1N6xY0c1a9ZMKSkpat++vapUqaK//vWvkqStW7cqOjpaNWrUkI+Pj+rUqaPHH3/cZf0333xTt99+uwIDA+Xj46PIyEiX1/2Fx7h48WJFRETIx8dHTqdT27dvlyTNnj1b9evXl7e3tzp27Fjs+3B+nbfffrtZz6xZs0o9/iJXes43btyoVq1auZzzP9s4IKKihRUFk8DAQEnSf//7Xy1btkwPPPCA6tSpo8zMTM2ePVsdOnTQzp07FRYWJkkqKCjQ3XffrfXr16tPnz565plndOLECa1du1Y//fST6tWrZ+7joYceUvfu3V32m5CQUGI9r776qmw2m0aOHKmsrCxNmTJFUVFRSk1NlY+Pj6Rzv+RiYmIUGRmpMWPGyMPDQ/PmzVOnTp301Vdf6bbbbiu23RtvvFETJkyQJJ08eVJDhgwpcd8vvfSSevfurSeeeEKHDx/WO++8o/bt2+v777+Xv79/sXUGDx6sO++8U5L06aefaunSpS7Ln3zySc2fP18DBgzQX/7yF+3bt0/Tp0/X999/r2+++UaVK1cu8TxcjuPHj5vHdr7CwkLdc889+vrrrzV48GA1adJE27dv1+TJk/Wf//xHy5YtK3WbaWlp2r17tx577DE5HI4S+/Tr109jxozRihUr1KdPn0uut0mTJho/frxGjx7tcv5uv/12SdLatWt19913KzQ0VM8884xCQkK0a9curVixQs8884wkad26dYqJiVHdunU1duxY/f7773rnnXfUrl07bdu2rdgv4AcffFBNmjTR66+/rn//+9965ZVXFBAQoNmzZ6tTp06aOHGiFixYoOeff16tW7dW+/btr/gcFjn/tVdk5cqV+vDDD13axo4dq3HjxikqKkpDhgzR7t27NXPmTG3ZssXltXIp5+d89957r+677z5J0ldffaV3333XZXlZXvdl9fXXX+vTTz/V0KFDVa1aNU2bNk1xcXE6cOCA+TNo+/bt6tq1q2rWrKmxY8fq7NmzGjNmjIKDg4tt73JqP3r0qGJiYtSnTx898sgjCg4OVlZWlrmv//u//5O/v7/279+vTz/91GU/U6dO1T333KO+ffsqPz9fixYt0gMPPKAVK1YoNjbWpe9XX32lf/3rX4qPj5ckTZgwQXfffbdefPFFzZgxQ0OHDtVvv/2mSZMm6fHHH9eGDRtc1v/tt9/UvXt39e7dWw899JA+/vhjDRkyRF5eXsVCWHmd8++//17dunVTaGioxo0bp4KCAo0fP141a9a87P1d1wxc9+bNm2dIMtatW2ccPnzYSE9PNxYtWmQEBgYaPj4+xv/+9z/DMAzj9OnTRkFBgcu6+/btM+x2uzF+/Hizbe7cuYYk4+233y62r8LCQnM9ScYbb7xRrE/Tpk2NDh06mPNffPGFIcm44YYbjJycHLP9448/NiQZU6dONbfdoEEDIzo62tyPYRjGqVOnjDp16hhdunQptq/bb7/daNasmTl/+PBhQ5IxZswYs23//v2Gp6en8eqrr7qsu337dqNSpUrF2tPS0gxJxnvvvWe2jRkzxjj/7fLVV18ZkowFCxa4rLtq1api7bVr1zZiY2OL1R4fH29c+Ba8sPYXX3zRCAoKMiIjI13O6T//+U/Dw8PD+Oqrr1zWnzVrliHJ+Oabb4rtr8iyZcsMScbkyZNL7WMYhuFwOIxbb73V5Tj69+9frF+HDh1catuyZYshyZg3b55Lv7Nnzxp16tQxateubfz2228uy87/frds2dIICgoyjh49arb98MMPhoeHh9GvXz+zreh7MnjwYJd93HjjjYbNZjNef/11s/23334zfHx8XOq/knNYdNxNmzYt1v7GG28Ykox9+/YZhmEYWVlZhpeXl9G1a1eX99/06dMNScbcuXMv6/wYhmGcOXPGkGSMGzfObCv6OVC038t93V+of//+Ru3atYu1X/heMIxzr1svLy9jz549ZtsPP/xgSDLeeecds61Xr16Gt7e38csvv5htO3fuNDw9PV22eTm1d+jQwZBkzJo1y6Xv0qVLDUnGli1bLnqcp06dcpnPz883mjVrZnTq1KnYMdrtdvP8GoZhzJ4925BkhISEuPxsS0hIcPlenF/nW2+9Zbbl5eWZr/f8/HzDMP7/z9bz3z9Xcs579OhhVKlSxfj111/NtrS0NKNSpUrFtmll3MaykKioKNWsWVPh4eHq06ePfH19tXTpUt1www2SJLvdLg+Pc9/ygoICHT16VL6+vmrUqJG2bdtmbueTTz5RjRo19PTTTxfbx5Vc9uzXr5+qVatmzt9///0KDQ3VypUrJUmpqalKS0vTww8/rKNHj+rIkSM6cuSIcnNz1blzZ3355ZcqLCx02ebp06fl7e190f1++umnKiwsVO/evc1tHjlyRCEhIWrQoIG++OILl/75+fmSzp2v0ixevFh+fn7q0qWLyzYjIyPl6+tbbJtnzpxx6XfkyBGdPn36onX/+uuveuedd/TSSy/J19e32P6bNGmixo0bu2yz6Nblhfs/34kTJyTJ5XtRkmrVqiknJ+eifS7H999/r3379unZZ58tdkWh6HV16NAhpaam6rH
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки после oversampling и undersampling: 802\n"
]
}
],
"source": [
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"sns.countplot(x=train_data['Outcome'])\n",
"plt.title('Распределение Outcome в обучающей выборке')\n",
"plt.xlabel('Outcome')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"ros = RandomOverSampler(random_state=42)\n",
"X_train = train_data.drop(columns=['Outcome'])\n",
"y_train = train_data['Outcome']\n",
"\n",
"X_resampled, y_resampled = ros.fit_resample(X_train, y_train)\n",
"\n",
"sns.countplot(x=y_resampled)\n",
"plt.title('Распределение Outcome после oversampling')\n",
"plt.xlabel('Outcome')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"rus = RandomUnderSampler(random_state=42)\n",
"X_resampled, y_resampled = rus.fit_resample(X_resampled, y_resampled)\n",
"\n",
"sns.countplot(x=y_resampled)\n",
"plt.title('Распределение Outcome после undersampling')\n",
"plt.xlabel('Outcome')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"print(\"Размер обучающей выборки после oversampling и undersampling: \", len(X_resampled))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Конструирование признаков\n",
"Унитарное кодирование - замена категориальных признаков бинарными значениями.\n",
"\n",
"Дискретизация числовых признаков - процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы train_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Pregnancies_14', 'Pregnancies_15', 'Pregnancies_17', 'Outcome_0', 'Outcome_1']\n",
"Столбцы val_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1']\n",
"Столбцы test_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1']\n"
]
}
],
"source": [
"categorical_features = ['Pregnancies', 'Outcome']\n",
"\n",
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)\n",
"\n",
"print(\"Столбцы train_data_encoded:\", train_data_encoded.columns.tolist())\n",
"print(\"Столбцы val_data_encoded:\", val_data_encoded.columns.tolist())\n",
"print(\"Столбцы test_data_encoded:\", test_data_encoded.columns.tolist())\n",
"\n",
"train_data_encoded['Glucose_binned'] = pd.cut(train_data_encoded['Glucose'], bins=5, labels=False)\n",
"val_data_encoded['Glucose_binned'] = pd.cut(val_data_encoded['Glucose'], bins=5, labels=False)\n",
"test_data_encoded['Glucose_binned'] = pd.cut(test_data_encoded['Glucose'], bins=5, labels=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ручной синтез\n",
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, можно создать признак, который отражает соотношение уровня глюкозы к индексу массы тела"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Glucose</th>\n",
" <th>BloodPressure</th>\n",
" <th>SkinThickness</th>\n",
" <th>Insulin</th>\n",
" <th>BMI</th>\n",
" <th>DiabetesPedigreeFunction</th>\n",
" <th>Age</th>\n",
" <th>Pregnancies_0</th>\n",
" <th>Pregnancies_1</th>\n",
" <th>Pregnancies_2</th>\n",
" <th>...</th>\n",
" <th>Pregnancies_8</th>\n",
" <th>Pregnancies_9</th>\n",
" <th>Pregnancies_10</th>\n",
" <th>Pregnancies_11</th>\n",
" <th>Pregnancies_12</th>\n",
" <th>Pregnancies_13</th>\n",
" <th>Outcome_0</th>\n",
" <th>Outcome_1</th>\n",
" <th>Glucose_binned</th>\n",
" <th>glucose_to_bmi</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>668</th>\n",
" <td>98</td>\n",
" <td>58</td>\n",
" <td>33</td>\n",
" <td>190</td>\n",
" <td>34.0</td>\n",
" <td>0.430</td>\n",
" <td>43</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>1</td>\n",
" <td>2.882353</td>\n",
" </tr>\n",
" <tr>\n",
" <th>324</th>\n",
" <td>112</td>\n",
" <td>75</td>\n",
" <td>32</td>\n",
" <td>0</td>\n",
" <td>35.7</td>\n",
" <td>0.148</td>\n",
" <td>21</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>1</td>\n",
" <td>3.137255</td>\n",
" </tr>\n",
" <tr>\n",
" <th>624</th>\n",
" <td>108</td>\n",
" <td>64</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.8</td>\n",
" <td>0.158</td>\n",
" <td>21</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>1</td>\n",
" <td>3.506494</td>\n",
" </tr>\n",
" <tr>\n",
" <th>690</th>\n",
" <td>107</td>\n",
" <td>80</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>24.6</td>\n",
" <td>0.856</td>\n",
" <td>34</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>1</td>\n",
" <td>4.349593</td>\n",
" </tr>\n",
" <tr>\n",
" <th>473</th>\n",
" <td>136</td>\n",
" <td>90</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>29.9</td>\n",
" <td>0.210</td>\n",
" <td>50</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>2</td>\n",
" <td>4.548495</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>355</th>\n",
" <td>165</td>\n",
" <td>88</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.4</td>\n",
" <td>0.302</td>\n",
" <td>49</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>3</td>\n",
" <td>5.427632</td>\n",
" </tr>\n",
" <tr>\n",
" <th>534</th>\n",
" <td>77</td>\n",
" <td>56</td>\n",
" <td>30</td>\n",
" <td>56</td>\n",
" <td>33.3</td>\n",
" <td>1.251</td>\n",
" <td>24</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>0</td>\n",
" <td>2.312312</td>\n",
" </tr>\n",
" <tr>\n",
" <th>344</th>\n",
" <td>95</td>\n",
" <td>72</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>36.8</td>\n",
" <td>0.485</td>\n",
" <td>57</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>1</td>\n",
" <td>2.581522</td>\n",
" </tr>\n",
" <tr>\n",
" <th>296</th>\n",
" <td>146</td>\n",
" <td>70</td>\n",
" <td>38</td>\n",
" <td>360</td>\n",
" <td>28.0</td>\n",
" <td>0.337</td>\n",
" <td>29</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>3</td>\n",
" <td>5.214286</td>\n",
" </tr>\n",
" <tr>\n",
" <th>462</th>\n",
" <td>74</td>\n",
" <td>70</td>\n",
" <td>40</td>\n",
" <td>49</td>\n",
" <td>35.3</td>\n",
" <td>0.705</td>\n",
" <td>39</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>0</td>\n",
" <td>2.096317</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>154 rows × 25 columns</p>\n",
"</div>"
],
"text/plain": [
" Glucose BloodPressure SkinThickness Insulin BMI \\\n",
"668 98 58 33 190 34.0 \n",
"324 112 75 32 0 35.7 \n",
"624 108 64 0 0 30.8 \n",
"690 107 80 0 0 24.6 \n",
"473 136 90 0 0 29.9 \n",
".. ... ... ... ... ... \n",
"355 165 88 0 0 30.4 \n",
"534 77 56 30 56 33.3 \n",
"344 95 72 0 0 36.8 \n",
"296 146 70 38 360 28.0 \n",
"462 74 70 40 49 35.3 \n",
"\n",
" DiabetesPedigreeFunction Age Pregnancies_0 Pregnancies_1 \\\n",
"668 0.430 43 False False \n",
"324 0.148 21 False False \n",
"624 0.158 21 False False \n",
"690 0.856 34 False False \n",
"473 0.210 50 False False \n",
".. ... ... ... ... \n",
"355 0.302 49 False False \n",
"534 1.251 24 False True \n",
"344 0.485 57 False False \n",
"296 0.337 29 False False \n",
"462 0.705 39 False False \n",
"\n",
" Pregnancies_2 ... Pregnancies_8 Pregnancies_9 Pregnancies_10 \\\n",
"668 False ... False False False \n",
"324 True ... False False False \n",
"624 True ... False False False \n",
"690 False ... True False False \n",
"473 False ... False False False \n",
".. ... ... ... ... ... \n",
"355 False ... False True False \n",
"534 False ... False False False \n",
"344 False ... True False False \n",
"296 True ... False False False \n",
"462 False ... True False False \n",
"\n",
" Pregnancies_11 Pregnancies_12 Pregnancies_13 Outcome_0 Outcome_1 \\\n",
"668 False False False True False \n",
"324 False False False True False \n",
"624 False False False True False \n",
"690 False False False True False \n",
"473 False False False True False \n",
".. ... ... ... ... ... \n",
"355 False False False False True \n",
"534 False False False True False \n",
"344 False False False True False \n",
"296 False False False False True \n",
"462 False False False True False \n",
"\n",
" Glucose_binned glucose_to_bmi \n",
"668 1 2.882353 \n",
"324 1 3.137255 \n",
"624 1 3.506494 \n",
"690 1 4.349593 \n",
"473 2 4.548495 \n",
".. ... ... \n",
"355 3 5.427632 \n",
"534 0 2.312312 \n",
"344 1 2.581522 \n",
"296 3 5.214286 \n",
"462 0 2.096317 \n",
"\n",
"[154 rows x 25 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data_encoded['glucose_to_bmi'] = train_data_encoded['Glucose'] / train_data_encoded['BMI']\n",
"val_data_encoded['glucose_to_bmi'] = val_data_encoded['Glucose'] / val_data_encoded['BMI']\n",
"test_data_encoded['glucose_to_bmi'] = test_data_encoded['Glucose'] / test_data_encoded['BMI']\n",
"test_data_encoded"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Масштабирование признаков\n",
"Масштабирование признаков - это процесс изменения диапазона признаков, чтобы равномерно распределить значения."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"numerical_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']\n",
"\n",
"scaler = StandardScaler()\n",
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Конструирование признаков с применением фреймворка Featuretools\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы в df: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']\n",
"Столбцы в train_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Pregnancies_14', 'Pregnancies_15', 'Pregnancies_17', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_bmi']\n",
"Столбцы в val_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_bmi']\n",
"Столбцы в test_data_encoded: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_0', 'Pregnancies_1', 'Pregnancies_2', 'Pregnancies_3', 'Pregnancies_4', 'Pregnancies_5', 'Pregnancies_6', 'Pregnancies_7', 'Pregnancies_8', 'Pregnancies_9', 'Pregnancies_10', 'Pregnancies_11', 'Pregnancies_12', 'Pregnancies_13', 'Outcome_0', 'Outcome_1', 'Glucose_binned', 'glucose_to_bmi']\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\tabee\\AIM_PIbd-31_Tabeev_A.P\\.venv\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
" warnings.warn(\n",
"c:\\Users\\tabee\\AIM_PIbd-31_Tabeev_A.P\\.venv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Glucose</th>\n",
" <th>BloodPressure</th>\n",
" <th>SkinThickness</th>\n",
" <th>Insulin</th>\n",
" <th>BMI</th>\n",
" <th>DiabetesPedigreeFunction</th>\n",
" <th>Age</th>\n",
" <th>Pregnancies_0</th>\n",
" <th>Pregnancies_1</th>\n",
" <th>Pregnancies_2</th>\n",
" <th>...</th>\n",
" <th>Pregnancies_11</th>\n",
" <th>Pregnancies_12</th>\n",
" <th>Pregnancies_13</th>\n",
" <th>Pregnancies_14</th>\n",
" <th>Pregnancies_15</th>\n",
" <th>Pregnancies_17</th>\n",
" <th>Outcome_0</th>\n",
" <th>Outcome_1</th>\n",
" <th>Glucose_binned</th>\n",
" <th>glucose_to_bmi</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-1.151398</td>\n",
" <td>-3.752683</td>\n",
" <td>-1.322774</td>\n",
" <td>-0.701206</td>\n",
" <td>-4.135256</td>\n",
" <td>-0.490735</td>\n",
" <td>-1.035940</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>2</td>\n",
" <td>inf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-0.276643</td>\n",
" <td>0.680345</td>\n",
" <td>0.233505</td>\n",
" <td>-0.701206</td>\n",
" <td>-0.489169</td>\n",
" <td>2.415030</td>\n",
" <td>1.487101</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>2</td>\n",
" <td>3.971631</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.566871</td>\n",
" <td>-1.265862</td>\n",
" <td>-0.090720</td>\n",
" <td>0.013448</td>\n",
" <td>-0.424522</td>\n",
" <td>0.549161</td>\n",
" <td>-0.948939</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>3</td>\n",
" <td>4.843206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.254179</td>\n",
" <td>-1.049617</td>\n",
" <td>-1.322774</td>\n",
" <td>-0.701206</td>\n",
" <td>-1.303720</td>\n",
" <td>-0.639291</td>\n",
" <td>2.792122</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>4</td>\n",
" <td>7.351598</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.410665</td>\n",
" <td>0.572222</td>\n",
" <td>1.076490</td>\n",
" <td>2.484601</td>\n",
" <td>1.838121</td>\n",
" <td>-0.686829</td>\n",
" <td>1.139095</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>3</td>\n",
" <td>2.900433</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>609</th>\n",
" <td>0.566871</td>\n",
" <td>-0.292759</td>\n",
" <td>0.946800</td>\n",
" <td>0.504235</td>\n",
" <td>-0.437451</td>\n",
" <td>-0.172824</td>\n",
" <td>-0.600933</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>3</td>\n",
" <td>4.860140</td>\n",
" </tr>\n",
" <tr>\n",
" <th>610</th>\n",
" <td>-0.776503</td>\n",
" <td>2.842797</td>\n",
" <td>-1.322774</td>\n",
" <td>-0.701206</td>\n",
" <td>-1.239073</td>\n",
" <td>-0.778934</td>\n",
" <td>-0.513932</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>2</td>\n",
" <td>4.285714</td>\n",
" </tr>\n",
" <tr>\n",
" <th>611</th>\n",
" <td>-0.620297</td>\n",
" <td>0.896590</td>\n",
" <td>1.076490</td>\n",
" <td>-0.701206</td>\n",
" <td>1.760544</td>\n",
" <td>1.981245</td>\n",
" <td>0.443084</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>2</td>\n",
" <td>2.214912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>612</th>\n",
" <td>0.629354</td>\n",
" <td>-3.752683</td>\n",
" <td>-1.322774</td>\n",
" <td>-0.701206</td>\n",
" <td>1.346804</td>\n",
" <td>-0.784877</td>\n",
" <td>-0.339929</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>3</td>\n",
" <td>3.325472</td>\n",
" </tr>\n",
" <tr>\n",
" <th>613</th>\n",
" <td>0.129493</td>\n",
" <td>1.437203</td>\n",
" <td>-1.322774</td>\n",
" <td>-0.701206</td>\n",
" <td>-1.226144</td>\n",
" <td>-0.615522</td>\n",
" <td>-1.035940</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>3</td>\n",
" <td>5.555556</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>614 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" Glucose BloodPressure SkinThickness Insulin BMI \\\n",
"id \n",
"0 -1.151398 -3.752683 -1.322774 -0.701206 -4.135256 \n",
"1 -0.276643 0.680345 0.233505 -0.701206 -0.489169 \n",
"2 0.566871 -1.265862 -0.090720 0.013448 -0.424522 \n",
"3 1.254179 -1.049617 -1.322774 -0.701206 -1.303720 \n",
"4 0.410665 0.572222 1.076490 2.484601 1.838121 \n",
".. ... ... ... ... ... \n",
"609 0.566871 -0.292759 0.946800 0.504235 -0.437451 \n",
"610 -0.776503 2.842797 -1.322774 -0.701206 -1.239073 \n",
"611 -0.620297 0.896590 1.076490 -0.701206 1.760544 \n",
"612 0.629354 -3.752683 -1.322774 -0.701206 1.346804 \n",
"613 0.129493 1.437203 -1.322774 -0.701206 -1.226144 \n",
"\n",
" DiabetesPedigreeFunction Age Pregnancies_0 Pregnancies_1 \\\n",
"id \n",
"0 -0.490735 -1.035940 False False \n",
"1 2.415030 1.487101 False False \n",
"2 0.549161 -0.948939 False True \n",
"3 -0.639291 2.792122 True False \n",
"4 -0.686829 1.139095 False False \n",
".. ... ... ... ... \n",
"609 -0.172824 -0.600933 False False \n",
"610 -0.778934 -0.513932 False True \n",
"611 1.981245 0.443084 False False \n",
"612 -0.784877 -0.339929 True False \n",
"613 -0.615522 -1.035940 True False \n",
"\n",
" Pregnancies_2 ... Pregnancies_11 Pregnancies_12 Pregnancies_13 \\\n",
"id ... \n",
"0 True ... False False False \n",
"1 False ... False False False \n",
"2 False ... False False False \n",
"3 False ... False False False \n",
"4 False ... False False False \n",
".. ... ... ... ... ... \n",
"609 False ... False False False \n",
"610 False ... False False False \n",
"611 False ... False False False \n",
"612 False ... False False False \n",
"613 False ... False False False \n",
"\n",
" Pregnancies_14 Pregnancies_15 Pregnancies_17 Outcome_0 Outcome_1 \\\n",
"id \n",
"0 False False False True False \n",
"1 False False False False True \n",
"2 False False False True False \n",
"3 False False False True False \n",
"4 False False False False True \n",
".. ... ... ... ... ... \n",
"609 False False False True False \n",
"610 False False False True False \n",
"611 False False False False True \n",
"612 False False False False True \n",
"613 False False False True False \n",
"\n",
" Glucose_binned glucose_to_bmi \n",
"id \n",
"0 2 inf \n",
"1 2 3.971631 \n",
"2 3 4.843206 \n",
"3 4 7.351598 \n",
"4 3 2.900433 \n",
".. ... ... \n",
"609 3 4.860140 \n",
"610 2 4.285714 \n",
"611 2 2.214912 \n",
"612 3 3.325472 \n",
"613 3 5.555556 \n",
"\n",
"[614 rows x 28 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import featuretools as ft\n",
"\n",
"print(\"Столбцы в df:\", df.columns.tolist())\n",
"print(\"Столбцы в train_data_encoded:\", train_data_encoded.columns.tolist())\n",
"print(\"Столбцы в val_data_encoded:\", val_data_encoded.columns.tolist())\n",
"print(\"Столбцы в test_data_encoded:\", test_data_encoded.columns.tolist())\n",
"\n",
"# Удаление дубликатов по всем столбцам (если нет уникального идентификатора)\n",
"df = df.drop_duplicates()\n",
"duplicates = train_data_encoded[train_data_encoded.duplicated(keep=False)]\n",
"\n",
"#Создание EntitySet\n",
"es = ft.EntitySet(id='diabetes_data')\n",
"\n",
"#Добавление датафрейма в EntitySet\n",
"es = es.add_dataframe(dataframe_name='patients', dataframe=train_data_encoded, index='id')\n",
"\n",
"#Генерация признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='patients', max_depth=2)\n",
"\n",
"feature_matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Оценка качества"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Время обучения модели: 0.01 секунд\n",
"Среднеквадратичная ошибка: 704.68\n",
"Коэффициент детерминации (R²): 0.30\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAIjCAYAAAAJLyrXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADPB0lEQVR4nOzdd3hT1f8H8HeSJt2bDlqKBVrKlK3sIQhFhixRWWUIX7RsREGZCrI3CCpYUEEUlCUqKCIgVkEE2QUqhTLaAt0jbdLc3x/8em06kzZp1vv1PDzac2+Sk3VzPmd8jkQQBAFERERERESkM6mpK0BERERERGRpGEgRERERERHpiYEUERERERGRnhhIERERERER6YmBFBERERERkZ4YSBEREREREemJgRQREREREZGeGEgRERERERHpiYEUERERERGRnhhIEZFBZGZmYs2aNeLfqamp2Lhxo+kqRERERGREDKTIIo0cORIuLi6mrgYV4ujoiNmzZ2PHjh2Ij4/H/PnzcfDgQVNXi4iIiMgo7ExdASJdPX78GDt27MDJkydx4sQJ5OTkIDw8HM2aNcPgwYPRrFkzU1fRpslkMixYsAAjRoyARqOBm5sbDh06ZOpqERERERmFRBAEwdSVICrPrl27MHbsWGRmZiI4OBgqlQoJCQlo1qwZ/vnnH6hUKkRERODjjz+GQqEwdXVt2t27dxEfH4/69evDw8PD1NUhIiIiMgpO7SOzd+rUKQwbNgz+/v44deoUbt26hW7dusHBwQFnzpzB/fv38eqrr2L79u2YOnWq1m1XrFiBtm3bwtvbG46OjmjRogX27NlT7DEkEgnmz58v/q1Wq/HCCy/Ay8sLV65cEc8p61/nzp0BAL/++iskEgl+/fVXrcfo1atXscfp3LmzeLsCcXFxkEgk2LZtm1b5tWvXMGjQIHh5ecHBwQEtW7bEgQMHij2X1NRUTJ06FcHBwbC3t0eNGjUwYsQIPHr0qNT63b9/H8HBwWjZsiUyMzP1fh7z58+HRCIBANSoUQNt2rSBnZ0d/P39S7yPwo4dOwaJRIK9e/cWO7Zz505IJBJER0cD+G9K57///osePXrA2dkZAQEBeO+991C0TygrKwvTp09HUFAQ7O3tERYWhhUrVhQ7r/B7KJPJEBgYiHHjxiE1NVXrvNzcXMybNw8hISGwt7dHUFAQ3nrrLeTm5ha7vwkTJhR7Lr1790ZwcLD4d8H7vGLFilJfm9IU3Lakf1988YXWuZ07dy7xvMKfr5EjR2rVDQDWrFmDevXqwd7eHv7+/vjf//6H5OTkYvdd9PO7aNEiSKVS7Ny5U6t89+7daNGiBRwdHVGtWjUMGzYM9+7d0zpn/vz5aNCgAVxcXODm5obWrVtj3759xR6zUaNG5b42Rb8/RW3btq3M73PhzzcAnDt3Dj179oSbmxtcXFzQtWtX/PHHH2U+RgGNRoO1a9eicePGcHBwgI+PD8LDw/HXX3+J5xR8bnbs2IGwsDA4ODigRYsWOHHihNZ93b59G2+88QbCwsLg6OgIb29vvPTSS4iLiyvz+Tk5OaFx48bYsmWL1nmlTZPes2dPid/dP//8E+Hh4XB3d4eTkxM6deqEU6dOaZ1TcD0ouOYU+Ouvv3T67MXHx8PR0RESiUTreRX9vKlUKsyZMwe1atWCQqFAzZo18dZbbyEnJ6fY8ynJtWvXMHjwYPj4+MDR0RFhYWF49913y7xNwXWxtH8jR44Uzy14D06cOIH//e9/8Pb2hpubG0aMGIGUlJRi9/3hhx+iYcOGsLe3R0BAACIjI4tdh0r7Pnfr1k08R9drEKDbdfLx48fo2bMnatSoAXt7e1SvXh1Dhw7F7du3xXNK+95FRkZW+HWJiIhAtWrVoFKpij2X7t27IywsTKvsiy++EK8xXl5eeOWVVxAfH1/i69evX79i9/m///0PEolE6/qiy3W68O9fgYL3pfC64QL16tUr9T0iy8OpfWT2lixZAo1Gg127dqFFixbFjlerVg2fffYZrly5go8++gjz5s2Dr68vAGDt2rXo27cvhg4diry8POzatQsvvfQSvvvuO/Tq1avUx3zttdfw66+/4qeffkKDBg0AAJ9//rl4/OTJk/j444+xevVqVKtWDQDg5+dX6v2dOHEC33//fYWePwBcvnwZ7dq1Q2BgIGbOnAlnZ2d8/fXX6NevH7755hv0798fwJOEDx06dMDVq1cxevRoNG/eHI8ePcKBAwdw9+5dsa6FpaWloWfPnpDL5fj+++/LXHumz/NYuXIlEhMTyz2vc+fOCAoKwo4dO8TnUWDHjh2oU6cO2rRpI5bl5+cjPDwcrVu3xrJly/Djjz9i3rx5UKvVeO+99wAAgiCgb9++OHbsGMaMGYOmTZvi8OHDmDFjBu7du4fVq1drPU7//v0xYMAAqNVqREdH4+OPP0ZOTo74nms0GvTt2xe//fYbxo0bh/r16+PixYtYvXo1rl+/XqyxX1VeffVVvPDCC1pl7dq1K3ZevXr1xAbio0ePinU4FPXBBx/g3XffRceOHREZGYlbt25hw4YN+PPPP/Hnn3/C3t6+xNtFRUVh9uzZWLlyJYYMGSKWb9u2DaNGjUKrVq2wePFiJCYmYu3atTh16hTOnTsnjlxmZWWhf//+CA4ORk5ODrZt24aBAwciOjoazzzzjD4vjc7ee+891KpVS/w7MzMTr7/+utY5ly9fRocOHeDm5oa33noLcrkcH330ETp37ozjx4/j2WefLfMxxowZg23btqFnz5547bXXoFarcfLkSfzxxx9o2bKleN7x48fx1VdfYdKkSbC3t8eHH36I8PBwnD59WmzcnTlzBr///jteeeUV1KhRA3Fxcdi0aRM6d+6MK1euwMnJSeuxC65R6enp+PTTTzF27FgEBwdrNbx19csvv6Bnz55o0aIF5s2bB6lUiqioKDz33HM4efKkwd6juXPnQqlUlnteZGQkPvnkE/Tt2xdvvvkmzp07h+XLl+PSpUs4dOhQscZtYRcuXECHDh0gl8sxbtw4BAcHIzY2FgcPHsSiRYvKfexJkyahVatWWmWvvfZaiedOmDABHh4emD9/PmJiYrBp0ybcvn1bDMqAJ43xBQsWoFu3bnj99dfF886cOYNTp05BLpeL91ejRg0sXrxY6zGqV69ebp2L0vU6mZeXB1dXV0yePBne3t6IjY3F+vXrceHCBVy8eLHU+7958yY++eSTUo+X97oMHz4cn332GQ4fPozevXuLt0tISMAvv/yCefPmiWWLFi3CnDlzMHjwYLz22mt4+PAh1q9fj44dO2pdYwDAwcEBhw4dQlJSkthWyMnJwVdffQUHBwe9X8fSODg4ICoqClOmTBHLfv/9d60AlKyAQGTmvLy8hKeeekqrLCIiQnB2dtYqmzNnjgBAOHjwoFiWnZ2tdU5eXp7QqFEj4bnnntMqByDMmzdPEARBmDVrliCTyYR9+/aVWqeoqCgBgHDr1q1ix44dOyYAEI4dOyaWPfvss0LPnj21HkcQBKFLly5Cx44dtW5/69YtAYAQFRUllnXt2lVo3LixoFQqxTKNRiO0bdtWCA0NFcvmzp0rABC+/fbbYvXSaDTF6qdUKoXOnTsLvr6+ws2bNyv8PObNmycUvpwkJSUJrq6u4rmF76Mks2bNEuzt7YXU1FSt+7Czs9N6nIiICAGAMHHiRK3n1atXL0GhUAgPHz4UBEEQ9u3bJwAQFi5cqPU4gwYNEiQSidZzLfpcBEEQ2rZtKzRo0ED8+/PPPxekUqlw8uRJrfM2b94sABBOnTqldX+RkZHFnmOvXr20PscF7/Py5cvLeGVKps9t27VrJ3Tp0qXYbQt/viIiIsS6PXz4UHBwcBDat28vqFQq8Zxt27YJAIT169eLZZ06dRI6deokCIIgHDp0SLCzsxOmT5+u9fh5eXmCr6+v0KhRIyEnJ0cs/+677wQAwty5c0ute1JSkgBAWLFihdZjNmzYsNTblPT8SlLwHT5z5oxW+cOHD4t9Jvr16ycoFAohNjZWLLt//77g6upa7Ptb1C+//CIAECZNmlTsWMF3UhCefG4
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import time\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"\n",
"# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную\n",
"X = df.drop('Glucose', axis=1)\n",
"y = df['Glucose']\n",
"\n",
"# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)\n",
"X = pd.get_dummies(X, drop_first=True)\n",
"\n",
"# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением\n",
"X.fillna(X.median(), inplace=True)\n",
"\n",
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Обучение модели\n",
"model = LinearRegression()\n",
"\n",
"# Начинаем отсчет времени\n",
"start_time = time.time()\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Время обучения модели\n",
"train_time = time.time() - start_time\n",
"\n",
"# Предсказания и оценка модели\n",
"val_predictions = model.predict(X_val)\n",
"mse = mean_squared_error(y_val, val_predictions)\n",
"r2 = r2_score(y_val, val_predictions)\n",
"\n",
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
"print(f'Среднеквадратичная ошибка: {mse:.2f}')\n",
"print(f'Коэффициент детерминации (R²): {r2:.2f}')\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_val, val_predictions, alpha=0.5)\n",
"plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)\n",
"plt.xlabel('Фактический уровень глюкозы')\n",
"plt.ylabel('Прогнозируемый уровень глюкозы')\n",
"plt.title('Фактический уровень глюкозы по сравнению с прогнозируемым')\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}