2024-11-26 22:22:49 +04:00
{
"cells": [
2024-11-28 00:49:06 +04:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4 style=\"margin: 30px;\">бизнес-цели и 2 задачи, которые нужно решить:<br/>\n",
"Снижение вероятности инсульта у пациентов с высоким риском путем раннего выявления предрасположенности.<br/>\n",
"Оптимизация медицинских услуг, предоставляемых пациентам, с учетом их риска инсульта.<br/><br/><br/>\n",
"Разработать модель, которая прогнозирует вероятность инсульта у пациента.<br/>\n",
"Определить значимые признаки для анализа риска инсульта, чтобы направить усилия медицинских работников на важные факторы.</h4>"
]
},
2024-11-26 22:22:49 +04:00
{
"cell_type": "code",
2024-11-28 00:49:06 +04:00
"execution_count": 164,
2024-11-26 22:22:49 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество колонок: 12\n",
"Колонки: Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',\n",
" 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',\n",
" 'smoking_status', 'stroke'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Загрузка данных\n",
"data = pd.read_csv('./csv/option4.csv')\n",
"\n",
"# Обзор данных\n",
"print(\"Количество колонок:\", data.columns.size)\n",
"print(\"Колонки:\", data.columns)"
]
},
{
"cell_type": "code",
2024-11-28 00:49:06 +04:00
"execution_count": 165,
2024-11-26 22:22:49 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Наличие пропущенных значений:\n",
"id 0\n",
"gender 0\n",
"age 0\n",
"hypertension 0\n",
"heart_disease 0\n",
"ever_married 0\n",
"work_type 0\n",
"Residence_type 0\n",
"avg_glucose_level 0\n",
"bmi 201\n",
"smoking_status 0\n",
"stroke 0\n",
2024-11-27 22:27:38 +04:00
"dtype: int64\n",
"\n",
"\n",
"\n",
2024-11-26 22:22:49 +04:00
"<bound method NDFrame.describe of id gender age hypertension heart_disease ever_married \\\n",
"0 9046 Male 67.0 0 1 Yes \n",
"1 51676 Female 61.0 0 0 Yes \n",
"2 31112 Male 80.0 0 1 Yes \n",
"3 60182 Female 49.0 0 0 Yes \n",
"4 1665 Female 79.0 1 0 Yes \n",
"... ... ... ... ... ... ... \n",
"5105 18234 Female 80.0 1 0 Yes \n",
"5106 44873 Female 81.0 0 0 Yes \n",
"5107 19723 Female 35.0 0 0 Yes \n",
"5108 37544 Male 51.0 0 0 Yes \n",
"5109 44679 Female 44.0 0 0 Yes \n",
"\n",
" work_type Residence_type avg_glucose_level bmi smoking_status \\\n",
"0 Private Urban 228.69 36.6 formerly smoked \n",
"1 Self-employed Rural 202.21 NaN never smoked \n",
"2 Private Rural 105.92 32.5 never smoked \n",
"3 Private Urban 171.23 34.4 smokes \n",
"4 Self-employed Rural 174.12 24.0 never smoked \n",
"... ... ... ... ... ... \n",
"5105 Private Urban 83.75 NaN never smoked \n",
"5106 Self-employed Urban 125.20 40.0 never smoked \n",
"5107 Self-employed Rural 82.99 30.6 never smoked \n",
"5108 Private Rural 166.29 25.6 formerly smoked \n",
"5109 Govt_job Urban 85.28 26.2 Unknown \n",
"\n",
" stroke \n",
"0 1 \n",
"1 1 \n",
"2 1 \n",
"3 1 \n",
"4 1 \n",
"... ... \n",
"5105 0 \n",
"5106 0 \n",
"5107 0 \n",
"5108 0 \n",
"5109 0 \n",
"\n",
"[5110 rows x 12 columns]>\n"
]
}
],
"source": [
2024-11-27 22:27:38 +04:00
"print(\"\\nН а личие пропущенных значений:\")\n",
"print(data.isnull().sum())\n",
"\n",
"print(\"\\n\\n\")\n",
"\n",
2024-11-26 22:22:49 +04:00
"print(data.describe)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Возьмем и заменим нулевые значения в столбце bmi на средние значения по столбцу </p>"
]
},
{
"cell_type": "code",
2024-11-28 00:49:06 +04:00
"execution_count": 166,
2024-11-26 22:22:49 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Наличие пропущенных значений:\n",
"id 0\n",
"gender 0\n",
"age 0\n",
"hypertension 0\n",
"heart_disease 0\n",
"ever_married 0\n",
"work_type 0\n",
"Residence_type 0\n",
"avg_glucose_level 0\n",
"bmi 0\n",
"smoking_status 0\n",
"stroke 0\n",
"dtype: int64\n"
]
}
],
"source": [
"data['bmi'] = data['bmi'].fillna(data['bmi'].median())\n",
"print(\"\\nН а личие пропущенных значений:\")\n",
"print(data.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Взглянем на выбросы: </p>"
]
},
{
"cell_type": "code",
2024-11-28 00:49:06 +04:00
"execution_count": 168,
2024-11-26 22:22:49 +04:00
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAHqCAYAAADrpwd3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABscUlEQVR4nO3de1yUdf7//+cwHD0wiAcGChDPirIeKiOM3GTDU6vlZhaltX7081G0zG0rNw9hGFvblj9bsGz7mBau1ZZmZVpiylZoahoeWtMicdPBCgEPiQjX74++XB9HmRQdGAYe99ttbs28r9dc87rwlm988uZ9WQzDMAQAAAAAAAAAAM7j4+kGAAAAAAAAAABoqAjRAQAAAAAAAABwgRAdAAAAAAAAAAAXCNEBAAAAAAAAAHCBEB0AAAAAAAAAABcI0QEAAAAAAAAAcIEQHQAAAAAAAAAAFwjRAQAAAAAAAABwgRAdAAAAAAAAAAAXCNEBAACAemCxWPTYY495uo16tWHDBlksFm3YsMHTrTSoXgAAjddjjz0mi8WiH374oU4/55577lH79u3r9DMA/B9CdAAAAAAAAAAAXPD1dAMAAAAAAAAALt6LL76oqqoqT7cBNBmE6AAAAAAAAIAX8fPz83QLQJPCdi5AI3TgwAFNnjxZXbt2VVBQkFq3bq3bbrtN33777Xm1+fn5uuGGGxQUFKQrr7xS6enpWrx4sSwWy3n177//vq6//no1b95cLVu21LBhw7R79+76uSgAQJN1MfPa1q1bZbFYtGTJkvPev3btWlksFr377rvm2IYNG3TVVVcpMDBQHTt21AsvvGDuYVpbb7zxhnr06KHAwED17NlTK1asuKh9Sl3VuOrj1Vdf1TXXXKNmzZqpVatWSkxM1AcffOBUk5WVpdjYWAUEBCgiIkKpqakqKSlxqtm3b59GjRolu92uwMBAXXnllRozZoxKS0vP+7x+/fopKChIoaGhGjNmjA4ePHhRX5ML2bx5swYPHiybzaZmzZrphhtu0CeffGIe/+c//ymLxaKNGzee994XXnhBFotFu3btMsf+/e9/63e/+51CQ0MVGBioq666SqtWrXJLrwAAXIoffvhBo0ePVnBwsFq3bq37779fp06dMo9bLBZNmTLF/D4iKChI8fHx2rlzp6Sf57tOnTopMDBQAwcOPO/f5+yJDtQvVqIDjdCWLVv06aefasyYMbryyiv17bffauHChRo4cKD27NmjZs2aSZK+++47/frXv5bFYtGMGTPUvHlz/f3vf1dAQMB553zllVc0btw4JScn68knn9TJkye1cOFCDRgwQNu3b2fyBgDUmYuZ16666ip16NBBr7/+usaNG+f0/tdee02tWrVScnKyJGn79u0aPHiwwsPDlZaWpsrKSs2dO1dt27atdW/vvfeebr/9dvXq1UsZGRk6evSoxo8fryuuuMIt114tLS1Njz32mK677jrNnTtX/v7+2rx5s9avX6+bbrpJ0s/he1pampKSkjRp0iTt3btXCxcu1JYtW/TJJ5/Iz89Pp0+fVnJyssrLyzV16lTZ7XZ99913evfdd1VSUiKbzSZJmjdvnmbNmqXRo0frv/7rv/T999/rueeeU2JiorZv366QkJBLvpb169dryJAh6tevn+bMmSMfHx8tXrxYN954o/71r3/pmmuu0bBhw9SiRQu9/vrruuGGG5ze/9prryk2NlY9e/aUJO3evVsJCQm64oor9Mgjj6h58+Z6/fXXNXLkSL355pu65ZZbLrlXAAAu1ejRo9W+fXtlZGRo06ZNWrBggY4ePaqlS5eaNf/617+0atUqpaamSpIyMjI0fPhwPfTQQ8rKytLkyZN19OhRPfXUU/r973+v9evXe+pyABgAGp2TJ0+eN5aXl2dIMpYuXWqOTZ061bBYLMb27dvNsR9//NEIDQ01JBkFBQWGYRjGsWPHjJCQEGPChAlO53Q4HIbNZjtvHAAAd7rYeW3GjBmGn5+fUVxcbI6Vl5cbISEhxu9//3tz7OabbzaaNWtmfPfdd+bYvn37DF9fX6O23x736tXLuPLKK41jx46ZYxs2bDAkGdHR0U61kow5c+aYr8eNG3dejWEYxpw5c5z62Ldvn+Hj42PccsstRmVlpVNtVVWVYRiGceTIEcPf39+46aabnGr+9re/GZKM//3f/zUMwzC2b99uSDLeeOMNl9f07bffGlar1Zg3b57T+M6dOw1fX9/zxn/JRx99ZEgyPvroI7Pfzp07G8nJyWbvhvHzn3FMTIzxm9/8xhy74447jHbt2hlnzpwxxw4fPmz4+PgYc+fONccGDRpk9OrVyzh16pTT1+W6664zOnfu7LIXAADqQvU8/tvf/tZpfPLkyYYk44svvjAM4+fvCwICAsx/dxuGYbzwwguGJMNutxtlZWXm+IwZM5z+jW4Yrr+PAFA32M4FaISCgoLM5xUVFfrxxx/VqVMnhYSE6PPPPzePrVmzRvHx8erdu7c5FhoaqpSUFKfzffjhhyopKdEdd9yhH374wXxYrVb1799fH330UZ1fEwCg6brYee32229XRUWF3nrrLXPsgw8+UElJiW6//XZJUmVlpdatW6eRI0cqIiLCrOvUqZOGDBlSq74OHTqknTt3auzYsWrRooU5fsMNN6hXr161vk5XVq5cqaqqKs2ePVs+Ps7fvldv+7Ju3TqdPn1a06ZNc6qZMGGCgoOD9d5770mSudJ87dq1OnnyZI2f99Zbb6mqqkqjR492mvftdrs6d+58WfP+jh07tG/fPt1555368ccfzXOfOHFCgwYNUm5urnmTtNtvv11HjhzRhg0bzPf/85//VFVVlfnnWVxcrPXr12v06NE6duyYeb4ff/xRycnJ2rdvn7777rtL7hcAgEtVvbq82tSpUyVJq1evNscGDRrk9Fvd/fv3lySNGjVKLVu2PG/8m2++qat2AVwA27kAjdBPP/2kjIwMLV68WN99950MwzCPnb3f6YEDBxQfH3/e+zt16uT0et++fZKkG2+8scbPCw4OdkfbAADU6GLntV/96lfq1q2bXnvtNY0fP17Sz1t/tGnTxpzDjhw5op9++um8uU46f/67kAMHDrh8X6dOnZwC/svx9ddfy8fHRz169LhgL127dnUa9/f3V4cOHczjMTExmj59up555hllZ2fr+uuv129/+1vdddddZsC+b98+GYahzp071/hZl3Mjs+rvKc7dcudspaWlatWqlbln+muvvaZBgwZJ+vnPs3fv3urSpYskaf/+/TIMQ7NmzdKsWbNqPN+RI0fcvr0OAAAXcu482rFjR/n4+DjtbR4VFeVUUz0XR0ZG1jh+9OjROugUwMUgRAcaoalTp2rx4sWaNm2a4uPjZbPZZLFYNGbMGHN1V21Uv+eVV16R3W4/77ivL3+VAADqTm3mtdtvv13z5s3TDz/8oJYtW2rVqlW64447GuRc5eomppWVlXX6uX/96191zz336O2339YHH3yg++67z9yv9corr1RVVZUsFovef/99Wa3W895/9qr72qr+8/rLX/7i9JtwNZ0/ICBAI0eO1IoVK5SVlaWioiJ98skneuKJJ84734MPPmjueX+u2v5wBACAulDTvF/TPPtL42cvJABQvxrevyYAXLZ//vOfGjdunP7617+aY6dOnVJJSYlTXXR0tPbv33/e+88d69ixoySpXbt2SkpKcn/DAAD8goud16SfQ/S0tDS9+eabCgsLU1lZmcaMGWMeb9eunQIDAy9q/ruQ6Ohol++7mHO1atWqxmuoXjVerWPHjqqqqtKePXtcBs/Vvezdu1cdOnQwx0+fPq2CgoLz5u9evXqpV69emjlzpj799FMlJCTo+eefV3p6ujp27CjDMBQTE2Ou+HaX6u8pgoODL+p7ittvv11LlixRTk6OvvzySxmGYW7lIsm8Vj8/P75HAQA0KPv27VNMTIz5ev/+/aqqqnLavgWA92BPdKARslqt5/2E+rnnnjtvZVtycrLy8vK0Y8cOc6y4uFjZ2dnn1QUHB+uJJ55QRUXFeZ/3/fffu695AADOcbHzmiR1795dvXr
"text/plain": [
"<Figure size 1500x500 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2024-11-28 00:49:06 +04:00
"def plot_numeric_boxplots(dataframe):\n",
2024-11-26 22:22:49 +04:00
" # Фильтрация числовых столбцов\n",
" numeric_columns = ['age', 'avg_glucose_level', 'bmi']\n",
" \n",
" # Построение графиков\n",
" if numeric_columns:\n",
" plt.figure(figsize=(15, 5))\n",
" \n",
" for i, col in enumerate(numeric_columns):\n",
" if col != 'id':\n",
" plt.subplot(1, len(numeric_columns), i + 1)\n",
" sns.boxplot(y=dataframe[col])\n",
" plt.title(f'{col}')\n",
" plt.ylabel('')\n",
" plt.xlabel(col)\n",
" \n",
" plt.tight_layout()\n",
" plt.show()\n",
" else:\n",
" print(\"Нет подходящих числовых столбцов для построения графиков.\")\n",
"\n",
2024-11-28 00:49:06 +04:00
"plot_numeric_boxplots(data)"
2024-11-26 22:22:49 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Видим выбросы в столбцах с о средним уровнем глюкозы и в столбце bmi (индекс массы тела). устраним выбросы - поставим верхние и нижние границы</p>"
]
},
{
"cell_type": "code",
2024-11-28 00:49:06 +04:00
"execution_count": 170,
2024-11-26 22:22:49 +04:00
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAHqCAYAAADrpwd3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABcoElEQVR4nO3dfZiVVb0//vcgMIPADIIxAwmI+AAGPhuSZqQk4kOgnJSiNPPIqfCRThpH0SSJ9Fh5NMQ0j2lBmqUezcQUBcpQEUVNjdBQSAVKZUZQRmT2749+7m8j7BIFBobX67ruS/Za6177cw9erJn33HvdZYVCoRAAAAAAAGAtLZq6AAAAAAAA2FwJ0QEAAAAAoAQhOgAAAAAAlCBEBwAAAACAEoToAAAAAABQghAdAAAAAABKEKIDAAAAAEAJQnQAAAAAAChBiA4AAAAAACUI0QEAYBMoKyvLN7/5zaYuY5OaMWNGysrKMmPGjKYuZbOqBYDm65vf/GbKysryt7/9baO+zxe/+MXsuOOOG/U9gP9HiA4AAAAAACW0bOoCAAAAAID37pprrklDQ0NTlwFbDSE6AAAAAGxBWrVq1dQlwFbFdi7QDL3wwgv56le/mt122y1t2rRJp06d8pnPfCbPP//8WmOfeOKJfOITn0ibNm2yww475KKLLsp1112XsrKytcbfdddd+fjHP562bdumffv2OfLII/PUU09tmosCYKv1Xta1Rx55JGVlZbn++uvXOv/uu+9OWVlZfvWrXxXbZsyYkf322y8VFRXp1atXfvjDHxb3MF1fN998c3bfffdUVFSkb9++ufXWW9/TPqWlxpSq46c//Wk++tGPZtttt812222Xgw8+OL/5zW8ajbnyyivzkY98JOXl5enatWtGjx6d5cuXNxqzYMGCDB8+PDU1NamoqMgOO+yQESNGpLa2dq3323fffdOmTZt07NgxI0aMyOLFi9/T1+Rfeeihh3L44Yenqqoq2267bT7xiU/kgQceKPb/4he/SFlZWWbOnLnWuT/84Q9TVlaWP/zhD8W2P/7xj/m3f/u3dOzYMRUVFdlvv/1y++23b5BaAeD9+Nvf/pbjjjsulZWV6dSpU84444ysWrWq2F9WVpZTTz21+H1EmzZtMmDAgDz55JNJ/r7e7bzzzqmoqMjAgQPX+vncnuiwabkTHZqhOXPm5Pe//31GjBiRHXbYIc8//3wmT56cgQMH5umnn862226bJHnxxRfzyU9+MmVlZRk7dmzatm2bH/3oRykvL19rzp/85Cc58cQTM3jw4Fx88cV54403Mnny5Bx00EF57LHHLN4AbDTvZV3bb7/9stNOO+XnP/95TjzxxEbn33TTTdluu+0yePDgJMljjz2Www8/PF26dMmFF16YNWvWZPz48fnQhz603rXdeeedOf7449OvX79MnDgxr732Wk4++eR8+MMf3iDX/o4LL7ww3/zmN/Oxj30s48ePT+vWrfPQQw/lvvvuy2GHHZbk7+H7hRdemEGDBuUrX/lK5s+fn8mTJ2fOnDl54IEH0qpVq7z11lsZPHhw6uvrc9ppp6WmpiYvvvhifvWrX2X58uWpqqpKkkyYMCHjxo3Lcccdl3//93/PX//611xxxRU5+OCD89hjj6VDhw7v+1ruu+++DBkyJPvuu28uuOCCtGjRItddd10OOeSQ/Pa3v81HP/rRHHnkkWnXrl1+/vOf5xOf+ESj82+66aZ85CMfSd++fZMkTz31VA488MB8+MMfzje+8Y20bds2P//5zzNs2LD88pe/zDHHHPO+awWA9+u4447LjjvumIkTJ+bBBx/M5Zdfntdeey033HBDccxvf/vb3H777Rk9enSSZOLEiTnqqKNy9tln58orr8xXv/rVvPbaa7nkkkvypS99Kffdd19TXQ5QAJqdN954Y6222bNnF5IUbrjhhmLbaaedVigrKys89thjxbZXXnml0LFjx0KSwsKFCwuFQqHw+uuvFzp06FA45ZRTGs25ZMmSQlVV1VrtALAhvdd1bezYsYVWrVoVXn311WJbfX19oUOHDoUvfelLxbajjz66sO222xZefPHFYtuCBQsKLVu2LKzvt8f9+vUr7LDDDoXXX3+92DZjxoxCkkKPHj0ajU1SuOCCC4qvTzzxxLXGFAqFwgUXXNCojgULFhRatGhROOaYYwpr1qxpNLahoaFQKBQKy5YtK7Ru3bpw2GGHNRrzgx/8oJCk8L//+7+FQqFQeOyxxwpJCjfffHPJa3r++ecL22yzTWHChAmN2p988slCy5Yt12r/Z+6///5CksL9999frHeXXXYpDB48uFh7ofD3v+OePXsWPvWpTxXbPvvZzxY6d+5cePvtt4ttL7/8cqFFixaF8ePHF9sOPfTQQr9+/QqrVq1q9HX52Mc+Vthll11K1gIAG8M76/inP/3pRu1f/epXC0kKjz/+eKFQ+Pv3BeXl5cWfuwuFQuGHP/xhIUmhpqamUFdXV2wfO3Zso5/RC4XS30cAG4ftXKAZatOmTfHPq1evziuvvJKdd945HTp0yKOPPlrsmzZtWgYMGJC99tqr2NaxY8eMHDmy0Xz33HNPli9fns9+9rP529/+Vjy22Wab9O/fP/fff/9GvyYAtl7vdV07/vjjs3r16txyyy3Ftt/85jdZvnx5jj/++CTJmjVrcu+992bYsGHp2rVrcdzOO++cIUOGrFddL730Up588smccMIJadeuXbH9E5/4RPr167fe11nKbbfdloaGhpx//vlp0aLxt+/vbPty77335q233sqZZ57ZaMwpp5ySysrK3HnnnUlSvNP87rvvzhtvvLHO97vlllvS0NCQ4447rtG6X1NTk1122eUDrfvz5s3LggUL8rnPfS6vvPJKce6VK1fm0EMPzaxZs4oPSTv++OOzbNmyzJgxo3j+L37xizQ0NBT/Pl999dXcd999Oe644/L6668X53vllVcyePDgLFiwIC+++OL7rhcA3q937i5/x2mnnZYk+fWvf11sO/TQQxt9qrt///5JkuHDh6d9+/Zrtf/5z3/eWOUC/4LtXKAZevPNNzNx4sRcd911efHFF1MoFIp9/7jf6QsvvJABAwasdf7OO+/c6PWCBQuSJIcccsg636+ysnJDlA0A6/Re17U999wzvXv3zk033ZSTTz45yd+3/th+++2La9iyZcvy5ptvrrXWJWuvf//KCy+8UPK8nXfeuVHA/0E899xzadGiRXbfffd/Wctuu+3WqL1169bZaaediv09e/bMmDFj8r3vfS9TpkzJxz/+8Xz605/O5z//+WLAvmDBghQKheyyyy7rfK8P8iCzd76nePeWO/+otrY22223XXHP9JtuuimHHnpokr//fe61117ZddddkyTPPvtsCoVCxo0bl3Hjxq1zvmXLlm3w7XUA4F959zraq1evtGjRotHe5t27d2805p21uFu3butsf+211zZCpcB7IUSHZui0007LddddlzPPPDMDBgxIVVVVysrKMmLEiOLdXevjnXN+8pOfpKamZq3+li39UwLAxrM+69rxxx+fCRMm5G9/+1vat2+f22+/PZ/97Gc3y7Wq1ENM16xZs1Hf97vf/W6++MUv5v/+7//ym9/8Jqeffnpxv9YddtghDQ0NKSsry1133ZVtttlmrfP/8a779fXO39d///d/N/ok3LrmLy8vz7Bhw3LrrbfmyiuvzNKlS/PAAw/k29/+9lrz/ed//mdxz/t3W99fjgDAxrCudX9d6+w/a//HGwmATWvz+2kC+MB+8Ytf5MQTT8x3v/vdYtuqVauyfPnyRuN69OiRZ599dq3z393Wq1evJEnnzp0zaNCgDV8wAPwT73VdS/4eol944YX55S9/merq6tTV1WXEiBHF/s6dO6eiouI9rX//So8ePUqe917m2m677dZ5De/cNf6OXr16paGhIU8//XTJ4PmdWubPn5+ddtqp2P7WW29l4cKFa63f/fr1S79+/XLeeefl97//fQ488MBcddVVueiii9KrV68UCoX07NmzeMf3hvLO9xSVlZXv6XuK448/Ptdff32mT5+eZ555JoV
"text/plain": [
"<Figure size 1500x500 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def remove_outliers(df):\n",
"\n",
" numeric_columns = ['age', 'avg_glucose_level', 'bmi']\n",
" for column in numeric_columns:\n",
" Q1 = df[column].quantile(0.25)\n",
" Q3 = df[column].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" df[column] = df[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n",
" return df\n",
" \n",
"data = remove_outliers(data)\n",
2024-11-28 00:49:06 +04:00
"plot_numeric_boxplots(data)"
2024-11-26 22:22:49 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Так, от выбросов избавились, теперь разобьем на выборки</p>"
]
},
{
"cell_type": "code",
2024-11-28 00:49:06 +04:00
"execution_count": 171,
2024-11-26 22:22:49 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Размеры выборок:\n",
2024-11-27 22:27:38 +04:00
"Обучающая выборка: (4088, 10)\n",
"Тестовая выборка: (511, 10)\n",
"Контрольная выборка: (511, 10)\n"
2024-11-26 22:22:49 +04:00
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Определение признаков и целевой переменной\n",
"X = data.drop(columns=['id', 'stroke']) \n",
"y = data['stroke'] \n",
"\n",
"# Обучающая выборка\n",
2024-11-27 22:27:38 +04:00
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)\n",
2024-11-26 22:22:49 +04:00
"\n",
"# Тестовая и контрольная выборки\n",
"X_test, X_control, y_test, y_control = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)\n",
"\n",
"print(\"\\nР а зме р ы выборок:\")\n",
"print(f\"Обучающая выборка: {X_train.shape}\")\n",
"print(f\"Тестовая выборка: {X_test.shape}\")\n",
"print(f\"Контрольная выборка: {X_control.shape}\")"
]
},
{
"cell_type": "code",
2024-11-28 00:49:06 +04:00
"execution_count": null,
2024-11-26 22:22:49 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"stroke\n",
"0 4861\n",
"1 249\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHHCAYAAABeLEexAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA9kUlEQVR4nO3deXyM9/7//2d2IZlEkIRSQmqJpa0oglpDqlFVnFYpqpyWhhYtTs6ptfXVcmorqqeLpaUt1XLKsUQUVSkaja042sahjSRCk0ErieT6/dFP5mckthGZxPW4325zu5n39b7e1+uaTDJP1/W+rnExDMMQAACAibk6uwAAAABnIxABAADTIxABAADTIxABAADTIxABAADTIxABAADTIxABAADTIxABAADTIxABAADTIxABABz2/PPPq3Pnzs4uw2bx4sVycXHRd999d8tj5ebmqkaNGlqwYEExVIbSjkCEO1rBH8eCR7ly5VS3bl0NHz5caWlpzi4PKNOSk5P13nvv6e9///tNrbdz505NmjRJmZmZt6ewYuLh4aHRo0dr6tSpunjxorPLwW1GIIIpTJkyRR9++KHmzZunVq1a6e2331ZERIR+//13Z5cGlFlz5sxRSEiIOnTocFPr7dy5U5MnTy71gUiSBg0apIyMDC1fvtzZpeA2IxDBFLp27aqnnnpKQ4YM0eLFizVy5EglJydrzZo1zi4NKJNyc3O1bNkyPf7447d1O/n5+U49OuPv768uXbpo8eLFTqsBJYNABFPq2LGjpD8P+UvS2bNn9fLLL6tx48by8fGRxWJR165dtW/fvkLrXrx4UZMmTVLdunVVrlw5Va1aVT179tRPP/0kSTp+/LjdaborH+3bt7eNtXXrVrm4uOjTTz/V3//+dwUHB6tChQrq3r27Tp48WWjbu3bt0kMPPSQ/Pz+VL19e7dq10zfffFPkPrZv377I7U+aNKlQ348++kjh4eHy9vZWQECA+vTpU+T2r7Vvl8vPz9fs2bPVsGFDlStXTkFBQXruuef022+/2fWrVauWunXrVmg7w4cPLzRmUbXPmDGj0GsqSdnZ2Zo4caJCQ0Pl5eWlGjVqaOzYscrOzi7ytbpc+/btC403depUubq6FjpKcKOvxz//+U+1atVKlSpVkre3t8LDw/XZZ58Vuf2PPvpIzZs3V/ny5VWxYkW1bdtWmzZtsuuzfv16tWvXTr6+vrJYLHrggQcK1bZy5Urbz7Ry5cp66qmn9Ouvv9r1efrpp+1qrlixotq3b6+vv/76uq/Tjh07lJGRocjIyELL3nrrLTVs2NC2D82aNbPVN2nSJI0ZM0aSFBISYtv28ePHJf35cx4+fLiWLVumhg0bysvLSxs2bJAkff/99+ratassFot8fHzUqVMnffvtt9et9bffflPz5s1VvXp1HT16VNLNvUc6d+6sHTt26OzZs9fdFsoud2cXADhDQXipVKmSJOnnn3/W6tWr9Ze//EUhISFKS0vTO++8o3bt2umHH35QtWrVJEl5eXnq1q2b4uPj1adPH7344os6d+6c4uLidPDgQdWpU8e2jSeffFIPP/yw3XZjY2OLrGfq1KlycXHRuHHjlJ6ertmzZysyMlJJSUny9vaWJG3ZskVdu3ZVeHi4Jk6cKFdXVy1atEgdO3bU119/rebNmxcat3r16po2bZok6fz58xo2bFiR2x4/frwef/xxDRkyRKdPn9Zbb72ltm3b6vvvv5e/v3+hdZ599lk9+OCDkqTPP/9cX3zxhd3y5557TosXL9agQYP0wgsvKDk5WfPmzdP333+vb775Rh4eHkW+DjcjMzPTtm+Xy8/PV/fu3bVjxw49++yzatCggQ4cOKBZs2bpv//9r1avXn1T21m0aJFeeeUVvfnmm+rbt2+Rfa73esyZM0fdu3dXv379lJOTo08++UR/+ctftHbtWkVHR9v6TZ48WZMmTVKrVq00ZcoUeXp6ateuXdqyZYu6dOki6c95cc8884waNmyo2NhY+fv76/vvv9eGDRts9RW89g888ICmTZumtLQ0zZkzR998802hn2nlypU1a9YsSdIvv/yiOXPm6OGHH9bJkyeL/NkX2Llzp1xcXHT//ffbtb/77rt64YUX1Lt3b7344ou6ePGi9u/fr127dqlv377q2bOn/vvf/+rjjz/WrFmzVLlyZUlSlSpVbGNs2bJFK1as0PDhw1W5cmXVqlVLhw4d0oMPPiiLxaKxY8fKw8ND77zzjtq3b69t27apRYsWRdaZkZGhzp076+zZs9q2bZvq1Klz0++R8PBwGYahnTt3FhngcYcwgDvYokWLDEnG5s2bjdOnTxsnT540PvnkE6NSpUqGt7e38csvvxiGYRgXL1408vLy7NZNTk42vLy8jClTptjaPvjgA0OSMXPmzELbys/Pt60nyZgxY0ahPg0bNjTatWtne/7VV18Zkoy77rrLsFqttvYVK1YYkow5c+bYxr7nnnuMqKgo23YMwzB+//13IyQkxOjcuXOhbbVq1cpo1KiR7fnp06cNScbEiRNtbcePHzfc3NyMqVOn2q174MABw93dvVD7sWPHDEnGkiVLbG0TJ040Lv9T8vXXXxuSjGXLltmtu2HDhkLtNWvWNKKjowvVHhMTY1z55+nK2seOHWsEBgYa4eHhdq/phx9+aLi6uhpff/213foLFy40JBnffPNNoe1drl27drbx1q1bZ7i7uxsvvfRSkX1v5PUwjD9/TpfLyckxGjVqZHTs2NFuLFdXV+Oxxx4r9F4s+JlnZmYavr6+RosWLYw//vijyD45OTlGYGCg0ahRI7s+a9euNSQZEyZMsLUNHDjQqFmzpt04//rXvwxJxu7du4vc5wJPPfWUUalSpULtjz76qNGwYcNrrjtjxgxDkpGcnFxomSTD1dXVOHTokF17jx49DE9PT+Onn36ytaWkpBi+vr5G27ZtbW0Fv/N79uwxTp06ZTRs2NCoXbu2cfz4cVufm32PpKSkGJKMN95445r7hbKNU2YwhcjISFWpUkU1atRQnz595OPjoy+++EJ33XWXJMnLy0uurn/+OuTl5enMmTPy8fFRvXr1tHfvXts4q1atUuXKlTVixIhC27jyNMnNGDBggHx9fW3Pe/furapVq+o///mPJCkpKUnHjh1T3759debMGWVkZCgjI0MXLlxQp06dtH37duXn59uNefHiRZUrV+6a2/3888+Vn5+vxx9/3DZmRkaGgoODdc899+irr76y65+TkyPpz9fralauXCk/Pz917tzZbszw8HD5+PgUGjM3N9euX0ZGxnXnjPz666966623NH78ePn4+BTafoMGDVS/fn27MQtOk165/avZvXu3Hn/8cfXq1UszZswoss+NvB6SbEf5pD9P32RlZenBBx+0e2+tXr1a+fn5mjBhgu29WKDgvRUXF6dz587pb3/7W6GfbUGf7777Tunp6Xr++eft+kRHR6t+/fpat26d3Xr5+fm21ygpKUlLly5V1apV1aBBg2vu05kzZ1SxYsVC7f7+/vrll1+0Z8+ea65/Le3atVNYWJjteV5enjZt2qQePXqodu3atvaqVauqb9++2rFjh6xWq90Yv/zyi9q1a6fc3Fxt375dNWvWtC272fdIwX5mZGQ4vE8o/ThlBlOYP3++6tatK3d3dwUFBalevXp2Hzr5+fmaM2eOFixYoOTkZOXl5dmWFZxWk/481VavXj25uxfvr84999xj99zFxUWhoaG2eRXHjh2TJA0cOPCqY2RlZdl9QGVkZBQa90rHjh2TYRhX7Xflqa2Cq4KuDCFXjpmVlaXAwMAil6enp9s937Rpk93pkhsxceJEVatWTc8991yhuTjHjh3T4cOHrzrmldsvyq+//qro6GhduHBBZ86cuWrYvZHXQ5LWrl2r1157TUlJSXZzVC4f96effpKrq6tdELhSwaneRo0aXbXP//73P0lSvXr1Ci2rX7++duzYYdd28uRJu9eqatWqWrVq1XX3SZIMwyjUNm7cOG3evFnNmzdXaGiounTpor59+6p169bXHa9ASEiI3fPTp0/r999/L3KfGjR
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Подсчет количества объектов каждого класса\n",
"class_counts = y.value_counts()\n",
"print(class_counts)\n",
"\n",
"# Визуализация\n",
"sns.barplot(x=class_counts.index, y=class_counts.values)\n",
"plt.title(\"Распределение классов (stroke)\")\n",
"plt.xlabel(\"Класс\")\n",
"plt.ylabel(\"Количество\")\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
2024-11-27 22:27:38 +04:00
"source": [
"<p style=\"margin: 30px;\">Напишем функцию и сделаем аугментацию данных</p>"
]
2024-11-26 22:22:49 +04:00
},
{
"cell_type": "code",
2024-11-28 00:49:06 +04:00
"execution_count": null,
2024-11-26 22:22:49 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-11-27 22:27:38 +04:00
"Данные ДО аугментации в ОБУЧАЮЩЕЙ ВЫБОРКЕ (60-80% данных)\n",
"\n",
"stroke\n",
"0 3889\n",
"1 199\n",
"Name: count, dtype: int64\n",
"\n",
"После оверсемплинга\n",
"\n",
"stroke\n",
"0 3889\n",
"1 1944\n",
"Name: count, dtype: int64\n",
"\n",
"После балансировки данных (андерсемплинга)\n",
"\n",
"stroke\n",
"0 1944\n",
"1 1944\n",
"Name: count, dtype: int64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAGbCAYAAAAr/4yjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA8YklEQVR4nO3deXhTVcIG8PfeJE3apmVpaUsBoew7aNlBdlAEFRFBEUX8UGdcZ3R0hnHcdRwHR0Vxm3EEQcYFERxEFkGQVUD2nQItlK2lBUq3NNv5/ii9Q2iBLklO7s378+mDTXPvfZOmfXvOuUkUIYQAERERAFV2ACIiCh0sBSIi0rAUiIhIw1IgIiINS4GIiDQsBSIi0rAUiIhIw1IgIiINS4GIiDQsBSKqkYcffhhDhgyRHUMzY8YMKIqCX3/9tcb7crlcaNSoET744AM/JNMHw5dC2QOk7MNms6Fly5Z49NFHkZWVJTseka6lp6fjk08+wZ///Ocqbbdu3Tq8+OKLOHfuXGCC+YnFYsGTTz6J1157DQ6HQ3acoDB8KZR5+eWXMWvWLEybNg29evXChx9+iJ49e6KoqEh2NCLdmjp1KlJSUjBgwIAqbbdu3Tq89NJLIV8KADBx4kTk5OTgP//5j+woQRE2pTBs2DCMHz8ekyZNwowZM/C73/0O6enp+O6772RHI9Ill8uF2bNnY8yYMQE9jtfrlfpXeu3atTF06FDMmDFDWoZgCptSuNTAgQMBlA5/AeDMmTP4wx/+gA4dOsButyM2NhbDhg3D9u3by23rcDjw4osvomXLlrDZbKhfvz5GjRqFQ4cOAQAyMjJ8pqwu/ejfv7+2r5UrV0JRFHz11Vf485//jKSkJERHR+OWW25BZmZmuWNv2LABN954I2rVqoWoqCj069cPa9eurfA29u/fv8Ljv/jii+Wu+/nnnyM1NRWRkZGoW7cu7rzzzgqPf6XbdjGv14t33nkH7dq1g81mQ2JiIh566CGcPXvW53pNmjTBiBEjyh3n0UcfLbfPirJPmTKl3H0KACUlJXjhhRfQvHlzWK1WNGrUCM888wxKSkoqvK8u1r9//3L7e+2116Cqarm/Fit7f7z55pvo1asX4uLiEBkZidTUVHzzzTcVHv/zzz9Ht27dEBUVhTp16qBv375YunSpz3UWLVqEfv36ISYmBrGxsejatWu5bHPmzNG+p/Hx8Rg/fjyOHz/uc5377rvPJ3OdOnXQv39/rF69+qr305o1a5CTk4PBgweX+9p7772Hdu3aabehS5cuWr4XX3wRTz/9NAAgJSVFO3ZGRgaA0u/zo48+itmzZ6Ndu3awWq1YvHgxAGDr1q0YNmwYYmNjYbfbMWjQIPzyyy9XzXr27Fl069YNDRs2xP79+wFU7TEyZMgQrFmzBmfOnLnqsfTOLDuALGW/wOPi4gAAhw8fxvz583HHHXcgJSUFWVlZ+Pjjj9GvXz/s2bMHycnJAACPx4MRI0Zg+fLluPPOO/HEE08gPz8fP/74I3bt2oVmzZppx7jrrrtw0003+Rx38uTJFeZ57bXXoCgK/vjHPyI7OxvvvPMOBg8ejG3btiEyMhIA8NNPP2HYsGFITU3FCy+8AFVVMX36dAwcOBCrV69Gt27dyu23YcOGeP311wEABQUF+O1vf1vhsZ977jmMGTMGkyZNwunTp/Hee++hb9++2Lp1K2rXrl1umwcffBDXX389AODbb7/FvHnzfL7+0EMPYcaMGZg4cSIef/xxpKenY9q0adi6dSvWrl0Li8VS4f1QFefOndNu28W8Xi9uueUWrFmzBg8++CDatGmDnTt34u2338aBAwcwf/78Kh1n+vTp+Mtf/oJ//OMfGDduXIXXudr9MXXqVNxyyy24++674XQ68eWXX+KOO+7A999/j+HDh2vXe+mll/Diiy+iV69eePnllxEREYENGzbgp59+wtChQwGUrpPdf//9aNeuHSZPnozatWtj69atWLx4sZav7L7v2rUrXn/9dWRlZWHq1KlYu3Ztue9pfHw83n77bQDAsWPHMHXqVNx0003IzMys8HtfZt26dVAUBddee63P5f/617/w+OOPY/To0XjiiSfgcDiwY8cObNiwAePGjcOoUaNw4MABfPHFF3j77bcRHx8PAKhXr562j59++glff/01Hn30UcTHx6NJkybYvXs3rr/+esTGxuKZZ56BxWLBxx9/jP79++Pnn39G9+7dK8yZk5ODIUOG4MyZM/j555/RrFmzKj9GUlNTIYTAunXrKvwjxlCEwU2fPl0AEMuWLROnT58WmZmZ4ssvvxRxcXEiMjJSHDt2TAghhMPhEB6Px2fb9PR0YbVaxcsvv6xd9umnnwoA4q233ip3LK/Xq20HQEyZMqXcddq1ayf69eunfb5ixQoBQDRo0ECcP39eu/zrr78WAMTUqVO1fbdo0ULccMMN2nGEEKKoqEikpKSIIUOGlDtWr169RPv27bXPT58+LQCIF154QbssIyNDmEwm8dprr/lsu3PnTmE2m8tdnpaWJgCIzz77TLvshRdeEBc/lFavXi0AiNmzZ/tsu3jx4nKXN27cWAwfPrxc9kceeURc+vC8NPszzzwjEhISRGpqqs99OmvWLKGqqli9erXP9h999JEAINauXVvueBfr16+ftr+FCxcKs9ksnnrqqQqvW5n7Q4jS79PFnE6naN++vRg4cKDPvlRVFbfddlu5x2LZ9/zcuXMiJiZGdO/eXRQXF1d4HafTKRISEkT79u19rvP9998LAOL555/XLpswYYJo3Lixz37++c9/CgBi48aNFd7mMuPHjxdxcXHlLr/11ltFu3btrrjtlClTBACRnp5e7msAhKqqYvfu3T6Xjxw5UkRERIhDhw5pl504cULExMSIvn37apeV/cxv2rRJnDx5UrRr1040bdpUZGRkaNep6mPkxIkTAoB44403rni7jCBspo8GDx6MevXqoVGjRrjzzjtht9sxb948NGjQAABgtVqhqqV3h8fjQW5uLux2O1q1aoUtW7Zo+5k7dy7i4+Px2GOPlTvGpVMGVXHvvfciJiZG+3z06NGoX78+fvjhBwDAtm3bkJaWhnHjxiE3Nxc5OTnIyclBYWEhBg0ahFWrVsHr9frs0+FwwGazXfG43377LbxeL8aMGaPtMycnB0lJSWjRogVWrFjhc32n0wmg9P66nDlz5qBWrVoYMmSIzz5TU1Nht9vL7dPlcvlcLycn56pzyMePH8d7772H5557Dna7vdzx27Rpg9atW/vss2zK8NLjX87GjRsxZswY3H777ZgyZUqF16nM/QFAG+0BpVMZeXl5uP76630eW/Pnz4fX68Xzzz+vPRbLlD22fvzxR+Tn5+NPf/pTue9t2XV+/fVXZGdn4+GHH/a5zvDhw9G6dWssXLjQZzuv16vdR9u2bcPMmTNRv359tGnT5oq3KTc3F3Xq1Cl3ee3atXHs2DFs2rTpittfSb9+/dC2bVvtc4/Hg6VLl2LkyJFo2rSpdnn9+vUxbtw4rFmzBufPn/fZx7Fjx9CvXz+4XC6sWrUKjRs31r5W1cdI2e3Mycmp9m3Si7CZPnr//ffRsmVLmM1mJCYmolWrVj4/eF6vF1OnTsUHH3yA9PR0eDwe7WtlU0xA6bRTq1atYDb7965r0aKFz+eKoqB58+baPGtaWhoAYMKECZfdR15ens8PaU5OTrn9XiotLQ1CiMte79JpnrKzRS79RXzpPvPy8pCQkFDh17Ozs30+X7p0qc/UQWW88MILSE5OxkMPPVRubj4tLQ179+697D4vPX5Fjh8/juHDh6OwsBC5ubmXLfzK3B8A8P333+PVV1/Ftm3bfOasL97voUOHoKqqzy/DS5VNe7Zv3/6y1zly5AgAoFWrVuW+1rp1a6xZs8bnsszMTJ/7qn79+pg7d+5VbxMAiAreuPGPf/wjli1bhm7duqF58+YYOnQoxo0bh969e191f2VSUlJ8Pj99+jSKiooqvE1t2rSB1+tFZmYm2rVrp11+zz33wGw2Y+/evUhKSvLZpqqPkbLbWZM//PQibEqhW7du6NKly2W//te
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные ДО аугментации в ТЕСТОВОЙ ВЫБОРКЕ (10-20% данных)\n",
"\n",
2024-11-26 22:22:49 +04:00
"stroke\n",
2024-11-27 22:27:38 +04:00
"0 486\n",
"1 25\n",
2024-11-26 22:22:49 +04:00
"Name: count, dtype: int64\n",
2024-11-27 22:27:38 +04:00
"\n",
"После оверсемплинга\n",
"\n",
"stroke\n",
"0 486\n",
"1 243\n",
"Name: count, dtype: int64\n",
"\n",
"После балансировки данных (андерсемплинга)\n",
"\n",
2024-11-26 22:22:49 +04:00
"stroke\n",
2024-11-27 22:27:38 +04:00
"0 243\n",
"1 243\n",
2024-11-26 22:22:49 +04:00
"Name: count, dtype: int64\n"
]
},
{
"data": {
2024-11-27 22:27:38 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAGbCAYAAAAr/4yjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA8YklEQVR4nO3deXhTVcIG8PfeJE3apmVpaUsBoew7aNlBdlAEFRFBEUX8UGdcZ3R0hnHcdRwHR0Vxm3EEQcYFERxEFkGQVUD2nQItlK2lBUq3NNv5/ii9Q2iBLklO7s378+mDTXPvfZOmfXvOuUkUIYQAERERAFV2ACIiCh0sBSIi0rAUiIhIw1IgIiINS4GIiDQsBSIi0rAUiIhIw1IgIiINS4GIiDQsBSKqkYcffhhDhgyRHUMzY8YMKIqCX3/9tcb7crlcaNSoET744AM/JNMHw5dC2QOk7MNms6Fly5Z49NFHkZWVJTseka6lp6fjk08+wZ///Ocqbbdu3Tq8+OKLOHfuXGCC+YnFYsGTTz6J1157DQ6HQ3acoDB8KZR5+eWXMWvWLEybNg29evXChx9+iJ49e6KoqEh2NCLdmjp1KlJSUjBgwIAqbbdu3Tq89NJLIV8KADBx4kTk5OTgP//5j+woQRE2pTBs2DCMHz8ekyZNwowZM/C73/0O6enp+O6772RHI9Ill8uF2bNnY8yYMQE9jtfrlfpXeu3atTF06FDMmDFDWoZgCptSuNTAgQMBlA5/AeDMmTP4wx/+gA4dOsButyM2NhbDhg3D9u3by23rcDjw4osvomXLlrDZbKhfvz5GjRqFQ4cOAQAyMjJ8pqwu/ejfv7+2r5UrV0JRFHz11Vf485//jKSkJERHR+OWW25BZmZmuWNv2LABN954I2rVqoWoqCj069cPa9eurfA29u/fv8Ljv/jii+Wu+/nnnyM1NRWRkZGoW7cu7rzzzgqPf6XbdjGv14t33nkH7dq1g81mQ2JiIh566CGcPXvW53pNmjTBiBEjyh3n0UcfLbfPirJPmTKl3H0KACUlJXjhhRfQvHlzWK1WNGrUCM888wxKSkoqvK8u1r9//3L7e+2116Cqarm/Fit7f7z55pvo1asX4uLiEBkZidTUVHzzzTcVHv/zzz9Ht27dEBUVhTp16qBv375YunSpz3UWLVqEfv36ISYmBrGxsejatWu5bHPmzNG+p/Hx8Rg/fjyOHz/uc5377rvPJ3OdOnXQv39/rF69+qr305o1a5CTk4PBgweX+9p7772Hdu3aabehS5cuWr4XX3wRTz/9NAAgJSVFO3ZGRgaA0u/zo48+itmzZ6Ndu3awWq1YvHgxAGDr1q0YNmwYYmNjYbfbMWjQIPzyyy9XzXr27Fl069YNDRs2xP79+wFU7TEyZMgQrFmzBmfOnLnqsfTOLDuALGW/wOPi4gAAhw8fxvz583HHHXcgJSUFWVlZ+Pjjj9GvXz/s2bMHycnJAACPx4MRI0Zg+fLluPPOO/HEE08gPz8fP/74I3bt2oVmzZppx7jrrrtw0003+Rx38uTJFeZ57bXXoCgK/vjHPyI7OxvvvPMOBg8ejG3btiEyMhIA8NNPP2HYsGFITU3FCy+8AFVVMX36dAwcOBCrV69Gt27dyu23YcOGeP311wEABQUF+O1vf1vhsZ977jmMGTMGkyZNwunTp/Hee++hb9++2Lp1K2rXrl1umwcffBDXX389AODbb7/FvHnzfL7+0EMPYcaMGZg4cSIef/xxpKenY9q0adi6dSvWrl0Li8VS4f1QFefOndNu28W8Xi9uueUWrFmzBg8++CDatGmDnTt34u2338aBAwcwf/78Kh1n+vTp+Mtf/oJ//OMfGDduXIXXudr9MXXqVNxyyy24++674XQ68eWXX+KOO+7A999/j+HDh2vXe+mll/Diiy+iV69eePnllxEREYENGzbgp59+wtChQwGUrpPdf//9aNeuHSZPnozatWtj69atWLx4sZav7L7v2rUrXn/9dWRlZWHq1KlYu3Ztue9pfHw83n77bQDAsWPHMHXqVNx0003IzMys8HtfZt26dVAUBddee63P5f/617/w+OOPY/To0XjiiSfgcDiwY8cObNiwAePGjcOoUaNw4MABfPHFF3j77bcRHx8PAKhXr562j59++glff/01Hn30UcTHx6NJkybYvXs3rr/+esTGxuKZZ56BxWLBxx9/jP79++Pnn39G9+7dK8yZk5ODIUOG4MyZM/j555/RrFmzKj9GUlNTIYTAunXrKvwjxlCEwU2fPl0AEMuWLROnT58WmZmZ4ssvvxRxcXEiMjJSHDt2TAghhMPhEB6Px2fb9PR0YbVaxcsvv6xd9umnnwoA4q233ip3LK/Xq20HQEyZMqXcddq1ayf69eunfb5ixQoBQDRo0ECcP39eu/zrr78WAMTUqVO1fbdo0ULccMMN2nGEEKKoqEikpKSIIUOGlDtWr169RPv27bXPT58+LQCIF154QbssIyNDmEwm8dprr/lsu3PnTmE2m8tdnpaWJgCIzz77TLvshRdeEBc/lFavXi0AiNmzZ/tsu3jx4nKXN27cWAwfPrxc9kceeURc+vC8NPszzzwjEhISRGpqqs99OmvWLKGqqli9erXP9h999JEAINauXVvueBfr16+ftr+FCxcKs9ksnnrqqQqvW5n7Q4jS79PFnE6naN++vRg4cKDPvlRVFbfddlu5x2LZ9/zcuXMiJiZGdO/eXRQXF1d4HafTKRISEkT79u19rvP9998LAOL555/XLpswYYJo3Lixz37++c9/CgBi48aNFd7mMuPHjxdxcXHlLr/11ltFu3btrrjtlClTBACRnp5e7msAhKqqYvfu3T6Xjxw5UkRERIhDhw5pl504cULExMSIvn37apeV/cxv2rRJnDx5UrRr1040bdpUZGRkaNep6mPkxIkTAoB44403rni7jCBspo8GDx6MevXqoVGjRrjzzjtht9sxb948NGjQAABgtVqhqqV3h8fjQW5uLux2O1q1aoUtW7Zo+5k7dy7i4+Px2GOPlTvGpVMGVXHvvfciJiZG+3z06NGoX78+fvjhBwDAtm3bkJaWhnHjxiE3Nxc5OTnIyclBYWEhBg0ahFWrVsHr9frs0+FwwGazXfG43377LbxeL8aMGaPtMycnB0lJSWjRogVWrFjhc32n0wmg9P66nDlz5qBWrVoYMmSIzz5TU1Nht9vL7dPlcvlcLycn56pzyMePH8d7772H5557Dna7vdzx27Rpg9atW/vss2zK8NLjX87GjRsxZswY3H777ZgyZUqF16nM/QFAG+0BpVMZeXl5uP76630eW/Pnz4fX68Xzzz+vPRbLlD22fvzxR+Tn5+NPf/pTue9t2XV+/fVXZGdn4+GHH/a5zvDhw9G6dWssXLjQZzuv16vdR9u2bcPMmTNRv359tGnT5oq3KTc3F3Xq1Cl3ee3atXHs2DFs2rTpittfSb9+/dC2bVvtc4/Hg6VLl2LkyJFo2rSpdnn9+vUxbtw4rFmzBufPn/fZx7Fjx9CvXz+4XC6sWrUKjRs31r5W1cdI2e3Mycmp9m3Si7CZPnr//ffRsmVLmM1mJCYmolWrVj4/eF6vF1OnTsUHH3yA9PR0eDwe7WtlU0xA6bRTq1atYDb7965r0aKFz+eKoqB58+baPGtaWhoAYMKECZfdR15ens8PaU5OTrn9XiotLQ1CiMte79JpnrKzRS79RXzpPvPy8pCQkFDh17Ozs30+X7p0qc/UQWW88MILSE5OxkMPPVRubj4tLQ179+697D4vPX5Fjh8/juHDh6OwsBC5ubmXLfzK3B8A8P333+PVV1/Ftm3bfOasL97voUOHoKqqzy/DS5VNe7Zv3/6y1zly5AgAoFWrVuW+1rp1a6xZs8bnsszMTJ/7qn79+pg7d+5VbxMAiAreuPGPf/wjli1bhm7duqF58+YYOnQoxo0bh969e191f2VSUlJ8Pj99+jSKiooqvE1t2rSB1+tFZmYm2rVrp11+zz33wGw2Y+/evUhKSvLZpqqPkbLbWZM//PQibEqhW7du6NKly2W//te
2024-11-26 22:22:49 +04:00
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
2024-11-26 23:29:51 +04:00
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
2024-11-26 22:22:49 +04:00
"\n",
2024-11-27 22:27:38 +04:00
"def over_under_sampling(x_selection, y_selection):\n",
2024-11-26 22:22:49 +04:00
"\n",
2024-11-27 22:27:38 +04:00
" # сначала увеличение меньшинства\n",
2024-11-26 22:22:49 +04:00
"\n",
2024-11-27 22:27:38 +04:00
" oversampler = RandomOverSampler(sampling_strategy=0.5, random_state=42) \n",
" x_over, y_over = oversampler.fit_resample(x_selection, y_selection) \n",
2024-11-26 22:22:49 +04:00
"\n",
2024-11-27 22:27:38 +04:00
" print(\"\\nПо с ле оверсемплинга\\n\")\n",
" print(y_over.value_counts())\n",
2024-11-26 23:29:51 +04:00
"\n",
2024-11-27 22:27:38 +04:00
" # потом уменьшение большинства\n",
"\n",
" undersampler = RandomUnderSampler(sampling_strategy=1.0, random_state=42)\n",
" x_balanced, y_balanced = undersampler.fit_resample(x_over, y_over)\n",
"\n",
" print(\"\\nПо с ле балансировки данных (андерсемплинга)\\n\")\n",
" print(y_balanced.value_counts())\n",
"\n",
" plt.pie(\n",
" y_balanced.value_counts(), \n",
2024-11-26 23:29:51 +04:00
" labels=class_counts.index, # Метки классов (0 и 1)\n",
" autopct='%1.1f%%', # Отображение процентов\n",
2024-11-27 22:27:38 +04:00
" colors=['lightgreen', 'lightcoral'], # Цвета для классов\n",
2024-11-26 23:29:51 +04:00
" startangle=45, # Поворот диаграммы\n",
2024-11-27 22:27:38 +04:00
" explode=(0, 0.05) # Небольшое смещение для класса 1\n",
" )\n",
" plt.title(\"Распределение классов (stroke)\")\n",
" plt.show()\n",
"\n",
"print(\"Данные ДО аугментации в ОБУЧАЮЩЕЙ ВЫБОРКЕ (60-80% данных)\\n\")\n",
"print(y_train.value_counts())\n",
"over_under_sampling(X_train, y_train)\n",
"\n",
"print(\"Данные ДО аугментации в ТЕСТОВОЙ ВЫБОРКЕ (10-20% данных)\\n\")\n",
"print(y_test.value_counts())\n",
"over_under_sampling(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Теперь можно и к конструированию признаков приступить) данные ведь сбалансированы (в выборках)</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Унитарное кодирование категориальных признаков <br/> <br/>Применяем к категориальным (Н Е числовым) признакам: 'gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2024-11-28 00:49:06 +04:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные после унитарного кодирования:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>hypertension</th>\n",
" <th>heart_disease</th>\n",
" <th>avg_glucose_level</th>\n",
" <th>bmi</th>\n",
" <th>gender_Male</th>\n",
" <th>gender_Other</th>\n",
" <th>ever_married_Yes</th>\n",
" <th>work_type_Never_worked</th>\n",
" <th>work_type_Private</th>\n",
" <th>work_type_Self-employed</th>\n",
" <th>work_type_children</th>\n",
" <th>Residence_type_Urban</th>\n",
" <th>smoking_status_formerly smoked</th>\n",
" <th>smoking_status_never smoked</th>\n",
" <th>smoking_status_smokes</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>845</th>\n",
" <td>48.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>69.21</td>\n",
" <td>33.1</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3744</th>\n",
" <td>15.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>122.25</td>\n",
" <td>21.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4183</th>\n",
" <td>67.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>110.42</td>\n",
" <td>24.9</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3409</th>\n",
" <td>44.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>65.41</td>\n",
" <td>24.8</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>284</th>\n",
" <td>14.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>82.34</td>\n",
" <td>31.6</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age hypertension heart_disease avg_glucose_level bmi gender_Male \\\n",
"845 48.0 0 0 69.21 33.1 False \n",
"3744 15.0 0 0 122.25 21.0 True \n",
"4183 67.0 0 0 110.42 24.9 False \n",
"3409 44.0 0 0 65.41 24.8 True \n",
"284 14.0 0 0 82.34 31.6 True \n",
"\n",
" gender_Other ever_married_Yes work_type_Never_worked \\\n",
"845 False True False \n",
"3744 False False False \n",
"4183 False True False \n",
"3409 False True False \n",
"284 False False False \n",
"\n",
" work_type_Private work_type_Self-employed work_type_children \\\n",
"845 True False False \n",
"3744 True False False \n",
"4183 False True False \n",
"3409 True False False \n",
"284 False False False \n",
"\n",
" Residence_type_Urban smoking_status_formerly smoked \\\n",
"845 True False \n",
"3744 False False \n",
"4183 False False \n",
"3409 True False \n",
"284 True False \n",
"\n",
" smoking_status_never smoked smoking_status_smokes \n",
"845 True False \n",
"3744 True False \n",
"4183 True False \n",
"3409 False True \n",
"284 False False "
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
2024-11-27 22:27:38 +04:00
"source": [
"# One-Hot Encoding\n",
"categorical_columns = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']\n",
"X_encoded = pd.get_dummies(X_train, columns=categorical_columns, drop_first=True)\n",
"\n",
"print(\"Данные после унитарного кодирования:\")\n",
2024-11-28 00:49:06 +04:00
"X_encoded.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Дискретизация числовых признаков<br/><br/>Числовые признаки, такие как 'age', 'avg_glucose_level', 'bmi', можно разделить на категории (биннинг).</p>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные после дискретизации:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age_bins</th>\n",
" <th>bmi_bins</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>845</th>\n",
" <td>средний</td>\n",
" <td>ожирение</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3744</th>\n",
" <td>ребенок</td>\n",
" <td>норма</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4183</th>\n",
" <td>пожилой</td>\n",
" <td>норма</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3409</th>\n",
" <td>средний</td>\n",
" <td>норма</td>\n",
" </tr>\n",
" <tr>\n",
" <th>284</th>\n",
" <td>ребенок</td>\n",
" <td>ожирение</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4796</th>\n",
" <td>пожилой</td>\n",
" <td>ожирение</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1283</th>\n",
" <td>пожилой</td>\n",
" <td>ожирение</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3656</th>\n",
" <td>средний</td>\n",
" <td>ожирение</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2485</th>\n",
" <td>ребенок</td>\n",
" <td>норма</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1282</th>\n",
" <td>пожилой</td>\n",
" <td>ожирение</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age_bins bmi_bins\n",
"845 средний ожирение\n",
"3744 ребенок норма\n",
"4183 пожилой норма\n",
"3409 средний норма\n",
"284 ребенок ожирение\n",
"4796 пожилой ожирение\n",
"1283 пожилой ожирение\n",
"3656 средний ожирение\n",
"2485 ребенок норма\n",
"1282 пожилой ожирение"
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_encoded['age_bins'] = pd.cut(X_encoded['age'], bins=[0, 18, 30, 50, 80], labels=['ребенок', 'молодой', 'средний', 'пожилой'])\n",
"X_encoded['bmi_bins'] = pd.cut(X_encoded['bmi'], bins=[0, 18.5, 25, 30, 50], labels=['низкий', 'норма', 'избыток', 'ожирение'])\n",
"\n",
"print(\"Данные после дискретизации:\")\n",
"X_encoded[['age_bins', 'bmi_bins']].head(10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Ручной синтез новых признаков <br/><br/>\n",
"<li>Возрастной индекс глюкозы: age * avg_glucose_level\n",
"<li>Индекс массы тела с поправкой на глюкозу: bmi / avg_glucose_level </p>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные после синтеза новых признаков:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age_glucose_index</th>\n",
" <th>bmi_glucose_ratio</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>845</th>\n",
" <td>3322.0800</td>\n",
" <td>0.478255</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3744</th>\n",
" <td>1833.7500</td>\n",
" <td>0.171779</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4183</th>\n",
" <td>7398.1400</td>\n",
" <td>0.225503</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3409</th>\n",
" <td>2878.0400</td>\n",
" <td>0.379147</td>\n",
" </tr>\n",
" <tr>\n",
" <th>284</th>\n",
" <td>1152.7600</td>\n",
" <td>0.383775</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4796</th>\n",
" <td>5204.1700</td>\n",
" <td>0.528826</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1283</th>\n",
" <td>9478.3500</td>\n",
" <td>0.295779</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3656</th>\n",
" <td>3164.2800</td>\n",
" <td>0.504380</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2485</th>\n",
" <td>987.5600</td>\n",
" <td>0.345903</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1282</th>\n",
" <td>8975.9475</td>\n",
" <td>0.188949</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age_glucose_index bmi_glucose_ratio\n",
"845 3322.0800 0.478255\n",
"3744 1833.7500 0.171779\n",
"4183 7398.1400 0.225503\n",
"3409 2878.0400 0.379147\n",
"284 1152.7600 0.383775\n",
"4796 5204.1700 0.528826\n",
"1283 9478.3500 0.295779\n",
"3656 3164.2800 0.504380\n",
"2485 987.5600 0.345903\n",
"1282 8975.9475 0.188949"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_encoded['age_glucose_index'] = X_encoded['age'] * X_encoded['avg_glucose_level']\n",
"X_encoded['bmi_glucose_ratio'] = X_encoded['bmi'] / X_encoded['avg_glucose_level']\n",
"\n",
"print(\"Данные после синтеза новых признаков:\")\n",
"X_encoded[['age_glucose_index', 'bmi_glucose_ratio']].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Масштабирование признаков<br/><br/>Применяем нормализацию (для сжатия в диапазон [0, 1]) и стандартизацию (для приведения к среднему 0 и стандартному отклонению 1)</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные после нормализации:\n",
" age hypertension heart_disease avg_glucose_level bmi \\\n",
"845 0.584961 0 0 0.123340 0.633333 \n",
"3744 0.182129 0 0 0.587635 0.297222 \n",
"4183 0.816895 0 0 0.484079 0.405556 \n",
"3409 0.536133 0 0 0.090076 0.402778 \n",
"284 0.169922 0 0 0.238276 0.591667 \n",
"4796 0.890137 1 0 0.141547 0.761111 \n",
"1283 0.768066 1 1 0.834490 0.950000 \n",
"3656 0.511719 0 0 0.177000 0.769444 \n",
"2485 0.169922 0 0 0.134982 0.391667 \n",
"1282 0.645996 0 1 1.000000 0.602778 \n",
"\n",
" gender_Male gender_Other ever_married_Yes work_type_Never_worked \\\n",
"845 False False True False \n",
"3744 True False False False \n",
"4183 False False True False \n",
"3409 True False True False \n",
"284 True False False False \n",
"4796 True False False False \n",
"1283 True False True False \n",
"3656 False False True False \n",
"2485 False False False False \n",
"1282 True False True False \n",
"\n",
" work_type_Private work_type_Self-employed work_type_children \\\n",
"845 True False False \n",
"3744 True False False \n",
"4183 False True False \n",
"3409 True False False \n",
"284 False False False \n",
"4796 False False False \n",
"1283 True False False \n",
"3656 False True False \n",
"2485 True False False \n",
"1282 True False False \n",
"\n",
" Residence_type_Urban smoking_status_formerly smoked \\\n",
"845 True False \n",
"3744 False False \n",
"4183 False False \n",
"3409 True False \n",
"284 True False \n",
"4796 True False \n",
"1283 True True \n",
"3656 False False \n",
"2485 False True \n",
"1282 False False \n",
"\n",
" smoking_status_never smoked smoking_status_smokes age_bins bmi_bins \\\n",
"845 True False средний ожирение \n",
"3744 True False ребенок норма \n",
"4183 True False пожилой норма \n",
"3409 False True средний норма \n",
"284 False False ребенок ожирение \n",
"4796 True False пожилой ожирение \n",
"1283 False False пожилой ожирение \n",
"3656 True False средний ожирение \n",
"2485 False False ребенок норма \n",
"1282 False False пожилой ожирение \n",
"\n",
" age_glucose_index bmi_glucose_ratio \n",
"845 3322.0800 0.478255 \n",
"3744 1833.7500 0.171779 \n",
"4183 7398.1400 0.225503 \n",
"3409 2878.0400 0.379147 \n",
"284 1152.7600 0.383775 \n",
"4796 5204.1700 0.528826 \n",
"1283 9478.3500 0.295779 \n",
"3656 3164.2800 0.504380 \n",
"2485 987.5600 0.345903 \n",
"1282 8975.9475 0.188949 \n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
"\n",
"scaler = MinMaxScaler()\n",
"standardizer = StandardScaler()\n",
"\n",
"# Нормализация\n",
"X_encoded[['age', 'avg_glucose_level', 'bmi']] = scaler.fit_transform(X_encoded[['age', 'avg_glucose_level', 'bmi']])\n",
"print(\"Данные после нормализации:\\n\", X_encoded.head(10))\n",
"\n",
"# # Стандартизация\n",
"# X_encoded[['age', 'avg_glucose_level', 'bmi']] = standardizer.fit_transform(X_encoded[['age', 'avg_glucose_level', 'bmi']])\n",
"# print(\"Данные после стандартизации:\\n\", X_encoded.head(10))\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Конструирование признаков с применением фреймворка Featuretools</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы в data: ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi', 'gender_Male', 'gender_Other', 'ever_married_Yes', 'work_type_Never_worked', 'work_type_Private', 'work_type_Self-employed', 'work_type_children', 'Residence_type_Urban', 'smoking_status_formerly smoked', 'smoking_status_never smoked', 'smoking_status_smokes', 'age_bins', 'bmi_bins', 'age_glucose_index', 'bmi_glucose_ratio']\n",
"age 0\n",
"hypertension 0\n",
"heart_disease 0\n",
"avg_glucose_level 0\n",
"bmi 0\n",
"gender_Male 0\n",
"gender_Other 0\n",
"ever_married_Yes 0\n",
"work_type_Never_worked 0\n",
"work_type_Private 0\n",
"work_type_Self-employed 0\n",
"work_type_children 0\n",
"Residence_type_Urban 0\n",
"smoking_status_formerly smoked 0\n",
"smoking_status_never smoked 0\n",
"smoking_status_smokes 0\n",
"age_bins 87\n",
"bmi_bins 0\n",
"age_glucose_index 0\n",
"bmi_glucose_ratio 0\n",
"dtype: int64\n",
"Сгенерированные признаки:\n",
" age hypertension heart_disease avg_glucose_level bmi \\\n",
"id \n",
"0 0.584961 0 0 0.123340 0.633333 \n",
"1 0.182129 0 0 0.587635 0.297222 \n",
"2 0.816895 0 0 0.484079 0.405556 \n",
"3 0.536133 0 0 0.090076 0.402778 \n",
"4 0.169922 0 0 0.238276 0.591667 \n",
"\n",
" gender_Male gender_Other ever_married_Yes work_type_Never_worked \\\n",
"id \n",
"0 False False True False \n",
"1 True False False False \n",
"2 False False True False \n",
"3 True False True False \n",
"4 True False False False \n",
"\n",
" work_type_Private work_type_Self-employed work_type_children \\\n",
"id \n",
"0 True False False \n",
"1 True False False \n",
"2 False True False \n",
"3 True False False \n",
"4 False False False \n",
"\n",
" Residence_type_Urban smoking_status_formerly smoked \\\n",
"id \n",
"0 True False \n",
"1 False False \n",
"2 False False \n",
"3 True False \n",
"4 True False \n",
"\n",
" smoking_status_never smoked smoking_status_smokes age_bins bmi_bins \\\n",
"id \n",
"0 True False средний ожирение \n",
"1 True False ребенок норма \n",
"2 True False пожилой норма \n",
"3 False True средний норма \n",
"4 False False ребенок ожирение \n",
"\n",
" age_glucose_index bmi_glucose_ratio \n",
"id \n",
"0 3322.08 0.478255 \n",
"1 1833.75 0.171779 \n",
"2 7398.14 0.225503 \n",
"3 2878.04 0.379147 \n",
"4 1152.76 0.383775 \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
" warnings.warn(\n",
"d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>hypertension</th>\n",
" <th>heart_disease</th>\n",
" <th>avg_glucose_level</th>\n",
" <th>bmi</th>\n",
" <th>gender_Male</th>\n",
" <th>gender_Other</th>\n",
" <th>ever_married_Yes</th>\n",
" <th>work_type_Never_worked</th>\n",
" <th>work_type_Private</th>\n",
" <th>work_type_Self-employed</th>\n",
" <th>work_type_children</th>\n",
" <th>Residence_type_Urban</th>\n",
" <th>smoking_status_formerly smoked</th>\n",
" <th>smoking_status_never smoked</th>\n",
" <th>smoking_status_smokes</th>\n",
" <th>age_bins</th>\n",
" <th>bmi_bins</th>\n",
" <th>age_glucose_index</th>\n",
" <th>bmi_glucose_ratio</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.584961</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.123340</td>\n",
" <td>0.633333</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>средний</td>\n",
" <td>ожирение</td>\n",
" <td>3322.08</td>\n",
" <td>0.478255</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.182129</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.587635</td>\n",
" <td>0.297222</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>ребенок</td>\n",
" <td>норма</td>\n",
" <td>1833.75</td>\n",
" <td>0.171779</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.816895</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.484079</td>\n",
" <td>0.405556</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>пожилой</td>\n",
" <td>норма</td>\n",
" <td>7398.14</td>\n",
" <td>0.225503</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.536133</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.090076</td>\n",
" <td>0.402778</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>средний</td>\n",
" <td>норма</td>\n",
" <td>2878.04</td>\n",
" <td>0.379147</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.169922</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.238276</td>\n",
" <td>0.591667</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>ребенок</td>\n",
" <td>ожирение</td>\n",
" <td>1152.76</td>\n",
" <td>0.383775</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4083</th>\n",
" <td>0.548340</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.330364</td>\n",
" <td>0.688889</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>средний</td>\n",
" <td>ожирение</td>\n",
" <td>4178.70</td>\n",
" <td>0.377988</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4084</th>\n",
" <td>0.194336</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.510778</td>\n",
" <td>0.255556</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>ребенок</td>\n",
" <td>норма</td>\n",
" <td>1815.52</td>\n",
" <td>0.171852</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4085</th>\n",
" <td>0.743652</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.205974</td>\n",
" <td>0.719444</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>пожилой</td>\n",
" <td>ожирение</td>\n",
" <td>4797.65</td>\n",
" <td>0.460267</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4086</th>\n",
" <td>0.377441</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.165707</td>\n",
" <td>0.436111</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>средний</td>\n",
" <td>избыток</td>\n",
" <td>2295.55</td>\n",
" <td>0.351114</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4087</th>\n",
" <td>0.072266</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.314520</td>\n",
" <td>0.327778</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>ребенок</td>\n",
" <td>норма</td>\n",
" <td>546.30</td>\n",
" <td>0.242724</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>4088 rows × 20 columns</p>\n",
"</div>"
],
"text/plain": [
" age hypertension heart_disease avg_glucose_level bmi \\\n",
"id \n",
"0 0.584961 0 0 0.123340 0.633333 \n",
"1 0.182129 0 0 0.587635 0.297222 \n",
"2 0.816895 0 0 0.484079 0.405556 \n",
"3 0.536133 0 0 0.090076 0.402778 \n",
"4 0.169922 0 0 0.238276 0.591667 \n",
"... ... ... ... ... ... \n",
"4083 0.548340 0 0 0.330364 0.688889 \n",
"4084 0.194336 0 0 0.510778 0.255556 \n",
"4085 0.743652 0 0 0.205974 0.719444 \n",
"4086 0.377441 0 0 0.165707 0.436111 \n",
"4087 0.072266 0 0 0.314520 0.327778 \n",
"\n",
" gender_Male gender_Other ever_married_Yes work_type_Never_worked \\\n",
"id \n",
"0 False False True False \n",
"1 True False False False \n",
"2 False False True False \n",
"3 True False True False \n",
"4 True False False False \n",
"... ... ... ... ... \n",
"4083 False False True False \n",
"4084 False False False False \n",
"4085 False False True False \n",
"4086 True False True False \n",
"4087 False False False False \n",
"\n",
" work_type_Private work_type_Self-employed work_type_children \\\n",
"id \n",
"0 True False False \n",
"1 True False False \n",
"2 False True False \n",
"3 True False False \n",
"4 False False False \n",
"... ... ... ... \n",
"4083 True False False \n",
"4084 False False True \n",
"4085 True False False \n",
"4086 True False False \n",
"4087 False False True \n",
"\n",
" Residence_type_Urban smoking_status_formerly smoked \\\n",
"id \n",
"0 True False \n",
"1 False False \n",
"2 False False \n",
"3 True False \n",
"4 True False \n",
"... ... ... \n",
"4083 True True \n",
"4084 False False \n",
"4085 False True \n",
"4086 True False \n",
"4087 True False \n",
"\n",
" smoking_status_never smoked smoking_status_smokes age_bins bmi_bins \\\n",
"id \n",
"0 True False средний ожирение \n",
"1 True False ребенок норма \n",
"2 True False пожилой норма \n",
"3 False True средний норма \n",
"4 False False ребенок ожирение \n",
"... ... ... ... ... \n",
"4083 False False средний ожирение \n",
"4084 False False ребенок норма \n",
"4085 False False пожилой ожирение \n",
"4086 False False средний избыток \n",
"4087 False False ребенок норма \n",
"\n",
" age_glucose_index bmi_glucose_ratio \n",
"id \n",
"0 3322.08 0.478255 \n",
"1 1833.75 0.171779 \n",
"2 7398.14 0.225503 \n",
"3 2878.04 0.379147 \n",
"4 1152.76 0.383775 \n",
"... ... ... \n",
"4083 4178.70 0.377988 \n",
"4084 1815.52 0.171852 \n",
"4085 4797.65 0.460267 \n",
"4086 2295.55 0.351114 \n",
"4087 546.30 0.242724 \n",
"\n",
"[4088 rows x 20 columns]"
]
},
"execution_count": 145,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import featuretools as ft\n",
"\n",
"print(\"Столбцы в data:\", X_encoded.columns.tolist())\n",
"print(X_encoded.isnull().sum())\n",
"\n",
"# Создание EntitySet (основная структура для Featuretools)\n",
"entity = ft.EntitySet(id=\"stroke_prediction\")\n",
"\n",
"entity = entity.add_dataframe(\n",
" dataframe_name=\"data\", \n",
" dataframe=X_encoded, \n",
" index=\"id\",\n",
")\n",
"\n",
"# Генерация новых признаков\n",
"feature_matrix, feature_defs = ft.dfs(\n",
" entityset=entity,\n",
" target_dataframe_name=\"data\", # Основная таблица\n",
" max_depth=2 # Уровень вложенности\n",
")\n",
"\n",
"print(\"Сгенерированные признаки:\")\n",
"print(feature_matrix.head())\n",
"\n",
"# Сохранение результатов\n",
"feature_matrix.to_csv(\"./csv/generated_features.csv\", index=False)\n",
"feature_matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">Самое время оценить качество работы модели</p>"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Время обучения модели: 0.33 секунд\n",
"ROC-AUC: 0.84\n",
"F1-Score: 0.00\n",
"Матрица ошибок:\n",
"[[486 0]\n",
" [ 25 0]]\n",
"Отчет по классификации:\n",
" precision recall f1-score support\n",
"\n",
" 0 0.95 1.00 0.97 486\n",
" 1 0.00 0.00 0.00 25\n",
"\n",
" accuracy 0.95 511\n",
" macro avg 0.48 0.50 0.49 511\n",
"weighted avg 0.90 0.95 0.93 511\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
"d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
"d:\\code\\mai\\labs\\AIM-PIbd-31-Bakalskaya-E-D\\lab_3\\venv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAksAAAJwCAYAAACZACVsAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABU8klEQVR4nO3deVxUZf//8feAbIKAqEDupuZuLmlSbpVludyalpqVoJaZS7mG5JZZYWa5ZFZmoVbm0mbaouaaiksqaWpmLpkpaC4gJoswvz/8Oj8ncIZRmDMwr2ePedzMdc6Z85nxJj+9z3WuMZnNZrMAAACQKw+jCwAAAHBlNEsAAAA20CwBAADYQLMEAABgA80SAACADTRLAAAANtAsAQAA2ECzBAAAYEMxowsAAKNkZGTo7Nmzys7OVtmyZY0uB4CLIlkC4FZ+/vln9ezZU6VLl5aPj49uueUWde3a1eiyALgwmiXgP+bOnSuTySSTyaSNGzfm2G42m1WhQgWZTCZ16NDBgApxo5YuXarmzZtr3759evXVV7Vq1SqtWrVK77//vtGlAXBhXIYDrsPX11cLFixQ8+bNrcbXr1+v48ePy8fHx6DKcCPOnj2rp556Sm3bttWSJUvk7e1tdEkACgmSJeA62rVrpyVLlujy5ctW4wsWLFDjxo0VHh5uUGW4EXFxcUpLS9PcuXNplAA4hGYJuI7HHntMZ86c0apVqyxjGRkZ+vzzz9WzZ89cj5kyZYruuusulSpVSn5+fmrcuLE+//xzq32uXuK73qN169aSpHXr1slkMmnRokV68cUXFR4eLn9/f/3vf//TX3/9ZfWarVu3thx31fbt2y2v+d/zDxo0KEftHTp0UOXKla3Gdu/eraioKN16663y9fVVeHi4+vTpozNnztj66CxOnTqlvn37KiwsTL6+vrr99ts1b948q32OHj0qk8mkKVOmWI3XrVs3x3saM2aMTCaTUlNTrd7PSy+9ZLXfG2+8YfVZStKWLVvUoEEDvfbaa6pQoYJ8fHxUvXp1TZo0SdnZ2VbHX758WRMnTlTVqlXl4+OjypUr68UXX1R6errVfpUrV1ZUVJTVWL9+/eTr66t169bZ/4AAFApchgOuo3LlyoqIiNBnn32mhx56SJL0/fffKzk5WT169NCMGTNyHDN9+nT973//0+OPP66MjAwtXLhQjz76qJYvX6727dtLkj7++GPL/j/99JNmz56tqVOnqnTp0pKksLAwq9d89dVXZTKZFB0drVOnTmnatGlq06aNEhIS5Ofnd936o6Ojb/ozWLVqlQ4fPqzevXsrPDxce/fu1ezZs7V3715t2bIlRyN2rUuXLql169b6448/NGjQIFWpUkVLlixRVFSUzp8/r+eff/6m68vN+fPnFRsbm2P8zJkz2rhxozZu3Kg+ffqocePGWr16tWJiYnT06FG99957ln2feuopzZs3T4888oiGDx+urVu3KjY2Vvv379dXX3113XOPHz9eH374oRYtWpSj0QNQiJkBWImLizNLMm/fvt08c+ZMc4kSJcz//vuv2Ww2mx999FHzPffcYzabzeZKlSqZ27dvb3Xs1f2uysjIMNetW9d877332jzXkSNHcmxbu3atWZK5XLly5pSUFMv44sWLzZLM06dPt4y1atXK3KpVK8vz7777zizJ/OCDD5r/+2suyTxw4MAc52vfvr25UqVKNt+P2Ww2f/bZZ2ZJ5g0bNuT6nq6aNm2aWZL5k08+sYxlZGSYIyIizAEBAZb3dOTIEbMk8xtvvGF1fJ06dazek9lsNo8ePdosyXzhwgWr9zN+/HjL8xdeeMEcGhpqbty4sdXxrVq1Mksyv/TSS1avGRUVZZZk3rNnj9lsNpsTEhLMksxPPfWU1X4jRowwSzKvWbPGMlapUiVzZGSk2Ww2m99//32zJPPbb79t83MBUPhwGQ6woVu3brp06ZKWL1+uCxcuaPny5de9BCfJKuk5d+6ckpOT1aJFC+3cufOGa+jVq5dKlChhef7II4/olltu0XfffZfr/mazWTExMeratavuvPPOGz6vZP1+0tLS9M8//6hZs2aSZPc9fffddwoPD9djjz1mGfPy8tJzzz2n1NRUrV+//qZqy83ff/+tt99+W2PHjlVAQECO7Z6enho6dKjV2PDhwyVJ3377raVuSRo2bJjN/a61dOlSDRgwQCNHjsz1EieAwo1mCbChTJkyatOmjRYsWKAvv/xSWVlZeuSRR667//Lly9WsWTP5+voqJCREZcqU0bvvvqvk5OQbrqF69epWz00mk6pVq6ajR4/muv+nn36qvXv36rXXXrvhc1519uxZPf/88woLC5Ofn5/KlCmjKlWqSJLd9/Tnn3+qevXq8vCw/tdMrVq1LNvz2/jx41W2bFk988wzObaZTCaVLVtWgYGBVuM1atSQh4eH5fP8888/5eHhoWrVqlntFx4eruDg4Bx1JyQk6LHHHlNWVpbOnj2bv28IgEtgzhJgR8+ePfX0008rMTFRDz30kIKDg3Pd76efftL//vc/tWzZUrNmzdItt9wiLy8vxcXFacGCBU6pNSMjQ2PHjlXfvn1122233fTrdevWTZs3b9bIkSPVoEEDBQQEKDs7Ww8++GCOSdFG279/v+bOnatPPvlEXl5eObbbmt+VG1vzsa71yy+/6KGHHtJ9992nkSNH6oknnmC+ElDE0CwBdjz88MN65plntGXLFi1atOi6+33xxRfy9fXVihUrrNZgiouLu6nzHzx40Oq52WzWH3/8ofr16+fYd9asWTp16lSOu8NuxLlz57R69WpNmDBB48aNu24911OpUiXt3r1b2dnZVunSb7/9Ztmen2JiYtSgQQN179491+1VqlTRypUrdeHCBavLmr///ruys7MtdwJWqlRJ2dnZOnjwoCUFk6SkpCSdP38+R9316tXTkiVL5OfnpyVLlqhfv37avXu3fH198/X9ATAOl+EAOwICAvTuu+/qpZdeUseOHa+7n6enp0wmk7KysixjR48e1ddff31T558/f74uXLhgef7555/r5MmTljv0rrpw4YJeffVVDR06NF/WgPL09JR0pTm71rRp0/J0fLt27ZSYmGjVYF6+fFlvv/22AgIC1KpVq5uu8ar4+HgtXbpUkyZNum4i1K5dO2VlZWnmzJlW42+99ZYkWe5WbNeunaSc7/O/+13VqFEj+fv7y8PDQ3PmzNHRo0f18ssv3/R7AuA6SJaAPIiMjLS7T/v27fXWW2/pwQcfVM+ePXXq1Cm98847qlatmnbv3n3D5w4JCVHz5s3Vu3dvJSUladq0aapWrZqefvppq/127typ0qVL64UXXrD7mseOHdMPP/xgNXb69GldunRJP/zwg1q1aqXAwEC1bNlSkydPVmZmpsqVK6eVK1fqyJEjeaq7X79+ev/99xUVFaUdO3aocuXK+vzzz7Vp0yZNmzbNKt2RpAMHDljVlJqaKg8PD6uxw4cP53qulStX6v7771ebNm2uW0+7du3Upk0bjR49WkeOHFGDBg20Zs0affHFF+rfv7/q1q0rSbr99tsVGRmp2bNn6/z582rVqpW2bdumefPmqXPnzrrnnnuue466desqOjpakyZNUo8ePXJN/wAUQgbfjQe4nGuXDrAlt6UDPvzwQ3P16tXNPj4+5po1a5rj4uLM48ePz3H7/n/PZWvpgM8++8wcExNjDg0NNfv5+Znbt29v/vPPP632vXpb/NSpU63Gczu3JLuPq/UcP37c/PDDD5uDg4PNQUFB5kcffdR84sSJHLfrX09SUpK5d+/e5tKlS5u9vb3N9erVM8fFxVntc3XpAEce/106wGQymXfs2JHjM/nv0gOpqanmoUOHmsuWLWv28vIyV6tWzTxp0iRzVlaW1X6ZmZnmCRMmmKtUqWL28vIyV6hQwRwTE2NOS0uz2u/apQOuSktLM9esWdPcpEkT8+XLl+1+RgBcn8ls/k/GDsAlrFu3Tvfcc4+WLFli8w68/HT06FFVqVJFR44cybGaNwC4K+YsAQAA2ECzBMDCz89Pbdu2dfg2ewAoypjgDcAiLCwsx8R
"text/plain": [
"<Figure size 700x700 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAIjCAYAAAA0vUuxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACkYUlEQVR4nOzddVhU6dsH8O/QDaKAAQoqIAaiYge6uuaaa60B9q4KKHZhrq0Iih3Y3bV2Y3eAWChioxKC1Mx5//DHeR2HGgSH+H6uay6d+5w55z5ngrnnec7zSARBEEBERERERERpUlN1AkRERERERLkdCyciIiIiIqIMsHAiIiIiIiLKAAsnIiIiIiKiDLBwIiIiIiIiygALJyIiIiIiogywcCIiIiIiIsoACyciIiIiIqIMsHAiIiIiIiLKAAsnIsoWX758ga+vr3g/MjISixcvVl1CRERERNmIhRPlSb169YKBgYGq06Dv6OrqYsKECdi0aRNevnyJyZMn48CBA6pOi4iIiChbaKg6AaLM+vjxIzZt2oTz58/j3Llz+Pr1K5o3b44qVaqgc+fOqFKliqpTLNDU1dUxZcoUuLq6QiaTwcjICIcOHVJ1WkRERETZQiIIgqDqJIgysnXrVvTv3x9fvnyBtbU1kpKS8PbtW1SpUgV37txBUlIS3NzcsGLFCmhpaak63QItPDwcL1++hIODA0xMTFSdDhEREVG2YFc9yvUCAwPRo0cPFC1aFIGBgQgNDUWTJk2go6ODa9eu4fXr1/jrr7+wbt06eHl5yT123rx5qFOnDgoXLgxdXV1Uq1YNO3fuVNiHRCLB5MmTxfvJyclo2bIlTE1NERQUJK6T3q1hw4YAgDNnzkAikeDMmTNy+2jVqpXCfho2bCg+LsXz588hkUiwdu1aufjDhw/RsWNHmJqaQkdHB87Ozti/f7/CsURGRsLLywvW1tbQ1taGpaUlXF1dERERkWZ+r1+/hrW1NZydnfHlyxelj2Py5MmQSCQAAEtLS9SuXRsaGhooWrRoqttIzatXr9C3b18UL14c2trasLGxwcCBA5GYmIi1a9dmeP5Tztfdu3fRq1cvlC5dGjo6OihatCj69OmDjx8/KuSb3u3MmTOYNGkSNDU18eHDB4V8BwwYABMTE8THx4ux//77Dy4uLjA0NISRkRGqV6+OzZs3p3vc35+7FF++fEn13DVs2BAVK1ZU2Ma8efMgkUjw/PlzuXh6+Sh7bCmvhx9v1tbWCuuk9h778XgzOvcAcP78eXTq1AklS5aEtrY2rKys4OXlha9fv6a5/RQZvWa+f/0CwK1bt9CiRQsYGRnBwMAAjRs3xuXLlzPcDwDIZDL4+fmhUqVK0NHRgZmZGZo3b47r16+L60gkEri7u2PTpk2wt7eHjo4OqlWrhnPnzslt68WLFxg0aBDs7e2hq6uLwoULo1OnTgrP7Y/Hp6enh0qVKmHVqlVy66XVrXnnzp2pvjevXLmC5s2bw9jYGHp6enBxcUFgYKDcOinPYcpnSorr168rfHb16tVL7jUCAC9fvoSurq7Ca/bHz8OkpCR4e3vDxsYGWlpaKFmyJEaNGpWp5x/49pnZuXNnmJmZQVdXF/b29hg/fny6j0nrdZ5y69Wrl7huynNw7tw5/P333yhcuDCMjIzg6uqKz58/K2x7yZIlqFChArS1tVG8eHEMHjwYkZGRcus0bNgw1f02adJEXCfltfSjP/74Q+Fcx8bGYvjw4bCysoK2tjbs7e0xb948fP+b+cePH9GiRQtYWlpCW1sbxYoVQ/fu3fHixQtxnbT+Lg0ePDjL58XNzQ1FihRBUlKSwrE0bdoU9vb2crGNGzeiWrVq0NXVhampKbp27YqXL1+mev7atWunsM2///4bEolE7jM05bjmzZunsH6K1D6jU56X76/rTVGuXLk0nyPK+9hVj3K9WbNmQSaTYevWrahWrZrC8iJFimD9+vUICgrC8uXLMWnSJJibmwMA/Pz80KZNG3Tv3h2JiYnYunUrOnXqhIMHD6JVq1Zp7rNfv344c+YMjh8/jvLlywMANmzYIC4/f/48VqxYgQULFqBIkSIAAAsLizS3d+7cORw+fDhLxw8ADx48QN26dVGiRAmMGTMG+vr62L59O9q1a4ddu3ahffv2AL594a5fvz6Cg4PRp08fVK1aFREREdi/fz/Cw8PFXL8XFRWFFi1aQFNTE4cPH0732jFljmP+/Pl49+5dptZ9/fo1atSogcjISAwYMADlypXDq1evsHPnTsTFxaFBgwZy53/69OkAIPclqE6dOgCA48eP49mzZ+jduzeKFi2KBw8eYMWKFXjw4AEuX74MiUSCDh06oGzZsuJjvby84ODggAEDBogxBwcHWFpaYurUqdi2bZvcH8HExETs3LkTf/75J3R0dAB8+7LQp08fVKhQAWPHjoWJiQlu3bqFI0eOoFu3bpk6D1k5d2nJKJ+ePXtm+ti+N27cODg4OAAAVqxYgbCwMKXyyuy5B4AdO3YgLi4OAwcOROHChXH16lUsWrQI4eHh2LFjR6b2N3XqVNjY2Ij3v3z5goEDB8qt8+DBA9SvXx9GRkYYNWoUNDU1sXz5cjRs2BBnz55FzZo1091H3759sXbtWrRo0QL9+vVDcnIyzp8/j8uXL8PZ2Vlc7+zZs9i2bRs8PT2hra2NJUuWoHnz5rh69ar4Ze7atWu4ePEiunbtCktLSzx//hxLly5Fw4YNERQUBD09Pbl9p3wGRUdHY82aNejfvz+sra3lvmhn1qlTp9CiRQtUq1YNkyZNgpqaGgICAvDbb7/h/PnzqFGjhtLbTM3EiRPlivK0DB48GCtXrkSbNm0wYsQI3Lp1C3PnzsX9+/dx6NAhhS+z37t79y7q168PTU1NDBgwANbW1nj69CkOHDggfn6kx9PTE9WrV5eL9evXL9V13d3dYWJigsmTJyMkJARLly7FixcvxCIM+Pble8qUKWjSpAkGDhwornft2jUEBgZCU1NT3J6lpSVmzpwpt49ixYplmPOPBEFAmzZtcPr0afTt2xdOTk44evQoRo4ciVevXmHBggUAvr3nDQ0NMWTIEBQuXBhPnz7FokWLcPfuXdy7dy/N7T958gQrV65Mc3lG56Vnz55Yv349jh49ij/++EN83Nu3b3Hq1ClMmjRJjE2fPh3e3t7o3Lkz+vXrhw8fPmDRokVo0KABbt26Jde7QUdHB4cOHcL79+/F7wJfv37Ftm3bUv1MyyodHR0EBARg6NChYuzixYtyBSflQwJRLmdqaiqUKlVKLubm5ibo6+vLxby9vQUAwoEDB8RYXFyc3DqJiYlCxYoVhd9++00uDkCYNGmSIAiCMHbsWEFdXV3Yu3dvmjkFBAQIAITQ0FCFZadPnxYACKdPnxZjNWvWFFq0aCG3H0EQhEaNGgkNGjSQe3xoaKgAQAgICBBjjRs3FipVqiTEx8eLMZlMJtSpU0ewtbUVYxMnThQACLt371bISyaTKeQXHx8vNGzYUDA3NxeePHmS5eOYNGmS8P3Hyfv37wVDQ0Nx3e+3kRpXV1dBTU1NuHbtWpp5f8/FxUVwcXFJdVs/PueCIAhbtmwRAAjnzp1L9TGlSpUS3NzcUl1Wu3ZtoWbNmnKx3bt3yx1XZGSkYGhoKNSsWVP4+vVrhvl/T5lz5+LiIlSoUEFhG3PnzpV7PWY2n8wcW4rjx48LAISzZ8+KMTc3N7n3ZsprZseOHeke8/fSO/epPZczZ84UJBKJ8OLFi3S3m/Ie/fE19eHDB4XXb7t27QQtLS3h6dOnYuz169eCoaGhwvvzR6dOnRIACJ6engrLvj/XAAQAwvXr18XYixcvBB0dHaF9+/ZiLLVjvnTpkgBAWL9+vcLxff8Z9OjRIwGAMGfOHDGW2melIAjCjh075J5nmUwm2NraCs2aNZPLOy4uTrCxsRF+//13MZbymv3w4YPcNq9du6bw2fXja+T+/fuCmpqa+Pr+Pv/v39d3794VJBKJ0LVrV7l9TJ48WeFzPjUNGjQQDA0NFV4nGb0f03sN6+vry71WU56DatWqCYmJiWJ8zpw5AgBh375
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import time\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, classification_report\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки\n",
"X = data.drop(columns=['id', 'stroke']) # Признаки\n",
"y = data['stroke'] # Целевая переменная\n",
"\n",
"# Преобразование категориальных признаков с помощью One-Hot Encoding\n",
"categorical_columns = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']\n",
"X = pd.get_dummies(X, columns=categorical_columns, drop_first=True)\n",
"\n",
"# Заполнение пропущенных значений (например, медианой для числовых данных)\n",
"X.fillna(X.median(), inplace=True)\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки\n",
"# Обучающая выборка\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)\n",
"\n",
"# Тестовая и контрольная выборки\n",
"X_test, X_control, y_test, y_control = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)\n",
"\n",
"# Обучение модели\n",
"model = RandomForestClassifier(random_state=42)\n",
"\n",
"# Начинаем отсчет времени\n",
"start_time = time.time()\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Время обучения модели\n",
"train_time = time.time() - start_time\n",
"\n",
"# Предсказания и оценка модели\n",
"y_pred = model.predict(X_test)\n",
"y_pred_proba = model.predict_proba(X_test)[:, 1] # Вероятности для ROC-AUC\n",
"\n",
"# Метрики\n",
"roc_auc = roc_auc_score(y_test, y_pred_proba)\n",
"f1 = f1_score(y_test, y_pred)\n",
"conf_matrix = confusion_matrix(y_test, y_pred)\n",
"class_report = classification_report(y_test, y_pred)\n",
"\n",
"# Вывод результатов\n",
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
"print(f'ROC-AUC: {roc_auc:.2f}')\n",
"print(f'F1-Score: {f1:.2f}')\n",
"print('Матрица ошибок:')\n",
"print(conf_matrix)\n",
"print('Отчет по классификации:')\n",
"print(class_report)\n",
"\n",
"# Визуализация матрицы ошибок\n",
"plt.figure(figsize=(7, 7))\n",
"sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Нет инсульта', 'Инсульт'], yticklabels=['Нет инсульта', 'Инсульт'])\n",
"plt.title('Матрица ошибок')\n",
"plt.xlabel('Предсказанный класс')\n",
"plt.ylabel('Истинный класс')\n",
"plt.show()\n",
"\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_test, y_pred, alpha=0.5, color='blue', label='Прогнозы модели')\n",
"plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Идеальное совпадение')\n",
"plt.xlabel('Фактический статус инсульта')\n",
"plt.ylabel('Прогнозируемый статус инсульта')\n",
"plt.title('Фактический статус инсульта по сравнению с прогнозируемым')\n",
"plt.legend()\n",
"plt.show()\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"margin: 30px;\">в общем, вышло так, что пока что моя модель может предсказать ОТСУТСТВИЕ инсульта с высокой точностью, но вообще не может предсказать е г о наличие... целей пока не достигаем, задачи не решаем(</p>"
2024-11-26 22:22:49 +04:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}