AIM-PIbd-32-Kozlova-A-A/lab_2/laba2.ipynb

1899 lines
1.2 MiB
Plaintext
Raw Permalink Normal View History

2024-10-19 00:33:26 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **1. Car Price Prediction Challenge**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### **Проблемная область**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проблемная область этого набора данных связана с анализом рынка автомобилей. Данные могут быть использованы для анализа ценовых тенденций, определения характеристик автомобилей, влияющих на их стоимость, а также для изучения предпочтений потребителей в отношении моделей, производителей и других характеристик. Объектами наблюдения являются автомобили, представленные в наборе данных.\n"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<bound method DataFrame.info of ID Price Levy Manufacturer Model Prod. year Category \\\n",
"0 45654403 13328 1399 LEXUS RX 450 2010 Jeep \n",
"1 44731507 16621 1018 CHEVROLET Equinox 2011 Jeep \n",
"2 45774419 8467 - HONDA FIT 2006 Hatchback \n",
"3 45769185 3607 862 FORD Escape 2011 Jeep \n",
"4 45809263 11726 446 HONDA FIT 2014 Hatchback \n",
"... ... ... ... ... ... ... ... \n",
"19232 45798355 8467 - MERCEDES-BENZ CLK 200 1999 Coupe \n",
"19233 45778856 15681 831 HYUNDAI Sonata 2011 Sedan \n",
"19234 45804997 26108 836 HYUNDAI Tucson 2010 Jeep \n",
"19235 45793526 5331 1288 CHEVROLET Captiva 2007 Jeep \n",
"19236 45813273 470 753 HYUNDAI Sonata 2012 Sedan \n",
"\n",
" Leather interior Fuel type Engine volume Mileage Cylinders \\\n",
"0 Yes Hybrid 3.5 186005 km 6.0 \n",
"1 No Petrol 3 192000 km 6.0 \n",
"2 No Petrol 1.3 200000 km 4.0 \n",
"3 Yes Hybrid 2.5 168966 km 4.0 \n",
"4 Yes Petrol 1.3 91901 km 4.0 \n",
"... ... ... ... ... ... \n",
"19232 Yes CNG 2.0 Turbo 300000 km 4.0 \n",
"19233 Yes Petrol 2.4 161600 km 4.0 \n",
"19234 Yes Diesel 2 116365 km 4.0 \n",
"19235 Yes Diesel 2 51258 km 4.0 \n",
"19236 Yes Hybrid 2.4 186923 km 4.0 \n",
"\n",
" Gear box type Drive wheels Doors Wheel Color Airbags \n",
"0 Automatic 4x4 04-May Left wheel Silver 12 \n",
"1 Tiptronic 4x4 04-May Left wheel Black 8 \n",
"2 Variator Front 04-May Right-hand drive Black 2 \n",
"3 Automatic 4x4 04-May Left wheel White 0 \n",
"4 Automatic Front 04-May Left wheel Silver 4 \n",
"... ... ... ... ... ... ... \n",
"19232 Manual Rear 02-Mar Left wheel Silver 5 \n",
"19233 Tiptronic Front 04-May Left wheel Red 8 \n",
"19234 Automatic Front 04-May Left wheel Grey 4 \n",
"19235 Automatic Front 04-May Left wheel Black 4 \n",
"19236 Automatic Front 04-May Left wheel White 12 \n",
"\n",
"[19237 rows x 18 columns]> \n",
"\n"
]
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\".//static//csv//car_price_prediction.csv\")\n",
"print(df.info, \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Объекты наблюдения**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объектами наблюдения являются автомобили, представленные в наборе данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Атрибуты объектов**"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ID\n",
"Price\n",
"Levy\n",
"Manufacturer\n",
"Model\n",
"Prod. year\n",
"Category\n",
"Leather interior\n",
"Fuel type\n",
"Engine volume\n",
"Mileage\n",
"Cylinders\n",
"Gear box type\n",
"Drive wheels\n",
"Doors\n",
"Wheel\n",
"Color\n",
"Airbags\n",
"price_log\n",
"price_category\n"
]
}
],
"source": [
"attributes = df.columns\n",
"for attribute in attributes:\n",
" print(attribute)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Связи между объектами**"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABAcAAAIlCAYAAACpcJFFAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADtPElEQVR4nOzdeXhTdfYG8DdJs7SlCy20TQVLBWQrgoIiigqCFFAEFxRERWVEURQGBxEH2cRBUBQQBHEBVPi5i+JSqcimVPYdQYGyaJsWKN1pkib390dyb5smaZM0bbb38zydoTffJCcxaXtPzvccmSAIAoiIiIiIiIgoZMl9HQARERERERER+RaTA0REREREREQhjskBIiIiIiIiohDH5AARERERERFRiGNygIiIiIiIiCjEMTlAREREREREFOKYHCAiIiIiIiIKcUwOEBEREREREYU4JgeIiIiIiIiIQlyYrwMgIiIi4NSpU/j999+Rn5+PCxcuIC8vD/Pnz0dkZKSvQyMiIqIQIBMEQfB1EERERKFqx44dGD9+PH7//Xeb42FhYVi/fj369Onjo8iIiIgolHBbARFJdu/ejdGjR6Nt27aIjIxEeHg4WrdujYceegiZmZm+Do8o6Pz666+45ZZbcPDgQbz22ms4c+YMBEGAIAgwGo1MDBAREVGjYeUAEcFsNuM///kP3nzzTYSFheHWW29FWloalEolTp48iZ9//hkXL17ErFmz8NJLL/k6XKKgYDAY0KFDB1y4cAEbN27E1Vdf7euQiIiIKISx5wARYerUqXjzzTfRtWtXfPHFF2jdurXN5ZcuXcLixYtx4cIFH0VIFHx++uknnDx5Em+99RYTA0RERORz3FZAFOKOHz+OefPmIT4+HhkZGXaJAQAIDw/HpEmTMHPmTOnYI488AplMhpMnT2LevHlo27YtNBoNUlNTMWvWLBiNRof3t2XLFgwePBjNmjWDWq1G27ZtMXXqVJSXlztcf+rUKchkMqdfNbVq1QqtWrVyeFsrV66ETCbDypUrbY6Xl5fjxRdfxJVXXgm1Wm13H6dOnbJZf+DAAdx1111ISEiAQqGwWdu7d2+H912T+PzVvO3qj/mRRx5xeN1vvvkGffv2RdOmTaHRaJCWlobXX38dJpPJpccrcvRc/fnnn3j++edxzTXXID4+HhqNBldeeSVeeOEFlJaWuvTYACAnJwfTp0/H9ddfj4SEBKjVarRq1QpPPfUU8vPz7daLz4f4FRYWhhYtWuC+++7D4cOHpXWbNm2q9fXg7LVx/vx5TJgwAampqVCr1UhISMB9992HQ4cOOY1FLpcjOzvb7vKtW7dK9zFjxgy7y3/77TfcfvvtiIuLg0ajQfv27TF9+nS71/i2bdsAAElJSbjzzjsRHx8vvScmT56MoqIiu9t29hpbtGiR09crAMyYMcPp8+To9t577z1cd911iIyMtFvv7PXkKNbqX5GRkejQoQNmzJiBiooKl25j3759GDx4MK644gpERkYiNjYW11xzDRYsWGD3M2bjxo147LHH0K5dOzRp0gRNmjRB9+7dsXz58nrHJz5/mzZtsjleXl6Oli1bOn2/5ufn47nnnkO7du0QHh6OuLg49OjRA6+//rq0prb3+8SJEx2+nqv/96wZEwCcPn1a+tnk6HYPHTqE++67T3pvpqamYsKECU4TwHU9Dk/el85+Bubk5CAqKsrp+8uZ06dPY/To0bjsssugUqnQokULjB49GmfOnLFZ16pVK5fidOV1Xtv1p06d6lF81dX8uVj9q+ZzU1RUhLlz5+KWW25BcnIyVCoVkpOT8fDDD+PEiRMuP4+O/luK/70//PBDu/W9e/d2+LvY2WOp/t9b/B1V21f131EPPvggZDIZduzY4fA+pk2bBplMhv/7v/+zOb5//36MHDkSLVq0gFqthlarxYABA7Bu3Tq7WGr+d9++fTuio6ORmpqKs2fPSse99buSyN+wcoAoxK1cuRImkwlPPPEEEhMTa12rVqvtjk2YMAG//fYb7rvvPjRp0gTr1q3D9OnTceDAAXzxxRc2a5cuXYqnn34asbGxGDx4MBISErBr1y688sor2LhxIzZu3AiVSuXwvrt06YKhQ4faxH369Gn3H7ADo0aNwhdffIE2bdpg7NixiI2NBQCsXbsW+/fvt1mbk5ODm266CSUlJRgwYAC6du0qxVw9edJQpkyZgldffRWXXXYZ7r77bsTExGDr1q2YNGkStm/fjs8//7xet//VV1/h/fffR58+fdC7d2+YzWb8/vvvmDt3LjZv3owtW7ZAqVTWeTtbtmzB/Pnz0bdvX/To0QNKpRJ79+7F0qVL8dNPP2HPnj2IiYmxu9748eMRGxsLo9GIgwcP4osvvsCmTZtw5MgRNGvWDK1atcL06dNtrjNz5kykpKQ4TaacO3cOPXv2xIkTJ9C7d28MHz4c2dnZ+OKLL/D999/jp59+Qq9eveyuJ5fL8c477+DVV1+1Ob506VIoFAq7ZAwAfP755xgxYgTUajXuv/9+JCQkYP369Zg1axZ++uknbNq0CRqNRooLAIYPHw6VSoX77rsPWq0Wmzdvxrx58/Ddd99h27ZtDp+n6s6fP+/ySdSoUaNs/th29JpdsmQJxo0bh9jYWDzwwANITk6GTCbDvn378M0337h0P6Lq/11KSkrw/fffY+bMmTh9+jRWrFhR5/X//vtvnD9/Hn379kXz5s1RVlaGn376Cf/+979x6NAhvPfee9LauXPn4vjx47j++utx1113obCwEBkZGXjiiSdw7NgxzJ8/3+vxzZkzB3///bfDy44dO4Y+ffogNzcXvXr1wtChQ1FWVobDhw/jf//7H/7zn//UettHjx7F4sWLa12jUCiwbNkyuwTPO++84/SE7ddff0V6ejoMBgPuvfdetGrVCllZWVi4cCG+++47/P7772jWrJlbj8OT96UzkydPdvvk6s8//0SvXr1w7tw5DB48GJ06dcKhQ4fwwQcfYN26dfj1119x5ZVXArD8ziosLJSuK/6cF3/2iLp27erSfTt7jDfffLNH8TlSPbZTp05h1apVdmv++OMPTJs2DX369MFdd92FyMhIHD16FGvWrMH333+PPXv2ICUlxaXHBAC33HKL9LrS6XT44osvMGrUKAiCgFGjRrl8O64YMmSIw+d7wYIFNt8/8cQTWL16tZS8rM5kMmHFihWIj4/H3XffLR3/8ssv8cADD0AQBAwePBjt2rVDfn4+tm/fjvfffx+DBw92Gtf+/fsxcOBAREdHY8OGDWjZsqV0mbd+VxL5HYGIQlrv3r0FAMLPP//s1vVGjRolABCaN28unD17Vjqu1+uFm2++WQAgfPHFF9Lxw4cPC2FhYUKXLl2E8+fP29zWnDlzBADC66+/bnc/x48fFwAIjzzyiM3xW265RXD0IywlJUVISUlxGPOKFSsEAMKKFSukY8XFxYJcLheSk5OF0tJSh48xOztbOrZkyRIBgPDss8/a3T4A4ZZbbnF43zU5um1Rdna2AEAYNWqUzfH169cLAIT09HSbWM1ms/Dkk0/aPeeOHm91jp6rv//+W9Dr9XZrZ86cKQAQPv74Y5ceX15enlBSUmJ3fNWqVQIAYfbs2TbHnT0fkyZNEgAIa9eudXpfdT3vjz76qABAmDJlis3x77//XgAgtGnTRjCZTHaxDBkyRGjevLnN85Gfny+oVCph6NChAgBh+vTp0mVFRUVCTEyMoFarhf3790vHTSaTcP/99wsAhFmzZtndj0qlEvbs2WMT2zPPPCMAEMaNG1fnY33yyScFuVwudO3a1elr6r///a8AQNi0aVOdt3fttdcKAOxiquv1VJOj27506ZLQsmVLITY21qXbcMRgMAitW7cWIiMjbY6fPHnSbq3RaBRuu+02QaFQCKdPn/Y4vunTpwsAhI0bN0rHsrOzBY1GI3Tr1s3h+7V79+4CAGH58uV2cVX/mens/T5gwAAhMjJSaNeund3POjGeIUOGCEqlUtDpdNJler1eSEhIkF6j1W/XZDIJrVu3FgAIGRkZNrcpvtcee+wxjx5HTXW9Lx2957O
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df['Price'] = pd.to_numeric(df['Price'], errors='coerce')\n",
"df['Prod. year'] = pd.to_numeric(df['Prod. year'], errors='coerce')\n",
"\n",
"df.dropna(subset=['Price', 'Prod. year'], inplace=True)\n",
"\n",
"avg_price_per_year = df.groupby('Prod. year')['Price'].mean().reset_index()\n",
"\n",
"plt.figure(figsize=(12, 6))\n",
"plt.plot(avg_price_per_year['Prod. year'], avg_price_per_year['Price'], marker='o')\n",
"plt.title('Средняя цена автомобиля в зависимости от года выпуска', fontsize=14)\n",
"plt.xlabel('Год выпуска')\n",
"plt.ylabel('Средняя цена ($)')\n",
"plt.grid(True)\n",
"\n",
"plt.xticks(avg_price_per_year['Prod. year'][::4])\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Вывод:** наличие корреляции между такими атрибутами, как цена, год производства, пробег, тип топливаnи и объем двигателя. Например, более новые автомобили с меньшим пробегом могут иметь более высокую цену.\n",
"Категория автомобиля и его производитель также могут иметь связь с ценой и популярностью модели среди потребителей. Например, на графике показана зависимоость цены от года выпуска. По графику можно заметить: чем новее автомобиль, тем дороже его цена, за ислючением некоторых автомобилей.\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### **Бизнес-цели:**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**Ценообразование**\n",
"Построение моделей для прогнозирования цен на автомобили на основе их характеристик. Эффект для бизнеса: Оптимизация стратегии продаж и увеличение прибыли.\n",
"\n",
"**Анализ спроса**\n",
"Определение популярных моделей и типов автомобилей. Эффект для бизнеса: Более точное управление запасами и увеличение удовлетворенности клиентов.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Цели технического проекта**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Для ценообразования:**\n",
"\n",
"Цель: Разработка модели для прогнозирования цен.\n",
"Входные данные: Данные о характеристиках автомобилей (год выпуска, тип топлива, пробег и др.).\n",
"Целевой признак: Цена автомобиля.\n",
"\n",
"**Для анализа спроса:**\n",
"\n",
"Цель: Анализ спроса на различные модели автомобилей.\n",
"Входные данные: Данные о продажах, характеристиках и моделях автомобилей.\n",
"Целевой признак: Частота продаж конкретных моделей и типов."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Проблемы данных**\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Пропущенные значения в данных:\n",
" ID 0\n",
"Price 0\n",
"Levy 5819\n",
"Manufacturer 0\n",
"Model 0\n",
"Prod. year 0\n",
"Category 0\n",
"Leather interior 0\n",
"Fuel type 0\n",
"Engine volume 0\n",
"Mileage 0\n",
"Cylinders 0\n",
"Gear box type 0\n",
"Drive wheels 0\n",
"Doors 0\n",
"Wheel 0\n",
"Color 0\n",
"Airbags 0\n",
"dtype: int64\n",
"\n",
"Выбросы в числовых столбцах:\n",
" ID 2531\n",
"Price 627\n",
"Prod. year 829\n",
"Cylinders 0\n",
"Airbags 0\n",
"dtype: int64\n",
"\n",
"Распределение автомобилей по производителям:\n",
" Manufacturer\n",
"HYUNDAI 3769\n",
"TOYOTA 3662\n",
"MERCEDES-BENZ 2076\n",
"FORD 1111\n",
"CHEVROLET 1069\n",
" ... \n",
"LAMBORGHINI 1\n",
"PONTIAC 1\n",
"SATURN 1\n",
"ASTON MARTIN 1\n",
"GREATWALL 1\n",
"Name: count, Length: 65, dtype: int64\n",
"\n",
"Последний год производства в данных: 2020.0\n",
"\n",
"начальный год: 2000.0\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"# Заменяем \"-\" на NaN только в числовых столбцах\n",
"numeric_columns = ['Price', 'Levy', 'Engine volume', 'Mileage', 'Cylinders', 'Airbags']\n",
"df[numeric_columns] = df[numeric_columns].replace(\"-\", np.nan)\n",
"\n",
"# Проверяем пропущенные значения снова\n",
"missing_values = df.isnull().sum()\n",
"print(\"Пропущенные значения в данных:\\n\", missing_values)\n",
"\n",
"# Фильтрация только числовых столбцов для анализа выбросов\n",
"numerical_data = df.select_dtypes(include=[np.number])\n",
"\n",
"# 2. Анализ выбросов\n",
"Q1 = numerical_data.quantile(0.25)\n",
"Q3 = numerical_data.quantile(0.75)\n",
"IQR = Q3 - Q1\n",
"outliers = ((numerical_data < (Q1 - 1.5 * IQR)) | (numerical_data > (Q3 + 1.5 * IQR))).sum()\n",
"print(\"\\nВыбросы в числовых столбцах:\\n\", outliers)\n",
"\n",
"# 3. Анализ смещения данных\n",
"category_counts = df['Manufacturer'].value_counts()\n",
"print(\"\\nРаспределение автомобилей по производителям:\\n\", category_counts)\n",
"\n",
"# 4. Анализ актуальности данных\n",
"max_production_year = df['Prod. year'].max()\n",
"min_year_from_car = df['Prod. year'].min()\n",
"\n",
"print(\"\\nПоследний год производства в данных:\", max_production_year)\n",
"print(\"\\nначальный год:\", min_year_from_car)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В наборе данных есть пропущенные значения в столбце Levy. Выбросы встречаются с столбцах Price, Prod. year, Cylinders. Набор данных сильно смещён в сторону нескольких производителей, таких как Hyundai, Toyota, и Mercedes-Benz, которые составляют большую часть данных.\n",
"В то же время, такие бренды, как Lamborghini, Pontiac, Saturn, и Aston Martin, представлены всего одной записью. Последний год выпуска в наборе данных — 2020: Данные устарели на несколько лет. Это может означать, что новые модели автомобилей, выпущенные после 2020 года, не учтены, что снижает актуальность данных для анализа современного рынка автомобилей."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Примеры решения обнаруженных проблем**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В столбце Levy имеется 5819 пропущенных значений. Заполним пропуски, подставив 0."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"df['Levy'] = df['Levy'].fillna(0) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Заменим выбрасы на медиану"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [],
"source": [
"for col in ['Price', 'Prod. year', 'Cylinders']:\n",
" Q1 = df[col].quantile(0.25) # 1-й квартиль\n",
" Q3 = df[col].quantile(0.75) # 3-й квартиль\n",
" IQR = Q3 - Q1 # Интерквартильный размах\n",
" median = df[col].median() # Медиана\n",
"\n",
" # Определение выбросов\n",
" condition = (df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))\n",
"\n",
" # Замена выбросов на медиану\n",
" df[col] = np.where(condition, median, df[col])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Если нужно использовать для анализа машины определенных лет выпуска, можем отфильтровать данные следующим образом:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = df[df['Prod. year'] >= 2015]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Оценка качества данных**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Для каждой категориальной переменной, такой как Manufacturer, Model, Category, можно подсчитать количество уникальных значений. Это даст представление о разнообразии и информативности данных.\n"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ID 18924\n",
"Price 2315\n",
"Levy 559\n",
"Manufacturer 65\n",
"Model 1590\n",
"Prod. year 54\n",
"Category 11\n",
"Leather interior 2\n",
"Fuel type 7\n",
"Engine volume 107\n",
"Mileage 7687\n",
"Cylinders 13\n",
"Gear box type 4\n",
"Drive wheels 3\n",
"Doors 3\n",
"Wheel 2\n",
"Color 16\n",
"Airbags 17\n",
"price_log 2315\n",
"price_category 5\n",
"dtype: int64\n"
]
}
],
"source": [
"unique_counts = df.nunique()\n",
"print(unique_counts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Мы видим здесь 65 брендов машин и 1590 различных моделей, что говорит а разнообразии данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проверим на соответствие реальным данным"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAIVCAYAAADmnq8BAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACNm0lEQVR4nOzdd1gUV9sG8GcpYkFQVMCCiA1FBVFs2LCBii3WqLF3TexdrFFJokaNJfZeYu+9Y0uMsXeNBRsoqIBUYe/vD76dlxVwxSAL7P27Lq6Emdn12WGn3HPOnFEBgBAREREREVGyjPRdABERERERUXrH4ERERERERKQDgxMREREREZEODE5EREREREQ6MDgRERERERHpwOBERERERESkA4MTERERERGRDgxOREREREREOpjouwAiIiKipMTExMibN29ErVZLgQIF9F0OERk4tjgRERFRunHx4kXp0KGD5M2bV8zMzCR//vzSqlUrfZeVyPTp00WtVouIiFqtFl9fXz1XRERfG4MTEWUIq1atEpVKJRcvXkxyvoeHh5QtWzaNqyKi1LRr1y6pUaOG3Lp1S6ZNmyZHjhyRI0eOyOLFi/VdWiKrV6+WmTNnyrNnz2TWrFmyevVqfZdERF8Zu+oRERGR3r1580Z69uwpXl5esmXLFsmSJYu+S/qkKVOmSOfOnWXUqFFiZmYm69at03dJRPSVMTgRERGR3q1cuVKioqJk1apV6T40iYi0a9dO6tSpIw8ePJASJUpIvnz59F0SEX1l7KpHRJnanTt3pHXr1mJlZSVZs2YVNzc32b17t9YyyXUDDAoKEpVKJZMmTVKmTZo0SVQqlQQFBSX7bxYpUkS6du36yboeP34sKpUq2R8PDw+t5V+9eiU9evQQGxsbyZo1q7i4uHx216Ck6undu7dkzZpVTp48qUzbtWuXeHt7S4ECBcTMzEyKFSsmP/74o8TFxSnLeHh4fLJulUql9e+sW7dOKlasKNmyZRMrKyv59ttv5enTp1rLJPee9evXV5aJjY2VH3/8UYoVKyZmZmZSpEgRGTt2rERHRyf6rE2aNJHDhw9L+fLlJWvWrOLk5CTbt2/XWk7zN3/8+LEyTa1Wi7Ozs6hUKlm1apXW8lu3bhU3NzfJmTOnVo0zZ85UlunatauoVCopX758or+Br6+vqFQqMTc3TzTvY6dPn5Y2bdpI4cKFxczMTOzs7GTIkCESGRmptZzm39P85M6dWzw8POT06dNa6+NTf6siRYooy4aHh8uwYcPEzs5OzMzMxNHRUWbOnCkAtP5dzWvnzJmTqPZSpUqJSqWS77//Xmv6w4cPpU2bNmJlZSXZs2eXqlWryr59+7SW+fPPP6V8+fIyffp0pYYSJUrITz/9pNxLlLCGhNuliMiMGTOS3HaSWlean4/fIyX7i8ePH4u1tbW4u7tLnjx5kv3uEFHmwRYnIsq0bt68KdWrV5eCBQvK6NGjJUeOHLJ582Zp0aKFbNu2Tb755ht9lyjt27eXxo0ba00bM2aM1u+RkZHi4eEhDx48kO+//14cHBxky5Yt0rVrV3n37p0MGjQoRf/mxIkTZfny5bJp0yatk8xVq1aJubm5DB06VMzNzeX48eMyYcIECQ0NlRkzZoiIyLhx46Rnz54iEh8shwwZIr1795aaNWsm+nemTZsm48ePl7Zt20rPnj3l9evXMm/ePKlVq5ZcvnxZcuXKpSxbqFChRDfX58+fX/n/nj17yurVq6V169YybNgw+euvv8TX11du374tO3bs0Hrd/fv3pV27dtK3b1/p0qWLrFy5Utq0aSMHDx6UBg0aJLte1q5dK9evX080/fz589K2bVtxcXGRn376SSwtLZXP/jETExO5efOmXL58WVxdXbXWbdasWZP9txPasmWLRERESL9+/SRPnjxy4cIFmTdvnjx79ky2bNmitWzevHll9uzZIiLy7NkzmTt3rjRu3FiePn0quXLlkjlz5sj79+9FROT27dsyffp0GTt2rJQuXVpERAlyAKRZs2Zy4sQJ6dGjh5QvX14OHTokI0aMkOfPnyv/hkbWrFll5cqVMnjwYGXauXPn5MmTJ4k+T2BgoLi7u0tERIQMHDhQ8uTJI6tXr5ZmzZrJ1q1ble0wODhYzpw5I2fOnJHu3btLxYoV5dixYzJmzBh5/PixLFq0KNl19u7dO52DMyRcVyIinTp10pr/X/YXyX13iCiTARFRBrBy5UqICP7+++8k59euXRtlypTRmlavXj2UK1cOUVFRyjS1Wg13d3eUKFFC53u/fv0aIoKJEycq0yZOnAgRwevXr5Ot1d7eHl26dPnk53n06BFEBDNmzEg0r0yZMqhdu7by+5w5cyAiWLdunTItJiYG1apVg7m5OUJDQz/5byWsZ/HixRARzJs3L9FyERERiab16dMH2bNn11qHH3+GlStXJpr3+PFjGBsbY9q0aVrTr1+/DhMTE63pSf3tErpy5QpEBD179tSaPnz4cIgIjh8/rvVZRQTbtm1TpoWEhCB//vxwdXVVpmn+5o8ePQIAREVFoXDhwmjUqFGizzRmzBiICF6+fJnosyf8+3Xp0gU5cuRA06ZN8f333yvTT58+jWzZsqFFixbIkSNHsp9TI6m/g6+vL1QqFZ48eaL179nb22stt2TJEogILly4kOg9Tpw4ARHBiRMnEs3buXMnRARTp07Vmt66dWuoVCo8ePBAmSYiaN26NUxMTHDx4kVleo8ePdChQweICAYMGKBMHzx4MEQEp0+fVqaFhYXBwcEBRYoUQVxcHID474GIYNKkSVo1dO3aFSKC69eva9WQcLscOXIkrK2tUbFiRa1tR6Njx45wcHDQmvbxe6R0f/E53x0iylzYVY+IMqU3b97I8ePHpW3bthIWFiZBQUESFBQkwcHB4uXlJffv35fnz59rvSYkJERZLigoSN68efPJ9w8KCpLw8PCv/VFk//79YmtrK+3bt1emmZqaysCBA+X9+/dy6tSpz3qfXbt2Sf/+/WXEiBGJulKJiGTLlk35f806q1mzpkRERMidO3dSVPP27dtFrVZL27Zttdapra2tlChRQk6cOPHZ77V//34RERk6dKjW9GHDhomIJOryVaBAAa3WAQsLC+ncubNcvnxZAgICkvw3FixYIMHBwTJx4sRE88LCwsTIyEirhexTunfvLhs2bFC6Ea5cuVJatmwplpaWn/X6hH+H8PBwCQoKEnd3dwEgly9f1lpWrVYr6/bKlSuyZs0ayZ8/v9Ki9Ln2798vxsbGMnDgQK3pw4YNEwBy4MABrek2Njbi7e0tK1euFBGRiIgI2bx5s3Tr1i3J965cubLUqFFDmWZubi69e/eWx48fy61bt5TpxsbGiVrykvs7azx//lzmzZsn48ePT7YrZExMjJiZmSX38b9of6Hxqe8OEWUuBh2c/Pz8pGnTplKgQAFRqVSyc+fOFL1ec6/Dxz85cuT4OgUT0Wd78OCBAJDx48dLvnz5tH40JzivXr3Sek39+vW1lnN0dEz2/R0dHSVfvnxibm4uNjY24uPjo3UvUGp68uSJlChRQoyMtHfZmpPjpLpHfezKlSvSvn17iYuLSzYQ3rx5U7755huxtLQUCwsLyZcvn3z33XciEh8qU+L+/fsCQLlpPuHP7du3E637T3ny5IkYGRlJ8eLFtabb2tpKrly5En3+4sWLJ7rXqmTJkiIiWvc0aYSEhMj06dNl6NChYmNjk2h+tWrVRK1Wy6BBg+Tff/+VoKAgefv2bbL1ent7i4mJiezatUvCw8OTDRTJ8ff3l65du4qVlZWYm5tLvnz5pHbt2kqtCT19+lRZr66urvLvv//Ktm3bPuteqoSePHkiBQoUkJw5c2pN/9R3rFu3bkpA3LJli+TOnVvq1q2b5HsntS19/N4qlUoKFCggFhYWWss5OjqKkZFRkn87kfiupwUKFJA+ffok+/nevXv3yXXyJfsLEd3fHSLKXAz6Hqfw8HBxcXGR7t27S8uWLVP8+uHDh0vfvn21ptWrV08qVaqUWiUS0RfS3Ew+fPhw8fLySnKZj0/EFyxYoJxgi4iEhoY
"text/plain": [
"<Figure size 1000x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10, 5))\n",
"sns.boxplot(x='Category', y='Price', data=df)\n",
"plt.title('Цены по категориям автомобилей')\n",
"plt.xlabel('Категория')\n",
"plt.ylabel('Цена')\n",
"plt.xticks(rotation=45)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"данная диаграмма была получина до устранения выбрасов."
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2QAAAIVCAYAAAC+4TwYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC7JElEQVR4nOzdeVhUZfsH8O8wsrgBksqSgrgrimsqomhFkrvlmmTuuGCZlJaJEgba26a+Bi6ZZmmLW5pJKm6RipqmYuZWImTKmBKgIAzO3L8//M15GUFFQw7MfD/X5SVzzs2cew4zZ859nuc8j0ZEBERERERERFTqbNROgIiIiIiIyFqxICMiIiIiIlIJCzIiIiIiIiKVsCAjIiIiIiJSCQsyIiIiIiIilbAgIyIiIiIiUgkLMiIiIiIiIpWwICMiIiIiIlJJBbUTICIiIipNer0e6enpMBqN8PDwUDsdIrJybCEjIiIii3f48GEMHToU1atXh729Pdzd3dG/f3+10ypkzpw5MBqNAACj0Yi5c+eqnBERPWosyIjIqn322WfQaDQ4fPhwkeu7du2KZs2alXJWRFSSNm3ahE6dOuG3335DdHQ04uPjER8fjyVLlqidWiErV67EBx98gIsXL+LDDz/EypUr1U6JiB4xdlkkIiIii5Weno4xY8YgKCgIa9euhZ2dndop3dPs2bPx0ksv4Y033oC9vT1WrVqldkpE9IixICMiIiKLtWLFCuTm5uKzzz4r88UYAAwePBhPPvkkfv/9dzRo0AA1atRQOyUiesTYZZGI6CGcPn0aAwYMgIuLCxwcHNC2bVt89913ZjF36w559epVaDQavP3228qyt99+GxqNBlevXr3rNuvUqYMRI0bcM68LFy5Ao9Hc9V/Xrl3N4q9cuYLRo0fD1dUVDg4OaNGiRbG7SBWVT0hICBwcHLBnzx5l2aZNm9CzZ094eHjA3t4e9erVwzvvvAODwaDEdO3a9Z55azQas+2sWrUKbdq0QcWKFeHi4oIhQ4bgzz//NIu523MGBgYqMbdu3cI777yDevXqwd7eHnXq1MFbb72FvLy8Qq+1V69e2L59O1q2bAkHBwc0bdoUGzZsMIsz/c0vXLigLDMajfD19YVGo8Fnn31mFr9u3Tq0bdsWVatWNcvxgw8+UGJGjBgBjUaDli1bFvobzJ07FxqNBlWqVCm07k4//fQTBg4cCE9PT9jb26N27dqYMmUKbt68aRZn2p7pX7Vq1dC1a1f89NNPZvvjXn+rOnXqKLHZ2dl47bXXULt2bdjb26NRo0b44IMPICJm2zX97vz58wvl3rhxY2g0GkyaNMls+fnz5zFw4EC4uLigUqVK6NChA7Zs2WIWc+DAAbRs2RJz5sxRcmjQoAHeffdd5V6tgjkU/FwCwPvvv1/kZ6eofWX6d+dzPMjx4sKFC6hZsyY6duyIxx577K7vHSKyHGwhIyJ6QCdPnoS/vz8ef/xxvPnmm6hcuTLWrFmDfv36Yf369XjuuefUThEvvPACevToYbZs+vTpZo9v3ryJrl274vfff8ekSZPg7e2NtWvXYsSIEcjIyMDkyZMfaJsRERH49NNP8c0335idvH722WeoUqUKwsLCUKVKFezatQuzZs1CVlYW3n//fQDAjBkzMGbMGAC3C9YpU6YgJCQEnTt3LrSd6OhozJw5E4MGDcKYMWPw999/Y+HChQgICMDRo0fh7OysxNaqVavQoAju7u7Kz2PGjMHKlSsxYMAAvPbaazh48CDmzp2LU6dO4dtvvzX7vXPnzmHw4MEYP348hg8fjhUrVmDgwIHYunUrnnnmmbvuly+++AInTpwotDwxMRGDBg1CixYt8O6778LJyUl57XeqUKECTp48iaNHj6JVq1Zm+9bBweGu2y5o7dq1yMnJwYQJE/DYY4/h0KFDWLhwIS5evIi1a9eaxVavXh3z5s0DAFy8eBELFixAjx498Oeff8LZ2Rnz58/HjRs3AACnTp3CnDlz8NZbb6FJkyYAoBSIIoI+ffpg9+7dGD16NFq2bIlt27Zh6tSp+Ouvv5RtmDg4OGDFihV49dVXlWX79+9HSkpKodej0+nQsWNH5OTk4JVXXsFjjz2GlStXok+fPli3bp3yObx27Rr27t2LvXv3YtSoUWjTpg127tyJ6dOn48KFC1i8ePFd91lGRsZ9B9UouK8AYNiwYWbr/83x4m7vHSKyMEJEZMVWrFghAOTnn38ucn2XLl3Ex8fHbNnTTz8tzZs3l9zcXGWZ0WiUjh07SoMGDe773H///bcAkIiICGVZRESEAJC///77rrl6eXnJ8OHD7/l6kpOTBYC8//77hdb5+PhIly5dlMfz588XALJq1SplmV6vFz8/P6lSpYpkZWXdc1sF81myZIkAkIULFxaKy8nJKbRs3LhxUqlSJbN9eOdrWLFiRaF1Fy5cEK1WK9HR0WbLT5w4IRUqVDBbXtTfrqBjx44JABkzZozZ8tdff10AyK5du8xeKwBZv369siwzM1Pc3d2lVatWyjLT3zw5OVlERHJzc8XT01O6d+9e6DVNnz5dAMjly5cLvfaCf7/hw4dL5cqVpXfv3jJp0iRl+U8//SQVK1aUfv36SeXKle/6Ok2K+jvMnTtXNBqNpKSkmG3Py8vLLG7p0qUCQA4dOlToOXbv3i0AZPfu3YXWbdy4UQBIVFSU2fIBAwaIRqOR33//XVkGQAYMGCAVKlSQw4cPK8tHjx4tQ4cOFQASGhqqLH/11VcFgPz000/KsuvXr4u3t7fUqVNHDAaDiNx+HwCQt99+2yyHESNGCAA5ceKEWQ4FP5fTpk2TmjVrSps2bcw+OybBwcHi7e1ttuzO53jQ40Vx3jtEZFnYZZGI6AGkp6dj165dGDRoEK5fv46rV6/i6tWruHbtGoKCgnDu3Dn89ddfZr+TmZmpxF29ehXp6en3fP6rV68iOzv7Ub8UxMXFwc3NDS+88IKyzNbWFq+88gpu3LiBH3/8sVjPs2nTJkycOBFTp04t1KUMACpWrKj8bNpnnTt3Rk5ODk6fPv1AOW/YsAFGoxGDBg0y26dubm5o0KABdu/eXezniouLAwCEhYWZLX/ttdcAoFDXNw8PD7PWDEdHR7z00ks4evQo0tLSitxGTEwMrl27hoiIiELrrl+/DhsbG7MWvXsZNWoUvvzyS6U75YoVK/D888/DycmpWL9f8O+QnZ2Nq1evomPHjhARHD161CzWaDQq+/bYsWP4/PPP4e7urrSAFVdcXBy0Wi1eeeUVs+WvvfYaRAQ//PCD2XJXV1f07NkTK1asAADk5ORgzZo1GDlyZJHP3a5dO3Tq1ElZVqVKFYSEhODChQv47bfflOVarbZQy+Pd/s4mf/31FxYuXIiZM2fetUuoXq+Hvb393V7+Qx0vTO713iEiy8KCjIjoAfz+++8QEcycORM1atQw+2c6cbpy5YrZ7wQGBprFNWrU6K7P36hRI9SoUQNVqlSBq6srwsPDze61KkkpKSlo0KABbGzMvwpMJ91FdRO707Fjx/DCCy/AYDDctdA8efIknnvuOTg5OcHR0RE1atTAiy++COB2sfogzp07BxFRBjso+O/UqVOF9v29pKSkwMbGBvXr1zdb7ubmBmdn50Kvv379+oXuZWvYsCEAmN0zZpKZmYk5c+YgLCwMrq6uhdb7+fnBaDRi8uTJ+OOPP3D16lX8888/d823Z8+eqFChAjZt2oTs7Oy7Fip3k5qaihEjRsDFxQVVqlRBjRo10KVLFyXXgv78809lv7Zq1Qp//PEH1q9fX6x71QpKSUmBh4cHqlatarb8Xu+xkSNHKoXn2rVrUa1aNTz11FNFPndRn6U7n1uj0cDDwwOOjo5mcY0aNYKNjU2RfzvgdhdcDw8PjBs37q6vLyMj45775GGOF8D93ztEZFl4DxkR0QMwDQLw+uuvIygoqMiYO0/wY2JilBN3AMjKyrrrhLTr16+Ho6MjcnJy8O233yI6OhqOjo6YNm1aCb2CknX8+HF0794dTz/9NKZOnYoXX3zR7P6xjIwMdOnSBY6Ojpg9ezbq1asHBwcH/PLLL3j
"text/plain": [
"<Figure size 1000x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10, 5))\n",
"sns.boxplot(x='Category', y='Price', data=df)\n",
"plt.title('Цены по категориям автомобилей')\n",
"plt.xlabel('Категория')\n",
"plt.ylabel('Цена')\n",
"plt.xticks(rotation=45)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"После устранения выбрасов, диаграмма выглядит лучше, хотя здесь все равно присутствуют очень маленькие значения, что не соответсвуем реальным ценам на машины. Из этого можем сделать вывод, что цены в наборе данных не особо соответствуют реальности. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Данный датасет содержит все ключевые аспекты, которые могут повлиять на цены автомобилей, поэтому его можно считать информативными."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Разбиение данных на выборки и Приращение данных**"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnIAAAIwCAYAAAACvobyAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1gUV9sG8HuXjnQVAUXsXaPBhr2CiL13RY0llthijTUaYm+xxESKsfceFbtR7GLXqMEuWAFFEYHn+8Nv591ldwGNRje5f9e1F+zMmZkzs1OeOXPOGZWICIiIiIjI5Kg/dQaIiIiI6P0wkCMiIiIyUQzkiIiIiEwUAzkiIiIiE8VAjoiIiMhEMZAjIiIiMlEM5IiIiIhMFAM5IiIiIhPFQI6I/raEhATcuXMHz549+9RZoQ/s+fPnuHnzJhISEj51VojIAAZyRPRe1qxZg9q1a8Pe3h52dnbInTs3pkyZ8qmzZRJevHiBWbNmKd9jY2Mxb968T5chLSKCRYsWoWLFirC1tYWDgwPy5s2LpUuXfuqsfTDHjx+HpaUlbt269amzQgQAWLhwIXLnzo3Xr1+/87SqT/GKrtDQUAQGBirfrayskDt3bvj6+mL06NHIkSPHP50lInoHw4cPx+TJk9G4cWO0adMG2bJlg0qlQqFCheDp6fmps/fZS0lJgaOjI37++WdUq1YN06dPx5UrV7Bjx45PnTW0bdsWq1atQufOndGgQQM4OjpCpVKhVKlSyJ49+6fO3gdRt25deHh4ICws7FNnhQgAkJiYiDx58mDkyJHo37//O01r/pHylCkTJkxA3rx5kZiYiD/++AMLFizA9u3bceHCBdja2n7KrBGREQcOHMDkyZMRFBSE4cOHf+rsmCQzMzOMHz8enTp1QmpqKhwcHLBt27ZPnS0sWbIEq1atwtKlS9GuXbtPnZ2PIjIyErt378aRI0c+dVaIFNbW1ujcuTNmzJiBfv36QaVSZXraT1oid+LECZQtW1YZPnjwYMyYMQPLly9H27Zt/+lsEVEmNGzYEE+fPsXhw4c/dVZM3t27d3Hnzh0ULVoUTk5Onzo7KFmyJEqVKoVly5Z96qx8NN988w02btyImzdvvtPFkuhjO3XqFMqWLYs9e/agVq1amZ7us6ojp8l4VFQUAODp06cYMmQISpYsCTs7Ozg4OMDf3x9nz57VmzYxMRHjxo1DoUKFYG1tDXd3dzRr1gw3btwAAOWgNfapUaOGMq/9+/dDpVJh1apVGDlyJNzc3JAlSxY0atQId+7c0Vv2sWPHUK9ePTg6OsLW1hbVq1c3epGrUaOGweWPGzdOL+3SpUvh7e0NGxsbuLi4oE2bNgaXn966aUtNTcWsWbNQvHhxWFtbI0eOHOjZs6deBfU8efKgQYMGesvp27ev3jwN5X3q1Kl62xQAXr9+jbFjx6JAgQKwsrKCp6cnhg4dmqk6ATVq1NCb36RJk6BWq7F8+fL32h7Tpk1DpUqVkDVrVtjY2MDb2xtr1641uPylS5eifPnysLW1hbOzM6pVq4Zdu3bppPn9999RvXp12Nvbw8HBAeXKldPL25o1a5TfNFu2bOjQoQPu3bunk6ZLly46eXZ2dkaNGjVw6NChDLfT35kWAPbu3YuqVasiS5YscHJyQuPGjXH58mWdNEePHkWJEiXQpk0buLi4wMbGBuXKlcPGjRuVNC9evECWLFnwzTff6C3j7t27MDMzQ1BQkJLnPHny6KVLu2/dunULX3/9NQoXLgwbGxtkzZoVLVu2xM2bN3Wm0xy/+/fvV4adOHECdevWhb29PbJkyWJwm4SGhkKlUuHkyZPKsMePHxvcxxs0aGAwz5k5F4wbN07ZF3PlygUfHx+Ym5vDzc1NL9+GaKbXfOzt7VG+fHmd7Q+8PWZKlChhdD6a4yQ0NBTA2wYrFy5cgKenJwICAuDg4GB0WwHAX3/9hZYtW8LFxQW2traoWLGiXqniu5xL3+UYf5dzblobN25ErVq1DAZx6Z070u5nmVl/Y5KTk/H9998jf/78sLKyUh6ppT0X5smTR1m+Wq2Gm5sbWrdujdu3b+uke9dz+65du1C6dGlYW1ujWLFiWL9+vV4eY2NjMXDgQOTJkwdWVlbIlSsXOnXqhMePHwMAgoKCULx4cdja2sLFxQWNGjXC6dOnjW7PtPtnYmIinJ2doVKpMG3aNGV42v077Uezv2rLzG+W2WuVZp81dC2ws7NDly5dlO+ac4ahz927dwEA586dQ5cuXZAvXz5YW1vDzc0NXbt2xZMnT/Tm7+3tDRcXF2zatElvXHo+6aPVtDRBV9asWQG8PVA2btyIli1bIm/evIiJicHPP/+M6tWr49KlS/Dw8ADwtr5JgwYNsGfPHrRp0wbffPMNnj9/jvDwcFy4cAH58+dXltG2bVvUr19fZ7kjRowwmJ9JkyZBpVJh2LBhePjwIWbNmoU6deogMjISNjY2AN5e+Pz9/eHt7Y2xY8dCrVYjJCQEtWrVwqFDh1C+fHm9+ebKlUu5iL148QK9e/c2uOzRo0ejVatW6N69Ox49eoS5c+eiWrVqOHPmjMG79x49eqBq1aoAgPXr12PDhg0643v27KmUhvbv3x9RUVH46aefcObMGRw+fBgWFhYGt8O7iI2NVdZNW2pqKho1aoQ//vgDPXr0QNGiRXH+/HnMnDkTf/75p95BnpGQkBB89913mD59utFHQBltj9mzZ6NRo0Zo3749kpKSsHLlSrRs2RJbt25FQECAkm78+PEYN24cKlWqhAkTJsDS0hLHjh3D3r174evrC+DtAd21a1cUL14cI0aMgJOTE86cOYMdO3Yo+dNs+3LlyiEoKAgxMTGYPXs2Dh8+rPebZsuWDTNnzgTwNvCZPXs26tevjzt37mRYcvO+0+7evRv+/v7Ily8fxo0bh1evXmHu3LmoXLkyTp8+rQQuT548waJFi2BnZ4f+/fsje/bsWLp0KZo1a4Zly5ahbdu2sLOzQ9OmTbFq1SrMmDEDZmZmynJWrFgBEUH79u3TXY+0Tpw4gSNHjqBNmzbIlSsXbt68iQULFqBGjRq4dOmS0eoY169fR40aNWBra4tvv/0Wtra2+OWXX1CnTh2Eh4ejWrVq75QPY97nXKAxffp0xMTEvNPyfvvtNwBvg8358+ejZcuWuHDhAgoXLvxe+ddcWCZPngw3Nzd8++23sLa2NritYmJiUKlSJbx8+RL9+/dH1qxZERYWhkaNGmHt2rVo2rSpzrwzcy5Ny9gx/ne2871793D79m18+eWX6W4L7evE9u3bsWLFCp3x77r+aXXv3h1hYWFo0aIFBg8ejGPHjiEoKAiXL1/WO09VrVoVPXr0QGpqKi5cuIBZs2bh/v37OsH1u5zbr127htatW6NXr17o3LkzQkJC0LJlS+zYsQN169YF8Pa6VLVqVVy+fBldu3bFl19+icePH2Pz5s24e/cusmXLhp07dyIgIAAFChRATEwMli1bhsqVK2PHjh2oXr26zjpYW1sjJCQETZo0UYatX78eiYmJRrfRggULYGdnp3yPiorCmDFjjKZv2rQpmjVrBgA4dOgQFi1alM4vYPxa9T401cS0ubi4AADCw8Px119/ITAwEG5ubrh48SIWLVqEixcv4ujRo3o3FF9++eW7P+2QTyAkJEQAyO7du+XRo0dy584dWblypWTNmlVsbGzk7t27IiKSmJgoKSkpOtNGRUWJlZWVTJgwQRkWHBwsAGTGjBl6y0pNTVWmAyBTp07VS1O8eHGpXr268n3fvn0CQHLmzCnx8fHK8NWrVwsAmT17tjLvggULip+fn7IcEZGXL19K3rx5pW7dunrLqlSpkpQoUUL5/ujRIwEgY8eOVYbdvHlTzMzMZNKkSTrTnj9/XszNzfWGX7t2TQBIWFiYMmzs2LGi/fMeOnRIAMiyZct0pt2xY4fecC8vLwkICNDLe58+fSTtLpM270OHDhVXV1fx9vbW2aa//fabqNVqOXTokM70CxcuFABy+PBhveVpq169ujK/bdu2ibm5uQwePNhg2sxsD5G3v5O2pKQkKVGihNSqVUtnXmq
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoEAAAIwCAYAAAD5xbFoAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1gUV9sG8HuX3hGRpqjYezRYgr2g2HvBgr0llqgx1lhjNPYWayLF3ks0iYq9oUYUsUeN2MEKKIIIPN8ffjsvy+4CEo2+796/69pLmTk7c2Z25swzM6eoRERAREREREZF/bEzQERERET/PgaBREREREaIQSARERGREWIQSERERGSEGAQSERERGSEGgURERERGiEEgERERkRFiEEhERERkhBgEEhEAICEhAXfv3sXz588/dlboPXvx4gWioqKQkJDwsbNCRJ8QBoFERmzTpk2oV68e7OzsYGtri/z582PGjBkfO1v/FV6+fIl58+Ypf8fGxmLRokUfL0PpiAiWL1+OL774AtbW1rC3t4eXlxdWr179sbP23pw+fRrm5ua4ffv2x84K0Ufz9OlT2NjY4Pfff8/R91U5HTYuODgYPXr0UP62sLBA/vz50aBBA4wbNw6urq45yhAR/TtGjRqF6dOno0WLFvD394ezszNUKhWKFSsGT0/Pj529T15qaiocHBywbNky1KxZE7Nnz8bVq1exe/fuj501dOzYERs2bEC3bt3QtGlTODg4QKVSoVy5csiTJ8/Hzt57Ub9+fXh4eCAkJORjZ4Xoo/r6669x7NgxhIeHv/N3Tf/pyidPngwvLy8kJSXh2LFjWLJkCX7//XdcvHgR1tbW/3TxRPQBHD58GNOnT8e0adMwatSoj52d/0omJiaYNGkSunbtirS0NNjb2+O333772NnCypUrsWHDBqxevRqdOnX62Nn5ICIiIrBv3z6cOHHiY2eF6KPr378/FixYgAMHDqBu3brv9N1//CTwzz//RMWKFZXp33zzDebMmYO1a9eiY8eOOVk0EX1gzZo1w7Nnz3D8+PGPnZX/evfu3cPdu3dRsmRJODo6fuzsoGzZsihXrhzWrFnzsbPywXz99dfYvn07oqKioFKpPnZ2iD66smXLokKFCli5cuU7fe+91wnURKG3bt0CADx79gzDhw9H2bJlYWtrC3t7ezRq1Ajnz5/X+W5SUhImTpyIYsWKwdLSEu7u7mjdujVu3rwJAMoJb+hTu3ZtZVmHDh2CSqXChg0bMGbMGLi5ucHGxgbNmzfH3bt3ddZ96tQpNGzYEA4ODrC2tkatWrUMXiBr166td/0TJ07USbt69Wp4e3vDysoKTk5O8Pf317v+zLYtvbS0NMybNw+lS5eGpaUlXF1d0a9fP53K/AULFkTTpk111jNw4ECdZerL+8yZM3X2KQC8fv0aEyZMQJEiRWBhYQFPT0+MGDECr1+/1ruv0qtdu7bO8n744Qeo1WqsXbs2R/tj1qxZqFq1KnLnzg0rKyt4e3tj8+bNete/evVqVK5cGdbW1siVKxdq1qyJvXv3aqX5448/UKtWLdjZ2cHe3h6VKlXSydumTZuU39TZ2RldunTB/fv3tdJ0795dK8+5cuVC7dq1cfTo0Sz30z/5LgAcOHAANWrUgI2NDRwdHdGiRQtcuXJFK83JkydRpkwZ+Pv7w8nJCVZWVqhUqRK2b9+upHn58iVsbGzw9ddf66zj3r17MDExwbRp05Q8FyxYUCddxmPr9u3b+Oqrr1C8eHFYWVkhd+7caNeuHaKiorS+pzl/Dx06pEz7888/Ub9+fdjZ2cHGxkbvPgkODoZKpcKZM2eUaU+ePNF7jDdt2lRvnrNTFkycOFE5FvPlywcfHx+YmprCzc1NJ9/6aL6v+djZ2aFy5cpa+x94e86UKVPG4HI050lwcDCAt417Ll68CE9PTzRp0gT29vYG9xUA/P3332jXrh2cnJxgbW2NL774Qudp5ruUpe9yjr9LmZvR9u3bUbduXZ3yoGDBgpleI9JLSUnB999/j8KFC8PCwgIFCxbEmDFj9JZl2SkX3ncZbkh2zu/sHl/Aux3vV69eRfv27WFvb4/cuXPj66+/RlJSks4yMytrjx49Cl9fXzg7O8PKygoVKlTAkiVLkPF5lOY627JlS53l9+vXDyqVSuvceJf4QCNjWavvWq6vbLt79y6srKygUqm0yq53ve7q+0yZMgUAkJycjPHjx8Pb2xsODg6wsbFBjRo1cPDgQZ3lA2+rR+zcuVNnP2blH78OzkgTsOXOnRvA20Jm+/btaNeuHby8vBATE4Nly5ahVq1auHz5Mjw8PAC8rV/TtGlT7N+/H/7+/vj666/x4sULhIaG4uLFiyhcuLCyjo4dO6Jx48Za6x09erTe/Pzwww9QqVQYOXIkHj16hHnz5sHX1xcRERGwsrIC8PakatSoEby9vTFhwgSo1WoEBQWhbt26OHr0KCpXrqyz3Hz58ikXwJcvX+LLL7/Uu+5x48ahffv26N27Nx4/foyFCxeiZs2aOHfunN6nBn379kWNGjUAAFu3bsW2bdu05vfr1095Cjt48GDcunULP/30E86dO4fjx4/DzMxM7354F7Gxscq2pZeWlobmzZvj2LFj6Nu3L0qWLIkLFy5g7ty5+Ouvv/QWMJkJCgrCd999h9mzZxt8bZXV/pg/fz6aN2+Ozp07Izk5GevXr0e7du2wa9cuNGnSREk3adIkTJw4EVWrVsXkyZNhbm6OU6dO4cCBA2jQoAGAtwFEz549Ubp0aYwePRqOjo44d+4cdu/ereRPs+8rVaqEadOmISYmBvPnz8fx48d1flNnZ2fMnTsXwNugaf78+WjcuDHu3r2b5ROjnH533759aNSoEQoVKoSJEyciMTERCxcuRLVq1XD27FmlMHv69CmWL18OW1tbDB48GHny5MHq1avRunVrrFmzBh07doStrS1atWqFDRs2YM6cOTAxMVHWs27dOogIOnfunOl2ZPTnn3/ixIkT8Pf3R758+RAVFYUlS5agdu3auHz5ssEqJDdu3EDt2rVhbW2Nb7/9FtbW1vj555/h6+uL0NBQ1KxZ853yYUhOygKN2bNnIyYm5p3Wt2rVKgBvA9XFixejXbt2uHjxIooXL56j/D99+hQAMH36dLi5ueHbb7+FpaWl3n0VExODqlWr4tWrVxg8eDBy586NkJAQNG/eHJs3b0arVq20lp2dsjQjQ+f4P9nP9+/fx507d/D555/rnV++fHl88803WtNWrlyJ0NBQrWm9e/dGSEgI2rZti2+++QanTp3CtGnTcOXKFa1yJjvlQnofsgzP7vmtkdXx9a6/Q/v27VGwYEFMmzYNJ0+exIIFC/D8+XOtp09ZlbUnTpyAi4sLvvvuO5iYmODw4cP46quvEBkZiSVLlmitz9LSEr/99hsePXoEFxcXAEBiYiI2bNgAS0tLvfvoXeIDQLusBYCAgACDaTXGjx+vN/h9V/Xr10fXrl21ppUvXx4AEB8fj19++QUdO3ZEnz598OLFC6xYsQJ+fn44ffq0kk7D29sbc+fOxaVLlzK9cdQhORQUFCQAZN++ffL48WO5e/eurF+/XnLnzi1WVlZy7949ERFJSkqS1NRUre/eunVLLCwsZPLkycq0wMBAASBz5szRWVdaWpryPQAyc+ZMnTSlS5eWWrVqKX8fPHhQAEjevHklPj5emb5x40YBIPPnz1eWXbRoUfHz81PWIyLy6tUr8fLykvr16+usq2rVqlKmTBnl78ePHwsAmTBhgjItKipKTExM5IcfftD67oULF8TU1FRn+vXr1wWAhISEKNMmTJgg6X+io0ePCgBZs2aN1nd3796tM71AgQLSpEkTnbwPGDBAMv7sGfM+YsQIcXFxEW9vb619umrVKlGr1XL06FGt7y9dulQAyPHjx3XWl16tWrWU5f32229iamoq33zzjd602dkfIm9/p/SSk5OlTJkyUrduXa1lqdVqadWqlc6xqPnNY2Njxc7OTqpUqSKJiYl60yQnJ4uLi4uUKVNGK82uXbsEgIwfP16Z1q1bNylQoIDWcpYvXy4A5PT
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAooAAAIwCAYAAAABJUqfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAD9SElEQVR4nOzdd1gU1/s28HuX3rECVmwoVhRL7A1716hYEXtvMXYlloQYey8JRey9JcYaNRox9l6iUewiioCiiMLz/sG7891lZxFMIuGX+3Nde10w9ezszJlnzpx5RiMiAiIiIiKiVLSZXQAiIiIi+ndioEhEREREqhgoEhEREZEqBopEREREpIqBIhERERGpYqBIRERERKoYKBIRERGRKgaKRERERKSKgSLRf1h8fDzu37+PFy9eZHZR6G/28uVLREREID4+PrOLQkRZGANFov+YTZs2oX79+nBwcIC9vT0KFCiA7777LrOLlSW8evUK8+bNU/6PiYnB4sWLM69AekQEK1aswGeffQZbW1s4OjqiUKFCWL16dWYX7W9z8uRJWFpa4u7du5ldFKIsZdmyZShQoADevn2b4Xk1GXmFX2hoKPz9/ZX/raysUKBAATRs2BCTJk2Ci4tLhgtARJ/O2LFjMWPGDLRq1Qq+vr7ImTMnNBoNPDw8kD9//swu3r9eUlISnJycsHz5ctSqVQuzZ8/G9evXsWfPnswuGjp16oQNGzbAz88PzZs3h5OTEzQaDcqWLYtcuXJldvH+Fg0aNECePHmwcuXKzC4KUZaSkJAAd3d3jB8/HkOHDs3QvOYfs8KpU6eiUKFCSEhIwLFjx7B06VLs3r0bly9fhq2t7ccskoj+YUeOHMGMGTMQGBiIsWPHZnZxsiQzMzNMmTIF3bt3R3JyMhwdHfHTTz9ldrEQFhaGDRs2YPXq1ejcuXNmF+cfcf78eRw4cADHjx/P7KIQZTnW1tbw8/PDnDlzMGTIEGg0mnTP+1EtiqdOnULFihWV4V988QXmzJmDtWvXolOnThkrPRF9Ei1atEB0dDR+++23zC5KlvfgwQPcv38fnp6ecHZ2zuzioEyZMihbtizWrFmT2UX5xwwbNgzbt29HREREhk5yRJTizJkzqFixIg4ePIh69eqle76/pY+iboV37twBAERHR2PUqFEoU6YM7O3t4ejoiCZNmuDChQtG8yYkJOCrr76Ch4cHrK2t4ebmhrZt2+LPP/8EAKVSMPWpU6eOsqzDhw9Do9Fgw4YNGD9+PFxdXWFnZ4eWLVvi/v37Ruv+/fff0bhxYzg5OcHW1ha1a9c2eRKtU6eO6vq/+uoro2lXr14Nb29v2NjYIHv27PD19VVdf1rfTV9ycjLmzZuHUqVKwdraGi4uLujXr5/RAwju7u5o3ry50XoGDx5stEy1ss+cOdNomwLA27dvERAQgKJFi8LKygr58+fH6NGj09XXoU6dOkbL+/rrr6HVarF27dqP2h6zZs1CtWrVkCNHDtjY2MDb2xubN29WXf/q1atRuXJl2NraIlu2bKhVqxb27dtnMM3PP/+M2rVrw8HBAY6OjqhUqZJR2TZt2qT8pjlz5kTXrl3x8OFDg2l69OhhUOZs2bKhTp06OHr06Ae301+ZFwB++eUX1KxZE3Z2dnB2dkarVq1w7do1g2lOnDiB0qVLw9fXF9mzZ4eNjQ0qVaqE7du3K9O8evUKdnZ2GDZsmNE6Hjx4ADMzMwQGBipldnd3N5ou9b519+5dDBw4EMWLF4eNjQ1y5MiB9u3bIyIiwmA+3fF7+PBhZdipU6fQoEEDODg4wM7OTnWbhIaGQqPR4PTp08qwZ8+eqe7jzZs3Vy1zeuqCr776StkX8+XLh6pVq8Lc3Byurq5G5Vajm1/3cXBwQOXKlQ22P5ByzJQuXdrkcnTHSWhoKICUB5IuX76M/Pnzo1mzZnB0dDS5rQDg9u3baN++PbJnzw5bW1t89tlnRq2iGalLM3KMZ6TOTW379u2oV6+eUX3g7u6e5jlC3/v37zFt2jQUKVIEVlZWyq04tbosPfXC312Hq7l48SJ69OiBwoULw9raGq6urujZsyeeP39uMJ1u/3r27JnB8NOnTxvsL+nZdqmnzeg5aN++ffDy8oK1tTVKliyJrVu3Gq07JiYGI0aMgLu7O6ysrJAvXz50795dKX9gYCBKlSoFW1tbZM+eHS1btsTZs2cNlqG//VMfRwkJCciWLRs0Gg1mzZpltJ1MfdS2k6lp9euw9J5TdceW2jnL3t4ePXr0UP7X1W1qnwcPHgBI//4BAN7e3siePTt27NhhNC4tH3XrOTVdUJcjRw4AKRXR9u3b0b59exQqVAiRkZFYvnw5ateujatXryJPnjwAUvr7NG/eHAcPHoSvry+GDRuGly9fYv/+/bh8+TKKFCmirKNTp05o2rSpwXrHjRunWp6vv/4aGo0GY8aMwdOnTzFv3jz4+Pjg/PnzsLGxAZByYm3SpAm8vb0REBAArVaLkJAQ1KtXD0ePHkXlypWNlpsvXz7lJPnq1SsMGDBAdd2TJk1Chw4d0Lt3b0RFRWHhwoWoVasWzp07p9r60LdvX9SsWRMAsHXrVmzbts1gfL9+/ZTW3KFDh+LOnTtYtGgRzp07h99++w0WFhaq2yEjYmJilO+mLzk5GS1btsSxY8fQt29feHp64tKlS5g7dy7++OMPo4PzQ0JCQjBx4kTMnj3b5C2yD22P+fPno2XLlujSpQsSExOxfv16tG/fHj/++COaNWumTDdlyhR89dVXqFatGqZOnQpLS0v8/vvv+OWXX9CwYUMAKQdiz549UapUKYwbNw7Ozs44d+4c9uzZo5RPt+0rVaqEwMBAREZGYv78+fjtt9+MftOcOXNi7ty5AFICq/nz56Np06a4f//+B1uePnbeAwcOoEmTJihcuDC++uorvHnzBgsXLkT16tVx9uxZJTB6/vw5VqxYAXt7ewwdOhS5cuXC6tWr0bZtW6xZswadOnWCvb092rRpgw0bNmDOnDkwMzNT1rNu3TqICLp06ZLm90jt1KlTOH78OHx9fZEvXz5ERERg6dKlqFOnDq5evWqyu8qtW7dQp04d2Nra4ssvv4StrS2+//57+Pj4YP/+/ahVq1aGymHKx9QFOrNnz0ZkZGSG1rdq1SoAKcHskiVL0L59e1y+fBnFixf/qPLrTggzZsyAq6srvvzyS1hbW6tuq8jISFSrVg2vX7/G0KFDkSNHDqxcuRItW7bE5s2b0aZNG4Nlp6cuTc3UMf5XtvPDhw9x7949VKhQQXW8l5cXvvjiC4NhYWFh2L9/v8Gw3r17Y+XKlfj888/xxRdf4Pfff0dgYCCuXbtmUM+kp17Q90/W4fv378ft27fh7+8PV1dXXLlyBStWrMCVK1dw4sSJv9y6qr/t7ty5g8mTJxtNk5Hy37x5Ex07dkT//v3h5+eHkJAQtG/fHnv27EGDBg0ApJw/a9asiWvXrqFnz56oUKECnj17hp07d+LBgwfImTMn9u7di2bNmqFo0aKIjIzEmjVrUL16dezZswe1a9c2KJ+1tTVCQkLQunVrZdjWrVuRkJBg8nsvXboU9vb2yv+mvrtOmzZt0LZtWwDA0aNHsWLFijS2qulz6sfQdffTlz17dgAZ3z8qVKiQ8btKkgEhISECQA4cOCBRUVFy//59Wb9+veTIkUNsbGzkwYMHIiKSkJAgSUlJBvPeuXNHrKysZOrUqcqw4OBgASBz5swxWldycrIyHwCZOXOm0TSlSpWS2rVrK/8fOnRIAEjevHklLi5OGb5x40YBIPPnz1eWXaxYMWnUqJGyHhGR169fS6FChaRBgwZG66pWrZqULl1a+T8qKkoASEBAgDIsIiJCzMzM5OuvvzaY99KlS2Jubm40/ObNmwJAVq5cqQwLCAgQ/Z/l6NGjAkDWrFljMO+ePXuMhhcsWFCaNWtmVPZBgwZJ6p86ddlHjx4tuXPnFm9vb4NtumrVKtFqtXL06FGD+ZctWyYA5LfffjNan77atWsry/vpp5/E3NxcvvjiC9Vp07M9RFJ+J32JiYlSunRpqVevnsGytFqttGn
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Загрузка данных\n",
"new1 = pd.read_csv(\".//static//csv//car_price_prediction.csv\")\n",
"\n",
"# Разбиение на обучающую и временную выборки\n",
"train_data, temp_data = train_test_split(new1, test_size=0.3, random_state=42)\n",
"\n",
"# Разбиение временной выборки на контрольную и тестовую\n",
"val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)\n",
"\n",
"# Определяем целевую переменную и признаки\n",
"X = train_data.drop(columns=['Manufacturer']) # Признаки\n",
"y = train_data['Manufacturer'] # Целевая переменная\n",
"\n",
"# Функция для визуализации распределения\n",
"def plot_class_distribution(data, title):\n",
" data['Manufacturer'].value_counts().plot(kind='bar', title=title)\n",
" plt.xlabel('Manufacturer')\n",
" plt.ylabel('Count')\n",
" plt.show()\n",
"\n",
"# Визуализация распределения классов до оверсэмплинга\n",
"plot_class_distribution(train_data, 'Распределение классов в обучающей выборке (до оверсэмплинга)')\n",
"\n",
"# Применение оверсэмплинга\n",
"ros = RandomOverSampler(random_state=42)\n",
"X_resampled, y_resampled = ros.fit_resample(X, y)\n",
"\n",
"# Создание нового DataFrame для оверсэмплинга\n",
"train_resampled_over = pd.DataFrame(X_resampled, columns=X.columns)\n",
"train_resampled_over['Manufacturer'] = y_resampled\n",
"\n",
"# Визуализация распределения классов после оверсэмплинга\n",
"plot_class_distribution(train_resampled_over, 'Распределение классов в обучающей выборке (после оверсэмплинга)')\n",
"\n",
"# Применение андерсэмплинга\n",
"rus = RandomUnderSampler(random_state=42)\n",
"X_resampled, y_resampled = rus.fit_resample(X, y)\n",
"\n",
"# Создание нового DataFrame для андерсэмплинга\n",
"train_resampled_under = pd.DataFrame(X_resampled, columns=X.columns)\n",
"train_resampled_under['Manufacturer'] = y_resampled\n",
"\n",
"# Визуализация распределения классов после андерсэмплинга\n",
"plot_class_distribution(train_resampled_under, 'Распределение классов в обучающей выборке (после андерсэмплинга)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"По графикам можно увидить, что изначалльно выборки были сбалансированы плохо, но после приращения ситуация кординально изменилась."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **2. Forbes 2022 Billionaires data**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Проблемная область**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проблемная область данного набора данных связана с анализом состояния, возраста и источников богатства самых богатых людей в мире. Это может быть полезно для понимания, какие факторы влияют на накопление богатства и как эти факторы могут варьироваться в зависимости от региона и индустрии."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<bound method DataFrame.info of Rank Name Networth Age Country \\\n",
"0 1 Elon Musk 219.0 50 United States \n",
"1 2 Jeff Bezos 171.0 58 United States \n",
"2 3 Bernard Arnault & family 158.0 73 France \n",
"3 4 Bill Gates 129.0 66 United States \n",
"4 5 Warren Buffett 118.0 91 United States \n",
"... ... ... ... ... ... \n",
"2595 2578 Jorge Gallardo Ballart 1.0 80 Spain \n",
"2596 2578 Nari Genomal 1.0 82 Philippines \n",
"2597 2578 Ramesh Genomal 1.0 71 Philippines \n",
"2598 2578 Sunder Genomal 1.0 68 Philippines \n",
"2599 2578 Horst-Otto Gerberding 1.0 69 Germany \n",
"\n",
" Source Industry \n",
"0 Tesla, SpaceX Automotive \n",
"1 Amazon Technology \n",
"2 LVMH Fashion & Retail \n",
"3 Microsoft Technology \n",
"4 Berkshire Hathaway Finance & Investments \n",
"... ... ... \n",
"2595 pharmaceuticals Healthcare \n",
"2596 apparel Fashion & Retail \n",
"2597 apparel Fashion & Retail \n",
"2598 garments Fashion & Retail \n",
"2599 flavors and fragrances Food & Beverage \n",
"\n",
"[2600 rows x 7 columns]> \n",
"\n"
]
}
],
"source": [
"import pandas as pd\n",
"df2 = pd.read_csv(\".//static//csv//Forbes Billionaires.csv\")\n",
"print(df2.info, \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Объекты наблюдения**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объектами наблюдения являются богатейшие люди мира, представленные в наборе данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Атрибуты объектов**"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ID\n",
"Price\n",
"Levy\n",
"Manufacturer\n",
"Model\n",
"Prod. year\n",
"Category\n",
"Leather interior\n",
"Fuel type\n",
"Engine volume\n",
"Mileage\n",
"Cylinders\n",
"Gear box type\n",
"Drive wheels\n",
"Doors\n",
"Wheel\n",
"Color\n",
"Airbags\n",
"price_log\n",
"price_category\n"
]
}
],
"source": [
"attributes = df.columns\n",
"for attribute in attributes:\n",
" print(attribute)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Связи между объектами**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Можно выявить связи между возрастом и состоянием, страной проживания и источником дохода, а также отраслью и уровнем благосостояния."
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAasAAAEnCAYAAAAXY2zOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABnb0lEQVR4nO2dd3hUxfrHv9tLkt30BgESQidABEFIgIsgRUDaRQUuVcFCUxABlSZSBCuIFyt6LwJeRFHEn0oVwYAISFFEQm8JpG42m2yd3x+bPezZvmGTbMj7eZ48sGfmzHnPe+bMe2bmnXcEjDEGgiAIgghihDUtAEEQBEF4g4wVQRAEEfSQsSIIgiCCHjJWBEEQRNBDxoogCIIIeshYEQRBEEEPGSuCIAgi6CFjRRAEQQQ9ZKwIgiCIoIeMFUEQBBH0kLGqBOfOncMTTzyBlJQUyOVyqFQqZGRk4O2330ZZWVlNi0cQBHHXIa5pAWob27dvx/DhwyGTyTBmzBi0bt0aBoMB+/fvx6xZs/DHH3/g/fffr2kxCYIg7ioEFMjWdy5cuIA2bdqgfv362L17NxISEnjp2dnZ2L59O6ZPn15DEhIEQdyd0DCgH6xYsQJarRYfffSRk6ECgNTUVJ6hEggE3J9IJEK9evUwadIkFBUV8c47f/48hg8fjsTERAiFQu6c1q1bc3n279+PzMxMREdHQy6XIyUlBbNnz0Z5eTmX55NPPoFAIMBvv/3m973ZzpVKpbh16xYvLSsri5PJsexDhw6hb9++UKvVUCqV6N69Ow4cOMDLM3fuXMjlct7xvXv3QiAQYO/evdyxAwcOQC6XY+7cuU7y/eMf/+Dp0/a3cOFCLk/37t3Rtm1bl/fXrFkz9OnTx6MOGjVqxJUrFAoRHx+PRx55BJcvX+blKy0txcyZM5GUlASZTIZmzZrhtddeg/13X35+Pvr164f69etDJpMhISEBo0aNwqVLl7g8Fy9ehEAgwGuvvYY333wTDRs2hEKhQPfu3XHq1CneNU+cOIFx48ZxQ8/x8fGYMGEC8vPzne7j2rVreOyxx5CYmAiZTIbk5GQ89dRTMBgM3HP29PfJJ59wZe3evRtdu3ZFSEgIwsPDMWjQIJw+fZp3vYULF0IgECA2NhZGo5GXtnHjRq7cvLw8j/ofN24cGjVqxDt25coVKBQKCAQCXLx40eP5APDXX3/h4YcfRkxMDBQKBZo1a4YXX3yRl+fYsWPo168fVCoVQkND0bNnTxw8eNCprKKiIjz77LNo1KgRZDIZ6tevjzFjxiAvL4+rv57+7Oumr9c8duwY+vbti5iYGF5ZAwYM4PK4e8/z8vKcrmt7NvZotVrEx8c7vX+Ab+9zoJ63v9AwoB9s27YNKSkp6NKli8/nDBkyBEOHDoXJZEJWVhbef/99lJWV4b///S8AwGw246GHHsKlS5fwzDPPoGnTphAIBFiyZAmvnJKSErRo0QIPP/wwlEolsrKysGLFCuh0OqxevTpg9ygSibB+/Xo8++yz3LF169ZBLpfzDCNgbcj69euH9u3bY8GCBRAKhVi3bh3uv/9+/Pzzz+jYsSMAYOnSpTh79iyGDBmCQ4cOITk52em6Fy5cwODBgzFgwAAsXbrUpWz169fHsmXLAFhfuKeeeoqXPnr0aEycOBGnTp3iGfrDhw/j77//xksvveT1/rt27YpJkybBYrHg1KlTeOutt3D9+nX8/PPPAADGGB566CHs2bMHjz32GNq1a4cffvgBs2bNwrVr1/Dmm28CAAwGA8LCwjB9+nRERUXh3LlzWL16NU6cOIGTJ0/yrvmf//wHJSUlmDx5MsrLy/H222/j/vvvx8mTJxEXFwcA2LFjB86fP4/x48cjPj6eG27+448/cPDgQa5Bun79Ojp27IiioiJMmjQJzZs3x7Vr1/DFF19Ap9OhW7duXN0DwNUz+wbdVr937tyJfv36ISUlBQsXLkRZWRlWr16NjIwMHD161MmwlJSU4Ntvv8WQIUO4Y+7qjq/Mnz/f53NPnDiBrl27QiKRYNKkSWjUqBHOnTuHbdu2cff5xx9/oGvXrlCpVHj++echkUjw3nvv4R//+Ad++ukndOrUCYC1fnXt2hWnT5/GhAkTcM899yAvLw/ffPMNrl69ihYtWvD0+P777+P06dPc8weANm3a+HXN4uJi9OvXD4wxzJgxA0lJSQDAexcDweuvv47c3Fyn476+zzaq4nl7hBE+UVxczACwQYMG+XwOALZgwQLesS5durCWLVtyv8+cOcMAsGXLlvHyde/enbVq1cpj+Q8++CBr3bo193vdunUMADt8+LDPMjqeO2LECJaWlsYdLy0tZSqVio0cOZJXtsViYU2aNGF9+vRhFouFy6/T6VhycjJ74IEHeOWXlpayDh06sFatWrHi4mK2Z88eBoDt2bOHFRUVsZYtW7J7772X6XQ6l/J16dKFd6+3bt1y0m9RURGTy+Vs9uzZvHOnTZvGQkJCmFar9aiDhg0bsrFjx/KOjRw5kimVSu731q1bGQD2yiuv8PL985//ZAKBgGVnZ7stf8WKFQwAy8vLY4wxduHCBQaAKRQKdvXqVS7foUOHGAD27LPPcsdc6WXjxo0MANu3bx93bMyYMUwoFLqsA/bPyUb37t1Z9+7dXcrbrl07Fhsby/Lz87ljx48fZ0KhkI0ZM4Y7tmDBAq7uDBgwgDt+6dIlJhQK2YgRIxgAduvWLZfXsTF27FjWsGFD7vepU6eYUChk/fr1YwDYhQsXPJ7frVs3FhYWxi5dusQ7bn/fgwcPZlKplJ07d447dv36dRYWFsa6devGHZs/fz4DwL788kun67jSo6Ps9vh6zR9++IEBYBs3buSd37BhQ9a/f3/ut7v33NU7YXs2Nm7evMnCwsI4ne7Zs4e7J1/f50A9b3+hYUAf0Wg0AICwsDC/ztPpdMjLy0NOTg62bNmC48ePo2fPnlx6SUkJACAqKsqn8goKCnDjxg1s3boVWVlZ6Natm1Oe4uJi5OXlcWX7w+jRo/HXX39xQwxbtmyBWq3myQwAv//+O86ePYuRI0ciPz8feXl5yMvLQ2lpKXr27Il9+/bBYrFw+ZVKJbZt24aCggI8/PDDMJvNAKw9y0ceeQSFhYX45ptvoFAoXMpVXl4OuVzuUXa1Wo1BgwZh48aN3JCc2WzG559/jsGDByMkJMTr/ev1euTl5eHmzZvYsWMHdu/ezbv37777DiKRCNOmTeOdN3PmTDDG8H//93+84yUlJbh58yaysrKwceNGtGrVCpGRkbw8gwcPRr169bjfHTt2RKdOnfDdd99xx+z1Ul5ejry8PNx3330AgKNHjwIALBYLtm7dioEDB6JDhw5O9+Y4HOSJGzdu4Pfff8e4ceN48rZp0wYPPPAATzYbEyZMwPfff4+cnBwAwKefforOnTujadOmPl/Xnrlz5+Kee+7B8OHDvea9desW9u3bhwkTJqBBgwa8NNt9m81m/Pjjjxg8eDBSUlK49ISEBIwcORL79+/n3vMtW7agbdu2vF6DY3m+4M81/W0LbO+57a+goMDrOYsXL4ZarXaqv/6+z0Dgn7c3yFj5iEqlAgC/DcDKlSsRExODhIQE/POf/0TXrl3x6quvcunNmjVDREQEXn/9dRw4cAC3bt1CXl6e01iwjZYtWyIxMRFDhgzBoEGD8Pbbbzvl6dWrF2JiYqBSqRAREYGnn34apaWlPskbExOD/v374+OPPwYAfPzxxxg7diyEQn5VOXv2LABg7NixiImJ4f19+OGH0Ov1KC4u5p1TXl6OoqIi/PDDD9y81Ny5c/HDDz+guLgYer3erVx5eXlQq9Ve5R8zZgwuX77MDdvt3LkTubm5GD16tE/3v2nTJsTExCAuLg69e/dGUlISPvzwQy790qVLSExMdPpoadGiBZduz8SJExEXF4cuXbpALBZj586dTo1dkyZNnORo2rQpb46moKAA06dPR1xcHBQKBWJiYrjhVJueb926BY1GwxsCrSy2+2jWrJlTWosWLbiGzJ527dqhdevW+M9//gPGGD755BOMHz++Utffv38/tm3bhld
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcAAAAFNCAYAAACXC791AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB4a0lEQVR4nO3dd1iT19sH8G/C3kORJSK4EVHcOHCL4NY6qeKoG6WOVq0I4q62zqLWVa1i3btWxYkD3AouFERxAA5kC0Jy3j948/wICUpCYiC5P9fFpXmew8lJQnLnnOec+/AYYwyEEEKIhuGrugGEEEKIKlAAJIQQopEoABJCCNFIFAAJIYRoJAqAhBBCNBIFQEIIIRqJAiAhhBCNRAGQEEKIRqIASAghRCNRACQab/HixRAKhQAAoVCIJUuWqLhFhJBvQa4AGB8fj3HjxsHZ2Rn6+vowNTVF69atsXr1anz69EnRbSREqbZv347ffvsNr169wu+//47t27erukmEkG+AJ2su0H///RcDBgyAnp4ehg8fDldXV3z+/BmXL1/GgQMHMGLECGzcuFFZ7SVE4fbs2YPhw4fj8+fP0NPTw86dO/Hdd9+pulmEECWTKQAmJCTAzc0NVatWxblz52Brayt2Pi4uDv/++y8CAgIU3lBClOnt27eIi4tDrVq1YGVlpermEEK+BSaD8ePHMwDsypUrpSoPgPvh8/nMzs6OjRkzhn38+FGsXHx8PPvuu++Yra0t4/F43O/Ur1+fK3Pp0iXWunVrVqlSJaanp8ecnJzYzz//zD59+sSV+euvvxgAduPGDVkeltjv6ujosLdv34qdu3r1Ktem4nVHRUUxLy8vZmpqygwMDJinpye7fPmyWJlZs2YxPT09sePnz59nANj58+e5Y5cvX2Z6enps1qxZEu1r166d2PMp+gkODubKeHp6Mjc3N6mPr3bt2qxr165ffR5OnDjBPD09mbGxMTMxMWFNmzZlYWFhX2xD0R8RAGzSpEls586drHbt2kxPT481btyYXbx4Uez+nj9/ziZMmMBq167N9PX1maWlJfvuu+9YQkKCWLmSXtt3795JPA+MMXb79m3WrVs3ZmJiwoyMjFjHjh1ZZGSk1DqL3pdAIGANGjRgANhff/31xedK9Psl/RRtU3BwMAPAHj16xAYMGMBMTEyYpaUlmzJlitjfMGOM5efns/nz5zNnZ2emq6vLHB0d2ezZs1lubi5X5uPHj6xu3bqsWbNmLCcnhzvu5+fHHB0dxeqbNGkSMzIyYrdu3eKOOTo6Mj8/P7Fye/fuZQDEfj8hIUHsMWlrazNHR0c2Y8YMlpeXx5X78OEDmz59OnN1dWVGRkbMxMSEdevWjd29e1fsPkR/9/v27ZN4Po2MjMTaJOvrs2/fPtakSRNmbGws1ubly5dL3FdRZf07YKzw9fjxxx+Zo6Mj09XVZfb29mzYsGHs3bt3XJmUlBQ2atQoVqVKFaanp8fc3NzYtm3bJOoSCARs1apVzNXVlenp6bHKlSszLy8v7m//a+/Bdu3ayXyfonIODg6Mz+dzdRkZGXFlRH8L0p7P+vXri92vtM83xhjz8fGR+n599eoVGzlyJKtSpQrT1dVlLi4ubMuWLWJlRHUCYHfu3JH4fVG7pf1tlURblmB57NgxODs7o1WrVqX+nb59+6Jfv34oKChAZGQkNm7ciE+fPmHHjh0AAIFAgF69euHFixf48ccfUbt2bfB4PCxatEisnszMTNSrVw8DBw6EoaEhIiMjsWzZMuTk5GDt2rWyPIwv0tLSws6dOzF16lTu2F9//QV9fX3k5uaKlT137hy8vb3RpEkTBAcHg8/n46+//kLHjh1x6dIlNG/eHEDhJIunT5+ib9++uHbtGpycnCTuNyEhAX369EGPHj2wePFiqW2rWrUqN0EjKysLEyZMEDs/bNgwjBkzBvfv34erqyt3/MaNG3jy5AkCAwO/+Ni3bduGUaNGoX79+pg9ezbMzc1x584dnDx5EkOHDsWcOXPwww8/AADev3+PqVOnYuzYsWjbtq3U+i5evIg9e/ZgypQp0NPTw7p169CtWzdcv36da9+NGzdw9epVDB48GFWrVsXz58+xfv16tG/fHg8fPoShoeEX2yzNgwcP0LZtW5iamuLnn3+Gjo4O/vzzT7Rv3x4XL15EixYtSvzdHTt2ICYmRqb7mz9/vthrKu21ERk4cCCqV6+OJUuWICoqCmvWrMHHjx/x999/c2V++OEHbN++Hd999x2mT5+Oa9euYcmSJXj06BEOHToEADA3N8fx48fRsmVL+Pn5Yc+ePeDxeBL3t3btWqxfvx4HDx5E48aNS3wMBQUFmDNnTonnRa9zXl4eTp06hd9++w36+vpYsGABAODZs2c4fPgwBgwYACcnJ6SkpODPP/9Eu3bt8PDhQ9jZ2X35SSylkl6fyMhIDBw4EA0bNsTSpUthZmbG/Y0q8n6kycrKQtu2bfHo0SOMGjUKjRs3xvv373H06FG8evUKlStXxqdPn9C+fXvExcXB398fTk5O2LdvH0aMGIG0tDSxUbPRo0dj27Zt8Pb2xg8//ICCggJcunQJUVFRaNq0KffZCQCXLl3Cxo0bsXLlSlSuXBkAYG1tDQAy3aefnx/OnDmDyZMno2HDhtDS0sLGjRtx+/ZtuZ4/aSIiInDixAmJ4ykpKWjZsiV4PB78/f1hZWWF//77D6NHj0ZGRgZ+/PFHsfL6+vr466+/sHr1au7Y9u3boaurK/EZ/VWljZTp6ekMAOvdu3epoyukRPpWrVoxFxcX7nZsbCwDwJYsWSJWrl27dmI9QGl8fHyYq6srd1sRPcAhQ4awBg0acMezs7OZqakpGzp0qFjdQqGQ1apVi3l5eTGhUMiVz8nJYU5OTqxLly5i9WdnZ7OmTZuy+vXrs/T0dLFvSGlpaczFxUXi23xRrVq1Enus0no+aWlpTF9fn82cOVPsd6dMmcKMjIxYVlZWiY8/LS2NmZiYsBYtWkj0SIo+PhHRt8GSvh3j/7+p3bx5kzv24sULpq+vz/r27csdk/Z4IyMjGQD2999/c8dk6QH26dOH6erqsvj4eO7YmzdvmImJCfP09JSoU/TNPzc3l1WrVo15e3vL1AMsTZtEPcBevXqJlZ04cSIDwO7du8cYY+zu3bsMAPvhhx/Eys2YMYMBYOfOnRM7funSJaanp8fmzJnDGBPvAf73339MS0tL6jf24j3AdevWMT09PdahQwepPcDiz4WdnR3z8fHhbufm5jKBQCBWJiEhgenp6bH58+dzx8rSA/zS6zN79mwGgCUlJUm0XdYeoKx/B0FBQQwAO3jwoMQ50Xtn1apVDADbuXMnd+7z58/Mw8ODGRsbs4yMDMYYY+fOnWMA2JQpU0qs60ttL6q09/np0yfG5/PZuHHjxH7fz89PoT3AFi1acM9p0ffG6NGjma2tLXv//r1YnYMHD2ZmZmbcZ4SoziFDhrBKlSqJjUDUqlWL+4yWpQdY6lmgGRkZAAATExOZAmxOTg7ev3+P5ORkHDhwAPfu3UOnTp2485mZmQCASpUqlaq+1NRUJCUl4fDhw4iMjISnp6dEmfT0dLx//56rWxbDhg3D48ePcfPmTQDAgQMHYGZmJtZmALh79y6ePn2KoUOH4sOHD3j//j3ev3+P7OxsdOrUCREREdzUegAwNDTEsWPHkJqaioEDB0IgEAAo7AEPGjQIHz9+xNGjR2FgYCC1Xbm5udDX1/9i283MzNC7d2/8888/YP9/aVcgEGDPnj3o06cPjIyMSvzd8PBwZGZmYtasWRL3I61nURoeHh5o0qQJd7tatWro3bs3Tp06xT3+oo83Pz8fHz58QM2aNWFubi7126fotRX9pKamip0XCAQ4ffo0+vTpA2dnZ+64ra0thg4disuXL3N/y8WFhobiw4cPCA4OluvxlsakSZPEbk+ePBkAuG/Gon+nTZsmVm769OkACiehFdWmTRv8+eefWLRoEXbu3Mkdf/DgAQYNGoRhw4ZhxowZX2xTTk4O5s+fD39/f1SrVk1qmaysLLx//x6vX7/Gxo0bkZycLPa
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAbQAAAFiCAYAAACeUy10AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACLjElEQVR4nO3dd1RTydsH8G+ooRdBigKCnSZgW0Sx4dr7YkPF3rFgXwuiKOquDcVeQMVVEevaRRRURBER7DTBBopIE+nz/sGb+yMmtBQC2fmck6O592buk5DkycydwiKEEFAURVFUPScj6QAoiqIoShRoQqMoiqKkAk1oFEVRlFSgCY2iKIqSCjShURRFUVKBJjSKoihKKtCERlEURUkFmtAoiqIoqUATGkVRFCUVaEKjKIqipIJACS0hIQHTp0+HmZkZ2Gw21NXV4eDggB07duDnz5+ijpGiKIqiqiRX0wdcvnwZzs7OUFRUxPjx42FpaYnCwkLcu3cPixcvxosXL7B//35xxEpRFEVRFWLVZHLipKQkWFtbo3Hjxrh9+zYMDAy49sfHx+Py5cuYN2+eyAOlKIqiqMrUqMlx8+bNyM3NxaFDh3iSGQA0a9aMK5mxWCzmJisri0aNGmHatGnIzMzkelxiYiKcnZ1haGgIGRkZ5jGWlpbMMffu3UPnzp2ho6MDNpsNMzMzLF26FPn5+cwxfn5+YLFYiIyMrMnT4nqsgoICvn79yrUvPDycienXsiMiItCnTx9oaGhAWVkZXbt2xf3797mOWb58OdhsNtf2O3fugMVi4c6dO8y2+/fvg81mY/ny5TzxdevWjev15NzWrFnDHNO1a1e0adOG7/Nr2bIlevfuXelr0KRJE0yYMIFrW2BgIFgsFpo0acK1vbS0FDt27ICVlRXYbDZ0dXXRp08f5vXhF2v5W7du3Ziyvnz5gsmTJ0NPTw9sNhtt2rSBv78/1/k4f593797xvC7ly+K8rmfOnOF5fqqqqlzPj1+ZpaWlsLa2BovFgp+fH7N9woQJPK/B8ePHISMjg40bN3Jtv337Nrp06QIVFRVoampi8ODBePXqFdcxa9asAYvFQsOGDVFUVMS1759//mFep/T0dJ7nwU+TJk34vs7lnwMA/PjxAwsXLoSRkREUFRXRsmVL/P333yj/u/bIkSNgsVg4fPgw12M3bNgAFouFK1eu1Kg8Ds7fht+tvOTkZMyaNQstW7aEkpISGjRoAGdnZ56/PT/v3r3j+7xnz54NFovF8/7mp6r3NgAUFxdj3bp1aNq0KRQVFdGkSRP8+eefKCgo4Cnv6tWr6Nq1K9TU1KCuro727dvjxIkTACr+XPN7bap7zuLiYnh5eaFFixZQVFTkKqv8c2jSpAkGDBjAE++cOXN4/ia/ftcAwF9//cXzWQaAgoICeHh4oFmzZlBUVISRkRGWLFnCEycnpu3bt/PE0KpVK7BYLMyZM4dnX2Vq1OR46dIlmJmZoVOnTtV+zNChQzFs2DAUFxcjPDwc+/fvx8+fP3Hs2DEAQElJCQYNGoTk5GTMnz8fLVq0AIvFwvr167nKycnJQevWrTFixAgoKysjPDwcmzdvRl5eHnbu3FmTp1EpWVlZHD9+HAsWLGC2HTlyBGw2myt5AmVfXH379kXbtm3h4eEBGRkZHDlyBD169EBYWBg6dOgAoOyLIC4uDkOHDkVERARMTU15zpuUlIQhQ4ZgwIAB2LBhA9/YGjduDG9vbwBAbm4uZs6cybV/3LhxmDp1Kp4/f871Y+Dx48d4+/YtVq5cWaPXori4GCtWrOC7b/LkyfDz80Pfvn0xZcoUFBcXIywsDA8fPkS7du2Yvy8AhIWFYf/+/di2bRt0dHQAAHp6egCAnz9/olu3boiPj8ecOXNgamqKwMBATJgwAZmZmbVe2z927BhiY2OrPO7GjRuYNGkS5syZg2XLljHbb926hb59+8LMzAxr1qzBz58/sXPnTjg4OCAqKoonKebk5ODff//F0KFDmW0Vvd+qYmNjg4ULFwIoez+tXr2aaz8hBIMGDUJISAgmT54MGxsbXL9+HYsXL8bHjx+xbds2AMDEiRNx9uxZuLu7o1evXjAyMkJsbCw8PT0xefJk9OvXr0bl/Wru3Llo3749AODo0aO4efMm1/7Hjx/jwYMHGDVqFBo3box3795hz5496NatG16+fAllZeUavS7x8fE4cOBAtY+v6r0NAFOmTIG/vz/++OMPLFy4EBEREfD29sarV69w7tw5piw/Pz9MmjQJFhYWWL58OTQ1NfH06VNcu3YNY8aMwYoVKzBlyhQAQHp6OhYsWIBp06ahS5cuPHFV95xbtmzBqlWrMHToUCxduhSKiorMZ1BUMjMzme+i8kpLSzFo0CDcu3cP06ZNQ+vWrREbG4tt27bh7du3OH/+PNfxbDYbR44cwfz585ltDx48QHJysmCBkWrKysoiAMjgwYOr+xACgHh4eHBt69SpEzE3N2fuv3nzhgAg3t7eXMd17dqVWFhYVFp+v379iKWlJXP/yJEjBAB5/PhxtWP89bGjR48mVlZWzPYfP34QdXV1MmbMGK6yS0tLSfPmzUnv3r1JaWkpc3xeXh4xNTUlvXr14ir/x48fpF27dsTCwoJkZWWRkJAQAoCEhISQzMxMYm5uTtq3b0/y8vL4xtepUyeu5/r161ee1zczM5Ow2WyydOlSrsfOnTuXqKiokNzc3EpfAxMTE+Lq6src3717N1FUVCTdu3cnJiYmzPbbt28TAGTu3Lk8ZZR/LTg4r21SUhLPvu3btxMA5Pjx48y2wsJCYm9vT1RVVUl2djYhhBB/f38CgCQmJnI9vmvXrqRr167Mfc7rGhgYyHMuFRUVruf3a1z5+fnE2NiY9O3blwAgR44cYY51dXVlXoPIyEiiqqpKnJ2dSUlJCdc5bGxsSMOGDcm3b9+Ybc+ePSMyMjJk/PjxzDYPDw/m/TZgwABme3JyMpGRkSGjR48mAMjXr195ngc/hoaGXOU8fvyY5zmcP3+eACBeXl5cj/3jjz8Ii8Ui8fHxzLbPnz8TbW1t0qtXL1JQUEBsbW2JsbExycrKEqg8Qgi5ceMGAUDOnDnDbJs9ezb59WuI32cgPDycACBHjx6t9HVISkried4jRowglpaWxMjIiOvvz0913tvR0dEEAJkyZQrX/kWLFhEA5Pbt24SQss+jmpoa6dixI/n58yffsqqKnaO65ySEEHt7e9K6dWuuc/D7bjQxMSH9+/fnORe/v8mv3zVLliwhDRs2JG3btuX6/B07dozIyMiQsLAwrsfv3buXACD379/nKvOPP/4gcnJyJDIyktk+efJk5vt29uzZPPFVptpNjtnZ2QAANTW1GiXMvLw8pKenIzU1FUFBQXj27Bl69uzJ7M/JyQEANGjQoFrlZWRk4PPnzzh//jzCw8Ph6OjIc0xWVhbS09OZsmti3LhxeP36NVM1DwoKgoaGBlfMABAdHY24uDiMGTMG3759Q3p6OtLT0/Hjxw/07NkToaGhKC0tZY5XVlbGpUuXkJGRgREjRqCkpARAWQ115MiR+P79Oy5evAglJSW+ceXn54PNZlcau4aGBgYPHox//vmHafIpKSnBqVOnMGTIEKioqFT7dcjLy8PatWsxZ84cGBsbc+0LCgoCi8WCh4cHz+N+baqoypUrV6Cvr4/Ro0cz2+Tl5TF37lzk5ubi7t27AICGDRsCAD58+FCtcnNycpi/CedWFV9fX3z79o3v8+JITExE//79YWNjg2PHjkFG5n8foc+fPyM6OhoTJkyAtrY2s93a2hq9evXiaqrjmDRpEq5du4bU1FQAgL+/P+zt7dGiRYtqPU+O6rw/rly5AllZWcydO5dr+8KFC0EIwdWrV5lt+vr68PX1xc2bN9GlSxdER0fj8OHDUFdXF6g8TowAqoyz/GegqKgI3759Q7NmzaCpqYmoqKhKH/urJ0+eIDAwEN7e3lx/q4pU573N+Tu6u7tz7efUji9fvgwAuHnzJnJycrBs2TKe5yzI56Q65wTK3vtaWlrVOkdRURHP56SqloGPHz9i586dWLVqFVRVVbn
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAGVCAYAAADAPivmAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACTJklEQVR4nO2dd1gT2dfHvwnSO6JUaSooCIIdC7o2sPeKDbtYsJe1Yl93dV0Ue+8NG6urq6yCBbGBYKODBVAR6dLCff/gzfwSEiDVINzP8+TRzNycORkyZ+6cewqLEEJAoVAolB8OW9EKUCgUSm2FGmAKhUJRENQAUygUioKgBphCoVAUBDXAFAqFoiCoAaZQKBQFQQ0whUKhKAhqgCkUCkVBUANMoVAoCoIaYAqFQlEQMjPA8fHxmDZtGmxsbKCmpgYdHR106NABf/31F75//y6rw1AoFEqNoY4shFy7dg3Dhg2Dqqoqxo0bh2bNmqGoqAj379/HokWL8OrVK+zbt08Wh6JQKJQaA0vaYjyJiYlwcnKCubk5/vvvP5iYmPDtj4uLw7Vr1+Dj4yOVohQKhVLjIFIyffp0AoA8ePBApPEAmBebzSampqZkypQp5Nu3b3zj4uPjydChQ4mJiQlhsVjMZxwcHJgx9+7dIx06dCB169YlqqqqxNramixevJh8//6dGXP48GECgDx58kTs78b9rLKyMvn8+TPfvocPHzI6lZf96NEj4u7uTnR0dIi6ujpxc3Mj9+/f5xuzdOlSoqqqyrf9zp07BAC5c+cOs+3+/ftEVVWVLF26VEC/zp07851P7mv16tXMGDc3N+Lk5CT0+9na2pKePXtWeR78/f2Jvb09UVFRISYmJsTb25vv71WRHrwvLgDIzJkzyYkTJ4itrS1RVVUlLVq0IMHBwXzHTEpKIjNmzCC2trZETU2NGBgYkKFDh5LExEQB/b59+0bmzp1LLC0tiYqKCjEzMyNjx44lX7584Rs3fvz4Ks/X+PHjiaamZpXnpHPnzqRz587M+6KiIrJixQpiZWVFlJWVSYMGDciiRYtIfn5+lbLGjx9PLC0t+ba9e/eOqKmpEQBCv3N53rx5Q4YNG0YMDQ2JmpoasbW1Jb/++ivfmOfPnxMPDw+ira1NNDU1SdeuXUloaKiArMrOJ/c3WtmL93yKesznz58Td3d3YmhoyCerT58+zJiKruUvX74IHHf16tV8vztCCMnJySFGRkYC1xghol2zXJn16tUjRUVFfPtOnTrF6Fz+d1cZUrsgAgMDYWNjg/bt24v8mUGDBmHw4MEoKSlBaGgo9u3bh+/fv+P48eMAAA6Hg/79+yM5ORlz586Fra0tWCwWNmzYwCcnJycHTZs2xfDhw6GhoYHQ0FBs2bIF+fn52LFjh7RfjUFJSQknTpzAvHnzmG2HDx+GmpoaCgoK+Mb+999/6NWrF1q2bInVq1eDzWbj8OHD6Nq1K+7du4c2bdoAADZu3IjY2FgMGjQIYWFhsLa2FjhuYmIiBg4ciL59+2Ljxo1CdTM3N8emTZsAALm5uZgxYwbf/rFjx2LKlCl4+fIlmjVrxmx/8uQJYmJisGLFikq/+5o1a+Dr64vu3btjxowZiI6Oxu7du/HkyRM8ePAAysrKWL58OSZPngwASE9Px7x58zB16lR06tRJqMzg4GCcPXsWc+bMgaqqKnbt2gUPDw88fvyY0fHJkyd4+PAhRo4cCXNzcyQlJWH37t3o0qULXr9+DQ0NDeY7d+rUCW/evMHEiRPRokULpKen4+rVq/jw4QMMDQ35jm1oaIg///yT7/zIgpkzZ2L//v3o378/Fi5ciPDwcPz+++94+fIlrl27BhaLJZa8VatWCfy2KiIyMhKdOnWCsrIypk6dCisrK8THxyMwMJC5Zl69eoVOnTpBR0cHixcvhrKyMvbu3YsuXbogODgYbdu2BVD1+WzatClznQLAvn378ObNG75z6uTkJNYxs7Ky0KtXLxBCMH/+fDRo0AAA+K43WbB161Z8+vRJYLuo1yyXnJwc/P333xg0aBCzrSJ7UCUim2ohZGVlEQBkwIABIn8G5e5UhBDSvn17Ym9vz7yPjo4mAMimTZv4xnXu3JlvBiyM3r17k2bNmjHvZTEDHjVqFHF0dGS25+XlER0dHTJ69Gg+2aWlpaRx48bE3d2dlJaWMuPz8/OJtbU16dGjB5/8vLw80qpVK+Lg4ECysrL4ZsCZmZnE3t6etG7dusJZVPv27fm+q7CZQGZmJlFTUyNLlizh++ycOXOIpqYmyc3NrfD7f/78maioqJCePXsSDofDbN+5cycBQA4dOiTwmcTERAKAHD58WKhM/P8s4enTp8y25ORkoqamRgYNGsRsE/adQ0NDCQBy7NgxZtuqVasIAHLx4kWB8bx/A0II8fT0JNbW1gL6SDsDjoyMJCwWi4wcOZJvzJo1awgAEhgYWKms8jPgly9fEjabTXr16iXSDNjNzY1oa2uT5ORkvu2833/gwIFERUWFxMfHM9tSUlKItrY2cXNzY7aJcz6F6c6LqMe8efMmAUBOnz7N93lLS0uZzYA/f/5MtLW1mXPKnQGLc81yZY4aNYr07duX2Z6cnEzYbDYZNWqU2DNgqaIgsrOzAQDa2tpifS4/Px/p6elIS0tDQEAAXrx4gW7dujH7c3JyAAB169YVSV5GRgZSU1Nx+fJlhIaGws3NTWBMVlYW0tPTGdniMHbsWLx9+xZPnz4FAAQEBEBXV5dPZwCIiIhAbGwsRo8eja9fvyI9PR3p6enIy8tDt27dEBISgtLSUma8hoYGAgMDkZGRgeHDh4PD4QAoewIYMWIEvn37hqtXr0JdXV2oXgUFBVBTU6tUd11dXQwYMACnT58G+X93P4fDwdmzZzFw4EBoampW+Nnbt2+jqKgIc+fOBZv9v5/KlClToKOjg2vXrlV67IpwdXVFy5YtmfcWFhYYMGAAbt68yZwD3u9cXFyMr1+/olGjRtDT08Pz58+ZfQEBAWjevDnfbIRL+VlnUVERVFVVRdKR+7erbEZTXFyM9PR0XLp0iZm98TJ37lwoKSmJfZ6WLVuGFi1aYNiwYVWO/fLlC0JCQjBx4kRYWFjw7eN+fw6Hg3///RcDBw6EjY0Ns9/ExASjR4/G/fv3mWtZnPNZGeIcU9zrnXstc18ZGRlVfmbdunXQ1dXFnDlz+LaLe80CwMSJE3Hjxg2kpaUBAI4ePQpXV1fY2tqKpD8vUrkgdHR0AEBso/b777/j999/Z957eHjgt99+Y97b2dlBX18fW7duhb29PeOCKC4uFirP3t6eebSYMGEC/vrrL4Ex3bt3Z/6vp6eHUaNG4ffff6/UAHGpV68e+vTpg0OHDqFVq1Y4dOgQxo8fz2eUACA2NhYAMH78+AplZWVlQV9fn3lfUFCAzMxM3Lx5k/khLVu2DE+ePIGGhgYKCwsrlJWeno7GjRtXqf+4ceNw9uxZ3Lt3D25ubrh9+zY+ffpU5eN3cnIygLK/By8qKiqwsbFh9ouLMJ1tbW2Rn5+PL1++wNjYGN+/f8emTZtw+PBhfPz4kbl5AGXnkEt8fDyGDBki0nEzMzOhpaVV5bi8vDzUq1ePed+gQQMsWLBAYCH54cOHfOPKnyddXV2YmJggKSlJJP0A4P79+wgMDERQUBDevXtX5fiEhAQA4HMvlefLly/Iz88X0A8AmjZtitLSUrx//x4ODg5inc/KEOeYrVq1grKyMtasWQNDQ0PGBVHe8HHhvZZFITExEXv37sXu3bsFJiySXLPOzs5o1qwZjh07hkWLFuHIkSP49ddf8f79e7H0AmRggE1NTfHy5UuxPjd27FiMGzcOpaWlSEhIwLp169C3b1/cvn0bLBYLWlpaOHv2LCZOnIiOHTvyfdbBwUFA3vnz55GdnY1nz55h8+bNMDMzw/r16/nG+Pv7w9bWFoWFhbh79y7++OMPAMCuXbtE0nnixIkYN24cZs+ejZCQEBw4cAD37t3jG8P9wfz+++9wdnYWKqe8AfDx8YGJiQk2bNgAT09PAMCzZ89w5swZLFu2DD4+Prh8+bKAnKKiIqSmpqJ
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"\n",
"# Связь между возрастом и состоянием\n",
"plt.subplot(2, 2, 1)\n",
"sns.scatterplot(data=df2, x='Age', y='Networth')\n",
"plt.title('Связь между возрастом и состоянием')\n",
"plt.xlabel('Возраст')\n",
"plt.ylabel('Состояние (млрд)')\n",
"plt.show()\n",
"\n",
"\n",
"# Связь между страной проживания и состоянием (топ-10 стран)\n",
"plt.subplot(2, 2, 2)\n",
"top_countries = df2['Country'].value_counts().index[:10]\n",
"sns.boxplot(data=df2[df2['Country'].isin(top_countries)], x='Country', y='Networth')\n",
"plt.title('Связь между страной проживания и состоянием')\n",
"plt.xticks(rotation=90)\n",
"plt.xlabel('Страна')\n",
"plt.ylabel('Состояние (млрд)')\n",
"plt.show()\n",
"\n",
"\n",
"# Связь между источником дохода и состоянием (топ-10 источников дохода)\n",
"plt.subplot(2, 2, 3)\n",
"top_sources = df2['Source'].value_counts().index[:10]\n",
"sns.boxplot(data=df2[df2['Source'].isin(top_sources)], x='Source', y='Networth')\n",
"plt.title('Связь между источником дохода и состоянием')\n",
"plt.xticks(rotation=90)\n",
"plt.xlabel('Источник дохода')\n",
"plt.ylabel('Состояние (млрд)')\n",
"plt.show()\n",
"\n",
"# Связь между отраслью и состоянием (топ-10 отраслей)\n",
"plt.subplot(2, 2, 4)\n",
"top_industries = df2['Industry'].value_counts().index[:10]\n",
"sns.boxplot(data=df2[df2['Industry'].isin(top_industries)], x='Industry', y='Networth')\n",
"plt.title('Связь между отраслью и состоянием')\n",
"plt.xticks(rotation=90)\n",
"plt.xlabel('Отрасль')\n",
"plt.ylabel('Состояние (млрд)')\n",
"plt.show()\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### **Бизнес-цели:**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Понять факторы успеха:\n",
"\n",
"Бизнес-цель: Исследовать, какие факторы (возраст, страна, источник дохода) коррелируют с высоким состоянием.\n",
"Эффект для бизнеса: Это может помочь новым предпринимателям и стартапам учиться на опыте успешных людей.\n",
"\n",
"Анализ тенденций богатства:\n",
"\n",
"Бизнес-цель: Изучить, как источники богатства меняются со временем и как это связано с экономическими условиями в разных странах.\n",
"Эффект для бизнеса: Поможет инвесторам и аналитикам определить, какие секторы могут быть наиболее перспективными для инвестиций в будущем."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Цели технического проекта**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Цель 1: Анализ тенденций богатства.\n",
"\n",
"Входные данные: Данные о богатейших людях (возраст, страна, источник богатства).\n",
"Целевой признак: Наличие зависимости между источником богатства и страной.\n",
"\n",
"Цель 2: Исследование факторов успеха.\n",
"\n",
"Входные данные: Данные о богатейших людях (возраст, чистая стоимость, индустрия).\n",
"Целевой признак: Выявление факторов, способствующих накоплению состояния."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Проблемы данных**\n"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABKMAAAHWCAYAAACrLUrEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABKhklEQVR4nO3dd5RV5b0//s/MwAwdRLpIFUWUpihBr4KRKIgJMf5sQSMxRqIY201uTDBgzDVYYhI7eMUSMXZjIRYwBlADiCixYwErAiK9Dszs3x+uOV8OQxkQ9rTXa61Zi9n72c9+yjnDc95nn31ykiRJAgAAAABSkFveDQAAAACg+hBGAQAAAJAaYRQAAAAAqRFGAQAAAJAaYRQAAAAAqRFGAQAAAJAaYRQAAAAAqRFGAQAAAJAaYRR8A8XFxbF48eKYO3dueTcFAIAUWQcC7DxhFOygBQsWxEUXXRRt27aN/Pz8aNq0aXTp0iVWrFhR3k0DAGA3sg4E2DVqlHcDYFe566674sc//nHWtqZNm8YBBxwQ//M//xMDBw78xuf44IMP4qijjooNGzbEBRdcEAcddFDUqFEjateuHXXr1v3G9QMAsOOsAwEqF2EUVc4VV1wR7du3jyRJYuHChXHXXXfFcccdF08++WQcf/zx36juYcOGRX5+fkyfPj322muvXdRiAAB2BetAgMpBGEWVM3DgwOjVq1fm95/85CfRvHnzuO+++77RImTWrFnx/PPPx8SJEy1AAAAqIOtAgMrBPaOo8ho1ahS1a9eOGjX+X/b60UcfRU5OTvzxj3/c6nGXX3555OTkZH6fPn161KpVKz788MM44IADoqCgIFq0aBHDhg2LJUuWZB3br1+/OPDAA2PWrFlx2GGHRe3ataN9+/YxZsyYUudZtGhRZqFUq1at6N69e9x9992lyhUXF8f1118fXbt2jVq1akXTpk1jwIAB8corr2TK5OTkxPnnn7/VPt11112Rk5MTH3300VbLREQMHTo0cnJytvozefLkrPIPPfRQHHzwwVG7du1o0qRJnH766fH5559v8xwlli1bFhdffHG0a9cuCgoKonXr1vGjH/0oFi9evMvHaFt9ysnJiX79+u3wOUvKtWnTJvLy8jJ11atXL1NmW4+3Aw88MOu8kydP3uIYDxo0KHJycuLyyy/PbNvSfD777LNx2GGHRZ06daJhw4Zx/PHHx5tvvrm14QeAKs068P+paOvAkvaU/NSpUye6du0at99+e6myzz//fBxxxBFRt27daNSoUQwePDjeeeedrDK33nprdO/ePRo2bBh169aN7t27x7hx40r1rV69ejF37tw49thjo27dutGqVau44oorIkmSrLJ//OMf47DDDos999wzateuHQcffHA8/PDDW+zL+PHj49BDD406derEHnvsEUceeWRMnDgxIiLatWu3zfFs165dJEkS7dq1i8GDB5eqe926ddGwYcMYNmzYdscUKhNXRlHlLF++PBYvXhxJksSiRYvixhtvjFWrVsXpp5/+jer96quvYt26dXHuuefGt7/97fjZz34WH374Ydx8880xY8aMmDFjRhQUFGTKL126NI477rg4+eST47TTTosHH3wwzj333MjPz4+zzjorIiLWrl0b/fr1iw8++CDOP//8aN++fTz00EMxdOjQWLZsWVx44YWZ+n7yk5/EXXfdFQMHDoyzzz47Nm7cGC+88EJMnz496x3AXaWgoKDUYmDmzJlxww03ZG0ruUfDIYccEqNHj46FCxfG9ddfHy+99FK89tpr0ahRo62eY9WqVXHEEUfEO++8E2eddVYcdNBBsXjx4njiiSfis88+iyZNmuzSMbrnnnsyZV944YW47bbb4s9//nM0adIkIiKaN28eETs2L2eeeWY899xz8fOf/zy6d+8eeXl5cdttt8Wrr76602O/ualTp8ZTTz213XIvvPBCHHfccdG2bdsYNWpUbNiwIW655ZY4/PDDY+bMmbHvvvvusjYBQEVkHbhrpLEOLFGyFluxYkXccccd8dOf/jTatWsX/fv3j4iI5557LgYOHBgdOnSIyy+/PNauXRs33nhjHH744fHqq69Gu3btIiJi5cqVccwxx0THjh0jSZJ48MEH4+yzz45GjRrFiSeemDlfUVFRDBgwIL71rW/FNddcE88880yMGjUqNm7cGFdccUWm3PXXXx/f+973YsiQIVFYWBj3339/nHTSSTFhwoQYNGhQptzvfve7uPzyy+Owww6LK664IvLz82PGjBnx/PPPxzHHHBN/+ctfYtWqVRER8c4778Qf/vCH+M1vfhP7779/RETUq1cvcnJy4vTTT49rrrkmlixZEo0bN87U/+STT8aKFSu+8WMYKpwEqog777wziYhSPwUFBcldd92VVXbevHlJRCTXXnvtVusbNWpUsulTpOT3o48+Otm4cWOp8954442ZbX379k0iIrnuuusy29avX5/06NEjadasWVJYWJgkSZL85S9/SSIiGT9+fKZcYWFh0qdPn6RevXrJihUrkiRJkueffz6JiOSCCy4o1c7i4uLMvyMiGT58+HbHaN68eVstkyRJcuaZZyZ169Yttf2hhx5KIiL517/+lWlrs2bNkgMPPDBZu3ZtptyECROSiEhGjhy5zfOMHDkyiYjk0Ucf3Wq/dvUYldjWWJT1nGvXrk1yc3OTYcOGZR2/+fht6/F2wAEHJH379s38/q9//StrjJMkSXr37p0MHDgwiYhk1KhRW+3DwQcfnDRs2DBZsGBBpsx7772X1KxZMznxxBNLnRsAqgrrwMq3DtxSe957770kIpJrrrkms61k3L766qvMtv/85z9Jbm5u8qMf/Wir9W/cuDFp0KBBcv7552f1LSKSn//855ltxcXFyaBBg5L8/Pzkyy+/zGxfs2ZNVn2FhYXJgQcemHz729/ObHv//feT3Nzc5IQTTkiKioqyym9p/bmldV6JOXPmJBGR3HrrrVnbv/e97yXt2rXbYn1QmfmYHlXOzTffHJMmTYpJkybF+PHj46ijjoqzzz47Hn300VJl16xZE4sXL46lS5eWujR3ay655JLIy8vL/H7GGWdE8+bN4x//+EdWuRo1amRdTpufnx/Dhg2LRYsWxaxZsyIi4qmnnooWLVrEaaedlilXs2bNuOCCC2LVqlUxZcqUiIh45JFHIicnJ0aNGlWqPZteQh7x9aW8ixcvjq+++iqKi4vL1Ked9corr8SiRYvivPPOi1q1amW2Dxo0KDp37lxqTDb3yCOPRPfu3eOEE04ota+kX7tjjLanrOdcvXp1FBcXx5577lmmekseb5v+FBUVbfOYRx99NGbOnBlXXXXVVsssXbo03nvvvZg1a1YMGTIkc4VXRESnTp3ie9/7XjzzzDPbPRcAVHbWgZVnHVhi6dKlsXjx4pg7d278+c9/jry8vOjbt29ERHzxxRcxe/bsGDp0aNbVQt26dYvvfOc7pa4cLyoqisWLF8fHH38cf/7zn2PFihVxxBFHlDrnph9nLPl4Y2FhYTz33HOZ7bVr185q4/Lly+OII47Iuvr9sccei+Li4hg5cmTk5ma/tN7R9ee+++4bvXv3jnvvvTezbcmSJfH000/HkCFDdrg+qOiEUVQ5hx56aPTv3z/69+8fQ4YMiX/84x/RpUuXzH8ymxo1alQ0bdo0GjduHHXq1IlBgwbF+++/v8V6S/4D6Ny5c9b2vLy86NSpU6nP37dq1arU1/yWfEyqpOzHH38cnTp1KvWfV8llux9//HFERHz44YfRqlWrrP+Et2bcuHHRtGnTaNKkSdSuXTuOPPLIrPsJ7Eol7dtvv/1K7evcuXNm/9Z8+OGHceCBB273HLt6jLanrOfcc889o1OnTnH77bfHxIkTY9GiRbF48eJYv379Fustebxt+vPuu+9utR1FRUXxm9/8JoYMGRLdunXbarmDDjooMwdbmov9998/Vq9enXUfLgCoiqwDK886sMRBBx0UTZs2jY4dO8Ydd9wRN910Uxx66KHbPcf+++8fixcvjtWrV2e2vf/++9G0adNo165djBgxIm655ZY4+eSTs47
"text/plain": [
"<Figure size 1500x500 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер данных до удаления выбросов: (2600, 7)\n"
]
}
],
"source": [
"fig, axs = plt.subplots(1, 2, figsize=(15, 5))\n",
"\n",
"sns.boxplot(data=df2, x='Networth', ax=axs[0])\n",
"axs[0].set_title(\"Выбросы по состоянию\")\n",
"\n",
"sns.boxplot(data=df2, x='Age', ax=axs[1])\n",
"axs[1].set_title(\"Выбросы по возрасту\")\n",
"\n",
"plt.show()\n",
"print(\"Размер данных до удаления выбросов: \", df2.shape)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В данном случае выбросов нет, все данные находятся в пределах допустимых значений"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA/YAAAIjCAYAAACpnIB8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB8/0lEQVR4nO3deVxU9f7H8fcMzAw7roDkRmou5VJaSuWSqWjWbfHXbbEyNbsZ1lXLylu5dcuyW2Zletu0zRYru6WlYq4ZmpmWaZmZRqXgCijrwJzfHzhHRhYRwZmB1/Px4BGc850znzPMl3zP93u+x2IYhiEAAAAAAOCXrN4uAAAAAAAAVB7BHgAAAAAAP0awBwAAAADAjxHsAQAAAADwYwR7AAAAAAD8GMEeAAAAAAA/RrAHAAAAAMCPEewBAAAAAPBjBHsAAAAAAPwYwR4AAKAWe/vtt7V7927z57lz5+qvv/7yXkEAgFNGsAeACpg7d64sFkuZX3/++ae3SwSASlmzZo0eeOAB7d69W0uWLFFiYqKsVv6JCAD+JNDbBQCAP5kyZYri4uJKbK9Xr54XqgGA0zdmzBj16tXL/Ns2duxYNWrUyMtVAQBOBcEeAE7BgAED1KVLF2+XAQBVpk2bNtq5c6d+/PFHNWjQQC1atPB2SQCAU8Q8KwCoQu4p+8WvV3W5XOrQoYMsFovmzp3r0f7nn3/W3//+dzVs2FDBwcFq3bq1Hn74YUnSpEmTyp3+b7FYtHLlSvNY8+fPV+fOnRUcHKwGDRrolltuKXGd7O23317qcVq2bGm2ad68ua688kotXbpUnTp1UlBQkNq1a6ePP/7Y41iHDh3S/fffr/bt2yssLEwREREaMGCAvv/+e492K1euNJ9n8+bNHvv++usvBQQEyGKx6MMPPyxRZ6dOnUq8xlOnTpXFYlFYWJjH9jlz5qh3796KioqSw+FQu3btNGvWrBKPL83tt9+usLAw/fbbb0pISFBoaKhiY2M1ZcoUGYbh0fY///mPLr74YtWvX1/BwcHq3LmzR+3Fvf3227rooosUEhKiunXrqkePHlq6dKm5v3nz5uX+fouzWCwaNWqU3nnnHbVu3VpBQUHq3LmzVq9eXeJ5//rrLw0bNkzR0dFyOBw699xz9frrr5daY1nvs169epVou379evXv31+RkZEKCQlRz549tXbt2lKPW9a5FX/PStIXX3yh7t27KzQ0VOHh4Ro4cKC2bt3q0cb9+znRhx9+WOKYvXr1KlH7hg0bSn1Njx49qvvuu09nn322bDabR50HDhwo9bzcLBaLJk2a5LGttP7fvHlz3X777R7t5s+fL4vFoubNm3tsd7lcmjFjhtq3b6+goCA1bNhQ/fv317fffms+Z3lfxc973759Gj58uKKjoxUUFKSOHTvqjTfe8Hi+3bt3m3+XQkND1bVrV7Vo0UKJiYmyWCwl6i6NN2surrSa3b8Pu92u/fv3e7RPTk42a3DX6laRv6VS1f39PpX3LACUhRF7AKhmb731lrZs2VJi+w8//KDu3bvLZrPpzjvvVPPmzbVz50599tlnevzxx3Xdddd5BO4xY8aobdu2uvPOO81tbdu2lVT0D9ihQ4fqwgsv1NSpU5WWlqYZM2Zo7dq12rRpk+rUqWM+xuFw6NVXX/WoJTw83OPnHTt26IYbbtBdd92lIUOGaM6cObr++uu1ePFi9e3bV5L022+/6ZNPPtH111+vuLg4paWl6b///a969uypbdu2KTY21uOYQUFBmjNnjmbMmGFue+ONN2S325Wbm1vi9QkMDNTWrVu1adMmnX/++eb2uXPnKigoqET7WbNm6dxzz9Xf/vY3BQYG6rPPPtPdd98tl8ulxMTEEu1PVFhYqP79+6tbt26aNm2aFi9erIkTJ6qgoEBTpkwx282YMUN/+9vfNHjwYOXn5+u9997T9ddfr4ULF2rgwIFmu8mTJ2vSpEm6+OKLNWXKFNntdq1fv17Lly9Xv379zHadOnXSfffd51HLm2++qaSkpBI1rlq1Su+//77uvfdeORwOvfTSS+rfv7+++eYbnXfeeZKktLQ0devWzfwgoGHDhvriiy80fPhwZWZmavTo0aWe/6xZs8zwPH78+BL7ly9frgEDBqhz586aOHGirFar+WHKmjVrdNFFF5V4TPfu3c33608//aQnnnjCY/9bb72lIUOGKCEhQU899ZSys7M1a9YsXXrppdq0aVOJ4FtZDz74YKnbx40bp9mzZ2v48OG65JJLZLPZ9PHHH2vBggVV8rylKSgoMMPfiYYPH665c+dqwIABuuOOO1RQUKA1a9Zo3bp16tKli9566y2z7Zo1a/Tyyy9r+vTpatCggSQpOjpakpSTk6NevXrp119/1ahRoxQXF6f58+fr9ttvV3p6uv75z3+WWd+vv/6qV155pcLn4w81BwQE6O2339aYMWPMbXPmzFFQUFCJvz0V/VtalX+/S1PWexYAymQAAE5qzpw5hiRjw4YNFWq3a9cuwzAMIzc312jatKkxYMAAQ5IxZ84cs22PHj2M8PBw4/fff/c4hsvlKvXYzZo1M4YMGVJie35+vhEVFWWcd955Rk5Ojrl94cKFhiRjwoQJ5rYhQ4YYoaGh5Z5Ds2bNDEnGRx99ZG7LyMgwGjVqZJx//vnmttzcXKOwsNDjsbt27TIcDocxZcoUc9uKFSsMScZNN91k1K9f38jLyzP3tWrVyrj55psNScb8+fNL1HnVVVcZo0aNMrevWbPGCA4ONq655poS55GdnV3iXBISEoyzzz673PN1P58k45577jG3uVwuY+DAgYbdbjf2799f5vPk5+cb5513ntG7d29z244dOwyr1Wpce+21JV6j4r/fZs2aGQMHDixRT2JionHi/6IlGZKMb7/91tz2+++/G0FBQca1115rbhs+fLjRqFEj48CBAx6Pv/HGG43IyMgS9f/rX/8yJHm0P/fcc42ePXt61NyqVSsjISHBo/7s7GwjLi7O6Nu3b4lzOOuss4yhQ4eaP7vfBytWrDAMwzCOHDli1KlTxxgxYoTH41JTU43IyEiP7WW9b+fPn+9xTMMwjJ49e3rU/vnnnxuSjP79+5d4TRs1amQkJCR4bJs4caIhyeN3XhqLxeLRtwyjZP83jJL99qWXXjIcDodx2WWXGc2aNTO3L1++3JBk3HvvvSWeq7S/CaU9l9tzzz1nSDLefvttc1t+fr4RHx9vhIWFGZmZmYZhFPXXE/8u/f3vfzfOO+88o0mTJqX+vSnO12t2P99NN91ktG/f3tyelZVlREREmH973H/XT+VvaVX9/TaMU3vPAkBZmIoPANVo5syZOnjwoCZOnOixff/+/Vq9erWGDRumpk2beuw71amX3377rfbt26e7777bYyR74MCBatOmjRYtWnTKdcfGxuraa681f46IiNBtt92mTZs2KTU1VVLRyL975ezCwkIdPHhQYWFhat26tb777rsSx7zqqqtksVj06aefSioavfvzzz91ww03lFnHsGHDNG/ePOXl5UkqGmW77rrrFBkZWaJtcHCw+X1GRoYOHDignj176rffflNGRkaFznvUqFHm9+4R7/z8fC1btqzU5zl8+LAyMjLUvXt3j3P+5JNP5HK5NGHChBKri5/O1Nr4+Hh17tzZ/Llp06a6+uqrtWTJEhUWFsowDH300Ue66qqrZBiGDhw4YH4lJCQoIyOjxO/GPWJZ2iwIt82bN2vHjh26+eabdfDgQfOYWVlZuvzyy7V69Wq5XC6Px+Tn58vhcJR5zKSkJKWnp+umm27yqDMgIEBdu3bVihUrSjymeLsDBw7oyJEj5b5ehmFo/PjxGjRokLp27Vpi/5EjR1S/fv1yj1GWqKioU74bRnZ2tqZMmaJRo0aV6PcfffSRLBZLib8V0qm/Zz7//HPFxMTopptuMrfZbDbde++9Onr0qFatWlXq4zZu3Kj58+dr6tSpFVoV319qvvXWW/Xzzz+bU+4/+ugjRUZG6vLLL/doV9G/pVX59/tEJ3vPAkBZCPYAUE0yMjL0xBNPaOzYseZ0U7fffvtNkszp06fj999/lyS
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Гистограмма распределения чистой стоимости\n",
"plt.figure(figsize=(12, 6))\n",
"sns.histplot(df2['Networth'], bins=10, kde=True)\n",
"plt.title('Гистограмма распределения чистой стоимости')\n",
"plt.xlabel('Чистая стоимость (в миллиардах долларов)')\n",
"plt.ylabel('Частота')\n",
"plt.grid(True)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Распределение чистой стоимости показало сильно смещённый характер. Основная масса значений сосредоточена в нижней части диапазона, с небольшим количеством высоких значений. Это указывает на наличие значительного количества людей с относительно низкой чистой стоимостью, в то время как несколько индивидуумов (например, миллиардеры) имеют очень высокую чистую стоимость. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Оценка качества данных**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Для оценки покрытия мы смотрим на то, насколько разнообразны данные по странам, отраслям и возрастам."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABAwAAAKICAYAAAD0EmiCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3QU5dvG8XtTgZBCIIXQA6GETmihl0joVRBEuqAISJOmNOlFBZGqIqACKh2RXhSVIlXpItIEQiehpd/vH7w7v6wTIISQBPx+ztlzkpnZee6Znd2dueaZWYuqqgAAAAAAACRgl9YFAAAAAACA9IfAAAAAAAAAmBAYAAAAAAAAEwIDAAAAAABgQmAAAAAAAABMCAwAAAAAAIAJgQEAAAAAADAhMAAAAAAAACYEBgAAAAAAwITAAAAApDuHDx+WlStXGv8fPHhQfvjhh7QrCACA/yACAwBAks2fP18sFovs3bvXNO6zzz4Ti8UiTZs2lbi4uDSoDi+S27dvyxtvvCG7du2SkydPSu/eveXQoUNpXRYAAP8pDmldAADg+bdixQrp3r27VK1aVb755huxt7dP65LwnAsODjYeIiIFCxaUrl27pnFVAAD8txAYAACeyo8//iht2rSRwMBA+f777yVDhgxpXRJeECtXrpSjR4/K/fv3pXjx4uLk5JTWJQEA8J/CJQkAgGQ7ePCgNGnSRLJnzy4bNmwQd3d30zRLliyRoKAgyZgxo2TLlk1ee+01uXDhQqLzs1gsiT7OnDljM83IkSNtnjd58mSxWCxSo0YNY9jIkSPFYrGY2sibN6907NjRZtitW7ekT58+kitXLnF2dpYCBQrIxIkTJT4+3ma6+Ph4+fjjj6V48eKSIUMG8fLykrp16xqXaDysfuvDWt+PP/5oM9zZ2VkKFiwo48ePF1W1afPAgQNSr149cXNzk8yZM0vt2rVl165dia6/xFjXw78fCdeB9VKThOt5w4YNUqlSJcmUKZO4u7tLw4YN5fDhw4m2UaNGjUTb+PfrJCLy9ddfG9uDp6entG7dWs6fP2+an3VdBQYGSlBQkPz+++/GfB/nYfUktj2JiMycOVOKFi0qzs7O4ufnJz169JBbt249th0RkQsXLkiXLl3Ez89PnJ2dJV++fNK9e3eJjo421uujHvPnzxcRkY4dO0rmzJnl77//ltDQUHFxcRE/Pz8ZNWqUaZv44IMPpFKlSpI1a1bJmDGjBAUFydKlS021WSwW6dmzp2l4w4YNJW/evMmep8VikalTp5rGFS5c+KFtAgCeT/QwAAAky6lTp6Ru3bri7OwsGzZskOzZs5ummT9/vnTq1EnKlSsn48ePl8uXL8vHH38sv/76qxw4cEA8PDxMz2nWrJk0b95cRER+/vln+fTTTx9Zx61bt2T8+PHJXo579+5J9erV5cKFC/LGG29I7ty5ZceOHTJkyBC5dOmSzYFRly5dZP78+VKvXj15/fXXJTY2Vn7++WfZtWuXlC1bVr766itjWmvtU6ZMkWzZsomIiI+Pj03b7777rhQpUkTu378v3377rbz77rvi7e0tXbp0ERGRI0eOSNWqVcXNzU0GDhwojo6OMmfOHKlRo4b89NNPUqFChSQvZ8La+vbt+8hpf/75Z6lfv77kyZNHRowYITExMTJz5kypXLmy7NmzRwoWLGh6Ts6cOY3X4c6dO9K9e3fTNGPHjpVhw4ZJq1at5PXXX5erV6/KJ598ItWqVXvo9mA1aNCgJC6puR6rtWvXyuLFi22GjRw5Ut5//30JCQmR7t27y4kTJ2TWrFmyZ88e+fXXX8XR0fGhbVy8eFHKly8vt27dkm7duknhwoXlwoULsnTpUrl3755Uq1bNZr2PHTtWRETee+89Y1ilSpWMv+Pi4qRu3bpSsWJFmTRpkqxfv15GjBghsbGxMmrUKGO6jz/+WBo3bixt27aV6Oho+eabb6Rly5ayZs0aadCgwROtp+TMM0OGDDJv3jzp06ePMWzHjh1y9uzZZLUNAEjHFACAJJo3b56KiK5Zs0bz58+vIqJ16tRJdNro6Gj19vbWYsWK6f37943ha9asURHR4cOH20wfExOjIqLvv/++qb3Tp08bw0RER4wYYfw/cOBA9fb21qCgIK1evbox/P3331cR0fj4eJt28uTJox06dDD+Hz16tLq4uOiff/5pM93gwYPV3t5ez507p6qqW7duVRHRt99+27Ss/27jYbVbbdu2TUVEt23bZgyLjIxUOzs7feutt4xhTZs2VScnJz116pQx7OLFi+rq6qrVqlUzzTcx7733nlosFpth/14H/641KChI3d3dNSwszJjmzz//VEdHR23RooWpjUqVKmmxYsWM/69evWp6nc6cOaP29vY6duxYm+ceOnRIHRwcbIZXr17d5rVcu3atiojWrVtXk7LrUr16dS1atKhp+OTJk22W88qVK+rk5KR16tTRuLg4Y7rp06eriOgXX3zxyHbat2+vdnZ2umfPHtO4xLaJfy9XQh06dFAR0V69etnMo0GDBurk5KRXr141ht+7d8/mudHR0VqsWDGtVauWzXAR0R49epjaatCggebJk8dm2JPM8+WXX1YHBwfdu3evMbxLly766quvPrRNAMDziUsSAABPrGPHjnL+/Hl59dVXZePGjbJkyRLTNHv37pUrV67IW2+9ZXNfgwYNGkjhwoVNP5EXHR0tIiLOzs5JruPChQvyySefyLBhwyRz5sw247y9vUVE5J9//nnkPJYsWSJVq1aVLFmyyLVr14xHSEiIxMXFyfbt20VEZNmyZWKxWGTEiBGmeSSlm3xiwsPD5dq1a3Lu3DmZNGmSxMfHS61atUTkwdnmjRs3StOmTcXf3994Tvbs2eXVV1+VX375RSIiIh7bRnR0dJLX6c2bN+XPP/+Uffv2Sdu2bW16RAQEBEjjxo1l/fr1pl/BiIyMfOy9K5YvXy7x8fHSqlUrm/Xs6+srAQEBsm3btkSfp6oyZMgQadGixRP1qEiKzZs3S3R0tPTp00fs7P63S9S1a1dxc3N75M84xsfHy8qVK6VRo0ZStmxZ0/jkbhMJu/Nbu/dHR0fL5s2bjeEZM2Y0/r5586aEh4dL1apVZf/+/ab5RUZG2qzva9euSUxMjGm6J5mnj4+PNGjQQObNmyciD3rpfPfdd9KpU6dkLTMAIP3ikgQAwBO7ceOGfPPNN9KsWTM5evSo9O7dW+rUqWNzDwNr9+RChQqZnl+4cGH55ZdfbIZZrxn/94H/o4wYMUL8/PzkjTfeMF1vHRwcLBaLRYYMGSJjxowx5vvv+xKcPHlS/vjjD/Hy8kq0jStXrojIg0sw/Pz8xNPTM8n1PU7Tpk2Nv+3s7GTo0KHSokULERG5evWq3Lt3L9H1V6RIEYmPj5fz589L0aJFH9nGrVu3krxOy5QpY/z9sHaXLVsm165dswkTrl27JgEBAY+c98mTJ0VVHzrdw7r+L1y4UI4cOSLfffedLFq0KCmLkWQP20adnJzE39//kV3sr169KhEREVKsWLEUq8fOzs4mHBIR4/KPhPddWLNmjYwZM0YOHjwoUVFRxvDEQoq5c+fK3LlzTcPz5Mlj8/+TzFNEpFOnTtKpUyf58MMPZcmSJZIlSxYj7AIAvDgIDAAAT2zy5MnSsmVLERH59NNPpWLFijJkyBCZOXNmsucZFhYmIiK+vr5Jmv7YsWMyf/58+frrrxM92CxZsqSMGDFC3n//fVm4cOFD5xMfHy8vvfSSDBw4MNHxiV2vn1I++OADKVmypMTExMiePXtkzJgx4uDgkGgvhuQKCwtL8jr9+uuv5d69e9KtW7ckzz86OlouXbokL7300iOni4+PF4vFIuvWrUv0ZzcTCzWio6Nl2LBh0qVLl2f6OjxPfv75Z2ncuLFUq1ZNZs6cKdmzZxdHR0eZN29eooFKkyZNTDchHDp0qPF+S848RR70FHJycpKVK1fKvHnzpEOHDja9NAAALwYCAwDAE6tWrZrxd7ly5aRHjx4yY8YMad++vVSsWFFE/ncG88SJE6YzjydOnDCd4Tx69KiIPDiLnRRDhgyRUqV
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABAsAAAKpCAYAAADaPqVoAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3QU5fv38WspCSUFQkmhhNAJndBCLyGU0ItKERCkSUeKEekiUgTpWCgq8EWpItJBQWlCAGlKR2roJKElhL2eP3h2fhkSiiFkk/h+nbPnJDOzM9fszpb57D33bVFVFQAAAAAAgP8vlb0LAAAAAAAASQthAQAAAAAAMCEsAAAAAAAAJoQFAAAAAADAhLAAAAAAAACYEBYAAAAAAAATwgIAAAAAAGBCWAAAAAAAAEwICwAAAAAAgAlhAQAAsKsjR47IqlWrjP8PHjwoP//8s/0KAgAAhAUAgLgtWLBALBaL7Nu3L9a8r776SiwWizRt2lQeP35sh+qQkkREREi3bt1k9+7dcvLkSenbt68cPnzY3mUBAPCflsbeBQAAkpeVK1dKjx49pGrVqrJkyRJJnTq1vUtCMufv72/cREQKFiwoXbp0sXNVAAD8txEWAABe2q+//iqtW7cWX19f+emnnyRdunT2LgkpxKpVq+TYsWPy4MEDKV68uDg4ONi7JAAA/tO4DAEA8FIOHjwoTZo0EU9PT9mwYYO4urrGWmbp0qXi5+cn6dOnl6xZs0q7du3k0qVLca7PYrHEeTt37pxpmZEjR5ruN3HiRLFYLFKjRg1j2siRI8ViscTaRp48eaRjx46maXfu3JF+/fpJrly5xNHRUfLnzy/jx48Xq9VqWs5qtcrUqVOlePHiki5dOsmWLZvUq1fPuCzjWfXbbrb6fv31V9N0R0dHKViwoIwbN05U1bTNAwcOSP369cXFxUWcnJykdu3asnv37jgfv7jYHoenbzEfA9vlJTEf5w0bNkilSpUkQ4YM4urqKg0bNpQjR47EuY0aNWrEuY2nnycRkYULFxrHg5ubm7z11lty4cKFWOuzPVa+vr7i5+cnf/75p7HeF3lWPXEdTyIis2bNkqJFi4qjo6N4eXlJz5495c6dOy/cjsiLnx/bY/u824IFC0REpGPHjuLk5CRnzpyRunXrSsaMGcXLy0tGjx4d67iYNGmSVKpUSbJkySLp06cXPz8/WbZsWZw1Lly4UMqXLy8ZMmSQzJkzS7Vq1WTjxo2mZZ4+Jm23PHnymJY7c+aMtGrVSry8vCRVqlTGcsWKFYtzXQcPHjTd/9KlS5I6dWqxWCzPrBcAkHTRsgAA8EKnT5+WevXqiaOjo2zYsEE8PT1jLbNgwQJ55513pFy5cjJu3Di5evWqTJ06VXbs2CEHDhyQTJkyxbpPs2bNpHnz5iIi8ttvv8mXX3753Dru3Lkj48aNi/d+3L9/X6pXry6XLl2Sbt26Se7cuWXnzp0SHBwsV65ckc8//9xYtnPnzrJgwQKpX7++vPvuuxIdHS2//fab7N69W8qWLSvfffedsayt9ilTpkjWrFlFRMTd3d207Q8//FCKFCkiDx48kO+//14+/PBDyZ49u3Tu3FlERI4ePSpVq1YVFxcXGTx4sKRNm1a++OILqVGjhmzbtk0qVKjw0vsZs7b+/fs/d9nffvtNGjRoIN7e3jJixAh59OiRzJo1SypXrix79+6VggULxrpPzpw5jefh7t270qNHj1jLjB07VoYNGyZvvPGGvPvuu3L9+nWZPn26VKtW7ZnHg82QIUNeck9j12Ozdu1a+d///meaNnLkSBk1apQEBARIjx495Pjx4zJ79mzZu3ev7NixQ9KmTfvMbbzM81OtWjXTYz927FgRERk6dKgxrVKlSsbfjx8/lnr16knFihVlwoQJsn79ehkxYoRER0fL6NGjjeWmTp0qjRs3lrZt20pUVJQsWbJEWrVqJWvWrJGgoCBjuVGjRsnIkSOlUqVKMnr0aHFwcJA9e/bI1q1bJTAwMNY+2Y5JEZEvv/xSzp8/b6qtcePG8s8//0i/fv2kYMGCYrFYjH16Wrp06WT+/PkydepUY9o333wjDg4O8vDhw2c+rgCAJEwBAIjD/PnzVUR0zZo1mi9fPhURDQwMjHPZqKgozZ49uxYrVkwfPHhgTF+zZo2KiA4fPty0/KNHj1REdNSoUbG2d/bsWWOaiOiIESOM/wcPHqzZs2dXPz8/rV69ujF91KhRKiJqtVpN2/H29tYOHToY/48ZM0YzZsyoJ06cMC33wQcfaOrUqfX8+fOqqrp161YVEe3Tp0+sfX16G8+q3eaXX35REdFffvnFmPbw4UNNlSqVvvfee8a0pk2bqoODg54+fdqYdvnyZXV2dtZq1arFWm9chg4dqhaLxTTt6cfg6Vr9/PzU1dVVQ0NDjWVOnDihadOm1RYtWsTaRqVKlbRYsWLG/9evX4/1PJ07d05Tp06tY8eONd338OHDmiZNGtP06tWrm57LtWvXqohovXr19GW+plSvXl2LFi0aa/rEiRNN+3nt2jV1cHDQwMBAffz4sbHcjBkzVER03rx5z91OfJ6fp/ctpg4dOqiIaO/evY1pVqtVg4KC1MHBQa9fv25Mv3//vum+UVFRWqxYMa1Vq5Yx7eTJk5oqVSpt1qyZaf9s641p06ZNKiK6bds2Uz3e3t7G/8ePH1cR0XHjxsXap5iPt+34bt26tWbJkkUjIyONeQUKFNA2bdqoiOjSpUvjfBwAAEkXlyEAAJ6rY8eOcuHCBWnTpo1s3LhRli5dGmuZffv2ybVr1+S9994z9WMQFBQkhQsXjjUMXlRUlIiIODo6vnQdly5dkunTp8uwYcPEycnJNC979uwiInLx4sXnrmPp0qVStWpVyZw5s9y4ccO4BQQEyOPHj2X79u0iIrJ8+XKxWCwyYsSIWOt4mabxcQkLC5MbN27I+fPnZcKECWK1WqVWrVoi8uRX3I0bN0rTpk0lb968xn08PT2lTZs28vvvv0t4ePgLtxEVFfXSj+nt27flxIkTEhISIm3btjW1hChQoIA0btxY1q9fH2u0i4cPH76wr4oVK1aI1WqVN954w/Q4e3h4SIECBeSXX36J836qKsHBwdKiRYt/1ZLiZWzevFmioqKkX79+kirV/3396dKli7i4uDx3qMaEen7i0qtXL+Nvi8UivXr1kqioKNm8ebMxPX369Mbft2/flrCwMKlatars37/fmL5q1SqxWq0yfPhw0/7Z1hvTy7z+IiIiREQkS5YsL7UfjRo1EovFIqtXrxaRJy1WLl68KG+++eZL3R8AkPRwGQIA4Llu3bolS5YskWbNmsmxY8ekb9++EhgYaOqz4J9//hERkUKFCsW6f+HCheX33383TbNdI/70Sf/zjBgxQry8vKRbt26xrn/29/cXi8UiwcHB8vHHHxvrfbofgpMnT8qhQ4ckW7ZscW7j2rVrIvLksgsvLy9xc3N76fpepGnTpsbfqVKlko8++khatGghIiLXr1+X+/fvx/n4FSlSRKxWq1y4cEGKFi363G3cuXPnpR/TMmXKGH8/a7vLly+XGzdumIKEGzduSIECBZ677pMnT4qqPnO5ZzX3X7RokRw9elR++OEHWbx48cvsxkt71jHq4OAgefPmNebHJaGen6elSpXKFD6IiHHZR8y+FtasWSMff/yxHDx4UCIjI43pMUOA06dPS6pUqcTX1/eF232Z11+hQoUkc+bM8tlnn4mvr69xGcKjR4/iXD5t2rTSrl07mTdvnrRs2VLmzZsnLVq0EBcXlxfWAwBImggLAADPNXHiRGnVqpWIPLmuuWLFihIcHCyzZs2K9zpDQ0NFRMTDw+Ollv/rr79kwYIFsnDhwjhPNEuWLCkjRoyQUaNGyaJFi565HqvVKnXq1JHBgwfHOT+u6/MTyqRJk6RkyZLy6NEj2bt3r3z88ceSJk2aOFsvxFdoaOhLP6YLFy6U+/fvS9euXV96/VFRUXLlyhWpU6fOc5ezWq1isVhk3bp1cQ6tGddJalRUlAwbNkw6d+78Wp+H5Oa3336Txo0bS7Vq1WTWrFni6ek
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAHWCAYAAAB9mLjgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB8xElEQVR4nOzdd3wUdeLG8c/upvfeCxAghI4gGLGLAqKiYIFDxe6d4Nl+Fs6zn/3O3nvD3lA8QAQBkV5CDS2UAOm9t935/RHJGakJm0zK83699iWZnf3Os+NC8mRmvmMxDMNAREREREREjpnV7AAiIiIiIiLtjYqUiIiIiIhIE6lIiYiIiIiINJGKlIiIiIiISBOpSImIiIiIiDSRipSIiIiIiEgTqUiJiIiIiIg0kYqUiIiIiIhIE6lIiYiIiIiINJGKlIiIiIiISBOpSImItKD3338fi8XS8PDw8KBnz55MnTqV7Oxss+OJiIhIM7mYHUBEpDN45JFH6Nq1K1VVVSxevJjXXnuN//73v2zcuBEvLy+z44mIiEgTqUiJiLSC0aNHM2TIEACuv/56goODefbZZ5kxYwYTJ040OZ2IiIg0lU7tExExwVlnnQXArl27ACgoKOD//u//6NevHz4+Pvj5+TF69GjWrVt30Gurqqp46KGH6NmzJx4eHkRGRjJu3DjS0tIA2L17d6PTCf/8OOOMMxrGWrBgARaLhc8//5x//OMfRERE4O3tzYUXXsjevXsP2vby5csZNWoU/v7+eHl5cfrpp/Pbb78d8j2eccYZh9z+Qw89dNC6H3/8MYMHD8bT05OgoCAmTJhwyO0f6b39kcPh4Pnnn6dPnz54eHgQHh7OTTfdRGFhYaP1unTpwvnnn3/QdqZOnXrQmIfK/swzzxy0TwGqq6t58MEH6d69O+7u7sTGxnL33XdTXV19yH31R3/ebyEhIYwZM4aNGzc2Wq+uro5HH32UhIQE3N3d6dKlC//4xz8O2sbYsWPp0qULHh4ehIWFceGFF7Jhw4aD3tvUqVOZPn06iYmJeHh4MHjwYBYtWtRovT179nDzzTeTmJiIp6cnwcHBXHrppezevfug91FUVMTtt99Oly5dcHd3JyYmhquuuoq8vLyGz92RHgf2dVO2KSLSmnRESkTEBAdKT3BwMAA7d+7ku+++49JLL6Vr165kZ2fzxhtvcPrpp7N582aioqIAsNvtnH/++cybN48JEyZw6623Ulpayty5c9m4cSMJCQkN25g4cSLnnXdeo+1OmzbtkHkee+wxLBYL99xzDzk5OTz//POMGDGClJQUPD09AZg/fz6jR49m8ODBPPjgg1itVt577z3OOussfv31V4YOHXrQuDExMTzxxBMAlJWV8be//e2Q277//vu57LLLuP7668nNzeWll17itNNOY+3atQQEBBz0mhtvvJFTTz0VgG+++YZvv/220fM33XQT77//Ptdccw1///vf2bVrFy+//DJr167lt99+w9XV9ZD7oSmKiooa3tsfORwOLrzwQhYvXsyNN95IUlISGzZs4LnnnmPbtm189913Rx27V69e3HfffRiGQVpaGs8++yznnXce6enpDetcf/31fPDBB1xyySXceeedLF++nCeeeILU1NSD9seNN95IREQEGRkZvPzyy4wYMYJdu3Y1Oq104cKFfP755/z973/H3d2dV199lVGjRrFixQr69u0LwMqVK1myZAkTJkwgJiaG3bt389prr3HGGWewefPmhvHKyso49dRTSU1N5dprr+WEE04gLy+P77//nn379pGUlMRHH33UsO0333yT1NRUnnvuuYZl/fv3b9I2RURanSEiIi3mvffeMwDj559/NnJzc429e/can332mREcHGx4enoa+/btMwzDMKqqqgy73d7otbt27TLc3d2NRx55pGHZu+++awDGs88+e9C2HA5Hw+sA45lnnjlonT59+hinn356w9e//PKLARjR0dFGSUlJw/IvvvjCAIwXXnihYewePXoYI0eObNiOYRhGRUWF0bVrV+Occ845aFsnn3yy0bdv34avc3NzDcB48MEHG5bt3r3bsNlsxmOPPdbotRs2bDBcXFwOWr59+3YDMD744IOGZQ8++KDxx29nv/76qwEY06dPb/Ta2bNnH7Q8Pj7eGDNmzEHZp0yZYvz5W+Sfs999991GWFiYMXjw4Eb79KOPPjKsVqvx66+/Nnr966+/bgDGb7/9dtD2/uj0009vNJ5hGMY//vEPAzBycnIMwzCMlJQUAzCuv/76Ruv93//9nwEY8+fPP+z4B/7frlq1qtF7+/OyPXv2GB4eHsbFF1/csKyiouKg8ZYuXWoAxocfftiw7IEHHjAA45tvvjlo/T9+fg6YPHmyER8ff8i8x7pNEZHWplP7RERawYgRIwgNDSU2NpYJEybg4+PDt99+S3R0NADu7u5YrfX/JNvtdvLz8/Hx8SExMZE1a9Y0jPP1118TEhLCLbfcctA2/nwqWlNcddVV+Pr6Nnx9ySWXEBkZyX//+18AUlJS2L59O3/5y1/Iz88nLy+PvLw8ysvLOfvss1m0aBEOh6PRmFVVVXh4eBxxu9988w0Oh4PLLrusYcy8vDwiIiLo0aMHv/zyS6P1a2pqgPr9dThffvkl/v7+nHPOOY3GHDx4MD4+PgeNWVtb22i9vLw8qqqqjph7//79vPTSS9x///34+PgctP2kpCR69erVaMwDp3P+efuHciBTbm4uS5cu5dtvv6V///6EhIQANPx/ueOOOxq97s477wTgxx9/bLS8oqKCvLw8UlJSeOuttwgPD6dnz56N1klOTmbw4MENX8fFxTF27FjmzJmD3W4HaDg6eSBjfn4+3bt3JyAg4KDP6YABA7j44osPem9N/Zwe6zZFRFqbTu0TEWkFr7zyCj179sTFxYXw8HASExMbihPUnw72wgsv8Oqrr7Jr166GH1zhf6f/Qf0pgYmJibi4OPef7x49ejT62mKx0L1794brULZv3w7A5MmTDztGcXExgYGBDV/n5eUdNO6fbd++HcMwDrven0/BKyoqAjiovPx5zOLiYsLCwg75fE5OTqOvf/rpJ0JDQ4+Y888efPBBoqKiuOmmm/jqq68O2n5qauphx/zz9g9lyZIljV7fo0cPvvvuu4YSsmfPHqxWK927d2/0uoiICAICAtizZ0+j5Y888ghPPfVUw1gLFixoVJwPLP+znj17UlFRQW5uLhEREVRWVvLEE0/w3nvvsX//fgzDaFi3uLi44c9paWmMHz/+qO/zWBzrNkVEWpuKlIhIKxg6dGjDrH2H8vjjj3P//fdz7bXX8uijjxIUFITVauW222476EiPGQ5keOaZZxg4cOAh1/ljuampqSEzM5NzzjnnqONaLBZmzZqFzWY74pgAWVlZQH1hONKYYWFhTJ8+/ZDP/7ngDBs2jH/961+Nlr388svMmDHjkK9PTU3l/fff5+OPPz7ktVYOh4N+/frx7LPPHvL1sbGxh81+QP/+/fnPf/4DQG5uLi+++CJnnHEGa9asafTej/XozvXXX8/ZZ5/Nvn37eO655xg/fjxLlizB39//mF5/wC233MJ7773HbbfdRnJyMv7+/lgsFiZMmNBin1MztikicixUpERE2oCvvvqKM888k3feeafR8qKioobTuQASEhJYvnw5tbW1Tpkw4YADR5wOMAyDHTt2NFzwf2ASCz8/P0aMGHHU8datW0dtbe0Ry+OBcQ3DoGvXrgedanYomzdvxmKxkJiYeMQxf/75Z4YPH97otLDDCQkJOeg9HWlCiGnTpjFw4EAuv/zyw25/3bp1nH322c0+3TIwMLBRpjPOOIOoqCjee+89pk2bRnx8PA6Hg+3bt5OUlNSwXnZ2NkVFRcTHxzcar3v37g1Hr0aMGEFcXByffPJJo8k//vwZANi2bRteXl4N5fOrr75i8uTJDSUP6k/hPHCk8I/74M+zDDbXsW5TRKS16RopEZE2wGazNTplCeqvtdm/f3+jZePHjycvL4+XX375oDH+/Pqm+PDDDyktLW34+quvviIzM5PRo0cDMHjwYBISEvj3v/9
"text/plain": [
"<Figure size 1000x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 1. Столбчатая диаграмма по странам\n",
"plt.figure(figsize=(12, 6))\n",
"sns.countplot(data=df2, x='Country', order=df2['Country'].value_counts().index)\n",
"plt.title('Количество людей по странам')\n",
"plt.xlabel('Страна')\n",
"plt.ylabel('Количество')\n",
"plt.xticks(rotation=45)\n",
"plt.show()\n",
"\n",
"# 2. Столбчатая диаграмма по отраслям\n",
"plt.figure(figsize=(12, 6))\n",
"sns.countplot(data=df2, x='Industry', order=df2['Industry'].value_counts().index)\n",
"plt.title('Количество людей по отраслям')\n",
"plt.xlabel('Отрасль')\n",
"plt.ylabel('Количество')\n",
"plt.xticks(rotation=45)\n",
"plt.show()\n",
"\n",
"# 3. Гистограмма для анализа возраста\n",
"plt.figure(figsize=(10, 5))\n",
"sns.histplot(df2['Age'], bins=30, kde=True)\n",
"plt.title('Распределение возраста')\n",
"plt.xlabel('Возраст')\n",
"plt.ylabel('Частота')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Графики показывают широкий спектр стран и отраслей, представленных в наборе данных. Это свидетельствует о том, что данные охватывают множество регионов и различных сфер деятельности.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Устранение проблемы пропущенных данных**"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Пропущенные значения в данных:\n",
" Rank 0\n",
"Name 0\n",
"Networth 0\n",
"Age 0\n",
"Country 0\n",
"Source 0\n",
"Industry 0\n",
"dtype: int64\n"
]
}
],
"source": [
"missing_values = df2.isnull().sum()\n",
"print(\"Пропущенные значения в данных:\\n\", missing_values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пропущенных данных не найдено."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Разбиение набора данных на обучающую, контрольную и тестовую выборки**"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((1560, 6), (520, 6), (520, 6))"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделим набор данных на признаки (X) и целевой признак (y)\n",
"X = df2.drop(columns=['Networth'])\n",
"y = df2['Networth']\n",
"\n",
"# Разделение на обучающую, контрольную и тестовую выборки\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
"\n",
"# Проверка размера выборок\n",
"(X_train.shape, X_val.shape, X_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Оценка сбалансированности выборок**"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(count 1560.000000\n",
" mean 5.208173\n",
" std 12.653032\n",
" min 1.000000\n",
" 25% 1.500000\n",
" 50% 2.400000\n",
" 75% 4.300000\n",
" max 219.000000\n",
" Name: Networth, dtype: float64,\n",
" count 520.000000\n",
" mean 4.443654\n",
" std 7.267615\n",
" min 1.000000\n",
" 25% 1.500000\n",
" 50% 2.400000\n",
" 75% 4.825000\n",
" max 91.400000\n",
" Name: Networth, dtype: float64,\n",
" count 520.000000\n",
" mean 4.235577\n",
" std 5.861496\n",
" min 1.000000\n",
" 25% 1.600000\n",
" 50% 2.500000\n",
" 75% 4.500000\n",
" max 60.000000\n",
" Name: Networth, dtype: float64)"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Проверка распределения целевого признака по выборкам\n",
"train_dist = y_train.describe()\n",
"val_dist = y_val.describe()\n",
"test_dist = y_test.describe()\n",
"\n",
"train_dist, val_dist, test_dist"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размеры после oversampling: (13910, 10047) (13910,)\n",
"Размеры после undersampling: (13065, 10047) (13065,)\n"
]
}
],
"source": [
"\n",
"from imblearn.over_sampling import RandomOverSampler\n",
"oversampler = RandomOverSampler(random_state=12)\n",
"X_train_over, y_train_over = oversampler.fit_resample(X_train, y_train)\n",
"\n",
"undersampler = RandomUnderSampler(random_state=12)\n",
"X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)\n",
"\n",
"print(\"Размеры после oversampling:\", X_train_over.shape, y_train_over.shape)\n",
"print(\"Размеры после undersampling:\", X_train_under.shape, y_train_under.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **3. Pima Indians Diabetes Database**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### **Проблемная область**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проблемная область: Набор данных касается предсказания наличия диабета у пациентов на основе различных медицинских измерений, таких как уровень глюкозы, артериальное давление, толщина кожи, уровень инсулина и другие параметры."
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<bound method DataFrame.info of Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
"0 6 148 72 35 0 33.6 \n",
"1 1 85 66 29 0 26.6 \n",
"2 8 183 64 0 0 23.3 \n",
"3 1 89 66 23 94 28.1 \n",
"4 0 137 40 35 168 43.1 \n",
".. ... ... ... ... ... ... \n",
"763 10 101 76 48 180 32.9 \n",
"764 2 122 70 27 0 36.8 \n",
"765 5 121 72 23 112 26.2 \n",
"766 1 126 60 0 0 30.1 \n",
"767 1 93 70 31 0 30.4 \n",
"\n",
" DiabetesPedigreeFunction Age Outcome \n",
"0 0.627 50 1 \n",
"1 0.351 31 0 \n",
"2 0.672 32 1 \n",
"3 0.167 21 0 \n",
"4 2.288 33 1 \n",
".. ... ... ... \n",
"763 0.171 63 0 \n",
"764 0.340 27 0 \n",
"765 0.245 30 0 \n",
"766 0.349 47 1 \n",
"767 0.315 23 0 \n",
"\n",
"[768 rows x 9 columns]> \n",
"\n"
]
}
],
"source": [
"import pandas as pd\n",
"df3 = pd.read_csv(\".//static//csv//diabetes.csv\")\n",
"print(df3.info, \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Объекты наблюдения**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объектами наблюдения являются пациенты"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Атрибуты объектов**"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pregnancies\n",
"Glucose\n",
"BloodPressure\n",
"SkinThickness\n",
"Insulin\n",
"BMI\n",
"DiabetesPedigreeFunction\n",
"Age\n",
"Outcome\n"
]
}
],
"source": [
"attributes = df3.columns\n",
"for attribute in attributes:\n",
" print(attribute)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Связи между объектами**"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnEAAAIQCAYAAADuJTjHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADD5ElEQVR4nOzdd1hT1xsH8G8SQsIMeyMgoqiIqDhw4UARt3VvcdattLVSqzhaqXvVuveoW6tiXbgr4kTciqKisvcMkNzfH/wIxgQIkQiB9/M8eTTnnnvvew8ZJ2ddFsMwDAghhBBCiFphV3QAhBBCCCGk7KgSRwghhBCihqgSRwghhBCihqgSRwghhBCihqgSRwghhBCihqgSRwghhBCihqgSRwghhBCihqgSRwghhBCihqgSRwhRWkpKCiIiIpCfn1/RoRCidtLT0/H27VtkZmZWdChETVEljhCisLy8PCxduhQNGzYEj8eDoaEhnJycEBwcXNGhEVLpMQyDzZs3o0WLFtDW1oa+vj4cHBywd+/eig6NqKkqU4nbuXMnWCwW7t69K7Nty5YtYLFY6N27N0QiUQVER4j6EwqF8PLywty5c9GuXTscPnwYFy5cwKVLl+Dh4VHR4RFS6Q0ZMgTff/896tatiz179uDChQu4ePEivvvuu4oOjagpjYoOQNWOHz+OiRMnok2bNjhw4AA4HE5Fh0SIWlqyZAlCQ0Nx7tw5tGvXrqLDIUSt7N69GwcPHsTevXsxZMiQig6HVBFVpiVOnitXrmDw4MGoV68eTp06BT6fX9EhEaKW8vPzsXr1avzwww9UgSNECcuWLcPgwYOpAkfKVZWtxIWFhaFXr16wtLTEuXPnIBAIZPIcPnwYTZo0gZaWFkxMTDBs2DB8/PhRKs+oUaOgq6uLN2/ewNvbGzo6OrCyssLChQvBMIwk39u3b8FisbB8+XKsWrUKdnZ20NLSgqenJx4/fixz7ufPn6Nfv34wMjICn8+Hu7s7Tp48Kfda2rVrBxaLJfPYuXOnVL4NGzbAxcUF2traUvmOHDkidSwXFxeZcyxfvhwsFgtv376VpBV2UX+eJhaL4erqKvf8R44cgbu7O/T09KTOv3z5crnX9eV5NDU1ER8fL7UtJCREcpwvu8pDQ0PRpUsXCAQCaGtrw9PTE//9959UHn9/f/D5fKn0K1eugMVi4cqVK5K0//77D3w+H/7+/jLxFVf+8+fPl+Tx9PREw4YN5V5fnTp14O3tXWIZ/PPPP+jWrRusrKzA4/Hg6OiIRYsWSXX/FxfH549CLBYLU6ZMwb59+1CnTh3w+Xw0adIE165dkzrv/PnzpfYDgIyMDFhYWEiV0YsXL5CcnAw9PT14enpCW1sbAoEA3bt3l3p9X758GSwWC8ePH5e5xv3794PFYiEkJASjRo0q9VoKX3eKlM3n5dO7d2+Zc0+YMAEsFkvqtV/4nv3ydTx58mSwWCyMGjVK5jifK6/95b0/XFxcpCrLubm5mDdvHpo0aQKBQAAdHR20adMGly9flnvswr/rl4/PY1L0bw9A5vUOyP98sLe3l7nuw4cPg8Viwd7eXipdLBZj9erVqF+/Pvh8PszNzTFhwgQkJydL5bO3tweLxcKMGTNkrtPb2xssFgvdu3eXWw5fKnzvl/TeAcpWNtevX0f//v1Ro0YN8Hg82NraYubMmcjOzpbkyczMxOPHj2Fra4tu3bpBX18fOjo6aNeuHa5fvy4TZ0pKCmbMmAFbW1vweDzUqlULS5YsgVgsBlD02inpUfh3SEpKwo8//ogGDRpAV1cX+vr68PHxwcOHDxUqM1K5Vcnu1NevX6NLly7g8Xg4d+4cLC0tZfLs3LkTvr6+aNq0KQIDAxEbG4s1a9bgv//+w4MHD2BgYCDJKxKJ0KVLF7Ro0QJLly7F2bNnERAQgPz8fCxcuFDquLt370Z6ejomT56MnJwcrFmzBh06dMCjR49gbm4OAHjy5AlatWoFa2trzJ49Gzo6Ojh06BB69+6No0ePok+fPjLxOjs7Y86cOQCAhIQEzJw5U2r7wYMHMWnSJLRr1w5Tp06Fjo4Onj17hsWLF39tcUrZs2cPHj16JJMeEhKCAQMGoGHDhvjjjz8gEAjkxlkSDoeDvXv3Su2zY8cO8Pl85OTkSOW9dOkSfHx80KRJEwQEBIDNZmPHjh3o0KEDrl+/jmbNmgEAFi9ejFevXqFPnz4IDQ2Fg4ODzHkjIyPRu3dvdO/evdjysrGxQWBgIICCD/OJEydKbR8+fDjGjRuHx48fS1UU7ty5g5cvX+LXX38t8dp37twJXV1d+Pn5QVdXF5cuXcK8efOQlpaGZcuWAQDmzJmDsWPHAih6DYwfPx5t2rSRe8yrV6/i4MGDmDZtGng8Hv766y906dIFt2/flluRL7RixQrExsZKpSUmJgIoqBQ7OTlhwYIFyMnJwfr169GqVSvcuXMHtWvXRrt27WBra4t9+/bJvI737dsHR0dHyfg5Ly8vqfLr06eP1NggU1NThcumEJ/PR1BQEOLi4mBmZgYAyM7OxsGDBxVqiY+IiMCWLVtKzaeq/YuTlpaGrVu3YvDgwRg3bhzS09Oxbds2eHt74/bt23Bzc5O73549eyT/V+S9KO9v/zXy8/Mln1tfmjBhguRzeNq0aYiMjMSff/6JBw8e4L///gOXy5Xk5fP52LdvH5YtWyZJ//DhA4KDg5XqYZk2bRqaNm0KoOAz+8KFC6XuU1zZHD58GFlZWZg4cSKMjY1x+/ZtrFu3Dh8+fMDhw4cBFL1/lixZAgsLC/z000/g8/nYsmULvLy8cOHCBbRt2xYAkJWVBU9PT3z8+BETJkxAjRo1cPPmTfj7+yM6OhqrV6+Gqamp1N/22LFjOH78uFSao6MjAODNmzc4ceIE+vfvDwcHB8TGxmLTpk3w9PTE06dPYWVlVebyI5UIU0Xs2LGDAcCcPn2acXR0ZAAwnTt3lps3NzeXMTMzY1xcXJjs7GxJ+unTpxkAzLx58yRpI0eOZAAwU6dOlaSJxWKmW7dujKamJhMfH88wDMNERkYyABgtLS3mw4cPkryhoaEMAGbmzJmStI4dOzINGjRgcnJypI7ZsmVLxsnJSSbeVq1aMe3bt5c8LzzXjh07JGmDBw9mDAwMpK7n8uXLDADm8OHDkjRPT0+mfv36MudYtmwZA4CJjIyUpBWWaWFaTk4OU6NGDcbHx0fm/P7+/gwAJjo6WibOZcuWyZzvc4XnGTx4MNOgQQNJemZmJqOvr88MGTKEAcDcuXNHUlZOTk6Mt7c3IxaLJfmzsrIYBwcHplOnTlLHz8zMZNzd3Zn69eszqampknK5fPkyk5KSwtSrV49p2rQpk5WVJTe+li1bMi4uLpLn8fHxDAAmICBAkpaSksLw+Xzm559/ltp32rRpjI6ODpORkVFiGcg794QJExhtbW2p10khea+BzwFgADB3796VpL17947h8/lMnz59JGkBAQHM5x8DcXFxjJ6enuRvfPnyZYZhil5LJiYmTEJCgiT/y5cvGS6Xy/Tt21eS5u/vz/B4PCYlJUXquBoaGlJl9mW8xW1TtGwKX9uurq7M8uXLJel79uxhbGxsmDZt2ki99uWV4YABAxgXFxfG1taWGTlypNx4ynt/ee+P+vXrM56enpLn+fn5jFAolMqTnJzMmJubM6NHj5bZf86cOQyLxZJKs7Ozk4pJ0b89wzAMi8WS+lxkGNnPB3nn+Ouvvxgej8e0b9+esbOzk6Rfv36dAcDs27dP6phnz56VSbezs2M6derEmJiYMEeOHJGkL1q0iGnZsiVjZ2fHdOvWTaYM5Dl//jwDQOo4kydPliqHspaNvNdnYGAgw2KxmHfv3jEMU/S31tTUZF6+fCnJFx8fzxgbGzNNmjSRui4dHR2pfAzDMLNnz2Y4HA7z/v17mfN9Ge/ncnJyGJFIJJUWGRnJ8Hg8ZuHChXL3IeqjynWnjho1ClFRURgyZAjOnz8v+SX0ubt37yIuLg6TJk2S+hXXrVs3ODs7IygoSGafKVOmSP5f2FWVm5uLixcvSuXr3bs3rK2tJc+bNWuG5s2
"text/plain": [
"<Figure size 800x600 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Создаем матрицу корреляций\n",
"correlation_matrix = df3[['Glucose', 'Insulin', 'BMI', 'SkinThickness','Age', 'Outcome']].corr()\n",
"\n",
"# Строим тепловую карту для визуализации корреляций\n",
"plt.figure(figsize=(8, 6))\n",
"sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=\".2f\")\n",
"plt.title('Корреляция между атрибутами и наличием диабета')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Тепловая карта показывает, что существует значительная положительная корреляция между уровнем глюкозы и наличием диабета, что свидетельствует о важности уровня глюкозы в определении риска заболевания. Также наблюдается положительная связь между индексом массы тела (BMI) и наличием диабета, что подтверждает влияние лишнего веса на вероятность развития диабета. Корреляция между инсулином и наличием диабета также присутствует, но менее выражена, что указывает на необходимость более глубокого анализа этих данных для определения их значимости."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### **Бизнес-цели:**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Улучшение ранней диагностики диабета:\n",
"\n",
"Эффект для бизнеса: Помогает медицинским учреждениям выявлять пациентов с риском диабета, что позволяет быстрее начинать лечение и снижать затраты на позднюю диагностику.\n",
"Разработка персонализированных планов лечения:\n",
"\n",
"Эффект для бизнеса: Медицинские компании могут предлагать персонализированные планы лечения и диеты для пациентов на основе их предрасположенности к диабету."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Цели технического проекта**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ранняя диагностика диабета\n",
"\n",
"Входные данные: Параметры здоровья пациента, такие как уровень глюкозы, артериальное давление, BMI и другие.\n",
"Целевой признак: Outcome (наличие диабета).\n",
"Персонализированные рекомендации\n",
"\n",
"Входные данные: Исторические данные пациента, его возраст, индекс массы тела, уровень инсулина и т.д.\n",
"Целевой признак: Разработанные рекомендации или предсказание необходимости изменений в лечении."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Проблемы данных**\n"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Пропущенные значения в данных:\n",
" Pregnancies 0\n",
"Glucose 0\n",
"BloodPressure 0\n",
"SkinThickness 0\n",
"Insulin 0\n",
"BMI 0\n",
"DiabetesPedigreeFunction 0\n",
"Age 0\n",
"Outcome 0\n",
"dtype: int64\n"
]
}
],
"source": [
"# Проверяем пропущенные значения снова\n",
"missing_values = df3.isnull().sum()\n",
"print(\"Пропущенные значения в данных:\\n\", missing_values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"пропущенные данные отсутсвуют"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов в столбце 'Pregnancies': 4\n",
"Количество выбросов в столбце 'Glucose': 5\n",
"Количество выбросов в столбце 'BloodPressure': 45\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdIAAAISCAYAAADIuT2dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAD2s0lEQVR4nOzdeXxU9fX/8ffMJJnsCdkTCBC2sC8CIqKIiiKIlYq1tC5oVdqKti61La24oBV3Ub9Ua9uf0lZqa6tUUUFAcUUQFGTfZM2+kH2dmfv7YzIjEQghZHJnMq/n4zEPzb137pyZSfjce+6552MxDMMQAAAAAAAAAAA4LqvZAQAAAAAAAAAA4M9IpAMAAAAAAAAA0AIS6QAAAAAAAAAAtIBEOgAAAAAAAAAALSCRDgAAAAAAAABAC0ikAwAAAAAAAADQAhLpAAAAAAAAAAC0gEQ6AAAAAAAAAAAtIJEOAAAAAAAAAEALSKSjU+rZs6euv/56s8Po9B5//HH16tVLNptNw4cPNzucTmX16tWyWCxavXq12aEAQEDiWKBj+PJYYMKECZowYUK77hMAcHoYXzuGL8ZXM88x77//flkslg5/XaC9kUiH33v55ZdlsVi0fv36466fMGGCBg8efNqv88477+j+++8/7f0Ei/fee0+//vWvNW7cOL300kt6+OGHT7jt9ddfL4vF4n3ExsZq2LBhevLJJ1VfX9+BUQMAAhHHAv7pVI4FPD7++GNdddVV6tq1q8LCwhQXF6cxY8Zo3rx5Kigo6ICoAQAejK/+6XTOtUNCQpSZmakZM2Zo27ZtHRj1qevZs2ez2FNSUnTuuefqjTfeMDs04IRCzA4A8IWdO3fKaj2160TvvPOOFi5cyADfSu+//76sVqv++te/Kiws7KTb2+12/eUvf5EklZWV6b///a9+9atf6YsvvtCrr77q63ADzvjx41VbW9uqzxYAcCyOBXzvVI8F7r33Xj344IPq1auXrr/+evXq1Ut1dXXasGGDnnzySS1atEh79+7tgMgBAG3F+Op7p3Ou7XA4tHfvXr3wwgtatmyZtm3bpoyMDF+H3GbDhw/XXXfdJUnKzc3Vn/70J11xxRV6/vnn9bOf/czk6IBjkUhHp2S3280O4ZRVV1crKirK7DBarbCwUBEREa1O9IaEhOiaa67x/nzLLbdozJgx+te//qWnnnrquIO7YRiqq6tTREREu8UdKKxWq8LDw80OAwACFscCvncqxwL/+te/9OCDD+qqq67S3//+92Oe8/TTT+vpp5/2VagAgHbC+Op7p3uuLUlnnXWWpk6dqrfffls333yzL8JsF127dm0W+3XXXac+ffro6aefPmEi3eFwyOVyBVTRWaD9DuLEaO2CTum7fdsaGxv1wAMPqG/fvgoPD1diYqLOOeccrVixQpL7dqiFCxdKUrNbizyqq6t11113KTMzU3a7XdnZ2XriiSdkGEaz162trdUvfvELJSUlKSYmRt/73veUk5Mji8XS7Oq7pz/Ytm3b9OMf/1hdunTROeecI0n6+uuvvVVa4eHhSktL009+8hOVlJQ0ey3PPnbt2qVrrrlGcXFxSk5O1ty5c2UYhg4dOqTLL79csbGxSktL05NPPtmqz87hcOjBBx9U7969Zbfb1bNnT/3ud79r1oLFYrHopZdeUnV1tfezevnll1u1fw+r1erte7p//35J7u9t6tSpWr58uUaNGqWIiAj96U9/kuSuYr/99tu930GfPn306KOPyuVyNdtvSUmJrr32WsXGxio+Pl4zZ87Upk2bjonx+uuvV3R0tHJycjRt2jRFR0crOTlZv/rVr+R0Opvt84knntDZZ5+txMRERUREaOTIkfrPf/5zzHuyWCy69dZbtWTJEg0ePFh2u12DBg3SsmXLjtk2JydHN954ozIyMmS325WVlaWf//znamhokHTi/nVr167VJZdcori4OEVGRuq8887Tp59+2mybyspK3X777erZs6fsdrtSUlJ00UUX6csvvzzp9wIAnQXHAv51LHDvvfcqKSnphNV1cXFxJ61U9LQg8Bw3eLQ0Zk6ZMkVdunRRVFSUhg4dqmeeeabZNu+//77OPfdcRUVFKT4+Xpdffrm2b9/ebJvWjqutGaMBINAxvvrX+HoiaWlpktxJ9pN57bXXNHLkSEVERCgpKUnXXHONcnJyjtmuNWOmJH3yyScaPXq0wsPD1bt3b+85fWvjHjBggPbt2yfJnSuwWCx64okntGDBAu9n52lbs2PHDl155ZVKSEhQeHi4Ro0apTfffLPZPk/2OypJ+fn5uuGGG9StWzfZ7Xalp6fr8ssvb3bM8d3fNY/v/k14jlc+/PBD3XLLLUpJSVG3bt286999913v5xgTE6NLL71UW7dubfVnBHNRkY6AUV5eruLi4mOWNzY2nvS5999/v+bPn6+bbrpJZ555pioqKrR+/Xp9+eWXuuiii/TTn/5Uubm5WrFihf7+9783e65hGPre976nDz74QDfeeKOGDx+u5cuX6+6771ZOTk6z6qnrr79e//73v3XttdfqrLPO0ocffqhLL730hHH94Ac/UN++ffXwww97DxRWrFihb775RjfccIPS0tK0detWvfjii9q6das+//zzYybo+OEPf6gBAwbokUce0dtvv62HHnpICQkJ+tOf/qQLLrhAjz76qF555RX96le/0ujRozV+/PgWP6ubbrpJixYt0pVXXqm77rpLa9eu1fz587V9+3Zvr7K///3vevHFF7Vu3TrvLWRnn332Sb+H7/Lcvp2YmOhdtnPnTv3oRz/ST3/6U918883Kzs5WTU2NzjvvPOXk5OinP/2punfvrs8++0xz5sxRXl6eFixYIElyuVy67LLLtG7dOv385z9X//799b///U8zZ8487us7nU5NmjRJY8aM0RNPPKGVK1fqySefVO/evfXzn//cu90zzzyj733ve7r66qvV0NCgV199VT/4wQ+0dOnSY77fTz75RK+//rpuueUWxcTE6Nlnn9X06dN18OBB7/vMzc3VmWeeqbKyMs2aNUv9+/dXTk6O/vOf/6impuaEV9bff/99TZ48WSNHjtR9990nq9Wql156SRdccIE+/vhjnXnmmZKkn/3sZ/rPf/6jW2+9VQMHDlRJSYk++eQTbd++XWecccYpf08A4C84FgjMY4Fdu3Zp165duummmxQdHd3ia7eXFStWaOrUqUpPT9cvf/lLpaWlafv27Vq6dKl++ctfSpJWrlypyZMnq1evXrr//vtVW1ur5557TuPGjdOXX36pnj17SmrduNraMRoA/BHja2COr0fzfH9Op1PffPONfvOb3ygxMVFTp05t8Xkvv/yybrjhBo0ePVrz589XQUGBnnnmGX366af66quvFB8fL6n1Y+bmzZt18cUXKzk5Wffff78cDofuu+8+paamnvQ9SO7fuUOHDjXLEUjSSy+9pLq6Os2aNUt2u10JCQnaunWrxo0bp65du+q3v/2toqKi9O9//1vTpk3Tf//7X33/+9+XdPLfUUmaPn26tm7dqttuu009e/ZUYWGhVqxYoYMHD3rf26m65ZZblJycrHvvvVfV1dWS3N/vzJkzNWnSJD366KOqqanR888/r3POOUdfffVVm18LHcgA/NxLL71kSGrxMWjQoGbP6dGjhzFz5kzvz8OGDTMuvfTSFl9n9uzZxvH+JJYsWWJIMh566KFmy6+88krDYrEYe/bsMQzDMDZs2GBIMm6//fZm211//fWGJOO+++7zLrvvvvsMScaPfvSjY16vpqbmmGX//Oc/DUnGRx99dMw+Zs2a5V3mcDiMbt26GRaLxXjkkUe8y48cOWJEREQ0+0yOZ+PGjYYk46abbmq2/Fe/+pUhyXj//fe9y2bOnGlERUW1uL/vbltUVGQUFRUZe/bsMR5++GHDYrEYQ4cO9W7Xo0cPQ5KxbNmyZs9/8MEHjaioKGPXrl3Nlv/2t781bDabcfDgQcMwDOO///2vIclYsGCBdxun02lccMEFhiTjpZdeahaTJGPevHnN9jlixAhj5MiRzZZ99ztpaGgwBg8ebFxwwQX
"text/plain": [
"<Figure size 1500x1000 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"# Выбираем числовые столбцы\n",
"numeric_columns = ['Pregnancies', 'Glucose', 'BloodPressure']\n",
"\n",
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['Pregnancies', 'Glucose', 'BloodPressure']\n",
"\n",
"# Функция для подсчета выбросов\n",
"def count_outliers(df3, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df3[col].quantile(0.25)\n",
" Q3 = df3[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df3[(df3[col] < lower_bound) | (df3[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df3, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(numeric_columns, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df3[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Устраним выбрасы в столбце BloodPressure"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество удаленных строк: 45\n",
"Количество выбросов в столбце 'BloodPressure': 4\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAJOCAYAAABYwk4SAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACuhUlEQVR4nOzdd3hUdd7+8XtaJr33Sgu9hiaCoqIiTRAsKKyI9Xl0167PuuraC7bFXnZd7BURdRUsgB3pndBLeu89mTm/P7LkZ4QohJCT8n5dVy6dM2fO3JNMSObOdz7HYhiGIQAAAAAAAAAAcBir2QEAAAAAAAAAAGirKNEBAAAAAAAAAGgCJToAAAAAAAAAAE2gRAcAAAAAAAAAoAmU6AAAAAAAAAAANIESHQAAAAAAAACAJlCiAwAAAAAAAADQBEp0AAAAAAAAAACaQIkOAAAAAAAAAEATKNEBAADaqS5duuiyyy4zO0aH9/jjj6tbt26y2WwaPHhwixzz22+/lcVi0bffftsixzsW9957rywWS6vfb2eRmpoqT09P/fTTT826/YEDB2SxWPTaa6+1bDCT/Pb5Vltbq7i4OL3wwgsmpgIAADg2lOgAAABtwGuvvSaLxaK1a9ce8frTTjtN/fv3P+77+eKLL3Tvvfce93E6i6+++kq33367Ro8erQULFujhhx9uct/LLrtMFoul4cNutysuLk4zZ87U9u3bWzF1y/jt4/H399egQYP05JNPqrq62ux4bdb999+vkSNHavTo0Ydd9+2332r69OmKjIyUh4eHwsPDNWXKFC1atMiEpOZwOBy6+eab9dBDD6mqqsrsOAAAAEfFbnYAAAAANM/OnTtltR7bmogvvvhCzz//PEX6UVq+fLmsVqteffVVeXh4/OH+TqdT//rXvyRJdXV12rt3r1566SUtXbpU27dvV3R09ImO3KJ+/XiKior00Ucf6dZbb9WaNWv03nvvmZyu7cnNzdXrr7+u119//bDr7rnnHt1///1KTEzUNddco4SEBOXn5+uLL77QjBkz9Pbbb+uSSy4xIXXrmzt3rv7617/qnXfe0eWXX252HAAAgD9EiQ4AANBOOZ1OsyMcs/Lycvn4+Jgd46jl5OTIy8vrqAp0SbLb7Zo9e3ajbSeddJImT56szz//XFddddWJiHnC/PbxXHvttRo5cqTef/99PfXUU0f8o4BhGKqqqpKXl1drRj0udXV1crvdR/11bspbb70lu92uKVOmNNq+cOFC3X///Tr//PP1zjvvyOFwNFx322236csvv1Rtbe1x3Xd7EhgYqLPPPluvvfYaJToAAGgXGOcCAADQTv12Jnptba3uu+8+JSYmytPTUyEhIRozZoy+/vprSfXjOZ5//nlJajSm45Dy8nLdcsstiouLk9PpVK9evfTEE0/IMIxG91tZWanrr79eoaGh8vPz07nnnqv09HRZLJZGK9wPzULevn27LrnkEgUFBWnMmDGSpM2bN+uyyy5Tt27d5OnpqcjISF1++eXKz89vdF+HjrFr1y7Nnj1bAQEBCgsL09133y3DMJSamqqpU6fK399fkZGRevLJJ4/qc1dXV6cHHnhA3bt3l9PpVJcuXfS3v/2t0ZgSi8WiBQsWqLy8vOFz1Zw51ZGRkZLqC+k/8uGHH2ro0KHy8vJSaGioZs+erfT09MP2W758uU455RT5+PgoMDBQU6dOVXJy8mH7/fjjjxo+fLg8PT3VvXt3vfzyy8ec/9esVqtOO+00SfWzu6X65+HkyZP15ZdfatiwYfLy8mq4n6KiIt14440Nz6kePXpo3rx5crvdjY773nvvaejQofLz85O/v78GDBigp59+uuH6P3puS/Ujjw5l+7XLLrtMXbp0abh8aOb4E088ofnz5zc8Bw6N3NmxY4fOP/98BQcHy9PTU8OGDdOnn356VJ+fxYsXa+TIkfL19W20/e6771ZwcLD+/e9/NyrQDxk/frwmT578u8c+mlwFBQW69dZbNWDAAPn6+srf318TJkzQpk2bGu13aCb/Bx98oIceekixsbHy9PTUuHHjtGfPnsPue9WqVTrnnHMUEBAgb29vjR079ogz34/l+XbWWWfpxx9/VEFBwe8+bgAAgLaAlegAAABtSHFxsfLy8g7bfjSrVO+991498sgjuvLKKzVixAiVlJRo7dq1Wr9+vc466yxdc801ysjI0Ndff60333yz0W0Nw9C5556rFStW6IorrtDgwYP15Zdf6rbbblN6err+8Y9/NOx72WWX6YMPPtCf/vQnnXTSSfruu+80adKkJnNdcMEFSkxM1MMPP9xQyH/99dfat2+f5s6dq8jISG3btk2vvPKKtm3bpl9++eWwE19edNFF6tOnjx599FF9/vnnevDBBxUcHKyXX35ZZ5xxhubNm6e3335bt956q4YPH65TTz31dz9XV155pV5//XWdf/75uuWWW7Rq1So98sgjSk5O1scffyxJevPNN/XKK69o9erVDSNNTj755D/8Ohz6+rlcLu3bt0//93//p5CQkD8sSV977TXNnTtXw4cP1yOPPKLs7Gw9/fTT+umnn7RhwwYFBgZKkr755htNmDBB3bp107333qvKyko9++yzGj16tNavX99QGG/ZskVnn322wsLCdO+996qurk733HOPIiIi/vAx/J69e/dKkkJCQhq27dy5UxdffLGuueYaXXXVVerVq5cqKio0duxYpaen65prrlF8fLx+/vln3XHHHcrMzNT8+fMl1T8XLr74Yo0bN07z5s2TJCUnJ+unn37SDTfcIOmPn9vNsWDBAlVVVenqq6+W0+lUcHCwtm3bptGjRysmJkZ//etf5ePjow8++EDTpk3TRx99pPPOO6/J49XW1mrNmjX63//930bbd+/erR07dujyyy+Xn59fs7Ieba59+/Zp8eLFuuCCC9S1a1dlZ2fr5Zdf1tixY484TujRRx+V1WrVrbfequLiYj322GOaNWuWVq1a1bDP8uXLNWHCBA0dOlT33HOPrFarFixYoDPOOEM//PCDRowYIenYn29Dhw6VYRj6+eef//B7AwAAwHQGAAAATLdgwQJD0u9+9OvXr9FtEhISjDlz5jRcHjRokDFp0qTfvZ/rrrvOONKvgIsXLzYkGQ8++GCj7eeff75hsViMPXv2GIZhGOvWrTMkGTfeeGOj/S677DJDknHPPfc0bLvnnnsMScbFF1982P1VVFQctu3dd981JBnff//9Yce4+uqrG7bV1dUZsbGxhsViMR599NGG7YWFhYaXl1ejz8mRbNy40ZBkXHnllY2233rrrYYkY/ny5Q3b5syZY/j4+Pzu8X6975G+bjExMca6desa7btixQpDkrFixQrDMAyjpqbGCA8PN/r3729UVlY27Pef//zHkGT8/e9/b9g2ePBgIzw83MjPz2/YtmnTJsNqtRqXXnppw7Zp06YZnp6exsGDBxu2bd++3bDZbEd8Dhzp8fj4+Bi5ublGbm6usWfPHuPhhx82LBaLMXDgwIb9EhISDEnG0qVLG93+gQceMHx8fIxdu3Y12v7Xv/7VsNlsRkpKimEYhnHDDTcY/v7+Rl1dXZNZjua5PXbsWGPs2LFHfBwJCQkNl/fv329IMvz9/Y2cnJxG+44bN84YMGCAUVVV1bDN7XYbJ598spGYmPi7979nzx5DkvHss8822v7JJ58Ykox//OMfv3v73+ZbsGDBMeeqqqoyXC7XYcdzOp3G/fff37Dt0POvT58+RnV1dcP2p59+2pBkbNmypeE+EhMTjfHjxxtut7thv4qKCqNr167GWWed1bDtWJ9vGRkZhiRj3rx5R/V5AQAAMBPjXAAAANqQ559/Xl9//fVhHwMHDvzD2wYGBmrbtm3avXv3Md/vF198IZvNpuuvv77R9ltuuUWGYWjJkiWSpKVLl0qqn439a3/5y1+aPPb//M//HLbt1/Oyq6qqlJeXp5NOOkmStH79+sP2v/LKKxv+32azadiwYTIMQ1dccUXD9sDAQPXq1Uv79u1rMotU/1gl6eabb260/ZZbbpEkff755797+9/j6enZ8DX78ssv9fL
"text/plain": [
"<Figure size 1500x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Выбираем столбцы для очистки\n",
"columns_to_clean = ['BloodPressure']\n",
"\n",
"# Функция для удаления выбросов\n",
"def remove_outliers(df3, columns):\n",
" for col in columns:\n",
" Q1 = df3[col].quantile(0.25)\n",
" Q3 = df3[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Удаляем строки, содержащие выбросы\n",
" df3 = df3[(df3[col] >= lower_bound) & (df3[col] <= upper_bound)]\n",
" \n",
" return df3\n",
"\n",
"# Удаляем выбросы\n",
"df3_cleaned = remove_outliers(df3, columns_to_clean)\n",
"\n",
"# Выводим количество удаленных строк\n",
"print(f\"Количество удаленных строк: {len(df3) - len(df3_cleaned)}\")\n",
"\n",
"df3 = df3_cleaned\n",
"\n",
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['BloodPressure']\n",
"\n",
"# Функция для подсчета выбросов\n",
"def count_outliers(df3, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df3[col].quantile(0.25)\n",
" Q3 = df3[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df3[(df3[col] < lower_bound) | (df3[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df3, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
"\n",
"# Создаем гистограммы для очищенных данных\n",
"plt.figure(figsize=(15, 6))\n",
"\n",
"# Гистограмма для BloodPressure\n",
"sns.histplot(df3_cleaned['BloodPressure'], kde=True)\n",
"plt.title('Histogram of Blood Pressure (Cleaned)')\n",
"plt.xlabel('Blood Pressure')\n",
"plt.ylabel('Frequency')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"видно, что количество выбросов сократилось."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Оценка качества данных**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Все атрибуты (например, уровень глюкозы, индекс массы тела, уровень инсулина) являются важными показателями, которые могут существенно влиять на наличие диабета. Набор данных содержит достаточное количество релевантных признаков для анализа. Чтобы определить соответствие реальным данным, необходимо провести анализ статистики для проверки, соответствуют ли значения разумным медицинским стандартам."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Разбиение данных на выборки**"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 433\n",
"Размер контрольной выборки: 145\n",
"Размер тестовой выборки: 145\n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение на признаки (X) и целевую переменную (y)\n",
"X = df3.drop('Outcome', axis=1) # Признаки\n",
"y = df3['Outcome'] # Целевая переменная\n",
"\n",
"# Разбиение на обучающую и оставшуюся часть (контрольная + тестовая)\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"\n",
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
"\n",
"# Вывод размеров выборок\n",
"print(\"Размер обучающей выборки:\", X_train.shape[0])\n",
"print(\"Размер контрольной выборки:\", X_val.shape[0])\n",
"print(\"Размер тестовой выборки:\", X_test.shape[0])"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Сбалансированность обучающей выборки:\n",
"Outcome\n",
"0 0.658199\n",
"1 0.341801\n",
"Name: proportion, dtype: float64\n",
"\n",
"Сбалансированность контрольной выборки:\n",
"Outcome\n",
"0 0.655172\n",
"1 0.344828\n",
"Name: proportion, dtype: float64\n",
"\n",
"Сбалансированность тестовой выборки:\n",
"Outcome\n",
"0 0.662069\n",
"1 0.337931\n",
"Name: proportion, dtype: float64\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение на признаки (X) и целевую переменную (y)\n",
"X = df3.drop('Outcome', axis=1) # Признаки\n",
"y = df3['Outcome'] # Целевая переменная\n",
"\n",
"# Разбиение на обучающую и оставшуюся часть (контрольная + тестовая)\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)\n",
"\n",
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)\n",
"\n",
"# Функция для проверки сбалансированности выборок\n",
"def check_balance(y_train, y_val, y_test):\n",
" print(\"Сбалансированность обучающей выборки:\")\n",
" print(y_train.value_counts(normalize=True))\n",
" \n",
" print(\"\\nСбалансированность контрольной выборки:\")\n",
" print(y_val.value_counts(normalize=True))\n",
" \n",
" print(\"\\nСбалансированность тестовой выборки:\")\n",
" print(y_test.value_counts(normalize=True))\n",
"\n",
"# Проверка сбалансированности\n",
"check_balance(y_train, y_val, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В обучающей выборке наблюдается незначительный дисбаланс между классами, где количество случаев отсутствия диабета превышает количество случаев наличия диабета. Это может привести к смещению модели в сторону предсказания класса 0."
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Сбалансированность обучающей выборки после SMOTE:\n",
"Outcome\n",
"0 0.5\n",
"1 0.5\n",
"Name: proportion, dtype: float64\n",
"Сбалансированность обучающей выборки:\n",
"Outcome\n",
"0 0.5\n",
"1 0.5\n",
"Name: proportion, dtype: float64\n",
"\n",
"Сбалансированность контрольной выборки:\n",
"Outcome\n",
"0 0.655172\n",
"1 0.344828\n",
"Name: proportion, dtype: float64\n",
"\n",
"Сбалансированность тестовой выборки:\n",
"Outcome\n",
"0 0.662069\n",
"1 0.337931\n",
"Name: proportion, dtype: float64\n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from imblearn.over_sampling import SMOTE\n",
"\n",
"# Разделение на признаки (X) и целевую переменную (y)\n",
"X = df3.drop('Outcome', axis=1) # Признаки\n",
"y = df3['Outcome'] # Целевая переменная\n",
"\n",
"# Разбиение на обучающую и оставшуюся часть (контрольная + тестовая)\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)\n",
"\n",
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)\n",
"\n",
"# Применение SMOTE для балансировки обучающей выборки\n",
"smote = SMOTE(random_state=42)\n",
"X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)\n",
"\n",
"# Функция для проверки сбалансированности выборок\n",
"def check_balance(y_train, y_val, y_test):\n",
" print(\"Сбалансированность обучающей выборки:\")\n",
" print(y_train.value_counts(normalize=True))\n",
" \n",
" print(\"\\nСбалансированность контрольной выборки:\")\n",
" print(y_val.value_counts(normalize=True))\n",
" \n",
" print(\"\\nСбалансированность тестовой выборки:\")\n",
" print(y_test.value_counts(normalize=True))\n",
"\n",
"# Проверка сбалансированности после SMOTE\n",
"print(\"Сбалансированность обучающей выборки после SMOTE:\")\n",
"print(y_train_resampled.value_counts(normalize=True))\n",
"\n",
"# Проверка сбалансированности контрольной и тестовой выборок\n",
"check_balance(y_train_resampled, y_val, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборка сбалансирована"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}