2024-10-19 00:25:57 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Лабораторная 2\n",
"\n",
2024-10-20 00:37:12 +04:00
"Первый датасет: Н а б о р данных для анализа и прогнозирования сердечного приступа (https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)"
2024-10-19 00:25:57 +04:00
]
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 104,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',\n",
" 'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',\n",
" 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime',\n",
" 'Asthma', 'KidneyDisease', 'SkinCancer'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
2024-10-20 00:37:12 +04:00
"df_heart = pd.read_csv(\"..\\\\static\\\\csv\\\\heart_2020_cleaned.csv\")\n",
"print(df_heart.columns)"
2024-10-19 00:25:57 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Столбцы на русском:\n",
"\n",
2024-10-20 00:37:12 +04:00
"HeartDisease - сердечный приступ \\\n",
2024-10-19 00:25:57 +04:00
"BMI - ИМТ \\\n",
"Smoking - курящий ли человек \\\n",
"AlcoholDrinking - выпивающий ли человек\\\n",
"Stroke - был ли инсульт\\\n",
"PhysicalHealth - физическое здоровье\\\n",
"MentalHealth - ментальное здоровье\\\n",
"DiffWalking - проблемы с ходьбой\\\n",
"Sex - пол\\\n",
"AgeCategory - возрастная категория\\\n",
"Race - р а с а \\\n",
"Diabetic - диабетик ли человек\\\n",
"PhysicalActivity - физическая активность\\\n",
"GenHealth - общее здоровье\\\n",
"SleepTime - время сна\\\n",
"Asthma - астматик ли человек\\\n",
"KidneyDisease - нефропатия\\\n",
"SkinCancer - рак кожи"
]
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 105,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 319795 entries, 0 to 319794\n",
"Data columns (total 18 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 HeartDisease 319795 non-null object \n",
" 1 BMI 319795 non-null float64\n",
" 2 Smoking 319795 non-null object \n",
" 3 AlcoholDrinking 319795 non-null object \n",
" 4 Stroke 319795 non-null object \n",
" 5 PhysicalHealth 319795 non-null float64\n",
" 6 MentalHealth 319795 non-null float64\n",
" 7 DiffWalking 319795 non-null object \n",
" 8 Sex 319795 non-null object \n",
" 9 AgeCategory 319795 non-null object \n",
" 10 Race 319795 non-null object \n",
" 11 Diabetic 319795 non-null object \n",
" 12 PhysicalActivity 319795 non-null object \n",
" 13 GenHealth 319795 non-null object \n",
" 14 SleepTime 319795 non-null float64\n",
" 15 Asthma 319795 non-null object \n",
" 16 KidneyDisease 319795 non-null object \n",
" 17 SkinCancer 319795 non-null object \n",
"dtypes: float64(4), object(14)\n",
"memory usage: 43.9+ MB\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>HeartDisease</th>\n",
" <th>BMI</th>\n",
" <th>Smoking</th>\n",
" <th>AlcoholDrinking</th>\n",
" <th>Stroke</th>\n",
" <th>PhysicalHealth</th>\n",
" <th>MentalHealth</th>\n",
" <th>DiffWalking</th>\n",
" <th>Sex</th>\n",
" <th>AgeCategory</th>\n",
" <th>Race</th>\n",
" <th>Diabetic</th>\n",
" <th>PhysicalActivity</th>\n",
" <th>GenHealth</th>\n",
" <th>SleepTime</th>\n",
" <th>Asthma</th>\n",
" <th>KidneyDisease</th>\n",
" <th>SkinCancer</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>No</td>\n",
" <td>16.60</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>3.0</td>\n",
" <td>30.0</td>\n",
" <td>No</td>\n",
" <td>Female</td>\n",
" <td>55-59</td>\n",
" <td>White</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Very good</td>\n",
" <td>5.0</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>No</td>\n",
" <td>20.34</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" <td>Female</td>\n",
" <td>80 or older</td>\n",
" <td>White</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Very good</td>\n",
" <td>7.0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>No</td>\n",
" <td>26.58</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>20.0</td>\n",
" <td>30.0</td>\n",
" <td>No</td>\n",
" <td>Male</td>\n",
" <td>65-69</td>\n",
" <td>White</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>Fair</td>\n",
" <td>8.0</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>No</td>\n",
" <td>24.21</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" <td>Female</td>\n",
" <td>75-79</td>\n",
" <td>White</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Good</td>\n",
" <td>6.0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>No</td>\n",
" <td>23.71</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>28.0</td>\n",
" <td>0.0</td>\n",
" <td>Yes</td>\n",
" <td>Female</td>\n",
" <td>40-44</td>\n",
" <td>White</td>\n",
" <td>No</td>\n",
" <td>Yes</td>\n",
" <td>Very good</td>\n",
" <td>8.0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth \\\n",
"0 No 16.60 Yes No No 3.0 \n",
"1 No 20.34 No No Yes 0.0 \n",
"2 No 26.58 Yes No No 20.0 \n",
"3 No 24.21 No No No 0.0 \n",
"4 No 23.71 No No No 28.0 \n",
"\n",
" MentalHealth DiffWalking Sex AgeCategory Race Diabetic \\\n",
"0 30.0 No Female 55-59 White Yes \n",
"1 0.0 No Female 80 or older White No \n",
"2 30.0 No Male 65-69 White Yes \n",
"3 0.0 No Female 75-79 White No \n",
"4 0.0 Yes Female 40-44 White No \n",
"\n",
" PhysicalActivity GenHealth SleepTime Asthma KidneyDisease SkinCancer \n",
"0 Yes Very good 5.0 Yes No Yes \n",
"1 Yes Very good 7.0 No No No \n",
"2 Yes Fair 8.0 Yes No No \n",
"3 No Good 6.0 No No Yes \n",
"4 Yes Very good 8.0 No No No "
]
},
2024-10-20 00:37:12 +04:00
"execution_count": 105,
2024-10-19 00:25:57 +04:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
2024-10-20 00:37:12 +04:00
"df_heart.info()\n",
"df_heart.head()"
2024-10-19 00:25:57 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объект наблюдения: состояние человека\\\n",
"Атрибуты объектов: сердечная недостаточность, ИМТ, курящий человек или нет, выпивающий человек или нет, был ли инсульт у человека и т.д."
]
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 106,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABIQAAAIjCAYAAAByG8BaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACSK0lEQVR4nOzdd3xT9eLG8SfpnqHQDYWWJXsjsgQHLpwoLjbITxS314F6HThA1HvdoLIURBStqDhBUVSQvfcqhVJGaZvulZzfH0AvpS3Q0va0zef9euUlSc5Jnx5DaZ9+h8UwDEMAAAAAAABwGVazAwAAAAAAAKBqUQgBAAAAAAC4GAohAAAAAAAAF0MhBAAAAAAA4GIohAAAAAAAAFwMhRAAAAAAAICLoRACAAAAAABwMRRCAAAAAAAALoZCCAAAAAAAwMVQCAEAAAAAALgYCiEAAFBtfPnll7JYLCXe2rRpY3Y8l5WRkaHnnntOV111lerWrSuLxaKZM2eaHQsAAJwHd7MDAAAAnO6pp55Sy5YtC++//PLLJqZBUlKSxo8fr4YNG6p9+/b6/fffzY4EAADOE4UQAACodvr166e+ffsW3p86daqSkpLMC+TiIiIilJiYqPDwcK1atUpdu3Y1OxIAADhPTBkDAADVRl5eniTJaj37tygzZ86UxWJRXFxc4WNOp1Pt2rUrNqVpw4YNGj58uBo3bixvb2+Fh4dr5MiROnbsWJHXfP7550ucrubu/r/fofXt21dt2rTR6tWr1aNHD/n4+CgmJkZTpkwp9rk8++yz6ty5s2w2m/z8/NS7d28tXry4yHFxcXGFH2f+/PlFnsvJyVFQUJAsFotef/31YjlDQ0OVn59f5JzPPvus8PVOLdG++eYb9e/fX5GRkfLy8lKTJk304osvyuFwnPVae3l5KTw8/KzHAQCAmoMRQgAAoNo4WQh5eXmV6/xZs2Zp48aNxR5fuHCh9uzZoxEjRig8PFybN2/Whx9+qM2bN+uff/6RxWIpcvzkyZPl7+9feP/0giolJUXXXHONbr31Vt1xxx364osvdM8998jT01MjR46UJKWlpWnq1Km64447NHr0aKWnp2vatGm68sortWLFCnXo0KHIa3p7e2vGjBm68cYbCx+LjY1VTk5OqZ9venq6FixYoJtuuqnwsRkzZsjb27vYeTNnzpS/v78eeeQR+fv767ffftOzzz6rtLQ0vfbaa6V+DAAAUDtRCAEAgGrDbrdLknx8fMp8bm5urp599lldffXV+vHHH4s8d++99+rRRx8t8thFF12kO+64Q3/99Zd69+5d5LlbbrlFwcHBpX6sgwcP6o033tAjjzwiSbr77rvVrVs3jRs3TkOGDJGHh4eCgoIUFxcnT0/PwvNGjx6tFi1a6J133tG0adOKvOZNN92kefPm6fDhwwoLC5MkTZ8+XQMGDNCcOXNKzHHTTTdp+vTphYVQfHy8fv31V91222367LPPihw7Z86cItd1zJgxGjNmjN5//3299NJL5S7hAABAzcSUMQAAUG2cnMIVEhJS5nPfe+89HTt2TM8991yx504tQnJycpSUlKSLLrpIkrRmzZoyfyx3d3fdfffdhfc9PT11991368iRI1q9erUkyc3NrbAMcjqdSk5OVkFBgbp06VLix+zUqZNat26tWbNmSZL27dunxYsXa/jw4aXmGDlypH766ScdOnRIkvTxxx+re/fuat68ebFjT70G6enpSkpKUu/evZWVlaVt27aV+RoAAICajUIIAABUG/v27ZO7u3uZCyG73a5XXnlFjzzySOHomlMlJyfrwQcfVFhYmHx8fBQSEqKYmJjCc8sqMjJSfn5+RR47WcKcuqbRxx9/rHbt2snb21v16tVTSEiIvv/++1I/5ogRIzRjxgxJx6d49ejRQ82aNSs1R4cOHdSmTRt98sknMgxDM2fO1IgRI0o8dvPmzbrppptks9kUGBiokJAQDR48WFL5rgEAAKjZKIQAAEC1sX37djVu3LjIIs7n4tVXX5XVatVjjz1W4vO33nqrPvroI40ZM0axsbH65Zdf9NNPP0k6PnqnMsyePVvDhw9XkyZNNG3aNP30009auHChLr300lI/5uDBg7Vr1y79888/+vjjj0std041cuRIzZgxQ3/88YcOHTqkW2+9tdgxqamp6tOnj9avX6/x48fru+++08KFC/Xqq69KqrxrAAAAqi/WEAIAANVCbm6u1q1bV2RR5XNx8OBBvfXWW5owYYICAgKK7RyWkpKiX3/9VS+88IKeffbZwsd37txZ7qwHDx5UZmZmkVFCO3bskCRFR0dLkr788ks1btxYsbGxRRatLmlK20n16tXT9ddfXzj97NZbby2yU1hJBg0apMcee0wPPvigbrnlFgUEBBQ75vfff9exY8cUGxuriy++uPDxvXv3ntPnCwAAah9GCAEAgGphzpw5ys3N1WWXXVam81544QWFhYVpzJgxJT7v5uYmSTIMo8jjb775ZrlySlJBQYE++OCDwvt5eXn64IMPFBISos6dO5f6cZcvX65ly5ad8bVHjhypDRs2aODAgUV2OitN3bp1dcMNN2jDhg2FO5ydrqQseXl5ev/998/6+gAAoHZihBAAADBVZmam3nnnHY0fP15ubm4yDEOzZ88ucszhw4eVkZGh2bNnq1+/fkXWCfrll1/06aefFtnN61SBgYG6+OKLNWnSJOXn56t+/fr65Zdfzmt0TGRkpF599VXFxcWpefPm+vzzz7Vu3Tp9+OGH8vDwkCRde+21io2N1U033aT+/ftr7969mjJlilq1aqWMjIxSX/uqq67S0aNHz6kMOmnmzJl67733St0ZrUePHgoKCtKwYcP0wAMPyGKxaNasWcVKsjN59913lZqaqoMHD0qSvvvuOx04cECSdP/998tms53zawEAAPNRCAEAAFMdPXpU48aNK7x/6u5dpxsyZIgWL15cpBDq0KGD7rjjjjN+jDlz5uj+++/Xe++9J8MwdMUVV+jHH39UZGRkuTIHBQXp448/1v3336+PPvpIYWFhevfddzV69OjCY4YPH65Dhw7pgw8+0M8//6xWrVpp9uzZmjdvnn7//fdSX9tisZxxy/uS+Pj4FNlF7HT16tXTggUL9Oijj+qZZ55RUFCQBg8erMsuu0xXXnnlOX2M119/Xfv27Su8Hxsbq9jYWEnH1z6iEAIAoGaxGGX51RAAAEAFi4uLU0xMjBYvXqy+ffue93GVrW/fvkpKStKmTZtMywAAAHC+WEMIAAAAAADAxVAIAQAAU/n7+2vQoEFFpoGdz3EAAAA4O6aMAQAAlAFTxgAAQG1AIQQAAAAAAOBimDIGAAAAAADgYiiEAAAAAAAAXIy72QGqmtPp1MGDBxUQECCLxWJ2HAAAAAAAgAphGIbS09MVGRkpq/XMY4BcrhA6ePCgoqKizI4BAAAAAABQKfbv368GDRqc8RiXK4QCAgIkHb84gYGBJqcBAAAAAACoGGlpaYqKiirsPs7E5Qqhk9PEAgMDKYQAAAAAAECtcy5L5LCoNAAAAAAAgIuhEAIAAAAAAHAxFEIAAAAAAAAuhkIIAAAAAADAxVAIAQAAAAAAuBgKIQAAAAAAABdDIQQAAAAAAOBiKIQAAAAAAABcDIUQAAAAAACAi6EQAgAAAAAAcDEUQgAAAAAAAC6GQggAAAAAAMDFUAgBAAAAAAC4GAohAAAAAAAAF0MhBAAAAAAA4GIohAAAAAAAgMtLtGdr6e4kJdqzzY5SJdzNDgAAAAAAAGCmz1fG68nYjTIMyWqRJgxoq9u6NjQ7VqVihBAAAAAAAHBZifbswjJIkpyG9FTsplo/UohCCAAAAAAAuKxv1h0sLINOchiG4pKyzAlURSiEAAAAAACAS/pl8yG98fP2Yo+7WSyKDvY1IVHVoRACAAAAAAAu59v1B3XPp2uU7zTUtn6grJbjj7tZLHplQBtF2HzMDVjJWFQaAAAAAAC4lC9W7tcTsRtkGNKAjvU16ZZ2OpqRq7ikLEUH+9b6MkiiEAIAAAAAAC7k46Vxeu7bzZKkO7s11Es3tJHValGEzcc
"text/plain": [
"<Figure size 1400x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2024-10-20 00:37:12 +04:00
"mean_menthalhealth = df_heart.groupby('AgeCategory')['SleepTime'].mean().reset_index()\n",
2024-10-19 00:25:57 +04:00
"\n",
"plt.figure(figsize=(14, 6))\n",
"\n",
"plt.plot(mean_menthalhealth['AgeCategory'], mean_menthalhealth['SleepTime'], marker='.')\n",
"\n",
"plt.title(\"Диаграмма 1\")\n",
"plt.xlabel(\"Возрастная группа\")\n",
"plt.ylabel(\"Время сна\")\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Между атрибутами присутствует связь. Пример, на диаграмме 1 - связь между возрастной группой и временем сна\\\n",
"Примеры бизнес-целей:\\\n",
" 1. Прогнозирование инсульта на основе ИМТ.\\\n",
" 2. Наблюдение за изменением времени сна в зависимости от возраста.\\\n",
"\\\n",
"Эффект для бизнеса: влияние количества сна на здоровье, влияние ИМТ на здоровье, влияние возраста на инсульты\\\n",
"\\\n",
"\\\n",
"Цели технического проекта:\\\n",
" 1. Первая бизнес-цель: вход - ИМТ, целевой признак - инсульт.\\\n",
" 2. Вторая бизнес-цель: вход - возрастная группа, целевой признак - время сна."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проверка на выбросы"
]
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 217,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Пустые значения по столбцам:\n",
"HeartDisease 0\n",
"BMI 0\n",
"Smoking 0\n",
"AlcoholDrinking 0\n",
"Stroke 0\n",
"PhysicalHealth 0\n",
"MentalHealth 0\n",
"DiffWalking 0\n",
"Sex 0\n",
"AgeCategory 0\n",
"Race 0\n",
"Diabetic 0\n",
"PhysicalActivity 0\n",
"GenHealth 0\n",
"SleepTime 0\n",
"Asthma 0\n",
"KidneyDisease 0\n",
"SkinCancer 0\n",
"dtype: int64\n",
"\n",
2024-10-20 00:37:12 +04:00
"Количество дубликатов: 18078\n",
"\n",
2024-10-19 00:25:57 +04:00
"Статистический обзор данных:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>BMI</th>\n",
" <th>PhysicalHealth</th>\n",
" <th>MentalHealth</th>\n",
" <th>SleepTime</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>319795.000000</td>\n",
" <td>319795.00000</td>\n",
" <td>319795.000000</td>\n",
" <td>319795.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>28.325399</td>\n",
" <td>3.37171</td>\n",
" <td>3.898366</td>\n",
" <td>7.097075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>6.356100</td>\n",
" <td>7.95085</td>\n",
" <td>7.955235</td>\n",
" <td>1.436007</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>12.020000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>24.030000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>6.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>27.340000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>7.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>31.420000</td>\n",
" <td>2.00000</td>\n",
" <td>3.000000</td>\n",
" <td>8.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>94.850000</td>\n",
" <td>30.00000</td>\n",
" <td>30.000000</td>\n",
" <td>24.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" BMI PhysicalHealth MentalHealth SleepTime\n",
"count 319795.000000 319795.00000 319795.000000 319795.000000\n",
"mean 28.325399 3.37171 3.898366 7.097075\n",
"std 6.356100 7.95085 7.955235 1.436007\n",
"min 12.020000 0.00000 0.000000 1.000000\n",
"25% 24.030000 0.00000 0.000000 6.000000\n",
"50% 27.340000 0.00000 0.000000 7.000000\n",
"75% 31.420000 2.00000 3.000000 8.000000\n",
"max 94.850000 30.00000 30.000000 24.000000"
]
},
2024-10-20 00:37:12 +04:00
"execution_count": 217,
2024-10-19 00:25:57 +04:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
2024-10-20 00:37:12 +04:00
"null_values = df_heart.isnull().sum()\n",
2024-10-19 00:25:57 +04:00
"print(\"Пустые значения по столбцам:\")\n",
"print(null_values)\n",
"\n",
2024-10-20 00:37:12 +04:00
"duplicates = df_heart.duplicated().sum()\n",
"print(f\"\\nК о личе с тво дубликатов: {duplicates}\")\n",
"\n",
2024-10-19 00:25:57 +04:00
"print(\"\\nС та тис тиче с кий обзор данных:\")\n",
2024-10-20 00:37:12 +04:00
"df_heart.describe()"
2024-10-19 00:25:57 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-10-20 00:37:12 +04:00
"Н а основе полученной информации видно, что пустых данных нет, но есть дубликаты. Удалим их и проверим данные на выбросы:"
2024-10-19 00:25:57 +04:00
]
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 236,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Коэффициент асимметрии для столбца 'BMI': 1.3324306428979513\n",
"\n",
"Коэффициент асимметрии для столбца 'PhysicalHealth': 2.6039732622480822\n",
"\n",
"Коэффициент асимметрии для столбца 'MentalHealth': 2.331111549136165\n",
"\n",
2024-10-20 00:37:12 +04:00
"Коэффициент асимметрии для столбца 'SleepTime': 0.6790346208011537\n"
2024-10-19 00:25:57 +04:00
]
}
],
"source": [
2024-10-20 00:37:12 +04:00
"cleaned_df = df_heart.drop_duplicates()\n",
2024-10-19 00:25:57 +04:00
"\n",
2024-10-20 00:37:12 +04:00
"for column in df_heart.select_dtypes(include=[np.number]).columns:\n",
" skewness = df_heart[column].skew()\n",
" print(f\"\\nК о эффицие нт асимметрии для столбца '{column}': {skewness}\")\n"
2024-10-19 00:25:57 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-10-20 00:37:12 +04:00
"Выбросы незначительные. Очистка данных от шумов"
2024-10-19 00:25:57 +04:00
]
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 237,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"data": {
2024-10-20 00:37:12 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0kAAAIjCAYAAADWYVDIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXxcdb34/9eZfUkymUmapVuStkBJC21TCm2xBQW0t0jdEBWLLAqVC1e9el2KS1uvWvmholevVOr3ArYX96JUsepFFCihhZYCIRShTdItaZpM9snMZOac3x/Tmc4ks2eyv5+PRx+QyWd5fz7nc87MJ3M+n6NomqYhhBBCCCGEEAIA3VgHIIQQQgghhBDjiUyShBBCCCGEECKKTJKEEEIIIYQQIopMkoQQQgghhBAiikyShBBCCCGEECKKTJKEEEIIIYQQIopMkoQQQgghhBAiikyShBBCCCGEECKKTJKEEEIIIYQQIopMkoQQQgghxtCJEyd4+OGHIz83Njbyv//7v2MXkBBCJklCiOH5zW9+g6Iocf8tXLhwrMMTQohxT1EU7rrrLv785z/T2NjIF77wBZ555pmxDkuIKc0w1gEIISaHe+65hwsvvDDy8ze/+c0xjEYIISaOGTNmcPvtt7NmzRoAysvL+fvf/z62QQkxxSmapmljHYQQYuL6zW9+wwc/+EGeeuoprrzyysjrV155JW1tbdTV1Y1dcEIIMYEcOXKEtrY2Fi5ciN1uH+twhJjS5HY7IcSw+P1+AHS61JeThx9+GEVRaGxsjLymqioXX3wxiqLE3JP/yiuvcMsttzBnzhwsFgtlZWXcdttttLe3x5S5efPmuLf6GQznvii/8sorWbhwIQcOHGDlypVYrVaqqqrYtm3bkLZ87WtfY+nSpTgcDux2O6tWreKpp56KSdfY2Bip53e/+13M77xeL06nE0VR+M53vjMkzpKSEgYGBmLy/PznP4+U19bWFnn997//Pddeey3Tp0/HbDYzd+5c/vM//5NgMJiyr8P1HT58mBtuuIGCggKKior49Kc/jdfrjUn70EMP8Y53vIOSkhLMZjPV1dU88MADccv905/+xBVXXEF+fj4FBQUsW7aMRx99NCbNvn37WLt2LU6nE7vdzsUXX8wPfvCDmDSHDx/m+uuvx+VyYbFYuOSSS3j88cdj0mQyXm655ZaY4+90OrnyyiuH3LKUbp+Gx8xg3/nOd4bEVFlZyS233BKT7te//jWKolBZWRnzemtrKx//+MeZPXs2er0+Em9eXt6QugarrKxMeGuroihD0u/cuZOlS5ditVpxuVx8+MMf5vjx43HbmercAPD5fGzatIl58+ZhNpuZNWsWX/jCF/D5fEPS/v3vf087zsHCYzde+6P7OZPxAUTOhWnTpmG1Wrngggv48pe/HFNnsn/hb3auvPLKmD8IQeibc51ON+Rc+PWvfx05BsXFxaxfv56TJ0/GpLnlllsi42Tu3LlcdtlluN1urFbrkPYJIUaP3G4nhBiW8CTJbDZnlX/Hjh28+uqrQ17/61//ytGjR7n11lspKyvjtdde48EHH+S1117j+eefH/Ih6oEHHoj5oDl40tbR0cHatWu54YYb+MhHPsKvfvUr7rzzTkwmE7fddhsA3d3d/PSnP+UjH/kIt99+Oz09Pfy///f/eNe73sX+/ftZvHhxTJkWi4WHHnqI9773vZHXdu3aNWQSEq2np4c//OEPvO9974u89tBDD2GxWIbke/jhh8nLy+Ozn/0seXl5/O1vf+NrX/sa3d3d3HfffQnriHbDDTdQWVnJ1q1bef755/mv//ovOjo6+NnPfhbTdwsWLGDdunUYDAZ2797Nv/7rv6KqKnfddVdMPLfddhsLFixg48aNFBYW8tJLL7Fnzx5uvPFGIHTc3v3ud1NeXs6nP/1pysrKeP311/nDH/7Apz/9aQBee+01Lr/8cmbMmMGXvvQl7HY7v/rVr3jve9/Lb3/725i+GSzReAEoLi7m/vvvB0IL4X/wgx+wdu1ajh8/TmFhYc76NJVAIBD58D3YzTffzP/93//xb//2byxatAi9Xs+DDz7IwYMH0yp78eLFfO5zn4t57Wc/+xl//etfY1775je/yVe/+lVuuOEGPvGJT3DmzBl++MMfsnr1al566aVIf0B654aqqqxbt45nn32WO+64gwsvvJBXX32V+++/n3/+859D/lgQ9qlPfYply5YljDPXEo2PV155hVWrVmE0GrnjjjuorKzkyJEj7N69m29+85u8//3vZ968eZH0//7v/86FF17IHXfcEXkt+nbiaA899BBf+cpX+O53vxs5DyA01m699VaWLVvG1q1bOX36ND/4wQ/Yu3fvkGMw2Ne+9rWk1xEhxCjQhBBiGL7//e9rgPbyyy/HvH7FFVdoCxYsiHntoYce0gCtoaFB0zRN83q92uzZs7V/+Zd/0QDtoYceiqT1eDxD6vr5z3+uAdrTTz8deW3Tpk0aoJ05cyZhjFdccYUGaN/97ncjr/l8Pm3x4sVaSUmJ5vf7NU3TtEAgoPl8vpi8HR0dWmlpqXbbbbdFXmtoaNAA7SMf+YhmMBi0lpaWyO+uuuoq7cYbb9QA7b777hsS50c+8hHt3e9+d+T1pqYmTafTaR/5yEeGtCNeH2zYsEGz2Wya1+tN2N7o+tatWxfz+r/+678OOV7x6nnXu96lzZkzJ/JzZ2enlp+fr1122WVaf39/TFpVVTVNC/VfVVWVVlFRoXV0dMRNo2mhPrroooti2qCqqrZy5UrtvPPOi7yWyXi5+eabtYqKipg6H3zwQQ3Q9u/fn7St8fo03vjVNE277777YmLSNE2rqKjQbr755sjPP/7xjzWz2ay9/e1vj4mpv79f0+l02oYNG2LKvPnmmzW73T6krsEqKiq0a6+9dsjrd911lxb9dt7Y2Kjp9Xrtm9/8Zky6V199VTMYDDGvp3tu7NixQ9PpdNozzzwTU+a2bds0QNu7d2/M63/5y180QPvNb36TMM5EtmzZogExYybc/uh+zmR8rF69WsvPz9eamppiyhxcR6K6ol1xxRXaFVdcoWmapv3xj3/UDAaD9rnPfS4mjd/v10pKSrSFCxfGnC9/+MMfNED72te+Fnlt8Nitq6vTdDpdpB3RY00IMXrkdjshxLCEb3+bNm1axnn/+7//m/b2djZt2jTkd1arNfL/Xq+XtrY2li9fDpD2X92jGQwGNmzYEPnZZDKxYcMGWltbOXDgAAB6vR6TyQSE/nLudrsJBAJccsklceusqalhwYIF7NixA4CmpiaeeuqpIbdeRbvtttvYs2cPLS0tADzyyCOsWLGC888/f0ja6D7o6emhra2NVatW4fF4OHz4cFrtjv4mCODf/u3fAHjiiSfi1tPV1UVbWxtXXHEFR48epaurCwh9Q9TT08OXvvQlLBZLTJnhb/VeeuklGhoa+MxnPjPkr+ThNG63m7/97W/ccMMNkTa1tbXR3t7Ou971Lt58880htyOFJRsvEDpm4fIOHTrEz372M8rLy2O+AcikT4PBYKS88D+PxxO37jCPx8PXv/517r77bmbPnh3zu76+PlRVpaioKGkZw7Vr1y5UVeWGG26Iib2srIzzzjtvyO2j6Zwbv/71r7nwwguZP39+TJnveMc7AIaUGf4WZPBYSUdJSQkQ+jYwE4nGx5kzZ3j66ae57bbbhhyTdG7/S2T//v3ccMMNfOADHxjyLeSLL75Ia2sr//qv/xrTB9deey3z58/nj3/8Y8JyN27cSE1NDR/84Aezjk0IMXxyu50QYliampowGAwZT5K6urr41re+xWc/+1lKS0uH/N7tdrNlyxZ+8Ytf0NraOiRvpqZPnz5kIXR4YtLY2BiZgD3yyCN897vf5fDhwzFrh6qqquKWe+utt/Lggw/yH//xHzz88MOsXLmS8847L2EcixcvZuHChfzsZz/j85//PA8//DD33HPPkLUiELot7Stf+Qp/+9vf6O7ujvldun0wOJa5c+ei0+li1jns3buXTZs2UVtbO2QS0NXVhcPh4MiRIwBJt3VPJ81bb72Fpml89at
2024-10-19 00:25:57 +04:00
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Выбросы в датасете:\n",
" HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth \\\n",
2024-10-20 00:37:12 +04:00
"32 No 45.35 No No No 30.0 \n",
"57 No 46.52 Yes No No 30.0 \n",
"90 No 44.29 No No No 30.0 \n",
"105 No 58.54 No No No 30.0 \n",
"107 No 45.42 No No No 0.0 \n",
2024-10-19 00:25:57 +04:00
"... ... ... ... ... ... ... \n",
2024-10-20 00:37:12 +04:00
"319636 No 47.55 No No No 0.0 \n",
"319693 No 44.29 No No No 0.0 \n",
"319709 No 51.46 Yes No No 30.0 \n",
"319725 No 53.16 No No No 29.0 \n",
"319794 No 46.56 No No No 0.0 \n",
"\n",
" MentalHealth DiffWalking Sex AgeCategory Race \\\n",
"32 0.0 Yes Male 70-74 White \n",
"57 0.0 No Male 65-69 White \n",
"90 10.0 Yes Female 70-74 White \n",
"105 0.0 Yes Male 65-69 Other \n",
"107 0.0 No Female 45-49 White \n",
"... ... ... ... ... ... \n",
"319636 0.0 No Female 55-59 Hispanic \n",
"319693 0.0 No Female 25-29 Hispanic \n",
"319709 0.0 No Male 55-59 Hispanic \n",
"319725 0.0 Yes Male 25-29 Hispanic \n",
"319794 0.0 No Female 80 or older Hispanic \n",
"\n",
" Diabetic PhysicalActivity GenHealth SleepTime Asthma \\\n",
"32 Yes No Good 8.0 No \n",
"57 Yes No Poor 8.0 Yes \n",
"90 No No Fair 7.0 No \n",
"105 No, borderline diabetes Yes Poor 3.0 Yes \n",
"107 No Yes Very good 7.0 Yes \n",
"... ... ... ... ... ... \n",
"319636 No No Fair 7.0 Yes \n",
"319693 No Yes Very good 7.0 No \n",
"319709 No Yes Good 7.0 Yes \n",
"319725 No, borderline diabetes No Fair 5.0 Yes \n",
"319794 No Yes Good 8.0 No \n",
"\n",
" KidneyDisease SkinCancer \n",
"32 No No \n",
"57 No No \n",
"90 No Yes \n",
"105 No No \n",
"107 No No \n",
"... ... ... \n",
"319636 No No \n",
"319693 No No \n",
"319709 No No \n",
"319725 No No \n",
"319794 No No \n",
"\n",
"[8905 rows x 18 columns]\n"
2024-10-19 00:25:57 +04:00
]
},
{
"data": {
2024-10-20 00:37:12 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0kAAAIjCAYAAADWYVDIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydeXyU1bn4v+/sM8nMZCMkQEgCqBgWAQUBBW0rSrVSba2tinuVerW3rV0s3raCtlKvXeztIkpv3dDuen9ul6utVlERkU0hoghJIJAQksk6+8z7/v4YZpyZzPLOloRwvp8Pnw9555znPM9znrPNvOccSVEUBYFAIBAIBAKBQCAQAKAZbgUEAoFAIBAIBAKBYCQhFkkCgUAgEAgEAoFAEIVYJAkEAoFAIBAIBAJBFGKRJBAIBAKBQCAQCARRiEWSQCAQCAQCgUAgEEQhFkkCgUAgEAgEAoFAEIVYJAkEAoFAIBAIBAJBFGKRJBAIBAKBQCAQCARRiEWSQCAQCAQCgUAgEEQhFkkCgUAgEAgEAoFAEIVYJAkEgoLwt7/9DUmSEv6bPn36cKsnEAgEAoFAkBTdcCsgEAhGN3feeSennnpq5O+f/OQnw6iNQCAQCAQCQXrEIkkgEBSUJUuWcO6550b+/v3vf09nZ+fwKSQQCAQCgUCQBvG6nUAgKAg+nw8AjSZ9N/Poo48iSRLNzc2RZ7IsM3PmTCRJ4tFHH408f++997juuuuYNGkSJpOJqqoqbrjhBrq6umJkrlq1KuGrfjrdJ98NnXvuuUyfPp2tW7eycOFCzGYz9fX1rF27dpAtP/rRjzj99NOx2+0UFRWxaNEiXn311Zh0zc3NkXL+53/+J+Yzj8dDaWkpkiTxs5/9bJCelZWV+P3+mDx//OMfI/KiF5b/7//9Py666CLGjRuH0Whk8uTJ3HPPPQSDwbS+Dpe3Z88eLr/8cmw2G+Xl5XzjG9/A4/HEpH3kkUf49Kc/TWVlJUajkYaGBh588MGEcv/3f/+Xc845B6vVis1mY+7cuTz11FMxaTZv3syFF15IaWkpRUVFzJw5k1/96lcxafbs2cNll11GWVkZJpOJM844g2effTYmTSbxct1118XUf2lpKeeeey4bN26MkanWp+GYiednP/vZIJ3q6uq47rrrYtL99a9/RZIk6urqYp53dHRw4403MnHiRLRabUTf4uLiQWXFU1dXl/TVVkmSYtIGAgHuueceJk+ejNFopK6ujjvvvBOv1ztIrpo6jY75VOXKsswDDzzAtGnTMJlMjB07lhUrVtDd3a3Kvng//utf/0KSJP71r39Fnp177rkxX8gAbNmyJaE+AOvXr2fevHlYLBZKS0tZvHgxL730UqTMVD4N11/Y/uiY6+/v5/TTT6e+vp62trak6QBuvfVWJEkaZJ9AIBh+xC9JAoGgIIQXSUajMav8TzzxBO+///6g5y+//DL79+/n+uuvp6qqit27d/Pwww+ze/du3n777UGToQcffDBmohm/aOvu7ubCCy/k8ssv54orruAvf/kLt9xyCwaDgRtuuAGAvr4+fv/733PFFVdw00030d/fz3//939zwQUX8M477zBr1qwYmSaTiUceeYRLLrkk8uzpp58etAiJpr+/n+eff55LL7008uyRRx7BZDINyvfoo49SXFzM7bffTnFxMa+88go/+tGP6Ovr4/77709aRjSXX345dXV1rFmzhrfffpv/+q//oru7m8cffzzGd9OmTWPZsmXodDqee+45/u3f/g1Zlrn11ltj9LnhhhuYNm0aK1eupKSkhO3bt7NhwwauvPJKIFRvn/vc56iuruYb3/gGVVVVfPDBBzz//PN84xvfAGD37t2cddZZjB8/nu9///sUFRXxl7/8hUsuuYS///3vMb6JJ1m8AFRUVPDLX/4SgNbWVn71q19x4YUXcvDgQUpKSvLm03QEAgH+4z/+I+Fn1157Lf/4xz/4+te/zmmnnYZWq+Xhhx9m27ZtqmTPmjWLb3/72zHPHn/8cV5++eWYZ1/96ld57LHHuOyyy/j2t7/N5s2bWbNmDR988AHPPPNMJJ2aOo3m5ptvZtGiRUAo1qNlAaxYsYJHH32U66+/nn//93+nqamJ3/zmN2zfvp0333wTvV6vys5MueOOOxI+X716NatWrWLhwoXcfffdGAwGNm/ezCuvvML555/PAw88wMDAAAAffPAB9957b8yrw8kWr36/ny9+8YscOHCAN998k+rq6qS6ffzxx6xbty5HCwUCQcFQBAKBoAA88MADCqDs3Lkz5vk555yjTJs2LebZI488ogBKU1OToiiK4vF4lIkTJyqf/exnFUB55JFHImldLtegsv74xz8qgPL6669Hnt11110KoBw9ejSpjuecc44CKD//+c8jz7xerzJr1iylsrJS8fl8iqIoSiAQULxeb0ze7u5uZezYscoNN9wQedbU1KQAyhVXXKHodDqlvb098tlnPvMZ5corr1QA5f777x+k5xVXXKF87nOfizxvaWlRNBqNcsUVVwyyI5EPVqxYoVgsFsXj8SS1N7q8ZcuWxTz/t3/7t0H1laicCy64QJk0aVLk756eHsVqtSpnnnmm4na7Y9LKsqwoSsh/9fX1Sm1trdLd3Z0wjaKEfDRjxowYG2RZVhYuXKicdNJJkWeZxMu1116r1NbWxpT58MMPK4DyzjvvpLQ1kU8Txa+iKMr9998fo5OiKEptba1y7bXXRv7+3e9+pxiNRuVTn/pUjE5ut1vRaDTKihUrYmRee+21SlFR0aCy4qmtrVUuuuiiQc9vvfVWJXqY37FjhwIoX/3qV2PSfec731EA5ZVXXlEURV2dhtm7d68CKI899ljkWTjGwmzcuFEBlCeffDIm74YNGxI+j6e+vl655pprYp69+uqrCqC8+uqrkWfnnHOOcs4550T+fvHFFxVAWbp0aYw+e/fuVTQajXLppZcqwWAwpX3JygoTbvOPPPKIIsuyctVVVykWi0XZvHlz0nRhLr/8cmX69OlKTU1NTJwIBIKRgXjdTiAQFITw629jxozJOO9vf/tburq6uOuuuwZ9ZjabI//3eDx0dnYyf/58ANXfukej0+lYsWJF5G+DwcCKFSvo6Ohg69atAGi1WgwGAxB6bcjhcBAIBDjjjDMSljlnzhymTZvGE088AUBLSwuvvvpqyldqbrjhBjZs2EB7ezsAjz32GAsWLODkk08elDbaB/39/XR2drJo0SJcLhd79uxRZXf0L0EAX//61wF48cUXE5bT29tLZ2cn55xzDvv376e3txcI/ULU39/P97//fUwmU4zM8K9627dvp6mpiW9+85uRX27i0zgcDl555RUuv/zyiE2dnZ10dXVxwQUXsHfvXg4dOpTQllTxAqE6C8vbsWMHjz/+ONXV1TEHimTi02AwGJEX/udyuRKWHcblcnH33Xdz2223MXHixJjPnE4nsixTXl6eUkauhOv29ttvj3ke/gXqhRdeANTVaRg1vxj/9a9/xW63s2TJkhifnX766RQXFw96bTWeyspKWltbVVj4CYqisHLlSr74xS9y5plnxnz2P//zP8iyzI9+9KNBvywnei1PLd/97nd58skn+ctf/sK8efNSpt26dSt//etfWbNmjapXkgUCwdAjWqZAICgILS0t6HS6jBdJvb293Hvvvdx+++2MHTt20OcOh4NvfOMbjB07FrPZzJgxY6ivr4/kzZRx48ZRVFQU8yy8MIneX/LYY48xc+ZMTCYT5eXljBkzhhdeeCFpmddffz2PPPIIEHp1aeHChZx00klJ9Zg1axbTp0/n8ccfR1GUyKtJidi9ezeXXnopdrsdm83GmDFjWL58OaDeB/G6TJ48GY1GE2Pzm2++yXnnnUdRURElJSWMGTOGO++8M6acffv2AaQ81l1Nmo8//hhFUfjhD3/ImDFjYv6FFz8dHR2D8qWLF4CDBw9GZM2ePZt9+/bx97//PeaVqUx8umfPnqQ6JuMXv/gFHo8n4r9oysvLOemkk/j973/PSy+9REdHB52dnQn3CeVCS0sLGo2GKVOmxDyvqqqipKSElpY
2024-10-19 00:25:57 +04:00
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10, 6))\n",
2024-10-20 00:37:12 +04:00
"plt.scatter(cleaned_df['BMI'], cleaned_df['PhysicalHealth'])\n",
"plt.xlabel('ИМТ')\n",
"plt.ylabel('Физическое здоровье')\n",
2024-10-19 00:25:57 +04:00
"plt.title('Диаграмма рассеивания перед чисткой')\n",
"plt.show()\n",
"\n",
2024-10-20 00:37:12 +04:00
"Q1 = cleaned_df[\"BMI\"].quantile(0.25)\n",
"Q3 = cleaned_df[\"BMI\"].quantile(0.75)\n",
2024-10-19 00:25:57 +04:00
"\n",
"IQR = Q3 - Q1\n",
"\n",
"threshold = 1.5 * IQR\n",
"lower_bound = Q1 - threshold\n",
"upper_bound = Q3 + threshold\n",
"\n",
2024-10-20 00:37:12 +04:00
"outliers = (cleaned_df[\"BMI\"] < lower_bound) | (cleaned_df[\"BMI\"] > upper_bound)\n",
2024-10-19 00:25:57 +04:00
"\n",
"print(\"Выбросы в датасете:\")\n",
"print(cleaned_df[outliers])\n",
"\n",
2024-10-20 00:37:12 +04:00
"median_score = cleaned_df[\"BMI\"].median()\n",
"cleaned_df.loc[outliers, \"BMI\"] = median_score\n",
2024-10-19 00:25:57 +04:00
"\n",
"plt.figure(figsize=(10, 6))\n",
2024-10-20 00:37:12 +04:00
"plt.scatter(cleaned_df['BMI'], cleaned_df['PhysicalHealth'])\n",
"plt.xlabel('ИМТ')\n",
"plt.ylabel('Физическое здоровье')\n",
2024-10-19 00:25:57 +04:00
"plt.title('Диаграмма рассеивания после чистки')\n",
2024-10-20 00:37:12 +04:00
"plt.show()\n"
2024-10-19 00:25:57 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Разбиение набора данных на обучающую, контрольную и тестовую выборки"
]
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 238,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 181029\n",
"Размер контрольной выборки: 60344\n",
"Размер тестовой выборки: 60344\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_df, test_df = train_test_split(cleaned_df, test_size=0.2, random_state=42)\n",
"\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))"
]
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 239,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение ИМТ в обучающей выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 5906\n",
2024-10-19 00:25:57 +04:00
"26.63 1941\n",
"27.46 1456\n",
"27.44 1416\n",
"27.12 1258\n",
" ... \n",
2024-10-20 00:37:12 +04:00
"17.93 1\n",
"34.10 1\n",
"23.27 1\n",
"13.81 1\n",
"28.58 1\n",
"Name: count, Length: 2305, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n",
"Распределение ИМТ в контрольной выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 1972\n",
"26.63 657\n",
"27.46 494\n",
"24.41 474\n",
"27.44 463\n",
" ... \n",
"16.76 1\n",
"19.93 1\n",
"38.99 1\n",
"35.34 1\n",
"32.80 1\n",
"Name: count, Length: 1969, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n",
"Распределение ИМТ в тестовой выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 1931\n",
"26.63 646\n",
"27.44 506\n",
"27.46 475\n",
"24.41 452\n",
" ... \n",
"34.89 1\n",
"30.75 1\n",
"41.06 1\n",
"39.91 1\n",
"20.27 1\n",
"Name: count, Length: 1988, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n"
]
}
],
"source": [
"def check_balance(df, name):\n",
" counts = df['BMI'].value_counts()\n",
" print(f\"Распределение ИМТ в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"check_balance(val_df, \"контрольной выборке\")\n",
"check_balance(test_df, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
2024-10-20 00:37:12 +04:00
"source": [
"Оверсемплинг и андерсемплинг"
]
2024-10-19 00:25:57 +04:00
},
{
"cell_type": "code",
2024-10-20 00:37:12 +04:00
"execution_count": 240,
2024-10-19 00:25:57 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Оверсэмплинг:\n",
"Распределение ИМТ в обучающей выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 5906\n",
2024-10-19 00:25:57 +04:00
"26.63 1941\n",
2024-10-20 00:37:12 +04:00
"27.46 1636\n",
"27.44 1595\n",
2024-10-19 00:25:57 +04:00
"27.12 1258\n",
" ... \n",
2024-10-20 00:37:12 +04:00
"40.92 1\n",
"26.98 1\n",
"31.22 1\n",
2024-10-19 00:25:57 +04:00
"29.59 1\n",
2024-10-20 00:37:12 +04:00
"16.00 1\n",
"Name: count, Length: 2305, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n",
"Распределение ИМТ в контрольной выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 1981\n",
"26.63 657\n",
"27.46 494\n",
"24.41 474\n",
"27.44 465\n",
" ... \n",
"43.03 1\n",
"30.64 1\n",
"42.48 1\n",
"31.61 1\n",
"35.34 1\n",
"Name: count, Length: 1969, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n",
"Распределение ИМТ в тестовой выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 1931\n",
"26.63 646\n",
"27.44 580\n",
"27.46 533\n",
"24.41 452\n",
" ... \n",
"14.37 1\n",
"17.28 1\n",
"18.04 1\n",
"35.98 1\n",
"17.41 1\n",
"Name: count, Length: 1988, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n",
"Андерсэмплинг:\n",
"Распределение ИМТ в обучающей выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 5173\n",
"26.63 1708\n",
2024-10-19 00:25:57 +04:00
"27.46 1456\n",
"27.44 1416\n",
2024-10-20 00:37:12 +04:00
"27.12 1104\n",
2024-10-19 00:25:57 +04:00
" ... \n",
2024-10-20 00:37:12 +04:00
"35.27 1\n",
"42.80 1\n",
"28.58 1\n",
"34.10 1\n",
"38.83 1\n",
"Name: count, Length: 2282, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n",
"Распределение ИМТ в контрольной выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 1972\n",
"26.63 654\n",
"27.46 494\n",
"24.41 470\n",
"27.44 463\n",
" ... \n",
"32.31 1\n",
"33.76 1\n",
"38.38 1\n",
"42.11 1\n",
"31.23 1\n",
"Name: count, Length: 1969, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n",
"Распределение ИМТ в тестовой выборке:\n",
"BMI\n",
2024-10-20 00:37:12 +04:00
"27.41 1706\n",
"26.63 559\n",
"27.44 506\n",
"27.46 475\n",
"24.41 398\n",
" ... \n",
"28.22 1\n",
"39.93 1\n",
"39.17 1\n",
"28.24 1\n",
"41.90 1\n",
"Name: count, Length: 1966, dtype: int64\n",
2024-10-19 00:25:57 +04:00
"\n"
]
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"def binning(target, bins):\n",
" return pd.qcut(target, q=bins, labels=False)\n",
"\n",
"train_df['BMI_binned'] = binning(train_df['BMI'], bins=2)\n",
"val_df['BMI_binned'] = binning(val_df['BMI'], bins=2)\n",
"test_df['BMI_binned'] = binning(test_df['BMI'], bins=2)\n",
"\n",
"def oversample(df, target_column):\n",
" X = df.drop(target_column, axis=1)\n",
" y = df[target_column]\n",
" \n",
" oversampler = RandomOverSampler(random_state=42)\n",
" x_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
" \n",
" resampled_df = pd.concat([x_resampled, y_resampled], axis=1) \n",
" return resampled_df\n",
"\n",
"def undersample(df, target_column):\n",
" X = df.drop(target_column, axis=1)\n",
" y = df[target_column]\n",
" \n",
" undersampler = RandomUnderSampler(random_state=42)\n",
" x_resampled, y_resampled = undersampler.fit_resample(X, y)\n",
" \n",
" resampled_df = pd.concat([x_resampled, y_resampled], axis=1)\n",
" return resampled_df\n",
"\n",
"train_df_oversampled = oversample(train_df, 'BMI_binned')\n",
"val_df_oversampled = oversample(val_df, 'BMI_binned')\n",
"test_df_oversampled = oversample(test_df, 'BMI_binned')\n",
"\n",
"train_df_undersampled = undersample(train_df, 'BMI_binned')\n",
"val_df_undersampled = undersample(val_df, 'BMI_binned')\n",
"test_df_undersampled = undersample(test_df, 'BMI_binned')\n",
"\n",
"print(\"Оверсэмплинг:\")\n",
2024-10-20 00:37:12 +04:00
"check_balance(train_df_oversampled, \"обучающей выборке\")\n",
"check_balance(val_df_oversampled, \"контрольной выборке\")\n",
"check_balance(test_df_oversampled, \"тестовой выборке\")\n",
"\n",
"print(\"Андерсэмплинг:\")\n",
"check_balance(train_df_undersampled, \"обучающей выборке\")\n",
"check_balance(val_df_undersampled, \"контрольной выборке\")\n",
"check_balance(test_df_undersampled, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет: Цены на мобильные устройства (https://www.kaggle.com/datasets/dewangmoghe/mobile-phone-price-prediction)"
]
},
{
"cell_type": "code",
"execution_count": 241,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Unnamed: 0', 'Name', 'Rating', 'Spec_score', 'No_of_sim', 'Ram',\n",
" 'Battery', 'Display', 'Camera', 'External_Memory', 'Android_version',\n",
" 'Price', 'company', 'Inbuilt_memory', 'fast_charging',\n",
" 'Screen_resolution', 'Processor', 'Processor_name'],\n",
" dtype='object')\n"
]
}
],
"source": [
"df_phones = pd.read_csv(\"..\\\\static\\\\csv\\\\mobile-phone-price-prediction.csv\")\n",
"\n",
"print(df_phones.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Столбцы на русском:\n",
"\n",
"Unnamed: 0 - идентификатор \\\n",
"Name - Модель \\\n",
"Rating - Рейтинг \\\n",
"Spec_score - Оценка характеристик\\\n",
"No_of_sim - Доступние сим-карты\\\n",
"Ram - Объем оперативной памяти\\\n",
"Battery - Батарея\\\n",
"Display - Размер дисплея\\\n",
"Camera - Камера\\\n",
"External_Memory - Внешняя память\\\n",
"Android_version - Версия Android\\\n",
"Price - Цена\\\n",
"company - Фирма\\\n",
"Inbuilt_memory - Встроенная память\\\n",
"fast_charging - Быстрая зарядка\\\n",
"Screen_resolution - Разрешение экрана\\\n",
"Processor - Процессор\\\n",
"Processor_name - Название процессора"
]
},
{
"cell_type": "code",
"execution_count": 242,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 1370 entries, 0 to 1369\n",
"Data columns (total 18 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Unnamed: 0 1370 non-null int64 \n",
" 1 Name 1370 non-null object \n",
" 2 Rating 1370 non-null float64\n",
" 3 Spec_score 1370 non-null int64 \n",
" 4 No_of_sim 1370 non-null object \n",
" 5 Ram 1370 non-null object \n",
" 6 Battery 1370 non-null object \n",
" 7 Display 1370 non-null object \n",
" 8 Camera 1370 non-null object \n",
" 9 External_Memory 1370 non-null object \n",
" 10 Android_version 927 non-null object \n",
" 11 Price 1370 non-null object \n",
" 12 company 1370 non-null object \n",
" 13 Inbuilt_memory 1351 non-null object \n",
" 14 fast_charging 1281 non-null object \n",
" 15 Screen_resolution 1368 non-null object \n",
" 16 Processor 1342 non-null object \n",
" 17 Processor_name 1370 non-null object \n",
"dtypes: float64(1), int64(2), object(15)\n",
"memory usage: 192.8+ KB\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Name</th>\n",
" <th>Rating</th>\n",
" <th>Spec_score</th>\n",
" <th>No_of_sim</th>\n",
" <th>Ram</th>\n",
" <th>Battery</th>\n",
" <th>Display</th>\n",
" <th>Camera</th>\n",
" <th>External_Memory</th>\n",
" <th>Android_version</th>\n",
" <th>Price</th>\n",
" <th>company</th>\n",
" <th>Inbuilt_memory</th>\n",
" <th>fast_charging</th>\n",
" <th>Screen_resolution</th>\n",
" <th>Processor</th>\n",
" <th>Processor_name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>Samsung Galaxy F14 5G</td>\n",
" <td>4.65</td>\n",
" <td>68</td>\n",
" <td>Dual Sim, 3G, 4G, 5G, VoLTE,</td>\n",
" <td>4 GB RAM</td>\n",
" <td>6000 mAh Battery</td>\n",
" <td>6.6 inches</td>\n",
" <td>50 MP + 2 MP Dual Rear &amp; 13 MP Front Camera</td>\n",
" <td>Memory Card Supported, upto 1 TB</td>\n",
" <td>13</td>\n",
" <td>9,999</td>\n",
" <td>Samsung</td>\n",
" <td>128 GB inbuilt</td>\n",
" <td>25W Fast Charging</td>\n",
" <td>2408 x 1080 px Display with Water Drop Notch</td>\n",
" <td>Octa Core Processor</td>\n",
" <td>Exynos 1330</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Samsung Galaxy A11</td>\n",
" <td>4.20</td>\n",
" <td>63</td>\n",
" <td>Dual Sim, 3G, 4G, VoLTE,</td>\n",
" <td>2 GB RAM</td>\n",
" <td>4000 mAh Battery</td>\n",
" <td>6.4 inches</td>\n",
" <td>13 MP + 5 MP + 2 MP Triple Rear &amp; 8 MP Fro...</td>\n",
" <td>Memory Card Supported, upto 512 GB</td>\n",
" <td>10</td>\n",
" <td>9,990</td>\n",
" <td>Samsung</td>\n",
" <td>32 GB inbuilt</td>\n",
" <td>15W Fast Charging</td>\n",
" <td>720 x 1560 px Display with Punch Hole</td>\n",
" <td>1.8 GHz Processor</td>\n",
" <td>Octa Core</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Samsung Galaxy A13</td>\n",
" <td>4.30</td>\n",
" <td>75</td>\n",
" <td>Dual Sim, 3G, 4G, VoLTE,</td>\n",
" <td>4 GB RAM</td>\n",
" <td>5000 mAh Battery</td>\n",
" <td>6.6 inches</td>\n",
" <td>50 MP Quad Rear &amp; 8 MP Front Camera</td>\n",
" <td>Memory Card Supported, upto 1 TB</td>\n",
" <td>12</td>\n",
" <td>11,999</td>\n",
" <td>Samsung</td>\n",
" <td>64 GB inbuilt</td>\n",
" <td>25W Fast Charging</td>\n",
" <td>1080 x 2408 px Display with Water Drop Notch</td>\n",
" <td>2 GHz Processor</td>\n",
" <td>Octa Core</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>Samsung Galaxy F23</td>\n",
" <td>4.10</td>\n",
" <td>73</td>\n",
" <td>Dual Sim, 3G, 4G, VoLTE,</td>\n",
" <td>4 GB RAM</td>\n",
" <td>6000 mAh Battery</td>\n",
" <td>6.4 inches</td>\n",
" <td>48 MP Quad Rear &amp; 13 MP Front Camera</td>\n",
" <td>Memory Card Supported, upto 1 TB</td>\n",
" <td>12</td>\n",
" <td>11,999</td>\n",
" <td>Samsung</td>\n",
" <td>64 GB inbuilt</td>\n",
" <td>NaN</td>\n",
" <td>720 x 1600 px</td>\n",
" <td>Octa Core</td>\n",
" <td>Helio G88</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>Samsung Galaxy A03s (4GB RAM + 64GB)</td>\n",
" <td>4.10</td>\n",
" <td>69</td>\n",
" <td>Dual Sim, 3G, 4G, VoLTE,</td>\n",
" <td>4 GB RAM</td>\n",
" <td>5000 mAh Battery</td>\n",
" <td>6.5 inches</td>\n",
" <td>13 MP + 2 MP + 2 MP Triple Rear &amp; 5 MP Fro...</td>\n",
" <td>Memory Card Supported, upto 1 TB</td>\n",
" <td>11</td>\n",
" <td>11,999</td>\n",
" <td>Samsung</td>\n",
" <td>64 GB inbuilt</td>\n",
" <td>15W Fast Charging</td>\n",
" <td>720 x 1600 px Display with Water Drop Notch</td>\n",
" <td>Octa Core</td>\n",
" <td>Helio P35</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Name Rating Spec_score \\\n",
"0 0 Samsung Galaxy F14 5G 4.65 68 \n",
"1 1 Samsung Galaxy A11 4.20 63 \n",
"2 2 Samsung Galaxy A13 4.30 75 \n",
"3 3 Samsung Galaxy F23 4.10 73 \n",
"4 4 Samsung Galaxy A03s (4GB RAM + 64GB) 4.10 69 \n",
"\n",
" No_of_sim Ram Battery Display \\\n",
"0 Dual Sim, 3G, 4G, 5G, VoLTE, 4 GB RAM 6000 mAh Battery 6.6 inches \n",
"1 Dual Sim, 3G, 4G, VoLTE, 2 GB RAM 4000 mAh Battery 6.4 inches \n",
"2 Dual Sim, 3G, 4G, VoLTE, 4 GB RAM 5000 mAh Battery 6.6 inches \n",
"3 Dual Sim, 3G, 4G, VoLTE, 4 GB RAM 6000 mAh Battery 6.4 inches \n",
"4 Dual Sim, 3G, 4G, VoLTE, 4 GB RAM 5000 mAh Battery 6.5 inches \n",
"\n",
" Camera \\\n",
"0 50 MP + 2 MP Dual Rear & 13 MP Front Camera \n",
"1 13 MP + 5 MP + 2 MP Triple Rear & 8 MP Fro... \n",
"2 50 MP Quad Rear & 8 MP Front Camera \n",
"3 48 MP Quad Rear & 13 MP Front Camera \n",
"4 13 MP + 2 MP + 2 MP Triple Rear & 5 MP Fro... \n",
"\n",
" External_Memory Android_version Price company \\\n",
"0 Memory Card Supported, upto 1 TB 13 9,999 Samsung \n",
"1 Memory Card Supported, upto 512 GB 10 9,990 Samsung \n",
"2 Memory Card Supported, upto 1 TB 12 11,999 Samsung \n",
"3 Memory Card Supported, upto 1 TB 12 11,999 Samsung \n",
"4 Memory Card Supported, upto 1 TB 11 11,999 Samsung \n",
"\n",
" Inbuilt_memory fast_charging \\\n",
"0 128 GB inbuilt 25W Fast Charging \n",
"1 32 GB inbuilt 15W Fast Charging \n",
"2 64 GB inbuilt 25W Fast Charging \n",
"3 64 GB inbuilt NaN \n",
"4 64 GB inbuilt 15W Fast Charging \n",
"\n",
" Screen_resolution Processor \\\n",
"0 2408 x 1080 px Display with Water Drop Notch Octa Core Processor \n",
"1 720 x 1560 px Display with Punch Hole 1.8 GHz Processor \n",
"2 1080 x 2408 px Display with Water Drop Notch 2 GHz Processor \n",
"3 720 x 1600 px Octa Core \n",
"4 720 x 1600 px Display with Water Drop Notch Octa Core \n",
"\n",
" Processor_name \n",
"0 Exynos 1330 \n",
"1 Octa Core \n",
"2 Octa Core \n",
"3 Helio G88 \n",
"4 Helio P35 "
]
},
"execution_count": 242,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_phones.info()\n",
"df_phones.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объект наблюдения: телефон\\\n",
"Атрибуты объектов: название модели, рейтинг, оценка характеристик, сим-карты, оперативная память и т.д."
]
},
{
"cell_type": "code",
"execution_count": 243,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABIgAAAJLCAYAAACMgK3jAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdfXwU5bk//s9uQkggZCWRkCCSLFALa4xCFY0lYKlRRPHxVOUncCyeiFh85Gs1VIiAGtGeatVKIaXyVGt7WqsiIZZKlXgaDBUR0+ATbIJgQiCBTQIJS3b390fOpoTsw0wymZlr9/N+vXi9mp2LeDG99557r525L4vP5/OBiIiIiIiIiIiiltXoBIiIiIiIiIiIyFgsEBERERERERERRTkWiIiIiIiIiIiIohwLREREREREREREUY4FIiIiIiIiIiKiKMcCERERERERERFRlGOBiIiIiIiIiIgoyrFAREREREREREQU5VggIiIiIiIiIiKKciwQERERERERERFFORaIiIiIyLT+9Kc/wWKxBPyTlZVldHpRq6WlBYWFhZg6dSqSk5NhsViwZs0ao9MiIiKiXog1OgEiIiKicBYuXIixY8d2/vzUU08ZmA0dOXIES5cuxYgRI3DhhRfi/fffNzolIiIi6iUWiIiIiMj08vLycMUVV3T+/Jvf/AZHjhwxLqEol56ejtraWqSlpeGf//wnLrnkEqNTIiIiol7iI2ZERERkWm63GwBgtYZfsqxZswYWiwXV1dWdr3m9XmRnZ3d7BGr37t248847MXLkSMTHxyMtLQ1z5sxBQ0NDl9/5xBNPBHy8LTb239+xXXHFFcjKysLHH3+Myy+/HAkJCbDb7fj1r3/d7d+yePFifO9734PNZsPAgQORm5uLv//9713iqqurO/87b775ZpdjbW1tGDx4MCwWC37+8593yzM1NRWnTp3q8nd+//vfd/6+04tqb731Fq699loMGzYM/fv3x6hRo7Bs2TJ4PJ6w57p///5IS0sLG0dERERy8A4iIiIiMi1/gah///49+vvr16/HZ5991u31LVu2YN++ffjxj3+MtLQ0/Otf/8KqVavwr3/9C9u3b4fFYukSv2LFCiQmJnb+fGbB6ujRo5g2bRpuvfVWzJgxA3/84x8xb948xMXFYc6cOQCApqYm/OY3v8GMGTOQn5+P5uZmrF69GldffTUqKipw0UUXdfmd8fHxePXVV3HjjTd2vvbGG2+gra0t6L+3ubkZ77zzDm666abO11599VXEx8d3+3tr1qxBYmIiHn74YSQmJmLr1q1YvHgxmpqa8NxzzwX9bxAREVFkYoGIiIiITMvlcgEAEhISVP/dkydPYvHixbjmmmuwefPmLsfuvfdeLFiwoMtrl112GWbMmIEPP/wQubm5XY79x3/8B84+++yg/61vv/0W//3f/42HH34YADB37lxceumlKCgowKxZs9CvXz8MHjwY1dXViIuL6/x7+fn5GDNmDF566SWsXr26y++86aab8D//8z84dOgQhg4dCgD47W9/i5tvvhmvvfZawDxuuukm/Pa3v+0sEO3fvx/vvfcebrvtNvz+97/vEvvaa691Oa/33HMP7rnnHrzyyit48skne1yUIyIiIpn4iBkRERGZlv+RryFDhqj+u7/61a/Q0NCAwsLCbsdOL4y0tbXhyJEjuOyyywAAO3fuVP3fio2Nxdy5czt/jouLw9y5c1FfX4+PP/4YABATE9NZHPJ6vWhsbER7ezsuvvjigP/N8ePH4/zzz8f69esBADU1Nfj73/+OO++8M2gec+bMQWlpKerq6gAAa9euRU5ODs4777xusaefg+bmZhw5cgS5ubk4ceIEPv/8c9XngIiIiGRjgYiIiIhMq6amBrGxsaoLRC6XC08//TQefvjhzrtvTtfY2IgHHngAQ4cORUJCAoYMGQK73d75d9UaNmwYBg4c2OU1f1Hm9D2R1q5di+zsbMTHxyMlJQVDhgzBpk2bgv43f/zjH+PVV18F0PFI2OWXX47vfOc7QfO46KKLkJWVhXXr1sHn82HNmjX48Y9/HDD2X//6F2666SbYbDYkJSVhyJAhmDlzJoCenQMiIiKSjQUiIiIiMq0vvvgCI0eO7LIptBLLly+H1WrFI488EvD4rbfeiuLiYtxzzz1444038Ne//hWlpaUAOu7u6QsbNmzAnXfeiVGjRmH16tUoLS3Fli1bMGXKlKD/zZkzZ+Lrr7/G9u3bsXbt2qDFntPNmTMHr776Kj744APU1dXh1ltv7RZz7NgxTJ48GZ9++imWLl2KjRs3YsuWLVi+fDmAvjsHREREZF7cg4iIiIhM6eTJk9i1a1eXTZqV+Pbbb/HLX/4SRUVFGDRoULfOZEePHsV7772HJUuWYPHixZ2vf/XVVz3O9dtvv8Xx48e73EX05ZdfAgAyMzMBAH/6058wcuRIvPHGG102wQ70CJxfSkoKrr/++s7H1W699dYuncgCueOOO/DII4/ggQcewH/8x39g0KBB3WLef/99NDQ04I033sCkSZM6X3c6nYr+vURERBR5eAcRERERmdJrr72GkydP4oc//KGqv7dkyRIMHToU99xzT8DjMTExAACfz9fl9RdeeKFHeQJAe3s7Vq5c2fmz2+3GypUrMWTIEHzve98L+t/96KOPUF5eHvJ3z5kzB7t378aPfvSjLp3UgklOTsYNN9yA3bt3d3ZQO1OgXNxuN1555ZWwv5+IiIgiE+8gIiIiIlM5fvw4XnrpJSxduhQxMTHw+XzYsGFDl5hDhw6hpaUFGzZsQF5eXpd9hv7617/id7/7XZduYadLSkrCpEmT8Oyzz+LUqVM455xz8Ne//rVXd88MGzYMy5cvR3V1Nc477zz84Q9/wK5du7Bq1Sr069cPAHDdddfhjTfewE033YRrr70WTqcTv/71r+FwONDS0hL0d0+dOhWHDx9WVBzyW7NmDX71q18F7bx2+eWXY/DgwfjP//xP3H///bBYLFi/fn23olkoL7/8Mo4dO4Zvv/0WALBx40YcOHAAAHDffffBZrMp/l1ERERkPBaIiIiIyFQOHz6MgoKCzp9P7w52plmzZuHvf/97lwLRRRddhBkzZoT8b7z22mu477778Ktf/Qo+nw9XXXUVNm/ejGHDhvUo58GDB2Pt2rW47777UFxcjKFDh+Lll19Gfn5+Z8ydd96Juro6rFy5Eu+++y4cDgc2bNiA//mf/8H7778f9HdbLJaghZ5gEhISunQpO1NKSgreeecdLFiwAI8//jgGDx6MmTNn4oc//CGuvvpqRf+Nn//856ipqen8+Y033sAbb7wBoGPvJBaIiIiIZLH41HxVRERERNTHqqurYbfb8fe//x1XXHFFr+P62hVXXIEjR46gsrLSsByIiIiIeot7EBERERERERERRTkWiIiIiMhUEhMTcccdd3R5bKw3cUREREQUHh8xIyIiIuoFPmJGREREkYAFIiIiIiIiIiKiKMdHzIiIiIiIiIiIohzb3APwer349ttvMWjQIFgsFqPTISIiIiIiIiLShM/nQ3NzM4YNGwarNfh9QiwQAfj2229x7rnnGp0GEREREREREVGf+OabbzB8+PCgx1kgAjBo0CAAHScrKSnJ4GyIiIiIiIiIiLTR1NSEc889t7P2EQwLREDnY2VJSUksEBERERERERFRxAm3pY6hm1Rv27YN06dPx7Bhw2CxWPDmm292Oe7z+bB48WKkp6cjISEBV155Jb766qsuMY2NjbjjjjuQlJSEs846C3fddRdaWlp0/FcQEREREREREclmaIHo+PHjuPDCC/GrX/0q4PFnn30WL774In7961/jo48+wsCBA3H11Vejra2tM+aOO+7Av/71L2zZsgXvvPMOtm3bhrvvvluvfwIRERERERERkXgWn8/nMzoJoONWp7/85S+48cYbAXTcPTRs2DAsWLAA/+///T8AgMvlwtChQ7FmzRrcfvvt2LNnDxwOB3bs2IGLL74YAFBaWopp06bhwIEDGDZsWMD/1smTJ3Hy5MnOn/3P47lcLj5iRkREREREREQRo6mpCTa
"text/plain": [
"<Figure size 1400x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(14, 6))\n",
"\n",
"\n",
"plt.scatter(df_phones['company'].str.lower(), df_phones['Spec_score'])\n",
"plt.xlabel('Фирма')\n",
"plt.ylabel('Оценка характеристик')\n",
"plt.xticks(rotation=45)\n",
"plt.title('Диаграмма 1')\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Между атрибутами присутствует связь. Пример, на диаграмме 1 - связь между фирмой и оценкой характеристик\\\n",
"Примеры бизнес-целей:\\\n",
" 1. Прогнозирование цен на основе оценки характеристик\\\n",
" 2. Прогнозирование оценки характеристик на основе фирмы и цены\\\n",
"\\\n",
"Эффект для бизнеса: влияние фирмы на цену, влияние характеристик на рейтинг\\\n",
"\\\n",
"\\\n",
"Цели технического проекта:\\\n",
" 1. Первая бизнес-цель: вход - оценка характеристик, целевой признак - цена.\\\n",
" 2. Вторая бизнес-цель: вход - фирма и цена, целевой признак - оценка характеристик."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проверка на выбросы"
]
},
{
"cell_type": "code",
"execution_count": 244,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Пустые значения по столбцам:\n",
"Unnamed: 0 0\n",
"Name 0\n",
"Rating 0\n",
"Spec_score 0\n",
"No_of_sim 0\n",
"Ram 0\n",
"Battery 0\n",
"Display 0\n",
"Camera 0\n",
"External_Memory 0\n",
"Android_version 443\n",
"Price 0\n",
"company 0\n",
"Inbuilt_memory 19\n",
"fast_charging 89\n",
"Screen_resolution 2\n",
"Processor 28\n",
"Processor_name 0\n",
"dtype: int64\n",
"\n",
"Количество дубликатов: 0\n",
"\n",
"Статистический обзор данных:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Rating</th>\n",
" <th>Spec_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>1370.000000</td>\n",
" <td>1370.000000</td>\n",
" <td>1370.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>684.500000</td>\n",
" <td>4.374416</td>\n",
" <td>80.234307</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>395.629246</td>\n",
" <td>0.230176</td>\n",
" <td>8.373922</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>3.750000</td>\n",
" <td>42.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>342.250000</td>\n",
" <td>4.150000</td>\n",
" <td>75.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>684.500000</td>\n",
" <td>4.400000</td>\n",
" <td>82.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1026.750000</td>\n",
" <td>4.550000</td>\n",
" <td>86.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1369.000000</td>\n",
" <td>4.750000</td>\n",
" <td>98.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Rating Spec_score\n",
"count 1370.000000 1370.000000 1370.000000\n",
"mean 684.500000 4.374416 80.234307\n",
"std 395.629246 0.230176 8.373922\n",
"min 0.000000 3.750000 42.000000\n",
"25% 342.250000 4.150000 75.000000\n",
"50% 684.500000 4.400000 82.000000\n",
"75% 1026.750000 4.550000 86.000000\n",
"max 1369.000000 4.750000 98.000000"
]
},
"execution_count": 244,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"null_values = df_phones.isnull().sum()\n",
"print(\"Пустые значения по столбцам:\")\n",
"print(null_values)\n",
"\n",
"duplicates = df_phones.duplicated().sum()\n",
"print(f\"\\nК о личе с тво дубликатов: {duplicates}\")\n",
"\n",
"print(\"\\nС та тис тиче с кий обзор данных:\")\n",
"df_phones.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видим, что есть пустые данные, но нет дубликатов. Удаляем их"
]
},
{
"cell_type": "code",
"execution_count": 245,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"В наборе данных 'Phones' было удалено 553 строк с пустыми значениями.\n"
]
}
],
"source": [
"def drop_missing_values(dataframe, name):\n",
" before_shape = dataframe.shape \n",
" cleaned_dataframe = dataframe.dropna() \n",
" after_shape = cleaned_dataframe.shape \n",
" print(f\"В наборе данных '{name}' было удалено {before_shape[0] - after_shape[0]} строк с пустыми значениями.\")\n",
" return cleaned_dataframe\n",
"\n",
"cleaned_df = drop_missing_values(df_phones, \"Phones\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Рассчитаем коэффицент ассиметрии"
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Коэффициент асимметрии для столбца 'Unnamed: 0': 0.0\n",
"\n",
"Коэффициент асимметрии для столбца 'Rating': -0.06697860128699223\n",
"\n",
"Коэффициент асимметрии для столбца 'Spec_score': -0.7393772365886471\n"
]
}
],
"source": [
"for column in df_phones.select_dtypes(include=[np.number]).columns:\n",
" skewness = df_phones[column].skew()\n",
" print(f\"\\nК о эффицие нт асимметрии для столбца '{column}': {skewness}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видим, что выбросы незначительные\n",
"\n",
"Очистка данных от шумов"
]
},
{
"cell_type": "code",
"execution_count": 247,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAJLCAYAAADtiKfgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADus0lEQVR4nOzdeXxTVfo/8E/a0oUusQW6AELL4mBFBAQUZBsFQRQQHRFHUOQnMDKoIKMjDKuyjKKi4ICAyqqjjgiCIoiALFIWBUQsqEDL2lKg0AW60Ob8/ug3SdMmaZLmJk/Sz/v16kt7c0iennvvufe59+Y8OqWUAhERERERETkswNsBEBERERER+RomUkRERERERE5iIkVEREREROQkJlJEREREREROYiJFRERERETkJCZSRERERERETmIiRURERERE5CQmUkRERERERE5iIkVEREREROQkJlJEREREgp05cwZLly41/Z6eno6PPvrIewEREQAmUkSksc8//xw6nc7qT8uWLb0dHhGReDqdDn//+9+xceNGpKen46WXXsKOHTu8HRZRjRfk7QCIqGaYMGECbr75ZtPvM2bM8GI0RES+o0GDBhg+fDh69+4NAEhISMD333/v3aCICDqllPJ2EETkvz7//HM88sgj2Lp1K7p3725a3r17d1y8eBGHDx/2XnBERD7k+PHjuHjxIlq2bInw8HBvh0NU4/HRPiLSVHFxMQAgIKDq4Wbp0qXQ6XRIT083LTMYDGjVqhV0Op3FdwQOHTqEoUOHokmTJggNDUV8fDyGDRuGS5cuWbzn1KlTrT5WGBRkviHfvXt3tGzZEj/99BM6deqEsLAwJCUl4b333qv0t0yePBm333479Ho9wsPD0aVLF2zdutWiXXp6uulz1qxZY/FaYWEhoqOjodPp8MYbb1SKMzY2FtevX7f4N//9739N73fx4kXT8i+//BL3338/6tevj5CQEDRt2hSvvvoqSktLq+xr4+cdPXoUAwcORFRUFOrUqYPnn38ehYWFFm2XLFmCu+++G7GxsQgJCUFycjIWLFhg9X2/+eYbdOvWDZGRkYiKikL79u3x8ccfW7TZs2cP+vTpg+joaISHh6NVq1Z45513LNocPXoUf/nLXxATE4PQ0FC0a9cOa9eutWjjzPYydOhQi/UfHR2N7t27V3o8ytE+NW4zFb3xxhuVYkpMTMTQoUMt2v3vf/+DTqdDYmKixfKsrCz8v//3/9CoUSMEBgaa4o2IiKj0WRUlJibafIxWp9NVar9y5UrcfvvtCAsLQ0xMDAYNGoTTp09b/Tur2jcAoKioCFOmTEGzZs0QEhKCG2+8ES+99BKKiooqtf3+++8djrMi47Zr7e8v38/ObB8ATPtCvXr1EBYWhj/96U/417/+ZfGZ9n6Md4i6d+9ucdEIKLsDHxAQUGlf+N///mdaB3Xr1sXgwYNx9uxZizZDhw41bSdNmzbFHXfcgezsbISFhVX6+4jIs/hoHxFpyphIhYSEuPTvV6xYgV9++aXS8k2bNuHEiRN46qmnEB8fj19//RWLFi3Cr7/+it27d1c60VqwYIHFyWjFxO7y5cvo06cPBg4ciMceewyfffYZnnnmGQQHB2PYsGEAgNzcXLz//vt47LHHMHz4cOTl5eGDDz5Ar169sHfvXrRu3driPUNDQ7FkyRI8+OCDpmVffPFFpUSlvLy8PHz11VcYMGCAadmSJUsQGhpa6d8tXboUEREReOGFFxAREYEtW7Zg8uTJyM3NxezZs21+RnkDBw5EYmIiZs2ahd27d2Pu3Lm4fPkyli9fbtF3t9xyC/r164egoCCsW7cOo0aNgsFgwN///neLeIYNG4ZbbrkF48ePxw033IADBw5gw4YN+Otf/wqgbL098MADSEhIwPPPP4/4+HgcOXIEX331FZ5//nkAwK+//oq77roLDRo0wMsvv4zw8HB89tlnePDBB7Fq1SqLvqnI1vYCAHXr1sWcOXMAlH15/5133kGfPn1w+vRp3HDDDW7r06qUlJSYTtArevLJJ/Hdd9/h2WefxW233YbAwEAsWrQI+/fvd+i9W7dujXHjxlksW758OTZt2mSxbMaMGZg0aRIGDhyIp59+GhcuXMC8efPQtWtXHDhwwNQfgGP7hsFgQL9+/bBz506MGDECN998M3755RfMmTMHv//+e6ULCkbPPfcc2rdvbzNOd7O1fRw6dAhdunRBrVq1MGLECCQmJuL48eNYt24dZsyYgYceegjNmjUztR87dixuvvlmjBgxwrSs/KPL5S1ZsgQTJ07Em2++adoPgLJt7amnnkL79u0xa9YsnD9/Hu+88w5++OGHSuugosmTJ9sdR4jIQxQRkYbefvttBUD9/PPPFsu7deumbrnlFotlS5YsUQBUWlqaUkqpwsJC1ahRI3XfffcpAGrJkiWmtteuXav0Wf/9738VALV9+3bTsilTpigA6sKFCzZj7NatmwKg3nzzTdOyoqIi1bp1axUbG6uKi4uVUkqVlJSooqIii397+fJlFRcXp4YNG2ZalpaWpgCoxx57TAUFBanMzEzTa/fcc4/661//qgCo2bNnV4rzscceUw888IBp+cmTJ1VAQIB67LHHKv0d1vpg5MiRqnbt2qqwsNDm31v+8/r162exfNSoUZXWl7XP6dWrl2rSpInp9ytXrqjIyEh1xx13qIKCAou2BoNBKVXWf0lJSapx48bq8uXLVtsoVdZHt956q8XfYDAYVKdOnVTz5s1Ny5zZXp588knVuHFji89ctGiRAqD27t1r92+11qfWtl+llJo9e7ZFTEop1bhxY/Xkk0+afp8/f74KCQlRf/7zny1iKigoUAEBAWrkyJEW7/nkk0+q8PDwSp9VUePGjdX9999fafnf//53Vf5wn56ergIDA9WMGTMs2v3yyy8qKCjIYrmj+8aKFStUQECA2rFjh8V7vvfeewqA+uGHHyyWf/vttwqA+vzzz23Gacu0adMUAIttxvj3l+9nZ7aPrl27qsjISHXy5EmL96z4GbY+q7xu3bqpbt26KaWU+vrrr1VQUJAaN26cRZvi4mIVGxurWrZsabG/fPXVVwqAmjx5smlZxW338OHDKiAgwPR3lN/WiMiz+GgfEWnK+KhdvXr1nP63//nPf3Dp0iVMmTKl0mthYWGm/y8sLMTFixdx5513AoDDV+/LCwoKwsiRI02/BwcHY+TIkcjKysJPP/0EAAgMDERwcDCAsivw2dnZKCkpQbt27ax+Ztu2bXHLLbdgxYoVAICTJ09i69atlR7zKm/YsGHYsGEDMjMzAQDLli1Dx44dcdNNN1VqW74P8vLycPHiRXTp0gXXrl3D0aNHHfq7y99RAoBnn30WALB+/Xqrn5OTk4OLFy+iW7duOHHiBHJycgCU3WnKy8vDyy+/jNDQUIv3NN4dPHDgANLS0jBmzJhKV9uNbbKzs7FlyxYMHDjQ9DddvHgRly5dQq9evfDHH39UevTJyN72ApStM+P7HTx4EMuXL0dCQoLFnQRn+rS0tNT0fsafa9euWf1so2vXruGVV17B6NGj0ahRI4vXrl69CoPBgDp16th9j+r64osvYDAYMHDgQIvY4+Pj0bx580qPqjqyb/zvf//DzTffjBYtWli859133w0Ald7TeDel4rbiiNjYWABldxWdYWv7uHDhArZv345hw4ZVWieOPGpoy969ezFw4EA8/PDDle5m/vjjj8jKysKoUaMs+uD+++9HixYt8PXXX9t83/Hjx6Nt27Z45JFHXI6NiNyDj/YRkaZOnjyJoKAgpxOpnJwczJw5Ey+88ALi4uIqvZ6dnY1p06bhk08+QVZWVqV/66z69etX+vK2MXlJT083JWnLli3Dm2++iaNHj1p8lykpKcnq+z711FNYtGgR/vGPf2Dp0qXo1KkTmjdvbjOO1q1bo2XLlli+fDlefPFFLF26FBMmTKj03RWg7BG4iRMnYsuWLcjNzbV4zdE+qBhL06ZNERAQYPG9ix9++AFTpkxBSkpKpUQhJycHer0ex48fBwC7U9o70ubYsWNQSmHSpEmYNGmS1TZZWVlo0KB
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Выбросы в датасете:\n",
" Unnamed: 0 Name Rating Spec_score \\\n",
"99 99 Vivo Y02 4.35 54 \n",
"214 214 Realme C30s 4.55 58 \n",
"802 802 Vivo Y02 (2GB RAM + 32GB) 4.50 53 \n",
"803 803 Vivo Y02 4.35 54 \n",
"1344 1344 TCL 501 4.25 55 \n",
"\n",
" No_of_sim Ram Battery Display \\\n",
"99 Dual Sim, 3G, 4G, VoLTE, 3 GB RAM 5000 mAh Battery 6.51 inches \n",
"214 Dual Sim, 3G, 4G, VoLTE, 2 GB RAM 5000 mAh Battery 6.5 inches \n",
"802 Dual Sim, 3G, 4G, VoLTE, 2 GB RAM 5000 mAh Battery 6.51 inches \n",
"803 Dual Sim, 3G, 4G, VoLTE, 3 GB RAM 5000 mAh Battery 6.51 inches \n",
"1344 Dual Sim, 3G, 4G, VoLTE, 2 GB RAM 3000 mAh Battery 6 inches \n",
"\n",
" Camera External_Memory \\\n",
"99 8 MP Rear & 5 MP Front Camera Memory Card Supported, upto 1 TB \n",
"214 8 MP Rear & 5 MP Front Camera Memory Card Supported, upto 1 TB \n",
"802 8 MP Rear & 5 MP Front Camera Memory Card Supported, upto 1 TB \n",
"803 8 MP Rear & 5 MP Front Camera Memory Card Supported, upto 1 TB \n",
"1344 5 MP Rear & 2 MP Front Camera Memory Card Supported \n",
"\n",
" Android_version Price company Inbuilt_memory fast_charging \\\n",
"99 12 9,999 Vivo 32 GB inbuilt 10W Fast Charging \n",
"214 12 6,950 Realme 32 GB inbuilt 10W Fast Charging \n",
"802 12 8,999 Vivo 32 GB inbuilt 10W Fast Charging \n",
"803 12 8,489 Vivo 32 GB inbuilt 10W Fast Charging \n",
"1344 14 7,990 TCL 32 GB inbuilt 10W Fast Charging \n",
"\n",
" Screen_resolution Processor \\\n",
"99 720 x 1600 px Display with Water Drop Notch Octa Core Processor \n",
"214 720 x 1600 px Display with Water Drop Notch Octa Core \n",
"802 720 x 1600 px Display with Water Drop Notch Octa Core Processor \n",
"803 720 x 1600 px Display with Water Drop Notch Octa Core Processor \n",
"1344 540 x 1092 px Display Octa Core \n",
"\n",
" Processor_name \n",
"99 Helio \n",
"214 Unisoc SC9863A \n",
"802 Helio \n",
"803 Helio \n",
"1344 Helio G36 \n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0kAAAJLCAYAAAAyxt3/AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzde3hTZbo28DtJSdNjbAqlRaQtoONUYASlM2UsuFUEURCd2aIjIDJWNoyjoltHGLFUlG51vsHTyEAdQcBRnO1hU6dW8QRlRMsWlMGo4yEtIsViAz3QpmmS9f2RnbZpDl1Js1bepPfvunpdNH1In6618q715F15H40kSRKIiIiIiIgIAKCNdgJEREREREQiYZFERERERETUC4skIiIiIiKiXlgkERERERER9cIiiYiIiIiIqBcWSURERERERL2wSCIiIiIiIuqFRRIREREREVEvLJKIiIiIiIh6YZFERERERETUC4skIlLEf//3f0Oj0fj9GjduXLTTIyIiIgooIdoJEFF8W7lyJX784x93f//ggw9GMRsiIiKi/rFIIiJFTZ8+HRdeeGH3908//TR++OGH6CVERERE1A/ebkdEirDb7QAArbb/YWbz5s3QaDSoq6vrfszlcmHChAnQaDTYvHlz9+MHDx7EokWLMHr0aBgMBmRnZ2Px4sVoamryes7Vq1f7vdUvIaHnvaELL7wQ48aNw0cffYQpU6YgKSkJ+fn5+POf/+zzt9x3330477zzYDQakZKSguLiYrz77rtecXV1dd2/59VXX/X6mc1mQ0ZGBjQaDf7whz/45JmVlYWuri6v//P88893P1/vwvJ//ud/cPnll2PEiBFITEzEmDFjsGbNGjidzn63tef3ff7557jmmmuQnp6OzMxM3HbbbbDZbF6xmzZtwkUXXYSsrCwkJiaioKAA69ev9/u8r7/+OqZNm4a0tDSkp6dj8uTJ+Otf/+oV8+GHH2LWrFnIyMhASkoKJkyYgMcee8wr5vPPP8cvf/lLmEwmGAwGnH/++dixY4dXTCjHy6JFi7z2f0ZGBi688ELU1NR4Pafcbeo5Zvr6wx/+4JNTXl4eFi1a5BX3t7/9DRqNBnl5eV6PNzY24te//jVGjRoFnU7XnW9qaqrP7+orLy8v4K2tGo3GK9bhcGDNmjUYM2YMEhMTkZeXh5UrV6Kzs9PneeXs097HfLDf63K58Oijj+Kcc86BwWDA8OHDsWTJEpw4cULW39d3O7733nvQaDR47733uh+78MILvd6QAYB9+/b5zQcAtm3bhsLCQiQnJyMjIwNTp07Fm2++2f07g21Tz/7z/P29j7nW1lacd955yM/PR0NDQ8A4APjNb34DjUbj8/cRUfRxJomIFOEpkhITE8P6/1u3bsU///lPn8d37tyJb775BjfeeCOys7Px6aefYuPGjfj000/xwQcf+FwMrV+/3utCs2/RduLECcyaNQvXXHMNrrvuOrz44otYunQp9Ho9Fi9eDABoaWnB008/jeuuuw4lJSVobW3FX/7yF8yYMQO1tbU499xzvZ7TYDBg06ZNmDt3bvdjL7/8sk8R0ltraytee+01XHXVVd2Pbdq0CQaDwef/bd68GampqbjjjjuQmpqKd955B/fddx9aWlrwyCOPBPwdvV1zzTXIy8tDeXk5PvjgAzz++OM4ceIEtmzZ4rXtzjnnHMyZMwcJCQmorKzEsmXL4HK58Jvf/MYrn8WLF+Occ87BihUrcNppp+HAgQOorq7Gr371KwDu/XbFFVcgJycHt912G7Kzs/HZZ5/htddew2233QYA+PTTT/Hzn/8cp59+Ou655x6kpKTgxRdfxNy5c/HSSy95bZu+Ah0vADB06FCsW7cOAHDkyBE89thjmDVrFr799lucdtppEdum/XE4HPj973/v92c33HAD3nrrLfz2t7/FT37yE+h0OmzcuBH79++X9dznnnsu7rzzTq/HtmzZgp07d3o9dtNNN+HZZ5/FL3/5S9x555348MMPUV5ejs8++wyvvPJKd5ycfdrbzTffjOLiYgDuY733cwHAkiVLsHnzZtx444249dZbYbFY8OSTT+LAgQP4xz/+gSFDhsj6O0P1u9/9zu/jZWVlWL16NaZMmYL7778fer0eH374Id555x1ceumlePTRR9HW1gYA+Oyzz7B27VqvW4cDFa9dXV34xS9+gcOHD+Mf//gHcnJyAub21VdfoaKiYoB/IREpRiIiUsCjjz4qAZA++eQTr8enTZsmnXPOOV6Pbdq0SQIgWSwWSZIkyWazSaNGjZIuu+wyCYC0adOm7tj29naf3/X8889LAKTdu3d3P1ZaWioBkI4fPx4wx2nTpkkApP/3//5f92OdnZ3SueeeK2VlZUl2u12SJElyOBxSZ2en1/89ceKENHz4cGnx4sXdj1ksFgmAdN1110kJCQnSsWPHun928cUXS7/61a8kANIjjzzik+d1110nXXHFFd2P19fXS1qtVrruuut8/g5/22DJkiVScnKyZLPZAv69vX/fnDlzvB5ftmyZz/7y93tmzJghjR49uvv7kydPSmlpadJPf/pTqaOjwyvW5XJJkuTefvn5+VJubq504sQJvzGS5N5G48eP9/obXC6XNGXKFOnMM8/sfiyU4+WGG26QcnNzvX7nxo0bJQBSbW1t0L/V3zb1d/xKkiQ98sgjXjlJkiTl5uZKN9xwQ/f3Tz31lJSYmCj927/9m1dOHR0dklarlZYsWeL1nDfccIOUkpLi87v6ys3NlS6//HKfx3/zm99IvU/zH3/8sQRAuummm7zi/vM//1MCIL3zzjuSJMnbpx5ffvmlBEB69tlnux/zHGMeNTU1EgDpueee8/q/1dXVfh/vKz8/X1q4cKHXY++++64EQHr33Xe7H5s2bZo0bdq07u+rqqokANLMmTO98vnyyy8lrVYrXXXVVZLT6Qz69wX6XR6e1/ymTZskl8slXX/99VJycrL04YcfBozzuOaaa6Rx48ZJZ5xxhtdxQkRi4O12RKQIz+1vw4YNC/n//ulPf0JTUxNKS0t9fpaUlNT9b5vNhh9++AE/+9nPAED2u+69JSQkYMmSJd3f6/V6LFmyBI2Njfjoo48AADqdDnq9HoD7tiGr1QqHw4Hzzz/f7++cNGkSzjnnHGzduhUAUF9fj3fffTfoLTWLFy9GdXU1jh07BgB49tlnUVRUhLPOOssntvc2aG1txQ8//IDi4mK0t7fj888/l/V3954JAoDf/va3AICqqiq/v6e5uRk//PADpk2bhm+++QbNzc0A3DNEra2tuOeee2AwGLye0zOrd+DAAVgsFtx+++3dMzd9Y6xWK9555x1cc8013X/TDz/8gKamJsyYMQNffvklvvvuO79/S7DjBXDvM8/zffzxx9iyZQtycnK8FhQJZZs6nc7u5/N8tbe3+/3dHu3t7bj//vtxyy23YNSoUV4/O3XqFFwuFzIzM4M+x0B59u0dd9zh9bhnBurvf/87AHn71EPOjPHf/vY3GI1GTJ8+3WubnXfeeUhNTfW5bbWvrKwsHDlyRMZf2EOSJKxYsQK/+MUv8NOf/tTrZ6+++ipcLhfuu+8+n5llf7flyXXXXXfhueeew4svvojCwsKgsR999BH+9re/oby8XNYtyUSkPr4yiUgR9fX1SEhICLlIam5uxtq1a3HHHXdg+PDhPj+3Wq247bbbMHz4cCQlJWHYsGHIz8/v/r+hGjFiBFJSUrwe8xQmvT9f8uyzz2LChAkwGAzIzMzEsGHD8Pe//z3g77zxxhuxadMmAO5bl6ZMmYIzzzwzYB7nnnsuxo0bhy1btkCSpO5bk/z59NNPcdVVV8FoNCI9PR3Dhg3D/PnzAcjfBn1zGTNmDLRardff/I9//AOXXHIJUlJScNppp2HYsGFYuXKl1+/5+uuvASDosu5yYr766itIkoRVq1Zh2LBhXl+e4qexsdHn//V3vADAt99+2/1cEydOxNdff42XXnrJ65apULbp559/HjDHQP74xz/CZrN1b7/eMjMzceaZZ+Lpp5/Gm2++icbGRvzwww9+Pyc0EPX19dBqtRg
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(cleaned_df['company'].str.lower(), cleaned_df['Spec_score'])\n",
"plt.xlabel('Фирма')\n",
"plt.ylabel('Оценка характеристик')\n",
"plt.xticks(rotation=45)\n",
"plt.title('Диаграмма рассеивания перед чисткой')\n",
"plt.show()\n",
"\n",
"Q1 = cleaned_df[\"Spec_score\"].quantile(0.25)\n",
"Q3 = cleaned_df[\"Spec_score\"].quantile(0.75)\n",
"\n",
"IQR = Q3 - Q1\n",
"\n",
"threshold = 1.5 * IQR\n",
"lower_bound = Q1 - threshold\n",
"upper_bound = Q3 + threshold\n",
"\n",
"outliers = (cleaned_df[\"Spec_score\"] < lower_bound) | (cleaned_df[\"Spec_score\"] > upper_bound)\n",
"\n",
"print(\"Выбросы в датасете:\")\n",
"print(cleaned_df[outliers])\n",
"\n",
"median_score = cleaned_df[\"Spec_score\"].median()\n",
"cleaned_df.loc[outliers, \"Spec_score\"] = median_score\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(cleaned_df['company'].str.lower(), cleaned_df['Spec_score'])\n",
"plt.xlabel('Фирма')\n",
"plt.ylabel('Оценка характеристик')\n",
"plt.xticks(rotation=45)\n",
"plt.title('Диаграмма рассеивания после чистки')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Разбиение на выборки"
]
},
{
"cell_type": "code",
"execution_count": 248,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 489\n",
"Размер контрольной выборки: 164\n",
"Размер тестовой выборки: 164\n",
"\n",
"Распределение оценки характеристик в обучающей выборке:\n",
"Spec_score\n",
"75 48\n",
"86 35\n",
"80 34\n",
"84 32\n",
"85 23\n",
"78 23\n",
"83 23\n",
"77 19\n",
"79 19\n",
"82 18\n",
"89 17\n",
"88 17\n",
"71 16\n",
"73 15\n",
"72 13\n",
"74 13\n",
"87 12\n",
"69 11\n",
"76 10\n",
"81 10\n",
"67 9\n",
"90 9\n",
"70 8\n",
"68 8\n",
"91 8\n",
"64 7\n",
"93 7\n",
"92 6\n",
"66 5\n",
"94 4\n",
"63 4\n",
"96 2\n",
"95 1\n",
"65 1\n",
"60 1\n",
"61 1\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в контрольной выборке:\n",
"Spec_score\n",
"75 18\n",
"81 12\n",
"74 11\n",
"79 9\n",
"82 9\n",
"85 9\n",
"84 8\n",
"86 8\n",
"76 7\n",
"78 7\n",
"77 7\n",
"83 6\n",
"89 5\n",
"71 5\n",
"72 5\n",
"80 4\n",
"70 4\n",
"88 3\n",
"68 3\n",
"65 3\n",
"73 3\n",
"67 2\n",
"87 2\n",
"63 2\n",
"95 2\n",
"93 2\n",
"90 2\n",
"94 1\n",
"66 1\n",
"92 1\n",
"69 1\n",
"98 1\n",
"61 1\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в тестовой выборке:\n",
"Spec_score\n",
"75 15\n",
"84 13\n",
"76 11\n",
"82 10\n",
"81 9\n",
"80 9\n",
"77 8\n",
"83 8\n",
"86 7\n",
"89 6\n",
"78 6\n",
"79 6\n",
"87 5\n",
"71 5\n",
"74 5\n",
"85 5\n",
"70 4\n",
"94 3\n",
"72 3\n",
"73 3\n",
"66 3\n",
"91 3\n",
"88 3\n",
"92 3\n",
"93 2\n",
"96 1\n",
"64 1\n",
"90 1\n",
"67 1\n",
"62 1\n",
"65 1\n",
"68 1\n",
"95 1\n",
"69 1\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"train_df, test_df = train_test_split(cleaned_df, test_size=0.2, random_state=42)\n",
"\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))\n",
"\n",
"print()\n",
"\n",
"def check_balance(df, name):\n",
" counts = df['Spec_score'].value_counts()\n",
" print(f\"Распределение оценки характеристик в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"check_balance(val_df, \"контрольной выборке\")\n",
"check_balance(test_df, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Оверсемплинг и андерсемплинг"
]
},
{
"cell_type": "code",
"execution_count": 249,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Оверсэмплинг:\n",
"Распределение оценки характеристик в обучающей выборке:\n",
"Spec_score\n",
"85 48\n",
"78 48\n",
"75 48\n",
"82 48\n",
"64 48\n",
"73 48\n",
"79 48\n",
"87 48\n",
"86 48\n",
"80 48\n",
"70 48\n",
"83 48\n",
"68 48\n",
"74 48\n",
"71 48\n",
"72 48\n",
"66 48\n",
"93 48\n",
"77 48\n",
"88 48\n",
"69 48\n",
"89 48\n",
"84 48\n",
"94 48\n",
"76 48\n",
"95 48\n",
"90 48\n",
"63 48\n",
"81 48\n",
"67 48\n",
"91 48\n",
"92 48\n",
"96 48\n",
"65 48\n",
"60 48\n",
"61 48\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в контрольной выборке:\n",
"Spec_score\n",
"75 18\n",
"94 18\n",
"72 18\n",
"82 18\n",
"70 18\n",
"74 18\n",
"68 18\n",
"88 18\n",
"71 18\n",
"80 18\n",
"92 18\n",
"86 18\n",
"66 18\n",
"81 18\n",
"84 18\n",
"79 18\n",
"73 18\n",
"76 18\n",
"67 18\n",
"95 18\n",
"78 18\n",
"85 18\n",
"83 18\n",
"77 18\n",
"89 18\n",
"98 18\n",
"69 18\n",
"90 18\n",
"87 18\n",
"65 18\n",
"63 18\n",
"93 18\n",
"61 18\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в тестовой выборке:\n",
"Spec_score\n",
"80 15\n",
"94 15\n",
"82 15\n",
"77 15\n",
"75 15\n",
"79 15\n",
"96 15\n",
"83 15\n",
"76 15\n",
"71 15\n",
"64 15\n",
"78 15\n",
"84 15\n",
"91 15\n",
"74 15\n",
"93 15\n",
"87 15\n",
"89 15\n",
"81 15\n",
"66 15\n",
"86 15\n",
"92 15\n",
"88 15\n",
"73 15\n",
"90 15\n",
"67 15\n",
"85 15\n",
"72 15\n",
"62 15\n",
"70 15\n",
"65 15\n",
"68 15\n",
"95 15\n",
"69 15\n",
"Name: count, dtype: int64\n",
"\n",
"Андерсэмплинг:\n",
"Распределение оценки характеристик в обучающей выборке:\n",
"Spec_score\n",
"60 1\n",
"61 1\n",
"63 1\n",
"64 1\n",
"65 1\n",
"66 1\n",
"67 1\n",
"68 1\n",
"69 1\n",
"70 1\n",
"71 1\n",
"72 1\n",
"73 1\n",
"74 1\n",
"75 1\n",
"76 1\n",
"77 1\n",
"78 1\n",
"79 1\n",
"80 1\n",
"81 1\n",
"82 1\n",
"83 1\n",
"84 1\n",
"85 1\n",
"86 1\n",
"87 1\n",
"88 1\n",
"89 1\n",
"90 1\n",
"91 1\n",
"92 1\n",
"93 1\n",
"94 1\n",
"95 1\n",
"96 1\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в контрольной выборке:\n",
"Spec_score\n",
"61 1\n",
"63 1\n",
"65 1\n",
"66 1\n",
"67 1\n",
"68 1\n",
"69 1\n",
"70 1\n",
"71 1\n",
"72 1\n",
"73 1\n",
"74 1\n",
"75 1\n",
"76 1\n",
"77 1\n",
"78 1\n",
"79 1\n",
"80 1\n",
"81 1\n",
"82 1\n",
"83 1\n",
"84 1\n",
"85 1\n",
"86 1\n",
"87 1\n",
"88 1\n",
"89 1\n",
"90 1\n",
"92 1\n",
"93 1\n",
"94 1\n",
"95 1\n",
"98 1\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в тестовой выборке:\n",
"Spec_score\n",
"62 1\n",
"64 1\n",
"65 1\n",
"66 1\n",
"67 1\n",
"68 1\n",
"69 1\n",
"70 1\n",
"71 1\n",
"72 1\n",
"73 1\n",
"74 1\n",
"75 1\n",
"76 1\n",
"77 1\n",
"78 1\n",
"79 1\n",
"80 1\n",
"81 1\n",
"82 1\n",
"83 1\n",
"84 1\n",
"85 1\n",
"86 1\n",
"87 1\n",
"88 1\n",
"89 1\n",
"90 1\n",
"91 1\n",
"92 1\n",
"93 1\n",
"94 1\n",
"95 1\n",
"96 1\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"train_df_oversampled = oversample(train_df, 'Spec_score')\n",
"val_df_oversampled = oversample(val_df, 'Spec_score')\n",
"test_df_oversampled = oversample(test_df, 'Spec_score')\n",
"\n",
"train_df_undersampled = undersample(train_df, 'Spec_score')\n",
"val_df_undersampled = undersample(val_df, 'Spec_score')\n",
"test_df_undersampled = undersample(test_df, 'Spec_score')\n",
"\n",
"print(\"Оверсэмплинг:\")\n",
"check_balance(train_df_oversampled, \"обучающей выборке\")\n",
"check_balance(val_df_oversampled, \"контрольной выборке\")\n",
"check_balance(test_df_oversampled, \"тестовой выборке\")\n",
"\n",
"print(\"Андерсэмплинг:\")\n",
"check_balance(train_df_undersampled, \"обучающей выборке\")\n",
"check_balance(val_df_undersampled, \"контрольной выборке\")\n",
"check_balance(test_df_undersampled, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Датасет: \"Удаленная работа и ментальное здоровье\" (https://www.kaggle.com/datasets/waqi786/remote-work-and-mental-health)"
]
},
{
"cell_type": "code",
"execution_count": 250,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Employee_ID', 'Age', 'Gender', 'Job_Role', 'Industry',\n",
" 'Years_of_Experience', 'Work_Location', 'Hours_Worked_Per_Week',\n",
" 'Number_of_Virtual_Meetings', 'Work_Life_Balance_Rating',\n",
" 'Stress_Level', 'Mental_Health_Condition',\n",
" 'Access_to_Mental_Health_Resources', 'Productivity_Change',\n",
" 'Social_Isolation_Rating', 'Satisfaction_with_Remote_Work',\n",
" 'Company_Support_for_Remote_Work', 'Physical_Activity', 'Sleep_Quality',\n",
" 'Region'],\n",
" dtype='object')\n"
]
}
],
"source": [
"df_remotework = pd.read_csv(\"..\\\\static\\\\csv\\\\Impact_of_Remote_Work_on_Mental_Health.csv\")\n",
"\n",
"print(df_remotework.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Столбцы на русском:\n",
"\n",
"Employee_ID - идентификатор \\\n",
"Age - возраст\\\n",
"Gender - гендер\\\n",
"Job_Role - специальность\\\n",
"Industry - образование\\\n",
"Years_of_Experience - опыт\\\n",
"Work_Location - место работы\\\n",
"Hours_Worked_Per_Week - количество часов работы в неделю\\\n",
"Number_of_Virtual_Meetings - количество виртуальных встреч\\\n",
"Work_Life_Balance_Rating - рейтинг баланса между работой и жизнью\\\n",
"Stress_Level - уровень стресса\n",
"Mental_Health_Condition - состояние ментального здоровья\\\n",
"Access_to_Mental_Health_Resources - доступ к ресурсам по психическому здоровью\\\n",
"Productivity_Change - изменение продуктивности\\\n",
"Social_Isolation_Rating - рейтинг социальной изоляции\\\n",
"Satisfaction_with_Remote_Work - удовольствие от удаленной работы\\\n",
"Company_Support_for_Remote_Work - поддержка компании удаленной работы\\\n",
"Physical_Activity - физическая активность\\\n",
"Sleep_Quality - качество сна\\\n",
"Region - регион"
]
},
{
"cell_type": "code",
"execution_count": 251,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 5000 entries, 0 to 4999\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Employee_ID 5000 non-null object\n",
" 1 Age 5000 non-null int64 \n",
" 2 Gender 5000 non-null object\n",
" 3 Job_Role 5000 non-null object\n",
" 4 Industry 5000 non-null object\n",
" 5 Years_of_Experience 5000 non-null int64 \n",
" 6 Work_Location 5000 non-null object\n",
" 7 Hours_Worked_Per_Week 5000 non-null int64 \n",
" 8 Number_of_Virtual_Meetings 5000 non-null int64 \n",
" 9 Work_Life_Balance_Rating 5000 non-null int64 \n",
" 10 Stress_Level 5000 non-null object\n",
" 11 Mental_Health_Condition 3804 non-null object\n",
" 12 Access_to_Mental_Health_Resources 5000 non-null object\n",
" 13 Productivity_Change 5000 non-null object\n",
" 14 Social_Isolation_Rating 5000 non-null int64 \n",
" 15 Satisfaction_with_Remote_Work 5000 non-null object\n",
" 16 Company_Support_for_Remote_Work 5000 non-null int64 \n",
" 17 Physical_Activity 3371 non-null object\n",
" 18 Sleep_Quality 5000 non-null object\n",
" 19 Region 5000 non-null object\n",
"dtypes: int64(7), object(13)\n",
"memory usage: 781.4+ KB\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Employee_ID</th>\n",
" <th>Age</th>\n",
" <th>Gender</th>\n",
" <th>Job_Role</th>\n",
" <th>Industry</th>\n",
" <th>Years_of_Experience</th>\n",
" <th>Work_Location</th>\n",
" <th>Hours_Worked_Per_Week</th>\n",
" <th>Number_of_Virtual_Meetings</th>\n",
" <th>Work_Life_Balance_Rating</th>\n",
" <th>Stress_Level</th>\n",
" <th>Mental_Health_Condition</th>\n",
" <th>Access_to_Mental_Health_Resources</th>\n",
" <th>Productivity_Change</th>\n",
" <th>Social_Isolation_Rating</th>\n",
" <th>Satisfaction_with_Remote_Work</th>\n",
" <th>Company_Support_for_Remote_Work</th>\n",
" <th>Physical_Activity</th>\n",
" <th>Sleep_Quality</th>\n",
" <th>Region</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>EMP0001</td>\n",
" <td>32</td>\n",
" <td>Non-binary</td>\n",
" <td>HR</td>\n",
" <td>Healthcare</td>\n",
" <td>13</td>\n",
" <td>Hybrid</td>\n",
" <td>47</td>\n",
" <td>7</td>\n",
" <td>2</td>\n",
" <td>Medium</td>\n",
" <td>Depression</td>\n",
" <td>No</td>\n",
" <td>Decrease</td>\n",
" <td>1</td>\n",
" <td>Unsatisfied</td>\n",
" <td>1</td>\n",
" <td>Weekly</td>\n",
" <td>Good</td>\n",
" <td>Europe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>EMP0002</td>\n",
" <td>40</td>\n",
" <td>Female</td>\n",
" <td>Data Scientist</td>\n",
" <td>IT</td>\n",
" <td>3</td>\n",
" <td>Remote</td>\n",
" <td>52</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>Medium</td>\n",
" <td>Anxiety</td>\n",
" <td>No</td>\n",
" <td>Increase</td>\n",
" <td>3</td>\n",
" <td>Satisfied</td>\n",
" <td>2</td>\n",
" <td>Weekly</td>\n",
" <td>Good</td>\n",
" <td>Asia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>EMP0003</td>\n",
" <td>59</td>\n",
" <td>Non-binary</td>\n",
" <td>Software Engineer</td>\n",
" <td>Education</td>\n",
" <td>22</td>\n",
" <td>Hybrid</td>\n",
" <td>46</td>\n",
" <td>11</td>\n",
" <td>5</td>\n",
" <td>Medium</td>\n",
" <td>Anxiety</td>\n",
" <td>No</td>\n",
" <td>No Change</td>\n",
" <td>4</td>\n",
" <td>Unsatisfied</td>\n",
" <td>5</td>\n",
" <td>NaN</td>\n",
" <td>Poor</td>\n",
" <td>North America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>EMP0004</td>\n",
" <td>27</td>\n",
" <td>Male</td>\n",
" <td>Software Engineer</td>\n",
" <td>Finance</td>\n",
" <td>20</td>\n",
" <td>Onsite</td>\n",
" <td>32</td>\n",
" <td>8</td>\n",
" <td>4</td>\n",
" <td>High</td>\n",
" <td>Depression</td>\n",
" <td>Yes</td>\n",
" <td>Increase</td>\n",
" <td>3</td>\n",
" <td>Unsatisfied</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>Poor</td>\n",
" <td>Europe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>EMP0005</td>\n",
" <td>49</td>\n",
" <td>Male</td>\n",
" <td>Sales</td>\n",
" <td>Consulting</td>\n",
" <td>32</td>\n",
" <td>Onsite</td>\n",
" <td>35</td>\n",
" <td>12</td>\n",
" <td>2</td>\n",
" <td>High</td>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>Decrease</td>\n",
" <td>3</td>\n",
" <td>Unsatisfied</td>\n",
" <td>3</td>\n",
" <td>Weekly</td>\n",
" <td>Average</td>\n",
" <td>North America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Employee_ID Age Gender Job_Role Industry \\\n",
"0 EMP0001 32 Non-binary HR Healthcare \n",
"1 EMP0002 40 Female Data Scientist IT \n",
"2 EMP0003 59 Non-binary Software Engineer Education \n",
"3 EMP0004 27 Male Software Engineer Finance \n",
"4 EMP0005 49 Male Sales Consulting \n",
"\n",
" Years_of_Experience Work_Location Hours_Worked_Per_Week \\\n",
"0 13 Hybrid 47 \n",
"1 3 Remote 52 \n",
"2 22 Hybrid 46 \n",
"3 20 Onsite 32 \n",
"4 32 Onsite 35 \n",
"\n",
" Number_of_Virtual_Meetings Work_Life_Balance_Rating Stress_Level \\\n",
"0 7 2 Medium \n",
"1 4 1 Medium \n",
"2 11 5 Medium \n",
"3 8 4 High \n",
"4 12 2 High \n",
"\n",
" Mental_Health_Condition Access_to_Mental_Health_Resources \\\n",
"0 Depression No \n",
"1 Anxiety No \n",
"2 Anxiety No \n",
"3 Depression Yes \n",
"4 NaN Yes \n",
"\n",
" Productivity_Change Social_Isolation_Rating Satisfaction_with_Remote_Work \\\n",
"0 Decrease 1 Unsatisfied \n",
"1 Increase 3 Satisfied \n",
"2 No Change 4 Unsatisfied \n",
"3 Increase 3 Unsatisfied \n",
"4 Decrease 3 Unsatisfied \n",
"\n",
" Company_Support_for_Remote_Work Physical_Activity Sleep_Quality \\\n",
"0 1 Weekly Good \n",
"1 2 Weekly Good \n",
"2 5 NaN Poor \n",
"3 3 NaN Poor \n",
"4 3 Weekly Average \n",
"\n",
" Region \n",
"0 Europe \n",
"1 Asia \n",
"2 North America \n",
"3 Europe \n",
"4 North America "
]
},
"execution_count": 251,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_remotework.info()\n",
"df_remotework.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объект наблюдения: работник\\\n",
"Атрибуты объектов: возраст, гендер, специальность, образование и т.д."
]
},
{
"cell_type": "code",
"execution_count": 252,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABI0AAAIjCAYAAACODuuQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACEeElEQVR4nOzdd3hUdcLF8XNnUiadUAIEAgSkhx6EEMWGoqJSVIoiIIg0C4tlZRUbKpZXLEhVmhRFkSK6iMgCSu+99xpaSCF1kpn3DyQQEpBMEm7K9/M885i5986dM6xryMmvGE6n0ykAAAAAAADgChazAwAAAAAAAKDgoTQCAAAAAABAFpRGAAAAAAAAyILSCAAAAAAAAFlQGgEAAAAAACALSiMAAAAAAABkQWkEAAAAAACALCiNAAAAAAAAkAWlEQAAAAAAALKgNAIAAAAAAEAWlEYAAKDQmDlzpgzDyPYRFhZmdrxi68KFC3rrrbd0//33q2TJkjIMQ5MmTTI7FgAAyCU3swMAAADk1H/+8x/Vrl074/n7779vYhqcPXtW7777ripVqqQGDRpoyZIlZkcCAAB5gNIIAAAUOvfee6/uvPPOjOfffPONzp49a16gYq58+fI6efKkypUrp3Xr1qlp06ZmRwIAAHmA6WkAAKDQSE1NlSRZLP/8V5hJkybJMAwdOnQo45jD4VD9+vWzTJ/asmWLevTooapVq8pms6lcuXLq2bOnzp07l+meb7/9drZT49zcLv8e7s4771RYWJjWr1+vFi1ayMvLS6GhoRozZkyWz/Lmm2+qSZMmCggIkI+Pj26//XYtXrw403WHDh3KeJ85c+ZkOpecnKzAwEAZhqH/+7//y5IzKChIdrs902u+++67jPtdWbTNnTtXbdq0UXBwsDw9PVWtWjUNHTpU6enp//hn7enpqXLlyv3jdQAAoHBhpBEAACg0LpVGnp6eLr1+ypQp2rp1a5bjCxcu1IEDB/T000+rXLly2r59u8aNG6ft27dr1apVMgwj0/WjR4+Wr69vxvOrS6zz58/rwQcfVMeOHdWlSxf98MMP6tevnzw8PNSzZ09JUlxcnL755ht16dJFvXv3Vnx8vMaPH6/WrVtrzZo1atiwYaZ72mw2TZw4Ue3atcs4NmvWLCUnJ1/z88bHx+uXX35R+/btM45NnDhRNpsty+smTZokX19fDRo0SL6+vvrf//6nN998U3Fxcfrkk0+u+R4AAKDoojQCAACFRmxsrCTJy8srx69NSUnRm2++qQceeEDz58/PdK5///566aWXMh1r3ry5unTpomXLlun222/PdO6xxx5T6dKlr/leJ06c0KeffqpBgwZJkvr06aNmzZpp8ODBeuqpp+Tu7q7AwEAdOnRIHh4eGa/r3bu3atWqpREjRmj8+PGZ7tm+fXv9+OOPOnXqlMqWLStJmjBhgjp06KDp06dnm6N9+/aaMGFCRml05MgRLVq0SJ06ddJ3332X6drp06dn+nPt27ev+vbtq1GjRum9995zuagDAACFF9PTAABAoXFpuliZMmVy/NqRI0fq3Llzeuutt7Kcu7IsSU5O1tmzZ9W8eXNJ0oYNG3L8Xm5uburTp0/Gcw8PD/Xp00enT5/W+vXrJUlWqzWjMHI4HIqOjlZaWprCw8Ozfc/GjRurbt26mjJliiTp8OHDWrx4sXr06HHNHD179tRvv/2mqKgoSdLkyZMVERGhGjVqZLn2yj+D+Ph4nT17VrfffrsSExO1a9euHP8ZAACAwo/SCAAAFBqHDx+Wm5tbjkuj2NhYffDBBxo0aFDGKJ0rRUdH68UXX1TZsmXl5eWlMmXKKDQ0NOO1ORUcHCwfH59Mxy4VNVeusTR58mTVr19fNptNpUqVUpkyZfTrr79e8z2ffvppTZw4UdLF6WQtWrRQ9erVr5mjYcOGCgsL07fffiun06lJkybp6aefzvba7du3q3379goICJC/v7/KlCmjrl27SnLtzwAAABR+lEYAAKDQ2L17t6pWrZpp4ekb8dFHH8liseiVV17J9nzHjh319ddfq2/fvpo1a5Z+//13/fbbb5IujgLKD1OnTlWPHj1UrVo1jR8/Xr/99psWLlyou++++5rv2bVrV+3bt0+rVq3S5MmTr1kAXalnz56aOHGili5dqqioKHXs2DHLNTExMbrjjju0efNmvfvuu5o3b54WLlyojz76SFL+/RkAAICCjTWNAABAoZCSkqJNmzZlWgj6Rpw4cUJffPGFhg0bJj8/vyw7op0/f16LFi3SO++8ozfffDPj+N69e13OeuLECSUkJGQabbRnzx5JUpUqVSRJM2fOVNWqVTVr1qxMC21nN33uklKlSumRRx7JmOrWsWPHTDugZefJJ5/UK6+8ohdffFGPPfaY/Pz8slyzZMkSnTt3TrNmzVLLli0zjh88ePCGPi8AACiaGGkEAAAKhenTpyslJUX33HNPjl73zjvvqGzZsurbt2+2561WqyTJ6XRmOv7555+7lFOS0tLSNHbs2IznqampGjt2rMqUKaMmTZpc831Xr16tlStXXvfePXv21JYtW/T4449n2sHtWkqWLKm2bdtqy5YtGTu3XS27LKmpqRo1atQ/3h8AABRdjDQCAAAFWkJCgkaMGKF3331XVqtVTqdTU6dOzXTNqVOndOHCBU2dOlX33ntvpnWLfv/9d02bNi3TLmVX8vf3V8uWLfXxxx/LbrerQoUK+v3333M1yiY4OFgfffSRDh06pBo1amjGjBnatGmTxo0bJ3d3d0nSQw89pFmzZql9+/Zq06aNDh48qDFjxqhOnTq6cOHCNe99//3368yZMzdUGF0yadIkjRw58po7vrVo0UKBgYHq3r27XnjhBRmGoSlTpmQp0q7nq6++UkxMjE6cOCFJmjdvno4dOyZJev755xUQEHDD9wIAAAUDpREAACjQzpw5o8GDB2c8v3JXsqs99dRTWrx4cabSqGHDhurSpct132P69Ol6/vnnNXLkSDmdTt13332aP3++goODXcocGBioyZMn6/nnn9fXX3+tsmXL6quvvlLv3r0zrunRo4eioqI0duxYLViwQHXq1NHUqVP1448/asmSJde8t2EY1yx/rsXLyyvT7mhXK1WqlH755Re99NJLeuONNxQYGKiuXbvqnnvuUevWrW/oPf7v//5Phw8fzng+a9YszZo1S9LFtZgojQAAKHwMZ05+hQQAAHCTHTp0SKGhoVq8eLHuvPPOXF+X3+68806dPXtW27ZtMy0DAABAXmBNIwAAAAAAAGRBaQQAAAo0X19fPfnkk5mmnOXmOgAAANwYpqcBAADkIaanAQCAooLSCAAAAAAAAFkwPQ0AAAAAAABZUBoBAAAAAAAgCzezAxREDodDJ06ckJ+fnwzDMDsOAAAAAABAnnA6nYqPj1dwcLAsluuPJaI0ysaJEycUEhJidgwAAAAAAIB8cfToUVWsWPG611AaZcPPz0/SxT9Af39/k9MAAAAAAADkjbi4OIWEhGR0H9dDaZSNS1PS/P39KY0AAAAAAECRcyPL8bAQNgAAAAAAALKgNAIAAAAAAEAWlEYAAAAAAADIgtIIAAAAAAAAWVAaAQAAAAAAIAtKIwAAAAAAAGRBaQQAAAAAAIAsKI0AAAAAAACQBaURAAAAAAAAsqA0AgAAAAAAQBaURgAAAAAAAMiC0ggAAAAAAABZUBoBAAAAAAAgC0ojAAAAAAAAZEFpBAAAAAAAgCwojQAAAAAAAJCFm9kBAAAAigun06nE1HSdT0xVTKJdsUl2nU9MVbLdoTtqlFEZP0+zIwIAAGSgNAIAAMghp9OpZLsjo/yJSfr7n4kXS6DYJLvOJ6QqJsmu2L+PXfo6Nd2R7T1L+3pqfPdwNQgpcXM/DAAAwDVQGgEAgGIt2Z6eUfycT7Ar9u8C6PylMijh73OJF0ufS1+npmVf/twID6tFJbzd/354KCo2WUeiE9Vp3Ep93qmh7g8rn4efEAAAwDWURgAAoEhISUv/u9TJfpRPTGJqxkigq6eGucrNYqiEt4dKeLsr0NtdAV6Xv750vISXx8Vzfx8L9HaXl7t
"text/plain": [
"<Figure size 1400x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"mean_isolation_rating = df_remotework.groupby('Work_Location')['Social_Isolation_Rating'].mean().reset_index()\n",
"\n",
"plt.figure(figsize=(14, 6))\n",
"\n",
"plt.plot(mean_isolation_rating['Work_Location'], mean_isolation_rating['Social_Isolation_Rating'])\n",
"\n",
"plt.title(\"Диаграмма 1\")\n",
"plt.xlabel(\"Место работы\")\n",
"plt.ylabel(\"Рейтинг социальной изоляции\")\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Между атрибутами присутствует связь. Пример, на диаграмме 1 - связь между местом работы и рейтингом социальной изоляции\\\n",
"\\\n",
"Примеры бизнес-целей:\\\n",
" 1. Прогнозирование изменения продуктивности на основе опыта работы и уровня стресса\\\n",
" 2. Прогнозирование ментального состояния на основе рейтинга баланса между работой и жизнью\\\n",
"\\\n",
"Эффект для бизнеса: влияние ментального состояния на продуктивность, влияние качества сна и количества часов работы в неделю на ментальное состояние\\\n",
"\\\n",
"\\\n",
"Цели технического проекта:\\\n",
" 1. Первая бизнес-цель: вход - опыт работы, уровень стресса, целевой признак - изменение продуктивности.\\\n",
" 2. Вторая бизнес-цель: вход - рейтинг баланса между работой и жизнью, целевой признак - ментальное состояние."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проверим на выбросы"
]
},
{
"cell_type": "code",
"execution_count": 305,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Пустые значения по столбцам:\n",
"Employee_ID 0\n",
"Age 0\n",
"Gender 0\n",
"Job_Role 0\n",
"Industry 0\n",
"Years_of_Experience 0\n",
"Work_Location 0\n",
"Hours_Worked_Per_Week 0\n",
"Number_of_Virtual_Meetings 0\n",
"Work_Life_Balance_Rating 0\n",
"Stress_Level 0\n",
"Mental_Health_Condition 1196\n",
"Access_to_Mental_Health_Resources 0\n",
"Productivity_Change 0\n",
"Social_Isolation_Rating 0\n",
"Satisfaction_with_Remote_Work 0\n",
"Company_Support_for_Remote_Work 0\n",
"Physical_Activity 1629\n",
"Sleep_Quality 0\n",
"Region 0\n",
"dtype: int64\n",
"\n",
"Количество дубликатов: 0\n",
"\n",
"Статистический обзор данных:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>Years_of_Experience</th>\n",
" <th>Hours_Worked_Per_Week</th>\n",
" <th>Number_of_Virtual_Meetings</th>\n",
" <th>Work_Life_Balance_Rating</th>\n",
" <th>Social_Isolation_Rating</th>\n",
" <th>Company_Support_for_Remote_Work</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>5000.000000</td>\n",
" <td>5000.000000</td>\n",
" <td>5000.000000</td>\n",
" <td>5000.000000</td>\n",
" <td>5000.000000</td>\n",
" <td>5000.000000</td>\n",
" <td>5000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>40.995000</td>\n",
" <td>17.810200</td>\n",
" <td>39.614600</td>\n",
" <td>7.559000</td>\n",
" <td>2.984200</td>\n",
" <td>2.993800</td>\n",
" <td>3.007800</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>11.296021</td>\n",
" <td>10.020412</td>\n",
" <td>11.860194</td>\n",
" <td>4.636121</td>\n",
" <td>1.410513</td>\n",
" <td>1.394615</td>\n",
" <td>1.399046</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>22.000000</td>\n",
" <td>1.000000</td>\n",
" <td>20.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>31.000000</td>\n",
" <td>9.000000</td>\n",
" <td>29.000000</td>\n",
" <td>4.000000</td>\n",
" <td>2.000000</td>\n",
" <td>2.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>41.000000</td>\n",
" <td>18.000000</td>\n",
" <td>40.000000</td>\n",
" <td>8.000000</td>\n",
" <td>3.000000</td>\n",
" <td>3.000000</td>\n",
" <td>3.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>51.000000</td>\n",
" <td>26.000000</td>\n",
" <td>50.000000</td>\n",
" <td>12.000000</td>\n",
" <td>4.000000</td>\n",
" <td>4.000000</td>\n",
" <td>4.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>60.000000</td>\n",
" <td>35.000000</td>\n",
" <td>60.000000</td>\n",
" <td>15.000000</td>\n",
" <td>5.000000</td>\n",
" <td>5.000000</td>\n",
" <td>5.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age Years_of_Experience Hours_Worked_Per_Week \\\n",
"count 5000.000000 5000.000000 5000.000000 \n",
"mean 40.995000 17.810200 39.614600 \n",
"std 11.296021 10.020412 11.860194 \n",
"min 22.000000 1.000000 20.000000 \n",
"25% 31.000000 9.000000 29.000000 \n",
"50% 41.000000 18.000000 40.000000 \n",
"75% 51.000000 26.000000 50.000000 \n",
"max 60.000000 35.000000 60.000000 \n",
"\n",
" Number_of_Virtual_Meetings Work_Life_Balance_Rating \\\n",
"count 5000.000000 5000.000000 \n",
"mean 7.559000 2.984200 \n",
"std 4.636121 1.410513 \n",
"min 0.000000 1.000000 \n",
"25% 4.000000 2.000000 \n",
"50% 8.000000 3.000000 \n",
"75% 12.000000 4.000000 \n",
"max 15.000000 5.000000 \n",
"\n",
" Social_Isolation_Rating Company_Support_for_Remote_Work \n",
"count 5000.000000 5000.000000 \n",
"mean 2.993800 3.007800 \n",
"std 1.394615 1.399046 \n",
"min 1.000000 1.000000 \n",
"25% 2.000000 2.000000 \n",
"50% 3.000000 3.000000 \n",
"75% 4.000000 4.000000 \n",
"max 5.000000 5.000000 "
]
},
"execution_count": 305,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"null_values = df_remotework.isnull().sum()\n",
"print(\"Пустые значения по столбцам:\")\n",
"print(null_values)\n",
"\n",
"duplicates = df_remotework.duplicated().sum()\n",
"print(f\"\\nК о личе с тво дубликатов: {duplicates}\")\n",
"\n",
"print(\"\\nС та тис тиче с кий обзор данных:\")\n",
"df_remotework.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видим, что есть пустые данные, но нет дубликатов. Удаляем их"
]
},
{
"cell_type": "code",
"execution_count": 306,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"В наборе данных 'RemoteWork' было удалено 2423 строк с пустыми значениями.\n"
]
}
],
"source": [
"cleaned_df_remotework = drop_missing_values(df_remotework, \"RemoteWork\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Рассчитаем коэффицент ассиметрии"
]
},
{
"cell_type": "code",
"execution_count": 307,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Коэффициент асимметрии для столбца 'Age': -0.0039235316463557286\n",
"\n",
"Коэффициент асимметрии для столбца 'Years_of_Experience': 0.017598366445551735\n",
"\n",
"Коэффициент асимметрии для столбца 'Hours_Worked_Per_Week': 0.044648255942281\n",
"\n",
"Коэффициент асимметрии для столбца 'Number_of_Virtual_Meetings': 0.013099681093504066\n",
"\n",
"Коэффициент асимметрии для столбца 'Work_Life_Balance_Rating': -0.005133932183379798\n",
"\n",
"Коэффициент асимметрии для столбца 'Social_Isolation_Rating': 0.00804653798135057\n",
"\n",
"Коэффициент асимметрии для столбца 'Company_Support_for_Remote_Work': 0.0032439806320526494\n"
]
}
],
"source": [
"for column in cleaned_df_remotework.select_dtypes(include=[np.number]).columns:\n",
" skewness = cleaned_df_remotework[column].skew()\n",
" print(f\"\\nК о эффицие нт асимметрии для столбца '{column}': {skewness}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видим, что выбросы незначительные\n",
"\n",
"Очистка данных от шумов"
]
},
{
"cell_type": "code",
"execution_count": 308,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0kAAAIrCAYAAAA6MtKlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADNe0lEQVR4nOyde3hU1bn/v5PrhFwGBhMBkSTejYhChR4QiFpRQeAUT+uliiKnYLWC0h5Oi1WBFqVKKxasVuhRFNtqa8UKBTxUlEulDQqW0uA9CXJrIgOZ3CaTZOb3R37JSZjM7LXDN5lt+v08T54Hdt68813vete799p7z1qucDgchhBCCCGEEEIIAEBCvAUIIYQQQgghhJPQJEkIIYQQQggh2qBJkhBCCCGEEEK0QZMkIYQQQgghhGiDJklCCCGEEEII0QZNkoQQQgghhBCiDZokCSGEEEIIIUQbNEkSQgghhBBCiDZokiSEEEIIIYQQbdAkSQghhBAijhw4cACrVq1q/X9paSl+9atfxU+QEEKTJCHEyfHyyy/D5XJ1+DN48OB4yxNCCMfjcrnw7W9/G6+//jpKS0vx3//939i2bVu8ZQnxL01SvAUIIXoG9913H84///zW/z/00ENxVCOEEF8cTjvtNMyYMQPXXHMNAKB///5466234itKiH9xXOFwOBxvEUKILy4vv/wyvv71r+PNN9/EZZdd1nr8sssuw+eff469e/fGT5wQQnyB+OSTT/D5559j8ODBSE9Pj7ccIf6l0et2QoiTIhgMAgASEqzLyapVq+ByuVBaWtp6LBQKYciQIXC5XO3eyd+zZw+mTZuGM844A263G/369cP06dNx9OjRdj4XLFjQ4at+SUn/96D8sssuw+DBg/Huu+9i1KhRSEtLQ35+Pn7xi19EtOXBBx/El770JXg8HqSnp2PMmDF4880329mVlpa2fs6rr77a7neBQAB9+vSBy+XCT37ykwidOTk5aGhoaPc3v/nNb1r9ff75563H//CHP+Daa6/FgAEDkJqaijPPPBM/+tGP0NTUZBnrls97//33cf311yMrKwt9+/bFPffcg0Ag0M722WefxRVXXIGcnBykpqaioKAATz31VId+N2zYgMLCQmRmZiIrKwvDhw/Hr3/963Y2f/3rXzFhwgT06dMH6enpGDJkCH72s5+1s3n//ffxta99DV6vF263G5dccglee+21djZ28mXatGnt+r9Pnz647LLLIl5ZMo1pS86cyE9+8pMITXl5eZg2bVo7u9/97ndwuVzIy8trd7y8vBz/+Z//iUGDBiExMbFVb0ZGRsRnnUheXl7UV1tdLleE/QsvvIAvfelLSEtLg9frxY033ojPPvusw3ZajQ0AqK+vx/z583HWWWchNTUVp59+Ov77v/8b9fX1EbZvvfWWsc4TacndjtrfNs528gNA61jIzs5GWloazj33XPzgBz9o95mxflqe7Fx22WXtbggBzU/OExISIsbC7373u9Y+OOWUU3DLLbfg4MGD7WymTZvWmidnnnkmvvzlL8Pn8yEtLS2ifUKI7kOv2wkhToqWSVJqamqn/n716tX4+9//HnF806ZN+PTTT3H77bejX79++Mc//oEVK1bgH//4B/7yl79EXEQ99dRT7S40T5y0HTt2DBMmTMD111+Pm266Cb/97W9x5513IiUlBdOnTwcA+P1+/PKXv8RNN92EGTNmoKqqCv/zP/+Dq6++GkVFRbj44ovb+XS73Xj22Wfx1a9+tfXYK6+8EjEJaUtVVRXWrVuHKVOmtB579tln4Xa7I/5u1apVyMjIwHe+8x1kZGRg8+bNePDBB+H3+7FkyZKon9GW66+/Hnl5eVi8eDH+8pe/YNmyZTh27Bief/75drG74IILMHnyZCQlJWHt2rW46667EAqF8O1vf7udnunTp+OCCy7AvHnz0Lt3b+zevRsbN27EN77xDQDN/TZx4kT0798f99xzD/r164d9+/Zh3bp1uOeeewAA//jHP3DppZfitNNOw/e//32kp6fjt7/9Lb761a/i97//fbvYnEi0fAGAU045BUuXLgXQ/EX4n/3sZ5gwYQI+++wz9O7dmxZTKxobG1svvk/ktttuw5/+9CfMmjULF110ERITE7FixQrs2rXLyPfFF1+M7373u+2OPf/889i0aVO7Yw899BAeeOABXH/99fjmN7+JiooKLF++HGPHjsXu3btb4wGYjY1QKITJkydj+/btmDlzJs4//3z8/e9/x9KlS/Hhhx9G3CxoYfbs2Rg+fHhUnWyi5ceePXswZswYJCcnY+bMmcjLy8Mnn3yCtWvX4qGHHsJ1112Hs846q9V+zpw5OP/88zFz5szWY21fJ27Ls88+i/vvvx8//elPW8cB0Jxrt99+O4YPH47Fixfjn//8J372s5/hz3/+c0QfnMiDDz4Ys44IIbqBsBBCnASPP/54GED4b3/7W7vjhYWF4QsuuKDdsWeffTYMIFxSUhIOh8PhQCAQHjRoUHj8+PFhAOFnn3221ba2tjbis37zm9+EAYS3bt3aemz+/PlhAOGKioqoGgsLC8MAwj/96U9bj9XX14cvvvjicE5OTjgYDIbD4XC4sbExXF9f3+5vjx07Fj711FPD06dPbz1WUlISBhC+6aabwklJSeEjR460/u4rX/lK+Bvf+EYYQHjJkiUROm+66abwxIkTW4+XlZWFExISwjfddFNEOzqKwR133BHu1atXOBAIRG1v28+bPHlyu+N33XVXRH919DlXX311+Iwzzmj9//Hjx8OZmZnhL3/5y+G6urp2tqFQKBwON8cvPz8/nJubGz527FiHNuFwc4wuvPDCdm0IhULhUaNGhc8+++zWY3by5bbbbgvn5ua2+8wVK1aEAYSLiopitrWjmHaUv+FwOLxkyZJ2msLhcDg3Nzd82223tf7/ySefDKempoYvv/zydprq6urCCQkJ4TvuuKOdz9tuuy2cnp4e8VknkpubG7722msjjn/7298Otz2dl5aWhhMTE8MPPfRQO7u///3v4aSkpHbHTcfG6tWrwwkJCeFt27a18/mLX/wiDCD85z//ud3x//3f/w0DCL/88stRdUZj4cKFYQDtcqal/W3jbCc/xo4dG87MzAyXlZW183niZ0T7rLYUFhaGCwsLw+FwOPzHP/4xnJSUFP7ud7/bziYYDIZzcnLCgwcPbjde1q1bFwYQfvDBB1uPnZi7e/fuDSckJLS2o22uCSG6D71uJ4Q4KVpef8vOzrb9tz//+c9x9OhRzJ8/P+J3aWlprf8OBAL4/PPP8W//9m8AYHzXvS1JSUm44447Wv+fkpKCO+64A+Xl5Xj33XcBAImJiUhJSQHQfOfc5/OhsbERl1xySYefOWzYMFxwwQVYvXo1AKCsrAxvvvlmxKtXbZk+fTo2btyII0eOAACee+45jBw5Euecc06EbdsYVFVV4fPPP8eYMWNQW1uL999/36jdbZ8EAcCsWbMAAOvXr+/wcyorK/H555+jsLAQn376KSorKwE0PyGqqqrC97//fbjd7nY+W57q7d69GyUlJbj33nsj7pK32Ph8PmzevBnXX399a5s+//xzHD16FFdffTU++uijiNeRWoiVL0Bzn7X4e++99/D888+jf//+7Z4A2IlpU1NTq7+Wn9ra2g4/u4Xa2lr88Ic/xN13341Bgwa1+11NTQ1CoRD69u0b08fJ8sorryAUCuH6669vp71fv344++yzI14fNRkbv/vd73D++efjvPPOa+fziiuuAIAIny1PQU7MFRNycnIAND8NtEO0/KioqMDWrVsxffr0iD4xef0vGkVFRbj++uvxH//xHxFPId955x2Ul5fjrrvuaheDa6+9Fueddx7++Mc/RvU7b948DBs2DF//+tc7rU0IcfLodTshxElRVlaGpKQk25OkyspKPPzww/jOd76DU089NeL3Pp8PCxcuxIsvvojy8vKIv7XLgAEDIr4I3TIxKS0tbZ2APffcc/jpT3+K999/v913h/Lz8zv0e/vtt2PFihX4r//6L6xatQqjRo3C2WefHVXHxRdfjMGDB+P555/
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Выбросы в датасете:\n",
"Empty DataFrame\n",
"Columns: [Employee_ID, Age, Gender, Job_Role, Industry, Years_of_Experience, Work_Location, Hours_Worked_Per_Week, Number_of_Virtual_Meetings, Work_Life_Balance_Rating, Stress_Level, Mental_Health_Condition, Access_to_Mental_Health_Resources, Productivity_Change, Social_Isolation_Rating, Satisfaction_with_Remote_Work, Company_Support_for_Remote_Work, Physical_Activity, Sleep_Quality, Region]\n",
"Index: []\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0kAAAIrCAYAAAA6MtKlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADMmElEQVR4nOyde3hU1bn/v5Pr5DqQmAiIJFFrNWIUWugBhagVFQSOtD1e6o3SItUKSns4FasCXqBKj1iwWKFHUbTF1opHKWBRlEtFQwWb0qCihiC3BhnI5DaZJDO/P/JLTsJkZq8dv8ls0+/nefI8sPPmne9617vevdfee9ZyhUKhEIQQQgghhBBCAADiYi1ACCGEEEIIIZyEJklCCCGEEEII0Q5NkoQQQgghhBCiHZokCSGEEEIIIUQ7NEkSQgghhBBCiHZokiSEEEIIIYQQ7dAkSQghhBBCCCHaoUmSEEIIIYQQQrRDkyQhhBBCCCGEaIcmSUIIIYQQQgjRDk2ShBDdwosvvgiXy9Xpz+DBg2MtTwghhBAiIgmxFiCE6N3cfffdOPvss9v+/9BDD8VQjRBCCCGENZokCSG6lTFjxuCiiy5q+/9vfvMbfP7557ETJIQQQghhgV63E0J0C4FAAAAQF2ddZlasWAGXy4W9e/e2HQsGgygqKoLL5cKKFSvajpeWlmLy5Mk47bTT4Ha70a9fP0yZMgVHjx7t4HPu3LmdvuqXkPB/94YuuugiDB48GO+99x5GjhyJlJQUFBQU4Ne//nVYW+677z587Wtfg8fjQVpaGkaNGoU333yzg93evXvbPufll1/u8Du/34++ffvC5XLhF7/4RZjO3NxcNDY2dvib3/3ud23+2k8s//d//xdXXnklBgwYgOTkZJx++ul44IEH0NzcbBnr1s/74IMPcPXVVyMzMxPZ2dm444474Pf7O9g+/fTTuOSSS5Cbm4vk5GQUFhbiiSee6NTvunXrUFxcjIyMDGRmZmLYsGH47W9/28Hm3Xffxbhx49C3b1+kpaWhqKgIv/zlLzvYfPDBB/jOd76DrKwsuN1ufP3rX8crr7zSwcZOvkyePLlD//ft2xcXXXQRtmzZ0sGnaUxbc+ZEfvGLX4Rpys/Px+TJkzvY/eEPf4DL5UJ+fn6H45WVlfj+97+PQYMGIT4+vk1venp62GedSH5+fsRXW10uVwfbpqYmPPDAAzj99NORnJyM/Px83H333WhoaAjza9Kn7XM+2ucGg0E89thjOOecc+B2u3HyySdj2rRpOHbsmFH7TozjW2+9BZfLhbfeeqvt2EUXXdThhgwAbN++vVM9APDcc89h+PDhSE1NRd++fTF69Gj8+c9/bvvMaDFt7b/W9rfPuerqanzta19DQUEBDh06FNEOAH70ox/B5XKFtU8IEXv0JEkI0S20TpKSk5O79PcrV67E3//+97DjGzZswKefforvfe976NevH/7xj39g2bJl+Mc//oF33nkn7GLoiSee6HCheeKk7dixYxg3bhyuvvpqXHfddfj973+PW2+9FUlJSZgyZQoAwOfz4Te/+Q2uu+46TJ06FdXV1fif//kfXH755SgpKcH555/fwafb7cbTTz+Nq666qu3YSy+9FDYJaU91dTXWrFmDSZMmtR17+umn4Xa7w/5uxYoVSE9Px49//GOkp6dj48aNuO++++Dz+bBw4cKIn9Geq6++Gvn5+ViwYAHeeecdLF68GMeOHcOzzz7bIXbnnHMOJk6ciISEBLz66qu47bbbEAwG8aMf/aiDnilTpuCcc87B7Nmz0adPH+zcuRPr16/Hd7/7XQAt/TZ+/Hj0798fd9xxB/r164fdu3djzZo1uOOOOwAA//jHP3DBBRfglFNOwV133YW0tDT8/ve/x1VXXYU//vGPHWJzIpHyBQBOOukkLFq0CACwf/9+/PKXv8S4cePw2WefoU+fPrSYWtHU1ISf/exnnf7u5ptvxuuvv47p06fjvPPOQ3x8PJYtW4YdO3YY+T7//PPxk5/8pMOxZ599Fhs2bOhw7Ac/+AGeeeYZfOc738FPfvITvPvuu1iwYAF2796N1atXt9mZ9Gl7brnlFowaNQpAS6639wUA06ZNw4oVK/C9730PM2bMQHl5OR5//HHs3LkTf/nLX5CYmGjUTrv89Kc/7fT4vHnzMHfuXIwcORL3338/kpKS8O6772Ljxo247LLL8Nhjj6GmpgYAsHv3bsyfP7/Dq8ORJq+NjY349re/jX379uEvf/kL+vfvH1Hbxx9/jOXLl3/BFgohuo2QEEJ0A4899lgIQOhvf/tbh+PFxcWhc845p8Oxp59+OgQgVF5eHgqFQiG/3x8aNGhQaOzYsSEAoaeffrrNtq6uLuyzfve734UAhDZv3tx2bM6cOSEAoSNHjkTUWFxcHAIQ+u///u+2Yw0NDaHzzz8/lJubGwoEAqFQKBRqamoKNTQ0dPjbY8eOhU4++eTQlClT2o6Vl5eHAISuu+66UEJCQujw4cNtv/vmN78Z+u53vxsCEFq4cGGYzuuuuy40fvz4tuMVFRWhuLi40HXXXRfWjs5iMG3atFBqamrI7/dHbG/7z5s4cWKH47fddltYf3X2OZdffnnotNNOa/v/8ePHQxkZGaFvfOMbofr6+g62wWAwFAq1xK+goCCUl5cXOnbsWKc2oVBLjM4999wObQgGg6GRI0eGvvKVr7Qds5MvN998cygvL6/DZy5btiwEIFRSUhK1rZ3FtLP8DYVCoYULF3bQFAqFQnl5eaGbb7657f9Lly4NJScnhy6++OIOmurr60NxcXGhadOmdfB58803h9LS0sI+60Ty8vJCV155ZdjxH/3oR6H2p/n3338/BCD0gx/8oIPdf/7nf4YAhDZu3BgKhcz6tJU9e/aEAISeeeaZtmOtOdbKli1bQgBCzz//fIe/Xb9+fafHT6SgoCB00003dTj25ptvhgCE3nzzzbZjxcXFoeLi4rb/r127NgQgdMUVV3TQs2fPnlBcXFxo0qRJoebm5qjti/RZrbSO+aeffjoUDAZD119/fSg1NTX07rvvRrRr5eqrrw4NHjw4dOqpp3bIEyGEM9DrdkKIbqH19becnBzbf/urX/0KR48exZw5c8J+l5KS0vZvv9+Pzz//HP/2b/8GAMZ33duTkJCAadOmtf0/KSkJ06ZNQ2VlJd577z0AQHx8PJKSkgC0vDbk9XrR1NSEr3/9651+5tChQ3HOOedg5cqVAICKigq8+eabUV+pmTJlCtavX4/Dhw8DAJ555hmMGDECZ555Zpht+xhUV1fj888/x6hRo1BXV4cPPvjAqN3tnwQBwPTp0wEAa9eu7fRzqqqq8Pnnn6O4uBiffvopqqqqALQ8IaqursZdd90Ft9vdwWfrU72dO3eivLwcd955Z9uTmxNtvF4vNm7ciKuvvrqtTZ9//jmOHj2Kyy+/HHv27MGBAwc6bUu0fAFa+qzV3/vvv49nn30W/fv377CgiJ2YNjc3t/lr/amrq+v0s1upq6vD/fffj9tvvx2DBg3q8Lva2loEg0FkZ2dH9fFFae3bH//4xx2Otz6B+tOf/gTArE9bMXli/Ic//AEejwdjxozpELOvfe1rSE9PD3tt9URyc3Oxf/9+gxb+H6FQCLNnz8a3v/1tfOMb3+jwu5dffhnBYBD33Xdf2JPlzl7LM2XWrFl4/vnn8fvf/x7Dhw+Pavvee+/hD3/4AxYsWGD0SrIQoufRyBRCdAsVFRVISEiwPUmqqqrC/Pnz8eMf/xgnn3xy2O+9Xi/uuOMOnHzyyUhJSUFOTg4KCgra/tYuAwYMQFpaWodjrROT9t8veeaZZ1BUVAS3243s7Gzk5OTgT3/6U8TP/N73voenn34aQMurSyNHjsRXvvKViDrOP/98DB48GM8++yxCoVDbq0md8Y9//AOTJk2Cx+NBZmYmcnJycMMNNwAwj8GJWk4//XTExcV1aPNf/vIXXHrppUhLS0OfPn2Qk5ODu+++u8PnfPLJJwAQdVl3E5uPP/4YoVAI9957L3Jycjr8tE5+Kisrw/7OKl8A4LPPPmvzNWTIEHzyySf44x/
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(cleaned_df_remotework['Age'], cleaned_df_remotework['Number_of_Virtual_Meetings'])\n",
"plt.xlabel('Возраст')\n",
"plt.ylabel('Количество виртуальных встреч')\n",
"plt.xticks(rotation=45)\n",
"plt.title('Диаграмма рассеивания перед чисткой')\n",
"plt.show()\n",
"\n",
"Q1 = cleaned_df_remotework[\"Hours_Worked_Per_Week\"].quantile(0.25)\n",
"Q3 = cleaned_df_remotework[\"Hours_Worked_Per_Week\"].quantile(0.75)\n",
"\n",
"IQR = Q3 - Q1\n",
"\n",
"threshold = 1.5 * IQR\n",
"lower_bound = Q1 - threshold\n",
"upper_bound = Q3 + threshold\n",
"\n",
"outliers = (cleaned_df_remotework[\"Hours_Worked_Per_Week\"] < lower_bound) | (cleaned_df_remotework[\"Hours_Worked_Per_Week\"] > upper_bound)\n",
"\n",
"print(\"Выбросы в датасете:\")\n",
"print(cleaned_df[outliers])\n",
"\n",
"median_score = cleaned_df_remotework[\"Hours_Worked_Per_Week\"].median()\n",
"cleaned_df_remotework.loc[outliers, \"Hours_Worked_Per_Week\"] = median_score\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(cleaned_df_remotework['Age'], cleaned_df_remotework['Number_of_Virtual_Meetings'])\n",
"plt.xlabel('Возраст')\n",
"plt.ylabel('Количество виртуальных встреч')\n",
"plt.xticks(rotation=45)\n",
"plt.title('Диаграмма рассеивания после чистки')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Шумов нет\n",
"\n",
"Разбиение на выборки"
]
},
{
"cell_type": "code",
"execution_count": 311,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 1545\n",
"Размер контрольной выборки: 516\n",
"Размер тестовой выборки: 516\n",
"\n",
"Распределение оценки характеристик в обучающей выборке:\n",
"Hours_Worked_Per_Week\n",
"23 49\n",
"22 47\n",
"28 44\n",
"45 42\n",
"40 42\n",
"32 42\n",
"59 41\n",
"41 41\n",
"34 40\n",
"33 40\n",
"54 40\n",
"24 40\n",
"30 39\n",
"25 39\n",
"56 39\n",
"35 39\n",
"26 39\n",
"49 39\n",
"20 39\n",
"57 39\n",
"27 38\n",
"53 38\n",
"42 37\n",
"31 37\n",
"39 37\n",
"37 37\n",
"46 36\n",
"52 36\n",
"36 36\n",
"55 36\n",
"29 35\n",
"48 34\n",
"50 34\n",
"38 34\n",
"44 34\n",
"47 33\n",
"58 33\n",
"51 32\n",
"21 31\n",
"43 29\n",
"60 28\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в контрольной выборке:\n",
"Hours_Worked_Per_Week\n",
"37 19\n",
"25 18\n",
"33 17\n",
"44 17\n",
"22 17\n",
"57 16\n",
"41 16\n",
"45 15\n",
"36 15\n",
"27 15\n",
"34 14\n",
"56 14\n",
"23 14\n",
"53 14\n",
"48 14\n",
"39 14\n",
"32 14\n",
"52 14\n",
"24 13\n",
"21 13\n",
"43 13\n",
"28 12\n",
"42 12\n",
"50 12\n",
"38 12\n",
"55 12\n",
"29 12\n",
"60 11\n",
"46 10\n",
"59 10\n",
"20 10\n",
"58 10\n",
"51 10\n",
"26 10\n",
"49 10\n",
"54 9\n",
"31 9\n",
"35 9\n",
"47 8\n",
"30 7\n",
"40 5\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в тестовой выборке:\n",
"Hours_Worked_Per_Week\n",
"53 20\n",
"31 19\n",
"50 18\n",
"35 17\n",
"51 17\n",
"28 17\n",
"60 16\n",
"27 16\n",
"47 15\n",
"56 15\n",
"36 15\n",
"26 15\n",
"38 15\n",
"21 14\n",
"43 13\n",
"59 13\n",
"30 13\n",
"52 13\n",
"46 13\n",
"24 13\n",
"33 12\n",
"34 12\n",
"39 12\n",
"48 12\n",
"57 12\n",
"22 11\n",
"42 11\n",
"23 11\n",
"20 11\n",
"55 11\n",
"32 11\n",
"58 11\n",
"49 10\n",
"25 10\n",
"54 9\n",
"40 9\n",
"41 9\n",
"44 8\n",
"29 8\n",
"45 5\n",
"37 4\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"train_df, test_df = train_test_split(cleaned_df_remotework, test_size=0.2, random_state=42)\n",
"\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))\n",
"\n",
"print()\n",
"\n",
"def check_balance(df, name):\n",
" counts = df['Hours_Worked_Per_Week'].value_counts()\n",
" print(f\"Распределение оценки характеристик в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"check_balance(val_df, \"контрольной выборке\")\n",
"check_balance(test_df, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Оверсемплинг и андерсемплинг"
]
},
{
"cell_type": "code",
"execution_count": 312,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Оверсэмплинг:\n",
"Распределение оценки характеристик в обучающей выборке:\n",
"Hours_Worked_Per_Week\n",
"43 49\n",
"48 49\n",
"31 49\n",
"45 49\n",
"42 49\n",
"35 49\n",
"50 49\n",
"23 49\n",
"36 49\n",
"38 49\n",
"27 49\n",
"37 49\n",
"60 49\n",
"34 49\n",
"53 49\n",
"58 49\n",
"24 49\n",
"56 49\n",
"32 49\n",
"39 49\n",
"46 49\n",
"40 49\n",
"57 49\n",
"59 49\n",
"21 49\n",
"41 49\n",
"52 49\n",
"55 49\n",
"25 49\n",
"26 49\n",
"29 49\n",
"54 49\n",
"33 49\n",
"47 49\n",
"49 49\n",
"44 49\n",
"28 49\n",
"51 49\n",
"22 49\n",
"30 49\n",
"20 49\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в контрольной выборке:\n",
"Hours_Worked_Per_Week\n",
"24 19\n",
"46 19\n",
"39 19\n",
"43 19\n",
"48 19\n",
"22 19\n",
"30 19\n",
"32 19\n",
"20 19\n",
"60 19\n",
"52 19\n",
"33 19\n",
"57 19\n",
"36 19\n",
"51 19\n",
"40 19\n",
"31 19\n",
"45 19\n",
"27 19\n",
"50 19\n",
"38 19\n",
"58 19\n",
"53 19\n",
"55 19\n",
"34 19\n",
"29 19\n",
"49 19\n",
"41 19\n",
"35 19\n",
"28 19\n",
"21 19\n",
"23 19\n",
"37 19\n",
"56 19\n",
"25 19\n",
"47 19\n",
"44 19\n",
"42 19\n",
"54 19\n",
"59 19\n",
"26 19\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в тестовой выборке:\n",
"Hours_Worked_Per_Week\n",
"21 20\n",
"50 20\n",
"43 20\n",
"53 20\n",
"30 20\n",
"54 20\n",
"47 20\n",
"44 20\n",
"59 20\n",
"57 20\n",
"55 20\n",
"48 20\n",
"39 20\n",
"46 20\n",
"35 20\n",
"49 20\n",
"24 20\n",
"28 20\n",
"42 20\n",
"45 20\n",
"32 20\n",
"52 20\n",
"26 20\n",
"34 20\n",
"25 20\n",
"41 20\n",
"51 20\n",
"22 20\n",
"36 20\n",
"38 20\n",
"60 20\n",
"29 20\n",
"37 20\n",
"58 20\n",
"23 20\n",
"20 20\n",
"40 20\n",
"27 20\n",
"56 20\n",
"31 20\n",
"33 20\n",
"Name: count, dtype: int64\n",
"\n",
"Андерсэмплинг:\n",
"Распределение оценки характеристик в обучающей выборке:\n",
"Hours_Worked_Per_Week\n",
"20 28\n",
"21 28\n",
"22 28\n",
"23 28\n",
"24 28\n",
"25 28\n",
"26 28\n",
"27 28\n",
"28 28\n",
"29 28\n",
"30 28\n",
"31 28\n",
"32 28\n",
"33 28\n",
"34 28\n",
"35 28\n",
"36 28\n",
"37 28\n",
"38 28\n",
"39 28\n",
"40 28\n",
"41 28\n",
"42 28\n",
"43 28\n",
"44 28\n",
"45 28\n",
"46 28\n",
"47 28\n",
"48 28\n",
"49 28\n",
"50 28\n",
"51 28\n",
"52 28\n",
"53 28\n",
"54 28\n",
"55 28\n",
"56 28\n",
"57 28\n",
"58 28\n",
"59 28\n",
"60 28\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в контрольной выборке:\n",
"Hours_Worked_Per_Week\n",
"20 5\n",
"21 5\n",
"22 5\n",
"23 5\n",
"24 5\n",
"25 5\n",
"26 5\n",
"27 5\n",
"28 5\n",
"29 5\n",
"30 5\n",
"31 5\n",
"32 5\n",
"33 5\n",
"34 5\n",
"35 5\n",
"36 5\n",
"37 5\n",
"38 5\n",
"39 5\n",
"40 5\n",
"41 5\n",
"42 5\n",
"43 5\n",
"44 5\n",
"45 5\n",
"46 5\n",
"47 5\n",
"48 5\n",
"49 5\n",
"50 5\n",
"51 5\n",
"52 5\n",
"53 5\n",
"54 5\n",
"55 5\n",
"56 5\n",
"57 5\n",
"58 5\n",
"59 5\n",
"60 5\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение оценки характеристик в тестовой выборке:\n",
"Hours_Worked_Per_Week\n",
"20 4\n",
"21 4\n",
"22 4\n",
"23 4\n",
"24 4\n",
"25 4\n",
"26 4\n",
"27 4\n",
"28 4\n",
"29 4\n",
"30 4\n",
"31 4\n",
"32 4\n",
"33 4\n",
"34 4\n",
"35 4\n",
"36 4\n",
"37 4\n",
"38 4\n",
"39 4\n",
"40 4\n",
"41 4\n",
"42 4\n",
"43 4\n",
"44 4\n",
"45 4\n",
"46 4\n",
"47 4\n",
"48 4\n",
"49 4\n",
"50 4\n",
"51 4\n",
"52 4\n",
"53 4\n",
"54 4\n",
"55 4\n",
"56 4\n",
"57 4\n",
"58 4\n",
"59 4\n",
"60 4\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"train_df_oversampled = oversample(train_df, 'Hours_Worked_Per_Week')\n",
"val_df_oversampled = oversample(val_df, 'Hours_Worked_Per_Week')\n",
"test_df_oversampled = oversample(test_df, 'Hours_Worked_Per_Week')\n",
"\n",
"train_df_undersampled = undersample(train_df, 'Hours_Worked_Per_Week')\n",
"val_df_undersampled = undersample(val_df, 'Hours_Worked_Per_Week')\n",
"test_df_undersampled = undersample(test_df, 'Hours_Worked_Per_Week')\n",
"\n",
"print(\"Оверсэмплинг:\")\n",
2024-10-19 00:25:57 +04:00
"check_balance(train_df_oversampled, \"обучающей выборке\")\n",
"check_balance(val_df_oversampled, \"контрольной выборке\")\n",
"check_balance(test_df_oversampled, \"тестовой выборке\")\n",
"\n",
"print(\"Андерсэмплинг:\")\n",
"check_balance(train_df_undersampled, \"обучающей выборке\")\n",
"check_balance(val_df_undersampled, \"контрольной выборке\")\n",
"check_balance(test_df_undersampled, \"тестовой выборке\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}