2024-10-12 01:22:15 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-10-12 11:41:32 +04:00
"## 1-й Датасет: Pima Indians Diabetes Database"
2024-10-12 01:22:15 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-10-12 11:41:32 +04:00
"https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"Этот набор данных получен Национальным институтом диабета, болезней органов пищеварения и почек. Цель набора данных - диагностически предсказать, есть ли у пациента диабет, на основе определенных диагностических измерений, включенных в набор данных. При отборе этих случаев из более обширной базы данных было наложено несколько ограничений. В частности, все пациенты здесь - женщины не моложе 21 года индейского происхождения Пима.\n",
"\n",
"* Из описания датасета очевидно, что объектами иследования являются женьщины индейци пима.\n",
"* Атрибуты объектов: Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome\n",
"* Очевидная цель этого датасета - научиться предсказывать диабет.\n",
"\n",
"В качестве примера бизнес-целей можно привести:\n",
"* Повышение качества жизни пациентов. Цель технического проекта: Разработать интерфейс для модели, который будет предоставлять пациентам персонализированные рекомендации по профилактике и лечению диабета на основе их индивидуальных рисков, определенных моделью.\n",
"* Повышение эффективности скрининга диабета. Цель технического проекта: Разработать и обучить модель машинного обучения с точностью предсказания не менее 85% для автоматизированного скрининга диабета на основе данных датасета \"Диабет у индейцев Пима\".\n",
"* Снижение медицинских расходов. Цель технического проекта: Оптимизировать модель прогнозирования таким образом, чтобы минимизировать количество ложноотрицательных результатов (пациенты с диабетом, которые не были выявлены), что позволит снизить затраты на лечение осложнений."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"количество колонок: 9\n",
"колонки: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 768 entries, 0 to 767\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Pregnancies 768 non-null int64 \n",
" 1 Glucose 768 non-null int64 \n",
" 2 BloodPressure 768 non-null int64 \n",
" 3 SkinThickness 768 non-null int64 \n",
" 4 Insulin 768 non-null int64 \n",
" 5 BMI 768 non-null float64\n",
" 6 DiabetesPedigreeFunction 768 non-null float64\n",
" 7 Age 768 non-null int64 \n",
" 8 Outcome 768 non-null int64 \n",
"dtypes: float64(2), int64(7)\n",
"memory usage: 54.1 KB\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pregnancies</th>\n",
" <th>Glucose</th>\n",
" <th>BloodPressure</th>\n",
" <th>SkinThickness</th>\n",
" <th>Insulin</th>\n",
" <th>BMI</th>\n",
" <th>DiabetesPedigreeFunction</th>\n",
" <th>Age</th>\n",
" <th>Outcome</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6</td>\n",
" <td>148</td>\n",
" <td>72</td>\n",
" <td>35</td>\n",
" <td>0</td>\n",
" <td>33.6</td>\n",
" <td>0.627</td>\n",
" <td>50</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>85</td>\n",
" <td>66</td>\n",
" <td>29</td>\n",
" <td>0</td>\n",
" <td>26.6</td>\n",
" <td>0.351</td>\n",
" <td>31</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>8</td>\n",
" <td>183</td>\n",
" <td>64</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>23.3</td>\n",
" <td>0.672</td>\n",
" <td>32</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>89</td>\n",
" <td>66</td>\n",
" <td>23</td>\n",
" <td>94</td>\n",
" <td>28.1</td>\n",
" <td>0.167</td>\n",
" <td>21</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>137</td>\n",
" <td>40</td>\n",
" <td>35</td>\n",
" <td>168</td>\n",
" <td>43.1</td>\n",
" <td>2.288</td>\n",
" <td>33</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
"0 6 148 72 35 0 33.6 \n",
"1 1 85 66 29 0 26.6 \n",
"2 8 183 64 0 0 23.3 \n",
"3 1 89 66 23 94 28.1 \n",
"4 0 137 40 35 168 43.1 \n",
"\n",
" DiabetesPedigreeFunction Age Outcome \n",
"0 0.627 50 1 \n",
"1 0.351 31 0 \n",
"2 0.672 32 1 \n",
"3 0.167 21 0 \n",
"4 2.288 33 1 "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\".//static//csv//diabetes.csv\", sep=\",\")\n",
"print('количество колонок: ' + str(df.columns.size)) \n",
"print('колонки: ' + ', '.join(df.columns))\n",
"\n",
"df.info()\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Получение сведений о пропущенных данных\n",
"\n",
"Типы пропущенных данных:\n",
"\n",
"* None - представление пустых данных в Python\n",
"* NaN - представление пустых данных в Pandas\n",
"* '' - пустая строка"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pregnancies 0\n",
"Glucose 0\n",
"BloodPressure 0\n",
"SkinThickness 0\n",
"Insulin 0\n",
"BMI 0\n",
"DiabetesPedigreeFunction 0\n",
"Age 0\n",
"Outcome 0\n",
"dtype: int64\n",
"\n",
"Pregnancies False\n",
"Glucose False\n",
"BloodPressure False\n",
"SkinThickness False\n",
"Insulin False\n",
"BMI False\n",
"DiabetesPedigreeFunction False\n",
"Age False\n",
"Outcome False\n",
"dtype: bool\n",
"\n"
]
}
],
"source": [
"# Количество пустых значений признаков\n",
"print(df.isnull().sum())\n",
"\n",
"print()\n",
"\n",
"# Есть ли пустые значения признаков\n",
"print(df.isnull().any())\n",
"\n",
"print()\n",
"\n",
"# Процент пустых значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Судя по статистике выше, пустые значения отсутсвуют. Проверим датасет на выбросы:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов в столбце 'Pregnancies': 4\n",
"Количество выбросов в столбце 'Glucose': 5\n",
"Количество выбросов в столбце 'BloodPressure': 45\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdIAAAISCAYAAADIuT2dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAD2s0lEQVR4nOzdeXxU9fX/8ffMJJnsCdkTCBC2sC8CIqKIiiKIlYq1tC5oVdqKti61La24oBV3Ub9Ua9uf0lZqa6tUUUFAcUUQFGTfZM2+kH2dmfv7YzIjEQghZHJnMq/n4zEPzb137pyZSfjce+6552MxDMMQAAAAAAAAAAA4LqvZAQAAAAAAAAAA4M9IpAMAAAAAAAAA0AIS6QAAAAAAAAAAtIBEOgAAAAAAAAAALSCRDgAAAAAAAABAC0ikAwAAAAAAAADQAhLpAAAAAAAAAAC0gEQ6AAAAAAAAAAAtIJEOAAAAAAAAAEALSKSjU+rZs6euv/56s8Po9B5//HH16tVLNptNw4cPNzucTmX16tWyWCxavXq12aEAQEDiWKBj+PJYYMKECZowYUK77hMAcHoYXzuGL8ZXM88x77//flkslg5/XaC9kUiH33v55ZdlsVi0fv36466fMGGCBg8efNqv88477+j+++8/7f0Ei/fee0+//vWvNW7cOL300kt6+OGHT7jt9ddfL4vF4n3ExsZq2LBhevLJJ1VfX9+BUQMAAhHHAv7pVI4FPD7++GNdddVV6tq1q8LCwhQXF6cxY8Zo3rx5Kigo6ICoAQAejK/+6XTOtUNCQpSZmakZM2Zo27ZtHRj1qevZs2ez2FNSUnTuuefqjTfeMDs04IRCzA4A8IWdO3fKaj2160TvvPOOFi5cyADfSu+//76sVqv++te/Kiws7KTb2+12/eUvf5EklZWV6b///a9+9atf6YsvvtCrr77q63ADzvjx41VbW9uqzxYAcCyOBXzvVI8F7r33Xj344IPq1auXrr/+evXq1Ut1dXXasGGDnnzySS1atEh79+7tgMgBAG3F+Op7p3Ou7XA4tHfvXr3wwgtatmyZtm3bpoyMDF+H3GbDhw/XXXfdJUnKzc3Vn/70J11xxRV6/vnn9bOf/czk6IBjkUhHp2S3280O4ZRVV1crKirK7DBarbCwUBEREa1O9IaEhOiaa67x/nzLLbdozJgx+te//qWnnnrquIO7YRiqq6tTREREu8UdKKxWq8LDw80OAwACFscCvncqxwL/+te/9OCDD+qqq67S3//+92Oe8/TTT+vpp5/2VagAgHbC+Op7p3uuLUlnnXWWpk6dqrfffls333yzL8JsF127dm0W+3XXXac+ffro6aefPmEi3eFwyOVyBVTRWaD9DuLEaO2CTum7fdsaGxv1wAMPqG/fvgoPD1diYqLOOeccrVixQpL7dqiFCxdKUrNbizyqq6t11113KTMzU3a7XdnZ2XriiSdkGEaz162trdUvfvELJSUlKSYmRt/73veUk5Mji8XS7Oq7pz/Ytm3b9OMf/1hdunTROeecI0n6+uuvvVVa4eHhSktL009+8hOVlJQ0ey3PPnbt2qVrrrlGcXFxSk5O1ty5c2UYhg4dOqTLL79csbGxSktL05NPPtmqz87hcOjBBx9U7969Zbfb1bNnT/3ud79r1oLFYrHopZdeUnV1tfezevnll1u1fw+r1erte7p//35J7u9t6tSpWr58uUaNGqWIiAj96U9/kuSuYr/99tu930GfPn306KOPyuVyNdtvSUmJrr32WsXGxio+Pl4zZ87Upk2bjonx+uuvV3R0tHJycjRt2jRFR0crOTlZv/rVr+R0Opvt84knntDZZ5+txMRERUREaOTIkfrPf/5zzHuyWCy69dZbtWTJEg0ePFh2u12DBg3SsmXLjtk2JydHN954ozIyMmS325WVlaWf//znamhokHTi/nVr167VJZdcori4OEVGRuq8887Tp59+2mybyspK3X777erZs6fsdrtSUlJ00UUX6csvvzzp9wIAnQXHAv51LHDvvfcqKSnphNV1cXFxJ61U9LQg8Bw3eLQ0Zk6ZMkVdunRRVFSUhg4dqmeeeabZNu+//77OPfdcRUVFKT4+Xpdffrm2b9/ebJvWjqutGaMBINAxvvrX+HoiaWlpktxJ9pN57bXXNHLkSEVERCgpKUnXXHONcnJyjtmuNWOmJH3yyScaPXq0wsPD1bt3b+85fWvjHjBggPbt2yfJnSuwWCx64okntGDBAu9n52lbs2PHDl155ZVKSEhQeHi4Ro0apTfffLPZPk/2OypJ+fn5uuGGG9StWzfZ7Xalp6fr8ssvb3bM8d3fNY/v/k14jlc+/PBD3XLLLUpJSVG3bt286999913v5xgTE6NLL71UW7dubfVnBHNRkY6AUV5eruLi4mOWNzY2nvS5999/v+bPn6+bbrpJZ555pioqKrR+/Xp9+eWXuuiii/TTn/5Uubm5WrFihf7+9783e65hGPre976nDz74QDfeeKOGDx+u5cuX6+6771ZOTk6z6qnrr79e//73v3XttdfqrLPO0ocffqhLL730hHH94Ac/UN++ffXwww97DxRWrFihb775RjfccIPS0tK0detWvfjii9q6das+//zzYybo+OEPf6gBAwbokUce0dtvv62HHnpICQkJ+tOf/qQLLrhAjz76qF555RX96le/0ujRozV+/PgWP6ubbrpJixYt0pVXXqm77rpLa9eu1fz587V9+3Zvr7K///3vevHFF7Vu3TrvLWRnn332Sb+H7/Lcvp2YmOhdtnPnTv3oRz/ST3/6U918883Kzs5WTU2NzjvvPOXk5OinP/2punfvrs8++0xz5sxRXl6eFixYIElyuVy67LLLtG7dOv385z9X//799b///U8zZ8487us7nU5NmjRJY8aM0RNPPKGVK1fqySefVO/evfXzn//cu90zzzyj733ve7r66qvV0NCgV199VT/4wQ+0dOnSY77fTz75RK+//rpuueUWxcTE6Nlnn9X06dN18OBB7/vMzc3VmWeeqbKyMs2aNUv9+/dXTk6O/vOf/6impuaEV9bff/99TZ48WSNHjtR9990nq9Wql156SRdccIE+/vhjnXnmmZKkn/3sZ/rPf/6jW2+9VQMHDlRJSYk++eQTbd++XWecccYpf08A4C84FgjMY4Fdu3Zp165duummmxQdHd3ia7eXFStWaOrUqUpPT9cvf/lLpaWlafv27Vq6dKl++ctfSpJWrlypyZMnq1evXrr//vtVW1ur5557TuPGjdOXX36pnj17SmrduNraMRoA/BHja2COr0fzfH9Op1PffPONfvOb3ygxMVFTp05t8Xkvv/yybrjhBo0ePVrz589XQUGBnnnmGX366af66quvFB8fL6n1Y+bmzZt18cUXKzk5Wffff78cDofuu+8+paamnvQ9SO7fuUOHDjXLEUjSSy+9pLq6Os2aNUt2u10JCQnaunWrxo0bp65du+q3v/2toqKi9O9//1vTpk3Tf//7X33/+9+XdPLfUUmaPn26tm7dqttuu009e/ZUYWGhVqxYoYMHD3rf26m65ZZblJycrHvvvVfV1dWS3N/vzJkzNWnSJD366KOqqanR888/r3POOUdfffVVm18LHcgA/NxLL71kSGrxMWjQoGbP6dGjhzFz5kzvz8OGDTMuvfTSFl9n9uzZxvH+JJYsWWJIMh566KFmy6+88krDYrEYe/bsMQzDMDZs2GBIMm6//fZm211//fWGJOO+++7zLrvvvvsMScaPfvSjY16vpqbmmGX//Oc/DUnGRx99dMw+Zs2a5V3mcDiMbt26GRaLxXjkkUe8y48cOWJEREQ0+0yOZ+PGjYYk46abbmq2/Fe/+pUhyXj//fe9y2bOnGlERUW1uL/vbltUVGQUFRUZe/bsMR5++GHDYrEYQ4cO9W7Xo0cPQ5KxbNmyZs9/8MEHjaioKGPXrl3Nlv/2t781bDabcfDgQcMwDOO///2vIclYsGCBdxun02lccMEFhiTjpZdeahaTJGPevHnN9jlixAhj5MiRzZZ99ztpaGgwBg8ebFxwwQX
"text/plain": [
"<Figure size 1500x1000 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"df = pd.read_csv(\".//static//csv//diabetes.csv\")\n",
"\n",
"# Выбираем числовые столбцы\n",
"numeric_columns = ['Pregnancies', 'Glucose', 'BloodPressure']\n",
"\n",
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['Pregnancies', 'Glucose', 'BloodPressure']\n",
"\n",
"# Функция для подсчета выбросов\n",
"def count_outliers(df, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(numeric_columns, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В принципе, количество выбросов для солбцов 'Pregnancies' и 'Glucose' не так критично, что нельзя сказать про столбец 'BloodPressure'. Сделаем очистку от выбросов для данного столбца:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество удаленных строк: 45\n",
"Количество выбросов в столбце 'BloodPressure': 4\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAwcAAAJOCAYAAADieHtfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACJj0lEQVR4nOzdd3hUZeL28e/MJJPee6WEAJFeBBGsoAjCitgFe/u9Ym+r69oLoiu2VVHXRV17xQqIgChK7yX0Ekjvvc6c94+E0UgLIclJuT/XlUtz5sycO5MJmTvPeZ5jMQzDQEREREREOjyr2QFERERERKR1UDkQERERERFA5UBEREREROqoHIiIiIiICKByICIiIiIidVQOREREREQEUDkQEREREZE6KgciIiIiIgKoHIiIiIiISB2VAxFpsM6dO3P11VebHaPde+655+jatSs2m43+/fs3yWP+/PPPWCwWfv755yZ5vGPx6KOPYrFYWvy4HcW+ffvw9PTkt99+a9T99+zZg8Vi4Z133mnaYCb56+uturqauLg4XnvtNRNTibQdKgciHdQ777yDxWJh5cqVh7z99NNPp3fv3sd9nB9++IFHH330uB+no/jxxx+57777GD58ODNnzuTpp58+7L5XX301FovF9eHm5kZcXByXXnopmzdvbsHUTeOvX4+/vz/9+vXj+eefp7Ky0ux4rdbjjz/O0KFDGT58+EG3/fzzz0ycOJHIyEjsdjvh4eGMHz+eL7/80oSk5nB3d+euu+7iqaeeoqKiwuw4Iq2em9kBRKTt2Lp1K1brsf1N4YcffuDVV19VQWigBQsWYLVaefvtt7Hb7Ufd38PDg//85z8A1NTUsHPnTmbMmMGcOXPYvHkz0dHRzR25Sf356ykoKOCLL77gnnvuYcWKFXz88ccmp2t9srOzeffdd3n33XcPuu2RRx7h8ccfJzExkZtuuolOnTqRm5vLDz/8wAUXXMAHH3zA5ZdfbkLqlnfNNddw//338+GHH3LttdeaHUekVVM5EJEG8/DwMDvCMSstLcXHx8fsGA2WlZWFl5dXg4oBgJubG5MnT6637aSTTmLcuHF8//333HDDDc0Rs9n89eu5+eabGTp0KJ988gnTp08/ZNkxDIOKigq8vLxaMupxqampwel0Nvj7fDjvv/8+bm5ujB8/vt72zz//nMcff5wLL7yQDz/8EHd3d9dt9957L3PnzqW6uvq4jt2WBAYGcvbZZ/POO++oHIgchU4rEpEG++ucg+rqah577DESExPx9PQkJCSEESNGMG/ePKD2NJFXX30VoN7pIgeUlpZy9913ExcXh4eHBz169OBf//oXhmHUO255eTm33XYboaGh+Pn58be//Y3U1FQsFku9EYkD5xpv3ryZyy+/nKCgIEaMGAHA+vXrufrqq+natSuenp5ERkZy7bXXkpubW+9YBx5j27ZtTJ48mYCAAMLCwnjooYcwDIN9+/Zx3nnn4e/vT2RkJM8//3yDnruamhqeeOIJEhIS8PDwoHPnzvzjH/+od7qMxWJh5syZlJaWup6rxpwHHhkZCdS+0T6azz77jEGDBuHl5UVoaCiTJ08mNTX1oP0WLFjAKaecgo+PD4GBgZx33nkkJycftN/ixYs58cQT8fT0JCEhgTfeeOOY8/+Z1Wrl9NNPB2rPjYfa1+G4ceOYO3cugwcPxsvLy3WcgoIC7rjjDtdrqlu3bkybNg2n01nvcT/++GMGDRqEn58f/v7+9OnTh5deesl1+9Fe21B76t2BbH929dVX07lzZ9fnB87p/9e//sWLL77oeg0cOPVry5YtXHjhhQQHB+Pp6cngwYP55ptvGvT8zJo1i6FDh+Lr61tv+0MPPURwcDD//e9/6xWDA0aPHs24ceOO+NgNyZWXl8c999xDnz598PX1xd/fnzFjxrBu3bp6+x2Y8/Lpp5/y1FNPERsbi6enJyNHjmTHjh0HHXvZsmWcc845BAQE4O3tzWmnnXbIORXH8no766yzWLx4MXl5eUf8ukU6Oo0ciHRwhYWF5OTkHLS9IX9VfPTRR5k6dSrXX389Q4YMoaioiJUrV7J69WrOOussbrrpJtLS0pg3bx7/+9//6t3XMAz+9re/sXDhQq677jr69+/P3Llzuffee0lNTeWFF15w7Xv11Vfz6aefcsUVV3DSSSexaNEizj333MPmuuiii0hMTOTpp592FY158+axa9currnmGiIjI9m0aRNvvvkmmzZtYunSpQdNmL3kkktISkrimWee4fvvv+fJJ58kODiYN954gzPPPJNp06bxwQcfcM8993DiiSdy6qmnHvG5uv7663n33Xe58MILufvuu1m2bBlTp04lOTmZr776CoD//e9/vPnmmyxfvtx1as3JJ5981O/Dge+fw+Fg165d/P3vfyckJOSob/7eeecdrrnmGk488USmTp1KZmYmL730Er/99htr1qwhMDAQgJ9++okxY8bQtWtXHn30UcrLy3nllVcYPnw4q1evdr0R3rBhA2effTZhYWE8+uij1NTU8MgjjxAREXHUr+FIdu7cCUBISIhr29atW7nsssu46aabuOGGG+jRowdlZWWcdtpppKamctNNNxEfH8/vv//OAw88QHp6Oi+++CJQ+1q47LLLGDlyJNOmTQMgOTmZ3377jdtvvx04+mu7MWbOnElFRQU33ngjHh4eBAcHs2nTJoYPH05MTAz3338/Pj4+fPrpp0yYMIEvvviC888//7CPV11dzYoVK/h//+//1du+fft2tmzZwrXXXoufn1+jsjY0165du5g1axYXXXQRXbp0ITMzkzfeeIPTTjvtkKe1PfPMM1itVu655x4KCwt59tlnmTRpEsuWLXPts2DBAsaMGcOgQYN45JFHsFqtzJw5kzPPPJNff/2VIUOGAMf+ehs0aBCGYfD7778f9WdDpEMzRKRDmjlzpgEc8aNXr1717tOpUyfjqquucn3er18/49xzzz3icaZMmWIc6p+aWbNmGYDx5JNP1tt+4YUXGhaLxdixY4dhGIaxatUqAzDuuOOOevtdffXVBmA88sgjrm2PPPKIARiXXXbZQccrKys7aNtHH31kAMYvv/xy0GPceOONrm01NTVGbGysYbFYjGeeeca1PT8/3/Dy8qr3nBzK2rVrDcC4/vrr622/5557DMBYsGCBa9tVV11l+Pj4HPHx/rzvob5vMTExxqpVq+rtu3DhQgMwFi5caBiGYVRVVRnh4eFG7969jfLyctd+3333nQEYDz/8sGtb//79jfDwcCM3N9e1bd26dYbVajWuvPJK17YJEyYYnp6ext69e13bNm/ebNhstkO+Bg719fj4+BjZ2dlGdna2sWPHDuPpp582LBaL0bdvX9d+nTp1MgBjzpw59e7/xBNPGD4+Psa2bdvqbb///vsNm81mpKSkGIZhGLfffrvh7+9v1NTUHDZLQ17bp512mnHaaacd8uvo1KmT6/Pdu3cbgOHv729kZWXV23fkyJFGnz59jIqKCtc2p9NpnHzyyUZiYuIRj79jxw4DMF555ZV627/++msDMF544YUj3v+v+WbOnHnMuSoqKgyHw3HQ43l4eBiPP/64a9uB119SUpJRWVnp2v7SSy8ZgLFhwwbXMRITE43Ro0cbTqfTtV9ZWZnRpUsX46yzznJtO9bXW1pamgEY06ZNa9DzItJR6bQikQ7u1VdfZd68eQd99O3b96j3DQwMZNOmTWzfvv2Yj/vDDz9gs9m47bbb6m2/++67MQyD2bNnAzBnzhyg9tzzP7v11lsP+9j/93//d9C2P5+PXlFRQU5ODieddBIAq1evPmj/66+/3vX/NpuNwYMHYxgG1113nWt7YGAgPXr0YNeuXYfNArVfK8Bdd91Vb/vdd98NwPfff3/E+x+Jp6en63s2d+5c3njjDXx9fRk7dizbtm077P1WrlxJVlYWN998M56enq7t5557Lj179nRlSk9PZ+3atVx99dUEBwe79uvbty9nnXWW62tzOBzMnTuXCRMmEB8f79ovKSmJ0aNHN/jrKS0tJSwsjLCwMLp168Y//vEPhg0b5hpdOaB
"text/plain": [
"<Figure size 1500x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
" \n",
"# Выбираем столбцы для очистки\n",
"columns_to_clean = ['BloodPressure']\n",
"\n",
"# Функция для удаления выбросов\n",
"def remove_outliers(df, columns):\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Удаляем строки, содержащие выбросы\n",
" df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]\n",
" \n",
" return df\n",
"\n",
"# Удаляем выбросы\n",
"df_cleaned = remove_outliers(df, columns_to_clean)\n",
"\n",
"# Выводим количество удаленных строк\n",
"print(f\"Количество удаленных строк: {len(df) - len(df_cleaned)}\")\n",
"\n",
"df = df_cleaned\n",
"\n",
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['BloodPressure']\n",
"\n",
"# Функция для подсчета выбросов\n",
"def count_outliers(df, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"\n",
"# Создаем гистограммы для очищенных данных\n",
"plt.figure(figsize=(15, 6))\n",
"\n",
"# Гистограмма для relative_velocity\n",
"plt.subplot(1, 2, 1)\n",
"sns.histplot(df_cleaned['BloodPressure'], kde=True)\n",
"plt.title('Histogram of Blood Pressure (Cleaned)')\n",
"plt.xlabel('Blood Pressure')\n",
"plt.ylabel('Frequency')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Судя по данным на диаграмме выше, количество выбросов значительно сократилось и не превышает допустимые диапозоны. Теперь можно приступить к разбиению датасета на выборки:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 433\n",
"Размер контрольной выборки: 145\n",
"Размер тестовой выборки: 145\n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение на признаки (X) и целевую переменную (y)\n",
"X = df.drop('Outcome', axis=1) # Признаки\n",
"y = df['Outcome'] # Целевая переменная\n",
"\n",
"# Разбиение на обучающую и оставшуюся часть (контрольная + тестовая)\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"\n",
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
"\n",
"# Вывод размеров выборок\n",
"print(\"Размер обучающей выборки:\", X_train.shape[0])\n",
"print(\"Размер контрольной выборки:\", X_val.shape[0])\n",
"print(\"Размер тестовой выборки:\", X_test.shape[0])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Сбалансированность обучающей выборки:\n",
"Outcome\n",
"0 0.658199\n",
"1 0.341801\n",
"Name: proportion, dtype: float64\n",
"\n",
"Сбалансированность контрольной выборки:\n",
"Outcome\n",
"0 0.655172\n",
"1 0.344828\n",
"Name: proportion, dtype: float64\n",
"\n",
"Сбалансированность тестовой выборки:\n",
"Outcome\n",
"0 0.662069\n",
"1 0.337931\n",
"Name: proportion, dtype: float64\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение на признаки (X) и целевую переменную (y)\n",
"X = df.drop('Outcome', axis=1) # Признаки\n",
"y = df['Outcome'] # Целевая переменная\n",
"\n",
"# Разбиение на обучающую и оставшуюся часть (контрольная + тестовая)\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)\n",
"\n",
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)\n",
"\n",
"# Функция для проверки сбалансированности выборок\n",
"def check_balance(y_train, y_val, y_test):\n",
" print(\"Сбалансированность обучающей выборки:\")\n",
" print(y_train.value_counts(normalize=True))\n",
" \n",
" print(\"\\nС б а ла нс ир о ва нно с ть контрольной выборки:\")\n",
" print(y_val.value_counts(normalize=True))\n",
" \n",
" print(\"\\nС б а ла нс ир о ва нно с ть тестовой выборки:\")\n",
" print(y_test.value_counts(normalize=True))\n",
"\n",
"# Проверка сбалансированности\n",
"check_balance(y_train, y_val, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"По данным выше можно понять, что выборки сбалансиированы относительно. Воспользуемся приращением данных методом выборки с избытком (oversampling)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Сбалансированность обучающей выборки после SMOTE:\n",
"Outcome\n",
"0 0.5\n",
"1 0.5\n",
"Name: proportion, dtype: float64\n",
"Сбалансированность обучающей выборки:\n",
"Outcome\n",
"0 0.5\n",
"1 0.5\n",
"Name: proportion, dtype: float64\n",
"\n",
"Сбалансированность контрольной выборки:\n",
"Outcome\n",
"0 0.655172\n",
"1 0.344828\n",
"Name: proportion, dtype: float64\n",
"\n",
"Сбалансированность тестовой выборки:\n",
"Outcome\n",
"0 0.662069\n",
"1 0.337931\n",
"Name: proportion, dtype: float64\n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from imblearn.over_sampling import SMOTE\n",
"\n",
"# Разделение на признаки (X) и целевую переменную (y)\n",
"X = df.drop('Outcome', axis=1) # Признаки\n",
"y = df['Outcome'] # Целевая переменная\n",
"\n",
"# Разбиение на обучающую и оставшуюся часть (контрольная + тестовая)\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)\n",
"\n",
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)\n",
"\n",
"# Применение SMOTE для балансировки обучающей выборки\n",
"smote = SMOTE(random_state=42)\n",
"X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)\n",
"\n",
"# Функция для проверки сбалансированности выборок\n",
"def check_balance(y_train, y_val, y_test):\n",
" print(\"Сбалансированность обучающей выборки:\")\n",
" print(y_train.value_counts(normalize=True))\n",
" \n",
" print(\"\\nС б а ла нс ир о ва нно с ть контрольной выборки:\")\n",
" print(y_val.value_counts(normalize=True))\n",
" \n",
" print(\"\\nС б а ла нс ир о ва нно с ть тестовой выборки:\")\n",
" print(y_test.value_counts(normalize=True))\n",
"\n",
"# Проверка сбалансированности после SMOTE\n",
"print(\"Сбалансированность обучающей выборки после SMOTE:\")\n",
"print(y_train_resampled.value_counts(normalize=True))\n",
"\n",
"# Проверка сбалансированности контрольной и тестовой выборок\n",
"check_balance(y_train_resampled, y_val, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборка сбалансирована"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2-й Датасет: Starbucks Stock Price Dataset 📊🍵🧋🔥"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Starbucks Corporation - всемирно известная сеть кофеен, основанная в 1971 году в Сиэтле, штат Вашингтон, Джерри Болдуином, Зевом Сиглом и Гордоном Боукером. Начав с о скромного магазина, торгующего высококачественными кофейными зернами и оборудованием, Starbucks превратилась в одну из крупнейших в мире сетей кофеен с тысячами магазинов по всему миру. Известная своим кофе высшего сорта, инновационными напитками и уникальным обслуживанием клиентов, Starbucks стала культурной иконой кофейной индустрии.\n",
"\n",
"Этот набор данных предоставляет исчерпывающую информацию о б изменениях цен на акции Starbucks за последние 25 лет. Он включает в себя важные столбцы, такие как дата, цена открытия, самая высокая цена дня, самая низкая цена дня, цена закрытия, скорректированная цена закрытия и объем торгов.\n",
"\n",
"Эти данные бесценны для проведения исторического анализа, прогнозирования динамики акций в будущем и понимания рыночных тенденций, связанных с акциями Starbucks.\n",
"\n",
"* Из описания датасета очевидно, что объектами иследования являются записи о динамике цены акций.\n",
"* Атрибуты объектов: Date,Open,High,Low,Close,Adj Close,Volume\n",
"* Очевидная цель этого датасета - научиться предсказывать цены на акции.\n",
"\n",
"В качестве примера бизнес-целей можно привести:\n",
"* Предсказание будущих цен акций: Использовать исторические данные для прогнозирования будущих цен акций Starbucks.\n",
"* Анализ волатильности: Оценка волатильности акций на основе исторических данных, что поможет принять более информированные решения для инвестиций.\n",
"* Оптимизация торговых стратегий: Разработка стратегий для покупки и продажи акций на основе определённых индикаторов или паттернов поведения цен."
2024-10-12 01:22:15 +04:00
]
},
{
"cell_type": "code",
2024-10-12 11:41:32 +04:00
"execution_count": 11,
2024-10-12 01:22:15 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-10-12 11:41:32 +04:00
"количество колонок: 7\n",
"колонки: Date, Open, High, Low, Close, Adj Close, Volume\n",
2024-10-12 01:22:15 +04:00
"<class 'pandas.core.frame.DataFrame'>\n",
2024-10-12 11:41:32 +04:00
"RangeIndex: 8036 entries, 0 to 8035\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Date 8036 non-null object \n",
" 1 Open 8036 non-null float64\n",
" 2 High 8036 non-null float64\n",
" 3 Low 8036 non-null float64\n",
" 4 Close 8036 non-null float64\n",
" 5 Adj Close 8036 non-null float64\n",
" 6 Volume 8036 non-null int64 \n",
"dtypes: float64(5), int64(1), object(1)\n",
"memory usage: 439.6+ KB\n"
2024-10-12 01:22:15 +04:00
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
2024-10-12 11:41:32 +04:00
" <th>Date</th>\n",
" <th>Open</th>\n",
" <th>High</th>\n",
" <th>Low</th>\n",
" <th>Close</th>\n",
" <th>Adj Close</th>\n",
" <th>Volume</th>\n",
2024-10-12 01:22:15 +04:00
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
2024-10-12 11:41:32 +04:00
" <td>1992-06-26</td>\n",
" <td>0.328125</td>\n",
" <td>0.347656</td>\n",
" <td>0.320313</td>\n",
" <td>0.335938</td>\n",
" <td>0.260703</td>\n",
" <td>224358400</td>\n",
2024-10-12 01:22:15 +04:00
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
2024-10-12 11:41:32 +04:00
" <td>1992-06-29</td>\n",
" <td>0.339844</td>\n",
" <td>0.367188</td>\n",
" <td>0.332031</td>\n",
" <td>0.359375</td>\n",
" <td>0.278891</td>\n",
" <td>58732800</td>\n",
2024-10-12 01:22:15 +04:00
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
2024-10-12 11:41:32 +04:00
" <td>1992-06-30</td>\n",
" <td>0.367188</td>\n",
" <td>0.371094</td>\n",
" <td>0.343750</td>\n",
" <td>0.347656</td>\n",
" <td>0.269797</td>\n",
" <td>34777600</td>\n",
2024-10-12 01:22:15 +04:00
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
2024-10-12 11:41:32 +04:00
" <td>1992-07-01</td>\n",
" <td>0.351563</td>\n",
" <td>0.359375</td>\n",
" <td>0.339844</td>\n",
" <td>0.355469</td>\n",
" <td>0.275860</td>\n",
" <td>18316800</td>\n",
2024-10-12 01:22:15 +04:00
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
2024-10-12 11:41:32 +04:00
" <td>1992-07-02</td>\n",
" <td>0.359375</td>\n",
" <td>0.359375</td>\n",
" <td>0.347656</td>\n",
" <td>0.355469</td>\n",
" <td>0.275860</td>\n",
" <td>13996800</td>\n",
2024-10-12 01:22:15 +04:00
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
2024-10-12 11:41:32 +04:00
" Date Open High Low Close Adj Close Volume\n",
"0 1992-06-26 0.328125 0.347656 0.320313 0.335938 0.260703 224358400\n",
"1 1992-06-29 0.339844 0.367188 0.332031 0.359375 0.278891 58732800\n",
"2 1992-06-30 0.367188 0.371094 0.343750 0.347656 0.269797 34777600\n",
"3 1992-07-01 0.351563 0.359375 0.339844 0.355469 0.275860 18316800\n",
"4 1992-07-02 0.359375 0.359375 0.347656 0.355469 0.275860 13996800"
2024-10-12 01:22:15 +04:00
]
},
2024-10-12 11:41:32 +04:00
"execution_count": 11,
2024-10-12 01:22:15 +04:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
2024-10-12 11:41:32 +04:00
"df = pd.read_csv(\".//static//csv//sd.csv\", sep=\",\")\n",
2024-10-12 01:22:15 +04:00
"print('количество колонок: ' + str(df.columns.size)) \n",
"print('колонки: ' + ', '.join(df.columns))\n",
"\n",
"df.info()\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Получение сведений о пропущенных данных\n",
"\n",
"Типы пропущенных данных:\n",
"\n",
"* None - представление пустых данных в Python\n",
"* NaN - представление пустых данных в Pandas\n",
"* '' - пустая строка"
]
},
{
"cell_type": "code",
2024-10-12 11:41:32 +04:00
"execution_count": 12,
2024-10-12 01:22:15 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-10-12 11:41:32 +04:00
"Date 0\n",
"Open 0\n",
"High 0\n",
"Low 0\n",
"Close 0\n",
"Adj Close 0\n",
"Volume 0\n",
2024-10-12 01:22:15 +04:00
"dtype: int64\n",
"\n",
2024-10-12 11:41:32 +04:00
"Date False\n",
"Open False\n",
"High False\n",
"Low False\n",
"Close False\n",
"Adj Close False\n",
"Volume False\n",
2024-10-12 01:22:15 +04:00
"dtype: bool\n",
2024-10-12 11:41:32 +04:00
"\n"
2024-10-12 01:22:15 +04:00
]
}
],
"source": [
"# Количество пустых значений признаков\n",
"print(df.isnull().sum())\n",
"\n",
"print()\n",
"\n",
"# Есть ли пустые значения признаков\n",
"print(df.isnull().any())\n",
"\n",
"print()\n",
"\n",
"# Процент пустых значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-10-12 11:41:32 +04:00
"Судя по статистике выше, пустые значения отсутсвуют. Проверим датасет на выбросы:"
2024-10-12 01:22:15 +04:00
]
},
{
"cell_type": "code",
2024-10-12 11:41:32 +04:00
"execution_count": 15,
2024-10-12 01:22:15 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-10-12 11:41:32 +04:00
"Количество выбросов в столбце 'Open': 0\n",
"Количество выбросов в столбце 'High': 0\n",
"Количество выбросов в столбце 'Low': 0\n",
"Количество выбросов в столбце 'Close': 0\n",
"Количество выбросов в столбце 'Adj Close': 7\n"
2024-10-12 01:22:15 +04:00
]
2024-10-12 11:41:32 +04:00
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAPeCAYAAADj01PlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hUZdrH8d/MpIf0Hggh9NARUFFEEASUtay4a1cQxQK6dpddC5ZX1F1FV1HWVQEL1hULFghFiiAl9N4JLQmQ3iZl5v0jZJZAJiQhyZlJvp/rmmuZc545c5/gcue5z1NMdrvdLgAAAAAAAAAAcAaz0QEAAAAAAAAAAOCqKKIDAAAAAAAAAOAERXQAAAAAAAAAAJygiA4AAAAAAAAAgBMU0QEAAAAAAAAAcIIiOgAAAAAAAAAATlBEBwAAAAAAAADACYroAAAAAAAAAAA4QREdAAAAAAAAAAAnKKIDjaBNmzYaPXq00WE0ef/4xz/Utm1bWSwW9erVy+hwAABNBHm8cTRUHt+/f79MJpNmzJhR58/+85//rLd4AACNhxzeOOiLozmgiA7U0owZM2QymbRmzZoqzw8aNEjdunU75+/56aefNGnSpHO+TnMxb948PfHEE7r44os1ffp0vfTSS2f9zJw5czRixAiFhYXJx8dHHTt21GOPPaYTJ040QsQAACOQx11TbfL46NGj1aJFC6fnTSaTJkyY0BBhAgAMRA53TfWZwwFX5mF0AEBzsGPHDpnNtXtm9dNPP2nq1Kkk7xpauHChzGazPvjgA3l5eZ21/WOPPabXXntNPXv21JNPPqnQ0FCtXbtWb7/9tj7//HMtWLBAnTp1aoTIAQCujjze8Gqbx2sjPj5ehYWF8vT0rNfrAgBcHzm84TVkDgdcCUV0oBF4e3sbHUKt5efny9/f3+gwaiw9PV2+vr41StqfffaZXnvtNd1www369NNPZbFYHOdGjx6twYMH609/+pPWrl0rDw/+mQSA5o483vBqk8dry2QyycfHp96vCwBwfeTwhteQORxwJSznAjSC09dhKykp0XPPPacOHTrIx8dHYWFhGjBggJKSkiSVF3KnTp0qqbzjV/GqkJ+fr0cffVRxcXHy9vZWp06d9M9//lN2u73S9xYWFurBBx9UeHi4AgICdPXVV+vw4cMymUyVnqpPmjRJJpNJW7du1c0336yQkBANGDBAkrRx40aNHj1abdu2lY+Pj6Kjo3XnnXeeseRJxTV27typW2+9VUFBQYqIiNDTTz8tu92ugwcP6pprrlFgYKCio6P12muv1ehnV1paqhdeeEHt2rWTt7e32rRpo7/97W+yWq2ONiaTSdOnT1d+fr7jZ1XduqfPPfecQkJC9N5771UqoEvS+eefryeffFKbNm3S119/7TheMTUwOTlZF110kXx9fZWQkKBp06adcX2r1apnn31W7du3l7e3t+Li4vTEE09Uirki7gkTJujbb79Vt27d5O3tra5du+qXX36p0c8GANA4yOOulcdry9ma6F999ZW6dOkiHx8fdevWTbNnz9bo0aPVpk2bKq/z3nvvOe6jX79+Wr16db3FCABoGORw98zhX331lfr06SNfX1+Fh4fr1ltv1eHDhx3nv//+e5lMJm3cuNFx7L///a9MJpOuu+66StdKTEzUDTfccM4xAQyxBOooOztbx48fP+N4SUnJWT87adIkTZ48WXfddZfOP/985eTkaM2aNVq7dq0uv/xy3XPPPTpy5IiSkpL08ccfV/qs3W7X1VdfrUWLFmns2LHq1auX5s6dq8cff1yHDx/WlClTHG1Hjx6tL7/8UrfddpsuvPBCLV68WCNHjnQa15/+9Cd16NBBL730kuOXgKSkJO3du1djxoxRdHS0tmzZovfee09btmzR77//XukXCkm64YYblJiYqJdfflk//vijXnzxRYWGhurf//63LrvsMr3yyiv69NNP9dhjj6lfv34aOHBgtT+ru+66SzNnztT111+vRx99VCtXrtTkyZO1bds2zZ49W5L08ccf67333tOqVav0/vvvS5IuuuiiKq+3a9cu7dixQ6NHj1ZgYGCVbW6//XY9++yzmjNnjm688UbH8czMTF155ZX685//rJtuuklffvml7rvvPnl5eenOO++UJNlsNl199dVatmyZxo0bp8TERG3atElTpkzRzp079e2331b6rmXLlumbb77R/fffr4CAAP3rX//SqFGjlJKSorCwsGp/NgCAuiOPu2ceP1VVf3819eOPP+qGG25Q9+7dNXnyZGVmZmrs2LFq2bJlle1nzZql3Nxc3XPPPTKZTHr11Vd13XXXae/evSwTAwCNjBzu/jm8OjNmzNCYMWPUr18/TZ48WWlpaXrzzTf122+/ad26dQoODtaAAQNkMpm0ZMkS9ejRQ5K0dOlSmc1mLVu2zHGtY8eOafv27eyVgvphB1Ar06dPt0uq9tW1a9dKn4mPj7ffcccdjvc9e/a0jxw5strvGT9+vL2q/4t+++23dkn2F198sdLx66+/3m4ymey7d++22+12e3Jysl2S/aGHHqrUbvTo0XZJ9meffdZx7Nlnn7VLst90001nfF9BQcEZxz777DO7JPuSJUvOuMa4ceMcx0pLS+2tWrWym0wm+8svv+w4npmZaff19a30M6nK+vXr7ZLsd911V6Xjjz32mF2SfeHChY5jd9xxh93f37/a69nt//v5TZkypdp2gYGB9vPOO8/x/tJLL7VLsr/22muOY1ar1d6rVy97ZGSkvbi42G632+0ff/yx3Ww225cuXVrpetOmTbNLsv/222+OY5LsXl5ejr8zu91u37Bhg12S/a233jrrvQAAao887t55vKLt2f4Ox48f72i/b98+uyT79OnTHce6d+9ub9WqlT03N9dx7Ndff7VLssfHx5/x2bCwMHtGRobj+HfffWeXZP/hhx9qFDMA4NyRw5tGDq+ubXFxsT0yMtLerVs3e2FhoeP4nDlz7JLszzzzjONY165d7X/+858d78877zz7n/70J7sk+7Zt2+x2u93+zTff2CXZN2zYUKP4gOqwnAtQR1OnTlVSUtIZr4qnoNUJDg7Wli1btGvXrlp/708//SSLxaIHH3yw0vFHH31UdrtdP//8syQ5lgS5//77K7V74IEHnF773nvvPeOYr6+v489FRUU6fvy4LrzwQknS2rVrz2h/1113Of5ssVjUt29f2e12jR071nE8ODhYnTp10t69e53GIpXfqyQ98sgjlY4/+uijkspHkdVWbm6uJCkgIKDadgEBAcrJyal0zMPDQ/fcc4/jvZeXl+655x6lp6crOTlZUvm0s8TERHXu3FnHjx93vC677DJJ0qJFiypdc+jQoWrXrp3jfY8ePRQYGHjWnw0A4NyQx90zj1fw8fGp8u+vYjp+dY4cOaJNmzbp9ttvV4sWLRzHL730UnXv3r3Kz9xwww0KCQlxvL/kkkskiXwNAAYgh7t3Dq/OmjVrlJ6ervvvv7/SfiYjR45U586dK33vJZdcoqVLl0oq7+dv2LBB48aNU3h4uOP40qVLFRwcrG7dujVIvGheWM4FqKPzzz9fffv2PeN4SEjIWacXP//887rmmmvUsWNHdevWTSNGjNBtt91Wo6R/4MABxcbGnlEETkxMdJyv+F+z2ayEhIRK7dq3b+/02qe3laSMjAw999xz+vzzz5Wenl7pXHZ29hntW7duXel9UFCQfHx8FB4efsbx09dyO13FPZwec3R0tIKDgx33WhsVP7eKYrozubm5ioyMrHQsNjb2jA1eOnbsKKl8vdULL7xQu3bt0rZt2xQREVHldU//GZ7+85LK/xvKzMys/kYAAOeEPO6eebyCxWLR0KFD6/TZiu+t6mfZvn37KgsTp/9cKgrq5GsAaHzkcPfO4Wf7Xknq1KnTGec6d+5caamWSy65RNOmTdPu3bu1Z88emUwm9e/f31Fcv/vuu7V06VJdfPHFMpsZQ4xzRxEdMMDAgQO1Z88efffdd5o3b57ef/99TZkyRdOmTav09Lixnfqku8Kf//xnLV++XI8//rh69eqlFi1ayGa
"text/plain": [
"<Figure size 1500x1000 with 5 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
2024-10-12 01:22:15 +04:00
}
],
"source": [
2024-10-12 11:41:32 +04:00
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"df = pd.read_csv(\".//static//csv//sd.csv\")\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"# Выбираем числовые столбцы\n",
"numeric_columns = ['Open','High','Low','Close','Adj Close']\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['Open','High','Low','Close','Adj Close']\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"# Функция для подсчета выбросов\n",
"def count_outliers(df, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(numeric_columns, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Судя по диаграммам, количетв выбросов либо полностью отсутсвует, либо имеется в пределах допустимых значений. Теперь можно приступить к разбиению датасета на выборки, но теперь используем прописанные реализации методов приращения данных: "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"label_encoder = LabelEncoder()\n",
"\n",
"# Функция для применения oversampling\n",
"def apply_oversampling(X, y):\n",
" oversampler = RandomOverSampler(random_state=42)\n",
" X_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
" return X_resampled, y_resampled\n",
"\n",
"# Функция для применения undersampling\n",
"def apply_undersampling(X, y):\n",
" undersampler = RandomUnderSampler(random_state=42)\n",
" X_resampled, y_resampled = undersampler.fit_resample(X, y)\n",
" return X_resampled, y_resampled\n",
"\n",
"def split_stratified_into_train_val_test(\n",
" df_input,\n",
" stratify_colname=\"y\",\n",
" frac_train=0.6,\n",
" frac_val=0.15,\n",
" frac_test=0.25,\n",
" random_state=None,\n",
"):\n",
" \"\"\"\n",
" Splits a Pandas dataframe into three subsets (train, val, and test)\n",
" following fractional ratios provided by the user, where each subset is\n",
" stratified by the values in a specific column (that is, each subset has\n",
" the same relative frequency of the values in the column). It performs this\n",
" splitting by running train_test_split() twice.\n",
"\n",
" Parameters\n",
" ----------\n",
" df_input : Pandas dataframe\n",
" Input dataframe to be split.\n",
" stratify_colname : str\n",
" The name of the column that will be used for stratification. Usually\n",
" this column would be for the label.\n",
" frac_train : float\n",
" frac_val : float\n",
" frac_test : float\n",
" The ratios with which the dataframe will be split into train, val, and\n",
" test data. The values should be expressed as float fractions and should\n",
" sum to 1.0.\n",
" random_state : int, None, or RandomStateInstance\n",
" Value to be passed to train_test_split().\n",
"\n",
" Returns\n",
" -------\n",
" df_train, df_val, df_test :\n",
" Dataframes containing the three splits.\n",
" \"\"\"\n",
"\n",
" if frac_train + frac_val + frac_test != 1.0:\n",
" raise ValueError(\n",
" \"fractions %f, %f, %f do not add up to 1.0\"\n",
" % (frac_train, frac_val, frac_test)\n",
" )\n",
"\n",
" if stratify_colname not in df_input.columns:\n",
" raise ValueError(\"%s is not a column in the dataframe\" % (stratify_colname))\n",
"\n",
" X = df_input # Contains all columns.\n",
" y = df_input[\n",
" [stratify_colname]\n",
" ] # Dataframe of just the column on which to stratify.\n",
"\n",
" # Split original dataframe into train and temp dataframes.\n",
" df_train, df_temp, y_train, y_temp = train_test_split(\n",
" X, y, stratify=y, test_size=(1.0 - frac_train), random_state=random_state\n",
" )\n",
"\n",
" # Split the temp dataframe into val and test dataframes.\n",
" relative_frac_test = frac_test / (frac_val + frac_test)\n",
" df_val, df_test, y_val, y_test = train_test_split(\n",
" df_temp,\n",
" y_temp,\n",
" stratify=y_temp,\n",
" test_size=relative_frac_test,\n",
" random_state=random_state,\n",
" )\n",
"\n",
" assert len(df_input) == len(df_train) + len(df_val) + len(df_test)\n",
"\n",
" return df_train, df_val, df_test"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Обучающая выборка: (4821, 4)\n",
"Volume_Grouped\n",
" 0 2802\n",
" 1 1460\n",
" 2 369\n",
" 3 111\n",
" 4 40\n",
" 5 18\n",
"-1 10\n",
" 6 7\n",
" 7 4\n",
"Name: count, dtype: int64\n",
"Обучающая выборка после oversampling: (25218, 4)\n",
"Volume_Grouped\n",
" 0 2802\n",
" 2 2802\n",
" 1 2802\n",
" 5 2802\n",
" 3 2802\n",
" 4 2802\n",
" 7 2802\n",
"-1 2802\n",
" 6 2802\n",
"Name: count, dtype: int64\n",
"Контрольная выборка: (1607, 4)\n",
"Volume_Grouped\n",
" 0 934\n",
" 1 487\n",
" 2 123\n",
" 3 37\n",
" 4 13\n",
" 5 6\n",
"-1 4\n",
" 6 2\n",
" 7 1\n",
"Name: count, dtype: int64\n",
"Тестовая выборка: (1608, 4)\n",
"Volume_Grouped\n",
" 0 934\n",
" 1 487\n",
" 2 124\n",
" 3 37\n",
" 4 14\n",
" 5 6\n",
"-1 3\n",
" 6 2\n",
" 7 1\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"data = df[[\"Volume\", \"High\", \"Low\"]].copy()\n",
"data[\"Volume_Grouped\"] = pd.cut(data[\"Volume\"], bins=50, labels=False)\n",
"\n",
"interval_counts = data[\"Volume_Grouped\"].value_counts().sort_index()\n",
"\n",
"min_samples_per_interval = 5\n",
"for interval, count in interval_counts.items():\n",
" if count < min_samples_per_interval:\n",
" data.loc[data[\"Volume_Grouped\"] == interval, \"Volume_Grouped\"] = -1\n",
"\n",
"\n",
"df_coffee_train, df_coffee_val, df_coffee_test = split_stratified_into_train_val_test(\n",
" data, stratify_colname=\"Volume_Grouped\", frac_train=0.60, frac_val=0.20, frac_test=0.20)\n",
"\n",
"print(\"Обучающая выборка: \", df_coffee_train.shape)\n",
"print(df_coffee_train[\"Volume_Grouped\"].value_counts())\n",
"\n",
"X_resampled, y_resampled = apply_oversampling(df_coffee_train, df_coffee_train[\"Volume_Grouped\"])\n",
"df_coffee_train_adasyn = pd.DataFrame(X_resampled)\n",
"\n",
"print(\"Обучающая выборка после oversampling: \", df_coffee_train_adasyn.shape)\n",
"print(df_coffee_train_adasyn[\"Volume_Grouped\"].value_counts())\n",
"\n",
"print(\"Контрольная выборка: \", df_coffee_val.shape)\n",
"print(df_coffee_val[\"Volume_Grouped\"].value_counts())\n",
"\n",
"print(\"Тестовая выборка: \", df_coffee_test.shape)\n",
"print(df_coffee_test[\"Volume_Grouped\"].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборка сбалансирована"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3-й Датасет: Supermarket store branches sales analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Супермаркет - это магазин самообслуживания, предлагающий широкий ассортимент продуктов питания, напитков и товаров для дома, организованный по разделам. Этот магазин больше и имеет более широкий выбор, чем предыдущие продуктовые магазины, но меньше по размеру и более ограничен в ассортименте товаров, чем гипермаркет или рынок больших коробок. Однако в повседневном использовании в США термин \"продуктовый магазин\" является синонимом слова \"супермаркет\" и не используется для обозначения других типов магазинов, торгующих продуктами.\n",
"\n",
"* Из описания датасета очевидно, что объектами иследования являются магазины.\n",
"* Атрибуты объектов: Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales\n",
"* Очевидная цель этого датасета - научиться предсказывать объем продаж на основе таких характеристик, как площадь магазина и другие факторы.\n",
"\n",
"В качестве примера бизнес-целей можно привести:\n",
"* Оптимизация работы магазинов. Это может включать выявление тех характеристик магазинов (например, площадь, местоположение), которые наиболее сильно влияют на уровень продаж, и разработку стратегий для повышения этих продаж на основе этих факторов.\n",
"* Другая возможная цель заключается в расширении или перемещении магазинов, где данные могут использоваться для принятия решений о выборе местоположений, оптимальном использовании пространства и планировке магазинов для максимизации продаж.\n",
"* Также может быть целью управление запасами и ресурсами, поскольку понимание того, как площадь магазина влияет на объем продаж, поможет лучше управлять запасами и распределением ресурсов."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"количество колонок: 5\n",
"колонки: Store ID , Store_Area, Items_Available, Daily_Customer_Count, Store_Sales\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 896 entries, 0 to 895\n",
"Data columns (total 5 columns):\n",
" # Column Non-Null Count Dtype\n",
"--- ------ -------------- -----\n",
" 0 Store ID 896 non-null int64\n",
" 1 Store_Area 896 non-null int64\n",
" 2 Items_Available 896 non-null int64\n",
" 3 Daily_Customer_Count 896 non-null int64\n",
" 4 Store_Sales 896 non-null int64\n",
"dtypes: int64(5)\n",
"memory usage: 35.1 KB\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Store ID</th>\n",
" <th>Store_Area</th>\n",
" <th>Items_Available</th>\n",
" <th>Daily_Customer_Count</th>\n",
" <th>Store_Sales</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1659</td>\n",
" <td>1961</td>\n",
" <td>530</td>\n",
" <td>66490</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1461</td>\n",
" <td>1752</td>\n",
" <td>210</td>\n",
" <td>39820</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1340</td>\n",
" <td>1609</td>\n",
" <td>720</td>\n",
" <td>54010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1451</td>\n",
" <td>1748</td>\n",
" <td>620</td>\n",
" <td>53730</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1770</td>\n",
" <td>2111</td>\n",
" <td>450</td>\n",
" <td>46620</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Store ID Store_Area Items_Available Daily_Customer_Count Store_Sales\n",
"0 1 1659 1961 530 66490\n",
"1 2 1461 1752 210 39820\n",
"2 3 1340 1609 720 54010\n",
"3 4 1451 1748 620 53730\n",
"4 5 1770 2111 450 46620"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\".//static//csv//Stores.csv\", sep=\",\")\n",
"print('количество колонок: ' + str(df.columns.size)) \n",
"print('колонки: ' + ', '.join(df.columns))\n",
"\n",
"df.info()\n",
"df.head()"
2024-10-12 01:22:15 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-10-12 11:41:32 +04:00
"# Получение сведений о пропущенных данных\n",
"\n",
"Типы пропущенных данных:\n",
"\n",
"* None - представление пустых данных в Python\n",
"* NaN - представление пустых данных в Pandas\n",
"* '' - пустая строка"
2024-10-12 01:22:15 +04:00
]
},
{
"cell_type": "code",
2024-10-12 11:41:32 +04:00
"execution_count": 20,
2024-10-12 01:22:15 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-10-12 11:41:32 +04:00
"Store ID 0\n",
"Store_Area 0\n",
"Items_Available 0\n",
"Daily_Customer_Count 0\n",
"Store_Sales 0\n",
"dtype: int64\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"Store ID False\n",
"Store_Area False\n",
"Items_Available False\n",
"Daily_Customer_Count False\n",
"Store_Sales False\n",
"dtype: bool\n",
2024-10-12 01:22:15 +04:00
"\n"
]
}
],
"source": [
2024-10-12 11:41:32 +04:00
"# Количество пустых значений признаков\n",
"print(df.isnull().sum())\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"print()\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"# Есть ли пустые значения признаков\n",
"print(df.isnull().any())\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"print()\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"# Процент пустых значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
2024-10-12 01:22:15 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-10-12 11:41:32 +04:00
"Судя по статистике выше, пустые значения отсутсвуют. Проверим датасет на выбросы:"
2024-10-12 01:22:15 +04:00
]
},
{
"cell_type": "code",
2024-10-12 11:41:32 +04:00
"execution_count": 21,
2024-10-12 01:22:15 +04:00
"metadata": {},
"outputs": [
2024-10-12 11:41:32 +04:00
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов в столбце 'Store_Area': 5\n",
"Количество выбросов в столбце 'Items_Available': 5\n",
"Количество выбросов в столбце 'Daily_Customer_Count': 3\n",
"Количество выбросов в столбце 'Store_Sales': 1\n"
]
},
2024-10-12 01:22:15 +04:00
{
"data": {
2024-10-12 11:41:32 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdoAAAPdCAYAAACdp9q8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXxTVfo/8M9Nmibplu5pC91p2TcB2QWUVUBw3FBQUEf46SiDu3wVlEURxRFFBXVmELW4KyIOoCCLCJR9EQq00NLSPd3StE3aJPf3R9pIaYGuuWn6eb9eeSl3OXmStD25zz3nOYIoiiKIiIiIiIiIiIiIiKhJZFIHQERERERERERERETUljHRTkRERERERERERETUDEy0ExERERERERERERE1AxPtRERERERERERERETNwEQ7EREREREREREREVEzMNFORERERERERERERNQMTLQTERERERERERERETUDE+1ERERERERERERERM3ARDsRERERERERERERUTMw0U4uLyoqCrNmzZI6DJf35ptvIiYmBnK5HH369JE6HCIikhj7X8dg/yudnTt3QhAE7Ny5075t1qxZiIqKalJ7giDg8ccfv+5xn3zyCQRBQFpaWpOeh4jaNvavjuEM/Wt9fYogCHjllVckiYeIro+JdmpTai4sDh06VO/+kSNHokePHs1+nv/973/svBrhl19+wXPPPYehQ4di7dq1eO211655/E8//YQRI0YgODgYHh4eiImJwd13340tW7bYj8nKysIrr7yCY8eOtXL0LeeDDz6AIAgYOHCg1KEQEbUo9r/OqTH976xZs+Dl5VVr2wcffIBPPvmklaN0rLvvvhuCIOD555+XOhQiouti/+qcGtu/CoJgf3h5eSEmJgZ33nknvvvuO1itVgdG3nx6vR6LFi1C79694eXlBbVajR49euD5559HVlZWqzzn3r178corr6C4uLhV2nc2RqMRb7/9NgYOHAiNRgOVSoX4+Hg8/vjjOHfunNThAWh/n0lLcpM6AKLWdvbsWchkjbun9L///Q/vv/8+v4w00G+//QaZTIb//Oc/cHd3v+axK1aswLPPPosRI0Zg/vz58PDwQEpKCrZt24Yvv/wS48ePB2BLtC9atAhRUVFtZoReQkICoqKicODAAaSkpKBTp05Sh0REJBn2v62vMf1vfT744AMEBga6zMhIvV6Pn376CVFRUfjiiy/w+uuvQxCEVnu+m266CRUVFU1674mImor9a+trbP+qVCrx73//GwBQUVGBixcv4qeffsKdd96JkSNH4scff4SPj0+j4/j4448dmqi/cOECRo8ejfT0dNx1112YPXs23N3dceLECfznP//BDz/80CqJ4L1792LRokWYNWsWfH19W7x9Z6LT6TB+/HgcPnwYkyZNwn333QcvLy+cPXsWX375JT766CNUVlZKHWa7+kxaGhPt5PKUSqXUITRaWVkZPD09pQ6jwfLy8qBWq6/7JcRsNmPJkiUYM2YMfvnll3rbaW2t9d6mpqZi7969+P777zFnzhwkJCTg5Zdfvu55ZrMZVquVF+lE5HLY/7a+hva/7cV3330Hi8WC//73v7j55puxe/dujBgxotWeTyaTQaVStVr7RET1Yf/a+hrbv7q5uWHGjBm1ti1duhSvv/465s+fj0ceeQRfffVVo+NQKBSNPqepzGYz/va3vyE3Nxc7d+7EsGHDau1/9dVXsXz5cofF05Y05ud71qxZOHr0KL799lvccccdtfYtWbIEL774YmuESA7E0jHk8q6sYVdVVYVFixYhLi4OKpUKAQEBGDZsGH799VcAtj9877//PgDUmgJWo6ysDE8//TTCw8OhVCrRuXNnrFixAqIo1nreiooKzJ07F4GBgfD29sZtt92GzMzMOjXVXnnlFQiCgNOnT+O+++6Dn5+fvVM7ceIEZs2ahZiYGKhUKoSEhOChhx5CQUFBreeqaePcuXOYMWMGNBoNgoKCsGDBAoiiiIyMDEyZMgU+Pj4ICQnBW2+91aD3riYxHhsbC6VSiaioKPzf//0fTCaT/RhBELB27VqUlZXZ36urTUPX6XTQ6/UYOnRovfuDg4MB2GqeDhgwAADw4IMP1tvuN998g379+kGtViMwMBAzZsxAZmZmrfZqpsmfP38et956K7y9vTF9+nQAgNVqxcqVK9G9e3eoVCpotVrMmTMHRUVFDXpvrpSQkAA/Pz9MnDgRd955JxISEuock5aWBkEQsGLFCqxcudL+vp4+fRoAcObMGdx5553w9/eHSqVC//79sXHjxlptFBYW4plnnkHPnj3h5eUFHx8fTJgwAcePH29S3ERErYX9r/P0v/WJiorCqVOnsGvXLvv5I0eOtO8vLi7GvHnz7O93p06dsHz58loj6y7v195//33ExMTAw8MDY8eORUZGBkRRxJIlS9CxY0eo1WpMmTIFhYWFteI4dOgQxo0bh8DAQKjVakRHR+Ohhx5q8Ou4XEJCAsaMGYNRo0aha9eutfriQ4cOQRAErFu3rs55W7duhSAI2LRpEwDg4sWLeOyxx9C5c2eo1WoEBATgrrvuqlMTvb4a7fVZsWIFhgwZgoCAAKjVavTr1w/ffvvtNV9H586doVKp0K9fP+zevbtBr3/z5s0YPnw4PD094e3tjYkTJ+LUqVMNOpeI2g72r87dv17uhRdewNixY/HNN9/UGgn+448/YuLEiQgLC4NSqURsbCyWLFkCi8VS6/zrrfuxY8cOCIKAH374oc6+9evXQxAE7Nu3r0Gxfvfddzh+/DhefPHFOkl2APDx8cGrr75q//fV1goYOXJkre8TALBq1Sp0794dHh4e8PPzQ//+/bF+/XoAts/62WefBQBER0fb3/OaPrchn1lNPJMmTcLOnTvRv39/qNVq9OzZ095Hf//99+jZs6e9bz169Gid2BtyPV5T7mnXrl147LHHEBwcjI4dO17zva2RmJiIn3/+GQ8//HCdJDtgu4m2YsWKWtt+++03e9/u6+uLKVOmICkpqdYxV/s5qfk9ulzNejAbNmxAjx49oFQq0b1791plfK/3mdC1cUQ7tUklJSXQ6XR1tldVVV333FdeeQXLli3D3//+d9x4443Q6/U4dOgQjhw5gjFjxmDOnDnIysrCr7/+is8++6zWuaIo4rbbbsOOHTvw8MMPo0+fPti6dSueffZZZGZm4u2337YfO2vWLHz99de4//77MWjQIOzatQsTJ068alx33XUX4uLi8Nprr9m/1Pz666+4cOECHnzwQYSEhODUqVP46KOPcOrUKezfv7/OH8177rkHXbt2xeuvv46ff/4ZS5cuhb+/Pz788EPcfPPNWL58ORISEvDMM89gwIABuOmmm675Xv3973/HunXrcOedd+Lpp59GYmIili1bhqSkJHtn/tlnn+Gjjz7CgQMH7NPlhgwZUm97wcHBUKvV+Omnn/DEE0/A39+/3uO6du2KxYsXY+HChZg9ezaGDx9eq91PPvkEDz74IAYMGIBly5YhNzcX77zzDv744w8cPXq01tQms9mMcePGYdiwYVixYgU8PDwAAHPmzLG3M3fuXKSmpuK9997D0aNH8ccffzR69EBCQgL+9re/wd3dHffeey9Wr16NgwcP2m8YXG7t2rUwGo2YPXs2lEol/P39cerUKQwdOhQdOnTACy+8AE9PT3z99deYOnUqvvvuO9x+++0AbNP5NmzYgLvuugvR0dHIzc3Fhx9+iBEjRuD06dMICwtrVNxERI3B/rdt9r/1WblyJZ544gl4eXnZR09ptVoAQHl5OUaMGIHMzEzMmTMHERER2Lt3L+bPn4/s7GysXLmyVlsJCQmorKzEE088gcLCQrzxxhu4++67cfPNN2Pnzp14/vnnkZKSglWrVuGZZ57Bf//7XwC2EYNjx45FUFAQXnjhBfj6+iItLQ3ff/99g19HjaysLOzYscOeSL/33nvx9ttv47333oO7uzv69++PmJgYfP3115g5c2atc7/66iv4+flh3LhxAICDBw9i7969mDZtGjp27Ii0tDSsXr0aI0eOxOnTp+3fJRrqnXf
2024-10-12 01:22:15 +04:00
"text/plain": [
2024-10-12 11:41:32 +04:00
"<Figure size 1500x1000 with 4 Axes>"
2024-10-12 01:22:15 +04:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2024-10-12 11:41:32 +04:00
"import pandas as pd\n",
2024-10-12 01:22:15 +04:00
"import matplotlib.pyplot as plt\n",
2024-10-12 11:41:32 +04:00
"import seaborn as sns\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"df = pd.read_csv(\".//static//csv//Stores.csv\")\n",
"\n",
"# Выбираем числовые столбцы\n",
"numeric_columns = ['Store_Area','Items_Available','Daily_Customer_Count','Store_Sales']\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"# Выбираем столбцы для анализа\n",
"columns_to_check = ['Store_Area','Items_Available','Daily_Customer_Count','Store_Sales']\n",
"\n",
"# Функция для подсчета выбросов\n",
"def count_outliers(df, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
"# Подсчитываем выбросы\n",
"outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
"# Выводим количество выбросов для каждого столбца\n",
"for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
"# Создаем гистограммы\n",
"plt.figure(figsize=(15, 10))\n",
"for i, col in enumerate(numeric_columns, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
"plt.tight_layout()\n",
2024-10-12 01:22:15 +04:00
"plt.show()"
]
},
2024-10-12 11:41:32 +04:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Судя по диаграммам, количетв выбросов либо полностью отсутсвует, либо имеется в пределах допустимых значений. Теперь можно приступить к разбиению датасета на выборки: "
]
},
2024-10-12 01:22:15 +04:00
{
"cell_type": "code",
2024-10-12 11:41:32 +04:00
"execution_count": 22,
2024-10-12 01:22:15 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-10-12 11:41:32 +04:00
"Обучающая выборка: (537, 4)\n",
"Sales_Grouped\n",
" 2 184\n",
" 3 148\n",
" 1 135\n",
" 4 45\n",
" 0 20\n",
"-1 5\n",
"Name: count, dtype: int64\n",
"Обучающая выборка после oversampling: (1104, 4)\n",
"Sales_Grouped\n",
" 1 184\n",
" 2 184\n",
" 3 184\n",
" 4 184\n",
" 0 184\n",
"-1 184\n",
"Name: count, dtype: int64\n",
"Контрольная выборка: (179, 4)\n",
"Sales_Grouped\n",
" 2 61\n",
" 3 49\n",
" 1 45\n",
" 4 15\n",
" 0 7\n",
"-1 2\n",
"Name: count, dtype: int64\n",
"Тестовая выборка: (180, 4)\n",
"Sales_Grouped\n",
" 2 61\n",
" 3 50\n",
" 1 45\n",
" 4 15\n",
" 0 7\n",
"-1 2\n",
"Name: count, dtype: int64\n"
2024-10-12 01:22:15 +04:00
]
}
],
"source": [
2024-10-12 11:41:32 +04:00
"data = df[[\"Store_Sales\", \"Store_Area\", \"Daily_Customer_Count\"]].copy()\n",
"data[\"Sales_Grouped\"] = pd.cut(data[\"Store_Sales\"], bins=6, labels=False)\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"interval_counts = data[\"Sales_Grouped\"].value_counts().sort_index()\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"min_samples_per_interval = 10\n",
"for interval, count in interval_counts.items():\n",
" if count < min_samples_per_interval:\n",
" data.loc[data[\"Sales_Grouped\"] == interval, \"Sales_Grouped\"] = -1\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"df_shop_train, df_shop_val, df_shop_test = split_stratified_into_train_val_test(\n",
" data, stratify_colname=\"Sales_Grouped\", frac_train=0.60, frac_val=0.20, frac_test=0.20)\n",
"\n",
"\n",
"print(\"Обучающая выборка: \", df_shop_train.shape)\n",
"print(df_shop_train[\"Sales_Grouped\"].value_counts())\n",
"\n",
"X_resampled, y_resampled = apply_oversampling(df_shop_train, df_shop_train[\"Sales_Grouped\"])\n",
"df_shop_train_adasyn = pd.DataFrame(X_resampled)\n",
2024-10-12 01:22:15 +04:00
"\n",
2024-10-12 11:41:32 +04:00
"print(\"Обучающая выборка после oversampling: \", df_shop_train_adasyn.shape)\n",
"print(df_shop_train_adasyn[\"Sales_Grouped\"].value_counts())\n",
"\n",
"print(\"Контрольная выборка: \", df_shop_val.shape)\n",
"print(df_shop_val[\"Sales_Grouped\"].value_counts())\n",
"\n",
"print(\"Тестовая выборка: \", df_shop_test.shape)\n",
"print(df_shop_test[\"Sales_Grouped\"].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборка сбалансирована"
2024-10-12 01:22:15 +04:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}