AIM-PIbd-32-Fedorenko-G-Y/lab_3/lab3.ipynb

793 lines
399 KiB
Plaintext
Raw Permalink Normal View History

2024-10-27 11:24:05 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Jio Mart Product Items"
]
},
2024-11-09 11:48:24 +04:00
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['category', 'sub_category', 'href', 'items', 'price'], dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Бизнес-цели**"
]
},
2024-10-27 11:24:05 +04:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Бизнес-цель: Снизить издержки и увеличить продажи за счет оптимизации цен на товары.\n",
2024-11-09 11:48:24 +04:00
"Техническая цель: Создать модель машинного обучения, которая будет прогнозировать, является ли товар излишне дорогим для свой категории или нет."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Бизнес-цель: Оптимизировать распределение товаров по категориям.\n",
2024-10-27 11:24:05 +04:00
"Техническая цель: Создать модель машинного обучения, которая будет прогнозировать оптимальные цены на товары на основе их категорий, подкатегорий и текущих цен."
]
},
2024-11-09 11:48:24 +04:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Подготовка данных**"
]
},
2024-10-27 11:24:05 +04:00
{
"cell_type": "code",
2024-11-09 11:48:24 +04:00
"execution_count": null,
2024-10-27 11:24:05 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-11-09 11:48:24 +04:00
"Пропущенные данные по каждому столбцу:\n",
"category 0\n",
"sub_category 0\n",
"href 0\n",
"items 0\n",
"price 0\n",
"dtype: int64\n"
2024-10-27 11:24:05 +04:00
]
}
],
2024-11-09 11:48:24 +04:00
"source": [
"# Проверка на пропущенные значения\n",
"missing_data = df.isnull().sum()\n",
"print(\"Пропущенные данные по каждому столбцу:\")\n",
"print(missing_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Разбиение каждого набора данных на обучающую, контрольную и тестовую выборки**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размеры выборок:\n",
"Обучающая выборка: 18000 записей\n",
"Валидационная выборка: 6000 записей\n",
"Тестовая выборка: 6000 записей\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABeUAAAIjCAYAAACNsHDUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACjtklEQVR4nOzdf3zOdf////s2dmwzx2bYZqeZhRPz23SyRH6sDUNKnacsv2vRVOiNU6ckqoWEECmh0A+dpfwIQ37E/FqWX+WUpiltS9iB7Pfr+0efvb6ONmLmmM3term8Lqfj9Xwcr9fjeaxzj+N47HU8X06GYRgCAAAAAAAAAAA3nXNpJwAAAAAAAAAAwO2CpjwAAAAAAAAAAA5CUx4AAAAAAAAAAAehKQ8AAAAAAAAAgIPQlAcAAAAAAAAAwEFoygMAAAAAAAAA4CA05QEAAAAAAAAAcBCa8gAAAAAAAAAAOAhNeeAmunjxok6ePKmzZ8+WdiooQfxcAQC4vZw7d07ff/+9cnNzSzsVAABuefn5+Tp9+rR++OGH0k4FuGXRlAdK2IoVK9S5c2dVrlxZnp6eqlWrlqZOnVraaeEG8XMFAOD2kZOTo6lTp6pZs2ayWCyqUqWK6tWrp02bNpV2agAA3JJSU1M1YsQIBQUFydXVVdWrV1dISIhsNltppwbckiqUdgLArezw4cOKi4vTl19+qdOnT6tq1arq2LGjnn32WTVq1KhQ/L///W9NmTJF9913n9566y1Vq1ZNTk5O+vvf/14K2aOk8HMFgJK3ePFiDRo0yG5f9erV1ahRI40ZM0Zdu3Ytpcxwu8vKylJERIR27dqloUOHavLkyfLw8JCLi4tCQ0NLOz0AKPecnJyuKe7LL79Uhw4dbm4yuCbff/+9OnbsqJycHD311FNq2bKlKlSoIHd3d1WqVKm00wNuSTTlgSv45JNP9PDDD8vHx0dDhgxRcHCwTpw4oYULF+rjjz/WBx98oPvvv9+M37p1q6ZMmaK4uDj9+9//LsXMUZL4uQLAzTVp0iQFBwfLMAylpaVp8eLF6tatm1atWqXu3buXdnq4DU2ZMkW7d+/W+vXrafYAQCl477337B6/++67io+PL7S/YcOGjkwLV/H444/L1dVVu3bt0t/+9rfSTgcoE5wMwzBKOwngVnP8+HE1bdpUtWrV0rZt21S9enVz7PTp02rXrp1OnjypAwcO6I477pAk9ejRQ2fOnNGOHTtKK23cBPxcAeDmKLhSfu/evWrVqpW5/+zZs/Lz89NDDz2kZcuWlWKGuB3l5ubK19dXw4YN00svvVTa6QAAJA0fPlxz584V7atbU2Jiolq1aqUNGzbo3nvvLe10gDKDNeWBIkybNk2///67FixYYNeQl6Rq1arpzTff1MWLF+3WFN+1a5caN26sPn36yMfHR+7u7rrzzju1cuVKM+bChQuqVKmSnn766ULn/Omnn+Ti4qK4uDhJ0sCBA1W7du1CcU5OTpo4caL5+Mcff9QTTzyh+vXry93dXVWrVtVDDz2kEydO2D1vy5YtcnJy0pYtW8x9e/fu1b333qvKlSurUqVK6tChg7Zv3273vMWLF8vJyUn79u0z950+fbpQHpLUvXv3Qjlv375dDz30kGrVqiWLxaLAwECNHDlSly5dKjS3jz/+WK1atVLlypXl5ORkbq+++mqh2KJyLNg8PDzUpEkTvf3223ZxAwcOlKen51WP9ed5XcvPtUB6erqGDBkiPz8/ubm5qVmzZlqyZIldzIkTJ8w5zZgxQ0FBQXJ3d9c999yjQ4cOFcr3z6/n0qVL5ezsrFdeecXcd+DAAQ0cOFB33HGH3Nzc5O/vr8GDB+u333676lwB4Fbk7e0td3d3Vahg/4XOV199VXfddZeqVq0qd3d3hYaG6uOPPy7yGH+uCwXb5Vc9F8RcXi/z8/PVtGlTOTk5afHixYWOW7t27SKP++fYa83VyclJw4cPL7S/qHpaVE04efKk3N3dC81Dkt544w01atRIFotFAQEBio2N1blz5+xiOnTooMaNGxc6/6uvvlromLVr1y7ymwvDhw8vtMzAokWL1KlTJ/n6+spisSgkJETz5s0r9Nzc3Fy9+OKL+vvf/y6LxWL3ml7+vqMoAwcOtIuvUqVKke9jrpR3gT+/Pzp69KjOnj2rypUr65577pGHh4e8vLzUvXv3QnVakvbv36+uXbvKarXK09NTnTt31q5du+xiCv5b27Ztmx5//HFVrVpVVqtV/fv3L3TT+Nq1a2vgwIF2+2JiYuTm5mb3Hu6zzz5TVFSUAgICZLFYVKdOHU2ePFl5eXlXfd0AoDzKysrS888/r7p165qfOceMGaOsrKxCsUuXLtU//vEPeXh4qEqVKmrfvr02bNgg6cp1vmC7vA5fvHhRzzzzjAIDA2WxWFS/fn29+uqrhf5wcPnzXVxc9Le//U0xMTF2NTk7O1sTJkxQaGiovLy8VKlSJbVr105ffvllofwLPnPWqlVLLi4u5rH/6nPun+fn7Owsf39//etf/1JKSooZc/nn1SuZOHGiXe3ftWuX3NzcdPz4cfO9h7+/vx5//HGdOXOm0PNXrFih0NBQubu7q1q1anrkkUf0888/28UUfHb/4YcfFBkZqUqVKikgIECTJk2ye40L8r38vdj58+cVGhqq4OBg/fLLL+b+63kvCTgCy9cARVi1apVq166tdu3aFTnevn171a5dW2vWrDH3/fbbb1qwYIE8PT311FNPqXr16lq6dKkeeOABLVu2TA8//LA8PT11//3368MPP9Rrr70mFxcX8/nvv/++DMNQdHT0deW6d+9e7dy5U3369FHNmjV14sQJzZs3Tx06dNCRI0fk4eFR5PO+//57dejQQR4eHho9erQ8PDz01ltvKTw8XPHx8Wrfvv115XElK1as0O+//65hw4apatWq2rNnj2bPnq2ffvpJK1asMOMSEhL0z3/+U82aNdMrr7wiLy8vnT59WiNHjrzmc82YMUPVqlWTzWbTO++8o8cee0y1a9dWeHh4sfO/lp+rJF26dEkdOnTQ999/r+HDhys4OFgrVqzQwIEDde7cuUJ/iHn33Xd1/vx5xcbGKjMzU7NmzVKnTp108OBB+fn5FZnLhg0bNHjwYA0fPtxuKZ34+Hj98MMPGjRokPz9/XX48GEtWLBAhw8f1q5du655TUYAKA0ZGRk6ffq0DMNQenq6Zs+erQsXLuiRRx6xi5s1a5Z69uyp6OhoZWdn64MPPtBDDz2k1atXKyoqqshjF9QFSdd01fN7772ngwcPXjWmefPmeuaZZyRJycnJmjBhQqGY4uRaHBMmTFBmZmah/RMnTtQLL7yg8PBwDRs2TEePHtW8efO0d+9e7dixQxUrViyxHIoyb948NWrUSD179lSFChW0atUqPfHEE8rPz1dsbKwZN336dD333HO6//77NXbsWFksFm3fvl0LFiy4pvNUq1ZNM2bMkPTHxQ2zZs1St27ddPLkSXl7excr94I/aI8bN0716tXTCy+8oMzMTM2dO1dt27bV3r17zXvKHD58WO3atZPVatWYMWNUsWJFvfnmm+rQoYO2bt2q1q1b2x17+PDh8vb21sSJE82fyY8//mj+YaAozz//vBYuXKgPP/yw0B+VPD09NWrUKHl6emrz5s2aMGGCbDabpk2bVqy5A0BZlJ+fr549e+qrr75STEyMGjZsqIMHD2rGjBn63//+Z3cx1QsvvKCJEyfqrrvu0qRJk+Tq6qrdu3dr8+bNioiI0MyZM3XhwgVJ0rfffquXX35Zzz77rLlMTkHj2zAM9ezZU19++aWGDBmi5s2ba/369Ro9erR+/vlnszYVuP/++/XAAw8oNzdXCQkJWrBggS5dumQux2Oz2fT222/r4Ycf1mOPPabz589r4cKFioyM1J49e9S8eXPzWAMGDNDGjRv15JNPqlmzZnJxcdGCBQv09ddfX9Pr1a5dO8XExCg/P1+HDh3SzJkzderUqUJ/1L4ev/32mzIzMzVs2DB16tRJQ4cO1fHjxzV37lzt3r1bu3fvlsVikfT/f1PyzjvvVFxcnNLS0jR
"text/plain": [
"<Figure size 1800x600 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Разделение признаков (features) и целевой переменной (target)\n",
"X = df.drop(columns=['price']) # Признаки (все столбцы, кроме 'price')\n",
"y = df['price'] # Целевая переменная (price)\n",
"\n",
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
"\n",
"# Проверка размеров выборок\n",
"print(f\"Размеры выборок:\")\n",
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
"\n",
"# Визуализация распределения цен в каждой выборке\n",
"plt.figure(figsize=(18, 6))\n",
"\n",
"plt.subplot(1, 3, 1)\n",
"plt.hist(y_train, bins=30, color='blue', alpha=0.7)\n",
"plt.title('Обучающая выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.subplot(1, 3, 2)\n",
"plt.hist(y_val, bins=30, color='green', alpha=0.7)\n",
"plt.title('Валидационная выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.subplot(1, 3, 3)\n",
"plt.hist(y_test, bins=30, color='red', alpha=0.7)\n",
"plt.title('Тестовая выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Балансировка выборок**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размеры выборок:\n",
"Обучающая выборка: 18000 записей\n",
"Валидационная выборка: 6000 записей\n",
"Тестовая выборка: 6000 записей\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1sAAAIjCAYAAAD1OgEdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABqiElEQVR4nO3deVhUdf//8dcIMiyypLJIIq6puJVLhvsKLmmW5W1ZalqaouaSmZZ7RWm3mpmWLVqp30orSy1TXEvRXDL3XTMXcAfRRIHz+8MfczMOKhCnAXk+rmuuiznzmXPeZ+bMYV5zPudzLIZhGAIAAAAA5KpCzi4AAAAAAO5GhC0AAAAAMAFhCwAAAABMQNgCAAAAABMQtgAAAADABIQtAAAAADABYQsAAAAATEDYAgAAAAATELaQbWlpaTp79qwOHz7s7FKAW7p69apOnjyp06dPO7sU5CLeVwBZdfHiRR08eFApKSnOLgUFGGELWRIXF6eBAwcqNDRUbm5u8vf3V1hYmBITE51dGmATExOj9u3by8/PTx4eHrr33nv14osvOrss/EN3+/v6119/yd3dXevWrXN2KUC+dv36dU2YMEE1atSQ1WrVPffcowoVKmjFihXOLq1AOnfunLy8vPTjjz86uxSnshiGYTi7CDjHoUOHNGHCBC1fvlwnT56Um5ubqlWrpk6dOqlXr17y8PCQJB08eFBNmzbV9evXNWDAANWsWVOurq7y8PDQQw89JBcXFyevCSBNnz5d/fv3V4MGDdSjRw/de++9kqTQ0FBVqFDBydUhpwrC+/r8889r//79WrNmjbNLyVPOnj0rf39/jR49WmPGjHF2OcjjkpOTFRERoQ0bNuiFF15Q8+bN5enpKRcXF9WqVUs+Pj7OLrFAevHFF/Xrr79qy5Ytzi7FaVydXQCcY8mSJXriiSdktVrVtWtXVa1aVdeuXdOvv/6qoUOHateuXZo5c6YkqXfv3nJzc9OGDRtsX3SAvOTAgQMaPHiwevXqpenTp8tisTi7JOSCgvC+njlzRp999pk+++wzZ5cC5Gtvv/22Nm7cqJ9//llNmjRxdjn4/1544QVNnTpVK1euVLNmzZxdjlMQtgqgI0eOqHPnzgoNDdXKlStVokQJ22NRUVE6ePCglixZIknasmWLVq5cqWXLlhG0kGdNnTpVQUFBmjp16l35hbygKgjv65w5c+Tq6qp27do5uxQg30pJSdGUKVM0ZMgQglYeU7lyZVWtWlWzZ88usGGLc7YKoAkTJigpKUmffPKJXdBKV758edv5EBs2bJC7u7sOHTqkKlWqyGq1KigoSL1799b58+ftnvfLL7/oiSeeUKlSpWS1WhUSEqJBgwbp77//zrQOi8WS6e3o0aO2NrNmzVKzZs0UEBAgq9WqsLAwzZgxw2FepUuX1sMPP+wwvV+/fpl+SZszZ44efPBBeXp66p577lGjRo20bNkyu/l1797d7jnz58+XxWJR6dKlbdOOHj0qi8Wid955R5MnT1ZoaKg8PDzUuHFj7dy502G5K1euVMOGDeXl5SU/Pz898sgj2rNnj12bMWPG2L0e3t7eevDBB7Vw4UK7dll9vbt3764iRYo41LJgwQJZLBatXr3aNq1JkyaqWrWqQ9t33nnH4b35/vvv1bZtWwUHB8tqtapcuXIaP368UlNTHZ4/Y8YMVa1aVZ6ennbrtmDBAoe2N/v999/VunVr+fj4qEiRImrevLk2bNhg12bDhg2qVauW+vbtq8DAQFmtVlWtWlUfffSRrY1hGCpdurQeeeQRh2VcvXpVvr6+6t27t6T/vQc3u3m7OH/+vF566SVVq1ZNRYoUkY+Pj1q3bq0//vjD7nnp28ns2bNt0/bv369HH31U99xzjzw8PFSnTh2H93j16tWZvk5FihRx2D4z29a3b9+u7t27q2zZsnJ3d1dQUJB69Oihc+fOOazbqlWr1LBhQ91zzz1271G/fv0c2mZWY/rNarXqvvvuU3R0tDL2Uk9/Tc+ePXvLed38+mblfU13+fJlDRkyRCEhIbJarapYsaLeeecd3dxTPn2d5s6dq4oVK8rd3V21atXS2rVr7dpltg2sWrVKVqtVL7zwgm3an3/+qb59+6pixYry8PBQsWLF9MQTT9h9Vm5n4cKFqlu3bqaf0fTtJrNbTtc9s9vrr78uSbp27ZpGjRqlWrVqydfXV15eXmrYsKFWrVqVaV1Z2e91797dbp8p3ThHzcPDw2GfcuXKFT377LPy8vJSWFiYrdvR9evX9eyzz8rT01M1atTQ5s2b7ebXpEkTWSwWdejQweE17N27tywWi91+LbPPo3Tjx0aLxWK3DWZWf/preXPXxhMnTqhHjx62bbVKlSr69NNPHZ579epVjRkzRvfdd5/c3d1VokQJPfbYYzp06NAt67t06ZJq1aqlMmXK6NSpU7bpWX3vpRv/92rVqiUPDw8VLVpUnTt31l9//eXQ7mY3/0+6+ZaxzvT/N4cPH1ZkZKS8vLwUHByscePGOdSUlpamKVOmqEqVKnJ3d1dgYKB69+6tCxcuONQwffp023eQ4OBgRUVF6eLFi7bH9+3bpwsXLsjb21uNGzeWp6enfH199fDDDztsk+nrs3fvXnXq1Ek+Pj4qVqyYXnzxRV29etWubVa/gzzyyCMqXbq03N3dFRAQoPbt22vHjh12bVJSUjR+/HiVK1dOVqtVpUuX1ogRI5ScnGzXrnTp0rbXtlChQgoKCtJ//vMfHTt2zK7dO++8o3r16qlYsWLy8PBQrVq1Mv2feqv9+MMPP5zp95msfC6kG4OPDBw40LbtlS9fXm+//bbS0tIcltWyZUstWrQo0+2yIODIVgG0aNEilS1bVvXq1btj23Pnzunq1avq06ePmjVrphdeeEGHDh3S+++/r40bN2rjxo2yWq2SboSRK1euqE+fPipWrJh+++03vffeezp+/Ljmz5+f6fwfffRRPfbYY5JuhIf0rovpZsyYoSpVqqh9+/ZydXXVokWL1LdvX6WlpSkqKipH6z927FiNGTNG9erV07hx4+Tm5qaNGzdq5cqVioiIyPQ5KSkpevXVV285z88//1yXLl1SVFSUrl69qnfffVfNmjXTjh07FBgYKOnGSf6tW7dW2bJlNWbMGP3999967733VL9+fW3dutXhH/oXX3wh6cZ5C9OnT9cTTzyhnTt3qmLFipJy9nrnptmzZ6tIkSIaPHiwihQpopUrV2rUqFFKTEzUxIkTbe2++uor9e3bV02aNFH//v3l5eWlPXv26M0337zjMnbt2qWGDRvKx8dHL7/8sgoXLqwPP/xQTZo00Zo1a1S3bl1JN7bTzZs3y9XVVVFRUSpXrpwWLlyoXr166dy5c3rllVdksVj09NNPa8KECTp//ryKFi1qW86iRYuUmJiop59+OluvweHDh7Vw4UI98cQTKlOmjOLj4/Xhhx+qcePG2r17t4KDgzN93vnz59WoUSNdunRJAwYMUFBQkObMmaPHHntMc+fO1ZNPPpmtOm5l+fLlOnz4sJ599lkFBQXZugfv2rVLGzZssH1pP3LkiNq2basSJUpo1KhR8vf3lyQ988wzWV7WiBEjVLlyZf3999/66quvNGLECAUEBKhnz545rj8r76t0I0i3b99eq1atUs+ePXX//ffr559/1tChQ3XixAlNnjzZbr5r1qzRV199pQEDBshqtWr69Olq1aqVfvvtt0x/bJCkP/74Qx06dFCbNm30/vvv26Zv2rRJ69evV+fOnVWyZEkdPXpUM2bMUJMmTbR79255enrecv2uX7+uTZs2qU+fPrd9HXr16qWGDRtKkr799lt99913tseyu+4tW7ZU165d7abdf//9kqTExER9/PHHevLJJ/X888/r0qVL+uSTTxQZGanffvvN1i5dVvZ7mRk1apTDl1pJGjRokD777DP169dPJUuWVN++fSVJM2fOVLNmzfT666/r3XffVevWrXX48GF5e3vbnuvu7q4lS5bo9OnTCggIkCTbtuju7n7b11e6cW5yZiE+q+Lj4/XQQw/Zvtj6+/vrp59+Us+ePZWYmKiBAwd
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
2024-10-27 11:24:05 +04:00
"source": [
"import pandas as pd\n",
2024-11-09 11:48:24 +04:00
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
"\n",
"# Разделение признаков (features) и целевой переменной (target)\n",
"X = df.drop(columns=['price']) # Признаки (все столбцы, кроме 'price')\n",
"y = df['price'] # Целевая переменная (цена)\n",
"\n",
"# Применение one-hot encoding для категориальных признаков\n",
"X = pd.get_dummies(X, drop_first=True)\n",
"\n",
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
"\n",
"# Проверка размеров выборок\n",
"print(f\"Размеры выборок:\")\n",
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
"\n",
"# Удаление выбросов (цены выше 95-го процентиля)\n",
"upper_limit = y_train.quantile(0.95)\n",
"X_train = X_train[y_train <= upper_limit]\n",
"y_train = y_train[y_train <= upper_limit]\n",
"\n",
"# Логарифмическое преобразование целевой переменной\n",
"y_train_log = np.log1p(y_train)\n",
"y_val_log = np.log1p(y_val)\n",
"y_test_log = np.log1p(y_test)\n",
"\n",
"# Стандартизация признаков\n",
"scaler = StandardScaler()\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"X_val_scaled = scaler.transform(X_val)\n",
"X_test_scaled = scaler.transform(X_test)\n",
"\n",
"# Визуализация распределения цен в сбалансированной выборке\n",
"plt.figure(figsize=(10, 6))\n",
"plt.hist(y_train_log, bins=30, color='orange', alpha=0.7)\n",
"plt.title('Сбалансированная обучающая выборка (логарифмическое преобразование)')\n",
"plt.xlabel('Логарифм цены')\n",
"plt.ylabel('Количество')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Унитарное кодирование категориальных признаков**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до унитарного кодирования:\n",
" category ... price\n",
"0 Groceries ... 109.0\n",
"1 Groceries ... 49.0\n",
"2 Groceries ... 69.0\n",
"3 Groceries ... 125.0\n",
"4 Groceries ... 44.0\n",
"\n",
"[5 rows x 5 columns]\n",
"\n",
"Данные после унитарного кодирования:\n",
" price ... items_ Hilife Pantyliners\n",
"0 109.0 ... False\n",
"1 49.0 ... False\n",
"2 69.0 ... False\n",
"3 125.0 ... False\n",
"4 44.0 ... False\n",
"\n",
"[5 rows x 28392 columns]\n"
]
}
],
"source": [
"import pandas as pd\n",
"df1 = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
"\n",
"print(\"Данные до унитарного кодирования:\")\n",
"print(df1.head())\n",
"\n",
"# Применение унитарного кодирования для категориальных признаков\n",
"df_encoded = pd.get_dummies(df1, drop_first=True)\n",
"\n",
"print(\"\\nДанные после унитарного кодирования:\")\n",
"print(df_encoded.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Дискретизация числовых признаков**"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до дискретизации:\n",
" category ... price\n",
"0 Groceries ... 109.0\n",
"1 Groceries ... 49.0\n",
"2 Groceries ... 69.0\n",
"3 Groceries ... 125.0\n",
"4 Groceries ... 44.0\n",
"\n",
"[5 rows x 5 columns]\n",
"\n",
"Данные после дискретизации:\n",
" price price_bins\n",
"0 109.0 100-500\n",
"1 49.0 0-100\n",
"2 69.0 0-100\n",
"3 125.0 100-500\n",
"4 44.0 0-100\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
"\n",
"print(\"Данные до дискретизации:\")\n",
"print(df.head())\n",
"\n",
"# Определение интервалов и меток для дискретизации\n",
"bins = [0, 100, 500, 1000, 5000, float('inf')]\n",
"labels = ['0-100', '100-500', '500-1000', '1000-5000', '5000+']\n",
"\n",
"# Применение дискретизации\n",
"df['price_bins'] = pd.cut(df['price'], bins=bins, labels=labels, right=False)\n",
"\n",
"print(\"\\nДанные после дискретизации:\")\n",
"print(df[['price', 'price_bins']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**«Ручной» синтез признаков**"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до синтеза признака:\n",
" category ... price\n",
"0 Groceries ... 109.0\n",
"1 Groceries ... 49.0\n",
"2 Groceries ... 69.0\n",
"3 Groceries ... 125.0\n",
"4 Groceries ... 44.0\n",
"\n",
"[5 rows x 5 columns]\n",
"\n",
"Данные после синтеза признака 'relative_price':\n",
" price category relative_price\n",
"0 109.0 Groceries 0.247286\n",
"1 49.0 Groceries 0.111165\n",
"2 69.0 Groceries 0.156539\n",
"3 125.0 Groceries 0.283584\n",
"4 44.0 Groceries 0.099822\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
"\n",
"# Проверка первых строк данных\n",
"print(\"Данные до синтеза признака:\")\n",
"print(df.head())\n",
"\n",
"# Вычисление средней цены по категориям\n",
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
"\n",
"# Создание нового признака 'relative_price' (относительная цена)\n",
"df['relative_price'] = df['price'] / mean_price_by_category\n",
"\n",
"# Проверка первых строк данных после синтеза признака\n",
"print(\"\\nДанные после синтеза признака 'relative_price':\")\n",
"print(df[['price', 'category', 'relative_price']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Масштабирование признаков на основе нормировки и стандартизации**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до масштабирования:\n",
" price relative_price\n",
"0 109.0 0.247286\n",
"1 49.0 0.111165\n",
"2 69.0 0.156539\n",
"3 125.0 0.283584\n",
"4 44.0 0.099822\n",
"\n",
"Данные после нормировки:\n",
" price relative_price\n",
"0 0.005507 0.005507\n",
"1 0.002330 0.002330\n",
"2 0.003389 0.003389\n",
"3 0.006354 0.006354\n",
"4 0.002065 0.002065\n",
"\n",
"Данные после стандартизации:\n",
" price relative_price\n",
"0 -0.483613 -0.483613\n",
"1 -0.571070 -0.571070\n",
"2 -0.541918 -0.541918\n",
"3 -0.460292 -0.460292\n",
"4 -0.578358 -0.578358\n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
"\n",
"# Создание нового признака 'relative_price' (цена относительно средней цены в категории)\n",
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
"df['relative_price'] = df['price'] / mean_price_by_category\n",
"\n",
"# Проверка первых строк данных до масштабирования\n",
"print(\"Данные до масштабирования:\")\n",
"print(df[['price', 'relative_price']].head())\n",
"\n",
"# Масштабирование признаков на основе нормировки\n",
"min_max_scaler = MinMaxScaler()\n",
"df[['price', 'relative_price']] = min_max_scaler.fit_transform(df[['price', 'relative_price']])\n",
"\n",
"# Проверка первых строк данных после нормировки\n",
"print(\"\\nДанные после нормировки:\")\n",
"print(df[['price', 'relative_price']].head())\n",
"\n",
"# Стандартизация признаков\n",
"standard_scaler = StandardScaler()\n",
"df[['price', 'relative_price']] = standard_scaler.fit_transform(df[['price', 'relative_price']])\n",
"\n",
"# Проверка первых строк данных после стандартизации\n",
"print(\"\\nДанные после стандартизации:\")\n",
"print(df[['price', 'relative_price']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Конструирование признаков с применением фреймворка Featuretools**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Built 6 features\n",
"Elapsed: 00:00 | Progress: 100%|██████████\n",
"Новые признаки, созданные с помощью Featuretools:\n",
" category sub_category ... NUM_CHARACTERS(items) NUM_WORDS(items)\n",
"index ... \n",
"0 Groceries Fruits & Vegetables ... 41 8\n",
"1 Groceries Fruits & Vegetables ... 59 11\n",
"2 Groceries Fruits & Vegetables ... 12 3\n",
"3 Groceries Fruits & Vegetables ... 20 4\n",
"4 Groceries Fruits & Vegetables ... 50 10\n",
"\n",
"[5 rows x 6 columns]\n"
]
}
],
"source": [
"import pandas as pd\n",
"import featuretools as ft\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
"\n",
"# Создание нового признака 'relative_price'\n",
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
"df['relative_price'] = df['price'] / mean_price_by_category\n",
"\n",
"# Создание EntitySet\n",
"es = ft.EntitySet(id='jio_mart_items')\n",
"\n",
"# Добавление данных с явным указанием индексного столбца\n",
"es = es.add_dataframe(dataframe_name='items_data', dataframe=df, index='index', make_index=True)\n",
"\n",
"# Конструирование признаков\n",
"features, feature_defs = ft.dfs(entityset=es, target_dataframe_name='items_data', verbose=True)\n",
"\n",
"# Проверка первых строк новых признаков\n",
"print(\"Новые признаки, созданные с помощью Featuretools:\")\n",
"print(features.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Оценка качества**"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE: 534.0885949291326\n",
"R²: 0.6087611252156747\n",
"MAE: 28.697400000000002\n",
"Training Time: 4.757523536682129 seconds\n",
"Cross-validated RMSE: 133.74731704254154\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\halina\\AppData\\Local\\Temp\\ipykernel_13300\\3211138617.py:70: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAACD8AAAK9CAYAAAApe1VgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXRN1///8ddNkDkEEUGaGCOmiLFBJWL2kaJKq4oYitYUNZTWPNMqOlClppZqlZIixg8paWuehYQKqmlRY0INyfn94ZfzdSUhpkY+fT7Wumvl7LPP3u99zrmnq8777m0xDMMQAAAAAAAAAAAAAABANmWT1QEAAAAAAAAAAAAAAAA8DpIfAAAAAAAAAAAAAABAtkbyAwAAAAAAAAAAAAAAyNZIfgAAAAAAAAAAAAAAANkayQ8AAAAAAAAAAAAAACBbI/kBAAAAAAAAAAAAAABkayQ/AAAAAAAAAAAAAACAbI3kBwAAAAAAAAAAAAAAkK2R/AAAAAAAAAAAAAAAALI1kh8AAAAAAAAAAAAAAEC2RvIDAAAAAAAAAAD4n2axWDL12bx581ON4/Tp0xo5cqSqVasmNzc35c+fX8HBwdqwYUO69S9duqSuXbvK3d1dTk5OqlOnjnbv3p2pvoKDgzMc55EjR57ksEzTp0/XvHnznkrbjys4OFjlypXL6jAe2e+//64RI0Zo7969WR0KADyzcmR1AAAAAAAAAAAAAE/Tl19+abW9YMECrV+/Pk25n5/fU41jxYoVmjhxopo3b64OHTro9u3bWrBggerXr685c+aoY8eOZt2UlBT95z//0b59+zRgwADlz59f06dPV3BwsHbt2qWSJUs+sL8iRYpo/PjxacoLFSr0RMeVavr06cqfP7/CwsKeSvv/Zr///rtGjhwpHx8fVaxYMavDAYBnEskPAAAAAAAAAADgf9rrr79utf3LL79o/fr1acqftjp16ujUqVPKnz+/Wda9e3dVrFhRw4YNs0p++O677/TTTz9pyZIlevnllyVJrVu3VqlSpTR8+HAtWrTogf3lzp37Hx/jk2YYhv7++285ODhkdShZ4vbt20pJScnqMAAgW2DZCwAAAAAAAAAA8K+XlJSkfv36ycvLS3Z2dvL19dUHH3wgwzCs6lksFvXs2VMLFy6Ur6+v7O3tVblyZf34448P7KNs2bJWiQ+SZGdnpyZNmui3337T1atXzfLvvvtOHh4eeumll8wyd3d3tW7dWitWrNCNGzcec8TSjRs3NHz4cJUoUUJ2dnby8vLSwIED07Q9d+5chYSEqECBArKzs1OZMmU0Y8YMqzo+Pj46dOiQoqKizOU1goODJUkjRoyQxWJJ0/+8efNksVgUHx9v1U7Tpk21du1aValSRQ4ODpo5c6akO8uAhIeHm9eoRIkSmjhx4iMnB6ReyyVLlqhMmTJycHBQYGCgDhw4IEmaOXOmSpQoIXt7ewUHB1vFKf3fUhq7du1SjRo15ODgoKJFi+qzzz5L09fZs2fVuXNneXh4yN7eXv7+/po/f75Vnfj4eFksFn3wwQeaOnWqihcvLjs7O02fPl1Vq1aVJHXs2NE8v6lLjGzZskWtWrXSc889Z17Hvn376vr161bth4WFydnZWWfOnFHz5s3l7Owsd3d39e/fX8nJyVZ1U1JSNG3aNJUvX1729vZyd3dXo0aNtHPnTqt6X331lSpXriwHBwflzZtXr776qk6fPv3Q1wIAngRmfgAAAAAAAAAAAP9qhmHoxRdf1KZNm9S5c2dVrFhRa9eu1YABA3TmzBlNmTLFqn5UVJS++eYb9e7d23w53ahRI23fvl3lypV76P7/+OMPOTo6ytHR0Szbs2ePKlWqJBsb69+xVqtWTZ9//rliY2NVvnz5+7abnJys8+fPW5XZ29vL2dlZKSkpevHFF7V161Z17dpVfn5+OnDggKZMmaLY2FgtX77cPGbGjBkqW7asXnzxReXIkUM//PCD3nrrLaWkpKhHjx6SpKlTp6pXr15ydnbWe++9J0ny8PB46HMhSUePHlWbNm3UrVs3vfHGG/L19dW1a9cUFBSkM2fOqFu3bnruuef0008/afDgwUpISNDUqVMfqa8tW7YoIiLCHMf48ePVtGlTDRw4UNOnT9dbb72lixcvatKkSerUqZP++9//Wh1/8eJFNWnSRK1bt1abNm307bff6s0331SuXLnUqVMnSdL169cVHBysY8eOqWfPnipatKiWLFmisLAwXbp0SX369LFqc+7cufr777/VtWtX2dnZqUWLFrp69aqGDRumrl276oUXXpAk1ahRQ5K0ZMkSXbt2TW+++aby5cun7du36+OPP9Zvv/2mJUuWWLWdnJyshg0bqnr16vrggw+0YcMGTZ48WcWLF9ebb75p1uvcubPmzZunxo0bq0uXLrp9+7a2bNmiX375RVWqVJEkjR07VkOHDlXr1q3VpUsXnTt3Th9//LFq166tPXv2KE+ePI90TQDgkRkAAAAAAAAAAAD/Ij169DDufkWyfPlyQ5IxZswYq3ovv/yyYbFYjGPHjpllkgxJxs6dO82ykydPGvb29kaLFi0eOpa4uDjD3t7eaNeunVW5k5OT0alTpzT1V61aZUgy1qxZc992g4KCzFjv/nTo0MEwDMP48ssvDRsbG2PLli1Wx3322WeGJCM6Otosu3btWpr2GzZsaBQrVsyqrGzZskZQUFCausOHDzfSeyU1d+5cQ5Jx4sQJs8zb2zvd8Y0ePdpwcnIyYmNjrcoHDRpk2NraGqdOnUr3PKQKCgoyypYta1UmybCzs7Pqf+bMmYYko2DBgsaVK1fM8sGDB6eJNfUcT5482Sy7ceOGUbFiRaNAgQLGzZs3DcMwjKlTpxqSjK+++sqsd/PmTSMwMNBwdnY2+zlx4oQhyXB1dTXOnj1rFeuOHTsMScbcuXPTjC296zN+/HjDYrEYJ0+eNMs6dOhgSDJGjRplVTcgIMCoXLmyuf3f//7XkGT07t07TbspKSmGYRhGfHy8YWtra4wdO9Zq/4EDB4wcOXKkKQeAfwLLXgAAAAAAAAAAgH+11atXy9bWVr1797Yq79evnwzDUGRkpFV5YGCgKleubG4/99xzatasmdauXZtm+YD7uXbtmlq1aiUHBwdNmDDBat/169dlZ2eX5hh7e3tz/4P4+Pho/fr1Vp+BAwdKujNbgJ+fn0qXLq3z58+bn5CQEEnSpk2bzHYcHBzMvy9fvqzz588rKChIv/76qy5fvpzp8WZW0aJF1bBhQ6uyJUuW6IUXXpCbm5tVvPXq1VNycnKmlh1JT926deXj42NuV69eXZLUsmVLubi4pCn/9ddfrY7PkSOHunXrZm7nypVL3bp109mzZ7Vr1y5Jd+6vggULqk2bNma9nDlzqnfv3kpMTFRUVJRVmy1btpS7u3umx3D39UlKStL58+dVo0YNGYahPXv2pKnfvXt3q+0XXnjBalxLly6VxWLR8OHD0xybunzJsmXLlJKSotatW1tdj4IFC6pkyZJW9w8A/FNY9gIAAAAAAAAAAPyrnTx5UoUKFbJ62S1Jfn5+5v67lSxZMk0bpUqV0rVr13Tu3DkVLFjwgX0mJyfr1Vdf1eHDhxUZGalChQpZ7XdwcNCNGzfSHPf333+b+x/EyclJ9erVS3dfXFycYmJiMnzJfvbsWfPv6OhoDR8+XD///LOuXbtmVe/y5cvKnTv3A2N5GEWLFk033v3792cq3ofx3HPPWW2njsXLyyvd8osXL1qVFypUSE5OTlZlpUqVkiTFx8fr+eef18mTJ1WyZMk0S5hkdH+lN/77OXXqlIYNG6aIiIg08d2bnGJvb5/mHLq5uVkdd/z4cRUqVEh58+bNsM+4uDgZhpHud0G6k9wBAP80kh8AAAAAAAAAAAD+YW+88YZWrlyphQsXmrMt3M3T01MJCQlpylPL7k2WeFgpKSkqX768Pvzww3T3p778P378uOrWravSpUvrww8/lJeXl3LlyqXVq1drypQpSklJeWBfqbMF3CujWTLSS+xISUlR/fr1zZkr7pWacPCwbG1tH6rcMIxH6udhZCaxJVV
"text/plain": [
"<Figure size 1000x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train RMSE: 50.92770420271637\n",
"Train R²: 0.9845578370650323\n",
"Train MAE: 1.9114281249999987\n",
"Корреляция: 0.82\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2QAAAIjCAYAAABswtioAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACNHUlEQVR4nOzdeZxOdf/H8fc1+5jVYGYMY2eYyFqWUJaMSCllSbK1G7KG3ISSPXuk7uK+015UlEKkImtoxFiyhZnBmNXs1/n90c+5u7I0oxlnltfz8bgej87nfK9zfc6FeM/5nu+xGYZhCAAAAABw0zlZ3QAAAAAAlFQEMgAAAACwCIEMAAAAACxCIAMAAAAAixDIAAAAAMAiBDIAAAAAsAiBDAAAAAAsQiADAAAAAIsQyAAAAADAIgQyAIBlbDabJk6caHUblrvrrrt01113mdvHjx+XzWbTsmXLLOvpr/7aY0EpjOcOAAWJQAYAxcRrr70mm82mpk2b3vAxzpw5o4kTJ2rPnj3511ght2nTJtlsNvPl6uqqatWq6bHHHtNvv/1mdXt5smXLFk2cOFEJCQmW9VClShWH7zMwMFCtWrXSypUrLesJAAozF6sbAADkjxUrVqhKlSravn27jhw5oho1auT5GGfOnNGkSZNUpUoVNWjQIP+bLMSGDBmi2267TVlZWdq9e7eWLl2qNWvW6JdfflFISMhN7aVy5cpKS0uTq6trnt63ZcsWTZo0Sf369ZO/v3/BNJcLDRo00IgRIyT98Xvq9ddf14MPPqjFixfr6aefvu57b/TcAaCo4goZABQDx44d05YtW/Tqq6+qXLlyWrFihdUtFTmtWrXSo48+qv79+2vBggWaNWuW4uPjtXz58mu+JzU1tUB6sdls8vDwkLOzc4Ecv6BVqFBBjz76qB599FE9//zz+vHHH+Xl5aU5c+Zc8z3Z2dnKzMws8ucOAHlFIAOAYmDFihUqXbq0OnfurIceeuiagSwhIUHDhg1TlSpV5O7urooVK+qxxx7T+fPntWnTJt12222SpP79+5tTzi7fy1OlShX169fvimP+9d6izMxMTZgwQY0bN5afn5+8vLzUqlUrbdy4Mc/nFRsbKxcXF02aNOmKfdHR0bLZbFq4cKEkKSsrS5MmTVLNmjXl4eGhMmXKqGXLllq3bl2eP1eS2rZtK+mPsCtJEydOlM1m06+//qpHHnlEpUuXVsuWLc3x77zzjho3bixPT08FBASoZ8+eOnXq1BXHXbp0qapXry5PT0/dfvvt+v77768Yc637qA4ePKju3burXLly8vT0VFhYmMaNG2f2N2rUKElS1apVzV+/48ePF0iPeREcHKw6deqY3+Xl85s1a5bmzp2r6tWry93dXb/++usNnftlp0+f1oABAxQUFCR3d3fdcssteuutt/5R7wBQ0JiyCADFwIoVK/Tggw/Kzc1NvXr10uLFi7Vjxw4zYElSSkqKWrVqpQMHDmjAgAFq1KiRzp8/r88//1y///676tSpo8mTJ2vChAl68skn1apVK0lSixYt8tRLUlKS3nzzTfXq1UtPPPGEkpOT9e9//1sRERHavn17nqZCBgUF6c4779SHH36oF1980WHfBx98IGdnZz388MOS/ggkU6dO1eOPP67bb79dSUlJ2rlzp3bv3q277747T+cgSUePHpUklSlTxqH+8MMPq2bNmnrllVdkGIYkacqUKRo/fry6d++uxx9/XOfOndOCBQvUunVr/fzzz+b0wX//+9966qmn1KJFCw0dOlS//fab7rvvPgUEBCg0NPS6/ezbt0+tWrWSq6urnnzySVWpUkVHjx7VF198oSlTpujBBx/UoUOH9N5772nOnDkqW7asJKlcuXI3rcdrycrK0qlTp674Lt9++22lp6frySeflLu7uwICAmS32/N87tIf4b1Zs2ay2WyKjIxUuXLl9NVXX2ngwIFKSkrS0KFDb6h3AChwBgCgSNu5c6chyVi3bp1hGIZht9uNihUrGs8995zDuAkTJhiSjE8//fSKY9jtdsMwDGPHjh2GJOPtt9++YkzlypWNvn37XlG/8847jTvvvNPczs7ONjIyMhzGXLx40QgKCjIGDBjgUJdkvPjii9c9v9dff92QZPzyyy8O9fDwcKNt27bmdv369Y3OnTtf91hXs3HjRkOS8dZbbxnnzp0zzpw5Y6xZs8aoUqWKYbPZjB07dhiGYRgvvviiIcno1auXw/uPHz9uODs7G1OmTHGo//LLL4aLi4tZz8zMNAIDA40GDRo4fD9Lly41JDl8h8eOHbvi16F169aGj4+PceLECYfPufxrZxiGMXPmTEOScezYsQLv8VoqV65sdOjQwTh37pxx7tw5Y+/evUbPnj0NScbgwYMdzs/X19eIi4tzeP+NnvvAgQON8uXLG+fPn3cY07NnT8PPz8+4dOnS3/YOAFZgyiIAFHErVqxQUFCQ2rRpI+mP+4969Oih999/Xzk5Oea4Tz75RPXr19cDDzxwxTFsNlu+9ePs7Cw3NzdJkt1uV3x8vLKzs9WkSRPt3r07z8d78MEH5eLiog8++MCsRUVF6ddff1WPHj3Mmr+/v/bv36/Dhw/fUN8DBgxQuXLlFBISos6dOys1NVXLly9XkyZNHMb9dVGKTz/9VHa7Xd27d9f58+fNV3BwsGrWrGlO1dy5c6fi4uL09NNPm9+PJPXr109+fn7X7e3cuXPavHmzBgwYoEqVKjnsy82v3c3o8c+++eYblStXTuXKlVP9+vX10UcfqU+fPpo+fbrDuG7duplX8K4lN+duGIY++eQTdenSRYZhOJxjRESEEhMTb+j3HgDcDExZBIAiLCcnR++//77atGlj3p8jSU2bNtXs2bO1YcMGdejQQdIfU/C6det2U/pavny5Zs+erYMHDyorK8usV61aNc/HKlu2rNq1a6cPP/xQL730kqQ/piu6uLjowQcfNMdNnjxZ999/v2rVqqW6deuqY8eO6tOnj2699dZcfc6ECRPUqlUrOTs7q2zZsqpTp45cXK78a/Kv53D48GEZhqGaNWte9biXVws8ceKEJF0x7vIy+9dzefn9unXr5upc/upm9PhnTZs21csvvyybzaZSpUqpTp06V131MTe/H3Jz7ufOnVNCQoKWLl2qpUuXXnVMXFxc7poHgJuMQAYARdi3336rs2fP6v3339f7779/xf4VK1aYgeyfutaVmJycHIcV8d555x3169dPXbt21ahRoxQYGChnZ2dNnTrVvC8rr3r27Kn+/ftrz549atCggT788EO1a9fOvE9Kklq3bq2jR4/qs88+0zfffKM333xTc+bM0ZIlS/T444//7WfUq1dP7du3/9txnp6eDtt2u102m01fffXVVVcG9Pb2zsUZFqyb3WPZsmVv6Lu8UZfvO3v00UfVt2/fq47JbTAHgJuNQAYARdiKFSsUGBioRYsWXbHv008/1cqVK7VkyRJ5enqqevXqioqKuu7xrjf9rXTp0ld94PCJEyccrp58/PHHqlatmj799FOH4/11UY686Nq1q5566ilz2uKhQ4c0duzYK8YFBASof//+6t+/v1JSUtS6dWtNnDgxV4HsRlWvXl2GYahq1aqqVavWNcdVrlxZ0h9Xqy6v4Cj9seDFsWPHVL9+/Wu+9/L3e6O/fjejx4KSm3MvV66cfHx8lJOTk6sgCACFCfeQAUARlZaWpk8//VT33nuvHnrooStekZGRSk5O1ueffy7pj/t19u7dq5UrV15xLOP/Vwv08vKSpKsGr+rVq+unn35SZmamWVu9evUVy6ZfvgJz+ZiStG3bNm3duvWGz9Xf318RERH68MMP9f7778vNzU1du3Z1GHPhwgWHbW9vb9WoUUMZGRk3/Lm58eCDD8rZ2VmTJk1yOGfpj+/gcl9NmjRRuXLltGTJEofvcNmyZVf9vv+sXLlyat26td566y2dPHnyis+47Fq/fjejx4KSm3N3dnZWt27d9Mknn1w1uJ07d+6m9AoAN4IrZABQRH3++edKTk7Wfffdd9X9zZo1Mx8S3aNHD40aNUoff/yxHn74YQ0YMECNGzd
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error\n",
"from sklearn.model_selection import cross_val_score\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import time\n",
"import numpy as np\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(2000)\n",
"\n",
"# Создание нового признака 'relative_price'\n",
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
"df['relative_price'] = df['price'] / mean_price_by_category\n",
"\n",
"# Предобработка данных\n",
"# Преобразуем категориальные переменные в числовые\n",
"df = pd.get_dummies(df, drop_first=True)\n",
"\n",
"# Разделение данных на признаки и целевую переменную\n",
"X = df.drop('price', axis=1)\n",
"y = df['price']\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Выбор модели\n",
"model = RandomForestRegressor(random_state=42)\n",
"\n",
"# Измерение времени обучения и предсказания\n",
"start_time = time.time()\n",
"\n",
"# Обучение модели\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Предсказание и оценка\n",
"y_pred = model.predict(X_test)\n",
"\n",
"end_time = time.time()\n",
"training_time = end_time - start_time\n",
"\n",
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
"r2 = r2_score(y_test, y_pred)\n",
"mae = mean_absolute_error(y_test, y_pred)\n",
"\n",
"print(f\"RMSE: {rmse}\")\n",
"print(f\"R²: {r2}\")\n",
"print(f\"MAE: {mae}\")\n",
"print(f\"Training Time: {training_time} seconds\")\n",
"\n",
"# Кросс-валидация\n",
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
"rmse_cv = (-scores.mean())**0.5\n",
"print(f\"Cross-validated RMSE: {rmse_cv}\")\n",
"\n",
"# Анализ важности признаков\n",
"feature_importances = model.feature_importances_\n",
"feature_names = X_train.columns\n",
"\n",
"importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})\n",
"importance_df = importance_df.sort_values(by='Importance', ascending=False)\n",
"\n",
"# Отобразим только топ-20 признаков\n",
"top_n = 20\n",
"importance_df_top = importance_df.head(top_n)\n",
"\n",
"plt.figure(figsize=(10, 8))\n",
"sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n",
"plt.title(f'Top {top_n} Feature Importance')\n",
"plt.xlabel('Importance')\n",
"plt.ylabel('Feature')\n",
"plt.show()\n",
"\n",
"# Проверка на переобучение\n",
"y_train_pred = model.predict(X_train)\n",
"\n",
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
"r2_train = r2_score(y_train, y_train_pred)\n",
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
"\n",
"print(f\"Train RMSE: {rmse_train}\")\n",
"print(f\"Train R²: {r2_train}\")\n",
"print(f\"Train MAE: {mae_train}\")\n",
"\n",
"correlation = np.corrcoef(y_test, y_pred)[0, 1]\n",
"print(f\"Корреляция: {correlation:.2f}\")\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
"plt.xlabel('Actual Price')\n",
"plt.ylabel('Predicted Price')\n",
"plt.title('Actual vs Predicted Price')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Вывод:\n",
"\n",
"Время обучения:\n",
"\n",
"Время обучения модели составляет 4.76 секунды, что является средним. Это указывает на то, что модель обучается быстро и может эффективно обрабатывать данные.\n",
"\n",
"Предсказательная способность:\n",
"\n",
"MAE (Mean Absolute Error): 28.6974 — это средняя абсолютная ошибка предсказаний модели. Значение MAE невелико, что означает, что предсказанные значения в среднем отклоняются от реальных на 28.6974. Это может быть приемлемым уровнем ошибки.\n",
"\n",
"RMSE (Mean Squared Error): 534.088 — это среднее значение квадратов ошибок. Хотя MSE высокое, оно также может быть связано с большими значениями целевой переменной (цен).\n",
"\n",
"R² (коэффициент детерминации): 0.609 — это средний уровень, указывающий на то, что модель объясняет 60,9% вариации целевой переменной. Это свидетельствует о средней предсказательной способности модели.\n",
"\n",
"Корреляция:\n",
"\n",
"Корреляция (0.82) между предсказанными и реальными значениями говорит о том, что предсказания модели имеют сильную линейную зависимость с реальными значениями. Это подтверждает, что модель хорошо обучена и делает точные прогнозы.\n",
"\n",
"Надежность (кросс-валидация):\n",
"\n",
"Среднее RMSE (кросс-валидация): 133.75 — это значительно ниже, чем обычное RMSE, что указывает на отсутствие проблем с переобучением - что и подтверждается тестом переобучением. \n",
"\n",
"Результаты визуализации важности признаков, полученные из линейной регрессии, помогают понять, какие из входных переменных наибольшим образом влияют на целевую переменную (price). Это может быть полезным для дальнейшего анализа и при принятии бизнес-решений, связанных с управлением и ценообразованием в Jio Mart."
2024-10-27 11:24:05 +04:00
]
}
],
"metadata": {
"kernelspec": {
2024-11-09 11:48:24 +04:00
"display_name": "miienv",
2024-10-27 11:24:05 +04:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}