793 lines
399 KiB
Plaintext
793 lines
399 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Jio Mart Product Items"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 18,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['category', 'sub_category', 'href', 'items', 'price'], dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Бизнес-цели**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Бизнес-цель: Снизить издержки и увеличить продажи за счет оптимизации цен на товары.\n",
|
|||
|
"Техническая цель: Создать модель машинного обучения, которая будет прогнозировать, является ли товар излишне дорогим для свой категории или нет."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Бизнес-цель: Оптимизировать распределение товаров по категориям.\n",
|
|||
|
"Техническая цель: Создать модель машинного обучения, которая будет прогнозировать оптимальные цены на товары на основе их категорий, подкатегорий и текущих цен."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Подготовка данных**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Пропущенные данные по каждому столбцу:\n",
|
|||
|
"category 0\n",
|
|||
|
"sub_category 0\n",
|
|||
|
"href 0\n",
|
|||
|
"items 0\n",
|
|||
|
"price 0\n",
|
|||
|
"dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Проверка на пропущенные значения\n",
|
|||
|
"missing_data = df.isnull().sum()\n",
|
|||
|
"print(\"Пропущенные данные по каждому столбцу:\")\n",
|
|||
|
"print(missing_data)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Разбиение каждого набора данных на обучающую, контрольную и тестовую выборки**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размеры выборок:\n",
|
|||
|
"Обучающая выборка: 18000 записей\n",
|
|||
|
"Валидационная выборка: 6000 записей\n",
|
|||
|
"Тестовая выборка: 6000 записей\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABeUAAAIjCAYAAACNsHDUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACjtklEQVR4nOzdf3zOdf////s2dmwzx2bYZqeZhRPz23SyRH6sDUNKnacsv2vRVOiNU6ckqoWEECmh0A+dpfwIQ37E/FqWX+WUpiltS9iB7Pfr+0efvb6ONmLmmM3term8Lqfj9Xwcr9fjeaxzj+N47HU8X06GYRgCAAAAAAAAAAA3nXNpJwAAAAAAAAAAwO2CpjwAAAAAAAAAAA5CUx4AAAAAAAAAAAehKQ8AAAAAAAAAgIPQlAcAAAAAAAAAwEFoygMAAAAAAAAA4CA05QEAAAAAAAAAcBCa8gAAAAAAAAAAOAhNeeAmunjxok6ePKmzZ8+WdiooQfxcAQC4vZw7d07ff/+9cnNzSzsVAABuefn5+Tp9+rR++OGH0k4FuGXRlAdK2IoVK9S5c2dVrlxZnp6eqlWrlqZOnVraaeEG8XMFAOD2kZOTo6lTp6pZs2ayWCyqUqWK6tWrp02bNpV2agAA3JJSU1M1YsQIBQUFydXVVdWrV1dISIhsNltppwbckiqUdgLArezw4cOKi4vTl19+qdOnT6tq1arq2LGjnn32WTVq1KhQ/L///W9NmTJF9913n9566y1Vq1ZNTk5O+vvf/14K2aOk8HMFgJK3ePFiDRo0yG5f9erV1ahRI40ZM0Zdu3Ytpcxwu8vKylJERIR27dqloUOHavLkyfLw8JCLi4tCQ0NLOz0AKPecnJyuKe7LL79Uhw4dbm4yuCbff/+9OnbsqJycHD311FNq2bKlKlSoIHd3d1WqVKm00wNuSTTlgSv45JNP9PDDD8vHx0dDhgxRcHCwTpw4oYULF+rjjz/WBx98oPvvv9+M37p1q6ZMmaK4uDj9+9//LsXMUZL4uQLAzTVp0iQFBwfLMAylpaVp8eLF6tatm1atWqXu3buXdnq4DU2ZMkW7d+/W+vXrafYAQCl477337B6/++67io+PL7S/YcOGjkwLV/H444/L1dVVu3bt0t/+9rfSTgcoE5wMwzBKOwngVnP8+HE1bdpUtWrV0rZt21S9enVz7PTp02rXrp1OnjypAwcO6I477pAk9ejRQ2fOnNGOHTtKK23cBPxcAeDmKLhSfu/evWrVqpW5/+zZs/Lz89NDDz2kZcuWlWKGuB3l5ubK19dXw4YN00svvVTa6QAAJA0fPlxz584V7atbU2Jiolq1aqUNGzbo3nvvLe10gDKDNeWBIkybNk2///67FixYYNeQl6Rq1arpzTff1MWLF+3WFN+1a5caN26sPn36yMfHR+7u7rrzzju1cuVKM+bChQuqVKmSnn766ULn/Omnn+Ti4qK4uDhJ0sCBA1W7du1CcU5OTpo4caL5+Mcff9QTTzyh+vXry93dXVWrVtVDDz2kEydO2D1vy5YtcnJy0pYtW8x9e/fu1b333qvKlSurUqVK6tChg7Zv3273vMWLF8vJyUn79u0z950+fbpQHpLUvXv3Qjlv375dDz30kGrVqiWLxaLAwECNHDlSly5dKjS3jz/+WK1atVLlypXl5ORkbq+++mqh2KJyLNg8PDzUpEkTvf3223ZxAwcOlKen51WP9ed5XcvPtUB6erqGDBkiPz8/ubm5qVmzZlqyZIldzIkTJ8w5zZgxQ0FBQXJ3d9c999yjQ4cOFcr3z6/n0qVL5ezsrFdeecXcd+DAAQ0cOFB33HGH3Nzc5O/vr8GDB+u333676lwB4Fbk7e0td3d3Vahg/4XOV199VXfddZeqVq0qd3d3hYaG6uOPPy7yGH+uCwXb5Vc9F8RcXi/z8/PVtGlTOTk5afHixYWOW7t27SKP++fYa83VyclJw4cPL7S/qHpaVE04efKk3N3dC81Dkt544w01atRIFotFAQEBio2N1blz5+xiOnTooMaNGxc6/6uvvlromLVr1y7ymwvDhw8vtMzAokWL1KlTJ/n6+spisSgkJETz5s0r9Nzc3Fy9+OKL+vvf/y6LxWL3ml7+vqMoAwcOtIuvUqVKke9jrpR3gT+/Pzp69KjOnj2rypUr65577pGHh4e8vLzUvXv3QnVakvbv36+uXbvKarXK09NTnTt31q5du+xiCv5b27Ztmx5//HFVrVpVVqtV/fv3L3TT+Nq1a2vgwIF2+2JiYuTm5mb3Hu6zzz5TVFSUAgICZLFYVKdOHU2ePFl5eXlXfd0AoDzKysrS888/r7p165qfOceMGaOsrKxCsUuXLtU//vEPeXh4qEqVKmrfvr02bNgg6cp1vmC7vA5fvHhRzzzzjAIDA2WxWFS/fn29+uqrhf5wcPnzXVxc9Le//U0xMTF2NTk7O1sTJkxQaGiovLy8VKlSJbVr105ffvllofwLPnPWqlVLLi4u5rH/6nPun+fn7Owsf39//etf/1JKSooZc/nn1SuZOHGiXe3ftWuX3NzcdPz4cfO9h7+/vx5//HGdOXOm0PNXrFih0NBQubu7q1q1anrkkUf0888/28UUfHb/4YcfFBkZqUqVKikgIECTJk2ye40L8r38vdj58+cVGhqq4OBg/fLLL+b+63kvCTgCy9cARVi1apVq166tdu3aFTnevn171a5dW2vWrDH3/fbbb1qwYIE8PT311FNPqXr16lq6dKkeeOABLVu2TA8//LA8PT11//3368MPP9Rrr70mFxcX8/nvv/++DMNQdHT0deW6d+9e7dy5U3369FHNmjV14sQJzZs3Tx06dNCRI0fk4eFR5PO+//57dejQQR4eHho9erQ8PDz01ltvKTw8XPHx8Wrfvv115XElK1as0O+//65hw4apatWq2rNnj2bPnq2ffvpJK1asMOMSEhL0z3/+U82aNdMrr7wiLy8vnT59WiNHjrzmc82YMUPVqlWTzWbTO++8o8cee0y1a9dWeHh4sfO/lp+rJF26dEkdOnTQ999/r+HDhys4OFgrVqzQwIEDde7cuUJ/iHn33Xd1/vx5xcbGKjMzU7NmzVKnTp108OBB+fn5FZnLhg0bNHjwYA0fPtxuKZ34+Hj98MMPGjRokPz9/XX48GEtWLBAhw8f1q5du655TUYAKA0ZGRk6ffq0DMNQenq6Zs+erQsXLuiRRx6xi5s1a5Z69uyp6OhoZWdn64MPPtBDDz2k1atXKyoqqshjF9QFSdd01fN7772ngwcPXjWmefPmeuaZZyRJycnJmjBhQqGY4uRaHBMmTFBmZmah/RMnTtQLL7yg8PBwDRs2TEePHtW8efO0d+9e7dixQxUrViyxHIoyb948NWrUSD179lSFChW0atUqPfHEE8rPz1dsbKwZN336dD333HO6//77NXbsWFksFm3fvl0LFiy4pvNUq1ZNM2bMkPTHxQ2zZs1St27ddPLkSXl7excr94I/aI8bN0716tXTCy+8oMzMTM2dO1dt27bV3r17zXvKHD58WO3atZPVatWYMWNUsWJFvfnmm+rQoYO2bt2q1q1b2x17+PDh8vb21sSJE82fyY8//mj+YaAozz//vBYuXKgPP/yw0B+VPD09NWrUKHl6emrz5s2aMGGCbDabpk2bVqy5A0BZlJ+fr549e+qrr75STEyMGjZsqIMHD2rGjBn63//+Z3cx1QsvvKCJEyfqrrvu0qRJk+Tq6qrdu3dr8+bNioiI0MyZM3XhwgVJ0rfffquXX35Zzz77rLlMTkHj2zAM9ezZU19++aWGDBmi5s2ba/369Ro9erR+/vlnszYVuP/++/XAAw8oNzdXCQkJWrBggS5dumQux2Oz2fT222/r4Ycf1mOPPabz589r4cKFioyM1J49e9S8eXPzWAMGDNDGjRv15JNPqlmzZnJxcdGCBQv09ddfX9Pr1a5dO8XExCg/P1+HDh3SzJkzderUqUJ/1L4ev/32mzIzMzVs2DB16tRJQ4cO1fHjxzV37lzt3r1bu3fvlsVikfT/f1PyzjvvVFxcnNLS0jR
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1800x600 with 3 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Разделение признаков (features) и целевой переменной (target)\n",
|
|||
|
"X = df.drop(columns=['price']) # Признаки (все столбцы, кроме 'price')\n",
|
|||
|
"y = df['price'] # Целевая переменная (price)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка размеров выборок\n",
|
|||
|
"print(f\"Размеры выборок:\")\n",
|
|||
|
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
|
|||
|
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
|
|||
|
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен в каждой выборке\n",
|
|||
|
"plt.figure(figsize=(18, 6))\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 3, 1)\n",
|
|||
|
"plt.hist(y_train, bins=30, color='blue', alpha=0.7)\n",
|
|||
|
"plt.title('Обучающая выборка')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 3, 2)\n",
|
|||
|
"plt.hist(y_val, bins=30, color='green', alpha=0.7)\n",
|
|||
|
"plt.title('Валидационная выборка')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 3, 3)\n",
|
|||
|
"plt.hist(y_test, bins=30, color='red', alpha=0.7)\n",
|
|||
|
"plt.title('Тестовая выборка')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Балансировка выборок**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размеры выборок:\n",
|
|||
|
"Обучающая выборка: 18000 записей\n",
|
|||
|
"Валидационная выборка: 6000 записей\n",
|
|||
|
"Тестовая выборка: 6000 записей\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1sAAAIjCAYAAAD1OgEdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABqiElEQVR4nO3deVhUdf//8dcIMiyypLJIIq6puJVLhvsKLmmW5W1ZalqaouaSmZZ7RWm3mpmWLVqp30orSy1TXEvRXDL3XTMXcAfRRIHz+8MfczMOKhCnAXk+rmuuiznzmXPeZ+bMYV5zPudzLIZhGAIAAAAA5KpCzi4AAAAAAO5GhC0AAAAAMAFhCwAAAABMQNgCAAAAABMQtgAAAADABIQtAAAAADABYQsAAAAATEDYAgAAAAATELaQbWlpaTp79qwOHz7s7FKAW7p69apOnjyp06dPO7sU5CLeVwBZdfHiRR08eFApKSnOLgUFGGELWRIXF6eBAwcqNDRUbm5u8vf3V1hYmBITE51dGmATExOj9u3by8/PTx4eHrr33nv14osvOrss/EN3+/v6119/yd3dXevWrXN2KUC+dv36dU2YMEE1atSQ1WrVPffcowoVKmjFihXOLq1AOnfunLy8vPTjjz86uxSnshiGYTi7CDjHoUOHNGHCBC1fvlwnT56Um5ubqlWrpk6dOqlXr17y8PCQJB08eFBNmzbV9evXNWDAANWsWVOurq7y8PDQQw89JBcXFyevCSBNnz5d/fv3V4MGDdSjRw/de++9kqTQ0FBVqFDBydUhpwrC+/r8889r//79WrNmjbNLyVPOnj0rf39/jR49WmPGjHF2OcjjkpOTFRERoQ0bNuiFF15Q8+bN5enpKRcXF9WqVUs+Pj7OLrFAevHFF/Xrr79qy5Ytzi7FaVydXQCcY8mSJXriiSdktVrVtWtXVa1aVdeuXdOvv/6qoUOHateuXZo5c6YkqXfv3nJzc9OGDRtsX3SAvOTAgQMaPHiwevXqpenTp8tisTi7JOSCgvC+njlzRp999pk+++wzZ5cC5Gtvv/22Nm7cqJ9//llNmjRxdjn4/1544QVNnTpVK1euVLNmzZxdjlMQtgqgI0eOqHPnzgoNDdXKlStVokQJ22NRUVE6ePCglixZIknasmWLVq5cqWXLlhG0kGdNnTpVQUFBmjp16l35hbygKgjv65w5c+Tq6qp27do5uxQg30pJSdGUKVM0ZMgQglYeU7lyZVWtWlWzZ88usGGLc7YKoAkTJigpKUmffPKJXdBKV758edv5EBs2bJC7u7sOHTqkKlWqyGq1KigoSL1799b58+ftnvfLL7/oiSeeUKlSpWS1WhUSEqJBgwbp77//zrQOi8WS6e3o0aO2NrNmzVKzZs0UEBAgq9WqsLAwzZgxw2FepUuX1sMPP+wwvV+/fpl+SZszZ44efPBBeXp66p577lGjRo20bNkyu/l1797d7jnz58+XxWJR6dKlbdOOHj0qi8Wid955R5MnT1ZoaKg8PDzUuHFj7dy502G5K1euVMOGDeXl5SU/Pz898sgj2rNnj12bMWPG2L0e3t7eevDBB7Vw4UK7dll9vbt3764iRYo41LJgwQJZLBatXr3aNq1JkyaqWrWqQ9t33nnH4b35/vvv1bZtWwUHB8tqtapcuXIaP368UlNTHZ4/Y8YMVa1aVZ6ennbrtmDBAoe2N/v999/VunVr+fj4qEiRImrevLk2bNhg12bDhg2qVauW+vbtq8DAQFmtVlWtWlUfffSRrY1hGCpdurQeeeQRh2VcvXpVvr6+6t27t6T/vQc3u3m7OH/+vF566SVVq1ZNRYoUkY+Pj1q3bq0//vjD7nnp28ns2bNt0/bv369HH31U99xzjzw8PFSnTh2H93j16tWZvk5FihRx2D4z29a3b9+u7t27q2zZsnJ3d1dQUJB69Oihc+fOOazbqlWr1LBhQ91zzz1271G/fv0c2mZWY/rNarXqvvvuU3R0tDL2Uk9/Tc+ePXvLed38+mblfU13+fJlDRkyRCEhIbJarapYsaLeeecd3dxTPn2d5s6dq4oVK8rd3V21atXS2rVr7dpltg2sWrVKVqtVL7zwgm3an3/+qb59+6pixYry8PBQsWLF9MQTT9h9Vm5n4cKFqlu3bqaf0fTtJrNbTtc9s9vrr78uSbp27ZpGjRqlWrVqydfXV15eXmrYsKFWrVqVaV1Z2e91797dbp8p3ThHzcPDw2GfcuXKFT377LPy8vJSWFiYrdvR9evX9eyzz8rT01M1atTQ5s2b7ebXpEkTWSwWdejQweE17N27tywWi91+LbPPo3Tjx0aLxWK3DWZWf/preXPXxhMnTqhHjx62bbVKlSr69NNPHZ579epVjRkzRvfdd5/c3d1VokQJPfbYYzp06NAt67t06ZJq1aqlMmXK6NSpU7bpWX3vpRv/92rVqiUPDw8VLVpUnTt31l9//eXQ7mY3/0+6+ZaxzvT/N4cPH1ZkZKS8vLwUHByscePGOdSUlpamKVOmqEqVKnJ3d1dgYKB69+6tCxcuONQwffp023eQ4OBgRUVF6eLFi7bH9+3bpwsXLsjb21uNGzeWp6enfH199fDDDztsk+nrs3fvXnXq1Ek+Pj4qVqyYXnzxRV29etWubVa/gzzyyCMqXbq03N3dFRAQoPbt22vHjh12bVJSUjR+/HiVK1dOVqtVpUuX1ogRI5ScnGzXrnTp0rbXtlChQgoKCtJ//vMfHTt2zK7dO++8o3r16qlYsWLy8PBQrVq1Mv2feqv9+MMPP5zp95msfC6kG4OPDBw40LbtlS9fXm+//bbS0tIcltWyZUstWrQo0+2yIODIVgG0aNEilS1bVvXq1btj23Pnzunq1avq06ePmjVrphdeeEGHDh3S+++/r40bN2rjxo2yWq2SboSRK1euqE+fPipWrJh+++03vffeezp+/Ljmz5+f6fwfffRRPfbYY5JuhIf0rovpZsyYoSpVqqh9+/ZydXXVokWL1LdvX6WlpSkqKipH6z927FiNGTNG9erV07hx4+Tm5qaNGzdq5cqVioiIyPQ5KSkpevXVV285z88//1yXLl1SVFSUrl69qnfffVfNmjXTjh07FBgYKOnGSf6tW7dW2bJlNWbMGP3999967733VL9+fW3dutXhH/oXX3wh6cZ5C9OnT9cTTzyhnTt3qmLFipJy9nrnptmzZ6tIkSIaPHiwihQpopUrV2rUqFFKTEzUxIkTbe2++uor9e3bV02aNFH//v3l5eWlPXv26M0337zjMnbt2qWGDRvKx8dHL7/8sgoXLqwPP/xQTZo00Zo1a1S3bl1JN7bTzZs3y9XVVVFRUSpXrpwWLlyoXr166dy5c3rllVdksVj09NNPa8KECTp//ryKFi1qW86iRYuUmJiop59+OluvweHDh7Vw4UI98cQTKlOmjOLj4/Xhhx+qcePG2r17t4KDgzN93vnz59WoUSNdunRJAwYMUFBQkObMmaPHHntMc+fO1ZNPPpmtOm5l+fLlOnz4sJ599lkFBQXZugfv2rVLGzZssH1pP3LkiNq2basSJUpo1KhR8vf3lyQ988wzWV7WiBEjVLlyZf3999/66quvNGLECAUEBKhnz545rj8r76t0I0i3b99eq1atUs+ePXX//ffr559/1tChQ3XixAlNnjzZbr5r1qzRV199pQEDBshqtWr69Olq1aqVfvvtt0x/bJCkP/74Qx06dFCbNm30/vvv26Zv2rRJ69evV+fOnVWyZEkdPXpUM2bMUJMmTbR79255enrecv2uX7+uTZs2qU+fPrd9HXr16qWGDRtKkr799lt99913tseyu+4tW7ZU165d7abdf//9kqTExER9/PHHevLJJ/X888/r0qVL+uSTTxQZGanffvvN1i5dVvZ7mRk1apTDl1pJGjRokD777DP169dPJUuWVN++fSVJM2fOVLNmzfT666/r3XffVevWrXX48GF5e3vbnuvu7q4lS5bo9OnTCggIkCTbtuju7n7b11e6cW5yZiE+q+Lj4/XQQw/Zvtj6+/vrp59+Us+ePZWYmKiBAwd
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.preprocessing import StandardScaler\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение признаков (features) и целевой переменной (target)\n",
|
|||
|
"X = df.drop(columns=['price']) # Признаки (все столбцы, кроме 'price')\n",
|
|||
|
"y = df['price'] # Целевая переменная (цена)\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding для категориальных признаков\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка размеров выборок\n",
|
|||
|
"print(f\"Размеры выборок:\")\n",
|
|||
|
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
|
|||
|
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
|
|||
|
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов (цены выше 95-го процентиля)\n",
|
|||
|
"upper_limit = y_train.quantile(0.95)\n",
|
|||
|
"X_train = X_train[y_train <= upper_limit]\n",
|
|||
|
"y_train = y_train[y_train <= upper_limit]\n",
|
|||
|
"\n",
|
|||
|
"# Логарифмическое преобразование целевой переменной\n",
|
|||
|
"y_train_log = np.log1p(y_train)\n",
|
|||
|
"y_val_log = np.log1p(y_val)\n",
|
|||
|
"y_test_log = np.log1p(y_test)\n",
|
|||
|
"\n",
|
|||
|
"# Стандартизация признаков\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"X_train_scaled = scaler.fit_transform(X_train)\n",
|
|||
|
"X_val_scaled = scaler.transform(X_val)\n",
|
|||
|
"X_test_scaled = scaler.transform(X_test)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен в сбалансированной выборке\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.hist(y_train_log, bins=30, color='orange', alpha=0.7)\n",
|
|||
|
"plt.title('Сбалансированная обучающая выборка (логарифмическое преобразование)')\n",
|
|||
|
"plt.xlabel('Логарифм цены')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Унитарное кодирование категориальных признаков**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до унитарного кодирования:\n",
|
|||
|
" category ... price\n",
|
|||
|
"0 Groceries ... 109.0\n",
|
|||
|
"1 Groceries ... 49.0\n",
|
|||
|
"2 Groceries ... 69.0\n",
|
|||
|
"3 Groceries ... 125.0\n",
|
|||
|
"4 Groceries ... 44.0\n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 5 columns]\n",
|
|||
|
"\n",
|
|||
|
"Данные после унитарного кодирования:\n",
|
|||
|
" price ... items_ Hilife Pantyliners\n",
|
|||
|
"0 109.0 ... False\n",
|
|||
|
"1 49.0 ... False\n",
|
|||
|
"2 69.0 ... False\n",
|
|||
|
"3 125.0 ... False\n",
|
|||
|
"4 44.0 ... False\n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 28392 columns]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df1 = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Данные до унитарного кодирования:\")\n",
|
|||
|
"print(df1.head())\n",
|
|||
|
"\n",
|
|||
|
"# Применение унитарного кодирования для категориальных признаков\n",
|
|||
|
"df_encoded = pd.get_dummies(df1, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nДанные после унитарного кодирования:\")\n",
|
|||
|
"print(df_encoded.head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Дискретизация числовых признаков**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до дискретизации:\n",
|
|||
|
" category ... price\n",
|
|||
|
"0 Groceries ... 109.0\n",
|
|||
|
"1 Groceries ... 49.0\n",
|
|||
|
"2 Groceries ... 69.0\n",
|
|||
|
"3 Groceries ... 125.0\n",
|
|||
|
"4 Groceries ... 44.0\n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 5 columns]\n",
|
|||
|
"\n",
|
|||
|
"Данные после дискретизации:\n",
|
|||
|
" price price_bins\n",
|
|||
|
"0 109.0 100-500\n",
|
|||
|
"1 49.0 0-100\n",
|
|||
|
"2 69.0 0-100\n",
|
|||
|
"3 125.0 100-500\n",
|
|||
|
"4 44.0 0-100\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Данные до дискретизации:\")\n",
|
|||
|
"print(df.head())\n",
|
|||
|
"\n",
|
|||
|
"# Определение интервалов и меток для дискретизации\n",
|
|||
|
"bins = [0, 100, 500, 1000, 5000, float('inf')]\n",
|
|||
|
"labels = ['0-100', '100-500', '500-1000', '1000-5000', '5000+']\n",
|
|||
|
"\n",
|
|||
|
"# Применение дискретизации\n",
|
|||
|
"df['price_bins'] = pd.cut(df['price'], bins=bins, labels=labels, right=False)\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nДанные после дискретизации:\")\n",
|
|||
|
"print(df[['price', 'price_bins']].head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**«Ручной» синтез признаков**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до синтеза признака:\n",
|
|||
|
" category ... price\n",
|
|||
|
"0 Groceries ... 109.0\n",
|
|||
|
"1 Groceries ... 49.0\n",
|
|||
|
"2 Groceries ... 69.0\n",
|
|||
|
"3 Groceries ... 125.0\n",
|
|||
|
"4 Groceries ... 44.0\n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 5 columns]\n",
|
|||
|
"\n",
|
|||
|
"Данные после синтеза признака 'relative_price':\n",
|
|||
|
" price category relative_price\n",
|
|||
|
"0 109.0 Groceries 0.247286\n",
|
|||
|
"1 49.0 Groceries 0.111165\n",
|
|||
|
"2 69.0 Groceries 0.156539\n",
|
|||
|
"3 125.0 Groceries 0.283584\n",
|
|||
|
"4 44.0 Groceries 0.099822\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных\n",
|
|||
|
"print(\"Данные до синтеза признака:\")\n",
|
|||
|
"print(df.head())\n",
|
|||
|
"\n",
|
|||
|
"# Вычисление средней цены по категориям\n",
|
|||
|
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'relative_price' (относительная цена)\n",
|
|||
|
"df['relative_price'] = df['price'] / mean_price_by_category\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после синтеза признака\n",
|
|||
|
"print(\"\\nДанные после синтеза признака 'relative_price':\")\n",
|
|||
|
"print(df[['price', 'category', 'relative_price']].head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Масштабирование признаков на основе нормировки и стандартизации**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до масштабирования:\n",
|
|||
|
" price relative_price\n",
|
|||
|
"0 109.0 0.247286\n",
|
|||
|
"1 49.0 0.111165\n",
|
|||
|
"2 69.0 0.156539\n",
|
|||
|
"3 125.0 0.283584\n",
|
|||
|
"4 44.0 0.099822\n",
|
|||
|
"\n",
|
|||
|
"Данные после нормировки:\n",
|
|||
|
" price relative_price\n",
|
|||
|
"0 0.005507 0.005507\n",
|
|||
|
"1 0.002330 0.002330\n",
|
|||
|
"2 0.003389 0.003389\n",
|
|||
|
"3 0.006354 0.006354\n",
|
|||
|
"4 0.002065 0.002065\n",
|
|||
|
"\n",
|
|||
|
"Данные после стандартизации:\n",
|
|||
|
" price relative_price\n",
|
|||
|
"0 -0.483613 -0.483613\n",
|
|||
|
"1 -0.571070 -0.571070\n",
|
|||
|
"2 -0.541918 -0.541918\n",
|
|||
|
"3 -0.460292 -0.460292\n",
|
|||
|
"4 -0.578358 -0.578358\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'relative_price' (цена относительно средней цены в категории)\n",
|
|||
|
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
|
|||
|
"df['relative_price'] = df['price'] / mean_price_by_category\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных до масштабирования\n",
|
|||
|
"print(\"Данные до масштабирования:\")\n",
|
|||
|
"print(df[['price', 'relative_price']].head())\n",
|
|||
|
"\n",
|
|||
|
"# Масштабирование признаков на основе нормировки\n",
|
|||
|
"min_max_scaler = MinMaxScaler()\n",
|
|||
|
"df[['price', 'relative_price']] = min_max_scaler.fit_transform(df[['price', 'relative_price']])\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после нормировки\n",
|
|||
|
"print(\"\\nДанные после нормировки:\")\n",
|
|||
|
"print(df[['price', 'relative_price']].head())\n",
|
|||
|
"\n",
|
|||
|
"# Стандартизация признаков\n",
|
|||
|
"standard_scaler = StandardScaler()\n",
|
|||
|
"df[['price', 'relative_price']] = standard_scaler.fit_transform(df[['price', 'relative_price']])\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после стандартизации\n",
|
|||
|
"print(\"\\nДанные после стандартизации:\")\n",
|
|||
|
"print(df[['price', 'relative_price']].head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Конструирование признаков с применением фреймворка Featuretools**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Built 6 features\n",
|
|||
|
"Elapsed: 00:00 | Progress: 100%|██████████\n",
|
|||
|
"Новые признаки, созданные с помощью Featuretools:\n",
|
|||
|
" category sub_category ... NUM_CHARACTERS(items) NUM_WORDS(items)\n",
|
|||
|
"index ... \n",
|
|||
|
"0 Groceries Fruits & Vegetables ... 41 8\n",
|
|||
|
"1 Groceries Fruits & Vegetables ... 59 11\n",
|
|||
|
"2 Groceries Fruits & Vegetables ... 12 3\n",
|
|||
|
"3 Groceries Fruits & Vegetables ... 20 4\n",
|
|||
|
"4 Groceries Fruits & Vegetables ... 50 10\n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 6 columns]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import featuretools as ft\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(30000)\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'relative_price'\n",
|
|||
|
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
|
|||
|
"df['relative_price'] = df['price'] / mean_price_by_category\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"es = ft.EntitySet(id='jio_mart_items')\n",
|
|||
|
"\n",
|
|||
|
"# Добавление данных с явным указанием индексного столбца\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='items_data', dataframe=df, index='index', make_index=True)\n",
|
|||
|
"\n",
|
|||
|
"# Конструирование признаков\n",
|
|||
|
"features, feature_defs = ft.dfs(entityset=es, target_dataframe_name='items_data', verbose=True)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк новых признаков\n",
|
|||
|
"print(\"Новые признаки, созданные с помощью Featuretools:\")\n",
|
|||
|
"print(features.head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Оценка качества**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"RMSE: 534.0885949291326\n",
|
|||
|
"R²: 0.6087611252156747\n",
|
|||
|
"MAE: 28.697400000000002\n",
|
|||
|
"Training Time: 4.757523536682129 seconds\n",
|
|||
|
"Cross-validated RMSE: 133.74731704254154\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"C:\\Users\\halina\\AppData\\Local\\Temp\\ipykernel_13300\\3211138617.py:70: FutureWarning: \n",
|
|||
|
"\n",
|
|||
|
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.\n",
|
|||
|
"\n",
|
|||
|
" sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAACD8AAAK9CAYAAAApe1VgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXRN1///8ddNkDkEEUGaGCOmiLFBJWL2kaJKq4oYitYUNZTWPNMqOlClppZqlZIixg8paWuehYQKqmlRY0INyfn94ZfzdSUhpkY+fT7Wumvl7LPP3u99zrmnq8777m0xDMMQAAAAAAAAAAAAAABANmWT1QEAAAAAAAAAAAAAAAA8DpIfAAAAAAAAAAAAAABAtkbyAwAAAAAAAAAAAAAAyNZIfgAAAAAAAAAAAAAAANkayQ8AAAAAAAAAAAAAACBbI/kBAAAAAAAAAAAAAABkayQ/AAAAAAAAAAAAAACAbI3kBwAAAAAAAAAAAAAAkK2R/AAAAAAAAAAAAAAAALI1kh8AAAAAAAAAAAAAAEC2RvIDAAAAAAAAAAD4n2axWDL12bx581ON4/Tp0xo5cqSqVasmNzc35c+fX8HBwdqwYUO69S9duqSuXbvK3d1dTk5OqlOnjnbv3p2pvoKDgzMc55EjR57ksEzTp0/XvHnznkrbjys4OFjlypXL6jAe2e+//64RI0Zo7969WR0KADyzcmR1AAAAAAAAAAAAAE/Tl19+abW9YMECrV+/Pk25n5/fU41jxYoVmjhxopo3b64OHTro9u3bWrBggerXr685c+aoY8eOZt2UlBT95z//0b59+zRgwADlz59f06dPV3BwsHbt2qWSJUs+sL8iRYpo/PjxacoLFSr0RMeVavr06cqfP7/CwsKeSvv/Zr///rtGjhwpHx8fVaxYMavDAYBnEskPAAAAAAAAAADgf9rrr79utf3LL79o/fr1acqftjp16ujUqVPKnz+/Wda9e3dVrFhRw4YNs0p++O677/TTTz9pyZIlevnllyVJrVu3VqlSpTR8+HAtWrTogf3lzp37Hx/jk2YYhv7++285ODhkdShZ4vbt20pJScnqMAAgW2DZCwAAAAAAAAAA8K+XlJSkfv36ycvLS3Z2dvL19dUHH3wgwzCs6lksFvXs2VMLFy6Ur6+v7O3tVblyZf34448P7KNs2bJWiQ+SZGdnpyZNmui3337T1atXzfLvvvtOHh4eeumll8wyd3d3tW7dWitWrNCNGzcec8TSjRs3NHz4cJUoUUJ2dnby8vLSwIED07Q9d+5chYSEqECBArKzs1OZMmU0Y8YMqzo+Pj46dOiQoqKizOU1goODJUkjRoyQxWJJ0/+8efNksVgUHx9v1U7Tpk21du1aValSRQ4ODpo5c6akO8uAhIeHm9eoRIkSmjhx4iMnB6ReyyVLlqhMmTJycHBQYGCgDhw4IEmaOXOmSpQoIXt7ewUHB1vFKf3fUhq7du1SjRo15ODgoKJFi+qzzz5L09fZs2fVuXNneXh4yN7eXv7+/po/f75Vnfj4eFksFn3wwQeaOnWqihcvLjs7O02fPl1Vq1aVJHXs2NE8v6lLjGzZskWtWrXSc889Z17Hvn376vr161bth4WFydnZWWfOnFHz5s3l7Owsd3d39e/fX8nJyVZ1U1JSNG3aNJUvX1729vZyd3dXo0aNtHPnTqt6X331lSpXriwHBwflzZtXr776qk6fPv3Q1wIAngRmfgAAAAAAAAAAAP9qhmHoxRdf1KZNm9S5c2dVrFhRa9eu1YABA3TmzBlNmTLFqn5UVJS++eYb9e7d23w53ahRI23fvl3lypV76P7/+OMPOTo6ytHR0Szbs2ePKlWqJBsb69+xVqtWTZ9//rliY2NVvnz5+7abnJys8+fPW5XZ29vL2dlZKSkpevHFF7V161Z17dpVfn5+OnDggKZMmaLY2FgtX77cPGbGjBkqW7asXnzxReXIkUM//PCD3nrrLaWkpKhHjx6SpKlTp6pXr15ydnbWe++9J0ny8PB46HMhSUePHlWbNm3UrVs3vfHGG/L19dW1a9cUFBSkM2fOqFu3bnruuef0008/afDgwUpISNDUqVMfqa8tW7YoIiLCHMf48ePVtGlTDRw4UNOnT9dbb72lixcvatKkSerUqZP++9//Wh1/8eJFNWnSRK1bt1abNm307bff6s0331SuXLnUqVMnSdL169cVHBysY8eOqWfPnipatKiWLFmisLAwXbp0SX369LFqc+7cufr777/VtWtX2dnZqUWLFrp69aqGDRumrl276oUXXpAk1ahRQ5K0ZMkSXbt2TW+++aby5cun7du36+OPP9Zvv/2mJUuWWLWdnJyshg0bqnr16vrggw+0YcMGTZ48WcWLF9ebb75p1uvcubPmzZunxo0bq0uXLrp9+7a2bNmiX375RVWqVJEkjR07VkOHDlXr1q3VpUsXnTt3Th9//LFq166tPXv2KE+ePI90TQDgkRkAAAAAAAAAAAD/Ij169DDufkWyfPlyQ5IxZswYq3ovv/yyYbFYjGPHjpllkgxJxs6dO82ykydPGvb29kaLFi0eOpa4uDjD3t7eaNeunVW5k5OT0alTpzT1V61aZUgy1qxZc992g4KCzFjv/nTo0MEwDMP48ssvDRsbG2PLli1Wx3322WeGJCM6Otosu3btWpr2GzZsaBQrVsyqrGzZskZQUFCausOHDzfSeyU1d+5cQ5Jx4sQJs8zb2zvd8Y0ePdpwcnIyYmNjrcoHDRpk2NraGqdOnUr3PKQKCgoyypYta1UmybCzs7Pqf+bMmYYko2DBgsaVK1fM8sGDB6eJNfUcT5482Sy7ceOGUbFiRaNAgQLGzZs3DcMwjKlTpxqSjK+++sqsd/PmTSMwMNBwdnY2+zlx4oQhyXB1dTXOnj1rFeuOHTsMScbcuXPTjC296zN+/HjDYrEYJ0+eNMs6dOhgSDJGjRplVTcgIMCoXLmyuf3f//7XkGT07t07TbspKSmGYRhGfHy8YWtra4wdO9Zq/4EDB4wcOXKkKQeAfwLLXgAAAAAAAAAAgH+11atXy9bWVr1797Yq79evnwzDUGRkpFV5YGCgKleubG4/99xzatasmdauXZtm+YD7uXbtmlq1aiUHBwdNmDDBat/169dlZ2eX5hh7e3tz/4P4+Pho/fr1Vp+BAwdKujNbgJ+fn0qXLq3z58+bn5CQEEnSpk2bzHYcHBzMvy9fvqzz588rKChIv/76qy5fvpzp8WZW0aJF1bBhQ6uyJUuW6IUXXpCbm5tVvPXq1VNycnKmlh1JT926deXj42NuV69eXZLUsmVLubi4pCn/9ddfrY7PkSOHunXrZm7nypVL3bp109mzZ7Vr1y5Jd+6vggULqk2bNma9nDlzqnfv3kpMTFRUVJRVmy1btpS7u3umx3D39UlKStL58+dVo0YNGYahPXv2pKnfvXt3q+0XXnjBalxLly6VxWLR8OHD0xybunzJsmXLlJKSotatW1tdj4IFC6pkyZJW9w8A/FNY9gIAAAAAAAAAAPyrnTx5UoUKFbJ62S1Jfn5+5v67lSxZMk0bpUqV0rVr13Tu3DkVLFjwgX0mJyfr1Vdf1eHDhxUZGalChQpZ7XdwcNCNGzfSHPf333+b+x/EyclJ9erVS3dfXFycYmJiMnzJfvbsWfPv6OhoDR8+XD///LOuXbtmVe/y5cvKnTv3A2N5GEWLFk033v3792cq3ofx3HPPWW2njsXLyyvd8osXL1qVFypUSE5OTlZlpUqVkiTFx8fr+eef18mTJ1WyZMk0S5hkdH+lN/77OXXqlIYNG6aIiIg08d2bnGJvb5/mHLq5uVkdd/z4cRUqVEh58+bNsM+4uDgZhpHud0G6k9wBAP80kh8AAAAAAAAAAAD+YW+88YZWrlyphQsXmrMt3M3T01MJCQlpylPL7k2WeFgpKSkqX768Pvzww3T3p778P378uOrWravSpUvrww8/lJeXl3LlyqXVq1drypQpSklJeWBfqbMF3CujWTLSS+xISUlR/fr1zZkr7pWacPCwbG1tH6rcMIxH6udhZCaxJVV
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x800 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\halina\\repos\\mii\\AIM-PIbd-32-Fedorenko-G-Y\\miienv\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Train RMSE: 50.92770420271637\n",
|
|||
|
"Train R²: 0.9845578370650323\n",
|
|||
|
"Train MAE: 1.9114281249999987\n",
|
|||
|
"Корреляция: 0.82\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2QAAAIjCAYAAABswtioAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACNHUlEQVR4nOzdeZxOdf/H8fc1+5jVYGYMY2eYyFqWUJaMSCllSbK1G7KG3ISSPXuk7uK+015UlEKkImtoxFiyhZnBmNXs1/n90c+5u7I0oxlnltfz8bgej87nfK9zfc6FeM/5nu+xGYZhCAAAAABw0zlZ3QAAAAAAlFQEMgAAAACwCIEMAAAAACxCIAMAAAAAixDIAAAAAMAiBDIAAAAAsAiBDAAAAAAsQiADAAAAAIsQyAAAAADAIgQyAIBlbDabJk6caHUblrvrrrt01113mdvHjx+XzWbTsmXLLOvpr/7aY0EpjOcOAAWJQAYAxcRrr70mm82mpk2b3vAxzpw5o4kTJ2rPnj3511ght2nTJtlsNvPl6uqqatWq6bHHHtNvv/1mdXt5smXLFk2cOFEJCQmW9VClShWH7zMwMFCtWrXSypUrLesJAAozF6sbAADkjxUrVqhKlSravn27jhw5oho1auT5GGfOnNGkSZNUpUoVNWjQIP+bLMSGDBmi2267TVlZWdq9e7eWLl2qNWvW6JdfflFISMhN7aVy5cpKS0uTq6trnt63ZcsWTZo0Sf369ZO/v3/BNJcLDRo00IgRIyT98Xvq9ddf14MPPqjFixfr6aefvu57b/TcAaCo4goZABQDx44d05YtW/Tqq6+qXLlyWrFihdUtFTmtWrXSo48+qv79+2vBggWaNWuW4uPjtXz58mu+JzU1tUB6sdls8vDwkLOzc4Ecv6BVqFBBjz76qB599FE9//zz+vHHH+Xl5aU5c+Zc8z3Z2dnKzMws8ucOAHlFIAOAYmDFihUqXbq0OnfurIceeuiagSwhIUHDhg1TlSpV5O7urooVK+qxxx7T+fPntWnTJt12222SpP79+5tTzi7fy1OlShX169fvimP+9d6izMxMTZgwQY0bN5afn5+8vLzUqlUrbdy4Mc/nFRsbKxcXF02aNOmKfdHR0bLZbFq4cKEkKSsrS5MmTVLNmjXl4eGhMmXKqGXLllq3bl2eP1eS2rZtK+mPsCtJEydOlM1m06+//qpHHnlEpUuXVsuWLc3x77zzjho3bixPT08FBASoZ8+eOnXq1BXHXbp0qapXry5PT0/dfvvt+v77768Yc637qA4ePKju3burXLly8vT0VFhYmMaNG2f2N2rUKElS1apVzV+/48ePF0iPeREcHKw6deqY3+Xl85s1a5bmzp2r6tWry93dXb/++usNnftlp0+f1oABAxQUFCR3d3fdcssteuutt/5R7wBQ0JiyCADFwIoVK/Tggw/Kzc1NvXr10uLFi7Vjxw4zYElSSkqKWrVqpQMHDmjAgAFq1KiRzp8/r88//1y///676tSpo8mTJ2vChAl68skn1apVK0lSixYt8tRLUlKS3nzzTfXq1UtPPPGEkpOT9e9//1sRERHavn17nqZCBgUF6c4779SHH36oF1980WHfBx98IGdnZz388MOS/ggkU6dO1eOPP67bb79dSUlJ2rlzp3bv3q277747T+cgSUePHpUklSlTxqH+8MMPq2bNmnrllVdkGIYkacqUKRo/fry6d++uxx9/XOfOndOCBQvUunVr/fzzz+b0wX//+9966qmn1KJFCw0dOlS//fab7rvvPgUEBCg0NPS6/ezbt0+tWrWSq6urnnzySVWpUkVHjx7VF198oSlTpujBBx/UoUOH9N5772nOnDkqW7asJKlcuXI3rcdrycrK0qlTp674Lt9++22lp6frySeflLu7uwICAmS32/N87tIf4b1Zs2ay2WyKjIxUuXLl9NVXX2ngwIFKSkrS0KFDb6h3AChwBgCgSNu5c6chyVi3bp1hGIZht9uNihUrGs8995zDuAkTJhiSjE8//fSKY9jtdsMwDGPHjh2GJOPtt9++YkzlypWNvn37XlG/8847jTvvvNPczs7ONjIyMhzGXLx40QgKCjIGDBjgUJdkvPjii9c9v9dff92QZPzyyy8O9fDwcKNt27bmdv369Y3OnTtf91hXs3HjRkOS8dZbbxnnzp0zzpw5Y6xZs8aoUqWKYbPZjB07dhiGYRgvvviiIcno1auXw/uPHz9uODs7G1OmTHGo//LLL4aLi4tZz8zMNAIDA40GDRo4fD9Lly41JDl8h8eOHbvi16F169aGj4+PceLECYfPufxrZxiGMXPmTEOScezYsQLv8VoqV65sdOjQwTh37pxx7tw5Y+/evUbPnj0NScbgwYMdzs/X19eIi4tzeP+NnvvAgQON8uXLG+fPn3cY07NnT8PPz8+4dOnS3/YOAFZgyiIAFHErVqxQUFCQ2rRpI+mP+4969Oih999/Xzk5Oea4Tz75RPXr19cDDzxwxTFsNlu+9ePs7Cw3NzdJkt1uV3x8vLKzs9WkSRPt3r07z8d78MEH5eLiog8++MCsRUVF6ddff1WPHj3Mmr+/v/bv36/Dhw/fUN8DBgxQuXLlFBISos6dOys1NVXLly9XkyZNHMb9dVGKTz/9VHa7Xd27d9f58+fNV3BwsGrWrGlO1dy5c6fi4uL09NNPm9+PJPXr109+fn7X7e3cuXPavHmzBgwYoEqVKjnsy82v3c3o8c+++eYblStXTuXKlVP9+vX10UcfqU+fPpo+fbrDuG7duplX8K4lN+duGIY++eQTdenSRYZhOJxjRESEEhMTb+j3HgDcDExZBIAiLCcnR++//77atGlj3p8jSU2bNtXs2bO1YcMGdejQQdIfU/C6det2U/pavny5Zs+erYMHDyorK8usV61aNc/HKlu2rNq1a6cPP/xQL730kqQ/piu6uLjowQcfNMdNnjxZ999/v2rVqqW6deuqY8eO6tOnj2699dZcfc6ECRPUqlUrOTs7q2zZsqpTp45cXK78a/Kv53D48GEZhqGaNWte9biXVws8ceKEJF0x7vIy+9dzefn9unXr5upc/upm9PhnTZs21csvvyybzaZSpUqpTp06V131MTe/H3Jz7ufOnVNCQoKWLl2qpUuXXnVMXFxc7poHgJuMQAYARdi3336rs2fP6v3339f7779/xf4VK1aYgeyfutaVmJycHIcV8d555x3169dPXbt21ahRoxQYGChnZ2dNnTrVvC8rr3r27Kn+/ftrz549atCggT788EO1a9fOvE9Kklq3bq2jR4/qs88+0zfffKM333xTc+bM0ZIlS/T444//7WfUq1dP7du3/9txnp6eDtt2u102m01fffXVVVcG9Pb2zsUZFqyb3WPZsmVv6Lu8UZfvO3v00UfVt2/fq47JbTAHgJuNQAYARdiKFSsUGBioRYsWXbHv008/1cqVK7VkyRJ5enqqevXqioqKuu7xrjf9rXTp0ld94PCJEyccrp58/PHHqlatmj799FOH4/11UY686Nq1q5566ilz2uKhQ4c0duzYK8YFBASof//+6t+/v1JSUtS6dWtNnDgxV4HsRlWvXl2GYahq1aqqVavWNcdVrlxZ0h9Xqy6v4Cj9seDFsWPHVL9+/Wu+9/L3e6O/fjejx4KSm3MvV66cfHx8lJOTk6sgCACFCfeQAUARlZaWpk8//VT33nuvHnrooStekZGRSk5O1ueffy7pj/t19u7dq5UrV15xLOP/Vwv08vKSpKsGr+rVq+unn35SZmamWVu9evUVy6ZfvgJz+ZiStG3bNm3duvWGz9Xf318RERH68MMP9f7778vNzU1du3Z1GHPhwgWHbW9vb9WoUUMZGRk3/Lm58eCDD8rZ2VmTJk1yOGfpj+/gcl9NmjRRuXLltGTJEofvcNmyZVf9vv+sXLlyat26td566y2dPHnyis+47Fq/fjejx4KSm3N3dnZWt27d9Mknn1w1uJ07d+6m9AoAN4IrZABQRH3++edKTk7Wfffdd9X9zZo1Mx8S3aNHD40aNUoff/yxHn74YQ0YMECNGzd
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|||
|
"from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error\n",
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import time\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(2000)\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'relative_price'\n",
|
|||
|
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
|
|||
|
"df['relative_price'] = df['price'] / mean_price_by_category\n",
|
|||
|
"\n",
|
|||
|
"# Предобработка данных\n",
|
|||
|
"# Преобразуем категориальные переменные в числовые\n",
|
|||
|
"df = pd.get_dummies(df, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на признаки и целевую переменную\n",
|
|||
|
"X = df.drop('price', axis=1)\n",
|
|||
|
"y = df['price']\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Выбор модели\n",
|
|||
|
"model = RandomForestRegressor(random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Измерение времени обучения и предсказания\n",
|
|||
|
"start_time = time.time()\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Предсказание и оценка\n",
|
|||
|
"y_pred = model.predict(X_test)\n",
|
|||
|
"\n",
|
|||
|
"end_time = time.time()\n",
|
|||
|
"training_time = end_time - start_time\n",
|
|||
|
"\n",
|
|||
|
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
|
|||
|
"r2 = r2_score(y_test, y_pred)\n",
|
|||
|
"mae = mean_absolute_error(y_test, y_pred)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"RMSE: {rmse}\")\n",
|
|||
|
"print(f\"R²: {r2}\")\n",
|
|||
|
"print(f\"MAE: {mae}\")\n",
|
|||
|
"print(f\"Training Time: {training_time} seconds\")\n",
|
|||
|
"\n",
|
|||
|
"# Кросс-валидация\n",
|
|||
|
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
|
|||
|
"rmse_cv = (-scores.mean())**0.5\n",
|
|||
|
"print(f\"Cross-validated RMSE: {rmse_cv}\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ важности признаков\n",
|
|||
|
"feature_importances = model.feature_importances_\n",
|
|||
|
"feature_names = X_train.columns\n",
|
|||
|
"\n",
|
|||
|
"importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})\n",
|
|||
|
"importance_df = importance_df.sort_values(by='Importance', ascending=False)\n",
|
|||
|
"\n",
|
|||
|
"# Отобразим только топ-20 признаков\n",
|
|||
|
"top_n = 20\n",
|
|||
|
"importance_df_top = importance_df.head(top_n)\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 8))\n",
|
|||
|
"sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n",
|
|||
|
"plt.title(f'Top {top_n} Feature Importance')\n",
|
|||
|
"plt.xlabel('Importance')\n",
|
|||
|
"plt.ylabel('Feature')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на переобучение\n",
|
|||
|
"y_train_pred = model.predict(X_train)\n",
|
|||
|
"\n",
|
|||
|
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
|
|||
|
"r2_train = r2_score(y_train, y_train_pred)\n",
|
|||
|
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"Train RMSE: {rmse_train}\")\n",
|
|||
|
"print(f\"Train R²: {r2_train}\")\n",
|
|||
|
"print(f\"Train MAE: {mae_train}\")\n",
|
|||
|
"\n",
|
|||
|
"correlation = np.corrcoef(y_test, y_pred)[0, 1]\n",
|
|||
|
"print(f\"Корреляция: {correlation:.2f}\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация результатов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
|
|||
|
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
|
|||
|
"plt.xlabel('Actual Price')\n",
|
|||
|
"plt.ylabel('Predicted Price')\n",
|
|||
|
"plt.title('Actual vs Predicted Price')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Вывод:\n",
|
|||
|
"\n",
|
|||
|
"Время обучения:\n",
|
|||
|
"\n",
|
|||
|
"Время обучения модели составляет 4.76 секунды, что является средним. Это указывает на то, что модель обучается быстро и может эффективно обрабатывать данные.\n",
|
|||
|
"\n",
|
|||
|
"Предсказательная способность:\n",
|
|||
|
"\n",
|
|||
|
"MAE (Mean Absolute Error): 28.6974 — это средняя абсолютная ошибка предсказаний модели. Значение MAE невелико, что означает, что предсказанные значения в среднем отклоняются от реальных на 28.6974. Это может быть приемлемым уровнем ошибки.\n",
|
|||
|
"\n",
|
|||
|
"RMSE (Mean Squared Error): 534.088 — это среднее значение квадратов ошибок. Хотя MSE высокое, оно также может быть связано с большими значениями целевой переменной (цен).\n",
|
|||
|
"\n",
|
|||
|
"R² (коэффициент детерминации): 0.609 — это средний уровень, указывающий на то, что модель объясняет 60,9% вариации целевой переменной. Это свидетельствует о средней предсказательной способности модели.\n",
|
|||
|
"\n",
|
|||
|
"Корреляция:\n",
|
|||
|
"\n",
|
|||
|
"Корреляция (0.82) между предсказанными и реальными значениями говорит о том, что предсказания модели имеют сильную линейную зависимость с реальными значениями. Это подтверждает, что модель хорошо обучена и делает точные прогнозы.\n",
|
|||
|
"\n",
|
|||
|
"Надежность (кросс-валидация):\n",
|
|||
|
"\n",
|
|||
|
"Среднее RMSE (кросс-валидация): 133.75 — это значительно ниже, чем обычное RMSE, что указывает на отсутствие проблем с переобучением - что и подтверждается тестом переобучением. \n",
|
|||
|
"\n",
|
|||
|
"Результаты визуализации важности признаков, полученные из линейной регрессии, помогают понять, какие из входных переменных наибольшим образом влияют на целевую переменную (price). Это может быть полезным для дальнейшего анализа и при принятии бизнес-решений, связанных с управлением и ценообразованием в Jio Mart."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "miienv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.7"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|