2024-11-15 15:49:04 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Лабораторная работа №3\n",
"\n",
"*Вариант задания:* Товары Jio Mart (вариант - 23) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Для выполнения лабораторной работы по датасету 'jio mart product items', приведу пример двух бизнес-целей:\n",
"\n",
"### Бизнес-цели:\n",
"\n",
2024-11-30 03:34:37 +04:00
"1. **Оптимизация цен на товары**\n",
2024-11-15 15:49:04 +04:00
" \n",
2024-11-30 03:34:37 +04:00
" **Цель:** Снизить издержки и увеличить продажи за счет оптимизации цен на товары.\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
" **Техническая цель:** Создать модель машинного обучения, которая будет прогнозировать, является ли товар излишне дорогим для свой категории или нет.\n",
2024-11-15 15:49:04 +04:00
" \n",
"\n",
2024-11-30 03:34:37 +04:00
"2. **Распределение товаров по категориям**\n",
2024-11-15 15:49:04 +04:00
" \n",
2024-11-30 03:34:37 +04:00
" **Цель:** Оптимизировать распределение товаров по категориям.\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
" **Техническая цель:** Создать модель машинного обучения, которая будет прогнозировать оптимальные цены на товары на основе их категорий, подкатегорий и текущих цен.\n",
" "
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "code",
2024-11-30 03:34:37 +04:00
"execution_count": 2,
2024-11-15 15:49:04 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['category', 'sub_category', 'href', 'items', 'price'], dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib.ticker as ticker\n",
"import seaborn as sns\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\")\n",
"\n",
"# Срез данных, первые 15000 строк\n",
"df = df.iloc[:15000]\n",
"\n",
"# Вывод\n",
"print(df.columns)"
]
},
{
"cell_type": "code",
2024-11-30 03:34:37 +04:00
"execution_count": 3,
2024-11-15 15:49:04 +04:00
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>category</th>\n",
" <th>sub_category</th>\n",
" <th>href</th>\n",
" <th>items</th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Groceries</td>\n",
" <td>Fruits & Vegetables</td>\n",
" <td>https://www.jiomart.com/c/groceries/fruits-veg...</td>\n",
" <td>Fresh Dates (Pack) (Approx 450 g - 500 g)</td>\n",
" <td>109.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Groceries</td>\n",
" <td>Fruits & Vegetables</td>\n",
" <td>https://www.jiomart.com/c/groceries/fruits-veg...</td>\n",
" <td>Tender Coconut Cling Wrapped (1 pc) (Approx 90...</td>\n",
" <td>49.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Groceries</td>\n",
" <td>Fruits & Vegetables</td>\n",
" <td>https://www.jiomart.com/c/groceries/fruits-veg...</td>\n",
" <td>Mosambi 1 kg</td>\n",
" <td>69.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Groceries</td>\n",
" <td>Fruits & Vegetables</td>\n",
" <td>https://www.jiomart.com/c/groceries/fruits-veg...</td>\n",
" <td>Orange Imported 1 kg</td>\n",
" <td>125.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Groceries</td>\n",
" <td>Fruits & Vegetables</td>\n",
" <td>https://www.jiomart.com/c/groceries/fruits-veg...</td>\n",
" <td>Banana Robusta 6 pcs (Box) (Approx 800 g - 110...</td>\n",
" <td>44.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" category sub_category \\\n",
"0 Groceries Fruits & Vegetables \n",
"1 Groceries Fruits & Vegetables \n",
"2 Groceries Fruits & Vegetables \n",
"3 Groceries Fruits & Vegetables \n",
"4 Groceries Fruits & Vegetables \n",
"\n",
" href \\\n",
"0 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"1 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"2 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"3 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"4 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"\n",
" items price \n",
"0 Fresh Dates (Pack) (Approx 450 g - 500 g) 109.0 \n",
"1 Tender Coconut Cling Wrapped (1 pc) (Approx 90... 49.0 \n",
"2 Mosambi 1 kg 69.0 \n",
"3 Orange Imported 1 kg 125.0 \n",
"4 Banana Robusta 6 pcs (Box) (Approx 800 g - 110... 44.0 "
]
},
2024-11-30 03:34:37 +04:00
"execution_count": 3,
2024-11-15 15:49:04 +04:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Для наглядности\n",
"df.head()"
]
},
{
"cell_type": "code",
2024-11-30 03:34:37 +04:00
"execution_count": 4,
2024-11-15 15:49:04 +04:00
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>15000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>373.427633</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>463.957949</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>5.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>123.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>250.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>446.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>14999.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" price\n",
"count 15000.000000\n",
"mean 373.427633\n",
"std 463.957949\n",
"min 5.000000\n",
"25% 123.000000\n",
"50% 250.000000\n",
"75% 446.000000\n",
"max 14999.000000"
]
},
2024-11-30 03:34:37 +04:00
"execution_count": 4,
2024-11-15 15:49:04 +04:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Описание данных (основные статистические показатели)\n",
"df.describe()"
]
},
{
"cell_type": "code",
2024-11-30 03:34:37 +04:00
"execution_count": 5,
2024-11-15 15:49:04 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"category 0\n",
"sub_category 0\n",
"href 0\n",
"items 0\n",
"price 0\n",
"dtype: int64\n"
]
},
{
"data": {
"text/plain": [
"category False\n",
"sub_category False\n",
"href False\n",
"items False\n",
"price False\n",
"dtype: bool"
]
},
2024-11-30 03:34:37 +04:00
"execution_count": 5,
2024-11-15 15:49:04 +04:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Процент пропущенных значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
"\n",
"# Проверка на пропущенные данные\n",
"print(df.isnull().sum())\n",
"\n",
"df.isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Нет пропущенных данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Разбиваем на выборки (обучающую, тестовую, контрольную)"
]
},
{
"cell_type": "code",
2024-11-30 03:34:37 +04:00
"execution_count": 6,
2024-11-15 15:49:04 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-11-30 03:34:37 +04:00
"Размеры выборок:\n",
"Обучающая выборка: 9000 записей\n",
"Валидационная выборка: 3000 записей\n",
"Тестовая выборка: 3000 записей\n"
2024-11-15 15:49:04 +04:00
]
},
{
"data": {
2024-11-30 03:34:37 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABccAAAIjCAYAAADGGKM5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACffUlEQVR4nOzdeVhV5fr/8Q+IDIpAOICcUCk9iqY5lZKzEqioaVbH4qiZiRpmakfN8zUyG8ihnNOsHCpssMFSC8XZFFEpUsnMDMM0IFPYajKv3x9drJ9b0KMIbHK/X9e1ruNez73Xup9Nh3vvm7Wf5WAYhiEAAAAAAAAAAOyIo60TAAAAAAAAAACgotEcBwAAAAAAAADYHZrjAAAAAAAAAAC7Q3McAAAAAAAAAGB3aI4DAAAAAAAAAOwOzXEAAAAAAAAAgN2hOQ4AAAAAAAAAsDs0xwEAAAAAAAAAdofmOHADLly4oBMnTujs2bO2TgVliJ8rAAD2IzMzUz/99JPy8/NtnQoAAJVeYWGhTp8+rZ9//tnWqQBlguY4cJ1Wr16tHj16qEaNGnJ3d1e9evU0c+ZMW6eFG8TPFQAA+5CXl6eZM2fqzjvvlIuLi2655RY1atRImzdvtnVqAABUSmlpaRo3bpzq168vZ2dn1a5dW02bNpXFYrF1asANc7J1AoAtJScnKzo6Wlu3btXp06dVs2ZNdevWTf/973/VrFmzYvHPPPOMZsyYofvuu09vvvmmatWqJQcHB/3zn/+0QfYoK/xcAaBsrVixQsOGDbPaV7t2bTVr1kyTJk1Sr169bJQZ7F1OTo5CQkK0Z88ejRo1Si+88IKqVaumKlWqqE2bNrZODwBueg4ODtcUt3XrVnXt2rV8k8E1+emnn9StWzfl5eVp7Nixat26tZycnOTm5qbq1avbOj3ghtEch9369NNP9fDDD8vb21vDhw9XQECAjh8/rrffflsff/yxPvjgAw0YMMCM3759u2bMmKHo6Gg988wzNswcZYmfKwCUn+nTpysgIECGYSg9PV0rVqxQ7969tXbtWvXp08fW6cEOzZgxQwkJCdqwYQNNFwCwgXfffdfq8TvvvKO4uLhi+wMDAysyLVzFyJEj5ezsrD179ugf//iHrdMBypyDYRiGrZMAKtqxY8fUokUL1atXTzt27FDt2rXNsdOnT6tTp046ceKEDhw4oNtuu02S1LdvX505c0a7du2yVdooB/xcAaDsFV05vm/fPrVt29bcf/bsWfn4+OjBBx9UTEyMDTOEPcrPz1edOnU0evRovfTSS7ZOBwAgacyYMVq0aJFoTVVOiYmJatu2rTZu3Kh7773X1ukA5YI1x2GXZs2apT///FNLly61aoxLUq1atfTGG2/owoULVmtO79mzR3fccYcGDRokb29vubm56a677tKaNWvMmPPnz6t69ep66qmnip3z119/VZUqVRQdHS1JevTRR9WgQYNicQ4ODpo2bZr5+JdfftETTzyhxo0by83NTTVr1tSDDz6o48ePWz1v27ZtcnBw0LZt28x9+/bt07333qsaNWqoevXq6tq1q3bu3Gn1vBUrVsjBwUH79+83950+fbpYHpLUp0+fYjnv3LlTDz74oOrVqycXFxf5+/tr/PjxunjxYrG5ffzxx2rbtq1q1KghBwcHc5s9e3ax2JJyLNqqVaum5s2b66233rKKe/TRR+Xu7n7VY10+r2v5uRbJyMjQ8OHD5ePjI1dXV915551auXKlVczx48fNOc2ZM0f169eXm5ubunTpokOHDhXL9/LX87333pOjo6NeeeUVc9+BAwf06KOP6rbbbpOrq6t8fX312GOP6Y8//rjqXAGgsvHy8pKbm5ucnKy/vDh79mzdc889qlmzptzc3NSmTRt9/PHHJR7j8ppQtF16FXBRzKW1srCwUC1atJCDg4NWrFhR7LgNGjQo8biXx15rrg4ODhozZkyx/SXV0pLqwYkTJ+Tm5lZsHpL0+uuvq1mzZnJxcZGfn58iIyOVmZlpFdO1a1fdcccdxc4/e/bsYsds0KBBiVfyjxkzptjX35cvX67u3burTp06cnFxUdOmTbV48eJiz83Pz9eLL76of/7zn3JxcbF6TS99z1GSRx991Cr+lltuKfE9zJXyLnL5e6MjR47o7NmzqlGjhrp06aJq1arJ09NTffr0KVajJenbb79Vr1695OHhIXd3d/Xo0UN79uyxiin6b23Hjh0aOXKkatasKQ8PDw0ZMqTYjb0bNGigRx991GpfRESEXF1drd6/ff755woLC5Ofn59cXFx0++2364UXXlBBQcFVXzcAuBnl5OToueeeU8OGDc3Pm5MmTVJOTk6x2Pfee0933323qlWrpltuuUWdO3fWxo0bJV25zhdtl9bhCxcu6Omnn5a/v79cXFzUuHFjzZ49u1gD/9LnV6lSRf/4xz8UERFhVZNzc3MVFRWlNm3ayNPTU9WrV1enTp20devWYvkXfd6sV6+eqlSpYh77f33GvXx+jo6O8vX11b/+9S+lpqaaMZd+Vr2SadOmWdX+PXv2yNXVVceOHTPfe/j6+mrkyJE6c+ZMseevXr1abdq0kZubm2rVqqV///vfOnnypFVM0ef2n3/+WaGhoapevbr8/Pw0ffp0q9e4KN9L34udO3dObdq0UUBAgH777Tdz//W8lwQux7IqsEtr165VgwYN1KlTpxLHO3furAYNGmj9+vXmvj/++ENLly6Vu7u7xo4dq9q1a+u9997T/fffr5iYGD388MNyd3fXgAED9OGHH+q1115TlSpVzOe///77MgxD4eHh15Xrvn37tHv3bg0aNEi33nqrjh8/rsWLF6tr1676/vvvVa1atRKf99NPP6lr166qVq2aJk6cqGrVqunNN99UcHCw4uLi1Llz5+vK40pWr16tP//8U6NHj1bNmjW1d+9eLViwQL/++qtWr15txsXHx+uhhx7SnXfeqVdeeUWenp46ffq0xo8ff83nmjNnjmrVqiWLxaJly5ZpxIgRatCggYKDg0ud/7X8XCXp4sWL6tq1q3766SeNGTNGAQEBWr16tR599FFlZmYW+4PIO++8o3PnzikyMlLZ2dmaN2+eunfvroMHD8rHx6fEXDZu3KjHHntMY8aMsVriJS4uTj///LOGDRsmX19fJScna+nSpUpOTtaePXuued0+AKhoWVlZOn36tAzDUEZGhhYsWKDz58/r3//+t1XcvHnz1K9fP4WHhys3N1cffPCBHnzwQa1bt05hYWElHruoJki6pquA3333XR08ePCqMS1bttTTTz8tSUpJSVFUVFSxmNLkWhpRUVHKzs4utn/atGl6/vnnFRwcrNGjR+vIkSNavHix9u3bp127dqlq1apllkNJFi9erGbNmqlfv35ycnLS2rVr9cQTT6iwsFCRkZFm3Kuvvqpnn31WAwYM0OTJk+Xi4qKdO3dq6dKl13SeWrVqac6cOZL+usBg3rx56t27t06cOCEvL69S5V70R+UpU6aoUaNGev7555Wdna1FixapQ4cO2rdvn3m/keTkZHXq1EkeHh6aNGmSqlatqjfeeENdu3bV9u3b1a5dO6tjjxkzRl5eXpo2bZr5M/nll1/MBn1JnnvuOb399tv68MMPi/1xx93dXRMmTJC7u7u2bNmiqKgoWSwWzZo1q1RzB4C/o8LCQvXr109ff/21IiIiFBgYqIMHD2rOnDn68ccfrS5oev755zVt2jTdc889mj59upydnZWQkKAtW7YoJCREc+fO1fnz5yVJhw8f1ssvv6z//ve/5vItRQ1owzDUr18/bd26VcOHD1fLli21YcMGTZw4USdPnjRrU5EBAwbo/vvvV35+vuLj47V06VJdvHjRXCbGYrHorbfe0sMPP6wRI0bo3LlzevvttxUaGqq9e/eqZcuW5rGGDh2qTZs26cknn9Sdd96pKlWqaOnSpfrmm2+u6fXq1KmTIiIiVFhYqEOHDmnu3Lk6depUsT8uX48//vhD2dnZGj16tLp3765Ro0bp2LFjWrRokRISEpSQkCAXFxdJ//+bg3fddZeio6OVnp6uefPmadeuXfr
2024-11-15 15:49:04 +04:00
"text/plain": [
2024-11-30 03:34:37 +04:00
"<Figure size 1800x600 with 3 Axes>"
2024-11-15 15:49:04 +04:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2024-11-30 03:34:37 +04:00
"from sklearn.model_selection import train_test_split\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Разделение признаков (features) и целевой переменной (target)\n",
"X = df.drop(columns=['price']) # Признаки (все столбцы, кроме 'price')\n",
"y = df['price'] # Целевая переменная (price)\n",
"\n",
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
"\n",
"# Проверка размеров выборок\n",
"print(f\"Размеры выборок:\")\n",
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
"\n",
"# Визуализация распределения цен в каждой выборке\n",
"plt.figure(figsize=(18, 6))\n",
"\n",
"plt.subplot(1, 3, 1)\n",
"plt.hist(y_train, bins=30, color='blue', alpha=0.7)\n",
"plt.title('Обучающая выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.subplot(1, 3, 2)\n",
"plt.hist(y_val, bins=30, color='green', alpha=0.7)\n",
"plt.title('Валидационная выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.subplot(1, 3, 3)\n",
"plt.hist(y_test, bins=30, color='red', alpha=0.7)\n",
"plt.title('Тестовая выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Балансировка выборок**"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "code",
2024-11-30 03:34:37 +04:00
"execution_count": 7,
2024-11-15 15:49:04 +04:00
"metadata": {},
"outputs": [
{
2024-11-30 03:34:37 +04:00
"name": "stdout",
"output_type": "stream",
"text": [
"Размеры выборок:\n",
"Обучающая выборка: 9000 записей\n",
"Валидационная выборка: 3000 записей\n",
"Тестовая выборка: 3000 записей\n"
]
2024-11-15 15:49:04 +04:00
},
{
"data": {
2024-11-30 03:34:37 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAIjCAYAAAAJLyrXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABo20lEQVR4nO3dd3gUZf/+/XOTkEKqQArRAKEoBBAFFEIRhUBAiijlRlFpCmIAqSooHY2ACiICFm5AgZ8KKAreoAQEFUK30DuGYkJNQpEEknn+4Ml8s2wCmZi4Ad6v49jjIDPXznxmd3bYc6+Za2yGYRgCAAAAAOSZi7MLAAAAAICbDUEKAAAAACwiSAEAAACARQQpAAAAALCIIAUAAAAAFhGkAAAAAMAighQAAAAAWESQAgAAAACLCFJwkJmZqVOnTungwYPOLgXI1aVLl3T8+HGdOHHC2aWgAPG+Asir5ORk7d+/X1euXHF2KbhNEaQgSUpMTFT//v1VtmxZubu7KzAwUBEREUpNTXV2aYApLi5Obdq0UUBAgLy8vHTnnXfqpZdecnZZ+Idu9ff1yJEj8vT01Nq1a51dCnBTu3z5siZMmKAaNWrIw8NDd9xxhypVqqSVK1c6u7Tb0unTp+Xt7a3//e9/zi7FaWyGYRjOLgKF48CBA5owYYJWrFih48ePy93dXdWrV1fHjh3Vs2dPeXl5SZL279+vRx55RJcvX1a/fv1Us2ZNubm5ycvLS3Xr1pWrq6uTtwSQpk2bpr59+6pBgwbq3r277rzzTklS2bJlValSJSdXh/y6Hd7X559/Xnv37tWaNWucXUqRcurUKQUGBmrkyJEaNWqUs8tBEZeWlqZmzZpp/fr1euGFF9SkSRMVL15crq6uqlWrlvz8/Jxd4m3ppZde0i+//KItW7Y4uxSncHN2ASgc3333nTp06CAPDw89++yzqlatmtLT0/XLL79oyJAh2rFjhz766CNJUq9eveTu7q7169ebX2KAomTfvn0aOHCgevbsqWnTpslmszm7JBSA2+F9PXnypObMmaM5c+Y4uxTgpjZ+/Hht2LBB33//vR5++GFnl4P/3wsvvKApU6Zo1apVaty4sbPL+dcRpG5Bhw4dUqdOnVS2bFmtWrVKpUuXNufFxMRo//79+u677yRJW7Zs0apVq/TDDz8QolBkTZkyRSEhIZoyZcot+WX7dnU7vK9z586Vm5ubWrdu7exSgJvWlStXNHnyZA0aNIgQVcRUqVJF1apV0+zZs2/LIMU1UregCRMm6Pz585o5c6ZdiMpSsWJF8/qD9evXy9PTUwcOHFDVqlXl4eGhkJAQ9erVS2fOnLF73s8//6wOHTqoTJky8vDwUFhYmAYMGKC///47xzpsNluOj8OHD5ttZs2apcaNGysoKEgeHh6KiIjQ9OnTHZZVrlw5tWrVymF6nz59cvwCNnfuXD344IMqXry47rjjDj300EP64Ycf7JbXtWtXu+csWLBANptN5cqVM6cdPnxYNptNb7/9tiZNmqSyZcvKy8tLjRo10vbt2x3Wu2rVKjVs2FDe3t4KCAjQY489pl27dtm1GTVqlN3r4evrqwcffFCLFy+2a5fX17tr167y8fFxqGXhwoWy2WxavXq1Oe3hhx9WtWrVHNq+/fbbDu/NN998o5YtWyo0NFQeHh6qUKGCxo4dq4yMDIfnT58+XdWqVVPx4sXttm3hwoUOba/166+/qkWLFvLz85OPj4+aNGmi9evX27VZv369atWqpRdffFHBwcHy8PBQtWrV9PHHH5ttDMNQuXLl9Nhjjzms49KlS/L391evXr0k/d97cK1r94szZ85o8ODBql69unx8fOTn56cWLVro999/t3te1n4ye/Zsc9revXv1+OOP64477pCXl5ceeOABh/d49erVOb5OPj4+DvtnTvv6H3/8oa5du6p8+fLy9PRUSEiIunfvrtOnTzts248//qiGDRvqjjvusHuP+vTp49A2pxqzHh4eHrr77rsVGxur7GeGZ72mp06dynVZ176+eXlfs1y4cEGDBg1SWFiYPDw8dM899+jtt9/WtWenZ23TvHnzdM8998jT01O1atXSTz/9ZNcup33gxx9/lIeHh1544QVz2p9//qkXX3xR99xzj7y8vFSyZEl16NDB7rNyPYsXL1adOnVy/Ixm7Tc5PfK77Tk9xo0bJ0lKT0/XiBEjVKtWLfn7+8vb21sNGzbUjz/+mGNdeTnude3a1e6YKV29JszLy8vhmHLx4kV169ZN3t7eioiIME8Funz5srp166bixYurRo0a2rx5s93yHn74YdlsNrVt29bhNezVq5dsNpvdcS2nz6N09YdEm81mtw/mVH/Wa3nt6YbHjh1T9+7dzX21atWq+u9//+vw3EuXLmnUqFG6++675enpqdKlS+uJJ57QgQMHcq3v3LlzqlWrlsLDw/XXX3+Z0/P63ktX/9+rVauWvLy8VKJECXXq1ElHjhxxaHeta/9PuvaRvc6s/28OHjyo6OhoeXt7KzQ0VGPGjHGoKTMzU5MnT1bVqlXl6emp4OBg9erVS2fPnnWoYdq0aeZ3kNDQUMXExCg5Odmcv2fPHp09e1a+vr5q1KiRihcvLn9/f7Vq1cphn8zant27d6tjx47y8/NTyZIl9dJLL+nSpUt2bfP6HeSxxx5TuXLl5OnpqaCgILVp00bbtm2za3PlyhWNHTtWFSpUkIeHh8qVK6dhw4YpLS3Nrl25cuXM19bFxUUhISH6z3/+o4SEBLt2b7/9turVq6eSJUvKy8tLtWrVyvH/1NyO461atcrx+0xePhfS1YE8+vfvb+57FStW1Pjx45WZmemwrqZNm2rJkiU57pe3OnqkbkFLlixR+fLlVa9evRu2PX36tC5duqTevXurcePGeuGFF3TgwAF98MEH2rBhgzZs2CAPDw9JV4PGxYsX1bt3b5UsWVIbN27U+++/r6NHj2rBggU5Lv/xxx/XE088IelqMMg6nTDL9OnTVbVqVbVp00Zubm5asmSJXnzxRWVmZiomJiZf2z969GiNGjVK9erV05gxY+Tu7q4NGzZo1apVatasWY7PuXLlil577bVcl/npp5/q3LlziomJ0aVLl/Tee++pcePG2rZtm4KDgyVdvWC+RYsWKl++vEaNGqW///5b77//vurXr6+tW7c6/Gf92WefSbp6ncC0adPUoUMHbd++Xffcc4+k/L3eBWn27Nny8fHRwIED5ePjo1WrVmnEiBFKTU3VxIkTzXZffPGFXnzxRT388MPq27evvL29tWvXLr355ps3XMeOHTvUsGFD+fn56eWXX1axYsX04Ycf6uGHH9aaNWtUp04dSVf3082bN8vNzU0xMTGqUKGCFi9erJ49e+r06dN69dVXZbPZ9PTTT2vChAk6c+aMSpQoYa5nyZIlSk1N1dNPP23pNTh48KAWL16sDh06KDw8XElJSfrwww/VqFEj7dy5U6GhoTk+78yZM3rooYd07tw59evXTyEhIZo7d66eeOIJzZs3T08++aSlOnKzYsUKHTx4UN26dVNISIh5yu6OHTu0fv168wv5oUOH1LJlS5UuXVojRoxQYGCgJOmZZ57J87qGDRumKlWq6O+//9YXX3yhYcOGKSgoSD169Mh3/Xl5X6WrIblNmzb68ccf1aNHD9133336/vvvNWTIEB07dkyTJk2yW+6aNWv0xRdfqF+/fvLw8NC0adPUvHlzbdy4MccfEiTp999/V9u2bfXoo4/qgw8+MKdv2rRJ69atU6dOnXTXXXfp8OHDmj59uh5++GHt3LlTxYsXz3X7Ll++rE2bNql3797XfR169uyphg0bSpK++uorff311+Y8q9vetGlTPfvss3bT7rvvPklSamqqPvnkEz355JN6/vnnde7cOc2cOVPR0dHauHGj2S5LXo57ORkxYoTDF1ZJGjBggObMmaM+ffrorrvu0osvvihJ+uijj9S4cWONGzdO7733nlq0aKGDBw/K19fXfK6np6e+++47nThxQkFBQZJk7ouenp7XfX2lq9cC5xTQ8yopKUl169Y1v7QGBgZq2bJl6tG
2024-11-15 15:49:04 +04:00
"text/plain": [
2024-11-30 03:34:37 +04:00
"<Figure size 1000x600 with 1 Axes>"
2024-11-15 15:49:04 +04:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2024-11-30 03:34:37 +04:00
"import pandas as pd\n",
"import numpy as np\n",
2024-11-15 15:49:04 +04:00
"from sklearn.model_selection import train_test_split\n",
2024-11-30 03:34:37 +04:00
"import matplotlib.pyplot as plt\n",
"from sklearn.preprocessing import StandardScaler\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Разделение признаков (features) и целевой переменной (target)\n",
"X = df.drop(columns=['price']) # Признаки (все столбцы, кроме 'price')\n",
"y = df['price'] # Целевая переменная (цена)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Применение one-hot encoding для категориальных признаков\n",
"X = pd.get_dummies(X, drop_first=True)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Проверка размеров выборок\n",
"print(f\"Размеры выборок:\")\n",
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Удаление выбросов (цены выше 95-г о процентиля)\n",
"upper_limit = y_train.quantile(0.95)\n",
"X_train = X_train[y_train <= upper_limit]\n",
"y_train = y_train[y_train <= upper_limit]\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Логарифмическое преобразование целевой переменной\n",
"y_train_log = np.log1p(y_train)\n",
"y_val_log = np.log1p(y_val)\n",
"y_test_log = np.log1p(y_test)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Стандартизация признаков\n",
"scaler = StandardScaler()\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"X_val_scaled = scaler.transform(X_val)\n",
"X_test_scaled = scaler.transform(X_test)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Визуализация распределения цен в сбалансированной выборке\n",
"plt.figure(figsize=(10, 6))\n",
"plt.hist(y_train_log, bins=30, color='orange', alpha=0.7)\n",
"plt.title('Сбалансированная обучающая выборка (логарифмическое преобразование)')\n",
"plt.xlabel('Логарифм цены')\n",
"plt.ylabel('Количество')\n",
"plt.show()"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-30 03:34:37 +04:00
"**Унитарное кодирование категориальных признаков**"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "code",
2024-11-30 03:34:37 +04:00
"execution_count": 10,
2024-11-15 15:49:04 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-11-30 03:34:37 +04:00
"Данные до унитарного кодирования:\n",
" category sub_category \\\n",
"0 Groceries Fruits & Vegetables \n",
"1 Groceries Fruits & Vegetables \n",
"2 Groceries Fruits & Vegetables \n",
"3 Groceries Fruits & Vegetables \n",
"4 Groceries Fruits & Vegetables \n",
"\n",
" href \\\n",
"0 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"1 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"2 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"3 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"4 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"\n",
" items price \n",
"0 Fresh Dates (Pack) (Approx 450 g - 500 g) 109.0 \n",
"1 Tender Coconut Cling Wrapped (1 pc) (Approx 90... 49.0 \n",
"2 Mosambi 1 kg 69.0 \n",
"3 Orange Imported 1 kg 125.0 \n",
"4 Banana Robusta 6 pcs (Box) (Approx 800 g - 110... 44.0 \n",
"\n",
"Данные после унитарного кодирования:\n",
" price sub_category_Fruits & Vegetables sub_category_Premium Fruits \\\n",
"0 109.0 True False \n",
"1 49.0 True False \n",
"2 69.0 True False \n",
"3 125.0 True False \n",
"4 44.0 True False \n",
"\n",
" sub_category_Snacks & Branded Foods sub_category_Staples \\\n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"\n",
" href_https://www.jiomart.com/c/groceries/dairy-bakery/bakery-snacks/281 \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" href_https://www.jiomart.com/c/groceries/dairy-bakery/batter-chutney/407 \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" href_https://www.jiomart.com/c/groceries/dairy-bakery/breads-and-buns/267 \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" href_https://www.jiomart.com/c/groceries/dairy-bakery/cakes-muffins/125 \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" href_https://www.jiomart.com/c/groceries/dairy-bakery/cheese/1569 ... \\\n",
"0 False ... \n",
"1 False ... \n",
"2 False ... \n",
"3 False ... \n",
"4 False ... \n",
"\n",
" items_sUpazon Instant Idli Mix, 400g (Ragi Idli) \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_sUpazon Instant Idli Mix, 400g (Rawa Idli) \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_shivanyamart Roasted Peanuts - 1 kg \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_shivanyamart Special Bombay Mixer - 200 g (Pack of 2) \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_tasty tongue - Haldi ka Achar with Lime decoction, 190 gms Glass Jar \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_tasty tongue - Homemade Baingan Ka Achar, Certified, No Added preservatives |190 GMS Glass Jar \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_tasty tongue - Homemade Heeng wala Nimbu ka teekha Achar , Certified | 190 GMS Glass Jar \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_tasty tongue - Homemade Kacche Aam ka Achar, Certified | 350 GMS Glass Jar \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_tasty tongue - Homemade Khatta Meetha Nimbu ka Achar ,Certified | 350 GMS Glass Jar \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" items_xThe Whole Truth - 71% Dark Chocolate Combo - (Pack of 3) - 1 - 71% ,1 - Sea-Salt , 1 - Orange - No Added Sugar - Sweetened Only with Dates - 71% Cocoa - 29% Dates \n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
"[5 rows x 14769 columns]\n"
2024-11-15 15:49:04 +04:00
]
}
],
"source": [
2024-11-30 03:34:37 +04:00
"print(\"Данные до унитарного кодирования:\")\n",
"print(df.head())\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Применение унитарного кодирования для категориальных признаков\n",
"df_encoded = pd.get_dummies(df, drop_first=True)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"print(\"\\nДа нные после унитарного кодирования:\")\n",
"print(df_encoded.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Дискретизация числовых признаков**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до дискретизации:\n",
" category sub_category \\\n",
"0 Groceries Fruits & Vegetables \n",
"1 Groceries Fruits & Vegetables \n",
"2 Groceries Fruits & Vegetables \n",
"3 Groceries Fruits & Vegetables \n",
"4 Groceries Fruits & Vegetables \n",
"\n",
" href \\\n",
"0 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"1 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"2 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"3 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"4 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"\n",
" items price price_bins \n",
"0 Fresh Dates (Pack) (Approx 450 g - 500 g) 109.0 100-500 \n",
"1 Tender Coconut Cling Wrapped (1 pc) (Approx 90... 49.0 0-100 \n",
"2 Mosambi 1 kg 69.0 0-100 \n",
"3 Orange Imported 1 kg 125.0 100-500 \n",
"4 Banana Robusta 6 pcs (Box) (Approx 800 g - 110... 44.0 0-100 \n",
"\n",
"Данные после дискретизации:\n",
" price price_bins\n",
"0 109.0 100-500\n",
"1 49.0 0-100\n",
"2 69.0 0-100\n",
"3 125.0 100-500\n",
"4 44.0 0-100\n"
]
}
],
"source": [
"print(\"Данные до дискретизации:\")\n",
"print(df.head())\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Определение интервалов и меток для дискретизации\n",
"bins = [0, 100, 500, 1000, 5000, float('inf')]\n",
"labels = ['0-100', '100-500', '500-1000', '1000-5000', '5000+']\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Применение дискретизации\n",
"df['price_bins'] = pd.cut(df['price'], bins=bins, labels=labels, right=False)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"print(\"\\nДа нные после дискретизации:\")\n",
"print(df[['price', 'price_bins']].head())"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-30 03:34:37 +04:00
"**«Ручной» синтез признаков**\n",
"\n",
2024-11-15 15:49:04 +04:00
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, для данных о продаже домов можно создать признак цена за единицу товара."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
2024-11-30 03:34:37 +04:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до синтеза признака:\n",
" category sub_category \\\n",
"0 Groceries Fruits & Vegetables \n",
"1 Groceries Fruits & Vegetables \n",
"2 Groceries Fruits & Vegetables \n",
"3 Groceries Fruits & Vegetables \n",
"4 Groceries Fruits & Vegetables \n",
"\n",
" href \\\n",
"0 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"1 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"2 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"3 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"4 https://www.jiomart.com/c/groceries/fruits-veg... \n",
"\n",
" items price price_bins \n",
"0 Fresh Dates (Pack) (Approx 450 g - 500 g) 109.0 100-500 \n",
"1 Tender Coconut Cling Wrapped (1 pc) (Approx 90... 49.0 0-100 \n",
"2 Mosambi 1 kg 69.0 0-100 \n",
"3 Orange Imported 1 kg 125.0 100-500 \n",
"4 Banana Robusta 6 pcs (Box) (Approx 800 g - 110... 44.0 0-100 \n",
"\n",
"Данные после синтеза признака 'relative_price':\n",
" price category relative_price\n",
"0 109.0 Groceries 0.291891\n",
"1 49.0 Groceries 0.131217\n",
"2 69.0 Groceries 0.184775\n",
"3 125.0 Groceries 0.334737\n",
"4 44.0 Groceries 0.117827\n"
]
}
],
2024-11-15 15:49:04 +04:00
"source": [
2024-11-30 03:34:37 +04:00
"# Проверка первых строк данных\n",
"print(\"Данные до синтеза признака:\")\n",
"print(df.head())\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Вычисление средней цены по категориям\n",
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Создание нового признака 'relative_price' (относительная цена)\n",
"df['relative_price'] = df['price'] / mean_price_by_category\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Проверка первых строк данных после синтеза признака\n",
"print(\"\\nДа нные после синтеза признака 'relative_price':\")\n",
"print(df[['price', 'category', 'relative_price']].head())"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-30 03:34:37 +04:00
"**Масштабирование признаков на основе нормировки и стандартизации**\n",
"\n",
2024-11-15 15:49:04 +04:00
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб. Это важно для многих алгоритмов машинного обучения, которые чувствительны к масштабу признаков, таких как линейная регрессия, метод опорных векторов (SVM) и нейронные сети."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
2024-11-30 03:34:37 +04:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до масштабирования:\n",
" price relative_price\n",
"0 109.0 0.291891\n",
"1 49.0 0.131217\n",
"2 69.0 0.184775\n",
"3 125.0 0.334737\n",
"4 44.0 0.117827\n",
"\n",
"Данные после нормировки:\n",
" price relative_price\n",
"0 0.006936 0.006936\n",
"1 0.002935 0.002935\n",
"2 0.004268 0.004268\n",
"3 0.008003 0.008003\n",
"4 0.002601 0.002601\n",
"\n",
"Данные после стандартизации:\n",
" price relative_price\n",
"0 -0.569958 -0.569958\n",
"1 -0.699284 -0.699284\n",
"2 -0.656175 -0.656175\n",
"3 -0.535471 -0.535471\n",
"4 -0.710061 -0.710061\n"
]
}
],
2024-11-15 15:49:04 +04:00
"source": [
2024-11-30 03:34:37 +04:00
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Создание нового признака 'relative_price' (цена относительно средней цены в категории)\n",
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
"df['relative_price'] = df['price'] / mean_price_by_category\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Проверка первых строк данных до масштабирования\n",
"print(\"Данные до масштабирования:\")\n",
"print(df[['price', 'relative_price']].head())\n",
"\n",
"# Масштабирование признаков на основе нормировки\n",
"min_max_scaler = MinMaxScaler()\n",
"df[['price', 'relative_price']] = min_max_scaler.fit_transform(df[['price', 'relative_price']])\n",
"\n",
"# Проверка первых строк данных после нормировки\n",
"print(\"\\nДа нные после нормировки:\")\n",
"print(df[['price', 'relative_price']].head())\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Стандартизация признаков\n",
"standard_scaler = StandardScaler()\n",
"df[['price', 'relative_price']] = standard_scaler.fit_transform(df[['price', 'relative_price']])\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Проверка первых строк данных после стандартизации\n",
"print(\"\\nДа нные после стандартизации:\")\n",
"print(df[['price', 'relative_price']].head())"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-30 03:34:37 +04:00
"**Конструирование признаков с применением фреймворка Featuretools**"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "code",
2024-11-30 03:34:37 +04:00
"execution_count": 16,
2024-11-15 15:49:04 +04:00
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
2024-11-30 03:34:37 +04:00
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
2024-11-15 15:49:04 +04:00
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/featuretools/synthesis/deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
2024-11-30 03:34:37 +04:00
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Built 7 features\n",
"Elapsed: 00:00 | Progress: 100%|██████████\n",
"Новые признаки, созданные с помощью Featuretools:\n",
" category sub_category price price_bins relative_price \\\n",
"index \n",
"0 Groceries Fruits & Vegetables -0.569958 100-500 6.281321e+15 \n",
"1 Groceries Fruits & Vegetables -0.699284 0-100 7.706585e+15 \n",
"2 Groceries Fruits & Vegetables -0.656175 0-100 7.231497e+15 \n",
"3 Groceries Fruits & Vegetables -0.535471 100-500 5.901250e+15 \n",
"4 Groceries Fruits & Vegetables -0.710061 0-100 7.825357e+15 \n",
"\n",
" NUM_CHARACTERS(items) NUM_WORDS(items) \n",
"index \n",
"0 41 8 \n",
"1 59 11 \n",
"2 12 3 \n",
"3 20 4 \n",
"4 50 10 \n"
2024-11-15 15:49:04 +04:00
]
}
],
"source": [
"import featuretools as ft\n",
"\n",
2024-11-30 03:34:37 +04:00
"# Создание нового признака 'relative_price'\n",
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
"df['relative_price'] = df['price'] / mean_price_by_category\n",
2024-11-15 15:49:04 +04:00
"\n",
"# Создание EntitySet\n",
2024-11-30 03:34:37 +04:00
"es = ft.EntitySet(id='jio_mart_items')\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Добавление данных с явным указанием индексного столбца\n",
"es = es.add_dataframe(dataframe_name='items_data', dataframe=df, index='index', make_index=True)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Конструирование признаков\n",
"features, feature_defs = ft.dfs(entityset=es, target_dataframe_name='items_data', verbose=True)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Проверка первых строк новых признаков\n",
"print(\"Новые признаки, созданные с помощью Featuretools:\")\n",
"print(features.head())"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-30 03:34:37 +04:00
"**Оценка качества**\n",
2024-11-15 15:49:04 +04:00
"\n",
"*Предсказательная способность Метрики:* RMSE, MAE, R² \n",
"\n",
"*Методы:* Обучение модели на обучающей выборке и оценка на контрольной и тестовой выборках. \n",
"\n",
"*Скорость вычисления Методы:* Измерение времени выполнения генерации признаков и обучения модели. \n",
"\n",
"*Надежность Методы:* К р о с с -валидация, анализ чувствительности модели к изменениям в данных. \n",
"\n",
"*Корреляция Методы:* Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков. \n",
"\n",
"*Цельность Методы:* Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
2024-11-30 03:34:37 +04:00
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
2024-11-15 15:49:04 +04:00
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-11-30 03:34:37 +04:00
"RMSE: 534.0885949291326\n",
"R²: 0.6087611252156747\n",
"MAE: 28.697400000000002\n",
"Training Time: 1.323014259338379 seconds\n",
"Cross-validated RMSE: 133.74731704254154\n"
2024-11-15 15:49:04 +04:00
]
2024-11-30 03:34:37 +04:00
},
2024-11-15 15:49:04 +04:00
{
"name": "stderr",
"output_type": "stream",
"text": [
2024-11-30 03:34:37 +04:00
"/var/folders/rd/3q9k4y6x0mn6ztd0mby6zx3r0000gn/T/ipykernel_96452/3211138617.py:70: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n"
2024-11-15 15:49:04 +04:00
]
},
2024-11-30 03:34:37 +04:00
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAACD8AAAK9CAYAAAApe1VgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXRN1///8ddNkDkEEUGaGCOmiLFBJWL2kaJKq4oYitYUNZTWPNMqOlClppZqlZIixg8paWuehYQKqmlRY0INyfn94ZfzdSUhpkY+fT7Wumvl7LPP3u99zrmnq8777m0xDMMQAAAAAAAAAAAAAABANmWT1QEAAAAAAAAAAAAAAAA8DpIfAAAAAAAAAAAAAABAtkbyAwAAAAAAAAAAAAAAyNZIfgAAAAAAAAAAAAAAANkayQ8AAAAAAAAAAAAAACBbI/kBAAAAAAAAAAAAAABkayQ/AAAAAAAAAAAAAACAbI3kBwAAAAAAAAAAAAAAkK2R/AAAAAAAAAAAAAAAALI1kh8AAAAAAAAAAAAAAEC2RvIDAAAAAAAAAAD4n2axWDL12bx581ON4/Tp0xo5cqSqVasmNzc35c+fX8HBwdqwYUO69S9duqSuXbvK3d1dTk5OqlOnjnbv3p2pvoKDgzMc55EjR57ksEzTp0/XvHnznkrbjys4OFjlypXL6jAe2e+//64RI0Zo7969WR0KADyzcmR1AAAAAAAAAAAAAE/Tl19+abW9YMECrV+/Pk25n5/fU41jxYoVmjhxopo3b64OHTro9u3bWrBggerXr685c+aoY8eOZt2UlBT95z//0b59+zRgwADlz59f06dPV3BwsHbt2qWSJUs+sL8iRYpo/PjxacoLFSr0RMeVavr06cqfP7/CwsKeSvv/Zr///rtGjhwpHx8fVaxYMavDAYBnEskPAAAAAAAAAADgf9rrr79utf3LL79o/fr1acqftjp16ujUqVPKnz+/Wda9e3dVrFhRw4YNs0p++O677/TTTz9pyZIlevnllyVJrVu3VqlSpTR8+HAtWrTogf3lzp37Hx/jk2YYhv7++285ODhkdShZ4vbt20pJScnqMAAgW2DZCwAAAAAAAAAA8K+XlJSkfv36ycvLS3Z2dvL19dUHH3wgwzCs6lksFvXs2VMLFy6Ur6+v7O3tVblyZf34448P7KNs2bJWiQ+SZGdnpyZNmui3337T1atXzfLvvvtOHh4eeumll8wyd3d3tW7dWitWrNCNGzcec8TSjRs3NHz4cJUoUUJ2dnby8vLSwIED07Q9d+5chYSEqECBArKzs1OZMmU0Y8YMqzo+Pj46dOiQoqKizOU1goODJUkjRoyQxWJJ0/+8efNksVgUHx9v1U7Tpk21du1aValSRQ4ODpo5c6akO8uAhIeHm9eoRIkSmjhx4iMnB6ReyyVLlqhMmTJycHBQYGCgDhw4IEmaOXOmSpQoIXt7ewUHB1vFKf3fUhq7du1SjRo15ODgoKJFi+qzzz5L09fZs2fVuXNneXh4yN7eXv7+/po/f75Vnfj4eFksFn3wwQeaOnWqihcvLjs7O02fPl1Vq1aVJHXs2NE8v6lLjGzZskWtWrXSc889Z17Hvn376vr161bth4WFydnZWWfOnFHz5s3l7Owsd3d39e/fX8nJyVZ1U1JSNG3aNJUvX1729vZyd3dXo0aNtHPnTqt6X331lSpXriwHBwflzZtXr776qk6fPv3Q1wIAngRmfgAAAAAAAAAAAP9qhmHoxRdf1KZNm9S5c2dVrFhRa9eu1YABA3TmzBlNmTLFqn5UVJS++eYb9e7d23w53ahRI23fvl3lypV76P7/+OMPOTo6ytHR0Szbs2ePKlWqJBsb69+xVqtWTZ9//rliY2NVvnz5+7abnJys8+fPW5XZ29vL2dlZKSkpevHFF7V161Z17dpVfn5+OnDggKZMmaLY2FgtX77cPGbGjBkqW7asXnzxReXIkUM//PCD3nrrLaWkpKhHjx6SpKlTp6pXr15ydnbWe++9J0ny8PB46HMhSUePHlWbNm3UrVs3vfHGG/L19dW1a9cUFBSkM2fOqFu3bnruuef0008/afDgwUpISNDUqVMfqa8tW7YoIiLCHMf48ePVtGlTDRw4UNOnT9dbb72lixcvatKkSerUqZP++9//Wh1/8eJFNWnSRK1bt1abNm307bff6s0331SuXLnUqVMnSdL169cVHBysY8eOqWfPnipatKiWLFmisLAwXbp0SX369LFqc+7cufr777/VtWtX2dnZqUWLFrp69aqGDRumrl276oUXXpAk1ahRQ5K0ZMkSXbt2TW+++aby5cun7du36+OPP9Zvv/2mJUuWWLWdnJyshg0bqnr16vrggw+0YcMGTZ48WcWLF9ebb75p1uvcubPmzZunxo0bq0uXLrp9+7a2bNmiX375RVWqVJEkjR07VkOHDlXr1q3VpUsXnTt3Th9//LFq166tPXv2KE+ePI90TQDgkRkAAAAAAAAAAAD/Ij169DDufkWyfPlyQ5IxZswYq3ovv/yyYbFYjGPHjpllkgxJxs6dO82ykydPGvb29kaLFi0eOpa4uDjD3t7eaNeunVW5k5OT0alTpzT1V61aZUgy1qxZc992g4KCzFjv/nTo0MEwDMP48ssvDRsbG2PLli1Wx3322WeGJCM6Otosu3btWpr2GzZsaBQrVsyqrGzZskZQUFCausOHDzfSeyU1d+5cQ5Jx4sQJs8zb2zvd8Y0ePdpwcnIyYmNjrcoHDRpk2NraGqdOnUr3PKQKCgoyypYta1UmybCzs7Pqf+bMmYYko2DBgsaVK1fM8sGDB6eJNfUcT5482Sy7ceOGUbFiRaNAgQLGzZs3DcMwjKlTpxqSjK+++sqsd/PmTSMwMNBwdnY2+zlx4oQhyXB1dTXOnj1rFeuOHTsMScbcuXPTjC296zN+/HjDYrEYJ0+eNMs6dOhgSDJGjRplVTcgIMCoXLmyuf3f//7XkGT07t07TbspKSmGYRhGfHy8YWtra4wdO9Zq/4EDB4wcOXKkKQeAfwLLXgAAAAAAAAAAgH+11atXy9bWVr1797Yq79evnwzDUGRkpFV5YGCgKleubG4/99xzatasmdauXZtm+YD7uXbtmlq1aiUHBwdNmDDBat/169dlZ2eX5hh7e3tz/4P4+Pho/fr1Vp+BAwdKujNbgJ+fn0qXLq3z58+bn5CQEEnSpk2bzHYcHBzMvy9fvqzz588rKChIv/76qy5fvpzp8WZW0aJF1bBhQ6uyJUuW6IUXXpCbm5tVvPXq1VNycnKmlh1JT926deXj42NuV69eXZLUsmVLubi4pCn/9ddfrY7PkSOHunXrZm7nypVL3bp109mzZ7Vr1y5Jd+6vggULqk2bNma9nDlzqnfv3kpMTFRUVJRVmy1btpS7u3umx3D39UlKStL58+dVo0YNGYahPXv2pKnfvXt3q+0XXnjBalxLly6VxWLR8OHD0xybunzJsmXLlJKSotatW1tdj4IFC6pkyZJW9w8A/FNY9gIAAAAAAAAAAPyrnTx5UoUKFbJ62S1Jfn5+5v67lSxZMk0bpUqV0rVr13Tu3DkVLFjwgX0mJyfr1Vdf1eHDhxUZGalChQpZ7XdwcNCNGzfSHPf333+b+x/EyclJ9erVS3dfXFycYmJiMnzJfvbsWfPv6OhoDR8+XD///LOuXbtmVe/y5cvKnTv3A2N5GEWLFk033v3792cq3ofx3HPPWW2njsXLyyvd8osXL1qVFypUSE5OTlZlpUqVkiTFx8fr+eef18mTJ1WyZMk0S5hkdH+lN/77OXXqlIYNG6aIiIg08d2bnGJvb5/mHLq5uVkdd/z4cRUqVEh58+bNsM+4uDgZhpHud0G6k9wBAP80kh8AAAAAAAAAAAD+YW+88YZWrlyphQsXmrMt3M3T01MJCQlpylPL7k2WeFgpKSkqX768Pvzww3T3p778P378uOrWravSpUvrww8/lJeXl3LlyqXVq1drypQpSklJeWBfqbMF3CujWTLSS+xISUlR/fr1zZkr7pWacPCwbG1tH6rcMIxH6udhZCaxJVV
"text/plain": [
"<Figure size 1000x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
2024-11-15 15:49:04 +04:00
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-11-30 03:34:37 +04:00
"Train RMSE: 50.92770420271637\n",
"Train R²: 0.9845578370650323\n",
"Train MAE: 1.9114281249999987\n",
"Корреляция: 0.82\n"
2024-11-15 15:49:04 +04:00
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"data": {
2024-11-30 03:34:37 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2QAAAIjCAYAAABswtioAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACNHUlEQVR4nOzdeZxOdf/H8fc1+5jVYGYMY2eYyFqWUJaMSCllSbK1G7KG3ISSPXuk7uK+015UlEKkImtoxFiyhZnBmNXs1/n90c+5u7I0oxlnltfz8bgej87nfK9zfc6FeM/5nu+xGYZhCAAAAABw0zlZ3QAAAAAAlFQEMgAAAACwCIEMAAAAACxCIAMAAAAAixDIAAAAAMAiBDIAAAAAsAiBDAAAAAAsQiADAAAAAIsQyAAAAADAIgQyAIBlbDabJk6caHUblrvrrrt01113mdvHjx+XzWbTsmXLLOvpr/7aY0EpjOcOAAWJQAYAxcRrr70mm82mpk2b3vAxzpw5o4kTJ2rPnj3511ght2nTJtlsNvPl6uqqatWq6bHHHtNvv/1mdXt5smXLFk2cOFEJCQmW9VClShWH7zMwMFCtWrXSypUrLesJAAozF6sbAADkjxUrVqhKlSravn27jhw5oho1auT5GGfOnNGkSZNUpUoVNWjQIP+bLMSGDBmi2267TVlZWdq9e7eWLl2qNWvW6JdfflFISMhN7aVy5cpKS0uTq6trnt63ZcsWTZo0Sf369ZO/v3/BNJcLDRo00IgRIyT98Xvq9ddf14MPPqjFixfr6aefvu57b/TcAaCo4goZABQDx44d05YtW/Tqq6+qXLlyWrFihdUtFTmtWrXSo48+qv79+2vBggWaNWuW4uPjtXz58mu+JzU1tUB6sdls8vDwkLOzc4Ecv6BVqFBBjz76qB599FE9//zz+vHHH+Xl5aU5c+Zc8z3Z2dnKzMws8ucOAHlFIAOAYmDFihUqXbq0OnfurIceeuiagSwhIUHDhg1TlSpV5O7urooVK+qxxx7T+fPntWnTJt12222SpP79+5tTzi7fy1OlShX169fvimP+9d6izMxMTZgwQY0bN5afn5+8vLzUqlUrbdy4Mc/nFRsbKxcXF02aNOmKfdHR0bLZbFq4cKEkKSsrS5MmTVLNmjXl4eGhMmXKqGXLllq3bl2eP1eS2rZtK+mPsCtJEydOlM1m06+//qpHHnlEpUuXVsuWLc3x77zzjho3bixPT08FBASoZ8+eOnXq1BXHXbp0qapXry5PT0/dfvvt+v77768Yc637qA4ePKju3burXLly8vT0VFhYmMaNG2f2N2rUKElS1apVzV+/48ePF0iPeREcHKw6deqY3+Xl85s1a5bmzp2r6tWry93dXb/++usNnftlp0+f1oABAxQUFCR3d3fdcssteuutt/5R7wBQ0JiyCADFwIoVK/Tggw/Kzc1NvXr10uLFi7Vjxw4zYElSSkqKWrVqpQMHDmjAgAFq1KiRzp8/r88//1y///676tSpo8mTJ2vChAl68skn1apVK0lSixYt8tRLUlKS3nzzTfXq1UtPPPGEkpOT9e9//1sRERHavn17nqZCBgUF6c4779SHH36oF1980WHfBx98IGdnZz388MOS/ggkU6dO1eOPP67bb79dSUlJ2rlzp3bv3q277747T+cgSUePHpUklSlTxqH+8MMPq2bNmnrllVdkGIYkacqUKRo/fry6d++uxx9/XOfOndOCBQvUunVr/fzzz+b0wX//+9966qmn1KJFCw0dOlS//fab7rvvPgUEBCg0NPS6/ezbt0+tWrWSq6urnnzySVWpUkVHjx7VF198oSlTpujBBx/UoUOH9N5772nOnDkqW7asJKlcuXI3rcdrycrK0qlTp674Lt9++22lp6frySeflLu7uwICAmS32/N87tIf4b1Zs2ay2WyKjIxUuXLl9NVXX2ngwIFKSkrS0KFDb6h3AChwBgCgSNu5c6chyVi3bp1hGIZht9uNihUrGs8995zDuAkTJhiSjE8//fSKY9jtdsMwDGPHjh2GJOPtt9++YkzlypWNvn37XlG/8847jTvvvNPczs7ONjIyMhzGXLx40QgKCjIGDBjgUJdkvPjii9c9v9dff92QZPzyyy8O9fDwcKNt27bmdv369Y3OnTtf91hXs3HjRkOS8dZbbxnnzp0zzpw5Y6xZs8aoUqWKYbPZjB07dhiGYRgvvviiIcno1auXw/uPHz9uODs7G1OmTHGo//LLL4aLi4tZz8zMNAIDA40GDRo4fD9Lly41JDl8h8eOHbvi16F169aGj4+PceLECYfPufxrZxiGMXPmTEOScezYsQLv8VoqV65sdOjQwTh37pxx7tw5Y+/evUbPnj0NScbgwYMdzs/X19eIi4tzeP+NnvvAgQON8uXLG+fPn3cY07NnT8PPz8+4dOnS3/YOAFZgyiIAFHErVqxQUFCQ2rRpI+mP+4969Oih999/Xzk5Oea4Tz75RPXr19cDDzxwxTFsNlu+9ePs7Cw3NzdJkt1uV3x8vLKzs9WkSRPt3r07z8d78MEH5eLiog8++MCsRUVF6ddff1WPHj3Mmr+/v/bv36/Dhw/fUN8DBgxQuXLlFBISos6dOys1NVXLly9XkyZNHMb9dVGKTz/9VHa7Xd27d9f58+fNV3BwsGrWrGlO1dy5c6fi4uL09NNPm9+PJPXr109+fn7X7e3cuXPavHmzBgwYoEqVKjnsy82v3c3o8c+++eYblStXTuXKlVP9+vX10UcfqU+fPpo+fbrDuG7duplX8K4lN+duGIY++eQTdenSRYZhOJxjRESEEhMTb+j3HgDcDExZBIAiLCcnR++//77atGlj3p8jSU2bNtXs2bO1YcMGdejQQdIfU/C6det2U/pavny5Zs+erYMHDyorK8usV61aNc/HKlu2rNq1a6cPP/xQL730kqQ/piu6uLjowQcfNMdNnjxZ999/v2rVqqW6deuqY8eO6tOnj2699dZcfc6ECRPUqlUrOTs7q2zZsqpTp45cXK78a/Kv53D48GEZhqGaNWte9biXVws8ceKEJF0x7vIy+9dzefn9unXr5upc/upm9PhnTZs21csvvyybzaZSpUqpTp06V131MTe/H3Jz7ufOnVNCQoKWLl2qpUuXXnVMXFxc7poHgJuMQAYARdi3336rs2fP6v3339f7779/xf4VK1aYgeyfutaVmJycHIcV8d555x3169dPXbt21ahRoxQYGChnZ2dNnTrVvC8rr3r27Kn+/ftrz549atCggT788EO1a9fOvE9Kklq3bq2jR4/qs88+0zfffKM333xTc+bM0ZIlS/T444//7WfUq1dP7du3/9txnp6eDtt2u102m01fffXVVVcG9Pb2zsUZFqyb3WPZsmVv6Lu8UZfvO3v00UfVt2/fq47JbTAHgJuNQAYARdiKFSsUGBioRYsWXbHv008/1cqVK7VkyRJ5enqqevXqioqKuu7xrjf9rXTp0ld94PCJEyccrp58/PHHqlatmj799FOH4/11UY686Nq1q5566ilz2uKhQ4c0duzYK8YFBASof//+6t+/v1JSUtS6dWtNnDgxV4HsRlWvXl2GYahq1aqqVavWNcdVrlxZ0h9Xqy6v4Cj9seDFsWPHVL9+/Wu+9/L3e6O/fjejx4KSm3MvV66cfHx8lJOTk6sgCACFCfeQAUARlZaWpk8//VT33nuvHnrooStekZGRSk5O1ueffy7pj/t19u7dq5UrV15xLOP/Vwv08vKSpKsGr+rVq+unn35SZmamWVu9evUVy6ZfvgJz+ZiStG3bNm3duvWGz9Xf318RERH68MMP9f7778vNzU1du3Z1GHPhwgWHbW9vb9WoUUMZGRk3/Lm58eCDD8rZ2VmTJk1yOGfpj+/gcl9NmjRRuXLltGTJEofvcNmyZVf9vv+sXLlyat26td566y2dPHnyis+47Fq/fjejx4KSm3N3dnZWt27d9Mknn1w1uJ07d+6m9AoAN4IrZABQRH3++edKTk7Wfffdd9X9zZo1Mx8S3aNHD40aNUoff/yxHn74YQ0YMECNGzd
2024-11-15 15:49:04 +04:00
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2024-11-30 03:34:37 +04:00
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
2024-11-15 15:49:04 +04:00
"from sklearn.ensemble import RandomForestRegressor\n",
2024-11-30 03:34:37 +04:00
"from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error\n",
"from sklearn.model_selection import cross_val_score\n",
2024-11-15 15:49:04 +04:00
"import matplotlib.pyplot as plt\n",
2024-11-30 03:34:37 +04:00
"import seaborn as sns\n",
"import time\n",
"import numpy as np\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Загрузка данных\n",
"df = pd.read_csv(\"..//static//csv//jio_mart_items.csv\").head(2000)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Создание нового признака 'relative_price'\n",
"mean_price_by_category = df.groupby('category')['price'].transform('mean')\n",
"df['relative_price'] = df['price'] / mean_price_by_category\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Предобработка данных\n",
"# Преобразуем категориальные переменные в числовые\n",
"df = pd.get_dummies(df, drop_first=True)\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Разделение данных на признаки и целевую переменную\n",
"X = df.drop('price', axis=1)\n",
"y = df['price']\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"# Разделение данных на обучающую и тестовую выборки\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
2024-11-15 15:49:04 +04:00
"\n",
"# Выбор модели\n",
"model = RandomForestRegressor(random_state=42)\n",
"\n",
2024-11-30 03:34:37 +04:00
"# Измерение времени обучения и предсказания\n",
"start_time = time.time()\n",
"\n",
2024-11-15 15:49:04 +04:00
"# Обучение модели\n",
"model.fit(X_train, y_train)\n",
"\n",
2024-11-30 03:34:37 +04:00
"# Предсказание и оценка\n",
2024-11-15 15:49:04 +04:00
"y_pred = model.predict(X_test)\n",
"\n",
2024-11-30 03:34:37 +04:00
"end_time = time.time()\n",
"training_time = end_time - start_time\n",
"\n",
2024-11-15 15:49:04 +04:00
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
"r2 = r2_score(y_test, y_pred)\n",
"mae = mean_absolute_error(y_test, y_pred)\n",
"\n",
"print(f\"RMSE: {rmse}\")\n",
"print(f\"R²: {r2}\")\n",
2024-11-30 03:34:37 +04:00
"print(f\"MAE: {mae}\")\n",
"print(f\"Training Time: {training_time} seconds\")\n",
2024-11-15 15:49:04 +04:00
"\n",
"# К р о с с -валидация\n",
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
"rmse_cv = (-scores.mean())**0.5\n",
2024-11-30 03:34:37 +04:00
"print(f\"Cross-validated RMSE: {rmse_cv}\")\n",
2024-11-15 15:49:04 +04:00
"\n",
"# Анализ важности признаков\n",
"feature_importances = model.feature_importances_\n",
"feature_names = X_train.columns\n",
"\n",
2024-11-30 03:34:37 +04:00
"importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})\n",
"importance_df = importance_df.sort_values(by='Importance', ascending=False)\n",
"\n",
"# Отобразим только топ-20 признаков\n",
"top_n = 20\n",
"importance_df_top = importance_df.head(top_n)\n",
"\n",
"plt.figure(figsize=(10, 8))\n",
"sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n",
"plt.title(f'Top {top_n} Feature Importance')\n",
"plt.xlabel('Importance')\n",
"plt.ylabel('Feature')\n",
"plt.show()\n",
"\n",
2024-11-15 15:49:04 +04:00
"# Проверка на переобучение\n",
"y_train_pred = model.predict(X_train)\n",
"\n",
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
"r2_train = r2_score(y_train, y_train_pred)\n",
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
"\n",
"print(f\"Train RMSE: {rmse_train}\")\n",
"print(f\"Train R²: {r2_train}\")\n",
"print(f\"Train MAE: {mae_train}\")\n",
2024-11-30 03:34:37 +04:00
"\n",
"correlation = np.corrcoef(y_test, y_pred)[0, 1]\n",
"print(f\"Корреляция: {correlation:.2f}\")\n",
2024-11-15 15:49:04 +04:00
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
2024-11-30 03:34:37 +04:00
"plt.xlabel('Actual Price')\n",
"plt.ylabel('Predicted Price')\n",
"plt.title('Actual vs Predicted Price')\n",
"plt.show()"
2024-11-15 15:49:04 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Выводы и итог \n",
"\n",
2024-11-30 03:34:37 +04:00
"**Время обучения:**\n",
"\n",
"Время обучения модели составляет 1.32 секунды, что является средним. Это указывает на то, что модель обучается быстро и может эффективно обрабатывать данные.\n",
"\n",
"**Предсказательная способность:**\n",
"\n",
"MAE (Mean Absolute Error): 28.6974 — это средняя абсолютная ошибка предсказаний модели. Значение MAE невелико, что означает, что предсказанные значения в среднем отклоняются от реальных на 28.6974. Это может быть приемлемым уровнем ошибки.\n",
"\n",
"RMSE (Mean Squared Error): 534.088 — это среднее значение квадратов ошибок. Хотя MSE высокое, оно также может быть связано с большими значениями целевой переменной (цен).\n",
"\n",
"R² (коэффициент детерминации): 0.609 — это средний уровень, указывающий на то, что модель объясняет 60,9% вариации целевой переменной. Это свидетельствует о средней предсказательной способности модели.\n",
"\n",
"**Корреляция:**\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"Корреляция (0.82) между предсказанными и реальными значениями говорит о том, что предсказания модели имеют сильную линейную зависимость с реальными значениями. Это подтверждает, что модель хорошо обучена и делает точные прогнозы.\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"**Надежность (кросс-валидация):**\n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"Среднее RMSE (кросс-валидация): 133.75 — это значительно ниже, чем обычное RMSE, что указывает на отсутствие проблем с переобучением - что и подтверждается тестом переобучением. \n",
2024-11-15 15:49:04 +04:00
"\n",
2024-11-30 03:34:37 +04:00
"Результаты визуализации важности признаков, полученные из линейной регрессии, помогают понять, какие из входных переменных наибольшим образом влияют на целевую переменную (price). Это может быть полезным для дальнейшего анализа и при принятии бизнес-решений, связанных с управлением и ценообразованием в Jio Mart."
2024-11-15 15:49:04 +04:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}