AIM-PIbd-32-Petrushin-E-A/lab_3/lab3.ipynb

546 lines
153 KiB
Plaintext
Raw Permalink Normal View History

2024-10-11 13:42:11 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Вариант задания: Прогнозирование цен на автомобили\n",
"### Бизнес-цели:\n",
"Повышение эффективности ценообразования на вторичном рынке автомобилей:\n",
"Цель: Разработать модель машинного обучения, которая позволит точно прогнозировать рыночную стоимость автомобилей на вторичном рынке.\n",
"Ключевые показатели успеха (KPI):\n",
"Точность прогнозирования цены (например, RMSE, MAE).\n",
"Сокращение времени на оценку стоимости автомобиля.\n",
"Увеличение количества продаж за счет более конкурентоспособных цен.\n",
"Оптимизация рекламных бюджетов для онлайн-площадок по продаже автомобилей:\n",
"Цель: Использовать прогнозы цен на автомобили для оптимизации таргетинга рекламы и повышения конверсии на онлайн-площадках.\n",
"Ключевые показатели успеха (KPI):\n",
"Увеличение CTR (Click-Through Rate) рекламных объявлений.\n",
"Повышение конверсии (процент пользователей, совершивших покупку после клика на рекламу).\n",
"Снижение стоимости привлечения клиента (CPA).\n",
"### Цели технического проекта:\n",
"Для бизнес-цели 1:\n",
"Сбор и подготовка данных:\n",
"Очистка данных от пропусков, выбросов и дубликатов.\n",
"Преобразование категориальных переменных в числовые.\n",
"Разделение данных на обучающую и тестовую выборки.\n",
"Разработка и обучение модели:\n",
"Исследование различных алгоритмов машинного обучения (линейная регрессия, деревья решений, случайный лес и т.д.).\n",
"Обучение моделей на обучающей выборке.\n",
"Оценка качества моделей на тестовой выборке с помощью метрик RMSE, MAE и др.\n",
"Развертывание модели:\n",
"Интеграция модели в существующую систему или разработка нового API для доступа к прогнозам.\n",
"Создание веб-интерфейса или мобильного приложения для удобного использования модели.\n",
"Для бизнес-цели 2:\n",
"Анализ данных о пользователях и поведении:\n",
"Анализ данных о просмотрах, кликах и покупках на онлайн-площадке.\n",
"Определение сегментов пользователей с разным уровнем интереса к покупке автомобилей.\n",
"Разработка рекомендательной системы:\n",
"Создание модели, которая будет рекомендовать пользователям автомобили, соответствующие их предпочтениям и бюджету.\n",
"Интеграция рекомендательной системы в рекламные кампании.\n",
"Оптимизация таргетинга рекламы:\n",
"Использование прогнозов цен на автомобили для более точного таргетинга рекламы на пользователей, готовых к покупке.\n",
"Тестирование различных стратегий таргетинга и оценка их эффективности."
]
},
{
"cell_type": "code",
2024-10-11 23:18:43 +04:00
"execution_count": 10,
2024-10-11 13:42:11 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['ID', 'Price', 'Levy', 'Manufacturer', 'Model', 'Prod. year',\n",
" 'Category', 'Leather interior', 'Fuel type', 'Engine volume', 'Mileage',\n",
" 'Cylinders', 'Gear box type', 'Drive wheels', 'Doors', 'Wheel', 'Color',\n",
" 'Airbags'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pn\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"import matplotlib.ticker as ticker\n",
2024-10-11 23:17:25 +04:00
"df = pn.read_csv(\".//static//csv//car_price_prediction.csv\")\n",
2024-10-11 13:42:11 +04:00
"print(df.columns)"
]
2024-10-11 23:17:25 +04:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Разделим на 3 выборки\n"
]
},
{
"cell_type": "code",
2024-10-11 23:18:43 +04:00
"execution_count": 12,
2024-10-11 23:17:25 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 12311\n",
"Размер контрольной выборки: 3078\n",
"Размер тестовой выборки: 3848\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тест)\n",
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"# Разделение обучающей выборки на обучающую и контрольную (80% - обучение, 20% - контроль)\n",
"train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_data))\n",
"print(\"Размер контрольной выборки:\", len(val_data))\n",
"print(\"Размер тестовой выборки:\", len(test_data))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHHCAYAAABeLEexAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABI9UlEQVR4nO3deVxVdf7H8fdlVxRckK1IERVzNysHHbdCFJdyssylxMbSccAWW4yZTLQF08bMdHSaUmu0sWyxpmks3LKS1FTcM3Uw0wR3rmKiwPf3hz/ueL2gQsIFz+v5eJzHg3PO95zzOd+7vTnLvTZjjBEAAICFebi7AAAAAHcjEAEAAMsjEAEAAMsjEAEAAMsjEAEAAMsjEAEAAMsjEAEAAMsjEAEAAMsjEAGAm5w4cUK7d+9Wfn6+u0vBVWSM0bFjx7Rr1y53l4JSIBABQAU5d+6cJk+erNatW8vX11e1a9dW48aNtWzZMneXViVs3bpVixcvdoxnZGTo3//+t/sKusDJkyf1zDPPKDo6Wj4+Pqpbt66aNGminTt3urs0XCEvdxeAymHevHl64IEHHOO+vr664YYbFBcXp3HjxikkJMSN1QFVX15enuLi4vTtt9/qD3/4g5577jlVr15dnp6eateunbvLqxJOnjypkSNHKjQ0VHXr1tUjjzyi+Ph49e7d2611HT16VF26dNG+ffs0evRodezYUT4+PvL29laDBg3cWhuuHIEITiZOnKjIyEidOXNGX3/9tWbNmqXPPvtMW7duVfXq1d1dHlBlvfTSS1qzZo0+//xzde3a1d3lVEkxMTGOQZKaNGmihx56yM1VSU8++aQOHjyo9PR0NW/e3N3loIwIRHASHx+vm2++WZL04IMPqm7dupo6dao+/vhjDRo0yM3VAVVTfn6+pk2bpscff5ww9CstXrxY27dv1y+//KKWLVvKx8fHrfUcOnRIb731lmbPnk0YquK4hgiXdNttt0mSMjMzJUnHjh3TE088oZYtW6pGjRoKCAhQfHy8Nm3a5LLsmTNnlJKSoiZNmsjPz09hYWG66667tGfPHknS3r17ZbPZShwu/OBYuXKlbDab3n33Xf3pT39SaGio/P39dccdd+inn35y2faaNWvUs2dPBQYGqnr16urSpYu++eabYvexa9euxW4/JSXFpe38+fPVrl07VatWTXXq1NHAgQOL3f6l9u1ChYWFmjZtmpo3by4/Pz+FhIRo5MiROn78uFO7Bg0aqE+fPi7bSUpKcllncbVPmTLFpU+l86dxxo8fr0aNGsnX11cRERF66qmnlJeXV2xfXahr165q0aKFy/SXX35ZNptNe/fudZp+4sQJPfroo4qIiJCvr68aNWqkl156SYWFhY42Rf328ssvu6y3RYsWxT4n3n///RJrHDZs2BWdsmjQoIHj8fHw8FBoaKjuvfde7du377LLStJf//pXNW/eXL6+vgoPD1diYqJOnDjhmL9z504dP35cNWvWVJcuXVS9enUFBgaqT58+2rp1q6PdihUrZLPZ9NFHH7ls45133pHNZlN6erqj5mHDhjm1KeqTlStXOqZ99dVXuueee3TDDTc4HuPHHntMv/zyi9OyKSkpLs+lBQsWqE2bNvLz81PdunU1aNAglz4ZNmyYatSo4TTt/fffd6lDkmrUqOFSs3Rlr6uuXbs6Hv9mzZqpXbt22rRpU7Gvq+Jc/DoPCgpS7969nfpfOv/6SUpKKnE98+bNc3p+r1u3ToWFhTp79qxuvvnmS/aVJC1fvlydOnWSv7+/atWqpTvvvFM7duxwalP0WHz//fcaMGCAAgICHKcIz5w541Lvha/3/Px89erVS3Xq1NH27dud2l7p+5dVcYQIl1QUXurWrStJ+u9//6vFixfrnnvuUWRkpLKzs/W3v/1NXbp00fbt2xUeHi5JKigoUJ8+fbRs2TINHDhQjzzyiE6ePKm0tDRt3bpVUVFRjm0MGjRIvXr1ctpucnJysfW88MILstlsGjt2rA4dOqRp06YpNjZWGRkZqlatmqTzbzjx8fFq166dxo8fLw8PD82dO1e33XabvvrqK916660u673++uuVmpoqSTp16pRGjRpV7LbHjRunAQMG6MEHH9Thw4f12muvqXPnztq4caNq1arlssyIESPUqVMnSdKHH37o8kE3cuRIx/VbDz/8sDIzMzVjxgxt3LhR33zzjby9vYvth9I4ceKEY98uVFhYqDvuuENff/21RowYoRtvvFFbtmzRK6+8oh9++MHp4tVf6/Tp0+rSpYsOHDigkSNH6oYbbtDq1auVnJysgwcPatq0aVdtW2XVqVMnjRgxQoWFhdq6daumTZumn3/+WV999dUll0tJSdGECRMUGxurUaNGaefOnZo1a5bWrVvneAyPHj0q6fzzunHjxpowYYLOnDmjmTNnqmPHjlq3bp2aNGmirl27KiIiQgsWLNDvfvc7p+0sWLBAUVFRjtNFV2rRokU6ffq0Ro0apbp162rt2rV67bXXtH//fi1atKjE5d555x3dd999at26tVJTU3X06FFNnz5dX3/9tTZu3KigoKBS1VGSsryuiowdO7ZU22ratKn+/Oc/yxijPXv2aOrUqerVq9cVB9/iFD22SUlJateunSZNmqTDhw8X21dLly5VfHy8GjZsqJSUFP3yyy967bXX1LFjR23YsMElvA8YMEANGjRQamqqvv32W02fPl3Hjx/X22+/XWI9Dz74oFauXKm0tDQ1a9bMMf3X9LNlGMAYM3fuXCPJLF261Bw+fNj89NNPZuHChaZu3bqmWrVqZv/+/cYYY86cOWMKCgqcls3MzDS+vr5m4sSJjmlz5swxkszUqVNdtlVYWOhYTpKZMmWKS5vmzZubLl26OMZXrFhhJJnrrrvO2O12x/T33nvPSDKvvvqqY92NGzc2PXr0cGzHGGNOnz5tIiMjTffu3V221aFDB9OiRQvH+OHDh40kM378eMe0vXv3Gk9PT/PCCy84Lbtlyxbj5eXlMn3Xrl1Gknnrrbcc08aPH28ufMl99dVXRpJZsGCB07JLlixxmV6/fn3Tu3dvl9oTExPNxS/ji2t/6qmnTHBwsGnXrp1Tn/7jH/8wHh4e5quvvnJafvbs2UaS+eabb1y2d6EuXbqY5s2bu0yfMmWKkWQyMzMd05577jnj7+9vfvjhB6e2Tz/9tPH09DT79u0zxpTtObFo0aISa0xISDD169e/5H4Yc75/ExISnKYNHjzYVK9e/ZLLHTp0yPj4+Ji4uDin18WMGTOMJDNnzhynWoOCgsyRI0cc7X744Qfj7e1t+vfv75iWnJxsfH19zYkTJ5y24+Xl5fS4RkZGmqFDhzrVU7SdFStWOKadPn3ape7U1FRjs9nMjz/+6Jh24fMzPz/fhISEmKioKHPq1ClHm5UrVxpJ5vHHH3dMS0hIMP7+/k7rX7RokUsdxhjj7+/v1M+leV116dLF6fH/7LPPjCTTs2dPl9dAcS5e3hhj/vSnPxlJ5tChQ45pkkxiYmKJ6yl6ryx6fheNN2vWzKmvix6LC/uqTZs2Jjg42Bw9etQxbdOmTcbDw8PpsSx6LO644w6nbf/xj380ksymTZuc6i16XiQnJxtPT0+zePFip+VK+/5lVZwyg5PY2FjVq1dPERERGjhwoGrUqKGPPvpI1113naTzd595eJx/2hQUFOjo0aOqUaOGoqOjtWHDBsd6PvjgAwUFBWn06NEu27iSw9slGTp0qGrWrOkYv/vuuxUWFqbPPvtM0vnbcHft2qXBgwfr6NGjOnLkiI4cOaLc3FzdfvvtWrVqldMpGun8qT0/P79LbvfDDz9UYWGhBgwY4FjnkSNHFBoaqsaNG2vFihVO7c+ePSvpfH+VZNGiRQoMDFT37t2d1tmuXTvVqFHDZZ3nzp1zanfkyBGXw+cXO3DggF577TWNGzfO5bTGokWLdOONN6pp06ZO6yw6TXrx9n+NRYsWqVOnTqpdu7bTtmJjY1VQUKBVq1Y5tT99+rTLvhYUFBS77pMnT+rIkSNOp6jKIi8vT0eOHNGhQ4eUlpam5cuX6/bbb7/kMku
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAj4AAAHHCAYAAAC/R1LgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWY0lEQVR4nO3deVxU5f4H8M/sbA6rMKAIKCIuoIUbpVlJImHL1W6bt7Rral600rKiW7l0u1b2M8u05VZqt8WkzfKapriUiqYkIqKIhqIou6wCAzPP7w+aEyOggMAA5/N+veal55znnPM9z2wfzjYKIYQAERERkQwobV0AERERUXth8CEiIiLZYPAhIiIi2WDwISIiItlg8CEiIiLZYPAhIiIi2WDwISIiItlg8CEiIiLZUNu6ACIioq7AaDSisLAQZrMZPj4+ti6HGsE9PkRE1KF9+umnOH36tDS8Zs0aZGVl2a6gOg4ePIgHH3wQHh4e0Ol08Pb2xqRJk2xdFl0Bg08XsmbNGigUCulhZ2eHoKAgzJ49Gzk5ObYuj4ioRX755Rc888wzOH36NLZs2YKYmBgolbb/+tqwYQNGjRqF1NRUvPLKK9i6dSu2bt2K999/39al0RXwUFcXtHjxYgQEBKCyshK7d+/Gu+++i02bNiElJQUODg62Lo+IqFnmzp2Lm2++GQEBAQCAefPmwdvb26Y1FRYW4tFHH0VkZCTi4uKg1WptWg81HYNPFxQVFYWhQ4cCAB599FG4u7tj2bJl2LBhAx544AEbV0dE1DzBwcE4deoUUlJS4OHhgT59+ti6JKxevRqVlZVYs2YNQ08nY/t9hdTmbr31VgBARkYGgNq/VJ5++mmEhITAyckJer0eUVFROHz4cL15KysrsXDhQgQFBcHOzg7e3t6YOHEiTp06BQA4ffq01eG1yx8333yztKydO3dCoVDgyy+/xPPPPw+DwQBHR0fceeedOHv2bL1179+/H+PHj4ezszMcHBwwZswY7Nmzp8FtvPnmmxtc/8KFC+u1/fTTTxEWFgZ7e3u4ubnh/vvvb3D9V9q2usxmM5YvX46BAwfCzs4OXl5emDlzJi5evGjVzt/fHxMmTKi3ntmzZ9dbZkO1L126tF6fAkBVVRUWLFiAwMBA6HQ6+Pr64plnnkFVVVWDfVXXzTffjEGDBtUb/8Ybb0ChUFidVwEARUVFePLJJ+Hr6wudTofAwEC89tprMJvNUhtLv73xxhv1ljto0KAGXxNfffVVozVOnToV/v7+V90Wf39/6flRKpUwGAy47777kJmZ2aR5p06dajVuxowZsLOzw86dO63Gr1q1CgMHDoROp4OPjw9iYmJQVFRk1aap/Vq35oYelu2u26dvvvkm/Pz8YG9vjzFjxiAlJaXeerZv347Ro0fD0dERLi4uuOuuu3Ds2LGr9lvdR93tbuy1W1dznncAyM3NxbRp0+Dl5QU7OzsMHjwYa9eubXCZa9asgaOjI0aMGIE+ffogJiYGCoWi3nPWWE2Wh0ajgb+/P+bPnw+j0Si1s5wmcPDgwUaXdfPNN1ttw759+zBkyBD8+9//lt4Pffv2xauvvmr1fgCAmpoavPzyy+jTpw90Oh38/f3x/PPP13uPWvr5p59+wpAhQ2BnZ4cBAwbgm2++sWpnqbfu+/Po0aNwdXXFhAkTUFNTI41vyntWbrjHRwYsIcXd3R0A8Pvvv+O7777DX//6VwQEBCAnJwfvv/8+xowZg9TUVOlqBJPJhAkTJiA+Ph73338/nnjiCZSWlmLr1q1ISUmx+qvrgQcewO2332613tjY2AbreeWVV6BQKPDss88iNzcXy5cvR0REBJKSkmBvbw+g9oM7KioKYWFhWLBgAZRKJVavXo1bb70Vv/zyC4YPH15vuT179sSSJUsAAGVlZZg1a1aD637xxRdx77334tFHH0VeXh5WrFiBm266CYcOHYKLi0u9eWbMmIHRo0cDAL755ht8++23VtNnzpyJNWvW4JFHHsHjjz+OjIwMvPPOOzh06BD27NkDjUbTYD80R1FRkbRtdZnNZtx5553YvXs3ZsyYgf79++PIkSN48803ceLECXz33XfXvG6LS5cuYcyYMcjKysLMmTPRq1cv7N27F7Gxsbhw4QKWL1/eautqqdGjR2PGjBkwm81ISUnB8uXLcf78efzyyy/NWs6CBQvw0Ucf4csvv7T6slu4cCEWLVqEiIgIzJo1C2lpaXj33Xdx4MCBFj3Xy5cvR1lZGQDg2LFj+Pe//43nn38e/fv3BwA4OTlZtf/kk09QWlqKmJgYVFZW4q233sKtt96KI0eOwMvLCwCwbds2REVFoXfv3li4cCEqKiqwYsUK3Hjjjfjtt98aDJGWfqtbR1uqqKjAzTffjJMnT2L27NkICAhAXFwcpk6diqKiIjzxxBONznvy5En85z//adb6LO/hqqoqbNmyBW+88Qbs7Ozw8ssvt3gbCgoKsHv3buzevRt///vfERYWhvj4eMTGxuL06dN47733pLaPPvoo1q5di3vuuQdPPfUU9u/fjyVLluDYsWP1Pk/S09Nx33334bHHHsOUKVOwevVq/PWvf8XmzZtx2223NVjL2bNnMX78eAQHB2P9+vVQq2u/2jvDe9YmBHUZq1evFgDEtm3bRF5enjh79qxYt26dcHd3F/b29uLcuXNCCCEqKyuFyWSymjcjI0PodDqxePFiadzHH38sAIhly5bVW5fZbJbmAyCWLl1ar83AgQPFmDFjpOEdO3YIAKJHjx6ipKREGr9+/XoBQLz11lvSsvv27SsiIyOl9QghxKVLl0RAQIC47bbb6q3rhhtuEIMGDZKG8/LyBACxYMECadzp06eFSqUSr7zyitW8R44cEWq1ut749PR0AUCsXbtWGrdgwQJR923zyy+/CADis88+s5p38+bN9cb7+fmJ6OjoerXHxMSIy9+Kl9f+zDPPCE9PTxEWFmbVp//973+FUqkUv/zyi9X87733ngAg9uzZU299dY0ZM0YMHDiw3vilS5cKACIjI0Ma9/LLLwtHR0dx4sQJq7bPPfecUKlUIjMzUwjRstdEXFxcozVOmTJF+Pn5XXE7hKjt3ylTpliNe/DBB4WDg0Oz5n3//fcFALFixQqrNrm5uUKr1Ypx48ZZvX/eeecdAUB8/PHH0rjm9KuFpS927NhRb5qlT+u+j4UQYv/+/QKAmDt3rjRuyJAhwtPTUxQUFEjjDh8+LJRKpXj44YfrLbtHjx7ikUceuWIdjb12G6qxKc/78uXLBQDx6aefSuOMRqMIDw8XTk5O0ueDZZmrV6+W2t17771i0KBBwtfXt97z3VhNdecXQggfHx9x++23S8OWz84DBw40uqwxY8ZYbcOYMWMEALFw4UKrdlOnThUAxJEjR4QQQiQlJQkA4tFHH7Vq9/TTTwsAYvv27dI4Pz8/AUB8/fXX0rji4mLh7e0trrvuunr1ZmRkiMLCQjFgwADRr18/kZ+fb7WOpr5n5YaHurqgiIgIdO/eHb6+vrj//vvh5OSEb7/9Fj169AAA6HQ66YoIk8mEgoICODk5oV+/fvjtt9+k5Xz99dfw8PDAnDlz6q3j8kMzzfHwww+jW7du0vA999wDb29vbNq0CQCQlJSE9PR0PPjggygoKEB+fj7y8/NRXl6OsWPH4ueff663m7ayshJ2dnZXXO8333wDs9mMe++9V1pmfn4+DAYD+vbtix07dli1t+wK1+l0jS4zLi4Ozs7OuO2226yWGRYWBicnp3rLrK6utmqXn5+PysrKK9adlZWFFStW4MUXX6y3ByAuLg79+/dHcHCw1TIthzcvX/+1iIuLw+jRo+Hq6mq1roiICJhMJvz8889W7S9dulRvW00mU4PLLi0tRX5+fr1DRs1VVVWF/Px85ObmYuvWrdi+fTvGjh3b5Pk3bNiAf/zjH5g/fz5mz55tNW3btm0wGo148sknra4omj59OvR6Pf73v/9ZtTeZTPW2/9KlS9e0fXfffbf0PgaA4cOHY8SIEdJ758KFC0hKSsLUqVPh5uYmtQsNDcVtt90mtavLaDRe8TVuYXntFhQ
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlEAAAHHCAYAAACfqw0dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWe0lEQVR4nO3deVxU5f4H8M/srMMiMAMKiAsqroUblqlJLtl21dtmZl1T84eWWlq0uNXN0kozTetWaqs3K7XMJcWtEk1JVMRYFEVRNhEGUJaZeX5/IOc6AgpHYFg+79dr7mXOec453/PMjPPpnOecUQghBIiIiIioRpT2LoCIiIioMWKIIiIiIpKBIYqIiIhIBoYoIiIiIhkYooiIiIhkYIgiIiIikoEhioiIiEgGhigiIiIiGRiiiIio2TObzcjMzERqaqq9S6FGhCGKiIjqzKZNmxAbGys937BhA44fP26/gq6RlJSECRMmwNfXF1qtFgaDAWFhYeAPeVB1MUSRjdWrV0OhUEgPBwcHBAcHY8qUKcjIyLB3eUTUyBw7dgzPP/88kpKSsH//fjz77LPIz8+3d1nYv38/evfujZ07d+Lll1/Gtm3bsH37dmzYsAEKhcLe5VEjoeBv59G1Vq9ejaeffhrz589HUFAQioqK8Pvvv+PLL79EYGAg4uLi4OTkZO8yiaiRyMrKQr9+/ZCcnAwAGDlyJH744Qe71lRSUoLu3btDr9fj119/hZubm13rocZLbe8CqGEaPnw4evbsCQB45pln0KJFC7z//vvYuHEjHnvsMTtXR0SNhbe3N+Li4qT/AOvUqZO9S8LPP/+MhIQE/P333wxQdEt4Oo+q5e677wYApKSkAABycnLw4osvomvXrnBxcYFer8fw4cNx5MiRCssWFRVh7ty5CA4OhoODA3x9fTFy5EicPHkSAHD69GmbU4jXPwYOHCita/fu3VAoFPjvf/+LV155BUajEc7OznjggQdw9uzZCts+cOAAhg0bBjc3Nzg5OWHAgAH4448/Kt3HgQMHVrr9uXPnVmj71VdfITQ0FI6OjvD09MSjjz5a6fZvtG/XslqtWLJkCTp37gwHBwcYDAZMmjQJly5dsmnXunVr3HfffRW2M2XKlArrrKz2RYsWVehTACguLsacOXPQrl076HQ6+Pv7Y9asWSguLq60r641cOBAdOnSpcL0d999FwqFAqdPn7aZnpubi2nTpsHf3x86nQ7t2rXDO++8A6vVKrUp77d33323wnq7dOlS6Xvi+++/r7LGp556Cq1bt77pvrRu3Vp6fZRKJYxGIx555JGbDja+drnKHtduu7qvNQBs2bIFAwYMgKurK/R6PXr16oVvvvkGQNXv18reY2azGW+88Qbatm0LnU6H1q1b45VXXqnw+lZ3/wsLC/HCCy9Ir2GHDh3w7rvvVhhLVP4e1Ol0CA0NRadOnap8D1bm2n1RqVRo2bIlJk6ciNzcXKmNnNd///79CAoKwg8//IC2bdtCq9UiICAAs2bNwpUrVyos/9FHH6Fz587Q6XTw8/NDRESETQ3A/z4HMTEx6NevHxwdHREUFISVK1fatCuvd/fu3dK08+fPo3Xr1ujZsycKCgqk6bfyuaT6wSNRVC3lgadFixYAgFOnTmHDhg345z//iaCgIGRkZODjjz/GgAEDEB8fDz8/PwCAxWLBfffdh6ioKDz66KN4/vnnkZ+fj+3btyMuLg5t27aVtvHYY4/h3nvvtdluZGRkpfX8+9//hkKhwEsvvYTMzEwsWbIE4eHhiI2NhaOjIwBg586dGD58OEJDQzFnzhwolUqsWrUKd999N3777Tf07t27wnpbtWqFBQsWAAAKCgowefLkSrf9+uuv4+GHH8YzzzyDrKwsfPjhh7jrrrtw+PBhuLu7V1hm4sSJ6N+/PwDgxx9/xPr1623mT5o0STqV+txzzyElJQXLli3D4cOH8ccff0Cj0VTaDzWRm5sr7du1rFYrHnjgAfz++++YOHEiOnXqhGPHjmHx4sVITEzEhg0bbnnb5S5fvowBAwYgLS0NkyZNQkBAAPbt24fIyEhcuHABS5YsqbVtydW/f39MnDgRVqsVcXFxWLJkCc6fP4/ffvutymWWLFkiffmdOHECb731Fl555RXpqIuLi4vUtrqv9erVq/Gvf/0LnTt3RmRkJNzd3XH48GFs3boVjz/+OF599VU888wzAIDs7GxMnz7d5n12rWeeeQZr1qzB6NGj8cILL+DAgQNYsGABTpw4UeG9eLP9F0LggQcewK5duzB+/Hj06NED27Ztw8yZM5GWlobFixdX2U9VvQdv5B//+AdGjhwJs9mM6OhofPLJJ7hy5Qq+/PLLGq3nWhcvXsSpU6fwyiuvYOTIkXjhhRdw6NAhLFq0CHFxcfjll1+kEDp37lzMmzcP4eHhmDx5MhISErBixQocPHiwwmfz0qVLuPfee/Hwww/jsccew3fffYfJkydDq9XiX//6V6W15OXlYfjw4dBoNNi8ebP0XqnPzyXdAkF0jVWrVgkAYseOHSIrK0ucPXtWrF27VrRo0UI4OjqKc+fOCSGEKCoqEhaLxWbZlJQUodPpxPz586Vpn3/+uQAg3n///Qrbslqt0nIAxKJFiyq06dy5sxgwYID0fNeuXQKAaNmypTCZTNL07777TgAQH3zwgbTu9u3bi6FDh0rbEUKIy5cvi6CgIHHPPfdU2Fa/fv1Ely5dpOdZWVkCgJgzZ4407fTp00KlUol///vfNsseO3ZMqNXqCtOTkpIEALFmzRpp2pw5c8S1H73ffvtNABBff/21zbJbt26tMD0wMFCMGDGiQu0RERHi+o/z9bXPmjVL+Pj4iNDQUJs+/fLLL4VSqRS//fabzfIrV64UAMQff/xRYXvXGjBggOjcuXOF6YsWLRIAREpKijTtjTfeEM7OziIxMdGm7csvvyxUKpVITU0VQsh7T6xbt67KGseNGycCAwNvuB9ClPXvuHHjbKY9/vjjwsnJ6abLXl/Prl27Ksyr7mudm5srXF1dRZ8+fcSVK1ds2l77fi5X3l+rVq2qMC82NlYAEM8884zN9BdffFEAEDt37pSmVWf/N2zYIACIN99806bd6NGjhUKhEMnJydK06r4Hq3L98kKUfU5DQkKk53Je/3HjxgkA4qmnnrJpV/7Z/Pnnn4UQQmRmZgqtViuGDBli8+/dsmXLBADx+eefS9MGDBggAIj33ntPmlZcXCx69OghfHx8RElJiU29u3btEkVFRWLgwIHCx8fHpt+EuPXPJdUPns6jSoWHh8Pb2xv+/v549NFH4eLigvXr16Nly5YAAJ1OB6Wy7O1jsVhw8eJFuLi4oEOHDvjrr7+k9fzwww/w8vLC1KlTK2zjVq6AefLJJ+Hq6io9Hz16NHx9fbF582YAQGxsLJKSkvD444/j4sWLyM7ORnZ2NgoLCzF48GDs3bvX5vQRUHba0cHB4Ybb/fHHH2G1WvHwww9L68zOzobRaET79u2xa9cum/YlJSUAyvqrKuvWrYObmxvuuecem3WGhobCxcWlwjpLS0tt2mVnZ6OoqOiGdaelpeHDDz/E66+/bnNUpHz7nTp1QseOHW3WWX4K9/rt34p169ahf//+8PDwsNlWeHg4LBYL9u7da9P+8uXLFfbVYrFUuu78/HxkZ2dXOM1SU8XFxcjOzkZmZia2b9+OnTt3YvDgwbe0znLVfa23b9+O/Px8vPzyyxXekzX93JR/JmbMmGEz/YUXXgAA/PLLLzbTb7b/mzdvhkqlwnPPPVdhfUIIbNmypdI6bvQevJHy90B6ejp++OEHHDlypNLXQ87rP3PmTJvn06dPh0qlkvpkx44dKCkpwbRp06R/7wBgwoQJ0Ov1FfpOrVZj0qRJ0nOtVotJkyYhMzMTMTExNm2tViuefPJJ7N+/H5s3b7Y5Kg/U7+eS5OPpPKrU8uXLERwcDLVaDYPBgA4dOtj8I2K1WvHBBx/go48+QkpKis0XW/kpP6DsNGCHDh2gVtfuW619+/Y2zxUKBdq1ayeNv0lKSgIAjBs3rsp15OXlwcPDQ3qenZ1dYb3XS0p
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Пример оценки сбалансированности целевой переменной (цена автомобиля)\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Гистограмма распределения цены в обучающей выборке\n",
"sns.histplot(train_data['Price'], kde=True)\n",
"plt.title('Распределение цены в обучающей выборке')\n",
"plt.show()\n",
"\n",
"# Гистограмма распределения цены в контрольной выборке\n",
"sns.histplot(val_data['Price'], kde=True)\n",
"plt.title('Распределение цены в контрольной выборке')\n",
"plt.show()\n",
"\n",
"# Гистограмма распределения цены в тестовой выборке\n",
"sns.histplot(test_data['Price'], kde=True)\n",
"plt.title('Распределение цены в тестовой выборке')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Процесс конструирования признаков\n",
"Задача 1: Прогнозирование цен на автомобили\n",
"Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования рыночной стоимости автомобилей.\n",
"\n",
"Задача 2: Оптимизация рекламных бюджетов\n",
"Цель технического проекта: Использование прогнозов цен на автомобили для оптимизации таргетинга рекламы и повышения конверсии на онлайн-площадках.\n",
"\n",
"\n",
"### Унитарное кодирование категориальных признаков (one-hot encoding)\n",
"\n",
"One-hot encoding: Преобразование категориальных признаков в бинарные векторы."
]
},
{
"cell_type": "code",
2024-10-11 23:18:43 +04:00
"execution_count": 13,
2024-10-11 23:17:25 +04:00
"metadata": {},
2024-10-11 23:18:43 +04:00
"outputs": [],
2024-10-11 23:17:25 +04:00
"source": [
"import pandas as pd\n",
"\n",
"# Пример категориальных признаков\n",
"categorical_features = ['Model', 'Category', 'Fuel type', 'Gear box type', 'Leather interior']\n",
"\n",
"# Применение one-hot encoding\n",
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Дискретизация числовых признаков \n",
"это процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины). Этот процесс может быть полезен по нескольким причинам"
]
},
{
"cell_type": "code",
2024-10-11 23:18:43 +04:00
"execution_count": 16,
2024-10-11 23:17:25 +04:00
"metadata": {},
2024-10-11 23:18:43 +04:00
"outputs": [],
2024-10-11 23:17:25 +04:00
"source": [
"# Пример дискретизации признака 'year'\n",
"train_data_encoded['Year bin'] = pd.cut(train_data_encoded['Prod. year'], bins=5, labels=False)\n",
"val_data_encoded['Year bin'] = pd.cut(val_data_encoded['Prod. year'], bins=5, labels=False)\n",
"test_data_encoded['Year bin'] = pd.cut(test_data_encoded['Prod. year'], bins=5, labels=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ручной синтез\n",
"Создание новых признаков на основе экспертных знаний и логики предметной области. Например, для данных о продаже автомобилей можно создать признак \"возраст автомобиля\" как разницу между текущим годом и годом выпуска."
]
},
{
"cell_type": "code",
2024-10-11 23:18:43 +04:00
"execution_count": 15,
2024-10-11 23:17:25 +04:00
"metadata": {},
2024-10-11 23:18:43 +04:00
"outputs": [],
2024-10-11 23:17:25 +04:00
"source": [
"# Пример синтеза признака \"возраст автомобиля\"\n",
"train_data_encoded['Age'] = 2024 - train_data_encoded['Prod. year']\n",
"val_data_encoded['Age'] = 2024 - val_data_encoded['Prod. year']\n",
"test_data_encoded['Age'] = 2024 - test_data_encoded['Prod. year']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб. Это важно для многих алгоритмов машинного обучения, которые чувствительны к масштабу признаков, таких как линейная регрессия, метод опорных векторов (SVM) и нейронные сети."
]
},
{
"cell_type": "code",
2024-10-11 23:18:43 +04:00
"execution_count": 17,
2024-10-11 23:17:25 +04:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
"\n",
"# Пример масштабирования числовых признаков\n",
"numerical_features = ['Airbags', 'Age']\n",
"\n",
"scaler = StandardScaler()\n",
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Конструирование признаков с применением фреймворка Featuretools"
]
},
{
"cell_type": "code",
2024-10-11 23:18:43 +04:00
"execution_count": 51,
2024-10-11 23:17:25 +04:00
"metadata": {},
"outputs": [
{
2024-10-11 23:18:43 +04:00
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:724: UserWarning: A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: index\n",
" warnings.warn(\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
2024-10-11 23:17:25 +04:00
]
}
],
"source": [
"import featuretools as ft\n",
"\n",
"# Определение сущностей\n",
"es = ft.EntitySet(id='car_data')\n",
2024-10-11 23:18:43 +04:00
"es = es.add_dataframe(dataframe_name='cars', dataframe=train_data_encoded, index='id')\n",
2024-10-11 23:17:25 +04:00
"\n",
"# Определение связей между сущностями (если есть)\n",
"# es = es.add_relationship(...)\n",
"\n",
"# Генерация признаков\n",
2024-10-11 23:18:43 +04:00
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='cars', max_depth=2)\n",
"\n",
"# Преобразование признаков для контрольной и тестовой выборок\n",
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Оценка качества каждого набора признаков\n",
"Предсказательная способность\n",
"Метрики: RMSE, MAE, R²\n",
"\n",
"Методы: Обучение модели на обучающей выборке и оценка на контрольной и тестовой выборках.\n",
"\n",
"Скорость вычисления\n",
"Методы: Измерение времени выполнения генерации признаков и обучения модели.\n",
"\n",
"Надежность\n",
"Методы: Кросс-валидация, анализ чувствительности модели к изменениям в данных.\n",
"\n",
"Корреляция\n",
"Методы: Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков.\n",
"\n",
"Цельность\n",
"Методы: Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:724: UserWarning: A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: index\n",
" warnings.warn(\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
]
}
],
"source": [
"import featuretools as ft\n",
"\n",
"# Определение сущностей\n",
"es = ft.EntitySet(id='car_data')\n",
"es = es.add_dataframe(dataframe_name='cars', dataframe=train_data_encoded, index='id')\n",
"\n",
"# Генерация признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='cars', max_depth=2)\n",
2024-10-11 23:17:25 +04:00
"\n",
"# Преобразование признаков для контрольной и тестовой выборок\n",
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
2024-10-11 23:18:43 +04:00
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)\n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE: 234661.34107821883\n",
"R²: 0.8029264507217629\n",
"MAE: 7964.677649030692\n",
"Cross-validated RMSE: 259310.71680259163\n",
"Train RMSE: 109324.02870848698\n",
"Train R²: 0.7887252013114727\n",
"Train MAE: 3471.173866063129\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Egor\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAIjCAYAAAA0vUuxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABzjElEQVR4nO3deZiN9eP/8deZfYyZ0cRgmBDGzthDtgiRohRKlhAxlmxZKlvIvkeytSAVSiolRLasY8ky1uxbmDGW2c79+8Ov+/s5WWYOM3PP8nxc17ku533f58zrjGOc17zv+33bDMMwBAAAAAC4LxerAwAAAABAWkdxAgAAAIBEUJwAAAAAIBEUJwAAAABIBMUJAAAAABJBcQIAAACARFCcAAAAACARFCcAAAAASATFCQAAAAASQXECACTKZrNpyJAhVsewXK1atVSrVi3z/okTJ2Sz2TR//nzLMv3XfzOmlLT42gEgJVGcACCVffzxx7LZbKpcufJDP8fZs2c1ZMgQhYeHJ1+wNO7333+XzWYzb+7u7nryySfVunVrHTt2zOp4Ttm0aZOGDBmia9euWZYhf/78Dt/PwMBAVa9eXcuWLbMsEwCkZW5WBwCAzGbBggXKnz+/tm7dqiNHjqhQoUJOP8fZs2c1dOhQ5c+fX6GhockfMg3r3r27KlasqLi4OO3cuVOzZs3Sjz/+qL179yooKChVs+TLl0+3bt2Su7u7U4/btGmThg4dqrZt2ypbtmwpEy4JQkND1bt3b0l33lOffPKJXnrpJc2YMUOdO3d+4GMf9rUDQHrFjBMApKLjx49r06ZNmjBhgnLkyKEFCxZYHSndqV69ulq1aqV27dpp6tSpGjdunK5cuaLPPvvsvo+5ceNGimSx2Wzy8vKSq6trijx/SsuTJ49atWqlVq1aqV+/ftq4caN8fHw0ceLE+z4mPj5esbGx6f61A4CzKE4AkIoWLFigxx57TI0aNVKzZs3uW5yuXbumd955R/nz55enp6fy5s2r1q1b6/Lly/r9999VsWJFSVK7du3MQ63+Pdckf/78atu27V3P+d9zX2JjY/XBBx+ofPny8vf3l4+Pj6pXr661a9c6/bouXLggNzc3DR069K5thw4dks1m07Rp0yRJcXFxGjp0qAoXLiwvLy89/vjjevrpp7Vq1Sqnv64kPfPMM5LulFJJGjJkiGw2m/bv36/XXntNjz32mJ5++mlz/y+//FLly5eXt7e3AgIC1KJFC506dequ5501a5YKFiwob29vVapUSX/88cdd+9zvPJ+DBw/q1VdfVY4cOeTt7a0iRYpo0KBBZr6+fftKkgoUKGD+/Z04cSJFMjojV65cKlasmPm9/Pf1jRs3TpMmTVLBggXl6emp/fv3P9Rr/9eZM2f05ptvKmfOnPL09FSJEiU0d+7cR8oOACmNQ/UAIBUtWLBAL730kjw8PNSyZUvNmDFD27ZtM4uQJEVHR6t69eo6cOCA3nzzTZUrV06XL1/W8uXLdfr0aRUrVkzDhg3TBx98oLfeekvVq1eXJFWtWtWpLFFRUZo9e7Zatmypjh076vr165ozZ47q16+vrVu3OnUIYM6cOVWzZk19/fXXGjx4sMO2xYsXy9XVVa+88oqkO8Vh1KhR6tChgypVqqSoqCht375dO3fu1LPPPuvUa5Cko0ePSpIef/xxh/FXXnlFhQsX1siRI2UYhiRpxIgRev/99/Xqq6+qQ4cOunTpkqZOnaoaNWpo165d5mFzc+bMUadOnVS1alX17NlTx44d0wsvvKCAgAAFBwc/MM+ePXtUvXp1ubu766233lL+/Pl19OhR/fDDDxoxYoReeuklRUREaNGiRZo4caKyZ88uScqRI0eqZbyfuLg4nTp16q7v5bx583T79m299dZb8vT0VEBAgOx2u9OvXbpTsp966inZbDaFhYUpR44c+vnnn9W+fXtFRUWpZ8+eD5UdAFKcAQBIFdu3bzckGatWrTIMwzDsdruRN29eo0ePHg77ffDBB4YkY+nSpXc9h91uNwzDMLZt22ZIMubNm3fXPvny5TPatGlz13jNmjWNmjVrmvfj4+ONmJgYh32uXr1q5MyZ03jzzTcdxiUZgwcPfuDr++STTwxJxt69ex3GixcvbjzzzDPm/TJlyhiNGjV64HPdy9q1aw1Jxty5c41Lly4ZZ8+eNX788Ucjf/78hs1mM7Zt22YYhmEMHjzYkGS0bNnS4fEnTpwwXF1djREjRjiM792713BzczPHY2NjjcDAQCM0NNTh+zNr1ixDksP38Pjx43f9PdSoUcPw9fU1/v77b4ev8+/fnWEYxtixYw1JxvHjx1M84/3ky5fPqFevnnHp0iXj0qVLxu7du40WLVoYkoxu3bo5vD4/Pz/j4sWLDo9/2Nfevn17I3fu3Mbly5cd9mnRooXh7+9v3Lx5M9HsAGAFDtUDgFSyYMEC5cyZU7Vr15Z05/yY5s2b66uvvlJCQoK535IlS1SmTBk1bdr0ruew2WzJlsfV1VUeHh6SJLvdritXrig+Pl4VKlTQzp07nX6+l156SW5ublq8eLE5tm/fPu3fv1/Nmzc3x7Jly6a//vpLhw8ffqjcb775pnLkyKGgoCA1atRIN27c0GeffaYKFSo47PffxQ2WLl0qu92uV199VZcvXzZvuXLlUuHChc1DFLdv366LFy+qc+fO5vdHktq2bSt/f/8HZrt06ZLWr1+vN998U0888YTDtqT83aVGxv/166+/KkeOHMqRI4fKlCmjb775Rm+88YZGjx7tsN/LL79szojdT1Jeu2EYWrJkiRo3bizDMBxeY/369RUZGflQ7z0ASA2ZujitX79ejRs3VlBQkGw2m7777junHv/vcfT/vfn4+KRMYADpVkJCgr766ivVrl1bx48f15EjR3TkyBFVrlxZFy5c0OrVq819jx49qpIlS6ZKrs8++0ylS5c2zzXKkSOHfvzxR0VGRjr9XNmzZ1edOnX09ddfm2OLFy+Wm5ubXnrpJXNs2LBhunbtmkJCQlSqVCn17dtXe/bsSfLX+eCDD7Rq1SqtWbNGe/bs0dmzZ/XGG2/ctV+BAgUc7h8+fFiGYahw4cJmWfj3duDAAV28eFGS9Pfff0uSChcu7PD4f5c/f5B/l0V/2L+/1Mj4vypXrqxVq1bpt99+06ZNm3T58mV9/vnn8vb2dtjvv9/Le0nKa7906ZKuXbumWbNm3fX62rVrJ0nmawSAtCZTn+N048YNlSlTRm+++abDf+pJ1adPn7t+o1mnTh2HcxUAQJLWrFmjc+fO6auvvtJXX3111/YFCxaoXr16yfK17jezkZCQ4LAC2pdffqm2bduqSZMm6tu3rwIDA+Xq6qpRo0aZ5w05q0WLFmrXrp3Cw8MVGhqqr7/+WnXq1DHP45GkGjVq6OjRo/r+++/166+/avbs2Zo4caJmzpypDh06JPo1SpUqpbp16ya6338//NvtdtlsNv3888/3XAkua9asSXiFKSu1M2bPnv2hvpcP69/zolq1aqU2bdrcc5/SpUsny9cCgOSWqYvTc889p+eee+6+22NiYjRo0CAtWrRI165dU8mSJTV69GhzVaqsWbM6/Ce2e/du7d+/XzNnzkzp6ADSmQULFigwMFDTp0+/a9vSpUu1bNkyzZw5U97e3ipYsKD27dv3wOd70GFfjz322D0vrPr33387zEZ8++23evLJJ7V06VKH5/vv4g7OaNKkiTp16mQerhcREaEBAwbctV9AQIDatWundu3aKTo6WjVq1NCQIUOSVJweVsGCBWUYhgoUKKCQkJD77pcvXz5Jd2Z//l2xT7qzcMLx48dVpkyZ+z723+/vw/79pUbGlJKU154jRw75+voqISEhSYUNANKSTH2oXmLCwsK0efNmffXVV9qzZ49eeeUVNWjQ4L7H5c+ePVshISHmClcAIEm3bt3S0qVL9fzzz6tZs2Z33cLCwnT9+nUtX75c0p3zSXbv3q1ly5bd9VzG/18d7t9Dgu9VkAoWLKgtW7YoNjbWHFuxYsVdy1n/O6Px73NK0p9//qnNmzc/9GvNli2b6tevr6+//lp
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error\n",
"from sklearn.model_selection import cross_val_score\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Удаление строк с NaN\n",
"feature_matrix = feature_matrix.dropna()\n",
"val_feature_matrix = val_feature_matrix.dropna()\n",
"test_feature_matrix = test_feature_matrix.dropna()\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки\n",
"X_train = feature_matrix.drop('Price', axis=1)\n",
"y_train = feature_matrix['Price']\n",
"X_val = val_feature_matrix.drop('Price', axis=1)\n",
"y_val = val_feature_matrix['Price']\n",
"X_test = test_feature_matrix.drop('Price', axis=1)\n",
"y_test = test_feature_matrix['Price']\n",
"\n",
"# Выбор модели\n",
"model = RandomForestRegressor(random_state=42)\n",
"\n",
"# Обучение модели\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Предсказание и оценка\n",
"y_pred = model.predict(X_test)\n",
"\n",
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
"r2 = r2_score(y_test, y_pred)\n",
"mae = mean_absolute_error(y_test, y_pred)\n",
"\n",
"print(f\"RMSE: {rmse}\")\n",
"print(f\"R²: {r2}\")\n",
"print(f\"MAE: {mae}\")\n",
"\n",
"# Кросс-валидация\n",
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
"rmse_cv = (-scores.mean())**0.5\n",
"print(f\"Cross-validated RMSE: {rmse_cv}\")\n",
"\n",
"# Анализ важности признаков\n",
"feature_importances = model.feature_importances_\n",
"feature_names = X_train.columns\n",
"\n",
"# importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})\n",
"# importance_df = importance_df.sort_values(by='Importance', ascending=False)\n",
"\n",
"# plt.figure(figsize=(10, 6))\n",
"# sns.barplot(x='Importance', y='Feature', data=importance_df)\n",
"# plt.title('Feature Importance')\n",
"# plt.show()\n",
"\n",
"# Проверка на переобучение\n",
"y_train_pred = model.predict(X_train)\n",
"\n",
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
"r2_train = r2_score(y_train, y_train_pred)\n",
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
"\n",
"print(f\"Train RMSE: {rmse_train}\")\n",
"print(f\"Train R²: {r2_train}\")\n",
"print(f\"Train MAE: {mae_train}\")\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
"plt.xlabel('Actual Price')\n",
"plt.ylabel('Predicted Price')\n",
"plt.title('Actual vs Predicted Price')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Точность предсказаний: Модель показывает довольно высокий R² (0.8029), что указывает на хорошее объяснение вариации цен. Однако, значения RMSE и MAE довольно высоки, что говорит о том, что модель не очень точно предсказывает цены, особенно для высоких значений.\n",
"\n",
"Переобучение: Разница между RMSE на обучающей и тестовой выборках не очень большая, что указывает на то, что переобучение не является критическим. Однако, стоит быть осторожным и продолжать мониторинг этого показателя.\n",
"\n",
"Кросс-валидация: Значение RMSE после кросс-валидации немного выше, чем на тестовой выборке, что может указывать на некоторую нестабильность модели."
2024-10-11 23:17:25 +04:00
]
2024-10-11 13:42:11 +04:00
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}