1324 lines
684 KiB
Plaintext
1324 lines
684 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Начало лабораторной работы"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Цены на автомобили"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['ID', 'Price', 'Levy', 'Manufacturer', 'Model', 'Prod. year',\n",
|
|||
|
" 'Category', 'Leather interior', 'Fuel type', 'Engine volume', 'Mileage',\n",
|
|||
|
" 'Cylinders', 'Gear box type', 'Drive wheels', 'Doors', 'Wheel', 'Color',\n",
|
|||
|
" 'Airbags'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//car_price_prediction.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проблемная область: Данные о ценах на автомобили, включая их характеристики\n",
|
|||
|
"\n",
|
|||
|
"Объект наблюдения: автомобиль\n",
|
|||
|
"\n",
|
|||
|
"Атрибуты: идентификатор, цена, налог, производитель, модель, год производства, категория, наличие кожаного салона, тип топлива, объем двигателя, пробег автомобиля, количество цилиндров в двигателе, тип коробки передач, тип привода, количество дверей, расположение руля, цвет, количество подушек безопасностей.\n",
|
|||
|
"\n",
|
|||
|
"Пример бизнес-цели: \n",
|
|||
|
"1. Анализ данных: Изучение и очистка данных для выявления закономерностей и корреляций между характеристиками автомобилей и их ценами.\n",
|
|||
|
"2. Разработка модели: Создание и обучение модели машинного обучения, которая будет прогнозировать цены на автомобили на основе их характеристик.\n",
|
|||
|
"3. Внедрение: Интеграция модели в систему ценообразования компании для автоматического расчета цен на автомобили.\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"Актуальность: Данный датасет является актуальным и ценным ресурсом для компаний, занимающихся продажей автомобилей, а также для исследователей и инвесторов, поскольку он предоставляет обширную информацию о ценах и характеристиках автомобилей на вторичном рынке. Эти данные могут быть использованы для разработки моделей прогнозирования цен, анализа рыночных тенденций и принятия обоснованных бизнес-решений.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAIjCAYAAAA0vUuxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABSuElEQVR4nO3deVwV9f7H8fdB5QCyuLGIEpKWueWWGpqS5Zq5ZItp5ZZLqVnaanVT26jMym6Wt7pqZaZZqZWp4ZaplEtqLmWuYQpuJSgqKHx/f/jjXI8sc0DgHOH1fDzOo87Md+Z85suA8z4z8x2bMcYIAAAAAJArL3cXAAAAAACejuAEAAAAABYITgAAAABggeAEAAAAABYITgAAAABggeAEAAAAABYITgAAAABggeAEAAAAABYITgAAAABggeAEAKXcvn37ZLPZNH36dHeX4mTRokVq1KiRfHx8ZLPZdPz4cXeXlCubzaZx48a5u4xSr0aNGurfv7+7ywBQQhGcAJRYW7Zs0R133KHIyEj5+PioWrVqat++vf79738X2WfOnDlTb731VrbpBw8e1Lhx47Rp06Yi++yLrVixQjabzfEqV66crrzySvXt21d79uwplM9Ys2aNxo0bV+ih5tixY7rrrrvk6+uryZMn65NPPlH58uVzbDt9+nSn7fTx8dHVV1+tESNG6NChQ4VaF4pO//79nX6OgYGBatiwoSZOnKi0tDR3lwcAKuvuAgCgKKxZs0Zt27bVFVdcocGDByssLEz79+/XTz/9pEmTJumhhx4qks+dOXOmtm7dqkceecRp+sGDBzV+/HjVqFFDjRo1KpLPzs3IkSPVrFkznT17Vr/88ovef/99LViwQFu2bFF4ePglrXvNmjUaP368+vfvrwoVKhROwZLWrVunEydO6IUXXlC7du1cWub5559XVFSUzpw5o1WrVum9997Td999p61bt8rPz6/QakPRsdvt+vDDDyVJx48f15dffqnHHntM69at06xZsyyX37Fjh7y8+E4YQNEgOAEokV566SUFBQVp3bp12Q7oDx8+7J6iikBqamquZ2KytG7dWnfccYckacCAAbr66qs1cuRIffTRRxozZkxxlJlvWT+j/ISxzp0767rrrpMkDRo0SJUrV9Ybb7yh+fPnq3fv3jku40r/lSanTp1ya8gsW7as7r33Xsf7YcOGqUWLFpo9e7beeOONHIO+MUZnzpyRr6+v7HZ7cZYLoJThaxkAJdLu3btVr169HA+8Q0JCsk2bMWOGmjdvLj8/P1WsWFFt2rTR999/75g/f/58denSReHh4bLb7apZs6ZeeOEFZWRkONrceOONWrBggf7880/H5UY1atTQihUr1KxZM0nng0vWvAvvKfr555/VqVMnBQUFyc/PTzExMVq9erVTjePGjZPNZtP27dvVp08fVaxYUTfccEO+++amm26SJO3duzfPdsuWLVPr1q1Vvnx5VahQQd27d9dvv/3mVM/jjz8uSYqKinJs1759+/Jc75w5c9S0aVP5+vqqSpUquvfee3XgwAHH/BtvvFH9+vWTJDVr1kw2m61A961cvJ39+/eXv7+/du/erVtuuUUBAQG65557JJ0PUI8++qgiIiJkt9tVu3Ztvf766zLGOK0zLS1No0aNUnBwsAICAtStWzf99ddf+a4tS0xMjBo2bJjjvNq1a6tjx46O95mZmXrrrbdUr149+fj4KDQ0VEOHDtU///zjtJwr+6p0vp/r16+vDRs2qE2bNvLz89PTTz+dYy2vv/66bDab/vzzz2zzxowZI29vb0cdO3fu1O23366wsDD5+PioevXquvvuu5WcnJyvvpEkLy8v3XjjjZLk2K9q1KihW2+9VYsXL9Z1110nX19f/ec//3HMu3hfOX78uEaNGqUaNWrIbrerevXq6tu3r44ePepok5aWprFjx6pWrVqy2+2KiIjQE088wSWCAJxwxglAiRQZGan4+Hht3bpV9evXz7Pt+PHjNW7cOLVs2VLPP/+8vL299fPPP2vZsmXq0KGDpPP30fj7+2v06NHy9/fXsmXL9NxzzyklJUUTJkyQJD3zzDNKTk7WX3/9pTfffFOS5O/vrzp16uj555/Xc889pyFDhqh169aSpJYtW0o6H1A6d+6spk2bauzYsfLy8tK0adN000036ccff1Tz5s2d6r3zzjt11VVX6eWXX852YO+K3bt3S5IqV66ca5slS5aoc+fOuvLKKzVu3DidPn1a//73v9WqVSv98ssvqlGjhnr27Kk//vhDn332md58801VqVJFkhQcHJzreqdPn64BAwaoWbNmio2N1aFDhzRp0iStXr1aGzduVIUKFfTMM8+odu3aev/99x2X39WsWbNQtvPcuXPq2LGjbrjhBr3++uvy8/OTMUbdunXT8uXLdf/996tRo0ZavHixHn/8cR04cMDxs5TOn8maMWOG+vTpo5YtW2rZsmXq0qVLvmvLct9992nw4MHZ9tN169bpjz/+0LPPPuuYNnToUEf/jRw5Unv37tU777yjjRs3avXq1SpXrpwk1/bVLMeOHVPnzp119913695771VoaGiOdd5111164okn9PnnnzvCcpbPP/9cHTp0UMWKFZWenq6OHTsqLS1NDz30kMLCwnTgwAF9++23On78uIKCgvLdRzn9HHfs2KHevXtr6NChGjx4sGrXrp3jsidPnlTr1q3122+/aeDAgWrSpImOHj2qr7/+Wn/99ZeqVKmizMxMdevWTatWrdKQIUNUp04dbdmyRW+++ab++OMPzZs3L981AyihDACUQN9//70pU6aMKVOmjImOjjZPPPGEWbx4sUlPT3dqt3PnTuPl5WVuu+02k5GR4TQvMzPT8f+nTp3K9hlDhw41fn5+5syZM45pXbp0MZGRkdnarlu3zkgy06ZNy/YZV111lenYsWO2z4uKijLt27d3TBs7dqyRZHr37u1SHyxfvtxIMlOnTjVHjhwxBw8eNAsWLDA1atQwNpvNrFu3zhhjzN69e7PV1qhRIxMSEmKOHTvmmLZ582bj5eVl+vbt65g2YcIEI8ns3bvXsp709HQTEhJi6tevb06fPu2Y/u233xpJ5rnnnnNMmzZtmpHkqDEvWW2XLFlijhw5Yvbv329mzZplKleubHx9fc1ff/1ljDGmX79+RpJ56qmnnJafN2+ekWRefPFFp+l33HGHsdlsZteuXcYYYzZt2mQkmWHDhjm169Onj5Fkxo4da1nrxY4fP258fHzMk08+6TR95MiRpnz58ubkyZPGGGN+/PFHI8l8+umnTu0WLVqUbbqr+2pMTIyRZKZMmeJSrdHR0aZp06ZO09auXWskmY8//tgYY8zGjRuNJDNnzhyX1nmhfv36mfLly5sjR46YI0eOmF27dpmXX37Z2Gw2c+211zraRUZGGklm0aJF2dYRGRlp+vXr53j/3HPPGUnmq6++ytY26/ftk08+MV5eXubHH390mj9lyhQjyaxevTrf2wKgZOJSPQAlUvv27RUfH69u3bpp8+bNeu2119SxY0dVq1ZNX3/9taPdvHnzlJmZqeeeey7bTeU2m83x/76+vo7/P3HihI4eParWrVvr1KlT+v333wtc56ZNm7Rz50716dNHx44d09GjR3X06FGlpqbq5ptv1sqVK5WZmem0zAMPPJCvzxg4cKCCg4MVHh6uLl26KDU1VR999JHjfqCLJSYmatOmTerfv78qVarkmH7ttdeqffv2+u677/K/oZLWr1+vw4cPa9iwYfLx8XFM79Kli6655hotWLCgQOvN0q5dOwUHBysiIkJ33323/P39NXfuXFWrVs2p3YMPPuj0/rvvvlOZMmU0cuRIp+mPPvqojDFauHCho52kbO0uHggkP4KCgtS9e3d99tlnjrOHGRkZmj17tnr06OG4/2rOnDkKCgpS+/btHfvI0aNH1bRpU/n7+2v58uWOdeZnX7Xb7RowYIBLtfbq1UsbNmxwnAGSpNmzZ8tut6t79+6O7ZGkxYsX69SpU/nuj9TUVAUHBys4OFi1atXS008/rejoaM2dO9epXVRUlNNljLn58ss
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//car_price_prediction.csv\")\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df['Prod. year'], df['Price'])\n",
|
|||
|
"plt.xlabel('Prod. year')\n",
|
|||
|
"plt.ylabel('Price')\n",
|
|||
|
"plt.title('Scatter Plot of Prod. year vs Price')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"При проверке на шум можно заметить выброс в 2000 году. Цена там запредельная.\n",
|
|||
|
"\n",
|
|||
|
"Для удаления выбросов из датасета можно использовать метод межквартильного размаха. Зашумленность не очень высокая. Покрытие данных высокое и подошло бы для поставленной задачи по актуальности."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2QAAAIjCAYAAABswtioAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydeXxcVd3/33eZfSZ70rRp2nSDUlrAloJlR6D9IeACyKbsKD6A/IRHxe0B1OeRR1FRf4KoKCCiAm6oCJRVWYoCBaRA96Zbmn2Zfbn3nt8fNzMkTdLcMpmboZz365WmM/PJzJl7z9w53/PdFCGEQCKRSCQSiUQikUgkrqNO9gAkEolEIpFIJBKJ5L2KNMgkEolEIpFIJBKJZJKQBplEIpFIJBKJRCKRTBLSIJNIJBKJRCKRSCSSSUIaZBKJRCKRSCQSiUQySUiDTCKRSCQSiUQikUgmCWmQSSQSiUQikUgkEskkIQ0yiUQikUgkEolEIpkkpEEmkUgkEolEIpFIJJOENMgkEsm7htbWVhRF4a677prsoQzjkUce4ZBDDsHv96MoCv39/ZM9pDFRFIUbb7xxsofxnqelpYWLLrpoUl77gx/8IJ/85Cff8d+/m+b7u52LLrqIlpaWyR6Gqxx33HEcd9xxhduTed0/55xzOOuss1x/Xcl7D2mQSSRlwOuvv86ZZ57JzJkz8fv9NDU1cdJJJ/H//t//K9lr/vrXv+b73//+iPvb2tq48cYbefXVV0v22rvz9NNPoyhK4cfj8TB79mwuuOACNm/ePCGv8fzzz3PjjTdO+OKxp6eHs846i0AgwK233so999xDKBQaVXvXXXcNe59+v5/99tuPq666io6Ojgkdl6R0XHTRRcPOY0VFBQcffDDf/e53yWQykz28PfLcc8+xcuVKrrvuulEf/9vf/oaiKEybNg3LskY8PtZ8/+Y3v8mf/vSnEo9+OEPPQf48HHvssTz00EOujmNfRgjBPffcwzHHHENVVRXBYJBFixbx9a9/nUQi8Y6f98033+TGG2+ktbV14gZbAq677jp+//vf89prr032UCT7OPpkD0Aiea/z/PPPc/zxxzNjxgw++clP0tjYyPbt23nhhRf4wQ9+wGc+85mSvO6vf/1r1qxZw2c/+9lh97e1tfG1r32NlpYWDjnkkJK89lhcffXVLF26lFwux+rVq/npT3/KQw89xOuvv860adOKeu7nn3+er33ta1x00UVUVVVNzICBF198kVgsxje+8Q1OPPFER3/z9a9/nVmzZpFOp3n22Wf58Y9/zN/+9jfWrFlDMBicsLFJSofP5+OOO+4AoL+/n9///vd87nOf48UXX+S3v/3tuH+/bt06VNX9PdGbb76ZE044gblz5476+L333ktLSwutra08+eSTI+b0WPP9m9/8JmeeeSYf+chHSjn8EZx00klccMEFCCHYunUrP/7xjznttNN4+OGHWbFihatjKQU/+9nPRjWM3cA0Tc477zzuv/9+jj76aG688UaCwSDPPPMMX/va13jggQd4/PHHmTJlyl4/95tvvsnXvvY1jjvuuHE9gDNnziSVSuHxeN7hO3nnvO997+PQQw/lu9/9Lr/85S9df33JewdpkEkkk8z//M//UFlZyYsvvjjCUOjs7JycQZWARCIxpucoz9FHH82ZZ54JwMUXX8x+++3H1Vdfzd13382XvvQlN4a51+TP0d4YeSeffDKHHnooAJdddhm1tbV873vf48EHH+Tcc88d9W+cHL/3EslkclKNV13X+cQnPlG4fcUVV3D44Ydz33338b3vfW/UDQQhBOl0mkAggM/nc3O4gD1XH3roIW6//fZRH08kEjz44IPcdNNN3Hnnndx7770jDLJ3Mt/fKel0Gq/Xu0fDdb/99ht2Hs444wwWLFjAD37wg33CIJsMIyTPt7/9be6//34+97nPcfPNNxfu/9SnPsVZZ53FRz7yES666CIefvjhko4jH00wUezttfSss87ihhtu4LbbbiMcDk/YOCSSociQRYlkktm0aRMHHnjgqAuchoaGEff96le/4rDDDiMYDFJdXc0xxxzDypUrC48/+OCDnHLKKUybNg2fz8ecOXP4xje+gWmaBc1xxx3HQw89xNatWwvhPi0tLTz99NMsXboUsA2i/GNDY/f/+c9/8n/+z/+hsrKSYDDIsccey3PPPTdsjDfeeCOKovDmm29y3nnnUV1dzVFHHbXXx+YDH/gAAFu2bNmj7sknn+Too48mFApRVVXFhz/8Yd56661h4/n85z8PwKxZswrva7xwmQceeIAlS5YQCASoq6vjE5/4BDt37iw8ftxxx3HhhRcCsHTpUhRFeUd5Qbu/z4suuohwOMymTZv44Ac/SCQS4eMf/zhgLyb+8z//k+bmZnw+H/vvvz/f+c53EEIMe85MJsM111xDfX09kUiED33oQ+zYsWOvx5bn2GOP5eCDDx71sf3333/Y4teyLL7//e9z4IEH4vf7mTJlCpdffjl9fX3D/s7JXAX7OC9cuJCXX36ZY445hmAwyJe//OVRx/Kd73wHRVHYunXriMe+9KUv4fV6C+PYsGEDZ5xxBo2Njfj9fqZPn84555zDwMDAXh0bAFVVC3kv+XnV0tLCqaeeyqOPPsqhhx5KIBDgJz/5SeGx3edKf38/11xzDS0tLfh8PqZPn84FF1xAd3d3QZPJZLjhhhuYO3cuPp+P5uZmvvCFLzgKlXzooYcwDGNMT+4f//hHUqkUH/vYxzjnnHP4wx/+QDqdLjw+1nxXFIVEIsHdd99d+GwNfW87d+7kkksuYcqUKfh8Pg488EB+8YtfDHvtfNjyb3/7W7761a/S1NREMBgkGo2O+76GcsABB1BXV8emTZuG3e/0uCmKwlVXXcUDDzzAggULCAQCLFu2jNdffx2An/zkJ8ydOxe/389xxx036jVkvOvG3szR3XPI8vlU3/nOd/jpT3/KnDlz8Pl8LF26lBdffHHUsSxYsAC/38/ChQv54x//6CgvLZVKcfPNN7Pffvtx0003jXj8tNNO48ILL+SRRx7hhRdeGHb8RstRHTrf77rrLj72sY8BcPzxxxfmzNNPPz3qWMbKIVu7di1nnnkmNTU1+P1+Dj30UP785z8P0+TDxP/+979zxRVX0NDQwPTp0wGIxWJ89rOfLXzeGhoaOOmkk1i9evWw5zjppJNIJBI89thjezpkEklRSA+ZRDLJzJw5k1WrVrFmzRoWLly4R+3XvvY1brzxRo444gi+/vWv4/V6+ec//8mTTz7J8uXLAfsLKBwOc+211xIOh3nyySe5/vrriUajhV3Or3zlKwwMDLBjxw5uueUWAMLhMAcccABf//rXuf766/nUpz7F0UcfDcARRxwB2IbPySefzJIlS7jhhhtQVZU777yTD3zgAzzzzDMcdthhw8b7sY99jHnz5vHNb35zhMHghPyiqra2dkzN448/zsknn8zs2bO58cYbSaVS/L//9/848sgjWb16NS0tLZx++umsX7+e3/zmN9xyyy3U1dUBUF9fP+bz3nXXXVx88cUsXbqUm266iY6ODn7wgx/w3HPP8corr1BVVcVXvvIV9t9/f376058WwhDnzJkzIe/TMAxWrFjBUUcdxXe+8x2CwSBCCD70oQ/x1FNPcemll3LIIYfw6KOP8vnPf56dO3cWziXYnrdf/epXnHfeeRxxxBE8+eSTnHLKKXs9tjznn38+n/zkJ0fM0xdffJH169fz1a9+tXDf5ZdfXjh+V199NVu2bOFHP/oRr7zyCs8991xh19/JXM3T09PDySefzDnnnMMnPvGJMcOkzjrrLL7whS9w//33F4zwPPfffz/Lly+nurqabDbLihUryGQyfOYzn6GxsZGdO3fy17/+lf7+fiorK/f6GI12HtetW8e5557L5Zdfzic/+Un233//Uf82Ho9z9NFH89Zbb3HJJZewePFiuru7+fOf/8yOHTuoq6vDsiw+9KEP8eyzz/KpT32KAw44gNdff51bbrmF9evXj5vD9fzzz1NbW8vMmTNHffzee+/l+OOPp7GxkXPOOYcvfvG
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Количество строк до удаления выбросов: 19237\n",
|
|||
|
"Количество строк после удаления выбросов: 17241\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//car_price_prediction.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор столбцов для анализа\n",
|
|||
|
"column1 = 'Prod. year'\n",
|
|||
|
"column2 = 'Price'\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Функция для удаления выбросов\n",
|
|||
|
"def remove_outliers(df, column):\n",
|
|||
|
" Q1 = df[column].quantile(0.25)\n",
|
|||
|
" Q3 = df[column].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов для каждого столбца\n",
|
|||
|
"df_cleaned = df.copy()\n",
|
|||
|
"for column in [column1, column2]:\n",
|
|||
|
" df_cleaned = remove_outliers(df_cleaned, column)\n",
|
|||
|
"\n",
|
|||
|
"# Построение точечной диаграммы после удаления выбросов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df_cleaned[column1], df_cleaned[column2], alpha=0.5)\n",
|
|||
|
"plt.xlabel(column1)\n",
|
|||
|
"plt.ylabel(column2)\n",
|
|||
|
"plt.title(f'Scatter Plot of {column1} vs {column2} (After Removing Outliers)')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Вывод количества строк до и после удаления выбросов\n",
|
|||
|
"print(f\"Количество строк до удаления выбросов: {len(df)}\")\n",
|
|||
|
"print(f\"Количество строк после удаления выбросов: {len(df_cleaned)}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Теперь очистим датасет от пустых строк"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Общая информация о датасете:\n",
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 19237 entries, 0 to 19236\n",
|
|||
|
"Data columns (total 18 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 ID 19237 non-null int64 \n",
|
|||
|
" 1 Price 19237 non-null int64 \n",
|
|||
|
" 2 Levy 19237 non-null object \n",
|
|||
|
" 3 Manufacturer 19237 non-null object \n",
|
|||
|
" 4 Model 19237 non-null object \n",
|
|||
|
" 5 Prod. year 19237 non-null int64 \n",
|
|||
|
" 6 Category 19237 non-null object \n",
|
|||
|
" 7 Leather interior 19237 non-null object \n",
|
|||
|
" 8 Fuel type 19237 non-null object \n",
|
|||
|
" 9 Engine volume 19237 non-null object \n",
|
|||
|
" 10 Mileage 19237 non-null object \n",
|
|||
|
" 11 Cylinders 19237 non-null float64\n",
|
|||
|
" 12 Gear box type 19237 non-null object \n",
|
|||
|
" 13 Drive wheels 19237 non-null object \n",
|
|||
|
" 14 Doors 19237 non-null object \n",
|
|||
|
" 15 Wheel 19237 non-null object \n",
|
|||
|
" 16 Color 19237 non-null object \n",
|
|||
|
" 17 Airbags 19237 non-null int64 \n",
|
|||
|
"dtypes: float64(1), int64(4), object(13)\n",
|
|||
|
"memory usage: 2.6+ MB\n",
|
|||
|
"None\n",
|
|||
|
"\n",
|
|||
|
"Таблица анализа пропущенных значений:\n",
|
|||
|
" Количество пропущенных значений \\\n",
|
|||
|
"ID 0 \n",
|
|||
|
"Price 0 \n",
|
|||
|
"Levy 0 \n",
|
|||
|
"Manufacturer 0 \n",
|
|||
|
"Model 0 \n",
|
|||
|
"Prod. year 0 \n",
|
|||
|
"Category 0 \n",
|
|||
|
"Leather interior 0 \n",
|
|||
|
"Fuel type 0 \n",
|
|||
|
"Engine volume 0 \n",
|
|||
|
"Mileage 0 \n",
|
|||
|
"Cylinders 0 \n",
|
|||
|
"Gear box type 0 \n",
|
|||
|
"Drive wheels 0 \n",
|
|||
|
"Doors 0 \n",
|
|||
|
"Wheel 0 \n",
|
|||
|
"Color 0 \n",
|
|||
|
"Airbags 0 \n",
|
|||
|
"\n",
|
|||
|
" Процент пропущенных значений \n",
|
|||
|
"ID 0.0 \n",
|
|||
|
"Price 0.0 \n",
|
|||
|
"Levy 0.0 \n",
|
|||
|
"Manufacturer 0.0 \n",
|
|||
|
"Model 0.0 \n",
|
|||
|
"Prod. year 0.0 \n",
|
|||
|
"Category 0.0 \n",
|
|||
|
"Leather interior 0.0 \n",
|
|||
|
"Fuel type 0.0 \n",
|
|||
|
"Engine volume 0.0 \n",
|
|||
|
"Mileage 0.0 \n",
|
|||
|
"Cylinders 0.0 \n",
|
|||
|
"Gear box type 0.0 \n",
|
|||
|
"Drive wheels 0.0 \n",
|
|||
|
"Doors 0.0 \n",
|
|||
|
"Wheel 0.0 \n",
|
|||
|
"Color 0.0 \n",
|
|||
|
"Airbags 0.0 \n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//car_price_prediction.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Вывод общей информации о датасете\n",
|
|||
|
"print(\"Общая информация о датасете:\")\n",
|
|||
|
"print(df.info())\n",
|
|||
|
"\n",
|
|||
|
"# Вывод таблицы анализа пропущенных значений\n",
|
|||
|
"missing_values = df.isnull().sum()\n",
|
|||
|
"missing_values_percentage = (missing_values / len(df)) * 100\n",
|
|||
|
"missing_data = pd.concat([missing_values, missing_values_percentage], axis=1, keys=['Количество пропущенных значений', 'Процент пропущенных значений'])\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nТаблица анализа пропущенных значений:\")\n",
|
|||
|
"print(missing_data)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пустых строк не было обнаружено.\n",
|
|||
|
"\n",
|
|||
|
"Теперь создадим выборки."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 18,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 11542\n",
|
|||
|
"Размер контрольной выборки: 3847\n",
|
|||
|
"Размер тестовой выборки: 3848\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//car_price_prediction.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор признаков и целевой переменной\n",
|
|||
|
"X = df.drop('Price', axis=1) # Признаки (все столбцы, кроме 'Price')\n",
|
|||
|
"y = df['Price'] # Целевая переменная ('Price')\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение данных на обучающую и оставшуюся часть (контрольную + тестовую)\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод размеров выборок\n",
|
|||
|
"print(f\"Размер обучающей выборки: {X_train.shape[0]}\")\n",
|
|||
|
"print(f\"Размер контрольной выборки: {X_val.shape[0]}\")\n",
|
|||
|
"print(f\"Размер тестовой выборки: {X_test.shape[0]}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проанализируем сбалансированность выборок"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение Price в обучающей выборке:\n",
|
|||
|
"Price\n",
|
|||
|
"1 1\n",
|
|||
|
"3 8\n",
|
|||
|
"6 4\n",
|
|||
|
"19 1\n",
|
|||
|
"20 4\n",
|
|||
|
" ..\n",
|
|||
|
"260296 1\n",
|
|||
|
"297930 2\n",
|
|||
|
"308906 1\n",
|
|||
|
"872946 1\n",
|
|||
|
"26307500 1\n",
|
|||
|
"Name: count, Length: 1764, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n",
|
|||
|
"Распределение Price в контрольной выборке:\n",
|
|||
|
"Price\n",
|
|||
|
"1 1\n",
|
|||
|
"3 4\n",
|
|||
|
"6 1\n",
|
|||
|
"20 1\n",
|
|||
|
"25 3\n",
|
|||
|
" ..\n",
|
|||
|
"141124 1\n",
|
|||
|
"144261 1\n",
|
|||
|
"156805 1\n",
|
|||
|
"172486 1\n",
|
|||
|
"627220 1\n",
|
|||
|
"Name: count, Length: 983, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n",
|
|||
|
"Распределение Price в тестовой выборке:\n",
|
|||
|
"Price\n",
|
|||
|
"3 3\n",
|
|||
|
"6 1\n",
|
|||
|
"9 1\n",
|
|||
|
"20 2\n",
|
|||
|
"25 7\n",
|
|||
|
" ..\n",
|
|||
|
"153669 1\n",
|
|||
|
"156805 1\n",
|
|||
|
"163077 1\n",
|
|||
|
"216391 1\n",
|
|||
|
"288521 1\n",
|
|||
|
"Name: count, Length: 978, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//car_price_prediction.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор признаков и целевой переменной\n",
|
|||
|
"X = df.drop('Price', axis=1) # Признаки (все столбцы, кроме 'Price')\n",
|
|||
|
"y = df['Price'] # Целевая переменная ('Price')\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение данных на обучающую и оставшуюся часть (контрольную + тестовую)\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Функция для анализа распределения и вывода результатов\n",
|
|||
|
"def analyze_distribution(data, title):\n",
|
|||
|
" print(f\"Распределение Price в {title}:\")\n",
|
|||
|
" distribution = data.value_counts().sort_index()\n",
|
|||
|
" print(distribution)\n",
|
|||
|
" total = len(data)\n",
|
|||
|
" positive_count = (data > 0).sum()\n",
|
|||
|
" negative_count = (data < 0).sum()\n",
|
|||
|
" positive_percent = (positive_count / total) * 100\n",
|
|||
|
" negative_percent = (negative_count / total) * 100\n",
|
|||
|
" print(f\"Процент положительных значений: {positive_percent:.2f}%\")\n",
|
|||
|
" print(f\"Процент отрицательных значений: {negative_percent:.2f}%\")\n",
|
|||
|
" print(\"\\nНеобходима аугментация данных для балансировки классов.\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ распределения для каждой выборки\n",
|
|||
|
"analyze_distribution(y_train, \"обучающей выборке\")\n",
|
|||
|
"analyze_distribution(y_val, \"контрольной выборке\")\n",
|
|||
|
"analyze_distribution(y_test, \"тестовой выборке\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Выборки не сбалансированы, и для улучшения качества модели рекомендуется провести аугментацию данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение Price в обучающей выборке после oversampling:\n",
|
|||
|
"Price\n",
|
|||
|
"1 169\n",
|
|||
|
"3 169\n",
|
|||
|
"6 169\n",
|
|||
|
"19 169\n",
|
|||
|
"20 169\n",
|
|||
|
" ... \n",
|
|||
|
"260296 169\n",
|
|||
|
"297930 169\n",
|
|||
|
"308906 169\n",
|
|||
|
"872946 169\n",
|
|||
|
"26307500 169\n",
|
|||
|
"Name: count, Length: 1764, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"Распределение Price в контрольной выборке:\n",
|
|||
|
"Price\n",
|
|||
|
"1 1\n",
|
|||
|
"3 4\n",
|
|||
|
"6 1\n",
|
|||
|
"20 1\n",
|
|||
|
"25 3\n",
|
|||
|
" ..\n",
|
|||
|
"141124 1\n",
|
|||
|
"144261 1\n",
|
|||
|
"156805 1\n",
|
|||
|
"172486 1\n",
|
|||
|
"627220 1\n",
|
|||
|
"Name: count, Length: 983, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"Распределение Price в тестовой выборке:\n",
|
|||
|
"Price\n",
|
|||
|
"3 3\n",
|
|||
|
"6 1\n",
|
|||
|
"9 1\n",
|
|||
|
"20 2\n",
|
|||
|
"25 7\n",
|
|||
|
" ..\n",
|
|||
|
"153669 1\n",
|
|||
|
"156805 1\n",
|
|||
|
"163077 1\n",
|
|||
|
"216391 1\n",
|
|||
|
"288521 1\n",
|
|||
|
"Name: count, Length: 978, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//car_price_prediction.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор признаков и целевой переменной\n",
|
|||
|
"X = df.drop('Price', axis=1) # Признаки (все столбцы, кроме 'Price')\n",
|
|||
|
"y = df['Price'] # Целевая переменная ('Price')\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение данных на обучающую и оставшуюся часть (контрольную + тестовую)\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Применение oversampling к обучающей выборке\n",
|
|||
|
"oversampler = RandomOverSampler(random_state=42)\n",
|
|||
|
"X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Функция для анализа распределения и вывода результатов\n",
|
|||
|
"def analyze_distribution(data, title):\n",
|
|||
|
" print(f\"Распределение Price в {title}:\")\n",
|
|||
|
" distribution = data.value_counts().sort_index()\n",
|
|||
|
" print(distribution)\n",
|
|||
|
" total = len(data)\n",
|
|||
|
" positive_count = (data > 0).sum()\n",
|
|||
|
" negative_count = (data < 0).sum()\n",
|
|||
|
" positive_percent = (positive_count / total) * 100\n",
|
|||
|
" negative_percent = (negative_count / total) * 100\n",
|
|||
|
" print(f\"Процент положительных значений: {positive_percent:.2f}%\")\n",
|
|||
|
" print(f\"Процент отрицательных значений: {negative_percent:.2f}%\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ распределения для каждой выборки\n",
|
|||
|
"analyze_distribution(y_train_resampled, \"обучающей выборке после oversampling\")\n",
|
|||
|
"analyze_distribution(y_val, \"контрольной выборке\")\n",
|
|||
|
"analyze_distribution(y_test, \"тестовой выборке\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Цены на бриллианты"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['Unnamed: 0', 'carat', 'cut', 'color', 'clarity', 'depth', 'table',\n",
|
|||
|
" 'price', 'x', 'y', 'z'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проблемная область: ценообразование бриллиантов\n",
|
|||
|
"\n",
|
|||
|
"Объект наблюдения: бриллиант\n",
|
|||
|
"\n",
|
|||
|
"Атрибуты: идентификатор, вес, качество огранки, цвет, чистота, общая глубина, ширина верхней грани, цена, длина, ширина, высота.\n",
|
|||
|
"\n",
|
|||
|
"Пример бизнес-цели: \n",
|
|||
|
"1. Оптимизация ценообразования и повышение эффективности продаж:\n",
|
|||
|
"Цель: Разработка модели прогнозирования цен на бриллианты, которая позволит компаниям устанавливать конкурентоспособные цены и повысить продажи.\n",
|
|||
|
"\n",
|
|||
|
"2. Повышение эффективности маркетинговых кампаний:\n",
|
|||
|
"Цель: Использование данных о бриллиантах для разработки целевых маркетинговых кампаний, направленных на конкретные сегменты рынка.\n",
|
|||
|
"\n",
|
|||
|
"3. Повышение качества сервиса и удовлетворенности клиентов:\n",
|
|||
|
"Цель: Использование данных для предоставления клиентам персонализированных рекомендаций и улучшения качества обслуживания.\n",
|
|||
|
"\n",
|
|||
|
"Актуальность: Данный датасет является актуальным и ценным ресурсом для компаний, работающих на рынке бриллиантов, а также для исследователей и инвесторов, поскольку он предоставляет обширную информацию о ценах и характеристиках бриллиантов. Эти данные могут быть использованы для разработки моделей прогнозирования цен, анализа рыночных тенденций и принятия обоснованных бизнес-решений."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0EAAAIjCAYAAADFthA8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACpr0lEQVR4nOzde3gU5dk/8O/s5rBJSDYnYBOEJBwUQgQEQZCTIlQOIuprVRTPP1SE1kNrFa0CUovUvlXfatFi0RYEra2KFIwFQTkYDBJOMagQk4CQAEnIBnJOdn5/xFn3MLs7szt7yn4/18V1kc3szLOzs5vnnvt57kcQRVEEERERERFRhNAFuwFERERERESBxCCIiIiIiIgiCoMgIiIiIiKKKAyCiIiIiIgoojAIIiIiIiKiiMIgiIiIiIiIIgqDICIiIiIiiigMgoiIiIiIKKIwCCIiIiIioojCIIiIiNwqLy+HIAh46623gt0UO/n5+Rg2bBgMBgMEQUBdXZ1fjycIAhYvXuzXYxARUWAwCCKiiHXo0CHceOONyMrKgsFgQK9evTBlyhT8+c9/9tsx165di5deesnp8ZMnT2Lx4sXYv3+/347t6LPPPoMgCNZ/0dHR6Nu3L+644w58//33mhzjiy++wOLFizUPUGpqanDTTTchLi4Or776KlavXo2EhARNj0GB5+rzQUSktahgN4CIKBi++OILXHnllejTpw/mzp0Lk8mE48ePY/fu3Xj55Zfxi1/8wi/HXbt2LYqLi/Hwww/bPX7y5EksWbIE2dnZGDZsmF+O7covf/lLjBw5Em1tbSgqKsJf//pXbNy4EYcOHUJmZqZP+/7iiy+wZMkS3HXXXUhOTtamwQD27NmDc+fOYenSpZg8ebJm+3WnqakJUVH8s+lPrj4fRERa47c5EUWk5557DkajEXv27HHqnJ8+fTo4jfKDhoYGjxmS8ePH48YbbwQA3H333bjwwgvxy1/+En//+9+xcOHCQDRTNek90jKwkmOxWNDa2gqDwQCDweDXY3VFjY2NiI+PD3YziIiccDgcEUWk0tJSDB48WLYT3aNHD6fH1qxZg1GjRiE+Ph4pKSmYMGEC/vvf/1p/v379esyYMQOZmZmIjY1Fv379sHTpUnR0dFi3ueKKK7Bx40ZUVFRYh6BlZ2fjs88+w8iRIwF0BiHS72zn4Hz55ZeYOnUqjEYj4uPjMXHiROzatcuujYsXL4YgCCgpKcGtt96KlJQUjBs3TvW5mTRpEgCgrKzM7XZbt27F+PHjkZCQgOTkZMyaNQuHDx+2a89jjz0GAMjJybG+rvLycrf7fe+99zBixAjExcUhPT0dc+bMwYkTJ6y/v+KKK3DnnXcCAEaOHAlBEHDXXXe53J90Xr755hvcdNNNSEpKQlpaGh566CE0NzfbbSsIAhYsWIC3334bgwcPRmxsLPLz862/c5wTdOLECdx7773W9z0nJwfz5s1Da2urdZu6ujo8/PDD6N27N2JjY9G/f38sX74cFovF7Xm45ppr0LdvX9nfjRkzBpdeeqn1582bN2PcuHFITk5Gt27dcNFFF+HJJ590u3+JFtc20Pm+5OXlYe/evZgwYQLi4+OtbfDl80FE5A/MBBFRRMrKykJBQQGKi4uRl5fndtslS5Zg8eLFuPzyy/Hss88iJiYGX375JbZu3Yqf/exnAIC33noL3bp1w6OPPopu3bph69ateOaZZ1BfX48XXngBAPDUU0/BbDbjhx9+wIsvvggA6NatGwYNGoRnn30WzzzzDO677z6MHz8eAHD55ZcD6Aw2pk2bhhEjRmDRokXQ6XR48803MWnSJOzYsQOjRo2ya+/Pf/5zDBgwAL///e8hiqLqc1NaWgoASEtLc7nNli1bMG3aNPTt2xeLFy9GU1MT/vznP2Ps2LEoKipCdnY2brjhBnz33XdYt24dXnzxRaSnpwMAunfv7nK/b731Fu6++26MHDkSy5Ytw6lTp/Dyyy9j165d2LdvH5KTk/HUU0/hoosuwl//+lc8++yzyMnJQb9+/Ty+rptuugnZ2dlYtmwZdu/ejf/7v//D2bNn8Y9//MNuu61bt+Kf//wnFixYgPT0dJcd8ZMnT2LUqFGoq6vDfffdh4EDB+LEiRP417/+hcbGRsTExKCxsRETJ07EiRMncP/996NPnz744osvsHDhQlRWVrqd/3LzzTfjjjvuwJ49e6xBMgBUVFRg9+7d1uvq66+/xjXXXIMhQ4bg2WefRWxsLI4ePeoUJMvR6tqW1NTUYNq0abjlllswZ84c9OzZU/E+XH0+iIj8QiQiikD//e9/Rb1eL+r1enHMmDHib37zG/GTTz4RW1tb7bY7cuSIqNPpxOuvv17s6Oiw+53FYrH+v7Gx0ekY999/vxgfHy82NzdbH5sxY4aYlZXltO2ePXtEAOKbb77pdIwBAwaIV199tdPxcnJyxClTplgfW7RokQhAnD17tqJzsG3bNhGAuGrVKvHMmTPiyZMnxY0bN4rZ2dmiIAjinj17RFEUxbKyMqe2DRs2TOzRo4dYU1NjfezAgQOiTqcT77jjDutjL7zwgghALCsr89ie1tZWsUePHmJeXp7Y1NRkffw///mPCEB85plnrI+9+eabIgBrG92Rzsu1115r9/iDDz4oAhAPHDhgfQyAqNPpxK+//tppPwDERYsWWX++4447RJ1OJ9sG6b1aunSpmJCQIH733Xd2v3/iiSdEvV4vHjt2zGW7zWazGBsbK/7qV7+ye/wPf/iDKAiCWFFRIYqiKL744osiAPHMmTMu9yVH62t74sSJIgDxtddec9re188HEZHWOByOiCLSlClTUFBQgGuvvRYHDhzAH/7wB1x99dXo1asXPvroI+t2H374ISwWC5555hnodPZfmYIgWP8fFxdn/f+5c+dQXV2N8ePHo7GxEd98843X7dy/fz+OHDmCW2+9FTU1NaiurkZ1dTUaGhpw1VVXYfv27U7Dqh544AFVx7jnnnvQvXt3ZGZmYsaMGWhoaMDf//53u+FWtiorK7F//37cddddSE1NtT4+ZMgQTJkyBZs2bVL/QgF89dVXOH36NB588EG7+TczZszAwIEDsXHjRq/2K5k/f77dz1LxC8f2Tpw4Ebm5uW73ZbFY8OGHH2LmzJmy50m6Nt577z2MHz8eKSkp1veuuroakydPRkdHB7Zv3+7yGElJSZg2bRr++c9/2mX03n33XYwePRp9+vQB8NO8qPXr13scYmfLH9d2bGws7r77bqdj+evzQUTkLQ6HI6KINXLkSLz//vtobW3FgQMH8MEHH+DFF1/EjTfeiP379yM3NxelpaXQ6XQeO8Vff/01fvvb32Lr1q2or6+3+53ZbPa6jUeOHAEA6xwYOWazGSkpKdafc3JyVB3jmWeewfjx46HX65Geno5Bgwa5rYJWUVEBALjoooucfjdo0CB88sknigoyqNnvwIEDsXPnTlX7czRgwAC7n/v16wedTuc0R0nJ+Ttz5gzq6+s9DqU8cuQIDh486HIIoKciHDfffDM+/PBDFBQU4PLLL0dpaSn27t1rN4zu5ptvxhtvvIH/9//+H5544glcddVVuOGGG3DjjTc6BTe2/HFt9+rVCzExMT7tg4goEBgEEVHEi4mJwciRIzFy5EhceOGFuPvuu/Hee+9h0aJFip5fV1eHiRMnIikpCc8++yz69esHg8GAoqIiPP7446ruzjuSnvvCCy+4LJ3tOG/C9q67EhdffHHAykyHEttshy21588di8WCKVOm4De/+Y3s7y+88EK3z585cybi4+Pxz3/+E5dffjn++c9/QqfT4ec//7lde7dv345t27Zh48aNyM/Px7vvvotJkybhv//9L/R6vdftV3tty507f34+iIi8xSCIiMiGNLSpsrISQGe2wGKxoKSkxGUQ8tlnn6Gmpgbvv/8+JkyYYH1crrqaq463q8elCf9JSUkhE6hkZWUBAL799lun333zzTdIT0+3ZoFcvS5P+5Uq1Em+/fZb6++9deTIEbssz9GjR2GxWLyqQNa9e3ckJSWhuLjY7Xb9+vXD+fPnvX7vEhIScM0
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df['price'], df['carat'])\n",
|
|||
|
"plt.xlabel('price')\n",
|
|||
|
"plt.ylabel('carat')\n",
|
|||
|
"plt.title('Scatter Plot of price vs carat')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"При проверке на шум можно заметить выброс при цене в 17500. Количество карат запредельно.\n",
|
|||
|
"\n",
|
|||
|
"Для удаления выбросов из датасета можно использовать метод межквартильного размаха. Зашумленность не очень высокая. Покрытие данных высокое и подошло бы для поставленной задачи по актуальности."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2QAAAIjCAYAAABswtioAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOy9d5xtVXn//9719DO93F7pVYo0ERXkiogSO5ofRSPGCEYlahIVEKJE7JrYvkZQgopYSGIBkRICItIucGm39+kzZ07dff3+WPucO/XeuX3Q9X690Dtn77P7nlnPep7n89GEEAKFQqFQKBQKhUKhUBxw9IN9AAqFQqFQKBQKhULxl4oKyBQKhUKhUCgUCoXiIKECMoVCoVAoFAqFQqE4SKiATKFQKBQKhUKhUCgOEiogUygUCoVCoVAoFIqDhArIFAqFQqFQKBQKheIgoQIyhUKhUCgUCoVCoThIqIBMoVAoFAqFQqFQKA4SKiBTKBQKhUKhUCgUioOECsgUCsVfLBs3bkTTNG6++eaDfSjjuPPOOzn++ONJJpNomkahUDjYh/QXy8F8RsrlMp2dndx66617vI0vfOELLF26FMMwOP744/fdwSkmsXjxYi699NKDfRgHFE3TuPbaaxs/33zzzWiaxsaNGw/ocfi+z4IFC/jmN795QPerUOwrVECmUPwZ8swzz/DWt76VRYsWkUwmmTdvHq997Wv5xje+sd/2+aMf/YivfvWrkz7fvn071157LStXrtxv+57I/fffj6Zpjf8sy2Lp0qVcfPHFrF+/fp/s4w9/+APXXnvtPg+WhoaGePvb304qleLf//3fueWWW8hkMvt0H/uLg3Gv/5z52te+Ri6X453vfOeUyz/+8Y+jaRrveMc7plz+u9/9jo9//OOcccYZ3HTTTXzuc5+bFe+jYRh0dnby1re+leeff/6AHcefO5VKheuvv55jjz2WdDpNU1MTZ555Jj/84Q8RQuzxdn/zm9+MC7pmI5Zl8dGPfpTPfvazOI5zsA9Hodh9hEKh+LPioYceErZti+XLl4vrr79e/L//9//E1VdfLc4991yxbNmy/bbf888/XyxatGjS548++qgAxE033bTf9j2R++67TwDiQx/6kLjlllvE97//fXHFFVcI27ZFa2ur2LZtmxBCiA0bNuzxsX3hC18QgNiwYcM+Pfbf/va3AhB33333Pt3ugeBg3Ov9TRRFolariSAIDuh+Pc8THR0d4nOf+9y0xzV//nyxePFikUqlRLFYnLTOJz7xCaHrunBdt/HZbHkfP/zhD4tkMina2tpET0/PATuW/YnjOMLzvIOy797eXnHUUUcJXdfFu971LvGd73xHfO1rXxOvfOUrBSDe8Y537PEz/MEPflBMN1wExDXXXNP4OQgCUavVRBRFe7SvvWFkZETYti3+4z/+44DvW6HYW8yDFAcqFIr9xGc/+1mampp49NFHaW5uHresv7//4BzUfqBSqewyc3TmmWfy1re+FYDLLruMQw89lA996EP84Ac/4J/+6Z8OxGHuNvV7NPHeHQwcx8G2bXT9L6+YIggCoijCtm2SyeQB3/+vfvUrBgYGePvb3z7l8vvvv5+tW7dy7733smLFCn7xi19wySWXjFunv7+fVCqFbdv7/Xh3930EOOyww/jABz7AD3/4Qz7+8Y/v70Pc7yQSiYO270suuYTnn3+eX/7yl7zxjW9sfP6hD32Ij33sY3zxi1/kZS97GZ/4xCf263EYhoFhGPtsezN5ruo0Nzdz7rnncvPNN/Oe97xnnx2DQnFAONgRoUKh2Lccdthh4lWvetWM17/lllvEySefLFKplGhubhZnnnmmuOuuuxrL77jjDvH6179ezJkzR9i2LZYuXSquu+66cbOtZ511lgDG/bdo0aLGzPjE/8bOzv/xj38UK1asEPl8XqRSKfHKV75SPPjgg+OO8ZprrhGAePbZZ8VFF10kmpubxfHHHz/tOdX3e/vtt4/7fNWqVQIQ73vf+4QQ02fI7rnnHvGKV7xCpNNp0dTUJN74xjeK5557btLxTPxvV9myn/70p+KEE05oZAbe/e53i61bt+70Ol5yySU73ebWrVvFe97znsb9Wbx4sfjbv/3bRlZkaGhIXHXVVeLoo48WmUxG5HI58brXvU6sXLlyymv24x//WHzyk58Uc+fOFZqmiZGRkRltYyb3eiy33367AMT9998/adm3v/1tAYhnnnlGCCFET0+PuPTSS8W8efOEbduiu7tbvPGNb9zl9b7kkktEJpMR69atE+eee65Ip9Nizpw54jOf+cy4Gfz6c/CFL3xBfOUrXxFLly4Vuq6LJ598ctpn5Pnnnxdve9vbRHt7u0gmk+LQQw8V//zP/zzp3lx22WWis7NT2LYtjjzyyBnP3l988cVi8eLF0y5/73vfK4488kghhBDnnXeeeO1rXztu+XT3Yja+j5dffvm4z2dy3erbvO2228S1114r5s6dK7LZrHjLW94iCoWCcBxH/P3f/73o6OgQmUxGXHrppcJxnHHb8H1fXHfddWLp0qXCtm2xaNEi8U//9E/j1jv//PPFkiVLpjyvU089VZx44omNnxctWjTufa1f7wcffFB85CMfEe3t7SKdTosLL7xQ9Pf3j9tWGIbimmuuEXPmzBGpVEq86lWvEs8+++ykbU7Fww8/LADxnve8Z8rlvu+LQw45RLS0tIhqtTru+t13333j1p34vF9yySVTPjN1mJAhq5/zxHfzN7/5TeN3ajabFa9//evFqlWrxq1Tf1/Xrl0rzjvvPJHNZsWb3vQmIYQQq1evFm9+85tFV1eXSCQSYt68eeId73iHKBQK47bxta99TWiaJoaGhnZ6zRSK2YbKkCkUf2YsWrSIhx9+mFWrVnH00UfvdN3PfOYzXHvttZx++ulcd9112LbNI488wr333su5554LyCbtbDbLRz/6UbLZLPfeey9XX301xWKRL3zhCwB88pOfZHR0lK1bt/KVr3wFgGw2yxFHHMF1113H1VdfzeWXX86ZZ54JwOmnnw7Avffey3nnnceJJ57INddcg67r3HTTTbzmNa/h//7v/3j5y18+7njf9ra3ccghh/C5z31uj3oi1q1bB0BbW9u06/z+97/nvPPOY+nSpVx77bXUajW+8Y1vcMYZZ/DEE0+wePFi3vzmN7N69Wp+/OMf85WvfIX29nYAOjo6pt3uzTffzGWXXcbJJ5/MDTfcQF9fH1/72td46KGHePLJJ2lubuaTn/wkhx12GN/97ne57rrrWLJkCcuWLZt2m9u3b+flL385hUKByy+/nMMPP5xt27bxs5/9jGq1im3brF+/njvuuIO3ve1tLFmyhL6+Pr7zne9w1lln8dxzzzF37txx27z++uuxbZt/+Id/wHVdbNvmueee2+U2dnWvJ3L++eeTzWb56U9/yllnnTVu2W233cZRRx3VeH7f8pa38Oyzz3LllVeyePFi+vv7ufvuu9m8eTOLFy+e9voAhGHI6173Ok499VRuvPFG7rzzTq655hqCIOC6664bt+5NN92E4zhcfvnlJBIJWltbiaJo0jaffvppzjzzTCzL4vLLL2fx4sWsW7eO//mf/+Gzn/0sAH19fZx66qlomsYVV1xBR0cHv/3tb3nve99LsVjkwx/+8E6P+w9/+AMnnHDClMtc1+XnP/85V111FQAXXXQRl112Gb29vXR3dwNwyy238N3vfpc//elPfO973wPgkEMOmVXvY134oaWlpfHZ7l63G264gVQqxT/+4z+ydu1avvGNb2BZFrquMzIywrXXXssf//hHbr75ZpYsWcLVV1/d+O7f/M3f8IMf/IC3vvWtXHXVVTzyyCPccMMNjUwTwDve8Q4uvvhiHn30UU4++eTGdzdt2sQf//jHxu/AnXHllVfS0tLCNddcw8aNG/nqV7/KFVdcwW233dZY55/+6Z+48cYbueCCC1ixYgVPPfUUK1asmFE/1P/8z/8AcPHFF0+53DRN3vWud/GZz3yGhx56iHPOOWeX26zz/ve/n+3bt3P33Xdzyy23zPh7Y7nlllu45JJLWLFiBZ/
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Количество строк до удаления выбросов: 53943\n",
|
|||
|
"Количество строк после удаления выбросов: 49517\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор столбцов для анализа\n",
|
|||
|
"column1 = 'carat'\n",
|
|||
|
"column2 = 'price'\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Функция для удаления выбросов\n",
|
|||
|
"def remove_outliers(df, column):\n",
|
|||
|
" Q1 = df[column].quantile(0.25)\n",
|
|||
|
" Q3 = df[column].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов для каждого столбца\n",
|
|||
|
"df_cleaned = df.copy()\n",
|
|||
|
"for column in [column1, column2]:\n",
|
|||
|
" df_cleaned = remove_outliers(df_cleaned, column)\n",
|
|||
|
"\n",
|
|||
|
"# Построение точечной диаграммы после удаления выбросов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df_cleaned[column1], df_cleaned[column2], alpha=0.5)\n",
|
|||
|
"plt.xlabel(column1)\n",
|
|||
|
"plt.ylabel(column2)\n",
|
|||
|
"plt.title(f'Scatter Plot of {column1} vs {column2} (After Removing Outliers)')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Вывод количества строк до и после удаления выбросов\n",
|
|||
|
"print(f\"Количество строк до удаления выбросов: {len(df)}\")\n",
|
|||
|
"print(f\"Количество строк после удаления выбросов: {len(df_cleaned)}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Теперь очистим датасет от пустых строк"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Общая информация о датасете:\n",
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 53943 entries, 0 to 53942\n",
|
|||
|
"Data columns (total 11 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 Unnamed: 0 53943 non-null int64 \n",
|
|||
|
" 1 carat 53943 non-null float64\n",
|
|||
|
" 2 cut 53943 non-null object \n",
|
|||
|
" 3 color 53943 non-null object \n",
|
|||
|
" 4 clarity 53943 non-null object \n",
|
|||
|
" 5 depth 53943 non-null float64\n",
|
|||
|
" 6 table 53943 non-null float64\n",
|
|||
|
" 7 price 53943 non-null int64 \n",
|
|||
|
" 8 x 53943 non-null float64\n",
|
|||
|
" 9 y 53943 non-null float64\n",
|
|||
|
" 10 z 53943 non-null float64\n",
|
|||
|
"dtypes: float64(6), int64(2), object(3)\n",
|
|||
|
"memory usage: 4.5+ MB\n",
|
|||
|
"None\n",
|
|||
|
"\n",
|
|||
|
"Таблица анализа пропущенных значений:\n",
|
|||
|
" Количество пропущенных значений Процент пропущенных значений\n",
|
|||
|
"Unnamed: 0 0 0.0\n",
|
|||
|
"carat 0 0.0\n",
|
|||
|
"cut 0 0.0\n",
|
|||
|
"color 0 0.0\n",
|
|||
|
"clarity 0 0.0\n",
|
|||
|
"depth 0 0.0\n",
|
|||
|
"table 0 0.0\n",
|
|||
|
"price 0 0.0\n",
|
|||
|
"x 0 0.0\n",
|
|||
|
"y 0 0.0\n",
|
|||
|
"z 0 0.0\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Вывод общей информации о датасете\n",
|
|||
|
"print(\"Общая информация о датасете:\")\n",
|
|||
|
"print(df.info())\n",
|
|||
|
"\n",
|
|||
|
"# Вывод таблицы анализа пропущенных значений\n",
|
|||
|
"missing_values = df.isnull().sum()\n",
|
|||
|
"missing_values_percentage = (missing_values / len(df)) * 100\n",
|
|||
|
"missing_data = pd.concat([missing_values, missing_values_percentage], axis=1, keys=['Количество пропущенных значений', 'Процент пропущенных значений'])\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nТаблица анализа пропущенных значений:\")\n",
|
|||
|
"print(missing_data)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пустых строк не было обнаружено.\n",
|
|||
|
"\n",
|
|||
|
"Теперь создадим выборки."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 32365\n",
|
|||
|
"Размер контрольной выборки: 10789\n",
|
|||
|
"Размер тестовой выборки: 10789\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор признаков и целевой переменной\n",
|
|||
|
"X = df.drop('carat', axis=1) # Признаки (все столбцы, кроме 'carat')\n",
|
|||
|
"y = df['carat'] # Целевая переменная ('carat')\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение данных на обучающую и оставшуюся часть (контрольную + тестовую)\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод размеров выборок\n",
|
|||
|
"print(f\"Размер обучающей выборки: {X_train.shape[0]}\")\n",
|
|||
|
"print(f\"Размер контрольной выборки: {X_val.shape[0]}\")\n",
|
|||
|
"print(f\"Размер тестовой выборки: {X_test.shape[0]}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проанализируем сбалансированность выборок"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение carat в обучающей выборке:\n",
|
|||
|
"carat\n",
|
|||
|
"0.20 7\n",
|
|||
|
"0.21 5\n",
|
|||
|
"0.22 4\n",
|
|||
|
"0.23 178\n",
|
|||
|
"0.24 139\n",
|
|||
|
" ... \n",
|
|||
|
"3.40 1\n",
|
|||
|
"3.65 1\n",
|
|||
|
"3.67 1\n",
|
|||
|
"4.13 1\n",
|
|||
|
"4.50 1\n",
|
|||
|
"Name: count, Length: 263, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"Распределение carat в контрольной выборке:\n",
|
|||
|
"carat\n",
|
|||
|
"0.20 2\n",
|
|||
|
"0.21 2\n",
|
|||
|
"0.23 62\n",
|
|||
|
"0.24 58\n",
|
|||
|
"0.25 51\n",
|
|||
|
" ..\n",
|
|||
|
"3.11 1\n",
|
|||
|
"3.51 1\n",
|
|||
|
"4.00 1\n",
|
|||
|
"4.01 1\n",
|
|||
|
"5.01 1\n",
|
|||
|
"Name: count, Length: 232, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"Распределение carat в тестовой выборке:\n",
|
|||
|
"carat\n",
|
|||
|
"0.20 3\n",
|
|||
|
"0.21 2\n",
|
|||
|
"0.22 1\n",
|
|||
|
"0.23 53\n",
|
|||
|
"0.24 57\n",
|
|||
|
" ..\n",
|
|||
|
"3.00 1\n",
|
|||
|
"3.01 1\n",
|
|||
|
"3.04 1\n",
|
|||
|
"3.50 1\n",
|
|||
|
"4.01 1\n",
|
|||
|
"Name: count, Length: 241, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор признаков и целевой переменной\n",
|
|||
|
"X = df.drop('carat', axis=1) # Признаки (все столбцы, кроме 'carat')\n",
|
|||
|
"y = df['carat'] # Целевая переменная ('carat')\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение данных на обучающую и оставшуюся часть (контрольную + тестовую)\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Функция для анализа распределения и вывода результатов\n",
|
|||
|
"def analyze_distribution(data, title):\n",
|
|||
|
" print(f\"Распределение carat в {title}:\")\n",
|
|||
|
" distribution = data.value_counts().sort_index()\n",
|
|||
|
" print(distribution)\n",
|
|||
|
" total = len(data)\n",
|
|||
|
" positive_count = (data > 0).sum()\n",
|
|||
|
" negative_count = (data < 0).sum()\n",
|
|||
|
" positive_percent = (positive_count / total) * 100\n",
|
|||
|
" negative_percent = (negative_count / total) * 100\n",
|
|||
|
" print(f\"Процент положительных значений: {positive_percent:.2f}%\")\n",
|
|||
|
" print(f\"Процент отрицательных значений: {negative_percent:.2f}%\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ распределения для каждой выборки\n",
|
|||
|
"analyze_distribution(y_train, \"обучающей выборке\")\n",
|
|||
|
"analyze_distribution(y_val, \"контрольной выборке\")\n",
|
|||
|
"analyze_distribution(y_test, \"тестовой выборке\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Цены на кофе"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Starbucks Dataset.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проблемная область: ценообразование кофе\n",
|
|||
|
"\n",
|
|||
|
"Объект наблюдения: кофе\n",
|
|||
|
"\n",
|
|||
|
"Атрибуты: дата, цена на момент открытия, максимальная цена, минимальная цена, цена на момент закрытия, скорректированная цена закрытия, объем\n",
|
|||
|
"\n",
|
|||
|
"Пример бизнес-цели: \n",
|
|||
|
"1. Анализ рыночных тенденций:\n",
|
|||
|
"Цель: Определить долгосрочные тенденции в ценах на кофе.\n",
|
|||
|
"\n",
|
|||
|
"2. Прогнозирование цен:\n",
|
|||
|
"Цель: Разработать модель прогнозирования будущих цен на кофе.\n",
|
|||
|
"\n",
|
|||
|
"3. Оценка рисков:\n",
|
|||
|
"Цель: Оценить риски, связанные с колебаниями цен на кофе.\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"Актуальность: Данные о ценах на кофе являются крайне актуальными для компаний, работающих в сфере кофейной индустрии, а также для инвесторов и трейдеров, заинтересованных в сырьевом рынке. Понимание динамики цен на кофе позволяет оптимизировать стратегии закупок, управления запасами и ценообразования, что в конечном итоге влияет на прибыльность бизнеса и эффективность инвестиций."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 35,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0EAAAIjCAYAAADFthA8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABj/ElEQVR4nO3deXhTdf728Tvp3tKmLVsLFCjbYEH2VRRcYAQZFDcEQRnEHRR1dITxUcAN1N+4oeM2IzogyqjgDsrOsAkIiIgiYFmElqWlC4W20JznDyaxadM2adMm7Xm/rovroicnJ5/kNOm5890shmEYAgAAAACTsPq7AAAAAACoSYQgAAAAAKZCCAIAAABgKoQgAAAAAKZCCAIAAABgKoQgAAAAAKZCCAIAAABgKoQgAAAAAKZCCAIAAABgKoQgADChffv2yWKx6J133vF3KS4WL16sLl26KDw8XBaLRVlZWf4uCcVcfPHFuvjii/1dBgBUGSEIQJ3yww8/6LrrrlOLFi0UHh6upk2batCgQZo1a1a1Pea8efP04osvltp++PBhTZs2Tdu2bau2xy5p5cqVslgszn8hISFq1aqVbr75Zv36668+eYx169Zp2rRpPg8oGRkZGjFihCIiIvTqq69qzpw5ioqKKvc+P/74o8aMGaOmTZsqLCxMTZo00ejRo/Xjjz/6tLbaZsGCBbJYLPrnP/9Z5j5LliyRxWLRyy+/XIOVAUBgIAQBqDPWrVunHj166Pvvv9dtt92mV155RbfeequsVqteeumlanvc8kLQ9OnTazQEOdx7772aM2eO3nzzTQ0dOlTz589Xz549dfjw4Sofe926dZo+fbrPQ9CmTZuUm5urJ554QuPHj9eYMWMUEhJS5v4LFixQt27dtGzZMo0bN07/+Mc/NH78eK1YsULdunXTwoULfVpfbTJ06FDZbDbNmzevzH3mzZunoKAgjRw5sgYrA4DAEOzvAgDAV5566inZbDZt2rRJsbGxLrcdPXrUP0VVg7y8vApbSC666CJdd911kqRx48apXbt2uvfee/Xuu+9qypQpNVGm1xznqOS5c2fv3r266aab1KpVK61evVoNGzZ03jZp0iRddNFFuummm7R9+3a1atWqukoOWGFhYbruuus0e/ZsHT58WE2aNHG5PT8/XwsXLtSgQYPUqFEjP1UJAP5DSxCAOmPv3r3q0KGD24todxd6c+fOVa9evRQZGam4uDj1799f33zzjfP2Tz/9VEOHDlWTJk0UFham1q1b64knnlBRUZFzn4svvlhffvml9u/f7+yC1rJlS61cuVI9e/aUdC6EOG4rPgbn22+/1eDBg2Wz2RQZGakBAwZo7dq1LjVOmzZNFotFO3fu1I033qi4uDhdeOGFXr82l156qSQpNTW13P2WL1+uiy66SFFRUYqNjdVVV12ln376yaWehx56SJKUnJzsfF779u0r97gffvihunfvroiICDVo0EBjxozRoUOHnLdffPHFGjt2rCSpZ8+eslgs+vOf/1zm8Z577jmdOnVKb775pksAkqQGDRrojTfeUF5enp599lmX2i0Wi37++WeNGDFCMTExql+/viZNmqT8/PxSjzF37lxnzfHx8Ro5cqQOHjzoss/FF1+sjh07aufOnbrkkksUGRmppk2bujxuWTp27KhLLrmk1Ha73a6mTZs6Q6wkffDBB+revbuio6MVExOj888/v8LWzTFjxshut+uDDz4odduXX36p7OxsjR49WpJ09uxZPfHEE2rdurXCwsLUsmVL/e1vf1NBQUG5j/HOO++4Pf+ObpkrV650bnO8Vtu3b9eAAQMUGRmpNm3a6KOPPpIkrVq1Sr1791ZERIT+8Ic/aOnSpaUe79ChQ7rlllvUuHFjhYWFqUOHDnr77bfLrREA3CEEAagzWrRooe+++047duyocN/p06frpptuUkhIiB5//HFNnz5dSUlJWr58uXOfd955R/Xq1dMDDzygl156Sd27d9djjz2myZMnO/d55JFH1KVLFzVo0EBz5szRnDlz9OKLL+q8887T448/Lkm6/fbbnbf1799f0rmw0b9/f+Xk5Gjq1Kl6+umnlZWVpUsvvVQbN24sVe/111+vU6dO6emnn9Ztt93m9Wuzd+9eSVL9+vXL3Gfp0qW6/PLLdfToUU2bNk0PPPCA1q1bp379+jkvcq+55hqNGjVKkvTCCy84n1fJIFLcO++8oxEjRigoKEgzZszQbbfdpgULFujCCy90dql75JFHdPvtt0uSHn/8cc2ZM0d33HFHmcf8/PPP1bJlS1100UVub+/fv79atmypL7/8stRtI0aMUH5+vmbMmKErrrhCL7/8svOxHZ566indfPPNatu2rZ5//nndd999WrZsmfr371+qG+CJEyc0ePBgde7cWX//+9/Vvn17Pfzww1q0aFGZ9UvSDTfcoNWrVys9Pd1l+5o1a3T48GFnN7UlS5Zo1KhRiouL0zPPPKOZM2fq4osvLhWY3b0GzZo1c9slbt68eYqMjNTw4cMlSbfeeqsee+wxdevWTS+88IIGDBigGTNm+Lyr3IkTJ/SnP/1JvXv31rPPPquwsDCNHDlS8+fP18iRI3XFFVdo5syZysvL03XXXafc3FznfY8cOaI+ffpo6dKlmjhxol566SW1adNG48ePd9sdFQDKZQBAHfHNN98YQUFBRlBQkNG3b1/jr3/9q/H1118bhYWFLvvt3r3bsFqtxtVXX20UFRW53Ga3253/P3XqVKnHuOOOO4zIyEgjPz/fuW3o0KFGixYtSu27adMmQ5Ixe/bsUo/Rtm1b4/LLLy/1eMnJycagQYOc26ZOnWpIMkaNGuXRa7BixQpDkvH2228bx44dMw4fPmx8+eWXRsuWLQ2LxWJs2rTJMAzDSE1NLVVbly5djEaNGhkZGRnObd9//71htVqNm2++2bntueeeMyQZqampFdZTWFhoNGrUyOjYsaNx+vRp5/YvvvjCkGQ89thjzm2zZ882JDlrLEtWVpYhybjqqqvK3e/KK680JBk5OTmGYfz+Wl555ZUu+919992GJOP77783DMMw9u3bZwQFBRlPPfWUy34//PCDERwc7LJ9wIABhiTj3//+t3NbQUGBkZCQYFx77bXl1rdr1y5DkjFr1qxS9dSrV8/5+zdp0iQjJibGOHv2bLnHc+ehhx4yJBm7du1ybsvOzjbCw8Odv1Pbtm0zJBm33nqry30ffPBBQ5KxfPlyl+c7YMAA58+Oc1byd8Hxe7hixQqX+0oy5s2b59z2888/G5IMq9VqbNiwwbn966+/LvX7OX78eCMxMdE4fvy4y2ONHDnSsNlsbt+vAFAWWoIA1BmDBg3S+vXrdeWVV+r777/Xs88+q8svv1xNmzbVZ5995tzvk08+kd1u12OPPSar1fVj0GKxOP8fERHh/H9ubq6OHz+uiy66SKdOndLPP/9c6Tq3bdum3bt368Ybb1RGRoaOHz+u48ePKy8vT5dddplWr14tu93ucp8777zTq8e45ZZb1LBhQzVp0kRDhw5VXl6e3n33XfXo0cPt/mlpadq2bZv+/Oc/Kz4+3rm9U6dOGjRokL766ivvn6ikzZs36+jRo7r77rsVHh7u3D506FC1b9/ebUtNRRytA9HR0eXu57g9JyfHZfuECRNcfr7nnnskyfkcFyxYILvdrhEjRjjPzfHjx5WQkKC2bdtqxYoVLvevV6+exowZ4/w5NDRUvXr1qnA2vnbt2qlLly6aP3++c1tRUZE++ugjDRs2zPn7Fxsbq7y8PC1ZsqTc47njqKt4a9DHH3+s/Px8Z1c4x/N+4IEHXO77l7/8RZIqdY7KUq9ePZfWpT/84Q+KjY3Veeedp969ezu3O/7veA0Nw9DHH3+sYcOGyTAMl/Ny+eWXKzs7W1u2bPFZnQDqvjoTglavXq1hw4apSZMmslgs+uSTT7w+xtdff60+ffooOjpaDRs21LXXXlthP3cAgaVnz55asGCBTpw4oY0bN2rKlCnKzc3Vddddp507d0o61zXMarUqJSWl3GP9+OOPuvrqq2Wz2RQTE6OGDRs6Lyqzs7M
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Starbucks Dataset.csv\")\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df['Open'], df['Volume'])\n",
|
|||
|
"plt.xlabel('Open')\n",
|
|||
|
"plt.ylabel('Volume')\n",
|
|||
|
"plt.title('Scatter Plot of Open vs Volume')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Выброс присутствует. Сделаем очистку данных.\n",
|
|||
|
"\n",
|
|||
|
"Для удаления выбросов из датасета можно использовать метод межквартильного размаха. Зашумленность не очень высокая. Покрытие данных высокое и подошло бы для поставленной задачи по актуальности."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAIjCAYAAAA0vUuxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydeXQUZfb3v90hO0knIUIHRBI2IQaIIAgEQSOMLAroOAqKCyKKwgziuKGiICoyzu9FRxwRFHTEgCuCgDhAcJAQBAkBQlAhJqiQgNkhIQvpev+I1fRS1fVUdXVVdff9nMM5pLuWp2u997n3fq+J4zgOBEEQBEEQBEEQhChmvQdAEARBEARBEARhdMhxIgiCIAiCIAiCkIAcJ4IgCIIgCIIgCAnIcSIIgiAIgiAIgpCAHCeCIAiCIAiCIAgJyHEiCIIgCIIgCIKQgBwngiAIgiAIgiAICchxIgiCIAiCIAiCkIAcJ4IgCIIgCIIgCAnIcSIIgtCRkpISmEwmvPfee3oPxYktW7YgPT0dERERMJlMqK6u1ntIhAPXXnstrr32Wr2HoZhz586hffv2+PDDDxVvw2QyYf78+eoNSkfee+89mEwmlJSU2D8bPHgwnnjiCf0GRRCEG+Q4EQThEw4fPoxbb70VXbp0QUREBDp16oRRo0bhjTfe8Nk+s7Ky8Nprr7l9furUKcyfPx/5+fk+27cr33zzDUwmk/1faGgounbtirvvvhs///yzKvvYvXs35s+fr7pTU1FRgdtuuw2RkZF488038cEHHyA6OtrjOkeOHMGUKVPQqVMnhIeHo2PHjrjzzjtx5MgRVcfmb3z++ecwmUx45513RJfZunUrTCYT/vWvf2k4Mn15/fXXERMTg0mTJrl9l5+fjylTpqBz584IDw9HQkICRo4ciVWrVqGlpUWH0erDk08+iTfffBNlZWV6D4UgiD8gx4kgCNXZvXs3rrrqKhw8eBDTp0/H0qVLcf/998NsNuP111/32X49OU4LFizQ1HHi+dvf/oYPPvgAy5cvx7hx4/DRRx9h4MCBOHXqlNfb3r17NxYsWKC647Rv3z6cPXsWCxcuxLRp0zBlyhSEhoaKLv/555+jf//+2L59O6ZOnYp///vfmDZtGnbs2IH+/ftj3bp1qo7Pnxg3bhwsFguysrJEl8nKykJISIigExGINDc34/XXX8f999+PkJAQp+/eeecdXHXVVdixYwfuvPNO/Pvf/8Zzzz2HyMhITJs2DYsXL9Zp1NozYcIExMbG4t///rfeQyEI4g/a6D0AgiACj5deegkWiwX79u1DXFyc03dnzpzRZ1A+oK6uTjISc8011+DWW28FAEydOhU9e/bE3/72N7z//vuYO3euFsOUDX+OXM+dEEVFRbjrrrvQtWtX7Ny5E5dccon9u9mzZ+Oaa67BXXfdhUOHDqFr166+GrJhCQ8Px6233opVq1bh1KlT6Nixo9P3DQ0NWLduHUaNGoX27dvrNEpt2bhxI37//XfcdtttTp/v2bMHM2bMwJAhQ7B582bExMTYv3vkkUfw/fffo6CgQOvh6obZbMatt96K//znP1iwYAFMJpPeQyKIoIciTgRBqE5RURGuuOIKQcNbyDhcvXo1Bg0ahKioKMTHx2P48OH473//a/9+/fr1GDduHDp27Ijw8HB069YNCxcudErbufbaa7Fp0yacOHHCnh6XnJyMb775BgMHDgTQ6rjw3znWFH333XcYPXo0LBYLoqKiMGLECOTk5DiNcf78+TCZTCgsLMQdd9yB+Ph4DBs2TPaxyczMBAAUFxd7XC47OxvXXHMNoqOjERcXhwkTJuDo0aNO43n88ccBACkpKfbf5VgjIcQnn3yCAQMGIDIyEomJiZgyZQpOnjxp//7aa6/FPffcAwAYOHAgTCYT7r33XtHtvfrqq6ivr8fy5cudnCYASExMxNtvv426ujr84x//cBq7yWTCDz/8gNtuuw2xsbFo164dZs+ejYaGBrd9rF692j7mhIQETJo0Cb/++qvTMtdeey3S0tJQWFiI6667DlFRUejUqZPTfsVIS0vDdddd5/a5zWZDp06d7I4vAKxduxYDBgxATEwMYmNj0adPH8ko6pQpU2Cz2bB27Vq37zZt2oSamhrceeedAIALFy5g4cKF6NatG8LDw5GcnIynn34ajY2NHvchVCMDXEwZ/eabb+yf8cfq0KFDGDFiBKKiotC9e3d8+umnAID//e9/uPrqqxEZGYnLL78c27Ztc9vfyZMncd9996FDhw4IDw/HFVdcgZUrV3ocI88XX3yB5ORkdOvWzelz3jn48MMPnZwmnquuusrjtcg6rqamJjz33HMYMGAALBYLoqOjcc0112DHjh1Oy/H1h//85z+xfPly+zkZOHAg9u3b57bvH374AbfeeisSEhIQERGBq666Chs2bHBb7siRI8jMzERkZCQuvfRSvPjii7DZbIK/Z9SoUThx4oQu0XKCINyhiBNBEKrTpUsX5ObmoqCgAGlpaR6XXbBgAebPn4+hQ4fihRdeQFhYGL777jtkZ2fjT3/6E4BWo7Bt27Z49NFH0bZtW2RnZ+O5555DbW0tXn31VQDAM888g5qaGvz2229YsmQJAKBt27bo3bs3XnjhBTz33HN44IEHcM011wAAhg4dCqDVQRkzZgwGDBiA559/HmazGatWrUJmZia+/fZbDBo0yGm8f/nLX9CjRw+8/PLL4DhO9rEpKioCALRr1050mW3btmHMmDHo2rUr5s+fj/Pnz+ONN95ARkYG8vLykJycjFtuuQU//fQT1qxZgyVLliAxMREA3JwXR9577z1MnToVAwcOxKJFi3D69Gm8/vrryMnJwYEDBxAXF4dnnnkGl19+OZYvX44XXngBKSkpbgauI19++SWSk5Ptx9WV4cOHIzk5GZs2bXL77rbbbkNycjIWLVqEPXv24F//+heqqqrwn//8x77MSy+9hHnz5uG2227D/fffj99//x1vvPEGhg8fbh8zT1VVFUaPHo1bbrkFt912Gz799FM8+eST6NOnD8aMGSP6G26//XbMnz8fZWVlsFqt9s937dqFU6dO2VPotm7dismTJ+P666+3p4wdPXoUOTk5mD17tuj2hw8fjksvvRRZWVl49NFHnb7LyspCVFQUJk6cCAC4//778f777+PWW2/F3//+d3z33XdYtGgRjh49qmrKY1VVFW688UZMmjQJf/nLX/DWW29h0qRJ+PDDD/HII49gxowZuOOOO/Dqq6/i1ltvxa+//mp3Zk6fPo3BgwfDZDJh1qxZuOSSS/DVV19h2rRpqK2txSOPPOJx37t370b//v2dPquvr8f27dsxfPhwXHbZZYp+E+u4amtr8c4772Dy5MmYPn06zp49i3fffRc33HAD9u7di/T0dKftZmVl4ezZs3jwwQdhMpnwj3/8A7fccgt+/vlnewrrkSNHkJGRgU6dOuGpp55CdHQ0Pv74Y0ycOBGfffYZbr75ZgBAWVkZrrvuOly4cMG+3PLlyxEZGSn4mwYMGAAAyMnJwZVXXqnouBAEoSIcQRCEyvz3v//lQkJCuJCQEG7IkCHcE088wX399ddcU1OT03LHjh3jzGYzd/PNN3MtLS1O39lsNvv/6+vr3fbx4IMPclFRUVxDQ4P9s3HjxnFdunRxW3bfvn0cAG7VqlVu++jRowd3ww03uO0vJSWFGzVqlP2z559/ngPATZ48mekY7NixgwPArVy5kvv999+5U6dOcZs2beKSk5M5k8nE7du3j+M4jisuLnYbW3p6Ote+fXuuoqLC/tnBgwc5s9nM3X333fbPXn31VQ4AV1xcLDmepqYmrn379lxaWhp3/vx5++cbN27kAHDPPfec/bNVq1ZxAOxjFKO6upoDwE2YMMHjcuPHj+cAcLW1tRzHXTyW48ePd1ru4Ycf5gBwBw8e5DiO40pKSriQkBDupZdeclru8OHDXJs2bZw+HzFiBAeA+89//mP/rLGxkbNardyf//xnj+P78ccfOQDcG2+84Taetm3b2q+/2bNnc7GxsdyFCxc8bk+Ixx9/nAPA/fjjj/b
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Starbucks Dataset.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Функция для удаления выбросов с использованием IQR\n",
|
|||
|
"def remove_outliers_iqr(df, column):\n",
|
|||
|
" Q1 = df[column].quantile(0.25)\n",
|
|||
|
" Q3 = df[column].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов для столбцов 'Open' и 'Volume'\n",
|
|||
|
"df_cleaned = remove_outliers_iqr(df, 'Open')\n",
|
|||
|
"df_cleaned = remove_outliers_iqr(df_cleaned, 'Volume')\n",
|
|||
|
"\n",
|
|||
|
"# Построение графика для очищенных данных\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df_cleaned['Open'], df_cleaned['Volume'])\n",
|
|||
|
"plt.xlabel('Open')\n",
|
|||
|
"plt.ylabel('Volume')\n",
|
|||
|
"plt.title('Scatter Plot of Open vs Volume (Cleaned)')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Теперь очистим датасет от пустых строк"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Количество строк до очистки: 8036\n",
|
|||
|
"Количество строк после удаления выбросов: 7585\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Starbucks Dataset.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Вывод количества строк до очистки\n",
|
|||
|
"print(f\"Количество строк до очистки: {len(df)}\")\n",
|
|||
|
"\n",
|
|||
|
"# Удаление пустых строк\n",
|
|||
|
"df_cleaned = df.dropna()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Функция для удаления выбросов с использованием IQR\n",
|
|||
|
"def remove_outliers_iqr(df, column):\n",
|
|||
|
" Q1 = df[column].quantile(0.25)\n",
|
|||
|
" Q3 = df[column].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов для столбцов 'Open' и 'Volume'\n",
|
|||
|
"df_cleaned = remove_outliers_iqr(df_cleaned, 'Open')\n",
|
|||
|
"df_cleaned = remove_outliers_iqr(df_cleaned, 'Volume')\n",
|
|||
|
"\n",
|
|||
|
"# Вывод количества строк после удаления выбросов\n",
|
|||
|
"print(f\"Количество строк после удаления выбросов: {len(df_cleaned)}\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Теперь создадим выборки"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 29709\n",
|
|||
|
"Размер контрольной выборки: 9904\n",
|
|||
|
"Размер тестовой выборки: 9904\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор столбцов для анализа\n",
|
|||
|
"column1 = 'carat'\n",
|
|||
|
"column2 = 'price'\n",
|
|||
|
"\n",
|
|||
|
"# Функция для удаления выбросов\n",
|
|||
|
"def remove_outliers(df, column):\n",
|
|||
|
" Q1 = df[column].quantile(0.25)\n",
|
|||
|
" Q3 = df[column].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов для каждого столбца\n",
|
|||
|
"df_cleaned = df.copy()\n",
|
|||
|
"for column in [column1, column2]:\n",
|
|||
|
" df_cleaned = remove_outliers(df_cleaned, column)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"X = df_cleaned[[column1]]\n",
|
|||
|
"y = df_cleaned[column2]\n",
|
|||
|
"\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение обучающей выборки на обучающую и контрольную выборки\n",
|
|||
|
"X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод размеров выборок\n",
|
|||
|
"print(f\"Размер обучающей выборки: {len(X_train)}\")\n",
|
|||
|
"print(f\"Размер контрольной выборки: {len(X_val)}\")\n",
|
|||
|
"print(f\"Размер тестовой выборки: {len(X_test)}\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проанализируем сбалансированность выборок"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 29709\n",
|
|||
|
"Размер контрольной выборки: 9904\n",
|
|||
|
"Размер тестовой выборки: 9904\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAHqCAYAAADrpwd3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB/8klEQVR4nOzde1hVZf7//xcH2eBhg6iA/EIiy/MpqdFdeSZRyTJpGpUSk6IcqNRGHWfMPFQklacynZoSKx3L+ZSVmYpnSzTFSNPG1Cws3dCkQGhyXL8/+rKmHexU3Jzk+biudQ3rvt9rrftel82b/Wbte7kZhmEIAAAAAAAAAACU417TAwAAAAAAAAAAoLaiiA4AAAAAAAAAgBMU0QEAAAAAAAAAcIIiOgAAAAAAAAAATlBEBwAAAAAAAADACYroAAAAAAAAAAA4QREdAAAAAAAAAAAnKKIDAAAAAAAAAOAERXQAAAAAAAAAAJygiA5AkjRmzBhdffXVlTp2xowZcnNzc+2AAACoR7755hu5ubkpJSXFbLuU/Orm5qYZM2a4dEx9+/ZV3759XXpOAAAAoC6iiA7Ucm5ubhe1bd26taaHWmM++OAD9enTRwEBAWrYsKGuueYa3X333Vq3bl2lzvf0009r9erVrh0kAOCKcfvtt6thw4b66aefnMbExMTIy8tLP/74YzWO7NIdOnRIM2bM0DfffFPTQ3HwzTff6L777lPr1q3l7e2toKAg9e7dW0888USlzrd27VqX/5EBAIBfq87P7ufOndOMGTMu6VzkVuDyuBmGYdT0IAA49+abbzrsv/7660pNTdUbb7zh0H7rrbcqMDCw0tcpKipSaWmpLBbLJR9bXFys4uJieXt7V/r6lfXcc89p0qRJ6tOnj+644w41bNhQR48e1caNG9W1a1eHJ/ouVuPGjXXXXXdV6lgAwJXvrbfe0ogRI7Rs2TKNHj26XP+5c+cUEBCg/v376/3337+oc37zzTcKCwvT0qVLNWbMGEmXll/d3Nz0xBNPXPKH2X//+9/64x//qC1btpR76rywsFCS5OXldUnnvFxHjx7VjTfeKB8fH40dO1ZXX321Tp06pX379umjjz7S+fPnL/mciYmJWrRokfjoAwCoKtX12V2S/vvf/6pFixYXnfvJrcDl86zpAQD4fffcc4/D/q5du5Samlqu/bfOnTunhg0bXvR1GjRoUKnxSZKnp6c8Pav//06Ki4s1e/Zs3XrrrdqwYUO5/uzs7GofEwDgynf77berSZMmWrFiRYVF9Pfee09nz55VTEzMZV2npvJrmeounpeZN2+e8vPzlZGRodDQUIc+cjsAoLaq7Gf36kBuBS4fy7kAV4C+ffuqU6dOSk9PV+/evdWwYUP97W9/k/TLB/moqCgFBwfLYrGodevWmj17tkpKShzO8ds10cvWZn3uuef08ssvq3Xr1rJYLLrxxhu1Z88eh2MrWrPVzc1NiYmJWr16tTp16iSLxaKOHTtWuMTK1q1bdcMNN8jb21utW7fWP/7xj4taB/a///2v8vLydPPNN1fYHxAQ4LBfUFCgJ554Qtdee60sFotCQkI0efJkFRQUOIz77NmzWrZsmfl1u7InAgEAkCQfHx8NHz5cmzZtqvCD54oVK9SkSRPdfvvtOn36tP7yl7+oc+fOaty4saxWqwYPHqzPP//8gtepKBcWFBRowoQJatGihXmN7777rtyx3377rf785z+rbdu28vHxUbNmzfTHP/7RYdmWlJQU/fGPf5Qk9evXr9zXzCtaEz07O1txcXEKDAyUt7e3unbtqmXLljnEXMrvEBU5duyYrrrqqnIf8qXyuV2SPvroI/Xq1UuNGjVSkyZNFBUVpYMHD5r9Y8aM0aJFiyQ5ftUeAIDqVlpaqvnz56tjx47y9vZWYGCgHnzwQZ05c8Yhbu/evYqMjFTz5s3l4+OjsLAwjR07VtIvebZFixaSpJkzZ5p57feeSCe3ApePJ9GBK8SPP/6owYMHa8SIEbrnnnvMr4elpKSocePGmjhxoho3bqzNmzdr+vTpysvL07PPPnvB865YsUI//fSTHnzwQbm5uSk5OVnDhw/X119/fcGn1z/++GO98847+vOf/6wmTZpo4cKFio6OVmZmppo1ayZJ+uyzzzRo0CC1bNlSM2fOVElJiWbNmmX+UvB7AgIC5OPjow8++EAPP/yw/P39ncaWlpbq9ttv18cff6z4+Hi1b99eBw4c0Lx58/TVV1+Za6C/8cYbuv/++/WHP/xB8fHxkqTWrVtfcCwAgPolJiZGy5Yt09tvv63ExESz/fTp01q/fr1GjhwpHx8fHTx4UKtXr9Yf//hHhYWFKSsrS//4xz/Up08fHTp0SMHBwZd03fvvv19vvvmmRo0apZtuukmbN29WVFRUubg9e/Zo586dGjFihK666ip98803Wrx4sfr27atDhw6pYcOG6t27tx555BEtXLhQf/vb39S+fXtJMv/3t37++Wf17dtXR48eVWJiosLCwrRq1SqNGTNGOTk5evTRRx3iK/s7RGhoqDZu3KjNmzerf//+v3s/3njjDcXGxioyMlJz5szRuXPntHjxYt1yyy367LPPdPXVV+vBBx/UyZMnK/xKPQAA1enBBx9USkqK7rvvPj3yyCM6fvy4XnzxRX322Wf65JNP1KBBA2VnZ2vgwIFq0aKF/vrXv8rPz0/ffPON3nnnHUlSixYttHjxYo0bN0533nmnhg8fLknq0qWL0+uSWwEXMADUKQkJCcZv/9Pt06ePIclYsmRJufhz586Va3vwwQeNhg0bGufPnzfbYmNjjdDQUHP/+PHjhiSjWbNmxunTp8329957z5BkfPDBB2bbE088UW5MkgwvLy/j6NGjZtvnn39uSDJeeOEFs23o0KFGw4YNje+//95sO3LkiOHp6VnunBWZPn26Iclo1KiRMXjwYOOpp54y0tPTy8W98cYbhru7u7Fjxw6H9iVLlhiSjE8++cRsa9SokREbG3vBawMA6q/i4mKjZcuWhs1mc2gvyyvr1683DMMwzp8/b5SUlDjEHD9+3LBYLMasWbMc2iQZS5cuNdt+m18zMjIMScaf//xnh/ONGjXKkGQ88cQTZltF+T8tLc2QZLz++utm26pVqwxJxpYtW8rF9+nTx+jTp4+5P3/+fEOS8eabb5pthYWFhs1mMxo3bmzk5eU5zOVifoeoyBdffGH4+PgYkoxu3boZjz76qLF69Wrj7NmzDnE//fST4efnZzzwwAMO7Xa73fD19XVor+j3JwAAqtJvc8+OHTsMScby5csd4tatW+fQ/u677xqSjD179jg99w8//FAu9/8ecitw+VjOBbhCWCwW3XfffeXafXx8zJ9/+ukn/fe//1WvXr107tw5/ec//7ngef/0pz+padOm5n6vXr0kSV9//fUFj42IiHB4irtLly6yWq3msSUlJdq4caOGDRvm8CTetddeq8GDB1/w/NIvX19bsWKFrr/+eq1fv15///vfFR4eru7du+vLL78041atWqX27durXbt2+u9//2tuZX+F37Jly0VdDwAASfLw8NCIESOUlpbmsETKihUrFBgYqAEDBkj6JT+7u//yK3dJSYl+/PFHNW7cWG3bttW+ffsu6Zpr166VJD3yyCMO7ePHjy8X++v8X1RUpB9//FHXXnut/Pz8Lvm6v75+UFCQRo4cabY1aNBAjzzyiPLz87Vt2zaH+Mr+DtGxY0dlZGTonnvu0TfffKMFCxZo2LBhCgwM1CuvvGLGpaamKicnRyNHjnTI7R4eHurRowe5HQBQq6xatUq+vr669dZbHfJWeHi4GjdubOYtPz8/SdKaNWtUVFTkkmuTW4HLRxEduEL8f//f/1fhC8AOHjyoO++8U76+vrJarWrRooX5YpPc3NwLnrdVq1YO+2Ufhn+7ZtvFHFt2fNmx2dnZ+vnnn3XttdeWi6uozZmRI0dqx44dOnPmjDZs2KBRo0bps88+09ChQ823jB85ckQHDx5UixYtHLY2bdqYYwEA4FKUvTh0xYoVkqTvvvtOO3bs0IgRI+Th4SHpl+XE5s2bp+u
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x500 with 3 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"\n",
|
|||
|
"Статистика для Training Set:\n",
|
|||
|
"Среднее: 3021.2620418055135\n",
|
|||
|
"Медиана: 2090.0\n",
|
|||
|
"Стандартное отклонение: 2574.5120319534017\n",
|
|||
|
"\n",
|
|||
|
"Статистика для Validation Set:\n",
|
|||
|
"Среднее: 3012.331684168013\n",
|
|||
|
"Медиана: 2059.5\n",
|
|||
|
"Стандартное отклонение: 2587.9320915537055\n",
|
|||
|
"\n",
|
|||
|
"Статистика для Test Set:\n",
|
|||
|
"Среднее: 3020.2212237479807\n",
|
|||
|
"Медиана: 2097.5\n",
|
|||
|
"Стандартное отклонение: 2568.915633156046\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"..//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбор столбцов для анализа\n",
|
|||
|
"column1 = 'carat'\n",
|
|||
|
"column2 = 'price'\n",
|
|||
|
"\n",
|
|||
|
"# Функция для удаления выбросов\n",
|
|||
|
"def remove_outliers(df, column):\n",
|
|||
|
" Q1 = df[column].quantile(0.25)\n",
|
|||
|
" Q3 = df[column].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов для каждого столбца\n",
|
|||
|
"df_cleaned = df.copy()\n",
|
|||
|
"for column in [column1, column2]:\n",
|
|||
|
" df_cleaned = remove_outliers(df_cleaned, column)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"X = df_cleaned[[column1]]\n",
|
|||
|
"y = df_cleaned[column2]\n",
|
|||
|
"\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение обучающей выборки на обучающую и контрольную выборки\n",
|
|||
|
"X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод размеров выборок\n",
|
|||
|
"print(f\"Размер обучающей выборки: {len(X_train)}\")\n",
|
|||
|
"print(f\"Размер контрольной выборки: {len(X_val)}\")\n",
|
|||
|
"print(f\"Размер тестовой выборки: {len(X_test)}\")\n",
|
|||
|
"\n",
|
|||
|
"# Построение гистограмм для каждой выборки\n",
|
|||
|
"plt.figure(figsize=(15, 5))\n",
|
|||
|
"\n",
|
|||
|
"# Гистограмма для обучающей выборки\n",
|
|||
|
"plt.subplot(1, 3, 1)\n",
|
|||
|
"plt.hist(y_train, bins=30, alpha=0.5, label='Train')\n",
|
|||
|
"plt.xlabel('Price')\n",
|
|||
|
"plt.ylabel('Frequency')\n",
|
|||
|
"plt.title('Training Set')\n",
|
|||
|
"\n",
|
|||
|
"# Гистограмма для контрольной выборки\n",
|
|||
|
"plt.subplot(1, 3, 2)\n",
|
|||
|
"plt.hist(y_val, bins=30, alpha=0.5, label='Validation')\n",
|
|||
|
"plt.xlabel('Price')\n",
|
|||
|
"plt.ylabel('Frequency')\n",
|
|||
|
"plt.title('Validation Set')\n",
|
|||
|
"\n",
|
|||
|
"# Гистограмма для тестовой выборки\n",
|
|||
|
"plt.subplot(1, 3, 3)\n",
|
|||
|
"plt.hist(y_test, bins=30, alpha=0.5, label='Test')\n",
|
|||
|
"plt.xlabel('Price')\n",
|
|||
|
"plt.ylabel('Frequency')\n",
|
|||
|
"plt.title('Test Set')\n",
|
|||
|
"\n",
|
|||
|
"plt.tight_layout()\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Вычисление статистических показателей\n",
|
|||
|
"def print_stats(data, name):\n",
|
|||
|
" print(f\"\\nСтатистика для {name}:\")\n",
|
|||
|
" print(f\"Среднее: {data.mean()}\")\n",
|
|||
|
" print(f\"Медиана: {data.median()}\")\n",
|
|||
|
" print(f\"Стандартное отклонение: {data.std()}\")\n",
|
|||
|
"\n",
|
|||
|
"print_stats(y_train, 'Training Set')\n",
|
|||
|
"print_stats(y_val, 'Validation Set')\n",
|
|||
|
"print_stats(y_test, 'Test Set')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Мы вычислили среднее, медиану и стандартное отклонение для каждой выборки. Если эти показатели для всех выборок близки, это также указывает на сбалансированность выборок."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|