1291 lines
265 KiB
Plaintext
1291 lines
265 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Лабораторная работа №2\n",
|
|||
|
"## Были выбраны следующие датасеты:\n",
|
|||
|
" - ### 11. Цены на бриллианты.\n",
|
|||
|
" - ### 18. Цены на мобильные устройства.\n",
|
|||
|
" - ### 19. Данные о миллионерах."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Начнем анализировать датасет №11.\n",
|
|||
|
"\n",
|
|||
|
"Ссылка на исходные данные: https://www.kaggle.com/datasets/nancyalaswad90/diamonds-prices\n",
|
|||
|
"\n",
|
|||
|
"**Общее описание**: Данный датасет содержит цены и атрибуты для 53940 алмазов круглой огранки. Имеются 10 характеристик (карат, огранка, цвет, чистота, глубина, таблица, цена, x, y и z). Большинство переменных являются числовыми по своей природе, но переменные cut, color и clearity являются упорядоченными факторными переменными.\n",
|
|||
|
"\n",
|
|||
|
"**Проблемная область**: Финансовый анализ и прогнозирование цен акций.\n",
|
|||
|
"\n",
|
|||
|
"**Объекты наблюдения**: Данные о алмазах, включающие атрибуты: _Carat, Cut, Color, Clarity, Depth, Table, Price_.\n",
|
|||
|
"\n",
|
|||
|
"**Бизнес цели**:\n",
|
|||
|
"- ***Прогнозирование цен на алмазы***: Позволяет покупателям и продавцам лучше ориентироваться в рыночных ценах, а также помогает в принятии решений о покупке или продаже алмазов,\n",
|
|||
|
"- ***Анализ факторов, влияющих на стоимость***: Понимание, какие характеристики алмаза (например, качество огранки или цвет) оказывают наибольшее влияние на его цену, может помочь в разработке стратегий ценообразования и улучшении ассортимента.\n",
|
|||
|
"\n",
|
|||
|
"**Цели технического проекта**:\n",
|
|||
|
"1. ***Прогнозирование цен на алмазы***: Входные данные - атрибуты алмазов; целевой признак - _цена_,\n",
|
|||
|
"2. ***Анализ факторов влияния***: Входные данные - атрибуты, описывающие качество и характеристики алмаза; целевой признак - влияние каждого атрибута на конечную цену, что может быть проанализировано с помощью методов регрессии и визуализации данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['Unnamed: 0', 'carat', 'cut', 'color', 'clarity', 'depth', 'table',\n",
|
|||
|
" 'price', 'x', 'y', 'z'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"df = pd.read_csv(\"./data/Diamonds-Prices.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Атрибуты: \n",
|
|||
|
"- Неизвестный: 0, \n",
|
|||
|
"- Караты (carat), \n",
|
|||
|
"- Огранка (cut), \n",
|
|||
|
"- Цвет (color), \n",
|
|||
|
"- Чистота (clarity), \n",
|
|||
|
"- Глубина (depth), \n",
|
|||
|
"- Площадь огранки (table), \n",
|
|||
|
"- Цена (price), \n",
|
|||
|
"- Ширина (координата X), \n",
|
|||
|
"- Длина (координата Y), \n",
|
|||
|
"- Высота (координата Z). "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проверяем на выбросы"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0EAAAIjCAYAAADFthA8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACu9klEQVR4nOzdeXwTdf4/8FfSI+mZXpQUhLYcCrUCgrJUDhVBEUTUXV1RXK9FRNmvsq4HugrIKuu66/FbFBUPVhDcXS9AsSsIimARpVy1qFDbitICbSG9r2R+f9SJOSbJTDK5X8/Hg8eDppOZTyaT9POe9+fz/mgEQRBAREREREQUJbTBbgAREREREVEgMQgiIiIiIqKowiCIiIiIiIiiCoMgIiIiIiKKKgyCiIiIiIgoqjAIIiIiIiKiqMIgiIiIiIiIogqDICIiIiIiiioMgoiIiIiIKKowCCIiIiIioqjCIIiIrN566y1oNBrJf4WFhcFuHhEREZEqYoPdACIKPQ8++CCGDh1q/fmxxx4LYmuIiIiI1MUgiIicTJ48GRdccIH155dffhl1dXXBaxARERGRijgcjoisOjs7AQBareevhpUrV0Kj0aCqqsr6mMViwbBhw6DRaLBy5Urr4/v378dNN92EAQMGQK/Xw2g04pZbbkF9fb3dPhctWiQ5FC829pf7NRdccAEKCwuxe/dunHfeeUhISEB+fj5eeOEFp9fyyCOPYNSoUTAYDEhKSsL48eOxdetWu+2qqqqsx3nvvffsftfe3o709HRoNBr8/e9/d2pndnY2urq67J6zdu1a6/5sA8d169Zh2rRp6NOnD3Q6HQYOHIglS5bAbDZ7PNfLly/H8OHDra9j+PDheOWVV+y28fYcp6SkYPTo0U6v/YILLrALhAHgyy+/tD7P0erVqzF69GgkJiYiPT0dEyZMwEcffWT9fV5eHm666Sa75/z3v/+FRqNBXl6e9TFv3g8A2LNnDy699FKkpqYiOTkZF110EXbu3OnUzlOnTmH+/PnIy8uDTqfDaaedht/97neoq6vDJ5984nI4qPhv0aJFdufRVnNzM4xGIzQaDT755BOnY9sSn+94c+Grr75y+vwAPZ+tZ555BmeeeSb0ej169+6NOXPm4OTJk3bb5eXl4bLLLnM63rx585zaq9FoMG/ePJdtlPqMu7JlyxaMHz8eSUlJSEtLw4wZM3Dw4EGn1+vun7tzJnW+t27dCp1Oh9tvv936WHV1Ne644w6cccYZSEhIQGZmJq6++mqn1yC+tm3btmHOnDnIzMxEamoqfve73zmdUzmf3QsuuMDj6xO99tprmDhxIrKzs6HT6VBQUIDly5d7PMdEpC5mgojISgyCdDqdV89ftWoVDhw44PT4pk2b8P333+Pmm2+G0WjE119/jZdeeglff/01du7c6dS5Wb58OZKTk60/OwZlJ0+exNSpU3HNNddg5syZ+M9//oO5c+ciPj4et9xyCwCgsbERL7/8MmbOnInZs2ejqakJr7zyCi655BLs2rULI0aMsNunXq/Ha6+9hiuuuML62DvvvIP29naXr7epqQnvv/8+rrzySutjr732GvR6vdPzVq5cieTkZPzxj39EcnIytmzZgkceeQSNjY148sknXR5DPM7FF1+MgQMHQhAE/Oc//8Hvf/97pKWl4de//rVX53jVqlUAgLq6Ojz//PO4+uqrUVZWhjPOOMNlO+6//37JxxcvXoxFixbhvPPOw6OPPor4+Hh88cUX2LJlCy6++GLJ53R3d+Ohhx5yeSwl78fXX3+N8ePHIzU1Fffddx/i4uLw4osv4oILLsCnn36KX/3qVwB6gpTx48fj4MGDuOWWWzBy5EjU1dVh/fr1+PHHHzF06FDreQGAl156CQcPHsTTTz9tfWzYsGEu2/yPf/wDx44dc/l7X8yZMwcrV67EzTffjP/7v/9DZWUlli1bhj179mDHjh2Ii4vzy3Hl2Lx5My699FIMGDAAixYtQltbG/75z39i7NixKC0tRV5eHq666ioMGjTI+pz58+dj6NChuO2226yP2Q7B9WTfvn244oorMHXqVDz33HPWx7/88kt8/vnnuPbaa3HaaaehqqoKy5cvxwUXXIDy8nIkJiba7WfevHlIS0vDokWL8O2332L58uWorq62BsSAvM/uQw89hN///vcAej5T8+fPx2233Ybx48c7tX358uU488wzcfnllyM2NhYbNmzAHXfcAYvFgjvvvFP2OSAiHwlERD975plnBADCvn377B4///zzhTPPPNPusddee00AIFRWVgqCIAjt7e1C//79hUsvvVQAILz22mvWbVtbW52OtXbtWgGAsG3bNutjCxcuFAAIJ06ccNnG888/XwAg/OMf/7A+1tHRIYwYMULIzs4WOjs7BUEQhO7ubqGjo8PuuSdPnhR69+4t3HLLLdbHKisrBQDCzJkzhdjYWKG2ttb6u4suuki47rrrBADCk08+6dTOmTNnCpdddpn18erqakGr1QozZ850eh1S52DOnDlCYmKi0N7e7vL1Sunu7hZSU1OFefPmud2/u3Ns66OPPhIACP/5z3+sj51//vnC+eefb/1548aNAgBhypQpds8/dOiQoNVqhSuvvFIwm812+7VYLNb/5+bmCjfeeKP15+eff17Q6XTChRdeKOTm5lof9+b9uOKKK4T4+HihoqLC+tjRo0eFlJQUYcKECdbHHnnkEQGA8M477zidK9u2im688Ua7ttlyPI/Hjx8XUlJSrNf/1q1bJZ/n+HzHa/3LL790+vx89tlnAgDhjTfesNu2uLjY6fHc3Fxh2rRpTse78847nd53AMKdd97pso2On3FXxM9efX299bF9+/YJWq1W+N3vfif5HMfrwRPb811VVSXk5OQI48aNE9ra2uy2k/oclJSUCACE119/3fqY+NpGjRpl/c4QBEH429/+JgAQ1q1b53af7j674jVs+x56auMll1wiDBgwQHJ7IvIPDocjIitx6FSvXr0UP/e5555DfX09Fi5c6PS7hIQE6//b29tRV1eHMWPGAABKS0sVHys2NhZz5syx/hwfH485c+bg+PHj2L17NwAgJiYG8fHxAHqGEjU0NKC7uxvnnHOO5DFHjhyJM88805oJqK6uxtatW52GcNm65ZZbUFxcjNraWgDAv/71LxQVFeH000932tb2HDQ1NaGurg7jx49Ha2srvvnmG4+v2Ww2o66uDtXV1Xj66afR2Nhod5dZ6Tmuq6tDXV0dDh48iBdeeAFJSUnW7R0JgoAFCxbg17/+tTWrInrvvfdgsVjwyCOPOGXspIbNAUBrayseffRRzJs3D/3795fcRu77YTab8dFHH+GKK67AgAEDrI/n5OTguuuuw/bt29HY2AgAePvttzF8+HC7zJ2ntsq1ZMkSGAwG/N///Z9P+5Hy3//+FwaDAZMnT7a+b3V1dRg1ahSSk5Odhnh2dXXZbVdXV+cyoyleK/X19bBYLIrbVlNTg7179+Kmm25CRkaG9fFhw4Zh8uTJ2Lhxo+J9ulNfX49LLrkEKSkpWL9+PfR6vd3vbT8HXV1dqK+vx6BBg5CWlib5Objtttvssmhz585FbGysXbt9/ew6st2fyWRCXV0dzj//fHz//fcwmUyK90dE3mEQRERW1dXViI2NVRwEmUwmPP744/jjH/+I3r17O/2+oaEBd911F3r37o2EhAT06tUL+fn51ucq1adPHyQlJdk9JgYetmP///Wvf2HYsGHQ6/XIzMxEr1698MEHH7g85s0334zXXnsNQM8QmPPOOw+DBw922Y4RI0agsLAQr7/+OgRBsA5XkvL111/jyiuvhMFgQGpqKnr16oVZs2YBkHcODh06hF69eiEvLw8PPfQQnn/+eVxzzTXW3ys9x7169UKvXr1QUFCAzZs344033kC/fv0kj/3GG2/g66+/xuOPP+70u4qKCmi1WhQUFHh8DaKnnnoK7e3tePDBB91uJ+f9OHHiBFpbWyWH8Q0dOhQWiwVHjhyxttUfpd4rKyvx4osvYvHixU6dcjUcOnQIJpMJ2dnZ1vdN/Nfc3Izjx4/bbf/
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"./data/Diamonds-Prices.csv\")\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df[\"price\"], df[\"carat\"])\n",
|
|||
|
"plt.xlabel(\"Цена\")\n",
|
|||
|
"plt.ylabel(\"Карат\")\n",
|
|||
|
"plt.title(\"Диаграмма зависимости цены от карата\")\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Выброс с наибольшим значением был замечен при ~175000\n",
|
|||
|
"Начнем использовать метод межквантильного размаха для удаления выбросов."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Пустые значения по столбцам:\n",
|
|||
|
"Unnamed: 0 0\n",
|
|||
|
"carat 0\n",
|
|||
|
"cut 0\n",
|
|||
|
"color 0\n",
|
|||
|
"clarity 0\n",
|
|||
|
"depth 0\n",
|
|||
|
"table 0\n",
|
|||
|
"price 0\n",
|
|||
|
"x 0\n",
|
|||
|
"y 0\n",
|
|||
|
"z 0\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Количество дубликатов: 0\n",
|
|||
|
"\n",
|
|||
|
"Статистический обзор данных:\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Unnamed: 0</th>\n",
|
|||
|
" <th>carat</th>\n",
|
|||
|
" <th>depth</th>\n",
|
|||
|
" <th>table</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>x</th>\n",
|
|||
|
" <th>y</th>\n",
|
|||
|
" <th>z</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>count</th>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>mean</th>\n",
|
|||
|
" <td>26972.000000</td>\n",
|
|||
|
" <td>0.797935</td>\n",
|
|||
|
" <td>61.749322</td>\n",
|
|||
|
" <td>57.457251</td>\n",
|
|||
|
" <td>3932.734294</td>\n",
|
|||
|
" <td>5.731158</td>\n",
|
|||
|
" <td>5.734526</td>\n",
|
|||
|
" <td>3.538730</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>std</th>\n",
|
|||
|
" <td>15572.147122</td>\n",
|
|||
|
" <td>0.473999</td>\n",
|
|||
|
" <td>1.432626</td>\n",
|
|||
|
" <td>2.234549</td>\n",
|
|||
|
" <td>3989.338447</td>\n",
|
|||
|
" <td>1.121730</td>\n",
|
|||
|
" <td>1.142103</td>\n",
|
|||
|
" <td>0.705679</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>min</th>\n",
|
|||
|
" <td>1.000000</td>\n",
|
|||
|
" <td>0.200000</td>\n",
|
|||
|
" <td>43.000000</td>\n",
|
|||
|
" <td>43.000000</td>\n",
|
|||
|
" <td>326.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>25%</th>\n",
|
|||
|
" <td>13486.500000</td>\n",
|
|||
|
" <td>0.400000</td>\n",
|
|||
|
" <td>61.000000</td>\n",
|
|||
|
" <td>56.000000</td>\n",
|
|||
|
" <td>950.000000</td>\n",
|
|||
|
" <td>4.710000</td>\n",
|
|||
|
" <td>4.720000</td>\n",
|
|||
|
" <td>2.910000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>50%</th>\n",
|
|||
|
" <td>26972.000000</td>\n",
|
|||
|
" <td>0.700000</td>\n",
|
|||
|
" <td>61.800000</td>\n",
|
|||
|
" <td>57.000000</td>\n",
|
|||
|
" <td>2401.000000</td>\n",
|
|||
|
" <td>5.700000</td>\n",
|
|||
|
" <td>5.710000</td>\n",
|
|||
|
" <td>3.530000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>75%</th>\n",
|
|||
|
" <td>40457.500000</td>\n",
|
|||
|
" <td>1.040000</td>\n",
|
|||
|
" <td>62.500000</td>\n",
|
|||
|
" <td>59.000000</td>\n",
|
|||
|
" <td>5324.000000</td>\n",
|
|||
|
" <td>6.540000</td>\n",
|
|||
|
" <td>6.540000</td>\n",
|
|||
|
" <td>4.040000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>max</th>\n",
|
|||
|
" <td>53943.000000</td>\n",
|
|||
|
" <td>5.010000</td>\n",
|
|||
|
" <td>79.000000</td>\n",
|
|||
|
" <td>95.000000</td>\n",
|
|||
|
" <td>18823.000000</td>\n",
|
|||
|
" <td>10.740000</td>\n",
|
|||
|
" <td>58.900000</td>\n",
|
|||
|
" <td>31.800000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Unnamed: 0 carat depth table price \\\n",
|
|||
|
"count 53943.000000 53943.000000 53943.000000 53943.000000 53943.000000 \n",
|
|||
|
"mean 26972.000000 0.797935 61.749322 57.457251 3932.734294 \n",
|
|||
|
"std 15572.147122 0.473999 1.432626 2.234549 3989.338447 \n",
|
|||
|
"min 1.000000 0.200000 43.000000 43.000000 326.000000 \n",
|
|||
|
"25% 13486.500000 0.400000 61.000000 56.000000 950.000000 \n",
|
|||
|
"50% 26972.000000 0.700000 61.800000 57.000000 2401.000000 \n",
|
|||
|
"75% 40457.500000 1.040000 62.500000 59.000000 5324.000000 \n",
|
|||
|
"max 53943.000000 5.010000 79.000000 95.000000 18823.000000 \n",
|
|||
|
"\n",
|
|||
|
" x y z \n",
|
|||
|
"count 53943.000000 53943.000000 53943.000000 \n",
|
|||
|
"mean 5.731158 5.734526 3.538730 \n",
|
|||
|
"std 1.121730 1.142103 0.705679 \n",
|
|||
|
"min 0.000000 0.000000 0.000000 \n",
|
|||
|
"25% 4.710000 4.720000 2.910000 \n",
|
|||
|
"50% 5.700000 5.710000 3.530000 \n",
|
|||
|
"75% 6.540000 6.540000 4.040000 \n",
|
|||
|
"max 10.740000 58.900000 31.800000 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"null_values_diamond = df.isnull().sum()\n",
|
|||
|
"print(\"Пустые значения по столбцам:\")\n",
|
|||
|
"print(null_values_diamond)\n",
|
|||
|
"\n",
|
|||
|
"duplicates = df.duplicated().sum()\n",
|
|||
|
"print(f\"\\nКоличество дубликатов: {duplicates}\")\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nСтатистический обзор данных:\")\n",
|
|||
|
"df.describe()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"\n",
|
|||
|
"Коэффициент асимметрии для столбца 'Unnamed: 0': 0.0\n",
|
|||
|
"\n",
|
|||
|
"Коэффициент асимметрии для столбца 'carat': 1.1167052359880187\n",
|
|||
|
"\n",
|
|||
|
"Коэффициент асимметрии для столбца 'depth': -0.08218721424717913\n",
|
|||
|
"\n",
|
|||
|
"Коэффициент асимметрии для столбца 'table': 0.7968359775412807\n",
|
|||
|
"\n",
|
|||
|
"Коэффициент асимметрии для столбца 'price': 1.6184763222032386\n",
|
|||
|
"\n",
|
|||
|
"Коэффициент асимметрии для столбца 'x': 0.37868453466912216\n",
|
|||
|
"\n",
|
|||
|
"Коэффициент асимметрии для столбца 'y': 2.4342330799873775\n",
|
|||
|
"\n",
|
|||
|
"Коэффициент асимметрии для столбца 'z': 1.5224810204974413\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"for column in df.select_dtypes(include=[np.number]).columns:\n",
|
|||
|
" asymmetry = df[column].skew()\n",
|
|||
|
" print(f\"\\nКоэффициент асимметрии для столбца '{column}': {asymmetry}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Видим выбросы. Очистим данные от шумов."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0EAAAI+CAYAAAB6/gF5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC+C0lEQVR4nOzdeXwTdf4/8FeSHmlLm7aUkoLQlkOhVMCiCHKoCCuC4PHbdUXxq+iiIuzXdT1xRURXWdZdj11cVFRYQVDXC1HELwrKYZGVylELCrWtHC3Qg7T0bjO/P+rEHDPJTDK5X8/Hg4c2ncx8Mpk0n/d8Pp/3WycIggAiIiIiIqIooQ92A4iIiIiIiAKJQRAREREREUUVBkFERERERBRVGAQREREREVFUYRBERERERERRhUEQERERERFFFQZBREREREQUVRgEERERERFRVGEQREREREREUYVBEBEREZGfHD16FCtXrrT9XF5ejjfeeCN4DSIiAAyCiMiNd955BzqdTvJffn5+sJtHRBTydDod5s6di08//RTl5eV44IEHsG3btmA3iyjqxQS7AUQU+h5++GEMHjzY9vOTTz4ZxNYQEYWP3r17Y/bs2Zg8eTIAICsrC1988UVwG0VE0AmCIAS7EUQUmt555x385je/wZYtW3DJJZfYHr/kkktQXV2N4uLi4DWOiCiMlJaWorq6Gvn5+UhKSgp2c4iiHqfDEZGstrY2AIBe7/lPxcqVK6HT6VBeXm57zGq1YujQodDpdA5z4vft24dbbrkF/fr1g9FohNlsxq233oqamhqHfT722GOSU/FiYn4ZxL7kkkuQn5+P3bt346KLLkJCQgJyc3Px4osvuryWRx99FCNGjIDJZEJSUhLGjRuHLVu2OGxXXl5uO84HH3zg8LuWlhakpaVBp9Phb3/7m0s7MzMz0d7e7vCctWvX2vZXXV1te3zdunWYOnUqevXqhfj4ePTv3x9PPPEEOjs7PZ5r8XgHDx7Eddddh5SUFHTv3h133303WlpaHLZdsWIFJkyYgMzMTMTHxyMvLw/Lli2T3O8nn3yCiy++GMnJyUhJScEFF1yANWvWOGzz9ddfY8qUKUhLS0NSUhKGDh2K559/3mGbgwcP4te//jXS09NhNBpx/vnn48MPP3TYRs31cssttzi8/2lpabjkkktcphQpPafiNePsb3/7m0ubcnJycMsttzhs95///Ac6nQ45OTkOj588eRK33XYb+vbtC4PBYGtvt27dXI7lLCcnR3bqqU6nc9l+9erVGDFiBBISEpCeno7rr78eR44ckXydnj4bANDa2oqFCxdiwIABiI+PR58+ffDAAw+gtbXVZdsvvvhCcTudideu1Ou3P89qrg8Ats9Cjx49kJCQgHPOOQd/+tOfHI7p7p84MnPJJZc43PABuka+9Xq9y2fhP//5j+09yMjIwMyZM3Hs2DGHbW655RbbddK/f39ceOGFqK2tRUJCgsvrI6LA4nQ4IpIlBkHx8fFePX/VqlXYv3+/y+ObNm3Cjz/+iFmzZsFsNuO7777Dyy+/jO+++w47d+506SQtW7bMoSPpHJTV1dVhypQpuO666zBjxgy8/fbbmDNnDuLi4nDrrbcCAOrr6/HKK69gxowZmD17NhoaGvDqq6/i8ssvx65duzB8+HCHfRqNRqxYsQJXX3217bH33nvPJciw19DQgI8++gjXXHON7bEVK1bAaDS6PG/lypXo1q0b/vjHP6Jbt27YvHkzHn30UdTX1+Ppp5+WPYa96667Djk5OVi8eDF27tyJf/zjH6irq8Prr7/ucO6GDBmC6dOnIyYmBuvXr8ddd90Fq9WKuXPnOrTn1ltvxZAhQzB//nykpqbi22+/xcaNG3HDDTcA6HrfrrzySmRlZeHuu++G2WzGgQMH8NFHH+Huu+8GAHz33XcYM2YMevfujYceeghJSUl4++23cfXVV+Pdd991ODfO5K4XAMjIyMCzzz4LoGuh+fPPP48pU6bgyJEjSE1N1eycetLR0WHrXDu7+eab8dlnn+H3v/89hg0bBoPBgJdffhlFRUWK9j18+HDce++9Do+9/vrr2LRpk8NjTz75JBYsWIDrrrsOv/vd73Dq1Cn885//xPjx4/Htt9/azgeg7LNhtVoxffp0bN++HbfffjsGDx6M/fv349lnn8UPP/zgcjNA9L//+7+44IILZNupNbnrY9++fRg3bhxiY2Nx++23IycnB6WlpVi/fj2efPJJXHvttRgwYIBt+3vuuQeDBw/G7bffbnvMfrqvvRUrVuCRRx7B3//+d9vnAOi61mbNmoULLrgAixcvxokTJ/D8889jx44dLu+Bs0cffdTt3xEiChCBiEjGc889JwAQ9u7d6/D4xRdfLAwZMsThsRUrVggAhLKyMkEQBKGlpUXo27evcMUVVwgAhBUrVti2bWpqcjnW2rVrBQDC1q1bbY8tXLhQACCcOnVKto0XX3yxAED4+9//bnustbVVGD58uJCZmSm0tbUJgiAIHR0dQmtrq8Nz6+rqhJ49ewq33nqr7bGysjIBgDBjxgwhJiZGqKqqsv3usssuE2644QYBgPD000+7tHPGjBnClVdeaXu8oqJC0Ov1wowZM1xeh9Q5uOOOO4TExEShpaVF9vXaH2/69OkOj991110u75fUcS6//HKhX79+tp9Pnz4tJCcnCxdeeKHQ3NzssK3VahUEoev85ebmCtnZ2UJdXZ3kNoLQdY7OPfdch9dgtVqFiy66SBg4cKDtMTXXy8033yxkZ2c7HPPll18WAAi7du1y+1qlzqnU9SsIgvD00087tEkQBCE7O1u4+eabbT//61//EuLj44VLL73UoU3Nzc2CXq8X7rjjDod93nzzzUJSUpLLsZxlZ2cLU6dOdXl87ty5gv1XdXl5uWAwGIQnn3zSYbv9+/cLMTExDo8r/WysWrVK0Ov1wrZt2xz2+eKLLwoAhB07djg8/n//938CAOGdd96RbaecRYsWCQAcrhnx9dufZzXXx/jx44Xk5GShoqLCYZ/Ox5A7lr2LL75YuPjiiwVBEISPP/5YiImJEe69916Hbdra2oTMzEwhPz/f4fPy0UcfCQCERx991PaY87VbXFws6PV62+uwv9aIKLA4HY6IZInT03r06KH6uS+88AJqamqwcOFCl98lJCTY/r+lpQXV1dUYNWoUACi+a24vJiYGd9xxh+3nuLg43HHHHTh58iR2794NADAYDIiLiwPQdee7trYWHR0dOP/88yWPWVBQgCFDhmDVqlUAgIqKCmzZssVlapS9W2+9FRs3bkRVVRUA4N///jdGjx6Ns88+22Vb+3PQ0NCA6upqjBs3Dk1NTTh48KCi120/kgMAv//97wEAGzZskDyOxWJBdXU1Lr74Yvz444+wWCwAukZ4Ghoa8NBDD8FoNDrsUxyV+/bbb1FWVoY//OEPLne5xW1qa2uxefNmXHfddbbXVF1djZqaGlx++eU4dOiQy3QhkbvrBeh6z8T97dmzB6+//jqysrIc7uCrOaednZ22/Yn/mpqaJI8tampqwuOPP4558+ahb9++Dr9rbGyE1WpF9+7d3e7DV++99x6sViuuu+46h7abzWYMHDjQZXqnks/Gf/7zHwwePBiDBg1y2OeECRMAwGWf4iiG87WiRGZmJoCu0Tw15K6PU6dOYevWrbj11ltd3hMl0/Pk7Nq1C9dddx3+3//7fy6jiN988w1OnjyJu+66y+EcTJ06FYMGDcLHH38su9/58+ejoKAAv/nNb7xuGxFpg9PhiEhWRUUFYmJiVAdBFosFTz31FP74xz+iZ8+eLr+vra3FokWL8Oabb+LkyZMuz1WrV69eLguNxcCjvLzcFmD9+9//xt///nccPHjQYe1Obm6u5H5nzZqFl19+Gffddx9WrlyJiy66CAMHDpRtx/Dhw5Gfn4/XX38d999/P1auXImHH37YZa0G0DVt7JFHHsHmzZtRX1/v8Dul58C5Lf3794der3dYZ7Bjxw4sXLgQhYWFLp18i8UCk8mE0tJSAHCb9lzJNocPH4YgCFiwYAEWLFgguc3JkyfRu3d
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2QAAAI1CAYAAAC5TTkuAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAD2iUlEQVR4nOzdd3wUdfoH8M9uKgmphGQThCQUhRCKIJAYikIQBMWC54+ih4JgAQ+xAXoo5RTROxFPBEGxUE9PTlG4HCUoxVAEgsSAUhIQTMB0SEjd+f2xzLJlZvY7s7O7s5vn/Xr5uiM7OzvbZr/PfJ/v8+g4juNACCGEEEIIIcTt9J4+AEIIIYQQQghpriggI4QQQgghhBAPoYCMEEIIIYQQQjyEAjJCCCGEEEII8RAKyAghhBBCCCHEQyggI4QQQgghhBAPoYCMEEIIIYQQQjyEAjJCCCGEEEII8RAKyAghhBBCCCHEQyggI4QQQgghhBAPoYCMEOLV/v3vf0On0wn+l5qa6unDI4QQQgiR5O/pAyCEEDW89NJL6NKli/nfr732mgePhhBCCCGEDQVkhBCfMHToUNx2223mf3/44YcoKSnx3AERQgghhDCglEVCiFerr68HAOj1jk9nn3zyCXQ6HQoLC81/MxqN6N69O3Q6HT755BPz33/66Sc88sgjaN++PYKDg2EwGDBx4kSUlpZa7XPu3LmC6ZL+/tevd912221ITU3FoUOHcOutt6JFixZITk7G8uXL7Z7LK6+8gt69eyMiIgKhoaEYMGAAdu7cabVdYWGh+XG++uorq9tqa2sRFRUFnU6Hv//973bHGRsbi4aGBqv7rF+/3rw/yyD266+/xsiRI5GQkICgoCB06NABCxYsQFNTk8PXmn+8EydO4MEHH0R4eDhatWqF6dOno7a21mrbjz/+GIMHD0ZsbCyCgoKQkpKCZcuWCe73v//9LwYNGoSwsDCEh4ejT58+WLdundU2+/fvx4gRIxAVFYXQ0FB0794dS5YssdrmxIkTeOCBBxAdHY3g4GDccsst2LRpk9U2cj4vjzzyiNX7HxUVhdtuuw27d++22ifra8p/Zmz9/e9/tzumpKQkPPLII1bbffHFF9DpdEhKSrL6+6VLlzBp0iS0a9cOfn5+5uNt2bKl3WPZSkpKEk0P1ul0Vts2NjZiwYIF6NChA4KCgpCUlISXXnoJdXV1dvtleU8tP/NSj2s0GvHOO++ga9euCA4ORlxcHB5//HGUl5czPT/b1/G7776DTqfDd999Z/7bbbfdZnXxBwAOHjwoeDwAsGbNGvTt2xchISGIiorCwIEDsXXrVvNjSr2m/PvHP3/Lz9zly5fRu3dvJCcno6ioSHQ7AJg6dSp0Op3d8yOEaAPNkBFCvBofkAUFBSm6/+rVq3Hs2DG7v2/btg1nzpzBo48+CoPBgJ9//hkrVqzAzz//jH379tkNvJYtW2Y1qLUNEMvLyzFixAg8+OCDGDt2LD7//HM8+eSTCAwMxMSJEwEAVVVV+PDDDzF27FhMnjwZly9fxkcffYRhw4bhwIED6Nmzp9U+g4OD8fHHH+Pee+81/23jxo12AY+ly5cv49tvv8V9991n/tvHH3+M4OBgu/t98sknaNmyJZ599lm0bNkS2dnZeOWVV1BVVYW33npL9DEsPfjgg0hKSsLChQuxb98+vPvuuygvL8dnn31m9dp17doVo0aNgr+/P7755hs89dRTMBqNmDp1qtXxTJw4EV27dsXs2bMRGRmJI0eOICsrC+PGjQNget/uuusuxMfHY/r06TAYDDh+/Di+/fZbTJ8+HQDw888/IyMjA23atMGsWbMQGhqKzz//HPfeey++/PJLq9fGltjnBQBiYmKwePFiAMD58+exZMkSjBgxAr/99hsiIyNVe00daWxsxMsvvyx424QJE7B9+3Y8/fTT6NGjB/z8/LBixQocPnyYad89e/bEc889Z/W3zz77DNu2bbP622OPPYZPP/0UDzzwAJ577jns378fCxcuxPHjx/Gf//zHvB3Le2ppypQpGDBgAADTZ91yXwDw+OOP45NPPsGjjz6Kv/zlLygoKMB7772HI0eOYO/evQgICGB6nnLNnDlT8O/z5s3D3Llzceutt2L+/PkIDAzE/v37kZ2djTvuuAPvvPMOrly5AgA4fvw4Xn/9dav0a7FAuaGhAaNHj8a5c+ewd+9exMfHix7bqVOnsHLlSiefISHEpThCCPFi77zzDgeAO3r0qNXfBw0axHXt2tXqbx9//DEHgCsoKOA4juNqa2u5du3acXfeeScHgPv444/N29bU1Ng91vr16zkA3K5du8x/e/XVVzkA3B9//CF6jIMGDeIAcP/4xz/Mf6urq+N69uzJxcbGcvX19RzHcVxjYyNXV1dndd/y8nIuLi6OmzhxovlvBQUFHABu7NixnL+/P1dcXGy+bciQIdy4ceM4ANxbb71ld5xjx47l7rrrLvPfz549y+n1em7s2LF2z0PoNXj88ce5kJAQrra2VvT5Wj7eqFGjrP7+1FNP2b1fQo8zbNgwrn379uZ/V1RUcGFhYVy/fv24q1evWm1rNBo5jjO9fsnJyVxiYiJXXl4uuA3HmV6jbt26WT0Ho9HI3XrrrVynTp3Mf5PzeZkwYQKXmJho9ZgrVqzgAHAHDhyQfK5Cr6nQ55fjOO6tt96yOiaO47jExERuwoQJ5n+///77XFBQEHf77bdbHdPVq1c5vV7PPf7441b7nDBhAhcaGmr3WLYSExO5kSNH2v196tSpnOVwIjc3lwPAPfbYY1bbPf/88xwALjs7m+M4tveUd/LkSQ4A9+mnn5r/xn/GeLt37+YAcGvXrrW6b1ZWluDfbSUnJ3N//vOfrf62c+dODgC3c+dO898GDRrEDRo0yPzvLVu2cAC44cOHWx3PyZMnOb1ez913331cU1OT5PMTeywe/53/+OOPOaPRyI0fP54LCQnh9u/fL7od78EHH+RSU1O5tm3bWn1OCCHaQSmLhBCvxqcQtm7dWvZ9ly5ditLSUrz66qt2t7Vo0cL8/2tra1FSUoK0tDQAYJ5NsOTv74/HH3/c/O/AwEA8/vjjuHTpEg4dOgQA8PPzQ2BgIABT6lVZWRkaGxtxyy23CD5mr1690LVrV6xevRoAcPbsWezcuVMyLWnixInIyspCcXExAODTTz9Feno6brzxRrttLV+Dy5cvo6SkBAMGDEBNTQ1OnDjB9LwtZ7gA4OmnnwYAbNmyRfBxKisrUVJSgkGDBuHMmTOorKwEYJr5unz5MmbNmoXg4GCrffKzlUeOHEFBQQGeeeYZ84yU7TZlZWXIzs7Ggw8+aH5OJSUlKC0txbBhw3Dy5ElcuHBB8LlIfV4A03vG7y83NxefffYZ4uPjrYrNyHlNm5qazPvj/6upqRF8bF5NTQ3mz5+PadOmoV27dla3VVdXw2g0olWrVpL7cBb/3j777LNWf+dn1jZv3gyA7T3lscyEf/HFF4iIiMDQoUOtXrPevXujZcuWdqm/tmJjY3H+/HmGZ3gdx3GYPXs2Ro8ejX79+lnd9tVXX8FoNOKVV16xmzEXSm1k9cILL2Dt2rX4/PPP0bdvX8ltDx06hC+++AILFy5kSusmhHgGfTsJIV7t7Nmz8Pf3lx2QVVZW4vXXX8ezzz6LuLg4u9vLysowffp0xMXFoUWLFmjdujWSk5PN95UrISEBoaGhVn/jgyDL9UCffvopunfvjuDgYLRq1QqtW7fG5s2bRR/z0UcfxccffwzAlP516623olOnTqLH0bNnT6SmpuKzzz4Dx3Hm9C4hP//8M+677z5EREQgPDwcrVu3xkMPPQSA/TWwPZYOHTpAr9dbPee9e/ciMzMToaGhiIyMROvWrfHSSy9ZPc7p06cBQLKVAcs2p06dAsdxmDNnDlq3bm31Hx9oXbp0ye5+jj4vAPDbb7+Z93XzzTfj9OnT+PLLL63SzuS8pidOnBA9RjFvv/02amtrza+fpVatWqFTp0748MMPsXXrVly6dAklJSWC67qccfbsWej1enTs2NHq7waDAZGRkTh79iwAtveLV1F
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df[\"price\"], df[\"carat\"])\n",
|
|||
|
"plt.xlabel(\"Цена\")\n",
|
|||
|
"plt.ylabel(\"Карат\")\n",
|
|||
|
"plt.xticks(rotation=45)\n",
|
|||
|
"plt.title(\"Диаграмма рассеивания перед чисткой\")\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Выбираем столбцы для анализа\n",
|
|||
|
"column1 = \"carat\"\n",
|
|||
|
"column2 = \"price\"\n",
|
|||
|
"# Функция для удаления выбросов\n",
|
|||
|
"def remove_outliers(df, column):\n",
|
|||
|
" Q1 = df[column].quantile(0.25)\n",
|
|||
|
" Q3 = df[column].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов для каждого столбца\n",
|
|||
|
"df_cleaned = df.copy()\n",
|
|||
|
"for column in [column1, column2]:\n",
|
|||
|
" df_cleaned = remove_outliers(df_cleaned, column)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df_cleaned[column1], df_cleaned[column2])\n",
|
|||
|
"plt.xlabel(\"Цена\")\n",
|
|||
|
"plt.ylabel(\"Карат\")\n",
|
|||
|
"plt.xticks(rotation=45)\n",
|
|||
|
"plt.title(\"Диаграмма рассеивания после чистки\")\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Количество строк до удаления выбросов: 53943\n",
|
|||
|
"Количество строк после удаления выбросов: 49517\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Вывод количества строк до и после удаления выбросов\n",
|
|||
|
"print(f\"Количество строк до удаления выбросов: {len(df)}\")\n",
|
|||
|
"print(f\"Количество строк после удаления выбросов: {len(df_cleaned)}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Перейдем к созданию выборок"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 32365\n",
|
|||
|
"Размер контрольной выборки: 10789\n",
|
|||
|
"Размер тестовой выборки: 10789\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"df = pd.read_csv(\"./data/Diamonds-Prices.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Выбираем признаки и целевую переменную\n",
|
|||
|
"X = df.drop(\"price\", axis=1) # Все столбцы, кроме цены\n",
|
|||
|
"y = df[\"price\"]\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение данных на обучающую и оставшуюся часть (контрольную + тестовую)\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(\n",
|
|||
|
" X, y, test_size=0.4, random_state=42\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение оставшейся части на контрольную и тестовую выборки\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(\n",
|
|||
|
" X_temp, y_temp, test_size=0.5, random_state=42\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Вывод размеров выборок\n",
|
|||
|
"print(f\"Размер обучающей выборки: {X_train.shape[0]}\")\n",
|
|||
|
"print(f\"Размер контрольной выборки: {X_val.shape[0]}\")\n",
|
|||
|
"print(f\"Размер тестовой выборки: {X_test.shape[0]}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проанализируем сбалансированность выборок"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение Price в обучающей выборке:\n",
|
|||
|
"price\n",
|
|||
|
"327 1\n",
|
|||
|
"334 1\n",
|
|||
|
"336 1\n",
|
|||
|
"337 1\n",
|
|||
|
"338 1\n",
|
|||
|
" ..\n",
|
|||
|
"18791 1\n",
|
|||
|
"18795 2\n",
|
|||
|
"18797 1\n",
|
|||
|
"18804 1\n",
|
|||
|
"18806 1\n",
|
|||
|
"Name: count, Length: 9476, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n",
|
|||
|
"Распределение Price в контрольной выборке:\n",
|
|||
|
"price\n",
|
|||
|
"326 2\n",
|
|||
|
"340 1\n",
|
|||
|
"344 1\n",
|
|||
|
"354 1\n",
|
|||
|
"357 1\n",
|
|||
|
" ..\n",
|
|||
|
"18781 1\n",
|
|||
|
"18784 1\n",
|
|||
|
"18791 1\n",
|
|||
|
"18803 1\n",
|
|||
|
"18823 1\n",
|
|||
|
"Name: count, Length: 5389, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n",
|
|||
|
"Распределение Price в тестовой выборке:\n",
|
|||
|
"price\n",
|
|||
|
"335 1\n",
|
|||
|
"336 1\n",
|
|||
|
"337 1\n",
|
|||
|
"351 1\n",
|
|||
|
"353 1\n",
|
|||
|
" ..\n",
|
|||
|
"18766 1\n",
|
|||
|
"18768 1\n",
|
|||
|
"18780 1\n",
|
|||
|
"18788 1\n",
|
|||
|
"18818 1\n",
|
|||
|
"Name: count, Length: 5308, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"def analyze_distribution(data, title):\n",
|
|||
|
" print(f\"Распределение Price в {title}:\")\n",
|
|||
|
" distribution = data.value_counts().sort_index()\n",
|
|||
|
" print(distribution)\n",
|
|||
|
" total = len(data)\n",
|
|||
|
" positive_count = (data > 0).sum()\n",
|
|||
|
" negative_count = (data < 0).sum()\n",
|
|||
|
" positive_percent = (positive_count / total) * 100\n",
|
|||
|
" negative_percent = (negative_count / total) * 100\n",
|
|||
|
" print(f\"Процент положительных значений: {positive_percent:.2f}%\")\n",
|
|||
|
" print(f\"Процент отрицательных значений: {negative_percent:.2f}%\")\n",
|
|||
|
" print(\"\\nНеобходима аугментация данных для балансировки классов.\\n\")\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Анализ распределения для каждой выборки\n",
|
|||
|
"analyze_distribution(y_train, \"обучающей выборке\")\n",
|
|||
|
"analyze_distribution(y_val, \"контрольной выборке\")\n",
|
|||
|
"analyze_distribution(y_test, \"тестовой выборке\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Применяем методы приращения данных"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 18,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение Price в обучающей выборке после oversampling:\n",
|
|||
|
"price\n",
|
|||
|
"327 85\n",
|
|||
|
"334 85\n",
|
|||
|
"336 85\n",
|
|||
|
"337 85\n",
|
|||
|
"338 85\n",
|
|||
|
" ..\n",
|
|||
|
"18791 85\n",
|
|||
|
"18795 85\n",
|
|||
|
"18797 85\n",
|
|||
|
"18804 85\n",
|
|||
|
"18806 85\n",
|
|||
|
"Name: count, Length: 9476, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n",
|
|||
|
"Распределение Price в контрольной выборке:\n",
|
|||
|
"price\n",
|
|||
|
"326 2\n",
|
|||
|
"340 1\n",
|
|||
|
"344 1\n",
|
|||
|
"354 1\n",
|
|||
|
"357 1\n",
|
|||
|
" ..\n",
|
|||
|
"18781 1\n",
|
|||
|
"18784 1\n",
|
|||
|
"18791 1\n",
|
|||
|
"18803 1\n",
|
|||
|
"18823 1\n",
|
|||
|
"Name: count, Length: 5389, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n",
|
|||
|
"Распределение Price в тестовой выборке:\n",
|
|||
|
"price\n",
|
|||
|
"335 1\n",
|
|||
|
"336 1\n",
|
|||
|
"337 1\n",
|
|||
|
"351 1\n",
|
|||
|
"353 1\n",
|
|||
|
" ..\n",
|
|||
|
"18766 1\n",
|
|||
|
"18768 1\n",
|
|||
|
"18780 1\n",
|
|||
|
"18788 1\n",
|
|||
|
"18818 1\n",
|
|||
|
"Name: count, Length: 5308, dtype: int64\n",
|
|||
|
"Процент положительных значений: 100.00%\n",
|
|||
|
"Процент отрицательных значений: 0.00%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"\n",
|
|||
|
"# Применение oversampling к обучающей выборке\n",
|
|||
|
"oversampler = RandomOverSampler(random_state=42)\n",
|
|||
|
"X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Анализ распределения для каждой выборки\n",
|
|||
|
"analyze_distribution(y_train_resampled, \"обучающей выборке после oversampling\")\n",
|
|||
|
"analyze_distribution(y_val, \"контрольной выборке\")\n",
|
|||
|
"analyze_distribution(y_test, \"тестовой выборке\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Начнем анализировать датасет №18.\n",
|
|||
|
"\n",
|
|||
|
"Ссылка на исходные данные: https://www.kaggle.com/datasets/dewangmoghe/mobile-phone-price-prediction\n",
|
|||
|
"\n",
|
|||
|
"**Общее описание**: Данный датасет содержит информацию о ценах и атрибутах для 1369 мобильных телефонов разных конфигураций и производителей. Имеются 17 характеристик (именование модели, оценка (мин - 0, макс - 5), оценка на основе характеристик (мин - 0, макс - 100), информация о поддержке 2 симок и сетевых технологий (3G, 4G, 5G, VoLTE), количество оперативной памяти, характеристики батареи, информация о дисплее, характеристики камеры, поддержка внешней памяти, версия системы Android, цена, компания-производитель, поддержка быстрой зарядки, разрешение экрана, тип процессора, название процессора).\n",
|
|||
|
"\n",
|
|||
|
"**Проблемная область**: Финансовый анализ и прогнозирование цен на мобильные телефоны.\n",
|
|||
|
"\n",
|
|||
|
"**Объекты наблюдения**: телефон, включающий атрибуты: _Name, Rating, Spec_score, No_of_sim, RAM, Battery, Display, Camera, External_Memory, Android_version, Price, Company, Inbuilt_memory, Fast_charging, Screen_resolution, Processor, Processor_name_.\n",
|
|||
|
"\n",
|
|||
|
"**Бизнес цели**:\n",
|
|||
|
"- ***Прогнозирование цен мобильные телефоны на основе оценки характеристик***.\n",
|
|||
|
"- ***Прогнозирование оценки на основе фирмы и цены***.\n",
|
|||
|
"\n",
|
|||
|
"**Цели технического проекта**:\n",
|
|||
|
"1. ***Прогнозирование цен на телефоны***: Входные данные - _оценка характеристик_; целевой признак - _цена_,\n",
|
|||
|
"2. ***Анализ факторов влияния***: Входные данные - _фирма и цена_; целевой признак - _оценка характеристик_."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"ename": "FileNotFoundError",
|
|||
|
"evalue": "[Errno 2] No such file or directory: '../data/mobile-phone-price-prediction.csv'",
|
|||
|
"output_type": "error",
|
|||
|
"traceback": [
|
|||
|
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
|||
|
"\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)",
|
|||
|
"Cell \u001b[1;32mIn[19], line 3\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mpandas\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mpd\u001b[39;00m\n\u001b[1;32m----> 3\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m../data/mobile-phone-price-prediction.csv\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 4\u001b[0m \u001b[38;5;28mprint\u001b[39m(df\u001b[38;5;241m.\u001b[39mcolumns)\n",
|
|||
|
"File \u001b[1;32md:\\Users\\Leo\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\mai-S9i2J6c7-py3.12\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1026\u001b[0m, in \u001b[0;36mread_csv\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[0;32m 1013\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[0;32m 1014\u001b[0m dialect,\n\u001b[0;32m 1015\u001b[0m delimiter,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 1022\u001b[0m dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[0;32m 1023\u001b[0m )\n\u001b[0;32m 1024\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[1;32m-> 1026\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_read\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n",
|
|||
|
"File \u001b[1;32md:\\Users\\Leo\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\mai-S9i2J6c7-py3.12\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:620\u001b[0m, in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m 617\u001b[0m _validate_names(kwds\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnames\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[0;32m 619\u001b[0m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[1;32m--> 620\u001b[0m parser \u001b[38;5;241m=\u001b[39m \u001b[43mTextFileReader\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 622\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[0;32m 623\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m parser\n",
|
|||
|
"File \u001b[1;32md:\\Users\\Leo\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\mai-S9i2J6c7-py3.12\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1620\u001b[0m, in \u001b[0;36mTextFileReader.__init__\u001b[1;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[0;32m 1617\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m kwds[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[0;32m 1619\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles: IOHandles \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m-> 1620\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_make_engine\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mengine\u001b[49m\u001b[43m)\u001b[49m\n",
|
|||
|
"File \u001b[1;32md:\\Users\\Leo\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\mai-S9i2J6c7-py3.12\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1880\u001b[0m, in \u001b[0;36mTextFileReader._make_engine\u001b[1;34m(self, f, engine)\u001b[0m\n\u001b[0;32m 1878\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m mode:\n\u001b[0;32m 1879\u001b[0m mode \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m-> 1880\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;241m=\u001b[39m \u001b[43mget_handle\u001b[49m\u001b[43m(\u001b[49m\n\u001b[0;32m 1881\u001b[0m \u001b[43m \u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 1882\u001b[0m \u001b[43m \u001b[49m\u001b[43mmode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 1883\u001b[0m \u001b[43m \u001b[49m\u001b[43mencoding\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mencoding\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 1884\u001b[0m \u001b[43m \u001b[49m\u001b[43mcompression\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mcompression\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 1885\u001b[0m \u001b[43m \u001b[49m\u001b[43mmemory_map\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmemory_map\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 1886\u001b[0m \u001b[43m \u001b[49m\u001b[43mis_text\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mis_text\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 1887\u001b[0m \u001b[43m \u001b[49m\u001b[43merrors\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mencoding_errors\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mstrict\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 1888\u001b[0m \u001b[43m \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mstorage_options\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 1889\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1890\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39
|
|||
|
"File \u001b[1;32md:\\Users\\Leo\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\mai-S9i2J6c7-py3.12\\Lib\\site-packages\\pandas\\io\\common.py:873\u001b[0m, in \u001b[0;36mget_handle\u001b[1;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[0;32m 868\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(handle, \u001b[38;5;28mstr\u001b[39m):\n\u001b[0;32m 869\u001b[0m \u001b[38;5;66;03m# Check whether the filename is to be opened in binary mode.\u001b[39;00m\n\u001b[0;32m 870\u001b[0m \u001b[38;5;66;03m# Binary mode does not support 'encoding' and 'newline'.\u001b[39;00m\n\u001b[0;32m 871\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mencoding \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mmode:\n\u001b[0;32m 872\u001b[0m \u001b[38;5;66;03m# Encoding\u001b[39;00m\n\u001b[1;32m--> 873\u001b[0m handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mopen\u001b[39;49m\u001b[43m(\u001b[49m\n\u001b[0;32m 874\u001b[0m \u001b[43m \u001b[49m\u001b[43mhandle\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 875\u001b[0m \u001b[43m \u001b[49m\u001b[43mioargs\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 876\u001b[0m \u001b[43m \u001b[49m\u001b[43mencoding\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mioargs\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencoding\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 877\u001b[0m \u001b[43m \u001b[49m\u001b[43merrors\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43merrors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 878\u001b[0m \u001b[43m \u001b[49m\u001b[43mnewline\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[0;32m 879\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 880\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 881\u001b[0m \u001b[38;5;66;03m# Binary mode\u001b[39;00m\n\u001b[0;32m 882\u001b[0m handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(handle, ioargs\u001b[38;5;241m.\u001b[39mmode)\n",
|
|||
|
"\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../data/mobile-phone-price-prediction.csv'"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"df = pd.read_csv(\"../data/mobile-phone-price-prediction.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Атрибуты: \n",
|
|||
|
"- Неизвестный: 0, \n",
|
|||
|
"- Наименование телефона (Name), \n",
|
|||
|
"- Рейтинг (Rating),\n",
|
|||
|
"- Рейтинг на основе характеристик (Spec_score),\n",
|
|||
|
"- Поддержка различных технологий (No_of_sim),\n",
|
|||
|
"- Количество оперативной памяти (Ram),\n",
|
|||
|
"- Инфо о батарее (Battery),\n",
|
|||
|
"- Инфо о дисплее (Display),\n",
|
|||
|
"- Инфо о камере (Camera),\n",
|
|||
|
"- Инфо о внешней памяти (External_Memory),\n",
|
|||
|
"- Версия Android (Android_version),\n",
|
|||
|
"- Цена (Price),\n",
|
|||
|
"- Компания-производитель (company),\n",
|
|||
|
"- Инфо о внутренней памяти (Inbuilt_memory),\n",
|
|||
|
"- Быстрая зарядка (fast_charging),\n",
|
|||
|
"- Разрешение экрана (Screen_resolution),\n",
|
|||
|
"- Тип процессора (Processor),\n",
|
|||
|
"- Наименование процессора (Processor_name)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"plt.figure(figsize=(14, 6))\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"plt.scatter(df[\"company\"].str.lower(), df[\"Spec_score\"])\n",
|
|||
|
"plt.xlabel(\"Фирма\")\n",
|
|||
|
"plt.ylabel(\"Оценка характеристик\")\n",
|
|||
|
"plt.xticks(rotation=45)\n",
|
|||
|
"plt.title(\"Диаграмма 1\")\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Между атрибутами присутствует связь. Пример, на диаграмме 1 - связь между фирмой и оценкой характеристик"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Перейдем к проверке на выбросы"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"null_values = df.isnull().sum()\n",
|
|||
|
"print(\"Пустые значения по столбцам:\")\n",
|
|||
|
"print(null_values)\n",
|
|||
|
"\n",
|
|||
|
"duplicates = df.duplicated().sum()\n",
|
|||
|
"print(f\"\\nКоличество дубликатов: {duplicates}\")\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nСтатистический обзор данных:\")\n",
|
|||
|
"df.describe()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Видим, что есть пустые данные, но нет дубликатов. Удаляем их"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def drop_missing_values(dataframe, name):\n",
|
|||
|
" before_shape = dataframe.shape\n",
|
|||
|
" cleaned_dataframe = dataframe.dropna()\n",
|
|||
|
" after_shape = cleaned_dataframe.shape\n",
|
|||
|
" print(\n",
|
|||
|
" f\"В наборе данных '{name}' было удалено {before_shape[0] - after_shape[0]} строк с пустыми значениями.\"\n",
|
|||
|
" )\n",
|
|||
|
" return cleaned_dataframe\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"cleaned_df = drop_missing_values(df, \"Phones\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Рассчитаем коэффициент ассиметрии"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"for column in df.select_dtypes(include=[np.number]).columns:\n",
|
|||
|
" asymmetry = df[column].skew()\n",
|
|||
|
" print(f\"\\nКоэффициент асимметрии для столбца '{column}': {asymmetry}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Выбросы незначительные.\n",
|
|||
|
"\n",
|
|||
|
"Очистим данные от шумов."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(cleaned_df[\"company\"].str.lower(), cleaned_df[\"Spec_score\"])\n",
|
|||
|
"plt.xlabel(\"Фирма\")\n",
|
|||
|
"plt.ylabel(\"Оценка характеристик\")\n",
|
|||
|
"plt.xticks(rotation=45)\n",
|
|||
|
"plt.title(\"Диаграмма рассеивания перед чисткой\")\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"Q1 = cleaned_df[\"Spec_score\"].quantile(0.25)\n",
|
|||
|
"Q3 = cleaned_df[\"Spec_score\"].quantile(0.75)\n",
|
|||
|
"\n",
|
|||
|
"IQR = Q3 - Q1\n",
|
|||
|
"\n",
|
|||
|
"threshold = 1.5 * IQR\n",
|
|||
|
"lower_bound = Q1 - threshold\n",
|
|||
|
"upper_bound = Q3 + threshold\n",
|
|||
|
"\n",
|
|||
|
"outliers = (cleaned_df[\"Spec_score\"] < lower_bound) | (\n",
|
|||
|
" cleaned_df[\"Spec_score\"] > upper_bound\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"print(\"Выбросы в датасете:\")\n",
|
|||
|
"print(cleaned_df[outliers])\n",
|
|||
|
"\n",
|
|||
|
"median_score = cleaned_df[\"Spec_score\"].median()\n",
|
|||
|
"cleaned_df.loc[outliers, \"Spec_score\"] = median_score\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(cleaned_df[\"company\"].str.lower(), cleaned_df[\"Spec_score\"])\n",
|
|||
|
"plt.xlabel(\"Фирма\")\n",
|
|||
|
"plt.ylabel(\"Оценка характеристик\")\n",
|
|||
|
"plt.xticks(rotation=45)\n",
|
|||
|
"plt.title(\"Диаграмма рассеивания после чистки\")\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Разбиваем на выборки."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"train_df, test_df = train_test_split(cleaned_df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размер обучающей выборки:\", len(train_df))\n",
|
|||
|
"print(\"Размер контрольной выборки:\", len(val_df))\n",
|
|||
|
"print(\"Размер тестовой выборки:\", len(test_df))\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"def check_balance(df, name):\n",
|
|||
|
" counts = df[\"Spec_score\"].value_counts()\n",
|
|||
|
" print(f\"Распределение оценки характеристик в {name}:\")\n",
|
|||
|
" print(counts)\n",
|
|||
|
" print()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"check_balance(train_df, \"обучающей выборке\")\n",
|
|||
|
"check_balance(val_df, \"контрольной выборке\")\n",
|
|||
|
"check_balance(test_df, \"тестовой выборке\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Оверсемплинг и андерсемплинг"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"def oversample(df, target_column):\n",
|
|||
|
" X = df.drop(target_column, axis=1)\n",
|
|||
|
" y = df[target_column]\n",
|
|||
|
"\n",
|
|||
|
" oversampler = RandomOverSampler(random_state=42)\n",
|
|||
|
" x_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
|
|||
|
"\n",
|
|||
|
" resampled_df = pd.concat([x_resampled, y_resampled], axis=1)\n",
|
|||
|
" return resampled_df\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"def undersample(df, target_column):\n",
|
|||
|
" X = df.drop(target_column, axis=1)\n",
|
|||
|
" y = df[target_column]\n",
|
|||
|
"\n",
|
|||
|
" undersampler = RandomUnderSampler(random_state=42)\n",
|
|||
|
" x_resampled, y_resampled = undersampler.fit_resample(X, y)\n",
|
|||
|
"\n",
|
|||
|
" resampled_df = pd.concat([x_resampled, y_resampled], axis=1)\n",
|
|||
|
" return resampled_df\n",
|
|||
|
"\n",
|
|||
|
"train_df_oversampled = oversample(train_df, \"Spec_score\")\n",
|
|||
|
"val_df_oversampled = oversample(val_df, \"Spec_score\")\n",
|
|||
|
"test_df_oversampled = oversample(test_df, \"Spec_score\")\n",
|
|||
|
"\n",
|
|||
|
"train_df_undersampled = undersample(train_df, \"Spec_score\")\n",
|
|||
|
"val_df_undersampled = undersample(val_df, \"Spec_score\")\n",
|
|||
|
"test_df_undersampled = undersample(test_df, \"Spec_score\")\n",
|
|||
|
"\n",
|
|||
|
"print(\"Оверсэмплинг:\")\n",
|
|||
|
"check_balance(train_df_oversampled, \"обучающей выборке\")\n",
|
|||
|
"check_balance(val_df_oversampled, \"контрольной выборке\")\n",
|
|||
|
"check_balance(test_df_oversampled, \"тестовой выборке\")\n",
|
|||
|
"\n",
|
|||
|
"print(\"Андерсэмплинг:\")\n",
|
|||
|
"check_balance(train_df_undersampled, \"обучающей выборке\")\n",
|
|||
|
"check_balance(val_df_undersampled, \"контрольной выборке\")\n",
|
|||
|
"check_balance(test_df_undersampled, \"тестовой выборке\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Начнем анализировать датасет №19.\n",
|
|||
|
"\n",
|
|||
|
"Ссылка на исходные данные: https://www.kaggle.com/datasets/surajjha101/forbes-billionaires-data-preprocessed\n",
|
|||
|
"\n",
|
|||
|
"**Общее описание**: «Миллиардеры мира» — это ежегодный рейтинг документально подтвержденного состояния богатейших миллиардеров мира, который составляется и публикуется ежегодно в марте американским деловым журналом Forbes. Список был впервые опубликован в марте 1987 года. Общий собственный капитал каждого человека в списке оценивается и указывается в долларах США на основе их документально подтвержденных активов, а также с учетом долга и других факторов. Члены королевской семьи и диктаторы, чье богатство обусловлено их положением, исключены из этих списков. Этот рейтинг представляет собой индекс самых богатых задокументированных людей, исключая любой рейтинг тех, кто обладает богатством, которое невозможно полностью установить.\n",
|
|||
|
"\n",
|
|||
|
"**Проблемная область**: Анализ состояния, возраста и источников богатства самых богатых людей в мире.\n",
|
|||
|
"\n",
|
|||
|
"**Объекты наблюдения**: Богатейшие люди мира, представленные в датасете.\n",
|
|||
|
"\n",
|
|||
|
"**Связи между объектами**: можно выявить следующие связи:\n",
|
|||
|
"- Между возрастом и состоянием\n",
|
|||
|
"- Между страной проживания и источником дохода\n",
|
|||
|
"- Между отраслью бизнеса и уровнем благосостояния.\n",
|
|||
|
"\n",
|
|||
|
"**Бизнес цели**:\n",
|
|||
|
"- ***Понять факторы успеха:***: Исследовать, какие факторы (возраст, страна, источник дохода) влияют на высокие состояния. Это может помочь новым предпринимателям и стартапам учиться на опыте успешных людей.\n",
|
|||
|
"- ***Анализ тенденций богатства***: Понимание как источники богатства меняются со временем и как это связано с экономическими условиями в разных странах. Это непременно поможет инвесторам и аналитикам определить, какие секторы могут быть наиболее перспективными для инвестиций в будущем. \n",
|
|||
|
"\n",
|
|||
|
"**Цели технического проекта**:\n",
|
|||
|
"1. ***Исследование факторов успеха***: Входные данные - данные о богатейших людях (возраст, чистая стоимость, индустрия); целевой признак - выявление факторов, способствующих накоплению состояния.\n",
|
|||
|
"2. ***Анализ тенденций богатства***: Входные данные - данные о богатейших людях (возраст, страна, источник богатства); целевой признак - наличие зависимости между источником богатства и страной."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"df = pd.read_csv(\"../data/Forbes Billionaires.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Атрибуты:\n",
|
|||
|
"- Ранг (Rank),\n",
|
|||
|
"- Имя (Name),\n",
|
|||
|
"- Общая стоимость (Networth),\n",
|
|||
|
"- Возраст (Age),\n",
|
|||
|
"- Страна (Country),\n",
|
|||
|
"- Источник дохода(Source),\n",
|
|||
|
"- Индустрия (Industry)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Посмотрим на связи."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import seaborn as sns\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"\n",
|
|||
|
"# Связь между возрастом и состоянием\n",
|
|||
|
"plt.subplot(2, 2, 1)\n",
|
|||
|
"sns.scatterplot(data=df, x=\"Age\", y=\"Networth\")\n",
|
|||
|
"plt.title(\"Связь между возрастом и состоянием\")\n",
|
|||
|
"plt.xlabel(\"Возраст\")\n",
|
|||
|
"plt.ylabel(\"Состояние (млрд)\")\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Связь между страной проживания и состоянием (топ-10 стран)\n",
|
|||
|
"plt.subplot(2, 2, 2)\n",
|
|||
|
"top_countries = df[\"Country\"].value_counts().index[:10]\n",
|
|||
|
"sns.boxplot(data=df[df[\"Country\"].isin(top_countries)], x=\"Country\", y=\"Networth\")\n",
|
|||
|
"plt.title(\"Связь между страной проживания и состоянием\")\n",
|
|||
|
"plt.xticks(rotation=90)\n",
|
|||
|
"plt.xlabel(\"Страна\")\n",
|
|||
|
"plt.ylabel(\"Состояние (млрд)\")\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Связь между источником дохода и состоянием (топ-10 источников дохода)\n",
|
|||
|
"plt.subplot(2, 2, 3)\n",
|
|||
|
"top_sources = df[\"Source\"].value_counts().index[:10]\n",
|
|||
|
"sns.boxplot(data=df[df[\"Source\"].isin(top_sources)], x=\"Source\", y=\"Networth\")\n",
|
|||
|
"plt.title(\"Связь между источником дохода и состоянием\")\n",
|
|||
|
"plt.xticks(rotation=90)\n",
|
|||
|
"plt.xlabel(\"Источник дохода\")\n",
|
|||
|
"plt.ylabel(\"Состояние (млрд)\")\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Связь между отраслью и состоянием (топ-10 отраслей)\n",
|
|||
|
"plt.subplot(2, 2, 4)\n",
|
|||
|
"top_industries = df[\"Industry\"].value_counts().index[:10]\n",
|
|||
|
"sns.boxplot(data=df[df[\"Industry\"].isin(top_industries)], x=\"Industry\", y=\"Networth\")\n",
|
|||
|
"plt.title(\"Связь между отраслью и состоянием\")\n",
|
|||
|
"plt.xticks(rotation=90)\n",
|
|||
|
"plt.xlabel(\"Отрасль\")\n",
|
|||
|
"plt.ylabel(\"Состояние (млрд)\")\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Перейдем к выявлению выбросов."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"missing_values = df.isnull().sum()\n",
|
|||
|
"print(\"Пропущенные значения в данных:\\n\", missing_values)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пропущенных данных не найдено.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"fig, axs = plt.subplots(1, 2, figsize=(15, 5))\n",
|
|||
|
"\n",
|
|||
|
"sns.boxplot(data=df, x='Networth', ax=axs[0])\n",
|
|||
|
"axs[0].set_title(\"Выбросы по состоянию\")\n",
|
|||
|
"\n",
|
|||
|
"sns.boxplot(data=df, x=\"Age\", ax=axs[1])\n",
|
|||
|
"axs[1].set_title(\"Выбросы по возрасту\")\n",
|
|||
|
"\n",
|
|||
|
"plt.show()\n",
|
|||
|
"print(\"Размер данных до удаления выбросов: \", df.shape)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Выбросов в данном случае не видно, данные в районе допустимых значений"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Гистограмма распределения чистой стоимости\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"sns.histplot(df['Networth'], bins=10, kde=True)\n",
|
|||
|
"plt.title(\"Гистограмма распределения чистой стоимости\")\n",
|
|||
|
"plt.xlabel(\"Чистая стоимость (в миллиардах долларов)\")\n",
|
|||
|
"plt.ylabel(\"Частота\")\n",
|
|||
|
"plt.grid(True)\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Распределение чистой стоимости имеет ярко выраженное смещение: большая часть значений сосредоточена в нижнем диапазоне, с небольшим количеством высоких значений. Это указывает на преобладание людей с относительно низкой чистой стоимостью, тогда как у немногих (например, миллиардеров) чистая стоимость крайне высока."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# 1. Столбчатая диаграмма по странам\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"sns.countplot(data=df, x=\"Country\", order=df[\"Country\"].value_counts().index)\n",
|
|||
|
"plt.title(\"Количество людей по странам\")\n",
|
|||
|
"plt.xlabel(\"Страна\")\n",
|
|||
|
"plt.ylabel(\"Количество\")\n",
|
|||
|
"plt.xticks(rotation=45)\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# 2. Столбчатая диаграмма по отраслям\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"sns.countplot(data=df, x=\"Industry\", order=df[\"Industry\"].value_counts().index)\n",
|
|||
|
"plt.title(\"Количество людей по отраслям\")\n",
|
|||
|
"plt.xlabel(\"Отрасль\")\n",
|
|||
|
"plt.ylabel(\"Количество\")\n",
|
|||
|
"plt.xticks(rotation=45)\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# 3. Гистограмма для анализа возраста\n",
|
|||
|
"plt.figure(figsize=(10, 5))\n",
|
|||
|
"sns.histplot(df[\"Age\"], bins=30, kde=True)\n",
|
|||
|
"plt.title(\"Распределение возраста\")\n",
|
|||
|
"plt.xlabel(\"Возраст\")\n",
|
|||
|
"plt.ylabel(\"Частота\")\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Графики демонстрируют разнообразие стран и отраслей, представленных в наборе данных, что указывает на охват данных по множеству регионов и различных сфер деятельности.\n",
|
|||
|
"\n",
|
|||
|
"Разбиваем набор данных на обучающую, контрольную и тестовую выборки"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Разделим набор данных на признаки (X) и целевой признак (y)\n",
|
|||
|
"X = df.drop(columns=[\"Networth\"])\n",
|
|||
|
"y = df[\"Networth\"]\n",
|
|||
|
"\n",
|
|||
|
"# Разделение на обучающую, контрольную и тестовую выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(\n",
|
|||
|
" X, y, test_size=0.4, random_state=42\n",
|
|||
|
")\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(\n",
|
|||
|
" X_temp, y_temp, test_size=0.5, random_state=42\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Проверка размера выборок\n",
|
|||
|
"(X_train.shape, X_val.shape, X_test.shape)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Проверка распределения целевого признака по выборкам\n",
|
|||
|
"train_dist = y_train.describe()\n",
|
|||
|
"val_dist = y_val.describe()\n",
|
|||
|
"test_dist = y_test.describe()\n",
|
|||
|
"\n",
|
|||
|
"train_dist, val_dist, test_dist"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"oversampler = RandomOverSampler(random_state=12)\n",
|
|||
|
"X_train_over, y_train_over = oversampler.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"undersampler = RandomUnderSampler(random_state=12)\n",
|
|||
|
"X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размеры после oversampling:\", X_train_over.shape, y_train_over.shape)\n",
|
|||
|
"print(\"Размеры после undersampling:\", X_train_under.shape, y_train_under.shape)"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": ".venv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|