AIM-PIbd-31-LOBASHOV-I-D/lab_3/lab_3.ipynb

831 lines
332 KiB
Plaintext
Raw Normal View History

2024-11-15 21:58:38 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Вариант 19:* Данные о миллионерах\n",
"- Определим бизнес-цели и цели технического проекта "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry'], dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"df = pd.read_csv(\"C:/Users/goldfest/Desktop/3 курс/MII/AIM-PIbd-31-LOBASHOV-I-D/static/csv/Forbes Billionaires.csv\")\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определение бизнес целей:\n",
"\n",
"1. Прогнозирование потенциальных миллионеров на основе анализа данных.\n",
"2. Оценка факторов, влияющих на достижение статуса миллионера."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определение целей технического проекта:\n",
"\n",
"1. Построить модель машинного обучения для классификации, которая будет прогнозировать вероятность достижения статуса миллионера на основе предоставленных данных о характеристиках миллионеров.\n",
"2. Провести анализ данных для выявления ключевых факторов, влияющих на достижение статуса миллионера."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Rank</th>\n",
" <th>Name</th>\n",
" <th>Networth</th>\n",
" <th>Age</th>\n",
" <th>Country</th>\n",
" <th>Source</th>\n",
" <th>Industry</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Elon Musk</td>\n",
" <td>219.0</td>\n",
" <td>50</td>\n",
" <td>United States</td>\n",
" <td>Tesla, SpaceX</td>\n",
" <td>Automotive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Jeff Bezos</td>\n",
" <td>171.0</td>\n",
" <td>58</td>\n",
" <td>United States</td>\n",
" <td>Amazon</td>\n",
" <td>Technology</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Bernard Arnault &amp; family</td>\n",
" <td>158.0</td>\n",
" <td>73</td>\n",
" <td>France</td>\n",
" <td>LVMH</td>\n",
" <td>Fashion &amp; Retail</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Bill Gates</td>\n",
" <td>129.0</td>\n",
" <td>66</td>\n",
" <td>United States</td>\n",
" <td>Microsoft</td>\n",
" <td>Technology</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Warren Buffett</td>\n",
" <td>118.0</td>\n",
" <td>91</td>\n",
" <td>United States</td>\n",
" <td>Berkshire Hathaway</td>\n",
" <td>Finance &amp; Investments</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Rank Name Networth Age Country \\\n",
"0 1 Elon Musk 219.0 50 United States \n",
"1 2 Jeff Bezos 171.0 58 United States \n",
"2 3 Bernard Arnault & family 158.0 73 France \n",
"3 4 Bill Gates 129.0 66 United States \n",
"4 5 Warren Buffett 118.0 91 United States \n",
"\n",
" Source Industry \n",
"0 Tesla, SpaceX Automotive \n",
"1 Amazon Technology \n",
"2 LVMH Fashion & Retail \n",
"3 Microsoft Technology \n",
"4 Berkshire Hathaway Finance & Investments "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Rank 0\n",
"Name 0\n",
"Networth 0\n",
"Age 0\n",
"Country 0\n",
"Source 0\n",
"Industry 0\n",
"dtype: int64\n"
]
},
{
"data": {
"text/plain": [
"Rank False\n",
"Name False\n",
"Networth False\n",
"Age False\n",
"Country False\n",
"Source False\n",
"Industry False\n",
"dtype: bool"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Процент пропущенных значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
"\n",
"# Проверка на пропущенные данные\n",
"print(df.isnull().sum())\n",
"\n",
"df.isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пропущенных колонок нету, это очень хорошо"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 2080\n",
"Размер контрольной выборки: 520\n",
"Размер тестовой выборки: 520\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тестовая)\n",
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"# Разделение данных на обучающую и контрольную выборки (80% - обучение, 20% - контроль)\n",
"train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки: \", len(train_data))\n",
"print(\"Размер контрольной выборки: \", len(val_data))\n",
"print(\"Размер тестовой выборки: \", len(test_data))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWsklEQVR4nO3de1wU5f4H8M9y2eWyLAgICwqKd1HMwttmIkdJRCpTyixT9Hi0DO2oZUWZt06SWWmal+qUl9Qy83a08i6oiWYo3jX1h0LJgmDcFrnu8/uDdnQFFBFYGD/v12tfsDPPzHxnZnf5MPPMrEIIIUBEREQkU1aWLoCIiIioNjHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAR1UMGgwEpKSn466+/LF0K1bDc3FxcvnwZBoPB0qU8MBh2iEjWli9fDoVCgd9++83SpdzVunXr0LdvXzg5OUGtVsPX1xcffvihpctqEPLy8jB//nzpeVZWFhYtWmS5gm4hhMAXX3yBHj16wMHBARqNBn5+fli1apWlS3tgMOw0EKYPbNPDzs4Obdq0wfjx45GWlmbp8qgeGjlyJBQKBTp16oSKvhVGoVBg/Pjx1Zr37NmzsWnTpvussGYtXrwYy5cvt3QZ1fbWW29hyJAhcHJywpdffomdO3di165deOWVVyxdWoNgb2+PqVOnYvXq1UhJScGMGTOwZcsWS5cFAHjhhRfw8ssvo3379vjmm2+kfTt48GBLl/bAsLF0AXRvZs2aBT8/PxQUFODAgQNYsmQJfvrpJ5w6dQoODg6WLo/qoZMnT2LDhg2IiIiosXnOnj0bzzzzDJ5++ukam+f9Wrx4Mdzd3TFy5EhLl3LP4uLiMGfOHMTExOCtt96ydDkNkrW1NWbOnIkRI0bAaDRCo9Hgxx9/tHRZWLlyJdauXYtVq1bhhRdesHQ5Dywe2WlgwsLC8OKLL+Jf//oXli9fjokTJyIpKQmbN2+2dGlUD9nb26NNmzaYNWtWhUd35CA/P9/SJdy3jz76CI8++iiDzn167bXXcOXKFRw8eBBXrlzBY489ZumSMHfuXDz//PMMOhbGsNPA9enTBwCQlJQEALh+/Tpef/11BAQEQK1WQ6PRICwsDMePHy83bUFBAWbMmIE2bdrAzs4OXl5eGDx4MC5dugQAuHz5stmps9sfwcHB0rxiY2OhUCiwdu1avP3229BqtXB0dMRTTz2FlJSUcss+fPgw+vfvD2dnZzg4OKB379745ZdfKlzH4ODgCpc/Y8aMcm1XrVqFwMBA2Nvbw9XVFUOHDq1w+Xdat1sZjUbMnz8fHTp0gJ2dHTw9PfHSSy+V6zTavHlzPPHEE+WWM378+HLzrKj2uXPnltumAFBYWIjp06ejVatWUKlU8PHxwRtvvIHCwsIKt9XtrKysMHXqVJw4cQIbN268a/uqLE+hUMBgMGDFihXSNhs5ciROnDgBhUKB//3vf1LbhIQEKBQKPPLII2bLCQsLQ/fu3c2GLV68GB06dIBKpYK3tzeioqKQlZVl1iY4OBgdO3ZEQkICgoKC4ODggLfffhvNmzfH6dOnERcXV+Hr07RukydPRuPGjeHo6IhBgwbh2rVrd90mptOBpkejRo0QHByM/fv333VaANizZw969eoFR0dHuLi4YODAgTh79qxZm0OHDqFjx44YOnQoXF1dYW9vj65du5qdKszLy4OjoyP+/e9/l1vGH3/8AWtra8TExEg1N2/evFy72197V65cwSuvvIK2bdvC3t4ebm5uePbZZ3H58mWz6Uzv79jYWGnYkSNH8Pjjj8PJyQmOjo4VbpOK+ktlZGRU+B544oknKqy5Kp8VM2bMkN5nTZs2hU6ng42NDbRabbm6K2Ka3vRwcnJCt27dyp2qNb3+KmP6XDGdTjUYDDh16hR8fHwQHh4OjUZT6bYCgP/7v//Ds88+C1dXVzg4OKBHjx7ljk7dy2dtcHBwuffB+++/DysrK6xZs8Zs+L18JjdEPI3VwJmCiZubG4CyN8umTZvw7LPPws/PD2lpafj888/Ru3dvnDlzBt7e3gCA0tJSPPHEE9i9ezeGDh2Kf//738jNzcXOnTtx6tQptGzZUlrG888/jwEDBpgtNzo6usJ63n//fSgUCrz55ptIT0/H/PnzERISgsTERNjb2wMo+/APCwtDYGAgpk+fDisrKyxbtgx9+vTB/v370a1bt3Lzbdq0qfRBnpeXh3HjxlW47HfffRdDhgzBv/71L1y7dg0LFy5EUFAQjh07BhcXl3LTjB07Fr169QIAbNiwoVwgeOmll7B8+XKMGjUKr776KpKSkvDZZ5/h2LFj+OWXX2Bra1vhdrgXWVlZ0rrdymg04qmnnsKBAwcwduxYtG/fHidPnsS8efPw+++/V7nPzAsvvID33nsPs2bNwqBBg8qFr3td3jfffIN//etf6NatG8aOHQsAaNmyJTp27AgXFxfs27cPTz31FABg//79sLKywvHjx5GTkwONRgOj0YiDBw9K0wJlf2xmzpyJkJAQjBs3DufPn8eSJUtw5MiRcts5MzMTYWFhGDp0KF588UV4enoiODgYEyZMgFqtxjvvvAMA8PT0NFu/CRMmoFGjRpg+fTouX76M+fPnY/z48Vi7du1dt6G7uzvmzZsHoCxYfPrppxgwYABSUlIqfF2Z7Nq1C2FhYWjRogVmzJiBGzduYOHChejZsyeOHj0q/XHPzMzEF198AbVajVdffRWNGzfGqlWrMHjwYKxevRrPP/881Go1Bg0ahLVr1+KTTz6BtbW1tJxvv/0WQggMGzbsrutyqyNHjuDgwYMYOnQomjZtisuXL2PJkiUIDg7GmTNnKj01fvHiRQQHB8PBwQFTpkyBg4MDvvzyS4SEhGDnzp0ICgq6pzoqU53PCpOPP/74nvszfvPNNwDKAtnixYvx7LPP4tSpU2jbtm216s/MzAQAzJkzB1qtFlOmTIGdnV2F2yotLQ2PPvoo8vPz8eqrr8LNzQ0rVqzAU089hR9++AGDBg0ym3dVPmtvt2zZMkydOhUff/yx2ZGm+9nODYagBmHZsmUCgNi1a5e4du2aSElJEd99951wc3MT9vb24o8//hBCCFFQUCBKS0vNpk1KShIqlUrMmjVLGvb1118LAOKTTz4ptyyj0ShNB0DMnTu3XJsOHTqI3r17S8/37t0rAIgmTZqInJwcafj3338vAIhPP/1Umnfr1q1FaGiotBwhhMjPzxd+fn7i8ccfL7esRx99VHTs2FF6fu3aNQFATJ8+XRp2+fJlYW1tLd5//32zaU+ePClsbGzKDb9w4YIAIFasWCENmz59urj1LbF//34BQKxevdps2m3btpUb3qxZMxEeHl6u9qioKHH72+z22t944w3h4eEhAgMDzbbpN998I6ysrMT+/fvNpl+6dKkAIH755Zdyy7tVZGSkcHR0FEIIsWLFCgFAbNiwwayOqKioai3P0dFRREZGlltmeHi46Natm/R88ODBYvDgwcLa2lr8/PPPQgghjh49KgCIzZs3CyGESE9PF0qlUvTr18/stfvZZ58JAOLrr7+WhvXu3VsAEEuXLi237Ntfkyam905ISIjZa27SpEnC2tpaZGVllZvmVpGRkaJZs2Zmw7744gsBQPz66693nLZz587Cw8NDZGZmSsOOHz8urKysxIgRI6RhAAQAERsbKw3Lz88X7du3F1qtVhQVFQkhhNi+fbsAIG1Lk06dOpmt+6hRo4Svr2+5em5/7eXn55drEx8fLwCIlStXSsNM7++9e/cKIYSIiIgQ1tbW4tSpU1KbjIwM4ebmJgIDA6Vhpm1/5MgRaVhF718hyl47t27ne/msuP29m56eLpycnERYWJhZ3ZW5fXohhNixY4cAIL7//ntpWO/evUWHDh0qnY/pM3PZsmVmz5VKpfj999/NtsHt22rixIkCgNn7Lzc3V/j5+YnmzZtL742qftaa6jW9Ln788UdhY2MjXnvtNbOaq/OZ3BDxNFYDExISgsaNG8PHxwdDhw6FWq3Gxo0b0aRJEwC
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABZFUlEQVR4nO3deVxU1f8/8NedAYZ93xUFccEd99BEVFLRzK0ss1IztXJJaftQmUsLmn3LNbVfuaWlWS5pZe64obmEuwQKisoiKvs+c35/4NwcBxARGBhez8fjPuDee+6975k7M7y499w7khBCgIiIiMhIKQxdABEREVFVYtghIiIio8awQ0REREaNYYeIiIiMGsMOERERGTWGHSIiIjJqDDtERERk1Bh2iIiIyKgx7BARERmYRqNBamoqrly5YuhSjBLDDhHVeqtWrYIkSThx4oShS6Ea5tChQ9i/f788vn//fhw+fNhwBd0nKSkJU6dORcOGDWFmZgYXFxe0aNECGRkZhi7N6DDs1CDaD2ztYG5ujqZNm2LSpElITk42dHlUA40ePRqSJKFNmzYo6ZtfJEnCpEmTKrTuzz//HFu2bHnMCivXN998g1WrVhm6DKpFEhIS8Oabb+Ls2bM4e/Ys3nzzTSQkJBi6LMTGxqJTp05Yv349JkyYgO3bt2PXrl3Ys2cPrKysDF2e0TExdAGkb/bs2fDx8UFeXh4OHTqEpUuX4o8//sC5c+dgaWlp6PKoBjp79iw2bdqEYcOGVdo6P//8czz77LMYPHhwpa3zcX3zzTdwdnbG6NGjDV0K1RJDhw7F/Pnz0aZNGwBAQEAAhg4dauCqgAkTJsDMzAxHjx5FvXr1DF2O0WPYqYFCQkLQsWNHAMBrr70GJycnfPXVV9i6dStGjBhh4OqoprGwsICXlxdmz56NoUOHQpIkQ5dU6XJychj0qUJUKhWOHDmCc+fOAQBatWoFpVJp0JpOnjyJvXv3YufOnQw61YSnsWqBXr16AQDi4uIAAHfu3ME777yD1q1bw9raGra2tggJCcHp06f1ls3Ly8PMmTPRtGlTmJubw8PDA0OHDsXly5cBAPHx8Tqnzh4cgoKC5HXt378fkiRhw4YN+OCDD+Du7g4rKys888wzJR4WPnbsGPr16wc7OztYWlqiR48epZ4rDwoKKnH7M2fO1Gu7du1adOjQARYWFnB0dMQLL7xQ4vbLemz302g0mD9/Plq2bAlzc3O4ublhwoQJuHv3rk47b29vPP3003rbmTRpkt46S6p93rx5es8pAOTn52PGjBlo3LgxVCoVvLy88N577yE/P7/E5+pBCoUCH330Ec6cOYPNmzc/tH15tidJErKzs7F69Wr5ORs9ejTOnDkDSZLw22+/yW1PnjwJSZLQvn17ne2EhISgS5cuOtO++eYbtGzZEiqVCp6enpg4cSLS0tJ02gQFBaFVq1Y4efIkAgMDYWlpiQ8++ADe3t44f/48IiIiSnx9ah9baGgoXFxcYGVlhSFDhuDWrVsPfU60pwO1g4ODA4KCgnDw4MFyLevt7a0zbe3atVAoFJgzZ47O9L1796J79+6wsrKCvb09Bg0ahIsXL+q0mTlzJiRJQmpqqs70EydOQJIk+TTegzWXNMTHxwP477W7c+dO+Pv7w9zcHC1atMCmTZv0Hs+VK1fw3HPPwdHREZaWlnjiiSfw+++/l+t5K+l9O3r0aFhbWz/0eXyU91dRURE++eQT+Pr6QqVSwdvbGx988IHee8bb2xujR4+GUqlE27Zt0bZtW2zatAmSJOnts9Jq0j4mhUIBd3d3PP/887h27ZrcRvs58+WXX5a6Hu0+1Tp69CjMzc1x+fJl+f3g7u6OCRMm4M6dO3rLb9y4Uf7Mc3Z2xksvvYQbN27otNE+z1euXEHfvn1hZWUFT09PzJ49W+cUt7be+08HZ2ZmokOHDvDx8UFiYqI8vbyfjbUBj+zUAtpg4uTkBKD4w2jLli147rnn4OPjg+TkZCxfvhw9evTAhQsX4OnpCQBQq9V4+umnsWfPHrzwwgt46623kJmZiV27duHcuXPw9fWVtzFixAj0799fZ7thYWEl1vPZZ59BkiS8//77SElJwfz58xEcHIyoqChYWFgAKP5QDwkJQYcOHTBjxgwoFAqsXLkSvXr1wsGDB9G5c2e99davXx/h4eEAgKysLLzxxhslbnv69OkYPnw4XnvtNdy6dQuLFi1CYGAg/vnnH9jb2+stM378eHTv3h0AsGnTJr1AMGHCBKxatQpjxozBlClTEBcXh8WLF+Off/7B4cOHYWpqWuLz8CjS0tLkx3Y/jUaDZ555BocOHcL48ePRvHlznD17Fl9//TX+/fffcveZefHFF/HJJ59g9uzZGDJkSKlHd8q7vR9++AGvvfYaOnfujPHjxwMAfH190apVK9jb2+PAgQN45plnAAAHDx6EQqHA6dOnkZGRAVtbW2g0Ghw5ckReFij+wJ81axaCg4PxxhtvIDo6GkuXLsXx48f1nufbt28jJCQEL7zwAl566SW4ubkhKCgIkydPhrW1NT788EMAgJubm87jmzx5MhwcHDBjxgzEx8dj/vz5mDRpEjZs2PDQ59DZ2Rlff/01AOD69etYsGAB+vfvj4SEhBJfV6XZuXMnXn31VUyaNAn/+9//5Om7d+9GSEgIGjVqhJkzZyI3NxeLFi1Ct27dcOrUqXL98b3fhAkTEBwcLI+//PLLGDJkiM4pGhcXF/n3mJgYPP/883j99dcxatQorFy5Es899xx27NiBp556CgCQnJyMrl27IicnB1OmTIGTkxNWr16NZ555Br/88guGDBmiV8f9z5u2jqr22muvYfXq1Xj22Wfx9ttv49ixYwgPD8fFixfLDPxFRUXya6e8unfvjvHjx0Oj0eDcuXOYP38+bt68Wa4gXJrbt28jLy8Pb7zxBnr16oXXX38dly9fxpIlS3Ds2DEcO3YMKpUKAOTPpk6dOiE8PBzJyclYsGABDh8+rPeZp1ar0a9fPzzxxBP44osvsGPHDsyYMQNFRUWYPXt2ibUUFhZi2LBhuHbtGg4fPgwPDw95XnV8NlYbQTXGypUrBQCxe/ducevWLZGQkCDWr18vnJychIWFhbh+/boQQoi8vDyhVqt1lo2LixMqlUrMnj1bnrZixQoBQHz11Vd629JoNPJyAMS8efP02rRs2VL06NFDHt+3b58AIOrVqycyMjLk6T///LMAIBYsWCCvu0mTJqJv377ydoQQIicnR/j4+IinnnpKb1tdu3YVrVq1ksdv3bolAIgZM2bI0+Lj44VSqRSfffaZzrJnz54VJiYmetNjYmIEALF69Wp52owZM8T9L/uDBw8KAGLdunU6y+7YsUNvesOGDcWAAQP0ap84caJ48K30YO3vvfeecHV1FR06dNB5Tn/44QehUCjEwYMHdZZftmyZACAOHz6st737jRo1SlhZWQkhhFi9erUAIDZt2qRTx8SJEyu0PSsrKzFq1Ci9bQ4YMEB07txZHh86dKgYOnSoUCqV4s8//xRCCHHq1CkBQGzdulUIIURKSoowMzMTffr00XntLl68WAAQK1askKf16NFDABDLli3T2/aDr0kt7XsnODhY5zU3bdo0oVQqRVpamt4y9xs1apRo2LChzrRvv/1WABB///13uZc9ceKEsLa2Fs8995zee9Tf31+4urqK27dvy9NOnz4tFAqFeOWVV+Rp2tforVu3dJY/fvy4ACBWrlxZYh0Pvubu17BhQwFA/Prrr/K09PR04eHhIdq1aydPmzp1qgCg8/rIzMwUPj4+wtvbW+8xjRw5Uvj4+JRZx/2v0bKU9/0VFRUlAIjXXntNp90777wjAIi9e/fqrPP+1/A333wjVCqV6Nmzp97+Lq2mB98DL774orC0tJTHy/oM1Xrwc0c73rt3b1FUVCRP176OFy1aJIQQoqCgQLi6uopWrVqJ3Nxcud327dsFAPHxxx/L00aNGiUAiMmTJ8vTNBqNGDBggDAzM5NfT9p6V65cKTQajRg5cqSwtLQUx44d06n5UT4bawOexqqBgoOD4eLiAi8vL7z
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWLElEQVR4nO3deVwV9f4/8NecAxz2fVdQRMUVUVRCBTExRVNTy5tZaZl6Sy217UtlLmVods01rV+LWZZmuWSL5QouaG64S2jgyiIq+37O5/cHMnnYBAQODK/n4zEPmP09M+ccXsx8Zo4khBAgIiIiUiiVoQsgIiIiqksMO0RERKRoDDtERESkaAw7REREpGgMO0RERKRoDDtERESkaAw7REREpGgMO0RERKRoDDtEREQPQAiB27dvIy4uztClUAUYdoiIAKxZswaSJOHo0aOGLoXuOnPmDLZs2SL3x8TE4NdffzVcQffIzMzEO++8Ax8fH5iYmMDBwQFt27ZFbGysoUujcjDsKEzJB3ZJZ2pqirZt22Lq1KlITk42dHnUAI0fPx6SJMHX1xflfXuMJEmYOnVqjZb9wQcf6P2xagg++eQTrFmzxtBlUBVkZmZi8uTJOHToEOLi4vDKK6/g9OnThi4Lt27dQmBgIJYtW4bHH38cW7duxY4dO7B37160bNnS0OVROYwMXQDVjXnz5sHLywt5eXnYv38/Vq1ahd9++w1nzpyBubm5ocujBuj06dPYtGkTRo0aVWvL/OCDD/D444/jscceq7VlPqhPPvkEjo6OGD9+vKFLofsIDAyUOwBo27YtJk6caOCqgNdffx2JiYmIjo5Gx44dDV0OVQHDjkKFhYWhe/fuAIAXXngBDg4OWLx4MbZu3YoxY8YYuDpqaMzMzODh4YF58+Zh5MiRkCTJ0CXVupycHAb9RmjLli04d+4ccnNz0blzZ5iYmBi0npSUFHz99ddYvXo1g04jwstYTcTDDz8MAIiPjwcA3L59G6+99ho6d+4MS0tLWFtbIywsDCdPniwzb15eHubMmYO2bdvC1NQUbm5uGDlyJC5dugQASEhI0Lt0VroLCQmRl7V3715IkoQNGzbgrbfegqurKywsLDBs2DBcvXq1zLoPHz6MQYMGwcbGBubm5ujbty8OHDhQ7jaGhISUu/45c+aUmfbbb7+Fv78/zMzMYG9vjyeffLLc9Ve2bffS6XRYsmQJOnbsCFNTU7i4uGDy5Mm4c+eO3nQtW7bEo48+WmY9U6dOLbPM8mpftGhRmX0KAPn5+Zg9ezZat24NjUYDDw8PvPHGG8jPzy93X5WmUqnwzjvv4NSpU9i8efN9p6/K+iRJQnZ2Nr7++mt5n40fPx6nTp2CJEn4+eef5WmPHTsGSZLQrVs3vfWEhYUhICBAb9gnn3yCjh07QqPRwN3dHVOmTEFaWpreNCEhIejUqROOHTuG4OBgmJub46233kLLli1x9uxZREZGlvv6LNm2mTNnwsnJCRYWFhgxYgRu3rx5331ScjmwpLOzs0NISAj27dtXrfnK6xISEuTpf//9dwQFBcHCwgJWVlYYMmQIzp49W2a5Fy5cwOjRo+Hk5AQzMzP4+Pjg7bffBgDMmTPnvuvcu3evvKyNGzfK7xdHR0c8/fTTuH79eo23v6rHsOTYdOjQAf7+/jh58mS577/ylP48cHR0xJAhQ3DmzBm96e53mbakaUDJMThy5Ah0Oh0KCgrQvXt3mJqawsHBAWPGjMGVK1fKzL979275eNna2mL48OE4f/683jQlx6PkmFlbW8PBwQGvvPIK8vLyytR77+dCUVERBg8eDHt7e5w7d05v2qp+zjUFPLPTRJQEEwcHBwDAP//8gy1btuCJJ56Al5cXkpOT8emnn6Jv3744d+4c3N3dAQBarRaPPvoodu3ahSeffBKvvPIKMjMzsWPHDpw5cwbe3t7yOsaMGYPBgwfrrTc8PLzceubPnw9JkvDmm28iJSUFS5YsQWhoKGJiYmBmZgag+EMiLCwM/v7+mD17NlQqFb766is8/PDD2LdvH3r27Flmuc2bN0dERAQAICsrCy+++GK56541axZGjx6NF154ATdv3sTy5csRHByMEydOwNbWtsw8kyZNQlBQEABg06ZNZQLB5MmTsWbNGjz33HN4+eWXER8fjxUrVuDEiRM4cOAAjI2Ny90P1ZGWliZv2710Oh2GDRuG/fv3Y9KkSWjfvj1Onz6Njz/+GH///XeV28w89dRTeO+99zBv3jyMGDGiwj8oVV3fN998gxdeeAE9e/bEpEmTAADe3t7o1KkTbG1tERUVhWHDhgEA9u3bB5VKhZMnTyIjIwPW1tbQ6XQ4ePCgPC9Q/Edh7ty5CA0NxYsvvojY2FisWrUKR44cKbOfb926hbCwMDz55JN4+umn4eLigpCQEEybNg2WlpbyH34XFxe97Zs2bRrs7Owwe/ZsJCQkYMmSJZg6dSo2bNhw333o6OiIjz/+GABw7do1LF26FIMHD8bVq1fLfV0Bxa+d0NBQuf+ZZ57BiBEjMHLkSHmYk5OTvE/HjRuHgQMHYuHChcjJycGqVavQp08fnDhxQm4vcurUKQQFBcHY2BiTJk1Cy5YtcenSJWzbtg3z58/HyJEj0bp1a3n5M2bMQPv27fX2dfv27QFAfl336NEDERERSE5OxtKlS3HgwIEy75eqbH91jmFpb7755n2OgL527drh7bffhhACly5dwuLFizF48OByQ0lV3bp1C0DxPyj+/v5YsGABbt68iWXLlmH//v04ceIEHB0dAQA7d+5EWFgYWrVqhTlz5iA3NxfLly9H7969cfz48TLte0aPHo2WLVsiIiIChw4dwrJly3Dnzh2sXbu2wnpeeOEF7N27Fzt27ECHDh3k4TX5nFM0QYry1VdfCQBi586d4ubNm+Lq1ati/fr1wsHBQZiZmYlr164JIYTIy8sTWq1Wb974+Hih0WjEvHnz5GFffvmlACAWL15cZl06nU6eD4BYtGhRmWk6duwo+vbtK/fv2bNHABDNmjUTGRkZ8vAffvhBABBLly6Vl92mTRsxcOBAeT1CCJGTkyO8vLzEgAEDyqyrV69eolOnTnL/zZs3BQAxe/ZseVhCQoJQq9Vi/vz5evOePn1aGBkZlRkeFxcnAIivv/5aHjZ79mxx71tn3759AoBYt26d3rzbt28vM7xFixZiyJAhZWqfMmWKKP12LF37G2+8IZydnYW/v7/ePv3mm2+ESqUS+/bt05t/9erVAoA4cOBAmfXda9y4ccLCwkIIIcTXX38tAIhNmzbp1TFlypQarc/CwkKMGzeuzDqHDBkievbsKfePHDlSjBw5UqjVavH7778LIYQ4fvy4ACC2bt0qhBAiJSVFmJiYiEceeUTvtbtixQoBQHz55ZfysL59+woAYvXq1WXWXfo1WaLkvRMaGqr3mpsxY4ZQq9UiLS2tzDz3GjdunGjRooXesM8++0wAEH/99Vel896r9HEvkZmZKWxtbcXEiRP1hiclJQkbGxu94cHBwcLKykpcvnxZb9p7t+teLVq0KPc4FRQUCGdnZ9GpUyeRm5srD//ll18EAPHuu+/Kw6qy/dU9hvcep99++00AEIMGDSrzXilP6fmFEOKtt94SAERKSoo8rPTru7SS10V8fLxef4cOHUROTo48Xcln26uvvioP8/PzE87OzuLWrVvysJMnTwqVSiWeffZZeVjJZ8qwYcP01v3SSy8JAOLkyZN69Za8PsLDw4VarRZbtmzRm6+6n3NNAS9jKVRoaCicnJzg4eGBJ598EpaWlti8eTOaNWsGANBoNFCpig+/VqvFrVu3YGlpCR8fHxw/flxezk8//QRHR0dMmzatzDoepF3Hs88+CysrK7n/8ccfh5ubG3777TcAxbeYxsXF4amnnsKtW7eQmpqK1NRUZGdno3///oiKioJOp9NbZl5eHkxNTStd76ZNm6DT6TB69Gh5mampqXB1dUWbNm2wZ88evekLCgoAFO+vimzcuBE2NjYYMGCA3jL9/f1haWlZZpmFhYV606WmppY5VV3a9evXsXz5csyaNQuWlpZ
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Среднее значение Networth в обучающей выборке: 5.05858173076923\n",
"Среднее значение Networth в контрольной выборке: 4.069423076923076\n",
"Среднее значение Networth в тестовой выборке: 4.069423076923076\n"
]
}
],
"source": [
"# Оценка сбалансированности целевой переменной (Networth)\n",
"# Визуализация распределения целевой переменной в выборках (гистограмма)\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"def plot_networth_distribution(data, title):\n",
" sns.histplot(data['Networth'], kde=True)\n",
" plt.title(title)\n",
" plt.xlabel('Networth')\n",
" plt.ylabel('Частота')\n",
" plt.show()\n",
"\n",
"plot_networth_distribution(train_data, 'Распределение Networth в обучающей выборке')\n",
"plot_networth_distribution(val_data, 'Распределение Networth в контрольной выборке')\n",
"plot_networth_distribution(test_data, 'Распределение Networth в тестовой выборке')\n",
"\n",
"# Оценка сбалансированности данных по целевой переменной (Networth)\n",
"print(\"Среднее значение Networth в обучающей выборке: \", train_data['Networth'].mean())\n",
"print(\"Среднее значение Networth в контрольной выборке: \", val_data['Networth'].mean())\n",
"print(\"Среднее значение Networth в тестовой выборке: \", test_data['Networth'].mean())\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWsklEQVR4nO3de1wU5f4H8M9y2eWyLAgICwqKd1HMwttmIkdJRCpTyixT9Hi0DO2oZUWZt06SWWmal+qUl9Qy83a08i6oiWYo3jX1h0LJgmDcFrnu8/uDdnQFFBFYGD/v12tfsDPPzHxnZnf5MPPMrEIIIUBEREQkU1aWLoCIiIioNjHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAR1UMGgwEpKSn466+/LF0K1bDc3FxcvnwZBoPB0qU8MBh2iEjWli9fDoVCgd9++83SpdzVunXr0LdvXzg5OUGtVsPX1xcffvihpctqEPLy8jB//nzpeVZWFhYtWmS5gm4hhMAXX3yBHj16wMHBARqNBn5+fli1apWlS3tgMOw0EKYPbNPDzs4Obdq0wfjx45GWlmbp8qgeGjlyJBQKBTp16oSKvhVGoVBg/Pjx1Zr37NmzsWnTpvussGYtXrwYy5cvt3QZ1fbWW29hyJAhcHJywpdffomdO3di165deOWVVyxdWoNgb2+PqVOnYvXq1UhJScGMGTOwZcsWS5cFAHjhhRfw8ssvo3379vjmm2+kfTt48GBLl/bAsLF0AXRvZs2aBT8/PxQUFODAgQNYsmQJfvrpJ5w6dQoODg6WLo/qoZMnT2LDhg2IiIiosXnOnj0bzzzzDJ5++ukam+f9Wrx4Mdzd3TFy5EhLl3LP4uLiMGfOHMTExOCtt96ydDkNkrW1NWbOnIkRI0bAaDRCo9Hgxx9/tHRZWLlyJdauXYtVq1bhhRdesHQ5Dywe2WlgwsLC8OKLL+Jf//oXli9fjokTJyIpKQmbN2+2dGlUD9nb26NNmzaYNWtWhUd35CA/P9/SJdy3jz76CI8++iiDzn167bXXcOXKFRw8eBBXrlzBY489ZumSMHfuXDz//PMMOhbGsNPA9enTBwCQlJQEALh+/Tpef/11BAQEQK1WQ6PRICwsDMePHy83bUFBAWbMmIE2bdrAzs4OXl5eGDx4MC5dugQAuHz5stmps9sfwcHB0rxiY2OhUCiwdu1avP3229BqtXB0dMRTTz2FlJSUcss+fPgw+vfvD2dnZzg4OKB379745ZdfKlzH4ODgCpc/Y8aMcm1XrVqFwMBA2Nvbw9XVFUOHDq1w+Xdat1sZjUbMnz8fHTp0gJ2dHTw9PfHSSy+V6zTavHlzPPHEE+WWM378+HLzrKj2uXPnltumAFBYWIjp06ejVatWUKlU8PHxwRtvvIHCwsIKt9XtrKysMHXqVJw4cQIbN268a/uqLE+hUMBgMGDFihXSNhs5ciROnDgBhUKB//3vf1LbhIQEKBQKPPLII2bLCQsLQ/fu3c2GLV68GB06dIBKpYK3tzeioqKQlZVl1iY4OBgdO3ZEQkICgoKC4ODggLfffhvNmzfH6dOnERcXV+Hr07RukydPRuPGjeHo6IhBgwbh2rVrd90mptOBpkejRo0QHByM/fv333VaANizZw969eoFR0dHuLi4YODAgTh79qxZm0OHDqFjx44YOnQoXF1dYW9vj65du5qdKszLy4OjoyP+/e9/l1vGH3/8AWtra8TExEg1N2/evFy72197V65cwSuvvIK2bdvC3t4ebm5uePbZZ3H58mWz6Uzv79jYWGnYkSNH8Pjjj8PJyQmOjo4VbpOK+ktlZGRU+B544oknKqy5Kp8VM2bMkN5nTZs2hU6ng42NDbRabbm6K2Ka3vRwcnJCt27dyp2qNb3+KmP6XDGdTjUYDDh16hR8fHwQHh4OjUZT6bYCgP/7v//Ds88+C1dXVzg4OKBHjx7ljk7dy2dtcHBwuffB+++/DysrK6xZs8Zs+L18JjdEPI3VwJmCiZubG4CyN8umTZvw7LPPws/PD2lpafj888/Ru3dvnDlzBt7e3gCA0tJSPPHEE9i9ezeGDh2Kf//738jNzcXOnTtx6tQptGzZUlrG888/jwEDBpgtNzo6usJ63n//fSgUCrz55ptIT0/H/PnzERISgsTERNjb2wMo+/APCwtDYGAgpk+fDisrKyxbtgx9+vTB/v370a1bt3Lzbdq0qfRBnpeXh3HjxlW47HfffRdDhgzBv/71L1y7dg0LFy5EUFAQjh07BhcXl3LTjB07Fr169QIAbNiwoVwgeOmll7B8+XKMGjUKr776KpKSkvDZZ5/h2LFj+OWXX2Bra1vhdrgXWVlZ0rrdymg04qmnnsKBAwcwduxYtG/fHidPnsS8efPw+++/V7nPzAsvvID33nsPs2bNwqBBg8qFr3td3jfffIN//etf6NatG8aOHQsAaNmyJTp27AgXFxfs27cPTz31FABg//79sLKywvHjx5GTkwONRgOj0YiDBw9K0wJlf2xmzpyJkJAQjBs3DufPn8eSJUtw5MiRcts5MzMTYWFhGDp0KF588UV4enoiODgYEyZMgFqtxjvvvAMA8PT0NFu/CRMmoFGjRpg+fTouX76M+fPnY/z48Vi7du1dt6G7uzvmzZsHoCxYfPrppxgwYABSUlIqfF2Z7Nq1C2FhYWjRogVmzJiBGzduYOHChejZsyeOHj0q/XHPzMzEF198AbVajVdffRWNGzfGqlWrMHjwYKxevRrPP/881Go1Bg0ahLVr1+KTTz6BtbW1tJxvv/0WQggMGzbsrutyqyNHjuDgwYMYOnQomjZtisuXL2PJkiUIDg7GmTNnKj01fvHiRQQHB8PBwQFTpkyBg4MDvvzyS4SEhGDnzp0ICgq6pzoqU53PCpOPP/74nvszfvPNNwDKAtnixYvx7LPP4tSpU2jbtm216s/MzAQAzJkzB1qtFlOmTIGdnV2F2yotLQ2PPvoo8vPz8eqrr8LNzQ0rVqzAU089hR9++AGDBg0ym3dVPmtvt2zZMkydOhUff/yx2ZGm+9nODYagBmHZsmUCgNi1a5e4du2aSElJEd99951wc3MT9vb24o8//hBCCFFQUCBKS0vNpk1KShIqlUrMmjVLGvb1118LAOKTTz4ptyyj0ShNB0DMnTu3XJsOHTqI3r17S8/37t0rAIgmTZqInJwcafj3338vAIhPP/1Umnfr1q1FaGiotBwhhMjPzxd+fn7i8ccfL7esRx99VHTs2FF6fu3aNQFATJ8+XRp2+fJlYW1tLd5//32zaU+ePClsbGzKDb9w4YIAIFasWCENmz59urj1LbF//34BQKxevdps2m3btpUb3qxZMxEeHl6u9qioKHH72+z22t944w3h4eEhAgMDzbbpN998I6ysrMT+/fvNpl+6dKkAIH755Zdyy7tVZGSkcHR0FEIIsWLFCgFAbNiwwayOqKioai3P0dFRREZGlltmeHi46Natm/R88ODBYvDgwcLa2lr8/PPPQgghjh49KgCIzZs3CyGESE9PF0qlUvTr18/stfvZZ58JAOLrr7+WhvXu3VsAEEuXLi237Ntfkyam905ISIjZa27SpEnC2tpaZGVllZvmVpGRkaJZs2Zmw7744gsBQPz66693nLZz587Cw8NDZGZmSsOOHz8urKysxIgRI6RhAAQAERsbKw3Lz88X7du3F1qtVhQVFQkhhNi+fbsAIG1Lk06dOpmt+6hRo4Svr2+5em5/7eXn55drEx8fLwCIlStXSsNM7++9e/cKIYSIiIgQ1tbW4tSpU1KbjIwM4ebmJgIDA6Vhpm1/5MgRaVhF718hyl47t27ne/msuP29m56eLpycnERYWJhZ3ZW5fXohhNixY4cAIL7//ntpWO/evUWHDh0qnY/pM3PZsmVmz5VKpfj999/NtsHt22rixIkCgNn7Lzc3V/j5+YnmzZtL742qftaa6jW9Ln788UdhY2MjXnvtNbOaq/OZ3BDxNFYDExISgsaNG8PHxwdDhw6FWq3Gxo0b0aRJEwC
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABfOUlEQVR4nO3deVxUVeMG8GdYZlgHRHZFcV9xww0XMEURKTUt0yyxTMvQUnvNKHdLss2lFO2XS5lmaS6vVporbrjvG6kvCiWLG7sMy5zfHzRXhmEXGLg+389nPjD3nnvuOffOwDP3nntHIYQQICIiIpIpE2M3gIiIiKgyMewQERGRrDHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BDRU2fNmjVQKBQ4deqUsZtCRFWAYacG0/3B1j0sLCzQtGlTTJgwAQkJCcZuHlVDo0ePhkKhQJs2bVDYN8UoFApMmDChXHXPnz8fW7dufcIWVqxly5ZhzZo1xm4GERkZw44MzJ07F2vXrsU333yDbt26ITw8HD4+PsjIyDB206iaunjxIjZv3lyhdTLsEFF1xbAjA4GBgXjllVfwxhtvYM2aNZg0aRKio6Oxbds2YzeNqiFLS0s0bdoUc+fOLfTojhww6BNRfgw7MtS7d28AQHR0NADgwYMH+M9//gMvLy/Y2NhArVYjMDAQ58+fN1g2MzMTs2fPRtOmTWFhYQE3NzcMGTIEN2/eBADcunVL79RZwUevXr2kug4cOACFQoGff/4ZH374IVxdXWFtbY2BAwciNjbWYN3Hjx9H//79YWdnBysrK/j5+eHIkSOF9rFXr16Frn/27NkGZX/88Ud4e3vD0tISDg4OGD58eKHrL65v+Wm1WixatAitWrWChYUFXFxc8Oabb+Lhw4d65Tw9PfHss88arGfChAkGdRbW9s8//9xgmwKARqPBrFmz0LhxY6hUKnh4eOD999+HRqMpdFsVZGJigunTp+PChQvYsmVLieVLsz6FQoH09HR8//330jYbPXo0Lly4AIVCgf/+979S2dOnT0OhUKBDhw566wkMDESXLl30pi1btgytWrWCSqWCu7s7QkJCkJSUpFemV69eaN26NU6fPg1fX19YWVnhww8/hKenJy5fvoyIiIhCX5+6vk2ZMgVOTk6wtrbG888/j7t375a4TXSnA4t6HDhwQK/8xo0bpdego6MjXnnlFfzzzz8G9V67dg3Dhg2Dk5MTLC0t0axZM3z00UcG5Tw9PUu13j/++AM9e/aEtbU1bG1tERQUhMuXL5fYv6LGNN27d6/Q1+rZs2cRGBgItVoNGxsb9OnTB8eOHSu0zoMHD+LNN99E7dq1oVarMWrUqELfOwqFApMmTTJoW0BAABQKhd57KysrCzNnzoS3tzfs7OxgbW2Nnj17Yv/+/YX2b/bs2YVuv9GjRxuUyS8tLQ2urq4G2/qtt95CkyZNYGVlBQcHB/Tu3RuHDh3SW3bbtm0ICgqCu7s7VCoVGjVqhHnz5iE3N1evnO71XNAXX3wBhUKBW7duGWzT/NO0Wi3atGkDhUKhd1Rz9OjR8PT01KszNjYWlpaWBnXIkZmxG0AVTxdMateuDQD43//+h61bt+LFF19EgwYNkJCQgBUrVsDPzw9XrlyBu7s7ACA3NxfPPvss9u7di+HDh+Pdd99Famoqdu/ejUuXLqFRo0bSOkaMGIEBAwborTc0NLTQ9nzyySdQKBSYNm0aEhMTsWjRIvj7++PcuXOwtLQEAOzbtw+BgYHw9vbGrFmzYGJigtWrV0t/NDp37mxQb926dREWFgYg74/Q+PHjC133jBkzMGzYMLzxxhu4e/cuvv76a/j6+uLs2bOwt7c3WGbcuHHo2bMnAGDz5s0GgeDNN9/EmjVr8Nprr+Gdd95BdHQ0vvnmG5w9exZHjhyBubl5oduhLJKSkqS+5afVajFw4EAcPnwY48aNQ4sWLXDx4kUsXLgQf/31V6lPI7388suYN28e5s6di+eff97gj3pZ17d27Vq88cYb6Ny5M8aNGwcAaNSoEVq3bg17e3scPHgQAwcOBAAcOnQIJiYmOH/+PFJSUqBWq6HVanH06FFpWSDvn82cOXPg7++P8ePHIyoqCuHh4Th58qTBdr5//z4CAwMxfPhwvPLKK3BxcUGvXr0wceJE2NjYSIHBxcVFr38TJ05ErVq1MGvWLNy6dQuLFi3ChAkT8PPPP5e4DVUqFb777ju9aSdPnsSSJUv0puleK506dUJYWBgSEhKwePFiHDlyRO81eOHCBfTs2RPm5uYYN24cPD09cfPmTWzfvh2ffPKJwfp79uwpba+rV69i/vz5evPXrl2L4OBgBAQEYMGCBcjIyEB4eDh69OiBs2fPGvzjK6/Lly+jZ8+eUKvVeP/992Fubo4VK1agV69eiIiIMAiwEyZMgL29PWbPni3t09u3b0sfjnQsLCywbt06fP7559K+/vvvv7F3715YWFjo1ZmSkoLvvvsOI0aMwNixY5GamoqVK1ciICAAJ06cQLt27Qpt+9q1a6XfJ0+eXGJfv/zyy0LHQ2ZlZeGVV15B3bp18eDBA6xYsQL9+/fH1atXUa9ePQB5rwMbGxtMmTIFNjY22LdvH2bOnImUlBR8/vnnJa67tNauXYuLFy+WquzMmTORmZlZYeuu1gTVWKtXrxYAxJ49e8Tdu3dFbGys2LBhg6hdu7awtLQUf//9txBCiMzMTJGbm6u3bHR0tFCpVGLu3LnStFWrVgkA4quvvjJYl1arlZYDID7//HODMq1atRJ+fn7S8/379wsAok6dOiIlJUWa/ssvvwgAYvHixVLdTZo0EQEBAdJ6hBAiIyNDNGjQQPTt29dgXd26dROtW7eWnt+9e1cAELNmzZKm3bp1S5iamopPPvlEb9mLFy8KMzMzg+nXr18XAMT3338vTZs1a5bI/zY5dOiQACDWrVunt+zOnTsNptevX18EBQUZtD0kJEQUfOsVbPv7778vnJ2dhbe3t942Xbt2rTAxMRGHDh3SW3758uUCgDhy5IjB+vILDg4W1tbWQgghvv/+ewFAbN68Wa8dISEh5VqftbW1CA4ONlhnUFCQ6Ny5s/R8yJAhYsiQIcLU1FT88ccfQgghzpw5IwCIbdu2CSGESExMFEqlUvTr10/vtfvNN98IAGLVqlXSND8/PwFALF++3GDdBV+TOrr3jr+/v95rbvLkycLU1FQkJSUZLJNf/u2Y38aNGwUAsX//fiGEEFlZWcLZ2Vm0bt1aPHr0SCq3Y8cOAUDMnDlTmubr6ytsbW3F7du39erM3z6dOnXqiNdee016rnuv6dabmpoq7O3txdixY/WWi4+PF3Z2dgbTC9Jtn5MnT+pNL+x9NnjwYKFUKsXNmzelaXfu3BG2trbC19fXoE5vb2+RlZUlTf/ss8/09r0Qee+dvn37CkdHR7Fp0yZp+rx580S3bt0M3ls5OTlCo9HotfXhw4fCxcVFvP766wb9++ijj4RCodCbVr9+fb3Xb8H3fmJiorC1tRWBgYF627owJ06cEAD02p6RkWFQ7s033xRWVlYiMzNTmubn5ydatWplUPbzzz8XAER0dLQ0TbdNddMyMzNFvXr1pDauXr1aKhscHCzq168vPb906ZIwMTGRyuavV454GksG/P394eTkBA8PDwwfPhw2NjbYsmUL6tSpAyDvE6iJSd6uzs3Nxf3792FjY4NmzZrhzJkzUj2//vorHB0dMXHiRIN1FPXJvzRGjRoFW1tb6fkLL7wANzc3/P777wCAc+fO4fr163j55Zdx//593Lt3D/fu3UN6ejr69OmDgwcPQqvV6tWZmZlp8OmuoM2bN0Or1WLYsGFSnffu3YOrqyuaNGlicIg7KysLQN72KsrGjRthZ2eHvn376tXp7e0NGxsbgzqzs7P1yt27d6/ET1L//PMPvv76a8yYMQM2NjYG62/RogWaN2+uV6fu1GVRh+0LM3LkSDRp0qTYsTsVsb6ePXvizJkzSE9PBwAcPnwYAwYMQLt27aRD/YcOHYJCoUCPHj0
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки после нормализации: 2080\n"
]
}
],
"source": [
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Визуализация распределения Networth в обучающей выборке\n",
"sns.histplot(train_data['Networth'], kde=True)\n",
"plt.title('Распределение Networth в обучающей выборке')\n",
"plt.xlabel('Networth')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"# Нормализация данных\n",
"scaler = StandardScaler()\n",
"train_data['Networth_scaled'] = scaler.fit_transform(train_data[['Networth']])\n",
"\n",
"# Визуализация распределения Networth после нормализации\n",
"sns.histplot(train_data['Networth_scaled'], kde=True)\n",
"plt.title('Распределение Networth после нормализации')\n",
"plt.xlabel('Networth (нормализованное)')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"# Печать размеров выборки после нормализации\n",
"print(\"Размер обучающей выборки после нормализации: \", len(train_data))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Конструирование признаков \n",
"\n",
"Теперь приступим к конструированию признаков для решения каждой задачи.\n",
"\n",
"**Процесс конструирования признаков** \n",
"Задача 1: Прогнозирование вероятности достижения статуса миллионера. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования вероятности достижения статуса миллионера.\n",
"Задача 2: Оценка факторов, влияющих на достижение статуса миллионера. Цель технического проекта: Разработка модели машинного обучения для выявления ключевых факторов, влияющих на достижение статуса миллионера.\n",
"\n",
"**Унитарное кодирование** \n",
"Унитарное кодирование категориальных признаков (one-hot encoding). Преобразование категориальных признаков в бинарные векторы.\n",
"\n",
"**Дискретизация числовых признаков** \n",
"Процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины)."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы train_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'LogNetworth', 'Networth_scaled', 'Country_Algeria', 'Country_Argentina', 'Country_Australia', 'Country_Austria', 'Country_Barbados', 'Country_Belgium', 'Country_Belize', 'Country_Brazil', 'Country_Bulgaria', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Colombia', 'Country_Cyprus', 'Country_Czechia', 'Country_Denmark', 'Country_Egypt', 'Country_Estonia', 'Country_Eswatini (Swaziland)', 'Country_Finland', 'Country_France', 'Country_Georgia', 'Country_Germany', 'Country_Greece', 'Country_Guernsey', 'Country_Hong Kong', 'Country_Hungary', 'Country_Iceland', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Macau', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_Morocco', 'Country_Nepal', 'Country_Netherlands', 'Country_New Zealand', 'Country_Nigeria', 'Country_Norway', 'Country_Oman', 'Country_Peru', 'Country_Philippines', 'Country_Poland', 'Country_Portugal', 'Country_Qatar', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Thailand', 'Country_Turkey', 'Country_Ukraine', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Country_Uruguay', 'Country_Venezuela', 'Country_Vietnam', 'Country_Zimbabwe', 'Source_3D printing', 'Source_AOL', 'Source_Airbnb', \"Source_Aldi, Trader Joe's\", 'Source_Aluminium', 'Source_Amazon', 'Source_Apple', 'Source_BMW, pharmaceuticals', 'Source_Banking', 'Source_Berkshire Hathaway', 'Source_Bloomberg LP', 'Source_Campbell Soup', 'Source_Cargill', 'Source_Carnival Cruises', 'Source_Chanel', 'Source_Charlotte Hornets, endorsements', 'Source_Chemicals', 'Source_Chick-fil-A', 'Source_Coca Cola Israel', 'Source_Coca-Cola bottler', 'Source_Columbia Sportswear', 'Source_Comcast', 'Source_Construction', 'Source_Contact Lens', 'Source_Dallas Cowboys', 'Source_Dell computers', \"Source_Dick's Sporting Goods\", 'Source_DirecTV', 'Source_Dolby Laboratories', 'Source_Dole, real estate', 'Source_EasyJet', 'Source_Estee Lauder', 'Source_Estée Lauder', 'Source_FIAT, investments', 'Source_Facebook', 'Source_Facebook, investments', 'Source_Furniture retail', 'Source_Gap', 'Source_Genentech, Apple', 'Source_Getty Oil', 'Source_Golden State Warriors', 'Source_Google', 'Source_Groupon, investments', 'Source_H&M', 'Source_Heineken', 'Source_Hermes', 'Source_Home Depot', 'Source_Houston Rockets, entertainment', 'Source_Hyundai', 'Source_I.T.', 'Source_IKEA', 'Source_IT', 'Source_IT consulting', 'Source_IT products', 'Source_IT provider', 'Source_In-N-Out Burger', 'Source_Instagram', 'Source_Intel', 'Source_Internet', 'Source_Internet search', 'Source_Investments', 'Source_Koch Industries', \"Source_L'Oréal\", 'Source_LED lighting', 'Source_LG', 'Source_LVMH', 'Source_Lego', 'Source_LinkedIn', 'Source_Little Caesars', 'Source_Lululemon', 'Source_Luxury goods', 'Source_Manufacturing', 'Source_Microsoft', 'Source_Mining', 'Source_Motors', 'Source_Multiple', 'Source_Nascar, racing', 'Source_Netflix', 'Source_Netscape, investments', 'Source_New Balance', 'Source_New England Patriots', 'Source_Nike', 'Source_Nutella, chocolates', 'Source_Patagonia', 'Source_Petro Fibre', 'Source_Petro Firbe', 'Source_Philadelphia Eagles', 'Source_Quicken Loans', 'Source_Real Estate', 'Source_Real estate', 'Source_Red Bull', 'Source_Reebok', 'Source_SAP', 'Source_Samsung', 'Source_Sears', 'Source_Semiconductor materials', 'Source_Shipping', 'Source_Shoes', 'Source_Slim-Fast', 'Source_Smartphones', 'Source_Snapchat', 'Source_Spotify', 'Source_Starbucks', 'Source_TD Ameritrade', 'Source_TV broadcasting', 'Source_TV network, investments', 'Source_TV programs', 'Source_TV shows', 'Source_TV, movie production', 'Source_Tesla, SpaceX', 'Source_TikTok', 'Source_Toyota dealerships', 'Source_Transportation', 'Source_Twitter, Square', 'Source_
"Столбцы val_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foods',
"Столбцы test_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foods'
]
}
],
"source": [
"# Пример категориальных признаков\n",
"categorical_features = ['Country', 'Source', 'Industry']\n",
"\n",
"# Применение one-hot encoding\n",
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)\n",
"df_encoded = pd.get_dummies(df, columns=categorical_features)\n",
"\n",
"print(\"Столбцы train_data_encoded:\", train_data_encoded.columns.tolist())\n",
"print(\"Столбцы val_data_encoded:\", val_data_encoded.columns.tolist())\n",
"print(\"Столбцы test_data_encoded:\", test_data_encoded.columns.tolist())\n",
"\n",
"# Дискретизация числовых признаков (Age и Networth). Например, можно разделить возраст и стоимость активов на категории\n",
"# Пример дискретизации признака 'Age' на 5 категорий\n",
"train_data_encoded['Age_binned'] = pd.cut(train_data_encoded['Age'], bins=5, labels=False)\n",
"val_data_encoded['Age_binned'] = pd.cut(val_data_encoded['Age'], bins=5, labels=False)\n",
"test_data_encoded['Age_binned'] = pd.cut(test_data_encoded['Age'], bins=5, labels=False)\n",
"\n",
"# Пример дискретизации признака 'Networth' на 5 категорий\n",
"train_data_encoded['Networth_binned'] = pd.cut(train_data_encoded['Networth'], bins=5, labels=False)\n",
"val_data_encoded['Networth_binned'] = pd.cut(val_data_encoded['Networth'], bins=5, labels=False)\n",
"test_data_encoded['Networth_binned'] = pd.cut(test_data_encoded['Networth'], bins=5, labels=False)\n",
"\n",
"# Пример дискретизации признака 'Age' на 5 категорий\n",
"df_encoded['Age_binned'] = pd.cut(df_encoded['Age'], bins=5, labels=False)\n",
"\n",
"# Пример дискретизации признака 'Networth' на 5 категорий\n",
"df_encoded['Networth_binned'] = pd.cut(df_encoded['Networth'], bins=5, labels=False)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ручной синтез\n",
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, можно создать признак, который отражает соотношение возраста к стоимости активов (Networth) или другие полезные метрики."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# Пример создания нового признака - соотношение возраста к стоимости активов (Networth)\n",
"train_data_encoded['age_to_networth'] = train_data_encoded['Age'] / train_data_encoded['Networth']\n",
"val_data_encoded['age_to_networth'] = val_data_encoded['Age'] / val_data_encoded['Networth']\n",
"test_data_encoded['age_to_networth'] = test_data_encoded['Age'] / test_data_encoded['Networth']\n",
"\n",
"# Пример создания нового признака - соотношение возраста к стоимости активов (Networth)\n",
"df_encoded['age_to_networth'] = df_encoded['Age'] / df_encoded['Networth']\n",
"\n",
"# Пример создания нового признака - соотношение стоимости активов к возрасту\n",
"train_data_encoded['networth_to_age'] = train_data_encoded['Networth'] / train_data_encoded['Age']\n",
"val_data_encoded['networth_to_age'] = val_data_encoded['Networth'] / val_data_encoded['Age']\n",
"test_data_encoded['networth_to_age'] = test_data_encoded['Networth'] / test_data_encoded['Age']\n",
"\n",
"# Пример создания нового признака - соотношение стоимости активов к возрасту\n",
"df_encoded['networth_to_age'] = df_encoded['Networth'] / df_encoded['Age']\n",
"\n",
"# Пример создания нового признака - квадрат возраста\n",
"train_data_encoded['age_squared'] = train_data_encoded['Age'] ** 2\n",
"val_data_encoded['age_squared'] = val_data_encoded['Age'] ** 2\n",
"test_data_encoded['age_squared'] = test_data_encoded['Age'] ** 2\n",
"\n",
"# Пример создания нового признака - квадрат возраста\n",
"df_encoded['age_squared'] = df_encoded['Age'] ** 2\n",
"\n",
"# Пример создания нового признака - логарифм стоимости активов\n",
"import numpy as np\n",
"train_data_encoded['log_networth'] = train_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)\n",
"val_data_encoded['log_networth'] = val_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)\n",
"test_data_encoded['log_networth'] = test_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)\n",
"\n",
"# Пример создания нового признака - логарифм стоимости активов\n",
"df_encoded['log_networth'] = df_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
"\n",
"# Пример числовых признаков\n",
"numerical_features = ['Networth', 'Age']\n",
"\n",
"# Применение StandardScaler для масштабирования числовых признаков\n",
"scaler = StandardScaler()\n",
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])\n",
"\n",
"# Пример использования MinMaxScaler для масштабирования числовых признаков\n",
"scaler = MinMaxScaler()\n",
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Использование фреймворка Featuretools"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы в df: ['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry']\n",
"Столбцы в train_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'LogNetworth', 'Networth_scaled', 'Country_Algeria', 'Country_Argentina', 'Country_Australia', 'Country_Austria', 'Country_Barbados', 'Country_Belgium', 'Country_Belize', 'Country_Brazil', 'Country_Bulgaria', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Colombia', 'Country_Cyprus', 'Country_Czechia', 'Country_Denmark', 'Country_Egypt', 'Country_Estonia', 'Country_Eswatini (Swaziland)', 'Country_Finland', 'Country_France', 'Country_Georgia', 'Country_Germany', 'Country_Greece', 'Country_Guernsey', 'Country_Hong Kong', 'Country_Hungary', 'Country_Iceland', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Macau', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_Morocco', 'Country_Nepal', 'Country_Netherlands', 'Country_New Zealand', 'Country_Nigeria', 'Country_Norway', 'Country_Oman', 'Country_Peru', 'Country_Philippines', 'Country_Poland', 'Country_Portugal', 'Country_Qatar', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Thailand', 'Country_Turkey', 'Country_Ukraine', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Country_Uruguay', 'Country_Venezuela', 'Country_Vietnam', 'Country_Zimbabwe', 'Source_3D printing', 'Source_AOL', 'Source_Airbnb', \"Source_Aldi, Trader Joe's\", 'Source_Aluminium', 'Source_Amazon', 'Source_Apple', 'Source_BMW, pharmaceuticals', 'Source_Banking', 'Source_Berkshire Hathaway', 'Source_Bloomberg LP', 'Source_Campbell Soup', 'Source_Cargill', 'Source_Carnival Cruises', 'Source_Chanel', 'Source_Charlotte Hornets, endorsements', 'Source_Chemicals', 'Source_Chick-fil-A', 'Source_Coca Cola Israel', 'Source_Coca-Cola bottler', 'Source_Columbia Sportswear', 'Source_Comcast', 'Source_Construction', 'Source_Contact Lens', 'Source_Dallas Cowboys', 'Source_Dell computers', \"Source_Dick's Sporting Goods\", 'Source_DirecTV', 'Source_Dolby Laboratories', 'Source_Dole, real estate', 'Source_EasyJet', 'Source_Estee Lauder', 'Source_Estée Lauder', 'Source_FIAT, investments', 'Source_Facebook', 'Source_Facebook, investments', 'Source_Furniture retail', 'Source_Gap', 'Source_Genentech, Apple', 'Source_Getty Oil', 'Source_Golden State Warriors', 'Source_Google', 'Source_Groupon, investments', 'Source_H&M', 'Source_Heineken', 'Source_Hermes', 'Source_Home Depot', 'Source_Houston Rockets, entertainment', 'Source_Hyundai', 'Source_I.T.', 'Source_IKEA', 'Source_IT', 'Source_IT consulting', 'Source_IT products', 'Source_IT provider', 'Source_In-N-Out Burger', 'Source_Instagram', 'Source_Intel', 'Source_Internet', 'Source_Internet search', 'Source_Investments', 'Source_Koch Industries', \"Source_L'Oréal\", 'Source_LED lighting', 'Source_LG', 'Source_LVMH', 'Source_Lego', 'Source_LinkedIn', 'Source_Little Caesars', 'Source_Lululemon', 'Source_Luxury goods', 'Source_Manufacturing', 'Source_Microsoft', 'Source_Mining', 'Source_Motors', 'Source_Multiple', 'Source_Nascar, racing', 'Source_Netflix', 'Source_Netscape, investments', 'Source_New Balance', 'Source_New England Patriots', 'Source_Nike', 'Source_Nutella, chocolates', 'Source_Patagonia', 'Source_Petro Fibre', 'Source_Petro Firbe', 'Source_Philadelphia Eagles', 'Source_Quicken Loans', 'Source_Real Estate', 'Source_Real estate', 'Source_Red Bull', 'Source_Reebok', 'Source_SAP', 'Source_Samsung', 'Source_Sears', 'Source_Semiconductor materials', 'Source_Shipping', 'Source_Shoes', 'Source_Slim-Fast', 'Source_Smartphones', 'Source_Snapchat', 'Source_Spotify', 'Source_Starbucks', 'Source_TD Ameritrade', 'Source_TV broadcasting', 'Source_TV network, investments', 'Source_TV programs', 'Source_TV shows', 'Source_TV, movie production', 'Source_Tesla, SpaceX', 'Source_TikTok', 'Source_Toyota dealerships', 'Source_Transportation', 'Source_Twitter, Square', 'Sour
"Столбцы в val_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen food
"Столбцы в test_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foo
"Empty DataFrame\n",
"Columns: [Rank , Name, Networth, Age, LogNetworth, Networth_scaled, Country_Algeria, Country_Argentina, Country_Australia, Country_Austria, Country_Barbados, Country_Belgium, Country_Belize, Country_Brazil, Country_Bulgaria, Country_Canada, Country_Chile, Country_China, Country_Colombia, Country_Cyprus, Country_Czechia, Country_Denmark, Country_Egypt, Country_Estonia, Country_Eswatini (Swaziland), Country_Finland, Country_France, Country_Georgia, Country_Germany, Country_Greece, Country_Guernsey, Country_Hong Kong, Country_Hungary, Country_Iceland, Country_India, Country_Indonesia, Country_Ireland, Country_Israel, Country_Italy, Country_Japan, Country_Kazakhstan, Country_Lebanon, Country_Macau, Country_Malaysia, Country_Mexico, Country_Monaco, Country_Morocco, Country_Nepal, Country_Netherlands, Country_New Zealand, Country_Nigeria, Country_Norway, Country_Oman, Country_Peru, Country_Philippines, Country_Poland, Country_Portugal, Country_Qatar, Country_Romania, Country_Russia, Country_Singapore, Country_Slovakia, Country_South Africa, Country_South Korea, Country_Spain, Country_Sweden, Country_Switzerland, Country_Taiwan, Country_Thailand, Country_Turkey, Country_Ukraine, Country_United Arab Emirates, Country_United Kingdom, Country_United States, Country_Uruguay, Country_Venezuela, Country_Vietnam, Country_Zimbabwe, Source_3D printing, Source_AOL, Source_Airbnb, Source_Aldi, Trader Joe's, Source_Aluminium, Source_Amazon, Source_Apple, Source_BMW, pharmaceuticals, Source_Banking, Source_Berkshire Hathaway, Source_Bloomberg LP, Source_Campbell Soup, Source_Cargill, Source_Carnival Cruises, Source_Chanel, Source_Charlotte Hornets, endorsements, Source_Chemicals, Source_Chick-fil-A, Source_Coca Cola Israel, Source_Coca-Cola bottler, Source_Columbia Sportswear, Source_Comcast, ...]\n",
"Index: []\n",
"\n",
"[0 rows x 869 columns]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
" warnings.warn(\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Rank Networth Age Country_Algeria Country_Argentina \\\n",
"id \n",
"0 1 219.0 50 False False \n",
"1 2 171.0 58 False False \n",
"2 3 158.0 73 False False \n",
"3 4 129.0 66 False False \n",
"4 5 118.0 91 False False \n",
"\n",
" Country_Australia Country_Austria Country_Barbados Country_Belgium \\\n",
"id \n",
"0 False False False False \n",
"1 False False False False \n",
"2 False False False False \n",
"3 False False False False \n",
"4 False False False False \n",
"\n",
" Country_Belize ... Industry_Sports Industry_Technology \\\n",
"id ... \n",
"0 False ... False False \n",
"1 False ... False True \n",
"2 False ... False False \n",
"3 False ... False True \n",
"4 False ... False False \n",
"\n",
" Industry_Telecom Industry_diversified Age_binned Networth_binned \\\n",
"id \n",
"0 False False 1 4 \n",
"1 False False 2 3 \n",
"2 False False 3 3 \n",
"3 False False 2 2 \n",
"4 False False 4 2 \n",
"\n",
" age_to_networth networth_to_age age_squared log_networth \n",
"id \n",
"0 0.228311 4.380000 2500 5.389072 \n",
"1 0.339181 2.948276 3364 5.141664 \n",
"2 0.462025 2.164384 5329 5.062595 \n",
"3 0.511628 1.954545 4356 4.859812 \n",
"4 0.771186 1.296703 8281 4.770685 \n",
"\n",
"[5 rows x 997 columns]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
" warnings.warn(\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
]
}
],
"source": [
"import featuretools as ft\n",
"\n",
"# Проверка наличия столбцов в DataFrame\n",
"print(\"Столбцы в df:\", df.columns.tolist())\n",
"print(\"Столбцы в train_data_encoded:\", train_data_encoded.columns.tolist())\n",
"print(\"Столбцы в val_data_encoded:\", val_data_encoded.columns.tolist())\n",
"print(\"Столбцы в test_data_encoded:\", test_data_encoded.columns.tolist())\n",
"\n",
"# Удаление дубликатов по всем столбцам (если нет уникального идентификатора)\n",
"df = df.drop_duplicates()\n",
"duplicates = train_data_encoded[train_data_encoded.duplicated(keep=False)]\n",
"\n",
"# Удаление дубликатов из столбца \"id\", сохранив первое вхождение\n",
"df_encoded = df_encoded.drop_duplicates(keep='first')\n",
"\n",
"print(duplicates)\n",
"\n",
"# Создание EntitySet\n",
"es = ft.EntitySet(id='millionaires_data')\n",
"\n",
"# Добавление датафрейма с данными о миллионерах\n",
"es = es.add_dataframe(dataframe_name='millionaires', dataframe=df_encoded, index='id')\n",
"\n",
"# Генерация признаков с помощью глубокой синтезы признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='millionaires', max_depth=2)\n",
"\n",
"# Выводим первые 5 строк сгенерированного набора признаков\n",
"print(feature_matrix.head())\n",
"\n",
"# Удаление дубликатов из обучающей выборки\n",
"train_data_encoded = train_data_encoded.drop_duplicates()\n",
"train_data_encoded = train_data_encoded.drop_duplicates(keep='first') # or keep='last'\n",
"\n",
"# Определение сущностей (Создание EntitySet)\n",
"es = ft.EntitySet(id='millionaires_data')\n",
"\n",
"es = es.add_dataframe(dataframe_name='millionaires', dataframe=train_data_encoded, index='id')\n",
"\n",
"# Генерация признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='millionaires', max_depth=2)\n",
"\n",
"# Преобразование признаков для контрольной и тестовой выборок\n",
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Оценка качества каждого набора признаков \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Время обучения модели: 11.98 секунд\n",
"Среднеквадратичная ошибка: 17.43\n",
"Коэффициент детерминации (R²): 0.27\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0kAAAIjCAYAAADWYVDIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC8l0lEQVR4nOzdd1hTZ/sH8G8GIUAgiEwRRcG6t75VcNRqxb3rrHtAa221vu2rra3dtrW1dgMO1Kq1bq3VOqtWxL2KdeJCQIbICoSQ5Pz+8EdKDGgCgQB+P9fFpTzPycnNSXJy7nOe5z4iQRAEEBEREREREQBAbOsAiIiIiIiIKhMmSUREREREREUwSSIiIiIiIiqCSRIREREREVERTJKIiIiIiIiKYJJERERERERUBJMkIiIiIiKiIpgkERERERERFcEkiYiIiIiIqAgmSUREREREZXT37l2sWLHC8PutW7ewZs0a2wVEZcIkiWxuwoQJUCgUtg6DiIiIqNREIhGmT5+O3bt349atW3jrrbfw119/2TosKiWprQOgp9P9+/exZs0a/PXXXzh8+DDy8vLQq1cvtG7dGsOHD0fr1q1tHSIRERGR2Xx9fTF16lT06tULAODj44ODBw/aNigqNZEgCIKtg6Cny7p16zB16lTk5OTA398fBQUFuHfvHlq3bo3z58+joKAA48ePR2RkJGQyma3DJSIiIjJbXFwc0tLS0KxZMzg5Odk6HColDrejChUdHY2XXnoJ3t7eiI6Oxs2bN9GjRw/I5XKcPHkSiYmJGDVqFFauXIlZs2YZPfbLL79EUFAQatasCQcHB7Rt2xYbN240eQ6RSIT333/f8LtWq0WfPn3g5uaGf/75x7DM436ee+45AMDBgwchEolMzgT17dvX5Hmee+45w+MK3bp1CyKRyGiMMgBcvnwZw4YNg5ubG+RyOdq1a4ft27eb/C0ZGRmYNWsW/P39YW9vj9q1a2PcuHFIS0srMb7ExET4+/ujXbt2yMnJAQBoNBq89957aNu2LZRKJZycnNC5c2f8+eefJs+ZkpKCyZMno06dOpBIJIZtYu6QyF27dqFr165wdnaGi4sL2rdvj7Vr1xq20ZO2fSGtVouPPvoIAQEBsLe3h7+/P95++23k5+cbPZ+/vz8mTJhg1LZhwwaIRCL4+/sb2gpfC5FIhK1btxotr1arUaNGDYhEInz55ZdGfWfPnkXv3r3h4uIChUKB7t2749ixYyZ/9+Neq8LX6XE/he+l999/HyKRyPAaW+L27dt45ZVX0LBhQzg4OKBmzZp48cUXcevWLaPlVqxYAZFIZNR+8eJF1KhRA/369YNWqzUs87ifwvf1hAkTjLY1AMTHx8PBwcHkefz9/Q2PF4vF8Pb2xogRI3Dnzh2jx6tUKsyePRt+fn6wt7dHw4YN8eWXX+LR83pF45FIJPD19cW0adOQkZHxxO31uL/t0b/H3HhKcvz4cfTp0wc1atSAk5MTWrRogW+++cbQXzjs+MaNGwgJCYGTkxNq1aqFDz/80OQ5LNkXPmnbFL43i3u8QqEw+WxlZGRg5syZhu0QGBiIzz//HHq93rBM4Wft0c8SADRr1sxoP2nJPra49+3u3bsRFBQER0dHKJVK9OvXD7GxsSbPWxy1Wo33338fzzzzDORyOXx8fDBkyBDExcU99nFF38OP24cBD1+DV199FWvWrEHDhg0hl8vRtm1bHD582GS95uxrHve5vHv3LoCSh7Bv3Lix2G29YcMGtG3bFg4ODnB3d8dLL72EhIQEo2Xef/99NGnSBAqFAi4uLujQoYPJfrS478CTJ0+Werv8+eefEIlE2LJli8nfsnbtWohEIsTExBjazPleLdx+MpkMqampRn0xMTGGWE+dOmXxNiq6HwwICMCzzz6L9PT0YveDVDVwuB1VqM8++wx6vR7r1q1D27ZtTfrd3d2xatUq/PPPP4iIiMD8+fPh6ekJAPjmm28wYMAAjBkzBhqNBuvWrcOLL76IHTt2oG/fviU+55QpU3Dw4EHs3bsXTZo0AQD8/PPPhv6//voLkZGR+Prrr+Hu7g4A8PLyKnF9hw8fxs6dO0v19wMPD0aDg4Ph6+uLOXPmwMnJCevXr8egQYOwadMmDB48GACQk5ODzp0749KlS5g0aRLatGmDtLQ0bN++HXfv3jXEWlRmZiZ69+4NOzs77Ny50/BFmZWVhaVLl2LUqFGYOnUqsrOzsWzZMoSEhODEiRNo1aqVYR3jx4/Hvn37MGPGDLRs2RISiQSRkZE4c+bME/+2FStWYNKkSWjatCnmzp0LV1dXnD17Fn/88QdGjx6Nd955B1OmTAEApKWlYdasWZg2bRo6d+5ssq4pU6Zg5cqVGDZsGGbPno3jx49jwYIFuHTpUrFfmoW0Wi3eeeedEvvlcjmioqIwaNAgQ9vmzZuhVqtNlr148SI6d+4MFxcXvPXWW7Czs0NERASee+45HDp0CM8++yyAJ79WjRs3NnrPRUZG4tKlS/j6668NbS1atCh5w5rp5MmTOHr0KEaOHInatWvj1q1b+Omnn/Dcc8/hn3/+gaOjY7GPi4+PR69evdCoUSOsX78eUqkUXbp0MYr5k08+AQCjbRsUFFRiLO+9916x2xQAOnfujGnTpkGv1yM2NhaLFy9GYmKiYey+IAgYMGAA/vzzT0yePBmtWrXC7t278eabbyIhIcFouwHA4MGDMWTIEGi1WsTExCAyMhJ5eXlG8ZfkhRdewLhx44zavvrqKzx48MDwu6XxPGrv3r3o168ffHx88Prrr8Pb2xuXLl3Cjh078PrrrxuW0+l06NWrFzp06IAvvvgCf/zxB+bPnw+tVosPP/zQsJwl+8KybJtH5ebmomvXrkhISEBoaCjq1KmDo0ePYu7cuUhKSsLixYstXmdxzN3H/vXXX+jTpw/q1q2L+fPno6CgAD/++COCg4Nx8uRJPPPMMyU+VqfToV+/fti/fz9GjhyJ119/HdnZ2di7dy9iY2MREBDw2Odu1aoVZs+ebdS2atUq7N2712TZQ4cO4ddff8Vrr70Ge3t7/Pjjj+jVqxdOnDiBZs2aATB/X1Poww8/RL169Yza3NzcHhtzcVasWIGJEyeiffv2WLBgAZKTk/HNN98gOjoaZ8+ehaurK4CHJwkGDx4Mf39/5OXlYcWKFRg6dChiYmLwn//8p8T1/+9//yux70nb5bnnnoOfnx/WrFlj+F4stGbNGgQEBKBjx44AzP9eLSSRSLB69Wqjk7FRUVGQy+Um+y1zt1FxHrcfpCpAIKpAbm5uQt26dY3axo8fLzg5ORm1vfvuuwIA4bfffjO05ebmGi2j0WiEZs2aCc8//7xROwBh/vz5giAIwty5cwWJRCJs3bq1xJiioqIEAMLNmzdN+v78808BgPDnn38a2p599lmhd+/eRs8jCILQrVs3oUuXLkaPv3nzpgBAiIqKMrR1795daN68uaBWqw1ter1eCAoKEho0aGBoe++99wQAwubNm03i0uv1JvGp1WrhueeeEzw9PYXr168bLa/VaoX8/HyjtgcPHgheXl7CpEmTDG15eXmCWCwWQkNDjZYt7jV6VEZGhuDs7Cw8++yzQl5eXrHxFlXctil07tw5AYAwZcoUo/b//ve/AgDhwIEDhra6desK48ePN/z+448/Cvb29kK3bt2M3muFzzdq1ChBKpUK9+7dM/R1795dGD16tABAWLhwoaF90KBBgkwmE+Li4gxtiYmJgrOzs9Frbc5rVdT48eNNPgeF5s+fLwAQUlNTi+1/nEc/I4IgCDExMQIAYdWqVYa2ou/59PR0oUmTJkLDhg2FtLS0EtfdtWtXoWvXrsX2Pfr3xMbGCmKx2PA5KfrZevT1EgRBGD16tODo6Gj4fevWrQIA4eOPPzZabtiwYYJIJDJ6fz/6ORQEQQgKChKaNGlS4t9S9LHTp083ae/bt6/R32NJPI/SarVCvXr1hLp16woPHjww6iv63hg/frwAQJgxY4ZRf9++fQWZTGb0fijNvrDQo9umcB+yYcMGk9idnJyMXquPPvpIcHJyEq5evWq03Jw5cwSJRCL
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import time\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split, GridSearchCV\n",
"from sklearn.linear_model import Ridge\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"# Предположим, что df уже определен и загружен\n",
"\n",
"# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную\n",
"X = df.drop('Networth', axis=1)\n",
"y = df['Networth']\n",
"\n",
"# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)\n",
"X = pd.get_dummies(X, drop_first=True)\n",
"\n",
"# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением\n",
"X.fillna(X.median(), inplace=True)\n",
"\n",
"# Масштабирование признаков\n",
"scaler = StandardScaler()\n",
"X = scaler.fit_transform(X)\n",
"\n",
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Обучение модели с регуляризацией (Ridge)\n",
"model = Ridge()\n",
"\n",
"# Настройка гиперпараметров с помощью GridSearchCV\n",
"param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0]}\n",
"grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')\n",
"\n",
"# Начинаем отсчет времени\n",
"start_time = time.time()\n",
"grid_search.fit(X_train, y_train)\n",
"\n",
"# Время обучения модели\n",
"train_time = time.time() - start_time\n",
"\n",
"# Лучшая модель\n",
"best_model = grid_search.best_estimator_\n",
"\n",
"# Предсказания и оценка модели\n",
"val_predictions = best_model.predict(X_val)\n",
"mse = mean_squared_error(y_val, val_predictions)\n",
"r2 = r2_score(y_val, val_predictions)\n",
"\n",
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
"print(f'Среднеквадратичная ошибка: {mse:.2f}')\n",
"print(f'Коэффициент детерминации (R²): {r2:.2f}')\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_val, val_predictions, alpha=0.5)\n",
"plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)\n",
"plt.xlabel('Фактическая стоимость активов')\n",
"plt.ylabel('Прогнозируемая стоимость активов')\n",
"plt.title('Фактическая стоимость активов по сравнению с прогнозируемой')\n",
"plt.show()\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Выводы\n",
"\n",
"**Модель линейной регрессии (LinearRegression)** показала удовлетворительные результаты при прогнозировании стоимости активов миллионеров. Метрики качества и кросс-валидация позволяют предположить, что модель не сильно переобучена и может быть использована для практических целей.\n",
"\n",
"*Точность предсказаний:* Модель демонстрирует коэффициент детерминации (R²) 0.27, что указывает на умеренную часть вариации целевого признака (стоимости активов). Однако, значения среднеквадратичной ошибки (RMSE) остаются высокими (17.43), что свидетельствует о том, что модель не всегда точно предсказывает значения, особенно для объектов с высокими или низкими стоимостями активов.\n",
"\n",
"*Переобучение:* Разница между RMSE на обучающей и тестовой выборках незначительна, что указывает на то, что модель не склонна к переобучению. Однако в будущем стоит следить за этой метрикой при добавлении новых признаков или усложнении модели, чтобы избежать излишней подгонки под тренировочные данные. Также стоит быть осторожным и продолжать мониторинг этого показателя.\n",
"\n",
"*Кросс-валидация:* При кросс-валидации наблюдается небольшое увеличение ошибки RMSE по сравнению с тестовой выборкой (рост на 2-3%). Это может указывать на небольшую нестабильность модели при использовании разных подвыборок данных. Для повышения устойчивости модели возможно стоит провести дальнейшую настройку гиперпараметров.\n",
"\n",
"*Рекомендации:* Следует уделить внимание дополнительной обработке категориальных признаков, улучшению метода feature engineering, а также возможной оптимизации модели (например, через подбор гиперпараметров) для повышения точности предсказаний на экстремальных значениях.\n",
"\n",
"*Время обучения модели:* Модель обучалась в течение 11.98 секунд, что является приемлемым временем для данного объема данных.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}