831 lines
332 KiB
Plaintext
831 lines
332 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"*Вариант 19:* Данные о миллионерах\n",
|
|||
|
"- Определим бизнес-цели и цели технического проекта "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry'], dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"df = pd.read_csv(\"C:/Users/goldfest/Desktop/3 курс/MII/AIM-PIbd-31-LOBASHOV-I-D/static/csv/Forbes Billionaires.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Определение бизнес целей:\n",
|
|||
|
"\n",
|
|||
|
"1. Прогнозирование потенциальных миллионеров на основе анализа данных.\n",
|
|||
|
"2. Оценка факторов, влияющих на достижение статуса миллионера."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Определение целей технического проекта:\n",
|
|||
|
"\n",
|
|||
|
"1. Построить модель машинного обучения для классификации, которая будет прогнозировать вероятность достижения статуса миллионера на основе предоставленных данных о характеристиках миллионеров.\n",
|
|||
|
"2. Провести анализ данных для выявления ключевых факторов, влияющих на достижение статуса миллионера."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Rank</th>\n",
|
|||
|
" <th>Name</th>\n",
|
|||
|
" <th>Networth</th>\n",
|
|||
|
" <th>Age</th>\n",
|
|||
|
" <th>Country</th>\n",
|
|||
|
" <th>Source</th>\n",
|
|||
|
" <th>Industry</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>Elon Musk</td>\n",
|
|||
|
" <td>219.0</td>\n",
|
|||
|
" <td>50</td>\n",
|
|||
|
" <td>United States</td>\n",
|
|||
|
" <td>Tesla, SpaceX</td>\n",
|
|||
|
" <td>Automotive</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>Jeff Bezos</td>\n",
|
|||
|
" <td>171.0</td>\n",
|
|||
|
" <td>58</td>\n",
|
|||
|
" <td>United States</td>\n",
|
|||
|
" <td>Amazon</td>\n",
|
|||
|
" <td>Technology</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>Bernard Arnault & family</td>\n",
|
|||
|
" <td>158.0</td>\n",
|
|||
|
" <td>73</td>\n",
|
|||
|
" <td>France</td>\n",
|
|||
|
" <td>LVMH</td>\n",
|
|||
|
" <td>Fashion & Retail</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>Bill Gates</td>\n",
|
|||
|
" <td>129.0</td>\n",
|
|||
|
" <td>66</td>\n",
|
|||
|
" <td>United States</td>\n",
|
|||
|
" <td>Microsoft</td>\n",
|
|||
|
" <td>Technology</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>Warren Buffett</td>\n",
|
|||
|
" <td>118.0</td>\n",
|
|||
|
" <td>91</td>\n",
|
|||
|
" <td>United States</td>\n",
|
|||
|
" <td>Berkshire Hathaway</td>\n",
|
|||
|
" <td>Finance & Investments</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Rank Name Networth Age Country \\\n",
|
|||
|
"0 1 Elon Musk 219.0 50 United States \n",
|
|||
|
"1 2 Jeff Bezos 171.0 58 United States \n",
|
|||
|
"2 3 Bernard Arnault & family 158.0 73 France \n",
|
|||
|
"3 4 Bill Gates 129.0 66 United States \n",
|
|||
|
"4 5 Warren Buffett 118.0 91 United States \n",
|
|||
|
"\n",
|
|||
|
" Source Industry \n",
|
|||
|
"0 Tesla, SpaceX Automotive \n",
|
|||
|
"1 Amazon Technology \n",
|
|||
|
"2 LVMH Fashion & Retail \n",
|
|||
|
"3 Microsoft Technology \n",
|
|||
|
"4 Berkshire Hathaway Finance & Investments "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Rank 0\n",
|
|||
|
"Name 0\n",
|
|||
|
"Networth 0\n",
|
|||
|
"Age 0\n",
|
|||
|
"Country 0\n",
|
|||
|
"Source 0\n",
|
|||
|
"Industry 0\n",
|
|||
|
"dtype: int64\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Rank False\n",
|
|||
|
"Name False\n",
|
|||
|
"Networth False\n",
|
|||
|
"Age False\n",
|
|||
|
"Country False\n",
|
|||
|
"Source False\n",
|
|||
|
"Industry False\n",
|
|||
|
"dtype: bool"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Процент пропущенных значений признаков\n",
|
|||
|
"for i in df.columns:\n",
|
|||
|
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на пропущенные данные\n",
|
|||
|
"print(df.isnull().sum())\n",
|
|||
|
"\n",
|
|||
|
"df.isnull().any()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пропущенных колонок нету, это очень хорошо"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 2080\n",
|
|||
|
"Размер контрольной выборки: 520\n",
|
|||
|
"Размер тестовой выборки: 520\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тестовая)\n",
|
|||
|
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и контрольную выборки (80% - обучение, 20% - контроль)\n",
|
|||
|
"train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размер обучающей выборки: \", len(train_data))\n",
|
|||
|
"print(\"Размер контрольной выборки: \", len(val_data))\n",
|
|||
|
"print(\"Размер тестовой выборки: \", len(test_data))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWsklEQVR4nO3de1wU5f4H8M9y2eWyLAgICwqKd1HMwttmIkdJRCpTyixT9Hi0DO2oZUWZt06SWWmal+qUl9Qy83a08i6oiWYo3jX1h0LJgmDcFrnu8/uDdnQFFBFYGD/v12tfsDPPzHxnZnf5MPPMrEIIIUBEREQkU1aWLoCIiIioNjHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAR1UMGgwEpKSn466+/LF0K1bDc3FxcvnwZBoPB0qU8MBh2iEjWli9fDoVCgd9++83SpdzVunXr0LdvXzg5OUGtVsPX1xcffvihpctqEPLy8jB//nzpeVZWFhYtWmS5gm4hhMAXX3yBHj16wMHBARqNBn5+fli1apWlS3tgMOw0EKYPbNPDzs4Obdq0wfjx45GWlmbp8qgeGjlyJBQKBTp16oSKvhVGoVBg/Pjx1Zr37NmzsWnTpvussGYtXrwYy5cvt3QZ1fbWW29hyJAhcHJywpdffomdO3di165deOWVVyxdWoNgb2+PqVOnYvXq1UhJScGMGTOwZcsWS5cFAHjhhRfw8ssvo3379vjmm2+kfTt48GBLl/bAsLF0AXRvZs2aBT8/PxQUFODAgQNYsmQJfvrpJ5w6dQoODg6WLo/qoZMnT2LDhg2IiIiosXnOnj0bzzzzDJ5++ukam+f9Wrx4Mdzd3TFy5EhLl3LP4uLiMGfOHMTExOCtt96ydDkNkrW1NWbOnIkRI0bAaDRCo9Hgxx9/tHRZWLlyJdauXYtVq1bhhRdesHQ5Dywe2WlgwsLC8OKLL+Jf//oXli9fjokTJyIpKQmbN2+2dGlUD9nb26NNmzaYNWtWhUd35CA/P9/SJdy3jz76CI8++iiDzn167bXXcOXKFRw8eBBXrlzBY489ZumSMHfuXDz//PMMOhbGsNPA9enTBwCQlJQEALh+/Tpef/11BAQEQK1WQ6PRICwsDMePHy83bUFBAWbMmIE2bdrAzs4OXl5eGDx4MC5dugQAuHz5stmps9sfwcHB0rxiY2OhUCiwdu1avP3229BqtXB0dMRTTz2FlJSUcss+fPgw+vfvD2dnZzg4OKB379745ZdfKlzH4ODgCpc/Y8aMcm1XrVqFwMBA2Nvbw9XVFUOHDq1w+Xdat1sZjUbMnz8fHTp0gJ2dHTw9PfHSSy+V6zTavHlzPPHEE+WWM378+HLzrKj2uXPnltumAFBYWIjp06ejVatWUKlU8PHxwRtvvIHCwsIKt9XtrKysMHXqVJw4cQIbN268a/uqLE+hUMBgMGDFihXSNhs5ciROnDgBhUKB//3vf1LbhIQEKBQKPPLII2bLCQsLQ/fu3c2GLV68GB06dIBKpYK3tzeioqKQlZVl1iY4OBgdO3ZEQkICgoKC4ODggLfffhvNmzfH6dOnERcXV+Hr07RukydPRuPGjeHo6IhBgwbh2rVrd90mptOBpkejRo0QHByM/fv333VaANizZw969eoFR0dHuLi4YODAgTh79qxZm0OHDqFjx44YOnQoXF1dYW9vj65du5qdKszLy4OjoyP+/e9/l1vGH3/8AWtra8TExEg1N2/evFy72197V65cwSuvvIK2bdvC3t4ebm5uePbZZ3H58mWz6Uzv79jYWGnYkSNH8Pjjj8PJyQmOjo4VbpOK+ktlZGRU+B544oknKqy5Kp8VM2bMkN5nTZs2hU6ng42NDbRabbm6K2Ka3vRwcnJCt27dyp2qNb3+KmP6XDGdTjUYDDh16hR8fHwQHh4OjUZT6bYCgP/7v//Ds88+C1dXVzg4OKBHjx7ljk7dy2dtcHBwuffB+++/DysrK6xZs8Zs+L18JjdEPI3VwJmCiZubG4CyN8umTZvw7LPPws/PD2lpafj888/Ru3dvnDlzBt7e3gCA0tJSPPHEE9i9ezeGDh2Kf//738jNzcXOnTtx6tQptGzZUlrG888/jwEDBpgtNzo6usJ63n//fSgUCrz55ptIT0/H/PnzERISgsTERNjb2wMo+/APCwtDYGAgpk+fDisrKyxbtgx9+vTB/v370a1bt3Lzbdq0qfRBnpeXh3HjxlW47HfffRdDhgzBv/71L1y7dg0LFy5EUFAQjh07BhcXl3LTjB07Fr169QIAbNiwoVwgeOmll7B8+XKMGjUKr776KpKSkvDZZ5/h2LFj+OWXX2Bra1vhdrgXWVlZ0rrdymg04qmnnsKBAwcwduxYtG/fHidPnsS8efPw+++/V7nPzAsvvID33nsPs2bNwqBBg8qFr3td3jfffIN//etf6NatG8aOHQsAaNmyJTp27AgXFxfs27cPTz31FABg//79sLKywvHjx5GTkwONRgOj0YiDBw9K0wJlf2xmzpyJkJAQjBs3DufPn8eSJUtw5MiRcts5MzMTYWFhGDp0KF588UV4enoiODgYEyZMgFqtxjvvvAMA8PT0NFu/CRMmoFGjRpg+fTouX76M+fPnY/z48Vi7du1dt6G7uzvmzZsHoCxYfPrppxgwYABSUlIqfF2Z7Nq1C2FhYWjRogVmzJiBGzduYOHChejZsyeOHj0q/XHPzMzEF198AbVajVdffRWNGzfGqlWrMHjwYKxevRrPP/881Go1Bg0ahLVr1+KTTz6BtbW1tJxvv/0WQggMGzbsrutyqyNHjuDgwYMYOnQomjZtisuXL2PJkiUIDg7GmTNnKj01fvHiRQQHB8PBwQFTpkyBg4MDvvzyS4SEhGDnzp0ICgq6pzoqU53PCpOPP/74nvszfvPNNwDKAtnixYvx7LPP4tSpU2jbtm216s/MzAQAzJkzB1qtFlOmTIGdnV2F2yotLQ2PPvoo8vPz8eqrr8LNzQ0rVqzAU089hR9++AGDBg0ym3dVPmtvt2zZMkydOhUff/yx2ZGm+9nODYagBmHZsmUCgNi1a5e4du2aSElJEd99951wc3MT9vb24o8//hBCCFFQUCBKS0vNpk1KShIqlUrMmjVLGvb1118LAOKTTz4ptyyj0ShNB0DMnTu3XJsOHTqI3r17S8/37t0rAIgmTZqInJwcafj3338vAIhPP/1Umnfr1q1FaGiotBwhhMjPzxd+fn7i8ccfL7esRx99VHTs2FF6fu3aNQFATJ8+XRp2+fJlYW1tLd5//32zaU+ePClsbGzKDb9w4YIAIFasWCENmz59urj1LbF//34BQKxevdps2m3btpUb3qxZMxEeHl6u9qioKHH72+z22t944w3h4eEhAgMDzbbpN998I6ysrMT+/fvNpl+6dKkAIH755Zdyy7tVZGSkcHR0FEIIsWLFCgFAbNiwwayOqKioai3P0dFRREZGlltmeHi46Natm/R88ODBYvDgwcLa2lr8/PPPQgghjh49KgCIzZs3CyGESE9PF0qlUvTr18/stfvZZ58JAOLrr7+WhvXu3VsAEEuXLi237Ntfkyam905ISIjZa27SpEnC2tpaZGVllZvmVpGRkaJZs2Zmw7744gsBQPz66693nLZz587Cw8NDZGZmSsOOHz8urKysxIgRI6RhAAQAERsbKw3Lz88X7du3F1qtVhQVFQkhhNi+fbsAIG1Lk06dOpmt+6hRo4Svr2+5em5/7eXn55drEx8fLwCIlStXSsNM7++9e/cKIYSIiIgQ1tbW4tSpU1KbjIwM4ebmJgIDA6Vhpm1/5MgRaVhF718hyl47t27ne/msuP29m56eLpycnERYWJhZ3ZW5fXohhNixY4cAIL7//ntpWO/evUWHDh0qnY/pM3PZsmVmz5VKpfj999/NtsHt22rixIkCgNn7Lzc3V/j5+YnmzZtL742qftaa6jW9Ln788UdhY2MjXnvtNbOaq/OZ3BDxNFYDExISgsaNG8PHxwdDhw6FWq3Gxo0b0aRJEwC
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABZFUlEQVR4nO3deVxU1f8/8NedAYZ93xUFccEd99BEVFLRzK0ss1IztXJJaftQmUsLmn3LNbVfuaWlWS5pZe64obmEuwQKisoiKvs+c35/4NwcBxARGBhez8fjPuDee+6975k7M7y499w7khBCgIiIiMhIKQxdABEREVFVYtghIiIio8awQ0REREaNYYeIiIiMGsMOERERGTWGHSIiIjJqDDtERERk1Bh2iIiIyKgx7BARERmYRqNBamoqrly5YuhSjBLDDhHVeqtWrYIkSThx4oShS6Ea5tChQ9i/f788vn//fhw+fNhwBd0nKSkJU6dORcOGDWFmZgYXFxe0aNECGRkZhi7N6DDs1CDaD2ztYG5ujqZNm2LSpElITk42dHlUA40ePRqSJKFNmzYo6ZtfJEnCpEmTKrTuzz//HFu2bHnMCivXN998g1WrVhm6DKpFEhIS8Oabb+Ls2bM4e/Ys3nzzTSQkJBi6LMTGxqJTp05Yv349JkyYgO3bt2PXrl3Ys2cPrKysDF2e0TExdAGkb/bs2fDx8UFeXh4OHTqEpUuX4o8//sC5c+dgaWlp6PKoBjp79iw2bdqEYcOGVdo6P//8czz77LMYPHhwpa3zcX3zzTdwdnbG6NGjDV0K1RJDhw7F/Pnz0aZNGwBAQEAAhg4dauCqgAkTJsDMzAxHjx5FvXr1DF2O0WPYqYFCQkLQsWNHAMBrr70GJycnfPXVV9i6dStGjBhh4OqoprGwsICXlxdmz56NoUOHQpIkQ5dU6XJychj0qUJUKhWOHDmCc+fOAQBatWoFpVJp0JpOnjyJvXv3YufOnQw61YSnsWqBXr16AQDi4uIAAHfu3ME777yD1q1bw9raGra2tggJCcHp06f1ls3Ly8PMmTPRtGlTmJubw8PDA0OHDsXly5cBAPHx8Tqnzh4cgoKC5HXt378fkiRhw4YN+OCDD+Du7g4rKys888wzJR4WPnbsGPr16wc7OztYWlqiR48epZ4rDwoKKnH7M2fO1Gu7du1adOjQARYWFnB0dMQLL7xQ4vbLemz302g0mD9/Plq2bAlzc3O4ublhwoQJuHv3rk47b29vPP3003rbmTRpkt46S6p93rx5es8pAOTn52PGjBlo3LgxVCoVvLy88N577yE/P7/E5+pBCoUCH330Ec6cOYPNmzc/tH15tidJErKzs7F69Wr5ORs9ejTOnDkDSZLw22+/yW1PnjwJSZLQvn17ne2EhISgS5cuOtO++eYbtGzZEiqVCp6enpg4cSLS0tJ02gQFBaFVq1Y4efIkAgMDYWlpiQ8++ADe3t44f/48IiIiSnx9ah9baGgoXFxcYGVlhSFDhuDWrVsPfU60pwO1g4ODA4KCgnDw4MFyLevt7a0zbe3atVAoFJgzZ47O9L1796J79+6wsrKCvb09Bg0ahIsXL+q0mTlzJiRJQmpqqs70EydOQJIk+TTegzWXNMTHxwP477W7c+dO+Pv7w9zcHC1atMCmTZv0Hs+VK1fw3HPPwdHREZaWlnjiiSfw+++/l+t5K+l9O3r0aFhbWz/0eXyU91dRURE++eQT+Pr6QqVSwdvbGx988IHee8bb2xujR4+GUqlE27Zt0bZtW2zatAmSJOnts9Jq0j4mhUIBd3d3PP/887h27ZrcRvs58+WXX5a6Hu0+1Tp69CjMzc1x+fJl+f3g7u6OCRMm4M6dO3rLb9y4Uf7Mc3Z2xksvvYQbN27otNE+z1euXEHfvn1hZWUFT09PzJ49W+cUt7be+08HZ2ZmokOHDvDx8UFiYqI8vbyfjbUBj+zUAtpg4uTkBKD4w2jLli147rnn4OPjg+TkZCxfvhw9evTAhQsX4OnpCQBQq9V4+umnsWfPHrzwwgt46623kJmZiV27duHcuXPw9fWVtzFixAj0799fZ7thYWEl1vPZZ59BkiS8//77SElJwfz58xEcHIyoqChYWFgAKP5QDwkJQYcOHTBjxgwoFAqsXLkSvXr1wsGDB9G5c2e99davXx/h4eEAgKysLLzxxhslbnv69OkYPnw4XnvtNdy6dQuLFi1CYGAg/vnnH9jb2+stM378eHTv3h0AsGnTJr1AMGHCBKxatQpjxozBlClTEBcXh8WLF+Off/7B4cOHYWpqWuLz8CjS0tLkx3Y/jUaDZ555BocOHcL48ePRvHlznD17Fl9//TX+/fffcveZefHFF/HJJ59g9uzZGDJkSKlHd8q7vR9++AGvvfYaOnfujPHjxwMAfH190apVK9jb2+PAgQN45plnAAAHDx6EQqHA6dOnkZGRAVtbW2g0Ghw5ckReFij+wJ81axaCg4PxxhtvIDo6GkuXLsXx48f1nufbt28jJCQEL7zwAl566SW4ubkhKCgIkydPhrW1NT788EMAgJubm87jmzx5MhwcHDBjxgzEx8dj/vz5mDRpEjZs2PDQ59DZ2Rlff/01AOD69etYsGAB+vfvj4SEhBJfV6XZuXMnXn31VUyaNAn/+9//5Om7d+9GSEgIGjVqhJkzZyI3NxeLFi1Ct27dcOrUqXL98b3fhAkTEBwcLI+//PLLGDJkiM4pGhcXF/n3mJgYPP/883j99dcxatQorFy5Es899xx27NiBp556CgCQnJyMrl27IicnB1OmTIGTkxNWr16NZ555Br/88guGDBmiV8f9z5u2jqr22muvYfXq1Xj22Wfx9ttv49ixYwgPD8fFixfLDPxFRUXya6e8unfvjvHjx0Oj0eDcuXOYP38+bt68Wa4gXJrbt28jLy8Pb7zxBnr16oXXX38dly9fxpIlS3Ds2DEcO3YMKpUKAOTPpk6dOiE8PBzJyclYsGABDh8+rPeZp1ar0a9fPzzxxBP44osvsGPHDsyYMQNFRUWYPXt2ibUUFhZi2LBhuHbtGg4fPgwPDw95XnV8NlYbQTXGypUrBQCxe/ducevWLZGQkCDWr18vnJychIWFhbh+/boQQoi8vDyhVqt1lo2LixMqlUrMnj1bnrZixQoBQHz11Vd629JoNPJyAMS8efP02rRs2VL06NFDHt+3b58AIOrVqycyMjLk6T///LMAIBYsWCCvu0mTJqJv377ydoQQIicnR/j4+IinnnpKb1tdu3YVrVq1ksdv3bolAIgZM2bI0+Lj44VSqRSfffaZzrJnz54VJiYmetNjYmIEALF69Wp52owZM8T9L/uDBw8KAGLdunU6y+7YsUNvesOGDcWAAQP0ap84caJ48K30YO3vvfeecHV1FR06dNB5Tn/44QehUCjEwYMHdZZftmyZACAOHz6st737jRo1SlhZWQkhhFi9erUAIDZt2qRTx8SJEyu0PSsrKzFq1Ci9bQ4YMEB07txZHh86dKgYOnSoUCqV4s8//xRCCHHq1CkBQGzdulUIIURKSoowMzMTffr00XntLl68WAAQK1askKf16NFDABDLli3T2/aDr0kt7XsnODhY5zU3bdo0oVQqRVpamt4y9xs1apRo2LChzrRvv/1WABB///13uZc9ceKEsLa2Fs8995zee9Tf31+4urqK27dvy9NOnz4tFAqFeOWVV+Rp2tforVu3dJY/fvy4ACBWrlxZYh0Pvubu17BhQwFA/Prrr/K09PR04eHhIdq1aydPmzp1qgCg8/rIzMwUPj4+wtvbW+8xjRw5Uvj4+JRZx/2v0bKU9/0VFRUlAIjXXntNp90777wjAIi9e/fqrPP+1/A333wjVCqV6Nmzp97+Lq2mB98DL774orC0tJTHy/oM1Xrwc0c73rt3b1FUVCRP176OFy1aJIQQoqCgQLi6uopWrVqJ3Nxcud327dsFAPHxxx/L00aNGiUAiMmTJ8vTNBqNGDBggDAzM5NfT9p6V65cKTQajRg5cqSwtLQUx44d06n5UT4bawOexqqBgoOD4eLiAi8vL7z
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWLElEQVR4nO3deVwV9f4/8NecAxz2fVdQRMUVUVRCBTExRVNTy5tZaZl6Sy217UtlLmVods01rV+LWZZmuWSL5QouaG64S2jgyiIq+37O5/cHMnnYBAQODK/n4zEPmP09M+ccXsx8Zo4khBAgIiIiUiiVoQsgIiIiqksMO0RERKRoDDtERESkaAw7REREpGgMO0RERKRoDDtERESkaAw7REREpGgMO0RERKRoDDtEREQPQAiB27dvIy4uztClUAUYdoiIAKxZswaSJOHo0aOGLoXuOnPmDLZs2SL3x8TE4NdffzVcQffIzMzEO++8Ax8fH5iYmMDBwQFt27ZFbGysoUujcjDsKEzJB3ZJZ2pqirZt22Lq1KlITk42dHnUAI0fPx6SJMHX1xflfXuMJEmYOnVqjZb9wQcf6P2xagg++eQTrFmzxtBlUBVkZmZi8uTJOHToEOLi4vDKK6/g9OnThi4Lt27dQmBgIJYtW4bHH38cW7duxY4dO7B37160bNnS0OVROYwMXQDVjXnz5sHLywt5eXnYv38/Vq1ahd9++w1nzpyBubm5ocujBuj06dPYtGkTRo0aVWvL/OCDD/D444/jscceq7VlPqhPPvkEjo6OGD9+vKFLofsIDAyUOwBo27YtJk6caOCqgNdffx2JiYmIjo5Gx44dDV0OVQHDjkKFhYWhe/fuAIAXXngBDg4OWLx4MbZu3YoxY8YYuDpqaMzMzODh4YF58+Zh5MiRkCTJ0CXVupycHAb9RmjLli04d+4ccnNz0blzZ5iYmBi0npSUFHz99ddYvXo1g04jwstYTcTDDz8MAIiPjwcA3L59G6+99ho6d+4MS0tLWFtbIywsDCdPniwzb15eHubMmYO2bdvC1NQUbm5uGDlyJC5dugQASEhI0Lt0VroLCQmRl7V3715IkoQNGzbgrbfegqurKywsLDBs2DBcvXq1zLoPHz6MQYMGwcbGBubm5ujbty8OHDhQ7jaGhISUu/45c+aUmfbbb7+Fv78/zMzMYG9vjyeffLLc9Ve2bffS6XRYsmQJOnbsCFNTU7i4uGDy5Mm4c+eO3nQtW7bEo48+WmY9U6dOLbPM8mpftGhRmX0KAPn5+Zg9ezZat24NjUYDDw8PvPHGG8jPzy93X5WmUqnwzjvv4NSpU9i8efN9p6/K+iRJQnZ2Nr7++mt5n40fPx6nTp2CJEn4+eef5WmPHTsGSZLQrVs3vfWEhYUhICBAb9gnn3yCjh07QqPRwN3dHVOmTEFaWpreNCEhIejUqROOHTuG4OBgmJub46233kLLli1x9uxZREZGlvv6LNm2mTNnwsnJCRYWFhgxYgRu3rx5331ScjmwpLOzs0NISAj27dtXrfnK6xISEuTpf//9dwQFBcHCwgJWVlYYMmQIzp49W2a5Fy5cwOjRo+Hk5AQzMzP4+Pjg7bffBgDMmTPnvuvcu3evvKyNGzfK7xdHR0c8/fTTuH79eo23v6rHsOTYdOjQAf7+/jh58mS577/ylP48cHR0xJAhQ3DmzBm96e53mbakaUDJMThy5Ah0Oh0KCgrQvXt3mJqawsHBAWPGjMGVK1fKzL979275eNna2mL48OE4f/683jQlx6PkmFlbW8PBwQGvvPIK8vLyytR77+dCUVERBg8eDHt7e5w7d05v2qp+zjUFPLPTRJQEEwcHBwDAP//8gy1btuCJJ56Al5cXkpOT8emnn6Jv3744d+4c3N3dAQBarRaPPvoodu3ahSeffBKvvPIKMjMzsWPHDpw5cwbe3t7yOsaMGYPBgwfrrTc8PLzceubPnw9JkvDmm28iJSUFS5YsQWhoKGJiYmBmZgag+EMiLCwM/v7+mD17NlQqFb766is8/PDD2LdvH3r27Flmuc2bN0dERAQAICsrCy+++GK56541axZGjx6NF154ATdv3sTy5csRHByMEydOwNbWtsw8kyZNQlBQEABg06ZNZQLB5MmTsWbNGjz33HN4+eWXER8fjxUrVuDEiRM4cOAAjI2Ny90P1ZGWliZv2710Oh2GDRuG/fv3Y9KkSWjfvj1Onz6Njz/+GH///XeV28w89dRTeO+99zBv3jyMGDGiwj8oVV3fN998gxdeeAE9e/bEpEmTAADe3t7o1KkTbG1tERUVhWHDhgEA9u3bB5VKhZMnTyIjIwPW1tbQ6XQ4ePCgPC9Q/Edh7ty5CA0NxYsvvojY2FisWrUKR44cKbOfb926hbCwMDz55JN4+umn4eLigpCQEEybNg2WlpbyH34XFxe97Zs2bRrs7Owwe/ZsJCQkYMmSJZg6dSo2bNhw333o6OiIjz/+GABw7do1LF26FIMHD8bVq1fLfV0Bxa+d0NBQuf+ZZ57BiBEjMHLkSHmYk5OTvE/HjRuHgQMHYuHChcjJycGqVavQp08fnDhxQm4vcurUKQQFBcHY2BiTJk1Cy5YtcenSJWzbtg3z58/HyJEj0bp1a3n5M2bMQPv27fX2dfv27QFAfl336NEDERERSE5OxtKlS3HgwIEy75eqbH91jmFpb7755n2OgL527drh7bffhhACly5dwuLFizF48OByQ0lV3bp1C0DxPyj+/v5YsGABbt68iWXLlmH//v04ceIEHB0dAQA7d+5EWFgYWrVqhTlz5iA3NxfLly9H7969cfz48TLte0aPHo2WLVsiIiIChw4dwrJly3Dnzh2sXbu2wnpeeOEF7N27Fzt27ECHDh3k4TX5nFM0QYry1VdfCQBi586d4ubNm+Lq1ati/fr1wsHBQZiZmYlr164JIYTIy8sTWq1Wb974+Hih0WjEvHnz5GFffvmlACAWL15cZl06nU6eD4BYtGhRmWk6duwo+vbtK/fv2bNHABDNmjUTGRkZ8vAffvhBABBLly6Vl92mTRsxcOBAeT1CCJGTkyO8vLzEgAEDyqyrV69eolOnTnL/zZs3BQAxe/ZseVhCQoJQq9Vi/vz5evOePn1aGBkZlRkeFxcnAIivv/5aHjZ79mxx71tn3759AoBYt26d3rzbt28vM7xFixZiyJAhZWqfMmWKKP12LF37G2+8IZydnYW/v7/ePv3mm2+ESqUS+/bt05t/9erVAoA4cOBAmfXda9y4ccLCwkIIIcTXX38tAIhNmzbp1TFlypQarc/CwkKMGzeuzDqHDBkievbsKfePHDlSjBw5UqjVavH7778LIYQ4fvy4ACC2bt0qhBAiJSVFmJiYiEceeUTvtbtixQoBQHz55ZfysL59+woAYvXq1WXWXfo1WaLkvRMaGqr3mpsxY4ZQq9UiLS2tzDz3GjdunGjRooXesM8++0wAEH/99Vel896r9HEvkZmZKWxtbcXEiRP1hiclJQkbGxu94cHBwcLKykpcvnxZb9p7t+teLVq0KPc4FRQUCGdnZ9GpUyeRm5srD//ll18EAPHuu+/Kw6qy/dU9hvcep99++00AEIMGDSrzXilP6fmFEOKtt94SAERKSoo8rPTru7SS10V8fLxef4cOHUROTo48Xcln26uvvioP8/PzE87OzuLWrVvysJMnTwqVSiWeffZZeVjJZ8qwYcP01v3SSy8JAOLkyZN69Za8PsLDw4VarRZbtmzRm6+6n3NNAS9jKVRoaCicnJzg4eGBJ598EpaWlti8eTOaNWsGANBoNFCpig+/VqvFrVu3YGlpCR8fHxw/flxezk8//QRHR0dMmzatzDoepF3Hs88+CysrK7n/8ccfh5ubG3777TcAxbeYxsXF4amnnsKtW7eQmpqK1NRUZGdno3///oiKioJOp9NbZl5eHkxNTStd76ZNm6DT6TB69Gh5mampqXB1dUWbNm2wZ88evekLCgoAFO+vimzcuBE2NjYYMGCA3jL9/f1haWlZZpmFhYV606WmppY5VV3a9evXsXz5csyaNQuWlpZ
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Среднее значение Networth в обучающей выборке: 5.05858173076923\n",
|
|||
|
"Среднее значение Networth в контрольной выборке: 4.069423076923076\n",
|
|||
|
"Среднее значение Networth в тестовой выборке: 4.069423076923076\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Оценка сбалансированности целевой переменной (Networth)\n",
|
|||
|
"# Визуализация распределения целевой переменной в выборках (гистограмма)\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"def plot_networth_distribution(data, title):\n",
|
|||
|
" sns.histplot(data['Networth'], kde=True)\n",
|
|||
|
" plt.title(title)\n",
|
|||
|
" plt.xlabel('Networth')\n",
|
|||
|
" plt.ylabel('Частота')\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"plot_networth_distribution(train_data, 'Распределение Networth в обучающей выборке')\n",
|
|||
|
"plot_networth_distribution(val_data, 'Распределение Networth в контрольной выборке')\n",
|
|||
|
"plot_networth_distribution(test_data, 'Распределение Networth в тестовой выборке')\n",
|
|||
|
"\n",
|
|||
|
"# Оценка сбалансированности данных по целевой переменной (Networth)\n",
|
|||
|
"print(\"Среднее значение Networth в обучающей выборке: \", train_data['Networth'].mean())\n",
|
|||
|
"print(\"Среднее значение Networth в контрольной выборке: \", val_data['Networth'].mean())\n",
|
|||
|
"print(\"Среднее значение Networth в тестовой выборке: \", test_data['Networth'].mean())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABWsklEQVR4nO3de1wU5f4H8M9y2eWyLAgICwqKd1HMwttmIkdJRCpTyixT9Hi0DO2oZUWZt06SWWmal+qUl9Qy83a08i6oiWYo3jX1h0LJgmDcFrnu8/uDdnQFFBFYGD/v12tfsDPPzHxnZnf5MPPMrEIIIUBEREQkU1aWLoCIiIioNjHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAR1UMGgwEpKSn466+/LF0K1bDc3FxcvnwZBoPB0qU8MBh2iEjWli9fDoVCgd9++83SpdzVunXr0LdvXzg5OUGtVsPX1xcffvihpctqEPLy8jB//nzpeVZWFhYtWmS5gm4hhMAXX3yBHj16wMHBARqNBn5+fli1apWlS3tgMOw0EKYPbNPDzs4Obdq0wfjx45GWlmbp8qgeGjlyJBQKBTp16oSKvhVGoVBg/Pjx1Zr37NmzsWnTpvussGYtXrwYy5cvt3QZ1fbWW29hyJAhcHJywpdffomdO3di165deOWVVyxdWoNgb2+PqVOnYvXq1UhJScGMGTOwZcsWS5cFAHjhhRfw8ssvo3379vjmm2+kfTt48GBLl/bAsLF0AXRvZs2aBT8/PxQUFODAgQNYsmQJfvrpJ5w6dQoODg6WLo/qoZMnT2LDhg2IiIiosXnOnj0bzzzzDJ5++ukam+f9Wrx4Mdzd3TFy5EhLl3LP4uLiMGfOHMTExOCtt96ydDkNkrW1NWbOnIkRI0bAaDRCo9Hgxx9/tHRZWLlyJdauXYtVq1bhhRdesHQ5Dywe2WlgwsLC8OKLL+Jf//oXli9fjokTJyIpKQmbN2+2dGlUD9nb26NNmzaYNWtWhUd35CA/P9/SJdy3jz76CI8++iiDzn167bXXcOXKFRw8eBBXrlzBY489ZumSMHfuXDz//PMMOhbGsNPA9enTBwCQlJQEALh+/Tpef/11BAQEQK1WQ6PRICwsDMePHy83bUFBAWbMmIE2bdrAzs4OXl5eGDx4MC5dugQAuHz5stmps9sfwcHB0rxiY2OhUCiwdu1avP3229BqtXB0dMRTTz2FlJSUcss+fPgw+vfvD2dnZzg4OKB379745ZdfKlzH4ODgCpc/Y8aMcm1XrVqFwMBA2Nvbw9XVFUOHDq1w+Xdat1sZjUbMnz8fHTp0gJ2dHTw9PfHSSy+V6zTavHlzPPHEE+WWM378+HLzrKj2uXPnltumAFBYWIjp06ejVatWUKlU8PHxwRtvvIHCwsIKt9XtrKysMHXqVJw4cQIbN268a/uqLE+hUMBgMGDFihXSNhs5ciROnDgBhUKB//3vf1LbhIQEKBQKPPLII2bLCQsLQ/fu3c2GLV68GB06dIBKpYK3tzeioqKQlZVl1iY4OBgdO3ZEQkICgoKC4ODggLfffhvNmzfH6dOnERcXV+Hr07RukydPRuPGjeHo6IhBgwbh2rVrd90mptOBpkejRo0QHByM/fv333VaANizZw969eoFR0dHuLi4YODAgTh79qxZm0OHDqFjx44YOnQoXF1dYW9vj65du5qdKszLy4OjoyP+/e9/l1vGH3/8AWtra8TExEg1N2/evFy72197V65cwSuvvIK2bdvC3t4ebm5uePbZZ3H58mWz6Uzv79jYWGnYkSNH8Pjjj8PJyQmOjo4VbpOK+ktlZGRU+B544oknKqy5Kp8VM2bMkN5nTZs2hU6ng42NDbRabbm6K2Ka3vRwcnJCt27dyp2qNb3+KmP6XDGdTjUYDDh16hR8fHwQHh4OjUZT6bYCgP/7v//Ds88+C1dXVzg4OKBHjx7ljk7dy2dtcHBwuffB+++/DysrK6xZs8Zs+L18JjdEPI3VwJmCiZubG4CyN8umTZvw7LPPws/PD2lpafj888/Ru3dvnDlzBt7e3gCA0tJSPPHEE9i9ezeGDh2Kf//738jNzcXOnTtx6tQptGzZUlrG888/jwEDBpgtNzo6usJ63n//fSgUCrz55ptIT0/H/PnzERISgsTERNjb2wMo+/APCwtDYGAgpk+fDisrKyxbtgx9+vTB/v370a1bt3Lzbdq0qfRBnpeXh3HjxlW47HfffRdDhgzBv/71L1y7dg0LFy5EUFAQjh07BhcXl3LTjB07Fr169QIAbNiwoVwgeOmll7B8+XKMGjUKr776KpKSkvDZZ5/h2LFj+OWXX2Bra1vhdrgXWVlZ0rrdymg04qmnnsKBAwcwduxYtG/fHidPnsS8efPw+++/V7nPzAsvvID33nsPs2bNwqBBg8qFr3td3jfffIN//etf6NatG8aOHQsAaNmyJTp27AgXFxfs27cPTz31FABg//79sLKywvHjx5GTkwONRgOj0YiDBw9K0wJlf2xmzpyJkJAQjBs3DufPn8eSJUtw5MiRcts5MzMTYWFhGDp0KF588UV4enoiODgYEyZMgFqtxjvvvAMA8PT0NFu/CRMmoFGjRpg+fTouX76M+fPnY/z48Vi7du1dt6G7uzvmzZsHoCxYfPrppxgwYABSUlIqfF2Z7Nq1C2FhYWjRogVmzJiBGzduYOHChejZsyeOHj0q/XHPzMzEF198AbVajVdffRWNGzfGqlWrMHjwYKxevRrPP/881Go1Bg0ahLVr1+KTTz6BtbW1tJxvv/0WQggMGzbsrutyqyNHjuDgwYMYOnQomjZtisuXL2PJkiUIDg7GmTNnKj01fvHiRQQHB8PBwQFTpkyBg4MDvvzyS4SEhGDnzp0ICgq6pzoqU53PCpOPP/74nvszfvPNNwDKAtnixYvx7LPP4tSpU2jbtm216s/MzAQAzJkzB1qtFlOmTIGdnV2F2yotLQ2PPvoo8vPz8eqrr8LNzQ0rVqzAU089hR9++AGDBg0ym3dVPmtvt2zZMkydOhUff/yx2ZGm+9nODYagBmHZsmUCgNi1a5e4du2aSElJEd99951wc3MT9vb24o8//hBCCFFQUCBKS0vNpk1KShIqlUrMmjVLGvb1118LAOKTTz4ptyyj0ShNB0DMnTu3XJsOHTqI3r17S8/37t0rAIgmTZqInJwcafj3338vAIhPP/1Umnfr1q1FaGiotBwhhMjPzxd+fn7i8ccfL7esRx99VHTs2FF6fu3aNQFATJ8+XRp2+fJlYW1tLd5//32zaU+ePClsbGzKDb9w4YIAIFasWCENmz59urj1LbF//34BQKxevdps2m3btpUb3qxZMxEeHl6u9qioKHH72+z22t944w3h4eEhAgMDzbbpN998I6ysrMT+/fvNpl+6dKkAIH755Zdyy7tVZGSkcHR0FEIIsWLFCgFAbNiwwayOqKioai3P0dFRREZGlltmeHi46Natm/R88ODBYvDgwcLa2lr8/PPPQgghjh49KgCIzZs3CyGESE9PF0qlUvTr18/stfvZZ58JAOLrr7+WhvXu3VsAEEuXLi237Ntfkyam905ISIjZa27SpEnC2tpaZGVllZvmVpGRkaJZs2Zmw7744gsBQPz66693nLZz587Cw8NDZGZmSsOOHz8urKysxIgRI6RhAAQAERsbKw3Lz88X7du3F1qtVhQVFQkhhNi+fbsAIG1Lk06dOpmt+6hRo4Svr2+5em5/7eXn55drEx8fLwCIlStXSsNM7++9e/cKIYSIiIgQ1tbW4tSpU1KbjIwM4ebmJgIDA6Vhpm1/5MgRaVhF718hyl47t27ne/msuP29m56eLpycnERYWJhZ3ZW5fXohhNixY4cAIL7//ntpWO/evUWHDh0qnY/pM3PZsmVmz5VKpfj999/NtsHt22rixIkCgNn7Lzc3V/j5+YnmzZtL742qftaa6jW9Ln788UdhY2MjXnvtNbOaq/OZ3BDxNFYDExISgsaNG8PHxwdDhw6FWq3Gxo0b0aRJEwC
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABfOUlEQVR4nO3deVxUVeMG8GdYZlgHRHZFcV9xww0XMEURKTUt0yyxTMvQUnvNKHdLss2lFO2XS5lmaS6vVporbrjvG6kvCiWLG7sMy5zfHzRXhmEXGLg+389nPjD3nnvuOffOwDP3nntHIYQQICIiIpIpE2M3gIiIiKgyMewQERGRrDHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BAREZGsMewQERGRrDHsEBERkawx7BDRU2fNmjVQKBQ4deqUsZtCRFWAYacG0/3B1j0sLCzQtGlTTJgwAQkJCcZuHlVDo0ePhkKhQJs2bVDYN8UoFApMmDChXHXPnz8fW7dufcIWVqxly5ZhzZo1xm4GERkZw44MzJ07F2vXrsU333yDbt26ITw8HD4+PsjIyDB206iaunjxIjZv3lyhdTLsEFF1xbAjA4GBgXjllVfwxhtvYM2aNZg0aRKio6Oxbds2YzeNqiFLS0s0bdoUc+fOLfTojhww6BNRfgw7MtS7d28AQHR0NADgwYMH+M9//gMvLy/Y2NhArVYjMDAQ58+fN1g2MzMTs2fPRtOmTWFhYQE3NzcMGTIEN2/eBADcunVL79RZwUevXr2kug4cOACFQoGff/4ZH374IVxdXWFtbY2BAwciNjbWYN3Hjx9H//79YWdnBysrK/j5+eHIkSOF9rFXr16Frn/27NkGZX/88Ud4e3vD0tISDg4OGD58eKHrL65v+Wm1WixatAitWrWChYUFXFxc8Oabb+Lhw4d65Tw9PfHss88arGfChAkGdRbW9s8//9xgmwKARqPBrFmz0LhxY6hUKnh4eOD999+HRqMpdFsVZGJigunTp+PChQvYsmVLieVLsz6FQoH09HR8//330jYbPXo0Lly4AIVCgf/+979S2dOnT0OhUKBDhw566wkMDESXLl30pi1btgytWrWCSqWCu7s7QkJCkJSUpFemV69eaN26NU6fPg1fX19YWVnhww8/hKenJy5fvoyIiIhCX5+6vk2ZMgVOTk6wtrbG888/j7t375a4TXSnA4t6HDhwQK/8xo0bpdego6MjXnnlFfzzzz8G9V67dg3Dhg2Dk5MTLC0t0axZM3z00UcG5Tw9PUu13j/++AM9e/aEtbU1bG1tERQUhMuXL5fYv6LGNN27d6/Q1+rZs2cRGBgItVoNGxsb9OnTB8eOHSu0zoMHD+LNN99E7dq1oVarMWrUqELfOwqFApMmTTJoW0BAABQKhd57KysrCzNnzoS3tzfs7OxgbW2Nnj17Yv/+/YX2b/bs2YVuv9GjRxuUyS8tLQ2urq4G2/qtt95CkyZNYGVlBQcHB/Tu3RuHDh3SW3bbtm0ICgqCu7s7VCoVGjVqhHnz5iE3N1evnO71XNAXX3wBhUKBW7duGWzT/NO0Wi3atGkDhUKhd1Rz9OjR8PT01KszNjYWlpaWBnXIkZmxG0AVTxdMateuDQD43//+h61bt+LFF19EgwYNkJCQgBUrVsDPzw9XrlyBu7s7ACA3NxfPPvss9u7di+HDh+Pdd99Famoqdu/ejUuXLqFRo0bSOkaMGIEBAwborTc0NLTQ9nzyySdQKBSYNm0aEhMTsWjRIvj7++PcuXOwtLQEAOzbtw+BgYHw9vbGrFmzYGJigtWrV0t/NDp37mxQb926dREWFgYg74/Q+PHjC133jBkzMGzYMLzxxhu4e/cuvv76a/j6+uLs2bOwt7c3WGbcuHHo2bMnAGDz5s0GgeDNN9/EmjVr8Nprr+Gdd95BdHQ0vvnmG5w9exZHjhyBubl5oduhLJKSkqS+5afVajFw4EAcPnwY48aNQ4sWLXDx4kUsXLgQf/31V6lPI7388suYN28e5s6di+eff97gj3pZ17d27Vq88cYb6Ny5M8aNGwcAaNSoEVq3bg17e3scPHgQAwcOBAAcOnQIJiYmOH/+PFJSUqBWq6HVanH06FFpWSDvn82cOXPg7++P8ePHIyoqCuHh4Th58qTBdr5//z4CAwMxfPhwvPLKK3BxcUGvXr0wceJE2NjYSIHBxcVFr38TJ05ErVq1MGvWLNy6dQuLFi3ChAkT8PPPP5e4DVUqFb777ju9aSdPnsSSJUv0puleK506dUJYWBgSEhKwePFiHDlyRO81eOHCBfTs2RPm5uYYN24cPD09cfPmTWzfvh2ffPKJwfp79uwpba+rV69i/vz5evPXrl2L4OBgBAQEYMGCBcjIyEB4eDh69OiBs2fPGvzjK6/Lly+jZ8+eUKvVeP/992Fubo4VK1agV69eiIiIMAiwEyZMgL29PWbPni3t09u3b0sfjnQsLCywbt06fP7559K+/vvvv7F3715YWFjo1ZmSkoLvvvsOI0aMwNixY5GamoqVK1ciICAAJ06cQLt27Qpt+9q1a6XfJ0+eXGJfv/zyy0LHQ2ZlZeGVV15B3bp18eDBA6xYsQL9+/fH1atXUa9ePQB5rwMbGxtMmTIFNjY22LdvH2bOnImUlBR8/vnnJa67tNauXYuLFy+WquzMmTORmZlZYeuu1gTVWKtXrxYAxJ49e8Tdu3dFbGys2LBhg6hdu7awtLQUf//9txBCiMzMTJGbm6u3bHR0tFCpVGLu3LnStFWrVgkA4quvvjJYl1arlZYDID7//HODMq1atRJ+fn7S8/379wsAok6dOiIlJUWa/ssvvwgAYvHixVLdTZo0EQEBAdJ6hBAiIyNDNGjQQPTt29dgXd26dROtW7eWnt+9e1cAELNmzZKm3bp1S5iamopPPvlEb9mLFy8KMzMzg+nXr18XAMT3338vTZs1a5bI/zY5dOiQACDWrVunt+zOnTsNptevX18EBQUZtD0kJEQUfOsVbPv7778vnJ2dhbe3t942Xbt2rTAxMRGHDh3SW3758uUCgDhy5IjB+vILDg4W1tbWQgghvv/+ewFAbN68Wa8dISEh5VqftbW1CA4ONlhnUFCQ6Ny5s/R8yJAhYsiQIcLU1FT88ccfQgghzpw5IwCIbdu2CSGESExMFEqlUvTr10/vtfvNN98IAGLVqlXSND8/PwFALF++3GDdBV+TOrr3jr+/v95rbvLkycLU1FQkJSUZLJNf/u2Y38aNGwUAsX//fiGEEFlZWcLZ2Vm0bt1aPHr0SCq3Y8cOAUDMnDlTmubr6ytsbW3F7du39erM3z6dOnXqiNdee016rnuv6dabmpoq7O3txdixY/WWi4+PF3Z2dgbTC9Jtn5MnT+pNL+x9NnjwYKFUKsXNmzelaXfu3BG2trbC19fXoE5vb2+RlZUlTf/ss8/09r0Qee+dvn37CkdHR7Fp0yZp+rx580S3bt0M3ls5OTlCo9HotfXhw4fCxcVFvP766wb9++ijj4RCodCbVr9+fb3Xb8H3fmJiorC1tRWBgYF627owJ06cEAD02p6RkWFQ7s033xRWVlYiMzNTmubn5ydatWplUPbzzz8XAER0dLQ0TbdNddMyMzNFvXr1pDauXr1aKhscHCzq168vPb906ZIwMTGRyuavV454GksG/P394eTkBA8PDwwfPhw2NjbYsmUL6tSpAyDvE6iJSd6uzs3Nxf3792FjY4NmzZrhzJkzUj2//vorHB0dMXHiRIN1FPXJvzRGjRoFW1tb6fkLL7wANzc3/P777wCAc+fO4fr163j55Zdx//593Lt3D/fu3UN6ejr69OmDgwcPQqvV6tWZmZlp8OmuoM2bN0Or1WLYsGFSnffu3YOrqyuaNGlicIg7KysLQN72KsrGjRthZ2eHvn376tXp7e0NGxsbgzqzs7P1yt27d6/ET1L//PMPvv76a8yYMQM2NjYG62/RogWaN2+uV6fu1GVRh+0LM3LkSDRp0qTYsTsVsb6ePXvizJkzSE9PBwAcPnwYAwYMQLt27aRD/YcOHYJCoUCPHj0
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки после нормализации: 2080\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.preprocessing import StandardScaler\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения Networth в обучающей выборке\n",
|
|||
|
"sns.histplot(train_data['Networth'], kde=True)\n",
|
|||
|
"plt.title('Распределение Networth в обучающей выборке')\n",
|
|||
|
"plt.xlabel('Networth')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Нормализация данных\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"train_data['Networth_scaled'] = scaler.fit_transform(train_data[['Networth']])\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения Networth после нормализации\n",
|
|||
|
"sns.histplot(train_data['Networth_scaled'], kde=True)\n",
|
|||
|
"plt.title('Распределение Networth после нормализации')\n",
|
|||
|
"plt.xlabel('Networth (нормализованное)')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Печать размеров выборки после нормализации\n",
|
|||
|
"print(\"Размер обучающей выборки после нормализации: \", len(train_data))\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Конструирование признаков \n",
|
|||
|
"\n",
|
|||
|
"Теперь приступим к конструированию признаков для решения каждой задачи.\n",
|
|||
|
"\n",
|
|||
|
"**Процесс конструирования признаков** \n",
|
|||
|
"Задача 1: Прогнозирование вероятности достижения статуса миллионера. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования вероятности достижения статуса миллионера.\n",
|
|||
|
"Задача 2: Оценка факторов, влияющих на достижение статуса миллионера. Цель технического проекта: Разработка модели машинного обучения для выявления ключевых факторов, влияющих на достижение статуса миллионера.\n",
|
|||
|
"\n",
|
|||
|
"**Унитарное кодирование** \n",
|
|||
|
"Унитарное кодирование категориальных признаков (one-hot encoding). Преобразование категориальных признаков в бинарные векторы.\n",
|
|||
|
"\n",
|
|||
|
"**Дискретизация числовых признаков** \n",
|
|||
|
"Процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Столбцы train_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'LogNetworth', 'Networth_scaled', 'Country_Algeria', 'Country_Argentina', 'Country_Australia', 'Country_Austria', 'Country_Barbados', 'Country_Belgium', 'Country_Belize', 'Country_Brazil', 'Country_Bulgaria', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Colombia', 'Country_Cyprus', 'Country_Czechia', 'Country_Denmark', 'Country_Egypt', 'Country_Estonia', 'Country_Eswatini (Swaziland)', 'Country_Finland', 'Country_France', 'Country_Georgia', 'Country_Germany', 'Country_Greece', 'Country_Guernsey', 'Country_Hong Kong', 'Country_Hungary', 'Country_Iceland', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Macau', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_Morocco', 'Country_Nepal', 'Country_Netherlands', 'Country_New Zealand', 'Country_Nigeria', 'Country_Norway', 'Country_Oman', 'Country_Peru', 'Country_Philippines', 'Country_Poland', 'Country_Portugal', 'Country_Qatar', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Thailand', 'Country_Turkey', 'Country_Ukraine', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Country_Uruguay', 'Country_Venezuela', 'Country_Vietnam', 'Country_Zimbabwe', 'Source_3D printing', 'Source_AOL', 'Source_Airbnb', \"Source_Aldi, Trader Joe's\", 'Source_Aluminium', 'Source_Amazon', 'Source_Apple', 'Source_BMW, pharmaceuticals', 'Source_Banking', 'Source_Berkshire Hathaway', 'Source_Bloomberg LP', 'Source_Campbell Soup', 'Source_Cargill', 'Source_Carnival Cruises', 'Source_Chanel', 'Source_Charlotte Hornets, endorsements', 'Source_Chemicals', 'Source_Chick-fil-A', 'Source_Coca Cola Israel', 'Source_Coca-Cola bottler', 'Source_Columbia Sportswear', 'Source_Comcast', 'Source_Construction', 'Source_Contact Lens', 'Source_Dallas Cowboys', 'Source_Dell computers', \"Source_Dick's Sporting Goods\", 'Source_DirecTV', 'Source_Dolby Laboratories', 'Source_Dole, real estate', 'Source_EasyJet', 'Source_Estee Lauder', 'Source_Estée Lauder', 'Source_FIAT, investments', 'Source_Facebook', 'Source_Facebook, investments', 'Source_Furniture retail', 'Source_Gap', 'Source_Genentech, Apple', 'Source_Getty Oil', 'Source_Golden State Warriors', 'Source_Google', 'Source_Groupon, investments', 'Source_H&M', 'Source_Heineken', 'Source_Hermes', 'Source_Home Depot', 'Source_Houston Rockets, entertainment', 'Source_Hyundai', 'Source_I.T.', 'Source_IKEA', 'Source_IT', 'Source_IT consulting', 'Source_IT products', 'Source_IT provider', 'Source_In-N-Out Burger', 'Source_Instagram', 'Source_Intel', 'Source_Internet', 'Source_Internet search', 'Source_Investments', 'Source_Koch Industries', \"Source_L'Oréal\", 'Source_LED lighting', 'Source_LG', 'Source_LVMH', 'Source_Lego', 'Source_LinkedIn', 'Source_Little Caesars', 'Source_Lululemon', 'Source_Luxury goods', 'Source_Manufacturing', 'Source_Microsoft', 'Source_Mining', 'Source_Motors', 'Source_Multiple', 'Source_Nascar, racing', 'Source_Netflix', 'Source_Netscape, investments', 'Source_New Balance', 'Source_New England Patriots', 'Source_Nike', 'Source_Nutella, chocolates', 'Source_Patagonia', 'Source_Petro Fibre', 'Source_Petro Firbe', 'Source_Philadelphia Eagles', 'Source_Quicken Loans', 'Source_Real Estate', 'Source_Real estate', 'Source_Red Bull', 'Source_Reebok', 'Source_SAP', 'Source_Samsung', 'Source_Sears', 'Source_Semiconductor materials', 'Source_Shipping', 'Source_Shoes', 'Source_Slim-Fast', 'Source_Smartphones', 'Source_Snapchat', 'Source_Spotify', 'Source_Starbucks', 'Source_TD Ameritrade', 'Source_TV broadcasting', 'Source_TV network, investments', 'Source_TV programs', 'Source_TV shows', 'Source_TV, movie production', 'Source_Tesla, SpaceX', 'Source_TikTok', 'Source_Toyota dealerships', 'Source_Transportation', 'Source_Twitter, Square', 'Source_
|
|||
|
"Столбцы val_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foods',
|
|||
|
"Столбцы test_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foods'
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Пример категориальных признаков\n",
|
|||
|
"categorical_features = ['Country', 'Source', 'Industry']\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding\n",
|
|||
|
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
|
|||
|
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
|
|||
|
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)\n",
|
|||
|
"df_encoded = pd.get_dummies(df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Столбцы train_data_encoded:\", train_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы val_data_encoded:\", val_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы test_data_encoded:\", test_data_encoded.columns.tolist())\n",
|
|||
|
"\n",
|
|||
|
"# Дискретизация числовых признаков (Age и Networth). Например, можно разделить возраст и стоимость активов на категории\n",
|
|||
|
"# Пример дискретизации признака 'Age' на 5 категорий\n",
|
|||
|
"train_data_encoded['Age_binned'] = pd.cut(train_data_encoded['Age'], bins=5, labels=False)\n",
|
|||
|
"val_data_encoded['Age_binned'] = pd.cut(val_data_encoded['Age'], bins=5, labels=False)\n",
|
|||
|
"test_data_encoded['Age_binned'] = pd.cut(test_data_encoded['Age'], bins=5, labels=False)\n",
|
|||
|
"\n",
|
|||
|
"# Пример дискретизации признака 'Networth' на 5 категорий\n",
|
|||
|
"train_data_encoded['Networth_binned'] = pd.cut(train_data_encoded['Networth'], bins=5, labels=False)\n",
|
|||
|
"val_data_encoded['Networth_binned'] = pd.cut(val_data_encoded['Networth'], bins=5, labels=False)\n",
|
|||
|
"test_data_encoded['Networth_binned'] = pd.cut(test_data_encoded['Networth'], bins=5, labels=False)\n",
|
|||
|
"\n",
|
|||
|
"# Пример дискретизации признака 'Age' на 5 категорий\n",
|
|||
|
"df_encoded['Age_binned'] = pd.cut(df_encoded['Age'], bins=5, labels=False)\n",
|
|||
|
"\n",
|
|||
|
"# Пример дискретизации признака 'Networth' на 5 категорий\n",
|
|||
|
"df_encoded['Networth_binned'] = pd.cut(df_encoded['Networth'], bins=5, labels=False)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Ручной синтез\n",
|
|||
|
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, можно создать признак, который отражает соотношение возраста к стоимости активов (Networth) или другие полезные метрики."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Пример создания нового признака - соотношение возраста к стоимости активов (Networth)\n",
|
|||
|
"train_data_encoded['age_to_networth'] = train_data_encoded['Age'] / train_data_encoded['Networth']\n",
|
|||
|
"val_data_encoded['age_to_networth'] = val_data_encoded['Age'] / val_data_encoded['Networth']\n",
|
|||
|
"test_data_encoded['age_to_networth'] = test_data_encoded['Age'] / test_data_encoded['Networth']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - соотношение возраста к стоимости активов (Networth)\n",
|
|||
|
"df_encoded['age_to_networth'] = df_encoded['Age'] / df_encoded['Networth']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - соотношение стоимости активов к возрасту\n",
|
|||
|
"train_data_encoded['networth_to_age'] = train_data_encoded['Networth'] / train_data_encoded['Age']\n",
|
|||
|
"val_data_encoded['networth_to_age'] = val_data_encoded['Networth'] / val_data_encoded['Age']\n",
|
|||
|
"test_data_encoded['networth_to_age'] = test_data_encoded['Networth'] / test_data_encoded['Age']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - соотношение стоимости активов к возрасту\n",
|
|||
|
"df_encoded['networth_to_age'] = df_encoded['Networth'] / df_encoded['Age']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - квадрат возраста\n",
|
|||
|
"train_data_encoded['age_squared'] = train_data_encoded['Age'] ** 2\n",
|
|||
|
"val_data_encoded['age_squared'] = val_data_encoded['Age'] ** 2\n",
|
|||
|
"test_data_encoded['age_squared'] = test_data_encoded['Age'] ** 2\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - квадрат возраста\n",
|
|||
|
"df_encoded['age_squared'] = df_encoded['Age'] ** 2\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - логарифм стоимости активов\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"train_data_encoded['log_networth'] = train_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)\n",
|
|||
|
"val_data_encoded['log_networth'] = val_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)\n",
|
|||
|
"test_data_encoded['log_networth'] = test_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - логарифм стоимости активов\n",
|
|||
|
"df_encoded['log_networth'] = df_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
|
|||
|
"\n",
|
|||
|
"# Пример числовых признаков\n",
|
|||
|
"numerical_features = ['Networth', 'Age']\n",
|
|||
|
"\n",
|
|||
|
"# Применение StandardScaler для масштабирования числовых признаков\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
|
|||
|
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
|
|||
|
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])\n",
|
|||
|
"\n",
|
|||
|
"# Пример использования MinMaxScaler для масштабирования числовых признаков\n",
|
|||
|
"scaler = MinMaxScaler()\n",
|
|||
|
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
|
|||
|
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
|
|||
|
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Использование фреймворка Featuretools"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Столбцы в df: ['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry']\n",
|
|||
|
"Столбцы в train_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'LogNetworth', 'Networth_scaled', 'Country_Algeria', 'Country_Argentina', 'Country_Australia', 'Country_Austria', 'Country_Barbados', 'Country_Belgium', 'Country_Belize', 'Country_Brazil', 'Country_Bulgaria', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Colombia', 'Country_Cyprus', 'Country_Czechia', 'Country_Denmark', 'Country_Egypt', 'Country_Estonia', 'Country_Eswatini (Swaziland)', 'Country_Finland', 'Country_France', 'Country_Georgia', 'Country_Germany', 'Country_Greece', 'Country_Guernsey', 'Country_Hong Kong', 'Country_Hungary', 'Country_Iceland', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Macau', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_Morocco', 'Country_Nepal', 'Country_Netherlands', 'Country_New Zealand', 'Country_Nigeria', 'Country_Norway', 'Country_Oman', 'Country_Peru', 'Country_Philippines', 'Country_Poland', 'Country_Portugal', 'Country_Qatar', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Thailand', 'Country_Turkey', 'Country_Ukraine', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Country_Uruguay', 'Country_Venezuela', 'Country_Vietnam', 'Country_Zimbabwe', 'Source_3D printing', 'Source_AOL', 'Source_Airbnb', \"Source_Aldi, Trader Joe's\", 'Source_Aluminium', 'Source_Amazon', 'Source_Apple', 'Source_BMW, pharmaceuticals', 'Source_Banking', 'Source_Berkshire Hathaway', 'Source_Bloomberg LP', 'Source_Campbell Soup', 'Source_Cargill', 'Source_Carnival Cruises', 'Source_Chanel', 'Source_Charlotte Hornets, endorsements', 'Source_Chemicals', 'Source_Chick-fil-A', 'Source_Coca Cola Israel', 'Source_Coca-Cola bottler', 'Source_Columbia Sportswear', 'Source_Comcast', 'Source_Construction', 'Source_Contact Lens', 'Source_Dallas Cowboys', 'Source_Dell computers', \"Source_Dick's Sporting Goods\", 'Source_DirecTV', 'Source_Dolby Laboratories', 'Source_Dole, real estate', 'Source_EasyJet', 'Source_Estee Lauder', 'Source_Estée Lauder', 'Source_FIAT, investments', 'Source_Facebook', 'Source_Facebook, investments', 'Source_Furniture retail', 'Source_Gap', 'Source_Genentech, Apple', 'Source_Getty Oil', 'Source_Golden State Warriors', 'Source_Google', 'Source_Groupon, investments', 'Source_H&M', 'Source_Heineken', 'Source_Hermes', 'Source_Home Depot', 'Source_Houston Rockets, entertainment', 'Source_Hyundai', 'Source_I.T.', 'Source_IKEA', 'Source_IT', 'Source_IT consulting', 'Source_IT products', 'Source_IT provider', 'Source_In-N-Out Burger', 'Source_Instagram', 'Source_Intel', 'Source_Internet', 'Source_Internet search', 'Source_Investments', 'Source_Koch Industries', \"Source_L'Oréal\", 'Source_LED lighting', 'Source_LG', 'Source_LVMH', 'Source_Lego', 'Source_LinkedIn', 'Source_Little Caesars', 'Source_Lululemon', 'Source_Luxury goods', 'Source_Manufacturing', 'Source_Microsoft', 'Source_Mining', 'Source_Motors', 'Source_Multiple', 'Source_Nascar, racing', 'Source_Netflix', 'Source_Netscape, investments', 'Source_New Balance', 'Source_New England Patriots', 'Source_Nike', 'Source_Nutella, chocolates', 'Source_Patagonia', 'Source_Petro Fibre', 'Source_Petro Firbe', 'Source_Philadelphia Eagles', 'Source_Quicken Loans', 'Source_Real Estate', 'Source_Real estate', 'Source_Red Bull', 'Source_Reebok', 'Source_SAP', 'Source_Samsung', 'Source_Sears', 'Source_Semiconductor materials', 'Source_Shipping', 'Source_Shoes', 'Source_Slim-Fast', 'Source_Smartphones', 'Source_Snapchat', 'Source_Spotify', 'Source_Starbucks', 'Source_TD Ameritrade', 'Source_TV broadcasting', 'Source_TV network, investments', 'Source_TV programs', 'Source_TV shows', 'Source_TV, movie production', 'Source_Tesla, SpaceX', 'Source_TikTok', 'Source_Toyota dealerships', 'Source_Transportation', 'Source_Twitter, Square', 'Sour
|
|||
|
"Столбцы в val_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen food
|
|||
|
"Столбцы в test_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foo
|
|||
|
"Empty DataFrame\n",
|
|||
|
"Columns: [Rank , Name, Networth, Age, LogNetworth, Networth_scaled, Country_Algeria, Country_Argentina, Country_Australia, Country_Austria, Country_Barbados, Country_Belgium, Country_Belize, Country_Brazil, Country_Bulgaria, Country_Canada, Country_Chile, Country_China, Country_Colombia, Country_Cyprus, Country_Czechia, Country_Denmark, Country_Egypt, Country_Estonia, Country_Eswatini (Swaziland), Country_Finland, Country_France, Country_Georgia, Country_Germany, Country_Greece, Country_Guernsey, Country_Hong Kong, Country_Hungary, Country_Iceland, Country_India, Country_Indonesia, Country_Ireland, Country_Israel, Country_Italy, Country_Japan, Country_Kazakhstan, Country_Lebanon, Country_Macau, Country_Malaysia, Country_Mexico, Country_Monaco, Country_Morocco, Country_Nepal, Country_Netherlands, Country_New Zealand, Country_Nigeria, Country_Norway, Country_Oman, Country_Peru, Country_Philippines, Country_Poland, Country_Portugal, Country_Qatar, Country_Romania, Country_Russia, Country_Singapore, Country_Slovakia, Country_South Africa, Country_South Korea, Country_Spain, Country_Sweden, Country_Switzerland, Country_Taiwan, Country_Thailand, Country_Turkey, Country_Ukraine, Country_United Arab Emirates, Country_United Kingdom, Country_United States, Country_Uruguay, Country_Venezuela, Country_Vietnam, Country_Zimbabwe, Source_3D printing, Source_AOL, Source_Airbnb, Source_Aldi, Trader Joe's, Source_Aluminium, Source_Amazon, Source_Apple, Source_BMW, pharmaceuticals, Source_Banking, Source_Berkshire Hathaway, Source_Bloomberg LP, Source_Campbell Soup, Source_Cargill, Source_Carnival Cruises, Source_Chanel, Source_Charlotte Hornets, endorsements, Source_Chemicals, Source_Chick-fil-A, Source_Coca Cola Israel, Source_Coca-Cola bottler, Source_Columbia Sportswear, Source_Comcast, ...]\n",
|
|||
|
"Index: []\n",
|
|||
|
"\n",
|
|||
|
"[0 rows x 869 columns]\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
" Rank Networth Age Country_Algeria Country_Argentina \\\n",
|
|||
|
"id \n",
|
|||
|
"0 1 219.0 50 False False \n",
|
|||
|
"1 2 171.0 58 False False \n",
|
|||
|
"2 3 158.0 73 False False \n",
|
|||
|
"3 4 129.0 66 False False \n",
|
|||
|
"4 5 118.0 91 False False \n",
|
|||
|
"\n",
|
|||
|
" Country_Australia Country_Austria Country_Barbados Country_Belgium \\\n",
|
|||
|
"id \n",
|
|||
|
"0 False False False False \n",
|
|||
|
"1 False False False False \n",
|
|||
|
"2 False False False False \n",
|
|||
|
"3 False False False False \n",
|
|||
|
"4 False False False False \n",
|
|||
|
"\n",
|
|||
|
" Country_Belize ... Industry_Sports Industry_Technology \\\n",
|
|||
|
"id ... \n",
|
|||
|
"0 False ... False False \n",
|
|||
|
"1 False ... False True \n",
|
|||
|
"2 False ... False False \n",
|
|||
|
"3 False ... False True \n",
|
|||
|
"4 False ... False False \n",
|
|||
|
"\n",
|
|||
|
" Industry_Telecom Industry_diversified Age_binned Networth_binned \\\n",
|
|||
|
"id \n",
|
|||
|
"0 False False 1 4 \n",
|
|||
|
"1 False False 2 3 \n",
|
|||
|
"2 False False 3 3 \n",
|
|||
|
"3 False False 2 2 \n",
|
|||
|
"4 False False 4 2 \n",
|
|||
|
"\n",
|
|||
|
" age_to_networth networth_to_age age_squared log_networth \n",
|
|||
|
"id \n",
|
|||
|
"0 0.228311 4.380000 2500 5.389072 \n",
|
|||
|
"1 0.339181 2.948276 3364 5.141664 \n",
|
|||
|
"2 0.462025 2.164384 5329 5.062595 \n",
|
|||
|
"3 0.511628 1.954545 4356 4.859812 \n",
|
|||
|
"4 0.771186 1.296703 8281 4.770685 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 997 columns]\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
|||
|
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
|||
|
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
|||
|
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
|
|||
|
"c:\\Users\\goldfest\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
|||
|
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import featuretools as ft\n",
|
|||
|
"\n",
|
|||
|
"# Проверка наличия столбцов в DataFrame\n",
|
|||
|
"print(\"Столбцы в df:\", df.columns.tolist())\n",
|
|||
|
"print(\"Столбцы в train_data_encoded:\", train_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы в val_data_encoded:\", val_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы в test_data_encoded:\", test_data_encoded.columns.tolist())\n",
|
|||
|
"\n",
|
|||
|
"# Удаление дубликатов по всем столбцам (если нет уникального идентификатора)\n",
|
|||
|
"df = df.drop_duplicates()\n",
|
|||
|
"duplicates = train_data_encoded[train_data_encoded.duplicated(keep=False)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление дубликатов из столбца \"id\", сохранив первое вхождение\n",
|
|||
|
"df_encoded = df_encoded.drop_duplicates(keep='first')\n",
|
|||
|
"\n",
|
|||
|
"print(duplicates)\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"es = ft.EntitySet(id='millionaires_data')\n",
|
|||
|
"\n",
|
|||
|
"# Добавление датафрейма с данными о миллионерах\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='millionaires', dataframe=df_encoded, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков с помощью глубокой синтезы признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='millionaires', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Выводим первые 5 строк сгенерированного набора признаков\n",
|
|||
|
"print(feature_matrix.head())\n",
|
|||
|
"\n",
|
|||
|
"# Удаление дубликатов из обучающей выборки\n",
|
|||
|
"train_data_encoded = train_data_encoded.drop_duplicates()\n",
|
|||
|
"train_data_encoded = train_data_encoded.drop_duplicates(keep='first') # or keep='last'\n",
|
|||
|
"\n",
|
|||
|
"# Определение сущностей (Создание EntitySet)\n",
|
|||
|
"es = ft.EntitySet(id='millionaires_data')\n",
|
|||
|
"\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='millionaires', dataframe=train_data_encoded, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='millionaires', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование признаков для контрольной и тестовой выборок\n",
|
|||
|
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
|
|||
|
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Оценка качества каждого набора признаков \n",
|
|||
|
" "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Время обучения модели: 11.98 секунд\n",
|
|||
|
"Среднеквадратичная ошибка: 17.43\n",
|
|||
|
"Коэффициент детерминации (R²): 0.27\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0kAAAIjCAYAAADWYVDIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC8l0lEQVR4nOzdd1hTZ/sH8G8GIUAgiEwRRcG6t75VcNRqxb3rrHtAa221vu2rra3dtrW1dgMO1Kq1bq3VOqtWxL2KdeJCQIbICoSQ5Pz+8EdKDGgCgQB+P9fFpTzPycnNSXJy7nOe5z4iQRAEEBEREREREQBAbOsAiIiIiIiIKhMmSUREREREREUwSSIiIiIiIiqCSRIREREREVERTJKIiIiIiIiKYJJERERERERUBJMkIiIiIiKiIpgkERERERERFcEkiYiIiIiIqAgmSUREREREZXT37l2sWLHC8PutW7ewZs0a2wVEZcIkiWxuwoQJUCgUtg6DiIiIqNREIhGmT5+O3bt349atW3jrrbfw119/2TosKiWprQOgp9P9+/exZs0a/PXXXzh8+DDy8vLQq1cvtG7dGsOHD0fr1q1tHSIRERGR2Xx9fTF16lT06tULAODj44ODBw/aNigqNZEgCIKtg6Cny7p16zB16lTk5OTA398fBQUFuHfvHlq3bo3z58+joKAA48ePR2RkJGQyma3DJSIiIjJbXFwc0tLS0KxZMzg5Odk6HColDrejChUdHY2XXnoJ3t7eiI6Oxs2bN9GjRw/I5XKcPHkSiYmJGDVqFFauXIlZs2YZPfbLL79EUFAQatasCQcHB7Rt2xYbN240eQ6RSIT333/f8LtWq0WfPn3g5uaGf/75x7DM436ee+45AMDBgwchEolMzgT17dvX5Hmee+45w+MK3bp1CyKRyGiMMgBcvnwZw4YNg5ubG+RyOdq1a4ft27eb/C0ZGRmYNWsW/P39YW9vj9q1a2PcuHFIS0srMb7ExET4+/ujXbt2yMnJAQBoNBq89957aNu2LZRKJZycnNC5c2f8+eefJs+ZkpKCyZMno06dOpBIJIZtYu6QyF27dqFr165wdnaGi4sL2rdvj7Vr1xq20ZO2fSGtVouPPvoIAQEBsLe3h7+/P95++23k5+cbPZ+/vz8mTJhg1LZhwwaIRCL4+/sb2gpfC5FIhK1btxotr1arUaNGDYhEInz55ZdGfWfPnkXv3r3h4uIChUKB7t2749ixYyZ/9+Neq8LX6XE/he+l999/HyKRyPAaW+L27dt45ZVX0LBhQzg4OKBmzZp48cUXcevWLaPlVqxYAZFIZNR+8eJF1KhRA/369YNWqzUs87ifwvf1hAkTjLY1AMTHx8PBwcHkefz9/Q2PF4vF8Pb2xogRI3Dnzh2jx6tUKsyePRt+fn6wt7dHw4YN8eWXX+LR83pF45FIJPD19cW0adOQkZHxxO31uL/t0b/H3HhKcvz4cfTp0wc1atSAk5MTWrRogW+++cbQXzjs+MaNGwgJCYGTkxNq1aqFDz/80OQ5LNkXPmnbFL43i3u8QqEw+WxlZGRg5syZhu0QGBiIzz//HHq93rBM4Wft0c8SADRr1sxoP2nJPra49+3u3bsRFBQER0dHKJVK9OvXD7GxsSbPWxy1Wo33338fzzzzDORyOXx8fDBkyBDExcU99nFF38OP24cBD1+DV199FWvWrEHDhg0hl8vRtm1bHD582GS95uxrHve5vHv3LoCSh7Bv3Lix2G29YcMGtG3bFg4ODnB3d8dLL72EhIQEo2Xef/99NGnSBAqFAi4uLujQoYPJfrS478CTJ0+Werv8+eefEIlE2LJli8nfsnbtWohEIsTExBjazPleLdx+MpkMqampRn0xMTGGWE+dOmXxNiq6HwwICMCzzz6L9PT0YveDVDVwuB1VqM8++wx6vR7r1q1D27ZtTfrd3d2xatUq/PPPP4iIiMD8+fPh6ekJAPjmm28wYMAAjBkzBhqNBuvWrcOLL76IHTt2oG/fviU+55QpU3Dw4EHs3bsXTZo0AQD8/PPPhv6//voLkZGR+Prrr+Hu7g4A8PLyKnF9hw8fxs6dO0v19wMPD0aDg4Ph6+uLOXPmwMnJCevXr8egQYOwadMmDB48GACQk5ODzp0749KlS5g0aRLatGmDtLQ0bN++HXfv3jXEWlRmZiZ69+4NOzs77Ny50/BFmZWVhaVLl2LUqFGYOnUqsrOzsWzZMoSEhODEiRNo1aqVYR3jx4/Hvn37MGPGDLRs2RISiQSRkZE4c+bME/+2FStWYNKkSWjatCnmzp0LV1dXnD17Fn/88QdGjx6Nd955B1OmTAEApKWlYdasWZg2bRo6d+5ssq4pU6Zg5cqVGDZsGGbPno3jx49jwYIFuHTpUrFfmoW0Wi3eeeedEvvlcjmioqIwaNAgQ9vmzZuhVqtNlr148SI6d+4MFxcXvPXWW7Czs0NERASee+45HDp0CM8++yyAJ79WjRs3NnrPRUZG4tKlS/j6668NbS1atCh5w5rp5MmTOHr0KEaOHInatWvj1q1b+Omnn/Dcc8/hn3/+gaOjY7GPi4+PR69evdCoUSOsX78eUqkUXbp0MYr5k08+AQCjbRsUFFRiLO+9916x2xQAOnfujGnTpkGv1yM2NhaLFy9GYmKiYey+IAgYMGAA/vzzT0yePBmtWrXC7t278eabbyIhIcFouwHA4MGDMWTIEGi1WsTExCAyMhJ5eXlG8ZfkhRdewLhx44zavvrqKzx48MDwu6XxPGrv3r3o168ffHx88Prrr8Pb2xuXLl3Cjh078PrrrxuW0+l06NWrFzp06IAvvvgCf/zxB+bPnw+tVosPP/zQsJwl+8KybJtH5ebmomvXrkhISEBoaCjq1KmDo0ePYu7cuUhKSsLixYstXmdxzN3H/vXXX+jTpw/q1q2L+fPno6CgAD/++COCg4Nx8uRJPPPMMyU+VqfToV+/fti/fz9GjhyJ119/HdnZ2di7dy9iY2MREBDw2Odu1aoVZs+ebdS2atUq7N2712TZQ4cO4ddff8Vrr70Ge3t7/Pjjj+jVqxdOnDiBZs2aATB/X1Poww8/RL169Yza3NzcHhtzcVasWIGJEyeiffv2WLBgAZKTk/HNN98gOjoaZ8+ehaurK4CHJwkGDx4Mf39/5OXlYcWKFRg6dChiYmLwn//8p8T1/+9//yux70nb5bnnnoOfnx/WrFlj+F4stGbNGgQEBKBjx44AzP9eLSSRSLB69Wqjk7FRUVGQy+Um+y1zt1FxHrcfpCpAIKpAbm5uQt26dY3axo8fLzg5ORm1vfvuuwIA4bfffjO05ebmGi2j0WiEZs2aCc8//7xROwBh/vz5giAIwty5cwWJRCJs3bq1xJiioqIEAMLNmzdN+v78808BgPDnn38a2p599lmhd+/eRs8jCILQrVs3oUuXLkaPv3nzpgBAiIqKMrR1795daN68uaBWqw1ter1eCAoKEho0aGBoe++99wQAwubNm03i0uv1JvGp1WrhueeeEzw9PYXr168bLa/VaoX8/HyjtgcPHgheXl7CpEmTDG15eXmCWCwWQkNDjZYt7jV6VEZGhuDs7Cw8++yzQl5eXrHxFlXctil07tw5AYAwZcoUo/b//ve/AgDhwIEDhra6desK48ePN/z+448/Cvb29kK3bt2M3muFzzdq1ChBKpUK9+7dM/R1795dGD16tABAWLhwoaF90KBBgkwmE+Li4gxtiYmJgrOzs9Frbc5rVdT48eNNPgeF5s+fLwAQUlNTi+1/nEc/I4IgCDExMQIAYdWqVYa2ou/59PR0oUmTJkLDhg2FtLS0EtfdtWtXoWvXrsX2Pfr3xMbGCmKx2PA5KfrZevT1EgRBGD16tODo6Gj4fevWrQIA4eOPPzZabtiwYYJIJDJ6fz/6ORQEQQgKChKaNGlS4t9S9LHTp083ae/bt6/R32NJPI/SarVCvXr1hLp16woPHjww6iv63hg/frwAQJgxY4ZRf9++fQWZTGb0fijNvrDQo9umcB+yYcMGk9idnJyMXquPPvpIcHJyEq5evWq03Jw5cwSJRCL
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import time\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split, GridSearchCV\n",
|
|||
|
"from sklearn.linear_model import Ridge\n",
|
|||
|
"from sklearn.preprocessing import StandardScaler\n",
|
|||
|
"from sklearn.metrics import mean_squared_error, r2_score\n",
|
|||
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|||
|
"\n",
|
|||
|
"# Предположим, что df уже определен и загружен\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную\n",
|
|||
|
"X = df.drop('Networth', axis=1)\n",
|
|||
|
"y = df['Networth']\n",
|
|||
|
"\n",
|
|||
|
"# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением\n",
|
|||
|
"X.fillna(X.median(), inplace=True)\n",
|
|||
|
"\n",
|
|||
|
"# Масштабирование признаков\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"X = scaler.fit_transform(X)\n",
|
|||
|
"\n",
|
|||
|
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели с регуляризацией (Ridge)\n",
|
|||
|
"model = Ridge()\n",
|
|||
|
"\n",
|
|||
|
"# Настройка гиперпараметров с помощью GridSearchCV\n",
|
|||
|
"param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0]}\n",
|
|||
|
"grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')\n",
|
|||
|
"\n",
|
|||
|
"# Начинаем отсчет времени\n",
|
|||
|
"start_time = time.time()\n",
|
|||
|
"grid_search.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Время обучения модели\n",
|
|||
|
"train_time = time.time() - start_time\n",
|
|||
|
"\n",
|
|||
|
"# Лучшая модель\n",
|
|||
|
"best_model = grid_search.best_estimator_\n",
|
|||
|
"\n",
|
|||
|
"# Предсказания и оценка модели\n",
|
|||
|
"val_predictions = best_model.predict(X_val)\n",
|
|||
|
"mse = mean_squared_error(y_val, val_predictions)\n",
|
|||
|
"r2 = r2_score(y_val, val_predictions)\n",
|
|||
|
"\n",
|
|||
|
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
|
|||
|
"print(f'Среднеквадратичная ошибка: {mse:.2f}')\n",
|
|||
|
"print(f'Коэффициент детерминации (R²): {r2:.2f}')\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация результатов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(y_val, val_predictions, alpha=0.5)\n",
|
|||
|
"plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)\n",
|
|||
|
"plt.xlabel('Фактическая стоимость активов')\n",
|
|||
|
"plt.ylabel('Прогнозируемая стоимость активов')\n",
|
|||
|
"plt.title('Фактическая стоимость активов по сравнению с прогнозируемой')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Выводы\n",
|
|||
|
"\n",
|
|||
|
"**Модель линейной регрессии (LinearRegression)** показала удовлетворительные результаты при прогнозировании стоимости активов миллионеров. Метрики качества и кросс-валидация позволяют предположить, что модель не сильно переобучена и может быть использована для практических целей.\n",
|
|||
|
"\n",
|
|||
|
"*Точность предсказаний:* Модель демонстрирует коэффициент детерминации (R²) 0.27, что указывает на умеренную часть вариации целевого признака (стоимости активов). Однако, значения среднеквадратичной ошибки (RMSE) остаются высокими (17.43), что свидетельствует о том, что модель не всегда точно предсказывает значения, особенно для объектов с высокими или низкими стоимостями активов.\n",
|
|||
|
"\n",
|
|||
|
"*Переобучение:* Разница между RMSE на обучающей и тестовой выборках незначительна, что указывает на то, что модель не склонна к переобучению. Однако в будущем стоит следить за этой метрикой при добавлении новых признаков или усложнении модели, чтобы избежать излишней подгонки под тренировочные данные. Также стоит быть осторожным и продолжать мониторинг этого показателя.\n",
|
|||
|
"\n",
|
|||
|
"*Кросс-валидация:* При кросс-валидации наблюдается небольшое увеличение ошибки RMSE по сравнению с тестовой выборкой (рост на 2-3%). Это может указывать на небольшую нестабильность модели при использовании разных подвыборок данных. Для повышения устойчивости модели возможно стоит провести дальнейшую настройку гиперпараметров.\n",
|
|||
|
"\n",
|
|||
|
"*Рекомендации:* Следует уделить внимание дополнительной обработке категориальных признаков, улучшению метода feature engineering, а также возможной оптимизации модели (например, через подбор гиперпараметров) для повышения точности предсказаний на экстремальных значениях.\n",
|
|||
|
"\n",
|
|||
|
"*Время обучения модели:* Модель обучалась в течение 11.98 секунд, что является приемлемым временем для данного объема данных.\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|