2024-10-12 09:48:37 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Анализ и оптимизация работы розничных магазинов."
]
},
{
"cell_type": "markdown",
"metadata": {
"notebookRunGroups": {
"groupValue": "1"
}
},
"source": [
"2. Проблемная область: Анализ и оптимизация работы розничных магазинов.\n",
"\n",
"Данные включают информацию о магазинах, такую как площадь, количество доступных товаров, количество посетителей в день и продажи.\n",
"\n",
"3. Объекты наблюдения: Магазины.\n",
"\n",
"4. Бизнес-цель: Оптимизация распределения товаров в магазинах для увеличения продаж.\n",
"\n",
"Эффект для бизнеса: Увеличение прибыли за счет более эффективного использования площади и товарного запаса.\n",
"\n",
"5. Техническая цель: Создание модели, предсказывающей продажи магазина на основе е г о площади, количества товаров и количества посетителей."
]
},
{
"cell_type": "code",
2024-10-12 10:08:57 +04:00
"execution_count": 11,
2024-10-12 09:48:37 +04:00
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
2024-10-12 10:08:57 +04:00
"from sklearn.model_selection import train_test_split\n",
2024-10-12 09:48:37 +04:00
"\n",
"# Загрузка данных\n",
"stores_pd = pd.read_csv('static/csv/Stores.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Визуалзиация данных с помощью диаграмм рассеивания"
]
},
{
"cell_type": "code",
2024-10-12 10:08:57 +04:00
"execution_count": 12,
2024-10-12 09:48:37 +04:00
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAooAAAHVCAYAAAB2RNozAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOxdd3gU1ds920vKpiwhCaRvICQECISaBKRIMVSxUD5NCGChWX4qIL1IsSJNUGkqiIr0qhSVXiMQeiAQIISQXjbb7/fHZoYts7ObEKpznodHMzM7c6edee9bzssjhBBw4MCBAwcOHDhw4GAD/uMeAAcOHDhw4MCBA4cnE5yhyIEDBw4cOHDgwIERnKHIgQMHDhw4cODAgRGcociBAwcOHDhw4MCBEZyhyIEDBw4cOHDgwIERnKHIgQMHDhw4cODAgRGcociBAwcOHDhw4MCBEZyhyIEDBw4cOHDgwIERnKHIgQMHDhw4cODAgRGcociBA4dHjtDQUKSmpj7uYXDg8NDA4/EwdepU+u+VK1eCx+Ph+vXr1d7Xc889h8aNGzvd7vr16+DxeFi5cmW1j8HhycRff/0FHo+Hv/7667GNgTMUHxPOnj2Ll156CSEhIZBKpahXrx6ef/55LFiwwGq7WbNmYePGjY9nkDXAhQsXwOPxIJVKUVxc/LiH81hgMpnwww8/oHXr1vDx8YGHhwcaNGiA119/HUeOHKG3O3/+PKZOnVqjD8ejxoEDB9CjRw/Uq1cPUqkUwcHB6NWrF9asWfO4h8bhAUEZMCdOnKCXbd++3crIeRawePFi8Hg8tG7d+nEPhYMF7t27h3feeQdRUVGQyWTw8/NDq1atMHbsWJSXl9PbrVmzBvPmzXt8A3UROp0OX3/9NeLi4uDp6QkvLy/ExMTgjTfewMWLFx/38GoEzlB8DDh06BDi4+Nx+vRpDB8+HAsXLsSwYcPA5/Px9ddfW237tBmKP/30E/z9/QEA69ate8yjeTwYM2YMUlJSEBAQgKlTp2Lu3Lno0aMHjhw5gp07d9LbnT9/HtOmTXviDcXffvsN7du3x927d/HOO+9gwYIF+L//+z8UFRXhu+++e9zD4/AQsH37dkybNu1xD6NWsXr1aoSGhuLYsWPIzMx86MerrKzExIkTH/pxnmYUFhYiPj4eP/zwA5KTkzF//ny8//77UKlU+Oabb5Cfn09v+7QYiv3798f//vc/NG7cGHPmzMG0adPQvn177Nixw8pR8DRB+LgH8F/EJ598AoVCgePHj8PLy8tqXV5e3kM/fkVFBdzc3Gp9v4QQrFmzBoMGDUJWVhZWr16NYcOGufQ7jUYDmUxW62N61Lh79y4WL16M4cOH49tvv7VaN2/ePNy7d++hj0GtVkMul9fa/qZOnYro6GgcOXIEYrHYat2jeF45cHhQZGVl4dChQ1i/fj3efPNNrF69GlOmTHmox5RKpQ91/88Cli1bhuzsbBw8eBDt2rWzWldaWmrHN7UNg8EAk8lUa8c5fvw4tm7dik8++QQff/yx1bqFCxc+tVE2zqP4GHD16lXExMTYGYkA4OfnR/8/j8dDRUUFVq1aBR6PBx6PZ5XXlZ6ejh49esDT0xPu7u7o3Lmz3YyFCiv9/fffGDFiBPz8/FC/fn16/Y4dO5CUlAQ3Nzd4eHggOTkZ586dq9F5HTx4ENevX8eAAQMwYMAA/PPPP7h165bddqGhoejZsyd27dqF+Ph4yGQyLF26FABQXFyMd999F0FBQZBIJFCpVJg7dy5MJpPVPj7//HO0a9cOvr6+kMlkaNGihUsezFGjRsHd3R1qtdpu3cCBA+Hv7w+j0QgAOHHiBLp16walUgmZTIawsDCkpaWx7j8rKwuEECQkJNit4/F49P1duXIlXn75ZQBAx44d6ftrmYeyePFixMTEQCKRIDAwECNHjrQjGip36eTJk2jfvj3kcjlNUFqtFlOmTIFKpYJEIkFQUBA++ugjaLVap9fJElevXkXLli0ZydTyeQVqfl8A1+/92rVr0aJFC3h4eMDT0xOxsbF2nngONUdqaioWLVoEAPRzyePx6PUmkwnz5s1DTEwMpFIp6tatizfffBNFRUVW+6He87/++ot+z2NjY+lnfP369YiNjYVUKkWLFi2Qnp5u9fvc3FwMGTIE9evXh0QiQUBAAPr06VMjD/zq1avh7e2N5ORkvPTSS1i9ejW9Tq/Xw8fHB0OGDLH7XWlpKaRSKT744AMA5rDi5MmT0aJFCygUCri5uSEpKQn79u2z+61tjiITNm3ahOTkZAQGBkIikSAiIgIzZsygOcgWJ0+eRLt27Wg+WrJkiUvnf/HiRbz00kvw8fGBVCpFfHw8Nm/ezPqb6lwXAFiwYAFiYmIgl8vh7e2N+Ph4p6kpV69ehUAgQJs2bezWeXp60sb2c889h23btuHGjRv08xgaGkpvm5eXh6FDh6Ju3bqQSqVo2rQpVq1aZbU/Kn/z888/x7x58xAREQGJRILz58/X+BoxnQ8ARv4XCATw9fWl/75x4wZGjBiBhg0bQiaTwdfXFy+//LLLz/fRo0fRvXt3KBQKyOVydOjQAQcPHrTapqysDO+++y5CQ0MhkUjg5+eH559/HqdOnarWeYFweOTo2rUr8fDwIGfPnmXd7scffyQSiYQkJSWRH3/8kfz444/k0KFDhBBCMjIyiJubGwkICCAzZswgc+bMIWFhYUQikZAjR47Q+1ixYgUBQKKjo0mHDh3IggULyJw5cwghhPzwww+Ex+OR7t27kwULFpC5c+eS0NBQ4uXlRbKysqp9Xm+99RaJiIgghBCiVquJu7s7+fTTT+22CwkJISqVinh7e5Nx48aRJUuWkH379pGKigrSpEkT4uvrSz7++GOyZMkS8vrrrxMej0feeecdq33Ur1+fjBgxgixcuJB8+eWXpFWrVgQA2bp1K+sY//nnHwKA/Prrr1bLKyoqiJubGxk5ciQhhJC7d+8Sb29v0qBBA/LZZ5+R7777jkyYMIE0atSIdf85OTkEAElOTiYVFRUOt7t69SoZM2YMAUA+/vhj+v7m5uYSQgiZMmUKAUC6dOlCFixYQEaNGkUEAgFp2bIl0el09H46dOhA/P39SZ06dcjo0aPJ0qVLycaNG4nRaCRdu3YlcrmcvPvuu2Tp0qVk1KhRRCgUkj59+rCegy0aNGhAgoKCyM2bN51u6+p9CQkJISkpKfTfrt77P/74gwAgnTt3JosWLSKLFi0io0aNIi+//HK1zonDfVAccfz4cUIIIYcOHSLPP/88AUA/lz/++CO9/bBhw4hQKCTDhw8nS5YsIWPHjiVubm52z2ZISAhp2LAhCQgIIFOnTiVfffUVqVevHnF3dyc//fQTCQ4OJnPmzCFz5swhCoWCqFQqYjQa6d+3a9eOKBQKMnHiRPL999+TWbNmkY4dO5K///672ucYFRVFhg4dSgi5zwHHjh2j16elpREvLy+i1Wqtfrdq1Sqra3Pv3j0SEBBA3n//ffLNN9+QTz/9lDRs2JCIRCKSnp5u9VsAZMqUKXbX2ZJb+/btS1555RXy2WefkW+++Ya8/PLLBAD54IMPrPbVoUMHEhgYSPz8/MioUaPI/PnzSWJiIgFAli1bRm+XlZVFAJAVK1bQyzIyMohCoSDR0dFk7ty5ZOHChaR9+/aEx+OR9evXs143V6/Lt99+SwCQl156iSxdupR8/fXXZOjQoWTMmDGs+581axYBQFauXMm63R9//EGaNWtGlEol/Txu2LCBEGL+1jRq1IiIRCLy3nvvkfnz55OkpCQCgMybN8/u2kRHR5Pw8HAyZ84c8tVXX5EbN2480DWyxKFDhwgAMnz4cKLX61m3/e2330jTpk3J5MmTybfffks+/vhj4u3tTUJCQqy+Hfv27SMAyL59++hle/bsIWKxmLRt25Z88cUX5KuvviJNmjQhYrGYHD16lN5u0KBBRCwWk/fff598//33ZO7cuaRXr17kp59+cvmcCCGEMxQfA/744w8iEAiIQCAgbdu2JR999BHZtWuXFclScHNzs/qgUujbty8Ri8Xk6tWr9LKcnBzi4eF
"text/plain": [
"<Figure size 640x480 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Диаграмма рассеяния для Store_Area и Store_Sales\n",
"plt.subplot(2, 2, 1)\n",
"sns.scatterplot(x='Store_Area', y='Store_Sales', data=stores_pd)\n",
"plt.title('Store_Area vs Store_Sales')\n",
"\n",
2024-10-12 10:08:57 +04:00
"# Разбиение на обучающую, контрольную и тестовую выборки\n",
"train_df, test_df = train_test_split(stores_pd, test_size=0.4, random_state=42)\n",
"val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)\n",
"\n",
2024-10-12 09:48:37 +04:00
"# Диаграмма рассеяния для Items_Available и Store_Sales\n",
"plt.subplot(2, 2, 2)\n",
"sns.scatterplot(x='Items_Available', y='Store_Sales', data=stores_pd)\n",
"plt.title('Items_Available vs Store_Sales')\n",
"\n",
"# Диаграмма рассеяния для Daily_Customer_Count и Store_Sales\n",
"plt.subplot(2, 2, 3)\n",
"sns.scatterplot(x='Daily_Customer_Count', y='Store_Sales', data=stores_pd)\n",
"plt.title('Daily_Customer_Count vs Store_Sales')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определение и удаление выбросов происходит с помощь метода межквартильного размаха IQR. Выбросами считаются точки, которые лежат за пределами Q1 - 1.5 * IQR и Q3 + 1.5 * IQR."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAPdCAYAAABlRyFLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydeXgUxdb/v5NJJjNJyEJCgAgJIQmyLwKiJAEVFGUREFHA93cJweW+bNfXizvIpiK4AyK4gFwFVJRFUUEFZb8qGGURlWAkKEtIyEIySSZL/f7AHmfpbWa6p3sm5/M8PA+Z7umprq6u+tapU+cYGGMMBEEQBEEQBEEQBEEQBEEQBEG4EaJ1AQiCIAiCIAiCIAiCIAiCIAhCr5ARnSAIgiAIgiAIgiAIgiAIgiAEICM6QRAEQRAEQRAEQRAEQRAEQQhARnSCIAiCIAiCIAiCIAiCIAiCEICM6ARBEARBEARBEARBEARBEAQhABnRCYIgCIIgCIIgCIIgCIIgCEIAMqITBEEQBEEQBEEQBEEQBEEQhABkRCcIgiAIgiAIgiAIgiAIgiAIAciIThAEQRAEQRAEQRAEQRAEQRACkBGdIAiCIERo164dcnJy/P67p0+fhtlsxr59+7z6/u+//w6DwYC33npL2YJpxNy5c2EwGOx/19XVoW3btli+fLmGpSIIgiAIwt8YDAbMnTvX/vdbb70Fg8GA33//3eNrXXfddejatavkecGmqwjg66+/hsFgwNdff+333168eDE6duyIxsZGr77vqosDHdf51rZt2xAVFYULFy5oVyiC4IGM6ARBSHLkyBHcfvvtSElJgdlsxhVXXIEbb7wRS5cudTrv6aefxubNm7UppBccP34cBoMBZrMZZWVlWhdHExobG/Gf//wH/fr1Q/PmzdGsWTN06NAB//jHP/Df//7Xft5PP/2EuXPnejU58Td79+7FLbfcgiuuuAJmsxnJyckYMWIE1q1bp3XRPGL+/Pno168fMjMz3Y59/fXXuO2229CqVSuYTCYkJiZixIgR2LhxowYl1YawsDA88MADeOqpp1BTU6N1cQiCIAjCDc64e/DgQftnn376qZMBOBhYvnw5DAYD+vXrp3VRCAcuXLiAf/3rX+jYsSMsFgsSExNx9dVX4+GHH0ZlZaX9vHXr1uGll17SrqAysdlsePnll9GrVy9ER0cjNjYWXbp0wb333ouff/5Z6+LJpqKiAosWLcLDDz+MkBBnk1xNTQ1efPFF9OvXDzExMTCbzejQoQOmTZuGX3/9VaMS+5+bb74Z6enpWLhwodZFIQgnyIhOEIQo+/fvR58+ffDjjz/innvuwbJly3D33XcjJCQEL7/8stO5gWZEf+edd9CqVSsAwAcffKBxabRhxowZmDhxIlq3bo25c+di0aJFuOWWW/Df//4X27Zts5/3008/Yd68ebo3om/YsAEDBgzA+fPn8a9//QtLly7F//zP/6C0tBSvv/661sWTzYULF7BmzRr885//dDs2Z84cXH/99Th69Cjuu+8+rFixAg8++CAqKysxZsyYgFss8IVJkyahuLi4Sd0zQRAEEdh8+umnmDdvntbFUJS1a9eiXbt2+Pbbb5Gfn6/671VXV2PWrFmq/04gc/HiRfTp0wf/+c9/MGzYMCxZsgQPPPAA0tPT8eqrr6K4uNh+bqAY0ceMGYN///vf6Nq1K5555hnMmzcPAwYMwGeffebk/KN3Vq1ahfr6eowfP97p8+LiYmRlZeGBBx5AYmIi5s+fj1deeQWjRo3CRx99JGvHRDBx3333YeXKlbh06ZLWRSEIO6FaF4AgCH3z1FNPISYmBt999x1iY2OdjhUVFan++1VVVYiMjFT8uowxrFu3DhMmTEBBQQHWrl2Lu+++W9b3ampqYLFYFC+Tvzl//jyWL1+Oe+65B6+99prTsZdeeskv2+esVisiIiIUu97cuXPRuXNn/Pe//4XJZHI65o/2qhTvvPMOQkNDMWLECKfPP/jgA8yfPx+333471q1bh7CwMPuxBx98ENu3b0ddXZ2/i6sZsbGxuOmmm/DWW28hNzdX6+IQBEEQRJOjoKAA+/fvx8aNG3Hfffdh7dq1mDNnjqq/aTabVb1+MPDmm2+isLAQ+/btQ//+/Z2OVVRUuOlkpamvr0djY6Niv/Pdd99h69ateOqpp/DYY485HVu2bFlA7SpevXo1br31Vrd2nJOTg7y8PHzwwQcYM2aM07EFCxbg8ccf92cxNWfMmDGYPn06NmzYQDqf0A3kiU4QhCgnT55Ely5d3AzoAJCYmGj/v8FgQFVVFdasWQODwQCDweAU1ywvLw+33HILoqOjERUVhUGDBrl5DHBbXnft2oUpU6YgMTERbdq0sR//7LPPkJ2djcjISDRr1gzDhg3DsWPHvLqvffv24ffff8e4ceMwbtw47N69G3/88Yfbee3atcPw4cOxfft29OnTBxaLBStXrgQAlJWV4f7770fbtm0RHh6O9PR0LFq0yC223XPPPYf+/fsjPj4eFosFvXv3luX5Pm3aNERFRcFqtbodGz9+PFq1aoWGhgYAwMGDBzFkyBAkJCTAYrEgNTVVUmwUFBSAMcYbLsRgMNif71tvvYWxY8cCAK6//nr783WMH7h8+XJ06dIF4eHhSEpKwtSpU93ELBdz8tChQxgwYAAiIiLsIri2thZz5sxBeno6wsPD0bZtWzz00EOora2VrCdHTp48ib59+/IKdsf2Cnj/XAD5z/7dd99F79690axZM0RHR6Nbt25uOzj42Lx5M/r164eoqCinz2fPno3mzZtj1apVTgZ0jiFDhmD48OGi1/75559x++23o3nz5jCbzejTpw8++ugjp3MuXryImTNnolu3boiKikJ0dDRuueUW/Pjjj07ncbEk33//fTz11FNo06YNzGYzBg0axOuJ9s033+Dmm29GTEwMIiIiMHDgQN6Y73v37kXfvn1hNpuRlpZmf+f4uPHGG7F3715cvHhR9L4JgiAIQmtycnLwyiuvAIBdTznGNW5sbMRLL72ELl26wGw2o2XLlrjvvvtQWlrqdB1On3799dd2fdqtWze7Ntu4cSO6desGs9mM3r17Iy8vz+n7586dw6RJk9CmTRuEh4ejdevWGDlypFc7DteuXYu4uDgMGzYMt99+O9auXWs/VldXh+bNm2PSpElu36uoqIDZbMbMmTMBXA7V8cQTT6B3796IiYlBZGQksrOz8dVXX7l91zUmOh9btmzBsGHDkJSUhPDwcKSlpWHBggV27ezKoUOH0L9/f7uOXrFihaz7l6OrXPGkXgBg6dKl6NKlCyIiIhAXF4c+ffpI7sI7efIkjEYjrrnmGrdj0dHRdgPuddddh08++QSnTp2yt8d27drZzy0qKsLkyZPRsmVLmM1m9OjRA2vWrHG6Hhcv/rnnnsNLL72EtLQ0hIeH46effvK6jvjuBwDvvMVoNCI+Pt7+96lTpzBlyhRceeWVsFgsiI+Px9ixY2W3bzl69dKlS7j//vvRrl07hIeHIzExETfeeCO+//570WsXFBTg8OHDGDx4sNtvfvLJJ5g8ebKbAR0AwsPD8dxzz0mW/Z133kHv3r1hsVjQvHlzjBs3DqdPn3Y6Z8+ePRg7diySk5Pt867/+7//Q3V1tdN5OTk5iIqKwp9//olRo0YhKioKLVq0wMyZM93eI7l9F2MMTz75JNq0aYOIiAhcf/31gvP5xMREdO/eHVu2bJG8b4LwG4wgCEKEm266iTVr1owdOXJE9Ly3336bhYeHs+zsbPb222+zt99+m+3fv58xxtjRo0dZZGQka926NVuwYAF75plnWGpqKgsPD2f//e9/7ddYvXo1A8A6d+7MBg4cyJYuXcqeeeYZxhhj//nPf5jBYGA333wzW7p0KVu0aBFr164di42NZQUFBR7f1z//+U+WlpbGGGPMarWyqKgotnjxYrfzUlJSWHp6OouLi2OPPPIIW7FiBfvqq69YVVUV6969O4uPj2ePPfYYW7FiBfvHP/7BDAYD+9e//uV0jTZt2rApU6awZcuWsRdeeIFdffXVDADbunW
"text/plain": [
"<Figure size 1500x1000 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def remove_outliers(df, column):\n",
" Q1 = df[column].quantile(0.25)\n",
" Q3 = df[column].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
"\n",
"# Удаление выбросов для каждой пары переменных\n",
"stores_cleaned = stores_pd.copy()\n",
"stores_cleaned = remove_outliers(stores_cleaned, 'Store_Area')\n",
"stores_cleaned = remove_outliers(stores_cleaned, 'Items_Available')\n",
"stores_cleaned = remove_outliers(stores_cleaned, 'Daily_Customer_Count')\n",
"stores_cleaned = remove_outliers(stores_cleaned, 'Store_Sales')\n",
"\n",
"# Визуализация очищенных данных\n",
"plt.figure(figsize=(15, 10))\n",
"\n",
"# Диаграмма рассеяния для Store_Area и Store_Sales\n",
"plt.subplot(2, 2, 1)\n",
"sns.scatterplot(x='Store_Area', y='Store_Sales', data=stores_cleaned)\n",
"plt.title('Store_Area vs Store_Sales (Cleaned)')\n",
"\n",
"# Диаграмма рассеяния для Items_Available и Store_Sales\n",
"plt.subplot(2, 2, 2)\n",
"sns.scatterplot(x='Items_Available', y='Store_Sales', data=stores_cleaned)\n",
"plt.title('Items_Available vs Store_Sales (Cleaned)')\n",
"\n",
"# Диаграмма рассеяния для Daily_Customer_Count и Store_Sales\n",
"plt.subplot(2, 2, 3)\n",
"sns.scatterplot(x='Daily_Customer_Count', y='Store_Sales', data=stores_cleaned)\n",
"plt.title('Daily_Customer_Count vs Store_Sales (Cleaned)')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Анализ и прогнозирование экономических показателей стран."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Проблемная область: Анализ и прогнозирование экономических показателей стран.\n",
"\n",
"Данные содержат экономические показатели 9 стран за период с 1980 по 2020 год, включая индексы, инфляцию, цены на нефть, курсы валют и другие показатели.\n",
"\n",
"3. Объекты наблюдения: Страны.\n",
"\n",
"4. Бизнес-цель: Прогнозирование экономических показателей для принятия инвестиционных решений.\n",
"\n",
"Эффект для бизнеса: Увеличение доходности инвестиций за счет более точного прогнозирования рыночных условий.\n",
"\n",
"5. Техническая цель: Разработка модели прогнозирования инфляции на основе исторических данных."
]
},
{
"cell_type": "code",
2024-10-12 10:08:57 +04:00
"execution_count": 8,
2024-10-12 09:48:37 +04:00
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
2024-10-12 10:08:57 +04:00
"from sklearn.model_selection import train_test_split\n",
2024-10-12 09:48:37 +04:00
"\n",
"# Загрузка данных\n",
"economic_df = pd.read_csv('static/csv/Economic Data - 9 Countries (1980-2020).csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Визуализация данных с помощью диаграмм рассеивания"
]
},
{
"cell_type": "code",
2024-10-12 10:08:57 +04:00
"execution_count": 13,
2024-10-12 09:48:37 +04:00
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoEAAAHVCAYAAACOpCHEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1wT9/8H8FcGSSDsJThABFRU3GIVcFfcs2q1teC2ddtata17Veuq2lpb6/j127pXrda9rVtUVFSwuAd7Bggkn98fNCdHQiBICJH38/Hg8SB3n1w+d7ncve8zBYwxBkIIIYQQUqEITZ0BQgghhBBS9igIJIQQQgipgCgIJIQQQgipgCgIJIQQQgipgCgIJIQQQgipgCgIJIQQQgipgCgIJIQQQgipgCgIJIQQQgipgCgIJIQQQgipgCgILGOPHj2CQCDApk2bTJ0Vg4WFhaF69eqmzgYxwKZNmyAQCPDo0SNTZ8Vg1atXR1hYmKmzQUpBeno6hg8fDjc3NwgEAkycONFo10I6b95d5vrdlufrMAWBemi+uKtXr5o6K29Nsy+aP5lMhpo1a2Ls2LF4/fq1qbNnMrNnz4ZAIEB8fLyps4KwsDBYW1ubOhulIiwsjHe+2draokGDBli2bBmys7NNnT1ioLe9Fi5cuBCbNm3Cp59+it9++w2DBw9+q/z8888/mD17NpKTk99qO6Wp4DVWLBajSpUqCAsLw/Pnz0u0TYVCgdmzZ+PUqVOlm1kj0wT4S5cuNXVW3ppmXzR/IpEIHh4e6N27N27cuGHq7L01sakzQMrW3Llz4eXlhaysLJw7dw5r167FwYMHcfv2bVhZWel97y+//AK1Wl1GOSXmTiqVYv369QCA5ORk7Nq1C1988QWuXLmCrVu3Fvn++/fvQyik59R3wYkTJ/Dee+9h1qxZ3LK3KRX5559/MGfOHISFhcHe3p63ztTnTf5r7MWLF7Fp0yacO3cOt2/fhkwmM2hbCoUCc+bMAQC0adPGCLklxTVw4EB06dIFKpUKkZGRWLt2Lf7++29cvHgRDRs21PvewYMH48MPP4RUKi2bzBqAgsAKpnPnzmjatCkAYPjw4XBycsLy5cuxb98+DBw4UOd7MjIyIJfLYWFhUZZZJWZOLBbj448/5l5/9tlnaN68ObZt24bly5ejcuXKWu9hjCErKwuWlpbl8oJJSiY2NhZ16tQpk88y9XlT8Brr7OyMxYsX488//0T//v1NmjdSco0bN+ZdzwIDA9GjRw+sXbsW69at0/kezb1TJBJBJBKVVVYNQo/ZBtJU2T1//hy9evWCtbU1XFxc8MUXX0ClUvHSJicnIywsDHZ2drC3t0doaGih1Rf37t3DBx98AEdHR8hkMjRt2hR//vkntz42NhYuLi5o06YNGGPc8ujoaMjlcgwYMKBE+9OuXTsAQExMDG//Hj58iC5dusDGxgYfffQRt65gm0C1Wo3vv/8e/v7+kMlkcHFxQadOnbSqjf73v/+hSZMmsLS0hKOjIz788EM8ffpUb9527twJgUCA06dPa61bt24dBAIBbt++DQB49eoVhgwZgqpVq0IqlcLd3R09e/YstTYYJ06cQHBwMORyOezt7dGzZ09ERkZqpTt16hSaNm0KmUwGb29vrFu3jqtyLonq1aujW7duOHfuHAICAiCTyVCjRg383//9n1baO3fuoF27drC0tETVqlUxf/78Qktu//77b25/bGxs0LVrV9y5c4e3v0KhEDNnzuS9748//oBAIMDatWsN3hehUMiVZmi+F83+HT58GE2bNoWlpSV3QdXV/ic5ORmTJk1C9erVIZVKUbVqVXzyySe86vzs7GzMmjULPj4+kEqlqFatGr788kuqhi5lxbkWnjp1CgKBADExMThw4ABXpVbY7/LWrVsICwtDjRo1IJPJ4ObmhqFDhyIhIYFLM3v2bEyZMgUA4OXlpbVNXefNv//+i379+sHR0RFWVlZ47733cODAAV4aTV63b9+OBQsWoGrVqpDJZGjfvj2io6NLfJyCg4MBAA8fPuSWKZVKzJw5E02aNIGdnR3kcjmCg4Nx8uRJLs2jR4/g4uICAJgzZw63n7Nnz+bSFHXf0CUnJweOjo4YMmSI1rrU1FTIZDJ88cUX3LLVq1ejbt26sLKygoODA5o2bYo//vjD4OOgqS4/f/48Jk+eDBcXF8jlcvTu3RtxcXG8tIwxzJ8/H1WrVoWVlRXatm3Luz7ll5ycjIkTJ6JatWqQSqXw8fHB4sWLuWsfYwxt27aFi4sLYmNjufcplUr4+/vD29sbGRkZBu9PwXunZv9Onz6Nzz77DK6urqhatSpvXcHz/u+//0br1q1hY2MDW1tbNGvWTOvYXrp0CZ06dYKdnR2srKzQunVrnD9/3uD8FoZKAktApVIhJCQEzZs3x9KlS3Hs2DEsW7YM3t7e+PTTTwHknXg9e/bEuXPnMHr0aPj5+WHPnj0IDQ3V2t6dO3cQGBiIKlWqYNq0aZDL5di+fTt69eqFXbt2oXfv3nB1dcXatWvRr18/rF69GuPHj4darUZYWBhsbGzw448/lmhfNBcmJycnbllubi5CQkIQFBSEpUuX6q0mHjZsGDZt2oTOnTtj+PDhyM3NxdmzZ3Hx4kXuaXjBggWYMWMG+vfvj+HDhyMuLg6rV69Gq1atEB4erlWdo9G1a1dYW1tj+/btaN26NW/dtm3bULduXdSrVw8A0LdvX9y5cwfjxo1D9erVERsbi6NHj+LJkydv3Znl2LFj6Ny5M2rUqIHZs2cjMzMTq1evRmBgIK5fv85tPzw8HJ06dYK7uzvmzJkDlUqFuXPnchfykoqOjsYHH3yAYcOGITQ0FBs2bEBYWBiaNGmCunXrAsgLgtu2bYvc3FzuHPr5559haWmptb3ffvsNoaGhCAkJweLFi6FQKLB27VoEBQUhPDwc1atXR7t27fDZZ59h0aJF6NWrFxo3boyXL19i3Lhx6NChA0aPHl2ifdF1vt2/fx8DBw7EqFGjMGLECNSqVUvne9PT0xEcHIzIyEgMHToUjRs3Rnx8PP788088e/YMzs7OUKvV6NGjB86dO4eRI0fCz88PERERWLFiBR48eIC9e/eWKN9Et6KuhX5+fvjtt98wadIkVK1aFZ9//jkAwMXFRevGDwBHjx7Fv//+iyFDhsDNzQ137tzBzz//jDt37uDixYsQCATo06cPHjx4gC1btmDFihVwdnbmtqnL69ev0bJlSygUCowfPx5OTk7YvHkzevTogZ07d6J379689N9++y2EQiG++OILpKSkYMmSJfjoo49w6dKlEh0jzY3fwcGBW5aamor169dj4MCBGDFiBNLS0vDrr78iJCQEly9fRsOGDeHi4oK1a9fi008/Re/evdGnTx8AQP369QEU776hi4WFBXr37o3du3dj3bp1kEgk3Lq9e/ciOzsbH374IYC8JkDjx4/HBx98gAkTJiArKwu3bt3CpUuXMGjQoBIdj3HjxsHBwQGzZs3Co0ePsHLlSowdOxbbtm3j0sycORPz589Hly5d0KVLF1y/fh0dO3aEUqnkbUuhUKB169Z4/vw5Ro0aBQ8PD/zzzz+YPn06Xr58iZUrV0IgEGDDhg2oX78+Ro8ejd27dwMAZs2ahTt37uDUqVOQy+UG74euaxmQV+Ph4uKCmTNn6g0uN23ahKFDh6Ju3bqYPn067O3tER4ejkOHDnHH9sSJE+jcuTOaNGmCWbNmQSgUYuPGjWjXrh3Onj2LgIAAg/OthZFCbdy4kQFgV65c4ZaFhoYyAGzu3Lm8tI0aNWJNmjThXu/du5cBYEuWLOGW5ebmsuDgYAaAbdy4kVvevn175u/vz7KysrhlarWatWzZkvn6+vI+Z+DAgczKyoo9ePCAfffddwwA27t3b7H35dixYywuLo49ffqUbd26lTk5OTFLS0v27Nkz3v5NmzZNaxuhoaHM09OTe33ixAkGgI0fP14rrVqtZowx9ujRIyYSidiCBQt46yMiIphYLNZaXtDAgQOZq6sry83N5Za9fPmSCYVC7jtISkpiANh3331X5HEoaNasWQw
"text/plain": [
"<Figure size 640x480 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Диаграмма рассеяния для index price и log_indexprice\n",
"plt.subplot(2, 2, 1)\n",
"sns.scatterplot(x='index price', y='log_indexprice', data=economic_df)\n",
"plt.title('Index Price vs Log Index Price')\n",
"\n",
"# Диаграмма рассеяния для inflationrate и index price\n",
"plt.subplot(2, 2, 2)\n",
"sns.scatterplot(x='inflationrate', y='index price', data=economic_df)\n",
"plt.title('Inflation Rate vs Index Price')\n",
"\n",
2024-10-12 10:08:57 +04:00
"# Разбиение на обучающую, контрольную и тестовую выборки\n",
"train_df, test_df = train_test_split(economic_df, test_size=0.4, random_state=42)\n",
"val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)\n",
"\n",
2024-10-12 09:48:37 +04:00
"# Диаграмма рассеяния для oil prices и index price\n",
"plt.subplot(2, 2, 3)\n",
"sns.scatterplot(x='oil prices', y='index price', data=economic_df)\n",
"plt.title('Oil Prices vs Index Price')\n",
"\n",
"# Диаграмма рассеяния для exchange_rate и index price\n",
"plt.subplot(2, 2, 4)\n",
"sns.scatterplot(x='exchange_rate', y='index price', data=economic_df)\n",
"plt.title('Exchange Rate vs Index Price')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определение и удаление выбросов происходит с помощь метода межквартильного размаха IQR. Выбросами считаются точки, которые лежат за пределами Q1 - 1.5 * IQR и Q3 + 1.5 * IQR."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAPdCAYAAABlRyFLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzde1xT9f8H8Ne4bIPBxmUi4BUZ3iVR05SLpSZe8haVUr/EW3bxklneKm95N+0i9jXL1PKbmor67WJe8i6aecFQMwNDsRAQhA0YMC7n94dtMTZuChuX1/Px4CH7nM/O3mcb7nPe+5z3RyQIggAiIiIiIiIiIiIiIjJhY+0AiIiIiIiIiIiIiIhqKybRiYiIiIiIiIiIiIjKwCQ6EREREREREREREVEZmEQnIiIiIiIiIiIiIioDk+hERERERERERERERGVgEp2IiIiIiIiIiIiIqAxMohMRERERERERERERlYFJdCIiIiIiIiIiIiKiMjCJTkRERERERERERERUBibRiahCN2/ehEgkwubNm60dSpWNGTMGLVu2tHYYVAWbN2+GSCTCzZs3rR1KlbVs2RJjxoyxymMPGjQIL7300gPf//HHH8fjjz9efQFZkbn/s2bPno0ePXpYLygiIqJqlp2djQkTJsDT0xMikQjTpk2rsXG7Ncc4VLPq6mtrzXOGX375BWKxGLdu3Xqg+x87dgwikQjHjh2r3sCspPQ5d3p6OmQyGfbt22e9oIhqAJPoRPWMfjBx/vx5a4fy0PTHov+RSqVo3bo1Jk+ejJSUFGuHZzULFiyASCRCWlqatUPBmDFj4OTkZO0wqsWYMWOM3m9yuRyPPPIIVq9ejfz8fGuHV67o6GgcPHgQs2bNMtmWkpKCt956C23btoWjoyNkMhm6du2KxYsXIzMz0/LBWsm0adPw66+/4ttvv7V2KERERAAefty+dOlSbN68Ga+++iq2bNmCF1988aHiOX36NBYsWFCrxgelzwfs7OzQpEkTjBkzBn///fcD7VOr1WLBggV1LoGp/4Jk1apV1g7loemPRf9ja2uL5s2bY8SIEbh06ZK1w6vQO++8g/DwcLRo0cJk2549ezBw4EAolUqIxWJ4e3vjueeew5EjR6wQqXW4u7tjwoQJmDt3rrVDIapWdtYOgIioIu+99x58fHyQl5eHU6dOYd26ddi3bx+uXLkCR0fHcu/7+eefo7i42EKRUl0nkUiwYcMGAEBmZiaioqLw1ltv4dy5c9i+fXuF979+/TpsbCz//fT777+Pvn37QqVSGbWfO3cOgwYNQnZ2Nv7v//4PXbt2BQCcP38ey5cvx4kTJ3Dw4EGLx2sNnp6eGDZsGFatWoWhQ4daOxwiIqKHduTIETz22GOYP3++oe1hZuWePn0aCxcuxJgxY+Di4mK0zVpjHL2S5wM///wzNm/ejFOnTuHKlSuQSqVV2pdWq8XChQsBoN5chVdXhYeHY9CgQSgqKsK1a9ewbt06/Pjjj/j555/RuXPncu/74osvYtSoUZBIJJYJ9h+XLl3CTz/9hNOnTxu1C4KAcePGYfPmzQgICMD06dPh6emJO3fuYM+ePejbty+io6PRq1cvi8ZrLa+88grWrFmDI0eOoE+fPtYOh6haMIlORLXewIED0a1bNwDAhAkT4O7ujg8++AD/+9//EB4ebvY+OTk5kMlksLe3t2SoVMfZ2dnh//7v/wy3X3vtNfTo0QPffPMNPvjgA3h7e5vcRxAE5OXlwcHBweKDeABITU3FDz/8gE8//dSoPTMzEyNGjICtrS1iYmLQtm1bo+1LlizB559/bslQre65557Ds88+iz///BOtWrWydjhEREQPJTU1Fe3bt7fIY1ljjFNS6fMBpVKJFStW4Ntvv8Vzzz1n1djowXXp0sVo7B0YGIihQ4di3bp1WL9+vdn76M/zbG1tYWtra6lQDTZt2oTmzZvjscceM2pfvXo1Nm/ejGnTpuGDDz6ASCQybHvnnXewZcsW2Nk1nBRcu3bt0LFjR2zevJlJdKo3WM6FqAHQl9z4+++/MXz4cDg5OaFRo0Z46623UFRUZNQ3MzMTY8aMgUKhgIuLCyIiIsq8pPP333/HM888Azc3N0ilUnTr1s2oVEJqaioaNWqExx9/HIIgGNrj4+Mhk8kwcuTIBzoe/YdwQkKC0fHduHEDgwYNgrOzM1544QXDttI10YuLi/Hxxx+jU6dOkEqlaNSoEQYMGGByKe1///tfdO3aFQ4ODnBzc8OoUaNw+/btcmPbtWsXRCIRjh8/brJt/fr1EIlEuHLlCgAgOTkZY8eORdOmTSGRSODl5YVhw4ZVW12/I0eOIDg4GDKZDC4uLhg2bBiuXbtm0u/YsWPo1q0bpFIpfH19sX79ekPJmAfRsmVLPPXUUzh16hS6d+8OqVSKVq1a4auvvjLpe/XqVfTp0wcODg5o2rQpFi9eXOaVAz/++KPheJydnTF48GBcvXrV6HhtbGwwb948o/tt3boVIpEI69atq/Kx2NjYGGYo6V8X/fEdOHAA3bp1g4ODg2GQb66mZGZmJt544w20bNkSEokETZs2xejRo43K8eTn52P+/PlQqVSQSCRo1qwZZs6cWakyMj/88AMKCwvRr18/o/b169fj77//xgcffGCSQAeAxo0b49133y1335WNa9OmTejTpw88PDwgkUjQvn17s893Vd4bmZmZmDZtGpo1awaJRAKVSoUVK1aYvD+q8n+W/jn63//+V+5xExERWUtlxu36esoJCQn44YcfDCUxyhpDxsbGYsyYMWjVqhWkUik8PT0xbtw4pKenG/osWLAAM2bMAAD4+PiY7NPcGOfPP//Es88+Czc3Nzg6OuKxxx7DDz/8YNRHH+uOHTuwZMkSNG3aFFKpFH379kV8fPwDP0/BwcEAgBs3bhjadDod5s2bh65du0KhUEAmkyE4OBhHjx419Ll58yYaNWoEAFi4cKHhOBcsWGDoU9E5jjkFBQVwc3PD2LFjTbZpNBpIpVK89dZbhrbIyEh06NABjo6OcHV1Rbdu3bB169YqPw/6cjfR0dGYPn06GjVqBJlMhhEjRuDu3btGfQVBwOLFi9G0aVM4OjriiSeeMBpLl1TROEwQBDzxxBNo1KgRUlNTDffT6XTo1KkTfH19kZOTU+XjKX2epz++48eP47XXXoOHhweaNm1qtK30+/7HH39E79694ezsDLlcjkcffdTkuT179iwGDBgAhUIBR0dH9O7dG9HR0ZWKce/evejTp4/RuVJubi6WLVuGtm3bYtWqVWbPo1588UV079693H1XJq5bt27htddeQ5s2beDg4AB3d3c8++yzJs9DVd4bQMXnWiWPv2PHjpBKpejYsSP27NlT5vE8+eST+O6774xyAUR1WcP5GoyogSsqKkJoaCh69OiBVatW4aeffsLq1avh6+uLV199FcD9wdCwYcNw6tQpvPLKK2jXrh327NmDiIgIk/1dvXoVgYGBaNKkCWbPng2ZTIYdO3Zg+PDhiIqKwogRI+Dh4YF169bh2WefRWRkJKZOnYri4mKMGTMGzs7O+M9//vNAx6IfLLu7uxvaCgsLERoaiqCgIKxatarcMi/jx4/H5s2bMXDgQEyYMAGFhYU4efIkfv75Z8MMlyVLlmDu3Ll47rnnMGHCBNy9exeRkZEICQlBTEyMySWueoMHD4aTkxN27NiB3r17G2375ptv0KFDB3Ts2BEAEBYWhqtXr2LKlClo2bIlUlNTcejQISQmJj70Yqg//fQTBg4ciFatWmHBggXIzc1FZGQkAgMDcfHiRcP+Y2JiMGDAAHh5eWHhwoUoKirCe++9Zzi5eFDx8fF45plnMH78eERERGDjxo0YM2YMunbtig4dOgC4/yXCE088gcLCQsN76LPPPoODg4PJ/rZs2YKIiAiEhoZixYoV0Gq1WLduHYKCghATE4OWLVuiT58+eO2117Bs2TIMHz4cXbp0wZ07dzBlyhT069cPr7zyygMdi7n32/Xr1xEeHo6XX34ZL730Etq0aWP2vtnZ2QgODsa1a9cwbtw4dOnSBWlpafj222/x119
"text/plain": [
"<Figure size 1500x1000 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def remove_outliers(df, column):\n",
" Q1 = df[column].quantile(0.25)\n",
" Q3 = df[column].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]\n",
"\n",
"# Удаление выбросов для каждой пары переменных\n",
"economic_cleaned = economic_df.copy()\n",
"economic_cleaned = remove_outliers(economic_cleaned, 'index price')\n",
"economic_cleaned = remove_outliers(economic_cleaned, 'log_indexprice')\n",
"economic_cleaned = remove_outliers(economic_cleaned, 'inflationrate')\n",
"economic_cleaned = remove_outliers(economic_cleaned, 'oil prices')\n",
"economic_cleaned = remove_outliers(economic_cleaned, 'exchange_rate')\n",
"\n",
"# Визуализация очищенных данных\n",
"plt.figure(figsize=(15, 10))\n",
"\n",
"# Диаграмма рассеяния для index price и log_indexprice\n",
"plt.subplot(2, 2, 1)\n",
"sns.scatterplot(x='index price', y='log_indexprice', data=economic_cleaned)\n",
"plt.title('Index Price vs Log Index Price (Cleaned)')\n",
"\n",
"# Диаграмма рассеяния для inflationrate и index price\n",
"plt.subplot(2, 2, 2)\n",
"sns.scatterplot(x='inflationrate', y='index price', data=economic_cleaned)\n",
"plt.title('Inflation Rate vs Index Price (Cleaned)')\n",
"\n",
"# Диаграмма рассеяния для oil prices и index price\n",
"plt.subplot(2, 2, 3)\n",
"sns.scatterplot(x='oil prices', y='index price', data=economic_cleaned)\n",
"plt.title('Oil Prices vs Index Price (Cleaned)')\n",
"\n",
"# Диаграмма рассеяния для exchange_rate и index price\n",
"plt.subplot(2, 2, 4)\n",
"sns.scatterplot(x='exchange_rate', y='index price', data=economic_cleaned)\n",
"plt.title('Exchange Rate vs Index Price (Cleaned)')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Анализ рынка труда для специалистов по данным."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Проблемная область: Анализ рынка труда для специалистов по данным.\n",
"\n",
"Данные включают информацию о зарплатах специалистов по данным, такую как уровень опыта, тип занятости, зарплату в долларах США и другие атрибуты.\n",
"\n",
"3. Объекты наблюдения: Специалисты по данным.\n",
"\n",
"4. Бизнес-цель: Анализ рынка труда для специалистов по данным для оптимизации кадровой политики.\n",
"\n",
"Эффект для бизнеса: Улучшение привлечения и удержания талантов за счет более конкурентоспособной оплаты труда.\n",
"\n",
"5. Техническая цель: Создание модели, предсказывающей зарплату специалиста по данным на основе е г о уровня опыта и типа занятости."
]
},
{
"cell_type": "code",
2024-10-12 10:08:57 +04:00
"execution_count": 14,
2024-10-12 09:48:37 +04:00
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
2024-10-12 10:08:57 +04:00
"from sklearn.model_selection import train_test_split\n",
2024-10-12 09:48:37 +04:00
"\n",
"# Загрузка данных\n",
"salaries_df = pd.read_csv('static/csv/ds_salaries.csv')\n",
"\n",
"# Выбор столбца для анализа (например, salary_in_usd)\n",
"data = salaries_df['salary_in_usd']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Визуализация данных с помощью диаграмм рассеивания"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2wAAAIjCAYAAAB/FZhcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydeXwTZf7HPz2SpmmbtLQUqFCuVrlaKKfcyCEqeADqT3BXjmVX5fBgVUDl8gDRdXXlUFdF2FVwPUHBC0GRywMsciN30RZKr6Rt2qZN5vdHmTCZzExmkkkyCd/36+VLmjmeY57r+zzfI4phGAYEQRAEQRAEQRCE5ogOdQYIgiAIgiAIgiAIYUhgIwiCIAiCIAiC0CgksBEEQRAEQRAEQWgUEtgIgiAIgiAIgiA0CglsBEEQBEEQBEEQGoUENoIgCIIgCIIgCI1CAhtBEARBEARBEIRGIYGNIAiCIAiCIAhCo5DARhAEQRAEQRAEoVFIYCMIgiAIImw5c+YMoqKisHr16lBnRZDvvvsOUVFR+O6770KdFZ+ZNGkS2rRpE+psEMQVCwlsBEFEDB9++CGioqIE/+vSpUuos0cQBEEQBKGY2FBngCAIQm0ef/xxdOzY0fX3s88+G8LcEARxJTNo0CDU1NRAr9eHOisEQYQpJLARBBFxjBgxAkOGDHH9/eabb6KkpCR0GSIIImyorq5GQkKCau+Ljo6GwWBQ7X0EQVx5kEokQRARg91uB9C4QPLG6tWrERUVhTNnzrh+czqdyM3N9bCH2b9/PyZNmoR27drBYDCgefPmmDJlCkpLS93euXDhQkF1zNjYy3tjQ4YMQZcuXbB3717069cP8fHxaNu2LV577TWPssyfPx89evSA2WxGQkICBg4ciG+//dbtPtZ+JyoqCuvXr3e7Vltbi5SUFERFReEf//iHRz7T09NRX1/v9sy6detc7+MKuRs2bMCoUaOQkZGBuLg4tG/fHk8//TQcDofXumbTO3r0KO68806YTCakpqbiwQcfRG1trdu9b7/9NoYOHYr09HTExcWhU6dOePXVVwXf+8UXX2Dw4MFISkqCyWRCr169sHbtWrd7fvzxR9x0001ISUlBQkICcnNz8a9//cvtnqNHj+L2229HkyZNYDAY0LNnT3z66ace6eXn5+OGG25A06ZN3b7v6NGjXfew7WrPnj1uz5aUlCAqKgoLFy70qBcuVVVVaN68uaDN06uvvoouXbrAaDS6pf/hhx8K1g+fIUOGCLZPbp745eD/x90IkVsnYrz33nvo0aOH6/vl5OS4fZuysjI88sgjyMnJQWJiIkwmE2688Ub8+uuvXt+ttM8ePnwYEyZMQEpKCgYMGIC3334bUVFRyM/P93j34sWLERMTgz/++MNrPgBhGzZ2HDh8+DCuu+46GI1GXHXVVXj++edlvZOL2Dds06YNJk2a5Pq7vr4eixYtQnZ2NgwGA1JTUzFgwABs3rzZ7bn169ejS5cuMBgM6NKlCz755BPFeSIIQl3ohI0giIiBFdji4uJ8ev6///0vDhw44PH75s2bcerUKUyePBnNmzfHoUOH8O9//xuHDh3CDz/84LHofvXVV5GYmOj6my9AlpeX46abbsKdd96J8ePH4/3338f9998PvV6PKVOmAACsVivefPNNjB8/Hn/9619RWVmJt956CyNHjsRPP/2Ebt26ub3TYDDg7bffxm233eb67eOPP/YQiLhUVlZi48aNGDNmjOu3t99+GwaDweO51atXIzExEbNmzUJiYiK2bt2K+fPnw2q14oUXXhBNg8udd96JNm3aYMmSJfjhhx/wyiuvoLy8HP/5z3/c6q5z58645ZZbEBsbi88++wzTpk2D0+nE9OnT3fIzZcoUdO7cGXPnzkVycjLy8/Px5ZdfYsKECQAav9vo0aPRokULPPjgg2jevDmOHDmCjRs34sEHHwQAHDp0CP3798dVV12FOXPmICEhAe+//z5uu+02fPTRR666sVgsuPHGG8EwDGbNmoVWrVoBAB5++GFZZZfLiy++iAsXLnj8/r///Q/Tpk3DkCFDMHPmTCQkJODIkSNYvHixove3bNkSS5YsAdAoHN5///2S97/00ktIS0sD4Kla7E+dbN68GePHj8ewYcOwdOlSAMCRI0ewc+dO17c5deoU1q9fjzvuuANt27bFhQsX8Prrr2Pw4ME4fPgwMjIyJN+vpM/ecccdyM7OxuLFi8EwDG6//XZMnz4d7777LvLy8tzufffddzFkyBBcddVVXsspRXl5OW644QaMHTsWd955Jz788EPMnj0bOTk5uPHGG/16txALFy7EkiVLMHXqVPTu3RtWqxV79uzBL7/8ghEjRgAAvv76a4wbNw6dOnXCkiVLUFpaismTJ6Nly5aq54cgCAUwBEEQEcLLL7/MAGB+/fVXt98HDx7MdO7c2e23t99+mwHAnD59mmEYhqmtrWUyMzOZG2+8kQHAvP322657bTabR1rr1q1jADDff/+967cFCxYwAJiLFy+K5nHw4MEMAObFF190/VZXV8d069aNSU9PZ+x2O8MwDNPQ0MDU1dW5PVteXs40a9aMmTJliuu306dPMwCY8ePHM7Gxscz58+dd14YNG8ZMmDCBAcC88MILHvkcP348M3r0aNfvZ8+eZaKjo5nx48d7lEOoDu69917GaDQytbW1ouXlpnfLLbe4/T5t2jSP7yWUzsiRI5l27dq5/q6oqGCSkpKYPn36MDU1NW73Op1OhmEa669t27ZM69atmfLycsF7GKaxjnJyctzK4HQ6mX79+jHZ2dmu37766isGALNu3Tq3d7Vu3ZoZNWqU62+2Xf38889u9128eJEBwCxYsMCjXliKi4uZpKQkVxv89ttvXdfGjx/PJCcnu5X322+/ZQAwH3zwgUedCdGvXz+mS5cuknlieeONNxgAzNmzZ12/DR48mBk8eLDrb7l1IsSDDz7ImEwmpqGhQfSe2tpaxuFwuP12+vRpJi4ujnnqqafcfvO3z44fP97j/vHjxzMZGRluefjll1880vIG+52435MdB/7zn/+4fqurq2OaN2/OjBs3Tva7GYYR/YatW7dmJk6c6Pq7a9euXr9Lt27dmBYtWjAVFRWu377++msGANO6dWtF+SIIQj1IJZIgiIiBVXdq2rSp4mdXrFiB0tJSLFiwwONafHy869+1tbUoKSnBtddeCwD45ZdfFKcVGxuLe++91/W3Xq/Hvffei+LiYuzduxcAEBMT43JS4HQ6UVZWhoaGBvTs2VMwze7du6Nz587473//CwA4e/Ysvv32WzeVKD5TpkzBl19+ifPnzwMA1qxZg759++Lqq6/2uJdbB5WVlSgpKcHAgQNhs9lw9OhRWeXmnpABwMyZMwEAn3/+uWA6FosFJSUlGDx4ME6dOgWLxQKg8fSksrISc+bM8bANYk9O8vPzcfr0aTz00ENITk4WvKesrAxbt27FnXfe6SpTSUkJSktLMXLkSBw/ftyl9lZZWQkASE1NlVVWNu/sf2VlZV6fefrpp2E2m/HAAw94XKusrITRaPTLFqq2tlb283JOq5XWCZfk5GRUV1d7qONxiYuLc51OOxwOlJaWIjExEddcc43Xfqe0z953330ev91zzz0oLCx0U0N+9913ER8fj3HjxkkXUAaJiYn405/+5Ppbr9ejd+/eOHXqlN/vFiI5ORmHDh3C8ePHBa8XFRVh3759mDhxIsxms+v3ESNGoFOnTgHJE0EQ8iCBjSCIiOHs2bOIjY1VLLBZLBYsXrwYs2bNQrNmzTyul5WV4cEHH0SzZs0QHx+Ppk2bom3btq5nlZKRkeHh1IAVkrg2dWvWrEFubq7L3qRp06bYtGmTaJqTJ0/G22+/DaBRZbBfv37Izs4WzUe3bt3QpUsX/Oc//wHDMFi9ejUmT54seO+hQ4cwZswYmM1mmEwmNG3a1LXYlFsH/Ly0b98e0dHRbmXeuXMnhg8fjoSEBCQnJ6Np06Z4/PHH3dI5efIkAEiGapBzz4kTJ8AwDObNm4emTZu6/ccK7sXFxQCAnj17QqfTYeHChcjPz3cJYk6
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Построение диаграммы рассеяния (scatter plot)\n",
"plt.figure(figsize=(10, 6))\n",
"sns.scatterplot(x=range(len(data)), y=data)\n",
"plt.title('Диаграмма рассеяния для salary_in_usd')\n",
"plt.xlabel('Индекс')\n",
"plt.ylabel('Зарплата в USD')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определение и удаление выбросов происходит с помощь метода межквартильного размаха IQR. Выбросами считаются точки, которые лежат за пределами Q1 - 1.5 * IQR и Q3 + 1.5 * IQR."
]
},
{
"cell_type": "code",
2024-10-12 10:08:57 +04:00
"execution_count": 15,
2024-10-12 09:48:37 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество выбросов: 63\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2wAAAIjCAYAAAB/FZhcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydd3hT9f7H323aNEnbpBuKUCgUmWUPgTJkiAoqiAv9XYGq1yvDgQNRlqBycaECrqsMr4CKIlxBcbAERL1AEQqClFWuhZaWNqVJ07TJ+f1RzuHk5MzkJDlpv6/n4XlocnK+e3++708ERVEUCAQCgUAgEAgEAoGgOSJDHQECgUAgEAgEAoFAIPBDFmwEAoFAIBAIBAKBoFHIgo1AIBAIBAKBQCAQNApZsBEIBAKBQCAQCASCRiELNgKBQCAQCAQCgUDQKGTBRiAQCAQCgUAgEAgahSzYCAQCgUAgEAgEAkGjkAUbgUAgEAgEAoFAIGgUsmAjEAgEAoFAIBAksNlsOHfuHMrLy0MdFUIjgyzYCAQCgdBoOHPmDCIiIrBy5cpQR4WXHTt2ICIiAjt27Ah1VHxm4sSJaNWqVaijQSCowrp16zBs2DDEx8cjLi4OGRkZeOWVV0IdLUIjgyzYCI2WL774AhEREbz/OnfuHOroEQgEAoFACCHPPvss7rrrLsTHx+Nf//oXfvjhB/z444+YPHlyqKNGaGREhToCBEKoee6559ChQwfm75deeimEsSEQCI2ZQYMGobq6Gnq9PtRRIRAaNTt37sSiRYuwcOFCPPvss6GODqGRQxZshEbPiBEjMGTIEObvDz/8EKWlpaGLEIFACBtsNhtiY2NVe19kZCQMBoNq7yMQCL7x2muvoX///mSxRtAExCSS0GhxOp0A6idIUqxcuRIRERE4c+YM85nb7UaXLl287sMcOnQIEydOROvWrWEwGNC0aVPk5uairKzM453z5s3jNceMirq6jzJkyBB07twZ+/fvR//+/WE0GpGZmYn33nvPKy1z5sxBz549YbFYEBsbi4EDB2L79u0ez9H3dyIiIrBhwwaP7xwOBxITExEREYHXXnvNK55paWmora31+M3atWuZ97EXuRs3bsSoUaPQrFkzxMTEoE2bNliwYAFcLpdkXtPhHTt2DHfddRfMZjOSk5Px2GOPweFweDy7YsUKDB06FGlpaYiJiUHHjh3x7rvv8r7322+/xeDBgxEfHw+z2YzevXtjzZo1Hs/8+uuvuPnmm5GYmIjY2Fh06dIFb731lsczx44dwx133IGkpCQYDAb06tUL//nPf7zCy8vLw4033ojU1FSP8h09ejTzDF2v9u3b5/Hb0tJSREREYN68eV75wqaqqgpNmzblvfP07rvvonPnzjCZTB7hf/HFF7z5w2XIkCG89ZMdJ246uP/YGyFy80SITz/9FD179mTKLzs726NsLl26hKeeegrZ2dmIi4uD2WzGTTfdhN9//13y3Urb7NGjR3HvvfciMTEROTk5WLFiBSIiIpCXl+f17pdffhk6nQ5//fWXZDwA/jtsdD9w9OhRXH/99TCZTLjmmmt8ukcjVIatWrXCxIkTmb9ra2vxwgsvoG3btjAYDEhOTkZOTg5++OEHj99t2LABnTt3hsFgQOfOnfHVV1/JjkurVq0EzdK5db2urg4LFixAmzZtEBMTg1atWuG5555DTU2N13vltHV2XygWrtvtxptvvolOnTrBYDCgSZMmePjhhyVFJ3ypE0LxYdcFuX09jdA4wy5rAPjrr7+Qm5uLJk2aICYmBp06dcLy5cs9nqHrJl8fEhcX5/FOJWMmUH9FoVevXoiPj/eIJ3ss4oPb95hMJmRnZ+PDDz/0eG7ixImIi4sTfRe3bfzyyy/o3Lkz7rnnHiQlJcFoNKJ3795eYyedL5999hmee+45NG3aFLGxsbj11ltx7tw5r3DWrVuHnj17wmg0IiUlBf/3f//H2z/QY2BqaiqMRiPatWuH559/3uOZvLw83HTTTTCbzYiLi8OwYcPwyy+/+JRHBG1DTtgIjRZ6wRYTE+PT7//973/j8OHDXp//8MMPOHXqFCZNmoSmTZviyJEj+OCDD3DkyBH88ssvXhOCd99912Mg4S4gy8vLcfPNN+Ouu+7C+PHj8fnnn+ORRx6BXq9Hbm4uAKCyshIffvghxo8fj4ceegiXL1/GRx99hJEjR+K3335Dt27dPN5pMBiwYsUKjBkzhvls/fr1XgsiNpcvX8amTZswduxY5rMVK1bAYDB4/W7lypWIi4vD9OnTERcXh23btmHOnDmorKzEq6++KhgGm7vuugutWrXCwoUL8csvv+Dtt99GeXk5Pv74Y4+869SpE2699VZERUXh66+/xuTJk+F2uzFlyhSP+OTm5qJTp06YOXMmEhISkJeXhy1btuDee+8FUF9uo0ePRnp6Oh577DE0bdoUf/zxBzZt2oTHHnsMAHDkyBEMGDAA11xzDZ599lnExsbi888/x5gxY/Dll18yeWO1WnHTTTeBoihMnz4dLVq0AAA88cQTstIul9dffx3FxcVen3/22WeYPHkyhgwZgmnTpiE2NhZ//PEHXn75ZUXvb968ORYuXAigfnH4yCOPiD6/ePFipKSkAPA2LfYnT3744QeMHz8ew4YNw6JFiwAAf/zxB/bs2cOUzalTp7BhwwbceeedyMzMRHFxMd5//30MHjwYR48eRbNmzUTfr6TN3nnnnWjbti1efvllUBSFO+64A1OmTMHq1avRvXt3j2dXr16NIUOG4JprrpFMpxjl5eW48cYbcfvtt+Ouu+7CF198gRkzZiA7Oxs33XSTX+/mY968eVi4cCEefPBB9OnTB5WVldi3bx8OHDiAESNGAAC+//57jBs3Dh07dsTChQtRVlaGSZMmoXnz5rLD6datG5588kmPzz7++GOvheGDDz6IVatW4Y477sCTTz6JX3/9FQsXLsQff/zhsUiU09bZ/P3vf8fAgQMB1PeB3AXnww8/jJUrV2LSpEl49NFHcfr0aSxduhR5eXnYs2cPoqOjedPla50YMWIE7r//fgDAf//7X7z99tse3yvt62n+/e9/M//ntrni4mJcd911iIiIwNSpU5Gamopvv/0WDzzwACorK/H444/zvlMpQmPm3r17cdddd6Fr16745z//CYvFgtLSUkX9Jd33VFZWYvny5XjooYfQqlUrDB8+3Of4lpWV4YMPPkBcXBweffRRpKam4pNPPsHtt9+O1atXY/z48R7Pv/TSS4iIiMCMGTNQUlKCN998E8OHD8fBgwdhNBoBgKlLvXv3xsKFC1FcXIy33noLe/bsQV5eHhISEgDUbyINHDgQ0dHR+Pvf/45WrVrh5MmT+Prrr5m+9ciRIxg4cCDMZjOeeeYZREdH4/3338eQIUOwc+dO9O3bN+B5RAgiFIHQSHnzzTcpANTvv//u8fngwYOpTp06eXy2YsUKCgB1+vRpiqIoyuFwUBkZGdRNN91EAaBWrFjBPGu3273CWrt2LQWA+umnn5jP5s6dSwGgLl68KBjHwYMHUwCo119/nfmspqaG6tatG5WWlkY5nU6Koiiqrq6Oqqmp8fhteXk51aRJEyo3N5f57PTp0xQAavz48VRUVBR14cIF5rthw4ZR9957LwWAevXVV73iOX78eGr06NHM52fPnqUiIyOp8ePHe6WDLw8efvhhymQyUQ6HQzC97PBuvfVWj88nT57sVV584YwcOZJq3bo183dFRQUVHx9P9e3bl6qurvZ41u12UxRVn3+ZmZlUy5YtqfLyct5nKKo+j7Kzsz3S4Ha7qf79+1Nt27ZlPvvuu+8oANTatWs93tWyZUtq1KhRzN90vfrvf//r8dzFixcpANTcuXO98oWmpKSEio+PZ+rg9u3bme/Gjx9PJSQkeKR3+/btFABq3bp1XnnGR//+/anOnTuLxonmX//6FwWAOnv2LPPZ4MGDqcGDBzN/y80TPh577DHKbDZTdXV1gs84HA7K5XJ5fHb69GkqJiaGmj9/vsdn/rbZ8ePHez0/fvx4qlmzZh5xOHDggFdYUtDlxC5Puh/
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Определение выбросов с использованием межквартильного размаха (IQR)\n",
"Q1 = data.quantile(0.25)\n",
"Q3 = data.quantile(0.75)\n",
"IQR = Q3 - Q1\n",
"\n",
"# Выбросы — это значения, которые лежат за пределами 1.5 * IQR от Q1 и Q3\n",
"outliers = data[(data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))]\n",
"print(f'Количество выбросов: {len(outliers)}')\n",
"\n",
2024-10-12 10:08:57 +04:00
"# Разбиение на обучающую, контрольную и тестовую выборки\n",
"train_df, test_df = train_test_split(salaries_df, test_size=0.4, random_state=42)\n",
"val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)\n",
"\n",
2024-10-12 09:48:37 +04:00
"# Удаление выбросов\n",
"filtered_data = data[(data >= (Q1 - 1.5 * IQR)) & (data <= (Q3 + 1.5 * IQR))]\n",
"\n",
"# Построение диаграммы рассеяния после удаления выбросов\n",
"plt.figure(figsize=(10, 6))\n",
"sns.scatterplot(x=range(len(filtered_data)), y=filtered_data)\n",
"plt.title('Диаграмма рассеяния для salary_in_usd после удаления выбросов')\n",
"plt.xlabel('Индекс')\n",
"plt.ylabel('Зарплата в USD')\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}