879 lines
411 KiB
Plaintext
879 lines
411 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Начало лабораторной работы"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. Цены на кофе (12 варик)\n",
|
|||
|
"2. Цены на акции (13 варик)\n",
|
|||
|
"3. Цены на золото (14 варик)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Цены на кофе"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object') \n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"import pandas as pd \n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"df = pd.read_csv(\"C:/Users/TIGR228/Desktop/МИИ/Lab1/AIM-PIbd-31-Afanasev-S-S/static/csv/Starbucks.csv\")\n",
|
|||
|
"\n",
|
|||
|
"print(df.columns, \"\\n\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Столбцы на русском"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. date: Дата\n",
|
|||
|
"2. open: Цена открытия\n",
|
|||
|
"3. high: Самая высокая цена дня\n",
|
|||
|
"4. low: Самая низкая цена дня\n",
|
|||
|
"5. Close: Цена закрытия\n",
|
|||
|
"6. Adj Close: Скорректированная цена закрытия\n",
|
|||
|
"7. Volume: Объем торгов\n",
|
|||
|
"\n",
|
|||
|
"Проблемная область: Прогнозирование динамики цен акций Starbucks на основе исторических данных о ценах и объемах торгов.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"<bound method DataFrame.info of Date Open High Low Close Adj Close \\\n",
|
|||
|
"0 1992-06-26 0.328125 0.347656 0.320313 0.335938 0.260703 \n",
|
|||
|
"1 1992-06-29 0.339844 0.367188 0.332031 0.359375 0.278891 \n",
|
|||
|
"2 1992-06-30 0.367188 0.371094 0.343750 0.347656 0.269797 \n",
|
|||
|
"3 1992-07-01 0.351563 0.359375 0.339844 0.355469 0.275860 \n",
|
|||
|
"4 1992-07-02 0.359375 0.359375 0.347656 0.355469 0.275860 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"8031 2024-05-17 75.269997 78.000000 74.919998 77.849998 77.849998 \n",
|
|||
|
"8032 2024-05-20 77.680000 78.320000 76.709999 77.540001 77.540001 \n",
|
|||
|
"8033 2024-05-21 77.559998 78.220001 77.500000 77.720001 77.720001 \n",
|
|||
|
"8034 2024-05-22 77.699997 81.019997 77.440002 80.720001 80.720001 \n",
|
|||
|
"8035 2024-05-23 80.099998 80.699997 79.169998 79.260002 79.260002 \n",
|
|||
|
"\n",
|
|||
|
" Volume \n",
|
|||
|
"0 224358400 \n",
|
|||
|
"1 58732800 \n",
|
|||
|
"2 34777600 \n",
|
|||
|
"3 18316800 \n",
|
|||
|
"4 13996800 \n",
|
|||
|
"... ... \n",
|
|||
|
"8031 14436500 \n",
|
|||
|
"8032 11183800 \n",
|
|||
|
"8033 8916600 \n",
|
|||
|
"8034 22063400 \n",
|
|||
|
"8035 4651418 \n",
|
|||
|
"\n",
|
|||
|
"[8036 rows x 7 columns]> \n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(df.info, \"\\n\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Объектом наблюдения является - цена акций Starbucks <br>\n",
|
|||
|
"Атрибуты — содержит набор информации о ценах акций Starbucks, такие как: дата, цена открытия, максимальная цена дня, минимальная цена дня, цена закрытия, скорректированная цена закрытия и объем торгов."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAy0AAAIjCAYAAAAObfTCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAD3A0lEQVR4nOzdd3xUVfr48c+5U9MbJCEQehVQ7CIIKFjAvjaUXbH8ZL+rrm3Xtq4NV11d1+7a17Ir+7W76tcCq65YEBUURekEEDCBENIz7d7z+2NKZpIJkJCZCeR5+7pk5t47Z84tifPMOec5SmutEUIIIYQQQoguykh1BYQQQgghhBBiRyRoEUIIIYQQQnRpErQIIYQQQgghujQJWoQQQgghhBBdmgQtQgghhBBCiC5NghYhhBBCCCFElyZBixBCCCGEEKJLk6BFCCGEEEII0aXZU10BIYQQYm/g8/moqqrCsixKSkpSXR0hhNirSEuLEEII0UFff/0155xzDj169MDlctGrVy9OO+20VFdLCCH2OtLSIvYYjz32GK+//jrffvstVVVVFBQUMGzYMC688EJ++ctfYhgSgwshkuff//43Z511FsOHD+f2229n0KBBABQWFqa4ZkIIsfdRWmud6koIsSvGjh1Lr169OOqoo8jOzqa6upovvviC//3f/+Wss87iX//6V6qrKIToJqqqqhg2bBiHH344L7/8Mk6nM9VVEkKIvZoELWKP4ff7cTgcrdb/9re/5eGHH6asrIz+/fsnv2JCiG7nr3/9K7fccgsbNmwgLy8v1dURQoi9nvSnEXuMeAELEAlUoruH/fvf/+b444+npKQEl8vFoEGDuO222zBNM+a1kyZNQikVWXr06MHxxx/P0qVLY/ZTSnHLLbfErPvLX/6CUopJkybFrPd4PNxyyy0MHToUt9tNr169+MUvfsGaNWsAWLduHUopnn322ZjXXXLJJSilOO+88yLrnn32WZRSOJ1Otm7dGrP/ggULIvX++uuvY7a9/PLLHHjggaSlpdGjRw9++ctfsmnTplbnbvny5Zx55pn07NmTtLQ0hg0bxg033ADALbfcEnNu4i3//e9/I+dx1KhRrcrfFW299p577kEpxbp162LWV1dXc8UVV1BaWorL5WLw4MHcddddWJYV2Sd8ju+5555W5Y4aNSrmmv33v/9FKcUrr7zSZh3PO++8XQqITz75ZPr374/b7aawsJCTTjqJ77//PmafZ555hqOOOorCwkJcLhf77LMPjz76aKuy+vfvH3MvAMyaNQu32x057+H9TjjhBObOncuYMWNwu93ss88+vPbaazGvDd9LLe+VaJMmTYqcm/B52dES/p0I3yvR6uvrKS4ujrlPwh599FFGjRpFenp6THk7ugZh33zzDVOnTiU7O5vMzEwmT57MF1980eo4d7S0/N1rae3atZxxxhnk5+eTnp7OYYcdxv/93//F7PPFF18wZswY7rjjjsi9OGTIEP785z/H3Iu5ubkx733ppZfu9Bi9Xi8333wzgwcPxuVyUVpayjXXXIPX643ZL1zm/fff36qM4cOHt3q/qqoqfv/73zN69GgyMzPJzs5m6tSpLFmyZKd12tk5jf6d2rJlCxdeeCFFRUW43W72228/nnvuuZjyon9H77vvPvr160daWhoTJ05s9ff3vPPOi/uegwcPjtnvb3/7GyNHjsTlclFSUsIll1xCdXV1zD67+jdfCNH1yJgWsceprq4mEAhQV1fHokWLuOeee5g+fTp9+/aN7PPss8+SmZnJVVddRWZmJh9++CE33XQTtbW1/OUvf4kpb/jw4dxwww1orVmzZg333nsv06ZNY8OGDTusw5133tlqvWmanHDCCXzwwQdMnz6dyy+/nLq6OubNm8fSpUsjfd5bWr16NU8++WSb72ez2fjnP//JlVdeGVn3zDPP4Ha78Xg8Mfs+++yznH/++Rx88MHceeedVFRU8MADD/DZZ5/xzTffkJubC8B3333HEUccgcPhYNasWfTv3581a9bw1ltvcfvtt/OLX/wi5kPBlVdeyYgRI5g1a1Zk3YgRI9qscyI0NjYyceJENm3axK9//Wv69u3L559/zvXXX8/PP/8c98Nbss2aNYvi4mI2b97Mww8/zJQpUygrKyM9PR0IfmAfOXIkJ510Ena7nbfeeouLL74Yy7K45JJL2iz35ptv5umnn+bFF19sFSivWrWKs846i//5n/9h5syZPPPMM5xxxhm89957HH300R06jhEjRvCPf/wj8vyJJ55g2bJl3HfffZF1++67b5uv/+tf/0pFRUWr9S+++CIXX3wxkyZN4re//S0ZGRksW7aMO+64Y6d1+uGHHzjiiCPIzs7mmmuuweFw8PjjjzNp0iQ+/vhjDj30UCZMmBBT79tvvx0gEowDHH744W2+R0VFBYcffjiNjY1cdtllFBQU8Nxzz3HSSSfxyiuvcOqppwKwbds2Pv30Uz799FMuuOACDjzwQD744AOuv/561q1bx2OPPRY5bz6fD4Bf/epXOz1Gy7I46aST+PTTT5k1axYjRozg+++/57777mPlypW88cYbMfu73W6eeeYZrrjiisi6zz//nPXr17cqe+3atbzxxhucccYZDBgwgIqKCh5//HEmTpzIjz/+uMOMZ9Hn9JNPPuGJJ57gvvvuo0ePHgAUFRUB0NTUxKRJk1i9ejWXXnopAwYM4OWXX+a8886jurqayy+/PKbc559/nrq6Oi655BI8Hg8PPPAARx11FN9//32kTACXy8VTTz0V89qsrKzI41tuuYVbb72VKVOm8Jvf/IYVK1bw6KOP8tVXX/HZZ5/FfOnVkb/5QoguQAuxhxk2bJgGIsu5556r/X5/zD6NjY2tXvfrX/9ap6ena4/HE1k3ceJEPXHixJj9/vCHP2hAb9myJbIO0DfffHPk+TXXXKMLCwv1gQceGPP6v//97xrQ9957b6v3tyxLa611WVmZBvQzzzwT2XbmmWfqUaNG6dLSUj1z5szI+meeeUYD+uyzz9ajR4+OrG9oaNDZ2dn6nHPO0YD+6quvtNZa+3w+XVhYqEeNGqWbmpoi+7/99tsa0DfddFNk3YQJE3RWVpZev3593Hq21K9fv5i6RZs4caIeOXJk3G0709Zr//KXv2hAl5WVRdbddtttOiMjQ69cuTJm3+uuu07bbDa9YcMGrXXzOf7LX/7SqtyRI0fGXLOPPvpIA/rll19us44zZ87U/fr1a9+Baa1feuklDeivv/46si7evXnsscfqgQMHxqyLPt+PP/64BvRDDz3U6rX9+vXTgH711Vcj62pqanSvXr30/vvvH1kXvpfC90o88X4fwnZ0Dm6++WYd/b+TLVu26KysLD116lQN6I8++iiy7eyzz9a5ubkx9+euXAOttT7llFO00+nUa9asiazbvHmzzsrK0hMmTGj3McVzxRVXaEB/8sknkXV1dXV6wIABun///to0zUi5gL7llltiXn/eeedpQH///fetygb0JZdcssP3/8c//qENw4h5f621fuyxxzSgP/vss5jyTj/9dG2322PusQsvvDDytyH6/TweT6T+YWVlZdrlcunZs2fvsF7RwvdS9O9m2P33368B/c9//jOyzufz6bFjx+rMzExdW1sbeV9Ap6Wl6Y0bN0b2XbhwoQb0lVdeGVk3c+ZMnZGR0WZ9tmzZop1Opz7mmGNiju/hhx/WgP773/8eWberf/OFEF2PdA8Te5xnnnmGefPm8cILL3DhhRfywgsvxHz7D5CWlhZ5XFdXR2VlJUcccQSNjY0sX748Zl+/309lZSVbt25lwYIFvP766+y7776RbxBb2rRpEw899BA33ngjmZmZMdteffVVevTowW9/+9tWr2vZfSZs0aJFvPzyy9x5551tZkD71a9+xfLlyyNde1599VVycnKYPHlyzH5ff/01W7Zs4eKLL8btdkfWH3/88QwfPjzSxWXr1q3Mnz+fCy64IKaFakf13BnTNKmsrKSysjLyzXJne/nllzniiCPIy8uLvFdlZSVTpkzBNE3mz58fs39jY2PMfpWVla26CIaF75O
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"\n",
|
|||
|
"plt.scatter(df['Volume'], df['Close'], c=df['Close'], alpha=0.6)\n",
|
|||
|
"plt.colorbar(label='Close Price')\n",
|
|||
|
"\n",
|
|||
|
"plt.title(\"Зависимость цены закрытия от объема торгов\")\n",
|
|||
|
"plt.ylabel(\"Цена закрытия\")\n",
|
|||
|
"plt.xlabel(\"Объем торгов\")\n",
|
|||
|
"plt.grid(visible=True)\n",
|
|||
|
"\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+8AAAIjCAYAAAByJypeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACNQklEQVR4nOzdd3iT9f7G8TvpSPeik9mytyB7yFAUVATBecQjONDjD0XF48CjqByVgxPcG1w4UNwex2GJgOy9R8tuoUDTvZLn90dppLSFpk2btH2/rqtXmm+ePPkkDaV3vstkGIYhAAAAAADgsczuLgAAAAAAAJwd4R0AAAAAAA9HeAcAAAAAwMMR3gEAAAAA8HCEdwAAAAAAPBzhHQAAAAAAD0d4BwAAAADAwxHeAQAAAADwcIR3AAAAAAA8HOEdAADUaR9//LGSkpIc12fPnq1Dhw65ryAnJSUlyWQy6fnnn6/xxzaZTLrrrrtq/HEBAKUR3gHUaXv27NEdd9yh5s2by8/PTyEhIerXr59mzpypnJwcd5cHoAYsWbJEDz74oJKSkvTLL79owoQJMpsr9idQZmamHn/8cXXs2FGBgYFq0KCBunTponvuuUeHDx92HPfTTz/piSeeqKZnAACA5O3uAgCguvz444+65pprZLFYdNNNN6ljx47Kz8/XH3/8oQceeEBbtmzR22+/7e4yAVSz++67T4MGDVJCQoIkadKkSYqLizvn/QoKCjRgwABt375dY8eO1d13363MzExt2bJFc+bM0ahRo9SwYUNJReH9tddeI8ADAKoN4R1AnZSYmKjrr79ezZo104IFC0r8oT5hwgTt3r1bP/74oxsrBFBT2rZtqz179mjz5s2KjIxUixYtKnS/b775RuvWrdMnn3yiG264ocRtubm5ys/Pr45yHex2e7U/BgCg9mDYPIA66dlnn1VmZqbee++9MnvYWrZsqXvuucdxvXhe5yeffKI2bdrIz89P3bp10++//17qvocOHdItt9yimJgYWSwWdejQQe+//36ZdTzxxBMymUylvgYNGlTiuEGDBqljx46l7v/888/LZDKVmK9bWFiop556Sq1bt5bFYilx3tWrVzt9XFnGjRunoKCgUu1ffvmlTCaTFi1aVKI9Ly9Pjz/+uFq2bCmLxaImTZrowQcfVF5eXonjyps/O3z4cMXHx5d67n379lWDBg3k7++vbt266csvvzxr3cX+8Y9/qFWrVgoICFBERIQuvPBCLVmypMQx3377rS6//HI1bNhQFotFLVq00L///W/ZbLYSxw0aNKjUz+vpp5+W2WzWnDlzShzXsWNHrVmzRn379pW/v78SEhL05ptvlrjvokWLZDKZzvpcxo0b53g9iuc7n+1r3Lhxkormcp/5frHb7ercubNMJpNmz55d4nG+/PJLde/eXcHBwSXOd6651SdOnNA///lPderUSUFBQQoJCdGll16qDRs2lPlcT3+/HD58WPHx8erevbsyMzPP+ZoEBQU5nl+x+Pj4Um1z586VyWQq8T4qfu1mz56twMBA9erVSy1atNCECRNKvG7l2bNnjySpX79+pW4rnoYjFf28XnvtNUkq8ToWq+h7+fTfQx06dJDFYtHPP/9c4piXXnpJzZo1k7+/vwYOHKjNmzeXuL2s92txjWf+G7Pb7Zo5c6Y6deokPz8/RUVFadiwYef8/fDUU0/JbDbrlVdecbS98sor6tChgwICAhQeHq7u3buX+PdRluKf+7ne18X27t2ra665RhEREQoICFDv3r0r/CFseY/x1FNPOY45evSobr31VsXExMjPz0/nnXeePvjggzLPV/xv7cyvM1/jjRs3aty4cY6pW7Gxsbrlllt0/PjxEscV/18RHR2tgoKCErd9+umnjvOnpqZW6PkCqJvoeQdQJ33//fdq3ry5+vbtW+H7LF68WJ9//rkmTpwoi8Wi119/XcOGDdPKlSsdwTolJUW9e/d2/JEdFRWl//73v7r11luVnp6ue++9t8xzv/HGG44wPHny5Co9txdeeEGPPfaYRo0apYceekgWi0VLliwpNQWgosdVld1u14gRI/THH3/o9ttvV7t27bRp0ya99NJL2rlzp7755ptKnXfmzJkaMWKExowZo/z8fH322We65ppr9MMPP+jyyy8/633z8/N14403qnHjxjpx4oTeeustDRs2TNu2bVPTpk0lFf3xHRQUpEmTJikoKEgLFizQlClTlJ6erueee67cc8+aNUuPPvqoXnjhhVK9sSdPntRll12ma6+9Vn/729/0xRdf6M4775Svr69uueWWSr0OUVFR+uijjxzX582bp6+//rpE29l6kj/66CNt2rSpVPvy5ct17bXX6rzzztN//vMfhYaGKjU1Vffdd985a9q7d6+++eYbXXPNNUpISFBKSoreeustDRw4UFu3bnUMJT+T1WrVpZdeKh8fH/30009lfkBUGYWFhfrXv/5VoWN3796td955p0LHNmvWTJL04Ycf6tFHHy0RyE93xx136PDhw/rtt99K/FyKOfNeXrBggb744gvdddddioyMLBEGP/zwQ2VkZGjChAnKzc3VzJkzdeGFF2rTpk2KiYmp0HM63a233qrZs2fr0ksv1W233abCwkItWbJEf/75p7p3717mfR599FE988wzeuuttzR+/HhJ0jvvvKOJEyfq6quv1j333KPc3Fxt3LhRK1asKPVvpCwTJ05Ujx49SrTddtttJa6npKSob9++ys7O1sSJE9WgQQN98MEHGjFihL788kuNGjXqnI9z8cUX66abbirR1qVLF0lSTk6OBg0apN27d+uuu+5SQkKC5s6dq3HjxiktLa3Eh72nmzp1qmM6xgsvvKCTJ0+WuP23337T3r17dfPNNys2NtYxXWvLli36888/S72nMjIy9MMPP5R4PrNmzZKfn59yc3PP+RwB1HEGANQxVqvVkGSMHDmywveRZEgyVq9e7Wjbt2+f4efnZ4waNcrRduuttxpxcXFGampqiftff/31RmhoqJGdnV2i/ZFHHjEklTi+Q4cOxsCBA0scN3DgQKNDhw6l6nruuecMSUZiYqKjrU+fPka7du0Mu93uaJs1a5YhyVi1apXTx5Vl7NixRmBgYKn2uXPnGpKMhQsXOto++ugjw2w2G0uWLClx7JtvvmlIMpYuXepok2RMmDCh1Hkvv/xyo1mzZiXaznwt8/PzjY4dOxoXXnjhWWsvy8qVKw1Jxpdfflnu+Q3DMO644w4jICDAyM3NdbQNHDjQ8fP68ccfDW9vb+P+++8vdd+BAwcakowXXnjB0ZaXl2d06dLFiI6ONvLz8w3DMIyFCxcakoy5c+eWW+/YsWNLvR7FHn/8caO8/76Lf77F75fc3FyjadOmxqWXXmpIMmbNmuU4dvLkyYYk48iRI462xMREQ5Lx3HPPlVtb8XltNluJtsTERMNisRhTp051tBU/14ULFxq5ubnGoEGDjOjoaGP37t0l7nu21yQwMNAYO3ZsibZmzZqVaHv99dcNi8ViDB48uMTrVvx8Tn/e1157rdGxY0ejSZMmpc57puzsbKNNmzaGJKNZs2bGuHHjjPfee89ISUkpdeyECRPK/blU9L0syTCbzcaWLVtKtBc/D39/f+PgwYOO9hUrVhiSjPvuu8/Rdvr79XRnvqcWLFhgSDImTpxY6tjTf2ec/m/2/vvvN8xmszF79uwSx48cObLM31/n4szP/d577zUklfg9k5GRYSQkJBjx8fGl3o9nKu93T7EZM2YYkoyPP/7Y0Zafn2/06dPHCAoKMtLT00sc//bbb5f6P6Miv8cMwzA+/fRTQ5Lx+++/O9qK/13/7W9/M4YPH+5o37dvn2E2m42//e1vhiTj2LFjZ32eAOo2hs0DqHPS09MlScHBwU7dr0+fPurWrZvjetOmTTVy5Ej98ssvstlsMgxDX331la644goZhqHU1FTH19ChQ2W1WrV27doS5yzuKfHz8zvn49tsthLnTE1NVXZ2dqnjMjIyFB4eXm4voLPHVdXcuXPVrl07tW3btkTtF154oSRp4cKFJY7Pzc0t9TzPHCY
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1200x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"df['Year'] = pd.to_datetime(df['Date']).dt.year\n",
|
|||
|
"\n",
|
|||
|
"year_close = df.groupby('Year')['Close'].mean().reset_index()\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"\n",
|
|||
|
"plt.plot(year_close['Year'], year_close['Close'], marker='.')\n",
|
|||
|
"\n",
|
|||
|
"plt.title(\"Средняя цена закрытия акций Starbucks по годам\")\n",
|
|||
|
"plt.xlabel(\"Год\")\n",
|
|||
|
"plt.ylabel(\"Средняя цена закрытия\")\n",
|
|||
|
"\n",
|
|||
|
"plt.show()\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Связь между объектами есть. Цена связана почти со всеми характеристиками акций. Например, на графике номер один показана зависимость между ценой закрытия и объемом торгов. А на графике номер два показана зависимость средней цены закрытия от года."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<h3>Примеры бизнес-целей</h3>\n",
|
|||
|
"\n",
|
|||
|
"1. Прогнозирование динамики цен акций Starbucks на основе исторических данных о ценах и объемах торгов.\n",
|
|||
|
"2. Наблюдение за изменениями цен акций Starbucks с годами.\n",
|
|||
|
"\n",
|
|||
|
"Эффект для бизнеса: Оценка и оптимизация цен, оценка и планирование затрат, выявление тенденций на рынке, стратегия планирования.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<h3>Цели технического проекта</h3>\n",
|
|||
|
"<ul>Для первой цели:</ul>\n",
|
|||
|
" <li>Вход: Исторические данные о ценах и объемах торгов</li>\n",
|
|||
|
" <li>Целевой признак: Цена закрытия.</li>\n",
|
|||
|
"<ul>Для второй цели:</ul>\n",
|
|||
|
" <li>Вход: Исторические данные о ценах и объемах торгов</li>\n",
|
|||
|
" <li>Целевой признак: Год</li>\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<h3>Код ниже нужен для определения проблем данных</h3>"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Максимальные значения:\n",
|
|||
|
" Date 2024-05-23\n",
|
|||
|
"Open 126.080002\n",
|
|||
|
"High 126.32\n",
|
|||
|
"Low 124.809998\n",
|
|||
|
"Close 126.059998\n",
|
|||
|
"Adj Close 118.010414\n",
|
|||
|
"Volume 585508800\n",
|
|||
|
"Year 2024\n",
|
|||
|
"dtype: object \n",
|
|||
|
"\n",
|
|||
|
"Столбцы с нулевыми значениями:\n",
|
|||
|
" Index([], dtype='object') \n",
|
|||
|
"\n",
|
|||
|
"Признаки с низкой дисперсией:\n",
|
|||
|
" Series([], dtype: float64) \n",
|
|||
|
"\n",
|
|||
|
"Годы:\n",
|
|||
|
" 0 1992\n",
|
|||
|
"1 1992\n",
|
|||
|
"2 1992\n",
|
|||
|
"3 1992\n",
|
|||
|
"4 1992\n",
|
|||
|
" ... \n",
|
|||
|
"8031 2024\n",
|
|||
|
"8032 2024\n",
|
|||
|
"8033 2024\n",
|
|||
|
"8034 2024\n",
|
|||
|
"8035 2024\n",
|
|||
|
"Name: Year, Length: 8036, dtype: int32\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"max_value = df.max(axis=0)\n",
|
|||
|
"\n",
|
|||
|
"columns_with_zero = df.columns[(df == 0).any()]\n",
|
|||
|
"\n",
|
|||
|
"numeric_data = df.select_dtypes(include='number')\n",
|
|||
|
"shum = numeric_data.var()\n",
|
|||
|
"low_dispers = 0.1\n",
|
|||
|
"low_var_columns = shum[shum < low_dispers]\n",
|
|||
|
"\n",
|
|||
|
"df['Year'] = pd.to_datetime(df['Date']).dt.year\n",
|
|||
|
"print(\"Максимальные значения:\\n\", max_value, \"\\n\")\n",
|
|||
|
"print(\"Столбцы с нулевыми значениями:\\n\", columns_with_zero, \"\\n\")\n",
|
|||
|
"print(\"Признаки с низкой дисперсией:\\n\", low_var_columns, \"\\n\")\n",
|
|||
|
"print(\"Годы:\\n\", df['Year'])\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<h4>Из полученных данных выяснилось:</h4></ul> <li>Столбцы с нулевыми значениями отсутствуют, что указывает на полноту данных и отсутствие проблем с пропущенными значениями.</li> <li>Максимальные значения для различных метрик: <ul> <li>Date: 2024-05-23</li> <li>Open: 126.080002</li> <li>High: 126.32</li> <li>Low: 124.809998</li> <li>Close: 126.059998</li> <li>Adj Close: 118.010414</li> <li>Volume: 585508800</li> <li>Year: 2024</li> </ul> </li> <li>Признаки с низкой дисперсией отсутствуют, что указывает на стабильность данных и отсутствие проблем с зашумленностью.</li> <li>Годы варьируются от 1992 до 2024. Это может быть актуальной информацией для анализа временных трендов и изменений в данных за длительный период. Однако, если данные включают будущие даты (например, 2024 год), это может указывать на проблему с актуальностью данных или просачивание данных.</li> <li>Выбросы: Максимальные значения для некоторых метрик (например, Volume) могут указывать на наличие выбросов, которые могут искажать анализ и моделирование.</li> <li>Смещение: Отсутствие столбцов с нулевыми значениями и признаков с низкой дисперсией указывает на отсутствие явных проблем со смещением данных. Однако, для более точного анализа смещения необходимо провести дополнительные исследования, такие как сравнение распределений признаков в тренировочном и тестовом наборах данных.</li> <li>Просачивание данных: Наличие будущих дат (например, 2024 год) может указывать на проблему с просачиванием данных, если эти данные используются для прогнозирования будущих событий. Это может привести к некорректным результатам моделирования.</li>"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<ol><h3>Примеры решения проблем для набора данных</h3></ol>\n",
|
|||
|
" <li>Удаление выбросов на основе значения или Volume</li>\n",
|
|||
|
" <li>Удаление или обновить устаревшие даты, так как наличие будущих дат может указывать на проблему с актуальностью данных</li>\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<h3>Оценка качества данных</h3>\n",
|
|||
|
"1. Информативность. Набор данных предоставляет достаточную информацию для анализа цен на недвижимость.\n",
|
|||
|
"2. Степень покрытия. Набор данных затрагивает только один райно, не включая информацию о других райнов.\n",
|
|||
|
"3. Соответствие реальным данным. Данные вполне кажутся реальными, не считая некоторых редких выбросов.\n",
|
|||
|
"4. Согласованность меток. Метки состояние и оценка вида, имеют четкие значения."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<h3>Разбиение данных на обучающую, контрольную и тестовую выборки</h3>"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Исходный размер строк: 8036 строк\n",
|
|||
|
"Размер обучающей выборки: 5625 строк\n",
|
|||
|
"Размер валидационной выборки: 1205 строк\n",
|
|||
|
"Размер тестовой выборки: 1206 строк\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df_numeric = df.select_dtypes(include='number')\n",
|
|||
|
"\n",
|
|||
|
"x = df_numeric.drop(['Close'], axis=1)\n",
|
|||
|
"y = df_numeric['Close']\n",
|
|||
|
"\n",
|
|||
|
"x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size=0.3, random_state=14)\n",
|
|||
|
"\n",
|
|||
|
"x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=0.5, random_state=14)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"Исходный размер строк: {df_numeric.shape[0]} строк\")\n",
|
|||
|
"print(f\"Размер обучающей выборки: {x_train.shape[0]} строк\")\n",
|
|||
|
"print(f\"Размер валидационной выборки: {x_val.shape[0]} строк\")\n",
|
|||
|
"print(f\"Размер тестовой выборки: {x_test.shape[0]} строк\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAIjCAYAAAAJLyrXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACwEklEQVR4nOzdd3hUVf4/8Ped3jKppABJKKFXxRYLIFXECurasay4Crrqrrr4VSm69hVdRSzroiugK4q68kNpChZAEaVHpIeWhCHJTCbT7sw9vz+GjIQkkAkTpr1fz8OjuXPvuWfuuTNzP/ec87mSEEKAiIiIiIiImk0V7QoQERERERHFGwZSREREREREYWIgRUREREREFCYGUkRERERERGFiIEVERERERBQmBlJERERERERhYiBFREREREQUJgZSREREREREYdJEuwJERERE0eLxeFBZWQmNRoPs7OxoV4ciqLa2FpWVlbBYLEhPT492dSgBsUeKiIiIksrSpUtx2WWXIS0tDUajEe3atcOf//znaFcrbjz11FNQFAUAoCgKnn766SjX6Hfz5s3D0KFDkZKSAovFgoKCAjz33HPRrhYlKAZSFHPeeecdSJIU+mcwGNC1a1dMnDgR5eXl0a4eUUyy2WyQJAlTpkyJdlWIYtprr72GkSNHwm634+WXX8aSJUuwZMkSTJs2LdpVixvvvvsuXnjhBezbtw//+Mc/8O6770a7SgCAv/3tb7jmmmuQkpKCt956C0uWLMHSpUtx9913R7tqlKA4tI9i1rRp09CxY0d4PB589913mDlzJhYuXIhNmzbBZDJFu3pERBRntm3bhgceeADjx4/Ha6+9BkmSol2luDRt2jTcfPPNePjhh6HX6zF79uxoVwkrVqzAs88+i6effhp/+9vfol0dShIMpChmjRo1CmeccQYA4I9//CMyMzPx4osv4rPPPsN1110X5doREVG8+ec//4nc3Fz885//ZBB1Ev7whz/gwgsvxPbt29GlSxe0adMm2lXCCy+8gHPPPZdBFJ1SHNpHcWPIkCEAgF27dgEAKisr8de//hV9+vSBxWKB1WrFqFGjsH79+gbbejweTJkyBV27doXBYEBeXh7GjBmDHTt2AAB2795dbzjhsf8GDx4cKmv58uWQJAn//e9/8cgjjyA3NxdmsxmXXXYZ9u7d22DfP/zwAy666CKkpqbCZDJh0KBB+P777xt9j4MHD250/40N15o9ezYGDBgAo9GIjIwMXHvttY3u/3jv7WiKouCll15Cr169YDAYkJOTgzvvvBNVVVX11uvQoQMuueSSBvuZOHFigzIbq/vzzz/f4JgCgNfrxeTJk1FUVAS9Xo/8/Hw89NBD8Hq9jR6rlr7P2tpa/OUvf0F+fj70ej26deuGF154AUKIBnVv7N+TTz4JAPD5fHj88ccxYMAApKamwmw244ILLsDXX3/daL1eeOEFTJ8+HYWFhTAajRg0aBA2bdpUb91bbrkFHTp0qLds7969MBqNkCQJu3fvDi13uVy49dZbYTab0bNnT6xduxYAIMsybr31VphMJvTr1w8//fRTvfLqzrErrriiwTG88847IUkSevfuXW953QVKZmYmjEYjBgwYgI8++qjxhjjG4MGDG5RXV+ax7wkAqqurcd9994Xap6ioCM8++2xoPgZQ/5geq3fv3o1+Xo9X38aOe2Muv/xydOjQAQaDAdnZ2bjsssuwcePGeuvMmjULQ4YMQXZ2NvR6PXr27ImZM2c2KKtDhw645ZZb6i0bP348DAYDli9fXm+9Sy65BIsXL0b//v1hMBjQs2dPzJ8/v962dUOij23vow0ePDh0bOqOy/H+1X12p0yZ0uBz5HQ6kZubC0mS6tUXAGbOnInevXvDZDLVK+9E50zdfur+paSk4KyzzsKnn3563O3q/PLLLxg1ahSsVissFguGDh2K1atX11tn9erVGDBgAO6++27k5ORAr9ejd+/eeOutt0LrCCHQoUMHXH755Q324fF4kJqaijvvvLPJYwM0bN/m/l7VndvvvPNOaNlvv/2GK6+8Eunp6TAajTjzzDMbHJOmznOLxdLgPGvsuxoAfv31V1x11VXIyMiAwWDAGWecgf/973/11qk7z3bv3o3s7OzQ90Lfvn0b1Lsxxw7dN5lM6NOnD/71r3/VW++WW26BxWI5blnH/r6sXr0avXv3xrXXXouMjIwmjxUAVFRU4Pbbb0dOTg4MBgP69evXYHjiyX53z549GyqVCs8880y95c05zhQ/2CNFcaMu6MnMzAQA7Ny5E59++imuvvpqdOzYEeXl5XjjjTcwaNAgbNmyBW3btgUABAIBXHLJJVi2bBmuvfZa/PnPf0ZNTQ2WLFmCTZs2oXPnzqF9XHfddbj44ovr7XfSpEmN1ufvf/87JEnCww8/jIqKCrz00ksYNmwY1q1bB6PRCAD46quvMGrUKAwYMACTJ0+GSqUKXWh9++23OOussxqU2759+9DEXafTibvuuqvRfT/22GO45ppr8Mc//hGHDh3CK6+8goEDB+KXX35BWlpag23Gjx+PCy64AAAwf/58fPLJJ/Vev/POO/HOO+/g1ltvxb333otdu3bh1VdfxS+//ILvv/8eWq220eMQjurq6kYnJSuKgssuuwzfffcdxo8fjx49emDjxo2YPn06fvvtt2ZfSAHHf59CCFx22WX4+uuvcfvtt6N///5YtGgRHnzwQezfvx/Tp0+vV9bw4cNx880311vWv39/AIDD4cC//vUvXHfddbjjjjtQU1ODt99+GyNHjsSPP/4YWq/Of/7zH9TU1GDChAnweDx4+eWXMWTIEGzcuBE5OTlNvp/HH38cHo+nwfL7778f7777LiZOnIj27duH5gC8+eabGDJkCJ588km8/PLLGDVqFHbu3ImUlJTQtgaDAf/v//0/VFRUhLKUud1u/Pe//4XBYGiwr5dffhmXXXYZbrjhBvh8PnzwwQe4+uqrsWDBAowePbrJuofL5XJh0KBB2L9/P+68804UFBRg5cqVmDRpEg4ePIiXXnopYvtqqfHjxyM3NxcHDhzAq6++imHDhmHXrl2h4cYzZ85Er169cNlll0Gj0eDzzz/H3XffDUVRMGHChCbLnTx5Mt5++23897//bXCTYdu2bfjDH/6AP/3pTxg3bhxmzZqFq6++Gl9++SWGDx/eovfRo0cPvPfee6G/33zzTZSUlNT7DPTt27fJ7f/xj380Omf1v//9L+6++24MHjwY99xzD8xmM0pKSvDUU081u2519bLZbHjttddw9dVXY9OmTejWrVuT22zevBkXXHABrFYrHnroIWi1WrzxxhsYPHgwVqxYgbPPPhsAcPjwYfz000/QaDSYMGECOnfujE8//RTjx4/H4cOH8be//Q2SJOHGG2/Ec889h8rKSmRkZIT28/nnn8PhcODGG29s9vsBmv97dazKykoMHDgQNTU1uPfee5Gbm4vZs2djzJgxmDNnTsRGZ2zevBnnnXce2rVrh7/97W8wm8348MMPccUVV+Djjz/GlVde2eS27733XoMbCicyffp0ZGVlweFw4N///jfuuOMOdOjQAcOGDWvxezh8+DDefPNNWCwW3HvvvWjTpk2jx8rtdmPw4MHYvn07Jk6ciI4dO2LevHm45ZZbUF1d3SDpSEu+uxcvXozbbrsNEydOrNdDdjLHmWKUIIoxs2bNEgDE0qVLxaFDh8TevXvFBx98IDIzM4XRaBT79u0TQgjh8XhEIBCot+2uXbuEXq8X06ZNCy3797//LQCIF198scG+FEUJbQdAPP/88w3W6dWrlxg0aFDo76+//loAEO3atRMOhyO0/MMPPxQAxMsvvxwqu0uXLmLkyJGh/QghhMvlEh07dhTDhw9vsK9zzz1X9O7dO/T3oUOHBAAxefLk0LLdu3cLtVot/v73v9fbduPGjUKj0TRYvm3bNgFAvPvuu6FlkydPFkd//L/99lsBQMyZM6fetl9++WWD5YWFhWL06NEN6j5hwgRx7FfKsXV/6KGHRHZ2thgwYEC9Y/ree+8JlUolvv3223rbv/766wKA+P7
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAIjCAYAAAAJLyrXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACMzklEQVR4nOzdeXxU1f3/8fedJbNkJQSyQICwbwIVUXEBFxBxr1brVnFp9adi61JtaVUQtdYdq9bta1FbccEFW2sVpSquiCgoFZE9rIEEsk4y6/n9ETISEkIGJkyW1/Px4KG5M3PmM3PuLO+5555jGWOMAAAAAADNZkt0AQAAAADQ1hCkAAAAACBGBCkAAAAAiBFBCgAAAABiRJACAAAAgBgRpAAAAAAgRgQpAAAAAIgRQQoAAAAAYkSQAgAAAJpQUVGhtWvXqqqqKtGloBUhSAEAAOCAq6ys1IwZM6J/l5aW6tFHH01cQbswxujJJ5/U4YcfLq/Xq7S0NBUUFOgf//hHoktDK0KQQtw888wzsiwr+s/tdqt///6aPHmyioqKEl0e0CoVFxfLsixNmzYt0aUAwAHl8Xh088036/nnn9f69es1bdo0/etf/0p0WZKk888/X//v//0/DRo0SH//+9/17rvv6r333tOZZ56Z6NLQijgSXQDan+nTp6ugoEA1NTX6+OOP9dhjj+mtt97S0qVL5fV6E10eAABoBex2u2677TZddNFFikQiSktL07///e9El6XnnntOL730kv7xj3/o/PPPT3Q5aMUIUoi7iRMn6pBDDpEk/fKXv1Tnzp31wAMP6I033tB5552X4OoAAEBrccMNN+jnP/+51q9fr0GDBikjIyPRJenee+/VeeedR4jCXjG0Dy3uuOOOkyStWbNGkrR9+3b99re/1UEHHaSUlBSlpaVp4sSJWrJkSYPb1tTUaNq0aerfv7/cbrdyc3N15plnatWqVZKktWvX1htOuPu/Y445JtrWBx98IMuy9NJLL+kPf/iDcnJylJycrNNOO03r169vcN8LFizQiSeeqPT0dHm9Xo0dO1affPJJo4/xmGOOafT+Gxuu9Y9//EMjR46Ux+NRZmamzj333Ebvv6nHtqtIJKIZM2ZoyJAhcrvdys7O1hVXXKEdO3bUu16vXr10yimnNLifyZMnN2izsdrvvffeBs+pJPn9fk2dOlV9+/aVy+VSfn6+brrpJvn9/kafq319nFVVVbrhhhuUn58vl8ulAQMG6L777pMxpkHtjf274447JEmBQEC33nqrRo4cqfT0dCUnJ+voo4/W+++/32hd9913nx588EH17NlTHo9HY8eO1dKlS+td9+KLL1avXr3qbVu/fr08Ho8sy9LatWuj230+ny655BIlJydr8ODBWrRokSQpGAzqkksukdfr1fDhw/Xll1/Wa69uHzvjjDMaPIdXXHGFLMvS0KFD622/7777dMQRR6hz587yeDwaOXKkXnnllcY7YjfHHHNMg/bq2tz9MUm15zZce+210f7p27ev7r77bkUikeh1dn1Odzd06NBGX69N1dvY896Y008/Xb169ZLb7VbXrl112mmn6dtvv613nZkzZ+q4445T165d5XK5NHjwYD322GMN2urVq5cuvvjietsuv/xyud1uffDBB/Wud8opp2ju3LkaMWKE3G63Bg8erNdee63ebeuGRO/e37s65phjos9N3fPS1L+61+60adMavI4qKyuVk5Mjy7Lq1StJjz32mIYOHSqv11uvvb3tM3X3U/cvNTVVhx56qObMmdPk7WK57UcffaSzzz5bPXr0iL7PXHfddaqurm603T09N7vut7G8zzVWa2OfM7HUevHFFyslJaXB/bzyyisN+qeu33fvs5NPPrnRx/H1119r4sSJSktLU0pKio4//nh9/vnn9a6zp32vseHGdY+9uLi43nW//PJLWZalZ555pt72//73vzr66KOVnJysjIwMnX766Vq2bFm96+y6f3bv3l2jR4+Ww+HY4/65u+buO3t6L6tT975U9xiqqqq0dOlS5efn6+STT1ZaWpqSk5N1zDHH6KOPPmpw+9WrV+vss89WZmamvF6vDj/88AZH1WL5/rHr673OnXfeKZvNplmzZtXbHsv3FLQMjkihxdWFns6dO0uqfdOZM2eOzj77bBUUFKioqEhPPPGExo4dq++++055eXmSpHA4rFNOOUXz5s3Tueeeq9/85jeqqKjQu+++q6VLl6pPnz7R+zjvvPN00kkn1bvfKVOmNFrPnXfeKcuy9Lvf/U5bt27VjBkzNG7cOC1evFgej0dS7YfAxIkTNXLkSE2dOlU2my36Reujjz7SoYce2qDd7t2766677pJU+2XlyiuvbPS+b7nlFp1zzjn65S9/qW3btunhhx/WmDFj9PXXXzf6S9zll1+uo48+WpL02muv6fXXX693+RVXXKFnnnlGl1xyiX79619rzZo1euSRR/T111/rk08+kdPpbPR5iEVpaWn0se0qEonotNNO08cff6zLL79cgwYN0rfffqsHH3xQP/zwQ7O+SNVp6nEaY3Taaafp/fff12WXXaYRI0bonXfe0Y033qiNGzfqwQcfrNfW+PHjddFFF9XbNmLECElSeXm5/u///k/nnXeefvWrX6miokJPP/20JkyYoC+++CJ6vTrPPfecKioqdPXVV6umpkYPPfSQjjvuOH377bfKzs7e4+O59dZbVVNT02D7ddddp2effVaTJ09W9+7dddVVV0mSnnzySR133HG644479NBDD2nixIlavXq1UlNTo7d1u93697//ra1bt6pr166SpOrqar300ktyu90N7uuhhx7SaaedpgsuuECBQEAvvviizj77bL355ps6+eST91h7rHw+n8aOHauNGzfqiiuuUI8ePfTpp59qypQp2rx5c72TyRPl8ssvV05OjjZt2qRHHnlE48aN05o1a6LDjR977DENGTJEp512mhwOh/71r3/pqquuUiQS0dVXX73HdqdOnaqnn35aL730UoMvPytWrNDPf/5z/b//9/80adIkzZw5U2effbbefvttjR8/fp8eR935GnWefPJJLVu2rN5rYNiwYXu8/f3339/oOasvvfSSrrrqKh1zzDG65pprlJycrGXLlulPf/pTs2urq6u4uFh//etfdfbZZ2vp0qUaMGDAft929uzZ8vl8uvLKK9W5c2d98cUXevjhh7VhwwbNnj270TZ/+tOfRs9n+eijj/Tkk082WcOe3ud299hjj0UDUGOfM/tS676YP3++3nrrrQbb//e//+noo49WWlqabrrpJjmdTj3xxBM65phj9OGHH+qwww6LWw2Nee+99zRx4kT17t1b06ZNU3V1tR5++GEdeeSR+uqrr5r88WNP+2dT9me/a0xJSYkk6e6771ZOTo5uvPFGud1uPfXUUxo3bpzeffddjRkzRpJUVFSkI444Qj6fT7/+9a/VuXNnPfvsszrttNP0yiuv6Kc//Wm9tpvz/WN3M2fO1M0336z777+/3hGyffmeghZggDiZOXOmkWTee+89s23bNrN+/Xrz4osvms6dOxuPx2M2bNhgjDGmpqbGhMPherdds2aNcblcZvr06dFtf/vb34wk88ADDzS4r0gkEr2dJHPvvfc2uM6QIUPM2LFjo3+///77RpLp1q2bKS8vj25/+eWXjSTz0EMPRdvu16+fmTBhQvR+jDHG5/OZgoICM378+Ab3dcQRR5ihQ4dG/962bZuRZKZOnRrdtnbtWmO3282dd95Z77bffvutcTgcDbavWLHCSDLPPvtsdNvUqVPNri/bjz76yEgyzz//fL3bvv322w229+zZ05x88skNar/66qvN7m8Fu9d+0003ma5du5qRI0fWe07//ve/G5vNZj766KN6t3/88ceNJPPJJ580uL/dNedxzpkzx0gyd9xxR73b/uxnPzOWZZmVK1fWq/3qq6/e4/2FQiHj9/vrbduxY4fJzs42l156aXRb3b61675rjDELFiwwksx1110X3TZp0iTTs2fP6N9Lly41NpvNTJw40Ugya9asMcYYs2XLFpOUlGSmTJk
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0kAAAIjCAYAAADWYVDIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACOt0lEQVR4nOzdeXwTdf4/8NdM7qRNW1p6QS8olAICKyqLCHhwiCiyuioqK7Ku+lNwV93VXXZVEHUVT/wKnsvisYsHinisCooHHoiIgKAVudpytqRH0jTNNfP5/REaU9pCj5QcfT0fjz6g08kn72Qmk7wyn/l8JCGEABEREREREQEA5EgXQEREREREFE0YkoiIiIiIiEIwJBEREREREYVgSCIiIiIiIgrBkERERERERBSCIYmIiIiIiCgEQxIREREREVEIhiQiIiIiIqIQDElEREREFBVqa2uxc+dO+P3+SJdC3RxDEhEREVEc27dvH55//vng76Wlpfjvf/8buYJC+Hw+PPjggxg6dCgMBgNSUlLQr18/rFmzJtKlUTfHkERR6/nnn4ckScEfo9GI/v37Y/bs2aioqIh0eURRyWazQZIkzJs3L9KlEFGUkCQJs2bNwqpVq1BaWorbb78dn3/+eaTLgsfjwbhx43DnnXfizDPPxPLly/Hhhx/i448/xsiRIyNdHnVz2kgXQHQ88+fPR0FBAdxuN7744gs89dRTeO+997Bt2zaYzeZIl0dERBTVevXqhWuvvRbnnnsuACArKwuffvppZIsCsGDBAqxfvx6rVq3CmWeeGelyiJpgSKKoN2nSJJxyyikAgD/84Q9ITU3Fo48+irfeeguXX355hKsjIiKKfgsXLsRNN90Em82GwYMHw2KxRLQev9+PhQsX4s9//jMDEkUldrejmHP22WcDAPbs2QMAqK6uxl/+8hecdNJJSEhIgNVqxaRJk7Bly5Zmt3W73Zg3bx769+8Po9GIrKwsXHTRRdi1axeAQD/t0C5+R/+EHsg//fRTSJKEV199FX//+9+RmZkJi8WCKVOmYO/evc3ue/369Tj33HORlJQEs9mMsWPH4ssvv2zxMZ555pkt3n9LXaj+85//YPjw4TCZTOjRowemTZvW4v0f67GFUlUVCxcuxKBBg2A0GpGRkYHrr78eNTU1TdbLz8/H+eef3+x+Zs+e3azNlmp/6KGHmj2nQKD7xdy5c1FYWAiDwYCcnBzcfvvt8Hg8LT5XHX2c9fX1+POf/4ycnBwYDAYUFRXh4YcfhhCiWe0t/dx7770AAK/Xi7vuugvDhw9HUlISLBYLRo8ejU8++aTFuh5++GE89thjyMvLg8lkwtixY7Ft27Ym61599dXIz89vsmzv3r0wmUyQJAmlpaXB5S6XCzNnzoTFYsHAgQOxceNGAIF+/jNnzoTZbMbQoUPx7bffNmmvcR+bOnVqs+fw+uuvhyRJGDx4cJPlDz/8ME4//XSkpqbCZDJh+PDheP3111veEEc588wzm7XX2ObRjwkIXLx98803B7dPYWEhFixYAFVVg+uEPqdHGzx4cIuv12PV29Lz3pILL7wQ+fn5MBqNSE9Px5QpU7B169Ym6yxduhRnn3020tPTYTAYMHDgQDz11FPN2srPz8fVV1/dZNl1110Ho9HY5Jv+xtfb6tWrMWzYMBiNRgwcOBArVqxoctvGbspHb+9QZ555ZvC5aXxejvXT+NqdN29es9eR0+lEZmYmJElqdmbiqaeewuDBg2E2m5u0d7x9pvF+Gn8SExNx2mmnYeXKle26XUs/oTW29Zi8f/9+XHPNNcjOzobBYEBBQQFuuOEGeL3eZt3CW/oJvRbo448/xujRo2GxWJCcnIwLL7wQJSUlHX78y5cvDx7/09LSMH36dOzfv7/JOqH7dd++fTFixAhUV1e3eDxpydVXX92knpSUFJx55pnNuuu19p7QqHFfa9wG27dvR01NDRITEzF27FiYzWYkJSXh/PPPb3ZMBIBNmzZh0qRJsFqtSEhIwDnnnIOvv/66yTqN22Pt2rW4/vrrkZqaCqvViquuuqrF97C2vPYA4P333w9ut8TEREyePBk//PDDMZ83in08k0QxpzHQpKamAgB2796NlStX4pJLLkFBQQEqKirwzDPPYOzYsfjxxx+RnZ0NAFAUBeeffz7WrFmDadOm4U9/+hPq6urw4YcfYtu2bejbt2/wPi6//HKcd955Te53zpw5LdZz3333QZIk/PWvf0VlZSUWLlyIcePGYfPmzTCZTAACb4yTJk3C8OHDMXfuXMiyHPwQ9fnnn+O0005r1m7v3r1x//33Awh8ELnhhhtavO8777wTl156Kf7whz/g8OHDeOKJJzBmzBhs2rQJycnJzW5z3XXXYfTo0QCAFStW4M0332zy9+uvvx7PP/88Zs6ciT/+8Y/Ys2cPFi1ahE2bNuHLL7+ETqdr8Xloj9ra2uBjC6WqKqZMmYIvvvgC1113HYqLi7F161Y89thj+Pnnn4/7ISnUsR6nEAJTpkzBJ598gmuuuQbDhg3DqlWrcNttt2H//v147LHHmrQ1fvx4XHXVVU2WDRs2DADgcDjwr3/9C5dffjmuvfZa1NXVYcmSJZg4cSK++eab4HqNXnzxRdTV1WHWrFlwu914/PHHcfbZZ2Pr1q3IyMho9fHcddddcLvdzZbfcssteOGFFzB79mz07t0bN954IwDg2Wefxdlnn417770Xjz/+OCZNmoTdu3cjMTExeFuj0Yj//e9/qKysRHp6OgCgoaEBr776KoxGY7P7evzxxzFlyhRceeWV8Hq9eOWVV3DJJZfg3XffxeTJk1utvb1cLhfGjh2L/fv34/rrr0dubi6++uorzJkzBwcPHsTChQvDdl8ddd111yEzMxMHDhzAokWLMG7cOOzZsyfYBfipp57CoEGDMGXKFGi1Wrzzzju48cYboaoqZs2a1Wq7c+fOxZIlS/Dqq682+wJhx44duOyyy/D//t//w4wZM7B06VJccskl+OCDDzB+/PgOPY7i4mK89NJLwd+fffZZlJSUNHkNDBkypNXbP/LIIy1eI/rqq6/ixhtvxJlnnombbroJFosFJSUl+Oc//9nm2hrrstlsePLJJ3HJJZdg27ZtKCoqanH9iy66CIWFhcHfb7nlFhQXF+O6665r8niBth+TDxw4gNNOOw21tbW47rrrMGDAAOzfvx+vv/46XC4XxowZ0+T5u++++wAA//jHP4LLTj/9dADARx99hEmTJqFPnz6YN28eGhoa8MQTT2DUqFH47rvvmgX04z3+xuP0qaeeivvvvx8VFRV4/PHH8eWXX7Z6/G/U2vGkNWlpacF9Yt++fXj88cdx3nnnYe/evce8n2OpqqoCEHhv7devH+6++2643W4sXrwYo0aNwoYNG9C/f38AwA8//IDRo0fDarXi9ttvh06nwzPPPIMzzzwTn332GUaMGNGk7dmzZyM5ORnz5s3D9u3b8dRTT6GsrCwY1FrS2mvvpZdewowZMzBx4kQsWLAALpcLTz31FM444wxs2rSpTV+sUIwSRFFq6dKlAoD46KOPxOHDh8XevXvFK6+8IlJTU4XJZBL79u0TQgjhdruFoihNbrtnzx5hMBjE/Pnzg8v+/e9/CwDi0UcfbXZfqqoGbwdAPPTQQ83WGTRokBg7dmzw908++UQAEL169RIOhyO4/LXXXhMAxOOPPx5su1+/fmLixInB+xFCCJfLJQoKCsT48eOb3dfpp58uBg8eHPz98OHDAoCYO3ducFlpaanQaDTivvvua3LbrVu3Cq1W22z5jh07BADxwgsvBJfNnTtXhB4GPv/8cwFA/Pe//21y2w8++KDZ8ry8PDF58uRmtc+aNUscfWg5uvbbb79dpKeni+HDhzd5Tl966SUhy7L4/PPPm9z+6aefFgDEl19+2ez+jtaWx7ly5UoBQNx7771Nbvvb3/5WSJIkdu7c2aT2WbNmtXp/fr9feDyeJstqampERkaG+P3vfx9c1rhvhe67Qgixfv16AUDccsstwWUzZswQeXl5wd+3bdsmZFkWkyZNEgDEnj17hBBCHDp0SOj
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Статистические показатели для обучающей выборки:\n",
|
|||
|
"Среднее значение: 2.53\n",
|
|||
|
"Стандартное отклонение: 1.54\n",
|
|||
|
"Минимальное значение: -1.06\n",
|
|||
|
"Максимальное значение: 4.84\n",
|
|||
|
"Количество наблюдений: 5625\n",
|
|||
|
"\n",
|
|||
|
"Статистические показатели для валидационной выборки:\n",
|
|||
|
"Среднее значение: 2.44\n",
|
|||
|
"Стандартное отклонение: 1.58\n",
|
|||
|
"Минимальное значение: -1.02\n",
|
|||
|
"Максимальное значение: 4.79\n",
|
|||
|
"Количество наблюдений: 1205\n",
|
|||
|
"\n",
|
|||
|
"Статистические показатели для тестовой выборки:\n",
|
|||
|
"Среднее значение: 2.49\n",
|
|||
|
"Стандартное отклонение: 1.57\n",
|
|||
|
"Минимальное значение: -1.09\n",
|
|||
|
"Максимальное значение: 4.81\n",
|
|||
|
"Количество наблюдений: 1206\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Логарифмирование целевой переменной\n",
|
|||
|
"df['Close_log'] = np.log(df['Close'])\n",
|
|||
|
"\n",
|
|||
|
"# Выбор признаков и целевой переменной\n",
|
|||
|
"X = df.drop(['Close', 'Close_log'], axis=1)\n",
|
|||
|
"y = df['Close_log']\n",
|
|||
|
"\n",
|
|||
|
"# Выбор только числовых признаков\n",
|
|||
|
"X = X.select_dtypes(include='number')\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую, валидационную и тестовую выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Функция для построения гистограммы распределения целевого признака\n",
|
|||
|
"def plot_distribution(data, title):\n",
|
|||
|
" \"\"\"Построение гистограммы распределения целевого признака\"\"\"\n",
|
|||
|
" plt.figure(figsize=(10, 6))\n",
|
|||
|
" sns.histplot(data, kde=True, bins=30, color='skyblue')\n",
|
|||
|
" plt.title(title)\n",
|
|||
|
" plt.xlabel('Logarithm of Close Price')\n",
|
|||
|
" plt.ylabel('Count')\n",
|
|||
|
" plt.grid(True)\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Построение гистограмм распределения целевого признака\n",
|
|||
|
"plot_distribution(y_train, 'Распределение логарифма цены закрытия в обучающей выборке')\n",
|
|||
|
"plot_distribution(y_val, 'Распределение логарифма цены закрытия в валидационной выборке')\n",
|
|||
|
"plot_distribution(y_test, 'Распределение логарифма цены закрытия в тестовой выборке')\n",
|
|||
|
"\n",
|
|||
|
"# Функция для вывода статистических показателей\n",
|
|||
|
"def get_statistics(df, name):\n",
|
|||
|
" print(f\"Статистические показатели для {name} выборки:\")\n",
|
|||
|
" print(f\"Среднее значение: {df.mean():.2f}\")\n",
|
|||
|
" print(f\"Стандартное отклонение: {df.std():.2f}\")\n",
|
|||
|
" print(f\"Минимальное значение: {df.min():.2f}\")\n",
|
|||
|
" print(f\"Максимальное значение: {df.max():.2f}\")\n",
|
|||
|
" print(f\"Количество наблюдений: {df.count()}\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Вывод статистических показателей для обучающей, валидационной и тестовой выборок\n",
|
|||
|
"get_statistics(y_train, \"обучающей\")\n",
|
|||
|
"get_statistics(y_val, \"валидационной\")\n",
|
|||
|
"get_statistics(y_test, \"тестовой\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<h3>Oversampling и undersampling</h3>"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение классов после SMOTE (oversampling):\n",
|
|||
|
"Close_category\n",
|
|||
|
"0 1157\n",
|
|||
|
"1 1157\n",
|
|||
|
"2 1157\n",
|
|||
|
"3 1157\n",
|
|||
|
"4 1157\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Распределение классов после RandomUnderSampler (undersampling):\n",
|
|||
|
"Close_category\n",
|
|||
|
"0 1092\n",
|
|||
|
"1 1092\n",
|
|||
|
"2 1092\n",
|
|||
|
"3 1092\n",
|
|||
|
"4 1092\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import SMOTE\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Проверка наличия столбца 'Date' и создание столбца 'Year'\n",
|
|||
|
"if 'Date' in df.columns:\n",
|
|||
|
" df['Year'] = pd.to_datetime(df['Date'], errors='coerce').dt.year\n",
|
|||
|
" df = df.drop(['Date'], axis=1)\n",
|
|||
|
"\n",
|
|||
|
"# Логарифмирование целевой переменной\n",
|
|||
|
"df['Close_log'] = np.log(df['Close'])\n",
|
|||
|
"\n",
|
|||
|
"# Создание категорий для целевой переменной\n",
|
|||
|
"df['Close_category'] = pd.qcut(df['Close_log'], q=5, labels=[0, 1, 2, 3, 4])\n",
|
|||
|
"\n",
|
|||
|
"# Выбор признаков и целевой переменной\n",
|
|||
|
"X = df.drop(['Close', 'Close_log', 'Close_category'], axis=1)\n",
|
|||
|
"y = df['Close_category']\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Применение SMOTE для oversampling\n",
|
|||
|
"smote = SMOTE(random_state=42)\n",
|
|||
|
"X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Распределение классов после SMOTE (oversampling):\")\n",
|
|||
|
"print(pd.Series(y_train_smote).value_counts())\n",
|
|||
|
"\n",
|
|||
|
"# Применение RandomUnderSampler для undersampling\n",
|
|||
|
"undersampler = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Распределение классов после RandomUnderSampler (undersampling):\")\n",
|
|||
|
"print(pd.Series(y_train_under).value_counts())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<h3>Оценка сбалансированности выборок</h3>"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Оценка необходимости аугментации данных"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные в выборке обучающей являются категориальными.\n",
|
|||
|
"Проверка необходимости аугментации для валидационной выборки:\n",
|
|||
|
"Среднее значение: 2.44, Стандартное отклонение: 1.58\n",
|
|||
|
"25-й квантиль: 1.20\n",
|
|||
|
"50-й квантиль (медиана): 2.53\n",
|
|||
|
"75-й квантиль: 4.01\n",
|
|||
|
"Выборка валидационной несбалансирована, рекомендуется аугментация.\n",
|
|||
|
"\n",
|
|||
|
"Данные в выборке тестовой являются категориальными.\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"def check_augmentation_need(data, name):\n",
|
|||
|
" \"\"\"Проверка необходимости аугментации данных\"\"\"\n",
|
|||
|
" # Проверка на наличие числовых значений\n",
|
|||
|
" if isinstance(data.dtype, pd.CategoricalDtype):\n",
|
|||
|
" print(f\"Данные в выборке {name} являются категориальными.\")\n",
|
|||
|
" return\n",
|
|||
|
" elif not np.issubdtype(data.dtype, np.number):\n",
|
|||
|
" print(f\"Данные в выборке {name} не являются числовыми.\")\n",
|
|||
|
" return\n",
|
|||
|
"\n",
|
|||
|
" # Проверка на наличие пустых значений\n",
|
|||
|
" if data.isnull().any():\n",
|
|||
|
" print(f\"Выборка {name} содержит пустые значения.\")\n",
|
|||
|
" return\n",
|
|||
|
"\n",
|
|||
|
" quantiles = data.quantile([0.25, 0.5, 0.75])\n",
|
|||
|
" mean = data.mean()\n",
|
|||
|
" std = data.std()\n",
|
|||
|
"\n",
|
|||
|
" print(f\"Проверка необходимости аугментации для {name} выборки:\")\n",
|
|||
|
" print(f\"Среднее значение: {mean:.2f}, Стандартное отклонение: {std:.2f}\")\n",
|
|||
|
" print(f\"25-й квантиль: {quantiles[0.25]:.2f}\")\n",
|
|||
|
" print(f\"50-й квантиль (медиана): {quantiles[0.5]:.2f}\")\n",
|
|||
|
" print(f\"75-й квантиль: {quantiles[0.75]:.2f}\")\n",
|
|||
|
"\n",
|
|||
|
" if std > mean * 0.5:\n",
|
|||
|
" print(f\"Выборка {name} несбалансирована, рекомендуется аугментация.\\n\")\n",
|
|||
|
" else:\n",
|
|||
|
" print(f\"Выборка {name} сбалансирована, аугментация не требуется.\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Пример использования функции\n",
|
|||
|
"# y_train, y_val, y_test должны быть определены заранее\n",
|
|||
|
"check_augmentation_need(y_train, \"обучающей\")\n",
|
|||
|
"check_augmentation_need(y_val, \"валидационной\")\n",
|
|||
|
"check_augmentation_need(y_test, \"тестовой\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Поскольку выборка валидационной несбалансирована и демонстрирует значительный разброс значений, что подтверждается квантилями и стандартным отклонением, применение методов аугментации рекомендуется для улучшения сбалансированности и качества модели."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение 'Close_category' в обучающей выборке:\n",
|
|||
|
" Close_category\n",
|
|||
|
"2 1157\n",
|
|||
|
"4 1134\n",
|
|||
|
"1 1126\n",
|
|||
|
"3 1116\n",
|
|||
|
"0 1092\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"C:\\Users\\TIGR228\\AppData\\Local\\Temp\\ipykernel_21436\\2926621768.py:29: FutureWarning: \n",
|
|||
|
"\n",
|
|||
|
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
|
|||
|
"\n",
|
|||
|
" sns.barplot(x=category_counts.index, y=category_counts.values, palette='viridis')\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsAAAAIjCAYAAAAN/63DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABfa0lEQVR4nO3dd3gU5d7G8Xs3pEAgdBKiNOkgRSkSUWmhiSCCIoJSDgJqQBEFzZEuiIAiHY6+ApZgAaTo4YSOKIQuSBcQRGmhh5a6z/sHJ3tYNkAS0tb5fq6LK+zMMzO/mWfLvbPPztqMMUYAAACARdizuwAAAAAgKxGAAQAAYCkEYAAAAFgKARgAAACWQgAGAACApRCAAQAAYCkEYAAAAFgKARgAAACWQgAGAOBv5sKFCzp48KASExOzuxRkIGOMzp07pwMHDmR3KR6PAAwAWeDIkSOy2WyaPXt2dpeCv6GEhASNHTtWNWrUkK+vrwoWLKjy5ctr5cqV2V2aR9i1a5cWLlzovL19+3b9+9//zr6CbnDp0iUNGjRIFStWlI+PjwoXLqwKFSpo//792V2aRyMA52CzZ8+WzWZz/vPz81OFChXUp08fnTp1KrvLAyxlzZo1stlsOnLkSIrz2rVrp6CgIPn4+KhYsWJq3bq1vvvuu6wvNIeZNm3a3zL0d+vWTQ0bNszuMiRJcXFxCg0N1eDBg9WwYUPNnTtXy5cv16pVqxQSEpLd5XmES5cuqXfv3tqwYYMOHDig1157TTt37szusnT27FmFhIRo0qRJevrpp7Vo0SItX75ca9asUenSpbO7PI+WK7sLwJ2NGDFCZcqUUWxsrH7++WdNnz5dS5Ys0a5du5QnT57sLg+wtKFDh2rEiBEqX768evfurVKlSuns2bNasmSJ2rdvr4iICHXq1Cm7y8w206ZNU5EiRdStW7fsLuVva8yYMdq4caOWLl2aY0K5pwkJCXH+k6QKFSqoZ8+e2VyVNGDAAJ04cUJRUVGqWrVqdpfzt0IA9gAtW7ZU7dq1JUkvvviiChcurPHjx2vRokV67rnnsrk6wLrmzZunESNG6Omnn9acOXPk7e3tnDdgwAAtXbpUCQkJ2VghUuvKlSvy9/fP7jLSLDExURMmTNAbb7xB+L1LCxcu1J49e3Tt2jVVq1ZNPj4+2VpPdHS0PvvsM82YMYPwmwkYAuGBGjduLEk6fPiwJOncuXN68803Va1aNeXNm1cBAQFq2bKlduzY4bZsbGyshg0bpgoVKsjPz0/FixdXu3btdOjQIUn/G6d4q383PsEmfyT8zTff6J///KeCgoLk7++vNm3a6M8//3Tb9saNG9WiRQvlz59fefLkUYMGDbRu3boU97Fhw4Ypbn/YsGFubb/88kvVqlVLuXPnVqFChdSxY8cUt3+7fbuRw+HQhAkTVLVqVfn5+SkwMFC9e/fW+fPnXdqVLl1aTzzxhNt2+vTp47bOlGofN26c2zGVrn+cOXToUJUrV06+vr4qUaKEBg4cqLi4uBSP1Y1uddyS/9348X1y/cuWLVPNmjXl5+enKlWquH1snzwU58ZlHQ6Hqlev7jamddiwYapSpYrzflivXj2XcXXJNd5///1utX/wwQdu21m0aJFatWql4OBg+fr6qmzZsnr33XeVlJTkts6bj+OoUaNkt9s1Z84c57SffvpJzzzzjEqWLOk8tq+//rquXbt2+wN7C4MHD1ahQoU0c+ZMl/CbrHnz5ineR260atUqPfroo/L391eBAgX05JNPau/evS5tLl26pH79+ql06dLy9fVVsWLF1LRpU23bts2lXVoeY7fjcDg0ceJEVatWTX5+fipatKhatGihLVu2ONvMmjVLjRs3VrFixeTr66sqVapo+vTpLuspXbq0du/erR9//DHF55ALFy6oX79+KlGihHx9fVWuXDmNGTNGDofDZT1nz57VCy+8oICAABUoUEBdu3bVjh07UhxTnZrjOWzYMNlsNu3Zs0edOnVSwYIF9cgjj2jWrFmy2Wz65Zdf3I7Je++9Jy8vLx07dixNx7J06dLOfbfb7QoKCtKzzz6ro0ePpmr5adOmqWrVqvL19VVwcLDCwsJ04cIF5/z9+/fr/Pnzypcvnxo0aKA8efIof/78euKJJ7Rr1y5nu9WrV8tms2nBggVu25gzZ45sNpuioqKcNd98xj75+X7NmjXOaal9PCUf7xtFREQ4n3cKFy6s5557zu2YdOvWTXnz5nWZNm/ePLc6JClv3rwpfsqQmteHG58/qlSpolq1ajnvXzfXnZKbn3eLFCmiVq1auRx/6frrQJ8+fW65npufazdv3iyHw6H4+HjVrl37tsdKStt9f9++ferQoYMCAgJUuHBhvfbaa4qNjXWr98bXrcTERD3++OMqVKiQ9uzZ49I2ta/DOQlngD1QclgtXLiwJOn333/XwoUL9cwzz6hMmTI6deqU/vWvf6lBgwbas2ePgoODJUlJSUl64okntHLlSnXs2FGvvfaaLl26pOXLl2vXrl0qW7ascxvPPfecHn/8cZfthoeHp1jPqFGjZLPZ9NZbbyk6OloTJkxQaGiotm/frty5c0u6/sBs2bKlatWqpaFDh8putztfQH/66SfVrVvXbb333nuvRo8eLUm6fPmyXn755RS3PXjwYHXo0EEvvviiTp8+rcmTJ+uxxx7TL7/8ogIFCrgt06tXLz366KOSpO+++87tBaF3796aPXu2unfvrldffVWHDx/WlClT9Msvv2jdunUpBp20unDhgnPfbuRwONSmTRv9/PPP6tWrlypXrqydO3fqo48+0m+//eYWJlNy43FLtmTJEn311VdubQ8cOKBnn31WL730krp27apZs2bpmWeeUWRkpJo2bXrLbXzxxRcpjo+7cuWKnnrqKZUuXVrXrl3T7Nmz1b59e0VFRaXYx3cye/Zs5c2bV/3791fevHm1atUqDRkyRDExMRo3btwtl5s1a5YGDRqkDz/80GX4wdy5c3X16lW9/PLLKly4sDZt2qTJkyfrr7/+0ty5c9NU24EDB7Rv3z794x//UL58+dK8b5K0YsUKtWzZUvfdd5+GDRuma9euafLkyapfv762bdvmHOP30ksvad68eerTp4+qVKmis2fP6ueff9bevXv14IMPSkrfY+xWevToodmzZ6tly5Z68cUXlZiYqJ9++kkbNmxwfho1ffp0Va1aVW3atFGuXLn0/fff65VXXpHD4VBYWJgkacKECerbt6/y5s2rd955R5IUGBgoSbp69aoaNGigY8eOqXfv3ipZsqTWr1+v8PBwnThxQhMmTJB0/THRunVrbdq0SS+//LIqVaqkRYsWqWvXruk+nsmeeeYZlS9fXu+9956MMXr66acVFhamiIgIPfDAAy5tIyIi1LBhQ91zzz2pPo7JHn30UfXq1UsOh0O7du3ShAkTdPz4cf3000+3XW7YsGEaPny4QkND9fLLL2v//v2aPn26Nm/e7HwuOnv2rKTrz8/ly5fX8OHDFRsbq6lTp6p+/fravHmzKlSooIYNG6pEiRKKiIjQU0895bZvZcuWTfN44fQ+nubMmaPnn39eNWrU0OjRo3X27FlNmjRJP//8s3755RcVKVIkTXXcSnpeH5K99dZbadpWpUqV9M4778gYo0OHDmn8+PF6/PHHU/1GJyXJfdunTx/VqlVL77//vk6fPp3isUrrfb9Dhw4qXbq0Ro8erQ0bNmjSpEk6f/68Pv/881vW8+KLL2rNmjVavny5qlSp4px+N8c5WxnkWLNmzTKSzIoVK8zp06fNn3/+ab7++mtTuHBhkzt3bvPXX38ZY4yJjY01SUlJLssePnzY+Pr6mhEjRjinzZw500gy48ePd9uWw+FwLifJjBs3zq1N1apVTYMGDZy3V69ebSSZe+65x8TExDinf/vtt0aSmThxonPd5cuXN82bN3duxxhjrl69asqUKWOaNm3qtq2HH37Y3H///c7bp0+fNpLM0KFDndOOHDlivLy8zKhRo1yW3blzp8mVK5fb9AMHDhhJ5rPPPnNOGzp0qLnxYfDTTz8ZSSYiIsJl2cjISLfppUq
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение 'Close_category' в валидационной выборке:\n",
|
|||
|
" Close_category\n",
|
|||
|
"0 263\n",
|
|||
|
"1 242\n",
|
|||
|
"3 238\n",
|
|||
|
"4 238\n",
|
|||
|
"2 224\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"C:\\Users\\TIGR228\\AppData\\Local\\Temp\\ipykernel_21436\\2926621768.py:29: FutureWarning: \n",
|
|||
|
"\n",
|
|||
|
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
|
|||
|
"\n",
|
|||
|
" sns.barplot(x=category_counts.index, y=category_counts.values, palette='viridis')\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAIjCAYAAAAZajMiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABYf0lEQVR4nO3dd3wUdf7H8fduSKOEniaQIC2AAtIEUZEaEFEOrHASOATUgIcoelGpoqgogjTPOwXOA1F6OQ1VQKoIoiBFQBCkF5NAgGST/f7+0OyPZRNIQsImw+v5eOwDdua7M5+Z7+7OO7PfnbUZY4wAAAAAC7B7uwAAAAAgrxBuAQAAYBmEWwAAAFgG4RYAAACWQbgFAACAZRBuAQAAYBmEWwAAAFgG4RYAAACWQbgFAACwEKfTqdOnT+uXX37xdileQbgFcNM6ePCgbDabpk6d6u1SABQCa9eu1apVq1z3V61apXXr1nmvoMscP35cAwYMUEREhPz8/FS+fHnVqlVLSUlJ3i7thiPc5tDUqVNls9lct4CAAFWvXl39+vXTiRMnvF0ecFNZtWqVbDabDh48mOm8zp07KzQ0VH5+fgoODlbHjh01d+7cG19oATNp0iRLBvoePXrovvvu83YZsLDDhw/r2Wef1fbt27V9+3Y9++yzOnz4sLfL0r59+9SoUSPNnDlTffv21eLFi7Vs2TKtWLFCxYoV83Z5N1wRbxdQWI0YMUKVK1fWpUuXtHbtWk2ePFlffvmlduzYoaJFi3q7POCmNnToUI0YMULVqlVT3759FRERoTNnzujLL79Uly5dNH36dHXt2tXbZXrNpEmTVK5cOfXo0cPbpQCFSufOnTV27FjVqVNHktS0aVN17tzZy1VJffv2lZ+fnzZu3KhbbrnF2+V4HeE2l9q3b6+GDRtKkp566imVLVtWY8aM0YIFC/TEE094uTrg5jV79myNGDFCDz/8sGbMmCFfX1/XvEGDBmnJkiVyOBxerBDZlZycfFOedULB5e/vr/Xr12vHjh2SpNtuu00+Pj5erWnLli1auXKlli5dSrD9E8MS8kjLli0lSQcOHJAknT17Vi+++KJuv/12FS9eXEFBQWrfvr1++OEHj8deunRJw4YNU/Xq1RUQEKCwsDB17txZ+/fvl/T/4wKzul3+MVzGx7Sff/65XnnlFYWGhqpYsWJ68MEHM/3oZNOmTWrXrp1KliypokWLqnnz5lmOH7rvvvsyXf+wYcM82v73v/9VgwYNFBgYqDJlyujxxx/PdP1X27bLOZ1OjR07VrVr11ZAQIBCQkLUt29f/f77727tIiMj9cADD3isp1+/fh7LzKz20aNHe+xTSUpJSdHQoUNVtWpV+fv7q2LFinrppZeUkpKS6b66XFb7LeN2+UfqGfUvXbpU9erVU0BAgGrVquXxUXrG8JjLH+t0OlWnTh2PMaTDhg1TrVq1XM/DJk2aaP78+R413nbbbR61v/vuux7rWbBggTp06KDw8HD5+/urSpUqev3115Wenu6xzCv34xtvvCG73a4ZM2a4pn3zzTd65JFHVKlSJde+ff7553Xx4sWr79gsDB48WGXKlNEnn3ziFmwzREdHZ/ocudzKlSt1zz33qFixYipVqpQeeugh7dq1y63NuXPnNGDAAEVGRsrf31/BwcFq06aNtm7d6tYuJ6+xq3E6nRo3bpxuv/12BQQEqHz58mrXrp2+++47V5spU6aoZcuWCg4Olr+/v2rVqqXJkye7LScyMlI//fSTVq9enel7SEJCggYMGKCKFSvK399fVatW1dtvvy2n0+m2nDNnzujJJ59UUFCQSpUqpZiYGP3www+ZjmHOzv4cNmyYbDabdu7cqa5du6p06dK6++67NWXKFNlsNn3//fce++TNN9+Uj4+Pjhw5kqN9GRkZ6dp2u92u0NBQPfbYYzp06FCePfbdd9/VXXfdpbJlyyowMFANGjTQ7NmzM13mlcPdMuuXnLzmM6v18tuVbbNbq81mU79+/TymP/DAA4qMjHSb1qNHD49phw8fVmBgYKZDiSZNmqTatWvL399f4eHhio2NVUJCglubnLxP5eRYkJaWptdff11VqlSRv7+/IiMj9corr3i8v0dGRqpHjx7y8fFR3bp1VbduXc2dO1c2m81jWzOTnedOxjHx3XffzXI5Ga+VDBs3blRAQID279/v2oehoaHq27evzp496/H4WbNmuY7P5cqV01//+leP11CPHj1UvHhx/fLLL4qOjlaxYsUUHh6uESNGyBjjUe/lz6lz586pQYMGqly5so4dO+aant3jeF7gzG0eyQiiZcuWlST98ssvmj9/vh555BFVrlxZJ06c0D//+U81b95cO3fuVHh4uCQpPT1dDzzwgFasWKHHH39cf//733Xu3DktW7ZMO3bsUJUqVVzreOKJJ3T//fe7rTcuLi7Tet544w3ZbDa9/PLLOnnypMaOHavWrVtr27ZtCgwMlPTHAad9+/Zq0KCBhg4dKrvd7jo4fvPNN2rcuLHHcitUqKBRo0ZJks6fP69nnnkm03UPHjxYjz76qJ566imdOnVK48eP17333qvvv/9epUqV8nhMnz59dM8990iS5s6dq3nz5rnN79u3r6ZOnaqePXvqueee04EDBzRhwgR9//33WrduXaYhJqcSEhJc23Y5p9OpBx98UGvXrlWfPn1Us2ZNbd++Xe+//75+/vlnj6CYmcv3W4Yvv/xSn332mUfbvXv36rHHHtPTTz+tmJgYTZkyRY888oji4+PVpk2bLNfx6aefavv27R7Tk5OT9Ze//EWRkZG6ePGipk6dqi5dumjDhg2Z9vG1TJ06VcWLF9fAgQNVvHhxrVy5UkOGDFFSUpJGjx6d5eOmTJmi1157Te+9957bkIBZs2bpwoULeuaZZ1S2bFl9++23Gj9+vH777TfNmjUrR7Xt3btXu3fv1t/+9jeVKFEix9smScuXL1f79u116623atiwYbp48aLGjx+vZs2aaevWra6D2NNPP63Zs2erX79+qlWrls6cOaO1a9dq165dql+/vqTcvcay0qtXL02dOlXt27fXU089pbS0NH3zzTfauHGj61OkyZMnq3bt2nrwwQdVpEgRLVq0SM8++6ycTqdiY2MlSWPHjlX//v1VvHhxvfrqq5KkkJAQSdKFCxfUvHlzHTlyRH379lWlSpW0fv16xcXF6dixYxo7dqykP14THTt21LfffqtnnnlGUVFRWrBggWJiYnK9PzM88sgjqlatmt58800ZY/Twww8rNjZW06dP1x133OHWdvr06brvvvtydbbqnnvuUZ8+feR0OrVjxw6NHTtWR48e1TfffJMnjx03bpwefPBBdevWTampqZo5c6YeeeQRLV68WB06dMh0ue+//77KlSsn6Y/30WvJ6jV/uXr16umFF16Q9MfJlyFDhni0yU2tuTFkyBBdunTJY/qwYcM0fPhwtW7dWs8884z27NmjyZMna/PmzXn2/n41Tz31lKZNm6aHH35YL7zwgjZt2qRRo0Zp165dHseiy6WlpbleQ9l1Pc+7rJw5c0aXLl3SM888o5YtW+rpp5/W/v37NXHiRG3atEmbNm2Sv7+/JLmOo40aNdKoUaN04sQJjRs3TuvWrfM4Pqenp6tdu3Zq0qSJ3nnnHcXHx2vo0KFKS0vTiBEjMq3F4XCoS5cuOnTokNatW6ewsDDXvBtxHHcxyJEpU6YYSWb58uXm1KlT5vDhw2bmzJmmbNmyJjAw0Pz222/GGGMuXbpk0tPT3R574MAB4+/vb0aMGOGa9sknnxhJZsyYMR7rcjqdrsdJMqNHj/ZoU7t2bdO8eXPX/a+//tpIMrfccotJSkpyTf/iiy+MJDNu3DjXsqtVq2aio6Nd6zHGmAsXLpjKlSubNm3aeKzrrrvuMrfddpvr/qlTp4wkM3ToUNe0gwcPGh8fH/PGG2+4PXb79u2mSJEiHtP37t1rJJlp06a5pg0dOtRc/tT85ptvjCQzffp0t8fGx8d7TI+IiDAdOnTwqD02NtZc+XS/svaXXnrJBAcHmwYNGrjt008//dTY7XbzzTffuD3
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"C:\\Users\\TIGR228\\AppData\\Local\\Temp\\ipykernel_21436\\2926621768.py:29: FutureWarning: \n",
|
|||
|
"\n",
|
|||
|
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
|
|||
|
"\n",
|
|||
|
" sns.barplot(x=category_counts.index, y=category_counts.values, palette='viridis')\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение 'Close_category' в тестовой выборке:\n",
|
|||
|
" Close_category\n",
|
|||
|
"0 253\n",
|
|||
|
"3 252\n",
|
|||
|
"1 241\n",
|
|||
|
"4 235\n",
|
|||
|
"2 225\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArcAAAIjCAYAAAAZajMiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABXOUlEQVR4nO3dd3wUdf7H8fduSKOEHpIIhF6lKEUDKr0EBBE8RDwpIiAGPEVBsdBEOUQFpehxPyWoYKGLh3QEpQlBFKQICKJUCSaBBJIl+/394WWPZQMkIWGzw+v5eOwDduY7M5+Z7+7sO7MzszZjjBEAAABgAXZvFwAAAADkFsItAAAALINwCwAAAMsg3AIAAMAyCLcAAACwDMItAAAALINwCwAAAMsg3AIAAMAyCLcAAAB55OLFizp16pSOHDni7VJuGoRbALiBDh8+LJvNptjYWG+XAvi0L7/8Ujt27HA9X7RokX766SfvFXSJ/fv3q3///goPD1dAQIDKlCmjqKgo8aOwNwbh1gfExsbKZrO5HkFBQapWrZoGDx6skydPers84Kby9ddfy2az6fDhw5mO69q1q8LCwhQQEKDQ0FB16tRJCxYsuPGF5jPTp0+3ZKDv06ePmjdv7u0ybko7d+7UP/7xD+3fv1+bN2/W448/rrNnz3q7LG3evFmNGzfWmjVr9Pzzz2v58uVauXKlFi1aJJvN5u3ybgoFvF0Asm7s2LGqWLGiLly4oG+//Vbvvvuuli5dql27dqlgwYLeLg+4qY0aNUpjx45V1apVNXDgQEVGRio+Pl5Lly5Vt27dNHv2bPXs2dPbZXrN9OnTVapUKfXp08fbpcAiHnvsMX3wwQeqVq2aJKlr16668847vVpTWlqa+vbtq2rVqmnFihUqWrSoV+u5WRFufUh0dLQaNmwo6a83dcmSJfXWW29p8eLFeuihh7xcHXDzmjdvnsaOHasHHnhAc+bMkb+/v2vcsGHDtHz5cjkcDi9WiKxKTk5WoUKFvF0GsqB06dLatWuX6wBPzZo1vV2SlixZon379mnv3r0EWy/itAQf1rJlS0nSoUOHJElnzpzRs88+qzp16qhw4cIKCQlRdHS0fvjhB49pL1y4oNGjR6tatWoKCgpSeHi4unbtqoMHD0r633mBV3pc+jVcxte0n332mV544QWFhYWpUKFC6ty5s3777TePZW/ZskXt27dX0aJFVbBgQTVr1kwbNmzIdB2bN2+e6fJHjx7t0fbjjz9WgwYNFBwcrBIlSqhHjx6ZLv9q63Ypp9OpyZMnq3bt2goKClKZMmU0cOBA/fnnn27tKlSooHvvvddjOYMHD/aYZ2a1T5w40WObSlJqaqpGjRqlKlWqKDAwUOXKldPw4cOVmpqa6ba61JW2W8bj0q/UM+pfsWKF6tevr6CgINWqVcvjq/SM02MundbpdKpu3boe55COHj1atWrVcr0O77zzTi1atMijxltvvdWj9jfeeMNjOYsXL1bHjh0VERGhwMBAVa5cWa+88orS09M95nn5dnz11Vdlt9s1Z84c17BvvvlGf/vb31S+fHnXtn366ad1/vz5q2/YK3j55ZdVokQJffDBB27BNkO7du0yfY1cas2aNbr77rtVqFAhFStWTPfdd5/27Nnj1ubs2bN66qmnVKFCBQUGBio0NFRt2rTR9u3b3dpl5z12NU6nU2+//bbq1KmjoKAglS5dWu3bt9e2bdtcbWbOnKmWLVsqNDRUgYGBqlWrlt599123+VSoUEE//fST1q1bl+k+JCEhQU899ZTKlSunwMBAValSRRMmTJDT6XSbT3x8vB555BGFhISoWLFi6t27t3744YdMz2HOyvYcPXq0bDabdu/erZ49e6p48eK66667NHPmTNlsNn3//fce2+S1116Tn5+fjh49mq1tWaFCBde62+12hYWF6cEHH7zmRUaXTpfZo0KFCq62Wd1nSdJXX32lZs2aqUiRIgoJCVGjRo1c75Fr7T8u3a9dvHhRr7zyiipXrqzAwEBVqFBBL7zwgsd+Kqvrn5ycrGeeecb1WqhevbreeOMNj3NVM/algYGBatCggWrWrHnFfWlmLl0XPz8/3XLLLRowYIASEhJcbTI+2+bNm3fF+fTp08etDzZv3qyKFStq/vz5qly5sgICAlS+fHkNHz480/3L9OnTVbt2bQUGBioiIkIxMTFuNUj/21fGxcWpSZMmCg4OVsWKFfXee++5tcuo9+uvv3YNO3bsmCpUqKCGDRvq3LlzruHX8/niCzhy68MygmjJkiUlSb/88osWLVqkv/3tb6pYsaJOnjypf/3rX2rWrJl2796tiIgISVJ6erruvfderV69Wj169NA//vEPnT17VitXrtSuXbtUuXJl1zIeeughdejQwW25I0aMyLSeV199VTabTc8995xOnTqlyZMnq3Xr1tqxY4eCg4Ml/fWBEx0drQYNGmjUqFGy2+2uD8dvvvlGjRs39phv2bJlNX78eEnSuXPnNGjQoEyX/fLLL6t79+567LHH9Mcff2jKlCm655579P3336tYsWIe0wwYMEB33323JGnBggVauHCh2/iBAwcqNjZWffv21ZNPPqlDhw5p6tSp+v7777Vhw4ZMQ0x2JSQkuNbtUk6nU507d9a3336rAQMGqGbNmtq5c6cmTZqkn3/+2SMoZubS7ZZh6dKl+uSTTzza7t+/Xw8++KAef/xx9e7dWzNnztTf/vY3LVu2TG3atLniMj766CPt3LnTY3hycrLuv/9+VahQQefPn1dsbKy6deumTZs2ZdrH1xIbG6vChQtr6NChKly4sNasWaORI0cqKSlJEydOvOJ0M2fO1EsvvaQ333zT7ZSAuXPnKiUlRYMGDVLJkiX13XffacqUKfr99981d+7cbNW2f/9+7d27V48++qiKFCmS7XWTpFWrVik6OlqVKlXS6NGjdf78eU2ZMkVNmzbV9u3bXR+ejz/+uObNm6fBgwerVq1aio+P17fffqs9e/bo9ttvl5Sz99iV9OvXT7GxsYqOjtZjjz2mixcv6ptvvtHmzZtd3yK9++67ql27tjp37qwCBQpoyZIleuKJJ+R0OhUTEyNJmjx5soYMGaLChQvrxRdflCSVKVNGkpSSkqJmzZrp6NGjGjhwoMqXL6+NGzdqxIgROn78uCZPnizpr/dEp06d9N1332nQoEGqUaOGFi9erN69e+d4e2b429/+pqpVq+q1116TMUYPPPCAYmJiNHv2bN12221ubWfPnq3mzZvrlltuyfJ2zHD33XdrwIABcjqd2rVrlyZPnqxjx47pm2++ueI0kydPdoWSPXv26LXXXtMLL7zgOkpZuHBhV9us7rNiY2P16KOPqnbt2hoxYoSKFSum77//XsuWLVPPnj314osv6rHHHpMknT59Wk8//bTb/vJSjz32mGbNmqUHHnhAzzzzjLZs2aLx48drz549HvvUa62/MUadO3fW2rVr1a9fP9WvX1/Lly/XsGHDdPToUU2aNOmK2+lK+9Kruf/++9W1a1ddvHhRmzZt0owZM3T+/Hl99NFH2ZrPpeLj4/XLL7/ohRdeUNeuXfXMM89o27Ztmjhxonbt2qX//Oc/rj8ORo8erTFjxqh169YaNGiQ9u3bp3fffVdbt271+Iz5888/1aFDB3Xv3l0PPfSQPv/8cw0aNEgBAQF69NFHM60lMTFR0dHR8vf319KlS12vldz4fMn3DPK9mTNnGklm1apV5o8//jC//fab+fTTT03JkiVNcHCw+f33340xxly4cMGkp6e7TXvo0CETGBhoxo4d6xr2wQcfGEnmrbfe8liW0+l0TSfJTJw40aNN7dq1TbNmzVzP165daySZW265xSQlJbmGf/7550aSefvtt13zrlq1qmnXrp1rOcYYk5KSYipWrGjatGnjsawmTZqYW2+91fX8jz/+MJLMqFGjXMMOHz5s/Pz8zKuvvuo27c6dO02BAgU8hu/fv99IMrNmzXINGzVqlLn07fDNN98YSWb27Nlu0y5btsxjeGRkpOnYsaNH7TExMebyt9jltQ8fPtyEhoaaBg0auG3Tjz76yNj
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Проверка необходимости аугментации для признака 'Close_category' в обучающей выборке:\n",
|
|||
|
"Минимальное количество наблюдений в классе: 1092\n",
|
|||
|
"Максимальное количество наблюдений в классе: 1157\n",
|
|||
|
"Выборка 'обучающей' сбалансирована, аугментация не требуется.\n",
|
|||
|
"\n",
|
|||
|
"Проверка необходимости аугментации для признака 'Close_category' в валидационной выборке:\n",
|
|||
|
"Минимальное количество наблюдений в классе: 224\n",
|
|||
|
"Максимальное количество наблюдений в классе: 263\n",
|
|||
|
"Выборка 'валидационной' сбалансирована, аугментация не требуется.\n",
|
|||
|
"\n",
|
|||
|
"Проверка необходимости аугментации для признака 'Close_category' в тестовой выборке:\n",
|
|||
|
"Минимальное количество наблюдений в классе: 225\n",
|
|||
|
"Максимальное количество наблюдений в классе: 253\n",
|
|||
|
"Выборка 'тестовой' сбалансирована, аугментация не требуется.\n",
|
|||
|
"\n",
|
|||
|
"Распределение классов после SMOTE (oversampling):\n",
|
|||
|
"Close_category\n",
|
|||
|
"0 1157\n",
|
|||
|
"1 1157\n",
|
|||
|
"2 1157\n",
|
|||
|
"3 1157\n",
|
|||
|
"4 1157\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Распределение классов после RandomUnderSampler (undersampling):\n",
|
|||
|
"Close_category\n",
|
|||
|
"0 1092\n",
|
|||
|
"1 1092\n",
|
|||
|
"2 1092\n",
|
|||
|
"3 1092\n",
|
|||
|
"4 1092\n",
|
|||
|
"Name: count, dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from imblearn.over_sampling import SMOTE\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"# Логарифмирование целевой переменной\n",
|
|||
|
"df['Close_log'] = np.log(df['Close'])\n",
|
|||
|
"\n",
|
|||
|
"# Создание категорий для целевой переменной\n",
|
|||
|
"df['Close_category'] = pd.qcut(df['Close_log'], q=5, labels=[0, 1, 2, 3, 4])\n",
|
|||
|
"\n",
|
|||
|
"# Выбор признаков и целевой переменной\n",
|
|||
|
"X = df.drop(['Close', 'Close_log', 'Close_category'], axis=1)\n",
|
|||
|
"y = df['Close_category']\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую, валидационную и тестовую выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"def analyze_close_category_distribution(data, name):\n",
|
|||
|
" \"\"\"Проверка и визуализация распределения признака 'Close_category'\"\"\"\n",
|
|||
|
" category_counts = data.value_counts()\n",
|
|||
|
" print(f\"Распределение 'Close_category' в {name} выборке:\\n\", category_counts)\n",
|
|||
|
"\n",
|
|||
|
" plt.figure(figsize=(8, 6))\n",
|
|||
|
" sns.barplot(x=category_counts.index, y=category_counts.values, palette='viridis')\n",
|
|||
|
" plt.title(f\"Распределение признака 'Close_category' в {name} выборке\")\n",
|
|||
|
" plt.xlabel('Close Category')\n",
|
|||
|
" plt.ylabel('Count')\n",
|
|||
|
" plt.grid(True)\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"analyze_close_category_distribution(y_train, 'обучающей')\n",
|
|||
|
"analyze_close_category_distribution(y_val, 'валидационной')\n",
|
|||
|
"analyze_close_category_distribution(y_test, 'тестовой')\n",
|
|||
|
"\n",
|
|||
|
"def check_close_category_augmentation(data, name):\n",
|
|||
|
" print(f\"Проверка необходимости аугментации для признака 'Close_category' в {name} выборке:\")\n",
|
|||
|
" min_count = data.value_counts().min()\n",
|
|||
|
" max_count = data.value_counts().max()\n",
|
|||
|
" print(f\"Минимальное количество наблюдений в классе: {min_count}\")\n",
|
|||
|
" print(f\"Максимальное количество наблюдений в классе: {max_count}\")\n",
|
|||
|
"\n",
|
|||
|
" if max_count > min_count * 1.5:\n",
|
|||
|
" print(f\"Выборка '{name}' несбалансирована, рекомендуется аугментация.\\n\")\n",
|
|||
|
" else:\n",
|
|||
|
" print(f\"Выборка '{name}' сбалансирована, аугментация не требуется.\\n\")\n",
|
|||
|
"\n",
|
|||
|
"check_close_category_augmentation(y_train, 'обучающей')\n",
|
|||
|
"check_close_category_augmentation(y_val, 'валидационной')\n",
|
|||
|
"check_close_category_augmentation(y_test, 'тестовой')\n",
|
|||
|
"\n",
|
|||
|
"# Применение SMOTE для oversampling\n",
|
|||
|
"smote = SMOTE(random_state=42)\n",
|
|||
|
"X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Распределение классов после SMOTE (oversampling):\")\n",
|
|||
|
"print(pd.Series(y_train_smote).value_counts())\n",
|
|||
|
"\n",
|
|||
|
"# Применение RandomUnderSampler для undersampling\n",
|
|||
|
"undersampler = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Распределение классов после RandomUnderSampler (undersampling):\")\n",
|
|||
|
"print(pd.Series(y_train_under).value_counts())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"В этом исследование данные сбалансированы, поэтому аугментация не требуется ."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|