2024-11-02 03:42:31 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Лабораторная работа №3**\n",
"\n",
"### **Определение бизнес-целей и технических целей проекта**\n",
"\n",
"**Вариант задания:** Н а б о р данных о ценах на акции Starbucks.\n",
"\n",
"**Бизнес-цель:** Создать модель машинного обучения для предсказания объема продаж акций компании Starbucks на основе таких характеристик, как цена открытия и цена закрытия.\n",
"\n",
"**Технические цели проекта:**\n",
"\n",
"- Удаление пропусков, выбросов и дублированных записей для повышения качества данных. \n",
"\n",
"- Кодирование категориальных признаков в числовые значения.\n",
"\n",
"- Разделение набора данных на обучающую, контрольную и тестовую выборки для оценки модели.\n",
"\n",
"**Столбцы датасета и их пояснение:**\n",
"\n",
"**Date** - Дата, на которую относятся данные. Эта характеристика указывает конкретный день, в который происходила торговля акциями Starbucks.\n",
"\n",
"**Open** - Цена открытия. Стоимость акций Starbucks в начале торгового дня. Это важный показатель, который показывает, по какой цене начались торги в конкретный день, и часто используется для сравнения с ценой закрытия для определения дневного тренда.\n",
"\n",
"**High** - Максимальная цена за день. Наибольшая цена, достигнутая акциями Starbucks в течение торгового дня. Эта характеристика указывает, какой была самая высокая стоимость акций за день.\n",
"\n",
"**Low** - Минимальная цена за день. Наименьшая цена, по которой торговались акции Starbucks в течение дня.\n",
"\n",
"**Close** - Цена закрытия. Стоимость акций Starbucks в конце торгового дня. Цена закрытия — один из основных показателей, используемых для анализа акций, так как она отображает итоговую стоимость акций за день и часто используется для расчета дневных изменений и трендов на длительных временных периодах.\n",
"\n",
"**Adj Close** - Скорректированная цена закрытия. Цена закрытия, скорректированная с учетом всех корпоративных действий.\n",
"\n",
"**Volume** - Объем торгов. Количество акций Starbucks, проданных и купленных в течение дня. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Получим очищенный и структурированный набор данных"
]
},
{
"cell_type": "code",
2024-11-02 03:46:45 +04:00
"execution_count": 5,
2024-11-02 03:42:31 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 8036 entries, 0 to 8035\n",
"Data columns (total 8 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Date 8036 non-null object \n",
" 1 Open 8036 non-null float64 \n",
" 2 High 8036 non-null float64 \n",
" 3 Low 8036 non-null float64 \n",
" 4 Close 8036 non-null float64 \n",
" 5 Adj Close 8036 non-null float64 \n",
" 6 Volume 8036 non-null int64 \n",
" 7 date 8036 non-null datetime64[ns]\n",
"dtypes: datetime64[ns](1), float64(5), int64(1), object(1)\n",
"memory usage: 502.4+ KB\n"
]
}
],
"source": [
"import pandas as pn\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"import matplotlib.ticker as ticker\n",
"from datetime import datetime\n",
"import matplotlib.dates as md\n",
"\n",
"df = pn.read_csv(\"..//static//csv//StarbucksDataset.csv\")\n",
"print(df.columns)\n",
"\n",
"df[\"date\"] = df.apply(lambda row: datetime.strptime(row[\"Date\"], \"%Y-%m-%d\"), axis=1)\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Разбиение на обучающую, контрольную и тестовую выборки**\n",
"\n",
"Разделим набор данных на обучающую, контрольную и тестовую выборки для предотвращения утечки данных. \n",
"Для начала разделим данные на обучающую и тестовую выборки, а затем обучающую выборку на обучающую и контрольную."
]
},
{
"cell_type": "code",
2024-11-02 03:46:45 +04:00
"execution_count": 6,
2024-11-02 03:42:31 +04:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 5142\n",
"Размер контрольной выборки: 1286\n",
"Размер тестовой выборки: 1608\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_data))\n",
"print(\"Размер контрольной выборки:\", len(val_data))\n",
"print(\"Размер тестовой выборки:\", len(test_data))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Создание диаграмм**\n",
"\n",
"Создадим три диаграммы(гистограммы) отображающие распределение объёма торгов в разных выборках данных: обучающей, контрольной и тестовой."
]
},
{
"cell_type": "code",
2024-11-02 03:46:45 +04:00
"execution_count": 11,
2024-11-02 03:42:31 +04:00
"metadata": {},
"outputs": [
{
"data": {
2024-11-02 03:46:45 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAj8AAAHJCAYAAABqj1iuAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABTH0lEQVR4nO3deViU5f4G8HuAYRMXJBY3RCHBDQVBxcQFl0xtQc8p19IkNTFKc09z92jiBgpuGJl7YqZFWZrlOf4U0VxSQEXF0BRUFJR1YOb3xzSvDJszwzAvMPfnurhmeNfvPIx48zzP+45EoVAoQERERGQkTMQugIiIiMiQGH6IiIjIqDD8EBERkVFh+CEiIiKjwvBDRERERoXhh4iIiIwKww8REREZFYYfIiIiMioMP0RERGRUzMQugKq/0aNH48yZM2rLpFIpXnrpJfTu3RuffPIJ6tevL1J1JKZnz55h48aNOHbsGO7evYv8/HwAQIMGDfDjjz+iYcOGIldY/aSkpCAyMhKnT5/Gw4cPUVhYCADo3Lkzvv76a5GrE9fkyZPh7u6OMWPG4NixY4iKisLhw4cNXkdBQQGio6Px/fff46+//kJubi4AwNLSEjExMXBzczN4TaRfEn68Bb3I6NGj8ezZM8yfP19YJpPJcOXKFaxevRpt2rTB7t27IZFIRKySDE0mk2HYsGGwtLTEsGHD4OTkBHNzc0ilUrRo0QJWVlZil1jt3LlzB4GBgejbty/69OkDW1tbSKVSWFlZwdXVFSYmxt0Zf+XKFQQFBSEjIwPW1tZYvXo1evfubfA6Jk2ahNTUVIwZMwZNmzaFpaUlzMzM4OzsjLp16xq8HtI/9vyQRmxsbNCxY0e1Zb6+vsjOzkZYWBguXrxYaj3Vbr/99hseP36Mn376Cebm5mKXUyNER0eje/fu+M9//iN2KdVS27Zt8dtvvyE1NRVOTk6wsbExeA0JCQk4deoUjh07xp7LWsy4/8ygSmvXrh0A4O+//wYAJCUlYfLkyejatSvatm0Lf39/LFmyBHl5ecI+BQUFWLt2Lfr06QNPT08MHjwY3377rbB+9OjRcHd3L/Przp07AIBZs2Zh9OjR2L9/P3r37g0vLy+89957SEpKUqvv77//xtSpU9G5c2d06NAB7733HhISEtS22bdvX5nnmjVrltp2R48exZAhQ9C+fXu88sorWLJkCXJycoT1Bw4cKLfuAwcOaFzTnTt3Su2jes0BAQHC9wEBAaVqnDp1Ktzd3REXFycsu3btGiZMmABvb294e3sjODgYqamppX6WJZ08eRIjRoxAp06d0KVLF3z66ae4d++esD4uLg7du3fHzz//jIEDB6Jdu3YYMGAAdu3apfY6VF+jR49+4Tm/+eYbDBo0CO3atUOvXr0QHh6OoqIitTZwd3fHq6++WmrfIUOGwN3dHeHh4cIyTd6PJZX8ObZr1w6vvvoqDh06VGHtRUVF2LlzJ15//XV4enqiV69eCA0NFYYCAeDMmTPo2bMnwsLC4O/vD09PTwwbNkz4eV2/fh3u7u7Yu3ev2rHv3buH1q1b49ChQxg9enSptoyLiyv1cz969ChGjBgBLy8v4Wezc+fOCvf59ddf8frrr6u1v1wuF9aXfM+V9V4t67javOctLCzg5uYGqVSKPn36wN3dvdw2L/kea9OmDbp3744vvvhCqLuseso6huo1nDlzBt7e3khMTBT+vQcEBGDDhg1q70UAiI2NxZAhQ+Dl5YVXXnkFn3/+OTIzM4X14eHhCAgIwPHjxzFgwAB06NABb7/9tlotJeu7du0a+vbti2HDhgnbnD17FqNGjUKHDh3QuXNnzJw5ExkZGeW2C70Yww9Vyq1btwAAzZo1Q3p6OkaOHInc3FwsX74cW7ZswaBBg/D1119j+/btwj7Tpk3Dl19+iX//+9/YtGkTunfvjlmzZuH7778XtmnTpg327t0rfH344Yelzp2YmIg1a9Zg8uTJWLlyJR4/foxRo0YhPT0dAJCRkYFhw4bhypUrmDdvHlatWgW5XI6RI0fixo0bwnHy8vLQvn17tfPZ29urnevw4cMIDg5Gy5YtsWHDBkyePBmHDh3CpEmTUHLkeP369cJx1q9fr7ZO05p0cfbsWfzwww9qy27duoVhw4bh0aNHWLFiBZYuXYrU1FQMHz4cjx49KvdYBw8exPvvv49GjRph9erVmD17Ns6fP4933nlH2O/u3bs4d+4cPvvsMwwZMgQbN25EQEAAFi1ahIiICDg4OAjt0KZNmxfWv2nTJsybNw9+fn7YuHEjRo4ciS1btmDevHlq21lbW+P27dtq7fXXX3+VCr6avh/Lo/o5btiwAS1btsTMmTOF93tZPv/8c/znP/9B3759ERkZiZEjR2LHjh1q75G7d+9iy5YtOHjwID799FOEhYWhYcOGeP/993H69Gm8/PLL6NChA7777ju1Yx88eBDW1tbo37//C+sGlL1ywcHBaNu2LSIiIhAeHo5mzZph0aJFuHjxYpn7/PnnnwgODka7du0QGRmJ0aNHY9OmTVi5cqVG5yyPru/5rVu3Cn/svMiHH36IvXv3IioqCm+99RaioqIQExOjU7137tzBnTt3MHHiRPj7+yMyMhLvvPMOIiMj8fnnnwvbRUREYOrUqejYsSPCwsIQHByMI0eOYPTo0WrhOiMjAzNnzsSIESOwbt06WFpaYty4cUhMTCzz/CtXrkS7du2wcOFCAEB8fDzGjBkDS0tLrF27FnPmzMGZM2fw7rvvVhjiqWIc9iKNKBQKYWImAGRmZuLMmTOIjIwU/rI8efIkWrdujXXr1gnd1d26dcPJkycRFxeH8ePH49q1azhy5AjmzJmD9957DwDg5+eHu3fvIi4uDoMHDwZQepjt5s2bpWp6+vQpNm7cCB8fHwCAp6cn+vbti+3bt2PatGn46quv8OTJE+zevRtNmjQBAPTo0QMDBw7EunXrEBYWBgDIzc3FSy+9pHa+4sM4CoUCoaGh8Pf3R2hoqLDcxcUFY8aMwe+//45evXoJy1u3bo2mTZsCQKlf3prWpC25XI4lS5agbdu2uHLlirB8/fr1sLKyQnR0tPAz8fPzQ9++fbF161bMnDmzzGOFhoaie/fuWLVqlbDc29sbAwcORFRUFGbMmIHc3FwkJydj2bJlGDp0KACge/fuyM/Px8aNGzFixAihTV80fPH06VNERETgnXfewdy5c4VjNWjQAHPnzsXYsWPx8ssvAwBsbW3h5uaGY8eOwdXVFYDyL3AfH59SPV4vej9WpPjPsVGjRvj111+RmJiIFi1alNo2OTkZ+/fvx6effioc95VXXoGDgwNmzJiBEydOoGfPnsjNzcWtW7cQGxsLFxcXAEDPnj3x5ptvYtWqVfjmm28wdOhQzJ8/H6mpqWjWrBkAZfgZNGgQLC0tYWJigoKCggprT05ORmBgID777DNhmZeXF7p06YK4uDh06NCh1D7r16+Hl5eXMCTn7++Pp0+fYuvWrZgwYQIaNGhQ4TnLo8t7/t69e9iyZUup93N5nJ2dhfean58fvvnmG1y+fBn//ve/ta43NzcXKSkpCA4ORkhICADle1EqlWLFihUYO3Ys7O3tERkZibffflstELVq1QojR45ETEwMRo4cKRxvwYIFeOuttwAAXbt2Rd++fbF582asWbNG7dy3b9/G//73Pxw6dEh4v69atQotWrTApk2bYGpqCgDo0KEDBg0apHYe0g57fkgj8fHxaNu2rfDVrVs3TJ06Fe3atcOqVasgkUjQvXt37NixAxYWFkhOTsaxY8cQGRmJjIwM4Zf1uXPnAKDUX7Dh4eFYvHixVjU1bdpUCD4A4ODgAC8vL8THxwMATp06hdatW8PR0RGFhYUoLCyEiYkJevTogf/7v/8T9rt3716Fkxhv3ryJ+/fvIyAgQDhOYWEhfH19YWNjg5MnT2pcs6Y1AcoQUvx8FV2bsGfPHjx48ADBwcFqy0+fPo3OnTvD0tJSOI6NjQ18fHxKnU/l1q1bePDggRBEVZydneHl5SVc+SeRSGBqaorXX39dbbvXXnsN+fn5pXoYVAG
2024-11-02 03:42:31 +04:00
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
2024-11-02 03:46:45 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAj8AAAHJCAYAAABqj1iuAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABio0lEQVR4nO3deVxU9f7H8dewgygiCWguaOS+kmvXFZc2q6tWt1Ir09KbS7mkVpaVZZv7ri2aZWmlmZUtV+vm/VVuLVaKGSq5JIIiIMrO+f0xzcgI6MywzMi8n48Hjxm+55w5Hz5zGD58v99zjskwDAMRERERD+Hl6gBEREREKpKKHxEREfEoKn5ERETEo6j4EREREY+i4kdEREQ8ioofERER8SgqfkRERMSjqPgRERERj6LiR0RERDyKj6sDEPczZMgQduzYYdPm6+vLFVdcQc+ePXnkkUcICQlxUXTiShkZGSxdupQtW7Zw7NgxsrOzAahevTqfffYZNWrUcHGEUtn079+fu+++mxtuuIG3336bn3/+maVLl1Z4HDr2KxeTbm8hFxoyZAgZGRlMmzbN2pabm8uePXuYPXs2zZo1491338VkMrkwSqloubm53HnnnQQEBHDnnXcSGRmJn58fvr6+NGjQgMDAQFeHKJXQN998wyOPPMK5c+eoXr06r776Kq1atarQGHTsVz7q+ZFiBQcH06ZNG5u29u3bc/bsWebPn8/u3buLLJfK7b///S+nT5/m888/x8/Pz9XhiIfo3r07//vf/0hMTKROnToEBARUeAw69isfzfkRh7Ro0QKAv/76C4B9+/YxevRoOnXqRPPmzenatSvPPfccWVlZ1m1ycnKYO3cuvXr1olWrVvTr148PP/zQunzIkCE0bty42K+jR48CMGXKFIYMGcIHH3xAz549adu2Lffeey/79u2zie+vv/5i/PjxdOjQgdatW3Pvvfeyd+9em3Xee++9Yvc1ZcoUm/U2b97MgAEDaNmyJf/4xz947rnnOHfunHX5+vXrS4x7/fr1dsd09OjRIttYfubY2Fjr97GxsUViHD9+PI0bN2b79u3Wtv379zNixAhiYmKIiYlh1KhRHDlypMh7eaFvv/2Wu+++m2uuuYaOHTsyYcIEjh8/bl2+fft2unTpwpdffsmNN95IixYtuP7663nnnXdsfg7L15AhQy65z/fff5+bbrqJFi1a0KNHDxYsWEB+fr5NDho3bsx1111XZNsBAwbQuHFjFixYYG2z53i80IXvY4sWLbjuuuvYuHHjRWMv/H4UFBTw8MMP06JFCw4cOABAUlISjz32GN27d6dVq1bcdtttbNmyxeY1LowfYMGCBTRu3Njm5y/pGNu+fTuNGzfm//7v/xg0aBCtWrWib9++1vfEIjs7m0WLFnH99dfTsmVL+vbty/LlyykoKCiy3+K+LMdX4dhKMmTIkCLvvSXOwsfpr7/+yrBhw+jYsSMxMTGMHDmSP/74o9htgoODiY6O5uzZs7Rr187m9+JClu0Kv5+xsbG88cYb1nUs77nl86Wk17DEe6lj3yI/P5/Vq1dz880306pVK3r06MHMmTOtQ2Rg32fZhfFZ9j9+/HjrOpf6fJKLU8+POOTQoUMA1K1bl6SkJAYNGkSbNm148cUX8fPzY+vWraxYsYLw8HAefPBBACZOnMg333zDv//9b1q3bs0333zDlClT8PX1pV+/fgA0a9bMZpjtv//9L0uWLLHZd1xcHAcPHmT8+PGEhIQwf/58Bg8ezKZNmwgPDyclJYU777yTwMBAnnzySQIDA3nzzTcZNGgQH3zwAVdddRUAWVlZtGzZkqlTp1pfe/To0Tb7+vjjj5k4cSI333wzjzzyCMeOHWPOnDnEx8ezYsUKmyG/hQsXUrNmTQCSk5NtXsvemJyxa9cuPv30U5u2Q4cOceedd9KwYUNeeukl8vLyWLJkCXfddRcfffQRYWFhxb7Whg0bmDx5Mv369WPEiBGcPn2a+fPn869//YsPP/yQsLAwjh07xuHDh/noo48YM2YMTZo04bvvvuPZZ58lNTWV4cOHs3btWgCeeeaZS8a/bNky5syZw+DBg3nssceIi4tjwYIFHD9+nBkzZljXCwoK4s8//+TAgQPWfB0+fLhI4Wvv8VgSy/uYlpbGmjVrmDx5Mi1btqRBgwaX/Fk+//xztm/fzquvvkqdOnU4efIkt912G/7+/owbN47Q0FDWr1/PqFGjePnll7nlllsu+ZoADz30EHfeeSdgPkabNWvGQw89BEC9evWsxcK4ceP45z//yciRI9myZYs1/3fffTeGYTBy5Eh+/vlnRo8eTZMmTdi+fTtz587lyJEjTJ8+3WaflvcQYM+ePTz77LN2xeqIbdu2MXz4cDp27MiMGTPIzs5m2bJl3Hnnnbz33nsl/l7MmjWLM2fOUK1atUvu46mnnqJ58+acPXuWTz/9lJdeeokmTZpw7bXXOhzvpY59y3vy1FNP8dFHH/HAAw/Qrl079u7dy6JFi4iLi+O1116zfm5c6rPsQs8++yzXX389gwcPBhz7fJLiqfiRYhmGQV5envX7tLQ0duzYwZIlS2jbti0tWrTg22+/pWnTpsybN4/g4GAArr32Wr799lu2b9/Ogw8+yP79+/niiy94/PHHuffeewHo3Lkzx44dY/v27dbi58JhtoMHDxaJ6cyZMyxdupR27doB0KpVK3r37s2qVauYOHEib775Jqmpqbz77rtceeWVAHTr1o0bb7yRefPmMX/+fAAyMzO54oorbPZXuCvbMAxmzpxJ165dmTlzprU9KiqK++67j2+++YYePXpY25s2bUqdOnUAivwnaW9MjiooKOC5556jefPm7Nmzx9q+cOFCAgMDWblypfU96dy5M7179+a1115j8uTJxb7WzJkz6dKlC7NmzbK2x8TEcOONN/L6668zadIkMjMziY+PZ8aMGQwcOBCALl26kJ2dzdKlS7n77rutObXsuyRnzpxh8eLF/Otf/7IWoV26dKF69epMnTqVoUOHcvXVVwMQGhpKdHQ0W7Zssf5R3LRpE+3atSvS43Wp4/FiCr+PtWrV4quvviIuLs6u4mf16tUMGDCAzp07AzB//nxSUlL44osvrO979+7due+++3j55Zfp168fXl6X7nivV68e9erVA8zHaI0aNYodbu7Tpw9PPPEEAF27diUpKYnFixdz1113sXXrVr777jtmz57NTTfdBMA//vEPAgICmDdvHvfcc48114DN6xfusShLs2bNon79+ixfvhxvb2/A/P736dOH+fPnM2/evCLb/Prrr3z00Uc0bdqU9PT0S+4jOjra+rO0adOGdevW8dtvvzlV/Nhz7J88eZIPPviACRMmWI+1f/zjH4SHhzNp0iS2bt1K9+7dgUt/lhX2/fffc/jwYVavXk316tUd/nyS4mnYS4q1c+dOmjdvbv269tprGT9+PC1atGDWrFmYTCa6dOnC22+/jb+/P/Hx8WzZsoUlS5aQkpJCTk4OAD/88AMAffv2tXn9BQsWFPmP81Lq1Klj/bAACA8Pp23btuzcuRMwf0g0bdqUiIgI8vLyyMvLw8vLi27duvHdd99Ztzt+/DhVq1YtcT8HDx4kMTGR2NhY6+vk5eXRvn17goOD+fbbb+2O2d6YwFyEFN7fxc5FWLNmDcnJyYwaNcqmfdu2bXTo0IGAgADr6wQHB9OuXbsi+7M4dOgQycnJ1kLUol69erRt29Z65p/JZMLb25ubb77ZZr0bbriB7Oxsdu/ebdNuKaALD2NZ/PTTT2RlZRXJsWU448Ic9+rVy2bIaNOmTdY/5Bb2HI8XY8n/mTNneO+99/Dx8aFJkyYX3SY/P58vv/yS3bt3c9ddd1nbd+zYQdu2ba2Fj8Utt9xCcnKyTXF/4fteeCjKXv3797f5vm/fviQnJ3Po0CF27NiBj48P119/fZFYLLE6qqT31cLy3hf3M507d45ff/2VG264wVr4AFSrVo2ePXsWG49hGDz33HPcdtttl3xPLCx5PXv2rHV4qmXLlkXWudjPYWHPsW+J+8Lj8qabbsLb29umUL/UZ5l
2024-11-02 03:42:31 +04:00
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
2024-11-02 03:46:45 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkEAAAHJCAYAAACCD+2FAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABttUlEQVR4nO3dd3hU1dbH8e9kMukhhBASkBKKoUiRLioooCj3eq8C3msDyxUFFVEQQSwg2FBREFDACojYsYMF7I2m2AidhJoQSCN9MjnvH3lnZEiAzJBk2u/jkwdmn7bWzHiy2Hufc0yGYRiIiIiIBJggTwcgIiIi4gkqgkRERCQgqQgSERGRgKQiSERERAKSiiAREREJSCqCREREJCCpCBIREZGApCJIREREApKKIBEREQlIwZ4OQPzLiBEjWLt2rVObxWKhYcOG9O/fnzvvvJOYmBgPRSeelJ+fz4IFC1i9ejX79u2jpKQEgPr167Ny5UoaNGjg4QilthiGQZ8+fXj88cfp3r07Tz75JMHBwTzwwAN1HktmZibPPfcc3333Henp6VitVgCaN2/OypUrCQ7Wr8VAok9balyHDh2YOnWq47XVauWvv/7i6aefJiUlhddffx2TyeTBCKWuWa1WrrvuOsLCwrj11ltJTEwkJCQEi8VCy5YtCQ8P93SIUotMJhMTJkzgtttuw2q1ctppp7FkyZI6j+PIkSNcccUVtG3blrvuuov4+HhCQkIICQmhdevWKoACkD5xqXFRUVGceeaZTm09e/akoKCAOXPm8Ntvv1VaLv7t66+/Jjs7m08//ZSQkBBPhyMecPnll3PhhRdy6NAhmjVr5pHvwbvvvkt8fDzz58+v82OLd9KcIKkzHTt2BGD//v0AbN68mTFjxnDWWWdxxhln0LdvXx5++GGKi4sd25SWljJ79mwGDhxI586dueSSS3jvvfccy0eMGEHbtm2r/Nm7dy8A99xzDyNGjOCdd96hf//+dO3aleuuu47Nmzc7xbd//37Gjx9Pr1696NKlC9dddx2bNm1yWuett96q8lj33HOP03qrVq1i6NChdOrUiXPOOYeHH36YwsJCx/Lly5cfN+7ly5dXO6a9e/dW2sae84ABAxyvBwwYUCnG8ePH07ZtW9asWeNo27p1K6NGjaJbt25069aN2267jT179lT6LI/1ww8/cPXVV9O9e3d69+7NXXfdxYEDBxzL16xZw7nnnsvnn3/OP/7xDzp27MjFF1/MsmXLnPKw/4wYMeKkx3z77bf55z//SceOHTn//POZO3cuNpvN6T1o27YtF110UaVthw4dStu2bZk7d66jrTrfx2Md+zl27NiRiy66iA8//PC425zoO2v/LKrzORw8eJBJkybRp08funbtyvDhw/n111+Bis/7ZP9fnOwzq25u1d3P3r17iYmJoXXr1qSlpXHGGWec8HOuzvHnzp1L27ZtT7oPe85r1qyhf//+LFmyhIEDB9KpUycuvfRSPvvsM6ftSkpKePbZZ7n44ovp1KkTgwYN4vnnn6e8vNzpc7znnntYsGABZ599Nt27d+fWW29l3759x43vk08+oWfPnjz11FOOtpN9j6V2qSdI6syuXbsAaNasGQcPHuSaa67hzDPPZMaMGYSEhPDtt9/yyiuv0KhRI26++WYAJkyYwDfffMMtt9xCly5d+Oabb7jnnnuwWCxccsklQOXht6+//rrSv/RSUlLYuXMn48ePJyYmhjlz5jB8+HBWrFhBo0aNyMrK4sorryQ8PJwHHniA8PBwFi9ezDXXXMM777xD69atASguLqZTp07cf//9jn2PGTPG6VgfffQREyZM4F//+hd33nkn+/btY9asWWzfvp1XXnnFaShw3rx5xMfHAxVzFY7eV3Vjcsf69ev55JNPnNp27drFlVdeSatWrXj88ccpKytj/vz5XHXVVXzwwQfExcVVua/333+fSZMmcckllzBq1Ciys7OZM2cOV1xxBe+99x5xcXHs27eP3bt388EHH3D77bfTrl07fvzxR6ZPn05OTg4jR47kzTffBGDatGknjX/hwoXMmjWL4cOHM3nyZFJSUpg7dy4HDhzg0UcfdawXERFBWloaO3bscLxfu3fvrlQAV/f7eDz2zzE3N5c33niDSZMm0alTJ1q2bFlp3alTp5Kfnw/AFVdcweWXX85//vMfANq0aVOtz6GgoICrrroKm83G3XffTUJCAi+//DL/+9//eO+995g3bx6lpaWO79Qtt9zC+eefD0CjRo2q9ZlVJzdX9nO0Rx55hLKyspN8yq6/tydj/x4eOnSIO++8k6ZNm7Jy5UrGjh3L448/zmWXXYZhGIwePZqNGzcyZswY2rVrx5o1a5g9ezZ79uzhoYcecuxv9erVxMbGcv/991NeXs5TTz3FiBEj+OSTTyoN8RYXFzN9+nRGjhzJv/71L6D632OpRYZIDRo+fLhxzTXXGFar1fFz6NAhY8WKFUavXr2MK664wigvLze+++4745prrjGOHDnitP0ll1xi/O9//zMMwzC2bNliJCcnG4sWLXJaZ8yYMcb999/vON7w4cOdlr/77rtGcnKysWfPHsMwDGPSpElGcnKysW7dOsc6GRkZRqdOnYwnn3zSMAzDePrpp41OnToZe/fudaxTUlJiDBw40Lj99tsdbQsWLDBGjRrldLz+/fsbkyZNMgzDMMrLy41+/foZN954o9M6P/74o5GcnGx89dVXVcZoGIaxZ88eIzk52Xj33XerHdOx29hNmjTJ6N+/f5Ux2mw249JLLzWGDBliJCcnGz///LNhGIYxfvx44+yzz3b6TLKzs43u3bsbM2bMMKpis9mMc845x/GZ2aWlpRlnnHGG8fjjjxuGYRjXXXedkZycbLzzzjtO602fPt3o1KmTkZ2d7Wir6jM9Wl5entG5c2djypQpTu1vvfWWkZycbGzdutXpPbj88suNhQsXOtabP3++MWLECCM5OdmYM2eOYRhGtb6PVanqc7R/bz/55JPjbmd3dAx21fkcXn31VaNt27bGpk2bHOsUFhYagwYNMt566y1HW1Xfj+p+ZifLzd39fPrpp8aZZ55pXHTRRSf8nKvz3s6ZM8dITk6u9j4uuOACIzk52fjpp5+c1hs1apRxzjnnGDabzfj666+N5ORk4+OPP3Za59lnn3X6fg0fPtw444wzjN27dzvW+euvv4zk5GRj2bJlleJ7++23jXPOOccoKyszDKP632OpXRoOkxq3bt06zjjjDMfP2Wefzfjx4+nYsSNPPfUUJpOJc889l6VLlxIaGsr27dtZvXo18+fPJysri9LSUgA2bNgAwKBBg5z2P3fuXKd/jVVH06ZN6dGjh+N1o0aN6Nq1K+vWrQPgp59+on379iQkJFBWVkZZWRlBQUH069ePH3/80bHdgQMHiI6OPu5xdu7cSXp6OgMGDHDsp6ysjJ49exIVFcUPP/xQ7ZirGxNAeXm50/EMwzjuft944w0yMzO57bbbnNp//vlnevXqRVhYmGM/UVFR9OjRo9Lx7Hbt2kVmZqajV86uefPmdO3a1XGloMlkwmw2O/4FbDd48GBKSkr47bffnNoNw6CsrKzKYYFff/2V4uLiSu+xffjv2Pd44MCBrF692vF6xYoV/POf/3RapzrfxxOxv/9HjhzhrbfeIjg4mHbt2p10u6pU53PYsGEDTZs2pX379o7twsPD+eyzzxy9SsdT3c/sZLm5uh+oGGZ6/PHHueWWWxw9oCdTnff2eN+VY5lMJhITEznrrLOc2gcPHkxmZiY7d+5k7dq1BAcHc/HFFzut8+9//xvAKa9u3brRrFkzx+sOHTrQrFkzx3nFLiMjgxdeeIGrr74as9kMuP49ltqh4TCpcWeccYZjSMNkMhEaGkrjxo2JiopyrFNeXs7TTz/Na6+9RmFhIY0bN6Zz586EhoY61snJyQE4bpe6KxISEiq1xcXF8ddffzmOZZ+nUJWioiLCw8PZt2/fcdc5OuZp06ZVOaxz8ODBasdcnZjs7rvvPu677z6n5aeddlqV+3zmmWeYOHGi0+dhX7ZixQpWrFhRabvjXb5
2024-11-02 03:42:31 +04:00
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
2024-11-02 03:46:45 +04:00
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Средний объем торгов в обучающей выборке: 14979282.753403345\n",
"Средний объем торгов в контрольной выборке: 14379996.6562986\n",
"Средний объем торгов в тестовой выборке: 14085782.089552239\n"
]
2024-11-02 03:42:31 +04:00
}
],
"source": [
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
2024-11-02 03:46:45 +04:00
"sns.histplot(train_data[\"Volume\"], kde=True, color=\"blue\")\n",
2024-11-02 03:42:31 +04:00
"plt.title('Распределение объема в обучающей выборке')\n",
"plt.xlabel('Объем')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
2024-11-02 03:46:45 +04:00
"sns.histplot(val_data[\"Volume\"], kde=True, color=\"red\")\n",
2024-11-02 03:42:31 +04:00
"plt.title('Распределение объема в контрольной выборке')\n",
"plt.xlabel('Объем')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
2024-11-02 03:46:45 +04:00
"sns.histplot(test_data[\"Volume\"], kde=True, color=\"green\")\n",
2024-11-02 03:42:31 +04:00
"plt.title('Распределение объема в тестовой выборке')\n",
"plt.xlabel('Объем')\n",
"plt.ylabel('Частота')\n",
2024-11-02 03:46:45 +04:00
"plt.show()\n",
"\n",
"print(\"Средний объем торгов в обучающей выборке: \", train_data['Volume'].mean())\n",
"print(\"Средний объем торгов в контрольной выборке: \", val_data['Volume'].mean())\n",
"print(\"Средний объем торгов в тестовой выборке: \", test_data['Volume'].mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Процесс конструирования признаков**\n",
"\n",
"Теперь приступим к конструированию признаков для решения бизнес задачи.\n",
"\n",
"**Унитарное кодирование** \n",
"Унитарное кодирование категориальных признаков (one-hot encoding). Преобразование категориальных признаков в бинарные векторы.\n",
"\n",
"**Дискретизация числовых признаков** \n",
"Процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины).\n",
"\n",
"**Ручной синтез** \n",
"Создание новых признаков на основе экспертных знаний и логики предметной области. \n",
"\n",
"**Масштабирование признаков** \n",
"Процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб. Это важно для многих алгоритмов машинного обучения, которые чувствительны к масштабу признаков, таких как линейная регрессия, метод опорных векторов (SVM) и нейронные сети."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**One-hot encoding:**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Пример категориальных признаков\n",
"categorical_features = [\n",
" \"Date\",\n",
" \"date\"\n",
"]\n",
"\n",
"# Применение one-hot encoding\n",
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Дискретизация числовых признаков:**"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" High High_bin High_category\n",
"0 0.347656 NaN NaN\n",
"1 0.367188 (0.348, 29.982] Low hight value\n",
"2 0.371094 (0.348, 29.982] Low hight value\n",
"3 0.359375 (0.348, 29.982] Low hight value\n",
"4 0.355469 (0.348, 29.982] Low hight value\n",
"5 0.382813 (0.348, 29.982] Low hight value\n",
"6 0.414063 (0.348, 29.982] Low hight value\n",
"7 0.437500 (0.348, 29.982] Low hight value\n",
"8 0.445313 (0.348, 29.982] Low hight value\n",
"9 0.449219 (0.348, 29.982] Low hight value\n",
"10 76.839996 (59.616, 89.25] Very hight value\n",
"11 89.250000 (59.616, 89.25] Very hight value\n",
"12 88.610001 (59.616, 89.25] Very hight value\n",
"13 76.989998 (59.616, 89.25] Very hight value\n",
"14 75.150002 (59.616, 89.25] Very hight value\n",
"15 74.470001 (59.616, 89.25] Very hight value\n",
"16 81.019997 (59.616, 89.25] Very hight value\n",
"17 80.699997 (59.616, 89.25] Very hight value\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"labels = [\"Low hight value\", \"Medium hight value\", \"Very hight value\"]\n",
"num_bins = 3\n",
"\n",
"# Создаем числовые интервалы и метки\n",
"hist, bins = np.histogram(df[\"High_filled\"], bins=num_bins)\n",
"df[\"High_bin\"] = pd.cut(df[\"High_filled\"], bins=bins)\n",
"df[\"High_category\"] = pd.cut(df[\"High_filled\"], bins=bins, labels=labels)\n",
"\n",
"combined_table = df[[\"High\", \"High_bin\", \"High_category\"]]\n",
"print(combined_table.head(20))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Ручной синтез:**"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"# Пример синтеза признака среднего значения в максимальной и минимальной цене\n",
"train_data_encoded[\"medium\"] = train_data_encoded[\"High\"] / train_data_encoded[\"Low\"]\n",
"val_data_encoded[\"medium\"] = val_data_encoded[\"High\"] / val_data_encoded[\"Low\"]\n",
"test_data_encoded[\"medium\"] = test_data_encoded[\"High\"] / test_data_encoded[\"Low\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Масштабирование признаков:**"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
"\n",
"# Пример масштабирования числовых признаков\n",
"numerical_features = [\"Open\", \"Close\"]\n",
"\n",
"scaler = StandardScaler()\n",
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Конструирование признаков с применением фреймворка Featuretools:**"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"s:\\Учеба\\3 курс\\5 семестр\\AIM\\lab_3\\AIM-PIbd-31-Izotov-A-P\\aimenv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n",
"s:\\Учеба\\3 курс\\5 семестр\\AIM\\lab_3\\AIM-PIbd-31-Izotov-A-P\\aimenv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"s:\\Учеба\\3 курс\\5 семестр\\AIM\\lab_3\\AIM-PIbd-31-Izotov-A-P\\aimenv\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n",
"s:\\Учеба\\3 курс\\5 семестр\\AIM\\lab_3\\AIM-PIbd-31-Izotov-A-P\\aimenv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, default_df], sort=True)\n",
"s:\\Учеба\\3 курс\\5 семестр\\AIM\\lab_3\\AIM-PIbd-31-Izotov-A-P\\aimenv\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Open High Low Close Adj Close Volume \\\n",
"id \n",
"0 -0.836096 1.937500 1.843750 -0.837302 1.470243 21592000 \n",
"1 -0.740309 5.175000 4.975000 -0.744123 3.899629 17306400 \n",
"2 0.836571 58.580002 57.689999 0.838929 50.306290 7857500 \n",
"3 1.104184 68.209999 67.050003 1.134793 60.854706 11702500 \n",
"4 0.806207 57.520000 57.029999 0.813331 49.359718 8331600 \n",
"\n",
" Date_1992-06-26 Date_1992-06-29 Date_1992-06-30 Date_1992-07-01 ... \\\n",
"id ... \n",
"0 False False False False ... \n",
"1 False False False False ... \n",
"2 False False False False ... \n",
"3 False False False False ... \n",
"4 False False False False ... \n",
"\n",
" date_2024-04-30 00:00:00 date_2024-05-01 00:00:00 \\\n",
"id \n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"\n",
" date_2024-05-03 00:00:00 date_2024-05-06 00:00:00 \\\n",
"id \n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"\n",
" date_2024-05-07 00:00:00 date_2024-05-08 00:00:00 \\\n",
"id \n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"\n",
" date_2024-05-16 00:00:00 date_2024-05-21 00:00:00 \\\n",
"id \n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"\n",
" date_2024-05-23 00:00:00 medium \n",
"id \n",
"0 False 1.050847 \n",
"1 False 1.040201 \n",
"2 False 1.015427 \n",
"3 False 1.017300 \n",
"4 False 1.008592 \n",
"\n",
"[5 rows x 10291 columns]\n"
]
}
],
"source": [
"import featuretools as ft\n",
"\n",
"es = ft.EntitySet(id='starbucks_data')\n",
"es = es.add_dataframe(dataframe_name='starbucks', dataframe=train_data_encoded, index='id')\n",
"\n",
"\n",
"# Генерация признаков\n",
"feature_matrix, feature_defs = ft.dfs(\n",
" entityset=es, \n",
" target_dataframe_name=\"starbucks\", \n",
" max_depth=2\n",
")\n",
"\n",
"# Преобразование признаков для контрольной и тестовой выборок\n",
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)\n",
"\n",
"print(feature_matrix.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Оценка качества каждого набора признаков**\n",
"\n",
"**Предсказательная способность** \n",
2024-11-02 03:58:16 +04:00
"Метрики: RMSE, MAE, R² \n",
2024-11-02 03:46:45 +04:00
"\n",
"Методы: Обучение модели на обучающей выборке и оценка на контрольной и тестовой выборках.\n",
"\n",
"**Скорость вычисления** \n",
"Методы: Измерение времени выполнения генерации признаков и обучения модели.\n",
"\n",
"**Надежность** \n",
"Методы: К р о с с -валидация, анализ чувствительности модели к изменениям в данных.\n",
"\n",
"**Корреляция** \n",
"Методы: Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков.\n",
"\n",
"**Цельность** \n",
"Методы: Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"s:\\Учеба\\3 курс\\5 семестр\\AIM\\lab_3\\AIM-PIbd-31-Izotov-A-P\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE: 2885972.9324181927\n",
"R²: 0.9328285916832842\n",
"MAE: 1680373.6776608187\n",
"Cross-validated RMSE: 12160466.835803727\n",
"=====================================\n",
"Метрика Train RMSE: 4388457.199779966\n",
"Метрика Train R²: 0.9082228071090095\n",
"Метрика Train MAE: 1787810.5665033064\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"s:\\Учеба\\3 курс\\5 семестр\\AIM\\lab_3\\AIM-PIbd-31-Izotov-A-P\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAImCAYAAABZ4rtkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADgsklEQVR4nOzdd3xT1fsH8E+SZjSdlA6gpZQhe8qQIassGQIiCiIIKCCy9yyjUAHZo1CWiIAIMgQUUKaAiiwZspRZdlsoTUeafX9/9Nf7JbRA0qZNWz7v18uX7bk3N09yEpon55znSARBEEBEREREREQ2kzo7ACIiIiIiovyGiRQREREREZGdmEgRERERERHZiYkUERERERGRnZhIERERERER2YmJFBERERERkZ2YSBEREREREdmJiRQREREREZGdmEgRERERERHZiYkUETnFkiVLUK5cOWeHUeAcPHgQnTp1woMHD3Dv3j2Ehobi4sWLDrl2XFwcwsPD0bx5c1SuXBnlypVDuXLl0KJFC5hMJofcR06IiIhAWFhYpse2b98uPo6X/Ud526BBg7BkyRIkJSVhx44dePfddx127ePHj6NPnz546623UKFCBfE1sWTJEofdBxHlTy7ODoCIXh+3bt3CsmXL8McffyA+Ph4AULlyZZQqVQqdOnVCjx49IJPJnBxl/ta4cWNs3LgRTZs2BQC89957qFy5cravm5SUhC5duqBcuXIYOXIk/Pz8oFAooFAoULp0abi45N0/J0ePHsXo0aNfek5kZCT8/PwytC9btgxHjhzJqdDIQb744gv06dMHkZGRUKvVmD9/vkOuu3//fowePRp9+vTBJ598And3d7i4uMDHxwdBQUEOuQ8iyr/y7l8+IipQTp8+jb59+yIgIAAjRozAmTNnsH37dixduhTHjx/HnDlzcOrUKSxZsgRSKQfLs8rFxQVff/017t69C5lMhmLFijnkutu2bYOfnx+ioqIccr3cEh0djQcPHqBevXovPa9ChQqZfjD28fHJqdDIgSpVqoTffvsNd+/eRZEiReDu7u6Q6y5ZsgRjx47FRx995JDrEVHBwk8rRJTjjEYjxowZA19fX/z444/o3Lmz+AG/cePGGDduHKZNm4YDBw5g+/bt4u1OnTqFzz77DLVr10blypURGhqKJUuWwGKxAADu3buHcuXKibeJiYlBx44d0bBhw5dO2Ro3bhwAZJieIwgCunbtinLlyuHevXs4ceIEypUrhxMnTlg9nh49eqBHjx7i7xaLBStXrkSLFi1QuXJltGrVCuvXr8/wPOzYsQPvvfceqlWrhiZNmmDevHkwGAwAMk513L17N2rXro158+ZleJwAoNfr0axZM6vbPBtX8eLFUaxYMcybNy/DbTOzZ88edOrUCTVq1ECDBg0wefJkaDQa8fiJEyfQtGlTrFu3Ds2aNUOVKlXQoUMH/Prrr+LxzJ7jF7HlOevRowfKlSuH3r17W7WbzWY0aNDApsd15MgRvPnmmw77YD1u3Dj06NEDW7duRdOmTVGjRg307NkTV69etTrvwYMHGDFiBOrUqYNq1aqhZ8+euHz5stU5P/zww0tfn+my+roB0p6rlStXol27dqhatSqqV6+Orl274q+//hJvo9frMXXqVNSrVw9vvfUWRo0aZdX3Op0O8+bNQ8uWLVG5cmW8+eab6N27N65cuWL1vISGhlrF/fzrNv09ee/ePavzQkNDxcec2Wv9ec++b589X6lUokyZMpDL5RneG5lJSkrCzJkz0bx5c1SpUgXt2rXD1q1bxeMajQb//vsvatSogVGjRqFmzZqoWbMmhgwZIj6GcePGWfXd8/9WPO9Vr4v0x1OuXDns2rXL6raHDx/mVFOiPIaJFBHluHPnzuH+/fv45JNP4Orqmuk5nTp1QqFChbB7924AwNWrV9GrVy94e3tjwYIFiIqKQq1atRAZGYm9e/dmeo2oqCi4u7tj6dKlaNKkCTZv3ozNmzejcePG8PPzE38fMGBAprffuXMnzp49a/fjmzp1KhYvXoz27dtj+fLleOeddzBjxgwsXbpUPOe7777D2LFjUalSJURGRqJfv35Yv349IiIiMlxPp9Nh2rRp6NOnzwu/CV+9enWGD6TPu3PnDtauXfvK+JctW4YRI0agevXqWLx4MQYOHIhff/0VPXr0gE6nAwDcv38fP/30E5YuXYo+ffpg2bJlqFSpEoYMGYIdO3agUqVK4vOb2RS559nynAGAm5sbTp06haSkJLHt5MmT4tTQVzly5AgaN25s07m2unLlChYsWIBBgwZhzpw5ePr0Kbp3747Y2FgAQHx8PLp27YpLly5h0qRJmDdvHiwWCz7++GPcuHFDvI5Op0OVKlXE5y2z5y67r5u5c+di2bJl6NKlC1avXo3p06cjISEBQ4cORWpqKgBgzpw52LFjB0aNGoWIiAj89ddfmDp1qnjdMWPGYNu2bejXrx/WrFmD8ePH49q1axg5ciQEQXDoc+sItrw3dDodunXrhp9++kl8PdesWRMTJ07E8uXLAaS95gFg5MiRePz4Mb766itMnz4d169fR9euXfHkyRMMGDAAmzdvxuTJk18Zl62vCyDtdX/o0CGrtj179nC0niiP4dS+F1ixYgV+//33TL9VfhGTyYSlS5dix44dSEhIQMWKFTF69GhUr1495wIlygcePXoEAAgODn7hORKJBEFBQXjw4AGAtESqfv36mDNnjvjhoUGDBjh06BBOnDiBtm3bWt0+OTkZO3fuxMKFC1G1alUA/5uW5ePjA4VC8dL3YkpKCubOnYtKlSrh0qVLACCu1zKbzS+83a1bt/DDDz9gxIgR6NevHwDg7bffhkQiwYoVK9CtWzd4eXlh6dKlaN68udUH4NTUVOzevRtGo9Hqmj///DPkcjn69OkDmUyW4UPhw4cPsWrVKqtYMzNjxgy88cYbLz1Ho9EgKioKH374odWHwbJly+Ljjz/Gtm3b8PHHHyM1NRV37tzBt99+i7p16wIAGjZsiPj4eMydOxft27cXn1+FQvHC+7P1OStUqBAAoGLFirh9+zaOHj0q9vmePXtQu3btV377r9PpcOrUKYwfP/6l59krKSkJy5cvR61atQAAVatWRfPmzbFu3TqMGjUK3377LRISEvD9998jMDAQANCoUSO0adMGixYtwuLFiwGk9b+vr6/V6/LZ585isWTrdQMAsbGxGD58uNUIqlKpxODBg/Hvv/+ievXqEAQBY8aMwfvvvw8A+Pvvv7FlyxYAgMFgQEpKCsLCwtCmTRsAQJ06dZCcnIxZs2bh8ePHNiXOucXW98b27dvx33//YdOmTahRowaAtNezyWTCsmXL0LVrV2i1WgBpz8GqVasgl8sBADVr1kTz5s2xZs0ajB49GsHBwdDr9a+MzdbXRXr7sWPHYDAYoFAooNfrcfDgQZte90SUe/jVRia+++47LFy40O7bRUVFYcuWLZg+fTp27NiBkiVLok+fPuK3lESvK29vbwB45XshNjZW/ADdsWNHrFq1CkajEVevXsWvv/6KxYsXw2w2Z/gAqdfrERkZCX9/fzRs2DBLMS5btgyFChWyGgFKT8TSE8HM/PXXXxAEAaGhoTCZTOJ/oaGh0Ov1OHPmDG7duoUnT56gRYsWVrf97LPPsH37dvEDGpA2PXHVqlXo1q3bCwtvfPXVV6hVq5ZYUCIzR48exZ9//omxY8e+9HGfO3cOBoMB7dq1s2qvVasWAgMDcfLkSQBpiW6RIkXEJCpd69atERcXh5s3b1q1C4IAk8kkTsN8li3PWTqJRIKmTZvi4MGDANK+sNq3b1+GRDozf/31F3x9fVGmTJlXnmuPoKAgMYkCAH9/f9SoUQOnTp0CkFblrUKFCggICBAfm1QqRaNGjfDnn3+Kt3v48CE8PDxeeD+OeN3MmzcPPXv2RHx8PE6fPo1t27aJU8bSpwdOmjQJ3bp1g9lsRkxMDI4fP47SpUsDSEvsvv76a7Rp0wYxMTH466+/sGnTJhw+fNjqGume7c/M+h5ISxCfPe9l59g74mXLewNIG9UMDAwUk6h07du3h16
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error\n",
"from sklearn.model_selection import cross_val_score\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Удаление строк с NaN\n",
"feature_matrix = feature_matrix.dropna()\n",
"val_feature_matrix = val_feature_matrix.dropna()\n",
"test_feature_matrix = test_feature_matrix.dropna()\n",
"\n",
"X_train = feature_matrix.drop(\"Volume\", axis=1)\n",
"y_train = feature_matrix[\"Volume\"]\n",
"X_val = val_feature_matrix.drop(\"Volume\", axis=1)\n",
"y_val = val_feature_matrix[\"Volume\"]\n",
"X_test = test_feature_matrix.drop(\"Volume\", axis=1)\n",
"y_test = test_feature_matrix[\"Volume\"]\n",
"\n",
"# Выбор модели\n",
"model = RandomForestRegressor(random_state=42)\n",
"\n",
"# Обучение модели\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Предсказание и оценка\n",
"y_pred = model.predict(X_test)\n",
"\n",
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
"r2 = r2_score(y_test, y_pred)\n",
"mae = mean_absolute_error(y_test, y_pred)\n",
"\n",
"print(f\"RMSE: {rmse}\")\n",
"print(f\"R²: {r2}\")\n",
"print(f\"MAE: {mae}\")\n",
"\n",
"# К р о с с -валидация\n",
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
"rmse_cv = (-scores.mean())**0.5\n",
"print(f\"Cross-validated RMSE: {rmse_cv}\")\n",
"\n",
"# Анализ важности признаков\n",
"feature_importances = model.feature_importances_\n",
"feature_names = X_train.columns\n",
"\n",
"print(f\"=====================================\")\n",
"\n",
"# Проверка на переобучение\n",
"y_train_pred = model.predict(X_train)\n",
"\n",
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
"r2_train = r2_score(y_train, y_train_pred)\n",
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
"\n",
"print(f\"Метрика Train RMSE: {rmse_train}\")\n",
"print(f\"Метрика Train R²: {r2_train}\")\n",
"print(f\"Метрика Train MAE: {mae_train}\")\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
"\n",
"plt.xlabel(\"Фактический объем продаж акций\")\n",
"plt.ylabel(\"Предсказанный объем продаж акций\")\n",
"plt.title(\"Фактический объем / Предсказанный объем\")\n",
2024-11-02 03:42:31 +04:00
"plt.show()"
]
2024-11-02 03:46:45 +04:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-02 03:58:16 +04:00
"### **Вывод и итог**\n",
2024-11-02 03:46:45 +04:00
"\n",
"**Качество предсказаний:** Модель демонстрирует впечатляющий коэффициент детерминации R² на уровне 0.9975, что свидетельствует о е е способности эффективно объяснять изменения в распродажах. Низкие значения RMSE и MAE подтверждают, что модель предсказывает цены с высокой степенью точности.\n",
"\n",
"**Проблема переобучения:** Разница в значениях RMSE между обучающей и тестовой выборками относительно мала, что говорит о минимальных рисках переобучения. Тем не менее, важно оставаться внимательным и продолжать следить за этим аспектом.\n"
]
2024-11-02 03:42:31 +04:00
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}