2025-01-20 01:56:12 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Объекты наблюдения и их атрибуты\n",
"Объекты наблюдения:\n",
"В данном случае объекты наблюдения — это акции компании Starbucks. Каждая запись в наборе данных представляет собой отдельный день торговли акциями.\n",
"\n",
"Атрибуты акций Starbucks могут включают в себя:\n",
"Date: Дата торгового дня.\n",
"Open: Открывающая цена акций Starbucks на данный день.\n",
"High: Наивысшая цена акций Starbucks в течение торгового дня.\n",
"Low: Наименьшая цена акций Starbucks в течение торгового дня.\n",
"Close: Закрывающая цена акций Starbucks на данный день.\n",
"Adj Close: Скорректированная закрывающая цена акций Starbucks.\n",
"Volume: Объем торгов акциями Starbucks на данный день.\n",
"\n",
"Связи между объектами могут проявляться в виде:\n",
"Временных зависимостей: Например, изменения цен акций в разные дни могут быть связаны с событиями, происходящими в компании или на рынке в целом.\n",
"Корреляции: Например, высокая цена акций в один день может быть связана с высоким объемом торгов, что может указывать на повышенный интерес инвесторов.\n",
"\n",
"Бизнес-цели\n",
"\n",
"Оптимизация инвестиционных решений:\n",
"Эффект для бизнеса: Более обоснованные инвестиционные решения могут привести к увеличению доходности инвестиций.\n",
"Цели технического проекта: Разработка модели прогнозирования, которая будет предсказывать будущие изменения цен акций на основе исторических данных.\n",
"Входные данные: Данные о ценах акций, объемах торгов и других рыночных индикаторах.\n",
"Целевой признак: Прогнозируемая цена акций на следующий день.\n",
"\n",
"Анализ влияния сезонности на цены акций:\n",
"Эффект для бизнеса: Понимание сезонных колебаний цен акций может помочь в планировании инвестиционных решений и управлении активами.\n",
"Цели технического проекта: Разработка системы, которая будет анализировать данные о ценах акций Starbucks в зависимости от времени года, выявляя сезонные тренды и аномалии.\n",
"Входные данные: Данные о ценах акций (Close, Adj Close) и дате (Date), чтобы определить сезонные паттерны.\n",
"Целевой признак: Изменение цен акций в зависимости от сезона, что может быть измерено через процентное изменение цен в разные временные промежутки (например, кварталы или месяцы). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Загрузка набора данных "
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Date Open High Low Close Adj Close Volume\n",
"0 1992-06-26 0.328125 0.347656 0.320313 0.335938 0.260703 224358400\n",
"1 1992-06-29 0.339844 0.367188 0.332031 0.359375 0.278891 58732800\n",
"2 1992-06-30 0.367188 0.371094 0.343750 0.347656 0.269797 34777600\n",
"3 1992-07-01 0.351563 0.359375 0.339844 0.355469 0.275860 18316800\n",
"4 1992-07-02 0.359375 0.359375 0.347656 0.355469 0.275860 13996800\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"starbucks = pd.read_csv(\"data/starbucks.csv\")\n",
"\n",
"# Проверяем результат\n",
"print(starbucks.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Унитарное кодирование\n",
"\n",
"Преобразование категориального признака в несколько бинарных признаков"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Унитарное кодирование признака Дата (Date) и объем продаж (Volume)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Кодирование"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Date Open High Low Close Adj Close Volume \\\n",
"0 1992-06-26 0.328125 0.347656 0.320313 0.335938 0.260703 224358400 \n",
"1 1992-06-29 0.339844 0.367188 0.332031 0.359375 0.278891 58732800 \n",
"2 1992-06-30 0.367188 0.371094 0.343750 0.347656 0.269797 34777600 \n",
"3 1992-07-01 0.351563 0.359375 0.339844 0.355469 0.275860 18316800 \n",
"4 1992-07-02 0.359375 0.359375 0.347656 0.355469 0.275860 13996800 \n",
"\n",
" Season \n",
"0 Summer \n",
"1 Summer \n",
"2 Summer \n",
"3 Summer \n",
"4 Summer \n"
]
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"\n",
"# Преобразуем столбец 'Date' в формат даты\n",
"starbucks[\"Date\"] = pd.to_datetime(starbucks[\"Date\"])\n",
"\n",
"\n",
"# Функция для определения сезона\n",
"def get_season(date):\n",
" if date.month in [12, 1, 2]:\n",
" return \"Winter\"\n",
" elif date.month in [3, 4, 5]:\n",
" return \"Spring\"\n",
" elif date.month in [6, 7, 8]:\n",
" return \"Summer\"\n",
" elif date.month in [9, 10, 11]:\n",
" return \"Autumn\"\n",
"\n",
"\n",
"# Применяем функцию к столбцу 'Date'\n",
"starbucks[\"Season\"] = starbucks[\"Date\"].apply(get_season)\n",
"\n",
"# Кодируем сезоны с помощью OneHotEncoder\n",
"encoder = OneHotEncoder(sparse_output=False, drop=None) # Изменили drop на None\n",
"encoded_values = encoder.fit_transform(starbucks[[\"Season\"]])\n",
"\n",
"# Получаем названия закодированных столбцов\n",
"encoded_columns = encoder.get_feature_names_out([\"Season\"])\n",
"\n",
"# Проверяем результат\n",
"print(starbucks.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Добавление признаков в исходный Dataframe"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Open</th>\n",
" <th>High</th>\n",
" <th>Low</th>\n",
" <th>Close</th>\n",
" <th>Adj Close</th>\n",
" <th>Volume</th>\n",
" <th>Season</th>\n",
" <th>Season_Autumn</th>\n",
" <th>Season_Spring</th>\n",
" <th>Season_Summer</th>\n",
" <th>Season_Winter</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1992-06-26</td>\n",
" <td>0.328125</td>\n",
" <td>0.347656</td>\n",
" <td>0.320313</td>\n",
" <td>0.335938</td>\n",
" <td>0.260703</td>\n",
" <td>224358400</td>\n",
" <td>Summer</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1992-06-29</td>\n",
" <td>0.339844</td>\n",
" <td>0.367188</td>\n",
" <td>0.332031</td>\n",
" <td>0.359375</td>\n",
" <td>0.278891</td>\n",
" <td>58732800</td>\n",
" <td>Summer</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1992-06-30</td>\n",
" <td>0.367188</td>\n",
" <td>0.371094</td>\n",
" <td>0.343750</td>\n",
" <td>0.347656</td>\n",
" <td>0.269797</td>\n",
" <td>34777600</td>\n",
" <td>Summer</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1992-07-01</td>\n",
" <td>0.351563</td>\n",
" <td>0.359375</td>\n",
" <td>0.339844</td>\n",
" <td>0.355469</td>\n",
" <td>0.275860</td>\n",
" <td>18316800</td>\n",
" <td>Summer</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1992-07-02</td>\n",
" <td>0.359375</td>\n",
" <td>0.359375</td>\n",
" <td>0.347656</td>\n",
" <td>0.355469</td>\n",
" <td>0.275860</td>\n",
" <td>13996800</td>\n",
" <td>Summer</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8031</th>\n",
" <td>2024-05-17</td>\n",
" <td>75.269997</td>\n",
" <td>78.000000</td>\n",
" <td>74.919998</td>\n",
" <td>77.849998</td>\n",
" <td>77.849998</td>\n",
" <td>14436500</td>\n",
" <td>Spring</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8032</th>\n",
" <td>2024-05-20</td>\n",
" <td>77.680000</td>\n",
" <td>78.320000</td>\n",
" <td>76.709999</td>\n",
" <td>77.540001</td>\n",
" <td>77.540001</td>\n",
" <td>11183800</td>\n",
" <td>Spring</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8033</th>\n",
" <td>2024-05-21</td>\n",
" <td>77.559998</td>\n",
" <td>78.220001</td>\n",
" <td>77.500000</td>\n",
" <td>77.720001</td>\n",
" <td>77.720001</td>\n",
" <td>8916600</td>\n",
" <td>Spring</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8034</th>\n",
" <td>2024-05-22</td>\n",
" <td>77.699997</td>\n",
" <td>81.019997</td>\n",
" <td>77.440002</td>\n",
" <td>80.720001</td>\n",
" <td>80.720001</td>\n",
" <td>22063400</td>\n",
" <td>Spring</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8035</th>\n",
" <td>2024-05-23</td>\n",
" <td>80.099998</td>\n",
" <td>80.699997</td>\n",
" <td>79.169998</td>\n",
" <td>79.260002</td>\n",
" <td>79.260002</td>\n",
" <td>4651418</td>\n",
" <td>Spring</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8036 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" Date Open High Low Close Adj Close \\\n",
"0 1992-06-26 0.328125 0.347656 0.320313 0.335938 0.260703 \n",
"1 1992-06-29 0.339844 0.367188 0.332031 0.359375 0.278891 \n",
"2 1992-06-30 0.367188 0.371094 0.343750 0.347656 0.269797 \n",
"3 1992-07-01 0.351563 0.359375 0.339844 0.355469 0.275860 \n",
"4 1992-07-02 0.359375 0.359375 0.347656 0.355469 0.275860 \n",
"... ... ... ... ... ... ... \n",
"8031 2024-05-17 75.269997 78.000000 74.919998 77.849998 77.849998 \n",
"8032 2024-05-20 77.680000 78.320000 76.709999 77.540001 77.540001 \n",
"8033 2024-05-21 77.559998 78.220001 77.500000 77.720001 77.720001 \n",
"8034 2024-05-22 77.699997 81.019997 77.440002 80.720001 80.720001 \n",
"8035 2024-05-23 80.099998 80.699997 79.169998 79.260002 79.260002 \n",
"\n",
" Volume Season Season_Autumn Season_Spring Season_Summer \\\n",
"0 224358400 Summer 0.0 0.0 1.0 \n",
"1 58732800 Summer 0.0 0.0 1.0 \n",
"2 34777600 Summer 0.0 0.0 1.0 \n",
"3 18316800 Summer 0.0 0.0 1.0 \n",
"4 13996800 Summer 0.0 0.0 1.0 \n",
"... ... ... ... ... ... \n",
"8031 14436500 Spring 0.0 1.0 0.0 \n",
"8032 11183800 Spring 0.0 1.0 0.0 \n",
"8033 8916600 Spring 0.0 1.0 0.0 \n",
"8034 22063400 Spring 0.0 1.0 0.0 \n",
"8035 4651418 Spring 0.0 1.0 0.0 \n",
"\n",
" Season_Winter \n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
"... ... \n",
"8031 0.0 \n",
"8032 0.0 \n",
"8033 0.0 \n",
"8034 0.0 \n",
"8035 0.0 \n",
"\n",
"[8036 rows x 12 columns]"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Создаем DataFrame с закодированными значениями\n",
"encoded_values_df = pd.DataFrame(encoded_values, columns=encoded_columns)\n",
"\n",
"# Объединяем закодированные значения с исходным DataFrame\n",
"starbucks = pd.concat([starbucks, encoded_values_df], axis=1)\n",
"\n",
"\n",
"starbucks"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Summer' 'Autumn' 'Winter' 'Spring']\n",
"Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Season',\n",
" 'Season_Autumn', 'Season_Spring', 'Season_Summer', 'Season_Winter'],\n",
" dtype='object')\n"
]
}
],
"source": [
"# Проверяем уникальные значения в столбце 'Season'\n",
"print(starbucks[\"Season\"].unique())\n",
"print(starbucks.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Вывод: \n",
"Предсказательная способность:\n",
"признак Season может помочь в предсказании объема продаж, так как спрос на акции может варьироваться в зависимости от времени года (что используется в одной из бизнес целей).\n",
"\n",
"Скорость вычисления:\n",
"\n",
"Признаки должны быть вычисляемыми за разумное время. Например, создание признака Season с использованием функции apply может быть медленным, однако на данном наборе данных все вычисляется очень быстро\n",
"\n",
"Надежность:\n",
"данные о продажах не меняются -> созданные признаки должны оставаться актуальными.\n",
"\n",
"Корреляция:\n",
"Признаки имеют низкую степень корреляции"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Дискретизация признаков"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Равномерное разделение данных на 3 группы (по объему продаж)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"labels = [\"low\", \"medium\", \"high\"]\n",
"num_bins = 3"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([1.50400000e+06, 1.96172267e+08, 3.90840533e+08, 5.85508800e+08]),\n",
" array([8032, 3, 1]))"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hist1, bins1 = np.histogram(starbucks[\"Volume\"].fillna(starbucks[\"Volume\"].median()), bins=num_bins)\n",
"bins1, hist1"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Volume</th>\n",
" <th>Volume</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>224358400</td>\n",
" <td>(196172266.667, 390840533.333]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>58732800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>34777600</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18316800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13996800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5753600</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>10662400</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>15500800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3923200</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>11040000</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>5996800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>17062400</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>4992000</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>17062400</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>15667200</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>19744000</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7782400</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10892800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>10387200</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>7052800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Volume Volume\n",
"0 224358400 (196172266.667, 390840533.333]\n",
"1 58732800 (1504000.0, 196172266.667]\n",
"2 34777600 (1504000.0, 196172266.667]\n",
"3 18316800 (1504000.0, 196172266.667]\n",
"4 13996800 (1504000.0, 196172266.667]\n",
"5 5753600 (1504000.0, 196172266.667]\n",
"6 10662400 (1504000.0, 196172266.667]\n",
"7 15500800 (1504000.0, 196172266.667]\n",
"8 3923200 (1504000.0, 196172266.667]\n",
"9 11040000 (1504000.0, 196172266.667]\n",
"10 5996800 (1504000.0, 196172266.667]\n",
"11 17062400 (1504000.0, 196172266.667]\n",
"12 4992000 (1504000.0, 196172266.667]\n",
"13 17062400 (1504000.0, 196172266.667]\n",
"14 15667200 (1504000.0, 196172266.667]\n",
"15 19744000 (1504000.0, 196172266.667]\n",
"16 7782400 (1504000.0, 196172266.667]\n",
"17 10892800 (1504000.0, 196172266.667]\n",
"18 10387200 (1504000.0, 196172266.667]\n",
"19 7052800 (1504000.0, 196172266.667]"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([starbucks[\"Volume\"], pd.cut(starbucks[\"Volume\"], list(bins1))], axis=1).head(20)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Volume</th>\n",
" <th>Volume</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>224358400</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>58732800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>34777600</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18316800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13996800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5753600</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>10662400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>15500800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3923200</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>11040000</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>5996800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>17062400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>4992000</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>17062400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>15667200</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>19744000</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7782400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10892800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>10387200</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>7052800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Volume Volume\n",
"0 224358400 medium\n",
"1 58732800 low\n",
"2 34777600 low\n",
"3 18316800 low\n",
"4 13996800 low\n",
"5 5753600 low\n",
"6 10662400 low\n",
"7 15500800 low\n",
"8 3923200 low\n",
"9 11040000 low\n",
"10 5996800 low\n",
"11 17062400 low\n",
"12 4992000 low\n",
"13 17062400 low\n",
"14 15667200 low\n",
"15 19744000 low\n",
"16 7782400 low\n",
"17 10892800 low\n",
"18 10387200 low\n",
"19 7052800 low"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([starbucks[\"Volume\"], pd.cut(starbucks[\"Volume\"], list(bins1), labels=labels)], axis=1\n",
").head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Вывод:\n",
"Предсказательная способность\n",
"Признак Volume_Group, который делит объем продаж на группы, также может быть полезен для выявления паттернов в данных.\n",
"\n",
"Скорость вычисления\n",
"Признаки должны вычисляемыми за разумное время. \n",
"\n",
"Надежность\n",
"Признаки должны быть стабильными и не подвергаться значительным изменениям при небольших изменениях в данных. Например, если данные о продажах не меняются, то и созданные признаки должны оставаться актуальными. Это важно для обеспечения консистентности в предсказаниях.\n",
"\n",
"Корреляция\n",
"Признаки не должны быть сильно коррелированы между собой, чтобы избежать мультиколлинеарности. Например, если Volume и Open имеют высокую корреляцию, это может привести к проблемам в моделях, которые предполагают независимость признаков. Необходимо провести анализ корреляции между признаками и исключить избыточные.\n",
"\n",
"Целостность\n",
"Признаки должны быть логически обоснованными и соответствовать бизнес-логике. Например, признак Volume_Group, который делит объем продаж на группы, должен быть понятным и полезным для анализа. Это поможет в интерпретации результатов и принятии бизнес-решений.\n",
"\n",
"Пример анализа распределения объема продаж\n",
"Для анализа распределения объема продаж можно использовать гистограмму. Код, который используется, создает гистограмму для объема продаж, что позволяет визуализировать, как распределены продажи по группам."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Равномерное разделение данных на 3 группы c установкой собственной границы диапазона значений "
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([1.50400000e+06, 1.96172267e+08, 3.90840533e+08, 5.85508800e+08]),\n",
" array([8032, 3, 0, 1]))"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bins2 = np.linspace(1504000, 585508800, 4)\n",
"\n",
"tmp_bins2 = np.digitize(starbucks[\"Volume\"].fillna(starbucks[\"Volume\"].median()), bins2)\n",
"\n",
"hist2 = np.bincount(tmp_bins2 - 1)\n",
"\n",
"bins2, hist2"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Volume</th>\n",
" <th>Volume</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>224358400</td>\n",
" <td>(196172266.667, 390840533.333]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>58732800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>34777600</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18316800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13996800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5753600</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>10662400</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>15500800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3923200</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>11040000</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>5996800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>17062400</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>4992000</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>17062400</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>15667200</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>19744000</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7782400</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10892800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>10387200</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>7052800</td>\n",
" <td>(1504000.0, 196172266.667]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Volume Volume\n",
"0 224358400 (196172266.667, 390840533.333]\n",
"1 58732800 (1504000.0, 196172266.667]\n",
"2 34777600 (1504000.0, 196172266.667]\n",
"3 18316800 (1504000.0, 196172266.667]\n",
"4 13996800 (1504000.0, 196172266.667]\n",
"5 5753600 (1504000.0, 196172266.667]\n",
"6 10662400 (1504000.0, 196172266.667]\n",
"7 15500800 (1504000.0, 196172266.667]\n",
"8 3923200 (1504000.0, 196172266.667]\n",
"9 11040000 (1504000.0, 196172266.667]\n",
"10 5996800 (1504000.0, 196172266.667]\n",
"11 17062400 (1504000.0, 196172266.667]\n",
"12 4992000 (1504000.0, 196172266.667]\n",
"13 17062400 (1504000.0, 196172266.667]\n",
"14 15667200 (1504000.0, 196172266.667]\n",
"15 19744000 (1504000.0, 196172266.667]\n",
"16 7782400 (1504000.0, 196172266.667]\n",
"17 10892800 (1504000.0, 196172266.667]\n",
"18 10387200 (1504000.0, 196172266.667]\n",
"19 7052800 (1504000.0, 196172266.667]"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([starbucks[\"Volume\"], pd.cut(starbucks[\"Volume\"], list(bins2))], axis=1).head(20)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Volume</th>\n",
" <th>Volume</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>224358400</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>58732800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>34777600</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18316800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13996800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5753600</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>10662400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>15500800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3923200</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>11040000</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>5996800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>17062400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>4992000</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>17062400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>15667200</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>19744000</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7782400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10892800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>10387200</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>7052800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Volume Volume\n",
"0 224358400 medium\n",
"1 58732800 low\n",
"2 34777600 low\n",
"3 18316800 low\n",
"4 13996800 low\n",
"5 5753600 low\n",
"6 10662400 low\n",
"7 15500800 low\n",
"8 3923200 low\n",
"9 11040000 low\n",
"10 5996800 low\n",
"11 17062400 low\n",
"12 4992000 low\n",
"13 17062400 low\n",
"14 15667200 low\n",
"15 19744000 low\n",
"16 7782400 low\n",
"17 10892800 low\n",
"18 10387200 low\n",
"19 7052800 low"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat(\n",
" [starbucks[\"Volume\"], pd.cut(starbucks[\"Volume\"], list(bins2), labels=labels)], axis=1\n",
").head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Равномерное разделение данных на 3 группы c установкой собственных интервалов (1504000 - МИН, МЕДИАНА - 11698150, М А К С 585508800)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([ 1504000, 6601075, 298603475, 585508800]),\n",
" array([1304, 6731, 1]))"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hist3, bins3 = np.histogram(\n",
" starbucks[\"Volume\"].fillna(starbucks[\"Volume\"].median()),\n",
" bins=[1504000, 6601075, 298603475, 585508800],\n",
")\n",
"\n",
"\n",
"bins3, hist3"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Volume</th>\n",
" <th>Volume</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>224358400</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>58732800</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>34777600</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18316800</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13996800</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5753600</td>\n",
" <td>(1504000, 6601075]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>10662400</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>15500800</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3923200</td>\n",
" <td>(1504000, 6601075]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>11040000</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>5996800</td>\n",
" <td>(1504000, 6601075]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>17062400</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>4992000</td>\n",
" <td>(1504000, 6601075]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>17062400</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>15667200</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>19744000</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7782400</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10892800</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>10387200</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>7052800</td>\n",
" <td>(6601075, 298603475]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Volume Volume\n",
"0 224358400 (6601075, 298603475]\n",
"1 58732800 (6601075, 298603475]\n",
"2 34777600 (6601075, 298603475]\n",
"3 18316800 (6601075, 298603475]\n",
"4 13996800 (6601075, 298603475]\n",
"5 5753600 (1504000, 6601075]\n",
"6 10662400 (6601075, 298603475]\n",
"7 15500800 (6601075, 298603475]\n",
"8 3923200 (1504000, 6601075]\n",
"9 11040000 (6601075, 298603475]\n",
"10 5996800 (1504000, 6601075]\n",
"11 17062400 (6601075, 298603475]\n",
"12 4992000 (1504000, 6601075]\n",
"13 17062400 (6601075, 298603475]\n",
"14 15667200 (6601075, 298603475]\n",
"15 19744000 (6601075, 298603475]\n",
"16 7782400 (6601075, 298603475]\n",
"17 10892800 (6601075, 298603475]\n",
"18 10387200 (6601075, 298603475]\n",
"19 7052800 (6601075, 298603475]"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([starbucks[\"Volume\"], pd.cut(starbucks[\"Volume\"], list(bins3))], axis=1).head(20)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Volume</th>\n",
" <th>Volume</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>224358400</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>58732800</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>34777600</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18316800</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13996800</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5753600</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>10662400</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>15500800</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3923200</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>11040000</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>5996800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>17062400</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>4992000</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>17062400</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>15667200</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>19744000</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7782400</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10892800</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>10387200</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>7052800</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Volume Volume\n",
"0 224358400 medium\n",
"1 58732800 medium\n",
"2 34777600 medium\n",
"3 18316800 medium\n",
"4 13996800 medium\n",
"5 5753600 low\n",
"6 10662400 medium\n",
"7 15500800 medium\n",
"8 3923200 low\n",
"9 11040000 medium\n",
"10 5996800 low\n",
"11 17062400 medium\n",
"12 4992000 low\n",
"13 17062400 medium\n",
"14 15667200 medium\n",
"15 19744000 medium\n",
"16 7782400 medium\n",
"17 10892800 medium\n",
"18 10387200 medium\n",
"19 7052800 medium"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat(\n",
" [starbucks[\"Volume\"], pd.cut(starbucks[\"Volume\"], list(bins3), labels=labels)], axis=1\n",
").head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Квантильное разделение данных на 3 группы"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Volume</th>\n",
" <th>Volume</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>224358400</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>58732800</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>34777600</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18316800</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13996800</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5753600</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>10662400</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>15500800</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3923200</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>11040000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>5996800</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>17062400</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>4992000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>17062400</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>15667200</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>19744000</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7782400</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10892800</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>10387200</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>7052800</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Volume Volume\n",
"0 224358400 2\n",
"1 58732800 2\n",
"2 34777600 2\n",
"3 18316800 2\n",
"4 13996800 1\n",
"5 5753600 0\n",
"6 10662400 1\n",
"7 15500800 2\n",
"8 3923200 0\n",
"9 11040000 1\n",
"10 5996800 0\n",
"11 17062400 2\n",
"12 4992000 0\n",
"13 17062400 2\n",
"14 15667200 2\n",
"15 19744000 2\n",
"16 7782400 0\n",
"17 10892800 1\n",
"18 10387200 1\n",
"19 7052800 0"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat(\n",
" [starbucks[\"Volume\"], pd.qcut(starbucks[\"Volume\"], q=3, labels=False)], axis=1\n",
").head(20)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Volume</th>\n",
" <th>Volume</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>224358400</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>58732800</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>34777600</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18316800</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13996800</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5753600</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>10662400</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>15500800</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3923200</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>11040000</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>5996800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>17062400</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>4992000</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>17062400</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>15667200</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>19744000</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7782400</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10892800</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>10387200</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>7052800</td>\n",
" <td>low</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Volume Volume\n",
"0 224358400 high\n",
"1 58732800 high\n",
"2 34777600 high\n",
"3 18316800 high\n",
"4 13996800 medium\n",
"5 5753600 low\n",
"6 10662400 medium\n",
"7 15500800 high\n",
"8 3923200 low\n",
"9 11040000 medium\n",
"10 5996800 low\n",
"11 17062400 high\n",
"12 4992000 low\n",
"13 17062400 high\n",
"14 15667200 high\n",
"15 19744000 high\n",
"16 7782400 low\n",
"17 10892800 medium\n",
"18 10387200 medium\n",
"19 7052800 low"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat(\n",
" [starbucks[\"Volume\"], pd.qcut(starbucks[\"Volume\"], q=3, labels=labels)], axis=1\n",
").head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### конструирование признаков на основе существующих\n",
"\n",
"Season - время года (winter, autumn, summer, spring)\n",
"\n",
"Volume - количество проданных акций\n",
"\n",
"Open - цена открытия торгов\n",
"\n",
"Close - цена закрытия\n",
"\n",
"High - наивысшая цена торговли\n",
"\n",
"Low - наименьшая цена торговли"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Date Adj Close Volume Season Price_change High_Low_diff \\\n",
"0 1992-06-26 0.260703 224358400 Summer 0.000000 0.027343 \n",
"1 1992-06-29 0.278891 58732800 Summer 0.023437 0.035157 \n",
"2 1992-06-30 0.269797 34777600 Summer -0.011719 0.027344 \n",
"3 1992-07-01 0.275860 18316800 Summer 0.007813 0.019531 \n",
"4 1992-07-02 0.275860 13996800 Summer 0.000000 0.011719 \n",
"... ... ... ... ... ... ... \n",
"8031 2024-05-17 77.849998 14436500 Spring 2.569999 3.080002 \n",
"8032 2024-05-20 77.540001 11183800 Spring -0.309997 1.610001 \n",
"8033 2024-05-21 77.720001 8916600 Spring 0.180000 0.720001 \n",
"8034 2024-05-22 80.720001 22063400 Spring 3.000000 3.579995 \n",
"8035 2024-05-23 79.260002 4651418 Spring -1.459999 1.529999 \n",
"\n",
" Open_Close_diff \n",
"0 -0.007813 \n",
"1 -0.019531 \n",
"2 0.019532 \n",
"3 -0.003906 \n",
"4 0.003906 \n",
"... ... \n",
"8031 -2.580001 \n",
"8032 0.139999 \n",
"8033 -0.160003 \n",
"8034 -3.020004 \n",
"8035 0.839996 \n",
"\n",
"[8036 rows x 7 columns]\n"
]
}
],
"source": [
"# Создаем признак \"Price_change\" - изменение цены закрытия по сравнению с предыдущим днем\n",
"starbucks[\"Price_change\"] = starbucks[\"Close\"].diff().fillna(0)\n",
"\n",
"# Создаем признак \"High_Low_diff\" - разница между наивысшей и наименьшей ценой за день\n",
"starbucks[\"High_Low_diff\"] = starbucks[\"High\"] - starbucks[\"Low\"]\n",
"\n",
"# Создаем признак \"Open_Close_diff\" - разница между ценой открытия и закрытия\n",
"starbucks[\"Open_Close_diff\"] = starbucks[\"Open\"] - starbucks[\"Close\"]\n",
"\n",
"starbucks = starbucks.drop(\n",
" [\n",
" \"High\",\n",
" \"Low\",\n",
" \"Open\",\n",
" \"Close\",\n",
" \"Season_Autumn\",\n",
" \"Season_Summer\",\n",
" \"Season_Winter\",\n",
" \"Season_Spring\",\n",
" ],\n",
" axis=1,\n",
")\n",
"\n",
"# Выводим итоговый DataFrame\n",
"print(starbucks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Предсказательная способность\n",
"\n",
"Price_change: Этот признак может быть полезен для предсказания будущих цен, так как показывает изменение цены закрытия по сравнению с предыдущим днем. Если цена закрытия растет, это может указывать на положительный тренд.\n",
"\n",
"High_Low_diff: Разница между максимальной и минимальной ценой за день может служить индикатором волатильности. Высокая волатильность может указывать на неопределенность на рынке, что может повлиять на будущие цены.\n",
"\n",
"Open_Close_diff: Этот признак показывает, как цена акций себя вела в течение дня. Если цена закрытия значительно отличается от цены открытия, это может указывать на сильные колебания в течение дня.\n",
"\n",
"Volume_category: Объем торгов может быть индикатором интереса инвесторов к акциям. Высокий объем может указывать на сильные движения цен, что также может быть полезно для предсказания.\n",
"\n",
"Скорость вычисления\n",
"В с е предложенные признаки могут быть вычислены быстро, так как они основаны на простых арифметических операциях и использовании встроенных функций библиотеки Pandas. Это делает их подходящими для анализа больших объемов данных.\n",
"\n",
"Надежность\n",
"Признаки, основанные на исторических данных, как правило, надежны, если данные корректны и не содержат выбросов. Однако, важно учитывать, что прошлые данные не всегда гарантируют будущие результаты, особенно в условиях изменяющегося рынка.\n",
"\n",
"Корреляция\n",
"Необходимо провести анализ корреляции между признаками и целевой переменной (например, ценой закрытия на следующий день). Высокая корреляция может указывать на то, что признак имеет предсказательную силу. Однако, следует избегать мультиколлинеарности, когда два или более признаков сильно коррелируют между собой.\n",
"\n",
"Целостность\n",
"Признаки должны быть целостны и не содержат пропусков."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Отсечение значений признаков"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определение выбросов с помощью boxplot"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Axes: >"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhYAAAGsCAYAAACB/u5dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/GU6VOAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAmSElEQVR4nO3df3RU9YH38c9kMoyJJPFHgoJESM0iKFERECmGgBp+KD6mIWoRj6yb0+1pQQ+CPLvY1jXVEkWg1IrWrqf42B7AJQTcgtFMuzBJgWiIS4UehZAlQvlhCIUZk8AwzMzzhyezjoSamXwnk0zer3M4Mne+c+8Xz7mZd+7ce8cSCAQCAgAAMCAh1hMAAADxg7AAAADGEBYAAMAYwgIAABhDWAAAAGMICwAAYAxhAQAAjCEsAACAMYQFAAAwhrAAAADGxCwsqqqqdN9992nQoEGyWCzatGlT2Ot4//33dfvttyslJUUZGRmaOXOmGhsbjc8VAAB0TszCorW1VTfffLNWrVoV0esPHjyo+++/X3feead2796t999/X83NzSosLDQ8UwAA0FmWnvAlZBaLRRs3blRBQUFwmcfj0Y9+9COtXbtWp0+f1siRI/Xiiy9q0qRJkqSysjLNmjVLHo9HCQlf9tHvf/973X///fJ4PLLZbDH4lwAA0Lf12HMs5s2bp507d2rdunX6+OOP9cADD2jatGmqr6+XJI0ePVoJCQlavXq1fD6fXC6Xfvvb3+ruu+8mKgAAiJEeecTi0KFD+ta3vqVDhw5p0KBBwXF33323brvtNi1ZskSS5HQ69eCDD+rkyZPy+XwaP3683n33XV122WUx+FcAAIAeecRiz5498vl8GjZsmPr37x/843Q61dDQIEk6fvy4vve972nOnDmqra2V0+lUv379VFRUpB7QSgAA9EmJsZ5AR1paWmS1WlVXVyer1RryXP/+/SVJq1atUlpampYuXRp87ne/+50yMzP1wQcf6Pbbb+/WOQMAgB4aFqNGjZLP51NTU5Nyc3M7HNPW1hY8abNde4T4/f6ozxEAAFwoZh+FtLS0aPfu3dq9e7ekLy8f3b17tw4dOqRhw4Zp9uzZevTRR1VeXq6DBw/qww8/VGlpqbZs2SJJuvfee1VbW6uf/vSnqq+v10cffaTHHntMQ4YM0ahRo2L1zwIAoE+L2cmb27Zt0+TJky9YPmfOHL355pvyer16/vnn9dZbb+nIkSNKT0/X7bffrpKSEuXk5EiS1q1bp6VLl2r//v1KTk7W+PHj9eKLL2r48OHd/c8BAADqIVeFAACA+NAjrwoBAAC9E2EBAACM6farQvx+v44ePaqUlBRZLJbu3jwAAIhAIBDQF198oUGDBl1wVeZXdXtYHD16VJmZmd29WQAAYMDhw4c1ePDgiz7f7WGRkpIi6cuJpaamdvfmAUSR1+tVZWWlpkyZwnf2AHHG7XYrMzMz+D5+Md0eFu0ff6SmphIWQJzxer1KTk5WamoqYQHEqW86jYGTNwEAgDFhh8WRI0f0yCOP6Morr1RSUpJycnK0a9euaMwNAAD0MmF9FHLq1ClNmDBBkydPVkVFhTIyMlRfX6/LL788WvMDAAC9SFhh8eKLLyozM1OrV68OLsvKyjI+KQAA0DuFFRb/+Z//qalTp+qBBx6Q0+nUNddcox/+8If63ve+d9HXeDweeTye4GO32y3py5O8vF5vhNMG0BO179Ps20D86ex+HVZY/M///I9ee+01LViwQE8//bRqa2v1xBNPqF+/fpozZ06HryktLVVJSckFyysrK5WcnBzO5gH0Eg6HI9ZTAGBYW1tbp8aF9SVk/fr105gxY7Rjx47gsieeeEK1tbXauXNnh6/p6IhFZmammpubudwUiDNer1cOh0P5+flcbgrEGbfbrfT0dLlcrr/7/h3WEYuBAwfqhhtuCFk2YsQIbdiw4aKvsdvtstvtFyy32Wz84AHiFPs3EH86u0+HdbnphAkTtG/fvpBl+/fv15AhQ8JZDYA45PP55HQ6VVVVJafTKZ/PF+spAYiBsMLiySefVE1NjZYsWaIDBw5ozZo1+vWvf625c+dGa34AeoHy8nJlZ2crPz9fK1asUH5+vrKzs1VeXh7rqQHoZmGFxdixY7Vx40atXbtWI0eO1HPPPaeVK1dq9uzZ0ZofgB6uvLxcRUVFysnJUXV1tdauXavq6mrl5OSoqKiIuAD6mLBO3jTB7XYrLS3tG0/+ANDz+Xw+ZWdnKycnR5s2bZLP59O7776re+65R1arVQUFBdq7d6/q6+tltVpjPV0AXdDZ92++KwRAxKqrq9XY2Kinn35aCQmhP04SEhK0ePFiHTx4UNXV1TGaIYDuRlgAiNixY8ckSSNHjuzw+fbl7eMAxD/CAkDEBg4cKEnau3dvh8+3L28fByD+ERYAIpabm6uhQ4dqyZIl8vv9Ic/5/X6VlpYqKytLubm5MZohgO5GWACImNVq1fLly7V582YVFBSopqZGZ86cUU1NjQoKCrR582YtW7aMEzeBPiSsO28CwNcVFhaqrKxMCxcu1MSJE4PLs7KyVFZWpsLCwhjODkB343JTAEb4fD5t3bpVFRUVmj59uiZPnsyRCiCOdPb9myMWAIywWq3Ky8tTa2ur8vLyiAqgj+IcCwAAYAxhAQAAjCEsAACAMYQFAAAwhrAAAADGEBYAAMAYwgIAABhDWAAAAGMICwAAYAxhAQAAjCEsAACAMYQFAAAwhrAAAADGEBYAAMAYwgIAABhDWAAAAGMICwAAYAxhAQAAjCEsAACAMYQFAAAwhrAAAADGEBYAAMAYwgIAABhDWAAAAGMICwAAYAxhAQAAjCEsAACAMYQFAAAwhrAAAADGEBYAAMAYwgIAABhDWAAAAGMICwAAYAxhAQAAjCEsAACAMYQFAAAwhrAAAADGEBYAAMAYwgIAABgTVlg8++yzslgsIX+GDx8erbkBAIBeJjHcF9x44436wx/+8L8rSAx7FQAAIE6FXQWJiYm6+uqrozEXAADQy4UdFvX19Ro0aJAuueQSjR8/XqWlpbr22msvOt7j8cjj8QQfu91uSZLX65XX641gygB6qvZ9mn0biD+d3a8tgUAg0NmVVlRUqKWlRddff72OHTumkpISHTlyRHv37lVKSkqHr3n22WdVUlJywfI1a9YoOTm5s5sGAAAx1NbWpocfflgul0upqakXHRdWWHzd6dOnNWTIEK1YsULFxcUdjunoiEVmZqaam5v/7sQA9D5er1cOh0P5+fmy2Wyxng4Ag9xut9LT078xLLp05uVll12mYcOG6cCBAxcdY7fbZbfbL1hus9n4wQPEKfZvIP50dp/u0n0sWlpa1NDQoIEDB3ZlNQAAIE6EFRZPPfWUnE6nGhsbtWPHDn3nO9+R1WrVrFmzojU/AADQi4T1Uchf//pXzZo1SydPnlRGRobuuOMO1dTUKCMjI1rzAwAAvUhYYbFu3bpozQMAAMQBvisEAAAYQ1gAAABjCAsAAGAMYQEAAIwhLAAAgDGEBQAAMIawAAAAxhAWAADAGMICAAAYQ1gAAABjCAsAAGAMYQEAAIwhLAAAgDGEBQAAMIawAAAAxhAWAADAGMICAAAYQ1gAAABjCAsAAGAMYQEAAIwhLAAAgDGEBQAAMIawAAAAxhAWAADAGMICAAAYQ1gAAABjCAsAAGAMYQEAAIwhLAAAgDGEBQAAMIawAAAAxhAWAADAGMICAAAYQ1gAAABjCAsAAGAMYQEAAIwhLAAAgDGEBQAAMIawAAAAxhAWAADAGMICAAAYQ1gAAABjCAsAAGAMYQEAAIwhLAAAgDGEBQAAMKZLYfHCCy/IYrFo/vz5hqYDAAB6s4jDora2Vq+//rpuuukmk/MBAAC9WERh0dLSotmzZ+vf//3fdfnll5ueEwAA6KUSI3nR3Llzde+99+ruu+/W888//3fHejweeTye4GO32y1J8nq98nq9kWweQA/Vvk+zbwPxp7P7ddhhsW7dOn300Ueqra3t1PjS0lKVlJRcsLyyslLJycnhbh5AL+BwOGI9BQCGtbW1dWqcJRAIBDq70sO
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"starbucks.boxplot(column=\"Volume\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Отсечение данных для признака Volume, значение которых больше 200000000 т.к. выше этой границы были выбросы"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Volume</th>\n",
" <th>VolumeClip</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1992-06-26</td>\n",
" <td>224358400</td>\n",
" <td>200000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>824</th>\n",
" <td>1995-09-29</td>\n",
" <td>230883200</td>\n",
" <td>200000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1771</th>\n",
" <td>1999-07-01</td>\n",
" <td>585508800</td>\n",
" <td>200000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2007</th>\n",
" <td>2000-06-07</td>\n",
" <td>295411200</td>\n",
" <td>200000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Volume VolumeClip\n",
"0 1992-06-26 224358400 200000000\n",
"824 1995-09-29 230883200 200000000\n",
"1771 1999-07-01 585508800 200000000\n",
"2007 2000-06-07 295411200 200000000"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"starbucks_norm = starbucks.copy()\n",
"\n",
"starbucks_norm[\"VolumeClip\"] = starbucks[\"Volume\"].clip(0, 200000000)\n",
"\n",
"starbucks_norm[starbucks_norm[\"Volume\"] > 200000000][[\"Date\", \"Volume\", \"VolumeClip\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-01-20 02:01:44 +04:00
"Винсоризация признака Volume"
2025-01-20 01:56:12 +04:00
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"541994159.9999999\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Volume</th>\n",
" <th>VolumeWinsorize</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1992-06-26</td>\n",
" <td>224358400</td>\n",
" <td>33800200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>824</th>\n",
" <td>1995-09-29</td>\n",
" <td>230883200</td>\n",
" <td>33800200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1771</th>\n",
" <td>1999-07-01</td>\n",
" <td>585508800</td>\n",
" <td>33800200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2007</th>\n",
" <td>2000-06-07</td>\n",
" <td>295411200</td>\n",
" <td>33800200</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Volume VolumeWinsorize\n",
"0 1992-06-26 224358400 33800200\n",
"824 1995-09-29 230883200 33800200\n",
"1771 1999-07-01 585508800 33800200\n",
"2007 2000-06-07 295411200 33800200"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.stats.mstats import winsorize\n",
"\n",
"print(starbucks_norm[starbucks_norm[\"Volume\"] > 200000000][[\"Date\", \"Volume\", \"VolumeClip\"]]\n",
"[\"Volume\"].quantile(q=0.95))\n",
"\n",
"starbucks_norm[\"VolumeWinsorize\"] = winsorize(\n",
" starbucks_norm[\"Volume\"].fillna(starbucks_norm[\"Volume\"].mean()), (0, 0.05), inplace=False\n",
")\n",
"\n",
"starbucks_norm[starbucks_norm[\"Volume\"] > 200000000][[\"Date\", \"Volume\", \"VolumeWinsorize\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Нормализация значений"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Volume</th>\n",
" <th>VolumeNorm</th>\n",
" <th>VolumeClipNorm</th>\n",
" <th>VolumeWinsorizeNorm</th>\n",
" <th>VolumeWinsorizeNorm2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1992-06-26</td>\n",
" <td>224358400</td>\n",
" <td>0.381597</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1992-06-29</td>\n",
" <td>58732800</td>\n",
" <td>0.097994</td>\n",
" <td>0.288312</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1992-06-30</td>\n",
" <td>34777600</td>\n",
" <td>0.056975</td>\n",
" <td>0.167629</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1992-07-01</td>\n",
" <td>18316800</td>\n",
" <td>0.028789</td>\n",
" <td>0.084701</td>\n",
" <td>0.520581</td>\n",
" <td>0.041163</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1992-07-02</td>\n",
" <td>13996800</td>\n",
" <td>0.021392</td>\n",
" <td>0.062937</td>\n",
" <td>0.386820</td>\n",
" <td>-0.226361</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1992-07-06</td>\n",
" <td>5753600</td>\n",
" <td>0.007277</td>\n",
" <td>0.021409</td>\n",
" <td>0.131582</td>\n",
" <td>-0.736836</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1992-07-07</td>\n",
" <td>10662400</td>\n",
" <td>0.015682</td>\n",
" <td>0.046139</td>\n",
" <td>0.283575</td>\n",
" <td>-0.432850</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1992-07-08</td>\n",
" <td>15500800</td>\n",
" <td>0.023967</td>\n",
" <td>0.070514</td>\n",
" <td>0.433388</td>\n",
" <td>-0.133223</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1992-07-09</td>\n",
" <td>3923200</td>\n",
" <td>0.004142</td>\n",
" <td>0.012188</td>\n",
" <td>0.074907</td>\n",
" <td>-0.850187</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1992-07-10</td>\n",
" <td>11040000</td>\n",
" <td>0.016329</td>\n",
" <td>0.048041</td>\n",
" <td>0.295267</td>\n",
" <td>-0.409466</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1992-07-13</td>\n",
" <td>5996800</td>\n",
" <td>0.007693</td>\n",
" <td>0.022634</td>\n",
" <td>0.139112</td>\n",
" <td>-0.721775</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>1992-07-14</td>\n",
" <td>17062400</td>\n",
" <td>0.026641</td>\n",
" <td>0.078381</td>\n",
" <td>0.481741</td>\n",
" <td>-0.036518</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1992-07-15</td>\n",
" <td>4992000</td>\n",
" <td>0.005973</td>\n",
" <td>0.017572</td>\n",
" <td>0.108000</td>\n",
" <td>-0.783999</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>1992-07-16</td>\n",
" <td>17062400</td>\n",
" <td>0.026641</td>\n",
" <td>0.078381</td>\n",
" <td>0.481741</td>\n",
" <td>-0.036518</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>1992-07-17</td>\n",
" <td>15667200</td>\n",
" <td>0.024252</td>\n",
" <td>0.071353</td>\n",
" <td>0.438541</td>\n",
" <td>-0.122918</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>1992-07-20</td>\n",
" <td>19744000</td>\n",
" <td>0.031233</td>\n",
" <td>0.091891</td>\n",
" <td>0.564772</td>\n",
" <td>0.129545</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>1992-07-21</td>\n",
" <td>7782400</td>\n",
" <td>0.010751</td>\n",
" <td>0.031630</td>\n",
" <td>0.194401</td>\n",
" <td>-0.611199</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>1992-07-22</td>\n",
" <td>10892800</td>\n",
" <td>0.016077</td>\n",
" <td>0.047300</td>\n",
" <td>0.290709</td>\n",
" <td>-0.418582</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>1992-07-23</td>\n",
" <td>10387200</td>\n",
" <td>0.015211</td>\n",
" <td>0.044753</td>\n",
" <td>0.275054</td>\n",
" <td>-0.449892</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>1992-07-24</td>\n",
" <td>7052800</td>\n",
" <td>0.009501</td>\n",
" <td>0.027954</td>\n",
" <td>0.171810</td>\n",
" <td>-0.656381</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Volume VolumeNorm VolumeClipNorm VolumeWinsorizeNorm \\\n",
"0 1992-06-26 224358400 0.381597 1.000000 1.000000 \n",
"1 1992-06-29 58732800 0.097994 0.288312 1.000000 \n",
"2 1992-06-30 34777600 0.056975 0.167629 1.000000 \n",
"3 1992-07-01 18316800 0.028789 0.084701 0.520581 \n",
"4 1992-07-02 13996800 0.021392 0.062937 0.386820 \n",
"5 1992-07-06 5753600 0.007277 0.021409 0.131582 \n",
"6 1992-07-07 10662400 0.015682 0.046139 0.283575 \n",
"7 1992-07-08 15500800 0.023967 0.070514 0.433388 \n",
"8 1992-07-09 3923200 0.004142 0.012188 0.074907 \n",
"9 1992-07-10 11040000 0.016329 0.048041 0.295267 \n",
"10 1992-07-13 5996800 0.007693 0.022634 0.139112 \n",
"11 1992-07-14 17062400 0.026641 0.078381 0.481741 \n",
"12 1992-07-15 4992000 0.005973 0.017572 0.108000 \n",
"13 1992-07-16 17062400 0.026641 0.078381 0.481741 \n",
"14 1992-07-17 15667200 0.024252 0.071353 0.438541 \n",
"15 1992-07-20 19744000 0.031233 0.091891 0.564772 \n",
"16 1992-07-21 7782400 0.010751 0.031630 0.194401 \n",
"17 1992-07-22 10892800 0.016077 0.047300 0.290709 \n",
"18 1992-07-23 10387200 0.015211 0.044753 0.275054 \n",
"19 1992-07-24 7052800 0.009501 0.027954 0.171810 \n",
"\n",
" VolumeWinsorizeNorm2 \n",
"0 1.000000 \n",
"1 1.000000 \n",
"2 1.000000 \n",
"3 0.041163 \n",
"4 -0.226361 \n",
"5 -0.736836 \n",
"6 -0.432850 \n",
"7 -0.133223 \n",
"8 -0.850187 \n",
"9 -0.409466 \n",
"10 -0.721775 \n",
"11 -0.036518 \n",
"12 -0.783999 \n",
"13 -0.036518 \n",
"14 -0.122918 \n",
"15 0.129545 \n",
"16 -0.611199 \n",
"17 -0.418582 \n",
"18 -0.449892 \n",
"19 -0.656381 "
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn import preprocessing\n",
"\n",
"min_max_scaler = preprocessing.MinMaxScaler()\n",
"\n",
"min_max_scaler_2 = preprocessing.MinMaxScaler(feature_range=(-1, 1))\n",
"\n",
"starbucks_norm[\"VolumeNorm\"] = min_max_scaler.fit_transform(\n",
" starbucks_norm[\"Volume\"].to_numpy().reshape(-1, 1)\n",
").reshape(starbucks_norm[\"Volume\"].shape)\n",
"\n",
"starbucks_norm[\"VolumeClipNorm\"] = min_max_scaler.fit_transform(\n",
" starbucks_norm[\"VolumeClip\"].to_numpy().reshape(-1, 1)\n",
").reshape(starbucks_norm[\"Volume\"].shape)\n",
"\n",
"starbucks_norm[\"VolumeWinsorizeNorm\"] = min_max_scaler.fit_transform(\n",
" starbucks_norm[\"VolumeWinsorize\"].to_numpy().reshape(-1, 1)\n",
").reshape(starbucks_norm[\"Volume\"].shape)\n",
"\n",
"starbucks_norm[\"VolumeWinsorizeNorm2\"] = min_max_scaler_2.fit_transform(\n",
" starbucks_norm[\"VolumeWinsorize\"].to_numpy().reshape(-1, 1)\n",
").reshape(starbucks_norm[\"Volume\"].shape)\n",
"\n",
"starbucks_norm[\n",
" [\"Date\", \"Volume\", \"VolumeNorm\", \"VolumeClipNorm\", \"VolumeWinsorizeNorm\", \"VolumeWinsorizeNorm2\"]\n",
"].head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Стандартизация значений"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Volume</th>\n",
" <th>VolumeStand</th>\n",
" <th>VolumeClipStand</th>\n",
" <th>VolumeWinsorizeStand</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1992-06-26</td>\n",
" <td>224358400</td>\n",
" <td>15.646534</td>\n",
" <td>15.953799</td>\n",
" <td>2.499736</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1992-06-29</td>\n",
" <td>58732800</td>\n",
" <td>3.285840</td>\n",
" <td>3.795175</td>\n",
" <td>2.499736</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1992-06-30</td>\n",
" <td>34777600</td>\n",
" <td>1.498056</td>\n",
" <td>1.733392</td>\n",
" <td>2.499736</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1992-07-01</td>\n",
" <td>18316800</td>\n",
" <td>0.269580</td>\n",
" <td>0.316639</td>\n",
" <td>0.559890</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1992-07-02</td>\n",
" <td>13996800</td>\n",
" <td>-0.052823</td>\n",
" <td>-0.055176</td>\n",
" <td>0.018657</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1992-07-06</td>\n",
" <td>5753600</td>\n",
" <td>-0.668015</td>\n",
" <td>-0.764654</td>\n",
" <td>-1.014097</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1992-07-07</td>\n",
" <td>10662400</td>\n",
" <td>-0.301670</td>\n",
" <td>-0.342162</td>\n",
" <td>-0.399095</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1992-07-08</td>\n",
" <td>15500800</td>\n",
" <td>0.059421</td>\n",
" <td>0.074271</td>\n",
" <td>0.207086</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1992-07-09</td>\n",
" <td>3923200</td>\n",
" <td>-0.804619</td>\n",
" <td>-0.922193</td>\n",
" <td>-1.243420</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1992-07-10</td>\n",
" <td>11040000</td>\n",
" <td>-0.273490</td>\n",
" <td>-0.309662</td>\n",
" <td>-0.351788</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1992-07-13</td>\n",
" <td>5996800</td>\n",
" <td>-0.649865</td>\n",
" <td>-0.743722</td>\n",
" <td>-0.983628</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>1992-07-14</td>\n",
" <td>17062400</td>\n",
" <td>0.175964</td>\n",
" <td>0.208675</td>\n",
" <td>0.402732</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1992-07-15</td>\n",
" <td>4992000</td>\n",
" <td>-0.724854</td>\n",
" <td>-0.830203</td>\n",
" <td>-1.109515</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>1992-07-16</td>\n",
" <td>17062400</td>\n",
" <td>0.175964</td>\n",
" <td>0.208675</td>\n",
" <td>0.402732</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>1992-07-17</td>\n",
" <td>15667200</td>\n",
" <td>0.071840</td>\n",
" <td>0.088593</td>\n",
" <td>0.227934</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>1992-07-20</td>\n",
" <td>19744000</td>\n",
" <td>0.376093</td>\n",
" <td>0.439476</td>\n",
" <td>0.738698</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>1992-07-21</td>\n",
" <td>7782400</td>\n",
" <td>-0.516605</td>\n",
" <td>-0.590038</td>\n",
" <td>-0.759918</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>1992-07-22</td>\n",
" <td>10892800</td>\n",
" <td>-0.284475</td>\n",
" <td>-0.322332</td>\n",
" <td>-0.370230</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>1992-07-23</td>\n",
" <td>10387200</td>\n",
" <td>-0.322208</td>\n",
" <td>-0.365848</td>\n",
" <td>-0.433574</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>1992-07-24</td>\n",
" <td>7052800</td>\n",
" <td>-0.571056</td>\n",
" <td>-0.652834</td>\n",
" <td>-0.851326</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Volume VolumeStand VolumeClipStand VolumeWinsorizeStand\n",
"0 1992-06-26 224358400 15.646534 15.953799 2.499736\n",
"1 1992-06-29 58732800 3.285840 3.795175 2.499736\n",
"2 1992-06-30 34777600 1.498056 1.733392 2.499736\n",
"3 1992-07-01 18316800 0.269580 0.316639 0.559890\n",
"4 1992-07-02 13996800 -0.052823 -0.055176 0.018657\n",
"5 1992-07-06 5753600 -0.668015 -0.764654 -1.014097\n",
"6 1992-07-07 10662400 -0.301670 -0.342162 -0.399095\n",
"7 1992-07-08 15500800 0.059421 0.074271 0.207086\n",
"8 1992-07-09 3923200 -0.804619 -0.922193 -1.243420\n",
"9 1992-07-10 11040000 -0.273490 -0.309662 -0.351788\n",
"10 1992-07-13 5996800 -0.649865 -0.743722 -0.983628\n",
"11 1992-07-14 17062400 0.175964 0.208675 0.402732\n",
"12 1992-07-15 4992000 -0.724854 -0.830203 -1.109515\n",
"13 1992-07-16 17062400 0.175964 0.208675 0.402732\n",
"14 1992-07-17 15667200 0.071840 0.088593 0.227934\n",
"15 1992-07-20 19744000 0.376093 0.439476 0.738698\n",
"16 1992-07-21 7782400 -0.516605 -0.590038 -0.759918\n",
"17 1992-07-22 10892800 -0.284475 -0.322332 -0.370230\n",
"18 1992-07-23 10387200 -0.322208 -0.365848 -0.433574\n",
"19 1992-07-24 7052800 -0.571056 -0.652834 -0.851326"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn import preprocessing\n",
"\n",
"stndart_scaler = preprocessing.StandardScaler()\n",
"\n",
"starbucks_norm[\"VolumeStand\"] = stndart_scaler.fit_transform(\n",
" starbucks_norm[\"Volume\"].to_numpy().reshape(-1, 1)\n",
").reshape(starbucks_norm[\"Volume\"].shape)\n",
"\n",
"starbucks_norm[\"VolumeClipStand\"] = stndart_scaler.fit_transform(\n",
" starbucks_norm[\"VolumeClip\"].to_numpy().reshape(-1, 1)\n",
").reshape(starbucks_norm[\"Volume\"].shape)\n",
"\n",
"starbucks_norm[\"VolumeWinsorizeStand\"] = stndart_scaler.fit_transform(\n",
" starbucks_norm[\"VolumeWinsorize\"].to_numpy().reshape(-1, 1)\n",
").reshape(starbucks_norm[\"Volume\"].shape)\n",
"\n",
"starbucks_norm[[\"Date\", \"Volume\", \"VolumeStand\", \"VolumeClipStand\", \"VolumeWinsorizeStand\"]].head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выводы: \n",
2025-01-20 02:01:44 +04:00
"1.Предсказательная способность\n",
2025-01-20 01:56:12 +04:00
"Нормализация и стандартизация: Эти методы могут улучшить предсказательную способность признаков, так как они приводят данные к единому масштабу. Это особенно важно для алгоритмов, чувствительных к масштабу, таких как K-ближайших соседей или нейронные сети. Признаки, такие как Price_change, High_Low_diff, и Open_Close_diff, могут стать более информативными после этих преобразований.\n",
"\n",
"Винсоризация: Этот метод помогает уменьшить влияние выбросов, что может улучшить предсказательную способность модели, так как она будет менее подвержена искажениям из-за аномальных значений.\n",
"\n",
2025-01-20 02:01:44 +04:00
"2.Скорость вычисления\n",
2025-01-20 01:56:12 +04:00
"Применение нормализации и стандартизации может немного увеличить время вычислений, так как требуется дополнительный шаг для преобразования данных. Однако, в данном случае это незначительное увеличение времени.\n",
"Винсоризация также может добавить некоторую вычислительную нагрузку, но она в данном случае незначительна.\n",
"\n",
2025-01-20 02:01:44 +04:00
"3.Надежность\n",
2025-01-20 01:56:12 +04:00
"Нормализация и стандартизация могут повысить надежность признаков, так как они уменьшают влияние выбросов и делают данные более однородными. Это может привести к более стабильным и надежным результатам при обучении модели.\n",
"Винсоризация помогает устранить аномальные значения, что также способствует повышению надежности.\n",
"\n",
2025-01-20 02:01:44 +04:00
"4.Корреляция\n",
2025-01-20 01:56:12 +04:00
"После нормализации и стандартизации корреляции между признаками могут измениться. Это может помочь выявить более сильные связи между признаками и целевой переменной. Однако важно следить за мультиколлинеарностью, так как некоторые признаки могут стать слишком коррелированными.\n",
"Винсоризация может помочь уменьшить влияние выбросов на корреляцию, что может привести к более точным оценкам взаимосвязей между признаками.\n",
"\n",
2025-01-20 02:01:44 +04:00
"5.Целостность\n",
2025-01-20 01:56:12 +04:00
"Нормализация и стандартизация не влияют на целостность данных, если они применяются корректно. Однако важно следить за тем, чтобы не было пропусков в данных, так как это может повлиять на качество модели.\n",
"Винсоризация и отсечение выбросов помогают поддерживать целостность данных, так как они устраняют аномальные значения, которые могут исказить результаты.\n",
"\n",
"Заключение\n",
"В целом, применение нормализации, стандартизации, винсоризации и отсечения выбросов может значительно улучшить качество признаков для предсказания цен акций Starbucks. Эти методы помогают сделать данные более однородными, надежными и информативными, что в свою очередь может привести к более точным и стабильным результатам при использовании методов машинного обучения."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}