1824 lines
80 KiB
Plaintext
1824 lines
80 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Лабораторная 2"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"ДАТАСЕТ СПИСОК ФОРБС\n",
|
|||
|
"Объектами наблюдения в данном наборе данных являются миллиардеры, чье состояние оценивается и документируется в ежегодном рейтинге Forbes. Каждая запись в наборе данных представляет собой отдельного миллиардера с его оцененным состоянием.\n",
|
|||
|
"Атрибуты объектов\n",
|
|||
|
"\n",
|
|||
|
"Атрибутами объектов (миллиардеров) являются:\n",
|
|||
|
"Имя: имя миллиардера.\n",
|
|||
|
"Страна: страна, в которой проживает миллиардер.\n",
|
|||
|
"Состояние: оцененное состояние миллиардера в долларах США.\n",
|
|||
|
"Источник богатства: источник, из которого миллиардер получил свое состояние (например, технологии, финансы, недвижимость и т.д.).\n",
|
|||
|
"Возраст: возраст миллиардера на момент публикации списка.\n",
|
|||
|
"Ранг: позиция миллиардера в рейтинге по сравнению с другими миллиардерами.\n",
|
|||
|
"\n",
|
|||
|
"Связи между объектами могут быть определены через общие источники богатства или страны проживания. Например, миллиардеры из одной страны могут иметь схожие источники дохода, а также могут быть связаны через бизнес-партнерства или семейные связи.\n",
|
|||
|
"\n",
|
|||
|
"Примеры бизнес-целей\n",
|
|||
|
"Привлечение инвестиций: Компании могут использовать данные о миллиардерах для целенаправленного маркетинга и привлечения инвестиций от состоятельных индивидуумов.\n",
|
|||
|
"Анализ рынка: Понимание источников богатства и распределения состояния может помочь в анализе рыночных трендов и потребительских предпочтений.\n",
|
|||
|
"\n",
|
|||
|
"Эффект для бизнеса\n",
|
|||
|
"Эти бизнес-цели могут привести к увеличению инвестиций, улучшению репутации компании, расширению клиентской базы и повышению финансовой устойчивости организаций, работающих в различных секторах.\n",
|
|||
|
"\n",
|
|||
|
"Примеры целей технического проекта\n",
|
|||
|
"Для привлечения инвестиций: Разработка платформы для анализа данных о миллиардерах, которая поможет компаниям находить потенциальных инвесторов на основе их интересов и источников богатства.\n",
|
|||
|
"Для анализа рынка: Создание аналитической панели, которая визуализирует данные о миллиардерах и их источниках богатства, позволяя компаниям лучше понимать рыночные тренды.\n",
|
|||
|
"\n",
|
|||
|
"Входные данные: Данные о миллиардерах, включая имя, страну, состояние, источник богатства, возраст и ранг.\n",
|
|||
|
"\n",
|
|||
|
"Целевой признак: Целевым признаком может быть состояние миллиардера, что позволит строить модели для прогнозирования изменений в состоянии или ранге в будущем."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"Index: 2600 entries, 1 to 2578\n",
|
|||
|
"Data columns (total 6 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 Name 2600 non-null object \n",
|
|||
|
" 1 Networth 2600 non-null float64\n",
|
|||
|
" 2 Age 2600 non-null int64 \n",
|
|||
|
" 3 Country 2600 non-null object \n",
|
|||
|
" 4 Source 2600 non-null object \n",
|
|||
|
" 5 Industry 2600 non-null object \n",
|
|||
|
"dtypes: float64(1), int64(1), object(4)\n",
|
|||
|
"memory usage: 142.2+ KB\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(2600, 6)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Name</th>\n",
|
|||
|
" <th>Networth</th>\n",
|
|||
|
" <th>Age</th>\n",
|
|||
|
" <th>Country</th>\n",
|
|||
|
" <th>Source</th>\n",
|
|||
|
" <th>Industry</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>RankID</th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>Elon Musk</td>\n",
|
|||
|
" <td>219.0</td>\n",
|
|||
|
" <td>50</td>\n",
|
|||
|
" <td>United States</td>\n",
|
|||
|
" <td>Tesla, SpaceX</td>\n",
|
|||
|
" <td>Automotive</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>Jeff Bezos</td>\n",
|
|||
|
" <td>171.0</td>\n",
|
|||
|
" <td>58</td>\n",
|
|||
|
" <td>United States</td>\n",
|
|||
|
" <td>Amazon</td>\n",
|
|||
|
" <td>Technology</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>Bernard Arnault & family</td>\n",
|
|||
|
" <td>158.0</td>\n",
|
|||
|
" <td>73</td>\n",
|
|||
|
" <td>France</td>\n",
|
|||
|
" <td>LVMH</td>\n",
|
|||
|
" <td>Fashion & Retail</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>Bill Gates</td>\n",
|
|||
|
" <td>129.0</td>\n",
|
|||
|
" <td>66</td>\n",
|
|||
|
" <td>United States</td>\n",
|
|||
|
" <td>Microsoft</td>\n",
|
|||
|
" <td>Technology</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>Warren Buffett</td>\n",
|
|||
|
" <td>118.0</td>\n",
|
|||
|
" <td>91</td>\n",
|
|||
|
" <td>United States</td>\n",
|
|||
|
" <td>Berkshire Hathaway</td>\n",
|
|||
|
" <td>Finance & Investments</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Name Networth Age Country \\\n",
|
|||
|
"RankID \n",
|
|||
|
"1 Elon Musk 219.0 50 United States \n",
|
|||
|
"2 Jeff Bezos 171.0 58 United States \n",
|
|||
|
"3 Bernard Arnault & family 158.0 73 France \n",
|
|||
|
"4 Bill Gates 129.0 66 United States \n",
|
|||
|
"5 Warren Buffett 118.0 91 United States \n",
|
|||
|
"\n",
|
|||
|
" Source Industry \n",
|
|||
|
"RankID \n",
|
|||
|
"1 Tesla, SpaceX Automotive \n",
|
|||
|
"2 Amazon Technology \n",
|
|||
|
"3 LVMH Fashion & Retail \n",
|
|||
|
"4 Microsoft Technology \n",
|
|||
|
"5 Berkshire Hathaway Finance & Investments "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"df = pd.read_csv(\"data/forbes.csv\", index_col=\"RankID\")\n",
|
|||
|
"\n",
|
|||
|
"df.info()\n",
|
|||
|
"\n",
|
|||
|
"display(df.shape)\n",
|
|||
|
"\n",
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Получение сведений о пропущенных данных"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Типы пропущенных данных:\n",
|
|||
|
"- None - представление пустых данных в Python\n",
|
|||
|
"- NaN - представление пустых данных в Pandas\n",
|
|||
|
"- '' - пустая строка"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Name 0\n",
|
|||
|
"Networth 0\n",
|
|||
|
"Age 0\n",
|
|||
|
"Country 0\n",
|
|||
|
"Source 0\n",
|
|||
|
"Industry 0\n",
|
|||
|
"dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Name False\n",
|
|||
|
"Networth False\n",
|
|||
|
"Age False\n",
|
|||
|
"Country False\n",
|
|||
|
"Source False\n",
|
|||
|
"Industry False\n",
|
|||
|
"dtype: bool"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Количество пустых значений признаков\n",
|
|||
|
"display(df.isnull().sum())\n",
|
|||
|
"display()\n",
|
|||
|
"\n",
|
|||
|
"# Есть ли пустые значения признаков\n",
|
|||
|
"display(df.isnull().any())\n",
|
|||
|
"display()\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in df.columns:\n",
|
|||
|
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" display(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Для данного датасета количество пустых значений для каждого из признаков = 0 т.е. не пропущено одно значение -> заполнение и корректировка не нужны."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(2600, 7)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Name False\n",
|
|||
|
"Networth False\n",
|
|||
|
"Age False\n",
|
|||
|
"Country False\n",
|
|||
|
"Source False\n",
|
|||
|
"Industry False\n",
|
|||
|
"AgeFillMedian False\n",
|
|||
|
"dtype: bool"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Name</th>\n",
|
|||
|
" <th>Networth</th>\n",
|
|||
|
" <th>Age</th>\n",
|
|||
|
" <th>Country</th>\n",
|
|||
|
" <th>Source</th>\n",
|
|||
|
" <th>Industry</th>\n",
|
|||
|
" <th>AgeFillMedian</th>\n",
|
|||
|
" <th>AgeFillNA</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>RankID</th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Jorge Gallardo Ballart</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>80</td>\n",
|
|||
|
" <td>Spain</td>\n",
|
|||
|
" <td>pharmaceuticals</td>\n",
|
|||
|
" <td>Healthcare</td>\n",
|
|||
|
" <td>80</td>\n",
|
|||
|
" <td>80</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Nari Genomal</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>82</td>\n",
|
|||
|
" <td>Philippines</td>\n",
|
|||
|
" <td>apparel</td>\n",
|
|||
|
" <td>Fashion & Retail</td>\n",
|
|||
|
" <td>82</td>\n",
|
|||
|
" <td>82</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Ramesh Genomal</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>71</td>\n",
|
|||
|
" <td>Philippines</td>\n",
|
|||
|
" <td>apparel</td>\n",
|
|||
|
" <td>Fashion & Retail</td>\n",
|
|||
|
" <td>71</td>\n",
|
|||
|
" <td>71</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Sunder Genomal</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>68</td>\n",
|
|||
|
" <td>Philippines</td>\n",
|
|||
|
" <td>garments</td>\n",
|
|||
|
" <td>Fashion & Retail</td>\n",
|
|||
|
" <td>68</td>\n",
|
|||
|
" <td>68</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Horst-Otto Gerberding</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>69</td>\n",
|
|||
|
" <td>Germany</td>\n",
|
|||
|
" <td>flavors and fragrances</td>\n",
|
|||
|
" <td>Food & Beverage</td>\n",
|
|||
|
" <td>69</td>\n",
|
|||
|
" <td>69</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Name Networth Age Country \\\n",
|
|||
|
"RankID \n",
|
|||
|
"2578 Jorge Gallardo Ballart 1.0 80 Spain \n",
|
|||
|
"2578 Nari Genomal 1.0 82 Philippines \n",
|
|||
|
"2578 Ramesh Genomal 1.0 71 Philippines \n",
|
|||
|
"2578 Sunder Genomal 1.0 68 Philippines \n",
|
|||
|
"2578 Horst-Otto Gerberding 1.0 69 Germany \n",
|
|||
|
"\n",
|
|||
|
" Source Industry AgeFillMedian AgeFillNA \n",
|
|||
|
"RankID \n",
|
|||
|
"2578 pharmaceuticals Healthcare 80 80 \n",
|
|||
|
"2578 apparel Fashion & Retail 82 82 \n",
|
|||
|
"2578 apparel Fashion & Retail 71 71 \n",
|
|||
|
"2578 garments Fashion & Retail 68 68 \n",
|
|||
|
"2578 flavors and fragrances Food & Beverage 69 69 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"fillna_df = df.fillna(0)\n",
|
|||
|
"\n",
|
|||
|
"display(fillna_df.shape)\n",
|
|||
|
"\n",
|
|||
|
"display(fillna_df.isnull().any())\n",
|
|||
|
"\n",
|
|||
|
"# Замена пустых данных на 0\n",
|
|||
|
"df[\"AgeFillNA\"] = df[\"Age\"].fillna(0) \n",
|
|||
|
"\n",
|
|||
|
"# Замена пустых данных на медиану\n",
|
|||
|
"df[\"AgeFillMedian\"] = df[\"Age\"].fillna(df[\"Age\"].median())\n",
|
|||
|
"\n",
|
|||
|
"df.tail()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Name</th>\n",
|
|||
|
" <th>Networth</th>\n",
|
|||
|
" <th>Age</th>\n",
|
|||
|
" <th>Country</th>\n",
|
|||
|
" <th>Source</th>\n",
|
|||
|
" <th>Industry</th>\n",
|
|||
|
" <th>AgeFillMedian</th>\n",
|
|||
|
" <th>AgeFillNA</th>\n",
|
|||
|
" <th>AgeCopy</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>RankID</th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Jorge Gallardo Ballart</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>80</td>\n",
|
|||
|
" <td>Spain</td>\n",
|
|||
|
" <td>pharmaceuticals</td>\n",
|
|||
|
" <td>Healthcare</td>\n",
|
|||
|
" <td>80</td>\n",
|
|||
|
" <td>80</td>\n",
|
|||
|
" <td>80</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Nari Genomal</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>82</td>\n",
|
|||
|
" <td>Philippines</td>\n",
|
|||
|
" <td>apparel</td>\n",
|
|||
|
" <td>Fashion & Retail</td>\n",
|
|||
|
" <td>82</td>\n",
|
|||
|
" <td>82</td>\n",
|
|||
|
" <td>82</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Ramesh Genomal</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>71</td>\n",
|
|||
|
" <td>Philippines</td>\n",
|
|||
|
" <td>apparel</td>\n",
|
|||
|
" <td>Fashion & Retail</td>\n",
|
|||
|
" <td>71</td>\n",
|
|||
|
" <td>71</td>\n",
|
|||
|
" <td>71</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Sunder Genomal</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>68</td>\n",
|
|||
|
" <td>Philippines</td>\n",
|
|||
|
" <td>garments</td>\n",
|
|||
|
" <td>Fashion & Retail</td>\n",
|
|||
|
" <td>68</td>\n",
|
|||
|
" <td>68</td>\n",
|
|||
|
" <td>68</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2578</th>\n",
|
|||
|
" <td>Horst-Otto Gerberding</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>69</td>\n",
|
|||
|
" <td>Germany</td>\n",
|
|||
|
" <td>flavors and fragrances</td>\n",
|
|||
|
" <td>Food & Beverage</td>\n",
|
|||
|
" <td>69</td>\n",
|
|||
|
" <td>69</td>\n",
|
|||
|
" <td>69</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Name Networth Age Country \\\n",
|
|||
|
"RankID \n",
|
|||
|
"2578 Jorge Gallardo Ballart 1.0 80 Spain \n",
|
|||
|
"2578 Nari Genomal 1.0 82 Philippines \n",
|
|||
|
"2578 Ramesh Genomal 1.0 71 Philippines \n",
|
|||
|
"2578 Sunder Genomal 1.0 68 Philippines \n",
|
|||
|
"2578 Horst-Otto Gerberding 1.0 69 Germany \n",
|
|||
|
"\n",
|
|||
|
" Source Industry AgeFillMedian AgeFillNA \\\n",
|
|||
|
"RankID \n",
|
|||
|
"2578 pharmaceuticals Healthcare 80 80 \n",
|
|||
|
"2578 apparel Fashion & Retail 82 82 \n",
|
|||
|
"2578 apparel Fashion & Retail 71 71 \n",
|
|||
|
"2578 garments Fashion & Retail 68 68 \n",
|
|||
|
"2578 flavors and fragrances Food & Beverage 69 69 \n",
|
|||
|
"\n",
|
|||
|
" AgeCopy \n",
|
|||
|
"RankID \n",
|
|||
|
"2578 80 \n",
|
|||
|
"2578 82 \n",
|
|||
|
"2578 71 \n",
|
|||
|
"2578 68 \n",
|
|||
|
"2578 69 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df[\"AgeCopy\"] = df[\"Age\"]\n",
|
|||
|
"\n",
|
|||
|
"# Замена данных сразу в DataFrame без копирования\n",
|
|||
|
"df.fillna({\"AgeCopy\": 0}, inplace=True)\n",
|
|||
|
"\n",
|
|||
|
"df.tail()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Удаление наблюдений с пропусками"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 33,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(2600, 9)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Name False\n",
|
|||
|
"Networth False\n",
|
|||
|
"Age False\n",
|
|||
|
"Country False\n",
|
|||
|
"Source False\n",
|
|||
|
"Industry False\n",
|
|||
|
"AgeFillMedian False\n",
|
|||
|
"dtype: bool"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"dropna_df = df.dropna()\n",
|
|||
|
"\n",
|
|||
|
"display(dropna_df.shape)\n",
|
|||
|
"\n",
|
|||
|
"display(fillna_df.isnull().any())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Создание выборок данных\n",
|
|||
|
"\n",
|
|||
|
"Библиотека scikit-learn\n",
|
|||
|
"\n",
|
|||
|
"https://scikit-learn.org/stable/index.html"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 75,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Industry\n",
|
|||
|
"Finance & Investments 386\n",
|
|||
|
"Technology 329\n",
|
|||
|
"Manufacturing 322\n",
|
|||
|
"Fashion & Retail 246\n",
|
|||
|
"Healthcare 212\n",
|
|||
|
"Food & Beverage 201\n",
|
|||
|
"Real Estate 189\n",
|
|||
|
"diversified 178\n",
|
|||
|
"Media & Entertainment 95\n",
|
|||
|
"Energy 93\n",
|
|||
|
"Automotive 69\n",
|
|||
|
"Metals & Mining 67\n",
|
|||
|
"Service 51\n",
|
|||
|
"Construction & Engineering 43\n",
|
|||
|
"Logistics 35\n",
|
|||
|
"Telecom 35\n",
|
|||
|
"Sports 26\n",
|
|||
|
"Gambling & Casinos 23\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Обучающая выборка: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(1560, 3)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Industry\n",
|
|||
|
"Finance & Investments 231\n",
|
|||
|
"Technology 197\n",
|
|||
|
"Manufacturing 193\n",
|
|||
|
"Fashion & Retail 148\n",
|
|||
|
"Healthcare 127\n",
|
|||
|
"Food & Beverage 121\n",
|
|||
|
"Real Estate 113\n",
|
|||
|
"diversified 107\n",
|
|||
|
"Media & Entertainment 57\n",
|
|||
|
"Energy 56\n",
|
|||
|
"Automotive 41\n",
|
|||
|
"Metals & Mining 40\n",
|
|||
|
"Service 31\n",
|
|||
|
"Construction & Engineering 26\n",
|
|||
|
"Logistics 21\n",
|
|||
|
"Telecom 21\n",
|
|||
|
"Sports 16\n",
|
|||
|
"Gambling & Casinos 14\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Контрольная выборка: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(520, 3)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Industry\n",
|
|||
|
"Finance & Investments 77\n",
|
|||
|
"Technology 66\n",
|
|||
|
"Manufacturing 64\n",
|
|||
|
"Fashion & Retail 49\n",
|
|||
|
"Healthcare 43\n",
|
|||
|
"Food & Beverage 40\n",
|
|||
|
"Real Estate 38\n",
|
|||
|
"diversified 35\n",
|
|||
|
"Media & Entertainment 19\n",
|
|||
|
"Energy 18\n",
|
|||
|
"Automotive 14\n",
|
|||
|
"Metals & Mining 14\n",
|
|||
|
"Service 10\n",
|
|||
|
"Construction & Engineering 9\n",
|
|||
|
"Telecom 7\n",
|
|||
|
"Logistics 7\n",
|
|||
|
"Sports 5\n",
|
|||
|
"Gambling & Casinos 5\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Тестовая выборка: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(520, 3)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Industry\n",
|
|||
|
"Finance & Investments 78\n",
|
|||
|
"Technology 66\n",
|
|||
|
"Manufacturing 65\n",
|
|||
|
"Fashion & Retail 49\n",
|
|||
|
"Healthcare 42\n",
|
|||
|
"Food & Beverage 40\n",
|
|||
|
"Real Estate 38\n",
|
|||
|
"diversified 36\n",
|
|||
|
"Media & Entertainment 19\n",
|
|||
|
"Energy 19\n",
|
|||
|
"Automotive 14\n",
|
|||
|
"Metals & Mining 13\n",
|
|||
|
"Service 10\n",
|
|||
|
"Construction & Engineering 8\n",
|
|||
|
"Logistics 7\n",
|
|||
|
"Telecom 7\n",
|
|||
|
"Sports 5\n",
|
|||
|
"Gambling & Casinos 4\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Вывод распределения количества наблюдений по Индустрии\n",
|
|||
|
"from src.utils import split_stratified_into_train_val_test\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"display(df.Industry.value_counts())\n",
|
|||
|
"display()\n",
|
|||
|
"\n",
|
|||
|
"data = df[[\"Networth\", \"Age\", \"Industry\"]].copy()\n",
|
|||
|
"\n",
|
|||
|
"df_train, df_val, df_test, y_train, y_val, y_test = split_stratified_into_train_val_test(\n",
|
|||
|
" data, stratify_colname=\"Industry\", frac_train=0.60, frac_val=0.20, frac_test=0.20\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"display(\"Обучающая выборка: \", df_train.shape)\n",
|
|||
|
"display(df_train.Industry.value_counts())\n",
|
|||
|
"\n",
|
|||
|
"display(\"Контрольная выборка: \", df_val.shape)\n",
|
|||
|
"display(df_val.Industry.value_counts())\n",
|
|||
|
"\n",
|
|||
|
"display(\"Тестовая выборка: \", df_test.shape)\n",
|
|||
|
"display(df_test.Industry.value_counts())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 78,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Обучающая выборка: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(1560, 3)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Industry\n",
|
|||
|
"Finance & Investments 231\n",
|
|||
|
"Technology 197\n",
|
|||
|
"Manufacturing 193\n",
|
|||
|
"Fashion & Retail 148\n",
|
|||
|
"Healthcare 127\n",
|
|||
|
"Food & Beverage 121\n",
|
|||
|
"Real Estate 113\n",
|
|||
|
"diversified 107\n",
|
|||
|
"Media & Entertainment 57\n",
|
|||
|
"Energy 56\n",
|
|||
|
"Automotive 41\n",
|
|||
|
"Metals & Mining 40\n",
|
|||
|
"Service 31\n",
|
|||
|
"Construction & Engineering 26\n",
|
|||
|
"Logistics 21\n",
|
|||
|
"Telecom 21\n",
|
|||
|
"Sports 16\n",
|
|||
|
"Gambling & Casinos 14\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"ename": "ValueError",
|
|||
|
"evalue": "could not convert string to float: 'Technology '",
|
|||
|
"output_type": "error",
|
|||
|
"traceback": [
|
|||
|
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
|||
|
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
|
|||
|
"\u001b[1;32m~\\AppData\\Local\\Temp\\ipykernel_1348\\420769102.py\u001b[0m in \u001b[0;36m?\u001b[1;34m()\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Обучающая выборка: \"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf_train\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdf_train\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mIndustry\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mX_resampled\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_resampled\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mada\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit_resample\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdf_train\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf_train\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"Industry\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;31m# type: ignore\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 9\u001b[0m \u001b[0mdf_train_adasyn\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX_resampled\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 11\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Обучающая выборка после oversampling: \"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf_train_adasyn\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
|||
|
"\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\imblearn\\base.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 204\u001b[0m \u001b[0my_resampled\u001b[0m \u001b[1;33m:\u001b[0m \u001b[0marray\u001b[0m\u001b[1;33m-\u001b[0m\u001b[0mlike\u001b[0m \u001b[0mof\u001b[0m \u001b[0mshape\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mn_samples_new\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 205\u001b[0m \u001b[0mThe\u001b[0m \u001b[0mcorresponding\u001b[0m \u001b[0mlabel\u001b[0m \u001b[0mof\u001b[0m \u001b[1;33m`\u001b[0m\u001b[0mX_resampled\u001b[0m\u001b[1;33m`\u001b[0m\u001b[1;33m.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 206\u001b[0m \"\"\"\n\u001b[0;32m 207\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_validate_params\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 208\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0msuper\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit_resample\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
|
|||
|
"\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\imblearn\\base.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 102\u001b[0m \u001b[0mThe\u001b[0m \u001b[0mcorresponding\u001b[0m \u001b[0mlabel\u001b[0m \u001b[0mof\u001b[0m \u001b[1;33m`\u001b[0m\u001b[0mX_resampled\u001b[0m\u001b[1;33m`\u001b[0m\u001b[1;33m.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 103\u001b[0m \"\"\"\n\u001b[0;32m 104\u001b[0m \u001b[0mcheck_classification_targets\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 105\u001b[0m \u001b[0marrays_transformer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mArraysTransformer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 106\u001b[1;33m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbinarize_y\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_check_X_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 107\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 108\u001b[0m self.sampling_strategy_ = check_sampling_strategy(\n\u001b[0;32m 109\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msampling_strategy\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_sampling_type\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
|||
|
"\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\imblearn\\base.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, X, y, accept_sparse)\u001b[0m\n\u001b[0;32m 157\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_check_X_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 158\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0maccept_sparse\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 159\u001b[0m \u001b[0maccept_sparse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;34m\"csr\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"csc\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 160\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbinarize_y\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcheck_target_type\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mindicate_one_vs_all\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 161\u001b[1;33m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_validate_data\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mreset\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maccept_sparse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 162\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbinarize_y\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
|||
|
"\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\sklearn\\base.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[0;32m 646\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;34m\"estimator\"\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mcheck_y_params\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 647\u001b[0m \u001b[0mcheck_y_params\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;33m**\u001b[0m\u001b[0mdefault_check_params\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 648\u001b[0m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0minput_name\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"y\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 649\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 650\u001b[1;33m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcheck_X_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 651\u001b[0m \u001b[0mout\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 652\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 653\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mcheck_params\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"ensure_2d\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
|||
|
"\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[0m\n\u001b[0;32m 1297\u001b[0m raise ValueError(\n\u001b[0;32m 1298\u001b[0m \u001b[1;33mf\"\u001b[0m\u001b[1;33m{\u001b[0m\u001b[0mestimator_name\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m requires y to be passed, but the target y is None\u001b[0m\u001b[1;33m\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1299\u001b[0m \u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1300\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1301\u001b[1;33m X = check_array(\n\u001b[0m\u001b[0;32m 1302\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1303\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maccept_sparse\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1304\u001b[0m \u001b[0maccept_large_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maccept_large_sparse\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
|||
|
"\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[0;32m 1009\u001b[0m \u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1010\u001b[0m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mxp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1011\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1012\u001b[0m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_asarray_with_order\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mxp\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mxp\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1013\u001b[1;33m \u001b[1;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1014\u001b[0m raise ValueError(\n\u001b[0;32m 1015\u001b[0m \u001b[1;34m\"Complex data not supported\\n{}\\n\"\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1016\u001b[0m \u001b[1;33m)\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
|||
|
"\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\sklearn\\utils\\_array_api.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(array, dtype, order, copy, xp, device)\u001b[0m\n\u001b[0;32m 741\u001b[0m \u001b[1;31m# Use NumPy API to support order\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 742\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mcopy\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 743\u001b[0m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 744\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 745\u001b[1;33m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 746\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 747\u001b[0m \u001b[1;31m# At this point array is a NumPy ndarray. We convert it to an array\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 748\u001b[0m \u001b[1;31m# container that is consistent with the input's namespace.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
|||
|
"\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, dtype, copy)\u001b[0m\n\u001b[0;32m 2149\u001b[0m def __array__(\n\u001b[0;32m 2150\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mnpt\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mDTypeLike\u001b[0m \u001b[1;33m|\u001b[0m \u001b[1;32mNone\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mbool_t\u001b[0m \u001b[1;33m|\u001b[0m \u001b[1;32mNone\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2151\u001b[0m \u001b[1;33m)\u001b[0m \u001b[1;33m->\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2152\u001b[0m \u001b[0mvalues\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2153\u001b[1;33m \u001b[0marr\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2154\u001b[0m if (\n\u001b[0;32m 2155\u001b[0m \u001b[0mastype_is_view\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0marr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2156\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0musing_copy_on_write\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
|||
|
"\u001b[1;31mValueError\u001b[0m: could not convert string to float: 'Technology '"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.over_sampling import ADASYN\n",
|
|||
|
"\n",
|
|||
|
"ada = ADASYN()\n",
|
|||
|
"\n",
|
|||
|
"display(\"Обучающая выборка: \", df_train.shape)\n",
|
|||
|
"display(df_train.Industry.value_counts())\n",
|
|||
|
"\n",
|
|||
|
"X_resampled, y_resampled = ada.fit_resample(df_train, df_train[\"Industry\"]) # type: ignore\n",
|
|||
|
"df_train_adasyn = pd.DataFrame(X_resampled)\n",
|
|||
|
"\n",
|
|||
|
"display(\"Обучающая выборка после oversampling: \", df_train_adasyn.shape)\n",
|
|||
|
"display(df_train_adasyn.Industry.value_counts())\n",
|
|||
|
"\n",
|
|||
|
"df_train_adasyn"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"________________________________________________________________________________________________________________________________________"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"ДАТАСЕТ ЦЕНЫ НА ЗОЛОТО\n",
|
|||
|
"\n",
|
|||
|
"Объектами наблюдения в данном наборе данных являются цены на золото, представленные через Gold ETF (Exchange-Traded Fund). Каждая запись в наборе данных соответствует отдельному дню торговли золотыми активами.\n",
|
|||
|
"Атрибуты объектов\n",
|
|||
|
"\n",
|
|||
|
"Атрибутами объектов (цен на золото) являются:\n",
|
|||
|
"Дата: дата, когда происходила торговля.\n",
|
|||
|
"Цена открытия (Open): цена, по которой золото открывалось в начале торгового дня.\n",
|
|||
|
"Максимальная цена (High): наивысшая цена золота в течение дня.\n",
|
|||
|
"Минимальная цена (Low): наименьшая цена золота в течение дня.\n",
|
|||
|
"Цена закрытия (Close): цена, по которой золото закрылось в конце торгового дня.\n",
|
|||
|
"Скорректированная цена закрытия (Adjusted Close): цена закрытия, скорректированная с учетом факторов, таких как дивиденды и сплиты акций.\n",
|
|||
|
"Объем (Volume): количество золота, которое было куплено и продано в течение дня.\n",
|
|||
|
"\n",
|
|||
|
"Связи между объектами могут быть определены через временные последовательности. Например, изменение цен на золото в один день может зависеть от цен в предыдущие дни, а также от внешних факторов, таких как цены на другие драгоценные металлы, цены на нефть, экономические условия и рыночные тренды.\n",
|
|||
|
"\n",
|
|||
|
"Примеры бизнес-целей\n",
|
|||
|
"Оптимизация инвестиционных решений: Анализ исторических данных о ценах на золото может помочь инвесторам принимать более обоснованные решения о покупке или продаже золота.\n",
|
|||
|
"Управление рисками: Понимание факторов, влияющих на цены на золото, может помочь компаниям и инвесторам минимизировать риски, связанные с колебаниями цен.\n",
|
|||
|
"\n",
|
|||
|
"Эффект для бизнеса\n",
|
|||
|
"Эти бизнес-цели могут привести к увеличению доходов, привлечению новых инвесторов и повышению общей финансовой устойчивости компаний, работающих с золотом.\n",
|
|||
|
"\n",
|
|||
|
"Примеры целей технического проекта\n",
|
|||
|
"Для управления рисками: Создание системы мониторинга, которая будет отслеживать изменения цен на золото и другие факторы, влияющие на рынок, и предоставлять рекомендации по управлению рисками.\n",
|
|||
|
"\n",
|
|||
|
"Входные данные: Данные о ценах на золото, включая дату, цену открытия, максимальную и минимальную цены, цену закрытия, скорректированную цену закрытия и объем торгов.\n",
|
|||
|
"\n",
|
|||
|
"Целевой признак: Целевым признаком может быть скорректированная цена закрытия золота на следующий день, что позволит строить модели для прогнозирования будущих цен."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 38,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"Index: 1718 entries, 2011-12-15 to 2018-12-31\n",
|
|||
|
"Data columns (total 80 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 Open 1718 non-null float64\n",
|
|||
|
" 1 High 1718 non-null float64\n",
|
|||
|
" 2 Low 1718 non-null float64\n",
|
|||
|
" 3 Close 1718 non-null float64\n",
|
|||
|
" 4 Adj Close 1718 non-null float64\n",
|
|||
|
" 5 Volume 1718 non-null int64 \n",
|
|||
|
" 6 SP_open 1718 non-null float64\n",
|
|||
|
" 7 SP_high 1718 non-null float64\n",
|
|||
|
" 8 SP_low 1718 non-null float64\n",
|
|||
|
" 9 SP_close 1718 non-null float64\n",
|
|||
|
" 10 SP_Ajclose 1718 non-null float64\n",
|
|||
|
" 11 SP_volume 1718 non-null int64 \n",
|
|||
|
" 12 DJ_open 1718 non-null float64\n",
|
|||
|
" 13 DJ_high 1718 non-null float64\n",
|
|||
|
" 14 DJ_low 1718 non-null float64\n",
|
|||
|
" 15 DJ_close 1718 non-null float64\n",
|
|||
|
" 16 DJ_Ajclose 1718 non-null float64\n",
|
|||
|
" 17 DJ_volume 1718 non-null int64 \n",
|
|||
|
" 18 EG_open 1718 non-null float64\n",
|
|||
|
" 19 EG_high 1718 non-null float64\n",
|
|||
|
" 20 EG_low 1718 non-null float64\n",
|
|||
|
" 21 EG_close 1718 non-null float64\n",
|
|||
|
" 22 EG_Ajclose 1718 non-null float64\n",
|
|||
|
" 23 EG_volume 1718 non-null int64 \n",
|
|||
|
" 24 EU_Price 1718 non-null float64\n",
|
|||
|
" 25 EU_open 1718 non-null float64\n",
|
|||
|
" 26 EU_high 1718 non-null float64\n",
|
|||
|
" 27 EU_low 1718 non-null float64\n",
|
|||
|
" 28 EU_Trend 1718 non-null int64 \n",
|
|||
|
" 29 OF_Price 1718 non-null float64\n",
|
|||
|
" 30 OF_Open 1718 non-null float64\n",
|
|||
|
" 31 OF_High 1718 non-null float64\n",
|
|||
|
" 32 OF_Low 1718 non-null float64\n",
|
|||
|
" 33 OF_Volume 1718 non-null int64 \n",
|
|||
|
" 34 OF_Trend 1718 non-null int64 \n",
|
|||
|
" 35 OS_Price 1718 non-null float64\n",
|
|||
|
" 36 OS_Open 1718 non-null float64\n",
|
|||
|
" 37 OS_High 1718 non-null float64\n",
|
|||
|
" 38 OS_Low 1718 non-null float64\n",
|
|||
|
" 39 OS_Trend 1718 non-null int64 \n",
|
|||
|
" 40 SF_Price 1718 non-null int64 \n",
|
|||
|
" 41 SF_Open 1718 non-null int64 \n",
|
|||
|
" 42 SF_High 1718 non-null int64 \n",
|
|||
|
" 43 SF_Low 1718 non-null int64 \n",
|
|||
|
" 44 SF_Volume 1718 non-null int64 \n",
|
|||
|
" 45 SF_Trend 1718 non-null int64 \n",
|
|||
|
" 46 USB_Price 1718 non-null float64\n",
|
|||
|
" 47 USB_Open 1718 non-null float64\n",
|
|||
|
" 48 USB_High 1718 non-null float64\n",
|
|||
|
" 49 USB_Low 1718 non-null float64\n",
|
|||
|
" 50 USB_Trend 1718 non-null int64 \n",
|
|||
|
" 51 PLT_Price 1718 non-null float64\n",
|
|||
|
" 52 PLT_Open 1718 non-null float64\n",
|
|||
|
" 53 PLT_High 1718 non-null float64\n",
|
|||
|
" 54 PLT_Low 1718 non-null float64\n",
|
|||
|
" 55 PLT_Trend 1718 non-null int64 \n",
|
|||
|
" 56 PLD_Price 1718 non-null float64\n",
|
|||
|
" 57 PLD_Open 1718 non-null float64\n",
|
|||
|
" 58 PLD_High 1718 non-null float64\n",
|
|||
|
" 59 PLD_Low 1718 non-null float64\n",
|
|||
|
" 60 PLD_Trend 1718 non-null int64 \n",
|
|||
|
" 61 RHO_PRICE 1718 non-null int64 \n",
|
|||
|
" 62 USDI_Price 1718 non-null float64\n",
|
|||
|
" 63 USDI_Open 1718 non-null float64\n",
|
|||
|
" 64 USDI_High 1718 non-null float64\n",
|
|||
|
" 65 USDI_Low 1718 non-null float64\n",
|
|||
|
" 66 USDI_Volume 1718 non-null int64 \n",
|
|||
|
" 67 USDI_Trend 1718 non-null int64 \n",
|
|||
|
" 68 GDX_Open 1718 non-null float64\n",
|
|||
|
" 69 GDX_High 1718 non-null float64\n",
|
|||
|
" 70 GDX_Low 1718 non-null float64\n",
|
|||
|
" 71 GDX_Close 1718 non-null float64\n",
|
|||
|
" 72 GDX_Adj Close 1718 non-null float64\n",
|
|||
|
" 73 GDX_Volume 1718 non-null int64 \n",
|
|||
|
" 74 USO_Open 1718 non-null float64\n",
|
|||
|
" 75 USO_High 1718 non-null float64\n",
|
|||
|
" 76 USO_Low 1718 non-null float64\n",
|
|||
|
" 77 USO_Close 1718 non-null float64\n",
|
|||
|
" 78 USO_Adj Close 1718 non-null float64\n",
|
|||
|
" 79 USO_Volume 1718 non-null int64 \n",
|
|||
|
"dtypes: float64(58), int64(22)\n",
|
|||
|
"memory usage: 1.1+ MB\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(1718, 80)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Open</th>\n",
|
|||
|
" <th>High</th>\n",
|
|||
|
" <th>Low</th>\n",
|
|||
|
" <th>Close</th>\n",
|
|||
|
" <th>Adj Close</th>\n",
|
|||
|
" <th>Volume</th>\n",
|
|||
|
" <th>SP_open</th>\n",
|
|||
|
" <th>SP_high</th>\n",
|
|||
|
" <th>SP_low</th>\n",
|
|||
|
" <th>SP_close</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>GDX_Low</th>\n",
|
|||
|
" <th>GDX_Close</th>\n",
|
|||
|
" <th>GDX_Adj Close</th>\n",
|
|||
|
" <th>GDX_Volume</th>\n",
|
|||
|
" <th>USO_Open</th>\n",
|
|||
|
" <th>USO_High</th>\n",
|
|||
|
" <th>USO_Low</th>\n",
|
|||
|
" <th>USO_Close</th>\n",
|
|||
|
" <th>USO_Adj Close</th>\n",
|
|||
|
" <th>USO_Volume</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>Date</th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2011-12-15</th>\n",
|
|||
|
" <td>154.740005</td>\n",
|
|||
|
" <td>154.949997</td>\n",
|
|||
|
" <td>151.710007</td>\n",
|
|||
|
" <td>152.330002</td>\n",
|
|||
|
" <td>152.330002</td>\n",
|
|||
|
" <td>21521900</td>\n",
|
|||
|
" <td>123.029999</td>\n",
|
|||
|
" <td>123.199997</td>\n",
|
|||
|
" <td>121.989998</td>\n",
|
|||
|
" <td>122.180000</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>51.570000</td>\n",
|
|||
|
" <td>51.680000</td>\n",
|
|||
|
" <td>48.973877</td>\n",
|
|||
|
" <td>20605600</td>\n",
|
|||
|
" <td>36.900002</td>\n",
|
|||
|
" <td>36.939999</td>\n",
|
|||
|
" <td>36.049999</td>\n",
|
|||
|
" <td>36.130001</td>\n",
|
|||
|
" <td>36.130001</td>\n",
|
|||
|
" <td>12616700</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2011-12-16</th>\n",
|
|||
|
" <td>154.309998</td>\n",
|
|||
|
" <td>155.369995</td>\n",
|
|||
|
" <td>153.899994</td>\n",
|
|||
|
" <td>155.229996</td>\n",
|
|||
|
" <td>155.229996</td>\n",
|
|||
|
" <td>18124300</td>\n",
|
|||
|
" <td>122.230003</td>\n",
|
|||
|
" <td>122.949997</td>\n",
|
|||
|
" <td>121.300003</td>\n",
|
|||
|
" <td>121.589996</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>52.040001</td>\n",
|
|||
|
" <td>52.680000</td>\n",
|
|||
|
" <td>49.921513</td>\n",
|
|||
|
" <td>16285400</td>\n",
|
|||
|
" <td>36.180000</td>\n",
|
|||
|
" <td>36.500000</td>\n",
|
|||
|
" <td>35.730000</td>\n",
|
|||
|
" <td>36.270000</td>\n",
|
|||
|
" <td>36.270000</td>\n",
|
|||
|
" <td>12578800</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2011-12-19</th>\n",
|
|||
|
" <td>155.479996</td>\n",
|
|||
|
" <td>155.860001</td>\n",
|
|||
|
" <td>154.360001</td>\n",
|
|||
|
" <td>154.869995</td>\n",
|
|||
|
" <td>154.869995</td>\n",
|
|||
|
" <td>12547200</td>\n",
|
|||
|
" <td>122.059998</td>\n",
|
|||
|
" <td>122.320000</td>\n",
|
|||
|
" <td>120.029999</td>\n",
|
|||
|
" <td>120.290001</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>51.029999</td>\n",
|
|||
|
" <td>51.169998</td>\n",
|
|||
|
" <td>48.490578</td>\n",
|
|||
|
" <td>15120200</td>\n",
|
|||
|
" <td>36.389999</td>\n",
|
|||
|
" <td>36.450001</td>\n",
|
|||
|
" <td>35.930000</td>\n",
|
|||
|
" <td>36.200001</td>\n",
|
|||
|
" <td>36.200001</td>\n",
|
|||
|
" <td>7418200</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2011-12-20</th>\n",
|
|||
|
" <td>156.820007</td>\n",
|
|||
|
" <td>157.429993</td>\n",
|
|||
|
" <td>156.580002</td>\n",
|
|||
|
" <td>156.979996</td>\n",
|
|||
|
" <td>156.979996</td>\n",
|
|||
|
" <td>9136300</td>\n",
|
|||
|
" <td>122.180000</td>\n",
|
|||
|
" <td>124.139999</td>\n",
|
|||
|
" <td>120.370003</td>\n",
|
|||
|
" <td>123.930000</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>52.369999</td>\n",
|
|||
|
" <td>52.990002</td>\n",
|
|||
|
" <td>50.215282</td>\n",
|
|||
|
" <td>11644900</td>\n",
|
|||
|
" <td>37.299999</td>\n",
|
|||
|
" <td>37.610001</td>\n",
|
|||
|
" <td>37.220001</td>\n",
|
|||
|
" <td>37.560001</td>\n",
|
|||
|
" <td>37.560001</td>\n",
|
|||
|
" <td>10041600</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2011-12-21</th>\n",
|
|||
|
" <td>156.979996</td>\n",
|
|||
|
" <td>157.529999</td>\n",
|
|||
|
" <td>156.130005</td>\n",
|
|||
|
" <td>157.160004</td>\n",
|
|||
|
" <td>157.160004</td>\n",
|
|||
|
" <td>11996100</td>\n",
|
|||
|
" <td>123.930000</td>\n",
|
|||
|
" <td>124.360001</td>\n",
|
|||
|
" <td>122.750000</td>\n",
|
|||
|
" <td>124.169998</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>52.419998</td>\n",
|
|||
|
" <td>52.959999</td>\n",
|
|||
|
" <td>50.186852</td>\n",
|
|||
|
" <td>8724300</td>\n",
|
|||
|
" <td>37.669998</td>\n",
|
|||
|
" <td>38.240002</td>\n",
|
|||
|
" <td>37.520000</td>\n",
|
|||
|
" <td>38.110001</td>\n",
|
|||
|
" <td>38.110001</td>\n",
|
|||
|
" <td>10728000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>5 rows × 80 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Open High Low Close Adj Close \\\n",
|
|||
|
"Date \n",
|
|||
|
"2011-12-15 154.740005 154.949997 151.710007 152.330002 152.330002 \n",
|
|||
|
"2011-12-16 154.309998 155.369995 153.899994 155.229996 155.229996 \n",
|
|||
|
"2011-12-19 155.479996 155.860001 154.360001 154.869995 154.869995 \n",
|
|||
|
"2011-12-20 156.820007 157.429993 156.580002 156.979996 156.979996 \n",
|
|||
|
"2011-12-21 156.979996 157.529999 156.130005 157.160004 157.160004 \n",
|
|||
|
"\n",
|
|||
|
" Volume SP_open SP_high SP_low SP_close ... \\\n",
|
|||
|
"Date ... \n",
|
|||
|
"2011-12-15 21521900 123.029999 123.199997 121.989998 122.180000 ... \n",
|
|||
|
"2011-12-16 18124300 122.230003 122.949997 121.300003 121.589996 ... \n",
|
|||
|
"2011-12-19 12547200 122.059998 122.320000 120.029999 120.290001 ... \n",
|
|||
|
"2011-12-20 9136300 122.180000 124.139999 120.370003 123.930000 ... \n",
|
|||
|
"2011-12-21 11996100 123.930000 124.360001 122.750000 124.169998 ... \n",
|
|||
|
"\n",
|
|||
|
" GDX_Low GDX_Close GDX_Adj Close GDX_Volume USO_Open \\\n",
|
|||
|
"Date \n",
|
|||
|
"2011-12-15 51.570000 51.680000 48.973877 20605600 36.900002 \n",
|
|||
|
"2011-12-16 52.040001 52.680000 49.921513 16285400 36.180000 \n",
|
|||
|
"2011-12-19 51.029999 51.169998 48.490578 15120200 36.389999 \n",
|
|||
|
"2011-12-20 52.369999 52.990002 50.215282 11644900 37.299999 \n",
|
|||
|
"2011-12-21 52.419998 52.959999 50.186852 8724300 37.669998 \n",
|
|||
|
"\n",
|
|||
|
" USO_High USO_Low USO_Close USO_Adj Close USO_Volume \n",
|
|||
|
"Date \n",
|
|||
|
"2011-12-15 36.939999 36.049999 36.130001 36.130001 12616700 \n",
|
|||
|
"2011-12-16 36.500000 35.730000 36.270000 36.270000 12578800 \n",
|
|||
|
"2011-12-19 36.450001 35.930000 36.200001 36.200001 7418200 \n",
|
|||
|
"2011-12-20 37.610001 37.220001 37.560001 37.560001 10041600 \n",
|
|||
|
"2011-12-21 38.240002 37.520000 38.110001 38.110001 10728000 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 80 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 38,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"dfGold = pd.read_csv(\"data/gold.csv\", index_col=\"Date\")\n",
|
|||
|
"\n",
|
|||
|
"dfGold.info()\n",
|
|||
|
"\n",
|
|||
|
"display(dfGold.shape)\n",
|
|||
|
"\n",
|
|||
|
"dfGold.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пустые значения"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 41,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Open 0\n",
|
|||
|
"High 0\n",
|
|||
|
"Low 0\n",
|
|||
|
"Close 0\n",
|
|||
|
"Adj Close 0\n",
|
|||
|
" ..\n",
|
|||
|
"USO_High 0\n",
|
|||
|
"USO_Low 0\n",
|
|||
|
"USO_Close 0\n",
|
|||
|
"USO_Adj Close 0\n",
|
|||
|
"USO_Volume 0\n",
|
|||
|
"Length: 80, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Open False\n",
|
|||
|
"High False\n",
|
|||
|
"Low False\n",
|
|||
|
"Close False\n",
|
|||
|
"Adj Close False\n",
|
|||
|
" ... \n",
|
|||
|
"USO_High False\n",
|
|||
|
"USO_Low False\n",
|
|||
|
"USO_Close False\n",
|
|||
|
"USO_Adj Close False\n",
|
|||
|
"USO_Volume False\n",
|
|||
|
"Length: 80, dtype: bool"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Количество пустых значений признаков\n",
|
|||
|
"display(dfGold.isnull().sum())\n",
|
|||
|
"display()\n",
|
|||
|
"\n",
|
|||
|
"# Есть ли пустые значения признаков\n",
|
|||
|
"display(dfGold.isnull().any())\n",
|
|||
|
"display()\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in dfGold.columns:\n",
|
|||
|
" null_rate = dfGold[i].isnull().sum() / len(dfGold) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" display(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Заполение пустых значений для данного набора так же не требуется."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Создание выборок данных"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 67,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"USB_Trend\n",
|
|||
|
"0 876\n",
|
|||
|
"1 842\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Обучающая выборка: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(1030, 4)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"USB_Trend\n",
|
|||
|
"0 525\n",
|
|||
|
"1 505\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Контрольная выборка: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(344, 4)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"USB_Trend\n",
|
|||
|
"0 176\n",
|
|||
|
"1 168\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Тестовая выборка: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(344, 4)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"USB_Trend\n",
|
|||
|
"0 175\n",
|
|||
|
"1 169\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Вывод распределения количества наблюдений по меркам\n",
|
|||
|
"from src.utils import split_stratified_into_train_val_test\n",
|
|||
|
"\n",
|
|||
|
"display((dfGold.USB_Trend).value_counts())\n",
|
|||
|
"display()\n",
|
|||
|
"\n",
|
|||
|
"selected_columns = [\"Open\", \"High\", \"Low\", \"USB_Trend\"]\n",
|
|||
|
"dfGold[\"USB_Trend\"] = round(dfGold[\"USB_Trend\"])\n",
|
|||
|
"data = dfGold[selected_columns].copy()\n",
|
|||
|
"\n",
|
|||
|
"# Создание выборок\n",
|
|||
|
"dfGold_train, dfGold_val, dfGold_test, y_train, y_val, y_test = (\n",
|
|||
|
" split_stratified_into_train_val_test(\n",
|
|||
|
" data,\n",
|
|||
|
" stratify_colname=\"USB_Trend\",\n",
|
|||
|
" frac_train=0.60,\n",
|
|||
|
" frac_val=0.20,\n",
|
|||
|
" frac_test=0.20,\n",
|
|||
|
" )\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Используем display для вывода информации о выборках\n",
|
|||
|
"display(\"Обучающая выборка: \", dfGold_train.shape)\n",
|
|||
|
"display(round(dfGold_train.USB_Trend).value_counts())\n",
|
|||
|
"\n",
|
|||
|
"display(\"Контрольная выборка: \", dfGold_val.shape)\n",
|
|||
|
"display(round(dfGold_val.USB_Trend).value_counts())\n",
|
|||
|
"\n",
|
|||
|
"display(\"Тестовая выборка: \", dfGold_test.shape)\n",
|
|||
|
"display(round(dfGold_test.USB_Trend).value_counts())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Обучающая выборка: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(1030, 4)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"USB_Trend\n",
|
|||
|
"0 525\n",
|
|||
|
"1 505\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Обучающая выборка после undersampling: '"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(1010, 4)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"USB_Trend\n",
|
|||
|
"0 505\n",
|
|||
|
"1 505\n",
|
|||
|
"Name: count, dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Open</th>\n",
|
|||
|
" <th>High</th>\n",
|
|||
|
" <th>Low</th>\n",
|
|||
|
" <th>USB_Trend</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>Date</th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2016-04-27</th>\n",
|
|||
|
" <td>118.970001</td>\n",
|
|||
|
" <td>119.699997</td>\n",
|
|||
|
" <td>118.430000</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2017-02-16</th>\n",
|
|||
|
" <td>117.930000</td>\n",
|
|||
|
" <td>118.349998</td>\n",
|
|||
|
" <td>117.830002</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2016-11-15</th>\n",
|
|||
|
" <td>116.459999</td>\n",
|
|||
|
" <td>117.239998</td>\n",
|
|||
|
" <td>116.290001</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2016-11-07</th>\n",
|
|||
|
" <td>122.660004</td>\n",
|
|||
|
" <td>122.709999</td>\n",
|
|||
|
" <td>121.879997</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2018-04-30</th>\n",
|
|||
|
" <td>124.410004</td>\n",
|
|||
|
" <td>125.199997</td>\n",
|
|||
|
" <td>124.190002</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2012-07-13</th>\n",
|
|||
|
" <td>153.449997</td>\n",
|
|||
|
" <td>154.940002</td>\n",
|
|||
|
" <td>153.440002</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2016-05-25</th>\n",
|
|||
|
" <td>116.589996</td>\n",
|
|||
|
" <td>117.059998</td>\n",
|
|||
|
" <td>116.320000</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2016-03-02</th>\n",
|
|||
|
" <td>118.339996</td>\n",
|
|||
|
" <td>118.970001</td>\n",
|
|||
|
" <td>118.070000</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2013-08-05</th>\n",
|
|||
|
" <td>126.510002</td>\n",
|
|||
|
" <td>126.639999</td>\n",
|
|||
|
" <td>125.339996</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2013-06-20</th>\n",
|
|||
|
" <td>125.220001</td>\n",
|
|||
|
" <td>126.379997</td>\n",
|
|||
|
" <td>123.330002</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>1010 rows × 4 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Open High Low USB_Trend\n",
|
|||
|
"Date \n",
|
|||
|
"2016-04-27 118.970001 119.699997 118.430000 0\n",
|
|||
|
"2017-02-16 117.930000 118.349998 117.830002 0\n",
|
|||
|
"2016-11-15 116.459999 117.239998 116.290001 0\n",
|
|||
|
"2016-11-07 122.660004 122.709999 121.879997 0\n",
|
|||
|
"2018-04-30 124.410004 125.199997 124.190002 0\n",
|
|||
|
"... ... ... ... ...\n",
|
|||
|
"2012-07-13 153.449997 154.940002 153.440002 1\n",
|
|||
|
"2016-05-25 116.589996 117.059998 116.320000 1\n",
|
|||
|
"2016-03-02 118.339996 118.970001 118.070000 1\n",
|
|||
|
"2013-08-05 126.510002 126.639999 125.339996 1\n",
|
|||
|
"2013-06-20 125.220001 126.379997 123.330002 1\n",
|
|||
|
"\n",
|
|||
|
"[1010 rows x 4 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"# Создание экземпляра RandomUnderSampler\n",
|
|||
|
"rus = RandomUnderSampler(\n",
|
|||
|
" sampling_strategy=\"auto\"\n",
|
|||
|
") # 'auto' будет пытаться сбалансировать классы\n",
|
|||
|
"\n",
|
|||
|
"display(\"Обучающая выборка: \", dfGold_train.shape)\n",
|
|||
|
"display(dfGold_train.USB_Trend.value_counts())\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Разделение признаков и целевой переменной\n",
|
|||
|
"X = dfGold_train.drop(columns=[\"USB_Trend\"])\n",
|
|||
|
"y = dfGold_train[\"USB_Trend\"]\n",
|
|||
|
"\n",
|
|||
|
"# Применение undersampling\n",
|
|||
|
"X_resampled, y_resampled = rus.fit_resample(X, y)\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового DataFrame\n",
|
|||
|
"dfGold_train_undersampled = pd.DataFrame(X_resampled)\n",
|
|||
|
"dfGold_train_undersampled[\"USB_Trend\"] = y_resampled\n",
|
|||
|
"\n",
|
|||
|
"display(\"Обучающая выборка после undersampling: \", dfGold_train_undersampled.shape)\n",
|
|||
|
"display(dfGold_train_undersampled.USB_Trend.value_counts())\n",
|
|||
|
"\n",
|
|||
|
"dfGold_train_undersampled"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"ДАТАСЕТ МАРКЕТИНГОВАЯ КОМПАНИЯ"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": ".venv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|