{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Лабораторная 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ДАТАСЕТ СПИСОК ФОРБС\n", "Объектами наблюдения в данном наборе данных являются миллиардеры, чье состояние оценивается и документируется в ежегодном рейтинге Forbes. Каждая запись в наборе данных представляет собой отдельного миллиардера с его оцененным состоянием.\n", "Атрибуты объектов\n", "\n", "Атрибутами объектов (миллиардеров) являются:\n", "Имя: имя миллиардера.\n", "Страна: страна, в которой проживает миллиардер.\n", "Состояние: оцененное состояние миллиардера в долларах США.\n", "Источник богатства: источник, из которого миллиардер получил свое состояние (например, технологии, финансы, недвижимость и т.д.).\n", "Возраст: возраст миллиардера на момент публикации списка.\n", "Ранг: позиция миллиардера в рейтинге по сравнению с другими миллиардерами.\n", "\n", "Связи между объектами могут быть определены через общие источники богатства или страны проживания. Например, миллиардеры из одной страны могут иметь схожие источники дохода, а также могут быть связаны через бизнес-партнерства или семейные связи.\n", "\n", "Примеры бизнес-целей\n", "Привлечение инвестиций: Компании могут использовать данные о миллиардерах для целенаправленного маркетинга и привлечения инвестиций от состоятельных индивидуумов.\n", "Анализ рынка: Понимание источников богатства и распределения состояния может помочь в анализе рыночных трендов и потребительских предпочтений.\n", "\n", "Эффект для бизнеса\n", "Эти бизнес-цели могут привести к увеличению инвестиций, улучшению репутации компании, расширению клиентской базы и повышению финансовой устойчивости организаций, работающих в различных секторах.\n", "\n", "Примеры целей технического проекта\n", "Для привлечения инвестиций: Разработка платформы для анализа данных о миллиардерах, которая поможет компаниям находить потенциальных инвесторов на основе их интересов и источников богатства.\n", "Для анализа рынка: Создание аналитической панели, которая визуализирует данные о миллиардерах и их источниках богатства, позволяя компаниям лучше понимать рыночные тренды.\n", "\n", "Входные данные: Данные о миллиардерах, включая имя, страну, состояние, источник богатства, возраст и ранг.\n", "\n", "Целевой признак: Целевым признаком может быть состояние миллиардера, что позволит строить модели для прогнозирования изменений в состоянии или ранге в будущем." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 2600 entries, 1 to 2578\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Name 2600 non-null object \n", " 1 Networth 2600 non-null float64\n", " 2 Age 2600 non-null int64 \n", " 3 Country 2600 non-null object \n", " 4 Source 2600 non-null object \n", " 5 Industry 2600 non-null object \n", "dtypes: float64(1), int64(1), object(4)\n", "memory usage: 142.2+ KB\n" ] }, { "data": { "text/plain": [ "(2600, 6)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameNetworthAgeCountrySourceIndustry
RankID
1Elon Musk219.050United StatesTesla, SpaceXAutomotive
2Jeff Bezos171.058United StatesAmazonTechnology
3Bernard Arnault & family158.073FranceLVMHFashion & Retail
4Bill Gates129.066United StatesMicrosoftTechnology
5Warren Buffett118.091United StatesBerkshire HathawayFinance & Investments
\n", "
" ], "text/plain": [ " Name Networth Age Country \\\n", "RankID \n", "1 Elon Musk 219.0 50 United States \n", "2 Jeff Bezos 171.0 58 United States \n", "3 Bernard Arnault & family 158.0 73 France \n", "4 Bill Gates 129.0 66 United States \n", "5 Warren Buffett 118.0 91 United States \n", "\n", " Source Industry \n", "RankID \n", "1 Tesla, SpaceX Automotive \n", "2 Amazon Technology \n", "3 LVMH Fashion & Retail \n", "4 Microsoft Technology \n", "5 Berkshire Hathaway Finance & Investments " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/forbes.csv\", index_col=\"RankID\")\n", "\n", "df.info()\n", "\n", "display(df.shape)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Получение сведений о пропущенных данных" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Типы пропущенных данных:\n", "- None - представление пустых данных в Python\n", "- NaN - представление пустых данных в Pandas\n", "- '' - пустая строка" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name 0\n", "Networth 0\n", "Age 0\n", "Country 0\n", "Source 0\n", "Industry 0\n", "dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Name False\n", "Networth False\n", "Age False\n", "Country False\n", "Source False\n", "Industry False\n", "dtype: bool" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Количество пустых значений признаков\n", "display(df.isnull().sum())\n", "display()\n", "\n", "# Есть ли пустые значения признаков\n", "display(df.isnull().any())\n", "display()\n", "\n", "# Процент пустых значений признаков\n", "for i in df.columns:\n", " null_rate = df[i].isnull().sum() / len(df) * 100\n", " if null_rate > 0:\n", " display(f\"{i} процент пустых значений: %{null_rate:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для данного датасета количество пустых значений для каждого из признаков = 0 т.е. не пропущено одно значение -> заполнение и корректировка не нужны." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2600, 7)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Name False\n", "Networth False\n", "Age False\n", "Country False\n", "Source False\n", "Industry False\n", "AgeFillMedian False\n", "dtype: bool" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameNetworthAgeCountrySourceIndustryAgeFillMedianAgeFillNA
RankID
2578Jorge Gallardo Ballart1.080SpainpharmaceuticalsHealthcare8080
2578Nari Genomal1.082PhilippinesapparelFashion & Retail8282
2578Ramesh Genomal1.071PhilippinesapparelFashion & Retail7171
2578Sunder Genomal1.068PhilippinesgarmentsFashion & Retail6868
2578Horst-Otto Gerberding1.069Germanyflavors and fragrancesFood & Beverage6969
\n", "
" ], "text/plain": [ " Name Networth Age Country \\\n", "RankID \n", "2578 Jorge Gallardo Ballart 1.0 80 Spain \n", "2578 Nari Genomal 1.0 82 Philippines \n", "2578 Ramesh Genomal 1.0 71 Philippines \n", "2578 Sunder Genomal 1.0 68 Philippines \n", "2578 Horst-Otto Gerberding 1.0 69 Germany \n", "\n", " Source Industry AgeFillMedian AgeFillNA \n", "RankID \n", "2578 pharmaceuticals Healthcare 80 80 \n", "2578 apparel Fashion & Retail 82 82 \n", "2578 apparel Fashion & Retail 71 71 \n", "2578 garments Fashion & Retail 68 68 \n", "2578 flavors and fragrances Food & Beverage 69 69 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fillna_df = df.fillna(0)\n", "\n", "display(fillna_df.shape)\n", "\n", "display(fillna_df.isnull().any())\n", "\n", "# Замена пустых данных на 0\n", "df[\"AgeFillNA\"] = df[\"Age\"].fillna(0) \n", "\n", "# Замена пустых данных на медиану\n", "df[\"AgeFillMedian\"] = df[\"Age\"].fillna(df[\"Age\"].median())\n", "\n", "df.tail()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameNetworthAgeCountrySourceIndustryAgeFillMedianAgeFillNAAgeCopy
RankID
2578Jorge Gallardo Ballart1.080SpainpharmaceuticalsHealthcare808080
2578Nari Genomal1.082PhilippinesapparelFashion & Retail828282
2578Ramesh Genomal1.071PhilippinesapparelFashion & Retail717171
2578Sunder Genomal1.068PhilippinesgarmentsFashion & Retail686868
2578Horst-Otto Gerberding1.069Germanyflavors and fragrancesFood & Beverage696969
\n", "
" ], "text/plain": [ " Name Networth Age Country \\\n", "RankID \n", "2578 Jorge Gallardo Ballart 1.0 80 Spain \n", "2578 Nari Genomal 1.0 82 Philippines \n", "2578 Ramesh Genomal 1.0 71 Philippines \n", "2578 Sunder Genomal 1.0 68 Philippines \n", "2578 Horst-Otto Gerberding 1.0 69 Germany \n", "\n", " Source Industry AgeFillMedian AgeFillNA \\\n", "RankID \n", "2578 pharmaceuticals Healthcare 80 80 \n", "2578 apparel Fashion & Retail 82 82 \n", "2578 apparel Fashion & Retail 71 71 \n", "2578 garments Fashion & Retail 68 68 \n", "2578 flavors and fragrances Food & Beverage 69 69 \n", "\n", " AgeCopy \n", "RankID \n", "2578 80 \n", "2578 82 \n", "2578 71 \n", "2578 68 \n", "2578 69 " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"AgeCopy\"] = df[\"Age\"]\n", "\n", "# Замена данных сразу в DataFrame без копирования\n", "df.fillna({\"AgeCopy\": 0}, inplace=True)\n", "\n", "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Удаление наблюдений с пропусками" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2600, 9)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Name False\n", "Networth False\n", "Age False\n", "Country False\n", "Source False\n", "Industry False\n", "AgeFillMedian False\n", "dtype: bool" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dropna_df = df.dropna()\n", "\n", "display(dropna_df.shape)\n", "\n", "display(fillna_df.isnull().any())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Создание выборок данных\n", "\n", "Библиотека scikit-learn\n", "\n", "https://scikit-learn.org/stable/index.html" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Industry\n", "Finance & Investments 386\n", "Technology 329\n", "Manufacturing 322\n", "Fashion & Retail 246\n", "Healthcare 212\n", "Food & Beverage 201\n", "Real Estate 189\n", "diversified 178\n", "Media & Entertainment 95\n", "Energy 93\n", "Automotive 69\n", "Metals & Mining 67\n", "Service 51\n", "Construction & Engineering 43\n", "Logistics 35\n", "Telecom 35\n", "Sports 26\n", "Gambling & Casinos 23\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Обучающая выборка: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(1560, 3)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Industry\n", "Finance & Investments 231\n", "Technology 197\n", "Manufacturing 193\n", "Fashion & Retail 148\n", "Healthcare 127\n", "Food & Beverage 121\n", "Real Estate 113\n", "diversified 107\n", "Media & Entertainment 57\n", "Energy 56\n", "Automotive 41\n", "Metals & Mining 40\n", "Service 31\n", "Construction & Engineering 26\n", "Logistics 21\n", "Telecom 21\n", "Sports 16\n", "Gambling & Casinos 14\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Контрольная выборка: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(520, 3)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Industry\n", "Finance & Investments 77\n", "Technology 66\n", "Manufacturing 64\n", "Fashion & Retail 49\n", "Healthcare 43\n", "Food & Beverage 40\n", "Real Estate 38\n", "diversified 35\n", "Media & Entertainment 19\n", "Energy 18\n", "Automotive 14\n", "Metals & Mining 14\n", "Service 10\n", "Construction & Engineering 9\n", "Telecom 7\n", "Logistics 7\n", "Sports 5\n", "Gambling & Casinos 5\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Тестовая выборка: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(520, 3)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Industry\n", "Finance & Investments 78\n", "Technology 66\n", "Manufacturing 65\n", "Fashion & Retail 49\n", "Healthcare 42\n", "Food & Beverage 40\n", "Real Estate 38\n", "diversified 36\n", "Media & Entertainment 19\n", "Energy 19\n", "Automotive 14\n", "Metals & Mining 13\n", "Service 10\n", "Construction & Engineering 8\n", "Logistics 7\n", "Telecom 7\n", "Sports 5\n", "Gambling & Casinos 4\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Вывод распределения количества наблюдений по Индустрии\n", "from src.utils import split_stratified_into_train_val_test\n", "\n", "\n", "display(df.Industry.value_counts())\n", "display()\n", "\n", "data = df[[\"Networth\", \"Age\", \"Industry\"]].copy()\n", "\n", "df_train, df_val, df_test, y_train, y_val, y_test = split_stratified_into_train_val_test(\n", " data, stratify_colname=\"Industry\", frac_train=0.60, frac_val=0.20, frac_test=0.20\n", ")\n", "\n", "display(\"Обучающая выборка: \", df_train.shape)\n", "display(df_train.Industry.value_counts())\n", "\n", "display(\"Контрольная выборка: \", df_val.shape)\n", "display(df_val.Industry.value_counts())\n", "\n", "display(\"Тестовая выборка: \", df_test.shape)\n", "display(df_test.Industry.value_counts())" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Обучающая выборка: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(1560, 3)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Industry\n", "Finance & Investments 231\n", "Technology 197\n", "Manufacturing 193\n", "Fashion & Retail 148\n", "Healthcare 127\n", "Food & Beverage 121\n", "Real Estate 113\n", "diversified 107\n", "Media & Entertainment 57\n", "Energy 56\n", "Automotive 41\n", "Metals & Mining 40\n", "Service 31\n", "Construction & Engineering 26\n", "Logistics 21\n", "Telecom 21\n", "Sports 16\n", "Gambling & Casinos 14\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "ename": "ValueError", "evalue": "could not convert string to float: 'Technology '", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m~\\AppData\\Local\\Temp\\ipykernel_1348\\420769102.py\u001b[0m in \u001b[0;36m?\u001b[1;34m()\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Обучающая выборка: \"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf_train\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdf_train\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mIndustry\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mX_resampled\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_resampled\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mada\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit_resample\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdf_train\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf_train\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"Industry\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;31m# type: ignore\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 9\u001b[0m \u001b[0mdf_train_adasyn\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX_resampled\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 11\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Обучающая выборка после oversampling: \"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf_train_adasyn\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\imblearn\\base.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 204\u001b[0m \u001b[0my_resampled\u001b[0m \u001b[1;33m:\u001b[0m \u001b[0marray\u001b[0m\u001b[1;33m-\u001b[0m\u001b[0mlike\u001b[0m \u001b[0mof\u001b[0m \u001b[0mshape\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mn_samples_new\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 205\u001b[0m \u001b[0mThe\u001b[0m \u001b[0mcorresponding\u001b[0m \u001b[0mlabel\u001b[0m \u001b[0mof\u001b[0m \u001b[1;33m`\u001b[0m\u001b[0mX_resampled\u001b[0m\u001b[1;33m`\u001b[0m\u001b[1;33m.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 206\u001b[0m \"\"\"\n\u001b[0;32m 207\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_validate_params\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 208\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0msuper\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit_resample\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\imblearn\\base.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 102\u001b[0m \u001b[0mThe\u001b[0m \u001b[0mcorresponding\u001b[0m \u001b[0mlabel\u001b[0m \u001b[0mof\u001b[0m \u001b[1;33m`\u001b[0m\u001b[0mX_resampled\u001b[0m\u001b[1;33m`\u001b[0m\u001b[1;33m.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 103\u001b[0m \"\"\"\n\u001b[0;32m 104\u001b[0m \u001b[0mcheck_classification_targets\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 105\u001b[0m \u001b[0marrays_transformer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mArraysTransformer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 106\u001b[1;33m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbinarize_y\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_check_X_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 107\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 108\u001b[0m self.sampling_strategy_ = check_sampling_strategy(\n\u001b[0;32m 109\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msampling_strategy\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_sampling_type\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\imblearn\\base.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, X, y, accept_sparse)\u001b[0m\n\u001b[0;32m 157\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_check_X_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 158\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0maccept_sparse\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 159\u001b[0m \u001b[0maccept_sparse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;34m\"csr\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"csc\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 160\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbinarize_y\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcheck_target_type\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mindicate_one_vs_all\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 161\u001b[1;33m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_validate_data\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mreset\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maccept_sparse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 162\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbinarize_y\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\sklearn\\base.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[0;32m 646\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;34m\"estimator\"\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mcheck_y_params\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 647\u001b[0m \u001b[0mcheck_y_params\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;33m**\u001b[0m\u001b[0mdefault_check_params\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 648\u001b[0m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0minput_name\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"y\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 649\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 650\u001b[1;33m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcheck_X_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 651\u001b[0m \u001b[0mout\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 652\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 653\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mcheck_params\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"ensure_2d\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[0m\n\u001b[0;32m 1297\u001b[0m raise ValueError(\n\u001b[0;32m 1298\u001b[0m \u001b[1;33mf\"\u001b[0m\u001b[1;33m{\u001b[0m\u001b[0mestimator_name\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m requires y to be passed, but the target y is None\u001b[0m\u001b[1;33m\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1299\u001b[0m \u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1300\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1301\u001b[1;33m X = check_array(\n\u001b[0m\u001b[0;32m 1302\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1303\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maccept_sparse\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1304\u001b[0m \u001b[0maccept_large_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maccept_large_sparse\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[0;32m 1009\u001b[0m \u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1010\u001b[0m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mxp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1011\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1012\u001b[0m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_asarray_with_order\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mxp\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mxp\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1013\u001b[1;33m \u001b[1;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1014\u001b[0m raise ValueError(\n\u001b[0;32m 1015\u001b[0m \u001b[1;34m\"Complex data not supported\\n{}\\n\"\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1016\u001b[0m \u001b[1;33m)\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\sklearn\\utils\\_array_api.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(array, dtype, order, copy, xp, device)\u001b[0m\n\u001b[0;32m 741\u001b[0m \u001b[1;31m# Use NumPy API to support order\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 742\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mcopy\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 743\u001b[0m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 744\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 745\u001b[1;33m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 746\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 747\u001b[0m \u001b[1;31m# At this point array is a NumPy ndarray. We convert it to an array\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 748\u001b[0m \u001b[1;31m# container that is consistent with the input's namespace.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mc:\\Users\\ateks\\Courses\\Courses\\.venv\\Lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, dtype, copy)\u001b[0m\n\u001b[0;32m 2149\u001b[0m def __array__(\n\u001b[0;32m 2150\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mnpt\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mDTypeLike\u001b[0m \u001b[1;33m|\u001b[0m \u001b[1;32mNone\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mbool_t\u001b[0m \u001b[1;33m|\u001b[0m \u001b[1;32mNone\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2151\u001b[0m \u001b[1;33m)\u001b[0m \u001b[1;33m->\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2152\u001b[0m \u001b[0mvalues\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2153\u001b[1;33m \u001b[0marr\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2154\u001b[0m if (\n\u001b[0;32m 2155\u001b[0m \u001b[0mastype_is_view\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0marr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2156\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0musing_copy_on_write\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mValueError\u001b[0m: could not convert string to float: 'Technology '" ] } ], "source": [ "from imblearn.over_sampling import ADASYN\n", "\n", "ada = ADASYN()\n", "\n", "display(\"Обучающая выборка: \", df_train.shape)\n", "display(df_train.Industry.value_counts())\n", "\n", "X_resampled, y_resampled = ada.fit_resample(df_train, df_train[\"Industry\"]) # type: ignore\n", "df_train_adasyn = pd.DataFrame(X_resampled)\n", "\n", "display(\"Обучающая выборка после oversampling: \", df_train_adasyn.shape)\n", "display(df_train_adasyn.Industry.value_counts())\n", "\n", "df_train_adasyn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "________________________________________________________________________________________________________________________________________" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ДАТАСЕТ ЦЕНЫ НА ЗОЛОТО\n", "\n", "Объектами наблюдения в данном наборе данных являются цены на золото, представленные через Gold ETF (Exchange-Traded Fund). Каждая запись в наборе данных соответствует отдельному дню торговли золотыми активами.\n", "Атрибуты объектов\n", "\n", "Атрибутами объектов (цен на золото) являются:\n", "Дата: дата, когда происходила торговля.\n", "Цена открытия (Open): цена, по которой золото открывалось в начале торгового дня.\n", "Максимальная цена (High): наивысшая цена золота в течение дня.\n", "Минимальная цена (Low): наименьшая цена золота в течение дня.\n", "Цена закрытия (Close): цена, по которой золото закрылось в конце торгового дня.\n", "Скорректированная цена закрытия (Adjusted Close): цена закрытия, скорректированная с учетом факторов, таких как дивиденды и сплиты акций.\n", "Объем (Volume): количество золота, которое было куплено и продано в течение дня.\n", "\n", "Связи между объектами могут быть определены через временные последовательности. Например, изменение цен на золото в один день может зависеть от цен в предыдущие дни, а также от внешних факторов, таких как цены на другие драгоценные металлы, цены на нефть, экономические условия и рыночные тренды.\n", "\n", "Примеры бизнес-целей\n", "Оптимизация инвестиционных решений: Анализ исторических данных о ценах на золото может помочь инвесторам принимать более обоснованные решения о покупке или продаже золота.\n", "Управление рисками: Понимание факторов, влияющих на цены на золото, может помочь компаниям и инвесторам минимизировать риски, связанные с колебаниями цен.\n", "\n", "Эффект для бизнеса\n", "Эти бизнес-цели могут привести к увеличению доходов, привлечению новых инвесторов и повышению общей финансовой устойчивости компаний, работающих с золотом.\n", "\n", "Примеры целей технического проекта\n", "Для управления рисками: Создание системы мониторинга, которая будет отслеживать изменения цен на золото и другие факторы, влияющие на рынок, и предоставлять рекомендации по управлению рисками.\n", "\n", "Входные данные: Данные о ценах на золото, включая дату, цену открытия, максимальную и минимальную цены, цену закрытия, скорректированную цену закрытия и объем торгов.\n", "\n", "Целевой признак: Целевым признаком может быть скорректированная цена закрытия золота на следующий день, что позволит строить модели для прогнозирования будущих цен." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 1718 entries, 2011-12-15 to 2018-12-31\n", "Data columns (total 80 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Open 1718 non-null float64\n", " 1 High 1718 non-null float64\n", " 2 Low 1718 non-null float64\n", " 3 Close 1718 non-null float64\n", " 4 Adj Close 1718 non-null float64\n", " 5 Volume 1718 non-null int64 \n", " 6 SP_open 1718 non-null float64\n", " 7 SP_high 1718 non-null float64\n", " 8 SP_low 1718 non-null float64\n", " 9 SP_close 1718 non-null float64\n", " 10 SP_Ajclose 1718 non-null float64\n", " 11 SP_volume 1718 non-null int64 \n", " 12 DJ_open 1718 non-null float64\n", " 13 DJ_high 1718 non-null float64\n", " 14 DJ_low 1718 non-null float64\n", " 15 DJ_close 1718 non-null float64\n", " 16 DJ_Ajclose 1718 non-null float64\n", " 17 DJ_volume 1718 non-null int64 \n", " 18 EG_open 1718 non-null float64\n", " 19 EG_high 1718 non-null float64\n", " 20 EG_low 1718 non-null float64\n", " 21 EG_close 1718 non-null float64\n", " 22 EG_Ajclose 1718 non-null float64\n", " 23 EG_volume 1718 non-null int64 \n", " 24 EU_Price 1718 non-null float64\n", " 25 EU_open 1718 non-null float64\n", " 26 EU_high 1718 non-null float64\n", " 27 EU_low 1718 non-null float64\n", " 28 EU_Trend 1718 non-null int64 \n", " 29 OF_Price 1718 non-null float64\n", " 30 OF_Open 1718 non-null float64\n", " 31 OF_High 1718 non-null float64\n", " 32 OF_Low 1718 non-null float64\n", " 33 OF_Volume 1718 non-null int64 \n", " 34 OF_Trend 1718 non-null int64 \n", " 35 OS_Price 1718 non-null float64\n", " 36 OS_Open 1718 non-null float64\n", " 37 OS_High 1718 non-null float64\n", " 38 OS_Low 1718 non-null float64\n", " 39 OS_Trend 1718 non-null int64 \n", " 40 SF_Price 1718 non-null int64 \n", " 41 SF_Open 1718 non-null int64 \n", " 42 SF_High 1718 non-null int64 \n", " 43 SF_Low 1718 non-null int64 \n", " 44 SF_Volume 1718 non-null int64 \n", " 45 SF_Trend 1718 non-null int64 \n", " 46 USB_Price 1718 non-null float64\n", " 47 USB_Open 1718 non-null float64\n", " 48 USB_High 1718 non-null float64\n", " 49 USB_Low 1718 non-null float64\n", " 50 USB_Trend 1718 non-null int64 \n", " 51 PLT_Price 1718 non-null float64\n", " 52 PLT_Open 1718 non-null float64\n", " 53 PLT_High 1718 non-null float64\n", " 54 PLT_Low 1718 non-null float64\n", " 55 PLT_Trend 1718 non-null int64 \n", " 56 PLD_Price 1718 non-null float64\n", " 57 PLD_Open 1718 non-null float64\n", " 58 PLD_High 1718 non-null float64\n", " 59 PLD_Low 1718 non-null float64\n", " 60 PLD_Trend 1718 non-null int64 \n", " 61 RHO_PRICE 1718 non-null int64 \n", " 62 USDI_Price 1718 non-null float64\n", " 63 USDI_Open 1718 non-null float64\n", " 64 USDI_High 1718 non-null float64\n", " 65 USDI_Low 1718 non-null float64\n", " 66 USDI_Volume 1718 non-null int64 \n", " 67 USDI_Trend 1718 non-null int64 \n", " 68 GDX_Open 1718 non-null float64\n", " 69 GDX_High 1718 non-null float64\n", " 70 GDX_Low 1718 non-null float64\n", " 71 GDX_Close 1718 non-null float64\n", " 72 GDX_Adj Close 1718 non-null float64\n", " 73 GDX_Volume 1718 non-null int64 \n", " 74 USO_Open 1718 non-null float64\n", " 75 USO_High 1718 non-null float64\n", " 76 USO_Low 1718 non-null float64\n", " 77 USO_Close 1718 non-null float64\n", " 78 USO_Adj Close 1718 non-null float64\n", " 79 USO_Volume 1718 non-null int64 \n", "dtypes: float64(58), int64(22)\n", "memory usage: 1.1+ MB\n" ] }, { "data": { "text/plain": [ "(1718, 80)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenHighLowCloseAdj CloseVolumeSP_openSP_highSP_lowSP_close...GDX_LowGDX_CloseGDX_Adj CloseGDX_VolumeUSO_OpenUSO_HighUSO_LowUSO_CloseUSO_Adj CloseUSO_Volume
Date
2011-12-15154.740005154.949997151.710007152.330002152.33000221521900123.029999123.199997121.989998122.180000...51.57000051.68000048.9738772060560036.90000236.93999936.04999936.13000136.13000112616700
2011-12-16154.309998155.369995153.899994155.229996155.22999618124300122.230003122.949997121.300003121.589996...52.04000152.68000049.9215131628540036.18000036.50000035.73000036.27000036.27000012578800
2011-12-19155.479996155.860001154.360001154.869995154.86999512547200122.059998122.320000120.029999120.290001...51.02999951.16999848.4905781512020036.38999936.45000135.93000036.20000136.2000017418200
2011-12-20156.820007157.429993156.580002156.979996156.9799969136300122.180000124.139999120.370003123.930000...52.36999952.99000250.2152821164490037.29999937.61000137.22000137.56000137.56000110041600
2011-12-21156.979996157.529999156.130005157.160004157.16000411996100123.930000124.360001122.750000124.169998...52.41999852.95999950.186852872430037.66999838.24000237.52000038.11000138.11000110728000
\n", "

5 rows × 80 columns

\n", "
" ], "text/plain": [ " Open High Low Close Adj Close \\\n", "Date \n", "2011-12-15 154.740005 154.949997 151.710007 152.330002 152.330002 \n", "2011-12-16 154.309998 155.369995 153.899994 155.229996 155.229996 \n", "2011-12-19 155.479996 155.860001 154.360001 154.869995 154.869995 \n", "2011-12-20 156.820007 157.429993 156.580002 156.979996 156.979996 \n", "2011-12-21 156.979996 157.529999 156.130005 157.160004 157.160004 \n", "\n", " Volume SP_open SP_high SP_low SP_close ... \\\n", "Date ... \n", "2011-12-15 21521900 123.029999 123.199997 121.989998 122.180000 ... \n", "2011-12-16 18124300 122.230003 122.949997 121.300003 121.589996 ... \n", "2011-12-19 12547200 122.059998 122.320000 120.029999 120.290001 ... \n", "2011-12-20 9136300 122.180000 124.139999 120.370003 123.930000 ... \n", "2011-12-21 11996100 123.930000 124.360001 122.750000 124.169998 ... \n", "\n", " GDX_Low GDX_Close GDX_Adj Close GDX_Volume USO_Open \\\n", "Date \n", "2011-12-15 51.570000 51.680000 48.973877 20605600 36.900002 \n", "2011-12-16 52.040001 52.680000 49.921513 16285400 36.180000 \n", "2011-12-19 51.029999 51.169998 48.490578 15120200 36.389999 \n", "2011-12-20 52.369999 52.990002 50.215282 11644900 37.299999 \n", "2011-12-21 52.419998 52.959999 50.186852 8724300 37.669998 \n", "\n", " USO_High USO_Low USO_Close USO_Adj Close USO_Volume \n", "Date \n", "2011-12-15 36.939999 36.049999 36.130001 36.130001 12616700 \n", "2011-12-16 36.500000 35.730000 36.270000 36.270000 12578800 \n", "2011-12-19 36.450001 35.930000 36.200001 36.200001 7418200 \n", "2011-12-20 37.610001 37.220001 37.560001 37.560001 10041600 \n", "2011-12-21 38.240002 37.520000 38.110001 38.110001 10728000 \n", "\n", "[5 rows x 80 columns]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfGold = pd.read_csv(\"data/gold.csv\", index_col=\"Date\")\n", "\n", "dfGold.info()\n", "\n", "display(dfGold.shape)\n", "\n", "dfGold.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Пустые значения" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Open 0\n", "High 0\n", "Low 0\n", "Close 0\n", "Adj Close 0\n", " ..\n", "USO_High 0\n", "USO_Low 0\n", "USO_Close 0\n", "USO_Adj Close 0\n", "USO_Volume 0\n", "Length: 80, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Open False\n", "High False\n", "Low False\n", "Close False\n", "Adj Close False\n", " ... \n", "USO_High False\n", "USO_Low False\n", "USO_Close False\n", "USO_Adj Close False\n", "USO_Volume False\n", "Length: 80, dtype: bool" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Количество пустых значений признаков\n", "display(dfGold.isnull().sum())\n", "display()\n", "\n", "# Есть ли пустые значения признаков\n", "display(dfGold.isnull().any())\n", "display()\n", "\n", "# Процент пустых значений признаков\n", "for i in dfGold.columns:\n", " null_rate = dfGold[i].isnull().sum() / len(dfGold) * 100\n", " if null_rate > 0:\n", " display(f\"{i} процент пустых значений: %{null_rate:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Заполение пустых значений для данного набора так же не требуется." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Создание выборок данных" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "USB_Trend\n", "0 876\n", "1 842\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Обучающая выборка: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(1030, 4)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "USB_Trend\n", "0 525\n", "1 505\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Контрольная выборка: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(344, 4)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "USB_Trend\n", "0 176\n", "1 168\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Тестовая выборка: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(344, 4)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "USB_Trend\n", "0 175\n", "1 169\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Вывод распределения количества наблюдений по меркам\n", "from src.utils import split_stratified_into_train_val_test\n", "\n", "display((dfGold.USB_Trend).value_counts())\n", "display()\n", "\n", "selected_columns = [\"Open\", \"High\", \"Low\", \"USB_Trend\"]\n", "dfGold[\"USB_Trend\"] = round(dfGold[\"USB_Trend\"])\n", "data = dfGold[selected_columns].copy()\n", "\n", "# Создание выборок\n", "dfGold_train, dfGold_val, dfGold_test, y_train, y_val, y_test = (\n", " split_stratified_into_train_val_test(\n", " data,\n", " stratify_colname=\"USB_Trend\",\n", " frac_train=0.60,\n", " frac_val=0.20,\n", " frac_test=0.20,\n", " )\n", ")\n", "\n", "# Используем display для вывода информации о выборках\n", "display(\"Обучающая выборка: \", dfGold_train.shape)\n", "display(round(dfGold_train.USB_Trend).value_counts())\n", "\n", "display(\"Контрольная выборка: \", dfGold_val.shape)\n", "display(round(dfGold_val.USB_Trend).value_counts())\n", "\n", "display(\"Тестовая выборка: \", dfGold_test.shape)\n", "display(round(dfGold_test.USB_Trend).value_counts())" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Обучающая выборка: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(1030, 4)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "USB_Trend\n", "0 525\n", "1 505\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Обучающая выборка после undersampling: '" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(1010, 4)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "USB_Trend\n", "0 505\n", "1 505\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenHighLowUSB_Trend
Date
2016-04-27118.970001119.699997118.4300000
2017-02-16117.930000118.349998117.8300020
2016-11-15116.459999117.239998116.2900010
2016-11-07122.660004122.709999121.8799970
2018-04-30124.410004125.199997124.1900020
...............
2012-07-13153.449997154.940002153.4400021
2016-05-25116.589996117.059998116.3200001
2016-03-02118.339996118.970001118.0700001
2013-08-05126.510002126.639999125.3399961
2013-06-20125.220001126.379997123.3300021
\n", "

1010 rows × 4 columns

\n", "
" ], "text/plain": [ " Open High Low USB_Trend\n", "Date \n", "2016-04-27 118.970001 119.699997 118.430000 0\n", "2017-02-16 117.930000 118.349998 117.830002 0\n", "2016-11-15 116.459999 117.239998 116.290001 0\n", "2016-11-07 122.660004 122.709999 121.879997 0\n", "2018-04-30 124.410004 125.199997 124.190002 0\n", "... ... ... ... ...\n", "2012-07-13 153.449997 154.940002 153.440002 1\n", "2016-05-25 116.589996 117.059998 116.320000 1\n", "2016-03-02 118.339996 118.970001 118.070000 1\n", "2013-08-05 126.510002 126.639999 125.339996 1\n", "2013-06-20 125.220001 126.379997 123.330002 1\n", "\n", "[1010 rows x 4 columns]" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from imblearn.under_sampling import RandomUnderSampler\n", "\n", "# Создание экземпляра RandomUnderSampler\n", "rus = RandomUnderSampler(\n", " sampling_strategy=\"auto\"\n", ") # 'auto' будет пытаться сбалансировать классы\n", "\n", "display(\"Обучающая выборка: \", dfGold_train.shape)\n", "display(dfGold_train.USB_Trend.value_counts())\n", "\n", "\n", "# Разделение признаков и целевой переменной\n", "X = dfGold_train.drop(columns=[\"USB_Trend\"])\n", "y = dfGold_train[\"USB_Trend\"]\n", "\n", "# Применение undersampling\n", "X_resampled, y_resampled = rus.fit_resample(X, y)\n", "\n", "# Создание нового DataFrame\n", "dfGold_train_undersampled = pd.DataFrame(X_resampled)\n", "dfGold_train_undersampled[\"USB_Trend\"] = y_resampled\n", "\n", "display(\"Обучающая выборка после undersampling: \", dfGold_train_undersampled.shape)\n", "display(dfGold_train_undersampled.USB_Trend.value_counts())\n", "\n", "dfGold_train_undersampled" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ДАТАСЕТ МАРКЕТИНГОВАЯ КОМПАНИЯ" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 2 }