AIM-PIbd-32-Isaeva-A-I/lab_3/Lab3.ipynb
2024-12-20 23:47:13 +04:00

164 KiB
Raw Blame History

Лабораторная 3

Вариант 7. Экономика стран

Бизнес-цели:

  1. прогнозирование уровня инфляции на основе данных за года
  2. определение факторов, значительно влияющих на показателль ВВП на душу населения

Технические цели:

  1. Разработать МО для прогнозирования уровня инфляции на основе исторических данных
  2. Проанализировать взаимосвязь между экономическими показателями и ВВП
In [74]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(".//csv//EconomicData.csv")
print(df.columns)
Index(['stock index', 'country', 'year', 'index price', 'log_indexprice',
       'inflationrate', 'oil prices', 'exchange_rate', 'gdppercent',
       'percapitaincome', 'unemploymentrate', 'manufacturingoutput',
       'tradebalance', 'USTreasury'],
      dtype='object')

Подготовка данных:

In [75]:
print(df.isnull().sum())
stock index             0
country                 0
year                    0
index price            52
log_indexprice          0
inflationrate          43
oil prices              0
exchange_rate           2
gdppercent             19
percapitaincome         1
unemploymentrate       21
manufacturingoutput    91
tradebalance            4
USTreasury              0
dtype: int64

Заполним пустые значения медианами:

In [76]:
for column in df.columns:
    if (column != "stock index" and column != "country"):
        df[column].fillna(df[column].median())
In [77]:
plt.figure(figsize=(12, 8))
ax = sns.scatterplot(x='exchange_rate', y='oil prices', hue='inflationrate', data=df)
plt.title('Уровень инфляции')
plt.xlabel('Валютный курс')
plt.ylabel('Цены на нефть')
plt.legend(title='inflationrate')
plt.show()
No description has been provided for this image
In [78]:
Q1 = df['oil prices'].quantile(0.25)
Q3 = df['oil prices'].quantile(0.75)
IQR = Q3 - Q1

threshold = 1.5 * IQR
outliers = (df['oil prices'] < (Q1 - threshold)) | (df['oil prices'] > (Q3 + threshold))

median_rating = df['oil prices'].median()
df.loc[outliers, 'oil prices'] = median_rating

plt.figure(figsize=(12, 8))
ax = sns.scatterplot(x='exchange_rate', y='gdppercent', hue='inflationrate', data=df)
plt.title('Уровень инфляции')
plt.xlabel('Валютный курс')
plt.ylabel('Цены на нефть')
plt.legend(title='inflationrate')
plt.show()
No description has been provided for this image

Разбиение данных на выборки и оценка сбалансированности выборки

In [79]:
from sklearn.model_selection import train_test_split

# обучающая и тестовая
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# обучающая на обучающую и контрольную
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)

print("Размер обучающей выборки:", len(train_df))
print("Размер контрольной выборки:", len(val_df))
print("Размер тестовой выборки:", len(test_df))
Размер обучающей выборки: 221
Размер контрольной выборки: 74
Размер тестовой выборки: 74

Конструирование признаков

  1. Кодирование категориальных признаков
In [80]:
from sklearn.preprocessing import OneHotEncoder

df = pd.get_dummies(df, columns=['country'])
print(df.head)
<bound method NDFrame.head of     stock index    year  index price  log_indexprice  inflationrate  \
0        NASDAQ  1980.0       168.61            2.23           0.14   
1        NASDAQ  1981.0       203.15            2.31           0.10   
2        NASDAQ  1982.0       188.98            2.28           0.06   
3        NASDAQ  1983.0       285.43            2.46           0.03   
4        NASDAQ  1984.0       248.89            2.40           0.04   
..          ...     ...          ...             ...            ...   
364      IEX 35  2016.0      9352.10            3.97            NaN   
365      IEX 35  2017.0     10043.90            4.00           0.02   
366      IEX 35  2018.0      8539.90            3.93           0.02   
367      IEX 35  2019.0      9549.20            3.98           0.01   
368      IEX 35  2020.0      8073.70            3.91            NaN   

     oil prices  exchange_rate  gdppercent  percapitaincome  unemploymentrate  \
0         21.59           1.00        0.09          12575.0              0.07   
1         31.77           1.00        0.12          13976.0              0.08   
2         28.52           1.00        0.04          14434.0              0.10   
3         26.19           1.00        0.09          15544.0              0.10   
4         25.88           1.00        0.11          17121.0              0.08   
..          ...            ...         ...              ...               ...   
364       51.97           1.11        0.03          26523.0              0.20   
365       57.88           1.13        0.03          28170.0              0.17   
366       49.52           1.18        0.02          30389.0              0.15   
367       59.88           1.12        0.02          29565.0              0.14   
368       47.02           1.14       -0.11          27057.0              0.16   

     ...  USTreasury  country_China  country_France  country_Germany  \
0    ...        0.11          False           False            False   
1    ...        0.14          False           False            False   
2    ...        0.13          False           False            False   
3    ...        0.11          False           False            False   
4    ...        0.12          False           False            False   
..   ...         ...            ...             ...              ...   
364  ...        0.02          False           False            False   
365  ...        0.02          False           False            False   
366  ...        0.03          False           False            False   
367  ...        0.02          False           False            False   
368  ...        0.01          False           False            False   

     country_Hong Kong  country_India  country_Japan  country_Spain  \
0                False          False          False          False   
1                False          False          False          False   
2                False          False          False          False   
3                False          False          False          False   
4                False          False          False          False   
..                 ...            ...            ...            ...   
364              False          False          False           True   
365              False          False          False           True   
366              False          False          False           True   
367              False          False          False           True   
368              False          False          False           True   

     country_United Kingdom  country_United States of America  
0                     False                              True  
1                     False                              True  
2                     False                              True  
3                     False                              True  
4                     False                              True  
..                      ...                               ...  
364                   False                             False  
365                   False                             False  
366                   False                             False  
367                   False                             False  
368                   False                             False  

[369 rows x 22 columns]>
  1. Дискретизация числовых признаков
In [88]:
print(f"min = {df['percapitaincome'].min()}")
print(f"max = {df['percapitaincome'].max()}")
print(df['percapitaincome'].max()/6)
min = 27.0
max = 65280.0
10880.0
In [92]:
from sklearn.preprocessing import KBinsDiscretizer

bins = [0, 11000, 22000, 33000, 44000, float('inf')]
labels = ['незначительный', 'низкий', 'средний', 'высокий', 'очень высокий']

df['percapitaincome_level'] = pd.cut(df['percapitaincome'], bins=bins, labels=labels)
print(df['percapitaincome_level'].head)
<bound method NDFrame.head of 0       низкий
1       низкий
2       низкий
3       низкий
4       низкий
        ...   
364    средний
365    средний
366    средний
367    средний
368    средний
Name: percapitaincome_level, Length: 369, dtype: category
Categories (5, object): ['незначительный' < 'низкий' < 'средний' < 'высокий' < 'очень высокий']>
  1. Ручной синтез признаков
In [95]:
# pip install featuretools
import featuretools as ft

es = ft.EntitySet(id='economy_data')
es.add_dataframe(
    dataframe=df,
    dataframe_name='economy',
    index='index',
    make_index=True
)

# Автоматическое конструирование
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='economy',
    agg_primitives=['mean', 'sum', 'max', 'min', 'std'],
    trans_primitives=['add_numeric', 'multiply_numeric'],
    max_depth=2 
)

print("Сгенерированные признаки:")
print(feature_matrix.head())
print("\nОписание:")
print(feature_defs)
e:\AIM1.5\Scripts\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
e:\AIM1.5\Scripts\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
e:\AIM1.5\Scripts\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
e:\AIM1.5\Scripts\Lib\site-packages\featuretools\synthesis\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
  warnings.warn(
e:\AIM1.5\Scripts\Lib\site-packages\featuretools\synthesis\dfs.py:321: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  agg_primitives: ['max', 'mean', 'min', 'std', 'sum']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
Сгенерированные признаки:
      stock index    year  index price  log_indexprice  inflationrate  \
index                                                                   
0          NASDAQ  1980.0       168.61            2.23           0.14   
1          NASDAQ  1981.0       203.15            2.31           0.10   
2          NASDAQ  1982.0       188.98            2.28           0.06   
3          NASDAQ  1983.0       285.43            2.46           0.03   
4          NASDAQ  1984.0       248.89            2.40           0.04   

       oil prices  exchange_rate  gdppercent  percapitaincome  \
index                                                           
0           21.59            1.0        0.09            12575   
1           31.77            1.0        0.12            13976   
2           28.52            1.0        0.04            14434   
3           26.19            1.0        0.09            15544   
4           25.88            1.0        0.11            17121   

       unemploymentrate  ...  oil prices * year  percapitaincome * USTreasury  \
index                    ...                                                    
0                  0.07  ...           42748.20                       1383.25   
1                  0.08  ...           62936.37                       1956.64   
2                  0.10  ...           56526.64                       1876.42   
3                  0.10  ...           51934.77                       1709.84   
4                  0.08  ...           51345.92                       2054.52   

       percapitaincome * tradebalance  percapitaincome * unemploymentrate  \
index                                                                       
0                          -164229.50                              880.25   
1                          -174979.52                             1118.08   
2                          -288246.98                             1443.40   
3                          -802692.16                             1554.40   
4                         -1758840.33                             1369.68   

       percapitaincome * year  tradebalance * USTreasury  \
index                                                      
0                  24898500.0                    -1.4366   
1                  27686456.0                    -1.7528   
2                  28608188.0                    -2.5961   
3                  30823752.0                    -5.6804   
4                  33968064.0                   -12.3276   

       tradebalance * unemploymentrate  tradebalance * year  \
index                                                         
0                              -0.9142            -25858.80   
1                              -1.0016            -24802.12   
2                              -1.9970            -39580.54   
3                              -5.1640           -102402.12   
4                              -8.2184           -203816.32   

       unemploymentrate * USTreasury  unemploymentrate * year  
index                                                          
0                             0.0077                   138.60  
1                             0.0112                   158.48  
2                             0.0130                   198.20  
3                             0.0110                   198.30  
4                             0.0096                   158.72  

[5 rows x 207 columns]

Описание:
[<Feature: stock index>, <Feature: year>, <Feature: index price>, <Feature: log_indexprice>, <Feature: inflationrate>, <Feature: oil prices>, <Feature: exchange_rate>, <Feature: gdppercent>, <Feature: percapitaincome>, <Feature: unemploymentrate>, <Feature: manufacturingoutput>, <Feature: tradebalance>, <Feature: USTreasury>, <Feature: country_China>, <Feature: country_France>, <Feature: country_Germany>, <Feature: country_Hong Kong>, <Feature: country_India>, <Feature: country_Japan>, <Feature: country_Spain>, <Feature: country_United Kingdom>, <Feature: country_United States of America>, <Feature: percapitaincome_level>, <Feature: index price_scaled>, <Feature: log_indexprice_scaled>, <Feature: USTreasury + year>, <Feature: exchange_rate + USTreasury>, <Feature: exchange_rate + gdppercent>, <Feature: exchange_rate + index price>, <Feature: exchange_rate + index price_scaled>, <Feature: exchange_rate + inflationrate>, <Feature: exchange_rate + log_indexprice>, <Feature: exchange_rate + log_indexprice_scaled>, <Feature: exchange_rate + manufacturingoutput>, <Feature: exchange_rate + oil prices>, <Feature: exchange_rate + percapitaincome>, <Feature: exchange_rate + tradebalance>, <Feature: exchange_rate + unemploymentrate>, <Feature: exchange_rate + year>, <Feature: gdppercent + USTreasury>, <Feature: gdppercent + index price>, <Feature: gdppercent + index price_scaled>, <Feature: gdppercent + inflationrate>, <Feature: gdppercent + log_indexprice>, <Feature: gdppercent + log_indexprice_scaled>, <Feature: gdppercent + manufacturingoutput>, <Feature: gdppercent + oil prices>, <Feature: gdppercent + percapitaincome>, <Feature: gdppercent + tradebalance>, <Feature: gdppercent + unemploymentrate>, <Feature: gdppercent + year>, <Feature: index price + USTreasury>, <Feature: index price + index price_scaled>, <Feature: index price + inflationrate>, <Feature: index price + log_indexprice>, <Feature: index price + log_indexprice_scaled>, <Feature: index price + manufacturingoutput>, <Feature: index price + oil prices>, <Feature: index price + percapitaincome>, <Feature: index price + tradebalance>, <Feature: index price + unemploymentrate>, <Feature: index price + year>, <Feature: index price_scaled + USTreasury>, <Feature: index price_scaled + inflationrate>, <Feature: index price_scaled + log_indexprice>, <Feature: index price_scaled + log_indexprice_scaled>, <Feature: index price_scaled + manufacturingoutput>, <Feature: index price_scaled + oil prices>, <Feature: index price_scaled + percapitaincome>, <Feature: index price_scaled + tradebalance>, <Feature: index price_scaled + unemploymentrate>, <Feature: index price_scaled + year>, <Feature: inflationrate + USTreasury>, <Feature: inflationrate + log_indexprice>, <Feature: inflationrate + log_indexprice_scaled>, <Feature: inflationrate + manufacturingoutput>, <Feature: inflationrate + oil prices>, <Feature: inflationrate + percapitaincome>, <Feature: inflationrate + tradebalance>, <Feature: inflationrate + unemploymentrate>, <Feature: inflationrate + year>, <Feature: log_indexprice + USTreasury>, <Feature: log_indexprice + log_indexprice_scaled>, <Feature: log_indexprice + manufacturingoutput>, <Feature: log_indexprice + oil prices>, <Feature: log_indexprice + percapitaincome>, <Feature: log_indexprice + tradebalance>, <Feature: log_indexprice + unemploymentrate>, <Feature: log_indexprice + year>, <Feature: log_indexprice_scaled + USTreasury>, <Feature: log_indexprice_scaled + manufacturingoutput>, <Feature: log_indexprice_scaled + oil prices>, <Feature: log_indexprice_scaled + percapitaincome>, <Feature: log_indexprice_scaled + tradebalance>, <Feature: log_indexprice_scaled + unemploymentrate>, <Feature: log_indexprice_scaled + year>, <Feature: manufacturingoutput + USTreasury>, <Feature: manufacturingoutput + oil prices>, <Feature: manufacturingoutput + percapitaincome>, <Feature: manufacturingoutput + tradebalance>, <Feature: manufacturingoutput + unemploymentrate>, <Feature: manufacturingoutput + year>, <Feature: oil prices + USTreasury>, <Feature: oil prices + percapitaincome>, <Feature: oil prices + tradebalance>, <Feature: oil prices + unemploymentrate>, <Feature: oil prices + year>, <Feature: percapitaincome + USTreasury>, <Feature: percapitaincome + tradebalance>, <Feature: percapitaincome + unemploymentrate>, <Feature: percapitaincome + year>, <Feature: tradebalance + USTreasury>, <Feature: tradebalance + unemploymentrate>, <Feature: tradebalance + year>, <Feature: unemploymentrate + USTreasury>, <Feature: unemploymentrate + year>, <Feature: USTreasury * year>, <Feature: exchange_rate * USTreasury>, <Feature: exchange_rate * gdppercent>, <Feature: exchange_rate * index price>, <Feature: exchange_rate * index price_scaled>, <Feature: exchange_rate * inflationrate>, <Feature: exchange_rate * log_indexprice>, <Feature: exchange_rate * log_indexprice_scaled>, <Feature: exchange_rate * manufacturingoutput>, <Feature: exchange_rate * oil prices>, <Feature: exchange_rate * percapitaincome>, <Feature: exchange_rate * tradebalance>, <Feature: exchange_rate * unemploymentrate>, <Feature: exchange_rate * year>, <Feature: gdppercent * USTreasury>, <Feature: gdppercent * index price>, <Feature: gdppercent * index price_scaled>, <Feature: gdppercent * inflationrate>, <Feature: gdppercent * log_indexprice>, <Feature: gdppercent * log_indexprice_scaled>, <Feature: gdppercent * manufacturingoutput>, <Feature: gdppercent * oil prices>, <Feature: gdppercent * percapitaincome>, <Feature: gdppercent * tradebalance>, <Feature: gdppercent * unemploymentrate>, <Feature: gdppercent * year>, <Feature: index price * USTreasury>, <Feature: index price * index price_scaled>, <Feature: index price * inflationrate>, <Feature: index price * log_indexprice>, <Feature: index price * log_indexprice_scaled>, <Feature: index price * manufacturingoutput>, <Feature: index price * oil prices>, <Feature: index price * percapitaincome>, <Feature: index price * tradebalance>, <Feature: index price * unemploymentrate>, <Feature: index price * year>, <Feature: index price_scaled * USTreasury>, <Feature: index price_scaled * inflationrate>, <Feature: index price_scaled * log_indexprice>, <Feature: index price_scaled * log_indexprice_scaled>, <Feature: index price_scaled * manufacturingoutput>, <Feature: index price_scaled * oil prices>, <Feature: index price_scaled * percapitaincome>, <Feature: index price_scaled * tradebalance>, <Feature: index price_scaled * unemploymentrate>, <Feature: index price_scaled * year>, <Feature: inflationrate * USTreasury>, <Feature: inflationrate * log_indexprice>, <Feature: inflationrate * log_indexprice_scaled>, <Feature: inflationrate * manufacturingoutput>, <Feature: inflationrate * oil prices>, <Feature: inflationrate * percapitaincome>, <Feature: inflationrate * tradebalance>, <Feature: inflationrate * unemploymentrate>, <Feature: inflationrate * year>, <Feature: log_indexprice * USTreasury>, <Feature: log_indexprice * log_indexprice_scaled>, <Feature: log_indexprice * manufacturingoutput>, <Feature: log_indexprice * oil prices>, <Feature: log_indexprice * percapitaincome>, <Feature: log_indexprice * tradebalance>, <Feature: log_indexprice * unemploymentrate>, <Feature: log_indexprice * year>, <Feature: log_indexprice_scaled * USTreasury>, <Feature: log_indexprice_scaled * manufacturingoutput>, <Feature: log_indexprice_scaled * oil prices>, <Feature: log_indexprice_scaled * percapitaincome>, <Feature: log_indexprice_scaled * tradebalance>, <Feature: log_indexprice_scaled * unemploymentrate>, <Feature: log_indexprice_scaled * year>, <Feature: manufacturingoutput * USTreasury>, <Feature: manufacturingoutput * oil prices>, <Feature: manufacturingoutput * percapitaincome>, <Feature: manufacturingoutput * tradebalance>, <Feature: manufacturingoutput * unemploymentrate>, <Feature: manufacturingoutput * year>, <Feature: oil prices * USTreasury>, <Feature: oil prices * percapitaincome>, <Feature: oil prices * tradebalance>, <Feature: oil prices * unemploymentrate>, <Feature: oil prices * year>, <Feature: percapitaincome * USTreasury>, <Feature: percapitaincome * tradebalance>, <Feature: percapitaincome * unemploymentrate>, <Feature: percapitaincome * year>, <Feature: tradebalance * USTreasury>, <Feature: tradebalance * unemploymentrate>, <Feature: tradebalance * year>, <Feature: unemploymentrate * USTreasury>, <Feature: unemploymentrate * year>]
  1. Масштабирование
In [94]:
from sklearn.preprocessing import MinMaxScaler

scaler_minmax = MinMaxScaler()
df[['index price_scaled', 'log_indexprice_scaled']] = scaler_minmax.fit_transform(df[['index price', 'log_indexprice']])

Оценка качества наборов признаков: Набор данных достаточно полный, но требует предварительной обработки (заполнение пропусков, удаление выбросов, нормализация). После обработки он может быть использован для анализа и построения моделей.