AIM-PIbd-31-LOBASHOV-I-D/lab_3/lab_3.ipynb
2024-11-15 21:58:38 +04:00

332 KiB
Raw Blame History

Вариант 19: Данные о миллионерах

  • Определим бизнес-цели и цели технического проекта
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("C:/Users/goldfest/Desktop/3 курс/MII/AIM-PIbd-31-LOBASHOV-I-D/static/csv/Forbes Billionaires.csv")
print(df.columns)
Index(['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry'], dtype='object')

Определение бизнес целей:

  1. Прогнозирование потенциальных миллионеров на основе анализа данных.
  2. Оценка факторов, влияющих на достижение статуса миллионера.

Определение целей технического проекта:

  1. Построить модель машинного обучения для классификации, которая будет прогнозировать вероятность достижения статуса миллионера на основе предоставленных данных о характеристиках миллионеров.
  2. Провести анализ данных для выявления ключевых факторов, влияющих на достижение статуса миллионера.
In [3]:
df.head()
Out[3]:
Rank Name Networth Age Country Source Industry
0 1 Elon Musk 219.0 50 United States Tesla, SpaceX Automotive
1 2 Jeff Bezos 171.0 58 United States Amazon Technology
2 3 Bernard Arnault & family 158.0 73 France LVMH Fashion & Retail
3 4 Bill Gates 129.0 66 United States Microsoft Technology
4 5 Warren Buffett 118.0 91 United States Berkshire Hathaway Finance & Investments
In [4]:
# Процент пропущенных значений признаков
for i in df.columns:
    null_rate = df[i].isnull().sum() / len(df) * 100
    if null_rate > 0:
        print(f'{i} Процент пустых значений: %{null_rate:.2f}')

# Проверка на пропущенные данные
print(df.isnull().sum())

df.isnull().any()
Rank        0
Name        0
Networth    0
Age         0
Country     0
Source      0
Industry    0
dtype: int64
Out[4]:
Rank        False
Name        False
Networth    False
Age         False
Country     False
Source      False
Industry    False
dtype: bool

Пропущенных колонок нету, это очень хорошо

In [7]:
from sklearn.model_selection import train_test_split

# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тестовая)
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Разделение данных на обучающую и контрольную выборки (80% - обучение, 20% - контроль)
train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)

print("Размер обучающей выборки: ", len(train_data))
print("Размер контрольной выборки: ", len(val_data))
print("Размер тестовой выборки: ", len(test_data))
Размер обучающей выборки:  2080
Размер контрольной выборки:  520
Размер тестовой выборки:  520
In [9]:
# Оценка сбалансированности целевой переменной (Networth)
# Визуализация распределения целевой переменной в выборках (гистограмма)
import seaborn as sns
import matplotlib.pyplot as plt

def plot_networth_distribution(data, title):
    sns.histplot(data['Networth'], kde=True)
    plt.title(title)
    plt.xlabel('Networth')
    plt.ylabel('Частота')
    plt.show()

plot_networth_distribution(train_data, 'Распределение Networth в обучающей выборке')
plot_networth_distribution(val_data, 'Распределение Networth в контрольной выборке')
plot_networth_distribution(test_data, 'Распределение Networth в тестовой выборке')

# Оценка сбалансированности данных по целевой переменной (Networth)
print("Среднее значение Networth в обучающей выборке: ", train_data['Networth'].mean())
print("Среднее значение Networth в контрольной выборке: ", val_data['Networth'].mean())
print("Среднее значение Networth в тестовой выборке: ", test_data['Networth'].mean())
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Среднее значение Networth в обучающей выборке:  5.05858173076923
Среднее значение Networth в контрольной выборке:  4.069423076923076
Среднее значение Networth в тестовой выборке:  4.069423076923076
In [14]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Визуализация распределения Networth в обучающей выборке
sns.histplot(train_data['Networth'], kde=True)
plt.title('Распределение Networth в обучающей выборке')
plt.xlabel('Networth')
plt.ylabel('Частота')
plt.show()

# Нормализация данных
scaler = StandardScaler()
train_data['Networth_scaled'] = scaler.fit_transform(train_data[['Networth']])

# Визуализация распределения Networth после нормализации
sns.histplot(train_data['Networth_scaled'], kde=True)
plt.title('Распределение Networth после нормализации')
plt.xlabel('Networth (нормализованное)')
plt.ylabel('Частота')
plt.show()

# Печать размеров выборки после нормализации
print("Размер обучающей выборки после нормализации: ", len(train_data))
No description has been provided for this image
No description has been provided for this image
Размер обучающей выборки после нормализации:  2080

Конструирование признаков

Теперь приступим к конструированию признаков для решения каждой задачи.

Процесс конструирования признаков
Задача 1: Прогнозирование вероятности достижения статуса миллионера. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования вероятности достижения статуса миллионера. Задача 2: Оценка факторов, влияющих на достижение статуса миллионера. Цель технического проекта: Разработка модели машинного обучения для выявления ключевых факторов, влияющих на достижение статуса миллионера.

Унитарное кодирование
Унитарное кодирование категориальных признаков (one-hot encoding). Преобразование категориальных признаков в бинарные векторы.

Дискретизация числовых признаков
Процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины).

In [15]:
# Пример категориальных признаков
categorical_features = ['Country', 'Source', 'Industry']

# Применение one-hot encoding
train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)
val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)
test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)
df_encoded = pd.get_dummies(df, columns=categorical_features)

print("Столбцы train_data_encoded:", train_data_encoded.columns.tolist())
print("Столбцы val_data_encoded:", val_data_encoded.columns.tolist())
print("Столбцы test_data_encoded:", test_data_encoded.columns.tolist())

# Дискретизация числовых признаков (Age и Networth). Например, можно разделить возраст и стоимость активов на категории
# Пример дискретизации признака 'Age' на 5 категорий
train_data_encoded['Age_binned'] = pd.cut(train_data_encoded['Age'], bins=5, labels=False)
val_data_encoded['Age_binned'] = pd.cut(val_data_encoded['Age'], bins=5, labels=False)
test_data_encoded['Age_binned'] = pd.cut(test_data_encoded['Age'], bins=5, labels=False)

# Пример дискретизации признака 'Networth' на 5 категорий
train_data_encoded['Networth_binned'] = pd.cut(train_data_encoded['Networth'], bins=5, labels=False)
val_data_encoded['Networth_binned'] = pd.cut(val_data_encoded['Networth'], bins=5, labels=False)
test_data_encoded['Networth_binned'] = pd.cut(test_data_encoded['Networth'], bins=5, labels=False)

# Пример дискретизации признака 'Age' на 5 категорий
df_encoded['Age_binned'] = pd.cut(df_encoded['Age'], bins=5, labels=False)

# Пример дискретизации признака 'Networth' на 5 категорий
df_encoded['Networth_binned'] = pd.cut(df_encoded['Networth'], bins=5, labels=False)
Столбцы train_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'LogNetworth', 'Networth_scaled', 'Country_Algeria', 'Country_Argentina', 'Country_Australia', 'Country_Austria', 'Country_Barbados', 'Country_Belgium', 'Country_Belize', 'Country_Brazil', 'Country_Bulgaria', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Colombia', 'Country_Cyprus', 'Country_Czechia', 'Country_Denmark', 'Country_Egypt', 'Country_Estonia', 'Country_Eswatini (Swaziland)', 'Country_Finland', 'Country_France', 'Country_Georgia', 'Country_Germany', 'Country_Greece', 'Country_Guernsey', 'Country_Hong Kong', 'Country_Hungary', 'Country_Iceland', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Macau', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_Morocco', 'Country_Nepal', 'Country_Netherlands', 'Country_New Zealand', 'Country_Nigeria', 'Country_Norway', 'Country_Oman', 'Country_Peru', 'Country_Philippines', 'Country_Poland', 'Country_Portugal', 'Country_Qatar', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Thailand', 'Country_Turkey', 'Country_Ukraine', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Country_Uruguay', 'Country_Venezuela', 'Country_Vietnam', 'Country_Zimbabwe', 'Source_3D printing', 'Source_AOL', 'Source_Airbnb', "Source_Aldi, Trader Joe's", 'Source_Aluminium', 'Source_Amazon', 'Source_Apple', 'Source_BMW, pharmaceuticals', 'Source_Banking', 'Source_Berkshire Hathaway', 'Source_Bloomberg LP', 'Source_Campbell Soup', 'Source_Cargill', 'Source_Carnival Cruises', 'Source_Chanel', 'Source_Charlotte Hornets, endorsements', 'Source_Chemicals', 'Source_Chick-fil-A', 'Source_Coca Cola Israel', 'Source_Coca-Cola bottler', 'Source_Columbia Sportswear', 'Source_Comcast', 'Source_Construction', 'Source_Contact Lens', 'Source_Dallas Cowboys', 'Source_Dell computers', "Source_Dick's Sporting Goods", 'Source_DirecTV', 'Source_Dolby Laboratories', 'Source_Dole, real estate', 'Source_EasyJet', 'Source_Estee Lauder', 'Source_Estée Lauder', 'Source_FIAT, investments', 'Source_Facebook', 'Source_Facebook, investments', 'Source_Furniture retail', 'Source_Gap', 'Source_Genentech, Apple', 'Source_Getty Oil', 'Source_Golden State Warriors', 'Source_Google', 'Source_Groupon, investments', 'Source_H&M', 'Source_Heineken', 'Source_Hermes', 'Source_Home Depot', 'Source_Houston Rockets, entertainment', 'Source_Hyundai', 'Source_I.T.', 'Source_IKEA', 'Source_IT', 'Source_IT consulting', 'Source_IT products', 'Source_IT provider', 'Source_In-N-Out Burger', 'Source_Instagram', 'Source_Intel', 'Source_Internet', 'Source_Internet search', 'Source_Investments', 'Source_Koch Industries', "Source_L'Oréal", 'Source_LED lighting', 'Source_LG', 'Source_LVMH', 'Source_Lego', 'Source_LinkedIn', 'Source_Little Caesars', 'Source_Lululemon', 'Source_Luxury goods', 'Source_Manufacturing', 'Source_Microsoft', 'Source_Mining', 'Source_Motors', 'Source_Multiple', 'Source_Nascar, racing', 'Source_Netflix', 'Source_Netscape, investments', 'Source_New Balance', 'Source_New England Patriots', 'Source_Nike', 'Source_Nutella, chocolates', 'Source_Patagonia', 'Source_Petro Fibre', 'Source_Petro Firbe', 'Source_Philadelphia Eagles', 'Source_Quicken Loans', 'Source_Real Estate', 'Source_Real estate', 'Source_Red Bull', 'Source_Reebok', 'Source_SAP', 'Source_Samsung', 'Source_Sears', 'Source_Semiconductor materials', 'Source_Shipping', 'Source_Shoes', 'Source_Slim-Fast', 'Source_Smartphones', 'Source_Snapchat', 'Source_Spotify', 'Source_Starbucks', 'Source_TD Ameritrade', 'Source_TV broadcasting', 'Source_TV network, investments', 'Source_TV programs', 'Source_TV shows', 'Source_TV, movie production', 'Source_Tesla, SpaceX', 'Source_TikTok', 'Source_Toyota dealerships', 'Source_Transportation', 'Source_Twitter, Square', 'Source_U-Haul', 'Source_Uber', 'Source_Urban Outfitters', 'Source_Waffle House', 'Source_Walmart', 'Source_Walmart, logistics', 'Source_Washington Football Team', 'Source_WeWork', 'Source_WhatsApp', 'Source_Yahoo', 'Source_Zara', 'Source_Zoom Video Communications', 'Source_accounting services', 'Source_adhesives', 'Source_advertising', 'Source_aerospace', 'Source_agribusiness', 'Source_agriculture', 'Source_agriculture, land', 'Source_agriculture, water', 'Source_agrochemicals', 'Source_air compressors', 'Source_aircraft leasing', 'Source_airline', 'Source_airlines', 'Source_airport', 'Source_airport management', 'Source_airports, investments', 'Source_alcohol', 'Source_alcohol, real estate', 'Source_aluminum', 'Source_aluminum products', 'Source_aluminum, diversified  ', 'Source_aluminum, utilities', 'Source_animal health, investments', 'Source_apparel', 'Source_appliances', 'Source_art', 'Source_art collection', 'Source_art, car dealerships', 'Source_asset management', 'Source_auto dealers, investments', 'Source_auto dealerships', 'Source_auto loans', 'Source_auto parts', 'Source_auto repair', 'Source_automobiles', 'Source_automobiles, batteries', 'Source_automotive', 'Source_automotive brakes', 'Source_automotive technology', 'Source_aviation', 'Source_bakeries', 'Source_banking', 'Source_banking, credit cards', 'Source_banking, insurance', 'Source_banking, insurance, media', 'Source_banking, investments', 'Source_banking, minerals', 'Source_banking, oil', 'Source_banking, property', 'Source_banking, real estate', 'Source_banking, tobacco', 'Source_banks, real estate', 'Source_bars', 'Source_batteries', 'Source_batteries, automobiles', 'Source_batteries, investments', 'Source_battery components', 'Source_beauty products', 'Source_beef packing', 'Source_beef processing', 'Source_beer', 'Source_beverages', 'Source_beverages, pharmaceuticals', 'Source_biochemicals', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_biotech investing', 'Source_biotech, investments', 'Source_biotechnology', 'Source_blockchain technology', 'Source_blockchain, technology', 'Source_book distribution, transportation', 'Source_brakes, investments', 'Source_brewery', 'Source_building materials', 'Source_business software', 'Source_cable', 'Source_cable TV, investments', 'Source_cable television', 'Source_call centers', 'Source_cameras, software', 'Source_candy', 'Source_candy, pet food', 'Source_car dealerships', 'Source_car rentals', 'Source_carbon fiber products', 'Source_carpet', 'Source_cars', 'Source_cashmere', 'Source_casinos', 'Source_casinos, banking', 'Source_casinos, hotels', 'Source_casinos, mixed martial arts', 'Source_casinos, property, energy', 'Source_casinos/hotels', 'Source_cement', 'Source_cement, sugar', 'Source_cheese', 'Source_chemical products', 'Source_chemicals', 'Source_chemicals, investments', 'Source_chemicals, logistics', 'Source_chemicals, spandex', 'Source_chewing gum', 'Source_chicken processing', 'Source_cleaning products', 'Source_clinical diagnostics', 'Source_clinical trials', 'Source_cloud communications', 'Source_cloud computing', 'Source_coal', 'Source_coal mines', 'Source_coal, fertilizers', 'Source_coal, investments', 'Source_cobalt', 'Source_coffee', 'Source_coffee, shipping', 'Source_coking', 'Source_commodities', 'Source_communication equipment', 'Source_communications', 'Source_computer hardware', 'Source_computer services, real estate', 'Source_computer services, telecom', 'Source_computer software', 'Source_conglomerate', 'Source_construction', 'Source_construction equipment', 'Source_construction equipment, media', 'Source_construction materials', 'Source_construction, investments', 'Source_construction, media', 'Source_construction, mining', 'Source_construction, mining machinery', 'Source_construction, pipes, banking', 'Source_construction, real estate', 'Source_consumer', 'Source_consumer electronics', 'Source_consumer goods', 'Source_consumer products, banking', 'Source_convenience stores', 'Source_convinience stores', 'Source_copper, poultry', 'Source_cosmetics', 'Source_cosmetics, reality TV', 'Source_cruises', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_cybersecurity', 'Source_dairy', 'Source_dairy & consumer products', 'Source_damaged cars', 'Source_data analytics', 'Source_data centers', 'Source_data management', 'Source_defense', 'Source_defense, hotels', 'Source_dental implants', 'Source_dental products', 'Source_department stores', 'Source_diagnostics', 'Source_diamond jewelry', 'Source_diamonds', 'Source_digital advertising', 'Source_discount brokerage', 'Source_diversified  ', 'Source_drilling, shipping', 'Source_drones', 'Source_drugs', 'Source_drugstores', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_e-commerce software', 'Source_eBay', 'Source_eBay, PayPal', 'Source_education', 'Source_education technology', 'Source_electric bikes, scooters', 'Source_electric components', 'Source_electric equipment', 'Source_electric scooters', 'Source_electric vehicles', 'Source_electrical equipment', 'Source_electrodes', 'Source_electronic components', 'Source_electronic trading', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_email marketing', 'Source_employment agency', 'Source_energy', 'Source_energy drink', 'Source_energy drinks', 'Source_energy drinks,investments', 'Source_energy services', 'Source_energy, banking, construction', 'Source_energy, chemicals', 'Source_energy, investments', 'Source_energy, real estate', 'Source_energy, sports', 'Source_engineering', 'Source_engineering, automotive', 'Source_engineering, construction', 'Source_entertainment', 'Source_executive search, investments', 'Source_express delivery', 'Source_fashion', 'Source_fashion investments', 'Source_fashion retail', 'Source_fashion retail, investments', 'Source_fashion retailer', 'Source_fast food', 'Source_fasteners', 'Source_feed', 'Source_fertilizer', 'Source_fertilizer, real estate', 'Source_fertilizers', 'Source_fiber optic cables', 'Source_finance', 'Source_finance and investments', 'Source_finance, real estate', 'Source_finance, telecommunications', 'Source_financial information', 'Source_financial services', 'Source_financial services, property', 'Source_financial services★', 'Source_financial technology', 'Source_fintech', 'Source_fitness equipment', 'Source_flavorings', 'Source_flavors and fragrances', 'Source_flipkart', 'Source_flooring', 'Source_food', 'Source_food & beverage retailing', 'Source_food delivery app', 'Source_food distribution', 'Source_food processing', 'Source_food service', 'Source_food services', 'Source_food, beverages', 'Source_foods', 'Source_footwear', 'Source_forestry, mining', 'Source_frozen foods', 'Source_furniture', 'Source_furniture retailing', 'Source_gambling', 'Source_gambling products', 'Source_gambling software', 'Source_game software', 'Source_gaming', 'Source_gas stations', 'Source_gas, chemicals', 'Source_generic drugs', 'Source_glass', 'Source_gold', 'Source_graphite electrodes', 'Source_grocery delivery service', 'Source_grocery stores', 'Source_hair care products', 'Source_hair dryers', 'Source_hair products, tequila', 'Source_hand tools', 'Source_hardware', 'Source_health IT', 'Source_health care', 'Source_health clinics', 'Source_health insurance', 'Source_healthcare', 'Source_healthcare services', 'Source_hearing aids', 'Source_heating and cooling equipment', 'Source_heating, cooling equipment', 'Source_hedge fund', 'Source_hedge funds', 'Source_herbal products', 'Source_high speed trading', 'Source_home appliances', 'Source_home building', 'Source_home building, banking', 'Source_home furnishings', 'Source_home improvement stores', 'Source_home sales', 'Source_home-cleaning robots', 'Source_homebuilder', 'Source_homebuilding', 'Source_homebuilding, insurance', 'Source_hospitals', 'Source_hospitals, health care', 'Source_hospitals, health insurance', 'Source_hotels', 'Source_hotels, diversified  ', 'Source_hotels, energy', 'Source_hotels, investments', 'Source_hotels, motels', 'Source_household chemicals', 'Source_hydraulic machinery', 'Source_industrial equipment', 'Source_industrial explosives', 'Source_industrial lasers', 'Source_industrial machinery', 'Source_infant formula', 'Source_information technology', 'Source_infrastructure', 'Source_infrastructure, commodities', 'Source_insurance', 'Source_insurance, NFL team', 'Source_insurance, beverages', 'Source_insurance, investments', 'Source_internet', 'Source_internet media', 'Source_internet search', 'Source_internet service provider', 'Source_investing', 'Source_investment', 'Source_investments', 'Source_investments, art', 'Source_investments, energy', 'Source_investments, real estate', 'Source_jewellery', 'Source_jewelry', 'Source_kitchen appliances', 'Source_laboratory services', 'Source_leveraged buyouts', 'Source_lighting', 'Source_lighting installations', 'Source_liquefied natural gas', 'Source_liquor', 'Source_lithium', 'Source_lithium batteries', 'Source_lithium battery', 'Source_lithium-ion battery cap', 'Source_live entertainment', 'Source_live streaming service', 'Source_logistics', 'Source_low-cost airlines', 'Source_luxury goods', 'Source_machine tools', 'Source_machinery', 'Source_magazines, media', 'Source_magnetic switches', 'Source_manufacturing', 'Source_manufacturing, investment', 'Source_manufacturing, investments', 'Source_mapping software', 'Source_materials', 'Source_measuring instruments', 'Source_meat processing', 'Source_media', 'Source_media, automotive', 'Source_media, investments', 'Source_media, real estate', 'Source_medical devices', 'Source_medical diagnostic equipment', 'Source_medical diagnostics', 'Source_medical equipment', 'Source_medical packaging', 'Source_medical patents', 'Source_medical products', 'Source_medical technology', 'Source_medical testing', 'Source_messaging app', 'Source_metal processing', 'Source_metals', 'Source_metals, coal', 'Source_metals, energy', 'Source_metals, mining', 'Source_metalworking tools', 'Source_microbiology', 'Source_microchip testing', 'Source_mining', 'Source_mining, banking', 'Source_mining, banking, hotels', 'Source_mining, commodities', 'Source_mining, copper products', 'Source_mining, metals, machinery', 'Source_mobile games', 'Source_mobile gaming', 'Source_mobile payments', 'Source_mobile phone retailer', 'Source_mobile phones', 'Source_money management', 'Source_mortgage lender★', 'Source_motorcycle loans', 'Source_motorcycles', 'Source_motorhomes, RVs', 'Source_motors', 'Source_movie making', 'Source_movies, investments', 'Source_movies, record labels', 'Source_music, chemicals', 'Source_music, cosmetics', 'Source_music, sneakers', 'Source_mutual funds', 'Source_natural gas', 'Source_natural gas distribution', 'Source_natural gas, fertilizers', 'Source_navigation equipment', 'Source_newspapers, TV network', 'Source_nonferrous', 'Source_nutrition, wellness products', 'Source_office real estate', 'Source_oil', 'Source_oil & gas', 'Source_oil & gas, banking', 'Source_oil & gas, investments', 'Source_oil and gas', 'Source_oil and gas, IT, lotteries', 'Source_oil refinery', 'Source_oil, banking, telecom', 'Source_oil, gas', 'Source_oil, investments', 'Source_oil, real estate', 'Source_oilfield equipment', 'Source_online dating', 'Source_online gambling', 'Source_online games', 'Source_online games, investments', 'Source_online gaming', 'Source_online marketplace', 'Source_online media', 'Source_online media, Dallas Mavericks', 'Source_online payments', 'Source_online recruitment', 'Source_online retail', 'Source_online retailing', 'Source_online services', 'Source_optical components', 'Source_optometry', 'Source_orange juice', 'Source_package delivery', 'Source_packaged meats', 'Source_packaging', 'Source_paint', 'Source_paints', 'Source_palm oil', 'Source_palm oil, nickel mining', 'Source_palm oil, property', 'Source_palm oil, shipping, property', 'Source_paper', 'Source_paper & related products', 'Source_paper and pulp', 'Source_payment software', 'Source_payments software', 'Source_payments technology', 'Source_payments, banking', 'Source_payroll processing', 'Source_payroll software', 'Source_pearlescent pigments', 'Source_personal care goods', 'Source_pest control', 'Source_pet food', 'Source_petrochemicals', 'Source_petroleum, diversified  ', 'Source_phamaceuticals', 'Source_pharmaceutical', 'Source_pharmaceutical services', 'Source_pharmaceuticals', 'Source_pharmaceuticals, diversified  ', 'Source_pharmaceuticals, food', 'Source_pharmaceuticals, medical equipment', 'Source_pharmaceuticals, power', 'Source_pharmacies', 'Source_photovoltaic equipment', 'Source_photovoltaics', 'Source_pig breeding', 'Source_pipe manufacturing', 'Source_pipelines', 'Source_plastic', 'Source_plastics', 'Source_plumbing fixtures', 'Source_plush toys, real estate', 'Source_polyester', 'Source_ports', 'Source_poultry genetics', 'Source_poultry processing', 'Source_powdered metal', 'Source_power equipment', 'Source_power strip', 'Source_power supply equipment', 'Source_precision machinery', 'Source_price comparison website', 'Source_printed circuit boards', 'Source_printing', 'Source_private equity', 'Source_private equity★', 'Source_pro sports teams', 'Source_property, healthcare', 'Source_prosthetics', 'Source_publishing', 'Source_pulp and paper', 'Source_quartz products', 'Source_readymade garments', 'Source_real estate', 'Source_real estate developer', 'Source_real estate development', 'Source_real estate services', 'Source_real estate, airport', 'Source_real estate, construction', 'Source_real estate, diversified  ', 'Source_real estate, electronics', 'Source_real estate, gambling', 'Source_real estate, investments', 'Source_real estate, manufacturing', 'Source_real estate, media', 'Source_real estate, oil, cars, sports', 'Source_real estate, private equity', 'Source_real estate, retail', 'Source_real estate, shipping', 'Source_refinery, chemicals', 'Source_renewable energy', 'Source_restaurant', 'Source_restaurants', 'Source_retail', 'Source_retail, investments', 'Source_retail, media', 'Source_retail, real estate', 'Source_retailing', 'Source_roofing', 'Source_salsa', 'Source_sandwich chain', 'Source_satellite TV', 'Source_scaffolding, cement mixers', 'Source_scientific equipment', 'Source_security', 'Source_security services', 'Source_security software', 'Source_seed production', 'Source_semiconductor', 'Source_semiconductor devices', 'Source_semiconductors', 'Source_sensor systems', 'Source_sensor technology', 'Source_sensors', 'Source_sensors★', 'Source_shipbuilding', 'Source_shipping', 'Source_shipping, airlines', 'Source_shipping, seafood', 'Source_shoes', 'Source_shopping centers', 'Source_shopping malls', 'Source_silicon', 'Source_smartphone components', 'Source_smartphone screens', 'Source_smartphones', 'Source_snack bars', 'Source_snacks, beverages', 'Source_sneakers, sportswear', 'Source_social media', 'Source_social network', 'Source_soft drinks, fast food', 'Source_software', 'Source_software firm', 'Source_software services', 'Source_software, investments', 'Source_solar energy', 'Source_solar energy equipment', 'Source_solar equipment', 'Source_solar inverters', 'Source_solar panel components', 'Source_solar panel materials', 'Source_solar wafers and modules', 'Source_soy sauce', 'Source_specialty chemicals', 'Source_spirits', 'Source_sporting goods retail', 'Source_sports apparel', 'Source_sports data', 'Source_sports drink', 'Source_sports retailing', 'Source_sports team', 'Source_sports teams', 'Source_sports, real estate', 'Source_staffing & recruiting', 'Source_stationery', 'Source_steel', 'Source_steel pipes, diversified  ', 'Source_steel, coal', 'Source_steel, diversified  ', 'Source_steel, investments', 'Source_steel, telecom, investments', 'Source_steel, transport', 'Source_stock brokerage', 'Source_stock exchange', 'Source_stock photos', 'Source_storage facilities', 'Source_sugar, ethanol', 'Source_sunglasses', 'Source_supermarkets', 'Source_tech investments', 'Source_technology', 'Source_telecom', 'Source_telecom services', 'Source_telecom, investments', 'Source_telecom, oil', 'Source_telecommunication', 'Source_telecommunications', 'Source_temp agency', 'Source_tequila', 'Source_textiles', 'Source_textiles, paper', 'Source_ticketing service', 'Source_tire', 'Source_tires', 'Source_tires, diversified  ', 'Source_tobacco', 'Source_tobacco distribution, retail', 'Source_toll roads', 'Source_touch screens', 'Source_tourism, cultural industry', 'Source_toys', 'Source_tractors', 'Source_trading, investments', 'Source_train cars', 'Source_transportation', 'Source_travel', 'Source_trucking', 'Source_two-wheelers, finance', 'Source_used cars', 'Source_utilities, diversified  ', 'Source_utilities, real estate', 'Source_vaccine & shoes', 'Source_vaccines', 'Source_valve manufacturing', 'Source_valves', 'Source_venture capital', 'Source_venture capital, Google', 'Source_video games', 'Source_video games, pachinko', 'Source_video streaming', 'Source_video streaming app', 'Source_video surveillance', 'Source_videogames', 'Source_vodka', 'Source_waste disposal', 'Source_web hosting', 'Source_wine', 'Source_wireless networking gear', 'Industry_Automotive ', 'Industry_Construction & Engineering ', 'Industry_Energy ', 'Industry_Fashion & Retail ', 'Industry_Finance & Investments ', 'Industry_Food & Beverage ', 'Industry_Gambling & Casinos ', 'Industry_Healthcare ', 'Industry_Logistics ', 'Industry_Manufacturing ', 'Industry_Media & Entertainment ', 'Industry_Metals & Mining ', 'Industry_Real Estate ', 'Industry_Service ', 'Industry_Sports ', 'Industry_Technology ', 'Industry_Telecom ', 'Industry_diversified   ']
Столбцы val_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified  ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified  ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foods', 'Source_furniture retailing', 'Source_garments', 'Source_gas stations, retail', 'Source_generic drugs', 'Source_glass', 'Source_greek yogurt', 'Source_gym equipment', 'Source_hardware stores', 'Source_health products', 'Source_healthcare IT', 'Source_hedge funds', 'Source_home appliances', 'Source_home furnishings', 'Source_homebuilding', 'Source_homebuilding, NFL team', 'Source_hospitals, health insurance', 'Source_hotels', 'Source_hotels, investments', 'Source_household chemicals', 'Source_hygiene products', 'Source_imaging systems', 'Source_insurance', 'Source_insurance, NFL team', 'Source_internet and software', 'Source_internet media', 'Source_internet, telecom', 'Source_investing', 'Source_investment banking', 'Source_investment research', 'Source_investments', 'Source_iron ore mining', 'Source_jewelry', 'Source_kitchen appliances', 'Source_laboratory services', 'Source_liquor', 'Source_lithium', 'Source_logistics', 'Source_logistics, baseball', 'Source_logistics, real estate', 'Source_luxury goods', 'Source_machine tools', 'Source_machinery', 'Source_manufacturing', 'Source_manufacturing, investments', 'Source_mattresses', 'Source_meat processing', 'Source_media', 'Source_media, automotive', 'Source_media, tech', 'Source_medical cosmetics', 'Source_medical devices', 'Source_medical equipment', 'Source_medical services', 'Source_messaging software', 'Source_metals', 'Source_metals, banking, fertilizers', 'Source_mining', 'Source_mining, commodities', 'Source_mining, metals', 'Source_mining, steel', 'Source_money management', 'Source_motorcycles', 'Source_movies', 'Source_movies, digital effects', 'Source_natural gas', 'Source_non-ferrous metals', 'Source_nutritional supplements', 'Source_oil', 'Source_oil & gas, investments', 'Source_oil refining', 'Source_oil trading', 'Source_oil, banking', 'Source_oil, gas', 'Source_oil, investments', 'Source_oil, real estate', 'Source_oil, semiconductor', 'Source_online gambling', 'Source_online games', 'Source_online media', 'Source_online retail', 'Source_package delivery', 'Source_packaging', 'Source_paper', 'Source_paper manufacturing', 'Source_payment processing', 'Source_payroll services', 'Source_pet food', 'Source_petrochemicals', 'Source_pharma retailing', 'Source_pharmaceutical', 'Source_pharmaceutical ingredients', 'Source_pharmaceuticals', 'Source_pipelines', 'Source_plastic pipes', 'Source_poultry', 'Source_poultry breeding', 'Source_power strips', 'Source_precious metals, real estate', 'Source_printed circuit boards', 'Source_private equity', 'Source_publishing', 'Source_pulp and paper', 'Source_real estate', 'Source_real estate finance', 'Source_real estate, hotels', 'Source_real estate, investments', 'Source_record label', 'Source_refinery, chemicals', 'Source_restaurants', 'Source_retail', 'Source_retail & gas stations', 'Source_retail chain', 'Source_retail stores', 'Source_retail, agribusiness', 'Source_retail, investments', 'Source_rubber gloves', 'Source_security software', 'Source_self storage', 'Source_semiconductor', 'Source_semiconductors', 'Source_sensor systems', 'Source_shipping', 'Source_shoes', 'Source_smartphone screens', 'Source_smartphones', 'Source_software', 'Source_solar panels', 'Source_soy sauce', 'Source_sporting goods', 'Source_sports', 'Source_sports apparel', 'Source_staffing, Baltimore Ravens', 'Source_stationery', 'Source_steel', 'Source_steel production', 'Source_steel, diversified  ', 'Source_steel, mining', 'Source_supermarkets', 'Source_supermarkets, investments', 'Source_surveillance equipment', 'Source_technology', 'Source_telecom', 'Source_telecom, lotteries, insurance', 'Source_telecoms, media, oil-services', 'Source_testing equipment', 'Source_textile, chemicals', 'Source_textiles, apparel', 'Source_textiles, petrochemicals', 'Source_timberland, lumber mills', 'Source_titanium', 'Source_transport, logistics', 'Source_two-wheelers', 'Source_used cars', 'Source_vaccines', 'Source_vacuums', 'Source_venture capital', 'Source_venture capital investing', 'Source_video games', 'Source_wedding dresses', 'Source_wind turbines', 'Source_winter jackets', 'Source_wire & cables, paints', 'Industry_Automotive ', 'Industry_Construction & Engineering ', 'Industry_Energy ', 'Industry_Fashion & Retail ', 'Industry_Finance & Investments ', 'Industry_Food & Beverage ', 'Industry_Gambling & Casinos ', 'Industry_Healthcare ', 'Industry_Logistics ', 'Industry_Manufacturing ', 'Industry_Media & Entertainment ', 'Industry_Metals & Mining ', 'Industry_Real Estate ', 'Industry_Service ', 'Industry_Sports ', 'Industry_Technology ', 'Industry_Telecom ', 'Industry_diversified   ']
Столбцы test_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified  ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified  ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foods', 'Source_furniture retailing', 'Source_garments', 'Source_gas stations, retail', 'Source_generic drugs', 'Source_glass', 'Source_greek yogurt', 'Source_gym equipment', 'Source_hardware stores', 'Source_health products', 'Source_healthcare IT', 'Source_hedge funds', 'Source_home appliances', 'Source_home furnishings', 'Source_homebuilding', 'Source_homebuilding, NFL team', 'Source_hospitals, health insurance', 'Source_hotels', 'Source_hotels, investments', 'Source_household chemicals', 'Source_hygiene products', 'Source_imaging systems', 'Source_insurance', 'Source_insurance, NFL team', 'Source_internet and software', 'Source_internet media', 'Source_internet, telecom', 'Source_investing', 'Source_investment banking', 'Source_investment research', 'Source_investments', 'Source_iron ore mining', 'Source_jewelry', 'Source_kitchen appliances', 'Source_laboratory services', 'Source_liquor', 'Source_lithium', 'Source_logistics', 'Source_logistics, baseball', 'Source_logistics, real estate', 'Source_luxury goods', 'Source_machine tools', 'Source_machinery', 'Source_manufacturing', 'Source_manufacturing, investments', 'Source_mattresses', 'Source_meat processing', 'Source_media', 'Source_media, automotive', 'Source_media, tech', 'Source_medical cosmetics', 'Source_medical devices', 'Source_medical equipment', 'Source_medical services', 'Source_messaging software', 'Source_metals', 'Source_metals, banking, fertilizers', 'Source_mining', 'Source_mining, commodities', 'Source_mining, metals', 'Source_mining, steel', 'Source_money management', 'Source_motorcycles', 'Source_movies', 'Source_movies, digital effects', 'Source_natural gas', 'Source_non-ferrous metals', 'Source_nutritional supplements', 'Source_oil', 'Source_oil & gas, investments', 'Source_oil refining', 'Source_oil trading', 'Source_oil, banking', 'Source_oil, gas', 'Source_oil, investments', 'Source_oil, real estate', 'Source_oil, semiconductor', 'Source_online gambling', 'Source_online games', 'Source_online media', 'Source_online retail', 'Source_package delivery', 'Source_packaging', 'Source_paper', 'Source_paper manufacturing', 'Source_payment processing', 'Source_payroll services', 'Source_pet food', 'Source_petrochemicals', 'Source_pharma retailing', 'Source_pharmaceutical', 'Source_pharmaceutical ingredients', 'Source_pharmaceuticals', 'Source_pipelines', 'Source_plastic pipes', 'Source_poultry', 'Source_poultry breeding', 'Source_power strips', 'Source_precious metals, real estate', 'Source_printed circuit boards', 'Source_private equity', 'Source_publishing', 'Source_pulp and paper', 'Source_real estate', 'Source_real estate finance', 'Source_real estate, hotels', 'Source_real estate, investments', 'Source_record label', 'Source_refinery, chemicals', 'Source_restaurants', 'Source_retail', 'Source_retail & gas stations', 'Source_retail chain', 'Source_retail stores', 'Source_retail, agribusiness', 'Source_retail, investments', 'Source_rubber gloves', 'Source_security software', 'Source_self storage', 'Source_semiconductor', 'Source_semiconductors', 'Source_sensor systems', 'Source_shipping', 'Source_shoes', 'Source_smartphone screens', 'Source_smartphones', 'Source_software', 'Source_solar panels', 'Source_soy sauce', 'Source_sporting goods', 'Source_sports', 'Source_sports apparel', 'Source_staffing, Baltimore Ravens', 'Source_stationery', 'Source_steel', 'Source_steel production', 'Source_steel, diversified  ', 'Source_steel, mining', 'Source_supermarkets', 'Source_supermarkets, investments', 'Source_surveillance equipment', 'Source_technology', 'Source_telecom', 'Source_telecom, lotteries, insurance', 'Source_telecoms, media, oil-services', 'Source_testing equipment', 'Source_textile, chemicals', 'Source_textiles, apparel', 'Source_textiles, petrochemicals', 'Source_timberland, lumber mills', 'Source_titanium', 'Source_transport, logistics', 'Source_two-wheelers', 'Source_used cars', 'Source_vaccines', 'Source_vacuums', 'Source_venture capital', 'Source_venture capital investing', 'Source_video games', 'Source_wedding dresses', 'Source_wind turbines', 'Source_winter jackets', 'Source_wire & cables, paints', 'Industry_Automotive ', 'Industry_Construction & Engineering ', 'Industry_Energy ', 'Industry_Fashion & Retail ', 'Industry_Finance & Investments ', 'Industry_Food & Beverage ', 'Industry_Gambling & Casinos ', 'Industry_Healthcare ', 'Industry_Logistics ', 'Industry_Manufacturing ', 'Industry_Media & Entertainment ', 'Industry_Metals & Mining ', 'Industry_Real Estate ', 'Industry_Service ', 'Industry_Sports ', 'Industry_Technology ', 'Industry_Telecom ', 'Industry_diversified   ']

Ручной синтез

Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, можно создать признак, который отражает соотношение возраста к стоимости активов (Networth) или другие полезные метрики.

In [16]:
# Пример создания нового признака - соотношение возраста к стоимости активов (Networth)
train_data_encoded['age_to_networth'] = train_data_encoded['Age'] / train_data_encoded['Networth']
val_data_encoded['age_to_networth'] = val_data_encoded['Age'] / val_data_encoded['Networth']
test_data_encoded['age_to_networth'] = test_data_encoded['Age'] / test_data_encoded['Networth']

# Пример создания нового признака - соотношение возраста к стоимости активов (Networth)
df_encoded['age_to_networth'] = df_encoded['Age'] / df_encoded['Networth']

# Пример создания нового признака - соотношение стоимости активов к возрасту
train_data_encoded['networth_to_age'] = train_data_encoded['Networth'] / train_data_encoded['Age']
val_data_encoded['networth_to_age'] = val_data_encoded['Networth'] / val_data_encoded['Age']
test_data_encoded['networth_to_age'] = test_data_encoded['Networth'] / test_data_encoded['Age']

# Пример создания нового признака - соотношение стоимости активов к возрасту
df_encoded['networth_to_age'] = df_encoded['Networth'] / df_encoded['Age']

# Пример создания нового признака - квадрат возраста
train_data_encoded['age_squared'] = train_data_encoded['Age'] ** 2
val_data_encoded['age_squared'] = val_data_encoded['Age'] ** 2
test_data_encoded['age_squared'] = test_data_encoded['Age'] ** 2

# Пример создания нового признака - квадрат возраста
df_encoded['age_squared'] = df_encoded['Age'] ** 2

# Пример создания нового признака - логарифм стоимости активов
import numpy as np
train_data_encoded['log_networth'] = train_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)
val_data_encoded['log_networth'] = val_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)
test_data_encoded['log_networth'] = test_data_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)

# Пример создания нового признака - логарифм стоимости активов
df_encoded['log_networth'] = df_encoded['Networth'].apply(lambda x: np.log(x) if x > 0 else 0)

Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб.

In [17]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Пример числовых признаков
numerical_features = ['Networth', 'Age']

# Применение StandardScaler для масштабирования числовых признаков
scaler = StandardScaler()
train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])
val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])
test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])

# Пример использования MinMaxScaler для масштабирования числовых признаков
scaler = MinMaxScaler()
train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])
val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])
test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])

Использование фреймворка Featuretools

In [20]:
import featuretools as ft

# Проверка наличия столбцов в DataFrame
print("Столбцы в df:", df.columns.tolist())
print("Столбцы в train_data_encoded:", train_data_encoded.columns.tolist())
print("Столбцы в val_data_encoded:", val_data_encoded.columns.tolist())
print("Столбцы в test_data_encoded:", test_data_encoded.columns.tolist())

# Удаление дубликатов по всем столбцам (если нет уникального идентификатора)
df = df.drop_duplicates()
duplicates = train_data_encoded[train_data_encoded.duplicated(keep=False)]

# Удаление дубликатов из столбца "id", сохранив первое вхождение
df_encoded = df_encoded.drop_duplicates(keep='first')

print(duplicates)

# Создание EntitySet
es = ft.EntitySet(id='millionaires_data')

# Добавление датафрейма с данными о миллионерах
es = es.add_dataframe(dataframe_name='millionaires', dataframe=df_encoded, index='id')

# Генерация признаков с помощью глубокой синтезы признаков
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='millionaires', max_depth=2)

# Выводим первые 5 строк сгенерированного набора признаков
print(feature_matrix.head())

# Удаление дубликатов из обучающей выборки
train_data_encoded = train_data_encoded.drop_duplicates()
train_data_encoded = train_data_encoded.drop_duplicates(keep='first')  # or keep='last'

# Определение сущностей (Создание EntitySet)
es = ft.EntitySet(id='millionaires_data')

es = es.add_dataframe(dataframe_name='millionaires', dataframe=train_data_encoded, index='id')

# Генерация признаков
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='millionaires', max_depth=2)

# Преобразование признаков для контрольной и тестовой выборок
val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)
test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)
Столбцы в df: ['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry']
Столбцы в train_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'LogNetworth', 'Networth_scaled', 'Country_Algeria', 'Country_Argentina', 'Country_Australia', 'Country_Austria', 'Country_Barbados', 'Country_Belgium', 'Country_Belize', 'Country_Brazil', 'Country_Bulgaria', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Colombia', 'Country_Cyprus', 'Country_Czechia', 'Country_Denmark', 'Country_Egypt', 'Country_Estonia', 'Country_Eswatini (Swaziland)', 'Country_Finland', 'Country_France', 'Country_Georgia', 'Country_Germany', 'Country_Greece', 'Country_Guernsey', 'Country_Hong Kong', 'Country_Hungary', 'Country_Iceland', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Macau', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_Morocco', 'Country_Nepal', 'Country_Netherlands', 'Country_New Zealand', 'Country_Nigeria', 'Country_Norway', 'Country_Oman', 'Country_Peru', 'Country_Philippines', 'Country_Poland', 'Country_Portugal', 'Country_Qatar', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Thailand', 'Country_Turkey', 'Country_Ukraine', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Country_Uruguay', 'Country_Venezuela', 'Country_Vietnam', 'Country_Zimbabwe', 'Source_3D printing', 'Source_AOL', 'Source_Airbnb', "Source_Aldi, Trader Joe's", 'Source_Aluminium', 'Source_Amazon', 'Source_Apple', 'Source_BMW, pharmaceuticals', 'Source_Banking', 'Source_Berkshire Hathaway', 'Source_Bloomberg LP', 'Source_Campbell Soup', 'Source_Cargill', 'Source_Carnival Cruises', 'Source_Chanel', 'Source_Charlotte Hornets, endorsements', 'Source_Chemicals', 'Source_Chick-fil-A', 'Source_Coca Cola Israel', 'Source_Coca-Cola bottler', 'Source_Columbia Sportswear', 'Source_Comcast', 'Source_Construction', 'Source_Contact Lens', 'Source_Dallas Cowboys', 'Source_Dell computers', "Source_Dick's Sporting Goods", 'Source_DirecTV', 'Source_Dolby Laboratories', 'Source_Dole, real estate', 'Source_EasyJet', 'Source_Estee Lauder', 'Source_Estée Lauder', 'Source_FIAT, investments', 'Source_Facebook', 'Source_Facebook, investments', 'Source_Furniture retail', 'Source_Gap', 'Source_Genentech, Apple', 'Source_Getty Oil', 'Source_Golden State Warriors', 'Source_Google', 'Source_Groupon, investments', 'Source_H&M', 'Source_Heineken', 'Source_Hermes', 'Source_Home Depot', 'Source_Houston Rockets, entertainment', 'Source_Hyundai', 'Source_I.T.', 'Source_IKEA', 'Source_IT', 'Source_IT consulting', 'Source_IT products', 'Source_IT provider', 'Source_In-N-Out Burger', 'Source_Instagram', 'Source_Intel', 'Source_Internet', 'Source_Internet search', 'Source_Investments', 'Source_Koch Industries', "Source_L'Oréal", 'Source_LED lighting', 'Source_LG', 'Source_LVMH', 'Source_Lego', 'Source_LinkedIn', 'Source_Little Caesars', 'Source_Lululemon', 'Source_Luxury goods', 'Source_Manufacturing', 'Source_Microsoft', 'Source_Mining', 'Source_Motors', 'Source_Multiple', 'Source_Nascar, racing', 'Source_Netflix', 'Source_Netscape, investments', 'Source_New Balance', 'Source_New England Patriots', 'Source_Nike', 'Source_Nutella, chocolates', 'Source_Patagonia', 'Source_Petro Fibre', 'Source_Petro Firbe', 'Source_Philadelphia Eagles', 'Source_Quicken Loans', 'Source_Real Estate', 'Source_Real estate', 'Source_Red Bull', 'Source_Reebok', 'Source_SAP', 'Source_Samsung', 'Source_Sears', 'Source_Semiconductor materials', 'Source_Shipping', 'Source_Shoes', 'Source_Slim-Fast', 'Source_Smartphones', 'Source_Snapchat', 'Source_Spotify', 'Source_Starbucks', 'Source_TD Ameritrade', 'Source_TV broadcasting', 'Source_TV network, investments', 'Source_TV programs', 'Source_TV shows', 'Source_TV, movie production', 'Source_Tesla, SpaceX', 'Source_TikTok', 'Source_Toyota dealerships', 'Source_Transportation', 'Source_Twitter, Square', 'Source_U-Haul', 'Source_Uber', 'Source_Urban Outfitters', 'Source_Waffle House', 'Source_Walmart', 'Source_Walmart, logistics', 'Source_Washington Football Team', 'Source_WeWork', 'Source_WhatsApp', 'Source_Yahoo', 'Source_Zara', 'Source_Zoom Video Communications', 'Source_accounting services', 'Source_adhesives', 'Source_advertising', 'Source_aerospace', 'Source_agribusiness', 'Source_agriculture', 'Source_agriculture, land', 'Source_agriculture, water', 'Source_agrochemicals', 'Source_air compressors', 'Source_aircraft leasing', 'Source_airline', 'Source_airlines', 'Source_airport', 'Source_airport management', 'Source_airports, investments', 'Source_alcohol', 'Source_alcohol, real estate', 'Source_aluminum', 'Source_aluminum products', 'Source_aluminum, diversified  ', 'Source_aluminum, utilities', 'Source_animal health, investments', 'Source_apparel', 'Source_appliances', 'Source_art', 'Source_art collection', 'Source_art, car dealerships', 'Source_asset management', 'Source_auto dealers, investments', 'Source_auto dealerships', 'Source_auto loans', 'Source_auto parts', 'Source_auto repair', 'Source_automobiles', 'Source_automobiles, batteries', 'Source_automotive', 'Source_automotive brakes', 'Source_automotive technology', 'Source_aviation', 'Source_bakeries', 'Source_banking', 'Source_banking, credit cards', 'Source_banking, insurance', 'Source_banking, insurance, media', 'Source_banking, investments', 'Source_banking, minerals', 'Source_banking, oil', 'Source_banking, property', 'Source_banking, real estate', 'Source_banking, tobacco', 'Source_banks, real estate', 'Source_bars', 'Source_batteries', 'Source_batteries, automobiles', 'Source_batteries, investments', 'Source_battery components', 'Source_beauty products', 'Source_beef packing', 'Source_beef processing', 'Source_beer', 'Source_beverages', 'Source_beverages, pharmaceuticals', 'Source_biochemicals', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_biotech investing', 'Source_biotech, investments', 'Source_biotechnology', 'Source_blockchain technology', 'Source_blockchain, technology', 'Source_book distribution, transportation', 'Source_brakes, investments', 'Source_brewery', 'Source_building materials', 'Source_business software', 'Source_cable', 'Source_cable TV, investments', 'Source_cable television', 'Source_call centers', 'Source_cameras, software', 'Source_candy', 'Source_candy, pet food', 'Source_car dealerships', 'Source_car rentals', 'Source_carbon fiber products', 'Source_carpet', 'Source_cars', 'Source_cashmere', 'Source_casinos', 'Source_casinos, banking', 'Source_casinos, hotels', 'Source_casinos, mixed martial arts', 'Source_casinos, property, energy', 'Source_casinos/hotels', 'Source_cement', 'Source_cement, sugar', 'Source_cheese', 'Source_chemical products', 'Source_chemicals', 'Source_chemicals, investments', 'Source_chemicals, logistics', 'Source_chemicals, spandex', 'Source_chewing gum', 'Source_chicken processing', 'Source_cleaning products', 'Source_clinical diagnostics', 'Source_clinical trials', 'Source_cloud communications', 'Source_cloud computing', 'Source_coal', 'Source_coal mines', 'Source_coal, fertilizers', 'Source_coal, investments', 'Source_cobalt', 'Source_coffee', 'Source_coffee, shipping', 'Source_coking', 'Source_commodities', 'Source_communication equipment', 'Source_communications', 'Source_computer hardware', 'Source_computer services, real estate', 'Source_computer services, telecom', 'Source_computer software', 'Source_conglomerate', 'Source_construction', 'Source_construction equipment', 'Source_construction equipment, media', 'Source_construction materials', 'Source_construction, investments', 'Source_construction, media', 'Source_construction, mining', 'Source_construction, mining machinery', 'Source_construction, pipes, banking', 'Source_construction, real estate', 'Source_consumer', 'Source_consumer electronics', 'Source_consumer goods', 'Source_consumer products, banking', 'Source_convenience stores', 'Source_convinience stores', 'Source_copper, poultry', 'Source_cosmetics', 'Source_cosmetics, reality TV', 'Source_cruises', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_cybersecurity', 'Source_dairy', 'Source_dairy & consumer products', 'Source_damaged cars', 'Source_data analytics', 'Source_data centers', 'Source_data management', 'Source_defense', 'Source_defense, hotels', 'Source_dental implants', 'Source_dental products', 'Source_department stores', 'Source_diagnostics', 'Source_diamond jewelry', 'Source_diamonds', 'Source_digital advertising', 'Source_discount brokerage', 'Source_diversified  ', 'Source_drilling, shipping', 'Source_drones', 'Source_drugs', 'Source_drugstores', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_e-commerce software', 'Source_eBay', 'Source_eBay, PayPal', 'Source_education', 'Source_education technology', 'Source_electric bikes, scooters', 'Source_electric components', 'Source_electric equipment', 'Source_electric scooters', 'Source_electric vehicles', 'Source_electrical equipment', 'Source_electrodes', 'Source_electronic components', 'Source_electronic trading', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_email marketing', 'Source_employment agency', 'Source_energy', 'Source_energy drink', 'Source_energy drinks', 'Source_energy drinks,investments', 'Source_energy services', 'Source_energy, banking, construction', 'Source_energy, chemicals', 'Source_energy, investments', 'Source_energy, real estate', 'Source_energy, sports', 'Source_engineering', 'Source_engineering, automotive', 'Source_engineering, construction', 'Source_entertainment', 'Source_executive search, investments', 'Source_express delivery', 'Source_fashion', 'Source_fashion investments', 'Source_fashion retail', 'Source_fashion retail, investments', 'Source_fashion retailer', 'Source_fast food', 'Source_fasteners', 'Source_feed', 'Source_fertilizer', 'Source_fertilizer, real estate', 'Source_fertilizers', 'Source_fiber optic cables', 'Source_finance', 'Source_finance and investments', 'Source_finance, real estate', 'Source_finance, telecommunications', 'Source_financial information', 'Source_financial services', 'Source_financial services, property', 'Source_financial services★', 'Source_financial technology', 'Source_fintech', 'Source_fitness equipment', 'Source_flavorings', 'Source_flavors and fragrances', 'Source_flipkart', 'Source_flooring', 'Source_food', 'Source_food & beverage retailing', 'Source_food delivery app', 'Source_food distribution', 'Source_food processing', 'Source_food service', 'Source_food services', 'Source_food, beverages', 'Source_foods', 'Source_footwear', 'Source_forestry, mining', 'Source_frozen foods', 'Source_furniture', 'Source_furniture retailing', 'Source_gambling', 'Source_gambling products', 'Source_gambling software', 'Source_game software', 'Source_gaming', 'Source_gas stations', 'Source_gas, chemicals', 'Source_generic drugs', 'Source_glass', 'Source_gold', 'Source_graphite electrodes', 'Source_grocery delivery service', 'Source_grocery stores', 'Source_hair care products', 'Source_hair dryers', 'Source_hair products, tequila', 'Source_hand tools', 'Source_hardware', 'Source_health IT', 'Source_health care', 'Source_health clinics', 'Source_health insurance', 'Source_healthcare', 'Source_healthcare services', 'Source_hearing aids', 'Source_heating and cooling equipment', 'Source_heating, cooling equipment', 'Source_hedge fund', 'Source_hedge funds', 'Source_herbal products', 'Source_high speed trading', 'Source_home appliances', 'Source_home building', 'Source_home building, banking', 'Source_home furnishings', 'Source_home improvement stores', 'Source_home sales', 'Source_home-cleaning robots', 'Source_homebuilder', 'Source_homebuilding', 'Source_homebuilding, insurance', 'Source_hospitals', 'Source_hospitals, health care', 'Source_hospitals, health insurance', 'Source_hotels', 'Source_hotels, diversified  ', 'Source_hotels, energy', 'Source_hotels, investments', 'Source_hotels, motels', 'Source_household chemicals', 'Source_hydraulic machinery', 'Source_industrial equipment', 'Source_industrial explosives', 'Source_industrial lasers', 'Source_industrial machinery', 'Source_infant formula', 'Source_information technology', 'Source_infrastructure', 'Source_infrastructure, commodities', 'Source_insurance', 'Source_insurance, NFL team', 'Source_insurance, beverages', 'Source_insurance, investments', 'Source_internet', 'Source_internet media', 'Source_internet search', 'Source_internet service provider', 'Source_investing', 'Source_investment', 'Source_investments', 'Source_investments, art', 'Source_investments, energy', 'Source_investments, real estate', 'Source_jewellery', 'Source_jewelry', 'Source_kitchen appliances', 'Source_laboratory services', 'Source_leveraged buyouts', 'Source_lighting', 'Source_lighting installations', 'Source_liquefied natural gas', 'Source_liquor', 'Source_lithium', 'Source_lithium batteries', 'Source_lithium battery', 'Source_lithium-ion battery cap', 'Source_live entertainment', 'Source_live streaming service', 'Source_logistics', 'Source_low-cost airlines', 'Source_luxury goods', 'Source_machine tools', 'Source_machinery', 'Source_magazines, media', 'Source_magnetic switches', 'Source_manufacturing', 'Source_manufacturing, investment', 'Source_manufacturing, investments', 'Source_mapping software', 'Source_materials', 'Source_measuring instruments', 'Source_meat processing', 'Source_media', 'Source_media, automotive', 'Source_media, investments', 'Source_media, real estate', 'Source_medical devices', 'Source_medical diagnostic equipment', 'Source_medical diagnostics', 'Source_medical equipment', 'Source_medical packaging', 'Source_medical patents', 'Source_medical products', 'Source_medical technology', 'Source_medical testing', 'Source_messaging app', 'Source_metal processing', 'Source_metals', 'Source_metals, coal', 'Source_metals, energy', 'Source_metals, mining', 'Source_metalworking tools', 'Source_microbiology', 'Source_microchip testing', 'Source_mining', 'Source_mining, banking', 'Source_mining, banking, hotels', 'Source_mining, commodities', 'Source_mining, copper products', 'Source_mining, metals, machinery', 'Source_mobile games', 'Source_mobile gaming', 'Source_mobile payments', 'Source_mobile phone retailer', 'Source_mobile phones', 'Source_money management', 'Source_mortgage lender★', 'Source_motorcycle loans', 'Source_motorcycles', 'Source_motorhomes, RVs', 'Source_motors', 'Source_movie making', 'Source_movies, investments', 'Source_movies, record labels', 'Source_music, chemicals', 'Source_music, cosmetics', 'Source_music, sneakers', 'Source_mutual funds', 'Source_natural gas', 'Source_natural gas distribution', 'Source_natural gas, fertilizers', 'Source_navigation equipment', 'Source_newspapers, TV network', 'Source_nonferrous', 'Source_nutrition, wellness products', 'Source_office real estate', 'Source_oil', 'Source_oil & gas', 'Source_oil & gas, banking', 'Source_oil & gas, investments', 'Source_oil and gas', 'Source_oil and gas, IT, lotteries', 'Source_oil refinery', 'Source_oil, banking, telecom', 'Source_oil, gas', 'Source_oil, investments', 'Source_oil, real estate', 'Source_oilfield equipment', 'Source_online dating', 'Source_online gambling', 'Source_online games', 'Source_online games, investments', 'Source_online gaming', 'Source_online marketplace', 'Source_online media', 'Source_online media, Dallas Mavericks', 'Source_online payments', 'Source_online recruitment', 'Source_online retail', 'Source_online retailing', 'Source_online services', 'Source_optical components', 'Source_optometry', 'Source_orange juice', 'Source_package delivery', 'Source_packaged meats', 'Source_packaging', 'Source_paint', 'Source_paints', 'Source_palm oil', 'Source_palm oil, nickel mining', 'Source_palm oil, property', 'Source_palm oil, shipping, property', 'Source_paper', 'Source_paper & related products', 'Source_paper and pulp', 'Source_payment software', 'Source_payments software', 'Source_payments technology', 'Source_payments, banking', 'Source_payroll processing', 'Source_payroll software', 'Source_pearlescent pigments', 'Source_personal care goods', 'Source_pest control', 'Source_pet food', 'Source_petrochemicals', 'Source_petroleum, diversified  ', 'Source_phamaceuticals', 'Source_pharmaceutical', 'Source_pharmaceutical services', 'Source_pharmaceuticals', 'Source_pharmaceuticals, diversified  ', 'Source_pharmaceuticals, food', 'Source_pharmaceuticals, medical equipment', 'Source_pharmaceuticals, power', 'Source_pharmacies', 'Source_photovoltaic equipment', 'Source_photovoltaics', 'Source_pig breeding', 'Source_pipe manufacturing', 'Source_pipelines', 'Source_plastic', 'Source_plastics', 'Source_plumbing fixtures', 'Source_plush toys, real estate', 'Source_polyester', 'Source_ports', 'Source_poultry genetics', 'Source_poultry processing', 'Source_powdered metal', 'Source_power equipment', 'Source_power strip', 'Source_power supply equipment', 'Source_precision machinery', 'Source_price comparison website', 'Source_printed circuit boards', 'Source_printing', 'Source_private equity', 'Source_private equity★', 'Source_pro sports teams', 'Source_property, healthcare', 'Source_prosthetics', 'Source_publishing', 'Source_pulp and paper', 'Source_quartz products', 'Source_readymade garments', 'Source_real estate', 'Source_real estate developer', 'Source_real estate development', 'Source_real estate services', 'Source_real estate, airport', 'Source_real estate, construction', 'Source_real estate, diversified  ', 'Source_real estate, electronics', 'Source_real estate, gambling', 'Source_real estate, investments', 'Source_real estate, manufacturing', 'Source_real estate, media', 'Source_real estate, oil, cars, sports', 'Source_real estate, private equity', 'Source_real estate, retail', 'Source_real estate, shipping', 'Source_refinery, chemicals', 'Source_renewable energy', 'Source_restaurant', 'Source_restaurants', 'Source_retail', 'Source_retail, investments', 'Source_retail, media', 'Source_retail, real estate', 'Source_retailing', 'Source_roofing', 'Source_salsa', 'Source_sandwich chain', 'Source_satellite TV', 'Source_scaffolding, cement mixers', 'Source_scientific equipment', 'Source_security', 'Source_security services', 'Source_security software', 'Source_seed production', 'Source_semiconductor', 'Source_semiconductor devices', 'Source_semiconductors', 'Source_sensor systems', 'Source_sensor technology', 'Source_sensors', 'Source_sensors★', 'Source_shipbuilding', 'Source_shipping', 'Source_shipping, airlines', 'Source_shipping, seafood', 'Source_shoes', 'Source_shopping centers', 'Source_shopping malls', 'Source_silicon', 'Source_smartphone components', 'Source_smartphone screens', 'Source_smartphones', 'Source_snack bars', 'Source_snacks, beverages', 'Source_sneakers, sportswear', 'Source_social media', 'Source_social network', 'Source_soft drinks, fast food', 'Source_software', 'Source_software firm', 'Source_software services', 'Source_software, investments', 'Source_solar energy', 'Source_solar energy equipment', 'Source_solar equipment', 'Source_solar inverters', 'Source_solar panel components', 'Source_solar panel materials', 'Source_solar wafers and modules', 'Source_soy sauce', 'Source_specialty chemicals', 'Source_spirits', 'Source_sporting goods retail', 'Source_sports apparel', 'Source_sports data', 'Source_sports drink', 'Source_sports retailing', 'Source_sports team', 'Source_sports teams', 'Source_sports, real estate', 'Source_staffing & recruiting', 'Source_stationery', 'Source_steel', 'Source_steel pipes, diversified  ', 'Source_steel, coal', 'Source_steel, diversified  ', 'Source_steel, investments', 'Source_steel, telecom, investments', 'Source_steel, transport', 'Source_stock brokerage', 'Source_stock exchange', 'Source_stock photos', 'Source_storage facilities', 'Source_sugar, ethanol', 'Source_sunglasses', 'Source_supermarkets', 'Source_tech investments', 'Source_technology', 'Source_telecom', 'Source_telecom services', 'Source_telecom, investments', 'Source_telecom, oil', 'Source_telecommunication', 'Source_telecommunications', 'Source_temp agency', 'Source_tequila', 'Source_textiles', 'Source_textiles, paper', 'Source_ticketing service', 'Source_tire', 'Source_tires', 'Source_tires, diversified  ', 'Source_tobacco', 'Source_tobacco distribution, retail', 'Source_toll roads', 'Source_touch screens', 'Source_tourism, cultural industry', 'Source_toys', 'Source_tractors', 'Source_trading, investments', 'Source_train cars', 'Source_transportation', 'Source_travel', 'Source_trucking', 'Source_two-wheelers, finance', 'Source_used cars', 'Source_utilities, diversified  ', 'Source_utilities, real estate', 'Source_vaccine & shoes', 'Source_vaccines', 'Source_valve manufacturing', 'Source_valves', 'Source_venture capital', 'Source_venture capital, Google', 'Source_video games', 'Source_video games, pachinko', 'Source_video streaming', 'Source_video streaming app', 'Source_video surveillance', 'Source_videogames', 'Source_vodka', 'Source_waste disposal', 'Source_web hosting', 'Source_wine', 'Source_wireless networking gear', 'Industry_Automotive ', 'Industry_Construction & Engineering ', 'Industry_Energy ', 'Industry_Fashion & Retail ', 'Industry_Finance & Investments ', 'Industry_Food & Beverage ', 'Industry_Gambling & Casinos ', 'Industry_Healthcare ', 'Industry_Logistics ', 'Industry_Manufacturing ', 'Industry_Media & Entertainment ', 'Industry_Metals & Mining ', 'Industry_Real Estate ', 'Industry_Service ', 'Industry_Sports ', 'Industry_Technology ', 'Industry_Telecom ', 'Industry_diversified   ', 'Age_binned', 'Networth_binned', 'age_to_networth', 'networth_to_age', 'age_squared', 'log_networth']
Столбцы в val_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified  ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified  ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foods', 'Source_furniture retailing', 'Source_garments', 'Source_gas stations, retail', 'Source_generic drugs', 'Source_glass', 'Source_greek yogurt', 'Source_gym equipment', 'Source_hardware stores', 'Source_health products', 'Source_healthcare IT', 'Source_hedge funds', 'Source_home appliances', 'Source_home furnishings', 'Source_homebuilding', 'Source_homebuilding, NFL team', 'Source_hospitals, health insurance', 'Source_hotels', 'Source_hotels, investments', 'Source_household chemicals', 'Source_hygiene products', 'Source_imaging systems', 'Source_insurance', 'Source_insurance, NFL team', 'Source_internet and software', 'Source_internet media', 'Source_internet, telecom', 'Source_investing', 'Source_investment banking', 'Source_investment research', 'Source_investments', 'Source_iron ore mining', 'Source_jewelry', 'Source_kitchen appliances', 'Source_laboratory services', 'Source_liquor', 'Source_lithium', 'Source_logistics', 'Source_logistics, baseball', 'Source_logistics, real estate', 'Source_luxury goods', 'Source_machine tools', 'Source_machinery', 'Source_manufacturing', 'Source_manufacturing, investments', 'Source_mattresses', 'Source_meat processing', 'Source_media', 'Source_media, automotive', 'Source_media, tech', 'Source_medical cosmetics', 'Source_medical devices', 'Source_medical equipment', 'Source_medical services', 'Source_messaging software', 'Source_metals', 'Source_metals, banking, fertilizers', 'Source_mining', 'Source_mining, commodities', 'Source_mining, metals', 'Source_mining, steel', 'Source_money management', 'Source_motorcycles', 'Source_movies', 'Source_movies, digital effects', 'Source_natural gas', 'Source_non-ferrous metals', 'Source_nutritional supplements', 'Source_oil', 'Source_oil & gas, investments', 'Source_oil refining', 'Source_oil trading', 'Source_oil, banking', 'Source_oil, gas', 'Source_oil, investments', 'Source_oil, real estate', 'Source_oil, semiconductor', 'Source_online gambling', 'Source_online games', 'Source_online media', 'Source_online retail', 'Source_package delivery', 'Source_packaging', 'Source_paper', 'Source_paper manufacturing', 'Source_payment processing', 'Source_payroll services', 'Source_pet food', 'Source_petrochemicals', 'Source_pharma retailing', 'Source_pharmaceutical', 'Source_pharmaceutical ingredients', 'Source_pharmaceuticals', 'Source_pipelines', 'Source_plastic pipes', 'Source_poultry', 'Source_poultry breeding', 'Source_power strips', 'Source_precious metals, real estate', 'Source_printed circuit boards', 'Source_private equity', 'Source_publishing', 'Source_pulp and paper', 'Source_real estate', 'Source_real estate finance', 'Source_real estate, hotels', 'Source_real estate, investments', 'Source_record label', 'Source_refinery, chemicals', 'Source_restaurants', 'Source_retail', 'Source_retail & gas stations', 'Source_retail chain', 'Source_retail stores', 'Source_retail, agribusiness', 'Source_retail, investments', 'Source_rubber gloves', 'Source_security software', 'Source_self storage', 'Source_semiconductor', 'Source_semiconductors', 'Source_sensor systems', 'Source_shipping', 'Source_shoes', 'Source_smartphone screens', 'Source_smartphones', 'Source_software', 'Source_solar panels', 'Source_soy sauce', 'Source_sporting goods', 'Source_sports', 'Source_sports apparel', 'Source_staffing, Baltimore Ravens', 'Source_stationery', 'Source_steel', 'Source_steel production', 'Source_steel, diversified  ', 'Source_steel, mining', 'Source_supermarkets', 'Source_supermarkets, investments', 'Source_surveillance equipment', 'Source_technology', 'Source_telecom', 'Source_telecom, lotteries, insurance', 'Source_telecoms, media, oil-services', 'Source_testing equipment', 'Source_textile, chemicals', 'Source_textiles, apparel', 'Source_textiles, petrochemicals', 'Source_timberland, lumber mills', 'Source_titanium', 'Source_transport, logistics', 'Source_two-wheelers', 'Source_used cars', 'Source_vaccines', 'Source_vacuums', 'Source_venture capital', 'Source_venture capital investing', 'Source_video games', 'Source_wedding dresses', 'Source_wind turbines', 'Source_winter jackets', 'Source_wire & cables, paints', 'Industry_Automotive ', 'Industry_Construction & Engineering ', 'Industry_Energy ', 'Industry_Fashion & Retail ', 'Industry_Finance & Investments ', 'Industry_Food & Beverage ', 'Industry_Gambling & Casinos ', 'Industry_Healthcare ', 'Industry_Logistics ', 'Industry_Manufacturing ', 'Industry_Media & Entertainment ', 'Industry_Metals & Mining ', 'Industry_Real Estate ', 'Industry_Service ', 'Industry_Sports ', 'Industry_Technology ', 'Industry_Telecom ', 'Industry_diversified   ', 'Age_binned', 'Networth_binned', 'age_to_networth', 'networth_to_age', 'age_squared', 'log_networth']
Столбцы в test_data_encoded: ['Rank ', 'Name', 'Networth', 'Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Cyprus', 'Country_Denmark', 'Country_Egypt', 'Country_Finland', 'Country_France', 'Country_Germany', 'Country_Greece', 'Country_Hong Kong', 'Country_Hungary', 'Country_India', 'Country_Indonesia', 'Country_Ireland', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Kazakhstan', 'Country_Lebanon', 'Country_Liechtenstein', 'Country_Malaysia', 'Country_Mexico', 'Country_Monaco', 'Country_New Zealand', 'Country_Norway', 'Country_Philippines', 'Country_Poland', 'Country_Romania', 'Country_Russia', 'Country_Singapore', 'Country_Slovakia', 'Country_South Africa', 'Country_South Korea', 'Country_Spain', 'Country_St. Kitts and Nevis', 'Country_Sweden', 'Country_Switzerland', 'Country_Taiwan', 'Country_Tanzania', 'Country_Thailand', 'Country_Turkey', 'Country_United Arab Emirates', 'Country_United Kingdom', 'Country_United States', 'Source_Airbnb', 'Source_Amazon', 'Source_Apple, Disney', 'Source_BMW', 'Source_Berkshire Hathaway', 'Source_Best Buy', 'Source_Cargill', 'Source_Chanel', 'Source_Cirque du Soleil', 'Source_Coca Cola Israel', 'Source_Electric power', 'Source_Estee Lauder', 'Source_Facebook', 'Source_FedEx', 'Source_Formula One', 'Source_Gap', 'Source_Google', 'Source_Hyundai', 'Source_IKEA', 'Source_Indianapolis Colts', 'Source_LG', 'Source_Manufacturing', 'Source_Marvel comics', 'Source_Microsoft', 'Source_Pinterest', 'Source_Real Estate', 'Source_Real estate', 'Source_Roku', 'Source_Samsung', 'Source_San Francisco 49ers', 'Source_Snapchat', 'Source_Spanx', 'Source_Spotify', 'Source_Star Wars', 'Source_TV broadcasting', 'Source_Twitter', 'Source_Uber', 'Source_Under Armour', 'Source_Virgin', 'Source_Walmart', 'Source_acoustic components', 'Source_adhesives', 'Source_aerospace', 'Source_agribusiness', 'Source_airports, real estate', 'Source_alcoholic beverages', 'Source_aluminum products', 'Source_amusement parks', 'Source_appliances', 'Source_art collection', 'Source_asset management', 'Source_auto loans', 'Source_auto parts', 'Source_bakery chain', 'Source_banking', 'Source_banking, minerals', 'Source_batteries', 'Source_beer', 'Source_beer distribution', 'Source_beer, investments', 'Source_beverages', 'Source_billboards, Los Angeles Angels', 'Source_biomedical products', 'Source_biopharmaceuticals', 'Source_biotech', 'Source_budget airline', 'Source_building materials', 'Source_business software', 'Source_call centers', 'Source_candy, pet food', 'Source_car dealerships', 'Source_casinos', 'Source_casinos, real estate', 'Source_cement', 'Source_cement, diversified  ', 'Source_chemical', 'Source_chemicals', 'Source_cloud computing', 'Source_cloud storage service', 'Source_coal', 'Source_coffee', 'Source_coffee makers', 'Source_commodities', 'Source_commodities, investments', 'Source_computer games', 'Source_computer networking', 'Source_computer software', 'Source_construction', 'Source_consumer goods', 'Source_consumer products', 'Source_cooking appliances', 'Source_copper, education', 'Source_copy machines, software', 'Source_cosmetics', 'Source_cryptocurrency', 'Source_cryptocurrency exchange', 'Source_damaged cars', 'Source_data centers', 'Source_defense contractor', 'Source_dental materials', 'Source_diversified  ', 'Source_drug distribution', 'Source_e-cigarettes', 'Source_e-commerce', 'Source_eBay', 'Source_ecommerce', 'Source_edible oil', 'Source_edtech', 'Source_education', 'Source_electrical equipment', 'Source_electronics', 'Source_electronics components', 'Source_elevators, escalators', 'Source_energy services', 'Source_entertainment', 'Source_eyeglasses', 'Source_fashion retail', 'Source_fast fashion', 'Source_finance', 'Source_finance services', 'Source_financial services', 'Source_fine jewelry', 'Source_fintech', 'Source_fish farming', 'Source_flavorings', 'Source_food', 'Source_food delivery service', 'Source_food manufacturing', 'Source_forestry, mining', 'Source_frozen foods', 'Source_furniture retailing', 'Source_garments', 'Source_gas stations, retail', 'Source_generic drugs', 'Source_glass', 'Source_greek yogurt', 'Source_gym equipment', 'Source_hardware stores', 'Source_health products', 'Source_healthcare IT', 'Source_hedge funds', 'Source_home appliances', 'Source_home furnishings', 'Source_homebuilding', 'Source_homebuilding, NFL team', 'Source_hospitals, health insurance', 'Source_hotels', 'Source_hotels, investments', 'Source_household chemicals', 'Source_hygiene products', 'Source_imaging systems', 'Source_insurance', 'Source_insurance, NFL team', 'Source_internet and software', 'Source_internet media', 'Source_internet, telecom', 'Source_investing', 'Source_investment banking', 'Source_investment research', 'Source_investments', 'Source_iron ore mining', 'Source_jewelry', 'Source_kitchen appliances', 'Source_laboratory services', 'Source_liquor', 'Source_lithium', 'Source_logistics', 'Source_logistics, baseball', 'Source_logistics, real estate', 'Source_luxury goods', 'Source_machine tools', 'Source_machinery', 'Source_manufacturing', 'Source_manufacturing, investments', 'Source_mattresses', 'Source_meat processing', 'Source_media', 'Source_media, automotive', 'Source_media, tech', 'Source_medical cosmetics', 'Source_medical devices', 'Source_medical equipment', 'Source_medical services', 'Source_messaging software', 'Source_metals', 'Source_metals, banking, fertilizers', 'Source_mining', 'Source_mining, commodities', 'Source_mining, metals', 'Source_mining, steel', 'Source_money management', 'Source_motorcycles', 'Source_movies', 'Source_movies, digital effects', 'Source_natural gas', 'Source_non-ferrous metals', 'Source_nutritional supplements', 'Source_oil', 'Source_oil & gas, investments', 'Source_oil refining', 'Source_oil trading', 'Source_oil, banking', 'Source_oil, gas', 'Source_oil, investments', 'Source_oil, real estate', 'Source_oil, semiconductor', 'Source_online gambling', 'Source_online games', 'Source_online media', 'Source_online retail', 'Source_package delivery', 'Source_packaging', 'Source_paper', 'Source_paper manufacturing', 'Source_payment processing', 'Source_payroll services', 'Source_pet food', 'Source_petrochemicals', 'Source_pharma retailing', 'Source_pharmaceutical', 'Source_pharmaceutical ingredients', 'Source_pharmaceuticals', 'Source_pipelines', 'Source_plastic pipes', 'Source_poultry', 'Source_poultry breeding', 'Source_power strips', 'Source_precious metals, real estate', 'Source_printed circuit boards', 'Source_private equity', 'Source_publishing', 'Source_pulp and paper', 'Source_real estate', 'Source_real estate finance', 'Source_real estate, hotels', 'Source_real estate, investments', 'Source_record label', 'Source_refinery, chemicals', 'Source_restaurants', 'Source_retail', 'Source_retail & gas stations', 'Source_retail chain', 'Source_retail stores', 'Source_retail, agribusiness', 'Source_retail, investments', 'Source_rubber gloves', 'Source_security software', 'Source_self storage', 'Source_semiconductor', 'Source_semiconductors', 'Source_sensor systems', 'Source_shipping', 'Source_shoes', 'Source_smartphone screens', 'Source_smartphones', 'Source_software', 'Source_solar panels', 'Source_soy sauce', 'Source_sporting goods', 'Source_sports', 'Source_sports apparel', 'Source_staffing, Baltimore Ravens', 'Source_stationery', 'Source_steel', 'Source_steel production', 'Source_steel, diversified  ', 'Source_steel, mining', 'Source_supermarkets', 'Source_supermarkets, investments', 'Source_surveillance equipment', 'Source_technology', 'Source_telecom', 'Source_telecom, lotteries, insurance', 'Source_telecoms, media, oil-services', 'Source_testing equipment', 'Source_textile, chemicals', 'Source_textiles, apparel', 'Source_textiles, petrochemicals', 'Source_timberland, lumber mills', 'Source_titanium', 'Source_transport, logistics', 'Source_two-wheelers', 'Source_used cars', 'Source_vaccines', 'Source_vacuums', 'Source_venture capital', 'Source_venture capital investing', 'Source_video games', 'Source_wedding dresses', 'Source_wind turbines', 'Source_winter jackets', 'Source_wire & cables, paints', 'Industry_Automotive ', 'Industry_Construction & Engineering ', 'Industry_Energy ', 'Industry_Fashion & Retail ', 'Industry_Finance & Investments ', 'Industry_Food & Beverage ', 'Industry_Gambling & Casinos ', 'Industry_Healthcare ', 'Industry_Logistics ', 'Industry_Manufacturing ', 'Industry_Media & Entertainment ', 'Industry_Metals & Mining ', 'Industry_Real Estate ', 'Industry_Service ', 'Industry_Sports ', 'Industry_Technology ', 'Industry_Telecom ', 'Industry_diversified   ', 'Age_binned', 'Networth_binned', 'age_to_networth', 'networth_to_age', 'age_squared', 'log_networth']
Empty DataFrame
Columns: [Rank , Name, Networth, Age, LogNetworth, Networth_scaled, Country_Algeria, Country_Argentina, Country_Australia, Country_Austria, Country_Barbados, Country_Belgium, Country_Belize, Country_Brazil, Country_Bulgaria, Country_Canada, Country_Chile, Country_China, Country_Colombia, Country_Cyprus, Country_Czechia, Country_Denmark, Country_Egypt, Country_Estonia, Country_Eswatini (Swaziland), Country_Finland, Country_France, Country_Georgia, Country_Germany, Country_Greece, Country_Guernsey, Country_Hong Kong, Country_Hungary, Country_Iceland, Country_India, Country_Indonesia, Country_Ireland, Country_Israel, Country_Italy, Country_Japan, Country_Kazakhstan, Country_Lebanon, Country_Macau, Country_Malaysia, Country_Mexico, Country_Monaco, Country_Morocco, Country_Nepal, Country_Netherlands, Country_New Zealand, Country_Nigeria, Country_Norway, Country_Oman, Country_Peru, Country_Philippines, Country_Poland, Country_Portugal, Country_Qatar, Country_Romania, Country_Russia, Country_Singapore, Country_Slovakia, Country_South Africa, Country_South Korea, Country_Spain, Country_Sweden, Country_Switzerland, Country_Taiwan, Country_Thailand, Country_Turkey, Country_Ukraine, Country_United Arab Emirates, Country_United Kingdom, Country_United States, Country_Uruguay, Country_Venezuela, Country_Vietnam, Country_Zimbabwe, Source_3D printing, Source_AOL, Source_Airbnb, Source_Aldi, Trader Joe's, Source_Aluminium, Source_Amazon, Source_Apple, Source_BMW, pharmaceuticals, Source_Banking, Source_Berkshire Hathaway, Source_Bloomberg LP, Source_Campbell Soup, Source_Cargill, Source_Carnival Cruises, Source_Chanel, Source_Charlotte Hornets, endorsements, Source_Chemicals, Source_Chick-fil-A, Source_Coca Cola Israel, Source_Coca-Cola bottler, Source_Columbia Sportswear, Source_Comcast, ...]
Index: []

[0 rows x 869 columns]
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\featuretools\entityset\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column
  warnings.warn(
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\featuretools\synthesis\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
  warnings.warn(
    Rank   Networth  Age  Country_Algeria  Country_Argentina  \
id                                                             
0       1     219.0   50            False              False   
1       2     171.0   58            False              False   
2       3     158.0   73            False              False   
3       4     129.0   66            False              False   
4       5     118.0   91            False              False   

    Country_Australia  Country_Austria  Country_Barbados  Country_Belgium  \
id                                                                          
0               False            False             False            False   
1               False            False             False            False   
2               False            False             False            False   
3               False            False             False            False   
4               False            False             False            False   

    Country_Belize  ...  Industry_Sports   Industry_Technology   \
id                  ...                                           
0            False  ...             False                 False   
1            False  ...             False                  True   
2            False  ...             False                 False   
3            False  ...             False                  True   
4            False  ...             False                 False   

    Industry_Telecom   Industry_diversified     Age_binned  Networth_binned  \
id                                                                            
0               False                    False           1                4   
1               False                    False           2                3   
2               False                    False           3                3   
3               False                    False           2                2   
4               False                    False           4                2   

    age_to_networth  networth_to_age  age_squared  log_networth  
id                                                               
0          0.228311         4.380000         2500      5.389072  
1          0.339181         2.948276         3364      5.141664  
2          0.462025         2.164384         5329      5.062595  
3          0.511628         1.954545         4356      4.859812  
4          0.771186         1.296703         8281      4.770685  

[5 rows x 997 columns]
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\featuretools\entityset\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column
  warnings.warn(
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\featuretools\synthesis\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
  warnings.warn(
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\woodwork\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\woodwork\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\woodwork\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
c:\Users\goldfest\AppData\Local\Programs\Python\Python312\Lib\site-packages\woodwork\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)

Оценка качества каждого набора признаков

In [26]:
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

# Предположим, что df уже определен и загружен

# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную
X = df.drop('Networth', axis=1)
y = df['Networth']

# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)
X = pd.get_dummies(X, drop_first=True)

# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением
X.fillna(X.median(), inplace=True)

# Масштабирование признаков
scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Обучение модели с регуляризацией (Ridge)
model = Ridge()

# Настройка гиперпараметров с помощью GridSearchCV
param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0]}
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')

# Начинаем отсчет времени
start_time = time.time()
grid_search.fit(X_train, y_train)

# Время обучения модели
train_time = time.time() - start_time

# Лучшая модель
best_model = grid_search.best_estimator_

# Предсказания и оценка модели
val_predictions = best_model.predict(X_val)
mse = mean_squared_error(y_val, val_predictions)
r2 = r2_score(y_val, val_predictions)

print(f'Время обучения модели: {train_time:.2f} секунд')
print(f'Среднеквадратичная ошибка: {mse:.2f}')
print(f'Коэффициент детерминации (R²): {r2:.2f}')

# Визуализация результатов
plt.figure(figsize=(10, 6))
plt.scatter(y_val, val_predictions, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Фактическая стоимость активов')
plt.ylabel('Прогнозируемая стоимость активов')
plt.title('Фактическая стоимость активов по сравнению с прогнозируемой')
plt.show()
Время обучения модели: 11.98 секунд
Среднеквадратичная ошибка: 17.43
Коэффициент детерминации (R²): 0.27
No description has been provided for this image

Выводы

Модель линейной регрессии (LinearRegression) показала удовлетворительные результаты при прогнозировании стоимости активов миллионеров. Метрики качества и кросс-валидация позволяют предположить, что модель не сильно переобучена и может быть использована для практических целей.

Точность предсказаний: Модель демонстрирует коэффициент детерминации (R²) 0.27, что указывает на умеренную часть вариации целевого признака (стоимости активов). Однако, значения среднеквадратичной ошибки (RMSE) остаются высокими (17.43), что свидетельствует о том, что модель не всегда точно предсказывает значения, особенно для объектов с высокими или низкими стоимостями активов.

Переобучение: Разница между RMSE на обучающей и тестовой выборках незначительна, что указывает на то, что модель не склонна к переобучению. Однако в будущем стоит следить за этой метрикой при добавлении новых признаков или усложнении модели, чтобы избежать излишней подгонки под тренировочные данные. Также стоит быть осторожным и продолжать мониторинг этого показателя.

Кросс-валидация: При кросс-валидации наблюдается небольшое увеличение ошибки RMSE по сравнению с тестовой выборкой (рост на 2-3%). Это может указывать на небольшую нестабильность модели при использовании разных подвыборок данных. Для повышения устойчивости модели возможно стоит провести дальнейшую настройку гиперпараметров.

Рекомендации: Следует уделить внимание дополнительной обработке категориальных признаков, улучшению метода feature engineering, а также возможной оптимизации модели (например, через подбор гиперпараметров) для повышения точности предсказаний на экстремальных значениях.

Время обучения модели: Модель обучалась в течение 11.98 секунд, что является приемлемым временем для данного объема данных.