MII/lec3.ipynb at 03e83b03126873e0223b0f0b2deda2e7b490c36f

gaillard/MII

Fork 0

antoc0der 03e83b0312 3 no comments

2024-10-24 21:50:12 +04:00

78 KiB

Raw Blame History

Унитарное кодирование¶

Преобразование категориального признака в несколько бинарных признаков

Загрузка набора дынных, преобразование данных в числовой формат.¶

In [ ]:

import pandas as pd

countries = pd.read_csv(
    "data/population.csv", index_col="no"
)
#преобразуем данные в числовой формат, удаляем запятые
countries["Population 2020"] = countries["Population 2020"].apply(
    lambda x: int("".join(x.split(",")))
)
countries["Net Change"] = countries["Net Change"].apply(
    lambda x: int("".join(x.split(",")))
)
countries["Yearly Change"] = countries["Yearly Change"].apply(
    lambda x: float("".join(x.rstrip("%")))
)
countries["Land Area (Km²)"] = countries["Land Area (Km²)"].apply(
    lambda x: int("".join(x.split(",")))
)
countries

Унитарное кодирование признаков Пол (Sex) и Порт посадки (Embarked)¶

Кодирование

In [12]:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# encoder = OneHotEncoder(sparse_output=False, drop="first")

# encoded_values = encoder.fit_transform(titanic[["Embarked", "Sex"]])

# encoded_columns = encoder.get_feature_names_out(["Embarked", "Sex"])

# encoded_values_df = pd.DataFrame(encoded_values, columns=encoded_columns)

# encoded_values_df

Добавление признаков в исходный Dataframe

In [37]:

# titanic = pd.concat([titanic, encoded_values_df], axis=1)

# titanic

Дискретизация признаков¶

Равномерное разделение данных на 3 группы. первый вывод - ограничения по площади, второй - колво стран в каждой группе

In [10]:

labels = ["Small", "Middle", "Big"]
num_bins = 3

In [ ]:

hist1, bins1 = np.histogram(
    countries["Land Area (Km²)"].fillna(countries["Land Area (Km²)"].median()), bins=num_bins
)
bins1, hist1

In [ ]:

pd.concat(
    [
        countries["Country (or dependency)"],
        countries["Land Area (Km²)"],
        pd.cut(countries["Land Area (Km²)"], list(bins1)),
    ],
    axis=1,
).head(20)

In [ ]:

pd.concat(
    [
        countries["Country (or dependency)"],
        countries["Land Area (Km²)"],
        pd.cut(countries["Land Area (Km²)"], list(bins1), labels=labels),
    ],
    axis=1,
).head(20)

Равномерное разделение данных на 3 группы c установкой собственной границы диапазона значений (от 0 до 100)

In [ ]:

labels = ["Small", "Middle", "Big"]
bins2 = np.linspace(0, 12000000, 4)

tmp_bins2 = np.digitize(
    countries["Land Area (Km²)"].fillna(countries["Land Area (Km²)"].median()), bins2
)

hist2 = np.bincount(tmp_bins2 - 1)

bins2, hist2

In [ ]:

pd.concat(
    [
        countries["Country (or dependency)"],
        countries["Land Area (Km²)"],
        pd.cut(countries["Land Area (Km²)"], list(bins2)),
    ],
    axis=1,
).head(20)

In [ ]:

pd.concat(
    [
        countries["Country (or dependency)"],
        countries["Land Area (Km²)"],
        pd.cut(countries["Land Area (Km²)"], list(bins2), labels=labels),
    ],
    axis=1,
).head(20)

Равномерное разделение данных на 3 группы c установкой собственных интервалов (0 - 39, 40 - 60, 61 - 100)

In [ ]:

labels2 = ["Dwarf", "Small", "Middle", "Big", "Giant"]
hist3, bins3 = np.histogram(
    countries["Land Area (Km²)"].fillna(countries["Land Area (Km²)"].median()),
    bins=[0, 1000, 100000, 500000, 3000000, np.inf],
)


bins3, hist3

In [ ]:

pd.concat(
    [
        countries["Country (or dependency)"],
        countries["Land Area (Km²)"],
        pd.cut(countries["Land Area (Km²)"], list(bins3)),
    ],
    axis=1,
).head(20)

In [ ]:

pd.concat(
    [
        countries["Country (or dependency)"],
        countries["Land Area (Km²)"],
        pd.cut(countries["Land Area (Km²)"], list(bins3), labels=labels2),
    ],
    axis=1,
).head(20)

Квантильное разделение данных на 5 групп

In [ ]:

pd.concat(
    [
        countries["Country (or dependency)"],
        countries["Land Area (Km²)"],
        pd.qcut(countries["Land Area (Km²)"], q=5, labels=False),
    ],
    axis=1,
).head(20)

In [ ]:

pd.concat(
    [
        countries["Country (or dependency)"],
        countries["Land Area (Km²)"],
        pd.qcut(countries["Land Area (Km²)"], q=5, labels=labels2),
    ],
    axis=1,
).head(20)

Пример конструирования признаков на основе существующих¶

Title - обращение к пассажиру (Mr, Mrs, Miss)

Is_married - замужняя ли женщина

Cabin_type - палуба (тип каюты)

In [50]:

# titanic_cl = titanic.drop(
#     ["Embarked_Q", "Embarked_S", "Embarked_nan", "Sex_male"], axis=1, errors="ignore"
# )
# titanic_cl = titanic_cl.dropna()

# titanic_cl["Title"] = [
#     i.split(",")[1].split(".")[0].strip() for i in titanic_cl["Name"]
# ]

# titanic_cl["Is_married"] = [1 if i == "Mrs" else 0 for i in titanic_cl["Title"]]

# titanic_cl["Cabin_type"] = [i[0] for i in titanic_cl["Cabin"]]

# titanic_cl

Пример использования библиотеки Featuretools для автоматического конструирования (синтеза) признаков¶

https://featuretools.alteryx.com/en/stable/getting_started/using_entitysets.html

Загрузка данных¶

За основу был взят набор данных "Ecommerce Orders Data Set" из Kaggle

Используется только 100 первых заказов и связанные с ними объекты

https://www.kaggle.com/datasets/sangamsharmait/ecommerce-orders-data-analysis

In [32]:

import featuretools as ft
from woodwork.logical_types import Categorical, Datetime

info = pd.read_csv("data/population.csv")
forcast = pd.read_csv("data/forcast.csv")
capitals = pd.read_csv("data/country.csv", encoding="ISO-8859-1")
forcast["Population"] = forcast["Population"].apply(
    lambda x: int("".join(x.split(",")))
)
forcast["YearlyPer"] = forcast["YearlyPer"].apply(
    lambda x: float("".join(x.rstrip("%")))
)
forcast["Yearly"] = forcast["Yearly"].apply(
    lambda x: int("".join(x.split(",")))
)
info = info.drop(
    ["Migrants (net)", "Fert. Rate", "MedAge", "Urban Pop %", "World Share"], axis=1
)
info["Population 2020"] = info["Population 2020"].apply(
    lambda x: int("".join(x.split(",")))
)
info["Yearly Change"] = info["Yearly Change"].apply(
    lambda x: float("".join(x.rstrip("%")))
)
info["Net Change"] = info["Net Change"].apply(
    lambda x: int("".join(x.split(",")))
)
info["Land Area (Km²)"] = info["Land Area (Km²)"].apply(
    lambda x: int("".join(x.split(",")))
)

info, forcast, capitals

Out[32]:

(      no Country (or dependency)  Population 2020  Yearly Change  Net Change  \
 0      1                   China       1439323776           0.39     5540090   
 1      2                   India       1380004385           0.99    13586631   
 2      3           United States        331002651           0.59     1937734   
 3      4               Indonesia        273523615           1.07     2898047   
 4      5                Pakistan        220892340           2.00     4327022   
 ..   ...                     ...              ...            ...         ...   
 230  231              Montserrat             4992           0.06           3   
 231  232        Falkland Islands             3480           3.05         103   
 232  233                    Niue             1626           0.68          11   
 233  234                 Tokelau             1357           1.27          17   
 234  235                Holy See              801           0.25           2   
 
     Density(P/Km²)  Land Area (Km²)  
 0              153          9388211  
 1              464          2973190  
 2               36          9147420  
 3              151          1811570  
 4              287           770880  
 ..             ...              ...  
 230             50              100  
 231              0            12170  
 232              6              260  
 233            136               10  
 234          2,003                0  
 
 [235 rows x 7 columns],
    Year  Population  YearlyPer    Yearly  Median  Fertility  Density
 0  2020  7794798739       1.10  83000320      31       2.47       52
 1  2025  8184437460       0.98  77927744      32       2.54       55
 2  2030  8548487400       0.87  72809988      33       2.62       57
 3  2035  8887524213       0.78  67807363      34       2.70       60
 4  2040  9198847240       0.69  62264605      35       2.77       62
 5  2045  9481803274       0.61  56591207      35       2.85       64
 6  2050  9735033990       0.53  50646143      36       2.95       65,
      Country/Territory           Capital Continent
 0          Afghanistan             Kabul      Asia
 1              Albania            Tirana    Europe
 2              Algeria           Algiers    Africa
 3       American Samoa         Pago Pago   Oceania
 4              Andorra  Andorra la Vella    Europe
 ..                 ...               ...       ...
 229  Wallis and Futuna          Mata-Utu   Oceania
 230     Western Sahara           El Aain    Africa
 231              Yemen             Sanaa      Asia
 232             Zambia            Lusaka    Africa
 233           Zimbabwe            Harare    Africa
 
 [234 rows x 3 columns])

Создание сущностей в featuretools¶

Добавление dataframe'ов с данными в EntitySet с указанием параметров: название сущности (таблицы), первичный ключ, категориальные атрибуты (в том числе даты)

In [34]:

es = ft.EntitySet(id="countries")

es = es.add_dataframe(
    dataframe_name="countries",
    dataframe=info,
    index="no",
    logical_types={
        "Country (or dependency)": Categorical,
    },
)
es = es.add_dataframe(
    dataframe_name="capitals",
    dataframe=capitals,
    index="Country/Territory",
    logical_types={
        "Country/Territory": Categorical,
        "Capital": Categorical,
        "Continent": Categorical,
    },
)
es = es.add_dataframe(
    dataframe_name="forcast",
    dataframe=forcast,
    index="forcast_id",
    make_index=True,
    logical_types={
        "Year": Datetime,
    },
)

es

c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

Out[34]:

Entityset: countries
  DataFrames:
    countries [Rows: 235, Columns: 7]
    capitals [Rows: 234, Columns: 3]
    forcast [Rows: 7, Columns: 8]
  Relationships:
    No relationships

Настройка связей между сущностями featuretools¶

Настройка связей между таблицами на уровне ключей

Связь указывается от родителя к потомкам (таблица-родитель, первичный ключ, таблица-потомок, внешний ключ)

In [35]:

es = es.add_relationship(
    "capitals", "Country/Territory", "countries", "Country (or dependency)"
)

es

Out[35]:

Entityset: countries
  DataFrames:
    countries [Rows: 235, Columns: 7]
    capitals [Rows: 234, Columns: 3]
    forcast [Rows: 7, Columns: 8]
  Relationships:
    countries.Country (or dependency) -> capitals.Country/Territory

Автоматическое конструирование признаков с помощью featuretools¶

Библиотека применят различные функции агрегации и трансформации к атрибутам таблицы order_items с учетом отношений

Результат помещается в Dataframe feature_matrix

In [36]:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="countries",
    max_depth=1,
)

feature_matrix

Out[36]:

	Country (or dependency)	Population 2020	Yearly Change	Net Change	Land Area (Km²)	capitals.Capital	capitals.Continent
no
1	China	1439323776	0.39	5540090	9388211	Beijing	Asia
2	India	1380004385	0.99	13586631	2973190	New Delhi	Asia
3	United States	331002651	0.59	1937734	9147420	Washington, D.C.	North America
4	Indonesia	273523615	1.07	2898047	1811570	Jakarta	Asia
5	Pakistan	220892340	2.00	4327022	770880	Islamabad	Asia
...	...	...	...	...	...	...	...
231	Montserrat	4992	0.06	3	100	Brades	North America
232	Falkland Islands	3480	3.05	103	12170	Stanley	South America
233	Niue	1626	0.68	11	260	Alofi	Oceania
234	Tokelau	1357	1.27	17	10	Nukunonu	Oceania
235	Holy See	801	0.25	2	0	NaN	NaN

235 rows × 7 columns

Полученные признаки¶

Список колонок полученного dataframe'а

In [37]:

feature_defs

Out[37]:

[<Feature: Country (or dependency)>,
 <Feature: Population 2020>,
 <Feature: Yearly Change>,
 <Feature: Net Change>,
 <Feature: Land Area (Km²)>,
 <Feature: capitals.Capital>,
 <Feature: capitals.Continent>]

Отсечение значений признаков¶

Определение выбросов с помощью boxplot

In [38]:

countries.boxplot(column="Population 2020")

Out[38]:

<Axes: >

No description has been provided for this image

Отсечение данных для признака Возраст, значение которых больше 65 лет

In [40]:

countries_norm = countries.copy()

countries_norm["Population Clip"] = countries_norm["Population 2020"].clip(0, 50000000);

countries_norm[countries_norm["Population 2020"] > 50000000][
    ["Country (or dependency)", "Population 2020", "Population Clip"]
]

Out[40]:

	Country (or dependency)	Population 2020	Population Clip
no
1	China	1439323776	50000000
2	India	1380004385	50000000
3	United States	331002651	50000000
4	Indonesia	273523615	50000000
5	Pakistan	220892340	50000000
6	Brazil	212559417	50000000
7	Nigeria	206139589	50000000
8	Bangladesh	164689383	50000000
9	Russia	145934462	50000000
10	Mexico	128932753	50000000
11	Japan	126476461	50000000
12	Ethiopia	114963588	50000000
13	Philippines	109581078	50000000
14	Egypt	102334404	50000000
15	Vietnam	97338579	50000000
16	DR Congo	89561403	50000000
17	Turkey	84339067	50000000
18	Iran	83992949	50000000
19	Germany	83783942	50000000
20	Thailand	69799978	50000000
21	United Kingdom	67886011	50000000
22	France	65273511	50000000
23	Italy	60461826	50000000
24	Tanzania	59734218	50000000
25	South Africa	59308690	50000000
26	Myanmar	54409800	50000000
27	Kenya	53771296	50000000
28	South Korea	51269185	50000000
29	Colombia	50882891	50000000

Винсоризация признака Возраст

In [41]:

from scipy.stats.mstats import winsorize

print(countries_norm["Population 2020"].quantile(q=0.95))

countries_norm["PopulationWinsorized"] = winsorize(
    countries_norm["Population 2020"].fillna(countries_norm["Population 2020"].mean()),
    (0, 0.05),
    inplace=False,
)

countries_norm[countries_norm["Population 2020"] > 50000000][
    ["Country (or dependency)", "Population 2020", "PopulationWinsorized"]
]

111195830.99999991

Out[41]:

	Country (or dependency)	Population 2020	PopulationWinsorized
no
1	China	1439323776	114963588
2	India	1380004385	114963588
3	United States	331002651	114963588
4	Indonesia	273523615	114963588
5	Pakistan	220892340	114963588
6	Brazil	212559417	114963588
7	Nigeria	206139589	114963588
8	Bangladesh	164689383	114963588
9	Russia	145934462	114963588
10	Mexico	128932753	114963588
11	Japan	126476461	114963588
12	Ethiopia	114963588	114963588
13	Philippines	109581078	109581078
14	Egypt	102334404	102334404
15	Vietnam	97338579	97338579
16	DR Congo	89561403	89561403
17	Turkey	84339067	84339067
18	Iran	83992949	83992949
19	Germany	83783942	83783942
20	Thailand	69799978	69799978
21	United Kingdom	67886011	67886011
22	France	65273511	65273511
23	Italy	60461826	60461826
24	Tanzania	59734218	59734218
25	South Africa	59308690	59308690
26	Myanmar	54409800	54409800
27	Kenya	53771296	53771296
28	South Korea	51269185	51269185
29	Colombia	50882891	50882891

Нормализация значений¶

In [43]:

from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()

min_max_scaler_2 = preprocessing.MinMaxScaler(feature_range=(-1, 1))

countries_norm["PopulationNorm"] = min_max_scaler.fit_transform(
    countries_norm["Population 2020"].to_numpy().reshape(-1, 1)
).reshape(countries_norm["Population 2020"].shape)

countries_norm["PopulationClipNorm"] = min_max_scaler.fit_transform(
    countries_norm["Population Clip"].to_numpy().reshape(-1, 1)
).reshape(countries_norm["Population 2020"].shape)

countries_norm["PopulationWinsorizedNorm"] = min_max_scaler.fit_transform(
    countries_norm["PopulationWinsorized"].to_numpy().reshape(-1, 1)
).reshape(countries_norm["Population 2020"].shape)

countries_norm["PopulationWinsorizedNorm2"] = min_max_scaler_2.fit_transform(
    countries_norm["PopulationWinsorized"].to_numpy().reshape(-1, 1)
).reshape(countries_norm["Population 2020"].shape)

countries_norm[
    [
        "Country (or dependency)",
        "Population 2020",
        "PopulationNorm",
        "PopulationClipNorm",
        "PopulationWinsorizedNorm",
        "PopulationWinsorizedNorm2",
    ]
]

Out[43]:

	Country (or dependency)	Population 2020	PopulationNorm	PopulationClipNorm	PopulationWinsorizedNorm	PopulationWinsorizedNorm2
no
1	China	1439323776	1.000000e+00	1.000000	1.000000	1.000000
2	India	1380004385	9.587866e-01	1.000000	1.000000	1.000000
3	United States	331002651	2.299705e-01	1.000000	1.000000	1.000000
4	Indonesia	273523615	1.900357e-01	1.000000	1.000000	1.000000
5	Pakistan	220892340	1.534691e-01	1.000000	1.000000	1.000000
...	...	...	...	...	...	...
231	Montserrat	4992	2.911786e-06	0.000084	0.000036	-0.999927
232	Falkland Islands	3480	1.861292e-06	0.000054	0.000023	-0.999953
233	Niue	1626	5.731862e-07	0.000017	0.000007	-0.999986
234	Tokelau	1357	3.862927e-07	0.000011	0.000005	-0.999990
235	Holy See	801	0.000000e+00	0.000000	0.000000	-1.000000

235 rows × 6 columns

Стандартизация значений¶

In [44]:

from sklearn import preprocessing

stndart_scaler = preprocessing.StandardScaler()

countries_norm["PopulationStand"] = stndart_scaler.fit_transform(
    countries_norm["Population 2020"].to_numpy().reshape(-1, 1)
).reshape(countries_norm["Population 2020"].shape)

countries_norm["PopulationClipStand"] = stndart_scaler.fit_transform(
    countries_norm["Population Clip"].to_numpy().reshape(-1, 1)
).reshape(countries_norm["Population 2020"].shape)

countries_norm["PopulationWinsorizedStand"] = stndart_scaler.fit_transform(
    countries_norm["PopulationWinsorized"].to_numpy().reshape(-1, 1)
).reshape(countries_norm["Population 2020"].shape)

countries_norm[
    [
        "Country (or dependency)",
        "Population 2020",
        "PopulationStand",
        "PopulationClipStand",
        "PopulationWinsorizedStand",
    ]
]

Out[44]:

	Country (or dependency)	Population 2020	PopulationStand	PopulationClipStand	PopulationWinsorizedStand
no
1	China	1439323776	10.427597	2.073933	3.171659
2	India	1380004385	9.987702	2.073933	3.171659
3	United States	331002651	2.208627	2.073933	3.171659
4	Indonesia	273523615	1.782380	2.073933	3.171659
5	Pakistan	220892340	1.392082	2.073933	3.171659
...	...	...	...	...	...
231	Montserrat	4992	-0.245950	-0.795071	-0.621969
232	Falkland Islands	3480	-0.245962	-0.795158	-0.622019
233	Niue	1626	-0.245975	-0.795265	-0.622080
234	Tokelau	1357	-0.245977	-0.795280	-0.622089
235	Holy See	801	-0.245982	-0.795312	-0.622107

235 rows × 5 columns

78 KiB Raw Blame History Unescape Escape

Унитарное кодирование¶

Загрузка набора дынных, преобразование данных в числовой формат.¶

Унитарное кодирование признаков Пол (Sex) и Порт посадки (Embarked)¶

Дискретизация признаков¶

Пример конструирования признаков на основе существующих¶

Пример использования библиотеки Featuretools для автоматического конструирования (синтеза) признаков¶

Загрузка данных¶

Создание сущностей в featuretools¶

Настройка связей между сущностями featuretools¶

Автоматическое конструирование признаков с помощью featuretools¶

Полученные признаки¶

Отсечение значений признаков¶

Нормализация значений¶

Стандартизация значений¶

78 KiB

Raw Blame History