pred_analytics/lec3.ipynb at main

Загрузка набора данных Titanic¶

In [1]:

import pandas as pd

titanic = pd.read_csv("data/titanic.csv", index_col="PassengerId")

titanic

Out[1]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 11 columns

Унитарное кодирование¶

Преобразование категориального признака в несколько бинарных признаков

Унитарное кодирование признаков Пол (Sex) и Порт посадки (Embarked)¶

Кодирование

In [2]:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

encoder = OneHotEncoder(sparse_output=False, drop="first")

encoded_values = encoder.fit_transform(titanic[["Embarked", "Sex"]])

encoded_columns = encoder.get_feature_names_out(["Embarked", "Sex"])

encoded_values_df = pd.DataFrame(encoded_values, columns=encoded_columns)

encoded_values_df

Out[2]:

	Embarked_Q	Embarked_S	Embarked_nan	Sex_male
0	0.0	1.0	0.0	1.0
1	0.0	0.0	0.0	0.0
2	0.0	1.0	0.0	0.0
3	0.0	1.0	0.0	0.0
4	0.0	1.0	0.0	1.0
...	...	...	...	...
886	0.0	1.0	0.0	1.0
887	0.0	1.0	0.0	0.0
888	0.0	1.0	0.0	0.0
889	0.0	0.0	0.0	1.0
890	1.0	0.0	0.0	1.0

891 rows × 4 columns

Добавление признаков в исходный Dataframe

In [3]:

titanic = pd.concat([titanic, encoded_values_df], axis=1)

titanic

Out[3]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Embarked_Q	Embarked_S	Embarked_nan	Sex_male
1	0.0	3.0	Braund, Mr. Owen Harris	male	22.0	1.0	0.0	A/5 21171	7.2500	NaN	S	0.0	0.0	0.0	0.0
2	1.0	1.0	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1.0	0.0	PC 17599	71.2833	C85	C	0.0	1.0	0.0	0.0
3	1.0	3.0	Heikkinen, Miss. Laina	female	26.0	0.0	0.0	STON/O2. 3101282	7.9250	NaN	S	0.0	1.0	0.0	0.0
4	1.0	1.0	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1.0	0.0	113803	53.1000	C123	S	0.0	1.0	0.0	1.0
5	0.0	3.0	Allen, Mr. William Henry	male	35.0	0.0	0.0	373450	8.0500	NaN	S	1.0	0.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
888	1.0	1.0	Graham, Miss. Margaret Edith	female	19.0	0.0	0.0	112053	30.0000	B42	S	0.0	1.0	0.0	0.0
889	0.0	3.0	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1.0	2.0	W./C. 6607	23.4500	NaN	S	0.0	0.0	0.0	1.0
890	1.0	1.0	Behr, Mr. Karl Howell	male	26.0	0.0	0.0	111369	30.0000	C148	C	1.0	0.0	0.0	1.0
891	0.0	3.0	Dooley, Mr. Patrick	male	32.0	0.0	0.0	370376	7.7500	NaN	Q	NaN	NaN	NaN	NaN
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	1.0	0.0	1.0

892 rows × 15 columns

Дискретизация признаков¶

Равномерное разделение данных на 3 группы

In [4]:

labels = ["young", "middle-aged", "old"]
num_bins = 3

In [5]:

hist1, bins1 = np.histogram(titanic["Age"].fillna(titanic["Age"].median()), bins=num_bins)
bins1, hist1

Out[5]:

(array([ 0.42      , 26.94666667, 53.47333333, 80.        ]),
 array([319, 523,  50]))

In [6]:

pd.concat([titanic["Age"], pd.cut(titanic["Age"], list(bins1))], axis=1).head(20)

Out[6]:

	Age	Age
1	22.0	(0.42, 26.947]
2	38.0	(26.947, 53.473]
3	26.0	(0.42, 26.947]
4	35.0	(26.947, 53.473]
5	35.0	(26.947, 53.473]
6	NaN	NaN
7	54.0	(53.473, 80.0]
8	2.0	(0.42, 26.947]
9	27.0	(26.947, 53.473]
10	14.0	(0.42, 26.947]
11	4.0	(0.42, 26.947]
12	58.0	(53.473, 80.0]
13	20.0	(0.42, 26.947]
14	39.0	(26.947, 53.473]
15	14.0	(0.42, 26.947]
16	55.0	(53.473, 80.0]
17	2.0	(0.42, 26.947]
18	NaN	NaN
19	31.0	(26.947, 53.473]
20	NaN	NaN

In [7]:

pd.concat([titanic["Age"], pd.cut(titanic["Age"], list(bins1), labels=labels)], axis=1).head(20)

Out[7]:

	Age	Age
1	22.0	young
2	38.0	middle-aged
3	26.0	young
4	35.0	middle-aged
5	35.0	middle-aged
6	NaN	NaN
7	54.0	old
8	2.0	young
9	27.0	middle-aged
10	14.0	young
11	4.0	young
12	58.0	old
13	20.0	young
14	39.0	middle-aged
15	14.0	young
16	55.0	old
17	2.0	young
18	NaN	NaN
19	31.0	middle-aged
20	NaN	NaN

Равномерное разделение данных на 3 группы c установкой собственной границы диапазона значений (от 0 до 100)

In [8]:

bins2 = np.linspace(0, 100, 4)
tmp_bins2 = np.digitize(titanic["Age"].fillna(titanic["Age"].median()), bins2)
hist2 = np.bincount(tmp_bins2 - 1)
bins2, hist2

Out[8]:

(array([  0.        ,  33.33333333,  66.66666667, 100.        ]),
 array([641, 244,   7]))

In [9]:

pd.concat([titanic["Age"], pd.cut(titanic["Age"], list(bins2))], axis=1).head(20)

Out[9]:

	Age	Age
1	22.0	(0.0, 33.333]
2	38.0	(33.333, 66.667]
3	26.0	(0.0, 33.333]
4	35.0	(33.333, 66.667]
5	35.0	(33.333, 66.667]
6	NaN	NaN
7	54.0	(33.333, 66.667]
8	2.0	(0.0, 33.333]
9	27.0	(0.0, 33.333]
10	14.0	(0.0, 33.333]
11	4.0	(0.0, 33.333]
12	58.0	(33.333, 66.667]
13	20.0	(0.0, 33.333]
14	39.0	(33.333, 66.667]
15	14.0	(0.0, 33.333]
16	55.0	(33.333, 66.667]
17	2.0	(0.0, 33.333]
18	NaN	NaN
19	31.0	(0.0, 33.333]
20	NaN	NaN

In [10]:

pd.concat([titanic["Age"], pd.cut(titanic["Age"], list(bins2), labels=labels)], axis=1).head(20)

Out[10]:

	Age	Age
1	22.0	young
2	38.0	middle-aged
3	26.0	young
4	35.0	middle-aged
5	35.0	middle-aged
6	NaN	NaN
7	54.0	middle-aged
8	2.0	young
9	27.0	young
10	14.0	young
11	4.0	young
12	58.0	middle-aged
13	20.0	young
14	39.0	middle-aged
15	14.0	young
16	55.0	middle-aged
17	2.0	young
18	NaN	NaN
19	31.0	young
20	NaN	NaN

Равномерное разделение данных на 3 группы c установкой собственных интервалов (0 - 39, 40 - 60, 61 - 100)

In [11]:

hist3, bins3 = np.histogram(
    titanic["Age"].fillna(titanic["Age"].median()), bins=[0, 40, 60, 100]
)
bins3, hist3

Out[11]:

(array([  0,  40,  60, 100]), array([729, 137,  26]))

In [12]:

pd.concat([titanic["Age"], pd.cut(titanic["Age"], list(bins3))], axis=1).head(20)

Out[12]:

	Age	Age
1	22.0	(0.0, 40.0]
2	38.0	(0.0, 40.0]
3	26.0	(0.0, 40.0]
4	35.0	(0.0, 40.0]
5	35.0	(0.0, 40.0]
6	NaN	NaN
7	54.0	(40.0, 60.0]
8	2.0	(0.0, 40.0]
9	27.0	(0.0, 40.0]
10	14.0	(0.0, 40.0]
11	4.0	(0.0, 40.0]
12	58.0	(40.0, 60.0]
13	20.0	(0.0, 40.0]
14	39.0	(0.0, 40.0]
15	14.0	(0.0, 40.0]
16	55.0	(40.0, 60.0]
17	2.0	(0.0, 40.0]
18	NaN	NaN
19	31.0	(0.0, 40.0]
20	NaN	NaN

In [13]:

pd.concat([titanic["Age"], pd.cut(titanic["Age"], list(bins3), labels=labels)], axis=1).head(20)

Out[13]:

	Age	Age
1	22.0	young
2	38.0	young
3	26.0	young
4	35.0	young
5	35.0	young
6	NaN	NaN
7	54.0	middle-aged
8	2.0	young
9	27.0	young
10	14.0	young
11	4.0	young
12	58.0	middle-aged
13	20.0	young
14	39.0	young
15	14.0	young
16	55.0	middle-aged
17	2.0	young
18	NaN	NaN
19	31.0	young
20	NaN	NaN

Квантильное разделение данных на 3 группы

In [14]:

pd.concat([titanic["Age"], pd.qcut(titanic["Age"], q=3, labels=False)], axis=1).head(20)

Out[14]:

	Age	Age
1	22.0	0.0
2	38.0	2.0
3	26.0	1.0
4	35.0	2.0
5	35.0	2.0
6	NaN	NaN
7	54.0	2.0
8	2.0	0.0
9	27.0	1.0
10	14.0	0.0
11	4.0	0.0
12	58.0	2.0
13	20.0	0.0
14	39.0	2.0
15	14.0	0.0
16	55.0	2.0
17	2.0	0.0
18	NaN	NaN
19	31.0	1.0
20	NaN	NaN

In [15]:

pd.concat([titanic["Age"], pd.qcut(titanic["Age"], q=3, labels=labels)], axis=1).head(20)

Out[15]:

	Age	Age
1	22.0	young
2	38.0	old
3	26.0	middle-aged
4	35.0	old
5	35.0	old
6	NaN	NaN
7	54.0	old
8	2.0	young
9	27.0	middle-aged
10	14.0	young
11	4.0	young
12	58.0	old
13	20.0	young
14	39.0	old
15	14.0	young
16	55.0	old
17	2.0	young
18	NaN	NaN
19	31.0	middle-aged
20	NaN	NaN

Пример конструирования признаков на основе существующих¶

Title - обращение к пассажиру (Mr, Mrs, Miss)

Is_married - замужняя ли женщина

Cabin_type - палуба (тип каюты)

In [ ]:

titanic_cl = titanic.drop(
    ["Embarked_Q", "Embarked_S", "Embarked_nan", "Sex_male"], axis=1, errors="ignore"
)
titanic_cl = titanic_cl.dropna()

titanic_cl["Title"] = [
    i.split(",")[1].split(".")[0].strip() for i in titanic_cl["Name"]
]

titanic_cl["Is_married"] = [1 if i == "Mrs" else 0 for i in titanic_cl["Title"]]

titanic_cl["Cabin_type"] = [i[0] for i in titanic_cl["Cabin"]]

titanic_cl

Out[ ]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	Is_married	Cabin_type
2	1.0	1.0	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1.0	0.0	PC 17599	71.2833	C85	C	Mrs	1	C
4	1.0	1.0	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1.0	0.0	113803	53.1000	C123	S	Mrs	1	C
7	0.0	1.0	McCarthy, Mr. Timothy J	male	54.0	0.0	0.0	17463	51.8625	E46	S	Mr	0	E
11	1.0	3.0	Sandstrom, Miss. Marguerite Rut	female	4.0	1.0	1.0	PP 9549	16.7000	G6	S	Miss	0	G
12	1.0	1.0	Bonnell, Miss. Elizabeth	female	58.0	0.0	0.0	113783	26.5500	C103	S	Miss	0	C
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
872	1.0	1.0	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1.0	1.0	11751	52.5542	D35	S	Mrs	1	D
873	0.0	1.0	Carlsson, Mr. Frans Olof	male	33.0	0.0	0.0	695	5.0000	B51 B53 B55	S	Mr	0	B
880	1.0	1.0	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0.0	1.0	11767	83.1583	C50	C	Mrs	1	C
888	1.0	1.0	Graham, Miss. Margaret Edith	female	19.0	0.0	0.0	112053	30.0000	B42	S	Miss	0	B
890	1.0	1.0	Behr, Mr. Karl Howell	male	26.0	0.0	0.0	111369	30.0000	C148	C	Mr	0	C

183 rows × 14 columns

Пример использования библиотеки Featuretools для автоматического конструирования (синтеза) признаков¶

https://featuretools.alteryx.com/en/stable/getting_started/using_entitysets.html

Загрузка данных¶

За основу был взят набор данных "Ecommerce Orders Data Set" из Kaggle

Используется только 100 первых заказов и связанные с ними объекты

https://www.kaggle.com/datasets/sangamsharmait/ecommerce-orders-data-analysis

In [17]:

import featuretools as ft
from woodwork.logical_types import Categorical, Datetime

customers = pd.read_csv("data/orders/customers.csv")
sellers = pd.read_csv("data/orders/sellers.csv")
products = pd.read_csv("data/orders/products.csv")
orders = pd.read_csv("data/orders/orders.csv")
orders.fillna({"order_delivered_carrier_date": pd.to_datetime(
    "1900-01-01 00:00:00"
)}, inplace=True)
orders.fillna(
    {"order_delivered_customer_date": pd.to_datetime("1900-01-01 00:00:00")},
    inplace=True,
)
order_items = pd.read_csv("data/orders/order_items.csv")

Создание сущностей в featuretools¶

Добавление dataframe'ов с данными в EntitySet с указанием параметров: название сущности (таблицы), первичный ключ, категориальные атрибуты (в том числе даты)

In [18]:

es = ft.EntitySet(id="orders")

es = es.add_dataframe(
    dataframe_name="customers",
    dataframe=customers,
    index="customer_id",
    logical_types={
        "customer_unique_id": Categorical,
        "customer_zip_code_prefix": Categorical,
        "customer_city": Categorical,
        "customer_state": Categorical,
    },
)
es = es.add_dataframe(
    dataframe_name="sellers",
    dataframe=sellers,
    index="seller_id",
    logical_types={
        "seller_zip_code_prefix": Categorical,
        "seller_city": Categorical,
        "seller_state": Categorical,
    },
)
es = es.add_dataframe(
    dataframe_name="products",
    dataframe=products,
    index="product_id",
    logical_types={
        "product_category_name": Categorical,
        "product_name_lenght": Categorical,
        "product_description_lenght": Categorical,
        "product_photos_qty": Categorical,
    },
)
es = es.add_dataframe(
    dataframe_name="orders",
    dataframe=orders,
    index="order_id",
    logical_types={
        "order_status": Categorical,
        "order_purchase_timestamp": Datetime,
        "order_approved_at": Datetime,
        "order_delivered_carrier_date": Datetime,
        "order_delivered_customer_date": Datetime,
        "order_estimated_delivery_date": Datetime,
    },
)
es = es.add_dataframe(
    dataframe_name="order_items",
    dataframe=order_items,
    index="orderitem_id",
    make_index=True,
    logical_types={"shipping_limit_date": Datetime},
)

es

c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(

Out[18]:

Entityset: orders
  DataFrames:
    customers [Rows: 100, Columns: 5]
    sellers [Rows: 87, Columns: 4]
    products [Rows: 100, Columns: 9]
    orders [Rows: 100, Columns: 8]
    order_items [Rows: 115, Columns: 8]
  Relationships:
    No relationships

Настройка связей между сущностями featuretools¶

Настройка связей между таблицами на уровне ключей

Связь указывается от родителя к потомкам (таблица-родитель, первичный ключ, таблица-потомок, внешний ключ)

In [19]:

es = es.add_relationship("customers", "customer_id", "orders", "customer_id")
es = es.add_relationship("orders", "order_id", "order_items", "order_id")
es = es.add_relationship("products", "product_id", "order_items", "product_id")
es = es.add_relationship("sellers", "seller_id", "order_items", "seller_id")

es

Out[19]:

Entityset: orders
  DataFrames:
    customers [Rows: 100, Columns: 5]
    sellers [Rows: 87, Columns: 4]
    products [Rows: 100, Columns: 9]
    orders [Rows: 100, Columns: 8]
    order_items [Rows: 115, Columns: 8]
  Relationships:
    orders.customer_id -> customers.customer_id
    order_items.order_id -> orders.order_id
    order_items.product_id -> products.product_id
    order_items.seller_id -> sellers.seller_id

Автоматическое конструирование признаков с помощью featuretools¶

Библиотека применят различные функции агрегации и трансформации к атрибутам таблицы order_items с учетом отношений

Результат помещается в Dataframe feature_matrix

In [20]:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="order_items",
    agg_primitives=["mean", "count", "mode", "any"],
    trans_primitives=["hour", "weekday"],
    max_depth=2,
)

feature_matrix

c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\featuretools\synthesis\dfs.py:321: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  agg_primitives: ['any', 'mode']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000245E1C73EC0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000245E1C73EC0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000245E1C73EC0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)

Out[20]:

	order_item_id	price	freight_value	HOUR(shipping_limit_date)	WEEKDAY(shipping_limit_date)	orders.order_status	products.product_category_name	products.product_name_lenght	products.product_description_lenght	products.product_photos_qty	...	orders.customers.customer_city	orders.customers.customer_state	products.COUNT(order_items)	products.MEAN(order_items.freight_value)	products.MEAN(order_items.order_item_id)	products.MEAN(order_items.price)	sellers.COUNT(order_items)	sellers.MEAN(order_items.freight_value)	sellers.MEAN(order_items.order_item_id)	sellers.MEAN(order_items.price)
orderitem_id
0	1	38.50	24.84	20	4	delivered	cama_mesa_banho	53.0	223.0	1.0	...	santa luzia	PB	1	24.84	1.0	38.50	2	21.340	1.0	61.200000
1	1	29.99	7.39	8	0	delivered	telefonia	59.0	675.0	5.0	...	sao paulo	SP	1	7.39	1.0	29.99	1	7.390	1.0	29.990000
2	1	110.99	21.27	21	1	delivered	cama_mesa_banho	52.0	413.0	1.0	...	gravatai	RS	1	21.27	1.0	110.99	1	21.270	1.0	110.990000
3	1	27.99	15.10	23	1	delivered	telefonia	60.0	818.0	6.0	...	imbituba	SC	1	15.10	1.0	27.99	2	13.970	1.0	26.490000
4	1	49.90	16.05	13	2	invoiced	NaN	NaN	NaN	NaN	...	santa rosa	RS	1	16.05	1.0	49.90	1	16.050	1.0	49.900000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
110	1	17.90	10.96	8	1	delivered	cama_mesa_banho	55.0	122.0	1.0	...	jundiai	SP	1	10.96	1.0	17.90	1	10.960	1.0	17.900000
111	1	79.99	8.91	9	4	delivered	beleza_saude	59.0	492.0	3.0	...	sao paulo	SP	1	8.91	1.0	79.99	5	13.206	1.2	54.590000
112	1	190.00	19.41	13	3	delivered	climatizacao	60.0	3270.0	4.0	...	paulinia	SP	1	19.41	1.0	190.00	1	19.410	1.0	190.000000
113	1	109.90	15.53	2	2	delivered	cool_stuff	46.0	595.0	2.0	...	rio de janeiro	RJ	1	15.53	1.0	109.90	1	15.530	1.0	109.900000
114	1	27.90	18.30	14	2	delivered	alimentos	59.0	982.0	1.0	...	joinville	SC	2	16.70	1.0	27.90	3	16.190	1.0	38.596667

115 rows × 43 columns

Полученные признаки¶

Список колонок полученного dataframe'а

In [21]:

feature_defs

Out[21]:

[<Feature: order_item_id>,
 <Feature: price>,
 <Feature: freight_value>,
 <Feature: HOUR(shipping_limit_date)>,
 <Feature: WEEKDAY(shipping_limit_date)>,
 <Feature: orders.order_status>,
 <Feature: products.product_category_name>,
 <Feature: products.product_name_lenght>,
 <Feature: products.product_description_lenght>,
 <Feature: products.product_photos_qty>,
 <Feature: products.product_weight_g>,
 <Feature: products.product_length_cm>,
 <Feature: products.product_height_cm>,
 <Feature: products.product_width_cm>,
 <Feature: sellers.seller_zip_code_prefix>,
 <Feature: sellers.seller_city>,
 <Feature: sellers.seller_state>,
 <Feature: orders.COUNT(order_items)>,
 <Feature: orders.MEAN(order_items.freight_value)>,
 <Feature: orders.MEAN(order_items.order_item_id)>,
 <Feature: orders.MEAN(order_items.price)>,
 <Feature: orders.HOUR(order_approved_at)>,
 <Feature: orders.HOUR(order_delivered_carrier_date)>,
 <Feature: orders.HOUR(order_delivered_customer_date)>,
 <Feature: orders.HOUR(order_estimated_delivery_date)>,
 <Feature: orders.HOUR(order_purchase_timestamp)>,
 <Feature: orders.WEEKDAY(order_approved_at)>,
 <Feature: orders.WEEKDAY(order_delivered_carrier_date)>,
 <Feature: orders.WEEKDAY(order_delivered_customer_date)>,
 <Feature: orders.WEEKDAY(order_estimated_delivery_date)>,
 <Feature: orders.WEEKDAY(order_purchase_timestamp)>,
 <Feature: orders.customers.customer_unique_id>,
 <Feature: orders.customers.customer_zip_code_prefix>,
 <Feature: orders.customers.customer_city>,
 <Feature: orders.customers.customer_state>,
 <Feature: products.COUNT(order_items)>,
 <Feature: products.MEAN(order_items.freight_value)>,
 <Feature: products.MEAN(order_items.order_item_id)>,
 <Feature: products.MEAN(order_items.price)>,
 <Feature: sellers.COUNT(order_items)>,
 <Feature: sellers.MEAN(order_items.freight_value)>,
 <Feature: sellers.MEAN(order_items.order_item_id)>,
 <Feature: sellers.MEAN(order_items.price)>]

Отсечение значений признаков¶

Определение выбросов с помощью boxplot

In [22]:

titanic.boxplot(column="Age")

Out[22]:

<Axes: >

No description has been provided for this image

Отсечение данных для признака Возраст, значение которых больше 65 лет

In [23]:

titanic_norm = titanic.copy()

titanic_norm["AgeClip"] = titanic["Age"].clip(0, 65);

titanic_norm[titanic_norm["Age"] > 65][["Name", "Age", "AgeClip"]]

Out[23]:

	Name	Age	AgeClip
34	Wheadon, Mr. Edward H	66.0	65.0
97	Goldschmidt, Mr. George B	71.0	65.0
117	Connors, Mr. Patrick	70.5	65.0
494	Artagaveytia, Mr. Ramon	71.0	65.0
631	Barkworth, Mr. Algernon Henry Wilson	80.0	65.0
673	Mitchell, Mr. Henry Michael	70.0	65.0
746	Crosby, Capt. Edward Gifford	70.0	65.0
852	Svensson, Mr. Johan	74.0	65.0

Винсоризация признака Возраст

In [24]:

from scipy.stats.mstats import winsorize

print(titanic_norm["Age"].quantile(q=0.95))

titanic_norm["AgeWinsorize"] = winsorize(
    titanic_norm["Age"].fillna(titanic_norm["Age"].mean()), (0, 0.05), inplace=False
)

titanic_norm[titanic_norm["Age"] > 65][["Name", "Age", "AgeWinsorize"]]

56.0

Out[24]:

	Name	Age	AgeWinsorize
34	Wheadon, Mr. Edward H	66.0	54.0
97	Goldschmidt, Mr. George B	71.0	54.0
117	Connors, Mr. Patrick	70.5	54.0
494	Artagaveytia, Mr. Ramon	71.0	54.0
631	Barkworth, Mr. Algernon Henry Wilson	80.0	54.0
673	Mitchell, Mr. Henry Michael	70.0	54.0
746	Crosby, Capt. Edward Gifford	70.0	54.0
852	Svensson, Mr. Johan	74.0	54.0

Нормализация значений¶

In [25]:

from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()

min_max_scaler_2 = preprocessing.MinMaxScaler(feature_range=(-1, 1))

titanic_norm["AgeNorm"] = min_max_scaler.fit_transform(
    titanic_norm["Age"].to_numpy().reshape(-1, 1)
).reshape(titanic_norm["Age"].shape)

titanic_norm["AgeClipNorm"] = min_max_scaler.fit_transform(
    titanic_norm["AgeClip"].to_numpy().reshape(-1, 1)
).reshape(titanic_norm["Age"].shape)

titanic_norm["AgeWinsorizeNorm"] = min_max_scaler.fit_transform(
    titanic_norm["AgeWinsorize"].to_numpy().reshape(-1, 1)
).reshape(titanic_norm["Age"].shape)

titanic_norm["AgeWinsorizeNorm2"] = min_max_scaler_2.fit_transform(
    titanic_norm["AgeWinsorize"].to_numpy().reshape(-1, 1)
).reshape(titanic_norm["Age"].shape)

titanic_norm[
    ["Name", "Age", "AgeNorm", "AgeClipNorm", "AgeWinsorizeNorm", "AgeWinsorizeNorm2"]
].head(20)

Out[25]:

	Name	Age	AgeNorm	AgeClipNorm	AgeWinsorizeNorm	AgeWinsorizeNorm2
1	Braund, Mr. Owen Harris	22.0	0.271174	0.334159	0.402762	-0.194476
2	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	0.472229	0.581914	0.701381	0.402762
3	Heikkinen, Miss. Laina	26.0	0.321438	0.396098	0.477417	-0.045166
4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	0.434531	0.535460	0.645390	0.290780
5	Allen, Mr. William Henry	35.0	0.434531	0.535460	0.645390	0.290780
6	Moran, Mr. James	NaN	NaN	NaN	0.546456	0.092912
7	McCarthy, Mr. Timothy J	54.0	0.673285	0.829669	1.000000	1.000000
8	Palsson, Master. Gosta Leonard	2.0	0.019854	0.024466	0.029489	-0.941023
9	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	27.0	0.334004	0.411583	0.496081	-0.007839
10	Nasser, Mrs. Nicholas (Adele Achem)	14.0	0.170646	0.210282	0.253453	-0.493094
11	Sandstrom, Miss. Marguerite Rut	4.0	0.044986	0.055435	0.066816	-0.866368
12	Bonnell, Miss. Elizabeth	58.0	0.723549	0.891607	1.000000	1.000000
13	Saundercock, Mr. William Henry	20.0	0.246042	0.303190	0.365435	-0.269130
14	Andersson, Mr. Anders Johan	39.0	0.484795	0.597399	0.720045	0.440090
15	Vestrom, Miss. Hulda Amanda Adolfina	14.0	0.170646	0.210282	0.253453	-0.493094
16	Hewlett, Mrs. (Mary D Kingcome)	55.0	0.685851	0.845153	1.000000	1.000000
17	Rice, Master. Eugene	2.0	0.019854	0.024466	0.029489	-0.941023
18	Williams, Mr. Charles Eugene	NaN	NaN	NaN	0.546456	0.092912
19	Vander Planke, Mrs. Julius (Emelia Maria Vande...	31.0	0.384267	0.473521	0.570735	0.141471
20	Masselmani, Mrs. Fatima	NaN	NaN	NaN	0.546456	0.092912

Стандартизация значений¶

In [26]:

from sklearn import preprocessing

stndart_scaler = preprocessing.StandardScaler()

titanic_norm["AgeStand"] = stndart_scaler.fit_transform(
    titanic_norm["Age"].to_numpy().reshape(-1, 1)
).reshape(titanic_norm["Age"].shape)

titanic_norm["AgeClipStand"] = stndart_scaler.fit_transform(
    titanic_norm["AgeClip"].to_numpy().reshape(-1, 1)
).reshape(titanic_norm["Age"].shape)

titanic_norm["AgeWinsorizeStand"] = stndart_scaler.fit_transform(
    titanic_norm["AgeWinsorize"].to_numpy().reshape(-1, 1)
).reshape(titanic_norm["Age"].shape)

titanic_norm[["Name", "Age", "AgeStand", "AgeClipStand", "AgeWinsorizeStand"]].head(20)

Out[26]:

	Name	Age	AgeStand	AgeClipStand	AgeWinsorizeStand
1	Braund, Mr. Owen Harris	22.0	-0.530377	-0.532745	-0.606602
2	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	0.571831	0.585060	0.718863
3	Heikkinen, Miss. Laina	26.0	-0.254825	-0.253294	-0.275236
4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	0.365167	0.375472	0.470339
5	Allen, Mr. William Henry	35.0	0.365167	0.375472	0.470339
6	Moran, Mr. James	NaN	NaN	NaN	0.031205
7	McCarthy, Mr. Timothy J	54.0	1.674039	1.702866	2.044329
8	Palsson, Master. Gosta Leonard	2.0	-1.908136	-1.930003	-2.263435
9	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	27.0	-0.185937	-0.183431	-0.192394
10	Nasser, Mrs. Nicholas (Adele Achem)	14.0	-1.081480	-1.091648	-1.269335
11	Sandstrom, Miss. Marguerite Rut	4.0	-1.770360	-1.790277	-2.097751
12	Bonnell, Miss. Elizabeth	58.0	1.949591	1.982317	2.044329
13	Saundercock, Mr. William Henry	20.0	-0.668153	-0.672471	-0.772286
14	Andersson, Mr. Anders Johan	39.0	0.640719	0.654923	0.801705
15	Vestrom, Miss. Hulda Amanda Adolfina	14.0	-1.081480	-1.091648	-1.269335
16	Hewlett, Mrs. (Mary D Kingcome)	55.0	1.742927	1.772729	2.044329
17	Rice, Master. Eugene	2.0	-1.908136	-1.930003	-2.263435
18	Williams, Mr. Charles Eugene	NaN	NaN	NaN	0.031205
19	Vander Planke, Mrs. Julius (Emelia Maria Vande...	31.0	0.089615	0.096020	0.138972
20	Masselmani, Mrs. Fatima	NaN	NaN	NaN	0.031205

	Embarked_Q	Embarked_S	Embarked_nan	Sex_male
0	0.0	1.0	0.0	1.0
1	0.0	0.0	0.0	0.0
2	0.0	1.0	0.0	0.0
3	0.0	1.0	0.0	0.0
4	0.0	1.0	0.0	1.0
...	...	...	...	...
886	0.0	1.0	0.0	1.0
887	0.0	1.0	0.0	0.0
888	0.0	1.0	0.0	0.0
889	0.0	0.0	0.0	1.0
890	1.0	0.0	0.0	1.0

	Embarked_Q	Embarked_S	Embarked_nan	Sex_male
0	0.0	1.0	0.0	1.0
1	0.0	0.0	0.0	0.0
2	0.0	1.0	0.0	0.0
3	0.0	1.0	0.0	0.0
4	0.0	1.0	0.0	1.0
...	...	...	...	...
886	0.0	1.0	0.0	1.0
887	0.0	1.0	0.0	0.0
888	0.0	1.0	0.0	0.0
889	0.0	0.0	0.0	1.0
890	1.0	0.0	0.0	1.0

160 KiB Raw Permalink Blame History Unescape Escape