40 KiB
Raw Blame History

Пример использования библиотеки Featuretools для автоматического конструирования признаков

https://featuretools.alteryx.com/en/stable/getting_started/using_entitysets.html

Загрузка данных

За основу был взят набор данных "Ecommerce Orders Data Set" из Kaggle

Используется только 100 первых заказов и связанные с ними объекты

https://www.kaggle.com/datasets/sangamsharmait/ecommerce-orders-data-analysis

In [51]:
import pandas as pd
import featuretools as ft
from woodwork.logical_types import Categorical, Datetime

customers = pd.read_csv("data/orders/customers.csv")
sellers = pd.read_csv("data/orders/sellers.csv")
products = pd.read_csv("data/orders/products.csv")
orders = pd.read_csv("data/orders/orders.csv")
orders.fillna({"order_delivered_carrier_date": pd.to_datetime(
    "1900-01-01 00:00:00"
)}, inplace=True)
orders.fillna(
    {"order_delivered_customer_date": pd.to_datetime("1900-01-01 00:00:00")},
    inplace=True,
)
order_items = pd.read_csv("data/orders/order_items.csv")

Создание сущностей в featuretools

Добавление dataframe'ов с данными в EntitySet с указанием параметров: название сущности (таблицы), первичный ключ, категориальные атрибуты (в том числе даты)

In [52]:
es = ft.EntitySet(id="orders")

es = es.add_dataframe(
    dataframe_name="customers",
    dataframe=customers,
    index="customer_id",
    logical_types={
        "customer_unique_id": Categorical,
        "customer_zip_code_prefix": Categorical,
        "customer_city": Categorical,
        "customer_state": Categorical,
    },
)
es = es.add_dataframe(
    dataframe_name="sellers",
    dataframe=sellers,
    index="seller_id",
    logical_types={
        "seller_zip_code_prefix": Categorical,
        "seller_city": Categorical,
        "seller_state": Categorical,
    },
)
es = es.add_dataframe(
    dataframe_name="products",
    dataframe=products,
    index="product_id",
    logical_types={
        "product_category_name": Categorical,
        "product_name_lenght": Categorical,
        "product_description_lenght": Categorical,
        "product_photos_qty": Categorical,
    },
)
es = es.add_dataframe(
    dataframe_name="orders",
    dataframe=orders,
    index="order_id",
    logical_types={
        "order_status": Categorical,
        "order_purchase_timestamp": Datetime,
        "order_approved_at": Datetime,
        "order_delivered_carrier_date": Datetime,
        "order_delivered_customer_date": Datetime,
        "order_estimated_delivery_date": Datetime,
    },
)
es = es.add_dataframe(
    dataframe_name="order_items",
    dataframe=order_items,
    index="orderitem_id",
    make_index=True,
    logical_types={"shipping_limit_date": Datetime},
)

es
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\woodwork\type_sys\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
Out[52]:
Entityset: orders
  DataFrames:
    customers [Rows: 100, Columns: 5]
    sellers [Rows: 87, Columns: 4]
    products [Rows: 100, Columns: 9]
    orders [Rows: 100, Columns: 8]
    order_items [Rows: 115, Columns: 8]
  Relationships:
    No relationships

Настройка связей между сущностями featuretools

Настройка связей между таблицами на уровне ключей

Связь указывается от родителя к потомкам (таблица-родитель, первичный ключ, таблица-потомок, внешний ключ)

In [53]:
es = es.add_relationship("customers", "customer_id", "orders", "customer_id")
es = es.add_relationship("orders", "order_id", "order_items", "order_id")
es = es.add_relationship("products", "product_id", "order_items", "product_id")
es = es.add_relationship("sellers", "seller_id", "order_items", "seller_id")

es
Out[53]:
Entityset: orders
  DataFrames:
    customers [Rows: 100, Columns: 5]
    sellers [Rows: 87, Columns: 4]
    products [Rows: 100, Columns: 9]
    orders [Rows: 100, Columns: 8]
    order_items [Rows: 115, Columns: 8]
  Relationships:
    orders.customer_id -> customers.customer_id
    order_items.order_id -> orders.order_id
    order_items.product_id -> products.product_id
    order_items.seller_id -> sellers.seller_id

Автоматическое конструирование признаков с помощью featuretools

Библиотека применят различные функции агрегации к атрибутам таблицы order_items с учетом отношений

Результат помещается в Dataframe feature_matrix

In [54]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="order_items",
    agg_primitives=["mean", "count", "mode", "any"],
    trans_primitives=["hour", "weekday"],
    max_depth=2,
)

feature_matrix
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\featuretools\synthesis\dfs.py:321: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  agg_primitives: ['any', 'mode']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000278FE70BA60> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000278FE70BA60> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
c:\Users\user\Projects\python\mai\.venv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000278FE70BA60> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
Out[54]:
order_item_id price freight_value HOUR(shipping_limit_date) WEEKDAY(shipping_limit_date) orders.order_status products.product_category_name products.product_name_lenght products.product_description_lenght products.product_photos_qty ... orders.customers.customer_city orders.customers.customer_state products.COUNT(order_items) products.MEAN(order_items.freight_value) products.MEAN(order_items.order_item_id) products.MEAN(order_items.price) sellers.COUNT(order_items) sellers.MEAN(order_items.freight_value) sellers.MEAN(order_items.order_item_id) sellers.MEAN(order_items.price)
orderitem_id
0 1 38.50 24.84 20 4 delivered cama_mesa_banho 53.0 223.0 1.0 ... santa luzia PB 1 24.84 1.0 38.50 2 21.340 1.0 61.200000
1 1 29.99 7.39 8 0 delivered telefonia 59.0 675.0 5.0 ... sao paulo SP 1 7.39 1.0 29.99 1 7.390 1.0 29.990000
2 1 110.99 21.27 21 1 delivered cama_mesa_banho 52.0 413.0 1.0 ... gravatai RS 1 21.27 1.0 110.99 1 21.270 1.0 110.990000
3 1 27.99 15.10 23 1 delivered telefonia 60.0 818.0 6.0 ... imbituba SC 1 15.10 1.0 27.99 2 13.970 1.0 26.490000
4 1 49.90 16.05 13 2 invoiced NaN NaN NaN NaN ... santa rosa RS 1 16.05 1.0 49.90 1 16.050 1.0 49.900000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110 1 17.90 10.96 8 1 delivered cama_mesa_banho 55.0 122.0 1.0 ... jundiai SP 1 10.96 1.0 17.90 1 10.960 1.0 17.900000
111 1 79.99 8.91 9 4 delivered beleza_saude 59.0 492.0 3.0 ... sao paulo SP 1 8.91 1.0 79.99 5 13.206 1.2 54.590000
112 1 190.00 19.41 13 3 delivered climatizacao 60.0 3270.0 4.0 ... paulinia SP 1 19.41 1.0 190.00 1 19.410 1.0 190.000000
113 1 109.90 15.53 2 2 delivered cool_stuff 46.0 595.0 2.0 ... rio de janeiro RJ 1 15.53 1.0 109.90 1 15.530 1.0 109.900000
114 1 27.90 18.30 14 2 delivered alimentos 59.0 982.0 1.0 ... joinville SC 2 16.70 1.0 27.90 3 16.190 1.0 38.596667

115 rows × 43 columns

Полученные признаки

Список колонок полученного dataframe'а

In [55]:
feature_defs
Out[55]:
[<Feature: order_item_id>,
 <Feature: price>,
 <Feature: freight_value>,
 <Feature: HOUR(shipping_limit_date)>,
 <Feature: WEEKDAY(shipping_limit_date)>,
 <Feature: orders.order_status>,
 <Feature: products.product_category_name>,
 <Feature: products.product_name_lenght>,
 <Feature: products.product_description_lenght>,
 <Feature: products.product_photos_qty>,
 <Feature: products.product_weight_g>,
 <Feature: products.product_length_cm>,
 <Feature: products.product_height_cm>,
 <Feature: products.product_width_cm>,
 <Feature: sellers.seller_zip_code_prefix>,
 <Feature: sellers.seller_city>,
 <Feature: sellers.seller_state>,
 <Feature: orders.COUNT(order_items)>,
 <Feature: orders.MEAN(order_items.freight_value)>,
 <Feature: orders.MEAN(order_items.order_item_id)>,
 <Feature: orders.MEAN(order_items.price)>,
 <Feature: orders.HOUR(order_approved_at)>,
 <Feature: orders.HOUR(order_delivered_carrier_date)>,
 <Feature: orders.HOUR(order_delivered_customer_date)>,
 <Feature: orders.HOUR(order_estimated_delivery_date)>,
 <Feature: orders.HOUR(order_purchase_timestamp)>,
 <Feature: orders.WEEKDAY(order_approved_at)>,
 <Feature: orders.WEEKDAY(order_delivered_carrier_date)>,
 <Feature: orders.WEEKDAY(order_delivered_customer_date)>,
 <Feature: orders.WEEKDAY(order_estimated_delivery_date)>,
 <Feature: orders.WEEKDAY(order_purchase_timestamp)>,
 <Feature: orders.customers.customer_unique_id>,
 <Feature: orders.customers.customer_zip_code_prefix>,
 <Feature: orders.customers.customer_city>,
 <Feature: orders.customers.customer_state>,
 <Feature: products.COUNT(order_items)>,
 <Feature: products.MEAN(order_items.freight_value)>,
 <Feature: products.MEAN(order_items.order_item_id)>,
 <Feature: products.MEAN(order_items.price)>,
 <Feature: sellers.COUNT(order_items)>,
 <Feature: sellers.MEAN(order_items.freight_value)>,
 <Feature: sellers.MEAN(order_items.order_item_id)>,
 <Feature: sellers.MEAN(order_items.price)>]