Пример использования библиотеки Featuretools для автоматического конструирования признаков


Загрузка данных

За основу был взят набор данных "Ecommerce Orders Data Set" из Kaggle

Используется только 100 первых заказов и связанные с ними объекты


In [51]:
import pandas as pd
import featuretools as ft
from woodwork.logical_types import Categorical, Datetime

customers = pd.read_csv("data/orders/customers.csv")
sellers = pd.read_csv("data/orders/sellers.csv")
products = pd.read_csv("data/orders/products.csv")
orders = pd.read_csv("data/orders/orders.csv")
orders.fillna({"order_delivered_carrier_date": pd.to_datetime(
    "1900-01-01 00:00:00"
)}, inplace=True)
    {"order_delivered_customer_date": pd.to_datetime("1900-01-01 00:00:00")},
order_items = pd.read_csv("data/orders/order_items.csv")

Создание сущностей в featuretools

Добавление dataframe'ов с данными в EntitySet с указанием параметров: название сущности (таблицы), первичный ключ, категориальные атрибуты (в том числе даты)

In [52]:
es = ft.EntitySet(id="orders")

es = es.add_dataframe(
        "customer_unique_id": Categorical,
        "customer_zip_code_prefix": Categorical,
        "customer_city": Categorical,
        "customer_state": Categorical,
es = es.add_dataframe(
        "seller_zip_code_prefix": Categorical,
        "seller_city": Categorical,
        "seller_state": Categorical,
es = es.add_dataframe(
        "product_category_name": Categorical,
        "product_name_lenght": Categorical,
        "product_description_lenght": Categorical,
        "product_photos_qty": Categorical,
es = es.add_dataframe(
        "order_status": Categorical,
        "order_purchase_timestamp": Datetime,
        "order_approved_at": Datetime,
        "order_delivered_carrier_date": Datetime,
        "order_delivered_customer_date": Datetime,
        "order_estimated_delivery_date": Datetime,
es = es.add_dataframe(
    logical_types={"shipping_limit_date": Datetime},

Entityset: orders
    customers [Rows: 100, Columns: 5]
    sellers [Rows: 87, Columns: 4]
    products [Rows: 100, Columns: 9]
    orders [Rows: 100, Columns: 8]
    order_items [Rows: 115, Columns: 8]
    No relationships

Настройка связей между сущностями featuretools

Настройка связей между таблицами на уровне ключей

Связь указывается от родителя к потомкам (таблица-родитель, первичный ключ, таблица-потомок, внешний ключ)

In [53]:
es = es.add_relationship("customers", "customer_id", "orders", "customer_id")
es = es.add_relationship("orders", "order_id", "order_items", "order_id")
es = es.add_relationship("products", "product_id", "order_items", "product_id")
es = es.add_relationship("sellers", "seller_id", "order_items", "seller_id")

Entityset: orders
    customers [Rows: 100, Columns: 5]
    sellers [Rows: 87, Columns: 4]
    products [Rows: 100, Columns: 9]
    orders [Rows: 100, Columns: 8]
    order_items [Rows: 115, Columns: 8]
    orders.customer_id -> customers.customer_id
    order_items.order_id -> orders.order_id
    order_items.product_id -> products.product_id
    order_items.seller_id -> sellers.seller_id

Автоматическое конструирование признаков с помощью featuretools

Библиотека применят различные функции агрегации к атрибутам таблицы order_items с учетом отношений

Результат помещается в Dataframe feature_matrix

In [54]:
feature_matrix, feature_defs = ft.dfs(
    agg_primitives=["mean", "count", "mode", "any"],
    trans_primitives=["hour", "weekday"],

order_item_id price freight_value HOUR(shipping_limit_date) WEEKDAY(shipping_limit_date) orders.order_status products.product_category_name products.product_name_lenght products.product_description_lenght products.product_photos_qty ... orders.customers.customer_city orders.customers.customer_state products.COUNT(order_items) products.MEAN(order_items.freight_value) products.MEAN(order_items.order_item_id) products.MEAN(order_items.price) sellers.COUNT(order_items) sellers.MEAN(order_items.freight_value) sellers.MEAN(order_items.order_item_id) sellers.MEAN(order_items.price)
0 1 38.50 24.84 20 4 delivered cama_mesa_banho 53.0 223.0 1.0 ... santa luzia PB 1 24.84 1.0 38.50 2 21.340 1.0 61.200000
1 1 29.99 7.39 8 0 delivered telefonia 59.0 675.0 5.0 ... sao paulo SP 1 7.39 1.0 29.99 1 7.390 1.0 29.990000
2 1 110.99 21.27 21 1 delivered cama_mesa_banho 52.0 413.0 1.0 ... gravatai RS 1 21.27 1.0 110.99 1 21.270 1.0 110.990000
3 1 27.99 15.10 23 1 delivered telefonia 60.0 818.0 6.0 ... imbituba SC 1 15.10 1.0 27.99 2 13.970 1.0 26.490000
4 1 49.90 16.05 13 2 invoiced NaN NaN NaN NaN ... santa rosa RS 1 16.05 1.0 49.90 1 16.050 1.0 49.900000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110 1 17.90 10.96 8 1 delivered cama_mesa_banho 55.0 122.0 1.0 ... jundiai SP 1 10.96 1.0 17.90 1 10.960 1.0 17.900000
111 1 79.99 8.91 9 4 delivered beleza_saude 59.0 492.0 3.0 ... sao paulo SP 1 8.91 1.0 79.99 5 13.206 1.2 54.590000
112 1 190.00 19.41 13 3 delivered climatizacao 60.0 3270.0 4.0 ... paulinia SP 1 19.41 1.0 190.00 1 19.410 1.0 190.000000
113 1 109.90 15.53 2 2 delivered cool_stuff 46.0 595.0 2.0 ... rio de janeiro RJ 1 15.53 1.0 109.90 1 15.530 1.0 109.900000
114 1 27.90 18.30 14 2 delivered alimentos 59.0 982.0 1.0 ... joinville SC 2 16.70 1.0 27.90 3 16.190 1.0 38.596667

115 rows × 43 columns

Полученные признаки

Список колонок полученного dataframe'а

In [55]:
[<Feature: order_item_id>,
 <Feature: price>,
 <Feature: freight_value>,
 <Feature: HOUR(shipping_limit_date)>,
 <Feature: WEEKDAY(shipping_limit_date)>,
 <Feature: orders.order_status>,
 <Feature: products.product_category_name>,
 <Feature: products.product_name_lenght>,
 <Feature: products.product_description_lenght>,
 <Feature: products.product_photos_qty>,
 <Feature: products.product_weight_g>,
 <Feature: products.product_length_cm>,
 <Feature: products.product_height_cm>,
 <Feature: products.product_width_cm>,
 <Feature: sellers.seller_zip_code_prefix>,
 <Feature: sellers.seller_city>,
 <Feature: sellers.seller_state>,
 <Feature: orders.COUNT(order_items)>,
 <Feature: orders.MEAN(order_items.freight_value)>,
 <Feature: orders.MEAN(order_items.order_item_id)>,
 <Feature: orders.MEAN(order_items.price)>,
 <Feature: orders.HOUR(order_approved_at)>,
 <Feature: orders.HOUR(order_delivered_carrier_date)>,
 <Feature: orders.HOUR(order_delivered_customer_date)>,
 <Feature: orders.HOUR(order_estimated_delivery_date)>,
 <Feature: orders.HOUR(order_purchase_timestamp)>,
 <Feature: orders.WEEKDAY(order_approved_at)>,
 <Feature: orders.WEEKDAY(order_delivered_carrier_date)>,
 <Feature: orders.WEEKDAY(order_delivered_customer_date)>,
 <Feature: orders.WEEKDAY(order_estimated_delivery_date)>,
 <Feature: orders.WEEKDAY(order_purchase_timestamp)>,
 <Feature: orders.customers.customer_unique_id>,
 <Feature: orders.customers.customer_zip_code_prefix>,
 <Feature: orders.customers.customer_city>,
 <Feature: orders.customers.customer_state>,
 <Feature: products.COUNT(order_items)>,
 <Feature: products.MEAN(order_items.freight_value)>,
 <Feature: products.MEAN(order_items.order_item_id)>,
 <Feature: products.MEAN(order_items.price)>,
 <Feature: sellers.COUNT(order_items)>,
 <Feature: sellers.MEAN(order_items.freight_value)>,
 <Feature: sellers.MEAN(order_items.order_item_id)>,
 <Feature: sellers.MEAN(order_items.price)>]