Add featuretools example

This commit is contained in:
Aleksey Filippov 2024-10-02 21:27:56 +04:00
parent 5c88818958
commit 317c52035c

878
lec3.ipynb Normal file
View File

@ -0,0 +1,878 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пример использования библиотеки Featuretools для автоматического конструирования признаков\n",
"\n",
"https://featuretools.alteryx.com/en/stable/getting_started/using_entitysets.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Загрузка данных\n",
"\n",
"За основу был взят набор данных \"Ecommerce Orders Data Set\" из Kaggle\n",
"\n",
"Используется только 100 первых заказов и связанные с ними объекты\n",
"\n",
"https://www.kaggle.com/datasets/sangamsharmait/ecommerce-orders-data-analysis"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import featuretools as ft\n",
"from woodwork.logical_types import Categorical, Datetime\n",
"\n",
"customers = pd.read_csv(\"data/orders/customers.csv\")\n",
"sellers = pd.read_csv(\"data/orders/sellers.csv\")\n",
"products = pd.read_csv(\"data/orders/products.csv\")\n",
"orders = pd.read_csv(\"data/orders/orders.csv\")\n",
"orders.fillna({\"order_delivered_carrier_date\": pd.to_datetime(\n",
" \"1900-01-01 00:00:00\"\n",
")}, inplace=True)\n",
"orders.fillna(\n",
" {\"order_delivered_customer_date\": pd.to_datetime(\"1900-01-01 00:00:00\")},\n",
" inplace=True,\n",
")\n",
"order_items = pd.read_csv(\"data/orders/order_items.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Создание сущностей в featuretools\n",
"\n",
"Добавление dataframe'ов с данными в EntitySet с указанием параметров: название сущности (таблицы), первичный ключ, категориальные атрибуты (в том числе даты)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n"
]
},
{
"data": {
"text/plain": [
"Entityset: orders\n",
" DataFrames:\n",
" customers [Rows: 100, Columns: 5]\n",
" sellers [Rows: 87, Columns: 4]\n",
" products [Rows: 100, Columns: 9]\n",
" orders [Rows: 100, Columns: 8]\n",
" order_items [Rows: 115, Columns: 8]\n",
" Relationships:\n",
" No relationships"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es = ft.EntitySet(id=\"orders\")\n",
"\n",
"es = es.add_dataframe(\n",
" dataframe_name=\"customers\",\n",
" dataframe=customers,\n",
" index=\"customer_id\",\n",
" logical_types={\n",
" \"customer_unique_id\": Categorical,\n",
" \"customer_zip_code_prefix\": Categorical,\n",
" \"customer_city\": Categorical,\n",
" \"customer_state\": Categorical,\n",
" },\n",
")\n",
"es = es.add_dataframe(\n",
" dataframe_name=\"sellers\",\n",
" dataframe=sellers,\n",
" index=\"seller_id\",\n",
" logical_types={\n",
" \"seller_zip_code_prefix\": Categorical,\n",
" \"seller_city\": Categorical,\n",
" \"seller_state\": Categorical,\n",
" },\n",
")\n",
"es = es.add_dataframe(\n",
" dataframe_name=\"products\",\n",
" dataframe=products,\n",
" index=\"product_id\",\n",
" logical_types={\n",
" \"product_category_name\": Categorical,\n",
" \"product_name_lenght\": Categorical,\n",
" \"product_description_lenght\": Categorical,\n",
" \"product_photos_qty\": Categorical,\n",
" },\n",
")\n",
"es = es.add_dataframe(\n",
" dataframe_name=\"orders\",\n",
" dataframe=orders,\n",
" index=\"order_id\",\n",
" logical_types={\n",
" \"order_status\": Categorical,\n",
" \"order_purchase_timestamp\": Datetime,\n",
" \"order_approved_at\": Datetime,\n",
" \"order_delivered_carrier_date\": Datetime,\n",
" \"order_delivered_customer_date\": Datetime,\n",
" \"order_estimated_delivery_date\": Datetime,\n",
" },\n",
")\n",
"es = es.add_dataframe(\n",
" dataframe_name=\"order_items\",\n",
" dataframe=order_items,\n",
" index=\"orderitem_id\",\n",
" make_index=True,\n",
" logical_types={\"shipping_limit_date\": Datetime},\n",
")\n",
"\n",
"es"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Настройка связей между сущностями featuretools\n",
"\n",
"Настройка связей между таблицами на уровне ключей\n",
"\n",
"Связь указывается от родителя к потомкам (таблица-родитель, первичный ключ, таблица-потомок, внешний ключ)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Entityset: orders\n",
" DataFrames:\n",
" customers [Rows: 100, Columns: 5]\n",
" sellers [Rows: 87, Columns: 4]\n",
" products [Rows: 100, Columns: 9]\n",
" orders [Rows: 100, Columns: 8]\n",
" order_items [Rows: 115, Columns: 8]\n",
" Relationships:\n",
" orders.customer_id -> customers.customer_id\n",
" order_items.order_id -> orders.order_id\n",
" order_items.product_id -> products.product_id\n",
" order_items.seller_id -> sellers.seller_id"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es = es.add_relationship(\"customers\", \"customer_id\", \"orders\", \"customer_id\")\n",
"es = es.add_relationship(\"orders\", \"order_id\", \"order_items\", \"order_id\")\n",
"es = es.add_relationship(\"products\", \"product_id\", \"order_items\", \"product_id\")\n",
"es = es.add_relationship(\"sellers\", \"seller_id\", \"order_items\", \"seller_id\")\n",
"\n",
"es"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Автоматическое конструирование признаков с помощью featuretools\n",
"\n",
"Библиотека применят различные функции агрегации к атрибутам таблицы order_items с учетом отношений\n",
"\n",
"Результат помещается в Dataframe feature_matrix"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\featuretools\\synthesis\\dfs.py:321: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:\n",
" agg_primitives: ['any', 'mode']\n",
"This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used.\n",
" warnings.warn(warning_msg, UnusedPrimitiveWarning)\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000278FE70BA60> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"mean\" instead.\n",
" ).agg(to_agg)\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000278FE70BA60> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"mean\" instead.\n",
" ).agg(to_agg)\n",
"c:\\Users\\user\\Projects\\python\\mai\\.venv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:785: FutureWarning: The provided callable <function mean at 0x00000278FE70BA60> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"mean\" instead.\n",
" ).agg(to_agg)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>order_item_id</th>\n",
" <th>price</th>\n",
" <th>freight_value</th>\n",
" <th>HOUR(shipping_limit_date)</th>\n",
" <th>WEEKDAY(shipping_limit_date)</th>\n",
" <th>orders.order_status</th>\n",
" <th>products.product_category_name</th>\n",
" <th>products.product_name_lenght</th>\n",
" <th>products.product_description_lenght</th>\n",
" <th>products.product_photos_qty</th>\n",
" <th>...</th>\n",
" <th>orders.customers.customer_city</th>\n",
" <th>orders.customers.customer_state</th>\n",
" <th>products.COUNT(order_items)</th>\n",
" <th>products.MEAN(order_items.freight_value)</th>\n",
" <th>products.MEAN(order_items.order_item_id)</th>\n",
" <th>products.MEAN(order_items.price)</th>\n",
" <th>sellers.COUNT(order_items)</th>\n",
" <th>sellers.MEAN(order_items.freight_value)</th>\n",
" <th>sellers.MEAN(order_items.order_item_id)</th>\n",
" <th>sellers.MEAN(order_items.price)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>orderitem_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>38.50</td>\n",
" <td>24.84</td>\n",
" <td>20</td>\n",
" <td>4</td>\n",
" <td>delivered</td>\n",
" <td>cama_mesa_banho</td>\n",
" <td>53.0</td>\n",
" <td>223.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>santa luzia</td>\n",
" <td>PB</td>\n",
" <td>1</td>\n",
" <td>24.84</td>\n",
" <td>1.0</td>\n",
" <td>38.50</td>\n",
" <td>2</td>\n",
" <td>21.340</td>\n",
" <td>1.0</td>\n",
" <td>61.200000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>29.99</td>\n",
" <td>7.39</td>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>delivered</td>\n",
" <td>telefonia</td>\n",
" <td>59.0</td>\n",
" <td>675.0</td>\n",
" <td>5.0</td>\n",
" <td>...</td>\n",
" <td>sao paulo</td>\n",
" <td>SP</td>\n",
" <td>1</td>\n",
" <td>7.39</td>\n",
" <td>1.0</td>\n",
" <td>29.99</td>\n",
" <td>1</td>\n",
" <td>7.390</td>\n",
" <td>1.0</td>\n",
" <td>29.990000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>110.99</td>\n",
" <td>21.27</td>\n",
" <td>21</td>\n",
" <td>1</td>\n",
" <td>delivered</td>\n",
" <td>cama_mesa_banho</td>\n",
" <td>52.0</td>\n",
" <td>413.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>gravatai</td>\n",
" <td>RS</td>\n",
" <td>1</td>\n",
" <td>21.27</td>\n",
" <td>1.0</td>\n",
" <td>110.99</td>\n",
" <td>1</td>\n",
" <td>21.270</td>\n",
" <td>1.0</td>\n",
" <td>110.990000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>27.99</td>\n",
" <td>15.10</td>\n",
" <td>23</td>\n",
" <td>1</td>\n",
" <td>delivered</td>\n",
" <td>telefonia</td>\n",
" <td>60.0</td>\n",
" <td>818.0</td>\n",
" <td>6.0</td>\n",
" <td>...</td>\n",
" <td>imbituba</td>\n",
" <td>SC</td>\n",
" <td>1</td>\n",
" <td>15.10</td>\n",
" <td>1.0</td>\n",
" <td>27.99</td>\n",
" <td>2</td>\n",
" <td>13.970</td>\n",
" <td>1.0</td>\n",
" <td>26.490000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>49.90</td>\n",
" <td>16.05</td>\n",
" <td>13</td>\n",
" <td>2</td>\n",
" <td>invoiced</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>santa rosa</td>\n",
" <td>RS</td>\n",
" <td>1</td>\n",
" <td>16.05</td>\n",
" <td>1.0</td>\n",
" <td>49.90</td>\n",
" <td>1</td>\n",
" <td>16.050</td>\n",
" <td>1.0</td>\n",
" <td>49.900000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>110</th>\n",
" <td>1</td>\n",
" <td>17.90</td>\n",
" <td>10.96</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>delivered</td>\n",
" <td>cama_mesa_banho</td>\n",
" <td>55.0</td>\n",
" <td>122.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>jundiai</td>\n",
" <td>SP</td>\n",
" <td>1</td>\n",
" <td>10.96</td>\n",
" <td>1.0</td>\n",
" <td>17.90</td>\n",
" <td>1</td>\n",
" <td>10.960</td>\n",
" <td>1.0</td>\n",
" <td>17.900000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>111</th>\n",
" <td>1</td>\n",
" <td>79.99</td>\n",
" <td>8.91</td>\n",
" <td>9</td>\n",
" <td>4</td>\n",
" <td>delivered</td>\n",
" <td>beleza_saude</td>\n",
" <td>59.0</td>\n",
" <td>492.0</td>\n",
" <td>3.0</td>\n",
" <td>...</td>\n",
" <td>sao paulo</td>\n",
" <td>SP</td>\n",
" <td>1</td>\n",
" <td>8.91</td>\n",
" <td>1.0</td>\n",
" <td>79.99</td>\n",
" <td>5</td>\n",
" <td>13.206</td>\n",
" <td>1.2</td>\n",
" <td>54.590000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>112</th>\n",
" <td>1</td>\n",
" <td>190.00</td>\n",
" <td>19.41</td>\n",
" <td>13</td>\n",
" <td>3</td>\n",
" <td>delivered</td>\n",
" <td>climatizacao</td>\n",
" <td>60.0</td>\n",
" <td>3270.0</td>\n",
" <td>4.0</td>\n",
" <td>...</td>\n",
" <td>paulinia</td>\n",
" <td>SP</td>\n",
" <td>1</td>\n",
" <td>19.41</td>\n",
" <td>1.0</td>\n",
" <td>190.00</td>\n",
" <td>1</td>\n",
" <td>19.410</td>\n",
" <td>1.0</td>\n",
" <td>190.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>113</th>\n",
" <td>1</td>\n",
" <td>109.90</td>\n",
" <td>15.53</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>delivered</td>\n",
" <td>cool_stuff</td>\n",
" <td>46.0</td>\n",
" <td>595.0</td>\n",
" <td>2.0</td>\n",
" <td>...</td>\n",
" <td>rio de janeiro</td>\n",
" <td>RJ</td>\n",
" <td>1</td>\n",
" <td>15.53</td>\n",
" <td>1.0</td>\n",
" <td>109.90</td>\n",
" <td>1</td>\n",
" <td>15.530</td>\n",
" <td>1.0</td>\n",
" <td>109.900000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>114</th>\n",
" <td>1</td>\n",
" <td>27.90</td>\n",
" <td>18.30</td>\n",
" <td>14</td>\n",
" <td>2</td>\n",
" <td>delivered</td>\n",
" <td>alimentos</td>\n",
" <td>59.0</td>\n",
" <td>982.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>joinville</td>\n",
" <td>SC</td>\n",
" <td>2</td>\n",
" <td>16.70</td>\n",
" <td>1.0</td>\n",
" <td>27.90</td>\n",
" <td>3</td>\n",
" <td>16.190</td>\n",
" <td>1.0</td>\n",
" <td>38.596667</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>115 rows × 43 columns</p>\n",
"</div>"
],
"text/plain": [
" order_item_id price freight_value HOUR(shipping_limit_date) \\\n",
"orderitem_id \n",
"0 1 38.50 24.84 20 \n",
"1 1 29.99 7.39 8 \n",
"2 1 110.99 21.27 21 \n",
"3 1 27.99 15.10 23 \n",
"4 1 49.90 16.05 13 \n",
"... ... ... ... ... \n",
"110 1 17.90 10.96 8 \n",
"111 1 79.99 8.91 9 \n",
"112 1 190.00 19.41 13 \n",
"113 1 109.90 15.53 2 \n",
"114 1 27.90 18.30 14 \n",
"\n",
" WEEKDAY(shipping_limit_date) orders.order_status \\\n",
"orderitem_id \n",
"0 4 delivered \n",
"1 0 delivered \n",
"2 1 delivered \n",
"3 1 delivered \n",
"4 2 invoiced \n",
"... ... ... \n",
"110 1 delivered \n",
"111 4 delivered \n",
"112 3 delivered \n",
"113 2 delivered \n",
"114 2 delivered \n",
"\n",
" products.product_category_name products.product_name_lenght \\\n",
"orderitem_id \n",
"0 cama_mesa_banho 53.0 \n",
"1 telefonia 59.0 \n",
"2 cama_mesa_banho 52.0 \n",
"3 telefonia 60.0 \n",
"4 NaN NaN \n",
"... ... ... \n",
"110 cama_mesa_banho 55.0 \n",
"111 beleza_saude 59.0 \n",
"112 climatizacao 60.0 \n",
"113 cool_stuff 46.0 \n",
"114 alimentos 59.0 \n",
"\n",
" products.product_description_lenght products.product_photos_qty \\\n",
"orderitem_id \n",
"0 223.0 1.0 \n",
"1 675.0 5.0 \n",
"2 413.0 1.0 \n",
"3 818.0 6.0 \n",
"4 NaN NaN \n",
"... ... ... \n",
"110 122.0 1.0 \n",
"111 492.0 3.0 \n",
"112 3270.0 4.0 \n",
"113 595.0 2.0 \n",
"114 982.0 1.0 \n",
"\n",
" ... orders.customers.customer_city \\\n",
"orderitem_id ... \n",
"0 ... santa luzia \n",
"1 ... sao paulo \n",
"2 ... gravatai \n",
"3 ... imbituba \n",
"4 ... santa rosa \n",
"... ... ... \n",
"110 ... jundiai \n",
"111 ... sao paulo \n",
"112 ... paulinia \n",
"113 ... rio de janeiro \n",
"114 ... joinville \n",
"\n",
" orders.customers.customer_state products.COUNT(order_items) \\\n",
"orderitem_id \n",
"0 PB 1 \n",
"1 SP 1 \n",
"2 RS 1 \n",
"3 SC 1 \n",
"4 RS 1 \n",
"... ... ... \n",
"110 SP 1 \n",
"111 SP 1 \n",
"112 SP 1 \n",
"113 RJ 1 \n",
"114 SC 2 \n",
"\n",
" products.MEAN(order_items.freight_value) \\\n",
"orderitem_id \n",
"0 24.84 \n",
"1 7.39 \n",
"2 21.27 \n",
"3 15.10 \n",
"4 16.05 \n",
"... ... \n",
"110 10.96 \n",
"111 8.91 \n",
"112 19.41 \n",
"113 15.53 \n",
"114 16.70 \n",
"\n",
" products.MEAN(order_items.order_item_id) \\\n",
"orderitem_id \n",
"0 1.0 \n",
"1 1.0 \n",
"2 1.0 \n",
"3 1.0 \n",
"4 1.0 \n",
"... ... \n",
"110 1.0 \n",
"111 1.0 \n",
"112 1.0 \n",
"113 1.0 \n",
"114 1.0 \n",
"\n",
" products.MEAN(order_items.price) sellers.COUNT(order_items) \\\n",
"orderitem_id \n",
"0 38.50 2 \n",
"1 29.99 1 \n",
"2 110.99 1 \n",
"3 27.99 2 \n",
"4 49.90 1 \n",
"... ... ... \n",
"110 17.90 1 \n",
"111 79.99 5 \n",
"112 190.00 1 \n",
"113 109.90 1 \n",
"114 27.90 3 \n",
"\n",
" sellers.MEAN(order_items.freight_value) \\\n",
"orderitem_id \n",
"0 21.340 \n",
"1 7.390 \n",
"2 21.270 \n",
"3 13.970 \n",
"4 16.050 \n",
"... ... \n",
"110 10.960 \n",
"111 13.206 \n",
"112 19.410 \n",
"113 15.530 \n",
"114 16.190 \n",
"\n",
" sellers.MEAN(order_items.order_item_id) \\\n",
"orderitem_id \n",
"0 1.0 \n",
"1 1.0 \n",
"2 1.0 \n",
"3 1.0 \n",
"4 1.0 \n",
"... ... \n",
"110 1.0 \n",
"111 1.2 \n",
"112 1.0 \n",
"113 1.0 \n",
"114 1.0 \n",
"\n",
" sellers.MEAN(order_items.price) \n",
"orderitem_id \n",
"0 61.200000 \n",
"1 29.990000 \n",
"2 110.990000 \n",
"3 26.490000 \n",
"4 49.900000 \n",
"... ... \n",
"110 17.900000 \n",
"111 54.590000 \n",
"112 190.000000 \n",
"113 109.900000 \n",
"114 38.596667 \n",
"\n",
"[115 rows x 43 columns]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix, feature_defs = ft.dfs(\n",
" entityset=es,\n",
" target_dataframe_name=\"order_items\",\n",
" agg_primitives=[\"mean\", \"count\", \"mode\", \"any\"],\n",
" trans_primitives=[\"hour\", \"weekday\"],\n",
" max_depth=2,\n",
")\n",
"\n",
"feature_matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Полученные признаки\n",
"\n",
"Список колонок полученного dataframe'а"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<Feature: order_item_id>,\n",
" <Feature: price>,\n",
" <Feature: freight_value>,\n",
" <Feature: HOUR(shipping_limit_date)>,\n",
" <Feature: WEEKDAY(shipping_limit_date)>,\n",
" <Feature: orders.order_status>,\n",
" <Feature: products.product_category_name>,\n",
" <Feature: products.product_name_lenght>,\n",
" <Feature: products.product_description_lenght>,\n",
" <Feature: products.product_photos_qty>,\n",
" <Feature: products.product_weight_g>,\n",
" <Feature: products.product_length_cm>,\n",
" <Feature: products.product_height_cm>,\n",
" <Feature: products.product_width_cm>,\n",
" <Feature: sellers.seller_zip_code_prefix>,\n",
" <Feature: sellers.seller_city>,\n",
" <Feature: sellers.seller_state>,\n",
" <Feature: orders.COUNT(order_items)>,\n",
" <Feature: orders.MEAN(order_items.freight_value)>,\n",
" <Feature: orders.MEAN(order_items.order_item_id)>,\n",
" <Feature: orders.MEAN(order_items.price)>,\n",
" <Feature: orders.HOUR(order_approved_at)>,\n",
" <Feature: orders.HOUR(order_delivered_carrier_date)>,\n",
" <Feature: orders.HOUR(order_delivered_customer_date)>,\n",
" <Feature: orders.HOUR(order_estimated_delivery_date)>,\n",
" <Feature: orders.HOUR(order_purchase_timestamp)>,\n",
" <Feature: orders.WEEKDAY(order_approved_at)>,\n",
" <Feature: orders.WEEKDAY(order_delivered_carrier_date)>,\n",
" <Feature: orders.WEEKDAY(order_delivered_customer_date)>,\n",
" <Feature: orders.WEEKDAY(order_estimated_delivery_date)>,\n",
" <Feature: orders.WEEKDAY(order_purchase_timestamp)>,\n",
" <Feature: orders.customers.customer_unique_id>,\n",
" <Feature: orders.customers.customer_zip_code_prefix>,\n",
" <Feature: orders.customers.customer_city>,\n",
" <Feature: orders.customers.customer_state>,\n",
" <Feature: products.COUNT(order_items)>,\n",
" <Feature: products.MEAN(order_items.freight_value)>,\n",
" <Feature: products.MEAN(order_items.order_item_id)>,\n",
" <Feature: products.MEAN(order_items.price)>,\n",
" <Feature: sellers.COUNT(order_items)>,\n",
" <Feature: sellers.MEAN(order_items.freight_value)>,\n",
" <Feature: sellers.MEAN(order_items.order_item_id)>,\n",
" <Feature: sellers.MEAN(order_items.price)>]"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_defs"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}