Feature management using feature_store


Feature engineering is one of the most important parts of Kaggle. If you do a lot of feature engineering, it is time-consuming to calculate features each time you build a model.

Many skilled Kagglers save their features to local disk as binary (npy, pickle or feather) to manage their features 1 2 3 4.

feature_store provides simple helper APIs for feature management.

import pandas as pd
import nyaggle.feature_store as fs

def make_feature_1(df: pd.DataFrame) -> pd.DataFrame:
    return ...

def make_feature_2(df: pd.DataFrame) -> pd.DataFrame:
    return ...

# feature 1
feature_1 = make_feature_1(df)

# feature 2
feature_2 = make_feature_2(df)

# name can be str or int
fs.save_feature(feature_1, "my_feature_1")
fs.save_feature(feature_2, 42, '../my_favorite_feature_store')  # change directory to be saved

save_feature stores dataframe as a feather format under the feature directory (./features by default). If you want to load the feature, just call load_feature by name.

feature_1_restored = fs.load_feature("my_feature_1")
feature_2_restored = fs.load_feature(999)

To merge all features into the main dataframe, call load_features with the main dataframe you want to merge with.

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
base_df = pd.concat([train, test])

df_with_features = fs.load_features(base_df, ["my_feature_1", "magic_1", "leaky_1"])


load_features assumes that the stored feature values are concatenated in the order [train, test].

If you don’t like separating your feature engineering code into the independent module, cached_feature decorator provides cache functionality. The function with this decorator automatically saves the return value using save_feature on the first call, and returns the result of load_feature on subsequent calls instead of executing the function body.

import pandas as pd
import nyaggle.feature_store as fs

def make_feature_1(df: pd.DataFrame) -> pd.DataFrame:
    return result

# saves automatically to features/my_feature_1.f
feature_1 = make_feature_1(df)

# loads from saved binary instead of calling make_feature_1
feature_1 = make_feature_1(df)


The function decorated by cached_feature must return pandas DataFrame.

Use with run_experiment

If you pass feature_list and feature_directory parameters to run_experiment API, nyaggle will combine specified features to the given dataframe before performing cross-validation.

List of features is logged as parameters (and of course can be seen in mlflow ui), that makes your experiment cycle much simpler.

import pandas as pd
import nyaggle.feature_store as fs
from nyaggle.experiment import run_experiment

               feature_list=["my_feature_1", "magic_1", "leaky_1"],