nyaggle.experiment

class nyaggle.experiment.Experiment(logging_directory, custom_logger=None, with_mlflow=False, if_exists='error')[source]

Minimal experiment logger for Kaggle

This module provides minimal functionality for tracking experiments. The output files are laid out as follows:

<logging_directory>/
    log.txt       <== Output of log
    metrics.json  <== Output of log_metric(s), format: name,score
    params.json   <== Output of log_param(s), format: key,value
    mlflow.json   <== mlflow's run_id, experiment_id and artifact_uri (logged if with_mlflow=True)

You can add numpy array and pandas dataframe under the directory through log_numpy and log_dataframe.

Parameters
  • logging_directory (str) – Path to directory where output is stored.

  • custom_logger (Optional[Logger]) – A custom logger to be used instead of default logger.

  • with_mlflow (bool) – If True, mlflow tracking is used. One instance of nyaggle.experiment.Experiment corresponds to one run in mlflow. Note that all output files are located both logging_directory and mlflow’s directory (mlruns by default).

  • if_exists (str) –

    How to behave if the logging directory already exists.

    • error: Raise a ValueError.

    • replace: Delete logging directory before logging.

    • append: Append to exisitng experiment.

    • rename: Rename current directory by adding “_1”, “_2”… prefix

Example

>>> import numpy as np
>>> import pandas as pd
>>> from nyaggle.experiment import Experiment
>>>
>>> with Experiment(logging_directory='./output/') as exp:
>>>     # log key-value pair as a parameter
>>>     exp.log_param('lr', 0.01)
>>>     exp.log_param('optimizer', 'adam')
>>>
>>>     # log text
>>>     exp.log('blah blah blah')
>>>
>>>     # log metric
>>>     exp.log_metric('CV', 0.85)
>>>
>>>     # log dictionary with flattening keys
>>>     exp.log_dict('params', {'X': 3, 'Y': {'Z': 'foobar'}})
>>>
>>>     # log numpy ndarray, pandas dafaframe and any artifacts
>>>     exp.log_numpy('predicted', np.zeros(1))
>>>     exp.log_dataframe('submission', pd.DataFrame(), file_format='csv')
>>>     exp.log_artifact('path-to-your-file')
get_logger()[source]

Get logger used in this experiment.

Return type

Logger

Returns

logger object

get_run()[source]

Get mlflow’s currently active run, or None if with_mlflow = False.

Returns

active Run

log(text)[source]

Logs a message on the logger for the experiment.

Parameters

text (str) – The message to be written.

log_artifact(src_file_path)[source]

Make a copy of the file under the logging directory.

Parameters

src_file_path (str) – Path of the file. If path is not a child of the logging directory, the file will be copied. If with_mlflow is True, mlflow.log_artifact will be called (then another copy will be made).

log_dataframe(name, df, file_format='feather')[source]

Log a pandas dataframe under the logging directory.

Parameters
  • name (str) – Name of the file. A .f or .csv extension will be appended to the file name if it does not already have one.

  • df (DataFrame) – A dataframe to be saved.

  • file_format (str) – A format of output file. csv and feather are supported.

log_dict(name, value, separator='.')[source]

Logs a dictionary as parameter with flatten format.

Parameters
  • name (str) – Parameter name

  • value (Dict) – Parameter value

  • separator (str) – Separating character used to concatanate keys

Examples

>>> with Experiment('./') as e:
>>>     e.log_dict('a', {'b': 1, 'c': 'd'})
>>>     print(e.params)
{ 'a.b': 1, 'a.c': 'd' }
log_metric(name, score)[source]

Log a metric under the logging directory.

Parameters
  • name (str) – Metric name.

  • score (float) – Metric value.

log_metrics(metrics)[source]

Log a batch of metrics under the logging directory.

Parameters

metrics (Dict) – dictionary of metrics.

log_numpy(name, array)[source]

Log a numpy ndarray under the logging directory.

Parameters
  • name (str) – Name of the file. A .npy extension will be appended to the file name if it does not already have one.

  • array (ndarray) – Array data to be saved.

log_param(key, value)[source]

Logs a key-value pair for the experiment.

Parameters
  • key – parameter name

  • value – parameter value

log_params(params)[source]

Logs a batch of params for the experiments.

Parameters

params (Dict) – dictionary of parameters

start()[source]

Start a new experiment.

stop()[source]

Stop current experiment.

nyaggle.experiment.add_leaderboard_score(logging_directory, score)[source]

Record leaderboard score to the existing experiment directory.

Parameters
  • logging_directory (str) – The directory to be added

  • score (float) – Leaderboard score

nyaggle.experiment.find_best_lgbm_parameter(base_param, X, y, cv=None, groups=None, time_budget=None, type_of_target='auto')[source]

Search hyperparameter for lightgbm using optuna.

Parameters
  • base_param (Dict) – Base parameters passed to lgb.train.

  • X (DataFrame) – Training data.

  • y (Series) – Target

  • cv (Union[int, Iterable, BaseCrossValidator, None]) – int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.

  • groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).

  • time_budget (Optional[int]) – Time budget for tuning (in seconds).

  • type_of_target (str) – The type of target variable. If auto, type is inferred by sklearn.utils.multiclass.type_of_target. Otherwise, binary, continuous, or multiclass are supported.

Return type

Dict

Returns

The best parameters found

nyaggle.experiment.run_experiment(model_params, X_train, y, X_test=None, logging_directory='output/{time}', if_exists='error', eval_func=None, algorithm_type='lgbm', fit_params=None, cv=None, groups=None, categorical_feature=None, sample_submission=None, submission_filename=None, type_of_target='auto', feature_list=None, feature_directory=None, inherit_experiment=None, with_auto_hpo=False, with_auto_prep=False, with_mlflow=False)[source]

Evaluate metrics by cross-validation and stores result (log, oof prediction, test prediction, feature importance plot and submission file) under the directory specified.

One of the following estimators are used (automatically dispatched by type_of_target(y) and gbdt_type).

  • LGBMClassifier

  • LGBMRegressor

  • CatBoostClassifier

  • CatBoostRegressor

The output files are laid out as follows:

<logging_directory>/
    log.txt                  <== Logging file
    importance.png           <== Feature importance plot generated by nyaggle.util.plot_importance
    oof_prediction.npy       <== Out of fold prediction in numpy array format
    test_prediction.npy      <== Test prediction in numpy array format
    submission.csv           <== Submission csv file
    metrics.json             <== Metrics
    params.json              <== Parameters
    models/
        fold1                <== The trained model in fold 1
        ...
Parameters
  • model_params (Dict[str, Any]) – Parameters passed to the constructor of the classifier/regressor object (i.e. LGBMRegressor).

  • X_train (DataFrame) – Training data. Categorical feature should be casted to pandas categorical type or encoded to integer.

  • y (Series) – Target

  • X_test (Optional[DataFrame]) – Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.

  • logging_directory (str) – Path to directory where output of experiment is stored.

  • if_exists (str) –

    How to behave if the logging directory already exists.

    • error: Raise a ValueError.

    • replace: Delete logging directory before logging.

    • append: Append to exisitng experiment.

    • rename: Rename current directory by adding “_1”, “_2”… prefix

  • fit_params (Union[Dict[str, Any], Callable, None]) – Parameters passed to the fit method of the estimator. If dict is passed, the same parameter except eval_set passed for each fold. If callable is passed, returning value of fit_params(fold_id, train_index, test_index) will be used for each fold.

  • eval_func (Optional[Callable]) – Function used for logging and calculation of returning scores. This parameter isn’t passed to GBDT, so you should set objective and eval_metric separately if needed. If eval_func is None, roc_auc_score or mean_squared_error is used by default.

  • gbdt_type – Type of gradient boosting library used. “lgbm” (lightgbm) or “cat” (catboost)

  • cv (Union[int, Iterable, BaseCrossValidator, None]) –

    int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.

    • None, to use the default KFold(5, random_state=0, shuffle=True),

    • integer, to specify the number of folds in a (Stratified)KFold,

    • CV splitter (the instance of BaseCrossValidator),

    • An iterable yielding (train, test) splits as arrays of indices.

  • groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).

  • sample_submission (Optional[DataFrame]) – A sample dataframe alined with test data (Usually in Kaggle, it is available as sample_submission.csv). The submission file will be created with the same schema as this dataframe.

  • submission_filename (Optional[str]) – The name of submission file will be created under logging directory. If None, the basename of the logging directory will be used as a filename.

  • categorical_feature (Optional[List[str]]) – List of categorical column names. If None, categorical columns are automatically determined by dtype.

  • type_of_target (str) – The type of target variable. If auto, type is inferred by sklearn.utils.multiclass.type_of_target. Otherwise, binary, continuous, or multiclass are supported.

  • feature_list (Optional[List[Union[int, str]]]) – The list of feature ids saved through nyaggle.feature_store module.

  • feature_directory (Optional[str]) – The location of features stored. Only used if feature_list is not empty.

  • inherit_experiment (Optional[Experiment]) – An experiment object which is used to log results. if not None, all logs in this function are treated as a part of this experiment.

  • with_auto_prep (bool) – If True, the input datasets will be copied and automatic preprocessing will be performed on them. For example, if gbdt_type = 'cat', all missing values in categorical features will be filled.

  • with_auto_hpo (bool) – If True, model parameters will be automatically updated using optuna (only available in lightgbm).

  • with_mlflow (bool) –

    If True, mlflow tracking is used. One instance of nyaggle.experiment.Experiment corresponds to one run in mlflow. Note that all output mlflow’s directory (mlruns by default).

Returns

Namedtuple with following members

  • oof_prediction:

    numpy array, shape (len(X_train),) Predicted value on Out-of-Fold validation data.

  • test_prediction:

    numpy array, shape (len(X_test),) Predicted value on test data. None if X_test is None

  • metrics:

    list of float, shape(nfolds+1) metrics[i] denotes validation score in i-th fold. metrics[-1] is overall score.

  • models:

    list of objects, shape(nfolds) Trained models for each folds.

  • importance:

    list of pd.DataFrame, feature importance for each fold (type=”gain”).

  • time:

    Training time in seconds.

  • submit_df:

    The dataframe saved as submission.csv