nyaggle.experiment¶

class nyaggle.experiment.Experiment(logging_directory, custom_logger=None, with_mlflow=False, if_exists='error')[source]¶

Minimal experiment logger for Kaggle

This module provides minimal functionality for tracking experiments. The output files are laid out as follows:

<logging_directory>/
    log.txt       <== Output of log
    metrics.json  <== Output of log_metric(s), format: name,score
    params.json   <== Output of log_param(s), format: key,value
    mlflow.json   <== mlflow's run_id, experiment_id and artifact_uri (logged if with_mlflow=True)

You can add numpy array and pandas dataframe under the directory through log_numpy and log_dataframe.

Parameters

logging_directory (str) – Path to directory where output is stored.
custom_logger (Optional[Logger]) – A custom logger to be used instead of default logger.
with_mlflow (bool) – If True, mlflow tracking is used. One instance of nyaggle.experiment.Experiment corresponds to one run in mlflow. Note that all output files are located both logging_directory and mlflow’s directory (mlruns by default).
if_exists (str) –
How to behave if the logging directory already exists.
- error: Raise a ValueError.
- replace: Delete logging directory before logging.
- append: Append to exisitng experiment.
- rename: Rename current directory by adding “_1”, “_2”… prefix

Example

>>> import numpy as np
>>> import pandas as pd
>>> from nyaggle.experiment import Experiment
>>>
>>> with Experiment(logging_directory='./output/') as exp:
>>>     # log key-value pair as a parameter
>>>     exp.log_param('lr', 0.01)
>>>     exp.log_param('optimizer', 'adam')
>>>
>>>     # log text
>>>     exp.log('blah blah blah')
>>>
>>>     # log metric
>>>     exp.log_metric('CV', 0.85)
>>>
>>>     # log dictionary with flattening keys
>>>     exp.log_dict('params', {'X': 3, 'Y': {'Z': 'foobar'}})
>>>
>>>     # log numpy ndarray, pandas dafaframe and any artifacts
>>>     exp.log_numpy('predicted', np.zeros(1))
>>>     exp.log_dataframe('submission', pd.DataFrame(), file_format='csv')
>>>     exp.log_artifact('path-to-your-file')

get_logger()[source]¶

Get logger used in this experiment.

Return type: Logger
Returns: logger object

get_run()[source]¶

Get mlflow’s currently active run, or None if with_mlflow = False.

Returns: active Run

log(text)[source]¶

Logs a message on the logger for the experiment.

Parameters: text (str) – The message to be written.

log_artifact(src_file_path)[source]¶

Make a copy of the file under the logging directory.

Parameters: src_file_path (str) – Path of the file. If path is not a child of the logging directory, the file will be copied. If with_mlflow is True, mlflow.log_artifact will be called (then another copy will be made).

log_dataframe(name, df, file_format='feather')[source]¶

Log a pandas dataframe under the logging directory.

Parameters

name (str) – Name of the file. A .f or .csv extension will be appended to the file name if it does not already have one.
df (DataFrame) – A dataframe to be saved.
file_format (str) – A format of output file. csv and feather are supported.

log_dict(name, value, separator='.')[source]¶

Logs a dictionary as parameter with flatten format.

Parameters

name (str) – Parameter name
value (Dict) – Parameter value
separator (str) – Separating character used to concatanate keys

Examples

>>> with Experiment('./') as e:
>>>     e.log_dict('a', {'b': 1, 'c': 'd'})
>>>     print(e.params)
{ 'a.b': 1, 'a.c': 'd' }

log_metric(name, score)[source]¶

Log a metric under the logging directory.

Parameters

name (str) – Metric name.
score (float) – Metric value.

log_metrics(metrics)[source]¶

Log a batch of metrics under the logging directory.

Parameters: metrics (Dict) – dictionary of metrics.

log_numpy(name, array)[source]¶

Log a numpy ndarray under the logging directory.

Parameters

name (str) – Name of the file. A .npy extension will be appended to the file name if it does not already have one.
array (ndarray) – Array data to be saved.

log_param(key, value)[source]¶

Logs a key-value pair for the experiment.

Parameters

key – parameter name
value – parameter value

log_params(params)[source]¶

Logs a batch of params for the experiments.

Parameters: params (Dict) – dictionary of parameters

start()[source]¶: Start a new experiment.

stop()[source]¶: Stop current experiment.

nyaggle.experiment.add_leaderboard_score(logging_directory, score)[source]¶

Record leaderboard score to the existing experiment directory.

Parameters

logging_directory (str) – The directory to be added
score (float) – Leaderboard score

nyaggle.experiment.find_best_lgbm_parameter(base_param, X, y, cv=None, groups=None, time_budget=None, type_of_target='auto')[source]¶

Search hyperparameter for lightgbm using optuna.

Parameters

base_param (Dict) – Base parameters passed to lgb.train.
X (DataFrame) – Training data.
y (Series) – Target
cv (Union[int, Iterable, BaseCrossValidator, None]) – int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
time_budget (Optional[int]) – Time budget for tuning (in seconds).
type_of_target (str) – The type of target variable. If auto, type is inferred by sklearn.utils.multiclass.type_of_target. Otherwise, binary, continuous, or multiclass are supported.

Return type

Dict

Returns

The best parameters found

nyaggle.experiment.run_experiment(model_params, X_train, y, X_test=None, logging_directory='output/{time}', if_exists='error', eval_func=None, algorithm_type='lgbm', fit_params=None, cv=None, groups=None, categorical_feature=None, sample_submission=None, submission_filename=None, type_of_target='auto', feature_list=None, feature_directory=None, inherit_experiment=None, with_auto_hpo=False, with_auto_prep=False, with_mlflow=False)[source]¶

Evaluate metrics by cross-validation and stores result (log, oof prediction, test prediction, feature importance plot and submission file) under the directory specified.

One of the following estimators are used (automatically dispatched by type_of_target(y) and gbdt_type).

LGBMClassifier
LGBMRegressor
CatBoostClassifier
CatBoostRegressor

The output files are laid out as follows:

<logging_directory>/
    log.txt                  <== Logging file
    importance.png           <== Feature importance plot generated by nyaggle.util.plot_importance
    oof_prediction.npy       <== Out of fold prediction in numpy array format
    test_prediction.npy      <== Test prediction in numpy array format
    submission.csv           <== Submission csv file
    metrics.json             <== Metrics
    params.json              <== Parameters
    models/
        fold1                <== The trained model in fold 1
        ...

Parameters

model_params (Dict[str, Any]) – Parameters passed to the constructor of the classifier/regressor object (i.e. LGBMRegressor).
X_train (DataFrame) – Training data. Categorical feature should be casted to pandas categorical type or encoded to integer.
y (Series) – Target
X_test (Optional[DataFrame]) – Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.
logging_directory (str) – Path to directory where output of experiment is stored.
if_exists (str) –
How to behave if the logging directory already exists.
- error: Raise a ValueError.
- replace: Delete logging directory before logging.
- append: Append to exisitng experiment.
- rename: Rename current directory by adding “_1”, “_2”… prefix
fit_params (Union[Dict[str, Any], Callable, None]) – Parameters passed to the fit method of the estimator. If dict is passed, the same parameter except eval_set passed for each fold. If callable is passed, returning value of fit_params(fold_id, train_index, test_index) will be used for each fold.
eval_func (Optional[Callable]) – Function used for logging and calculation of returning scores. This parameter isn’t passed to GBDT, so you should set objective and eval_metric separately if needed. If eval_func is None, roc_auc_score or mean_squared_error is used by default.
gbdt_type – Type of gradient boosting library used. “lgbm” (lightgbm) or “cat” (catboost)
cv (Union[int, Iterable, BaseCrossValidator, None]) –
int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
- None, to use the default KFold(5, random_state=0, shuffle=True),
- integer, to specify the number of folds in a (Stratified)KFold,
- CV splitter (the instance of BaseCrossValidator),
- An iterable yielding (train, test) splits as arrays of indices.
groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
sample_submission (Optional[DataFrame]) – A sample dataframe alined with test data (Usually in Kaggle, it is available as sample_submission.csv). The submission file will be created with the same schema as this dataframe.
submission_filename (Optional[str]) – The name of submission file will be created under logging directory. If None, the basename of the logging directory will be used as a filename.
categorical_feature (Optional[List[str]]) – List of categorical column names. If None, categorical columns are automatically determined by dtype.
type_of_target (str) – The type of target variable. If auto, type is inferred by sklearn.utils.multiclass.type_of_target. Otherwise, binary, continuous, or multiclass are supported.
feature_list (Optional[List[Union[int, str]]]) – The list of feature ids saved through nyaggle.feature_store module.
feature_directory (Optional[str]) – The location of features stored. Only used if feature_list is not empty.
inherit_experiment (Optional[Experiment]) – An experiment object which is used to log results. if not None, all logs in this function are treated as a part of this experiment.
with_auto_prep (bool) – If True, the input datasets will be copied and automatic preprocessing will be performed on them. For example, if gbdt_type = 'cat', all missing values in categorical features will be filled.
with_auto_hpo (bool) – If True, model parameters will be automatically updated using optuna (only available in lightgbm).
with_mlflow (bool) –
If True, mlflow tracking is used. One instance of nyaggle.experiment.Experiment corresponds to one run in mlflow. Note that all output mlflow’s directory (mlruns by default).

Returns

Namedtuple with following members

oof_prediction:
numpy array, shape (len(X_train),) Predicted value on Out-of-Fold validation data.
test_prediction:
numpy array, shape (len(X_test),) Predicted value on test data. None if X_test is None
metrics:
list of float, shape(nfolds+1) metrics[i] denotes validation score in i-th fold. metrics[-1] is overall score.
models:
list of objects, shape(nfolds) Trained models for each folds.
importance:
list of pd.DataFrame, feature importance for each fold (type=”gain”).
time:
Training time in seconds.
submit_df:
The dataframe saved as submission.csv