nyaggle.experiment¶
- class nyaggle.experiment.Experiment(logging_directory, custom_logger=None, with_mlflow=False, if_exists='error')[source]¶
Minimal experiment logger for Kaggle
This module provides minimal functionality for tracking experiments. The output files are laid out as follows:
<logging_directory>/ log.txt <== Output of log metrics.json <== Output of log_metric(s), format: name,score params.json <== Output of log_param(s), format: key,value mlflow.json <== mlflow's run_id, experiment_id and artifact_uri (logged if with_mlflow=True)
You can add numpy array and pandas dataframe under the directory through
log_numpy
andlog_dataframe
.- Parameters
logging_directory (
str
) – Path to directory where output is stored.custom_logger (
Optional
[Logger
]) – A custom logger to be used instead of default logger.with_mlflow (
bool
) – If True, mlflow tracking is used. One instance ofnyaggle.experiment.Experiment
corresponds to one run in mlflow. Note that all output files are located bothlogging_directory
and mlflow’s directory (mlruns
by default).if_exists (
str
) –How to behave if the logging directory already exists.
error: Raise a ValueError.
replace: Delete logging directory before logging.
append: Append to exisitng experiment.
rename: Rename current directory by adding “_1”, “_2”… prefix
Example
>>> import numpy as np >>> import pandas as pd >>> from nyaggle.experiment import Experiment >>> >>> with Experiment(logging_directory='./output/') as exp: >>> # log key-value pair as a parameter >>> exp.log_param('lr', 0.01) >>> exp.log_param('optimizer', 'adam') >>> >>> # log text >>> exp.log('blah blah blah') >>> >>> # log metric >>> exp.log_metric('CV', 0.85) >>> >>> # log dictionary with flattening keys >>> exp.log_dict('params', {'X': 3, 'Y': {'Z': 'foobar'}}) >>> >>> # log numpy ndarray, pandas dafaframe and any artifacts >>> exp.log_numpy('predicted', np.zeros(1)) >>> exp.log_dataframe('submission', pd.DataFrame(), file_format='csv') >>> exp.log_artifact('path-to-your-file')
- get_run()[source]¶
Get mlflow’s currently active run, or None if
with_mlflow = False
.- Returns
active Run
- log(text)[source]¶
Logs a message on the logger for the experiment.
- Parameters
text (
str
) – The message to be written.
- log_artifact(src_file_path)[source]¶
Make a copy of the file under the logging directory.
- Parameters
src_file_path (
str
) – Path of the file. If path is not a child of the logging directory, the file will be copied. Ifwith_mlflow
is True,mlflow.log_artifact
will be called (then another copy will be made).
- log_dataframe(name, df, file_format='feather')[source]¶
Log a pandas dataframe under the logging directory.
- Parameters
name (
str
) – Name of the file. A.f
or.csv
extension will be appended to the file name if it does not already have one.df (
DataFrame
) – A dataframe to be saved.file_format (
str
) – A format of output file.csv
andfeather
are supported.
- log_dict(name, value, separator='.')[source]¶
Logs a dictionary as parameter with flatten format.
- Parameters
name (
str
) – Parameter namevalue (
Dict
) – Parameter valueseparator (
str
) – Separating character used to concatanate keys
Examples
>>> with Experiment('./') as e: >>> e.log_dict('a', {'b': 1, 'c': 'd'}) >>> print(e.params) { 'a.b': 1, 'a.c': 'd' }
- log_metric(name, score)[source]¶
Log a metric under the logging directory.
- Parameters
name (
str
) – Metric name.score (
float
) – Metric value.
- log_metrics(metrics)[source]¶
Log a batch of metrics under the logging directory.
- Parameters
metrics (
Dict
) – dictionary of metrics.
- log_numpy(name, array)[source]¶
Log a numpy ndarray under the logging directory.
- Parameters
name (
str
) – Name of the file. A .npy extension will be appended to the file name if it does not already have one.array (
ndarray
) – Array data to be saved.
- log_param(key, value)[source]¶
Logs a key-value pair for the experiment.
- Parameters
key – parameter name
value – parameter value
- nyaggle.experiment.add_leaderboard_score(logging_directory, score)[source]¶
Record leaderboard score to the existing experiment directory.
- Parameters
logging_directory (
str
) – The directory to be addedscore (
float
) – Leaderboard score
- nyaggle.experiment.find_best_lgbm_parameter(base_param, X, y, cv=None, groups=None, time_budget=None, type_of_target='auto')[source]¶
Search hyperparameter for lightgbm using optuna.
- Parameters
base_param (
Dict
) – Base parameters passed to lgb.train.X (
DataFrame
) – Training data.y (
Series
) – Targetcv (
Union
[int
,Iterable
,BaseCrossValidator
,None
]) – int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.groups (
Optional
[Series
]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g.,GroupKFold
).time_budget (
Optional
[int
]) – Time budget for tuning (in seconds).type_of_target (
str
) – The type of target variable. Ifauto
, type is inferred bysklearn.utils.multiclass.type_of_target
. Otherwise,binary
,continuous
, ormulticlass
are supported.
- Return type
Dict
- Returns
The best parameters found
- nyaggle.experiment.run_experiment(model_params, X_train, y, X_test=None, logging_directory='output/{time}', if_exists='error', eval_func=None, algorithm_type='lgbm', fit_params=None, cv=None, groups=None, categorical_feature=None, sample_submission=None, submission_filename=None, type_of_target='auto', feature_list=None, feature_directory=None, inherit_experiment=None, with_auto_hpo=False, with_auto_prep=False, with_mlflow=False)[source]¶
Evaluate metrics by cross-validation and stores result (log, oof prediction, test prediction, feature importance plot and submission file) under the directory specified.
One of the following estimators are used (automatically dispatched by
type_of_target(y)
andgbdt_type
).LGBMClassifier
LGBMRegressor
CatBoostClassifier
CatBoostRegressor
The output files are laid out as follows:
<logging_directory>/ log.txt <== Logging file importance.png <== Feature importance plot generated by nyaggle.util.plot_importance oof_prediction.npy <== Out of fold prediction in numpy array format test_prediction.npy <== Test prediction in numpy array format submission.csv <== Submission csv file metrics.json <== Metrics params.json <== Parameters models/ fold1 <== The trained model in fold 1 ...
- Parameters
model_params (
Dict
[str
,Any
]) – Parameters passed to the constructor of the classifier/regressor object (i.e. LGBMRegressor).X_train (
DataFrame
) – Training data. Categorical feature should be casted to pandas categorical type or encoded to integer.y (
Series
) – TargetX_test (
Optional
[DataFrame
]) – Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.logging_directory (
str
) – Path to directory where output of experiment is stored.if_exists (
str
) –How to behave if the logging directory already exists.
error: Raise a ValueError.
replace: Delete logging directory before logging.
append: Append to exisitng experiment.
rename: Rename current directory by adding “_1”, “_2”… prefix
fit_params (
Union
[Dict
[str
,Any
],Callable
,None
]) – Parameters passed to the fit method of the estimator. If dict is passed, the same parameter except eval_set passed for each fold. If callable is passed, returning value offit_params(fold_id, train_index, test_index)
will be used for each fold.eval_func (
Optional
[Callable
]) – Function used for logging and calculation of returning scores. This parameter isn’t passed to GBDT, so you should set objective and eval_metric separately if needed. Ifeval_func
is None,roc_auc_score
ormean_squared_error
is used by default.gbdt_type – Type of gradient boosting library used. “lgbm” (lightgbm) or “cat” (catboost)
cv (
Union
[int
,Iterable
,BaseCrossValidator
,None
]) –int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
None, to use the default
KFold(5, random_state=0, shuffle=True)
,integer, to specify the number of folds in a
(Stratified)KFold
,CV splitter (the instance of
BaseCrossValidator
),An iterable yielding (train, test) splits as arrays of indices.
groups (
Optional
[Series
]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g.,GroupKFold
).sample_submission (
Optional
[DataFrame
]) – A sample dataframe alined with test data (Usually in Kaggle, it is available as sample_submission.csv). The submission file will be created with the same schema as this dataframe.submission_filename (
Optional
[str
]) – The name of submission file will be created under logging directory. IfNone
, the basename of the logging directory will be used as a filename.categorical_feature (
Optional
[List
[str
]]) – List of categorical column names. IfNone
, categorical columns are automatically determined by dtype.type_of_target (
str
) – The type of target variable. Ifauto
, type is inferred bysklearn.utils.multiclass.type_of_target
. Otherwise,binary
,continuous
, ormulticlass
are supported.feature_list (
Optional
[List
[Union
[int
,str
]]]) – The list of feature ids saved through nyaggle.feature_store module.feature_directory (
Optional
[str
]) – The location of features stored. Only used if feature_list is not empty.inherit_experiment (
Optional
[Experiment
]) – An experiment object which is used to log results. if notNone
, all logs in this function are treated as a part of this experiment.with_auto_prep (
bool
) – If True, the input datasets will be copied and automatic preprocessing will be performed on them. For example, ifgbdt_type = 'cat'
, all missing values in categorical features will be filled.with_auto_hpo (
bool
) – If True, model parameters will be automatically updated using optuna (only available in lightgbm).with_mlflow (
bool
) –If True, mlflow tracking is used. One instance of
nyaggle.experiment.Experiment
corresponds to one run in mlflow. Note that all output mlflow’s directory (mlruns
by default).
- Returns
Namedtuple with following members
- oof_prediction:
numpy array, shape (len(X_train),) Predicted value on Out-of-Fold validation data.
- test_prediction:
numpy array, shape (len(X_test),) Predicted value on test data.
None
if X_test isNone
- metrics:
list of float, shape(nfolds+1)
metrics[i]
denotes validation score in i-th fold.metrics[-1]
is overall score.
- models:
list of objects, shape(nfolds) Trained models for each folds.
- importance:
list of pd.DataFrame, feature importance for each fold (type=”gain”).
- time:
Training time in seconds.
- submit_df:
The dataframe saved as submission.csv