nyaggle.experiment¶
- class nyaggle.experiment.Experiment(logging_directory, custom_logger=None, with_mlflow=False, if_exists='error')[source]¶
Minimal experiment logger for Kaggle
This module provides minimal functionality for tracking experiments. The output files are laid out as follows:
<logging_directory>/ log.txt <== Output of log metrics.json <== Output of log_metric(s), format: name,score params.json <== Output of log_param(s), format: key,value mlflow.json <== mlflow's run_id, experiment_id and artifact_uri (logged if with_mlflow=True)You can add numpy array and pandas dataframe under the directory through
log_numpyandlog_dataframe.- Parameters
logging_directory (
str) – Path to directory where output is stored.custom_logger (
Optional[Logger]) – A custom logger to be used instead of default logger.with_mlflow (
bool) – If True, mlflow tracking is used. One instance ofnyaggle.experiment.Experimentcorresponds to one run in mlflow. Note that all output files are located bothlogging_directoryand mlflow’s directory (mlrunsby default).if_exists (
str) –How to behave if the logging directory already exists.
error: Raise a ValueError.
replace: Delete logging directory before logging.
append: Append to exisitng experiment.
rename: Rename current directory by adding “_1”, “_2”… prefix
Example
>>> import numpy as np >>> import pandas as pd >>> from nyaggle.experiment import Experiment >>> >>> with Experiment(logging_directory='./output/') as exp: >>> # log key-value pair as a parameter >>> exp.log_param('lr', 0.01) >>> exp.log_param('optimizer', 'adam') >>> >>> # log text >>> exp.log('blah blah blah') >>> >>> # log metric >>> exp.log_metric('CV', 0.85) >>> >>> # log dictionary with flattening keys >>> exp.log_dict('params', {'X': 3, 'Y': {'Z': 'foobar'}}) >>> >>> # log numpy ndarray, pandas dafaframe and any artifacts >>> exp.log_numpy('predicted', np.zeros(1)) >>> exp.log_dataframe('submission', pd.DataFrame(), file_format='csv') >>> exp.log_artifact('path-to-your-file')
- get_run()[source]¶
Get mlflow’s currently active run, or None if
with_mlflow = False.- Returns
active Run
- log(text)[source]¶
Logs a message on the logger for the experiment.
- Parameters
text (
str) – The message to be written.
- log_artifact(src_file_path)[source]¶
Make a copy of the file under the logging directory.
- Parameters
src_file_path (
str) – Path of the file. If path is not a child of the logging directory, the file will be copied. Ifwith_mlflowis True,mlflow.log_artifactwill be called (then another copy will be made).
- log_dataframe(name, df, file_format='feather')[source]¶
Log a pandas dataframe under the logging directory.
- Parameters
name (
str) – Name of the file. A.for.csvextension will be appended to the file name if it does not already have one.df (
DataFrame) – A dataframe to be saved.file_format (
str) – A format of output file.csvandfeatherare supported.
- log_dict(name, value, separator='.')[source]¶
Logs a dictionary as parameter with flatten format.
- Parameters
name (
str) – Parameter namevalue (
Dict) – Parameter valueseparator (
str) – Separating character used to concatanate keys
Examples
>>> with Experiment('./') as e: >>> e.log_dict('a', {'b': 1, 'c': 'd'}) >>> print(e.params) { 'a.b': 1, 'a.c': 'd' }
- log_metric(name, score)[source]¶
Log a metric under the logging directory.
- Parameters
name (
str) – Metric name.score (
float) – Metric value.
- log_metrics(metrics)[source]¶
Log a batch of metrics under the logging directory.
- Parameters
metrics (
Dict) – dictionary of metrics.
- log_numpy(name, array)[source]¶
Log a numpy ndarray under the logging directory.
- Parameters
name (
str) – Name of the file. A .npy extension will be appended to the file name if it does not already have one.array (
ndarray) – Array data to be saved.
- log_param(key, value)[source]¶
Logs a key-value pair for the experiment.
- Parameters
key – parameter name
value – parameter value
- nyaggle.experiment.add_leaderboard_score(logging_directory, score)[source]¶
Record leaderboard score to the existing experiment directory.
- Parameters
logging_directory (
str) – The directory to be addedscore (
float) – Leaderboard score
- nyaggle.experiment.find_best_lgbm_parameter(base_param, X, y, cv=None, groups=None, time_budget=None, type_of_target='auto')[source]¶
Search hyperparameter for lightgbm using optuna.
- Parameters
base_param (
Dict) – Base parameters passed to lgb.train.X (
DataFrame) – Training data.y (
Series) – Targetcv (
Union[int,Iterable,BaseCrossValidator,None]) – int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.groups (
Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g.,GroupKFold).time_budget (
Optional[int]) – Time budget for tuning (in seconds).type_of_target (
str) – The type of target variable. Ifauto, type is inferred bysklearn.utils.multiclass.type_of_target. Otherwise,binary,continuous, ormulticlassare supported.
- Return type
Dict- Returns
The best parameters found
- nyaggle.experiment.run_experiment(model_params, X_train, y, X_test=None, logging_directory='output/{time}', if_exists='error', eval_func=None, algorithm_type='lgbm', fit_params=None, cv=None, groups=None, categorical_feature=None, sample_submission=None, submission_filename=None, type_of_target='auto', feature_list=None, feature_directory=None, inherit_experiment=None, with_auto_hpo=False, with_auto_prep=False, with_mlflow=False)[source]¶
Evaluate metrics by cross-validation and stores result (log, oof prediction, test prediction, feature importance plot and submission file) under the directory specified.
One of the following estimators are used (automatically dispatched by
type_of_target(y)andgbdt_type).LGBMClassifier
LGBMRegressor
CatBoostClassifier
CatBoostRegressor
The output files are laid out as follows:
<logging_directory>/ log.txt <== Logging file importance.png <== Feature importance plot generated by nyaggle.util.plot_importance oof_prediction.npy <== Out of fold prediction in numpy array format test_prediction.npy <== Test prediction in numpy array format submission.csv <== Submission csv file metrics.json <== Metrics params.json <== Parameters models/ fold1 <== The trained model in fold 1 ...- Parameters
model_params (
Dict[str,Any]) – Parameters passed to the constructor of the classifier/regressor object (i.e. LGBMRegressor).X_train (
DataFrame) – Training data. Categorical feature should be casted to pandas categorical type or encoded to integer.y (
Series) – TargetX_test (
Optional[DataFrame]) – Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.logging_directory (
str) – Path to directory where output of experiment is stored.if_exists (
str) –How to behave if the logging directory already exists.
error: Raise a ValueError.
replace: Delete logging directory before logging.
append: Append to exisitng experiment.
rename: Rename current directory by adding “_1”, “_2”… prefix
fit_params (
Union[Dict[str,Any],Callable,None]) – Parameters passed to the fit method of the estimator. If dict is passed, the same parameter except eval_set passed for each fold. If callable is passed, returning value offit_params(fold_id, train_index, test_index)will be used for each fold.eval_func (
Optional[Callable]) – Function used for logging and calculation of returning scores. This parameter isn’t passed to GBDT, so you should set objective and eval_metric separately if needed. Ifeval_funcis None,roc_auc_scoreormean_squared_erroris used by default.gbdt_type – Type of gradient boosting library used. “lgbm” (lightgbm) or “cat” (catboost)
cv (
Union[int,Iterable,BaseCrossValidator,None]) –int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
None, to use the default
KFold(5, random_state=0, shuffle=True),integer, to specify the number of folds in a
(Stratified)KFold,CV splitter (the instance of
BaseCrossValidator),An iterable yielding (train, test) splits as arrays of indices.
groups (
Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g.,GroupKFold).sample_submission (
Optional[DataFrame]) – A sample dataframe alined with test data (Usually in Kaggle, it is available as sample_submission.csv). The submission file will be created with the same schema as this dataframe.submission_filename (
Optional[str]) – The name of submission file will be created under logging directory. IfNone, the basename of the logging directory will be used as a filename.categorical_feature (
Optional[List[str]]) – List of categorical column names. IfNone, categorical columns are automatically determined by dtype.type_of_target (
str) – The type of target variable. Ifauto, type is inferred bysklearn.utils.multiclass.type_of_target. Otherwise,binary,continuous, ormulticlassare supported.feature_list (
Optional[List[Union[int,str]]]) – The list of feature ids saved through nyaggle.feature_store module.feature_directory (
Optional[str]) – The location of features stored. Only used if feature_list is not empty.inherit_experiment (
Optional[Experiment]) – An experiment object which is used to log results. if notNone, all logs in this function are treated as a part of this experiment.with_auto_prep (
bool) – If True, the input datasets will be copied and automatic preprocessing will be performed on them. For example, ifgbdt_type = 'cat', all missing values in categorical features will be filled.with_auto_hpo (
bool) – If True, model parameters will be automatically updated using optuna (only available in lightgbm).with_mlflow (
bool) –If True, mlflow tracking is used. One instance of
nyaggle.experiment.Experimentcorresponds to one run in mlflow. Note that all output mlflow’s directory (mlrunsby default).
- Returns
Namedtuple with following members
- oof_prediction:
numpy array, shape (len(X_train),) Predicted value on Out-of-Fold validation data.
- test_prediction:
numpy array, shape (len(X_test),) Predicted value on test data.
Noneif X_test isNone
- metrics:
list of float, shape(nfolds+1)
metrics[i]denotes validation score in i-th fold.metrics[-1]is overall score.
- models:
list of objects, shape(nfolds) Trained models for each folds.
- importance:
list of pd.DataFrame, feature importance for each fold (type=”gain”).
- time:
Training time in seconds.
- submit_df:
The dataframe saved as submission.csv