nyaggle.validation

class nyaggle.validation.Nth(n, base_validator)[source]

Returns N-th fold of the base validator

This validator wraps the base validator to take n-th (1-origin) fold.

Parameters
  • n (int) – The number of folds to be taken.

  • base_validator (BaseCrossValidator) – The base validator to be wrapped.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.model_selection import KFold
>>> from nyaggle.validation import Nth
>>> # take the 3rd fold
>>> folds = Nth(3, KFold(5))
>>> folds.get_n_splits()
1
get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.

  • groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

class nyaggle.validation.Skip(n, base_validator)[source]

Skips the first N folds and returns the remaining folds

This validator wraps the base validator to skip first n folds.

Parameters
  • n (int) – The number of folds to be skipped.

  • base_validator (BaseCrossValidator) – The base validator to be wrapped.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.model_selection import KFold
>>> from nyaggle.validation import Skip
>>> # take the last 2 folds out of 5
>>> folds = Skip(3, KFold(5))
>>> folds.get_n_splits()
2
get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.

  • groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

class nyaggle.validation.SlidingWindowSplit(source, train_from, train_to, test_from, test_to, n_windows, stride)[source]

Sliding window time series cross-validator

Time Series cross-validator which provides train/test indices based on the sliding window to split variable interval time series data. Splitting for each fold will be as follows:

Folds  Training data                                      Testing data
1      ((train_from-(N-1)*stride, train_to-(N-1)*stride), (test_from-(N-1)*stride, test_to-(N-1)*stride))
...    ...                                                ...
N-1    ((train_from-stride,       train_to-stride),       (test_from-stride,       test_to-stride))
N      ((train_from,              train_to),              (test_from,              test_to))

This class is compatible with sklearn’s BaseCrossValidator (base class of KFold, GroupKFold etc).

Parameters
  • source (Union[Series, str]) – The column name or series of timestamp.

  • train_from (Union[datetime, str]) – Start datetime for the training data in the base split.

  • train_to (Union[datetime, str]) – End datetime for the training data in the base split.

  • test_from (Union[datetime, str]) – Start datetime for the testing data in the base split.

  • test_to (Union[datetime, str]) – End datetime for the testing data in the base split.

  • n_windows (int) – The number of windows (or folds) in the validation.

  • stride (timedelta) – Time delta between folds.

class nyaggle.validation.StratifiedGroupKFold(n_splits=3, shuffle=False, random_state=None)[source]

Stratified K-Folds cross-validator with grouping

Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of GroupKFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. Read more in the User Guide.

Parameters

n_splits (int) – Number of folds. Must be at least 2.

Example

>>> from pprint import pprint
>>> rng = np.random.RandomState(0)
>>> groups = [1, 1, 3, 4, 2, 2, 7, 8, 8]
>>> y      = [1, 1, 1, 1, 2, 2, 2, 3, 3]
>>> X = np.empty((len(y), 0))
>>> self = StratifiedGroupKFold(random_state=rng)
>>> skf_list = list(self.split(X=X, y=y, groups=groups))
>>> pprint(skf_list)
[
    (np.array([2, 3, 4, 5, 6]), np.array([0, 1, 7, 8])),
    (np.array([0, 1, 2, 7, 8]), np.array([3, 4, 5, 6])),
    (np.array([0, 1, 3, 4, 5, 6, 7, 8]), np.array([2])),
]
split(X, y, groups=None)[source]

Generate indices to split data into training and test set.

class nyaggle.validation.Take(n, base_validator)[source]

Returns the first N folds of the base validator

This validator wraps the base validator to take first n folds.

Parameters
  • n (int) – The number of folds.

  • base_validator (BaseCrossValidator) – The base validator to be wrapped.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.model_selection import KFold
>>> from nyaggle.validation import Take
>>> # take the first 3 folds out of 5
>>> folds = Take(3, KFold(5))
>>> folds.get_n_splits()
3
get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters
  • X – Training data.

  • y – Target.

  • groups – Group indices.

Yields

The training set and the testing set indices for that split.

class nyaggle.validation.TimeSeriesSplit(source, times=None)[source]

Time Series cross-validator

Time Series cross-validator which provides train/test indices to split variable interval time series data. This class provides low-level API for time series validation strategy. This class is compatible with sklearn’s BaseCrossValidator (base class of KFold, GroupKFold etc).

Parameters
  • source (Union[Series, str]) – The column name or series of timestamp.

  • times (Optional[List[Tuple[Tuple[Union[datetime, str], Union[datetime, str]], Tuple[Union[datetime, str], Union[datetime, str]]]]]) – Splitting window, where times[i][0] and times[i][1] denotes train and test time interval in (i-1)th fold respectively. Each time interval should be pair of datetime or str, and the validator generates indices of rows where timestamp is in the half-open interval [start, end). For example, if times[i][0] = ('2018-01-01', '2018-01-03'), indices for (i-1)th training data will be rows where timestamp value meets 2018-01-01 <= t < 2018-01-03.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from nyaggle.validation import TimeSeriesSplit
>>> df = pd.DataFrame()
>>> df['time'] = pd.date_range(start='2018/1/1', periods=5)
>>> folds = TimeSeriesSplit('time',
>>>                         [(('2018-01-01', '2018-01-02'), ('2018-01-02', '2018-01-04')),
>>>                          (('2018-01-02', '2018-01-03'), ('2018-01-04', '2018-01-06'))])
>>> folds.get_n_splits()
2
>>> splits = folds.split(df)
>>> train_index, test_index = next(splits)
>>> train_index
[0]
>>> test_index
[1, 2]
>>> train_index, test_index = next(splits)
>>> train_index
[1]
>>> test_index
[3, 4]
add_fold(train_interval, test_interval)[source]

Append 1 split to the validator.

Parameters
  • train_interval (Tuple[Union[datetime, str], Union[datetime, str]]) – start and end time of training data.

  • test_interval (Tuple[Union[datetime, str], Union[datetime, str]]) – start and end time of test data.

get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters
  • X – Training data.

  • y – Ignored.

  • groups – Ignored.

Yields

The training set and the testing set indices for that split.

nyaggle.validation.adversarial_validate(X_train, X_test, importance_type='gain', estimator=None, categorical_feature=None, cv=None)[source]

Perform adversarial validation between X_train and X_test.

Parameters
  • X_train (DataFrame) – Training data

  • X_test (DataFrame) – Test data

  • importance_type (str) – The type of feature importance calculated.

  • estimator (Optional[BaseEstimator]) – The custom estimator. If None, LGBMClassifier is automatically used. Only LGBMModel or CatBoost instances are supported.

  • categorical_feature (Optional[List[str]]) – List of categorical column names. If None, categorical columns are automatically determined by dtype.

  • cv (Union[int, Iterable, BaseCrossValidator, None]) – Cross validation split. If None, the first fold out of 5 fold is used as validation.

Return type

ADVResult

Returns

Namedtuple with following members

  • auc:

    float, ROC AUC score of adversarial validation.

  • importance:

    pandas DataFrame, feature importance of adversarial model (order by importance)

Example

>>> from sklearn.model_selection import train_test_split
>>> from nyaggle.testing import make_regression_df
>>> from nyaggle.validation import adversarial_validate
>>> X, y = make_regression_df(n_samples=8)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> auc, importance = cross_validate(X_train, X_test)
>>>
>>> print(auc)
0.51078231
>>> importance.head()
feature importance
col_1   231.5827204
col_5   207.1837266
col_7   188.6920685
col_4   174.5668498
col_9   170.6438643
nyaggle.validation.cross_validate(estimator, X_train, y, X_test=None, cv=None, groups=None, eval_func=None, logger=None, on_each_fold=None, fit_params=None, importance_type='gain', early_stopping=True, type_of_target='auto')[source]

Evaluate metrics by cross-validation. It also records out-of-fold prediction and test prediction.

Parameters
  • estimator (Union[BaseEstimator, List[BaseEstimator]]) – The object to be used in cross-validation. For list inputs, estimator[i] is trained on i-th fold.

  • X_train (Union[DataFrame, ndarray]) – Training data

  • y (Union[Series, ndarray]) – Target

  • X_test (Union[DataFrame, ndarray, None]) – Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.

  • cv (Union[int, Iterable, BaseCrossValidator, None]) –

    int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.

    • None, to use the default KFold(5, random_state=0, shuffle=True),

    • integer, to specify the number of folds in a (Stratified)KFold,

    • CV splitter (the instance of BaseCrossValidator),

    • An iterable yielding (train, test) splits as arrays of indices.

  • groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).

  • eval_func (Optional[Callable]) – Function used for logging and returning scores

  • logger (Optional[Logger]) – logger

  • on_each_fold (Optional[Callable[[int, BaseEstimator, DataFrame, Series], None]]) – called for each fold with (idx_fold, model, X_fold, y_fold)

  • fit_params (Union[Dict[str, Any], Callable, None]) – Parameters passed to the fit method of the estimator

  • importance_type (str) – The type of feature importance to be used to calculate result. Used only in LGBMClassifier and LGBMRegressor.

  • early_stopping (bool) – If True, eval_set will be added to fit_params for each fold. early_stopping_rounds = 100 will also be appended to fit_params if it does not already have one.

  • type_of_target (str) – The type of target variable. If auto, type is inferred by sklearn.utils.multiclass.type_of_target. Otherwise, binary, continuous, or multiclass are supported.

Return type

CVResult

Returns

Namedtuple with following members

  • oof_prediction (numpy array, shape (len(X_train),)):

    The predicted value on put-of-Fold validation data.

  • test_prediction (numpy array, hape (len(X_test),)):

    The predicted value on test data. None if X_test is None.

  • scores (list of float, shape (nfolds+1,)):

    scores[i] denotes validation score in i-th fold. scores[-1] is the overall score. None if eval is not specified.

  • importance (list of pandas DataFrame, shape (nfolds,)):

    importance[i] denotes feature importance in i-th fold model. If the estimator is not GBDT, empty array is returned.

Example

>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import Ridge
>>> from sklearn.metrics import mean_squared_error
>>> from nyaggle.validation import cross_validate
>>> X, y = make_regression(n_samples=8)
>>> model = Ridge(alpha=1.0)
>>> pred_oof, pred_test, scores, _ =         >>>     cross_validate(model,
>>>                    X_train=X[:3, :],
>>>                    y=y[:3],
>>>                    X_test=X[3:, :],
>>>                    cv=3,
>>>                    eval_func=mean_squared_error)
>>> print(pred_oof)
[-101.1123267 ,   26.79300693,   17.72635528]
>>> print(pred_test)
[-10.65095894 -12.18909059 -23.09906427 -17.68360714 -20.08218267]
>>> print(scores)
[71912.80290003832, 15236.680239881942, 15472.822033121925, 34207.43505768073]