nyaggle.validation¶

class nyaggle.validation.Nth(n, base_validator)[source]¶

Returns N-th fold of the base validator

This validator wraps the base validator to take n-th (1-origin) fold.

Parameters

n (int) – The number of folds to be taken.
base_validator (BaseCrossValidator) – The base validator to be wrapped.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.model_selection import KFold
>>> from nyaggle.validation import Nth

>>> # take the 3rd fold
>>> folds = Nth(3, KFold(5))
>>> folds.get_n_splits()
1

get_n_splits(X=None, y=None, groups=None)[source]¶: Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]¶

Generate indices to split data into training and test set.

Parameters

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class nyaggle.validation.Skip(n, base_validator)[source]¶

Skips the first N folds and returns the remaining folds

This validator wraps the base validator to skip first n folds.

Parameters

n (int) – The number of folds to be skipped.
base_validator (BaseCrossValidator) – The base validator to be wrapped.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.model_selection import KFold
>>> from nyaggle.validation import Skip

>>> # take the last 2 folds out of 5
>>> folds = Skip(3, KFold(5))
>>> folds.get_n_splits()
2

get_n_splits(X=None, y=None, groups=None)[source]¶: Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]¶

Generate indices to split data into training and test set.

Parameters

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class nyaggle.validation.SlidingWindowSplit(source, train_from, train_to, test_from, test_to, n_windows, stride)[source]¶

Sliding window time series cross-validator

Time Series cross-validator which provides train/test indices based on the sliding window to split variable interval time series data. Splitting for each fold will be as follows:

Folds  Training data                                      Testing data
1      ((train_from-(N-1)*stride, train_to-(N-1)*stride), (test_from-(N-1)*stride, test_to-(N-1)*stride))
...    ...                                                ...
N-1    ((train_from-stride,       train_to-stride),       (test_from-stride,       test_to-stride))
N      ((train_from,              train_to),              (test_from,              test_to))

This class is compatible with sklearn’s BaseCrossValidator (base class of KFold, GroupKFold etc).

Parameters

source (Union[Series, str]) – The column name or series of timestamp.
train_from (Union[datetime, str]) – Start datetime for the training data in the base split.
train_to (Union[datetime, str]) – End datetime for the training data in the base split.
test_from (Union[datetime, str]) – Start datetime for the testing data in the base split.
test_to (Union[datetime, str]) – End datetime for the testing data in the base split.
n_windows (int) – The number of windows (or folds) in the validation.
stride (timedelta) – Time delta between folds.

class nyaggle.validation.StratifiedGroupKFold(n_splits=3, shuffle=False, random_state=None)[source]¶

Stratified K-Folds cross-validator with grouping

Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of GroupKFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. Read more in the User Guide.

Parameters: n_splits (int) – Number of folds. Must be at least 2.

Example

>>> from pprint import pprint
>>> rng = np.random.RandomState(0)
>>> groups = [1, 1, 3, 4, 2, 2, 7, 8, 8]
>>> y      = [1, 1, 1, 1, 2, 2, 2, 3, 3]
>>> X = np.empty((len(y), 0))
>>> self = StratifiedGroupKFold(random_state=rng)
>>> skf_list = list(self.split(X=X, y=y, groups=groups))
>>> pprint(skf_list)
[
    (np.array([2, 3, 4, 5, 6]), np.array([0, 1, 7, 8])),
    (np.array([0, 1, 2, 7, 8]), np.array([3, 4, 5, 6])),
    (np.array([0, 1, 3, 4, 5, 6, 7, 8]), np.array([2])),
]

split(X, y, groups=None)[source]¶: Generate indices to split data into training and test set.

class nyaggle.validation.Take(n, base_validator)[source]¶

Returns the first N folds of the base validator

This validator wraps the base validator to take first n folds.

Parameters

n (int) – The number of folds.
base_validator (BaseCrossValidator) – The base validator to be wrapped.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.model_selection import KFold
>>> from nyaggle.validation import Take

>>> # take the first 3 folds out of 5
>>> folds = Take(3, KFold(5))
>>> folds.get_n_splits()
3

get_n_splits(X=None, y=None, groups=None)[source]¶: Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]¶

Generate indices to split data into training and test set.

Parameters

X – Training data.
y – Target.
groups – Group indices.

Yields

The training set and the testing set indices for that split.

class nyaggle.validation.TimeSeriesSplit(source, times=None)[source]¶

Time Series cross-validator

Time Series cross-validator which provides train/test indices to split variable interval time series data. This class provides low-level API for time series validation strategy. This class is compatible with sklearn’s BaseCrossValidator (base class of KFold, GroupKFold etc).

Parameters

source (Union[Series, str]) – The column name or series of timestamp.
times (Optional[List[Tuple[Tuple[Union[datetime, str], Union[datetime, str]], Tuple[Union[datetime, str], Union[datetime, str]]]]]) – Splitting window, where times[i][0] and times[i][1] denotes train and test time interval in (i-1)th fold respectively. Each time interval should be pair of datetime or str, and the validator generates indices of rows where timestamp is in the half-open interval [start, end). For example, if times[i][0] = ('2018-01-01', '2018-01-03'), indices for (i-1)th training data will be rows where timestamp value meets 2018-01-01 <= t < 2018-01-03.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from nyaggle.validation import TimeSeriesSplit
>>> df = pd.DataFrame()
>>> df['time'] = pd.date_range(start='2018/1/1', periods=5)

>>> folds = TimeSeriesSplit('time',
>>>                         [(('2018-01-01', '2018-01-02'), ('2018-01-02', '2018-01-04')),
>>>                          (('2018-01-02', '2018-01-03'), ('2018-01-04', '2018-01-06'))])

>>> folds.get_n_splits()
2

>>> splits = folds.split(df)

>>> train_index, test_index = next(splits)
>>> train_index
[0]
>>> test_index
[1, 2]

>>> train_index, test_index = next(splits)
>>> train_index
[1]
>>> test_index
[3, 4]

add_fold(train_interval, test_interval)[source]¶

Append 1 split to the validator.

Parameters

train_interval (Tuple[Union[datetime, str], Union[datetime, str]]) – start and end time of training data.
test_interval (Tuple[Union[datetime, str], Union[datetime, str]]) – start and end time of test data.

get_n_splits(X=None, y=None, groups=None)[source]¶: Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]¶

Generate indices to split data into training and test set.

Parameters

X – Training data.
y – Ignored.
groups – Ignored.

Yields

The training set and the testing set indices for that split.

nyaggle.validation.adversarial_validate(X_train, X_test, importance_type='gain', estimator=None, categorical_feature=None, cv=None)[source]¶

Perform adversarial validation between X_train and X_test.

Parameters

X_train (DataFrame) – Training data
X_test (DataFrame) – Test data
importance_type (str) – The type of feature importance calculated.
estimator (Optional[BaseEstimator]) – The custom estimator. If None, LGBMClassifier is automatically used. Only LGBMModel or CatBoost instances are supported.
categorical_feature (Optional[List[str]]) – List of categorical column names. If None, categorical columns are automatically determined by dtype.
cv (Union[int, Iterable, BaseCrossValidator, None]) – Cross validation split. If None, the first fold out of 5 fold is used as validation.

Return type

ADVResult

Returns

Namedtuple with following members

auc:
float, ROC AUC score of adversarial validation.
importance:
pandas DataFrame, feature importance of adversarial model (order by importance)

Example

>>> from sklearn.model_selection import train_test_split
>>> from nyaggle.testing import make_regression_df
>>> from nyaggle.validation import adversarial_validate

>>> X, y = make_regression_df(n_samples=8)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> auc, importance = cross_validate(X_train, X_test)
>>>
>>> print(auc)
0.51078231
>>> importance.head()
feature importance
col_1   231.5827204
col_5   207.1837266
col_7   188.6920685
col_4   174.5668498
col_9   170.6438643

nyaggle.validation.cross_validate(estimator, X_train, y, X_test=None, cv=None, groups=None, eval_func=None, logger=None, on_each_fold=None, fit_params=None, importance_type='gain', early_stopping=True, type_of_target='auto')[source]¶

Evaluate metrics by cross-validation. It also records out-of-fold prediction and test prediction.

Parameters

estimator (Union[BaseEstimator, List[BaseEstimator]]) – The object to be used in cross-validation. For list inputs, estimator[i] is trained on i-th fold.
X_train (Union[DataFrame, ndarray]) – Training data
y (Union[Series, ndarray]) – Target
X_test (Union[DataFrame, ndarray, None]) – Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.
cv (Union[int, Iterable, BaseCrossValidator, None]) –
int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
- None, to use the default KFold(5, random_state=0, shuffle=True),
- integer, to specify the number of folds in a (Stratified)KFold,
- CV splitter (the instance of BaseCrossValidator),
- An iterable yielding (train, test) splits as arrays of indices.
groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
eval_func (Optional[Callable]) – Function used for logging and returning scores
logger (Optional[Logger]) – logger
on_each_fold (Optional[Callable[[int, BaseEstimator, DataFrame, Series], None]]) – called for each fold with (idx_fold, model, X_fold, y_fold)
fit_params (Union[Dict[str, Any], Callable, None]) – Parameters passed to the fit method of the estimator
importance_type (str) – The type of feature importance to be used to calculate result. Used only in LGBMClassifier and LGBMRegressor.
early_stopping (bool) – If True, eval_set will be added to fit_params for each fold. early_stopping_rounds = 100 will also be appended to fit_params if it does not already have one.
type_of_target (str) – The type of target variable. If auto, type is inferred by sklearn.utils.multiclass.type_of_target. Otherwise, binary, continuous, or multiclass are supported.

Return type

CVResult

Returns

Namedtuple with following members

oof_prediction (numpy array, shape (len(X_train),)):
The predicted value on put-of-Fold validation data.
test_prediction (numpy array, hape (len(X_test),)):
The predicted value on test data. None if X_test is None.
scores (list of float, shape (nfolds+1,)):
scores[i] denotes validation score in i-th fold. scores[-1] is the overall score. None if eval is not specified.
importance (list of pandas DataFrame, shape (nfolds,)):
importance[i] denotes feature importance in i-th fold model. If the estimator is not GBDT, empty array is returned.

Example

>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import Ridge
>>> from sklearn.metrics import mean_squared_error
>>> from nyaggle.validation import cross_validate

>>> X, y = make_regression(n_samples=8)
>>> model = Ridge(alpha=1.0)
>>> pred_oof, pred_test, scores, _ =         >>>     cross_validate(model,
>>>                    X_train=X[:3, :],
>>>                    y=y[:3],
>>>                    X_test=X[3:, :],
>>>                    cv=3,
>>>                    eval_func=mean_squared_error)
>>> print(pred_oof)
[-101.1123267 ,   26.79300693,   17.72635528]
>>> print(pred_test)
[-10.65095894 -12.18909059 -23.09906427 -17.68360714 -20.08218267]
>>> print(scores)
[71912.80290003832, 15236.680239881942, 15472.822033121925, 34207.43505768073]