nyaggle.feature

class nyaggle.feature.category_encoder.KFoldEncoderWrapper(base_transformer, cv=None, return_same_type=True, groups=None)[source]

KFold Wrapper for sklearn like interface

This class wraps sklearn’s TransformerMixIn (object that has fit/transform/fit_transform methods), and call it as K-fold manner.

Parameters
  • base_transformer (BaseEstimator) – Transformer object to be wrapped.

  • cv (Union[int, Iterable, BaseCrossValidator, None]) –

    int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.

    • None, to use the default KFold(5, random_state=0, shuffle=True),

    • integer, to specify the number of folds in a (Stratified)KFold,

    • CV splitter (the instance of BaseCrossValidator),

    • An iterable yielding (train, test) splits as arrays of indices.

  • groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).

  • return_same_type (bool) – If True, transform and fit_transform return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.

fit(X, y)[source]

Fit models for each fold.

Parameters
  • X (DataFrame) – Data

  • y (Series) – Target

Returns

returns the transformer object.

fit_transform(X, y=None, **fit_params)[source]

Fit models for each fold, then transform X

Parameters
  • X (Union[DataFrame, ndarray]) – Data

  • y (Optional[Series]) – Target

  • fit_params – Additional parameters passed to models

Return type

Union[DataFrame, ndarray]

Returns

Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)[source]

Transform X

Parameters

X (Union[DataFrame, ndarray]) – Data

Return type

Union[DataFrame, ndarray]

Returns

Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.

class nyaggle.feature.category_encoder.TargetEncoder(cv=None, groups=None, cols=None, drop_invariant=False, handle_missing='value', handle_unknown='value', min_samples_leaf=20, smoothing=10.0, return_same_type=True)[source]

Target Encoder

KFold version of category_encoders.TargetEncoder in https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html.

Parameters
  • cv (Union[Iterable, BaseCrossValidator, None]) –

    int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.

    • None, to use the default KFold(5, random_state=0, shuffle=True),

    • integer, to specify the number of folds in a (Stratified)KFold,

    • CV splitter (the instance of BaseCrossValidator),

    • An iterable yielding (train, test) splits as arrays of indices.

  • groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).

  • cols (Optional[List[str]]) – A list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – Boolean for whether or not to drop columns with 0 variance.

  • handle_missing (str) – Options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • handle_unknown (str) – Options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • min_samples_leaf (int) – Minimum samples to take category average into account.

  • smoothing (float) – Smoothing effect to balance categorical average vs prior. Higher value means stronger regularization. The value must be strictly bigger than 0.

  • return_same_type (bool) – If True, transform and fit_transform return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.

fit(X, y)

Fit models for each fold.

Parameters
  • X (DataFrame) – Data

  • y (Series) – Target

Returns

returns the transformer object.

fit_transform(X, y=None, **fit_params)

Fit models for each fold, then transform X

Parameters
  • X (Union[DataFrame, ndarray]) – Data

  • y (Optional[Series]) – Target

  • fit_params – Additional parameters passed to models

Return type

Union[DataFrame, ndarray]

Returns

Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)

Transform X

Parameters

X (Union[DataFrame, ndarray]) – Data

Return type

Union[DataFrame, ndarray]

Returns

Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.

class nyaggle.feature.nlp.BertSentenceVectorizer(lang='en', n_components=None, text_columns=None, pooling_strategy='reduce_mean', use_cuda=False, tokenizer=None, model=None, return_same_type=True, column_format='{col}_{idx}')[source]

Sentence Vectorizer using BERT pretrained model.

Extract fixed-length feature vector from English/Japanese variable-length sentence using BERT.

Parameters
  • lang (str) – Type of language. If set to “jp”, Japanese BERT model is used (you need to install MeCab).

  • n_components (Optional[int]) – Number of components in SVD. If None, SVD is not applied.

  • text_columns (Optional[List[str]]) – List of processing columns. If None, all object columns are regarded as text column.

  • pooling_strategy (str) – The pooling algorithm for generating fixed length encoding vector. ‘reduce_mean’ and ‘reduce_max’ use average pooling and max pooling respectively to reduce vector from (num-words, emb-dim) to (emb_dim). ‘reduce_mean_max’ performs ‘reduce_mean’ and ‘reduce_max’ separately and concat them. ‘cls_token’ takes the first element (i.e. [CLS]).

  • use_cuda (bool) – If True, inference is performed on GPU.

  • tokenizer (Optional[PreTrainedTokenizer]) – The custom tokenizer used instead of default tokenizer

  • model – The custom pretrained model used instead of default BERT model

  • return_same_type (bool) – If True, transform and fit_transform return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.

  • column_format (str) – Name of transformed columns (used if returning type is pd.DataFrame)

fit(X, y=None)[source]

Fit SVD model on training data X.

Parameters
  • X (Union[DataFrame, ndarray]) – Data

  • y – Ignored

fit_transform(X, y=None, **fit_params)[source]

Fit SVD model on training data X and perform feature extraction and dimensionality reduction using BERT pre-trained model and trained SVD model.

Parameters
  • X (Union[DataFrame, ndarray]) – Data

  • y – Ignored

transform(X, y=None)[source]

Perform feature extraction and dimensionality reduction using BERT pre-trained model and trained SVD model.

Parameters
  • X (Union[DataFrame, ndarray]) – Data

  • y – Ignored

nyaggle.feature.groupby.aggregation(input_df, group_key, group_values, agg_methods)[source]

Aggregate values after grouping table rows by a given key.

Parameters
  • input_df (DataFrame) – Input data frame.

  • group_key (str) – Used to determine the groups for the groupby.

  • group_values (List[str]) – Used to aggregate values for the groupby.

  • agg_methods (List[Union[str, FunctionType]]) – List of function or function names, e.g. [‘mean’, ‘max’, ‘min’, numpy.mean]. Do not use a lambda function because the name attribute of the lambda function cannot generate a unique string of column names in <lambda>.

Return type

Tuple[DataFrame, List[str]]

Returns

Tuple of output dataframe and new column names.