nyaggle.feature¶

class nyaggle.feature.category_encoder.KFoldEncoderWrapper(base_transformer, cv=None, return_same_type=True, groups=None)[source]¶

KFold Wrapper for sklearn like interface

This class wraps sklearn’s TransformerMixIn (object that has fit/transform/fit_transform methods), and call it as K-fold manner.

Parameters

base_transformer (BaseEstimator) – Transformer object to be wrapped.
cv (Union[int, Iterable, BaseCrossValidator, None]) –
int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
- None, to use the default KFold(5, random_state=0, shuffle=True),
- integer, to specify the number of folds in a (Stratified)KFold,
- CV splitter (the instance of BaseCrossValidator),
- An iterable yielding (train, test) splits as arrays of indices.
groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
return_same_type (bool) – If True, transform and fit_transform return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.

fit(X, y)[source]¶

Fit models for each fold.

Parameters

X (DataFrame) – Data
y (Series) – Target

Returns

returns the transformer object.

fit_transform(X, y=None, **fit_params)[source]¶

Fit models for each fold, then transform X

Parameters

X (Union[DataFrame, ndarray]) – Data
y (Optional[Series]) – Target
fit_params – Additional parameters passed to models

Return type

Union[DataFrame, ndarray]

Returns

Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

transform(X)[source]¶

Transform X

Parameters: X (Union[DataFrame, ndarray]) – Data
Return type: Union[DataFrame, ndarray]
Returns: Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.

class nyaggle.feature.category_encoder.TargetEncoder(cv=None, groups=None, cols=None, drop_invariant=False, handle_missing='value', handle_unknown='value', min_samples_leaf=20, smoothing=10.0, return_same_type=True)[source]¶

Target Encoder

KFold version of category_encoders.TargetEncoder in https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html.

Parameters

cv (Union[Iterable, BaseCrossValidator, None]) –
int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
- None, to use the default KFold(5, random_state=0, shuffle=True),
- integer, to specify the number of folds in a (Stratified)KFold,
- CV splitter (the instance of BaseCrossValidator),
- An iterable yielding (train, test) splits as arrays of indices.
groups (Optional[Series]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
cols (Optional[List[str]]) – A list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – Boolean for whether or not to drop columns with 0 variance.
handle_missing (str) – Options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
handle_unknown (str) – Options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
min_samples_leaf (int) – Minimum samples to take category average into account.
smoothing (float) – Smoothing effect to balance categorical average vs prior. Higher value means stronger regularization. The value must be strictly bigger than 0.
return_same_type (bool) – If True, transform and fit_transform return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.

fit(X, y)¶

Fit models for each fold.

Parameters

X (DataFrame) – Data
y (Series) – Target

Returns

returns the transformer object.

fit_transform(X, y=None, **fit_params)¶

Fit models for each fold, then transform X

Parameters

X (Union[DataFrame, ndarray]) – Data
y (Optional[Series]) – Target
fit_params – Additional parameters passed to models

Return type

Union[DataFrame, ndarray]

Returns

Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

transform(X)¶

Transform X

Parameters: X (Union[DataFrame, ndarray]) – Data
Return type: Union[DataFrame, ndarray]
Returns: Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.

class nyaggle.feature.nlp.BertSentenceVectorizer(lang='en', n_components=None, text_columns=None, pooling_strategy='reduce_mean', use_cuda=False, tokenizer=None, model=None, return_same_type=True, column_format='{col}_{idx}')[source]¶

Sentence Vectorizer using BERT pretrained model.

Extract fixed-length feature vector from English/Japanese variable-length sentence using BERT.

Parameters

lang (str) – Type of language. If set to “jp”, Japanese BERT model is used (you need to install MeCab).
n_components (Optional[int]) – Number of components in SVD. If None, SVD is not applied.
text_columns (Optional[List[str]]) – List of processing columns. If None, all object columns are regarded as text column.
pooling_strategy (str) – The pooling algorithm for generating fixed length encoding vector. ‘reduce_mean’ and ‘reduce_max’ use average pooling and max pooling respectively to reduce vector from (num-words, emb-dim) to (emb_dim). ‘reduce_mean_max’ performs ‘reduce_mean’ and ‘reduce_max’ separately and concat them. ‘cls_token’ takes the first element (i.e. [CLS]).
use_cuda (bool) – If True, inference is performed on GPU.
tokenizer (Optional[PreTrainedTokenizer]) – The custom tokenizer used instead of default tokenizer
model – The custom pretrained model used instead of default BERT model
return_same_type (bool) – If True, transform and fit_transform return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.
column_format (str) – Name of transformed columns (used if returning type is pd.DataFrame)

fit(X, y=None)[source]¶

Fit SVD model on training data X.

Parameters

X (Union[DataFrame, ndarray]) – Data
y – Ignored

fit_transform(X, y=None, **fit_params)[source]¶

Fit SVD model on training data X and perform feature extraction and dimensionality reduction using BERT pre-trained model and trained SVD model.

Parameters

X (Union[DataFrame, ndarray]) – Data
y – Ignored

transform(X, y=None)[source]¶

Perform feature extraction and dimensionality reduction using BERT pre-trained model and trained SVD model.

Parameters

X (Union[DataFrame, ndarray]) – Data
y – Ignored

nyaggle.feature.groupby.aggregation(input_df, group_key, group_values, agg_methods)[source]¶

Aggregate values after grouping table rows by a given key.

Parameters

input_df (DataFrame) – Input data frame.
group_key (str) – Used to determine the groups for the groupby.
group_values (List[str]) – Used to aggregate values for the groupby.
agg_methods (List[Union[str, FunctionType]]) – List of function or function names, e.g. [‘mean’, ‘max’, ‘min’, numpy.mean]. Do not use a lambda function because the name attribute of the lambda function cannot generate a unique string of column names in <lambda>.

Return type

Tuple[DataFrame, List[str]]

Returns

Tuple of output dataframe and new column names.