nyaggle.feature¶
- class nyaggle.feature.category_encoder.KFoldEncoderWrapper(base_transformer, cv=None, return_same_type=True, groups=None)[source]¶
KFold Wrapper for sklearn like interface
This class wraps sklearn’s TransformerMixIn (object that has fit/transform/fit_transform methods), and call it as K-fold manner.
- Parameters
base_transformer (
BaseEstimator
) – Transformer object to be wrapped.cv (
Union
[int
,Iterable
,BaseCrossValidator
,None
]) –int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
None, to use the default
KFold(5, random_state=0, shuffle=True)
,integer, to specify the number of folds in a
(Stratified)KFold
,CV splitter (the instance of
BaseCrossValidator
),An iterable yielding (train, test) splits as arrays of indices.
groups (
Optional
[Series
]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g.,GroupKFold
).return_same_type (
bool
) – If True, transform and fit_transform return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.
- fit(X, y)[source]¶
Fit models for each fold.
- Parameters
X (
DataFrame
) – Datay (
Series
) – Target
- Returns
returns the transformer object.
- fit_transform(X, y=None, **fit_params)[source]¶
Fit models for each fold, then transform X
- Parameters
X (
Union
[DataFrame
,ndarray
]) – Datay (
Optional
[Series
]) – Targetfit_params – Additional parameters passed to models
- Return type
Union
[DataFrame
,ndarray
]- Returns
Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- class nyaggle.feature.category_encoder.TargetEncoder(cv=None, groups=None, cols=None, drop_invariant=False, handle_missing='value', handle_unknown='value', min_samples_leaf=20, smoothing=10.0, return_same_type=True)[source]¶
Target Encoder
KFold version of category_encoders.TargetEncoder in https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html.
- Parameters
cv (
Union
[Iterable
,BaseCrossValidator
,None
]) –int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
None, to use the default
KFold(5, random_state=0, shuffle=True)
,integer, to specify the number of folds in a
(Stratified)KFold
,CV splitter (the instance of
BaseCrossValidator
),An iterable yielding (train, test) splits as arrays of indices.
groups (
Optional
[Series
]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g.,GroupKFold
).cols (
Optional
[List
[str
]]) – A list of columns to encode, if None, all string columns will be encoded.drop_invariant (
bool
) – Boolean for whether or not to drop columns with 0 variance.handle_missing (
str
) – Options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.handle_unknown (
str
) – Options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.min_samples_leaf (
int
) – Minimum samples to take category average into account.smoothing (
float
) – Smoothing effect to balance categorical average vs prior. Higher value means stronger regularization. The value must be strictly bigger than 0.return_same_type (
bool
) – If True,transform
andfit_transform
return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.
- fit(X, y)¶
Fit models for each fold.
- Parameters
X (
DataFrame
) – Datay (
Series
) – Target
- Returns
returns the transformer object.
- fit_transform(X, y=None, **fit_params)¶
Fit models for each fold, then transform X
- Parameters
X (
Union
[DataFrame
,ndarray
]) – Datay (
Optional
[Series
]) – Targetfit_params – Additional parameters passed to models
- Return type
Union
[DataFrame
,ndarray
]- Returns
Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- transform(X)¶
Transform X
- Parameters
X (
Union
[DataFrame
,ndarray
]) – Data- Return type
Union
[DataFrame
,ndarray
]- Returns
Transformed version of X. It will be pd.DataFrame If X is pd.DataFrame and return_same_type is True.
- class nyaggle.feature.nlp.BertSentenceVectorizer(lang='en', n_components=None, text_columns=None, pooling_strategy='reduce_mean', use_cuda=False, tokenizer=None, model=None, return_same_type=True, column_format='{col}_{idx}')[source]¶
Sentence Vectorizer using BERT pretrained model.
Extract fixed-length feature vector from English/Japanese variable-length sentence using BERT.
- Parameters
lang (
str
) – Type of language. If set to “jp”, Japanese BERT model is used (you need to install MeCab).n_components (
Optional
[int
]) – Number of components in SVD. If None, SVD is not applied.text_columns (
Optional
[List
[str
]]) – List of processing columns. If None, all object columns are regarded as text column.pooling_strategy (
str
) – The pooling algorithm for generating fixed length encoding vector. ‘reduce_mean’ and ‘reduce_max’ use average pooling and max pooling respectively to reduce vector from (num-words, emb-dim) to (emb_dim). ‘reduce_mean_max’ performs ‘reduce_mean’ and ‘reduce_max’ separately and concat them. ‘cls_token’ takes the first element (i.e. [CLS]).use_cuda (
bool
) – If True, inference is performed on GPU.tokenizer (
Optional
[PreTrainedTokenizer
]) – The custom tokenizer used instead of default tokenizermodel – The custom pretrained model used instead of default BERT model
return_same_type (
bool
) – If True, transform and fit_transform return the same type as X. If False, these APIs always return a numpy array, similar to sklearn’s API.column_format (
str
) – Name of transformed columns (used if returning type is pd.DataFrame)
- fit(X, y=None)[source]¶
Fit SVD model on training data X.
- Parameters
X (
Union
[DataFrame
,ndarray
]) – Datay – Ignored
- nyaggle.feature.groupby.aggregation(input_df, group_key, group_values, agg_methods)[source]¶
Aggregate values after grouping table rows by a given key.
- Parameters
input_df (
DataFrame
) – Input data frame.group_key (
str
) – Used to determine the groups for the groupby.group_values (
List
[str
]) – Used to aggregate values for the groupby.agg_methods (
List
[Union
[str
,FunctionType
]]) – List of function or function names, e.g. [‘mean’, ‘max’, ‘min’, numpy.mean]. Do not use a lambda function because the name attribute of the lambda function cannot generate a unique string of column names in <lambda>.
- Return type
Tuple
[DataFrame
,List
[str
]]- Returns
Tuple of output dataframe and new column names.