Kernel: Python 3 (system-wide)

Data Dojo 20 - Missing Data

scikit-learn documentation on imputation

Specific Tasks

How often is each feature missing?
Try a simple imputer
Try a more sophisticated imputation strategy
Optional: try a model that can handle missing values / a multi-stage modeling approach

Setup

In [1]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

Hacking Order

In [2]:

np.random.seed(42)
names = ["Sascha", "Marko", "Sebastian", "Max", "Markus", "Sabine", "Caro", "Prithivi", "Mike", "Robin"]
np.random.shuffle(names)
" => ".join(names)

Out[2]:

'Mike => Marko => Sabine => Sascha => Prithivi => Sebastian => Robin => Markus => Max => Caro'

Data Loading

In [4]:

data = pd.read_csv("https://github.com/ddojo/ddojo.github.io/raw/main/sessions/14_trees/train.tsv", sep="\t")
test = pd.read_csv("https://github.com/ddojo/ddojo.github.io/raw/main/sessions/14_trees/test.tsv", sep="\t")

All cases

In [5]:

X = data.drop("species",axis=1)
y = data.species
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)

In [6]:

X_test = test.drop("tree_id",axis=1)
tree_id = test.tree_id
pred = pd.DataFrame()
pred["tree_id"] = tree_id
pred["species"] = "unknown"

In [15]:

X_train.isna().sum()

Out[15]:

latitude               0
longitude              0
stem_diameter_cm       0
height_m             316
crown_radius_m      2512
dtype: int64

In [23]:

for index,entry in X_train.iterrows():
    if np.nan in entry:
        print(entry)

In [24]:

from sklearn.impute import SimpleImputer

In [25]:

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(X_train)

Out[25]:

SimpleImputer()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [36]:

X_train_full = imp.transform(X_train)

In [33]:

y_train.isna().sum()

Out[33]:

0

In [35]:

X_val_full = imp.transform(X_val)
X_test_full = imp.transform(X_test)

In [29]:

X_train_full.isna().sum()

Out[29]:

  0
  0
  0
  0
  0
dtype: int64

In [37]:

from sklearn.ensemble import RandomForestClassifier

In [39]:

RF = RandomForestClassifier(n_estimators = 100, criterion="log_loss", oob_score=True)

In [40]:

RF.fit(X_train_full, y_train)

Out[40]:

RandomForestClassifier(criterion='log_loss', oob_score=True)

In [41]:

RF.score(X_val_full, y_val)

Out[41]:

0.9516809116809117

In [46]:


imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=-1)
imp.fit(X_train)

Out[46]:

SimpleImputer(fill_value=-1, strategy='constant')

In [47]:

X_train_full = imp.transform(X_train)
X_val_full = imp.transform(X_val)
X_test_full = imp.transform(X_test)

In [52]:

RF.fit(X_train_full, y_train)

Out[52]:

RandomForestClassifier(criterion='log_loss', oob_score=True)

In [53]:

RF.score(X_val_full, y_val)

Out[53]:

0.9535042735042735

In [54]:

predictions = RF.predict(X_test_full)

In [45]:

?SimpleImputer

Out[45]:

Init signature:
SimpleImputer(
    *,
    missing_values=nan,
    strategy='mean',
    fill_value=None,
    verbose='deprecated',
    copy=True,
    add_indicator=False,
    keep_empty_features=False,
)
Docstring:     
Univariate imputer for completing missing values with simple strategies.

Replace missing values using a descriptive statistic (e.g. mean, median, or
most frequent) along each column, or using a constant value.

Read more in the :ref:`User Guide <impute>`.

.. versionadded:: 0.20
   `SimpleImputer` replaces the previous `sklearn.preprocessing.Imputer`
   estimator which is now removed.

Parameters
----------
missing_values : int, float, str, np.nan, None or pandas.NA, default=np.nan
    The placeholder for the missing values. All occurrences of
    `missing_values` will be imputed. For pandas' dataframes with
    nullable integer dtypes with missing values, `missing_values`
    can be set to either `np.nan` or `pd.NA`.

strategy : str, default='mean'
    The imputation strategy.

    - If "mean", then replace missing values using the mean along
      each column. Can only be used with numeric data.
    - If "median", then replace missing values using the median along
      each column. Can only be used with numeric data.
    - If "most_frequent", then replace missing using the most frequent
      value along each column. Can be used with strings or numeric data.
      If there is more than one such value, only the smallest is returned.
    - If "constant", then replace missing values with fill_value. Can be
      used with strings or numeric data.

    .. versionadded:: 0.20
       strategy="constant" for fixed value imputation.

fill_value : str or numerical value, default=None
    When strategy == "constant", `fill_value` is used to replace all
    occurrences of missing_values. For string or object data types,
    `fill_value` must be a string.
    If `None`, `fill_value` will be 0 when imputing numerical
    data and "missing_value" for strings or object data types.

verbose : int, default=0
    Controls the verbosity of the imputer.

    .. deprecated:: 1.1
       The 'verbose' parameter was deprecated in version 1.1 and will be
       removed in 1.3. A warning will always be raised upon the removal of
       empty columns in the future version.

copy : bool, default=True
    If True, a copy of X will be created. If False, imputation will
    be done in-place whenever possible. Note that, in the following cases,
    a new copy will always be made, even if `copy=False`:

    - If `X` is not an array of floating values;
    - If `X` is encoded as a CSR matrix;
    - If `add_indicator=True`.

add_indicator : bool, default=False
    If True, a :class:`MissingIndicator` transform will stack onto output
    of the imputer's transform. This allows a predictive estimator
    to account for missingness despite imputation. If a feature has no
    missing values at fit/train time, the feature won't appear on
    the missing indicator even if there are missing values at
    transform/test time.

keep_empty_features : bool, default=False
    If True, features that consist exclusively of missing values when
    `fit` is called are returned in results when `transform` is called.
    The imputed value is always `0` except when `strategy="constant"`
    in which case `fill_value` will be used instead.

    .. versionadded:: 1.2

Attributes
----------
statistics_ : array of shape (n_features,)
    The imputation fill value for each feature.
    Computing statistics can result in `np.nan` values.
    During :meth:`transform`, features corresponding to `np.nan`
    statistics will be discarded.

indicator_ : :class:`~sklearn.impute.MissingIndicator`
    Indicator used to add binary indicators for missing values.
    `None` if `add_indicator=False`.

n_features_in_ : int
    Number of features seen during :term:`fit`.

    .. versionadded:: 0.24

feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.

    .. versionadded:: 1.0

See Also
--------
IterativeImputer : Multivariate imputer that estimates values to impute for
    each feature with missing values from all the others.
KNNImputer : Multivariate imputer that estimates missing features using
    nearest samples.

Notes
-----
Columns which only contained missing values at :meth:`fit` are discarded
upon :meth:`transform` if strategy is not `"constant"`.

In a prediction context, simple imputation usually performs poorly when
associated with a weak learner. However, with a powerful learner, it can
lead to as good or better performance than complex imputation such as
:class:`~sklearn.impute.IterativeImputer` or :class:`~sklearn.impute.KNNImputer`.

Examples
--------
>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
SimpleImputer()
>>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
>>> print(imp_mean.transform(X))
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]
File:           /usr/local/lib/python3.8/dist-packages/sklearn/impute/_base.py
Type:           type
Subclasses:     

In [31]:

X_train_full = X_train.copy()
X_train_full["height_m"].fillna((X_train_full["height_m"].mean()), inplace = True )
X_train_full.isna().sum()

Out[31]:

latitude               0
longitude              0
stem_diameter_cm       0
height_m               0
crown_radius_m      2512
dtype: int64

In [32]:

X_train_full["crown_radius_m"].fillna((X_train_full["crown_radius_m"].mean()), inplace = True )
X_train_full.isna().sum()

Out[32]:

latitude            0
longitude           0
stem_diameter_cm    0
height_m            0
crown_radius_m      0
dtype: int64

In [0]:

Only complete cases

In [6]:

X_complete = data.dropna().drop("species",axis=1)
y_complete = data.dropna().species
X_train_complete, X_val_complete, y_train_complete, y_val_complete = train_test_split(X_complete, y_complete, random_state=42)

In [7]:

X_test_complete = test.dropna().drop("tree_id",axis=1)
tree_id_complete = test.dropna().tree_id
pred_complete = pd.DataFrame()
pred_complete["tree_id"] = tree_id_complete
pred_complete["species"] = "unknown"

In [0]:

Models

In [0]:

In [0]:

In [0]:

In [0]:

Save Test Predictions

In [55]:

pred["species"] = RF.predict(X_test_full)
pred.to_csv("constant-1_imp_RF.tsv", sep="\t")

In [0]:

pred_complete["species"] = model.predict(X_test_complete)
pred_complete.to_csv("my_prediction.tsv", sep="\t")