Kernel: Python 3 (system-wide)

Data Dojo 20 - Missing Data

scikit-learn documentation on imputation

Specific Tasks

  • How often is each feature missing?

  • Try a simple imputer

  • Try a more sophisticated imputation strategy

  • Optional: try a model that can handle missing values / a multi-stage modeling approach

Setup

In [1]:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split

Hacking Order

In [2]:
np.random.seed(42) names = ["Sascha", "Marko", "Sebastian", "Max", "Markus", "Sabine", "Caro", "Prithivi", "Mike", "Robin"] np.random.shuffle(names) " => ".join(names)
Out[2]:
'Mike => Marko => Sabine => Sascha => Prithivi => Sebastian => Robin => Markus => Max => Caro'

Data Loading

In [4]:
data = pd.read_csv("https://github.com/ddojo/ddojo.github.io/raw/main/sessions/14_trees/train.tsv", sep="\t") test = pd.read_csv("https://github.com/ddojo/ddojo.github.io/raw/main/sessions/14_trees/test.tsv", sep="\t")

All cases

In [5]:
X = data.drop("species",axis=1) y = data.species X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)
In [6]:
X_test = test.drop("tree_id",axis=1) tree_id = test.tree_id pred = pd.DataFrame() pred["tree_id"] = tree_id pred["species"] = "unknown"
In [15]:
X_train.isna().sum()
Out[15]:
latitude 0 longitude 0 stem_diameter_cm 0 height_m 316 crown_radius_m 2512 dtype: int64
In [23]:
for index,entry in X_train.iterrows(): if np.nan in entry: print(entry)
In [24]:
from sklearn.impute import SimpleImputer
In [25]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean') imp.fit(X_train)
Out[25]:
SimpleImputer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [36]:
X_train_full = imp.transform(X_train)
In [33]:
y_train.isna().sum()
Out[33]:
0
In [35]:
X_val_full = imp.transform(X_val) X_test_full = imp.transform(X_test)
In [29]:
X_train_full.isna().sum()
Out[29]:
0 0 1 0 2 0 3 0 4 0 dtype: int64
In [37]:
from sklearn.ensemble import RandomForestClassifier
In [39]:
RF = RandomForestClassifier(n_estimators = 100, criterion="log_loss", oob_score=True)
In [40]:
RF.fit(X_train_full, y_train)
Out[40]:
RandomForestClassifier(criterion='log_loss', oob_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [41]:
RF.score(X_val_full, y_val)
Out[41]:
0.9516809116809117
In [46]:
imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=-1) imp.fit(X_train)
Out[46]:
SimpleImputer(fill_value=-1, strategy='constant')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [47]:
X_train_full = imp.transform(X_train) X_val_full = imp.transform(X_val) X_test_full = imp.transform(X_test)
In [52]:
RF.fit(X_train_full, y_train)
Out[52]:
RandomForestClassifier(criterion='log_loss', oob_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [53]:
RF.score(X_val_full, y_val)
Out[53]:
0.9535042735042735
In [54]:
predictions = RF.predict(X_test_full)
In [45]:
?SimpleImputer
Out[45]:
Init signature: SimpleImputer( *, missing_values=nan, strategy='mean', fill_value=None, verbose='deprecated', copy=True, add_indicator=False, keep_empty_features=False, ) Docstring: Univariate imputer for completing missing values with simple strategies. Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value. Read more in the :ref:`User Guide <impute>`. .. versionadded:: 0.20 `SimpleImputer` replaces the previous `sklearn.preprocessing.Imputer` estimator which is now removed. Parameters ---------- missing_values : int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`. strategy : str, default='mean' The imputation strategy. - If "mean", then replace missing values using the mean along each column. Can only be used with numeric data. - If "median", then replace missing values using the median along each column. Can only be used with numeric data. - If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. - If "constant", then replace missing values with fill_value. Can be used with strings or numeric data. .. versionadded:: 0.20 strategy="constant" for fixed value imputation. fill_value : str or numerical value, default=None When strategy == "constant", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and "missing_value" for strings or object data types. verbose : int, default=0 Controls the verbosity of the imputer. .. deprecated:: 1.1 The 'verbose' parameter was deprecated in version 1.1 and will be removed in 1.3. A warning will always be raised upon the removal of empty columns in the future version. copy : bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`. add_indicator : bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time. keep_empty_features : bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy="constant"` in which case `fill_value` will be used instead. .. versionadded:: 1.2 Attributes ---------- statistics_ : array of shape (n_features,) The imputation fill value for each feature. Computing statistics can result in `np.nan` values. During :meth:`transform`, features corresponding to `np.nan` statistics will be discarded. indicator_ : :class:`~sklearn.impute.MissingIndicator` Indicator used to add binary indicators for missing values. `None` if `add_indicator=False`. n_features_in_ : int Number of features seen during :term:`fit`. .. versionadded:: 0.24 feature_names_in_ : ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0 See Also -------- IterativeImputer : Multivariate imputer that estimates values to impute for each feature with missing values from all the others. KNNImputer : Multivariate imputer that estimates missing features using nearest samples. Notes ----- Columns which only contained missing values at :meth:`fit` are discarded upon :meth:`transform` if strategy is not `"constant"`. In a prediction context, simple imputation usually performs poorly when associated with a weak learner. However, with a powerful learner, it can lead to as good or better performance than complex imputation such as :class:`~sklearn.impute.IterativeImputer` or :class:`~sklearn.impute.KNNImputer`. Examples -------- >>> import numpy as np >>> from sklearn.impute import SimpleImputer >>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') >>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) SimpleImputer() >>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] >>> print(imp_mean.transform(X)) [[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]] File: /usr/local/lib/python3.8/dist-packages/sklearn/impute/_base.py Type: type Subclasses:
In [31]:
X_train_full = X_train.copy() X_train_full["height_m"].fillna((X_train_full["height_m"].mean()), inplace = True ) X_train_full.isna().sum()
Out[31]:
latitude 0 longitude 0 stem_diameter_cm 0 height_m 0 crown_radius_m 2512 dtype: int64
In [32]:
X_train_full["crown_radius_m"].fillna((X_train_full["crown_radius_m"].mean()), inplace = True ) X_train_full.isna().sum()
Out[32]:
latitude 0 longitude 0 stem_diameter_cm 0 height_m 0 crown_radius_m 0 dtype: int64
In [0]:

Only complete cases

In [6]:
X_complete = data.dropna().drop("species",axis=1) y_complete = data.dropna().species X_train_complete, X_val_complete, y_train_complete, y_val_complete = train_test_split(X_complete, y_complete, random_state=42)
In [7]:
X_test_complete = test.dropna().drop("tree_id",axis=1) tree_id_complete = test.dropna().tree_id pred_complete = pd.DataFrame() pred_complete["tree_id"] = tree_id_complete pred_complete["species"] = "unknown"
In [0]:

Models

In [0]:
In [0]:
In [0]:
In [0]:

Save Test Predictions

In [55]:
pred["species"] = RF.predict(X_test_full) pred.to_csv("constant-1_imp_RF.tsv", sep="\t")

or

In [0]:
pred_complete["species"] = model.predict(X_test_complete) pred_complete.to_csv("my_prediction.tsv", sep="\t")