Data Dojo 19 - Ensemble Models¶

[`scikit-learn` documentation on ensemble models](https://scikit%5C-learn.org/stable/modules/ensemble.html%5C)

Setup¶

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

Hacking Order¶

In [4]:
np.random.seed(42)
names = ["Robin", "Markus", "Mike", "Kerstin", "Stefan", "Chris","Andi","Carolin","Sascha", "Sabine", "Prithivi"]
np.random.shuffle(names)
" => ".join(names)
Out[4]:
'Chris => Robin => Sabine => Prithivi => Mike => Markus => Sascha => Stefan => Carolin => Kerstin => Andi'

Data Loading¶

In [2]:
data = pd.read_csv("https://github.com/ddojo/ddojo.github.io/raw/main/sessions/14_trees/train.tsv", sep="\t")
test = pd.read_csv("https://github.com/ddojo/ddojo.github.io/raw/main/sessions/14_trees/test.tsv", sep="\t")

All cases¶

In [3]:
X = data.drop("species",axis=1)
y = data.species
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)
In [4]:
X_test = test.drop("tree_id",axis=1)
tree_id = test.tree_id
pred = pd.DataFrame()
pred["tree_id"] = tree_id
pred["species"] = "unknown"

Only complete cases¶

In [5]:
X_complete = data.dropna().drop("species",axis=1)
y_complete = data.dropna().species
X_train_complete, X_val_complete, y_train_complete, y_val_complete = train_test_split(X_complete, y_complete, random_state=42)
In [6]:
X_test_complete = test.dropna().drop("tree_id",axis=1)
tree_id_complete = test.dropna().tree_id
pred_complete = pd.DataFrame()
pred_complete["tree_id"] = tree_id_complete
pred_complete["species"] = "unknown"

Models¶

In [10]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

AdaBoost¶

In [39]:
adaboost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10), n_estimators=10)
adaboost.fit(X_train_complete,y_train_complete)
Out[39]:
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=10),
                   n_estimators=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=10),
                   n_estimators=10)
DecisionTreeClassifier(max_depth=10)
DecisionTreeClassifier(max_depth=10)
In [40]:
adaboost.score(X_train_complete, y_train_complete)
Out[40]:
0.9773878976280713
In [41]:
adaboost.score(X_val_complete, y_val_complete)
Out[41]:
0.9412365866121615

Random Forests¶

In [7]:
from sklearn.ensemble import RandomForestClassifier
In [20]:
RF = RandomForestClassifier(n_estimators = 100, criterion="log_loss", oob_score=True)
In [21]:
RF.fit(X_train_complete, y_train_complete)
Out[21]:
RandomForestClassifier(criterion='log_loss', oob_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='log_loss', oob_score=True)
In [22]:
RF.score(X_val_complete, y_val_complete)
Out[22]:
0.9503065917220235
In [23]:
RF.oob_score
Out[23]:
True
In [24]:
RF.oob_score_
Out[24]:
0.9506025635566154
In [25]:
RF.feature_importances_
Out[25]:
array([0.29352523, 0.27102041, 0.07447549, 0.27080703, 0.09017184])
In [27]:
RF.feature_names_in_
Out[27]:
array(['latitude', 'longitude', 'stem_diameter_cm', 'height_m',
       'crown_radius_m'], dtype=object)

Save Test Predictions¶

In [0]:
pred["species"] = model.predict(X_test)
pred.to_csv("my_prediction.tsv", sep="\t")

or

In [18]:
pred_complete["species"] = RF.predict(X_test_complete)
pred_complete.to_csv("randomforest_logloss_prediction.tsv", sep="\t")
In [0]: