%matplotlib inline

Supervised learning#

We have the human operator labels for our data, so we can train models to predict these labels instead of using them only to check our model.

Examples of supervised learning include linear regression, support vector machines, neural networks (including deep learning), and more.

The simplest possible supervised learning model for our classification problems is a k-nearest-neighbors classifier (kNN).

kNN assigns labels based on the labels of a sample’s k nearest neighbors. Let’s try it out!

import pandas as pd

df = pd.read_csv("pellets-visual-classes-rgb.csv", index_col="image").dropna()
df["yellowing index"] = df["yellowing index"].astype(int)

We already know that the size is mostly random so let’s drop it here.

feature_columns = ["r", "g", "b"]
X = df[feature_columns].values

y = df["yellowing"].values
from sklearn import neighbors

knn = neighbors.KNeighborsClassifier()
knn.fit(X, y)
prediction = knn.predict(X)
prediction
array(['low', 'low', 'low', 'low', 'low', 'low', 'low', 'high', 'high',
       'low', 'low', 'very high', 'low', 'low', 'low', 'low', 'low',
       'high', 'high', 'low', 'high', 'low', 'moderate', 'high', 'high',
       'low', 'moderate', 'low', 'low', 'moderate', 'low', 'moderate',
       'low', 'low', 'moderate', 'moderate', 'low', 'high', 'low', 'high',
       'moderate', 'high', 'low', 'moderate', 'high', 'low', 'moderate',
       'low', 'high', 'high', 'low', 'high', 'low', 'high', 'low', 'high',
       'high', 'high', 'low', 'low', 'high', 'moderate', 'moderate',
       'very high', 'low', 'low', 'low', 'low', 'low', 'low', 'moderate',
       'low', 'low', 'low', 'low', 'low', 'low', 'moderate', 'low',
       'moderate', 'high', 'high', 'high', 'high', 'high', 'high', 'low',
       'low', 'low', 'low', 'low', 'low', 'low', 'high', 'low', 'high',
       'low', 'moderate', 'moderate', 'high', 'moderate', 'low', 'high',
       'high', 'low', 'high', 'low', 'low', 'low', 'low', 'low',
       'very high', 'very high', 'low', 'high', 'moderate', 'high',
       'high', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'high',
       'low'], dtype=object)
from sklearn import metrics

metrics.accuracy_score(y, prediction)
0.8267716535433071

Quite the improvement from our k-means attempt. Aren’t we forgeting anything though? Yes, we should always standardize the data!

from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
knn = neighbors.KNeighborsClassifier()
knn.fit(X_scaled, y)
prediction_scaled = knn.predict(X_scaled)

metrics.accuracy_score(y, prediction_scaled)
0.7795275590551181

The lower score means that we where overfitting before standardizing. Still, ~78% is much better than our k-means.

import seaborn as sns
redux = df[["r", "g", "b", "yellowing"]]
redux = redux.assign(knn=prediction_scaled)

sns.pairplot(redux, hue="knn", vars=feature_columns);
/home/filipe/micromamba/envs/ML/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
../../../../_images/0bb38e4d3b603ca686ef0771ab4ccd8bdffc24372b41b32bfec92b248725b5dc.png

How can we stop forgetting to standardize the data? Well, scikit-learn is awesome and has our back. We can create data processing pipelines and keep all the steps of our model in a single object. Pipelines are very robust and may contain custom steps if your data requires them

from sklearn import pipeline

classifier = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    neighbors.KNeighborsClassifier(),
)

classifier.fit(X, y)
prediction_pipeline = classifier.predict(X)

metrics.accuracy_score(y, prediction_pipeline)
0.7795275590551181

Validation#

from sklearn import model_selection

split = model_selection.train_test_split(X, y)

X_train, X_test, y_train, y_test = split

X_train.shape, X_test.shape
((95, 3), (32, 3))
classifier.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kneighborsclassifier', KNeighborsClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
metrics.accuracy_score(y_test, classifier.predict(X_test))
0.71875
scores = model_selection.cross_val_score(classifier, X, y)
scores
array([0.69230769, 0.65384615, 0.84      , 0.68      , 0.72      ])
scores.mean()
0.7172307692307692

We reduce our accuracy when performing a test/train split. Why that happened? The first guess is that our model may be “data hungry.” We just don’t have enough samples on each class to predict them.

df["yellowing"].value_counts().plot.barh(title="yellowing");
../../../../_images/7e7decac81222bd8068fe5d7b587789038dd00eddc2a4f34f325f407d8c062a5.png

The data is unbalanced! That can we do next?

  • Try to balance the current data;

  • Collect more data and see if the dataset balance itself out;

  • Choose a technique that is more robust to unbalanced data, like Decision Trees (DT).

PS: Check this awesome paper on DTs.