A quick introduction to machine learning#

These notebooks are a brief, hands-on, introduction to machine learning.
We will revise some of the nomenclature, principles, and applications from Valentina’s presentation.

ML will solve all of our problems, right?#

What is Machine Learning (ML)?#

Caveat: I’m not a Statistician, Mathematician, or ML expert. I only play one online. You can find my work on movies like “How to get by with little to no data” or “Oh gosh, the PI wants some buzz-words in the report” and “Fuzzy logic no longer does it, we need ML → AI → DL”

What is ML (a personal point of view):

Focus on practical problems
Learn from the data and/or make predictions with it
Middle ground between statistics and optimization techniques
We have fast computers now, right? Let them do the work! (Must see JVP talk on this.)

Oversimplified take: Fit a model to data and use it to make predictions. (This is how scikit-learn designed its API BTW).

Vocabulary#

parameters: Variables that define the model and control its behavior.
model: Set of mathematical equations used to approximate the data.
labels/classes: Quantity/category that we want to predict
features: Observations (information) used as predictors of labels/classes.
training: Use features and known labels/classes to fit the model estimate its parameters (full circle, right? But why stop now?).

Please check out this awesome lecture on ML for climate science.

hyper-parameters: Variables that influence the training and the model but are not estimated during training.
unsupervised learning: Extract information and structure from the data without training with known labels. We will see clustering, and Principal Component Analysis (PCA).
supervised learning: Fit a model using data to “train” it for making predictions. Examples: regression, classification, spam detection, recommendation systems. We’ll see KNN, a classification type of ML in this tutorial.

Unsupervised: PCA#

The dataset we will use was consists of Red, Green, Blue composites (parameters) from plastic pellets photos. We also have some extra information on the pellet size, shape, etc.

The labels are the yellowing index. The goal is to predict the yellowing based the pellets image, broken down to its RGB info.

import pandas as pd

df = pd.read_csv("pellets-visual-classes-rgb.csv", index_col="image").dropna()
df["yellowing index"] = df["yellowing index"].astype(int)
df

	r	g	b	size (mm)	color	description	erosion	erosion index	yellowing	yellowing index
image
cl1_p11_moca2_deixa5_a0001	152	150	143	4.021	transparent	sphere	high erosion	3	low	1
cl1_p12_lagoinha_deixa1_g0006	221	218	219	4.244	white	light erosion	low erosion	1	low	1
cl1_p12_lagoinha_deixa1_g0007	140	137	129	3.946	white	not erosion	low erosion	1	low	1
cl1_p12_lagoinha_deixa1_g0008	188	178	146	3.948	white	moderate erosion	high erosion	3	moderate	2
cl1_p12_lagoinha_deixa2_h0004	207	200	189	6.043	white	light erosion	low erosion	1	moderate	2
...	...	...	...	...	...	...	...	...	...	...
cl1_p6_moca2_deixa3_a0006	186	193	155	4.546	transparent	cylinder	moderate erosion	2	low	1
cl1_p8_moca2_deixa5_b0001	169	168	106	3.082	transparent	sphere	low erosion	1	low	1
cl1_p8_moca2_deixa5_b0003	191	189	152	3.932	white	sphere	low erosion	1	low	1
cl1_p8_moca2_deixa5_b0004	181	156	70	3.230	white	sphere	moderate erosion	3	moderate	2
cl1_p9_moca2_deixa5_b0001	193	192	198	3.763	transparent	sphere	high erosion	3	low	1

127 rows × 10 columns

import matplotlib.pyplot as plt


def histograms():
    fig, axes = plt.subplots(figsize=(11, 11), nrows=2, ncols=2)

    axes = axes.ravel()

    df["erosion"].value_counts().plot.barh(ax=axes[0], title="erosion")
    df["color"].value_counts().plot.barh(ax=axes[1], title="color")
    df["description"].value_counts().plot.barh(ax=axes[2], title="description")
    df["yellowing"].value_counts().plot.barh(ax=axes[3], title="yellowing")

    axes[1].yaxis.set_label_position("right")
    axes[1].yaxis.tick_right()

    axes[3].yaxis.set_label_position("right")
    axes[3].yaxis.tick_right()

histograms();

../../../../_images/9a7b3e395a711498b4cf99bbcc98623a7c1868696f8931d9d3b6d27cb4c55a0b.png

We will be using only the R, G, B data for now.

RGB = df[["r", "g", "b"]]

import numpy as np
import seaborn

corr = RGB.corr()

seaborn.heatmap(corr, vmin=-1, vmax=1, annot=True);

../../../../_images/b3030f6472eb5c6a0c1368213e5938a3ab97b1ed05362a627f1a2f1091d35644.png

The first step to most ML techniques is to standardize the data. We do not want high variance data to bias our model.

def z_score(x):
    return (x - x.mean()) / x.std()

zs = RGB.apply(z_score).T

zs.std(axis=1)  # Should be 1

r    1.0
g    1.0
b    1.0
dtype: float64

zs.mean(axis=1)  # Should be zero

r    3.020331e-16
g    4.056248e-16
b   -1.188900e-16
dtype: float64

from sklearn.decomposition import PCA

pca = PCA(n_components=None)
pca.fit(zs);

The pca object, or fitted model, was designed before pandas existed and it is based on numpy arrays. We can do better nowadays and add meaningful labels to it.

loadings = pd.DataFrame(pca.components_.T)
loadings.index = ["PC %s" % pc for pc in loadings.index + 1]
loadings.columns = ["TS %s" % pc for pc in loadings.columns + 1]
loadings

	TS 1	TS 2	TS 3
PC 1	0.106617	-0.020035	0.796085
PC 2	0.038558	-0.015377	0.045291
PC 3	0.118255	-0.029110	-0.071213
PC 4	0.032419	-0.036895	-0.025253
PC 5	0.037738	-0.033506	0.003937
...	...	...	...
PC 123	0.037121	0.082580	0.065741
PC 124	0.023213	0.049475	0.000340
PC 125	0.027663	0.022751	-0.036524
PC 126	-0.032039	-0.082355	0.004122
PC 127	0.076820	-0.015290	-0.064717

127 rows × 3 columns

PCs = np.dot(loadings.values.T, RGB)

marker = {
    "linestyle": "none",
    "marker": "o",
    "markersize": 7,
    "color": "blue",
    "alpha": 0.5,
}

fig, ax = plt.subplots(figsize=(7, 2.75))
ax.plot(PCs[0], PCs[1], label="Scores", **marker)

ax.set_xlabel("PC1")
ax.set_ylabel("PC2")

text = [ax.text(x, y, t) for x, y, t in zip(PCs[0], PCs[1] + 0.5, RGB.columns)]

../../../../_images/82618cba2665060a513e0a8330ad76e77dcedf0a75e0fa23dab017dff26e0201.png

perc = pca.explained_variance_ratio_ * 100
perc = pd.DataFrame(
    perc,
    columns=["Percentage explained ratio"],
    index=["PC %s" % pc for pc in np.arange(len(perc)) + 1],
)
ax = perc.plot(kind="bar")

../../../../_images/4f7a80dd32db1b553489ad0965812b09fc71d4f4ed6d7e4ed2bea0a8748dbf90.png

The non-projected loadings plot can help us see if the data has some sort of aggregation that we can use.

common = {"linestyle": "none", "markersize": 7, "alpha": 0.5}

markers = {
    0: {"color": "black", "marker": "o", "label": "no yellowing"},
    1: {"color": "red", "marker": "^", "label": "low"},
    2: {"color": "blue", "marker": "*", "label": "moderate"},
    3: {"color": "khaki", "marker": "s", "label": "high"},
    4: {"color": "darkgoldenrod", "marker": "d", "label": "very high"},
}


def unprojected_loadings():
    fig, ax = plt.subplots(figsize=(7, 7))
    for x, y, idx in zip(
        loadings.iloc[:, 0], loadings.iloc[:, 1], df["yellowing index"]
    ):
        ax.plot(x, y, **common, **markers.get(idx))

    ax.set_xlabel("non-projected PC1")
    ax.set_ylabel("non-projected PC2")
    ax.axis([-1, 1, -1, 1])
    ax.axis([-0.25, 0.25, -0.4, 0.4])

    # Trick to remove duplicate labels from the for-loop.
    handles, labels = ax.get_legend_handles_labels()
    by_label = dict(zip(labels, handles))
    ax.legend(by_label.values(), by_label.keys())
    return fig, ax

unprojected_loadings();

../../../../_images/7ad932534edc9251b2fd43745ab448bed37420055faa3f967760eaa959ab8a01.png

Summary#

PCA is probably be most robust, and easy to perform, non-supervised ML technique (it has been a common technique in ocean sciences since before the ML hype);
We learned that a single RGB value does not have enough predictive power to be used alone, we’ll need at least a combination of Reds and Greens;
The loading plot show that the moderate and the low yellowing have some overlaps. That can be troublesome when using this model for predictions.