{ "cells": [ { "cell_type": "markdown", "id": "ea0786c2", "metadata": {}, "source": [ "# A quick introduction to machine learning\n", "\n", "\n", "These notebooks are a brief, hands-on, introduction to machine learning. We will revise some of the nomenclature, principles, and applications from Valentina's presentation." ] }, { "cell_type": "markdown", "id": "987fda81", "metadata": {}, "source": [ "## What is Machine Learning (ML)?\n", "\n", "**Caveat:** I'm not a Staticician, Mathematicial, or ML expert. I only play one online. You can find my work on plays like \"How to get by with little to no data\" or \"Oh gosh, the PI wants some buzz-words in the report.\"\n", "\n", "What is ML (a personal point of view):\n", "\n", "* Focus on practical problems\n", "* Learn from the data and/or make predictions with it\n", "* Middle ground between statistics and optimization techniques\n", "* We have fast computers now, right? Let them do the work! ([Must see JVP talk on this](https://www.youtube.com/watch?app=desktop&v=Iq9DzN6mvYA).)\n", "\n", "**Oversimplified take:** Fit a model to data and use it to make predictions. (This is how scikit-learn designed its API BTW)." ] }, { "cell_type": "markdown", "id": "b2cf395e", "metadata": {}, "source": [ "## Vocabulary \n", "\n", "- **model:** Mathematical equations used to approximate the data.\n", "\n", "\n", "- **parameters:** Variables that define the model and control its behavior.\n", "\n", "\n", "- **labels/classes:** Quantity/category that we want to predict\n", "\n", "\n", "- **features:** Observations (information) used as predictors of labels/classes.\n", "\n", "\n", "- **training:** Use **features** and known **labels/classes** to fit the **model** estimate its **parameters** (full circle, right? But why stop now?).\n", "\n", "\n", "- **hyper-parameters:** Variables that influence the training and the model but are not estimated during training.\n", "\n", "\n", "- **unsupervised learning:** Extract information and structure from the data without \"training\". We will see clustering, and Principal Component Analysis (PCA).\n", "\n", "\n", "- **supervised learning:** Fit a model using data to \"train\" it for making predictions. Examples: regression, classification, spam detection, recommendation systems. We'll see KNN, a classification type of ML." ] }, { "cell_type": "markdown", "id": "97fe276c", "metadata": {}, "source": [ "## Unsupervised: PCA\n", "\n", "The dataset we will use was consists of Red, Green, Blue (**parameters**) composites from plastic pellets photos. We Also have some extra information on the pellet size, shape, etc.\n", "\n", "The **labels** are the yellowing index. The goal is to predict the yellowing based the pellets image. broken down to its RGB info," ] }, { "cell_type": "code", "execution_count": null, "id": "048ff44d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"pellets-visual-classes-rgb.csv\", index_col=\"image\").dropna()\n", "df[\"yellowing index\"] = df[\"yellowing index\"].astype(int)\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "121bde11", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "fig, axes = plt.subplots(figsize=(11, 11), nrows=2, ncols=2)\n", "\n", "axes = axes.ravel()\n", "\n", "df[\"erosion\"].value_counts().plot.barh(ax=axes[0], title=\"erosion\")\n", "df[\"color\"].value_counts().plot.barh(ax=axes[1], title=\"color\")\n", "df[\"description\"].value_counts().plot.barh(ax=axes[2], title=\"description\")\n", "df[\"yellowing\"].value_counts().plot.barh(ax=axes[3], title=\"yellowing\")\n", "\n", "axes[1].yaxis.set_label_position(\"right\")\n", "axes[1].yaxis.tick_right()\n", "\n", "axes[3].yaxis.set_label_position(\"right\")\n", "axes[3].yaxis.tick_right()" ] }, { "cell_type": "markdown", "id": "d00fb82b", "metadata": {}, "source": [ "We will be using only the R, G, B data for now." ] }, { "cell_type": "code", "execution_count": null, "id": "82ef1547", "metadata": {}, "outputs": [], "source": [ "RGB = df[[\"r\", \"g\", \"b\"]]" ] }, { "cell_type": "code", "execution_count": null, "id": "9c3c0700", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import seaborn\n", "\n", "corr = RGB.corr()\n", "\n", "# Generate a mask for the upper triangle\n", "mask = np.zeros_like(corr, dtype=bool)\n", "mask[np.triu_indices_from(mask)] = True\n", "\n", "fig, ax = plt.subplots()\n", "\n", "# Draw the heatmap with the mask and correct aspect ratio\n", "vmax = np.abs(corr.values[~mask]).max()\n", "seaborn.heatmap(\n", " corr,\n", " mask=mask,\n", " cmap=plt.cm.PuOr,\n", " vmin=-vmax,\n", " vmax=vmax,\n", " square=True,\n", " linecolor=\"lightgray\",\n", " linewidths=1,\n", " ax=ax,\n", ")\n", "\n", "for k in range(len(corr)):\n", " ax.text(\n", " k + 0.5,\n", " len(corr) - (k + 0.5),\n", " corr.columns[k],\n", " ha=\"center\",\n", " va=\"center\",\n", " rotation=45,\n", " )\n", " for j in range(k + 1, len(corr)):\n", " s = \"{:.3f}\".format(corr.values[k, j])\n", " ax.text(j + 0.5, len(corr) - (k + 0.5), s, ha=\"center\", va=\"center\")\n", "ax.axis(\"off\")" ] }, { "cell_type": "markdown", "id": "614f713c", "metadata": {}, "source": [ "The first step to most ML techniques is to standardize the data. We do not want high variance data to bias our model." ] }, { "cell_type": "code", "execution_count": null, "id": "9dd44aa7", "metadata": {}, "outputs": [], "source": [ "def z_score(x):\n", " return (x - x.mean()) / x.std()\n", "\n", "\n", "zs = RGB.apply(z_score).T\n", "\n", "zs.std(axis=1) # Should be 1" ] }, { "cell_type": "code", "execution_count": null, "id": "9a5930af", "metadata": {}, "outputs": [], "source": [ "zs.mean(axis=1) # Should be zero" ] }, { "cell_type": "code", "execution_count": null, "id": "82f65e58", "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "\n", "pca = PCA(n_components=None)\n", "pca.fit(zs)" ] }, { "cell_type": "markdown", "id": "41df5061", "metadata": {}, "source": [ "The pca object, or fitted model, was designed before pandas existed and it is based on numpy arrays. We can do better nowadays and add meaniful labels to it." ] }, { "cell_type": "code", "execution_count": null, "id": "1f8a6e3a", "metadata": {}, "outputs": [], "source": [ "loadings = pd.DataFrame(pca.components_.T)\n", "loadings.index = [\"PC %s\" % pc for pc in loadings.index + 1]\n", "loadings.columns = [\"TS %s\" % pc for pc in loadings.columns + 1]\n", "loadings" ] }, { "cell_type": "code", "execution_count": null, "id": "78ce6ab9", "metadata": {}, "outputs": [], "source": [ "PCs = np.dot(loadings.values.T, RGB)" ] }, { "cell_type": "code", "execution_count": null, "id": "179811b8", "metadata": {}, "outputs": [], "source": [ "line = {\"linewidth\": 1, \"linestyle\": \"--\", \"color\": \"k\"}\n", "marker = {\n", " \"linestyle\": \"none\",\n", " \"marker\": \"o\",\n", " \"markersize\": 7,\n", " \"color\": \"blue\",\n", " \"alpha\": 0.5,\n", "}\n", "\n", "\n", "fig, ax = plt.subplots(figsize=(7, 2.75))\n", "ax.plot(PCs[0], PCs[1], label=\"Scores\", **marker)\n", "\n", "ax.set_xlabel(\"PC1\")\n", "ax.set_ylabel(\"PC2\")\n", "\n", "text = [ax.text(x, y, t) for x, y, t in zip(PCs[0], PCs[1] + 0.5, RGB.columns)]" ] }, { "cell_type": "code", "execution_count": null, "id": "67cbf486", "metadata": {}, "outputs": [], "source": [ "perc = pca.explained_variance_ratio_ * 100\n", "\n", "perc = pd.DataFrame(\n", " perc,\n", " columns=[\"Percentage explained ratio\"],\n", " index=[\"PC %s\" % pc for pc in np.arange(len(perc)) + 1],\n", ")\n", "ax = perc.plot(kind=\"bar\")" ] }, { "cell_type": "markdown", "id": "b0461500", "metadata": {}, "source": [ "The non-project loadings plot can help us see if the data has some sort of aggregation that we can leverage." ] }, { "cell_type": "code", "execution_count": null, "id": "bbabd98c", "metadata": { "scrolled": false }, "outputs": [], "source": [ "common = {\"linestyle\": \"none\", \"markersize\": 7, \"alpha\": 0.5}\n", "\n", "markers = {\n", " 0: {\"color\": \"black\", \"marker\": \"o\", \"label\": \"no yellowing\"},\n", " 1: {\"color\": \"red\", \"marker\": \"^\", \"label\": \"low\"},\n", " 2: {\"color\": \"blue\", \"marker\": \"*\", \"label\": \"moderate\"},\n", " 3: {\"color\": \"khaki\", \"marker\": \"s\", \"label\": \"high\"},\n", " 4: {\"color\": \"darkgoldenrod\", \"marker\": \"d\", \"label\": \"very high\"},\n", "}\n", "\n", "fig, ax = plt.subplots(figsize=(7, 7))\n", "for x, y, idx in zip(loadings.iloc[:, 0], loadings.iloc[:, 1], df[\"yellowing index\"]):\n", " ax.plot(x, y, **common, **markers.get(idx))\n", "\n", "ax.set_xlabel(\"non-projected PC1\")\n", "ax.set_ylabel(\"non-projected PC2\")\n", "ax.axis([-1, 1, -1, 1])\n", "ax.axis([-0.25, 0.25, -0.4, 0.4])\n", "\n", "# Trick to remove duplicate labels from the for-loop.\n", "handles, labels = ax.get_legend_handles_labels()\n", "by_label = dict(zip(labels, handles))\n", "ax.legend(by_label.values(), by_label.keys())" ] }, { "cell_type": "markdown", "id": "e33ab2a8", "metadata": {}, "source": [ "## Summary\n", "\n", "- PCA is probably be most robust, and easy to perform, non-supervised ML technique (it has been a common technique in ocean sciences since before the ML hype);\n", "- We learned that a single RGB value does not have enough predictive power to be used alone, we'll need at least a combination of Red and Green;\n", "- The loading plot show that the moderate and the low yellowing have some overlaps and that can be troublesome when using this model for predictions." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }