{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ea0786c2",
   "metadata": {},
   "source": [
    "# A quick introduction to machine learning\n",
    "\n",
    "\n",
    "These notebooks are a brief, hands-on, introduction to machine learning. We will revise some of the nomenclature, principles, and applications from Valentina's presentation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "987fda81",
   "metadata": {},
   "source": [
    "## What is Machine Learning (ML)?\n",
    "\n",
    "**Caveat:** I'm not a Staticician, Mathematicial, or ML expert. I only play one online. You can find my work on plays like \"How to get by with little to no data\" or \"Oh gosh, the PI wants some buzz-words in the report.\"\n",
    "\n",
    "What is ML (a personal point of view):\n",
    "\n",
    "* Focus on practical problems\n",
    "* Learn from the data and/or make predictions with it\n",
    "* Middle ground between statistics and optimization techniques\n",
    "* We have fast computers now, right? Let them do the work! ([Must see JVP talk on this](https://www.youtube.com/watch?app=desktop&v=Iq9DzN6mvYA).)\n",
    "\n",
    "**Oversimplified take:** Fit a model to data and use it to make predictions. (This is how scikit-learn designed its API BTW)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2cf395e",
   "metadata": {},
   "source": [
    "## Vocabulary \n",
    "\n",
    "- **model:** Mathematical equations used to approximate the data.\n",
    "\n",
    "\n",
    "- **parameters:** Variables that define the model and control its behavior.\n",
    "\n",
    "\n",
    "- **labels/classes:** Quantity/category that we want to predict\n",
    "\n",
    "\n",
    "- **features:** Observations (information) used as predictors of labels/classes.\n",
    "\n",
    "\n",
    "- **training:** Use **features** and known **labels/classes** to fit the **model** estimate its **parameters** (full circle, right? But why stop now?).\n",
    "\n",
    "\n",
    "- **hyper-parameters:** Variables that influence the training and the model but are not estimated during training.\n",
    "\n",
    "\n",
    "- **unsupervised learning:** Extract information and structure from the data without \"training\". We will see clustering, and Principal Component Analysis (PCA).\n",
    "\n",
    "\n",
    "- **supervised learning:** Fit a model using data to \"train\" it for making predictions. Examples: regression, classification, spam detection, recommendation systems. We'll see KNN, a classification type of ML."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97fe276c",
   "metadata": {},
   "source": [
    "## Unsupervised: PCA\n",
    "\n",
    "The dataset we will use was consists of Red, Green, Blue (**parameters**) composites from plastic pellets photos. We Also have some extra information on the pellet size, shape, etc.\n",
    "\n",
    "The **labels** are the yellowing index. The goal is to predict the yellowing based the pellets image. broken down to its RGB info,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "048ff44d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"pellets-visual-classes-rgb.csv\", index_col=\"image\").dropna()\n",
    "df[\"yellowing index\"] = df[\"yellowing index\"].astype(int)\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "121bde11",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, axes = plt.subplots(figsize=(11, 11), nrows=2, ncols=2)\n",
    "\n",
    "axes = axes.ravel()\n",
    "\n",
    "df[\"erosion\"].value_counts().plot.barh(ax=axes[0], title=\"erosion\")\n",
    "df[\"color\"].value_counts().plot.barh(ax=axes[1], title=\"color\")\n",
    "df[\"description\"].value_counts().plot.barh(ax=axes[2], title=\"description\")\n",
    "df[\"yellowing\"].value_counts().plot.barh(ax=axes[3], title=\"yellowing\")\n",
    "\n",
    "axes[1].yaxis.set_label_position(\"right\")\n",
    "axes[1].yaxis.tick_right()\n",
    "\n",
    "axes[3].yaxis.set_label_position(\"right\")\n",
    "axes[3].yaxis.tick_right()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d00fb82b",
   "metadata": {},
   "source": [
    "We will be using only the R, G, B data for now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82ef1547",
   "metadata": {},
   "outputs": [],
   "source": [
    "RGB = df[[\"r\", \"g\", \"b\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c3c0700",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import seaborn\n",
    "\n",
    "corr = RGB.corr()\n",
    "\n",
    "# Generate a mask for the upper triangle\n",
    "mask = np.zeros_like(corr, dtype=bool)\n",
    "mask[np.triu_indices_from(mask)] = True\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "# Draw the heatmap with the mask and correct aspect ratio\n",
    "vmax = np.abs(corr.values[~mask]).max()\n",
    "seaborn.heatmap(\n",
    "    corr,\n",
    "    mask=mask,\n",
    "    cmap=plt.cm.PuOr,\n",
    "    vmin=-vmax,\n",
    "    vmax=vmax,\n",
    "    square=True,\n",
    "    linecolor=\"lightgray\",\n",
    "    linewidths=1,\n",
    "    ax=ax,\n",
    ")\n",
    "\n",
    "for k in range(len(corr)):\n",
    "    ax.text(\n",
    "        k + 0.5,\n",
    "        len(corr) - (k + 0.5),\n",
    "        corr.columns[k],\n",
    "        ha=\"center\",\n",
    "        va=\"center\",\n",
    "        rotation=45,\n",
    "    )\n",
    "    for j in range(k + 1, len(corr)):\n",
    "        s = \"{:.3f}\".format(corr.values[k, j])\n",
    "        ax.text(j + 0.5, len(corr) - (k + 0.5), s, ha=\"center\", va=\"center\")\n",
    "ax.axis(\"off\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "614f713c",
   "metadata": {},
   "source": [
    "The first step to most ML techniques is to standardize the data. We do not want high variance data to bias our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9dd44aa7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def z_score(x):\n",
    "    return (x - x.mean()) / x.std()\n",
    "\n",
    "\n",
    "zs = RGB.apply(z_score).T\n",
    "\n",
    "zs.std(axis=1)  # Should be 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a5930af",
   "metadata": {},
   "outputs": [],
   "source": [
    "zs.mean(axis=1)  # Should be zero"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82f65e58",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "pca = PCA(n_components=None)\n",
    "pca.fit(zs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41df5061",
   "metadata": {},
   "source": [
    "The pca object, or fitted model, was designed before pandas existed and it is based on numpy arrays. We can do better nowadays and add meaniful labels to it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f8a6e3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "loadings = pd.DataFrame(pca.components_.T)\n",
    "loadings.index = [\"PC %s\" % pc for pc in loadings.index + 1]\n",
    "loadings.columns = [\"TS %s\" % pc for pc in loadings.columns + 1]\n",
    "loadings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78ce6ab9",
   "metadata": {},
   "outputs": [],
   "source": [
    "PCs = np.dot(loadings.values.T, RGB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "179811b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "line = {\"linewidth\": 1, \"linestyle\": \"--\", \"color\": \"k\"}\n",
    "marker = {\n",
    "    \"linestyle\": \"none\",\n",
    "    \"marker\": \"o\",\n",
    "    \"markersize\": 7,\n",
    "    \"color\": \"blue\",\n",
    "    \"alpha\": 0.5,\n",
    "}\n",
    "\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(7, 2.75))\n",
    "ax.plot(PCs[0], PCs[1], label=\"Scores\", **marker)\n",
    "\n",
    "ax.set_xlabel(\"PC1\")\n",
    "ax.set_ylabel(\"PC2\")\n",
    "\n",
    "text = [ax.text(x, y, t) for x, y, t in zip(PCs[0], PCs[1] + 0.5, RGB.columns)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67cbf486",
   "metadata": {},
   "outputs": [],
   "source": [
    "perc = pca.explained_variance_ratio_ * 100\n",
    "\n",
    "perc = pd.DataFrame(\n",
    "    perc,\n",
    "    columns=[\"Percentage explained ratio\"],\n",
    "    index=[\"PC %s\" % pc for pc in np.arange(len(perc)) + 1],\n",
    ")\n",
    "ax = perc.plot(kind=\"bar\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0461500",
   "metadata": {},
   "source": [
    "The non-project loadings plot can help us see if the data has some sort of aggregation that we can leverage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbabd98c",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "common = {\"linestyle\": \"none\", \"markersize\": 7, \"alpha\": 0.5}\n",
    "\n",
    "markers = {\n",
    "    0: {\"color\": \"black\", \"marker\": \"o\", \"label\": \"no yellowing\"},\n",
    "    1: {\"color\": \"red\", \"marker\": \"^\", \"label\": \"low\"},\n",
    "    2: {\"color\": \"blue\", \"marker\": \"*\", \"label\": \"moderate\"},\n",
    "    3: {\"color\": \"khaki\", \"marker\": \"s\", \"label\": \"high\"},\n",
    "    4: {\"color\": \"darkgoldenrod\", \"marker\": \"d\", \"label\": \"very high\"},\n",
    "}\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(7, 7))\n",
    "for x, y, idx in zip(loadings.iloc[:, 0], loadings.iloc[:, 1], df[\"yellowing index\"]):\n",
    "    ax.plot(x, y, **common, **markers.get(idx))\n",
    "\n",
    "ax.set_xlabel(\"non-projected PC1\")\n",
    "ax.set_ylabel(\"non-projected PC2\")\n",
    "ax.axis([-1, 1, -1, 1])\n",
    "ax.axis([-0.25, 0.25, -0.4, 0.4])\n",
    "\n",
    "# Trick to remove duplicate labels from the for-loop.\n",
    "handles, labels = ax.get_legend_handles_labels()\n",
    "by_label = dict(zip(labels, handles))\n",
    "ax.legend(by_label.values(), by_label.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e33ab2a8",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "- PCA is probably be most robust, and easy to perform, non-supervised ML technique (it has been a common technique in ocean sciences since before the ML hype);\n",
    "- We learned that a single RGB value does not have enough predictive power to be used alone, we'll need at least a combination of Red and Green;\n",
    "- The loading plot show that the moderate and the low yellowing have some overlaps and that can be troublesome when using this model for predictions."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}