{ "cells": [ { "cell_type": "markdown", "id": "ea0786c2", "metadata": {}, "source": [ "# A quick introduction to machine learning\n", "\n", "\n", "These notebooks are a brief, hands-on, introduction to machine learning. We will revise some of the nomenclature, principles, and applications from Valentina's presentation." ] }, { "cell_type": "markdown", "id": "987fda81", "metadata": {}, "source": [ "## What is Machine Learning (ML)?\n", "\n", "**Caveat:** I'm not a Staticician, Mathematicial, or ML expert. I only play one online. You can find my work on plays like \"How to get by with little to no data\" or \"Oh gosh, the PI wants some buzz-words in the report.\"\n", "\n", "What is ML (a personal point of view):\n", "\n", "* Focus on practical problems\n", "* Learn from the data and/or make predictions with it\n", "* Middle ground between statistics and optimization techniques\n", "* We have fast computers now, right? Let them do the work! ([Must see JVP talk on this](https://www.youtube.com/watch?app=desktop&v=Iq9DzN6mvYA).)\n", "\n", "**Oversimplified take:** Fit a model to data and use it to make predictions. (This is how scikit-learn designed its API BTW)." ] }, { "cell_type": "markdown", "id": "b2cf395e", "metadata": {}, "source": [ "## Vocabulary \n", "\n", "- **model:** Mathematical equations used to approximate the data.\n", "\n", "\n", "- **parameters:** Variables that define the model and control its behavior.\n", "\n", "\n", "- **labels/classes:** Quantity/category that we want to predict\n", "\n", "\n", "- **features:** Observations (information) used as predictors of labels/classes.\n", "\n", "\n", "- **training:** Use **features** and known **labels/classes** to fit the **model** estimate its **parameters** (full circle, right? But why stop now?).\n", "\n", "\n", "- **hyper-parameters:** Variables that influence the training and the model but are not estimated during training.\n", "\n", "\n", "- **unsupervised learning:** Extract information and structure from the data without \"training\". We will see clustering, and Principal Component Analysis (PCA).\n", "\n", "\n", "- **supervised learning:** Fit a model using data to \"train\" it for making predictions. Examples: regression, classification, spam detection, recommendation systems. We'll see KNN, a classification type of ML." ] }, { "cell_type": "markdown", "id": "97fe276c", "metadata": {}, "source": [ "## Unsupervised: PCA\n", "\n", "The dataset we will use was consists of Red, Green, Blue (**parameters**) composites from plastic pellets photos. We Also have some extra information on the pellet size, shape, etc.\n", "\n", "The **labels** are the yellowing index. The goal is to predict the yellowing based the pellets image. broken down to its RGB info," ] }, { "cell_type": "code", "execution_count": null, "id": "048ff44d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"pellets-visual-classes-rgb.csv\", index_col=\"image\").dropna()\n", "df[\"yellowing index\"] = df[\"yellowing index\"].astype(int)\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "121bde11", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "fig, axes = plt.subplots(figsize=(11, 11), nrows=2, ncols=2)\n", "\n", "axes = axes.ravel()\n", "\n", "df[\"erosion\"].value_counts().plot.barh(ax=axes[0], title=\"erosion\")\n", "df[\"color\"].value_counts().plot.barh(ax=axes[1], title=\"color\")\n", "df[\"description\"].value_counts().plot.barh(ax=axes[2], title=\"description\")\n", "df[\"yellowing\"].value_counts().plot.barh(ax=axes[3], title=\"yellowing\")\n", "\n", "axes[1].yaxis.set_label_position(\"right\")\n", "axes[1].yaxis.tick_right()\n", "\n", "axes[3].yaxis.set_label_position(\"right\")\n", "axes[3].yaxis.tick_right()" ] }, { "cell_type": "markdown", "id": "d00fb82b", "metadata": {}, "source": [ "We will be using only the R, G, B data for now." ] }, { "cell_type": "code", "execution_count": null, "id": "82ef1547", "metadata": {}, "outputs": [], "source": [ "RGB = df[[\"r\", \"g\", \"b\"]]" ] }, { "cell_type": "code", "execution_count": null, "id": "9c3c0700", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import seaborn\n", "\n", "corr = RGB.corr()\n", "\n", "# Generate a mask for the upper triangle\n", "mask = np.zeros_like(corr, dtype=bool)\n", "mask[np.triu_indices_from(mask)] = True\n", "\n", "fig, ax = plt.subplots()\n", "\n", "# Draw the heatmap with the mask and correct aspect ratio\n", "vmax = np.abs(corr.values[~mask]).max()\n", "seaborn.heatmap(\n", " corr,\n", " mask=mask,\n", " cmap=plt.cm.PuOr,\n", " vmin=-vmax,\n", " vmax=vmax,\n", " square=True,\n", " linecolor=\"lightgray\",\n", " linewidths=1,\n", " ax=ax,\n", ")\n", "\n", "for k in range(len(corr)):\n", " ax.text(\n", " k + 0.5,\n", " len(corr) - (k + 0.5),\n", " corr.columns[k],\n", " ha=\"center\",\n", " va=\"center\",\n", " rotation=45,\n", " )\n", " for j in range(k + 1, len(corr)):\n", " s = \"{:.3f}\".format(corr.values[k, j])\n", " ax.text(j + 0.5, len(corr) - (k + 0.5), s, ha=\"center\", va=\"center\")\n", "ax.axis(\"off\")" ] }, { "cell_type": "markdown", "id": "614f713c", "metadata": {}, "source": [ "The first step to most ML techniques is to standardize the data. We do not want high variance data to bias our model." ] }, { "cell_type": "code", "execution_count": null, "id": "9dd44aa7", "metadata": {}, "outputs": [], "source": [ "def z_score(x):\n", " return (x - x.mean()) / x.std()\n", "\n", "\n", "zs = RGB.apply(z_score).T\n", "\n", "zs.std(axis=1) # Should be 1" ] }, { "cell_type": "code", "execution_count": null, "id": "9a5930af", "metadata": {}, "outputs": [], "source": [ "zs.mean(axis=1) # Should be zero" ] }, { "cell_type": "code", "execution_count": null, "id": "82f65e58", "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "\n", "pca = PCA(n_components=None)\n", "pca.fit(zs)" ] }, { "cell_type": "markdown", "id": "41df5061", "metadata": {}, "source": [ "The pca object, or fitted model, was designed before pandas existed and it is based on numpy arrays. 