{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Tutorial 4. Basic Plotting

\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Previous sections have focused on putting various simple types of data together\n", "in notebooks and deployed servers, but most people will want to include plots as\n", "well. In this section, we'll focus on one of the simplest (but still powerful)\n", "ways to get a plot.\n", "\n", "If you have tried to visualize a `pandas.DataFrame` before, then you have likely\n", "encountered the\n", "[Pandas .plot() API](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).\n", "This basic plotting interface uses [Matplotlib](http://matplotlib.org) to render\n", "static PNGs or SVGs in a Jupyter notebook using the`inline` backend (or\n", "interactive figures via `%matplotlib notebook` or `%matplotlib widget`) and for\n", "exporting from Python, with a command that can be as simple as `df.plot()` for a\n", "DataFrame with one or two columns.\n", "\n", "The Pandas .plot() API has emerged as a de-facto standard for high-level\n", "plotting APIs in Python, and is now supported by many different libraries that\n", "use other underlying plotting engines to provide additional power and\n", "flexibility. Thus learning this API allows you to access capabilities provided\n", "by a wide variety of underlying tools, with relatively little additional effort.\n", "The libraries currently supporting this API include:\n", "\n", "- [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)\n", " -- Matplotlib-based API included with Pandas. Static or interactive output in\n", " Jupyter notebooks.\n", "- [xarray](https://xarray.pydata.org/en/stable/plotting.html) --\n", " Matplotlib-based API included with xarray, based on pandas .plot API. Static\n", " or interactive output in Jupyter notebooks.\n", "- [hvPlot](https://hvplot.pyviz.org) -- HoloViews and Bokeh-based interactive\n", " plots for Pandas, GeoPandas, xarray, Dask, Intake, and Streamz data.\n", "- [Pandas Bokeh](https://github.com/PatrikHlobil/Pandas-Bokeh) -- Bokeh-based\n", " interactive plots, for Pandas, GeoPandas, and PySpark data.\n", "- [Cufflinks](https://github.com/santosjorge/cufflinks) -- Plotly-based\n", " interactive plots for Pandas data.\n", "- [Plotly Express](https://plotly.com/python/pandas-backend) --\n", " Plotly-Express-based interactive plots for Pandas data; only partial support\n", " for the .plot API keywords\n", "- [PdVega](https://altair-viz.github.io/pdvega) -- Vega-lite-based, JSON-encoded\n", " interactive plots for Pandas data.\n", "\n", "In this notebook we'll explore what is possible with the default `.plot` API and\n", "demonstrate the additional capabilities of `.hvplot`, using the same dataset. Of\n", "course, this particular dataset is just an example; the same approach can be\n", "used with just about any tabular dataset.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read in the data\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from load_data import *\n", "\n", "df = load_data()\n", "print(df.shape)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lat_min = df.latitude.min()\n", "lat_max = df.latitude.max()\n", "\n", "lon_min = df.longitude.min()\n", "lon_max = df.longitude.max()\n", "\n", "f\"Wave heights from {lat_min, lat_max} latitude to {lon_min, lon_max} longitude\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Pandas `.plot`\n", "\n", "The first thing that we'd like to do with this data is visualize the locations\n", "of every earthquake. So we would like to make a scatter or points plot where\n", "`x='longitude'` and `y='latitude'`.\n", "\n", "If you are familiar with the `pandas.plot` API, you might expect to execute\n", "`df.plot.scatter(x='longitude', y='latitude')`. Feel free to try this out in a\n", "new cell, but it will throw an error:\n", "`AttributeError: 'DataFrame' object has no attribute 'plot'`. In order to make\n", "the data more manageable for now, we'll briefly use just a fraction (1%) of it\n", "and call that `small_df`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "small_df = df.sample(frac=.1)\n", "small_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a smaller dataset with many fewer observations. We can use that to\n", "test out our visualizations before ramping back up to the full dataset.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df.plot.scatter(x=\"longitude\", y=\"latitude\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `.hvplot`\n", "\n", "As you can see above, the Pandas API gives you a usable plot very easily, where\n", "you can start to see the density of waves in the western hemisphere. You can\n", "make a very similar plot with the same arguments using hvplot.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import hvplot.pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df.hvplot.scatter(x=\"longitude\", y=\"latitude\", alpha=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here unlike in the Pandas `.plot()` there is a default hover action on the\n", "datapoints to show the location values, and you can also pan and zoom to focus\n", "on any particular region of the data of interest.\n", "\n", "You might have noticed that many of the dots in the scatter that we've just\n", "created lie on top of one another. This is called\n", "[\"overplotting\"](http://datashader.org/user_guide/1_Plotting_Pitfalls.html#1.-Overplotting)\n", "and can be avoided in a variety of ways, such as by making the dots slightly\n", "transparent, or binning the data. These approaches have the downside of\n", "introducing bias because you need to choose the alpha or the edges of the bins,\n", "and in order to do that, you have to make assumptions about the data. For an\n", "initial exploration of a new dataset, it's much safer if you can just **_see_**\n", "the data, before you impose any assumptions about its form or structure.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Try changing the alpha (try .1) on the plot above to see the effect of this\n", "approach\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try creating a `hexbin` plot.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df.hvplot.hexbin(x=\"longitude\", y=\"latitude\", alpha=0.8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Datashader\n", "\n", "To avoid some of the problems of traditional scatter/point plots we can use\n", "[Datashader](datashader.org), which aggregates data into each pixel without any\n", "arbitrary parameter settings. In `hvplot` we can activate this capability by\n", "setting `datashade=True`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df.hvplot.scatter(\n", " x=\"longitude\", y=\"latitude\", datashade=True, dynspread=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you can see all of the rich detail in this set of thousands of wave\n", "heights. If you have a live Python process running, you can zoom in and see\n", "additional detail at each zoom level, without tuning any parameters or making\n", "any assumptions about the form or structure of the data. We'll come back to\n", "Datashader later, but for now the important thing to know about it is that it\n", "lets us work with arbitrarily large datasets in a web browser conveniently.\n", "\n", "Note that the `.hvplot()` API works here because unlike the other `.plot`\n", "libraries, `hvplot` doesn't just target Pandas objects. Instead hvplot can be\n", "used with:\n", "\n", "- Pandas : DataFrame, Series (columnar/tabular data)\n", "- xarray : Dataset, DataArray (labelled multidimensional arrays)\n", "- Dask : DataFrame, Series (distributed/out of core arrays and columnar data)\n", "- Streamz : DataFrame(s), Series(s) (streaming columnar data)\n", "- Intake : DataSource (data catalogues)\n", "- GeoPandas : GeoDataFrame (geometry data)\n", "- NetworkX : Graph (network graphs)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Select a subset of the data, e.g. only magitudes >5 and plot them with a\n", "different colormap (valid `cmap` values include 'viridis', 'Reds' and 'magma'):\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Solution
\n", "\n", "```python\n", "df[df.mag>5].hvplot.scatter(x='longitude', y='latitude', datashade=True, cmap='Reds')\n", "```\n", "\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Note on `points`\n", "\n", "As a final note, we should really use `hvplot.points` instead of\n", "`hvplot.scatter` in this instance. The former does not exist in the standard\n", "pandas `.plot` API which is why we have used `hvplot.scatter` up until now.\n", "\n", "The reason scatter is inappropriate is that it implies that the y-axis\n", "(latitude) is a _dependent variable_ with respect to the x-axis (latitude). In\n", "reality, this is not the case, as waves can happen anywhere on the Earth's\n", "two-dimensional surface. For this reason, it is best to use `hvplot.points` for\n", "wave locations, as will be explained further in the next notebook.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df.hvplot.points(x=\"longitude\", y=\"latitude\", datashade=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistical Plots\n", "\n", "Let's dive into some of the other capabilities of `.plot()` and `.hvplot()`,\n", "starting with the frequency of different wind gusts.\n", "\n", "As a first pass, we'll use a histogram first with `plot.hist` on the small data,\n", "then with `.hvplot.hist` on the full dataset.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df.plot.hist(y=\"gst\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly we can create a histogram of the whole dataset using hvplot:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.hvplot.hist(y=\"gst\", bin_range=(0, 10), bins=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Create a kernel density estimate (kde) plot of wind gust speed using the smaller\n", "`small_df`:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categorical variables\n", "\n", "Next we'll categorize the waves based on wind gusts. You can read about all the\n", "different variables available in this dataset\n", "[here](https://coastwatch.pfeg.noaa.gov/erddap/info/cwwcNDBCMet/index.html).\n", "\n", "First we'll use `pd.cut` to split the small_dataset into gust categories.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gust_bins = [-np.inf, 4, 8, np.inf]\n", "gust_names = [\"Tame\", \"Gusty\", \"Blown Away\"]\n", "gust_class_column = pd.cut(small_df[\"gst\"], gust_bins, labels=gust_names)\n", "\n", "small_df.insert(1, \"gust_class\", gust_class_column)\n", "# handy: in case we need to modify the \"gust_class\", we can drop it from our data frame in order to re-add it\n", "# small_df = small_df.drop(['gust_class'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use this new categorical variable to group our data. First we will\n", "overlay all our groups on the same plot using the `by` option:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df.hvplot.hist(y=\"wvht\", by=\"gust_class\", alpha=0.6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** Click on the legend to turn off certain categories and see what is\n", "behind them.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Add `subplots=True` and `width=300` to see the different classes side-by-side.\n", "The y-axis will be linked, so try zooming.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping\n", "\n", "To use a widget to toggle between classes, use the `groupby` option, here in a\n", "bivariate plot:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "small_df.hvplot.bivariate(x=\"wvht\", y=\"gst\", groupby=\"gust_class\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring further\n", "\n", "These visualizations just touch the surface of what is available from hvPlot. To\n", "see many more examples, study the [hvPlot website](https://hvplot.pyviz.org).\n", "The following section will focus on how to put these plots together once you\n", "have them, linking them to understand and show their relationships.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:root] *", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 4 }