{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Citizen weather stations quality checks\n", "\n", "Citizen weather stations (CWS) provide an unprecedented potential for high-resolution climate modeling, especially in urban environments where the spatial density of stations can be very high {cite}`muller2015crowdsourcing,meier2017crowdsourcing`. To that end, Meteora features the `NetatmoClient`, which makes it easy to access [public data from Netatmo weather stations](https://weathermap.netatmo.com) (one of the major CWS providers).\n", "\n", "However, since there is no guarantee that the scientific guidelines to properly set up meteorological stations have been met for CWS, performing a quality control (QC) of the collected CWS data is a crucial prerequisite before any analysis can be conducted. To that end, Meteora features the `qc` module, which is based on Napoly et al., (2018) {cite}`napoly2018qc` and implements a set of statistically-based QC methods that control for common errors in CWS observations. Note that this QC methods are designed to detect stations with **suspicious temperature measurements and therefore may not be appropriate to use on other meteorological variables**." ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import contextily as cx\n", "import geopandas as gpd\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "from meteora import clients, qc, settings, utils\n", "\n", "figwidth = plt.rcParams[\"figure.figsize\"][0]\n", "figheight = plt.rcParams[\"figure.figsize\"][1]" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "We will start by using the `NetatmoClient` to get CWS data for a bounding box in the Lavaux region of Switzerland. In order to use the `NetatmoClient`, we need to obtain a client ID and secret (see the detailed steps [in the dedicated section of the Meteora documentation](../supported-providers.html#netatmo))." ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "region = [6.6469, 46.4799, 6.7346, 46.5250]\n", "crs = \"epsg:4326\"\n", "client_id = os.getenv(\"NETATMO_CLIENT_ID\", default=\"\")\n", "client_secret = os.getenv(\"NETATMO_CLIENT_SECRET\", default=\"\")\n", "scale = \"1hour\" # default is \"30min\"\n", "\n", "# fallback to local files (to run notebooks programatically without interactive auth\n", "netatmo_ts_filepath = \"data/lavaux-netatmo-ts-df.csv\"\n", "netatmo_stations_filepath = \"data/lavaux-netatmo-cws.gpkg\"\n", "use_netatmo_files = not client_id or not client_secret" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "Given the client ID and secret, we can pass them to the instantiation of the `NetatmoClient`, which will then open a browser window asking us to authorize our application (that we used to get the client ID and secret) to access the data from the Netatmo weather stations. After accepting, we will be redirected to the page with the list of your accounts applications. At the end of the URL bar there will be an authorization code preceded by `code=`, which we will have to copy and enter into the input window that will appear when we execute the cell below:" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "if not use_netatmo_files:\n", " # query Netatmo data from the API\n", " client = clients.NetatmoClient(region, client_id, client_secret)\n", " stations_gdf = client.stations_gdf\n", "else:\n", " # fallback to a local file when interactive auth is not possible\n", " stations_gdf = gpd.read_file(netatmo_stations_filepath)\n", " if settings.STATIONS_ID_COL in stations_gdf.columns:\n", " stations_gdf = stations_gdf.set_index(settings.STATIONS_ID_COL)\n", " client = None" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "Like in any other Meteora client, we can start by plotting the station locations:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "ax = stations_gdf.plot()\n", "cx.add_basemap(ax=ax, crs=stations_gdf.crs, attribution=\"\")" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "*(C) OpenStreetMap contributors, Tiles style by Humanitarian OpenStreetMap Team hosted by OpenStreetMap France*\n", "\n", "## Mislocated stations\n", "\n", "Before retrieving any time series data, we can filter out some locations just based on their geolocation using the `meteora.qc.get_mislocated_stations_method`, which will return the list of station ids (index keys in `client.stations_gdf`) that are considered mislocated. As noted by Meier et al. (2017) {cite}`meier2017crowdsourcing`, many Netatmo stations show identical geolocation, which is likely because an inadequate set up \"led to automated location assignment based on the IP address of the wireless network\" {cite}`napoly2018qc` (page 4). Let us plot them in a map:" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "mislocated_stations = qc.get_mislocated_stations(stations_gdf[\"geometry\"])\n", "ax = stations_gdf.assign(\n", " label=[\n", " \"mislocated\" if i in mislocated_stations else \"correct\"\n", " for i in stations_gdf.index\n", " ]\n", ").plot(\"label\", legend=True)\n", "cx.add_basemap(ax, crs=stations_gdf.crs, attribution=\"\")" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "*(C) OpenStreetMap contributors, Tiles style by Humanitarian OpenStreetMap Team hosted by OpenStreetMap France*\n", "\n", "As we can see, some of the mislocated stations, but not all of them, are on Lake Geneva. The other stations on the lake are likely mislocated too and can be filtered out using an ad-hoc method.\n", "\n", "Let us now get a time series of data during a heatwave in August 2023, defined as at least 3 consecutive days with an average temperature over 25$^{\\circ}$C ([based on the heat level warning definitions by MeteoSwiss](https://www.meteoswiss.admin.ch/weather/weather-and-climate-from-a-to-z/heat-warnings.html)):" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "variables = [\"temperature\"]\n", "start = \"18-08-2023\"\n", "end = \"25-08-2023\"\n", "\n", "if use_netatmo_files:\n", " # query Netatmo data from the API\n", " long_ts_df = pd.read_csv(\n", " netatmo_ts_filepath,\n", " index_col=[settings.STATIONS_ID_COL, settings.TIME_COL],\n", " parse_dates=True,\n", " )\n", " long_ts_df = long_ts_df[variables]\n", "else:\n", " # fallback to a local file when interactive auth is not possible\n", " long_ts_df = client.get_ts_df(variables, start, end, scale=scale)" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "We can now plot the time series of CWS observations. Since there are 59 CWS in the study area (number of rows in `client.stations_gdf`), we can use the [`seaborn.lineplot`](https://seaborn.pydata.org/generated/seaborn.lineplot.html) function to plot the mean and 95% confidence intervals at each temporal cross-section:" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "ax = sns.lineplot(long_ts_df.reset_index(), x=\"time\", y=\"temperature\")\n", "ax.tick_params(axis=\"x\", labelrotation=45)" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "The `get_ts_df` method of all Meteora clients returns a [long-form time series data frames](data-structures.ipynb#long-data-frames), multi-indexed by stations (first-level) and time (second-level) and with variables as columns. However, the methods of the `qc` module expect a [wide-form time series data frame](data-structures.ipynb#wide-data-frames) instead, where each cell represents the temperature measurement at a given time (index) and station (columns). See the [\"Data structures\" section of the Meteora documentation](data-structures.ipynb) for more details.\n", "\n", "We can use the `meteora.utils.long_to_wide` function to transform a time series data frame from its long form to its wide form. Let us do that and start by filtering out the mislocated stations:" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "ts_df = utils.long_to_wide(long_ts_df)\n", "# discard mislocated stations\n", "ts_df = ts_df.drop(columns=mislocated_stations, errors=\"ignore\")" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "Note that we are using the `errors=\"ignore\"` keyword argument when discarding stations because some stations in `client.stations_gdf` may not have data for the requested period and thus may not appear in the time series data frames. The default `errors=\"raise\"` would instead raise a `KeyError`.\n", "\n", "## Unreliable stations\n", "\n", "After discarding the mislocated stations, we can continue with a very simple QC to filter out stations with not enough valid observations during the study period. This is done by the `meteora.qc.get_unreliable_stations` function, which besides the (wide) time series data frame, accepts an optional `unreliable_threshold` keyword argument (otherwise the default set in `meteora.settings.UNRELIABLE_THRESHOLD` is used). Let us thus discard stations that have more than 20% of missing observations for the study period:" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "unreliable_threshold = 0.2\n", "unreliable_stations = qc.get_unreliable_stations(\n", " ts_df, unreliable_threshold=unreliable_threshold\n", ")\n", "unreliable_stations" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "At any QC step we can use the `meteora.qc.comparison_lineplot` function to compare the time series of the discarded and kept stations. In this case, since there are only 2 unreliable stations, we will use the `individual_discard_lines=True` keyword argument (default `False`) to plot each unreliable station's line individually rather than plotting the mean plus confidence interval, which would not let us appreciate the periods with non-valid data:" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "qc.comparison_lineplot(\n", " ts_df,\n", " unreliable_stations,\n", " label_discarded=\"unreliable\",\n", " label_kept=\"reliable\",\n", " individual_discard_lines=True,\n", ")" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "Filtering out unreliable stations should be the first step of the time series-based QC, since otherwise QC functions may return misleading results:" ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "reliable_ts_df = ts_df.drop(columns=unreliable_stations)" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "## Elevation adjustment (optional)\n", "\n", "Due to the [atmospheric lapse rate](https://en.wikipedia.org/wiki/Lapse_rate), stations at lower elevations are usually expected to be consistently warmer than its counterparts at higher elevations (except under, e.g., the occurrence of [inversion](https://en.wikipedia.org/wiki/Inversion_(meteorology))). Therefore, it is possible to adjust for the elevation the time series data frame - otherwise we may find, e.g., that a given station is consistently warmer because of an incorrect set up when in reality the difference is due to the elevation. This can be done using the `meteora.qc.elevation_adjustment` function, which besides the (wide) time series data frame, accepts an optional `atmospheric_lapse_rate` keyword argument (otherwise the default set in `meteora.settings.ATMOSPHERIC_LAPSE_RATE` is used):" ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "atmospheric_lapse_rate = 0.0065 # in C / m\n", "adj_ts_df = qc.elevation_adjustment(\n", " reliable_ts_df,\n", " stations_gdf[\"altitude\"],\n", " atmospheric_lapse_rate=atmospheric_lapse_rate,\n", ")" ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "Let us now plot the difference between the average elevation-adjusted and unadjusted temperature at each station:" ] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [ "ax = stations_gdf.assign(diff=adj_ts_df.mean() - reliable_ts_df.mean()).plot(\n", " \"diff\",\n", " legend=True,\n", " legend_kwds={\n", " \"label\": \"$\\overline{T_{adj}} - \\overline{T} \\quad [\\circ C]$\",\n", " \"shrink\": 0.6,\n", " },\n", ")\n", "cx.add_basemap(ax, crs=stations_gdf.crs, attribution=\"\")\n", "stations_gdf[\"altitude\"].describe()[[\"mean\", \"min\", \"max\"]]" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "Stations at higher elevations are added about 1.5 $^{\\circ}$C whereas the ones at lower elevations are subtracted less than that (because of the smaller elevation difference between the mean and the minimum than between the maximum and the mean).\n", "\n", "## Oulier stations\n", "\n", "The `meteora.qc.get_outlier_stations` allows detecting when the measurements of a given station show suspicious deviations from a normal distribution. The function allow three optional keyword arguments, namely `low_alpha` and `high_alpha`, which are the respective values for the lower and upper tail that lead to a rejection of the null hypothesis (and that default to `meteora.settings.OUTLIER_LOW_ALPHA` and `meteora.settings.OUTLIER_LOW_ALPHA` respectively) as well as `station_outlier_threshold`, which is the maximum proportion of outlier measurements after which the respective station is flagged as faulty (defaults to `meteora.settings.station_outlier_threshold`).\n", "As noted by Meier et al., (2017) {cite}`meier2017crowdsourcing`, one of the main issues with Netatmo data are stations located in non-shaded areas, resulting in radiative errors, which is why the upper tail threshold is by default less permisive than its lower tail counterpart.\n", "\n", "We can detect outlier stations separately for the initial data frame with (i) all (but mislocated) stations, (ii) reliable stations and (ii) reliable stations with elevation adjustment:" ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "low_alpha = 0.01\n", "high_alpha = 0.95\n", "station_outlier_threshold = 0.2\n", "\n", "for _ts_df, label in zip(\n", " [ts_df, reliable_ts_df, adj_ts_df], [\"raw\", \"reliable\", \"elev. adjusted\"]\n", "):\n", " outlier_stations = qc.get_outlier_stations(\n", " _ts_df,\n", " low_alpha=low_alpha,\n", " high_alpha=high_alpha,\n", " station_outlier_threshold=station_outlier_threshold,\n", " )\n", " print(f\"n. outlier stations from {label}: {len(outlier_stations)}\")\n", "outlier_stations = qc.get_outlier_stations(reliable_ts_df)" ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "As we can see, initially 3 stations would be considered outliers, but after we exclude the unreliable stations, the number drops to 1 (with or without elevation adjustment). This highlights the importance of excluding unreliable stations first, otherwise we may wrongfully assume that the unreliable stations are systematically warmer or colder whereas in fact they likely do not have enough measurements to determine that.\n", "\n", "For the reminder of the notebook, let us consider the outlier stations from the unadjusted reliable stations. We can again use the `comparison_lineplot` method:" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "qc.comparison_lineplot(\n", " reliable_ts_df,\n", " outlier_stations,\n", " label_discarded=\"outlier\",\n", " label_kept=\"non-outlier\",\n", ")" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "Stations labeled as outliers clearly display an amplified temperature variation, especially for the maximum temperatures - in line with the radiative errors suggested by Meier et al. (2017) {cite}`meier2017crowdsourcing`.\n", "\n", "## Indoor stations\n", "\n", "Netatmo stations may also be set indoors, in which case their diurnal cycles are assumed to be less correlated to the one from outdoor stations {cite}`napoly2018qc`. To that end, we can use the `meteora.qc.get_indoor_stations` function, which accepts an optional keyword argument `station_indoor_corr_threshold`, so that stations showing Pearson correlation (with the overall station median distribution) lower than the provided value are flagged as indoors (which defaults to `meteora.settings.STATION_INDOOR_CORR_THRESHOLD`):" ] }, { "cell_type": "code", "execution_count": null, "id": "31", "metadata": {}, "outputs": [], "source": [ "station_indoor_corr_threshold = 0.9\n", "\n", "indoor_stations = qc.get_indoor_stations(\n", " reliable_ts_df, station_indoor_corr_threshold=station_indoor_corr_threshold\n", ")\n", "indoor_stations" ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "In this example, 7 stations are considered to be indoors. We can check whether any indoor station is also considered an outlier:" ] }, { "cell_type": "code", "execution_count": null, "id": "33", "metadata": {}, "outputs": [], "source": [ "set(indoor_stations).intersection(set(outlier_stations))" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "which is not the case. We can also plot the time series of temperature measurements for indoor and outdoor stations separately:" ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [], "source": [ "qc.comparison_lineplot(\n", " reliable_ts_df,\n", " indoor_stations,\n", " label_discarded=\"indoor\",\n", " label_kept=\"outdoor\",\n", ")" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "To conclude the QC, we can combine all the QC filters and plot the kept and discarded stations:" ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [], "source": [ "discarded_stations = list(set(outlier_stations) | set(indoor_stations))\n", "qc.comparison_lineplot(\n", " reliable_ts_df,\n", " list(set(outlier_stations) | set(indoor_stations)),\n", ")" ] }, { "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "## Comparison with official data\n", "\n", "Finally, we can compare the time series of temperature measurements of the CWS data (before and after QC) with its counterpart from official stations. To that end, we will get the data from [Agrometeo](https://www.agrometeo.ch) and [MeteoSwiss](https://www.meteoswiss.admin.ch):" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [], "source": [ "# official\n", "# need to specify the crs for agrometeo, otherwise it will be epsg:21781\n", "agrometeo_client = clients.AgrometeoClient(region, crs=crs)\n", "meteoswiss_client = clients.MeteoSwissClient(region)\n", "\n", "ax = pd.concat(\n", " [\n", " stations_gdf.assign(source=label).to_crs(crs)\n", " for label, stations_gdf in {\n", " \"Netatmo\": stations_gdf,\n", " \"Agrometeo\": agrometeo_client.stations_gdf,\n", " \"MeteoSwiss\": meteoswiss_client.stations_gdf,\n", " }.items()\n", " ]\n", ").plot(\"source\", legend=True)\n", "cx.add_basemap(ax, crs=crs, attribution=\"\")" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "There are only a few official stations in the study area compared with the high spatial density of Netatmo CWS. Let us now get the time series of temperature data for the study period:" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "# specify hourly `scale` for Agrometeo, otherwise is 10-minutes.\n", "agrometeo_ts_df = agrometeo_client.get_ts_df(variables, start, end, scale=\"hour\")\n", "meteoswiss_ts_df = meteoswiss_client.get_ts_df(variables, start, end)" ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "We can now plot the time series separately. To have more flexibility, we will use the long-form data frames (note that we thus need to exlude the mislocated stations again, which were already excluded from the wide-form data frame but not from the long-form one):" ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [], "source": [ "all_discards = list(\n", " set(discarded_stations) | set(unreliable_stations) | set(mislocated_stations)\n", ")\n", "ax = sns.lineplot(\n", " pd.concat(\n", " [\n", " ts_df.assign(label=label)\n", " for ts_df, label in zip(\n", " [\n", " long_ts_df.drop(all_discards, errors=\"ignore\"),\n", " long_ts_df.loc[discarded_stations],\n", " pd.concat([agrometeo_ts_df, meteoswiss_ts_df]),\n", " ],\n", " [\"CWS (QC)\", \"CWS (Discarded)\", \"Official\"],\n", " )\n", " ]\n", " )\n", " .droplevel(\"station_id\")\n", " .reset_index(),\n", " x=\"time\",\n", " y=\"temperature\",\n", " hue=\"label\",\n", ")\n", "ax.tick_params(axis=\"x\", labelrotation=45)" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "We can observe how, unlike the discarded CWS, the CWS that pass QC have a very similar trend as the official stations. Although this is beyond the scope of this notebook, note that even after QC, CWS seem to be slightly warmer than the official stations, but this may be due to elevation differences as well as the urban heat island effect.\n", "\n", "## References\n", "\n", "```{bibliography}\n", ":filter: docname in docnames\n", "```" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }