{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Citizen weather stations quality checks\n",
    "\n",
    "Citizen weather stations (CWS) provide an unprecedented potential for high-resolution climate modeling, especially in urban environments where the spatial density of stations can be very high {cite}`muller2015crowdsourcing,meier2017crowdsourcing`. To that end, Meteora features the `NetatmoClient`, which makes it easy to access [public data from Netatmo weather stations](https://weathermap.netatmo.com) (one of the major CWS providers).\n",
    "\n",
    "Like other low-cost sensors, Netatmo stations can be subject to [radiative errors](radiation-bias-correction.ipynb) {cite}`meier2017crowdsourcing,ahmed2025comparison`. Beyond such measurement biases, CWS are also prone to station-placement issues: many Netatmo stations report an identical geolocation - likely because an inadequate set up led to an automated location assignment based on the IP address of the wireless network {cite}`meier2017crowdsourcing` - while others are inadvertently set up indoors {cite}`napoly2018qc`.\n",
    "\n",
    "Because of these issues, and since there is no guarantee that the scientific guidelines to properly set up meteorological stations have been met for CWS, performing a quality control (QC) of the collected CWS data is a crucial prerequisite before any analysis can be conducted. To that end, Meteora features the `qc` module, which is based on Napoly et al., (2018) {cite}`napoly2018qc` and implements a set of statistically-based QC methods that control for common errors in CWS observations. Note that these QC methods are designed to detect stations with **suspicious temperature measurements and therefore may not be appropriate to use on other meteorological variables**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import contextily as cx\n",
    "import geopandas as gpd\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "\n",
    "from meteora import clients, qc, settings, utils\n",
    "\n",
    "figwidth = plt.rcParams[\"figure.figsize\"][0]\n",
    "figheight = plt.rcParams[\"figure.figsize\"][1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "We will start by using the `NetatmoClient` to get CWS data for a bounding box in the Lavaux region of Switzerland. In order to use the `NetatmoClient`, we need to obtain a client ID and secret (see the detailed steps [in the dedicated section of the Meteora documentation](../supported-providers.html#netatmo))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "region = [6.6469, 46.4799, 6.7346, 46.5250]\n",
    "crs = \"epsg:4326\"\n",
    "client_id = os.getenv(\"NETATMO_CLIENT_ID\", default=\"\")\n",
    "client_secret = os.getenv(\"NETATMO_CLIENT_SECRET\", default=\"\")\n",
    "scale = \"1hour\"  # default is \"30min\"\n",
    "\n",
    "# fallback to local files (to run notebooks programatically without interactive auth\n",
    "netatmo_ts_filepath = \"data/lavaux-netatmo-ts-df.csv\"\n",
    "netatmo_stations_filepath = \"data/lavaux-netatmo-cws.gpkg\"\n",
    "use_netatmo_files = not client_id or not client_secret"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "Given the client ID and secret, we can pass them to the instantiation of the `NetatmoClient`, which will then open a browser window asking us to authorize our application (that we used to get the client ID and secret) to access the data from the Netatmo weather stations. After accepting, we will be redirected to the page with the list of your accounts applications. At the end of the URL bar there will be an authorization code preceded by `code=`, which we will have to copy and enter into the input window that will appear when we execute the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "if not use_netatmo_files:\n",
    "    # query Netatmo data from the API\n",
    "    client = clients.NetatmoClient(region, client_id, client_secret)\n",
    "    stations_gdf = client.stations_gdf\n",
    "else:\n",
    "    # fallback to a local file when interactive auth is not possible\n",
    "    stations_gdf = gpd.read_file(netatmo_stations_filepath)\n",
    "    if settings.STATIONS_ID_COL in stations_gdf.columns:\n",
    "        stations_gdf = stations_gdf.set_index(settings.STATIONS_ID_COL)\n",
    "    client = None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "Like in any other Meteora client, we can start by plotting the station locations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = stations_gdf.plot()\n",
    "cx.add_basemap(ax=ax, crs=stations_gdf.crs, attribution=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "*(C) OpenStreetMap contributors, Tiles style by Humanitarian OpenStreetMap Team hosted by OpenStreetMap France*\n",
    "\n",
    "Let us now get a time series of data during a heatwave in August 2023, defined as at least 3 consecutive days with an average temperature over 25$^{\\circ}$C ([based on the heat level warning definitions by MeteoSwiss](https://www.meteoswiss.admin.ch/weather/weather-and-climate-from-a-to-z/heat-warnings.html)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "variables = [\"temperature\"]\n",
    "start = \"18-08-2023\"\n",
    "end = \"25-08-2023\"\n",
    "\n",
    "if use_netatmo_files:\n",
    "    # fallback to a local file when interactive auth is not possible\n",
    "    long_ts_df = pd.read_csv(\n",
    "        netatmo_ts_filepath,\n",
    "        index_col=[settings.STATIONS_ID_COL, settings.TIME_COL],\n",
    "        parse_dates=True,\n",
    "    )\n",
    "    long_ts_df = long_ts_df[variables]\n",
    "else:\n",
    "    # query Netatmo data from the API\n",
    "    long_ts_df = client.get_ts_df(variables, start, end, scale=scale)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "The `get_ts_df` method of all Meteora clients returns a [long-form time series data frame](data-structures.ipynb#long-data-frames), multi-indexed by stations (first level) and time (second level) and with variables as columns. However, the `qc` module expects a [wide-form time series data frame](data-structures.ipynb#wide-data-frames) instead, where each cell represents the temperature measurement at a given time (index) and station (columns). See the [\"Data structures\" section of the Meteora documentation](data-structures.ipynb) for more details. We can use the `meteora.utils.long_to_wide` function to convert it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "ts_df = utils.long_to_wide(long_ts_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "Since there are several dozen CWS in the study area, we can use the [`seaborn.lineplot`](https://seaborn.pydata.org/generated/seaborn.lineplot.html) function to plot the mean and 95% confidence interval at each temporal cross-section:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.lineplot(long_ts_df.reset_index(), x=\"time\", y=\"temperature\")\n",
    "ax.tick_params(axis=\"x\", labelrotation=45)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "## Quality control pipeline\n",
    "\n",
    "Rather than applying each quality check by hand, Meteora bundles them into `meteora.qc.QCPipeline`, which in line with the approach of [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html#pipeline), chains an ordered sequence of QC *steps* and applies them to a (wide) time series data frame.\n",
    "\n",
    "Based on the literature `meier2017crowdsourcing,napoly2018qc` will:\n",
    "\n",
    "1. flag mislocated stations, likely due to an incorrect set up that led to automatic location assignment based on the IP address of the wireless network;\n",
    "2. flag unreliable stations, with a high proportion (customizable) of non-valid measurements;\n",
    "3. adjust temperature measurements based on station elevation to account for the [atmospheric lapse rate](https://en.wikipedia.org/wiki/Lapse_rate);\n",
    "4. flag outlier stations, which show a high proportion of measurements that largely deviate from a normal distribution (especially in the upper tail due to radiative errors);\n",
    "5. flag indoor stations, whose measurements show low correlation with the spatial median.\n",
    "\n",
    "See the section \"The individual QC steps\" for more details on each QC step. Let us now apply this to our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "elevation_col = \"altitude\"  # required to enable elevation adjustment\n",
    "pipeline = qc.QCPipeline(\n",
    "    stations=stations_gdf,\n",
    "    elevation=elevation_col,\n",
    ")\n",
    "qc_ts_df = pipeline.apply(ts_df)\n",
    "print(f\"{ts_df.shape[1]} stations before QC, {qc_ts_df.shape[1]} after\")\n",
    "pipeline.discarded_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "Calling `apply` returns the quality-controlled data frame (with the flagged stations dropped) and records the stations discarded by each step in the `discarded_` attribute (keyed by the step/function name).\n",
    "\n",
    "## The individual QC steps\n",
    "\n",
    "The pipeline above ran the default sequence of steps - `meteora.settings.DEFAULT_QC_STEPS` - which we now go through one by one. For each step we explain what it does, which parameter (and `meteora.settings` default) controls it, and use the stations it flagged - readily available in `pipeline.discarded_` - to visualize its effect. Note that each built-in step is also exposed as a standalone function in the `qc` module, should one prefer to run it on its own.\n",
    "\n",
    "Several of the steps below are illustrated by comparing the time series of the discarded and kept stations. Let us first define a small `comparison_lineplot` helper:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "def comparison_lineplot(\n",
    "    ts_df,\n",
    "    discard_stations,\n",
    "    *,\n",
    "    label_discarded=\"discarded\",\n",
    "    label_kept=\"kept\",\n",
    "    individual_discard_lines=False,\n",
    "    ax=None,\n",
    "):\n",
    "    \"\"\"Plot the time series of the discarded and kept stations separately.\"\"\"\n",
    "    data = pd.concat(\n",
    "        [\n",
    "            _ts_df.reset_index()\n",
    "            .melt(value_name=\"temperature\", id_vars=\"time\")\n",
    "            .assign(label=label)\n",
    "            for _ts_df, label in zip(\n",
    "                [\n",
    "                    ts_df[discard_stations],\n",
    "                    ts_df.drop(columns=discard_stations, errors=\"ignore\"),\n",
    "                ],\n",
    "                [label_discarded, label_kept],\n",
    "            )\n",
    "        ],\n",
    "        ignore_index=True,\n",
    "    )\n",
    "    if ax is None:\n",
    "        _, ax = plt.subplots()\n",
    "    if individual_discard_lines:\n",
    "        # avoid warnings about matplotlib converters\n",
    "        pd.plotting.register_matplotlib_converters()\n",
    "        sns.lineplot(\n",
    "            data[data[\"label\"] == label_kept],\n",
    "            x=\"time\",\n",
    "            y=\"temperature\",\n",
    "            hue=\"label\",\n",
    "            ax=ax,\n",
    "        )\n",
    "        data[data[\"label\"] == label_discarded].set_index(\"time\").rename(\n",
    "            columns={\"temperature\": label_discarded}\n",
    "        ).plot(ax=ax)\n",
    "    else:\n",
    "        sns.lineplot(data, x=\"time\", y=\"temperature\", hue=\"label\", ax=ax)\n",
    "        ax.tick_params(axis=\"x\", labelrotation=45)\n",
    "    return ax"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "### Mislocated stations\n",
    "\n",
    "The first step, `\"flag_mislocated\"`, filters stations purely from their geolocation, flagging those that share an identical location. As noted by Meier et al. (2017) {cite}`meier2017crowdsourcing`, many Netatmo stations show identical geolocation, which is likely because an inadequate set up \"led to automated location assignment based on the IP address of the wireless network\" {cite}`napoly2018qc` (page 4). This step takes no parameters (it is purely geometric) and is skipped when no `stations` are provided. Let us plot the flagged stations on a map:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {},
   "outputs": [],
   "source": [
    "mislocated_stations = pipeline.discarded_[\"flag_mislocated\"]\n",
    "ax = stations_gdf.assign(\n",
    "    label=[\n",
    "        \"mislocated\" if i in mislocated_stations else \"correct\"\n",
    "        for i in stations_gdf.index\n",
    "    ]\n",
    ").plot(\"label\", legend=True)\n",
    "cx.add_basemap(ax, crs=stations_gdf.crs, attribution=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20",
   "metadata": {},
   "source": [
    "*(C) OpenStreetMap contributors, Tiles style by Humanitarian OpenStreetMap Team hosted by OpenStreetMap France*\n",
    "\n",
    "As we can see, some of the mislocated stations, but not all of them, are on Lake Geneva. The other stations on the lake are likely mislocated too and can be filtered out using an ad-hoc method.\n",
    "\n",
    "### Unreliable stations\n",
    "\n",
    "The `\"flag_unreliable\"` step filters out stations with too few valid observations during the study period. The maximum proportion of missing observations is set by the `unreliable_threshold` keyword argument (defaulting to `meteora.settings.UNRELIABLE_THRESHOLD`). Since there are only a few such stations, we use `individual_discard_lines=True` so that each is drawn as its own line (rather than a mean plus confidence interval), which lets us appreciate the periods with non-valid data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "comparison_lineplot(\n",
    "    ts_df,\n",
    "    pipeline.discarded_[\"flag_unreliable\"],\n",
    "    label_discarded=\"unreliable\",\n",
    "    label_kept=\"reliable\",\n",
    "    individual_discard_lines=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22",
   "metadata": {},
   "source": [
    "### Elevation adjustment\n",
    "\n",
    "Due to the [atmospheric lapse rate](https://en.wikipedia.org/wiki/Lapse_rate), stations at lower elevations are usually expected to be consistently warmer than their counterparts at higher elevations (except under, e.g., the occurrence of [inversion](https://en.wikipedia.org/wiki/Inversion_(meteorology))). The `\"adjust_elevation\"` step therefore *adjusts* the time series rather than discarding stations - otherwise we may find, e.g., that a given station is consistently warmer because of an incorrect set up when in reality the difference is due to the elevation. The adjustment uses the per-station elevation passed to the step (`station_elevation_ser`) and the `atmospheric_lapse_rate` keyword argument (defaulting to `meteora.settings.ATMOSPHERIC_LAPSE_RATE`). Being a transform, this step discards no stations (its entry in `discarded_` is empty); to illustrate its effect we can apply `meteora.qc.adjust_elevation` directly and map the difference between the average adjusted and unadjusted temperature at each station:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23",
   "metadata": {},
   "outputs": [],
   "source": [
    "adj_ts_df, _ = qc.adjust_elevation(\n",
    "    ts_df,\n",
    "    station_elevation_ser=stations_gdf[\"altitude\"],\n",
    ")\n",
    "ax = stations_gdf.assign(diff=adj_ts_df.mean() - ts_df.mean()).plot(\n",
    "    \"diff\",\n",
    "    legend=True,\n",
    "    legend_kwds={\n",
    "        \"label\": \"$\\\\overline{T_{adj}} - \\\\overline{T} \\\\quad [\\\\circ C]$\",\n",
    "        \"shrink\": 0.6,\n",
    "    },\n",
    ")\n",
    "cx.add_basemap(ax, crs=stations_gdf.crs, attribution=\"\")\n",
    "stations_gdf[\"altitude\"].describe()[[\"mean\", \"min\", \"max\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24",
   "metadata": {},
   "source": [
    "*(C) OpenStreetMap contributors, Tiles style by Humanitarian OpenStreetMap Team hosted by OpenStreetMap France*\n",
    "\n",
    "Stations at higher elevations are added about 1.5 $^{\\circ}$C whereas the ones at lower elevations are subtracted less than that (because of the smaller elevation difference between the mean and the minimum than between the maximum and the mean).\n",
    "\n",
    "### Outlier stations\n",
    "\n",
    "The `\"flag_outliers\"` step detects stations whose measurements show suspicious deviations from a normal distribution (based on a modified z-score using a robust Qn variance estimator). It accepts the `low_alpha` and `high_alpha` keyword arguments - the lower- and upper-tail significance levels that lead to flagging a measurement as an outlier (defaulting to `meteora.settings.OUTLIER_LOW_ALPHA` and `meteora.settings.OUTLIER_HIGH_ALPHA`) - and `station_outlier_threshold`, the maximum proportion of outlier measurements after which the station is flagged (defaulting to `meteora.settings.STATION_OUTLIER_THRESHOLD`). As noted by Meier et al. (2017) {cite}`meier2017crowdsourcing`, one of the main issues with Netatmo data are stations located in non-shaded areas, resulting in radiative errors, which is why the upper-tail threshold is by default less permissive than its lower-tail counterpart:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25",
   "metadata": {},
   "outputs": [],
   "source": [
    "comparison_lineplot(\n",
    "    ts_df,\n",
    "    pipeline.discarded_[\"flag_outliers\"],\n",
    "    label_discarded=\"outlier\",\n",
    "    label_kept=\"non-outlier\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "Stations labeled as outliers clearly display an amplified temperature variation, especially for the maximum temperatures - in line with the radiative errors suggested by Meier et al. (2017) {cite}`meier2017crowdsourcing`.\n",
    "\n",
    "### Indoor stations\n",
    "\n",
    "Netatmo stations may also be set indoors, in which case their diurnal cycles are assumed to be less correlated with the one from outdoor stations {cite}`napoly2018qc`. The `\"flag_indoor\"` step flags stations whose Pearson correlation with the overall station median is below the `station_indoor_corr_threshold` keyword argument (defaulting to `meteora.settings.STATION_INDOOR_CORR_THRESHOLD`). We can plot the time series of temperature measurements for indoor and outdoor stations separately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27",
   "metadata": {},
   "outputs": [],
   "source": [
    "comparison_lineplot(\n",
    "    ts_df,\n",
    "    pipeline.discarded_[\"flag_indoor\"],\n",
    "    label_discarded=\"indoor\",\n",
    "    label_kept=\"outdoor\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "## Customizing the pipeline\n",
    "\n",
    "The default pipeline can be customized in several ways: running it without the station metadata (skipping the steps that need it), tuning each step's parameters through `step_kwargs`, inserting *transforming* steps that modify the measurements instead of discarding stations (e.g. `mask_outliers`), adding built-in steps that are not part of the default sequence (e.g. the spatial buddy check), and even plugging in our own callables. The following sections cover each of these in turn. In every case a step is given either as the name of a built-in `meteora.qc` function or as any callable matching the signature `step(ts_df, **kwargs) -> (ts_df, discarded_station_ids)`.\n",
    "\n",
    "### Running without station metadata\n",
    "\n",
    "The pipeline does not require the station locations or elevation: when they are not provided, the steps that need them are simply skipped. Running the default pipeline with no arguments therefore omits the mislocated-station flagging and the elevation adjustment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = qc.QCPipeline()\n",
    "qc_ts_df = pipeline.apply(ts_df)\n",
    "print(f\"{ts_df.shape[1]} stations before QC, {qc_ts_df.shape[1]} after\")\n",
    "pipeline.discarded_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "### Tuning step parameters\n",
    "\n",
    "Each step's parameters are supplied through the `step_kwargs` list, one dict per step and positionally matched to `steps`. Any step whose entry is omitted - by not passing `step_kwargs` at all - or left empty (`{}`) falls back to the parameter defaults defined in `meteora.settings`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31",
   "metadata": {},
   "outputs": [],
   "source": [
    "unreliable_threshold = 0.2\n",
    "atmospheric_lapse_rate = 0.0065  # in C / m\n",
    "low_alpha = 0.01\n",
    "high_alpha = 0.95\n",
    "station_outlier_threshold = 0.2\n",
    "station_indoor_corr_threshold = 0.9\n",
    "\n",
    "# each step receives its own keyword-argument dict, positionally matched to `steps`;\n",
    "# the per-station elevation series is passed to the `adjust_elevation` step\n",
    "default_step_kwargs = [\n",
    "    {},  # flag_mislocated (purely geometric)\n",
    "    {\"unreliable_threshold\": unreliable_threshold},\n",
    "    {\n",
    "        \"atmospheric_lapse_rate\": atmospheric_lapse_rate,\n",
    "    },\n",
    "    {\n",
    "        \"low_alpha\": low_alpha,\n",
    "        \"high_alpha\": high_alpha,\n",
    "        \"station_outlier_threshold\": station_outlier_threshold,\n",
    "    },\n",
    "    {\"station_indoor_corr_threshold\": station_indoor_corr_threshold},\n",
    "]\n",
    "pipeline = qc.QCPipeline(\n",
    "    stations=stations_gdf,\n",
    "    elevation=\"altitude\",\n",
    "    steps=settings.DEFAULT_QC_STEPS,\n",
    "    step_kwargs=default_step_kwargs,\n",
    ")\n",
    "_ = pipeline.apply(ts_df)\n",
    "pipeline.discarded_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32",
   "metadata": {},
   "source": [
    "### Flagging vs. transforming steps\n",
    "\n",
    "There are two main types of QC steps depending on how `apply` handles their output:\n",
    "\n",
    "- **Flagging** steps *discard whole stations*: they return the list of suspicious station ids - recorded in `discarded_` - and leave the measurements untouched. All the `flag_*` steps (`flag_mislocated`, `flag_unreliable`, `flag_outliers`, `flag_indoor`, `flag_buddies`) are of this kind.\n",
    "- **Transforming** steps instead *modify the measurements* and discard no station (their `discarded_` entry is always empty). `adjust_elevation` is one: it shifts each station's temperature by the lapse-rate correction rather than flagging anyone - which is why, although it belongs to the default sequence, it never contributes any discarded station.\n",
    "\n",
    "A second transforming step, deliberately left out of `meteora.settings.DEFAULT_QC_STEPS`, is `meteora.qc.mask_outliers`. It reuses the exact same robust (median/Qn) z-score as `flag_outliers`, but instead of discarding entire stations it replaces the individual outlier *measurements* with `NaN` - closer to the m2 level of CrowdQC+ {cite}`fenner2021crowdqc`, which flags values rather than whole stations. Being a transform, it returns the modified data frame and an empty discard list, and can be applied on its own or inserted anywhere in a pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33",
   "metadata": {},
   "outputs": [],
   "source": [
    "masked_ts_df, discarded = qc.mask_outliers(ts_df)\n",
    "n_masked = int(masked_ts_df.isna().sum().sum() - ts_df.isna().sum().sum())\n",
    "print(f\"{n_masked} measurements masked as outliers; stations discarded: {discarded}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34",
   "metadata": {},
   "source": [
    "### Spatial buddy check\n",
    "\n",
    "The `flag_outlier` step compares each station against the *whole* network. We can complement it with a *spatial* buddy check, which instead compares each station only against its neighbours (its \"buddies\") within a given radius {cite}`baaserud2020titan,fenner2021crowdqc`. In meteora, this is implemented as the `flag_buddies` step, which is deliberately *not* part of `meteora.settings.DEFAULT_QC_STEPS` but is easy to add through the `steps` argument. Its neighbourhood is set by `buddy_radius` (in meters, defaulting to `meteora.settings.BUDDY_RADIUS`) and `buddy_min_n` (the minimum number of buddies required to evaluate a station, defaulting to `meteora.settings.BUDDY_MIN_N`); it reuses the outlier statistics (`low_alpha`, `high_alpha`, `station_outlier_threshold`, defaulting to the `meteora.settings.BUDDY_*` values) in its own `step_kwargs` dict. Let us add it to the default sequence and compare:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35",
   "metadata": {},
   "outputs": [],
   "source": [
    "buddy_radius = 2000  # in meters\n",
    "pipeline_buddy = qc.QCPipeline(\n",
    "    stations=stations_gdf,\n",
    "    steps=[*settings.DEFAULT_QC_STEPS, \"flag_buddies\"],\n",
    "    step_kwargs=[*default_step_kwargs, {\"buddy_radius\": buddy_radius}],\n",
    ")\n",
    "qc_buddy_ts_df = pipeline_buddy.apply(ts_df)\n",
    "print(\n",
    "    f\"{qc_ts_df.shape[1]} stations kept without the buddy check, \"\n",
    "    f\"{qc_buddy_ts_df.shape[1]} with it\"\n",
    ")\n",
    "pipeline_buddy.discarded_[\"flag_buddies\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "The buddy check flags the additional stations above. Since it is a *spatial* check, we visualize its effect on a map: each flagged station is shown together with a dashed circle delimiting the `buddy_radius` neighbourhood it was compared against (note that some flagged stations may already have been caught by the earlier steps):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37",
   "metadata": {},
   "outputs": [],
   "source": [
    "buddy_flagged = pipeline_buddy.discarded_[\"flag_buddies\"]\n",
    "# project to LV95 so that `buddy_radius` (in meters) maps to true distances\n",
    "stations_lv95_gdf = stations_gdf.to_crs(\"epsg:2056\")\n",
    "ax = stations_lv95_gdf.assign(\n",
    "    label=[\n",
    "        \"buddy outlier\" if i in buddy_flagged else \"kept\"\n",
    "        for i in stations_lv95_gdf.index\n",
    "    ]\n",
    ").plot(\"label\", legend=True)\n",
    "# dashed circle delimiting each flagged station's `buddy_radius` neighbourhood\n",
    "stations_lv95_gdf.loc[buddy_flagged].buffer(buddy_radius).boundary.plot(\n",
    "    ax=ax, color=\"black\", linestyle=\"--\", linewidth=1\n",
    ")\n",
    "cx.add_basemap(ax, crs=stations_lv95_gdf.crs, attribution=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38",
   "metadata": {},
   "source": [
    "### Custom steps\n",
    "\n",
    "Each step can be given either as a **string** naming a built-in function of `meteora.qc` (as in the sequences above), or directly as **any callable** with the signature `step(ts_df, **kwargs) -> (ts_df, discarded_station_ids)`: station-discarding steps return `ts_df` unchanged together with the ids to drop, while transforms (e.g. the elevation adjustment) return a new `ts_df` and an empty list. We can therefore freely interleave built-in names and our own functions in the `steps` list, supplying each step's parameters through the positionally-matched `step_kwargs` list. As an example, the callable below flags *systematically sunlit* stations - those that overheat at the same hour every day, a recurring radiative bias that the per-measurement `flag_outliers` step can miss because it is too small a fraction of the record. It is a deliberately simplified version of the method of Bosch et al. (in prep.) {cite}`bosch2026sunlit`, dropped into the pipeline right after the `flag_outliers` step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39",
   "metadata": {},
   "outputs": [],
   "source": [
    "def sunlit_stations(ts_df, *, contrast_threshold=1.5):\n",
    "    \"\"\"Flag systematically sunlit stations: a recurring, time-locked warm peak.\n",
    "\n",
    "    Robust cross-station z-score (median centre, MAD scale), then for each station the\n",
    "    median-over-days hour-of-day profile; a station is flagged when its warmest hour\n",
    "    exceeds its own baseline (the median across hours) by more than\n",
    "    ``contrast_threshold``. Simplified from Bosch et al. (in prep.), whose full method\n",
    "    adds a permutation significance test with false-discovery-rate control.\n",
    "    \"\"\"\n",
    "    centered = ts_df.sub(ts_df.median(axis=\"columns\"), axis=\"index\")\n",
    "    z = centered.div(1.4826 * centered.abs().median(axis=\"columns\"), axis=\"index\")\n",
    "    profile = z.groupby(z.index.hour).median()  # hour-of-day median per station\n",
    "    contrast = profile.max() - profile.median()  # peak vs the station's own baseline\n",
    "    return ts_df, list(contrast.index[contrast > contrast_threshold])\n",
    "\n",
    "\n",
    "custom_pipeline = qc.QCPipeline(\n",
    "    stations=stations_gdf,\n",
    "    steps=[\n",
    "        \"flag_mislocated\",\n",
    "        \"flag_unreliable\",\n",
    "        \"adjust_elevation\",\n",
    "        \"flag_outliers\",\n",
    "        sunlit_stations,\n",
    "        \"flag_indoor\",\n",
    "    ],\n",
    ")\n",
    "_ = custom_pipeline.apply(ts_df)\n",
    "custom_pipeline.discarded_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "## Comparison with official data\n",
    "\n",
    "Finally, we can compare the time series of temperature measurements of the CWS data (before and after QC) with its counterpart from official stations. To that end, we will get the data from [Agrometeo](https://www.agrometeo.ch) and [MeteoSwiss](https://www.meteoswiss.admin.ch):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41",
   "metadata": {},
   "outputs": [],
   "source": [
    "# official\n",
    "# specify the crs for both clients (they default to other CRSs: epsg:21781 for\n",
    "# Agrometeo, the Swiss LV95 epsg:2056 for MeteoSwiss), otherwise the lon/lat `region`\n",
    "# bounding box is misinterpreted and no stations are matched\n",
    "agrometeo_client = clients.AgrometeoClient(region, crs=crs)\n",
    "meteoswiss_client = clients.MeteoSwissClient(region, crs=crs)\n",
    "\n",
    "ax = pd.concat(\n",
    "    [\n",
    "        stations_gdf.assign(source=label).to_crs(crs)\n",
    "        for label, stations_gdf in {\n",
    "            \"Netatmo\": stations_gdf,\n",
    "            \"Agrometeo\": agrometeo_client.stations_gdf,\n",
    "            \"MeteoSwiss\": meteoswiss_client.stations_gdf,\n",
    "        }.items()\n",
    "    ]\n",
    ").plot(\"source\", legend=True)\n",
    "cx.add_basemap(ax, crs=crs, attribution=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "There are only a few official stations in the study area compared with the high spatial density of Netatmo CWS. Let us now get the time series of temperature data for the study period:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43",
   "metadata": {},
   "outputs": [],
   "source": [
    "# specify hourly `scale` for Agrometeo, otherwise is 10-minutes.\n",
    "agrometeo_ts_df = agrometeo_client.get_ts_df(variables, start, end, scale=\"hour\")\n",
    "meteoswiss_ts_df = meteoswiss_client.get_ts_df(variables, start, end)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44",
   "metadata": {},
   "source": [
    "We can now plot the time series separately. To have more flexibility, we will use the long-form data frames (note that we thus need to exclude the discarded stations, which were dropped from the quality-controlled wide-form data frame but not from the long-form one):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_discards = pipeline.discarded_stations\n",
    "ax = sns.lineplot(\n",
    "    pd.concat(\n",
    "        [\n",
    "            ts_df.assign(label=label)\n",
    "            for ts_df, label in zip(\n",
    "                [\n",
    "                    long_ts_df.drop(all_discards, errors=\"ignore\"),\n",
    "                    long_ts_df[\n",
    "                        long_ts_df.index.get_level_values(\"station_id\").isin(\n",
    "                            all_discards\n",
    "                        )\n",
    "                    ],\n",
    "                    pd.concat([agrometeo_ts_df, meteoswiss_ts_df]),\n",
    "                ],\n",
    "                [\"CWS (QC)\", \"CWS (Discarded)\", \"Official\"],\n",
    "            )\n",
    "        ]\n",
    "    )\n",
    "    .droplevel(\"station_id\")\n",
    "    .reset_index(),\n",
    "    x=\"time\",\n",
    "    y=\"temperature\",\n",
    "    hue=\"label\",\n",
    ")\n",
    "ax.tick_params(axis=\"x\", labelrotation=45)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46",
   "metadata": {},
   "source": [
    "We can observe how, unlike the discarded CWS, the CWS that pass QC have a very similar trend as the official stations. Although this is beyond the scope of this notebook, note that even after QC, CWS seem to be slightly warmer than the official stations, but this may be due to elevation differences as well as the urban heat island effect.\n",
    "\n",
    "## References\n",
    "\n",
    "```{bibliography}\n",
    ":filter: docname in docnames\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all"
  },
  "kernelspec": {
   "display_name": "Python (Pixi)",
   "language": "python",
   "name": "pixi-kernel-python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.11"
  },
  "pixi-kernel": {
   "environment": "doc"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}