{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "## Data structures for geospatial time series data\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "import contextily as cx\n", "import matplotlib.pyplot as plt\n", "import osmnx as ox\n", "import seaborn as sns\n", "import xarray as xr\n", "\n", "from meteora import clients, utils" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "region = \"Switzerland\"\n", "start = \"2021-08-13\"\n", "end = \"2021-08-16\"\n", "variables = [\"temperature\", \"precipitation\", \"wind_speed\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "client = clients.METARASOSIEMClient(region)" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "ts_df = client.get_ts_df(variables, start=start, end=end)\n", "ts_df" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "There are different ways in which we can structure time series data frame shown above. The most common representation are the so-called long and wide table formats (see the [\"Data structures accepted by seaborn\"](https://seaborn.pydata.org/tutorial/data_structure.html) section of the seaborn documentation for an overview).\n", "\n", "### Long data frames\n", "\n", "The `get_ts_df` method returns a long data frame where the station and time are used as multi-index levels, so that we can easily select the time series of a specific station:" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "station_id = \"LSGG\"\n", "ts_df.loc[station_id]" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "The key advantage of a long data frame for time series data is that it is very flexible, i.e., we can combine stations with different time ranges and temporal resolutions into a single data frame structure. However, performing simple time series operations requires grouping by stations, which may not be the most convenient and straight-forward.\n", "\n", "### Wide data frames\n", "\n", "The same data can be represented as a wide data frame, where the index is the time and the variables and stations are the multi-level columns. We can convert the long data frame to a wide data frame using the `meteora.utils.long_to_wide` function:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "# in this case, equivalently:\n", "# wide_ts_df = ts_df.unstack(level=\"station\")\n", "wide_ts_df = utils.long_to_wide(ts_df)\n", "wide_ts_df" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "Accordingly, we can easily select the time series of a specific variable:" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "wide_ts_df[\"temperature\"]" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "Note that unlike in the long data frame, the records of all the stations and variables are indexed by a common and unique timestamp. Accordingly, a key advantage of the wide data frame is that time series operations are supported by pandas out-of-the-box. For example, we can easily select a time range and plot the temperature time series of all stations:" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "wide_ts_df.loc[\"2021-08-14 00:00:00\":\"2021-08-15 00:00:00\"][\"temperature\"].plot(\n", " legend=False\n", ")" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "We can also resample the time series to a different frequency, e.g., we can compute the hourly mean temperature time series of all stations:" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "wide_ts_df.resample(\"3h\")[\"temperature\"].mean().plot(legend=False)" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "The main disadvantage is the wide data form may not be appropriate to store raw observations from different sources at different time resolutions, since it can result in many nan values.\n", "\n", "### Example geospatial operations with long and wide data frames\n", "\n", "We have so far reviewed the advantages and disatvantages that the long and wide data frame forms feature when dealing with time series data. However, meteorological observations also have an important geospatial component which we have not addressed yet. While it is possible to add a \"geometry\" column to the long data frame, it would result in many repeated values (since we likely have many observations for each station). On the other hand, stations are featured as rows in the wide data frame, which makes adding their geolocation very inconvenient.\n", "\n", "Therefore, in most cases the best way to peform spatial operations on meteorological data is to use two separate data structures, i.e., one for the time series of measurements and another with the station locations (and potentially other attributes).\n", "As shown above, the station locations can be accessed in the `stations_gdf` property, which we can use to perform geospatial operations. For example, we can select stations within a buffer around a specific location:" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "query = \"Canton de Vaud\"\n", "buffer_dist = 20e3 # in meters\n", "\n", "extent_geom = (\n", " ox.projection.project_gdf(ox.geocode_to_gdf(query))\n", " .buffer(buffer_dist)\n", " .to_crs(client.stations_gdf.crs)\n", " .iloc[0]\n", ")\n", "station_ids = client.stations_gdf[client.stations_gdf.within(extent_geom)].index\n", "station_ids" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "We can use this information to select the time series of the stations within the buffer. For the long data frame (the default in Meteora), this is straightforward since the stations are the first-level index:" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "ts_df.loc[station_ids]" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "For the wide data frame, stations are in the second level of the columns, so we can either (i) swap the column levels and then simply pass `station_ids` to select the desired stations or (ii) use `.loc` at the second level of the columns:" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "# (i)\n", "# wide_ts_df.swaplevel(\"variable\", \"station\", axis=\"columns\")[station_ids]\n", "# (ii)\n", "wide_ts_df.loc[:, (slice(None), station_ids)]" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "Note that if we are interested in a single variable, filtering by stations becomes much simpler:" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "wide_ts_df[\"temperature\"][station_ids]" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "### Vector data cubes\n", "\n", "As reviewed above, long and wide data frame forms have its own strengths and weaknesses when dealing with time series data, yet none of the forms offers a single object to encapsulate both the temporal and geospatial aspects of meteorological stations' data. Luckily for us, there exists a data structure that fulfills our requirements - enter *vector data cubes*.\n", "\n", "A *vector data cube* is an n-D array with at least one dimension indexed by vector geometries {cite}`pebesma2022vdc`. In Python, this is represented as an [xarray](https://xarray.dev) Dataset (or DataArray) object with an indexed dimension with vector geometries set using [xvec](https://xvec.readthedocs.io). Given the time series data frame and station locations, we can get a vector data cube using the `meteora.utils.long_to_cube` function:" ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "ts_cube = utils.long_to_cube(ts_df, client.stations_gdf)\n", "ts_cube" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "We obtain an xarray Dataset with both the temporal and geospatial (respectively labeled as \"time\" and \"station_id\" following `ts_df`) coordinates and the three variables as data variables. Additionally, note in the \"Indexes\" section that the \"station\" index have the geospatial features enabled by xvec (note the `GeometryIndex` with the CRS).\n", "\n", "The major advantage of vector data cubes is that we can now natively perform both time series and geospatial operations using a single object:" ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "ts_cube[\"temperature\"].sel(geometry=extent_geom, method=\"intersects\").sel(\n", " time=slice(\"2021-08-13 12:00:00\", \"2021-08-14 12:00:00\")\n", ").resample(time=\"3h\").mean()" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "We can also use the xvec accessor to plot our results, both in terms of time series data:" ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "# assign_coords(station=ts_df.index.get_level_values(\"station\").unique())\n", "fig, ax = plt.subplots()\n", "ts_cube[\"temperature\"].plot.line(x=\"time\", hue=\"station_id\", add_legend=True, ax=ax)\n", "sns.move_legend(ax, loc=\"center left\", bbox_to_anchor=(1, 0.5))" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "or including geospatial information as well, e.g., stations colored by temperature averaged over a given period:" ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [ "fig, ax = (\n", " ts_cube[\"temperature\"]\n", " .sel(geometry=extent_geom, method=\"intersects\")\n", " .sel(time=slice(\"2021-08-13 12:00:00\", \"2021-08-14 12:00:00\"))\n", " .mean(\"time\")\n", " .xvec.plot(cmap=\"coolwarm\", geometry=\"geometry\")\n", ")\n", "cx.add_basemap(ax, crs=ts_cube.geometry.crs, attribution=False)" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "*(C) OpenStreetMap contributors, Tiles style by Humanitarian OpenStreetMap Team hosted by OpenStreetMap France*\n", "\n", "### StationBench format\n", "\n", "The vector data cube is very similar to the ground truth station data format expected by [StationBench](https://github.com/juaAI/stationbench) to benchmark weather forecasts. The only differences are that instead of point geometry objects we have longitude/latitude coordinates (so geospatial operations are not enabled) and the variables use specific naming conventions. You can convert the long time series data frame to the expected xarray using the `meteora.utils.long_to_stationbench` function:" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "ts_ds = utils.long_to_stationbench(ts_df, client.stations_gdf)\n", "ts_ds" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "## Saving to disk\n", "\n", "There are several ways in which we can store the geospatial time series data that we deal with when using Meteora.\n", "\n", "### Using long or wide data frames\n", "\n", "When using long and wide data frames, the most straight-forward solution is to separately save the stations geo-data frame and time series data frame to disk, e.g., using a geospatial file format for the former and a pandas data frame compatible format (CSV, JSON...) for the latter:" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "ts_df.to_csv(\"ts-df.csv\")\n", "client.stations_gdf.to_file(\"stations.gpkg\")" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "Note that for long data frames, it is possible to save a single object by adding the station attributes as columns, however note that in most cases this will be highly inefficient and result in many repeated values.\n", "\n", "For large amounts of data, you may consider [tstore](https://github.com/ltelab/tstore), which is an interoperable specification based on [Apache Parquet](https://parquet.apache.org) and [GeoParquet](https://github.com/opengeospatial/geoparquet). (note that the API is still experimental). The [first tutorial](https://tstore.readthedocs.io/en/latest/tutorials/01-your-first-tstore.html) illustrates different ways in which long data frames from Meteora can be efficiently$^1$ stored while keeping the geospatial information.\n", "\n", "### Using vector data cubes\n", "\n", "There exist several ways of saving vector data cubes to disk, which include (see the [related section of the xvec documentation](https://xvec.readthedocs.io/en/stable/io.html) for more details):\n", "\n", " - *geospatial file formats*, which is essentially equivalent to transforming the vector data cube to a long data frame with the station attributes (including location), and thus has the exact same shortcomings in terms of the high degree of repetition.\n", " - *binary serialization formats* such as pickle and joblib, which have the advantage of allowing compression but may only work when read in a similar environment with similar package versions (therefore with restricted interoperability).\n", " - *CF conventions with netCDF or zarr*, which encode the geospatial information so that it is compatible with the xarray's formats such as netCDF and zarr.\n", "\n", "Overall, the latter option is likely the most convenient. Let us try it (note that in order to run the cells below, we need to [install zarr](https://github.com/zarr-developers/zarr-python) which is not a Meteora - nor xvec - dependency):" ] }, { "cell_type": "code", "execution_count": null, "id": "36", "metadata": {}, "outputs": [], "source": [ "encoded = ts_cube.xvec.encode_cf()\n", "# encoded.to_netcdf(\"geo-encoded.nc\", mode=\"w\")\n", "encoded.to_zarr(\"geo-encoded.zarr\", mode=\"w\")" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "We can now read back the saved zarr store from disk into a vector data cube:" ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [], "source": [ "roundtripped = xr.open_zarr(\"geo-encoded.zarr\").xvec.decode_cf()\n", "roundtripped" ] }, { "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "which we can test to be identical to the original `ts_cube` (this may currently fail due to [xvec/issues/141](https://github.com/xarray-contrib/xvec/issues/141)):" ] }, { "cell_type": "code", "execution_count": null, "id": "40", "metadata": {}, "outputs": [], "source": [ "roundtripped.identical(ts_cube)" ] }, { "cell_type": "markdown", "id": "41", "metadata": {}, "source": [ "## Footnotes\n", "\n", "1. See [a benchmark](https://tstore.readthedocs.io/en/latest/tutorials/02-benchmark.html) comparing tstore, netCDF and zarr in terms of resulting file sizes as well as reading and writing times.\n", "\n", "## References\n", "\n", "```{bibliography}\n", ":filter: docname in docnames\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python (Pixi)", "language": "python", "name": "pixi-kernel-python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.11" }, "pixi-kernel": { "environment": "doc" } }, "nbformat": 4, "nbformat_minor": 5 }