Data structures for geospatial time series data#

import contextily as cx
import matplotlib.pyplot as plt
import osmnx as ox
import seaborn as sns
import xarray as xr

from meteora import clients, utils

region = "Switzerland"
start = "2021-08-13"
end = "2021-08-16"
variables = ["temperature", "precipitation", "wind_speed"]

client = clients.METARASOSIEMClient(region)

ts_df = client.get_ts_df(variables, start=start, end=end)
ts_df

		temperature	precipitation	wind_speed
station_id	time
LSGC	2021-08-13 00:20:00	60.8	0.0	1.0
	2021-08-13 00:50:00	60.8	0.0	2.0
	2021-08-13 01:20:00	60.8	0.0	2.0
	2021-08-13 01:50:00	59.0	0.0	0.0
	2021-08-13 02:20:00	59.0	0.0	0.0
...	...	...	...	...
LSZS	2021-08-15 21:50:00	51.8	0.0	4.0
	2021-08-15 22:20:00	51.8	0.0	2.0
	2021-08-15 22:50:00	51.8	0.0	4.0
	2021-08-15 23:20:00	50.0	0.0	2.0
	2021-08-15 23:50:00	50.0	0.0	1.0

2304 rows × 3 columns

There are different ways in which we can structure time series data frame shown above. The most common representation are the so-called long and wide table formats (see the “Data structures accepted by seaborn” section of the seaborn documentation for an overview).

Long data frames#

The get_ts_df method returns a long data frame where the station and time are used as multi-index levels, so that we can easily select the time series of a specific station:

station_id = "LSGG"
ts_df.loc[station_id]

	temperature	precipitation	wind_speed
time
2021-08-13 00:20:00	64.4	0.0	4.0
2021-08-13 00:50:00	62.6	0.0	1.0
2021-08-13 01:20:00	62.6	0.0	3.0
2021-08-13 01:50:00	62.6	0.0	2.0
2021-08-13 02:20:00	62.6	0.0	3.0
...	...	...	...
2021-08-15 21:50:00	77.0	0.0	2.0
2021-08-15 22:20:00	77.0	0.0	4.0
2021-08-15 22:50:00	77.0	0.0	10.0
2021-08-15 23:20:00	68.0	0.0	7.0
2021-08-15 23:50:00	68.0	0.0	3.0

144 rows × 3 columns

The key advantage of a long data frame for time series data is that it is very flexible, i.e., we can combine stations with different time ranges and temporal resolutions into a single data frame structure. However, performing simple time series operations requires grouping by stations, which may not be the most convenient and straight-forward.

Wide data frames#

The same data can be represented as a wide data frame, where the index is the time and the variables and stations are the multi-level columns. We can convert the long data frame to a wide data frame using the meteora.utils.long_to_wide function:

# in this case, equivalently:
# wide_ts_df = ts_df.unstack(level="station")
wide_ts_df = utils.long_to_wide(ts_df)
wide_ts_df

variable	temperature										...	wind_speed
station_id	LSGC	LSGG	LSGS	LSMA	LSMD	LSME	LSMM	LSMP	LSZA	LSZB	...	LSMM	LSMP	LSZA	LSZB	LSZC	LSZG	LSZH	LSZL	LSZR	LSZS
time
2021-08-13 00:20:00	60.8	64.4	64.4	69.8	68.0	69.8	66.2	68.0	71.6	64.4	...	2.0	4.0	8.0	3.0	4.0	2.0	7.0	5.0	6.0	1.0
2021-08-13 00:50:00	60.8	62.6	64.4	68.0	66.2	69.8	66.2	68.0	71.6	62.6	...	3.0	4.0	4.0	3.0	2.0	2.0	4.0	7.0	3.0	4.0
2021-08-13 01:20:00	60.8	62.6	64.4	66.2	60.8	68.0	64.4	68.0	71.6	62.6	...	4.0	3.0	2.0	1.0	2.0	2.0	5.0	8.0	6.0	2.0
2021-08-13 01:50:00	59.0	62.6	62.6	66.2	59.0	66.2	64.4	66.2	69.8	62.6	...	3.0	3.0	3.0	2.0	4.0	2.0	2.0	7.0	5.0	2.0
2021-08-13 02:20:00	59.0	62.6	62.6	64.4	59.0	64.4	62.6	64.4	69.8	62.6	...	3.0	0.0	4.0	2.0	3.0	2.0	4.0	3.0	4.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2021-08-15 21:50:00	62.6	77.0	75.2	73.4	-4.0	75.2	69.8	71.6	75.2	68.0	...	0.0	1.0	1.0	1.0	2.0	1.0	2.0	3.0	2.0	4.0
2021-08-15 22:20:00	64.4	77.0	75.2	71.6	12.2	73.4	68.0	71.6	75.2	66.2	...	3.0	0.0	2.0	2.0	9.0	4.0	2.0	2.0	4.0	2.0
2021-08-15 22:50:00	66.2	77.0	75.2	69.8	-13.0	71.6	68.0	71.6	73.4	66.2	...	6.0	2.0	2.0	2.0	15.0	3.0	2.0	2.0	5.0	4.0
2021-08-15 23:20:00	64.4	68.0	75.2	71.6	26.6	71.6	68.0	75.2	75.2	66.2	...	3.0	3.0	3.0	4.0	9.0	3.0	2.0	2.0	5.0	2.0
2021-08-15 23:50:00	66.2	68.0	71.6	73.4	23.0	71.6	68.0	69.8	73.4	68.0	...	1.0	2.0	1.0	2.0	8.0	5.0	2.0	3.0	3.0	1.0

144 rows × 48 columns

Accordingly, we can easily select the time series of a specific variable:

wide_ts_df["temperature"]

station_id	LSGC	LSGG	LSGS	LSMA	LSMD	LSME	LSMM	LSMP	LSZA	LSZB	LSZC	LSZG	LSZH	LSZL	LSZR	LSZS
time
2021-08-13 00:20:00	60.8	64.4	64.4	69.8	68.0	69.8	66.2	68.0	71.6	64.4	68.0	68.0	71.6	71.6	71.6	50.0
2021-08-13 00:50:00	60.8	62.6	64.4	68.0	66.2	69.8	66.2	68.0	71.6	62.6	69.8	66.2	71.6	73.4	71.6	50.0
2021-08-13 01:20:00	60.8	62.6	64.4	66.2	60.8	68.0	64.4	68.0	71.6	62.6	68.0	66.2	69.8	73.4	73.4	50.0
2021-08-13 01:50:00	59.0	62.6	62.6	66.2	59.0	66.2	64.4	66.2	69.8	62.6	66.2	64.4	69.8	73.4	69.8	50.0
2021-08-13 02:20:00	59.0	62.6	62.6	64.4	59.0	64.4	62.6	64.4	69.8	62.6	68.0	62.6	69.8	73.4	71.6	48.2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2021-08-15 21:50:00	62.6	77.0	75.2	73.4	-4.0	75.2	69.8	71.6	75.2	68.0	73.4	71.6	71.6	73.4	73.4	51.8
2021-08-15 22:20:00	64.4	77.0	75.2	71.6	12.2	73.4	68.0	71.6	75.2	66.2	73.4	71.6	69.8	73.4	77.0	51.8
2021-08-15 22:50:00	66.2	77.0	75.2	69.8	-13.0	71.6	68.0	71.6	73.4	66.2	71.6	69.8	69.8	73.4	78.8	51.8
2021-08-15 23:20:00	64.4	68.0	75.2	71.6	26.6	71.6	68.0	75.2	75.2	66.2	73.4	71.6	69.8	73.4	78.8	50.0
2021-08-15 23:50:00	66.2	68.0	71.6	73.4	23.0	71.6	68.0	69.8	73.4	68.0	69.8	75.2	69.8	73.4	77.0	50.0

144 rows × 16 columns

Note that unlike in the long data frame, the records of all the stations and variables are indexed by a common and unique timestamp. Accordingly, a key advantage of the wide data frame is that time series operations are supported by pandas out-of-the-box. For example, we can easily select a time range and plot the temperature time series of all stations:

wide_ts_df.loc["2021-08-14 00:00:00":"2021-08-15 00:00:00"]["temperature"].plot(
    legend=False
)

<Axes: xlabel='time'>

../_images/3318bb1c7715f068a107f27a834c9417cea7d45efca24238dd24692b1ea08ae8.png

We can also resample the time series to a different frequency, e.g., we can compute the hourly mean temperature time series of all stations:

wide_ts_df.resample("3h")["temperature"].mean().plot(legend=False)

<Axes: xlabel='time'>

../_images/dd6ad213246b8dd7871149aacf390f4f0a110c5f9f2e5fd3b9959d6ce4b192ad.png

The main disadvantage is the wide data form may not be appropriate to store raw observations from different sources at different time resolutions, since it can result in many nan values.

Example geospatial operations with long and wide data frames#

We have so far reviewed the advantages and disatvantages that the long and wide data frame forms feature when dealing with time series data. However, meteorological observations also have an important geospatial component which we have not addressed yet. While it is possible to add a “geometry” column to the long data frame, it would result in many repeated values (since we likely have many observations for each station). On the other hand, stations are featured as rows in the wide data frame, which makes adding their geolocation very inconvenient.

Therefore, in most cases the best way to peform spatial operations on meteorological data is to use two separate data structures, i.e., one for the time series of measurements and another with the station locations (and potentially other attributes). As shown above, the station locations can be accessed in the stations_gdf property, which we can use to perform geospatial operations. For example, we can select stations within a buffer around a specific location:

query = "Canton de Vaud"
buffer_dist = 20e3  # in meters

extent_geom = (
    ox.projection.project_gdf(ox.geocode_to_gdf(query))
    .buffer(buffer_dist)
    .to_crs(client.stations_gdf.crs)
    .iloc[0]
)
station_ids = client.stations_gdf[client.stations_gdf.within(extent_geom)].index
station_ids

Index(['LSGC', 'LSGG', 'LSGS', 'LSMP'], dtype='str', name='station_id')

We can use this information to select the time series of the stations within the buffer. For the long data frame (the default in Meteora), this is straightforward since the stations are the first-level index:

ts_df.loc[station_ids]

		temperature	precipitation	wind_speed
station_id	time
LSGC	2021-08-13 00:20:00	60.8	0.0	1.0
	2021-08-13 00:50:00	60.8	0.0	2.0
	2021-08-13 01:20:00	60.8	0.0	2.0
	2021-08-13 01:50:00	59.0	0.0	0.0
	2021-08-13 02:20:00	59.0	0.0	0.0
...	...	...	...	...
LSMP	2021-08-15 21:50:00	71.6	0.0	1.0
	2021-08-15 22:20:00	71.6	0.0	0.0
	2021-08-15 22:50:00	71.6	0.0	2.0
	2021-08-15 23:20:00	75.2	0.0	3.0
	2021-08-15 23:50:00	69.8	0.0	2.0

576 rows × 3 columns

For the wide data frame, stations are in the second level of the columns, so we can either (i) swap the column levels and then simply pass station_ids to select the desired stations or (ii) use .loc at the second level of the columns:

# (i)
# wide_ts_df.swaplevel("variable", "station", axis="columns")[station_ids]
# (ii)
wide_ts_df.loc[:, (slice(None), station_ids)]

variable	temperature				precipitation				wind_speed
station_id	LSGC	LSGG	LSGS	LSMP	LSGC	LSGG	LSGS	LSMP	LSGC	LSGG	LSGS	LSMP
time
2021-08-13 00:20:00	60.8	64.4	64.4	68.0	0.0	0.0	0.0	0.0	1.0	4.0	4.0	4.0
2021-08-13 00:50:00	60.8	62.6	64.4	68.0	0.0	0.0	0.0	0.0	2.0	1.0	5.0	4.0
2021-08-13 01:20:00	60.8	62.6	64.4	68.0	0.0	0.0	0.0	0.0	2.0	3.0	0.0	3.0
2021-08-13 01:50:00	59.0	62.6	62.6	66.2	0.0	0.0	0.0	0.0	0.0	2.0	0.0	3.0
2021-08-13 02:20:00	59.0	62.6	62.6	64.4	0.0	0.0	0.0	0.0	0.0	3.0	2.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...
2021-08-15 21:50:00	62.6	77.0	75.2	71.6	0.0	0.0	0.0	0.0	1.0	2.0	3.0	1.0
2021-08-15 22:20:00	64.4	77.0	75.2	71.6	0.0	0.0	0.0	0.0	2.0	4.0	3.0	0.0
2021-08-15 22:50:00	66.2	77.0	75.2	71.6	0.0	0.0	0.0	0.0	5.0	10.0	3.0	2.0
2021-08-15 23:20:00	64.4	68.0	75.2	75.2	0.0	0.0	0.0	0.0	6.0	7.0	5.0	3.0
2021-08-15 23:50:00	66.2	68.0	71.6	69.8	0.0	0.0	0.0	0.0	6.0	3.0	1.0	2.0

144 rows × 12 columns

Note that if we are interested in a single variable, filtering by stations becomes much simpler:

wide_ts_df["temperature"][station_ids]

station_id	LSGC	LSGG	LSGS	LSMP
time
2021-08-13 00:20:00	60.8	64.4	64.4	68.0
2021-08-13 00:50:00	60.8	62.6	64.4	68.0
2021-08-13 01:20:00	60.8	62.6	64.4	68.0
2021-08-13 01:50:00	59.0	62.6	62.6	66.2
2021-08-13 02:20:00	59.0	62.6	62.6	64.4
...	...	...	...	...
2021-08-15 21:50:00	62.6	77.0	75.2	71.6
2021-08-15 22:20:00	64.4	77.0	75.2	71.6
2021-08-15 22:50:00	66.2	77.0	75.2	71.6
2021-08-15 23:20:00	64.4	68.0	75.2	75.2
2021-08-15 23:50:00	66.2	68.0	71.6	69.8

144 rows × 4 columns

Saving to disk#

There are several ways in which we can store the geospatial time series data that we deal with when using Meteora.

Using long or wide data frames#

When using long and wide data frames, the most straight-forward solution is to separately save the stations geo-data frame and time series data frame to disk, e.g., using a geospatial file format for the former and a pandas data frame compatible format (CSV, JSON…) for the latter:

ts_df.to_csv("ts-df.csv")
client.stations_gdf.to_file("stations.gpkg")

Note that for long data frames, it is possible to save a single object by adding the station attributes as columns, however note that in most cases this will be highly inefficient and result in many repeated values.

For large amounts of data, you may consider tstore, which is an interoperable specification based on Apache Parquet and GeoParquet. (note that the API is still experimental). The first tutorial illustrates different ways in which long data frames from Meteora can be efficiently$^1$ stored while keeping the geospatial information.

Using vector data cubes#

There exist several ways of saving vector data cubes to disk, which include (see the related section of the xvec documentation for more details):

geospatial file formats, which is essentially equivalent to transforming the vector data cube to a long data frame with the station attributes (including location), and thus has the exact same shortcomings in terms of the high degree of repetition.
binary serialization formats such as pickle and joblib, which have the advantage of allowing compression but may only work when read in a similar environment with similar package versions (therefore with restricted interoperability).
CF conventions with netCDF or zarr, which encode the geospatial information so that it is compatible with the xarray’s formats such as netCDF and zarr.

Overall, the latter option is likely the most convenient. Let us try it (note that in order to run the cells below, we need to install zarr which is not a Meteora - nor xvec - dependency):

encoded = ts_cube.xvec.encode_cf()
# encoded.to_netcdf("geo-encoded.nc", mode="w")
encoded.to_zarr("geo-encoded.zarr", mode="w")

<xarray.backends.zarr.ZarrStore at 0x70badad963e0>

We can now read back the saved zarr store from disk into a vector data cube:

roundtripped = xr.open_zarr("geo-encoded.zarr").xvec.decode_cf()
roundtripped

<xarray.Dataset> Size: 57kB
Dimensions:        (geometry: 16, time: 144)
Coordinates:
  * geometry       (geometry) object 128B POINT (6.7928 47.0838) ... POINT (9...
    station_id     (geometry) object 128B dask.array<chunksize=(16,), meta=np.ndarray>
  * time           (time) datetime64[ns] 1kB 2021-08-13T00:20:00 ... 2021-08-...
Data variables:
    precipitation  (geometry, time) float64 18kB dask.array<chunksize=(16, 144), meta=np.ndarray>
    temperature    (geometry, time) float64 18kB dask.array<chunksize=(16, 144), meta=np.ndarray>
    wind_speed     (geometry, time) float64 18kB dask.array<chunksize=(16, 144), meta=np.ndarray>
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)

which we can test to be identical to the original ts_cube (this may currently fail due to xvec/issues/141):

roundtripped.identical(ts_cube)

False

Footnotes#

See a benchmark comparing tstore, netCDF and zarr in terms of resulting file sizes as well as reading and writing times.

References#

[1]

Edzer Pebesma. Vector data cubes. r-spatial.org blog, 2022. URL: https://r-spatial.org/r/2022/09/12/vdc.html.