the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
SEEPS4ALL: an open dataset for the verification of daily precipitation forecasts using station climate statistics
Abstract. Forecast verification is an essential task when developing a forecasting model. How well does a model perform? How does the forecast performance compare with previous versions or other models? Which aspects of the forecast could be improved? In weather forecasting, these questions apply in particular to precipitation, a key weather parameter with vital societal applications. Scores specifically designed to assess the performance of precipitation forecasts have been developed over the years. One example is the Stable and Equitable Error in Probability Space (SEEPS, Rodwell et al., 2010). The computation of this score is however not straightforward because it requires information about the precipitation climatology at the verification locations. More generally, climate statistics are key to assessing forecasts for extreme precipitation and high-impact events. Here, we introduce SEEPS4ALL, a set of data and tools that democratize the use of climate statistics for verification purposes. In particular, verification results for daily precipitation are showcased with both deterministic and probabilistic forecasts.
- Preprint
(1544 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on essd-2025-553', Jonas Bhend, 16 Dec 2025
-
AC1: 'Reply on RC1', Zied Ben Bouallegue, 16 Jan 2026
The authors would like to thank Jonas Bhend for taking the time to review our paper.
We have followed the suggestions regarding the changes to the datasets: 1) the percentiles are now a dimension rather than a variable name, 2) the time dimension for percentiles and other climatological quantities is expressed as a month of year, 3) the maximum has been added to the list of percentiles, and 4) the data has been archived with chunk sizes of around 10MB.
The verification scripts have been adapted to accommodate the dataset change of dimension.
The minor editorial comments were addressed. Many thanks for your help.
Citation: https://doi.org/10.5194/essd-2025-553-AC1
-
AC1: 'Reply on RC1', Zied Ben Bouallegue, 16 Jan 2026
-
RC2: 'Comment on essd-2025-553', Anonymous Referee #2, 23 Dec 2025
The manuscript introduces a new dataset focused on station-based precipitation over Europe, including climatological statistics that facilitate the computation of meaningful verification metrics, like the SSEPS. This dataset can be useful to verify precipitation forecast, not only from NWP but also forecasts obtained from ML methods. The paper is well written and the data is accessible as described in the manuscript.
Regarding the dataset, there is already an open discussion about reformatting the data. I also think that including the maximum (100th percentile) is a good idea and agree with the change of dimensions.
Regarding the manuscript, I have some minor comments and clarifying questions:
- Line 68: The dataset includes observations from 2022 to 2024 coming from ECA&D. This is presented as a novelty (Line 60), but is this part of the dataset just a direct subset of ECA&D?
- Fig1: Information about the number of the plotted stations is appreciated.
- Lines 82-83: The study uses the blended dataset provided in ECA&D. Blending data has an impact into extreme values, that are smoothed. Please, provide a comment about the impact this blending might have into the extreme cases of the provided dataset.
Citation: https://doi.org/10.5194/essd-2025-553-RC2 -
AC2: 'Reply on RC2', Zied Ben Bouallegue, 16 Jan 2026
We would like to thank the Reviewer for their comment on the data structure and the text.
We have now included the maximum as a new climate statistic in the dataset and adapted the dimensions as required.
The observations of SEEPS4ALL are indeed a subset of ECA&D. The novelty is that this data is in zarr format, ready for benchmarking activities. We have clarified this point in the text.
In Figure 1, we have added information about the number of stations as suggested.
Blended data are data merged from different providers with the aim of having the most complete dataset possible (see https://www.ecad.eu/FAQ/index.php#3). So, blending in that sense does not have a smoothing effect.
Citation: https://doi.org/10.5194/essd-2025-553-AC2
Status: closed
-
RC1: 'Comment on essd-2025-553', Jonas Bhend, 16 Dec 2025
SEEPS4All review
The authors present a new station-based dataset to support evaluation of precipitation forecasts with the SEEPS score and other verification metrics. This dataset and the corresponding software for verification are useful contributions and the paper is well written and concise. The datasets are well described and accessible, but I suggest to reorganize the data as detailed below.
The suggested changes to the datasets can be summarized as:
- Percentile as a dimension rather than variable name
- Time dimension for percentiles and other climatological quantities as month of year to avoid redundant repetition of values
- Re-chunk the data, the current chunk layout produces way too many very small chunks of around 10kB. Recommendations for chunk sizes vary, but are generally in the 10MB range.
- Also add the maximum (i.e. the 100th percentile)
To facilitate working with the new data layout, the scripts should be adjusted to take into account the difference in time representation.
So instead of the representation as below:
```
>>> xr.open_zarr("obs_clim_tp24_2022_2024_ecad.zarr")
<xarray.Dataset> Size: 9GB
Dimensions: (stnid: 10562, time: 1097)
Coordinates:
elevation (stnid) int64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
lat (stnid) float64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
lon (stnid) float64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
* stnid (stnid) int64 84kB 13 14 15 16 21 ... 27706 27707 27708 27710
* time (time) datetime64[ns] 9kB 2022-01-01 2022-01-02 ... 2025-01-01
Data variables: (12/100)
observation (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc1 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc10 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc11 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc12 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc13 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
... ...
perc98 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc99 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
Attributes:
description: observations with climate percentiles from 1 to 99
licence: CC-BY-NC. See also https://knmi-ecad-assets-prd.s3.amazonaw...
version: 1.0.0
```I suggest the following:
```
>>> xr.open_zarr("obs_clim_tp24_2022_2024_ecad.zarr")
<xarray.Dataset> Size: 9GB
Dimensions: (stnid: 10562, time: 1097, doy: 365)
Coordinates:
elevation (stnid) int64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
lat (stnid) float64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
lon (stnid) float64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
* stnid (stnid) int64 84kB 13 14 15 16 21 ... 27706 27707 27708 27710
* time (time) datetime64[ns] 9kB 2022-01-01 2022-01-02 ... 2025-01-01
* month (month) in64 1 2 3 ... 12
* perc (perc) int64 1 2 3 ... 100
Data variables:
observation (time, stnid) float64 93MB
percentile (month, stnid, perc) float64 …
Attributes:
description: observations with climate percentiles from 1 to 99
licence: CC-BY-NC. See also https://knmi-ecad-assets-prd.s3.amazonaw...
version: 1.0.0
```Similarly, the `obs_seeps_tp24_2022_2024_ecad.zarr` dataset could be reorganized in corresponding fashion.
## Minor Editorial Comments:
L13: recognizes meteorological data as high value data, …
L143: PSS
Citation: https://doi.org/10.5194/essd-2025-553-RC1 -
AC1: 'Reply on RC1', Zied Ben Bouallegue, 16 Jan 2026
The authors would like to thank Jonas Bhend for taking the time to review our paper.
We have followed the suggestions regarding the changes to the datasets: 1) the percentiles are now a dimension rather than a variable name, 2) the time dimension for percentiles and other climatological quantities is expressed as a month of year, 3) the maximum has been added to the list of percentiles, and 4) the data has been archived with chunk sizes of around 10MB.
The verification scripts have been adapted to accommodate the dataset change of dimension.
The minor editorial comments were addressed. Many thanks for your help.
Citation: https://doi.org/10.5194/essd-2025-553-AC1
-
RC2: 'Comment on essd-2025-553', Anonymous Referee #2, 23 Dec 2025
The manuscript introduces a new dataset focused on station-based precipitation over Europe, including climatological statistics that facilitate the computation of meaningful verification metrics, like the SSEPS. This dataset can be useful to verify precipitation forecast, not only from NWP but also forecasts obtained from ML methods. The paper is well written and the data is accessible as described in the manuscript.
Regarding the dataset, there is already an open discussion about reformatting the data. I also think that including the maximum (100th percentile) is a good idea and agree with the change of dimensions.
Regarding the manuscript, I have some minor comments and clarifying questions:
- Line 68: The dataset includes observations from 2022 to 2024 coming from ECA&D. This is presented as a novelty (Line 60), but is this part of the dataset just a direct subset of ECA&D?
- Fig1: Information about the number of the plotted stations is appreciated.
- Lines 82-83: The study uses the blended dataset provided in ECA&D. Blending data has an impact into extreme values, that are smoothed. Please, provide a comment about the impact this blending might have into the extreme cases of the provided dataset.
Citation: https://doi.org/10.5194/essd-2025-553-RC2 -
AC2: 'Reply on RC2', Zied Ben Bouallegue, 16 Jan 2026
We would like to thank the Reviewer for their comment on the data structure and the text.
We have now included the maximum as a new climate statistic in the dataset and adapted the dimensions as required.
The observations of SEEPS4ALL are indeed a subset of ECA&D. The novelty is that this data is in zarr format, ready for benchmarking activities. We have clarified this point in the text.
In Figure 1, we have added information about the number of stations as suggested.
Blended data are data merged from different providers with the aim of having the most complete dataset possible (see https://www.ecad.eu/FAQ/index.php#3). So, blending in that sense does not have a smoothing effect.
Citation: https://doi.org/10.5194/essd-2025-553-AC2
Data sets
SEEPS4ALL version 1.0 Zied Ben Bouallègue https://doi.org/10.5281/zenodo.17052887
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 277 | 86 | 25 | 388 | 17 | 15 |
- HTML: 277
- PDF: 86
- XML: 25
- Total: 388
- BibTeX: 17
- EndNote: 15
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
SEEPS4All review
The authors present a new station-based dataset to support evaluation of precipitation forecasts with the SEEPS score and other verification metrics. This dataset and the corresponding software for verification are useful contributions and the paper is well written and concise. The datasets are well described and accessible, but I suggest to reorganize the data as detailed below.
The suggested changes to the datasets can be summarized as:
To facilitate working with the new data layout, the scripts should be adjusted to take into account the difference in time representation.
So instead of the representation as below:
```
>>> xr.open_zarr("obs_clim_tp24_2022_2024_ecad.zarr")
<xarray.Dataset> Size: 9GB
Dimensions: (stnid: 10562, time: 1097)
Coordinates:
elevation (stnid) int64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
lat (stnid) float64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
lon (stnid) float64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
* stnid (stnid) int64 84kB 13 14 15 16 21 ... 27706 27707 27708 27710
* time (time) datetime64[ns] 9kB 2022-01-01 2022-01-02 ... 2025-01-01
Data variables: (12/100)
observation (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc1 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc10 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc11 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc12 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc13 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
... ...
perc98 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
perc99 (time, stnid) float64 93MB dask.array<chunksize=(138, 1321), meta=np.ndarray>
Attributes:
description: observations with climate percentiles from 1 to 99
licence: CC-BY-NC. See also https://knmi-ecad-assets-prd.s3.amazonaw...
version: 1.0.0
```
I suggest the following:
```
>>> xr.open_zarr("obs_clim_tp24_2022_2024_ecad.zarr")
<xarray.Dataset> Size: 9GB
Dimensions: (stnid: 10562, time: 1097, doy: 365)
Coordinates:
elevation (stnid) int64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
lat (stnid) float64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
lon (stnid) float64 84kB dask.array<chunksize=(10562,), meta=np.ndarray>
* stnid (stnid) int64 84kB 13 14 15 16 21 ... 27706 27707 27708 27710
* time (time) datetime64[ns] 9kB 2022-01-01 2022-01-02 ... 2025-01-01
* month (month) in64 1 2 3 ... 12
* perc (perc) int64 1 2 3 ... 100
Data variables:
observation (time, stnid) float64 93MB
percentile (month, stnid, perc) float64 …
Attributes:
description: observations with climate percentiles from 1 to 99
licence: CC-BY-NC. See also https://knmi-ecad-assets-prd.s3.amazonaw...
version: 1.0.0
```
Similarly, the `obs_seeps_tp24_2022_2024_ecad.zarr` dataset could be reorganized in corresponding fashion.
## Minor Editorial Comments:
L13: recognizes meteorological data as high value data, …
L143: PSS