OpenRainER: an open-source dataset for studying the opportunistic sensing of rainfall in Emilia-Romagna, Italy
Abstract. We present the OpenRainER dataset of precipitation measurements, available open-source on Zenodo repository at https://doi.org/10.5281/zenodo.10593848. The dataset contains mainly precipitation related measurements over the region of Emilia-Romagna in northern Italy. Inside OpenRainER, measurement from the commercial microwave link network managed by Lepida S.c.p.A. are published, consisting in 1 min time series of the transmitted signal level and of the received signal level over 151 radio links. These data are primarily generated for link quality monitoring; however, they can be opportunistically exploited for weather monitoring as there is a well known direct relationship between rainfall intensity and decrease of the received signal level. The data are stored in NetCDF format and can be processed and converted into rainfall intensity time series along each radio link path using open-source tools developed within the COST Action OpenSense framework. We also provide concurrent data from the regional operational rain gauge network and two weather radars, as a reference for calibration and validation purposes. OpenRainER has peculiar characteristics with respect to similar open datasets: (1.) The links are distributed over very different types of terrain, including plains, hills, valleys, and mountain ridges (up to 2000 m a.s.l., and both densely inhabited cities and rural areas. (2.) While conventional CML networks show a frequency distribution inversely proportional to the link path length, here links operate at 24.6 or 25.6 GHz. This offers a wider range of sensitivities for testing classification and retrieval algorithms.
The authors present the OpenRainER data set. The data set consists of commercial microwave link (CML), weather radar, and rain gauge data with the main goal to improve the usability of the CML data for rainfall monitoring.
The data set offers a significant contribution to the community involved in rainfall monitoring with CMLs. As an open data set in a field where the raw data is often non-disclosed, this CML data set can be of high value in benchmark studies. It is uniquely different from the other three open data sets mentioned in the paper, and the addition of weather radar and rain gauge data complete the data set.
The data itself is of sufficiently high quality, though the limitations, especially of the CML data, should be elaborated on to improve the usability. The manuscript with the description and analyses, in my view, currently does not sufficiently support the data set, other than having it published on Zenodo. I have outlined my major concerns regarding these points below.
Preprocessing CML data
The TSL and RSL time series are processed in-house by the data owner. Even though the details are not disclosed, it would be valuable to be as precise as possible. Is it just the sampling frequency and quantization that are unknown? Because that is the only thing the appendix focusses on. If so, mention this concretely in the main text (L120-126). If the authors suspect other forms of preprocessing were done it would be helpful to know why the authors think so, and if they could share some of these processing steps they suspect have been done based on their experience with this data set.
The appendix on the in-house processing is an important addition and I appreciate the authors have looked into this, however, in my view it is not sufficiently well written and difficult to understand at this point. Unfortunately the example in Line 259 does not make it much clearer to me. Maybe it is clear to the authors, but I suggest having a colleague (not involved in this exact field of reserach) to read through it and iterate together to make it more understandable.
Some examples of what is unclear to me in the appendix:
L251 -257 : How I understand it is within 1 minute there are N samples being recorded, and each of these samples has a quantization somewhere between RSL0 and RSLK . So K is simply the number of possible quantization values? Then shouldn’t RSL0 + k in equation 1 actually be RSL0+k (same in L257)? If K is the quantization value itself, then I would expect equation A1 to be the sum of Nk(RSLk) and not Nk(RSL0 + k). Otherwise the range of RSL0 to RSLK mentioned in L254 does not make sense.
L259: Doesn’t Nk already vary between 0 and K, instead of between 0 and N?
EqA.3: follow the same format as in Eq. A1 and write it out completely. E.g. if K=1 I would expect Eq. A3 to be 1/N * (N0 * RSL0 + N1*(RSL0+1)) based on the format in Eq. A1
Having someone outside of your direct field of research read through this appendix will hopefully improve the readability and make it more accessible to those outside the direct CML community too. Finally I would suggest making figures A1 and possibly A2 a lot larger (w)ider) as now it is very difficult to deduce anything from these time series or follow along when the authors refer to these figures. For example, the bulk 1 dB quantization step in L251 is not deducible from Fig. A1 based on the current vertical axis.
Additional analyses
To gain more trust in this opportunistic data source some additional analyses would be recommendable. Section 3.3. on rainfall comparison is rather limited at this moment. The authors state that the data set has some peculiarities compared to other open CML data sets, such as the presence of orography, and the use of nearly identical CML operating frequencies. However, nothing is done with these features at this moment.
It would be fairly straight-forward to split the results in Figure 11 by orography / elevation, for example. The same for link length. Since the frequency is predominantly the same, how does the accuracy of the CML rainfall estimates vary with link length? This would quickly give a user some insights into the quality of the data that makes this data set unique.
Similarly, in L162 you state gauges are “the most accurate ground-truth reference against which CML-based rainfall retrievals can be compared in our case”. However no such analysis is currently performed. To gain confidence in the (reference) data a short comparison between CMLs and the nearest gauge, similar to what is done for radar would be useful. Also a comparison between gauges themselves, a simple corrollelogram or double mass curve could already yield some insights into the quality of the reference data itself and potentially spot outliers.
Added value of split between convective and stratiform – Section 3.1
The split between stratiform and convective is interesting to get an idea of the local climatology, but is not really used in any analysis at the moment. So what is the exact added value to this description of the dataset? A quick addition could be to split the results from Fig 11 in time and show the performance of CMLs vs. weather radar in stratiform and convective conditions.
In general I am not suggesting to do many different analyses, as that is not the goal of such a data paper, but simply to split the analyses currently done by some of the characteristics mentioned in the paper to get an idea of how the CML data performs in different conditions.
Specific comments on content:
L42: So you have 1 minute data but receive it every 15 minutes?
L47-56: I would switch the order of this paragraph, and put it before L37-46, so that after describing the other data sets you first describe what makes the OpenRainER data set unique.
L51: arguably “wet-antenna attenuation” is not the most frequency dependent effect, so I would give another example. A much larger effect could be the presumed uniform attenuation along the path.
L57: Nice to mention other publicly available datasets! Hyperlinks in the text don’t read nicely though. A small table would fit this better. You can then add some extra columns like whether these data a freely available which could benefit the community too.
L85: I would make “Study Area” Section 2, and “Datasets” a separate section 3.
L126: To make the main text self-contained, explicitly mention the interesting characteristics you find in your Appendix here.
Fig 2: This and all other Figures, make them a lot Larger as they are difficult to read now. Moreover, stacked bar charts are generally not a good idea since the reader is never sure if one of the bars covering the other. Instead put the orange and blue bars next to each other. If you must really have a single bar per bin, use different widths so the height of both bars are always visible. Finally, it would be useful to quickly define the term roughness in the caption as well.
Fig3: To enhance the usability, this data would better fit on a map so it becomes clear which links have less data, which would especially be of importance if they are all in the same regions.
L142: Should it be “the one only covered by SPC”? Would be good to mention the mosaic codes in your README / dataset here.
L145: Would be good to add the resolution to the README on Zenodo too.
L147-154: If there is a publication describing the entire processing chain from the Arpae, or some comparison / validation studies on the radar data it would be valuable to mention that here for reference.
L158: What did these minor modifications do? Try to be as specific as possible to give the reader/user confidence in the data.
Fig6: Similar to Fig3, a map with the gauge locations colored by availability would be useful.
Table 3: For completeness adding precipitation to this table would nice
L174: Would be useful to know which stations or at least how many stations operate at temporal resolutions below 15 minutes. Possibly add this to table 3 or again, on a map.
L198: Explain shortly how the averaging of all radar pixels is done. Weighting the length of the CML in the radar pixel?
L201-203: This hypothesis can be fairly quickly checked by coarsening the CML resolution to 15 minutes too, and/or shifting the timestamp to the beginning/end of each interval.
L204: “which reflect the different spatial sampling performed by rain gauges, CML and radars”. Not necessarily true, or in any case not only true! It could be due to rainfall variability in time, differences in altitude, etc.
L210: “ however it captures the temporal dynamics of rainfall better than the rain gauge”. How so? The timing of the rain gauge seems to agree more with timing of the radar..
Fig8: Panel b and c show number of days, this a discrete variable, hence a discrete rather than a continuous colormap would be more useful.
L223-224: In my opinion this conclusion is too easily drawn. To give the user of this data set some more confidence in the data it would be very helpful to simply stratify these results, as mentioned above in the general comments, by mountains/plains, link length, stratiform/convective.
L241: Lepida S.c.p.A. (the company that owns the CML network). -> mobile network operate Lepida S.c.p.A.
L 249: by the CML network -> in a network management system
L250: What is basically constant? Can you mention a range, i.e. -0.1 to +0.1 dB?
Textual comments:
L3: measurement -> measurements
L4: in 1 min -> of 1 min
L10: I would use “unique” instead of “peculiar”
L10: datasets: -> datasets.
L16: hereinafter -> hereafter
L17: by the -> by a, or in a
L21: “well known” is subjective, remove
L65: OpenSense has been -> OpenSense, a COST Action project that ended in October 2025, brought together..
L138: Joss and Waldvogel (1988) -> (Joss and Waldvogel, 1988)
L182: lapse? Do you mean interval?
L213: to to -> to
L214: estimate -> estimates
Comments on the README on Zenodo: