A 16-year global climate data record of total column water vapour generated from OMI observations in the visible blue spectral range

. We present a long-term data set of 1° × 1° monthly mean total column water vapour (TCWV) based on global measurements of the Ozone Monitoring Instrument (OMI) covering the time range from January 2005 to December 2020. In comparison to the retrieval algorithm of Borger et al. (2020) several modiﬁcations and ﬁlters have been applied accounting for instrumental issues (such as OMI’s "row-anomaly") or the inferior quality of solar reference spectra. For instance, to overcome the problems of low quality reference spectra, the daily solar irradiance spectrum is replaced by an annually varying 5 mean Earthshine radiance obtained in December over Antarctica. For the TCWV data set only measurements are taken into account for which the effective cloud fraction < 20%, the AMF > 0.1, the ground pixel is snow- and ice-free, and the OMI row is not affected by the "row-anomaly" over the complete time range of the data set. The individual TCWV measurements are then gridded to a regular 1° × 1° lattice, from which the monthly means are calculated. In a comprehensive validation study we demonstrate that the OMI TCWV data set is in good agreement to reference data sets 10 of ERA5, RSS SSM/I, and ESA CCI Water Vapour CDR-2: over ocean ordinary least squares (OLS) as well as orthogonal distance regressions (ODR) indicate slopes close to unity with very small offsets and high correlation coefﬁcients of around 0.98. However, over land, distinctive positive deviations are obtained especially within the tropics with relative deviations of approximately +10% likely caused by uncertainties in the retrieval input data (surface albedo, cloud information) due to frequent cloud contamination in these regions. Nevertheless, a temporal stability analysis proves that the OMI TCWV data set 15 is consistent with the temporal changes of Competing interests. The authors declare that they have no conﬂict of interest. Acknowledgements. The combined microwave and near-infrared imager based product COMBI was initiated, funded and provided by the Water Vapour project of the ESA Climate Change Initiative, with contributions from Brockmann Consult, Spectral Earth, Deutscher Wetter-dienst and the EUMETSAT Satellite Climate Facility on Climate Monitoring (CM SAF). The combined MW and NIR product will be owned by EUMETSAT CM SAF and will be released by CM SAF in late 2021. In particular, we would like to thank Marc Schröder and the ESA 315 CCI WV team for providing the CDR TCWV and common mask data.

mean Earthshine radiance obtained in December over Antarctica. For the TCWV data set only measurements are taken into account for which the effective cloud fraction < 20%, the AMF > 0.1, the ground pixel is snow-and ice-free, and the OMI row is not affected by the "row-anomaly" over the complete time range of the data set. The individual TCWV measurements are then gridded to a regular 1°× 1°lattice, from which the monthly means are calculated.
In a comprehensive validation study we demonstrate that the OMI TCWV data set is in good agreement to reference data sets 10 of ERA5, RSS SSM/I, and ESA CCI Water Vapour CDR-2: over ocean ordinary least squares (OLS) as well as orthogonal distance regressions (ODR) indicate slopes close to unity with very small offsets and high correlation coefficients of around 0.98. However, over land, distinctive positive deviations are obtained especially within the tropics with relative deviations of approximately +10% likely caused by uncertainties in the retrieval input data (surface albedo, cloud information) due to frequent cloud contamination in these regions. Nevertheless, a temporal stability analysis proves that the OMI TCWV data set 15 is consistent with the temporal changes of the reference data sets and shows no significant deviation trends.
Since the TCWV retrieval can be easily applied to further satellite missions, additional TCWV data sets can be created from past missions such as GOME-1 or SCIAMACHY, which under consideration of systematic differences (e.g. due to different observation times) can be combined with the OMI TCWV data set in order to create a data record that would cover a time span from 1995 to the present. Moreover, the TCWV retrieval will also work for all missions dedicated to NO 2 in future such as 20 Sentinel-5 on MetOp-SG.

Introduction
Water vapour is the most important natural greenhouse gas in the Earth's atmosphere altering the Earth's energy balance by 25 playing a dominant role in the atmospheric thermal opacity and having a major amplifying influence on several factors of anthropogenic climate change through various feedback mechanisms (Kiehl and Trenberth, 1997;Randall et al., 2007;Trenberth et al., 2009). Though its great importance not only on processes on global/climate scale, the complex interactions between the components of the hydrological cycle (including water vapour) and the atmosphere are still one of major challenges of climate modelling and for a better understanding of the Earth's climate system in general (Stevens and Bony, 2013). Moreover, 30 the amount and distribution of water vapour are highly variable, so that for global observations these must also be measured with high spatiotemporal resolution. Considering that changes in water vapour are closely linked to changes in temperature via the Clausius-Clapeyron equation, i.e. for typical atmospheric conditions a temperature increase of 1 K yields an increase in the water vapour concentration by approximately 6-7% (Held and Soden, 2000), it is essential to monitor the variability and change of the amount and distribution of water vapour on global scale accurately. 35 To observe the water vapour distribution on global scale, satellite measurements provide invaluable information. Due to its spectroscopic absorption properties, water vapour can be retrieved from satellite spectra in various different spectral ranges, ranging from the radio (e.g. Kursinski et al., 1997), microwave (e.g. Rosenkranz, 2001), thermal infrared (e.g. Susskind et al., 2003)), short and near-infrared (e.g. Bennartz and Fischer, 2001;Gao and Kaufman, 2003) to the visible spectral range (e.g. Noël et al., 1999;Lang et al., 2003;Wagner et al., 2003;Grossi et al., 2015).

40
Within the past decade, substantial progress has been made to retrieve total column water vapour (TCWV) within the visible blue spectral range (e.g. Wagner et al., 2013;Wang et al., 2019;Borger et al., 2020;Chan et al., 2020) allowing to make use of measurements from satellite instruments like TROPOMI (Veefkind et al., 2012) and even GOME-2 (Munro et al., 2016) for which so far only retrievals in the visible red and near-infrared spectral range have been available. In comparison to these aforementioned spectral ranges, TCWV retrievals in the visible "blue" have several advatanges, for instance similar sensitivity 45 for the near-surface layers over land and ocean due to a more homogenous surface albedo distribution than at longer wavelengths (Koelemeijer et al., 2003;Wagner et al., 2013;Tilstra et al., 2017). Moreover, any satellite mission dedicated to NO 2 monitoring is covering this spectral range.
For investigations of climate change or global warming, respectively, the Ozone Monitoring Instrument (Levelt et al., 2006(Levelt et al., , 2018 onboard NASA's Aura satellite is particularly interesting: launched in July 2004 it offers an almost continuous measurement 50 data record of more than 16 years up until today. In this study, we make use of OMI's long-term data record and retrieve total column water vapour (TCWV) from its measurements in the visible blue spectral range in order to generate a climate data set.
The paper is structured as follows: in Sect. 2 we describe the data set generation and briefly explain the retrieval methodology and the applied modifications in comparison to the TCWV retrieval from Borger et al. (2020). Then, in Sect. 3 we characterize the data set via a validation study to the various different reference TCWV data sets and also analyze its temporal stability.

55
Finally, we briefly summarize our results in Sect. 4 and draw conclusions. The Ozone Monitoring Instrument OMI (Levelt et al., 2006(Levelt et al., , 2018 onboard NASA's Aura satellite is a nadir-looking UV-vis pushbroom spectrometer that measures the Earth's radiance spectrum from 270-500 nm with a spectral resolution of approxi-60 mately 0.5 nm following a sun-synchronous orbit with an equator crossing time around 13:30 LT. The instrument employs a 2D CCD consisting of 60 across-track rows which in total cover a swath width of approximately 2600 km with a spatial resolution of 24 km × 13 km at nadir increasing to 24 km × 160 km towards the edges of the swath. Launched in July 2004, OMI provides an almost continuous measurement record until today with more than 90000 orbits. However, since July 2007 OMI has suffered from the so-called "row-anomaly", a dynamic artefact causing abnormally low 65 radiance readings in the across-track rows, i.e. several rows of the CCD detector receive less light from the Earth, and some other rows appear to receive sunlight scattered off a peeling piece of spacecraft insulation. One plausible explanation for these effects is a partial obscuration of the entrance port by insulating layer material that may have come loose on the outside of the instrument (Schenkeveld et al., 2017;Boersma et al., 2018). Thus, in this study, the affected measurements are excluded for the entire period of the evaluation.

Methodology and modifications of the spectral analysis
To retrieve total column water vapour (TCWV) from UV-vis spectra from OMI, we apply the TCWV retrieval of Borger et al.
(2020) developed for the TROPOspheric Monitoring Instrument (TROPOMI) onboard Sentinel-5P. The retrieval is based on the principles of Different Optical Absorption Spectroscopy (DOAS, Platt and Stutz, 2008) with a fit window between 430-450 nm and consists of the common two-step DOAS approach: first, the absorption along the light path is calculated: where I and I 0 represent the solar irradiance and the radiance backscattered from Earth, respectively, and i denotes the index of a trace gas of interest, σ i (λ) its respective molecular absorption cross section, SCD i = s c i ds its concentration integrated along the light path s (the so called slant column density), Ψ summarizing terms accounting for the Ring effect and additional pseudo-absorbers, and Φ a closure polynomial accounting for Mie and Rayleigh scattering as well as parts of the low-frequency 80 contributions of the trace gas cross sections.
Second, to convert the slant column density to a vertical column density (VCD), we apply the so called airmass factor (AMF): The AMF accounts for the non-trivial effects of atmospheric radiative transfer and depends on the conditions of the retrieval scenario (i.e. aerosol and cloud effects, viewing geometry, and surface properties) as well as the profile shape of the trace gas 85 of interest. The algorithm of Borger et al. (2020) makes use of the relation between the H 2 O VCD and the profile shape and iteratively finds the optimal VCD by assuming an exponential water vapour profile shape.  et al. (2020). For climate studies such as trend analyses it is evident to provide a consistent data record. Thus, all rows that have ever been affected by the so called "row-anomaly" are excluded from the data set for the complete time series, which 90 corresponds to approximately half of the OMI swath. Also, instead of a daily solar irradiance an Earthshine radiance is used as reference spectrum within the DOAS analysis. The rationale for using an Earthshine radiance over a solar irradiance is as follows: -The daily OMI solar irradiance spectra (OML1BIRR version 3) are very noisy and have several gaps causing high H 2 O SCD fit errors and thus leading to an overall poor quality of the H 2 O VCD data set.

95
-By using an annual mean solar irradiance spectrum from the year 2005 (also used during the QA4ECV project; Boersma et al., 2018) a good fit quality can be obtained, however, OMI is also suffering from degradation effects (Schenkeveld et al., 2017). Thus, for the case of climate trend analyses it will be almost impossible to disentangle if a trend signal originates from the spectral degradation of OMI or indeed from a geophysical trend (see also Fig. A1). By using an Earthshine radiance as reference spectrum these degradation effects will largely cancel out.

100
-By using an Earthshine radiance as reference spectrum, also the across-track biases within the OMI swath are strongly reduced (see Panel (c) in Fig. 1) and consequently no destriping is necessary during post-processing (see also Anand et al., 2015).
-However, as a disadvantage of the use of Earthshine spectra, the retrieved H 2 O slant columns do not represent absolute slant columns because the Earthshine reference spectra also contain H 2 O absorptions. Hence, a slant column representa-105 tive for the chosen reference sector has to be added to the retrieved values.
For the creation of annual Earthshine reference spectra we selected the Antarctic continent as reference sector (high surface albedo due to snow and ice cover) and the time period of December (i.e. during austral summer) yielding a relatively high signal-to-noise ratio for our radiance measurements despite large solar zenith angles. Furthermore, only pixels above an altitude of 2000 m above sea level are selected: as the air temperatures are very low there, the water vapour concentrations are very 110 low as well, thus representing a reference atmosphere that is as dry as possible (i.e. the reference SCD or better saying the absolute value of its uncertainty has to be as minimal as possible). Also, to avoid the inclusion of noisy measurements (in particular from the descending part of the OMI orbit), only pixels with a solar zenith angle (SZA) < 80°are considered. From these measurements we calculate the monthly-mean radiance for December for each year for every OMI row and then use the resulting reference spectra for the retrievals of the upcoming year. Further details about destriping in general and a comparison of the temporal behaviour of the irradiance based and Earthshine SCD are available in Appendix A.

VCD conversion and data set generation
To account for the potential water vapour contamination within the Earthshine reference spectra, the SCDs based on the Earthshine reference have to be corrected for the corresponding offset. In this study, we determine this offset for each row 125 based on the difference of the Earthshine based SCDs and solar irradiance based SCDs for the first 5 years of OMI operation (see Appendix A). Equation (2) can then be rewritten as: where eSCD denotes the SCD derived using the Earthshine reference.
The AMFs are calculated as described in Borger et al. (2020). For the determination of the AMF, additional information about 130 the retrieval scenario like cloud cover and surface properties is necessary. We use the cloud information from the OMI L2 NO 2 product (OMNO2, Lamsal et al., 2021) and the modified OMI surface albedo version of Kleipool et al. (2008) as described in Borger et al. (2020). We also tested the surface albedo information from the OMNO2 product, however, within the framework of a trend analysis study (Borger et al., 2021a) we observed spatial artefacts in the surface albedo trends which likely arise  distribution of TCWV trends is mainly determined by the trends in the SCD. The albedo or AMF trends usually only determine whether the trend signal becomes stronger or weaker, but this only affects trends over land, since an albedo climatology from Kleipool et al. (2008) is used over ocean. As the ice flags from the OMI processor sometimes indicate snow/ice-free surfaces over Antarctica or Greenland, we additionally use the monthly mean sea ice cover information from ERA5 (Hersbach et al., 2020) and the annual mean land cover information from MODIS Aqua (Sulla-Menashe et al., 2019).

140
To create the OMI TCWV data set, we have chosen the time range from January 2005 to December 2020 and only include observations with an effective cloud fraction < 20% and AMF > 0.1. Furthermore, the pixels have to be free of snow and ice and must not be affected by the row anomaly. The results of every orbit are then gridded to a 1°× 1°lattice for every day. From these daily grids, the monthly mean H 2 O VCD distributions are then calculated ensuring that a continuous TCWV time series is available for as many grid cells as possible.
145 Figure 2 shows the global mean OMI H 2 O VCD averaged over the complete time range of the TCWV data set. The resulting distribution demonstrates that the retrieval is capable to capture the macroscale water vapour patterns like high VCD values in the tropics (in particular over the maritime continent) and low values towards the polar regions, but also characteristic regional patterns like the South Pacific convergence zone. To evaluate the overall quality of the OMI TCWV data set, we conducted a validation study for which we use the merged, 1-degree total precipitable water (TPW) data set version 7 from Remote Sensing Systems (RSS) (Mears et al., 2015;Wentz, 2015), TCWV data from the reanalysis model ERA5 (Hersbach et al., 2019(Hersbach et al., , 2020, and the ESA Water Vapour CCI climate data record CDR-2 as reference. For the correlation analysis we perform an ordinary least-squares (OLS) linear regression and an orthogonal distance regression (ODR). In the case of the ODR it is necessary to use reasonable ratios of the relative errors 155 of the compared data sets instead of using absolute errors in order to obtain meaningful results. Thus, for the sake of simplicity, we assume that the relative errors of the reference data sets over ocean are 5%, over land 10%, and for the OMI TCWV data set 20%. We also tested other variants of error assumptions and it turned out that the exact choice of errors is negligible for the regression results.

Intercomparison to ERA5
For further validations, we also compare the OMI TCWV data to ERA5 (Hersbach et al., 2020). To account for OMI's observation time (around 13:30 LT), we only take into account ERA5 monthly mean values between 13:00-14:00 LT. The results of 185 the intercomparison to ERA5 are depicted in Figure 5. To investigate potential dependencies on the surface type, we separated the data into data over ocean (top row in Fig. 5) and data over land (bottom row in Fig. 5). The intercomparison for data over ocean reveals similar results as the intercomparison between OMI and RSS: the results of the OLS and ODR indicate a slight overestimation (slopes of around 1.05 and 1.08) together with a correlation coefficient close to unity (R of around 0.98).
Moreover, the periodic pattern of positive deviations in the tropics occurs again.

190
In contrast, the regression results for the intercomparison for data over land reveal a distinctive overestimation by about 11% for the OLS and slightly less for the ODR (8%) but still with a high correlation coefficient of 0.97. Interestingly, the distribution Thus, considering these large uncertainties in the OMI retrieval and that the uncertainties in ERA5 for data over tropical land-215 masses are not negligible anymore, we conclude that the OMI TCWV data set can well represent the global distribution of the  atmospheric water vapour content at least over ocean. Over land, however, the data set should be treated with caution due to the systematic positive deviations from the reference data sets, especially in areas of high TCWV values (i.e. above 26 kg m −2 ).
An additional comparison in which particularly critical regions were filtered using the ESA WV CCI CDR "common mask" (see Fig. B1) is given in the Appendix B. When this mask is applied, only high quality measurements are taken into account 220 for the intercomparison. Consequently, the regression results for the comparison over land improve significantly (see Fig. B3), so that the slopes now vary between 0.95 and 1.02, which are closer to the results of the piecewise-linear regression for TCWV < 26 kg m −2 .

Intercomparison to ESA Water Vapour CCI climate data record
In addition to RSS SSM/I and ERA5 we compare the OMI TCWV data to the ESA Water Vapour CCI climate data record    mostly affected by frequent cloud cover (see Fig. 8). However, we also observe systematic overestimations along coastlines (e.g. Central America) which eventually arise from sampling issues of the different satellite products and in some mountain regions (e.g. Himalaya).
In Appendix B we present a comparison in which critical regions were filtered using the "common mask" from the ESA WV 250 CCI CDR. When this mask is applied, there are clear improvements for the comparison over land: instead of an overestimation of 11-14%, a good agreement is obtained with slopes now between 0.97 and 1.04 (see Fig. B3), which agrees quite well with the slopes obtained for the piecewise linear regression for TCWV < 26 kg m −2 .

Temporal stability
In addition to a good agreement to existing reference data sets, the temporal stability is an important property of a climate data  to the RSS SSM/I and ERA5 data sets as these two cover the complete time range of OMI TCWV data set. For the sake of completeness, however, we also show the results for ESA WV CCI CDR.
To assess the stability of the OMI TCWV data set, we derive the global mean relative deviation for every time step: 260 and then calculate temporal trends of these deviations using linear regression following the approach of Danielczok and Schröder (2017) and Beirle et al. (2018). For the calculation of global means only data points or grid cells are taken into account for which for every time step data from the OMI TCWV and reference data set are available. In the case of the ESA WV CCI CDR-2 a "common mask" has been provided (see also Fig. B1). For the time series until the end of the reference data set we find trends of −0.06 % dec −1 for the comparison to RSS SSM/I and −0.18 % dec −1 for the comparison to ERA5 and where these trends are not significantly different from 0 % dec −1 . For the 270 comparison to the ESA data there is a stronger trend (around −0.52 % dec −1 ) than for the other two data sets, however also the time range is much shorter and does not cover the complete time range of the OMI TCWV data set. Altogether, the obtained trends of the relative deviations are in line with typical stability requirements for climate data products of ±1 % dec −1 (see e.g. Beirle et al., 2018, and references therein).

275
In this study, we present a long-term 16-year data record of total column water vapour (TCWV) retrieved from multiple years of OMI observations in the visible blue spectral range by means of Differential Optical Absorption Spectroscopy. To derive TCWV from OMI measurements, we applied the TCWV retrieval developed for TROPOMI (Borger et al., 2020) and modified the spectral analysis to account for the degradation of OMI's daily solar irradiance. Thus, annual Earthshine reference spectra were calculated from radiance measurements over Antarctica during December (austral summer).

280
Within a validation study, the OMI TCWV data set proves to be in good agreement to the reference data sets of RSS SSM/I, ERA5, and the ESA WV CCI CDR-2 in particular over ocean surface. However, over land surface the OMI data set systematically overestimates the TCWV content compared to ERA5 and the ESA CDR by approximately 10% especially in the tropical regions affected by frequent cloud cover. The reasons for these overestimations are manifold, but likely due to an overestimation of the OMI TCWV retrieval due to uncertainties in the retrieval input data (surface albedo, cloud information) on the one 285 hand and an underestimation of the reference data due to missing or uncertain observations on the other hand. Nevertheless, the validation also shows that for TCWV < 26 kg m −2 good agreement to the reference data can be obtained and also for the case when regions of large uncertainty are filtered. Considering the temporal stability analysis no significant deviation trends could be obtained with respect to ERA5 and RSS SSM/I which demonstrates that the OMI TCWV data set is well suited for climate studies.

290
Altogether, the OMI TCWV data set provides a promising basis for investigations of climate change: on the one hand, it covers a long time series (more than 16 years and with measurements still in operation), and on the other hand, these measurements are based on a single instrument, so that no bias corrections between different sensors need to be taken into account (e.g. in trend analysis studies). Although OMI is affected by degradation effects, we were able to successfully suppress these effects by using Earthshine reference spectra. Furthermore, the data set is based on a retrieval in the visible blue spectral range, where a 295 similar sensitivity for the near-surface layers over ocean and land is given and thus a consistent global data set can be obtained from measurements of only one sensor.
In the future, we plan to complement the data set with TCWV measurements from TROPOMI to ensure the continuation of the data set after the end of the OMI mission. Since the TCWV retrieval can be easily applied to other UV-vis satellite instruments, additional data sets from other instruments from past and present missions such as GOME-1/2 and SCIAMACHY, but also to 300 future instruments such as Sentinel-5 on MetOp-SG can be created and eventually combined with the OMI TCWV data set taking into account the different instrumental properties (e.g. observation time). This would allow the construction of a data record that extends from 1995 to today. Similarly, a combination of data from low-earth orbit satellites and geostationary satellite instruments such as GEMS, TEMPO or Sentinel-4 could be a promising option to fill temporal gaps in daily observations, but also to investigate (semi-) diurnal cycles of the water vapour distribution.

Data availability
The MPIC OMI total column water vapour (TCWV) climate data record is available at https://doi.org/10.5281/zenodo.5776718 (Borger et al., 2021b). For the case of an Earthshine reference this is already implictly accounted for during the spectral analysis, however, one still has to consider that the Earthshine reference spectrum is not perfectly pristine of the trace gas of interest. For example in our 325 case, although the water vapour concentrations in Antarctica are very low, the Earthshine reference might still be contaminated because of the long light path at such high solar zenith angles. To get an overview of how the SCD difference (i.e. solar irradiance based minus Earthshine SCD) behaves with time over the complete OMI swath, Fig. A2     Appendix B: Validation taking into account the common mask from ESA WV CCI CDR The validation in Sect. 3 also takes into account regions for which only a small number of measurements are available, for exam-340 ple due to frequent cloud cover or seasonality of the solar zenith angle. On the one hand the small sample size of measurements leads to a higher statistical uncertainty with regard to the monthly mean, and on the other hand also to a non-continuous time series when data are missing for the complete month. Moreover, the errors of the individual measurements are also significantly larger in these regions. With the help of the "common-mask" of the ESA WV CCI CDR-2 (see Fig. B1), these regions can be identified and filtered for additional validation.

345
The results of the validation with the "filtered" data are shown in Fig. B2 for data over ocean and in Fig. B3 for data over land.
For all comparisons, the correlation coefficients remain approximately at similar level (i.e. over 0.95) as for the non-"filtered" comparison. For the comparison over ocean we obtain a slight improvement, so that overall the slopes are closer to unity.
However, there is a remarkable improvement for the comparison over land: instead of a distinctive overestimation of 8-14%, the slopes now vary between 0.95 and 1.04, but this is also associated with an increase in the y-axis intercept. Altogether, the 350 results for the "filtered" comparison over land also agree very well to findings of piecewise-linear regression, for which similar regression results in the slope were found for TCWV < 26 kg m −2 .  Figure B2. Correlation analysis of the OMI TCWV data set and RSS SSM/I, ERA5, the ESA WV CCI CDR-2 for data over ocean taking into account only valid grid cells according to "common mask" in Figure B1.