Two decades of flask observations of atmospheric δ ( O 2 / N 2 ) , CO 2 , and APO at stations Lutjewad ( the Netherlands ) and Mace Head ( Ireland ) , and 3 years from Halley station ( Antarctica )

We present 20-year flask sample records of atmospheric CO2, δ(O2/N2) and APO (Atmospheric Potential 15 Oxygen) from the stations Lutjewad (the Netherlands) and Mace Head (Ireland), and a 3-year record from Halley Station (Antarctica). We include details of our calibration procedures and the stability of our calibration scale over time, which we estimate to be less than 3 per meg over the 14 years of calibration, and our compatibility with the international Scripps O2 scale. The measurement records from Lutjewad and Mace Head show similar long-term trends during the period 2002-2018 of 2.31 ± 0.07 ppm yr-1 for CO2 and -21.2 ± 0.8 per meg yr-1 for δ(O2/N2) at Lutjewad, and 2.22 ± 20 0.04 ppm yr-1 for CO2 and -21.3 ± 0.9 per meg yr-1 for δ(O2/N2) at Mace Head. They also show a similar δ(O2/N2) seasonal cycle with an amplitude of 54 ± 4 per meg at Lutjewad and 61 ± 5 per meg at Mace Head, while the CO2 seasonal amplitude at Lutjewad (16.8 ± 0.5 ppm) is slightly higher than that at Mace Head (14.8 ± 0.3 ppm). We show that the observed long-term trends and seasonal cycles are in good agreement with the measurements from various other stations, especially the measurements from Weybourne Atmospheric Observatory (United Kingdom). However, 25 there are remarkable differences in the progression of annual trends between the Mace Head and Lutjewad records for δ(O2/N2) and APO, which might in part be caused by sampling differences, but also by environmental effects, such as North Atlantic Ocean oxygen ventilation changes to which Mace Head is more sensitive. The Halley record shows clear trends and seasonality in δ(O2/N2) and APO, where especially APO agrees well with continuous measurements at the same location made by the University of East Anglia, while CO2 and δ(O2/N2) present slight disagreements, most likely 30 caused by small leakages during sampling. From our 2002-2018 records, we find a good agreement with the Global Carbon Budget for the global ocean carbon sink: 2.1 ± 0.8 PgC yr-1, based on the Lutjewad record. The data presented in this work are available at https://doi.org/10.18160/qq7d-t060 (Nguyen et al., 2021).

and O2 cycles are closely coupled -in most processes, there is an anti-correlation in the changes of their mole fraction, except for the oceanic uptake of CO2 . To quantify the various components of the global carbon cycle, the changes in atmospheric mole fraction of the two species can be used in combination with their stoichiometric exchange ratio (ER), which is the ratio of CO2 and O2 exchanged (consumed/produced) in a process.

45
The ER value varies depending on the process, and is close to 1.1 for photosynthesis/respiration (Severinghaus, 1995) and on average 1.38 for the global mix of fossil fuels (Keeling and Manning, 2014).
Despite many improvements to these techniques over the years, it is still very challenging to obtain O2 measurements with high accuracy and precision. This is mainly because the atmospheric background mole fraction of O2 is very high -around 209,392 ± 3 ppm (Tohjima et al., 2005) -while the observed variations are at the level of a few ppm. These challenges are magnified further for long-term measurements because of possible small biases, drifts or other changes 55 in the analysers or in the calibration scales. Thus the sampling procedures and analysing (laboratory) conditions must be monitored and corrected for by a carefully designed use of calibration and reference gas cylinders over the years (Aoki et al., 2021). As a result, there are only a handful of programmes around the globe which are proficient in coupled CO2 and O2 measurements, for example, the network of atmospheric stations maintained by the Scripps Institution of Oceanography ; National Institute of Advanced Industrial Science and Technology (Aoki et al., 2021); National Institute for Environmental Studies (Tohjima et al., 2008); Tohoku University (Goto et al., 2017); University of East Anglia (UEA) (Pickers et al., 2017); and the University of Groningen (van Der . Our laboratory -the Centre for Isotope Research (CIO) of the University of Groningen (RUG) in the Netherlands -has been carrying out flask measurements of CO2 and O2 since the early 2000s from various locations . Flask sampling for CO2 and O2 has been conducted at Lutjewad (the Netherlands), Mace Head (Ireland), Jungfraujoch (Switzerland) and Halley (Antarctica).
In this paper, we present the O2 and CO2 measurements from flasks collected at Lutjewad (the Netherlands), Mace Head (Ireland), both for the period 2000-2020, and Halley (Antarctica) for 2014-2017. From these measurements, a tracer called Atmospheric Potential Oxygen (APO) (the details of which are given in Sect. 2.5) is calculated. We first The Lutjewad station is a "class 2" station in the European Union's Integrated Carbon Observation System (ICOS) network. It comprises a 60-m tall tower, an additional platform of 10-m height, and a laboratory building containing analysers, flask sampling systems, measurement systems and other equipment. The dominant wind direction in the Netherlands is southwest, meaning that the measurements acquired at the Lutjewad station often represent continental 90 air masses influenced by anthropogenic and biogenic sources and sinks . Otherwise, when the wind comes from the north, the station samples background air that comes from the North Sea and North Atlantic .
The Mace Head station consists of field laboratories and a 20-m tower for sampling. The dominant wind arriving at the 95 station is westerly from the North Atlantic Ocean, carrying air masses that would not have been considerably affected by regional anthropogenic activities. Air masses from other directions carry contamination from local and continental sources (Derwent et al., 2002;Jennings et al., 1993). The Halley station is a "Global" station within the World Meteorological Organisation's Global Atmosphere Watch 100 (WMO/GAW) programme, that observes background atmospheric conditions at various locations around the globe.
The main Halley station consists of 8 modules that are atop ski-fitted hydraulic legs, within which are the research facilities and living quarters. Air sampling for this project was carried out at the Clean Air Sector Laboratory, which is located 1.5km from the main station in a location that receives minimal contamination from station activities (Jones et al., 2008). The predominant winds are from the east, bringing background air masses from the South Atlantic sector of 105 the Southern Ocean (60%) or from the continental plateaux (30%). Westerly winds that have passed over the Weddell Sea gyre occur 10% of the time (Barningham, 2018;British Antarctic Survey, 2021).

Flask sampling procedure
At Lutjewad, we employ an automated flask sampling system, hereafter called the autosampler .
Air is pumped from the top of the 60-m tower via inlets connected to a series of tubing towards the laboratory building.

110
The inlet is equipped with a Nafion drying tube (MD 110-72-S, Perma Pure, Toms River, New Jersey) so that the incoming air is first partly dried. The flow in the outer side of the Nafion tube is the outlet of the same air sampling system, after the air is dried with the second stage cryogenic dryer in the laboratory to a dewpoint below -45 °C . This ensures that, except for water, all constituents have a negligible gradient over the Nafion membrane.
From the inlet, the sampled air is stored in glass flasks via a flask sampling system for further analyses in the CIO laboratories . For storing air samples, we use 2.5-litre glass flasks with dip tubes, capped with two high-vacuum valves (Louwers, Hapert, NL) sealed with Viton o-rings (these flasks are also used at Mace Head and Halley). Our autosampler is designed to connect to and fill up to 20 flasks without requiring user intervention, and we can remotely control the opening/closing of the flask valves (via custom-made electric motor actuators) and the filling of samples (via a series including a small diaphragm pump (KNF N811), flow controllers, and magnetic solenoid 120 valves). The autosampler schedule is controlled via custom-made software (written in Delphi programming language), and carries out the sampling procedure automatically, but it can also be operated remotely using software such as VNC or TeamViewer when needed. A normal filling procedure starts with the air stream being cryogenically dried (to a dewpoint of -45°C) and flushed through a flask for at least an hour at 2.5 L min -1 before filling the flask slowly so that the sample remains at current atmospheric pressure (to prevent the sample from fractionation and differential 125 permeation through the o-rings caused by a pressure gradient (Sturm et al., 2004)) and moving to the next flask.
Individual flasks can be preserved at any time. Samples at Lutjewad are collected under various conditions and time frequencies, but in this paper we present only the data from flasks collected under local background conditions, defined by van Der  as flasks taken while the 222 Radon activity monitored at the station was less than 3 Bq m -3 and with a CO mole fraction of less than 200 ppb. This filtering procedure is applied to the dataset after the 130 flasks are analysed.
We employ the same type of flasks, flow rates, and filling pressure (to current atmospheric pressure) at all stations. Due to different setup of the stations, the drying methods are different, and only Halley station has an aspirated inlet.

135
At Mace Head, flasks are collected once or twice per week via a manually operated system as described by Conway et al. (1994), at 35 m above sea level and mostly during restricted baseline conditions (Bousquet et al., 1996). A sampling sequence starts with the air being pumped from the inlet via a small diaphragm pump (KNF N86KT), into a drying tube packed with magnesium perchlorate, then flushed through the flasks for about 30 minutes at 2.5 L min -1 at atmospheric pressure before each flask is manually closed. Also, for Mace Head, only flasks with a CO mole fraction of less than This consists of a diaphragm pump (KNF N86), flowmeter, drying agent (magnesium perchlorate), 7µm filter and 3 sampling flasks connected in concession. The air is sampled about 6 metres above the snow surface on the east side of 145 the building via Synflex tubing connected to an aspirated inlet (the details of the aspirated inlet are as described by Blaine et al. (2006)). The system is flushed for about 45 minutes at a flow rate of 2.5 L min -1 at atmospheric pressure before each flask is manually closed. The collected samples are stored in insulated aluminium boxes at room temperature until their annual return to the UK on the Antarctic supply ship.

150
After sample collection, flasks from the three stations are transferred back to our laboratory in Groningen for analysis.
Typically, the mole fractions of CH4, CO, CO2 and O2 (reported as δ(O2/N2), see next section) are measured (van Der Laan et al., 2009), and additional analyses such as stable isotopes (for example 13 C and 18 O in CO2) and radiocarbon ( 14 C in CO2) are also conducted when required .

CO2 measurement
All flask samples are analysed on an Agilent HP6890N gas chromatograph equipped with a Flame Ionization Detector (referred to as HPGC) to determine the mole fractions of CO2, CO and CH4. The HPGC system has a set-up similar to the GC-systems described by Worthy et al. (2003) and van Der Laan et al. (2009). All working standard mixtures (made 160 from dried ambient air) that were used to calibrate the HPGC have been calibrated on the HPGC system at CIO against a suite of 5 primary standards linked to the World Meteorological Organization (WMO) X2007 scale with CO2 ranging between 354 and 426 µmol mol -1 (ppm). These primary standards were provided by the Earth System Research Laboratory (ESRL) of the National Oceanic and Atmospheric Administration (NOAA), USA. Since the summer of 2013, working standard gas cylinders were also calibrated for CO2, CO and CH4 mole fractions on a Cavity Ring-Down

165
Spectrometer (CRDS) model G2401-m from Picarro Inc. using the same suite of primary standards. We refer to Chen et al. (2010) for more details on the CDRS technique. The measurement precision and accuracy for flask measurements of CO2 on the HPGC are typically <0.06 ppm and <0.07 ppm, respectively (van Leeuwen, 2015).
All CO2 measurements presented in this paper were originally calibrated against standards on the WMO X2007 scale, 170 and are updated to the WMO X2019 scale (the new scale is explained in details by Hall et al. (2020)).

O2 measurements
Atmospheric O2 is typically reported as the δ(O2/N2) value. The δ(O2/N2) value of a sample is calculated as the difference between the O2/N2 ratio of the sample and that of a reference gas (Keeling and Shertz, 1992): (1) Since for natural variations, δ(O2/N2) values are very small, they are usually expressed in "per meg", which is 1/1000 of a per mil, as typically used in the stable isotope community. Atmospheric O2 is reported as O2/N2 ratio because it is not a trace gas, and its mole fraction is thus affected by changes in other atmospheric constituents such as CO2.
Atmospheric N2 is very stable , therefore changes in the O2/N2 ratio would reflect mostly the changes in atmospheric O2 (only in a detailed budget analysis minor N2 variabilities are still considered, as described 180 in Keeling and Manning (2014)). For δ(O2/N2) measurements, we use a Micromass Optima Dual Inlet Isotope Ratio Mass Spectrometer (DI-IRMS). The DI-IRMS analytical technique (which was first developed by Bender et al. (1994)) follows the principles as explained by Keeling et al. (2004). Each measurement comprises sixteen successive switches between sample and reference gases from the respective bellows. After every switch, the pressures of the two bellows are equalized, using a differential pressure meter (GA63, Effa France), subsequently there is an idle period of 120 s 185 before the actual signal is measured for 30 s, to account for the disturbances in the signals caused by the switching of the valves that affect the measurement precision . Due to the sensitivity of the analyser, it is located inside a climate-controlled room in our CIO laboratory. However, it is inevitable that the measurements still drift over time. To correct for the instrumental drifts, we perform frequent calibrations using a suite of reference gas cylinders. These cylinders are calibrated against the international Scripps scale using three primary standard cylinders

Atmospheric Potential Oxygen (APO)
Combining highly precise measurements of atmospheric CO2 and O2 can isolate the effects of the oceanic processes, by removing the effects of the land biosphere (Stephens et al., 1998). This is achieved by deriving the tracer Atmospheric Potential Oxygen (APO). The APO value of an air sample is determined by combining its δ(O2/N2) and

200
CO2 measurements (Battle et al., 2006;Gruber et al., 2001;Stephens et al., 1998): The value of 1.1 represents the mean O2:CO2 ER of terrestrial ecosystems (Severinghaus, 1995); for the SO2 , we take 0.2094, which is the standard atmospheric O2 mole fraction (Tohjima et al., 2005); and 350 is the consensus (arbitrary) reference value to be subtracted from the measured CO2 mole fraction, as defined in the SIO per meg scale conversion 205 for APO . Therefore, APO is not affected by land biosphere processes and mainly captures the seasonal and long-term air-sea exchange of CO2 and O2, with an influence from fossil fuels combustion, caused by their higher average ER of ≈1.4 (Pickers et al., 2017;Sirignano et al., 2010).

Calibration of the DI-IRMS
In this section we present the calibration procedure and the stability achieved at our laboratory from 2006 to 2020. The

The calibration procedure
The DI-IRMS compares the measurement of a sample gas with that of a reference gas (hereby called "machine 215 reference" or "MREF") in a sequence of several switches back and forth ("change overs"). The result of this process is the δ(O2/N2) value of the sample, as presented in equation 1. Each individual measurement is based on seven successive pairs of sample and reference measurements, which are used to calculate seven delta values (equation 1). The seven delta values then go through a filtering process. First, the mean and standard deviation of the seven delta values are calculated. Then, the delta value that is furthest from the mean is marked as a potential outlier. Next, a new mean and 220 a new standard deviation are calculated for the remaining six delta values. If the excluded delta value is more than 2.7 times (equivalent to p = 0.01) the new standard deviation away from the new mean, it is defined as an outlier and removed. This process is repeated to identify and remove a potential second outlier (at most two outliers are removed by this process, otherwise the reliability of the measurement is sacrificed). After removing possible outliers, the remaining delta values are averaged to produce one δ(O2/N2) value per measurement. A flask is typically measured two 225 to three times consecutively, for which we do not find any systematic biases. The final measurement for each flask (as

230
To improve the stability of our measurements, we also measure local reference gas cylinders (hereafter called "working tank" or "WT") on the sample side of the DI-IRMS. These WTs are also used to connect between periods of different MREF cylinders, where there may be shifts in the scales of the measurements and thus a scale conversion is required to keep all raw measurements on a comparable scale. The summary of different WTs and MREF cylinders used from 1998 to 2020 is shown in Fig. 2 and Table 1. In addition to the conversion to our internal CIO scale, the measurements are also affected by instrumental drifts over time. To correct for these drifts, we first divide our long measurement record into several periods, which are defined based on the timing of when the MREF cylinders are changed, and/or apparent fluctuations in the raw data related to, 250 for example, repairs or modifications of the system. In this work, the calibration procedure is carried out for measurements from 2011 onwards, which were divided into seven periods (periods 9-15, Table 1). These 7 periods were divided into 144 sub-periods (selected based on breaks in the records) which were then individually processed to derive the final corrections for all measurements in those sub-periods. The complete step in transforming the raw measurements of a sample (S) against a current MREF (M) into comparable data is to combine the drift correction with the shift to the CIO scale (R), by using an equation described by van Der Laan-Luijkx (2010): Where: -δS/R is the δ(O2/N2) value of the sample against the CIO 2534 scale;  After these adjustments, the measurements of the three long-term WTs (5279, 6096, and 6168) show that all three were simultaneously stable over time. To verify this, we calculated the trends of all three WTs based on their annual averages, and the weighted mean slope amounts to -0.4 ± 0.7 per meg yr -1 , so not significantly different from zero. To check the stability of our scale over time, we calculated the year-to-year variability between the WTs, and found a value of less 285 than 3 per meg over the 14 years of measurements.
WT 4845 was recently measured for a relatively short period only, and appeared to be less stable and noisier compared to WT 5279 measured in the same period. It is not clear why this is the case, but it could be due to the fact that the value of this cylinder is very low, suggesting a potential contamination when the cylinder was filled, or a leak in the pressure 290 reducer when it was measured. Thus, WT4845 was not used for the calculations in the calibration procedure, and its measurements are only shown here for completeness.
In addition to their long-term stability, the 3 WTs also showed no systematic drifts across different MREF periods (    Table 3. The measured values are the weighted means of each cylinder, since each data point is calculated based on different numbers of separate measurements. It can be seen from Figure 4 and Table 3 that cylinder 7008 exhibits a small upward drift over time of 1.4 ± 0.4 per meg yr -1 , whereas the other two remain constant. The 320 ensemble thus suggests that there is no clear systematic error in our scale conversion and calibration procedure. Overall, the SIO primary standards produce a weighted uncertainty of 8.6 per meg in 10 years. To improve the quality of our conversion into the SIO scale, and especially to check the behaviour of cylinder 7008, we are planning to purchase new primary standard cylinders in the future.

325
The conversion of the CIO scale to the Scripps scale is calibrated using these Scripps primary standards measurements, and in such a way that the ensemble difference between the assigned values and weighted averages of our measurements of three Scripps cylinders is minimised (Mook, 2000):   Deviation from assigned 6.9 -5.3 -7.4

Inter-comparison programmes
In addition to measuring the primary standard cylinders, the CIO also took part in two inter-comparison programmes involving oxygen measurements: "Cucumber" Intercomparison which was initialised in the European Union's 345 CarboEurope project and coordinated by the UEA (http://cucumbers.uea.ac.uk/); and the Global Oxygen Laboratories Link Ultra-precise Measurements (GOLLUM) programme, also coordinated by UEA (Manning et al., 2015) . These inter-comparison programmes provide an additional tool for checking the internal stability of our measurements, while also linking the oxygen measurements between global laboratories.
The Cucumber programme involves inter-comparison of nine atmospheric species (of which δ(O2/N2) is one) between 350 atmospheric research stations in Europe and a number of laboratories in Europe, USA, Canada, Japan, and Australia.
Within the programme, there are seven sets of three cylinders sent around in different rotations. The CIO participated in three rotations, with two involving oxygen measurements (called "Inter-1" and "Euro-3") (University of East Anglia, 2021).

355
The GOLLUM programme is specifically designed for the inter-comparison of oxygen measurements and involves 10 laboratories worldwide that carry out high-precision atmospheric oxygen measurements. Two sets (named "Bilbo" and "Frodo") of three cylinders are rotated in opposite directions amongst participating laboratories (Manning et al., 2015). Compared to the Cucumber cylinders, GOLLUM cylinders show much less variations between years (varying within a range of less than 20 per meg), and also significantly smaller overall drift over the duration of the measurements (4 ± 6 per meg yr -1 ). However, all 6 cylinders appear to drift in similar direction, suggesting a significant drift in our scale rather than drifts in these cylinders. The SIO cylinder 7008 also shows similar stability and a general drift in the same 380 direction as GOLLUM cylinders, whereas the two other SIO cylinder do not (Fig. 4).
Since the cylinders show an inconclusive "drift": INTER-1 and EURO-3 do not show an apparent drift direction; Bilbo and Frodo present a minor drift similarly to that observed by our SIO cylinder 7008 (while the other 2 SIO cylinders did not exhibit this behaviour as shown in Sect. 3.2); and our internal WTs all show no overall drifts, we consider our calibration procedure as sufficient. Recalibration of the SIO cylinders might shed further light on these small discrepancies, mostly to see if cylinder 7008 has indeed drifted or not.

Treatment of analysed flask samples
After the calibration and conversion to the SIO scale, the individual flask sample measurements are scrutinized for outliers and background conditions. For this purpose, we perform several iterations of fitting a combination of quadratic and 3-harmonic regression (following similar curve fitting methods applied to time series in NOAA without the use of a digital filtering method (Thoning et al., 1989)) and filtering the outliers from the combined fit. This outlier filtering 400 process uses the robust median absolute deviation (MAD) method (Rousseeuw and Verboven, 2002), in which the MAD value for a dataset is determined by first finding the median of the set, then subtracting the median from each individual value, and finally finding the median of the absolute differences. Measurements that are 3 times the MAD value away from the median of the measurement set are considered outliers and removed. The full principle of the procedure is described by van Der Laan-Luijkx (2010) (though with a different filtering process that was described in Sect. 3.1). In total, after both filtering processes, around 30% of the flasks were excluded from further analyses from In the period prior to 2006, our internal calibration scale was not as well-established as in the later period, due to 415 frequent changes in MREF and WT cylinders, especially in 2004 when there is little information to connect the following period to the first period (as presented in Fig. 2 and Table 1). Next to this, we also only obtained the SIO inter-comparison programmes. Therefore, we conclude that our calibration process is accurate within the uncertainties mentioned above.

The CO2, δ(O2/N2) and APO records
In this section, we present the long-term flask measurement records (from 2000 to 2020) of Lutjewad and Mace Head, the coloured lines are the total fit (combined quadratic trend and 3-harmonic seasonal cycles) and the black lines are the trend parts of the total fit. The fit lines are shown for the whole period, but for the fitting process we left the first 440 and last two years out, to make sure that the fit period comprises complete calendar years (from January to December).
Otherwise, the beginning and end of the curves can influence the trend part of the fit due to the irregular sampling frequency and other problems, as explained in Sect. 3.4. From the records, the total uncertainties associated with the trends are also calculated, based on a quadratic sum of the uncertainties of the flask measurements and other factors.
For CO2, the only other contributing factor is the uncertainty in the trend fit. For δ(O2/N2), and APO, the uncertainties associated with the measurements of the SIO primary standards, our internal scale, the long-term scale conversion between CIO and SIO scales, and the trend fits all contributed to the final uncertainty.  the uncertainty in the trend fits is the most significant factor. However, at Mace Head the uncertainties in the flask measurements contributed more significantly than those at Lutjewad (12.5 compared to 7.4 per meg, respectively).
The APO trend and seasonality can be determined either from fitting the APO values of the individual flasks themselves, or by combining the trend/seasonal parameters of the δ(O2/N2) and CO2 fits. Both methods yield almost identical results.
We present here the results from the first approach. Since APO is calculated from the combination of δ(O2/N2) and CO2 measurements, it shows a combination of the patterns as illustrated in the two species. The APO trend (reported here also in per meg yr -1 ) at Lutjewad does not differ significantly over time, varying from -9.4 ± 0.8 per meg yr -1 in 2002 to -9.31 ± 0.20 per meg yr -1 in 2010, and -9.3 ± 0.8 per meg yr -1 in 2018. In Mace Head, however, the same pattern as 1.4 ± 2.4 per meg yr -1 , -9.3 ± 0.6 per meg yr -1 , and -6.7 ± 0.9 per meg yr -1 .

Seasonal cycles
The seasonal cycles of CO2, δ(O2/N2), and APO for all three stations are presented in Fig. 9. The seasonal components are extracted from the total fits (detrended) and presented as 1-year cycles. In general, the CO2 seasonal cycles at 515 Lutjewad and Mace Head are similar in size and shape, although the average seasonal amplitude is higher at Lutjewad (16.8 ± 0.5 ppm) than Mace Head (14.8 ± 0.3 ppm). The CO2 seasonal cycle at Halley station, on the other hand, has a much smaller amplitude of 3.0 ± 0.3 ppm, as is generally the case for the ocean-dominated Southern Hemisphere due to the absence of a terrestrial biosphere influence. Lutjewad and Mace Head show very similar, and significantly higher δ(O2/N2) seasonal amplitudes (131 ± 6 per meg and 130 ± 6 per meg, respectively) than that at Halley (76 ± 4 per meg), 520 due to the influences of the terrestrial biosphere. In APO this influence is cancelled because APO is invariant to terrestrial biosphere processes, and the Halley amplitude is even somewhat higher than that of Lutjewad and Mace Head (65 ± 3 per meg compared to 54 ± 4 and 61 ± 5 per meg, respectively). All numerical seasonality parameters of the three stations are given in Table 4 below.

Measurements at Lutjewad, Mace Head, and Halley
Here, we discuss our measurement records in more detail. At first, the difference in the progression of trends in δ(O2/N2) and APO between Lutjewad and Mace Head ( Fig. 6 and 7) suggests that there could be an issue with the flask sampling procedure at Mace Head, such as the way the samples are dried. At Lutjewad, the sampling process has been more closely controlled thanks to the vicinity of our laboratory enabling frequent visits, multiple tests and other measurements taken from the same sample lines. Furthermore, a comparison of the Lutjewad data with data from the nearby Weybourne coastal station in the UK (presented in Sect. 5.2) showed very good agreement. As both Lutjewad and Mace Head samples share the same measurement procedure, measurement and calibration issues cannot explain their differences, so the differences must either be real, or related to the flask sampling procedure. It takes longer to transport the flasks from Mace Head to Groningen than from Lutjewad and thus contaminations of the samples through the valve 540 caps might have occurred. For the samples from the Halley station, the transport time is even longer, but here, additional protective caps (glass or aluminium) with Viton O-rings are used on the valve caps of the flasks to create small buffer volumes that slow down permeation effects. We tested the preservation of the samples using the protective caps by sending flasks to Halley station that were pre-filled with air of known composition, without actually using them. Back in Groningen, we could conclude the integrity of the samples by comparing the measurements before and after 545 shipment, and we found no significant change in δ(O2/N2) after 26 to 51 months. We found a small drift of 0.4 per meg in δ(O2/N2) after 48 months; and a drift of -0.3 ppm in CO2 after 24 months, on a set of 20 flasks. These numbers would only amount to biases of 0.008 per meg /month in δ(O2/N2) and 0.013 ppm/month in CO2. Unfortunately, the protective caps were not applied to Mace Head samples. Still, it is hard to imagine how such permeation effects could cause a deviating long-term trend in the data given that the flasks were filled to ambient pressure. Furthermore, the time between 550 taking the sample and analysing was a few months at most. If anything, one would expect more scatter in the record.
The same holds for sampling problems, such as incomplete drying.
To summarise, the trends at Lutjewad are as expected while those at Mace Head are not, so if there are no systematic sampling errors, the differences in δ(O2/N2) and APO in Mace Head compared to Lutjewad might be partially caused by the sparse and irregular sampling frequency at Mace Head or technical issues that remain undiagnosed. However, it is also worthwhile to consider effects that may be caused by real environmental differences between the two stations.
Two effects come to mind: the first is a difference in fossil fuel use (both in quantity and type), which would influence δ(O2/N2) and to a lesser extent also APO. The average fossil fuel exchange ratio (ER) for the Netherlands, when accounting for all fossil fuel types, is 1.60 ± 0.02 for the 2000-2020 period, much higher than that for Mace Head (  When comparing the seasonal cycles of the three stations, we can see that while CO2 and δ(O2/N2) seasonal amplitudes at Halley are significantly smaller than those at Lutjewad and Mace Head, the APO seasonal amplitude is slightly higher, agreeing with the model simulation by Tohjima et al. (2012) that the APO seasonal variations in the Southern Hemispheric ocean are larger than those in the Northern Hemisphere due to larger air-sea O2 exchange. As mentioned in Sect. 2, APO values also contain a small influence from fossil fuels, however, by selecting for flasks based on the 585 background conditions, we eliminate as much as possible this influence, especially for the Lutjewad record. As such, our APO values from these three stations represent mostly ocean influences.
As an illustration of the usefulness of the δ(O2/N2) measurement, we calculated the partitioning of CO2 uptake by the terrestrial biosphere and the ocean from the observations at Lutjewad, using the measurements of CO2 and APO 590 concentrations from 2002 to 2018, following the method described by Keeling and Manning (2014), but using the fitted trend lines from Lutjewad instead of global averaged values . This partitioning is illustrated in Fig. 10.

605
(2021), and the ER for globally averaged fossil fuel combustion of 1.43 from Jones et al. (2021). From the Lutjewad Ocean O 2 outgassing record, the global land biotic sink (B) is 2.5 ± 1.1 PgC yr -1 , the oceanic sink (O) is 1.5 ± 0.8 PgC yr -1 , and the CO2 remaining in the atmosphere (A) amounts to 4.89 ± 0.15 PgC yr -1 . However, these calculations did not take into account the riverine carbon fluxes, which are fluxes of carbon fixed on land and transported into the oceans through the river systems (Jacobson et al., 2007). After correcting for the riverine fluxes of 0.6 PgC yr -1 (shifting from the land sink to the ocean sink) (Friedlingstein et al., 2021), the corresponding values are 1.9 PgC yr -1 for B and 2.

615
The challenges in making O2 measurements have presented themselves clearly in this work: the sensitivity of the mass spectrometer that require intensive calibration; the quality maintenance of the internal calibration scale to make sure that our measurements can be reported with sufficient quality on the international scale; and the unexpected patterns (especially in APO for Mace Head) that could not be fully explained, partly due to the lack of consistent sampling The trend and seasonality fitting procedure are also of great importance, as these are also highly sensitive to irregular sampling frequency and biases in the timing in which the majority of the samples is collected. Nevertheless, our flask measurement records of Lutjewad, Mace Head, and Halley have proven to be informative and valuable in evaluating APO, and with future technical improvement (especially regarding the sampling frequency and the quality maintenance 625 of our internal scale), they will be extended further. In the near future, in addition to more regular sampling frequency at Lutjewad and Mace Head, we aim to improve the frequency at which we perform the measurements on the SIO primary standard cylinders, and also to purchase new primary standard cylinders from them, to produce higher precision conversion to the SIO scale. We also aim to employ more WTs as the current ones are either running out or experiencing considerable noise (see WT 4845 in Fig. 3). We have now added another cylinder to measure along with our last stable 630 WT, to ensure the continuation of our calibration scale quality. More protective measures to the flasks, such as using additional caps or switching to another type of valve, will also be considered, to reduce the risks of potential leakages, permeations, and contaminations during storage and transportation.

Comparison with other long-term records
In Table 5, we compare the seasonal amplitudes of our CO2, δ(O2/N2), and APO measurements with those of some 635 other stations worldwide. As can be seen, the measurements for all three species at Lutjewad and Mace Head agree well with the measurements conducted at other Northern Hemisphere stations Weybourne (UK), Sendai (Japan), and Ny Ålesund (Norway). In the Southern Hemisphere, our δ(O2/N2) and APO measurements for Halley station show an excellent agreement with those at the Syowa station. On the other hand, our CO2 measurements exhibit a much larger and noisier seasonal cycle, which is caused by small leaks during sampling (the details of which are given at the end of 640 this section). Nonetheless, the general concurrence with these stations helps to consolidate the quality of our measurements.

645
Additionally, we compare our long-term measurement record with an extended record of Weybourne station (Fig. 11), the first part of which has been published by Pickers (2016) and Barningham (2018). The figure shows the continuous Weybourne record as hourly averages. In general, the two records agree well, except for the period of late 2018 to the end of 2019, when flask measurements (and the fit curves) of CO2 and APO at Lutjewad are slightly higher than those at Weybourne. This difference is due to the fact that the Weybourne hourly measurements make year-to year variability 650 (in trend and seasonal cycle) visible, whereas the Lutjewad record, due to its sparser sampling character, is fitted with a smooth trend and a seasonal cycle that is fixed over the years. Apparently, the 2018-2019 period deviated from the average trend and/or seasonal cycle. However, the overall agreement further consolidates the quality of our measurements. For Halley, we compare our CO2, δ(O2/N2), and APO measurements with those conducted by UEA (Fig. 12) (Barningham, 2018 corresponding changes of 10 per meg in APO (since the APO is constructed from "clean" CO2 and "contaminated" δ(O2/N2)). The short-term variations in δ(O2/N2) and APO are greater than 10 per meg, masking the suspected leaks.
However, the significant difference between the average values for our APO measurements and the ones calculated using the NOAA's CO2 and our δ(O2/N2) (indicated by the black and blue lines in the APO plot, respectively), suggesting that our flasks must have been contaminated with inside air in the early 2014 period. the combined fits for the continuous measurements, with the lighter blue shaded area the 95% CI associated with the total fit. The black and blue lines in the APO plot are the average values for our APO measurements, and the ones calculated using the NOAA CO2 and our δ(O2/N2), respectively. The latter is significantly lower, corroborating our conclusion that our CO2 measurements must have been contaminated with inside air (human breathing). The CO2 scale is zoomed in to show the anomalies in 2016 more clearly.

6 Conclusion
We have presented 20-year flask measurement records for δ(O2/N2), CO2 and APO from Lutjewad and Mace Head, along with 3-year records from Halley. We also presented results of the calibration procedures of our instruments. Due to the sensitive nature of oxygen measurements, we conducted an extensive and intensive calibration procedures, which demonstrated a long-term stability for δ(O2/N2) of less than 3 per meg in 14 years based on our own internal cylinders programmes are not consistent, and therefore inconclusive. The long-term records from Lutjewad and Mace Head 705 provided useful information on the two-decadal trends and seasonality of CO2, δ(O2/N2), and APO, showing good agreements with other stations around the world, especially the Weybourne Atmospheric Observatory in the UK. We found long term trends during the period 2002-2018 of 2.31 ± 0.07 ppm yr -1 for CO2 and -21.2 ± 0.8 per meg yr -1 for δ(O2/N2) at Lutjewad, and 2.22 ± 0.04 ppm yr -1 for CO2 and -21.3 ± 0.9 per meg yr -1 for δ(O2/N2) at Mace Head. The notable differences in the year-to-year progression of δ(O2/N2) and APO trends between Lutjewad and Mace Head 710 might in part be caused by the sparse sampling frequency at Mace Head, but also may potentially be indications of influences from the changes in continental fossil fuel use, different degrees of sensitivity to the North Atlantic O2 ventilation, a shift in atmospheric transport, or an artefact in the data. Using the measurements at Lutjewad for 2002-2018, the partitioning of atmospheric CO2 sinks into the global terrestrial biosphere and the oceans are 1.9 ± 1.1 PgC yr 1 and 2.1 ± 0.8 PgC yr -1 , respectively. These values agree well with the numbers reported in the most recent Global

715
Carbon Budget. The Halley record shows that the APO seasonal variations in the Southern Ocean are slightly larger than those in the Northern Hemisphere due to larger air-sea O2 exchange there, and illustrates clearly the influences of oceanic processes on the variations in APO and atmospheric O2. With better maintenance of our internal scale, more regular sampling frequency, and better quality-control of the sampling process, the reliability of our future flask measurements will be improved.

Data availability
The accompanying database comprises three csv files. The files contain the information on the CO2, δ(O2/N2), and APO measurements (measured values and associated uncertainties) of the three stations, and are named after the corresponding station and the measured parameter (9 files in total).
The additional data presented in this paper are available upon request.

730
LNTN, HAJM, and ITL conducted the data analyses, produced all figures and tables, and wrote the manuscript. ITL and HAJM designed the methodology and framework for the calibration procedure of the DI-IRMS. BAMK conducted the technical work and prepared the flask samples at Lutjewad station, and carried out the CO2 and δ(O2/N2) measurements from flasks collected at all three stations. HAS calibrated the CO2 data at CIO. AEJ, NB, and TB performed the measurements at Halley station, prepared the flask samples, produced the data for comparison. PAP and