the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
BGC-Argo+: Global biogeochemical Argo float data with secondary quality control and derived parameters
Abstract. Biogeochemical Argo floats have quickly become the largest single source of ocean biogeochemical measurements. The vast amount of data collected and the long life-time, typically over 5 years, during which floats measure the ocean makes it challenging to provide consistently quality controlled data. We developed a set of automatic and manual outlier detection methods with a focus on oxygen, nitrate, and pH data that we applied to the 2,429 biogeochemical Argo floats with data available through the global data assembly centers that were deployed as of January 2025. The full biogeochemical dataset, named BGC-Argo+, is a uniform repository of profiles and gridded data to enable easy use of the global dataset for direct observational, mapping, or modeling uses. Data are available through an archived repository or at www.bgc-argo-plus.info, along with information on the outliers removed that should enable future improvements to the global dataset of biogeochemical float measurements.
- Preprint
(11672 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 11 Jul 2026)
- RC1: 'Comment on essd-2026-311', Hernan Garcia, 24 Jun 2026 reply
-
RC2: 'Comment on essd-2026-311', Anonymous Referee #2, 24 Jun 2026
reply
The paper presents a data compilation of the freely and openly available BGC-Argo data with additional data processing steps performed by the authors. Their work is inspired by efforts like GLODAP or SOCAT, which add additional steps of quality control for a higher consistency dataset than that from the original observing network (repeat hydrography or SOCONET). Likewise, the authors performed additional visual quality control as well as automated checks that flag parts of the BGC-Argo data.
Those checks follow the intended use by the authors and the data compilation's potential user base, e.g., to remove density inversions above a certain threshold. This modifies the original BGC-Argo dataset beyond "just" quality control and should be clearly communicated as such. I.e., BGC-ArgoPlus is a cleaned version of the BGC-Argo data, but also one cleaned (somewhat) towards a certain science purpose. As long as the authors are transparent about this, this is fair and such choices up to their discretion. However, BGC-ArgoPlus should not simply be sold as a "better BGC-Argo" dataset.
In addition, the compilation contains additional, calculated parameters for convenience such as MLD and carbonate system parameters. The selection is again towards the anticipated user base. There is merit to having those calculations provided alongside the BGC-Argo data in a convenient way, as it promises to reduce spread across upstream studies due to different subtleties in the underlying calculations.
A key feature and positive aspect of this work is that the data compilation retains part of the original data structure, i.e., data from BGC-ArgoPlus can be straightforwardly linked back to the original BGC-Argo floats and profiles. This enourmously helps the works transparency.
The manuscript is clearly and logically structured and can be followed easily.
Comments:
- The product's website and the manuscript differ in their statements about whether product data received additional adjustments that the authors deemed reasonable (e.g., DOXY; Bushinsky et al. 2025) or not. At present, there seems to be no adjustment done (following the manuscript). I think it would be important to preserve a version without additional data adjustments, so users can have a choice to follow the authors decision on such or not.- It is unclear at present how the process will look like for updates of the BGC-ArgoPlus compilation. Will they include visual screening of all new data, as was done for the current compilation, or will they rely on automated checks on new data only? This will make a difference in the scientific utility judgement going forward and should be included in the manuscript.
- Some of the additional checks can be automated and efforts should be made to have those enter into the original processing systems, the Argo DACs (e.g., through the Argo Data Management Team meetings). A prime example seems the surface DOXY removal. Another one the bottom oxygen filtering.
- It is less clear why there are surface removals on NITRATE with SUNA sensors: As they are typically placed at the bottom of a float (for buoyancy reasons), under no circumstances can they deliver data 'in air'. Any insights?
- The visual quality inspection on every float of the current dataset, which probably was the most time consuming part, is one of the most valuable pieces to feed back to the original Argo data system. The authors provide a file that traces all their additional flagging, which is great. Some more detail on whether only point-wise flags were/should be applied (i.e., at the given pressure/N_LEVELS index) or whether contiguous parts of the profile (e.g., starting at the given pressure for N_LEVELS) would be helpful, e.g., to help automate such a transfer back to Argo.
- On the slow response "goo" oxygen profiles observed and filtered out in the compilation: Again, this is a case where it should be made clear that data are flagged out based on the authors interpretation of good data. (I have not checked whether such "goo" cases are separately marked in the accompagnied file tracing the additional flagging? Would be good if they were seperate, because:)
- Did the authors had the chance to check some measurement timing information associated with those "goo" profiles? In my experience, floats sometimes happen to perform their buoyancy increase actions at a location that results in them getting a stronger buoyancy boost than expected, having them "shoot" upward through the water column compared to their typical ascent speed. With the sensor's time response still being the same, this can cause such seemingly delayed profiles, without any additional "goo" but just because the float profiles faster. The authors would have to look into the corresponding core profile file for such timing information ("MTIME" or "NB_SAMPLE_CTD"). It is not available in the s-profiles as data in the s-profiles can be interpolated.
Additional comments:
- On the example figures given, please make sure to state the WMO and cycle number(s) somewhere within the figure labels/titles or the figure caption for traceability.
- Data: Please add a proper Argo reference, selecting the monthly snapshot that is closest to your data access. (e.g. https://doi.org/10.17882/42182#117069 or https://doi.org/10.17882/42182#117068 ?)
- l.74: Replace "performed" by "implementation", since all should follow the same steps, i.e., perform the same calculation?
- l. 89: Maybe add "and are subject to future updates when a float's data are revisited" to indicate the non-static nature of Argo data?
- only ascending profiles where used, no descending ones?
Typos:
- l. 53: data 2x
- Table 1: superscript "1" appears only in the footnotes, does it?Citation: https://doi.org/10.5194/essd-2026-311-RC2
Data sets
BGC Argo+, v0.1 2026_04 Seth Bushinsky, Zachary Nachod, Mathilde Jutras, Daniela König, Shannon McClish, and Charles Addey https://doi.org/10.5281/zenodo.19709012
Model code and software
BGC-Argo+ Processing and Outlier Detection Code Seth Bushinsky, Zachary Nachod, Mathilde Jutras, Daniela König, Shannon McClish, and Charles Addey https://doi.org/10.5281/zenodo.19705310
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 279 | 93 | 16 | 388 | 19 | 16 |
- HTML: 279
- PDF: 93
- XML: 16
- Total: 388
- BibTeX: 19
- EndNote: 16
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Seth and co-authors describe the BCG-Argo+ dataset and code used. The addition of the BCG-Argo+ dataset is welcome for Argo and BGC delayed-mode data users. This paper describes their effort to provide the ocean community with a dataset subjected to additional primary level QC using an Argo delayed-mode data set downloaded January 24, 2025. The code helps ensure independent repeatability of the QC.
I am not understanding in what context the authors use the term "secondary quality control" as stated in the tittle and in the text. The identification and flagging of large outliers and vertical gradients is what is generally part of a primary level QC. A secondary QC approach would generally aim to goals such as reducing data uncertainty, improving internal data quality consistency between newer and old instruments and methods, reducing known data biases, new data re-calibrations or data adjustments (e.g., precision, target accuracy, or some other metric). If the authors mean that this is a second QC iteration following the delayed mode level, then this might cause confusion with each future BCG-Argo+ QC iteration. To help minimize confusion, please clearly define what the authors mean by primary and secondary level QC.
It would be beneficial if the authors include a version number or label if they consider potential future QC iteration of the BCG-Argo+ (e.g., in the metadata, BCG-Argo+ ver number and/or date).
I suggest that the methods section should include or define metrics for what the authors consider large outliers, hooks, spikes, and large vertical gradients; perhaps in a table form for simplicity. I also would encourage the authors to point to the python code in the table label and/or in the text.
Below I provide some additional comments and suggestions for the authors consideration that I hope prove useful. Please use them as you see fit. The numbers before each comment refer to line numbers in the manuscript.
10 Simply as a historical note, I note that there were open ocean dissolved O2 measurements collected even before the Winkler method was introduced (Winkler, 1888). For example, about 10 years earlier, O2 data were collected in 1878 at the surface and at 157 m depth during the 1876-1878 Norwegian North-Atlantic Expedition (Tornøe H. The Norwegian North-Atlantic Expedition, 1876-1878. Chemistry. Christiania: Grrendal and Son bogtrykkeri, 1880).
12 "Nisken"? should be Niskin. I'd probably also include the use of the Nansen bottles (e.g., pre-1970s).
44 Add a reference to SOCAT here?
41-47. I find the meaning of this paragraph confusing. In what way the authors use the GLODAP and SOCAT QC protocols in this paper? When the authors state "In this work we follow the lead of two ship-based quality control and data-inter-comparability programs to ...". Consider merging here the paragraphs starting in line 55 and/or lines 60-67.
48-49. It seems to me that the measurement quality of O2, pH, Nitrate shipboard and sensor chemical observations is also quite different. Modern shipboard data are of higher quality (e.g., uncertainty, target accuracy, precision) than those from current chemical sensors mounted on floats. This is an important distinction to make in the text because it gives the reader the erroneous impression that their quality are similar and invariant over time.
55-57: It seems to me that the identification and removal of outliers is part of any primary level QC approach. Why would this be considered a secondary level QC approach? Please define/explain what you mean by primary and secondary QC in the text (e.g., methods).
60: What do you mean by "improving" data quality? How is this quantified? I suggest rewriting this sentence as, for example, (1) improve the internal data quality consistency by minimizing the number of large vertical gradients and ouliers. Then define in the text what are "large" vertical gradient and outliers.
71 Please define "Sprof" and "meta" and/or add a reference where these terms are defined.
76 For consistency, if the chemical measurements (e.g., O2, nitrate) are reported using units on a per mass basis (e.g., umol/kg; Molal) then the recommended term to use is "content" (e.g., Nitrate content). On the other hand, reporting such data in units on a per volume basis (e.g., umol/l; Molar), then the term to use is "concentration" (e.g., Nitrate concentration).
90-97. These steps are basic primary QC (see comment lines 55-57).
96 Please change "variables" to "parameters" in the text to differentiate between measured and derived quantities.
99-107. Please cite or refer to a manual or a reference that describes the Argo QC flags and metadata.
102 Please clarify here that the primary Argo delayed-mode data used in this effort already has received QC by the Argo data management team. I would also note that the delayed-mode data may have been adjusted as part of a secondaty QC step by the Argo data management teams.
109. Suggest changing "highest quality data is present the new dataset" to "data with higher level of QC in the new dataset" or similar wording. What do you mean by data of "highest quality"? How do you know that the BCG-Argo+ data are of the highest quality possible?
110. "salvaged". What does this term mean?
130-131 "sharply deviated from clear mixed layer mean values" Please define what you mean by a "sharp deviation". I did not understand what the authors mean by "clear mean values"; suggest using "mixed layer mean values" instead?
155 If the authors set certain observations to "NaN" or remove values, then there should be an independent way to replicate your QC data analysis from the same delayed-mode version used. Why not just flag the data following the Argo QC flag scheme? As the authors indicate in lines 291-292 "maintain the original data format to make it easier to compare between different versions" Why not do the same here? See comments for line 218.
I note that the O2 data "hooks" in Figure 5 for profiles 48 and 49 near the bottom are approximately less than 0.5 umol/kg. Is the precision of the O2 sensors capable of consistently resolving or measure such O2 content differences over sensor deployment time? Please check and provide additional context. I also note that profiles 47, 49, and 49 also show what appears to be constant or stuck values over some depth ranges (e.g, near bottom values in profile 47). Are these a product of slow sensor response? In comparison, I note the temperature profile show a vertical gradient over the same depths/pressure.
172. Instead of "at each point", suggest using "at each depth or pressure level in the profile".
188 Please spell acronyms the first time used (e.g., DIC, TALK, pCO2) or cite a reference. Some readers might not necessarily be familiar with the nomenclature.
218. See comments for line 155.
227 It should be "hook" not "book". Perhaps run a spell check on the rest of text.
247-248. "goo covered". Jargon. Please cite a reference where this term is described.
Table 3. Change "Variable" to "Parameter" for consistency with the text and common community use for derived parameters and measured variables.
265-266. Please spell out IF, IN, AO, HZ, and BO and/or point to a reference.
Figure A2. Please label axes and their units. Consider adding a-d to the panels and refer to them in the figure caption as needed.
Figure A4: PSAL is not a unit. Salinity is unit less in the practical salinity scale. Suggest changing +3.466e1 to +34.66 (or is it 34.660 for consistency with the salinity significant figures for differences in the graph?).