BGC-Argo+: Global biogeochemical Argo float data with secondary quality control and derived parameters

Bushinsky, Seth M.; Nachod, Zachary; Jutras, Mathilde; König, Daniela; McClish, Shannon; Addey, Charles

doi:10.5194/essd-2026-311

Preprints

https://doi.org/10.5194/essd-2026-311

Preprints

12 May 2026

| 12 May 2026

Status: this preprint is currently under review for the journal ESSD.

BGC-Argo+: Global biogeochemical Argo float data with secondary quality control and derived parameters

Seth M. Bushinsky, Zachary Nachod, Mathilde Jutras, Daniela König, Shannon McClish, and Charles Addey

Abstract. Biogeochemical Argo floats have quickly become the largest single source of ocean biogeochemical measurements. The vast amount of data collected and the long life-time, typically over 5 years, during which floats measure the ocean makes it challenging to provide consistently quality controlled data. We developed a set of automatic and manual outlier detection methods with a focus on oxygen, nitrate, and pH data that we applied to the 2,429 biogeochemical Argo floats with data available through the global data assembly centers that were deployed as of January 2025. The full biogeochemical dataset, named BGC-Argo+, is a uniform repository of profiles and gridded data to enable easy use of the global dataset for direct observational, mapping, or modeling uses. Data are available through an archived repository or at www.bgc-argo-plus.info, along with information on the outliers removed that should enable future improvements to the global dataset of biogeochemical float measurements.

Received: 23 Apr 2026 – Discussion started: 12 May 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Seth M. Bushinsky, Zachary Nachod, Mathilde Jutras, Daniela König, Shannon McClish, and Charles Addey

Status: final response (author comments only)

RC1: 'Comment on essd-2026-311', Hernan Garcia, 24 Jun 2026

Seth and co-authors describe the BCG-Argo+ dataset and code used. The addition of the BCG-Argo+ dataset is welcome for Argo and BGC delayed-mode data users. This paper describes their effort to provide the ocean community with a dataset subjected to additional primary level QC using an Argo delayed-mode data set downloaded January 24, 2025. The code helps ensure independent repeatability of the QC.
I am not understanding in what context the authors use the term "secondary quality control" as stated in the tittle and in the text. The identification and flagging of large outliers and vertical gradients is what is generally part of a primary level QC. A secondary QC approach would generally aim to goals such as reducing data uncertainty, improving internal data quality consistency between newer and old instruments and methods, reducing known data biases, new data re-calibrations or data adjustments (e.g., precision, target accuracy, or some other metric). If the authors mean that this is a second QC iteration following the delayed mode level, then this might cause confusion with each future BCG-Argo+ QC iteration. To help minimize confusion, please clearly define what the authors mean by primary and secondary level QC.
It would be beneficial if the authors include a version number or label if they consider potential future QC iteration of the BCG-Argo+ (e.g., in the metadata, BCG-Argo+ ver number and/or date).
I suggest that the methods section should include or define metrics for what the authors consider large outliers, hooks, spikes, and large vertical gradients; perhaps in a table form for simplicity. I also would encourage the authors to point to the python code in the table label and/or in the text.
Below I provide some additional comments and suggestions for the authors consideration that I hope prove useful. Please use them as you see fit. The numbers before each comment refer to line numbers in the manuscript.
10 Simply as a historical note, I note that there were open ocean dissolved O2 measurements collected even before the Winkler method was introduced (Winkler, 1888). For example, about 10 years earlier, O2 data were collected in 1878 at the surface and at 157 m depth during the 1876-1878 Norwegian North-Atlantic Expedition (Tornøe H. The Norwegian North-Atlantic Expedition, 1876-1878. Chemistry. Christiania: Grrendal and Son bogtrykkeri, 1880).
12 "Nisken"? should be Niskin. I'd probably also include the use of the Nansen bottles (e.g., pre-1970s).
44 Add a reference to SOCAT here?
41-47. I find the meaning of this paragraph confusing. In what way the authors use the GLODAP and SOCAT QC protocols in this paper? When the authors state "In this work we follow the lead of two ship-based quality control and data-inter-comparability programs to ...". Consider merging here the paragraphs starting in line 55 and/or lines 60-67.
48-49. It seems to me that the measurement quality of O2, pH, Nitrate shipboard and sensor chemical observations is also quite different. Modern shipboard data are of higher quality (e.g., uncertainty, target accuracy, precision) than those from current chemical sensors mounted on floats. This is an important distinction to make in the text because it gives the reader the erroneous impression that their quality are similar and invariant over time.
55-57: It seems to me that the identification and removal of outliers is part of any primary level QC approach. Why would this be considered a secondary level QC approach? Please define/explain what you mean by primary and secondary QC in the text (e.g., methods).
60: What do you mean by "improving" data quality? How is this quantified? I suggest rewriting this sentence as, for example, (1) improve the internal data quality consistency by minimizing the number of large vertical gradients and ouliers. Then define in the text what are "large" vertical gradient and outliers.
71 Please define "Sprof" and "meta" and/or add a reference where these terms are defined.
76 For consistency, if the chemical measurements (e.g., O2, nitrate) are reported using units on a per mass basis (e.g., umol/kg; Molal) then the recommended term to use is "content" (e.g., Nitrate content). On the other hand, reporting such data in units on a per volume basis (e.g., umol/l; Molar), then the term to use is "concentration" (e.g., Nitrate concentration).
90-97. These steps are basic primary QC (see comment lines 55-57).
96 Please change "variables" to "parameters" in the text to differentiate between measured and derived quantities.
99-107. Please cite or refer to a manual or a reference that describes the Argo QC flags and metadata.
102 Please clarify here that the primary Argo delayed-mode data used in this effort already has received QC by the Argo data management team. I would also note that the delayed-mode data may have been adjusted as part of a secondaty QC step by the Argo data management teams.
109. Suggest changing "highest quality data is present the new dataset" to "data with higher level of QC in the new dataset" or similar wording. What do you mean by data of "highest quality"? How do you know that the BCG-Argo+ data are of the highest quality possible?
110. "salvaged". What does this term mean?
130-131 "sharply deviated from clear mixed layer mean values" Please define what you mean by a "sharp deviation". I did not understand what the authors mean by "clear mean values"; suggest using "mixed layer mean values" instead?
155 If the authors set certain observations to "NaN" or remove values, then there should be an independent way to replicate your QC data analysis from the same delayed-mode version used. Why not just flag the data following the Argo QC flag scheme? As the authors indicate in lines 291-292 "maintain the original data format to make it easier to compare between different versions" Why not do the same here? See comments for line 218.
I note that the O2 data "hooks" in Figure 5 for profiles 48 and 49 near the bottom are approximately less than 0.5 umol/kg. Is the precision of the O2 sensors capable of consistently resolving or measure such O2 content differences over sensor deployment time? Please check and provide additional context. I also note that profiles 47, 49, and 49 also show what appears to be constant or stuck values over some depth ranges (e.g, near bottom values in profile 47). Are these a product of slow sensor response? In comparison, I note the temperature profile show a vertical gradient over the same depths/pressure.
172. Instead of "at each point", suggest using "at each depth or pressure level in the profile".
188 Please spell acronyms the first time used (e.g., DIC, TALK, pCO2) or cite a reference. Some readers might not necessarily be familiar with the nomenclature.
218. See comments for line 155.
227 It should be "hook" not "book". Perhaps run a spell check on the rest of text.
247-248. "goo covered". Jargon. Please cite a reference where this term is described.
Table 3. Change "Variable" to "Parameter" for consistency with the text and common community use for derived parameters and measured variables.
265-266. Please spell out IF, IN, AO, HZ, and BO and/or point to a reference.
Figure A2. Please label axes and their units. Consider adding a-d to the panels and refer to them in the figure caption as needed.

Figure A4: PSAL is not a unit. Salinity is unit less in the practical salinity scale. Suggest changing +3.466e1 to +34.66 (or is it 34.660 for consistency with the salinity significant figures for differences in the graph?).

Citation: https://doi.org/10.5194/essd-2026-311-RC1
RC2: 'Comment on essd-2026-311', Anonymous Referee #2, 24 Jun 2026

The paper presents a data compilation of the freely and openly available BGC-Argo data with additional data processing steps performed by the authors. Their work is inspired by efforts like GLODAP or SOCAT, which add additional steps of quality control for a higher consistency dataset than that from the original observing network (repeat hydrography or SOCONET). Likewise, the authors performed additional visual quality control as well as automated checks that flag parts of the BGC-Argo data.

Those checks follow the intended use by the authors and the data compilation's potential user base, e.g., to remove density inversions above a certain threshold. This modifies the original BGC-Argo dataset beyond "just" quality control and should be clearly communicated as such. I.e., BGC-ArgoPlus is a cleaned version of the BGC-Argo data, but also one cleaned (somewhat) towards a certain science purpose. As long as the authors are transparent about this, this is fair and such choices up to their discretion. However, BGC-ArgoPlus should not simply be sold as a "better BGC-Argo" dataset.

In addition, the compilation contains additional, calculated parameters for convenience such as MLD and carbonate system parameters. The selection is again towards the anticipated user base. There is merit to having those calculations provided alongside the BGC-Argo data in a convenient way, as it promises to reduce spread across upstream studies due to different subtleties in the underlying calculations.

A key feature and positive aspect of this work is that the data compilation retains part of the original data structure, i.e., data from BGC-ArgoPlus can be straightforwardly linked back to the original BGC-Argo floats and profiles. This enourmously helps the works transparency.

The manuscript is clearly and logically structured and can be followed easily.

Comments:

- The product's website and the manuscript differ in their statements about whether product data received additional adjustments that the authors deemed reasonable (e.g., DOXY; Bushinsky et al. 2025) or not. At present, there seems to be no adjustment done (following the manuscript). I think it would be important to preserve a version without additional data adjustments, so users can have a choice to follow the authors decision on such or not.
- It is unclear at present how the process will look like for updates of the BGC-ArgoPlus compilation. Will they include visual screening of all new data, as was done for the current compilation, or will they rely on automated checks on new data only? This will make a difference in the scientific utility judgement going forward and should be included in the manuscript.
- Some of the additional checks can be automated and efforts should be made to have those enter into the original processing systems, the Argo DACs (e.g., through the Argo Data Management Team meetings). A prime example seems the surface DOXY removal. Another one the bottom oxygen filtering.
- It is less clear why there are surface removals on NITRATE with SUNA sensors: As they are typically placed at the bottom of a float (for buoyancy reasons), under no circumstances can they deliver data 'in air'. Any insights?
- The visual quality inspection on every float of the current dataset, which probably was the most time consuming part, is one of the most valuable pieces to feed back to the original Argo data system. The authors provide a file that traces all their additional flagging, which is great. Some more detail on whether only point-wise flags were/should be applied (i.e., at the given pressure/N_LEVELS index) or whether contiguous parts of the profile (e.g., starting at the given pressure for N_LEVELS) would be helpful, e.g., to help automate such a transfer back to Argo.
- On the slow response "goo" oxygen profiles observed and filtered out in the compilation: Again, this is a case where it should be made clear that data are flagged out based on the authors interpretation of good data. (I have not checked whether such "goo" cases are separately marked in the accompagnied file tracing the additional flagging? Would be good if they were seperate, because:)
- Did the authors had the chance to check some measurement timing information associated with those "goo" profiles? In my experience, floats sometimes happen to perform their buoyancy increase actions at a location that results in them getting a stronger buoyancy boost than expected, having them "shoot" upward through the water column compared to their typical ascent speed. With the sensor's time response still being the same, this can cause such seemingly delayed profiles, without any additional "goo" but just because the float profiles faster. The authors would have to look into the corresponding core profile file for such timing information ("MTIME" or "NB_SAMPLE_CTD"). It is not available in the s-profiles as data in the s-profiles can be interpolated.

Additional comments:

- On the example figures given, please make sure to state the WMO and cycle number(s) somewhere within the figure labels/titles or the figure caption for traceability.

- Data: Please add a proper Argo reference, selecting the monthly snapshot that is closest to your data access. (e.g. https://doi.org/10.17882/42182#117069 or https://doi.org/10.17882/42182#117068 ?)

- l.74: Replace "performed" by "implementation", since all should follow the same steps, i.e., perform the same calculation?

- l. 89: Maybe add "and are subject to future updates when a float's data are revisited" to indicate the non-static nature of Argo data?

- only ascending profiles where used, no descending ones?

Typos:

- l. 53: data 2x

- Table 1: superscript "1" appears only in the footnotes, does it?

Citation: https://doi.org/10.5194/essd-2026-311-RC2
RC3: 'Comment on essd-2026-311', Anonymous Referee #3, 30 Jun 2026

The manuscript aims to improve the quality of the BGC-Argo dataset and deliver more derived parameters for users and help delayed mode operators and Argo data assembly centers to identify remaining quality control issues to address them.
Qualifying data is always a huge and a time-consuming task. Moreover, as recalled by the authors, Argo data are collected over years and floats may be corrected individually. Then, to improve the consistency of the global dataset, it is important to perform an analysis at a global and/or at a basin scale. So, leading such a task is always a benefit for the Argo user’s community.

By delivering, the “filtered” data and providing additional derived parameters such as dissolved inorganic carbon, total alkalinity (TA), and pCO2, in a commonly used Argo format, the authors will be of great help for the users.
Definitely, it is also useful to remove from the data the remaining bottom hooks and the in-air data wrongly assigned to the data profiles, but the manuscript leaves the reader a bit disappointed while flying over the main quality issues in the Argo chemical data, like the oxygen response time and the potential remaining bias in Oxygen Minimum Zones, for example. Some path forward and some perspectives to address those issues would be a plus for this manuscript. Moreover, it could be stated directly in the title that the only variables that would be covered are the “Chemical” ones, while the Bio optics variables are poorly and sometimes wrongly served.
Finally, the huge effort on the visual data screening deserves recognition and an efficient path for the screened data to the Argo data system should be found.
Specific comments
- Line 22: This sentence may be a bit reductive: “Optical measurements that provide information on particle density and fluorescence, light sensors, and many other types of sensors have been deployed over time as well”
The author would be well advised to specify in the title of this manuscript that this dataset will focus solely on the chemical parameters of Argo, rather than “treating” poorly bio-optics.

-Line 72 => For data reproducibility, considering that the Argo data set is constantly evolving the correct way to proceed is to get an archive corresponding to a certain snapshot of the dataset
For January 2025 Sprof snapshot, this would have been https://doi.org/10.17882/42182#116312
Instead of mentioning accessing the dataset on January 24, 2025.
-In paragraph 2 related to data : it would have been useful to cite some references for the documentation to present the real time quality tests and delayed mode procedures that have been applied by the Argo Data team and lays the foundation of this manuscript.
For dissolved oxygen https://doi.org/10.13155/46542
For Nitrate https://doi.org/10.13155/84370
For pH https://doi.org/10.13155/97828
as well as, https://doi.org/10.3389/fmars.2021.683207 (Maurer et al., 2021) describing the delayed mode procedures for all the chemical parameters.
As well as the Argo user’s manual https://doi.org/10.13155/29825 , the manual for the Sprof generation http://dx.doi.org/10.13155/55637
That will definitely be useful to understand the “jargon”

-Line 92, Line 111 :
“filter out non-delayed mode data for most parameters”
It is not clear what this means, this needs to be precised
“We did not perform the same steps for pressure, temperature, or salinity, as our experience was that bad data in the "ADJUSTED" data fields for these physical parameters were already removed by the time any BGC data was put into Delayed Mode.”
Same as line 92, this needs to be precised, as it is admitted in the Argo community that BGC variables can be pushed in delayed mode before the CTD data.
Do the authors mean that they only consider profiles with PRESSURE, TEMPERATURE, SALINITY , pH, Oxygen and NITRATE in Delayed Mode, if yes, they should say it like that, instead, this sentence is again confusing.
- Line 114, “as the vast majority of those did not appear to ever be set into Delayed Mode”, this sentence doesn’t sound good. It looks like bio-optics variables are not within the scope of this manuscript, so why not saying it like that ? Are there any plans to account for bio-optics in BGC-Argo+ in the future and how ? Also reported in the table 1, it appears that 1/3 of radiometric data have been through delayed mode, so “the vast majority of those” expression may seem inappropriate.
- Table 1 page 8
Regarding CHLA, these figures are wrong, presently in the Argo data set there are less than 200000 profiles for CHLA (June 2026). As CHLA is often associated to BBP, a more realistic number for CHLA profiles would be of ~170000.
- Line 250, regarding the spikes, have the authors noticed some spikes simultaneously on pH and NITRATE ? Regarding the lags for both sensors, are they of the same nature ? while the Nitrate sensor could be mounted apart from the CTD, what can cause the lag for the pH ? Is there any time response related to the pH sensor ?
-Line 265 :
For the reader, for the users and for the Argo delayed mode operators, this analysis would have benefited from further examination, for example with illustrating how many bad surface data and bottom hook have been removed by the delayed mode process and how many of these are still in the data system and have been removed by this study (while it is potentially feasible mainly applying the test on DOXY and trying to catch if the DOXY_ADJUSTED_QC has been set to 4 or not) and the visual data screening performed by the authors is in line with the delayed mode operator data screening if any ?
This would have been also useful to highlight that the bad surface data are mainly related to old CTS4 PROVOR floats, when they were upgraded to take some in-air measurements.
Bottom oxygen hooks are also more visible for floats with a higher sampling frequency.

-To cite the Argo dataset
Argo (2026). Argo float data and metadata from Global Data Assembly Centre (Argo GDAC). SEANOE. https://doi.org/10.17882/42182

Technical corrections
About the data
File downloaded here
https://ftp.soest.hawaii.edu/bgc_argo_plus/outliers_removed/v0.1_2026_04/Individual_Floats/5901744_Sprof_BGCArgoPlus.nc
Some long names could be added to the new derived parameters to help when navigating through the files.
some are missing and some are certainly wrong for example, DOXY_SAT is certainly the oxygen saturation and not the practical salinity, so the long_name, standard_name, units, valid_min, valid_max are wrong.
double DOXY_SAT(N_PROF, N_LEVELS) ;
DOXY_SAT:_FillValue = NaN ;
DOXY_SAT:long_name = "Practical salinity" ;
DOXY_SAT:standard_name = "sea_water_salinity" ;
DOXY_SAT:units = "psu" ;
DOXY_SAT:valid_min = 2.f ;
DOXY_SAT:valid_max = 41.f ;
DOXY_SAT:C_format = "%9.3f" ;
DOXY_SAT:FORTRAN_format = "F9.3" ;
DOXY_SAT:resolution = 0.001f ;
DOXY_SAT:coordinates = "JULD LATITUDE LONGITUDE PRES_ADJUSTED_BGCArgoPlus" ;

-Line 12 : Niskin bottles instead of “Nisken” bottles

Citation: https://doi.org/10.5194/essd-2026-311-RC3

Seth M. Bushinsky, Zachary Nachod, Mathilde Jutras, Daniela König, Shannon McClish, and Charles Addey

Data sets

BGC Argo+, v0.1 2026_04 Seth Bushinsky, Zachary Nachod, Mathilde Jutras, Daniela König, Shannon McClish, and Charles Addey https://doi.org/10.5281/zenodo.19709012

Model code and software

BGC-Argo+ Processing and Outlier Detection Code Seth Bushinsky, Zachary Nachod, Mathilde Jutras, Daniela König, Shannon McClish, and Charles Addey https://doi.org/10.5281/zenodo.19705310

Seth M. Bushinsky, Zachary Nachod, Mathilde Jutras, Daniela König, Shannon McClish, and Charles Addey

Viewed

Total article views: 447 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
302	127	18	447	20	17

HTML: 302
PDF: 127
XML: 18
Total: 447
BibTeX: 20
EndNote: 17

Views and downloads (calculated since 12 May 2026)

Month	HTML	PDF	XML	Total
May 2026	244	82	13	339
Jun 2026	35	11	3	49
Jul 2026	23	34	2	59

Cumulative views and downloads (calculated since 12 May 2026)

Month	HTML	PDF	XML	Total
May 2026	244	82	13	339
Jun 2026	35	11	3	49
Jul 2026	23	34	2	59

Viewed (geographical distribution)

Total article views: 431 (including HTML, PDF, and XML) Thereof 431 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Jul 2026

Short summary

In this work we describe the creation of a secondary quality-controlled dataset for profiling float data. We focus on oxygen, nitrate, and pH measurements, using a mix of automated and manual detection methods to identify data points and profiles that are likely to be bad data. This work grew out of our individual recognition that spurious data exist in these widely used data and the need to remove these data before many types of oceanographic analyses can take place.


Total:	0
HTML:	0
PDF:	0
XML:	0