A rare intercomparison of nutrient analysis at sea : lessons learned and recommendations to enhance comparability of open-ocean nutrient data

An intercomparison study has been carried out on the analysis of inorganic nutrients at sea following the operation of two nutrient analysers simultaneously on the GO-SHIP A02 trans-Atlantic survey in May 2017. Both instruments were Skalar San Continuous Flow Analyzers, one from the Marine Institute, Ireland and the other from Dalhousie University, Canada, each operated by their own laboratory analysts following GO-SHIP guidelines while adopting their existing laboratory methods. There was high comparability between the two data sets and vertical profiles of nutrients also compared well with those collected in 1997 along the same A02 transect by the World Ocean Circulation Experiment. The largest differences between data sets were observed in the low-nutrient surface waters and results highlight the value of using three reference materials (low, middle and high concentration) to cover the full range of expected nutrients and identify bias and non-linearity in the calibrations. The intercomparison also raised some interesting questions on the comparison of nutrients analysed by different systems and a number of recommendations have been suggested that we feel will enhance the existing GO-SHIP guidelines to improve the comparability of global nutrient data sets. A key recommendation is for the specification of clearly defined data quality objectives for oceanic nutrient measurements and a flagging method for reported data that do not meet these criteria. The A02 nutrient data set is currently available at the National Oceanographic Data Centre of Ireland: https://doi.org/10.20393/CE49BC4C-91CC-41B9-A07F-D4E36B18B26F and https://doi.org/10.20393/EAD02A1F-AAB3-4F4E-AD60-6289B9585531.


Introduction
Dissolved nutrients such as nitrate, nitrite, silicate and phosphate can be a critical limiting factor constraining the growth of phytoplankton, which in turn form the base of the marine food web.They also provide useful chemical signatures (e.g.ratios of preformed nutrients) that can distinguish water masses and their origins (Broecker and Peng, 1982) as well as act as tracers for biogeochemical processes such as nitro-gen fixation and denitrification (Deutsch and Weber, 2012).There is growing evidence for significant variability, including long-term trends in nutrient levels in both coastal (Kim et al., 2011) and open-ocean surface (Yasunaka et al., 2014), and deepwater (Kim et al., 2014).These changes reflect direct human intervention in the global environment, especially the effects of the massive ongoing perturbation of the nitrogen cycle (Yang and Gruber, 2016), as well as changes in ocean circulation and biogeochemical cycling that may or Published by Copernicus Publications.
Identification and attribution of the variability of nutrient concentrations has been complicated by the existence of systematic analytical errors in data sets collected by different groups at different times.This can lead to controversy over the significance of observed long-term changes (e.g.Zhang et al., 2001) and generally requires empirical correction of historical data, using a variety of ad hoc approaches and principles (Keller et al., 2002;Moon et al., 2016;Pahlow and Riebesell, 2000;Tanhua et al., 2010).Recognition of such systematic errors within and between data sets led to a series of international comparison studies and the introduction of certified reference materials (CRMs) for dissolved nutrients (Aoyama et al., 2016(Aoyama et al., , 2007)), as well as recommendations concerning standard protocols for sampling, sample preservation and analysis (Hydes et al., 2010).These steps have undoubtedly contributed to a general improvement in interlaboratory comparability of field-collected data.However, it is notable that most intercomparison studies rely on either (a) shore-based laboratory-based analysis of replicate samples in the context of specially organised intercomparison studies or (b) crossover analysis of measurements made at nearby locations in the ocean where temporal and spatial variability is expected to be small.
The former approach is valuable, but most analysts are aware that conditions during an actual research cruise do not always match the stable, controlled conditions of a shorebased laboratory where a group can prepare carefully for their measurement of intercomparison samples.On the other hand, the latter approach works well in oceanic regions, where stable, unchanging nutrient concentrations can be expected.However, in regions such as the surface open ocean of the North Atlantic, or the northwest Pacific and in coastal regions everywhere, temporal and/or spatial variations can be expected, which complicates the interpretation of crossover comparisons.
In this paper we report the results, findings and lessons learned from a rare opportunity in which two independent nutrient analysis teams participated jointly in a deep-ocean hydrographic section as part of the international GO-SHIP programme (Talley et al., 2016).Both teams followed standard protocols (Hydes et al., 2010) and both groups used CRMs during the cruise.As such, the cruise provided an opportunity to assess the likely comparability of nutrient data collected following such protocols as well as helping to identify a number of issues affecting data quality that could be of general relevance to groups conducting such measurements elsewhere.The intercomparison illustrates how labbased performance assessment can be compared to at-sea assessment.We are not aware of any other report of such an extensive, at-sea intercomparison of nutrient measurement systems.
The GO-SHIP A02 survey was completed in April-May 2017 on the RV Celtic Explorer, travelling from St. John's, Newfoundland, Canada, across the North Atlantic to Galway, Ireland with on-board teams from Ireland, Canada, Germany, the UK and the USA.The survey provided an unusual opportunity for cross-comparison of methods, data quality procedures and exchange of technical expertise between the international scientific groups.The Marine Institute (MI) and Dalhousie University (Dal) teams brought separate nutrient Skalar San ++ auto-analysers on the survey to provide a contingency against technical failures and allow for on-board intercomparison of data, as well as exploration of the impact on data quality of subtle differences in laboratory methods, procedures and instrument configurations that ostensibly conform to the same (GO-SHIP) guidelines and quality assurance criteria.
A total of 67 stations were occupied along the A02 transect (Fig. 1), with 1231 nutrient samples analysed for total oxidised nitrogen (TOxN), nitrite, phosphate and silicate on the MI nutrient system.Of these, 12 stations were sampled and analysed on both the MI and Dal nutrient systems, allowing the comparison of 291 samples between the two systems.The 12 stations were also compared with historical data from the A02 transect completed on a World Ocean Circulation Experiment survey in 1997.

Methods
Sampling, sample preservation and analytical procedures on both systems followed methods outlined in the GO-SHIP guidelines for nutrient analysis at sea (Hydes et al., 2010), while both groups also incorporated their existing laboratory quality control (QC), which was specifically adapted to their individual instruments.Note that a draft revised version of the GO-SHIP nutrients manual available at the time of writing (Becker et al., 2019) was not available ahead of the 2017 A02 survey.

Sampling procedures
Both groups collected nutrient samples directly from the Niskin bottles and put them into falcon tubes (details in Table 1) and as per GO-SHIP guidelines, the samples were not filtered.Samples were analysed on board typically within 12 h of sampling.

Analytical methods
Analysis was carried out on two separate Skalar San ++ Continuous Flow Analyzers, set up in two separate on-board containerised laboratories brought by each team.Both analysers run four channels of nutrients simultaneously: total oxidised nitrogen, nitrite, silicate and phosphate.The Dal system also measures ammonia; however contamination issues were encountered during the survey, and therefore, there is no further discussion of this method.Both instruments consisted of an auto-sampler, where a needle draws the sample into the analyser, which is then split into the four channels.Each channel had its own set of reagents, where the stream of reagents and samples is pumped through the manifold to undergo treatment such as mixing and heating before entering a flow cell to be detected.The air-segmented flow promotes mixing of the sample and prevents contamination between samples.The reagents react to develop a colour, which is measured as an absorbance through a flow cell at a given wavelength.The Skalar Interface transmits all the data to the Skalar FlowAccess software.
Reagents for both systems were made using high-purity chemicals, pre-weighed using high-precision calibrated balances prior to the survey, stored in acid-washed polyethylene (PE) containers and mixed to final volume on board using ultrapure water.See reagent compositions in Table 1.The ultrapure water was generated using a Smart2Pure water purification system.Reagent storage time was in accordance with the Skalar methods: most can be stored for 1 week, the silicate ammonium heptamolybdate and oxalic acid reagents for 1 month; however fresh reagents were typically made every 2-3 days due to the volume required during the survey.
The analytical procedures for all nutrients were similar between the Dal and MI systems, but with some differences in the chemical composition of reagents and volumes of reagents/sample through the instruments (Table 1).For the determination of nitrite, the diazonium compounds formed by diazotising sulfanilamide by nitrite in water under acidic conditions (due to phosphoric acid in the reagent) are coupled with N-(1-naphthyl) ethylenediamine dihydrochloride to produce a reddish-purple colour, is measured at 540 nm.
For silicate determination the sample is acidified with sulfuric acid and mixed with an ammonium heptamolybdate solution-forming molybdosilicic acid.This acid is reduced with L(+)ascorbic acid to a blue dye and measured at 810 nm.Oxalic acid is added to avoid phosphate interference.
For the determination of phosphate, ammonium heptamolybdate and potassium antimony(III) oxide tartrate react in an acidic medium (with sulphuric acid) with diluted solutions of phosphate to form an antimony-phosphomolybdate complex.This complex is reduced to an intensely bluecoloured complex by L(+)ascorbic acid and is measured at 880 nm.
For the determination of total oxidised nitrogen (TOxN), both methods buffer the sample to pH of 8.2, which is then passed through a column containing granulated coppercadmium to reduce nitrate to nitrite.The nitrite originally present, plus the reduced nitrate, is determined by being diazotised with sulfanilamide and coupled with N-(1-naphthyl) ethylenediamine dihydrochloride to form a strong reddishpurple dye, which is measured at 540 nm.MI uses an ammonium chloride and ammonium hydroxide buffer solution, while the Dal buffer solution is made of imidazole and hydrochloric acid (Table 1).The MI uses a cadmium column through which no air bubbles are allowed, while the Dal system allows air bubbles though their column but monitors the efficiency of the reduction process daily, reactivating the cadmium column with 1 M hydrochloric acid and a copper sulfate solution if the efficiency falls below 95 %.It should be noted that above 95 %, the reduction efficiency is consistent throughout a run and therefore does not have to be corrected for; below 95 % the reduction efficiency may be variable, so the column must be reactivated to ensure there is no impact on the samples; this follows GO-SHIP protocol (Hydes et al., 2010).
Both instruments were calibrated daily using a suite of calibration standards (see calibration range in Table 2).The primary standards for each nutrient were made by each team immediately prior to the survey using calibrated balances and high-purity chemicals diluted to 1 L with ultrapure water, as per Skalar methods.The primary stocks were stored in a refrigerator for the duration of the survey.Two batches of primary stocks were used on the MI system to ensure no bias from an individual batch, while one batch of primary stock was used on the Dal system.Weekly secondary stocks were diluted from the primary stocks into 100 mL polypropylene (PP) flasks and stored in the fridge when not in use.These could be used for 1 week.Daily standards were made from secondary stock into 100 mL PP volumetric flasks.
MI calibration standards were made using calibrated fixed volume pipettes, while Dal standards were made using calibrated adjustable volume pipettes (0.1-1, 0.5-5 mL) and one calibrated fixed volume pipette (10 mL).All pipettes were tested prior to the start of the survey to ensure that the volumes delivered were accurate.The MI secondary stocks were made using ultrapure water, while the daily standards were made using artificial seawater (ASW) with salinity of 35.Both secondary and daily standards on the Dal system were made using ASW (salinity 33-35).Concentrations of daily standards for each system are in Table 2, where first-order (linear) calibration curves were fitted: neither group forced A notable difference between the two systems was the composition of the baseline wash: the MI analyser used ASW -a sodium chloride solution with a similar salinity to the ex-pected samples (salinity 35) -as the baseline wash for all channels.Batches of sodium chloride used were tested prior to the survey to ensure no contamination with any of the nutrients.The MI system runs its baseline wash as the first (zero) standard.The Dal system used ultrapure water as the baseline wash and ran a sample of ASW (effectively a blank, i.e. no nutrients) as the first standard, which was set to 0 for     2).The GO-SHIP manual recognises both ASW and ultrapure water as suitable baseline washes for nutrient analysis at sea.

Quality control
The certified reference materials (CRMs) used on the survey by both groups were supplied by KANSO (Aoyama et al., 2016(Aoyama et al., , 2007)).Two batches (batch CD and batch BW, Table 3) were used on the MI system to cover the full range of nutrients expected on the survey, with a CD and BW analysed at the beginning of a run and another CD at the end of the run.While Dal primarily analysed batch CD, they also analysed a BW CRM on three runs, as a comparison.The KANSO certified values are in µmol kg −1 (Table 3), which were converted to µmol L −1 for the QC charts, since the Skalar results are in µmol L −1 .The density for this conversion was calculated as per Millero and Poisson (1981), where the CRM salinity and analysis temperature (laboratory temperature, of 20 • C for both the MI and Dal containers) was used.The BW CRM for silicate has a concentration (61.47 µmol L −1 ) higher than the highest standard (60 µmol L −1 ) used by both groups and is therefore only used as an indication of QC variations for higher levels of silicate.
Prior to GO-SHIP, the MI laboratory developed acceptance criteria for CRMs based on the standard deviation of CRM results.The MI had primarily used Eurofins seawater and estuarine CRMs (https://www.eurofins.dk/miljoe/vores-ydelser/certificerede-vki-referencematerialer/ information-in-english/, last access: 4 March 2019) in the daily nutrient runs, with good results.The MI also participates in the QUASIMEME marine and estuarine proficiency testing schemes: between 2008 and 2017, the average absolute z scores |Z| from 84 test samples at the MI laboratory were 0.5 for TOxN, 0.4 for nitrite, 0.5 for silicate and 0.4 for phosphate.In that period, |Z| scores were satisfactory for all results greater than the limit of quantification (LOQ), with the exception of a single silicate result (Z = 2.04).
With no history of KANSO CRM results prior to the A02 survey, the QUASIMEME z-score assessment criteria were used, where a z score < 2 is considered satisfactory.The z score is calculated as follows: z score = measured value − certified value total error (1) (Cofino and Wells, 1994).
Total error is calculated as total error = assigned value × proportional error (6 %) 100 ) On the MI system, every sample was analysed twice and relative percentage differences (RPD REP ) were calculated for replicates using Eq. ( 3).Samples with RPD REP > 10 % were reanalysed.

RPD REP =
replicate A − replicate B concentration average nutrient concentration × 100 % (3) On the Dal system, every sample was measured in triplicate and a coefficient of variation (CV(%)) was calculated (Eq.4).For samples with concentrations of 0.5 to 10 and > 10 µmol L −1 , an outlier replicate was removed if the CV(%) were > 5 % and > 3 %, respectively.If the remaining two replicates differed by more than these amounts, both were rejected and the sample was reanalysed during the following run.For samples with lower concentrations (< 0.5 µmol L −1 ), the CV(%) test was not used.
Concentrations falling between the LOD and LOQ values were reported as < LOQ, while concentrations lower than the detection limit were reported as < LOD.Drift samples were analysed after every four samples on both systems to correct for instrumental drift during a run.The drift samples were prepared from secondary stock and artificial seawater (see concentrations in Table 2).System Suitability Standards (SSS) were made daily by the MI group using secondary stock standards and artificial seawater.These were not used to correct for drift but instead analysed as an internal reference material every four samples to ensure drift correction was accurate and to identify any problems during the course of a run.All SSS were checked in post-processing: any falling > ±10 % of the SSS value were marked as failed QC.The four samples on either side of a failed SSS were then reanalysed.The Dal group analysed their drift solution as an internal reference material every four samples; this "drift check" was monitored during a run but was not used for post-processing rejection or flagging.

Comparison of data
To compare final nutrient concentrations analysed on the two instruments, the sample relative percentage difference (RPD MI−DAL ) was also calculated based on the MI and Dal nutrient concentrations: While nitrite was analysed on both instruments, there were issues with nitrite contamination in both systems, potentially due to the ultrapure water quality on board.Whereas all frozen samples were reanalysed at the MI after the cruise, this was not possible for the Dal samples, so a comparison of nitrite methods and data cannot be carried out in this study.

Sample-to-sample comparisons including vertical profiles
The MI and Dal data are both available on the MI database (see links in Data availability).It is important to note that the MI data used in this comparison is calculated using split calibration curves: any TOxN and silicate data < 5 µmol L −1 were calculated from a calibration range of 0-10 µmol L −1 , while all other data were calculated using the 0-50 µmol L −1 calibration range.The reason for this split calibration is discussed in Sect.3.2 and 3.3.Overall, without any adjustments based on CRM analysis results, there was relatively good agreement between vertical profiles of nutrients measured with the two systems, as can be seen from vertical profiles presented in Fig. 2 and Supplement (Fig. S1).The mean percentage differences (RPD MI−DAL ) for all of the comparison samples measured during the cruise (n = 278-284) are shown in Table 5 and are −1.4 ± 0.6 %, −1.1 ± 1.1 % and +2.3 ± 1.2 % for TOxN, silicate and phosphate, respectively, where uncertainties are 95 % confidence intervals.This gives general confidence in the overall comparability of the data and individual methods, standardisation and analysis protocols used by each group.
For silicate, 70 % of samples had RPD MI−DAL < 5 %.The largest differences are in the top 400 m, which typically had < 3 µmol L −1 silicate, where 8 % of all the samples have RPDs between 11 % and 117 %, with the highest RPDs in stations with lowest silicate values (see vertical profiles of RPD MI−DAL in Fig. 3).In contrast, for samples > 400 m, there was no significant difference between silicate concentrations measured on the two systems with an average RPD MI−DAL of 0.3±0.7 %, where the uncertainty is the 95 % confidence interval.
TOxN vertical profiles also compare reasonably well, with 77 % of all RPD MI−DAL < 5 %.Virtually all TOxN samples with RPD MI−DAL > 10 % are within the top 200 m, where TOxN concentrations are low (Fig. 3).However, Fig. 3 shows that MI values of TOxN from deeper than 400 m are significantly lower, by 2.1 ± 0.4 % (95 % CI), than concentrations measured on the Dal system.This is consistent with the difference in mean values reported for CRM analyses on the two systems (see Sect. 3.2 and Table 6).
There was less agreement between the two systems for phosphate, with only 38 % of samples having RPD MI−DAL < 5 % (79 % of all samples had RPD MI−DAL < 10 %).Almost half of the samples with RPD MI−DAL > 10 % were in the top 400 m (Fig. 3).The remaining samples with larger differences deeper in the water column were from early stations of the cruise when the Dal system had problems with its phosphate channel.These problems were resolved and, in addition, the calibration range was altered from Station 46 onwards.If the earlier stations are excluded from the comparison, the average RPD MI−DAL for samples > 400 m showed an average RPD MI−DAL of 6.4 ± 0.8 % (95 % CI).The neg- ative bias of Dal's phosphate results, relative to MI's, is also consistent with the difference of ca. 4 % in CRM results measured on the two systems from Station 46 onwards (see Sect. 3.2; Fig. 4; Table 6).
A comparison was also performed between analyses of frozen replicate samples conducted in the MI laboratory after the survey with MI samples analysed at sea.The RPD SEA−LAB [(conc sea −conc lab )/average conc sea&lab × 100 %] was 4(±8) % for TOxN, 8(±14) % for silicate and 13(±16) % for phosphate (where uncertainties are given as 1 standard deviation).The frozen samples were defrosted at the MI overnight prior to analysis, which was carried out within 2 months of sample collection.The RPD SEA−LAB was typically positive, so that nutrient concentrations were lower in the frozen samples.This was also observed in a number of frozen samples that were analysed while at sea during the A02 survey.Of the nitrite samples that passed QC early in the survey, the frozen reruns had differences within the limit of quantification (< LOQ = 0.04 µmol L −1 ) of the method.

Comparison of QC results at sea and on shore
Both systems used the z-score criteria used by QUASIMEME (with a proportional error of 6 %) for assessment of the CRM results during the survey; all CRMs had |Z| scores within 2, as shown on the QC charts in Fig. 4.
Table 6 presents summary statistics for differences between measured and certified values as measured on both systems, expressed as percentages of certified values, to-Earth Syst.Sci.Data, 11, 355-374, 2019 www.earth-syst-sci-data.net/11/355/2019/  gether with the coefficient of variation, CV(%), of these differences.Overall, coefficients of variation for CRM analyses made on both systems were in the range of 3 %-5 % for all three nutrients.Early results for phosphate on the Dal system showed higher variation (10 %), but this improved later in the cruise following modifications to calibration procedures (Table 2).One CD CRM was run at the beginning and end of every run on both systems, and one BW CRM was analysed at the beginning of every run on the MI system.BW CRMs were run on only a selected number of runs of the Dal system for comparison.
For TOxN there were statistically significant biases of the order of −3 % (95 % confidence interval of ±1) (Dal) and −5 % (±1.5) (MI) for the lower-concentration CRM (CD), with an apparently smaller bias at the higher concentration (BW).For silicate, the Dal and MI analyses were not statistically distinguishable from certified values.For phosphate, the high scatter of the Dal analyses at earlier stations (before Station 46), precluded a useful estimation of bias for the cruise as a whole.The later analyses on the Dal system, with reduced scatter, suggested a bias of the order of −6 % (±3) for the mid-range CRM, whereas the MI phosphate analyses showed a smaller bias of ca.−2.5 % with the mid-range CRM.
Comparison of the QC results of the MI system during the A02 cruise with those from shore-based analyses conducted before and afterwards suggests a considerable reduction in the precision of CRM analyses conducted at sea. Between 2013 and 2017 the Eurofins CRMs (n = 67) were measured with a CV(%) of 1.9 % for TOxN, 3.0 % for silicate and 2.6 % for phosphate.Following the survey, the CV(%) of KANSO CD CRM (n = 20) was 2.2 % for TOxN, 1.7 % for silicate and 4.4 % for phosphate, whereas the CV(%) of the KANSO CJ CRM (n = 18) was 1.7 % for TOxN, 3.0 % for silicate and 2.8 % for phosphate.Hence the variability of CRM analyses for TOxN and silicate during the A02 cruise (Table 6) is almost a factor of 2 larger than that of corresponding shorebased analyses, whereas phosphate variability was largely unchanged.This, together with the bias in the TOxN data, has been noted in the metadata for the data set.At-sea QC results with the Dal system on the A02 cruise were comparable to subsequent on-shore analyses (September 2017), which had a CV(%) for the KANSO CD CRM (n = 21) of 2.7 % for TOxN, 3.3 % for silicate and 4 % for phosphate (these values can be compared with Table 6).Analyses conducted at sea 1 year later (on cruise MSM74, May-June 2018) were also comparable, with CV(%) of 2.5 % for TOxN, 2.8 % for silicate and 5.4 % for phosphate.

Comparison of instrument calibrations
Both groups carried out testing of instrument calibrations prior to the A02 survey to determine optimal calibration range.Tests indicated that the optimal calibration range for TOxN on the MI instrument was 0-30 µmol L −1 .However, early in the cruise, a negative bias was observed in the MI QC charts for the higher TOxN CRM (batch BW, 25.26 µmol L −1 ) while, at the same time, a comparison of the MI and Dal data sets also identified a negative bias in the MI TOxN data relative to Dal data for samples at concentrations > 15 µmol L −1 ).In an attempt to correct the bias while at sea, the TOxN calibration range on the MI system was increased from 0-30 to 0-50 µmol L −1 to match the Dal system's calibration range.This change appeared to reduce the negative bias in the BW CRM, without substantially affecting the CD CRM results (Fig. S2 in the Supplement).The reason for the negative bias was, and remains, unclear since upon return of the instrument to the laboratory following the cruise, standards up to 30 µmol L −1 resulted in better performance with greater precision and with less bias evident for TOxN.
A positive bias in the CD CRM was noted on the Dal phosphate channel early on in the cruise.This was corrected for by adding three new standards were between 0 and 0.8 µmol L −1 to help with standard curve fit (Table 2).This change in the calibration range removed the positive bias (Fig. 4), and as such, stations 46-59, measured after the curve was changed, are primarily considered in the phosphate intercomparison.This change in the calibration curve and use in the intercomparison is noted throughout the text.
Following the cruise, a calibration test was carried out in the MI laboratory, in which two sets of 14 QUASIMEME proficiency test materials with a wide range of nutrient concentrations were analysed, together with three batches of KANSO CRMs.The full suite of calibration standards (Table 2) was analysed during the run, while in the postprocessing, results were calculated after selecting different standards and calibration coefficients (either first or second order calibration).This test was repeated a number of times and the results illustrate that the range of calibration standards used can indeed have an appreciable effect on the final reported value, particularly for lower nutrient concentrations (Table 7).While nitrite and phosphate were also analysed during this experiment, the range used on the A02 cruise did not extend beyond 2.2 µmol L −1 , and adjusting the lower calibration standards had a minimal effect on the final reported concentrations.Therefore, only the results for TOxN and silicate are discussed in this section.
For silicate, the use of different calibration standard ranges had only a marginal effect on samples with middle to high concentrations, for which almost all Z scores were |Z| < 1 (all < 4 % bias).The samples that illustrated a significant difference were those with concentrations < 2 µmol L −1 , where |Z| scores increased to 2 if the higher-concentration calibration standards were included.For example, in the QNU 300 sample (Table 7), the measured value had a difference of 7 % from the assigned value when using standards ≤ 10 µmol L −1 , whereas the difference increased to 21 % with use of standards up to 60 µmol L −1 .
There was greater variation in the TOxN results depending on which standards were used, but again it is clear that inclusion of the highest concentration standards (≤ 50 µmol L −1 ) results in a larger bias in the accuracy of low-concentration TOxN samples.With the QNU 307 sample, the measured value was exactly the same as the assigned value (0 % difference) when standards ≤ 10 µmol L −1 were used, while the difference increased to ±19 % if standards up to 50 µmol L −1 were included.
Based on this experiment's finding that the lowest TOxN and silicate concentrations showed a reduced bias when calculated with a smaller range of calibration standards, the MI GO-SHIP A02 data with TOxN and silicate concentrations ≤ 5 µmol L −1 were recalculated using standards of ≤ 10 µmol L −1 (Table 2).The TOxN CD values (5.65 µmol L −1 ) plotted in Fig. 4 are calculated using the calibration range of 0-10 µmol L −1 to illustrate the accuracy of this method (Fig. 4).This is a key finding in this intercomparison, which illustrates that it could potentially reduce bias and CV(%) in CRMs and samples across a broad concentration range.A sample run could be split up into two (or more) components that are linear, which will be specific to individual instruments and configurations.Nutrient analysis on the WOCE A02 survey in 1997 was also carried out using a Skalar Continuous Flow AutoAnalyzer (SA 4000) for photometric determination of nitrate, nitrite, phosphate and silicate.Analytical methods were similar to the MI and Dal systems, with nutrients measured at the same wavelengths, while calibrated flasks and pipettes were also used for the daily calibration standards.There were no CRMs available for the 1997 cruise; instead the internal consistency of the nutrient measurements between cruises were assessed by comparison of quality-controlled dissolved inorganic carbon (DIC) data, wherein any inaccuracies in the nutrient measurements would show up as offsets or slope changes in the DIC-nutrient plots derived from various cruises.The "estimated accuracy on the WOCE survey was 0.02 µmol kg −1 for nitrite, 0.1 µmol L −1 for nitrate, 0.05 µmol L −1 for phosphate and 0.5 µmol L −1 for silicate" (https://cchdo.ucsd.edu/cruise/06MT39_3,last access: 4 March 2019).There was no information provided in the cruise report, and no articles published (that we know of) which state calibration ranges used on this survey.The vertical profiles of nutrient data compared quite well with the 2017 data (Fig. 2 and Fig. S1 in the Supplement).Not every station on the 2017 survey could be compared directly with the 1997 survey due to small differences in some station positions, which sometimes resulted in bottom depth differences of over 500 m between the two surveys.

Discussion
The comparison of the MI and Dal data sets from the A02 survey highlights the importance and effectiveness of following standard protocols.Both groups followed the GO-SHIP manual (Hydes et al., 2010) for the sampling and determination of nutrients in seawater while also incorporating their existing laboratory QC methods that were specifically adapted to their instruments.

MI vs. Dal station-by-station comparison
Figure 5 presents differences between samples that were measured on both the MI and Dal systems on a station-bystation basis.Summary statistics for the station-by-station comparisons are shown in Table 5.Because most of the stations plotted and listed were measured on different autoanalyser runs, these plots and statistics also give an indication of run-to-run differences in the level of agreement between the systems.RPD MI−Dal values are shown for three subsets: all data (upper panels), samples from > 400 m only (middle panels) and samples from < 400 m only (lower panels).
The plots show the larger RPDs and greater number of outliers for comparisons made on shallower (< 400 m) samples with deeper concentrations, which is also evident from the depth profiles (Fig. 3). Figure 5 and Table 5 also show good overall agreement between MI and Dal measurements of TOxN and silicate as determined on a cruise-wide basis (average bias of ca. 1 %-2 %; see Sect.3.1), However, the difference is variable from station to station, with individual stations having average differences as large as 3 %-4 %; this is likely due to run-to-run variations in measurement calibration on both systems.For phosphate, there was a clear improvement in the variability and magnitude of the betweensystem agreement later in the cruise.
Figure 5 and Table 5 show that, on a cruise-wide basis, average differences (MI-Dal) determined on the water samples and CRMs are similar.The respective differences of MI-Dal results for water samples and CRMs are −1.4 % and −2.2 % (TOxN), −1.1 % and +1.3 % (silicate), and +2.3 % and +3.6 % (phosphate).Figure 5 also shows that the stationby-station means of differences measured on the water samples generally fall within ±1 standard deviation of the cruisewide average RPD that was determined from analyses of CRMs.
We regressed the station-to-station differences of sample analyses with the corresponding differences of CRM analyses but found no significant correlation.This implies that, for this data set at least, we cannot use run-by-run analyses of CRMs to correct sample data from individual stations.This is likely due to the limited number of CRMs that were analysed per station and run relative to the within-run precision.
Overall, the results suggest that average levels of agreement between independent nutrient data sets should be interpreted with caution.Clearly, comparisons of data collected in deepwater with high concentrations risk not being directly applicable to samples from shallower depths with lower concentration ranges, where percentage errors are generally larger.Perhaps more significantly, our results also show that station-to-station variations in data quality and bias can be considerably larger (by several percent) than the mean bias between two cruise-wide data sets.These station-to-station variations in bias arise from short-term differences in instrument calibration that are difficult to identify without very detailed monitoring of system performance.This observation is relevant to "secondary quality control" (Tanhua et al., 2009(Tanhua et al., , 2010) ) of nutrient data, in which adjustments to entire cruise data sets might potentially be recommended on the basis of offsets between deepwater measurements made on different cruises at a limited number of crossover or co-located stations."Drifting or variable measurement precision and accuracy during a cruise" (Tanhua et al., 2010) is a recognised potential pitfall of this approach and the A02 Survey provides a rare example of a "crossover cruise" from which its impact on between-cruise data comparisons can be estimated.(d-f) and lower (g-i) rows represent results for samples from > 400 m and < 400 m, respectively.ToxN, silicate and phosphate data are presented in the left (a, d, g), middle (b, e, h) and right (c, f, i) rows.Note the use of different y-axis scales for the different depth ranges.For phosphate, the shaded area denotes the period prior to Station 46, after which the Dal standard curve was altered.The horizontal lines represent the cruise-wide means (solid line) and uncertainties (dashed lines) between MI and Dal measurements of the KANSO CD CRMs for each nutrient.The uncertainty bounds represent ±1 standard deviation, which was calculated from the respective standard deviations of the differences from certified values of the MI and Dal analyses, summed in quadrature.TOxN and silicate were calculated using results from all CD CRMs measured during the cruise.The lines for the average and standard deviation for phosphate relate only to the later (non-shaded) portion of the cruise.

At-sea vs. on-shore measurement: potential sources of error
A key observation from this study was the demonstration of the potential for reduced precision and increased bias of CRM results analysed at sea, relative to those analysed on shore.This was evident for TOxN, silicate and nitrite analyses on the MI system -with almost a doubling in the CV(%) of CRMs analysed on the A02 survey, while phosphate QC was similar for land-and sea-based analyses.This implies that shore-based intercomparisons and QC tests, where samples are measured under stable conditions and where there may be a tendency to analyse test samples when instruments are working "normally", do not necessarily reflect the quality of data collected at sea under more difficult conditions and, often, when analysts are under time pressure.It is likely not possible to pinpoint the exact cause(s) for the increased scatter in MI silicate and TOxN CRM results relative to shorebased analyses or for the negative bias in the TOxN results from both systems that was observed during A02.However, a number of potential sources of error associated with at-sea analysis can be speculated on.
-Ship vibrations: these were particularly evident in the MI container during A02.Unlike the other container labs, which were lined along the middle of the aft deck, the MI container was located along the starboard aft deck, in contact with the ship's hull, and appeared to suffer greater vibration at higher speeds and during dynamic positioning of the ship (when the thrusters were in action) than noticed in other containers.The vibrations even caused the instrument to crash a number of times when the auto-sampler syringe could not address the cup correctly.These vibrations had not been encountered on previous surveys on which the on-board laboratory was deployed and analysis undertaken.Vibration could potentially disrupt the light path of the instrument photometers, which could ultimately affect the measured nutrient concentrations.
During a transit westward across the Atlantic immediately prior to the A02 survey, during which the sea state was calmer and dynamic positioning was not used, two trial runs on the MI system showed little bias and better precision (CV(%) in CD -TOxN < 2.8, phosphate < 2.2, silicate < 2, n = 6; CV(%) in BW -TOxN 0.6, phosphate < 1.2, silicate < 1.5n = 5; see QC charts in the Supplement).The trial runs on the westward leg used the same reagents, stock solutions, pipettes and glassware as used on the survey proper.A vibrationrelated error, affecting the MI system more than the Dal system, could lead to variable differences between the measurements made on the two systems during the cruise.
-Water purification unit: although the ultrapure water from the RV Celtic Explorer was tested ahead of the A02 survey to ensure no nutrient contamination, problems arose for both groups during the survey with their nitrite channels, and this appeared to be due to the varying levels of nitrite in different batches of ultrapure water.This was sometimes seen as a shift in the nitrite baseline when a new batch of ultrapure water was used.If, in fact, there was nitrite in the ultrapure water used to make the reagents, standards and baseline wash, then it would contribute to the negative bias observed in the TOxN measurements with both systems, as it would raise the baseline due to higher levels of nitrite present.
It was noted that, on the westward leg, there were no such issues with the nitrite analysis on the MI system.Anecdotal reports of problems with pure-water supplies on research vessels are common.Such a contamination issue on a shared water supply might lead to a bias with TOxN measurements on both systems, as observed.
-Standard preparation: a key difference between shorebased and at-sea analysis by the MI group was the use of pipettes rather than balances for the preparation of daily calibration standards.However, all pipettes used by the MI on the A02 survey were calibrated ahead of the survey and should not have influenced the final results.There also did not appear to be any bias in the results between the two analysts using the MI system.The Dal system used the same pipettes to make secondary and work standards on land as were used on the survey.This source of error might be expected to be a result of constant (rather than variable) differences between the two systems.
-Reagent preparation: all reagent chemicals were preweighed and stored in acid-cleaned containers until use.
Tests were carried out at the MI and Dal prior to the A02 survey to ensure there were no issues of contamination in the pre-weighed chemicals.The accuracy and precision measured on the test runs on the westward transit prior to the A02 survey also indicated no contamination in the MI chemicals.The Dal team had extra preweighed reagents, which they continued to use for up to 9 months after the survey, indicating there were no contamination issues with storage time of the reagents.
-CRM use: the latest revision of the GO-SHIP guidelines (Becker et al., 2019) recommends that a new CRM bottle should be opened for every run, or at least every 2 days (Becker et al., 2019).This protocol was not followed on the GO-SHIP A02 survey and CRMs were generally used until they ran out.Similarly, this was not done during shore-based analysis, and therefore is unlikely to have contributed to the difference between atsea and shore-based analyses.Changes in CRM concentrations after opening could impact the comparison of CRM results between the two systems, and the CV(%) of the CRM measurements.However there is no reason for this to impact the differences observed between MI and Dal analyses of water samples.
Based on this difference in the overall method performance between the lab-based and at-sea analyses, the z-score acceptance criteria were recalculated following the survey, reducing the proportional error from 6 % to 2 % in Eq. ( 2) to better quantify the land-based instrument capability.This narrowed the CRM assessment criteria (see both limits in; Fig. 4) to levels which we feel are more suitable for oceanic nutrient samples.This was also closer to the CV(%) results of international laboratories from the recent JAMSTEC intercomparison exercise, which was typically less than 2 % for both TOxN and silicate (Aoyama et al., 2018).

Quality control, including reference materials
The results from this intercomparison exercise highlight the need for using low, middle and top-range reference materials covering the full range of the expected nutrient concentrations for ocean surveys.This is recommended by Hydes et al. (2010), and also in the JAMSTEC I/C report (Aoyama et al., 2018).If the CD CRM had been solely used by both groups on the A02 survey, the negative bias in the MI TOxN at high concentrations would not have been apparent.Without confirmation from the higher-concentration CRM (batch BW), it would not have been clear whether there was a negative bias in the MI data or a positive bias in the Dal data, since both were producing similar values for the lower (CD) CRM.
Similarly, a low-concentration CRM would have improved comparison of surface waters where nutrient concentrations were close to the detection limit, and where the largest differences between the two data sets were observed.The lownutrient KANSO CRMs available at the time of the survey (BY), similarly to the current low-nutrient batch (batch CE by KANSO or batch 7601a from NMIJ), have nutrient levels below our limits of quantification and therefore they are not useful as a low-concentration CRM for the MI-Dal methods.
For future surveys if a low KANSO batch is still not suitable, alternatives could be used to check precision and accuracy at low levels, such as low-concentration materials remaining from intercalibration/proficiency testing or in-house materials that are used to check precision.
With the availability of a range of CRMs for nutrients in seawater, there remains a need for clearly defined data quality objectives for oceanic nutrient measurements to meet GO-SHIP objectives as well as clear criteria for flagging acceptable and questionable data.Such criteria exist for other biogeochemical parameters; for example, for dissolved inorganic carbon (DIC) and total alkalinity (TA) in the open ocean, a level of uncertainty of 2 µmol kg −1 (∼ 0.1 %) is recommended to assess long-term anthropogenic trends in the marine carbonate system (referred to as "climate" level objectives), although for short-term changes and spatial variability less stringent objectives are specified ("Weather") (Newton et al., 2015).In coastal waters, the level of accuracy required would be less since the range of carbonate parameters observed would be much wider than those in the open ocean.If clear criteria for nutrient measurements were set, laboratories could flag reported data where these were not attained.The metadata supplied with published data sets should include all of the related QC information, including calibration ranges, batches of CRMs used, CRM assessment criteria, accuracy of CRMs achieved, sample storage prior to analysis, etc.
In a 2015 Inter-laboratory Calibration (I/C) exercise, Aoyama et al. (2016) reported CV(%) of 1 % for TOxN, 2 % for silicate and 6 % for phosphate with the reference material batch BU (which is similar to batch CD used on the A02 survey), and 2 % for all nutrients for batch CA (similar to batch BW).These CV(%) are lower than those produced by the MI and Dal groups on the A02 survey (Table 6).The CV(%) for the participating laboratories of the 2015 I/C exercise were, however, calculated from measurements carried out in shorebased laboratories, a much more stable and less pressured environment than during a research cruise.Our comparison of QC before and after the A02 survey with performance at sea illustrated an increase in CV(%) during A02 in all parameters for the MI group as well as a systematic bias for TOxN with both groups and variable performance of the Dal phosphate analyses during the cruise.These observations highlight the difficulty and nature of problems associated with carrying out ship-based nutrient analysis of open-ocean samples.A key question is whether it should be acknowledged that, for the accuracy of goals/targets for sea-going analyses, an at-sea analytical performance may not always attain the standards that can be reached in shore-based studies.Hydes et al. (2010) suggest that the use of CRMs, along with best practices in using analysis equipment and internal standardisation, should make it "commonly possible to achieve comparability of nutrient analysis to a level better than 1 %".The draft-revised guidelines for nutrients state that an accuracy of 1 % should be aimed for in order to be able to quantify decadal trends in the deep ocean.Based on intercalibration performance during A02 and into international I/C exercises, a target proportional error of 2 % for analysis of nutrients might instead be reasonable and achievable.The associated narrower z-score limits (Fig. 4) calculated with a PE% of 2 % could be considered as oceanic nutrient CRM acceptance criteria for future surveys.However, additionally, specification of an appropriate total error combining proportional and constant error components, as applied by the QUASIMEME system, may be appropriate to allow for a wider allowable total error for concentrations extending closer to the LOQs.We note that the GOOS essential ocean variable specifications list accuracy goals for nutrients in terms of constant errors that are similar to those specified for QUASIMEME.

Quality of data
The largest differences between the MI and Dal data sets were observed in the low-nutrient surface waters, where the RPD MI−DAL of all nutrients were considerably higher than the rest of the water column.In the 2015 I/C exercise (Aoyama et al., 2016), poorer comparability between the participating laboratories was also observed in the lownutrient reference materials, which yielded CV(%) of up to 60 %.This was confirmed in the I/C 2018 exercise (Aoyama et al., 2018), where CV(%) for the low-nutrient sample was 50 % for TOxN, and 120 % for silicate, compared to CV(%) < 2 (TOxN) and < 2.3 (silicate) for all higher-concentration samples.Larger differences in low-nutrient waters would be expected since any error in calibration standards, instrument baselines and detection limits would more strongly impact concentrations close to the limit of detection.The larger differences in the low-nutrient concentrations could be sensitive to the sample : reagent ratio of each system, where the instruments have different capabilities of measuring low-nutrient concentrations.Also, the low-nutrient surface samples (concentrations < 5 µmol L −1 for TOxN and silicate) were measured with a restricted calibration curve (0-10 µmol L −1 ) on Earth Syst.Sci.Data, 11, 355-374, 2019 www.earth-syst-sci-data.net/11/355/2019/ the MI system, whereas the Dal group used their full calibration range (0-50 µmol L −1 ) for their entire data set.The calibration tests carried out in the MI laboratory following the survey illustrate how low-concentration measurements can be significantly affected by the higher-concentration standards.This will vary between instruments depending on the linearity of the calibration curves over different ranges.The JAMSTEC I/C 2018 report indicates that the non-linearity of calibration curves is a significant source of reduced comparability of nutrient data and recommends the use of CRMs of concentrations covering the whole range of measurements (Aoyama et al., 2018).Accurate, intercomparable measurements of nutrient concentrations in the upper ocean, with lower concentrations, are important for a range of applications.Inaccurate measurements of nutrient concentrations in the euphotic zone would lead to large discrepancies in primary production estimation, or estimation of near-surface N : P ratios and indices of nutrient limitation.Hence our interpretation of ocean function can be directly related to the quality of the measurements.In the entire GO-SHIP A02 survey, 32 % of all samples are from the upper 400 m of the water column.Clearly, achieving highaccuracy measurements across the large concentration ranges encountered from surface to deepwater remains an analytical challenge.It is generally not possible to compare upper water column nutrient data quality using crossover analyses between different cruises from the same geographic area due to the greater "real" variability on short spatial and temporal scales (Tanhua et al., 2009(Tanhua et al., , 2010)).This intercomparison study therefore identifies a key issue in the comparability of nutrient data in lower-nutrient upper-ocean waters and suggests the need for in-house testing on the impact of higher standards on low-nutrient samples.It may, for example, be useful to split the calibration curve into low and high ranges, as was done on the MI system during the A02 survey.
In an intercomparison study carried out in 2005 and 2006 (Sahlsten and Håkansson, 2006), five different laboratories from monitoring institutes in Denmark, Norway and Sweden, compared nutrient concentrations from identical sets of natural seawater subsamples (as opposed to prepared reference materials) that were analysed ashore in individual laboratories.Results for the deepwater samples indicated precision generally better than 5 % CV(%) between laboratories.The study indicated that variations between laboratories could be explained by improper storage of the nutrient samples between sampling and analysis.Tanhua et al. (2009Tanhua et al. ( , 2010) ) carried out crossover analyses as a secondary QC on nutrient data from the Atlantic (CARINA), where an offset and standard deviation were calculated for nutrients at depths > 1500 m.They found that nitrate data showed the largest consistency with an RMSE of 2.9 %, found an RMSE of 4.2 % for phosphate and 7 % for silicate, and suggested the larger differences in the reported data were likely due to analytical difficulties.
The results of this intercomparison strongly support the recommendation of Hydes et al. (2010) that individual laboratories or groups must carry out extensive internal testing on their own instruments to understand the full capability of their instruments and ensure their laboratory methods achieve the highest level of accuracy for the samples being measured.Ahead of bringing a laboratory-based instrument to sea, scientists must take account of the different requirements of analysis at sea and be aware that, if analytical problems arise, analysts may have limited time and resources to troubleshoot compared to a shore-based laboratory: a constant throughput of samples requiring analysis leaves little time for investigative work in the event of problems.
Despite carrying out extensive testing ahead of the survey (including testing the ships' ultrapure water and batches of pre-weighed reagents), along with a contingency plan for almost all foreseeable problems that may arise at sea (including a back-up of all equipment used during analysis, and a second Skalar system), there were unresolved changes in the QC of the ship-based analysis, illustrating the challenges that can occur during analysis at sea. Results also highlighted the value of carrying out a between-laboratory testing exercise, which in this case helped both groups to identify quality assurance issues in their internal procedures, which would otherwise not have been evident.All laboratory groups should ensure they incorporate additional QC into their methods, including extra calibration standards, extra reference materials and internal standards, to allow for post-correction of data if some unforeseen changes to their instrument occur while at sea.

Conclusions and recommendations
For data to be of use to the scientific community, oceanographic data collected by different groups at different times must be comparable so that true changes in the marine environment can be quantified.The presence of biases or imprecision in the measurement of nutrients in seawater reduces our ability to understand spatial and temporal trends in nutrient concentrations in the ocean.The comparison of two nutrient data sets from the 2017 A02 survey illustrated how analysis at sea can change the method performance rel- ative to the analytical ability of a system and expectations of data accuracy and precision in shore-based laboratories.This study illustrates the importance of including extra QC checks (e.g. higher number of calibration and internal standards) should post-processing of the data be necessary.The cross-comparison of laboratory methods, quality control and instrument configurations allowed the MI and Dal groups to scrutinise their laboratory procedures in order to identify reasons for analytical bias while carrying out nutrient analysis at sea.The GO-SHIP hydro-manual provides essential guidelines to analytical teams undertaking on-board nutrient analysis.Following this study, some additional suggestions and recommendations were identified which could enhance those in the GO-SHIP manual (Hydes et al., 2010) for improved quality of global nutrient data sets.
-Agreed and clearly defined data quality objectives and acceptance criteria for flagging ocean observation nutrient measurement would aid in improving data quality and support flagging of reported data that do not meet these criteria.Such criteria could include proportional and constant error components.
-Additional information could be provided to indicate how CRMs can be used to correct data from a cruise if a bias is observed.This should factor in station-to-station variability, which was found to be several percent larger than cruise-wide average bias.
-If low-nutrient CRMs are below limits of detection, an alternative low-nutrient reference material should be considered, for example an internal reference solution or past proficiency test material.Extensive testing must be carried out ahead of a survey to understand individual instrument capabilities and additional QC checks should be included to allow for changes to the methods due to unforeseen changes while carrying out analysis at sea.
-Depending on individual auto-analysers, it may be necessary and effective to use two (or more) separate calibration curves to cover different nutrient concentration ranges.
-Metadata should include all information related to QC, including calibration ranges and CRM performance, so to increase comparability and traceability between different nutrient data sets.
Author contributions.TM was the lead of the Irish nutrient team for the survey (shore based).TM carried out a large portion of the data analysis and led the writing of the paper.MC was the lead nutrient chemist on board the survey and also carried out data analysis and writing of the manuscript.EK was the lead nutrient chemist on the Dalhousie team, carried out data analysis of the Dal data and aided in the writing of the manuscript.DW was shore-based support in the Dal group, aided in data analysis and provided a large contribution to the writing of the manuscript.CG was the second nutrient chemist on the MI team and provided support in preparation of the survey.CN was the second nutrient chemist on the Dal team and aided in the preparation of the survey.EM was the PI of the entire survey and provided a large contribution to the data analysis of the MI data and to the writing of the manuscript.

Figure 1 .
Figure 1.Station positions sampled along the GO-SHIP A02 trans-Atlantic survey completed in May 2017.The Marine Institute (MI) group sampled and analysed nutrient samples at every station along the transect, while the Dalhousie group (Dal) analysed nutrient samples from a selected number of sites, marked with a diamond.Both groups analysed samples over the full water column.

Figure 2 .
Figure 2. Vertical profiles of TOxN, silicate and phosphate (in µmol kg −1 from the MI (Marine Institute), Dal (Dalhousie University) and WOCE (World Ocean Circulation Experiment) data sets.Only stations 29 and 56 are included here; all other stations compared are in the Supplement.Profiles are in µmol kg −1 since WOCE data were reported in µmol kg −1 rather than µmol L −1 .

Figure 3 .
Figure 3. Relative percentage difference (RPD MI−DAL ) calculated as (MI conc − Dal conc)/average conc × 100 % for each nutrient for the whole water column and for depths > 400 m.The colour bar for each plot is the average concentration (µmol L −1 ) of each nutrient (i.e. the average concentration from both systems) at that depth.Note the use of different y-axis scales for the different subsets.

Figure 4 .
Figure 4. Control charts of CRM concentrations from the MI and Dal systems.The dashed centre line represents the certified value for each CRM (CV), while the red upper (UAL, upper action limit) and lower (LAL, lower action limit) lines represent the z score of two allowable limits criteria, where the z scores were calculated with a proportional error of 6 %.MV (MI) and MV (Dal) are the measured values from the MI and Dal systems, respectively.The dashed-dotted and dotted lines represent the revised z-score limits with a proportional error of 2 %.One CD CRM was run at the beginning and end of every run on both systems, and one BW CRM was analysed at the beginning of every run on the MI system.BW CRMs were run on only a selected number of runs of the Dal system for comparison.

Figure 5 .
Figure 5. Box plots of relative percent differences (RPD) between MI and Dal measurements on water samples that were analysed on both systems.RPDs are calculated as (MI conc − Dal conc)/average conc × 100 %.The median RPD(%) for a station defines the centre line of each box, and the entire box, representing the interquartile distance (IQD), is closed by the upper (UQ) and lower quartiles (LQ).Individual data points identify outliers, defined as LQ−1.5×IQD and UQ+1.5×IQD.The top row (a-c) represent results from all depths, whereas the middle (d-f) and lower (g-i) rows represent results for samples from > 400 m and < 400 m, respectively.ToxN, silicate and phosphate data are presented in the left (a, d, g), middle (b, e, h) and right (c, f, i) rows.Note the use of different y-axis scales for the different depth ranges.For phosphate, the shaded area denotes the period prior to Station 46, after which the Dal standard curve was altered.The horizontal lines represent the cruise-wide means (solid line) and uncertainties (dashed lines) between MI and Dal measurements of the KANSO CD CRMs for each nutrient.The uncertainty bounds represent ±1 standard deviation, which was calculated from the respective standard deviations of the differences from certified values of the MI and Dal analyses, summed in quadrature.TOxN and silicate were calculated using results from all CD CRMs measured during the cruise.The lines for the average and standard deviation for phosphate relate only to the later (non-shaded) portion of the cruise.

Table 1 .
A comparison of sampling, instrument configurations (including sample and reagent tubing sizes) and reagent compositions for each nutrient from the Marine Institute, Ireland (MI) and Dalhousie University, Canada (Dal) systems.

Table 2 .
Concentrations of daily calibration standards in µmol L −1 on the MI and Dal systems.

Table 3 .
Certified values in µmol/kg for the two batches of KANSO CRMs used on the survey.These were converted to µmol L −1 for comparison with Skalar data using a laboratory temperature of 20 • C and CRM salinity.

)
Constant errors are 0.05, 0.01, 0.1 and 0.05 µmol L −1 for TOxN, nitrite, silicate and phosphate, respectively, which are defined by the Scientific Advisory Board of QUASIMEME.These constant errors are similar to ac- curacy/uncertainty levels called for by the Global Ocean Observing System's (GOOS) Biogeochemistry Expert Panel (http://www.goosocean.org/index.php?option=com_ oe&task=viewDocumentRecord&docID=17474, last access: 4 March 2019).(We note that the GOOS Panel does not follow QUASIMEME in also specifying a proportional error; see Discussion section.

Table 4 .
The limit of detection (LOD) and limit of quantification (LOQ) in µmol L −1 for both instruments.

Table 5 .
Relative percentage difference (RPD MI−DAL ) calculated as (MI conc − Dal conc)/average conc × 100 % for each station in the intercomparison study.N represents the number of samples, and SD-RPD is the standard deviation.Bold font represents the stations analysed prior to Station 46 (i.e.before the phosphate standard curve was altered; see details in Table2)."All data" refers to all water samples which were measured by both MI and Dal during the cruise and "> 400 m" refers to all samples from below 400 m that were analysed by both groups.Asterisks (*) for phosphate values denote that the statistic refers only to samples from Station 46 onwards.

Table 6 .
Mean differences from certified values, and coefficients of variation of the differences (CV(%)) for the KANSO CRMs analysed by the Marine Institute (MI) and Dalhousie University (Dal).The CV(%) were calculated as the (standard deviation/mean × 100 %).The KANSO batches CD and BW were used by both groups, where N is the number of measurements.Dal results for phosphate do not include analyses prior to Station 46 (see text).

Table 7 .
Results from a laboratory experiment testing the effect of using different calibration ranges, where STD in the first column of the table indicates the top standard included in the calibration.The second column (order) indicates whether the first-or second-order calibration coefficient was used in the calibration.The samples are either QUASIMEME test materials (QNU) or KANSO CRMs; MV is the measured value; AV is the assigned (or certified value); TE is the total error used for calculating the z score; Z is the calculated z score as per Eq.(1) and RPD is the relative % difference (MV − AV/AV × 100 %).LOD and LOQ are the limits of quantification and detection, respectively.