Feasibility of reconstructing the basin–scale sea surface partial pressure of carbon dioxide from sparse in situ observations over the South China Sea

Sea surface partial pressure of CO2 (pCO2) data with high spatial-temporal resolution are important in studying the global carbon cycle and assessing the oceanic carbon uptake capacity. However, the observed sea surface pCO2 data are usually 15 limited in spatial and temporal coverage, especially in marginal seas. This study provides an approach to reconstruct the complete sea surface pCO2 field in the South China Sea (SCS) with a grid resolution of 0.5o×0.5o over the period of 2000– 2017 using both remote-sensing derived pCO2 and observed pCO2. Empirical orthogonal functions (EOFs) were computed from the remote sensing derived pCO2. Then, a multilinear regression was applied to the observed pCO2 as the response variable with the EOFs as the explanatory variables. EOF1 explains the general spatial pattern of pCO2 in the SCS. EOF2 20 shows the pattern influenced by the Pearl River plume on the northern shelf and slope. EOF3 is consistent with the pattern influenced by coastal upwelling along the north coast of the SCS. The reconstructions always agree with observations. When pCO2 observations cover a sufficiently large area, the reconstructed fields successfully display a pattern of relatively high pCO2 in the mid-and-southern basin. The rate of sea surface pCO2 increase in the SCS is 2.383 atm per year based on the spatial average of the reconstructed pCO2 over the period of 2000–2017. All the data for this paper are openly and freely available at 25 PANGAEA under the link https://doi.pangaea.de/10.1594/PANGAEA.921210 (Wang et al., 2020).

Thus, it is necessary to improve the spatial-temporal resolution and accuracy of the data in the evaluation of oceanic carbon uptake capacity to better understand the global carbon cycle and to better project the future climate. 35 The sea-air CO2 flux helps quantify the oceanic carbon uptake capacity and is primarily determined by the difference in the atmospheric and sea surface partial pressure of CO2 (pCO2). Although the measurement records of sea surface pCO2 have been increasing to 14.7 million, are available in almost all ocean basins in 2014, and continue to receive more data for compilation (Rodenbeck et al., 2015;Sheu et al., 2010), the observations are still severely limited in the spatial and temporal pCO2 field of the global ocean surface, especially in marginal seas. Thus, interpolation and/or extrapolation methods are needed to obtain a 40 complete pCO2 field in space and time over the concerned oceanic areas. Various methods have been applied for this purpose in the past two decades, including statistical interpolation (Chou et al., 2005) and empirical formulas between pCO2 and proxies such as sea surface temperature, salinity, chlorophyll a, sea surface height, and mixed layer depth (Boutin et al., 1999;Denvil-Sommer et al., 2019;Jo et al., 2012;Laruelle et al., 2017;Lefevre and Taylor, 2002;Ono et al., 2004;Zhai et al., 2005a).
These studies usually present their pCO2 fields in a monthly time scale and at a 1º×1ºor even coarser grid. In marginal seas a 45 finer grid resolution is needed to discern influences posed by local forces such as plumes and upwelling.
The South China Sea (SCS) is the largest marginal sea in the western Pacific. Measurements of sea surface pCO2 in the SCS have started as early as 2000 (Zhai et al., 2005b). Seasonal and spatial variations are present in different domains of the SCS (Li et al., 2020;Zhai et al., 2013). However, the data coverage is still so sparse each year that on global compilation maps the SCS is mostly blank (Fay and McKinley, 2013;Takahashi et al., 2009). For example, the summer observations of 2017 cover 50 7 % of the SCS, and those of 2001 cover only 1 %. Consequently, the observational data themselves cannot quantitatively depict the pCO2 field over the entire SCS basin. Thus, it is necessary to reconstruct a space-time complete pCO2 field in the SCS in order to better assess the CO2 source and sink features in the SCS and to supplement the global pCO2 map.
The purpose of this paper is to demonstrate the feasibility of reconstructing the pCO2 field over the SCS basin from the sparse in situ observations in the SCS with a grid resolution of 0.5º ×0.5º , using a method illustrated in the flowchart of Fig. 1. This 55 paper focuses on the pCO2 reconstruction for the summer season. As indicated in Fig. 1, we need to use an auxiliary dataset, the remote-sensing derived pCO2 data to calculate empirical orthogonal functions (EOFs) for spatial patterns of pCO2. The remote sensing data are complete in the space-time grid but less accurate, compared with in situ observations. The singular value decomposition (SVD) method is applied to the remote sensing data to compute the EOFs. These EOFs form an orthogonal basis for the spectral optimal gridding (SOG) method (Shen et al., 2014(Shen et al., , 2017Gao et al., 2015;Lammlein and

Observed data in the SCS
In the SCS, the underway sea surface pCO2 data are hardly available for every month of each year, so we decided to compile the data seasonally. This study focuses on the summer data since the greatest temporal coverage of the sampling occurs in summer. The available observed summer pCO2 data from 2000 to 2017 are compiled in this study and shown in Table 1. The summer data are the June-August mean for each year in this period excluding 2002excluding , 2003excluding , 2010excluding , 2011excluding and 2013excluding (Li et al., 202075 Zhai et al., 2005a). Thus, we have observed pCO2 data for 13 summers during 2000-2017. The blue cruise tracks of Fig. 2 indicate all the sea surface pCO2 observations in the 13 summers. The tracks indicate that these data are distributed mainly on the northern shelf and slope, and in the northern-and-mid basin of the SCS. The coverage of an individual summer is only a subset of the blue tracks. See Fig. 3 for the subset of each year. These observational data were aggregated onto 0.5º ×0.5º grid boxes in the (5-25º N, 109-122º E) region that covers most of the SCS. The aggregation used a simple space-time average of 80 the data in a grid box. The aggregated data for 13 summers are shown in Fig. 3. The aggregated pCO2 in general falls in the range of 160-480 atm with relatively larger spatial variation nearshore and smaller spatial variability in the basin. In addition, the large differences are apparent in the spatial coverage from year to year. pCO2 data cover a spatial range of 12º in latitude and 13º in longitude, 231 grid boxes with data that cover 22 % of SCS. The data fall in the range of 281-480 atm. In the summer of 2017 the observed data cover a spatial range of 13º in latitude and 6º 85 in longitude, 77 grid boxes with data that cover 7 % of SCS. The data are in the range of 279-440 atm. The summer of 2000 has only 5 grid boxes (covering 0.5 % SCS) with data in the range of 400-425 atm. The lowest observational pCO2 values appear on the northern SCS shelf due to the influence of the Pearl River plume (See Fig. 2), where nutrient-stimulated phytoplankton uptake consumes CO2. The relatively high sea surface pCO2 values occur mainly in the basin, which are often higher than the atmospheric pCO2 (Li et al., 2020;Zhai et al., 2013). The high pCO2 values off the northeastern coast of SCS 90 and the southern coast of Hainan Island in the summer of 2007 are consistent with local upwelling occurrences, which bring CO2-enriched water from the subsurface (Li et al., 2020). In the summer of 2012, the spatial coverage is 7º in latitude and 9.5º in longitude. The pCO2 data are in the range of 191-480 atm with the lowest value appearing on the northwestern shelf of the SCS due to the Jianjiang River plume and the highest values occurring on the northeast shelf and off the eastern coast of the Hainan Island due to upwelling (Gan et al., 2015;Jing et al., 2015). Some other data, for example, in the summer of 2000, 95 however, are relatively localized so that no certain spatial pattern is shown before the reconstruction. Our reconstruction results will help display the spatial patterns of the complete sea surface pCO2 field.   July-Aug. 2007 Zhai et al., 2013Zhai et al., 2008 July-Aug. 2008 Li et al., 2020Li et al., 2009 Aug Preprint. Discussion started: 11 September 2020 c Author(s) 2020. CC BY 4.0 License.

Remote-sensing derived sea surface pCO2 data 105
The satellite remote-sensing derived sea surface pCO2 in the SCS were estimated for the years of 2000-2014 using a semianalytical algorithm developed by Bai et al. (2015a). The algorithm treats pCO2 as a function of major controlling factors derived from multiple satellite remote sensing products, including sea surface temperature, chlorophyll a, and salinity. The spatial resolution of the remote-sensing derived pCO2 data is 1×1. These data were aggregated into 0.5º ×0.5º grid boxes in our study region (5-25º N, 109-122º E). As shown in Fig. 4, the gridded remote-sensing derived pCO2 data cover almost all 110 the areas of the SCS (See the boxes of RS pCO2 and RS data full coverage in Fig. 1). However, variations shown by these remote-sensing derived pCO2 are much less than those shown by the observed pCO2 data. Larger spatial variations are expected especially in areas influenced by river plumes. This makes it necessary to reconstruct a pCO2 field not only from the remotesensing derived pCO2, but also constrained by the observed in situ pCO2 data from the cruise samplings.

Reconstruction method
Figure 1 is a flowchart of our method. We used the remote-sensing derived data to compute the EOFs for the SOG reconstruction. The grid with 0.5º ×0.5º resolution covered from 5° to 25° N and from 109° to 122° E with 1040 grid boxes in total. The land area data were marked with NaN. The data were arranged in a 1040×15 space-time matrix with rows for grid 120 boxes and columns for time. Then, we removed the 143 land grid boxes from the data, and computed the climatology and https://doi.org/10.5194/essd-2020-167 standard deviation for the remaining 897 non-NaN grid boxes from the 15 years of remote-sensing derived data from 2000 to 2014. The standardized anomalies were computed for each grid box using the remote-sensing derived data minus the climatology and subsequently dividing the difference by the standard deviation. The singular value decomposition (SVD) method was applied to the standardized anomalies in the space-time matrix to compute the EOFs. The results are shown in 125 Section 3. The climatology and standard deviation calculated from the remote-sensing derived data were also used to compute the standardized anomalies of the observed data, which were used as the response variable in the SOG regression reconstruction.
Following the reconstruction of the standardized anomalies, the remote-sensing derived climatology and standard deviation were then used to produce the full field as the final reconstruction result.
The SOG reconstruction method is basically a multivariate regression model for the space-time field at grid box x and time t, 130 expressed as follows: Here, ( , ) is the response variable whose data are the standardized anomalies of the observed data, 0 ( ) is the regression intercept, ( ) is the regression coefficient for the mth EOF ( ), the least square estimator of ( ) is denoted by ( ), ( ) = cos ( ) is the area-factor, is the centroid's latitude, expressed in radian, of the grid box x, and ( , ) is the 135 regression error. The error is assumed to be normally distributed with zero mean and has an independent error variance 2 ( , ) = 〈 2 ( , )〉 , where 〈•〉 denotes the mathematical operation of expected value. The explanatory variables in the above multivariate regression are ( ), computed from the area-weighted standardized anomalies of the remote-sensing derived data. The anomalies were written as an 897×15 space-time data matrix. The SVD method was applied to this matrix to compute the spatial patterns, 140 which are EOFs, the temporal patterns, which are principal components (PCs), and their corresponding variances.
is the set of EOFs selected for our regression reconstruction.
where runs through the entire 893 grid boxes over our study region in the SCS. These anomalies were converted to the full field by adding the climatology and multiplying the standard deviation computed from the remote-sensing derived data for 150 each of the 893 grid boxes. In this way, the full reconstructed field was produced and is presented in Section 3.
Many computer software packages are available to compute the EOFs using SVD and to compute multilinear regressions. This paper chose to use R, a computer program language that has become a popular data science tool in the last few years for https://doi.org/10.5194/essd-2020-167 this purpose. The R computer codes and their required files for this paper are freely available at https://github.com/Hqin2019/pCO2-reconstruction 155 SOG usually uses the first few EOFs, or the first M EOFs that account for more than 80 % of the total variance, or determined by response data via a correlation test (Smith et al., 1998). The current paper used eight EOFs that explain 87 % of the total variances (Fig. 5). However, the year 2000 was an exception and used only four EOFs, because the year has only five grid boxes with the observed data.

EOFs and PCs
EOF1 demonstrates the mode of average level of pCO2 with lower or higher values near the coastal regions of China mainland 165 (Fig. 6a). This mode accounts for 49 % of the variance, which indicates the dominance of the average field and hence a small overall spatial variation, except in the coastal regions. The remote-sensing derived pCO2 data support this mode well. EOF2 shows a north-south dipole (Fig. 6b), which is supported by the observed data shown in Fig. 3, particularly in the summer of 2017, showing lower values in the north on the shelf and slope and higher values in the south in the ocean basin. The minimum values in the north occur where the Pearl River plume dominates (Li et al., 2020;Zhai et al., 2013). EOF3 shows an east-west 170 pattern (Fig. 6c), in addition to the north-south dipole in EOF2. EOF3 thus reflects a spatial variation of a smaller scale. This pattern is consistent with that influenced by coastal upwelling along the northeast China coast and off eastern Hainan Island (Gan et al., 2015;Jing et al., 2015).
The PCs are temporal stamp of the occurrence of the spatial patterns. PC1 basically shows the temporal trend (Fig. 6d). It has been concluded that surface SCS pCO2 has an increasing trend with time (Tseng et al., 2007) the north-south dipole. This strength seems to be related with the strength and extent of the Pearl River plume on the northern shelf and slope (Bai et al., 2015b;Li et al., 2020;Zhai et al., 2013). PC3 shows the temporal variation corresponding to the east-west spatial pattern of EOF3.   Figure 7 shows that the reconstructed pCO2 fields in the SCS have successfully displayed the spatial patterns of the observed pCO2 and in general are consistent with previous studies (Li et al., 2020;Zhai et al., 2013). Relatively low values appear in the northern coastal region where the Pearl River plume is dominant in summer and generally high values occur in the mid and southern basin. 185

Reconstruction results in the SCS
The reconstructions have taken the advantages of both the in situ data for retaining spatial and temporal variations and the remote-sensing derived data for EOF patterns. By default, the reconstructed field has fidelity to the in situ data, because the SOG reconstruction method is a fit of EOFs to the in situ data. The reconstruction is, thus, consistent with the in situ observations. When the in situ data cover a sufficiently large area and hence provide a proper constraint to the EOF fitting through the SOG procedure, the reconstruction result is more faithful to the reality. For example, the reconstructions of the 190 https://doi.org/10.5194/essd-2020-167  summers of 2004, 2007, 2009, 2012, 2014-2017 nicely demonstrate the spatial pCO2 patterns (Figs. 7c,f,h, that are consistent with observations (Li et al., 2020;Zhai et al., 2013) and ocean dynamics (Gan et al., 2015;Jing et al., 2015).
When the observational data are scarce, as long as the in situ data provide a proper constraint to the EOFs, the reconstruction can still yield reasonable results. For example, the summer of 2001 has few in situ data, but its reconstruction appears reasonable (Fig. 7b). 195 In cases of extreme data scarcity, the reconstruction may not be reliable. For example, the reconstructed data in the summer of 2000 appear to be in poor quality (Fig. 7a) since the relatively low values in the mid SCS basin may not be realistic. These poorly reconstructed data may be due to the poor spatial coverage of the in situ pCO2 data in the summer of 2000, which had 200 only 5 grid boxes with data (Fig. 3a). These 5 boxes are all located together and cover only 0.5 % of the SCS. Similarly (380-420 atm) (Li et al., 2020;Zhai et al., 2013). Another cause of the less ideal reconstruction results for the summers of 205 2005,2006, and 2008 may be the large spatial gradient of in situ data. These gradients, such as those for the summer of 2008 (Fig. 3g), in the in situ data can cause a large deviation of the regression coefficients because the linear regression is not robust.
The reconstruction results have demonstrated the feasibility of the SOG reconstruction of the sea surface pCO2 over the SCS, as long as the in situ data provide a proper constraint to the EOFs. The percentage of the in situ data coverage needs not necessarily be large. However, large spatial gradients of the situ data can distort the reconstruction and lower the quality of 210 reconstruction, because the linear regression method is not robust.
As an application of our reconstruction and a validation, we examine the temporal trend of sea surface pCO2 over the SCS.  (Tseng et al., 2007). This makes sense since our rate is a spatial average in summer. When compared with the summer rate at the Hawaii Ocean Time-Series Station (Station HOT) (22º 45´ N, 158º W) in the North Pacific, which is 1.976 atm per year over 2000(Dore et al., 2009) (See Fig. 8b), our rate is about 0.4 atm per year higher. This is reasonable for a marginal sea where a higher rate of increase in pCO2 would be expected. 220

Outliers of the observed data in the reconstruction
The SOG method is basically a linear regression method, which is known to be sensitive to the outliers of the response data. 225 Some outliers, whether due to observational biases or extreme events, can cause a large change in the regression coefficients, and hence the regression results, and can even make the regression results outside the physically valid domain, such as negative pCO2 values in the reconstructed data. Although we cannot conclude that the outliers of 3 away from the mean in the observed data are due to data biases, we have decided not to use them in our reconstruction to avoid the unphysical reconstruction results. Table 2 shows the 14 outlier entries excluded from our response data for regression. These outliers are located in the region of 230 (21.25-23.25° N, 113.25-116.75° E). This region is near the Pearl River Estuary. Thus, these extremely low pCO2 values may result from the Pearl River plume where the observed pCO2 can be very low. These very low values, such as at least 3 away from the mean, may cause a very large gradient in the observed pCO2. Our reconstruction has excluded these extremely low values influenced by the river plumes. Our reconstructed data may therefore overestimate the pCO2 values in the Pearl River Estuary and its nearby region. 235 This study has demonstrated the feasibility of using the SOG method to reconstruct the sea surface pCO2 data into regular grid boxes. We compiled the observed and remote-sensing derived sea surface pCO2 data in the SCS in summer over the period of 2000-2017 and aggregated these data with a grid resolution of 0.5º ×0.5º for reconstruction. The SOG method based on the multilinear regression was applied to reconstruct the space-time complete pCO2 field in the SCS. The method took the EOFs 245 calculated from the remote-sensing derived pCO2 as the explanatory variables and treated the observed pCO2 as the response variable. The EOFs reflect reasonably well the general spatial pattern of the sea surface pCO2 in the SCS and reveal features affected by regional physical forcing such as the river plume and coastal upwelling in the northern SCS. As long as the in situ data provide a proper constraint to the EOFs, the reconstructed pCO2 fields are, in general, consistent with the patterns of the observed pCO2 and demonstrate relatively low values along the north coast affected by the Pearl River plume and consistently 250 high values in the ocean basin of the SCS. These reconstructed pCO2 fields provide full spatial coverage of the sea surface pCO2 of the SCS in summer over a temporal scale of almost two decades and therefore fill the long-lasting blanks in the global sea surface pCO2 mapping. Thus, the reconstruction products will help improve the accuracy of the estimate of the oceanic CO2 flux of the largest marginal sea of the western Pacific so as to better constrain the global oceanic carbon uptake capacity.
Although the SOG method can optimize the information from both the in situ data and the remote-sensing derived data, the 255 reliability of the reconstructed results is still limited by the observed data. When the observed data are limited to only a few grid boxes in a small region, the reconstruction results may not be realistic. Additional constraints have to be considered.

Author contribution
Minhan Dai conceptualized and directed the field program of the in situ observations. Baoshan Chen and Xianghui Guo participated in the in situ data collection. Yan Bai provided the remote-sensing derived data. Guizhi Wang, Yao Chen and 260 Samuel S. P. Shen developed the reconstruction method, wrote the Matlab and R codes, analyzed the data, and plotted the figures. Huan Qin developed the data repository, and revised and tested the R codes. Guizhi Wang and Samuel S. P. Shen wrote the manuscript. All the authors contributed to the original writing, editing and revisions of the manuscript.

Competing interests
The authors declare that they have no conflict of interest. 265 of the National Natural Science Foundation of China (40521003). Acknowledgement is for the data support from "National Earth System Science Data Sharing Infrastructure, National Science & Technology Infrastructure of China 270 (http://www.geodata.cn)".