A historical nutrient dataset (1895–2024) for the North Pacific: reconstructed from machine learning and hydrographic observations
Abstract. Nutrients play a critical role in oceanic primary productivity and the biological pump. However, compared to hydrographic parameters such as temperature and salinity, nutrient observations are limited due to their labor-intensive and costly measurements. Thus, nutrient observations are several orders of magnitude sparser than hydrographic observations. In this study, we first established a rigorous data quality control procedure to clean the hydrographic and nutrient (including NO₃⁻, NO₂⁻, DIP, and Si(OH)₄) observations collected from World Ocean Database (WOD) and CLIVAR and Carbon Hydrographic Data Office (CCHDO) in the North Pacific. Subsequently, the cleaned and high-quality CCHDO dataset was used to train three machine learning models – Random Forest, Light Gradient Boosting Machine (LightGBM), and Gaussian Process Regression – to establish relationships between nutrient concentrations and key variables, including space coordinates (longitude, latitude, and depth), time variables (year and month), and water mass properties (indexed by potential temperature and salinity). Validation shows that the reconstruction closely matches the observations, with RMSEs of <1.41, <0.071, <0.089 and <3.07 mmol kg-1 for NO₃⁻, NO₂⁻, DIP, and Si(OH)₄, respectively. The validated models were then applied to reconstruct nutrient concentrations from the hydrographic observations in WOD, most of which lacked direct nutrient measurements. This resulted in ~473 million reconstructed nutrient data points across 1.92 million stations for each nutrient, spanning from 1895 to 2024, representing a 2,127 to 2,393-fold increase compared to the original nutrient observations in the North Pacific (197,539 to 222,234). This new dataset will be valuable for studying nutrient variability under climate change and anthropogenic influences, and for providing transient boundary conditions in ocean biogeochemical models. The dataset generated in this study is openly available on Zenodo at https://zenodo.org/records/17451417.