A full-coverage daily XCO<sub>2</sub> dataset in China from 2015 to 2020 based on DSC-DF-LGB

Huang, Xinfeng; Yang, Hui; Lv, Qingzhou; Fan, Huaiwei; Cui, Liu; Qiao, Yina; Yao, Yuejing; Feng, Gefei

doi:10.5194/essd-2024-371

Preprints

https://doi.org/10.5194/essd-2024-371

Preprints

28 Oct 2024

| 28 Oct 2024

Status: this discussion paper is a preprint. It has been under review for the journal Earth System Science Data (ESSD). The manuscript was not accepted for further review after discussion.

A full-coverage daily XCO₂ dataset in China from 2015 to 2020 based on DSC-DF-LGB

Xinfeng Huang, Hui Yang, Qingzhou Lv, Huaiwei Fan, Liu Cui, Yina Qiao, Yuejing Yao, and Gefei Feng

Abstract. Carbon dioxide (CO₂), as a major greenhouse gas, is one of the important causes of global warming. In recent years, the atmospheric CO₂ concentration in China has been increasing year by year. Satellite observation is the main means of obtaining atmospheric CO₂ concentration. However, the current onboard sensors used for measuring atmospheric CO₂ have a narrow observation range and cannot obtain spatiotemporal continuous atmospheric CO₂ concentrations. Therefore, this paper proposes a daily full-coverage XCO₂ dataset generation method based on the DSC-DF-LGB (Deep Separable Convolutional Neural Network and Deep Forest concatenated with LightGBM) model to obtain the spatiotemporal distribution of atmospheric CO₂ in China. The DSC-DF-LGB model was established to train the mapping relationship between OCO-2 XCO₂ retrieval and related variables (reanalysis XCO₂, vegetation parameters, human factors, elevation, and meteorological parameters). The model was used to generate a daily 0.1° full-coverage XCO₂ dataset for China from 2015 to 2020. The cross validation (CV) result indicates that the model has strong performance in estimating XCO₂, with R² and RMSE of 0.9633 and 0.9761 ppm. The TCCON independent site validation result indicates that the estimated XCO₂ has high consistency with in-situ measurements, with R² and RMSE of 0.8786 and 1.5452 ppm. The full-coverage and high-resolution XCO₂ dataset can provide data support for research on carbon sources and sinks. The dataset is available at https://zenodo.org/doi/10.5281/zenodo.12696674 (Huang, 2024).

Received: 25 Aug 2024 – Discussion started: 28 Oct 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Xinfeng Huang, Hui Yang, Qingzhou Lv, Huaiwei Fan, Liu Cui, Yina Qiao, Yuejing Yao, and Gefei Feng

Status: closed

RC1: 'Comment on essd-2024-371', Anonymous Referee #1, 29 Nov 2024

Using satellite observations, reanalysis data, and other auxiliary variables to produce spatiotemporally continuous XCO2 data is meaningful for carbon cycle studies. This paper employs the DSC-DF-LGB model to generate daily XCO2 data for China from 2015 to 2020 based on the OCO-2 XCO2. Cross-validation and independent site validation both demonstrate that the generated dataset is highly accurate. However, the novelty of this paper seems insufficient for the ESSD journal, as there has been a considerable amount of research in this area recently, generating many similar datasets. So, where exactly does this paper contribute? Is it in achieving higher accuracy for the dataset, or is the proposed method significantly different from existing studies? I do not recommend this paper for publication in ESSD, and here are some major concerns:
1. What is the contribution of this paper compared to existing studies and datasets? Machine learning methods have increasingly been used for XCO2 reconstruction, and there is no substantial difference in accuracy. The daily XCO2 data produced here does not generate new insights, nor is it compared with existing datasets. From a scientific perspective, I believe the novelty of this paper is insufficient.
2. Regarding the DSC-DF-LGB method proposed in the paper, the advantages of combining DF and LGB are not well demonstrated. Does it perform better than a single model in terms of accuracy? Additionally, the role and meaning of DSC are unclear. It seems to only be used for feature extraction, but it’s not clear why it needs to be involved in the training process.
3. The results in Section 3.2 are surprising. XCO2 shows an increasing trend year by year, and theoretically, using machine learning to predict past or future XCO2 data will inevitably lead to some over- or under-estimations. However, the degree of error described in the paper seems exaggerated, especially given that the prediction period is only one year away from the model training period. Also, what is the purpose of this section in the paper? Does it imply that the DSC-DF-LGB method lacks generalization capability?
4. The analysis of the results lacks depth. For example, Figure 8 could benefit from more quantitative analysis, rather than just a simple qualitative comparison. In addition, Figure 9 presents the growth of concentrations, which has already been reported in many studies. Does this paper offer any new analysis or findings?
5. Just as I find the novelty of this paper lacking, the authors are also unclear in stating the motivation for their work. The last part of the introduction, which describes the problem or research objectives this paper aims to address, is vague and unclear. I suggest reorganizing this section.

Citation: https://doi.org/10.5194/essd-2024-371-RC1
RC2:
'Comment on essd-2024-371', Anonymous Referee #2, 18 Feb 2025
Huang et al. presents a methodology for generating a high-resolution, full-coverage daily XCO₂ (column-averaged CO₂) dataset for China from 2015 to 2020 using a novel DSC-DF-LGB model. The researchers combined OCO-2 satellite data with various environmental and anthropogenic variables to create the dataset, claiming that they could achieve strong validation results. The resulting dataset reveals important spatiotemporal patterns in China's atmospheric C_O2 concentrations. However, as a manuscript presenting a data product, the method section lacks a significant amount of details. The scope of science is not clear and there are fundamental errors. I do not think this manuscript can be published in its current form and major revision is necessary. Here are my main concerns.

Method:
As a data paper, it should clearly demonstrate the product's generation process. The method description section (2.2.2) lacks crucial details, making it difficult for readers to understand the methodology.

The authors acknowledge poor temporal generalization of their method, which raises fundamental concerns about its reliability. I suggest validating the method by randomly selecting one year as a validation set while using the remaining years for training.

The use of CAMS XCO₂ as a predictive variable raises concerns about model dependency. The authors should demonstrate the relative importance of each predictive variable in reconstructing satellite XCO₂.

The manuscript lacks comparison with existing machine learning approaches for high-resolution XCO₂ retrieval, making it unclear what improvements their method offers.

The comparison between TCCON site measurements and model reconstruction is problematic. While comparing XCO2_TCCON with XCO2_OCO2 measures satellite retrieval error, and comparing XCO2_reconstructed with XCO2_OCO2 measures algorithm error, comparing XCO2_TCCON with XCO2_reconstructed lacks clear scientific justification.

Scientific scope:
While reading the paper, I think none of the authors are familiar with carbon cycle science in general. There are fundamental errors in presenting the science.

L265-266: I think carbon uptake by boreal forests plays an even more prominent role.

L297: XCO₂ is influenced by emissions (not only from fossil fuels, but also from land), but is also largely mitigated by atmospheric mixing. Claiming that high XCO₂ reflects increased human activities without specifying the timescale is fundamentally wrong. For example, while a single retrieval might correlate with an emission spike, the annual average XCO₂ might not correlate with emissions at all. Additionally, this apparent dependency could be a result of using emissions as a predictive variable.

Figure 9: The comments on the growth rate trend are fundamentally wrong. The 2016 anomaly is largely a result of the strong 2015-2016 ENSO event that led to larger CO₂ outgassing from low-latitude lands. It is expected to have a larger growth rate in 2016 compared to the following few years. The interannual variability of CO₂ growth rate is not a good metric for emission reduction due to substantial natural variability. From a purely statistical perspective, you cannot draw this conclusion based on only 5 years of data, especially starting with a strong ENSO year. If you remove the first point, you don't see a decline; if you remove the first two points, you actually see an increasing growth rate. Furthermore, have you compared your trend estimate with observations from the large array of surface stations (e.g., NOAA boundary layer average)?

I suggest the authors have a carbon cycle scientist read their draft and provide feedback on the scientific merit. For example, they should address why high-resolution XCO2 is needed, and why specifically over China.
The authors claim they generate daily XCO₂ over China. However, I do not see any validation of the daily reconstruction performance. Does this method really capture synoptic-scale variability? Does the daily reconstruction make sense? What is the importance of having XCO₂ at a daily timescale?

Writing:
Some sentences read awkwardly. I suggest having the manuscript proofread by native speakers.
Citation: https://doi.org/10.5194/essd-2024-371-RC2

Status: closed

RC1: 'Comment on essd-2024-371', Anonymous Referee #1, 29 Nov 2024

Using satellite observations, reanalysis data, and other auxiliary variables to produce spatiotemporally continuous XCO2 data is meaningful for carbon cycle studies. This paper employs the DSC-DF-LGB model to generate daily XCO2 data for China from 2015 to 2020 based on the OCO-2 XCO2. Cross-validation and independent site validation both demonstrate that the generated dataset is highly accurate. However, the novelty of this paper seems insufficient for the ESSD journal, as there has been a considerable amount of research in this area recently, generating many similar datasets. So, where exactly does this paper contribute? Is it in achieving higher accuracy for the dataset, or is the proposed method significantly different from existing studies? I do not recommend this paper for publication in ESSD, and here are some major concerns:
1. What is the contribution of this paper compared to existing studies and datasets? Machine learning methods have increasingly been used for XCO2 reconstruction, and there is no substantial difference in accuracy. The daily XCO2 data produced here does not generate new insights, nor is it compared with existing datasets. From a scientific perspective, I believe the novelty of this paper is insufficient.
2. Regarding the DSC-DF-LGB method proposed in the paper, the advantages of combining DF and LGB are not well demonstrated. Does it perform better than a single model in terms of accuracy? Additionally, the role and meaning of DSC are unclear. It seems to only be used for feature extraction, but it’s not clear why it needs to be involved in the training process.
3. The results in Section 3.2 are surprising. XCO2 shows an increasing trend year by year, and theoretically, using machine learning to predict past or future XCO2 data will inevitably lead to some over- or under-estimations. However, the degree of error described in the paper seems exaggerated, especially given that the prediction period is only one year away from the model training period. Also, what is the purpose of this section in the paper? Does it imply that the DSC-DF-LGB method lacks generalization capability?
4. The analysis of the results lacks depth. For example, Figure 8 could benefit from more quantitative analysis, rather than just a simple qualitative comparison. In addition, Figure 9 presents the growth of concentrations, which has already been reported in many studies. Does this paper offer any new analysis or findings?
5. Just as I find the novelty of this paper lacking, the authors are also unclear in stating the motivation for their work. The last part of the introduction, which describes the problem or research objectives this paper aims to address, is vague and unclear. I suggest reorganizing this section.

Citation: https://doi.org/10.5194/essd-2024-371-RC1
RC2:
'Comment on essd-2024-371', Anonymous Referee #2, 18 Feb 2025
Huang et al. presents a methodology for generating a high-resolution, full-coverage daily XCO₂ (column-averaged CO₂) dataset for China from 2015 to 2020 using a novel DSC-DF-LGB model. The researchers combined OCO-2 satellite data with various environmental and anthropogenic variables to create the dataset, claiming that they could achieve strong validation results. The resulting dataset reveals important spatiotemporal patterns in China's atmospheric C_O2 concentrations. However, as a manuscript presenting a data product, the method section lacks a significant amount of details. The scope of science is not clear and there are fundamental errors. I do not think this manuscript can be published in its current form and major revision is necessary. Here are my main concerns.

Method:
As a data paper, it should clearly demonstrate the product's generation process. The method description section (2.2.2) lacks crucial details, making it difficult for readers to understand the methodology.

The authors acknowledge poor temporal generalization of their method, which raises fundamental concerns about its reliability. I suggest validating the method by randomly selecting one year as a validation set while using the remaining years for training.

The use of CAMS XCO₂ as a predictive variable raises concerns about model dependency. The authors should demonstrate the relative importance of each predictive variable in reconstructing satellite XCO₂.

The manuscript lacks comparison with existing machine learning approaches for high-resolution XCO₂ retrieval, making it unclear what improvements their method offers.

The comparison between TCCON site measurements and model reconstruction is problematic. While comparing XCO2_TCCON with XCO2_OCO2 measures satellite retrieval error, and comparing XCO2_reconstructed with XCO2_OCO2 measures algorithm error, comparing XCO2_TCCON with XCO2_reconstructed lacks clear scientific justification.

Scientific scope:
While reading the paper, I think none of the authors are familiar with carbon cycle science in general. There are fundamental errors in presenting the science.

L265-266: I think carbon uptake by boreal forests plays an even more prominent role.

L297: XCO₂ is influenced by emissions (not only from fossil fuels, but also from land), but is also largely mitigated by atmospheric mixing. Claiming that high XCO₂ reflects increased human activities without specifying the timescale is fundamentally wrong. For example, while a single retrieval might correlate with an emission spike, the annual average XCO₂ might not correlate with emissions at all. Additionally, this apparent dependency could be a result of using emissions as a predictive variable.

Figure 9: The comments on the growth rate trend are fundamentally wrong. The 2016 anomaly is largely a result of the strong 2015-2016 ENSO event that led to larger CO₂ outgassing from low-latitude lands. It is expected to have a larger growth rate in 2016 compared to the following few years. The interannual variability of CO₂ growth rate is not a good metric for emission reduction due to substantial natural variability. From a purely statistical perspective, you cannot draw this conclusion based on only 5 years of data, especially starting with a strong ENSO year. If you remove the first point, you don't see a decline; if you remove the first two points, you actually see an increasing growth rate. Furthermore, have you compared your trend estimate with observations from the large array of surface stations (e.g., NOAA boundary layer average)?

I suggest the authors have a carbon cycle scientist read their draft and provide feedback on the scientific merit. For example, they should address why high-resolution XCO2 is needed, and why specifically over China.
The authors claim they generate daily XCO₂ over China. However, I do not see any validation of the daily reconstruction performance. Does this method really capture synoptic-scale variability? Does the daily reconstruction make sense? What is the importance of having XCO₂ at a daily timescale?

Writing:
Some sentences read awkwardly. I suggest having the manuscript proofread by native speakers.
Citation: https://doi.org/10.5194/essd-2024-371-RC2

Xinfeng Huang, Hui Yang, Qingzhou Lv, Huaiwei Fan, Liu Cui, Yina Qiao, Yuejing Yao, and Gefei Feng

Data sets

Full-coverage daily 0.1° XCO2 in China Xinfeng Huang https://zenodo.org/doi/10.5281/zenodo.12696674

Model code and software

Full-coverage daily 0.1° XCO2 in China Xinfeng Huang https://zenodo.org/doi/10.5281/zenodo.12696674

Xinfeng Huang, Hui Yang, Qingzhou Lv, Huaiwei Fan, Liu Cui, Yina Qiao, Yuejing Yao, and Gefei Feng

Viewed

Total article views: 1,809 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,159	400	250	1,809	53	104

HTML: 1,159
PDF: 400
XML: 250
Total: 1,809
BibTeX: 53
EndNote: 104

Views and downloads (calculated since 28 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	72	23	2	97
Nov 2024	97	28	7	132
Dec 2024	77	13	2	92
Jan 2025	60	11	0	71
Feb 2025	74	9	42	125
Mar 2025	37	7	57	101
Apr 2025	46	33	46	125
May 2025	37	12	51	100
Jun 2025	43	11	23	77
Jul 2025	30	8	2	40
Aug 2025	59	10	1	70
Sep 2025	310	16	0	326
Oct 2025	32	45	0	77
Nov 2025	27	44	2	73
Dec 2025	39	23	2	64
Jan 2026	72	30	5	107
Feb 2026	30	57	4	91
Mar 2026	17	20	4	41

Cumulative views and downloads (calculated since 28 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	72	23	2	97
Nov 2024	97	28	7	132
Dec 2024	77	13	2	92
Jan 2025	60	11	0	71
Feb 2025	74	9	42	125
Mar 2025	37	7	57	101
Apr 2025	46	33	46	125
May 2025	37	12	51	100
Jun 2025	43	11	23	77
Jul 2025	30	8	2	40
Aug 2025	59	10	1	70
Sep 2025	310	16	0	326
Oct 2025	32	45	0	77
Nov 2025	27	44	2	73
Dec 2025	39	23	2	64
Jan 2026	72	30	5	107
Feb 2026	30	57	4	91
Mar 2026	17	20	4	41

Viewed (geographical distribution)

Total article views: 1,771 (including HTML, PDF, and XML) Thereof 1,771 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 14 Mar 2026

Short summary

XCO₂ is the atmospheric CO₂ column concentration, mainly measured by satellite instrument. However, the XCO₂ retrieved from satellite sensors is spatially discontinuous and has long time intervals. In this study, we generated a spatiotemporal continuous XCO₂ dataset based on machine learning models, which has high temporal and spatial resolution. This dataset can be used for in-depth research on carbon sources and sinks.


Total:	0
HTML:	0
PDF:	0
XML:	0

A full-coverage daily XCO2 dataset in China from 2015 to 2020 based on DSC-DF-LGB

Data sets

Model code and software

Viewed

Viewed (geographical distribution)

A full-coverage daily XCO₂ dataset in China from 2015 to 2020 based on DSC-DF-LGB