the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
GPRChinaTemp1km: a high-resolution monthly air temperature dataset for China (1951–2020) based on machine learning
Abstract. An accurate spatially continuous air temperature dataset is crucial for multiple applications in environmental and ecological sciences. Existing spatial interpolation methods have relatively low accuracy and the resolution of available long-term gridded products of air temperature for China is coarse. Point observations from meteorological stations can provide long-term air temperature data series but cannot represent spatially continuous information. Here, we devised a method for spatial interpolation of air temperature data from meteorological stations based on powerful machine learning tools. First, to determine the optimal method for interpolation of air temperature data, we employed three machine learning models: random forest, support vector machine, and Gaussian process regression. Comparison of the mean absolute error, root mean square error, coefficient of determination, and residuals revealed that Gaussian process regression had high accuracy and clearly outperformed the other two models regarding interpolation of monthly maximum, minimum, and mean air temperatures. The machine learning methods were compared with three traditional methods used frequently for spatial interpolation: inverse distance weighting, ordinary kriging, and ANUSPLIN. Results showed that the Gaussian process regression model had higher accuracy and greater robustness than the traditional methods regarding interpolation of monthly maximum, minimum, and mean air temperatures in each month. Comparison with the TerraClimate, FLDAS, and ERA5 datasets revealed that the accuracy of the temperature data generated using the Gaussian process regression model was higher. Finally, using the Gaussian process regression method, we produced a long-term (January 1951 to December 2020) gridded monthly air temperature dataset with 1 km resolution and high accuracy for China, which we named GPRChinaTemp1km. The dataset consists of three variables: monthly mean air temperature, monthly maximum air temperature, and monthly minimum air temperature. The obtained GPRChinaTemp1km data were used to analyse the spatiotemporal variations of air temperature using Theil–Sen median trend analysis in combination with the Mann–Kendall test. It was found that the monthly mean and minimum air temperatures across China were characterized by a significant trend of increase in each month, whereas monthly maximum air temperature showed a more spatially heterogeneous pattern with significant increase, non-significant increase, and non-significant decrease. The GPRChinaTemp1km dataset is publicly available at https://doi.org/10.5281/zenodo.5112122 (He et al., 2021a) for monthly maximum air temperature, at https://doi.org/10.5281/zenodo.5111989 (He et al., 2021b) for monthly mean air temperature and at https://doi.org/10.5281/zenodo.5112232 (He et al., 2021c) for monthly minimum air temperature.
- Preprint
(3377 KB) - Metadata XML
-
Supplement
(14052 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on essd-2021-267', Anonymous Referee #1, 21 Sep 2021
The manuscript describes ML tools for spatial interpolation of air temperature data in China to a 1k resolution from meteorological stations. This topic is essential. However, many significant issues must be addressed. Here are some comments that hopefully can help to improve the manuscript qualtiy:
- It is not clear what is the spatial resolution of the meteorological station? Since the covered period and the temporal resolution, missing data of the monitoring stations may differ, it may be useful to provide a table or figure to summarize all this information.
- The methods used in this study are inappropriate, and the experiments lack sufficient detail. In ML, the dataset should be divided into train, validation, and test subsets. The validation can be adopted to evaluate the model while tuning model hyperparameters. Hyperparameters are crucial to obtain the best performance model, which is missing in this manuscript.
- Missing details regarding how to process the data., i.e., it is unclear how to deal with the missing data, how to normalize the data, etc.
- The model is compared based on one dataset, and without a statistical test, I would like to say there is a high chance that the GPR outperforms others by accident.
- Why the error RMSE, MAE and R2 shows a cycle pattern? Any reason for that?
Minor comments and questions?
1) Line 7 need to clarify why to use the "subset features" option of Geostatistic Analysis tools. Is it used to split features or datasets?
2) The explanation of SVM is not clear and needs to be further improved.
3) In Line 189, the sentence is not understandable.
Citation: https://doi.org/10.5194/essd-2021-267-RC1 -
RC2: 'Comment on essd-2021-267', Anonymous Referee #2, 04 Oct 2021
This study conducted by Qian He et.al produced a high-resolution air temperature dataset using three types of machine learning methods. The dataset is timely, and fits well the scope of the journal, which could be valuable and interested to the readers and community. The language and the methods of the work is overall good and I enjoyed reading it. I would really like to see this dataset published.
However, there are some points/aspects not clearly enough or needed to be clarified further. I have a number of general comments and suggestions listed below:
1) Generating high precision long time series of temperature data in China can effectively meet the needs of scientific research, but there are already high precision temperature data with 1km resolution in China have been released (Zhu X et al, 2019; Peng S et al, 2019), what are the innovative and different points of your data/methods?
2) The selection of the characteristic factors: The authors chose three spatially invariant variables, lon, lat and elevation, to predict the dynamic changes of temperature. Whereas these three static factors do not really reflect the changes of temperature and the real spatial distribution characteristics of temperature. Have you ever considered factors such as NDVI vegetation index, land use change, surface temperature, and temporal and spatial correlations, month changes, etc.
3) The accuracy of machine learning depends on the adjustment and calibration of hyperparameters. Here, 840 models are used in this study, are these 840 models using the same set of parameters or are each set of parameters different?
4) How do you conduct the accuracy verification of the raster products? The authors used a limited number of 613 sites to generate 1km raster data products. However, as far as I know, the climate modelling sites are too sparse for the Qinghai-Tibet Plateau and Northwest China.
Citation: https://doi.org/10.5194/essd-2021-267-RC2
Status: closed
-
RC1: 'Comment on essd-2021-267', Anonymous Referee #1, 21 Sep 2021
The manuscript describes ML tools for spatial interpolation of air temperature data in China to a 1k resolution from meteorological stations. This topic is essential. However, many significant issues must be addressed. Here are some comments that hopefully can help to improve the manuscript qualtiy:
- It is not clear what is the spatial resolution of the meteorological station? Since the covered period and the temporal resolution, missing data of the monitoring stations may differ, it may be useful to provide a table or figure to summarize all this information.
- The methods used in this study are inappropriate, and the experiments lack sufficient detail. In ML, the dataset should be divided into train, validation, and test subsets. The validation can be adopted to evaluate the model while tuning model hyperparameters. Hyperparameters are crucial to obtain the best performance model, which is missing in this manuscript.
- Missing details regarding how to process the data., i.e., it is unclear how to deal with the missing data, how to normalize the data, etc.
- The model is compared based on one dataset, and without a statistical test, I would like to say there is a high chance that the GPR outperforms others by accident.
- Why the error RMSE, MAE and R2 shows a cycle pattern? Any reason for that?
Minor comments and questions?
1) Line 7 need to clarify why to use the "subset features" option of Geostatistic Analysis tools. Is it used to split features or datasets?
2) The explanation of SVM is not clear and needs to be further improved.
3) In Line 189, the sentence is not understandable.
Citation: https://doi.org/10.5194/essd-2021-267-RC1 -
RC2: 'Comment on essd-2021-267', Anonymous Referee #2, 04 Oct 2021
This study conducted by Qian He et.al produced a high-resolution air temperature dataset using three types of machine learning methods. The dataset is timely, and fits well the scope of the journal, which could be valuable and interested to the readers and community. The language and the methods of the work is overall good and I enjoyed reading it. I would really like to see this dataset published.
However, there are some points/aspects not clearly enough or needed to be clarified further. I have a number of general comments and suggestions listed below:
1) Generating high precision long time series of temperature data in China can effectively meet the needs of scientific research, but there are already high precision temperature data with 1km resolution in China have been released (Zhu X et al, 2019; Peng S et al, 2019), what are the innovative and different points of your data/methods?
2) The selection of the characteristic factors: The authors chose three spatially invariant variables, lon, lat and elevation, to predict the dynamic changes of temperature. Whereas these three static factors do not really reflect the changes of temperature and the real spatial distribution characteristics of temperature. Have you ever considered factors such as NDVI vegetation index, land use change, surface temperature, and temporal and spatial correlations, month changes, etc.
3) The accuracy of machine learning depends on the adjustment and calibration of hyperparameters. Here, 840 models are used in this study, are these 840 models using the same set of parameters or are each set of parameters different?
4) How do you conduct the accuracy verification of the raster products? The authors used a limited number of 613 sites to generate 1km raster data products. However, as far as I know, the climate modelling sites are too sparse for the Qinghai-Tibet Plateau and Northwest China.
Citation: https://doi.org/10.5194/essd-2021-267-RC2
Data sets
GPRChinaTemp1km: 1 km monthly maximum air temperature for China from January 1951 to December 2020 He, Qian; Wang, Ming; Liu, Kai; Li, Kaiwen; Jiang, Ziyu https://doi.org/10.5281/zenodo.5112122
GPRChinaTemp1km: 1 km monthly minimum air temperature for China from January 1951 to December 2020 He, Qian; Wang, Ming; Liu, Kai; Li, Kaiwen; Jiang, Ziyu https://doi.org/10.5281/zenodo.5112232
GPRChinaTemp1km: 1 km monthly mean air temperature for China from January 1951 to December 2020 He, Qian Beijing Normal University ; Wang, Ming; Liu, Kai; Li, Kaiwen; Jiang, Ziyu https://doi.org/10.5281/zenodo.5111989
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
1,613 | 369 | 79 | 2,061 | 148 | 82 | 104 |
- HTML: 1,613
- PDF: 369
- XML: 79
- Total: 2,061
- Supplement: 148
- BibTeX: 82
- EndNote: 104
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
5 citations as recorded by crossref.
- Spatiotemporal variations and trends of air quality in major cities in Guizhou F. Lu et al. 10.3389/fenvs.2023.1254390
- Exploring the factors influencing the carbon sink function of coastal wetlands in the Yellow River Delta Z. Tang et al. 10.1038/s41598-024-80186-8
- Contrasting Responses of Vegetation Production to Rainfall Anomalies Across the Northeast China Transect H. Zhao et al. 10.1029/2022JG006842
- Cooling effects of increased green fodder area on native grassland in the northeastern Tibetan Plateau W. Liu et al. 10.1088/1748-9326/acc9d3
- Ecological health assessment of Tibetan alpine grasslands in Gannan using remote sensed ecological indicators Z. Du et al. 10.1080/10095020.2024.2311862