the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Global tropical cyclone size and intensity reconstruction dataset for 1959–2022 based on IBTrACS and ERA5 data
Abstract. Tropical cyclones (TCs) are powerful weather systems that can cause extreme disasters. The International Best Track Archive for Climate Stewardship (IBTrACS) dataset has been used extensively to estimate TC climatology. However, it has low data coverage, lacking intensity and outer size data for more than half of all recorded storms, and is therefore insufficient as a reference for researchers and decision makers. To fill this data gap, we reconstructed a long-term TC dataset by integrating IBTrACS and European Centre for Medium-Range Weather Forecasts Reanalysis 5 (ERA5) data. This new dataset covers the period 1959–2022, with 3 h temporal resolution. Compared to the IBTrACS dataset, it contains approximately 3–4 times more data points per characteristic. We established machine learning models to estimate the maximum sustained wind speed (Vmax) and radius to maximum wind speed (Rmax) in six basins for which TCs were generated using ERA5-derived 10 m azimuthal median azimuthal wind profiles as input, with Vmax and Rmax data from the IBTrACS dataset used as training data. An empirical wind–pressure relationship and six wind profile models were employed to estimate the minimum central pressure (Pmin) and outer size of the TCs, respectively. Overall, this high-resolution TC reconstruction dataset demonstrated global consistency with observations, exhibiting mean biases of <1 % for Vmax and 3 % for Rmax and Pmin in almost all basins. The new dataset is publicly available from https://doi.org/10.5281/zenodo.12740372 (Xu et al., 2024) and significantly advances our understanding of TC climatology, thereby facilitating risk assessments and defenses against TC-related disasters.
- Preprint
(4805 KB) - Metadata XML
-
Supplement
(1711 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on essd-2024-329', Anonymous Referee #1, 13 Aug 2024
The manuscript presents a new global tropical cyclone dataset that integrates the IBTrACS and ERA5 reanalysis data to reconstruct key TC characteristics like Vmax, Rmax, and Pmin. The authors use random forest algorithm to reduce biases in the ERA5-derived characteristics, enhancing the data availability and spatiotemporal coverage of the besk track dataset. This manuscript demonstrates a certain level of innovation and scientific value, and it is generally well-organized. I recommend accepting the manuscript with minor revisions in the following.
Comment 1: The approach of combining IBTrACS and ERA5 data using machine learning like Random Forest models appears to be well-justified based on the reported improvements in bias reduction. However, it would be helpful to provide more details about the selection process for the RF model, particularly in comparison with other models that were tested but not selected.
Commnet 2: One suggestion for improving writting could be to streamline the description of the wind profile models, as the detailed mathematical formulations might be overwhelming for some readers. Instead, focusing on the selected wind profile models and the comparative performance of all models in the main body of the manuscript (rather than in supplement and summarize the tables) would be more impactful.
Commnet 3: The reductions in bias for key metrics like Vmax and Rmax are impressive. However, while the manuscript acknowledges the limitations related to landfall TCs and the dependency on ERA5's spatial resolution, a more detailed discussion on how these limitations might affect specific use cases of the dataset could be beneficial.
Commnet 4: Ensure consistency in tense usage, particularly when discussing results and implications. For example, there is a mixture of the past simple tense and present simple tense in line 240.
Comment 5: Consider using active voice more frequently to make the writing more direct. For example, "Six wind profile models were used to compute the radii..." could be "We used six wind profile models to compute the radii..."
Citation: https://doi.org/10.5194/essd-2024-329-RC1 - AC1: 'Reply on RC1', Jianping Guo, 16 Oct 2024
-
RC2: 'Comment on essd-2024-329', Anonymous Referee #2, 14 Sep 2024
In Xu et al. (2024), the authors introduce an extension of the International Best Track Archive for Climate Stewardship (IBTrACS) dataset. Using the ERA-5 reanalysis dataset alongside IBTrACS observations, the authors utilize random forest to generate estimates of maximum wind speed and radius of maximum wind. The authors then proceed to construct minimum pressure estimates using an empirical wind-pressure relationship and the newly generated maximum wind speed estimates, before incorporating the radius of maximum wind and maximum wind speed estimates into wind models to generate estimates of the surface wind radii (radius of 17, 26, and 33 ms-1 winds). Validation statistics suggest that these estimates are more accurate than ERA-5 reanalysis data by itself, and the overall number of estimates generated in this manuscript provides substantially more data to be used by climatologists.
Overall, this paper presents an interesting and novel approach to handling missing observations in the tropical cyclone record. However, I have some concerns regarding how the dataset was created and how the manuscript is constructed. As such, I believe major revisions are necessary before this manuscript can be accepted.
Major Comments
Comment 1: The selection of random forest as the primary method for this paper is not presented in a way that convinces the readers that it was the best choice. In Lines 146–148, the authors list several machine learning techniques that could be used, but random forest is chosen without demonstrating how it outperformed the other methods. In Lines 161–162, the hyperparameters of each random forest are noted to have been selected by “randomized search.” This implies that the authors picked random numbers until the output was deemed acceptable, instead of utilizing validation statistics. Additionally, the authors do not mention the software or program used to implement the random forest (i.e., was it done using R, Python, SAS, Matlab, etc.), nor do they inform the reader of the error rate when training the random forest. By incorporating more details into how random forest outperformed the other potential techniques, how the best hyperparameters were determined, and how the process was conducted would go a long way to reassuring readers that the data they are using were generated using best practices.
Minor Comments and Technical Corrections
Lines 10–11: The first two sentences of the abstract have different verb tenses, with the first sentence written in active voice while the second sentence is written in passive voice. This occurs throughout the paper. Please be consistent, preferably in active voice.
Line 18: The abstract mentions the authors used “10 m azimuthal median azimuthal wind profiles,” however in Line 118 the authors mention using “10 m azimuthal mean azimuthal wind profiles.” These are not the only instances where the wind profiles differ. Please adjust to state the correct wind profile (mean or median).
Line 26: Referring to something as “significant” typically implies statistical significance. Unless you have statistics to support the claim, consider using a similar word, such as “substantial.”
Lines 26–27: Words like “formidable,” “torrential,” and “devastating” are not typically used in scientific writing. Please adjust accordingly.
Line 28: Are the 29 billion dollars in damages presented here in US dollars? If so, which year?
Line 28: Are the 22 million individuals globally distributed, or located on a specific continent?
Line 29: The previous sentence establishes the scale of TC-related disasters, but the frequency has not been addressed. Please consider adding information pertaining to TC frequency.
Line 50: Consider changing “radius to maximum wind speed” to “radius of maximum wind,” as this aligns with other manuscripts. This appears elsewhere in the manuscript, not just here.
Line 52: Consider changing “on surface” to “near the surface.”
Lines 71–73: It would be helpful to add a couple of sentences to demonstrate to readers how reanalysis can be used to reconstruct TC size. Some examples:
- Gori et al., 2023: North Atlantic Tropical Cyclone Size and Storm Surge Reconstructions From 1950-Present.
- Schenkel et al., 2017: Evaluating Outer Tropical Cyclone Size in Reanalysis Datasets Using QuikSCAT Data.
- Thompson et al., 2024: Construction of a tropical cyclone size dataset using reanalysis data.
- Zick and Matyas, 2016: Tropical cyclones in the North American regional reanalysis: the impact of satellite-derived precipitation over ocean.
Line 78: Consider re-wording “relatively accurate TC intensity,” as this implies that ERA-5 intensity is accurate (the description of maximum wind biases later in the manuscript undermines this phrasing).
Line 79: In the abstract, it says that this paper is about a reconstructed dataset. Here, it says a new dataset is generated. Please adjust throughout the manuscript for consistency.
Lines 91–92: IBTrACS can be downloaded in netCDF, CSV, and Shapefile format. Which version did you use?
Line 93: Which U.S. agencies were responsible for the data?
Line 118: You cite Schenkel (2017), but your references do not contain a paper where Schenkel is the solo author. Was this supposed to be Schenkel et al. (2017), or did you intend Schenkel (2017) but forget to add the citation (Schenkel did publish a solo author paper in 2017 pertaining to TC centers, which is the topic of the sentence)? Please revisit your citations to be sure that all in-text citations and references align.
Lines 120–122: How was this done? Which software did you use?
Lines 136–137: What is causing this systemic bias in ERA-5 derived maximum wind? An earlier sentence mentioned a couple of reasons, but it is presented as being an issue for other datasets. Is it the same reasoning, or are there different reasons for this bias?
Line 138: Different basins use different scales to categorize TCs. Why did you select the Saffir-Simpson Hurricane Wind Scale for all basins?
Line 142: I am not sure “modesty” is the appropriate word here. A possible change to the sentence: “Despite the discrepancy in TC intensity, Bian et al. (2021) demonstrated that ERA-5 accurately depicts TC structural alterations.”
Line 166: Which correlation did you use (Pearson, Spearman, or Kendall)? How was statistical significance evaluated?
Lines 180–201: Are the same variables incorporated into each wind profile? For example, is the distance r in Holland (1980) the same r in DeMaria (1987)?
Line 187: What are R1 and R2?
Line 188: What is the transition zone?
Line 216: Providing the error rate of the random forest on the training data would also help with assessing the accuracy of the reconstructed data.
Lines 217–218: According to the flowchart in Figure 3, the observational data from IBTrACS were used to generate the reconstructed versions of the same data. This would explain why you have such high correlations. In practice, you typically do not see such high correlation coefficients, and that is cause for concern. In the abstract, it says that the observation data were used for training. Were they also used in the test data?
Line 225: While the reconstructed data do appear to better fit the observational data, the error bars are quite large and the reconstructed data are estimates and therefore not “real” data. Saying they are “clearly superior” is a stretch.
Line 233: Does Table 2 compare the ERA-5 and reconstructed data with the observation data? If so, please make that clear in the table caption.
Line 246: Why are the error bars so much larger for the North Atlantic and East Pacific basins compared to the other basins?
Line 247: For each basin, the MAE of the reconstructed Rmax is less than the MAE for ERA-5, but at the global level, MAE for the reconstructed Rmax is greater than ERA-5. It looks like the two numbers were switched. Is that the case?
Line 258: What is the “sea level pressure difference”? It isn’t defined anywhere else in the paper.
Line 268: Consider adding “, respectively.” after the second sentence.
Line 287: Consider “Finally” instead of “Besides”
Lines 294–295: In the Zenodo repository, consider adding a ReadMe file that explains what the columns are in the dataset (how they were derived, their units, etc.). This would provide helpful information for users who would use your data.
Citation: https://doi.org/10.5194/essd-2024-329-RC2 - AC2: 'Reply on RC2', Jianping Guo, 16 Oct 2024
Status: closed
-
RC1: 'Comment on essd-2024-329', Anonymous Referee #1, 13 Aug 2024
The manuscript presents a new global tropical cyclone dataset that integrates the IBTrACS and ERA5 reanalysis data to reconstruct key TC characteristics like Vmax, Rmax, and Pmin. The authors use random forest algorithm to reduce biases in the ERA5-derived characteristics, enhancing the data availability and spatiotemporal coverage of the besk track dataset. This manuscript demonstrates a certain level of innovation and scientific value, and it is generally well-organized. I recommend accepting the manuscript with minor revisions in the following.
Comment 1: The approach of combining IBTrACS and ERA5 data using machine learning like Random Forest models appears to be well-justified based on the reported improvements in bias reduction. However, it would be helpful to provide more details about the selection process for the RF model, particularly in comparison with other models that were tested but not selected.
Commnet 2: One suggestion for improving writting could be to streamline the description of the wind profile models, as the detailed mathematical formulations might be overwhelming for some readers. Instead, focusing on the selected wind profile models and the comparative performance of all models in the main body of the manuscript (rather than in supplement and summarize the tables) would be more impactful.
Commnet 3: The reductions in bias for key metrics like Vmax and Rmax are impressive. However, while the manuscript acknowledges the limitations related to landfall TCs and the dependency on ERA5's spatial resolution, a more detailed discussion on how these limitations might affect specific use cases of the dataset could be beneficial.
Commnet 4: Ensure consistency in tense usage, particularly when discussing results and implications. For example, there is a mixture of the past simple tense and present simple tense in line 240.
Comment 5: Consider using active voice more frequently to make the writing more direct. For example, "Six wind profile models were used to compute the radii..." could be "We used six wind profile models to compute the radii..."
Citation: https://doi.org/10.5194/essd-2024-329-RC1 - AC1: 'Reply on RC1', Jianping Guo, 16 Oct 2024
-
RC2: 'Comment on essd-2024-329', Anonymous Referee #2, 14 Sep 2024
In Xu et al. (2024), the authors introduce an extension of the International Best Track Archive for Climate Stewardship (IBTrACS) dataset. Using the ERA-5 reanalysis dataset alongside IBTrACS observations, the authors utilize random forest to generate estimates of maximum wind speed and radius of maximum wind. The authors then proceed to construct minimum pressure estimates using an empirical wind-pressure relationship and the newly generated maximum wind speed estimates, before incorporating the radius of maximum wind and maximum wind speed estimates into wind models to generate estimates of the surface wind radii (radius of 17, 26, and 33 ms-1 winds). Validation statistics suggest that these estimates are more accurate than ERA-5 reanalysis data by itself, and the overall number of estimates generated in this manuscript provides substantially more data to be used by climatologists.
Overall, this paper presents an interesting and novel approach to handling missing observations in the tropical cyclone record. However, I have some concerns regarding how the dataset was created and how the manuscript is constructed. As such, I believe major revisions are necessary before this manuscript can be accepted.
Major Comments
Comment 1: The selection of random forest as the primary method for this paper is not presented in a way that convinces the readers that it was the best choice. In Lines 146–148, the authors list several machine learning techniques that could be used, but random forest is chosen without demonstrating how it outperformed the other methods. In Lines 161–162, the hyperparameters of each random forest are noted to have been selected by “randomized search.” This implies that the authors picked random numbers until the output was deemed acceptable, instead of utilizing validation statistics. Additionally, the authors do not mention the software or program used to implement the random forest (i.e., was it done using R, Python, SAS, Matlab, etc.), nor do they inform the reader of the error rate when training the random forest. By incorporating more details into how random forest outperformed the other potential techniques, how the best hyperparameters were determined, and how the process was conducted would go a long way to reassuring readers that the data they are using were generated using best practices.
Minor Comments and Technical Corrections
Lines 10–11: The first two sentences of the abstract have different verb tenses, with the first sentence written in active voice while the second sentence is written in passive voice. This occurs throughout the paper. Please be consistent, preferably in active voice.
Line 18: The abstract mentions the authors used “10 m azimuthal median azimuthal wind profiles,” however in Line 118 the authors mention using “10 m azimuthal mean azimuthal wind profiles.” These are not the only instances where the wind profiles differ. Please adjust to state the correct wind profile (mean or median).
Line 26: Referring to something as “significant” typically implies statistical significance. Unless you have statistics to support the claim, consider using a similar word, such as “substantial.”
Lines 26–27: Words like “formidable,” “torrential,” and “devastating” are not typically used in scientific writing. Please adjust accordingly.
Line 28: Are the 29 billion dollars in damages presented here in US dollars? If so, which year?
Line 28: Are the 22 million individuals globally distributed, or located on a specific continent?
Line 29: The previous sentence establishes the scale of TC-related disasters, but the frequency has not been addressed. Please consider adding information pertaining to TC frequency.
Line 50: Consider changing “radius to maximum wind speed” to “radius of maximum wind,” as this aligns with other manuscripts. This appears elsewhere in the manuscript, not just here.
Line 52: Consider changing “on surface” to “near the surface.”
Lines 71–73: It would be helpful to add a couple of sentences to demonstrate to readers how reanalysis can be used to reconstruct TC size. Some examples:
- Gori et al., 2023: North Atlantic Tropical Cyclone Size and Storm Surge Reconstructions From 1950-Present.
- Schenkel et al., 2017: Evaluating Outer Tropical Cyclone Size in Reanalysis Datasets Using QuikSCAT Data.
- Thompson et al., 2024: Construction of a tropical cyclone size dataset using reanalysis data.
- Zick and Matyas, 2016: Tropical cyclones in the North American regional reanalysis: the impact of satellite-derived precipitation over ocean.
Line 78: Consider re-wording “relatively accurate TC intensity,” as this implies that ERA-5 intensity is accurate (the description of maximum wind biases later in the manuscript undermines this phrasing).
Line 79: In the abstract, it says that this paper is about a reconstructed dataset. Here, it says a new dataset is generated. Please adjust throughout the manuscript for consistency.
Lines 91–92: IBTrACS can be downloaded in netCDF, CSV, and Shapefile format. Which version did you use?
Line 93: Which U.S. agencies were responsible for the data?
Line 118: You cite Schenkel (2017), but your references do not contain a paper where Schenkel is the solo author. Was this supposed to be Schenkel et al. (2017), or did you intend Schenkel (2017) but forget to add the citation (Schenkel did publish a solo author paper in 2017 pertaining to TC centers, which is the topic of the sentence)? Please revisit your citations to be sure that all in-text citations and references align.
Lines 120–122: How was this done? Which software did you use?
Lines 136–137: What is causing this systemic bias in ERA-5 derived maximum wind? An earlier sentence mentioned a couple of reasons, but it is presented as being an issue for other datasets. Is it the same reasoning, or are there different reasons for this bias?
Line 138: Different basins use different scales to categorize TCs. Why did you select the Saffir-Simpson Hurricane Wind Scale for all basins?
Line 142: I am not sure “modesty” is the appropriate word here. A possible change to the sentence: “Despite the discrepancy in TC intensity, Bian et al. (2021) demonstrated that ERA-5 accurately depicts TC structural alterations.”
Line 166: Which correlation did you use (Pearson, Spearman, or Kendall)? How was statistical significance evaluated?
Lines 180–201: Are the same variables incorporated into each wind profile? For example, is the distance r in Holland (1980) the same r in DeMaria (1987)?
Line 187: What are R1 and R2?
Line 188: What is the transition zone?
Line 216: Providing the error rate of the random forest on the training data would also help with assessing the accuracy of the reconstructed data.
Lines 217–218: According to the flowchart in Figure 3, the observational data from IBTrACS were used to generate the reconstructed versions of the same data. This would explain why you have such high correlations. In practice, you typically do not see such high correlation coefficients, and that is cause for concern. In the abstract, it says that the observation data were used for training. Were they also used in the test data?
Line 225: While the reconstructed data do appear to better fit the observational data, the error bars are quite large and the reconstructed data are estimates and therefore not “real” data. Saying they are “clearly superior” is a stretch.
Line 233: Does Table 2 compare the ERA-5 and reconstructed data with the observation data? If so, please make that clear in the table caption.
Line 246: Why are the error bars so much larger for the North Atlantic and East Pacific basins compared to the other basins?
Line 247: For each basin, the MAE of the reconstructed Rmax is less than the MAE for ERA-5, but at the global level, MAE for the reconstructed Rmax is greater than ERA-5. It looks like the two numbers were switched. Is that the case?
Line 258: What is the “sea level pressure difference”? It isn’t defined anywhere else in the paper.
Line 268: Consider adding “, respectively.” after the second sentence.
Line 287: Consider “Finally” instead of “Besides”
Lines 294–295: In the Zenodo repository, consider adding a ReadMe file that explains what the columns are in the dataset (how they were derived, their units, etc.). This would provide helpful information for users who would use your data.
Citation: https://doi.org/10.5194/essd-2024-329-RC2 - AC2: 'Reply on RC2', Jianping Guo, 16 Oct 2024
Data sets
Global tropical cyclone size and intensity reconstruction dataset for 1959–2022 based on IBTrACS and ERA5 data Zhiqi Xu, Jianping Guo, Guwei Zhang, Yuchen Ye, Haikun Zhao, and Haishan Chen https://zenodo.org/records/12740372
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
449 | 126 | 26 | 601 | 27 | 10 | 12 |
- HTML: 449
- PDF: 126
- XML: 26
- Total: 601
- Supplement: 27
- BibTeX: 10
- EndNote: 12
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1