the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Data mining-based machine learning methods for improving hydrological data a case study of salinity field in the Western Arctic Ocean
Abstract. In the Western Arctic Ocean lies the largest freshwater reservoir in the Arctic Ocean, the Beaufort Gyre. Long-term changes in freshwater reservoirs are critical for understanding the Arctic Ocean, and data from various sources, particularly measured or reanalyzed data, must be used to the greatest extent possible. Over the past two decades, a large number of intensive field observations and ship surveys have been conducted in the western Arctic Ocean to obtain a large amount of CTD data. Multiple machine learning methods were evaluated and merged to reconstruct annual salinity product in the western Arctic Ocean over the period 2003–2022. Data mining-based machine learning methods make use of variables determined by physical processes, such as sea level pressure, sea ice concentration, and drift. Our objective is to effectively manage the mean root mean square error (RMSE) of sea surface salinity, which exhibits greater susceptibility to atmospheric, sea ice, and oceanic changes. Considering the higher susceptibility of sea surface salinity to atmospheric, sea ice, and oceanic changes, which leads to greater variability, we ensured that the average root mean square error of CTD and EN4 sea surface salinity field during the machine learning training process was constrained within 0.25 psu. The machine learning process reveals that the uncertainty in predicting sea surface salinity, as constrained by CTD data, is 0.24 %, whereas when constrained by EN4 data it reduces to 0.02 %. During data merging and post-calibrating, the weight coefficients are constrained by imposing limitations on the uncertainty value. Compared with commonly used EN4 and ORAS5 salinity in the Arctic Ocean, our salinity product provide more accurate descriptions of freshwater content in the Beaufort Gyre and depth variations at its halocline base. The application potential of this multi-machine learning results approach for evaluating and integrating extends beyond the salinity field, encompassing hydrometeorology, sea ice thickness, polar biogeochemistry, and other related fields. The datasets are available at https://zenodo.org/records/10990138 (Tao and Du, 2024).
This preprint has been withdrawn.
-
Withdrawal notice
This preprint has been withdrawn.
-
Preprint
(2226 KB)
Interactive discussion
Status: closed
-
RC1: 'No signs of cross-validation against independent data, probable overfitting', Anonymous Referee #1, 24 Jun 2024
The salinity dataset by Tao and co-authors is interpolating salinity profiles in the data-scarce Western Arctic Ocean with the help of an interpolated salinity database and auxiliary atmospheric and sea ice observations. As many as six different machine learning algorithms are used to merge the various data. The resulting data product is used to calculated freshwater contents and compared to in situ and model-based estimates.
The major weakness of the paper is the lack of consideration for the validation against independent data, both in the choice of data sources and then in the machine learning methodology. The authors use different data sources (WOD, EN4, UDASH and the WOA18 climatology) without indication of the inter-dependences between these datasets. Whether the CTD data has been included in EN4 or not has implications on how I am interpreting the results: is the product an improved interpolation method than the objective analysis used in EN4 or are the two predictions of CTD and EN4 values two independent estimates of the same salinities? The same question applies for the BGEP mooring data, are they included or not in the EN4, UDASH and WOA18 aggregators? This question is an important prerequisite to understand why the estimates differ so much in the results section. As the paper stands, these results are just numbers that could look right for the wrong reason.
The methodological concern is that all six machine learning techniques used will overfit the salinity data unless a part (say, one or two years) are set aside for validation. This is common practice in machine learning as all textbooks will show. From the text and the columns of zeroes in Table 1, the results are presented on the training data, which is therefore not guaranteeing any skills in extrapolation. Along the same lines, the uncertainty CT1 is computed from the residuals of the training datasets, which in the likely case of overfitting are much lower than the actual errors (and worse, unrelated to them). So if there were one additional CTD cast that was not included in the input dataset, I see no guarantee that the prediction in that point would be as skillful as indicated in the paper. I would prefer one properly validated algorithm to six different ways of overfitting.
Overall the text is missing a clear explanation of the algorithm used and I had to squint at Figure 2 to imagine the methodology. The paper contains a lot of information, which relevance for the dataset is not mentioned. It overall makes a very tedious read and I realize that the correct procedures may have been applied without me finding it mentioned in the text.
In view of the above weaknesses, I believe the submitted manuscript cannot easily me modified into a publishable version. All the results should be presented on validation data rather than training data, and all the methods and data sections should be completely rewritten to specify the intended use of the data. The results section should be rewritten to reflect on the reasons for the differences between EN4 and BGEP data and why the author's approach is the correct answer to the problem.
Citation: https://doi.org/10.5194/essd-2024-138-RC1 - AC1: 'Reply on RC1', Ling Du, 16 Jul 2024
-
RC2: 'Comment on essd-2024-138', Anonymous Referee #2, 10 Jul 2024
This paper employed multiple machine learning methods to reconstruct the salinity in the West Arctic Ocean based on easily obtained atmosphere reanalysis data and satellite-based sea ice concentration and motion. This topic is interesting and crucial, and this method can expand the spatial-temporal coverage and improve the accuracy of estimation than existing productions as the author claimed.
However, due to so much confusion about the method and verification, I have to treat this article with caution, and I believe that it now is far away from an adequate article, especially for publishing on ESSD. Here are some major comments.
- For the method, there are 2 major questions.
- I cannot understand why the author using EN4 reanalysis data as a target to train his/her model. The EN4 including some uncertainties and errors cannot provide the true word’s salinity. Moreover, EN4 can cover the whole area and time span this work focuses on. I think the author can use this dataset directly or simply interpolate it.
- I think the author wants to highlight the machine learnings, but I don’t know which role they played in the improvement of the accuracy of salinity reconstruction. Cooperation with traditional methods (e.g., optimal interpolation) is lacking in this work. Moreover, the author compared their production with EN4, which is used as a target set to train the method, and the author said their results are better, which is contrary to general knowledge. Maybe the decrease in errors comes from the application of machine learning, but the more possible reason is just the merging of CTD data.
- For the verification, I feel very surprised about the 0 of RMSE for the results from the KNN method. It means that this method can perfectly reproduce your target salinity. The only possibility I can think of is that the author uses a train set the calculate RMSE instead of a verification or test set. This makes it completely impossible to evaluate the salinity reconstructed by the machine learning in this work.
There are also many minor comments. What needs the author to pay more attention to is that many places in the MS are against the conventions of academic writing, which makes it hard to read.
Line 17: write the full name of “CTD” due to its first use in the paper. Similar issues exist in many other places, please check.
Line 63: double full stops.
Line 86: what’s your dataset’s temporal resolution?
Line 96: I think it is better to put the section about why you focus on the Western Arctic Oceanin the Introduction than here.
Line 191: Fig. 2 is confused. It should be replotted. a), I cannot find the “Classify” and “statistic analysis” (case matters) in the text. b), Similarly, the “Physical process” and “Nearest Neighbors” (case matters) also cannot be found and they look like input variables very much. c), only 4 variables are used in the data-selecting step, which is far less than the data introduced in 2.2. And in the text, you don't seem to be doing anything with these 4 variables, while the CTD data was cleaned. This figure gives readers the opposite impression. You should make it clear to readers which are the variables used to train or build the dataset, and where are the algorithm. d), where is the WOA18 used when you create the dataset? Mark the figure or delete it.
Line 122 In this section, I think two things were done: 1, introduce the data used in this work; 2, data clean (selecting). It would be better to divide them into two paragraphs. A lot of data are introduced here, but it is confusing which one is used to train, which one is used to create the dataset, and which one is used to evaluate. Reorganize them according to their purpose.
Line 127 “the data with flags 0 and 1 based on the quality control provided by the data itself”: what does it mean?
Line 141: Why the data in 2004 are ignored?
Line 143: Enlarge the size of years. Also, in Fig 9 and 10.
Line 151: Does the quality of the reconstructed data vary across seasons?
Line 154: Fig 3 instead of “Fig 2”.
Line 157: Why did you choose the EN4? You should tell readers more about it, for example, if the WOD data is assimilated.
Line 174: which Neural Networkdid you use?
Line 195 “the prediction using EN4 data is better than using CTD data”: just means that the results based on EN4 are closer to its target data. There may be some errors in EN4. Don’t say that it is better.
Line 199: target values instead of “True values”.
Line 204: It is so surprising that the RMSE is 0. I guess you may just show the results of the training set. You should explain what causes these 0.
Line 207: “prediction” or reconstruction? Prediction is a good usage for this method, but it looks like you didn’t do it in this work.
Line 216 “In the same year, some machine learning predictions are good while others are poor.”: it is better to give me a statistical, quantitative result rather than such a description.
Line 226: In this work, you always show me examples, but a better way is to give readers a statistical result. I think it is not so hard.
Line 245: Maybe here you can tell readers how to merge and post-calibrate data first, and then discuss how to calculate uncertainty. Give me more details.
Line 253: how to get the weights “a”? Also, the weights beta needs more description.
Line 258: where is the 1st-3rdpost-calibrating?
Line 259 “when there are CTD measured data around the grid point”: the “around” needs to be quantified.
Line 261: I’m a bit confused about three things. a), why are you using machine learning to reconstruct salinity based on the EN4. I assume you're trying to improve its resolution, but you don’t show the advance of the method. You need to compare it with traditional methods like optimal interpolation. b), you only used the reconstructed data based on CTD which is close to the in-situ observations. I think this may be due to the increasing error away from the buoy, but this also requires evidence. You need to show the reader at what distance the salinity error based on CTD reconstruction is greater than EN4. c), if you use the optimal interpolation instead of machine learning to reconstruct salinity based on the CTD data, how much error is going to increase?
Line 270: I don't suggest using percentiles to record uncertainty. The salinity of the ocean is always high, so the proportions of errors are always low and the percentile statements are misleading. For the FWC, you calculate the proportion of errors, but there are differences between the reconstructed salinity and 34.8 psu. These two similar variables are quite different, potentially confusing readers.
Line 288: Please do a brief introduction about BGEP.
Line 334: why the freshwater decreasing after 2011 supports the recent major freshening event for 2012 to 2016.
Line 384: I think it will be clearer to write as “sea level pressure from ERA5 and sea ice concentration and motion from NSIDC”.
Line 385: where did you use the ETOPO1?
Line 392: you should not add any new results in the Summary and please cite Fig. 11 in this paragraph.
Line 405: I cannot understand the meaning of “trend” and also don’t know why you discuss the relationship between salinity and freshwater here. If you want to show me something, a figure about it is necessary.
Line 411: at the beginning of the paper, you said that the greatest advantage of your dataset is that the salinity in recent years is included, but here you say that the greatest advantage is your result is more accurate. I agree that both of them are important and you are able to do them, but changing your big problem in one paper is improper.
Citation: https://doi.org/10.5194/essd-2024-138-RC2 - AC2: 'Reply on RC2', Ling Du, 16 Jul 2024
- For the method, there are 2 major questions.
Interactive discussion
Status: closed
-
RC1: 'No signs of cross-validation against independent data, probable overfitting', Anonymous Referee #1, 24 Jun 2024
The salinity dataset by Tao and co-authors is interpolating salinity profiles in the data-scarce Western Arctic Ocean with the help of an interpolated salinity database and auxiliary atmospheric and sea ice observations. As many as six different machine learning algorithms are used to merge the various data. The resulting data product is used to calculated freshwater contents and compared to in situ and model-based estimates.
The major weakness of the paper is the lack of consideration for the validation against independent data, both in the choice of data sources and then in the machine learning methodology. The authors use different data sources (WOD, EN4, UDASH and the WOA18 climatology) without indication of the inter-dependences between these datasets. Whether the CTD data has been included in EN4 or not has implications on how I am interpreting the results: is the product an improved interpolation method than the objective analysis used in EN4 or are the two predictions of CTD and EN4 values two independent estimates of the same salinities? The same question applies for the BGEP mooring data, are they included or not in the EN4, UDASH and WOA18 aggregators? This question is an important prerequisite to understand why the estimates differ so much in the results section. As the paper stands, these results are just numbers that could look right for the wrong reason.
The methodological concern is that all six machine learning techniques used will overfit the salinity data unless a part (say, one or two years) are set aside for validation. This is common practice in machine learning as all textbooks will show. From the text and the columns of zeroes in Table 1, the results are presented on the training data, which is therefore not guaranteeing any skills in extrapolation. Along the same lines, the uncertainty CT1 is computed from the residuals of the training datasets, which in the likely case of overfitting are much lower than the actual errors (and worse, unrelated to them). So if there were one additional CTD cast that was not included in the input dataset, I see no guarantee that the prediction in that point would be as skillful as indicated in the paper. I would prefer one properly validated algorithm to six different ways of overfitting.
Overall the text is missing a clear explanation of the algorithm used and I had to squint at Figure 2 to imagine the methodology. The paper contains a lot of information, which relevance for the dataset is not mentioned. It overall makes a very tedious read and I realize that the correct procedures may have been applied without me finding it mentioned in the text.
In view of the above weaknesses, I believe the submitted manuscript cannot easily me modified into a publishable version. All the results should be presented on validation data rather than training data, and all the methods and data sections should be completely rewritten to specify the intended use of the data. The results section should be rewritten to reflect on the reasons for the differences between EN4 and BGEP data and why the author's approach is the correct answer to the problem.
Citation: https://doi.org/10.5194/essd-2024-138-RC1 - AC1: 'Reply on RC1', Ling Du, 16 Jul 2024
-
RC2: 'Comment on essd-2024-138', Anonymous Referee #2, 10 Jul 2024
This paper employed multiple machine learning methods to reconstruct the salinity in the West Arctic Ocean based on easily obtained atmosphere reanalysis data and satellite-based sea ice concentration and motion. This topic is interesting and crucial, and this method can expand the spatial-temporal coverage and improve the accuracy of estimation than existing productions as the author claimed.
However, due to so much confusion about the method and verification, I have to treat this article with caution, and I believe that it now is far away from an adequate article, especially for publishing on ESSD. Here are some major comments.
- For the method, there are 2 major questions.
- I cannot understand why the author using EN4 reanalysis data as a target to train his/her model. The EN4 including some uncertainties and errors cannot provide the true word’s salinity. Moreover, EN4 can cover the whole area and time span this work focuses on. I think the author can use this dataset directly or simply interpolate it.
- I think the author wants to highlight the machine learnings, but I don’t know which role they played in the improvement of the accuracy of salinity reconstruction. Cooperation with traditional methods (e.g., optimal interpolation) is lacking in this work. Moreover, the author compared their production with EN4, which is used as a target set to train the method, and the author said their results are better, which is contrary to general knowledge. Maybe the decrease in errors comes from the application of machine learning, but the more possible reason is just the merging of CTD data.
- For the verification, I feel very surprised about the 0 of RMSE for the results from the KNN method. It means that this method can perfectly reproduce your target salinity. The only possibility I can think of is that the author uses a train set the calculate RMSE instead of a verification or test set. This makes it completely impossible to evaluate the salinity reconstructed by the machine learning in this work.
There are also many minor comments. What needs the author to pay more attention to is that many places in the MS are against the conventions of academic writing, which makes it hard to read.
Line 17: write the full name of “CTD” due to its first use in the paper. Similar issues exist in many other places, please check.
Line 63: double full stops.
Line 86: what’s your dataset’s temporal resolution?
Line 96: I think it is better to put the section about why you focus on the Western Arctic Oceanin the Introduction than here.
Line 191: Fig. 2 is confused. It should be replotted. a), I cannot find the “Classify” and “statistic analysis” (case matters) in the text. b), Similarly, the “Physical process” and “Nearest Neighbors” (case matters) also cannot be found and they look like input variables very much. c), only 4 variables are used in the data-selecting step, which is far less than the data introduced in 2.2. And in the text, you don't seem to be doing anything with these 4 variables, while the CTD data was cleaned. This figure gives readers the opposite impression. You should make it clear to readers which are the variables used to train or build the dataset, and where are the algorithm. d), where is the WOA18 used when you create the dataset? Mark the figure or delete it.
Line 122 In this section, I think two things were done: 1, introduce the data used in this work; 2, data clean (selecting). It would be better to divide them into two paragraphs. A lot of data are introduced here, but it is confusing which one is used to train, which one is used to create the dataset, and which one is used to evaluate. Reorganize them according to their purpose.
Line 127 “the data with flags 0 and 1 based on the quality control provided by the data itself”: what does it mean?
Line 141: Why the data in 2004 are ignored?
Line 143: Enlarge the size of years. Also, in Fig 9 and 10.
Line 151: Does the quality of the reconstructed data vary across seasons?
Line 154: Fig 3 instead of “Fig 2”.
Line 157: Why did you choose the EN4? You should tell readers more about it, for example, if the WOD data is assimilated.
Line 174: which Neural Networkdid you use?
Line 195 “the prediction using EN4 data is better than using CTD data”: just means that the results based on EN4 are closer to its target data. There may be some errors in EN4. Don’t say that it is better.
Line 199: target values instead of “True values”.
Line 204: It is so surprising that the RMSE is 0. I guess you may just show the results of the training set. You should explain what causes these 0.
Line 207: “prediction” or reconstruction? Prediction is a good usage for this method, but it looks like you didn’t do it in this work.
Line 216 “In the same year, some machine learning predictions are good while others are poor.”: it is better to give me a statistical, quantitative result rather than such a description.
Line 226: In this work, you always show me examples, but a better way is to give readers a statistical result. I think it is not so hard.
Line 245: Maybe here you can tell readers how to merge and post-calibrate data first, and then discuss how to calculate uncertainty. Give me more details.
Line 253: how to get the weights “a”? Also, the weights beta needs more description.
Line 258: where is the 1st-3rdpost-calibrating?
Line 259 “when there are CTD measured data around the grid point”: the “around” needs to be quantified.
Line 261: I’m a bit confused about three things. a), why are you using machine learning to reconstruct salinity based on the EN4. I assume you're trying to improve its resolution, but you don’t show the advance of the method. You need to compare it with traditional methods like optimal interpolation. b), you only used the reconstructed data based on CTD which is close to the in-situ observations. I think this may be due to the increasing error away from the buoy, but this also requires evidence. You need to show the reader at what distance the salinity error based on CTD reconstruction is greater than EN4. c), if you use the optimal interpolation instead of machine learning to reconstruct salinity based on the CTD data, how much error is going to increase?
Line 270: I don't suggest using percentiles to record uncertainty. The salinity of the ocean is always high, so the proportions of errors are always low and the percentile statements are misleading. For the FWC, you calculate the proportion of errors, but there are differences between the reconstructed salinity and 34.8 psu. These two similar variables are quite different, potentially confusing readers.
Line 288: Please do a brief introduction about BGEP.
Line 334: why the freshwater decreasing after 2011 supports the recent major freshening event for 2012 to 2016.
Line 384: I think it will be clearer to write as “sea level pressure from ERA5 and sea ice concentration and motion from NSIDC”.
Line 385: where did you use the ETOPO1?
Line 392: you should not add any new results in the Summary and please cite Fig. 11 in this paragraph.
Line 405: I cannot understand the meaning of “trend” and also don’t know why you discuss the relationship between salinity and freshwater here. If you want to show me something, a figure about it is necessary.
Line 411: at the beginning of the paper, you said that the greatest advantage of your dataset is that the salinity in recent years is included, but here you say that the greatest advantage is your result is more accurate. I agree that both of them are important and you are able to do them, but changing your big problem in one paper is improper.
Citation: https://doi.org/10.5194/essd-2024-138-RC2 - AC2: 'Reply on RC2', Ling Du, 16 Jul 2024
- For the method, there are 2 major questions.
Data sets
Data mining-based machine learning methods for improving hydrological data: a case study of salinity field in the Western Arctic Ocean Shuhao Tao and Ling Du https://zenodo.org/records/10990138
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
535 | 98 | 47 | 680 | 40 | 39 |
- HTML: 535
- PDF: 98
- XML: 47
- Total: 680
- BibTeX: 40
- EndNote: 39
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1