The authors have added some metadata labels to the database which were indeed needed, and have revised certain parts of the manuscript and its presentation. However, concerning the point-to-point rebuttal, although the authors have written many replies to the reviewer comments, the majority of them are not reflected in the manuscript. It would be a simple thing to add some of the explanations given to the reviewers into the main text, so that they are available to the general readership in a straightforward way, helping towards a better understanding –and most importantly, a better use- of the data on offer. So my main and final recommendation for revision is to add explanations given in the rebuttal (and references) to help clarify/improve the main text. Some examples follow:
1. Points 2.2.7, 1.2.1, 2.1.14:
Both reviewers pose the question of how 5 Hz (maximum frequency that can numerically propagate through a grid) and 100 Hz (sampling rate) are really reconciled. The (identical) reply given to both reviewers is this: “100 Hz matches the usual temporal resolution of recorded time series available in public accessible earthquake engineering strong motion databases which is important for tasks such as seismic phase picking“. Yet this is not explained in the revised manuscript, but only given as a personal reply in the rebuttal. But then the reader, who will likely ask him/herself the same, cannot benefit in the end. He/she should not have to read the commentary exchange in order to get the necessary clarifications for the article, so please explain your rationale in the paper.
A note regarding this specific reply: Please rephrase this explanation before adding it to the manuscript, because it is incorrect on a few accounts:
1. earthquake engineers do not access strong motion datasets to do phase picking, which is a purely seismological task/skill
2. the reason for the high sampling in strong-motion data is not for the sake of phase picking (wave windowing can be very rough in such applications, in stark contrast to seismic monitoring) – this investment is made in order to be sure to catch PGA correctly
3. in many important networks, the sampling rate of accelerometric data is actually not even 100 Hz but 200 Hz
2. Points 2.1.2, 2.1.3, 1.1.1, 1.1.2, 2.1.5, 2.1.19:
Again both reviewers pointed this out: the choice to include unphysical instances of various parameter values in the database. If the authors agree that some of the models are unrealistic from a geological/geophysical/seismological point of view, then please stress this in the text, and explain why you think there is this dire necessity to include them nevertheless for ML purposes. Also, and this is something I’d like to stress, please be very clear on what percentage of the data can be related to unrealistic, or in statistical terms, “extremely rare” or “coda” cases. Because it seems as if these rare cases may actually take up a lot of the database: from the numbers the authors give, it seems like a ratio of 1:3 between rare/unrealistic and normal (10,550 out of 30,000?), which seems too high, so is the coda being sampled or oversampled in the end? Please be clear on the statistics.
To say that the ‘plausibility of data depends on the application’ is, I think, a compromise detrimental to the earth science applicability of the work, in favor of ML. But natural occurrence does need to have a role here. (E.g. one may well sample 1,000,000 soil samples but will never get a density of, say, 5t/m^3, and even so, it would certainly not happen 30% of the time.) So please add explicit commentary to the paper about all this. If 2/2 reviewers felt the need to bring it up, most earth scientists in the audience are likely to have similar questions.
3. Point 2.1.6: amplification
I find the arguments of the authors incomplete with respect to the existing knowledge on site response. Even so, please give your arguments in the paper explaining if, how and why amplification is or is not accounted for in your calculations, especially with respect to the frequency range <5 Hz (which may well be impressively high for such calculations but is lower than the range where hard sites amplify), and especially considering the large proportion of high-Vs cases. It is ok to say that it is not accounted for completely, but is at least dealt with better than it was in past papers, or is out of scope, etc. However, I think it is not ok to claim that there is absolutely nothing to talk about here and just ignore the issue: limitations exist and should be stated.
4. Limitations related to calculation speed/run time and memory/storage needs seem to underpin many of the decisions made and/or are the answer to many of the reviewer comments (duration of waveforms, spatial sampling, inability to fill basins with soft material, etc). It would be good to explain all these cases together in the end of the paper. So to speak, answer the question: when computing becomes faster/easier in future, what would be the top 5 things you’d like to do differently, without the need to worry about such issues?
5. points 2.1.8:
I did not find this new explanation in the new text about surface velocity. Please add if missing, because it is very important the reader understands how you define ‘surface velocity (30 or 300 m!). In the majority of locations in the world where seismic hazard is a concern, we would love to have ‘near-surface’ Vs of 1000m/s or more, but don’t! Unless the dataset is more representative of certain regions (France? stable continental areas?), which it claims it is not. Also, please state in main text (as perspectives – we know it is not in the scope of this paper) if/how your methods and data could be combined with near-surface site effects calculations.
2.1.9: GMPEs
Although many GMPEs exist that are informed by simulations, even assuming they were all empirical, they still are a key tool in practice. And so if your database were to show a great divergence from what they predict, it would be extremely important to point it out. A comparison would be beneficial, and if there is disagreement then the various arguments the authors give can be proposed to explain why their work is better fitted than GMPEs for such and such a case. It is not a matter of believing in data more than in simulations, but there is an urgent need that the two communities finally start to acknowledge each other for science to move forward faster. Please help in this direction.
2.1.5:
please make this clarification in text about durations and lack of content
2.2.4:
this was not answered
Outside reviewer bullet points:
On lowering duration from 20 sec to 8 sec: Please include a phrase about why 8 sec only is sufficient duration (maybe based on distance and M combinations) |