the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Synthetic ground motions in heterogeneous geologies: the HEMEW-3D dataset for scientific machine learning
Abstract. The ever-improving performances of physics-based simulations and the rapid developments of deep learning are offering new perspectives to study earthquake-induced ground motion. Due to the large amount of data required to train deep neural networks, applications have so far been limited to recorded data or two-dimensional simulations. To bridge the gap between deep learning and high-fidelity numerical simulations, this work introduces a new database of physics-based earthquake simulations.
The HEMEW-3D database comprises 30,000 simulations of elastic wave propagation in three-dimensional (3D) geological domains. Each domain is parametrized by a different geological model built from a random arrangement of layers augmented by random fields that represent heterogeneities. For each simulation, ground motion is synthetized at the surface by a grid of virtual sensors. The high frequency of waveforms (fmax = 5 Hz) allows extensive analyses of surface ground motion.
Existing and foreseen applications range from statistic analyses of the ground motion variability and machine learning methods on geological models, to deep learning-based predictions of ground motion depending on 3D heterogeneous geologies.
- Preprint
(3573 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on essd-2023-470', Anonymous Referee #1, 12 Mar 2024
General comments:
The datasets proposed in this paper consist of (1) 30000 ‘geological’ models and (2) corresponding virtual recordings at 256 surface stations for a double-couple point source excitation at depth. The datasets is certainly unique in the sense that it corresponds to a parametric exploration of a large space with 3D numerical simulation tools in a realistic frequency range and is therefore not the result of a routine process. My opinion is that there are some informations that should be added to favor the reuse of the data. Given that the data are already published on a repository, I do not know whether the curation is possible. In any case, my concerns about data reuse are listed in the paragraph below.
Individual scientific questions/issues:
-
Generation of the background propagation media: to generate their bank of geological models, the authors consider 7 driving parameters (number of layers, thickness of layers…), the values of which are chosen assuming a uniform probability distribution in predefined intervals. A major problem of this approach is to consider (without explicitly mentioning it) that the parameters are independent. The authors themselves admit that some velocity models may be unrealistic because, for example, the depth dependence of the average velocity is not constrained by any monotonicity conditions, or other constraints like the occurrence of low-velocity layers, the average vertical slowness... The same criticism can be done for the correlation lengths used to define the autocovariance function for the small-scale fluctuations: many realistic situations would call for similar horizontal correlation lengths and much smaller (commonly by one order of magnitude) vertical correlation length. The authors argue that non-realistic velocity models can be discarded from the database. This is certainly possible for the average velocity depth dependence which can be measured directly on the geological models, but how would a user proceed to discard unrealistic (or undesired) correlation lengths? It seems indeed that the information about the values of the seven parameters is not present in the geological models. This information should be added as metadata to the geological models to favor the reuse of the synthetic seismograms. Another missing information in the article is the number of random draws used for each parameter. Only the total number of models (30000) is given. This number seems large, but how would it reduce once non-realistic models have been discarded?
-
Generation of the random heterogeneities: the generation of random fluctuations presented in the article poses two potential problems (for the reuse of data sets). First, it seems that only one realization of the random fluctuations is considered. If true (please confirm by explictly writing it), it means that ensemble average would have to be replaced by spatial average (on the 256 virtual sensors) assuming that ergodicity holds. Testing the ergodicity assumption would be quite difficult without the information about the sensor interdistances and some physical lengths that control the scattering regimes. This is in fact the second issue: it is difficult to guess the scattering regime (characterized by the values of the scattering and transport mean free paths) under consideration in the simulations, and this may prevent the reuse of the datasets. It would be useful to add as metadata the mean free path values for each layer. Without this information, the reuse of the datasets would be certainly limited.
Technical corrections:
-
The choice of the sampling parameters (in space and time) for the synthetic datasets is rather surprising: on one hand, the time series are sampled at 100 Hz frequency, whereas the maximum simulated frequency is 5 Hz. This corresponds to a 10-fold additional cost in terms of storage. On the other hand, the spatial sampling is limited to the minimum possible wavelength (300 m), whereas it should be twice finer (assuming that a wavelength will not be sensitive to heterogeneities smaller than half its size).
-
The content of the README.md file seems to be uncorrect. It states that “The 300 velocity fields files amount to 2.96Tb. They are downloadable individually (9.8 Gb per file)”, but each velocity file weights only (and fortunately) 1.1 Gb. Please explain and correct.
-
Give the value of N in section 3.3.2.
Citation: https://doi.org/10.5194/essd-2023-470-RC1 -
-
RC2: 'Comment on essd-2023-470', Anonymous Referee #2, 19 Mar 2024
Please see attached for detailed comments. This is a large and original effort worth discussing and presenting with more detail/clarity. The actual set of analyses may be impossible to enhance/change anymore, but please enhance the paper in view of the concerns, assumptions and limitations discussed in the report.
Data sets
Physics-based Simulations of 3D Wave Propagation: a Dataset for Scientific Machine Learning Fanny Lehmann https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/LAI6YU
Model code and software
HEMEW3D Fanny Lehmann https://github.com/lehmannfa/HEMEW3D
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
382 | 63 | 23 | 468 | 16 | 16 |
- HTML: 382
- PDF: 63
- XML: 23
- Total: 468
- BibTeX: 16
- EndNote: 16
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1