the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Synthetic ground motions in heterogeneous geologies: the HEMEW3D dataset for scientific machine learning
Abstract. The everimproving performances of physicsbased simulations and the rapid developments of deep learning are offering new perspectives to study earthquakeinduced ground motion. Due to the large amount of data required to train deep neural networks, applications have so far been limited to recorded data or twodimensional simulations. To bridge the gap between deep learning and highfidelity numerical simulations, this work introduces a new database of physicsbased earthquake simulations.
The HEMEW3D database comprises 30,000 simulations of elastic wave propagation in threedimensional (3D) geological domains. Each domain is parametrized by a different geological model built from a random arrangement of layers augmented by random fields that represent heterogeneities. For each simulation, ground motion is synthetized at the surface by a grid of virtual sensors. The high frequency of waveforms (f_{max }= 5 Hz) allows extensive analyses of surface ground motion.
Existing and foreseen applications range from statistic analyses of the ground motion variability and machine learning methods on geological models, to deep learningbased predictions of ground motion depending on 3D heterogeneous geologies.
 Preprint
(3573 KB)  Metadata XML
 BibTeX
 EndNote
Status: closed

RC1: 'Comment on essd2023470', Anonymous Referee #1, 12 Mar 2024
General comments:
The datasets proposed in this paper consist of (1) 30000 ‘geological’ models and (2) corresponding virtual recordings at 256 surface stations for a doublecouple point source excitation at depth. The datasets is certainly unique in the sense that it corresponds to a parametric exploration of a large space with 3D numerical simulation tools in a realistic frequency range and is therefore not the result of a routine process. My opinion is that there are some informations that should be added to favor the reuse of the data. Given that the data are already published on a repository, I do not know whether the curation is possible. In any case, my concerns about data reuse are listed in the paragraph below.
Individual scientific questions/issues:

Generation of the background propagation media: to generate their bank of geological models, the authors consider 7 driving parameters (number of layers, thickness of layers…), the values of which are chosen assuming a uniform probability distribution in predefined intervals. A major problem of this approach is to consider (without explicitly mentioning it) that the parameters are independent. The authors themselves admit that some velocity models may be unrealistic because, for example, the depth dependence of the average velocity is not constrained by any monotonicity conditions, or other constraints like the occurrence of lowvelocity layers, the average vertical slowness... The same criticism can be done for the correlation lengths used to define the autocovariance function for the smallscale fluctuations: many realistic situations would call for similar horizontal correlation lengths and much smaller (commonly by one order of magnitude) vertical correlation length. The authors argue that nonrealistic velocity models can be discarded from the database. This is certainly possible for the average velocity depth dependence which can be measured directly on the geological models, but how would a user proceed to discard unrealistic (or undesired) correlation lengths? It seems indeed that the information about the values of the seven parameters is not present in the geological models. This information should be added as metadata to the geological models to favor the reuse of the synthetic seismograms. Another missing information in the article is the number of random draws used for each parameter. Only the total number of models (30000) is given. This number seems large, but how would it reduce once nonrealistic models have been discarded?

Generation of the random heterogeneities: the generation of random fluctuations presented in the article poses two potential problems (for the reuse of data sets). First, it seems that only one realization of the random fluctuations is considered. If true (please confirm by explictly writing it), it means that ensemble average would have to be replaced by spatial average (on the 256 virtual sensors) assuming that ergodicity holds. Testing the ergodicity assumption would be quite difficult without the information about the sensor interdistances and some physical lengths that control the scattering regimes. This is in fact the second issue: it is difficult to guess the scattering regime (characterized by the values of the scattering and transport mean free paths) under consideration in the simulations, and this may prevent the reuse of the datasets. It would be useful to add as metadata the mean free path values for each layer. Without this information, the reuse of the datasets would be certainly limited.
Technical corrections:

The choice of the sampling parameters (in space and time) for the synthetic datasets is rather surprising: on one hand, the time series are sampled at 100 Hz frequency, whereas the maximum simulated frequency is 5 Hz. This corresponds to a 10fold additional cost in terms of storage. On the other hand, the spatial sampling is limited to the minimum possible wavelength (300 m), whereas it should be twice finer (assuming that a wavelength will not be sensitive to heterogeneities smaller than half its size).

The content of the README.md file seems to be uncorrect. It states that “The 300 velocity fields files amount to 2.96Tb. They are downloadable individually (9.8 Gb per file)”, but each velocity file weights only (and fortunately) 1.1 Gb. Please explain and correct.

Give the value of N in section 3.3.2.
Citation: https://doi.org/10.5194/essd2023470RC1 

RC2: 'Comment on essd2023470', Anonymous Referee #2, 19 Mar 2024
Please see attached for detailed comments. This is a large and original effort worth discussing and presenting with more detail/clarity. The actual set of analyses may be impossible to enhance/change anymore, but please enhance the paper in view of the concerns, assumptions and limitations discussed in the report.

AC1: 'Comment on essd2023470', Fanny Lehmann, 07 May 2024
The authors thank the reviewers for their numerous and detailed comments. During the review process, the authors have proposed a new database HEMEW^{S}3D that addresses several of the shortcomings highlighted by the reviewers. Therefore, the revised manuscript describes this updated database. The main differences between the initial HEMEW3D and the revised HEMEW^{S}3D databases are the following:
 parameters of the geological models (mean value, thickness, coefficient of variation, and correlation lengths in each layer) are now provided as metadata
 the source has a random position and orientation in HEMEW^{S}3D
 the spatial sampling of velocity wavefields was increased from 16 x 16 sensors to 32 x 32 sensors. The total duration was reduced from 20s to 8s to maintain reasonable memory requirements and ignore insignificant ground motion.
Please find detailed responses in the attached file.
Status: closed

RC1: 'Comment on essd2023470', Anonymous Referee #1, 12 Mar 2024
General comments:
The datasets proposed in this paper consist of (1) 30000 ‘geological’ models and (2) corresponding virtual recordings at 256 surface stations for a doublecouple point source excitation at depth. The datasets is certainly unique in the sense that it corresponds to a parametric exploration of a large space with 3D numerical simulation tools in a realistic frequency range and is therefore not the result of a routine process. My opinion is that there are some informations that should be added to favor the reuse of the data. Given that the data are already published on a repository, I do not know whether the curation is possible. In any case, my concerns about data reuse are listed in the paragraph below.
Individual scientific questions/issues:

Generation of the background propagation media: to generate their bank of geological models, the authors consider 7 driving parameters (number of layers, thickness of layers…), the values of which are chosen assuming a uniform probability distribution in predefined intervals. A major problem of this approach is to consider (without explicitly mentioning it) that the parameters are independent. The authors themselves admit that some velocity models may be unrealistic because, for example, the depth dependence of the average velocity is not constrained by any monotonicity conditions, or other constraints like the occurrence of lowvelocity layers, the average vertical slowness... The same criticism can be done for the correlation lengths used to define the autocovariance function for the smallscale fluctuations: many realistic situations would call for similar horizontal correlation lengths and much smaller (commonly by one order of magnitude) vertical correlation length. The authors argue that nonrealistic velocity models can be discarded from the database. This is certainly possible for the average velocity depth dependence which can be measured directly on the geological models, but how would a user proceed to discard unrealistic (or undesired) correlation lengths? It seems indeed that the information about the values of the seven parameters is not present in the geological models. This information should be added as metadata to the geological models to favor the reuse of the synthetic seismograms. Another missing information in the article is the number of random draws used for each parameter. Only the total number of models (30000) is given. This number seems large, but how would it reduce once nonrealistic models have been discarded?

Generation of the random heterogeneities: the generation of random fluctuations presented in the article poses two potential problems (for the reuse of data sets). First, it seems that only one realization of the random fluctuations is considered. If true (please confirm by explictly writing it), it means that ensemble average would have to be replaced by spatial average (on the 256 virtual sensors) assuming that ergodicity holds. Testing the ergodicity assumption would be quite difficult without the information about the sensor interdistances and some physical lengths that control the scattering regimes. This is in fact the second issue: it is difficult to guess the scattering regime (characterized by the values of the scattering and transport mean free paths) under consideration in the simulations, and this may prevent the reuse of the datasets. It would be useful to add as metadata the mean free path values for each layer. Without this information, the reuse of the datasets would be certainly limited.
Technical corrections:

The choice of the sampling parameters (in space and time) for the synthetic datasets is rather surprising: on one hand, the time series are sampled at 100 Hz frequency, whereas the maximum simulated frequency is 5 Hz. This corresponds to a 10fold additional cost in terms of storage. On the other hand, the spatial sampling is limited to the minimum possible wavelength (300 m), whereas it should be twice finer (assuming that a wavelength will not be sensitive to heterogeneities smaller than half its size).

The content of the README.md file seems to be uncorrect. It states that “The 300 velocity fields files amount to 2.96Tb. They are downloadable individually (9.8 Gb per file)”, but each velocity file weights only (and fortunately) 1.1 Gb. Please explain and correct.

Give the value of N in section 3.3.2.
Citation: https://doi.org/10.5194/essd2023470RC1 

RC2: 'Comment on essd2023470', Anonymous Referee #2, 19 Mar 2024
Please see attached for detailed comments. This is a large and original effort worth discussing and presenting with more detail/clarity. The actual set of analyses may be impossible to enhance/change anymore, but please enhance the paper in view of the concerns, assumptions and limitations discussed in the report.

AC1: 'Comment on essd2023470', Fanny Lehmann, 07 May 2024
The authors thank the reviewers for their numerous and detailed comments. During the review process, the authors have proposed a new database HEMEW^{S}3D that addresses several of the shortcomings highlighted by the reviewers. Therefore, the revised manuscript describes this updated database. The main differences between the initial HEMEW3D and the revised HEMEW^{S}3D databases are the following:
 parameters of the geological models (mean value, thickness, coefficient of variation, and correlation lengths in each layer) are now provided as metadata
 the source has a random position and orientation in HEMEW^{S}3D
 the spatial sampling of velocity wavefields was increased from 16 x 16 sensors to 32 x 32 sensors. The total duration was reduced from 20s to 8s to maintain reasonable memory requirements and ignore insignificant ground motion.
Please find detailed responses in the attached file.
Data sets
Physicsbased Simulations of 3D Wave Propagation: a Dataset for Scientific Machine Learning Fanny Lehmann https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/LAI6YU
Model code and software
HEMEW3D Fanny Lehmann https://github.com/lehmannfa/HEMEW3D
Viewed
HTML  XML  Total  BibTeX  EndNote  

502  102  41  645  32  33 
 HTML: 502
 PDF: 102
 XML: 41
 Total: 645
 BibTeX: 32
 EndNote: 33
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1