The ever-improving performances of physics-based simulations and the rapid developments of deep learning are offering new perspectives to study earthquake-induced ground motion. Due to the large amount of data required to train deep neural networks, applications have so far been limited to recorded data or two-dimensional (2D) simulations. To bridge the gap between deep learning and high-fidelity numerical simulations, this work introduces a new database of physics-based earthquake simulations.

The HEterogeneous Materials and Elastic Waves with Source variability in 3D (HEMEW

Existing and foreseen applications range from statistical analyses of the ground motion variability and machine learning methods on geological models to deep-learning-based predictions of ground motion that depend on 3D heterogeneous geologies and source properties. Data are available at

Deep learning has a long tradition in seismology thanks to large networks of sensors recording earthquakes worldwide. Applications are extremely diverse in terms of methods, data, and scientific goals (see, for example,

However, all those methods rely on databases of seismic waveforms. While there exist several curated databases of recorded ground motion, they are sparse in regions with low to moderate seismicity or poor instrumental coverage

In fact, physics-based simulations show several limitations. Firstly, they require a detailed description of the ground properties that define the physical behaviour of the waves propagating in the Earth. Especially, ground properties should be given in the form of 3D geological models since 3D features have crucial effects that are not accounted for in 2D settings (e.g. sedimentary basins leading to site effects)

Quantifying the effects of 3D geological features is made more difficult by the second limitation of physics-based simulations, which is their high computational cost, especially when dealing with high frequencies and large spatial domains. Despite relying on high-performance computing (HPC) frameworks, seismic waves propagation simulations can reach tens to hundreds of thousands of equivalent core hours

When predicting the surface ground motion generated by an earthquake, it is important to obtain time series that describe the temporal evolution of shaking and not only scalar features (such as peak ground acceleration and cumulative absolute velocity) that give useful but limited information. Physics-informed neural networks (PINNs;

The recent emergence of scientific machine learning (SciML) is offering a new paradigm for the prediction of physics-based ground motion parametrized by 3D ground properties and source parameters, with intrinsic ability to generalize to various resolutions and geological configurations. SciML has led to significant scientific developments in communities with large, reliable, and freely available databases. For instance, in numerical weather prediction,

In this work, we describe the first open database of seismic simulations associated with 3D heterogeneous geological models. The HEMEW

In the following text, Sect.

Summary of datasets providing geological models and seismic wavefields. Domain: size of the physical domain with the number of grid points given in parenthesis, in the order width, depth for 2D datasets and width, length, depth for 3D datasets. The dimension of seismic wavefields is given in the format receivers along width, time steps for 2D datasets and receivers along width, receivers along length, time steps for 3D datasets.

Datasets of recorded ground motion have enabled major deep learning applications in seismology, but they have several limitations in data-scarce regions. In this section, we focus on datasets with 2D or 3D data used in geophysics and seismology with SciML applications. Due to the mathematical similarities between wave propagation and fluid flow (both are governed by hyperbolic equations), related studies are reviewed beyond the field of seismology. This highlights the challenges of high-fidelity numerical simulations for deep learning applications.

Ground motion simulations of past earthquakes have been collected in databases for model verification, characterization of complex near-field conditions, and machine learning purposes. The BB-SPEEDset provides 3D simulations of 16 earthquake scenarios in various regions of the globe

Due to the high computational costs of solving 3D partial differential equations (PDEs), only very few 3D datasets are available. CO

A few datasets of realistic geological units have been developed, such as the Noddyverse dataset of 3D geological models

Several other studies have computed simulation outputs for the acoustic or elastic wave equation, but their data are not public (e.g.

Elastodynamics describes reversible wave propagation phenomena in solid and fluid domains. In solid mechanics, the solution is represented by a displacement field,

In our database, the forcing term

The source position is represented by the coordinates

In addition to the source position, the source is parametrized by the symmetric

The source amplitude corresponds to a seismic moment,

First, any magnitude can be obtained by applying a scalar factor to the ground motion wavefields. Second, the response to any source time function can be computed from the Green function,

Computing the solution,

compute the Fourier transform of the reference source time function,

derive the Green function in the frequency domain,

compute the Fourier transform of the new source time function,

compute the new solution in the frequency domain,

deduce the new solution in the temporal domain,

From these remarks, one should remember that ground motion wavefields in the HEMEW

The HEMEW

Geological models are built by adding heterogeneities to randomly chosen horizontal layers. Elastic waves are then propagated from a source with a random position and random orientation to the surface, where velocity wavefields are synthesized.

A homogeneous layer that is 1.8

Statistical distribution of each parameter describing the geological models. Mean

To recover the other geological properties, the ratio of P- to S-wave velocity was fixed to

The layers' thicknesses and mean values describe the general structure of the propagation domain and they correspond to the prior physical information usually available. However, geomaterials of the Earth's crust contain a lot of variability, especially along the horizontal directions. This heterogeneity can be represented by random fields, characterized by their correlation length and coefficient of variation. Following previous studies on geological heterogeneity (e.g.

In order to provide a sufficient dataset variability, the choice of correlation lengths and coefficients of variation is tricky yet crucial

The 3D random field computation is made highly efficient by the use of the spectral representation

Finally,

It should be noted that all layers have distinct coefficients of variation and correlation lengths, meaning that different random fields are drawn inside each layer. Also, random fields are drawn only once for each set of parameters.

Geological realization,

The elastic wave equation was solved in each domain

To maintain reasonable computational loads and reflect realistic scenarios, velocity fields were recorded only at the surface of the propagation domain. A regular grid of

Three-component velocity waveforms synthesized at eight virtual sensors on a line parallel to the

Velocity fields are provided as

Files are gathered in

Since most of the geological parameters are chosen uniformly randomly (Table

The first-wave arrival time is a crucial parameter for earthquake early warning, and seismic phase picking is a common task with deep learning models. Arrival time depends on the distance between the earthquake source and the monitoring sensor as well as the geological properties on the propagation path. Wave arrival times are usually determined from recordings, either manually by experts or using machine learning methods. However, it is possible to compute almost exact arrival times from synthetic velocity fields since ground motion is almost zero before the first wave arrival. Therefore, we obtained wave arrival times for P-waves as the earliest time step where the amplitude exceeds 0.1 % of the maximum amplitude. Due to the source depth variability and the different wave velocities in the geological models, first wave arrival times significantly vary among samples and among sensors. Figure

Distributions of the temporal features of velocity time series at each monitoring sensor and for 30 000 samples.

As expected, the P-wave arrival time is strongly correlated with the hypocentral distance (Fig.

For each sample and each sensor, the P-wave arrival time is shown against

The temporal evolution of ground motion can also be characterized by its relative significant duration (RSD). It corresponds to the duration of the signal between 5 % and 95 % of the Arias intensity (

Quantitatively, the HEMEW

The PGV is computed as the maximum absolute value over all time steps separately for each component. The PGV is slightly lower on the vertical component than the two horizontal components (Fig.

The peak ground velocity (PGV) is computed as the maximum absolute value over all time steps separately on each component. There is one value for each of the

When the propagation path is longer, seismic waves encounter more geological heterogeneities. They create a dispersion and diffraction of waves that spread the energy signal over time. Larger hypocentral distances are associated with longer propagation paths. Figure

For each sample and each sensor, the PGV is shown against

It is also known that the seismic energy,

The pseudo-spectral acceleration (PSA) is a commonly used metric to estimate structural response. It evaluates the maximal acceleration of a 1-degree-of-freedom oscillator (with a 5 % damping), with a natural period,

For each sample and each sensor, the pseudo-spectral acceleration is shown against the hypocentral distance at period

GMMs provide an analytical formula to compute intensity measures, such as PSA and PGV, based on regression analyses. They are mainly derived from databases of recorded earthquakes, although numerical simulations can also be used. The PSA estimated from the HEMEW

GMM from

GMM from

GMM from

GMM from

Horizontal PSA at period

Figure

In supervised deep learning, it is always challenging to determine whether the size of the database (i.e. the number of samples) is sufficient to represent its variability. This question relates to the definition of the intrinsic dimension of the dataset, which indicates the number of hidden variables that should be necessary to represent the main features of the samples. In the following sections, we provide insights into this question, with the intrinsic dimension based on the principal component analysis (Sect.

The principal component analysis (PCA) decomposes data in principal components that correspond to the directions where data vary the most. For different sizes of datasets, we compute the number of principal components required to retain 95 % of variance and define this number as the intrinsic dimension of data. The 3D geological models and the 3D ground motion wavefields are transformed into 1D vectors to perform the PCA. To reduce the memory requirements, ground motions are analysed only on the east–west component. Geological models are represented by

Table

Database intrinsic dimension estimated by PCA, correlation dimension, and maximum likelihood estimator (MLE) for the geological database and the velocity fields database, depending on the number of data samples.

An alternative dimensionality measure was introduced by

Figure

It can also be noted that the intrinsic dimension increases with the number of samples, as was observed for the PCA. This may reflect a flaw in the intrinsic dimension's definition, or it may indicate that despite being already large, our database of 30 000 samples does not capture all the variabilities.

The correlation dimension is computed from the Euclidean distance between pairs of geological models. However, pointwise metrics do not necessarily best represent similarities between geological models, and alternative metrics such as the structural similarity index measure (SSIM) have been introduced for this purpose

Figure

The structural similarity index measure (SSIM) quantifies the visual resemblance between images, in a way that should mimic human perception. For each SSIM value

To give insights into the sparsity of the geological database, Fig.

Dimensionality analyses have shown that at least 1000 principal components are necessary to represent geological models with enough accuracy, as measured by the reconstructed variance. This means that the PCA provides a basis for 3D models to decompose a wide diversity of geological models. One can consider geological models that are very different from the random fields contained in the HEMEW

The original geological model

Figure

The influence of the PCA reconstruction on the generated velocity wavefields was investigated in more details in

Since the HEMEW

For two spatial points, the velocity field predicted by the MIFNO (dashed red line) is compared with the reference from the HEMEW

For each geological model and source in the HEMEW

Thanks to the large number of simulations, one can also envision studying the variability in ground motion to capture its statistical distribution. In particular, one can investigate the best sampling that minimizes the number of samples while preserving the largest ground motion variability

The surface ground motion can also be considered an “outcropping bedrock” response, which is classically used in 1D site–effect and soil–structure interaction analyses and which may require deconvolution.

Since HEMEW

First, the minimum S-wave velocity of 1071 m s

Second, the maximum S-wave velocity of 4500 m s

Third, we do not constrain the ordering of layerwise

Additionally, more diverse configurations could be designed by relaxing the assumption that all geological parameters depend on a single variable. This would imply, for instance, varying the

The domain size was limited to 9.6

consider larger and deeper geological models,

design models with a higher spatial resolution,

include lower minimum

increase the frequency limit of the wave propagation simulations,

increase the spatial sampling of virtual sensors,

increase the temporal duration of signals to match the longer epicentral distances coming from larger models.

It should also be noted that numerical simulations are only valid for a frequency of up to 5

The database is referred to as

We presented the HEMEW

Geological models are built from horizontal layers that are randomly arranged, and they correspond to the velocity of shear waves (

Seismic waves propagate numerically from the earthquake source to the surface. Pointwise sources have a random position and orientation. Ground motion wavefields are synthesized at the surface of the propagation domain by a grid of

Ground motion characteristics differ strongly between samples. They were analysed in terms of relative significant duration (RSD), P-wave arrival time, peak ground velocity (PGV), and pseudo-spectral acceleration (PSA). In addition to quantifying the distributions of essential intensity measures in seismology, these analyses confirm expected relationships between physical parameters and ground motion characteristics. In particular, hypocentral distance,

Concerning the velocity wavefields, the PCA and the MLE confirm the intuition that the intrinsic dimension is larger than the geological dimension since the source adds variability to the time arrival of wavefields as well as their location at the surface. In this situation, it is reasonable to consider that the intrinsic dimension of ground motion is at least on the order of 100. However, if data are decomposed with the PCA, then the number of principal components is a few thousand. The correlation dimension yields questionable estimates of the intrinsic dimension that contradict our intuition and the PCA and MLE outcomes.

By providing a large number of physics-based simulations, the HEMEW

The seismic moment function in the HEMEW

Distribution of impedance contrasts in the HEMEW

Frequency spectra corresponding to Fig.

The intrinsic dimension based on the principal component analysis (PCA) components has been evaluated with the

For comparison purposes, the wavefield intrinsic dimension is also computed for a previous version of the database, where the source has a fixed position and orientation (HEMEW-3D database,

Number of principal components (

The correlation dimension is determined as the slope of the linear part in the log–log representation of

The intrinsic dimension based on the maximum likelihood estimator (MLE) has been computed with the

Correlation dimension (

The correlation dimension,

Figure

Intrinsic dimension estimated by the MLE (

Two pairs of geological models with a high SSIM of 0.6.

The Multiple-Input Fourier Neural Operator (MIFNO) architecture is shown in Fig.

The MIFNO is made of a geology branch that encodes the geology with factorized Fourier (F-Fourier) layers and a source branch that transforms the vector of source parameters

FL, FG, and DC designed the study. FL conducted the analyses. MB and DC conceived the original idea. FL wrote the paper with input from all authors.

The contact author has declared that none of the authors has any competing interests.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

We used the code from

This paper was edited by Andrea Rovida and reviewed by two anonymous referees.