the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Synthetic ground motions in heterogeneous geologies from various sources: the HEMEW^{S}3D database
Fanny Lehmann
Filippo Gatti
Michaël Bertin
Didier Clouteau
The everimproving performances of physicsbased simulations and the rapid developments of deep learning are offering new perspectives to study earthquakeinduced ground motion. Due to the large amount of data required to train deep neural networks, applications have so far been limited to recorded data or twodimensional (2D) simulations. To bridge the gap between deep learning and highfidelity numerical simulations, this work introduces a new database of physicsbased earthquake simulations.
The HEterogeneous Materials and Elastic Waves with Source variability in 3D (HEMEW^{S}3D) database comprises 30 000 simulations of elastic wave propagation in 3D geological domains. Each domain is parametrized by a different geological model built from a random arrangement of layers augmented by random fields that represent heterogeneities. Elastic waves originate from a randomly located pointwise source parametrized by a random moment tensor. For each simulation, ground motion is synthesized at the surface by a grid of virtual sensors. The high frequency of waveforms (${f}_{max}=\mathrm{5}$ Hz) allows for extensive analyses of surface ground motion.
Existing and foreseen applications range from statistical analyses of the ground motion variability and machine learning methods on geological models to deeplearningbased predictions of ground motion that depend on 3D heterogeneous geologies and source properties. Data are available at https://doi.org/10.57745/LAI6YU (Lehmann, 2023).
 Article
(6227 KB)  Fulltext XML
 BibTeX
 EndNote
Deep learning has a long tradition in seismology thanks to large networks of sensors recording earthquakes worldwide. Applications are extremely diverse in terms of methods, data, and scientific goals (see, for example, Mousavi and Beroza, 2023, for a review). Detecting earthquakes and distinguishing them from other events such as explosions, quarry blasts, or seismic noise are the most common applications of deep learning in seismology (Mousavi and Beroza, 2023). A wide variety of methods is also devoted to characterizing earthquakes from ground motion recordings – for instance, to estimate source mechanisms, earthquake location, and magnitude. The rapid improvements in deep learning in the last few years have even enabled its use in operational frameworks, thereby providing realtime predictions of earthquake parameters (Zhu et al., 2022).
However, all those methods rely on databases of seismic waveforms. While there exist several curated databases of recorded ground motion, they are sparse in regions with low to moderate seismicity or poor instrumental coverage (Bahrampouri et al., 2021; Michelini et al., 2021; Mousavi et al., 2019). In those cases, numerical simulations are a great opportunity to complement existing databases. Simulations rely on computational schemes to solve the wave propagation equations from the earthquake's source to the Earth's surface and provide synthetic waveforms at any spatial point of the simulation domain. Results of threedimensional (3D) physicsbased simulations have been compiled for several past earthquakes in, for example, the BBSPEEDset dataset (Paolucci et al., 2021) and the Southern California Earthquake Center (SCEC) Broadband Platform (Maechling et al., 2015), but the number of simulations is not appropriate for machine learning approaches.
In fact, physicsbased simulations show several limitations. Firstly, they require a detailed description of the ground properties that define the physical behaviour of the waves propagating in the Earth. Especially, ground properties should be given in the form of 3D geological models since 3D features have crucial effects that are not accounted for in 2D settings (e.g. sedimentary basins leading to site effects) (Moczo et al., 2018; Smerzini et al., 2011; Zhu et al., 2020). Since extensive geophysical investigations are needed to obtain 3D geological models, they are rare, and when they exist, they are still limited by epistemic uncertainties. Therefore, when trying to reproduce an earthquake with physicsbased numerical simulations, uncertainties can be represented by random heterogeneities added to the reference model to introduce variability (Chaljub et al., 2021; Lehmann et al., 2022).
Quantifying the effects of 3D geological features is made more difficult by the second limitation of physicsbased simulations, which is their high computational cost, especially when dealing with high frequencies and large spatial domains. Despite relying on highperformance computing (HPC) frameworks, seismic waves propagation simulations can reach tens to hundreds of thousands of equivalent core hours (Fu et al., 2017; Heinecke et al., 2014; Poursartip et al., 2020). Since computational costs prevent statistical studies on synthetic waveforms due to a limited number of simulations per geological model, deep learning represents a promising alternative to obtain waveforms.
When predicting the surface ground motion generated by an earthquake, it is important to obtain time series that describe the temporal evolution of shaking and not only scalar features (such as peak ground acceleration and cumulative absolute velocity) that give useful but limited information. Physicsinformed neural networks (PINNs; Raissi et al., 2019) successfully solved the wave equation (Ding et al., 2023; Karimpouli and Tahmasebi, 2020; Moseley et al., 2020; RashtBehesht et al., 2022; Ren et al., 2024; Song et al., 2023; Wu et al., 2023). However, applications are mainly limited to 2D domains, and models cannot extrapolate to different geological configuration to the one used in the training phase. Alternatively, generative methods have been used to enhance existing numerical simulations by increasing their spatial resolution (e.g. Gadylshin et al., 2021) or their frequency content (e.g. Gatti and Clouteau, 2020).
The recent emergence of scientific machine learning (SciML) is offering a new paradigm for the prediction of physicsbased ground motion parametrized by 3D ground properties and source parameters, with intrinsic ability to generalize to various resolutions and geological configurations. SciML has led to significant scientific developments in communities with large, reliable, and freely available databases. For instance, in numerical weather prediction, Bonev et al. (2023) and Pathak et al. (2022) took advantage of the ERA5 dataset provided by the European Centre for MediumRange Weather Forecasts (Hersbach et al., 2020). In seismology, Mousavi and Beroza (2023) pointed out that “the limitations on training data and generalization are the main challenges in solving inverse and forward problems using supervised [deep neural networks].”
In this work, we describe the first open database of seismic simulations associated with 3D heterogeneous geological models. The HEMEW^{S}3D (HEterogeneous Materials and Elastic Waves with Source variability in 3D) database contains 30 000 highfidelity simulations in 3D domains of $\mathrm{9.6}\times \mathrm{9.6}\times \mathrm{9.6}$ km^{3} in size. This represents a challenging computational task, accounting for 9×10^{5} core hours and 4.4 MWh in total. Ground motion was synthesized at the surface of the simulation domain for 8 s on a grid of 32×32 virtual sensors. Data are available at https://doi.org/10.57745/LAI6YU (Lehmann, 2023).
In the following text, Sect. 2 provides an overview of existing datasets in related fields; Sect. 3 describes the geological models, sources, and surface wavefields in the database; Sect. 4 analyses physical characteristics; Sect. 5 illustrates applications; and Sect. 6 discusses limitations and perspectives.
(Jessell et al., 2022)(Deng et al., 2022)(Deng et al., 2022)(Deng et al., 2022)(Feng et al., 2023)Datasets of recorded ground motion have enabled major deep learning applications in seismology, but they have several limitations in datascarce regions. In this section, we focus on datasets with 2D or 3D data used in geophysics and seismology with SciML applications. Due to the mathematical similarities between wave propagation and fluid flow (both are governed by hyperbolic equations), related studies are reviewed beyond the field of seismology. This highlights the challenges of highfidelity numerical simulations for deep learning applications.
2.1 Datasets of ground motion simulations
Ground motion simulations of past earthquakes have been collected in databases for model verification, characterization of complex nearfield conditions, and machine learning purposes. The BBSPEEDset provides 3D simulations of 16 earthquake scenarios in various regions of the globe (Paolucci et al., 2021). The SCEC Broadband Platform also provides simulations of 17 past earthquakes with different source models (Maechling et al., 2015). Other studies focus on specific regions, such as the Hayward Fault (Petrone et al., 2021) and Türkiye (Altindal and Askan, 2022). The limited number of scenarios makes the abovementioned databases less suitable for machine learning purposes, where dataset variability and size are crucial factors. In addition, although they provide some variability on the source, those datasets do not consider varying geological models but focus on regional validated geological models.
2.2 3D datasets
Due to the high computational costs of solving 3D partial differential equations (PDEs), only very few 3D datasets are available. CO_{2} underground storage has been explored with SciML based on 3D numerical simulations (Grady et al., 2023; Wen et al., 2023; Witte et al., 2023). To support the study of Witte et al. (2023), Annon (2022) provided 4000 simulation results for 3D CO_{2} flow through geological models based on the Sleipner dataset complemented by random fields (Equinor, 2020). The Kimberlina dataset also contains 6000 CO_{2} leakage rate simulations (Mansoor et al., 2020). However, the geological models in both of those databases are all variants of the geological model that was carefully estimated for a given region, thereby limiting the reproducibility in other areas.
2.3 Geophysical datasets
A few datasets of realistic geological units have been developed, such as the Noddyverse dataset of 3D geological models (Jessell et al., 2022). In this dataset, geological models result from the deformation of horizontal layers by successive geological events (folds, faults, unconformities, dykes, plugs, shear zones, and tilts), but no associated ground motion is provided. Along the same line of geological deformation, the OpenFWI database combines geological models with associated waveforms and targets 2D geophysical inversion as the main application (Deng et al., 2022). OpenFWI contains geological models made of horizontal and nonhorizontal layers with various folds. It also includes real geological models from field survey areas and models of CO_{2} geological storage. To generate the wavefields, the acoustic wave equation is solved in the 2D domains. Waves originate from a line of sources at the surface and wavefields are acquired on a line of receivers at depth. The 𝔼^{FWI} database is an extension of OpenFWI to the elastic wave equation, providing twocomponent ground motion time series (Feng et al., 2023).
Several other studies have computed simulation outputs for the acoustic or elastic wave equation, but their data are not public (e.g. Liu et al., 2021; Ovadia et al., 2023; Zhang et al., 2023). Table 1 summarizes the characteristics of the public datasets and shows that no database provides solutions of the elastic wave equation in 3D domains. Our HEMEW^{S}3D database intends to fill this gap.
3.1 The elastic wave equation
Elastodynamics describes reversible wave propagation phenomena in solid and fluid domains. In solid mechanics, the solution is represented by a displacement field, u∈ℝ^{3}, that propagates in a 3D Euclidean space. We consider a truncated propagation domain, $\mathrm{\Omega}=[\mathrm{0},L{]}^{\mathrm{3}}$, with absorbing boundary conditions all around, except the tractionfree top surface, and a solution, $\mathit{u}:\mathrm{\Omega}\times [\mathrm{0},T]\to {\mathbb{R}}^{\mathrm{3}}$. The domain length is fixed to L=9.6 km, and the total time is T=8 s. In its most general form, the elastic wave equation is written as
where $\mathit{\rho}:\mathrm{\Omega}\to \mathbb{R}$ is the material unit mass density; $\mathit{\lambda}:\mathrm{\Omega}\to \mathbb{R}$ and $\mathit{\mu}:\mathrm{\Omega}\to \mathbb{R}$ are the Lamé parameters, characterizing the thermodynamically reversible mechanical behaviour of the material; and f is the body force distribution. In geomechanics, properties ρ, λ, and μ are rarely independently characterized due to a lack of measurements. Therefore, it is legitimate to assume that there is a single informative variable from which all parameters can be deduced. In this work, the velocity of shear waves, V_{S}, is the informative variable. Equation (1) can then be rewritten in the following general form:
3.2 Earthquake source
In our database, the forcing term $\mathit{f}(\mathit{x},t)=\mathrm{div}\phantom{\rule{0.33em}{0ex}}\mathit{m}\left(\mathit{x}\right)\cdot s\left(t\right)$ is the divergence of a moment tensor density, m, localized at a pointwise location. m encodes the source radiation patterns as a double couple representing a pointwise kinematic discontinuity in the media.
The source position is represented by the coordinates $({x}_{\mathrm{s}},{y}_{\mathrm{s}},{z}_{\mathrm{s}})\in \mathrm{\Omega}$, which is not too close from the boundaries to avoid numerical issues due to absorbing boundary conditions. The position is chosen from a product of Latin hypercube sampling (LHS), with
In addition to the source position, the source is parametrized by the symmetric 3×3 moment tensor. An equivalent formulation is obtained with the three angles, strike, dip, and rake (Aki and Richards, 1980). With this representation, the angles are samples from a LHS with a strike between 0 and 360°, dip between 0 and 90°, and rake between 0 and 360°.
The source amplitude corresponds to a seismic moment, ${M}_{\mathrm{0}}=\mathrm{2.47}\times {\mathrm{10}}^{\mathrm{16}}$ N m, and the source time evolution is a spice bench given by $s\left(t\right)=\mathrm{1}\left(\mathrm{1}+\frac{t}{\mathit{\tau}}\right){e}^{\frac{t}{\mathit{\tau}}}$, with τ=0.1 s (Fig. A1). Due to the linearity of the elastic wave equation (Eq. 1), it is important to notice that the choice of the source time function in the HEMEW^{S}3D database does not constrain the variability in resulting ground motions.
First, any magnitude can be obtained by applying a scalar factor to the ground motion wavefields. Second, the response to any source time function can be computed from the Green function, G(x,t), which is the fundamental solution of the elastic wave equation, Eq. (1), when the source is an impulse point force located at x_{s} and occurring at t=t_{0}. For the reference source time function, s(t), the solution, u(x,t), provided in HEMEW^{S}3D can be written as the convolution of the Green function with the source time function, i.e. $\mathit{u}(\mathit{x},t)=\mathbf{G}(\mathit{x},t)*s\left(t\right)$.
Computing the solution, u_{1}, for a new source time function, s_{1} (provided that the moment tensor density m and the geological parameter V_{S} remain the same), is straightforward following the steps below:

compute the Fourier transform of the reference source time function, $\widehat{s}\phantom{\rule{0.25em}{0ex}}:=\phantom{\rule{0.25em}{0ex}}\mathcal{F}\left(s\right)$, and the solution, $\widehat{\mathit{u}}\phantom{\rule{0.25em}{0ex}}:=\phantom{\rule{0.25em}{0ex}}\mathcal{F}\left(u\right)$;

derive the Green function in the frequency domain, $\widehat{\mathbf{G}}={\displaystyle \frac{\widehat{\mathit{u}}}{\widehat{s}}}$;

compute the Fourier transform of the new source time function, ${\widehat{s}}_{\mathrm{1}}$;

compute the new solution in the frequency domain, ${\widehat{\mathit{u}}}_{\mathrm{1}}=\widehat{\mathbf{G}}*{\widehat{s}}_{\mathrm{1}}$;

deduce the new solution in the temporal domain, ${\mathit{u}}_{\mathrm{1}}={\mathcal{F}}^{\mathrm{1}}\left({\widehat{\mathit{u}}}_{\mathrm{1}}.\right)$
From these remarks, one should remember that ground motion wavefields in the HEMEW^{S}3D database originate from pointwise sources with a different location, x_{s}∈ℝ^{3}, and orientation, θ_{s}∈ℝ^{3}, but the same source time function.
3.3 Heterogeneous geological models
The HEMEW^{S}3D database contains samples $\mathit{\{}{V}_{\mathrm{S}}^{\left(i\right)},{\mathit{x}}_{\mathrm{s}}^{\left(i\right)},{\mathit{\theta}}_{\mathrm{s}}^{\left(i\right)},{\dot{\mathit{u}}}^{\left(i\right)}{\mathit{\}}}_{i}$ that satisfy Eq. (2) ($\dot{\mathit{u}}$ denotes the velocity field obtained as the time derivative of the displacement field u). The 3D geological model set, V_{S}(x), is a set of nonstationary random fields defined as a mean stair function (horizontal homogeneous layers) to which fluctuations are added, as illustrated in Fig. 1.
3.3.1 Homogeneous models
A homogeneous layer that is 1.8 km thick is imposed at the bottom of each geological model, with a V_{S} value of ${V}_{\mathrm{S},\mathrm{max}}=\mathrm{4500}$ m s^{−1}. The minimum Swave velocity is ${V}_{\mathrm{S},\mathrm{min}}=\mathrm{1071}$ m s^{−1}. Above the bottom layer, the number of horizontal layers and their thickness are randomly chosen for each sample ${V}_{\mathrm{S}}^{\left(i\right)}$, with the sole constraint to fill the total depth with two to seven layers. In particular, this means that velocity values are not sorted by depth. This choice is discussed in depth in Sect. 6. Then, the mean layerwise value is drawn from the uniform distribution, 𝒰([μ_{1},μ_{2}]). All layerwise values are chosen independently. Values of ${\mathit{\mu}}_{\mathrm{1}}\phantom{\rule{0.25em}{0ex}}=\phantom{\rule{0.25em}{0ex}}{V}_{\mathrm{S},\mathrm{min}}/\mathrm{0.6}$ = 1785 m s^{−1} and ${\mathit{\mu}}_{\mathrm{2}}\phantom{\rule{0.25em}{0ex}}=\phantom{\rule{0.25em}{0ex}}{V}_{\mathrm{S},\mathrm{max}}/\mathrm{1.4}=\mathrm{3214}$ m s^{−1} were determined to ensure that most values remain bounded within $[{V}_{\mathrm{S},\mathrm{min}},{V}_{\mathrm{S},\mathrm{max}}]$ after the addition of random fields in each layer (see Table 2 for a summary of the parameters).
To recover the other geological properties, the ratio of P to Swave velocity was fixed to ${V}_{\mathrm{P}}/{V}_{\mathrm{S}}=\mathrm{1.7}$. The density, ρ, is computed as a function of the Pwave velocity (Molinari and Morelli, 2011):
Attenuation factors for P waves (Q_{P}) and S waves (Q_{S}) are computed as follows:
3.3.2 Addition of heterogeneities
The layers' thicknesses and mean values describe the general structure of the propagation domain and they correspond to the prior physical information usually available. However, geomaterials of the Earth's crust contain a lot of variability, especially along the horizontal directions. This heterogeneity can be represented by random fields, characterized by their correlation length and coefficient of variation. Following previous studies on geological heterogeneity (e.g. Hartzell et al., 2010; Imperatori and Mai, 2013; Khazaie et al., 2016; Scalise et al., 2021; Thompson et al., 2007), we drew random fields with a von Kármán correlation kernel and a Hurst exponent of 0.1 (Chernov, 1960) (marginal distributions are lognormal to preserve positive values).
In order to provide a sufficient dataset variability, the choice of correlation lengths and coefficients of variation is tricky yet crucial (Colvez, 2021). The correlation length gives an idea of the distance above which two points, x_{A} and x_{B}, have independent geological properties, V_{S}(x_{A}) and V_{S}(x_{B}). We chose correlation lengths randomly from the set {1.5, 3, 4.5, 6} km to mix samples with small and largescale heterogeneity. In addition, large coefficients of variation were chosen to provide high geological contrasts, following the folded normal distribution, $\left\mathcal{N}\right(\mathrm{0.2},\mathrm{0.1}\left)\right$, with a mean of 0.2 and coefficient of variation of 0.1. Coefficients of variation of around 20 % are common at the surface (Arroucau, 2020), while it is known that values of up to 40 % can be found locally (El Haber et al., 2021).
The 3D random field computation is made highly efficient by the use of the spectral representation (Shinozuka and Deodatis, 1991; de Carvalho Paludo et al., 2019). With this formulation, a centred Gaussian random field, V_{S}, determined by its autocovariance function, ℛ, can be decomposed as a sum of independent identically distributed random variables $({V}_{\mathrm{S},n}{)}_{N\le n\le N}$, with a uniform distribution over [0,2π].
where $\widehat{\mathcal{R}}$ is the Fourier transform of the autocovariance function ℛ and Δk is the unit volume in ℝ^{3}.
Finally, V_{S} values are clipped between ${V}_{\mathrm{S},\mathrm{min}}=\mathrm{1071}$ m s^{−1} and ${V}_{\mathrm{S},\mathrm{max}}=\mathrm{4500}$ m s^{−1}. These bounds correspond to the velocity of shear waves in hard sediments and at the bottom of the continental crust (Molinari and Morelli, 2011).
It should be noted that all layers have distinct coefficients of variation and correlation lengths, meaning that different random fields are drawn inside each layer. Also, random fields are drawn only once for each set of parameters.
3.3.3 Representation in the database
Geological realization, ${V}_{\mathrm{S}}^{\left(i\right)}$, is discretized over a grid of $\mathrm{32}\times \mathrm{32}\times \mathrm{32}$ elements (corresponding to x, y, and z axes). The geological models are provided as .npy arrays. The total size of the geological dataset is 3.9 GB, split in 15 files of 2000 geological models for easier data management. Additionally, metadata give parameters of each layer: the mean V_{S} value; the thickness; the coefficient of variation; and the correlation lengths along x, y, and z.
3.4 Solutions of the wave equation
The elastic wave equation was solved in each domain i by means of the opensource code SEM3D (https://github.com/sem3d/SEM, CEA et al., 2017) (Touhami et al., 2022) based on the spectral element method (Faccioli et al., 1997; Komatitsch and Tromp, 1999). The dimension of the simulation mesh is prescribed by the maximum frequency, f_{max}, one aims at exactly resolving. In this study, f_{max} was fixed at 5 Hz, which is relatively high for this type of simulation. Many simulations have been conducted so far with an accuracy of up to 1 or 2 Hz (Rekoske et al., 2023; Rosti et al., 2023), while highfidelity simulations for local realistic earthquake scenarios extend to up to 10 Hz (CastroCruz et al., 2021; De Martin et al., 2021; Heinecke et al., 2014) (and exceptionally to up to 18 Hz, such as in Fu et al., 2017). Then, the smallest wavelength, ${\mathit{\lambda}}_{min}={V}_{\mathrm{S},\mathrm{min}}/{f}_{max}$, must be described on the mesh by at least five quadrature points. With seven Gauss–Lobatto–Legendre quadrature points per mesh element, this leads to elements of size $h=\frac{\mathrm{7}}{\mathrm{5}}\cdot \frac{{V}_{\mathrm{S},\mathrm{min}}}{{f}_{max}}=\mathrm{300}$ m. This explains how 32 elements in each direction amount to a domain size of L=9.6 km. The timemarching scheme is a leapfrog secondorder accurate explicit scheme solved for velocity fields.
To maintain reasonable computational loads and reflect realistic scenarios, velocity fields were recorded only at the surface of the propagation domain. A regular grid of 32×32 sensors was placed between 150 and 9450 m in both horizontal directions (implying a distance of 300 m between two neighbouring sensors). At each monitoring point, the threecomponent velocity field is synthesized with a 100 Hz sampling frequency for between 0 and 8 s. Although the sampling frequency is higher than the Nyquist frequency (i.e. $\mathrm{2}\times {f}_{max}=\mathrm{10}$ Hz), the value of 100 Hz was chosen to match the temporal resolution of recorded time series in several publicly accessible datasets (e.g. STEAD, Mousavi et al., 2019; INSTANCE, Michelini et al., 2021). The sampling frequency is sufficient to allow for an accurate computation of peak ground velocity (PGV), derive the acceleration time series with finite differences, and compute the peak ground acceleration (PGA). Figure 2 illustrates velocity waveforms at eight virtual sensors.
3.4.1 Representation in the database
Velocity fields are provided as .h5 files. Each file contains three keys: uE, uN, and uZ that correspond to the three components of ground motion (east–west, north–south, and vertical). Each velocity field is of shape $\mathrm{32}\times \mathrm{32}\times \mathrm{800}$, where the first index corresponds to the y axis, the second index to the x axis, and the third index to the temporal axis.
Files are gathered in .zip archives containing 100 simulation results. The 300 .zip files amount to 263.4 GB. They are downloadable individually (0.87 GB per file).
4.1 Descriptive statistics of the temporal evolution
Since most of the geological parameters are chosen uniformly randomly (Table 2), the geological dataset is well balanced: geological models with one to six layers are equipartitioned, and all random field parameters have approximately the same frequency. Mean V_{S} values range from 1756 to 3145 m s^{−1}.
The firstwave arrival time is a crucial parameter for earthquake early warning, and seismic phase picking is a common task with deep learning models. Arrival time depends on the distance between the earthquake source and the monitoring sensor as well as the geological properties on the propagation path. Wave arrival times are usually determined from recordings, either manually by experts or using machine learning methods. However, it is possible to compute almost exact arrival times from synthetic velocity fields since ground motion is almost zero before the first wave arrival. Therefore, we obtained wave arrival times for Pwaves as the earliest time step where the amplitude exceeds 0.1 % of the maximum amplitude. Due to the source depth variability and the different wave velocities in the geological models, first wave arrival times significantly vary among samples and among sensors. Figure 3a shows that 10 % of velocity time series are initiated before 0.65 s, while 10 % of time series are still null after 2.17 s.
As expected, the Pwave arrival time is strongly correlated with the hypocentral distance (Fig. 4a) since shorter hypocentral distances are associated with shorter propagation paths. The mean velocity on the propagation path also influences the first wave arrival time, but variability is higher. Figure 4b indeed shows that the Pwave arrival time is negatively correlated with the mean Swave velocity in the whole domain. It confirms that waves propagate more slowly when the mean velocity is lower. The mean velocity gives an approximation of the velocity values encountered by the waves along the propagation path. In particular, the mean velocity does not depend on the sensor position with this approximation.
The temporal evolution of ground motion can also be characterized by its relative significant duration (RSD). It corresponds to the duration of the signal between 5 % and 95 % of the Arias intensity (I_{A}) (Arias, 1970).
where a(t) is the acceleration field and T is the total duration of the signal. Figure 3b shows that RSD covers a large variation range, from 0.17 to 7.60 s. This variability is illustrated in Fig. 2, where the grey areas represent the RSD. One can especially notice that samples with a strong pulse have a short RSD. Indeed, most of the energy is concentrated around the pulse.
Quantitatively, the HEMEW^{S}3D database contains short ground motions since 10 % of time series have a RSD lower than 0.24 s as well as longer ground motions, where 10 % of time series have a RSD longer than 3.42 s. The median RSD is 1.06 s. These low RSD values are related to the absence of highfrequency components in the coda and the dominance of high pulselike time series in cases with shallow sources and low heterogeneity contrasts. Combined with the short Pwave arrival times, RSD values also justify the fact that the 8 s window contains a significant part of ground motion.
4.2 Descriptive statistics related to energy content
The PGV is computed as the maximum absolute value over all time steps separately for each component. The PGV is slightly lower on the vertical component than the two horizontal components (Fig. 5). It is very similar between the east–west and north–south components, which is expected since the HEMEW^{S}3D database is statistically invariant per horizontal rotation. Figure 5 shows that the PGV extends over 3 orders of magnitude, with the first percentile being equal to 0.89 cm s^{−1}, while the 99th percentile is equal to 129.3 cm s^{−1}. The median PGV is 8.9 cm s^{−1}. The 1stto99thpercentile interval is in line with ground motion observed within a few kilometres of moderatemagnitude earthquakes (e.g. Convertito et al., 2022).
When the propagation path is longer, seismic waves encounter more geological heterogeneities. They create a dispersion and diffraction of waves that spread the energy signal over time. Larger hypocentral distances are associated with longer propagation paths. Figure 6a then shows that the PGV is negatively correlated with the hypocentral distance.
It is also known that the seismic energy, E_{s}, generated by a fault rupture is
where M_{0} is the seismic moment, Δσ is the stress drop, and μ is the shear modulus at the fault location. Knowing that the shear wave velocity is written as ${V}_{\mathrm{S}}=\sqrt{\mathit{\mu}/\mathit{\rho}}$, Eq. (7) indicates that the seismic energy is inversely proportional to ${V}_{\mathrm{S}}^{\mathrm{2}}$. Furthermore, Fig. 6b confirms that the PGV is negatively correlated with the velocity of S waves at the source location.
4.3 Distribution of pseudospectral acceleration (PSA)
The pseudospectral acceleration (PSA) is a commonly used metric to estimate structural response. It evaluates the maximal acceleration of a 1degreeoffreedom oscillator (with a 5 % damping), with a natural period, T. At T=0.2 s, the PSA value in the HEMEW^{S}3D database is between $\mathrm{2.3}\times {\mathrm{10}}^{\mathrm{3}}$ and 81.2 g. It decreases to the interval between $\mathrm{5.8}\times {\mathrm{10}}^{\mathrm{4}}$ g and 6.1 g at T=1.0 s. Figure 7 additionally shows that there exists a negative correlation between the PSA and the hypocentral distance. The distancedependent PSA values can be compared with existing ground motion models (GMMs), as shown in Fig. 8.
GMMs provide an analytical formula to compute intensity measures, such as PSA and PGV, based on regression analyses. They are mainly derived from databases of recorded earthquakes, although numerical simulations can also be used. The PSA estimated from the HEMEW^{S}3D database is compared with four GMMs (all taken with a moment magnitude of 4.9 that corresponds to the seismic moment of the HEMEW^{S}3D database):

GMM from Atkinson and Boore (2006), computed with ${V}_{\mathrm{S},\mathrm{30}}=\mathrm{2000}$ m s^{−1};

GMM from Atkinson (2015), computed with a depth of 1 km (solid line in Fig. 8) and a depth of 7.5 km (dashed line in Fig. 8);

GMM from Chiou and Youngs (2014), computed with ${V}_{\mathrm{S},\mathrm{30}}=\mathrm{2000}$ m s^{−1}, a depth of 1 km, and different dip and rake angles (these last two factors having little influence on the PSA);

GMM from Shahjouei and Pezeshk (2016).
Figure 8 shows that the mean PSA computed from the HEMEW^{S}3D database is in good agreement with all GMMs and that the standard deviation is smaller than GMMs. The Atkinson (2015) GMM illustrates the influence of depth, especially for the smallest hypocentral distances. The HEMEW^{S}3D PSA values fit better with the GMM PSA for shallow sources (solid purple line in Fig. 8) than deeper sources (dashed purple line).
4.4 Dimensionality
In supervised deep learning, it is always challenging to determine whether the size of the database (i.e. the number of samples) is sufficient to represent its variability. This question relates to the definition of the intrinsic dimension of the dataset, which indicates the number of hidden variables that should be necessary to represent the main features of the samples. In the following sections, we provide insights into this question, with the intrinsic dimension based on the principal component analysis (Sect. 4.4.1), the correlation dimension (Sect. 4.4.2), the maximum likelihood estimate (Sect. 4.4.3), and the structural similarity index (Sect. 4.5).
4.4.1 Principal component analysis (PCA)
The principal component analysis (PCA) decomposes data in principal components that correspond to the directions where data vary the most. For different sizes of datasets, we compute the number of principal components required to retain 95 % of variance and define this number as the intrinsic dimension of data. The 3D geological models and the 3D ground motion wavefields are transformed into 1D vectors to perform the PCA. To reduce the memory requirements, ground motions are analysed only on the east–west component. Geological models are represented by $\mathrm{32}\times \mathrm{32}\times \mathrm{32}$ = 32 768 points, and ground motions contain $\mathrm{16}\times \mathrm{16}\times \mathrm{320}$ = 81 920 points (16 sensors in directions x and y and 320 time steps between 0 and 6.4 s). To ease the computation on the large sample covariance matrix, an incremental PCA algorithm was used (Ross et al., 2008).
Table 3 and Fig. B1 show that more than 1000 principal components are needed to reconstruct the geological models with high accuracy, whereas the intrinsic dimension of ground motion wavefields is around 4900. It is a reasonable fact that the wavefield intrinsic dimension is larger than the geological dimension since wavefield variability is created by geological variations and the source position. However, due to its linearity, the PCA requires a large number of components to accurately represent complex patterns. Therefore, it may overestimate the intrinsic data dimension.
4.4.2 Correlation dimension
An alternative dimensionality measure was introduced by Grassberger and Procaccia (1983) as the correlation dimension, which characterizes the distance between pairs of samples. For a dataset of N samples $\mathit{\{}{V}_{\mathrm{S}}^{\left(i\right)}{\mathit{\}}}_{\mathrm{1}\le i\le N}$ and a given radius r, the correlation dimension (C_{N}(r)) is defined as the ratio of sample pairs $({V}_{\mathrm{S}}^{\left(i\right)},{V}_{\mathrm{S}}^{\left(j\right)}{)}_{i\ne j}$ that are at a distance lower than r:
Figure B2 indicates a correlation dimension of 8 for the geological dataset, which is significantly lower than the PCA dimension. In fact, it is known that the correlation dimension may underestimate the intrinsic dimension, especially “when data are scattered” (Qiu et al., 2023), which is likely to be the case in highdimensional spaces. However, the correlation dimension of the ground motion wavefields is debatable as it drops to 2. Figure B3c shows that the log–log representation of C_{N}(r) does not produce an obvious linear part, which makes it difficult to identify the correlation dimension.
4.4.3 Maximum likelihood estimator (MLE) intrinsic dimension
Levina and Bickel (2004) proposed another approach based on the maximum likelihood estimator (MLE) of the distance to the closest neighbours. Figure B4 shows the evolution of the intrinsic dimension as a function of the number of samples for geological models and velocity wavefields. The intrinsic dimension of geological models is 34, while the dimension of velocity wavefields is higher (around 110). Although this method may still underestimate data with high intrinsic dimensionality (Qiu et al., 2023), it provides higher estimates than the correlation dimension.
It can also be noted that the intrinsic dimension increases with the number of samples, as was observed for the PCA. This may reflect a flaw in the intrinsic dimension's definition, or it may indicate that despite being already large, our database of 30 000 samples does not capture all the variabilities.
4.5 Structural similarity
The correlation dimension is computed from the Euclidean distance between pairs of geological models. However, pointwise metrics do not necessarily best represent similarities between geological models, and alternative metrics such as the structural similarity index measure (SSIM) have been introduced for this purpose (Wang et al., 2004). This index theoretically ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfectly similar geological models (although values between −1 and 0 can be obtained numerically from the covariance computation). The SSIM of two geological models A and B is defined as
where μ_{A} and μ_{B} are the means of A and B, σ_{A} and σ_{B} are the unbiased estimators of the variance of A and B, σ_{AB} is the unbiased estimator of the covariance of A and B, and C_{1} and C_{2} are constants determined from the range of A and B values.
Figure B5a and b illustrate two pairs of geological models with the same SSIM (of 0.6), meaning there is a rather high similarity. The first geologies have similar mean values but different heterogeneities, resulting in a low Euclidean distance (Fig. B5a), while the second geologies have different mean values, leading to a higher Euclidean distance (Fig. B5b).
To give insights into the sparsity of the geological database, Fig. 9 shows that only 1.4 % of geological pairs have a SSIM greater than 0.2. This means that geological models are generally very distinct from each other in the HEMEW^{S}3D database. For comparison, the 2D OpenFWI dataset leads to significantly higher SSIM, with 31 % of geologies having a SSIM larger than 0.2 (3000 models were chosen from each of the 10 OpenFWI families; Deng et al., 2022).
5.1 Dimensionality reduction in geological models
Dimensionality analyses have shown that at least 1000 principal components are necessary to represent geological models with enough accuracy, as measured by the reconstructed variance. This means that the PCA provides a basis for 3D models to decompose a wide diversity of geological models. One can consider geological models that are very different from the random fields contained in the HEMEW^{S}3D database – for instance, embedding a basin shape.
Figure 10 shows that 1000 principal components allow for a good reconstruction of the basin shape with correct velocity values inside and outside the basin. Edges are slightly blurred, which is expected since sharp contrasts correspond to high spatial frequencies that require many principal components. This example illustrates the generalization ability of the HEMEW^{S}3D database from a geometrical point of view. To match the design of the HEMEW^{S}3D database, the velocity values are chosen within the same bounds. If one were to consider real sedimentary basins, they should apply rescaling to target lower velocity values.
The influence of the PCA reconstruction on the generated velocity wavefields was investigated in more details in Lehmann et al. (2022). It was shown that wavefields created by the propagation of seismic waves inside the reconstructed geological models and the reference model are very similar. When the initial geological model has strong heterogeneities, heterogeneities tend to be blurred in the PCA reconstruction, which reduces the dispersion of seismic waves. As a consequence, velocity wavefields generated inside the reconstructed geological model have slightly larger amplitudes.
5.2 Velocity fields predictions
Since the HEMEW^{S}3D database associates geological models and sources with their corresponding velocity wavefields, it can serve to predict the latter using the former. Neural operators are one class of SciML models that have shown great success in the prediction of parametric PDEs. One can mention in particular the MultipleInput Fourier Neural Operator (MIFNO; Lehmann et al., 2024) that uses the fast Fourier transform to learn the frequential representation of the elastic wave equation and a dedicated handling of the source term (Fig. C1).
For each geological model and source in the HEMEW^{S}3D database, the MIFNO predicts the velocity field at each surface point. Figure 11 illustrates that the MIFNO gives accurate predictions for samples with different ground motions. This shows that the variability and size of the HEMEW^{S}3D database are appropriate to train complex SciML models.
5.3 Other potential applications
Thanks to the large number of simulations, one can also envision studying the variability in ground motion to capture its statistical distribution. In particular, one can investigate the best sampling that minimizes the number of samples while preserving the largest ground motion variability (Tarbali and Bradley, 2015).
The surface ground motion can also be considered an “outcropping bedrock” response, which is classically used in 1D site–effect and soil–structure interaction analyses and which may require deconvolution.
Since HEMEW^{S}3D is the first database that provides 3D ground motion, it is constrained by some hypotheses to control its size and allow machine learning applications.
First, the minimum Swave velocity of 1071 m s^{−1} is rather high when compared to Swave velocities in soft sediments (typically of a few hundred metres per second) but coherent for hard sediments (Molinari and Morelli, 2011). V_{S} values in the HEMEW^{S}3D database are also in line with national velocity models with a low spatial resolution that display surface values of around 2000 m s^{−1}. The choice of minimum V_{S} implies in particular that the HEMEW^{S}3D dataset is not targeted towards site–effect analyses in sedimentary basins. One should also note that the vertical resolution of the geological models is 300 m, while very low V_{S} values are more commonly encountered in the first tens of metres. These low values would be averaged with higher deeper values in our models. In particular, this means that the upper velocity in the HEMEW^{S}3D database should be understood as the average velocity between 0 and −300 m. It is not comparable with the common notion of V_{S,30} (average V_{S} in the first 30 m). Reducing the minimum velocity poses no theoretical limitation but would increase the computational cost of the subsequent numerical simulations since it increases the number of mesh elements.
Second, the maximum Swave velocity of 4500 m s^{−1} corresponds to existing V_{S} values at the bottom of the Earth's crust that are often adopted in velocity models (Duverger et al., 2021; Molinari and Morelli, 2011). In addition, the bottom layer has a fixed thickness and value that originates from earlier works. Therefore, variability is considered only above this constant layer.
Third, we do not constrain the ordering of layerwise V_{S} values to provide a large database variability that is essential for machine learning perspectives. Similar choices were made in the widely used OpenFWI database (Deng et al., 2022), in which the FlatVelB family is based on a random arrangement of layers. This means that some layer arrangements may be unphysical – for instance, if the mean values are linearly decreasing with depth. However, it is important to notice that the physics of wave propagation is still satisfied in those situations, which is the main concern of this work. One choice opposite to ours would be to impose layerwise values increasing with depth (as done in the OpenFWI FlatVelA family for instance). However, this would remove all models with velocity inversion, which can be found in complex geological contexts. From the metadata provided, users can filter geological models with custom criteria to exclude those outliers from their study. For instance, one custom criterion could be the impedance contrast, defined as the ratio between impedance ($Z={V}_{\mathrm{S}}\times \mathit{\rho}$) in one geological layer, Z_{ℓ}, and the layer above, Z_{ℓ−1}. Figure A2 illustrates the distribution of impedance contrasts and shows that they are realistic. A total of 9028 geological models have a minimum impedance contrast smaller than 0.7.
Additionally, more diverse configurations could be designed by relaxing the assumption that all geological parameters depend on a single variable. This would imply, for instance, varying the ${V}_{\mathrm{P}}/{V}_{\mathrm{S}}$ ratio. Random anisotropic heterogeneities can also be generated for more diversity (Ta et al., 2010).
The domain size was limited to 9.6 km to prove that SciML was possible with a manageable dataset size. This size allows for reasonable local studies and is already larger than existing 2D databases (Table 1). Extending the spatial size is certainly of interest for some seismological applications and requires additional computational costs. In summary, with larger computational budgets and lesser memory constraints, it would become possible to

consider larger and deeper geological models,

design models with a higher spatial resolution,

include lower minimum V_{S} values,

increase the frequency limit of the wave propagation simulations,

increase the spatial sampling of virtual sensors,

increase the temporal duration of signals to match the longer epicentral distances coming from larger models.
It should also be noted that numerical simulations are only valid for a frequency of up to 5 Hz due to the mesh design, with numerical pollution for frequencies larger than 5 Hz. We observed that it is crucial to apply a lowpass filter (with a cutoff frequency of 5 Hz) to the velocity fields before using machine learning models; otherwise, the model may try to fit numerical noise.
The database is referred to as Lehmann (2023) and can be downloaded at https://doi.org/10.57745/LAI6YU. The wave propagation code, SEM3D, is available at https://github.com/sem3d/SEM (CEA et al., 2017). The code used to generate the HEMEW^{S}3D database is given at https://github.com/lehmannfa/HEMEW3D (Lehmann, 2024).
We presented the HEMEW^{S}3D (HEterogeneous Materials and Elastic Waves with Source variability in 3D) database that contains 30 000 geological models, source parameters, and time and spacedependent surface wavefields generated by the propagation of seismic waves through each geological model. This database was conceived as a solution to the forward problem of wave propagation.
Geological models are built from horizontal layers that are randomly arranged, and they correspond to the velocity of shear waves (V_{S}). They represent a domain of size $\mathrm{9.6}\times \mathrm{9.6}\times \mathrm{9.6}$ km discretized in 300 m wide elements. V_{S} values are comprised between 1071 and 4500 m s^{−1}. Random fields are then added independently in each geological layer to create 3D heterogeneities. Their parameters (coefficients of variation and correlation lengths) vary widely to cover diverse geological configurations and are given as metadata. Geological models are provided as cubes with $\mathrm{32}\times \mathrm{32}\times \mathrm{32}$ voxels.
Seismic waves propagate numerically from the earthquake source to the surface. Pointwise sources have a random position and orientation. Ground motion wavefields are synthesized at the surface of the propagation domain by a grid of 32×32 sensors recording for 8 s. Simulations are conducted with the highperformance computing code SEM3D and amount to a total computational time of 9×10^{5} core hours. The dataset description shows that the 8 s time window covers most significant ground motion at the surface.
Ground motion characteristics differ strongly between samples. They were analysed in terms of relative significant duration (RSD), Pwave arrival time, peak ground velocity (PGV), and pseudospectral acceleration (PSA). In addition to quantifying the distributions of essential intensity measures in seismology, these analyses confirm expected relationships between physical parameters and ground motion characteristics. In particular, hypocentral distance, V_{S} at the source location, and mean velocity were investigated. Comparisons with ground motion models (GMMs) show that PSA is comparable with estimates from recorded earthquakes. This indicates the usefulness of the HEMEW^{S}3D database in complementing databases of recorded earthquakes. Due to the size of individual samples in the HEMEW^{S}3D database, one may wonder whether data could be represented with fewer parameters to reduce memory requirements. To this end, we explored different methods to estimate the dataintrinsic dimension, and we exemplified the wellknown fact that they can lead to very different values. Taking the MLE as a lower bound, one can argue that the intrinsic dimension of the geological database is at least 30. In addition, the low values of the SSIM indicate that geologies are sparse and distant from each other in the HEMEW^{S}3D database.
Concerning the velocity wavefields, the PCA and the MLE confirm the intuition that the intrinsic dimension is larger than the geological dimension since the source adds variability to the time arrival of wavefields as well as their location at the surface. In this situation, it is reasonable to consider that the intrinsic dimension of ground motion is at least on the order of 100. However, if data are decomposed with the PCA, then the number of principal components is a few thousand. The correlation dimension yields questionable estimates of the intrinsic dimension that contradict our intuition and the PCA and MLE outcomes.
By providing a large number of physicsbased simulations, the HEMEW^{S}3D database offers new perspectives from which the relationship between geological properties and surface ground motion can be studied. It led to the first neural operator that predicts 3D ground motion, but many applications in statistics, scientific machine learning, and deep learning are envisioned. We designed the database to be as generic as possible, and we believe that several scientific communities can benefit from it.
B1 Principal component analysis
The intrinsic dimension based on the principal component analysis (PCA) components has been evaluated with the scikitdimension package. Figure B1 illustrates the number of PCA components required to retain 95 % of the data variance depending on the number of samples. It can be noted that, for computational reasons, the velocity fields are represented by a single component (the east–west component, parallel to the x axis) for all three methods.
For comparison purposes, the wavefield intrinsic dimension is also computed for a previous version of the database, where the source has a fixed position and orientation (HEMEW3D database, https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/LAI6YU&version=1.0 (last access: 12 July 2024)). With this database, the wavefield intrinsic dimension was around 3200. It is a reasonable fact that adding degrees of freedom with a random source increases the variability in data.
B2 Correlation dimension
The correlation dimension is determined as the slope of the linear part in the log–log representation of C_{N} (Fig. B3). This definition is subject to some interpretation since one should determine which portion constitutes the linear part. Nevertheless, we found that small variations in the linear part limits had very little influence on the slope estimate (less than one unit).
B3 MLEbased intrinsic dimension
The intrinsic dimension based on the maximum likelihood estimator (MLE) has been computed with the scikitdimension package. Figure B4 shows the evolution of the intrinsic dimension as a function of the number of samples for geological models and velocity fields.
B4 Structural similarity index
Figure B5 exemplifies two pairs of geological models with high similarity (SSIM of 0.6) but different properties. The first pair (Fig. B5a and b) has similar mean values but different heterogeneities, while in the second pair, geological models are almost homogeneous but exhibit different mean values (Fig. B5c and d).
The MultipleInput Fourier Neural Operator (MIFNO) architecture is shown in Fig. C1. Details about the model are given in Lehmann et al. (2024).
FL, FG, and DC designed the study. FL conducted the analyses. MB and DC conceived the original idea. FL wrote the paper with input from all authors.
The contact author has declared that none of the authors has any competing interests.
Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.
We used the code from https://github.com/bakerjw/GMMs/tree/master (last access: 9 July 2024, Baker, 2022) to compute the GMMs. The authors are grateful for the resources and human support of the Très Grand Centre de Calcul (TGCC, CCRT, France). They thank the anonymous reviewers for their numerous suggestions that helped to improve the paper.
This paper was edited by Andrea Rovida and reviewed by two anonymous referees.
Aki, K. and Richards, P. G.: Quantitative Seismology, Freeman, San Francisco, 1980. a
Altindal, A. and Askan, A.: SIGMOIDTR: A Simulated Ground Motion Dataset for Turkey, Zenodo [data set], https://doi.org/10.5281/ZENODO.7007917, 2022. a
Annon: Simulate CO_{2} Flow with Open Porous Media, Github [code], https://github.com/microsoft/AzureClusterlessHPC.jl/tree/main/examples/opm, 2022. a
Arias, A.: A Measure of Earthquake Intensity, in: Seismic design for nuclear plants, edited by: Hansen, R. J., The MIT Press, 438–483, ISBN 9780262080415, 1970. a
Arroucau, P.: A Preliminary ThreeDimensional Seismological Model of the Crust and Uppermost Mantle for Metropolitan Franc, Tech. Rep. SIGMA22018D2014, https://www.sigma2.net/medias/files/sigma22018d20143dvelocitymodelfranceapprovedpublic.pdf (last access: 12 April 2022), 2020. a
Atkinson, G. M.: GroundMotion Prediction Equation for SmalltoModerate Events at Short Hypocentral Distances, with Application to InducedSeismicity Hazards, B. Seismol. Soc. Am., 105, 981–992, https://doi.org/10.1785/0120140142, 2015. a, b, c, d
Atkinson, G. M. and Boore, D. M.: Earthquake GroundMotion Prediction Equations for Eastern North America, B. Seismol. Soc. Am., 96, 2181–2205, https://doi.org/10.1785/0120050245, 2006. a, b
Bahrampouri, M., RodriguezMarek, A., Shahi, S., and Dawood, H.: An Updated Database for Ground Motion Parameters for KiKnet Records, Earthquake Spectra, 37, 505–522, https://doi.org/10.1177/8755293020952447, 2021. a
Baker, J. W.: GMM, Github [code], https://github.com/bakerjw/GMMs/tree/master (last access: 9 July 2024), 2022. a
Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., and Kashinath, K.: Modelling Atmospheric Dynamics with Spherical Fourier Neural Operators, in: ICLR 2023 Workshop on Tackling Climate Change with Machine Learning, https://www.climatechange.ai/papers/iclr2023/47 (last access: 24 August 2023), 2023. a
CastroCruz, D., Gatti, F., and LopezCaballero, F.: HighFidelity Broadband Prediction of Regional Seismic Response: A Hybrid Coupling of PhysicsBased Synthetic Simulation and Empirical Green Functions, Natural Hazards, 108, 1997–2031, https://doi.org/10.1007/s1106902104766x, 2021. a
CEA, CentraleSupélec, IPGP and CNRS: SEM3D, Github [code], https://github.com/sem3d/SEM (last access: 17 April 2024), 2017. a, b
Chaljub, E., Celorio, M., Cornou, C., Martin, F. D., Haber, E. E., Margerin, L., Marti, J., and Zentner, I.: Numerical Simulation of Wave Propagation in Heterogeneous and Random Media for Site Effects Assessment in the Grenoble Valley, in: He 6th IASPEI/IAEE International Symposium: Effects of Surface Geology on Seismic Motion, 2021. a
Chernov, L. A.: Wave Propagation in a Random Medium, Translated by Richard A. Silverman, Mineola, New York, dover publications Edn., 1960. a
Chiou, B. S.J. and Youngs, R. R.: Update of the Chiou and Youngs NGA Model for the Average Horizontal Component of Peak Ground Motion and Response Spectra, Earthquake Spectra, 30, 1117–1153, https://doi.org/10.1193/072813EQS219M, 2014. a, b
Colvez, M.: Influence of the Earth's Crust Heterogeneities and Complex Fault Structures on the Frequency Content of Seismic Waves, Ph.D. thesis, Université Paris Saclay, ParisSaclay, https://theses.hal.science/tel03551874 (last access: 11 March 2023), 2021. a
Convertito, V., De Matteis, R., Amoroso, O., and Capuano, P.: Ground Motion Prediction Equations as a Proxy for Medium Properties Variation Due to Geothermal Resources Exploitation, Sci. Rep., 12, 12632, https://doi.org/10.1038/s4159802216815x, 2022. a
de Carvalho Paludo, L., Bouvier, V., and Cottereau, R.: Scalable Parallel Scheme for Sampling of Gaussian Random Fields over Very Large Domains: Parallel Scheme for Sampling of Random Fields over Very Large Domains, Int. J. Numer. Meth. Eng., 117, 845–859, https://doi.org/10.1002/nme.5981, 2019. a
De Martin, F., Chaljub, E., Thierry, P., Sochala, P., Dupros, F., Maufroy, E., Hadri, B., Benaichouche, A., and Hollender, F.: Influential Parameters on 3D Synthetic Ground Motions in a Sedimentary Basin Derived from Global Sensitivity Analysis, Geophys. J. Int., 227, 1795–1817, https://doi.org/10.1093/gji/ggab304, 2021. a
Deng, C., Feng, S., Wang, H., Zhang, X., Jin, P., Feng, Y., Zeng, Q., Chen, Y., and Lin, Y.: OpenFWI: Largescale MultiStructural Benchmark Datasets for Full Waveform Inversion, in: Advances in Neural Information Processing Systems, edited by: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., Curran Associates, Inc., 35, 6007–6020, https://proceedings.neurips.cc/paper_files/paper/2022/file/27d3ef263c7cb8d542c4f9815a49b69bPaperDatasets_and_Benchmarks.pdf (last access: 24 August 2023), 2022. a, b, c, d, e, f
Ding, Y., Chen, S., Li, X., Wang, S., Luan, S., and Sun, H.: SelfAdaptive PhysicsDriven Deep Learning for Seismic Wave Modeling in Complex Topography, Eng. Appl. Artif. Intel., 123, 106425, https://doi.org/10.1016/j.engappai.2023.106425, 2023. a
Duverger, C., MazetRoux, G., Bollinger, L., Trilla, A. G., Vallage, A., Hernandez, B., and Cansi, Y.: A Decade of Seismicity in Metropolitan France (2010–2019): The CEA/LDG Methodologies and Observations, Bulletin de la Société Géologique de France, 192, p. 25, https://doi.org/10.1051/bsgf/2021014, 2021. a
El Haber, E., Cornou, C., Jongmans, D., LopezCaballero, F., Youssef Abdelmassih, D., and AlBittar, T.: Impact of Spatial Variability of Shear Wave Velocity on the Lagged Coherency of Synthetic Surface Ground Motions, Soil Dyn. Earthq. Eng., 145, 106689, https://doi.org/10.1016/j.soildyn.2021.106689, 2021. a
Equinor: Sleipner 2019 Benchmark Model, CO2DataShare [data set], https://doi.org/10.11582/2020.00004, 2020. a
Faccioli, E., Maggio, F., Paolucci, R., and Quarteroni, A.: 2d and 3D Elastic Wave Propagation by a PseudoSpectral Domain Decomposition Method, J. Seismol., 1, 237–251, https://doi.org/10.1023/A:1009758820546, 1997. a
Feng, S., Wang, H., Deng, C., Feng, Y., Liu, Y., Zhu, M., Jin, P., Chen, Y., and Lin, Y.: E^{FWI}: Multiparameter Benchmark Datasets for Elastic Full Waveform Inversion of Geophysical Properties, in: ThirtySeventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, https://openreview.net/forum?id=3BQaMV9jxK (last access: 10 June 2024), 2023. a, b
Fu, H., He, C., Chen, B., Yin, Z., Zhang, Z., Zhang, W., Zhang, T., Xue, W., Liu, W., Yin, W., Yang, G., and Chen, X.: 18.9Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18Hz and 8Meter Scenarios, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '17, Association for Computing Machinery, New York, NY, USA, ISBN 9781450351140, https://doi.org/10.1145/3126908.3126910, 2017. a, b
Gadylshin, K., Lisitsa, V., Gadylshina, K., Vishnevsky, D., and Novikov, M.: Machine LearningBased Numerical Dispersion Mitigation in Seismic Modelling, in: Computational Science and Its Applications – ICCSA 2021, edited by: Gervasi, O., Murgante, B., Misra, S., Garau, C., Blečić, I., Taniar, D., Apduhan, B. O., Rocha, A. M. A. C., Tarantino, E., and Torre, C. M., Springer International Publishing, Cham, 34–47, ISBN 9783030866532, 2021. a
Gatti, F. and Clouteau, D.: Towards Blending PhysicsBased Numerical Simulations and Seismic Databases Using Generative Adversarial Network, Comput. Method. Appl. M., 372, 113421, https://doi.org/10.1016/j.cma.2020.113421, 2020. a
Grady, T. J., Khan, R., Louboutin, M., Yin, Z., Witte, P. A., Chandra, R., Hewett, R. J., and Herrmann, F. J.: ModelParallel Fourier Neural Operators as Learned Surrogates for LargeScale Parametric PDEs, Comput. Geosci., 178, 105402, https://doi.org/10.1016/j.cageo.2023.105402, 2023. a
Grassberger, P. and Procaccia, I.: Measuring the Strangeness of Strange Attractors, Physica D, 9, 189–208, 1983. a
Hartzell, S., Harmsen, S., and Frankel, A.: Effects of 3D Random Correlated Velocity Perturbations on Predicted Ground Motions, B. Seismol. Soc. Am., 100, 1415–1426, https://doi.org/10.1785/0120090060, 2010. a
Heinecke, A., Breuer, A., Rettenberger, S., Bader, M., Gabriel, A. A., Pelties, C., Bode, A., Barth, W., Liao, X. K., Vaidyanathan, K., Smelyanskiy, M., and Dubey, P.: Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers, in: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, ISBN 21674337, 3–14, https://doi.org/10.1109/SC.2014.6, 2014. a, b
Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., MuñozSabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Simmons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., De Chiara, G., Dahlgren, P., Dee, D., Diamantakis, M., Dragani, R., Flemming, J., Forbes, R., Fuentes, M., Geer, A., Haimberger, L., Healy, S., Hogan, R. J., Hólm, E., Janisková, M., Keeley, S., Laloyaux, P., Lopez, P., Lupu, C., Radnoti, G., de Rosnay, P., Rozum, I., Vamborg, F., Villaume, S., and Thépaut, J.N.: The ERA5 Global Reanalysis, Q. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803, 2020. a
Imperatori, W. and Mai, P. M.: BroadBand nearField Ground Motion Simulations in 3Dimensional Scattering Media, Geophys. J. Int., 192, 725–744, https://doi.org/10.1093/gji/ggs041, 2013. a
Jessell, M., Guo, J., Li, Y., Lindsay, M., Scalzo, R., Giraud, J., Pirot, G., Cripps, E., and Ogarko, V.: Into the Noddyverse: a massive data store of 3D geological models for machine learning and inversion applications, Earth Syst. Sci. Data, 14, 381–392, https://doi.org/10.5194/essd143812022, 2022. a, b
Karimpouli, S. and Tahmasebi, P.: Physics Informed Machine Learning: Seismic Wave Equation, Geosci. Front., 11, 1993–2001, https://doi.org/10.1016/j.gsf.2020.07.007, 2020. a
Khazaie, S., Cottereau, R., and Clouteau, D.: Influence of the Spatial Correlation Structure of an Elastic Random Medium on Its Scattering Properties, J. Sound Vib., 370, 132–148, https://doi.org/10.1016/j.jsv.2016.01.012, 2016. a
Komatitsch, D. and Tromp, J.: Introduction to the Spectral Element Method for ThreeDimensional Seismic Wave Propagation, Geophys. J. Int., 139, 806–822, https://doi.org/10.1046/j.1365246x.1999.00967.x, 1999. a
Lehmann, F.: Physicsbased Simulations of 3D Wave Propagation with Source Variability: HEMEW∧S3D, Recherche Data Gouv [data set], https://doi.org/10.57745/LAI6YU, 2023. a, b, c
Lehmann, F.: Lehmannfa/HEMEW3D: Initial Version, Zenodo [code], https://doi.org/10.5281/ZENODO.13625608, 2024. a
Lehmann, F., Gatti, F., Bertin, M., and Clouteau, D.: Machine Learning Opportunities to Conduct HighFidelity Earthquake Simulations in MultiScale Heterogeneous Geology, Front. Earth Sci., 10, https://doi.org/10.3389/feart.2022.1029160, 2022. a, b
Lehmann, F., Gatti, F., and Clouteau, D.: MultipleInput Fourier Neural Operator (MIFNO) for SourceDependent 3D Elastodynamics, arXiv [preprint], https://doi.org/10.48550/arXiv.2404.10115, 2024. a, b
Levina, E. and Bickel, P.: Maximum Likelihood Estimation of Intrinsic Dimension, in: Advances in Neural Information Processing Systems, edited by Saul, L., Weiss, Y., and Bottou, L., MIT Press, vol. 17, https://proceedings.neurips.cc/paper_files/paper/2004/file/74934548253bcab8490ebd74afed7031Paper.pdf (last access: 17 April 2023), 2004. a
Liu, B., Yang, S., Ren, Y., Xu, X., Jiang, P., and Chen, Y.: DeepLearning Seismic FullWaveform Inversion for Realistic Structural Models, Geophysics, 86, R31–R44, https://doi.org/10.1190/geo20190435.1, 2021. a
Maechling, P. J., Silva, F., Callaghan, S., and Jordan, T. H.: SCEC Broadband Platform: System Architecture and Software Implementation, Seismol. Res. Lett., 86, 27–38, https://doi.org/10.1785/0220140125, 2015. a, b
Mansoor, K., Buscheck, T., Yang, X., Carroll, S., and Chen, X.: LLNL Kimberlina 1.2 NUFT Simulations June 2018 (V2), https://doi.org/10.18141/1603336, 2020. a
Michelini, A., Cianetti, S., Gaviano, S., Giunchi, C., Jozinović, D., and Lauciani, V.: INSTANCE – the Italian seismic dataset for machine learning, Earth Syst. Sci. Data, 13, 5509–5544, https://doi.org/10.5194/essd1355092021, 2021. a, b
Moczo, P., Kristek, J., Bard, P.Y., Stripajová, S., Hollender, F., Chovanová, Z., Kristeková, M., and Sicilia, D.: Key Structural Parameters Affecting Earthquake Ground Motion in 2D and 3D Sedimentary Structures, B. Earthq. Eng., 16, 2421–2450, https://doi.org/10.1007/s1051801803455, 2018. a
Molinari, I. and Morelli, A.: EPcrust: A Reference Crustal Model for the European Plate: EPcrust, Geophys. J. Int., 185, 352–364, https://doi.org/10.1111/j.1365246X.2011.04940.x, 2011. a, b, c, d
Moseley, B., Markham, A., and NissenMeyer, T.: Solving the Wave Equation with PhysicsInformed Deep Learning, ArXiv [preprint], https://doi.org/10.48550/arxiv.2006.11894, 2020. a
Mousavi, S. M. and Beroza, G. C.: Machine Learning in Earthquake Seismology, Annu. Rev. Earth Pl. Sc., 51, 105–129, https://doi.org/10.1146/annurevearth071822100323, 2023. a, b, c
Mousavi, S. M., Sheng, Y., Zhu, W., and Beroza, G. C.: STanford EArthquake Dataset (STEAD): A Global Data Set of Seismic Signals for AI, IEEE Access, 7, 179464–179476, https://doi.org/10.1109/ACCESS.2019.2947848, 2019. a, b
Ovadia, O., Kahana, A., Stinis, P., Turkel, E., and Karniadakis, G. E.: ViTO: Vision TransformerOperator, ArXiv [preprint], https://doi.org/10.48550/arXiv.2303.08891, 2023. a
Paolucci, R., Smerzini, C., and Vanini, M.: BBSPEEDset: A Validated Dataset of Broadband NearSource Earthquake Ground Motions from 3D PhysicsBased Numerical Simulations, B. Seismol. Soc. Am., 111, 2527–2545, https://doi.org/10.1785/0120210089, 2021. a, b
Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., Hassanzadeh, P., Kashinath, K., and Anandkumar, A.: FourCastNet: A Global Datadriven Highresolution Weather Model Using Adaptive Fourier Neural Operators, ArXiv [preprint], https://doi.org/10.48550/arXiv.2202.11214, 2022. a
Petrone, F., Abrahamson, N., McCallen, D., Pitarka, A., and Rodgers, A.: Engineering Evaluation of the EQSIM Simulated Groundmotion Database: The San Francisco Bay Area Region, Earthquake Eng. Struc., 50, 3939–3961, https://doi.org/10.1002/eqe.3540, 2021. a
Poursartip, B., Fathi, A., and Tassoulas, J. L.: LargeScale Simulation of Seismic Wave Motion: A Review, Soil Dyn. Earthq. Eng., 129, 105909, https://doi.org/10.1016/j.soildyn.2019.105909, 2020. a
Qiu, H., Yang, Y., and Pan, H.: Underestimation Modification for Intrinsic Dimension Estimation, Pattern Recogn., 140, 109580, https://doi.org/10.1016/j.patcog.2023.109580, 2023. a, b
Raissi, M., Perdikaris, P., and Karniadakis, G.: PhysicsInformed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations, J. Comput. Phys., 378, 686–707, https://doi.org/10.1016/j.jcp.2018.10.045, 2019. a
RashtBehesht, M., Huber, C., Shukla, K., and Karniadakis, G. E.: PhysicsInformed Neural Networks (PINNs) for Wave Propagation and Full Waveform Inversions, J. Geophys. Res.Sol. Ea., 127, e2021JB023120, https://doi.org/10.1029/2021JB023120, 2022. a
Rekoske, J. M., Gabriel, A.A., and May, D. A.: Instantaneous PhysicsBased Ground Motion Maps Using ReducedOrder Modeling, J. Geophys. Res.Sol. Ea., 128, e2023JB026975, https://doi.org/10.1029/2023JB026975, 2023. a
Ren, P., Rao, C., Chen, S., Wang, J.X., Sun, H., and Liu, Y.: SeismicNet: Physicsinformed Neural Networks for Seismic Wave Modeling in SemiInfinite Domain, Comput. Phys. Commun., 295, https://doi.org/10.1016/j.cpc.2023.109010, 2024. a
Ross, D. A., Lim, J., Lin, R.S., and Yang, M.H.: Incremental Learning for Robust Visual Tracking, Int. J. Comput. Vis., 77, 125–141, https://doi.org/10.1007/s1126300700757, 2008. a
Rosti, A., Smerzini, C., Paolucci, R., Penna, A., and Rota, M.: Validation of PhysicsBased Ground Shaking Scenarios for Empirical Fragility Studies: The Case of the 2009 L'Aquila Earthquake, B. Earthquake Eng., 21, 95–123, https://doi.org/10.1007/s10518022015541, 2023. a
Scalise, M., Pitarka, A., Louie, J. N., and Smith, K. D.: Effect of Random 3D Correlated Velocity Perturbations on Numerical Modeling of Ground Motion from the Source Physics Experiment, B. Seismol. Soc. Am., 111, 139–156, https://doi.org/10.1785/0120200160, 2021. a
Shahjouei, A. and Pezeshk, S.: Alternative Hybrid Empirical GroundMotion Model for Central and Eastern North America Using Hybrid Simulations and NGAWest2 Models, B. Seismol. Soc. Am., 106, 734–754, https://doi.org/10.1785/0120140367, 2016. a, b
Shinozuka, M. and Deodatis, G.: Simulation of Stochastic Processes by Spectral Representation, Appl. Mech. Rev., 44, 191–204, https://doi.org/10.1115/1.3119501, 1991. a
Smerzini, C., Paolucci, R., and Stupazzini, M.: Comparison of 3D, 2D and 1D Numerical Approaches to Predict Long Period Earthquake Ground Motion in the Gubbio Plain, Central Italy, B. Earthquake Eng., 9, 2007–2029, https://doi.org/10.1007/s1051801192898, 2011. a
Song, C., Liu, Y., Zhao, P., Zhao, T., Zou, J., and Liu, C.: Simulating MultiComponent Elastic Seismic Wavefield Using Deep Learning, IEEE Geosci. Remote Sens. Lett., 20, 1–5, https://doi.org/10.1109/LGRS.2023.3250522, 2023. a
Ta, Q.A., Clouteau, D., and Cottereau, R.: Modeling of Random Anisotropic Elastic Media and Impact on Wave Propagation, Eur. J. Comput. Mech., 19, 241–253, https://doi.org/10.3166/ejcm.19.241253, 2010. a
Tarbali, K. and Bradley, B.: Ground Motion Selection for Scenario Ruptures Using the Generalised Conditional Intensity Measures (GCIM) Method, Earthq. Eng. Struct. D., 44, 1601–1621, https://doi.org/10.1002/eqe.2546, 2015. a
Thompson, E. M., Baise, L. G., and Kayen, R. E.: Spatial Correlation of ShearWave Velocity in the San Francisco Bay Area Sediments, Soil Dyn. Earthq. Eng., 27, 144–152, https://doi.org/10.1016/j.soildyn.2006.05.004, 2007. a
Touhami, S., Gatti, F., LopezCaballero, F., Cottereau, R., de Abreu Corrêa, L., Aubry, L., and Clouteau, D.: SEM3D: A 3D HighFidelity Numerical Earthquake Simulator for Broadband (0–10 Hz) Seismic Response Prediction at a Regional Scale, Geosciences, 12, 112, https://doi.org/10.3390/geosciences12030112, 2022. a
Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E.: Image Quality Assessment: From Error Visibility to Structural Similarity, IEEE T. Image Process., 13, 600–612, https://doi.org/10.1109/TIP.2003.819861, 2004. a
Wen, G., Li, Z., Long, Q., Azizzadenesheli, K., Anandkumar, A., and Benson, S. M.: RealTime HighResolution CO_{2} Geological Storage Prediction Using Nested Fourier Neural Operators, Energ. Environ. Sci., 16, 1732–1741, https://doi.org/10.1039/d2ee04204e, 2023. a
Witte, P. A., Konuk, T., Skjetne, E., and Chandra, R.: Fast CO2 Saturation Simulations on LargeScale Geomodels with Artificial IntelligenceBased Wavelet Neural Operators, Int. J. Greenhouse Gas Control, 126, 103880, https://doi.org/10.1016/j.ijggc.2023.103880, 2023. a, b
Wu, Y., Aghamiry, H. S., Operto, S., and Ma, J.: Helmholtz Equation Solution in NonSmooth Media by PhysicsInformed Neural Network with Incorporating Quadratic Terms and a Perfectly Matching Layer Condition, Geophysics, 88, 1–66, https://doi.org/10.1190/geo20220479.1, 2023. a
Zhang, T., Trad, D., and Innanen, K.: Learning to Solve the Elastic Wave Equation with Fourier Neural Operators, Geophysics, 88, 1–63, https://doi.org/10.1190/geo20220268.1, 2023. a
Zhu, C., Riga, E., Pitilakis, K., Zhang, J., and Thambiratnam, D.: Seismic Aggravation in Shallow Basins in Addition to Onedimensional Site Amplification, J. Earthquake Eng., 24, 1477–1499, https://doi.org/10.1080/13632469.2018.1472679, 2020. a
Zhu, W., Hou, A. B., Yang, R., Datta, A., Mousavi, S. M., Ellsworth, W. L., and Beroza, G. C.: QuakeFlow: A Scalable MachineLearningBased Earthquake Monitoring Workflow with Cloud Computing, Geophys. J. Int., 232, 684–693, https://doi.org/10.1093/gji/ggac355, 2022. a
 Abstract
 Introduction
 Related work
 Dataset creation
 Dataset analysis
 Applications
 Limitations and perspectives
 Code and data availability
 Conclusions
 Appendix A: Dataset description
 Appendix B: Dimensionality of data
 Appendix C: MultipleInput Fourier Neural Operator (MIFNO)
 Author contributions
 Competing interests
 Disclaimer
 Acknowledgements
 Review statement
 References
 Abstract
 Introduction
 Related work
 Dataset creation
 Dataset analysis
 Applications
 Limitations and perspectives
 Code and data availability
 Conclusions
 Appendix A: Dataset description
 Appendix B: Dimensionality of data
 Appendix C: MultipleInput Fourier Neural Operator (MIFNO)
 Author contributions
 Competing interests
 Disclaimer
 Acknowledgements
 Review statement
 References