INSTANCE – the Italian seismic dataset for machine learning

Michelini, Alberto; Cianetti, Spina; Gaviano, Sonja; Giunchi, Carlo; Jozinović, Dario; Lauciani, Valentino

doi:https://doi.org/10.5194/essd-13-5509-2021

Articles | Volume 13, issue 12

https://doi.org/10.5194/essd-13-5509-2021

Articles | Volume 13, issue 12

Data description paper

30 Nov 2021

Data description paper |

| 30 Nov 2021

INSTANCE – the Italian seismic dataset for machine learning

Alberto Michelini, Spina Cianetti, Sonja Gaviano, Carlo Giunchi, Dario Jozinović, and Valentino Lauciani

Abstract

The Italian earthquake waveform data are collected here in a dataset suited for machine learning analysis (ML) applications. The dataset consists of nearly 1.2 million three-component (3C) waveform traces from about 50 000 earthquakes and more than 130 000 noise 3C waveform traces, for a total of about 43 000 h of data and an average of 21 3C traces provided per event. The earthquake list is based on the Italian Seismic Bulletin (http://terremoti.ingv.it/bsi, last access: 15 February 2020) of the Istituto Nazionale di Geofisica e Vulcanologia between January 2005 and January 2020, and it includes events in the magnitude range between 0.0 and 6.5. The waveform data have been recorded primarily by the Italian National Seismic Network (network code IV) and include both weak- (HH, EH channels) and strong-motion (HN channels) recordings. All the waveform traces have a length of 120 s, are sampled at 100 Hz, and are provided both in counts and ground motion physical units after deconvolution of the instrument transfer functions. The waveform dataset is accompanied by metadata consisting of more than 100 parameters providing comprehensive information on the earthquake source, the recording stations, the trace features, and other derived quantities. This rich set of metadata allows the users to target the data selection for their own purposes. Much of these metadata can be used as labels in ML analysis or for other studies. The dataset, assembled in HDF5 format, is available at http://doi.org/10.13127/instance (Michelini et al., 2021).

Download & links

How to cite.

Received: 11 May 2021 – Discussion started: 27 May 2021 – Revised: 08 Oct 2021 – Accepted: 17 Oct 2021 – Published: 30 Nov 2021

1 Introduction

Important breakthroughs in the understanding of earthquake phenomena can be achieved through the analysis of the very large number of continuous waveform recordings stored in the existing seismic archives. To this end, it can be important to make available well-organized representative subsets of the archives together with their associated metadata information.

The recent developments of machine learning (ML) software platforms like TensorFlow, PyTorch, Keras, Caffe (see Abadi et al., 2016; Paszke et al., 2019; Chollet and others, 2015; and Jia et al., 2014, respectively); the availability of high performance computing hardware (i.e., GPUs); and the access to thoroughly selected benchmark datasets (e.g., STEAD, https://github.com/smousavi05/STEAD, last access: 19 November 2021; and LEN-DB, https://doi.org/10.5281/zenodo.3648231) offer new opportunities to apply ML methodologies to seismological and earthquake engineering problems. In particular, the use of sophisticated and optimized ML algorithms for the analysis of large amounts of seismic data can lead to remarkable improvements for automated tasks like seismic waveform onset picking, ground motion prediction, and earthquake early warning; for the detection of hidden signals currently recognized as noise; or for novel modeling and inversion strategies (see Kong et al., 2018; Bergen et al., 2019; and Dramsch, 2020, for recent reviews). Specifically, the advent of ML in the field of seismology has highlighted the importance of reference datasets for benchmarking the developed methodologies, and it has fostered more thorough and statistically sound schemes for analyzing the data, like splitting all the available data into training, validation, and test sets. Moreover, the introduction of competitions like those for predicting laboratory earthquakes launched on the Kaggle platform (https://www.kaggle.com/c/LANL-Earthquake-Prediction/data, last access: 19 November 2021) or the SeismOlympics (Fang et al., 2017), which attracted several thousand teams, evidences even more the great potential of benchmark datasets (Johnson et al., 2021) and the general interest to tackle seismology problems with ML.

The application of ML techniques to seismological waveform data can be quite straightforward. Indeed, large amounts of labeled data are already available thanks to the analyses carried out for many decades by expert analysts that have compiled and reviewed earthquake catalogs (which include phase onset readings, earthquake location, and size estimates) or that have assembled ground motion parameters in special flat files and maps of strong ground motion among the most common tasks. Their work provides effectively metadata that can be associated with the recorded waveforms and that can be used as labels when performing ML analysis. A main bottleneck in wide-scale implementation of ML is, however, the fast access to the waveforms and to the associated metadata. Open-access waveform archives available to the seismological community (e.g., EIDA, Strollo et al., 2021; or IRIS, Ingate, 2008) were mainly designed for preserving the continuous data and making them available to the scientific community. In practice, one of the main goals of seismological data centers has been the seamless acquisition of continuous data from the networks and the preservation, curation, and archiving of the entire record of continuous waveforms. In this context, the users have complete flexibility in the selection of the data to download, but accessing large data volumes can be very time consuming. Thus, despite the achievements attained in the last decades with the implementation of well-tested and efficient web services (e.g., FDSN dataselect), the accessibility of remote servers still remains cumbersome (Quinteros et al., 2021). It follows that in order to attract a broader audience of users and developers there is a strong need to assemble and publish benchmark datasets that can be readily used with the existing software platforms (Mousavi et al., 2019). In practical terms, the matter consists of assembling quality-checked data and metadata according to volume and formats ready to be used in ML applications.

Recently, effort has been made to assemble and make publicly available datasets consisting of waveforms and associated metadata. In detail, the dataset used in the works by Ross et al. (2018 a), Ross et al. (2018 b), and Meier et al. (2019) is downloadable from the Southern California Earthquake Data Center at the web portal https://scedc.caltech.edu/data/deeplearning.html (last access: 19 November 2021). This dataset includes 4.8 million time series recorded by nearly 700 receivers from more than 270 000 earthquakes in southern California. The STEAD dataset assembled by Mousavi et al. (2019) includes 1.2 million of 3C traces comprising 450 000 local earthquakes and 100 000 noise windows recorded by more than 2600 stations at the global scale. The LEN-DB dataset (Magrini et al., 2020) is also a global dataset of local earthquakes and includes 1.2 million 3C waveform traces, with half belonging to earthquakes and half to noise. The NEIC dataset (Yeck and Patton, 2020) includes global data and has been used by Yeck et al. (2020) to train the 1.3 million seismic-phase arrivals using three separate convolutional neural network models to predict arrival time onset, phase type, and distance.

Results attained by Ross et al. (2018 b), L. Zhu et al. (2019), W. Zhu et al. (2019), Mousavi et al. (2020), and Mousavi and Beroza (2020) are excellent examples of successful applications of ML which can improve substantially the earthquake detection level with respect to most traditional methods, leading to the location of tiny and previously undetected earthquakes improving our knowledge on the heterogeneity of stress release on known and unknown faults. This enhanced information is crucial to make more thorough assessments of the ongoing seismotectonics and seismic hazard. The ML methods are likely to become an irreplaceable tool in seismology to extract as much information as possible from the large amount of data already stored in the archives. Among the indirect advantages, the enhanced detection can, to some extent, also govern network densification with sensible reductions in equipment investments and maintenance costs.

In general, the impressive performances of ML applications have been strongly related to the availability of large amounts of data with associated properly labeled metadata. Large amounts of data are critical to perform proper training and avoid data overfitting. However, the preparation of a ML dataset is also tedious and very time consuming. These are the main reasons that motivated the work presented in this article. Our goal is to provide an open-access dataset consisting of raw and instrument removed waveform data and associated metadata to study earthquake occurrence in Italy. The data collection, named INSTANCE, gathers seismic waveform data from weak- and strong-motion stations that have been extracted from the Italian EIDA node (Danecek et al., 2021; see Sect. 6 for a full list of the FDSN networks included in the dataset). The metadata associated with the waveforms are extracted from the INGV earthquake catalogue and from the waveform traces themselves. We expect this reference dataset to be used for several different purposes spanning from improvements of the existing configurations of seismic monitoring in Italy to the development and testing of new techniques for earthquake detection and ground motion estimation.

2 Earthquakes

2.1 Data preparation

The data collection was assembled following the main stages listed below:

earthquake selection;
station selection;
waveform data selection and download;
cross-validation between phase-based station selection and downloaded waveform data;
processing of the data counts waveforms;
application of the instrument transfer function to the waveforms.

2.1.1 Earthquake selection

To compile the waveform dataset, we started from the Italian Seismic Bulletin (http://terremoti.ingv.it/en/bsi, last access: 15 February 2020, INGV bulletin hereinafter) and seismic stations archives (http://terremoti.ingv.it/iside, last access: 15 February 2020). These data are public and can be queried using the fdsnws-event (https://www.fdsn.org/webservices/fdsnws-event-1.2.pdf, last access: 19 November 2021) and the fdsnws-station web services provided by INGV. The event data belong to the INGV bulletin, which has been adopting the same velocity model and earthquake location software in the time period included in this study (see Appendix B for details).

The first step consisted of retrieving all the earthquakes with M≥0 from 1 January 2005 to 31 January 2020 in an enlarged area within the latitude and longitude corners (35.0, 5.0) and (49.0, 19.0). A total of 315 225 earthquakes were found. The beginning of the query corresponds approximately with the update, renovation, and increase in the number of stations of the national seismic network (Michelini et al., 2016; Danecek et al., 2021; Margheriti et al., 2021). Around 2005, the INGV network (FDSN code IV) underwent a major upgrade, with the existing, predominantly analog, instruments being replaced by high-quality digital seismic data loggers and new, mostly broadband (and some extended short period), three-component (3C) sensors. Selected stations were also complemented with additional 3C strong-motion sensors. The upgrade resulted in more than a 2-fold increase in the number of stations of IV network. In addition, since 2005, there have been many temporary deployments of seismic stations coinciding with earthquake sequences and specific experiments, the data of which are also available through the EIDA INGV node (Danecek et al., 2021). The total number of stations also increased thanks to the contribution of the networks belonging to other Italian institutions (e.g., the University of Genoa, the National Institute of Oceanography and Experimental Geophysics (OGS), and the University of Naples, among others). This increment resulted in a significant improvement of the detection of low-magnitude earthquakes. At the regional scale of Italy, the magnitude of completeness of the INGV bulletin is around ∼M 1.7–M 1.8, although significant differences occur depending on the area. In this regard, the preferred INGV catalogue magnitude is the local magnitude, M_l, (Richter, 1935) but sometimes also M_w and M_d (see below for additional details).

A relevant aspect when compiling a large dataset to be used for ML purposes consists of gathering a balanced distribution of data. In seismology, when using earthquake magnitude for classification, balanced representation is impossible to achieve because small-size earthquakes, following the Gutenberg–Richter magnitude versus the number of earthquakes power law (Gutenberg and Richter, 1944), outnumber larger earthquakes. To address this issue (or at least to mitigate its influence), we choose to select in our target area

the great majority of the earthquakes with M≥4.0 – the earthquakes that have been discarded (30) all (except for 5) occurred outside the Italian country borders and mainly in the Balkan area (the earthquakes in Italy, all with M<5, will be included in a future update of the dataset);
earthquakes with origin times differing by more than 120 s in the range $2.0 \leq M < 4.0$ ; and
an additional 20 000 earthquakes, randomly selected, with origin times differing by more than 120 s for M<2.0.

The resulting distribution of the earthquakes according to their magnitude is detailed in Table 1, and they are mapped in Fig. 1a.

Table 1Final data selection. “All” indicates the total number of earthquakes in the INGV bulletin in the time period between 1 January 2005 and 31 January 2020, “Selected” and “Percent kept” refer to the earthquakes, and “Nb. 3C records” refers to the waveform traces included in the dataset.

Download Print Version | Download XLSX

https://essd.copernicus.org/articles/13/5509/2021/essd-13-5509-2021-f01

Figure 1Map of the earthquakes included in the dataset shown as solid circles with colors selected according to depth (a), and map of the available moment tensors with colors assigned depending on the focal mechanism (b). Symbol size, in both maps, is proportional to earthquake magnitude.

2.1.2 Station selection

In order to gather high-quality earthquake signals, we based our choice on the most accurately picked P- and S-wave onset phases published in the INGV bulletin. In this regard, the manual picking of the arrival phases is routinely performed by a group of about 20 INGV highly trained staff personnel who also review the hypocenter locations and magnitude determination before bulletin publication. These manually reviewed locations are indicated as preferred solutions in the INGV bulletin. In practice, we have selected only those stations that had P- and, if available, S-wave onset picks associated with the preferred location of the INGV bulletin. We note that the strong-motion data provided by the national strong-motion network (Rete Accelerometrica Nazionale) operated by the Italian Department of Civil Protection do not enter in the earthquake picking and location performed by the INGV staff, and the same data are not available through EIDA. They may be included, however, in future releases of the dataset.

In summary, we have adopted the following criteria to identify the waveform records to be included in the dataset after the earthquake selection above was applied:

all stations that feature P-wave onset phases (and S-wave onset phases when available) used for the preferred earthquake location (no distinction is made between Pg, and Pn and no secondary phases like PmP are picked);
all stations with waveform data available through the Italian EIDA node (see the dataset contributing networks in the pie diagram of Fig. 5b);
P- and S-wave location residual times less than 1.0 s;
P- and S-wave phases that contributed to the location with a weight larger than 10 %.

This selection procedure reduced the number of P- and S-wave phases from ∼1.9 to ∼1.2 and from ∼1.1 to ∼0.7 millions, respectively.

2.1.3 Waveform data selection and download

The selection procedure described in Sect. 2.1.2 resulted in the compilation of a list of waveform data time windows to be downloaded from the EIDA continuous waveform archive. We choose a time window of 120 s in order to include both P and S waves from stations whose distance is up to ∼600 km from the hypocenter. Indeed, in these cases, the S − P time differences are approximately 75–80 s. Adding about 20 s of the signal before the P-wave time and about 20 s after the S wave, we end up with a 120 s window choice providing the most significant earthquake signals for either the most distant stations, in the case of crustal depth earthquakes, or closer stations, in the case of deep earthquakes of the Calabrian Arc subduction.

More technically, the time windows set for data download were defined by inserting a randomly selected buffer time ranging between 15 and 20 s before the P-wave onset arrival phase and enlarging the time window to 125 s. The adoption of 125 s long windows at the data download stage is arbitrary since after data processing the time windows have been all set to 120 s. This criterion ensured that the great majority of the waveform traces downloaded featured a pre-P-wave onset buffer time between 15 and 20 s. However, we found that, when dealing with such a large number of waveforms acquired by diversified instruments configured differently, some discrepancies may occur. In practice, since the data are archived in miniSEED compressed format that features different sizes of the logical records, and since the web service extracts the full logical record containing the predefined trace start time, the start time of the trace can be earlier than the predefined minimum time of 20 s (i.e., in this case, there is a longer time interval between the P-arrival and the actual trace start time). In contrast, when data are missing before the P-wave onset time (i.e., in the 15–20 s pre-P-onset buffer time), start time of the extracted window can be delayed and a shorter time interval will separate the trace window start time from the P-wave arrival time (i.e., <15 s). See Fig. D1 in the Appendix for the distribution of the P- and S-wave phase arrival time samples. The data (miniSEED format) were downloaded using the FDSN dataselect web services provided by INGV (http://terremoti.ingv.it/en/webservices_and_software, last access: 19 November 2021). Using a set of 14 container-based querying procedures running in parallel, this stage required about 7 d to complete the download of the ∼4 million waveform traces (i.e., ∼1.3 million 3C traces), with a storage requirement of ∼80 GB (miniSEED STEIM1 compression).

2.1.4 Cross-validation between phases-based metadata and downloaded waveform data

After the massive data download was concluded, a list of all the downloaded files was generated. This list was intersected with the originally selected metadata (Sect. 2.1.2) to have a one-to-one correspondence between the miniSEED data and the metadata (i.e., each 3C waveform record – three miniSEED files – must correspond to a row of the metadata file).

2.1.5 Preparation of processed waveforms in digital units

This part of our data assembling procedure targets the preparation of the digital counts waveform traces. It includes the following steps:

removal of traces containing data gaps (i.e., missing data);
trimming the waveform trace to the nearest sample to the start time;
120 s trace windowing;
removal of mean and linear trends from the data;
resampling at 100 Hz;
calculation of the signal-to-noise ratio;
extraction of the data quality metrics.

No rotation of the horizontal component along the N–S and E–W directions was required since all sensors used are oriented accordingly. For each waveform trace (i.e., each component), the maximum value of signal-to-noise ratio (SNR) was extracted and kept as metadata. The SNR was calculated as

\begin{matrix} (1) & SNR = 20 \log_{10} \frac{| S_{95} |}{| N_{95} |}, \end{matrix}

where $| S_{95} |$ and $| N_{95} |$ are the 95th percentile of the data absolute values in a 5 s window immediately after the S-wave onset and right before the P-wave arrival time. If the S-wave onset were not available, the S-wave window was determined after calculation of the predicted S-wave arrival using an average velocity of 3.0 km s⁻¹ and the hypocentral distance.

During this stage of the data preparation, we have also calculated some quality parameters extracted from the waveform traces for the purpose of a later inclusion in the metadata information. These additional parameters, providing the distribution of the trace values, have been computed using the MSEEDMetadata class of the ObsPy python software (Beyreuther et al., 2010; Megies et al., 2011; Krischer et al., 2015). For the same purpose, we have determined the number of spikes using a Hampel filter on a 161-sample sliding window to find outliers in the traces.

The final dataset consists of a total of 1 159 249 3C waveform data records from 54 008 earthquakes in count units assembled within an HDF5 format file. Table 1 provides the number of traces within each magnitude interval of the final assembled dataset.

2.1.6 Application of the instrument transfer function to the waveforms

To make the dataset of more general use, we have also generated a dataset in units of physical ground motion after deconvolving the instrument response. To this end, we have downloaded the station response files for all the stations used and applied the transfer functions to the individual traces with frequency filtering corners 0.01, 0.04, 25, and 40 Hz using a cosine flank frequency domain taper (see cosine_sac_taper in ObsPy) and applying a 5 % cosine tapering at both ends of the trace signal. After removing the instrument response, we extracted the intensity measures (IMs, i.e., peak ground acceleration, PGA; peak ground velocity, PGV; and the spectral accelerations at a 0.3, 1.0, and 3.0 s period) on each component so that they could be included among the metadata parameters. Peak ground displacements are not included since they are from single or double integration of velocity and acceleration records, respectively, and their determination can be inaccurate when performed automatically.

2.2 Metadata description

The 115 metadata associated with each 3C waveform trace of our collection are listed in Table 2. They provide different kind of information that can be subdivided into four main types – source, station, trace, and path metadata. The unit of each metadata is provided in its denomination.

The source metadata provide information on the earthquake with description of the source origin time; location; size; and, when available, the focal mechanism, the moment tensor, and the finite fault.

The station metadata provide information on the characteristics of the recording station, which include the station, channel, network, and location (SCNL) (cf. http://www.fdsn.org/seed_manual/SEEDManual_V2.4.pdf, last access: 19 November 2021); the geographical coordinates; and the average shear-wave velocity of the top 30 m of the Earth, V_S,30, which is an important parameter for classifying sites in seismic engineering applications (e.g., Boore, 2004) and is extracted from the map used in the INGV implementation of the USGS ShakeMap software in Italy (Michelini et al., 2019).

The trace metadata consists of parameters that are extracted from the waveform traces like maximum and minimum amplitudes, root mean squared values of the traces, and, after application of the transfer function, intensity measures (IMs) of the ground motion. In this class of metadata, we include the P (and S wave) provided by the INGV bulletin and, in addition, the number of P and S picks obtained by processing the waveforms with two deep-learning, phase-picking and event-detection algorithms (GPD and EQTransformer; Ross et al., 2018 a; Mousavi et al., 2020) to make the user aware that the waveform trace being used may include more than a single earthquake (see discussion further below).

The path metadata follow from the calculation of parameters that link the types of metadata above (e.g., traveltimes, hypocentral, and epicentral distances).

The rationale of our metadata selection reflects our intention of providing the users with comprehensive information about the data. This appears to be an important issue since the data, being recorded automatically, can suffer from many diverse problems deriving from malfunctioning of the data loggers and of the sensors or from poor data transmission. Since we seek to assemble a dataset that can be used also for analyzing real-time data streams using ML, we note that the automatic processing summarized above does not differ significantly from that routinely applied to the streamed data.

One alternative to our metadata comprehensive approach would have consisted of “cleaning” the dataset by removing the faulty traces from the dataset altogether. We do not think this approach is appropriate since in this case the dataset would not be representative of the “true” data that are collected in real time by the monitoring networks. Thus, the basic idea behind our criterion is that we would like to enable the users to make their own choices using opportune filters to exploit the data for their own purposes. For example, if a user looks for the cleanest data, this can be achieved by filtering the metadata accordingly (e.g., saturated velocimetric data acquired by broadband sensors equipped with 24 bit data loggers could be removed in a conservative fashion just by selecting only those traces with counts within $\pm 0.8 \times 2^{23}$ ). In contrast, the user could also opt to leave the ML model to learn the data problems so that they can be detected when using real data. An approach of this kind has been used by Jozinović et al. (2020) for missing data. In Jozinović et al. (2020), the dataset used for ML consists of a fixed number of stations, and when data from one or more stations are missing (either the whole trace or parts of it), the signal trace is set to be an array of zeros. The ML model used there was found to detect and learn the problematic values, and compensate for them, having a similar prediction accuracy on those stations as the accuracy on the stations which had the input data available. In practice, the provision of a rich set of waveform descriptive metadata is important not only to make use of an enlarged suite of labels that can be used for diverse purposes but also to identify problems with the waveform data and include or filter them out.

Our metadata include P- and S-wave onsets manually picked by INGV analysts as provided in the INGV bulletin. Recall that the traces were selected to include just one P-wave arrival time and possibly one S-wave arrival time since we sought to assemble one earthquake per window trace. This criterion was chosen for the purpose of facilitating the training of ML models using traces containing just one earthquake (for phase picking, peak ground motions, etc.). However, even though we have made considerable efforts to isolate only one earthquake per time window, more than one can be present effectively within the same time window (e.g., the analyst did not see or just disregarded other events with smaller amplitudes). Because the presence of additional, unidentified earthquakes adds complexities to the ML training phase, we followed the same approach taken by Mousavi et al. (2019) to run automatic picking algorithms upon the waveform dataset and include as metadata also the number of P- and S-wave phases picked automatically by the generalized phase detection, GPD, technique proposed by Ross et al. (2018 a) and the EQTransformer technique by Mousavi et al. (2020). In the analysis we have used as detection threshold 0.99 for P- and S-phase detection for GPD and 0.2, 0.1 and 0.1 for earthquakes, P-phase detection, and S-phase detection, respectively, for EQTransformer. Both GPD and EQTransformer have been run only on the high-gain channels (i.e., HH, EH).

As presented above, metadata are important constituents of data collections. They can be used for identifying the data to be analyzed, and they can be used as labels in ML applications. In addition to the fact that not all the metadata information in INSTANCE is always available (e.g., moment tensors are generally available only for events with magnitudes $\sim M \geq 3.5$ or the S-wave onset pick retrieved from the INGV bulletin may not be present), we have found that the automatically processed ground motion trace data may suffer from errors because the original traces contained already undetected malfunctioning problems (e.g., spikes, anomalous trends) which, after application of the instrument transfer function, are mapped into erroneous ground motion traces and IM values. Similarly, it may have also occurred that in isolated cases the coefficients of the instrument transfer functions were incorrect, producing also in this case incorrect traces and IM values. To address these problems, we have operated in two ways. First, we have chosen to detect the traces' maximum and minimum values lying outside the acceptable physical range and to replace them with NumPy nan in the metadata file. This acceptable range was based on the IMs reported in the “flat” file of the ESM DB (https://esm-db.eu/, last access: 19 November 2021, Lanzano et al., 2018), which includes all the IMs (obtained from analyst processing) of all the recordings available of earthquakes with M≥4.0 in Europe. Secondly, we have verified our instrument transfer function processing procedure by cross-validating all our IM values with those reported in the ESM DB flat file. In this regard, we found a very good correspondence between the IMs obtained using the two methodologies, giving us confidence in the quality of the applied data processing and of the IM metadata being provided.

(Buland, 2006)

Table 2List of the metadata for the events and noise waveform traces. The units are given in parenthesis in the “Description” column. Only a subset of metadata can be associated with the noise traces (star in the “Noise” column).

The horizontal line between “trace_EQT_number_detections” and “trace_[E,N,Z]_pga_cmps2” separates the additional metadata obtained after application of the instrument response transfer function.

Download XLSX

2.3 Dataset description

Figure 1a shows the earthquakes included in the dataset. The symbol size is proportional to the earthquake magnitude. We observe that the 54 008 selected earthquakes composing the dataset can be considered a representative subset of the entire seismicity in Italy and, for the larger events, also for those earthquakes occurring in the near vicinity of the Italian national borders. During the time span of our data selection three important sequences occurred in Italy after the main shocks of the 2009 L'Aquila M 6.0 earthquake, the 2012 Emilia M 5.9 earthquake, and the 2016 central Italy extended sequence which featured three main earthquakes with magnitudes M 6.0, M 5.9, and M 6.5.

In Fig. 1b we plot the 527 moment tensors included in the metadata. The size of the moment tensor symbol is proportional to source_magnitude, while the colors are defined according to the prevalent strain regime: indigo, lavender, and dark orange for strike slip, normal, and thrust faults, respectively. The prevalent strain regime is determined according to the fault's rake as derived from source_mechanisms_strike_dip_rake: strike slip for $- 45^{\circ} < rake < 45^{\circ}$ and $135^{\circ} < rake < 225^{\circ}$ , normal for $225^{\circ} \leq rake \leq 315^{\circ}$ , and thrust for $45^{\circ} \leq rake \leq 135^{\circ}$ .

In Fig. 2 we show the maps of the stations included in the events and noise datasets, respectively. The symbol size in panel a is proportional to the number of reported phase arrivals by each station, while in panel b it is proportional to the number of waveforms included in the dataset for each station. Figure 2a demonstrates that quite a different number of phases have been reported by the stations included in the event dataset. These differences depend on several factors like whether the stations are permanent or temporary, the time length of the acquisition, the noise level, and the level of seismicity of the area where the stations have been deployed. For example, it is evident that many stations in central Italy display many phases (and associated trace recordings) mainly because the area was struck by the 2009 and 2016 earthquake sequences. In contrast, stations that are located in the Po Plain generally feature a small number of phases mainly because the noise level is high, making the phase picking difficult. The same diversification in the number of available traces is not observable for the noise dataset shown in Fig. 2b. This occurs because it was an intentional choice to select a more or less even number of traces for all the station channels.

https://essd.copernicus.org/articles/13/5509/2021/essd-13-5509-2021-f02

Figure 2Map of the stations used to assemble the events (a) and noise (b) datasets. The symbol size in panel (a) is proportional to the number of P phases and corresponding waveform traces available for each station. In panel (b) the symbol size is proportional to the number of traces. A total of 620 stations are included.

In Fig. 3, we show the distribution according to magnitude, earthquake to station epicentral distance, earthquake depth, and back azimuth of the 3C record traces composing the dataset. The panels show the histograms using the log ₁₀ scale to provide a complete representation of the distribution of the dataset. We adopt the linear scale, however, to emphasize the distribution of the back azimuth in Fig. 3d. Despite the attempt to balance the distribution of earthquakes according to magnitude (Sect. 2.1.1), Fig. 3a shows that our selection still reflects (inevitably) the Gutenberg–Richter increase in the number of earthquakes at smaller magnitudes. The largest amount of trace records in the dataset belongs to earthquakes in the magnitude range $2 \leq M < 3$ . The significant decrease in the number of traces for M<2 follows from our choice to balance the dataset at small magnitudes by taking only about 7 % of the whole dataset. For what concerns the epicentral distances of the stations (Fig. 3b), the great majority of the traces have been recorded within 200 km. A better appreciation of the selected traces can be obtained from the observation of Fig. 4, where we show the magnitude versus hypocentral distance distribution of the dataset traces represented as density plots using hexagon binning (hexbin, Hunter, 2007). The earthquake depth distribution (Fig. 3c) shows that the great majority of the traces belong to shallow crustal earthquakes, although a few thousand occur in the depth range 100 to 300 km. At greater depths, the number of traces decreases sharply, and only a few hundred or fewer recordings are included in the depth range 400 to 550 km. Figure 3d shows that the great majority of the P- and S-wave onsets belong to paths more frequent along the NW–SE direction, in agreement with the geographical trend of the Apennines and of peninsular Italy overall.

https://essd.copernicus.org/articles/13/5509/2021/essd-13-5509-2021-f03

Figure 3Histograms of the distribution of the trace records composing the dataset according to magnitude (a), epicentral distances (b), earthquake depth (c), and back azimuth (d). The labels of the horizontal axis are assigned using the metadata names listed in Table 2.

INSTANCE – the Italian seismic dataset for machine learning

2.1 Data preparation

2.1.1 Earthquake selection

2.1.2 Station selection

2.1.3 Waveform data selection and download

2.1.4 Cross-validation between phases-based metadata and downloaded waveform data

2.1.5 Preparation of processed waveforms in digital units

2.1.6 Application of the instrument transfer function to the waveforms

2.2 Metadata description

2.3 Dataset description

2.4 Examples of event data traces

3.1 Data preparation

3.2 Metadata description

3.3 Examples of noise data traces