INSTANCE – the Italian seismic dataset for machine learning

Abstract. The Italian earthquake waveform data are collected here in a dataset suited for machine learning analysis (ML) applications.
The dataset consists of nearly 1.2 million three-component (3C) waveform traces from about 50 000 earthquakes and more than 130 000 noise 3C waveform traces, for a total of about 43 000 h of data and an average of 21 3C traces provided per event.
The earthquake list is based on the Italian Seismic Bulletin (http://terremoti.ingv.it/bsi, last access: 15 February 2020​​​​​​​) of the Istituto Nazionale di Geofisica e Vulcanologia between January 2005 and January 2020, and it includes events in the magnitude range between 0.0 and 6.5.
The waveform data have been recorded primarily by the Italian National Seismic Network (network code IV) and include both weak- (HH, EH channels) and strong-motion (HN channels) recordings.
All the waveform traces have a length of 120 s, are sampled at 100 Hz,
and are provided both in counts and ground motion physical units after deconvolution of the instrument transfer functions.
The waveform dataset is accompanied by metadata consisting of more than 100 parameters providing comprehensive information on the earthquake source, the recording stations, the trace features, and other derived quantities.
This rich set of metadata allows the users to target the data selection for their own purposes.
Much of these metadata can be used as labels in ML analysis or for other studies.
The dataset, assembled in HDF5 format,
is available at http://doi.org/10.13127/instance (Michelini et al., 2021).


1 Introduction 15 Important breakthroughs in the understanding of earthquake phenomena can be achieved through the analysis of the very large number of continuous waveform recordings stored in the existing seismic archives. To this end it can be important to make available well organized representative subsets of the archives together with their associated metadata information.
The recent developments of machine learning (ML) software platforms like TensorFlow (https://www.tensorflow.org), Py-Torch (https://pytorch.org), Keras (https://keras.io), Caffe (https://caffe.berkeleyvision.org) (see Abadi et al., 2016;Paszke 20 et al., 2019;Chollet and others, 2015;Jia et al., 2014, respectively), the availability of high performance computing hardware 1 (i.e., GPUs) and the access to thoroughly selected benchmark datasets (e.g., STEAD https://github.com/smousavi05/STEAD, LEN-DB https://doi.org/10.5281/zenodo.3648231) offer new opportunities to apply ML methodologies to seismological and earthquake engineering problems. In particular, the use of sophisticated and optimized ML algorithms for the analysis of large amounts of seismic data can lead to remarkable improvements for automated tasks like seismic waveform onset picking, ground motion prediction, earthquake early warning or for the detection of hidden signals currently recognized as noise or for novel 5 modeling and inversion strategies (see Kong et al., 2018;Bergen et al., 2019;Dramsch, 2020, for recent reviews). Specifically, the advent of ML in the field of seismology has highlighted the importance of reference datasets for benchmarking the developed methodologies and it has fostered more thorough and statistically sound schemes for analyzing the data, like splitting all the available data into training, validation and test sets. Moreover, the introduction of competitions like those for predicting laboratory earthquakes launched on the Kaggle platform (https://www.kaggle.com/c/LANL-Earthquake-Prediction/data) 10 or the SeismOlympics (Fang et al., 2017), that attracted several thousand teams, evidences even more the great potential of benchmark datasets (Johnson et al., 2021) and the general interest to tackle seismology problems with ML.
The application of ML techniques to seismological waveform data can be quite straightforward. Indeed, large amounts of labelled data are already available thanks to the analyses carried out since many decades by expert analysts that have compiled and reviewed earthquake catalogs (that include phase onset readings, earthquake location and size estimates) or that have 15 assembled ground motion parameters in special flat files and maps of strong ground motion amongst the most common tasks.
Their work provides effectively metadata that can be associated with the recorded waveforms and that can be used as labels when performing ML analysis. A main bottleneck in wide-scale implementation of ML is, however, the fast access to the waveforms and to the associated metadata. Open access waveforms archives available to the seismological community (e.g., EIDA, Strollo et al. (2021) http://eida.ingv.it/, or IRIS, Ingate (2008)) were mainly designed for preserving the continuous data 20 and making them available to the scientific community. In practice, one of the main goals of seismological data centers has been the seamless acquisition of continuous data from the networks and the preservation, curation and archiving of the entire record of continuous waveforms. In this context, the users have complete flexibility in the selection of the data to download but accessing large data volumes can be very time consuming. Thus, despite the achievements attained in the last decades with the implementation of well tested and efficient web services (e.g., FDSN dataselect), the accessibility of remote servers 25 still remains cumbersome (Quinteros et al., 2021). It follows that in order to attract a broader audience of users and developers there is a strong need to assemble and publish benchmark datasets that can be readily used with the existing software platforms (Mousavi et al., 2019). In practical terms, the matter consists of assembling quality checked data and metadata according to volume and formats ready to be used in ML applications.
Recently, effort has been made to assemble and make publicly available datasets consisting of waveforms and associated 30 metadata. In detail, the dataset used in the works by Ross et al. (2018a, b); Meier et al. (2019) is downloadable from the Southern California Earthquake Data Center at the web portal https://scedc.caltech.edu/data/deeplearning.html. This dataset includes 4.8 million time series recorded by nearly 700 receivers from more than 270,000 earthquakes in southern California.
The STEAD dataset assembled by Mousavi et al. (2019) includes 1.2 million of 3C traces comprising 450,000 local earthquakes and 100,000 noise windows recorded by more than 2600 stations at global scale. The LEN-DB dataset (Magrini et al., 2020) 35 is also a global dataset of local earthquakes and includes 1.2 million 3C waveform traces, half belonging to earthquakes and half to noise. The NEIC dataset (Yeck and Patton, 2020) includes global data and has been used by Yeck et al. (2020) to train the 1.3 million seismic-phase arrivals using three separate convolutional neural network models to predict arrival time onset, phase type, and distance.
Results attained by Ross et al. (2018b), Zhu et al. (2019a), Zhu et al. (2019b), Mousavi et al. (2020), and Mousavi and Beroza 5 (2020) are excellent examples of successful applications of ML which can improve substantially the earthquake detection level with respect to most traditional methods, leading to the location of tiny and previously undetected earthquakes improving our knowledge on the heterogeneity of stress release on known and unknown faults. This enhanced information is crucial to make more thorough assessments of the ongoing seismotectonics and seismic hazard. The ML methods are likely to become an irreplaceable tool in seismology to extract as much information as possible from the large amount of data already stored in the 10 archives. Among the indirect advantages, the enhanced detection can, to some extent, also govern network densification with sensible reductions in equipment investments and maintenance costs.
In general, the impressive performances of ML applications have been strongly related to the availability of large amounts of data with associated properly labeled metadata. Large amounts of data are critical to perform proper training and avoid data overfitting. However, the preparation of a ML dataset is also tedious and very time consuming. These are the main 15 reasons that motivated the work presented in this article. Our goal is to provide an open access dataset consisting of raw and instrument removed waveform data and associated metadata to study earthquake occurrence in Italy. The data collection, named INSTANCE, gathers seismic waveform data from weak and strong motion stations that have been extracted from the Italian EIDA node (Danecek et al., 2021, see section 7 for a full list of the FDSN networks included in the dataset).
The metadata associated to the waveforms are extracted from the INGV earthquake catalogue and from the waveform traces 20 themselves. We expect this reference dataset to be used for several different purposes spanning from improvements of the existing configurations of seismic monitoring in Italy to the development and testing of new techniques for earthquake detection and ground motion estimation.
2 Earthquakes 2.1 Data preparation 25 The data collection was assembled following the main stages listed below: 1. earthquake selection; 2. station selection; 3. waveform data selection and download; 4. cross-validation between phase-based station selection and downloaded waveform data; 30 5. processing of the data counts waveforms; 6. application of the instrument transfer function to the waveforms.

Earthquake selection
To compile the waveform dataset, we started from the Italian Seismic Bulletin (http://terremoti.ingv.it/en/bsi, INGV bulletin hereinafter) and seismic stations archives (http://terremoti.ingv.it/iside). These data are public and can be queried using the fdsnws-event (https://www.fdsn.org/webservices/fdsnws-event-1.2.pdf) and the fdsnws-station web services pro-5 vided by INGV. The event data belong to the INGV bulletin which has been adopting the same velocity model and earthquake location software in the time period included in this study (see Appendix B for detail).
The first step consisted of retrieving all the earthquakes with M ≥ 0 from 1 January 2005 to 31 January 2020 in an enlarged area within the latitude and longitude corners (35.0,5.0) and (49.0,19.0). A total of 315,225 earthquakes were found. The beginning of the query corresponds approximately with the update, renovation and increase in the number of stations of the 10 national seismic network (Michelini et al., 2016;Danecek et al., 2021;Margheriti et al., 2021). Around 2005, the INGV network (FDSN code IV) underwent a major upgrade with the existing, predominantly analog, instruments being replaced by high quality digital seismic data loggers and new, mostly broadband (and some extended short period), three-component (3C) sensors. Selected stations were also complemented with additional 3C strong motion sensors. The upgrade resulted in more than a two fold increase of the number of stations of IV network. Also and since 2005, there have been many temporary 15 deployments of seismic stations coinciding with earthquake sequences and specific experiments which data are also available through the EIDA INGV node (Danecek et al., 2021). The total number of stations also increased thanks to the contribution of the networks belonging to other Italian institutions (e.g., the University of Genoa, National Institute of Oceanography and Experimental Geophysics-OGS, and the University of Naples amongst others). This increment resulted in a significant improvement of the detection of low magnitude earthquakes. At the regional scale of Italy, the magnitude of completeness of 20 the INGV bulletin is around ∼ M 1.7 − M 1.8 although significant differences occur depending on the area. To this regard, the preferred INGV catalogue magnitude is the local magnitude, Ml, (Richter, 1935) but sometimes also Mw and Md (see below for additional detail).
A relevant aspect when compiling a large dataset to be used for ML purposes consists of gathering a balanced distribution of data. In seismology, when using earthquake magnitude for classification, balanced representation is impossible to achieve 25 because small size earthquakes, following the Gutenberg-Richter magnitude versus the number of earthquakes power-law (Gutenberg and Richter, 1944), outnumber larger earthquakes. To address this issue (or at least to mitigate its influence), we choose to select in our target area: • the great majority of the earthquakes with M ≥ 4.0. The earthquakes that have been discarded (30) occurred all but five outside the Italian country borders and mainly in the Balkan area. (The earthquakes in Italy, all with M < 5, will be 30 included in a future update of the dataset.) • earthquakes with origin times differing by more than 120 s in the range 2.0 ≤ M < 4.0; • additional 20,000 earthquakes, randomly selected, with origin times differing by more than 120 s for M < 2.0.
The resulting distribution of the earthquakes according to their magnitude is detailed in Table 1 and they are mapped in Fig. 1a. and 01/31/2020, "Selected" and "Percent kept" refer to the earthquakes, and "Nb. 3C records" to the waveform traces included in the dataset.

Station selection
In order to gather high quality earthquake signals, we based our choice on the most accurately picked P-and S-wave onset phases published in the INGV bulletin. To this regard, the manual picking of the arrival phases is routinely performed by a group of about 20 INGV highly trained staff personnel who also review the hypocenter locations and magnitude determination 5 before bulletin publication. These manually reviewed locations are indicated as preferred solutions in the INGV bulletin. In practice, we have selected only those stations that had P-and, if available, S-wave onset picks associated to the preferred location of the INGV bulletin. We note that the strong motion data provided by the national strong motion network ("Rete Accelerometrica Nazionale") operated by the Italian Department of Civil Protection do not enter in the earthquake picking and location performed by the INGV staff and the same data are not available through EIDA. They may be included, however, in 10 future releases of the dataset.
In summary, we have adopted the following criteria to identify the waveform records to be included in the dataset after the earthquake selection above was applied: • all stations that feature P-wave (and S-wave when available) onset phases used for the preferred earthquake location (no distinction is made between Pg and Pn and no secondary phases like PmP are picked);

15
• all stations with waveform data available through the INGV EIDA node (see the dataset contributing networks in the pie diagram of Fig. 5b); • P-and S-wave location residual times less than 1.0 s; • P-and S-wave phases that contributed to the location with a weight larger than 10 %.
This selection procedure reduced the number of P-and S-wave phases from ∼ 1.9 to ∼ 1.2 and from ∼ 1.1 to ∼ 0.7 millions, respectively.

Waveform data selection and download
The selection procedure described in section (2.1.2) resulted in the compilation of a list of waveform data time windows to be downloaded from the EIDA continuous waveform archive. We choose a time window of 120 s in order to include both P-and 5 S-waves from stations whose distance is up to~600 km from the hypocenter. Indeed, in these cases, the S − P time differences are approximately 75-80 s. Adding about 20 s of signal before the P-wave time and about 20 s after the S-wave, we end up with a 120 s window choice providing the most significant earthquake signals for either the most distant stations, in case of crustal depth earthquakes, or closer stations, in case of deep earthquakes of the Calabrian arc subduction.
More technically, the time windows set for data download were defined by inserting a randomly selected buffer time ranging 10 between 15 and 20 s before the P-wave onset arrival phase and enlarging the time window to 125 s. The adoption of 125 s long windows at the data download stage is arbitrary since after data processing the time windows have been all set to 120 s. This criterion ensured that the great majority of the waveform traces downloaded featured a pre-P-wave onset buffer time between 15 and 20 s. However, we found that, when dealing with such a large number of waveforms acquired by diversified instruments configured differently, some discrepancies may occur. In practice, since the data are archived in miniSEED compressed format 15 that feature different sizes of the logical records, and since the web service extracts the full logical record containing the predefined trace start time, it can occur that the start time of the trace is earlier than the predefined minimum time of 20 s (i.e., in this case, there is a longer time interval between the P-arrival and the actual trace start time). In contrast, when data are missing before the P-wave onset time (i.e., in the 15-20 s pre-P-onset buffer time), start time of the extracted window can be delayed and a shorter time interval will separate the trace window start time from the P-wave arrival time (i.e., < 15 s). See

Cross-validation between phases-based metadata and downloaded waveform data
After the massive data download was concluded, a list of all the downloaded files was generated. This list was intersected with the originally selected metadata (section 2.1.2) to have a one-to-one correspondence between the miniSEED data and the metadata (i.e., each 3C waveform record -three miniSEED files -must correspond to a row of the metadata file).
6 2.1.5 Preparation of processed waveforms in digital units This part of our data assembling procedure targets the preparation of the digital counts waveform traces. It includes the following steps.
-removal of traces containing data gaps (i.e., missing data); -trimming the waveform trace to the nearest sample to the start time;

5
-120 s trace windowing; -removal of mean and linear trends from the data; -re-sampling at 100 Hz; -calculation of the signal-to-noise ratio; -extraction of the data quality metrics.
where |S 95 | and |N 95 | are the 95 % percentile of the data absolute values in a 5 s window immediately after the S-wave onset 15 and right before the P-wave arrival time. If the S-wave onset were not available, the S-wave window was determined after calculation of the predicted S-wave arrival using an average velocity of 3.0 km s −1 and the hypocentral distance.
During this stage of the data preparation, we have also calculated some quality parameters extracted from the waveform traces to the purpose of a later inclusion in the metadata information. These additional parameters, providing the distribution of the trace values, have been computed using the MSEEDMetadata class of the obspy python software (Beyreuther et al.,20 2010; Megies et al., 2011;Krischer et al., 2015). To the same purpose, we have determined the number of spikes using a Hampel filter on a 161 samples sliding window to find outliers in the traces.
The final dataset consists of a total of 1,159,249 3C waveform data records from 54,008 earthquakes in counts units assembled within an hdf5 format file. Table 1 provides the number of traces within each magnitude interval of the final assembled dataset. To make the dataset of more general use, we have also generated a dataset in units of physical ground motion after deconvolving the instrument response. To this end, we have downloaded the station response files for all the stations used and applied the transfer functions to the individual traces with frequency filtering corners 0.01, 0.04, 25, 40 Hz using a cosine flank frequency domain taper (see cosine_sac_taper in Obspy), and applying a 5 % cosine tapering at both ends of the trace signal.
After removing the instrument response, we extracted the intensity measures (IMs, i.e., peak ground acceleration, PGA, peak ground velocity, PGV, and the spectral accelerations at 0.3, 1.0 and 3.0 seconds period) on each component so that they could be included amongst the metadata parameters. Peak ground displacements are not included since they be from single or double 5 integration of velocity and acceleration records, respectively, and their determination can result inaccurate when performed automatically.

Metadata description
The 115 metadata associated to each 3C waveform trace of our collection are listed in Table 2. They provide different kind of information that can be subdivided into four main types -source, station, trace and path metadata. The unit of each metadata 10 is provided in its denomination.
The source metadata provide information on the earthquake with description of the source origin time, location, size and, when available, the focal mechanism, the moment tensor, and the finite fault.
The station metadata provide information on the characteristics of the recording station which include the station, channel, network and location (SCNL) (cf. http://www.fdsn.org/seed_manual/SEEDManual_V2.4.pdf), the geographical coordinates 15 and the average shear-wave velocity of the top 30 m of the Earth, V S,30 , which is an important parameter for classifying sites in seismic engineering applications (e.g., Boore, 2004) and is extracted from the map used in the INGV implementation of the USGS-ShakeMap software in Italy .
The trace metadata consists of parameters that are extracted from the waveform traces like maximum and minimum amplitudes, root mean squared values of the traces and, after application of the transfer function, intensity measures (IMs) of the 20 ground motion. In this class of metadata, we include the P-(and S-wave) provided by the INGV bulletin and, in addition, the number of P and S picks obtained by processing the waveforms with two deep learning phase picking and event detection algorithms (GPD and EQTransformer; Ross et al., 2018a;Mousavi et al., 2020) to make the user aware that the waveform trace being used may include more than a single earthquake (see discussion further below).
The path metadata follow from the calculation of parameters that link the types of metadata above (e.g., traveltimes, hypocen-25 tral and epicentral distances).
The rationale of our metadata selection reflects our intention of providing the users with comprehensive information about the data. This appears an important issue since the data, being recorded automatically, can suffer of many diverse problems deriving from malfunctioning of the data loggers, of the sensors or from poor data transmission. Since we seek to assemble a data set that can be used also for analysing real time data streams using ML, we note that the automatic processing summarized 30 above does not differ significantly from that routinely applied to the streamed data.
One alternative to our metadata "comprehensive" approach would have consisted of "cleaning" the dataset by removing the faulty traces from the dataset altogether. We do not think this approach appropriate since in this case the dataset would not be representative of the "true" data that are collected in real-time by the monitoring networks. Thus, the basic idea behind our criterion is that we would like to enable the users to make their own choices using opportune filters to exploit the data for their own purposes. For example, if a user looks for the cleanest data, this can be achieved by filtering the metadata accordingly (e.g., saturated velocimetric data acquired by broadband sensors equipped with 24 bit data loggers could be removed in a conservative fashion just by selecting only those traces with counts within ±0.8 * 2 23 ). In contrast, the user could also opt to leave the ML model to learn the "data problems" so that they can be detected when using real data. An approach of this kind 5 has been used by (Jozinovic et al., 2020) for missing data. In Jozinovic et al. (2020), the dataset used for ML consists of a fixed number of stations and when data from one or more stations are missing (either the whole trace or parts of it), the signal trace is set to be an array of zeros. The ML model used there was found to detect and learn the problematic values, and compensate for it, having a similar prediction accuracy on those stations as the accuracy on the stations which had the input data available .
In practice, the provision of a rich set of waveform descriptive metadata is important not only to make use of an enlarged suite 10 of labels that can be used for diverse purposes but also to identify problems with the waveform data and include or filter them out.
Our metadata includes P-and S-wave onsets manually picked by INGV analysts as provided in the INGV bulletin. Recall that the traces were selected to include just one P-wave arrival time and possibly one S-wave arrival time since we sought to assemble one earthquake per window trace. This criterion was chosen to the purpose of facilitating the training of ML models 15 using traces containing just one earthquake (e.g., for phase picking, peak ground motions, ...). However, even though we have made considerable efforts to isolate only one earthquake per time window, more than one can be present effectively within the same time window (e.g., the analyst did not see or just disregarded other events with smaller amplitudes). Because the presence of additional, unidentified earthquakes adds complexities to the ML training phase, we followed the same approach taken by Mousavi et al. (2019) to run automatic picking algorithms upon the waveform dataset and include as metadata also the 20 number of P-and S-wave phases picked automatically by the generalized phase detection, GPD, technique proposed by Ross et al. (2018a) and the EQTransformer technique by Mousavi et al. (2020). In the analysis we have used as detection threshold: 0.99 for P-and S-phase detection for GPD, and 0.2, 0.1 and 0.1 for earthquakes, P-and S-phase detection, respectively, for EQTransformer. Both GPD and EQTransformer have been run only on the high gain channels (i.e., HH, EH).
As presented above, metadata are important constituents of data collections. They can be used for identifying the data to 25 be analyzed and they can be used as labels in ML applications. In addition to the fact that not all the metadata information in INSTANCE is always available (e.g., moment tensors are generally available only for events with magnitudes~M ≥ 3.5 or the S-wave onset pick retrieved from the INGV bulletin may not be present), we have found that the automatically processed ground motion trace data may suffer from errors because the original traces contained already undetected malfunctioning problems (e.g., spikes, anomalous trends) which, after application of the instrument transfer function, are mapped into erroneous ground from analyst processing) of all the recordings available of earthquakes with M ≥ 4.0 in Europe. Secondly, we have verified our instrument transfer function processing procedure by cross-validating all our IM values with those reported in the ESM DB "flat" file. To this regard, we found a very good correspondence between the IMs obtained using the two methodologies giving us confidence in the quality of the applied data processing and of the IM metadata being provided. Table (2). List of the metadata for the events and noise waveform traces. The units are given in parenthesis in the "Description" column.
Only a subset of metadata can be associated to the noise traces (star in the "Noise" column).    In Fig. 1b we plot the 527 moment tensors included in the metadata. The size of the moment tensors symbol is proportional to source_magnitude while the colors are defined according to the prevalent strain regime: black, blue and red for strike slip, normal and thrust faults, respectively. The prevalent strain regime is determined according to the fault's rake as derived 10 from source_mechanisms_strike_dip_rake: strike slip for -45 • < rake < 45 • and 135 • < rake < 225 • ; normal for In Fig. 2 Table 2. significant decrease of the number of traces for M < 2 follows from our choice to balance the dataset at small magnitudes by taking only about 7 % of the whole dataset. For what concerns the epicentral distances of the stations (Fig. 3b), the great majority of the traces have been recorded within 200 km. A better appreciation of the selected traces can be obtained from the observation of Fig. 4 where we show the magnitude versus hypocentral distance distribution of the dataset traces represented as density plots using hexagon binning (hexbin, Hunter, 2007). The earthquake depth distribution (Fig. 3c) shows that the great 5 majority of the traces belong to shallow crustal earthquakes although a few thousand occur in the depth range 100 to 300 km.
At greater depths, the number of traces decreases sharply and only a few hundred or less recordings are included in the depth range 400 to 550 km. Figure 3d shows that the great majority of the P-and S-wave onsets belong to paths more frequent along the NW-SE direction in agreement with the geographical trend of the Apennines and of peninsular Italy overall. The polarities associated to the P-wave onsets (trace_polarity) are shown in Fig. 5c and have been reported in only 20 % of the total number of traces. Although this represents only a fraction of the dataset, we are confident that its number (~235,000 ) is likely large enough to be used in a ML dedicated model (e.g., Ross et al., 2018b) for training and testing, and  Table 2. then used to recover the polarities of the remaining batch. To this regard, it is noteworthy to observe that the positive and negative polarities have a ratio nearly 2:1. In the Appendix C we have examined the possible origin for this asymmetry. In addition, we would like to point out that, although the source_type is provided amongst the metadata, there are inherent difficulties to identify man made sources by the staff analysts.
The magnitude type distribution (source_magnitude_type) is shown in Fig. 5d. The Wood-Anderson local magnitude 5 M L (Richter, 1935) is calculated predominantly (~96 %). The moment magnitude M w is determined for earthquakes with M L ≥ 3.5 and when enough good quality station data are available (Scognamiglio et al., 2009). The M d magnitude is used only when it is impossible to determine the M L and it is provided mainly in the first years of the dataset when the IV network still included a considerable number of analog stations.
In Fig. 6a,c we present the histograms of the P-and S-wave residual times included in the dataset. Fig. 6b,d shows the phase 10 arrival weights resulting from the earthquake locations for P-and S-phases, respectively.
To provide a broader perspective of the dataset and with the intent of showing the wide range of waveform paths that have been included we present, in Fig. 7, the hexbin plots of the traveltime for both P-and S-wave arrival times used in the locations.
These panels have been arranged using four different maximum distances and are useful for visualizing the dominant structure of the data selection given the large number of data. More specifically, it can be observed that the hexagon binning panels for 15 the larger distance ranges (700 and 200 km max distance) and for both P-and S-wave traveltimes (Fig. 7a,b,e,f) highlight well both the direct and Moho refracted travel times. At smaller distance ranges (100 and 40 km, Fig. 7c,d,g,h), it is evident that our dataset includes waveforms that propagated across crustal structures with different velocities. This is very evident, for example, for both P-and S-wave in the hexbin plots where at small distance are observable very low Vp and Vs velocities.  Table 2 for the specific metadata being represented.
In the following, we will focus on the trace amplitude metadata. These parameters are important for refined selection of the traces and are extracted from both the raw waveforms expressed in "counts" and from the traces in physical units after application of the instrument transfer function. Some of these parameters can be obtained without any knowledge on the earthquake source parameters whereas others, like the SNR, require knowledge on the arrival times of P-and S-wave onset times. The SNR distributions are shown in Fig. 10 as histograms and versus distance and magnitude (M ≥ 2) as hexbin plots in 10 Fig. 11. The histograms show that the peak values for the whole dataset are at~10 db for the two horizontal components and slightly less for the vertical (~6 db). This is expected because the S-wave motion in the shallow, near surface low velocity layers is polarized on a plane perpendicular to the nearly vertical propagation direction of the wavefront, implying that the ground motion occurs mainly along the horizontal components. In any event, the distribution of the SNR values of our dataset can be considered sensible given that values larger than 2 already provide distinct earthquake signals. In contrast and at the lower end 15 of the SNR distribution, we find that 10% (see Table 3) of the trace data of the HH channels have SNR values less than 2.3  Table 2 are used as labels.
(1.2 for the vertical component) that corresponds to roughly to 59,000 waveform traces out of the 592,000 traces of the HH channels included in the dataset. This number of low SNR traces could still be used, for example, to train machine learning models aimed to the detection of very small magnitude earthquakes little above the background noise level.
The hexbin plots of Fig. 11 provide a snapshot of the dominant levels of SNR with distance and magnitude. It is observed that higher values occur for nearby earthquakes and that the SNR progressively decreases at farther distances. Conversely and 5 as expected, the SNR generally increases with larger magnitude earthquakes.
The hexbin plots of the distribution of the IMs with distance for earthquakes with M ≥ 2 are shown in Fig. 12 whereas their associated distributions are shown in Fig. D4. The panels evidence a broad concentration of ground motion values deriving from the inclusion of earthquake recordings from different distances and magnitudes. The panels also evidence some horizontal stripes at higher and lower values of ground motion resulting presumably from the acquisition and processing problems  The metadata names listed in Table 2 are used as labels for the specific metadata being represented.

Examples of event data traces
Some examples of the data traces are shown in Fig. 14, 15 and 16. The traces have been selected randomly according to certainly non-exhaustive criteria described in the figure caption using, as a guideline, the metadata distribution illustrated in Sect. 2.3.
In Fig. 14 we show the traces in counts of events recorded by the broadband instruments (HH channels). Specifically, the 5 first three rows ( Fig. 14a-i) show traces for different ranges of magnitude which, taken together, represent more than 80 % of the total HH traces. In the following two panel rows (Fig. 14j-o) we show examples of traces selected according to SNR and distance criteria that evidence that more than 65 % of the traces feature relatively high SNR (i.e., ≥ 10) within the whole distance range covered by our data collection. The seismograms shown in the last panel row (Fig. 14p-r) provide some samples of recordings of the largest events (M ≥ 4) where we found that ∼ 87 % feature SNR ≥ 10.  Table 2. are unevenly distributed about the mean value as result of acquisition or processing problems; and iv.) the values of peak acceleration and velocity ground motion parameters. The user, depending on desiderata, can customize the selection criteria.
In Table 3, we provide a basic quantification of the distribution of the relevant metadata shown in Fig. 15 and, in Table 4, we present the distribution of the values of the maximum horizontal ground acceleration and velocity expressed as % g and cm s −1 , respectively. Some of the values reported in these two tables are used for our trace selections. 5 In Fig. 15 (a-c) and Fig. 15 (d-f) we show some traces that have been selected from the HH channels according to the number of P-and S-wave onset picks greater than 3 detected through the application of the GPD and EQTransformer techniques, respectively. Based on the values reported in Table 3, the presence of these multiple event traces in the dataset is less than the 10 %. In Fig. 15 Table 2. 6 % of the entire HH channels dataset. In the bottom row of Fig. 15 (m-o), we show that excluding the very low 25 % of SNR values from the previous selection ( Fig. 15 (j-l)) it is possible to select traces that do not suffer of particular problems.
22 Figure (11). Hexbin representation of the distribution of the signal-to-noise ratio for the E, N and Z components of the earthquake dataset as function of hypocentral distance distance and magnitude. The metadata names listed in Table 2 are used as labels for the specific metadata being represented.
In particular, we see that just by selecting a higher threshold of SNR values, about 85 % of the first and last 10 % of the distribution of median values, the traces appear acceptable. In Fig. 16, we show the instrument corrected traces randomly chosen in groups of 6 for each channel. The traces drawn from the entire dataset belong to the 75 % with the largest values of the maximum horizontal acceleration (i.e., second, third and fourth quartile of the value distribution, cf. Table 4). The total of traces satisfying this criterion amounts to more than 860,000  Table 2 are used as labels for the specific metadata being represented.
3C traces. Application of the instrument transfer function appears to be generally successful without introduction of particular side effects with the exception of some amplification of the very low frequencies for some very low amplitude traces of the EH channels (e.g., panels (h,k) in Fig. 16). This effect results from our choice to bandpass filter all the traces channels in the same frequency range: this has the negative effect of boosting the low frequencies of the narrower band EH channels although it can be promptly removed by high-pass filtering. Overall, the quality of the ground motion units dataset can be considered of 5 satisfactory quality to perform analysis of ground motions.

Noise
Noise is generated by many different sources such as ocean waves, wind, traffic, instrumental noise, electrical noise, etc. and its suppression in earthquake recordings represent a long standing objective (Zhu et al., 2019b, and references therein). The in- The metadata names listed in Table 2 are used as labels for the specific metadata being represented.
clusion of noise data in a dataset like INSTANCE is thus important because it provides information on the noise characteristics of the individual stations in the absence of earthquake generated signal. ML models can reveal to be effective for noise removal or, in a classification analysis, for improving the detection of earthquakes. The noise data have been assembled starting from the stations gathered in the event selection stage described above.

5
Starting from the entire catalogue consisting of more than 300,000 events (Table 1), we first identified 600 s long time windows free of any earthquake. Secondly, we obtained the operational times of acquisition of each station. The third step consisted of identifying the 120 s time windows to be included in the dataset for each station and channel. This was achieved by intersecting the time window series obtained in the previous two stages (i.e., the event free windows and the periods of station acquisition). The traces are representative of 75 % of the data and belong to the second, third and fourth quartiles of each channel. Each row contains three, randomly selected, 3C traces drawn according to the following criteria based on the metadata listed in Table 2 and the quantile values provided in Table 4: (a-f) HH traces with trace_pga_perc > 5.1e-4 % g; (g-l) EH traces with trace_pga_perc > 9.3e-4 % g; (m-r) HN traces with trace_pga_perc > 8.7e-4 % g; The arrival times of P-and S-wave onsets (i.e., trace_[P,S]_arrival_time ) are shown by blue and red vertical lines, respectively.
It follows that the adopted procedure does not entail the selection of the same time window for multiple stations. For stations acquiring more than one channel type (e.g., HH and HN), noise windows for all the channels were identified and downloaded.
The resulting total number of noise trace windows is 132,288 corresponding to about 10 % of the total number of traces of INSTANCE. We note also that this procedure does not preclude the presence of noise traces that include energy from regional and teleseismic events.

Metadata description
The 46 metadata elements (Table 2) used for the noise data selection include for each 3C waveform trace an identifier based on the start time, the station parameters, the trace quality control that include the automatic picks and event detection obtained using the GPD and EQTransformer procedures. These picks provide potential insights on whether any earthquake not catalogued in the INGV bulletin might be present in the selected time windows. In Fig.17 we show the channel subdivision of the downloaded noise together with the networks the stations belong to. In   Table 2 are used as labels for the specific metadata being represented.

Examples of noise data traces
Examples of the noise traces are shown in Figure 20. To perform the selection, we have used the distribution of the trace rms values that is provided in Table 5.
In the top two rows of Fig. 20, there are shown some examples of events detected using the GPD and EQTransformer algorithms on the EH channels. As it was the case with the event dataset, the noise traces also contain undetected events 5 although their number according to our analysis seems rather small especially for the earthquakes detected by EQTransformer.
This result gives us good confidence that the noise traces are for the great majority free of earthquake events. The following rows of Fig. 20 provide waveform samples drawn from the 90 % of the dataset (panels g-i and m-o) for the HH and EH channels, respectively. Both sets of panels exemplify some of the features of the great majority of the noise data. In contrast, the panels (j-l) and (p-r) have been chosen to show what could be considered traces exceeding noise values or that contain finite  Table 2.

Discussion
The primary objective of this work has been to assemble a benchmark dataset consisting of seismic waveforms and associated metadata. It has been designed to be used for the analysis of earthquakes in Italy (and neighboring areas) using ML techniques and it could prove useful for ML analysis also elsewhere in other active tectonic regions by adopting transfer learning methodologies . The dataset consists of three HDF5 volumes -raw and instrument removed event traces, and 5 raw noise traces -and of the associated metadata.
The selection of the waveform traces to be included was based on the availability of low (≤ 1 s) P-and S-phase location residual times and large location weights taken from the preferred solutions listed in the INGV bulletin. To counteract the Gutenberg-Richter power law which affects the compilation of seismological datasets targeting ML analysis applications, attention was paid towards assembling a dataset that was not completely skewed by a large number of small magnitude earth-  Table 2 and on the quantile values listed in Table 5.
Each row contains three, randomly selected, 3C traces drawn according to the following criteria: (a-c) trace_GPD_[P,S]_number   (5). Distribution according to different quantiles of selected noise metadata (cf.  quakes. To this end, we included all the traces available of the larger size earthquakes and then we decreased progressively the number of smaller size earthquakes and associated traces. Our effort, however, trades-off with the need of assembling a dataset that is sufficiently large for ML purposes. The distribution of the selected traces shown in Fig. 3 and Fig. 4 according to magnitude, distance and focal depth allows the users to make the appropriate choices for their purposes even though we recognize that the achievement of the sought balanced distribution remains difficult. Other data selection criteria could have 5 been used (e.g., select all the data acquired within distance ranges depending on earthquake magnitude) but the (un)balanced magnitude and distance distribution would have persisted. Thus, given the criteria adopted it is pleonastic to remark that this dataset is not designed for studies addressing the earthquake magnitude power-law distribution (e.g., the b-value). Similarly, although the dataset contains an average of 21 traces per earthquake, it may not be optimal for dedicated earthquake relocation studies.

10
Our criterion, based on the available high-quality P-and S-phases with low location residual times, is expected to provide a large number of traces with distinct earthquake signal and high SNR ratios. The distribution of SNR values shown in Fig. 10 and 11 and in Table 3, and the example seismograms shown in Fig. 14(j-r) and Fig. 15(m-o), appear to confirm our choices.
The selection of 120 s trace length time window is longer than those made by other authors for analogous benchmark datasets (e.g., Mousavi et al., 2019;Magrini et al., 2020). This relatively long time window was required, however, because we sought 15 to include the entire seismicity occurring in Italy that spans from very shallow to very deep (Fig. 3). Unfortunately, this long window trades-off with a higher probability of including earthquakes close in time that had not been reported in the INGV catalogue. For this reason, we carried out also a (preliminary) automatic picking and earthquake detection analysis using two well established recent ML techniques (GPD and EQTransformer; Ross et al., 2018a;Mousavi et al., 2020) to possibly isolate those traces that include multiple events. The results of this analysis summarized in Table 3 would indicate that about 90 % of the event data contain only one earthquake according to the EQTransformer analysis whereas the GPD analysis returned some slightly higher numbers of P-and S-phase detections.
Our metadata for the earthquake part of the dataset consist of 115 parameters. They are subdivided into three main classes plus one additional class derived from the previous ones. This is a rather rich set of parameters that can be used either to select 5 subsets of the dataset, or as additional features to rely on when developing ML models, or as labels in supervised ML analysis, or for unsupervised ML applications. In addition, the metadata could be used by themselves for specific studies (e.g., seismic velocity model regionalization, travel time tomography, ground motion prediction models, local site corrections, ...).
Earthquake data gathered by seismic instruments and streamed in realtime to earthquake monitoring centers or preserved within archives can suffer from problems of different nature (e.g., sensor, data logger, equipment installation, data transmission, 10 and processing among the most common). Thus, the compiled dataset could be useful for the development of robust techniques of analysis and this is one main reason for including several trace quality parameters as metadata since they can help the user to identify the possibly "faulty" records which can be then either removed or included to train the ML model just to "learn" them. This approach may seem to contradict one of the main purposes of compiling a high-quality dataset and it may also be an obstacle when attempting to reveal deep information therein but the expectation is that, by including all the data together, the 15 rich set of metadata leaves the users enough freedom to identify the "good" data for their purposes. It is worthwhile to mention that Yeck et al. (2020) have found that inclusion of only "good", high SNR trace data during training of various body waves resulted in lower performance when applied to real-time pick data.
The INSTANCE dataset includes intensity measures (i.e., PGA, PGV, spectral accelerations) obtained after deconvolution of the instrument response performed automatically and possibly affected by digital signal processing problems induced, for 20 example, by the presence of abnormal drifts and spikes. Given the difficulty to verify the quality of all the individual processed traces, the availability of a rich set of trace metadata can be useful (again) to detect the faulty traces.
The example traces drawn randomly from the dataset that we have presented in Figs. 14, 15, 16 and 20 provide some evidence of the characteristics of the traces contained in the dataset and how they can be promptly selected through the provided metadata. Although the great majority of the data appear to be of very good quality, we are also aware that low quality data are 25 almost inevitable to occur. Inspection of the waveform traces by using other selection criteria than those shown here, and of the IM metadata (cf. Fig. 12) give us, however, good confidence on an overall good quality of the dataset.
The INSTANCE data collection assembles for the first time a very large amount of earthquake and noise data throughout Italy. If on one side this might seem a limitation when compared to other recent data collections like STEAD and LEN-DB that have gathered data globally, on the other hand, the dataset can be considered a representative subset of the seismicity in 30 Italy and neighbouring areas. The dataset equals to more than 43,000 hours of continuous event and noise data and associated metadata with an average of 21 3C traces per earthquake. To the purpose of comparison, in Table 6, we summarize the main features of the currently available seismological datasets assembled for ML analysis. As noted above, the main features that distinguish INSTANCE from the other datasets are the number of metadata for both earthquakes and noise traces and the average number of traces per event. In addition, the dataset provides a generally large number of traces for each recording site 35 making the dataset suitable for quite diversified target studies. The dataset is also unique since it is the only one (yet) to provide the waveform traces in both digital counts and physical units. In this context, the set of parameters provided by INSTANCE spans both specific seismological parameters like P and S arrival times, fault plane and moment tensor solutions, and also peak ground motion parameters in physical units (e.g., PGA, PGV) which can be used for studies that target the estimation of the ground shaking (e.g., shakemaps).
5 In summary, the dataset features strengths such as the prompt availability of a large number of records assembled within a ready-to-use data volume that can be certainly considered representative of the whole waveform data archive of the INGV ORFEUS-EIDA node and that can be used for many diverse studies. In our opinion, the strengths of providing a diversified set of data outnumber the weaknesses and the latter ones can be isolated, and their negative contribution reduced through the exploitation of the very rich set of metadata.

Applications
To the purpose of describing the range of possible applications of INSTANCE we follow the basic exposition schema adopted by Mousavi et al. (2019) for the STEAD benchmark dataset. These authors addressed four main areas in which benchmark 5 datasets can prove very effective for improving seismological knowledge and seismic monitoring operational activities: earthquake trace denoising, earthquake detection and onset picking, classification/discrimination and direct earthquake characterization.
The seismic noise level at a station is frequency dependent and derives from many factors such as types of equipment, installation, meteorological conditions, anthropic generated noise, geography, season, and time of day (McNamara and Buland, . Seismic trace denoising enhances the SNR that is crucial to lowering the magnitude detection level of earthquake catalogs and, by so doing, increase the number events detected. Analogously, denoising can be relevant to pre-process seismic traces when performing ambient noise cross-correlation analysis (e.g. Baig et al., 2009), or for detecting speed-of-light changes of the gravitational field (Vallée et al., 2017), or for the analysis of seismic data acquired in urban areas (e.g. Parolai, 2009) just to mention a few among many applications. ML techniques seem very promising to address the reduction of noise in be associated with partial noise component. They apply the technique to the waveform stacked data used in Shearer (1991) and similar stacks can be promptly prepared using INSTANCE at the local/regional scale and apply the denoising technique accordingly. 25 Earthquake detection (including phase picking), discrimination and rapid characterization represent main pillars of seismic monitoring and surveillance. During their lifetime, operational seismic centers alternate calm periods characterized by low levels of seismicity, in which it can become of relevance the detection of even the smallest possible events to delineate the activation of often hidden tectonic structures, to paroxysmal periods starting with significant earthquakes and followed by hundreds or thousands of aftershocks felt by people. To ameliorate the response of the centers in both these extreme cases, we 30 find that the INSTANCE dataset can be of importance to calibrate and benchmark methodologies for i.) phase onset picking and earthquake detection methods to lower the magnitude detection level (e.g., Ross et al., 2018a;Zhu et al., 2019a;Walter et al., 2020;Mousavi et al., 2020, amongst others); ii.) discriminate between volcanic and tectonic earthquakes (e.g., Esposito et al., 2006) and, in the future, after updating INSTANCE with new data, discriminate earthquakes and other sources of seismic energy (e.g., sonic booms, quarry blasts, underwater explosions) often felt by the population (e.g., Del Pezzo et al., 2003;Linville et al., 2019); iii.) the rapid and accurate characterization of the earthquake source, distance, depth (e.g., Perol et al., 2018;Trugman and Shearer, 2018;Kriegerowski et al., 2018;Zhang et al., 2020;Lomax et al., 2019;Mousavi and Beroza, 2020;Münchmeyer et al., 2021) and of the ground shaking (e.g., Alavi, 2011;Derras et al., 2012;Derras, 2014;Jozinovic 5 et al., 2020;Münchmeyer et al., 2020).
Indeed, the field of application of the dataset is quite extensive and it can be used to address many diverse topics depending on how the data are grouped and it can also be useful for applications not relying on ML techniques. For example, the dataset features some stations with several thousand of traces recording earthquakes from different azimuths and distances that can be used to construct common-station gathers of seismograms for swaths of sources in almost any desired geometry (e.g., Korneev 10 et al., 2003), or to study in detail the local site response. Analogously, the metadata alone provide a rich set of arrival times (cf. Fig. 7) that could be used as is for traveltime tomography at regional scale in Italy and, in addition, the availability of the associated waveforms makes it possible the application of methodologies that resolve the velocity structure jointly using arrival times and waveform data (e.g., Zhang* and Chen, 2014). For what concerns the ground motion amplitude data, the availability of these metadata can be of relevance in combination with the shakemaps (http://shakemap.ingv.it) to develop new tools for 15 rapid earthquake ground motion estimation. Other applications of the data collection include the adoption of unsupervised ML algorithms to group the waveforms independently of the earthquake location and just on the waveform themselves (e.g., Seydoux et al., 2020). INSTANCE can also be used, as a dataset with a large number of data, for creating pre-trained models when using transfer learning techniques either for seismological or other applications which use time-series data (Otović et al., 2021). 20 Overall, we believe that the dataset will be useful for stepping up towards a new generation of earthquake monitoring tools that will profit of the ongoing very fast developments in machine learning. What is certain is that seismology is in great need of benchmark datasets (Mousavi et al., 2019) upon which test new and existing techniques. To this end, standardization of the input data and metadata formats is of great relevance and in constructing this dataset we have adopted the schema proposed by the SeisBench initiative (Wollam et al., 2021). Needless to emphasize that widespread adoption of the same metadata 25 schema and data volume formats can foster the compilation of similar datasets also for other regions with the possibility to merge them all together giving the opportunity to perform ML analysis exploiting the potentials of the resulting huge large datasets. Perhaps more importantly, standardization of data and metadata formats will make it easier to test different datasets using the same ML model or, alternatively, benchmarking different models on the same dataset and in both cases the benefits appear clear. 30 6 Data availability The data used in this work were downloaded using the webservices provided by INGV (http://terremoti.ingv.it/en/webservices_ and_software). The networks used for the ISTANCE dataset are organized in Table 7 Table (7). Seismic networks used in the compilation of INSTANCE dataset 7 Code and data availability Routines and notebooks for analysis and display of the dataset (and the sample dataset) are available on https://github.com/ ingv/instance. The processing was performed using ObsPy (Beyreuther et al., 2010;Megies et al., 2011;Krischer et al., 2015), NumPy (Harris et al., 2020), SciPy (Virtanen et al., 2020) and Pandas (McKinney, 2010; pandas development team, 2020) python modules and the graphics was prepared using the Matplotlib library (Hunter, 2007) and seaborn (Waskom, 2021). 5 The dataset can be downloaded from http://doi.org/10.13127/instance . A versioning schema has been also included since the dataset is expected to undergo modifications or expansions in the future. For instance, it is possible that some earthquakes or noise traces have been misclassified, or future significant seismic events be included. A sample dataset is also provided on the same landing page.

10
INSTANCE is the first dataset designed for the application of ML methodologies compiled using the seismic data archived on the ORFEUS-EIDA node of INGV for Italy (Danecek et al., 2021). One of the main scopes of the dataset is to provide a benchmark for developing improved techniques for earthquake and ground motion characterization. The dataset consists of about 1.3 M, 120 s long each, 100 Hz sampling, 3C traces subdivided into about 1.2 M containing seismic events and more than 100,000 that include noise. The traces are assembled within HDF5 formatted volumes to facilitate access and analysis. 15 More than 100 metadata grouped according to source, station, trace (and derived quantities) are associated to each trace to give the user much flexibility and control for the selection of the most appropriate data for her/his scientific targets.

38
The event data include recordings of earthquakes in the magnitude range 0 ≤ M ≤ 6.5 and in the distance range between 0 and more than 600 km although the great majority of the traces belong to earthquakes in the magnitude range 2 ≤ M ≤ 3 and within 250 km. The depth of the earthquakes varies between very shallow crustal earthquakes and about 600 km depth in the Calabrian subduction slab. The data have been recorded by more than 600 stations operating in Italy in the time span January 2005 -January 2020. The dataset equals to more than 43,000 hours of continuous event and noise data and associated 5 metadata. An average of 21 3C traces are provided per earthquake.
Appendix A: Preparation of the HDF5 data containers The waveform trace data are provided in binary HDF5 files volumes. The volumes have been prepared for event data in counts and ground motion units, and for noise data in counts. The HDF5 format allows for rapid and easy access to the individual traces without the need of loading the whole dataset into memory. The waveform trace datasets have been created using the 10 HDF5-Group "data" structure https://portal.hdfgroup.org/display/HDF5/HDF5 which contains as many HDF5-Datasets as 3C waveforms. Every 3-component waveform is a separate HDF5-Dataset and is accessed by its trace name (trace_name) found in the metadata file.
Appendix B: Velocity model used by the Italian Seismic Bulletin for the earthquake locations The earthquake locations provided in the Italian Seismic Bulletin (http://terremoti.ingv.it/en/help#BSI) are fully described by 15 Mele et al. (2010). The model consists of two layers over a half space assuming a ratio V P /V S = 1.732. The software IPOP developed by Alberto Basili is used for the earthquake locations (Bono, 2008).

Appendix C: Positive and negative polarities
In this appendix, we examine the origin for the observed asymmetry (almost 2:1 ratio) in the number of reported positive (up) and negative (down) polarities of the INSTANCE dataset. We evaluate whether i) the inclusion of anthropic, unidentified 20 sources like quarry blasts mistaken for earthquakes can affect the reported asymmetry and ii) the tectonics of the region can condition the number of positive and negative polarities in INSTANCE.
For the first investigation, we have followed Mele et al. (2010) (see also the recent work by Gulia and Gasperini, 2021) who found that in the 2008 bulletin the 99.6 % of the blasts have local magnitude M L ≤ 2.2 (Fig. 23 of their study). We have progressively increased the lower magnitude threshold to verify whether the nearly 2:1 ratio between positive and negative 5 polarities persists as the magnitude is increased. The expectation is that, as the magnitude increases, the ratio progressively levels out since the blasts (or other artificial sources) do not produce magnitudes greater than M =3 in Europe (cf. Giardini et al., 2004). To address the variation of the proportion between positive and negative polarities with magnitude, we have selected from the INSTANCE dataset earthquakes with progressively larger minimum threshold magnitudes and found that the fraction (percent values) of negative polarities increases progressively from 36 % to 41 % when including earthquakes with 10 M > 0.25 and M > 3, respectively. For larger minimum magnitudes, the percentage stabilizes around 42-43 %. This indicates that inclusion of the polarities of e.g., unrecognized blasts (i.e., with M < 3) has a moderate but still significant impact on the observed asymmetry between the reported positive and negative polarities. This asymmetry, although somewhat surprising, seems to occur also elsewhere. For example, Ross et al. (2018b) report, in their analysis of the southern California earthquake dataset (before data augmentation), a content of 67 % and 33 % for up and down polarities, respectively. We also note that the  Given these conditions, the observed asymmetry could result from the complex interplay between the source receiver geometry, the 200-300 km width of peninsular Italy and the dominant NW-SE extensional faulting characterizing the tectonic regime in the Apennines. In this setting, the radiation pattern predicts negative polarities in the near field and positive polarities farther away from the source. Also the negative lobes are characterized by a smaller extension with respect to the positive lobes. Since the seismic receivers are more or less evenly distributed, the chance to record a negative pulse is smaller with respect to the 5 positive one.
The Fig. C1 shows the histograms of the distribution of the positive and negative polarities with distance. The Fig. C1(a) shows the distribution of the polarities for the chosen target area in the Apennines, while Fig. C1(c) exhibits the polarities in the same area but only along backazimuth approximately NE-SW (i.e., the ranges 15 • -105 • and 195 • -285 • degrees). Finally, Fig. C1(b), displays the polarities from earthquakes outside the target area. We note that within the target area ( Fig. C1(a)) 10 the polarities are overwhelmingly positive in gross agreement with the explanation above. For further confirmation, we note in Fig. C1(c) that, if we restrict to the NE-SW propagation direction perpendicular to the Italian peninsula, the ratio between positive and negative polarities (pos %, neg %) increases further from (68 %, 32 %) to (81 %, 19 %), respectively. Conversely, the number of polarities for the earthquakes outside the target area are pretty much well balanced (49 %, 51 %) as shown in     Table 2.  Table 2.
Author contributions. AM prepared the initial data and metadata selection, drafted the manuscript and initial version of the figures. SC and SG provided ML estimation of the waveforms data, contributed to final versions of the figures and manuscript revision. CG contributed to data preparation and manuscript revision. DJ compiled the HDF5 formatted volumes and contributed to the data and manuscript revision. VL set up the virtual machines and the scripts to perform the massive data download.
Competing interests. No competing interests are present.

5
Disclaimer. The authors decline any responsibility for possible errors present in the INSTANCE dataset which can lead erroneous evaluations and to physical and economic damages.
Acknowledgements. This work would not have been possible without the effort and dedication of the people that install and maintain the stations of the networks used here, and the skilled IT people that are in charge of archiving, curating and providing access to the data. We also greatly acknowledge the earthquake analysts that routinely perform the data analysis for the compilation of the earthquake bulletins.

10
The metadata nomenclature benefited from the interaction and discussions within the recently formed SeisBench group. We would like to for the insightful comments, questions and suggestions that helped to clarify the text and improve the manuscript overall.