OpenSWI: A Massive-Scale Benchmark Dataset for Surface Wave Dispersion Curve Inversion
Abstract. Surface wave dispersion curve inversion plays a critical role in both shallow geophysical exploration and deep geological studies, yet it remains hindered by sensitivity to initial models, susceptibility to local minima, and low computational efficiency. Recently, data-driven deep learning methods, inspired by their success in computer vision and natural language processing, have shown promising potential to overcome these challenges. However, the lack of large-scale and diverse benchmark datasets remains a major obstacle to the development and evaluation of such methods. To address this gap, we introduce OpenSWI, a comprehensive benchmark dataset generated through the Surface Wave Inversion Dataset Preparation (SWIDP) pipeline. OpenSWI comprises two synthetic datasets tailored to different research scales and application scenarios, namely OpenSWI-shallow and OpenSWI-deep, as well as an AI-ready real-world dataset for generalization evaluation, OpenSWI-real. OpenSWI-shallow is derived from the 2-D geological model dataset OpenFWI, containing over 22 million 1-D velocity profiles paired with their fundamental-mode phase and group velocity dispersion curves, spanning a broad spectrum of shallow geological structures (e.g., flat layers, faults, folds, and realistic stratigraphy). OpenSWI-deep is built from 14 global and regional 3-D geological models, comprising approximately 1.26 million high-fidelity 1-D velocity-dispersion data pairs for deep earth studies. OpenSWI-real, compiled from open-source projects, contains two sets of observed dispersion curves and their corresponding 1-D reference models, serving as a benchmark for evaluating the generalization of deep learning models. To demonstrate the utility of OpenSWI, we trained deep learning models on OpenSWI-shallow and OpenSWI-deep, and evaluated them on OpenSWI-real. The results show strong agreement between the predicted and reference velocity models, confirming the diversity and representativeness of the OpenSWI dataset. To facilitate the advancement of intelligent surface wave dispersion curve inversion techniques, we release the OpenSWI dataset (https://doi.org/10.5281/zenodo.16874111) and the SWIDP toolbox along with associated resources (https://doi.org/10.5281/zenodo.16884901), providing open resources to support the research community.
The size and the extent of the proposed database are remarkable and certainly of interest for the community. However, there are a few issues that must be addressed before publication:
- extracting 1D profiles from the same 3D geology, while adding some random fluctuation, seems to create a bias in the dataset (profiles are close to each other and they all described the same large geological structures).Â
- too few information are provided, even in the appendix, about the DDPM. In particular, on how viable is to expand the dataset with diffusion model: does the DDPM reproduce the same statistics? how many iterations are needed to infer new samples? how diverse are those samples? Unless the DDPM model has some novel feature, I think its role in this paper is rather marginal and can be overlooked. Otherwise, it should be expanded to highlight its importance
- what is the highest frequency that the geological models can propagate?Â
- are the random perturbations introduced by author consistent with the natural uncertainty? What about small scale heterogeneity which is well known to have a specific 3D correlation structure? Why did not the authors include this in their dataset?
- The authors overlooked one major dataset, published on this journal in 2024, which provides 30000 ground motion simulations including complex randomized geology:
Lehmann, F.; Gatti, F.; Bertin, M.; Clouteau, D. Synthetic Ground Motions in Heterogeneous Geologies from Various Sources: The HEMEW S -3D Database. Earth Syst. Sci. Data 2024, 16 (9), 3949 3972. https://doi.org/10.5194/essd-16-3949-2024. Â
This database span a ~10x10 km² for each sample and it is constructed with a minimum bias. Considered the fact that the dataset provides (geology,time-histories) couples, it would be interesting to benchmark the proposed model out-of-distribution, which is the most difficult aspect of benchmarking a new ML model
- The transformer architecture presented in the paper seem a little too advanced for such a simple dataset (dispersion curves vs 1D geological profile). It is necessary to benchmark it with existing alternative deep learning models in order to consider it as a reliable alternative.