the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
cigChannel: A massive-scale 3D seismic dataset with labeled paleochannels for advancing deep learning in seismic interpretation
Abstract. Identifying buried channels in 3D seismic volumes is essential for characterizing hydrocarbon reservoirs and offering insights into paleoclimate conditions, yet it remains a labor-intensive and time-consuming task. The data-driven deep learning methods are highly promising to automate the seismic channel interpretation with high efficiency and accuracy, as they have already achieved significant success in similar image segmentation tasks within the field of computer vision (CV). However, unlike the CV domain, the field of seismic exploration lacks a comprehensive benchmark dataset for channels, severely limiting the development, application, and evaluation of deep learning approaches in seismic channel interpretation. Manually labeling 3D channels in field seismic volumes can be a tedious and subjective work and most importantly, many field seismic volumes are proprietary and not accessible to most of the researchers. To overcome these limitations, we propose a comprehensive workflow of geological channel simulation and geophysical forward modeling to create a massive-scale synthetic seismic dataset containing 1,200 256×256×256 seismic volumes with labels of more than 10,000 diverse channels and their associated sedimentary facies. It is by far the most comprehensive dataset for channel identification, providing realistic and geologically reasonable seismic volumes with meandering, distributary, and submarine channels. Trained with this synthetic dataset, a convolutional neural network (simplified from the U-Net) model performs well in identifying various types of channels in field seismic volumes, which indicates the diversity and representativeness of the dataset. We have made the dataset, codes generating the data, and trained model publicly available for facilitating further research and validation of deep learning approaches for seismic channel interpretation.
- Preprint
(31127 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 15 Jan 2025)
-
RC1: 'Comment on essd-2024-131', Anonymous Referee #1, 18 Sep 2024
reply
This manuscript proposes a workflow to generate 3D synthetic seismic cubes with paleo-channels interpretations that can be used to train deep-learning models for seismic interpretation, and a dataset of 1200 3D synthetic seismic cubes. This dataset is used in an application where a deep learning model is trained to interpret paleo-channels and tested on 3 real seismic cubes.
Overall the manuscript is easy to follow, and there is a clear need for such datasets considering that deep learning is getting a lot of traction for subsurface applications but we cannot rely on subsurface data alone. However, the dataset isn't nearly as realistic and comprehensive as claimed by the authors. This will drastically reduce its usefulness as a benchmark, but I believe this is a step in the right direction.
Major comments:
Overall the workflow doesn't rely on the state of the art for the generation of synthetic subsurface realizations. Using soillib instead of more established models like Landlab, Fastscape, or Badlands is puzzling, especially considering that I'm not sure if the principles behind soillib have been thoroughly validated. Integrating channels and topography in an object-based manner (which is similar to what Alluvsim does, see https://doi.org/10.1016/j.cageo.2008.09.012) is quick and easy, but far from the realism that stratigraphic models such as Flumy or Sedsim would lead to. The rock physics model is very simplistic, and ignores all the variability within the facies. And the forward seismic model is also the simplest there is. Of course full-waveform models are very expensive, but 3D convolution using a point-scatterer function can also capture the acquisition setting without increasing the computation cost. So in the end a large part of the variability in seismic data isn't captured in this dataset, and the impact of not fully capturing that variability needs to be discussed at the very least. The lack of faults is also a major drawback, as shown by the application, and will limit the usefulness of this dataset.
There are very few references to support the claim of geological realism, and I would even say that the manuscript lacks geological perspective. Section 2.2 is particularly problematic, because soillib seems to be simulating continental fluvial systems with tributaries, instead of deltas and distributaries as claimed in the text. So how comprehensive are the workflow and dataset when they don't explicitly capture key subsurface deposits for geo-energy applications? On top of that, deltas aren't just channels (they are more recognizable by their lobate shapes on seismic data), but this aspect isn't discussed at all. In general the limitations of the workflow and dataset should be more clearly acknowledged and discussed, including in the abstract. What could be the impact of the limitations for anyone who would like to use this dataset? What is it missing so that methods validated on it could fail when applied on real data?
Beyond all this lies the question of whether this dataset is plausible enough to be used confidently for the validation of new methods. To me the answer is no, but there could be a simple approach to discuss this in quantitative way: divide the dataset into a train and a test set, train an autoencoder to reconstruct the train set, then compare the reconstruction error between the test set and the real seismic cubes. Essentially, if the autoencoder can reconstruct the real data just as well as the synthetic ones, it means that the dataset likely captures the patterns that matter. If not, it quantifies the room for improvement and gives a clear criterion to follow for future work.
I suggest to also reflect on the value of such open workflow and dataset beyond algorithmic developments related to deep learning. The field of geophysics in general tends to test new approaches on (very) simplified case studies, geostatistics tends to do the same, but I also see a lot of value for education (e.g., to learn deep learning with subsurface applications or learn about the impact of heterogeneities on seismic data). On that, turning the functions available on GitHub into a small, properly documented Python package available from PyPi could foster its use.
Specific comments:
Abstract
Line 8: "to most researchers" instead of "to most of the researchers".
Line 9: If it's geologically reasonable then it's not realistic. Something realistic represents a system accurately, while reasonable means that it's a good enough approximation. In this case we're in the later: the geology here isn't realistic at all, but considering the lack of resolution of seismic data it's a good enough approximation to get some valuable insights from deep learning.
Line 13: What does it mean to "perform well"? Better be specific and quantitative there.
Line 14: How many seismic volumes?
Line 14: "which indicates the diversity and representativeness of the dataset" Nothing in the abstract suggests that you can conclude that.
Line 15: That's great!
1. Introduction
Line 20: Considering that we're already experiencing the consequences of climate change, that focus solely on hydrocarbons is unfortunate. Paleochannels are good reservoirs, so they are also valuable in hydrogeology, hydrothermal production, and mining.
Line 27: "have been developed" instead of "are developed".
Line 35: More and more seismic data are being released by government agencies (in Australia, the Netherlands, New Zealand, ...), so lack of access is not as true as it used to be. A key issue remains processing: those data can be raw or not completely processed (e.g., not depth converted), so it's difficult for non-specialists to reuse them. And then, as the authors rightfully mentioned, there's the difficulty in interpreting the data.
Line 39: I wouldn't say that it's not an option, it's just an expensive one, prone to uncertainties (in the processing for instance) and to biases (see for instance https://doi.org/10.1130/GSAT01711A.1).
Line 51: "massive-scale" is exaggerated. It's a relatively large dataset for the subsurface, but it's nothing compared to datasets from the deep learning community. And even in the subsurface much larger datasets have been released before (see https://doi.org/10.5194/essd-14-381-2022 for instance).
Line 65: Maybe mention the link to the GitHub repository also here?
2. Dataset generation workflow
Line 67: This sentence is a bit convoluted, with several repetitions that can be avoided (i.e., "generation" then "generating", "elaborate" then "explain details").
Line 72: Are meandering channels the most common river channels? I'm not convinced of that, it would be better to support this claim with a reference.
Line 86: Any reference to support those two shapes? It would support the claim of realism much better to show that this is indeed what we observe in nature.
Line 105: I couldn't find a paper describing how this model works in detail. The key problem here is that the channels shown in figure 3 look nothing like deltaic channels, but more like a continental river system (and those look more like tributaries, not distributaries). Many models have been developed to simulate such systems (see Landlab, Fastscape, Badlands) based on laws that only approximate the physics of overland flow, erosion, and deposition but have been validated to some degree and can be fast depending on the processes included and the implementation. So why not use those models? Regarding deltas, DeltaRCM developed by Liang et al. (2015) is a valid candidate, even if it's still a bit slow. I'm not sure if Sedflux could be an option (https://doi.org/10.1016/j.cageo.2008.02.013), but it shows that deltas are more than just channels, and that will impact the seismic data.
Line 127: Not just any sediment, you need enough fine sediments to build levees, so meandering turbiditic channels correspond to a quite specific depositional environment. This could restrict the comprehensiveness of the dataset, which won't capture sandier environments.
Line 130: The main difference is the scale: submarine channels are much wider, deeper, and longer than their terrestrial counterparts, which isn't really clear from figure 5. The lack of spatial scale also doesn't help in assessing the plausibility of the 3D structures, especially relative to one another.
Line 152: I would say abandoned meander instead of oxbow lake, which fits more a continental setting.
Line 164: Channels in general have different facies (point bars, levees, crevasse splays, abandoned meanders, abandoned channels). You take that into account when selecting the impedance for submarine channels but not the others, why is that? And what's the impact of that choice? On top of that, variations in grain size distribution within a facies leads to variations of impedance, so why using a uniform impedance, which isn't realistic? And this is excluding the effect of burial and diagenesis.
Line 177: So this is a 1D convolution? I get that this is a common and simple approach, but it's not really realistic either (see for instance https://doi.org/10.1111/1365-2478.12936 or https://library.seg.org/doi/full/10.1190/1.2919584, and https://doi.org/10.1190/geo2021-0824.1 for an application to seismic data interpretation using deep learning).
Figure 8: What's the spatial scale of the 2D sections? This comment stands for almost all the figures, but here in particular because we can't compare to the wavelet without a clear scale. And how do the peak wavenumbers relate to the usual values in Hz?
3. Results
Line 219: No faults? How does that impact the usefulness of the dataset?
Line 228: Actually size and aggradation are the only differences between fluvial and turbiditic channels in your simulations, since the model for meandering is the same. So a deep learning model trained on your dataset might struggle with small turbiditic systems and large fluvial systems.
Lines 228-229: That doesn't really explain why they are so much larger. Overall I see very little geological literature cited to support the plausibility of the models, which is unfortunate.
Line 235: Any reference to show how to do that? Any reference for the weighted loss function?
4. Applications
Line 249: It would be nice to add a link to that dataset.
Line 251: This isn't quantitative, which is unfortunate. It would have been much better to compare to a human-made interpretation, especially since here you have nice channels that look easily interpretable, and measure different metrics such as precision and recall. There seems to be a lot of false positives in the deep-learning interpretation. I realize that the manuscript doesn't aim at developing a deep-learning model for channel interpretation, but are the false positives due to a not-so-optimal model or a not-so-optimal dataset? Not having any validation metric for the training of the deep-learning model doesn't help to assess this.
Line 256 and 261: Are those seismic volumes open? If not, that part of the manuscript is irreproducible.
Line 270: This is really a strong limitation considering that faults are ubiquitous in the subsurface, can have a big impact on applications, and that there has been a lot of studies similar to yours for faults already, so methods to introduce faults already exist (e.g., https://doi.org/10.1190/geo2021-0824.1).
Line 279: That's not quite true: you're proposing a benchmark dataset (lines 5, 59, 290), so a standard to compare (future) methods. How can your dataset become a standard if it excludes a basic configuration of the subsurface (faulted domains)?
5. Conclusions
Line 284: What are the predecessors actually? You've never mentioned any.
Line 285-286: I'm dubious of the claims of realism and diversity, which aren't well supported by the manuscript. That doesn't mean that this dataset cannot be useful, but I expect more openness on the limitations, which is essential if this is to be used as a benchmark.
Table A1: It would be much better to have some justification for those values, either in an extra reference column or in the text. Channels can be smaller than 200 m and larger than 500 m (see https://doi.org/10.2110/jsr.2006.060).
Citation: https://doi.org/10.5194/essd-2024-131-RC1
Data sets
cigChannel: A massive-scale dataset of 3D synthetic seismic volumes and labelled palaeochannels for deep learning Guangyu Wang, Xinming Wu, and Wen Zhang https://doi.org/10.5281/zenodo.10791151
Model code and software
cigChannel Guangyu Wang, Xinming Wu, and Wen Zhang https://github.com/wanggy-1/cigChannel
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
391 | 184 | 212 | 787 | 12 | 14 |
- HTML: 391
- PDF: 184
- XML: 212
- Total: 787
- BibTeX: 12
- EndNote: 14
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1