the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
BuildingSense: a new multimodal building function classification dataset
Abstract. Building function is a description of building usage. The accessibility of its information is essential for urban research, including urban morphology, urban environment, and human activity patterns. Existing building function classification methodologies face two major bottlenecks: (1) poor model interpretability and (2) inadequate multimodal feature fusion. Although large models with strong interpretability and efficient multimodal data fusion capabilities offer promising potential for addressing the bottlenecks, they remain limited in processing multimodal spatial datasets. Their performance in building function classification is therefore also unknown. To the best of our knowledge, there is a lack of multimodal building function classification datasets, which results in the challenge of effectively performing their performance evaluation. Meanwhile, prevailing building function categorization schemes remain coarse, which hinders their ability to support finer-grained urban research in the future. To bridge the gap, we constructed a novel multimodal and fine-grained dataset—BuildingSense—for building function classification. Based on BuildingSense, we evaluated the performance of four state-of-the-art large models from the perspective of classification outcomes and reasoning processes. The results demonstrate that large models can effectively comprehend multimodal spatial data, challenging the conventional concept. Based on that, three directions for future research can be key: (1) build a categorized inference example database, (2) develop cost-effective classification models, and (3) quantify the confidence of model outputs. Our findings not only provide insights into the development of subsequent large model-based classification methods but also contribute to the advancement of multimodal fusion-based classification methods. The dataset and code of this paper can be accessed through https://doi.org/10.6084/m9.figshare.30645776.v2 (Su et al., 2025a).
- Preprint
(4642 KB) - Metadata XML
-
Supplement
(5600 KB) - BibTeX
- EndNote
Status: open (until 18 Mar 2026)
-
RC1: 'Comment on essd-2025-710', Anonymous Referee #1, 12 Feb 2026
reply
-
AC1: 'Reply on RC1', Pengxiang Su, 20 Feb 2026
reply
Dear anonymous referee #1,
We sincerely thanks for your insightful and constructive comments on our manuscript. Your feedback has been invaluable in improving the clarity, quality, and overall presentation of our manuscript.
Our detailed responses to each of your comments are provided in the attached PDF document. In the file, we have carefully considered all suggestions and have made revisions where appropriate. We hope our responses adequately address all your concerns and demonstrate our commitment to producing a rigorous, transparent manuscript.
Kind regard,
Pengxiang Su (On behalf of all authors)
-
AC1: 'Reply on RC1', Pengxiang Su, 20 Feb 2026
reply
-
RC2: 'Comment on essd-2025-710', Anonymous Referee #2, 16 Feb 2026
reply
This paper constructs a multimodal fine-grained dataset for building function classification, covering remote sensing imagery, street-view images, POI data, and building attributes for over 34,000 buildings in New York City. Through a systematic literature review, it identifies current methodological bottlenecks in building function classification and the scarcity of datasets, while revealing both the potential and research gaps of large models in this task. The subsequent benchmark experiments with large models demonstrate the promise of multimodal fusion for understanding building functions, challenging prevailing views in the field. The dataset construction process is solid and includes appropriate quality-control measures. The paper is timely and valuable, and falls within the scope of ESSD.
However, there are some issues that need to be addressed before publication:
- As the first contact with readers, the abstract should briefly highlight the dataset's scale (e.g., number of buildings, modalities included).
- I noticed that the street-view images in the published dataset do not have the target buildings annotated. This will undoubtedly lead to “Spatial relationship errors” in the current study and similar issues in future models trained on this dataset. Could the authors explain the reasoning behind this?
- There is a typo in Equation (1): "arctan" should replace "acrtan"
- In the POI section, adding a field table would improve readability.
- There are spelling inconsistencies for "BuildingSense" throughout the manuscript (e.g., line 165).
- While large models have surpassed traditional methods in reasoning on complex tasks and can significantly enhance the interpretability of results, it is still necessary to explain why traditional deep learning models were not evaluated on this building function dataset.
-
Please add crucial information about “residential category statistics scaled by a factor of 10” in the title of Figure 7 to prevent misinterpretation.
Citation: https://doi.org/10.5194/essd-2025-710-RC2 -
AC2: 'Reply on RC2', Pengxiang Su, 20 Feb 2026
reply
Dear anonymous referee #2,
We sincerely thanks for your insightful and constructive comments on our manuscript. Your feedback has been invaluable in improving the clarity, quality, and overall presentation of our manuscript.
Our detailed responses to each of your comments are provided in the attached PDF document. In the file, we have carefully considered all suggestions and have made revisions where appropriate. We hope our responses adequately address all your concerns and demonstrate our commitment to producing a rigorous, transparent manuscript.
Kind regard,
Pengxiang Su (On behalf of all authors)
-
RC3: 'Reply on AC2', Anonymous Referee #2, 22 Feb 2026
reply
The authors have addressed my concerns well. I have no more comments.
Citation: https://doi.org/10.5194/essd-2025-710-RC3
-
RC3: 'Reply on AC2', Anonymous Referee #2, 22 Feb 2026
reply
Data sets
BuildingSense-A multimodal building function classification dataset Pengxiang Su, Runfei Chen, Heng Xu, Wei Huang, Xinling Deng, Wanglin Yan, Songnian Li, Hangbin Wu, Chun Liu https://figshare.com/s/dc6aada5afa0d620a79f
Model code and software
BuildingSense-A multimodal building function classification dataset Pengxiang Su, Runfei Chen, Heng Xu, Wei Huang, Xinling Deng, Wanglin Yan, Songnian Li, Hangbin Wu, Chun Liu https://figshare.com/s/dc6aada5afa0d620a79f
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 257 | 97 | 26 | 380 | 39 | 17 | 12 |
- HTML: 257
- PDF: 97
- XML: 26
- Total: 380
- Supplement: 39
- BibTeX: 17
- EndNote: 12
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
It is a timely and valuable dataset that can be useful for not only building classification but training a multi-modal AI model to a better understanding of the function of urban buildings as well as the underlining motivation of human activities and movements. To the best of my knowledge, it is the first multimodal dataset dedicated to building function classification that offers 26 distinct, fine-grained categories. This is a significant improvement over existing schemes that often mirror coarse land-use classifications. The study challenges the conventional belief that large models cannot handle multimodal spatial data, which is a high-quality contribution. Overall, the paper is well-structured, the methodology is sound, and the dataset indeed fills a clear gap in the Earth System Science community. Some issues in below should be addressed before publication.