BuildingSense: a new multimodal building function classification dataset

Su, Pengxiang; Chen, Ruifei; Xu, Heng; Huang, Wei; Deng, Xinling; Li, Songnian; Yan, Wanglin; Wu, Hangbin; Liu, Chun

doi:10.5194/essd-2025-710

Preprints

https://doi.org/10.5194/essd-2025-710

Preprints

29 Jan 2026

| 29 Jan 2026

Status: this preprint is currently under review for the journal ESSD.

BuildingSense: a new multimodal building function classification dataset

Pengxiang Su, Ruifei Chen, Heng Xu, Wei Huang, Xinling Deng, Songnian Li, Wanglin Yan, Hangbin Wu, and Chun Liu

Abstract. Building function is a description of building usage. The accessibility of its information is essential for urban research, including urban morphology, urban environment, and human activity patterns. Existing building function classification methodologies face two major bottlenecks: (1) poor model interpretability and (2) inadequate multimodal feature fusion. Although large models with strong interpretability and efficient multimodal data fusion capabilities offer promising potential for addressing the bottlenecks, they remain limited in processing multimodal spatial datasets. Their performance in building function classification is therefore also unknown. To the best of our knowledge, there is a lack of multimodal building function classification datasets, which results in the challenge of effectively performing their performance evaluation. Meanwhile, prevailing building function categorization schemes remain coarse, which hinders their ability to support finer-grained urban research in the future. To bridge the gap, we constructed a novel multimodal and fine-grained dataset—BuildingSense—for building function classification. Based on BuildingSense, we evaluated the performance of four state-of-the-art large models from the perspective of classification outcomes and reasoning processes. The results demonstrate that large models can effectively comprehend multimodal spatial data, challenging the conventional concept. Based on that, three directions for future research can be key: (1) build a categorized inference example database, (2) develop cost-effective classification models, and (3) quantify the confidence of model outputs. Our findings not only provide insights into the development of subsequent large model-based classification methods but also contribute to the advancement of multimodal fusion-based classification methods. The dataset and code of this paper can be accessed through https://doi.org/10.6084/m9.figshare.30645776.v2 (Su et al., 2025a).

Received: 19 Nov 2025 – Discussion started: 29 Jan 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 4642 KB)

Supplement (5600 KB)

Download & links

Pengxiang Su, Ruifei Chen, Heng Xu, Wei Huang, Xinling Deng, Songnian Li, Wanglin Yan, Hangbin Wu, and Chun Liu

Status: open (until 18 Mar 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on essd-2025-710', Anonymous Referee #1, 12 Feb 2026 reply
It is a timely and valuable dataset that can be useful for not only building classification but training a multi-modal AI model to a better understanding of the function of urban buildings as well as the underlining motivation of human activities and movements. To the best of my knowledge, it is the first multimodal dataset dedicated to building function classification that offers 26 distinct, fine-grained categories. This is a significant improvement over existing schemes that often mirror coarse land-use classifications. The study challenges the conventional belief that large models cannot handle multimodal spatial data, which is a high-quality contribution. Overall, the paper is well-structured, the methodology is sound, and the dataset indeed fills a clear gap in the Earth System Science community. Some issues in below should be addressed before publication.
The dataset is exclusively sampled from New York City rather than from a broader geography. The authors acknowledge that conclusions may be subject to urban bias, as variations in model performance and error distributions might occur in cities with different layouts. A more in-depth discussion of impacts of the geography bias on the inference of models as well as the results of the baselines should be provided.

One important contribution of the dataset is the rich function Table 3 lists existing building datasets without indicating the number of building function categories. I would suggest adding # of categories in the table.

The method section describes a workflow. I suggest adding a workflow chart at the beginning of the method section to make the readers easily understanding the process.

It is good to have an evaluation for the performance of the large models handling the multimodal dataset. However, you should give a logical and clear justification of the reasons of doing the evaluation. Why did you only involve those large models? Any specific reasons of not testing popular machine learning and deep learning methods?

The authors claim that the sampling processing of the building footprints significantly reduces geospatial bias. In the BuildingSense section, a map showing the geographical distribution of the samples should be provided. It can help readers better understand the data quality in terms of geographical coverage.

Reply
Citation: https://doi.org/10.5194/essd-2025-710-RC1
- AC1: 'Reply on RC1', Pengxiang Su, 20 Feb 2026 reply
  
  Dear anonymous referee #1,
  We sincerely thanks for your insightful and constructive comments on our manuscript. Your feedback has been invaluable in improving the clarity, quality, and overall presentation of our manuscript.
  Our detailed responses to each of your comments are provided in the attached PDF document. In the file, we have carefully considered all suggestions and have made revisions where appropriate. We hope our responses adequately address all your concerns and demonstrate our commitment to producing a rigorous, transparent manuscript.
  Kind regard,
  Pengxiang Su (On behalf of all authors)
  
  Reply
  
  Citation: https://doi.org/10.5194/essd-2025-710-AC1
RC2:
'Comment on essd-2025-710', Anonymous Referee #2, 16 Feb 2026 reply
This paper constructs a multimodal fine-grained dataset for building function classification, covering remote sensing imagery, street-view images, POI data, and building attributes for over 34,000 buildings in New York City. Through a systematic literature review, it identifies current methodological bottlenecks in building function classification and the scarcity of datasets, while revealing both the potential and research gaps of large models in this task. The subsequent benchmark experiments with large models demonstrate the promise of multimodal fusion for understanding building functions, challenging prevailing views in the field. The dataset construction process is solid and includes appropriate quality-control measures. The paper is timely and valuable, and falls within the scope of ESSD.

However, there are some issues that need to be addressed before publication:
As the first contact with readers, the abstract should briefly highlight the dataset's scale (e.g., number of buildings, modalities included).

I noticed that the street-view images in the published dataset do not have the target buildings annotated. This will undoubtedly lead to “Spatial relationship errors” in the current study and similar issues in future models trained on this dataset. Could the authors explain the reasoning behind this?

There is a typo in Equation (1): "arctan" should replace "acrtan"

In the POI section, adding a field table would improve readability.

There are spelling inconsistencies for "BuildingSense" throughout the manuscript (e.g., line 165).

While large models have surpassed traditional methods in reasoning on complex tasks and can significantly enhance the interpretability of results, it is still necessary to explain why traditional deep learning models were not evaluated on this building function dataset.

Please add crucial information about “residential category statistics scaled by a factor of 10” in the title of Figure 7 to prevent misinterpretation.

Reply
Citation: https://doi.org/10.5194/essd-2025-710-RC2
- AC2:
  'Reply on RC2', Pengxiang Su, 20 Feb 2026 reply
  
  Dear anonymous referee #2,
  We sincerely thanks for your insightful and constructive comments on our manuscript. Your feedback has been invaluable in improving the clarity, quality, and overall presentation of our manuscript.
  Our detailed responses to each of your comments are provided in the attached PDF document. In the file, we have carefully considered all suggestions and have made revisions where appropriate. We hope our responses adequately address all your concerns and demonstrate our commitment to producing a rigorous, transparent manuscript.
  Kind regard,
  Pengxiang Su (On behalf of all authors)
  
  Reply
  
  Citation: https://doi.org/10.5194/essd-2025-710-AC2
  - RC3: 'Reply on AC2', Anonymous Referee #2, 22 Feb 2026 reply
    
    The authors have addressed my concerns well. I have no more comments.
    
    Reply
    
    Citation: https://doi.org/10.5194/essd-2025-710-RC3

Pengxiang Su, Ruifei Chen, Heng Xu, Wei Huang, Xinling Deng, Songnian Li, Wanglin Yan, Hangbin Wu, and Chun Liu

Supplement

https://doi.org/10.5194/essd-2025-710-supplement

Data sets

BuildingSense-A multimodal building function classification dataset Pengxiang Su, Runfei Chen, Heng Xu, Wei Huang, Xinling Deng, Wanglin Yan, Songnian Li, Hangbin Wu, Chun Liu https://figshare.com/s/dc6aada5afa0d620a79f

Model code and software

Pengxiang Su, Ruifei Chen, Heng Xu, Wei Huang, Xinling Deng, Songnian Li, Wanglin Yan, Hangbin Wu, and Chun Liu

Viewed

Total article views: 380 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
257	97	26	380	39	17	12

HTML: 257
PDF: 97
XML: 26
Total: 380
Supplement: 39
BibTeX: 17
EndNote: 12

Views and downloads (calculated since 29 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	64	33	1	98
Feb 2026	193	64	25	282

Cumulative views and downloads (calculated since 29 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	64	33	1	98
Feb 2026	193	64	25	282

Viewed (geographical distribution)

Total article views: 384 (including HTML, PDF, and XML) Thereof 384 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 28 Feb 2026

Download

Preprint (4642 KB)
Metadata XML

Short summary

The accessibility of building function is essential for urban research. We reviewed the recent work and concluded three limitations: few open-source datasets, coarse building function categories, and poor model interpretability with inadequate multimodal feature fusion. Thus, we created BuildingSense with fine-grained categories and multimodal data, and proved that the large model can be used for improving the interpretability of results, with three directions for enhancing their performance.


Total:	0
HTML:	0
PDF:	0
XML:	0