<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="data-paper">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">ESSD</journal-id><journal-title-group>
    <journal-title>Earth System Science Data</journal-title>
    <abbrev-journal-title abbrev-type="publisher">ESSD</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Earth Syst. Sci. Data</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">1866-3516</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/essd-18-2609-2026</article-id><title-group><article-title>BuildingSense: a new multimodal building function classification dataset</article-title><alt-title>BuildingSense</alt-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Su</surname><given-names>Pengxiang</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2">
          <name><surname>Chen</surname><given-names>Runfei</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Xu</surname><given-names>Heng</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="yes" rid="aff1 aff2 aff3">
          <name><surname>Huang</surname><given-names>Wei</given-names></name>
          <email>wei_huang@tongji.edu.cn</email>
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff5">
          <name><surname>Deng</surname><given-names>Xinling</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff3">
          <name><surname>Li</surname><given-names>Songnian</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff4">
          <name><surname>Yan</surname><given-names>Wanglin</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Wu</surname><given-names>Hangbin</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Liu</surname><given-names>Chun</given-names></name>
          
        </contrib>
        <aff id="aff1"><label>1</label><institution>College of Surveying and Geo-informatics, Tongji University, Shanghai, China</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>Urban Mobility Institute, Tongji University, Shanghai, China</institution>
        </aff>
        <aff id="aff3"><label>3</label><institution>Department of Civil Engineering, Toronto Metropolitan University, Toronto, Canada</institution>
        </aff>
        <aff id="aff4"><label>4</label><institution>Faculty of Environment and Information Studies, Keio University, Fujisawa City, Japan</institution>
        </aff>
        <aff id="aff5"><label>5</label><institution>Cornell Tech, Cornell University, New York City, USA</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Wei Huang (wei_huang@tongji.edu.cn)</corresp></author-notes><pub-date><day>13</day><month>April</month><year>2026</year></pub-date>
      
      <volume>18</volume>
      <issue>4</issue>
      <fpage>2609</fpage><lpage>2634</lpage>
      <history>
        <date date-type="received"><day>19</day><month>November</month><year>2025</year></date>
           <date date-type="rev-request"><day>29</day><month>January</month><year>2026</year></date>
           <date date-type="rev-recd"><day>10</day><month>March</month><year>2026</year></date>
           <date date-type="accepted"><day>19</day><month>March</month><year>2026</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2026 Pengxiang Su et al.</copyright-statement>
        <copyright-year>2026</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026.html">This article is available from https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026.html</self-uri><self-uri xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026.pdf">The full text article is available as a PDF file from https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d2e181">Building function is a description of building usage. The accessibility of its information is essential for urban research, including urban morphology, urban environment, and human activity patterns. Existing building function classification methodologies face two major bottlenecks: (1) poor model interpretability and (2) inadequate multimodal feature fusion. Although large models with strong interpretability and efficient multimodal data fusion capabilities offer promising potential for addressing the bottlenecks, they remain limited in processing multimodal spatial datasets. Their performance in building function classification is therefore also unknown. To the best of our knowledge, there is a lack of multimodal building function classification datasets, which results in the challenge of effectively performing their performance evaluation. Meanwhile, prevailing building function categorization schemes remain coarse, which hinders their ability to support finer-grained urban research in the future. To bridge the gap, we constructed a novel multimodal and fine-grained dataset – BuildingSense – for building function classification, comprising over 34 000 buildings, 60 000 annotated images, 71 654 POIs, and 3400 building description texts in 26 distinct categories. Based on BuildingSense, we evaluated the performance of four state-of-the-art large models from the perspective of classification outcomes and reasoning processes. The results demonstrate that large models can effectively comprehend multimodal spatial data, challenging the conventional concept. Based on that, three directions for future research can be key: (1) build a categorized inference example database, (2) develop cost-effective classification models, and (3) quantify the confidence of model outputs. Our findings not only provide insights into the development of subsequent large model-based classification methods but also contribute to the advancement of multimodal fusion-based classification methods. The dataset and code of this paper can be accessed through <ext-link xlink:href="https://doi.org/10.6084/m9.figshare.30645776.v2" ext-link-type="DOI">10.6084/m9.figshare.30645776.v2</ext-link> <xref ref-type="bibr" rid="bib1.bibx29" id="paren.1"/>.</p>
  </abstract>
    
<funding-group>
<award-group id="gs1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>42171452</award-id>
</award-group>
</funding-group>
</article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d2e199">Buildings, as primary spatial carriers of urban functions, constitute a fundamental component of the urban system, substantially influencing the cultural, economic, and environmental development of cities <xref ref-type="bibr" rid="bib1.bibx1 bib1.bibx25 bib1.bibx49" id="paren.2"/>. Building function, as a vital attribute of buildings, refers to the purpose of building  usage, presenting a high-level summary of the possible human activities within a building <xref ref-type="bibr" rid="bib1.bibx32" id="paren.3"/>. It provides crucial information for research on urban planning, urban heat island effects, and human mobility <xref ref-type="bibr" rid="bib1.bibx28 bib1.bibx7" id="paren.4"/>. Traditionally, the acquisition of building function data primarily relied on field surveys, which are resource-intensive, resulting in long update cycles <xref ref-type="bibr" rid="bib1.bibx12 bib1.bibx38" id="paren.5"/>.</p>
      <p id="d2e214">The development of remote sensing technology, information and communication technology (ICT), and the availability of geo-tagged images have resulted in multimodal data presenting building attributes <xref ref-type="bibr" rid="bib1.bibx42" id="paren.6"/>. For example, building footprints provide high spatial resolution boundary of buildings, the widely covered street views and very-high-resolution (VHR) remote sensing imagery captures detailed surface textures of buildings, and point of interest (POI) data offers finer-grained semantic information about human activities within building spaces <xref ref-type="bibr" rid="bib1.bibx9 bib1.bibx15 bib1.bibx22 bib1.bibx27" id="paren.7"/>. These advancements have shifted the paradigm of building function data acquisition from survey-based  to data-driven, significantly reducing the costs and accelerating the data update cycles <xref ref-type="bibr" rid="bib1.bibx48" id="paren.8"/>.</p>

      <fig id="F1" specific-use="star"><label>Figure 1</label><caption><p id="d2e228">The overview of BuildingSense. The other building function category examples are shown in   Fig. S1 in the Supplement.</p></caption>
        <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f01.jpg"/>

      </fig>

      <p id="d2e238">Existing data-driven methods for acquiring building function information mostly focus on traditional machine learning and deep learning methods <xref ref-type="bibr" rid="bib1.bibx22 bib1.bibx34" id="paren.9"/>. These methods respectively employ feature engineering and convolutional operations to construct the embedding feature vectors, subsequently constructing complex mapping relationships between these vectors and building functions, and ultimately classifying the building functions. Although breakthroughs have been made in improving the classification accuracy, they inevitably introduce new challenges, such as interpretability of the models and the comprehensive fusion of multimodal features. These limitations result in poor performance of a trained model when it is transferred to another study area. Notably, large models, as large-scale Artificial Intelligence (AI) models pretrained on vast amounts of multimodal data, contain extensive knowledge of human society and possess robust capabilities in multimodal information extraction and understanding <xref ref-type="bibr" rid="bib1.bibx3 bib1.bibx5" id="paren.10"/>. Recently, with advancements in the technology of large models, they have evolved human-like reasoning abilities, enabling them to perform logical inferences based on extracted multimodal information and explicitly output their reasoning processes <xref ref-type="bibr" rid="bib1.bibx31" id="paren.11"/>. These characteristics highlight the models' outstanding interpretability and efficient multimodal data processing capabilities, shedding light on overcoming these bottlenecks.</p>
      <p id="d2e250">However, previous studies have revealed that large models exhibit limited performance when processing spatial datasets, such as POI-based urban function classification, street view image–based urban noise intensity classification, and remote sensing image scene classification  <xref ref-type="bibr" rid="bib1.bibx24 bib1.bibx18" id="paren.12"/>. This limitation indeed raises an uncertainty regarding whether they can effectively extract building function-related multimodal information and logically infer the building function. Therefore, it is urgent to systematically evaluate the performance of large models on a multimodal building function classification dataset, providing a benchmark for future research on building function classification methods using large models.</p>
      <p id="d2e256">To the best of our knowledge, there is a lack of multimodal datasets that are explicitly created for building function classification. Instead, most existing studies employ coarse-grained function classification schemes, directly using standard land-use classifications or reclassifying them  <xref ref-type="bibr" rid="bib1.bibx45 bib1.bibx17 bib1.bibx27" id="paren.13"/>. Such approaches cover up the finer-grained human activity information inherently embedded in buildings, leading to information missing for exploring urban issues from a more detailed perspective. The availability of this dataset can be beneficial for benchmarking the capabilities of large models under increased classification complexity. Therefore, it is imperative to construct a multimodal building function dataset with finer-grained function classification.</p>
      <p id="d2e262">To overcome the dataset limitation, we developed BuildingSense – a novel multimodal dataset for building functions classification, comprising over 60 000 annotated images, 71 654 POIs, and 3400 building description texts in 26 distinct categories (Detailed definition is shown in  Table S1 in the Supplement). It was collected based on 34 000 building footprints in New York City (NYC) (Fig. <xref ref-type="fig" rid="F1"/>). To the best of our knowledge, BuildingSense is the first multimodal dataset dedicated to building function classification, offering detailed function categories. We subsequently selected four commonly used large models, including Gemini-2.5-flash (Thinking), Claude-sonnet-4, QVQ-plus, and Deepseek-chat, as baselines to evaluate the performance of large models on building function classification. Drawing from this evaluation, we discuss potential future avenues for enhancing the ability of large models to classify building function. Our main contributions are summarized as follows: <list list-type="order"><list-item>
      <p id="d2e269">We develop BuildingSense dataset, which is the first multimodal dataset focused on building function classification, featuring a substantial collection of well-aligned VHR images, street view images, and POIs, along with detailed categories for building functions.</p></list-item><list-item>
      <p id="d2e273">We present a systematic evaluation of the large models' classification decision and inference process. Based on this, we found that the best-performing model (Gemini-2.5-flash (Thinking)) can logically infer the building function based on multimodal spatial data, which challenges the current perspective that large models cannot handle multimodal spatial data.</p></list-item><list-item>
      <p id="d2e277">We discuss three potential ways for improving the performance of large models in building function classification tasks: (1) constructing an external information database related to building functions, (2) creating small parameter models that maintain exceptional inference capabilities, and (3) measuring the confidence levels of the model's outputs.</p></list-item></list></p>
</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Literature review</title>
<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>Building function classification</title>
      <p id="d2e295">Cities are hierarchically nested systems, presenting a hierarchical structure. Such a complicated structure is conceived as a hierarchical soup, containing different “stuffs” such as buildings, roads, etc. These “stuffs” are obvious to see but need to be organized into something more coherent  <xref ref-type="bibr" rid="bib1.bibx16" id="paren.14"/>. In human activity studies, land use, building function, and POIs can be analogized as a hierarchical semantic soup. The relationships can be imagined as an inverted pyramid, with land use at the top and POI at the bottom. At the same time, the building function occupies an intermediate position in the hierarchical structure, providing more granular semantic information than land use while simultaneously synthesizing and contextualizing the semantic information of POIs within the building. Thus, building function data is essential for hierarchically understanding urban systems.</p>
      <p id="d2e301">However, our previous review has revealed that building function classification studies are relatively limited compared to the land use ones (Keywords for paper collection are shown in   Fig. S2) <xref ref-type="bibr" rid="bib1.bibx30" id="paren.15"/>. The contradictions between the relatively scarce research and the importance of the building function data indicate a current neglect of building functions within the academic community. It is noteworthy that the significant quantitative change observed in studies on building function classification (2020–2025) (Keywords for paper collection are shown in   Fig. S3) shows that 11 papers have been published in less than two years (2023–2025), exceeding the total number of articles from the previous three years (2020–2023) (Table <xref ref-type="table" rid="T1"/>). This striking contrast highlights a growing interest among researchers in the topic of building function classification. Accordingly, we review these studies from three dimensions to conclude the main limitation: (1) classification method and data, (2) building function categories, and (3) data accessibility.</p>

<table-wrap id="T1" specific-use="star"><label>Table 1</label><caption><p id="d2e312">Research on building function classification between 2020 and 2025. Method refers to classification method; POI refers to point of interest; RSI refers to remote sensing image; SVI refers to street view image; HMD refers to human mobility data; RN refers to road network; BT refers to building topology; CN refers to category number of building function; DA refers to data accessibility.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="10">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="center"/>
     <oasis:colspec colnum="4" colname="col4" align="center"/>
     <oasis:colspec colnum="5" colname="col5" align="center"/>
     <oasis:colspec colnum="6" colname="col6" align="center"/>
     <oasis:colspec colnum="7" colname="col7" align="center"/>
     <oasis:colspec colnum="8" colname="col8" align="center"/>
     <oasis:colspec colnum="9" colname="col9" align="right"/>
     <oasis:colspec colnum="10" colname="col10" align="center"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Research</oasis:entry>
         <oasis:entry colname="col2">Method</oasis:entry>
         <oasis:entry colname="col3">POI</oasis:entry>
         <oasis:entry colname="col4">RSI</oasis:entry>
         <oasis:entry colname="col5">SVI</oasis:entry>
         <oasis:entry colname="col6">HMD</oasis:entry>
         <oasis:entry colname="col7">RN</oasis:entry>
         <oasis:entry colname="col8">BT</oasis:entry>
         <oasis:entry colname="col9">CN</oasis:entry>
         <oasis:entry colname="col10">DA</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx22" id="text.16"/>
                  </oasis:entry>
         <oasis:entry colname="col2">Semi-supervised learning</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">6</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx13" id="text.17"/>
                  </oasis:entry>
         <oasis:entry colname="col2">LARSE</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">10</oasis:entry>
         <oasis:entry colname="col10">✓</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx14" id="text.18"/>
                  </oasis:entry>
         <oasis:entry colname="col2">UB-FineNet</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">11</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx48" id="text.19"/>
                  </oasis:entry>
         <oasis:entry colname="col2">XGBoost</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7">✓</oasis:entry>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">10</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx27" id="text.20"/>
                  </oasis:entry>
         <oasis:entry colname="col2">XGBoost</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">5</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx26" id="text.21"/>
                  </oasis:entry>
         <oasis:entry colname="col2">XGBoost</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">4</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx17" id="text.22"/>
                  </oasis:entry>
         <oasis:entry colname="col2">GraphSAGE</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">✓</oasis:entry>
         <oasis:entry colname="col9">7</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx10" id="text.23"/>
                  </oasis:entry>
         <oasis:entry colname="col2">Radom forest</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7">✓</oasis:entry>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">5</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx6" id="text.24"/>
                  </oasis:entry>
         <oasis:entry colname="col2">OneClassSVM</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7">✓</oasis:entry>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">7</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx45" id="text.25"/>
                  </oasis:entry>
         <oasis:entry colname="col2">Geo-aware transformer</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">✓</oasis:entry>
         <oasis:entry colname="col9">12</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx15" id="text.26"/>
                  </oasis:entry>
         <oasis:entry colname="col2">CNN</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">3</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx38" id="text.27"/>
                  </oasis:entry>
         <oasis:entry colname="col2">GraphSAGE</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">✓</oasis:entry>
         <oasis:entry colname="col9">5</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx37" id="text.28"/>
                  </oasis:entry>
         <oasis:entry colname="col2">Frequency based</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">3</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx9" id="text.29"/>
                  </oasis:entry>
         <oasis:entry colname="col2">XGBoost</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7">✓</oasis:entry>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">5</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx44" id="text.30"/>
                  </oasis:entry>
         <oasis:entry colname="col2">CNN</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">4</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx43" id="text.31"/>
                  </oasis:entry>
         <oasis:entry colname="col2">Tensor decomposition</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9">7</oasis:entry>
         <oasis:entry colname="col10"/>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e897">First, from the perspective of classification method and data, we reveal that the majority of studies utilize multimodal data and their methods are dominated by deep learning and traditional machine learning algorithms (Table <xref ref-type="table" rid="T1"/>).  While the multimodal data are adopted to classify building functions, demonstrating the advantage of data fusion, they heavily rely on manual feature engineering, which is unable to guarantee that task-relevant features are fully extracted from each data instance during the feature construction process, resulting in partial information loss and limited classification performance <xref ref-type="bibr" rid="bib1.bibx48 bib1.bibx27 bib1.bibx26 bib1.bibx17 bib1.bibx10 bib1.bibx6 bib1.bibx45 bib1.bibx38 bib1.bibx9" id="paren.32"/>. Recent studies have shown that deep learning methods, which extract deeper data features through convolutional operations, ensure more comprehensive utilization of information and significantly improve the classification accuracy <xref ref-type="bibr" rid="bib1.bibx22 bib1.bibx14 bib1.bibx13 bib1.bibx15 bib1.bibx44" id="paren.33"/>. It should be noticed that their black-box characteristic substantially limits the interpretability of the models. Additionally, the predominant usage of single-modal data in these studies indicates that the application of multimodal approaches in building function classification remains an in-depth exploration, under the condition that the multimodal fusion methods have demonstrated superior performance.</p>
      <p id="d2e908">Current building function classification algorithms suffer from two major limitations: (1) the lack of interpretability in models, and (2) the absence of deep feature fusion from multimodal data. These limitations lead to poor model transferability and excessive reliance on training data. Notably, current technological advances present promising solutions to address these limitations. Reasoning-based large models (e.g., GPT-4o, Gemini) offer a competitive approach in deep learning interpretability and multimodal fusion. However, their multimodal synthesis analysis and reasoning ability remain substantially evaluated in the context of building function classification <xref ref-type="bibr" rid="bib1.bibx24 bib1.bibx18" id="paren.34"/>. Such conditions highlight an imperative demand for multimodal building function classification datasets.</p>
      <p id="d2e914">Second, from the perspective of building function categories, the current building function classification framework employed in the Table <xref ref-type="table" rid="T1"/>'s research remains coarse-grained, indicating these studies merely mirror the land use category or straightforwardly reclassify based on the land use category <xref ref-type="bibr" rid="bib1.bibx45" id="paren.35"/>. Although this division can facilitate bottom-top land use classification, it fundamentally overlooks the intrinsic connection between building function and fine-grained human activity. Referring to the POI type taxonomy, it employs multi-level taxonomies with detailed terminal categories that precisely describe the specific human activity. In analogy, land use, situated at the top of the classification hierarchy, offers a coarse-grained generalization of human activities at the parcel level. POIs, located at the bottom of the hierarchy, provide fine-grained descriptions of human activities. Building function, serving as a critical intermediate layer, should ideally provide deeper semantic information than land use and generalize the semantic information of diverse POIs within the building.</p>

<table-wrap id="T2" specific-use="star"><label>Table 2</label><caption><p id="d2e925">Example of typical building function categories. Building function categories 12, 7, 5, and 3 were selected from Table <xref ref-type="table" rid="T1"/> to illustrate the current mainstream classification of building functions.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="4">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx45" id="text.36"/>
                  </oasis:entry>
         <oasis:entry colname="col2">
                    <xref ref-type="bibr" rid="bib1.bibx17" id="text.37"/>
                  </oasis:entry>
         <oasis:entry colname="col3">
                    <xref ref-type="bibr" rid="bib1.bibx27" id="text.38"/>
                  </oasis:entry>
         <oasis:entry colname="col4">
                    <xref ref-type="bibr" rid="bib1.bibx15" id="text.39"/>
                  </oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Urban village</oasis:entry>
         <oasis:entry colname="col2">Residential</oasis:entry>
         <oasis:entry colname="col3">Residential</oasis:entry>
         <oasis:entry colname="col4">Residential</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Urban residential</oasis:entry>
         <oasis:entry colname="col2">Commercial</oasis:entry>
         <oasis:entry colname="col3">Commercial</oasis:entry>
         <oasis:entry colname="col4">Commercial</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Business office</oasis:entry>
         <oasis:entry colname="col2">Urban villages</oasis:entry>
         <oasis:entry colname="col3">Industrial</oasis:entry>
         <oasis:entry colname="col4">Other</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Big catering</oasis:entry>
         <oasis:entry colname="col2">Communal facilities</oasis:entry>
         <oasis:entry colname="col3">Public service</oasis:entry>
         <oasis:entry colname="col4"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Shopping center</oasis:entry>
         <oasis:entry colname="col2">Education</oasis:entry>
         <oasis:entry colname="col3">Landscape</oasis:entry>
         <oasis:entry colname="col4"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Hotel</oasis:entry>
         <oasis:entry colname="col2">Warehouse and factories</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Recreation &amp; Tourism</oasis:entry>
         <oasis:entry colname="col2">Mixed function</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Company &amp; Factories</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Industrial park</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Administrative</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Education</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Medical</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e1138">However, the current mainstream building function classification (Table <xref ref-type="table" rid="T2"/>) fails to fulfill its intermediary role in bridging land use and POIs. This limitation represents a fundamental conceptual gap in the existing building function classification framework, where the unique position of buildings as spatial containers for diverse human activities remains underexploited. This drawback highlights the need for more refined building function categories.</p>
      <p id="d2e1144">Third, from the perspective of building function data accessibility, only one paper shares its building function raw data <xref ref-type="bibr" rid="bib1.bibx13" id="paren.40"/>. This situation leads to a biased comparison of building function classification methodology, restricting the development of the method. This issue is compounded by the conflict between the shared dataset and the current demand for multimodal methodology development. Thus, an open-source and multimodal building function classification dataset is in pressing need.</p>
      <p id="d2e1150">In conclusion, building function classification is an emerging research direction. Although the studies of building function classification are becoming more and more prevalent, there are three limitations: (1) the advantage of multimodal fusion and the powerful reasoning ability of multimodal large models remains underexplored; (2) the current division of building function categories is coarse; (3) an open-source and high-quality building function dataset is in urgent need. Thus, instead of developing state-of-the-art methods for building function classification, conducting foundational work  –  building a multimodal and fine-grained building function classification dataset  –  is practically indispensable for future building function classification.</p>
</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Existing building-related dataset</title>
      <p id="d2e1161">To further illustrate the necessity of building a multimodal and fine-grained building function classification dataset, we conducted a comprehensive review of existing building-related datasets (Table <xref ref-type="table" rid="T3"/>). We divided the task into three categories, including image segmentation (represented by S), object detection (represented by O), and building classification (represented by C). Additionally, CMAB (A Multi-Attribute Building Dataset of China) is only a product instead of a training dataset <xref ref-type="bibr" rid="bib1.bibx46" id="paren.41"/>. Our analysis reveals two critical limitations in current datasets: (1) the majority of datasets (Cityscapes, SkyScapes, etc) exclusively contain single-view imagery (either street view image or remote sensing image), which are primarily designed for segmentation or object detection; (2) while some multiview (the dataset with street view and remote sensing image in Table <xref ref-type="table" rid="T3"/>) datasets (TorontoCity, Wojna, etc) have incorporated basic building information annotations, they remain limited in modal diversity and semantic depth. Therefore, it can be concluded that existing datasets universally lack multimodal information and employ coarse building function classifications. An ideal building function classification dataset should incorporate multimodal data related to buildings along with fine-grained functional annotations, thereby supporting the development of both unimodal and multimodal building function classification methods. Finally, the classified fine-grained building function data can be utilized for advanced urban research that requires granular spatial-semantic analysis.</p>
      <p id="d2e1171">Thus, we constructed BuildingSense to provide fundamental advances in building function classification. The dataset comprises over 34 000 accurately aligned building footprints with multimodal, multiview, and fine-grained functions. Although the dataset is moderate in size, we applied stratified sampling to ensure a highly representative and less biased sample, enhancing its analytical value over larger but potentially skewed alternatives (Table <xref ref-type="table" rid="T3"/>). To our knowledge, this is the first multimodal building-centric dataset to offer fine-grained functional annotations.</p>

<table-wrap id="T3" specific-use="star"><label>Table 3</label><caption><p id="d2e1179">Building-related dataset. NB refers to the number of buildings; Balance refers to whether the dataset considers each category sample of the dataset is balanced or not; POI refers to point of interest; RSI refers to remote sensing image; SVI refers to street view image; RN refers to road network; H refers to building height; Y refers to building year; roof refers to building roof; F refers to building function; CN refers to the number of function category.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="13">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="center"/>
     <oasis:colspec colnum="4" colname="col4" align="center"/>
     <oasis:colspec colnum="5" colname="col5" align="center"/>
     <oasis:colspec colnum="6" colname="col6" align="center"/>
     <oasis:colspec colnum="7" colname="col7" align="center"/>
     <oasis:colspec colnum="8" colname="col8" align="center"/>
     <oasis:colspec colnum="9" colname="col9" align="center"/>
     <oasis:colspec colnum="10" colname="col10" align="center"/>
     <oasis:colspec colnum="11" colname="col11" align="center"/>
     <oasis:colspec colnum="12" colname="col12" align="right"/>
     <oasis:colspec colnum="13" colname="col13" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Dataset</oasis:entry>
         <oasis:entry colname="col2">NB</oasis:entry>
         <oasis:entry colname="col3">Balance</oasis:entry>
         <oasis:entry colname="col4">POI</oasis:entry>
         <oasis:entry colname="col5">RSI</oasis:entry>
         <oasis:entry colname="col6">SVI</oasis:entry>
         <oasis:entry colname="col7">RN</oasis:entry>
         <oasis:entry colname="col8">H</oasis:entry>
         <oasis:entry colname="col9">Y</oasis:entry>
         <oasis:entry colname="col10">Roof</oasis:entry>
         <oasis:entry colname="col11">F</oasis:entry>
         <oasis:entry colname="col12">CN</oasis:entry>
         <oasis:entry colname="col13">Task</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">KITTI (<xref ref-type="bibr" rid="bib1.bibx11" id="altparen.42"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Cityscapes (<xref ref-type="bibr" rid="bib1.bibx8" id="altparen.43"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">EuroCity (<xref ref-type="bibr" rid="bib1.bibx4" id="altparen.44"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">O</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">WildPASS (<xref ref-type="bibr" rid="bib1.bibx41" id="altparen.45"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">PASS (<xref ref-type="bibr" rid="bib1.bibx40" id="altparen.46"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">HoliCity (<xref ref-type="bibr" rid="bib1.bibx23" id="altparen.47"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">SkyScapes (<xref ref-type="bibr" rid="bib1.bibx2" id="altparen.48"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">SpaceNet (<xref ref-type="bibr" rid="bib1.bibx35" id="altparen.49"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx20" id="text.50"/>
                  </oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">TorontoCity (<xref ref-type="bibr" rid="bib1.bibx33" id="altparen.51"/>)</oasis:entry>
         <oasis:entry colname="col2">400 000</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">✓</oasis:entry>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Wojna (<xref ref-type="bibr" rid="bib1.bibx36" id="altparen.52"/>)</oasis:entry>
         <oasis:entry colname="col2">9674</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10">✓</oasis:entry>
         <oasis:entry colname="col11">✓</oasis:entry>
         <oasis:entry colname="col12">6</oasis:entry>
         <oasis:entry colname="col13">C</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">OmniCity (<xref ref-type="bibr" rid="bib1.bibx21" id="altparen.53"/>)</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">✓</oasis:entry>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11">✓</oasis:entry>
         <oasis:entry colname="col12">7</oasis:entry>
         <oasis:entry colname="col13">S</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">CMAB (<xref ref-type="bibr" rid="bib1.bibx46" id="altparen.54"/>)</oasis:entry>
         <oasis:entry colname="col2">31 000 000</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">✓</oasis:entry>
         <oasis:entry colname="col9">✓</oasis:entry>
         <oasis:entry colname="col10">✓</oasis:entry>
         <oasis:entry colname="col11">✓</oasis:entry>
         <oasis:entry colname="col12">6</oasis:entry>
         <oasis:entry colname="col13">P</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">
                    <xref ref-type="bibr" rid="bib1.bibx13" id="text.55"/>
                  </oasis:entry>
         <oasis:entry colname="col2">500 000</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12">10</oasis:entry>
         <oasis:entry colname="col13">C</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Ours</oasis:entry>
         <oasis:entry colname="col2">34 458</oasis:entry>
         <oasis:entry colname="col3">✓</oasis:entry>
         <oasis:entry colname="col4">✓</oasis:entry>
         <oasis:entry colname="col5">✓</oasis:entry>
         <oasis:entry colname="col6">✓</oasis:entry>
         <oasis:entry colname="col7">✓</oasis:entry>
         <oasis:entry colname="col8">✓</oasis:entry>
         <oasis:entry colname="col9">✓</oasis:entry>
         <oasis:entry colname="col10">✓</oasis:entry>
         <oasis:entry colname="col11">✓</oasis:entry>
         <oasis:entry colname="col12">26</oasis:entry>
         <oasis:entry colname="col13">C</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
</sec>
<sec id="Ch1.S3">
  <label>3</label><title>Method</title>
      <p id="d2e1867">As illustrated in the Fig. <xref ref-type="fig" rid="F2"/>, the methodological workflow of our dataset consists of three components: (1) building-related data and annotation collection, (2) building annotation and data cleaning, and (3) evaluation on large models. The first component describes how building footprints were sampled and how the relevant data and annotations were collected based on these footprints (Sect. <xref ref-type="sec" rid="Ch1.S3.SS1"/>). The second component details the cleaning process for the collected data and annotations, ultimately yielding a high-quality multimodal dataset (Sect. <xref ref-type="sec" rid="Ch1.S3.SS2"/>). The third component outlines the setup for baseline comparisons to evaluate the performance of different large models and the evaluation method for the optimal model (Sect. <xref ref-type="sec" rid="Ch1.S3.SS3"/>).</p>

      <fig id="F2" specific-use="star"><label>Figure 2</label><caption><p id="d2e1880">The workflow chart of BuildingSense benchmark.</p></caption>
        <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f02.png"/>

      </fig>

<sec id="Ch1.S3.SS1">
  <label>3.1</label><title>Building-related annotation and multimodal data collection</title>
      <p id="d2e1896">As shown in Fig. <xref ref-type="fig" rid="F3"/>, we collect remote sensing images, street view images, and POIs based on each building footprint in Fig. <xref ref-type="fig" rid="F3"/>. The data sources are summarized in Table <xref ref-type="table" rid="T4"/>. They are either from the government or a licensed commercial company, except for the road network. The sources of these data ensure the reliability of BuildingSense. The following section details the annotation and multimodal data collection (Sect. <xref ref-type="sec" rid="Ch1.S3.SS1.SSS1"/>, <xref ref-type="sec" rid="Ch1.S3.SS1.SSS2"/>, and <xref ref-type="sec" rid="Ch1.S3.SS1.SSS3"/>, <xref ref-type="sec" rid="Ch1.S3.SS1.SSS4"/>).</p>

      <fig id="F3" specific-use="star"><label>Figure 3</label><caption><p id="d2e1916">Building-related remote sensing image, street view image, and POI data collection. An example of the raw dataset is shown in Fig. S4.</p></caption>
          <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f03.png"/>

        </fig>

<table-wrap id="T4" specific-use="star"><label>Table 4</label><caption><p id="d2e1928">Data sources of BuildingSense.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="4">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Data</oasis:entry>
         <oasis:entry colname="col2">Year</oasis:entry>
         <oasis:entry colname="col3">Source</oasis:entry>
         <oasis:entry colname="col4">Acquired via</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Building footprint</oasis:entry>
         <oasis:entry colname="col2">2024</oasis:entry>
         <oasis:entry colname="col3">NYC office</oasis:entry>
         <oasis:entry colname="col4">NYC open data</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Building footprint annotation</oasis:entry>
         <oasis:entry colname="col2">2024</oasis:entry>
         <oasis:entry colname="col3">NYC office</oasis:entry>
         <oasis:entry colname="col4">NYC open data</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">NYC taxi zones</oasis:entry>
         <oasis:entry colname="col2">2025</oasis:entry>
         <oasis:entry colname="col3">NYC office</oasis:entry>
         <oasis:entry colname="col4">NYC open data</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Remote sensing image</oasis:entry>
         <oasis:entry colname="col2">2018</oasis:entry>
         <oasis:entry colname="col3">NYC office</oasis:entry>
         <oasis:entry colname="col4">NYC open data</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Street view image</oasis:entry>
         <oasis:entry colname="col2">2025</oasis:entry>
         <oasis:entry colname="col3">Google Maps</oasis:entry>
         <oasis:entry colname="col4">Google Maps API</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">POI</oasis:entry>
         <oasis:entry colname="col2">2025</oasis:entry>
         <oasis:entry colname="col3">Google Maps</oasis:entry>
         <oasis:entry colname="col4">Google Maps API</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Road networks</oasis:entry>
         <oasis:entry colname="col2">2025</oasis:entry>
         <oasis:entry colname="col3">OSM</oasis:entry>
         <oasis:entry colname="col4">OSM website</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

<sec id="Ch1.S3.SS1.SSS1">
  <label>3.1.1</label><title>Building annotation and footprint</title>
      <p id="d2e2080">The building footprints collected in BuildingSense are sampled from the official building footprint data published by NYC Open Data. To avoid the geographical bias and imbalance in the categories of samples, the sampling process adhered strictly to two principles: (1) the sampled buildings should be spatially evenly distributed, and (2) the distribution of building functional categories after sampling should be uniform. Based on principle (1), we utilize NYC taxi zones (spatial distribution shown in  Fig. S5) as the spatial sampling units and assign each building to a corresponding zone number via spatial computations. Regarding principle (2), we reclassify NYC's building functional categories (details on category mappings and definitions are provided in   Table S1 in the Supplement). Considering their inherent long-tail distribution characteristics (as illustrated in  Fig. S6), we set a target of 1000 samples per category. For categories containing fewer than 1000 buildings, all available buildings are sampled to maintain balance in the distribution of building functions within each category. Additionally, to explicitly represent the intrinsic long-tail nature of building function distributions, we introduce an extra 16 000 residential buildings, which form the dataset that adheres to realistic distributional properties.</p>
      <p id="d2e2083">The building footprint dataset from NYC contains officially annotated information, including building height, constructed year, function, and location. The functional annotations of BuildingSense originate from the official NYC building classifications. In BuildingSense, we specifically extract building height, constructed year, location, and function as annotation attributes for the sampled building footprints.</p>
</sec>
<sec id="Ch1.S3.SS1.SSS2">
  <label>3.1.2</label><title>Street view image</title>
      <p id="d2e2094">The street view images of buildings are collected using the method proposed by <xref ref-type="bibr" rid="bib1.bibx17" id="text.56"/>, which requires road network data. Therefore, OSM road network data for NYC was collected, and sidewalks and bicycle lanes were removed through attribute-based filtering to reduce the proportion of indoor images and distorted viewpoints within the collected street view images. The schematic diagram of this method is illustrated in Fig. <xref ref-type="fig" rid="F4"/>. First, the nearest projection point of each building's centroid onto the road network is computed as the sampling point. Next, based on the building height and the corresponding projection point, parameters required for crawling street view images, namely Pitch (Eq. <xref ref-type="disp-formula" rid="Ch1.E1"/>) and Heading (Eq. <xref ref-type="disp-formula" rid="Ch1.E2"/>), are calculated. Finally, street view images corresponding to these parameters are retrieved through Google Place API requests.

              <disp-formula id="Ch1.E1" content-type="numbered"><label>1</label><mml:math id="M1" display="block"><mml:mrow><mml:mi mathvariant="normal">Pitch</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal">arctan</mml:mi><mml:mfenced open="(" close=")"><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mi>h</mml:mi><mml:msqrt><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:msqrt></mml:mfrac></mml:mstyle></mml:mfenced></mml:mrow></mml:math></disp-formula>

            

                  <disp-formula id="Ch1.E2" specific-use="gather" content-type="subnumberedsingle"><mml:math id="M2" display="block"><mml:mtable displaystyle="true"><mml:mlabeledtr id="Ch1.E2.3"><mml:mtd><mml:mtext>2a</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true" class="stylechange"/><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal">arctan</mml:mi><mml:mfenced open="(" close=")"><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:mfrac></mml:mstyle></mml:mfenced><mml:mo>,</mml:mo><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>∈</mml:mo><mml:mo>[</mml:mo><mml:mo>-</mml:mo><mml:mn mathvariant="normal">90</mml:mn><mml:mi mathvariant="italic">°</mml:mi><mml:mo>,</mml:mo><mml:mn mathvariant="normal">90</mml:mn><mml:mi mathvariant="italic">°</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mlabeledtr><mml:mlabeledtr id="Ch1.E2.4"><mml:mtd><mml:mtext>2b</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true" class="stylechange"/><mml:mi mathvariant="normal">Heading</mml:mi><mml:mo>=</mml:mo><mml:mfenced open="{" close=""><mml:mtable class="cases" rowspacing="0.2ex" columnspacing="1em" columnalign="left left" framespacing="0em"><mml:mtr><mml:mtd><mml:mrow><mml:mn mathvariant="normal">90</mml:mn><mml:mi mathvariant="italic">°</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if </mml:mtext><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>&gt;</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mn mathvariant="normal">270</mml:mn><mml:mi mathvariant="italic">°</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if </mml:mtext><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>≤</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mfenced></mml:mrow></mml:mtd></mml:mlabeledtr></mml:mtable></mml:math></disp-formula>

            where <inline-formula><mml:math id="M3" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> refers to the longitude and latitude position of the street view sampling point, <inline-formula><mml:math id="M4" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> refers to the longitude and latitude position of the building center point, and <inline-formula><mml:math id="M5" display="inline"><mml:mi>h</mml:mi></mml:math></inline-formula> refers to the height of the building.</p>

      <fig id="F4"><label>Figure 4</label><caption><p id="d2e2372">Street view image sampling schema.</p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f04.png"/>

          </fig>

</sec>
<sec id="Ch1.S3.SS1.SSS3">
  <label>3.1.3</label><title>Remote sensing image</title>
      <p id="d2e2389">The remote sensing imagery is derived from orthophotos published by the NYC government in 2018. Despite the temporal discrepancy between the orthophotos and building footprints, we deem them compatible due to the relatively limited changes in NYC's building infrastructure during the intervening period, which is subsequently confirmed during the data cleaning process (Sect. <xref ref-type="sec" rid="Ch1.S3.SS2"/>). Based on the orthophotos and building footprints, we extract top-down images of building rooftops and their surrounding 60.96 m buffer zones using the masking technique. These images represent rooftop characteristics and environmental contexts, respectively. Moreover, rooftop images could potentially be re-annotated to facilitate tasks such as rooftop material classification.</p>
</sec>
<sec id="Ch1.S3.SS1.SSS4">
  <label>3.1.4</label><title>POI</title>
      <p id="d2e2403">The POI data of buildings in BuildingSense are collected from Google Maps. Given the constraints imposed by API return limits, we employed a comprehensive deep-search approach to extract POI information within the boundary of the building footprint. The detailed procedure for this algorithm is shown in the   Algorithm S1 in the Supplement. The collected POI data comprises several attributes: place id, name, latitude, longitude, types, primary type, and current opening hours (Table <xref ref-type="table" rid="T5"/>). Specifically, “types” and “primary type” represent the diverse and primary functions of the POIs, respectively, and “current opening hours” facilitates the future studies of dynamic building functions classification.</p>

<table-wrap id="T5" specific-use="star"><label>Table 5</label><caption><p id="d2e2411">Information of POI sample data.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="2">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Field</oasis:entry>
         <oasis:entry colname="col2">Value</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Index</oasis:entry>
         <oasis:entry colname="col2">1</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Place id</oasis:entry>
         <oasis:entry colname="col2">ChIJ2SZR_QdZwokRT5cHeU0dKUo</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Name</oasis:entry>
         <oasis:entry colname="col2">SBFI Group</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Latitude</oasis:entry>
         <oasis:entry colname="col2">40.745241</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Longitude</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M6" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>73.982472</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Types</oasis:entry>
         <oasis:entry colname="col2">[furniture_store, home_improvement_store, home_goods_store, store]</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Primary type</oasis:entry>
         <oasis:entry colname="col2">furniture_store</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Current opening hours</oasis:entry>
         <oasis:entry colname="col2">09:00–17:00</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
</sec>
<sec id="Ch1.S3.SS2">
  <label>3.2</label><title>Building annotation and matching errors cleaning</title>
      <p id="d2e2527">To further verify the data quality, we manually audit the quality of each collected data instance and the results of data matching. The process is presented in Fig. <xref ref-type="fig" rid="F5"/>. We visualize the integrated data, including building footprints, a top-down buffer image of the building footprint, street view image, POI, and the collection process of the street view image. Based on the visualization, data obtained from the aforementioned processes were manually reviewed and cleaned to identify and rectify two types of errors: (1) matching errors and (2) labeling errors. Matching errors encompassed mismatches between street view images and building footprints, between buildings and remote sensing images, and between buildings and POI data. Labeling errors specifically refer to the origin labeling errors. To ensure annotation quality, annotators are instructed only to modify clearly erroneous category labels while preserving original official annotations in cases of uncertainty.  Different strategies are employed to address these errors: labeling errors are resolved through manual re-annotation; street view mismatches are rectified by manually selecting alternative sampling points; and remote sensing image mismatches are removed from the dataset.</p>

      <fig id="F5" specific-use="star"><label>Figure 5</label><caption><p id="d2e2534">Illustration of data cleaning. Examples of errors are shown in  Fig. S7.</p></caption>
          <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f05.jpg"/>

        </fig>


</sec>
<sec id="Ch1.S3.SS3">
  <label>3.3</label><title>Experimental setting for evaluating large model on BuildingSense</title>
      <p id="d2e2553">The characteristics of large models are highly aligned with the requirements for building function classification tasks, which can be concluded in three aspects: (1) powerful multimodal data processing capabilities enable the extraction of deep-level information from multimodal data, (2) extensive knowledge of human society allows for accurately interpret the extracted multimodal information, and (3) reasoning abilities facilitate logical inference based on both societal knowledge and multimodal data to classify building functions. Additionally, compared to other multimodal deep learning methods, the output of large models' reasoning process offers three distinct advantages: (1) the explicit output of the reasoning process enhances methodological transparency, (2) the inference chains can be manually verified and rectified to generate improved training datasets, and (3) corrected outputs enable further model refinement. Therefore, we systematically evaluate the performance of large models on BuildingSense to verify whether large models can understand multimodal spatial data.</p>
<sec id="Ch1.S3.SS3.SSS1">
  <label>3.3.1</label><title>Baselines</title>
      <p id="d2e2563">We benchmark four state-of-the-art large models, Gemini-2.5-flash (Thinking), Claude-sonnet-4, QVQ-plus, and Deepseek-chat on the balanced BuildingSense (18 458 buildings), to comprehensively evaluate the large model. Based on the collected data, we designed seven data combination configurations (detailed in   Table S2): single-modality (text/imagery), dual-modality (e.g., text <inline-formula><mml:math id="M7" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula> street view imagery), multi-view (street view imagery <inline-formula><mml:math id="M8" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula> remote sensing imagery), and dual-modality and multi-view combinations (text <inline-formula><mml:math id="M9" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula> street view imagery <inline-formula><mml:math id="M10" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula> remote sensing imagery). For each data combination, we design a typical prompt to guide the analysis process of large models (detailed in  Table S2). Due to computational constraints, single-modality tests were limited to Deepseek-chat and QVQ-plus, with a primary focus on the performance of multimodal/multiview. Notably, we precluded multiview evaluations on QVQ-plus from our benchmark evaluation due to the image input restriction of QVQ-plus.</p>
      <p id="d2e2594">The exclusion of traditional deep learning and machine learning methods from the baselines is motivated by two points: (1) two specific bottlenecks identified in the existing literature on building function classification (Sect. 2.1) and (2) the fundamental differences in their evaluation paradigms and spatial scalability compared to large models.</p>
      <p id="d2e2597">First, two specific bottlenecks are as follows: (1) insufficient feature extraction and fusion <xref ref-type="bibr" rid="bib1.bibx48 bib1.bibx27 bib1.bibx26 bib1.bibx10 bib1.bibx6 bib1.bibx9 bib1.bibx22 bib1.bibx14 bib1.bibx13 bib1.bibx15 bib1.bibx44" id="paren.57"/> and (2) poor interpretability of model outputs <xref ref-type="bibr" rid="bib1.bibx22 bib1.bibx14 bib1.bibx13 bib1.bibx15 bib1.bibx44" id="paren.58"/>. Second, the differences in their evaluation paradigms and spatial scalability are as follows: (1) from the perspective of evaluation paradigms, we assess the out-of-the-box zero/few-shot reasoning capabilities of large models on the full dataset. In contrast, evaluating traditional machine learning/deep learning methods requires partitioning the majority of the data into a training set. Such an approach allows the models to learn data characteristics in advance, leading to better-fitting results. Therefore, it is unfair to directly compare large models with deep learning models trained for building function classification. (2) From the perspective of spatial scalability, traditional deep learning models inherently overfit on the training dataset, leading to performance decreasing when transferred to another region or when the data format changes. Such a limitation renders them unscalable for large-scale urban applications.</p>
      <p id="d2e2606">Given these limitations, large models have emerged as a promising approach. Its extensive world knowledge and human-like reasoning abilities (spatial scalability) enable it to jointly interpret visual and textual cues (sufficient feature extraction and fusion), thereby explicitly articulating its inference processes (interpretability) <xref ref-type="bibr" rid="bib1.bibx5 bib1.bibx31" id="paren.59"/>. These characteristics directly target the limitations mentioned above, which remain underexplored. Therefore, we select four state-of-the-art large models to test the hypothesis that large models can overcome traditional limitations in multimodal building-function classification.</p>
</sec>
<sec id="Ch1.S3.SS3.SSS2">
  <label>3.3.2</label><title>Evaluation metric</title>
      <p id="d2e2620">We assessed model performance based on the category-balanced dataset, using four established classification metrics:

                  <disp-formula specific-use="gather" content-type="numbered"><mml:math id="M11" display="block"><mml:mtable displaystyle="true"><mml:mlabeledtr id="Ch1.E5"><mml:mtd><mml:mtext>3</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle class="stylechange" displaystyle="true"/><mml:mtext>Accuracy</mml:mtext><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mi mathvariant="normal">TP</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">TN</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">TP</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">TN</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">FP</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">FN</mml:mi></mml:mrow></mml:mfrac></mml:mstyle></mml:mrow></mml:mtd></mml:mlabeledtr><mml:mlabeledtr id="Ch1.E6"><mml:mtd><mml:mtext>4</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle class="stylechange" displaystyle="true"/><mml:mtext>Precision</mml:mtext><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mi mathvariant="normal">TP</mml:mi><mml:mrow><mml:mi mathvariant="normal">TP</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">FP</mml:mi></mml:mrow></mml:mfrac></mml:mstyle></mml:mrow></mml:mtd></mml:mlabeledtr><mml:mlabeledtr id="Ch1.E7"><mml:mtd><mml:mtext>5</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle class="stylechange" displaystyle="true"/><mml:mtext>Recall</mml:mtext><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mi mathvariant="normal">TP</mml:mi><mml:mrow><mml:mi mathvariant="normal">TP</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">FN</mml:mi></mml:mrow></mml:mfrac></mml:mstyle></mml:mrow></mml:mtd></mml:mlabeledtr><mml:mlabeledtr id="Ch1.E8"><mml:mtd><mml:mtext>6</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true" class="stylechange"/><mml:msub><mml:mi>F</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mtext>-score</mml:mtext><mml:mo>=</mml:mo><mml:mn mathvariant="normal">2</mml:mn><mml:mo>×</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mtext>Precision</mml:mtext><mml:mo>×</mml:mo><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mrow><mml:mtext>Precision</mml:mtext><mml:mo>+</mml:mo><mml:mtext>Recall</mml:mtext></mml:mrow></mml:mfrac></mml:mstyle></mml:mrow></mml:mtd></mml:mlabeledtr></mml:mtable></mml:math></disp-formula>

            where TP <inline-formula><mml:math id="M12" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> true positives, TN <inline-formula><mml:math id="M13" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> true negatives, FP <inline-formula><mml:math id="M14" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> false positives, and FN <inline-formula><mml:math id="M15" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> false negatives. Equation (<xref ref-type="disp-formula" rid="Ch1.E5"/>) represents Accuracy (Acc). Equation (<xref ref-type="disp-formula" rid="Ch1.E6"/>) represents Precision (Pre). Equation (<xref ref-type="disp-formula" rid="Ch1.E7"/>) represents the Recall rate. Equation (<xref ref-type="disp-formula" rid="Ch1.E8"/>) represents the <inline-formula><mml:math id="M16" display="inline"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>-score (<inline-formula><mml:math id="M17" display="inline"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>).</p>
</sec>
<sec id="Ch1.S3.SS3.SSS3">
  <label>3.3.3</label><title>Manual evaluation method for the best model</title>
      <p id="d2e2811">As illustrated in Fig. <xref ref-type="fig" rid="F6"/>, our structured prompt template requires models to generate three components, including the analysis of the picture modal, the analysis of the text modal, and the final results. The analysis of picture modal includes the interpretation of the rooftop characteristics and the building's surrounding context, as well as the building's ground-level facade and the surrounding environment analysis. Analysis of text modal contains location geocoding, building height and year understanding, semantic analysis of POIs' names, and integrated visual-textual reasoning. The final results include the reason for the decision, building function, and additional notes. The additional notes are designed to evaluate the model's dialectical reasoning capacity. Based on the model output, we defined six criteria to assess model performance and detect hallucinations in positive samples: (1) Are remote sensing descriptions accurate? (2) Are street view descriptions accurate? (3) Are text descriptions accurate? (4) Is the combination analysis logic? (5) Are there conflicts between results and analysis? (6) Is additional information helpful? (detailed definition is listed in the   Sect. S3 in the Supplement). To evaluate the best model (Gemini-flash-2.5 (Thinking)), five professionals with expertise in geographic information systems are employed to assess the positive sets based on six criteria, and one large model (Deepseek-chat) is employed to assess criteria 4 to ensure the discrimination is consistent (detailed prompt designs are provided in  Sect. S2). Samples are created by selecting 20 % from each category of both negative and positive classification outcomes.</p>

      <fig id="F6" specific-use="star"><label>Figure 6</label><caption><p id="d2e2818">Reasoning output of large model under the remote sensing image, street view image, and description text input. Detailed reasoning output is shown in   Fig. S9.</p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f06.jpg"/>

          </fig>

      <p id="d2e2827">For the negative sample set, we hierarchically categorize the error sources to elucidate further large model failure modes, including three primary errors and six detailed errors. The three primary errors include (1) definition-induced errors, (2) human-aligned errors, and (3) model errors. The six detailed errors include (1) category ambiguity, (2) origin annotation error, (3) insufficient evidence, (4) recognition failure, (5) spatial relation errors, and (6) POI semantic misinterpretation (definition detailed in   Sect. S3). The relationship between the primary errors and detailed errors is shown in Fig. <xref ref-type="fig" rid="F10"/>. By analyzing the frequency of error causes, we aim to reveal potential directions for improving current large models and provide insight for future related research.</p>
</sec>
</sec>
</sec>
<sec id="Ch1.S4">
  <label>4</label><title>BuildingSense</title>
      <p id="d2e2842">Ultimately, based on our data collection and cleaning methods, we construct a multimodal and fine-grained building function classification dataset containing 34 458 building samples. In BuildingSense, each sample contains a corresponding ID, a footprint polygon, a street view imagery, a rooftop remote sensing imagery, a 60.96 m buffer zone remote sensing imagery, POIs, and annotation (including building height, constructed year, location, and function) (Fig. <xref ref-type="fig" rid="F7"/>).  As a training dataset, BuildingSense ensures data quality through three aspects: (1) consideration of sample categories and spatial distribution, (2) multimodal data matching and annotation verification, and (3) data completeness of individual building samples.</p>

      <fig id="F7" specific-use="star"><label>Figure 7</label><caption><p id="d2e2849">Building-related data and annotation example in BuildingSense.</p></caption>
        <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f07.jpg"/>

      </fig>

      <p id="d2e2858">First, the building function annotations in BuildingSense are derived from government annotations, ensuring the reliability of annotations. Based on this reliability, it can be guaranteed that the sampling results obtained through our spatially and categorically balanced sampling method are statistically sound, establishing the fundamental quality of our dataset.  As illustrated in Fig. <xref ref-type="fig" rid="F8"/>, the sampled buildings are spatially distributed across NYC, adhering to the principle (1).  The categories distribution is shown in Fig. <xref ref-type="fig" rid="F9"/>. Although the sample distribution is not perfectly uniform, it primarily results from the condition that the number of certain categories in NYC is lower than the threshold we set based on the overall category distribution. To some extent, the outcome already approximates principle (2) as closely as possible under the given city.</p>

      <fig id="F8" specific-use="star"><label>Figure 8</label><caption><p id="d2e2868">Distribution of building footprints in NYC.</p></caption>
        <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f08.jpg"/>

      </fig>

      <p id="d2e2877">Second, to further ensure the data quality, we have cleaned the errors in BuildingSense. The results of data cleaning are presented in Table <xref ref-type="table" rid="T6"/>. The fewer matching errors in the table indicate that the raw data we collected required less manual adjustment, thereby reducing human-induced errors and ensuring the quality of the dataset. Annotation errors primarily originated from insufficiently detailed labeling of interior structures in specific open-area categories, such as parks, which have been manually corrected.</p>

<table-wrap id="T6"><label>Table 6</label><caption><p id="d2e2885">Data cleaning results.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1">Error type</oasis:entry>

         <oasis:entry colname="col2">Detailed error type</oasis:entry>

         <oasis:entry colname="col3">Percentage</oasis:entry>

       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col1" morerows="2">Matching errors</oasis:entry>

         <oasis:entry colname="col2">Street view and building</oasis:entry>

         <oasis:entry colname="col3">4 %</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">Remote sensing and building</oasis:entry>

         <oasis:entry colname="col3">0.1 %</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col2">POI and building</oasis:entry>

         <oasis:entry colname="col3">0</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1">Labeling errors</oasis:entry>

         <oasis:entry colname="col2">Building function errors</oasis:entry>

         <oasis:entry colname="col3">11 %</oasis:entry>

       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e2959">Third, Fig. <xref ref-type="fig" rid="F9"/> presents a data completeness analysis of BuildingSense. It can be seen that most of the buildings include remote sensing imagery, street views, and attribute annotations (Fig. <xref ref-type="fig" rid="F9"/>). However, three functional categories, Entertainment, Public service, and Fundamental infrastructure, exhibit relatively poorer data completeness, primarily due to missing street view imagery. This phenomenon stems from the inaccessible areas of these buildings located beyond the street view vehicles (e.g., secured facilities or locations far from roads).</p>

      <fig id="F9" specific-use="star"><label>Figure 9</label><caption><p id="d2e2969">Data completeness of each category. R refers to remote sensing imagery, A refers to annotated building attributes (building height, location, and constructed year), S refers to street view imagery, and P refers to POI data. Residential category statistics scaled by a factor of 10.</p></caption>
        <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f09.png"/>

      </fig>

      <p id="d2e2979">Overall, BuildingSense demonstrates competitive data quality through its well-integrated multimodal data and granular building function category taxonomy, while maintaining substantial data completeness across most building function categories.</p>
</sec>
<sec id="Ch1.S5">
  <label>5</label><title>Evaluation of large model on BuildingSense</title>
<sec id="Ch1.S5.SS1">
  <label>5.1</label><title>Baselines comparison</title>
      <p id="d2e2998">We conducted a comprehensive quantitative evaluation of four state-of-the-art large models across various input modals, as presented in Table <xref ref-type="table" rid="T7"/>. Moreover, we further investigate the performance across two classification schemata  – Fine-grained category (26 categories) and coarse category (14 categories) (Definitions detailed in   Tables S1 and S3). The evaluation framework examines three critical dimensions of the large models' results, including (1) effects of granularity of the categories, (2) efficacy of the combination of modality, and (3) model capability profiling.  The three dimensions are designed to reveal (1) the discrepancy performance of the large model in land use type and fine-grained building function classification, (2) the effect of each input modal on the performance of the large model, and (3) the advancement of the chosen large models.</p>

<table-wrap id="T7" specific-use="star"><label>Table 7</label><caption><p id="d2e3006">Performance Comparison of large models. R refers to a remote sensing image, S refers to a street view image, and T refers to a building text description (Detailed examples are listed in  Table S4). For instance, RT refers to the combination of remote sensing image and building text description. Bold values indicate the best performance in each category.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="7">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1">Category</oasis:entry>

         <oasis:entry colname="col2">Large model</oasis:entry>

         <oasis:entry colname="col3">Data combination</oasis:entry>

         <oasis:entry colname="col4">Acc</oasis:entry>

         <oasis:entry colname="col5">Pre</oasis:entry>

         <oasis:entry colname="col6">Recall</oasis:entry>

         <oasis:entry colname="col7"><inline-formula><mml:math id="M18" display="inline"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col1" morerows="10">26</oasis:entry>

         <oasis:entry rowsep="1" colname="col2">Deepseek-chat</oasis:entry>

         <oasis:entry rowsep="1" colname="col3">T</oasis:entry>

         <oasis:entry rowsep="1" colname="col4">0.08</oasis:entry>

         <oasis:entry rowsep="1" colname="col5">0.21</oasis:entry>

         <oasis:entry rowsep="1" colname="col6">0.08</oasis:entry>

         <oasis:entry rowsep="1" colname="col7">0.09</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="3">QVQ-plus</oasis:entry>

         <oasis:entry colname="col3">R</oasis:entry>

         <oasis:entry colname="col4">0.12</oasis:entry>

         <oasis:entry colname="col5">0.37</oasis:entry>

         <oasis:entry colname="col6">0.12</oasis:entry>

         <oasis:entry colname="col7">0.15</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col3">RT</oasis:entry>

         <oasis:entry colname="col4">0.29</oasis:entry>

         <oasis:entry colname="col5">0.49</oasis:entry>

         <oasis:entry colname="col6">0.29</oasis:entry>

         <oasis:entry colname="col7">0.30</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col3">S</oasis:entry>

         <oasis:entry colname="col4">0.24</oasis:entry>

         <oasis:entry colname="col5">0.47</oasis:entry>

         <oasis:entry colname="col6">0.24</oasis:entry>

         <oasis:entry colname="col7">0.27</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3">ST</oasis:entry>

         <oasis:entry colname="col4">0.34</oasis:entry>

         <oasis:entry colname="col5"><bold>0.59</bold></oasis:entry>

         <oasis:entry colname="col6">0.34</oasis:entry>

         <oasis:entry colname="col7">0.36</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="2">Claude-sonnet-4</oasis:entry>

         <oasis:entry colname="col3">ST</oasis:entry>

         <oasis:entry colname="col4">0.38</oasis:entry>

         <oasis:entry colname="col5">0.50</oasis:entry>

         <oasis:entry colname="col6">0.38</oasis:entry>

         <oasis:entry colname="col7">0.39</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col3">RS</oasis:entry>

         <oasis:entry colname="col4">0.27</oasis:entry>

         <oasis:entry colname="col5">0.42</oasis:entry>

         <oasis:entry colname="col6">0.27</oasis:entry>

         <oasis:entry colname="col7">0.30</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3">RST</oasis:entry>

         <oasis:entry colname="col4">0.39</oasis:entry>

         <oasis:entry colname="col5">0.49</oasis:entry>

         <oasis:entry colname="col6">0.39</oasis:entry>

         <oasis:entry colname="col7">0.40</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="2">Gemini-2.5-flash (Thinking)</oasis:entry>

         <oasis:entry colname="col3">ST</oasis:entry>

         <oasis:entry colname="col4">0.42</oasis:entry>

         <oasis:entry colname="col5">0.53</oasis:entry>

         <oasis:entry colname="col6">0.42</oasis:entry>

         <oasis:entry colname="col7">0.43</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col3">RS</oasis:entry>

         <oasis:entry colname="col4">0.35</oasis:entry>

         <oasis:entry colname="col5">0.51</oasis:entry>

         <oasis:entry colname="col6">0.35</oasis:entry>

         <oasis:entry colname="col7">0.37</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3">RST</oasis:entry>

         <oasis:entry colname="col4"><bold>0.43</bold></oasis:entry>

         <oasis:entry colname="col5">0.53</oasis:entry>

         <oasis:entry colname="col6"><bold>0.43</bold></oasis:entry>

         <oasis:entry colname="col7"><bold>0.44</bold></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" morerows="1">14</oasis:entry>

         <oasis:entry rowsep="1" colname="col2">Gemini-2.5-flash (Thinking)</oasis:entry>

         <oasis:entry rowsep="1" colname="col3">ST</oasis:entry>

         <oasis:entry rowsep="1" colname="col4"><bold>0.55</bold></oasis:entry>

         <oasis:entry rowsep="1" colname="col5">0.63</oasis:entry>

         <oasis:entry rowsep="1" colname="col6"><bold>0.55</bold></oasis:entry>

         <oasis:entry rowsep="1" colname="col7"><bold>0.56</bold></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">QVQ-plus</oasis:entry>

         <oasis:entry colname="col3">ST</oasis:entry>

         <oasis:entry colname="col4">0.40</oasis:entry>

         <oasis:entry colname="col5"><bold>0.65</bold></oasis:entry>

         <oasis:entry colname="col6">0.40</oasis:entry>

         <oasis:entry colname="col7">0.43</oasis:entry>

       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e3343">First, we evaluated performance across different categories of granularity under the ST data combination (Gemini-2.5-flash (Thinking) and QVQ-plus). The two models are chosen for their superior performance in terms of accuracy and precision. As demonstrated in Table <xref ref-type="table" rid="T7"/>, the model's accuracy improved significantly in land use classification, achieving a maximum accuracy of 0.55. Notably, this result is attained solely through prompt engineering and dual-modal data fusion (without fine-tuning or external knowledge injection). This performance is striking compared to the 0.6 accuracy reported in remote sensing-only studies (albeit with different modalities) <xref ref-type="bibr" rid="bib1.bibx14" id="paren.60"/>, underscoring that the large model can comprehend the multimodal data and make logical inference.  The advantage challenges that the large models remain limited in performance when they encounter multimodal spatial data.</p>
      <p id="d2e3352">Second, two key findings emerge from the analysis of data combination, including (1) multimodal fusion (image and text) substantially improved model performance (<inline-formula><mml:math id="M19" display="inline"><mml:mo lspace="0mm">+</mml:mo></mml:math></inline-formula>0.10 accuracy vs. unimodal baselines, Table <xref ref-type="table" rid="T7"/>) and (2) multiview fusion (remote detection <inline-formula><mml:math id="M20" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula> street view imagery) showed a marginal improvement (<inline-formula><mml:math id="M21" display="inline"><mml:mo lspace="0mm">+</mml:mo></mml:math></inline-formula>0.01 accuracy for RST vs. ST and the worse performance for RS vs. ST). Among unimodal inputs, street-view images yielded the highest accuracy (S), followed by remote sensing (R) and text (T), suggesting that street-level imagery provides richer semantic cues for building function inference. However, combining remote sensing and street-view data failed to deliver synergistic gains, indicating a limited cross-perspective reasoning capability in current models.</p>
      <p id="d2e3378">Third, while Gemini-2.5-flash (Thinking) achieved overall superior performance across data combinations, QVQ-plus exhibited exceptional precision (0.59 in ST mode vs. Gemini's 0.53 in RST mode). This reliability, coupled with cost-effectiveness and open-source availability of its series product, positions the qwen-series as a viable large model for developing affordable, high-performance building function classification models. More importantly, techniques such as retrieval-augmented generation (RAG) and low-rank adaptation (LoRA) will enable substantial reductions in development costs for enhancing large model performance.</p>
      <p id="d2e3381">Overall, our findings remarkably indicate that large models can transform conventional building function classification paradigms through (1) extensive human society knowledge, (2) transparent, interpretable reasoning processes, (3) robust multimodal processing capacity, and (4) efficient, lightweight model development.</p>
</sec>
<sec id="Ch1.S5.SS2">
  <label>5.2</label><title>Cost analysis</title>
      <p id="d2e3392">To compare the cost-effectiveness of large models, we calculated the experimental expenses under different data combinations in this study (official token costs for each model are provided in  Table S5), as shown in Table <xref ref-type="table" rid="T8"/>. A price comparison revealed that Gemini-2.5-flash (Thinking), as the optimal model, does not incur the highest cost – under identical data combination inputs, its expense is only one-third that of Claude-Sonnet-4. Notably, the most affordable multimodal reasoning model is the QVQ-plus model, whose cost under the same data conditions is merely one-tenth that of Gemini-2.5-flash (Thinking). Furthermore, its excellent prediction precision ensures the correctness of model outputs, making it the most cost-effective model.</p>

<table-wrap id="T8"><label>Table 8</label><caption><p id="d2e3400">Cost comparison of large models.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:thead>
       <oasis:row>

         <oasis:entry colname="col1">Large model</oasis:entry>

         <oasis:entry colname="col2">Data</oasis:entry>

         <oasis:entry colname="col3">Total cost</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2">combination</oasis:entry>

         <oasis:entry colname="col3">(USD)</oasis:entry>

       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1">Deepseek-chat</oasis:entry>

         <oasis:entry colname="col2">T</oasis:entry>

         <oasis:entry colname="col3">2</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col1" morerows="2">Claude-sonnet-4</oasis:entry>

         <oasis:entry colname="col2">RS</oasis:entry>

         <oasis:entry colname="col3">180</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">ST</oasis:entry>

         <oasis:entry colname="col3">217</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col2">RST</oasis:entry>

         <oasis:entry colname="col3">372</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col1" morerows="2">Gemini-2.5-flash (Thinking)</oasis:entry>

         <oasis:entry colname="col2">ST</oasis:entry>

         <oasis:entry colname="col3">52</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">RS</oasis:entry>

         <oasis:entry colname="col3">88</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col2">RST</oasis:entry>

         <oasis:entry colname="col3">107</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" morerows="3">QVQ-plus</oasis:entry>

         <oasis:entry colname="col2">R</oasis:entry>

         <oasis:entry colname="col3">10</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">S</oasis:entry>

         <oasis:entry colname="col3">8</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">ST</oasis:entry>

         <oasis:entry colname="col3">12</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">RT</oasis:entry>

         <oasis:entry colname="col3">14</oasis:entry>

       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
<sec id="Ch1.S5.SS3">
  <label>5.3</label><title>Results of manual evaluation</title>
      <p id="d2e3563">Large models often encounter the hallucination challenge. We evaluated the Gemini-2.5-flash (Thinking) model manually to check for hallucinations and understand misjudgments. Key findings include: (1) it shows minimal hallucination in building function inference, with most judgments (positive samples) having traceable logic; (2) it has limitations in capturing semantic information from remote sensing, synthesizing multiview data for analyzing complex spatial relationships in functional building classifications, and understanding POI names; (3) it can be a detailed auxiliary tool for building function labeling.</p>
<sec id="Ch1.S5.SS3.SSS1">
  <label>5.3.1</label><title>Reasoning abilities and hallucination</title>
      <p id="d2e3573">The percentage of Criteria 4 in Table <xref ref-type="table" rid="T9"/> reveals that Gemini-2.5-flash (Thinking) demonstrates logically consistent information synthesis, with strong agreement to the large model assessment (Cohen's Kappa <inline-formula><mml:math id="M22" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.86</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M23" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mo>&lt;</mml:mo><mml:mn mathvariant="normal">0.001</mml:mn></mml:mrow></mml:math></inline-formula>). All discrimination results align with their combinatorial analyses, and over half of the positive samples include valuable supplementary reasoning. Given the results, it can be concluded that the model exhibits negligible hallucination, strong multimodal processing, effective data integration, and reliable reasoning, highlighting its potential for fine-grained auxiliary annotation of building functions.</p>

<table-wrap id="T9" specific-use="star"><label>Table 9</label><caption><p id="d2e3603">Analysis of the positive sample.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="7">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">Criteria 1</oasis:entry>
         <oasis:entry colname="col3">Criteria 2</oasis:entry>
         <oasis:entry colname="col4">Criteria 3</oasis:entry>
         <oasis:entry colname="col5">Criteria 4</oasis:entry>
         <oasis:entry colname="col6">Criteria 5</oasis:entry>
         <oasis:entry colname="col7">Criteria 6</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Ture analysis</oasis:entry>
         <oasis:entry colname="col2">0.99</oasis:entry>
         <oasis:entry colname="col3">0.98</oasis:entry>
         <oasis:entry colname="col4">1.00</oasis:entry>
         <oasis:entry colname="col5">0.99</oasis:entry>
         <oasis:entry colname="col6">0.00</oasis:entry>
         <oasis:entry colname="col7">0.76</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
<sec id="Ch1.S5.SS3.SSS2">
  <label>5.3.2</label><title>Error sources of the best-performance model</title>
      <p id="d2e3689">To analyze Gemini-2.5-flash (Thinking) errors, we categorized them statistically and visualized the results in Fig. <xref ref-type="fig" rid="F10"/>. Model-induced errors account for only a small share, with spatial relation errors and POI semantic misinterpretation being the most common. Representative cases are shown in Fig. <xref ref-type="fig" rid="F11c"/>, with causes highlighted in blue.</p>

      <fig id="F10" specific-use="star"><label>Figure 10</label><caption><p id="d2e3698">Error distribution of Gemini-2.5-flash (Thinking).</p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f10.png"/>

          </fig>

      <p id="d2e3707">The highlighted characters in the spatial relation errors picture (Fig. <xref ref-type="fig" rid="F11c"/>) demonstrate that the model is unable to infer that the building is part of a railway trunk line based on the synthesis of the remote sensing image and the street view image. The inference process is dominated by the semantic information derived from the street view image. Meanwhile, the model is unable to locate the building in the street view image accurately. This phenomenon highlights the model's limitation in synthesizing multiview information. Additionally, the POI semantic misinterpretation cases (highlighted in blue) reveal the model's limitations in understanding the semantic meaning of POI names (Fig. <xref ref-type="fig" rid="F11c"/>). It can be found that the model interprets the POI as a resident function. However, based on a search in Chat-GPT, it appears that the “Woodruff Family” is an organization that assists vulnerable groups and primarily provides non-profit accommodation environments, which is aligned with the annotated building function (Public service for poor). Such mistakes reflect that the semantic understanding ability of large models for some irregular POI names is vital for building function classification.</p>
      <p id="d2e3715">To our surprise, the primary source of errors in the model stems from category ambiguity. Our prompt template failed to provide detailed definitions for each category, leading to the misclassification examples shown in Fig. <xref ref-type="fig" rid="F11c"/>. The composite analysis part of category ambiguity error presented in Fig. <xref ref-type="fig" rid="F11c"/> effectively characterizes both the building's spatial configuration and the associated POI semantic attributes. However, in the building classification of NYC, the building is categorized as Q3 (outdoor pool). According to the predefined mapping relationship, it should properly be classified under the “Sports” category. In contrast, Gemini-2.5-flash (Thinking) assigned it to the “Entertainment” instead. From the perspective of semantic representation, this classification result cannot be considered entirely incorrect; rather, it reflects a systematic discrepancy between the classification rules understood by the large model and those we have formally established.</p>
      <p id="d2e3722">More surprisingly, a considerable proportion of “Human-aligned errors” were identified in the negative samples. These refer to cases where, given the same building information, human evaluators and the model produced highly consistent judgments despite both being technically incorrect according to our ground truth (see Fig. <xref ref-type="fig" rid="F11c"/> for concrete examples). The model demonstrates robust and comprehensive analysis capabilities, with its outputs representing optimal conclusions that can be achieved given the available information. This error distribution pattern strongly suggests that Gemini-2.5-flash (Thinking) has attained competent performance in comprehending multimodal spatial data.</p>

      <fig id="F11a" specific-use="star"><label>Figure 11</label><caption><p id="d2e3729"> </p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f11-part01.jpg"/>

          </fig>

      <fig id="F11b" specific-use="star"><label>Figure 11</label><caption><p id="d2e3740"> </p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f11-part02.jpg"/>

          </fig>

      <fig id="F11c" specific-use="star"><label>Figure 11</label><caption><p id="d2e3752">Example of each error.</p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f11-part03.jpg"/>

          </fig>

</sec>
<sec id="Ch1.S5.SS3.SSS3">
  <label>5.3.3</label><title>Ablation study of category definition</title>
      <p id="d2e3769">Considering the systematic discrepancy between the classification rules understood by the large model and our predefined classification framework, we conducted an ablation experiment. By incorporating simplified definitions of our classification categories through prompt engineering, we compared the modified model with the original version. The results demonstrate that introducing functional definitions comprehensively improved the model's prediction performance, although the enhancement remains limited. Quantitative analysis reveals that while this approach cannot completely resolve all errors caused by definition ambiguity, its effectiveness surpasses that of supplementing remote sensing image information alone (as evidenced by comparative results in Tables <xref ref-type="table" rid="T7"/> and <xref ref-type="table" rid="T10"/>).</p>

<table-wrap id="T10" specific-use="star"><label>Table 10</label><caption><p id="d2e3779">Comparison between the original prompt and the definition given prompt. Bold values indicate the best performance.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="7">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Building function category</oasis:entry>
         <oasis:entry colname="col2">Large model</oasis:entry>
         <oasis:entry colname="col3">Data combination</oasis:entry>
         <oasis:entry colname="col4">Acc</oasis:entry>
         <oasis:entry colname="col5">Pre</oasis:entry>
         <oasis:entry colname="col6">Recall</oasis:entry>
         <oasis:entry colname="col7"><inline-formula><mml:math id="M24" display="inline"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">26</oasis:entry>
         <oasis:entry colname="col2">Gemini-2.5-flash (Thinking)</oasis:entry>
         <oasis:entry colname="col3">RST (Definition)</oasis:entry>
         <oasis:entry colname="col4"><bold>0.47</bold></oasis:entry>
         <oasis:entry colname="col5"><bold>0.55</bold></oasis:entry>
         <oasis:entry colname="col6"><bold>0.47</bold></oasis:entry>
         <oasis:entry colname="col7"><bold>0.48</bold></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">26</oasis:entry>
         <oasis:entry colname="col2">Gemini-2.5-flash (Thinking)</oasis:entry>
         <oasis:entry colname="col3">RST</oasis:entry>
         <oasis:entry colname="col4">0.43</oasis:entry>
         <oasis:entry colname="col5">0.53</oasis:entry>
         <oasis:entry colname="col6">0.43</oasis:entry>
         <oasis:entry colname="col7">0.44</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e3896">To better analyze errors from definitional discrepancies, we constructed a confusion matrix (Fig. <xref ref-type="fig" rid="F12c"/>), which shows that refined definitions improve the model's performance for most categories, as seen by the prevalence of blue cells along the diagonal. However, specific categories, such as office mix and office/medicine, warehouse and factory/garage, and sport and entertainment/entertainment mix, still show ambiguity. This suggests that even with enhanced definitions, the model's classification logic diverges from our classification framework, as human evaluators apply their own rules to these ambiguous cases. Instead of refining definitions, creating a reasoning-example database would help the model learn latent classification rules, better aligning it with the predefined classification framework.</p>

      <fig id="F12a" specific-use="star"><label>Figure 12</label><caption><p id="d2e3904"> </p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f12-part01.png"/>

          </fig>

      <fig id="F12b" specific-use="star"><label>Figure 12</label><caption><p id="d2e3915"> </p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f12-part02.png"/>

          </fig>

      <fig id="F12c" specific-use="star"><label>Figure 12</label><caption><p id="d2e3926">Confusion matrix of the original results, the definition given results, and their comparison.</p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f12-part03.png"/>

          </fig>


</sec>
</sec>
</sec>
<sec id="Ch1.S6">
  <label>6</label><title>Discussion</title>
<sec id="Ch1.S6.SS1">
  <label>6.1</label><title>Future directions for large model developments in building function classification</title>
      <p id="d2e3954">Based on the experimental results in this study, we found that, under zero-shot conditions, even though the overall quantitative metrics are not satisfactory, the current state-of-the-art large models are capable of effectively utilizing multimodal data to classify the building functions. The results challenge the conventional belief that large models struggle to comprehend multimodal spatial data. Although our conclusions may be subject to urban bias (the variations in model performance and error distributions across data from different cities),  the evaluation based on NYC has already demonstrated the feasibility of using large models for building function classification. We argue that, in the future research, three directions can be explored to comprehensively improve model performance and reduce inference costs, ultimately achieving low-cost auxiliary or automated annotation for building function, including (1) constructing an external information database for building function classification, (2) developing small parameter models with excellent inference performance, and (3) quantifying the confidence of model outputs.</p>
      <p id="d2e3957">First, based on the experimental results in this study, the main errors in the large model's building function classification result from category ambiguity, which can be attributed to the system diversity between the classification rules of the large model and the predefined definition. Despite providing relatively detailed classification definitions, we did not achieve a breakthrough in model performance. Therefore, rather than offering a complete definition, it may be more effective to construct a database of inference examples for different building function categories. Using RAG technology, the model can match the most similar inference examples to each input, allowing it to learn similar reasoning processes. Additionally, POI and location knowledge bases should be developed to provide sufficient semantic information for the model's building function inference. The first direction will not only enhance the model's inference capabilities but can also be combined with the second direction to reduce inference costs significantly.</p>
      <p id="d2e3960">Second, based on the experimental costs in this study, the optimal model, Gemini-2.5-flash (Thinking), has an inference cost of USD 107 for 16 043 buildings in RST combinations. Such costs are prohibitively expensive for a city-level building function inference project – approximately USD 6000 for NYC. In contrast, QVQ-plus only costs around USD 12.9 for the same number of buildings. Although its inference accuracy under zero-shot conditions is relatively low, we can still see its potential in building function classification. In the future, the performance of the large models can be improved by fine-tuning the open-source multimodal models such as qwen-VL, and using RAG technology to retrieve semantic information from the building function inference knowledge base proposed in the first direction. Ultimately, it will lead to the development of a cost-effective building function classification model for data production and auxiliary annotation.</p>
      <p id="d2e3963">Third, quantifying the confidence of model outputs facilitates a quick assessment of the quality of classification results. Results with low confidence can be flagged for human correction, thereby assisting with the annotation process. However, there are limited methods for quantifying the confidence of building function classification results, especially for large models. Consequently, developing a confidence quantification method is important for constructing a low-cost building function classification model.</p>
</sec>
<sec id="Ch1.S6.SS2">
  <label>6.2</label><title>Applications of BuildingSense</title>
      <p id="d2e3974">Except for evaluating large model performance and providing insights for large model-based classification methods development, BuildingSense, as the first multimodal and fine-grained building function dataset, offers advantages for supporting algorithm development in two aspects: its multimodal and fine-grained characteristics facilitate advancements in multimodal fusion-based classification algorithms. At the same time, its rich building-related annotations make it applicable for algorithm development of building height and constructed year inversion.</p>
      <p id="d2e3977">First, from the perspective of multimodal-based function classification methods development, there are two critical issues: (1) how to extract deep semantic features from POI and building-related descriptive texts, and (2) how to align multiview and multimodal data features.</p>
      <p id="d2e3980">Our evaluation of large models reveals that they can effectively extract the background semantic information about human activity solely from the POI name. This capability results from its extensive knowledge of human society. Similarly, the model shows the same ability in understanding the deep semantics of building-related descriptive texts. In contrast,  multimodal-based methods lack such advantages. Typically, the embedding feature vectors of POIs rely on predefined POI categories, resulting in an inadequate capture of the spatial and semantic relationships among POIs within buildings. Thus, extracting deep semantic features from POIs and building-related texts remains a significant bottleneck.</p>
      <p id="d2e3983">Large models exhibit limited performance in comprehensively interpreting the semantics of remote sensing and street view imagery, indicating that the multi-view features of remote sensing and street view imagery are poorly aligned. Although significant progress has been made in image-text alignment methods, methods for aligning multi-view images with texts remain underexplored <xref ref-type="bibr" rid="bib1.bibx19" id="paren.61"/>.</p>
      <p id="d2e3990">Second, from the perspective of  BuildingSense‘s task diversity, building height and constructed year, as two critical building attributes, are widely used in urban research <xref ref-type="bibr" rid="bib1.bibx47 bib1.bibx34" id="paren.62"/>. Consequently, the inversion of these parameters has become two important research directions. Existing studies have demonstrated that street-view imagery can be utilized for inferring building height and constructed year <xref ref-type="bibr" rid="bib1.bibx34 bib1.bibx39" id="paren.63"/>, while remote sensing imagery can be employed for building height estimation <xref ref-type="bibr" rid="bib1.bibx47" id="paren.64"/>. Notably, both the data and relevant annotations for these tasks are included in BuildingSense. Therefore, BuildingSense extends far beyond building function classification and can also support research on height and constructed year inversion.</p>
      <p id="d2e4002">In summary, in addition to supporting research on building function classification methods, BuildingSense can also be applied to estimate other building parameters, providing a foundational platform for the development of current building parameter estimation methods.</p>
</sec>
<sec id="Ch1.S6.SS3">
  <label>6.3</label><title>Limitations of BuildingSense</title>
<sec id="Ch1.S6.SS3.SSS1">
  <label>6.3.1</label><title>City sample bias</title>
      <p id="d2e4020">NYC, as an international metropolis, has a diverse ethnic composition and high population density, which contribute to its varied architectural styles, including row houses, modern commercial buildings, grand public structures, and towering skyscrapers. Thus, the building function dataset we constructed from NYC samples exhibits a certain degree of representativeness for the North American context. However, the architectural styles that differ considerably from those in North America, such as East Asia and Europe, may lead to two potential outcomes: (1) the evaluation of the optimal large model based on the baseline results may be overestimated, and (2) the performance of models trained on this dataset may degrade when transferred to regions with substantially different architectural styles.</p>
      <p id="d2e4023">First, our evaluation of the optimal large model (Gemini-2.5-flash (Thinking)) indicates that it can effectively integrate multimodal spatial data and perform logical building function inference, even though it was not specifically designed for this task. During this process, it demonstrates strong capabilities in image understanding, text processing, and spatial reasoning. As a flagship product of Google, we hypothesize that its training data likely includes extensive Google spatial data (e.g., street view imagery, POI reviews, and remote sensing data), which may account for its superior performance compared to other models. However, in regions such as China, the distribution of information across data modalities (visual, textual, spatial) differs significantly from that in NYC. For instance, Chinese urban villages present a unique challenge: densely built, low-rise residential areas contain informal commercial activities on the ground floor. Visually, a building may appear entirely residential (with laundry hanging from windows, narrow alleyways, and residential-style architecture), while POI data might indicate numerous small businesses (e.g., convenience stores, hair salons, street food vendors) operating within. Such a condition may lead to brittle reasoning chains and performance decline of baselines in four primary aspects: (1) an inability to analyze architectural styles from street view images; (2) a failure to semantically interpret POI names within buildings due to linguistic differences; (3) an incapacity to assess the built environment and architectural styles from remote sensing imagery; and (4) incorrect spatial reasoning resulting from any of the aforementioned errors. This implies that our evaluation of the optimal large model may be overestimated.</p>
      <p id="d2e4026">Second, despite these limitations, Gemini's strong performance on BuildingSense within its training distribution still offers valuable insights: the technical approach of inferring building functions using large models is feasible, and fine-tuning such models with relevant data could significantly enhance their performance on building function classification tasks. Consequently, subsequent large models fine-tuned on BuildingSense may only achieve performance improvements within the North American context, with relatively limited gains when applied to other regions. Furthermore, for traditional deep learning models, limited generalizability beyond the training distribution has always been a primary drawback. Models trained on BuildingSense may therefore not be suitable for application outside North America. Nevertheless, BuildingSense remains an important dataset for validating model performance.</p>
      <p id="d2e4029">To address this limitation, our future work will focus on three directions: (1) extending BuildingSense to include multiple cities representing diverse urban typologies (e.g., a European compact city, an Asian high-density city, and a Latin American spontaneous city); (2) developing domain adaptation techniques to transfer knowledge from NYC to data-scarce regions; and (3) conducting systematic cross-city evaluations to quantify the generalizability of both traditional and large models.</p>
</sec>
<sec id="Ch1.S6.SS3.SSS2">
  <label>6.3.2</label><title>Lack of annotation in street view images</title>
      <p id="d2e4040">In BuildingSense, the street view images are not annotated with the target building. It was designed to assess whether large models can infer the target building's location within a street-view image from other provided information (e.g., remote-sensing imagery, building height). In our prompting templates, we only indicate to the model that the target building is located directly ahead. Strikingly, we found that the model attempts to match and integrate across the different viewpoints and is capable of self-correction when it detects logical inconsistencies, ultimately producing a revised – and correct – label. An illustrative example is provided in Fig. <xref ref-type="fig" rid="F13"/>.</p>

      <fig id="F13" specific-use="star"><label>Figure 13</label><caption><p id="d2e4047">Example of the Gemini-flash 2.5 (Thinking)'s output.</p></caption>
            <graphic xlink:href="https://essd.copernicus.org/articles/18/2609/2026/essd-18-2609-2026-f13.png"/>

          </fig>

      <p id="d2e4056">It can be observed that the model's textual description of the street view image did not match the actual characteristics of the target building: the model mistakenly took a taller background tower as the target structure (see the blue text in Fig. <xref ref-type="fig" rid="F13"/>, “Description of building in Street view image”). Despite the condition arising from the unannotated target building in the street view image, this setup enables testing whether the model can align the building's location across the remote sensing and street view modalities using the provided information. In a subsequent inference, the section “Semantic information of the building in the content” in Fig. <xref ref-type="fig" rid="F13"/> mentions that given the building height that we supplied, it recognized a recognition error in the street view with textual cues present in the street image and revised its street view judgment. The later stages of the inference retained this revised decision consistently (see the blue text in Fig. <xref ref-type="fig" rid="F13"/>, “Analyze each POI name of the building” and “Combined the analysis and judgment”).</p>
      <p id="d2e4066">This annotation gap was intentionally designed to test whether large models could integrate multimodal spatial data and perform coherent reasoning – a finding that ultimately proves our assumptions. However, this approach inevitably increases annotation costs in practical applications, as target buildings in streetview images must be labeled to support specific tasks. Therefore, we plan to include target building annotations in future updates of the dataset.</p>
</sec>
</sec>
</sec>
<sec id="Ch1.S7">
  <label>7</label><title>Data availability</title>
      <p id="d2e4079">The data can be accessed through <ext-link xlink:href="https://doi.org/10.6084/m9.figshare.30645776.v2" ext-link-type="DOI">10.6084/m9.figshare.30645776.v2</ext-link> <xref ref-type="bibr" rid="bib1.bibx29" id="paren.65"/>.</p>
</sec>
<sec id="Ch1.S8">
  <label>8</label><title>Code availability</title>
      <p id="d2e4096">The code can be accessed through <ext-link xlink:href="https://doi.org/10.6084/m9.figshare.30645776.v2" ext-link-type="DOI">10.6084/m9.figshare.30645776.v2</ext-link> <xref ref-type="bibr" rid="bib1.bibx29" id="paren.66"/>.</p>
</sec>
<sec id="Ch1.S9" sec-type="conclusions">
  <label>9</label><title>Conclusions</title>
      <p id="d2e4113">Building function classification, as a primary method for obtaining building functions, remains two major challenges: (1) interpretability of models and (2) comprehensive fusion of multimodal features. The limitations have led to unreliable classification results and suboptimal classification accuracy. Large models, benefiting from extensive knowledge of human society, powerful multimodal data fusion capabilities, and advanced reasoning ability, provide a promising approach to address the issues. However, their current limited performance in handling multimodal spatial data raises a need for a systematic evaluation of their capabilities in building function classification task. Yet, the absence of a multimodal building function dataset, coupled with the coarse-grained classifications in most existing studies, obscures the deep semantic information that building functions convey about human activities. Therefore, we aim to construct a multimodal fine-grained building function classification dataset and benchmark the performance of large models on it to provide insights for future large model-based algorithms.</p>
      <p id="d2e4116">The main contributions of this study are twofold: (1) the creation of the first multimodal fine-grained building function classification dataset, and (2) a systematic evaluation of both the outcomes and reasoning processes of existing large models. Quantitative analysis of model classification results reveals that: (1) multimodal fusion improves classification accuracy, (2) multi-view fusion (street-view and remote sensing) has limitations in semantic understanding, and (3) building function classification models based on open-source large models require further investigation. Meanwhile, a manual evaluation of the reasoning processes shows that: (1) large models perform well when category distinctions are clear but struggle with insufficient or ambiguous data, with errors stemming from a lack of domain knowledge rather than modality misunderstanding – the Gemini-2.5-flash (Thinking) model demonstrates potential as an assistant in building function annotation; (2) improvements are needed in multi-view understanding for spatially complex buildings; and (3) utilizing RAG technology in the future could enhance performance, but quantifying model confidence deserves further exploration.</p>
      <p id="d2e4119">Overall, this study not only provides a multimodal fine-grained dataset for building function classification training but also demonstrates that current large models can handle multimodal spatial data, challenging the prevailing concept about their limitations in handling multimodal spatial data. In future work, we will update the dataset, incorporate multimodal inference chains, and expand the regional coverage of BuildingSense.</p>
</sec>

      
      </body>
    <back><app-group>
        <supplementary-material position="anchor"><p id="d2e4121">The supplement related to this article is available online at <inline-supplementary-material xlink:href="https://doi.org/10.5194/essd-18-2609-2026-supplement" xlink:title="pdf">https://doi.org/10.5194/essd-18-2609-2026-supplement</inline-supplementary-material>.</p></supplementary-material>
        </app-group><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d2e4132">P.S.: Conceptualization, Investigation, Methodology, Software, Data Curation, Writing – Original Draft, and Visualization. R.C.: Conceptualization, Methodology, Data Curation, and Writing – Original Draft. H.X.: Data Curation and Writing – Original Draft. W.H.: Conceptualization, Resources, Writing – Review &amp; Editing, Supervision, Funding acquisition, and Project administration. X.D.: Data Curation and Writing – Original Draft. S.L.: Writing – Review &amp; Editing. W.Y.: Writing – Review &amp; Editing. H.W.: Writing – Review &amp; Editing. C.L.: Writing – Review &amp; Editing.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d2e4139">The contact author has declared that none of the authors has any competing interests.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d2e4145">Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. The authors bear the ultimate responsibility for providing appropriate place names. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.</p>
  </notes><ack><title>Acknowledgements</title><p id="d2e4151">This work was supported by the General Program of the National Natural Science Foundation of China (grant no. 42171452). Besides, we extend our sincere gratitude to Weihua Huan and Yan Tang for their professional advice on the language, structure, and logic of the paper.</p></ack><notes notes-type="financialsupport"><title>Financial support</title>

      <p id="d2e4156">This research has been supported by the National Natural Science Foundation of China (grant no. 42171452).</p>
  </notes><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d2e4162">This paper was edited by Yuyu Zhou and reviewed by two anonymous referees.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><label>Arribas-Bel and Fleischmann(2022)</label><mixed-citation>Arribas-Bel, D. and Fleischmann, M.: Spatial signatures – Understanding (urban) spaces through form and function, Habitat Int., 128, 102641, <ext-link xlink:href="https://doi.org/10.1016/j.habitatint.2022.102641" ext-link-type="DOI">10.1016/j.habitatint.2022.102641</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx2"><label>Azimi et al.(2019)Azimi, Henry, Sommer, Schumann, and Vig</label><mixed-citation>Azimi, S. M., Henry, C., Sommer, L. W., Schumann, A., and Vig, E.: SkyScapes – Fine-grained semantic understanding of aerial scenes, 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 7392–7402, <uri>https://api.semanticscholar.org/CorpusID:207998372</uri> (last access: 20 February 2026), 2019.</mixed-citation></ref>
      <ref id="bib1.bibx3"><label>Bommasani et al.(2022)Bommasani, Hudson, and Ehsan Adeli</label><mixed-citation>Bommasani, R., Hudson, D. A., and Ehsan Adeli, E. A.: On the opportunities and risks of foundation models,   arXiv [preprint] <ext-link xlink:href="https://doi.org/10.48550/arXiv.2108.07258" ext-link-type="DOI">10.48550/arXiv.2108.07258</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx4"><label>Braun et al.(2019)Braun, Krebs, Flohr, and Gavrila</label><mixed-citation>Braun, M., Krebs, S., Flohr, F., and Gavrila, D. M.: EuroCity persons: A novel benchmark for person detection in traffic scenes, IEEE T. Pattern Anal. Mach. Intell., 41, 1844–1861, <ext-link xlink:href="https://doi.org/10.1109/TPAMI.2019.2897684" ext-link-type="DOI">10.1109/TPAMI.2019.2897684</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx5"><label>Brown et al.(2020)Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever, and Amodei</label><mixed-citation> Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.: Language models are few-shot learners, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20, Curran Associates Inc., Red Hook, NY, USA, ISBN 9781713829546, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx6"><label>Chen et al.(2024)Chen, Zhou, Stokes, and Zhang</label><mixed-citation>Chen, W., Zhou, Y., Stokes, E. C., and Zhang, X.: Large-scale urban building function mapping by integrating multi-source web-based geospatial data, Geo-Spat. Inf. Sci., 27, 1785–1799, <ext-link xlink:href="https://doi.org/10.1080/10095020.2023.2264342" ext-link-type="DOI">10.1080/10095020.2023.2264342</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx7"><label>Choi and Yoon(2023)</label><mixed-citation>Choi, S. and Yoon, S.: Energy signature-based clustering using open data for urban building energy analysis toward carbon neutrality: A case study on electricity change under COVID-19, Sust. Cities Soc., 92, 104471, <ext-link xlink:href="https://doi.org/10.1016/j.scs.2023.104471" ext-link-type="DOI">10.1016/j.scs.2023.104471</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx8"><label>Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele</label><mixed-citation>Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B.: The cityscapes dataset for semantic urban scene understanding, in: 2016 IEEE Conf. Comput. Vis. Pattern. Recognit. (CVPR), 3213–3223, <ext-link xlink:href="https://doi.org/10.1109/CVPR.2016.350" ext-link-type="DOI">10.1109/CVPR.2016.350</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx9"><label>Deng et al.(2022)Deng, Chen, Yang, Li, Jiang, Liao, and Sun</label><mixed-citation>Deng, Y., Chen, R., Yang, J., Li, Y., Jiang, H., Liao, W., and Sun, M.: Identify urban building functions with multisource data: A case study in Guangzhou, China, Int. J. Geogr. Inf. Sci., 36, 2060–2085, <ext-link xlink:href="https://doi.org/10.1080/13658816.2022.2046756" ext-link-type="DOI">10.1080/13658816.2022.2046756</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx10"><label>Du et al.(2024)Du, Zheng, Guo, Wu, Li, and Liu</label><mixed-citation>Du, S., Zheng, M., Guo, L., Wu, Y., Li, Z., and Liu, P.: Urban building function classification based on multisource geospatial data: A two-stage method combining unsupervised and supervised algorithms, Earth Sci. Inform., 17, 1179–1201, <ext-link xlink:href="https://doi.org/10.1007/s12145-024-01250-5" ext-link-type="DOI">10.1007/s12145-024-01250-5</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx11"><label>Geiger et al.(2012)Geiger, Lenz, and Urtasun</label><mixed-citation>Geiger, A., Lenz, P., and Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite, in: 2012 IEEE Conf. on Comput. Vis. Pattern Recognit.(CVPR),  3354–3361, <ext-link xlink:href="https://doi.org/10.1109/CVPR.2012.6248074" ext-link-type="DOI">10.1109/CVPR.2012.6248074</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bibx12"><label>Griffiths and Boehm(2019)</label><mixed-citation>Griffiths, D. and Boehm, J.: Improving public data for building segmentation from Convolutional Neural Networks (CNNs) for fused airborne lidar and image data using active contours, ISPRS-J. Photogramm. Remote Sens., 154, 70–83, <ext-link xlink:href="https://doi.org/10.1016/j.isprsjprs.2019.05.013" ext-link-type="DOI">10.1016/j.isprsjprs.2019.05.013</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx13"><label>He et al.(2025)He, Liu, Shi, and Zheng</label><mixed-citation>He, D., Liu, X., Shi, Q., and Zheng, Y.: Visual-language reasoning segmentation (LARSE) of function-level building footprint across Yangtze River Economic Belt of China, Sust. Cities Soc., 127, 106439, <ext-link xlink:href="https://doi.org/10.1016/j.scs.2025.106439" ext-link-type="DOI">10.1016/j.scs.2025.106439</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx14"><label>He et al.(2024)He, Yao, Shao, and Wang</label><mixed-citation>He, Z., Yao, W., Shao, J., and Wang, P.: UB-FineNet: Urban building fine-grained classification network for open-access satellite images, ISPRS-J. Photogramm. Remote Sens., 217, 76–90, <ext-link xlink:href="https://doi.org/10.1016/j.isprsjprs.2024.08.008" ext-link-type="DOI">10.1016/j.isprsjprs.2024.08.008</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx15"><label>Hoffmann et al.(2023)Hoffmann, Abdulahhad, and Zhu</label><mixed-citation>Hoffmann, E. J., Abdulahhad, K., and Zhu, X. X.: Using social media images for building function classification, Cities, 133, 104107, <ext-link xlink:href="https://doi.org/10.1016/j.cities.2022.104107" ext-link-type="DOI">10.1016/j.cities.2022.104107</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx16"><label>Johnson(2012)</label><mixed-citation>Johnson, J.: Cities: systems of systems of systems, in: Complexity theories of cities have come of age: An overview with implications to urban planning and design, edited by: Portugali, J., Meyer, H., Stolk, E., and Tan, E., chap. 153–172, Springer Berlin Heidelberg, Berlin, Heidelberg, ISBN 978-3-642-24544-2, <ext-link xlink:href="https://doi.org/10.1007/978-3-642-24544-2_9" ext-link-type="DOI">10.1007/978-3-642-24544-2_9</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bibx17"><label>Kong et al.(2024)Kong, Ai, Zou, Yan, and Yang</label><mixed-citation>Kong, B., Ai, T., Zou, X., Yan, X., and Yang, M.: A graph-based neural network approach to integrate multi-source data for urban building function classification, Comput. Environ. Urban Syst., 110, 102094, <ext-link xlink:href="https://doi.org/10.1016/j.compenvurbsys.2024.102094" ext-link-type="DOI">10.1016/j.compenvurbsys.2024.102094</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx18"><label>Kun et al.(2024)Kun, Wanxuan, Xiaoyu, Chubo, Hongfeng, and Xian</label><mixed-citation>Kun, F., Wanxuan, L., Xiaoyu, L., Chubo, D., Hongfeng, Y., and Xian, S.: A comprehensive survey and assumption of remote sensing foundation modal, National Remote Sensing Bulletin, 28, 1667–1680, <ext-link xlink:href="https://doi.org/10.11834/jrs.20233313" ext-link-type="DOI">10.11834/jrs.20233313</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx19"><label>Li and Tang(2024)</label><mixed-citation>Li, S. and Tang, H.: Multimodal alignment and fusion: A survey, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2411.17040" ext-link-type="DOI">10.48550/arXiv.2411.17040</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx20"><label>Li et al.(2021)Li, Meng, Wang, He, Xia, and Lin</label><mixed-citation>Li, W., Meng, L., Wang, J., He, C., Xia, G. S., and Lin, D.: 3D building reconstruction from monocular remote sensing images, in: 2021 IEEE/CVF Int. Conf. Comput. Vis. (ICCV),  12528–12537, <ext-link xlink:href="https://doi.org/10.1109/ICCV48922.2021.01232" ext-link-type="DOI">10.1109/ICCV48922.2021.01232</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx21"><label>Li et al.(2023)Li, Lai, Xu, Xiangli, Yu, He, Xia, and Lin</label><mixed-citation>Li, W., Lai, Y., Xu, L., Xiangli, Y., Yu, J., He, C., Xia, G. S., and Lin, D.: OmniCity: Omnipotent city understanding with multi-Level and multi-View images, in: 2023 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 17397–17407, <ext-link xlink:href="https://doi.org/10.1109/CVPR52729.2023.01669" ext-link-type="DOI">10.1109/CVPR52729.2023.01669</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx22"><label>Li et al.(2025)Li, Yu, Chen, Lin, Dong, Zhang, He, and Fu</label><mixed-citation>Li, W., Yu, J., Chen, D., Lin, Y., Dong, R., Zhang, X., He, C., and Fu, H.: Fine-grained building function recognition with street-view images and GIS map data via geometry-aware semi-supervised learning, Int. J. Appl. Earth Obs. Geoinf., 137, 104386, <ext-link xlink:href="https://doi.org/10.1016/j.jag.2025.104386" ext-link-type="DOI">10.1016/j.jag.2025.104386</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx23"><label>Ma et al.(2021)Ma, Huang, Dai, Liu, Luo, Chen, and Yi</label><mixed-citation>Ma, Y. Z., Huang, J., Dai, X., Liu, S., Luo, L., Chen, Z., and Yi: HoliCity: A city-scale data platform for learning holistic 3D structures, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2008.03286" ext-link-type="DOI">10.48550/arXiv.2008.03286</ext-link> 2021.</mixed-citation></ref>
      <ref id="bib1.bibx24"><label>Mai et al.(2024)Mai, Huang, Sun, Song, Mishra, Liu, Gao, Liu, Cong, Hu, Cundy, Li, Zhu, and Lao</label><mixed-citation>Mai, G., Huang, W., Sun, J., Song, S., Mishra, D., Liu, N., Gao, S., Liu, T., Cong, G., Hu, Y., Cundy, C., Li, Z., Zhu, R., and Lao, N.: On the opportunities and challenges of foundation models for GeoAI (Vision Paper), ACM Trans. Spatial Algorithms Syst., 10, 46, <ext-link xlink:href="https://doi.org/10.1145/3653070" ext-link-type="DOI">10.1145/3653070</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx25"><label>Marcus and Koch(2016)</label><mixed-citation>Marcus, L. and Koch, D.: Cities as implements or facilities – The need for a spatial morphology in smart city systems, Env. Plan. B-Urban Anal. City Sci., 44, 204–226, <ext-link xlink:href="https://doi.org/10.1177/0265813516685565" ext-link-type="DOI">10.1177/0265813516685565</ext-link>, <ext-link xlink:href="https://doi.org/10.1177/0265813516685565" ext-link-type="DOI">10.1177/0265813516685565</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx26"><label>Memduhoglu et al.(2024)Memduhoglu, Fulman, and Zipf</label><mixed-citation>Memduhoglu, A., Fulman, N., and Zipf, A.: Enriching building function classification using Large Language Model embeddings of OpenStreetMap Tags, Earth Sci. Inform., 17, 5403–5418, <ext-link xlink:href="https://doi.org/10.1007/s12145-024-01463-8" ext-link-type="DOI">10.1007/s12145-024-01463-8</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx27"><label>Ren et al.(2024)Ren, Qiu, and An</label><mixed-citation>Ren, D., Qiu, X., and An, Z.: A multi-source data-driven analysis of building functional classification and its relationship with population distribution, Remote Sens., 16, <ext-link xlink:href="https://doi.org/10.3390/rs16234492" ext-link-type="DOI">10.3390/rs16234492</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx28"><label>Shen et al.(2021)Shen, Liu, and Wang</label><mixed-citation>Shen, P., Liu, J., and Wang, M.: Fast generation of microclimate weather data for building simulation under heat island using map capturing and clustering technique, Sust. Cities Soc, 71, 102954, <ext-link xlink:href="https://doi.org/10.1016/j.scs.2021.102954" ext-link-type="DOI">10.1016/j.scs.2021.102954</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx29"><label>Su et al.(2025a)Su, Chen, Xu, Huang, Deng, Yan et al.</label><mixed-citation>Su, P., Chen, R., Xu, H., Huang, W., Deng, X., Yan, W., Wu, Hangbin, and Liu C.: BuildingSense-A multimodal building function classification dataset, <ext-link xlink:href="https://doi.org/10.6084/m9.figshare.30645776.v2" ext-link-type="DOI">10.6084/m9.figshare.30645776.v2</ext-link>, 2025a.</mixed-citation></ref>
      <ref id="bib1.bibx30"><label>Su et al.(2025b)Su, Yan, Li, Wu, Liu, and Huang</label><mixed-citation>Su, P., Yan, Y., Li, H., Wu, H., Liu, C., and Huang, W.: Images and deep learning in human and urban infrastructure interactions pertinent to sustainable urban studies: Review and perspective, Int. J. Appl. Earth Obs. Geoinf., 136, 104352, <ext-link xlink:href="https://doi.org/10.1016/j.jag.2024.104352" ext-link-type="DOI">10.1016/j.jag.2024.104352</ext-link>, 2025b.</mixed-citation></ref>
      <ref id="bib1.bibx31"><label>Sun et al.(2023)Sun, Zheng, Xie, Liu, Chu, Qiu, Xu, Ding, Li, Geng, Wu, Wang, Chen, Yin, Ren, Fu, He, Yuan, Liu, Liu, Li, Dong, Cheng, Zhang, Heng, Dai, Luo, Wang, Wen, Qiu, Guo, Xiong, Liu, and Li</label><mixed-citation>Sun, J., Zheng, C., Xie, E., Liu, Z., Chu, R., Qiu, J., Xu, J., Ding, M., Li, H., Geng, M., Wu, Y., Wang, W., Chen, J., Yin, Z., Ren, X., Fu, J., He, J., Yuan, W., Liu, Q., Liu, X., Li, Y., Dong, H., Cheng, Y., Zhang, M., Heng, P.-A., Dai, J., Luo, P., Wang, J., Wen, J.-R., Qiu, X., Guo, Y.-C., Xiong, H., Liu, Q., and Li, Z.: A survey of reasoning with foundation models, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2312.11562" ext-link-type="DOI">10.48550/arXiv.2312.11562</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx32"><label>The construction wiki contributors(2021)</label><mixed-citation>The construction wiki contributors: Function, <uri>https://www.designingbuildings.co.uk/wiki/Function</uri> (last access: 12 September 2025), 2021.</mixed-citation></ref>
      <ref id="bib1.bibx33"><label>Wang et al.(2017)Wang, Bai, Mattyus, Chu, Luo, Yang, Liang, Cheverie, Fidler, and Urtasun</label><mixed-citation>Wang, S., Bai, M., Mattyus, G., Chu, H., Luo, W., Yang, B., Liang, J., Cheverie, J., Fidler, S., and Urtasun, R.: TorontoCity: Seeing the world with a million eyes, in: 2017 IEEE Int. Conf. Comput. Vis. (ICCV),  3028–3036, <ext-link xlink:href="https://doi.org/10.1109/ICCV.2017.327" ext-link-type="DOI">10.1109/ICCV.2017.327</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx34"><label>Wang et al.(2024)Wang, Zhang, Dong, Guo, Tao, and Zhang</label><mixed-citation>Wang, Y., Zhang, Y., Dong, Q., Guo, H., Tao, Y., and Zhang, F.: A multi-view graph neural network for building age prediction, ISPRS-J. Photogramm. Remote Sens., 218, 294–311, <ext-link xlink:href="https://doi.org/10.1016/j.isprsjprs.2024.10.011" ext-link-type="DOI">10.1016/j.isprsjprs.2024.10.011</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx35"><label>Weir et al.(2019)Weir, Lindenbaum, Bastidas, Etten, Kumar, Mcpherson, Shermeyer, and Tang</label><mixed-citation>Weir, N., Lindenbaum, D., Bastidas, A., Etten, A., Kumar, V., Mcpherson, S., Shermeyer, J., and Tang, H.: SpaceNet MVOI: A multi-view overhead imagery dataset, in: 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 992–1001, ISBN 2380-7504, <ext-link xlink:href="https://doi.org/10.1109/ICCV.2019.00108" ext-link-type="DOI">10.1109/ICCV.2019.00108</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx36"><label>Wojna et al.(2021)Wojna, Maziarz, Jocz, Paluba, Kozikowski, and Kokkinos</label><mixed-citation>Wojna, Z., Maziarz, K., Jocz, L., Paluba, R., Kozikowski, R., and Kokkinos, I.: Holistic multi-view building analysis in the wild with projection pooling,  Proceedings of the AAAI Conference on Artificial Intelligence, 2870–2878, <ext-link xlink:href="https://doi.org/10.1609/aaai.v35i4.16393" ext-link-type="DOI">10.1609/aaai.v35i4.16393</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx37"><label>Xiao et al.(2022)Xiao, Jia, Yang, Sun, Shi, Wang, and Jia</label><mixed-citation>Xiao, B., Jia, X., Yang, D., Sun, L., Shi, F., Wang, Q., and Jia, Y.: Research on classification method of building function oriented to urban building stock management, Sustainability, 14, 5871, <ext-link xlink:href="https://doi.org/10.3390/su14105871" ext-link-type="DOI">10.3390/su14105871</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx38"><label>Xu et al.(2022)Xu, He, Xie, Xie, Luo, and Xie</label><mixed-citation>Xu, Y., He, Z., Xie, X., Xie, Z., Luo, J., and Xie, H.: Building function classification in Nanjing, China, using deep learning, Trans. GIS, 26, 2145–2165, <ext-link xlink:href="https://doi.org/10.1111/tgis.12934" ext-link-type="DOI">10.1111/tgis.12934</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx39"><label>Xu et al.(2023)Xu, Zhang, Wu, Yang, and Wu</label><mixed-citation>Xu, Z., Zhang, F., Wu, Y., Yang, Y., and Wu, Y.: Building height calculation for an urban area based on street view images and deep learning, Comput.-Aided Civil Infrastruct. Eng., 38, 892–906, <ext-link xlink:href="https://doi.org/10.1111/mice.12930" ext-link-type="DOI">10.1111/mice.12930</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx40"><label>Yang et al.(2020)Yang, Hu, Bergasa, Romera, and Wang</label><mixed-citation>Yang, K., Hu, X., Bergasa, L. M., Romera, E., and Wang, K.: PASS: Panoramic annular semantic segmentation, IEEE Trans. Intell. Transp. Syst., 21, 4171–4185, <ext-link xlink:href="https://doi.org/10.1109/TITS.2019.2938965" ext-link-type="DOI">10.1109/TITS.2019.2938965</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx41"><label>Yang et al.(2021)Yang, Hu, and Stiefelhagen</label><mixed-citation>Yang, K., Hu, X., and Stiefelhagen, R.: Is context-aware CNN ready for the surroundings? panoramic semantic segmentation in the wild, IEEE Trans. Image Process., 30, 1866–1881, <ext-link xlink:href="https://doi.org/10.1109/TIP.2020.3048682" ext-link-type="DOI">10.1109/TIP.2020.3048682</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx42"><label>Yu and Fang(2023)</label><mixed-citation>Yu, D. and Fang, C.: Urban remote sensing with spatial big data: A review and renewed perspective of urban studies in recent decades, Remote Sens., 15, 1307, <ext-link xlink:href="https://doi.org/10.3390/rs15051307" ext-link-type="DOI">10.3390/rs15051307</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx43"><label>Zhang et al.(2021a)Zhang, Shi, Zhuo, Wang, and Tao</label><mixed-citation>Zhang, C., Shi, Q., Zhuo, L., Wang, F., and Tao, H.: Inferring mixed use of buildings with multisource data based on tensor decomposition, ISPRS Int. J. Geo-Inf., 10, 185, <ext-link xlink:href="https://doi.org/10.3390/ijgi10030185" ext-link-type="DOI">10.3390/ijgi10030185</ext-link>, 2021a.</mixed-citation></ref>
      <ref id="bib1.bibx44"><label>Zhang et al.(2021b)Zhang, Fukuda, and Yabuki</label><mixed-citation>Zhang, J., Fukuda, T., and Yabuki, N.: Development of a city-scale approach for facade color measurement with building functional classification using deep learning and street view images, ISPRS Int. J. Geo-Inf., 10, 551, <ext-link xlink:href="https://doi.org/10.3390/ijgi10080551" ext-link-type="DOI">10.3390/ijgi10080551</ext-link>, 2021b.</mixed-citation></ref>
      <ref id="bib1.bibx45"><label>Zhang et al.(2023)Zhang, Liu, Chen, Guan, Luo, and Huang</label><mixed-citation>Zhang, X., Liu, X., Chen, K., Guan, F., Luo, M., and Huang, H.: Inferring building function: A novel geo-aware neural network supporting building-level function classification, Sust. Cities Soc., 89, 104349, <ext-link xlink:href="https://doi.org/10.1016/j.scs.2022.104349" ext-link-type="DOI">10.1016/j.scs.2022.104349</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx46"><label>Zhang et al.(2025)Zhang, Zhao, and Long</label><mixed-citation>Zhang, Y., Zhao, H., and Long, Y.: CMAB: A Multi-Attribute Building Dataset of China, Sci. Data, 12, 430, <ext-link xlink:href="https://doi.org/10.1038/s41597-025-04730-5" ext-link-type="DOI">10.1038/s41597-025-04730-5</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx47"><label>Zhao et al.(2023)Zhao, Wu, Li, Yang, Fan, Wu, and Yu</label><mixed-citation>Zhao, Y., Wu, B., Li, Q., Yang, L., Fan, H., Wu, J., and Yu, B.: Combining ICESat-2 photons and Google Earth Satellite images for building height extraction, Int. J. Appl. Earth Obs. Geoinf., 117, 103213, <ext-link xlink:href="https://doi.org/10.1016/j.jag.2023.103213" ext-link-type="DOI">10.1016/j.jag.2023.103213</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx48"><label>Zheng et al.(2024)Zheng, Zhang, Ou, and Liu</label><mixed-citation>Zheng, Y., Zhang, X., Ou, J., and Liu, X.: Identifying building function using multisource data: A case study of China's three major urban agglomerations, Sust. Cities Soc., 108, 105498, <ext-link xlink:href="https://doi.org/10.1016/j.scs.2024.105498" ext-link-type="DOI">10.1016/j.scs.2024.105498</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx49"><label>Zhou et al.(2023)Zhou, Persello, Li, and Stein</label><mixed-citation>Zhou, W., Persello, C., Li, M., and Stein, A.: Building use and mixed-use classification with a transformer-based network fusing satellite images and geospatial textual information, Remote Sens. Environ., 297, 113767, <ext-link xlink:href="https://doi.org/10.1016/j.rse.2023.113767" ext-link-type="DOI">10.1016/j.rse.2023.113767</ext-link>, 2023.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>BuildingSense: a new multimodal building function classification dataset</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>Arribas-Bel and Fleischmann(2022)</label><mixed-citation>
      
Arribas-Bel, D. and Fleischmann, M.: Spatial signatures – Understanding (urban)
spaces through form and function, Habitat Int., 128, 102641,
<a href="https://doi.org/10.1016/j.habitatint.2022.102641" target="_blank">https://doi.org/10.1016/j.habitatint.2022.102641</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>Azimi et al.(2019)Azimi, Henry, Sommer, Schumann, and
Vig</label><mixed-citation>
      
Azimi, S. M., Henry, C., Sommer, L. W., Schumann, A., and Vig, E.: SkyScapes –
Fine-grained semantic understanding of aerial scenes, 2019 IEEE/CVF Int.
Conf. Comput. Vis. (ICCV), 7392–7402,
<a href="https://api.semanticscholar.org/CorpusID:207998372" target="_blank"/> (last access: 20 February 2026), 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>Bommasani et al.(2022)Bommasani, Hudson, and
Ehsan Adeli</label><mixed-citation>
      
Bommasani, R., Hudson, D. A., and Ehsan Adeli, E. A.: On the opportunities and
risks of foundation models,   arXiv [preprint]
<a href="https://doi.org/10.48550/arXiv.2108.07258" target="_blank">https://doi.org/10.48550/arXiv.2108.07258</a>,
2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>Braun et al.(2019)Braun, Krebs, Flohr, and Gavrila</label><mixed-citation>
      
Braun, M., Krebs, S., Flohr, F., and Gavrila, D. M.: EuroCity persons: A novel
benchmark for person detection in traffic scenes, IEEE T. Pattern Anal.
Mach. Intell., 41, 1844–1861, <a href="https://doi.org/10.1109/TPAMI.2019.2897684" target="_blank">https://doi.org/10.1109/TPAMI.2019.2897684</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>Brown et al.(2020)Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal,
Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan,
Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess,
Clark, Berner, McCandlish, Radford, Sutskever, and Amodei</label><mixed-citation>
      
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S.,
Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray,
S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever,
I., and Amodei, D.: Language models are few-shot learners, in: Proceedings of
the 34th International Conference on Neural Information Processing Systems,
NIPS '20, Curran Associates Inc., Red Hook, NY, USA, ISBN 9781713829546,
2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>Chen et al.(2024)Chen, Zhou, Stokes, and Zhang</label><mixed-citation>
      
Chen, W., Zhou, Y., Stokes, E. C., and Zhang, X.: Large-scale urban building
function mapping by integrating multi-source web-based geospatial data,
Geo-Spat. Inf. Sci., 27, 1785–1799, <a href="https://doi.org/10.1080/10095020.2023.2264342" target="_blank">https://doi.org/10.1080/10095020.2023.2264342</a>,
2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>Choi and Yoon(2023)</label><mixed-citation>
      
Choi, S. and Yoon, S.: Energy signature-based clustering using open data for
urban building energy analysis toward carbon neutrality: A case study on
electricity change under COVID-19, Sust. Cities Soc., 92, 104471,
<a href="https://doi.org/10.1016/j.scs.2023.104471" target="_blank">https://doi.org/10.1016/j.scs.2023.104471</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler,
Benenson, Franke, Roth, and Schiele</label><mixed-citation>
      
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., and Schiele, B.: The cityscapes dataset for semantic
urban scene understanding, in: 2016 IEEE Conf. Comput. Vis. Pattern.
Recognit. (CVPR), 3213–3223,
<a href="https://doi.org/10.1109/CVPR.2016.350" target="_blank">https://doi.org/10.1109/CVPR.2016.350</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>Deng et al.(2022)Deng, Chen, Yang, Li, Jiang, Liao, and
Sun</label><mixed-citation>
      
Deng, Y., Chen, R., Yang, J., Li, Y., Jiang, H., Liao, W., and Sun, M.:
Identify urban building functions with multisource data: A case study in
Guangzhou, China, Int. J. Geogr. Inf. Sci., 36, 2060–2085,
<a href="https://doi.org/10.1080/13658816.2022.2046756" target="_blank">https://doi.org/10.1080/13658816.2022.2046756</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>Du et al.(2024)Du, Zheng, Guo, Wu, Li, and Liu</label><mixed-citation>
      
Du, S., Zheng, M., Guo, L., Wu, Y., Li, Z., and Liu, P.: Urban building
function classification based on multisource geospatial data: A two-stage
method combining unsupervised and supervised algorithms, Earth Sci. Inform.,
17, 1179–1201, <a href="https://doi.org/10.1007/s12145-024-01250-5" target="_blank">https://doi.org/10.1007/s12145-024-01250-5</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>Geiger et al.(2012)Geiger, Lenz, and Urtasun</label><mixed-citation>
      
Geiger, A., Lenz, P., and Urtasun, R.: Are we ready for autonomous driving? The
KITTI vision benchmark suite, in: 2012 IEEE Conf. on Comput. Vis. Pattern
Recognit.(CVPR),  3354–3361,
<a href="https://doi.org/10.1109/CVPR.2012.6248074" target="_blank">https://doi.org/10.1109/CVPR.2012.6248074</a>, 2012.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>Griffiths and Boehm(2019)</label><mixed-citation>
      
Griffiths, D. and Boehm, J.: Improving public data for building segmentation
from Convolutional Neural Networks (CNNs) for fused airborne lidar and image
data using active contours, ISPRS-J. Photogramm. Remote Sens., 154, 70–83,
<a href="https://doi.org/10.1016/j.isprsjprs.2019.05.013" target="_blank">https://doi.org/10.1016/j.isprsjprs.2019.05.013</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>He et al.(2025)He, Liu, Shi, and Zheng</label><mixed-citation>
      
He, D., Liu, X., Shi, Q., and Zheng, Y.: Visual-language reasoning segmentation
(LARSE) of function-level building footprint across Yangtze River Economic
Belt of China, Sust. Cities Soc., 127, 106439,
<a href="https://doi.org/10.1016/j.scs.2025.106439" target="_blank">https://doi.org/10.1016/j.scs.2025.106439</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>He et al.(2024)He, Yao, Shao, and Wang</label><mixed-citation>
      
He, Z., Yao, W., Shao, J., and Wang, P.: UB-FineNet: Urban building
fine-grained classification network for open-access satellite images,
ISPRS-J. Photogramm. Remote Sens., 217, 76–90,
<a href="https://doi.org/10.1016/j.isprsjprs.2024.08.008" target="_blank">https://doi.org/10.1016/j.isprsjprs.2024.08.008</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>Hoffmann et al.(2023)Hoffmann, Abdulahhad, and Zhu</label><mixed-citation>
      
Hoffmann, E. J., Abdulahhad, K., and Zhu, X. X.: Using social media images for
building function classification, Cities, 133, 104107,
<a href="https://doi.org/10.1016/j.cities.2022.104107" target="_blank">https://doi.org/10.1016/j.cities.2022.104107</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>Johnson(2012)</label><mixed-citation>
      
Johnson, J.: Cities: systems of systems of systems, in: Complexity theories of
cities have come of age: An overview with implications to urban planning and
design, edited by: Portugali, J., Meyer, H., Stolk, E., and Tan, E., chap.
153–172, Springer Berlin Heidelberg, Berlin, Heidelberg, ISBN
978-3-642-24544-2, <a href="https://doi.org/10.1007/978-3-642-24544-2_9" target="_blank">https://doi.org/10.1007/978-3-642-24544-2_9</a>, 2012.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>Kong et al.(2024)Kong, Ai, Zou, Yan, and Yang</label><mixed-citation>
      
Kong, B., Ai, T., Zou, X., Yan, X., and Yang, M.: A graph-based neural network
approach to integrate multi-source data for urban building function
classification, Comput. Environ. Urban Syst., 110, 102094,
<a href="https://doi.org/10.1016/j.compenvurbsys.2024.102094" target="_blank">https://doi.org/10.1016/j.compenvurbsys.2024.102094</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>Kun et al.(2024)Kun, Wanxuan, Xiaoyu, Chubo, Hongfeng, and
Xian</label><mixed-citation>
      
Kun, F., Wanxuan, L., Xiaoyu, L., Chubo, D., Hongfeng, Y., and Xian, S.: A
comprehensive survey and assumption of remote sensing foundation modal,
National Remote Sensing Bulletin, 28, 1667–1680,
<a href="https://doi.org/10.11834/jrs.20233313" target="_blank">https://doi.org/10.11834/jrs.20233313</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>Li and Tang(2024)</label><mixed-citation>
      
Li, S. and Tang, H.: Multimodal alignment and fusion: A survey,
arXiv [preprint],
<a href="https://doi.org/10.48550/arXiv.2411.17040" target="_blank">https://doi.org/10.48550/arXiv.2411.17040</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>Li et al.(2021)Li, Meng, Wang, He, Xia, and Lin</label><mixed-citation>
      
Li, W., Meng, L., Wang, J., He, C., Xia, G. S., and Lin, D.: 3D building
reconstruction from monocular remote sensing images, in: 2021 IEEE/CVF Int.
Conf. Comput. Vis. (ICCV),  12528–12537,
<a href="https://doi.org/10.1109/ICCV48922.2021.01232" target="_blank">https://doi.org/10.1109/ICCV48922.2021.01232</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>Li et al.(2023)Li, Lai, Xu, Xiangli, Yu, He, Xia, and Lin</label><mixed-citation>
      
Li, W., Lai, Y., Xu, L., Xiangli, Y., Yu, J., He, C., Xia, G. S., and Lin, D.:
OmniCity: Omnipotent city understanding with multi-Level and multi-View
images, in: 2023 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp.
17397–17407, <a href="https://doi.org/10.1109/CVPR52729.2023.01669" target="_blank">https://doi.org/10.1109/CVPR52729.2023.01669</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>Li et al.(2025)Li, Yu, Chen, Lin, Dong, Zhang, He, and Fu</label><mixed-citation>
      
Li, W., Yu, J., Chen, D., Lin, Y., Dong, R., Zhang, X., He, C., and Fu, H.:
Fine-grained building function recognition with street-view images and GIS
map data via geometry-aware semi-supervised learning, Int. J. Appl. Earth
Obs. Geoinf., 137, 104386, <a href="https://doi.org/10.1016/j.jag.2025.104386" target="_blank">https://doi.org/10.1016/j.jag.2025.104386</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>Ma et al.(2021)Ma, Huang, Dai, Liu, Luo, Chen, and Yi</label><mixed-citation>
      
Ma, Y. Z., Huang, J., Dai, X., Liu, S., Luo, L., Chen, Z., and Yi: HoliCity: A
city-scale data platform for learning holistic 3D structures, arXiv [preprint],
<a href="https://doi.org/10.48550/arXiv.2008.03286" target="_blank">https://doi.org/10.48550/arXiv.2008.03286</a> 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>Mai et al.(2024)Mai, Huang, Sun, Song, Mishra, Liu, Gao, Liu, Cong,
Hu, Cundy, Li, Zhu, and Lao</label><mixed-citation>
      
Mai, G., Huang, W., Sun, J., Song, S., Mishra, D., Liu, N., Gao, S., Liu, T.,
Cong, G., Hu, Y., Cundy, C., Li, Z., Zhu, R., and Lao, N.: On the
opportunities and challenges of foundation models for GeoAI (Vision Paper),
ACM Trans. Spatial Algorithms Syst., 10, 46, <a href="https://doi.org/10.1145/3653070" target="_blank">https://doi.org/10.1145/3653070</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>Marcus and Koch(2016)</label><mixed-citation>
      
Marcus, L. and Koch, D.: Cities as implements or facilities – The need for a
spatial morphology in smart city systems, Env. Plan. B-Urban Anal. City Sci.,
44, 204–226, <a href="https://doi.org/10.1177/0265813516685565" target="_blank">https://doi.org/10.1177/0265813516685565</a>, <a href="https://doi.org/10.1177/0265813516685565" target="_blank">https://doi.org/10.1177/0265813516685565</a>,
2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>Memduhoglu et al.(2024)Memduhoglu, Fulman, and Zipf</label><mixed-citation>
      
Memduhoglu, A., Fulman, N., and Zipf, A.: Enriching building function
classification using Large Language Model embeddings of OpenStreetMap Tags,
Earth Sci. Inform., 17, 5403–5418, <a href="https://doi.org/10.1007/s12145-024-01463-8" target="_blank">https://doi.org/10.1007/s12145-024-01463-8</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>Ren et al.(2024)Ren, Qiu, and An</label><mixed-citation>
      
Ren, D., Qiu, X., and An, Z.: A multi-source data-driven analysis of building
functional classification and its relationship with population distribution,
Remote Sens., 16, <a href="https://doi.org/10.3390/rs16234492" target="_blank">https://doi.org/10.3390/rs16234492</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>Shen et al.(2021)Shen, Liu, and Wang</label><mixed-citation>
      
Shen, P., Liu, J., and Wang, M.: Fast generation of microclimate weather data
for building simulation under heat island using map capturing and clustering
technique, Sust. Cities Soc, 71, 102954, <a href="https://doi.org/10.1016/j.scs.2021.102954" target="_blank">https://doi.org/10.1016/j.scs.2021.102954</a>,
2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>Su et al.(2025a)Su, Chen, Xu, Huang, Deng, Yan
et al.</label><mixed-citation>
      
Su, P., Chen, R., Xu, H., Huang, W., Deng, X., Yan, W., Wu, Hangbin, and Liu C.:
BuildingSense-A multimodal building function classification dataset,
<a href="https://doi.org/10.6084/m9.figshare.30645776.v2" target="_blank">https://doi.org/10.6084/m9.figshare.30645776.v2</a>, 2025a.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>Su et al.(2025b)Su, Yan, Li, Wu, Liu, and
Huang</label><mixed-citation>
      
Su, P., Yan, Y., Li, H., Wu, H., Liu, C., and Huang, W.: Images and deep
learning in human and urban infrastructure interactions pertinent to
sustainable urban studies: Review and perspective, Int. J. Appl. Earth Obs.
Geoinf., 136, 104352, <a href="https://doi.org/10.1016/j.jag.2024.104352" target="_blank">https://doi.org/10.1016/j.jag.2024.104352</a>,
2025b.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>Sun et al.(2023)Sun, Zheng, Xie, Liu, Chu, Qiu, Xu, Ding, Li, Geng,
Wu, Wang, Chen, Yin, Ren, Fu, He, Yuan, Liu, Liu, Li, Dong, Cheng, Zhang,
Heng, Dai, Luo, Wang, Wen, Qiu, Guo, Xiong, Liu, and Li</label><mixed-citation>
      
Sun, J., Zheng, C., Xie, E., Liu, Z., Chu, R., Qiu, J., Xu, J., Ding, M., Li,
H., Geng, M., Wu, Y., Wang, W., Chen, J., Yin, Z., Ren, X., Fu, J., He, J.,
Yuan, W., Liu, Q., Liu, X., Li, Y., Dong, H., Cheng, Y., Zhang, M., Heng,
P.-A., Dai, J., Luo, P., Wang, J., Wen, J.-R., Qiu, X., Guo, Y.-C., Xiong,
H., Liu, Q., and Li, Z.: A survey of reasoning with foundation models, arXiv [preprint],
<a href="https://doi.org/10.48550/arXiv.2312.11562" target="_blank">https://doi.org/10.48550/arXiv.2312.11562</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>The construction wiki contributors(2021)</label><mixed-citation>
      
The construction wiki contributors: Function,
<a href="https://www.designingbuildings.co.uk/wiki/Function" target="_blank"/> (last access: 12 September 2025), 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>Wang et al.(2017)Wang, Bai, Mattyus, Chu, Luo, Yang, Liang, Cheverie,
Fidler, and Urtasun</label><mixed-citation>
      
Wang, S., Bai, M., Mattyus, G., Chu, H., Luo, W., Yang, B., Liang, J.,
Cheverie, J., Fidler, S., and Urtasun, R.: TorontoCity: Seeing the world with
a million eyes, in: 2017 IEEE Int. Conf. Comput. Vis. (ICCV),  3028–3036,
<a href="https://doi.org/10.1109/ICCV.2017.327" target="_blank">https://doi.org/10.1109/ICCV.2017.327</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>Wang et al.(2024)Wang, Zhang, Dong, Guo, Tao, and Zhang</label><mixed-citation>
      
Wang, Y., Zhang, Y., Dong, Q., Guo, H., Tao, Y., and Zhang, F.: A multi-view
graph neural network for building age prediction, ISPRS-J. Photogramm. Remote
Sens., 218, 294–311, <a href="https://doi.org/10.1016/j.isprsjprs.2024.10.011" target="_blank">https://doi.org/10.1016/j.isprsjprs.2024.10.011</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>Weir et al.(2019)Weir, Lindenbaum, Bastidas, Etten, Kumar, Mcpherson,
Shermeyer, and Tang</label><mixed-citation>
      
Weir, N., Lindenbaum, D., Bastidas, A., Etten, A., Kumar, V., Mcpherson, S.,
Shermeyer, J., and Tang, H.: SpaceNet MVOI: A multi-view overhead imagery
dataset, in: 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 992–1001,
ISBN 2380-7504, <a href="https://doi.org/10.1109/ICCV.2019.00108" target="_blank">https://doi.org/10.1109/ICCV.2019.00108</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>Wojna et al.(2021)Wojna, Maziarz, Jocz, Paluba, Kozikowski, and
Kokkinos</label><mixed-citation>
      
Wojna, Z., Maziarz, K., Jocz, L., Paluba, R., Kozikowski, R., and Kokkinos, I.:
Holistic multi-view building analysis in the wild with projection pooling,  Proceedings of the AAAI Conference on Artificial Intelligence, 2870–2878, <a href="https://doi.org/10.1609/aaai.v35i4.16393" target="_blank">https://doi.org/10.1609/aaai.v35i4.16393</a>,
2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>Xiao et al.(2022)Xiao, Jia, Yang, Sun, Shi, Wang, and Jia</label><mixed-citation>
      
Xiao, B., Jia, X., Yang, D., Sun, L., Shi, F., Wang, Q., and Jia, Y.: Research
on classification method of building function oriented to urban building
stock management, Sustainability, 14, 5871, <a href="https://doi.org/10.3390/su14105871" target="_blank">https://doi.org/10.3390/su14105871</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>Xu et al.(2022)Xu, He, Xie, Xie, Luo, and Xie</label><mixed-citation>
      
Xu, Y., He, Z., Xie, X., Xie, Z., Luo, J., and Xie, H.: Building function
classification in Nanjing, China, using deep learning, Trans. GIS, 26,
2145–2165, <a href="https://doi.org/10.1111/tgis.12934" target="_blank">https://doi.org/10.1111/tgis.12934</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>Xu et al.(2023)Xu, Zhang, Wu, Yang, and Wu</label><mixed-citation>
      
Xu, Z., Zhang, F., Wu, Y., Yang, Y., and Wu, Y.: Building height calculation
for an urban area based on street view images and deep learning,
Comput.-Aided Civil Infrastruct. Eng., 38, 892–906,
<a href="https://doi.org/10.1111/mice.12930" target="_blank">https://doi.org/10.1111/mice.12930</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>Yang et al.(2020)Yang, Hu, Bergasa, Romera, and Wang</label><mixed-citation>
      
Yang, K., Hu, X., Bergasa, L. M., Romera, E., and Wang, K.: PASS: Panoramic
annular semantic segmentation, IEEE Trans. Intell. Transp. Syst., 21,
4171–4185, <a href="https://doi.org/10.1109/TITS.2019.2938965" target="_blank">https://doi.org/10.1109/TITS.2019.2938965</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib41"><label>Yang et al.(2021)Yang, Hu, and Stiefelhagen</label><mixed-citation>
      
Yang, K., Hu, X., and Stiefelhagen, R.: Is context-aware CNN ready for the
surroundings? panoramic semantic segmentation in the wild, IEEE Trans. Image
Process., 30, 1866–1881, <a href="https://doi.org/10.1109/TIP.2020.3048682" target="_blank">https://doi.org/10.1109/TIP.2020.3048682</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib42"><label>Yu and Fang(2023)</label><mixed-citation>
      
Yu, D. and Fang, C.: Urban remote sensing with spatial big data: A review and
renewed perspective of urban studies in recent decades, Remote Sens., 15,
1307, <a href="https://doi.org/10.3390/rs15051307" target="_blank">https://doi.org/10.3390/rs15051307</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib43"><label>Zhang et al.(2021a)Zhang, Shi, Zhuo, Wang, and
Tao</label><mixed-citation>
      
Zhang, C., Shi, Q., Zhuo, L., Wang, F., and Tao, H.: Inferring mixed use of
buildings with multisource data based on tensor decomposition, ISPRS Int. J.
Geo-Inf., 10, 185, <a href="https://doi.org/10.3390/ijgi10030185" target="_blank">https://doi.org/10.3390/ijgi10030185</a>, 2021a.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib44"><label>Zhang et al.(2021b)Zhang, Fukuda, and
Yabuki</label><mixed-citation>
      
Zhang, J., Fukuda, T., and Yabuki, N.: Development of a city-scale approach for
facade color measurement with building functional classification using deep
learning and street view images, ISPRS Int. J. Geo-Inf., 10, 551,
<a href="https://doi.org/10.3390/ijgi10080551" target="_blank">https://doi.org/10.3390/ijgi10080551</a>, 2021b.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib45"><label>Zhang et al.(2023)Zhang, Liu, Chen, Guan, Luo, and Huang</label><mixed-citation>
      
Zhang, X., Liu, X., Chen, K., Guan, F., Luo, M., and Huang, H.: Inferring
building function: A novel geo-aware neural network supporting building-level
function classification, Sust. Cities Soc., 89, 104349,
<a href="https://doi.org/10.1016/j.scs.2022.104349" target="_blank">https://doi.org/10.1016/j.scs.2022.104349</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib46"><label>Zhang et al.(2025)Zhang, Zhao, and Long</label><mixed-citation>
      
Zhang, Y., Zhao, H., and Long, Y.: CMAB: A Multi-Attribute Building Dataset of
China, Sci. Data, 12, 430, <a href="https://doi.org/10.1038/s41597-025-04730-5" target="_blank">https://doi.org/10.1038/s41597-025-04730-5</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib47"><label>Zhao et al.(2023)Zhao, Wu, Li, Yang, Fan, Wu, and Yu</label><mixed-citation>
      
Zhao, Y., Wu, B., Li, Q., Yang, L., Fan, H., Wu, J., and Yu, B.: Combining
ICESat-2 photons and Google Earth Satellite images for building height
extraction, Int. J. Appl. Earth Obs. Geoinf., 117, 103213,
<a href="https://doi.org/10.1016/j.jag.2023.103213" target="_blank">https://doi.org/10.1016/j.jag.2023.103213</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib48"><label>Zheng et al.(2024)Zheng, Zhang, Ou, and Liu</label><mixed-citation>
      
Zheng, Y., Zhang, X., Ou, J., and Liu, X.: Identifying building function using
multisource data: A case study of China's three major urban agglomerations,
Sust. Cities Soc., 108, 105498, <a href="https://doi.org/10.1016/j.scs.2024.105498" target="_blank">https://doi.org/10.1016/j.scs.2024.105498</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib49"><label>Zhou et al.(2023)Zhou, Persello, Li, and Stein</label><mixed-citation>
      
Zhou, W., Persello, C., Li, M., and Stein, A.: Building use and mixed-use
classification with a transformer-based network fusing satellite images and
geospatial textual information, Remote Sens. Environ., 297, 113767,
<a href="https://doi.org/10.1016/j.rse.2023.113767" target="_blank">https://doi.org/10.1016/j.rse.2023.113767</a>, 2023.

    </mixed-citation></ref-html>--></article>
