An Open-Source Automatic Survey of Green Roofs in London using Segmentation of Aerial Imagery

. Green roofs can mitigate heat, increase biodiversity, and attenuate storm water, giving some of the benefits of natural vegetation in an urban context where ground space is scarce. To guide the design of more sustainable and climate resilient buildings and neighbourhoods, there is a need to assess the existing status of green roof coverage and explore the potential for future implementation. Therefore, accurate information on the prevalence and characteristics of existing green roofs is needed, but 5 this information is currently lacking. Segmentation algorithms have been used widely to identify buildings and land cover in aerial imagery. Using a machine-learning algorithm based on U-Net to segment aerial imagery, we surveyed the area and coverage of green roofs in London, producing a geospatial dataset (Simpson et al., 2023). We estimate that there was 0 . 23 km 2 of green roof in the Central Activities Zone (CAZ) of London, ( 1 . 07 km 2 ) in Inner London, and ( 1 . 89 km 2 ) in Greater London in the year 2021. This corresponds to 2.0% of the total building footprint area in the CAZ, and 1.3% in Inner London. There 10 is a relatively higher concentration of green roofs in the City of London, covering 3.9% of the total building footprint area. Test set accuracy was 0.99, with an f-score of 0.58. When tested against imagery and labels from a different year (2019), the model performed just as well as a model trained on the imagery and labels from that year, showing that the model generalised well between different imagery. We improve on previous studies by including more negative examples in the training data, and by requiring coincidence between vector building footprints and green roof patches. We experimented with different data aug-15 mentation methods, and found a small improvement in performance when applying random elastic deformations, colour shifts, gamma adjustments, and rotations to the imagery

green roof coverage in London they lack transparency about the methods used, there is a wide disagreement about the area of green roofs in the CAZ, and the full data are not publicly available for analysis. Accurate, comprehensive, and open data documenting the location and area of green roofs can directly inform research into city-scale heat mitigation strategy and is useful for stakeholders such as urban planners, policy makers and research 55 communities looking at urban heat mitigation and the added value of green spaces. However, there is a general lack of open data documenting the area and coverage of green roofs. In order to address this Wu and Biljecki (2021) applied a machinelearning algorithm to high-resolution satellite-imagery to identify green roofs and solar panels in a number of cities around the world, producing a ranking for which of the surveyed cities have the greatest coverage with green roofs and solar panels.
London was not included in their survey.

60
In this study, we identify green roofs from aerial imagery: this is a binary segmentation problem, as a single class needs to be identified from a background. We used a fully convolutional neural network known as U-Net to segment the imagery: this type of neural network was originally designed for biomedical image segmentation (Ronneberger et al., 2015), but have since been applied in other research fields including remote sensing: for example to map roads (Ozturk et al., 2020), parking lots (Ng and Hofmann, 2018), and green roofs (Wu and Biljecki, 2021) from imagery. Such algorithms process an image to output 65 a binary mask identifying areas belonging to the target class. The encoder layers of the U-Net produce compressed abstract representations of the image at different scales, by repeatedly using convolution blocks followed by maxpool downsampling.
The decoder layers apply upsampling and concatenation with convolution to produce a prediction with the same dimensions as the input image, combining information from the different scales provided by each encoder layer. The relationship between the image and the classification is learned from a set of labelled examples, hereafter referred to as training polygons.

70
Green roofs cover only a small proportion of the planar area of London, so in aerial imagery most pixels are not part of a green roof. This means that the classification problem is imbalanced, with the negative class being many times more numerous that the positive class. This can create problems with model training if gradient descent batches often do not contain any positive examples. In Wu and Biljecki (2021), the training polygons were restricted to areas with relatively higher concentrations of green roofs and image tiles with no green roof were excluded; 1-5 km 2 of each of the 17 cities covered. Furthermore, the total number of examples for training is relatively low compared to many computer vision tasks, meaning that a computer vision model may be unable to generalise the appearance of green roofs; as such such, data augmentation is key for achieving good segmentation performance. In the original U-Net paper, elastic deformations are applied to the training images, which makes the network learn to be invariant to these deformations without the need for all possible deformations to be present in the data (Ronneberger et al., 2015), this is justified as soft tissues in medical images are often deformed in this way. In Ng and Hofmann In this study, we build on the machine-learning based method used by Wu and Biljecki (2021) for the segmentation of green roofs from remote-sensed imagery, improving the segmentation performance by including more negative examples and experimenting with data augmentation methods. We thus provide a robust, open and documented dataset of the location and 85 area of green roofs in London at the level of individual buildings (Simpson et al., 2022), filling an gap in publicly available data. This dataset has the greatest extent of its kind for any single city.
2 Data and Methods

Geographic Context and Data
Greater London is a region of England with an area of 1, 570 km 2 , which is divided into local authority districts ( compact midrise, but also contains a large amount of open lowrise. The CAZ mainly covers the area of compact midrise in the centre and is therefore the most densely built part of London. Buildings in the CAZ, and especially the City of London, are more likely to be non-residential buildings. Figure 1 show the outlines of the LADs in Greater London and Inner London, and the outline of the CAZ. The imagery used for segmentation was colour (red, green, blue) raster images from aerial imagery at 25 cm horizontal 105 resolution (from (Getmapping Plc., 2020) accessed under an academic license). Imagery was collected in Summer 2019. Two GIS datasets were used for building footprints. Ordnance Survey (OS) VectorMap Local (VML) (Ordnance Survey (GB), 2021) building footprints were used in post-processing the segmentation, as inspection showed that outlines were more consistent with the aerial imagery, especially in cases of buildings with internal courtyards. UKBuildings (Verisk Analytics, Inc., 2022) building footprints were used for building counts, as it divides buildings into individual properties.

Segmentation pipeline
Our segmentation pipeline was based on that of Wu and Biljecki (2021), which is in turn based on Ng and Hofmann (2018).
The key differences are as follows: 1. we used aerial imagery rather than satellite imagery, 2. our hand-labelled are distributed around the city, rather than concentrated in a central area, 3. we focussed on fully surveying a single city rather than trying to cover many, 4. we experimented with additional data augmentation methods, 5. we implemented early stopping rather than training for a fixed number of epochs, All analysis and data management was performed using Python (Van Rossum and Drake, 2009). A general outline of the workflow is shown in Figure 2. The method is covered in more detail in the following subsections.

125
To identify the locations of green roofs and estimate their covered area, we trained our U-Net with training polygons from a sample area. To produce training data, green roofs in the imagery were labelled by hand to provide input for model training.
The training label polygons and geospatial results are included in the Supplementary Material to this article for reproducibility.
We selected areas for labelling based on the OS 1 km grid reference system, so each grid square is 1 km 2 . Firstly, a 4 km 2 area in the CAZ was selected, known to have a relatively higher concentration of green roofs: this was to ensure that there is 130 sufficient representation of green roofs in the data. Secondly, to increase the diversity of the data, we selected a further 21 km 2 distributed around Inner London without prior knowledge of the concentration of green roofs, aiming to represent each LAD and a variety of built forms (based on an LCZ map); these areas had much smaller amounts of green roofs. All grid references that were included are listed in Table A2. Within the selected grid squares, every building in the imagery was inspected and green roofs were labelled by hand. Labelling was performed by drawing polygons using QGIS (QGIS Association, 2022); 135 some examples of training polygons are shown in Figure 3. In total, sample areas covered 7.8% of Inner London, resulting in 4.9 × 10 4 m 2 of green roofs labelled inside the CAZ, and 2.2 × 10 4 m 2 outside the CAZ.
Once trained we applied the U-Net to a larger area (the whole of Greater London) to map existing green roofs.

Perfomance metrics
Standard metrics were calculated to assess the validity of the segmentation model. Metrics were calculated from the final vector 140 layers, after all processing steps. The metrics are listed in Table 2. Accuracy, IoU, precision, recall, and F-score all range from 0 to 1, where 1 represents an ideal classifier. F-score is a more appropriate measure of the overall validity of a model for imbalanced classification than accuracy. As well as calculating these metrics, we examined examples of poor segmentation performance to understand the failure modes of our segmentation method.

Segmentation algorithm 145
Transfer learning refers to the practice of transferring models or parts of models between different learning tasks -in this case from a well-known image classification task to our segmentation task. Ng and Hofmann (Ng and Hofmann, 2018)    the ImageNet dataset (He et al., 2016;Deng et al., 2009). Transfer learning can improve performance and reduce the required training resources as the model will have already learned to extract features from images that are generally informative. We 150 initialised the U-Net encoder with a ResNet-50 model pre-trained on the ImageNet dataset (He et al., 2016).
The imagery was broken into 256 × 256 pixel tiles at a scale of 0.0005 o to one tile, approximately 47.5 m per tile or 19 cm per pixel, using OpenStreetMap's tiling conventions. We refer to areas labelled with no green roof as negative, and those labelled with green roof as positive. All tiles within the hand-labelled areas were used. In order ensure that batches would contain positive examples, we over-sampled positive tiles by repetition during training so that they were equally prevalent as 155 the fully negative tiles. Tiles were split randomly into training (80%), validation (10%), and testing sets (10%). The random split was performed separately for positive and fully negative tiles to ensure all splits contained both classes. The separate validation and testing datasets are required because hyperparameter tuning is performed by selecting the hyperparameters that maximise the validation dataset performance, so a testing dataset is required to properly estimate out-of-sample performance.
The algorithm was implemented in PyTorch (Paszke et al., 2019). The model was trained using the Adam optimiser (Kingma 160 and Ba, 2014), an optimiser that dynamically adjusts learning rates for each model parameter, making training less dependent on the global learning rate and therefore reducing required training resources.
Rather than training the model for a fixed number of epochs, we implemented early stopping. Early stopping refers to stopping training when validation performance ceases to improve. This reduces the required training resources and can effective at reducing overfitting. Training was stopped if the mean validation loss in the past five epochs was greater than that of the five 165 epochs before that.
Hyper-parameter tuning experiments were performed via grid search, and the final selection made based on the validation data. Testing data were not used for training method tuning, and were only processed after the hyper-parameters were finalised.
Cross-entropy, Lovasz, and Focal loss functions were tested: Lovasz is intended as a surrogate for the intersection-overunion measure (Berman et al., 2018), whereas focal loss is intended to give greater weight to hard-to-classify examples during 170 training (Lin et al., 2017). Learning rate, loss function, and data augmentation methods were tested. The hyper-parameters tuned, and hyper-parameter values used for the final classification, are listed in Table A1.
A key part of the U-Net methodology is data augmentation -a process wherein distortions or transformations are applied to the training data to increase robustness when training data is scarce. Augmentation can reduce overfitting, a process wherein a model memorizes certain features of the training dataset that do not generalise out-of-sample (Shorten and Khoshgoftaar,175 2019). During training, augmentations were applied to the imagery tiles, and correspondingly to the label masks. We randomly flipped images in both planes, and also applied random 90 • rotations, and found that this improved performance. There was no further improvement in performance from applying fully random rotations. We experimented with applying a random 90% crop to the images, with randomly manipulating the colours of the imagery, and randomly adjusting the sharpness of the image.
Augmentation was applied randomly and independently each training epoch. In previous work, morphological opening and closing have been used on the classification masks as a post-processing step: these are filters that remove small isolated positive areas and fill in small negative areas respectively. We tested these methods with our own models and imagery, but found that morphological opening of the classification masks increased recall but decreased precision, overall decreasing F-score; whereas, morphological closing did not have any substantial effect on F-score.

185
Therefore, we decided not to include these post-processing steps in our final classification pipeline.
From the binary masks produced by the segmentation algorithm, we extracted green roof candidate polygons. The intersection was then taken between the candidate polygons and the OS VML building footprints, to remove any candidate polygons that did not intersect with a building footprint. This process helped to reduce the false positive rate because the segmentation algorithm can incorrectly identify ground-level green cover as a green roof. The post-processed segmentation results were 190 spatially joined with the UKBuildings layer in order to identify which individual buildings have green roofs, and so calculate the number of buildings covered.

Area estimates
To estimate area of green roof in each geographic area, the polygons of green roof area identified by the segmentation are spatially overlayed with the polygons of the geographic area. A similar process is used with the building footprints to estimate 195 building footprint area. All area calculations were applied in the OSGB36 / EPSG:27700 coordinate projection.
Area projections are scaled up by the recall of the model, based on the assumption that a fixed proportion of each green roof is missed by the model. Not doing so would lead to an underestimation of the green roof area.

Segmentation performance 200
The confusion matrix and performance statistics for green roof identification are given in Tables 3 and 4 for calculations based on area. Tables 5 and 6 give the same statistics but calculated based on building counts.
Results of the hyperparameter search are shown in Table A1. The best performance was achieved by the combination of random flips, random 90 • rotations, and randomly reducing the sharpness of the image by a sharpness factor 0.5. Only small differences were observed between the performance of different loss functions. 205 We found that the building-intersection step made only a small difference to the performance statistics of the validation and testing data; however, areas with no green roof are under-represented in these datasets. We found that across the whole study area, 20% of predicted green roof area was outside of building footprints and was consequently removed, showing that across the building-intersection step plays an important role in suppressing false positives.  Table 7 gives estimates for LADs in Inner London, Table 8 for Outer London, and Table 9  Most (58%) LSOAs contain no green roofs, and the maximum proportion of building footprint area covered by green roofs in 220 any LSOA is 33.8%. Havering 0.0 ----- Table 9.

Segmentation Performance
Overall performance of the segmentation model achieves a high level of accuracy (99.8%). The area precision 81% and count precision 88% are high (Tables 4 and 6) meaning that we can be confident that identified green roofs are real.

225
Given that the survey covers such a large and diverse area, and the green roof fraction is low in many areas, it is important to consider the false positive rate. Tables 3 and 5 show that we expect just 0.08% of buildings to be incorrectly identified as having a green roof, or 0.02% of the built area.
Area recall is somewhat low at 0.48, whereas the count recall is higher at 0.875 Inspection of false negatives in the results showed that many pixels classified as false positives and false negative occur at the edges of labelled green roofs. This indicates 230 that the dataset is good at identifying whether a building has a green roof, but tends to underestimate the area.
There is a substantial difference in area recall between the training and testing datasets (Table 4) which indicates overtraining; suggesting that due to the high diversity in the appearance of green roofs the training dataset and model struggles to generalise the appearance of green roofs to unseen areas. However, area precision is consistently high. Together, this suggests that the training data contained a good quantity of negative examples, but would benefit from greater diversity in the positive 235 examples. This could be improved by labelling more data thus increasing the size of the training dataset. In our experimentation with augmentation methods, we found that augmentation substantially decreased the difference in F-score between training and validation datasets. It may be possible to reduce the performance difference using additional methods of data augmentation or model regularization.
The IoU score for counts of buildings (0.43, Table 6) of our segmentation model is higher than that reported in Wu and 240 Biljecki (2021) (0.396, see their sec 3.2.2). This could be because we used higher quality imagery and focused on a single city, although because the two studies are in different cities it is not a like-for-like comparison. Wu and Biljecki (2021) covered a total of 2217 km 2 across 12 cities, with the largest being 302 km 2 in Las Vegas, Nevada; our survey covered 1463 km 2 , making ours the largest survey of green roofs in a single city.
This method can be applied to other cities, and we have explored how the segmentation methods can be improved. We found 245 that including a large number of negative examples and over-sampling positive examples was very effective at suppressing the false positive rate in unseen areas, and would recommend this approach in general. Furthermore, we found small improvements in performance from augmenting the training dataset with random adjustments in sharpness, but this will not necessarily be effective with different imagery.

250
As is clear from this study, automatic methods are scalable, allowing large areas to be surveyed and monitored; however, they have limitations. Green roofs can only be identified by this method if they are visible in the imagery, and small areas of vegetation (that is, not visible at 25 cm pixel size) are necessarily left out. Hand labelling also has limitations; there are cases where it is difficult to determine visually from the imagery whether a building has a green roof, or where the edge of the green roof is. 255 We found that there was a higher rate of false positives east of OSGB37 Easting 5.5×10 5 m. This is because the aerial imagery was too different to the hand-labelled areas, having been collected in a different year and possibly with a different instrument. Data in this area was excluded from the dataset, affecting parts of Bromley, Havering, Bexley, and Barking and Dagenham.
While performance was generally good as measured by the performance metrics, we collected some examples of poor 260 classification performance, which are shown in Figure 6. Many false positives were observed where part of a green roof was correctly identified but another part nearby was misclassified, exemplified by Figure 6 A and B. It could be that relatively small variations in colour lead to the misclassification, but we found that random augmentations in brightness and hue did not improve performance. Areas in shadow in the imagery are generally poorly classified (for example in Figure 6 C). This could be because the shapes and colours are simply less distinct in shadow, but there are also few examples of this to learn from in the 265 training data. Multi-spectral imagery could help deal with shadows and variations in vegetation colour. However, multi-spectral aerial imagery is collected more rarely and is less available; satellite multi-spectral imagery is available but resolution is poorer.
Therefore, visible-spectrum aerial imagery has some practical advantages over multi-spectral imagery. Combining layers of multi-spectral imagery at lower resolution with aerial-imagery is technically challenging, but could be effective for this task.
We have not attempted to separate different types of green roof (e.g. intensive, extensive, roof gardens). While types of plant 270 may be differentiated to some extent in aerial imagery, important features like depth of substrate cannot. Some green roofs may be in poor condition from lack of water, and there may be cases of fake turf or other imitation vegetation being detected as green roofs: both of these could be better identified using multi-spectral imagery.
While relatively high-resolution satellite-imagery is available covering most cities in the world, these are generally not as high quality as the aerial imagery available in London; therefore, the same method applied to other cities may yield worse 275 performance.
Performance of the building-intersection step is reliant on the alignment of the buildings footprints with the imagery. The OS building footprints are very accurate, especially for identifying courtyards within building footprints. We found that alignment between other imagery sets, and with other building footprint sources, was not as reliable. However, OS maps are only available in the Great Britain, as opposed to OpenStreetMap which has a more global coverage. LADs with the greatest green roof area (e.g. Tower Hamlets, Greenwich). A slightly smaller area of green roofs is identified by our study in Tower Hamlets.
Examining the GLA's geospatial data (which is only public for the CAZ) (Greater London Authority, 2014) and infographics (Livingroofs Enterprises Ltd, 2019), we see multiple instances of ground-level parks being incorrectly identified as green 290 roofs (e.g. Finsbury Square in Islington, Figure A2). Making use of the building footprint data enables us to avoid such misclassifications. There is also disagreement for the Barbican Centre ( Figure A3), of which the full area is counted as a roof by the GLA results: this is a difficult edge-case, as the OS building footprints do not include the full area of the complex as a building. Over the CAZ, we find that 4% of the the area of the Greater London Authority (2014) dataset does not intersect with OS building footprints. It also appears that in the GLA's geospatial data, an area slightly larger than the vegetation is usually 295 selected, which may be due to the resolution of the input data. This demonstrates the utility of ensuring the coincidence of identified green roof patches with building footprints. Wu and Biljecki (2021) report that proportion of buildings by area which have a green roof is 41.6% in Zurich, 24.8% in Berlin, and 17.2% in New York (London was not included in their survey). Comparing this with the results in Tables 7, 8, 9, we see that the City of London ranks between Berlin and New York at 21.0%. This method of ranking is sensitive to the geographic 300 area included in the calculation if the concentration of green roofs varies between districts within a city. Furthermore, given our interest in rooftop vegetation as a climate adaptation strategy, the actual amount of vegetation seems more relevant than the total area of the building.

Distribution of green roof areas
As shown in Table 7, although larger total areas of green roof are present in some LADs, the City of London is unique within 305 Greater London for its' relatively higher concentration of green roofs. High concentrations of green roofs are also seen in the former dockland areas in Tower Hamlets, Newham and Greenwich, as well as in Stratford around the Olympic Park and the Kidbrooke development in Greenwich.
Distribution of green roofed buildings within LADs is heterogeneous (see Fig 5). When LSOAs stand out as having relatively high green roof coverage, it is often due to a single large building or a cluster of buildings with green roofs.

310
Despite having the highest green roof coverage out of the LADs, only 3.2% of the building footprint of the City of London is covered by green roofs. The City of London has very low amounts of green cover generally, so it is consistent with policy (e.g. (Greater London Authority, 2021b, Policy G5)) that green roofs would be adopted there. However, the LRW 2008 report (Design for London et al., 2008) found that 32% of roof area in the City of London could be suitable for retrofitting with green roofs, so the current status is a long way from that proposed. As the dataset identifies individual builtings, in future work we 315 will explore what kinds of buildings, and what areas, have adopted green roofs.
Given that the area of vegetation in the City of London is overall quite low, it is possible that existing green roof coverage is making a difference to the thermal environment: a possibility that we will explore in a urban climate modelling study enabled by this data.
In this study, we produced a survey of green roofs in London using automatic segmentation of aerial imagery. The resulting geospatial dataset is made available for further research. We identified areas which have a high prevalence of green roofs; especially the City of London, the former docklands in Tower Hamlets and Newham, and the Olympic Park in Stratford. We highlighted some of the difficulties of producing such a dataset: especially that low prevalence of green roofs means that the classification problem is highly imbalanced, which can create problems for machine-learning algorithms. Furthermore, we 325 demonstrate the importance of excluding ground-level vegetation from surveys of green roofs by ensuring areas classified as green roofs are coincident with building footprints.
In future work, we will use this geospatial dataset to further explore the characteristics and uses of buildings and neighbourhoods which have green roofs as well as those with potential for more green infrastructure, and to quantify the thermal effects of green roofs on London's micro-climate.