Mapping photovoltaic power plants in China using Landsat, Random Forest, and Google Earth Engine

. Photovoltaic (PV) technology, an efficient solution for mitigating the impacts of climate change, has been increasingly used across the world to replace fossil-fuel power to minimize greenhouse gas emissions. With the world's highest cumulative and fastest built PV capacity, China needs to assess the environmental and social impacts of these established photovoltaic (PV) power plants. However, a comprehensive map regarding the PV power plants' locations and extent remain 15 scarce on the country scale. This study developed a workflow combining machine learning and visual interpretation methods with big satellite data to map PV power plants across China. We applied a pixel-based Random Forest (RF) model to classify the PV power plants from composite images in 2020 with 30-meter spatial resolution on Google Earth Engine (GEE). The result classification map was further improved by a visual interpretation approach. Eventually, we established a map of PV power plants in China by 2020, covering a total area of 2917 km 2 . We found that most PV power plants were sited on cropland, 20 followed by barren land and grassland based on the derived national PV map. In addition, the installation of PV power plants has generally decreased the vegetation cover. This new dataset is expected to be conducive to policy management, environmental assessment, and further classification of PV power plants. The dataset of photovoltaic power plant distribution


Introduction 25
Solar power is the most available renewable energy source with great potential to replace fossil fuels to reduce greenhouse gases (GHGs) emissions and mitigate climate change (Nemet, 2009;Creutzig et al., 2017). Photovoltaic (PV) technology can convert solar energy directly into electricity with large PV arrays. With the development of PV technology and decline in the cost of PV power generation in recent years, the amount of PV power plants has been fast rising (Zou et al., 2017). China's PV industry leads the world regarding the cumulative installed and newly installed capacity. According to the National Energy 30 Administration of China, the cumulative installed capacity of PV power in China had reached 253 Gigawatt (GW) by the end of 2020, with 48.2 GW being newly installed in 2020. As China aims to achieve a carbon emissions peak before 2030 and carbon neutrality before 2060, it is expected that PV power generation will keep rapidly growing across China. As the development of PV power plants requires a large amount of land (Capellán-Pérez et al., 2017), knowing the distributions of PV power plants is crucial for evaluating the eco-environmental effects and predicting the power generation of PV power 35 plants in China (Taha, 2013;Hernandez et al., 2014;Hernandez et al., 2015;Li et al., 2018;Grodsky and Hernandez, 2020).
However, data regarding the distributions of PV power plants remain to be scarce in China, which has been greatly hindering national policy management and environmental assessment of PV power plants in China.
Remote sensing techniques can acquire features of different ground objects from images in spectral, temporal, and spatial dimensions globally (Zhu et al., 2012). A few studies have mapped the PV panels or power plants by using manually annotating 40 Dunnett et al., 2020) and machine learning methods with various remote sensing imagery (Malof et al., 2016a;Malof et al., 2016b;Malof et al., 2017;Zhang et al., 2021b2021). Machine learning algorithms can classify ground features with high accuracy by incorporating various input predictor data from remote sensing imagery without making assumptions about the data distribution (Maxwell et al., 2018). While machine learning methods have improved efficiency in identifying PV power plants, mapping PV power plants is still challenging on a continental scale, which is limited by the 45 computing resources and accuracy in complex environments.
Training an applicable machine learning model requires massive labelled training samples to cover as much system parameter space as possible. PV power plants are built in various landscapes, including deserts, mountains, coasts, and lakes (Sahu et al., 2016;Al Garni and Awasthi, 2017;Hammoud et al., 2019). The limited labelled data is insufficient to cover most of the spectral parameter space of PV power plants in complicated geographical environments. Thus, machine learning models 50 will generate unavoidable misclassification when identifying PV power plants. Especially on a continental scale, the model's inaccuracy will lead to many misclassified PV areas because the background non-PV area is thousands of times larger than the actual PV area. Since the PV power plants will not change in a short time, visual interpretation provides a potential way to filter out misclassifications from machine learning results.
Deep learning models, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and residual 55 networks (ResNet) (He et al., 2016;Schmidhuber, 2015;Krizhevsky et al., 2012), have also been applied to map the PV power plants in the United States (Yu et al., 2018), China (Hou et al., 2019), and worldwide (Kruitwagen et al., 2021). As a branch of machine learning, deep learning is characterized by neural networks (NNs) involving several to hundreds of layers that exploit feature representations learned exclusively from data. Deep learning models can accurately identify PV power plants from remote sensing data by developing in-depth information without hand-crafting features, but these tasks need extensive 60 computation resources. For example, Kruitwagen et al. (2021) used deep learning models and over 106 CPU-hours, 20000 GPU-hours, 71 MWh, and approximately two months in real-time to map the PV power plants worldwide with remote sensing imagery. Not to mention that these tasks usually require additional storage resources to store an enormous amount of remote sensing imagery. As a result, updating or modifying such PV maps derived from deep learning methods for the regional places of interest such as China is infeasible for researchers in most of the countries who don't have access to super computing 65 facilities.
Cloud computing platforms facilitate classification tasks on a global scale with shared data and computing resources.
Google earth engine (GEE) is a cloud geospatial computing platform that supports freely available petabyte remote sensing data, multiple machine learning algorithms, and shared computing resources (Gorelick et al., 2017). With GEE's support, researchers in the remote sensing community have completed numerous classification works on a planetary scale (Deines et 70 al., 2019;Li et al., 2019;Gong et al., 2019;Xie et al., 2019;Gong et al., 2020;Mao et al., 2021).
In this study, we integrated the advantage of cloud computing, machine learning, and visual interpretation to map the PV power plants in China in 2020. We used GEE to acquire the preliminary classified result using a random forest model from Landsat-8 imagery. We further refined the classified results by visual interpretation. Based on the final filtering result, we also investigated the stats of the PV power plants within different climatic and geographic areas. The proposed approach in this 75 study is easy to repeat, and the result will help future policymaking and environmental assessment for PV power facilities. A great amount of labelled PV power plant samples across China derived from visual interpretation could offer valuable data for future studies to update and improve maps of PV power plants.
In summary, the objectives of this study are to (1) build a workflow to map the PV power plants on a continental scale with Landsat imagery on GEE; (2) produce a fine-resolution map of PV power plants in China and (3) analyse the distribution 80 characteristics of PV power plants in China.

Landsat-8 surface reflectance imagery
This study used the Landsat-8 (L-8) surface reflectance (SR) product with a 30 m spatial resolution. L-8 product has been 85 atmospherically and topographically corrected and is accessible on GEE. We removed the pixels contaminated by clouds and shadows in each image using the pixel quality control bands. We further composited L-8 image datasets using the median value of six reflective bands during a specific period. The composite image was robust against extreme values and provided enough information about the particular period (Flood, 2013). We composited the images of autumn 2020 (September to November) and the whole year 2020 (January to December) over China, respectively. The composite image in autumn (C1) 90 has the advantage of fewer clouds, snow, and vegetation in China compared to the image from other seasons. The composite image of the whole year (C2) was involved in nearly four times as many images as the C1, so the C2 is less affected by the contaminated pixels than C1 but has less timeliness. Therefore we used C2 as a substitute in the regions where the quality of C1 was poor.

Random forest classification
We used a pixel-based Random forest (RF) algorithm on GEE to map the PV power plants over China (Zhang et al., 2021b). (Zhang et al., 2021). The RF classifier is an ensemble classifier that uses a set of decision trees to predict classification 100 or regression with advantages of high precision, efficiency, and stability (Belgiu and Drăguţ, 2016). The RF classifier has also been proven to be better than other machine learning classifiers on GEE Phalke et al., 2020) for mapping rangelands and croplands. For the RF classifier, we set the number of trees to 500 and left the rest of the parameters at GEE's default. Compared with the object-based model classification, the pixel-based model classification uses the raw resolution pixel and does not require further segmentation of the classified image. 105

Training and validation samples
The RF classifier is sensitive to the sampling design (Belgiu and Drăguţ, 2016). Suitable training samples are crucial for an RF model's classification accuracy and stable performance. We collected and labelled samples as PV and non-PV regions, respectively, short as PV and NPV. We primarily collected the PV samples from Dunnett's dataset, a global solar plants dataset annotated by volunteers (Dunnett et al., 2020). The total area of the PV power plants in China is about 897 km 2 from Dunnett's 110 dataset.
We manually modified this dataset with Google Earth's background to ensure the PV samples locating inside the PV power plants. We found that the labelled PV power plants in Dunnett's dataset are rarely distributed in eastern China, which will limit our model's performance to identify the PV power plant in similar areas. SoWith high resolution Google Earth images of 2017, we further enriched the training dataset by manually selectedselecting and edited the extent of differentlabelling PV power 115 plants that were not annotatedover regions of eastern China, where PV power plants are rarely labelled in Dunnett's dataset.
The improved training dataset is aim to ensure the labelled data covered most of the parameter space of PV power plants in China. We stored all the PV samples as polygon vectors. The area of the modified labelled PV polygons was 1121 km 2 . We randomly sampled points within the polygons with a balanced quantity from humid and arid regions (Fig. 1).
We collected the NPV samples from adjacent regions of the PV power plant region within 5-kilometers buffer regions, the 120 samples from manfully selected typical land types, and the samples from the whole of China, respectively. We prepared 20000 points labelled as PV and 50000 points labelled NPV in this study. At last, after filtering out the low-quality pixels, we randomly chose 75% of the total points as the training set and the left 25% of the total points as the validation set (Table 1).

Calculation of variables
We collected nine variables from the Landsat-8 SR images data, including six original bands and three calculated indexes (Zhang et al., 2021b). (Zhang et al., 2021). We used these variables to train machine learning models to distinguish the PV and NPV regions. The six original bands included blue (B2), green (B3), red (B4), near-infrared (B5), and two shortwave infrared 130 bands (B6 and B7) from the L-8 images. The three indices included the Normalized Difference Vegetation Index (NDVI) (Tucker, 1979), the Normalized Difference Built-up Index (NDBI) (Zha et al., 2003), and the Modified Normalized Difference Water Index (MNDWI) (Xu, 2006).

Classification accuracy assessment
We evaluated the pixel-based RF model by using a validation set. By comparing the confusion matrix of categorized and 135 labelled points in the validation set, we used the kappa coefficient, overall accuracy, producer's accuracy, and user's accuracy to assess the model's performance with the validation set (Congalton, 1991). The kappa coefficient calculated from the confusion matrix is widely used to check consistency and evaluate model performance. The overall accuracy is measured to examine the overall efficacy of the model. The producer's accuracy indicates the proportion of truth samples correctly judged as the target class. The user's accuracy indicates the proportion of samples judged as the target class on the classification map 140 presented as truth samples.

Filter and morphological operations
By applying the RF classification, we got pixels categorized as PV region and NPV region over entire China. We then filtered the pixels by topography. The PV power plants are not suitable for being built in locations with large slopes and shady 145 slopes (Al Garni and Awasthi, 2017;Aydin et al., 2013). We calculated slope and hillshade from the Shuttle Radar Topography Mission (SRTM) with 30 m spatial resolution (Farr et al., 2007). We calculated the hillshade by setting azimuth as 180° and elevation angle as 45°. We filtered the pixels where the slope was over 30° and the value of the hillshade was less than 150.
In pixel-based classification, sudden disturbances in the image signal and different objects with the same spectrum or the same objects with a different spectrum can cause a salt-and-pepper noise (i.e., impulse noise) which presents as image speckles. 150 We filtered categorized PV pixels that connect less than 9 pixels to neighbours to reduce the salt-and-pepper noise.
Additionally, the edge of the PV power plants mixed with roads or other PV facilities that are not categorized as PV regions should be part of the PV power plants. We then used morphological operations on the GEE platform to dilate the PV pixel clusters. The morphological operations included one round max filter and one round mode filter with a circle kernel of onepixel radius to conduct spatial filtering. 155

Visual interpretation
We further convert the clusters of PV pixels into polygonal vectors on GEE. We used visual interpretation to identify all polygons categorized as the PV power plants by the RF model. To meet the visual interpretation needs, we calculated each polygon's areas and filtered the PV power plants with less than 0.04 km 2 , which equaled 45 adjacent pixels. According to Kruitwagen's dataset, PV power plants over 0.04 km 2 account for 94.2 percent of the total area of PV power plants in China 160 (Kruitwagen et al., 2021).
With QGIS software (http://www. qgis. org/) and the GEE plugin (https://gee-community.github.io/qgis-earthengineplugin/), we filter the PV polygons with visual interpretation based on their sizes, shapes, color, and texture with background true-color images from Landsat-8, Sentinel-2, and Google Earth (Fig. 2). We first collected the PV power plants from the classified result of CS1, which stood for the image in autumn of 2020, and we then collected the PV power plants from the 165 result of CS2, where clouds still contaminate CS1.

Dataset organization and statistical analysis 170
We showed the flowchart of this study (Fig. 3). We also mapped some regions containing PV power plants as examples to show the changes in different steps (Fig. 4).  We built a dataset of PV power plants in China. We stored the PV power plants as polygon objects with shapefile format (Falge et al., 2017). Since PV power plants are not entirely adjacent, we group the PV power plants within 10 kilometers for 180 further analysis. We calculated area, average elevation, annual mean air temperature, cumulative yearly precipitation, population density, annual mean enhanced vegetation index (EVI), and land cover type for each PV power plant (Table 2). All the datasets are available on GEE.

Result 185
The map indicating the distributions of the PV power plants in China is shown below (Fig. 5a). The PV power plant mapped in this study was 2917 km 2 by the autumn end of 2020. In the machine learning classification process, the result showed that the model with the dataset of CS1 had a comparable result with the model with the dataset of CS2 (Table 3). The kappa coefficient (kappa), overall accuracy (OA), user's accuracy (UA) of PV and non-PV (NPV), and producer's accuracy (PA) of PV and non-PV were 0.878, 95.04%, 95.51%, 93.82%, 97.59 and 88.83% for the CS1. The kappa, OA, UA of PV and NPV,190 and PA of PV and NPV were 0.886, 95.39%, 95.961%, 93.89%, 97.62, and 89.89% for the CS2, respectively (Table 3). Note: kappa coefficient (Kappa), overall accuracy (OA), producer's accuracy (PA), and user's accuracy (UA).
The result showed that the top three provinces for installing PV power plants were Qinghai, Xinjiang, and Inner Mongolia, respectively (Fig. 5b). The result based on the land cover showed that most PV power plants were sited on cropland, followed 195 by barren land and grassland (Fig. 5c). We have further counted the distributions of PV power plants by temperature, precipitation, elevation, population density, 200 and location. From the result, many PV power plants are located in China's arid and alpine region, where solar energy resources are plentiful, precipitation is low, vegetation is sparse, population density is low, and elevation is relativity high (Fig. 6).
Additionally, some PV power plants are located in the industrially developed eastern coastal provinces of China, where precipitation is high, density population is high, and elevation is low. This distribution result also shows two tendencies in China's site selection of PV power plants. One tendency is to install in areas with suitable natural conditions but less power 205 demand. The other tendency is to install in the areas with more local energy demand. Installation of PV power plants affects the local vegetation under different climate conditions (Zhang and Xu, 2020;210 Nghiem et al., 2019;Liu et al., 2019). We calculated and compared each PV power plant's annual mean EVI (larger than 0) in 2013 and 2020 from Landsat-8 images. By the record of the National Energy Administration of China, the cumulative installation of PV capacity is 19.4 GW by 2013 and 252.8 GW by 2020, which indicates that over 92% of PV power plants are installed after 2013. We compared the EVI values in 2013 and 2020 and discovered the EVI values of PV power plants in 2020 were strongly and positively linked with the EVI values in 2013, of which the linear regression with area weight (p < 215 0.01) showed the estimated slope was 0.594 and intercept was 0.0312 (Fig. 7). From the linear regression result, we found that the installation of PV power plants generally decreased the EVI in regions of high vegetation cover. By contrast, in the hyperarid regions, where EVI was less than 0.07, the installation of PV power plants slightly increased the EVI values. In this study, we have successfully established a dataset for PV power plants with a total area of 2917 km 2 in China until 2020. To our knowledge, our dataset is the latest and most complete public dataset for the spatial extent of PV power plants in China. Our method integrates the efficiency of machine learning and the accuracy of visual interpretation. The two pixel-wise RF models performed well, with the producer's accuracy over 84% and overall accuracy over 96%. 225 PV power plants are a mixture of PV panels and their occupied lands, which often cause challenges in mapping PV power plants. The PV power plants are more likely to have similar spectral features as other objects, such as plastic-cover sheds and biological soil crust. PV power plants in different regions have different PV panel spacing and tilt angles due to the sunlight incident angle and terrain, which could cause spectral variability (Yadav and Chandel, 2013;Ji et al., 2021) Nevertheless, there are still some omission errors in the RF classification result. Misclassified PV regions with sporadic distribution among the PV power plants will not impact the morphological operations and visual interpretation results.
However, some PV power plants, which are of the low density of PV panels, would be misclassified as non-PV objects. In particular, these PV power plants situated in mountainous areas typically have unique installation spacing and installation 235 angles for their solar panels. Additionally, the mountainous terrain also impacts the reflectance of the PV power plants (Wen et al., 2018). These PV power plants thus mainly were missed in our study but only took up a small portion of the total number. annotation is incomplete and with no guarantee of update timely in China. We also compared our result with Kruitwagen's 245 dataset (Kruitwagen et al., 2021), which was classified by deep learning methods. The total area of PV power plants in China from Kruitwagen's dataset is 2169.8 km 2 by 2018, of which 1873.5 km 2 have spatially intersected with our dataset. The PV power plants in Kruitwagen's dataset that do not intersect with our dataset are 296.3 km 2 , some of which are too small to be identified by our method and some of which are misidentified in Kruitwagen's dataset.
Our dataset could provide the training samples for researchers to identify PV power plants in the future. We calculated 250 each PV power plant's geographical and climatic conditions based on the PV map and auxiliary data. The PV power plants in China are more likely to be installed in suitable natural conditions but with low power demand or in areas with high local energy demand. We also found that installing PV power plants will generally decrease the vegetation. Our dataset is conducive to policy management and environmental assessment.

Data availability 255
The dataset of photovoltaic power plant distribution in China by 2020 is stored as shapefile format and available to the public at https://doi.org/10.5281/zenodo.4552919 (Zhang et al., 2021a).
The dataset of photovoltaic power plant distribution in China by 2020 and the training set are stored as shapefile format and available to the public at https://doi.org/10.5281/zenodo.6849477 (Zhang et al., 2022).