Catchment attributes and meteorology for large sample study in contiguous China

We introduce the first large-scale catchment attributes and meteorological time series dataset of contiguous China. To develop the dataset, we compiled diverse data sources to generate basin-oriented features describing the catchment characteristics 10 related to hydrological processes. The proposed dataset consists of catchment characteristics, including soil, land cover, climate, topography, geology, and 29-year meteorological time series (from 1990 to 2018). The meteorological variables include precipitation, temperature, evapotranspiration, wind speed, ground surface temperature, pressure, humidity and sunshine duration. We also derived a daily potential evapotranspiration time series based on a modified Penman’s equation. The studied catchments are 4875 catchments within contiguous China derived from digital elevation models. We analysed and organised 15 the spatial variations of catchment characteristics into a series of maps. Correlation analysis between attributes was conducted. Compared to the previously proposed datasets, we derived more catchment characteristics resulting in 125 attributes, providing a complete description of the catchments. Besides, we propose Normal-Camels-YR, a hydrological dataset covering 102 basins of the Yellow River basin with normalized streamflow observations. The proposed dataset provides numerous opportunities for comparative hydrological research, such as examining the difference in hydrological behaviours across different 20 catchments and building general rainfall-runoff modelling frameworks for many catchments instead of limited to a few. The dataset is freely available via http://doi.org/10.5281/zenodo.4704017 for community use. We will open-source the complement code for generating the dataset such that the user can generate meteorological series and catchment attributes for any watershed within contiguous China.


Introduction 25
Studying a large set of catchments often provides insights that cannot be obtained when looking at a single or few catchments (Coron, Andreassian et al. 2012, Newman, Clark et al. 2015, Lane, Coxon et al. 2019. The hydrologic cycle consists of many sub-processes, including evaporation from the ocean, raindrop, interception, surface runoff, infiltration, etc. Catchment attributes such as soil characteristics, land cover characteristics and climate indices influence the water movement and storage in these sub-processes such that hydrologic behaviours can vary across catchments (van Werkhoven, 30 https://doi.org/10.5194/essd-2021-71 Wagener et al. 2008). The same hydrological model may not be applicable in another basin. However, by examining a large sample of catchments, it is possible for the hydrological model to learn the similarities and differences of hydrological behaviours across catchments. For example, prediction in ungauged basins is a challenging problem present in hydrology. The central challenge is how to extrapolate hydrologic information from gauged basins to ungauged ones. Solving the problem relies on understanding the similarities and differences between different catchments. However, regionally and temporally 35 imbalanced observations bring a difficulty to the problem. For a hydrologic model to successfully simulate the ungauged areas, it must adapt itself to the different hydrologic behaviours present in different catchments. (Kratzert, Klotz et al. 2019) shows encoding catchment characteristics (e.g., soil characteristics, land cover, topography) into a data-driven model can teach model to behave differently responding the meteorological time series input based on different sets of static catchment attributes. 40 (Silberstein 2006, Shen, Laloy et al. 2018, Nevo, Anisimov et al. 2019 pointed out that large sample hydrological datasets are the foundation and key of many hydrological studies. The term big hydrologic data refers to all data influencing the water cycle, such as the meteorological variables, infiltration characteristics of the study area, land use or land cover types, physical and geological features of the study area, etc. Many studies cannot be carried out without large-scale hydrologic data (Coron, Andreassian et al. 2012, Singh, van Werkhoven et al. 2014, Berghuijs, Aalbers et al. 2017, Gudmundsson, Leonard et al. 2019, 45 Tyralis, Papacharalampous et al. 2019. For hydrological research, basin-orientated large sample datasets are of great significance. For example, comparative hydrology (de Araújo andGonzález Piedra 2009, Singh, Archfield et al. 2014) focus on understanding how hydrological processes interact with the ecosystem, in particular, how hydrologic behaviours change under changes in the surface and sub-surface of the earth to determine to what extent hydrological predictions can be transferred from one area to another. Large-sample catchment attributes dataset provide opportunities for research studying 50 interrelationships among catchment attributes. (Seybold, Rothman et al. 2017) studied the correlations between river junction angle with geometric factors, downstream concavity, and aridity. (Oudin, Andréassian et al. 2008) investigates the link between land cover and mean annual streamflow based on 1508 basins representing a large hydroclimatic variety. (Voepel, Ruddell et al. 2011) examines how the interaction of climate and topography influences vegetation response.

55
Data-driven methods can best benefit from large-scale data. Data-driven approaches have shown great potential in various fields, transforming the applications in many industries (LeCun, Bengio et al. 2015). However, data-driven methods, especially the deep learning-based approaches, usually require high data volumes. Limited data will cause the over-fitting (Blumer, Ehrenfeucht et al. 1987, Abu-Mostafa, Magdon-Ismail et al. 2012 problem. Therefore, big hydrologic data is the fundamental support for the successful deployment of powerful data-driven strategies. 60 Traditional hydrological models have some long standing challenges, such as the inability to capture hydrological processes' mechanism complexity , which is due to the structural limitations of the conceptual models. Datadriven methods are proposed to overcome some existing obstacles. Data acquire knowledge transforming the research pattern from hypothesis-driven to data-driven. (Feng, Fang et al. 2020) proposed 65 a flexible data integration fusing various types of observations to improve rainfall-runoff modelling. The research shows that combining different resources of data benefits predictions in regions with high autocorrelation in streamflow. (Wongso, Nateghi et al. 2020) developed a model predicting the state-level, per capita water uses in the United States, taking various geographic, climatic, and socioeconomic variables as input. The research also identified key factors associated with high water usage. (Mei, Maggioni et al. 2020) proposed a statistical framework for spatial downscaling to obtain hyper-resolution 70 precipitation data. The results show improvements compared with the original product. (Brodeur, Herman et al. 2020) applied machine learning techniques, namely bootstrap aggregation and cross-validation, to reduce overfitting in reservoir control policy search. (Ni and Benson 2020) proposed an unsupervised machine learning method to differentiate flow regimes and identify capillary heterogeneity trapping, showing the promise of machine learning methods for analysing large datasets from coreflooding experiments. (Legasa and Gutiérrez 2020) propose to apply Bayesian Network for multisite precipitation 75 occurrence generation. The proposed methodology shows improvements for existing methods.
World-wide data sharing has become a trend (Wickel, Lehner et al. 2007, Ceola, Arheimer et al. 2015, Blume, van Meerveld et al. 2018, Wang, Chen et al. 2020, and the amounts of hydrologic data available are ever-increasing. However, these data typically came from different providers and are compiled in various formats. For example, ASTGTM 1 provides a global digital 80 elevation model; GliM (Hartmann and Moosdorf 2012) includes rock types data globally; MODIS provides data products (Knyazikhin 1999, Didan 2015, Myneni, Knyazikhin et al. 2015, Running, Mu et al. 2017, Sulla-Menashe and Friedl 2018 describing features of the land and the atmosphere derived from remote sensing observations; (Yamazaki, Ikeshima et al. 2019) provides a global flow direction map at three arc-second resolution; HydroBASINS (Lehner 2014) provides basin boundaries at different scales globally; and GDBD (Masutomi, Inui et al. 2009) provides basin boundaries with geographic attributes; 85 GLHYMPS (Gleeson, Moosdorf et al. 2014) provides a global map of subsurface permeability and porosity; SoilGrids250m (Hengl, Mendes de Jesus et al. 2017) dataset provides global numeric soil properties. Local government agencies often hold meteorological data such as precipitation and evaporation, and the amount of this data is also growing, however, data transparency has still been a problem (Viglione, Borga et al. 2010). The data mentioned above are rarely spatially aggregated to the catchment-scale, making it difficult for researchers to use these data. Properly pre-processed and formatted datasets on 90 a large scale are of great importance for the hydrology research. Searching for appropriate data sources, pre-processing, and formatting often consumes a lot of researchers' time. In some cases, individual research groups either do not know where to obtain the appropriate data or cannot properly process the data to receive the desired format.
In summary, both data-driven and traditional hydrological research need diverse hydrologic datasets to learn the generalisation 95 capability from one area to another. For a model to adapt to various behaviours in different catchments, the dataset must be 1 https://asterweb.jpl.nasa.gov/gdem.asp https://doi.org/10.5194/essd-2021-71 large enough to represent the complex heterogeneity presented in the natural hydrologic system. Although data sharing is being advocated in the community, it is usually difficult for the public to obtain certain data such as meteorological data and streamflow observations, either because there are not enough observations or because there are no open access permissions.

100
Recently, there are efforts (Addor, Newman et al. 2017, Alvarez-Garreton, Mendoza et al. 2018, Chagas, Chaffe et al. 2020 compiling different types of data sources to form large scale hydrological datasets. These four collected datasets cover the continental United States, Chile, Brazil, and Great Britain. (Addor, Do et al. 2020) reviewed these datasets and discussed the guidelines for producing large-sample hydrological datasets and the limitations of the currently proposed datasets. The CAMELS dataset has been used to support a lot of research. Based on CAMELS, (Kratzert, Klotz et 105 al. 2018) built a Long Short-Term Memory (LSTM) network for rainfall-runoff modelling, showing that one model can predict the discharge for a variety of catchments. (Knoben, Freer et al. 2019) compared metrics used in hydrology based on simulations on many basins. (Tyralis, Papacharalampous et al. 2019) studied the relationship between the shape parameter and basin attributes based on the sizeable basin-oriented dataset.

110
However, there is no large-scale compilation of hydrological datasets in contiguous China. An alternative is on a global scale, the HydroATLAS (Linke, Lehner et al. 2019) dataset. However, since it is on a world-wide scale, compared with other datasets constructed for regions, the dataset lacks many attributes and is not built according to the CAMELS standards. Besides, the climatic data is not up to date , and the derivation of climatic data lacks ground surface observations inputs, such that the data quality is not guaranteed. 115 Therefore, researchers still need to do repetitive works to compile data from different sources such as obtaining historical meteorological data (temperature, rainfall, evapotranspiration) of a catchment in contiguous China. Inspired by (Addor, Newman et al. 2017), in this paper, we present a catchment scale hydrologic dataset compiling a wide variety of hydrological data, including basin topography, climate indices, land cover characteristics, soil characteristics and geological characteristics 120 covering contiguous China.
The proposed dataset is the first dataset providing catchments meteorological time series and catchments attributes of contiguous China. We compiled and named the dataset following most standards of the previously proposed datasets. Unlike CAMELS and CAMELS-CL, catchments in the proposed dataset are not selective. Instead, the dataset consists of all generated 125 basins from the Digital Elevation Model (DEM), based on the Global Drainage Basin Dataset (Masutomi, Inui et al. 2009).
The GDBD is derived at high-resolution (100m-1km) and has a good geographic agreement with existing global drainage https://doi.org/10.5194/essd-2021-71 basin data in China 2 . Besides, an essential feature of the proposed dataset is that it provides a complete description of the catchment, rather than an abstraction. For example, both CAMELS and CAMELS-CL only report the most frequent and second most frequent catchment land cover and lithology types. Instead, the proposed dataset calculates the proportion of each land 130 cover and lithology type for each catchment to serve data-driven research better. We also introduced many more climate characteristics and soil characteristics to support more diverse potential research.
Researchers from different places can use the proposed dataset in conjunction with their streamflow data, simplifying organising and compiling various data resources, which is usually repetitive work. The proposed dataset is undoubtedly the 135 most comprehensive catchment attributes and meteorological time series dataset in contiguous China and is suitable for multipurpose data-driven research. The dataset consists of basin boundaries in the shapefile format, computed catchment attributes of climate, land cover, soil, topography and lithology and 29-year meteorological time series. Table 1 compares the number of static attributes between CAMELS, CAMELS-BR, and the proposed dataset.

140
The paper is organized as follows: Section 2 describes the study area. Section 3-7 describes the five classes of the computed catchment attributes. In section 3-7, each unit follows the same structure: first introduce the meaning and significance of each added feature and data source used, then describe the variables' spatial variability if necessary. Section 8 describes the proposed catchment-scale meteorological forcing time series. Section 9 introduce the Normal-Camels-YR dataset, which provides normalized streamflow measurements for 102 catchments of Yellow River. Section 10 describes the code and data availability. 145 Section 11 presents the concluding remark.
In summary, our contributions are as follows: (1) The proposed dataset is the first large-scale dataset containing catchment-scale meteorological time series of contiguous China, which is the basis for many hydrological studies. 150 (2) We present the first basin-oriented static attributes dataset in contiguous China.
(3) We introduce several new catchment characteristics providing a complete description of the catchment compared with the previously proposed datasets such that the proposed dataset is prepared for potential hydrological studies.
(4) We offer a self-contained dataset covering 102 basins of the Yellow River basin with normalized runoff observation supporting many potential studies. 155 2 In this study, gauge streamflow measurements are not available in areas other than the Yellow River such that it is infeasible to specify a gauge location for generating the basin boundary for most of the areas. Streamflow measurements have strict redistribution policy; however, local research institutions have their streamflow measurements for hydrological research, the proposed dataset can used in conjunction with the streamflow data of researchers in various places.   The study area corresponds to contiguous China, with diverse climate and terrain characteristics, spanning from 18.2° N to 52.3° N and 76.0° E to 134.3° E. Mountains, plateaus, and hills account for about two-thirds of areas of contiguous China, and the remaining are basins and plains. China's topography is like a three-level ladder, high in the west and low in the east. 170 The Qinghai-Tibet Plateau, the highest plateau globally, located in the west of contiguous China, with a mean elevation of over 4000 meters, is the first step of China's topography. The Xinjiang region, the Loess Plateau, the Sichuan Basin, and the Yunnan-Guizhou Plateau to the north and east are the second step of China's topography. The mean sea level here is between 1000 to 2000 meters. Plains and hills dominate the east of the Daxinganling-Taihang Mountain to the coastline, the third step of contiguous China. The elevation of this step descends to 500-1,000 meters. 175 In contiguous China, precipitation and temperature vary significantly in different places, forming a diverse climate environment. According to the Köppen Climate Classification System, from northwest to southeast, China's climate gradually evolves from Cold desert (BWk) climate, Tundra (ET) climate, Warm and temperate continental (D fa and D wb ) climate to Humid subtropical (C wa ) climate and Warm oceanic (C fa ) climate. From the perspective of temperature zones, there are tropical, 180 subtropical, warm temperate, medium temperate, cold temperate and Qinghai-Tibet Plateau regions, and there are humid regions, semi-humid regions, semiarid regions, and arid regions from the perspective of wet and dry zones. Moreover, the same temperature zone can contain different dry and wet zones. Therefore, there will be differences in heat and wetness in the same climate type. The complexity of the terrain makes the climate even more complex and diverse. Besides, China has a wide range of regions affected by the alternating winter and summer monsoons. Compared with other parts of the world at the same 185 latitude, these areas have low winter temperatures, high summer temperatures, significant annual temperature differences, and concentrated precipitation in summer. The cold and dry winter monsoon occurs in Asia's interior, far away from the ocean.
Under its influence, winter rainfall in most parts of China is low, accompanied by low temperature. The summer monsoon is warm and humid, coming from the Pacific Ocean and the Indian Ocean. Under its influence, precipitation generally increases. temperature, relative humidity, precipitation, evaporation, wind speed, sunshine duration, and ground surface temperature. The summary is presented in Table 4. The Inverse distance weighting method is used for interpolating the site observations. Climate indices are then obtained by taking the average of the catchment-scale extraction from the interpolated raster. To ensure data quality, we chose the latter 29-year record (from 1990 to 2018) to construct the dataset since sites' distribution was sparse in the early days (Fig. 2). We computed more climatic characteristics compared with other datasets (Table 2). These 200 characteristics have critical potential effects on the hydrological processes; for example, wind speed can affect actual evapotranspiration. To be consistent with the CAMELS (Addor, Newman et al. 2017), we also determined all climatic attributes (Woods 2009) in the CAMELS dataset. The proposed dataset provides more meteorological variables and longer time series  than CAMELS and CAMELS-CL. A summary of the computed Climate indices is presented in Table   3. The national distribution of meteorological attributes of catchments is shown in Fig. 3. 205    The average daily precipitation in contiguous China is highest in the southeast and lowest in the northwest. It is also higher in the coastal areas than in the interior land. Ground surface pressure is positively correlated with elevation, the highest in the Qinghai-Tibet Plateau and the lowest in the Southeast Plain. The average relative humidity is generally positively correlated with precipitation; they are also higher in some forested areas, such as the Taihang Mountains and Daxingan Mountains. The 220 Qinghai-Tibet Plateau has the lowest average temperature, and the southern coastal area has the highest. A distinctive feature of the distribution of wind speed is the high wind speed in mountainous areas. The highest wind speed occurs in the southeast coastal area (> 6 meters per second). Refer to Section 8 for a detailed description of the proposed catchment-scale meteorological time series dataset of contiguous China.

Geology 225
To describe the lithological characteristics of each catchment, we used the same two global datasets as CAMELS, Global Lithological Map (GLiM) (Hartmann and Moosdorf 2012) and GLobal HYdrogeology MaPS (GLHYMPS) (Gleeson, Moosdorf et al. 2014). Figure 4 presents the results.
GLiM provides a high resolution global lithological map assembled from existing regional geological maps; it has been widely 230 used for constructing datasets (e.g. SoilGrids250m (Hengl, Mendes de Jesus et al. 2017)). However, the data quality of GLiM can vary in different spatial locations depending on the quality of the original regional geological maps. GLiM consists of three levels, the first level contains 16 lithological classes, and the additional two levels describe more specific lithological characteristics. For contiguous China, the compiled regional data sources (China 1991, Xinjiang 1992 Compared to CAMELS and CAMELS-CL, one design consideration of the proposed dataset is that it should be more prepared for the data-driven research, such that we aim to generate as many types of catchment-scale data as possible since advanced 240 data-driven methods can learn the representation of inputs automatically. To this end, we determined and recorded each lithological class's contribution to the catchment instead of recoding just the first and second most frequent classes. The GLiM is represented by 1,235,400 polygons; the polygons are converted to raster format for the basin-scale lithological type statistics.
GLobal HYdrogeology MaPS (GLHYMPS) provides a global estimation of subsurface permeability and porosity, two critical 245 characteristics for the soils' hydrological classification. Porosity and permeability influence an area's infiltration capacity. Soil with high porosity is likely to contain s amounts of water, and high permeable soil transmits water relatively quickly. Based on the high-resolution map of GLiM, which can differentiate fine and coarse-grained sediments and sedimentary rocks, GLHYMPS determined subsurface permeability depending on the different permeabilities of rock types. For the proposed dataset, we calculated the catchment arithmetic mean for porosity. Followed (Gleeson, Smith et al. 2011), the logarithmic scale 250 geometric mean is used for representing subsurface permeability. The summary of geological characteristics is present in Table   3. Followed (Addor, Newman et al. 2017), we also computed the average rooting depth (50% and 90%) for each catchment based on the IGBP classification using a two-parameter method (Zeng 2001). The root depth distribution of vegetation affects the ground's water holding capacity and the topsoil layer's annual evapotranspiration (Desborough 1997

Location and topography
The catchments' boundary files are obtained from the global drainage basin dataset (Masutomi, Inui et al. 2009). The PDBD dataset was derived from digital elevation models (DEMs) with a high-resolution (100m-1km), and the errors were corrected by either automatic methods or manually. Additionally, PDBD also provides population and population density estimates for 320 catchments, and these two indicators are also included in our dataset as a measure of human intervention. Global Runoff Data Centre (Center 2005) discharge gauging stations were used for referencing the derived basins. In contiguous China, PDBD has a high average match area rate (AMAR) and good geographic agreement with existing global drainage basin data. Based on the high-quality dataset, precise geographic and topographic information can be derived. See Fig. 6 for a summary. The CAMELS dataset just provides two parameters (two area estimates) for describing the catchment shape; however, the physical characteristics of a catchment can affect the runoff volume and the runoff hydrograph of the catchment under a storm.
To provide a complete description of the catchment shape, we computed several geometrical parameters of the catchment related to the runoff process, including catchment form factor, shape factor, compactness coefficient, circulatory ratio and the 335 elongation ratio (Subramanya 2013). A summary of the location and topography attributes can be found in Table 3.

Soil
The proposed dataset has a total of 54 soil attributes (Table 3) derived from (Hengl, Mendes de Jesus et al. 2017), (Dai, Xin et al. 2019) and (Shangguan, Dai et al. 2013). The summary result is shown in Fig. 7. Five categories of soil characteristics (pH in H2O, organic carbon content, depth to bedrock, cation-exchange capacity, and bulk density) are determined from SoilGrids. Unlike CAMELS, whose reported results are obtained by a linear weighted combination of the different soil layers, and CAMELS-BR, whose products are soil characteristics at a depth of 30cm. We computed soil characteristics at all soil layers 350 provided by SoilGrids such that advanced models can learn directly from the raw inputs.
To be consistent with CAMELS, we also determined saturated water content and saturated hydraulic conductivity (Dai, Xin et al. 2019). We also introduced thermal conductivity of unfrozen saturated soils (Dai, Xin et al. 2019). (Dai, Xin et al. 2019) provides a global estimation of soil hydraulic and thermal parameters using multiple Pedotransfer Functions (PTFs) based on 355 SoilGrids. Based on the SoilGrids and GSDE (Shangguan, Dai et al. 2014) datasets, (Dai, Xin et al. 2019) produced six soil layers with a spatial resolution 30×30 arc-second. The vertical resolution of (Dai, Xin et al. 2019) is the same as the SoilGrids, with six intervals of 0-0.05 m, 0.05-0.15 m, 0.15-0.30 m, 0.30-0.60 m, 0.60-1.00 m, and 1.00-2.00 m. Same as the methods applied to SoilGrids, we determined and records catchment soil characteristics for all these layers.

360
To provide even more complete description of the soil, we determined seven more soil characteristics (Shangguan, Dai et al. 2013) including soil profile depth, porosity, clay/silt/sand content, rock fragment, and soil organic carbon content. (Shangguan, Dai et al. 2013) provides physical and chemical attributes of soils derived from 8979 soil profiles at 30×30 arc-second resolution, the polygon linkage method was used to derive the spatial distribution of soil properties. The profile attribute database and soil map are linked under a framework avoiding uncertainty in taxon referencing. 365 Depth to bedrock controls many physical and chemical processes in soil. The distribution of depth to bedrock in contiguous nutrient the soil can store such that it influences the growth of the vegetations. Cation exchange capacity is positively correlated with soil organic matter content and clay content, which Cation exchange capacity is generally low in sandy and silty soils.
The spatial variability of Cation exchange capacity in contiguous China is characterised by (i) high in peat and forested areas 375 in Qinghai-Tibet Plateau, central and northeast China (ii) The Cation exchange capacity in the desert area such as the northwest is extremely low. Soil hydraulic and thermal properties are greatly affected by soil organic matter (SOM). Soil organic matter has a similar distribution to the cation exchange capacity: high in the peat and forested areas such as northeast China and low in the north and northwest. There have been many studies based on SURF_CLI_CHN_MUL_DAY in China (Liu, Xu et al. 2004, Xu, Gao et al. 2009, Huang, Han et al. 2016, Liu, Zheng et al. 2017, such as trend analysis of the pan evaporation (Liu, Yang et al. 2010 Meteorological time series includes pressure, temperature, relative humidity, precipitation, evaporation, wind speed, sunshine 390 duration, ground surface temperature and potential evapotranspiration (see Table 4 for a summary).  the early site distribution is sparse, we only used records from 1990 to 2018 to construct the dataset to ensure the data quality.

Meteorological time series 380
The interpolation method used is the Inverse distance weighting since it shows better performance than other comparators.
Catchment-scale raster is extracted from the interpolated national raster using the open-source rasterio 5 package. For all variables, we take the arithmetic mean on the extracted catchment raster as the catchment mean. Potential evapotranspiration (PET) is estimated based on Penman's Equation and other catchment meteorological variables. 405

Normal Camels YR -Normalized Catchment attributes and meteorology for Yellow River basin
Apart from the dataset providing the catchment attributes and meteorological forcing for contiguous China, we also offer a self-contained dataset covering the Yellow River basin with normalized streamflow measurements. The streamflow data are normalized to have zero mean and a standard deviation of 1 for each basin. The Normal-Camels-YR dataset is designed to support machine learning and deep learning research related to hydrology. In particular, fifty-four watersheds are less affected 410 by human activities (selection is based on the Global Reservoirs and Dam databases (GRanD) (Lehner, Liermann et al. 2011) which provides the locations of reservoirs and dams globally), which makes them suitable for rainfall-runoff modelling research. For most machine learning and deep learning algorithms, data normalization will not affect model performance (e.g., neural network-based and tree-based algorithms). Besides, other research, such as trend analysis, can also be carried out. The Normal-Camels-YR dataset is self-contained to fully describe the Yellow River basin and is particularly helpful for the 415 hydrology research of the Yellow River.
During the dataset development, basins with too few observations are removed, resulting in discontinuous basin identifiers.
Normal-Camels-YR covers 102 gauges in the Yellow River basin, providing basin boundary shapefiles, static attributes and normalized streamflow measurements for each basin. The covered basins have areas ranging from 134 to 804,421 square 420 kilometres. The time resolution of streamflow measurements is seven days, and the mean length of records of the streamflow measurements is 684, which means the mean period of the streamflow measurements for each basin is over 13 years.
Meteorological variables included in Normal-Camels-YR is slightly different; it introduced daily maximum and minimum for some variables (Table 5). 10 Data availability and software packages used.

425
The proposed dataset is freely available at http://doi.org/10.5281/zenodo.4704017. The files provided are (i) several separate files containing 120+ catchments attributes, (ii) the daily meteorological time series in a zip file, (iii) the catchment boundaries 430 used to compute the attributes and extract the time series, (iv) the Normal-Camels-YR dataset, (v) an attribute description file and (v) a readme file. The code used to generate the dataset is mainly based on several publicly available packages: rasterio, gdal 6 , pyshp 7 , geopandas 8 , fiona 9 , and xarray 10 . Complement code for generating any watershed's dataset will be released soon.

Conclusion 435
The dataset proposed in this paper provides a novel dataset for hydrological research in contiguous China. In the study area, there is no catchment attributes dataset has been proposed before, either a catchment-scale time series meteorological dataset.
All catchments delaminated from the DEM are studied, covering contiguous China. The dataset includes daily meteorological forcing time-series data including precipitation, temperature, potential evapotranspiration, wind, ground surface temperature, pressure, humidity, sunshine duration and derived potential evapotranspiration of 4875 catchments. The proposed time series 440 dataset is derived based on the quality-controlled site observation dataset, SURF_CLI_CHN_MUL_DAY. We will also release the complement code for generating any shapefile's meteorological time series within contiguous China based on the SURF_CLI_CHN_MUL_DAY dataset (freely available for Chinese researchers). The dataset has longer time series (from 1990 to 2018) and more meteorological variables than the previously proposed datasets. The dataset also includes 120+ catchment attributes, including soil, land cover, geology, climate indices and topography for each catchment. We produced a 445 series of maps depicting the catchment attributes distributions in contiguous China. These maps present regional changes of various features; we also describe the relationships between them. The integration of multiple data sources into one dataset at a catchment-scale dramatically simplifies the data compilation process in research. Based on the dataset, we can test hypotheses and formulate valid conclusions under various conditions, not just limited to a few specific locations. Together with the Normal-Camels-YR dataset, the proposed dataset can help explore how different basin characteristics influence hydrological 450 behaviours, learn the migration of hydrological behaviours between different basins, and to develop general frameworks for large-scale model evaluation and benchmarking in China.

Appendix A: Modified Penman's equation
Penman's equation (Subramanya 2013), incorporating some modifications to the original formula, is: where is the daily potential evapotranspiration in mm per day; is the slope of the saturation vapour pressure ( ) vs temperature ( ) curve at the mean air temperature, in mm of mercury per Celsius; is the net radiation in mm of evaporable water per day; is a parameter including wind speed and saturation deficit; γ is the psychrometric constant = 0.49 mm of mercury per Celsius.

460
The relationship between and is defined as:  Table A1); is a constant depending upon the latitude ϕ and is given by = 0.29 ; is a constant = 0.52; is the sunshine duration in hours; is the maximum possible hours of bright sunshine (a function of latitude, see Table A2); is the reflection coefficient; σ is the Stefan-Boltzman constant = 2.01 × 10 −9 mm/day; is the mean air temperature in degrees kelvin; is the actual mean vapour pressure in the air in mm of mercury. 470 The parameter is estimated as: where 2 is the wind speed at 2 above ground in km/day; is the saturation vapour pressure at mean air temperature in 475 mm of mercury; is the actual vapour pressure. 10° 11.6 11.8 12.1 12.4 12.6 12.7 12.6 12.4 12.9 11.9 11.7 11.5 20° 11.1 11.5 12.0 12.6 13.1 13.3 13.2 12.8 12.3 11.7 11.2 10.9

Appendix B: Correlation analysis of catchment attributes
To explore the potential connections between various types of watershed attributes, we did correlation analysis using the Pearson correlation coefficient; the results can be found in Table B1, which shows the top five most relevant attributes for 480 each attribute, and the Fig. S1, the correlation matrix. The analysis result shows that the correlations between variables are consistent with general understanding, justifying the rationality of the dataset: (1) Subsurface permeability and porosity are highly correlated with geological attributes.
(3) Root depth is most correlated with land cover types. 485 (4) In China, the savanna is mainly distributed in the southern coastal areas, resulting in that it is positively correlated with average rainfall (0.604).
(5) Sand is positively correlated with saturated hydraulic conductivity (0.86) while the clay is negatively correlated (-0.763), and catchments with a lot of rainfall are less likely to have soil with high hydraulic conductivity (-0.647).