Hydrometeorological attributes for 897 catchments in Brazil

We introduce a new catchment dataset for large-sample hydrological studies in Brazil. This dataset encompasses daily time series of observed streamflow from 3679 gauges, as well as meteorological forcing (precipitation, evapotranspiration, and temperature) for 897 selected catchments. It also includes 65 attributes covering a range of topographic, climatic, hydrologic, land cover, geologic, soil and human intervention variables, as well as data quality indicators. This paper describes how the hydrometeorological time series and attributes were produced, their primary 15 limitations and their main spatial features. To facilitate comparisons with catchments from other countries, the data follow the same standards as the previous CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) datasets for the United States, Chile, and Great Britain. CAMELS-BR complements the other CAMELS datasets by providing data for hundreds of catchments in the tropics and the Amazon rainforest. Importantly, precipitation and evapotranspiration uncertainties are assessed using several gridded products and quantitative estimates of water consumption are provided to 20 characterize human impacts on water resources. By extracting and combining data from these different data products and making CAMELS-BR publicly available, we aim to create new opportunities for hydrological research in Brazil and to facilitate the inclusion of Brazilian basins in continental to global large-sample studies. We envision that this dataset will enable the community to gain new insights into the drivers of hydrological behavior, better characterize extreme hydroclimatic events, and explore the impacts of climate change and human activities on water resources in Brazil. The CAMELS-BR dataset 25 is freely available at https://doi.org/10.5281/zenodo.3709337 (Chagas et al., 2020).


Introduction
Large-scale hydrological research relies on data from large samples of catchments to formulate general conclusions on hydrological processes and models (Gupta et al., 2014;Addor et al., 2019). Hydrometeorological datasets with large spatial and temporal coverage are the basis to improve hydrological understanding with appropriate statistical robustness. For 30 example, multiple studies used large-sample datasets to investigate the drivers of hydrological change (e.g., Slater et al., 2015; institutions' local repositories (e.g., ANA, 2019a) or web scraping techniques to access these data in an automated fashion, (ii) there is little consistency in data format across regions and stations, and (iii) current datasets do not systematically provide catchment attributes characterizing the hydroclimate, landscape, and anthropogenic influences. Further, the difficulty of accessing national meteorological daily time series has led users to compute them from other gridded global databases (e.g., Xavier et al., 2016;Beck et al., 2017c;Sun et al., 2018). All these difficulties hinder large-sample hydrological studies in 70 Brazil where, unsurprisingly, nationwide studies (e.g., Siqueira et al., 2018;Bartiko et al., 2019) are less common than in North America or Europe. Consequently, studies in Brazil generally include only a reduced number of stream gauges and catchment attributes, and are restricted to specific regions, such as the Amazon (e.g., Tomasella et al., 2011;Paiva et al., 2013;Latrubesse et al., 2017;Levy et al., 2018), or the La Plata basin (e.g., Collischonn et al., 2001;Pasquini and Depetris, 2007;Melo et al., 2016;Lima et al., 2017;Chagas and Chaffe, 2018). 75 To overcome these limitations, we produced and made publicly available a new dataset for large-sample hydrological studies in Brazil, CAMELS-BR. It includes daily streamflow time series from 3,679 stream gauges and, for a selected group of 897 catchments, daily meteorological time series and 65 catchment attributes from properties such as topography, climate, land cover, geology, soil, and human intervention. All catchment attributes and time series are in an easily readable file format and on a quickly accessible database. We follow standards defined by the previous CAMELS and CAMELS-CL datasets, thus 80 allowing direct comparisons with them. Most attributes rely on data products that cover the whole of South America, so they are spatially consistent across Brazil. To reduce the risk of data misinterpretation, we describe the major limitations of the data sources and indices computed. By synthesizing hydrological information from thousands of catchments in Brazil into a single dataset, we allow researchers to skip the arduous task of collecting and preprocessing large quantities of disparate data.

Streamflow data
We provide daily streamflow time series for two sets of gauges (Table 1). The first set comprises 3679 streamflow gauges and is provided by the Brazilian National Water Agency (ANA, 2019a). We refer to these as "raw streamflow" time series, as they are readily available from ANA (2019a). Their values are unchanged but, to ease their processing, we converted the native files (i.e., Excel files with daily streamflows not disposed in chronological order) to a new file format (i.e., text files with daily 90 streamflow in chronological order). ANA estimates daily streamflow either by (i) taking two daily stream stage measurements, one in the morning (at 7 am) and another in the afternoon (at 5 pm), which are averaged and transformed into discharge using a stage-discharge relationship (rating curve); or (ii) resorting to regionalization methods when no stream stage measurements are available (no further details on the methods are provided by ANA). The raw streamflow time series cover different periods, ranging from a few days to more than a century. Additionally, although ANA performs data quality checks, these time series 95 include inconsistencies such as typographical errors and days with missing data. The 3679 gauges are irregularly distributed throughout the country (Fig. 1a). Overall, their spatial distribution is denser and their time series longer in the Southern Atlantic, Southeastern Atlantic, and Paraná hydrographic regions ( Fig. 1a and 1b).
The second set of streamflow time series includes 897 gauges, and here we simply refer to them as "streamflow" time series (Table 1). This is the set of gauges used to compute the catchment attributes. It is a subset from the previous 3679 gauges, 100 which resulted from two selection criteria. Firstly, we selected only gauges that have less than 5 % of missing streamflow data between the water years 1990 (starting on September 1, 1989) and 2009 (ending on August 31, 2009). We chose the water years from 1990 to 2009 because (i) it is the period with the largest number of stream gauges with available data (Fig. 2), and (ii) it coincides with the period of analysis from other CAMELS datasets (Addor et al., 2017;Alvarez-Garreton et al., 2018), allowing for direct comparisons with them. Secondly, we only considered catchments for which boundaries have been 105 delimited by Do et al. (2018) and for which there is a good match with the area estimated by the data provider (see Sect. 3).
Although the hydrological signatures introduced below were computed using data from 1990 to 2009, the time series for the 897 stream gauges include data from 1980 to 2018 when available, to enable complementary analyses by other users.
We individually screened the 897 selected streamflow time series between 1990 and 2009 for the following errors: zeroes or repeated values instead of missing values, abrupt changes resulting from changes in measurement instruments or rating curves, 110 annual streamflow larger than annual precipitation, and unrealistic daily streamflow values (i.e., larger than 1,000 mm day -1 ).
Gauges affected by such errors were not included in the set of 897 catchments. In addition, we summarized the streamflow metadata provided by ANA as follows. For each daily streamflow measurement, we provide two pieces of information (Table   1). The first metadata variable, "qual_control_by_ana", was set to 1 if the data was quality checked by ANA and to 0 otherwise. The second metadata variable, "qual_flag", indicates the reliability of streamflow estimates. It is also provided by ANA and 115 consists of the following quality flags: 0, when there is no description; 1, streamflow resulted from stream stage measurements and the rating curve; 2, streamflow estimated by ANA without stream stage measurements; 3, streamflow values marked as doubtful; and 4, when the stream water level falls outside the range of the stream stage, e.g., when the river ran dry. To summarize the ANA metadata (i.e., q_qual_control_perc and q_stream_stage_perc; Table 2), 80 % of the 897 gauges had at least 90 % of their data over 1990-2009 checked for inconsistencies (Fig. 1c). The Amazon, São Francisco, and Paraná regions 120 have the lowest frequency of quality controls in Brazil. Furthermore, the streamflow estimates from 64 % of the 897 catchments were derived from stream stage measurements for 90 % of the days over 1990-2009 (Fig. 1d).

Meteorological data
Meteorological daily time series data are provided for 897 catchments (Table 1). These include (i) precipitation from CHIRPS v2.0 (Funk et al., 2015), CPC (NOAA, 2019a), and MSWEP v2.2 (Beck et al., 2019); (ii) potential evapotranspiration from 125 GLEAM v3.3a (Miralles et al., 2011;Martens et al., 2017); (iii) actual evapotranspiration from GLEAM v3.3a and MGB South America (Siqueira et al., 2018); and (iv) minimum, maximum, and average temperature from CPC (NOAA, 2019b). The datasets were selected because of their high spatial resolution, their full coverage of South America allowing for consistency through all catchments, and because they are commonly used which enables comparisons with other studies. The daily values represent the average of all cells with their centroids intersected by the catchment, of which all cells contribute to the average 130 equally, whether the catchment fully covers them or not. However, some catchments do not intersect the centroid of any cell.
For those, we computed the daily values as the average of all cells partially covered by the catchment. A significant limitation of the meteorological data is that, because the cell grids of the adopted products have resolutions range from 0.05º (ca. 5.5 km 2 at the equator) to 0.5º (ca. 55 km 2 at the equator), some catchments are smaller than a single cell. This leads to the assumption that such a meteorological variable is homogeneous in catchments smaller than a single cell, even though this might not always 135 be the case. This limitation has to be kept in mind particularly when using the CPC precipitation data (resolution of 0.5º; NOAA, 2019), as precipitation is the meteorological variable with the highest spatial heterogeneity amongst those used in CAMELS-BR.
In addition to GLEAM v3.3a, estimates of actual evapotranspiration (ET) were obtained from the MGB model version for South America (Siqueira et al., 2018). The MGB is a conceptual, semi-distributed hydrologic-hydrodynamic model that 140 discretizes the basin (or a set of basins) into irregular unit-catchments and further into hydrological response units by combinations of land use and soil types, where both water and energy balance are computed. The model calculates ET using the Penman-Monteith equation based on CRU meteorological data (i.e., temperature, pressure, radiation, and wind speed) and MSWEP v1.1 precipitation data (Beck et al., 2017b). Surface resistance is adjusted according to the availability of water in the soil that is updated during the water budget. The MGB also computes the evaporation of flooded areas and intercepted water 145 from the canopy with the Penman equation. Regular ET cells of 0.5º resolution were generated by aggregating unit-catchments using their areas as weights.
The long-term water balance is accurate for most catchments, using either the estimated evapotranspiration from GLEAM ( Fig. 3a) or MGB (Fig. 3b). Both evapotranspiration data sources indicate that the highest data uncertainties occur in the Amazon and smaller catchments in the Paraná and the Southeastern Atlantic regions since those catchments are further away 150 from the 1:1 line in Fig. 3a-b. The same conclusions are derived from visualizing the runoff coefficient as a function of the humidity index (Fig. 3c). In addition, there are remarkable differences between GLEAM and MGB estimates, where evapotranspiration from GLEAM is substantially higher in the Amazon basin and substantially lower in the Eastern and the Western NE Atlantic regions.

Topographic indices 155
Even though ANA (2019a) provides estimates of the areas of most gauged Brazilian catchments, the catchment boundaries are not publicly available. Hence, in this study, we used the catchment boundaries provided by Do et al. (2018), who used the HydroSHEDS 15 arc-sec resolution digital elevation model (DEM) and delineated the catchments with a procedure similar to Lehner (2012) for more than 3,000 gauges in Brazil. For each streamflow gauge, Do et al. (2018) positioned the outlet at the center of all the DEM grid cells within a radius of 5 km from the gauge coordinates indicated by the metadata. They then 160 selected the grid cell (and associated catchment boundaries) leading to the catchment area most similar to the one indicated by ANA (2019a). The main limitation of the procedure of Do et al. (2018) is that catchment boundaries were not manually inspected.
Using those catchment boundaries, we computed four topographic attributes (Table 2), namely gauge elevation, catchment mean elevation, mean slope, and area. The area of the catchments ranged from 10.8 km 2 (i.e., in the upper São Francisco 165 hydrographic region) to 4.7 million km 2 (i.e., the Amazon basin at Óbidos). Approximately 30 % of the analyzed catchments are smaller than a thousand km 2 , 43 % are between 1 and 10 thousand km 2 , and 27 % are larger than 10 thousand km 2 . The largest basins are in the Amazon and the Tocantins-Araguaia hydrographic regions (Fig. 4a). Combined with the Paraguay basin, those regions are usually characterized by low elevations (Fig. 4b), flat slopes (Fig. 4c), and large proportions of wetlands (see Sect. 6.2). The smaller catchments are located along the mountain belts on the eastern coast of Brazil, particularly 170 in the southern and southeastern parts of the country. Those are also the catchments with the steepest slopes. Additionally, many catchments with intermediate elevation ranges (i.e., between 500 and 900 m) are in the central part of the country, which comprises the Brazilian highlands. Note that, since we computed the average attribute value (unless otherwise noted) of each catchment, the attributes become less representative as the area of the catchment increases.

Data and methods
We computed thirteen climatic indices (Table 3)  1st September 1989 and the last one finishes on 31st August 2009. This is to facilitate inter-dataset comparability. We used precipitation data from CHIRPS v2.0 (Funk et al., 2015) to compute the indices since it has the highest spatial resolution 180 among the three adopted precipitation products (i.e., CHIRPS v2.0, CPC, and MSWEP v2.2) and relies on both remote-sensing and gauge-based data.
The mean precipitation, mean potential evapotranspiration, and the aridity index are considered to capture long-term climatic conditions. The aridity index is the ratio of mean potential evapotranspiration to mean precipitation, which stands as a firstorder control on the partitioning of precipitation into streamflow (Budyko et al., 1974;Blöschl et al., 2013). Those indices are 185 complemented by the precipitation seasonality index (p_seasonality , Table 3), which relies on sine curves to approximate the monthly climatology of temperature and precipitation. While, for Brazil, the annual precipitation cycle is captured quite well, a sine curve provides a relatively rough approximation of the temperature cycle, particularly in the center of the country (around the state of Goiás; Berghuijs and Woods, 2016). Hence, in addition to p_seasonality, we extracted the asynchronicity index proposed by Feng et al. (2019), which relies on information theory and has the advantage of being non-parametric (in particular, 190 it does not assume sinusoidality). The indices of extreme climatic conditions include the frequency, duration, and the most common season of high precipitation events and dry days. Dry days are defined as days with precipitation less than 1 mm, so that the index is not compromised by underdetected precipitation events (Haylock and Nicholls, 2000).

Spatial variability in climatic indices
The mean daily precipitation in Brazil is highest in the Amazon and in Southern Brazil, where it on average exceeds 5 mm 195 day -1 (1825 mm year -1 ) (Fig. 5a). The lowest mean precipitation occurs in Northeastern Brazil, which is also where mean potential evapotranspiration exceeds the mean precipitation (aridity index > 1, Fig. 5b). Northeastern Brazil (in particular, the states of Maranhão, Piauí, Ceará) also has the highest values of asynchronicity index in the country (not shown), which corresponds to Mediterranean climates. The precipitation regime is highly seasonal in most of the country, particularly in the central-west and southeastern Brazil (Fig. 5c). This seasonality is regulated by the South American Monsoon System (Raia 200 and Cavalcanti, 2008;Carvalho et al., 2011), with peaks in the austral summer ( Fig. 5f) and several dry months during the austral winter (Fig. 5i). Southern Brazil has a distinct regime, with uniform precipitation throughout the year caused by a combination of large-scale phenomena and a diversity of sources of atmospheric moisture (Seager et al., 2010;Martinez and Dominguez, 2014). The Amazon basin, which extends into both hemispheres, has contrasting precipitation regimes between the north (with a peak in austral winter) and the south (with a peak in austral summer) related to alternating warming of each 205 hemisphere (Marengo and Espinoza, 2016). This seasonality is substantially diminished downstream in the Amazon.
The number of high precipitation and dry days is highest along the catchments on the coast ( Fig. 5d and 5g), which is also where the smallest catchments are located. Both indices are significantly correlated with catchment area (p-value < 0.001), so a regional analysis of both indices should be carried out with caution since large catchments are located in the Amazon and Tocantins-Araguaia basins. On the other hand, the duration of high precipitation (Fig. 5e) and dry day events ( Fig. 5h) do not 210 correlate with catchment area. Their spatial distribution is remarkably similar to the aridity index, except for the Tocantins-Araguaia basin, which has long dry periods but not necessarily long high precipitation events. Summer is the most common season of extreme precipitations in the majority of Brazil, with two main exceptions ( Fig. 5f): (i) part of the coast of Northern Brazil; and (ii) Southern Brazil. This is possibly linked to mesoscale convective systems over Southeastern South America (Salio et al., 2007), to sea surface temperature anomalies in the Atlantic ocean (Liebman et al., 2010), and the El Niño Southern 215 Oscillation phenomenon, as those regions are particularly affected by it (Grimm, 2011;Tedeschi et al., 2013).

Data and methods
We computed thirteen hydrological signatures ( Table 4) that represent a wide range of hydrological information for the water years from 1990 to 2009. The hydrological signatures were computed in the same approach as in CAMELS, CAMEL-CL, and 220 CAMELS-GB datasets. Intermediate streamflow conditions were evaluated with the mean daily flow and its ratio to mean daily precipitation. These were complemented by baseflow information, a fundamental component that sustains streamflow during dry periods (Smakhtin, 2001). The baseflow index is the ratio of long-term baseflow to long-term total streamflow. We used the digital filter from Ladson et al. (2013) to separate the baseflow component from the hydrograph. The variability of streamflow was evaluated with the slope of the flow duration curve and the streamflow elasticity indices. The slope of the flow 225 duration curve is defined as the slope between the log-transformed 33rd and 66th long-term percentiles of daily streamflow (Yadav et al., 2007;Sawicz et al., 2011). High values of that index suggest highly variable streamflow, caused either by a high seasonality of streamflow or by a flashy response to precipitation events (Yokoo and Sivapalan, 2011;McMillan et al., 2017).
Streamflow elasticity is an indicator of the sensitivity of mean annual flow to changes in mean annual precipitation (Sankarasubramanian et al., 2001). For example, a streamflow elasticity value of 2 indicates that a 1 % change in mean annual 230 precipitation generates a 2 % change in mean annual flow. Extreme streamflow conditions were analyzed using signatures based on the magnitude, frequency, and duration of high and low flow events. High and low flow events were defined through long-term thresholds, based on the median and mean flow, respectively (Olden and Poff, 2003). The magnitude of high and low flow events was characterized using the 5th and the 95th percentiles. There are two primary limitations to the hydrological signatures used in this study. First, several signatures might scale with catchment area. Since catchment area varies 235 substantially among hydrographic regions, spatial analyses should be carefully conducted. Second, we did not check for temporal dependencies of consecutive high or low flow events, for example when two flood peaks occur within a couple of days from each other and both may be related to a single extreme precipitation event. Many criteria exist to identify independent high flow events (Hall et al., 2014;Archfield et al., 2016) and low flow events (Fleig et al., 2006;Van Loon, 2015), which might lead to differences in the analyzed signatures. 240

Spatial variability in hydrological signatures
The spatial distribution of mean daily flows (Fig. 6a) and runoff ratio (Fig. 6b) closely resembles that of mean daily precipitation. These are notably high in Southern Brazil and parts of the Amazon, and low in Northeastern Brazil. The mean half-flow date (i.e., when the cumulative discharge since 1st September reaches half of the annual discharge) follows a gradient ranging from February and March in the Eastern Atlantic region to May in the Amazon and on the northern coast (Fig. 6c). 245 Steep slopes of the flow duration curve occur especially in the tributaries of Southern Amazon, the Tocantins-Araguaia basin, the Eastern Atlantic hydrographic region and in parts of Southern Brazil (Fig. 6d). Some catchments have undefined values, meaning that they have zero flow for more than 33 % of the time. Since the slopes of the flow duration curve indicate the overall streamflow variability, they are spatially similar to several other hydrological signatures. They are, most noticeably: (i) negatively correlated with the baseflow index (Fig. 6e), hence catchments with high baseflow may be highly resilient to dry 250 periods (Fan, 2015); (ii) positively correlated with streamflow precipitation elasticity (Fig. 6f), which indicates variability at the interannual timescale; (iii) negatively correlated with the 5th percentile of streamflow (i.e., low flows; Fig. 6l); and (iv) positively correlated with the frequency and duration of low flow events ( Fig. 6j and 6k). However, note that some regions do not follow those patterns. In particular, catchments in Southern Amazon and in the Tocantins-Araguaia basin have high baseflow indices despite steep slopes of the flow duration curve. It possibly implies that the variability in those catchments is 255 related to a high seasonality, rather than to a flashy response to precipitation events.
High flow days are more frequent and their events are longer in Southern Brazil, in the Eastern Atlantic region, and on the coast of northeastern Brazil ( Fig. 6g and 6h). Those regions also have the most frequent and longest low flow events. This 6 Land cover characteristics 265

Data and methods
Each catchment was described using ten land cover classes (Table 5)

Spatial variability in land cover characteristics
Croplands are widespread in Brazil, especially in the highlands, in Southern Brazil, and on the eastern coast of Northeastern Brazil ( Fig. 7a and 7d). Out of the 897 CAMELS-BR catchments, 52.4 % have croplands or mosaics of croplands and natural 280 vegetation as the dominating land cover (Fig. 7c). Croplands are most noticeable particularly in the Uruguay and Paraná hydrographic regions. Even though GlobCover2009 does not cover the same period as the hydrological signatures (i.e., 1990-2009), croplands were already extensive in almost all states in Brazil in the 1980s and pastures in the 1960s (Leite et al., 2012;Dias et al., 2016). This is true except for Southern Amazon, where agricultural expansion has led to one of the highest deforestation rates in the world since the 1980s (Song et al., 2018). 285 Aside from the Amazon, catchments dominated by forests are located in mountain belts, i.e., in steep slope regions in Southern and Southeastern Brazil (Fig. 7b). Shrublands occur mainly in the driest regions of the country (Fig. 7e), but they are not the predominant land cover in these regions. Natural wetlands or water bodies are largely present in the Amazon, Tocantins-Araguaia, and Paraguay hydrographic regions (not shown). Some catchments in the Paraná, Uruguay, and São Francisco basins are also substantially covered by water bodies. However, those are mainly artificial reservoirs (see Sect. 9.3). The CAMELS-290 BR catchments typically have a low fraction of their area considered to be "impervious areas", such as urban land covers; only 0.2 % of the catchments have more than 5 % of impervious areas (not shown). Besides, grasslands, bare soil areas, and permanent snow are rare in the CAMELS-BR catchments (not shown).

Data and methods 295
The geology of the catchments was described using seven geologic attributes ( Table 6). The first and second most common geologic class, their fractions, and the percentage of the catchment covered by carbonate rocks were extracted from the Global Lithological Map (GLiM; Hartmann and Moosdorf, 2012). GLiM was created by assembling information from 92 regional lithological maps. In the Brazilian territory, it relies on data from the Brazilian Geological Survey at the 1:1 million scale (Schobbenhaus et al., 2004). We considered only the first level of the GLiM geologic classes, which classifies lithology into 300 16 groups. The additional second and third levels provide more specific geologic information but were not included in this study. We note that two geologic classes cover a particularly broad variety of rocks. First, the "unconsolidated sediments" class is quite unspecific with regards to the sediment types and grain sizes (it includes sediments originated by areas as alluvial, swamp, and dune deposits). Second, catchments dominated by the "metamorphic rocks" class can have a wide range of lithologies, from shales to gneiss and quartzite. 305 We extracted the subsurface permeability and porosity indices from the GLobal HYdrogeology MaPS 2.0 (GLHYMPS; Gleeson et al., 2014;Huscroft et al., 2018), which is modeled based on information from the GLiM and the Global Unconsolidated Sediments Map (GUM; Börker et al., 2018). Subsurface permeability indicates how easily water can flow through the subsurface. GLHYMPS modeled it only for saturated conditions (Huscroft et al., 2018), so it is not adequate to characterize regions dominated by unsaturated processes, e.g., deeply weathered soils. The subsurface porosity indicates the 310 fraction of void spaces in a material and controls the water storage capacity in the subsurface. A major caveat of GLHYMPS data is that it is only adequate for analyses at the regional scale, i.e., over spatial units greater than 5 km (Gleeson et al., 2014).

Spatial variability in geological characteristics
The catchments on the eastern coast have lithologies dominated by either metamorphic or acid plutonic rocks ( Fig. 8a and 8b), related to high elevation and steep slopes in this region. These catchments also have low subsurface permeability (Fig. 8g) and 315 the lowest subsurface porosity rates in the country (Fig. 8f). In Southern Brazil, basic volcanic lithology is widespread, which encompasses basaltic rocks. Southern Brazil has the most homogeneous lithological types of the country (Fig. 8c and 8d), where more than 80 % of the catchment areas are usually characterized by a single lithological type. However, subsuperficial porosity and permeability are highly heterogeneous, extending from middle-range to high porosity values and from middlerange to low permeability values. 320 Sedimentary rocks occur on a large scale at São Francisco, Parnaíba, Western Northeast Atlantic, and part of the Amazon hydrographic regions. The Northern Amazon is characterized mostly by metamorphic or plutonic rocks, while the Western Amazon has either siliciclastic or mixed sedimentary lithologies. On the other hand, unconsolidated sedimentary lithologies occur particularly downstream in the Amazon, the Tocantins, and the Paraguay basins. These basins also have flat slopes and large proportions of wetlands, which allows for alluvial particles to settle down. Most of the catchments with a high proportion 325 of sedimentary rocks have high subsurface porosity, although their permeability varies according to the grain sizes of these rocks. Carbonate sedimentary rocks, such as karst or limestone, are more common in the São Francisco basin and in western Amazon (Fig. 8e). Those rocks are also present in some isolated and smaller catchments in Southern Brazil, in the Paraguay, and in the Tocantins-Araguaia basins.

Data and methods
We provide six soil characteristics ( Table 7). Five of those were extracted from SoilGrids250m (Hengl et al., 2017;Shangguan et al., 2017), a collection of soil maps for the world at the 250 m resolution. SoilGrids250m maps are the result of a model using approximately 150,000 soil profiles, with predictions based on machine learning methods and 158 remote-sensing covariates including climate, vegetation, geomorphology, and lithology (Hengl et al., 2017). Although SoilGrids250m 335 generated predictions for several soil depths, in this work we only computed soil characteristics over a depth of 30 cm.
SoilGrids250m is based on a machine-learning model that explains large proportions of the variance of most observed variables, including 69 % of the variance of organic carbon content and more than 70 % of the soil textures (i.e., clay, silt, and sand content).
The soil characteristics might be highly correlated with other attributes from CAMELS-BR since they are modeled based on 340 climatic and landscape covariates. Organic carbon content and clay content have modeled depth to bedrock as a predominant variable (Hengl et al., 2017). Other variables are also important, such as temperature and geomorphological characteristics (e.g., surface slope). The predictions of sand content are based primarily on depth to bedrock and precipitation, both at similar weights. Out of the five variables considered from the SoilGrids250m, predictions of depth to bedrock is the most problematic, with 59 % of its variance explained by the model (Shangguan et al., 2017). It has precipitation as the predominant covariate, 345 which accounts for the control of weathering rates and soil production. Other decisive covariates are vegetation dynamics and geomorphological characteristics, which accounts for factors such as soil erosion (Shangguan et al., 2017).
The sixth soil characteristic is the water table depth, based on a 1 km resolution global model by Fan et al. (2013). Combined with depth to bedrock, water table depth can be an indicator of water storage potential in the catchment, which is related to baseflow and the supply of water for the vegetation during dry periods (Fan et al., 2013(Fan et al., , 2019. The most important variables 350 in the predictions of the water table depth of that model are, in decreasing order of importance, surface slope, elevation, precipitation, and temperature (Fan et al., 2013). Note that groundwater abstractions are not represented in the model, so water table depth data must be used with caution when analyzing catchments with intense anthropogenic intervention.

Spatial variability in soil characteristics
Soil texture in CAMELS-BR is characterized by (i) a predominance of clay content in Southern Brazil, in parts of Southeastern 355 Brazil, particularly in higher elevations, and in northeastern Amazon (Fig. 9c); (ii) similar values of clay, sand, and silt content in the southern tributaries of Western Amazon (Fig. 9a to 8c); and (iii) a wide predominance of sand content in the rest of the country (Fig. 9a). As expected, the aridity index is closely related to the spatial distribution of the soil texture, since climatic attributes are important covariates in SoilGrids250m predictions (Hengl et al., 2017). The predominance of clay in Southern Brazil and in part of Southeastern Brazil might be linked to their lithological classes, i.e., with basic volcanic rocks in the 360 former and acid plutonic rocks in the latter since they have coincidental spatial distributions.
Organic carbon content is most pronounced in parts of the Amazon and in regions with high clay content (Fig. 9d). The depth to bedrock is higher in Central and Northeastern Brazil, frequently above 30 m (Fig. 9e). On the other hand, only 14 % of the catchments have depths to bedrock lower than 10 m, all of them located in Southern Brazil. Regarding water table depths, there is a clear gradient with higher depths on the eastern coast of Brazil (i.e., exceeding 30 m deep) to lower depths towards 365 the Amazon (i.e., less than 10 m deep). Amongst all six soil characteristics indices, water table depth has the lowest correspondence to climate, i.e., to mean precipitation and aridity index. It is mostly correlated with catchment slopes, as previously indicated by Fan et al. (2013).

Data and methods for consumptive water use 370
We computed four indices of human intervention in the catchments (Table 8). Two are the total consumptive water use in the catchment in 2017, one normalized by catchment areas and another normalized by mean annual streamflow. Consumptive water use refers to water withdrawals that do not return to the catchment, for example, by evaporating, transpiring, or being incorporated into manufactured products. The water uses are based on the Manual of Consumptive Water Use in Brazil (ANA, 2019c), which estimated the monthly water use of each municipality in Brazil. These estimates are the sum of water demands 375 from six categories: (i) Irrigation: demand based on water balance models that estimate the quantity of water needed by irrigated crops but not supplied by precipitation or soil moisture ANA (2019c). The spatial extent of irrigated croplands was characterized using the national censuses of agriculture (e.g. IBGE, 2007) and remote-sensing images from Landsat, Sentinel-2, and Moderate-380 Resolution Imaging Spectroradiometer (MODIS) (ANA, 2019b).
(ii) Livestock: demand estimated by multiplying the number of livestock units with their corresponding daily drinking requirements. The number and type of livestock of each municipality were mapped on the national censuses of agriculture.
(iii) Households: demand estimated by multiplying the number of people in a municipality by their per capita domestic water use. 385 (iv) Industry: demand estimated by multiplying the number of employees in several industrial categories from each municipality by its per capita water use.
(v) Mining: demand estimated by combining the water use coefficient with the annual production of several types of mineral extraction.
(vi) Thermoelectricity: demand estimated by applying a water use coefficient to the annual electricity production of each 390 thermoelectric plant in the country.
These water demand estimates do not differentiate surface water from groundwater. Even though groundwater abstraction is extensive in the eastern part of Brazil (Fan et al., 2013), it is estimated that most of the water use in South America comes from surface water (Wada et al., 2014). To estimate the consumptive water use of each catchment, we divided the values of 395 each municipality by its area. We assumed the water use to be spatially homogeneous throughout the municipality territory and transferred the data for each municipality onto a 500 m spatial resolution raster.
There are three major limitations of using ANA (2019c) estimated consumptive water use. First, evaporation from artificial reservoirs was not included in the computation. Thus, water use might be underestimated, particularly in the northeastern part of Brazil, i.e., in the driest part of the country. Second, the dataset comes on an irregular grid, since municipalities areas vary 400 significantly. The smallest municipalities are usually within 500 km of the coast and their areas are mostly a few hundred km².
In contrast, the western part of Brazil and the Amazon usually have municipalities larger than a thousand km². Hence, consumptive water use of small catchments in the western part of Brazil should be interpreted with caution because they are smaller than the input data. The third limitation is that the consumptive water use of South America outside Brazil was not estimated by ANA (2019c) and was not considered in this study. This affects particularly the basins in the Amazon since they 405 cover large parts out of Brazil. That said, anthropic intervention in these basins is low: only three basins with international borders in the Amazon are more than 10 % covered by croplands or croplands and natural vegetation mosaic, none has more than 0.05 % of impervious land covers such as urban areas.

Data and methods for reservoirs
The other two indices for human intervention are related to flow regulation (Table 8), i.e., the sum of the total storage capacity 410 of all reservoirs in the catchment and its ratio to the total annual flow of the catchment (i.e., the degree of regulation). We for Brazil. The procedure for combining the three databases was: (i) We included all reservoirs from GRanD v1.3 in South America.
(ii) For each GRanD reservoir, we visually compared the inundated area with the one indicated by the polygons from the water bodies maps from Pekel et al. (2016). When the inundated areas differed substantially, we substituted the former with the latter 420 and updated the size of the inundated area.
(iii) Out of more than 24,000 reservoirs from ONS (2019) and ANA (2018) databases, we included only those that have their inundated areas (Pekel et al., 2016) visible at the 1:500,000 scale. Although our goal was to only include reservoirs larger than approximately 0.5 km 2 , some smaller reservoirs were also included. We computed the size of the inundated areas of those reservoirs according to the polygons from Pekel et al. (2016). 425 (iv) To check for duplicates in the databases, we manually inspected all dam points and their inundated areas.
(v) Finally, the storage capacities of reservoirs updated in step (ii) or included in step (iii) were recalculated using their inundated areas and, when available, information on dam height. We applied two equations determined by Lehner et al. (2011, Technical Document) with a statistical regression using data from 5824 reservoirs worldwide. When information on dam height was available, we applied Eq. (1): 430 where V is the reservoir storage capacity in 10 6 m 3 ; A is the size of the inundated area in km 2 ; h is dam height in m. When information on dam height was not available, we used Eq. (2): = 30.684 0.9578 (2)

Spatial variability in human intervention indices 435
The spatial distribution of human interference indices reveals that, unlike the catchments in the original CAMELS for the United States, catchments in CAMELS-BR can be significantly impacted by human activities. There are 17.8 % of catchments with annual consumptive water uses greater than 5 % of the mean annual flow. Those are principally in the driest parts of the country, i.e., in the São Francisco, Eastern Atlantic, Eastern Northeast Atlantic, and upper Paraná hydrographic regions (Fig.   10b). Nevertheless, water uses greater than 20 % of the mean annual flow are rare, occurring in only 3.9 % of the catchments. 440 The similarity encountered between arid climates and high consumptive water uses may be attributed to two main causes.
First, in the most arid catchments, the mean annual flow is typically a third of that of the rest of the country, which, unsurprisingly, leads to higher water uses proportional to the annual flow. Second, crops in drier climates require frequent irrigation and considerable rates of water withdrawal. On the other hand, we observe that the central and southeastern regions of Brazil have the greatest values of water uses normalized by catchment area (Fig. 10a). Catchments in those regions are 445 commonly occupied by either irrigated croplands or populous metropolitan areas, which are respectively the first and second categories with the highest water demands in Brazil (ANA, 2019c).
The degree of regulation is related to catchment area (Fig. 10c), meaning that the most regulated basins are downstream in the river basins. The main rivers with high regulations are the Paraná, Uruguay, São Francisco, Tocantins-Araguaia, Parnaíba, and Paraíba do Sul rivers. In those regions,19.2 % and 7.2 % of the catchments have a degree of regulation greater than 10 % and 450 50 %, respectively. These values nearly double in the driest regions of the country (i.e., the Eastern Atlantic, São Francisco, Eastern Northeast Atlantic, and Parnaíba hydrographic regions): 37.6 % and 22.1 % of the catchments have a degree of regulation greater than 10 % and 50 %, respectively. Therefore, the driest catchments of CAMELS-BR dataset have the highest human intervention rates, both in terms of consumptive water use and reservoir regulation 10 Data availability 455 The CAMELS-BR dataset is freely available at https://doi.org/10.5281/zenodo.3709337 (Chagas et al., 2020). The files provided are (i) the 65 attributes in a zip file, (ii) the daily time series in zip files, (iii) the catchment boundaries used to compute the attributes and extract the time series, computed by Do et al. (2018) and Gudmundsson et al. (2018), and (iv) a readme file.

Conclusions
So far, large-sample hydrological studies in Brazil lacked a comprehensive and easily accessible dataset. Here, we introduced 460 the CAMELS-BR, a new dataset comprising streamflow time series for 3,679 catchments in Brazil and, for a selected qualitycontrolled set of 897 catchments, meteorological time series and 65 catchment attributes. The attributes cover a wide range of fundamental properties for large-sample hydrological research, such as topography, land cover, geology, soil, and human intervention characteristics. We strived to make CAMELS-BR as comparable as possible to the other CAMELS datasets (Addor et al., 2017;Alvarez-Garreton et al., 2018) by using common naming conventions, scripts, and datasets. We also 465 discuss the major limitations of the data to limit the risk of misinterpretation and misuse.
Even though CAMELS-BR is a step forward for hydrological research in Brazil, there are several opportunities for expanding the dataset in the future. For example, future versions of CAMELS-BR could include additional catchment attributes critical to understand hydrological processes, such as drainage density and basin morphometry (Shen et al., 2017). Further, an updated version should better characterize heterogeneities within each catchment, both for the time series and attributes. Additionally, 470 since data uncertainties are omnipresent (Montanari, 2007;Blöschl et al., 2019b;Addor et al., 2019), they should be further explored by including additional data sources.
By simplifying the access to hydrological data, we aim to encourage further large-sample hydrological studies in Brazil, to facilitate the inclusion of Brazilian catchments in global large-sample studies, and to increase the transparency and reproducibility of these studies. We believe the data introduced here will, in particular, prove useful to explore the drivers of 475 catchment behavior, to anticipate hydrological changes, and to study the impacts of human activities on the water cycle. We see CAMELS-BR as a resource designed to serve the broad water science community and to help with water resources management at regional, national, and continental scales.

Author contribution
PC and VC initiated the investigation. VC, PC, NA, FF, AF, RP, and VS designed the study. VC processed the data and created 480 the figures. VC and NA computed the catchment attributes. VC prepared the manuscript with contributions from all co-authors.

855
The grey line indicates the limits of hydrographic regions.