Monthly gridded data product of northern wetland methane emissions based on upscaling eddy covariance observations

. Natural wetlands constitute the largest and most uncertain source of methane (CH 4 ) to the atmosphere and a large fraction of them are found in the northern latitudes. These emissions are typically estimated using process (“bottom-up”) or inversion (“top-down”) models. However, estimates from these two types of models are not independent of each other since the top-down estimates usually rely on the a priori estimation of these emissions obtained with process models. Hence, independent spatially explicit validation data are needed. Here we utilize a random forest (RF) machine-learning technique to upscale CH 4 eddy covariance ﬂux measurements from 25 sites to estimate CH 4 wetland emissions from the northern latitudes (north of 45 ◦ N). Eddy covariance data from 2005 to 2016 are used for model development. The model is then used to predict emissions during 2013 and 2014. The predictive performance of the RF model is evaluated using a leave-one-site-out cross-validation scheme. The performance (Nash–Sutcliffe model efﬁciency = 0 . 47) is comparable to previous studies upscaling net ecosystem exchange of carbon dioxide and studies comparing process model output against site-level CH 4 emission data. The global distribution of wetlands is one major source of uncertainty for upscaling CH 4 . Thus, three wetland distribution maps are utilized in the upscaling. Depending on the wetland distribution map, the annual emissions for the northern wetlands yield 32 (22.3–41.2, 95 % conﬁdence interval calculated from a RF model ensemble), 31 (21.4–39.9) or 38 (25.9–49.5) Tg(CH 4 ) yr − 1 . To further evaluate the uncertainties of the upscaled CH 4 ﬂux data products we also compared them against output from two process models (LPX-Bern and WetCHARTs), and methodological issues related to CH 4 ﬂux upscaling are discussed. The monthly upscaled CH 4 ﬂux data products are available at https://doi.org/10.5281/zenodo.2560163 (Peltola et al., 2019).


Introduction
Methane (CH 4 ) is the second most important anthropogenic greenhouse gas (GHG) in terms of radiative forcing after carbon dioxide (CO 2 ): 34 times (GWP 100 , including climatecarbon feedbacks) as strong as CO 2 (Ciais et al., 2013).CH 4 has contributed ∼ 20% to the cumulative GHG-related global warming (Etminan et al., 2016).Deriving constraints on CH 4 sources and sinks is thus of utmost importance.The net atmospheric CH 4 budget is well constrained by precise CH 4 mole fraction measurements around the globe, yet the contribution of individual sources and sinks to this aggregated budget remains poorly understood.This is primarily due to lack of data to constrain the modelling results (Saunois et al., 2016).In order to make more accurate predictions of the atmospheric CH 4 budget in a changing climate, the response of the various sources and sinks to different drivers needs to be better identified and quantified.
Natural wetlands are the largest and quantitatively most uncertain source of CH 4 to the atmosphere (Saunois et al., 2016).An ensemble of land surface models estimated global CH 4 emissions from wetlands for the period 2003-2012 to be 185 Tg(CH 4 ) yr −1 (range 153-227 Tg(CH 4 ) yr −1 ) and for the same period inversion models estimated it to be 167 Tg(CH 4 ) yr −1 (range 127-202 Tg(CH 4 ) yr −1 ) (Saunois et al., 2016).This discrepancy between bottom-up (process model) and top-down (inversion model) estimates, as well as the range of variability, exemplifies the large uncertainty of the current estimate for natural wetland CH 4 emissions.Sources of this uncertainty can be roughly divided into two categories: (1) uncertainty related to the global areal extent of wetlands (e.g.Petrescu et al., 2010;Bloom et al., 2017a;Zhang et al., 2016) and (2) uncertainties related to the key CH 4 emission drivers and responses to these drivers (e.g.Bloom et al., 2017a;Saunois et al., 2017).Evaluation of the emission estimates is thus urgently needed, and results from these efforts will lead to refined process models.Pro-cess model improvements will also directly affect the uncertainty of inversion results since they provide important a priori information to the inversion models (Bergamaschi et al., 2013).
Boreal and arctic wetlands comprise up to 50 % of the total global wetland area (e.g.Lehner and Döll, 2004) and the wetlands in these northern latitudes substantially contribute to total terrestrial wetland CH 4 emissions (ca.27 %, based on the sum of regional budgets for boreal North America, Europe and Russia in Saunois et al., 2016).In wetlands, CH 4 is produced by methanogenic Archaea under anaerobic conditions, and hence the production takes place predominantly under water-saturated conditions (e.g.Whalen, 2005).The microbial activity and the resulting CH 4 production is thus controlled by the quality and quantity of the available substrates, competing electron acceptors, and temperature (Le Mer and Roger, 2001).Once produced, the CH 4 can be emitted to the atmosphere via three pathways: ebullition, molecular diffusion through soil matrix and water column, or plant transport.If plants capable of transporting CH 4 are present, plant transport is generally the dominating emission pathway (Knoblauch et al., 2015;Kwon et al., 2017;Waddington et al., 1996;Whiting and Chanton, 1992).A large fraction of CH 4 transported via molecular diffusion is oxidized into CO 2 by methanotrophic bacteria in the aerobic layers of wetland soils and hence never reaches the atmosphere (Sundh et al., 1995), whereas CH 4 transported via ebullition and plant transport can largely bypass oxidation (Le Mer and Roger, 2001;McEwing et al., 2015).Furthermore, processes related to permafrost (e.g.active layer, thermokarst) and snow cover dynamics (e.g.snow melt, insulation) have an impact on CH 4 flux seasonality and variability (Friborg et al., 1997;Helbig et al., 2017;Mastepanov et al., 2008;Zona et al., 2016;Zhao et al., 2016).Hence, wetland CH 4 emissions to the atmosphere largely depend on interplay between various controls, including water table position, temperature, vegetation composition, methane consumption, availability of substrates and competing electron acceptors.
During the past 2 decades, eddy covariance (EC) measurements of wetland CH 4 emissions have become more common, due to rapid development in sensor technology (e.g.Detto et al., 2011;Peltola et al., 2013Peltola et al., , 2014)).The latest generation of low-power and low-maintenance instruments are rugged enough for long-term field deployment (Nemitz et al., 2018;McDermitt et al., 2010); thus, the number of sites where CH 4 flux measurements have been made is increasing.Due to this progress, EC CH 4 flux synthesis studies have been emerging (Petrescu et al., 2015;Knox et al., 2019).Similar progress was made with CO 2 and energy flux measurements in the 1990s and now these measurements form the backbone of the global EC measurement network FLUXNET (https://fluxnet.fluxdata.org/,last access: 6 August 2019), whose data have provided invaluable insights into terrestrial carbon and water cycles.Some of the most important results have been obtained by upscaling FLUXNET observations using machine-learning algorithms to evaluate terrestrial carbon balance components and evapotranspiration (Beer et al., 2010;Bodesheim et al., 2018;Jung et al., 2010Jung et al., , 2011Jung et al., , 2017;;Mahecha et al., 2010).These results are now widely used by the modelling community to evaluate process model performance (e.g.Wu et al., 2017) and to validate satellite-derived carbon cycle data products (e.g.Sun et al., 2017;Y. Zhang et al., 2017).
In this study, we synthesized EC CH 4 flux data from 25 EC CH 4 flux sites and developed an observation-based monthly gridded data product of northern wetland CH 4 emissions.We focus on northern wetlands (north of 45 • N) due to their significance in the global CH 4 budget and relatively good data coverage and process understanding, at least compared to tropical systems (Knox et al., 2019).High-latitude regions are projected to warm during the next century at a faster rate than any other region, which will likely significantly impact the carbon cycling of wetland ecosystems (Tarnocai, 2009;Z. Zhang et al., 2017) and permafrost areas of the Arctic boreal region (Schuur et al., 2015).To date, CH 4 emission estimates for northern wetlands are typically based on process models (Bohn et al., 2015;Bloom et al., 2017a;Chen et al., 2015;Melton et al., 2013;Stocker et al., 2013;Wania et al., 2010;Watts et al., 2014;Zhang et al., 2016) or inversion modelling (Bohn et al., 2015;Bruhwiler et al., 2014;Spahni et al., 2011;Thompson et al., 2017;Thonat et al., 2017;Warwick et al., 2016), yet scaling of existing chamber measurements to the northern wetland area has also been published (Zhu et al., 2013).However, CH 4 emission estimates obtained with the former two approaches are not independent since the attribution of CH 4 emissions derived using inversion models to different emission sources (e.g.wetlands) depends largely on a priori estimates of these emissions (i.e.process models for wetland emissions), highlighting the tight coupling between these two approaches (Bergamaschi et al., 2013;Spahni et al., 2011).Hence, the main objective of this study is to produce an independent data-driven estimate of northern wetland CH 4 emissions.This product could be used as an additional constraint for the wetland emissions and hence aid in process model refinement and development.Additionally, the drivers causing CH 4 flux variability at the ecosystem scale are also evaluated and methodological issues are discussed, which will support future CH 4 wetland flux upscaling studies.

Materials and methods
Data from flux measurement sites (Fig. 1) were acquired and used together with forcing data to estimate CH 4 emissions from northern wetlands with a monthly time resolution using a random forest (RF) modelling approach.Both in situ measurements and remote sensing are utilized in this study.In this section, the RF approach is briefly introduced (Sect.2.1)  and data selection, quality filtering, gap filling and aggregation to monthly values are described (Sect.2.3).We identified 40.7 site years available for analysis, measured between 2005 and 2016.To perform upscaling to all wetlands north of 45 • N, gridded data products of the flux drivers and wetland distribution maps were needed.These products are presented in Sect.2.4 and 2.5, respectively.Finally, the upscaled wetland CH 4 emissions are compared against process model outputs, with the models briefly described in Sect.2.6.
Here, wetlands are defined as terrestrial ecosystems with water table positions near the land surface and with plants that have adapted to these waterlogged conditions.We exclude lakes, reservoirs and rivers from the study, in addition to ecosystems with significant human influence (e.g.drainage, rewetting).We consider peat-forming wetlands (i.e.mires), which can be further classified as fens and bogs based on hydrology, as well as wetlands with hydric mineral soils.Tundra wetlands may have only a shallow peat layer or none at all.Unified classifications for wetlands are still lacking, and typically different countries follow their own classification scheme, albeit some joint classification schema have been developed (e.g.Ramsar Classification System for Wetland Type).

Random forest algorithm
Random forest (RF) is a machine-learning algorithm that can be used for classification or regression analyses (Breiman, 2001).In this study the RF models consist of a large ensemble of regression trees.Each individual regression tree is built by training it with a random subset of training data and the trees are trained independently of each other.The RF model output is then the average of all the predictions made by individual regression trees in the forest.Hence, the RF algorithm applies the bootstrap aggregation (bagging) algorithm and takes full advantage of the fact that ensemble averaging decreases the noise of the prediction.In addition to random selection of training data, the predictor variables used in split nodes are also selected from a random sample of all predictors, which minimizes the possible correlation between trees in the forest (Breiman, 2001) and decreases the possibility of overfitting.The predictor variables can be either categorical or continuous.The variables are then used in the split nodes to divide the data into two (e.g.categorical variable true or false or a continuous variable, such as temperature above or below 5 • C).
Performance of RF algorithms to predict CO 2 and energy fluxes across FLUXNET sites have been compared against other machine-learning algorithms, such as artificial neural networks and multivariate regression splines, by Tramontana et al. (2016), who showed that differences between methods were negligible.We anticipate a similarly negligible effect of machine-learning algorithm choice for CH 4 fluxes.For a thorough description of the RF algorithm for flux upscaling purposes, the reader is referred to Bodesheim et al. (2018) (and references therein).
In this study, the RF models were developed using the MATLAB 9.4.0 (R2018a) TreeBagger function with default values similar to those of Bodesheim et al. (2018).These settings included a minimum of five samples in a leaf node and used mean squared error (MSE) as a metric for deciding the split criterion in split nodes.Each trained forest consisted of 300 randomized regression trees.

RF model development for CH 4 flux gap filling
Our RF algorithm was used for gap filling the daily CH 4 flux time series.The performance of the RF model was evaluated against "out-of-bag" (OOB) data (approximately one-third of data for each tree).Since each individual tree in the RF model was trained using a subset of training data, the rest of the data (i.e.OOB data) can be used as independent validation data to evaluate the prediction performance of that particular regression tree and hence the whole forest (Breiman, 2001).Only the five most important predictors were retained for the gap-filling models for each site.The relative importance of predictors (e.g.air temperature) was evaluated by randomly shuffling the predictor data and then estimating the increase in MSE when model output is compared against OOB data (Breiman, 2001).For important predictors, MSE will increase significantly due to shuffling, whereas the effect of shuffling on MSE is minor for less important predictors.Note that this procedure was executed separately for each site, and thus different predictors may have been used for different sites for gap filling.

RF model development for CH 4 flux upscaling
For upscaling purposes, one RF model was developed using all the available data in order to maximize the information content for the global (> 45 • N) CH 4 flux map.The model performance or uncertainty, however, was evaluated using two approaches.(1) The predictive performance of the model was assessed using the widely used "leave-one-siteout" cross-validation scheme (e.g.Jung et al., 2011).In order to avoid correlation between training data and validation data, sites located nearby (closer than 100 km) were excluded from the training data (Roberts et al., 2016).(2) The uncertainty of the upscaled fluxes was estimated by bootstrapping.The 200 independent RF models were trained using a bootstrap sample of the available data.This yielded 200 predictions for each grid cell and time step in the upscaled CH 4 flux map.The variability over this prediction ensemble was used as an uncertainty measure following, e.g.Aalto et al. (2018) and Zhu et al. (2013).This uncertainty estimate reflects the ability of the RF model to capture the dependence of CH 4 flux on the used predictors in the available data.However, it does not have any reference to actual in situ CH 4 fluxes, unlike the model predictive performance estimated with crossvalidation.
Predictors for the RF model used in the upscaling were determined following Moffat et al. (2010).First, the RF models were trained for each site using one predictor at a time (see all the predictors in Table 1).The single predictor which yielded the best match against validation data (leave-onesite-out scheme) was selected as the primary driver.Then, the RF models were trained again with the primary driver, plus each of the other predictors in turn as secondary drivers.Then the RF model performance was again evaluated and the best predictor pair was selected for the next round.This procedure was continued until all the predictors were included in the RF model.The smallest set of predictors capable of producing optimal RF model performance was used for flux upscaling.

Metrics for model performance evaluation
The RF model performance was evaluated against independent validation data using a set of statistical metrics, which were related to different aspects of model performance.During the RF model training MSE was optimized as follows: where o and p are vectors containing the observed and predicted values, respectively, and the overbar denotes averaging.
The Nash-Sutcliffe model efficiency (NSE; Nash and Sutcliffe, 1970) was used to evaluate how well the model was able to predict validation data when compared against a ref-erence (typically the mean of the validation data): where i is index running over all the n values in the o and p vectors.When NSE is equal to 1, there is a perfect match between prediction and observations.Values above 0 imply that the model predicts the observations better than the mean of observations and values below 0 indicate that the predictive capacity of the model is worse than the mean of validation data.Note that NSE calculated with Eq. ( 2) above is equivalent to the coefficient of determination calculated using residual sum of squares and total sum of squares.However, following the approach used in previous upscaling studies (e.g.Bodesheim et al., 2018;Tramontana et al., 2016), we opted to call this metric NSE.Instead, the coefficient of determination (R 2 ) was estimated as the squared Pearson correlation coefficient.Note that R 2 and NSE are equal when there is no bias between o and p and the residuals follow Gaussian distribution.Pearson correlation coefficients obtained with different model runs are compared using Fisher's r to z transformation.
The standard deviation (σ ) of the model residuals was used to evaluate the spread of model residual values (RE): whereas biases between model predictions and validation data were used to estimate the systematic uncertainty in the upscaled fluxes (BE): Note that RE equals RMSE when there is no systematic difference between the model predictions and observations (i.e. when BE equals zero).

Data from eddy covariance flux measurement sites
Data were acquired from 25 sites that (1) measure CH 4 fluxes with the EC technique, (2) are located north of 45 • N and (3) are wetlands as defined above and without substantial human influence on ecosystem functioning (see the site locations in Fig. 1 and the site list in Appendix A).The sites were evenly distributed among wetland types, including fens (n = 9), bogs (7), and wet tundra (9), as well as among different biomes, including tundra (11), boreal (8), and temperate (6) biomes, as defined in Olson et al., (2001).At 15 of the 25 sites, sedges (e.g.Rhynchospora alba, Eriophorum vaginatum, Carex limosa) were the dominant vascular plant functional type in the flux measurement source area.Most of the sites (18 out of 25) were located north of 60 • N and the highest densities of sites were in Fennoscandia and Alaska (Fig. 1).The magnitude of monthly CH 4 flux data varied between sites and the median time series length was 14.5 months of CH 4 flux data per site.Overall, the dataset spanned between 2005 and 2016.The sites represent northern wetlands sufficiently well to create an upscaled CH 4 flux product based on EC data.Sites are referred to with their FLUXNET IDs and if these were not available then new temporary site IDs were generated for this study (see Appendix A).Site principal investigators (PIs) provided CH 4 fluxes and their potential drivers (air temperature and pressure, precipitation, wind speed and direction, friction velocity, net ecosystem exchange of CO 2 and its components -i.e.canopy photosynthesis and ecosystem respiration -photosynthetically active radiation, water table depth, and soil temperature) .However, out of the in situ measurements only air temperature and precipitation were used for developing the RF model for flux upscaling since gridded data products of the other potentially important drivers were not readily available and/or the data for the other drivers were missing from several sites.
The 30 min averaged flux data were acquired from 21 sites and daily data were provided for 4 sites.The flux time series were quality filtered by removing fluxes with the worst quality flag (based on 0,1,2 flagging scheme, Mauder et al., 2013) and with friction velocity below a site-specific threshold (if friction velocity and threshold were available for the site).After filtering, daily medians were calculated if the daily data coverage was above 29 out of 48 half-hourly data points (daily data coverage at a minimum of 10 data points for sites without a diel pattern in CH 4 flux) and no gap filling was done to the time series prior to calculation of daily values.While this may cause slight systematic bias in the daily flux values, this bias is unlikely to be significant because the mag-nitude of diel patterns in CH 4 fluxes is typically moderate (e.g.Long et al., 2010) or negligible (e.g.Rinne et al., 2018), although at sites with Phragmites cover a relatively strong diurnal cycle can be observed (e.g.Kim et al., 1999;Kowalska et al., 2013).
Unlike the CH 4 flux data, the other in situ data from the sites were gap filled prior to the calculation of daily values.The gap filling was done only if the daily data coverage was above 60 % and days with lower data coverage, no daily values calculated.Shorter gaps (< 2 h) were filled with linear interpolation, whereas longer gaps (between 2 to 14.5 h) were replaced with mean diurnal variation within a 30 d moving window.However, for precipitation, daily sums were calculated without any gap filling.Besides the measurements at the sites, potential solar radiation (R pot ) and its time derivative (der(R pot )) were calculated based on latitude and time of measurement.In order to remove the R pot latitudinal dependence it was normalized to be between 0 and 1 before usage.
CH 4 flux drivers measured in situ, in addition to the remote-sensing data (Sect.2.3.2), were used for the gap filling of CH 4 time series with the RF algorithm (Sect.2.1.1).For each site the gap-filling models generally agreed well with the independent validation data (mean NSE = 0.74 and mean RMSE = 9 nmol m −2 s −1 ).After gap filling, the CH 4 flux time series were aggregated to monthly values if the monthly data coverage prior to gap filling was at least 20 %.
The daily time series of air temperature and precipitation measured at the sites were gap filled using the WATCH Forcing Data methodology applied to ERA-Interim (WFDEI) data (Weedon et al., 2014).Prior to using the WFDEI data for gap filling, the data were bias-corrected for each site as is typically done for climate or weather reanalysis data Earth Syst.Sci.Data, 11, 1263Data, 11, -1289Data, 11, , 2019 www.earth-syst-sci-data.net/11/1263/2019/ (e.g.Räisänen and Räty, 2013;Räty et al., 2014).For precipitation, the mean of WFDEI data were simply adjusted to match site mean precipitation.For air temperature, the bias correction was done for each month separately using quantile mapping with smoothing within a moving 7-month window.Quantile mapping compares the cumulative distribution functions (CDFs) of WFDEI and site measurements against each other and adjusts the WFDEI data so that after adjustment its CDF matches with the CDF of the site measurements (e.g.Räisänen and Räty, 2013).After gap filling the daily time series with WFDEI data, monthly and annual precipitation were calculated, in addition to monthly mean air temperature.

Remote-sensing data
Several data products from the Moderate Resolution Imaging Spectrometer (MODIS) were used in this study to derive various driving variables.For RF model development the following data products at 500 or 1000 m spatial resolution were used: MOD10A1 snow cover (Hall and Rigs, 2016), MOD11A2 daytime and night-time land surface temperature (LSTd and LSTn, Wan et al., 2015a), MOD13A3 enhanced vegetation index (EVI, Didan, 2015), and MOD09A1 surface reflectance (Vermote, 2015).More elaborate data products estimating ecosystem gross primary productivity (GPP) and net primary productivity (NPP; MOD17) were not included here for two reasons: (1) many of the sites included here were misclassified in the land cover map used in MOD17 (e.g. as woody savanna), hence severely influencing the estimated GPP and NPP (Zhao et al., 2005), and (2) sites that were correctly classified as permanent wetlands were in fact assigned a fill value and removed from the product since the product is not strictly valid for these areas (Lees et al., 2018).All the remote-sensing data products were quality filtered using the quality flags provided along with the data.The MODIS snow cover ranged from 0 (no snow) to 100 (full snow cover) and was converted to a simple snow cover flag (SC) consisting of 0 and 1 depending whether the snow cover data were below or above 50, respectively.A vector containing days since snow melt (DSSM) was calculated using the snow cover flag and normalized to 0 (beginning) and 1 (end) for each growing season (Mastepanov et al., 2013).The MOD09A1 surface reflectance at bands 2 (841-876 nm) and 5 (1230-1250 nm) were used to calculate the simple ratio water index (SRWI = band 2/band 5), following Zarco-Tejada and Ustin (2001).SRWI showed spurious values when there was snow cover, and hence these points were replaced with the mean SRWI observed at each site when there was no snow.Meingast et al. (2014) showed that SRWI can be used as a proxy for wetland water table depth, although their results were affected by changes in vegetation cover, which might hinder across-site comparability in this study.Additionally, following the temperature and greenness modelling approach (Sims et al., 2008), a product of EVI and LSTd was included in the analysis as a proxy for GPP, following a previous peatland study (Schubert et al., 2010).The remote-sensing data were provided with daily (MOD10A1), 8 d (MOD09A1, MOD11A2) or monthly (MOD13A3) time resolution and the data were aggregated to monthly means prior to usage.

Additional categorical variables
The sites were also classified based on the presence of permafrost in the source area (present or absent) and according to biome type.Biome types (temperate, boreal and tundra) were determined from Olson et al. (2001) and the information about the permafrost was provided by the site PIs.Furthermore, the data were categorized based on wetland type and sedge cover as in Treat et al. (2018) and Turetsky et al. (2014).However, such information is not available in the gridded format needed for upscaling; nevertheless, inclusion of these variables can be used to assess how much they increase the predictive performance of the model.

Gridded datasets used in flux upscaling
For upscaling CH 4 fluxes using the developed RF model, the LST data were acquired from the aggregated product MOD11C3 (Wan et al., 2015b) and snow cover data from MOD10CM (Hall and Riggs, 2018).Distribution of permafrost in the northern latitudes were estimated using the circum-Arctic map of permafrost derived by National Snow and Ice Data Center (Brown et al., 2002).The resolution of the gridded data was adjusted to match the resolution of the wetland maps using bilinear interpolation if needed.Additionally, land and ocean masks (Jet Propulsion Laboratory, 2013) were utilized when processing the gridded datasets.

Wetland maps
Upscaled fluxes were initially estimated in flux densities per wetland area, i.e. amount of CH 4 per area of wetland per unit of time.To create a gridded product of CH 4 emissions from the northern wetlands, these upscaled flux densities were converted into (amount of CH 4 ) per (grid cell area) per (unit of time) using different wetland maps.Wetland mapping is an ongoing field of research and the usage of different wetland maps contributes to the uncertainty of global wetland CH 4 emission estimates (e.g.Bloom et al., 2017a;Z. Zhang et al., 2017).Hence, three different wetland maps (PEATMAP, DYPTOP and GLWD) were used in this study to evaluate how much they affect the overall estimates of northern highlatitude wetland CH 4 emissions.
The recently developed static wetland map PEATMAP (Xu et al., 2018) combines detailed geospatial information from various sources to produce a global map of wetland extent.Here, the polygons in PEATMAP were converted to fractions of wetland in 0.5 • grid cells.While PEATMAP is focused on mapping peatlands, marshes and swamps (typically on mineral soil) are included in the product for certain areas in the northern latitudes.However, most of the wetlands in the northern latitudes are peatlands, and thus PEATMAP is suitable for our upscaling purposes.The dynamic wetland map estimated by the DYPTOP model (Stocker et al., 2014) was used by aggregating peat and inundated areas to form one dynamic wetland map with 1 • resolution.The widely used Global Lakes and Wetlands Database (GLWD, Lehner and Döll, 2004) is a static wetland map with 30 arcsec resolution and since it has been widely used here it provided a point of reference for the other two maps.The map was aggregated to 0.5 • resolution and lakes, reservoirs and rivers were excluded from the aggregated map.

Process models
The upscaled CH 4 fluxes were compared against the output from two process models: LPX-Bern (Spahni et al., 2013;Stocker et al., 2013;Zürcher et al., 2013) and the model ensemble WetCHARTs version 1.0 (Bloom et al., 2017a(Bloom et al., , 2017b)).LPX-Bern is a dynamic global vegetation model that models carbon and nitrogen cycling in terrestrial ecosystems.The model has a separate peatland module with peatlandspecific plant functional types (for more details, see Spahni et al., 2013).The wetland extent in LPX-Bern was dynamically estimated using the DYPTOP approach with 1 • resolution (Stocker et al., 2014).WetCHARTs combines several prescribed wetland maps with different gridded products for heterotrophic respiration and temperature sensitivity (Q 10 ) parameterizations for CH 4 production to form a model ensemble of wetland CH 4 emissions (Bloom et al., 2017b).
Here we used the extended ensemble of WetCHARTs.

Selecting the predictors for the RF model
The predictors in Table 1 were selected one by one using the procedure described in Sect.2.1.2.The order in which the predictors were selected is shown in Fig. 2. LSTn alone gave NSE = 0.29.After including the category permafrost presence and absence, R pot , SC and biome class increased NSE to 0.47.However, the influence of SC and biome class on the model performance was marginal based on the small increase in NSE.Additional predictors did not increase the model performance further because (1) they were strongly correlated with a predictor already included in the model (e.g.T air is correlated with LSTn) or (2) the predictors did not contain any information about CH 4 flux variability.The model response to predictors other than biome category was physically reasonable (e.g.permafrost and snow cover decrease fluxes, close to exponential dependence on LSTn), whereas the response to biome category was contrary to expectations.The RF model estimated the CH 4 flux magnitude from the different biomes to be in the following order: tundra < temperate < boreal.However, in prior studies it has been shown to be in the following order: tundra < boreal < temperate (Knox et al., 2019;Treat et al., 2018;Turetsky et al., 2014).This discrepancy may be due to the limited number of measurement sites and related sampling bias problems.Hence, in order not to upscale an incorrect pattern of decreasing CH 4 emissions when moving from boreal to temperate regions, the biome class was omitted from upscaling.In the subsequent analysis and flux upscaling only the four first predictors (LSTn, permafrost category, R pot and SC) are utilized.
We further tested whether information about wetland type or sedge cover would improve the model performance even though these categorical variables were not available in gridded format and hence were not usable for upscaling.Including the sedge flag increased the NSE to 0.53, although the increase in Pearson correlation was not statistically significant (p > 0.05, comparison of correlation coefficients using Fisher's r to z transformation).Also, wetland type did not have a statistically significant influence on the model performance (p > 0.05 and NSE = 0.49 if type included).Using Earth Syst.Sci.Data, 11, 1263Data, 11, -1289Data, 11, , 2019 www.earth-syst-sci-data.net/11/1263/2019/ too many categorical variables in a RF model may be problematic because each site may end up with a unique combination of categorical variables.
The most important predictor for the model was temperature, similar to numerous studies showing that wetland CH 4 emissions are strongly correlated to soil temperature (Christensen et al., 2003;Helbig et al., 2017;Jackowicz-Korczyński et al., 2010;Rinne et al., 2018;Yvon-Durocher et al., 2014;Knox et al., 2019).Selection of LSTn as the primary driver instead of the other temperature variables was likely an outcome of the available data and the algorithm used to select the drivers.With slightly different dataset (more sites) other temperature variables (e.g.T air ) might have been more important drivers for the CH 4 flux variability.Estimating apparent Q 10 from the RF model LSTn dependence yielded a value of 1.90 ± 0.03 and for validation data it was slightly higher (1.97 ± 0.06) (Fig. 3).These values are comparable to the ones reported in Turetsky et al. (2014) for CH 4 chamber measurements at bog and fen sites.The temperature dependence of CH 4 production is modelled in many process models with the parameter Q 10 value close to 2 (Xu et al., 2016b), which agrees with the CH 4 emission temperature dependence shown here.However, one should note that CH 4 oxidation also depends on temperature and the derived apparent Q 10 value describes the temperature dependence of surface CH 4 emission, which is always a combination of CH 4 production and oxidation.

Model agreement with validation data
The overall systematic bias (BE) between the RF predictions and validation data was negligible (Fig. 4), whereas the spread of the data (RE) was more pronounced (Fig. 4).Fol- When considering the model performance for each site separately, the agreement shows different characteristics (see Fig. 5 for four examples).For individual sites the magnitude of BE is typically somewhat higher (median of absolute value of BE approximately 11 nmol m −2 s −1 ), whereas RE is lower than for the overall agreement (median RE approximately 10 nmol m −2 s −1 ).These results indicate that the upscaled CH 4 fluxes have, in general, relatively low bias and high random error, whereas individual pixels in the upscaled CH 4 map may have higher bias but lower random error.
The mean annual cycle of CH 4 emission predicted by the RF model agrees well with the mean annual cycle cal- culated from the validation data (not shown).During the non-growing season the RF model slightly overestimates the fluxes (15 % overestimation) but such differences were negligible during the rest of the year (< 1 %).However, for individual sites CH 4 emission seasonality agrees less.For instance, at US-Los the modelled CH 4 emissions start to increase 1 month earlier in the spring (Fig. 5b).The nongrowing season fluxes are overestimated at four example sites (FI-Sii, US-Los, US-Atq and RU-Ch2; Fig. 5).The mean flux magnitude is modelled well at FI-Sii (Fig. 5a), whereas at US-Los (Fig. 5b) and US-Atq (Fig. 5c) the RF model overestimates and at RU-Ch2 (Fig. 5d) underestimates the CH 4 emissions.The flux bias had a relatively large impact on site-specific NSE.For example, for US-Atq NSE was −1.85, meaning that the observation mean would be a better predictor for this site than the RF model (see the NSE definition in Sect.2.2).The RF model is not able to replicate the between-year differences in CH 4 emissions at the example sites.Capturing interannual variability has also been difficult in previous upscaling studies of CO 2 and energy fluxes (e.g.Tramontana et al., 2016).
In general, the RF model performance was better for permafrost-free sites than for sites with permafrost (r = 0.66 and r = 0.51, respectively; p < 0.05), which is likely related to the fact that at sites with permafrost the MODIS LSTn is not as directly related to the soil temperature than at sites without permafrost.Hence, LSTn is not as good proxy for the temperature that is controlling both CH 4 production and consumption and this results in a worse performance than at sites without permafrost.

Upscaled CH 4 fluxes
The RF model developed in this study was used together with the gridded input datasets (Sect.2.4) and wetland distribution maps (Sect.2.5) to estimate CH 4 emissions from northern wetlands in 2013 and 2014.The mean CH 4 emissions of the 2 years from the RF model are plotted in Fig. 6 together with CH 4 wetland emission maps from the process model LPX-Bern and model ensemble WetCHARTs.Differences between the process model estimations and upscaled fluxes are shown in Fig. 7.In general, the spatial patterns are similar among emission maps, which is not surprising given that the spatial variability is largely controlled by the underlying wetland distributions.One noteworthy difference is that WetCHARTs, RF-PEATMAP (i.e.RF modelling with PEATMAP) and RF-GLWD show higher emissions from western Canada than LPX-Bern or the upscaled fluxes using the wetland map from that process model (RF-DYPTOP).The other difference is that RF-GLWD show negligible emissions from Fennoscandia (Fig. 6c).These differences are related to differences in the underlying wetland maps.While the wetland maps differ, there is no consensus on which is more accurate, so comparisons indicate the uncertainty in upscaling emanating from uncertainties in wetland distribution.
Three statistical metrics (NSE, R 2 and RE) were calculated between RF-DYPTOP and LPX-Bern for each grid cell (Fig. 8).The figure illustrates how well the temporal variability of CH 4 emissions estimated by RF-DYPTOP and LPX-Bern agree in each grid cell.NSE values are low in areas where the systematic difference between RF-DYPTOP and LPX-Bern was high (compare Figs. 8a and 7a) since the bias strongly penalizes NSE.The R 2 values are high throughout the study domain, likely due to the fact that the seasonal cycle of CH 4 emissions dominated the temporal variability in most of the grid cells and the seasonal cycles were in phase between RF-DYPTOP and LPX-Bern.RE values calculated between RF-DYPTOP and LPX-Bern were high in areas where the emissions estimated by RF-DYPTOP were also high (compare Figs. 8c and 6a).This is likely due to the fact that, even though the seasonal cycles were in phase, their amplitudes were different which increased the variability between LPX-Bern and RF-DYPTOP (i.e. increase in RE).
The uncertainties of the upscaled fluxes were estimated from the spread of predictions made with the ensemble of 200 RF models (Fig. 9).The uncertainty mostly scales with the flux magnitude (compare Fig. 6a-c with Fig. 9a-c), meaning that grid cells with high fluxes tend to also have high uncertainties.However, the relative flux uncertainty does have some geographical variation (Fig. 9d-f).The highest relative uncertainties are typically at the highest and lowest latitudes of the study domain.In these locations the dependencies between the predictors and the CH 4 flux are not as well defined as in the locations with lower uncertainties leading to larger spread in the ensemble of RF model prediction.
For instance, at low latitudes LSTn may go beyond the range of LSTn values in the training data (see the range in Fig. 3), and hence the RF model predictions are not well constrained in these situations.On the other hand, lower relative uncertainties are typically obtained for locations close to the measurement sites incorporated in this study (compare Figs. 1  and 9), since the dependencies between the predictors and the CH 4 flux are better defined.
The seasonalities of the upscaled fluxes and CH 4 fluxes from process models are similar with the highest CH 4 emissions in July-August and the lowest in February.This seasonal pattern is consistent throughout the study domain (Fig. 10).Warwick et al. (2016) and Thonat et al. (2017) showed that the northern wetland CH 4 emissions should peak in August-September in order to correctly explain the seasonality of atmospheric CH 4 mixing ratios and isotopes mea-sured across the Arctic.Hence, the wetland CH 4 emissions presented here are peaking approximately 1 month too early to perfectly match with their findings.CH 4 flux magnitude agrees well between WetCHARTs and the upscaled flux during spring and midsummer (April-July), whereas LPX-Bern estimates lower fluxes (0 % and 26 % difference, respectively).During late summer and autumn (August-October) both process models estimate slightly lower fluxes than the upscaled estimate (17 % and 19 % difference, respectively).The upscaled fluxes also show somewhat higher emissions during the non-growing season (November-March) than the two process models (27 % and 35 % difference; see Table 2), and the upscaled estimates of non-growing season emissions are relatively close to a recent model estimate (Treat et al., 2018).This result promotes the recent notion that process models might be underestimating non-growing season fluxes at high latitudes (e.g.Treat et al., 2018;Xu et al., 2016a;Zona et al., 2016).Treat et al. (2018) adjusted WetCHARTs model output so that it matches with their estimates of non-growing season CH 4 emissions and then estimated annual wetland CH 4 emissions north of 40 • N to be 37 ± 7 Tg(CH 4 ) yr −1 using this adjusted model output.The estimates derived here for the annual emissions using the three wetland maps are similar (see Table 2), especially when considering our slightly smaller study domain (above 45 • N).The two process models included in this study estimated slightly lower mean annual emissions than the upscaled fluxes (11 % and 26 % difference between the mean upscaled estimate and WetCHARTs and LPX-Bern, respectively; see also Table 2).However, given the uncertainties in upscaling as well as in process models, this can be regarded as relatively good agreement.Different process models may be driven with different climate forcing data and they may have discrepancies in the underlying wetland distributions, in addition to the different parameterizations and descriptions of the processes behind the CH 4 emissions.These sources of uncertainty should be recognized when models are compared against each other or against upscaling products.
In order to further evaluate the agreement between the upscaled fluxes and process models we focused on two specific regions: Hudson Bay lowlands (HBL) and western Siberian lowlands (WSL) (see locations in Fig. 1).The upscaled fluxes indicate higher annual emissions for both subdomains compared to the two process models or previously published estimate (Table 2).For WSL the upscaled estimates are within the range of variability observed between process models and inversion modelling in WETCHIMP-WSL (Bohn et al., 2015) and close to Thompson et al. (2017).The upscaled estimates by Glagolev et al. (2011) might underestimate CH 4 emissions from the WSL area (Bohn et al., 2015).Furthermore, the process models in Bohn et al. (2015) are likely underestimating the non-growing season CH 4 emissions which might partly explain the discrepancy to the upscaled estimates in this study.Hence, the upscaled CH 4 emission esti-  mates for the WSL area, while large, are still in a reasonable range.
For HBL, the discrepancy between upscaled emission estimates and the estimates based on process models or previous studies is larger (Table 2).The upscaling results agree with Zhang et al. (2016) and Melton et al. (2013) but show emissions that are twice as large for the HBL than the other estimates (Table 2).This cannot be explained by wetland mapping since the difference also holds when the DYP-TOP wetland map is used in upscaling.There are only few  2013) and this study is similar to the discrepancy between this study and the inversion estimates (Pickett-Heaps et al., 2011;Thompson et al., 2017) at the whole HBL scale.Pickett-Heaps et al. (2011) and Thompson et al. (2017) show near zero CH 4 emissions during October-April and onset of CH 4 emissions in mid-May or even June, largely dependent on when the ground was free of snow and unfrozen.This is somewhat surprising given the fact that only 32 % of wetlands in the area are underlain by permafrost (based on amalgam of PEATMAP and the permafrost map), and hence the soils are likely not completely frozen and some non-growing season CH 4 emissions are likely to occur in such conditions (e.g.Treat et al., 2018).The upscaled non-growing season CH 4 emissions show on average 1.1 Tg(CH 4 ) yr −1 emissions for the HBL area.This partly, but not completely, explains the discrepancy between the CH 4 emission estimates for the HBL area.All these results suggest that the upscaled product likely overestimates CH 4 emissions from the HBL area.

Comparing the RF model predictive performance to previous studies
The RF model performance was worse when compared against independent validation data than what has been achieved in previous upscaling studies for GPP and energy fluxes (R 2 > 0.7) and ecosystem respiration (R eco ; R 2 > 0.6) (e.g.Jung et al., 2010;Tramontana et al., 2016).However, the RF model performance for monthly CH 4 emissions was comparable to net ecosystem exchange of CO 2 (NEE) (R 2 < 0.5) (e.g.Jung et al., 2010;Tramontana et al., 2016).Likely reasons for this finding include, for instance, that for other fluxes there is simply more data available from several sites spanning the globe.For example, the La Thuile synthesis dataset used by Jung et al. (2010) and Tramontana et al. (2016) consists of 965 site years of data from over 252 EC stations.Here we have data from 25 sites with CH 4 fluxes.Furthermore, the drivers (or proxies for the drivers) of, for example, GPP and energy fluxes are more easily available from remote-sensing (e.g.MODIS) and weather forecasting re-analysis datasets (e.g.WFDEI).In contrast, CH 4 emissions are more related to below-ground processes, thus drivers for these processes are more difficult to measure remotely.Also, there are temporal lags between changes in drivers (e.g.LSTn) and CH 4 fluxes in response to these changes.Consequently, training a machine-learning model such as RF on such data is difficult since the RF model assumes a instantaneous relationship between the change and response.However, one should also note that GPP or R eco are never directly measured with the EC technique, they are always at least partly derived products (Lasslop et al., 2009;Reichstein et al., 2005).Hence, direct functional relationships between GPP and R eco and their environmental drivers are inherently included in these flux estimates, whereas NEE and CH 4 emissions are directly measured without additional modelling.Also, both NEE and CH 4 emissions are differences between component fluxes (NEE:    1) and time periods.The values are given in Tg(CH 4 ) yr −1 .Note that estimates from some reference studies are not for the same period as the one studied here (2013)(2014).For WetCHARTs the mean of the model ensemble together with the range (in parentheses) is given, whereas for the upscaling results the 95 % confidence intervals for the estimated emissions are given.This study, RF-DYPTOP 4.6 (3.1-6.0)7.0 (5.2-8.8)6.2 (4.6-7.8)30.6 (21.4-39.9)This study, RF-GLWD 4.9 (3.4-6.5)6.8 (5.0-8.5)8.0 (5.8-10.2) 37.6 (25.9-49.GPP and R eco ; CH 4 flux: production and oxidation).Therefore, GPP and R eco upscaling algorithms show better correspondence with validation data than for NEE or CH 4 emissions and the results for NEE would be the correct point of reference for the RF model performance presented here.While the RF model performance in this study was inferior to previous upscaling studies for other fluxes when evaluated using different statistical metrics, it was still comparable to what has been shown before for several process models for CH 4 emissions (McNorton et al., 2016;Wania et al., 2010;Zürcher et al., 2013;Zhu et al., 2014;Xu et al., 2016a) = 0.92) is found between simulated and measured cumulative site emissions after calibrating the model against the measurements.While Xu et al. (2016a) did not explicitly show any statistical metrics, their model (CLM4.5)comparison against site level CH 4 flux data seemed to be somewhat better than in Wania et al. (2010) or McNorton et al. (2016).Xu et al. (2016a) emphasize the importance of non-growing season emissions and the fact that their model was clearly underestimating these emissions.Zhu et al. (2014) calibrated their model (TRIPLEX-GHG) for each measurement site by changing, e.g. the Q 10 for CH 4 production and CH 4 to CO 2 release ratio to be site-specific and found on average R 2 = 0.64 when comparing the calibrated model against measurements at 17 CH 4 flux measurement sites.However, their findings are not directly comparable to the RF model agreement with validation data shown here due to their model calibration against data before comparison.Nevertheless, their results show that, even after calibration, the process models are not fully able to capture the CH 4 flux variability in measurements.Miller et al. (2014) argued that the structure of some of the process models is so complex that the required forcing variables may not be reliable at larger spatial scales.All of these five models (JULES, LPJ-WHyME, LPX-Bern, CLM4.5 and TRIPLEX-GHG) are contributing to the global CH 4 budget estimation within the Global Methane Project (Saunois et al., 2016), highlighting that these results summarize the agreement between state-of-the-art process models and field measurements.

Missing predictors
In this study a statistical model was developed using the RF algorithm, and the model was able to yield R 2 = 0.47 against monthly CH 4 flux validation data.Our upscaling using the RF model focused on 2013-2014, as these were the years with the largest overlap of collected data.However, all data from all the years (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) were used to develop and validate the model.The incomplete match between the RF model and validation data is likely caused by the fact that not all the possible drivers causing inter-and intra-site variability in CH 4 emissions were included in the analysis, and hence all the variability could not be explained by the model.Christensen et al. (2003) were able to explain practically all the variability (R 2 = 0.92) in annual CH 4 emissions in their multi-site chamber study with only two predictors: temperature and the availability of substrates for CH 4 production.Also, Yvon-Durocher et al. (2014) speculate that the amount of substrates for microbial CH 4 production explains across-site variability of CH 4 fluxes in their data.However, gridded data on spatially explicit substrate information are currently nonexistent.Hence, proxies for the substrates available for methanogenesis are needed.The current paradigm on wetland CH 4 emissions is that most of the emitted CH 4 is produced from recently fixed carbon being used as precursors for the CH 4 -producing Archaea (e.g.Chanton et al., 1995;Whiting and Chanton, 1993).Most process models are based on the premise that a certain fraction of ecosystem net primary productivity (NPP) is available and used for CH 4 production or alternatively a fraction of heterotrophic respiration is allocated to CH 4 emissions (e.g.Xu et al., 2016b).Thus, NPP (or GPP) could potentially be included as a predictor for the RF model and used as a proxy for the amount of substrates available for CH 4 production.However, the RF model performance in this study was not enhanced if variables closely related to NPP (EVI and the product of EVI and LSTd) were included as predictors.Also, Knox et al. (2019) did not find GPP as an important predictor of CH 4 emission variability in their multi-site synthesis study.
Using NPP (or proxies for it) for the RF model development might be an oversimplification, since it has been shown that the deep-rooted sedges and their NPP are especially important for CH 4 production (Joabsson and Christensen, 2002;Ström et al., 2003Ström et al., , 2012;;Waddington et al., 1996).Hence, information about plant functional types (PFTs) would be needed to better explain the CH 4 flux variability (Davidson et al., 2017;Gray et al., 2013).Furthermore, the fraction of the fixed carbon allocated to the roots and released as root exudates (hence, available for CH 4 production) varies between species and root age (Proctor and He, 2017;Ström et al., 2003), further complicating the connection between NPP and CH 4 emissions.The sedges also act as conduits for CH 4 , allowing the CH 4 produced below water level to rapidly escape to the atmosphere and bypass the oxic zone in which the CH 4 might have otherwise been oxidized (Waddington et al., 1996;Whiting and Chanton, 1992).Besides sedges, Spaghnum mosses are also important because methanotrophic bacteria that live in symbiosis with these mosses significantly decrease the CH 4 emissions to the atmosphere when they are present (Larmola et al., 2010;Liebner et al., 2011;Parmentier et al., 2011b;Raghoebarsing et al., 2005;Sundh et al., 1995).In a modelling study, Li et al. (2016) showed that it was essential to consider the vegetation differences between sites when modelling CH 4 emissions from two northern peatlands.Hence, ideally one should have gridded information on wetland species composition and associated NPP across the high latitudes to significantly improve the upscaling results.Unfortunately, such information is not yet available and therefore modelled estimates could be used (e.g.LPX-Bern, which includes several peatland-specific PFTs allowed to freely evolve during the model run) (Spahni et al., 2013).However, in such cases the upscaled CH 4 emission estimates would no longer be independent of the model and therefore would be less suitable for model validation.We also note that many process models have only one PFT per wetland.
Different variables related to water input to the ecosystem (i.e.P , P ann ) or surface moisture (SRWI) did not enhance the RF model predictive performance, not only reflecting that water table depth (WTD) is not solely controlled by input of water via precipitation but also that evapotranspiration and lateral flows affect wetland WTD, data that were missing from our study.These findings are consistent with previous studies (e.g.Christensen et al., 2003;Rinne et al., 2018;Pugh et al., 2018 andKnox et al., 2019), who showed only a modest CH 4 flux dependence on WTD in wetlands and peatlands.In contrast, several chamber-based studies have shown a positive relationship between WTD and CH 4 fluxes (Granberg et al., 1997;Olefeldt et al., 2012;Treat et al., 2018;Turetsky et al., 2014).In general, chamber-based studies often show spatial dependency of CH 4 flux on WTD, whereas studies done at an ecosystem scale with EC generally do not show temporal WTD dependency, albeit there are exceptions (e.g.Zona et al., 2009).This might indicate that WTD controls metre-scale spatial heterogeneity of CH 4 flux between microtopographical features (e.g.Granberg et al., 1997) but not temporal variability on the ecosystem scale, provided that WTD stays relative close to the surface.Also, the chamber studies tend to observe spatial variation, which can be indirectly influenced by WTD via its influence on plant communities, whereas EC studies observe typically temporal variation in sub-annual timescales.However, the effect of WTD might be masked by a confounding effect caused by plant phenology, since vegetation biomass often peaks at the same time as the WTD is at its lowest.While the variables related to WTD did not increase the RF model performance, WTD might still play a role in controlling ecosystem-scale CH 4 variability when it is exceptionally high or low.For instance, the year 2006 was exceptionally dry at the Siikaneva fen, and hence CH 4 emissions during that year were lower than on average (see Fig. 5a).However, in order to accurately capture such dependencies with machine-learning techniques (such as RF), they should be frequent enough so that the model can learn these dependencies.
RF model performance was better at permafrost-free than at sites with permafrost, which might indicate that the LSTn might not be an appropriate proxy for the temperature controlling the CH 4 production and oxidation rates at sites with permafrost.Also, no information on the development of the seasonally unfrozen, hydrologically and biogeochemically active layer was included in the RF model.Furthermore, Zona et al. (2016) showed strong hysteresis between soil temperatures and CH 4 emissions at their permafrost sites in Alaska, whereas Rinne et al. (2018) show a synchronous exponential dependence between soil temperature and CH 4 emissions at a boreal fen without permafrost.The hysteresis observed in Zona et al. (2016) could be explained by the fact that part of the produced CH 4 at these permafrost sites is stored below ground for several months before it is being emitted to the atmosphere, causing a temporal lag between soil temperature and observed surface flux.In any case, more knowledge of soil processes (soil thawing and freezing, CH 4 production and storage) is needed before the CH 4 emissions from these permafrost ecosystems can be extrapolated to other areas with greater confidence.
It should be emphasized that the drivers causing acrosssite variability in ecosystem-scale CH 4 emissions are, in general, unknown since studies comparing EC CH 4 fluxes from multiple wetland sites have only recently been published (Baldocchi, 2014;Knox et al., 2019;Petrescu et al., 2015).Most previous CH 4 synthesis studies were based on plotscale measurements (Bartlett and Harriss, 1993;Olefeldt et al., 2012;Treat et al., 2018;Turetsky et al., 2014).How-ever, the CH 4 flux responses to environmental drivers and their relative importance might be different at an ecosystem scale since CH 4 fluxes typically show significant spatial variability at sub-metre scale (e.g.Sachs et al., 2010).Furthermore, the temporal coverage of plot-scale measurements with chambers is usually relatively poor, whereas EC measurements provide continuous data on ecosystem scale.This study and Knox et al. (2019) show that temperature is important when predicting CH 4 flux variability in a multi-site CH 4 flux dataset, but a significant fraction of CH 4 flux variability is still left unexplained.It remains a challenge for future EC CH 4 flux synthesis studies to discover the drivers explaining the rest of the variability.

Quality and representativeness of CH 4 flux data
The RF model performance may improve if instrumentation, measurement setup and the data processing are harmonized across sites, since these discrepancies between flux sites might have caused spurious differences in CH 4 fluxes.These differences would have created additional variability in the synthesis dataset, which would in turn (1) influence the training of RF model and (2) decrease, for example, NSE values obtained against validation data, since there would be artificial variability in the validation data, which is not related to the predictors.In this study, the site PIs processed the data themselves using different processing codes, albeit the gap filling was done centrally in a standardized way.
While these issues mentioned above could impact the upscaling results shown here, prior studies have shown that the usage of different instruments or processing codes does not significantly impact CH 4 flux estimates.For instance, Mammarella et al. (2016) showed that the usage of different processing codes (EddyPro and EddyUH) resulted, in general, in a 1 % difference in long-term CH 4 emissions.On the other hand, CH 4 instrument cross comparisons have shown small differences (typically less than 7 %) between the long-term CH 4 emission estimates derived using different instruments (Goodrich et al., 2016;Peltola et al., 2013Peltola et al., , 2014)).While these studies show consistent CH 4 emissions, they also stress that the data should be carefully processed to achieve such good agreement across processing codes and instruments.In addition, many issues related to, for example, friction velocity filtering and gap filling of CH 4 fluxes are still unresolved, and the role of short-term emission bursts, which are common in methane flux time series, needs to be further investigated (e.g.Schaller et al., 2017).However, recently Nemitz et al. (2018) advanced these issues by proposing a methodological protocol for EC measurements of CH 4 fluxes used to standardize CH 4 flux measurements within the ICOS measurement network (Franz et al., 2018).
A total of 25 flux measurement sites were included in this study and they were distributed across the Arctic boreal region (see Fig. 1).The measurements were largely concentrated in Fennoscandia and Alaska, whereas data from, for  (Hargrove et al., 2003;Hoffman et al., 2013;Papale et al., 2015;Sulkava et al., 2011).
However, before such analysis can be done, the main drivers causing across-site variability in ecosystem-scale CH 4 fluxes should be better identified.Most of the CH 4 flux data here and in the literature have been recorded during the growing season when the CH 4 fluxes are at a maximum, whereas year-round continuous CH 4 flux measurements are not as common.This is likely due to the harsh conditions in the Arctic during winter that make continuous high-quality flux measurements very demanding (e.g.Goodrich et al., 2016;Kittler et al., 2017a) but also in part since the large-scale importance of non-growing season emissions has just recently been recognized (Kittler et al., 2017b;Treat et al., 2018;Xu et al., 2016a;Zona et al., 2016).For upscaling year-round CH 4 emissions, continuous measurements are vital to accurately constrain also the non-growing season emissions and their drivers.

Data availability
The presented upscaled CH 4 flux maps (RF-DYPTOP, RF-PEATMAP and RF-GLWD), their uncertainties and the underlying CH 4 flux densities are accessible via an opendata repository Zenodo (Peltola et al., 2019).The datasets are saved in netCDF files and they are accompanied by a readme file.The dataset can be downloaded from https://doi.org/10.5281/zenodo.2560163.

Conclusions
Methane (CH 4 ) emission data comprising over 40 site years from 25 eddy covariance flux measurement sites across the Arctic boreal region were assembled and upscaled to estimate CH 4 emissions from northern (> 45 • N) wetlands.The upscaling was done using the random forest (RF) algorithm.The performance of the RF model was evaluated against independent validation data utilizing the leave-one-site-out scheme, which yielded value of 0.47 for both the Nash-Sutcliffe model efficiency and R 2 .These results are similar to previous upscaling studies for the net ecosystem exchange of carbon dioxide (NEE) but worse for the individual components of NEE or energy fluxes (e.g.Jung et al., 2010;Tramontana et al., 2016).The performance is also comparable to studies where process models are compared against site CH 4 flux measurements (McNorton et al., 2016;Wania et al., 2010;Zürcher et al., 2013;Zhu et al., 2014;Xu et al., 2016a).Hence, despite the relatively high fraction of unexplained variability in the CH 4 flux data, the upscaling results are useful for comparing against models and could be used to evaluate model results.The three gridded CH 4 wetland flux estimates and their uncertainties are openly available for further usage (Peltola et al., 2019).
The upscaling to the regions > 45 • N resulted in mean annual CH 4 emissions comparable to prior studies on wetland CH 4 emissions from these areas (Bruhwiler et al., 2014;Chen et al., 2015;Spahni et al., 2011;Treat et al., 2018;Watts et al., 2014;Zhang et al., 2016;Zhu et al., 2013) and hence, in general, support the prior modelling results for the northern wetland CH 4 emissions.When compared to two validation areas, the upscaling likely overestimated CH 4 emissions from the Hudson Bay lowlands, whereas emission estimates for the western Siberian lowlands were in a reasonable range.Future CH 4 flux upscaling studies would benefit from long-term continuous CH 4 flux measurements, centralized data processing and better incorporation of CH 4 flux drivers (e.g.wetland vegetation composition and carbon cycle) from remote-sensing data needed for scaling the fluxes from the site level to the whole Arctic boreal region.

Figure 1 .
Figure 1.Map showing the locations of the EC measurements.The distribution of wetlands shown in the figure is based on Xu et al. (2018).Hudson Bay lowlands (50-60 • N, 75-96 • W) and western Siberian lowlands (52-74 • N, 60-94.5 • E) are highlighted with dashed red lines.

Figure 2 .
Figure 2. Evolution of statistical metrics during RF model development.Predictors were added to the RF model starting from the left of the figure and accumulate along the x axis.For instance, the x tick label "SC" shows the RF model performance when LSTn, Permafrost, Rpot and SC were used as predictors in the model.See the x tick label explanations in Table 1.The error bars denote 1σ uncertainty of the values estimated with bootstrapping.

Figure 3 .
Figure 3. Dependence of monthly mean CH 4 emissions on monthly mean land surface temperature at night (LSTn) derived from MODIS data.Eddy covariance measurements are shown with filled markers (unique colour for each site) and random forest model predictions for each site are given with black dots.

Figure 4 .
Figure 4. Relation between monthly mean CH 4 fluxes predicted by the RF model and independent validation data.Monthly average values from the same site are identified by unique colours and a least-squares linear fit to data from each site is also plotted using the same colour.Site means are shown with markers with black edges.The dashed line shows the 1 : 1 line.The shaded area shows the uncertainty range estimated from the RE CH 4 flux dependence (see text for further details).The statistics in the figure are calculated using the monthly data.See Appendix A for an explanation of site names.

Figure 5 .
Figure 5.Time series of modelled CH 4 emissions (red lines) together with validation data (circles) at four example sites: (a) Siikaneva oligotrophic fen in Finland, (b) Lost Creek shrub fen in Wisconsin, USA, (c) Atqasuk wet tundra in Alaska, USA, and (d) Chersky wet tundra in northeastern Siberia, Russia.Dashed vertical lines denote a new year.Note the changes in y axis scales.Site-specific model performance metrics are also included.

Figure 6 .
Figure 6.Mean annual CH 4 wetland emissions during years 2013-2014 estimated by upscaling EC data using the RF model and three wetland maps (a, b, c) and process models (d, e).Grid cells with low CH 4 wetland emissions (below 0.1 g(CH 4 ) m −2 yr −1 ) are shown in grey.The flux rates refer to total unit area in a grid cell.

Figure 7 .
Figure 7. Difference in mean annual CH 4 wetland emissions during the years 2013-2014 estimated by upscaling EC data using the RF model with different wetland maps and process models.All the CH 4 emission maps were aggregated to 1 • resolution before comparison.The flux rates refer to total unit area in a grid cell.

Figure 8 .
Figure 8. NSE, R 2 and RE calculated between RF-DYPTOP and LPX-Bern.Grid cells with low CH 4 wetland emissions (below 0.1 g(CH 4 ) m −2 yr −1 ) are shown in grey.RE values refer to total unit area in a grid cell.

Figure 9 .
Figure 9. Absolute (a-c) and relative (d-f) uncertainties of the upscaled CH 4 fluxes using different wetland maps.Uncertainty is estimated as 1σ variability of the predictions by 200 RF models developed by bootstrapping the training data (Sect.2.1.2).Grid cells with low CH 4 wetland emissions (below 0.1 g(CH 4 ) m −2 yr −1 ) are shown in grey.The absolute uncertainties refer to total unit area in a grid cell.

Figure 10 .
Figure 10.Monthly time series of zonal mean CH 4 fluxes.The upscaled fluxes with different wetland maps are shown in (a, b, c) and wetland CH 4 emissions estimated with the two process models are given in (d, e).

5 )a
Approximately north of 47 • N. b Approximately north of 45 • N. c Mean annual CH 4 emissions from eight models ±1σ of interannual variation in the model estimates for the period 1993-2004.d Process model tuned to match atmospheric observations.e North of 40 • N. f Mean ±1σ over the LPJ-wsl model results using different wetland extents for the period 1980-2000.
. For instance, McNorton et al. (2016) validated the land surface model JULES against CH 4 flux data from 13 sites and found R 2 = 0.10 between the validation data and the model.Wania et al. (2010) found on average RMSE = 29 nmol m −2 s −1 and RMSE = 42 nmol m −2 s −1 with and without tuning their model LPJ-WHyME against CH 4 flux data from seven sites, respectively.Zürcher et al. (2013) found the time-integrated CH 4 flux to be well represented by LPX-Bern model across different sites.A tight correlation (R 2

Table 1 .
Description of input variables for RF model development for upscaling.Data were aggregated to monthly values (see text) unless otherwise noted below.

Table 2 .
Annual CH 4wetland emissions in different subdomains (Hudson Bay lowlands and western Siberian lowlands; see Fig. (McDermitt et al., 2010L areas, were missing.Long-term EC CH 4 flux measurements are largely missing from these vast wetland areas, casting uncertainty on wetland CH 4 emissions from these areas.The location of a flux site is typically restricted by practical limitations related to, for example, ease of access and availability of grid power.Hence, open-path instruments with low power requirements potentially open up new areas for flux measurements(McDermitt et al., 2010), yet they need continuous maintenance, which is not necessarily easy in remote locations.However, one could argue that the geographical location of flux sites is not vital for upscaling, more important is that the available data represents well the full range of CH 4 fluxes across the northern latitudes and more importantly the CH 4 flux responses to the environmental drivers.Also, sites should ideally cover all different wetlands with varying plant species composition, whereas geographical representation is not necessarily as important.CH 4 flux site representativeness could be potentially assessed in the same vein as in previous studies for other measurement networks www.earth-syst-sci-data.net/11/1263/2019/ Earth Syst.Sci.Data, 11, 1263-1289, 2019

Table A1 .
Description of eddy covariance sites included in this study.