BorFIT: A Novel LiDAR-Based Training Dataset for Individual Tree Segmentation and Species Detection in northern boreal Forests

Schladebach, Jacob; Heim, Birgit; Enguehard, Léa; Wieczorek, Mareike; Broers, Jakob; Jackisch, Robert; Gloy, Josias; Hao, Kunyan; Tretton, James; Gorshunova, Anna; Kruse, Stefan

doi:10.5194/essd-2025-340

Preprints

https://doi.org/10.5194/essd-2025-340

Preprints

20 Aug 2025

| 20 Aug 2025

Status: a revised version of this preprint is currently under review for the journal ESSD.

BorFIT: A Novel LiDAR-Based Training Dataset for Individual Tree Segmentation and Species Detection in northern boreal Forests

Jacob Schladebach, Birgit Heim, Léa Enguehard, Mareike Wieczorek, Jakob Broers, Robert Jackisch, Josias Gloy, Kunyan Hao, James Tretton, Anna Gorshunova, and Stefan Kruse

Abstract. BorFIT is a novel training data set designed to assist in the segmentation of individual trees and the detection of species from LiDAR point clouds, thus contributing to deep learning-based forestry applications. Recent advancements in AI-supported individual tree detection have shown significant progress; however, satisfactory results remain elusive in dense and structurally-complex boreal forests. We compiled a training data set designed to remedy this issue. It comprises 384 LiDAR point clouds, each with an area of 20 m × 20 m, in the form of reference plots, with up to 200 manually segmented and species classified trees per point cloud. We carried out LiDAR surveys at 146 sites between 2021 and 2024 in East Siberia (Yakutia), northwest Canada, and Alaska (USA), selected along a bioclimatic gradient to represent the circumboreal region. From each LiDAR transect derived point cloud, we extracted a minimum of four reference plots (each 20 m × 20 m) based on maximum tree heights within the plots to systematically sample the apparent tree density gradient. We manually segmented identifiable trees within each reference plot point cloud leading to 16,530 individual trees in total. Following segmentation, we trained four randomForest classifiers to predict the species of every segmented tree. The predicted tree species include: Picea mariana (Britton, Sterns Poggenb.), Picea sitchensis ((Bong.) Carrière), Picea glauca ((Moench) Voss), Pinus contorta (Douglas ex Loudon), Abies lasiocarpa ((Hook.) Nutt.), Larix laricina ((Du Roi) K.Koch), Betula papyrifera (Marshall), Betula neoalaskana ((Regel) Ashburner McAll.), Populus balsamifera (L.), Populus tremuloides (Michx.), Pinus sylvestris (Thunb.) and Alnus glutinosa ((L.). The data offer the means for 3D space analysis of species distribution and stand structure around the circumboreal region. Furthermore, it can be used as a training data set for artificial intelligence (AI) applications and thereby improve our understanding of the boreal forest’s vegetation reorganization in response to significant global warming.

Received: 10 Jun 2025 – Discussion started: 20 Aug 2025

Competing interests: At least one of the (co-)authors is a member of the editorial board of Earth System Science Data.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Jacob Schladebach, Birgit Heim, Léa Enguehard, Mareike Wieczorek, Jakob Broers, Robert Jackisch, Josias Gloy, Kunyan Hao, James Tretton, Anna Gorshunova, and Stefan Kruse

Status: final response (author comments only)

RC1:
'Comment on essd-2025-340', Anonymous Referee #1, 18 Sep 2025
REVIEW: BorFIT: A Novel LiDAR-Based Training Dataset for Individual Tree Segmentation and Species Detection in northern boreal Forests
General Comments:
The dataset is relevant, especially due to its extended area and worth publishing. The data processing pipeline is standard and do not bring contributions. Regarding the ESSD data quality criterial, more clarifications are required from the authors. The Introduction and Methods are generally well described but would benefit from some clarifications. In the Introduction, certain statements could be revised to improve clarity and completeness. In the Dataset and Methods section, additional details on data acquisition, as well as on the final accuracy and spatial resolution of the dataset, would strengthen the manuscript. A major point to consider is adopting a clearer overall structure. It would be particularly useful for future users if the paper included dedicated sections on data validation and dataset usability. Finally, please double-check the Classification field in the point clouds, as some values appear unusual.
Specific comments:
ABSTRACT:
LiDAR is too broad terminology. I would recommend using UAV-LiDAR or UAV laser scanning system in the abstract to make it clear from the beginning which data collection method we are talking about here.
INTRODUCTION:
P1, Lines 27–30: I am not sure if I agree with the authors that this is the most used method nowadays. Some decades ago, yes, but nowadays there are many approaches, and stem detection has increased a lot, especially for boreal forests where the stems are often well visible. However, I agree that canopy-top or stem detection are the most common approaches for tree detection, although there are many other methods for segmenting trees depending on the spatial resolution, for instance cluster-based ones are also very popular. I suggest mentioning more than one methodology here (“the most used are”) and link with some review. There are reviews on tree detection and tree segmentation that could be used here.
Line 33 - LiDAR can also be based on phase, not only on ToF, but I agree that ToF is the most common option. Maybe this should be clarified, since it is presented here as a theoretical explanation?
Line 34 – “When properly segmented”, “these point clouds”. It seems that something is missing in this sentence, because point clouds were not mentioned in the previous sentences. Please revise the consistency and connection here. Also, the connection with the next sentence, which starts to mention tree detection, could be improved. Maybe you could include a sentence mentioning that to obtain individual tree point clouds automatically, tree detection and coarse-to-fine segmentation are often required. Should this be a new paragraph?
Line 35 -Well, biomass estimation with UAV is somehow possible, but there is still a lot of discussion about its accuracy. The UAV point cloud should be very dense. Even leaf and wood separation in UAV data is still quite challenging. I would reconsider this affirmation. Please check the results reported by Brede et al., 2019
Brede, B., Calders, K., Lau, A., Raumonen, P., Bartholomeus, H. M., Herold, M., & Kooistra, L. (2019). Non-destructive tree volume estimation through quantitative structure modelling: Comparing UAV laser scanning with terrestrial LIDAR. Remote Sensing of Environment, 233, 111355.
Line 38 – The author mentioned that “AI highly depend on the selection of appropriate algorithms,” which I agree with, but it also often depends on data training, which is most problematic and which is the reason for the paper. I think this should be included here too. This comes later on line 41 in the species classification topic, but it is also very important for tree segmentation. Also, the connection with the next process step could be improved (lines 39–40). I agree that segmentation has a huge impact on species classification, in which noise from neighboring trees can certainly intervene in the result. This was the main problem in the paper by Shcherbacheva et al., 2024. Could the authors elaborate more on this connection in the text before moving to the species classification topic?
Shcherbacheva, A., Campos, M. B., Wang, Y., Liang, X., Kukko, A., Hyyppä, J., ... & Puttonen, E. (2024). A study of annual tree-wise LiDAR intensity patterns of boreal species observed using a hyper-temporal laser scanning time series. Remote Sensing of Environment, 305, 114083.
In the dataset review, it would be interesting to also mention the FOR-instance dataset, which is a UAV laser scanning dataset aimed at semantic and instance segmentation of individual trees. Species information is also provided there. Overall, the review of available open datasets could be expanded. There are other datasets that provide individual trees, species, and leaf–wood separation, which could also be mentioned, even if they contain fewer samples. I think that extend the review here would improve the contribution discussion of this dataset and make more clear the differences between the proposed open dataset and existing ones, particularly in terms of coverage, spatial resolution, and temporal resolution too.
Puliti, S., Pearse, G., Surový, P., Wallace, L., Hollaus, M., Wielgosz, M., & Astrup, R. (2023). For-instance: a uav laser scanning benchmark dataset for semantic and instance segmentation of individual trees. arXiv preprint arXiv:2309.01279.
Also in the introduction (same comment as in the abstract) the author mentioned that BorFIT comprises “384 LiDAR point clouds collected from 146 locations across bio-climatic gradients in Siberia, Canada, and Alaska, with each point cloud containing up to 200 individual trees with assigned species”. It is very important to make clear here that the source of the data collection it is a UAV laser scanning (ULS). This gives to the user the insight of what type of spatial resolution we are talking about in this dataset. Consider including this information also in the abstract. Consider also this comment on the Abstract presented at Pangea on the data server.
Line 53 - Typo error: tree species “classificaiton"
DATASET AND METHODS
Study region
Line 61- The first sentence of the dataset description is a bit confusing “UAV LiDAR surveys in addition to extensive multisensor drone overflights”. What this means? What type of multisensory? Laser scanning was not included on those platforms.
On the research area description and also on Table 1 would be important to add the exact extension of the area covered, “several km” is to vague. Also some coordinates limiting the area would be welcome to make the extension of the research area more clear.
Are the reference plots managed forest? Looks like from the point clouds that some areas has very sparse vegetation. Is it natural?
Data acqusition and postprocessing
What is meant here by “inaccurate trajectory strips”? Could the authors explain more clearly which criteria were used to consider a trajectory inaccurate? Was any checkpoint or other feature used to verify the accuracy of the georeferencing? What is the estimated georeferencing accuracy?’
Line 77 – The first sentence of the data processing section is a bit difficult to read in the current format. Could the authors revise to make it clearer?
Line 79-80 [The LiDAR mapper followed….] Could the author explain better this sentence? Is this information really needed? Instead other technical specifications that are presented as supplementary could be in the main text. I think the specifications of the GNSS/INS system used and the expect accuracy of the position and attitude estimation for the direct georeferencing is much more important here and should be clarified. Also how the laser scanning system was setup is not clear, which was the frequency used, angular resolution, FoV, angle of aperture, laser divergency, etc? The table with all the specifications of the sensors used are important for facilitate future users information.
Figure 3 is hard to interpret. Could the author colorize the point cloud by height for instance, since the main important feature is the size of the trees?
Line 94 – the terminology reference plot “cut-outs” is a bit strange. I think it is enough to tell data 20x20 plots are selected or generated from the transects or maybe used the word segmented or sampled. Try to standardize the terminology.
Tree heigh groups 1 to 10 need to be better explained with quantitative values, small, medium, high do not say much without know the forest context. I suggest adding a table here with the range of heights in which level and which is the dominant species/ plot area in which of those. This can assist the users later to search better the data that would be useful for different purposes.
Tree segmentation
The authors mentioned that The training data sets are based on the reference plots annotated with “00” where species were documented during a field survey, but it was not clear if this plot has an example of all species to be classified. If this is not the case, how this would affect the classification results if the samples are not significative? Later the authors also mentioned that they used distinct training data subsets and four random forest models, which it is ok but should be better presented since the begging. Now the sequence that the information is presented is a bit confusing and mix. I suggest to reformulate the explanation logic here.
It would be possible to assign and discuss some type of data quality regarding the data acquisition and segmentation? Like if the tree is full visible (stem and canopy boundary seems well represented) or if there are occlusions or if there is noise from other trees even with the manual segmentation?
Figure 4 - which species is presented in Figure 4? This is very minor, but normally it is easy to present the individual trees with its height and not with absolute terrain values from the georeferencing just to have a more fast read of the tree size.
How this eleven structural and two spectral variables was select? Could the authors link with previous works that also performed Species classification to give more support to the method and features choose to be used here? This geometric variable seems a bit unusual selection, since most part of it focus on height and volume of the canopy. Maybe would also be nice to discuss the contribution of this features for the classification? You did not include LiDAR intensity? Normally it is a very good information to separate densiduos and coniferous trees.
Results and Discussion
I’m not sure that the traditional section titles, such as Results and Discussion, are the most appropriate for a dataset paper. It is important to keep in mind that this paper will serve as a reference for future users of the dataset, who will rely on it both to understand the data processing steps and to cite it in their own work. Therefore, I suggest reorganizing the structure to better align with the goals of a dataset description paper. Many dataset papers diverge from the traditional Results/Discussion format. Adopting a similar approach would benefit this paper as well. For instance as: https://doi.org/10.5194/essd-17-4569-2025 and https://doi.org/10.1038/s41597-024-04143-w
A possible outline could be:
Introduction

Methods and Data Processing – detailing all steps used to produce the final dataset.

Data Validation – evaluating the dataset’s accuracy, including classification performance and the related conclusions currently placed in the Discussion. I think the Class errors presented as supplementary are super important and could be in the main text.

Dataset Properties – presenting the content from Sections 3.1 and 3.3, which contain excellent visualizations. These could be reframed as a description of the dataset from an ecological/forest perspective.

Usability – a section that is currently missing, which would explicitly discuss and clarify the potential applications of the dataset, based on both the technical properties (from the validation) and the ecological/forest context (from the dataset description).

Regarding data validation, the species classification results could be discussed in more detail, clearly stating which classes users should be more cautious with when using the dataset. Including a table with estimated performance per species could be very helpful. From the data processing point of view manual segmentation is often the most accurate approach and I do not have much to comment about that, but it would also be valuable to provide additional details on how spatial resolution varies and data acquisition influence the dataset.
Why the link to Pangea paper is included? I think only the reference is enough. Only the links to your dataset should be presented on the Data availability section to not create confusion. Actually, six links are presented which make a bit confusing how to actually find the BorFIT data set as a whole. Maybe the related five links could be referenced somewhere else and only on link to BorFIT data set should be clear presented on Data availability? Also would be nice to have some indication about the documentation related to the R code.
When checking the Pangea repository and connecting here with the paper it is also good to include in the Data availability how the file is named and how the data was organized in the repository, for instance that they are presented in plot level. Also it is important to mention that the classes Trees and Species are extra bytes on Laz files and need to be considered when reading the files, since they are not standard fields on Laz structure. What is saving in the Classification and Point Source ID field? They look a bit strange. Please check.
Is EN23608(1)reference_plot_10_predicted.laz georeferencing correct? Looks that the trees are some how not vertical do the ground or the hill is very deep but the coordenates are somehow not aligned with E, N, up? Could the authors double check?
The first authors of the dataset do not match the paper authors list. This is not a issue on my point of view but maybe the authors should consider to clear state the contributions of each author.
Citation: https://doi.org/10.5194/essd-2025-340-RC1
- AC1: 'Reply on RC1', Jacob Schladebach, 26 Sep 2025
  
  Dear Referee,
  Please see the attached file for our reply.
  
  Best regards
  Jacob Schladebach
  
  Citation: https://doi.org/10.5194/essd-2025-340-AC1
- AC2: 'Reply on RC1', Jacob Schladebach, 26 Sep 2025
  
  Dear Referee,
  please find attached Part 2 of our comments.
  Best
  Jacob Schladebach
  
  Citation: https://doi.org/10.5194/essd-2025-340-AC2
RC2:
'Comment on essd-2025-340', Anonymous Referee #2, 19 Sep 2025

This paper introduces an interesting data set of segmented trees with species identifications in boreal forests across North America and eastern Siberia. While this is not a novel area of research and the paper makes no methodological innovations, having additional model outputs for this task may be beneficial to others looking to harmonize data from multiple acquisitions for their own modeling.
I have significant concerns (and confusion) about the combination of data with and without spectral information into a single data set. Given the two continents mapped here were modeled separately, the significantly different forest compositions across continents, and most North American sites had spectral information while the Siberian sites uniformly lacked it, it seems to me like users would be better served by having two distinct data sets made available. I don't think that this would need to significantly change the approach taken here, but rather treating "BorFIT" as the modeling approach taken and then having "BorFIT Siberia" as a distinct data product from "BorFIT North America" would possibly make the modeling and data processing steps used clearer.
I broadly agree with Reviewer 1's comments on alternative paper structures.
Specific comments follow.
Line 6: specify what type of lidar (UAV?)
Line 11: Think this should be "random forest". I personally object to classifying RF models as "AI" but I understand that's the fashion these days.
Line 25: Consider rephasing this sentence. "Ecological attributions" is a particularly odd phrase.
Line 27: I disagree that this is the most common approach for segmentation. There's plenty of research on satellite, airborne, and UAV lidar for segmentation, and even imagery-based approaches are more likely to use hyperspectral imagery than just color and IR.
Line 30: What do you mean by "large areas"? I'd suggest UAV based lidar is probably the second least effective remote sensing approach (after only terrestrial lidar) for large area monitoring given range and speed limitations.
Line 40-48: I was confused by this segment. You appear to be attempting to describe why the boreal region presents special challenges, but none of these thoughts coalesce into a coherent explanation. Naming individual data sets in this context is similarly confusing. Consider rephrasing to make the challenges of this region more obvious. Suggest introducing BorFIT in a new paragraph.
Fig 1: Capitalize "georeferenced". Consider resizing elements to not break words. "randomForest" should be two words.
Fig 2: Great map. Consider a box behind the legend for legibility.
Line 71: Please provide more detail about how sites were positioned. Were there formal criteria used or was this more ad-hoc?
Table 1: I have a lot of questions about how sites were selected. Why the large variation in number of point clouds and reference plots per region? Why was Alaska sampled twice?
Line 95: Can you explain why you used this approach for selecting reference plot locations -- the stratification, 20m plot size, and number of plots selected? I don't have any concerns about the approach, but explaining the rationale (or citing to where this method was introduced, if appropriate) would be beneficial.
Line 110: Should be "issues arose"
Page 8 last paragraph: rather than "the same is true for _Larix_" consider "while _Larix_ was the most abundant genus in Yakutia"
Line 135: Should be "random forest". Should cite Breiman 2001.
Table 4: Please provide citations for these indices and metrics.
Line 153: Please clarify: was species composition predicted on structural parameters, or were the RGB data predicted?
156: "random forest". If you're using the randomForest R package, you should cite it in this section.
165: "random forest"
211: Did you manually segment 16,530 trees? You should mention this number earlier, in the "manual segmentation" methods section. If these are instead classified by the models, please define what "successful" means -- you only view about 4500 trees as being classified with a high enough probability to include in graphs in this paper.
220: "random forest"
228: I'd still change this to "random forest", although I think the syntax means you could be referring to your specific model (in which case no space is appropriate) rather than the model category as a whole
Line 252: Remove "mean accuracy 82%", these models cannot be applied to the same geographic areas or even to the same subsets of data collected for this publication, and as such their ensemble accuracy is irrelevant. "Accuracies ranging from 73% - 91%" seems more appropriate.
Line 283: "The same" (not Same)

Citation: https://doi.org/10.5194/essd-2025-340-RC2
- AC3: 'Reply on RC2', Jacob Schladebach, 02 Oct 2025
  
  Dear Referee,
  
  please find attached our response regarding your comments.
  
  Best regards
  Jacob Schladebach
  
  Citation: https://doi.org/10.5194/essd-2025-340-AC3

Jacob Schladebach, Birgit Heim, Léa Enguehard, Mareike Wieczorek, Jakob Broers, Robert Jackisch, Josias Gloy, Kunyan Hao, James Tretton, Anna Gorshunova, and Stefan Kruse

Data sets

eference dataset of individual trees from the Tundra-Taiga-Ecotone and Northern boreal forests BorFIT Stefan Kruse, Jacob Schladebach, Jakob Broers, Kunyan Hao, James Tretton, Anna Gorshunova https://doi.pangaea.de/10.1594/PANGAEA.980505

Model code and software

BorFIT Jacob Schladebach https://github.com/StefanKruse/BorFIT/tree/main

Jacob Schladebach, Birgit Heim, Léa Enguehard, Mareike Wieczorek, Jakob Broers, Robert Jackisch, Josias Gloy, Kunyan Hao, James Tretton, Anna Gorshunova, and Stefan Kruse

Viewed

Total article views: 2,025 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,503	478	44	2,025	51	57

HTML: 1,503
PDF: 478
XML: 44
Total: 2,025
BibTeX: 51
EndNote: 57

Views and downloads (calculated since 20 Aug 2025)

Month	HTML	PDF	XML	Total
Aug 2025	299	20	5	324
Sep 2025	847	21	15	883
Oct 2025	90	12	6	108
Nov 2025	65	12	7	84
Dec 2025	43	54	7	104
Jan 2026	96	62	4	162
Feb 2026	61	293	0	354
Mar 2026	2	4	0	6

Cumulative views and downloads (calculated since 20 Aug 2025)

Month	HTML	PDF	XML	Total
Aug 2025	299	20	5	324
Sep 2025	847	21	15	883
Oct 2025	90	12	6	108
Nov 2025	65	12	7	84
Dec 2025	43	54	7	104
Jan 2026	96	62	4	162
Feb 2026	61	293	0	354
Mar 2026	2	4	0	6

Viewed (geographical distribution)

Total article views: 1,868 (including HTML, PDF, and XML) Thereof 1,868 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 02 Mar 2026

Short summary

BorFIT is a novel training dataset for LiDAR point cloud segmentation and tree species detection in boreal forests. Comprising 384 plots across Siberia, Canada, and Alaska, it features 16,530 manually segmented trees of 12 species. BorFIT supports AI applications for analyzing species distribution, stand structure, and boreal forest response to climate change.


Total:	0
HTML:	0
PDF:	0
XML:	0