Machine Learning for Uncovering Trees' Genetic Origins

Jenn Hoskins
9th June, 2025

Machine Learning for Uncovering Trees' Genetic Origins

The distribution of pairwise geographic distances shows that the sampled pedunculate oak (Quercus robur) trees (b) covered a much larger area than the European beech (Fagus sylvatica) trees (a), providing essential context for evaluating the accuracy of the study's geographic assignment models.

Image adapted from: Degen et al. / CC BY (Source)

Key Findings

  • In European forests, scientists used genetic data and advanced machine learning to predict exactly where trees like beech and oak originally come from
  • Their new methods—grid-based regression and deep learning—proved more accurate than traditional techniques, helping flag mislabeled or wrongly sourced trees
[1] Recent research from the Thünen Institute, Bashkir State Agrarian University, and Khalifa University addresses a pressing issue in forestry: how to accurately trace the geographic origin of trees using genetic data. This problem is critical for ensuring that reforestation projects use the right seed source, for promoting the legal trade of timber, and for detecting historical long-distance seed transfers or potential mislabelling in forest trials. For years, scientists have used genetic assignment approaches to match an unknown sample with one of several pre-established populations. However, such “discrete” assignment methods run into problems when genetic variation is, in fact, continuous across a region. Early work on related species, including studies on red oak ([2]) and mahogany ([3]), has demonstrated both the power and the pitfalls of genetic assignment. In more traditional applications, populations were often segregated by political borders or historical groupings, leading to potential mismatches when natural genetic gradients span countries or other arbitrary boundaries. In the current study, five continuous assignment methods are compared. Unlike traditional approaches that force a sample into a predefined group, the methods assessed here – nearest neighbour (NN), direct gaussian process regression (GPR-D), grid based gaussian process regression (GPR-G), genomic prediction (GP), and deep learning (DL) – are designed to predict a sample’s geographic origin on a continuous scale, offering a refined resolution of where a tree might have originated. Two extensive genome-wide single nucleotide polymorphism (SNP) datasets form the basis for this evaluation: one comprising 30,000 SNPs from 865 European beech (Fagus sylvatica) trees and another with 381 SNPs from 1,883 pedunculate oak (Quercus robur) trees. SNPs, which are single-base differences in DNA, serve as markers that capture variation across the genome. Because genetic differences among trees can arise gradually over distances rather than in abrupt shifts between populations, the continuous assignment methods used here are especially suited for scenarios where there is no clear-cut boundary between groups. In simpler terms, they allow researchers to predict the origin as a set of geographic coordinates that closely mirror the natural variation observed in the landscape. An important aspect of the study is the accuracy of geographic predictions. For the beech dataset, the grid based gaussian process regression (GPR-G) method and deep learning (DL) method proved most accurate, producing median distances of only 55 km and 76 km, respectively, between the true origin and the predicted location. When applied to oak data, these methods similarly yielded the best results with median distances of 263 km and 278 km. To put these numbers in context, nearly 90% of the samples were assigned within a relative error (measured as the distance error divided by the maximum distance among tree pairs) of less than 8% for the best method, highlighting the potential for precise geographic assignment. The study also identified outliers, which are samples that deviate significantly from the expected pattern. In the beech data, 35 individual trees and 10 groups were flagged as outliers; for the oak data, 27 individuals and 18 groups were detected. Such outliers may be explained by factors like mislabelling during field trials or historical human-cyclical practices involving long-distance seed transfer. The identification of these outliers is similar in concept to earlier research in forensic science that identified illegal timber trade cases through genetic mismatches ([3])[4]. This type of careful analysis helps uncover hidden patterns in seed movement and can ensure that timber on the market comes from the legally verified sources. Earlier research on tree species like the red oak ([2]) showed that genetic diversity can follow continuous geographic gradients. By integrating methods that account for this continuity, the current study addresses a key limitation in previous approaches. Moreover, wildlife forensic science – including efforts that focus on tracing ivory or fish to their origin – has emphasized the need for reliable reference data and robust computational approaches to minimize error rates ([4]). The present research builds on these principles and extends them by systematically comparing different algorithms on large datasets from significant European tree species. The research team carefully evaluated the methods. The nearest neighbour approach, for example, works by looking for the most genetically similar individuals among a reference dataset. While this method has been effective in earlier applications with species like Sapelli ([5]), the continuous nature of genetic variation in forests demanded more sophisticated techniques. In that context, gaussian process regression and deep learning showed superior performance, likely due to their capability to model nonlinear relationships and spatial patterns more effectively. In summary, this study provides evidence that continuous assignment methods offer a viable solution for accurately pinpointing the geographic origins of tree samples. The research not only reinforces findings from earlier genetic studies ([2],[3]) but also moves the field forward by presenting methods that handle the complexities of real-world genetic variation. The integration of advanced statistical and computational methods points toward a future where forensic applications in forestry – including seed sourcing, conservation, and legal timber trade – can be conducted with greater precision and confidence.

GeneticsEcologyPlant Science

References

Main Study

1) Machine learning techniques for continuous genetic assignment of geographic origin of forest trees

Published 6th June, 2025

https://doi.org/10.1371/journal.pone.0324994


Related Studies

2) Back to America: tracking the origin of European introduced populations of Quercus rubra L.

https://doi.org/10.1139/gen-2016-0187


3) Verifying the geographic origin of mahogany (Swietenia macrophylla King) with DNA-fingerprints.

https://doi.org/10.1016/j.fsigen.2012.06.003


4) Wildlife forensic science: A review of genetic geographic origin assignment.

https://doi.org/10.1016/j.fsigen.2015.02.008


5) A nearest neighbour approach by genetic distance to the assignment of individual trees to geographic origin.

https://doi.org/10.1016/j.fsigen.2016.12.011



Related Articles

An unhandled error has occurred. Reload 🗙