Inference of Ancestry, Sex and Trait Liabilities From Whole Genome Single Nucleotide Polymorphisms

Northfield Mount Hermon School
Sep 7, 2025

Lorcan Purcell

As the human genome is 99.6% identical between individuals, comparing genomes in their entirety
would be both a waste of computational resources and effort (1). Instead, the unique mutations
possessed by each individual are used for comparison, such mutations include: repeat regions, inversions
and deletions However, these classes of variation are not assayed by the approaches used in this study.
Rather single nucleotide polymorphisms (SNPs) are used, which are single nucleotide changes that
are found within at least 1% of the global population (distinct from rare mutations that may only
be carried by a single individual) (2). SNPs most common form of genetic variance between people
and most often occur in the region of DNA between genes (2). While a significant majority of SNPs
have benign effects, there are others that can help explain an individual’s susceptibility to a particular
disease or condition (3). One major application of SNP based analysis is within genealogical companies
such as Ancestry.com. As part of their analysis process, Ancestry claims to genotype 730,525 SNPs,
ones they claim ”account for majority of common genetic variation in European and other worldwide
populations”, which are then compared to population reference panels in order to infer an individuals
genetic makeup (4). Through this investigation, a similar, albeit slightly reductive, process will be
carried out to determine the ancestry as well as sex and predisposition to certain traits for an individual.
While this investigation does not fall under the realm of hypothesis testing, it remains an important
proof-of-principle project with significant didactic value.