January 20, 2012

Archaic DNA data mining for dummies

I have repeatedly stressed how full genome sequencing will allow us to detect archaic DNA in modern humans, so I thought of writing a simple post where I lay out the rationale behind my conviction.

The age of the microarray

Microarrays test for a few 105 variants in the human genome. Conceptually, we can view the difference between two individuals as follows:


As you can see, these two individuals differ in a couple of locations tested by the microarray and are the same in one.

The age of the full genome

What will happen when we use full genomes? All the unknown positions in the two sequences will be known.

This may end up looking like this (Possibility #1):


i.e., the sites that were polymorphic in the microarray were the only ones that were polymorphic, and the rest of the sequence appears like a carbon copy of each other.

Or, it may end up looking like this (Possibility #2):


i.e., there are additional differences between the two individuals that were not captured by the microarray.

In the second scenario, there are 6 mutations between the two sequences compared to only 2 in the first one. So, the two sequences share a much older common ancestor compared to the first scenario.

By scanning stretches of DNA in full genomes, it is possible to identify regions where the number of mutations between two sequences are so many (expressed e.g., as a fraction of the number of differences between humans and chimps), that the common ancestor must have lived a very long time ago, even millions of years ago.

In some cases, we will be able to directly compare these sequences to actual archaic hominins, which is how Mendez et al. were able to infer archaic introgression from a Denisova-like hominin into Melanesians. But, even in the absence of archaic DNA, a good enough case of archaic admixture can be made.

Balancing selection

Balancing selection is one mechanism whereby two very different sequences could be mantained for a very long time in the human population. The major histocompatibility complex is one part of the human genome where this is believed to take place.

Balancing selection occurs when heterozygotes have a selective advantage over homozygotes. In "regular" evolution, either due to drift or to selection, one allele drives another one to extinction either due to simple chance (drift) or due to an advantage (directional selection). In balancing selection the two alleles are maintained because people who have both of them (heterozygotes) outbreed people who have only one or the other (homozygotes).

It is, however, possible to distinguish between sequences maintained by balancing selection and those that are not. For example, one can examine the functional consequence of polymorphism, or survey the geographical distribution of the variant sequences.


A different issue is that of recombination. Recombination slices up genome sequences  and stitches up new sequences that are a combination of those inherited from one's father and mother. Going back to our previous example:


Now consider this:


You can see that now the two sequences appear more similar to each other. This could in fact be, because a stretch of DNA (ATTA in blue) from the top sequence has become stitched up to the bottom.

If there has been archaic admixture in modern humans, we cannot expect to find very long stretches of archaic DNA. Rather, we expect to find a pastiche of archaic and modern sequence due to multiple generations of recombination. For really old admixture events recombination may obliterate all traces of admixture altogether!

This is why full genome sequencing is important, since it allows us to look at arbitrarily small lengths of DNA.   Archaic sequences of various lengths may lurk in-between the test points covered by microarrays, and by comparing full genomes we have a chance of uncovering them.

It may not, however, be possible to detect archaic admixture in very small lengths, because of statistics: 10 mutations in a length of a 100 and 100 mutations in a length of 1,000 both give the same age estimate, but the latter has a much tighter confidence interval..


Full genome sequencing will allow us to detect archaic DNA in modern humans by identifying regions of DNA that have common ancestors that are much older than the genomewide average. Some of these regions may be explained by balancing selection, while traces of others may have been lost by recombination. Nonetheless, not all of the evidence will have disappeared (especially for events in the last 100-200 thousand years), so expect it to surface sooner or later.


  1. You hit the point, recombination. Just as mutations, by comparing the DNA will be curious to delve and discover the laws that govern the evolution.

  2. As I understand, this is what Anders Pålssen is doing in "Fennoscandia Biology Project, pilot"? Very impressive works.

  3. In principle perfect approach. If we have star-like population expansion and the diversity is preserved (at least one segment of MRCA), then we can infer the genome of MRCA. However going further is impossible in principle (other lineages did not contribute to current DNA). Given all that, would you estimate that this approach could work maximum 5000 years back (that is our most likely TMRCA)?


Stay on topic. Be polite. Use facts and arguments. Be Brief. Do not post back to back comments in the same thread, unless you absolutely have to. Don't quote excessively. Google before you ask.