December 01, 2008

Haplotype outliers and Y-chromosome age estimation

In a large collection of Y-chromosome haplotypes from a specific haplogroup and location, there are invariably a number of outliers, i.e., haplotypes that are too distant to the rest of the group. Consider the following table which presents the number of mutations between pairs of haplotypes (a-f):

a b c d e
b 1

c 2 1

d 1 2 2

e 3 2 1 2
f 7 5 6 6 5

While haplotypes a-e are all within 1-3 mutations of each other, haplotype f is 5-7 mutations away from any other haplotypes. It looks like it "doesn't belong".

Haplotypes such as f present a challenge:
  1. Are they true outliers? They might be an artifact of lab error, or simply extreme examples of normal variation. In the above example, if more haplotypes had been sampled, many more "pals" of f might be found, and it will no longer appear to be isolated.
  2. If they are true outliers, how did they end up in the collection?
This is not simply idle speculation: visit any forum dedicated to genetic genealogy, and you will find both (i) people who have too many exact and close matches and who are urged to upgrade their test results to a higher number of markers so that only their real close relatives "stand out" from the crowd, but also (ii) people who don't have any, or very few close matches, whose haplotypes seem to hang in mid-air, unconnected to any other set of Y chromosomes.

Spawn of the shipwrecked sailor

A popular explanation for outliers is that they are of foreign origin, the result of a chance event. According to this explanation, the distinctiveness of the outliers is due to being the product of a rare occurrence: a shipwrecked sailor, a lost explorer, a slave far from home, and so on.

To substantiate this as an explanation, it suffices to show that what is an "outlier" in a certain population X, is actually normal in another population Y. Then, it can be easily seen that the outlier may have ultimate origins in Y.

Relic of a bygone age

A different explanation is that outliers are relics of a previous age. Consider a country in which some important technological innovation, say farming, or iron, or the bow is introduced. Pretty soon, the inhabitants who acquire the new innovation may multiply in numbers, at the expense of their more isolated neighbors. Fast forward into the future, and the gene pool will be dominated by the closely related haplotypes of the "adopters" and the haplotypes of the "non-adopters" will stand out in the total population as oddities.

Implications for age estimation

Determining the cause of an outlier has important implications for determining the age of the common ancestor of the whole group:
  1. If the outlier is of foreign origin, then one must reject it, and age the remaining, more homogeneous haplotypes. This will lead to a younger age than if the entire group was used.
  2. If the outlier is a relic, then one must incorporate it, and downgrade the statistical weight of the larger more populous group; otherwise the age estimate will be dominated by the recently expanding group. This will lead to an older age than if the entire group was used.
As a practical example, there are 17 mutations for haplotypes a-e (Average = 1.7) and 45 mutations for haplotypes a-f (Average = 3). The average number of mutations between f and the rest is, on the other hand 5.8. If we purged f from the set, we would arrive at a young age (based on 1.7); if we did nothing at an intermediate age (based on 3), and if we treated f and the young group (a-e) on an equal footing at an old age (based on 5.8)


The treatment of outliers in the existing literature is problematic. The default position seems to be not to analyze a haplotype group's substructure, and to use all sampled haplotypes. This may lead to either a substantial overestimation of the age (if foreign outliers are included), or a substantial underestimation (if relic outliers are given equal weight with the more populous main group).


For any collection of haplotypes, the first step should be to calculate the distribution of pairwise distances to detect outliers. Subsequently, a search of public databases or the literature should be performed to see if said outliers appear to be of foreign origin. Depending on this search (*), appropriate correction (inclusion/weighting) should be used in age estimation.

(*) Taking into account that the detection of foreign haplotypes depends on adequate sampling of the source population; hence, no matches in other populations do not imply non-foreign origin.


just passing by said...

THe USA is one big mess, when it comes to DNA-genealogy. Most of white western Europeans came from the British Isles. So that skews the results toward that population. As for my mitochondrial DNA, I can trace it back to colonial times, but it doesn't seem to fall in with typical British. It belongs to a very small (thus far) group within U5b2, under the "11653" mutation and separate from the typically British group with the "4732" mutation. What I want to know is whether there are others in my group with Brtish roots; or whether they have continental roots, say, from Germany or etc.

Anonymous said...

Interesting Dienekes.
Just curious. can we extend this all the way to T , and Mt haplo. how do they show?.

Maju said...

Very interesting meditation, Dienekes. I'd say that "relics of the past" are not impossible at all and may actually be a high percentage of the cases. But it's difficult to decide, sure.

To the bypasser: much of the nacestry in the USA is from continental origins (Germany specially), even those dating from colonial times. Dutch and Germans were not rare among "British" colonists specially. That is specially true of Pennsylvania, whose colonial population was mostly of German origin, but may also be the case in other of he original states (Dutch in New York, Swedes in Delaware, French Huguenots all around, etc.) I would certainly not discard such origins, even if genuinely colonial.

Anonymous said...

Just to clarify, because it's a very common misconception Europeans have about the US and many Americans have about themselves: German is the #1 ancestry of citizens of the United States, not Irish or English, which are #2 and #3 respectively. Just because the US is a predominantly English-speaking nation does not mean most Americans are English in paternity and/or maternity. They're not. A full 1/4 to 1/3 of US citizens are entirely or partially ethnic German in background. This is from both the 1990s and 2000 census data. Of course, Germany didn't exist as a unified state until the 1870s, but the notion of one German ethnolinguistic identity dates to the late Middle Ages and early Renaissance.

Dienekes said...

Just to clarify, because it's a very common misconception Europeans have about the US and many Americans have about themselves: German is the #1 ancestry of citizens of the United States, not Irish or English, which are #2 and #3 respectively.

I doubt that. It may very well be that Americans with deep roots going back to colonial times simply put "American" in census forms, whereas those of German descent who migrated in more recent times put their more specific origin.

English is probably the most important single European ethnic group contributing to the American population.