While haplotypes a-e are all within 1-3 mutations of each other, haplotype f is 5-7 mutations away from any other haplotypes. It looks like it "doesn't belong".
Haplotypes such as f present a challenge:
- Are they true outliers? They might be an artifact of lab error, or simply extreme examples of normal variation. In the above example, if more haplotypes had been sampled, many more "pals" of f might be found, and it will no longer appear to be isolated.
- If they are true outliers, how did they end up in the collection?
Spawn of the shipwrecked sailor
A popular explanation for outliers is that they are of foreign origin, the result of a chance event. According to this explanation, the distinctiveness of the outliers is due to being the product of a rare occurrence: a shipwrecked sailor, a lost explorer, a slave far from home, and so on.
To substantiate this as an explanation, it suffices to show that what is an "outlier" in a certain population X, is actually normal in another population Y. Then, it can be easily seen that the outlier may have ultimate origins in Y.
Relic of a bygone age
A different explanation is that outliers are relics of a previous age. Consider a country in which some important technological innovation, say farming, or iron, or the bow is introduced. Pretty soon, the inhabitants who acquire the new innovation may multiply in numbers, at the expense of their more isolated neighbors. Fast forward into the future, and the gene pool will be dominated by the closely related haplotypes of the "adopters" and the haplotypes of the "non-adopters" will stand out in the total population as oddities.
Implications for age estimation
Determining the cause of an outlier has important implications for determining the age of the common ancestor of the whole group:
- If the outlier is of foreign origin, then one must reject it, and age the remaining, more homogeneous haplotypes. This will lead to a younger age than if the entire group was used.
- If the outlier is a relic, then one must incorporate it, and downgrade the statistical weight of the larger more populous group; otherwise the age estimate will be dominated by the recently expanding group. This will lead to an older age than if the entire group was used.
The treatment of outliers in the existing literature is problematic. The default position seems to be not to analyze a haplotype group's substructure, and to use all sampled haplotypes. This may lead to either a substantial overestimation of the age (if foreign outliers are included), or a substantial underestimation (if relic outliers are given equal weight with the more populous main group).
For any collection of haplotypes, the first step should be to calculate the distribution of pairwise distances to detect outliers. Subsequently, a search of public databases or the literature should be performed to see if said outliers appear to be of foreign origin. Depending on this search (*), appropriate correction (inclusion/weighting) should be used in age estimation.
(*) Taking into account that the detection of foreign haplotypes depends on adequate sampling of the source population; hence, no matches in other populations do not imply non-foreign origin.