(You can scroll down to the Results section, if you are not interested in the technical stuff)
When I proposed
Clusters Galore in November 2010, I was pleasantly surprised to see that very fine scale population structure could be uncovered using a combination of two algorithms:
- A dimensionality reduction technique (such as PCA or MDS) applied to dense genotypic data
- MCLUST, a state-of-the art model-based normal mixture clustering algorithm that had enough chops to uncover clusters of arbitrary size, shape, and orientation in multidimensional space
I explained how to carry out Clusters Galore analysis
here. A most recent analysis of West Eurasians can be found
here.
I have always thinking since that time of ways to improve the methodology. Since MCLUST is hard to best in my experience, I thought that improvement could be produced in the first step of the analysis; I have tried various ideas about choosing how many dimensions to retain, based on test of normality, Tracy-Widom statistics or some
newer ideas. My conclusion has been that one could expect only delta improvements with any of these ideas.
After reading an
abstract by Myers et al. in last years's ICHG, I realized that further improvement in resolution might be had by exploiting the linkage structure of dense genotype data, i.e., the pattern of co-inheritance of alleles along a stretch of chromosome.
Since, I didn't want to reinvent the wheel, I found the
paintmychromosomes website, which I've also covered
here, noting that it is unclear to what extent the ability of this methodology to infer fine-scale population structure is due to its exploitation of linkage. Unfortunately, the processing pipeline for this technique is computationally daunting, and I would probably have to wait months to carry out any meaningful experiment on the types of datasets I'm used to working with.
An hour's worth of coding may save you a month of runtime, so I decided to look elsewhere.
I've been experimenting with
fastIBD for a while now, so I invested some time to write some auxiliary code that would help me use it for my purposes. fastIBD finds identical-by-descent segments in a collection of individuals. It also has several attractive properties:
- It is fast
- It does its own phasing
- It runs within BEAGLE, a very well-known genetic analysis software
In principle one could do a single fastIBD run over an entire dataset, but the memory footprint is prohibitive. So, rather than beg for money for a bigger computer, I took ten minutes to write some code that combines the results of 22 fastIBD runs (one per chromosome). I also wrote some code that calculates how much (in Morgans) IBD sharing exists in a pair of individuals.
fastIBD is tunable in various ways, but the default parameters seem to work fine for my purposes.
The end result of my labors is an NxN matrix (if there are N individuals) of IBD-based distance between individuals (in Morgans
not IBD between individuals). This can then be fed into R's
MDS routine, and then it's business as usual.
Results
I assembled a North European dataset for testing my ideas. A thing that always bugged me was the lack of ability to detect much population structure in the British Isles. So, I hoped that the added "punch" of using fastIBD would finally uncover this structure. All analyses were run with 256,932 SNPs.
Clusters Galore (fastIBD)
The 26 first MDS dimensions deviated from normality according to a Shapiro-Wilk test, and MCLUST found a total of 21 clusters using these dimensions.
Population |
N |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
Russian_D |
22 |
2 |
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Irish_D |
22 |
|
|
19 |
2 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Polish_D |
23 |
|
23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
German_D |
21 |
|
1 |
|
|
18 |
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Finnish_D |
17 |
|
|
|
|
|
|
|
10 |
7 |
|
|
|
|
|
|
|
|
|
|
|
|
Swedish_D |
13 |
|
|
|
|
1 |
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
English_D |
12 |
|
|
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
British_D |
13 |
|
|
1 |
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Norwegian_D |
11 |
|
|
|
|
|
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lithuanian_D |
10 |
|
1 |
|
|
|
|
|
|
|
9 |
|
|
|
|
|
|
|
|
|
|
|
Dutch_D |
9 |
|
|
|
5 |
3 |
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
British_Isles_D |
8 |
|
|
|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mixed_Scandinavian_D |
4 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Danish_D |
3 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ukrainian_D |
2 |
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Latvian_D |
1 |
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Estonian_D |
1 |
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Russian |
25 |
|
1 |
|
|
|
|
|
|
|
|
|
24 |
|
|
|
|
|
|
|
|
|
Orcadian |
15 |
|
|
|
|
|
|
|
|
|
|
15 |
|
|
|
|
|
|
|
|
|
|
Lithuanians |
10 |
|
1 |
|
|
|
|
|
|
|
9 |
|
|
|
|
|
|
|
|
|
|
|
Belorussian |
9 |
|
8 |
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
Orkney_1KG |
25 |
|
|
|
|
|
|
|
|
|
|
19 |
|
2 |
2 |
2 |
|
|
|
|
|
|
Kent_1KG |
38 |
|
|
|
34 |
|
|
|
|
|
|
|
|
|
|
|
2 |
2 |
|
|
|
|
Cornwall_1KG |
33 |
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
2 |
|
|
Argyll_1KG |
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
3 |
|
FIN30 |
30 |
|
|
|
|
|
|
|
14 |
16 |
|
|
|
|
|
|
|
|
|
|
|
|
Ukranians_Y |
20 |
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mordovians_Y |
15 |
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
The clusters could be labeled as:
- Mordovian
- Slavic
- Irish
- English/British
- German
- Mini-cluster of 2 related Germans?
- Scandinavian
- Finnish 1
- Finnish 2
- Lithuanian
- Orkney
- Vologda Russians (HGDP)
- Mini-cluster of 2 Vologda Russians
- Mini-cluster of 2 Vologda Russians
- Mini-cluster of 2 Vologda Russians
- Mini-cluster of 2 Kent English
- Mini-cluster of 2 Kent English
- Cornwall
- Mini-cluster of 2 Cornwall
- Argyll
- Mini-cluster of 2 Mordovians
So, it seems that my intuition was correct. There is a fairly clean division of Lithuanians and Slavs that was much more muddled whenever it came up before, a clean division of Mordvins and Russians, and a fairly comprehensive split of British Isles populations: a quite clean Irish cluster, a Cornwall cluster, an Argyll cluster, and a Kent/English cluster. Note that British_D and British_Isles_D populations consist mostly of English+some other British Isles, so I am not very surprised that they fall in the English main cluster.
The entire analysis (fastIBD + MCLUST) took a few hours to run.
Clusters Galore (PLINK/MDS)
For comparison, using the "classical" Clusters Galore with PLINK's MDS facility, there were 15 non-normal dimensions. A total of 16 clusters were inferred:
Population |
N |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
Russian_D |
22 |
2 |
15 |
|
5 |
|
|
|
|
|
|
|
|
|
|
|
|
Irish_D |
22 |
|
|
|
|
|
6 |
6 |
10 |
|
|
|
|
|
|
|
|
Polish_D |
23 |
|
20 |
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
German_D |
21 |
|
1 |
8 |
|
|
4 |
3 |
5 |
|
|
|
|
|
|
|
|
Finnish_D |
17 |
|
|
|
|
|
|
|
|
|
17 |
|
|
|
|
|
|
Swedish_D |
13 |
|
|
|
|
12 |
|
1 |
|
|
|
|
|
|
|
|
|
English_D |
12 |
|
|
|
|
|
5 |
2 |
5 |
|
|
|
|
|
|
|
|
British_D |
13 |
|
|
|
|
|
1 |
3 |
9 |
|
|
|
|
|
|
|
|
Norwegian_D |
11 |
|
|
|
|
5 |
2 |
3 |
1 |
|
|
|
|
|
|
|
|
Lithuanian_D |
10 |
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dutch_D |
9 |
|
|
|
|
|
2 |
4 |
3 |
|
|
|
|
|
|
|
|
British_Isles_D |
8 |
|
|
|
|
|
3 |
1 |
4 |
|
|
|
|
|
|
|
|
Mixed_Scandinavian_D |
4 |
|
|
|
|
4 |
|
|
|
|
|
|
|
|
|
|
|
Danish_D |
3 |
|
|
|
|
1 |
|
1 |
1 |
|
|
|
|
|
|
|
|
Ukrainian_D |
2 |
|
1 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Latvian_D |
1 |
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Estonian_D |
1 |
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Russian |
25 |
8 |
1 |
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
Orcadian |
15 |
|
|
|
|
|
|
|
|
6 |
|
9 |
|
|
|
|
|
Lithuanians |
10 |
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Belorussian |
9 |
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Orkney_1KG |
25 |
|
|
|
|
|
|
2 |
|
14 |
|
5 |
2 |
2 |
|
|
|
Kent_1KG |
38 |
|
|
|
|
|
15 |
8 |
11 |
|
|
|
|
|
2 |
2 |
|
Cornwall_1KG |
33 |
|
|
|
|
|
17 |
1 |
13 |
|
|
|
|
|
|
|
2 |
Argyll_1KG |
4 |
|
|
|
|
|
|
2 |
2 |
|
|
|
|
|
|
|
|
FIN30 |
30 |
|
|
|
|
1 |
|
|
|
|
29 |
|
|
|
|
|
|
Ukranians_Y |
20 |
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mordovians_Y |
15 |
11 |
1 |
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
Some of the distinctions lost: 2 Finnish clusters rolled into 1; Lithuanians and + Slavs rolled into 1; there are 3 British Isles clusters with substantial overlap between different populations as well as with Scandinavians; Mordovians and Russians overlap; no German cluster.
It seems pretty clear to me that Clusters Galore (fastIBD) is the way to go into the future for this type of analysis, and hopefully further refinements to the methodology and the addition of
more project participants will add
even more resolution.
Clustering relies on (i) the ability to detect "blobs" of individuals, and (ii) the existence of such "blobs" of individuals. Clusters Galore (fastIBD edition) seems to be pretty good at doing (i), but it's as good as the data it's fed. For example, currently the Dutch seem split between the English and the Germans, but I have little doubt that if their sample sizes were to grow, they would also form their own specific cluster.
If you are a Project participant from these groups, you can find the results of this run in this
spreadsheet.