When I proposed Clusters Galore in November 2010, I was pleasantly surprised to see that very fine scale population structure could be uncovered using a combination of two algorithms:
- A dimensionality reduction technique (such as PCA or MDS) applied to dense genotypic data
- MCLUST, a state-of-the art model-based normal mixture clustering algorithm that had enough chops to uncover clusters of arbitrary size, shape, and orientation in multidimensional space
I have always thinking since that time of ways to improve the methodology. Since MCLUST is hard to best in my experience, I thought that improvement could be produced in the first step of the analysis; I have tried various ideas about choosing how many dimensions to retain, based on test of normality, Tracy-Widom statistics or some newer ideas. My conclusion has been that one could expect only delta improvements with any of these ideas.
After reading an abstract by Myers et al. in last years's ICHG, I realized that further improvement in resolution might be had by exploiting the linkage structure of dense genotype data, i.e., the pattern of co-inheritance of alleles along a stretch of chromosome.
Since, I didn't want to reinvent the wheel, I found the paintmychromosomes website, which I've also covered here, noting that it is unclear to what extent the ability of this methodology to infer fine-scale population structure is due to its exploitation of linkage. Unfortunately, the processing pipeline for this technique is computationally daunting, and I would probably have to wait months to carry out any meaningful experiment on the types of datasets I'm used to working with.
An hour's worth of coding may save you a month of runtime, so I decided to look elsewhere.
I've been experimenting with fastIBD for a while now, so I invested some time to write some auxiliary code that would help me use it for my purposes. fastIBD finds identical-by-descent segments in a collection of individuals. It also has several attractive properties:
- It is fast
- It does its own phasing
- It runs within BEAGLE, a very well-known genetic analysis software
fastIBD is tunable in various ways, but the default parameters seem to work fine for my purposes.
The end result of my labors is an NxN matrix (if there are N individuals) of IBD-based distance between individuals (in Morgans not IBD between individuals). This can then be fed into R's MDS routine, and then it's business as usual.
Results
I assembled a North European dataset for testing my ideas. A thing that always bugged me was the lack of ability to detect much population structure in the British Isles. So, I hoped that the added "punch" of using fastIBD would finally uncover this structure. All analyses were run with 256,932 SNPs.
Clusters Galore (fastIBD)
The 26 first MDS dimensions deviated from normality according to a Shapiro-Wilk test, and MCLUST found a total of 21 clusters using these dimensions.
Population | N | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
Russian_D | 22 | 2 | 20 | |||||||||||||||||||
Irish_D | 22 | 19 | 2 | 1 | ||||||||||||||||||
Polish_D | 23 | 23 | ||||||||||||||||||||
German_D | 21 | 1 | 18 | 2 | ||||||||||||||||||
Finnish_D | 17 | 10 | 7 | |||||||||||||||||||
Swedish_D | 13 | 1 | 12 | |||||||||||||||||||
English_D | 12 | 12 | ||||||||||||||||||||
British_D | 13 | 1 | 12 | |||||||||||||||||||
Norwegian_D | 11 | 11 | ||||||||||||||||||||
Lithuanian_D | 10 | 1 | 9 | |||||||||||||||||||
Dutch_D | 9 | 5 | 3 | 1 | ||||||||||||||||||
British_Isles_D | 8 | 8 | ||||||||||||||||||||
Mixed_Scandinavian_D | 4 | 4 | ||||||||||||||||||||
Danish_D | 3 | 3 | ||||||||||||||||||||
Ukrainian_D | 2 | 2 | ||||||||||||||||||||
Latvian_D | 1 | 1 | ||||||||||||||||||||
Estonian_D | 1 | 1 | ||||||||||||||||||||
Russian | 25 | 1 | 24 | |||||||||||||||||||
Orcadian | 15 | 15 | ||||||||||||||||||||
Lithuanians | 10 | 1 | 9 | |||||||||||||||||||
Belorussian | 9 | 8 | 1 | |||||||||||||||||||
Orkney_1KG | 25 | 19 | 2 | 2 | 2 | |||||||||||||||||
Kent_1KG | 38 | 34 | 2 | 2 | ||||||||||||||||||
Cornwall_1KG | 33 | 1 | 30 | 2 | ||||||||||||||||||
Argyll_1KG | 4 | 1 | 3 | |||||||||||||||||||
FIN30 | 30 | 14 | 16 | |||||||||||||||||||
Ukranians_Y | 20 | 20 | ||||||||||||||||||||
Mordovians_Y | 15 | 13 | 2 |
The clusters could be labeled as:
- Mordovian
- Slavic
- Irish
- English/British
- German
- Mini-cluster of 2 related Germans?
- Scandinavian
- Finnish 1
- Finnish 2
- Lithuanian
- Orkney
- Vologda Russians (HGDP)
- Mini-cluster of 2 Vologda Russians
- Mini-cluster of 2 Vologda Russians
- Mini-cluster of 2 Vologda Russians
- Mini-cluster of 2 Kent English
- Mini-cluster of 2 Kent English
- Cornwall
- Mini-cluster of 2 Cornwall
- Argyll
- Mini-cluster of 2 Mordovians
The entire analysis (fastIBD + MCLUST) took a few hours to run.
Clusters Galore (PLINK/MDS)
For comparison, using the "classical" Clusters Galore with PLINK's MDS facility, there were 15 non-normal dimensions. A total of 16 clusters were inferred:
Population | N | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
Russian_D | 22 | 2 | 15 | 5 | |||||||||||||
Irish_D | 22 | 6 | 6 | 10 | |||||||||||||
Polish_D | 23 | 20 | 3 | ||||||||||||||
German_D | 21 | 1 | 8 | 4 | 3 | 5 | |||||||||||
Finnish_D | 17 | 17 | |||||||||||||||
Swedish_D | 13 | 12 | 1 | ||||||||||||||
English_D | 12 | 5 | 2 | 5 | |||||||||||||
British_D | 13 | 1 | 3 | 9 | |||||||||||||
Norwegian_D | 11 | 5 | 2 | 3 | 1 | ||||||||||||
Lithuanian_D | 10 | 10 | |||||||||||||||
Dutch_D | 9 | 2 | 4 | 3 | |||||||||||||
British_Isles_D | 8 | 3 | 1 | 4 | |||||||||||||
Mixed_Scandinavian_D | 4 | 4 | |||||||||||||||
Danish_D | 3 | 1 | 1 | 1 | |||||||||||||
Ukrainian_D | 2 | 1 | 1 | ||||||||||||||
Latvian_D | 1 | 1 | |||||||||||||||
Estonian_D | 1 | 1 | |||||||||||||||
Russian | 25 | 8 | 1 | 16 | |||||||||||||
Orcadian | 15 | 6 | 9 | ||||||||||||||
Lithuanians | 10 | 10 | |||||||||||||||
Belorussian | 9 | 9 | |||||||||||||||
Orkney_1KG | 25 | 2 | 14 | 5 | 2 | 2 | |||||||||||
Kent_1KG | 38 | 15 | 8 | 11 | 2 | 2 | |||||||||||
Cornwall_1KG | 33 | 17 | 1 | 13 | 2 | ||||||||||||
Argyll_1KG | 4 | 2 | 2 | ||||||||||||||
FIN30 | 30 | 1 | 29 | ||||||||||||||
Ukranians_Y | 20 | 20 | |||||||||||||||
Mordovians_Y | 15 | 11 | 1 | 3 |
Some of the distinctions lost: 2 Finnish clusters rolled into 1; Lithuanians and + Slavs rolled into 1; there are 3 British Isles clusters with substantial overlap between different populations as well as with Scandinavians; Mordovians and Russians overlap; no German cluster.
It seems pretty clear to me that Clusters Galore (fastIBD) is the way to go into the future for this type of analysis, and hopefully further refinements to the methodology and the addition of more project participants will add even more resolution.
Clustering relies on (i) the ability to detect "blobs" of individuals, and (ii) the existence of such "blobs" of individuals. Clusters Galore (fastIBD edition) seems to be pretty good at doing (i), but it's as good as the data it's fed. For example, currently the Dutch seem split between the English and the Germans, but I have little doubt that if their sample sizes were to grow, they would also form their own specific cluster.
If you are a Project participant from these groups, you can find the results of this run in this spreadsheet.
12 comments:
Nice,
I would like to try this analysis.
Is it the code that calculates how much IBD sharing exists in a pair of inds available.
I would be REALLY grateful if you could help me.
THANK YOU.
fastIBD outputs a list of segments together with the probability that they are IBD. You can calculate the length of these segments in Morgans using recombination files, such as the following.
http://hapmap.ncbi.nlm.nih.gov/downloads/recombination/latest/rates/
Dienekes,
Will you be releasing a utility based on this, for non-participants - due to other family members already being participants - and/or mixed-ethnicity individuals?
I don't see an easy way to do that, since this is based on DNA segments, and a tool like that would almost certainly have to deploy such segments which would violate the no-distribution principle for participant data.
I was particularly struck by how the English cluster with the Dutch. In the fastIBD version all 12 English cluster with 5 of 9 Dutch, while 3 of the other 4 Dutch cluster with Germans and 1 with Norwegians and Swedes. By comparison, only 2 of 22 Irish cluster with the English; only 1 of 33 Cornish, none of the 4 Argyll (in western Scotland) and none of the Orcadian cluster with the English. In contrast the Russian_D, Polish and Ukrainian and other Slavs all cluster together even though they separated at the brink of recorded history.
That surely corroborates the traditional historical account that the English are descended from "Anglo-Saxon" west Germanic tribes who migrated from Angeln and Lower Saxony in Germany and from Jutland (Denmark) to England via -- and who also settled -- the Frisian coastal regions of Holland and Germany. Frisian and English are closely related west Germanic languages that ordinary folk have found to be mutually intelligible.
Quote:
http://en.wikipedia.org/wiki/Frisians
"When conditions improved Frisia would receive an influx of new settlers, mostly Angles and Saxons, and these would eventually be referred to as 'Frisians', though they were not necessarily descended from the ancient Frisii. It is these 'new Frisians' who are largely the ancestors of the medieval and modern Frisians.[9]"
Researchers at UCL and some other geneticists have argued in favour of the same account.
Quote:
http://news.bbc.co.uk/1/hi/wales/2076470.stm
"Academics at University College in London comparing a sample of men from the UK with those from an area of the Netherlands where the Anglo-Saxons are thought to have originated found the English subjects had genes that were almost identical. But there were clear differences between the genetic make-up of Welsh people studied."
* * *
By the way, it is intriguing that the two Russian samples barely cluster at all! Different regions?
Thank you for your help Dienekes.
When i will have money I will enjoy to your Project.
I am from a small town of southern Italy where people speak a strange kind of Greek. The region is calle Grecia Salentina. Do you know it!?
I was particularly struck by how the English cluster with the Dutch. In the fastIBD version all 12 English cluster with 5 of 9 Dutch, while 3 of the other 4 Dutch cluster with Germans and 1 with Norwegians and Swedes. By comparison, only 2 of 22 Irish cluster with the English; only 1 of 33 Cornish, none of the 4 Argyll (in western Scotland) and none of the Orcadian cluster with the English. In contrast the Russian_D, Polish and Ukrainian and other Slavs all cluster together even though they separated at the brink of recorded history.
Neither the inclusion of different populations in a single cluster (as is the case for Slavs), nor the assignment of individuals from a single population to multiple clusters (as is the case for the Dutch) has a simple interpretation.
In the latter case, it could be an artifact of the small size of Dutch_D. Clusters are formed when populations are distinctive relative to other included populations, and when they have enough individuals to make this distinction apparent.
In the former case, differences between Slavs may emerge with e.g., higher sample sizes, or higher number of clusters (after all, Mclust chooses the number of clusters via BIC which balances fit with parsimony, so it's conservative), etc.
It should also be noted that if two populations belong to the same cluster, this does not mean that there are no differences between them. Rather it means that in the context of the included populations there are no major "gaps" in the distribution of individuals.
By sampling more individuals, one usually "fills in the gaps" of the genetic map: this has a detrimental effect on the ability to infer clusters. At the same time, one "raises the peaks" of the mountains of the genetic landscape.
Clusters becomes apparent when the "mountains" of the genetic landscape (i.e. blobs of similar individuals) dominate the valleys of transitions between groups.
For the large number of individual clusters, the results look plausible. Color me skeptical regarding the micro-clusters in places where populations that been stable so long that cryptic relatedness is a serious problem.
I'd be interested to see how the analysis would change if only one individual from each two or three person microcluster were included and the data were re-run (alternately, perhaps one could set a minimum cluster size of three or four).
That surely corroborates the traditional historical account that the English are descended from "Anglo-Saxon" west Germanic tribes who migrated from Angeln and Lower Saxony in Germany and from Jutland (Denmark) to England via -- and who also settled -- the Frisian coastal regions of Holland and Germany. Frisian and English are closely related west Germanic languages that ordinary folk have found to be mutually intelligible.
The British geneticist, Mark Thomas, made this same point 10 years ago in his research on the Y-chromosome. Nice to see genetic research backing up historical claims.
http://mbe.oxfordjournals.org/content/19/7/1008.full
That surely corroborates the traditional historical account that the English are descended from "Anglo-Saxon" west Germanic tribes who migrated from Angeln and Lower Saxony in Germany and from Jutland (Denmark) to England via -- and who also settled -- the Frisian coastal regions of Holland and Germany.
I maintain that this connection - in addition to what you mention - also goes farther back in time.
(i) The agriculturalists who entered the Isles likely were identical to those settling along the "North Sea," in the first place (but different from the "natives" - who likely were were majority southwestern after the Younger Dryas, rather than northeastern like much of the central/northeastern continent.
(ii) There is evidence from the historical record as well as from excavated typical Germanic houses that there were Germanic settlements during, and perhaps even before Roman times along England's east coast. Which is not surprising, given generally strong and peaceful trade in the region for many centuries, if not more than a millennium.
Hi,I'm a new to this area.
I want get the population structrue of Soybean by FastIBD(Beagle). Now I have fastIBD results how do I get the matrix from “The end result of my labors is an NxN matrix”.
@Qinsi
fastIBD outputs a file ending in
*.fibd.gz
This file contains shared segments between individuals. Gunzip it, and you'll see that each line has five elements:
IDs of the two individuals
beginning and end marker index
probability
This is explained in the Beagle manual.
What you need to do is for each line to calculate the length of the segment by substracting the genetic positions (in bp): end-beginning. If there is a recombination map for whatever species you are dealing with, then you can do the subtraction in cM instead.
You then have an NxN matrix which you initialize with 0, and whenever the i-th individual shares a segment with the j-th individual, you add the length of the segment to the (i, j) entry. In the end, you get a matrix of sharing between different individuals. If you subtract this from its maximum value, then you get a matrix of not-sharing, which can be used as a distance matrix.
Post a Comment