In the following, I have included all populations from Human genetic variation: the first ? components that had "West Eurasian" admixture of at least 75% at K=3.
I have also added some additional populations from Xing et al. and a number of Dodecad Ancestry Project populations, including some making their debut; sample sizes in several old populations have increased due to participation in the current submission opportunity.
The number of markers is ~37k in order to include the greatest variety of populations, but as will be seen, the ability to detect structure is not greatly diminished.
Some South Asian populations were above the 75% cutoff and form their own cluster at the bottom, with the isolated Kalash at some way off. The island of Sardinia is its own island in genetic space as well.
The most distinctive feature of this plot is the separation of Europeans from West Asians. The big hole framed by Chuvash (bottom), Greeks, Italians, and European Jews (top), Europeans (left), and West Asians (right) and probably reflects barriers to gene flow by the Black Sea and Aegean.
A fairly linear cluster to the right of this hole contrasts people from the Caucasus (Urkarah, bottom) with those from Arabia (Yemenese Jews, Saudis, Bedouins).
Using the Galore approach on just the first two MDS dimensions resulted in 13 clusters:
The distinctiveness of several populations is discovered by MCLUST using just the first two dimensions, confirming our visual impression, e.g., #13: Chuvash, #12: Kalash, #3: Sardinian.
Other clusters, correspond to multiple populations, e.g., #9: Caucasus, and #10: Arabians.
In the latter case, as I have mentioned several times before, we should not conclude that these populations are identical, but see whether they can be divided using additional MDS dimensions.
Indeed, using just 4 dimensions, MCLUST infers 46 clusters.
Even more clusters can be inferred with the usual set of 177k markers and more MDS dimensions, but, for now, I just wanted to make the point that even the smaller number of SNPs suffices to uncover population variation.
This allows us to amortize genotyping efforts using different chips with relatively few markers in common with most of the populations included in the Dodecad Project.