July 15, 2014

k-means and structure

I was reading one of the many negative reviews of Nicholas Wade's new book when I came across this statement:
"The problem is that Structure, which uses an algorithm called “k-means,”"
I pointed out that Structure does not use k-means and a small discussion ensued on twitter. I see that the above statement has now been removed from the article, but an endnote on the topic remains:
*Originally, I wrote that STRUCTURE uses the k-means algorithm. Some population geneticists thought that I oversimplified what STRUCTURE does. Different clustering algorithms make different assumptions. STRUCTURE is indeed very similar to k-means, but with a particular error structure – binomial instead of gaussian. This is a fine technical detail compared with the principal point, which is that k is picked by the user, and does not emerge from the data automatically. To learn more, see this Twitter chain and this and this. Thanks to Graham Coop at UC Davis.
I did not intend to spend more time on this, but since the author of the article invited me to comment at more than 140 characters on the topic, I thought it was a good idea to do so.

k-means is completely unrelated to the structure algorithm of Pritchard and Stephens. Remember that structure can be run in either a no-mixture or a mixture mode. In both modes, the input is a set of N individuals and K, the number of ancestral populations. In the no-mixture mode, individuals are assigned to one of K populations, while in the mixture mode, their ancestry proportions from K populations are inferred. (Incidentally, allele frequencies in the K ancestral populations are also inferred, although usually not reported).

k-means has no mixture mode, but rather it is a clustering algorithm which assigns individuals to K populations. Thus, it can be used to solve the same problem as the no-mixture mode of structure. The two algorithms solve this problem in entirely different ways. Saying that structure uses k-means is equivalent to saying that any partitioning method into k groups uses k-means.

More importantly, structure is commonly used in mixture mode, including in the landmark paper by Rosenberg et al. (2002) that both Wade and the author of the review refer to. In this mode, structure does not even solve the same problem as k-means. Rather than find some partitioning of N individuals into K disjoint clusters, it estimates the mixture proportions of each of N individuals into all K populations. In practice (including the paper by Rosenberg et al. 2002), many individuals often have most (or all) of their ancestry from one or a few of the K populations. If humans had no structure at a particular K, the algorithm could very well produce a jumbled mess of different colors. Instead it produces neat ancestral populations that correspond well to what may be instantly recognizable as major human groups.

The reader is invited to look at any standard implementation of k-means, such as the one in R to be convinced that k-means does not even produce the same output as structure. The point is a trivial one, but k-means estimates N parameters (the cluster label for each of N individuals), whereas structure estimates N(K-1) parameters (the mixture proportions of N individuals in K populations; only K-1 numbers are needed as they have to add up to unity).

The only thing these algorithms have in common is that they require that the user input K. This point has been used by the plethora of negative reviews of Wade's book to argue that the classification of humans into biological races is arbitrary as it is subjective (it relies on user input of K).

This is a rather weak objection, for at least a couple of reasons: first, K can also be estimated from data and there are indeed clustering algorithms (such as fineStructure) that do not require user input of K and identify a value of K and organize the K ancestral populations into a hierarchical tree whose deep splits correspond exactly to the continental human races. Another popular algorithm, ADMIXTURE, proposes a cross-validation procedure to choose K. So, the choice of K can be automated and need not be subjective.

The more important reason against the "subjective K" objection is that it does not in any way invalidate the partitioning of humans into different K at different levels of granularity. This is reasonably easy to understand: the whole field of taxonomy divides living things into a hierarchical structure. In some cases it is useful to speak of vertebrates, and in others it's useful to speak of mammals, or primates, etc. In humans it's sometimes useful to speak of the entire species H. sapiens in contradistinction to other species, when studying what is common to humans, and sometimes it is useful to speak of major populations of H. sapiens (such as Europeans or East Asians), or minor ones (e.g., Mongols and Vietnamese), when studying how human groups differ from one another. These groupings are not arbitrary, but appear when biological traits (e.g., SNPs) are subjected to various types of analysis (including structure and similar algorithms).

18 comments:

Kostas said...

Have you read the book? If so what's your opinion of it? Would it be possible for you to maybe write a review?

What do you think of statements like "jews are adapted to capitalism like Tibetans are to high altitude" that (according to the reviews) are included in it?

Charles Nydorf said...

It is interesting to compare the controversies over the definition of species to the ones about the definition of race. I think that the arguments are the same but more emotion and politics enters into the discussion of race.

Ethio Helix said...

“Another popular algorithm, ADMIXTURE, proposes a cross-validation procedure to choose K. So, the choice of K can be automated and need not be subjective.”

For a well sampled global dataset ADMIXTURE's cross validation curve would never trough at K=5, or even K= 7, Wade's darling K values that seemingly correlate with modern social constructs of 'race' and or 'continental' groupings , instead it troughs at K values much greater than 10, usually 13 or 14.

If K=5 should be deemed a popular standard for some, there is absolutely no reason why K = 2 should not be deemed a standard for others, surely, the fact that @ K =2 Europeans and West Asians appearing to be mixtures of Africans and East Asians has no bearing on the adoption of such a standard , right ?

Grognard said...

Well, going back to the guy's original post:

"First, let’s examine Wade’s many straw men. He argues against the claim that all the people of the world have essentially similar genetic makeup (a position not actually held by any credible geneticists or anthropologists) and argues that races, indeed, are groups delineated from one another by hard biological evidence. To bolster this claim, Wade cites a 2002 landmark study in Science magazine of fifty-two populations from around the world. That study employed a computer program called Structure, which uses differences in DNA to identify distinct groups of humans."

This isn't true in any real sense, only true for obvious coding DNA. Of course people don't have different core proteins. Taj Majahl has mostly the same basic materials as modern skyscrapers, in much different proportions and form. Most brain genes found are actually in RNA not coding.

"Interactions between an individual’s genome and his or her environment can have profound effects on developmental outcomes. In many developed countries, IQ scores have risen by several points per decade, a phenomenon called the Flynn effect. Its discoverer, James R. Flynn, reports that this change has been happening for many decades. For instance, in verbal and performance IQ, an average Danish 14-year-old in 1982 scored 20 points higher than the average person of the same age in his parents’ generation in 1952. The rapidity of the change suggests that some environmental factor, whether educational, nutritional, or other, has had substantial effects on brain development."

This isn't really true either. First IQ tests were mainly on basically illiterate people and this trend has long since reversed in spite of the fact this is not the case any more and everyone (in first world anyway) is literate.



However while I can't see the original comment his criticism of using this wide definition of race is valid, and his conclusions are worse. The evolutionary acceleration is nothing to do with any particular race and in fact these changes are mostly fixed in the entire population of europe or asia (or both). Wade shows a feeble understanding of natural selection if he thinks of it as another way of mixing - after not many generations, only the exact gene selected on is being passed around and virtually no changes in race will happen. Changes in race are a result of conquest or migrations, mixing. Natural selection is nothing to do with mixing and will happen even for very loosely connected populations! Of course the guy in the article tries to argue that race doesn't exist at all which is also bogus. There have been vast changes to human genome as populations increase but they are not anything to do with race.

The view of history that gentle and patient english (lol!) had an industrial revolution after careful thought is also laughable. They dispossessed the whole peasantry with the enclosure act then with cheap labor available worked them all to death in large factories owned by a few people. By this time they were already virtual rulers of the world, so it's no surprised their wealth attracted the finest minds from around the world, which combined with immense wealth was what made this possible.

Fanty said...

"jews are adapted to capitalism like Tibetans are to high altitude"

Yeah, thats one of those things that are illegal to say, even if they would be truth.

ONe of the modern versions of "there are things that rotate around Jupiter".

DOH! Revoke or burn! Here sign this: Adaption does not exist, especially not if Jews are involved! SIGN IT OR DIE! ;-)



Dienekes said...

For a well sampled global dataset ADMIXTURE's cross validation curve would never trough at K=5, or even K= 7, Wade's darling K values that seemingly correlate with modern social constructs of 'race' and or 'continental' groupings , instead it troughs at K values much greater than 10, usually 13 or 14.

Indeed. It is somewhat ironic that modern criticisms of human biological race concept often emphasize intermediacy and arbitrariness of the races of traditional physical anthropology, but in fact the traditional racial classification schemes erred on the side of caution and we can now identify dozens of human races (if we wanted to).

The traditional races are the broad clusterings of human populations that were obvious to people even before the development of anthropometric methods and statistics. We now know that there are many more human populations that could be identified as races; this does not invalidate the traditional races, it just means we can now do better.

It's like we used to think there were animals and plants, and we can now divide both animals and plants in many subgroups. Or, we used to think there were planets, and we can now divide planets into gas giants and rocky worlds. Or, we used to think there was "flu" viruses and we can now identify various kinds and strains of flu.

Ethio Helix said...

“We now know that there are many more human populations that could be identified as races; this does not invalidate the traditional races, it just means we can now do better.”

But they do invalidate traditional races as they were defined back then, unless one proposes to completely change the definition of the term 'race' today from its original conceptualization, there is very little evidence that the original conceptualization put forward by the traditional anthropologists can stand up to modern genetic scrutiny.

For starters, only recently do we know that all humans on all continents are (for the most part) of relatively recent African origin, in the past, the conceptualization of races put forward by those anthropologists primarily hinged on the fact that human like populations independently evolved into modern populations separately for hundreds of thousands of years on different continents resulting in the traditional 'races', this as we know today is not true, human beings from all around the globe are a young African species.

Then there is the issue of using outside physical characteristics to determine biological affinity, a quick look at the physical characteristics of people like Andamanese islanders or Australian Aborigines would on the superficial surface imply immediate genetic affinity to Africans, but off-course that is not at all the case.

“Or, we used to think there were planets, and we can now divide planets into gas giants and rocky worlds.”

I disagree with that analogy and the general reasoning of 'refinement of what we used to know', first of all we are not dealing here with plants or animals or things that are external to us, instead we are studying ourselves, our own history, which would naturally introduce a different level of bias.

What happened is that we used to think that we were totally different from each other, even almost a different species, and now we know that we are actually biologically speaking extremely alike, the difference in what we used to think how close we are and how close we really are causes all these 'race debates', if the old anthropologists had started from the premise that we were already close, even if they didn't have the evidence that we had today, there would have been no reason for these race debates, however, it is understandable that scientists from the established culture would emphasize differences rather than similarities, how else could you justify the subjugation of peoples outside the establishment ?

Dienekes said...

But they do invalidate traditional races as they were defined back then, unless one proposes to completely change the definition of the term 'race' today from its original conceptualization

"Race" was used originally much more freely than to refer to the continental races. It was often a synonym for ethnic group or nation in the old days. You can search for things like "Chinese race" in 19th century in places like google books and you'll find many examples.

Then there is the issue of using outside physical characteristics to determine biological affinity, a quick look at the physical characteristics of people like Andamanese islanders or Australian Aborigines would on the superficial surface imply immediate genetic affinity to Africans, but off-course that is not at all the case.

If one looks at enough morphological traits there is no overlap. Anyway, the fact that some people might not readily distinguish different groups on appearance doesn't invalidate the ability to distinguish such groups. Some people wouldn't be able to distinguish horse or dog breeds, but those are easily distinguished by horse experts or horse geneticists.

I disagree with that analogy and the general reasoning of 'refinement of what we used to know', first of all we are not dealing here with plants or animals or things that are external to us, instead we are studying ourselves, our own history, which would naturally introduce a different level of bias.

I don't understand where the bias is supposed to be coming from. Feed unlabeled individuals into any clustering algorithm and you'll invariably see that the major groupings of mankind will appear. Really good algorithms will be able to find even finer structure (like the clusters galore approach)

What happened is that we used to think that we were totally different from each other, even almost a different species, and now we know that we are actually biologically speaking extremely alike

This is a caricature of what people used to think. Here is what Carleton Coon wrote in 1939:

"If, as above, we define race as a group of people reasonably unified in the physical sense and living in one place, difficulties at once arise. How are we to draw the borderline between that place and the next? Where does one race leave off and the next begin? There are those who assert that a race is merely an artificially assumed point on the smooth and glassy surface of a geographical continuum,19 for what may be the concentration point for an extreme condition in one criterion will be an intermediate point in others. This assertion is, to a certain extent, true. If we view the panorama of living races on a two dimensional map, we can but agree that a race in this sense is merely a reasonably homogeneous group of people who occupy a given arbitrary point upon a terrestrial continuum. In regions of geographical smoothness one condition blends broadly and gently into another; in regions cut up by geographical barriers, such as deserts or mountains, the contrasts are sharper and the transitions more rapid.:

Even "evil typologists" like Coon were well aware of the facts of human variation.

David Jacobson said...

It is certainly true that k means has nothing to do with Structure. K means is a simple clustering algorithm that is relatively easy to understand. Structure involves a complex probabilistic simulation for estimating admixture among populations. I am still working on understanding how it works. Even with an understanding of how it works, I suspect that it would not be easy to really establish what its results mean. My own evaluation of some of its results is that they show a number of geographies where humans have reached some kind of genetic equilibrium. The results also seem to give a reasonable idea about some patterns of gene flow within geographies and admixture between populations.
The thousand genome complete human gene sequences have the potential to provide more exact information about the populations they cover. Even with a minimal scratching of the surface, there is no doubt about a large difference in the variants of African, European, and Asian populations. Nor is there any question about the fact that the American population displays some kind of complex admixture between all of the other three. However, the very large amount of structure and difference obviously visible in this data suggests that Structure is only capturing a very low resolution statistical analysis of data that probably does not conform to any simple statistical distribution.

Unknown said...

@Dienekes - “Feed unlabeled individuals into any clustering algorithm, you'll invariably see that the major groupings of mankind will appear. Really good algorithms will be able to find even finer structure (like the clusters galore approach)”

Yes, indeed, we can see even finer structure if we go to the birth records and simply divide up humans according to immediate families. And there will be exponentially more variation than we can possibly see at the level of race.

Conversely, we can compare humans and other mammals to insects and the “major groupings of mankind” seem to be based on petty differences. It’s a matter of scale and reference. In the bigger evolutionary picture, human variation is ridiculously tiny.

No doubt there is genetic variation in humans. Heck, there are surprising variations within immediate families and even between identical twins.

But the important question is why that variation is important. I’ve only read excerpts of Nicholas Wade’s book. But it’s a bit obvious what he is doing.

That unfortunate quote -- “jews are adapted to capitalism like Tibetans are to high altitude" -- is ironic for a number of reasons, not the least of which is the paper you posted suggesting that Tibetans acquired the trait by admixture with Denisovans!

I’m sure it never occurred to Wade that it might be the other way around -- that capitalism has adapted to Jews (and quite a few others in equal measure.)

What Wade believes is equivalent to believing that our bodies adapted to the shape of automobiles instead of the other way around.

It is a fundamental misunderstanding of how what some call “cultural evolution” is a different process than biological evolution.

But, it’s more than that, Wade is just bad at understanding biological evolution most fundamental and obvious element.

Diversity is the one of the most important ways that a swarm of related organisms survives a basically unpredictable environment. In fact, it’s the only reason we humans are here despite the hundred million extinctions before us. Diversity saved us.

If, for example, all Caucasian suddenly were wiped out by a Caucasian targeted microbe or a meteor or a trojan horse in our genes,...well, then, thanks to human diversity, there will still be humans around. Just not Caucasians.

Locrian said...

I think there is a persistent logical fallacy at work in those who deny the existence of races, a fallacy that may be characterised as “If there is no sharp destination then there is no distinction at all”. They apply this fallacy quite selectively, having no trouble with breeds of dogs, or breeds of horses, but are over-sensitive when it is types of humans. This fallacy often comes up in other areas as well, for example in those who claim that there being no sharp distinction between colours in the colour spectrum means that there is no distinction between them at all, and that it is all subjective, culturally determined, etc. The persistent cry is “where do we draw the line?” This type of fallacy seems particularly beloved by anthropologists (at least in my local environs): and they also seem the group most disposed to the “there are no races” thesis. They see only continuity, not difference, and any perception of difference must be subjective, culturally determined, the result of bias, etc. (Whereas perception of continuity they seem to think is not subjective, not culturally determined.)

That this IS a fallacy is revealed by noting that continuity is completely compatible with hard, objective differences. A continuous curve may have turning points — places say, where the derivative is zero — separating parts of the curve in a perfectly objective way. Continuity is perfectly compatible with objective difference.

So take two clouds in the sky, with intermediate water vapour between them — a good model for population clusters on the Earth. A curve from one density to another will trough and go to zero before climbing to the other distribution. First and second derivatives can reveal objective differences the description of the continuity of the curve misses. There need be nothing subjective, or culturally relative, or biased about it. A continuity of differences does not mean no difference to any level of description.

When those who deny the existence of races go on to say, “and even if there are differences they are small, insignificant differences” then this IS subjective, and I would say, quite nakedly so. For whether something is significant surely DOES depend upon what you consider important, and what value you place on those differences. So if there is an argument about subjectivity and values then it is at that point that the subjectivity enters. Nothing prevents someone from regrading all such differences as unimportant — or important.

Locrian said...
This comment has been removed by the author.
Unknown said...
This comment has been removed by the author.
bhallmar said...

You are correct that structure absolutely does not implement k-means, which is obvious if you look at the original paper(s) and the mathematical model. The admixture model is an example of what is more generally known as latent Dirichlet allocation:
http://jmlr.org/papers/v3/blei03a.html
http://mistis.inrialpes.fr/statlearn/slides/Statlearn11_Francois.pdf

Unknown said...

Locrian wrote:
"When those who deny the existence of races go on to say, “and even if there are differences they are small, insignificant differences” then this IS subjective, and I would say, quite nakedly so. For whether something is significant surely DOES depend upon what you consider important, and what value you place on those differences."

Objectively, the differences are small, if you are talking about evolution in general. If you are talking about what Wade seems to be talking about -- a genetic difference that somehow relates to capitalism or rule of law -- the diversity in a small Irish village objectively has as much consequence as genetic differences he would assign to race. If reproductive differential is favoring some traits over others, they certainly don't match up with what Nicholas Wade thinks they are. And that's objective.

John Fuerst said...

These arguments against biological race are incredible.

Ethio Helix: "But they do invalidate traditional races..For starters...in the past, the conceptualization of races put forward by those anthropologists primarily hinged on the fact that human like populations independently evolved into modern populations separately for hundreds of thousands of years"

No. The early debates were between the monogenists who proposed that human groupings represented different races, that is, lineages -- originally thought to have descended from Adam several thousand years prior and the polygenists, who proposed that human groupings represented different species. The racialists argued that different human groups were not so different. Coon did propose early divergence, but this wasn't the "traditional" model which was conceived in the seventeen hundreds by e.g., Kant and Buffon and done so in line with the Biblical narrative.

Ethio Helix: "Then there is the issue of using outside physical characteristics to determine biological affinity."

Race, not to be confused with the early vague variety concept, was, from the start, conceptualized in terms of genealogy. The term comes from the french, Nobel de Race, or noble lineage, and the, at the time, English term for breeds. Race has almost always been understood in terms of pedigree and generation; morphological indexes were simply used to index "propinquity of descent" degree of differentiation.

John Fuerst said...

These arguments against biological race are incredible.

Ethio Helix: "But they do invalidate traditional races..For starters...in the past, the conceptualization of races put forward by those anthropologists primarily hinged on the fact that human like populations independently evolved into modern populations separately for hundreds of thousands of years"

No. The early debates were between the monogenists who proposed that human groupings represented different races, that is, lineages -- originally thought to have descended from Adam several thousand years prior and the polygenists, who proposed that human groupings represented different species. The racialists argued that different human groups were not so different. Coon did propose early divergence, but this wasn't the "traditional" model which was conceived in the seventeen hundreds by e.g., Kant and Buffon and done so in line with the Biblical narrative.

Ethio Helix: "Then there is the issue of using outside physical characteristics to determine biological affinity."

Race, not to be confused with the early vague variety concept, was, from the start, conceptualized in terms of genealogy. The term comes from the french, Nobel de Race, or noble lineage, and the, at the time, English term for breeds. Race has almost always been understood in terms of pedigree and generation; morphological indexes were simply used to index "propinquity of descent" degree of differentiation.

Ethio Helix: "What happened is that we used to think that we were totally different from each other... it is understandable that scientists from the established culture would emphasize differences rather than similarities"

This is just poor historiography. Why, if this is true, did the monogenist racialists argue, contra the polygenists, that human populations were surprisingly similar. Why was polygenism deemed heretical in the first place?

Unknown: "But the important question is why that variation is important. I’ve only read excerpts of Nicholas Wade’s book."

Genetic variation across races is important insofar as phenotypic variation across these same groups is considered to be so and insofar as this latter variation can be explained by the former. Is the phenotype variation e.g., in x cancer rates or IQ really that important? I don't think so, but I imagine that your local sociologist or macro economists does.

bhallmar said...

In case anyone doubts what the human population genetics community thinks about Wade's thesis, the opinion is clear.

http://cehg.stanford.edu/letter-from-population-geneticists/