May 01, 2014

The Geographic Position Structure (GPS) algorithm of Elhaik et al. (2014) is basically wrong

In the previous post I showed that the new paper by Elhaik et al. presents as its own (without citation) two of my ideas that were published on the web ~2 years before the paper was submitted.

I also took some time to evaluate the new aspect of their "geographical positioning system" (GPS) which is an algorithm to determine the geographic position of samples given their genetic distances to a group of reference populations. This is described under the heading "Calculating the biogeographical origin of a test sample" of their paper and I include a screenshot of it on the left to help you follow along.

From Equation (2) it is easy to see that the predicted position of the test sample is shifted away from the position of the best matching reference population (Positionbest) and towards the other reference populations (Position(m)) with the contribution of each reference population being weighted by wm which is the ratio of the distance of the closest population to the distance of the m-th reference population.

A little basic geometry (right) informs us that points on the circle have a constant ratio of distances (d2/d1) to points B and A.

That is, BC/CA = BD/DA = BP/PA

In terms of the Elhaik et al. (2014) algorithm this constant ratio is wm, so if A and B are two reference populations, and the test sample is e.g., either C or D, then the same constant weight applies (and this is true for all points on the circle).

In practical terms, the algorithm of Elhaik et al. (2014) will predict the same geographical locations for all points on the circle. This will be perfectly accurate for C and biased for every other point on the circle (with D being the absolute worst).

It is actually easy to test whether the test population is like C or like D; in the case of C it is CA+CB=AB. This is a simple test of collinearity that exploits the fact that not only the distance of the test population to reference ones, but also the distance of the two reference populations from each other. And, indeed, it's easy to see that for a test population P we can estimate genetic distances AP, PB, and AB and these uniquely define the circle on which the point must lie. Do this for all pairs of reference populations, find the distribution of the intersections of these circles, find a peak of this distribution (if such exists) et voila you have a sound mechanism for localizing individuals based on genetic distances. I expect to see something like this in Nature Communications circa 2016.

8 comments:

Andrés said...

The method has also a fundamentally wrong approach: that men move freely equally in all directions. But there are such things as glaciers, seas, mountains and deserts.

stéphane Mazieres said...

Dear,
many of your comments and reappraisals could have been the topic of a paper instead of being "thrown" in a blog. That's part of the game to shout out and loud ideas which could be borrowed in subsequent studies. Furthermore, there's no way to cite appropriately the blog. I totally inderstood Eran's not to add value to very pertinent -though not published-hints.

Dienekes said...
This comment has been removed by the author.
Dienekes said...

many of your comments and reappraisals could have been the topic of a paper instead of being "thrown" in a blog.

Perhaps I didn't have an interest to write a paper and wanted to put my ideas out there so that everyone could see them, read them, appraise them as they saw fit?

Furthermore, there's no way to cite appropriately the blog.

You can't cite a URL or put something in the acknowledgements of the paper if you don't wanna cite it?

I totally inderstood Eran's not to add value to very pertinent -though not published-hints.

These were not "hints", but rather a full description of a method (as evidenced by the fact that other people were able to use it to create "calculators" based on their unsupervised ADMIXTURE runs).

"Though not published" is incorrect. My method was published on the public web and has been subjected to critique by anyone who wanted to to critique it (including the many hundreds if not thousands of people who got estimates of ancestry based on it).

Intellectual honesty requires one to cite what was previously known, described or otherwise communicated irrespective of the venue in which it appeared.

terryt said...

"Perhaps I didn't have an interest to write a paper and wanted to put my ideas out there so that everyone could see them, read them, appraise them as they saw fit?"

To me that is what science is about. Unfortunately it is often driven by egos.

Tatiana Tatarinova said...

Dienekes, as one of the co-authors of the GPS paper, I apologize for not referencing your blog in the paper. It was not intentional, since I was not aware of the existence of your method when GPS algorithm was developed. I will find out if it is possible to add the reference.

eurologist said...

Dienekes,

I have full appreciation of your viewpoint. I have presented methods and results at scientific meetings with published abstracts, and within months was not cited in papers that simply copied my work.

I have invited a researcher to present in a session I headed, only to find later that he published a paper copying the ideas I presented in that session.

The scientific process is broken. Many scientists realize this, but many older ones adhere to the broken system out of self-interest. Many of their grad students and post-doc still follow them because they believe it's the best way to get ahead.

We need a science-process revolution.

I am now partially retired, but am still active in science process improvement advocacy.

Roy said...

Just a thought..Dienekes you have great material. Is a book aimed at a general audience (ala Jared Diamond) out of the question?