September 06, 2012

Y-chromosome phylogeny using Complete Genomics data

Since taking a trip to Innsbruck on a day's notice is out of the question, and since I don't want to wait for the eventual journal publication, I figured I'd try my hand at using the Complete Genomics data for myself to build a Y-chromosome phylogeny.

With the aid of PhyML (default parameters and 100 bootstrap replicates), here is what I get:

Note that the above was done by isolating the Y-SNPs on 28 unrelated males in the data. I also threw out all SNPs that had no-calls. I tried to infer terminal classifications for the different individuals based on current ISOGG nomenclature, although it's possible that there are downstream mutations that I missed. NA18940 that is cut off in the figure is D2a-M11 and, NA19649 is R1b1a2a1a1b2a1a-L20I couldn't quite figure out NA19670.

Here is the tree code for anyone who wants to play with it:

  (NA21732_MKK_E1b1b1a1c-V22:0.00295296,NA21737_MKK_E1b1b1a1c-V22:0.00270971,(NA20510_TSI_E1b1b1a1b-V13:0.02092798,((NA19239_YRI_E1a2-P110:0.08297556,(NA18940_JPT_D2a-M11
6.1:0.12168757,((NA12891_Utah_I1a4-Z63:0.01039303,(NA06994_CEU_I1a3a1b-Z73:0.00945249,NA20511_TSI_I1a3a1a-Z140:0.01697537)54:0.00049670)100:0.07005125,(NA19670_MXL_?:0.
08181470,(NA18558_CHB_NO-M214:0.10352315,(NA19735_MXL_Q1a3a1-M3:0.05725892,(NA20845_GIH_R2a1a-L294:0.05384443,((NA20846_GIH_R1a1a1b2-Z93:0.01116683,NA20850_GIH_R1a1a1b2
-Z93:0.00810758)100:0.02945897,(((NA12889_Utah_R1b1a2a1a1b5-DF19:0.01118764,HG00731_PUR_R1b1a2a1a1b1-DF27:0.00955759)24:0.00020607,NA07357_CEU_R1b1a2a1a1b2-U152:0.00963
759)21:0.00012284,(NA19649_MXL_R1b1a2a1a1b2a1a-L20:0.05342631,(NA20509_TSI_R1b1a2a1a1b2c3-Z146:0.00943290,NA10851_CEU_R1b1a2a1a1b3-L21:0.01307073)15:0.00019824)5:0.0000
0002)100:0.03211886)100:0.00848072)100:0.00921281)100:0.02274049)100:0.00299324)54:0.00000005)100:0.03274296)100:0.01917298)100:0.00999561,((NA18504_YRI_E1b1a1a1f1a1-U1
74:0.01490192,(NA19026_LWK_E1b1a1a1f1a1-U174:0.01033155,NA19834_ASW_E1b1a1a1f1a1-U174:0.00961706)75:0.00031406)100:0.00787632,(NA18501_YRI_E1b1a-V38:0.00966164,((NA1902
0_LWK_E1b1a-V38:0.01193590,NA19025_LWK_E1b1a-V38:0.00793484)100:0.00317626,(NA19700_ASW_E1b1a-V38:0.01130782,NA19703_ASW_E1b1a-V38:0.01054823)77:0.00022885)100:0.001228
37)100:0.00457000)100:0.04581828)100:0.04898306)100:0.01922685);

6 comments:

royking said...

Nice tree, Dienekes! But you should add the number of Y SNPs for each branch, so that you and others can estimate dates for various lineages.

Fanty said...

Do I miss something or does it claim, that R1b is closer related to the Asian R1a than to the European R1a?

Dienekes said...

There is no European R1a here.

Fanty said...

"There is no European R1a here."

Ah! All right. Missed that its a R2 not R1

When I googled for L294 I got a list with individuals including Czech and Slowakians, so I got fooled that this must be the Euro R1a. X-D

GregRM said...

In case it helps, our collaborative spreadsheet at https://docs.google.com/spreadsheet/ccc?key=0Agq_ez43qXCjdFlxemtlUnZ1Qk01cVhMRVBFcm5WX3c&authkey=CIOag_UD#gid=12 has NA19670 as G2a3b1a2 (L497+). (I didn't personally contribute to this particular categorization.)

Dienekes said...

@GregRM,

Thanks! I had read somewhere that within F, haplogroup G branches off early, and this seems consistent with that.

I think the deep phylogeny will be near perfectly resolved soon, based on the papers that are coming out.