Monday, November 26, 2012

Extensive Doctoral Thesis on Ethiopian Y and mtDNA

I was contacted earlier by Dr. Chris Plaster about a doctoral thesis on Ethiopian Y & mtDNA that was completed 2 years ago but had been embargoed to the public until only about two months ago. As this is the first time I am coming across of it, plus since it is 204 pages long I have not had a chance to go through it thoroughly, but suffice it to say that this is the most extensive work on Ethiopian NRY & mtDNA that I have seen to date, although the resolution leaves a lot to be desired, I will update this post more as I read it more thoroughly over the next few days/weeks...

Variation in Y chromosome, mitochondrial DNA and labels of identity on Ethiopia

Some numbers and figures that caught my attention at first glance:

The Discussion section also has some interesting things to say, especially with respects to haplogroups A3b2 and J, but also the remaining ones found in Ethiopia as well.

UPDATE (11/27/2012) - Received some more resolution on a portion of the NRY data from Dr. Plaster that was carried out later and not included in the thesis:

Link to Source Document

UPDATE2: Interactive Chart of Figure 3.2 (for improved legibility)
UPDATE3 (11/28/2012)- Analyzing Ethiopian E-M34 haplotypes.

One of the more curious results with respect to the NRY haplogroups found in this dataset is the high amount (24.6%) of E-M34 in the Maale samples. Previously, Cruciani '04 (see here and here for details) had found E-M34 widespread in Ethiopia with a more Northerly concentration (Amhara - 24%, Ethiopian Jews - 14%, Oromo - 8%, Wolayta -8%), this newer data however shows the opposite, i.e a more Southerly concentration of E-M34 (Maale - 25%, Amhara - 13%, Oromo - 10%, Afar - 4%). 

To explain the apparently lower diversity of Ethiopian E-M34 haplotypes relative to ones found in the Near East, Cruciani '04 had also proposed that the lineage may have back migrated to East Africa from the Near East, although not completely abandoning the possibility that it may also have originated in East Africa.

Luckily, the data for the Ethiopian E-M34 haplotypes found in this paper is actually accompanied with STR data, 23 independent 15 marker E-M34 haplotypes. 

So this gave me a chance to compare the diversity of the haplotypes with non-Ethiopian E-M34 haplotypes that are available a the haplozone site. A compilation of E1b1b1 and subclades of E1b1b1 67 marker haplotypes from this site can be downloaded from here. For this analysis only E-M123 and E-M84 haplotypes are used.

The method I used to compare the haplotypes is the same as outlined in this previous blog post. The only difference is that I am constrained with the number of markers available to me, thus I have used 14 of the following markers to compare haplotype diversity/TMRCA:

DYS19 DYS388 DYS390 DYS391 DYS392 DYS393 DYS389I DYS389II DYS437 DYS438 DYS439 DYS448 DYS456  Y GATA H4

where the marker DYS635 is unfortunately missing in the haplozone site and is not included in the analysis.

The TMRCA results for the 3 different datasets (E-M34_Plaster, E-M123_Haplozone and E-M84_Haplozone) are as follows:

Sample size:23
Years/Generation:28 - 33
TMRCA Range:4590 - 7558
Mean TMRCA:6055
Median TMRCA:5920

Year/Generation =28 detailed:
finalsummary =

  [1,2] = Chandler;14 Markers  TMRCA(Median)--5920.7 TMRCA(Modal)--6412.9
  [1,3] = Stafford;14 Markers  TMRCA(Median)--5120.6 TMRCA(Modal)--5593.2
  [1,4] = Burgarella_Navascues;14 Markers  TMRCA(Median)--4988.7 TMRCA(Modal)--5635.7
  [1,5] = Ballantyne;14 Markers  TMRCA(Median)--4590.1 TMRCA(Modal)--4993.9

Sample size:129
Years/Generation:28 - 33
TMRCA Range:4120 - 6131
Mean TMRCA:5067
Median TMRCA:5147

Year/Generation =28 detailed:
finalsummary =

  [1,2] = Chandler;14 Markers  TMRCA(Median)--5202.1 TMRCA(Modal)--5202.1
  [1,3] = Stafford;14 Markers  TMRCA(Median)--4405.3 TMRCA(Modal)--4405.3
  [1,4] = Burgarella_Navascues;14 Markers  TMRCA(Median)--4121 TMRCA(Modal)--4121
  [1,5] = Ballantyne;14 Markers  TMRCA(Median)--4330 TMRCA(Modal)--4330

Sample size:69
Years/Generation:28 - 33
TMRCA Range:3666 - 5124
Mean TMRCA:4458
Median TMRCA:4347

Year/Generation =28 detailed:
finalsummary =

  [1,2] = Chandler;14 Markers  TMRCA(Median)--4347.9 TMRCA(Modal)--4347.9
  [1,3] = Stafford;14 Markers  TMRCA(Median)--3666.2 TMRCA(Modal)--3666.2
  [1,4] = Burgarella_Navascues;14 Markers  TMRCA(Median)--3885.7 TMRCA(Modal)--3885.7
  [1,5] = Ballantyne;14 Markers  TMRCA(Median)--4219.2 TMRCA(Modal)--4219.2

It is not necessary to get fixated on the absolute TMRCA numbers, rather what is more informative are the relative TMRCA numbers, since the mutation rates being used for all 3 datasets come from the same source. In addition, the absolute TMRCA is not very informative due to the low number of markers, for instance, if I used these same 14 markers to compute a mean TMRCA across all 4 mutation rate sets for the E-M35 balanced dataset, I get 7,038 YBP, where as if I use 46 markers across all mutation rates I get a mean TMRCA of 11,984 YBP and yet again if I use 66 markers (but limited only to the Chandler mutation rates) I get a mean TMRCA of 14,802 YBP. So a reasonable amount of markers are needed before the absolute TMRCA starts to plateau to a meaningful number.

However, the relative TMRCA's clearly show the Ethiopian E-M34 haplotypes to be more diverse, and thus putatively older, than both the E-M84 and E-M123 haplotypes from haplozone, and that in itself is quite interesting.

UPDATE4 (11/29/2012) -Analyzing Ethiopian J-M267 haplotypes.

Similar to the above I used the 48 J-M267 haplotypes from this paper to compare them with non-Ethiopian J-M267 haplotypes from the FTDNA projects database and the results were as follows:

Sample size:48
Years/Generation:28 - 33
TMRCA Range:12188 - 21364
Mean TMRCA:15006
Median TMRCA:14448

Year/Generation =28 detailed:
Finalsummary =

  [1,1] = Chandler;14 Markers  TMRCA(Median)--18128 TMRCA(Modal)--18128
  [1,2] = Stafford;14 Markers  TMRCA(Median)--12460 TMRCA(Modal)--12460
  [1,3] = Burgarella_Navascues;14 Markers  TMRCA(Median)--12331 TMRCA(Modal)--12331
  [1,4] = Ballantyne;14 Markers  TMRCA(Median)--12189 TMRCA(Modal)--12189

Sample size:573
Years/Generation:28 - 33
TMRCA Range:11288 - 31985
Mean TMRCA:17597
Median TMRCA:16324

Year/Generation =28 detailed:
finalsummary =

  [1,1] = Chandler;14 Markers  TMRCA(Median)--18873 TMRCA(Modal)--27139
  [1,2] = Stafford;14 Markers  TMRCA(Median)--11955 TMRCA(Modal)--16285
  [1,3] = Burgarella_Navascues;14 Markers  TMRCA(Median)--11289 TMRCA(Modal)--15253
  [1,4] = Ballantyne;14 Markers  TMRCA(Median)--12084 TMRCA(Modal)--16363

Note again that the 66/46 Marker size mean TMRCA for the FTDNA dataset was considerably lower (9901) than the above 14 marker dataset, again highlighting the impact of Marker combination / size on the absolute TMRCA. However, it is clear from above that the FTDNA J-M267 haplotypes are relatively more diverse than the haplotypes from Ethiopia from the current paper (unlike the case for E-M34 above). 

Another interesting find in this paper with respect to the J-lineage is the reporting of one case of J (x M267, M172) in the Maale, a first such find in Ethiopia that I am aware of.

UPDATE4 (11/30/2012) -Analyzing Ethiopian E-V32 haplotypes.

To finalize the series of TMRCA calculations I have been doing, I performed the same calculations on the E-V32 dataset vs Haplozone, interestingly, it seems as though the E-V32 lineages in Haplozone are older than the ones in the Plaster paper, a reasonable explanation for this is that since we already know that E-V32 is for the most part restricted to Eastern Africa (a) most of the Haplozone E-V32 haplotypes, may have relatively recent East African ancestry, a possibility since a reasonable majority of the haplotypes are from the Arabian peninsula and the near east and/or (b) We know that there are already a few East African (Somali) haplotypes within the E-V32_Haplozone dataset. (Note: the self declared origins of the E-V32 Haplotypes from haplozone were:
11 from the Near East (Qatar, UAE, Jordan, Saudi and Yemen), 2 from Africa (Egypt and Somalia) and 4 of unknown origin).

Here below is the summary for the TMRCA comparisons I have done thus far, each bar within each dataset represents the mean TMRCA when the years per generation is equal to 28 and 33 , and the putative ancestral haplotype is set to median and modal repeats for the specified mutation rate set.

Also, note that the 72 E-M35_Plaster haplotypes are a composite of 18 E-V32, 4 E-V22, 23 E-M34, 1 E-M281 and 26 E-V6 haplotypes. Whereas the 180 E-M35_Haplozone haplotypes are a composite of, 20 E-V13, 20 E-V22, 20 E-V12, 60 E-M81 and 60 E-M123 haplotypes. 


  1. Very promising. I'll try to get some time tomorrow to read it in depth.

    So far I have stopped at the last two graphs and, contrary to what you said once, it does seem that Cushitic peoples are less influenced than Semitic ones by Eurasian lineages: even the Afar, who live closer to Arabia than Amhara do show less Eurasian genetics. It's not a simple story but it does seem like Semitics show yet another layer of Eurasian influx.

    Obvious correlations:

    Y-DNA K (J, T) - mtDNA N (M too but less obvious - see below).

    Y-DNA E1b1a7 - mtDNA L2 (possibly some specific subclade(s))

    Possibly Y-DNA E2 with some mtDNA L0/L1 subclade.

    Regarding M (which would be all M1, which is so highly derived under M, i.e. M → M1'20'51 → M1, that it MUST be a returning lineage from Asia) it seems like it may have spread with African Y-DNA lineages, at least in Ethiopia. Otherwise we are be in the strange case in which the immigrant Y-DNA is smaller than the mtDNA - not unheard of (see North Africa for example) but that requires an explanation. Such explanation may be a later expansion within Africa led by native African Y-DNA (for example that of early Afroasiatic languages in the Epipaleolithic) but I'm not sure how it fits Ethiopian prehistory.

    1. I believe what I said was that the relevant Cushitic speakers you would want to compare Semitic speakers with would be those that belong to the 'Central Cushitic' category, also known as Agew, the paper includes some Agew samples from Gojam (Awi) and Wello/Sekota (Hemra).

    2. It's not too important and actually the differences seem minor in many cases but there is some trend. Even the Agew, so similar in genetic split to the Amhara and your example of choice, have more A and less J, K*, P*. But not really important.

      I'm following with interest the updates (totally splitting an appearence of uniformity in E1* - a very illustrative example of why resolution in sequencing does matter) and have blogged a bit myself on the matter(link), mostly on the issue of Eurasian and African lineages and what they may mean.

      After consulting with my pillow, I think that the structure may indicate a similar patter to that found in North Africa, where I understand that the African-originated Afroasiatic wave (Epipaleolithic probably) overlaid earlier Eurasian layers (on top of whatever African earlier layers, of course). Only that seems to explain that there is less "Eurasian" Y-DNA (F-derived) than "Eurasian" mtDNA (M and N derived). That does not reject possible (post-)Neolithic layer(s) from West Asia as icing of the cake (for example Y-DNA J2 and mtDNA K look totally like that) but this seems a relatively thin layer.

  2. It would be most helpful to know the linguistic affiliations of the identified groups other than the Afar, Amhara, Anuak, Maale, and Oromo.

    I'd also be quite interested to hear the gist of the discussion section's treatment of haplogroups A3b2 and J* - the J1/J2 piece pretty much speaks for itself.

    The data look quite suitable for a principal components analysis chart (ideally including both mtDNA and Y-DNA in the same analysis) which might show some of the trends that would otherwise be less obvious.

    1. Linguistic affiliations and geography of the various populations are explained in relative depth in the first sections of the thesis, Andrew.

      There are many PCAs also for your amusement. :)

  3. Interesting study. I am most surprised at the relatively high levels of J among some of the Omotic groups. This weakens the Ethiopian origin theory of Afro-Asiatic substantially. Perhaps future studies with deeper resolution will clarify the situation better.

    1. It depends how you interpret J1 in Africa. I think that it has been in the region since Paleolithic times but detailed substructure study is needed to confirm or deny (or partly confirm and partly reject) this hypothesis.

      In NW Africa for example, the expansion of Afroasiatic would seem related to Y-DNA J1, as well as to E1b-M78 (but probably not M81, which seems specifically Northwestern and probably of much older local roots) and old haplotype studies (Semino 2004, fig. 4) on the matter may suggest a distinct African-centered J1 subclade (AFAIK this matter is yet to be clarified, right?)

      So I'd say that imagining J1 as (post-)Neolithic West Asian inflow is probably shallow and incorrect. It is a West Asian backflow but IMO it is older and related to other very old Eurasian backflows like mtDNA M1, X1, etc., which IMO must be dated in the Upper Paleolithic (maybe related to the formation of the LSA c. 50 Ka ago?)

      At least I imagine IJ splitting around the time of the first colonization of Europe (maybe as early as c. 49 Ka according to the most recent datings) and the J1/J2/J* split would happen soon after that. However I can't be sure of the dates because it requires good prehistorical understanding in material terms (archaeology) to determine the plausible frames of cultural flows and therefore possible demic ones. And I do not know enough.

    2. Why exactly would the finding of J1 in Omotic speakers "weaken the Ethiopian origin theory of Afro-Asiatic substantially" ??

      What makes the J1 lineage a weaker (relative to E-M35) candidate for the dissipation of Proto-Afroasiatic speakers is not only the fact that it has a weaker frequency overlap with the major linguistic branches of Afroasiatic, but also the fact that the putative area of the lineage's origin, both according to STR and SNP diversity, i.e. the greater area of the Zagros mountains , is at the very best residence to only one and arguably relatively very young branch of Afroasiatic, AKA Semitic. This however is not the case for the Ethiopian region where E-M35 was likely born, as it definitively harbors 2 (Cushitic and Omotic) and potentially 4 (+ Semitic and Ongota) independent branches of Afroasiatic.

  4. Where are the Oromo samples from? Are they from all over Ethiopia or?

    1. Assuming that you are asking where the samples were collected from:
      49 - Jimma, 48 - Addis, 29 - Gamo Gofa, 15 - Gambela , 8 - Konso/Kefa/Gurage

      Figures 2.4.2 and 2.4.3 show maps of the weighted mean collection location for all populations sampled.

      This off-course doesn't necessarily mean that the donors were all born at these locations, or that their parents were born there.

    2. Yeah, that's what I was asking about. Thanks a lot for the reply, man.