## August 29, 2008

### Determining whether an individual is in a mixed sample using microarrays

This is a very important new paper, blogged about by Gene Expression and the Spittoon.

A short explanation

DNA from many different individuals may be "mixed up", either literally (e.g. in a crime scene), or figuratively (in an allele frequency table where individual genotypes are averaged).

If you have the genotype of a particular individual, can you tell whether or not he is included in the mix?

The surprising answer is yes, even if the person contributes less than 1% to the mixture, provided that you study a large number of markers, such as the multi-100K chips by companies such as Illumina or Affymetrix.

An individual's DNA shifts the sample's allele frequencies by very small amounts. If the individual is included in the mix, then averaged over many loci, the sample will deviate from the overall population standard in the direction of the individual.

Let's give a non-genetic analogy (I'm making these numbers up, but they'll do). If the Chinese height average is 1.75m, and a sample of Chinese has a height average of 1.8m, then Yao Ming is more likely to be in that sample.

Of course, using one trait, it is impossible to conclude firmly that Yao Ming is in the sample: any number of tall Chinese could raise the sample average. But, averaged over many traits, Yao Ming's individual traits would stand out, whereas those of other tall Chinese men would not.

The power of this technique relies on using a very large number of variables, which has become possible with the use of microarray chips measuring hundreds of thousands of polymorphisms.

Why is this important?

The forensic applications are clear: people's DNA gets mixed up all the time, yet investigators are interested in determining whether a particular individual (e.g. criminal or missing person) was present in a scene.

The scientific implications are less clear, but more troubling. From now on, releasing a table of "allele frequencies" in a sample can't be guaranteed to mask the identities of individuals.

Suppose someone asked you to participate in a scientific study, and you were told that no individual genetic information would be disclosed to the public, but only averaged information over all participants.

You can no longer be content with that promise. Someone who has acquired your genotype can now figure out whether or not you participated in the study.

What is troubling, at least for me is the proposed solution to this problem:
Considering privacy issues with genetic data, it is now clear that further research is needed to determine how to best share data while fully masking identity of individual participants. However, since sharing only summary data does not completely mask identity, greater emphasis is needed for providing mechanisms to confidentially share and combine individual genotype data across studies, allowing for more robust meta-analysis such as for gene-environment and gene-gene interactions.
In other words, the proposed solution would deprive the public of access to any type of genetic information produced by studies the public actually pays for. Instead, individuals' genotype data, it is proposed, will be shared among scientists themselves.

But, who decides who can get access to the data? If an obscure "scientist" from some far-away land asks for data for a study he is conducting, is he entitled to it? Or, will the data be shared by a close-knit group, thus making it more difficult to evaluate it independently, or to create derivative applications (such as EURO-DNA-CALC, which would have been impossible without allele frequency data)

PLoS Genet 4(8): e1000167. doi:10.1371/journal.pgen.1000167

Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays

Nils Homer et al.

Abstract

We use high-density single nucleotide polymorphism (SNP) genotyping microarrays to demonstrate the ability to accurately and robustly determine whether individuals are in a complex genomic DNA mixture. We first develop a theoretical framework for detecting an individual's presence within a mixture, then show, through simulations, the limits associated with our method, and finally demonstrate experimentally the identification of the presence of genomic DNA of specific individuals within a series of highly complex genomic mixtures, including mixtures where an individual contributes less than 0.1% of the total genomic DNA. These findings shift the perceived utility of SNPs for identifying individual trace contributors within a forensics mixture, and suggest future research efforts into assessing the viability of previously sub-optimal DNA sources due to sample contamination. These findings also suggest that composite statistics across cohorts, such as allele frequency or genotype counts, do not mask identity within genome-wide association studies. The implications of these findings are discussed.