Sunday, September 8, 2019

Gene Fact #1

As it turns out, my genetics class was just starting to discuss genomes and genomics when this fact-checking project began, and the first fact presented in "A User's Guide: Your Genes - 100 Things You Never Knew" (a Time Inc. Specials publication, by National Geographic), happened to be relevant to that topic:

1. We humans share 99% of our genes with chimpanzees and bonobos

Initial Reaction

I knew that this putative Fact would be tricky to ask students to work on, because it combines an easily vetted numerical value with two subtle, perhaps trivial, issues of definition:

  • what does "share" mean?
  • what definition of "gene" should we use?

Also, I anticipated that this topic could automatically (without trial following investigation and critical thinking) be rejected by creationists or others who don't want to consider whether we humans are most closely related to non-human primate species like chimpanzees.

Initial Student Responses

Of all of the ~60 written summaries by students:

A few cited a review paper by Khodosevich, Lebedev and Sverdlov, "Endogenous retroviruses and convergent evolution" (2002) Comp Funct Genom, which states in the second sentence of the introduction, "The average DNA sequence difference between human and chimpanzee is only 1.24% [7] and probably only 0.5% in active coding regions [9]," but some students didn't even make it that far, stopping at the first sentence of the abstract, "Humans share about 99% of their genomic DNA with chimpanzees and bonobos."

Another review, by Varki and Altheide, "Comparing the human and chimpanzee genomes: Searching for needles in a haystack" (2005) Genome Research, was often-cited to refute the Fact, perhaps because its abstract states, "The difference between the two genomes is actually not ∼1%, but ∼4%—comprising ∼35 million single nucleotide differences and ∼90 Mb of insertions and deletions."

Britten (2002) PNAS says it all in the title, "Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels."

In Ebersberger et al. (2002) "Genomewide Comparison of DNA Sequences between Humans and Chimpanzees" Am J Hum Genet., the authors studied about 1.9 million nucleotides and found an average sequence difference of 1.24%.

Most cited Prüfer et al. "The bonobo genome compared with the chimpanzee and human genomes" (2012) Nature, in which the authors state that, in the single-copy autosomal regions they analyzed, the average percent identity between human and bonobo genomes is 98.7%, with bonobo-chimpanzee percent identity at 99.6%.

My Feedback to the Students

If you'd like to hear (and see the slides I used) our in-class discussion on this process, please visit this link to YouTube

Much student discussion focused on whether it was reasonable to round 98.7% (e.g. from Prüfer et al.) up to 99% (as stated in the Fact). Although this is a relatively trivial point, whether it is critical for concluding that a statement is fact is not an easy conclusion to draw!

I raised one concern with using the Khodosevich paper as evidence supporting the Fact. This is a review paper, meaning that it didn't technically adhere to my requirement of a primary research article (such as Prüfer et al.) that would be the original source of data that could be used to support or refute a conclusion. If students had found, read, and directly cited reference [7] from the above quote from Khodosevich, then that would be more appropriate. This assumes, of course, that reference 7 really does provide evidence that supports the Fact. The practice of citing literature based only on its title, or perhaps after only reading the abstract, does occur in science. Hence, some practical skepticism is warranted! Don't blindly believe that a citation supports an author's position until you read the citation.

I also offered students two technical, genetics-related viewpoints.

The Fact, as stated, discusses the percentage of genes shared between the species. However, most of the students cited research that was not tallying genes shared in common, but rather nucleotides (the letters comprising a chromosome's sequence) that are shared in common, and it is not necessarily an obvious logical step to conclude that 98.7 percent identity at the DNA sequence level would result in 98.7% (or 99%, with rounding) of genes being shared between humans and chimpanzees.

Second, many of the genomics studies that students cited analyzed different regions of chromosomes. Some looked only at single-copy regions of chromosomes. These DNA sequences are easy to locate the identical (homologous) chimpanzee version of, so the two sequences can be directly compared to find percent identity.

ATACATAG (Human)
ATAGATAG (Chimp)
Of these eight nucleotides (which I invented for the purpose of this example), only one is different between humans and chimps, so 7/8 are identical (87.5% identity). This was the type of approach used by Ebersberger et al. and by Prüfer et al.

Other chromosomal differences exist, like insertions and deletions (indels), and duplications of large (or small) stretches of DNA. Scientists can interpret these sorts of differences in various ways.

ATAC-----ATAG (Human)
ATAGGGGGGATAG (Chimp)

Here, in the middle of the same eight nucleotides as the first example, there is a five-nucleotide insertion of G in chimpanzee (or, just as likely, a five-nucleotide deletion of G in human). If scientists align the two sequences like this, then they might only analyze the alignable sequences, which would result in the same calculation as above: 7 of 8 alignable nucleotides are identical, so 87.5% identity. However, if the structural variants (like indels) are included in the calculation, then above there are seven of thirteen total nucleotides that match (53.8% identity). This latter approach was employed by  Varki and Altheide and also by Britten. Notably, Britten found 1.4% divergence in alignable nucleotides, with an additional 3.4% divergence based on indels. The sum is thus 4.8 (which was rounded to 5% in the manuscript title - probably not deceptively, but also not accurately!)

Thus, different studies can reach different conclusions because of the analysis methods used. And, of course, those subtle differences don't make it into the paper titles (and sometimes not into the abstracts) - and they definitely don't trickle down into popular science and media coverage of these types of studies!

Comparing research studies that arrive at apparently different conclusions

I made two final points for the students. I suggested that they might consider how much DNA sequence was analyzed in each study. This could be used to decide the relative importance of each study when arriving at a conclusion about which of the various calculations of human-chimp percent divergence (here we've seen citations reporting 0.5, 1, 1.24, 4, and 5%) is perhaps most accurate. Britten's analysis of 779,000 nucleotides resulted in a conclusion of 5% divergence, while Ebersberger et al. analyzed 1,900,000 nucleotides and found 1.24% divergence. 

I also suggested that they might consider the ages of the studies. It might be important, if you really want to be sure of your facts, that you not cite an old study that might have been conducted with perhaps less precise methods than we have at present, or that was since corrected in more recent studies. This is not to say that old studies using old approaches or tools are necessarily flawed and inherently worse than more current research, but some people (I understand) still think the earth is flat, citing really old literature, despite a plethora of more recent work that has pretty much eroded confidence in the flat-earth stance. When performing a literature review, it is best practice to read studies spanning the time when a particular topic has been researched. I know that my audience here doesn't necessarily have that sort of time, which is why I hope that this information literacy project will fill that need!

Student Decision: Fact or Fiction?

Ultimately, 30 of 51 students (58.8%) agreed that it is a fact that "We humans share 99% of our genes with chimpanzees and bonobos."

From my perspective, it seems like the top two aspects that caused less-than-overwhelming support for this Fact, as stated, were that

  • the specific value of 99% did not appear in research literature
  • the studies students found did not assess shared genes (as in the Fact) but instead shared DNA sequence

In other words, I think that this Fact didn't garner more support because of rounding and because the Fact, as written, misconstrues the actual basis of the research supporting it.

What resources did students use to find supporting research literature?

The week before this information literacy project began, I had started showing students in class how to use the NCBI PubMed literature database to find research publications. So, I also asked students to report how they identified the research literature they cited for this first fact-checking assignment.

29% PubMed
17% Google Scholar
29% Other form of search via web browser
2% EasyBib
2% Public Library of Science website
3% EBSCO

and, to my delight and surprise:

19% used some form of resource at the Henry Madden Library on our campus. I only teach upper-division and graduate classes, so I'm not familiar with how much exposure students get to using our library resources. I'm glad they're taking advantage of what our library has to offer!

Literature Cited




Fact-checking "100 Things You Never Knew About Your Genes"

If my goals are
  • to produce a curated bibliography for 39 of the DNA facts
  • to help demonstrate how to be information literate
then I owe it to readers to be transparent about the process. I think this is an important way to model information literacy: explain the process of fact-checking, particularly how decisions are made about how the copious information reviewed ultimately gets whittled down and passed on to the audience in digestible pieces.

Remember, this process is arguably the reason that scientific research conclusions get blown out of proportion or misconstrued, because the goals of media are to outcompete other outlets for audience and to give the audience information in a format that the audience wants. And, more and more, that means quick bites of information that necessarily exclude important details for understanding the assumptions, meaning, value and/or impact of a research study.

For example, if you only read the title of Beall and Tracy's 2013 paper in Psychological Science, "Women are more likely to wear red or pink at peak fertility," you might take this claim at face value. However, this study fails critical thinking and good information literacy practices in many ways, including that the title does not specify which women. You have to delve into the research paper itself to learn how many women were surveyed, and where in the world they live. This is where critical thinking is important. When I see the title of this study, I start thinking questions like, "What do they mean by 'more likely' - how much more likely than women wearing other colors?" "Where in the world did these women come from - might there be a cultural bias in what colors are worn?" "When was the study conducted: was it around Christmastime, where it might be more likely that women are wearing red?" "What control experiment was conducted - what about men, for example? Or women of ages outside of the fertility range?" And, because I'm red-green color-blind, another really important question to ask: "Who decided what shades and hues count as red or pink? Did women self-report, or did the research take photographs of the clothing and objectively define and measure both red and pink?"

So, this is how the fact-checking will work in my class:

When we begin to study a topic that is relevant to a group of DNA facts, then I'll present that group to the students and ask each student to select one to scrutinize. As we study that topic, the students will be introduced to concepts and vocabulary that might be important for them to understand the research literature they'll access for fact-checking their chosen claim. By the end of the topic, each student finds one primary peer-reviewed research manuscript that contains evidence either supporting or refuting the fact. Using that single-source information, each student writes and sends me a short (~1-2 paragraph) summary of the information, along with the citation to the primary literature.

Then, on my end, I compile and read through the student summaries as well as many of the commonly-cited research studies. I look for common misconceptions in how some of my students might have misinterpreted data from the studies, and I identify whether multiple sources tend to agree or not on the fact. So that I can give my students feedback on their work (which they will hopefully use to improve their information literacy and critical thinking skills), I give a short in-class presentation the day after their summaries are due, which includes:
  • summarizing various student perspectives about whether fact is supported or not
  • addressing misconceptions related to genetics concepts that were evident in the written summaries
  • highlighting strengths and weaknesses related to the credibility of sources that were used
Finally, after discussion, I put the matter to a yes or no vote:

"Based on the information found in published research literature, is it reasonable for this claimed fact to be called a fact?"