Molecular Cell Biology

Eric Wasiolek

Life Story

Home

Personal History

Family and Friends

World Travels

Academics

Books and Articles

Career

Philosophical Papers

Poetry

Projects

Contact Me

"You must be the change you wish to see in the world."

- Mahatma Gandhi

Molecular Cell Biology - 6151

Paper 3 Analysis - Aharoni et al.

Professor: Dr. Uhde-Stone

Student: Eric Wasiolek

Date: October 17, 2005

Summary:

Aharoni et al. performed three experiments to indicate the differential expression of genes in ripening strawberries. The researchers measured the expression of 1701 genes in progressively ripe strawberries in three developmental stages, green, white, and turning (half red and half white), and compared these expressions in each case with red (ripe) strawberry gene expression, using microarray technology. The researchers found 401 differentially expressed genes. Controls included switching the fluorescent dye in the green-red comparison. Small variances/noise in the data were 'cleaned' using statistical techniques. The differentially expressed genes were functionally characterized through sequence searches enough to categorize them into information, energy, communication, or uncharacterized groups, with some subcategorization. Particular attention was paid to the SAAT gene, which showed a 16 fold increase in expression in ripened (red) strawberries, in part, because SAAT belongs to the enzymatic family of acetyltransferases, AATs, which esterize alcohols and acylCoA to volatile esters, which are strongly implicated in the increased flavor and color of ripened fruits.

Critical Analysis:

General Comments - The Paper is a Disappointment

As I analyzed this paper, it was clear that the paper fell into two parts. First the researchers performed a microarray analysis to determine which genes were upregulated or downregulated during the 4 developmental stages of the ripening of the strawberry. The researches found 401 genes which were either up or down regulated through these stages. But, then the paper jumps to (in an almost non-sequitur fashion) an analysis specifically of the SAAT gene. My major criticism of this paper is, why did the researches concentrate so heavily on one gene, the SAAT gene, at the expense of further analysis of the 400 other up/downregulated genes involved in strawberry ripening? The answer seems to be that the researchers were relying on their previous biochemical knowledge that AATs were critical in fruit ripening, and SAAT is just a special case of this. Then, in effect, the researchers did no more than corroborate what was already known, that AATs, particularly SAAT, are critical in fruit (strawberry) ripening. Rather than pursuing a more informative analysis of what the microarray differential expression data could have told us about ALL of the genes critical in strawberry ripening, and the exact role of each of these genes insofar as determinable, or their coregulation, or precise interactions, the authors relegated their study, in effect, to a mere corroboration of something that was already generally known. Hence, I am basically disappointed by this paper. It could have been a much more informative research project than it became.

In keeping with the authors own division of their paper into two parts, I first analyze the authors' analysis of differential gene expression in stages of strawberry ripening, and secondly, I criticize their analysis of the SAAT gene expression on strawberry ripening.

Part I - critical analysis of differential gene expression in stages of strawberry ripening

The first, and potentially more informative part of the paper involved three microarray experiments to analyze which genes were differentially expressed (either up or down regulated) in four stages of strawberry ripening, a green (G) unripe stage, a white (W) more ripe stage, a turning (T) stage where the strawberry is half white and half red, and red (R) stage where the strawberry was ripe. The red (R) stage is used as the reference stage.

Analysis of the microarray experimental methods:

In the methods section, the experimenters indicate that:

a. red ripe strawberry tissue and corolla tissue from petunias were used to constructed cDNA libraries, these were directionally cloned, and this resulted in 20,000 plaques.

b. High quality plasmid DNA from 1701 strawberry and 480 petunia colonies picked randomly were extracted robotically. 1100/1701 of the strawberry DNAs and 480 of the petunia DNAs were partially sequenced. Petunias were used to provide specificity.

Critical Analysis: The fact that the 1701 colonies were randomly picked from a larger pool indicates a potentially huge problem. The microarray must contain targets (for the cDNA probes) that include a non-redundant set of all genes for all mRNA expressed in the two tissues (the test tissue and the reference tissue) in each experiment. To the extent that this is not the case, some mRNA critical in fruit ripening might not be even tested.

c. Total mRNA from tissues representing each of the 4 stages of strawberry ripening were extracted and reverse transcriptase was used to prepare 21-nucleotide long cDNAs. Cy3-dCTP and Cy5-dCTP was used to fluorescently label the cDNA. The cDNA was separated from the mRNA by denaturing (boiling), amplified by PCR, and then arrayed in duplicate on the microarray slide (the genechip). The cDNA probes were arrayed in duplicate with 4362 probes per array ( (1701 strawberry cDNAs + 480 petunia cDNAs) x 2 (duplicates) = 4362 total probes ). Arranged in 16 x 16 subarrays, the first 12 columns were dedicated to strawberry cDNA and the last 4 columns to petunia cDNA.

Critical Analysis: Note that a 21-mer is long enough to uniquely identify and hybridize to a specific gene, as there are 4^21 or approximately 4.4 trillion unique oligonucleotides that can be formed with a sequence of 21 nucleotides, and in yeast there are only less than 100,000 genes. The paper also implies here that there were 16 16x16 arrays on the microarray (Figure 2), but this doesn't make sense arithmetically, as this would yield only 4096 probes, and we are told that there are 4362 probes.

d. cDNA is hybridized: portions of each fluorescently labelled sample (e.g. the G and the R) were mixed and applied to the microarray. Immediately the hybridization of the cDNAs to the correct DNA probe (gene) occurs.

e. Scanning: After being washed the arrays were dried and scanned with a laser. One scan was performed for each dye. The integrated optical density of each individual probe on the array was measured to quantify the color and intensity of each probe. Since the computer keeps track of the gene (sequence) associated with each probe and its color and intensity, the color and intensity of expression for each gene can be recorded as the expression data resulting from two samples (the test and the reference sample, e.g. the green and the red strawberry samples).

Critical Analysis: There is no major problem at this step, except that some false data will be generated by hybridization not neatly occuring within a defined circular area. The scanning relies on using a grid to place a defined circle fitting the size of the DNA spots and measuring the integrated absorbance within each such circle. Statistical methods are usually used to clean this data and weed out the false data.

f. Statistical analysis and controls were used to clean the data (assure good data).

That was a discussion of the methods. The Experiment basically involved performing the procedure 3 times.

Analysis of the microarray experiments per se:

Experiment one, mixes the sample from the green strawberry mRNA (cy5) with the red strawberry mRNA (cy3).

Experiment two mixes the samples from the white strawberry mRNA with the red strawberry mRNA (cy3)

Experiment three mixes the samples from the turning strawberry mRNA with the red strawberry mRNA (cy3).

Control experiment: Experiment one was repeated switching the dies, with red labelled with cy5 and green labelled with cy3. This was done to exclude artifacts, to verify that the die labelling wasn't affecting the expression results.

Critical Analysis: Three experiments were necessary to compare expression of genes in green, white, and turning strawberries to red (ripe) strawberries. Note that the red strawberry is used as the reference tissue in each case and the other less ripe strawberry staged tissues as the test tissue. Since microarrays measure DIFFERENTIAL expression of genes, it really doesn't matter whether the red strawberry is the reference or the green, for example, so long as the same reference is always used. But this choice results in graphs of differential expression data appearing in a certain way, as in Figure 3B, where 1 represents the red strawberry as the reference, and expression ratios above or below 1 indicate up or downregulation of gene expression relative to how much that gene is expressed in red strawberries. The graph would appear differently if for example the green strawberry was used as the reference.

A Critical Analysis of the Resulting MicroArray Data (Results):

The results that are provided by the authors can be summarized in the following table:

		Experiment 1	Experiment 2	Experiment 3
		G/R	W/R	T/R
p < .05	total	247	168	137
	up	177	105	60
	down	70	63	77
p < .01	total	126	87	76
	up
	down
p < .001	total	68	42	27
	up
	down

Figure 1: A summary of the authors' results.

I.e. the authors indicated the number of differentially expressed genes in the three different tissues at different levels of statistical significance. But, this information is quite limited. What does this tell us: just that, for example, in the green versus red strawberries, there were 247 genes that showed a differential level of expression. But, we want to know more than this. We want to know WHICH genes were differentially expressed. This tells us only how much overall difference in expression there is in the different samples, nothing about the type of differential expression, nor is this level of differential expression compared to other ripening fruits or other developmental processes to evaluate these numbers. I'm surprised that the authors didn't provide this information, even in a condensed form, like the top 10 most differentially expressed (in terms of amount of differential expression) genes, and how much that differential expression was (to what extent up or down regulated). A simple table like the one below could have provided this information. Note that the authors did, only for the .05 (90%) significance test, even indicate how many genes were up-regulated versus down regulated. Still this doesn't tell us much.

Much better would have been a table like this:

	Experiment 1				Experiment 2				Experiment 3
	Green	Red	Intensity	Log(2)	White	Red	Intensity	Log(2)	Turning	Red	Intensity	Log(2)
GENE	(test)	(reference)	Ratio	IntensRat	(test)	(reference)	Ratio	IntensRat	(test)	(reference)	Ratio	IntensRat

*SAAT*	100	1600	16.00	4.00	600	1600	2.67	1.42	110	1600	14.55	3.86
*PDC*	89	1025	11.52	3.53	401	1025	2.56	1.35	802	1025	1.28	0.35
*Gene3*	80	800	10.00	3.32	320	800	2.50	1.32	560	800	1.43	0.51
*Gene4*	59	305	5.17	2.37	141	305	2.16	1.11	223	305	1.37	0.45
*Gene5*	160	140	0.88	-0.19	153	140	0.92	-0.13	146	140	0.96	-0.06
*Etc…*

Figure 2. Proposed Table the Authors Should Have Included.

A table like Figure 2 would be much more informative. It would show WHICH genes were up or down regulated and HOW MUCH each genes was up or down regulated, both by giving the actual expression level data (which could be omitted for simplification) AND the degree or relative amount of up or down regulation relative to the red strawberry reference. This table would indicate, for example that SAAT was 16 times more expressed in red strawberries than in green strawberries (i.e., the intensity ratio, here calcuated as reference/test, is critical). Furthermore this expression differential (the intensity ratio) could be expressed in log base 2, so that down regulations appeared negative and up regulations positive. The up and down regulation data could be presented in different colors for clarification. Of course not all differentially expressed genes could be presented in the paper, but maybe the top ten or fifteen most differential (up or down regulated), with the full data set being available online. Note that the values in the above table are bogus just to illustrate the type of table needed.

In all fairness, the authors did present Figure 3B of the Aharoni paper which indicated in a graphic format the up or down regulation of 27 genes. However, these aren't the 27 most differentially expressed genes, just 27 genes showing homology to a metallothionein gene or an auxin-induced gene, and the graph doesn't indicate WHICH 27 genes are depicted.

Next a Sequence Analysis of the Most Differentially Expressed Genes Would Be in Order.

Next, it would be most reasonable to take this set of most differentially expressed genes, and do a sequence analysis of them. A simple BLAST would suffice for a start, to see if similar sequences could be found which would indicate the functions of these differentially expressed genes. Again, the authors failed to do this simple obvious step to try to characterize the functionality of the most differentially expressed genes, or they did this but didn't present this important data. The authors DID do a BLAST search, but the only information they presented as a result of this was a general classification of 239 differentially expressed genes into four main categories (DNA,RNA,protein), energy, communication, or unknown. There is some indication that there was subcategorization, for example communication was subdivided into hormones, detoxification, signal stress, defense, etc... The problem is that whatever functional classification of the differentially expressed genes the authors did was NOT presented in detail, and yet this is CRITICAL information to determine what exactly is going on metabolically in different stages of ripening. All that was presented was that of 177 cDNA clones identified as being upregulated during the red stage, 53% were related to energy (which means metabolism, i.e, the functional are where we would expect most of the differentially expressed genes to be in), 27% in communication, and 20% in information. For every significantly differentially expressed gene, a BLAST search should be done, and its precise functional characterization, as much as possible, should be given.

A Cluster Analysis of the MicroArray Data Should Have Been Done to Indicate Gene Coregulation

Another OBVIOUS endeavor that the authors should have engaged in but didn't was a cluster analysis of the microarray data. This would give, in addition to the gene specific and also gene family information that a BLAST search functional analysis would have given, an indication of the important topic of which of these differentially expressed genes seem to be coregulated. Both types of information together start to build a more precise picture of the genetic events which give rise to the various stages of the ripening of a strawberry.

Part II - Critical analysis of the SAAT gene expression on strawberry ripening

Now we move to the strangest aspect and biggest problem with this paper: the non-sequitur like movement from an incomplete microarray analysis of the differential gene expression involved with strawberry ripening to a complete concentration on ONE GENE: the SAAT gene. Jumping from an incomplete microarray analysis to providing much information on ONE differentially expressed gene is based, in my opinion, simply upon the fact that the authors had much prior knowledge that the AAT enzymes were critical in flavor and color genesis in other fruits. This results in the authors adding little to the world of science or knowledge of strawberry ripening that isn't already known, that strawberries have an AAT, named SAAT which acts exactly like we already known AATs act. ATTs and SAAT are enzymes that esterify acyl-CoA and alcohols (the substrates) to volatile esters (the product) which are critical in the color and taste of ripen fruit.

The authors point out that SAAT is 16 times more expressed in red strawberries than green ones. So, this is one of the more differentially expressed genes. However, it is also indicated in the article that some genes were differentially expressed as much as 22.5 fold, yet there was no extensive analysis of these genes. Furthermore, information in the article like volatile ester emission during strawberry ripening indicated by gas chromotography merely tells us what is already known, that ripening fruits release and are rich in volatile esters. Much of the rest of the paper is taken up with the specifics of the SAAT sequence, of the SAAT enzymatic activity, etc... losing site of the valuable information that could have been derived from a proper analysis of the microarray data about ALL of the genes involved in fruit ripening, their precise form of progessive expression, and their coregulation.

Post-Microarray Sequence Analysis - Good

It was laudable, at least, that the authors performed a sequence analysis of the SAAT gene. This sequence analysis did indicate that SAAT is part of the acyltransferase family, but again, this tell us nothing new. We knew that AATs were acyl-transferases and that SAAT is an AAT. What interested me in the authors' sequence analysis of SAAT is how confused they were about the terms 'homology,' 'similarity,' and 'identity,' and how little they understood about how largely non-identical protein sequences can be highly related functionally (in terms of their enzymatic activity) and why. I have written a number of comments below that explains how this is possible.

The authors performed a BLAST search on the SAAT cDNA. Although the authors seemed to be making a common confusions between sequence 'homology', 'similarity', and 'identity' (see footnote on this topic), they show that the closest matches to SAAT are only 29% identical. Although the authors think this is a problem, it is really not, as I will explain, and the authors do do the right thing anway by looking consensus regions or highly conserved motifs between SAAT and the family of acyltransferases which it seems to be a part of. Bioinformatics researches have show that sequences that have as little as 25% similarity (therefore even lower identity) may be evolutionarily and functionally related.

So the first useful bit of information the authors provide by doing a sequence analysis of SAAT is that it belongs to a family of acyltransferases, which are enzymes known to be involved in volatile ester formation, the molecules that give fruit its flavor and ripe color.

The researches take the BEAT gene (the Beer brewery enzyme) which came up in the protein BLAST similarity search, but lament that there is only a 19.4% similarity between the amino acids of the two proteins produced by SAAT and BEAT. Again, this is not a problem if you understand the nature of proteins. First of all, in most proteins, about 80% of the amino acids (one by one) could be changed without having any affect on the protein's catalytic activity or function. This is because a protein's catalytic or binding activity or function depends specifically on small sequences of amino acids on the surface of the protein. The large number of amino acids in the interior have no direct effect on binding or catalytic activity, although indirectly they jointly have an effect on the protein's 3D conformation(s) which affect binding or catalytic activity. Furthermore, not even the smaller percentage of amino acids on the protein's surface are all involved in binding or catalytic activity, just small stretches of them, called binding or active sites. Compound this with the fact that chemically similar amino acids can be to some extent substituted with eachother (acids with acids, bases with bases, uncharged polars with uncharged polars, etc...), then you can understand why quite non-identical amino acid sequences can have quite similar function or biochemical activity. Without having a full understanding of why low amino acid identify or similarity may have little effect on functional similarity, the authors surprising still do the right thing. They look for common conserved motifs between the proteins. Here they found 3 conserved motifs between BEAT and SAAT: HXXXD, DFGWG (near the carboxyl terminus), and LSXTLXXXYXXXG (near the amino terminus). These motifs may be common active or binding sites that give the proteins similar catalytic activity.

Additional bioinformatics the authors should have done: 3D Protein Viewing.

In addition, the authors should have verified with a 3D protein viewer, whether these conserved motifs were in fact on the proteins' surfaces. If so, there would be a stronger case for believing that they accounted for the common catalytic activity, if not the case would be weaker. Note that many 3D protein viewers allow a PERL-style interface where motifs may be entered as regular expressions, yielding the 3D amino acid location of such motifs in a target protein molecule.

Additional Bioinformatics the authors should have done: sequence analysis of Other highly differentially expressed genes.

Additionally, why did the authors restrict their sequence similarity search to just one gene SAAT? Why not do a cluster analysis of microarray data to determine which genes may be coregulated and to look for sequence similarity between the coregulated genes. Note that in some cases there may not be much sequence similiarity between coregulated genes, since these genes may represent different enzymatic steps in a biochemical pathway. But, where, for example, there were redundant steps in the pathways, these redundant enzymes should be similar or at least have similar surface conserved domains.

Future Potential ‘Wider’ Applicability

The authors’ approach to characterizing developmental processes by their progressive differential gene expression, if done correctly, has a wide application. It could be done in any developmental biological process, including embryogenesis, tissue formation, organogenesis, and aging, to help us start to understand the orchestrated biochemical events that account for these developmental processes.

Footnote:

The differences between 'identity,' 'homology,' and 'similarity' that the authors don't seem to understand.

In sequence searches, 'identity' refers to two sequences or subsequences with exact matches of nucleotides or amino acids. To talk of 'amount of identity' is to talk of the number of matches relative to number of nucleotides or amino acids in the sequence. 'Similarity' refers to a score which two sequences or subsequences receive in how 'similar' they are. Sequences which are quite 'similar' may be very non-identical, i.e, the two sequences may have few matches, yet be highly similar. This is due to the fact that 'similar' sequences may be created by 'insertions' and 'deletions' or sequences over time. Also, in proteins, different amino acids may be 'similar' due to their 'chemical similarity' or 'evolutionary relationship' even though they do not match (this is evident in the use of the PAM and BLOSUM amino acid substution matrices used in sequence alignment algorithms). Finally 'homology' specifically implies that two sequences are 'evolutionarily related.' Two homologous sequences can be dissimilar and certainly have low identity, and sometimes two similar sequences are not homologous, as in sequence similarity by chance rather than evolutionary development.

Designed and Managed by:
Eric Wasiolek

Home | Contact Me | Login