%% CELLCYCLE Example of gene expression profile analysis %% % This demo belongs to the collection of Case Studies in Computational % Genomics, mostly based on classic papers, and mostly based on the contents % of the book % % Introduction to Computational Genomics, A Case Studies Approach % Cambridge University Press, 2006 % Nello Cristianini and Matthew Hahn % % The demos of the other case studies, pointers to software and papers that % are available on-line can be found on the website: % % www.computational-genomics.net % % This demo is also available on-line at web('http://www.computational-genomics.net/case_studies/cellcycle_demo.html'); % Demo by Elisa Ricci. Thanks to Chi Nguyen. %% Introduction % The yeast (Saccharomyces cerevisiae) is a unicellular fungus found naturally % in grapevines and responsible of wine-making fermenting sugars and producing % alchool. In this example we show some methods used in gene expression % analysis for the study of its general cell cycle. % From being budded off from its parent cell, to reproducing its own offspring, % each yeast go through a number of typical step that also involve changes in % gene expression, turning whole pathways on and off. % Today the study of such phenomena is possible through the technology of % microarray that can measure the expression level of every gene in a cell. % With the gene expression data, genes can be clustered on the basis of the % similarity of their expression profiles. % Here we examine the expressions of the entire yeast genome through two % rounds of the cell cycle. The temporal expression of genes are measured % by microarray at 24 time points every five hours. In detail we have the % expression profile of about 6000 genes. % The data set can be obtained from page of the project at Stanford: web('http://cellcycle-www.stanford.edu/'); %% % You can download directly the data in .mat format from the website % www.computational-genomics.net and load them into the % MATLAB workspace load cellcycle.mat %% % You can verify the real number of genes in the data set. The % cell array 'ccgene' contains the name of all genes. There are 6178 genes. numel(ccgene) %% % Let us analyze for example the first gene. ccgene{1} %% % A plot shows the expression profile for this ORF plot(time, ccvalues(1,:)) xlabel('Time (Hours)'); ylabel('Log2 Relative Expression Level'); %% Missing data % As frequently happens in this type of analysis, the original matrices used % in that type of experiments have various entries with missing values. It is % due to error of measurement or in the construction of the array. Here due to % the high number of gene we simply remove those with missing values and % those with low variability. nanInd = any(isnan(ccvalues),2); ccvalues(nanInd,:) = []; ccgene(nanInd) = []; numel(ccgene) %% [mask, ccvalues, ccgene] = genelowvalfilter(ccvalues,ccgene,'absval',log2(3)); numel(ccgene) %% % We use also the *geneentropyfilter* to remove genes whose profiles have % low entropy: [mask, ccvalues, ccgene] = geneentropyfilter(ccvalues,ccgene,'prctile',15); numel(ccgene) %% Cluster analysis % In order to identify genes that vary within the cell cycle, we can cluster genes % with the K-means clustering algorithm and select those clusters which have a % temporal expression profile that goes up and down. We decide to use the K-means % clustering algorithm. As distance measure between the data points we consider % d=1-corr, where corr indicates the sample correlation between two data points. % We set the number of clusters equal to 16 (arbitrary choice); totalcluster = 16; [clusters, ctrs] = kmeans(ccvalues, totalcluster, 'dist','corr', 'rep',5,... 'disp','final'); %% % Here in each plot the gene profiles in the same cluster are plotted % together. figure for c = 1:totalcluster subplot(4,4,c); plot(time,ccvalues((clusters == c),:)'); axis tight end suptitle('K-Means Clustering of Profiles'); %% % Then only the centroid profiles are plotted for each cluster so that the % expression pattern can be examined more clearly. figure for c = 1:totalcluster subplot(4,4,c); plot(time,ctrs(c,:)'); axis tight axis off % turn off the axis end suptitle('K-Means Clustering of Centroid Profiles'); %% % The last plot can be used to determine if a cluster contains periodic genes % by comparing the center to periodic functions such as cosines of various % frequencies and phases. We can check the degree of periodicity of each % cluster centre PER=zeros(totalcluster,10); for c = 1:totalcluster for phase=1:10 PER(c,phase)=(cos(time/0.9+phase)*ctrs(c,:)'); end end [r,c]=find(PER>1.2); %tuned threshold periodic_clusters=unique(r) %% % We outputed the clusters that look most periodic. To do that some % arbitrary choice have been made: the period of the sinewave is arbitrarily % fixed to 0.9, the phase is changed, the threshold for a signal to be % considered periodic have been fixed to 1.2. % We also identify all periodic genes. periodic_genes=0; for c = 1:size(periodic_clusters) clu=find(clusters==c); periodic_genes=union(periodic_genes,clu); end periodic_genes=setdiff(periodic_genes,0); ccgene(periodic_genes) %% References % % P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, % P.O. Brown, D. Botstein, B. Futcher, *Comprehensive Identification of Cell % Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray % Hybridization*. Molecular Biology of the Cell 9, 3273-3297, 1998.