%% EVENINGELEMENT Example of identification of regulatory sequences %% % This demo belongs to the collection of Case Studies in Computational % Genomics, mostly based on classic papers, and mostly based on the contents % of the book % % Introduction to Computational Genomics, A Case Studies Approach % Cambridge University Press, 2006 % Nello Cristianini and Matthew Hahn % % The demos of the other case studies, pointers to software and papers that % are available on-line can be found on the website: % % www.computational-genomics.net % % This demo is also available on-line at web('http://www.computational-genomics.net/case_studies/eveningelement_demo.html'); % Demo by Elisa Ricci. Thanks to Chi Nguyen. %% Introduction % Arabidopsis thaliana (also known as mouse-ear cress) is a small, weedy % organism that has become the model genetic and genomic study system for % plants. Its genome is approximately 120 Mb long, with five chromosomes and % 29K genes. % In this example we study in detail the circadian clock (i.e. the 24-hour % cycle of the physiological processes) of this plant. It is known that in % plants (but also in humans and other animals) the circadian clock % synchronize itself with the external day-night cycle. In particular in % Arabidopsis each cell keeps track of the day-night cycle indipendently of % all other cells. This is due to the activity of a few key proteins that % allow the plants to turn on and off large groups of genes needed at % different times of day and night. They do so by binding to specific % regulatory sequences called trascription factor binding sites. % Regulatory DNA is the sequence surrounding a gene that specify proper % trascription. It is a mosaic of short sequence motifs and semirandom DNA. % These short motifs are usually found upstream of coding regions but they % can also be found downstream. It is extremely difficult to identify these % due to their short length, to a certain pecentage of variability in the % sequence and because one does not know the motifs and much less the % location. % In this example we investigate some algorithms to find % regulatory sequences for clock-regulated genes in Arabidopsis activated in % the evening. Here we make the simplifying assumption that the motifs we are % looking for is identical in every istance where it is found. %% Load data in the MATLAB workspace % Therefore we can consider only the clock-regulated genes. They are 437 (6% % of the genes on the chip). They have been clustered in three groups according % to the temporal axis in which the experiment was done: the different time % intervals are labeled as 0, 4, 8, 12, 16 and 20 and correspond to every 4 hours % in a a 24 hour day/night cycle starting from 8 AM. % Cluster 1 correspond to genes whose expression peaks in phases 0 and 4, % cluster 2 to phases 8, 12 and cluster 3 to phases 16, 20. % You can find the associated sequences in the website www.computational-genomics.net % and load them directly in the MATLAB workspace, % distinguishing between the clusters, and concatenate them. The data from % which we look for binding sites are those genes of cluster 2 (191), that is % genes highly expressed in the evening. The genes of the others clusters % are used as background. load clockregulated.mat seq1=''; seq2=''; for ind = 1:191 % cluster2 seq1=[seq1 '*' up437_set2(ind).Seq{:}]; end sr1=seqrcomplement(seq1); seq1=[seq1 '*' sr1]; for ind = 1:246 %cluster 1 and 3 seq2=[seq2 '*' up437_others(ind).Seq{:}]; end sr2=seqrcomplement(seq2); seq2=[seq2 '*' sr2]; %% Identifying motifs % For simplicity we decide to look only at motifs of fixed length (9): we consider % all word of length 9 whose frequency in the evening cluster is very % different from its frequency in the rest of the data. Therefore we % examine all the words of length 9 in both sequences seq1 and seq2 % (considering also the reverse complement). Motifs found are scored and % sorted in descendin order by margin (the difference between their frequency % in cluster 2 and that in cluster 1-3). The top 10 9-mers are computed and % shown. L=9; %Motif length numMotifs=10*2; %10 highest scores elements and their reverse complements [nmers,buckets,buc1,buc2] = diff_nmers_sign(seq1,seq2,L,numMotifs) %% % The obtained set of motifs contains a lot of repeats (either of single % letters or of 2-mers). They likely have no biological significance and they % must be filtered out. [anmers,abuckets,abuc1,abuc2]=repeat_filter(nmers,buckets,buc1,buc2) %% % After eliminating the repeating element, we can observe that the most significant % element is the motif AAAATATCT. We known from the study of Harmer et al. % that it corresponds to the evening element (word of 9 bases found upstream % of genes turned on un the evening). Its margin is 0.00014. % We notice that 2 of the other 3 top motifs are simply variants of the evening % element (AAATATCTT and AAAAATATC). % To assess the significance of the value found for the margin of the % evening element we perform 100 random splits of the data and measure the % margin of the highest-scoring element. num_trial=100; margin=zeros(1,num_trial); for i=1:num_trial seq1r=''; seq2r=''; indran=randperm(437); for ind = 1:191 % cluster2 seq1r=[seq1r '*' up437(indran(ind)).Seq{:}]; end sr1=seqrcomplement(seq1r); seq1r=[seq1r '*' sr1]; for ind = 192:437 % cluster2 seq2r=[seq2r '*' up437(indran(ind)).Seq{:}]; end sr2=seqrcomplement(seq2r); seq2r=[seq2r '*' sr2]; [nmers,b,b1,b2] = diff_nmers_sign(seq1r,seq2r,L,numMotifs); [anmers,ab,ab1,ab2]=repeat_filter(nmers,b,b1,b2); if ab margin(i)=ab(1); end end find(margin>abuckets(1)) %% % In 100 trials we never observe a margin larger than 0.000147462. %% % We can look in detail at the frequency of the evening element among all % the clock regulated genes. The arrays EEcount and Ngenes show that not all the genes % of the second cluster have the evening element, nor this motif is limited % only to these genes. CircadianTime=0:4:20; EEcount=zeros(length(CircadianTime),1); Ngenes=zeros(length(CircadianTime),1); for ind=1:437 [indices]= strfind(up437(ind).Seq{:}, 'AAAATATCT'); [indices1]= strfind(seqrcomplement(up437(ind).Seq{:}), 'AAAATATCT'); ind1=(up437(ind).clusNum)/4+1; Ngenes(ind1)=Ngenes(ind1)+1; EEcount(ind1)=EEcount(ind1)+length(indices)+length(indices1); end Ngenes EEcount %% References % % S. L. Harmer, J. B. Hogenesch, M. Straume, C. H.S. Chang, B. Han, Z.Tong % X. Wang, J. A. Kreps, and S. A. Kay, *Orchestrated Transcription of Key % Pathways in Arabidopsis by the Circadian Clock*, Science, Vol. 290. no. 5499, % pp. 2110 - 2113, Dec 2000.