%% JUKESCANTOR Example of Jukes-Cantor model of a sequence evolution %% % This demo belongs to the collection of Case Studies in Computational % Genomics, mostly based on classic papers, and mostly based on the contents % of the book % % Introduction to Computational Genomics, A Case Studies Approach % Cambridge University Press, 2006 % Nello Cristianini and Matthew Hahn % % The demos of the other case studies, pointers to software and papers that % are available on-line can be found on the website: % % www.computational-genomics.net % % This demo is also available on-line at web('http://www.computational-genomics.net/case_studies/juckescantor_demo.html'); % Demo by Elisa Ricci. %% Introduction % One of the crucial tasks in computational genomics is represented by the % estimation of the genetic distance between two homologous sequences, that % is the number of of substitutions that have accumulated between them since % they diverged from their common ancestor. % The problem is not easy since it means not simply count the number of % position at which the two sequence differ. This underestimate the true % genetic distance due to multiple substitutions occurred at the same site. % To solve this problem, a common approach is to correct the observed genetic % distance between two sequences by using a probabilistic model. % Several models have been developed in the past. The simplest (the well known % Jukes-Cantor model) assumes that each substitution has the same probability. % In this demo we show the effect of the JC correction. %% Observed genetic distance % Starting with a random sequence of length 1000, you must perform one % random substitution at each of the 2000 epochs. L=1000; M=2000; %% % In order to get statistics you may average all over 10 experiments. R=10; %% % A random substitution is performed for each experiment every epoch and % the difference between the current sequence and the original one are % calculated. for n=1:R seq=randseq(L); seq1=seq; for i=1:M j=floor(rand*(L-1))+1; seq1(j)=randseq(1); v(i,n)=L-length(find((seq-seq1)==0)); %OBSERVED differences end end %% % Now you can perform some statistics: the mean and the standard deviation % of the differences are calculated. mm=mean(v'); ss=std(v'); %% Jukes Cantor correction % You must apply the JC correction to the observed differences between the % two sequences. An estimate of the genetic distance and of its standard % deviation is obtained. p=mm/L; E=sqrt(p.*(1-p)./(L*((1-(4/3).*p).^2))); d = -3/4 * log(1 - p * 4/3); %% % The error bars for the two cases are computed. figure; subplot(2,1,1), errorbar(mm,ss) title('Simulating the Effect of Multiple Substitutions') xlabel('True Number of Substitutions'); ylabel('Observed Num. of Substitutions') subplot(2,1,2), errorbar(d*L,E) xlabel('True Number of Substitutions'); ylabel('Estimated Num. of Substitutions') %% % From the plot you may notice that after 2000 substitutions, less than % half are actually observable. Moreover applying the JC correction it % turns out an almost linear relation between mean and std deviation. This % is due to the fact that the JC model consider equal probability of % substitution among all symbols.