%% MAMMOTH Example of sequence comparison through genetic distance %% % This demo belongs to the collection of Case Studies in Computational % Genomics, mostly based on classic papers, and mostly based on the contents % of the book % % Introduction to Computational Genomics, A Case Studies Approach % Cambridge University Press, 2006 % Nello Cristianini and Matthew Hahn % % The demos of the other case studies, pointers to software and papers that % are available on-line can be found on the website: % % www.computational-genomics.net % % This demo is also available on-line at web('http://www.computational-genomics.net/case_studies/mammoth_demo.html'); % Demo by Elisa Ricci. Thanks to Meng Mao. %% Introduction % As shown in the case of the human example, many question about the % origin and the evolution of humans can be answered with the analysis and the % comparison of mithocondrial dna sequences. Here we examine the relationship % between the ancient wooly mammoth and the African and Asian elephants. % Wooly mammoth is an extinct specie which lived in the Ice Age. The African % and Asian elephants are the closest relatives of these Ice Age giants. % The phylogenetic relationship of the mammoth to the African and Asian elephants % has been assessed only recently, thanks to the reconstruction of several complete % mitochondrial genome sequences of wooly mammoth by use of multiplex polymerase % chain reaction (PCR). % A standard procedure in such analysis consists in considering the mitochondrial % sequence, removing the hypervariable regions and comparing the remaining % part. In this way we expect to discard the noisiest part and to have sufficient % variation in the coding part of the genome. % The genome sequence of wooly mammoth and modern African and Asian % elephants can be obtained from the NCBI website. Between all possible DNA sequences % available we select just three and we collect them in a .mat file available in % the website www.computational-genomics.net. load mammoth %% % You can also download the sequences from GenBank with the MATLAB function % *getgenbank* % mammoth=getgenbank('DQ188829','sequenceonly','true'); % african=getgenbank('AJ224821','sequenceonly','true'); % asian=getgenbank('NC_005129','sequenceonly','true'); %% Statistical analysis % Before performing phylogenetic analysis, let us do some simple statistics % of the sequences to see, if possible, some meaningful differences between % them. % First we can count the nucleotides and plot their density. basecount(mammoth,'Chart', 'Pie'); title('Nucleotide composition') figure ntdensity(mammoth); %% figure basecount(african,'Chart', 'Pie'); title('Nucleotide composition') figure ntdensity(african) %% figure basecount(asian,'Chart', 'Pie'); title('Nucleotide composition') figure ntdensity(asian) %% % There is no significant difference between the three animals. %% Analysing ORFs % We perform ORF analysis to examine all ORFs of the sequences and to % determine the minimum reliable ORF length by permutation testing. We assume % significant ORFs those with length equal to or greater than the top 5% of % random ORFs. mrnd=mammoth(randperm(length(mammoth))); orf=seqorfs(mrnd,'MINIMUMLENGTH',1); ORFLength=[]; for i=1:3 frameStop=orf(i).Stop; frameStart=orf(i).Start; frameStart=frameStart(1:length(frameStop)); ORFLength=[ORFLength; (frameStop-frameStart)']; end empirical_threshold=prctile(ORFLength,95) orf=seqshoworfs(mrnd,'MINIMUMLENGTH',empirical_threshold/3); %% afrnd=african(randperm(length(african))); orf=seqorfs(afrnd,'MINIMUMLENGTH',1); ORFLength=[]; for i=1:3 frameStop=orf(i).Stop; frameStart=orf(i).Start; frameStart=frameStart(1:length(frameStop)); ORFLength=[ORFLength; (frameStop-frameStart)']; end empirical_threshold=prctile(ORFLength,95) orf=seqshoworfs(mrnd,'MINIMUMLENGTH',empirical_threshold/3); %% asrnd=asian(randperm(length(asian))); orf=seqorfs(asrnd,'MINIMUMLENGTH',1); ORFLength=[]; for i=1:3 frameStop=orf(i).Stop; frameStart=orf(i).Start; frameStart=frameStart(1:length(frameStop)); ORFLength=[ORFLength; (frameStop-frameStart)']; end empirical_threshold=prctile(ORFLength,95) orf=seqshoworfs(mrnd,'MINIMUMLENGTH',empirical_threshold/3); %% % As before, these results shows the similarity between the three species. %% Phylogenetic analysis % To perform phylogentic analysis we decide to exclude the Dloop from the sequences as % explained above. The obtained sequences are used as input to the MATLAB % function *seqpdist* elephants(1).Sequence=mammoth(1:15421); elephants(1).Header='Wooly mammoth'; elephants(2).Sequence=african(1:15417); elephants(2).Header='African'; elephants(3).Sequence=asian(1:15419); elephants(3).Header='Asian'; %% % We compute the distance matrix with the Jukes-Cantor correction. Note % that the distance computations in the following are very time consuming % and could run out of memory. Therefore the distance value are also given %distances = seqpdist(elephants,'Method','Jukes-Cantor','alphabet','NT'); distances=[0.0517 0.0474 0.0503]; tree2= seqlinkage(distances,'UPGMA',elephants); plot(tree2,'orient','bottom'); title('UPGMA tree') ylabel('Genetic Distance') NJtree = seqneighjoin(distances,'equivar',elephants); plot(NJtree,'orient','bottom'); title('Neighbor joining tree') ylabel('Genetic Distance') %% Global alignment % We perform also global alignment between each pair of the two elephants % to see if some useful innformation can be extracted from this. As above, % due to the length of the sequence, the global alignment could run out % of memory. % [s1, a1]=nwalign(mammoth, african); % [s2, a2]=nwalign(mammoth, asian); % [s3, a3]=nwalign(african, asian); %% % Observing the score of the alignments (s1=38792, s2=39065, s3=39085) % we see that none is significantly greater that other pair of sequences. % Therefore only phylogenetic analysis seems to be useful for investigating % the relationship between the mammoth and the modern elephants. %% Phylogenetic tree of primates % To aim of comparison, we plot the phylogenetic tree also for human, chimpanzee and gorilla. % The dna sequences can be download from the GenBank database with the command *genbank* human=getgenbank('DQ301816','sequenceonly','true'); chimpanzee=getgenbank('X93335','sequenceonly','true'); gorilla=getgenbank('NC_001645','sequenceonly','true'); %% % They can also be obtained in a .mat file from the website % www.computational-genomics.net. %load primates %% % As before we eliminate the D-loop sequences, which have high mutation % rate, from the complete sequences when computing Jukes-Cantor distance from % each pair of mtDNA sequences. primates(1).Sequence=human(579:16027); primates(1).Header='human'; primates(2).Sequence=chimpanzee(1:15448); primates(2).Header='chimpanzee'; primates(3).Sequence=gorilla(1:15446); primates(3).Header='gorilla'; %distances2 = seqpdist(primates,'Method','Jukes-Cantor','alphabet','NT'); distances2=[0.0906 0.1135 0.1084]; tree2= seqlinkage(distances2,'UPGMA',primates); plot(tree2,'orient','bottom'); title('UPGMA tree') ylabel('Genetic Distance') NJtree = seqneighjoin(distances2,'equivar',primates); plot(NJtree,'orient','bottom'); ylabel('Genetic Distance') title('Neighbor joining tree') %% % From the comparison between the two trees we can observe that the length of % the internal branch for the elphants tree is shorter than that of the % primates tree. It means that the time for the divergence between the % mammoth and the Asian elephant has been shorter than that of the divergence % between human and chimpanzee. %% References % % J. Krause et al, *Multiplex amplification of the mammoth mitochondrial % genome and the evolution of Elephantidae*. Nature, Vol 439, 2006.