%% SARS Example of phylogenetic analysis %% % This demo belongs to the collection of Case Studies in Computational % Genomics, mostly based on classic papers, and mostly based on the contents % of the book % % Introduction to Computational Genomics, A Case Studies Approach % Cambridge University Press, 2006 % Nello Cristianini and Matthew Hahn % % The demos of the other case studies, pointers to software and papers that % are available on-line can be found on the website: % % www.computational-genomics.net % % This demo is also available on-line at web('http://www.computational-genomics.net/case_studies/sars_demo.html'); % Demo by Elisa Ricci. %% Introduction % SARS (Severe Acute Respiratory Syndrome) was an illness that first appeared % in the second half of 2002 in Guangdong Province (China). The disease is % now known to be caused by the SARS coronavirus (SARS CoV), a novel % coronavirus. SARS was first reported in Asia (at the Vietnam French Hospital % of Hanoi) in February 2003. Over the next few months, the illness spread % to several countries all over the world. % The origin and diffusion of SARS epidemic is studied in this example. % The genome SARS-CoV was completed in April 2003 by a Canadian group of % researchers. It is a 29571 base long single stranded RNA sequence. You can % download it directly in .mat format from the website % www.computational-genomics.net and load it into the % MATLAB workspace load sars_can %% % If you have a live web connection, you can obtain it from the Genbank database % (accession number AY274119.3): %sars_can=getgenbank('AY274119.3','sequenceonly','true'); %% Finding ORFs % The SARS genome has the typical structure of coronaviruses, with 5 or 6 genes % in a caracteristic order. The associated webpage gives some useful information % about those genes. web('http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=30248028'); %% % As observed in the HIV example, the detection of gene in case of viruses is % complicated due to overlapping ORFs and other problem. % The simplest algorithm to find ORFs in this sequence consists in selecting % the statistically significant ORFs between all those found by the % function *seqorfs*. This algorithm can identify most of the genes. [orf n]=seqorfs(sars_can,'MINIMUMLENGTH',1); [orfr nr]=seqorfs(sars_can(randperm(length(sars_can))),'MINIMUMLENGTH',1); ORFLengthr=[]; for i=1:6 ORFLengthr=[ORFLengthr; orfr(i).Length']; end empirical_threshold=prctile(ORFLengthr,95) [orf n]=seqorfs(sars_can,'MINIMUMLENGTH',empirical_threshold/3); n %% % As example we can observe in detail the ORF of the 3rd frame. Let us focus % our attention on the sequence between (21492-25257). It coded for the Spike % protein that is one of the characteristic gene of coronaviruses. Our % following analysis is based on that protein. SARS_ORF=[orf(3).Start' orf(3).Stop' orf(3).Length' orf(3).Frame'] %% Reconstructing the Epidemic % During SARS epidemic in 2003 many groups isolated and published the % sequence of SARS. Now most of these sequences are available at the Genbank database % and can be used to answer many questions about the origin and the nature % of epidemic performing genomic sequence analysis. %% Identifying the Host % A first question than we can answer about SARS epidemic is related to its % origin. The SARS virus was recognized early on as a coronavirus, having % the same genes in the same order as other known coronaviruses. However it % was very different from any other human coronavirus and hence its origin was % likely to be from another animal. Here we compare various animal % coronaviruses (including the palm civet). % The associated protein sequences can be obtained directly in .mat format % from the website www.computational-genomics.net and loaded into the % MATLAB workspace: load CoV %% % If you have a live web connection the associated protein % sequences can be also downloaded from the Genbank database with the MATLAB function % *getgenpept*. % data = {'Bovine CoV1' 'AAL57308.1'; % 'Bovine CoV2' 'AAK83356.1'; % 'Human CoV OC43' 'NP_937950.1'; % 'Porcine HEV3' 'AAL80031'; % 'Murine HV1' 'YP_209233.1'; % 'Murine HV2' 'NP_045300.1'; % 'IBV3' 'NP_040831.1'; % 'Porcine PEDV' 'NP_598310.1'; % 'Canine CoV1' 'S41453'; % 'Feline CoV4' 'BAA06805'; % 'Human Sars CoV' 'AAP41037.1'; % 'Palm civet' 'AAV49723'; % }; % % % NumCoV=length(data); % % for ind = 1:NumCoV % CoV(ind).Header = data{ind,1}; % CoV(ind).Sequence = getgenpept(data{ind,2},'sequenceonly','true'); % end %% % Note that the human Sars CoV correspond to the spike protein previously % determined with our ORF finding algorithm. %% % We can consider the pairwise distance between sequences with Jukes-Cantor correction % and perform the neighbor joining algorithm. The MATLAB function *seqpdist* % can be used for that. Note that this computation is a very time consuming % task and can take some minutes. distances = seqpdist(CoV,'method','jukes-cantor','indels','pairwise-delete','squareform',true); NJtree = seqneighjoin(distances,'equivar',CoV); h = plot(NJtree,'orient','left'); title('Neighbor-joining tree using Jukes-Cantor model'); %% % The plot shows the phylogenetic tree. It clearly shows that human SARS is most % closely related to Hymalaian palm civet coronavirus and is quite distant % from other human coronavirus. %% The epidemic tree % We can perform phylogenetic analysis also to reconstruct the diffusion of % epidemic between humans. % We have chosen 13 sequences for which the date and the location of the % sample is known and we used them to perform phylogenetic analysis. % They can be obtained from the GenBank database. sars_data = { 'GZ01' 'AY278489' 'DEC-16-2002' 'Guangzhou (Guangdong)' ; 'ZS-A' 'AY394997' 'DEC-26-2002' 'Zhongshan (Guangdong)' ; 'ZS-C' 'AY395004' 'JAN-04-2003' 'Zhongshan (Guangdong)' ; 'GZ-B' 'AY394978' 'JAN-24-2003' 'Guangzhou (Guangdong)' ; 'HZS-2A' 'AY394983' 'JAN-31-2003' 'Guangzhou Hospital' ; 'GZ-50' 'AY304495' 'FEB-18-2002' 'Guangzhou (Guangdong)' ; 'CUHK-W1' 'AY278554' 'FEB-21-2003' 'Hong Kong' ; 'Urbani' 'AY278741' 'FEB-26-2003' 'Hanoi' ; 'Tor 2' 'AY274119' 'FEB-27-2003' 'Toronto' ; 'Sin2500' 'AY283794' 'MAR-01-2003' 'Singapore' ; 'TW1' 'AY291451' 'MAR-08-2003' 'Taiwan' ; 'CUHK-AG01' 'AY345986' 'MAR-19-2003' 'Hong Kong' ; 'CUHK-L' 'AY394999' 'MAY-15-2003' 'Hong Kong' ; }; NumSars=size(sars_data,1); %% % Here we perform our analysis using only the sequences of the spike % protein. You can extract them easily from the complete sequences obtained % from the Genbank database. Here we show how to do for the first sequence. sars= getgenbank(sars_data{1,2}); temp_seq = sars(1).Sequence; CDSs = sars(1).CDS(3); human_spike(1).Header = CDSs.product; human_spike(1).Sequence = temp_seq(CDSs(1).indices(1):CDSs(1).indices(2)); %% % For sake of simplicity we have collected them in a FASTA file available in % the website www.computational-genomics.net. You can % easily read it with the MATLAB function *fastaread* human_spike=fastaread('sars_spike.fasta'); palmcivet_spike=fastaread('spike_palmcivet.fasta'); all_spike=human_spike; all_spike(NumSars+1).Header=palmcivet_spike.Header; all_spike(NumSars+1).Sequence=palmcivet_spike.Sequence; %% % If you prefer, you can also load the data from a % MAT-file using the command % load sars_data % <== Uncomment this if no Internet connection %% % A neighbor joining tree of the Spike protein can be constructed. The distance % matrix can be obtained with the Jukes-Cantor corrections on genetic distance % calculated from global alignment of the Spike nucleotide sequences. distances = seqpdist(all_spike,'method','jukes-cantor','alphabet','NT','indels','pairwise-delete','squareform',true); %% % The distance matrix can be also plotted. figure imagesc(distances); NJtree = seqneighjoin(distances,'equivar',all_spike); h = plot(NJtree,'orient','left'); title('Neighbor-joining tree using Jukes-Cantor model'); %% % The tree depicts the story of the epidemic. The early infections occurred in the % Guangdong province and at the Hotel Metropole: their sequences in fact are % very similar. Examining all the branches of the tree we must reconstruct the % story of the diffusion of SARS epidemic. %% Area of origin % Even if it is known that the first cases of the SARS epidemic appear in the % Guangdong province we can study the area of origin of it. The statistical % method of multidimensional scaling can be used for that. As previously % said in the phylogentic analysis, the area of origin of the virus is in % the Guangdong province. distances = seqpdist(human_spike,'method','jukes-cantor','alphabet','NT','indels','pairwise-delete','squareform',true); [Y,eigvals] = cmdscale(distances); figure plot(Y(:,1),Y(:,2),'*'); for i=1:NumSars text(Y(i,1)+.00001,Y(i,2),human_spike(i).Header); end title('Multidimensional scaling - Genetic distance between SARS spike genes'); %% Date of origin % Another question that we can try to answer is related to the date of origin of the % epidemic. Knowing the date of each sequence, it is interesting to observe % the progression of mutations over time. To do that we first consider the % pairwise distance between sequences. Here the Kimura model is used instead of the % Jukes-Cantor correction. distance_Kimura=seqpdist(all_spike,'method','Kimura','squareform',true,'alphabet','NT','indels','pairwise-delete'); for i=1:13 score(i)=distance_Kimura(14,i); end date=[-16 -6 4 24 31 49 52 57 58 60 67 78 135]; plot(date,score,'k*'); hold on; [P,S] = polyfit(date,score,1); x=[-150:.1:150]; [y,delta]= polyconf(P,x,S,0.05); plot(x,y,'b-'); hold on; plot(x,y+delta,'r-'); hold on; plot(x,y-delta,'r-'); z=zeros(1,length(x)); hold on; plot(x,z,'k-'); %% % Relative to the sequence of the palm civet, we see that the genetic distance % increases with time in a roughtly linear manner. Interpolating in a least-square % manner these data, one can approximate date for the origin of the epidemic. % Any date compatible with zero distance from the palm civet is a plausible % start for the epidemic. We estimate an interval centered around 16 september % 2002. The 95% confidence interval are also shown in the plot. Despite its % simplicity the method seems to produce correct results because the % earliest reported infections dated in the second half of 2002.