%% HIV Example of quantitative analysis of natural selection. %% % This demo belongs to the collection of Case Studies in Computational % Genomics, mostly based on classic papers, and mostly based on the contents % of the book % % Introduction to Computational Genomics, A Case Studies Approach % Cambridge University Press, 2006 % Nello Cristianini and Matthew Hahn % % The demos of the other case studies, pointers to software and papers that % are available on-line can be found on the website: % % www.computational-genomics.net % % This demo is also available on-line at web('http://www.computational-genomics.net/case_studies/hiv_demo.html'); % Demo by Elisa Ricci. %% Introduction % In 1983 the infectious agent resposible of the well-known disease of % Acquired Immmune Deficiency Syndrome (AIDS) was identified. It was called % Human Immunodeficiency Virus (HIV). Despite this discovery of fundamental % importance, at the present there is no cure for this disease and no effective % vaccine against HIV infection. The main difficulty is that our immune system % as well as any drugs cannot deal with the inner nature of this virus that % evolves constantly and rapidly. % The genome of HIV has been sequenced hundreds of time since the eighties, so % it is possible to study the differences between many individual genomes. % This can help us to gain a general understanding of how virus evolves. % In particular there are specific regions of its proteins that are recognized % and attacked by our immune system: these regions mostly show the % signature of adaptive evolution of the changing virus. Other regions % instead remain invariant having important biological functions and not % being involved in interactions with the immune system. % In this example a genome wide analysis of natural selection in HIV is % performed. %% ORF Finding % The HIV genome has nine ORFs but 15 proteins are made in all (some genes % are translated into large polyproteins which are then cleaved by a % virus-encoded protease into smaller proteins). Some ORFs overlap sligthly % and there are also some introns. All this factor complicate the detection of % genes. % Let's consider for example a sequence of an HIV genome. It can be downloaded % from the GenBank database with the MATLAB function *getgenbank*. hiv=getgenbank('NC_001802','SequenceOnly',true); %% % You can also download directly the .mat file from the website % www.computational-genomics.net. %load hiv %% % The simplest algorithm to find ORFs in this sequence consists in selecting the % statistically significant ORFs between all those found by the function *seqorfs* [orf n]=seqorfs(hiv,'MINIMUMLENGTH',1); %% % Following this approach a single-nucleotide permutation test on the sequence % is performed to determine a threshold of minimum length for an ORF to be % considered gene. The threshold is chosen in order to keep as candidate genes % all ORFs of length equal to or greater than the top 5% of random ORFs. [orfr nr]=seqorfs(hiv(randperm(length(hiv))),'MINIMUMLENGTH',1); ORFLengthr=[]; for i=1:6 ORFLengthr=[ORFLengthr; orfr(i).Length']; end empirical_threshold=prctile(ORFLengthr,95) [orf n]=seqorfs(hiv,'MINIMUMLENGTH',empirical_threshold/3); %% % The informations associated with the detected ORFs can be saved in an array % and shown for all the reading frames of the positive strand. HIV_ORF=[]; for i=1:3 HIV_ORF=[HIV_ORF; orf(i).Start' orf(i).Stop' orf(i).Length' orf(i).Frame']; end HIV_ORF %% % Note that there are more genes in the HIV genome, but some protein-coding % regions are missed because of the semplicity of the algorithm. Hidden % Markov Model should be used for that. %% Ka/Ks % The Ka/Ks ratio is a measure, introduced by the Japanese Kimura, of % natural selection, comparing the number of non-synonimous to synonymous % substitutios in a gene. Here we compute the Ka/Ks ratio for the genes of % two different sequences of HIV-1 (Genbank accession % numbers AF033819 and M27323). hiv1_all=getgenbank('AF033819'); hiv1=hiv1_all.Sequence; hiv2_all=getgenbank('M27323'); hiv2=hiv2_all.Sequence; %% % As before, you can also download directly the .mat file from the website % www.computational-genomics.net. %load hivcomparison %% % We must compare corresponding protein sequence. Therefore is necessary to % individuate corresponding regions in the sequence. % The field Features of the structures can be used for that purpose. % GAG POL VIF VPX/VPU VPR TAT REV ENV NEF % seq1 1 2 3 7 4 5 6 8 9 % seq2 1 2 3 7 4 5 6 8 9 ind1=[1 2 3 7 4 5 6 8 9]; ind2=[1 2 3 7 4 5 6 8 9]; for ind=1:9 temp_seq =hiv1; CDSs = hiv1_all.CDS; seq1= temp_seq(CDSs(ind1(ind)).indices(1):CDSs(ind1(ind)).indices(2)); g1(ind).Sequence = seq1(str2num(CDSs(ind1(ind)).codon_start):end); temp_seq = hiv2; CDSs = hiv2_all.CDS; seq2= temp_seq(CDSs(ind2(ind)).indices(1):CDSs(ind2(ind)).indices(2)); g2(ind).Sequence = seq2(str2num(CDSs(ind2(ind)).codon_start):end); end %% % After aligning the corresponding ORFs, the Ka/Ks ratio can be calculated. for i=1:9 [sc,alig]= nwalign(nt2aa(g1(i).Sequence),nt2aa(g2(i).Sequence)); seq1 = seqinsertgaps(g1(i).Sequence,alig(1,:)); seq2 = seqinsertgaps(g2(i).Sequence,alig(3,:)); %Calculate synonymous and nonsynonymous substitution rates. [dn,ds] = dnds(seq1,seq2); KaKs(i)=dn/ds; end KaKs %% The ENV gene % For the analysis of single genes is convenient to calculate the Ka/Ks ratio % locally (using for example a sliding window) instead of averaging its value % over the entire gene. % For example, in this case we study the ENV gene. This gene codes % for the envelope glycoprotein gp160, which is a precursor to two % proteins: gp41 and gp120. In particular, the last one is responsible of two % important processes: to maintain continously recognition of host cells and % simultaneously to avoid detection by the immune system. These roles are % carried out by different parts of the gp120 protein, therefore measuring % the global Ka/Ks ratio obscures both processes. It is necessary to % measure the effect of selection on smaller region of the gene, by % computing Ka/Ks in sliding windows. env1 = g1(8).Sequence; env2 = g2(8).Sequence; [sc,alig] = nwalign(nt2aa(env1),nt2aa(env2)); seq1 = seqinsertgaps(env1,alig(1,:)); seq2 = seqinsertgaps(env2,alig(3,:)); S=[seq1; seq2]; %% % The following plots the local KaKs ratio with a sliding window of 60 % codons. The blue line shows the Ks values, the red one the Ka values and % the green the ratio KaKs. KaKs_ratio_WL(S,60); title('Sliding window analysis - ENV protein'); %% % As expected the rate of non-synonymous substitution is greater than the rate % of synonimous ones. These region are crearly recognized by the human % immune system. There are also many regions with Ka/Ks less than one: they % corresponds to regions necessary to the virus to recognize the receptors % of its host cells. % From the plot it is also evident that the average ratio for the entire % gene is less than one. As a confirm we can calculate it. [dn ds vardn vards] = dnds(seq1, seq2); KaKs=dn/ds %% GAG Polyprotein % The same analysis of above can be repeated for the GAG polyprotein. GAG % is another genes common to all the retroviruses. It contains around 1500 % nucleotides and encodes four different proteins wich form the building % block for the viral core. This protein contain additional epitopes % recognized by the human himmune system, and hence it is under constant % pressure to escape by evolving. gag1 = g1(1).Sequence; gag2 = g2(1).Sequence; [sc,alig] = nwalign(nt2aa(gag1),nt2aa(gag2)); seq1 = seqinsertgaps(gag1,alig(1,:)); seq2 = seqinsertgaps(gag2,alig(3,:)); S=[seq1; seq2]; KaKs_ratio_WL(S,60); title('Sliding window analysis - GAG protein');