%% HAEMOPHIILUS Example of sequence statistics and gene finding with MATLAB % This demonstration looks at some statistics about the DNA content of the % Haemophilus Influenzae. It shows also some algorithms for finding ORFs % and for assessing their significance. %% % This demo belongs to the collection of Case Studies in Computational % Genomics, mostly based on classic papers, and mostly based on the contents % of the book % % Introduction to Computational Genomics, A Case Studies Approach % Cambridge University Press, 2006 % Nello Cristianini and Matthew Hahn % % The demos of the other case studies, pointers to software and papers that % are available on-line can be found on the website: web('http://www.computational-genomics.net/'); % Demo by Elisa Ricci. %% Introduction % Haemophilus influenzae is a non-motile Gram-negative coccobacillus first % described in 1892 by Dr. Robert Pfeiffer during an influenza pandemic. % It is generally aerobic, but can grow as a facultative anaerobe. It is % responsible for a wide range of clinical diseases. % The Haemophilus influenzae genome was the first to be sequenced and % assembled in a free-living organism. It contains about 1.8 million base % pairs and is estimated to have 1,740 genes. %% Accessing NCBI data from the MATLAB workspace % The dna sequence can be obtained from the GenBank database with the % accession number NC_000907. Using the *getgenbank* function with the % 'SequenceOnly' flag just the nucleotide sequence is loaded into the % MATLAB workspace. Hflu = getgenbank('NC_000907','SequenceOnly',true); %% % The sequence can be also downloaded from the website www.computational-genomics.net % Then you can load the data from a .mat file using the command load Hflu %% % The MATLAB *whos* command gives information about the size of the % sequence. whos Hflu %% %The total length of the Haemophilus Influenzae genome is 1830138 bp. %% Statistical sequence analysis % You will examine some statistics properties of the sequence. You can look % at the composition of the nucleotides with the *basecount* function. bases=basecount(Hflu) %% % The four nucleotides are not used at equal frequency across the genome: A % and T are much common than G and C. %% % Others symbols occur in the sequence. They correspond to positions in the % sequence that are are not clearly one base or another and they are due, % for example, to sequencing uncertainties. In this case for example: % N: aNy base % K: G or T % R: A or G (puRine) % Y: C or T (pYrimidine) % M: A or C (aMino) % S: C or G % W: A or T % Instead of grouping other simbols together, you can count them individually. bases=basecount(Hflu,'Others','Full') %% % You can also calculate the frequency of each nucleotide. [bases frequency]=basecountfrequency(Hflu,'Others','Full') %% % It shold be pointed out that these are on the 5'-3' strand of DNA, but % we know exactly the frequency of all the the bases on the other strand % because of the complementarity of the double helix. Anyway we could % verify it using again the *basecountfrequency* function. [bases frequency]=basecountfrequency(seqrcomplement(Hflu),'Others','Full') %% CG content % The local fluctuations in the frequencies of nucleotides provide % different interesting information. The local base composition by a sliding % window of variable size can be measured. In the following the window % size is assumed 90000 bp and 20000 bp respectively. ntdensity1(Hflu,'window',90000) %% ntdensity1(Hflu,'window',20000) %% % The phenomenon called horizontal gene transfer can be observed in the % second plot: the H. influenzae genome contains a 30Kb region of unusual % GC content from 1.56Mb to 1.59Mb. %% k-mer frequency and motif bias % Now look at the dimers in the sequence and display the 2-mer frequencies % in the table matrix. The simbols others than A, C, G, T are not % considered. [dimers,matrix] = dimercount(Hflu) %% % You can also plot the frequency of certain di-nucleotides of interest: % for example considering the local fluctuations in the frequencies % of dimers AT and CG. density = dndensity1(Hflu) %% % As a further analysis you can also compare the observed frequency of each % dimer with the one expected under the multinomial model. This is called % genome signature. gs=gensign(Hflu) %% % You can look at codons using *codoncount*. That function counts the % codons on a particular reading frame. With no options, the function shows % the codon counts on the first reading frame. codoncount(Hflu) %% Exploring the Open Reading Frames (ORFs) % In a nucleotide sequence an obvious thing to look for is if there are any % open reading frames. The function *seqorfs* can be used to determine the % ORFs in a sequence. orf=seqorfs(Hflu); %% % The variable orf is a structure with information about the start/stop % positions of each ORFs, its length and which reading frame it is in. % Here the minimun number of codons for an ORF to be considered valid is 10 % (default value). %% % The minimun number of codons for an ORF to be considered valid can be % also set. Here a threshold of 70 and 80 amino acids are considered % leading to about 2175 and 1966 ORFs respectively. [orf n_70]=seqorfs(Hflu,'MINIMUMLENGTH',70); n_70 %% [orf n_80]=seqorfs(Hflu,'MINIMUMLENGTH',80); n_80 %% % You can also look for the total number of ORFs [orf n]=seqorfs(Hflu,'MINIMUMLENGTH',1); n %% Detecting significant ORFs % The classic approach to decide whether an ORF is a good candidate as a % gene is to calculate the probability of seeing an ORF of a certain length % L in a random sequence. To test the significance of ORFs a % single-nucleotide permutation test can be used. [orf1 n1]=seqorfs(Hflu(randperm(length(Hflu))),'MINIMUMLENGTH',1); n1 %% % The ORF Length of the original and of the randomized sequences can be % compared. ORFLength=[]; for i=1:6 ORFLength=[ORFLength; orf(i).Length']; end ORFLength1=[]; for i=1:6 ORFLength1=[ORFLength1; orf1(i).Length']; end %% % A threshold must be chosen in order to keep as candidate genes those ORFs % in the H. Influenzae genome that are longer than most (or all) the ORFs % in the random sequence. max_threshold=max(ORFLength1) n_max=length(find(ORFLength>=max_threshold)) %% % You can also use a more tolerant threshold in order to keep all ORFs of % length equal to or greater than the top 1% of random ORFs. empirical_threshold=prctile(ORFLength1,99) n_emp=length(find(ORFLength>=empirical_threshold))