%% OLFACTORYRECEPTORS Example of use of HMM in sequence analysis %% % This demo belongs to the collection of Case Studies in Computational % Genomics, mostly based on classic papers, and mostly based on the contents % of the book % % Introduction to Computational Genomics, A Case Studies Approach % Cambridge University Press, 2006 % Nello Cristianini and Matthew Hahn % % The demos of the other case studies, pointers to software and papers that % are available on-line can be found on the website: % % www.computational-genomics.net % % This demo is also available on-line at web('http://www.computational-genomics.net/case_studies/olfactoryreceptors_demo.html'); % Demo by Tara Thiemann, Elisa Ricci. %% Introduction % Proteins are composed of a sequence of amino acids. These aminoacids have % various atomic compositions and structures that lead to different % properties. The first part of this demo focuses on the property of % hydrophobicity. % % Olfactory receptors (OR) are part of a family of proteins that have 7 % transmembrane regions. That is they pass through the cell membrane 7 % times. The interior of the cell membrane is hydrophobic while both the % exterior and interior of the cell are hydrophilic. Therefore, the % regions of the protein that pass through the membrane should contain % mostly hydrophobic amino acids while the portion outside of the membrane % should be mostly hydrophilic. %% Segmenting odorant receptors % You will create a segmentation Hidden Markov Model to predict the % transmembrane (hydrophobic) regions of an olfactory receptor (OR) protein. % In this example you will consider 347 OR due to Zozulya, et al., 2001. % You can download the 347 sequences of amino acids from the website % www.computational-genomics.net and read them with the MATLAB function % *fastaread* orseqs = fastaread('347OR.fasta'); %% % Considering only the first sequence, you can plot various property of that % using the tool MATLAB proteinplot. or1 = orseqs(1).Sequence proteinplot(or1) %% % In the protein plot window, you must change the selected property % to hydrophobicity (Kyte & Doolittle). Under Edit -> Change Configuration, % you must change the window size to 19.0. Now you need to export the figure % so that you can plot the results of the model over it. To do this, select % File -> Export Figure. Then hold the figure for future use. Here the % final profile is loaded for simplicity. open('hydrophobicity.fig'); % Comment this if you want to use the proteinplot hold on %% % Altough you can pick out a number of hydrophobic and hydrophilic peaks in % this graph simply by eye, an HMM can help to delineate exactly the % various regions of interest. A model is assumed that consider the protein % sequence being generated by a stochastic process that alternates between % two hidden states: "out of the membrane" and "in the membrane". % The HMM is trained using 20 OR sequences. for i=1:20 intseqs(i) = {aa2int(orseqs(i).Sequence)}; end %% % The probability transition matrix is 2X2 for the 2 states: out of the % membrane and in the membrane. You must enter an estimate of the transition % probabilities as initial guess. T = [0.95 0.05; 0.05 0.95]; %% % The probability emission matrix is 2X20 (2 states and 20 amino acids). % Also in this case you must enter the initial guesses of the emission % probabilities. These guesses of emission are based on the hydrophobicity % of the amino acids. For example, in the 'in the membrane' state, a % hydrophilic amino acids has a higher probability of emission that one that % is hydrophobic. E = [0.018 0.067 0.067 0.067 0.018 0.067 0.067 0.067 0.067 0.01 0.01 0.067 0.018 0.018 0.067 0.067 0.067 0.067 0.067 0.01; 0.114 0.007 0.007 0.007 0.114 0.007 0.007 0.025 0.007 0.114 0.114 0.007 0.114 0.114 0.025 0.025 0.025 0.025 0.025 0.114]; %% % Now, starting from this initial guesses the true emission and transition % matrices must be estimated from our sequences using the EM algorithm. The % Matlab function *hmmtrain* is used to this purpose. [estT, estE] = hmmtrain(intseqs, T, E) %% % Finally the Viterbi algorithm can be used to detemine the states path, so % to automatically segment the protein into its component regions. The % MATLAB function *hmmviterbi* is used, receiving as input the emission % and transition matrices previously estimated from *hmmtrain*. % You can also plot the states over the hydrophobicity plot. estimatedStates = hmmviterbi(aa2int(or1),estT,estE); plot(estimatedStates) hold off %% Profile HMMs for odorant receptors % You will now turn to protein families and multiple alignment. Protein % families are groups of proteins that have similar structure and function. % Profile HMMs for specific families can be developed from the multiple % alignment of members of the family. Profile HMMs create a useful % position-based scoring system. Then, homologues can be compared back to % the HMM. In MATLAB, you can use profile HMM to perform multiple sequence % alignments. % % Profile HMMs can be found for many protein families and the PFAM website. web('http://www.sanger.ac.uk/Software/Pfam') %% % A first question to be answerd is to detemine which pHMM must be used % for the olfactory receptors. You can use the first OR sequence and % a randomized version of that to compare to the pHMMs. randor1 = randseq(length(or1), 'fromstructure', aacount(or1)); %% % There are over 7000 pHMMs available at the PFAM site. You can search all % the HMMs by entering the sequence in the 'search by protein sequence' on % the PFAM site. For sake of time, we will only compare the sequences % to the first 4 pHMMs. % The MATLAB function *hmmprofalign* aligns the sequences to the selected % profile HMM while *gethmmprof* retrieves the model. For PFAM accession % number PF00001, you must simply enter gethmmprof(1). seqs = {or1,randor1}; for i = 1:4 for j = 1:2 [score(i,j)] = hmmprofalign(gethmmprof(i), seqs(j)); end end score %% % You can see that the fake protein did not show a good alignment with % any of the HMMs. The real olfactory receptor matches PF00001. You will % use this HMM to align the olfactory receptors. % % Therefore, first of all, you should get the pHMM from the PFAM database. % Then you can retrieve multiple aligned sequences from the PFAM database % using the MATLAB function *gethmmalignment*. hmm7tm = gethmmprof(1); seqs = gethmmalignment(1, 'type', 'seed'); disp([char(seqs.Header) char(seqs.Sequence)]) %% % As above you can perform alignment between the chosen pHMM and our first % 30 up to 347 OR sequences. The MATLAB function *hmmmerge* allows an % easier viewing. for i=1:30 [Score(i), Seqs(i).Aligned] = hmmprofalign(hmm7tm, orseqs(i).Sequence); end hmmprofmerge(Seqs,Score) %% % It must be pointed out why it is better to use profile HMM instead % of a pairwise alignment. Each time you align a sequence to an HMM, it is % like aligning it to the hundreds of sequences that have been used to % create the HMM. This gives you more certainty in the results of the % alignment. % As confirm, you can consider the following. The family to which we aligned % the OR is the Rhodopsin family. Here you can try to perform an alignment % with just one of the sequences used to develop the HMM. % First, you must retrieve the sequences for the rhodopsin receptor and % one of the odorant receptors. rhod = getgenpept('NP_002368','sequenceonly',true); or = orseqs(300).Sequence; %% % The rhodopsin sequence can be also downloaded from the website % www.computational-genomics.net % load rhod %% % Now you can perform a paiwise global alignment of the two sequences. The % BLOSUM30 is used as scoring matrix. Here the penalties for opening and % extending a gap in the alignment are set to 5. [Score, Alignment] = nwalign(or, rhod, 'scoringmatrix','blosum30','gapopen',5,'extendgap',5) %% % Now you must assess the significance of this alignment. Use *randperm* % for this and compare the alignment and scores. The scores are fairly % close. It is difficult to tell if the alignment is significant. perm = randperm(length(or)); randor = or(perm); [Score, Alignment] = nwalign(randor, rhod,'scoringmatrix','blosum30','gapopen',5,'extendgap',5) %% % On the contrary, if you align both sequences with the HMM, the % significance of the alignment is more apparent. [score_or, align_or] = hmmprofalign(hmm7tm,or) [score_rand, align_rand] = hmmprofalign(hmm7tm,randor) %% % Therefore, the alignment with pHMM has much more power than pairwise % alignment since it includes the characteristics of all the sequences used % to create the model. %% Phylogenetic Tree % In the last part of this demo, you will create a phylogenetic tree from % member of this protein family. The olfactory receptors are actually part % of a much large protein family known as the G-Protein-Coupled Receptors. % All of these proteins are 7-transmembrane, but they detect molecules % other than odorants. There are 5 main groups of GPCRs: Adhesion, % Secretin, Glutamate, Frizzled/TAS2, Rhodopsin (Fredriksson, et al.,2003). % You will use a few of these groups to create the tree. % First, sequences can be retrieved from the GenBank database using the % *getgenbank* function. data = {'Adhesion 1' 'NP_001775'; 'Adhesion 2' 'NP_001965'; 'Glutamate 1' 'NP_000830'; 'Glutamate 2' 'NP_000836'; 'Rhod-Alpha 1' 'NP_001051'; 'Rhod-Alpha 2' 'NP_000946'; 'Rhod-Delta 1' 'NP_002368'; 'Rhod-Delta 2' 'NP_473372'}; for prot = 1:8 seqs(prot).Header = data{prot,1}; seqs(prot).Sequence = getgenpept(data{prot,2},'sequenceonly','true'); end %% % For convenience the data are directly available in the website www.computational-genomics.net % and you can load them from a .mat file using the command % load proteinfamily %% % You can calculate the UPGMA distances using Jukes-Cantor correction, % so you build the tree. distances = seqpdist(seqs,'Method','Jukes-Cantor'); tree = seqlinkage(distances,'UPGMA',seqs) %% % Now plot the tree. h = plot(tree,'orient','bottom'); ylabel('Evolutionary distance') %% % Adding two of the olfactory receptors sequences, you must recreate the tree. data2 = {'Olfactory 1';'Olfactory 2'}; for prot = 1:2 seqs(prot+8).Header = data2{prot,1}; seqs(prot+8).Sequence = orseqs(prot).Sequence; end distances = seqpdist(seqs,'Method','Jukes-Cantor'); tree = seqlinkage(distances,'UPGMA',seqs) h = plot(tree,'orient','bottom'); ylabel('Evolutionary distance') %% % You can see that the members of the GPCR groups were grouped together and % that the olfactory receptors fell within the Rhodopsin group. This % matches what we previously knew from matching the ORs with the correct % profile HMM. However, UPGMA grouped the ORs with the Alpha-Rhodopsins while the % maximum parsimony method used by Fredriksson et al. group them with the % Delta-Rhodopsins. %% References % Axel, R. *The Molecular Logic of Smell*. 1995. Scientific American % 273(4):154-159. % % Buck, L. and R. Axel. 1991. *A Novel Multigene Family May Encode Odorant % Receptors: A Molecular Basis for Odor Recognition*. Cell 65:175-187. % % Eddy, S. *Profile Hidden Markov Models*. 1998. Bioinformatics. % 14(9):755-763. % % Fredriksson, R., M. Lagerstrom, L. Lundin and H. Schioth. 2003. *The % G-Protein-Coupled Receptors in the Human Genome Form Five Main Families*. % Phylogenetic Analysis, Paralogon Groups, and Fingerprints. Molecular % Pharmacology 63:1256-1272. % % Mombaerts, R. 2004. *Genes and Ligands for Odorant, Vomeronasal, and Taste % Receptors*. Nature Reviews 5:263-278. % % Zozulya, S., F. Echeverri and T. Nguyen. 2001. *The human olfactory % receptor repertoire*. Genome Biology 2(6):research0018.1-0018.12.