Mining protein loops using a structural alphabet and statistical exceptionality

Leslie Regad1, Juliette Martin3,4, Gregory Nuel5 and Anne-Claude Camproux1

1MTI, Inserm UMR-S 973, University of Paris Diderot - Paris 7, 35 rue Hélène Brion, F-75205, Paris Cedex 13, France
2MIG UR1077, INRA, F-78350 Jouy-en-Josas, France
3University of Lyon, Lyon; University of Lyon 1; IFR 128, CNRS, UMR 5086, IBCP, 7 passage du Vercors, F-69367 Lyon, France
4MAP5, UMR CNRS 8145, University of Paris-Descartes, F-75006 Paris, France


Abstract:

Background: Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyse. Due to data sparsity, long loops have rarely been systematically studied.

Results: We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis.
We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of protein loop-length are covered by only 3310 highly recurrent structural words versus 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant-positions and 87% of long loops contain at least one such word.
We complement our analysis with the detection of statistically overrepresented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop length. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.

Conclusions/Significance: We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loops analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/.


Data:

  • list of protein chains (pdb code)


  • Description of words:
  • This file gives for each words :
  • Occ: Word occurrence
  • Zmax: The Zmax value associated to the word. This paramater informs about the sequence propensity of the word.
  • RMSD: The average RMSd (Root mean square deviation) associated to the word. This criterion correspond to the RMSd computed on the different fragments encoded by the same word.
  • type: The statistical type of this word (NS=not significant, OR=over represented, UR=under-represented).


  • Description of fragments:
  • This file gives for each fragment :
  • word: the word (4 structural letters) in which the fragment is encoded
  • position: position of the word in the structural letter sequence corresponding to the protein structures. This positions do not always correspond to the position of the fragment in the pdb files.
  • prot: the pdb code of the protein chain where the fragment is extracted
  • AAseq: amino acid sequence of the fragment
  • flank: flank of the loop containing the fragment (EE: loop linking 2 β-strands, EH: loop linking a β-strand and a α-helix, HE: loop linking a α-helix and β-strand, HH: loop linking 2 α-helices).
  • type: loop type according the loop length: SL= short loops (less than 9 structural letters (12 residues)); LL: long loops at least 10 structural letters (13 residues)

Remarks:

The word extraction is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet; Camproux et al. 1999, 2004). This structural alphabet allows to simplify the tri-dimensional protein structure into a sequence of structural alphabet (one-dimesional sequence). The structure simplification of the protein set is made using the software available at http://bioserv.rpbs.jussieu.fr/cgi-bin/SA-Encode

Previous publications about this works :

  • A Hidden Markov Model Applied to the Protein 3D Structure Analysis
    Regad L, Guyon F, Maupetit J, Tuffery P, Camproux AC.
    Computational Statistics Data & Analysis 2008, 52(6): 3198-3207.

  • Identification of non Random Motifs in Loops Using a Structural Alphabet
    Regad L, Martin J, Camproux AC.
    Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational, 2006.

  • Previous publications about structural alphabet HMM-SA :

  • Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity.
    Camproux AC, Tuffery P.
    Biochim Biophys Acta. 2005, 1724(3): 394-403.

  • A hidden markov model derived structural alphabet for proteins.
    Camproux AC, Gautier R, Tuffery P.
    J Mol Biol. 2004, 339(3): 591-605.

  • Hidden Markov model approach for identifying the modular framework of the protein backbone.
    Camproux AC, Tuffery P, Chevrolat JP, Boisvieux JF, Hazout S.
    JProtein Eng. 1999, 12(12):1063-73.

  • Previous publications about statictical exceptional words in loops:

  • Exact distribution of pattern in a set of random sequences generated by a Markov source: application to biological data.
    Nuel G, Regad L, Martin J, Camproux AC.
    Algo Mol Biol. 2009, in press.




  • Last update : Dec. 2009