About This WebServer
About
Procleave is an online prediction webserver for the computational identification of substrate cleavage sites of caspases from substrate primary sequences. Procleave is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/Procleave/. Procleave can not only provide the conventional two-state (cleavage or non-cleavage) prediction, but also generate the estimated probability score for each candidate cleavage site of caspases, thus providing a quantitative evaluation of caspase substrate specificity and addressing this prediction task. As the identification of substrate cleavage sites can provide valuable information about the biological functions of caspases and facilitate the discovery of novel substrates, Procleave is anticipated to be a useful tool in in-vitro screening of putative caspase substrates and contribute to a more comprehensive knowledge of caspase families and their biological roles.
This webserver takes into consideration multiple sequence features as the input to the support vector regression predictors, such as binary encoding amino acid sequence profiles (BEAAs) and Bi-Profile Bayesian (BPB) signatures (BPB) on the basis of various predicted structural features including predicted secondary structure, solvent accessibility and natively unstructured region, based on a novel bi-profile Bayesian feature extraction method (Shao et al., 2009). When using local sequence-derived profiles, Procleave could successfully predict 82.2% of the substrate cleavage sites, with the Matthews Correlation Coefficient (MCC) of 0.667. After incorporating relevant structural features such as predicted solvent accessibility and natively unstructured region information, the prediction performance could be further improved. Furthermore, the novel bi-profile Bayesian signatures were found to significantly improve the prediction performance and yielded the best performance with an overall accuracy of 87.6% and the MCC of 0.747, which provided better prediction performance in comparison with the state-of-art CASVM method.
Methodology
We extracted four types of sequence profiles based on the bi-profile Bayesian feature extraction approach: (1) bi-profile Bayesian amino acid profile (BPBAA); (2) bi-profile Bayesian secondary structure profile (BPBSS); (3) bi-profile Bayesian solvent accessibility profile (BPBSA); (4) bi-profile Bayesian disordered profile (BPBDISO), through calculating the frequency of each amino acid or the corresponding structural types at each position for the cleavage and non-cleavage peptide sequences in the current dataset.
Binary encoding amino acid sequence profiles (BEAA)
For this sequence encoding scheme, caspase substrate sequences were transformed into n-dimensional vectors using an orthonormal encoding scheme, in which each amino acid is represented by the 20-dimensional binary vector composed of either zero or one elements [25, 37], e.g. Ala (10000000000000000000), Cys (01000000000000000000), Asp (00100000000000000000)…, Tyr (00000000000000000001), etc. For the sake of simplicity, we termed this binary encoding amino acid sequence profile as the encoding scheme ‘BEAA’. Since increasing sequence window size is supposed to provide more local sequence information, we used a sliding window approach to derive the local sequence profiles based on the ‘BEAA’ scheme and examined the corresponding prediction accuracy. The window size w is defined as the residue numbers involved in the local sequence windows surrounding the cleavage sites from P8 to P8΄ positions, either in a symmetrical or non-symmetrical manner (Figure 1): i.e. w= 3 (P4-P1), 4 (P2-P2΄ or P3-P1΄), 5 (P3-P2΄ or P4-P1΄), 6 (P3-P3΄), …, 14 (P7-P7΄), 16 (P8-P8΄), etc.
Predicted structural information extraction
Based on the observation of structural determinants of caspase substrate specificity, we also incorporated into Procleave predictors structural information predicted by state-of-art algorithms, specifically, secondary structures, solvent accessibility and natively unstructured regions.
Secondary structure was predicted using the PSIPRED program, which provides one of the most accurate predictions for protein secondary structures and generates the probability profiles of three secondary structure (helix, strand and coil) assignments for each residue in a protein. For a given residue, we extracted the wx3 matrix from the output file of PSIPRED by selecting the sliding window size w, and incorporated this matrix into the SVR model.
Solvent accessibility was predicted using the SSpro program implemented in the SCRATCH package. SSpro could predict the solvent accessibility status for each residue in a protein sequence, whose output result comes in a binary format- either as “exposed” or “buried”. Solvent accessibility was encoded as binary units into the SVR model.
Natively unstructured/disordered region was predicted using the DISOPRED2 server, which is one of the leading servers for predicting natively disordered regions in proteins. As native disorder is often functionally important and commonly associated with molecular recognition and substrate binding, this information might be relevant for improving prediction performance. The probability of each residue being disordered generated by DISOPRED2 is used as the input to the SVR models.
Bi-profile Bayesian signatures
More recently, a novel approach called Bi-profile Bayes was theoretically proposed and applied to predict methylation sites in proteins, which has been proved to provide a significant improvement of prediction performance (Shao et al., 2009). Applying it to the current study, the rationale behind this approach is that peptide sequences that can be cleaved by caspases should exhibit different features or characteristics relative to those that cannot be cleaved. Therefore, integrating these bi-profile Bayesian signatures based on known experimentally verified data by representing each sample in a bi-feature manner would be more informative than the single binary encoding amino acid sequence profile (BEAA) mentioned above. This approach would be particularly useful when dealing with an unbalanced dataset comprising a smaller amount of positive samples and greater number of negative samples. More details about Bi-profile Bayesian signature extraction can be found at Shao et al. (2009). In this study, we describe the integration of the bi-profile Bayesian signatures to predict cleavage sites of caspase from primary substrate sequences. The architecture of the Procleave server can be viewed in the following Figure 1.
Figure 1. The architecture of the Procleave server.
Support vector regression (SVR) implementation and parameter selection
In this study, we used support vector regression (SVR) to build the models to quantitatively estimate the cleavage probability of caspase substrates. Support vector machine (SVM) is a supervised machine learning technique based on the structural risk minimization from statistical learning theory. As a powerful machine learning technique, SVM has been applied successfully to a wide range of classification problems in recent years. In practice, SVM has two modes: the classification mode (support vector classification, SVC) and regression mode (support vector regression, SVR). In contrast to SVC, SVR has an outstanding ability in predicting the raw values of the tested samples. It is especially effective when the input data is characterized by high dimension and non-linear function. For the implementation of the SVR approach, we used the SVM_light package, an implementation of Vapnik’s SVM for support vector classification, regression and pattern recognition. We selected radial basis kernel function (RBF) at ε=0.01, γ=0.01 and C=100.0 to build the prediction models. This parameter set was optimized based on 5-fold cross-validation.
Input
Input the sequence:
Two types of output results will be given by the Procleave webserver: 1) Plain text outputs containing the query substrate sequence information, residue position, residue name, predicted solvent accessibility, predicted naturally unstructured region, cleavage probability score as well as the identified cleavage sites (Positions P4-P2' are given).
Sequence encoding schemes selection:
Graphical representations of the predicted probability score for each residue position in the query, providing a visual presentation of the distribution of the predicted cleavage sites along the primary substrate sequence.
Email address:
Users' Email addresses should be entered in order to receive the plain text and the link containing the final prediction results regarding the caspase cleavage sites for the submitted sequence.
Output
Two types of output results will be given by the Procleave webserver:
1) Plain text outputs containing the query substrate sequence information, residue position, residue name, predicted solvent accessibility, predicted naturally unstructured region, cleavage probability score as well as the identified cleavage sites (Positions P4-P2' are given). The identified substrate cleavage sites (P1P1' positions) are indicated with an asterisk (*).
2) Graphical representations of the predicted probability score for each residue position in the query, providing a visual presentation of the distribution of the predicted cleavage sites along the primary substrate sequence.
Reference
1. Algeciras-Schimnich A, Barnhart BC, Peter ME: Apoptosis-independent functions of killer caspases. Curr Opin Cell Biol 2002, 14:721-726.
2. Launay S, Hermine O, Fontenay M, Kroemer G, Solary E, Garrido C: Vital functions for lethal caspases. Oncogene 2005, 24:5137-5148.
3. Talanian RV, Quinlan C, Trautz S, Hackett MC, Mankovich JA, Banach D, Ghayur T, Brady KD, Wong WW: Substrate specificities of caspase family proteases. J Biol Chem 1997, 272:9677-9682.
4. Mahrus S, Trinidad JC, Barkan DT, Sali A, Burlingame AL, Wells JA: Global sequencing of proteolytic cleavage sites in apoptosis by specific labeling of protein N termini. Cell 2008, 134:866-876.
5. Nicholson DW: Caspase structure, proteolytic substrates, and function during apoptotic cell death. Cell Death Differ 1999, 6:1028-1042.
6. Hotchkiss RS, Nicholson DW: Apoptosis and caspases regulate death and inflammation in sepsis. Nat Rev Immunol 2006, 6:813-822.
7. Earnshaw WC, Martins LM, Kaufmann SH: Mammalian caspases: structure, activation, substrates, and functions during apoptosis. Annu Rev Biochem 1999, 68:383-424.
8. Fischer U, Janicke RU, Schulze-Osthoff K: Many cuts to ruin: a comprehensive update of caspase substrates. Cell Death Differ 2003, 10:76-100.
9. Timmer JC, Salvesen GS: Caspase substrates. Cell Death Differ 2007, 14:66-72.
10. Luthi AU, Martin SJ: The CASBAH: a searchable database of caspase substrates. Cell Death Differ 2007, 14:641-650.
11. Dix MM, Simon GM, Cravatt BF: Global mapping of the topography and magnitude of proteolytic events in apoptosis. Cell 2008, 134:679-691.
12. Johnson CE, Kornbluth S: Caspase cleavage is not for everyone. Cell 2008, 134:720-721.
13. Lohmuller T, Wenzler D, Hagemann S, Kiess W, Peters C, Dandekar T, Reinheckel T: Toward computer-based cleavage site prediction of cysteine endopeptidases. Biol Chem 2003, 384:899-909.
14. Yang ZR: Prediction of caspase cleavage sites using Bayesian bio-basis function neural networks. Bioinformatics 2005, 21:1831-1837.
15. Garay-Malpartida HM, Occhiucci JM, Alves J, Belizario JE: CaSPredictor: a new computer-based tool for caspase substrate prediction. Bioinformatics 2005, 21 (Suppl 1):i169-i176.
16. Singh GP, Ganapathi M, Sandhu KS, Dash D: Intrinsic unstructuredness and abundance of PEST motifs in eukaryotic proteomes and sterling silver. Proteins, 2006, 62:309-315.
17. Backes C, Kuentzer J, Lenhof HP, Comtesse N, Meese E: GraBCas: a bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences. Nucleic Acids Res 2005, 33:W208-W213.
18. Wee LJ, Tan TW, Ranganathan S: CASVM: web server for SVM-based prediction of caspase substrates cleavage sites. Bioinformatics 2007, 23:3241-3243.
19. Wee LJ, Tan TW, Ranganathan S: SVM-based prediction of caspase substrate cleavage sites. BMC Bioinformatics 2006, 7 (Suppl 5):S14.
20. Ju W, Valencia CA, Pang H, Ke Y, Gao W, Dong B, Liu R: Proteome-wide identification of family member-specific natural substrate repertoire of caspases. Proc Natl Acad Sci USA 2007, 104:14294-14299.
21. Enoksson M, Li J, Ivancic MM, Timmer JC, Wildfang E, Eroshkin A, Salvesen GS, Tao WA: Identification of proteolytic cleavage sites by quantitative proteomics. J Proteome Res 2007, 6:2850-2858.
22. Enoksson M, Salvesen GS: Proteolytic needles in the cellular haystack. Nat Chem Biol 2008, 4:651-652.
23. Schilling O, Overall CM: Proteome-derived, database-searchable peptide libraries for identifying protease cleavage sites. Nat Biotechnol 2008, 26:685-694.
24. Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292:195-202.
25. Rawlings ND, Morton FR, Kok CY, Kong J, Barrett AJ: MEROPS: the peptidase database. Nucleic Acids Res 2008, 36:D320-D325.
26. Finn RD, Tate J, Mistry J, Coggill PC, Sammut JS, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam protein families database. Nucleic Acids Res 2008, 36:D281-D288.
27. Cheng J, Randall AZ, Sweredoski MJ, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 2005, 33:W72-76.
28. Shao J, Xu D, Tsai SN, Wang Y, Ngai SM: Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE 2009, 4:e4920.
Citation:
If you use Procleave, please cite our paper:
Song J, Tan H, Hayashida M, Akutsu T and Whisstock JC (2012). Procleave: a tool for protease-specific substrate cleavage site prediction using conditional random fields, in preparation