Supplementary tables and text



Table S1: The partition of the data used in the 5-fold cross-validation to evaluate attribute subsets and the final classifier

Group Proteins Total SAPs Disease-associated Polymorphism
1 105 686 449 237
2 104 688 450 238
3 105 688 450 238
4 105 688 450 238
5 103 688 450 238
All 522 3438 2249 1189



Table S2: The mutual information between solvent accessibility and the status of SAP

Solvent accessibility type Wildtype Variant Difference Absolute value of difference
All-atoms absolute 0.0475 0.0395 0.0199 0.0257
All-atoms relative 0.0482 0.0384 0.0080 0.0106
Total-side absolute 0.0552 0.0437 0.0184 0.0249
Total-side relative 0.0524 0.0405 0.0093 0.0110
Main-chain absolute 0.0067 0.0109 0.0043 0.0076
Main-chain relative 0.0069 0.0112 0.0040 0.0052
Non-polar absolute 0.0475 0.0304 0.0324 0.0264
Non-polar relative 0.0462 0.0366 0.0146 0.0155
All-polar absolute 0.0150 0.0278 0.0157 0.0184
All-polar relative 0.0171 0.0272 0.0081 0.0119


Note: These continuous values were converted to categorical data by 5 fixed frequency bins for calculating the mutual information.


Table S3: A summary of the final attribute set (in grouped style)

Attribute group Number of components Data from Wildtype, Variant or Difference? Novel or not
13A structural neighbor profile 20 W Y
Nearby functional sites 10 W Y
Disordered regions 1 W Y
HLA families 1 W Y
Structure model energy 1 D Y
Aggregation properties 3 W, D Y
Side-chain solvent accessibilities 2 W Y
C density 1 W N
Residue frequencies 3 W, V, D N
Conservation scores 7 W N
BLOSUM score 1 D N
GRANTHAM score 1 D N
Secondary structures 3 W, D N
Dihedral 3 D N
Hydrogen bond 1 D N
Disulfide bond 1 D N
RMSD 1 D N


For more details, please see a tabular view of each attribute of the final dataset with CSV-format HERE.


Table S4: The 1R rank of all the single attributes using 5-fold cross-validation and bucket size 14

1R Rank Attribute
78.01 The protein containing the SAP is in HLA family or not
75.77 Variant residue frequency
75.39 Frequency difference between wildtype and variant residue
74.26 Wildtype residue frequency
70.33 Conservation score
68.12 The least structural distance between the SAP and a disulfide-boned Cys
68.12 The Pro component of the structural neighbor profile
67.42 The Tyr component of the structural neighbor profile
66.87 C-beta density
66.84 GRANTHAM score
66.32 The Leu component of the structural neighbor profile
66.14 The conservation score of the 3rd left position from the SAP
66.08 BLOSUM 62 score
66.06 The Ile component of the structural neighbor profile
65.47 The Val component of the structural neighbor profile
65.42 The Asp component of the structural neighbor profile
65.42 The Phe component of the structural neighbor profile
65.42 The Ala component of the structural neighbor profile
65.42 The Cys component of the structural neighbor profile
65.42 The TANGO score of the wildtype residue
65.42 The altered number of the hydrogen bonds caused by the SAP
65.42 The altered number of the disulfide bonds caused by the SAP
65.42 SAP alters the secondary structure or not
65.42 Secondary structure of the wildtype resdiue
65.42 SAP changes the fragment of beta-aggregation or not
65.42 SAP alters the 3-mer secondary structure or not
65.42 The least structural distance between the SAP and a BINDING site
65.42 The least structural distance between the SAP and a METAL site
65.42 The Arg component of the structural neighbor profile
65.42 The least structural distance between the SAP and an ACT site
65.42 The Thr component of the structural neighbor profile
65.42 The least sequence distance between the SAP and a MOD_RES site
65.42 The SAP is in disordered region or not
65.42 The SAP is in TRANSMEM region or not
65.42 The least structural distance between the SAP and a MOD_RES site
65.42 The least sequence distance between the SAP and a METAL site
65.42 The least sequence distance between the SAP and a BINDING site
65.42 The Met component of the structural neighbor profile
65.42 The His component of the structural neighbor profile
65.42 The Asn component of the structural neighbor profile
65.39 The Lys component of the structural neighbor profile
65.39 The Ser component of the structural neighbor profile
65.36 The Gln component of the structural neighbor profile
65.33 The Glu component of the structural neighbor profile
65.27 The least sequence distance between the SAP and an ACT site
65.27 The difference of TANGO score
65.24 The Gly component of the structural neighbor profile
65.04 The Trp component of the structural neighbor profile
64.92 The conservation score of the 1st right position from the SAP
64.83 The conservation score of the 1st left position from the SAP
64.78 Side chain absolute solvent accessibilities
64.72 The conservation score of the 3rd right position from the SAP
64.28 The conservation score of the 2nd left position from the SAP
64.05 The conservation score of the 2nd right position from the SAP
63.61 The absolute value of the difference between the PSI angles
63.61 Side chain relative solvent accessibilities
63.55 The absolute value of the difference between the PHI angles
63.38 The backbone RMSD of 5 consecutive residues at the SAP site
63.21 The absolute value of the difference between the CHI1 angles
63.00 The difference of the energies evaluated by MODELLER

 


The attributes not described in the maintext are listed in the following:

Secondary structures

Protein secondary structures were assigned by STRIDE (Frishman and Argos, 1995). STRIDE can produce 7 different structural states: H (-helix), C (coil), T (turn), G (310helix), B (isolated bridge), E (extended) and I (PI-helix). In our dataset, only the first 6 states were assigned. The structural state of the variant and wildtype protein at the SAP position and the 3-mer structure state string (including one left and one right neighbor of the SAP site) were extracted as attributes. Another two attributes were also added to indicate whether the structural states between the variant and wildtype protein were equal.

Dihedral angles

Main chain dihedral angles and , and the first side chain dihedral 1 were calculated using MMTSB (Feig, et al., 2004). In fact, and have strong correlation to secondary structure, which is reflected by the Ramachandron plot. The 1 angle can approximate the side chain orientation. The , and 1 of the variant and the wildtype protein at the SAP position were all used as attributes. The difference and absolute value of the difference of the angle between variant and wildtype protein were also added as attributes.

RMSD -- root mean square deviation

The backbone RMSD values were calculated by fitting all C atoms of the variant protein to the corresponding C atoms of wildtype protein. The RMSD value of the SAP position residue, the RMSD values of 2n+1 residues (SAP position plus n left neighbors and n right neighbors, n=1, 2, 3, 4, 5), and the RMSD value of all residues were tried as attributes.

BLOSUM62 and GRANTHAM matrix

BLOSUM matrices are a series of amino acid substitution matrices derived from BLOCKS database, and is widely used in sequence alignment and similarity search (Henikoff and Henikoff, 1992). The GRANTHAM matrix shows the physicochemical property differences between amino acids (Grantham, 1974). For each SAP, the corresponding scores in BLOSUM62 and GRANTHAM matrix were used as attributes.

 

References:

Feig, M., et al. (2004) MMTSB Tool Set: enhanced sampling and multiscale modeling methods for applications in structural biology, J Mol Graph Model, 22, 377-395.
Frishman, D. and Argos, P. (1995) Knowledge-based protein secondary structure assignment, Proteins, 23, 566-579.
Grantham, R. (1974) Amino acid difference formula to help explain protein evolution, Science, 185, 862-864.
Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks, Proc Natl Acad Sci U S A, 89, 10915-10919.

 

 

Copyright© 2006-2007, CBI All Rights Reserved.