The dataset and source codes

File List
File name Description
sap_5parts.tar.gz Data Code Code

Detailed description:

The sap_5parts.tar.gz contains the dataset with 5-group partition described in the methods of our manuscript. We use the 5-group partition to obtain the 5-fold cross-validation accuracy. The part[1-5].csv files are human-readable, while the part[1-5].libsvm files are ready for LIBSVM.

The and the are our own wrappers for calling LIBSVM, since the shipped with LIBSVM can only conduct default random or stratified cross-validation, we wrote our own wrapper: and The former is obtained by modifying the original, while the latter is called by the former.

Description of fields in the csv file can be found HERE.

The 5-group partition of our data are on protein level:

The partition of the dataset is on protein level:

  1. The SAPs from the same protein are in only one group, e.g. they are not permitted to reside in two or more different groups.
  2. Each group had nearly the same amount of SAPs.
  3. The ratio of "Disease" SAPs to "Polymorphism" SAPs in each group was kept nearly the same, that is, about 2 to 1.

Play them with LIBSVM

In order to re-produce the results presented in the manuscript, you ought to operate as the following steps:

  1. System requirements

    Linux/Unix platform with perl, python installed. A C++ compiler like GCC and the make facility are also required to install LIBSVM. (Almost any modern Linux has already be equiped with them. The windows platform should also work, but we did not test it yet. Sorry for the inconvenience.)

  2. Make a directory for testing

    Go to a clean directory, say, /tmp/testsap.
    [yezq@ala]$ cd /tmp
    [yezq@ala]$ mkdir testsap
    [yezq@ala]$ cd /testsap
  3. Download the LIBSVM

    Get version 2.82 from HERE or get the newest version from its original site.

  4. Uncompress it and build them.

    [yezq@ala]$ tar xzf libsvm-2.82.tar.gz
    [yezq@ala]$ cd libsvm-2.82
    [yezq@ala]$ make
  5. Change work directory to tools

    [yezq@ala]$ cd tools
  6. Download the supplementary data and uncompress them.

    Download sap_5parts.tar.gz, and, and them uncompress them.
    [yezq@ala]$ gunzip *.gz
    [yezq@ala]$ tar xf sap_5parts.tar
  7. Make the perl and python scripts executable.

    [yezq@ala]$ chmod u+x
    [yezq@ala]$ chmod u+x
  8. Run 5-fold cross-validation.

    [yezq@ala]$ cp sap_5parts/part[1-5].libsvm .
    [yezq@ala]$ ./ -out sap_5parts.out

    Note this step is computation-intensive. After waiting for about half an hour (on a computer with two 2.2GHz Xeon CPUs), the results in sap_5parts.out should appear like THIS. The results will also be printed on the screen. If you have many more CPUs, you can modify the line "nr_local_workers =3" in "" to "nr_local_workers=N", where N represnts the number of your CPUs.

  9. Find the lines with highest ACC or MCC:

    1 -7 Records: 3438      ACC= 0.826062   MCC= 0.604331   BER= 0.224120

    That is, when the (log2C, log2) = (1, -7), the 5-fold cross-validation overall accuracy is 82.61%, the MCC is about 0.60. Furthermore, using a smaller step in grid-search, you can fine-tune (log2C, log2) in the vicinity of (1, -7) to get higher accuracy. This grid-search is important. As we can see in the result file, the ACC could be as low as 65% and the MCC could be as low as 0 if we choose (log2C, log2) = (-1, -13)

  10. The plots demostrating the grid-search of LIBSVM parameters:

    To elucidate grid-search of the parameters of LIBSVM intuitively, we made two contour plots: ACC or MCC vs. (log2C, log2). The data are based on the output of step 8 and 9.

    Plot 1: The contour of ACC vs. (log2C, log2)

    Plot 2: The contour of MCC vs. (log2C, log2)


Copyright© 2006-2007, CBI All Rights Reserved.