MsPI is a software tool for protein identification from PMF data.
It can be downloaded freely for non-profit institutions from http://aimed11.unipv.it/MsPI. Other organizations have to require a write authorization to mspi@aimed11.unipv.it before using this software. The implemented procedure of the version 1.2 was described in the paper "A Perl Procedure for protein identification by Peptide Mass Fingerprinting" by Tiengo et al. submitted to BMC Bioinformatics.

In the distribution directory there are the following files:

- README (this file) contains the main instructions for installing the software tool and for using Perl scripts and associated ASCII files.

- MsPI_x.x.zip files containing the x.x version of the software tool.

Perl and text files are here briefly described.


How to install MsPI
-------------------

1) If the Perl distribution is not already installed, download the ActivePerl distribution from the site http://www.activestate.com/Products/activeperl (or your preferred Perl distribution) and install it.

2) Download the desired version of the tool (MsPI_x.x.zip file) and extract all files in the MsPI root directory with your preferred software.

3) Install the required Perl modules (see the following section) if they are not yet installed.


Required Perl modules
---------------------

- Cwd -> This module provides some functions for determining the path name of the current working directory.

- Lwp -> This module provides the program "lwp-download" to download files from a specified ftp address.

- IO::Zlib -> This module provides some functions to read and write gzip/zlib compressed files.

- Math::BigFloat -> This module allows to use floating point numbers of arbitrary length.

- Math::BigInt -> This module allows to use integers of arbitrary length.


Configuration file
------------------

- settings.ini -> This file contains the main settings parameters for MsPI tool.
It is stored in the MsPI_x.x/src directory.
Each line ends with a new line.
The parameters are:
* the platform used (e.g.: WIN or other)
* the ftp address of the Swiss-Prot database (e.g. ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase)
* the accession numbers of the bovine trypsin (e.g. P00760 Q29463)
* the accession numbers of the porcine trypsin (e.g. P00761 Q9N2D1)
* the accession numbers of the keratins (e.g. P04264 P35908 P35527 P13645)
* the contaminant frequency threshold (e.g. 1e-5)
* the maximum number of consecutive MCs allowed (for creating the reference database, e.g. 2)
* a flag for the PTMs presence in the reference database (e.g.: Y or N)
* the maximum number of variable PTMs allowed in the same peptide (for creating the reference database, e.g. 2)
* the lower bound of the acquisition mass range (e.g. 800)
* the upper bound of the acquisition mass range (e.g. 5000)
* the MW % tolerance window on the electrophoretic MW (from 0 to 1)
* the pI tolerance window on the electrophoretic pI (e.g. 1)


Perl scripts
------------

They are stored in the MsPI_x.x/src directory.

- swiss2MsPI.pl -> This script downloads the last release of the Swiss-Prot database from the address specified in the settings.ini file (line 2) and
stores it in the MsPI_x.x/tmp directory.

- create_database.pl -> This script creates the reference protein database starting from the Swiss-Prot database downloaded by the swiss2MsPI.pl routine.
It uses the file organism.txt to create a choice menu for selecting the organisms of interest and the files table_mw.txt and table_pi.txt for computing the molecular weight (MW) and the isoelectric point (pI) of processed each protein. It includes for all the proteins the missing cleavages (MCs) and the post-translational modifications (PTMs) and it computes the number of peptides of each protein in the specified acquisition range. Finally, it defines the contaminant mass list and it computes their frequency in the reference protein database.

- create_database_random.pl -> This script creates the database of random protein to be used for the statistical evaluation of the results.

- make_grid.pl -> This script is called by the score_PMF.pl routine for estimating the peptide mass probability density in the reference and random databases. This distribution is necessary when the score 3 is used and if it is not assumed to be uniform.

- score_PMF.pl -> This script performs the protein identification. An input file with the main parameters (see the "Example of input file" section) is needed as argument of the command line. If the user defines in the input file an acquisition mass range different from that reported in the settings.ini file, the routine recomputes the number of peptides of each protein in the reference database considering the new mass range.
If the scoring method 3 is chosen by the user and the peptide mass distribution in the reference database is not assumed uniform, the routine make_grid.pl is called. The results of the identification are written in the MsPI_x.x/results directory in the file chosen by the user.


Text files
----------

They are stored in the MsPI_1.2/src directory.
Each line of these files ends with a new line.

- organism.txt -> It contains the names of the organisms that may be included in the reference protein database:
*Mus musculus
*Homo sapiens
*...
*All
The keyword "All" allows to include all the organisms.

- table_mw.txt -> It contains the amino acids list and their monoisotopic and average weights. The amino acids are represented by an alphabetic letter. The element on each line are spaced by a tab character:
*A	71.03711	71.0788
*C	103.00919	103.1388
*...

- table_pi.txt -> It contains the amino acids and the groups that influence the pI of a protein. Each line reports the amino acid letter, the pKr and
the charge polarity spaced by a tab character.
*Y	10.07	-
*H	6.0	+
*C_term	3.1	-
*N_term	8	+
*...

- table_ptm.txt -> It contains the PTMs that can be included in the reference protein database. Each line reports the name, the amino acid involved
(alphabetic letter), the monoisopic weight, the average weight and the type (fixed F or variable V) spaced by a tab character:
*CAM	C	57.021464	57.0513	F
*...


Example of input file
---------------------

This is an example of input file required to perform a protein identification with score_PMF.pl routine.
It must be stored in the MsPI_1.2/data directory.
Each line of this file ends with a new line.

* peak list file (e.g. band.txt)
* results file (e.g. results.txt)
* electrophoretic MW (0 if unknown)
* electrophoretic pI (0 if unknown)
* mass tolerance in dalton (D) or ppm (P) (e.g. D 0.5 or P 100)
* lower bound of the acquisition mass range
* upper bound of the acquisition mass range
* scoring method (1, 2, 3a for uniform peptide mass distribution or 3b for not uniform peptide mass distribution in the reference database)
* p-value threshold (e.g. 0.05)
* contaminant flag (1 for removing bovine trypsin and other contaminants, 2 for porcine trypsin and other contaminants, 3 for bovine trypsin only, 4 for porcine trypsin only, 5 for contaminants only or 0 to not remove the contaminants)
* maximum number of consecutive MCs allowed in the search (e.g. 2)
* number of PTMs allowed in the search (e.g. 2)
* charge of the m/z data (e.g. 0 or 1)


If the p-value threshold is fixed to 0, the statistical validation of the results is not performed and the random database is not used.


Output file
---------------------

The output file reports the list of candidate proteins ranked by the score.
It is stored in the MsPI_x.x/results directory.
Each line contains a candidate protein. For each candidate protein, the following information are reported: the accession number, the ID, the organisms, a brief description, the MW, the PI, the total number of amino acids, the number of query mass input, the number of matches, the number
of all possible peptides, the score, the p-value (if computed), the quality index (if computed), the percent protein coverage by the matched peptides and the list of peptides matched (only in the complete file).


How to run MsPI
---------------

1) Create the input file (see Example input file section)

2) Open the settings.ini file in the MsPI_x.x/src directory to configure some parameters. In particular, set:

- the used platform
- the Swiss-Prot ftp address
- the contaminants threshold
- the maximum number of allowed consecutive MCs
- the PTMs flag used to build up the reference database
- the maximum number of variable PTMs allowed in the same peptide
- the lower bound of the acquisition mass range
- the upper bound of the acquisition mass range
- the MW % tolerance window on the electrophoretic MW
- the pI tolerance window on the electrophoretic pI

3) Open the table_ptm.txt file in the MsPI_1.2/src directory to set the PTMs
to be included in the reference database.

4) Run the Perl routines as reported here:

 - to download a new release of the Swiss-Prot database: perl swiss2MsPI.pl
 - to create the reference database: perl create_database.pl
 - to create the random database: perl create_database_random.pl
 - to perform a protein identification: perl score_PMF.pl input_file