KnowledgeMiner Home
 
 
Solutions >Proteomics

In September 2002 the Medical Center of the Duke University will organize the First Annual Proteomics Data Mining Conference. Here you can read an extended abstract of the results obtained by using a multileveled self-organization on one of the two problems explained by the program committee:

"BACKGROUND
Over the last several years there has been a tremendous interest in proteomics with significant changes in technology. As the field continues to expand there has become an increasing need to mine large amounts of data.

A number of approaches have been used to determine differential protein expression, although much of the data mining has revolved around 2D gels. Other techniques including mass spectrometry, which has typically been used to identify specific proteins by peptide fingerprinting, are emerging as possible discovery methods. Mass spectrometry has the capability to produce information-rich patterns of protein expression, although data analysis remains difficult as most schemes are relatively crude and have not been well developed. Although similar types of clustering and discriminant problems have been explored in the field of genomics, relatively little has been done in proteomics.

There are clear differences between the data format in genomics and proteomics, however, particularly given the complexity of mass spectrometry. The purpose of this conference is to present several problems in mass spectrometry analysis and encourage interested individuals to submit solutions in the form of an extended abstract.

CONFERENCE PROBLEMS EXPLAINED
There are two fundamental problems put forward as a challenge and basis for this conference. Anyone many submit one or more solutions. The Rules and Guidelines are outlined in the following section.

Problem # 1
Data from forty-one clinical specimens are in the Problem #1 data sets (labeled X01-X41). Each file contains data produced from one clinical specimen, and represents the proteins found by mass spectrometry from that specimen (Please refer to No. 3 below for an explanation of the data format.). Some of the specimens are from a group (Group A) with a specific disease and some are from a group (Group B) without disease. Can you separate these 41 specimens into the two groups, Group A and B, and what are the specific differences in protein expression (or patterns) between the two groups?

Problem # 2
In the Problem # 2 data sets, there are 2 folders; Group A and Group B. There are 24 spectra in Group A and 17 spectra in Group B. Each file contains data produced from one clinical specimen, and represents the proteins found by mass spectrometry from that specimen. Each file in Group A is a different specimen, but all in the group have the same disease. Each file in Group B is a different specimen, but all in the group are without disease. Can you find differences between Group A and Group B and determine what are the specific differences in protein expression between the two groups. It is suggested that you use some of the files for cross validation. There will be two separate data sets for each Problem; a Processed Data Set (PDS) and a Raw Data Set (RDS). You may elect to solve the problems using one or both of these data sets.

The PDS contains only the peaks (or proteins) from the raw data set. The offset in the mass spectrometry baseline has not been eliminated. All PDS files are Excel files. The data for Problem #1 is contained in one file, named PROB1PDS.xls. The data for Problem #2 is in 2 files, named PROB2PDS_A.xls and PROB2PDS_B.xls. Data for each specimen exist as sheets within each parent excel file. Data in each sheet is arranged in 3 columns.

The first column is the "fraction number", ranging from 1 to 20. Fractionation of the sample prior to mass spectrometry is advantageous in the effort to visualize as many proteins as possible. If complex samples are analyzed by mass spectrometry without fractionation, a total of 30-50 peaks (proteins) would be seen. With fractionation, the number of peaks in any fraction remains around 30-50, but the total number of proteins visualized in that complex sample increases because there are more fractions.

Note, some proteins may be found in more than one fraction (this has to do with the separation technique), but remember this is only one protein and the mass accuracy of any m/z determination is approximately 0.1%.

The second column is the mass-to-charge ratio, or m/z, of the peak (x-axis). The third column is the peak height (y axis) or abundance of each protein ion.

Note, the absolute peak height in any fraction cannot necessarily be compared to that in any other fraction. If you want to explore the height as a discriminating parameter, some sort of normalization within each fraction is recommended. It appears that the ratio of any two peaks within a given fraction may hold diagnostic information."

Click here to read the results obtained until now.

© 2001 Plum AmazingSite MapContact Us