|
Bioinformatics
| Project Director,
Co-PI |
John Rose, UGA |
| Project Director (tools and analysis) |
Dawei Lin, UGA |
| Senior programmer |
Jeremy Praissman,UGA |
| Statistian/Programmer |
Xi (Catherine) Chen,UGA |
| Project Director (databases) |
Alan Gingle, UGA |
| Database Programmer |
Victor Babson Jr., UGA |
| UAB Bioinformatics leader |
Mike Carson, UAB |
| Database Programmer |
David Johnson, UAB |
| Web Designer |
Jun Tsao, UAB |
The protein structure initiative at SECSG is a combined effort
of protein production, NMR, crystallography, robotics and
informatics to produce structures in a high-throughput manner.
As part of the effort, it is critical to collect, integrate
and perform large scale analysis of interdisciplinary data.
The bioinformatics team at SECSG is: (1) developing software
and database tools to track experimental results, (2) streamline
integration and correlation of information from disparate
sources, (3) regularly update important databases related
to our selected target proteins, (4) automatically disseminate
information to research community, (5) carry out high-throughput
computation by use of a multi-processor computer cluster
to analyze biological information, predict and model protein
structures, find crystal and NMR structure solutions, and
validate structures on a large scale. The goal of the bioinformatics
team is to transform massively heterogeneous data into a
human-comprehensible form, hence facilitate the study of
important biological problems through experimental means.
The strategy of our experimental data collection is to maximally
use the domain expertise of each individual group. The informatics
personnel in experimental groups, who know what data are
important and how to best organize them, are responsible
for building and maintaining their own Laboratory Information
Management System (LIMS). They also provide database interfaces
in the form of Common Gateway Interface (CGI) query strings,
Excel spreadsheets, tab-delimited tables and Extensible Markup
Language (XML) format files. The bioinformatics team is responsible
for developing generic parsers for these database interfaces
and for converting different format files into an XML based
exchange layer, which is independent of operating system
and databases. The applications we intend to develop will
access information from the XML layer. This strategy allows
both the ability to comprehensively track experimental results
and the flexibility to accommodate database changes. One
application we implemented based on the above strategy is
our progress report ( http://www.secsg.org/cgi-bin/report.pl
). This report is automatically generated each week by integrating
information from protein production, NMR and X-ray crystallography
groups. From this report, people cannot only see our progress
toward protein structure determination, they can also find
the correlation between the crystallization experiment and
how well the protein is folded in solution as checked by
NMR.
An automated structure deposition procedure is under development
which uses the same strategy. The basic protein information
and its experimental information toward structure solution
are extracted from different experimental and bioinformatics
databases. The information will be integrated and transformed
to the format acceptable by the Protein Data Bank (PDB).
We also plan to validate structures with more rigorous rules
than the PDB requires, such as WhatCheck, Procheck and hydrogen
atom clashes validation etc. This procedure will assure a
high-quality structure deposition from the SECSG.
The protein sequence and structure information are changing
daily. It is important to know what newly characterized proteins
or deposited structures related to the protein or protein
homologue which we are working on, so that we can prioritize
our work and choose experimental protocols. For such reasons,
we have automated the process to search sequence and structure
databases regularly. We are using PSI-Blast to search non-redundant
databases every month, and using Blast2 and a refined PSI-Blast
procedure to search PDB sequences every week. The search
hits are integrated into the progress report mentioned previously,
so people can see the experimental progress and structure
information change at the same time. We are seeking collaboration
with different computational biology groups to employ their
protein structure prediction and modeling pipeline locally
in order to closely connect to our high-throughput experiments.
While many tools exist for parsing biological databases and
interfacing with existing analysis programs, there is currently
a lack of support infrastructure for correlating biological
information from disparate sources. We are developing a toolkit
attempting to address this need using mathematical set and
combinatorial graph techniques. The toolkit which is implemented
in Perl ( http://www.perl.org ) and uses bioperl( http://bio.perl.org
) has already been applied in analyzing neighbor gene relatedness
for complete bacterial genomes released at NCBI. It will
be extended to support future correlation of genes with other
genomic, structural and functional information.
Most bioinformatics and crystallography programs do not use
parallel techniques. To make the program parallel is a non-trivial
effort. Considering the relatively short running time for
each job ranging from several minutes to several hours, and
the fast computer price drop, the most economical way is
to use multiple processors to run jobs in parallel. With
a donation from IBM for a 64 node dual processor Linux Cluster,
we implemented sequence comparison, pattern discovery, and
crystallographic structure solving pipeline in a high-throughput
environment. Within the first two weeks of the implementation
of the pipeline, we solved the first automatically determined
structure in SECSG. We are developing visualization tools
to help mine large datasets generated by high-throughput
computation.
|