SECSG : Southeast Collaboratory for Structural Genomics

downloads

links

Research Cores

»Bioinformatics

Bioinformatics

Project Director, Co-PI	John Rose, UGA
Project Director (tools and analysis)	Dawei Lin, UGA
Senior programmer	Jeremy Praissman,UGA
Statistian/Programmer	Xi (Catherine) Chen,UGA
Project Director (databases)	Alan Gingle, UGA
Database Programmer	Victor Babson Jr., UGA
UAB Bioinformatics leader	Mike Carson, UAB
Database Programmer	David Johnson, UAB
Web Designer	Jun Tsao, UAB

The protein structure initiative at SECSG is a combined effort of protein production, NMR, crystallography, robotics and informatics to produce structures in a high-throughput manner. As part of the effort, it is critical to collect, integrate and perform large scale analysis of interdisciplinary data. The bioinformatics team at SECSG is: (1) developing software and database tools to track experimental results, (2) streamline integration and correlation of information from disparate sources, (3) regularly update important databases related to our selected target proteins, (4) automatically disseminate information to research community, (5) carry out high-throughput computation by use of a multi-processor computer cluster to analyze biological information, predict and model protein structures, find crystal and NMR structure solutions, and validate structures on a large scale. The goal of the bioinformatics team is to transform massively heterogeneous data into a human-comprehensible form, hence facilitate the study of important biological problems through experimental means.

The strategy of our experimental data collection is to maximally use the domain expertise of each individual group. The informatics personnel in experimental groups, who know what data are important and how to best organize them, are responsible for building and maintaining their own Laboratory Information Management System (LIMS). They also provide database interfaces in the form of Common Gateway Interface (CGI) query strings, Excel spreadsheets, tab-delimited tables and Extensible Markup Language (XML) format files. The bioinformatics team is responsible for developing generic parsers for these database interfaces and for converting different format files into an XML based exchange layer, which is independent of operating system and databases. The applications we intend to develop will access information from the XML layer. This strategy allows both the ability to comprehensively track experimental results and the flexibility to accommodate database changes. One application we implemented based on the above strategy is our progress report ( https://secsg.org/cgi-bin/report.pl ). This report is automatically generated each week by integrating information from protein production, NMR and X-ray crystallography groups. From this report, people cannot only see our progress toward protein structure determination, they can also find the correlation between the crystallization experiment and how well the protein is folded in solution as checked by NMR.

An automated structure deposition procedure is under development which uses the same strategy. The basic protein information and its experimental information toward structure solution are extracted from different experimental and bioinformatics databases. The information will be integrated and transformed to the format acceptable by the Protein Data Bank (PDB). We also plan to validate structures with more rigorous rules than the PDB requires, such as WhatCheck, Procheck and hydrogen atom clashes validation etc. This procedure will assure a high-quality structure deposition from the SECSG.

The protein sequence and structure information are changing daily. It is important to know what newly characterized proteins or deposited structures related to the protein or protein homologue which we are working on, so that we can prioritize our work and choose experimental protocols. For such reasons, we have automated the process to search sequence and structure databases regularly. We are using PSI-Blast to search non-redundant databases every month, and using Blast2 and a refined PSI-Blast procedure to search PDB sequences every week. The search hits are integrated into the progress report mentioned previously, so people can see the experimental progress and structure information change at the same time. We are seeking collaboration with different computational biology groups to employ their protein structure prediction and modeling pipeline locally in order to closely connect to our high-throughput experiments.

While many tools exist for parsing biological databases and interfacing with existing analysis programs, there is currently a lack of support infrastructure for correlating biological information from disparate sources. We are developing a toolkit attempting to address this need using mathematical set and combinatorial graph techniques. The toolkit which is implemented in Perl ( http://www.perl.org ) and uses bioperl( http://bio.perl.org ) has already been applied in analyzing neighbor gene relatedness for complete bacterial genomes released at NCBI. It will be extended to support future correlation of genes with other genomic, structural and functional information.

Most bioinformatics and crystallography programs do not use parallel techniques. To make the program parallel is a non-trivial effort. Considering the relatively short running time for each job ranging from several minutes to several hours, and the fast computer price drop, the most economical way is to use multiple processors to run jobs in parallel. With a donation from IBM for a 64 node dual processor Linux Cluster, we implemented sequence comparison, pattern discovery, and crystallographic structure solving pipeline in a high-throughput environment. Within the first two weeks of the implementation of the pipeline, we solved the first automatically determined structure in SECSG. We are developing visualization tools to help mine large datasets generated by high-throughput computation.