| |
|
|
We are interested in:
|
Protein Structure Prediction
|
Protein structure prediction refers to the effort to construct 3-dimensional
shape of protein molecules from the amino acid sequence
by computational calculations.
Our lab has developed a number of algorithms for protein 3D structure
prediction, including
I-TASSER
for iterative protein structure assembly,
QUARK
for ab initio protein folding,
and MUSTER and
LOMETS
for protein template structure identification, some of which have been
considered as the world's best and widely used by the community.
The Critical Assessment of Structure Prediction (CASP)
is a community-wide experiment, which designs to benchmark the
state-of-the-art of protein structure prediction in every two years since 1994.
Our lab has participated as "Zhang-Server" in the automated structure prediction
section in the experiments held in 2006, 2008 and 2010.
The method was ranked at the top position in all three experiments (CASP 7-9) (Table 1).
Table 1. Top ten groups in automated structure prediction in CASP 7-9, ranked
based on cumulative GDT-TS score of first model.
(Data were taken from http://predictioncenter.org.
When multiple servers are from same lab, the best server was listed)

The most difficult problem in protein structure prediction is the modeling
of proteins which have no solved structures that can be used as template,
commonly referred "ab initio" or "free modeling (FM)" modeling.
Figure 1 shows a successful example of ab initio modeling
on a FM target (T0604_1) in CASP9, where the first model by the I-TASSER server has a RMSD 2.66 Angstroms
to the X-ray crystal structure.

Figure 1. The first model by the I-TASSER server versus
the crystal structure of T0604_1, a FM target in CASP9.
This is the VP0956
protein from Vibrio parahaemolyticus, solved by the Northeast Structural Genomics Consortium.
Despite the successes, there are still significant unsolved problems in
protein structure prediction,
which will be the target of our lab in the next few years.
These include:
- How to build structures of experimental resolution (below 1-2 Angstroms,
useful for drug screening) when homologous templates are available?
- How to identify distantly homologous templates with accurate query-template alignments?
- How to fold proteins (especially the beta-proteins) with correct topology by
ab initio modeling, when no templates exist?
- How to fold membrane proteins?
|
Protein Function Prediction
|
Given the amino acid sequence, can we tell what the protein molecule does in living cells?
We have developed COFACTOR for protein function prediction, based on the
sequence-to-structure-to-function paradigm.
From the amino acid sequence, 3D structures are first constructed by I-TASSER. The
functional insights (including enzyme classification, gene ontology, and ligand binding
specificity) are then deduced by the local and global comparison of the structural
models with proteins of known functions (Figure 2).

Figure 2. Protein function annotation based on the sequence-to-structure-to-function
paradigm. The right
panel is the funcation homologs identified by global (a) and local (b)
matches of I-TASSER models.
The COFACTOR was tested in the community-wide CASP9 experiment as "I-TASSER_FUNCTION" in
the Server section and as "ZHANG" in the Human section, which were ranked at the first two
positions in both Z-score and the Matthews correlation coefficient (MCC) rankng
compared with the experimental data (Figure 2a).

Figure 2a.
Mean MCC Z-scores of the best ten groups in the Function Prediction in CASP9.
(The picture was taken from the presentation by the CASP9 assessor Dr. T Schwede).
Protein design refers to the effort to design new protein molecules of a desired 3D structure
and function. It is a reverse procedure of protein structure prediction and
the solution of the problem therefore highly relies on the extent of our understanding
on the principle of protein folding (Figure 3).

Figure 3. Protein design is a reverse procedure of protein structure prediction.
We recently designed a number of new protein sequences based on a physics-based atomic
force field with the lowest free-energy state searched by Monte Carlo simulation,
followed by sequence-based clustering. The designed protein sequence can be folded by
I-TASSER with a RMSD <2 Angstroms in 62% of cases,
despite that the I-TASSER force field differs significantly from that used in the design.
Figure 4 shows three representative
examples of the target protein structure and I-TASSER model of the designed sequences.

Figure 4.
I-TASSER models of design sequences (red) versus crystal structure of target proteins (green)
for calcium-binding domain of Calx (3E9TA), odorant binding protein (2ERBA), and peptidyl-tRNA
hydrolase (1WN2A). The sequence identities of the designed and target sequences are all below 30%.
We are now working on the design of protein molecules associated with human breast cancer
and on solicitation of experimental validation.
For speeding up the design and test procedure, we are developing new protein design methods
by the aid of protein threading and fold-recognition methods.
|
Modeling of G Protein-Coupled Receptor and
Ligand-Receptor Interactions
|
G protein-coupled receptors, or GPCRs, are integral membrane proteins embedded in
the cell surface that transmit signals to cells in response to stimuli and
mediate physiological functions through interaction with heterotrimeric
G proteins (Figure 5). Many diseases involve the malfunction of these receptors,
making them important drug targets.
More than 50% of all modern drugs target GPCRs, which represent
25% of the 100 top-selling drugs worldwide.

Figure 5. GPCRs comprise the largest family of membrane proteins
and act as cell receptors for cellular signal transduction.
We are working on the development of the new GPCR modeling tool, GPCR-ITASSER, which
extends I-TASSER by incooporating the protein-membrane interactions and
the mutagenesis restraints into the knowledge-based force field.
The ligand-GPCR
interactions are then modeled by
BSP-SLIM,
a blind molecular docking tool designed for low-resolution protein-ligand
docking. The method was tested (as "UMich-Zhang") in the recent community-wide
GPCR Dock experiment in 2010.
Figure 6 shows the result of our lab on all three ligand-GPCR complexes,
where the first receptor models are 2.4 and 1.6 Angstroms to the
crystal structure in the transmembrane region for the CXCR4 chemokine and
dopamine D3 receptors, respectively. The three ligands, antagonists
IT1t, CVX15, and eticlopride, are all in the same pocket as that in the
crystal structure (Figure 6).

Figure 6.
The first ligand-receptor docking model generated by GPCR-ITASSER and BSP-SLIM in GPCR-Dock 2010.
Left: CXCR4 chemokine receptor with IT1t; middle: CXCR receptor with CVX15; right: dopamine D3
with eticlopride.
Table 2 shows a summary of the top 10 groups (out of 35) in GPCRDock 2010,
together with the cumulative Z-score on all three targets for both receptor
and ligand models.
The most significant success of our models is on the distant homology target
CXCR4/CVX15, as Kufareva et al. (the assessors) commented,
"Modeling the CXCR4/CVX15
peptide complex represented the biggest challenge of GPCR Dock 2010.
The top model of this complex (by UMich-Zhang) has the Z-score of 2.45 thus
far exceeding other models in accuracy."
Table 2. The best 10 groups in GPCRDock 2010 based on total Z-score of receptor
and ligand models.
(Data were take from Kufareva et al. Structure. 2011, 19:1108)

We are now working on the application of the GPCR-ITASSER and BSP-SLIM pipeline to
the modeling of
all GPCRs in the human genome, to generate high-resolution structures of
the receptors as well as the ligand-receptor associations with the aid of
experimental restraints collected in
GPCRRD.
One goal is to systematically annotate the physiological roles of all GPCRs in
associated pathways and to identify new therapies to regulate these interactions.
|
Amyloid Diseases and Fiber Aggregation
|
When a cell creates a protein, it could either make the actual protein or
some peptide fragments.
The fragments can sometimes "mis-fold" into insoluble protein fibers of
uniform beta-pleated sheets, called amyloid fiber.
When these fibrils abnormally accumulate in tissues and organs, the
patient may suffer from serious amyloidosis.
If this happens in the brain, for example,
degeneration of neuronal processes and synaptic abnormalities may appear,
resulting amyloidosis including
Alzheimer's, Parkinson's, and Huntington's diseases (Figure 7).

Figure 7. A normal aged brain (left) versus an Alzheimer's patient's brain
with amyloid fiber aggragated (right).
To understand the mechanisms of amyloid fiber formation, we developed
a new approach to the asymptotic solution of the
fiber aggregation master equation.
It was found that four distinct stages,
lag phase, exponential growth phase, breaking phase and static phase,
dominate the fiber formation process. Amyloid proteins can thus be
classified into four hierarchical groups based on the
fiber formation half-time and growth rate (Figure 8).

Figure 8.
Amyloid proteins consist of four distinct types according to
nucleation mechanism.
|
Modeling of Protein-Protein Interactions
|
Every protein interacts (at least transiently) with about 9 other proteins, which
forms complicated interaction networks within a cell (Figure 9).
Since most proteins carry out their biological function through the interaction with
other proteins, many diseases can be treated by designing new drugs to inhibit or
activate the protein-protein interactions, where the knowledge of the protein-protein
complex structures is essential.

Figure 9.
Rhodopseudomonas palustris protein-protein interaction network.
To predict 3D structure of protein-protein complexes from sequence, we developed a
new dimeric threading algorithm,
COTH,
to recognize template structure of protein complexes from solved complex structural
databases. COTH aligns multiple-chain sequences simultaneously through the PDB library
using scoring functions including multiple sequence profiles and structural information,
with the assistance of interface predictions from
BSpred. The COTH algorithm
demonstrated significant advantage compared to other homology-based
template identification methods (Figure 10).

Figure 10. TM-score of templates identified by COTH
versus that from other homology-based methods.
Following COTH, we are working on the development of Dimer-ITASSER by extending
the I-TASSER algorithm for multiple-chain full-length complex structure prediction.
Since the folding principles of protein domains and complexes are essentially the same,
we are hopeful to exploit I-TASSER iterative threading assembly methodology to significantly
refine the template structures as identified from COTH, with the focus on the modeling
of binding-induced side-chain and backbone conformational changes.
One of the long-term goals is to utilize the developed Dimer-ITASSER
to reconstruct the structure-based protein-protein network across genomes.
A systematic, atom-level description of protein-protein complexes will be essential
for the understanding of cellular processes and for the development of novel reagents to
regulate the protein-protein interaction networks.
After many years of wild guesses, it is now known that the number of human genes is
about 20,000, which is surprisingly less than that of a roundworm or a fruit fly.
How can such a small number of genes give rise to an organism as complex as Mozart
or Einstein? One of the important mechanisms that the human body uses to
increase the complexity of
proteome is through RNA alternative splicing, i.e. different proteins can be generated
by combining different entries of exons from the same gene (Figure 11), while
another mechanism is through post-translational modification (PTM).
It is believed that more than 90% of human genes have alternative splicing
isoforms.

Figure 11. An illustration of RNA alternative splicing, where the combination of
Exons 1, 2, 3 gives rise to Protein A and 1, 2, 4 to Protein B.
Many RNA alternative splicing products result in diseases, including cancer.
To examine the role of alternative spliced isoforms in human breast cancer
at the level of protein structure,
we modeled the structures of three pairs of protein isoforms
(calumenin, cell devision cycle 42, and polypyrimindine tract binding proteins)
which were found only in breast cancer cells by our collaborators (Menon and Omenn).
We observed that, despite the high sequence identity (>90%),
structural variations between isoforms occur at the critical active motifs.
This observation opens a new avenue to the study of cancer on the
basis of protein structure variations in alternatively spliced isoforms
(Figure 12).

Figure 12. Sequence and structure variations in the breast cancer associated isoforms,
ENSP00000249364 and
ENSP00000408838, from the calumenin proteins. The arrows point to the important variation
motifs occurring in the
isoforms. Left panel is the
TM-align structural alignment
of the models generated by I-TASSER for the isoforms.
One issue in the determination of alternative splicing genes is the high error
rate in the high-throughput data of protein identification and quantification.
To improve data quality from protein structure prediction,
we are working on the application of the I-TASSER modeling and the confidence
scoring system to distinguish between true and false RNA alternative splicing
isoforms from the human genome sequences.
|
Ligand Screening and Structure-Based Drug Design
|
In terms of the lock-and-key metaphor, drug design is essentially a procedure to
find an appropriate compound molecule (the key) which can match well with the
active site pocket of the target protein (the lock). Therefore, an
important step of structure-based rational drug design is to use the
experimental or predicted 3D structure
of the target protein to screen compound databases with the purpose of
identifying appropriate drugs which can inhibit or activate the protein (Figure 13).

Figure 13. A successful example of structure-based drug design by Bugg et al. in 1990s
in designing a molecule
that inhibits enzyme purine nucleoside phosphorylase (PNP). PNP normally takes up individual nucleosides (a)
and cleaves the purine from the sugar, giving rise to a free purine base and a phosphorylated sugar (b).
A tightly fitting compound blocks the binding pocket and therefore inhibits the acitivity of the PNP enzyme (c).
We recently developed a composite approach for druglike compound identification,
which combines structure-based virtual screening with quantitative structure-activity
relationship (QSAR).
When using the approach to the epidermal growth factor receptor (EGFR),
an important target protein associated with brain, lung, bladder and colon tumors,
we found that two compounds (2 and 21) have significant EGFR-inhibitory activities (Figure 14).
The experimental assay to test the ability of the compounds in inhibiting the receptor proteins is in
progress.

Figure 14.
Binding structure of two compounds screened from the ZINC library which have inhibitory
activity on the epidermal growth factor receptor (EGFR), an important tumor target protein.
|
|
|