BFG@University of Richmond

Thursday, December 01, 2005

human genome fun

The human genome sequencing project caused quite a stir back in the day, as the public and private sequencing efforts "took issue" with each others' approach, methodology, philosophy, and attitude (did I miss anything?). Here's a look back (with links):


Celera Genomics genome sequence paper, February 2001


PNAS article "On the sequencing of the human genome", by Waterston, Lander, and Sulston, March 2002
Celera Genomics' response to "On the sequencing of the human genome"

PNAS article "Whole-genome disassembly", by Phil Green

Science editorial "Not Wicked, Perhaps, but Tacky", Donald Kennedy, August 2002
Craig Venter's reply to "Not Wicked, Perhaps, but Tacky"

PNAS article "More on the sequencing of the human genome", by
Waterston, Lander, and Sulston, March 2003

Celera Genomics reply to "More on the sequencing of the human genome"

Monday, November 28, 2005

more ecoinformatics links


Partnership for Biodiversity Informatics
Paper about Electronic Field Guides
Tropical Plant Database@Missouri Botanical Garden
Digital Atlas of Virginia Flora
Ecoinformatics.org
SEEK Collaborative
Article about Remote Sensing Technology
Long Term Ecological Research (LTER) Network
Vegbank.org
Trees and Shrubs of the University of Richmond Campus

Thursday, November 17, 2005

Ecoinformatics and Complex Systems Analysis


Monday's class will include a whirlwind tour of ecoinformatics, which is the subset of bioinformatics involved with collecting, organizing, and analyzing environmental data. Here are some links:
foodwebs.org
SEEK
ecoinformatics.org
ecoinformatics@RIT

Monday, November 14, 2005

Bird flu in the news


Bird flu 'out of control' in Chinese province
The bioweapon is in the post
Roche Makes Assurances on Tamiflu Supply
Quicker Turnaround for Flu Tests
Deadly flu virus can be sent through the mail
Flu in circulation
Google Avian Flu News
Pubmed search for "avian flu"

tree of life and genome sequencing links


European Ribosomal RNA Database
Tree of Life Web Project
NCBI's Entrez Genome Project
NCBI Trace Archive (raw sequence data)

Sunday, November 13, 2005

aluminum foil helmets enhance receipt of government radio signals


There are some who believe that the government sends secret signals to people through tooth fillings or secretly placed transceivers. One approach to this problem is a helmet made of aluminum foil or a metal colander, which is believed to block invasive signals. Researchers at MIT have tested this approach by measuring radio wave conductance of three helmet designs. Some signals are attenuated, but satellite and cell phone frequencies are enhanced, in effect acting like an antenna! The authors propose that the government and multinational corporations have promoted aluminum foil helmet use as a way to enhance their ability to surreptitiously send signals to selected individuals. Click on the title for more information. LOL

the iceworm cometh


Can worms that live in frozen glaciers yield clues to extraterrestrial life forms? Could similar worms live in Jupiter? NASA would like to know...

phylogenetic tree software


Now that you know of the awesome power of the multiple sequence alignment, how can you develop models for homolog relatedness? Here are some useful tools.
Tree-puzzle: free download of maximum likelihood analysis program
Modeltest: free download of program that computes most likely maximum likelihood assumptions (of 56 possible models)
Phylogeny programs: The mother of all phylogeny software pages. A pretty comprehensive list of free, fee-based, and web-interface software for phylogenetic analysis, both distance-based and character-based. Wow.

multiple sequence alignment tools


Here are some useful tools for doing multiple sequence alignments:
A nice interface for doing homology searches and multiple sequence alignments is available here. (thanks Nicole)

Friday, November 04, 2005

ab initio protein structure prediction


The "holy grail" of structural genomics is ab initio protein structure prediction; i.e., the ability to reliably predict tertiary structure from primary sequence. There are structure comparison "engines" and databases such as FATCAT, DALI, PDB, CATH, SCOP, SPICE, MSD, VAST, and NCBI's Structure page. However, there are not too many ab initio sites for predicting tertiary structure. HMMSTR is a fun site for such predictions, in the same way that PSORT is fun. Depending on the size of your protein sequence, HMMSTER will do a psi-BLAST alignment, predict secondary structure, and even generate a PDB file for viewing the predicted structure, which can be seen in RasMol, Cn3D, or your favorite structure viewer. Here's a link to my results for the first TPR repeat of Drosophila kinesin light chain protein. As Nile Rogers once wrote, these are the good times.

Wednesday, November 02, 2005

Prion Protein Predicted Properties

Prion protein (PrP) is a small glycoprotein found in high quantity in the brains of humans or animals infected with a number of degenerative neurological diseases such as Kuru, Creutzfeldt-Jacob disease (CJD), scrapie or bovine spongiform encephalopathy (BSE).
PrP is encoded in the host genome and expressed both in normal and infected cells. It has a tendency to aggregate yielding polymers called rods.
(Source: ExPAsy)

Structurally, PrP is a protein consisting of a signal peptide, followed by an N-terminal domain that contains tandem repeats of a short motif (PHGGGWGQ in mammals, PHNPGY in chicken), itself followed by a highly conserved domain of about 140 residues that contains a disulfide bond. Finally comes a C-terminal hydrophobic domain post-translationally removed when PrP is attached to the extracellular side of the cell membrane by a GPI-anchor. (Source: ExPAsy
)

Location Prediction
65.2 %: cytoplasmic
13.0 %: mitochondrial
13.0 %: nuclear
4.3 %: cytoskeletal
4.3 %: Golgi

Source: PSORT

48.0 %: extracellular, including cell wall
20.0 %: cytoplasmic
12.0 %: nuclear
8.0 %: vesicles of secretory system
8.0 %: endoplasmic reticulum
4.0 %: mitochondrial

Source: WOLFSORT

Conclusion: localization prediction databases = not quite so accurate

Molecular Weight (Da) : 27663
Molecular Class: Membrane Bound Ligand
Molecular Function: Receptor Binding


A general summary of the Prion Protein, found at hprd.org...



UBE3A Protein Property Prediction

Though it has been known for some time that Angleman Syndrome is caused by the lack of a functional maternal copy of UBE3A, the biochemical causes for the disease symptoms are still poorly understood.

The INTERPRO prediction package identified the HECT domain as the sole domain of this protein, suggesting a role as a ubiquitin-protein ligase. It also classified UBE3A as an IMP dehydrogenase/GMP reductase family protein.

Searching for localization signals, PSORT suggested a primarily extracellular role for UBE3A, giving the total distribution as follows:
52%: extracellular, including cell wall
12%: cytoplasmic
12%: nuclear
12%: endoplasmic reticulum
8%: vesicles of secretory system
4%: mitochondrial

According to the Human Protein Resource Database, the primary localization of this protein is in the cytoplasm, though it also occurs in the nucleus, so PSORT at least was close to the right track for this one.

The COILS program predicted a significant probability of a coiled-coil structure in the region of amino acid residues 150 and 200, as the graphical output below shows.




TMpred suggested the possibility of anywhere from one to three trans-membrane domains in the protein, though interestingly the PSORT prediction does not suggest that UBE3A is membrane-bound. The Human Protein Reference Database similarly does not suggest a membrane-bound role for UBE3A, though it does reference numerous protein-protein interactions of UBE3A in the ubiquitin pathway. Since many protein-protein interactions rely upon hydrophobic interactions, it could be that hydrophobic stretches of UBE3A are giving a false positive score for transmembrane domains.


Expasy predicted the molecular weight to be just over 100kD and the pI to be 5.12.

In 1999, the crystal structure of UBE3A was reported. With the crystal structural data, the need for structural prediction software is greatly reduced for this protein, though it is interesting to analyze the predictions against the actual structure. From both the reported data and the visibly apparent structure, this protein shows no transmembrane domains, contrary to the prediction given by TMpred. With the structural data available, the next step in better understanding the protein is to model its various interactions with different ligands and proteins to understand the ubiquitin protein degradation pathway more fully, and in particular its relationship to Angelman Syndrome.

Menkes Syndrome- Copper-transporting ATPase 1


Just a quick review: Menkes Syndrome is a recessively inherited X chromosomal gene which disrupts copper metabolism and decreases a cells' ability to absorb copper. Research suggests that this disease is caused by

a mutation in the gene encoding Cu(2+)-transporting ATPase, alpha polypeptide
(
OMIM #309400).

Uniprot Characterization:

Function: May supply copper to copper-requiring proteins within the secretory pathway, when localized in the trans-Golgi network. Under conditions of elevated extracellular copper, it relocalized to the plasma membrane where it functions in the efflux of copper from cells.

Catalytic Activity: ATP + H2O + Cu2+(In) = ADP + phosphate + Cu2+(Out).

Subunit: Monomer

Subcellular Location: Integral membrane protein. Cycles constitutively between the trans-Golgi network (TGN) and the plasma membrane. Predominantly found in the TGN and relocalized to the plasma membrane in response to elevated copper levels. Isoform 3 may be cytosolic. Isoform 5 is located in the endoplasmic reticulum.

Psort- Results of the k-NN Prediction
k = 9/23
82.6 %: plasma membrane
17.4 %: endoplasmic reticulum
>> prediction for QUERY is pla (k=23)


Domain: The C-terminal di-leucine, 1487-Leu-Leu-1488, is an endocytic targeting signal which functions in retrieving recycling from the plasma membrane to the TGN. Mutation of the di-leucine signal results in the accumulation of the protein in the plasma membrane.



metal binding domain


Secondary Structure
COILS:




Searched for repeats using REP, no repeats found
Visual structure of metal binding domains- simliar

SOSUI- SHOW!!
This amino acid sequence is of a MEMBRANE PROTEIN which have 9 transmembrane helices.

Average of hydrophobicity : 0.097200

Using TopPred2...



Characterization:
Glycosylation sites (
NetOGlyc):



Phosphorylation Sites predicted (NetPhoa 2.0):
Serine: 61
Threonine: 17
Tyrosine: 8




Sulfinator: no hits

Human Protein Reference Database Link

ProtParam Info:

amino acid sequence:
1 MDPSMGVNSV TISVEGMTCN SCVWTIEQQI GKVNGVHHIK VSLEEKNATI IYDPKLQTPK 60
61 TLQEAIDDMG FDAVIHNPDP LPVLTDTLFL TVTASLTLPW DHIQSTLLKT KGVTDIKIYP 120
1
21 QKRTVAVTII PSIVNANQIK ELVPELSLDT GTLEKKSGAC EDHSMAQAGE VVLKMKVEGM 180
181 TCHSCTSTIE GKIGKLQGVQ RIKVSLDNQE ATIVYQPHLI SVEEMKKQIE AMGFPAFVKK 240
241 QPKYLKLGAI DVERLKNTPV KSSEGSQQRS PSYTNDSTAT FIIDGMHCKS CVSNIESTLS 300
301 ALQYVSSIVV SLENRSAIVK YNASSVTPES LRKAIEAVSP GLYRVSITSE VESTSNSPSS 360
361 SSLQKIPLNV VSQPLTQETV INIDGMTCNS CVQSIEGVIS KKPGVKSIRV SLANSNGTVE 420
421 YDPLLTSPET LRGAIEDMGF DATLSDTNEP LVVIAQPSSE MPLLTSTNEF YTKGMTPVQD 480
481 KEEGKNSSKC YIQVTGMTCA SCVANIERNL RREEGIYSIL VALMAGKAEV RYNPAVIQPP 540
541 MIAEFIRELG FGATVIENAD EGDGVLELVV RGMTCASCVH KIESSLTKHR GILYCSVALA 600
601 TNKAHIKYDP EIIGPRDIIH TIESLGFEAS LVKKDRSASH LDHKREIRQW RRSFLVSLFF 660
661 CIPVMGLMIY MMVMDHHFAT LHHNQNMSKE EMINLHSSMF LERQILPGLS VMNLLSFLLC 720
721 VPVQFFGGWY FYIQAYKALK HKTANMDVLI VLATTIAFAY SLIILLVAMY ERAKVNPITF 780
781 FDTPPMLFVF IALGRWLEHI AKGKTSEALA KLISLQATEA TIVTLDSDNI LLSEEQVDVE 840
841 LVQRGDIIKV VPGGKFPVDG RVIEGHSMVD ESLITGEAMP VAKKPGSTVI AGSINQNGSL 900
901 LICATHVGAD TTLSQIVKLV EEAQTSKAPI QQFADKLSGY FVPFIVFVSI ATLLVWIVIG 960
961 FLNFEIVETY FPGYNRSISR TETIIRFAFQ ASITVLCIAC PCSLGLATPT AVMVGTGVGA 1020
1021 QNGILIKGGE PLEMAHKVKV VVFDKTGTIT HGTPVVNQVK VLTESNRISH HKILAIVGTA 1080
1081 ESNSEHPLGT AITKYCKQEL DTETLGTCID FQVVPGCGIS CKVTNIEGLL HKNNWNIEDN 1140
1141 NIKNASLVQI DASNEQSSTS SSMIIDAQIS NALNAQQYKV LIGNREWMIR NGLVINNDVN 1200
1201 DFMTEHERKG RTAVLVAVDD ELCGLIAIAD TVKPEAELAI HILKSMGLEV VLMTGDNSKT 1260
1261 ARSIASQVGI TKVFAEVLPS HKVAKVKQLQ EEGKRVAMVG DGINDSPALA MANVGIAIGT 1320
1321 GTDVAIEAAD VVLIRNDLLD VVASIDLSRE TVKRIRINFV FALIYNLVGI PIAAGVFMPI 1380
1381 GLVLQPWMGS AAMAASSVSV VLSSLFLKLY RKPTYESYEL PARSQIGQKS PSEISVHVGI 1440
1441 DDTSRNSPKL GLLDRIVNYS RASINSLLSD KRSLNSVVTS EPDKHSLLVG DFREDDDTAL

Number of amino acids: 1500
Molecular weight: 163373.7
Theoretical pI: 5.85

Atomic composition:
Carbon C 7246
Hydrogen H 11739
Nitrogen N 1929
Oxygen O 2203
Sulfur S 70

Formula: C7246H11739N1929O2203S70
Total number of atoms: 23187

Extinction coefficient at 280nm: 92450 M-1 cm-1

Alpha-synuclein protein properties

Alpha-synuclein

Function

Alpha-synculein is a protein encoded by the SNCA gene. It has been found to play a role in autosomal dominant Parkinson Disease. It is believed to be involved in regulating dopamine release and transport, and in lessening the responsiveness of apoptotic stimuli. The SMART tool on the EMBL site provided a lot of interesting methods of exploring the protein, including a protein interaction network. Alpha-synuclein interacts with many proteins, most notably, with UBE2L6, SNCAIP, and PARK2, all three of which in turn interact with one another. Each of these proteins has been found to play a role in PD.

Family, Domains, and Motifs

Alpha-synuclein is a member of the synuclein family. Members of the family are characterized by a

“Highly conserved alpha-helical lipid-binding motif with similarity to the class-A2 lipid-binding domains of the exchangeable apolipoproteins” (EMBL-EBI).
Both the ProfileScan Server and the PROSCAN revealed 3 motifs in the synuclein family:
  1. Casein kinase II phosphorylation site
  2. Tyrosine kinase phosphorylation site
  3. N-myristoylation site
UniProt identified the NAC domain, which is involved in fibril formation. It also suggests that the C-terminus could be involved in the regulation of the aggregation and size determination of filaments.

Physical Properties

The 3-D structure from the Protein Data Bank:















Length: 140AA
Molecular Weight: 14460 Da

Coiled-coil predictions
COILS predicted no coiled-coils

Phosphorylation predictions

NetPhos 2.0 Prediction Server predicted 1 serine, 3 threonine, and 3 tyrosine phosphorylation sites.

Sulfation predictions
The Sulfinator predicted 3 sulfated tyrosines.

Transmembrane prediction: conflicting results?
The DAS site predicts where transmembrane regions are within the protein. The alpha-synuclein prediction looks like this:











It predicts three possible transmembrane segments. One from aa67-77, one from 69-75, and one from 87-91.
However...the TMpred server graphs a similar output, but does not find any significant transmembrane regions:















Localization

Alpha-synuclein is generally found in the presynaptic terminals of brain tissue.
From PSORT II:
52.2 %: cytoplasmic
30.4 %: nuclear
8.7 %: mitochondrial
4.3 %: vacuolar
4.3 %: vesicles of secretory system
From WoLF PSORT:
cyto: 13 (40.6%), mito: 9 (28.1%), extr: 6 (18.8%), nucl: 4 (12.5%)

From literature:
Primarily cytoplasmic. Also found in the nucleus.

Colon Cancer MSH2 Mismatch Repair Protein

OMIM identified the MSH2 protein as a mismatch repair protein that, if mutated, can lead to unrepaired DNA damage and colon cancer. MSH2 was found to be homologous to E. coli Muts gene. Purified MSH2 protein binds specifically to DNA containing insertion-deletion loop-type (IDL) mismatched nucleotides.



NCBI conserved domain database Blast search (rpsblast) using Homo sapien MSH2 protein (AAB59565) reported MSH2 to be a member of the MutS DNA mismatch repair family. The molecular function of MSH2 is DNA and ATP-binding and the biological process is mismatch repair.


Domain relatives found from the rpsblast search include: MutS domain I found in proteins of the MutS family (DNA mismatch repair protein), MutS domain II also in the MutS family, MutSd DNA-binding domain of the MutS mismatch repair family, and MutS domain V of the MutS family, which was found to contain a Walker A motif structurally similar to ATPase domain in ABC transporters. Other members of the MutS family include Eukaryotic MSH1, 2, 3, 4, 5, and 6 proteins.


EMBL-EBI search also identified MSH2 as a mismatch repair protein with a dual role in DNA repair and apoptosis. Many proteins that are involved in DNA repair processes also play roles in apoptosis, preventing carcinogenesis when DNA damage is too accessive to repair. MSH2 functions in the nucleus as a heterodimer with MSH6 to replace mispaired bases.



The domains identified by the Human Protein Reference Database include MUTSd (aa's 321-645), and MUTSac (aa's 662-849) of the MUTs family of DNA mismatch repair proteins. MUTSac is the ATPase binding domain of MSH2.


The coiled coil (aa's 552-580) motif for MSH2 was identified by the Human Protein Reference Database. Coiled coil is a protein domain that forms a bundle of two or three alpha helices.



The COILS program predicts coiled coil regions in proteins and calculates the probability that the sequence will adopt a coiled-coil conformation.








Interprotein database search identified 5 domains for MSH2.

DNA mismatch repair protein MutS, C-terminal

DNA mismatch repair protein MutS, N-terminal

MutS III

MutS II

MutSIV

Marfan Syndrome – Fibrillin 1 Protein

Fibrillin 1 (NP_000129) is encoded from the FBN1 gene on chromosome 15 and is a crucial component of extracellular microfibrils. As a result, Fibrillin 1 is found in the elastic and non elastic connective tissue of the body. Point mutations in the FBN1 gene have been demonstrated to affect the function of Fibrillin 1. This behavior has been well documented as the cause of Marfan Syndrome. The following websites and databases were used to understand more about the structure and function of Fibrillin 1 protein.

Primary/Secondary Structure

The FASTA format of Fibrillin 1 was crucial to most database searches and illustrated the primary structure of Fibrillin 1:

FASTA FORMAT – from NCBI

Compute pI/MW
Provided information of the length, molecular weight, and pI
Length – 2871
Molecular Weight – 312237.48
pI – 4.81

COILS
Provided prediction of where coil regions are in a protein based on a 14, 21, and 28 window 1frame of an amino acid chain. Result are illustrated several ways (Graphical shown below). Peaks above 50% indicate a strongly probability that coils exist. For Fibrillin 1, very few coil regions were observed.















Tertiary Structure

InterProScan was useful to relating the Fibrillin 1 protein to the structural domains of other proteins. It provided many follow up websites that included information of the protein family, structural features, and examples of the many domains. One downfall of the program was the long time to provide results. However, the amount of information provided was useful and worth the wait. The results are shown below. As you can see, the matches indicated Ca2+ binding domains which support the function of the protein.
Full Results:
SEARCH RESULTS

NCBI’s Structure Database was useful in providing a three dimensional crystal structure of the Fibrillin 1 protein. Only a respresentative portion was shown. The entire protein contains about 60 domains with 47 being able to bind Ca2+.
















Postranslational Modications

Postranslational modifications is important to the function of many proteins. Certain websites contain programs that can calculate where certain modifications can take place.

Sulfinator
This website predicted tyrosine sulfation sites.

Fibrillin 1 demonstrated 4 sites out of 93 tyrosines (See Below)
Tyrosine Positions-

434
1004
2849
2853

Phosphorylation Prediction
This website predicted the phosphorylation of serine, threonine, and serine. The data can be several formats, including graphical (See Below). Peaks above the threshold indicate that phosphorylation will most likely occur.

Number of Phosphorylation sites predicted in Fibrillin 1: Ser: 65 Thr: 26 Tyr: 32











Glycosylation Prediction
This website predicted O-glycosylation sites in mammalian proteins

Fibrillin 1 illustrated two sites of O-glycosylation
Threonine - 371 and 2101

Protein Localization

TargetP
This site was useful in estimating the location of a protien in the body

Fibrillin was theorized to be localized in
secretory pathway - 86%
mitochondrion - 13%
other - 1%

Protein Analysis of Beta-Hexosaminidase A

  • BLAST search identifies the human HEXA protein product (NCBI P06865) as a member of the single domain family 'Glycosal hydrolase family 20,' with a catalytic domain identified (aa's 167-488) containing a TIM barrel, and 'domain 2' (aa's 35-165) containing a zincin like fold: NCBI
  • Unprocessed precursor aa length = 529, MW = 60,689 Da

    Gene Ontology Terms:
    Molecular Fxn:
    beta-N-acetylhexosaminidase A activity, hydrolase activity, acting on glycosyl bonds (EBI InterProScan)
    Biological Process: carbohydrate metabolism, glycosphingolipid metabolism (EBI InterProScan)
    Cellular Component: lysosome
  • see also EBI QuickGO
  • Subcellular localization prediction by http://psort.nibb.ac.jp/cgi-bin/runpsort.pl almost certainly flawed:

34.8 %: cytoplasmic 26.1 %: extracellular, including cell wall 13.0 %: mitochondrial 13.0 %: nuclear 4.3 %: Golgi 4.3 %: vacuolar 4.3 %: endoplasmic reticulum

  • A deletion mutation study suggests that the GSEP sequence beginning at position 283 in the alpha subunit confers its binding ability to the GM2-GM2AP complex. The beta active site can hydrolyze neutral substrates comparable to GM2 but its inability to bind the negatively charged carbohydrate (i.e. removal of the aligned GSEP sequence) suggests a biological role, namely, prevention of the non-productive binding of GM2 to the beta- active site incapable of catalyzing negative substrates (Zarghooni et al., 2004).

  • Post-translational Modification: includes mannose 6-phosphate recognition particle for lysosomal targeting - associated with asparagine-linked oligosaccharide chains (Sonderfeld-Fresko et al., 1989).
  • -glycosylation:
  • Sequence T 275 0.533
  • Targeting Prediction:TargetP 1.1 incorrectly predicted a secretory pathway: TargetP;
  • Transmembrane Domain Prediction: One transmembrane single helix predicted by SOSUI:






No.N terminaltransmembrane regionC terminaltypelength
13SSRLWFSLLLAAAFAGRATALWP25PRIMARY23

...and supported byTMpred:

  • - A study suggests Hex A ultimately associates with the cell membrane following its lysosomal origin: PubMed
- BIND confirms Hex A's role in lipid metabolism, its interaction with Hex B and its vacuolar localization