About The Ocean Protein Portal

The Ocean Protein Portal (OPP) is a recently launched prototype data sharing platform for ocean metaproteomics data. A workflow diagram is provided here: OceanPortalWorkflow.pdf. A video tutorial can be found here. A brief tutorial on using the Ocean Protein Portal is provided here: PortalWorkflow.pdf

A review publication that describes best practices for data sharing of ocean metaproteomic data that helped to guide this portal’s development is now available (https://pubs.acs.org/doi/pdf/10.1021/acs.jproteome.8b00761).

Introduction to Proteins

Proteins comprise roughly half of the mass in organisms, and serve a variety of critical cellular functions. Enzymes are proteins that catalyze chemical reactions in cellular metabolism. Many of these reactions are fundamental to the global biogeochemical cycles of chemical elements and molecules, and hence learning about the distribution of these catalysts can inform our understanding of biogeochemistry. Proteins also can be transporters of key nutritional or toxic elements into or out of cells, as well as serving key structural and regulatory functions. In this regard, the measurement of proteins, or proteomics, can provide valuable information regarding the biochemical functions that are occurring. When proteins are measured in natural environments that contain a great diversity of organisms not easily separated, the analysis of proteins is known as metaproteomics.

Introduction the Ocean Protein Portal (OPP): Where is my protein in the Oceans?

The OPP is intended to allow those interested in a particular protein or function to explore where their protein of interest exists in the ocean. It provides a suite of search capabilities and filters to interrogate metaproteomic datasets that have been deposited into the OPP. For example, proteins can be discovered by their functional name (product name), KEGG, PFam, or Enzyme Commission number information (Boolean search terms enabled). Alternatively, full amino acid protein sequences can be pasted into the search box. Those sequences will be digested in silico into tryptic peptides and all exact tryptic peptides matches to those discovered by mass spectrometry will be returned. Some example searches for the OPP include searching for metal related proteins (search the word iron, nickel) or important enzymes (Rubisco, Superoxide dismutase), or specific Enzyme Commission numbers within the protein annotation (searching 1.15.1.1 in product name will return nickel superoxide dismutase).

Introduction METATRYP Least Common Ancestor Search: Who makes my Protein?

Protein sequence information can also provide taxonomic information regarding its biological source. This is conducted by comparison of tryptic peptide sequences to those found within the representative microbial genomes, single amplified genomes, and metagenomes within the METATRYP database. A Least Common Ancestor (LCA) for a peptide is determined as being the highest taxonomic branch of an exact shared peptide sequence. Previous pairwise genomic analysis of 50 microbial genomes found that the number of shared tryptic peptides between ocean microbial species is generally less than 5% and often less than 1% of all tryptic peptides within a genome (Saito et al., 2015), allowing significant opportunity for taxonomic analysis within tryptic peptide space if metaproteomic taxonomy tools such as METATRYP are employed. The METATRYP database is searched for shared tryptic peptides by OPP users in realtime using an API to independent METATRYP site. More information about METATRYP, including the microbial genomes and metagenomes in its database, is available at the standalone METATRYP page.

Ocean Datasets Ingested into the Ocean Protein Portal

A list of datasets currently within the OPP is here. If you are interested in depositing data to the OPP please contact us at oceanproteinportal@whoi.edu. Inquiries to the OPP about dataset submission are welcome, although since the OPP is a prototype we may not be able to accommodate all submission.. Raw files are recommended for deposition at mass spectral repositories such as those affiliated with ProteomeXchange. Laboratory culture studies are recommended for deposition at other repositories, as the OPP is not currently scoped/funded to host laboratory studies.

Links to relevant data and tools

Links are embedded within the OPP for sequence analysis (NCBI Blast-P), as well as by various identification numbers for protein families, enzyme classifications, and taxonomy (Pfam, KEGG, NCBI Taxon ID). There are also links to corresponding environmental data at BCO-DMO [add link www.BCO-DMO.org] to provide connections to biogeochemical and biological datasets.

Metaproteomic Data Units: User Beware

Ocean metaproteomics is a young and rapidly advancing science that could use a variety of relative abundance units, or absolute units of concentration as well. The OPP currently hosts spectral count data types, but was designed with the ability to host a variety of datatypes.

Data Use Policies

The current OPP platform is intended to provide insights into where proteins exist in the oceans. The OPP is adopting the data use policies similar to the GEOTRACES program, where correct attribution and citation is viewed as an important aspect of the data policy. Moreover, the 2018 Workshop participants for Best Practices in Data Sharing (see review here https://pubs.acs.org/doi/pdf/10.1021/acs.jproteome.8b00761) recommended that users interested in using metaproteomic data sets in publications contact data generators and consider discussing collaboration if using their metaproteomic data. This serves two important purposes: First, there is a danger that non-expert users misinterpret or misuse data resulting in incorrect interpretations given the youth of the metaproteomic data type especially when considering issues of cross dataset comparisons and normalizations, where publication of incorrect data use could damage community confidence in metaproteomics. Second, attribution to and collaboration with the data generators will create a valuable incentive to share future datasets in the OPP’s data search and visualization environment, versus solely depositing data in raw spectra repositories, and hence is useful in the sustainability of the OPP.

Selection of Datasets using Data and Physical (Membrane) Filters

Although the OPP contains limited datasets during its Beta Technical release, as increasing data is deposited in the coming years, there are filters that allow selection of datasets to be searched. These include Expeditions, dates, geographic locations, depth, and filter size(s). The last criteria allows a selection of the microbial communities that are being searched, with 0.2 being a typical filter pore size that captures the entire microbial community. Use of higher filter fractions can select for or against eukaryotic protists (phytoplankton and mixotrophs) and even sinking particles.

Differences from DNA Searches

Notably, typically DNA-base sequence searches (e.g. BLAST) differ from how sequence searches in the OPP is setup. Proteomics identifications are based on determination of exact molecular weights from mass spectrometry spectra that are then matched to genome or metagenome sequences. Hence the sequence variation within metaproteomics is captured through exact mapping to genomic and metagenomic sequencing and the sequence variation therein. In other words, any resultant tryptic peptides generated from a user query must have 100% identity with peptide(s) discovered in mass spectra by matching to DNA sequence predicted masses. In this model, tryptic peptides (and the amino acid sequence within) can be considered the basal unit of information within these metaproteomic datasets, confirmed by stringent parent and fragment ion datasets. For this reason, the OPP will only return data of tryptic peptides that have been measured in the environment, rather than those that may be related to what has been measured.It is possible to conduct a NCBI BLAST-P search of identified proteins of interest from the Protein Data/Sequence popup.

OPP Schedule and Sustainability

The OPP is a prototype funded by a 2 year NSF Earthcube grant (https://nsf.gov/awardsearch/showAward?AWD_ID=1639714&HistoricalAwards=false) to Mak Saito and Danie Kinkade (WHOI) from September 2016 to September 2018. It was explicitly designed so that sustainability costs would be moderate to allow it to become a sustainable data sharing portal for ocean metaproteomic datasets. It is the hope of the OPP team that the Ocean Protein Portal will be viewed as a useful resource by the ocean chemistry and biology, biochemical protein science, and educational communities in the coming years. Please support the OPP and provide submissions and feedback, so that this prototype can continue to serve a broad scientific community.

Release Schedule

The Ocean Protein Portal Prototype was released 2/26/19 at a Town Hall at the Aquatic Sciences meeting in Puerto Rico. Prior to that the OPP Beta Technical version was released in the summer of 2018 (6/7/18) for the EarthCube all hands meeting (DC), the American Society for Mass Spectrometry meeting (San Diego), and SciPy scientific computing meeting in Austin Texas. The METATRYP software (2.0) that operates behind the OPP was released as a standalone web interface in February 2018 at the Ocean Science Meeting. METATRYP command line software released in 2015 (Saito et al., 2015; Proteomics). Feedback and improvements will incorporated in the Fall of 2018, and with ocean metaproteomic data ingestion with a target date for a Scientific Release at the February 24th 2019 Aquatic Sciences Meeting.

About OPP: Software Infrastructure

The OPP and METATRYP are hosted on virtual machines at WHOI, using Python, ElasticSearch, Django, Javascript, PostGres, OceanMap, Bokeh, and Matplotlib. OceanMap attribution Satellite: Tiles © Esri — Source: Esri, i-cubed, USDA, USGS, AEX, GeoEye, Getmapping, Aerogrid, IGN, IGP, UPR-EGP, and the GIS User Community

Acknowledgements and Support

The development of the OPP was supported by an NSF EarthCube grant: “Laying the Foundation for an Ocean Protein Portal”. The underlying METATRYP peptide taxonomic software was developed in a grant from the Gordon and Betty Moore Foundation Marine Microbiology Initiative program. The OPP team is a collaboration between the Saito laboratory, the Information Services Application group, and the Biological and Chemical Oceanography Data Management Office all at the Woods Hole Oceanographic Institution. Consulting services were provided by the RPS group. The 2016-2018 OPP development team consists of Mak Saito (PI), Jaci Saunders, Noelle Held (Saito lab), Nick Symmonds, David Gaylord, and Joe Futrelle (WHOI IS Applications group), Danie Kinkade (Co-PI) and Adam Shepherd (BCO-DMO), and Michael Chagnon and Paul Duffy (RPS). The efforts of the participants of the Data Sharing Workshop for Ocean Metaproteomics (May 2017) were also instrumental in developing best practices for ocean metaproteomics data sharing. Metagenomic resources for Metatryp were provided by Chris Dupont at J.C. Venter Institute.

Feedback

Feedback regarding the Ocean Protein Portal is actively encouraged so we may improve its functionality. Please write us at oceanproteinportal@whoi.edu.