Latest In

News

Scientists Proposed OntoProtein For Protein Models

The first universal framework, OntoProtein, incorporates structure from Gene Ontology into protein modeling.

Author:Suleman Shah
Reviewer:Han Ju
May 08, 202237 Shares898 Views
Self-supervised protein language models are successful in learning protein models. With increased computing capacity, existing protein language models pre-trained with millions of sequences may extend the parameter scale from the million-level to the billion-level and produce spectacular improvements. However, those prevalent techniques seldom explore adding knowledge graphs, which may give rich, structured information facts for improved protein representations. The knowledge graphs that include useful biology information may help show proteins better when combined with data from other sources.
OntoProtein was suggested by researchers from Zhejiang University in China, headed by Ningyu Zhang and Zhen Bi. In protein pre-training models, this first generic framework employs structure from Gene Ontology. They created a new large-scale knowledge network based on Gene Ontology and linked proteins, and gene annotation texts or protein sequences characterize all nodes in the graph. They suggested a novel contrastive learning method that simultaneously uses knowledge-aware negative sampling to maximize the knowledge graph and protein embedding during pre-training. OntoProtein can do better than current methods that use pre-trained protein language models. It can predict how proteins interact with each other and how they work.

OntoProtein

OntoProtein is the first broad framework to incorporate external knowledge graphs into protein pre-training. It is a protein pre-training model with gene ontology embedding. The researchers proposed a hybrid encoder to represent English text and protein sequences and contrastive learning using knowledge-aware negative sampling to maximize the knowledge graph and protein sequence embedding during pre-training. They encoded the node descriptions (go annotations) as the relevant entity embeddings for knowledge embedding. They extend their use of gene ontology to molecular function, cellular components, and biological processes, and they develop a knowledge-aware negative sampling strategy for the knowledge embedding aim. With the mask language models, OntoProtein inherits the high capacity of protein comprehension from protein language models. OntoProtein can also integrate biology knowledge into its representation of proteins with supervision from knowledge graphs by the knowledge embedding object. This object doesn't care what kind of protein task you're trying to do. You can change the structure of the model and add new training goals to make it work for different types of functions.
Illustration of protein molecule
Illustration of protein molecule

Mask Protein Modeling And Knowledge Embedding

To build the OntoProtein, the researchers used the mask protein modeling object and the knowledge embedding aim. A novel knowledge graph dataset was generated by combining Gene Ontology and publicly annotated proteins. This dataset was used to train the model, then tested in many downstream tasks. The Tasks Assessing Protein Embeddings (TAPE) benchmark was employed to assess protein representation learning. In TAPE, there are three sorts of functions: structural, evolutionary, and protein engineering. To analyze OntoProtein, they chose six sample datasets, including secondary structure (SS) and contact prediction. Protein-protein interactions (PPI) are high-specificity physical contacts formed between two or more protein molecules. They are seen as a sequence classification challenge and are judged on three different datasets of different sizes.
OntoProtein outperforms all other proteins in all tests. OntoProtein surpasses TAPE Transformer and ProtBert in structure and contact prediction, demonstrating that it may benefit from useful biological knowledge graphs in pre-training. OntoProtein exhibited its ability to predict fluorescence. OntoProtein, on the other hand, does not do well in protein engineering, homology, and stability prediction, all of which are regression tasks. This is most likely owing to the pre-training object's absence of sequence-level goals. The suggested method may be considered pre-training for human language and protein (the language of life). This research aims to determine how to read the language of life's code by making proteins with information about genes.

Conclusion

The researchers first incorporated external factual information from gene ontology into protein models. OntoProtein (protein pretraining with gene ontology embedding) is the first broad framework to integrate external knowledge graphs into protein pre-training. Experiment findings on everyday protein tasks show that effective information injection aids in understanding and uncovering the language of life. Furthermore, OntoProtein is compatible with the model parameters of many pre-trained protein language models, implying that users may utilize the existing pre-trained parameters on OntoProtein without changing the architecture. The promising findings indicate future efforts to improve OntoProtein by infusing more helpful information with gene ontology selection and expanding this technique to other sequence generation challenges for protein design.
Jump to
Suleman Shah

Suleman Shah

Author
Suleman Shah is a researcher and freelance writer. As a researcher, he has worked with MNS University of Agriculture, Multan (Pakistan) and Texas A & M University (USA). He regularly writes science articles and blogs for science news website immersse.com and open access publishers OA Publishing London and Scientific Times. He loves to keep himself updated on scientific developments and convert these developments into everyday language to update the readers about the developments in the scientific era. His primary research focus is Plant sciences, and he contributed to this field by publishing his research in scientific journals and presenting his work at many Conferences. Shah graduated from the University of Agriculture Faisalabad (Pakistan) and started his professional carrier with Jaffer Agro Services and later with the Agriculture Department of the Government of Pakistan. His research interest compelled and attracted him to proceed with his carrier in Plant sciences research. So, he started his Ph.D. in Soil Science at MNS University of Agriculture Multan (Pakistan). Later, he started working as a visiting scholar with Texas A&M University (USA). Shah’s experience with big Open Excess publishers like Springers, Frontiers, MDPI, etc., testified to his belief in Open Access as a barrier-removing mechanism between researchers and the readers of their research. Shah believes that Open Access is revolutionizing the publication process and benefitting research in all fields.
Han Ju

Han Ju

Reviewer
Hello! I'm Han Ju, the heart behind World Wide Journals. My life is a unique tapestry woven from the threads of news, spirituality, and science, enriched by melodies from my guitar. Raised amidst tales of the ancient and the arcane, I developed a keen eye for the stories that truly matter. Through my work, I seek to bridge the seen with the unseen, marrying the rigor of science with the depth of spirituality. Each article at World Wide Journals is a piece of this ongoing quest, blending analysis with personal reflection. Whether exploring quantum frontiers or strumming chords under the stars, my aim is to inspire and provoke thought, inviting you into a world where every discovery is a note in the grand symphony of existence. Welcome aboard this journey of insight and exploration, where curiosity leads and music guides.
Latest Articles
Popular Articles