Distributed Representations For Biological Sequence Analysis
2016 Β· Dhananjay Kimothi, Akshay Soni, Pravesh Biyani, et al.
Abstract
Biological sequence comparison is a key step in inferring the relatedness of various organisms and the functional similarity of their components. Thanks to the Next Generation Sequencing efforts, an abundance of sequence data is now available to be processed for a range of bioinformatics applications. Embedding a biological sequence over a nucleotide or amino acid alphabet in a lower dimensional vector space makes the data more amenable for use by current machine learning tools, provided the quality of embedding is high and it captures the most meaningful information of the original sequences. Motivated by recent advances in the text document embedding literature, we present a new method, called seq2vec, to represent a complete biological sequence in an Euclidean space. The new representation has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks. We test our embeddings with protein sequence classification and retrieval
Authors
(none)
Tags
Stats
Related papers
- Vector Embeddings By Sequence Similarity And Context For Improved Compression, Similarity Search, Clustering, Organization, And Manipulation Of Cdna Libraries (2023)2.26
- Fast And Scalable Gene Embedding Search: A Comparative Study Of FAISS And Scann (2025)2.26
- Site2vec: A Reference Frame Invariant Algorithm For Vector Embedding Of Protein-ligand Binding Sites (2020)5.84
- Learned Indexing In Proteins: Extended Work On Substituting Complex Distance Calculations With Embedding And Clustering Techniques (2022)5.84
- VERSE: Versatile Graph Embeddings From Similarity Measures (2018)17.42
- SEEC: Semantic Vector Federation Across Edge Computing Environments (2020)0.00
- Search Efficient Binary Network Embedding (2019)3.58
- A Survey On Efficient Processing Of Similarity Queries Over Neural Embeddings (2022)0.00