Vector Embeddings By Sequence Similarity And Context For Improved Compression, Similarity Search, Clustering, Organization, And Manipulation Of Cdna Libraries
2023 Β· Daniel H. Um, David A. Knowles, Gail E. Kaiser
Abstract
This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). FASTA/FASTQ files have several current limitations, such as their large file sizes, slow processing speeds for mapping and alignment, and contextual dependencies. These challenges significantly hinder investigations and tasks that involve finding similar sequences. The solution lies in transforming sequences into an alternative representation that facilitates easier clustering into similar groups compared to the raw sequences themselves. By assigning a unique vector embedding to each short sequence, it is possible to more efficiently cluster and improve upon compression performance for the string representations of cDNA libraries. Furthermore, through learning alternative coordinate vector embeddings based on the contexts of codon triplets, we can demonstrate clustering based on amino acid properties. Finally, using this sequence embed
Authors
(none)
Tags
Stats
Related papers
- Distributed Representations For Biological Sequence Analysis (2016)0.00
- Fast And Scalable Gene Embedding Search: A Comparative Study Of FAISS And Scann (2025)2.26
- Learned Indexing In Proteins: Extended Work On Substituting Complex Distance Calculations With Embedding And Clustering Techniques (2022)5.84
- Utilizing Low-dimensional Molecular Embeddings For Rapid Chemical Similarity Search (2024)4.52
- Site2vec: A Reference Frame Invariant Algorithm For Vector Embedding Of Protein-ligand Binding Sites (2020)5.84
- Leanvec: Searching Vectors Faster By Making Them Fit (2023)0.00
- Nearest Neighbor Search With Compact Codes: A Decoder Perspective (2021)3.58
- Search Efficient Binary Network Embedding (2019)3.58