3.1 Reframing Biology as Language: A Conceptual Shift
For a long time, bioinformatics has wrestled—sometimes productively, sometimes not—with a deceptively simple question: how does a linear biological sequence give rise to complex, three-dimensional, functional reality? Proteins, after all, are not merely chains of amino acids; they are dynamic entities, folding, interacting, and responding to their environment with remarkable precision. And yet, despite decades of research grounded in biochemical principles and structural biology, the translation from sequence to function has remained only partially resolved. The difficulty lies not in the lack of data—if anything, the opposite is true—but in how that data is interpreted.
Traditional approaches leaned heavily on evolutionary logic. Tools such as BLAST and HMMER were built on the assumption that similarity implies shared ancestry, and by extension, shared structure and function (Altschul et al., 1997; Eddy, 1998; Finn et al., 2011). In many cases, this assumption holds. But it also introduces a subtle limitation: these methods are, in essence, retrospective. They excel at recognizing patterns that have already been observed but struggle when confronted with novelty—proteins without homologs, or sequences that diverge significantly from known families. As biological datasets expanded, particularly with repositories like the Protein Data Bank (Berman et al., 2000), this limitation became increasingly visible. There was, quite simply, more unknown than known.
It is here—perhaps somewhat unexpectedly—that ideas from natural language processing (NLP) began to reshape the field. The notion that biological sequences could be treated as a kind of language was not entirely new, but it gained real traction only when computational frameworks matured enough to support it. Amino acids, in this view, become tokens; sequences become sentences; and evolutionary constraints resemble grammar. The analogy is imperfect, certainly, but it is also surprisingly powerful. It allows us to think of biology not just as chemistry, but as information—structured, contextual, and, importantly, learnable.
3.2 From Sequential Models to Attention Mechanisms
Early attempts to model biological sequences using deep learning borrowed from architectures designed for text processing. Recurrent neural networks (RNNs), and more specifically long short-term memory (LSTM) networks, were among the first to be adapted for this purpose (Hochreiter & Schmidhuber, 1997). Their appeal was straightforward: they process sequences step by step, maintaining a form of memory that, in principle, captures context.
Yet, in practice, this approach proved limiting. Biological sequences are not just long—they are complex in a way that challenges sequential processing. Interactions between residues may span hundreds of positions, and capturing these long-range dependencies is critical for understanding protein folding and function. RNNs, constrained by their stepwise nature, often fail to maintain such distant relationships effectively. The well-documented vanishing gradient problem exacerbates this issue, gradually diminishing the influence of earlier sequence elements as processing continues.
The introduction of the Transformer architecture marked a decisive departure from this paradigm (Vaswani et al., 2017). By replacing recurrence with self-attention, Transformers allow every element in a sequence to interact with every other element simultaneously. This is not merely a computational convenience—it fundamentally changes what the model can represent. In the context of proteins, it means that distant residues can be linked directly, without the need for sequential propagation of information.
There is, perhaps, a tendency to describe this transition as a clean break. In reality, it feels more like an accumulation of insights reaching a tipping point. The mathematical formalism of attention aligns almost uncannily with the biological reality of protein structures, where spatial proximity often overrides linear distance. The result is a model architecture that, almost by design, captures the kind of relationships that matter most in molecular systems.
3.3 Pre-training and the Emergence of Biological Representations
If the architecture provides the framework, pre-training provides the substance. Large language models in bioinformatics rely on self-supervised learning, a strategy that, at first glance, seems almost paradoxical. Instead of learning from labeled data—structures, functions, or experimental annotations—these models learn from raw sequences alone.
Two primary paradigms have emerged. The first, exemplified by BERT, uses masked language modeling, where portions of a sequence are hidden and the model is trained to predict them (Devlin et al., 2018). The second, associated with generative models like GPT, involves predicting the next token in a sequence (Radford et al., 2018). Both approaches, though conceptually simple, lead to surprisingly rich representations.
Through exposure to massive datasets—often comprising billions of sequences—models begin to internalize patterns that reflect underlying biological principles. Secondary structure motifs, residue co-evolution, and even aspects of protein stability appear to be encoded within the learned representations. What is striking is that this knowledge emerges without explicit supervision. The model is not told what a helix is, or how a binding site functions; it infers these patterns indirectly, through statistical regularities in the data.
This raises an interesting question: what, exactly, are these models learning? The answer is not entirely clear. They are not learning biology in the traditional sense, nor are they simply memorizing sequences. Instead, they appear to construct a latent representation space where biologically meaningful relationships are preserved. It is, in a way, an intermediate form of understanding—one that is useful, even if it is not fully interpretable.
3.4 From Sequence to Structure: Bridging a Longstanding Gap
Perhaps the most tangible impact of large language models has been in protein structure prediction. For many years, accurate prediction depended on multiple sequence alignments (MSAs), which provide evolutionary context by comparing related sequences. While effective, this approach is computationally intensive and inherently limited by the availability of homologous sequences. Recent advances suggest that this dependency may not be as absolute as once thought. Protein language models, trained on single sequences, have demonstrated an ability to infer structural features directly. In some cases, they can predict three-dimensional conformations with remarkable accuracy, drawing on patterns embedded in their learned representations rather than explicit evolutionary comparisons. This development resonates, interestingly, with earlier theoretical insights. Anfinsen’s principle—that sequence determines structure (Anfinsen, 1973)—is not new. What is new is the ability to operationalize that principle computationally, at scale. By integrating Transformer architectures with established structural knowledge, these models are narrowing the gap between sequence data and functional understanding.
The implications extend beyond basic science. In virology, for instance, the ability to rapidly assess the structural impact of mutations allows for early detection of potentially concerning variants. In therapeutic design, it enables the engineering of proteins with desired properties—binding specificity, stability, or catalytic efficiency—before they are synthesized in the lab.
3.5 Extending the Framework: Genomics and Cellular Systems
While proteins have been the primary focus, the application of large language models is not limited to proteomics. Genomic sequences, with their vast non-coding regions, present a different kind of challenge—one that is, perhaps, even more abstract. Regulatory elements such as promoters and enhancers do not follow simple patterns, and their function depends heavily on context. Transformer-based models, with their capacity for capturing global dependencies, are well-suited to this task. By analyzing entire genomic regions simultaneously, they can identify subtle patterns associated with gene regulation. This has led to improved predictions of transcription factor binding sites and regulatory interactions, offering new insights into gene expression mechanisms.
In transcriptomics, the paradigm shifts again. Here, the focus is not on sequences but on expression profiles—high-dimensional representations of cellular states. Treating these profiles as “sentences” allows language models to classify cell types, identify subpopulations, and even infer developmental trajectories. The analogy to language is, admittedly, stretched in this context, but it remains conceptually useful.
3.6 Toward Rational Drug Discovery
The integration of large language models into drug discovery represents one of the most promising—and perhaps most consequential—applications. Traditional drug development is notoriously inefficient, characterized by high costs and low success rates. A significant portion of this inefficiency stems from the difficulty of predicting how a molecule will interact with its biological target. Language models offer a different approach. By embedding both proteins and chemical compounds into a shared representation space, they enable the prediction of drug–target interactions with increasing accuracy. Representations derived from sequence data can be combined with chemical encodings, such as SMILES strings, to model binding affinity and specificity.
This does not eliminate uncertainty, of course. Biological systems remain complex, and in silico predictions must be validated experimentally. But it does shift the balance—from blind screening toward informed design. Molecules can be prioritized based on predicted efficacy, reducing the need for exhaustive testing and accelerating the overall pipeline.
3.7 Challenges, Interpretability, and the Limits of Abstraction
Despite their successes, large language models are not without limitations. The most frequently cited concern is interpretability. These models can produce accurate predictions, but the reasoning behind those predictions is often opaque. For experimental scientists, this lack of transparency can be problematic. A prediction, no matter how accurate, is difficult to trust without some understanding of its basis. Efforts to address this issue include attention visualization and attribution methods, which attempt to highlight the parts of a sequence that contribute most to a prediction. While promising, these approaches are still evolving and do not fully resolve the underlying challenge. There are also practical considerations. Training large models requires significant computational resources, raising questions about accessibility and sustainability. Moreover, generative models may produce sequences that are syntactically valid but biologically implausible—a form of “hallucination” that underscores the need for careful validation.
3.8 Conclusion: Toward a Unified Biological Intelligence
What emerges from these developments is not simply a new set of tools, but a shift in perspective. Biology, increasingly, is being understood as an information system—one that can be modeled, decoded, and, to some extent, rewritten. Large language models do not replace traditional methods; rather, they complement them, offering new ways to interpret complex data. The idea of a “linguistic turn” in biology may still feel tentative, perhaps even metaphorical. And yet, it captures something real: the growing recognition that biological sequences carry information in ways that resemble language, and that this information can be learned through statistical models. As these approaches mature, they may bring us closer to a more unified understanding of life—one in which sequence, structure, and function are not treated as separate domains, but as interconnected aspects of a single system. Whether this vision will fully materialize remains to be seen. But the direction, at least, seems increasingly clear.