top of page

AI-Designed Bacteriophages: A Breakthrough in Synthetic Biology

Updated: Apr 10

In molecular biology and synthetic biology, one of the ultimate ambitions has long been the ability to design and construct entirely new living systems at the scale of whole genomes. With the rapid advancement of DNA sequencing and synthesis technologies, our ability to read and edit genetic sequences has greatly improved. However, most previous gene-editing techniques have relied on modifying DNA sequences that already exist in natural organisms. Designing a functional genome entirely from scratch has remained an enormous challenge.


DNA itself is composed of chains of four deoxyribonucleotide molecules, whose bases are adenine, thymine, cytosine, and guanine. For simplicity, these are commonly represented by the letters A, T, C, and G. For example, the coding region of the human β-actin gene contains 1,128 nucleotides arranged in a specific sequence.


5'-ATGGATGATGATATCGCCGCGCTCGTCGTCGACAACGGCTCCGGCATGTGCAAGGCCGGCTTCGCGGGCGACGATGCCCCCCGGGCCGTCTTCCCCTCCATCGTGGGGCGCCCCAGGCACCAGGGCGTGATGGTGGGCATGGGTCAGAAGGATTCCTATGTGGGCGACGAGGCCCAGAGCAAGAGAGGCATCCTCACCCTGAAGTACCCCATCGAGCACGGCATCGTCACCAACTGGGACGACATGGAGAAAATCTGGCACCACACCTTCTACAATGAGCTGCGTGTGGCTCCCGAGGAGCACCCCGTGCTGCTGACCGAGGCCCCCCTGAACCCCAAGGCCAACCGCGAGAAGATGACCCAGATCATGTTTGAGACCTTCAACACCCCAGCCATGTACGTTGCTATCCAGGCTGTGCTATCCCTGTACGCCTCTGGCCGTACCACTGGCATCGTGATGGACTCCGGTGACGGGGTCACCCACACTGTGCCCATCTACGAGGGGTATGCCCTCCCCCATGCCATCCTGCGTCTGGACCTGGCTGGCCGGGACCTGACTGACTACCTCATGAAGATCCTCACCGAGCGCGGCTACAGCTTCACCACCACGGCCGAGCGGGAAATCGTGCGTGACATTAAGGAGAAGCTGTGCTACGTCGCCCTGGACTTCGAGCAAGAGATGGCCACGGCTGCTTCCAGCTCCTCCCTGGAGAAGAGCTACGAGCTGCCTGACGGCCAGGTCATCACCATTGGCAATGAGCGGTTCCGCTGCCCTGAGGCACTCTTCCAGCCTTCCTTCCTGGGCATGGAGTCCTGTGGCATCCACGAAACTACCTTCAACTCCATCATGAAGTGTGACGTGGACATCCGCAAAGACCTGTACGCCAACACAGTGCTGTCTGGCGGCACCACCATGTACCCTGGCATTGCCGACAGGATGCAGAAGGAGATCACTGCCCTGGCACCCAGCACAATGAAGATCAAGATCATTGCTCCTCCTGAGCGCAAGTACTCCGTGTGGATCGGCGGCTCCATCCTGGCCTCGCTGTCCACCTTCCAGCAGATGTGGATCAGCAAGCAGGAGTATGACGAGTCCGGCCCCTCCATCGTCCACCGCAAATGCTTCTAG-3'

 

DNA (Image source:Darryl Leja for the National Human Genome Research Institute,CC0 1.0 )
DNA (Image source:Darryl Leja for the National Human Genome Research Institute,CC0 1.0 )

DNA sequences in genes are first transcribed into RNA molecules, and for many genes the RNA is further translated into proteins. These proteins, together with non-coding RNAs that are not translated, constitute essential molecular components that sustain cellular life and support a wide range of metabolic processes. Different DNA sequences lead to different RNA and protein products, forming the basis of the tens of thousands of genes found in the human genome.


Gene-editing technologies essentially modify DNA sequences in order to produce RNA or proteins with desired functions. If DNA sequences are synthesized artificially but arranged randomly without biological constraints, the resulting RNA or proteins typically fail to function properly. Their structures must satisfy specific physical and chemical requirements in order to operate correctly within living systems. Because RNA and proteins are large and complex molecules, designing them from scratch so that they meet these requirements is extremely difficult. In some cases, a gene composed of thousands of nucleotides may lose its function when only a single nucleotide is altered. In other cases the mutation has no effect, depending on where it occurs within the sequence. Before the emergence of powerful artificial intelligence methods, designing complete functional genes entirely from the ground up was therefore an extraordinarily difficult task.


In recent years, the rapid development of artificial intelligence—particularly large language models—has opened a new path for addressing this challenge. Language models are not limited to processing human language; they can also process biological sequences. Because DNA sequences consist of combinations of the letters A, T, C, and G, models trained on massive genomic datasets can learn statistical patterns embedded in biological evolution. Once trained, these models can generate new sequences that satisfy biological, physical, and chemical constraints.


One study followed this approach by using two genome language models, Evo 1 and Evo 2, to generate complete bacteriophage genomes and test whether these synthetic genomes could function within living bacteria. Bacteriophages are viruses that infect bacteria. In this work, the researchers focused on a phage that infects Escherichia coli, known as ΦX174.


Simulated structure of the ΦX174 virion(Image source:Zlir'a, CC BY-SA 3.0 )
Simulated structure of the ΦX174 virion(Image source:Zlir'a, CC BY-SA 3.0 )

ΦX174 belongs to the family Microviridae and possesses a very small genome of approximately 5.4 kilobases, consisting of about 5,386 nucleotides. Despite its small size, the genome contains eleven genes and multiple regulatory elements. ΦX174 occupies an important place in the history of molecular biology: it was the first DNA virus whose genome was completely sequenced and also one of the earliest systems used in gene-editing research. Because of this long research history, it serves as a well-characterized experimental model. The researchers therefore used the ΦX174 genome as a template and trained the language models to generate new phage genomes that retained a similar overall architecture while introducing evolutionary novelty.


The genome of ΦX174, containing 11 genes(Image source:Emmanuel Douzery, CC BY-SA 4.0 )
The genome of ΦX174, containing 11 genes(Image source:Emmanuel Douzery, CC BY-SA 4.0 )

The research pipeline consisted of several stages. First, the language models were pretrained and fine-tuned on extensive collections of bacteriophage DNA sequences so that they could learn the genomic characteristics of Microviridae phages. Next, conserved regions from the ΦX174 genome were used as prompts that allowed the model to perform autoregressive generation of complete genome sequences.


After generation, the candidate genomes underwent a series of computational filtering steps. Basic sequence quality checks were applied, including constraints on genome length, GC content, and avoidance of long homopolymer runs. Because bacteriophages infect specific hosts through surface proteins that recognize cellular receptors, the researchers also examined the spike protein sequences responsible for host recognition, ensuring that the generated phages would likely retain the ability to infect E. coli. Additional filters promoted evolutionary novelty by discouraging sequences that were overly similar to known phage genomes while encouraging diversity in predicted protein sequences. The team also developed a specialized gene-annotation method tailored to ΦX174-like genomes, because conventional annotation tools often struggle to identify overlapping genes within this viral family.


After these successive filtering steps, 302 candidate genomes were selected for experimental testing. Researchers chemically synthesized DNA corresponding to these genome sequences and introduced them into E. coli cells to determine whether functional phages could be produced. Successful phage activity was detected by observing plaque formation and bacterial growth inhibition.


Among the 302 candidate genomes, 285 were successfully synthesized. Of these, sixteen produced viable bacteriophages capable of replicating within E. coli and infecting additional bacterial cells. This result demonstrated that AI-generated genome sequences were not only computationally plausible but also capable of functioning within real biological systems.


Importantly, these newly synthesized phages were not merely minor variants of ΦX174. Many exhibited substantial evolutionary novelty. Some genomes contained additional genes, while others lacked genes that are normally present in ΦX174. In some cases the length of certain genes was significantly extended or shortened. One particularly surprising example involved a generated phage in which gene J had been replaced by a homologous protein from another phage known as G4. Previous experimental work suggested that such a substitution would normally render the virus non-functional, yet the AI-generated version remained viable and retained the ability to infect bacteria.


The newly generated phages also exhibited diverse phenotypes. Some lysed bacterial cells faster than ΦX174, while others achieved higher population growth during infection. In competition experiments where multiple phages infected the same bacterial population, several AI-generated phages outperformed the original ΦX174 strain, indicating higher overall fitness.


These findings also have important implications for medicine. Bacteriophage therapy has long been considered a potential alternative to antibiotics, especially in the face of increasing antibiotic resistance. However, bacteria can rapidly evolve resistance against individual phages. When the researchers combined the sixteen newly generated phages together with ΦX174 into a mixed “cocktail,” they found that the mixture could overcome bacterial resistance that ΦX174 alone could not defeat. Genetic analysis revealed that recombination among the phages during infection produced new variants capable of infecting resistant bacterial strains.


The results suggest that AI-generated phage diversity may provide a powerful strategy for designing therapeutic phage cocktails that remain effective even as bacteria evolve resistance. More broadly, the study demonstrates that genome language models can explore evolutionary possibilities beyond those realized in nature, producing functional biological systems with novel properties.


As artificial intelligence continues to advance, it may eventually become possible to design entire cells and synthesize complete microorganisms tailored for specific purposes in medicine, environmental management, and industrial biotechnology. Such capabilities could usher in a new era of life science centered on the deliberate design of living systems. At the same time, the ability to create synthetic organisms also raises serious biosafety concerns. Artificially designed microbes might escape into natural environments or even be misused as biological weapons, making careful oversight and responsible governance essential.

 

Author: Shui-Ye You


Reference:

King SH et al. (2025). Generative design of novel bacteriophages with genome language models. bioRxiv.



(Paid content. Unauthorized reproduction or use is prohibited.)




Comments


bottom of page