Unsupervised Learnings of Protein Large Language Models for Make Benefit Glorious Sector of Biotech

Magazine: Features
Unsupervised Learnings of Protein Large Language Models for Make Benefit Glorious Sector of Biotech

FREE CONTENT FEATURE

Protein language models were nurtured by unlikely parents---corporations. Now that they have come of age, they have been forced to strike out on their own. A common pitfall that biotechnology platforms make is to attempt to solve as many problems, all at once, while in reality solving none. Whether these fledgling protein LLM companies will learn from the mistakes of their industry predecessors remains to be seen.

Unsupervised Learnings of Protein Large Language Models for Make Benefit Glorious Sector of Biotech

By Albin Hartwig, February 2024

Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition

Tags: Computing industry, Life and medical sciences

Like many in the protein design space, I have been both amused and confused by the development of AI-based protein design tools by big name tech companies. First, there was Google DeepMind's AlphaFold; a revolutionary structure-prediction tool from protein sequence alone [1]. For the first time, a computational tool existed that could predict the shape of a protein—an important determinant of protein function—just from the sequence alone with comparable accuracy to expensive and time-consuming experimental methods. Google had both ample motivation—considering the general importance of protein structure prediction to biology and biology-adjacent fields-and a track record of dabbling in weird side quests, namely DeepMind's previous work creating AlphaGo, AlphaZero, and AlphaStar. Compared to an AI that plays StarCraft, a protein structure prediction tool did not seem too bizarre.

It was not clear to me that a trend was emerging. But when Meta, Sales-force, and ByteDance published their own models [2, 3, 4], it became apparent something was afoot, particularly given that all three groups created protein language models (think ChatGPT, but for protein sequences). Companies are incentivized to do things that result in profit—so what was in it for them? Are big tech companies doing this out of a sense of corporate responsibility, feigned corporate responsibility, a fear of missing out, a hubristic notion that they can "solve biology," all of the above, or none of the above? Is ByteDance trying to design proteins to do a TikTok dance? Admittedly, I still have no clue. But recent developments in the space provide some leads and, more importantly, a glimpse into the biotechnology landscape.

An understanding of how these models work can answer a key question: Why are big tech companies building state-of-the-art protein language models rather than research universities or other actors? In short, corporations are uniquely positioned due to in-house expertise and infrastructure. It is not that university labs are unable to create protein language models. In fact, important deep learning models, such as Protein MPNN and ProtGPT2, were developed and trained at academic labs [5, 6]. Many protein large language models (LLMs) are trained using the same transformer architecture that has led to the recent explosion in chatbots, like ChatGPT, which are able to converse with humans, respond to queries, and write computer code. Instead of text, protein sequences are used to train those models. While a transformer is made up of many components, self-attention is the most critical, enabling transformer models to learn long-range dependencies in the data. In the case of protein sequence data, understanding these dependencies is vital to decoding protein structure, as amino acids far apart from one another on the sequence that covary with one another are likely to be close to one another in 3D structures. Unlike Alpha-Fold, these amino acid contacts can be learned by scaling up the number of parameters in a transformer model, both speeding up the time it takes to predict a protein structure and simplifying the underlying protein prediction algorithm [2]. Once trained, these protein LLMs can be sampled to generate sequences designed to fold or function like specific types of natural proteins but are made of sequences with low similarity to natural ones [6, 7]. As it turns out, the relationships between amino acids in a protein sequence can be represented in similar ways to words in human sentences—lowering the barrier to entry for companies that already have the incentive to develop large-language models for other tasks. Training cost is another obstacle these companies are able to largely remove. ProtGPT2, developed at the University of Bayreuth, required 128 NVIDIA A100s to train over four days. Using on-demand prices, current as of the time of this writing, it would cost just shy of $43,000 to train, not to mention difficulties obtaining the A100 instances, given the current high demand for GPUs [6]. Having in-house capabilities to scale up these models sidesteps these concerns.

The new protein LLM startups in the space would benefit from focus and specificity in their early days, before branching out to fulfill their greater missions

Nevertheless, these only explain the relative ease of development for large tech companies but not reasons to develop protein large language models. One reason may simply be curiosity. Many authors have ties to academia. Quanquan Gu, director of AI research at ByteDance and a co-creator of ByteDance's protein LLM, LM-Design, is an associate professor at UCLA. Alex Rives, former scientific lead of Meta's Evolutionary Scale Modeling (ESM) team that developed Meta's protein LLM, has an offer to join MIT and Harvard's joint Broad Institute (more on this later) [8]. But as 2022 and 2023 rolled on, the fates of each project, and their associated teams, diverged.

The first break came in mid-2022 when Ali Madani, who led the machine learning research initiatives at Salesforce (including the effort that produced Salesforce's protein LLM, ProGen) launched a separate company. Madani's Profluent Bio aims to commercialize its protein design LLM for novel protein design [9]. Next, came the disbanding of the ESM team at Meta. "Meta has tried to align its research strategy to understand more how to create advanced intelligence that can help Meta as a business, rather than just some curiosity projects," according to a former scientist and manager who was part of the ESM team [10]. Alex Rives, who I mentioned earlier, went on to found EvolutionaryScale with a founding staff of eight former ESM members. (Though there is speculation he may leave to take the aforementioned faculty position given his role as interim CEO). In the meantime, Isomorphic Labs, which spun off from DeepMind in 2021, continues to work closely with DeepMind to use AlphaFold to speed up drug discovery [8, 12]. While their specific paths differed, each of the three aforementioned groups ended up becoming separate businesses, with the accompanying pressure to launch a product and make money in their own right. However, each company risks making a common mistake in the biotechnology sector, especially platform technologies: namely, lack of sufficient focus on a specific problem or product.

"We're just a biological speculation

Sittin' here, vibratin'

And we don't know what we're vibratin' about"

Drew Endy opened with these lines from "Biological Speculation" by Funkadelic in a 2014 TED Talk meant to advocate for a better genetically engineered future [12]. That same year, Ginkgo Bioworks became the first biotechnology startup to join Y Combinator under CEO Jason Kelly, a former student of Endy's [13]. Almost 10 years later, the company still does not know what it's vibratin' about. In 2021, Ginkgo Bioworks was sued following a report from activist investment firm Scorpion Capital claiming it misrepresented revenue statements following its acquisition by the special purpose acquisition company (SPAC) Soaring Eagle Acquisition Corp., which took Ginkgo Bioworks public [14, 15]. At the heart of the allegations was that Ginkgo spun out companies, such as Allonnia, Joyn Bio, and Motif Foodworks, to then become a customer. The investments in these companies would then be used to pay for Ginkgo's biofoundry service. In short, the net effect would be Ginkgo would pay companies to be its customer, artificially inflating its revenues. After the initial report from Scorpion Capital in 2021, Ginkgo Bioworks was trading at about $9.50 per share [15]. At the time of this writing two years later, it is down to $1.53 per share. I am skeptical that simple greed solely explains why Ginkgo misrepresented revenues. Instead, the fraud is meant to conceal a critical weakness in the business, one that I hear when Jason Kelly pitches biotech startups to forget about building their own lab and contract their work out to Ginkgo, which is essentially a platform company without customers. The lack of focus can be seen from their website^a ("What if you could grow anything") to their promotional videos.^b While diversification might be acceptable in mature markets by companies with a proven track record, Ginkgo neither has a proven track record nor operates in mature markets. What Ginkgo does have is hype, distracting institutional and lay investors alike with grand visions of the future. Ginkgo aims to create probiotics, remediate wastewater, manufacture cannabinoids, ferment food proteins…. Vibratin' about everything in the end means vibratin' about nothing.

Why are big tech companies building state-of-the-art protein language models rather than research universities or other actors?

Ginkgo's biofoundry is a quite different platform from a protein LLM, but they share critical similarities. Both are "halfway platforms" in that they are not sufficient to create functional products by themselves. And both crucially have not done the hard work to apply the platform toward a market-vetted product. One example of a company that has is Neoleukin Therapeutics, spun out of the University of Washington's Institute for Protein Design. The Institute under David Baker, with their general expertise in protein design, collaborated with Chris Garcia's group at Stanford, a group with specific expertise in cytokine therapeutics, to create a novel, designed protein with not only enhanced binding properties to a cancer-relevant receptor, but also secondary properties such as thermostability and solubility that are otherwise inaccessible through standard directed evolution methods [16]. Both Ginkgo and the new protein LLM startups in the space would benefit from focus and specificity in their early days, before branching out to fulfill their greater missions, lest they become as extinct as the T. rex.

References

[1] Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 (2021), 583–589.

[2] Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 6637 (2023), 1123–1130.

[3] Nijkamp, E., Ruffolo, J., Weinstein, E., Naik, N., and Madani, A. ProGen2: Exploring the boundaries of protein language models. arXiv (2022).

[4] Zheng et al. Structure-informed language models are protein designers. arXiv (2023).

[5] Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 6615 (2022), 49–56.

[6] Ferruz, N., Schmidt, S., and Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications 13, 4348 (2022).

[7] Madani, A., Krause, B., Greene, E.R. et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnology 41(2023), 1099–1106

[8] Cai, K. Ex-Meta researchers have raised $40 million from Lux Capital for an AI biotech startup. Forbes. Oct. 5, 2023; www.forbes.com/sites/kenrickcai/2023/08/25/evolutionaryscale-ai-biotech-startup-meta-researchers-funding

[9] Profluent. Profluent launches to design new proteins with generative AI. Jan. 26, 2023; www.profluent.bio/press/profluent-launches-to-design-new-proteins-with-generative-ai

[10] Criddle, C. and Murphy, H. Meta disbands protein-folding team in shift towards commercial AI. Financial Times. Aug. 7, 2023; www.ft.com/content/919c05d2-b894-4812-aa1a-dd2ab6de794a

[11] Isomorphic Labs Team and Google DeepMind AlphaFold Team. A glimpse of the next generation of AlphaFold. Isomorphic Labs. Oct. 31, 2023. www.isomorphiclabs.com/articles/a-glimpse-of-the-next-generation-of-alphafold

[12] Endy, D. Synthetic biology - what should we be vibrating about? YouTube. June 5, 2014. Accessed Nov. 20, 2023; https://www.youtube.com/watch?v=rf5tTe_i7aA

[13] Kirsner, S. Dollars and scents. Boston Globe. July 24, 2015; https://www.newspapers.com/article/the-boston-globe-ginkgo-bioworks-45-mill/115061705/

[14] Jackson, S. Biotech Ginkgo sued over revenue statements for $15 bln SPAC deal. Reuters. Nov. 19, 2021; www.reuters.com/legal/litigation/biotech-ginkgo-sued-over-revenue-statements-15-bln-spac-deal-2021-11-19/

[15] Hale, C. Ginkgo Bioworks 'a hoax for the ages,' says activist short seller firm. Fierce Biotech. Oct. 6, 2021; www.fiercebiotech.com/medtech/ginkgo-bioworks-suffers-short-attack-firm-calling-it-a-hoax-for-ages

[16] Silva, D.A., Yu, S., Ulge, U.Y. et al. De novo design of potent and selective mimics of IL-2 and IL-15. Nature 565 (2019), 186–191.

Author

Albin Hartwig is a pseudonym as the author wishes to remain anonymous. They are a Ph.D. student at the California Institute of Technology.

Footnotes

a. www.ginkgobioworks.com/about/

b. https://www.youtube.com/watch?v=Lvp2Kw3hjt8

Figures

UF1 Figure. Screenshot of the "Get to Know Ginkgo" promotional video featuring a Tyrannosaurus rex for no apparent reason.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Crossroads The ACM Magazine for Students

Magazine: Features Unsupervised Learnings of Protein Large Language Models for Make Benefit Glorious Sector of Biotech

FREE CONTENT FEATURE

Unsupervised Learnings of Protein Large Language Models for Make Benefit Glorious Sector of Biotech

Magazine: Features
Unsupervised Learnings of Protein Large Language Models for Make Benefit Glorious Sector of Biotech