ESM3: The Protein Language Model That Unifies Sequence, Structure and Function

ESM3: The Protein Language Model That Unifies Sequence, Structure and Function
ESM3: The Protein Language Model That Unifies Sequence, Structure and Function

ESM3 from EvolutionaryScale is a generative protein language model that reasons simultaneously across sequence, structure, and function. Published in Science in 2024, the 98-billion-parameter model accepts any combination of partial sequence, partial structure, and functional annotations as conditioning input and generates completions across all three modalities. The model treats protein sequence, structure, and function as three channels of the same underlying biological information, trainable jointly through masked prediction objectives applied across all three.

The VQ-VAE Structural Tokenization

To make protein structure tractable as a language model input, ESM3 encodes 3D backbone coordinates through a Vector Quantized Variational Autoencoder that converts continuous coordinate representations into discrete structural tokens. This allows the transformer architecture to treat backbone geometry the same way it treats amino acid sequence tokens: as discrete elements in a vocabulary over which attention operates. The VQ-VAE approach introduced a quantization loss that required careful training to prevent codebook collapse, where most structural tokens cluster around a small number of centroids.

The GFP Design Demonstration

EvolutionaryScale demonstrated ESM3 generative capability by generating a sequence for a new green fluorescent protein with only 58% sequence identity to any known natural GFP, then synthesizing and characterizing it experimentally. The protein folded and fluoresced. The evolutionary distance from known sequences is roughly equivalent to the distance from modern humans to Cambrian animal phyla. The demonstration was compelling as a capability proof. It was not a drug discovery result.

Limitations

ESM3 was trained predominantly on soluble, single-domain proteins with well-characterized structures in the PDB. Membrane proteins, intrinsically disordered proteins, and large multi-domain complexes are underrepresented in training data. The model’s performance on these classes is substantially worse than on soluble globular proteins.

This data scarcity problem illustrates a constraint that compute-optimal scaling laws cannot solve: when training data has biological ceiling limits rather than web-crawl limits, adding compute does not close the gap. ESM3’s 98B parameter count is well beyond what the available structural data can optimally train.

Related coverage: AlphaFold 3 in Drug Discovery: Where It Works and Where It Fails | Evo 2: The Genomic Foundation Model Trained on 9.3 Trillion DNA Bases | RFdiffusion and ProteinMPNN: How AI Now Designs Proteins From Scratch

Primary source: Hayes T et al., Science 2024 (ESM3).

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading