Mike's Notes
Impressive and socially useful. The original article has many links.
Resources
References
Repository
-
Home > Ajabbi Research > Library > Subscriptions > NVIDEA Developer
-
Home > Handbook >
Last Updated
19/04/2026
How to Accelerate Protein Structure Prediction at Proteome-Scale
By: Christian Dallago, Kyle Tretina, Kyle Gion and Neel
Patel
NVIDEA Developer: 09/04/2026
Chris Dallago is a computer scientist turned bioinformatician,
passionately models biological mechanisms using machine learning. He's
advanced bio-sequence representation learning, contributing to its
establishment, notably in transformer models. Chris is dedicated to
solving scarce data problems, such as designing proteins for therapeutic
and industrial applications.
Kyle Tretina is a product marketing leader at NVIDIA, focused on
advancing AI for digital biology and drug discovery. He drives the
strategy and storytelling behind BioNeMo and our work with BioPharma,
shaping how next-generation foundation models and GPU-accelerated
microservices transform molecular and protein design. With a PhD in
molecular microbiology and immunology, Kyle bridges science and strategy,
translating breakthroughs in AI, chemistry, and biology into platforms
that accelerate discovery for researchers, startups, and pharmaceutical
companies worldwide.
Kyle Gion is a product manager for Research at NVIDIA, where he
translates R&D in digital biology and molecular science into impactful
products. He focuses on guiding research that applies computational
biology, computational chemistry, and AI to life sciences, drawing on
experience that spans both building scientific software and developing
cystic fibrosis therapies. Kyle earned his bachelor's and master's degrees
in Chemical Engineering from Brown University.
Neel Patel is a drug discovery scientist at NVIDIA, focusing on
cheminformatics and computational structural biology. Before joining
NVIDIA, Neel was a computational chemist in big pharma, where he worked on
structure-based drug design. He holds a Ph.D. from the University of
Southern California. He lives in San Diego with his family and enjoys
hiking and traveling.
Proteins rarely function in isolation as individual monomers. Most
biological processes are governed by proteins interacting with other
proteins, forming protein complexes whose structures are described in the
hierarchy of protein structure as the quaternary representation.
This represents one level of complexity up from tertiary representations,
the 3D structure of monomers, which are commonly known since the emergence
of AlphaFold2 and the creation of the Protein Data Bank.
Structural information for the vast majority of complexes remains
unavailable. While the AlphaFold Protein Structure Database (AFDB), jointly
developed by Google DeepMind and EMBL’s European Bioinformatics Institute
(EMBL-EBI), transformed access to monomeric protein structures,
interaction-aware structural biology at the proteome scale has remained a
bottleneck with unique challenges:
-
Massive combinatorial interaction space
-
High computational cost for multiple sequence alignment (MSA) generation
and protein folding
-
Inference scaling across millions of complexes
-
Confidence calibration and benchmarking
-
Dataset consistency and biological interpretability
In recent work, we extended the AFDB with large-scale predictions of
homomeric protein complexes generated by a high-throughput pipeline based on
AlphaFold-Multimer—made possible by NVIDIA accelerated computing.
Additionally, we predicted heteromeric complexes to compare the accuracy of
different complex prediction modalities.
In particular, for the predictions of these datasets, we leveraged
kernel-level accelerations from MMseqs2-GPU for MSA generation, and NVIDIA
TensorRT and NVIDIA cuEquivariance for deep-learning-based protein folding.
We then mapped the workload to HPC-scale inference by maximizing the
utilization of all available GPUs, including scale-out to multiple
clusters.
This blog describes the major principles we adopted to increase protein
folding throughput, from adopting libraries and SDKs to optimizations to
reduce the computational complexity of the workload. These principles can
help you set up a similar pipeline yourself by borrowing from the techniques
we used to create this new dataset.
So, if you are a:
-
Computational biologist scaling structure prediction pipelines
-
AI researcher training generative protein models
-
HPC engineer optimizing GPU workloads
-
Bioinformatician team building structural resources
You will learn how to:
-
Design a proteome-scale complex prediction strategy
-
Separate MSA generation from structure inference for efficiency
-
Scale AlphaFold-Multimer workflows across GPU clusters
Prerequisites
- Technical knowledge
-
Python and shell scripting
-
SLURM as HPC workload scheduler
-
Basic structural biology
-
Familiarity with AlphaFold/ColabFold/OpenFold or similar pipelines
Infrastructure
We describe scaling on a multi-GPU and multi-node NVIDIA DGX H100 Superpod
cluster
This cluster includes high-speed storage to store MSAs and intermediate
outputs
Software
- Access to MMseqs2-GPU
-
Familiarity with TensorRT
If not using a model with integrated cuEquivariance, knowledge about
triangular attention and multiplication operations
Procedure/Steps
1. Define the dataset you’d like to compute
Begin by defining the scope of prediction. Because predicting protein
complexes can become a combinatorial problem, it’s useful to understand what
may be most interesting. In some cases, if your proteomes are small enough,
an all-against-all (dimeric) complex prediction might be tractable; however,
this could change if you want to predict large datasets of proteomes.
Here’s how we decided to go about it:
-
Homomeric complexes: We selected all proteomes represented in the
AFDB and sorted them by perceived importance (e.g., proteomes of human
concern or commonly accessed). This allowed us to rank proteomes for
computation in a particular order, making execution more manageable.
-
Heteromeric complexes: This is where things can get complicated,
fast. For our heteromeric runs, we decided to focus on complexes
originating from several reference proteomes and proteomes included in the
WHO list of important proteomes. As there’s an intractable number of
combinations of complexes that can be derived from these proteomes, for
our runs, we focused on dimers (complexes of two proteins), within the
same proteome (no inter-proteome complexes) that had “physical”
interaction evidence in STRING. As we sought coverage, we decided to
consider all interactions reported in STRING for these proteomes, rather
than further filtering. Evidence in the literature suggests that filtering
for STRING scores >700 can further reduce the number of inputs while
increasing the likelihood of well-predicted complexes.
2. Decoupling MSA generation from structure prediction
MSA generation and structure inference are both compute-intensive but scale
differently, as we recently presented in a white paper. We thus approached
these computations as separate steps and implemented separate SLURM
pipelines. In general, for optimal use of a node, we set up MSA generation
and structure prediction this way.
MSA generation
We generated MSAs using colabfold_search with the MMseqs2-GPU backend.
While MMSeqs2-GPU scales across GPUs on a node natively, we chose to spawn
one MMseqs2-GPU server process per GPU on a node for easier process
management. In colabfold_search, the GPUs are only used for the
ungappedfilter stages and not the subsequent alignment stages (which are
multithreaded CPU processes).
Therefore, we can stack colabfold_search calls and start the next one once
the GPU is no longer used by the previous one, by monitoring the
colabfold_search output, to reduce GPU idle time.
Although this approach oversubscribes CPU resources, in practice, we found
that on a DGX H100 node, up to 25% of the overall increase in throughput can
be achieved with three staggered colabfold_search processes, at the expense
of slower processing of individual input chunks.
On determining reasonable input chunk sizes, there are two factors to
consider. Smaller chunk sizes result in more chunks, which means more
per-process overheads, such as database loading, which can take a couple of
minutes each, even on fast storage. (Pre-staging the databases on the
fastest storage available, such as the on-node SSD, helps with throughput as
well.) On the other hand, larger chunks take more time to finish. On a SLURM
cluster with a job time limit, this results in more unfinished chunks.
The sweet spot will depend on the cluster configuration, but for our DGX
H100 node with a 4-hour wall time limit, the chunk size of 300 sequences
seemed to work well with the staggering colabfold_search approach.
Structure prediction
In order to increase structure prediction throughput, we leveraged both
optimizations in data handling for JAX-based folding through ColabFold, as
well as accelerated tooling developed at NVIDIA, including TensorRT, and
cuEquivariance for OpenFold-based folding.
Deep learning inference parameters
First, we selected inference parameters that struck a good balance between
accuracy and speed. Protein inference setup for all deep learning inference
pipelines (ColabFold and OpenFold), thus utilized:
-
Weights: 1x weights from AlphaFold Multimer (model_1_multimer_v3)
-
Four recycles (with early stopping)
- No relaxation
-
MSAs: frozen MSAs generated through ColabFold-search (using MMseqs2-GPU),
as described above
Accuracy validation
| |
Homodimer PDB set (125 proteins) |
| Model |
High |
Medium |
Accept |
Incorr |
Usable |
DockQ |
| DockQ |
>0.8 |
>0.6 |
>0.3 |
>0 |
|
|
|
| ColabFold |
52 |
37 |
12 |
21 |
89 |
(72.95%) |
0.637 |
| OpenFold with TensorRT and cuEquivariance |
53 |
39 |
10 |
20 |
92 |
(75.41%) |
0.647 |
Table 1. A comparison of interface accuracy between ColabFold and
OpenFold (accelerated by TensorRT and cuEquivariance) across a benchmark
set of 125 homodimer proteins.
As we used different inference pipelines, we performed accuracy validation
using a curated benchmark set of 125 X-ray resolved PDB homodimers released
after AlphaFold2 was introduced, thus minimizing the potential for
information leakage.
Predicted complexes for each deep learning implementation were compared
against experimental reference structures using DockQ, which evaluates
interface accuracy via the fraction of native contacts (Fnat), fraction of
non-native contacts (Fnonnat), interface RMSD (iRMS), and ligand RMSD after
receptor alignment (LRMS), and assigns standard CAPRI classifications of
high, medium, acceptable, or incorrect.
Across the PDB homodimer benchmark, OpenFold accelerated through TensorRT
and cuEquivariance reproduces ColabFold interface accuracy, achieving a
similar fraction of “high” scoring predictions and comparable mean DockQ
scores. This indicates that the accelerated implementations preserve
interface-level structural accuracy relative to the ColabFold
baseline.
MSA preparation and sequence packing
For ColabFold-based homodimer inferences, higher throughput can be achieved
by packing homodimers of equal length into a batch for processing, sorted by
their MSA depth in descending order. This reduces the number of JAX
recompilations, thereby increasing end-to-end throughput. This trick,
however, does not work when processing heterodimers, because the lengths of
the individual chains differ.
For OpenFold, whether for homodimers or heterodimers, this packing strategy
is not needed, as the method doesn’t require re-compilation. However, given
a dependency between sequence length and execution time, reserving longer
sequences for individual jobs may be beneficial if operating with specific
SLURM runtimes. To further optimize the process, input featurizations
(CPU-bound) were performed for the next input query alongside the inference
step for the current query (GPU-bound).
Additionally, OpenFold’s throughput was enhanced through the integration of
the NVIDIA cuEquivariance library and NVIDIA TensorRT SDK. These modular
libraries and SDKs can be leveraged to accelerate operations common in
protein structure AI and general inference AI workloads, respectively. We
previously described how TensorRT can be leveraged to accelerate OpenFold
inference.
3. Optimize GPU utilization with SLURM
As alluded to in the previous section, depending on the available hardware,
you can increase throughput by “packing” GPUs and nodes. SLURM is a great
orchestrator, and we divided the inference workflows in SLURM scripts
to:
-
Pack multiple predictions per node
-
Match GPU memory to sequence length
-
Reduce idle time between jobs
-
Separate short vs long sequence queues
Our workload was mapped to a H100 DGX Superpod HPC system. We could thus
deploy inference across NVIDIA H100 GPUs on multi-node clusters, leveraging
exclusive execution on a single node, and packing each GPU with as many
processes as saturated the GPU utilization for both MSA processing and deep
learning inference.
Helpful tips:
-
Group jobs by total residue length
-
Monitor GPU memory fragmentation
-
Use asynchronous I/O to avoid disk bottlenecks
4. Making quality predictions accessible to the world
In partnership with EMBL-EBI, the Steineggerlab at Seoul National
University, and Google DeepMind, we explored complex structure prediction
analysis. We highlight that predicting these biological systems remains
challenging. Unlike protein monomer prediction, where predicted Local
Distance Difference Test (pLDDT) can inform overall prediction quality,
yielding a balanced amount of plausible predictions, in the complex
scenario, assessing interface plausibility is much harder. This has to do
with the fact that assessing complexes involves global and per-chain
confidence metrics, as well as local confidence metrics at the
interface.
Simply put, is the interface between two monomers plausible, and is it
predicted in the right pocket? These questions are much harder to answer
than more “local” questions about monomer likelihood, given the very limited
data available. Therefore, we make available a set of high-confidence
structures through the AlphaFold Database, thereby enabling, for the first
time, exploration of protein complexes. We intend to refine our approach
further and expand the universe of available protein complexes in the
AlphaFold Database.
Getting started
Proteome-scale quaternary structure prediction requires more than just
running AlphaFold-Multimer at scale. Success depends on:
-
Evidence-driven interaction selection
-
Decoupled and optimized compute workflows
-
GPU-aware job orchestration
-
Confidence calibration and validation
-
Dataset health monitoring
By combining STRING-guided selection, MMseqs2-GPU acceleration, and NVIDIA
H100-powered multimer inference, this work extends AFDB into a unified,
interaction-aware structural resource.
This infrastructure enables:
-
Variant interpretation at interfaces
-
Systems-level structural biology
- Drug target validation
-
Generative protein design benchmarking
Resources
Read more about the project here:
https://research.nvidia.com/labs/dbr/assets/data/manuscripts/afdb.pdf
Accelerated libraries and SDKs are available here:
- MMseqs2-GPU
- NVIDIA cuEquivariance
- NVIDIA TensorRT
If you wish to deploy MSA search and protein folding easily, you can get
accelerated inference pipelines through NVIDIA’s Inference Microservices
(NIMs):
- MSA Search NIM
- OpenFold2 NIM
The predictions from this effort are available through
https://alphafold.com