High-Performance Optimization via GPU Acceleration
Is CPU-based optimization falling behind?
Source: images.unsplash.com
Introduction
In designing a decision-making algorithm that proceeds from initiation to termination, the use case is important. No universal algorithmic structure or set of hyperparameters exists that works best in all settings. According to the No Free Lunch (NFL) theorem, no algorithm outperforms every other across all classes of decision problems or even across all instances of a single problem. This article brings together theoretical and practical perspectives, vendor and community experiences, and practical guidance about when and how to use or perfer graphics processing units (GPUs) (many thousands of simple cores optimized for data-parallel workloads and high memory bandwidth) versus central processing units (CPUs) (fewer, more complex cores with larger per-core caches and larger system memory capacity); and how hybrid designs are changing what is feasible for large-scale decision optimization problems.
Recent smoothed-analysis results show the classical simplex algorithm has near-optimal noise dependence (up to polylogarithmic factors) (Ref). That revival of interest in simplex variants sits alongside a parallel surge in work on parallel, first-order and primal–dual methods; most notably Primal–Dual Hybrid Gradient (PDHG) and its LP specialization Primal–Dual Linear Programming (PDLP); which are suited to modern massively parallel hardware, i.e., GPUs, and try to replace large factorizations with sparse matrix–vector products (SpMV) and other highly parallel primitives. These developments prompt a practical question: should we continue to refine tried-and-true CPU simplex or interior-point method (IPM) pipelines (second-order methods that converge in relatively few iterations but require expensive linear system factorizations (Karush–Kuhn–Tucker (KKT) systems)), or redesign parts of the solver stack to exploit GPU-friendly building blocks?
Theory and where algorithm families differ
Linear programs (LPs), continuous optimization problems with linear objectives and constraints, are typically solved “in one shot” by optimization algorithms such as simplex or interior-point methods. Simplex walks along vertices of the feasible polytope (basic feasible solutions) and is well-suited to warm starts and incremental changes; interior-point methods iterate in the interior of the polytope and often require tens to a few hundred iterations, each dominated by factorization of large sparse linear systems.
First-order methods like PDHG/PDLP present a different tradeoff, with many lightweight iterations dominated by sparse matrix–vector multiplication (SpMV) and simple linear algebra primitives, rather than expensive factorizations. That changes the performance model. PDLP’s cost scales with the number of nonzeros (nnz) rather than a cubic dependence on matrix dimension, making it attractive for enormous, sparse LPs and for the LP relaxations that dominate per-node work in MIP branch-and-bound.
However, first-order methods typically yield lower per-iteration accuracy and may require more iterations or a crossover step to produce a numerically “clean” basic solution suitable for downstream exact processing. Thus, algorithm choice is an exercise in balancing iteration cost, convergence speed, numerical precision, and engineering complexity.
Why GPUs are suddenly more relevant
GPUs deliver very high memory bandwidth and excel at massively parallel operations such as SpMV, batched BLAS (basic linear algebra subprograms), and other primitives that PDLP and related methods depend on. NVIDIA’s CUDA ecosystem (libraries such as cuSPARSE, memory managers like RMM, and higher-level toolkits such as cuOpt) exposes this capability and makes it practical to build GPU-native or hybrid solvers.
Open-sourcing GPU tooling (notably NVIDIA’s cuOpt) has accelerated integration across the ecosystem. Commercial vendors, open projects, and universities can now experiment, extend, and embed GPU kernels into broader workflows. For many extremely large, sparse problems and for batched/heuristic search patterns, these GPU-friendly kernels open paths to near-real-time reoptimization that were previously impossible on CPU-only stacks.
Empirical evidence and the COPT story
Benchmarks from vendors and research groups show highly variable outcomes that depend on problem structure, size, and the full workflow (including crossover and final polishing). A striking example from COPT (in collaboration with academic teams) highlights the change in what's feasible:
| Solver / Year | Hardware | Time to optimality |
|---|---|---|
| Barrier method (2009) | CPU | months |
| COPT (2023) | modern CPU | ~16 hours |
| cuPDLP-C (2023) | NVIDIA H100 GPU | ~15 minutes |
That zib03 instance (see MIPLIB) illustrates how GPU-accelerated PDLP implementations can convert previously intractable instances into practical runs. COPT’s broader benchmarking further shows PDLP/first-order solvers scaling much better than IPM on very large LPs (e.g., rmine21, rmine25), competitive behavior on some satellite or timetabling instances, and orders-of-magnitude improvements for certain conic (SOCP) and semidefinite (SDP) problems when combined with low-rank factorizations and GPU batching (cuLoRADS).
COPT also reports meaningful impacts on MIP workflows: accelerating LP relaxations and generating high-quality incumbents with GPU heuristics can reduce short-time optimality gaps (for example, a reported reduction in average gap from ~35.3% to ~26.3% within a 5-minute budget, and ~20% average reduction in (security-constrained unit commitment, SCUC) solution time when integrating GPU PDLP with CPU simplex/barrier components).
How vendors and projects are positioning solutions
- NVIDIA (cuOpt): open-sourced ecosystem providing PDLP kernels, routing accelerators, and integration hooks so that researchers and vendors can embed GPU primitives into broader solver stacks.
- COPT: demonstrates GPU-first wins on extreme LPs and conic programs (PDLP, PDCS for conic problems), plus cuLoRADS for low-rank SDP; highlighting real reductions in solve times and MIP gap improvements when GPU heuristics are used.
- Gurobi: shows per-instance variability; PDHG/PDLP wins on some giant LPs while barrier/hybrid approaches are better for other shapes, with crossover/accuracy targets often deciding the final outcome.
- FICO Xpress: reports up to ~50× speedups when a PDHG-type solver runs fully on GPU for very large LPs (gains scale with
nnzand appear primarily above tens of millions of nonzeros). - HiGHS: rapid experimentation with cuPDLP and GPU connectors; observed reductions in short-time MIPLIB gaps when combining GPU PDLP with classical solvers.
- GAMS: practical connectors and concurrent execution modes (run multiple algorithms/hardware paths concurrently and take the first finish), plus validation tools and cloud engine access to test GPU flows.
Practical guidance: what to do first, and when to add GPUs
- Benchmark on your real instances. Vendor demos highlight potential but results are highly instance-dependent. Use representative testbeds.
- Prioritize RAM and memory bandwidth. For MILP/MIP workloads (LPs augmented with integer/binary variables that create an exponential search tree of nodes), ensure enough system memory (RAM) (for optimization solvers, capacity and bandwidth are often the primary constraints.) before expecting parallel threads or GPU offload to help. Solvers often duplicate matrix data across threads.
- Tune modelling and solver settings. Reformulations, presolve, cuts, heuristics, and parameter tuning frequently produce larger gains than raw hardware upgrades.
- Adopt hybrid patterns where they fit. Use GPU PDLP/PDHG for large LP relaxations, GPU heuristics for incumbents, and then CPU simplex/barrier crossovers for numerical polish. This hybrid orchestration is the most pragmatic route today.
- Manage tolerances and validate quality. First-order solvers may be faster at relaxed tolerances; always check primal/dual feasibility and consider crossover for final solutions.
- Prototype in the cloud if needed. Cloud GPU instances and services can let you evaluate feasibility before buying hardware.
- Plan for engineering complexity. Expect to build asynchronous scheduling, selective offload, and fast interconnect usage (NVLink, a high-bandwidth interconnect technology used to reduce CPU<>GPU and GPU<>GPU transfer overheads) to capture end-to-end gains.
Common pitfalls
- Small or highly sequential problems often see no benefit (or slower performance) on GPU.
- Device memory limits on GPUs require careful batching; excessive CPU<>GPU transfers can nullify speedups.
- Pythonic element-wise logic must be moved to vectorized GPU kernels or libraries to benefit.
- First-order methods can trade precision for speed. Crossover/post-processing is frequently necessary and can dominate time to final optimality.
- Amdahl’s Law: the overall speedup is limited by the non-parallel fraction of the workflow.
Conclusion
As GPUs gain larger device memory, improved sparse libraries, tensor units, and faster interconnects, and as open-source projects (cuOpt, cuPDLPx, cuPDLP-C, cuLoRADS) mature, an expanding set of optimization workloads will benefit practically from GPU acceleration. The most promising patterns are hybrid workflows that combine GPU speed for massive, parallelizable kernels with CPU robustness for final numerical refinement. For practitioners, the pragmatic stance is unchanged: prioritize modelling and RAM/CPU infrastructure, then evaluate targeted GPU acceleration for problem classes and instances that map to high-bandwidth, massively parallel patterns.
One question that remains is whether GPU-accelerated LP solving is the barrier remover for large MILPs, or whether faster branch-and-bound fathoming is the more significant factor. A few questions one should be curious about:
- Can branching be parallelized at scale, or is it inherently sequential and “wait-and-see,” with only the “wait” part now taking less time?
- If LPs/MILPs run much faster, are we really solving models that reflect reality?
- And won’t GPU acceleration speed up other algorithms, like x-heuristics (meta, math, learn, grad, hybrid, sim)?