NVIDIA GPU Query Engine reference architecture accelerates database queries 7.5x over Central Processing Unit
NVIDIA has detailed GQE, a reference architecture for running high-throughput SQL queries natively on GPUs. By leveraging NVLink-C2C and nvCOMP decompression, GQE delivers up to 25.5x speedups on analytical query workloads.
Impact: Medium
Why it matters
GQE addresses major memory and I/O bandwidth bottlenecks, allowing databases to perform high-throughput query execution directly on GPU hardware without exhausting SM resources.
TL;DR
- 01GQE accelerates SQL queries through three layers: query, data, and execution.
- 02It utilizes Blackwell's Decompression Engine for zero-SM LZ77 decompression.
- 03Delivers a 7.5x aggregate speedup on benchmarks, reaching up to 25.5x for individual queries.
Key facts
- Average Speedup
- 7.5x over CPU databases
- Peak Query Speedup
- 25.5x on TPC-H SF1000
- Key Software Libraries
- cuDF, CCCL, nvCOMP, nvSHMEM
- Hardware Interconnect
- NVIDIA NVLink-C2C
The GQE Architectural Layers
GQE transitions SQL queries to hardware-level execution through three primary layers: 1. Query Layer: Parses SQL into logical plans using Substrait, making it compatible with engines like Apache DataFusion, then compiles them into optimized physical plans. 2. Data Layer: Organizes CPU-side memory into non-contiguous column partitions grouped into row groups. It orchestrates asynchronous chunked transfers (cudaMemcpyBatchAsync) to GPU device memory on-demand. 3. Execution Layer: Schedules relational operator tasks across concurrent CUDA streams, overlapping data movement with computation.
Advanced GPU Compression and Decompression
To expand effective memory capacity, GQE integrates the nvCOMP library. On the Blackwell architecture, the hardware-level Blackwell Decompression Engine (DE) can decompress standard LZ77-based formats (such as LZ4, Snappy, and Deflate) at ultra-high throughput without consuming any streaming multiprocessor (SM) resources. Overall, GQE's optimizations deliver an aggregate 7.5x speedup over CPU databases, with individual query gains reaching up to 25.5x.
✓ When to use
- When planning large-scale SQL query acceleration on high-bandwidth systems like NVIDIA GB200 NVL4.