cupug

CuPUG

GPU-accelerated PostgreSQL that replaces distributed analytics clusters.

Most enterprise data warehouses are under 100TB, yet companies spend $250K-$2M/year on multi-node clusters where network shuffle accounts for 50-80% of query time. A single NVIDIA Blackwell GPU moves data at 8,000 GB/s internally, 640x faster than a 100GbE network can shuffle between nodes.

For the vast majority of analytical workloads, one GPU server outperforms an entire cluster.

Features

SCADA/BaM GPU-initiated I/O that lets GPU threads read directly from NVMe storage, bypassing the CPU entirely. Massive thread parallelism hides storage latency.
PostgreSQL + CUDA MPS Standard PostgreSQL for compatibility and writes; GPU offload for analytical reads. Up to 48 concurrent clients share a single GPU context.
Heap Block Storage cupug can efficiently access traditional Postgres heap storage blocks from the GPU, making it compatible with all existing Postgres tables.
Column Block Storage Analytical workloads can be optimized by using the cupug Column Block Storage.
cuVS Vector Search Index cupug provides integration with the cuVS vector search library, scaling vector search use cases into the billions.

The result is a drop-in replacement for analytics clusters: same PostgreSQL interface, 5-10x lower TCO, 5-10x better query performance.

User Guide

Getting Started Extension setup, table creation, storage options, and loading data
Query Patterns GPU-optimized joins, analytical aggregations, graph analytics, and performance tips

Documentation

This site contains the technical architecture and business analysis for the project:

Architecture How reads (GPU/SCADA) and writes (CPU/WAL) are split across the system
SCADA GPU-initiated storage I/O, software caching, and warp coalescing
Multi-Process Server CUDA MPS integration for concurrent PostgreSQL backends
Build and Test Building the extension, Docker regression tests, and GPU testing on vast.ai