Apache Data Sketches

Apache Data Sketches is an open-source library of probabilistic data structures designed for fast, approximate analytics on massive datasets

BigQuery now integrates Apache Data Sketches, enabling large-scale, quick, and resource-efficient approximation analytics

Sketches summarize large datasets with minimal memory and computational overhead, often requiring only a single data pass

They are ideal for non-additive aggregation tasks like count distinct, quantiles, and most-frequent items, which are costly with traditional methods

Key sketch types include Theta Sketch (distinct counting and set expressions), HyperLogLog (HLL), CPC Sketch, Tuple Sketch, KLL Sketch, REQ Sketch, T-Digest

Sketches can be merged, making them additive and highly parallelizable for distributed and scalable analytics

Using Apache Data Sketches in BigQuery helps lower query costs, simplifies architecture, and enables efficient, scalable analytics for cloud-scale data