Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

Curated from Cloudflare Blog

If you're working with high-throughput data pipelines on ClickHouse, this article is essential reading. It dives into a real-world scenario where a subtle configuration change led to a major performance degradation—something that wasn't visible in standard metrics. The authors methodically uncover how lock contention in the query planner created a hidden bottleneck, impacting critical billing operations. What makes this post valuable is its detailed post-mortem and the practical steps taken to resolve and prevent the issue. For practitioners, the takeaway is clear: don’t rely solely on standard metrics when performance issues arise. Dig deeper into internal system behavior—especially around concurrency and locking—and consider how even small configuration tweaks can have large-scale consequences.

When a partitioning change to our petabyte-scale ClickHouse cluster caused critical billing jobs to stall, standard metrics showed no obvious errors. This post explores how we identified severe lock contention in ClickHouse's query planner and built upstream patches to fix it.

— Cloudflare Blog

Read the full article on Cloudflare Blog →