Question 1

What makes TabbyDB different from standard Apache Spark?

Accepted Answer

TabbyDB is a specialized fork of Apache Spark designed to optimize complex queries. It significantly reduces compilation time and memory usage for queries with nested joins, complex case statements, and large query trees. Unlike stock Spark, TabbyDB fixes these issues at the root — without disabling optimizer rules or requiring any code changes.

Question 2

Can TabbyDB handle extremely large and complex query trees?

Accepted Answer

Yes. The Project node capping algorithm prevents tree size from growing unboundedly — the root cause of most OutOfMemory failures in complex programmatic query patterns. Queries that previously failed with OOM now complete in minutes.

Question 3

Why is the gain 13% on AWS Hive, 17% on Ampere M1, and 50%+ with Iceberg?

Accepted Answer

Two factors combine. First, hardware: AWS used r6gd.4xlarge nodes while the Ampere M1 benchmark ran on a 2-node Ampere M1 cluster — a different hardware architecture. Second, version: the AWS benchmark used TabbyDB 4.0.1, while the Ampere M1 benchmark used TabbyDB 4.1.1. Between those releases, an Exchange reuse bug was fixed along with additional optimizations, which compound on top of the hardware difference. The Iceberg gain is more dramatic because Iceberg's Manifest Files allow the pushed-down Broadcast Hash Join key filters to prune files at the scan layer, before data is even read. That combination of deep scan-level pruning and Iceberg's metadata structure produces a far larger speedup than Hive Parquet alone.

Question 4

I ran TPC-DS on TabbyDB and see the same performance as stock Spark. Why?

Accepted Answer

The TPC-DS toolkit defaults to partitioned tables. To see the difference, pass the partitioning flag as false and modify data generation to sort locally on the date column while writing partition splits. With this setup, data generation is 6-7x faster and benchmark results show approximately 13%–17% better performance at 1TB–3TB scale using Hive and 50%+ using Iceberg (as tested on 50GB data)

Question 5

If TabbyDB on non-partitioned tables matches stock Spark on partitioned tables, why should I use TabbyDB?

Accepted Answer

Three reasons. First: Spark's Dynamic Partition Pruning only works when the join key IS the partition column — not all real-world queries do this. Second: partitioning data is expensive; creating a 1TB partitioned TPC-DS dataset took 6+ hours vs 40 minutes for non-partitioned. Third: TPC-DS queries don't stress compile time — real DataFrame API queries can spend far more time compiling than executing. TabbyDB eliminates that cost.

Question 6

Do I lose any functionality compared to stock Spark?

Accepted Answer

Never. TabbyDB guarantees 100% Apache Spark API and runtime compatibility — every feature, every configuration, every behavior is preserved. It is a fork that fixes the engine, not one that trades functionality for speed. In fact, TabbyDB also resolves critical functionality bugs that have remained open in Apache Spark for years. TabbyDB can only improve performance. It will never regress.

Question 7

Is TabbyDB compatible with existing Spark DataFrame APIs?

Accepted Answer

100% compatible. Use your existing code unchanged. TabbyDB guarantees full compatibility with open-source Spark — now and in future releases.

Frequently Asked Questions