Apache Spark — Reimagined

Spark Queries That Take Hours.Now Done in Minutes.

TabbyDB is a drop-in fork of Apache Spark that eliminates compile-time blowup, OutOfMemory failures, and nested join underperformance — at the optimizer level. Zero code changes. Zero cluster changes.

Download Trial Version

Try Live Demo →

13% TPC-DS Improvement at 1TB–2TB on AWS using Hive

17% TPC-DS Improvement at 3TB on Ampere M1 using Hive

50%+ Iceberg Performance Gain at 50GB

Complex query compilation: 1–8 hrs → under minutes

Run TPC-DS locally on your own cluster and verify the results yourself

25+ Apache Spark JIRA Fixes

Drop-in replacement — zero code changes

13% TPC-DS Improvement at 1TB–2TB on AWS using Hive

17% TPC-DS Improvement at 3TB on Ampere M1 using Hive

50%+ Iceberg Performance Gain at 50GB

Complex query compilation: 1–8 hrs → under minutes

Run TPC-DS locally on your own cluster and verify the results yourself

25+ Apache Spark JIRA Fixes

Drop-in replacement — zero code changes

13% TPC-DS Improvement at 1TB–2TB on AWS using Hive

17% TPC-DS Improvement at 3TB on Ampere M1 using Hive

50%+ Iceberg Performance Gain at 50GB

Complex query compilation: 1–8 hrs → under minutes

Run TPC-DS locally on your own cluster and verify the results yourself

25+ Apache Spark JIRA Fixes

Drop-in replacement — zero code changes

13% TPC-DS Improvement at 1TB–2TB on AWS using Hive

17% TPC-DS Improvement at 3TB on Ampere M1 using Hive

50%+ Iceberg Performance Gain at 50GB

Complex query compilation: 1–8 hrs → under minutes

Run TPC-DS locally on your own cluster and verify the results yourself

25+ Apache Spark JIRA Fixes

Drop-in replacement — zero code changes

The Spark Problem

The Real Bottleneck Is in Apache Spark

Production workloads hit walls that benchmarks never reveal. Analyzer and optimizer inefficiencies silently balloon compilation times, pushing complex queries into OOM failures.

Root Cause: Query Planning

Compile time can take minutes to hours — but Spark's UI only registers a query after plan submission. The bottleneck is in planning, not execution, and it's invisible to your metrics.

Why Tuning Fails

Compile-time problems are routinely misdiagnosed as runtime issues. Runtime tuning doesn't reduce planning time. More compute doesn't mean faster planning. The fixes you try don't touch the root cause.

Workarounds Backfire

Disabling optimizer rules can reduce compile time — but at the cost of runtime performance. Every workaround forces a tradeoff: faster planning vs. slower execution. TabbyDB eliminates this tradeoff.

Minutes to hours of delay, wasted compute, missed SLAs — with no clear path to fix it in stock Spark.

The Solution

TabbyDB — Turbocharged Apache Spark

Fully compatible with Apache Spark 4.1.1. Same APIs. Same clusters. Better engine. Drop in the jars and get back hours.

Compile Time

Intelligent Compile-Time Optimizations

Fundamental improvements to critical optimizer rules: optimized constraint propagation, early project collapsing during analysis, reduced Hive Metastore calls, and targeted rule application to avoid expensive tree traversals. Complex queries that took 8 hours now compile in minutes.

Memory

Scalable Query Tree Management

Safely collapses project nodes early in the query lifecycle, preventing unbounded query plan growth and reducing memory pressure during compilation. Faster compilation and dramatically reduced risk of out-of-memory failures on deeply nested workloads.

Runtime

Advanced Broadcast Hash Join Handling

Dynamic file pruning using Broadcast Hash Join data on non-partitioned columns — the fix Spark never shipped. Extended to enable dynamic file pruning for non-partitioned joins, reducing data scan time. 13% improvement on TPC-DS at 1TB–2TB.

Cache

Improved Cache Lookup Efficiency

Enhances how cached in-memory query plans are matched and reused, increasing the likelihood of successful cache hits. Higher cache reuse and lower execution overhead — especially for repeated or structurally similar queries.

Iceberg

Apache Iceberg — 50%+ Faster

Early testing on 50GB non-partitioned Iceberg tables shows 50%+ improvement. Gains are expected to increase further at 1TB–3TB scale. The Iceberg performance layer the ecosystem has been waiting for.

Deployment

Seamless Spark Compatibility. No Lock-In.

Maintains full compatibility with Apache Spark APIs, features, and tooling while delivering every performance improvement above. Replace Spark jars with TabbyDB jars. To revert: swap back. No code rewrites. No cluster changes. No disabling of optimizer rules.

Performance That Redefines Complex Query Execution

TPC-DS Runtime Improvement

1TB–2TB on AWS r6gd.4xlarge — non-partitioned Hive Parquet tables

TPC-DS Runtime Improvement

3TB on 2-node Ampere M1 cluster, 1506GB RAM, 384 cores — non-partitioned Hive Parquet tables

0%+

Iceberg Performance Gain

50GB non-partitioned tables. Gains expected to increase at 1TB–3TB scale

8 Hrsmins

Compile Time Reduction

Complex DataFrame API queries — hours to compile, now minutes

TPC-DS gains shown on non-partitioned tables. Standard TPC-DS (partitioned) shows equivalent performance to stock Spark by design.

See the performance difference for yourself. Run the same query on stock Spark and TabbyDB side by side in our live Zeppelin notebooks — two demos, each targeting a different root cause. Note: running the Stock Spark paragraph may take 5–12 minutes.

Demo: Constraints Impact Demo: Tree Size Impact

TPC-DS Benchmark Results

Click to expand

1 TB TPC-DS Benchmark — 6 nodes AWS r6gd.4xlarge

Click to expand

Click to expand

Click to expand

Live Demo

Compare Performance in Real Time

Run the same query on stock Spark vs TabbyDB in our hosted Zeppelin notebooks. No setup. No install. See the difference in seconds.

Demo: Constraints Impact

See how constraint propagation blowup affects query compilation time

Demo: Tree Size Impact

See how query tree growth causes OutOfMemory failures — and how TabbyDB handles it

Reproducible Benchmarks

Run the TPC-DS Benchmark on Your Own Machine

Don't take our word for it. Reproduce the exact benchmark used in our published results — on your own cluster, using both stock Spark 4.1.1 and TabbyDB 4.1.1, and compare the query timings yourself.

With non-partitioned, date-sorted tables: data generation is 6–7× faster — and benchmark performance is near-identical to what you would get with partitioned tables on stock Spark.

Step-by-step guide Get the tools →

What you get

Single-node standalone cluster — one machine acting as both master and worker

Same generated data, both engines — stock Spark 4.1.1 and TabbyDB 4.1.1

Non-partitioned, date-sorted tables — the configuration that exposes TabbyDB's gains

Side-by-side query timings for all TPC-DS queries

Under the Hood

Spark SQL Modules Optimized in TabbyDB

Side by side: the modules TabbyDB touches versus the stock Spark pipeline.

TabbyDBClick to expand

Stock SparkClick to expand

What We Fixed

25+ Apache Spark JIRA Issues — Resolved

Filed, triaged, and fixed in TabbyDB. Many are still open upstream.

Performance Issues

SPARK-33152Constraint Propagation causing compile times to run into hours

SPARK-36786Inefficiency in PushDownPredicates for complex expressions

SPARK-44662Dynamic file pruning for non-partition column joins

SPARK-45373Minimizing calls to HMS layer for repeated table references

SPARK-45866Reuse of Exchange broken in AQE when runtime filters pushed down

SPARK-45959Uncapped tree size in analysis phase — compilation runs into hours

SPARK-46671Redundant filter creation from buggy Constraint Propagation rule

SPARK-47609Cached Plan lookup may miss valid plans

SPARK-49618Canonicalization differences in Union causing failure in re-use of exchange or cached plans

SPARK-49881Minimizing cost of DeduplicateRelations in the analyzer

SPARK-54881BooleanSimplification using transformExpressionsUp instead of Down — inefficient in some cases

SPARK-55072Inferring new Constraint misses IsNotNull on Left Leg when Outer Join converts to Inner Join

SPARK-55110Order of BooleanSimplification and SimplifyBinaryComparison rules is suboptimal for idempotency

SPARK-57126Canonicalization of DynamicPruningSubquery is broken

SPARK-57127Canonicalization of JoinExec is broken

ICEBERG-16563The canonicalization of Filter Expressions (data and runtime) is broken in SparkRuntimeFilterableScan

Functional Issues

SPARK-47320Self-join inconsistencies and exceptions

SPARK-47217DeduplicateRelations may cause failure in plan resolution

SPARK-49727Data loss when POJO Dataset converted to DataFrame and back

SPARK-49789Exception encoding POJOs with generic type fields

SPARK-51016Incorrect results during retry when join column is indeterminate

SPARK-45658Canonicalization of DynamicPruningSubquery is broken

SPARK-53264Incorrect nullability when correlated subquery converted to Left Outer Join

SPARK-55185Adding InferFiltersFromConstraints to Optimization batch causes idempotency break

SPARK-55241Idempotency of SQL Streaming with Joins broken when InferFiltersFromConstraints and PropagateEmptyRelation are added as Optimization rules

Why Choose TabbyDB

More Than a Faster Engine

Built for teams running complex, production-critical Apache Spark workloads, where query compilation time and optimizer behavior matter as much as execution speed.

Engine-Level Enhancements

Improvements are made at the optimizer and execution engine level — not as application-layer patches. Every workload benefits, without tuning or code changes.

Built for Production Stability

Beyond raw performance, TabbyDB fixes 25+ Spark performance and functionality bugs — self-join inconsistencies, data loss in POJO conversions, broken idempotency in streaming joins.

Compatible With Everything You Have

Full compatibility with existing Spark APIs, tooling, and workflows. No migration. No vendor lock-in. If you ever need to roll back, swap the jars.

No Shortcuts. Ever.

TabbyDB never disables optimizer rules as a workaround. Doing so trades away runtime execution speed — the optimizer can no longer do its job. Every fix is implemented at the root cause, in the analyzer, optimizer, or execution engine, so you get optimal compilation and runtime performance.

Open to Collaboration

Partner With the Team Behind TabbyDB

If you're encountering functional or performance issues in Apache Spark — particularly in the SQL or optimizer layer — we're open to collaborating on solutions tailored to your workload or codebase.

Whether it's diagnosing a bottleneck, validating a fix, or contributing targeted improvements, we're happy to engage.

Get Started Today

See the Difference on a Sample Query

Run the same query on stock Spark and TabbyDB side by side in our live Zeppelin notebooks — two demos, each targeting a different root cause. No setup required.

Demo: Constraints Impact Demo: Tree Size Impact

Technical Depth

Read the Algorithms

Every optimization in TabbyDB is documented. Read the white paper before you decide. Then run the benchmark.

Co-presented at the Databricks Spark Summit, 2021

Optimizer Performance

Constraint Propagation Rule Optimization

The new algorithm that solves Constraint blow-up from permutational logic in stock Spark.

Analyzer Performance

Capping the Query Plan Size

Collapsing project nodes in the analysis phase to prevent tree size explosion in complex DataFrame API queries.

Runtime Performance

Broadcast Hash Join Key Pushdown

Dynamic file pruning for non-partitioned columns — the runtime performance fix Spark never shipped.

Benchmarks

TPC-DS Benchmark Details

Full breakdown of methodology, configuration, and results for 1TB and 2TB benchmarks on AWS.

Optimizer Performance

Common Subexpression Extraction

Applying expensive optimizer rules only once to complex repeated sub-expressions — avoiding redundant tree traversal.

How It Works

Live in 15 Minutes. Revert in 30 Seconds.

Step 01

Download TabbyDB jars

Download the Spark 4.1.1 build

Step 02

Replace existing Spark jars

Swap them in — no configuration changes

Step 03

Run your pipelines unchanged

Same APIs. Same code. Faster engine.

Zero-friction rollback — swap the jars and you're back. No config, no code, no cluster changes.

Our commitment

Performance guarantee

KwikQuery will never regress you relative to stock Apache Spark. Full stop.

Extended test coverage

Our testing has already uncovered and fixed real Spark bugs that may still be present in the upstream release.

Maximum-speed fixes

Any functional issue — whether from our changes or lurking in stock Spark — is our highest priority to fix.

We're confident, though we won't claim perfection — no engineering team can. What we can promise: you are never on your own.

Downloads

Three Commands. One Faster Spark.

Drop in the TabbyDB jars and run your existing pipelines. Roll back in seconds.

tabbydb-install.sh

# Download the complete TabbyDB Spark distribution
$ wget https://github.com/perf-apt/tabbydb-trial-releases/releases/download/licensed-release/tabbydb-4.1.1-bin-tabbydb.tgz
$ tar -xzf tabbydb-4.1.1-bin-tabbydb.tgz
$ export SPARK_HOME=$PWD/tabbydb-4.1.1-bin-tabbydb

# Run your pipelines
$ $SPARK_HOME/bin/spark-submit your_app.py

Complete Spark + Iceberg distribution — for new clusters

Fresh Install

Complete TabbyDB distribution — for new clusters

TabbyDB 4.1.1 — Full InstallLatest

Complete Spark + Iceberg, Linux x86_64, JDK 17+ · 248 MB

Convert Existing Spark

Drop-in jars for your existing Spark cluster

TabbyDB 4.1.1 JarsLatest

Drop-in jars with Iceberg support — replace 8 Spark jars · 62 MB

Iceberg Runtime Jar (4.1.1)Iceberg

Replace iceberg-spark-runtime to unlock 50%+ Iceberg gain · ~30 MB

Open Full Download Page →

FAQs

Questions Engineers Ask

The honest answers we give when a data platform team is evaluating TabbyDB.

Yes. TabbyDB is a fork that preserves 100% API compatibility with Spark 4.1.1. You replace the Spark jars with TabbyDB jars, restart your cluster, and run the same code unchanged. No config changes, no rewrites, and you can revert at any time by swapping the jars back.

See all FAQs

Ready When You Are

Stop Waiting for Spark to Compile.

Download the trial. Run it on your actual queries. See the difference on your own cluster.

Download Trial Version

Compare Performance Live →

100% Apache Spark API compatible. No code changes. Full rollback in seconds.