KwikQuery

Accelerating Complex Query Execution

TabbyDB enhances Apache Spark by eliminating the performance bottlenecks, which impact intricate queries (complex case logic, extensive joins, and large query trees), resulting in speed up of execution and reduced resource consumption, enabling enterprises to handle demanding data workloads more efficiently.

Check out the performance difference between stock spark and KwikQuery's TabbyDB, by clicking the button for comparing performance. It will lead you to Zeppelin notebooks, where same query can be run on stock spark and TabbyDB. Please note that running the paragraph on stock spark may take anywhere between 5 to 12 minutes.

TPCDS is not a realistic benchmark for performance of apache Spark as analytics engine. The complexity of SQL queries is limited due to input being a String and limits on level of nesting.

In our limited testing on 50GB scale factor of TPCDS Benchmark, for Spark with Hive managed non partitioned tables, overall 28% improvement in execution time was seen. This does not take into account the impact of compile time optimizations as the TPCDS queries are not that complex.

Analytic queries, especially those created by some looping logic using DataFrame APIs, can become extremely large and Stock Spark is seen to take hours ( > 1 Hour to 8 hours) to compile and even then may fail with OutOfMemory.

KwikQuery's TabbyDB will be able to bring down those times to realistic levels of minutes / seconds.

Performance Issues
SPARK-33152: Constraint Propagation rule causing query compilation times to run into hours.
SPARK-36786: Inefficiency in PushDownPredicates rule affecting complex expressions.
SPARK-44662: Dynamic file pruning for non partition column joins.
SPARK-45373: Minimizing calls to HMS layer. Issue impacts hive metastore based tables, with query having repeated reference to the tables.
SPARK-45866: Reuse of Exchange broken in AQE when runtime filters are pushed down to scan
SPARK-45959: Uncapped tree size in analysis phase, causing compilation to run into hours.
SPARK-46671: Redundant filter creation due to buggy Constraint Propagation rule.
SPARK-47609: Cached Plan lookup may miss picking valid plan
SPARK-49618: canonicalization differences in Union may cause failure in re-use of exchange or cached plans.
SPARK-49881: Minimizing the cost of DeduplicateRelations in the analyzer.
Functional Issues
SPARK-47320: Self join inconsistencies and exceptions
SPARK-49727: Data Loss issue when POJO Dataset is converted into DataFrame and back.
SPARK-49789: Exception in encoding POJOs with generic type fields.
SPARK-51016: Incorrect results during retry when joining column is indeterministic.
SPARK-45658: Canonicalization of DynamicPruningSubquery is broken
SPARK-53264: Incorrect nullability when correlated subquery gets converted to Left Outer Join
SPARK-47217: DeduplicateRelations may cause failure in plan resolution
SPARK-51016: Join on indeterminate column may give wrong results on retry

Get in touch

Spark Query Taking Forever?

Crashing With 'Out of Memory Errors' ? - WE GET IT!

INTRODUCING - KwikQuery's TabbyDB

Turbocharged fork of Apache Spark for lightning-fast queries and unstoppable data performance.

Many real-world queries take hours or fail entirely.

The Solution: KwikQuery's TabbyDB

Performance Enhancements That Redefine Complex Query Execution

Intelligent Compile-Time Optimizations

Advanced Broadcast Hash Join Handling

Improving Cache Look Up

Scalable Query Tree Management

Seamless Integration with Apache Spark Features

TabbyDB Performance Metrics

Issues Resolved in TabbyDB

Performance Issues

Functional Issues

Why Choose TabbyDB?

White Papers

Frequently Asked Questions about KwikQuery's TabbyDB

What makes TabbyDB different from standard Apache Spark?

Can TabbyDB handle extremely large and complex query trees efficiently?

Is TabbyDB compatible with existing Spark DataFrame APIs?

Ready to optimize your complex queries?

New Title

Contact