FAQs
Find clear answers to common questions about TabbyDB’s capabilities, performance, and integration to help you make the most of our advanced query engine.
What makes TabbyDB different from standard Apache Spark?
TabbyDB is a specialized fork of Apache Spark designed to optimize complex queries. It significantly reduces compilation time and memory usage for queries with nested joins, complex case statements, and large query trees through intelligent compile-time and runtime enhancements.
Can TabbyDB handle extremely large and complex query trees efficiently?
Yes, TabbyDB is built to manage vast and intricate query structures. Its optimizations improve both the speed and resource consumption, enabling faster execution of queries that would typically take hours to compile and run.
When I run TPC-DS Benchmark on TabbyDB, I see same performance as Open Source Spark. So how is it different?
The TPC-DS-Tool Kit from DataBricks used to generate data, is by default configured to create partitioned tables. To see the performance difference, you need to pass the partitioning flag as false and do a small modification in the data generation code of toolkit to sort data locally on date column, while writing the partition split. You will notice that data generation time would be 6 - 7 times faster now. Running the benchmark now, you should see atleast 16% better performance. In fact time taken would be nearly same as that of with partitioned tables, with minimum data generation time.
Below is the patch, to be applied on TPC-DS-Tool kit source code to generate non partitioned, locally date sorted splits.
patch-for-non-partitioned-date-sorted-splits.
Also for now the runtime Broadcast Hash Join performance enhancement is implemented only for Hive stored Parquet based tables ( both managed and external ) and Iceberg Tables ( with TabbyDB 4.1.1). For Iceberg to utilize the Broadcast Var Pushdown , please download the iceberg-tabbydb-runtime jar from the download section.
The early testing of Iceberg with TabbyDB on 50 GB dataset shows performance gain of 46% for non partitioned tables, with data generated such that its sorted locally on date column.
The gains are expected to rise as data size is increased , so for 1 TB or 2 TB the gains are expected to be north of 46%
If runtime performance of queries using TabbyDB for non partitioned tables is comparable to Stock Spark's partitioned table queries, why should I even consider TabbyDB?
- The runtime performance of partitioned table queries in stock spark is due to Dynamic Partition Pruning, which kicks in IFF joining key column is partitioned. In real world scenario, not all queries use join on partitioned column
- There is a hidden cost assosciated with partitioning the data. To give some perspective, time taken to create 1 TB partitioned dataset for TPCDS Benchmark, took more than 6 hours , as compared to 40 min for non partitioned, locally date sorted tables.
- TPC-DS queries are not complex enough to stress compilation time. Real-world DataFrame API-generated queries can be far more complex such that share of compilation time may far exceed runtime. TabbyDB ensures that compilation cost goes to bare minimum.
Do I loose any functionality or performance compared to stock spark, when I use TabbyDB?
No. TabbyDB is a strict superset — it adds performance improvements without removing or compromising any Spark feature, functionality, or runtime behavior.
For workloads that don't hit the bottlenecks which TabbyDB targets, (such as TPC-DS on partitioned tables), results will be identical to stock Spark within normal variance.
Is TabbyDB compatible with existing Spark DataFrame APIs?
TabbyDB maintains 100% compatibility with Apache Spark code base , allowing corporate users to continue using their programmatic query methods while benefiting from enhanced performance without ever changing their existing codebase, for TabbyDB specific change.
For future too, TabbyDB guarantees 100% compatibility with open source spark.
Still have a question?
If you have any question be it technical or license related , please do not hesitate to contact us. We will be prompt with our replies.
email us :
asif.shahid@kwikquery.com
taha.hussain@kwikquery.com
