
Spark Query Taking Forever?
Crashing With 'Out of Memory Errors' ? - WE GET IT!
INTRODUCING - KwikQuery's kqDB
Turbocharged fork of Apache Spark for lightning-fast queries and unstoppable data performance.
Many real-world queries take hours or fail entirely.
- Query Compile Time Is Not Even Counted In Spark UI
- More time may get spent in planning than executing for large, complex queries
- Runtime performance of nested join queries unsatisfactory.
- UI registers a query only after plan is submitted, leaving possible bottleneck hidden
- Work around suggested by providers is
disabling of rules, impacting the runtime performance.
The Result
- Wasted compute.
- Hours of delay
- Unmet SLAs
The Solution: KwikQuery
- We tackle the root causes :
query plan bloat
compile-time overhead
suboptimal runtime performance of nested join queries
- Absolutely no disabling of rules!
- No code rewrites
- No cluster changes
- Dynamic file pruning using broadcast hash join data, to better run time performance
Accelerating Complex Query Execution
KwikQuery enhances Apache Spark by eliminating the performance bottlenecks, which impact intricate queries (complex case logic, extensive joins, and large query trees), resulting in speed up of execution and reduced resource consumption, enabling enterprises to handle demanding data workloads more efficiently.
Check out the performance difference between stock spark and KwikQuery's kqDB, by clicking the button for comparing performance. It will lead you to Zeppelin notebooks, where same query can be run on stock spark and kqDB. Please note that running the paragraph on stock spark may take anywhere between 5 to 12 minutes.
Performance Enhancements That Redefine Complex Query Execution
Intelligent Compile-Time Optimizations
Changing at fundamental level, the algorithm of some of the critical rules like constraints propagation, collapsing the project nodes early (in the analysis phase), minimizing the calls to hive meta store and many more thoughtful modifications, tremendously improve compile-time performance.
Advanced Broadcast Hash Join Handling
Our fork optimizes the broadcast hash joins on non partitioned columns, to do dynamic file pruning, boosting the runtime performance of nested join queries. In a limited TPCDS testing, it has shown 28% performance improvement in time taken, compared to stock spark.
Improving Cache Look Up
The cache lookup of in memory plans is made more intelligent, there by increasing the hit rate of successful lookUps. This increased sensitivity can have huge impact on runtime performance.
Scalable Query Tree Management
The new rules and algorithm change allow for collapse of the projects in the analysis phase, there by capping the tree size. This results in tremendous savings in terms of time taken to compile, preventing out of memory errors.
Seamless Integration with Apache Spark Features
While boosting performance, KwikQuery's kqDB, retains full compatibility with Apache Spark’s APIs and features, allowing users to leverage familiar tools with enhanced speed.
Spark SQL Modules Optimized in kqDB
Spark SQL Processing
KwikQuery Performance Metrics
N x performance improvements
TPCDS is not a realistic benchmark for performance of apache Spark as analytics engine. The complexity of SQL queries is limited due to input being a String and limits on level of nesting.
In our limited testing on 50GB scale factor of TPCDS Benchmark, for Spark with Hive managed non partitioned tables, overall 28% improvement in execution time was seen. This does not take into account the impact of compile time optimizations as the TPCDS queries are not that complex.
Analytic queries, especially those created by some looping logic using DataFrame APIs, can become extremely large and Stock Spark is seen to take hours ( > 1 Hour to 8 hours) to compile and even then may fail with OutOfMemory.
KwikQuery's kqDB will be able to bring down those times to realistic levels of minutes / seconds.

White Papers
-
White Paper - Constraint Propagation Ruleconstraint propagation algo description List Item 1
Describes the new Constraint Propagation algorithm, which solves the issue of Constraints Blow Up due to permutational nature of the logic in stock Spark.
-
White Paper - Capping the Query Plan sizecollapsing projects in analysis phase algo description List Item 2
Describes the idea of collapsing the project nodes in the analysis phase itself, there by preventing extremely large tree sizes for query plans created using Data Frame APIs
-
White Paper - Pushdown of Broadcasted Keys as runtime filtersbroadcast var push down for file pruning List Item 3
Describes the runtime performance enhancement by utilizing the broadcasted keys as filters for file pruning, in case of non parttioned columns used as join keys
Frequently Asked Questions about KwikQuery's kqDB
Find clear answers to common questions about kqDB’s capabilities, performance, and integration to help you make the most of our advanced query engine.
What makes kqDB different from standard Apache Spark?
kqDB is a specialized fork of Apache Spark designed to optimize complex queries. It significantly reduces compilation time and memory usage for queries with nested joins, complex case statements, and large query trees through intelligent compile-time and runtime enhancements.
Can kqDB handle extremely large and complex query trees efficiently?
Yes, kqDB is built to manage vast and intricate query structures. Its optimizations improve both the speed and resource consumption, enabling faster execution of queries that would typically take hours to compile and run.
Is kqDB compatible with existing Spark DataFrame APIs?
kqDB maintains full compatibility with Apache Spark’s DataFrame APIs, allowing corporate users to continue using their programmatic query methods while benefiting from enhanced performance without changing their existing codebase.
Ready to optimize your complex queries?
Download kqDB and experience the performance improvements KwikQuery offers for demanding data workloads.