Blogs

Enhancing Apache Superset Dashboards Using Zettaprof for Faster Query Execution

Introduction

Chatbot Implementation for Enhanced Customer Support and Data Analysis

A major IT Company has implemented an AI-driven chatbot to streamline customer support and provide quick, accurate answers to user queries. This intelligent chatbot is powered by natural language processing (NLP) models that understand and respond to a wide range of questions. The chatbot efficiently processes user queries and delivers appropriate responses. It is designed to support both structured and unstructured questions, ensuring a seamless experience for users. The collected data undergoes analysis to uncover trends, patterns, and performance metrics. Apache Superset, an open-source data exploration and visualization platform, is integrated to create interactive dashboards.

Image: Chatbot

In the company, Apache Superset powers the analytics dashboard that provides real-time KPIs for chatbot performance. These KPIs include:

Total Number of Questions
Percentage of Positive Feedback
Percentage of Negative Feedback
Total Number of Accepted Answers

Image: Chatbot Dashboard

While Superset effectively visualizes the data, its reliance on Apache Spark as a backend data source led to significant performance issues. Query execution times were slow, resulting in visual lags and poor user experience. To solve this problem, Zettaprof, a profiler tool running on top of Apache Spark, was introduced. By analyzing query execution plans and implementing its recommendations, you can achieve up to 3x runtime improvement for dashboard performance.

The Problem: Slow Dashboard Performance

The Superset dashboards use Apache Spark as the data source to query chatbot performance data. While Apache Spark is robust, it struggles to handle large-scale queries efficiently. As a result:

Query execution was slow.
Dashboard visuals were delayed.
KPIs were not updated in real-time, impacting decision-making.

The chatbot data source consists of the following two tables:

S.No.	Schema	Tables	Columns	Size
1	feedback_sch	feedback_tbl	id int, question_id int, like_or_dislike int, selected_reasons string, additional_feedback string, created_at timestamp	1605 MB
2	questions_tbl	questions_sch	question_id int, question string, answer string, image_list string, component string, version_added string, version_expired string, created_at string, updated_at string	800 MB

The feedback_tbl table contains feedback-related data, including user reactions (likes/dislikes), selected reasons for responses, and additional feedback with timestamps. On the other hand, the questions_sch table stores detailed chatbot question metadata, such as the question text, answers, components, versioning information, and timestamps.

Identified Challenge: Optimize query execution to improve Superset dashboard performance without changing the data source.

The Solution: Leveraging Zettaprof with Apache Spark

To resolve the performance bottleneck, we introduced Zettaprof — a query execution profiler tool that works on top of Apache Spark. Zettaprof identifies performance issues, highlights inefficient query patterns, and recommends actionable improvements. By following its insights, you can optimise Spark configurations and achieve significant improvements.

Flow-Digram:

Image: Zettaprof Integration

Pre-Optimization Stage: Baseline Configuration

Before implementing Zettaprof's recommendations, the Spark job was configured as follows:

Pre-Implementation	Configs
spark.executor.instances	1
spark.executor.cores	300
spark.executor.memory	700g
spark.driver.memory	20g
spark.driver.cores	3
spark.sql.shuffle.partitions	200

Key Observations (Pre-Optimization):

Join Type: Shuffled Hash Join
Query Execution Time: ~30.6 Seconds
Auto/Manual: Auto-selected join type
Configuration Issues:
- Overprovisioning of memory and cores
- Suboptimal shuffle partitions
- Inefficient join execution

This baseline configuration lacked efficient resource utilization, leading to high latency in query execution.

Details Captured by Zettaprof:

Recommendations by Zettaprof:

Post-Optimization Stage: Zettaprof Recommendations

Zettaprof analyzed the query plan and recommended the following improvements:

Resource Optimization:
- Reduce executor memory and cores
- Increase the number of executor instances for parallelism
Partition Management:
- Optimize shuffle partitions
- Set a reasonable max file partition size
Join Optimization:
- Enable Broadcast Join for small tables using spark.sql.autoBroadcastJoinThreshold.

The optimized Spark job configuration:

Post-Implementation	Configs
spark.executor.instances	60
sspark.executor.cores	5
spark.executor.memory	2g
spark.driver.memory	2g
spark.driver.cores	5
spark.sql.shuffle.partitions	200
spark.sql.files.maxPartitionBytes	134217728
spark.serializer	org.apache.spark.serializer.KryoSerializer
spark.sql.autoBroadcastJoinThreshold	250m
spark.driver.maxResultSize	1g

Key Improvements (Post-Optimization):

Join Type: Broadcast Join
Query Execution Time: 10.2 Seconds (~3x faster)
Auto/Manual: Manually optimized join type
Efficient Resource Usage:
- Executors and cores are balanced for parallelism.
- Broadcast joins significantly reduce shuffle overhead.

Details Captured by Zettaprof:

Performance Comparison

Configuration Stage	Join Type	Execution Time	Auto/Manual
Pre-Optimization	Shuffled Hash Join	30.6 Seconds	Auto
Post-Optimization	Broadcast Join	10.2 Seconds	Manual

Result: A 3x improvement in query execution time was achieved, leading to faster dashboard loading and a seamless user experience.

Conclusion

By leveraging Zettaprof's recommendations and optimizing Spark configurations, one can successfully reduce the query runtime from 0.51 minutes to 0.17 minutes. This significant improvement enhanced the Apache Superset dashboard's responsiveness, enabling real-time KPI tracking for chatbot performance.

The key takeaway is that tools like Zettaprof are invaluable in identifying bottlenecks and driving optimizations in big data workflows. Combined with Apache Spark, you can unlock performance efficiencies that ensure scalable and real-time analytics.

Next Steps

Extend Zettaprof analysis to other Superset dashboards.
Automate Spark configuration tuning for new queries.
Explore further optimizations like materialized views and caching mechanisms.

Let's Talk