Enhancing Apache Superset Dashboards Using Zettaprof for Faster Query Execution


Introduction


Chatbot Implementation for Enhanced Customer Support and Data Analysis


A major IT Company has implemented an AI-driven chatbot to streamline customer support and provide quick, accurate answers to user queries. This intelligent chatbot is powered by natural language processing (NLP) models that understand and respond to a wide range of questions. The chatbot efficiently processes user queries and delivers appropriate responses. It is designed to support both structured and unstructured questions, ensuring a seamless experience for users. The collected data undergoes analysis to uncover trends, patterns, and performance metrics. Apache Superset, an open-source data exploration and visualization platform, is integrated to create interactive dashboards.

Chatbot

Image: Chatbot

In the company, Apache Superset powers the analytics dashboard that provides real-time KPIs for chatbot performance. These KPIs include:


Chatbot Dashboard

Image: Chatbot Dashboard

While Superset effectively visualizes the data, its reliance on Apache Spark as a backend data source led to significant performance issues. Query execution times were slow, resulting in visual lags and poor user experience. To solve this problem, Zettaprof, a profiler tool running on top of Apache Spark, was introduced. By analyzing query execution plans and implementing its recommendations, you can achieve up to 3x runtime improvement for dashboard performance.


The Problem: Slow Dashboard Performance

The Superset dashboards use Apache Spark as the data source to query chatbot performance data. While Apache Spark is robust, it struggles to handle large-scale queries efficiently. As a result:


The chatbot data source consists of the following two tables:


S.No. Schema Tables Columns Size
1 feedback_sch feedback_tbl id int, question_id int, like_or_dislike int, selected_reasons string, additional_feedback string, created_at timestamp 1605 MB
2 questions_tbl questions_sch question_id int, question string, answer string, image_list string, component string, version_added string, version_expired string, created_at string, updated_at string 800 MB

The feedback_tbl table contains feedback-related data, including user reactions (likes/dislikes), selected reasons for responses, and additional feedback with timestamps. On the other hand, the questions_sch table stores detailed chatbot question metadata, such as the question text, answers, components, versioning information, and timestamps.

Identified Challenge: Optimize query execution to improve Superset dashboard performance without changing the data source.


The Solution: Leveraging Zettaprof with Apache Spark

To resolve the performance bottleneck, we introduced Zettaprof — a query execution profiler tool that works on top of Apache Spark. Zettaprof identifies performance issues, highlights inefficient query patterns, and recommends actionable improvements. By following its insights, you can optimise Spark configurations and achieve significant improvements.


Flow-Digram:

Zettaprof Integration

Image: Zettaprof Integration


Pre-Optimization Stage: Baseline Configuration


Before implementing Zettaprof's recommendations, the Spark job was configured as follows:


Pre-Implementation Configs
spark.executor.instances 1
spark.executor.cores 300
spark.executor.memory 700g
spark.driver.memory 20g
spark.driver.cores 3
spark.sql.shuffle.partitions 200

Key Observations (Pre-Optimization):


This baseline configuration lacked efficient resource utilization, leading to high latency in query execution.

Details Captured by Zettaprof:

Enhanced Dashboard Performance

Recommendations by Zettaprof:

Enhanced Dashboard Performance


Post-Optimization Stage: Zettaprof Recommendations


Zettaprof analyzed the query plan and recommended the following improvements:

The optimized Spark job configuration:


Post-Implementation Configs
spark.executor.instances 60
sspark.executor.cores 5
spark.executor.memory 2g
spark.driver.memory 2g
spark.driver.cores 5
spark.sql.shuffle.partitions 200
spark.sql.files.maxPartitionBytes 134217728
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.autoBroadcastJoinThreshold 250m
spark.driver.maxResultSize 1g

Key Improvements (Post-Optimization):


Details Captured by Zettaprof:

Enhanced Dashboard Performance

Performance Comparison


Configuration Stage Join Type Execution Time Auto/Manual
Pre-Optimization Shuffled Hash Join 30.6 Seconds Auto
Post-Optimization Broadcast Join 10.2 Seconds Manual

Result: A 3x improvement in query execution time was achieved, leading to faster dashboard loading and a seamless user experience.


Conclusion


By leveraging Zettaprof's recommendations and optimizing Spark configurations, one can successfully reduce the query runtime from 0.51 minutes to 0.17 minutes. This significant improvement enhanced the Apache Superset dashboard's responsiveness, enabling real-time KPI tracking for chatbot performance.

The key takeaway is that tools like Zettaprof are invaluable in identifying bottlenecks and driving optimizations in big data workflows. Combined with Apache Spark, you can unlock performance efficiencies that ensure scalable and real-time analytics.


Next Steps


Let's Talk
GET YOUR DIGITAL TRANSFORMATION STARTED