A major IT Company has implemented an AI-driven chatbot to streamline customer support and provide quick, accurate answers to user queries. This intelligent chatbot is powered by natural language processing (NLP) models that understand and respond to a wide range of questions. The chatbot efficiently processes user queries and delivers appropriate responses. It is designed to support both structured and unstructured questions, ensuring a seamless experience for users. The collected data undergoes analysis to uncover trends, patterns, and performance metrics. Apache Superset, an open-source data exploration and visualization platform, is integrated to create interactive dashboards.
Image: Chatbot
In the company, Apache Superset powers the analytics dashboard that provides real-time KPIs for chatbot performance. These KPIs include:
Image: Chatbot Dashboard
While Superset effectively visualizes the data, its reliance on Apache Spark as a backend data source led to significant performance issues. Query execution times were slow, resulting in visual lags and poor user experience. To solve this problem, Zettaprof, a profiler tool running on top of Apache Spark, was introduced. By analyzing query execution plans and implementing its recommendations, you can achieve up to 3x runtime improvement for dashboard performance.
The Superset dashboards use Apache Spark as the data source to query chatbot performance data. While Apache Spark is robust, it struggles to handle large-scale queries efficiently. As a result:
The chatbot data source consists of the following two tables:
S.No. | Schema | Tables | Columns | Size |
---|---|---|---|---|
1 | feedback_sch | feedback_tbl | id int, question_id int, like_or_dislike int, selected_reasons string, additional_feedback string, created_at timestamp | 1605 MB |
2 | questions_tbl | questions_sch | question_id int, question string, answer string, image_list string, component string, version_added string, version_expired string, created_at string, updated_at string | 800 MB |
The feedback_tbl table contains feedback-related data, including user reactions (likes/dislikes), selected reasons for responses, and additional feedback with timestamps. On the other hand, the questions_sch table stores detailed chatbot question metadata, such as the question text, answers, components, versioning information, and timestamps.
Identified Challenge: Optimize query execution to improve Superset dashboard performance without changing the data source.
To resolve the performance bottleneck, we introduced Zettaprof — a query execution profiler tool that works on top of Apache Spark. Zettaprof identifies performance issues, highlights inefficient query patterns, and recommends actionable improvements. By following its insights, you can optimise Spark configurations and achieve significant improvements.
Image: Zettaprof Integration
Before implementing Zettaprof's recommendations, the Spark job was configured as follows:
Pre-Implementation | Configs |
---|---|
spark.executor.instances | 1 |
spark.executor.cores | 300 |
spark.executor.memory | 700g |
spark.driver.memory | 20g |
spark.driver.cores | 3 |
spark.sql.shuffle.partitions | 200 |
This baseline configuration lacked efficient resource utilization, leading to high latency in query execution.
Details Captured by Zettaprof:
Recommendations by Zettaprof:
Zettaprof analyzed the query plan and recommended the following improvements:
spark.sql.autoBroadcastJoinThreshold.
The optimized Spark job configuration:
Post-Implementation | Configs |
---|---|
spark.executor.instances | 60 |
sspark.executor.cores | 5 |
spark.executor.memory | 2g |
spark.driver.memory | 2g |
spark.driver.cores | 5 |
spark.sql.shuffle.partitions | 200 |
spark.sql.files.maxPartitionBytes | 134217728 |
spark.serializer | org.apache.spark.serializer.KryoSerializer |
spark.sql.autoBroadcastJoinThreshold | 250m |
spark.driver.maxResultSize | 1g |
Details Captured by Zettaprof:
Configuration Stage | Join Type | Execution Time | Auto/Manual |
---|---|---|---|
Pre-Optimization | Shuffled Hash Join | 30.6 Seconds | Auto |
Post-Optimization | Broadcast Join | 10.2 Seconds | Manual |
Result: A 3x improvement in query execution time was achieved, leading to faster dashboard loading and a seamless user experience.
By leveraging Zettaprof's recommendations and optimizing Spark configurations, one can successfully reduce the query runtime from 0.51 minutes to 0.17 minutes. This significant improvement enhanced the Apache Superset dashboard's responsiveness, enabling real-time KPI tracking for chatbot performance.
The key takeaway is that tools like Zettaprof are invaluable in identifying bottlenecks and driving optimizations in big data workflows. Combined with Apache Spark, you can unlock performance efficiencies that ensure scalable and real-time analytics.