ZettaLens provides natural language access to internal Google Drive knowledge, powered by a locally hosted Llama 3.1 8B Instruct model for privacy, governance, and on-prem control.
Semiconductor engineering documentation spans specs, ECOs, bring-up logs, validation notes, and characterization data. ZettaLens provides citation-grounded, permission-aware, semiconductor-aware answers while keeping documents, embeddings, prompts, and inference inside the company network.
Semiconductor teams operate in a high-entropy documentation environment. This project delivers a private RAG system that converts that sprawl into a citation-grounded, access-controlled question-answering experience without sending data to a cloud LLM.
ZettaLens implements an on-prem RAG pipeline backed by a locally hosted Llama 3.1 8B Instruct model to meet data privacy and governance requirements. Operationally, it includes: a Google Drive folder-to-knowledge-base synchronization script with MD5 tracking, a web UI for knowledge base management, multi-document/multi-format ingestion, an API-driven auto-accuracy checker, and a scalable deployment pattern (multi-instance services behind NGINX load balancing).
The core problem is not the absence of information; it is the cost of locating trustworthy, scoped answers. A "correct" answer in chip development is conditional on context: stepping/revision, operating corner, temperature, workload assumptions, and the document's authority (errata vs characterization vs draft spec).
The assistant behaves like a careful engineer: it cites sources, states assumptions, distinguishes draft from authoritative docs, and says "insufficient evidence" instead of inventing limits. Retrieval quality is the product; generation is the summarizer.
The platform is built around four subsystems:(a) ingestion and parsing,(b) indexing,(c) retriveal and answering, and (d) goverance/operations. The system treats retriveal quality as the primary product; generation is a structured summarizer over retrieved evidence
Figure 1. End-to-end architecture: ingestion → indexing → retrieval → glossary steering → local LLM → cited answers.
Google Drive behaves like a living knowledge lake. The ingestion strategy focuses on incremental updates, reproducible indexing, and permission-aware retrieval.
To support multiple dynamic documents and concurrent usage, ZettaLens includes a Python synchronization script that mirrors a specified Google Drive folder into a specified knowledge base. The script maintains a registry of files and their MD5 checksums; newly added or changed files are downloaded, passed to the ZettaLens ingestion API, and indexed into the target knowledge base. The sync can be run on demand or scheduled at a fixed interval.
The knowledge base pipeline supports multi-document updates and broad format coverage.
ZettaLens has been validated to ingest and index DOCX, TXT, PDF, HTML, CSV, and XLSX sources, and the knowledge base UI supports add/remove operations to manage evolving corpora.
Semiconductor documents are table-heavy. The parser preserves tables as structured artifacts where possible, so limits stay tied to conditions (corner, temp, stepping, test mode) instead of being shredded into unrelated text chunks.
The web UI makes the system usable by engineers: authentication, knowledge base management, chat-based Q&A, feedback capture, and result visualization. The UI is intentionally simple: the goal is to reduce friction between "I have a question" and "I have cited evidence."
Figure 2. Login screen (ZettaLens UI).
Figure 3. Knowledge base document repository with parsing status and controls.
Figure 4. Chat experience with attached citations.
Figure 5. API-driven evaluation output: per-query response time and semantic similarity scores.
Semiconductor corpora are acronym-dense and entity-heavy (register names, IP blocks, stepping labels, test modes). Pure vector retrieval can miss exact identifiers; pure lexical retrieval can miss semantic intent. A hybrid approach combines vector similarity with keyword/BM25 style retrieval, merges candidates, and applies reranking based on semantic relevance and metadata (doc type, authority, revision, and scope).
Figure 6. Hybrid retrieval: vector + lexical → merge/dedupe → rerank.
The single biggest relevance lever is domain grounding. A curated glossary is used as an active component of the pipeline: it disambiguates overloaded terms (e.g., Vmin, corner, training), expands internal aliases and synonyms, and injects retrieval hints so the model sees the right context before generating an answer.
{
"term": "Vmin",
"definition": "Minimum supply voltage meeting functional + timing requirements under specified conditions.",
"units": ["mV"],
"contexts": ["DVFS", "AVFS", "SRAM margin", "IR drop"],
"confusions": ["Not IO Vmin", "Not retention Vmin"],
"aliases": ["SOC_VMIN_TARGET", "SV_VMIN"],
"related": ["SS corner", "cold temp", "droop", "guardband"],
"retrieval_hints": {
"boost_doc_types": ["characterization", "silicon_validation"],
"prefer_sections": ["limits", "conditions", "tables"],
"boost_tags": ["stepping", "corner"]
}
}
ZettaLens includes an API-based auto-accuracy checking script to enable end-to-end testing of retrieval accuracy, metadata handling, and performance. The evaluation harness also supports benchmarking across different LLMs and embedding models.
In practice, this allows teams to track: (a) response latency, (b) semantic similarity between expected and generated answers, (c) retrieval hit-rate for authoritative documents, and (d) regression detection after pipeline changes.
Based on testing on the provided documents, the following configuration offered the best tradeoff between resource use and answer quality:
To support multiple engineers concurrently, the recommended deployment pattern runs multiple instances of the LLM, reranker, and embedder, fronted by NGINX for load balancing. This isolates heavy inference workloads, improves tail latency, and enables incremental horizontal scaling.
In the local deployment, the LLM service was hosted using vLLM. A single shared NVIDIA A100 GPU supported 10 concurrent users under interactive workloads.
Figure 7. Deployment pattern: multiple inference services behind NGINX load balancing.
Once a private RAG system is working, the next frontier is reliability at scale: better evaluation sets, stronger permission-aware ranking, and more helpful workflows (tickets, summaries, "show me the table", etc.).
Figure 8. Roadmap sketch: scaling, user management, agentic retrieval, and full RAG evaluation.