CLJun 12, 2025

Assessing RAG and HyDE on 1B vs. 4B-Parameter Gemma LLMs for Personal Assistants Integretion

arXiv:2506.21568v11 citations

Originality Synthesis-oriented

AI Analysis

This work addresses resource efficiency for deploying LLMs in edge and privacy-sensitive personal assistant applications, but is incremental as it compares existing augmentation methods on new model scales.

This study evaluated Retrieval-Augmented Generation (RAG) and Hypothetical Document Embeddings (HyDE) on 1B and 4B-parameter Gemma LLMs for privacy-first personal assistants, finding that RAG reduced latency by up to 17% and eliminated factual hallucinations, while HyDE improved semantic relevance but increased response time by 25-40% and caused hallucinations.

Resource efficiency is a critical barrier to deploying large language models (LLMs) in edge and privacy-sensitive applications. This study evaluates the efficacy of two augmentation strategies--Retrieval-Augmented Generation (RAG) and Hypothetical Document Embeddings (HyDE)--on compact Gemma LLMs of 1 billion and 4 billion parameters, within the context of a privacy-first personal assistant. We implement short-term memory via MongoDB and long-term semantic storage via Qdrant, orchestrated through FastAPI and LangChain, and expose the system through a React.js frontend. Across both model scales, RAG consistently reduces latency by up to 17\% and eliminates factual hallucinations when responding to user-specific and domain-specific queries. HyDE, by contrast, enhances semantic relevance--particularly for complex physics prompts--but incurs a 25--40\% increase in response time and a non-negligible hallucination rate in personal-data retrieval. Comparing 1 B to 4 B models, we observe that scaling yields marginal throughput gains for baseline and RAG pipelines, but magnifies HyDE's computational overhead and variability. Our findings position RAG as the pragmatic choice for on-device personal assistants powered by small-scale LLMs.

View on arXiv PDF

Similar