## Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented LLMs for Ancient Indian Philosophy
## Priyanka Mandikal
Department of Computer Science, UT Austin mandikal@utexas.edu
Figure 1: The dissemination of knowledge through the ages. Over time, methods of storing and transmitting knowledge have evolved from oral teachings to computers and the internet, significantly increasing the distribution and proliferation of human knowledge. The emerging LLM technology represents a new paradigm shift in this process.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Flowchart: Evolution of Knowledge Dissemination Technologies
### Overview
The image depicts a horizontal flowchart illustrating the progression of knowledge dissemination technologies over time. It traces the evolution from oral traditions to modern AI-driven systems, emphasizing technological advancements and their impact on knowledge distribution.
### Components/Axes
- **Horizontal Axis**: Time (left to right progression)
- **Vertical Arrows**:
- "Technological advancements" (upward)
- "Knowledge distribution & proliferation" (upward)
- **Stages**:
1. **Oral teachings** (icon: group of people under a tree)
2. **Hand-written manuscripts** (icon: scroll)
3. **Printing press** (icon: mechanical press)
4. **Computers and the Internet** (icon: computer with globe)
5. **LLMs** (icon: interconnected nodes)
### Detailed Analysis
- **Flow Direction**: All stages are connected by rightward arrows, indicating chronological progression.
- **Technological Advancements**: Vertical upward arrows link each stage to the next, emphasizing incremental improvements in technology.
- **Knowledge Distribution**: Vertical upward arrows also highlight increasing accessibility and reach of knowledge at each stage.
- **Icons/Text**: Each stage includes a minimalist icon and label, with no additional textual data or numerical values.
### Key Observations
- **Linear Progression**: Knowledge dissemination technologies evolve in a clear, sequential manner.
- **Dual Axes**: The diagram visually reinforces that technological progress and knowledge proliferation are interdependent.
- **Modern Focus**: The final stage (LLMs) is represented by an abstract network diagram, suggesting complexity and interconnectedness.
### Interpretation
This flowchart demonstrates how each technological leap (from oral traditions to LLMs) has exponentially increased the speed, scale, and accessibility of knowledge sharing. The upward vertical arrows imply that advancements in technology directly correlate with broader societal access to information. The transition from physical (scrolls, printing presses) to digital (computers, internet) and finally to AI-driven systems (LLMs) reflects a shift from localized, manual processes to global, automated networks. The absence of quantitative data suggests the focus is on conceptual rather than statistical evolution.
</details>
## Abstract
LLMs have revolutionized the landscape of information retrieval and knowledge dissemination. However, their application in specialized areas is often hindered by factual inaccuracies and hallucinations, especially in long-tail knowledge distributions. We explore the potential of retrieval-augmented generation (RAG) models for long-form question answering (LFQA) in a specialized knowledge domain. We present VedantaNY-10M, a dataset curated from extensive public discourses on the ancient Indian philosophy of Advaita Vedanta. We develop and benchmark a RAG model against a standard, non-RAG LLM, focusing on transcription, retrieval, and generation performance. Human evaluations by computational linguists and domain experts show that the RAG model significantly outperforms the standard model in producing factual and comprehensive responses having fewer hallucinations. In addition, a keyword-based hybrid retriever that emphasizes unique low-frequency terms further improves results. Our study provides insights into effectively integrating modern large language models with ancient knowledge systems.
Proceedings of the 1st Machine Learning for Ancient Languages Workshop , Association for Computational Linguistics (ACL) 2024 . Dataset, code, and evaluation is available at: https://sites.google.com/view/vedantany-10m
ॐके ने�षतं पत�त प्रे�षतं मनः || 1.1 || यन्मनसा न मनुते येनाहुम�नो मतम् । तदेव ब्रह्म त्वं �व�� नेदं य�ददमुपासते || 1.6 ||
- "By whom willed and directed, does the mind alight upon its objects?"
- "What one cannot comprehend with the mind, but by which they say the mind comprehends, know that alone to be Brahman, not this which people worship here."
- Kena Upanishad, >3000 B.C.E.
## 1 Introduction
Answer-seeking has been at the heart of human civilization. Humans have climbed mountains and crossed oceans in search of answers to the greatest questions concerning their own existence. Over time, ancient wisdom has travelled from the silent solitude of mountain caves and forest hermitages into the busy cities and plains of the world. Technology has played a major role in this transmission, significantly increasing the distribution and proliferation of human knowledge. In recent times, large language models (LLMs) trained on large swathes of the internet have emerged as de-facto questionanswering machines. Recent studies on the societal impact of LLMs (Malhotra, 2021; Yiu et al., 2023) highlight their growing significance as cultural technologies. Analogous to earlier technologies like writing, print, and the internet, the power of LLMs can be harnessed meaningfully to preserve and disseminate human knowledge (Fig. 1).
Generic LLMs have proven to be highly effective for broad knowledge domains. However, they often struggle in niche and less popular areas, encountering issues such as factual inaccuracies and hallucinations in long-tail knowledge distributions (Kandpal et al., 2023; Mallen et al., 2023). Moreover, their inability to verify responses against authentic sources is particularly problematic in these domains, where LLMs can generate highly inaccurate answers with unwarranted confidence (Kandpal et al., 2023; Menick et al., 2022). In response to these limitations, there has been growing interest in retrieval-augmented generation (RAG) models (Karpukhin et al., 2020; Lewis et al., 2020b; Izacard et al., 2022; Ram et al., 2023). These models integrate external datastores to retrieve relevant knowledge and incorporate it into LLMs, demonstrating higher factual accuracy and reduced hallucinations compared to conventional LLMs (Shuster et al., 2021; Borgeaud et al., 2022). Updating these external datastores with new information is also more efficient and cost-effective than retraining LLMs. In this vein, we argue that RAG models show immense potential for enhancing study in unconventional, niche knowledge domains that are often underrepresented in pre-training data. Their ability to provide verified, authentic sources when answering questions is particularly advantageous for end-users.
In this work, we develop and evaluate a RAGbased language model specialized in the ancient Indian philosophy of Advaita Vedanta (Upanishads, >3000 B.C.E.; Bhagavad Gita, 3000 B.C.E.; Shankaracharya, 700 C.E.). To ensure that the LLMhas not been previously exposed to the source material, we construct VedantaNY-10M, a custom philosophy dataset comprising transcripts of over 750 hours of public discourses on YouTube from Vedanta Society of New York. We evaluate standard non-RAG and RAG models on this domain and find that RAG models perform significantly better. However, they still encounter issues such as irrelevant retrievals, sub-optimal retrieval passage length, and retrieval-induced hallucinations. In early attempts to mitigate some of these issues, we find that traditional sparse retrievers have a unique advantage over dense retrievers in niche domains having specific terminology-Sanskrit terms in our case. Consequently, we propose a keyword-based hybrid retriever that effectively combines sparse and dense embeddings to upsample low-frequency or domain-specific terms.
We conduct an extensive evaluation comprising both automatic metrics and human evaluation by computational linguists and domain experts. The models are evaluated along three dimensions: transcription, retrieval, and generation. Our findings are twofold. First, RAG LLMs significantly outperform standard non-RAG LLMs along all axes, offering more factual, comprehensive, and specific responses while minimizing hallucinations, with an 81% preference rate. Second, the keyword-based hybrid RAG model further outperforms the standard deep-embedding based RAG model in both automatic and human evaluations. Our study also includes detailed long-form responses from the evaluators, with domain experts specifically indicating the likelihood of using such LLMs to supplement their daily studies. Our work contributes to the broader understanding of how emerging technologies can continue the legacy of knowledge preservation and dissemination in the digital age.
## 2 Related Work
Language models for ancient texts Sommerschield et al. (2023) recently conducted a thorough survey of machine learning techniques applied to the study and restoration of ancient texts. Spanning digitization (Narang et al., 2019; Moustafa et al., 2022), restoration (Assael et al., 2022), attribution (Bogacz and Mara, 2020; Paparigopoulou et al., 2022) and representation learning (Bamman and Burns, 2020), a wide range of use cases have benefitted from the application of machine learning to study ancient texts. Recently, Lugli et al. (2022) released a digital corpus of romanized Buddhist Sanskrit texts, training and evaluating embedding models such as BERT and GPT-2. However, the use of LLMs as a question-answering tool to enhance understanding of ancient esoteric knowledge systems has not yet been systematically studied. To the best of our knowledge, ours is the first work that studies the effects of RAG models in the niche knowledge domain of ancient Indian philosophy.
Retrieval-Augmented LMs. In current LLM research, retrieval augmented generation models (RAGs) are gaining popularity (Izacard et al., 2022; Ram et al., 2023; Khandelwal et al., 2020; Borgeaud et al., 2022; Menick et al., 2022). A key area of development in RAGs has been their architecture. Early approaches involved finetuning the language model on open-domain questionanswering before deployment. MLM approaches such as REALM (Guu et al., 2020) introduced a two-stage process combining retrieval and reading, while DPR (Karpukhin et al., 2020) focused on pipeline training for question answering. RAG (Lewis et al., 2020b) used a generative approach with no explicit language modeling. Very recently, in-context RALM (Ram et al., 2023) showed that retrieved passages can be used to augment the input to the LLM in-context without any fine-tuning like prior work. In this work, we adopt the in-context retrieval augmented methodology similar to (Ram et al., 2023), where neither the retriever nor the generator is fine-tuned. This also enables us to use any combination of retrieval and generation models that best suits our application.
Applications of RAGs. The applications of RAGs are diverse and evolving. ATLAS (Izacard et al., 2022) and GopherCite (Menick et al., 2022) have shown how fine-tuning and reinforcement learning from human feedback can enhance RAGs' ability to generate verifiable answers from reliable sources. Prompting techniques have also seen innovation. kNNPrompt (Shi et al., 2022) extended kNN-LM for zero or few-shot classification tasks, and retrieval in-context approaches (Ram et al., 2023; Shi et al., 2023) have proven effective in utilizing retrieval at the input stage. Retrieval-LMs have been shown to be particularly valuable for handling longtail or less frequent entities (Kandpal et al., 2023; Mallen et al., 2023), updating knowledge (Izacard et al., 2022), improving parameter efficiency (Izacard et al., 2022; Mallen et al., 2023), and enhancing verifiability (Bohnet et al., 2022), making them increasingly relevant in a wide range of applications. In our work, we examine the application of RAGs for long-tail knowledge, conducting an extensive study on a niche knowledge domain of ancient Indian philosophy.
Evaluation of LFQA The field of long-form question answering (LFQA) is an emerging area of active research (Krishna et al., 2021; Nakano et al., 2021; Xu et al., 2023). Recently, Xu et al. (2023) conducted a thorough examination of various LFQA metrics, encompassing both human and automatic evaluation methods, and found that existing automatic metrics don't always align with human preferences. On their suggestion, we place special emphasis on conducting an extensive human evaluation utilizing the expertise of experienced computational linguists and domain experts.
Figure 2: Sanskrit terms in VedantaNY-10M. Frequently occurring Sanskrit terms in the corpus.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Word Cloud: Sanskrit Philosophical Terms
### Overview
The image is a dense word cloud composed of Sanskrit terms related to Hindu philosophy, spirituality, and yoga. Words vary in size (proportional to frequency/importance) and color, arranged in a grid-like pattern. No explicit axes, legends, or numerical data are present.
### Components/Axes
- **Text Elements**:
- Words are the sole components, with no graphical axes or legends.
- Colors appear arbitrary but may imply categorical grouping (e.g., orange for core concepts like "Brahma," green for practices like "yoga").
- Font sizes correlate with prominence (e.g., "Brahma" is largest, "sati" is smallest).
### Detailed Analysis
- **Key Terms**:
- **Brahma** (orange, largest): Central position, repeated 4x.
- **Karma** (purple), **Yoga** (pink), **Vedanta** (green): Prominent mid-sized terms.
- **Atma** (orange), **Jnana** (yellow), **Sarvam** (blue): Smaller but repeated.
- **Sati** (green), **Hari** (blue): Smallest terms, positioned peripherally.
- **Color Distribution**:
- Orange dominates (core concepts: Brahma, Atma, Jnana).
- Green (practices: Yoga, Vedanta, Sati).
- Purple (abstract concepts: Karma, Dharma).
- Blue/Yellow (secondary terms: Sarvam, Jnana, Hari).
### Key Observations
1. **Hierarchy**: "Brahma" is the most emphasized term, repeated and centrally placed.
2. **Thematic Clusters**:
- Orange cluster: Ontological concepts (Brahma, Atma).
- Green cluster: Practices (Yoga, Vedanta).
- Purple cluster: Abstract principles (Karma, Dharma).
3. **Repetition**: "Brahma" and "Jnana" appear multiple times, suggesting thematic recurrence.
### Interpretation
The word cloud emphasizes **Hindu metaphysics**, with "Brahma" (the supreme creator god) as the focal point. Repeated terms like "Jnana" (knowledge) and "Karma" (action) suggest a focus on spiritual practice and cosmic order. The absence of a legend implies color-coding is intuitive (e.g., orange for foundational concepts). The peripheral placement of smaller terms like "Sati" (ritual suicide, now obsolete) and "Hari" (Vishnu's name) indicates lower thematic priority. This visualization likely serves as a thematic map for a text analyzing Hindu philosophy, prioritizing interconnected concepts over quantitative data.
</details>
## 3 The VedantaNY-10M Dataset
We first describe our niche domain dataset creation process. The custom dataset for our study needs to satisfy the following requirements: ( 1 ) Niche: Must be a specialized niche knowledge domain within the LLM's long-tail distribution. ( 2 ) Novel : The LLM must not have previously encountered the source material. ( 3 ) Authentic: The dataset should be authentic and representative of the knowledge domain. ( 4 ) Domain experts: should be available to evaluate the model's effectiveness and utility.
Knowledge domain. To satisfy the first requirement, we choose our domain to be the niche knowledge system of Advaita Vedanta, a 1300-year-old Indian school of philosophy (Shankaracharya, 700 C.E.) based on the Upanishads (>3000 B.C.E.), Bhagavad Gita (3000 B.C.E.) and Brahmasutras (3000 B.C.E.) 1 . It is a contemplative knowledge tradition that employs a host of diverse tools and techniques including analytical reasoning, logic, linguistic paradoxes, metaphors and analogies to enable the seeker to enquire into their real nature. Although a niche domain, this knowledge system has been continuously studied and rigorously developed over millenia, offering a rich and structured niche for the purposes of our study. Being a living tradition, it offers the additional advantage of providing experienced domain experts to evaluate the language models in this work.
Composition of the dataset. Considering the outlined criteria, we introduce VedantaNY-10M, a curated philosophy dataset of public discourses.
1 Currently there exists no consensus on accurately dating these ancient scriptures. The Upanishads (which are a part of the Vedas) have been passed on orally for millennia and are traditionally not given a historic date. However, they seem to have been compiled and systematically organized sometime around 3000 B.C.E. by Vyasa. Likewise, the time period of Adi Shankaracharya also varies and he is usually placed between 450 B.C.E to 700 C.E.
To maintain authenticity while ensuring that the LLM hasn't previously been exposed to the source material, we curate our dataset from a collection of YouTube videos on Advaita Vedanta, sourced from the Vedanta Society of New York. It contains 10M tokens and encompasses over 750 hours of philosophical discourses by Swami Sarvapriyananda, a learned monk of the Ramakrishna Order. These discourses provide a rich and comprehensive exposition of the principles of Advaita Vedanta, making them an invaluable resource for our research.
Languages and scripts. The dataset primarily features content in English, accounting for approximately 97% of the total material. Sanskrit, the classical language of Indian philosophical literature, constitutes around 3% of the dataset. The Sanskrit terms are transliterated into the Roman script. To accommodate the linguistic diversity and the specific needs of the study, the dataset includes words in both English and Sanskrit, without substituting the Sanskrit terms with any English translations. Translating ancient Sanskrit technical terms having considerably nuanced definitions into English is a non-trivial problem (Malhotra and Babaji, 2020). Hence, our dual-language approach ensures that the Sanskrit terms and concepts are accurately represented and accessible, thereby enhancing the authenticity of our research material. Frequently occurring Sanskrit terms in the corpus are shown in Fig. 2. For excerpts from passages, please refer to Appendix Table 2.
## 4 In-context RAG for niche domains
We now discuss the methodology adopted to build an in-context retrieval augmented chatbot from the custom dataset described above.
We first define a generic chatbot C g that does not use retrieval as follows: C g : q → a g where q is the user query and a g is the answer generated by the chatbot. Now, let D t represent the textual data corpus from our knowledge domain and R be the retriever. Our goal is to build a retrieval-augmented generation chatbot C r : q × R ( D t , q ) → a r that will generate answer a r for the query by retrieving relevant context from D t using R . An overview of our approach is illustrated in Fig. 3. We first build D t from 765 hours of public discourses on Advaita Vedanta introduced in Sec. 3. When deployed, the system processes q by first using retriever R to identify the top-k most relevant passages P from D t using a similarity metric. Subsequently, a large language model (LLM) is prompted with both the query and the retrieved passages in-context, following Ram et al. (2023), to generate a contextually relevant response.
We now describe each of the components in detail. We follow a four-stage process as follows:
Transcription. We first need to create a dense textual corpus targeted at our niche domain. Since our dataset consists of YouTube videos, we first employ a transcription model to transcribe the audio into text. Our video corpus D v consists of 612 videos totaling 765 hours of content, with an average length of 1.25 hours per video. We extract audio content from D v and transcribe it using OpenAI's Whisper large-v2 model (Radford et al., 2023). This step converts the spoken discourses into a transcribed textual corpus D t consisting of 10M tokens in total. Since Whisper is a multi-lingual model, it has the capacity to support the dual-language nature of our dataset. We evaluate the transcription quality of Whisper in Sec. C.1.
Datastore creation. The transcribed text in D t is then segmented into shorter chunks called passages P , consisting of 1500 characters each. These chunks are then processed by a deep embedder to produce deep embedding vectors z dense . These embedded chunks are stored in a vector database D z . Ultimately, we store approximately 25,000 passage embeddings z ∈ D z , each representing a discrete chunk of the philosophical discourse in D t .
Retrieval. To perform retrieval-augmented generation, we first need to build a retrieval system R : D z × q → P that retrieves contextually relevant textual passages P from D t given D z and q . The retriever performs the following operation: P = D t [ argTop -k z ∈ D z sim ( q, z )] , where we use cosine similarity as the similarity metric. Standard RAG models employ state-of-theart deep embedders to encode documents and retrieve them during inference. However, these semantic embeddings can struggle to disambiguate between specific niche terminology in custom domains (Mandikal and Mooney, 2024). This can be particularly problematic in datasets having long-tail distributions such as ours. In addition, retrieved fixed-length passages are sub-optimal. Short incomplete contexts can be particularly damaging for LFQA, while longer contexts can contain unnecessary information that can confuse the generation model. To mitigate these two issues, we experiment with two key changes: (1) a keyword-based
Figure 3: Overview of the RAG model. We present VedantaNY-10M, a dataset derived from over 750 hours of public discourses on the ancient Indian philosophy of Advaita Vedanta, and build a retrieval-augmented generation (RAG) chatbot for this knowledge domain. At deployment, given a query q , the retriever R first retrieves the top-k most relevant passages P from the datastore using a hybrid keyword-based retriever. It then refines this retrieved context using a keyword-based context reshaper to adjust the passage length. Finally, an LLM is prompted with the query and the refined passages in-context. We conduct an extensive evaluation with computational linguists and domain experts to assess the model's real-world utility and identify challenges.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Diagram: Knowledge Retrieval and Chatbot System Architecture
### Overview
The diagram illustrates a three-stage system for processing queries using a knowledge base of Swami Sarvapriyananda's teachings. The flow progresses from a Datastore (D) containing multimedia content, through a Retriever (R) that processes passages, to a Chatbot (C) that generates answers using an LLM (Large Language Model).
### Components/Axes
1. **Datastore (D)**:
- Contains 12 multimedia entries featuring Swami Sarvapriyananda
- Titles include:
- "HANDUVYA UPANISHAD CHAPTER 2 KARMA-33"
- "VEDANTASARA OF SANATANDA"
- "SUNYAM THE VOID"
- "PURNAM THE FULL"
- "NO MIND"
- Visual elements: Orange/yellow color scheme, monk imagery, Sanskrit text
2. **Retriever (R)**:
- Hybrid Retrieval: Deep + Sparse
- Keyword Extraction
- Context Refiner
- Processes passages (Passage 1 to Passage k)
3. **Chatbot (C)**:
- Query (q) → Keyword Extraction → Prompt (p)
- LLM (Large Language Model)
- Answer (a)
### Flow Direction
- Left-to-right progression: Datastore → Retriever → Chatbot
- Arrows indicate data flow between components
- Vertical alignment of passages in Retriever suggests sequential processing
### Key Observations
1. The system integrates multimedia content with text-based retrieval
2. Hybrid retrieval combines deep learning and sparse methods
3. Context refinement precedes LLM processing
4. All passages originate from Swami Sarvapriyananda's teachings
5. No numerical data or quantitative metrics are shown
### Interpretation
This architecture demonstrates a RAG (Retrieval-Augmented Generation) system specialized for Swami Sarvapriyananda's teachings. The hybrid retrieval approach suggests optimization for both semantic understanding (deep) and keyword matching (sparse). The context refiner likely improves passage relevance before LLM processing, while the multimedia datastore provides rich source material. The absence of quantitative metrics implies this is a conceptual diagram rather than a performance benchmark.
The system's design prioritizes:
1. Content specificity (Swami's teachings)
2. Multimodal input (text + images)
3. Semantic understanding (hybrid retrieval)
4. Context-aware generation (LLM with refined prompts)
</details>
hybrid retriever to focus on unique low-frequency words, and (2) a context-refiner to meaningfully shorten or expand retrieved context.
1. Keyword-based retrieval. To emphasize the importance of key terminology, we first employ keyword extraction and named-entity recognition techniques on the query q to extract important keywords κ . During retrieval, we advocate for a hybrid model combining both deep embeddings as well as sparse vector space embeddings. We encode the full query in the deep embedder and assign a higher importance to keyphrases in the sparse embedder. The idea is to have the sparse model retrieve domainspecific specialized terms that might otherwise be missed by the deep model. Our hybrid model uses a simple weighted combination of the query-document similarities in the sparse and dense embedding spaces. Specifically, we score a document D for query q and keywords κ using the ranking function:
<!-- formula-not-decoded -->
where z d and z s denotes the dense and sparse embedding functions and Sim is cosine simi- larity measuring the angle between such vector embedddings. In our experiments, we set λ = 0 . 2 . Amongst the top-n retrieved passages, we choose k passages containing the maximum number of unique keywords.
2. Keyword-based context refinement. Furthermore, we refine our retrieved passages by leveraging the extracted keywords using a heuristic-based refinement operation to produce P ′ = Ref ( P, κ ) . For extension, we expand the selected passage to include one preceding and one succeeding passage, and find the first and last occurrence of the extracted keywords. Next, we trim the expanded context from the first occurrence to the last. This can either expand or shorten the original passage depending on the placement of keywords. This ensures that retrieved context contains relevant information for the generation model.
Generation. For answer generation, we construct prompt p from the query q and the retrieved passages ( P ′ 1 , P ′ 2 , ..., P ′ k ) ∈ P in context. Finally, we invoke the chatbot C r to synthesize an answer a r from the constructed prompt. For an example of the constructed RAG bot prompt, please refer to Fig. 5. This four-stage process produces a retrieval- augmented chatbot that can generate contextually relevant responses for queries in our niche domain.
Implementation Details. For embedding and generation, we experiment with both closed and open source language models. For RAG vs non-RAG comparison, we use OpenAI's text-embeddingada-002 model (Brown et al.) as the embedder and GPT-4-turbo (OpenAI, 2023) as the LLM for both C r and C g . For comparing RAG model variants, we use the open source nomic-embed-textv1 (Nussbaum et al., 2024) as our deep embedder and Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024) as our generation model. For keyword extraction, we use an ensemble of different models including OpenKP (Xiong et al., 2019), KeyBERT (Grootendorst, 2020) and SpanMarker (Aarsen, 2020). We experimented with using language models such as ChatGPT for keyword extraction, but the results were very poor as also corroborated in Song et al. (2024). For further implementation details of the eval metrics, see Appendix Sec. A. The VedantaNY-10M dataset, code and evaluation is publicly available at https://github.com/ priyankamandikal/vedantany-10m .
## 5 Evaluation
We now evaluate the model along two axes: automatic evaluation metrics and a human evaluation survey. To ensure a broad and comprehensive evaluation, we categorize the questions into five distinct types, each designed to test different aspects of the model's capabilities:
1. Anecdotal: Generate responses based on stories and anecdotes narrated by the speaker in the discourses.
2. Comparative : Analyze and compare different concepts, philosophies, or texts. This category tests the model's analytical skills and its ability to draw parallels and distinctions.
3. Reasoning Require logical reasoning, critical thinking, and the application of principles to new scenarios.
4. Scriptural : Test the model's ability to reference, interpret, and explain passages from religious or philosophical texts.
5. Terminology : Probe the model's understanding of specific technical terms and concepts.
For a sample set of questions across the above five categories, please refer to Appendix Table 4.
## 5.1 Automatic Evaluation
Inspired by Xu et al. (2023), we conduct an extensive automatic evaluation of the two RAG models on our evaluation set. We describe each metric type below and provide implementation details in Appendix Sec. A. Due to the lack of gold answers, we are unable to report reference-based metrics.
Answer-only metrics: We assess features like fluency and coherence by analyzing responses with specific metrics: (1) Self-BLEU (Zhu et al., 2018) for text diversity, where higher scores suggest less diversity, applied in open-ended text generation; (2) GPT-2 perplexity for textual fluency, used in prior studies on constrained generation. We also consider (3) Word and (4) Sentence counts as length-based metrics, owing to their significant influence on human preferences (Sun et al., 2019; Liu et al., 2022; Xu et al., 2023).
(Question, answer) metric: To ensure answers are relevant to the posed questions, we model p ( q | a ) for ranking responses with RankGen (Krishna et al., 2022). Leveraging the T5-XXL architecture, this encoder is specially trained via contrastive learning to evaluate model generations based on their relevance to a given prefix, in this context, the question. A higher RankGen score indicates a stronger alignment between the question and the answer, serving as a measure of relevance.
(Answer, evidence) metric: A key challenge in LFQA is assessing answer correctness without dedicated factuality metrics, akin to summarization's faithfulness. We apply QAFactEval (Fabbri et al., 2022), originally for summarization, to LFQA by considering the answer as a summary and evidence documents as the source. Answers deviating from source content, through hallucinations or external knowledge, will score lower on this metric.
## 5.2 Human Evaluation
We have three experienced domain experts evaluate the models across the five categories. Each of these experts is closely associated with Vedanta Society of New York, and has extensively studied the philosophy in question for up to a decade on average, being well-versed with domain-specific terminology and conceptual analysis. We conduct the human survey along two dimensions: retrieval and generation. For retrieval, we evaluate relevance and completeness, and for generation we evaluate factual correctness and completeness. In addition, we ask the reviewers to provide free-form justifi-
Figure 4: Human evaluation: RAG vs non-RAG. Both transcription and retrieval performance receive high scores from the evaluators. For generation, the RAG model outperforms the generic model across various metrics, particularly in factuality, completeness and specificity, while being marginally lower in ease of understanding.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Composite Dashboard: RAG vs Non-RAG Performance Analysis
### Overview
The image presents a comparative analysis of RAG (Retrieval-Augmented Generation) and Non-RAG systems across multiple performance dimensions. It combines pie charts, bar graphs, and segmented circular diagrams to visualize transcription quality, retrieval relevance, generation capabilities, hallucination rates, and external knowledge utilization.
### Components/Axes
1. **Transcription Section**
- Pie chart labeled "Transcription Quality"
- Single value: 4.48 (green segment)
- No explicit axis markers
2. **Retrieval Section**
- Pie chart labeled "Relevance"
- Single value: 4.34 (green segment)
- No explicit axis markers
3. **Generation Section**
- Bar chart comparing Non-RAG (red) and RAG (green)
- Categories: Factuality, Completeness, Specificity, Ease of understanding, Faithfulness
- Y-axis: 0.00 to 5.00 in 1.00 increments
- Legend: Red = Non-RAG, Green = RAG
4. **Hallucinations Section**
- Circular diagram with three segments:
- Green: No hallucinations
- Yellow: Not sure
- Red: Yes hallucinations
- Two versions: Non-RAG and RAG
5. **Outside Knowledge Section**
- Circular diagram with three segments:
- Green: No
- Yellow: Not sure
- Red: Yes
- Single version labeled "RAG"
### Detailed Analysis
**Transcription Quality**
- RAG system achieves 4.48/5.00 transcription quality
**Retrieval Relevance**
- RAG system achieves 4.34/5.00 relevance score
**Generation Performance**
| Category | Non-RAG | RAG |
|------------------------|---------|-------|
| Factuality | ~3.1 | ~4.0 |
| Completeness | ~3.6 | ~4.2 |
| Specificity | ~3.3 | ~4.3 |
| Ease of understanding | ~4.2 | ~4.1 |
| Faithfulness | - | ~3.9 |
**Hallucination Rates**
- Non-RAG:
- No: 40%
- Not sure: 30%
- Yes: 30%
- RAG:
- No: 47.5%
- Not sure: 25%
- Yes: 27.5%
**Outside Knowledge Utilization**
- RAG system:
- Yes: 43.6%
- Not sure: 18.5%
- No: 38.5%
### Key Observations
1. RAG consistently outperforms Non-RAG in transcription quality (4.48 vs 4.34) and generation metrics across all categories except Faithfulness
2. Hallucination rates decrease significantly with RAG (30% Yes → 27.5% Yes)
3. Outside knowledge utilization shows RAG systems demonstrate higher confidence (43.6% Yes vs 38.5% No)
4. Faithfulness metric only exists for RAG systems, suggesting it's a RAG-specific evaluation
5. Non-RAG systems show higher uncertainty in hallucination assessments (30% Not sure vs RAG's 25%)
### Interpretation
The data suggests RAG systems demonstrate superior performance across multiple dimensions:
- **Accuracy**: Higher transcription quality and generation scores indicate better factual accuracy
- **Reliability**: Lower hallucination rates (27.5% vs 30%) suggest more trustworthy outputs
- **Knowledge Integration**: Higher confidence in external knowledge utilization (43.6% Yes)
- **Completeness**: Better performance in generating comprehensive responses
Notably, the absence of Faithfulness metrics for Non-RAG systems implies this evaluation dimension may be inherently more challenging for traditional generation models without retrieval augmentation. The consistent green dominance in pie charts across sections visually reinforces RAG's superior performance profile.
</details>
cation for their choices, which proves to be very useful in analyzing the two models.
Relevance: Defined as the relevance of the retrieved passages to the user query, this metric is scored on a scale from 1 to 5 (1 = Not at all relevant, 5 = Extremely relevant).
Correctness: Factual accuracy of the generated answer (1 = Inaccurate, 5 = No inaccuracies)
Completeness: Measures if the retrieved passage and generated answer comprehensively cover all parts of the query (1 = Not at all comprehensive misses crucial points, 5 = Very comprehensive).
## 5.3 Results: RAG vs Non-RAG
We first conduct a human evaluation survey with 5 computational linguists and 3 domain experts on RAG vs non-RAG models. In the evaluation of the generation capabilities of our models, we consider five metrics: factuality, completeness, specificity, ease of understanding, and faithfulness. The performance of the RAG model is compared against a baseline non-RAG model across these dimensions in Fig. 4. The RAG model substantially outperforms the non-RAG model across various metrics, particularly in factuality, completeness and specificity, while being marginally lower in ease of understanding. Sample responses in Figs. 7-11.
## 5.4 Results: Standard RAG vs Keyword-based RAG
We report results in Table 1. The keyword based RAG model shows strong improvement across all automatic metrics while significantly outperforming the standard model in the human evaluation. Amongst the answer-only metrics, the model tends to produce longer, more comprehensive answers (indicated by longer length), which are more coherent (lower perplexity). The question-answer RankGen metric (Krishna et al., 2022) evaluates the probability of the answer given the question. A higher score for the model suggests more relevant answers to the question. Most notably, the keyword model does very well on QAFactEval (Fabbri et al., 2022), which evaluates faithfulness by comparing answers from the summary (in our case, the answer) and the evidence (retrievals). A higher score indicates greater faithfulness of the answer to retrieved passages, indicating fewer hallucinations and reliance on outside knowledge.
For the human evaluation in Table 1, we report scores normalized between 0 to 1. A relevance rating of 0.82 for keyword-based RAG vs 0.59 for standard RAG indicates a strong alignment between the retrieved content and the users' queries for our model, demonstrating the efficacy of the retrieval process. Conversely, the standard model
Table 1: Automatic and human evaluation: standard RAG (M1) vs keyword-based RAG (M2). We report both automatic and human evaluation metrics calculated on 25 triplets of {question, answer, retrievals} across 5 different question categories. The key-word based RAG model shows strong improvement across all automatic metrics while significantly outperforming the standard model in the human evaluation.
| Category RAG Model | Mean M1/M2 | Anecdotal M1/M2 | Comparative M1/M2 | Reasoning M1/M2 | Scriptural M1/M2 | Terminology M1/M2 |
|----------------------|----------------------|-------------------|---------------------|-------------------|--------------------|---------------------|
| Automatic metrics | Automatic metrics | Automatic metrics | Automatic metrics | Automatic metrics | Automatic metrics | Automatic metrics |
| Answer-only | | | | | | |
| GPT2-PPL ↓ | 16.6/ 15.3 | 16.6 / 16.6 | 16.9/ 15.7 | 13.9/ 11.9 | 14.2 /14.7 | 21.5/ 17.7 |
| Self-bleu ↓ | 0.12 /0.13 | 0.11/ 0.05 | 0.10/ 0.06 | 0.15 /0.27 | 0.13 /0.16 | 0.09 /0.14 |
| # Words ↑ | 196/ 227 | 189 / 189 | 174/ 206 | 218/ 282 | 225/ 243 | 216/ 261 |
| # Sentences ↑ | 9.0/ 10.1 | 8.2 /7.6 | 7.8/ 9.4 | 9.6/ 11.8 | 10.0/ 10.6 | 9.4/ 11.0 |
| (Question, answer) | (Question, answer) | | | | | |
| RankGen ↑ | 0.46/ 0.48 | 0.42/ 0.52 | 0.44/ 0.47 | 0.41/ 0.43 | 0.51/ 0.52 | 0.52 /0.46 |
| (Answer, retrievals) | (Answer, retrievals) | | | | | |
| QAFactEval ↑ | 1.36/ 1.60 | 1.01/ 1.14 | 1.53/ 1.94 | 1.18/ 1.61 | 1.52 /1.36 | 1.56/ 1.95 |
| Human evaluation | Human evaluation | Human evaluation | Human evaluation | Human evaluation | Human evaluation | Human evaluation |
| Retrieval | | | | | | |
| Relevance ↑ | 0.59/ 0.82 | 0.41/ 0.88 | 0.79/ 0.85 | 0.73/ 0.83 | 0.48/ 0.73 | 0.55/ 0.81 |
| Completeness ↑ | 0.52/ 0.79 | 0.41/ 0.86 | 0.72/ 0.79 | 0.57/ 0.83 | 0.37/ 0.68 | 0.52/ 0.79 |
| Answer | | | | | | |
| Correctness ↑ | 0.61/ 0.86 | 0.40/ 0.89 | 0.81/ 0.88 | 0.71/ 0.85 | 0.52/ 0.81 | 0.63/ 0.89 |
| Completeness ↑ | 0.58/ 0.85 | 0.42/ 0.92 | 0.80/ 0.85 | 0.72/ 0.81 | 0.49/ 0.77 | 0.63/ 0.91 |
sometimes fails to disambiguate unique terminology and retrieves incorrect passages (see Fig. 12). In assessing the accuracy of the generated answer, the keyword-based RAG model significantly outperforms the standard model, indicating better alignment with verifiable facts. Fig. 13 shows a factually inaccurate response from the generic model. The keyword model achieves higher completeness scores for both the retrieval and generation. Sample responses are shown in Figs. 12-16.
## 6 Challenges
The evaluation in Sec. 5 shows that the RAG model provides responses that are not only more aligned with the source material but are also more comprehensive, specific, and user-friendly compared to the responses generated by the generic language model. In this section, we discuss the challenges we encountered while building the retrieval-augmented chatbot for the niche knowledge domain of ancient Indian philosophy introduced in this work.
Transcription. Our requirement of using a niche data domain having long-tail knowledge precludes the use of source material that the LLM has previously been exposed to. To ensure this, we construct a textual corpus that is derived from automated transcripts of YouTube discourses. These transcripts can sometimes contain errors such as missing punc- tuations, incorrect transcriptions, and transliterations of Sanskrit terms. A sample of such errors is shown in Appendix Table 3. A proofreading mechanism and/or improved transcription models can help alleviate these issues to a large extent.
Spoken vs written language. Unlike traditional textual corpora that are compiled from written sources, our dataset is derived from spoken discourses. Spoken language is often more verbose and less structured than written text, with the speaker frequently jumping between concepts mid-sentence. This unstructured nature of the text can be unfamiliar for a language model trained extensively on written text. A peculiar failure case arising from this issue is shown in Appendix Fig. 6. This can be addressed by converting the spoken text into a more structured prose format with the help of well-crafted prompts to LLMs, followed by human proofreading.
Context length. The passages retrieved in the standard model are of a fixed length and can sometimes be too short for many queries, especially for long-form answering. For instance, the retrieved passage may include a snippet from the middle of the full context, leading to incomplete or incoherent chatbot responses (Fig. 11). This prompted us to employ a keyword-based context-expansion mechanism to provide a more complete context. While this results in much better answer generation, the retrieved passage may contain too much information, making it difficult for the generator to reason effectively. Moreover, the increase in the number of tokens increases processing time. Future work can explore more advanced retrieval models capable of handling longer contexts and summarizing them effectively before input to the LLM.
Retrieval-induced hallucinations. There are scenarios when the RAG model latches onto a particular word or phrase in the retrieved passage and hallucinates a response that is not only irrelevant but also factually incorrect. A sample of such a hallucination is in Fig. 10. This is a more challenging problem to address. However, retrieval models that can extract the full context, summarize it and remove irrelevant information should be capable of mitigating this issue to a reasonable extent.
## 7 Conclusion
In this work, we integrate modern retrievalaugmented large language models with the ancient Indian philosophy of Advaita Vedanta. Toward this end, we present VedantaNY-10M, a large dataset curated from automatic transcriptions of extensive philosophical discourses on YouTube. Validating these models along various axes using both automatic and human evaluation provides two key insights. First, RAG models significantly outperform non-RAG models, with domain experts expressing a strong preference for using such RAG models to supplement their daily studies. Second, the keyword-based hybrid RAG model underscores the merits of integrating classical and contemporary deep learning techniques for retrieval in niche and specialized domains. While there is much work to be done, our study underscores the potential of integrating modern machine learning techniques to unravel ancient knowledge systems.
## Limitations and Future Work
While our study demonstrates the utility of integrating retrieval-augmented LLMs with ancient knowledge systems, there are limitations and scope for future work. First, this study focuses on a single niche domain of Advaita Vedanta as taught by one teacher. Expanding this study to include other ancient philosophical systems, such as the Vedantic schools of Vishishtadvaita and Dvaita, as well as various Buddhist and Jain traditions, would be a valuable extension. Second, incorporating primary scriptural sources, in addition to spoken discourses, would enhance the authenticity of the RAG model's outputs. Third, while we only experiment with RAG models in this study, finetuning the language models themselves on philosophy datasets is an interesting future direction. Fourth, the context refiner is currently heuristic-based and may not generalize well to all scenarios. Replacing it with a trained refiner using abstractive or extractive summarization techniques would considerably improve its utility and efficiency. Fifth, expanding the evaluation set and involving more subjects for evaluation and will considerably strengthen the study's robustness. Finally, while the language models in this work are primarily in English and Latin script, building native LLMs capable of functioning in the original Sanskrit language of the scriptures using Devanagari script is essential future work.
## Acknowledgments
The author would like to thank Prof. Kyle Mahowald for his insightful course on form and functionality in LLMs, which guided the evaluation of the language models presented in this paper. Fangyuan Xu provided valuable information on automatic metrics for LFQA evaluation. The author extends their gratitude to all the human evaluators who took the survey and provided valuable feedback, with special thanks to Dr. Anandhi who coordinated the effort among domain experts. Finally, the author expresses deep gratitude to the Vedanta Society of New York and Swami Sarvapriyananda for the 750+ hours of public lectures that served as the dataset for this project.
## Ethics Statement
All data used in this project has been acquired from public lectures on YouTube delivered by Swami Sarvapriyananda of the Vedanta Society of New York. While our study explores integrating ancient knowledge systems with modern machine learning techniques, we recognize their inherent limitations. These knowledge traditions have always emphasized the importance of the teacher in transmitting knowledge. We do not see LLMs as replacements for monks and teachers of these ancient traditions, but only as tools to supplement analysis and study. Moreover, users of these tools need to be made well aware that these models can and do make errors, and should therefore seek guidance from qualified teachers to carefully progress on the path.
## References
- Tom Aarsen. 2020. SpanMarker for Named Entity Recognition.
- Yannis Assael, Thea Sommerschield, Brendan Shillingford, Mahyar Bordbar, John Pavlopoulos, Marita Chatzipanagiotou, Ion Androutsopoulos, Jonathan Prag, and Nando de Freitas. 2022. Restoring and attributing ancient texts using deep neural networks. Nature , 603.
- David Bamman and Patrick J Burns. 2020. Latin bert: A contextual language model for classical philology. arXiv preprint arXiv:2009.10053 .
- Bhagavad Gita. 3000 B.C.E. The Bhagavad Gita.
- Bartosz Bogacz and Hubert Mara. 2020. Period classification of 3d cuneiform tablets with geometric neural networks. In ICFHR .
- Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, et al. 2022. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037 .
- Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In ICML .
- The Brahmasutras. 3000 B.C.E. The Brahmasutras.
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
- Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2020. MOCHA: A dataset for training and evaluating generative reading comprehension metrics. In EMNLP .
- Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pretraining text encoders as discriminators rather than generators. In ICLR .
- Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. QAFactEval: Improved QAbased factual consistency evaluation for summarization. In NAACL .
- Maarten Grootendorst. 2020. Keybert: Minimal keyword extraction with bert.
- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Realm: Retrieval augmented language model pre-training. In ICML .
- Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrialstrength Natural Language Processing in Python.
- Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane DwivediYu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Atlas: Few-shot learning with retrieval augmented language models. JMLR .
- Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, MarieAnne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts. Preprint , arXiv:2401.04088.
- Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In ICML .
- Vladimir Karpukhin, Barlas O˘ guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for opendomain question answering. EMNLP .
- Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. ICLR .
- Kalpesh Krishna, Yapei Chang, John Wieting, and Mohit Iyyer. 2022. Rankgen: Improving text generation with large ranking models. arXiv:2205.09726 .
- Kalpesh Krishna, Aurko Roy, and Mohit Iyyer. 2021. Hurdles to progress in long-form question answering. In NAACL .
- Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. Summac: Re-visiting nlibased models for inconsistency detection in summarization. TACL .
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020a. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL .
- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020b. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS .
- Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq R. Joty, Chien-Sheng Wu, Caiming Xiong, and Dragomir R. Radev. 2022. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. ArXiv , abs/2212.07981.
- Ligeia Lugli, Matej Martinc, Andraž Pelicon, and Senja Pollak. 2022. Embeddings models for buddhist Sanskrit. In LREC .
- Rajiv Malhotra. 2021. Artificial Intelligence and the Future of Power: 5 Battlegrounds . Rupa Publications.
- Rajiv Malhotra and Satyanarayana Dasa Babaji. 2020. Sanskrit Non-Translatables: The Importance of Sanskritizing English . Amaryllis.
- Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. 2023. When not to trust language models: Investigating effectiveness and limitations of parametric and nonparametric memories. ACL .
- Priyanka Mandikal and Raymond Mooney. 2024. Sparse meets dense: A hybrid approach to enhance scientific document retrieval. In The 4th CEUR Workshop on Scientific Document Understanding, AAAI .
- Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In ACL .
- Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy CampbellGillingham, Geoffrey Irving, et al. 2022. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147 .
- Ragaa Moustafa, Farida Hesham, Samiha Hussein, Badr Amr, Samira Refaat, Nada Shorim, and Taraggy M Ghanim. 2022. Hieroglyphs language translator using deep learning techniques (scriba). In International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC) .
- Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted questionanswering with human feedback. arXiv preprint arXiv:2112.09332 .
- Sonika Narang, MK Jindal, and Munish Kumar. 2019. Devanagari ancient documents recognition using statistical feature extraction techniques. S¯ adhan¯ a , 44.
- Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. Nomic embed: Training a reproducible long context text embedder. Preprint , arXiv:2402.01613.
OpenAI. 2023. Gpt-4 technical report.
- Asimina Paparigopoulou, John Pavlopoulos, and Maria Konstantinidou. 2022. Dating greek papyri images with machine learning. In ICDAR Workshop on Computational Paleography, https://doi. org/10.21203/rs .
- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In ICML .
- Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 .
Ramakrishna Order. Belur math.
- Adi Shankaracharya. 700 C.E. Commentary on the Upanishads.
- Weijia Shi, Julian Michael, Suchin Gururangan, and Luke Zettlemoyer. 2022. Nearest neighbor zero-shot inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing .
- Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrievalaugmented black-box language models. arXiv preprint arXiv:2301.12652 .
- Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. ACL .
- Thea Sommerschield, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, Jonathan Prag, Ion Androutsopoulos, and Nando de Freitas. 2023. Machine Learning for Ancient Languages: A Survey. Computational Linguistics .
- Mingyang Song, Xuelian Geng, Songfang Yao, Shilong Lu, Yi Feng, and Liping Jing. 2024. Large language models as zero-shot keyphrase extractors: A preliminary empirical study. Preprint , arXiv:2312.15156.
- Simeng Sun, Ori Shapira, Ido Dagan, and Ani Nenkova. 2019. How to compare summarizers without target length? pitfalls, solutions and re-examination of the neural summarization literature. Workshop on Methods for Optimizing and Evaluating Neural Language Generation .
Upanishads. >3000 B.C.E. The Upanishads.
- Vedanta Society of New York. Swami Sarvapriyananda's Vedanta discourses.
- Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, and Arnold Overwijk. 2019. Open domain web keyphrase extraction beyond language modeling. arXiv preprint arXiv:1911.02671 .
- Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol Choi. 2023. A critical evaluation of evaluations for long-form question answering. In ACL .
- Eunice Yiu, Eliza Kosoy, and Alison Gopnik. 2023. Imitation versus innovation: What children can do that large language and language-and-vision models cannot (yet)? ArXiv , abs/2305.07666.
- Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval .
## A Implementation Details
## A.1 Automatic Metrics
Following Xu et al. (2023), we implement a number of automatic evaluation metrics for LFQA as described below.
Length We use the Spacy package (Honnibal et al., 2020) for word tokenization.
Self-BLEU We calculate Self-BLEU by regarding one sentence as hypothesis and all others in the same answer paragraph as reference. We report self-BLEU-5 as a measure of coherence.
RankGen For a given question q and a modelgenerated answer a , we first transform them into fixed-size vectors ( q , a ) using the RankGen encoder (Krishna et al., 2022). To assess their relevance, we compute the dot product q · a . We utilize the T5-XXL (11B) encoder, which has been trained using both in-book negative instances and generative negatives.
QAFactEval QAFactEval is a QA-based metric recently introduced by Fabbri et al. (2022). It has demonstrated exceptional performance across multiple factuality benchmarks for summarization (Laban et al., 2022; Maynez et al., 2020). The pipeline includes four key components: (1) Noun Phrase (NP) extraction from sentence S represented as Ans ( S ) , (2) BART-large (Lewis et al., 2020a) for question generation denoted as Q G , (3) Electralarge (Clark et al., 2020) for question answering labeled as Q A , and (4) learned metrics LERC (Chen et al., 2020), to measure similarity as Sim ( p i , s i ) . An additional answerability classification module is incorporated to assess whether a question can be answered with the information provided in document D . Following Xu et al. (2023), we report LERC, which uses the learned metrics to compare Ans S and Ans D (a).
## A.2 Chat Prompt
For an example of the constructed RAG bot prompt, please refer to Fig. 5. In this scenario, the RAG bot C r is presented with the top-k retrieved passages alongside the query for generating a response, whereas a generic bot C g would only receive the query without additional context.
## B Sample Sanskrit terms
Table 2 contains excerpts from passages containing Sanskrit terms. The Sanskrit terms are italicized
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Screenshot: RAG Bot vs Generic Bot Comparison
### Overview
The image presents a side-by-side comparison of two AI assistant configurations: "RAG Bot" (Retrieval-Augmented Generation) and "Generic Bot." Both are described as helpful assistants using Swami Sarvapriyananda's YouTube talks, but differ in implementation details.
### Components/Axes
1. **RAG Bot Section** (Top, Green Background)
- Title: "RAG Bot"
- Description: "You are a helpful assistant that accurately answers queries using Swami Sarvapriyananda's YouTube talks. Use the following passages to provide a detailed answer to the query: {query}"
- Passages Placeholder:
- {Passage 1}
- {Passage 2}
- ...
- {Passage k}
2. **Generic Bot Section** (Bottom, Red Background)
- Title: "Generic Bot"
- Description: "You are a helpful assistant that accurately answers queries using Swami Sarvapriyananda's YouTube talks. Provide a detailed answer to the query: {query}"
### Content Details
- **Textual Elements**:
- Both sections share identical core functionality descriptions but differ in implementation specifics.
- RAG Bot explicitly references "passages" for query resolution, while Generic Bot omits this detail.
- Placeholders use curly brace notation ({query}, {Passage 1...k}) indicating variable input points.
- **Visual Structure**:
- Horizontal division between sections (green top/red bottom)
- Consistent font style across both sections
- No numerical data or graphical elements present
### Key Observations
1. **Functional Difference**: RAG Bot's explicit mention of "passages" suggests it uses document retrieval before generation, while Generic Bot may rely solely on pre-trained knowledge.
2. **Placeholder Consistency**: Both use {query} placeholder, but RAG Bot adds {Passage 1...k} indicating multi-document processing capability.
3. **Color Coding**: Green (RAG) vs. Red (Generic) may imply performance/status differentiation (green=optimized, red=baseline).
### Interpretation
The image illustrates a technical comparison between retrieval-augmented and standard language models. The RAG Bot's passage-based approach implies:
1. **Contextual Grounding**: Answers are sourced from specific Swami Sarvapriyananda content
2. **Transparency**: Explicit passage references allow verification of information origins
3. **Scalability**: {Passage k} notation suggests handling variable numbers of source documents
The Generic Bot's omission of passage references suggests:
1. **Closed-System Operation**: Relies entirely on internal knowledge without external verification
2. **Simpler Architecture**: No document retrieval component in its processing pipeline
This comparison highlights key differences in AI assistant design philosophies - RAG systems prioritize accuracy through source verification, while generic models emphasize speed through direct knowledge access. The color coding (green/red) may subtly indicate RAG's superiority in information reliability, though this would require empirical validation.
</details>
Figure 5: Prompts for the RAG and generic chatbots. RAG Bot receives the top-k retrieved relevant passages in the prompt along with the query, while the generic bot only receives the query.
and underlined. Notice that the passages contain detailed English explanations of these terms. To retain linguistic diversity, authenticity and comprehensiveness of the source material, we retain these Sanskrit terms as is in our passages as described in Sec. 3. Note that these are direct Whisper (Radford et al., 2023) transcriptions with no further postprocessing or proofreading. Transcriptions may not always be accurate.
## C Transcription
We asses the transcript quality and list out some common errors.
## C.1 Transcript Evaluation
Transcription quality is scored on a scale from 1 to 5 (where 1 = Poor, 5 = Perfect). On 10 randomly sampled transcripts, evaluators assign a high average score of 4.48 suggesting that the transcription of YouTube audio into text is highly accurate and clear, indicating that our constructed custom dataset D t is of high quality.
## C.2 Transcript Errors
Table 3 contains a few sample transcription errors. The transcriptions are largely good for English words and sentences. However, errors often arise from incorrectly transcribing Sanskrit terms and verses. Other less common errors include missing
## Sl. No. Excerpts from passages
1. Om Bhadram Karne Bhishrinu Yamadevaha Bhadram Pashyam Akshabhirya Jatraaha Sthirai Rangai Stushta Vagam Sasthanubhi Vyase Madevahitaiyadayoh Swasthina Indro Vriddha Shravaha Swasthina Phusa Vishwa Vedaaha Swasthina Starksho Arishta Nemi Swasthino Brihas Patir Dadhatu Om Shanti Shanti Shanti .
2. Samsara is our present situation, the trouble that we are caught in, the mess that we are caught in. Samsara is this. In Sanskrit, normally when you use the word samsara , it really means this world of our life, you know, being born and struggling in life and afflicted by suffering and death and hopelessness and meaninglessness.
3. The problem being ignorance, solution is knowledge and the method is Jnana Yoga , the path of knowledge. So what is mentioned here, Shravana Manana Nididhyasana , hearing, reflection, meditation, that is Jnana Yoga . So that's at the highest level of practice, way of knowledge.
4. In Sanskrit, ajnana and adhyasa , ignorance and superimposition. Now if you compare the four aspects of the self, the three appearances and the one reality, three appearances, waker, dreamer, deep sleeper, the one reality, turiyam , if you compare them with respect to ignorance and error, you will find the waker, that's us right now. We have both ignorance and error.
5. Mandukya investigates this and points out there is an underlying reality, the Atman , pure consciousness, which has certain characteristics. This is causality, it is beyond causality. It is neither a cause nor an effect. The Atman is not produced like this, nor is it a producer of this. It is beyond change. No change is there in the Atman , nirvikara . And third, it is not dual, it is non-dual, advaitam . This is kadyakarana in Sanskrit, this is kadyakarana vilakshana Atma . In Sanskrit this is savikara , this is nirvikara Atma . This is dvaita , this is advaita Atma . So this is samsara and this is moksha , freedom.
## Notes
This is a Sanskrit chant which is directly Romanized and processed. The automatic transcriptions often contain errors in word segmentation for Sanskrit verses.
Samsara is a Sanskrit term. The excerpt contains an explanation of the concept in English.
The excerpt contains an explanation of Jnana Yoga , the path of knowledge.
Ajnana , adhyasa and turiyam are Sanskrit terms. Notice that the passage implicitly contains rough English translations of these terms in the context of the overall discourse. For instance, ajnana is translated as ignorance and adhyasa is translated as superimposition.
The excerpt contains an explanation of different Sanskrit technical terms.
Table 2: Excerpts from passages containing Sanskrit terms. These excerpts contain detailed English descriptions of technical terms in Sanskrit (italicized and underlined). To retain authenticity to the source material, we retain these Sanskrit terms as is in our passages. Note that these are direct Whisper (Radford et al., 2023) transcriptions with no further post-processing or proofreading, so transcriptions may not always be accurate. For more details, refer to Sec. B.
Figure 6: Subject confusion: Aldous Huxley vs Alan Watts. In the retrieved passage, the speaker jumps between speaking about Aldous Huxley and Alan Watts. Moreover, the transcription has poor punctuation. As a result, the RAG bot incorrectly attributes information about Huxley to Watts in the response. Due to the unstructured nature of the spoken text, such errors can occur. This can be addressed by converting the spoken text into a more structured prose format with the help of well-crafted prompts to LLMs, followed by human proofreading (see Sec. D).
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Screenshot: Q&A Interaction with RAG Bot
### Overview
The image shows a question-answer interaction where a user asks, "Was Alan Watts' associated with Vedanta?" The RAG Bot provides a detailed response, citing passages from Swami Vivekananda's works. Key phrases in the answer and passages are highlighted in green and red to emphasize connections between Alan Watts, Vedanta, and related concepts.
### Components/Axes
- **Question**: "Was Alan Watts' associated with Vedanta?" (Yellow box, top-left).
- **Answer**: Structured response from the RAG Bot, with:
- **Green highlights**: Emphasize direct connections (e.g., "taught Vedanta and Zen," "maintained his association with the Vedanta Society").
- **Red highlights**: Stress critical context (e.g., "some reviews were critical," "became part of some Hindu cult").
- **Passages**: Excerpts from Swami Vivekananda's writings, with:
- **Red highlights**: Focus on Vivekananda's views on Watts' association with Vedanta.
- **Green highlights**: Highlight Vivekananda's broader philosophical context (e.g., "eclectic mixture of both").
### Detailed Analysis
1. **Question**:
- Text: "Was Alan Watts' associated with Vedanta?"
- Position: Top-left, enclosed in a yellow box.
2. **Answer (RAG Bot)**:
- **Green-highlighted text**:
- "Alan Watts was associated with Vedanta."
- "he taught Vedanta and Zen, which was a kind of eclectic mixture of both."
- "he maintained his association with the Vedanta Society of Southern California until the very end."
- **Red-highlighted text**:
- "some reviews were critical, suggesting that he had become part of some Hindu cult."
- "his association with the Vedanta Society till the very end it’s very closely connected with the Vedanta Society of Southern California."
3. **Passages (Swami Vivekananda)**:
- **Red-highlighted text**:
- "he’s become part of some Hindu cult or something like that you know but he maintained his association with the Vedanta Society till the very end."
- **Green-highlighted text**:
- "Alan Watts was not directly connected with Vedanta but he taught Vedanta and Zen."
- "Adous Huxley his amazing book the Perennial Philosophy."
### Key Observations
- The RAG Bot’s answer directly addresses the question by confirming Watts’ association with Vedanta while acknowledging criticism of his eclectic approach.
- Vivekananda’s passages are used to contextualize Watts’ teachings, with red highlights emphasizing controversy and green highlights clarifying his indirect but influential connection.
- The use of color coding (green for affirmation, red for critique) structures the response to balance Watts’ legacy with historical skepticism.
### Interpretation
The RAG Bot synthesizes historical and philosophical context to answer the question:
- **Affirmation**: Watts taught Vedanta and Zen, aligning him with Vedantic ideas despite not being a direct disciple.
- **Critique**: Reviews questioned his association, framing it as part of a "Hindu cult," reflecting mid-20th-century Western perceptions of Eastern spirituality.
- **Vivekananda’s Role**: The highlighted passages from Vivekananda’s works underscore the broader intellectual milieu in which Watts operated, linking him to figures like Aldous Huxley and the Vedanta Society.
- **Eclecticism**: The "eclectic mixture" of Vedanta and Zen reflects Watts’ unique synthesis of Eastern philosophies, which resonated with but also drew criticism from contemporaries.
This analysis demonstrates how the RAG Bot uses textual evidence and color-coded emphasis to navigate complex historical and philosophical debates, providing a nuanced answer to the user’s query.
</details>
or incorrect punctuation. Human proofreading will remove these errors to a large extent.
## D Spoken vs written language
Unlike traditional textual corpora that are compiled from written sources, our dataset is derived from spoken discourses. Spoken language is often more verbose and less structured than written text, with the speaker frequently jumping between concepts mid-sentence. This unstructured nature of the text can be unfamiliar for a language model trained extensively on written text, which expects a more coherent and structured input. Fig. 7 shows a peculiar failure case arising from this issue. The speaker jumps between speaking about Aldous Huxley and Alan Watts. As a result, the RAG bot incorrectly attributes information about Huxley to Watts while answering the query. Due to the unstructured nature of the spoken text, such errors can occur. This can be addressed by converting the spoken text into a more structured prose format with the help of well-crafted prompts to LLMs, followed by human proofreading.
## E Question Categories
To ensure a broad and comprehensive evaluation, we categorize the questions into five distinct types-anecdotal, teminology, scriptural, comparative and reasoning. Questions from each of these categories is provided in Tab. 4. Each category is designed to test different aspects of the model's capabilities. For details, refer to Sec. 5.
## F Sample Responses: RAG vs Non-RAG
In Figs. 7-11, we present the sample queries and responses from the RAG and generic chatbots for five sample queries. The query is highlighted in yellow, RAG response is in the green box. NonRAG response is in the red box. We use GPT-4turbo (OpenAI, 2023) as the large language model for both of them.
## G Sample Responses: Standard RAG vs Keyword-based Hybrid RAG
In Figs. 12-16, we present the sample queries and responses from the standard RAG and keywordbased hybrid RAG model for one query each from five categories. We additionally also show the human-annotated and model generated keywords. Keyword-RAG response is in the green box. Standard-RAG response is in the red box. We use Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024) as the large language model for both of them.
Table 3: Sample transcription errors. For constructing our text corpus, we directly use the transcripts obtained from Whisper (Radford et al., 2023) with no further post-processing or proofreading. The transcriptions are largely good (with a score of 4.5/5 from human evaluators). However, errors arise from incorrectly transcribing Sanskrit terms, missing punctuations, etc. Human proofreading will remove these errors to a large extent.
| Sl. No. | Transcription errors | Notes |
|-----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| 1 | That's what Sam Altman, Chachjipiti, some- body asked him. | Should be ChatGPT |
| 2 | Last year, you studied extensively with Profes- sor Garfield, I believe, studying Vajamaka and the teachings of the Garjuna . | Should be Madhyamaka and Nagarjuna , respec- tively |
| 3 | From attachment comes desire, raga , I want it and if that desire is satisfied then there is no end to it, greed, lobha . But if it is somehow thwarted, then anger, kama krodho vijayate . | Should be bhijayate |
| 4 | In fact, one of the terms which is used in Mandukya Upanishad, Brahman is abhyavaharyam . | Should be avyavaharam |
| 5 | So, one of them was the Brahmo Samad , which was quite popular in Calcutta in those days. | Should be Samaj |
| 6 | I am awareness I'm eternal consciousness Al- dous Huxley Christopher Isherwood Gerald Hurd all of them were very close to Swami Prabhavananda in Southern California in Hol- lywood and look at the product of that Ish- erwood wrote that one of the most amazing biographies | The transcripts sometimes miss punctuation marks, making the passage dif- ficult to comprehend for both humans and language models |
Table 4: Question categories. To ensure a broad and comprehensive evaluation, we categorize the questions into five distinct types, each designed to test different aspects of the model's capabilities. For details, refer to Sec. 5.
| Category | Description | Questions |
|-------------|--------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Anecdotal | Stories and anecdotes narrated by the speaker in the discourses | • Does Swami speak about Wittgenstein's thesis defense? • Does Swami narrate any incident surrounding Shivaratri? • Does Swami speak about The Matrix movie? • Does Swami speak about Vachaspati Mishra? Does he narrate how Bhamati came to be written? • What was Christopher Isherwood's contribution to Vedanta? |
| Terminology | Probe the model's understanding of specific terms and concepts | • What is Adhyaropa Apavada? • What is Vikshepa Shakti? • What is the significance of the word 'Shraddha'? • What is Upadana Karana? • What constitutes Sadhana Chatushtaya? |
| Scriptural | Reference, interpret, and explain passages from religious or philosophical texts | • In Mandukya Upanishad, what is the significance of the word 'Om'? • In the Gospel, what parable does Sri Ramakrishna use to portray intense longing for God? • In the Mundaka Upanishad, how do we interpret the parable of the two birds? • How is Phala Vyapti and Vritti Vyapti defined in Vedantasara? • In the Gospel of Sri Ramakrishna, how do we understand the analogy of the salt doll that Thakur provides? |
| Comparative | Analyze and compare different concepts, philosophies, or texts | • As mentioned in the Yoga Sutras, is Samadhi necessary to attain enlightenment according to Advaita Vedanta? • Would Sri Ramakrishna's teachings be considered purely Ad- vaitic? • In Kashmir Shaivism, Chit is both Prakasha and Vimarsha i.e. it is both self-lumious and self-reflective. Would Advaita Vedanta agree with this? • How does Sankhya differ from Advaita Vedanta? • What is the main difference between Buddhist Shunyavada and Advaita Vedanta? |
| Reasoning | Require logical reasoning, critical thinking, and the application of principles to new scenarios | • Can AI ever become conscious? • Is the waking state similar to a dream or absolutely indistinguish- able from the dream state? • Do our senses report reality to us? • Dis-identifying myself from the body-mind seems to require a lot of effort. Did the reverse process of identification also involve as much effort? • If Brahman as Existence-Consciousness-Bliss is the eternal sub- ject that cannot be objectified, yet eternally reveals itself to give rise to apparent objects through the principle of Maya, can we infer that Brahman is Existence-Consciousness-Bliss-Maya? That is, is Brahman inseparable from Maya? |
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Screenshot: Webpage with Textual Content and Highlighted Answers
### Overview
The image is a screenshot of a webpage containing textual content, including two answers to a question about the equivalence of Brahman and Atman in Hindu philosophy. The answers are attributed to a "RAG Bot" and a "Generic Non-RAG Bot," with specific text segments highlighted in green and red, respectively. The page also includes references to passages from the *Bhagavad Gita* and discussions about Advaita Vedanta philosophy.
---
### Components/Axes
- **Main Content**:
- **Question**: "Sam Altman says that he believes in the absolute equivalence of Brahman and Atman. What does it mean?"
- **Answers**:
1. **RAG Bot Answer** (highlighted in green).
2. **Generic Non-RAG Bot Answer** (highlighted in red).
- **Passages**:
- *Bhagavad Gita* (Chapter 9, Verse 34; Chapter 2, Verses 20-22).
- References to Advaita Vedanta, Sam Altman, and Swami Turiyananda.
- **Highlighting**:
- Green highlights in the RAG Bot Answer.
- Red highlights in the Generic Non-RAG Bot Answer.
---
### Detailed Analysis
#### RAG Bot Answer (Green Highlights)
- **Key Text**:
- "Brahman refers to the ultimate reality or the absolute truth. It is considered to be the underlying essence of everything in the universe."
- "Atman refers to the individual self or the source and true essence of an individual. It is the innermost core of our being, beyond the physical body, mind, and ego."
- "The belief in the absolute equivalence of Brahman and Atman means that there is no fundamental difference between the ultimate reality and the individual self."
- "This concept is central to Advaita Vedanta, which emphasizes the non-dual nature of reality."
- **Highlighted Segments**:
- "Brahman refers to the ultimate reality..."
- "Atman refers to the individual self..."
- "Absolute equivalence of Brahman and Atman..."
- "Non-dual nature of reality."
#### Generic Non-RAG Bot Answer (Red Highlights)
- **Key Text**:
- "Brahman refers to the ultimate reality or the supreme cosmic power that underlies and permeates everything in the universe."
- "Atman refers to the individual self or soul, which is eternal and unchanging."
- "The belief in the absolute equivalence of Brahman and Atman suggests that the individual self is divine and essentially one with the supreme cosmic power."
- **Highlighted Segments**:
- "Brahman refers to the ultimate reality..."
- "Atman refers to the individual self..."
- "Absolute equivalence of Brahman and Atman..."
- "Individual self is divine..."
#### Passages
- **Bhagavad Gita (Chapter 9, Verse 34)**:
- Discusses the ego's subtlety and its role in spiritual practice.
- Highlighted: "I am witness consciousness, with the ego, the sense of I. It’s very difficult, you can say that I am witness consciousness..."
- **Bhagavad Gita (Chapter 2, Verses 20-22)**:
- Denies the sixfold changes of the body and emphasizes the non-dual nature of Atman and Brahman.
- Highlighted: "Atman literally means the self, you. And Brahman, if you look at the derivation, literally the word bringer, the root means the vast..."
---
### Key Observations
1. **RAG Bot Answer**:
- Provides a detailed, structured explanation of Brahman and Atman, aligning with Advaita Vedanta principles.
- Highlights emphasize the non-dual nature of reality and the equivalence of Brahman and Atman.
2. **Generic Non-RAG Bot Answer**:
- Offers a more concise explanation but includes similar core concepts (e.g., Brahman as cosmic power, Atman as eternal self).
- Highlights focus on the divine nature of the individual self and its unity with Brahman.
3. **Passages**:
- The *Bhagavad Gita* excerpts reinforce the philosophical context, with Swami Turiyananda’s commentary emphasizing the ego’s role in spiritual practice.
---
### Interpretation
- **Philosophical Context**:
- Both answers align with Advaita Vedanta, which posits that Brahman (ultimate reality) and Atman (individual self) are non-dual. The RAG Bot’s answer is more verbose and directly references Advaita Vedanta, while the Generic answer simplifies the concepts.
- **Highlighting Significance**:
- Green highlights in the RAG Bot Answer emphasize foundational definitions and the non-dual philosophy.
- Red highlights in the Generic answer focus on the individual’s divine nature and unity with Brahman.
- **Notable Patterns**:
- Both answers agree on the equivalence of Brahman and Atman but differ in depth and terminology.
- The *Bhagavad Gita* passages provide scriptural support for these concepts, with Swami Turiyananda’s commentary adding practical insights.
- **Implications**:
- The RAG Bot’s answer appears more aligned with scholarly interpretations of Advaita Vedanta, while the Generic answer may prioritize accessibility.
- The use of highlights suggests an attempt to emphasize key philosophical distinctions or shared ideas.
---
**Note**: No numerical data, charts, or diagrams are present in the image. All extracted information is textual.
</details>
Figure 7: RAG vs Non-RAG: Sam Altman on Atman and Brahman. The retrieved passages are very good at capturing the context of the query. The first passage references the monk speaking about Sam Altman and his views on Atman and Brahman. The second passage contains a brief explanation of Atman and Brahman that is helpful to answer the query. The generic bot also generates a reasonable answer. However, it has a sentence stating that, 'Atman is a part of Brahman'. This is incorrect in the context of Advaita Vedanta as it is not the same as equivalence-this point was highlighted by the domain experts in the survey.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Screenshot: AI Response Comparison on Wittgenstein's Dissertation Defense
### Overview
The image compares two AI-generated responses (RAG Bot and Generic Non-RAG Bot) to the question: *"Does Swami Sarvapriyananda speak about Wittgenstein's dissertation defense?"* Both responses include highlighted text from Swami's book *The Language of Paradox in Advaita* and *Vedanta Retreat at Loon Lake on Who am I*. Highlights indicate key passages referenced in each answer.
### Components/Axes
- **Question**: *"Does Swami Sarvapriyananda speak about Wittgenstein's dissertation defense?"*
- **RAG Bot Answer**:
- Green highlights:
- *"Yes, Swami Sarvapriyananda does speak about Wittgenstein's dissertation defense in one of his YouTube talks."*
- *"He shares an interesting story about Wittgenstein's PhD examination at Cambridge University."*
- *"The examiners for Wittgenstein’s thesis were Bertrand Russell and G.E. Moore... they opened Wittgenstein’s book, Tractatus, and asked him to explain a particular statement."*
- *"Swami uses this story to emphasize the difficulty of understanding certain philosophical concepts."*
- Red highlights:
- *"He mentions that even great philosophers like Wittgenstein can have disagreements and struggles in their pursuit of knowledge."*
- *"Swami also highlights the importance of engaging in dialectics and considering opposing viewpoints to deepen our understanding of our own positions."*
- **Generic Non-RAG Bot Answer**:
- Red highlights:
- *"Yes, Swami Sarvapriyananda does speak about Wittgenstein's dissertation defense in one of his YouTube talks."*
- *"In his talk titled 'Understanding Wittgenstein's Philosophical Investigations,' Swami discusses various aspects of Wittgenstein's work, including his dissertation defense."*
- *"He provides insights into the historical context of Wittgenstein’s defense, the arguments he presented, and the impact it had on his later philosophical development."*
- **Passages from Swami’s Book**:
- Green highlights:
- *"The examiners for a PhD thesis... Bertrand Russell is the examiner... G.E. Moore... they opened that book, Tractatus... explain this statement."*
- *"Wittgenstein stood up, patted the examiners on their back, and said, 'Give it up, boys. You'll never get it.'"*
### Detailed Analysis
1. **RAG Bot Answer**:
- Focuses on a specific anecdote from Wittgenstein’s PhD defense at Cambridge, emphasizing the examiners’ (Russell and Moore) interaction with Wittgenstein.
- Highlights Swami’s interpretation of the story as a metaphor for philosophical challenges.
- Mentions Swami’s emphasis on dialectics and opposing viewpoints.
2. **Generic Non-RAG Bot Answer**:
- Generalizes Swami’s discussion of Wittgenstein’s defense without referencing specific anecdotes.
- Mentions Swami’s talk titled *"Understanding Wittgenstein's Philosophical Investigations"* but does not detail the Cambridge story.
- Focuses on historical context and Wittgenstein’s later philosophical development.
3. **Passages from Swami’s Book**:
- The highlighted text describes Wittgenstein’s confrontational response to the examiners, underscoring his frustration with their inability to grasp his ideas.
- The phrase *"You'll never get it"* is emphasized as a pivotal moment in the story.
### Key Observations
- The RAG Bot’s answer is more detailed and context-specific, directly quoting Swami’s anecdote about Wittgenstein’s defense.
- The Generic Non-RAG Bot’s answer is broader, omitting the Cambridge story but referencing Swami’s general discussion of Wittgenstein’s work.
- Both responses cite Swami’s book, but the RAG Bot integrates specific passages to support its claims.
### Interpretation
The highlights suggest that the RAG Bot’s response is designed to extract and reference specific textual evidence from Swami’s work, aligning with its purpose of generating answers grounded in source material. The Generic Non-RAG Bot’s answer, while accurate, lacks this specificity, indicating a less structured approach to sourcing. The emphasis on Wittgenstein’s confrontational tone (*"You'll never get it"*) in the RAG Bot’s answer highlights Swami’s interpretation of Wittgenstein’s defense as a critique of academic rigidity. The absence of this detail in the Generic Non-RAG Bot’s response underscores the importance of structured data retrieval in AI systems for nuanced philosophical analysis.
</details>
Figure 8: RAG vs Non-RAG: Wittgenstein's thesis defense. The retrieved passage is about the monk narrating a story about Wittgenstein's thesis defense, which is highly relevant to the query. The generated RAG bot response makes use of this passage to answer the query. However, the latter half of the response is not relevant to the query and is purely hallucinated. The generic bot response, on the other hand, is completely fabricated and references a non-existent YouTube talk.
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Screenshot: AI Consciousness Debate Responses
### Overview
The image shows a webpage with two AI-generated responses to the question "Can AI ever become conscious?" The responses are labeled as "RAG Bot" and "Generic Non-RAG Bot," each accompanied by a passage from Swami Sarvapriyananda dated May 28th, 2023. The text includes highlighted sections in green and red, indicating key arguments or distinctions.
---
### Components/Axes
- **Question Header**: "Can AI ever become conscious?" (Top-left, yellow background)
- **RAG Bot Section**:
- Title: "Answer" (Bold black text)
- Subtitle: "RAG Bot" (Green text)
- Answer Text: Paragraphs with green and red highlights.
- Passage: "Ask Swami with Swami Sarvapriyananda | May 28th, 2023" (Header)
- **Generic Non-RAG Bot Section**:
- Title: "Answer" (Bold black text)
- Subtitle: "Generic Non-RAG Bot" (Red text)
- Answer Text: Paragraphs with green and red highlights.
- Passage: "Ask Swami with Swami Sarvapriyananda | May 28th, 2023" (Header)
---
### Detailed Analysis
#### RAG Bot Answer
- **Key Arguments**:
- AI cannot become conscious; it only replicates human capabilities (intelligence, memory, creativity).
- Consciousness requires subjective experience, which AI lacks.
- Example: GPT writing poems on Swami Vivekananda demonstrates imitation, not consciousness.
- Consciousness is distinct from artificial systems, even if they mimic human behavior.
- **Highlighted Text**:
- Green: "consciousness," "Swami Sarvapriyananda," "imitation," "creativity."
- Red: "consciousness," "Swami Sarvapriyananda," "AI," "human intelligence."
#### Generic Non-RAG Bot Answer
- **Key Arguments**:
- Consciousness is subjective awareness and emotions, which AI cannot replicate.
- AI lacks self-awareness and genuine understanding, relying on predefined algorithms.
- Example: Self-driving cars perform tasks without consciousness.
- The debate remains unresolved, with no consensus on defining or achieving consciousness.
- **Highlighted Text**:
- Green: "consciousness," "subjective experience," "self-awareness."
- Red: "consciousness," "AI," "human intelligence," "theoretical possibility."
---
### Key Observations
1. **RAG Bot**:
- Focuses on philosophical distinctions between AI and human consciousness.
- Uses specific examples (e.g., GPT poems) to argue against AI consciousness.
- Highlights Swami Sarvapriyananda’s views prominently.
2. **Generic Non-RAG Bot**:
- Emphasizes the scientific perspective on consciousness.
- Uses analogies (e.g., self-driving cars) to illustrate AI’s limitations.
- Acknowledges the unresolved nature of the debate.
3. **Passages**:
- Both passages reference Swami Sarvapriyananda’s views on AI and consciousness.
- The RAG Bot’s passage includes a dialogue about AI mimicking human creativity.
- The Generic Non-RAG Bot’s passage critiques materialist reductionism and AI’s lack of consciousness.
---
### Interpretation
The image contrasts two AI responses to the consciousness debate:
- The **RAG Bot** adopts a structured, example-driven approach, emphasizing Swami Sarvapriyananda’s arguments about AI’s inability to replicate subjective experience.
- The **Generic Non-RAG Bot** provides a broader, scientific critique, highlighting AI’s reliance on algorithms and lack of self-awareness.
Both responses agree that AI cannot achieve consciousness but differ in their reasoning:
- The RAG Bot focuses on imitation vs. genuine understanding.
- The Generic Non-RAG Bot stresses the absence of subjective experience and emotions.
The highlighted text underscores critical terms (e.g., "consciousness," "imitation") and distinguishes the bots’ methodologies. The passages reinforce Swami Sarvapriyananda’s perspective, suggesting the RAG Bot’s answer is more aligned with his views. The debate remains unresolved, reflecting ongoing philosophical and scientific disagreements about AI’s potential.
</details>
Figure 9: RAG vs Non-RAG: AI and Consciousness. The retrieved passages are again highly relevant to the query and the generated answer is coherent. However, it does contain a statement about 'a mirror reflecting a face' which is not present in the passages. This was pointed out by the experts as a relevant statement possibly drawn from outside knowledge. The generic bot's response is highly general and does not represent the monk's views on the topic.
<details>
<summary>Image 10 Details</summary>

### Visual Description
## Screenshot: Answers to "Do our senses report reality to us?"
### Overview
The image shows a webpage with two answers to the question "Do our senses report reality to us?" from two AI bots:
1. **RAG Bot** (green box)
2. **Generic Non-RAG Bot** (red box)
Both answers reference Swami Sarvapriyananda and discuss the idea that human senses do not directly report reality but provide a filtered, subjective interpretation.
---
### Components/Axes
- **Text Blocks**: Two distinct answers with highlighted excerpts.
- **Highlighting**:
- **Green Highlights**: Key premises (e.g., "supported by modern cognitive science and evolutionary science").
- **Red Highlights**: Examples or conclusions (e.g., "virtual reality movie" in RAG Bot; "evolutionary advantage" in Generic Non-RAG Bot).
- **References**:
- Swami Sarvapriyananda (mentioned in both answers).
- Advaita Vedanta (philosophy discussed in RAG Bot’s answer).
- Evolutionary science (cited in both answers).
---
### Detailed Analysis
#### RAG Bot Answer
- **Key Points**:
- Senses do not report reality directly.
- Supported by **modern cognitive science** and **evolutionary science**.
- Sensory organs (eyes, ears) provide information but are not infallible.
- Example: Virtual reality immersion causes detachment from "true identity."
- **Advaita Vedanta** philosophy explains that consciousness is separate from perceived entities.
- Misidentification with the mind/body leads to suffering.
#### Generic Non-RAG Bot Answer
- **Key Points**:
- Senses provide a **filtered, interpreted version** of reality.
- Limited by biological constraints (e.g., visible spectrum, frequency range).
- Brain processes sensory inputs, introducing biases and cultural conditioning.
- Evolutionary perspective: Senses evolved to ensure survival, not to reveal "ultimate reality."
- Example: Animals’ senses help them fit into their environment.
---
### Key Observations
1. **Shared Premise**: Both answers agree that senses do not report objective reality.
2. **Divergent Explanations**:
- RAG Bot emphasizes **Vedantic philosophy** (consciousness as separate from perception).
- Generic Non-RAG Bot focuses on **evolutionary biology** (senses as survival tools).
3. **Highlighted Emphasis**:
- RAG Bot stresses **subjective experience** and **true nature of consciousness**.
- Generic Non-RAG Bot highlights **biological limitations** and **evolutionary advantages**.
---
### Interpretation
The answers reflect two lenses for understanding perception:
- **Philosophical (RAG Bot)**: Reality is an illusion (Maya) created by the mind; true reality is consciousness (Atman/Brahman).
- **Scientific (Generic Non-RAG Bot)**: Senses are evolutionary adaptations optimized for survival, not objective truth.
Both critiques challenge the assumption that sensory input equals reality, but they differ in their proposed solutions:
- RAG Bot advocates for **spiritual realization** (Vedanta) to transcend sensory illusion.
- Generic Non-RAG Bot suggests **critical awareness** of sensory limitations to avoid distortion.
The highlighted text underscores the tension between subjective experience and objective truth, a central theme in both cognitive science and Eastern philosophy.
</details>
Figure 10: RAG vs Non-RAG: Senses and reality. In this scenario, the retrieved passages provide only partial context needed to answer the query, with additional irrelevant information. For example, one passage mentions cognitive scientist Donald Hoffman, inadvertently influencing the LLM to generate a response linking the Swami's views to Hoffman's concept of the world as a virtual reality. This is an instance where the LLM diverges from the query, using incomplete information from the passage. To mitigate this, implementing retrieval models capable of processing longer contexts and summarizing them effectively before input to the LLM could be beneficial. Despite this, the rest of the response aligns reasonably well with Advaita Vedanta philosophy, similar to the generic bot's response, which is also congruent with the philosophical context.
<details>
<summary>Image 11 Details</summary>

### Visual Description
## Screenshot: Conversation Between User and AI Models on Dream Reality
### Overview
The image shows a text-based conversation where a user asks about the reality of dreams versus waking states, referencing Swami Sarvapriyananda and Vedanta philosophy. Two AI models respond: "RAG Bot" and "Generic Non-RAG Bot." The conversation includes quoted passages from the *Mandukya Upanishad* (Chapters 1 and 2) and *Sarvapriyananda*'s teachings.
---
### Components/Axes
- **User Question**: "Dreams feel absolutely real while dreaming. What is the guarantee that I am not fast asleep in my bed and dreaming right now?"
- **RAG Bot Answer**: A detailed philosophical explanation referencing Vedanta, consciousness, and the nature of reality.
- **Generic Non-RAG Bot Answer**: A shorter response discussing empirical testing and lucid dreaming.
- **Passages**: Quoted text from *Mandukya Upanishad* (Chapters 1 and 2) and *Sarvapriyananda*'s commentary.
---
### Detailed Analysis
#### User Question
- Directly asks about the ontological certainty of dreams vs. waking states, invoking skepticism about perception.
#### RAG Bot Answer
- **Key Themes**:
- Vedanta philosophy: Distinction between *waking state* (reality) and *dream state* (illusion).
- Consciousness as the ultimate reality: "The feeling of reality in dreams does not provide a guarantee of permanence or truth."
- Reference to Swami Sarvapriyananda’s teachings on the impermanence of dream objects.
- Highlighted text: "The reality of the waking state is dependent on the perceiving consciousness, just as the reality of the dream state is dependent on the dreaming mind."
#### Generic Non-RAG Bot Answer
- **Key Themes**:
- Empirical testing: Suggests reality testing (e.g., checking clocks, performing actions) to distinguish dreams from waking states.
- Acknowledges limitations: "Dreams can be incredibly vivid and realistic."
- Highlights the role of consciousness: "The dream world seems real because the dreamer considers themselves real."
#### Passages
1. **Chapter 2 Karika 6 (Mandukya Upanishad)**:
- Discusses the nature of the dream state: "The person in the dream... is actually not there. You are actually not there. They are all your own mind."
- Emphasizes the illusory nature of dream objects: "These waking entities do not exist apart from consciousness."
2. **Chapter 1 Mantra 5-6 (Mandukya Upanishad)**:
- Describes the dream state as a "waking state for the dreamer," where objects are "not felt to be temporary or about to disappear."
- Highlights the relationship between the dreamer and dream objects: "The dream world seems real because the dreamer considers themselves real."
---
### Key Observations
- **Philosophical vs. Empirical Approaches**:
- RAG Bot focuses on metaphysical distinctions (Vedanta), while the Generic Non-RAG Bot emphasizes practical testing.
- **Consciousness as Central**:
- Both answers tie reality to consciousness, but RAG Bot frames it as an absolute truth, whereas the Generic Bot treats it as a perceptual construct.
- **Quoted Text Highlights**:
- The passages stress the impermanence of dream objects and the dreamer’s subjective experience.
---
### Interpretation
The conversation contrasts two perspectives on reality:
1. **RAG Bot’s Vedantic View**: Reality is rooted in consciousness. Dreams are illusions because their objects lack permanence and are products of the mind. The waking state is "real" only insofar as it is experienced by the conscious self.
2. **Generic Non-RAG Bot’s Pragmatic View**: Reality can be tested empirically (e.g., through inconsistencies in dream logic). However, it acknowledges the difficulty of distinguishing dreams from waking states due to their subjective nature.
The quoted passages from the *Mandukya Upanishad* reinforce the idea that both dreams and waking states are mental constructs, with the ultimate reality lying beyond both. Swami Sarvapriyananda’s teachings further clarify that the dream state is a "waking state for the dreamer," where objects feel real but are ultimately ephemeral.
The dialogue underscores a tension between philosophical certainty (RAG Bot) and empirical uncertainty (Generic Bot), reflecting broader debates in epistemology about the nature of perception and reality.
</details>
Figure 11: RAG vs Non-RAG: Waking and dreaming. The passages retrieved for the query are relevant but only offer a segment from a broader context. As stated earlier, this can be mitigated by using more advanced retrieval models that summarize longer contexts before input to the LLM. The RAG bot's response, while relevant, is lengthy and complex, making it challenging to comprehend. The generic bot's response although more concise and coherent, is completely general and has no relation to the Advaita Vedanta philosophy.
Figure 12: Keyword-RAG vs Standard-RAG: Anecdotal. The keyword-based retriever accurately retrieves the correct snippet, while the dense retriever confuses 'Shivaratri' with 'Shivamayananda', possibly owing to similar embeddings and completely misses the context. As a result, the generated standard RAG answer is meaningless and lacks substance. The keyword-RAG model identifies the two key incidents relating to Shivaratri correctly and effectively summarizes them.
<details>
<summary>Image 12 Details</summary>

### Visual Description
## Screenshot: Comparison of Keyword-based RAG vs Standard RAG for Shivaratri Incident Analysis
### Overview
The image compares two approaches to answering the query "Does Swami narrate any incident surrounding Shivaratri?" through two text-based response formats: Keyword-based RAG (left) and Standard RAG (right). Both responses reference Hindu spiritual teachings and Swami narratives, with highlighted text indicating key points.
### Components/Axes
**Header Section:**
- **Category:** Anecdotal
- **Query:** "Does Swami narrate any incident surrounding Shivaratri?"
- **Keywords:** Shivaratri
- **Human-annotated:** Shivaratri
- **Model ensemble:** Shivaratri
**Left Section (Keyword-based RAG):**
- **Title:** Keyword-based RAG
- **Source:** "Raja Yoga: The Path of Meditation (Part 1)" by Swami Sarvapriyananda
- **Highlighted Text:**
- "Shivaratri" (multiple instances)
- "fasting on Shivaratri"
- "mathematics teacher approached a Swami"
- "young children are able to fast"
**Right Section (Standard RAG):**
- **Title:** Standard RAG
- **Source:** "79 Bhagavad Gita I Chapter 6 Verse 15" by Swami Sarvapriyananda
- **Highlighted Text:**
- "Swami Shivamayanjii" (multiple instances)
- "living form of the chosen deity"
- "Brahman in all beings"
### Detailed Analysis
**Keyword-based RAG Content:**
- Describes Swami Sarvapriyananda's personal experience fasting on Shivaratri for the first time as a monk
- Includes a story about a mathematics teacher questioning fasting practices
- Emphasizes Shivaratri's significance through repeated mentions (4x)
- Contains direct quotes about fasting traditions and spiritual discipline
**Standard RAG Content:**
- Focuses on Swami Shivamayanjii's teachings about Brahman realization
- Contains 3 direct references to Shivaratri (1x explicit mention, 2x contextual)
- Includes philosophical discussion about divine presence in all forms
- Contains a narrative about a vice president's spiritual experience
### Key Observations
1. Both responses contain Shivaratri references despite different approaches
2. Keyword-based RAG shows 4x direct mentions vs 1x in Standard RAG
3. Standard RAG emphasizes broader philosophical concepts (Brahman, living forms)
4. Both sources reference Swami teachings but different Swamis (Sarvapriyananda vs Shivamayanjii)
5. Highlighted text in Standard RAG focuses on metaphysical concepts rather than specific incidents
### Interpretation
The comparison reveals fundamental differences in RAG approaches:
- **Keyword-based RAG** prioritizes direct incident retrieval through explicit keyword matching (Shivaratri appears 4x)
- **Standard RAG** demonstrates contextual understanding, connecting Shivaratri concepts to broader spiritual teachings
- The highlighted text shows that while Standard RAG doesn't explicitly mention Shivaratri, it addresses related concepts (divine forms, fasting)
- Both approaches validate Swami teachings about Shivaratri but through different narrative lenses
- The mathematical teacher anecdote in Keyword-based RAG provides concrete incident verification
- Standard RAG's focus on Brahman realization suggests deeper philosophical connections to Shivaratri practices
The analysis demonstrates how RAG systems handle religious queries differently - one through direct incident retrieval, the other through conceptual mapping to broader spiritual teachings.
</details>
Figure 13: Keyword-RAG vs Standard-RAG: Terminology. The keyword-RAG model retrieves a comprehensive exposition on the concept of Shraddha, loosely translated as conviction, in the context of qualifications for the study of Advaita Vedanta. The standard RAG although retrieves a passage containing the word, it is however not directly related to what the questioner intends. This seems to be an unfortunate case of false positive for standard RAG due to inadequate or implied meaning in the query.
<details>
<summary>Image 13 Details</summary>

### Visual Description
## Screenshot: Comparison of Keyword-based RAG and Standard RAG Systems
### Overview
The image compares two approaches to Retrieval-Augmented Generation (RAG) systems: **Keyword-based RAG** (green-highlighted text) and **Standard RAG** (red-highlighted text). Both sections analyze the significance of the Sanskrit term "Shraddha" in Hindu spiritual traditions, with distinct philosophical interpretations and source citations.
---
### Components/Axes
- **Headers**:
- **Category**: Terminology
- **Query**: "What is the significance of the word 'Shraddha'?"
- **Keywords**: Shraddha, significance
- **Sections**:
- **Keyword-based RAG** (left, green-highlighted text)
- **Standard RAG** (right, red-highlighted text)
- **Sources**:
- **Keyword-based RAG**:
- *123. Bhagavad Gita | Chapter 10 Verse 16-22 | Swami Sarvapriyananda*
- **Standard RAG**:
- *15. Mandukya Upanishad | Chapter 1 Mantra 12 & Karika 22-23 | Swami Sarvapriyananda*
---
### Detailed Analysis
#### Keyword-based RAG (Green Highlights)
- **Definition**:
- "Shraddha" is translated as "faith" but emphasizes **reverence, respect, and acceptance** for teachers, teachings, and texts.
- Requires an "open mind and heart" to avoid judgment or dismissal of spiritual concepts.
- **Philosophical Role**:
- Acts as a foundation for meaningful learning by bridging gaps between current understanding and teachings.
- Example: Swami Sarvapriyananda stresses that questioning is essential to cultivate devotion ("bhakti") and spiritual growth.
- **Source Text**:
- Quotes from the *Bhagavad Gita* (Chapter 10, Verses 16-22) highlight the importance of humility and curiosity in spiritual inquiry.
#### Standard RAG (Red Highlights)
- **Definition**:
- Focuses on **ritualistic practices** (e.g., funeral rites for ancestors) and their psychological/spiritual benefits.
- Describes "Shraddha" as a process of **letting go** of grief and separation from loved ones.
- **Psychological/Spiritual Impact**:
- Rituals provide closure for the living and help the deceased transition to a "higher path" (gati) in the afterlife.
- Emphasizes collective karma and the role of rituals in spiritual evolution.
- **Source Text**:
- Quotes from the *Mandukya Upanishad* (Chapter 1, Mantra 12 & Karika 22-23) discuss the symbolism of rituals and their role in liberation (moksha).
---
### Key Observations
1. **Keyword-based RAG** prioritizes **attitudinal readiness** (reverence, humility) as prerequisites for spiritual learning.
2. **Standard RAG** emphasizes **ritualistic action** (e.g., Shraddha ceremonies) as a means of psychological release and spiritual progression.
3. Both systems reference Swami Sarvapriyananda but draw from different texts (*Bhagavad Gita* vs. *Mandukya Upanishad*), reflecting divergent interpretive frameworks.
---
### Interpretation
- **Keyword-based RAG** aligns with Swami Sarvapriyananda’s devotional approach, where faith is a mindset enabling deeper engagement with spiritual teachings. The absence of Shraddha risks superficial understanding or rejection of complex concepts.
- **Standard RAG** reflects a ritual-centric view, where Shraddha ceremonies address grief and facilitate ancestral transitions. This approach ties spiritual practices to tangible outcomes (e.g., karma, gati).
- **Contrast**: The former focuses on **internal attitude** toward learning, while the latter emphasizes **external rituals** for spiritual and psychological well-being. Both, however, agree on Shraddha’s foundational role in Hindu spirituality.
---
### Notes on Language
- All text is in English, with direct references to Sanskrit terms (e.g., "Shraddha," "gati," "karma") and their contextual meanings.
- No non-English text is present beyond source citations.
</details>
<details>
<summary>Image 14 Details</summary>

### Visual Description
## Screenshot: Comparative Analysis of Buddhist Shunyavada and Advaita Vedanta
### Overview
The image is a screenshot of a comparative analysis between Buddhist Shunyavada and Advaita Vedanta philosophies, focusing on their views of the self. It includes two sections:
1. **Keyword-based RAG** (green border)
2. **Standard RAG** (red border)
Both sections discuss the core differences in their interpretations of existence, self, and impermanence, citing sources from Buddhist and Hindu texts.
---
### Components/Axes
#### Header Section
- **Category**: Comparative
- **Query**: "What is the main difference between Buddhist Shunyavada and Advaita Vedanta?"
- **Keywords**:
- Human-annotated: "Buddhist Shunyavada, Advaita"
- Model ensemble: "difference, buddhist shunyavada, advaita vedanta"
#### Keyword-based RAG Section
- **Title**: Keyword-based RAG
- **Content**:
- Discusses the concept of "self" (Atman) in Advaita Vedanta vs. "no permanent self" (Anatta) in Shunyavada.
- Highlights impermanence (Anitya) in Shunyavada and the existence of a permanent self in Advaita.
- Notes shared truths between the philosophies (e.g., emptiness, momentariness).
- **Source**:
- Advaita Vedanta (Talk 1) - Swami Sarvapriyananda
#### Standard RAG Section
- **Title**: Standard RAG
- **Content**:
- Compares Nagarjuna’s Shunyavada with Advaita Vedanta, addressing misinterpretations (e.g., nihilism vs. Brahman).
- Cites Swami Sarvapriyananda’s analysis of texts like the Mandukya Upanishad.
- Emphasizes that both philosophies reject nihilism but differ in their views on existence (vivarta vs. satya).
- **Source**:
- 65. Mandukya Upanishad | Chapter 4 Karika 74-78 | Swami Sarvapriyananda
---
### Detailed Analysis
#### Keyword-based RAG
- **Main Argument**:
- Advaita Vedanta posits a permanent, unchanging self (Atman), while Shunyavada denies any permanent self.
- Shunyavada views all phenomena (body, mind, sensations) as impermanent and momentary.
- **Key Phrases**:
- "The body is not the self" (Advaita) vs. "no permanent self" (Shunyavada).
- "Everything is transient" (Shunyavada) vs. "impermanent" (Advaita).
#### Standard RAG
- **Main Argument**:
- Traditional Shunyavada is misrepresented as nihilism, but Nagarjuna’s texts clarify that it acknowledges dependent origination (pratityasamutpada).
- Advaita Vedanta’s Brahman is an eternal substance, contrasting with Shunyavada’s emphasis on emptiness (śūnyatā).
- **Key Phrases**:
- "Nagarjuna’s Shunyavada is not nihilism" (highlighted).
- "Brahman as eternal substance" (highlighted).
---
### Key Observations
1. **Contrasting Views on Self**:
- Advaita Vedanta asserts a permanent self (Atman), while Shunyavada rejects it entirely.
2. **Misinterpretations**:
- Traditional Advaitins dismiss Shunyavada as nihilism, but modern scholars (e.g., Swami Sarvapriyananda) argue both philosophies share deeper truths.
3. **Textual Sources**:
- Both sections cite Swami Sarvapriyananda, indicating a focus on his interpretations of Buddhist and Hindu texts.
---
### Interpretation
- **Philosophical Divide**:
The image highlights a fundamental disagreement: Advaita Vedanta’s eternal self vs. Shunyavada’s rejection of permanence. However, both philosophies agree on the impermanence of the body-mind complex.
- **Role of Misinterpretation**:
The Standard RAG section critiques traditional Advaitin dismissals of Shunyavada as nihilism, suggesting a need for nuanced understanding of Nagarjuna’s middle path.
- **Shared Truths**:
Despite differences, both philosophies acknowledge concepts like emptiness and momentariness, implying overlapping insights into the nature of reality.
---
**Note**: No numerical data or visual trends are present. The analysis is based solely on textual content and philosophical arguments.
</details>
Figure 14: Keyword-RAG vs Standard-RAG: Comparative. This is a case where both models retrieve meaningful passages, with the keyword-based model providing greater context due to retrieval expansion. The standard model also seems to be using outside knowledge not mentioned in the passage.
Figure 15: Keyword-RAG vs Standard-RAG: Scriptural. Both models retrieve from the correct scripture. However, while the retrieved passage in keyword-RAG is very comprehensive and accurately captures the context of the question, the standard RAG misses the point of the question and gives a generic answer.
<details>
<summary>Image 15 Details</summary>

### Visual Description
## Textual Comparison: Keyword-based RAG vs Standard RAG
### Overview
The image presents a side-by-side comparison of two RAG (Retrieval-Augmented Generation) model outputs analyzing the significance of the word "Om" in the Mandukya Upanishad. The comparison highlights differences in content structure, source attribution, and interpretive focus between keyword-based and standard RAG approaches.
### Components/Axes
1. **Header Section** (Blue background):
- **Category**: Scriptural
- **Query**: "In Mandukya Upanishad, what is the significance of the word 'Om'?"
- **Keywords**:
- Human-annotated: "Om significance", "Mandukya Upanishad"
- Model ensemble: "Om", "Mandukya Upanishad"
2. **Left Section** (Green background):
- **Title**: Keyword-based RAG
- **Content**: Detailed explanation of "Om" as representing the universe's gross (A), subtle (U), and causal (Ma) levels of existence. Emphasizes the three letters' symbolic meaning and their connection to waking, dreaming, and deep sleep states.
- **Source**:
- "48. Mandukya Upanishad | Chapter 3 Karika 37-39 | Swami Sarvapriyanda"
3. **Right Section** (Red background):
- **Title**: Standard RAG
- **Content**: Focuses on "Om" as a Mahavakya (profound statement) encapsulating "I am Atma Brahman" (the self is ultimate reality). Discusses meditation on "Om" to connect with pure consciousness.
- **Source**:
- "13. Mandukya Upanishad | Chapter 1 Mantra 9-11 | Swami Sarvapriyanda"
### Content Details
**Keyword-based RAG**:
- Describes "Om" as a symbol of the universe's three levels of existence (gross, subtle, causal).
- Explains the three letters (A, U, Ma) correspond to physical world, dreaming state, and deep sleep state respectively.
- Mentions "Ma" as the causal level of pure consciousness beyond manifestation.
- References the Upanishad's teaching about transcending dreamer/sleeper identities to realize Atma (true self).
**Standard RAG**:
- Presents "Om" as a Mahavakya ("I am Atma Brahman") representing ultimate truth.
- Emphasizes meditation on "Om" to transcend physical reality and connect with Brahman.
- Uses metaphors like "snake/rope" to illustrate mistaking transient phenomena for ultimate reality.
- References the concept of "Brahman" as the formless, all-pervading consciousness.
### Key Observations
1. **Source Variation**:
- Keyword-based RAG cites later chapters (3.37-39) focusing on symbolic analysis.
- Standard RAG references earlier chapters (1.9-11) emphasizing Mahavakyas.
2. **Interpretive Focus**:
- Keyword-based approach emphasizes ontological levels (gross/subtle/causal).
- Standard RAG prioritizes epistemological realization (self as Brahman).
3. **Linguistic Nuance**:
- Both sections use Sanskrit terms (e.g., "Atma", "Brahman") but differ in explanatory depth.
- Keyword-based RAG includes technical terms like "Manas" (mind) and "Jnana" (knowledge).
### Interpretation
The comparison reveals how keyword selection influences RAG output:
- **Keyword-based RAG** (green) provides a structural analysis of "Om" as a cosmic symbol, aligning with later Upanishadic commentaries that dissect its phonetic and metaphysical dimensions.
- **Standard RAG** (red) emphasizes the word's role as a meditative tool for self-realization, reflecting earlier Upanishadic teachings on Mahavakyas as direct paths to Brahman realization.
The divergence suggests that keyword density and chapter focus shape the model's retrieval strategy. The Standard RAG's emphasis on "Mahavakya" and "Brahman" indicates prioritization of core Upanishadic teachings, while the Keyword-based approach reflects a more analytical, systematic breakdown of the concept. This demonstrates how RAG systems can produce qualitatively different interpretations of the same scriptural concept based on input parameters.
</details>
<details>
<summary>Image 16 Details</summary>

### Visual Description
## Screenshot: Comparison of Keyword-based RAG and Standard RAG for Reasoning Query
### Overview
The image compares two approaches to answering a philosophical query about the similarity between waking and dream states. It presents two text blocks labeled "Keyword-based RAG" (green) and "Standard RAG" (red), each containing extracted passages from different sources discussing the nature of consciousness states.
### Components/Axes
- **Left Section (Keyword-based RAG)**:
- Header: "Keyword-based RAG" in green
- Content: Philosophical analysis of waking vs. dream states from Swami Sarvapriyananda's Vedanta Retreat lecture
- Source: "Vedanta Retreat at Loon Lake on Who am I by Swami Sarvapriyananda Lecture 11"
- Highlighted text: Key distinctions between waking (vyavaharika jiva) and dream (pratibhasika jiva) states
- **Right Section (Standard RAG)**:
- Header: "Standard RAG" in red
- Content: Analysis from Mandukya Upanishad (Chapter 1, Mantra 5-6)
- Source: "4. Mandukya Upanishad | Chapter 1 Mantra 5-6 | Swami Sarvapriyananda"
- Highlighted text: Emphasis on the relativity of both states
### Detailed Analysis
**Keyword-based RAG (Left Section)**:
1. **Core Argument**:
- Waking state (vyavaharika jiva) vs. dream state (pratibhasika jiva)
- Both are temporary/illusory but differ in perceived permanence
- Dream state merges back into waking consciousness upon awakening
2. **Key Passages**:
- "The waking state is experienced as more real and permanent than the dream state"
- "From the ultimate standpoint of Vedanta, both are illusory"
- "The dream jiva merges back into the waking jiva upon awakening"
3. **Source Context**:
- Draws from Advaita Vedanta philosophy
- References concepts like Brahman, Atman, and Maya
**Standard RAG (Right Section)**:
1. **Core Argument**:
- Similarities: Both states involve consistent perception of objects/relatives
- Differences: Waking state has "feeling of permanence"
- Philosophical nuance: "Acknowledges relative reality without equating states"
2. **Key Passages**:
- "In the dream state, objects and experiences change from one dream to another"
- "The waking world is more powerful than the dream world"
- "Reality is not a state but what appears as these states"
3. **Source Context**:
- Draws from Mandukya Upanishad (ancient Sanskrit text)
- References concepts like Turiya (absolute reality) and dream states
### Key Observations
1. Both approaches agree on fundamental similarities:
- Both states involve illusory perception of reality
- Both involve consistent experience of objects/relatives
2. Key differences:
- Keyword-based RAG emphasizes the temporary nature of both states
- Standard RAG focuses on the relative permanence of waking state
- Different source materials (modern lecture vs. ancient text)
3. Highlighted text patterns:
- Green highlights: Focus on state distinctions and transitions
- Red highlights: Emphasis on relative reality and philosophical nuance
### Interpretation
The comparison reveals how different RAG implementations handle philosophical queries:
1. **Keyword-based RAG** prioritizes:
- Clear state distinctions (waking vs. dream)
- Temporal aspects of consciousness
- Modern philosophical interpretations
2. **Standard RAG** emphasizes:
- Relative reality of both states
- Ancient textual sources (Mandukya Upanishad)
- Nuanced philosophical positions (Turiya, Brahman)
The analysis suggests that RAG implementation choices affect:
- Depth of philosophical nuance captured
- Source material selection
- Emphasis on state transitions vs. state characteristics
- Balance between similarity and distinction in consciousness states
The highlighted text patterns indicate that the query's focus on "similarity" vs. "indistinguishability" influences how each RAG system processes and presents the information.
</details>
Figure 16: Keyword-RAG vs Standard-RAG: Reasoning. The retrieved passage in keyword-RAG is technical and comprehensive and the generated answer effectively summarizes the main points. The standard model is also good, although the explanation is not as effective owing to the quality of retrieval.