# BayesAgent: Bayesian Agentic Reasoning Under Uncertainty via Verbalized Probabilistic Graphical Modeling
> Correspondence to Hengguan Huang.
## Abstract
Human cognition excels at transcending sensory input and forming latent representations that structure our understanding of the world. While Large Language Model (LLM) agents demonstrate emergent reasoning and decision-making abilities, they lack a principled framework for capturing latent structures and modeling uncertainty. In this work, we explore for the first time how to bridge LLM agents with probabilistic graphical models (PGMs) to address agentic reasoning under uncertainty. To this end, we introduce Verbalized Probabilistic Graphical Modeling (vPGM), a Bayesian agentic framework that (i) guides LLM agents in following key principles of PGMs through natural language and (ii) refines the resulting posterior distributions via numerical Bayesian inference. Unlike many traditional probabilistic methods requiring substantial domain expertise, vPGM bypasses expert‐driven model design, making it well‐suited for scenarios with limited assumptions. We evaluated our model on several agentic reasoning tasks, both close-ended and open-ended. Our results indicate that the model effectively enhances confidence calibration and text generation quality.
Code and Appendix — https://github.com/xingbpshen/agentic-reasoning-vpgm
## Introduction
In addressing complex reasoning problems, such as solving challenging science questions, the human brain is thought to have the capability to go beyond mere sensory input, potentially forming insights into latent patterns of the world. This ability suggests that humans might have a sophisticated skill to interpret the underlying structures and uncertainties (Tenenbaum et al. 2011), although the exact mechanisms remain the subject of ongoing research and debate. As of now, such depth of understanding demonstrated by humans has not been fully achieved in artificial intelligence (AI) systems (Lake et al. 2017; Bender and Koller 2020; Zheng et al. 2021; Sumers et al. 2023).
While large language models (LLMs) have demonstrated impressive capabilities in processing and generating human language (Devlin et al. 2018; Brown et al. 2020; Achiam et al. 2023), their performance is often constrained by the scope of their training data. These models, built primarily on vast corpora of text, excel at generating responses that are syntactically coherent and contextually relevant. Recent advances such as chain-of-thought (CoT) prompting (Wei et al. 2022) and the emergence of agentic paradigms (Yao et al. 2023; Schick et al. 2023) have extended their capabilities toward interactive and compositional agentic reasoning. However, when operating as autonomous agents in uncertain or partially observable environments, where implicit knowledge and the ability to integrate and reason over undisclosed information from multiple sources become essential, skills that humans typically employ in complex reasoning, LLM agents often struggle. This limitation arises not only from their dependence on surface-level linguistic correlations but also from the absence of a principled Bayesian framework to capture latent structures and model uncertainty.
In this work, we explore for the first time how to bridge LLM agents with probabilistic graphical models (PGMs) to address agentic reasoning under uncertainty. To this end, we introduce Verbalized Probabilistic Graphical Modeling (vPGM), a Bayesian agentic framework that combines the strengths of LLM agentic reasoning with explicit numerical Bayesian inference. Unlike traditional Bayesian inference frameworks (Griffiths, Kemp, and Tenenbaum 2008; Bielza and Larrañaga 2014; Wang and Yeung 2020; Abdullah, Hassan, and Mustafa 2022), which typically require substantial domain expertise, vPGM bypasses expert-driven model design, making it well-suited for scenarios with limited assumptions. Specifically, Bayesian structure learning methods (Kitson et al. 2023) facilitate the discovery of Bayesian networks, they often require expert domain knowledge for manual validation of statistical dependencies or rely on computationally expensive scoring functions to assess the graphical model’s goodness of fit to the data. Our approach leverages the knowledge and reasoning capabilities of LLMs by guiding them to simulate Bayesian reasoning princples, while augmenting uncertainty quantification through a learnable Bayesian surrogate, thus significantly reducing the reliance on expert input.
Concretely, our method consists of three initial stages: (1) Graphical Structure Discovery, in which the LLM is prompted to identify latent variables and their probabilistic dependencies; (2) Prompting-Based Inference, where LLMs are guided to infer verbalized posterior distributions of each latent variable given new input data; and (3) Predictions under Uncertainty, where confidence in the final predictions is achieved by computing the expected value of the conditional predictive distribution over the inferred latent variables. Furthermore, to fully leverage the multiple response samples generated by LLMs within the vPGM framework and enhance uncertainty quantification, we extend vPGM with numerical Bayesian inference techniques that infer posterior distributions over predictions and augment confidence calibration through a theoretically guaranteed differentiable calibration loss function.
We evaluate our method on several agentic reasoning tasks, designed in both close-ended and open-ended answering formats. The experiments demonstrate improvements in confidence calibration and the quality of generated responses, highlighting the efficacy of vPGM in enhancing probabilistic reasoning capabilities of LLM agents.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Comparison of AI Reasoning Systems (Chameleon vs. BayesVPGM)
### Overview
This image is a technical diagram comparing the reasoning processes of two AI systems, "Chameleon" and "BayesVPGM (ours)", when presented with the same visual science question. The diagram illustrates how each system processes the input, the tools or methods they employ, and the final answer they generate, highlighting a failure case for Chameleon and a success case for the proposed BayesVPGM method.
### Components/Axes
The diagram is divided into two primary horizontal sections:
1. **Top Section (Chameleon System):**
* **Input (Left):** A "Question" box containing an image of two beakers (Solution A and Solution B) and a multiple-choice question.
* **Agent Tools (Center):** A vertical block listing tools used: "Knowledge Retriever", "Image Captioner", and "OCR".
* **Chameleon Pipeline (Right):** A flow from "Solution Generator" to "Answer Generator", culminating in a final answer box with a red "X" icon.
2. **Bottom Section (BayesVPGM System):**
* **Input (Left):** A box labeled "Latent Variables + CPDs" with symbols Z₁, Z₂, etc.
* **Verbalized PGM Inference (Left-Center):** A blue box detailing probabilistic reasoning steps.
* **LLM (Center):** A vertical block representing a Large Language Model.
* **Verbalized Inference Results (Center-Right):** A purple box showing the LLM's assessment of probabilities.
* **Numerical Bayesian Inference (Right):** A gray box leading to the final answer.
* **Final Output (Bottom-Right):** A green box with a checkmark icon, labeled "BayesVPGM (ours)".
### Detailed Analysis
**1. The Question (Common Input):**
* **Image:** Two beakers labeled "Solution A" and "Solution B". Both have a "Solvent volume: 25 mL". Solution A contains 3 pink particles. Solution B contains 6 pink particles.
* **Text:** "Which solution has a higher concentration of pink particles? (A) Same (B) Solution A **(C) Solution B**" (The correct answer, (C), is bolded in the diagram).
**2. Chameleon System Process & Output:**
* **Knowledge Retriever Output:** "A solution is made up of two or more substances that are completely mixed. In a solution, solute particles are mixed into a solvent..."
* **Image Captioner Output:** "A close-up picture of a **wii game controller**." (The phrase "wii game controller" is highlighted in red, indicating an error).
* **OCR Output:** "None detected."
* **Solution Generator Reasoning:** "To determine which solution has a higher concentration...Therefore, the answer is B. Probability (0.852)."
* **Final Answer Generator Output:** "**Answer (B) with Probability (0.852)**" accompanied by a red "X" icon, indicating this is incorrect.
**3. BayesVPGM System Process & Output:**
* **Verbalized PGM Inference Steps:**
* `P(Z₁|X)`: "assess the probability of external knowledge relevance given knowledge retrieval outputs."
* `P(Z₂|Z₁, X)`: "integrate the information from Z₁ and assess the probability of discrepancy between visual information and the given question or the context."
* **Verbalized Inference Results (from LLM):**
* Assessment of Z₁: "Given the lack of useful retrieved knowledge and Bing search response, the probability of Z₁ capturing the essential knowledge and context accurately is low: `P(Z₁|X) = 0.2`"
* Assessment of Z₂: "Detected Text: None provided. Image Caption: Mentions **a wii game controller**, which is **not relevant to the question or the context**... the probability of Z₂ accurately reflecting the meaning difference and assigning appropriate weightage is low: `P(Z₂|Z₁, X) = 0.2`"
* **Final Output:** Answer (C) with Probability (0.510) accompanied by a green checkmark icon, indicating this is correct.
### Key Observations
1. **Critical Failure in Chameleon:** The "Image Captioner" tool in the Chameleon pipeline catastrophically misidentifies the beaker diagram as "a wii game controller." This erroneous visual input propagates through the system.
2. **Chameleon's Overconfidence:** Despite the flawed visual input, the Chameleon "Solution Generator" produces a high-confidence (0.852) but incorrect answer (B).
3. **BayesVPGM's Error Detection:** The BayesVPGM system explicitly identifies the irrelevance of the "wii game controller" caption (`P(Z₂|Z₁, X) = 0.2`), demonstrating a capacity for self-critique and uncertainty quantification.
4. **Probabilistic Reasoning:** BayesVPGM uses verbalized conditional probabilities (`P(Z₁|X)`, `P(Z₂|Z₁, X)`) to model its own uncertainty about the quality of its reasoning steps before arriving at a final numerical probability.
5. **Outcome:** The system with explicit uncertainty modeling (BayesVPGM) arrives at the correct answer (C), albeit with lower confidence (0.510), while the system without it (Chameleon) fails confidently.
### Interpretation
This diagram serves as a case study and argument for the proposed "BayesVPGM" method. It demonstrates a scenario where a standard multimodal AI pipeline (Chameleon) fails due to a severe error in one of its sub-components (the image captioner). The failure is compounded because the system lacks a mechanism to question or down-weight the confidence of that faulty component.
The BayesVPGM approach is presented as a solution. By "verbalizing" its probabilistic graphical model (PGM) inference, it forces the underlying LLM to explicitly reason about the reliability of its own tools and retrieved information. The low probabilities assigned to the latent variables (`Z₁`, `Z₂`) reflect the system's awareness that its inputs are unreliable. This self-aware uncertainty allows the final Bayesian inference step to correctly discount the misleading information and converge on the right answer, even if with modest confidence.
The core message is that for robust AI reasoning, especially in multimodal settings, it is crucial to move beyond generating single-point answers and instead model and communicate the system's own uncertainty about its reasoning process. The diagram visually contrasts the "black-box" failure of one system with the "self-reflective" success of the other.
</details>
Figure 1: Example of inference using the BayesVPGM. The Chameleon framework erroneously assigns high confidence to the answer despite its LLM agents capturing irrelevant information. Conversely, our BayesVPGM accurately identifies this discrepancy and assigns low confidence. Here, we show a simplified inference prompt. See Appendix for detailed examples.
## Related Work
Research on large language models (LLMs) has recently transitioned from static prompting toward LLM agents or agentic systems capable of agentic reasoning, tool use, and interactive decision-making. We discuss both threads respectively, highlighting their limitations and how our proposed vPGM addresses a key missing component: probabilistic latent-variable reasoning and uncertainty calibration for agentic reasoning tasks.
#### LLM Prompting
Prompting methods in LLMs form a long-standing research line centered on training-free model responses steering. Early approaches include in-context learning (Brown et al. 2020), where models are conditioned on task-specific demonstrations, and instruction prompting (Wang et al. 2022b; Ouyang et al. 2022), which embeds explicit task instructions directly into natural-language prompts. A major development is Chain-of-Thought (CoT) prompting (Wei et al. 2022), which elicits intermediate reasoning steps to enhance complex reasoning. Subsequent variants extend CoT to more flexible or automated settings: zero-shot CoT (Kojima et al. 2022), automatic rationale generation (Auto-CoT) (Zhang et al. 2022; Shum, Diao, and Zhang 2023; Yao et al. 2024), self-consistency decoding (Wang et al. 2022a), and chain-of-continuous-thought (Hao et al. 2024), which embeds reasoning trajectories in a latent space. Additionally, (Xiong et al. 2023) built upon the consistency-based method and conducted an empirical study on confidence elicitation for LLMs. In contrast, our proposed vPGM tackles the confidence elicitation problem from the perspective of Bayesian inference, which follows the principles of a more theoretically grounded Bayesian inference framework, PGM.
#### LLM Agents and Agentic Systems
Building on these prompting advances, LLM prompting has evolved into LLM agents, which interleave reasoning with actions, tool use, and interaction with external environments. ReAct (Yao et al. 2023) combines natural-language reasoning with tool calls and environment feedback; Toolformer (Schick et al. 2023) uses self-supervised signals to teach LLMs when and how to invoke tools, and ADAS (Wang et al. 2025) automates the design of agentic system architectures. These systems mark a shift from passive text generation to interactive, tool-augmented behavior. However, existing agentic approaches typically lack a principled probabilistic framework: they do not explicitly model latent variables, quantify uncertainty, or perform Bayesian belief updating, which limits their applicability in settings that require calibrated agentic reasoning under uncertainty.
#### Concurrent Work
Several concurrent works explore the use of LLMs for probabilistic or causal modeling, but they are largely orthogonal to our contribution. Recent causal-discovery studies (Wan et al. 2025; Constantinou, Kitson, and Zanga 2025) focus on learning causal relationships and counterfactuals, whereas vPGM targets non-causal probabilistic latent-variable reasoning and uncertainty calibration for multi-source agentic tasks. BIRD (Feng et al. 2025) introduces a Bayesian inference wrapper for LLMs, yet it is restricted to binary decision-making and is therefore not directly applicable to our multi-class and open-ended outputs. In contrast, vPGM provides a unified Bayesian framework for latent-variable reasoning and calibrated uncertainty within LLM agents. footnotetext: Although we set $n≤ 4$ in this example, the LLM may generate the maximum number of variables. To reduce redundancies, we can add additional constraints to encourage a more compact representation.
## Our Method: Verbalized Probabilistic Graphical Modeling (vPGM)
Verbalized Probabilistic Graphical Modeling (vPGM) is a Bayesian Agentic Reasoning approach that leverages Large Language Models (LLM) agents to simulate key principles of Probabilistic Graphical Models (PGMs) in natural language. Unlike many existing probabilistic methods that demand extensive domain knowledge and specialized training, vPGM bypasses the need for expert-based model design, making it suitable for handling complex reasoning tasks where domain assumptions are limited or data are scarce.
### Overview of vPGM
From an application standpoint, vPGM can be embedded into a range of complex reasoning systems, such as agentic reasoning tasks (see Figure 1). Our approach factorizes the overall reasoning process into three core steps: (1) Graphical Structure Discovery, in which the LLM is prompted to identify latent variables and their probabilistic dependencies (see Figure 2); (2) Prompting-Based Inference, where LLMs are guided to infer verbalized posterior distributions of each latent variable given new input data; and (3) Predictions under Uncertainty, where confidence in the final predictions is achieved by computing the expected value of the conditional predictive distribution over the inferred latent variables.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: LLM-Based Latent Variable and Graphical Model Discovery Pipeline
### Overview
The image is a technical flowchart illustrating a two-stage process that uses a Large Language Model (LLM) to first discover latent variables and a graphical model from input data, and then to construct an inference prompt for performing inference on Conditional Probability Distributions (CPDs). The flow proceeds from left to right.
### Components/Axes
The diagram is composed of several distinct blocks connected by directional arrows, indicating data or process flow.
**1. Input Block (Leftmost):**
* **Position:** Far left.
* **Content:** A rectangular box containing three labeled, stacked horizontal bars.
* Top bar (Gray): **"Data Pairs"**
* Middle bar (Light Yellow): **"Context"**
* Bottom bar (Light Orange): **"Knowledge"**
* Below the bars: An ellipsis **"..."** indicating additional, unspecified input types.
* **Function:** Represents the combined input fed into the first LLM stage.
**2. First Processing Stage:**
* **Label (Above):** **"Prompting"**
* **Component:** A vertical gray rectangle labeled **"LLM"** with a stylized brain/chip icon at the bottom.
* **Input:** An arrow from the Input Block.
* **Output:** An arrow pointing to the central discovery block.
**3. Central Discovery Block:**
* **Label (Below):** **"Latent Variable + Graphical Model Discovery"**
* **Content:** A large, dotted-outline rectangle divided into two sections by a vertical dotted line.
* **Left Section:** Contains circles labeled **"Z₁"**, **"Z₂"**, and a vertical ellipsis **"⋮"**, representing a set of discovered latent variables.
* **Right Section:** Contains a directed acyclic graph (DAG) with nodes and arrows.
* **Nodes:** Circles labeled **"Z₁"**, **"Z₂"**, **"Z₃"**, **"Z₄"**, and a shaded circle labeled **"Y"**.
* **Edges (Arrows):** Show dependencies: Z₁ → Z₃, Z₂ → Z₃, Z₂ → Z₄, Z₃ → Y, Z₄ → Y.
* **Function:** Visualizes the output of the first LLM stage: a set of latent variables (Z₁, Z₂...) and a learned graphical model structure showing probabilistic relationships between latent variables (Z) and an observed variable (Y).
**4. Second Processing Stage:**
* **Label (Above):** **"Inference Prompt Construction"**
* **Component:** A second vertical gray rectangle identical to the first, labeled **"LLM"** with the same icon.
* **Input:** An arrow from the Central Discovery Block.
* **Output:** An arrow pointing to the final output block.
**5. Output Block (Rightmost):**
* **Position:** Far right.
* **Content:** A light blue rounded rectangle containing a document icon.
* **Label (Below):** **"For CPDs Inference"**
* **Function:** Represents the final output: a constructed prompt ready to be used for inferring Conditional Probability Distributions (CPDs) based on the discovered model.
### Detailed Analysis
The diagram outlines a clear, sequential pipeline:
1. **Input Aggregation:** Multiple forms of information (Data Pairs, Context, Knowledge) are combined.
2. **First LLM Interaction (Prompting):** The aggregated input is used to prompt an LLM.
3. **Model Discovery:** The LLM's output is not a direct answer but a structured discovery: a set of latent variables (`Z₁`, `Z₂`, etc.) and a probabilistic graphical model (a DAG) defining how these latent variables and an observed variable `Y` relate to each other.
4. **Prompt Construction:** The discovered model structure is then fed into a second LLM interaction, specifically to construct a prompt.
5. **Final Inference Task:** The constructed prompt is designed for the specific task of performing inference on Conditional Probability Distributions (CPDs), which are fundamental components of the discovered graphical model.
### Key Observations
* **Two-Stage LLM Use:** The architecture explicitly uses the LLM twice for distinct purposes: first for *discovery* (unsupervised/structured output), and second for *prompt engineering* (task-specific application).
* **Latent Variable Focus:** The core of the discovery is identifying unobserved variables (`Z`) that explain the data, a common technique in causal and probabilistic modeling.
* **Graphical Model as Intermediate Representation:** The discovered DAG serves as a crucial, interpretable intermediate representation between raw data and the final inference task.
* **Task Specificity:** The pipeline is not generic; it is specifically geared towards probabilistic inference (CPDs), suggesting an application in causal discovery, Bayesian networks, or similar fields.
### Interpretation
This diagram represents a sophisticated methodology for **automating the construction of probabilistic models** using LLMs. The process suggests that instead of manually defining a model's structure and variables, an LLM can be prompted to *discover* them from raw data and context. The second stage then translates this discovered structure into an effective prompt for performing downstream inference.
The pipeline implies a **Peircean abductive reasoning** process: the first LLM stage takes observed data and context (the *sign*) and abduces the most plausible underlying structure (the *latent variables and their relations*—the *object*). The second stage then uses this hypothesized structure to make specific inferences (the *interpretant*).
The key innovation is using the LLM not just as a text generator, but as a **reasoning engine for model discovery and meta-prompting**. This could significantly lower the barrier for applying complex probabilistic models, as the LLM handles the technical steps of variable identification and structure learning. The "Knowledge" input bar is particularly notable, suggesting the model can incorporate prior domain knowledge to guide the discovery process, making it a form of **knowledge-enhanced machine learning**. The final output for "CPDs Inference" indicates the ultimate goal is quantitative, probabilistic reasoning about the relationships within the discovered model.
</details>
Figure 2: Overview of the vPGM’s learning framework. CPDs represent conditional probability distributions. We omit the observed variable $X$ for clarity.
### Graphical Structure Discovery
Our method begins by formulating a specialized prompt (see the appendix for the prompt) to uncover latent variables for compositional reasoning. The prompt comprises several key elements: (1) General Task Description, a concise statement of the reasoning objective; (2) Input-Output Data Pairs, which illustrate representative data samples; (3) Contextual Information, providing any essential background or domain insights; and (4) Prior Knowledge and Constraints, specifying constraints such as the maximum number of latent variables and predefined dependencies among them.
After identifying a set of latent variables $Z=\{Z_1,Z_2,…,Z_n\}$ (see the appendix for examples of latent variables) , we further prompt LLMs to determine how each latent variable depends on the others. An example set of dependencies obtained from the LLM is: $\{X→ Z_1,X→ Z_2,X→ Z_3,X→ Z_4,Z_1→ Z_3,Z_2→ Z_3,Z_2→ Z_4,Z_3→ Z_4,Z_4→Y\}$ , where each relationship $a→ b$ indicates that $b$ is conditionally dependent on $a$ . Like traditional PGMs, our verbalized PGM (vPGM) encodes these dependencies as conditional probability distributions $P\bigl(Z_i\midPa(Z_i)\bigr)$ . However, instead of relying on explicit distributional forms, vPGM uses natural language descriptions (see Appendix for detailed examples) to specify each conditional relationship, reducing the need for extensive domain expertise or parameter estimation.
### Prompting-Based Bayesian Inference
Traditionally, Bayesian inference focuses on inferring posterior distributions over model parameters given a probabilistic model and new observations. In the context of LLMs, however, it is reformulated as generating prompts that simulate posterior inference under the vPGM framework, leveraging its discovered structure and new observations. This approach leverages the advanced reasoning capabilities of LLMs to produce instructions enabling them to simulate Bayesian inference principles. An example prompt is: “Generate the prompt that guides LLMs through step-by-step probabilistic reasoning based on the provided task description, discovered PGM, and testing data…”
### Prediction Under Uncertainty
Agentic reasoning tasks often involve significant uncertainty. For instance, an LLM agent (e.g., an image captioner) may produce noisy outputs, introducing aleatoric uncertainty. Under the vPGM framework, this variability is captured by the verbalized posterior distributions of latent variables. After constructing the verbalized posterior $P(Z\midX)$ via prompting-based Bayesian inference, we quantify confidence in the final predictions by taking the expected value of $P(Y\midZ)$ over $Z$ :
$$
E_P(Z\mid\mathbf{X)}\bigl[P(Y\midZ)\bigr] ≈ ∑_ZP(Y\midZ) P(Z\midX), \tag{1}
$$
where $X$ denotes observed inputs, and $Z$ is sampled by querying LLM using vPGM’s Bayesian inference prompt. In practice, both $P(Z\midX)$ and $P(Y\midZ)$ are simulated within a single prompt (see detailed examples in the Appendix). Consequently, the expected posterior probabilities can be approximated by averaging the numerical values of $P(Y\midZ)$ generated by the LLM during these inference steps.
## Bayesian-Enhanced vPGM: BayesVPGM
When repeatedly querying a Large Language Model (LLM) under the vPGM framework, we obtain multiple samples of responses, i.e., categorical predictions and their numerical probabilities. A natural question is how to leverage these data to better capture the underlying uncertainty in the LLM’s predictions. To do this, we propose to infer such a posterior distribution, denoted $q(y\mid\tilde{x})$ , where $\tilde{x}$ denotes categorical predictions.
### Posterior Inference Under a Dirichlet Prior
We specify the form of the posterior $q(y\mid\tilde{x}) = Cat(\boldsymbol{π}),$ where $\boldsymbol{π}=(π_1,\dots,π_K)$ lies in the probability simplex over $K$ categories. To incorporate prior beliefs, we place a Dirichlet prior on $\boldsymbol{π}$ : $\boldsymbol{π} ∼ Dirichlet(α_1,\dots,α_K),$ with $α_k=λ p(y=k\midZ)$ for some hyperparameter $λ>0$ , reflecting the vPGM’s initial belief in category $k$ .
Next, suppose we query the LLM under the vPGM framework for $n$ times, obtaining labels $\{y_1,\dots,y_n\}$ . For each category $k$ , let $n_k$ be the number of labels that fall into that category. Assuming these labels are drawn i.i.d. from $Cat(\boldsymbol{π})$ , the likelihood is $P\bigl(\{y_i\}\mid\boldsymbol{π}\bigr) = ∏_k=1^Kπ_k^ n_k.$ By Bayes’ rule, the posterior distribution is then
$$
q(y\mid\tilde{x}) ∝ \Bigl(∏_k=1^Kπ_k^ n_k\Bigr)×\Bigl(∏_k=1^Kπ_k^ α_k-1\Bigr) = ∏_k=1^Kπ_k^ n_k+α_k-1,
$$
i.e. a $Dirichlet(n_1+α_1,\dots,n_K+α_K)$ . The posterior mean of $π_k$ becomes
$$
π_k^(mean) = \frac{n_k+α_k}{∑_j=1^K\bigl(n_j+α_j\bigr)}.
$$
Consequently, we adopt $q(y\mid\tilde{x}) = Cat\bigl(\boldsymbol{π}^(mean)\bigr)$ as our final predictive distribution, which balances empirical label frequencies with the original vPGM’s numerical probabilities.
### Optimizing $λ$ via a Differentiable Calibration Loss
One key limitation of this posterior distribution is its reliance on a manually tuned $λ$ , which governs how strongly the vPGM’s numerical probabilities influence the final outcome. To automate this process and improve calibration, we introduce a differentiable calibration loss that learns $λ$ through gradient‐based optimization.
Specifically, we minimize the following loss function with respect to $λ$ :
$$
L\bigl(\boldsymbol{π}(λ)\bigr) = L_c\bigl(\boldsymbol{π}(λ)\bigr) + β L_v\bigl(\boldsymbol{π}(λ)\bigr), \tag{2}
$$
where $\boldsymbol{π}(λ)=(π_1^(mean),\dots,π_K^(mean))$ is the posterior‐mean vector, $L_c$ is a standard classification loss (e.g., cross‐entropy), and $L_v$ is a differentiable class‐wise alignment term; $β$ is a hyperparameter balancing the two losses. Let $j$ index the categories, and let $\bar{π}_j=\frac{1}{n}∑_i=1^nπ_j^(i)$ be the average predicted probability of class $j$ over a mini‐batch of size $n$ . Likewise, let $\bar{y}_j=\frac{1}{n}∑_i=1^ny_j^(i)$ be the empirical fraction of class $j$ , where $y_j^(i)∈\{0,1\}$ indicates whether sample $i$ belongs to class $j$ . Inspired by class‐wise expected calibration error (Kull et al. 2019), which aligns predictions to empirical frequencies on a per‐category basis but whose binning procedure impedes differentiability, we define:
$$
L_v\bigl(\boldsymbol{π}\bigr) = \frac{1}{K}∑_j=1^K\Bigl|\bar{π}_j - \bar{y}_j\Bigr|, \tag{3}
$$
using a bin‐free version of class-wise expected calibration error.
To minimize $L\bigl(\boldsymbol{π}\bigr)$ with respect to $λ$ , we employ a quasi‐Newton method (e.g., L-BFGS) (Broyden 1967). This second‐order gradient‐based solver converges more rapidly than simple gradient descent.
**Theorem 1 (Global Optimum Implies Perfect ECE)**
*Let $\{(u_i,y_i)\}_i=1^n$ be the training set with features $u_i∈ℝ^d$ and one–hot labels $y_ik$ . For any parameter vector $θ$ , let $g_θ:ℝ^d→Δ^K-1$ be a function that produces class probabilities $\widehat{p}_ik(θ)=g_θ(u_i)_k$ . The empirical version of Eq. (2) is
$$
L(θ)=-\frac{1}{n}∑_i=1^n∑_k=1^Ky_ik \log\widehat{p}_ik(θ)+β \frac{1}{K}∑_k=1^K\bigl|\bar{\widehat{p}}_k(θ)-\bar{y}_k\bigr|,
$$
where $β>0,\bar{\widehat{p}}_k(θ)=\tfrac{1}{n}∑_i\widehat{p}_ik(θ)$ , and $\bar{y}_k=\tfrac{1}{n}∑_iy_ik$ . Then a parameter vector $θ^⋆$ is a global minimiser of $L$ iff
| | | $\displaystyle\widehat{p}_ik(θ^⋆)=\frac{1}{∑_i^\prime=1^nI_\{u_i=\mathbf{u_i^\prime\}}}∑_i^\prime=1^ny_i^\primekI_\{u_i=\mathbf{u_i^\prime\}}$ | |
| --- | --- | --- | --- |
where $I(u_i=u_i^\prime)$ is an indicator function equal to 1 if the feature inputs $u_i$ and $u_i^\prime$ are identical, and 0 otherwise. In that case, the class-wise expected calibration error $_class(θ)\triangleq\frac{1}{K}∑_k|\bar{\widehat{p}}_k(θ)-\bar{y}_k|$ satisfies $_class(θ^⋆)=0$ .*
The proof is provided in Appendix. Although the cross-entropy term in the loss function Eq. (2) pulls predictions toward one-hot labels while the calibration term enforces class-wise average alignment, Theorem 1 shows that both objectives can attain their minima simultaneously.
## Experiments
We evaluate the efficacy of the proposed vPGM and BayesVPGM in modeling uncertainty across three agentic reasoning tasks. The first, a closed-ended task named ScienceQA (Lu et al. 2022), and the second, an open-ended task named ChatCoach (Huang et al. 2024), both require reasoning with undisclosed information from multiple sources. We then introduce a negative control experiment derived from A-OKVQA (Schwenk et al. 2022) to investigate whether latent variables can enhance confidence calibration by detecting mismatches in the presence of misinformation. See Appendix for the detailed experimental configurations.
### Science Question Answering
The Science Question Answering (ScienceQA) benchmark, introduced by (Lu et al. 2022), serves as a comprehensive benchmark for multi-modal question answering across a diverse range of scientific disciplines, including physics, mathematics, biology, and the humanities. It features 4,241 question-answer pairs that cover various topics and contexts. This task demands the integration of information from multiple sources or LLM agents (e.g., Bing search results, image captions), a process that can introduce errors and increase the complexity of reasoning. Given these challenges, ScienceQA serves as an ideal testbed for evaluating how effectively vPGM identifies latent structures and model uncertainties. See Appendix for the more detailed experimental setups.
#### Baseline Methods
We compare vPGM/BayesVPGM with the following baseline methods:
- Chain-of-Thought This is one of the non-tool-augmented LLMs: Chain-of-Thought (CoT) prompting (Wei et al. 2022) equipped with verbalized confidence estimation by prompting it to provide a numerical confidence for the selected answer.
- Chameleon This is based on a tool-augmented LLM: Chameleon (Lu et al. 2023), and we equip it with verbalized confidence estimation.
- Chameleon+ It extends Chameleon with a state-of-art uncertainty quantification framework based on the combination of verbalized confidence estimation and self-consistency measurement (Wang et al. 2022a), as recommended in (Xiong et al. 2023).
#### Evaluation Metrics
In line with previous evaluation settings in (Naeini, Cooper, and Hauskrecht 2015; Guo et al. 2017; Xiong et al. 2023) on confidence calibration, we adopt the expected calibration error (ECE) to evaluate model confidence, represented as numeric probabilistic predictions. The ECE quantifies the divergence between the predicted probabilities and the observed accuracy across each confidence levels (bins). Throughout our experiments, we fix the number of confidence bins as 10 with uniform confidence contribution across bins. In addition, we evaluate the capability of a given method in solving problems correctly by measuring the accuracy (Acc.).
| CoT | – | 1 | 84.63 | 8.96 |
| --- | --- | --- | --- | --- |
| Chameleon | – | 1 | 85.29 | 9.62 |
| Chameleon+ | – | 3 | 85.17 | 8.65 |
| vPGM (Ours) | 2 | 3 | 85.49 | 2.31 |
| vPGM (Ours) | 3 | 3 | 86.38 | 1.67 |
| vPGM (Ours) | 4 | 3 | 86.54 | 2.15 |
| BayesVPGM (Ours) | 2 | 3 | 85.49 | 1.81 |
| BayesVPGM (Ours) | 3 | 3 | 86.38 | 1.05 |
| BayesVPGM (Ours) | 4 | 3 | 86.54 | 1.50 |
Table 1: Accuracy ( $\$ ) and ECE $(× 10^2)$ on ScienceQA for different methods and numbers of latent variables $N$ . $M$ is the number of sampled responses. The best and second-best results within each base model are bolded and underlined, respectively. Llama3-8B-Instruct (Dubey et al. 2024) serves as our test-time engine. See appendix for results using other LLMs
#### Results
Table 1 details the performance of different methods on the ScienceQA dataset. It shows that Chameleon results in the highest (worst) ECE ( $× 10^2$ ) of 9.62, indicating serious overconfidence issues in handling complex reasoning tasks, even with the assistance of external tools. In comparison, our vPGM outperforms these methods in both accuracy and ECE, due to its superior ability to capture latent structural information that other baseline methods overlook. Figure 3 shows the reliability diagram for vPGM and BayesVPGM, demonstrating its near-perfect alignment with the ideal calibration curve across all bins, highlighting its precision in confidence calibration (see the Appendix for the ablation results and the token-level computational costs).
#### Qualitative Study on the Inferred Latent Variables
Figure 1 shows a case study of BayesVPGM’s inference capabilities to qualitatively assess the model’s ability to utilize latent structural information for improving confidence estimation. Here vPGM employs its latent variables to critically assess the relevance of retrieved information. For example, when faced with irrelevant data from external tools such as Bing search or inaccurate captions from image captioners, the baseline, Chameleon, erroneously maintains high confidence in its predictions. In contrast, BayesVPGM carefully adjusts its confidence, assigning lower probabilities when essential contextual knowledge is missing or incorrect, a process that is particularly effective through the inference of latent variables $Z_1$ and $Z_2$ . These observations highlight the significance of inferring latent structures to improve the reliability of compositional reasoning systems.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Calibration Plot: Chameleon+ Model
### Overview
The image displays a calibration plot (also known as a reliability diagram) for a model or system named "Chameleon+". This type of chart evaluates how well a model's predicted confidence scores align with its actual accuracy. The plot consists of a bar chart overlaid with a diagonal reference line and a text box displaying the Expected Calibration Error (ECE).
### Components/Axes
* **Title:** "(a) Chameleon+" is centered at the top of the chart.
* **X-Axis:**
* **Label:** "Confidence" (centered below the axis).
* **Scale:** Linear scale from 0.0 to 1.0, with major tick marks at every 0.1 interval (0.0, 0.1, 0.2, ..., 1.0).
* **Y-Axis:**
* **Label:** "Average Accuracy within Bin" (rotated 90 degrees, positioned to the left of the axis).
* **Scale:** Linear scale from 0.0 to 1.0, with major tick marks at every 0.2 interval (0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
* **Data Series (Bars):** Ten light blue vertical bars, each representing a "bin" of confidence scores. The width of each bar spans a 0.1 confidence interval (e.g., 0.0-0.1, 0.1-0.2, etc.).
* **Reference Line:** A black dashed diagonal line running from the origin (0.0, 0.0) to the top-right corner (1.0, 1.0). This represents perfect calibration, where confidence equals accuracy.
* **Legend/Annotation:** A rectangular text box located in the bottom-right quadrant of the chart area (approximately spanning x=0.6 to 0.9, y=0.05 to 0.15). It contains the text "ECE=0.0865".
### Detailed Analysis
The chart plots the average accuracy of the model for predictions falling within specific confidence bins. The numerical value above each bar indicates the precise average accuracy for that bin.
**Data Points (Confidence Bin → Average Accuracy):**
1. **Bin 0.0-0.1:** Accuracy = 1.0
2. **Bin 0.1-0.2:** Accuracy = 0.0
3. **Bin 0.2-0.3:** Accuracy = 0.33
4. **Bin 0.3-0.4:** Accuracy = 0.56
5. **Bin 0.4-0.5:** Accuracy = 0.54
6. **Bin 0.5-0.6:** Accuracy = 0.7
7. **Bin 0.6-0.7:** Accuracy = 0.88
8. **Bin 0.7-0.8:** Accuracy = 0.79
9. **Bin 0.8-0.9:** Accuracy = 0.91
10. **Bin 0.9-1.0:** Accuracy = 0.91 (The bar height matches the previous bin, and the label is partially obscured by the reference line but reads 0.91).
**Trend Verification:**
* **General Trend:** The bars show a rough, non-monotonic upward trend. As confidence increases from 0.2 to 1.0, accuracy generally increases, but with notable fluctuations.
* **Deviations from Perfect Calibration (Dashed Line):**
* **Under-confidence:** The model is under-confident in the 0.0-0.1 bin (accuracy 1.0 > confidence ~0.05) and the 0.6-0.7 bin (accuracy 0.88 > confidence ~0.65).
* **Over-confidence:** The model is over-confident in the 0.1-0.2 bin (accuracy 0.0 < confidence ~0.15), the 0.3-0.4 bin (accuracy 0.56 < confidence ~0.35), and the 0.7-0.8 bin (accuracy 0.79 < confidence ~0.75).
* **Near Calibration:** The bins 0.5-0.6 and 0.9-1.0 are relatively close to the diagonal line.
### Key Observations
1. **Extreme Outliers:** The first two bins show extreme behavior. The 0.0-0.1 bin has perfect accuracy (1.0), while the 0.1-0.2 bin has zero accuracy (0.0). This suggests the model may be making very few, but highly accurate, predictions at the lowest confidence, and a separate set of completely incorrect predictions at slightly higher confidence.
2. **Non-Monotonicity:** The accuracy does not increase smoothly with confidence. There are dips at confidence bins 0.4-0.5 (accuracy drops from 0.56 to 0.54) and 0.7-0.8 (accuracy drops from 0.88 to 0.79).
3. **High-End Plateau:** The model's accuracy plateaus at 0.91 for the two highest confidence bins (0.8-0.9 and 0.9-1.0), indicating it does not achieve perfect accuracy even when most confident.
4. **Calibration Error:** The Expected Calibration Error (ECE) is reported as 0.0865. This is a scalar metric summarizing the average absolute difference between confidence and accuracy across all bins, weighted by the number of samples in each bin.
### Interpretation
This calibration plot provides a diagnostic view of the Chameleon+ model's predictive reliability. The data suggests the model is **not perfectly calibrated**. Its confidence scores are not reliable proxies for the true probability of being correct.
* **What the data demonstrates:** The model exhibits a pattern of both under- and over-confidence across different confidence regimes. The significant deviations in the lower confidence bins (0.0-0.2) are particularly noteworthy and could indicate issues with how the model generates or is trained on low-confidence predictions. The general upward trend, despite fluctuations, shows that higher confidence is *somewhat* associated with higher accuracy, which is a necessary but insufficient condition for good calibration.
* **Relationship between elements:** The bars show the empirical reality (actual accuracy), while the dashed line shows the ideal (perfect calibration). The gap between them visualizes the miscalibration. The ECE value quantifies this gap into a single number for model comparison.
* **Implications:** For applications where confidence scores are used for decision-making (e.g., selective prediction, risk assessment), this model's outputs would need to be interpreted with caution or potentially recalibrated using techniques like temperature scaling or isotonic regression. The poor calibration in the 0.1-0.2 confidence range is a red flag, as predictions made with ~15% confidence are systematically wrong. The plateau at 91% accuracy also indicates a ceiling on the model's peak performance.
</details>
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Reliability Diagram: BayesVPGM Calibration Plot
### Overview
The image displays a reliability diagram (also known as a calibration plot) for a model or method labeled "BayesVPGM". This type of chart evaluates how well a model's predicted confidence scores align with its actual accuracy. The plot consists of a series of light blue bars representing binned confidence intervals and a dashed diagonal line representing perfect calibration.
### Components/Axes
* **Title:** "(b) BayesVPGM" (centered at the top).
* **X-Axis:** Labeled "Confidence". It represents the predicted probability or confidence score, ranging from 0.0 to 1.0 with major tick marks at every 0.1 interval.
* **Y-Axis:** Labeled "Average Accuracy within Bin". It represents the observed accuracy for predictions falling within a given confidence bin, ranging from 0.0 to 1.0 with major tick marks at every 0.2 interval.
* **Perfect Calibration Line:** A black dashed diagonal line running from the origin (0.0, 0.0) to the top-right corner (1.0, 1.0). This line represents the ideal scenario where confidence equals accuracy.
* **Data Series:** A series of light blue vertical bars. Each bar corresponds to a confidence bin (e.g., 0.0-0.1, 0.1-0.2, etc.). The height of each bar indicates the average accuracy for that bin.
* **Data Labels:** Numerical values are printed directly above each bar, indicating the precise average accuracy for that bin.
* **Metric Annotation:** A text box in the bottom-right quadrant of the chart area displays "ECE=0.0105". ECE stands for Expected Calibration Error, a summary metric of miscalibration.
### Detailed Analysis
The chart presents the following binned data points (Confidence Bin → Average Accuracy):
1. **Bin 0.0-0.1:** Accuracy = 0.0
2. **Bin 0.1-0.2:** Accuracy = 0.0
3. **Bin 0.2-0.3:** Accuracy = 0.2
4. **Bin 0.3-0.4:** Accuracy = 0.44
5. **Bin 0.4-0.5:** Accuracy = 0.45
6. **Bin 0.5-0.6:** Accuracy = 0.46
7. **Bin 0.6-0.7:** Accuracy = 0.61
8. **Bin 0.7-0.8:** Accuracy = 0.73
9. **Bin 0.8-0.9:** Accuracy = 0.87
10. **Bin 0.9-1.0:** Accuracy = 0.93
**Trend Verification:** The height of the bars (accuracy) shows a clear, consistent upward trend as the confidence bin increases from left to right. This positive correlation is the expected behavior for a well-calibrated model. The bars closely follow the trajectory of the dashed perfect calibration line.
### Key Observations
1. **Strong Calibration:** The model demonstrates excellent calibration. The observed accuracy for each bin is very close to the ideal value represented by the diagonal line. The largest deviations appear in the mid-range bins (0.3-0.6), where accuracy is slightly below the ideal line, indicating minor overconfidence in that range.
2. **Low Confidence Behavior:** For the lowest confidence bins (0.0-0.2), the model's accuracy is 0.0. This suggests that when the model is very uncertain, it is consistently incorrect, which is a valid and interpretable behavior.
3. **High Confidence Performance:** In the highest confidence bin (0.9-1.0), the model achieves an accuracy of 0.93. This indicates that when the model is most confident, it is correct 93% of the time, which is a strong performance.
4. **Quantified Error:** The Expected Calibration Error (ECE) is explicitly stated as 0.0105. This is a very low value, quantitatively confirming the visual assessment of good calibration. ECE is typically the weighted average of the absolute differences between confidence and accuracy across all bins.
### Interpretation
This reliability diagram provides strong evidence that the BayesVPGM model is **well-calibrated**. Its predicted confidence scores are reliable indicators of the likelihood of being correct.
* **Practical Implication:** A user can trust the model's confidence output. For instance, if the model makes a prediction with 80% confidence, we can expect it to be correct approximately 80% of the time (actual observed accuracy for the 0.8-0.9 bin is 87%, which is close).
* **Model Behavior:** The plot reveals the model's risk profile. It is appropriately cautious (low accuracy at low confidence) and appropriately assertive (high accuracy at high confidence). The slight under-confidence in the mid-range (0.3-0.6) is a minor and often acceptable deviation.
* **Metric Significance:** The ECE of 0.0105 is the key quantitative takeaway. In many machine learning contexts, an ECE below 0.05 is considered good, and below 0.02 is considered excellent. This value places BayesVPGM in the excellent category for calibration.
* **Context:** The "(b)" in the title suggests this is part of a larger figure, likely comparing the calibration of multiple models. BayesVPGM appears to be a strong performer in terms of calibration based on this single plot.
</details>
Figure 3: Reliability diagrams of (a) Chameleon+ and (b) BayesVPGM ( $N=3,M=3$ ) on ScienceQA (see the appendix for diagrams of Chameleon and vPGM). BayesVPGM achieve a much lower ECE comparing to Chameleon+ and approaches to the ideal confidence calibration curve (the diagonal dashed line).
| Instruction Prompting Vanilla CoT Zero-shot CoT | 27.4 17.7 27.6 | 3.3 2.7 1.9 | 67.6 64.1 69.0 | 1.4 0.1 3.0 | 2.1 2.3 0.9 | 61.6 58.1 58.8 |
| --- | --- | --- | --- | --- | --- | --- |
| GCoT | 34.2 | 3.7 | 72.4 | 1.6 | 2.0 | 65.4 |
| vPGM (Ours) | 37.2 | 2.3 | 76.3 | 1.7 | 2.0 | 68.3 |
| Human | 76.6 | 6.0 | 90.5 | 33.5 | 3.6 | 84.1 |
Table 2: Results of various methods on the detection and correction of medical terminology errors.
### Communicative Medical Coaching
The Communicative Medical Coaching benchmark, ChatCoach, introduced in (Huang et al. 2024), establishes a complex multi-agent dialogue scenario involving doctors, patients, and a medical coach across 3,500 conversation turns. The medical coach is tasked with detecting inaccuracies in medical terminology used by doctors (detection task) and suggesting appropriate corrections (correction task). These tasks require integrating external medical knowledge, inherently introducing uncertainty into response formulation. This benchmark was chosen to test vPGM’s ability to generalize across complex open-ended reasoning tasks. BayesVPGM is not applied in this setting, as such a model assumes the output to be a categorical distribution. See the Appendix for more details on experiments and implementation.
#### Baseline Methods
For comparative analysis, we benchmark vPGM against these approaches:
- Vanilla Instruction Prompting: This method involves prompting the LLM with direct instructions for dialogue generation.
- Zero-shot Chain of Thought (CoT) (Kojima et al. 2022): A straightforward CoT approach where the LLM is prompted to sequentially articulate a reasoning chain.
- Vanilla CoT (Wei et al. 2022): This method builds upon the basic CoT by providing the LLM with a set of examples that include detailed reasoning steps.
- Generalized CoT (GCoT) (Huang et al. 2024): An advanced version of CoT, designed to improve the generation of structured feedback and integration of external knowledge effectively. It represents a state-of-the-art method in the ChatCoach benchmark.
#### Evaluation Metrics
We follow (Huang et al. 2024) to employ conventional automated metrics BLEU-2, ROUGE-L, and BERTScore. BLEU-2 is employed to measure the precision of bi-gram overlaps, offering insights into the lexical accuracy of the generated text against reference answers. ROUGE-L is used to assess sentence-level similarity, focusing on the longest common subsequence to evaluate structural coherence and the alignment of sequential n-grams. Additionally, BERTScore is applied for a semantic similarity assessment, utilizing BERT embeddings to compare the generated outputs and reference texts on a deeper semantic level. As specified in (Huang et al. 2024), we use GPT-4 to extract medical terminology errors and corresponding corrections in the feedback from Coach Agents. Automated metrics are then calculated based on these extracted elements in comparison to human annotations.
#### Results
We present the performance of various methods in Table 2. The noticeable difference between machine-generated outputs and human benchmarks across all metrics highlights the inherent challenges in communicative medical coaching. In the detection of medical terminology errors, vPGM leads with superior BLEU-2 (37.2) and BERTScore (76.3), underscoring its proficiency in identifying inaccuracies. In the correction task, while vPGM achieves a standout BERTScore of 68.3, surpassing all baselines, it scores lower on BLEU-2 and ROUGE-L. This variation is attributed to the ambiguity in doctors’ inputs, which can yield multiple valid responses, affecting metrics that rely on exact matches.
### A-OKVQA Negative Control: Studying Latent Variables Under Misinformation
#### Data Simulation
A-OKVQA (Schwenk et al. 2022) is a Visual Question Answering dataset that challenges models to perform commonsense reasoning about a scene, often beyond the reach of simple knowledge-base queries. Crucially, it provides ground-truth image captions and rationales for each question. We leverage these annotations to construct a negative control experiment: A-OKVQA-clean (603 data points) retains the correct image caption and rationale (near single-hop reasoning), while A-OKVQA-noisy (603 data points) randomly shuffles the rationale, thus introducing misinformation and forcing a multi-hop check for consistency. In this experiment, we adopt a vPGM with 2 latent variables (see the Appendix for the inference prompt and an example query). Refer to the Appendix for more details on data configurations.
#### Overall Performance Under Noisy Conditions
Table 3 shows the overall accuracy (Acc.) and expected calibration error (ECE) on the A-OKVQA-noisy dataset. Both vPGM and BayesVPGM outperform Chameleon+ on accuracy (61.03% vs. 59.04%) and yield lower ECE, indicating that latent variables detect mismatch and improve confidence calibration.
#### Mismatch Detection Through $Z_2$
To investigate how latent variables facilitate mismatch detection, we track $P\bigl(Z_2\midPa(Z_2)\bigr)$ , where $Z_2$ indicates whether the rationale is aligned with the image caption. As shown in Table 4, the mean probability of $Z_2$ is considerably higher in the Clean set than in the Noisy set (0.86 vs. 0.42), and mismatch identification accuracy in the Noisy condition reaches 87%. These findings demonstrate BayesVPGM’s capacity to robustly detect cases with inconsistencies or irrelevant content.
#### Latent Variable Correlation Analysis
We additionally compute Pearson correlations (Pcc.) between numerical conditional probabilities of the latent variables ( $Z_1$ and $Z_2$ ) and the final answer $Y$ . In the Noisy case, $Pcc(Z_2,Y)$ surpasses $Pcc(Z_1,Y)$ (0.55 versus 0.35), indicating that $Z_2$ exerts a stronger influence on the final prediction when mismatches are present. Conversely, in the Clean subset, $Z_1$ and $Z_2$ exhibit nearly equal correlation with $Y$ , yet about 22% of the Clean data is incorrectly flagged by $Z_2$ as mismatched, potentially introducing noisy confidence adjustments at $Y$ . This suggests a trade-off: while latent variables excel at detecting misinformation and improving calibration in Noisy settings, they can slightly degrade calibration when no mismatch actually exists.
| Chameleon+ vPGM (Ours) BayesVPGM (Ours) | 59.04 61.03 61.03 | 11.75 10.54 9.85 |
| --- | --- | --- |
Table 3: General Performance on A-OKVQA-noisy data (accuracy in $\$ and ECE in $× 10^2$ ).
| Mean $P\bigl(Z_2\midPa(Z_2)\bigr)$ Noise Identification Acc. $Pcc\bigl(Z_1,Y\bigr)$ | 0.86 78% 0.50 | 0.42 87% 0.35 |
| --- | --- | --- |
| $Pcc\bigl(Z_2,Y\bigr)$ | 0.51 | 0.55 |
Table 4: Analysis of the latent variables on A-OKVQA-clean and A-OKVQA-noisy.
## Conclusion
We introduce verbalized Probabilistic Graphical Model (vPGM), a Bayesian agentic framework that (1) directs LLM agents to simulate core principles of Probabilistic Graphical Models (PGMs) through natural language and (2) refines the resulting posterior distributions via numerical Bayesian inference. Applied within agentic workflows, vPGM enables LLM agents to perform probabilistic latent-variable reasoning with calibrated uncertainty. This approach discovers latent variables and dependencies without requiring extensive domain expertise, making it well-suited to settings with limited assumptions. Our empirical results on agentic reasoning tasks demonstrate substantial improvements in terms of both confidence calibration and text generation quality. These results highlight the potential of merging Bayesian principles with LLM agents to enhance AI systems’ capacity for modeling uncertainty and reasoning under uncertainty.
## Acknowledgments
We thank all reviewers, SPC, and AC for their valuable comments. S.B. acknowledges funding from the MRC Centre for Global Infectious Disease Analysis (reference MR/X020258/1), funded by the UK Medical Research Council (MRC). This UK funded award is carried out in the frame of the Global Health EDCTP3 Joint Undertaking. S.B. is funded by the National Institute for Health and Care Research (NIHR) Health Protection Research Unit in Modelling and Health Economics, a partnership between UK Health Security Agency, Imperial College London and LSHTM (grant code NIHR200908). H.W. is partially supported by Amazon Faculty Research Award, Microsoft AI & Society Fellowship, NSF CAREER Award IIS-2340125, NIH grant R01CA297832, and NSF grant IIS-2127918. We acknowledge support from OpenAI’s Researcher Access Program. Disclaimer: “The views expressed are those of the author(s) and not necessarily those of the NIHR, UK Health Security Agency or the Department of Health and Social Care.” S.B. acknowledges support from the Novo Nordisk Foundation via The Novo Nordisk Young Investigator Award (NNF20OC0059309). S.B. acknowledges the Danish National Research Foundation (DNRF160) through the chair grant. S.B. acknowledges support from The Eric and Wendy Schmidt Fund For Strategic Innovation via the Schmidt Polymath Award (G-22-63345) which also supports H.H. and L.M.
## References
- Abdullah, Hassan, and Mustafa (2022) Abdullah, A. A.; Hassan, M. M.; and Mustafa, Y. T. 2022. A review on bayesian deep learning in healthcare: Applications and challenges. IEEE Access, 10: 36538–36562.
- Achiam et al. (2023) Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Bender and Koller (2020) Bender, E. M.; and Koller, A. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–5198.
- Bielza and Larrañaga (2014) Bielza, C.; and Larrañaga, P. 2014. Bayesian networks in neuroscience: a survey. Frontiers in Computational Neuroscience, 8: 131.
- Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Broyden (1967) Broyden, C. G. 1967. Quasi-Newton methods and their application to function minimisation. Mathematics of Computation, 21(99): 368–381.
- Constantinou, Kitson, and Zanga (2025) Constantinou, A. C.; Kitson, N. K.; and Zanga, A. 2025. Using GPT-4 to Guide Causal Machine Learning. Expert Systems with Applications.
- Devlin et al. (2018) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dubey et al. (2024) Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. 2024. The llama 3 herd of models. arXiv e-prints, arXiv–2407.
- Feng et al. (2025) Feng, Y.; Zhou, B.; Lin, W.; and Roth, D. 2025. BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models. In International Conference on Learning Representations (ICLR).
- Griffiths, Kemp, and Tenenbaum (2008) Griffiths, T. L.; Kemp, C.; and Tenenbaum, J. B. 2008. Bayesian models of cognition. In Annual Meeting of the Cognitive Science Society, 2004; This chapter is based in part on tutorials given by the authors at the aforementioned conference as well as the one held in 2006. Cambridge University Press.
- Guo et al. (2017) Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On calibration of modern neural networks. In International Conference on Machine Learning, 1321–1330. PMLR.
- Hao et al. (2024) Hao, S.; Sukhbaatar, S.; Su, D.; Li, X.; Hu, Z.; Weston, J.; and Tian, Y. 2024. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769.
- Huang et al. (2024) Huang, H.; Wang, S.; Liu, H.; Wang, H.; and Wang, Y. 2024. Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset. In Findings of the Association for Computational Linguistics: ACL 2024.
- Kitson et al. (2023) Kitson, N. K.; Constantinou, A. C.; Guo, Z.; Liu, Y.; and Chobtham, K. 2023. A survey of Bayesian Network structure learning. Artificial Intelligence Review, 56(8): 8721–8814.
- Kojima et al. (2022) Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35: 22199–22213.
- Kull et al. (2019) Kull, M.; Perello Nieto, M.; Kängsepp, M.; Silva Filho, T.; Song, H.; and Flach, P. 2019. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in Neural Information Processing Systems, 32.
- Lake et al. (2017) Lake, B. M.; Ullman, T. D.; Tenenbaum, J. B.; and Gershman, S. J. 2017. Building machines that learn and think like people. Behavioral and Brain Sciences, 40: e253.
- Lu et al. (2022) Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.-W.; Zhu, S.-C.; Tafjord, O.; Clark, P.; and Kalyan, A. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35: 2507–2521.
- Lu et al. (2023) Lu, P.; Peng, B.; Cheng, H.; Galley, M.; Chang, K.-W.; Wu, Y. N.; Zhu, S.-C.; and Gao, J. 2023. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
- Naeini, Cooper, and Hauskrecht (2015) Naeini, M. P.; Cooper, G.; and Hauskrecht, M. 2015. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
- Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.
- Schick et al. (2023) Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; and Scialom, T. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 68539–68551.
- Schwenk et al. (2022) Schwenk, D.; Khandelwal, A.; Clark, C.; Marino, K.; and Mottaghi, R. 2022. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, 146–162. Springer.
- Shum, Diao, and Zhang (2023) Shum, K.; Diao, S.; and Zhang, T. 2023. Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data. arXiv preprint arXiv:2302.12822.
- Sumers et al. (2023) Sumers, T. R.; Yao, S.; Narasimhan, K.; and Griffiths, T. L. 2023. Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427.
- Tenenbaum et al. (2011) Tenenbaum, J. B.; Kemp, C.; Griffiths, T. L.; and Goodman, N. D. 2011. How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022): 1279–1285.
- Wan et al. (2025) Wan, G.; Lu, Y.; Wu, Y.; Hu, M.; and Li, S. 2025. Large Language Models for Causal Discovery: Current Landscape and Future Directions. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25), 10687–10695. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
- Wang et al. (2025) Wang, C.; Singh, A.; Shyam, P.; Andreas, J.; Krueger, D.; and Rocktäschel, T. 2025. Automated Design of Agentic Systems. In International Conference on Learning Representations.
- Wang and Yeung (2020) Wang, H.; and Yeung, D.-Y. 2020. A survey on Bayesian deep learning. ACM Computing Surveys (csur), 53(5): 1–37.
- Wang et al. (2022a) Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022a. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Wang et al. (2022b) Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N. A.; Khashabi, D.; and Hajishirzi, H. 2022b. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
- Xiong et al. (2023) Xiong, M.; Hu, Z.; Lu, X.; Li, Y.; Fu, J.; He, J.; and Hooi, B. 2023. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv preprint arXiv:2306.13063.
- Yao et al. (2024) Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; and Narasimhan, K. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
- Yao et al. (2023) Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K. R.; and Cao, Y. 2023. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
- Zhang et al. (2022) Zhang, Z.; Zhang, A.; Li, M.; and Smola, A. 2022. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
- Zheng et al. (2021) Zheng, L.; Guha, N.; Anderson, B. R.; Henderson, P.; and Ho, D. E. 2021. When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, 159–168.