# Neuro-symbolic Action Masking for Deep Reinforcement Learning
**Authors**: Shuai Han, Mehdi Dastani, Shihan Wang
ifaamas [AAMAS ā26]Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)May 25 ā 29, 2026 Paphos, CyprusC. Amato, L. Dennis, V. Mascardi, J. Thangarajah (eds.) 2026 2026 Utrecht University Utrecht the Netherland Utrecht University Utrecht the Netherland Utrecht University Utrecht the Netherland
## Abstract
Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.
Key words and phrases: Deep reinforcement learning, neuro-symbolic learning, action masking
doi: JWPH6906
## 1. Introduction
With the powerful representation capability of neural networks, deep reinforcement learning (DRL) has achieved remarkable success in a variety of complex domains that require autonomous agents, such as autonomous driving autodriving4_1; autodriving4_2; autodriving4_3, resource management resourcem4_1; resourcem4_2, algorithmic trading autotrading4_1; autotrading4_2 and robotics roboticRL4_1; roboticRL4_2; roboticRL4_3. However, in real-world scenarios, agents face the challenges of learning policies from few interactions roboticRL4_2 and keeping violations of domain constraints to a minimum during training and execution autodriving_safe. To address these challenges, an increasing number of neuro-symbolic reinforcement learning (NSRL) approaches have been proposed, aiming to exploit the structural knowledge of the problem to improve sample efficiency shindo2024blendrl; RM; nsrl2025_planning or to constrain agents to select actions PLPG; PPAM; nsrl2024_plpg_multi.
Among these NSRL approaches, a promising practice is to exclude infeasible actions for the agents. We use the term infeasible actions throughout the paper, which can also be considered as unsafe, unethical or in general undesirable actions. This is typically achieved by assuming a predefined symbolic grounding nsplanning or label function RM that maps high-dimensional states into symbolic representations and manually specify action masking techniques actionmasking_app1; actionmasking_app3; actionmasking_app4. However, predefining the symbolic grounding function is often expensive neuroRM, as it requires complete knowledge of the environmental states, and could be practically impossible when the states are high-dimensional or infinite. Learning symbolic grounding from environmental state is therefore crucial for NSRL approaches and remains a highly challenging problem neuroRM.
In particular, there are three main challenges. First, real-world environments should often satisfy complex constraints expressed in a domain specific language, which makes learning the symbolic grounding function difficult ahmed2022semantic. Second, obtaining full supervision for learning symbolic representations in DRL environments is unrealistic, as those environments rarely provide the ground-truth symbolic description of every state. Finally, even if symbolic grounding can be learned, integrating it into reinforcement learning to achieve end-to-end learning remains a challenge.
To address these challenges, we propose Neuro-symbolic Action Masking (NSAM), a framework that integrates symbolic reasoning into deep reinforcement learning. The basic idea is to use probabilistic sentential decision diagrams (PSDDs) to learn symbolic grounding. PSDDs serve two purposes: they guarantee that any learned symbolic model satisfies domain constraints expressed in a domain specific language kisa2014probabilistic, and they allow the agent to represent probability distributions over symbolic models conditioned on high-dimensional states. In this way, PSDDs bridge the gap between numerical states and symbolic reasoning without requiring manually defined mappings. Based on the learned PSDDs, NSAM combines action preconditions with the inferred symbolic model of numeric states to construct action masks, thereby filtering out infeasible actions. Crucially, this process only relies on minimal supervision in the form of action explorablility feedback, rather than full symbolic description at every state. Finally, NSAM is trained end-to-end, where the improvement of symbolic grounding and policy optimization mutually reinforce each other.
We evaluate NSAM on four DRL decision-making domains with domain constraints, and compare it against a series of state-of-the-art baselines. Experimental results demonstrate that NSAM not only learns more efficiently, consistently surpassing all baselines, but also substantially reduces constraint violations during training. The results further show that the symbolic grounding plays a crucial role in exploiting underlying knowledge structures for DRL.
## 2. Problem setting
We study reinforcement learning (RL) on a Markov Decision Process (MDP) RL1998 $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{T},R,\gamma)$ where $\mathcal{S}$ is a set of states, $\mathcal{A}$ is a finite set of actions, $\mathcal{T}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]$ is a transition function, $\gamma\in[0,1)$ is a discount factor and $R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$ is a reward function. An agent employs a policy $\pi$ to interact with the environment. At a time step $t$ , the agent takes action $a_{t}$ according to the current state $s_{t}$ . The environment state will transfer to next state $s_{t+1}$ based on the transition probability $\mathcal{T}$ . The agent will receive the reward $r_{t}$ . Then, the next round of interaction begins. The goal of this agent is to find the optimal policy $\pi^{*}$ that maximizes the expected return: $\mathbb{E}[\sum_{t=0}^{T}\gamma^{t}r_{t}|\pi]$ , where $T$ is the terminal time step.
To augment RL with symbolic domain knowledge, we extend the normal MDP with the following modules $(\mathcal{P},\mathcal{AP},\phi)$ where $\mathcal{P}=\{p_{1},..,p_{K}\}$ is a finite set of atomic propositions (each $p\in\mathcal{P}$ represents a Boolean property of a state $s\in\mathcal{S}$ ), $\mathcal{AP}=\{(a,\varphi)|a\in\mathcal{A},\varphi\in L(\mathcal{P})\}$ is the set of actions with their preconditions, and $L(\mathcal{P})$ denotes the propositional language over $\mathcal{P}$ . We use $(a,\varphi)$ to state that action $a$ is explorable All actions $a\in\mathcal{A}$ can in principle be chosen by the agent. However, we use the term explorable to distinguish actions whose preconditions are satisfied (safe, ethical, desriable actions) from those whose preconditions are not satisfied (unsafe, unethical, undesirable actions). in a state if and only if its precondition $\varphi$ holds in that state, $\phi\in L(\mathcal{P})$ is a domain constraint. We use $|[\phi]|=\{\bm{m}|\bm{m}\models\phi\}$ to denote the set of all possible symbolic models of $\phi$ a model is a truth assignment to all propositions in $\mathcal{P}$ .
To illustrate how symbolic domain knowledge $(\mathcal{P},\mathcal{AP},\phi)$ is reflected in our formulation, we consider the Visual Sudoku task as a concrete example. In this environment, each state is represented as a non-symbolic image input. The properties of a state can be described using propositions in $\mathcal{P}$ . For example, the properties of the state in Figure 1(a) include āposition (1,1) is number 1ā, āposition (1,2) is emptyā, etc. Each action $a$ of filling a number in a certain position corresponds to a symbolic precondition $\varphi$ , represented by $(a,\varphi)\in\mathcal{AP}$ . For example, the action āfilling number 1 at position (1,1)ā requires that both propositions āposition (1,2) is number 1ā and āposition (2,1) is number 1ā are false. Finally, $\phi$ is used to constrain the set of possible states, e.g., āposition (1,1) is number 1ā and āposition (1,1) is number 2ā cannot both be simultaneously true for a given state. To leverage this knowledge, challenges arise due to the following problems.
<details>
<summary>sudo1.png Details</summary>

### Visual Description
## Diagram: Minimalist Quadrant Grid with Diagonal Line
### Overview
The image displays a simple, abstract diagram consisting of a square divided into four equal quadrants by two perpendicular lines. A single, thick, black diagonal line is drawn within the top-left quadrant. There is no accompanying text, numerical data, labels, or legends.
### Components/Axes
* **Main Structure:** A square frame.
* **Dividing Lines:** Two solid, black lines of standard thickness. One vertical line and one horizontal line intersect at the exact center of the square, creating four equal smaller squares (quadrants).
* **Primary Element:** A single, thick, black diagonal line segment.
* **Text/Labels:** None present.
* **Legend:** None present.
* **Axes/Markers:** None present.
### Detailed Analysis
* **Spatial Grounding:** The diagram is centered on a white background. The dividing lines create a clear 2x2 grid. The only active element is the thick diagonal line, which is located exclusively within the **top-left quadrant**.
* **Element Description:** The thick black line originates near the bottom-left corner of the top-left quadrant and terminates near the top-right corner of the same quadrant. It has a positive slope, rising from left to right within its confined space.
* **Trend Verification:** The line exhibits a clear, consistent upward (positive) trend within the boundaries of its quadrant. There are no data points, curves, or variations in the line's path; it is a straight segment.
* **Component Isolation:**
* **Header/Frame:** The outer square border.
* **Main Chart Area:** The interior divided into four quadrants.
* **Footer:** None.
### Key Observations
1. **Extreme Minimalism:** The diagram contains no quantitative information, labels, or context. It is purely geometric.
2. **Isolation of Element:** The single active component (the diagonal line) is confined to one of the four available spaces, leaving the other three quadrants completely empty.
3. **Visual Weight:** The diagonal line is significantly thicker than the grid lines, making it the dominant visual feature and the clear focal point.
4. **Directionality:** The line's orientation implies a direction or trend (upward and to the right) but within an undefined coordinate system.
### Interpretation
This image is an abstract diagram, not a data chart. It does not present facts, measurements, or relationships between defined variables. Its meaning is entirely open to interpretation based on context, which is not provided.
* **Possible Symbolic Meanings:** It could symbolize growth, progress, or a positive trend occurring within a specific segment or category (represented by the top-left quadrant). The empty quadrants might represent potential, unused capacity, or areas without activity.
* **Structural Function:** It may serve as a placeholder, a template for a future chart, or a simple icon representing a concept like "focus on one area" or "positive trajectory."
* **Peircean Investigative Reading:** As a diagram, it functions as an **icon** (resembling the idea of a trend) and potentially an **index** (pointing to the concept of direction or confinement). Without further context, it cannot function as a **symbol** with a agreed-upon meaning.
**Conclusion:** The image contains no extractable textual or numerical data. Its informational content is limited to the geometric relationships described: a square divided into four, with a single, thick, upward-sloping line drawn in the top-left section. Any deeper meaning is contingent on external context not present in the image itself.
</details>
(a)
<details>
<summary>sudo2.png Details</summary>

### Visual Description
## Diagram: Abstract Black Shapes in a 2x2 Grid
### Overview
The image displays a simple 2x2 grid composed of four equal-sized square cells, separated by thin black lines. The background of each cell is white. Two of the cells, located in the left column, contain abstract black shapes. The two cells in the right column are completely empty. There is no textual information, labels, axes, or legends present in the image.
### Components/Axes
* **Grid Structure:** A 2x2 matrix defined by a horizontal and a vertical black line intersecting at the center.
* **Cell Contents:**
* **Top-Left Cell:** Contains a vertically oriented, irregular black shape.
* **Bottom-Left Cell:** Contains a diagonally oriented, irregular black shape.
* **Top-Right Cell:** Empty (white).
* **Bottom-Right Cell:** Empty (white).
### Detailed Analysis
* **Shape in Top-Left Cell:** This is a solid black, vertically elongated shape. It is not a perfect rectangle; its left edge has a small, rectangular protrusion extending leftward about one-third of the way down from the top. The overall shape resembles a stylized, blocky letter "I" or a vertical bar with a notch.
* **Shape in Bottom-Left Cell:** This is a solid black, diagonal line or bar. It runs from the lower-left corner of the cell towards the upper-right corner, but does not perfectly connect the corners. It has a slight irregularity or "kink" near its midpoint, giving it a somewhat hand-drawn or jagged appearance.
* **Spatial Grounding:** The vertical shape is centered horizontally within the top-left cell. The diagonal shape is positioned such that its lower end is near the bottom-left corner of its cell and its upper end is near the center-right of the cell.
### Key Observations
1. **Asymmetry:** The composition is asymmetrical, with visual weight concentrated in the left column.
2. **Contrast:** High contrast between the solid black shapes and the white background.
3. **Empty Space:** The entire right column is void of any content, creating a strong visual imbalance.
4. **Shape Relationship:** The two shapes are distinct in orientation (vertical vs. diagonal) but share the same color and texture (solid black). They do not touch or interact visually.
### Interpretation
This image does not present factual data, a process, or a measurable trend. It is an abstract visual composition. The interpretation is therefore speculative and open-ended:
* **Symbolic Potential:** The shapes could be interpreted as abstract symbols, glyphs, or icons. The vertical shape might represent stability, a pillar, or the number "1". The diagonal shape could imply motion, direction, or a slash.
* **Compositional Study:** The image may serve as a basic study in contrast, balance, and the use of negative space. The empty right column creates tension and draws attention to the forms on the left.
* **Minimalist Design:** It could be an example of minimalist graphic design, where simple geometric forms are used to create a visual statement without representational meaning.
* **Placeholder or Template:** The grid structure with empty cells might suggest a template or placeholder for future content, where the left column contains example or placeholder graphics.
**Conclusion:** The image contains no extractable textual information, data points, or labeled components. Its content is purely visual and abstract, consisting of two distinct black shapes positioned within the left cells of a 2x2 grid. Any meaning derived from it is subjective and not grounded in explicit data.
</details>
(b)
<details>
<summary>sudo3.png Details</summary>

### Visual Description
## [Diagram]: Simple 2x2 Grid with Numerical Element
### Overview
The image displays a simple, hand-drawn or digitally sketched 2x2 grid (a table with two rows and two columns). The grid contains minimal content: a vertical line in one cell and the digit "2" in another. The remaining two cells are empty. The overall appearance is that of a basic template or a fragment of a larger structure.
### Components/Axes
* **Structure:** A 2x2 grid formed by thin, black lines on a white background.
* **Cell Contents:**
* **Top-Left Cell:** Contains a single, thick, vertical black line (`|`). It is centered within the cell.
* **Top-Right Cell:** Empty.
* **Bottom-Left Cell:** Contains the handwritten digit "2" in black ink. The digit is slightly slanted to the right.
* **Bottom-Right Cell:** Empty.
* **Labels/Axes:** None present. The grid has no row or column headers, titles, or axis markers.
### Detailed Analysis
* **Spatial Layout:** The grid is symmetrically divided. The vertical line in the top-left cell is positioned centrally. The digit "2" in the bottom-left cell is also roughly centered.
* **Content Transcription:**
* Text Element 1: `|` (Vertical line character)
* Text Element 2: `2` (Digit two)
* **Data Points:** No numerical data series, trends, or quantitative values are presented beyond the single digit "2".
### Key Observations
1. **Minimal Content:** The grid is predominantly empty, with content in only two of the four cells.
2. **Symbolic vs. Numeric:** The top-left cell contains a symbolic separator (`|`), while the bottom-left contains a numeric character (`2`). This could imply a relationship, such as a label and its value, or a step in a sequence.
3. **Hand-drawn Quality:** The lines and the digit "2" have a slightly irregular, sketched appearance, suggesting it may be a quick diagram, a placeholder, or an example from a teaching context.
### Interpretation
This image does not present factual data or a complex diagram for analysis. Instead, it appears to be a **structural template or a conceptual fragment**.
* **What it suggests:** The grid establishes a basic relational structure (two items in the left column, with corresponding empty slots on the right). The vertical line could function as a divider, a placeholder for text (like a "Name:" field), or a symbol for "1" in a tally. The digit "2" is a clear, standalone value.
* **How elements relate:** The most plausible relationship is that the left column contains "labels" or "keys" (`|` and `2`), and the right column is reserved for corresponding "values" or "data" which are not filled in. Alternatively, it could simply be an enumeration (`|` as item 1, `2` as item 2).
* **Notable absence:** The critical missing information is the *context*. Without surrounding text or a title, the purpose of this grid is ambiguous. It could be the start of a table, a logic puzzle, a memory aid, or an example for a formatting exercise.
**In summary, the image provides a visual structure with two explicit elements (`|` and `2`) but no self-contained data or narrative. Its meaning is entirely dependent on external context not provided in the image itself.**
</details>
(c)
Figure 1. Example states in the Visual Sudoku environment
(P1) Numericalāsymbolic gap. Knowledge is based on symbolic property of states, but only raw numerical states are available.
(P2) Constraint satisfaction. The truth values of propositions in $\mathcal{P}$ mapped from a DRL state $s$ must satisfy domain constraints $\phi$ .
<details>
<summary>pr.png Details</summary>

### Visual Description
## Data Table: Probability Distribution for Three Binary Variables
### Overview
The image displays a structured data table with four columns and nine rows (including the header). It presents a probability distribution over all possible combinations of three binary variables, labeled `p1`, `p2`, and `p3`. The final column, `Pr`, lists the probability associated with each combination. The table uses a green color scheme for the header and alternating row shading for readability.
### Components/Axes
* **Header Row (Green Background):**
* Column 1: `p1`
* Column 2: `p2`
* Column 3: `p3`
* Column 4: `Pr`
* **Data Rows:** Eight rows representing the 2³ = 8 possible states of the three binary variables. Each variable takes a value of either `0` or `1`.
* **Color Coding:** The header row has a solid green background. Data rows alternate between a light green and a very light grey/white background.
### Content Details
The table is a complete enumeration of states. The following data is extracted precisely from the image:
| `p1` | `p2` | `p3` | `Pr` |
| :--- | :--- | :--- | :--- |
| 0 | 0 | 0 | 0.2 |
| 0 | 0 | 1 | 0.2 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0.1 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0.3 |
| 1 | 1 | 0 | 0.1 |
| 1 | 1 | 1 | 0.1 |
**Trend/Verification:** The probabilities (`Pr`) are non-negative and sum to 1.0 (0.2+0.2+0+0.1+0+0.3+0.1+0.1 = 1.0), confirming this is a valid probability mass function over the discrete sample space.
### Key Observations
1. **Zero-Probability States:** Two combinations have a probability of 0: (`p1=0, p2=1, p3=0`) and (`p1=1, p2=0, p3=0`). These outcomes are considered impossible under this model.
2. **Highest Probability State:** The combination (`p1=1, p2=0, p3=1`) has the highest probability at 0.3.
3. **Lowest Non-Zero Probability:** Three states share the lowest non-zero probability of 0.1: (`0,1,1`), (`1,1,0`), and (`1,1,1`).
4. **Distribution Shape:** The distribution is not uniform. Probability mass is concentrated on states where `p3=1` (total probability 0.2+0.1+0.3+0.1 = 0.7) versus `p3=0` (total probability 0.2+0+0+0.1 = 0.3).
### Interpretation
This table defines a joint probability distribution for three interdependent binary variables. It is a foundational component for probabilistic models like Bayesian networks or Markov random fields.
* **What the data suggests:** The variables are not independent. The probability of a state depends on the specific combination of all three variables. For example, `p1=1` is not inherently more or less likely; its probability is contingent on the values of `p2` and `p3`.
* **How elements relate:** The `Pr` column is a function of the triplet (`p1`, `p2`, `p3`). The table exhaustively maps the input space (all binary triplets) to an output probability.
* **Notable patterns/anomalies:** The two zero-probability states are significant. They imply a hard constraint or logical impossibility within the system being modeled. For instance, they might represent rules such as "If `p2` is true, `p1` and `p3` cannot both be false" (violated by row 3) or "If `p1` is true and `p2` is false, then `p3` must be true" (violated by row 5). The concentration of probability on states where `p3=1` suggests that variable `p3` is more frequently active or true in this model.
</details>
(a) Distribution
<details>
<summary>psdd_sdd.png Details</summary>

### Visual Description
## Logic Circuit Diagram: AND-OR Gate Network
### Overview
The image displays a hierarchical digital logic circuit diagram composed of interconnected AND and OR gates. The circuit is structured as a tree, with inputs at the bottom and a single output at the top. The diagram uses a consistent green color for all gate symbols. A legend in the top-left corner defines the gate symbols.
### Components/Axes
* **Legend (Top-Left):**
* **OR Gate:** Represented by a green rectangle with a curved top edge. Label: "OR".
* **AND Gate:** Represented by a green rectangle with a flat top edge. Label: "AND".
* **Circuit Structure:** The diagram is organized into four distinct horizontal levels or layers of gates.
* **Input Labels (Bottom Row):** A series of logical variables and their negations are labeled beneath the bottom-most gates. From left to right, they are: `p1`, `p2`, `¬p1`, `¬p2`, `p3`, `¬p3`, `p1`, `¬p2`, `p1`, `p2`.
* **Intermediate Label:** The variable `p3` is also labeled as a direct input to a gate on the right side of the second level from the top.
### Detailed Analysis
**Component Isolation & Flow:**
The circuit processes signals from the bottom (inputs) to the top (output). The flow can be segmented into left and right main branches originating from the top gate.
1. **Top Level (Output):** A single **OR gate**. Its output is the final circuit output.
2. **Second Level:** Two **AND gates** feed into the top OR gate.
* The **left AND gate** receives inputs from two OR gates on the third level.
* The **right AND gate** receives one input from an OR gate on the third level and one direct input labeled `p3`.
3. **Third Level:** Three **OR gates**.
* The **leftmost OR gate** receives inputs from two AND gates on the bottom level.
* The **middle OR gate** receives inputs from two AND gates on the bottom level.
* The **rightmost OR gate** receives inputs from two AND gates on the bottom level.
4. **Bottom Level (Inputs):** Six **AND gates**. Each AND gate has two input lines. The labels beneath these gates indicate the logical variables applied to those input lines. The grouping is as follows:
* Gate 1 (Leftmost): Inputs `p1` and `p2`.
* Gate 2: Inputs `¬p1` and `¬p2`.
* Gate 3: Inputs `p3` and `¬p3`.
* Gate 4: Inputs `p1` and `¬p2`.
* Gate 5: Inputs `p1` and `p2`.
* Gate 6 (Rightmost): This gate's inputs are not explicitly labeled with new variables; it appears to be a duplicate or continuation of the pattern.
**Transcription of Embedded Text:**
* Legend: `OR`, `AND`
* Input Labels (Bottom, left to right): `p1`, `p2`, `¬p1`, `¬p2`, `p3`, `¬p3`, `p1`, `¬p2`, `p1`, `p2`
* Intermediate Label: `p3` (connected to the right AND gate on the second level).
### Key Observations
* **Structural Symmetry:** The left and right main branches of the circuit are not perfectly symmetrical. The right branch has a direct `p3` input at the second level, while the left branch does not.
* **Input Repetition:** The variables `p1` and `p2` (and their negations) are used multiple times across different bottom-level gates. The variable `p3` and its negation appear only once as a paired input to a single bottom-level gate.
* **Gate Type Pattern:** The circuit alternates between layers of AND gates and OR gates, characteristic of a two-level logic implementation (like Sum-of-Products), though this specific structure is deeper.
* **Logical Completeness:** The bottom-level AND gates generate all four possible minterms for variables `p1` and `p2`: `(p1·p2)`, `(¬p1·¬p2)`, `(p1·¬p2)`, and a second instance of `(p1·p2)`. The minterm `(¬p1·p2)` is not explicitly generated.
### Interpretation
This diagram represents a specific Boolean logic function implemented with discrete AND and OR gates. The structure suggests it is computing a complex logical expression.
* **Function Derivation:** By tracing the signals upward, the circuit's output can be expressed as a Boolean equation. The top OR gate combines the results from the two main branches. Each branch is an AND of sub-expressions from the level below. This ultimately results in a **Product-of-Sums (POS)** or a more complex multi-level expression, rather than a simple Sum-of-Products.
* **Purpose of Structure:** The hierarchical, tree-like organization is typical for visualizing the evaluation order of a logical formula. It breaks down a complex expression into simpler, nested operations.
* **Notable Anomaly:** The absence of the minterm `(¬p1·p2)` in the bottom layer indicates that the function being implemented does not require this specific combination of `p1` and `p2` to be true for its evaluation, or that this condition is covered by another part of the circuit's logic. The duplicate `(p1·p2)` gate is unusual and may be a diagrammatic error or represent a specific design choice for signal loading or fan-in.
* **Overall Meaning:** The circuit is a physical or logical realization of a Boolean function involving three variables (`p1`, `p2`, `p3`). Its exact function would require writing out the full equation by tracing all paths from inputs to the final output gate. The diagram serves as a technical blueprint for constructing this logic network.
</details>
(b) SDD
<details>
<summary>psdd.png Details</summary>

### Visual Description
## Diagram: Probabilistic Decision Tree
### Overview
The image displays a hierarchical decision tree diagram, likely representing a probabilistic classification or logical decision process. The tree is structured with a root node at the top, branching downward into intermediate nodes and terminating in leaf nodes at the bottom. All nodes are represented by green, rounded rectangular shapes. Branches are labeled with numerical probabilities, and leaf nodes are labeled with logical propositions involving variables `p1`, `p2`, and `p3`.
### Components/Axes
* **Structure:** A top-down tree diagram.
* **Nodes:** Green, rounded rectangles. There is one root node, four intermediate nodes, and six leaf nodes.
* **Branches:** Lines connecting nodes. Each branch from an intermediate node is labeled with a numerical probability.
* **Labels:**
* **Probabilities:** `0.6`, `0.4`, `0.33`, `0.67`, `0.5`, `0.5`, `0.75`, `0.25`.
* **Logical Propositions (at leaf nodes):** `p1`, `p2`, `¬p1`, `¬p2`, `p1`, `¬p2`, `¬p1`, `p2`.
* **Additional Label:** The variable `p3` is placed next to an intermediate node on the right side of the tree. The logical expression `¬p3` appears near the center of the tree, associated with a branch.
### Detailed Analysis
The tree's structure and flow are as follows:
1. **Root Node (Top Center):** Splits into two main branches.
* **Left Branch:** Labeled `0.6`.
* **Right Branch:** Labeled `0.4`.
2. **Left Subtree (from the 0.6 branch):**
* Leads to an intermediate node.
* This node splits into two branches:
* **Left:** Labeled `0.33`, leading to a leaf node labeled `p1` and `p2`.
* **Right:** Labeled `0.67`, leading to a leaf node labeled `¬p1` and `¬p2`.
3. **Right Subtree (from the 0.4 branch):**
* Leads to an intermediate node. The label `p3` is positioned to the right of this node.
* This node splits into two branches:
* **Left:** Labeled `0.5`. This branch leads to an intermediate node. The label `¬p3` is positioned near the base of this branch.
* **Right:** Labeled `0.5`. This branch leads directly to a leaf node labeled `p1` and `¬p2`.
* The intermediate node reached via the left `0.5` branch splits further:
* **Left:** Labeled `0.75`, leading to a leaf node labeled `¬p1` and `p2`.
* **Right:** Labeled `0.25`, leading to a leaf node labeled `¬p1` and `p2`.
**Transcription of Logical Labels (Original):**
* Leaf 1 (Bottom Left): `p1 p2`
* Leaf 2: `¬p1 ¬p2`
* Leaf 3 (Center, under ¬p3 branch): `p1 ¬p2`
* Leaf 4: `¬p1 p2`
* Leaf 5: `¬p1 p2`
* Leaf 6 (Bottom Right): `p1 p2` (Note: This appears to be a duplicate label of Leaf 1, but is on a different path).
**English Translation of Logical Symbols:**
* `p1`, `p2`, `p3`: Propositions or variables (e.g., "Feature 1 is true").
* `¬`: Logical negation ("NOT"). Therefore, `¬p1` means "NOT p1" or "p1 is false".
### Key Observations
1. **Probability Summation:** At each intermediate node, the probabilities on its outgoing branches sum to 1.0 (0.6+0.4=1; 0.33+0.67=1; 0.5+0.5=1; 0.75+0.25=1). This is consistent with a decision tree where probabilities represent the likelihood of taking a given path.
2. **Variable `p3` Placement:** The variable `p3` is associated with the first node in the right subtree, and its negation `¬p3` is associated with a subsequent branch. This suggests the right subtree's decisions are conditioned on the state of `p3`.
3. **Leaf Node Outcomes:** The six leaf nodes represent four distinct logical outcomes:
* `p1 AND p2` (appears twice, on different paths).
* `NOT p1 AND NOT p2`.
* `p1 AND NOT p2`.
* `NOT p1 AND p2` (appears twice, on different paths).
4. **Asymmetry:** The left subtree (probability 0.6) is simpler, leading directly to two outcomes. The right subtree (probability 0.4) is more complex, involving an additional decision layer based on `p3`.
### Interpretation
This diagram models a sequential decision-making process under uncertainty, where the outcome depends on the values of three binary variables (`p1`, `p2`, `p3`).
* **Process Flow:** The tree first makes a high-level split (60%/40% probability). The more likely path (left) leads to a direct classification based on `p1` and `p2`. The less likely path (right) involves an intermediate check on variable `p3` before further refining the decision based on `p1` and `p2`.
* **Meaning of Probabilities:** The branch probabilities likely represent either:
* The empirical frequency of taking that path in a dataset.
* The conditional probability of a child node given the parent node's state.
* **Logical Relationships:** The tree defines a mapping from combinations of `p1`, `p2`, and `p3` to final classifications. For example, one path to the outcome `p1 AND p2` has a total probability of `0.6 * 0.33 = 0.198`. Another path to the same outcome has a probability of `0.4 * 0.5 * 0.5 = 0.1`.
* **Purpose:** This could represent a diagnostic flowchart, a classifier in a machine learning model (like a probabilistic decision tree), or a model for calculating the probability of different logical states in a system. The structure reveals that the state of `p3` is only relevant in a minority (40%) of cases.
</details>
(c) PSDD
<details>
<summary>vtree.png Details</summary>

### Visual Description
## Diagram: Hierarchical Tree Structure with Labeled Nodes
### Overview
The image displays a simple hierarchical tree diagram composed of five circular nodes connected by straight lines. The nodes are numbered from 0 to 4, and three of the nodes have associated labels (pā, pā, pā). The structure suggests a parent-child relationship, with node 3 at the apex.
### Components/Axes
* **Nodes:** Five green circular nodes.
* **Node Labels (Numbers):** Each node has a black numeral placed directly above it: `0`, `1`, `2`, `3`, `4`.
* **Node Labels (Variables):** Three nodes have a black variable label placed directly below them:
* Node `0` is labeled `pā`.
* Node `2` is labeled `pā`.
* Node `4` is labeled `pā`.
* **Edges:** Straight green lines connecting the nodes, indicating relationships or flow.
* **Spatial Layout:** The diagram is arranged in a top-down hierarchy.
* **Top Level (Root):** Node `3` is positioned at the top center.
* **Middle Level:** Node `1` is positioned to the left and below node `3`. Node `4` is positioned to the right and below node `3`.
* **Bottom Level:** Node `0` is positioned to the left and below node `1`. Node `2` is positioned to the right and below node `1`.
### Detailed Analysis
The diagram explicitly defines the following connections (edges):
1. A green line connects Node `3` (top) to Node `1` (middle-left).
2. A green line connects Node `3` (top) to Node `4` (middle-right).
3. A green line connects Node `1` (middle-left) to Node `0` (bottom-left).
4. A green line connects Node `1` (middle-left) to Node `2` (bottom-right).
This creates a binary tree structure where:
* Node `3` is the parent of nodes `1` and `4`.
* Node `1` is the parent of nodes `0` and `2`.
* Nodes `0`, `2`, and `4` are leaf nodes (they have no children).
* The variable labels `pā`, `pā`, and `pā` are associated with the leaf nodes `0`, `2`, and `4`, respectively.
### Key Observations
* The structure is a **strict binary tree** (each parent has at most two children).
* The numbering does not follow a standard traversal order (e.g., in-order, pre-order). Node `4` is a right child of the root but is numbered higher than its sibling `1`.
* The variable labels (`pā`, `pā`, `pā`) are only applied to the terminal (leaf) nodes of the tree.
* All visual elements (nodes, edges) are green; all text (numbers, variables) is black.
### Interpretation
This diagram most likely represents a **decision tree, a process flow, or a hierarchical data structure**. The numbered nodes could represent states, steps, or decision points, while the labeled leaf nodes (`pā`, `pā`, `pā`) likely represent final outcomes, parameters, or probabilities associated with reaching that terminal state.
The structure implies that starting from an initial state (Node `3`), one can follow a path of decisions or transitions (the edges) to arrive at one of three possible terminal states. For example:
* Path: `3` -> `1` -> `0` leads to outcome `pā`.
* Path: `3` -> `1` -> `2` leads to outcome `pā`.
* Path: `3` -> `4` leads to outcome `pā`.
The asymmetry (Node `1` having two children while Node `4` is a leaf) suggests that the process or model is more complex or has more possible outcomes on the left branch than on the right. The use of `p` notation often signifies probability in such models, hinting that `pā`, `pā`, and `pā` could be the probabilities of ending in each respective leaf node.
</details>
(d) Vtree
<details>
<summary>nml.png Details</summary>

### Visual Description
## Diagram: Hierarchical Aggregation Network
### Overview
The image displays a schematic diagram of a hierarchical network structure, likely representing a computational or logical model. It features a tree-like architecture with a single top-level node connected to multiple lower-level nodes. The diagram uses a consistent visual language with green shapes representing processing units or gates, black lines for connections, and mathematical notation for labels.
### Components/Axes
* **Primary Components:** The diagram consists of two tiers of identical green shapes resembling logic gates or processing units (specifically, they look like AND gates or similar computational blocks).
* **Top Tier:** A single gate positioned at the top-center of the diagram.
* **Bottom Tier:** A horizontal row of gates. The diagram explicitly shows three gates (left, center, right) with ellipses (`...`) between the center and right gates, indicating a sequence that continues to an arbitrary number `n`.
* **Connections:** Black lines connect the top gate to each gate in the bottom tier. Each connection is labeled with a Greek letter alpha (`α`) with a subscript.
* **Labels & Text:**
* **Connection Labels:** `αā`, `αā`, `αā`. These are placed adjacent to the lines connecting the top gate to the corresponding bottom-tier gates.
* **Input Labels:** Below each bottom-tier gate, there are paired labels: `primeā subā` (under the leftmost gate), `primeā subā` (under the rightmost gate). Ellipses (`...`) are placed between the center and right gates in this row as well.
* **Ellipses (`...`):** Used in two locations to denote repetition: between the `αā` and `αā` connection labels, and between the `primeā subā` and `primeā subā` input labels.
### Detailed Analysis
* **Structure & Flow:** The flow is hierarchical and bottom-up. Multiple parallel processes or data streams (represented by the bottom-tier gates) feed into a single, higher-level aggregation or processing unit (the top gate).
* **Component Isolation:**
* **Header Region:** Contains the single top gate and the initial segments of its outgoing connections.
* **Main Chart Region:** Contains the row of bottom-tier gates, their input labels, and the labeled connection lines.
* **Footer Region:** Contains the input labels (`primeā subā`, etc.) positioned directly below their respective gates.
* **Spatial Grounding:**
* The top gate is **centered at the top**.
* The bottom-tier gates are arranged in a **horizontal row across the lower half** of the diagram.
* The label `αā` is positioned **top-left relative to the center**, next to the leftmost connection line.
* The label `αā` is positioned **top-right relative to the center**, next to the rightmost connection line.
* The input label `primeā subā` is at the **bottom-left**, and `primeā subā` is at the **bottom-right**.
* **Trend Verification:** The diagram does not depict numerical trends but a structural relationship. The visual trend is one of **convergence**: multiple distinct inputs (`prime_i sub_i`) are processed locally and then their outputs (weighted or transformed by `α_i`) are combined at a central node.
### Key Observations
1. **Generalized Model:** The use of subscripts `1` and `n` with ellipses indicates this is a generalized schematic for a system with an arbitrary number (`n`) of parallel components.
2. **Paired Inputs:** Each bottom-tier gate receives a distinct pair of inputs labeled `prime_i` and `sub_i`. This suggests a two-factor input for each parallel process.
3. **Weighting/Parameterization:** The labels `αā`, `αā`, `αā` on the connection lines strongly imply that each parallel stream's contribution to the top node is modulated by a parameter `α`. These could be weights, coefficients, or probabilities.
4. **Uniformity:** All gates (top and bottom) are visually identical, suggesting they perform the same fundamental operation, though their role in the hierarchy differs.
### Interpretation
This diagram illustrates a **weighted aggregation or fusion architecture**. It is a common pattern in fields like:
* **Machine Learning:** An ensemble method where predictions from `n` base models (`prime_i sub_i` could represent model i's prediction on sub-problem i) are combined using weights `α_i` to produce a final output.
* **Signal Processing:** A system combining `n` sensor signals, each processed (`prime_i sub_i`) and then weighted (`α_i`) before fusion.
* **Logical/Distributed Systems:** A decision-making process where `n` sub-conclusions or votes are aggregated with different levels of influence (`α_i`).
The **Peircean investigative** reading suggests this is an **iconic diagram**āit represents the system's structure through spatial analogy. The **underlying information** is the principle of **parallel processing with weighted integration**. The model's power and flexibility come from the parameters `α_i`, which determine the relative importance of each parallel branch. The absence of specific numerical values or a defined operation for the gates means this is a conceptual template, not a specific instance. To make it actionable, one would need to define the function of the gates (e.g., summation, multiplication, a neural network layer) and the values of the `α` parameters and `prime/sub` inputs.
</details>
(e) A general fragment
Figure 2. (a) An example of joint distribution for three propositions $p_{1},p_{2}$ and $p_{3}$ with the constraint $(p_{1}\leftrightarrow p_{2})\lor p_{3}$ . (b) A SDD circuit with āORā and āANDā logic gate to represent the constrain $(p_{1}\leftrightarrow p_{2})\lor p_{3}$ . (c) The PSDD circuit to represent the distribution in Fig. 2(a). (d) The vtree used to group variables. (e) A general fragment to show the structure of SDD and PSDD.
(P3) Minimal supervision. The RL environment cannot provide full ground truth of propositions at each state.
(P4) Differentiability. The symbolic reasoning with $\varphi$ introduces non-differentiable process, which could be conflicting with gradient-based DRL algorithms that require differentiable policies.
(P5) End-to-end learning. Achieving end-to-end training on prediction of propositions, symbolic reasoning over preconditions and optimization of policy is challenging.
In summary, the above challenges fall into three categories. (P1āP3) concern learning symbolic models from high-dimensional states in DRL, which we address in Section 3. (P4) relates to the differentiability barrier when combining symbolic reasoning with gradient-based DRL, which we tackle in Section 4. (P5) raises the need for an end-to-end training, which we present in Section 5.
## 3. Learning Symbolic Grounding
This section introduces how NSAM learns symbolic grounding. At a high level, the goal is to learn whether an action is explorable in a state. Specifically, the agent receives minimal supervision from state transitions after executing an action $a$ . Using this supervision, NSAM learns to estimate the symbolic model of the high-dimensional input state that is in turn used to check the satisfiability of action preconditions. To achieve this, Section 3.1 presents a knowledge compilation step to encode domain constraints into a symbolic structure, while Section 3.2 explains how this symbolic structure is parameterized and learned from minimal supervision.
### 3.1. Compiling the Knowledge
To address P2 (Constraint satisfaction), we introduce the Probabilistic Sentential Decision Diagram (PSDD) kisa2014probabilistic. PSDDs are designed to represent probability distributions $Pr(\bm{m})$ over possible models, where any model $\bm{m}$ that violates domain constraints is assigned zero probability conditionalPSDD. For example, consider the distribution in Figure 2(a). The first step in constructing a PSDD is to build a Boolean circuit that captures the entries whose probability values are always zero, as shown in Figure 2(b). Specifically, the circuit evaluates to $0$ for model $\bm{m}$ if and only if $\bm{m}\not\models\phi$ . The second step is to parameterize this Boolean circuit to represent the (non-zero) probability of valid entries, yielding the PSDD in Figure 2(c).
To obtain the Boolean circuit in Figure 2(b), we represent the domain constraint $\phi$ using a general data structure called a Sentential Decision Diagram (SDD) sdd. An SDD is a normal form of a Boolean formula that generalizes the well-known Ordered Binary Decision Diagram (OBDD) OBDD; OBDD2. SDD circuits satisfy specific syntactic and semantic properties defined with respect to a binary tree, called a vtree, whose leaves correspond to propositions (see Figure 2(d)). Following Darwicheās definition sdd; psdd_infer1, an SDD normalized for a vtree $v$ is a Boolean circuit defined as follows: If $v$ is a leaf node labeled with variable $p$ , the SDD is either $p$ , $\neg p$ , $\top$ , $\bot$ , or an OR gate with inputs $p$ and $\neg p$ . If $v$ is an internal node, the SDD has the structure shown in Figure 2(e), where $\textit{prime}_{1},\ldots,\textit{prime}_{n}$ are SDDs normalized for the left child $v^{l}$ , and $\textit{sub}_{1},\ldots,\textit{sub}_{n}$ are SDDs normalized for the right child $v^{r}$ . SDD circuits alternate between OR gates and AND gates, with each AND gate having exactly two inputs. The OR gates are mutually exclusive in that at most one of their inputs evaluates to true under any circuit input sdd; psdd_infer1.
A PSDD is obtained by annotating each OR gate in an SDD with parameters $(\alpha_{1},\ldots,\alpha_{n})$ over its inputs kisa2014probabilistic; psdd_infer1, where $\sum_{i}\alpha_{i}=1$ (see Figure 2(e)). The probability distribution defined by a PSDD is as follows. Let $\bm{m}$ be a model that assigns truth values to the PSDD variables, and suppose the underlying SDD evaluates to $0$ under $\bm{m}$ ; then $Pr(\bm{m})=0$ . Otherwise, $Pr(\bm{m})$ is obtained by multiplying the parameters along the path from the output gate.
The key advantage of using PSDDs in our setting is twofold. First, PSDDs strictly enforce domain constraints by assigning zero probability to any model $\bm{m}$ that violates $\phi$ conditionalPSDD, thereby ensuring logical consistency (P2). Second, by ruling out impossible truth assignment through domain knowledge, PSDDs effectively reduce the scale of the probability distribution to be learned ahmed2022semantic.
Besides, PSDDs also support tractable probabilistic queries PCbooks; psdd_infer1. While PSDD compilation can be computationally expensive as its size grows exponentially in the number of propositions and constraints, it is a one-time offline cost. Once compilation is completed, PSDD inference is linear-time, making symbolic reasoning efficient during both training and execution psdd_infer1.
### 3.2. Learning the parameters of PSDD in DRL
To address P1 (Numericalāsymbolic gap), we need to learn distributions of models that satisfy the domain constraints. Inspired by recent deep supervised learning work on PSDDs ahmed2022semantic, we parameterize the PSDD using the output of gating function $g$ . This gating function is a neural network that maps high-dimensional RL states to PSDD parameters $\Theta=g(s)$ . This design allows the PSDD to represent state-conditioned distributions over propositions through its learned parameters, while strictly adhering to domain constraints (via its structure defined by symbolic knowledge $\phi$ ). The overall process is shown in Figure 3. We use $Pr(\bm{m}\mid\bm{\Theta}=g(s),\bm{m}\models\phi)$ to denote the probability of model $\bm{m}$ that satisfy the domain constrains $\phi$ given the state $s$ (this is calculated by PSDD in Figure 3).
After initializing $g$ and the PSDD according to the structure in Figure 3, we obtain a distribution over $\bm{m}$ such that for all $\bm{m}\not\models\phi$ , $Pr(\bm{m}\mid\bm{\Theta}=g(s),\bm{m}\not\models\phi)=0$ . However, for the probability distribution over $\bm{m}$ that does satisfy $\phi$ , we still need to learn from data to capture the probability of different $\bm{m}$ by adjusting parameters of gating function $g$ . To train the PSDD from minimal supervision signals (for problem (P3) in Section 2), we construct the supervision data from $\Gamma_{\phi}$ , which consists of tuples $(s,a,s^{\prime},y)$ where transitions $(s,a,s^{\prime})$ are explored from the environment and $y$ is calculated by:
$$
y=\begin{cases}1,&\text{if }s\;\text{and}\;s^{\prime}\;\text{do not violate}\;\phi,\\
0,&\text{otherwise.}\end{cases} \tag{1}
$$
That is, the action $a$ is labeled as explorable (i.e., $y=1$ ) in state $s$ if it does not lead to a violation of the domain constraint $\phi$ ; otherwise the action a is not explorable (i.e., $y=0$ ).
<details>
<summary>Framework.png Details</summary>

### Visual Description
## Diagram: Neuro-Symbolic Probabilistic Circuit Architecture
### Overview
The image is a technical diagram illustrating a hybrid system that integrates a neural network with a probabilistic circuit (PSDD) for knowledge compilation and inference. The system flow proceeds from left to right, showing how an input is processed by a neural network to configure a probabilistic circuit, whose structure is defined by a compiled logical theory.
### Components/Axes
The diagram is segmented into three primary regions:
1. **Left Region (Neural Network):**
* **Label:** "Neural Network (gating function *g*)"
* **Input:** An arrow labeled "input" points to the network, with the symbol "**S**" at its origin.
* **Structure:** A multi-layer network of blue circles (nodes) connected by lines, representing a feedforward neural network.
* **Output:** An arrow labeled "output" exits the network to the right.
2. **Central Region (Probabilistic Circuit):**
* **Main Container:** A large green dashed box labeled "**PSDD Circuit**".
* **Internal Structure:** A tree-like hierarchy of green shapes (resembling AND/OR gates). The top node is labeled "**Pr**".
* **Parameters:** Below the tree, a series of vertical green lines are labeled "**pā pā pā ... p|A|**", indicating a set of parameters or probabilities.
* **Configuration Interface:** To the left of the PSDD box, a blue rectangular component receives the neural network's output. It is labeled "**configure**" and contains a symbol "**āØ**" (likely representing a join or configuration operation). Multiple blue lines connect this component to the PSDD circuit's parameter lines.
3. **Right Region (Knowledge Compilation):**
* **Source Theory:** A green box labeled "**Propositional Theory Ļ**".
* **Process:** A downward arrow labeled "**Knowledge compilation**" points to the next component.
* **Compiled Structure:** A green dashed box labeled "**SDD Circuit**".
* **Relationship:** A green arrow labeled "**Define structure**" points from the SDD Circuit back to the PSDD Circuit, indicating the SDD defines the PSDD's structure.
### Detailed Analysis
* **Flow Direction:** The primary data flow is from left to right: Input **S** ā Neural Network ā Configuration Signal ā PSDD Circuit.
* **Control Flow:** A secondary, structural flow is from right to left: Propositional Theory **Ļ** ā (via Knowledge Compilation) ā SDD Circuit ā (defines structure of) ā PSDD Circuit.
* **Component Relationships:**
* The **Neural Network** acts as a "gating function *g*". Its role is to take an input **S** and produce an output that *configures* the parameters (**pā...p|A|**) of the PSDD Circuit.
* The **PSDD Circuit** (Probabilistic Sentential Decision Diagram) is the core probabilistic model. Its internal tree structure (with root **Pr**) is not arbitrary; it is formally defined by the **SDD Circuit** (Sentential Decision Diagram).
* The **SDD Circuit** is itself derived from a **Propositional Theory Ļ** through a process of **knowledge compilation**. This means logical constraints or rules are compiled into an efficient computational structure (the SDD).
* **Key Symbols:**
* **S**: Input variable or state.
* **g**: The gating function implemented by the neural network.
* **Pr**: Likely denotes the root of the probabilistic circuit, representing the overall probability distribution.
* **pā...p|A|**: A set of parameters (e.g., probabilities, weights) within the PSDD that are configured by the neural network.
* **Ļ**: A propositional logic theory (a set of logical formulas).
* **āØ**: Symbol within the "configure" block, suggesting a join, product, or configuration operation between the neural network output and the circuit parameters.
### Key Observations
1. **Hybrid Architecture:** The diagram explicitly combines connectionist (neural network) and symbolic (propositional theory, SDD) AI paradigms.
2. **Two-Stage Configuration:** The PSDD circuit is configured in two ways:
* **Structurally:** By the SDD circuit derived from logical theory.
* **Parametrically:** By the neural network's output based on input **S**.
3. **Directional Arrows:** The arrows are crucial. The neural network's output *configures* the PSDD, while the SDD *defines the structure* of the PSDD. This is a clear separation of concerns between structure and parameters.
4. **Visual Grouping:** Dashed boxes (blue for the neural network, green for the PSDD and SDD) are used to group related components, emphasizing the modular design.
### Interpretation
This diagram represents a **neuro-symbolic probabilistic reasoning system**. Its purpose is to merge the strengths of different AI approaches:
* **Neural Network (Subsymbolic):** Excels at learning patterns and mappings from raw data (input **S**). Here, it serves as a flexible, learnable "configurator" that adjusts the probabilistic model's parameters based on the specific input instance.
* **Symbolic Knowledge (Propositional Theory Ļ):** Encodes explicit, human-readable rules, constraints, or domain knowledge. Compiling this into an SDD circuit provides a structured, efficient representation of the logical relationships.
* **Probabilistic Circuit (PSDD):** Acts as the unifying inference engine. Its structure is constrained by logic (via the SDD), ensuring adherence to domain rules. Its parameters are dynamically set by the neural network, allowing it to adapt to data.
**The underlying narrative is one of constrained adaptation:** The system doesn't just learn a black-box model. It learns (via the neural network) to *instantiate* a probabilistic model whose very architecture is guaranteed to respect a set of predefined logical rules (the theory **Ļ**). This could be used for tasks requiring both data-driven prediction and logical consistency, such as diagnostic systems, planning under uncertainty, or explainable AI, where the neural network's role is to interpret the context (**S**) and the symbolic circuit ensures the reasoning follows logical principles.
</details>
Figure 3. The architecture design to calculate the probability of symbolic model $\bm{m}$ given DRL state $s$ .
Unlike a fully supervised setting that expensively requires labeling every propositional variable in $P$ , Eq. (1) only requires labeling whether a given state violates the domain constraint $\phi$ , which is a minimal supervision signal. In practice, the annotation of $y$ can be obtained either (i) by providing labeled data on whether the resulting state $s^{\prime}$ violates the constraint $\phi$ book2006, or (ii) via an automatic constraint-violation detection mechanism autochecking1; autochecking2.
We emphasize that action preconditions $\varphi$ and the domain constraints $\phi$ are two separate elements and treated differently. We first automatically generate training data to learn PSDD parameters by constructing tuples $(s,a,s^{\prime},y)$ as defined in Equation (1). The argument $y$ in tuples $(s,a,sā,y)$ is then used as an indicator for action preconditions. Specifically, we use $y$ to label whether action $a$ is excutable in state $s$ , i.e., if transition $(s,a,s^{\prime})$ is explored by DRL policy in a non-violating states $s$ and $s^{\prime}$ , then $y=1$ , meaning that action a is explorable in $s$ ; otherwise $y=0$ . We thus use $y$ in $(s,a,sā,y)$ as a minimal supervision signal to estimate the probability of the precondition of action $a$ being satisfied in non-violating $s$ during PSDD training.
By continuously rolling out the DRL agent in the environment, we store $(s,a,s^{\prime},y)$ into a buffer $\mathcal{D}$ . After collecting sufficient data, we sample batches from $\mathcal{D}$ and update $g$ via stochastic gradient descent SDG; ADAM. Concretely, the update proceeds as follows. Given samples $(s,a,s^{\prime},y)$ , we first use the current PSDD to estimate the probability that action $a$ is explorable in state $s$ , i.e., the probability that $s$ satisfies the precondition $\varphi$ associated with $a$ in $\mathcal{AP}$ :
$$
\hat{P}(a|s)=\sum_{\bm{m}\models\varphi}Pr(\bm{m}|\bm{\Theta}=g(s),\bm{m}\models\phi) \tag{2}
$$
Note that $\hat{P}(a|s)$ here does not represent a policy as in standard DRL; rather, it denotes the estimated probability that action $a$ is explorable in state $s$ . As shown in Equation (2), this probability is calculated by aggregating the probabilities of all models $\bm{m}$ that satisfy the precondition $\varphi$ . In addition, to evaluate if $\bm{m}\models\varphi$ , we assign truth values to the leaf variables of $\varphi$ ās SDD circuit based on $\bm{m}$ and propagate them bottom-up through the āORā and āANDā gates, where the Boolean value at the root indicates the satisfiability.
Given the probability estimated from Equation (2), we compute the cross-entropy loss CROSSENTR by comparing it with the explorability label $y$ . Specifically, for a single data $(s,a,s^{\prime},y)$ , the loss is:
$$
L_{g}=-[y\cdot log(\hat{P}(a|s))+(1-y)\cdot log(1-\hat{P}(a|s))] \tag{3}
$$
The intuition of this loss is straightforward: at each $s$ it encourages the PSDD to generate higher probability to actions that are explorable (when $y=1$ ), and generate lower probability to those that are not explorable (when $y=0$ ).
## 4. Combining symbolic reasoning with gradient-based DRL
Through the training of the gating function defined in Equation (3), the PSDD in Figure 3 can predict, for a given DRL state, a distribution over the symbolic model $\bm{m}$ for atomic propositions in $\mathcal{P}$ . This distribution then can be used to evaluate the truth values of the preconditions in $\mathcal{AP}$ and to reason about the explorability of actions. However, directly applying symbolic logical formula of preconditions to take actions results in non-differentiable decision-making sg2_1, which prevents gradient flow during policy optimization. This raises a key challenge on integrating symbolic reasoning with gradient-based DRL training in a way that preserves differentiability, i.e., problem (P4) in Section 2.
To address this issue, we employ the PSDD to perform maximum a posteriori (MAP) query PCbooks, obtaining the most likely model $\hat{\bm{m}}$ for the current state. Based on $\hat{\bm{m}}$ and the precondition $\varphi$ of each action $a$ , we re-normalize the action probabilities from a policy network. In this way, the learned symbolic representation from the PSDD can be used to constrain action selection, while the underlying policy network still provides a probability distribution that can be updated through gradient-based optimization.
Concretely, before the DRL agent makes a decision, we first use the PSDD to obtain the most likely model describing the state:
$$
\hat{\bm{m}}=argmax_{\bm{m}}Pr(\bm{m}|\bm{\Theta}=g(s),\bm{m}\models\phi) \tag{4}
$$
Importantly, the argmax operation on the PSDD does not require enumerating all possible $\bm{m}$ . Instead, it can be computed in linear time with respect to the PSDD size by exploiting its structural properties on decomposability and Determinism (see psdd_infer1). This linear-time inference makes PSDDs particularly attractive for DRL, where efficient evaluation of candidate actions are essential anokhinhandling.
After obtaining the symbolic model of the state, we renormalize the probability of each action $a$ according to its precondition $\varphi$ :
$$
\pi^{+}(s,a,\phi)=\frac{\pi(s,a)\cdot C_{\varphi}(\hat{\bm{m}})}{\sum_{a^{\prime}\in\mathcal{A}}\pi(s,a^{\prime})\cdot C_{\varphi^{\prime}}(\hat{\bm{m}})} \tag{5}
$$
where $\pi(s,a)$ denotes the probability of action $a$ at state $s$ predicted by the policy network, $C_{\varphi}(\hat{\bm{m}})$ is the evaluation of the SDD encoding from $\varphi$ under the model $\hat{\bm{m}}$ , and $\varphi^{\prime}$ is the precondition of action $a^{\prime}$ . The input of Equation (5) explicitly includes $\phi$ , as $\phi$ is required for evaluating the model $\hat{\bm{m}}$ in Equation (4). Intuitively, $C_{\varphi}(\hat{\bm{m}})$ acts as a symbolic mask. It equals to 1 if $\hat{\bm{m}}\models\varphi$ (i.e., the precondition is satisfied) and 0 otherwise. As a result, actions whose preconditions are violated are excluded from selection, while the probabilities of the remaining actions are renormalized as a new distribution. It is important to note that during the execution, we use the PSDD (trained by $y$ in Equation (2) and (3) ) to infer the most probable symbolic model of the current state (in Equation (4)), and therefore can formally verify whether each actionās precondition is satisfied with this symbolic model (happened in $C_{\varphi}$ in Equation (5)).
According to prior work, such $0$ - $1$ masking and renormalization still yield a valid policy gradient, thereby preserving the theoretical guarantees of policy optimization actionmasking_vPG. In practice, we optimize the masked policy $\pi^{+}$ using the Proximal Policy Optimization (PPO) objective schulman2017ppo. Concretely, the loss is:
$$
\mathcal{L}_{\text{PPO}}(\pi^{+})=\mathbb{E}_{t}\!\left[\min\!\Big(\mathfrak{r}_{t}(\pi^{+})\,\hat{A}_{t},\text{clip}(\mathfrak{r}_{t}(\pi^{+}),1-\epsilon,1+\epsilon)\,\hat{A}_{t}\Big)\right] \tag{6}
$$
where $\mathfrak{r}_{t}(\pi^{+})$ denotes the probability ratio between the new and old masked policies, ā clip ā is the clip function and $\hat{A}_{t}$ is the advantage estimate schulman2017ppo. In this way, the masked policy can be trained with PPO to effectively exploit symbolic action preconditions, leading to safer and more sample-efficient learning.
## 5. End-to-end training framework
After deriving the gating function loss of PSDD in Equation (3) and the DRL policy loss in Equation (6), we now introduce an end-to-end training framework that combines the two components.
Before presenting the training procedure, we first summarize how the agent makes decisions, as illustrated in Figure 4. At each time step, the state $s$ is first input into the symbolic grounding module, whose internal structure is shown in Figure 3. Within this module, the PSDD produces the most probable symbolic description of the state, i.e., a model $\hat{\bm{m}}$ , according to Equation (4). The agent then leverages the preconditions in $\mathcal{AP}$ (following Equation (5)) to mask the action distribution from policy network, and samples an action from the renormalized distribution to interact with the environment.
<details>
<summary>probSetting.png Details</summary>

### Visual Description
## Diagram: Reinforcement Learning with Symbolic Grounding and Action Preconditions
### Overview
The image displays a technical flowchart or system architecture diagram illustrating a reinforcement learning (RL) or decision-making process that incorporates symbolic reasoning. The diagram shows the flow of information between an environment, a policy, a symbolic grounding module, and a module for action preconditions. The primary language is English.
### Components/Axes
The diagram consists of four main components connected by directed arrows representing data flow. The components are positioned as follows:
* **Top-Left:** A dashed-border rectangle labeled **"Symbolic grounding"**.
* **Top-Right:** A solid-border rectangle labeled **"preconditions of action AP"**.
* **Bottom-Left:** A cloud-shaped element labeled **"Env"** (representing the Environment).
* **Bottom-Right:** A dashed-border rectangle labeled **"policy"**.
The connecting arrows and their labels are:
1. An arrow from **"Env"** to **"Symbolic grounding"** labeled **"s"**.
2. An arrow from **"Symbolic grounding"** to **"preconditions of action AP"** labeled **"Å"** (s-hat).
3. An arrow from **"preconditions of action AP"** to **"policy"** labeled **"mask"**.
4. An arrow from **"Env"** to **"policy"** labeled **"s_t"**.
5. An arrow from **"policy"** to **"Env"** labeled **"a_t"**.
### Detailed Analysis
The diagram defines a closed-loop interaction between an agent (comprising the policy and supporting modules) and its environment.
* **Component Flow & Relationships:**
* The **Environment ("Env")** provides a state observation **"s"** to the **Symbolic grounding** module.
* The **Symbolic grounding** module processes this state and outputs a symbolic or abstracted representation **"Å"** to the **preconditions of action AP** module.
* The **preconditions of action AP** module uses this symbolic information to generate a **"mask"**. This mask is sent to the **policy** and likely serves to filter or constrain the set of available actions based on logical preconditions.
* The **policy** receives two inputs: the direct state **"s_t"** from the environment and the **"mask"** from the preconditions module. It then selects and outputs an action **"a_t"** back to the environment.
* This creates a cycle: Env -> (s) -> Symbolic Grounding -> (Å) -> Preconditions -> (mask) -> Policy -> (a_t) -> Env. Simultaneously, the policy receives a direct state signal (s_t).
* **Visual Semantics:**
* The **dashed borders** around "Symbolic grounding" and "policy" may indicate they are learnable or neural network-based components.
* The **solid border** around "preconditions of action AP" may indicate a more deterministic or rule-based module.
* The **cloud shape** for "Env" is a standard representation for an external, often complex, system.
### Key Observations
1. **Hybrid Architecture:** The system combines a standard RL loop (Env -> s_t -> Policy -> a_t -> Env) with a parallel symbolic reasoning branch (Env -> s -> Symbolic Grounding -> Å -> Preconditions -> mask).
2. **Action Constraint Mechanism:** The "mask" signal is a critical intermediary. It suggests the policy's action selection is not free but is guided or restricted by logically derived preconditions from the symbolic representation of the state.
3. **Dual State Representation:** The environment provides two forms of state information: a potentially raw or high-dimensional state **"s_t"** to the policy, and a state **"s"** (possibly the same or a different view) to the symbolic grounding module.
4. **Symbolic Abstraction:** The use of **"Å"** (s-hat) strongly implies that the "Symbolic grounding" module performs an estimation, abstraction, or conversion of the environmental state into a symbolic form suitable for logical reasoning about action preconditions.
### Interpretation
This diagram illustrates a **neuro-symbolic AI architecture** for decision-making. It addresses a key challenge in pure reinforcement learning: ensuring that an agent's actions are not only reward-driven but also adhere to logical rules or common-sense constraints.
* **What it demonstrates:** The system learns or uses a policy (likely a neural network) to select actions, but this policy is "masked" by a set of preconditions. These preconditions are derived from a symbolic understanding of the world, which is itself grounded in sensory data from the environment. This setup aims to combine the learning flexibility of neural networks with the reliability and interpretability of symbolic logic.
* **How elements relate:** The symbolic branch acts as a **supervisor or constraint generator** for the policy. It translates the continuous, noisy state of the world into discrete symbols and logical rules ("preconditions"), which then define the safe or valid action space for the policy at each step.
* **Notable implications:** This architecture is designed to improve **safety, sample efficiency, and generalization**. By masking invalid actions, the agent avoids catastrophic mistakes and explores more efficiently. The symbolic layer could also allow for injecting human knowledge (as preconditions) into the learning process. The separation between the policy (which might be trained via RL) and the precondition module (which might be programmed or learned differently) is a key design feature for creating more robust and trustworthy AI agents.
</details>
Figure 4. An illustration of the decision process of our agent, where the symbolic grounding module is as in Figure 3 and $\hat{\bm{m}}$ is calculated via the PSDD by Equation (4).
An illustration of the decision process of our agent.
Algorithm 1 Training framework.
1: Compile $\phi$ as SDD to obtain structure of PSDD
2: Initialize gating network $g$ according to the structure of PSDD
3: Initialize policy network $\pi$ , total step $T\leftarrow 0$
4: Initialize a data buffer $\mathcal{D}$ for learning PSDD
5: for $Episode=1\to M$ do
6: Reset $Env$ and get $s$
7: while not terminal do
8: Calculate action distribution before masking $\pi(s,a)$
9: Calculate $\Theta=g(s)$ and assign parameter $\Theta$ to PSDD
10: Calculate $\hat{\bm{m}}$ in Equation (4)
11: Calculate action distribution after masking $\pi^{+}(s,a,\phi)$
12: Sample an action $a$ from $\pi^{+}(s,a,\phi)$
13: Execute $a$ and get $r$ , $s^{\prime}$ from $Env$
14: Obtain the truth-value $y$ according to Equ. (1)
15: Store $(s,a,\varphi)$ into $\mathcal{D}$
16: if terminal then
17: Update policy $\pi^{+}(s,a,\phi)$ using the trajectory of this episode with Equation (6)
18: end if
19: if $(T+1)\;\$ then
20: Sample batches from $\mathcal{D}$
21: Update gating function $g$ with Equation (3)
22: end if
23: $s\leftarrow s^{\prime}$ , $T\leftarrow T+1$
24: end while
25: end for
To achieve end-to-end training, we propose Algorithm 1. The key idea for this training framework is to periodically update the gating function of the PSDD during the agentās interaction with the environment, while simultaneously training the policy network under the guidance of action masks. As the RL agent explores, it continuously generates minimally supervised feedback for the PSDD via $\Gamma_{\phi}$ , thereby improving the quality of the learned action masks. In turn, the improved action masking reduces infeasible actions and guides the agent toward higher rewards and more informative trajectories, which accelerates policy learning.
Concretely, before the start of training, the domain constrain $\phi$ is compiled into an SDD in Line 1, which determines both the structure of the PSDD and the output dimensionality of the gating function. Lines 2 $\sim$ 4 initialize the gating function, the RL policy network, and a replay buffer $\mathcal{D}$ that stores minimally supervised feedback for PSDD training. In Lines 5 $\sim$ 25, the agent interacts with the environment and jointly learns the gating function and policy network. At each step (lines 8 $\sim$ 11), the agent computes the masked action distribution based on the current gating function and policy network, which is crucial to minimizing the selection of infeasible actions during training. At the end of each episode (lines 16 $\sim$ 18), the policy network is updated using the trajectory of this episode. In this process, the gating function is kept frozen. In addition, the gating function is periodically updated (lines 19 $\sim$ 22) with frequency $freq_{g}$ . This periodically update enables the PSDD to provide increasingly accurate action masks in subsequent interactions, which simultaneously improves policy optimization and reduces constraint violations.
## 6. Related work
Symbolic grounding in neuro-symbolic learning. In the literature on neuro-symbolic systems NSsys; NSsys2, symbol grounding refers to learning a mapping from raw data to symbolic representations, which is considered as a key challenge for integrating neural networks with symbolic knowledge neuroRM. Various approaches have been proposed to address this challenge in deep supervised learning. sg1_1; sg1_2; sg1_3 leverage logic abduction or consistency checking to periodically correct the output symbolic representations. To achieve end-to-end differentiable training, the most common methods are embedding symbolic knowledge into neural networks through differentiable relaxations of logic operators, such as fuzzy logic sg2_2 or Logic Tensor Networks (LTNs) sg2_1. These methods approximate logical operators with smooth functions, allowing symbolic supervision to be incorporated into gradient-based optimization sg2_2; sg2_3; sg2_4; neuroRM. More recently, advances in probabilistic circuits psdd_infer1; PCbooks give rise to efficient methods that embed symbolic knowledge via differentiable probabilistic representations, such as PSDD kisa2014probabilistic. In these methods, symbolic knowledge is first compiled into SDD sdd to initialize the structure, after which a neural network is used to learn the parameters for predicting symbolic outputs ahmed2022semantic. This class of approaches has been successfully applied to structured output prediction tasks, including multi-label classification ahmed2022semantic and routing psdd_infer2.
Symbolic grounding is also crucial in DRL. NRM neuroRM learn to capture the symbolic structure of reward functions in non-Markovian reward settings. In contrast, our approach learns symbolic properties of states to constrain actions under Markovian reward settings. KCAC KCAC has extended PSDDs to MDP with combinatorial action spaces, where symbolic knowledge is used to constrain action composition. Our work also uses PSDDs but differs from KCAC. We use preconditions of actions as symbolic knowledge to determine the explorability of each individual action in a standard DRL setting, whereas KCAC incorporates knowledge about valid combinations of actions in a DRL setting with combinatorial action spaces.
Action masking. In DRL, action masking refers to masking out invalid actions during training to sample actions from a valid set actionmasking_vPG. Empirical studies in early real-world applications show that masking invalid actions can significantly improve sample efficiency of DRL actionmasking_app1; actionmasking_app2; actionmasking_app3; actionmasking_app4; actionmasking_app5; actionmasking_app6. Following the systematic discussion of action masking in DRL actionmasking_review, actionmasking_onoffpolicy investigates the impact of action masking on both on-policy and off-policy algorithms. Works such as actionmasking_continous; actionmasking_continous2 extend action masking to continuous action spaces. actionmasking_vPG proves that binary action masking have Valid policy gradients during learning. In contrast to these approaches, our method does not assume that the set of invalid actions is predefined by the environment. Instead, we learn the set of invalid actions in each state for DRL using action precondition knowledge.
Another line of work employs a logical system (e.g., linear temporal logic LTL1985) to restrict the agentās actions shielding1; shielding2; shielding3; PPAM. These approaches require a predefined symbol grounding function to map states into its symbolic representations, whereas our method learn such function (via PSDD) from data. PLPG PLPG learns the probability of applying shielding with action constraints formulated in probabilistic logics. By contrast, our preconditions are hard constraints expressed in propositional logic: if the precondition of an action is evaluated to be false, the action is strictly not explorable.
Cost-based safe reinforcement learning. In addition to action masking, a complementary approach is to jointly optimize rewards and a safety-wise cost function to improve RL safety. In these cost-based settings, a policy is considered safe if its expected cumulative cost remains below a pre-specified threshold safeexp1; PPOlarg. A representative foundation of such cost-based approach is the constrained Markov decision process (CMDP) framework CMDP1994, which aims to maximize expected reward while ensuring costs below a threshold. Subsequent works often adopt Lagrangian relaxation to incorporate constraints into the optimization objective ppo-larg1; ppo-larg2; ppo-larg3; ppo-larg4; PPOlarg. However, these methods often suffer from unsafe behaviors in the early stages of training highvoil1. To address such issues, safe exploration approaches emphasize to control the cost during exploration in unknown environments shielding1; shielding2. Recently, SaGui safeexp1 employed imitation learning and policy distillation to enable agents to acquire safe behaviors from a teacher agent during early training. RC-PPO RCPPO augmented unsafe states to allow the agent to anticipate potential future losses. While constraints can in principle be reformulated as cost functions, our approach does not rely on cost-based optimization. Instead, we directly exploit them to learn masks to avoid the violation of actions constraints.
## 7. Experiment
The experimental design aims to answer the following questions:
Q1: Without predefined symbolic grounding, can NSAM leverage symbolic knowledge to improve the sample efficiency of DRL?
Q2: By jointly learning symbolic grounding and masking strategies, can NSAM significantly reduce constraint violations during exploration, thereby enhancing safety?
Q3: In NSAM, is symbolic grounding with PSDDs more effective than replacing it with a module based on standard neural network?
Q4: In what ways does symbolic knowledge contribute to the learning process of NSAM?
### 7.1. Environments
We evaluate ASG on four highly challenging reinforcement learning domains with logical constraints as shown in Figure 5. Across all these environments, agents receive inputs in the form of unknown representation such as vectors or images.
<details>
<summary>Sudoku.png Details</summary>

### Visual Description
## Diagram: 5x5 Grid with Numerical Entries
### Overview
The image displays a 5x5 grid of light blue squares with rounded corners and black outlines. Some of the squares contain single-digit numbers, while others are empty. The grid is presented on a plain white background.
### Components/Axes
* **Structure:** A 5x5 matrix of squares.
* **Visual Style:** Each square is filled with a uniform light blue color and has a distinct black border.
* **Content:** Numerical digits are centered within specific squares. The numbers are in a standard, black, sans-serif font.
* **Legend/Axes:** Not applicable. This is a diagram, not a chart with axes or a legend.
### Detailed Analysis
The grid can be described by its rows (1-5, top to bottom) and columns (1-5, left to right). The following table lists the content of each cell:
| Row | Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
| :-- | :------- | :------- | :------- | :------- | :------- |
| **1** | (empty) | (empty) | **5** | (empty) | **1** |
| **2** | (empty) | (empty) | (empty) | (empty) | **3** |
| **3** | (empty) | (empty) | **1** | (empty) | **2** |
| **4** | (empty) | (empty) | (empty) | (empty) | **4** |
| **5** | (empty) | (empty) | **3** | (empty) | (empty) |
**Spatial Grounding:**
* The number **5** is located in the top row, center column (Row 1, Column 3).
* The number **1** appears twice: in the top-right corner (Row 1, Column 5) and in the center of the grid (Row 3, Column 3).
* The number **3** appears twice: in the second row, rightmost column (Row 2, Column 5) and in the bottom row, center column (Row 5, Column 3).
* The number **2** is in the third row, rightmost column (Row 3, Column 5).
* The number **4** is in the fourth row, rightmost column (Row 4, Column 5).
* All other squares (17 out of 25) are empty.
### Key Observations
1. **Distribution:** The numbers are not randomly scattered. They are concentrated in the **center column (Column 3)** and the **rightmost column (Column 5)**.
2. **Column 3 Pattern:** Contains the numbers 5, 1, and 3 in rows 1, 3, and 5 respectively, creating a vertical sequence with gaps.
3. **Column 5 Pattern:** Contains the numbers 1, 3, 2, and 4 in rows 1, 2, 3, and 4 respectively, forming a nearly complete vertical sequence from top to bottom.
4. **Empty Columns:** Columns 1, 2, and 4 are entirely empty.
5. **Repetition:** The numbers 1 and 3 each appear twice within the grid.
### Interpretation
This diagram presents a sparse data matrix. The arrangement suggests it could be:
* A **partial representation of a larger dataset**, where only specific cells are populated.
* A **visual puzzle or game state** (e.g., a logic puzzle, a Sudoku-like grid in progress, or a memory game).
* A **schematic for a physical layout**, where the numbers label specific positions or items in a 5x5 array.
The clustering of data in Columns 3 and 5 is the most significant pattern. It implies that whatever system or data this grid represents has activity or defined values primarily along these two vertical axes. The empty columns (1, 2, 4) represent inactive or undefined zones. The repetition of numbers (1 and 3) indicates these values have multiple instances or occurrences within the defined structure. Without additional context, the specific meaning of the numbers (e.g., counts, identifiers, priorities) cannot be determined, but their spatial relationship is clearly defined.
</details>
(a) Sudoku
<details>
<summary>nqueens.png Details</summary>

### Visual Description
## Chessboard Diagram with Custom Crown Pieces
### Overview
The image shows an 8x8 standard chessboard (alternating light beige and dark brown wood-textured squares) with 8 identical custom crown pieces placed on distinct squares. No text, labels, legends, or numerical data are present.
### Components
- **Chessboard**: 8x8 grid with alternating light (beige) and dark (brown) squares, following standard chessboard layout (a1 = bottom-left, h8 = top-right).
- **Pieces**: 8 identical silver crowns adorned with blue gemstones (likely sapphires) and white decorative accents. Each crown is positioned on a unique square.
### Detailed Analysis
Using standard chess notation (columns: aāh left to right; rows: 1ā8 bottom to top), the crown positions are:
1. a8 (top-left corner, dark square)
2. g7 (row 7, column g, light square)
3. e6 (row 6, column e, dark square)
4. h5 (row 5, column h, light square)
5. b4 (row 4, column b, dark square)
6. d3 (row 3, column d, light square)
7. f2 (row 2, column f, dark square)
8. c1 (row 1, column c, light square)
### Key Observations
- All crowns are identical in design (silver with blue gemstones).
- Each row (1ā8) and column (aāh) contains exactly one crown (no two crowns share a row or column).
- Crowns alternate between light and dark squares (each crown occupies a square of the opposite color of the previous in sequence).
### Interpretation
This arrangement mirrors the classic "8 queens problem" (placing 8 queens on a chessboard with no two attacking each other), but with custom crown pieces instead of queens. The row/column uniqueness ensures no two crowns attack each other along rows or columns (diagonal attacks are not explicitly ruled out but are not visually apparent). The custom crown design suggests a thematic or decorative variation of the chess puzzle, emphasizing uniqueness and non-overlapping placement.
</details>
(b) N-queens
<details>
<summary>coloringG.png Details</summary>

### Visual Description
## Network Diagram: Numbered Node Graph
### Overview
The image displays an undirected graph (network diagram) consisting of eight circular nodes, each labeled with a single digit (0 through 7). The nodes are interconnected by straight black lines (edges). The nodes are color-coded, and the graph is laid out in a roughly hierarchical or layered structure on a plain white background.
### Components/Axes
* **Nodes:** 8 circular nodes, each containing a white numeral.
* **Edges:** Straight black lines connecting pairs of nodes.
* **Node Labels & Colors:**
* Node `0`: Yellow
* Node `1`: Dark Purple
* Node `2`: Blue
* Node `3`: Yellow
* Node `4`: Dark Purple
* Node `5`: Yellow
* Node `6`: Blue
* Node `7`: Green
* **Spatial Layout:** The graph is arranged with a general top-to-bottom flow.
* **Top Region:** Node `0` (top-left), Node `1` (top-center).
* **Middle Region:** Node `2` (center-left), Node `3` (center), Node `7` (center-right).
* **Lower-Middle Region:** Node `4` (center, below Node `3`), Node `6` (right, below Node `7`).
* **Bottom Region:** Node `5` (bottom-center).
### Detailed Analysis: Connectivity (Edges)
The connections between nodes are as follows. Each connection is bidirectional (undirected).
1. **Node 0 (Yellow, Top-Left):** Connected to Node `1` and Node `2`.
2. **Node 1 (Dark Purple, Top-Center):** Connected to Node `0`, Node `2`, Node `3`, and Node `7`.
3. **Node 2 (Blue, Center-Left):** Connected to Node `0`, Node `1`, Node `3`, Node `4`, and Node `5`.
4. **Node 3 (Yellow, Center):** Connected to Node `1`, Node `2`, Node `4`, Node `6`, and Node `7`.
5. **Node 4 (Dark Purple, Center):** Connected to Node `2`, Node `3`, Node `5`, Node `6`, and Node `7`.
6. **Node 5 (Yellow, Bottom-Center):** Connected to Node `2`, Node `4`, Node `6`, and Node `7`.
7. **Node 6 (Blue, Right):** Connected to Node `3`, Node `4`, Node `5`, and Node `7`.
8. **Node 7 (Green, Center-Right):** Connected to Node `1`, Node `3`, Node `4`, Node `5`, and Node `6`.
**Node Degree (Number of Connections):**
* Degree 4: Node `0`, Node `1`.
* Degree 5: Node `2`, Node `3`, Node `4`, Node `5`, Node `6`, Node `7`.
### Key Observations
1. **High Connectivity:** The graph is densely connected. Six out of eight nodes have a degree of 5, indicating a highly interconnected network.
2. **Central Nodes:** Nodes `2`, `3`, `4`, and `7` appear most central based on their position and high number of connections to other key nodes.
3. **Color Grouping:** There are three color groups: Yellow (Nodes `0`, `3`, `5`), Dark Purple (Nodes `1`, `4`), and Blue (Nodes `2`, `6`). Node `7` (Green) is unique.
4. **Structural Pattern:** The graph exhibits a structure where the top node (`1`) and a central cluster (`2`, `3`, `4`, `7`) are heavily interconnected, with Node `5` acting as a bottom hub connecting to this central cluster and Node `6`.
### Interpretation
This diagram represents a complex, non-hierarchical network where information or relationships flow freely between most components. The high average degree suggests a robust system with multiple redundant pathways; the failure of any single node would not disconnect the graph.
The color coding may represent categories, types, or states of the nodes. For instance:
* The three **Yellow** nodes (`0`, `3`, `5`) are positioned at the top-left, center, and bottom, potentially representing different layers or stages that are all of the same type.
* The two **Dark Purple** nodes (`1`, `4`) are at the top and center, possibly indicating primary or control nodes.
* The two **Blue** nodes (`2`, `6`) are on the left and right flanks, perhaps representing secondary or interface nodes.
* The single **Green** node (`7`) is highly connected within the central cluster, which could signify a unique, critical, or monitoring component within the core of the network.
Without a specific legend for the colors, their exact meaning is speculative, but their distribution is non-random and likely intentional to convey categorical information about the nodes' roles or attributes within the network structure.
</details>
(c) Graph coloring
<details>
<summary>sudokuV.png Details</summary>

### Visual Description
## Handwritten Digit Grid: 5x5 Matrix
### Overview
The image displays a 5x5 grid of cells, each containing a single handwritten digit in black ink on a white background. The grid is formed by thin, dark lines creating 25 distinct rectangular cells. The digits appear to be samples from a dataset similar to MNIST, showcasing variations in handwriting style.
### Components/Axes
- **Structure**: A 5-row by 5-column grid.
- **Cell Content**: Each cell contains one handwritten digit from 0-9 (though only digits 1-5 are present in this specific grid).
- **Grid Lines**: Thin, dark lines separate each cell both horizontally and vertically.
- **No Axes or Legends**: This is not a chart or graph; it is a matrix of individual character samples.
### Detailed Analysis
The grid contains the following digits, listed by row (top to bottom) and column (left to right):
**Row 1 (Top):**
- Column 1: `4`
- Column 2: `3`
- Column 3: `5`
- Column 4: `2`
- Column 5: `1`
**Row 2:**
- Column 1: `2`
- Column 2: `5`
- Column 3: `4`
- Column 4: `1`
- Column 5: `3`
**Row 3:**
- Column 1: `3`
- Column 2: `1`
- Column 3: `2`
- Column 4: `4`
- Column 5: `5`
**Row 4:**
- Column 1: `1`
- Column 2: `2`
- Column 3: `3`
- Column 4: `5`
- Column 5: `4`
**Row 5 (Bottom):**
- Column 1: `5`
- Column 2: `4`
- Column 3: `1`
- Column 4: `3`
- Column 5: `2`
**Digit Frequency Count:**
- `1`: 5 occurrences
- `2`: 5 occurrences
- `3`: 5 occurrences
- `4`: 5 occurrences
- `5`: 5 occurrences
### Key Observations
1. **Balanced Distribution**: Each digit from 1 to 5 appears exactly five times in the grid, creating a perfectly balanced frequency distribution.
2. **Handwriting Variation**: The same digit (e.g., `4`) shows stylistic variation across different cells, indicating different writing samples.
3. **No Zeroes**: The digit `0` is not present in this grid.
4. **Spatial Arrangement**: There is no immediately obvious sequential or mathematical pattern in the arrangement of digits (e.g., rows do not sum to a constant, columns are not sorted).
### Interpretation
This image is most likely a **visualization of a small, balanced subset of a handwritten digit dataset**, such as those used for training or testing machine learning models in optical character recognition (OCR).
- **Purpose**: It demonstrates the intra-class variation (how the same digit can look different) and inter-class distinction (how different digits are written) that a recognition system must learn.
- **Data Structure**: The grid format is a common way to display multiple samples from a dataset compactly. The perfect balance (5 samples per digit) suggests this may be a curated or synthetic example for demonstration.
- **Underlying Information**: The primary information is the **visual form of the digits themselves**. The value is in the pixel data and shapes, not in the numerical sequence. A machine learning model would process each cell as an independent image, with the grid serving only as an organizational layout for human viewing.
- **Notable Anomaly**: The complete absence of the digit `0` and the exclusive use of digits 1-5 is a notable constraint, indicating this is not a random sample from a full 0-9 dataset but a specific, limited selection.
</details>
(d) Visual Sudoku
Figure 5. Four tasks with logical constraints
<details>
<summary>S33.png Details</summary>

### Visual Description
## [Line Chart]: Reward vs Steps (Mean Min/Max)
### Overview
The image is a line chart titled *āReward vs Steps (Mean Min/Max)ā* that plots **āEvaluate Rewardā** (y-axis) against **āEpisodeā** (x-axis) for multiple data series (colored lines) with shaded regions (likely representing min/max or confidence intervals) for each series.
### Components/Axes
- **Title**: *āReward vs Steps (Mean Min/Max)ā* (top-center).
- **X-axis**: Labeled *āEpisodeā* with tick marks at 0, 250, 500, 750, 1000, 1250, 1500, 1750, 2000.
- **Y-axis**: Labeled *āEvaluate Rewardā* with tick marks at -3, -2, -1, 0, 1.
- **Lines & Shaded Regions**: Seven colored lines (red, yellow, teal, orange, magenta, green, cyan) with corresponding light-colored shaded areas (light red, light yellow, light teal, light orange, light magenta, light green, light cyan) to indicate variance/min-max.
### Detailed Analysis (Line-by-Line)
Each lineās trend and shaded region (variance) are analyzed below:
1. **Red Line**
- **Trend**: Sharp increase from (0, ~-2.8) to (250, ~1.2), then *horizontal (flat)* from 250 to 2000.
- **Shaded Region**: Wide initially (high variance), narrows as the line stabilizes.
2. **Yellow Line**
- **Trend**: Fluctuating upward (peaks/troughs) from (0, ~-1) to (2000, ~1.5).
- **Shaded Region**: Wide (high variance) throughout.
3. **Teal Line**
- **Trend**: Gradual increase with fluctuations, reaching ~0.8 by 2000. Notable dip at ~1750 (to ~-0.5) then recovery.
- **Shaded Region**: Moderate variance.
4. **Orange Line**
- **Trend**: Gradual increase from (0, ~-2.7) to ~-0.5 by 1250, then *flat*.
- **Shaded Region**: Moderate variance.
5. **Magenta Line**
- **Trend**: Increase to ~-1.8 by 500, then *flat* at ~-1 from 1250 to 2000.
- **Shaded Region**: Moderate variance.
6. **Green Line**
- **Trend**: Increase to ~-1.5 by 500, then *flat* at ~-1 from 1250 to 2000.
- **Shaded Region**: Moderate variance.
7. **Cyan Line**
- **Trend**: Fluctuates around -2.5 to -2.2, with a slight upward trend but mostly *flat*.
- **Shaded Region**: Wide (high variance) throughout.
### Key Observations
- **Red Line**: Stabilizes early (after ~250 episodes) at a high reward (~1.2) with low variance post-stabilization.
- **Yellow Line**: Highest final reward (~1.5) but with high variance (wide shaded region).
- **Teal Line**: Moderate final reward (~0.8) with a notable dip at ~1750.
- **Orange, Magenta, Green Lines**: Stabilize at lower rewards (~-0.5 to -1) with moderate variance.
- **Cyan Line**: Lowest reward (~-2.5) with high variance, minimal improvement.
### Interpretation
The chart compares the performance (evaluate reward) of different agents/algorithms over episodes:
- The **red line**ās early stabilization at a high reward suggests a highly efficient or well-tuned algorithm (low exploration, high exploitation).
- The **yellow line**ās upward trend with high variance implies a learning algorithm with high exploration (hence variance) but strong long-term performance.
- The **teal line**ās dip at 1750 could reflect a temporary setback (e.g., exploration phase, environmental change, or algorithmic adjustment).
- The **orange, magenta, and green lines** stabilize at lower rewards, indicating less effective algorithms (e.g., poor convergence or suboptimal exploration).
- The **cyan line**ās low reward and high variance suggest poor performance (e.g., ineffective learning or excessive exploration without convergence).
Shaded regions (min/max) highlight variability: wider regions (e.g., yellow, cyan) indicate more inconsistent results, while narrower regions (e.g., red post-stabilization) indicate stable performance.
This description captures all textual elements, trends, and relationships, enabling reconstruction of the chartās content without visual reference.
</details>
(a) Sudoku 3 $\times$ 3
<details>
<summary>S44.png Details</summary>

### Visual Description
## Line Chart with Range Bands: Reward vs Steps (Mean Min/Max)
### Overview
This is a line chart with shaded range bands (representing mean minimum and maximum values) tracking "Evaluate Reward" across training episodes (steps) for 7 distinct data series. The chart visualizes how reward performance evolves over 3000 episodes, with each series showing unique trends in reward value and variability.
### Components/Axes
- **Title**: *Reward vs Steps (Mean Min/Max)*, positioned at the top-center of the chart.
- **X-Axis**: Labeled *Episode*, positioned at the bottom-center. Scale ranges from 0 to 3000, with major tick marks at 0, 500, 1000, 1500, 2000, 2500, 3000.
- **Y-Axis**: Labeled *Evaluate Reward*, positioned at the left-center. Scale ranges from -4 to 2, with major tick marks at -4, -3, -2, -1, 0, 1, 2.
- **Data Series (Lines + Shaded Range Bands)**: 7 distinct colored lines, each paired with a matching shaded band (representing min/max reward range for that series):
1. Red line + light red shaded band
2. Yellow line + light yellow shaded band
3. Magenta (pink) line + light pink shaded band
4. Dark green line + light green shaded band
5. Orange line + light orange shaded band
6. Teal (dark cyan) line + light teal shaded band
7. Cyan (light blue) line + light cyan shaded band
### Detailed Analysis
1. **Red Series**:
- Trend: Starts at ~-4 (episode 0), rises sharply to ~-1 at episode 500, then jumps to a flat value of 2 at ~episode 700, remaining at 2 through episode 3000.
- Range Band: Wide band (spanning ~-4 to 2) from episode 0 to 700; after episode 700, the band collapses to a tight range around 2 (no variability).
2. **Yellow Series**:
- Trend: Starts at ~-2 (episode 0), fluctuates upward steadily, crossing 0 at ~episode 1500, 1 at ~episode 2000, and ending at ~1.5 at episode 3000.
- Range Band: Wide band (spanning ~-3 to 2) across all episodes, narrowing slightly at higher episode counts.
3. **Magenta Series**:
- Trend: Starts at ~-4 (episode 0), remains low until ~episode 1700, then rises sharply to ~-1 at episode 2000, jumps to ~-0.5 at episode 2500, and stays flat there through episode 3000.
- Range Band: Wide band (spanning ~-4 to 0) that shifts upward in parallel with the line's rise.
4. **Dark Green Series**:
- Trend: Starts at ~-4 (episode 0), fluctuates upward slowly, crossing -3 at ~episode 1000, -2 at ~episode 2000, and ending at ~-1 at episode 3000.
- Range Band: Band spanning ~-4 to -1, narrowing slightly at higher episodes.
5. **Orange Series**:
- Trend: Starts at ~-4 (episode 0), fluctuates upward with sharp spikes/dips, crossing -3 at ~episode 1500, -2 at ~episode 2500, and ending at ~-1 at episode 3000.
- Range Band: Band spanning ~-4 to -1, with wider fluctuations in the range compared to the dark green series.
6. **Teal Series**:
- Trend: Starts at ~-4 (episode 0), rises to ~-3 at episode 500, then fluctuates between -3 and -2.5 through episode 3000, ending at ~-2.5.
- Range Band: Band spanning ~-4 to -2, with the line staying centered in the band.
7. **Cyan Series**:
- Trend: Starts at ~-4 (episode 0), remains nearly flat between -4 and -3.8 across all 3000 episodes, with minimal fluctuations.
- Range Band: Tight band spanning ~-4 to -3.8, with almost no variability.
### Key Observations
- The red series is an outlier: it achieves the maximum reward (2) quickly and maintains it with no variability, while all other series have lower final rewards and ongoing variability.
- The yellow series is the only one to reach positive reward values (crossing 0 and 1) and shows the strongest upward trend among non-outlier series.
- The cyan series shows no meaningful improvement over 3000 episodes, remaining at the lowest reward range.
- All series except red and cyan show a general upward trend in reward over episodes, with varying rates of improvement and variability.
### Interpretation
This chart likely tracks reinforcement learning agent performance, where "Evaluate Reward" is a metric of task success, and "Episode" is training iterations.
- The red series represents an agent that quickly learned an optimal policy (max reward, no variability), possibly a baseline or a highly optimized agent.
- The yellow series represents an agent that steadily improved to achieve positive reward, indicating successful learning over time.
- The cyan series represents an agent that failed to learn, remaining at the lowest reward level with no improvement.
- The shaded bands show that most agents have significant variability in reward performance (wide bands), while the red agent's performance becomes consistent once it reaches maximum reward.
- The chart demonstrates that different agents (or training configurations) have drastically different learning trajectories, with some achieving optimal performance quickly, some improving steadily, and some failing to learn entirely.
</details>
(b) Sudoku 4 $\times$ 4
<details>
<summary>S55.png Details</summary>

### Visual Description
## [Line Chart with Shaded Regions]: Reward vs Steps (Mean Min/Max)
### Overview
The image is a line chart titled *āReward vs Steps (Mean Min/Max)ā* that plots **Evaluate Reward** (y-axis) against **Episode** (x-axis) for multiple data series. Each line represents a distinct series (e.g., algorithm or agent), with shaded regions (likely min/max ranges or confidence intervals) surrounding the lines to show variability.
### Components/Axes
- **Title**: *āReward vs Steps (Mean Min/Max)ā* (top-center).
- **X-axis**: Label *āEpisodeā* (bottom-center), with ticks at 0, 500, 1000, 1500, 2000, 2500, 3000.
- **Y-axis**: Label *āEvaluate Rewardā* (left-center), with ticks at -6, -4, -2, 0, 2.
- **Legend**: Implicit (no explicit labels), but multiple colored lines (red, yellow, magenta, green, orange, cyan, teal) with corresponding shaded regions (pink, yellow, green, etc.) behind them.
### Detailed Analysis (Line-by-Line Trends & Values)
We analyze each line by color, trend, and approximate values (with uncertainty):
1. **Red Line**
- **Trend**: Starts at ~-6 (episode 0), rises sharply between episodes 500ā1000, reaches ~2 by episode 1000, then *plateaus* (flat line) from 1000ā3000.
- **Shaded Region**: Pinkish, wide during the rise (high variability) and narrow during the plateau (low variability).
2. **Yellow Line**
- **Trend**: Starts at ~-3 (episode 0), fluctuates slightly, then *gradually rises* from ~1500ā3000, reaching ~1 by episode 3000.
- **Shaded Region**: Yellowish, follows the lineās trend (wider during growth).
3. **Magenta Line**
- **Trend**: Starts at ~-6 (episode 0), rises to ~-2 by episode 1000, fluctuates, drops to ~-4 around episode 2000, then *rises again* to ~-1 by episode 3000.
- **Shaded Region**: Pinkish, follows the lineās fluctuations (wider during drops/rises).
4. **Green Line**
- **Trend**: Starts at ~-6 (episode 0), fluctuates, then *gradually rises* from ~2000ā3000, reaching ~-4 by episode 3000.
- **Shaded Region**: Greenish, follows the line (wider during growth).
5. **Orange Line**
- **Trend**: Starts at ~-6 (episode 0), fluctuates, then *gradually rises* from ~2000ā3000, reaching ~-4 by episode 3000.
- **Shaded Region**: Orangeish, follows the line (wider during growth).
6. **Cyan Line**
- **Trend**: Starts at ~-6 (episode 0), fluctuates slightly, remains *relatively flat* (around -6) throughout.
- **Shaded Region**: Cyanish, narrow (low variability).
7. **Teal Line**
- **Trend**: Starts at ~-6 (episode 0), fluctuates, then *rises slightly* from ~2000ā3000, reaching ~-5 by episode 3000.
- **Shaded Region**: Tealish, follows the line (wider during growth).
### Key Observations
- The **red line** has the most dramatic improvement: it rapidly reaches the highest reward (2) and stabilizes.
- The **yellow line** shows steady, gradual growth (especially after episode 1500).
- The **magenta line** exhibits significant fluctuations (e.g., a drop around episode 2000) before recovering.
- Most lines (cyan, teal, green, orange) start at low rewards (-6) and show minimal improvement, with cyan remaining nearly flat.
- Shaded regions (min/max) are wider during periods of change (e.g., red lineās rise) and narrower during plateaus (e.g., red line after 1000).
### Interpretation
This chart likely compares the performance of **reinforcement learning agents/algorithms** over training episodes, where *āEvaluate Rewardā* measures task success.
- The red lineās rapid rise and plateau suggest a highly effective algorithm that quickly learns and stabilizes.
- The yellow lineās steady increase indicates a slower but consistent learning process.
- The magenta lineās fluctuations may reflect instability (e.g., exploration-exploitation trade-offs) or sensitivity to training dynamics.
- The flat cyan line and slow-growing green/orange/teal lines suggest less effective algorithms or slower learning.
Shaded regions (min/max) highlight variability: wider regions during learning phases (more uncertainty) and narrower regions during stable phases (less uncertainty). This data helps identify top-performing algorithms, assess learning curves, and evaluate stabilityācritical for optimizing reinforcement learning systems.
(Note: No non-English text is present; all labels are in English.)
</details>
(c) Sudoku 5 $\times$ 5
<details>
<summary>4queens.png Details</summary>

### Visual Description
## [Line Chart]: Reward vs Steps (Mean Min/Max)
### Overview
The image is a line chart titled *āReward vs Steps (Mean Min/Max)ā* (note: the x-axis is labeled āEpisode,ā suggesting āStepsā may refer to episodes). It plots **āEvaluate Rewardā** (y-axis) against **āEpisodeā** (x-axis) for multiple data series, with shaded regions representing the *minimum and maximum values* (mean min/max) for each series.
### Components/Axes
- **Title**: *āReward vs Steps (Mean Min/Max)ā*
- **X-axis**: Label = *āEpisodeā*; Ticks at 0, 500, 1000, 1500, 2000, 2500, 3000.
- **Y-axis**: Label = *āEvaluate Rewardā*; Ticks at -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00.
- **Data Series (Lines + Shaded Regions)**: Multiple colored lines (red, yellow, magenta, teal, green, orange, dark teal) with corresponding shaded regions (light red, light yellow, light pink, light cyan, light green, light orange, light blue) indicating min/max ranges.
### Detailed Analysis
We analyze each series (color, trend, key points):
1. **Red Line (Light Red Shaded Region)**
- **Trend**: Sharp increase from ~-0.5 (episode 0) to 1.0 (episode ~1000), then flat at 1.0.
- **Key Points**: Reaches the *maximum reward (1.0)* by episode 1000 and maintains it.
2. **Yellow Line (Light Yellow Shaded Region)**
- **Trend**: Fluctuating upward trend, starting at ~0 (episode 0), peaking around 0.8ā0.9 by episode 3000.
- **Key Points**: Consistent growth with variability (shaded region shows min/max fluctuations).
3. **Magenta (Pink) Line (Light Pink Shaded Region)**
- **Trend**: Rises from ~-0.5 (episode 0) to ~0.4ā0.5 (episode ~1500), then fluctuates (dip around episode 1750) but stabilizes.
- **Key Points**: Moderate growth, with a temporary drop in reward.
4. **Teal (Cyan) Line (Light Cyan Shaded Region)**
- **Trend**: Fluctuates around -0.5 to -0.25, with a slight upward trend toward episode 3000.
- **Key Points**: Low reward with high variability (shaded region is wide).
5. **Green Line (Light Green Shaded Region)**
- **Trend**: Rises from ~-0.75 (episode 0) to 0 (episode ~500), then flat at 0.
- **Key Points**: Reaches *neutral reward (0)* early and maintains it.
6. **Orange Line (Light Orange Shaded Region)**
- **Trend**: Rises from ~-0.5 (episode 0) to 0 (episode ~500), then flat at 0.
- **Key Points**: Similar to the green line, reaches neutral reward early.
7. **Dark Teal Line (Light Blue Shaded Region)**
- **Trend**: Rises from ~-0.75 (episode 0) to ~0.6 (episode ~1000), then drops to ~-0.25 (episode 3000).
- **Key Points**: Initial growth followed by a decline, with a wide shaded region (high variability).
### Key Observations
- **Red Line**: Outperforms all others, reaching and maintaining the *maximum reward (1.0)* by episode 1000.
- **Yellow Line**: Shows consistent growth with variability, approaching high reward (~0.8ā0.9) by episode 3000.
- **Green/Orange Lines**: Stabilize at *neutral reward (0)* early, with minimal variability.
- **Teal Line**: Remains in the low reward range with high variability.
- **Dark Teal Line**: Initial success followed by decline, indicating potential instability.
- **Shaded Regions**: Wide for teal and dark teal (high variability), narrow for red (low variability after episode 1000).
### Interpretation
This chart likely represents the performance of different reinforcement learning agents (or algorithms) over episodes, where *āEvaluate Rewardā* measures their success.
- The **red line**ās rapid rise to maximum reward suggests a *highly effective agent* (e.g., a well-tuned algorithm).
- The **yellow line**ās steady growth indicates a *reliable, if slower, agent* (consistent improvement over time).
- **Green/orange lines** stabilize at neutral reward, possibly indicating agents that learn to avoid negative rewards but do not excel.
- The **teal line**ās low, variable reward suggests a *struggling agent* (poor performance with high unpredictability).
- The **dark teal line**ās decline hints at *overfitting or instability* (initial success followed by failure).
The shaded regions (min/max) show the range of performance: red has the narrowest range (consistent), while teal/dark teal have the widest (unpredictable). This data helps identify which agents are most effective, stable, or prone to failure over timeācritical for optimizing reinforcement learning systems.
(Note: No non-English text is present in the image.)
</details>
(d) 4 Queens
<details>
<summary>6queens.png Details</summary>

### Visual Description
## Line Chart: Reward vs Steps (Mean Min/Max)
### Overview
The image is a line chart titled *āReward vs Steps (Mean Min/Max)ā* that plots **āEvaluate Rewardā** (y-axis) against **āEpisodeā** (x-axis) for multiple data series (colored lines) with shaded regions (likely representing min/max ranges or confidence intervals). The x-axis spans 0ā3000 episodes, and the y-axis spans -1.0 to 1.0 in evaluate reward.
### Components/Axes
- **Title**: *āReward vs Steps (Mean Min/Max)ā* (top-center).
- **X-axis**: Labeled *āEpisodeā*, with major ticks at 0, 500, 1000, 1500, 2000, 2500, 3000.
- **Y-axis**: Labeled *āEvaluate Rewardā*, with major ticks at -1.0, -0.5, 0.0, 0.5, 1.0.
- **Data Series (Lines)**: Six distinct colored lines (red, magenta, green, yellow, orange, cyan) with corresponding shaded regions (e.g., red shaded area, magenta shaded area). The legend is not explicitly visible, but line colors and their shaded regions are distinguishable.
### Detailed Analysis
#### 1. Red Line
- **Trend**: Starts near -1.0 at episode 0, rises sharply around episodes 500ā1000, reaches 1.0 by ~1000 episodes, and stabilizes at 1.0 for subsequent episodes.
- **Shaded Region**: Wide (spanning ~-1.0 to 1.0 initially), narrowing as the line stabilizes at 1.0.
#### 2. Magenta Line
- **Trend**: Starts near -1.0, rises gradually with fluctuations (e.g., dips around 2000ā2500 episodes), reaches 1.0 by ~2500 episodes, and stabilizes.
- **Shaded Region**: Wide (similar to red) but with more fluctuations in the shaded area.
#### 3. Green Line
- **Trend**: Relatively stable, fluctuating around 0.0 (range: ~-0.5 to 0.5) across all episodes.
- **Shaded Region**: Narrow, centered around 0.0.
#### 4. Yellow Line
- **Trend**: Fluctuates around -0.5 to 0.0, with minor variations.
- **Shaded Region**: Narrow, centered around -0.5 to 0.0.
#### 5. Orange Line
- **Trend**: Similar to yellow but slightly lower, fluctuating around -0.5 to 0.0 (more negative than yellow).
- **Shaded Region**: Narrow, overlapping with yellowās region.
#### 6. Cyan Line
- **Trend**: Lowest among all, fluctuating around -1.0 to -0.5, with minor dips/rises but remaining the most negative.
- **Shaded Region**: Narrow, centered around -1.0 to -0.5.
### Key Observations
- **Red/Magenta Lines**: Both reach the maximum reward (1.0) but at different episodes (red earlier, magenta later). Their wide shaded regions indicate higher variability in rewards.
- **Green/Yellow/Orange Lines**: Cluster around 0.0 to -0.5, with green being the most stable near 0.0.
- **Cyan Line**: Consistently the lowest, with the least improvement over episodes.
- **Shaded Regions**: Width correlates with variabilityāwider for red/magenta (more variable) and narrower for green/yellow/orange/cyan (less variable).
### Interpretation
This chart likely compares the performance of different reinforcement learning agents (or algorithms) over episodes, where *āEvaluate Rewardā* measures success.
- **High-Performing Agents (Red/Magenta)**: Achieve the highest reward (1.0) but with more variability (wider shaded regions), suggesting they may be more exploratory or have higher policy variance.
- **Stable Agent (Green)**: Maintains consistent performance around 0.0, indicating moderate but reliable success.
- **Low-Performing Agents (Yellow/Orange/Cyan)**: Have lower rewards, with cyan being the least successful. Their narrow shaded regions suggest more consistent (but less successful) behavior.
The trade-off between performance (reward) and stability (shaded region width) implies that higher-performing agents may sacrifice consistency for exploration, while lower-performing agents prioritize stability over success. This could inform decisions about agent design (e.g., balancing exploration/exploitation in reinforcement learning).
(Note: No non-English text is present in the image.)
</details>
(e) 6 Queens
<details>
<summary>8queens.png Details</summary>

### Visual Description
## [Line Graph]: Reward vs Steps (Mean Min/Max)
### Overview
This is a line graph titled *"Reward vs Steps (Mean Min/Max)"*, plotting **Evaluate Reward** (y-axis) against **Episode** (x-axis). Multiple colored lines (with translucent shaded regions) represent different data series, likely showing mean reward over episodes with shaded areas indicating min/max (or confidence intervals) around the mean.
### Components/Axes
- **Title**: *"Reward vs Steps (Mean Min/Max)"* (indicates the plot compares reward to training episodes, with mean, min, and max values).
- **X-axis**: Labeled *"Episode"*, ranging from 0 to 3000 (ticks at 0, 500, 1000, 1500, 2000, 2500, 3000).
- **Y-axis**: Labeled *"Evaluate Reward"*, ranging from -1.5 to 1.0 (ticks at -1.5, -1.0, -0.5, 0.0, 0.5, 1.0).
- **Lines/Shaded Regions**: Multiple colored lines (red, green, yellow, orange, cyan, magenta, dark blue) with corresponding translucent shaded areas (e.g., light red, light green) representing min/max ranges around each lineās mean.
### Detailed Analysis (Line-by-Line Trends)
We analyze each line by color (approximate, as no explicit legend is visible, but trends are clear):
1. **Red Line**
- **Trend**: Starts at ~-1.5 (Episode 0), rises sharply around Episode 1000, reaches ~1.0 by Episode 1500, and remains flat at 1.0 until Episode 3000.
- **Shaded Region**: Light red, covering a wide range early (Episode 0ā1000) and narrowing as the line stabilizes.
2. **Magenta Line**
- **Trend**: Fluctuates between -1.5 and 0.5, then surges sharply around Episode 2500, reaching ~1.0 by Episode 3000.
- **Shaded Region**: Light magenta, wide (high variability) during fluctuations, narrowing as it stabilizes.
3. **Green Line**
- **Trend**: Fluctuates between -1.0 and 0.0, with a slight upward trend (Episode 3000: ~-0.1).
- **Shaded Region**: Light green, moderate width (consistent variability).
4. **Yellow Line**
- **Trend**: Stable around -0.5, with minor fluctuations.
- **Shaded Region**: Light yellow, narrow (low variability).
5. **Orange Line**
- **Trend**: Fluctuates between -1.5 and -0.5, with a slight upward trend (Episode 3000: ~-0.4).
- **Shaded Region**: Light orange, moderate width.
6. **Cyan Line**
- **Trend**: Fluctuates between -1.5 and -1.0, with a slight upward trend (Episode 3000: ~-1.2).
- **Shaded Region**: Light cyan, moderate width.
7. **Dark Blue Line**
- **Trend**: Fluctuates between -1.5 and -0.5, with a slight upward trend (Episode 3000: ~-0.4).
- **Shaded Region**: Light blue, moderate width.
### Key Observations
- **High-Performing Lines**: Red and magenta lines reach the maximum reward (1.0) and stabilize, indicating successful learning (e.g., effective reinforcement learning agents).
- **Low-Performing Lines**: Green, yellow, orange, cyan, and dark blue lines remain in the lower reward range (-1.5 to 0.0), with limited improvement.
- **Variability**: Shaded regions are wider for lines with more fluctuation (e.g., magenta, red early on) and narrower for stable lines (e.g., yellow).
### Interpretation
This graph likely compares the performance of different reinforcement learning agents (or algorithms) over training episodes. The **red** and **magenta** agents achieve high rewards (1.0), suggesting they learn effectively. Other agents (green, yellow, orange, cyan, dark blue) either learn slowly or get stuck in low-reward states. The shaded regions show reward variability: wider regions mean more inconsistent performance. The x-axis (Episode) represents training steps, and the y-axis (Evaluate Reward) measures performance. The key takeaway is that some agents (red, magenta) outperform others, reaching the maximum reward, while others struggle to improve.
(Note: No explicit legend is visible, so line colors are inferred from trends. Shaded regions represent min/max (or confidence intervals) around each lineās mean reward.)
</details>
(f) 8 Queens
<details>
<summary>10queens.png Details</summary>

### Visual Description
## Line Chart: Reward vs Steps (Mean Min/Max)
### Overview
The chart visualizes the **evaluation reward** (Y-axis) across **training episodes** (X-axis, labeled āEpisodeā) for multiple data series (likely reinforcement learning agents or algorithms). Each series includes a *mean line* and a *shaded region* (representing minimum/maximum reward values or a confidence interval) to show variability. The X-axis spans 0ā3000 episodes, and the Y-axis spans -2.0 to 1.0.
### Components/Axes
- **X-axis (Episode)**: Ticks at 0, 500, 1000, 1500, 2000, 2500, 3000. Represents the number of training episodes (steps).
- **Y-axis (Evaluate Reward)**: Ticks at -2.0, -1.5, -1.0, -0.5, 0.0, 0.5, 1.0. Represents the reward obtained during evaluation.
- **Legend (Implicit)**: Multiple colored lines (red, green, yellow, orange, cyan, magenta) with corresponding shaded areas (pink, light green, light yellow, light orange, light cyan, light magenta) indicating min/max ranges around the mean.
### Detailed Analysis
We analyze each data series (line + shaded area) by trend and variability:
#### 1. Red Line (Mean) + Pink Shaded Area
- **Trend**: Starts at ~-1.5 (episode 0), drops to ~-1.8 (episode 500), then *sharply rises* around episode 1000 to ~0.4. Fluctuates between ~-0.2 and 0.4 until episode 2000, then **spikes to 1.0** and remains stable.
- **Variability (Shaded Area)**: Pink region shows wider variability in early episodes (e.g., episode 0ā1000) and narrower variability after episode 2000 (when reward stabilizes at 1.0).
#### 2. Green Line (Mean) + Light Green Shaded Area
- **Trend**: Starts at ~-1.8 (episode 0), fluctuates, reaches ~-1.0 around episode 1000, then *gradually rises* to ~-0.5 after episode 2000 (with fluctuations).
- **Variability**: Light green region is wider in early episodes (higher variance) and narrower later (lower variance).
#### 3. Yellow Line (Mean) + Light Yellow Shaded Area
- **Trend**: Relatively flat, starting at ~-0.8 (episode 0) and fluctuating slightly between ~-0.8 and -1.0 throughout.
- **Variability**: Light yellow region shows consistent, moderate variability.
#### 4. Orange Line (Mean) + Light Orange Shaded Area
- **Trend**: Starts at ~-1.8 (episode 0), fluctuates, reaches ~-1.5 around episode 1000, then *gradually rises* to ~-1.0 after episode 2000 (with fluctuations).
- **Variability**: Light orange region mirrors the green lineās variability (wider early, narrower later).
#### 5. Cyan Line (Mean) + Light Cyan Shaded Area
- **Trend**: Starts at ~-1.8 (episode 0), fluctuates, and stays around -1.5 to -1.8 (slight upward trend).
- **Variability**: Light cyan region shows consistent, moderate variability.
#### 6. Magenta Line (Mean) + Light Magenta Shaded Area
- **Trend**: Starts at ~-1.8 (episode 0), fluctuates, and stays around -1.5 to -1.8 (similar to cyan).
- **Variability**: Light magenta region shows consistent, moderate variability.
### Key Observations
- The **red line** demonstrates a *dramatic improvement*: it spikes to the maximum reward (1.0) after episode 2000 and stabilizes, indicating a breakthrough in learning.
- Most other lines (green, orange, cyan, magenta) show *gradual improvement* or stability, with the **yellow line** remaining relatively flat (possibly a baseline or less effective agent).
- **Variability (shaded areas)** is wider in early episodes (higher variance) and narrower in later episodes (lower variance) for most series, suggesting reduced performance variance as training progresses.
### Interpretation
This chart likely compares the performance of different reinforcement learning agents (or algorithms) over training episodes. The red agentās sharp improvement and stable high reward suggest it converged to an optimal policy (or learned a critical skill). Other agents show more gradual progress, with the yellow agent possibly serving as a baseline. The reduction in variability over time implies that agentsā performance becomes more consistent as they traināindicating improved policy robustness or convergence.
The red agentās success could highlight a superior learning algorithm, hyperparameter tuning, or environmental adaptation, while the other agentsā slower progress may reflect suboptimal policies or less effective learning dynamics.
</details>
(g) 10 Queens
<details>
<summary>Graph1.png Details</summary>

### Visual Description
## [Line Graph]: Reward vs Steps (Mean Min/Max)
### Overview
The graph plots **Evaluate Reward** (y - axis) against **Episode** (x - axis, representing training steps) for multiple agents/algorithms. It includes mean reward lines (solid) and shaded regions (likely min/max or variance) to show performance variability over 0ā3000 episodes.
### Components/Axes
- **Title**: *Reward vs Steps (Mean Min/Max)*
- **X - axis**: Label = *Episode*; Ticks = 0, 500, 1000, 1500, 2000, 2500, 3000.
- **Y - axis**: Label = *Evaluate Reward*; Ticks = -1.5, -1.0, -0.5, 0.0, 0.5, 1.0.
- **Lines (and Shaded Regions)**:
- **Red**: Rises sharply, stabilizes at ~1.0 (episode ā„1000). Shaded region (light red) is wide early, narrows post - stabilization.
- **Yellow**: Gradual upward trend, ends at ~0.8ā0.9. Shaded region (light yellow) follows the line.
- **Magenta (Pink)**: Fluctuates, ends at ~0.2. Shaded region (light pink) follows fluctuations.
- **Teal (Dark Green)**: Fluctuates, ends at ~-0.2ā0.0. Shaded region (light teal) follows fluctuations.
- **Green**: Rises late (after 2000 episodes), ends at ~0.0. Shaded region (light green) is narrow.
- **Orange**: Flat at ~-1.0 (no improvement). Shaded region (light orange) is narrow.
- **Cyan (Light Blue)**: Flat at ~-1.5 (no improvement). Shaded region (light cyan) is narrow.
### Detailed Analysis
| Line Color | Trend (Episode 0 ā 3000) | Key Values (Approximate) |
|------------|--------------------------|--------------------------|
| Red | Steep rise ā flat | 0: ~-1.5; 500: ~-1.0; 1000: ~1.0; 3000: 1.0 |
| Yellow | Gradual upward | 0: ~-1.0; 500: ~-0.5; 1000: ~0.0; 3000: ~0.9 |
| Magenta | Fluctuating upward | 0: ~-1.5; 500: ~-1.0; 1000: ~-0.5; 3000: ~0.2 |
| Teal | Fluctuating upward | 0: ~-1.5; 500: ~-1.0; 1000: ~-0.7; 3000: ~-0.2 |
| Green | Late rise | 0: ~-1.5; 2000: ~-1.0; 2500: ~-0.5; 3000: ~0.0 |
| Orange | Flat | 0ā3000: ~-1.0 |
| Cyan | Flat | 0ā3000: ~-1.5 |
### Key Observations
- **Red line** has the fastest learning, reaching maximum reward (1.0) by episode 1000 and stabilizing.
- **Yellow line** shows steady, consistent improvement (no sharp fluctuations).
- **Magenta/Teal lines** have volatile learning (fluctuations) but overall upward trends.
- **Green line** improves only after 2000 episodes (delayed learning).
- **Orange/Cyan lines** show no improvement (stuck in suboptimal policies).
- Shaded regions (min/max) are wider for volatile lines (e.g., Red early, Magenta/Teal) and narrower for stable lines (Orange, Cyan).
### Interpretation
This graph compares the **learning efficiency, stability, and final performance** of different reinforcement learning agents:
- **Red agent**: Highly efficient (fast learning, stable maximum reward) ā ideal for tasks requiring quick convergence.
- **Yellow agent**: Robust but slower learning ā suitable for tasks where consistency matters more than speed.
- **Magenta/Teal agents**: Volatile learning (exploration - exploitation trade - offs) ā may need tuning for stability.
- **Green agent**: Delayed learning ā possibly due to poor initialization or a slow - start algorithm.
- **Orange/Cyan agents**: No improvement ā likely stuck in suboptimal policies (e.g., poor exploration, wrong hyperparameters).
The shaded regions highlight performance variability: wider shades mean inconsistent results (e.g., Redās early volatility), while narrow shades mean consistent results (e.g., Orange/Cyanās stability). This data helps select the best agent for a task (e.g., Red for speed, Yellow for consistency).
</details>
(h) Graph 1
<details>
<summary>Graph2.png Details</summary>

### Visual Description
## Line Chart: Reward vs Steps (Mean Min/Max)
### Overview
This is a line chart plotting **Evaluate Reward** (y-axis) against **Episode** (x-axis, representing steps in a learning process). Multiple colored lines (with shaded regions, likely min/max or confidence intervals) represent different data series (e.g., learning agents/algorithms). The title is *āReward vs Steps (Mean Min/Max)ā*.
### Components/Axes
- **X-axis**: Labeled *āEpisodeā* with ticks at 0, 500, 1000, 1500, 2000, 2500 (steps in the learning process).
- **Y-axis**: Labeled *āEvaluate Rewardā* with ticks at -3, -2, -1, 0, 1 (reward values).
- **Legend**: Not explicitly labeled, but multiple colored lines (red, yellow, teal, green, orange, pink, cyan, etc.) with shaded regions (indicating variability).
### Detailed Analysis (Key Data Series)
We analyze each line by color, trend, and key points (approximate values, with uncertainty):
1. **Red Line**
- **Trend**: Starts near -3, remains low until ~1500 episodes, then *sharply increases* to 1 (around 1750 episodes), then plateaus.
- **Shaded Region**: Wide (high variability) during the sharp increase (1500ā1750), then narrow (stable) after 1750.
- **Key Points**:
- 0 episodes: ~-3
- 1500 episodes: ~-2.5
- 1750 episodes: ~1
- 2000ā2500 episodes: ~1 (stable)
2. **Yellow Line**
- **Trend**: Fluctuates around -2 to -1, with a dip at ~1500 episodes, then *gradually increases* to ~-1.2 by 2500.
- **Shaded Region**: Wide (high variability), especially after 1500 episodes.
- **Key Points**:
- 0 episodes: ~-2
- 500 episodes: ~-1.8
- 1000 episodes: ~-1.7
- 1500 episodes: ~-2
- 2000 episodes: ~-1.3
- 2500 episodes: ~-1.2
3. **Teal Line**
- **Trend**: Fluctuates around -3 to -2.5, with minor peaks.
- **Shaded Region**: Moderate width (moderate variability).
- **Key Points**:
- 0 episodes: ~-3
- 500 episodes: ~-2.7
- 1000 episodes: ~-2.6
- 1500 episodes: ~-2.8
- 2000 episodes: ~-2.7
- 2500 episodes: ~-2.8
4. **Green Line**
- **Trend**: Similar to teal, fluctuates around -3 to -2.5 (slightly lower than teal).
- **Shaded Region**: Moderate width.
- **Key Points**:
- 0 episodes: ~-3
- 500 episodes: ~-2.9
- 1000 episodes: ~-2.8
- 1500 episodes: ~-2.9
- 2000 episodes: ~-2.8
- 2500 episodes: ~-2.9
5. **Orange, Pink, Cyan Lines**
- **Trend**: Cluster around -3 to -2.8, with minimal improvement over episodes.
- **Shaded Region**: Narrow (low variability).
- **Key Points**: All remain near -3 to -2.8 across all episodes (e.g., 0 episodes: ~-3; 2500 episodes: ~-2.9).
### Key Observations
- **Outlier: Red Line**: It is the only series with a *sharp, sustained increase* to a reward of 1 (others stay below -1).
- **Yellow Line**: Second-highest, with moderate improvement (fluctuates around -2 to -1).
- **Clustering of Low-Performing Lines**: Teal, green, orange, pink, and cyan lines cluster around -3 to -2.5, showing little to no improvement.
- **Variability**: Red and yellow lines have wider shaded regions (higher variability), while low-performing lines have narrow ranges.
### Interpretation
- **Red Line**: Likely represents a *successful learning agent* that achieves a high reward (1) after ~1500 episodes, suggesting a ābreakthroughā in learning (e.g., a phase transition or effective algorithm).
- **Yellow Line**: Shows moderate improvement but not as drastic as red, indicating partial learning or a less effective algorithm.
- **Low-Performing Lines**: Clustered around -3 to -2.5, these likely represent less effective learning (e.g., random exploration, suboptimal algorithms).
- **Shaded Regions**: Wide ranges for red/yellow during learning (1500ā1750) indicate high variability (e.g., trial-and-error), while narrow ranges after 1750 (red) suggest stability.
This chart illustrates how different learning agents/algorithms perform over episodes, with the red line demonstrating a clear, successful learning trajectory.
</details>
(i) Graph 2
<details>
<summary>Graph3.png Details</summary>

### Visual Description
## Line Chart: Evaluate Reward vs. Episode (Reinforcement Learning Performance)
### Overview
The image is a line chart displaying the **Evaluate Reward** (y-axis) of multiple data series (likely reinforcement learning agents/algorithms) over **Episode** (x-axis, training iterations). The chart includes shaded regions (confidence intervals or variance bands) for each series, indicating variability in performance.
### Components/Axes
- **X-axis**: Labeled *āEpisodeā*, with ticks at 0, 200, 400, 600, 800, 1000, 1200 (range: 0ā1200 episodes).
- **Y-axis**: Labeled *āEvaluate Rewardā*, with ticks at -3, -2, -1, 0, 1 (range: -3 to 1).
- **Legend**: Implicit (no explicit label), but multiple colored lines (red, yellow, magenta, green, cyan, orange, etc.) with corresponding shaded areas (pink, yellow, green, cyan, etc.).
### Detailed Analysis (Line-by-Line Trends & Values)
We analyze each line by color, trend, and approximate values:
1. **Red Line**
- **Trend**: Starts low (ā-3 at episode 0), *sharply increases* around episode 600, reaches ā1 by episode 800, then fluctuates between 0ā1.
- **Shaded Area**: Pink, covering a wide range (ā-3 to 1) initially, narrowing as reward approaches 1 (lower variability).
2. **Yellow Line**
- **Trend**: Starts ā-2.5, *gradually increases* with fluctuations, reaching ā-0.5 to 0 by episode 1200.
- **Shaded Area**: Yellow, covering ā-3 to 0 (moderate variability).
3. **Magenta Line**
- **Trend**: Starts ā-3, *sharply increases* around episode 600, reaches ā1 by episode 800, then fluctuates between 0ā1 (similar to red, with more variation).
- **Shaded Area**: Pink (same as red), covering a wide range (ā-3 to 1) initially, narrowing post-episode 800.
4. **Green Line**
- **Trend**: Starts ā-3, *slightly increases* then stabilizes at ā-2.5 to -2 (minimal improvement).
- **Shaded Area**: Green, covering a narrow range (ā-3 to -2, low variability).
5. **Cyan Line**
- **Trend**: Starts ā-3, *remains flat* (or slightly decreases) at ā-3 (no significant improvement).
- **Shaded Area**: Cyan, covering a narrow range (ā-3 to -2.5, low variability).
6. **Orange Line**
- **Trend**: Starts ā-3, *slightly increases* then stabilizes at ā-2.5 to -2 (similar to green, with minor variation).
- **Shaded Area**: Orange, covering a narrow range (ā-3 to -2, low variability).
### Key Observations
- **Breakthrough Performance**: Red and magenta lines show a *sharp increase* in reward around episode 600, reaching near 1 (high task success).
- **Steady Improvement**: The yellow line shows gradual, consistent improvement over episodes.
- **Stagnant Performance**: Green, cyan, and orange lines remain flat or with minimal improvement (low task success).
- **Variability**: Red/magenta have high variability (wide shaded areas) early on, narrowing as they learn (lower variability). Other lines have low variability (narrow shaded areas).
### Interpretation
This chart likely compares **reinforcement learning agents** (or algorithms) on a task, where *āEvaluate Rewardā* measures task success. Key insights:
- **Effective Agents**: Red and magenta agents learn rapidly (sharp reward increase) and achieve high performance (ā1), suggesting they are well-suited for the task.
- **Steady Learner**: The yellow agent improves gradually, indicating consistent (but slower) learning.
- **Ineffective Agents**: Green, cyan, and orange agents struggle to learn (flat reward), possibly due to poor algorithm design or task mismatch.
- **Variability Insight**: High early variability in red/magenta suggests initial exploration, while narrowing shaded areas indicate convergence to a stable policy.
This data helps identify which algorithms perform best, guiding future research or deployment of reinforcement learning systems.
(Note: No non-English text is present in the image.)
</details>
(j) Graph 3
<details>
<summary>Graph4.png Details</summary>

### Visual Description
## Line Chart: Evaluate Reward vs. Episode (Multiple Data Series)
### Overview
The image is a line chart displaying the **Evaluate Reward** (y-axis) over **Episode** (x-axis) for multiple experimental conditions (data series). Each series is represented by a colored line with a shaded confidence interval (likely standard deviation or error band) to show variability. The chart spans episodes from 0 to ~1250 and rewards from -2.5 to 1.0.
### Components/Axes
- **X-axis (Horizontal)**:
Label: *Episode*
Ticks: 0, 200, 400, 600, 800, 1000, 1200 (spanning 0 to ~1250 episodes).
- **Y-axis (Vertical)**:
Label: *Evaluate Reward*
Ticks: -2.5, -2.0, -1.5, -1.0, -0.5, 0.0, 0.5, 1.0 (spanning -2.5 to 1.0).
- **Data Series (Lines + Shaded Regions)**:
Multiple colored lines (red, yellow, magenta, green, teal, orange, cyan) with corresponding light-colored shaded regions (confidence intervals). Each line represents a distinct experimental condition/algorithm.
### Detailed Analysis (By Series, Approximate Values)
#### 1. Red Line
- **Trend**: Starts at ~-2.5 (episode 0), rises sharply around episode 400, peaks at 1.0 (episode 600), then stabilizes at 1.0 until episode 1200.
- **Shaded Region**: Light red, wide initially (high variability) but narrows as the line stabilizes.
- **Key Points**:
- Episode 0: ~-2.5
- Episode 400: ~-2.5
- Episode 500: ~0.25
- Episode 600: 1.0
- Episodes 600ā1200: 1.0
#### 2. Yellow Line
- **Trend**: Starts at ~-1.75 (episode 0), fluctuates but generally increases over time, reaching ~0.25 (episode 1200).
- **Shaded Region**: Light yellow, wide (high variability).
- **Key Points**:
- Episode 0: ~-1.75
- Episode 200: ~-1.75
- Episode 400: ~-1.25
- Episode 600: ~-1.0
- Episode 800: ~-0.75
- Episode 1000: ~-0.5
- Episode 1200: ~0.25
#### 3. Magenta Line
- **Trend**: Highly volatile, with sharp peaks (1.0) at episodes 600, 800, 1000 and deep troughs (~-2.0).
- **Shaded Region**: Light magenta, wide (high variability).
- **Key Points**:
- Episode 0: ~-2.5
- Episode 600: 1.0
- Episode 700: ~-1.75
- Episode 800: 1.0
- Episode 900: ~-2.0
- Episode 1000: 1.0
- Episode 1100: ~-2.0
- Episode 1200: ~-2.0
#### 4. Green Line
- **Trend**: Relatively stable, fluctuating around -2.0 to -2.5, with a slight upward trend toward the end.
- **Shaded Region**: Light green, narrow (low variability).
- **Key Points**:
- Episode 0: ~-2.5
- Episode 200: ~-2.25
- Episode 400: ~-2.25
- Episode 600: ~-2.25
- Episode 800: ~-2.25
- Episode 1000: ~-2.25
- Episode 1200: ~-2.25
#### 5. Teal Line
- **Trend**: Fluctuates around -2.5 to -2.0, similar to green but with more variation.
- **Shaded Region**: Light teal, moderate width.
- **Key Points**:
- Episode 0: ~-2.5
- Episode 200: ~-2.25
- Episode 400: ~-2.25
- Episode 600: ~-2.25
- Episode 800: ~-2.25
- Episode 1000: ~-2.25
- Episode 1200: ~-2.25
#### 6. Orange Line
- **Trend**: Starts low, rises slightly, then stabilizes around -2.5 to -2.0.
- **Shaded Region**: Light orange, narrow.
- **Key Points**:
- Episode 0: ~-2.5
- Episode 200: ~-2.5
- Episode 400: ~-2.5
- Episode 600: ~-2.5
- Episode 800: ~-2.5
- Episode 1000: ~-2.5
- Episode 1200: ~-2.5
#### 7. Cyan Line
- **Trend**: Very stable, flat around -2.5 throughout all episodes.
- **Shaded Region**: Light cyan, very narrow (minimal variability).
- **Key Points**:
- Episode 0: ~-2.5
- Episode 200: ~-2.5
- Episode 400: ~-2.5
- Episode 600: ~-2.5
- Episode 800: ~-2.5
- Episode 1000: ~-2.5
- Episode 1200: ~-2.5
### Key Observations
- **Red Line**: Achieves the highest reward (1.0) and stabilizes, indicating a successful learning curve (e.g., a well-performing algorithm).
- **Magenta Line**: Highly volatile (peaks at 1.0, troughs at ~-2.0), suggesting instability or exploration-exploitation tradeoffs.
- **Yellow Line**: Gradual improvement over time (learning trend) but with high variability (wide shaded area).
- **Green, Teal, Orange, Cyan Lines**: Remain low (ā-2.5 to -2.0) with little improvement, indicating poor performance or stagnation.
- **Shaded Regions**: Wider for red, yellow, magenta (high variability) and narrower for green, teal, orange, cyan (low variability), correlating with performance stability.
### Interpretation
This chart likely compares **reinforcement learning algorithms** (or experimental conditions) over training episodes, measured by āEvaluate Reward.ā
- The **red line**ās rapid rise to 1.0 and stability suggest it converges to an optimal policy (most effective).
- The **magenta line**ās volatility implies unstable updates or ongoing exploration (high risk/reward).
- The **yellow line**ās gradual improvement shows learning but with high variance (e.g., a less stable algorithm).
- The **green, teal, orange, cyan lines** show little improvement, indicating poor tuning, ineffective algorithms, or local optima.
Shaded regions (confidence intervals) highlight reliability: narrower bands = consistent results; wider bands = high variability (e.g., stochasticity in the environment/algorithm).
(Note: No explicit legend labels are visible, so series are identified by color. All values are approximate, based on visual inspection.)
</details>
(k) Graph 4
<details>
<summary>SV22.png Details</summary>

### Visual Description
## [Line Chart]: Reward vs Steps (Mean Min/Max)
### Overview
The image is a line chart titled *āReward vs Steps (Mean Min/Max)ā* that visualizes the **āEvaluate Rewardā** (y-axis) over **āEpisodeā** (x-axis) for multiple data series (colored lines) with shaded regions (likely representing min/max or confidence intervals) for each series.
### Components/Axes
- **X-axis (Horizontal)**: Labeled *āEpisodeā*, with tick marks at `0, 250, 500, 750, 1000, 1250, 1500, 1750, 2000`. Represents the number of episodes (training/evaluation steps).
- **Y-axis (Vertical)**: Labeled *āEvaluate Rewardā*, with tick marks at `-1.5, -1.0, -0.5, 0.0, 0.5, 1.0`. Represents the reward value obtained during evaluation.
- **Lines (Data Series)**: 7 colored lines (red, green, orange, yellow, magenta, cyan, dark teal) each represent a distinct data series (e.g., different algorithms/agents). Each line has a **shaded region** (matching the lineās color, lighter shade) indicating the range (min/max) or variability around the mean.
- **Legend**: Not explicitly labeled, but colors correspond to distinct series (inferred from lines and shaded regions).
### Detailed Analysis (Line-by-Line Trends & Values)
We analyze each line (color) with trends and approximate values (noting uncertainty in shaded regions):
1. **Red Line**
- **Trend**: Starts low (~-1.2 at episode 0), rises sharply between 250ā500 episodes, peaks around `1.2ā1.3` (episode ~1000ā1250), then stabilizes with minor fluctuations.
- **Shaded Region**: Light red, spanning ~-1.5 to ~1.5 (wide initially, narrowing as episodes increase).
2. **Green Line**
- **Trend**: Starts at ~-1.5, rises steadily with fluctuations, reaches ~1.2 by episode 2000.
- **Shaded Region**: Light green, spanning ~-1.5 to ~1.5 (similar to red but with distinct fluctuations).
3. **Orange Line**
- **Trend**: Starts at ~-1.5, rises, dips around episode 1500 (to ~0.5), then recovers to ~1.2 by episode 2000.
- **Shaded Region**: Light orange, spanning ~-1.5 to ~1.5 (with a noticeable dip in the shaded area around episode 1500).
4. **Yellow Line**
- **Trend**: Fluctuates between ~-0.5 and 0.0, with peaks (e.g., ~0.0 at episode 1000) and troughs (e.g., ~-0.5 at episode 750).
- **Shaded Region**: Light yellow, spanning ~-1.0 to ~0.5 (narrower than red/green/orange).
5. **Magenta (Pink) Line**
- **Trend**: Fluctuates between ~-0.5 and -0.2, relatively stable with minor variations.
- **Shaded Region**: Light pink, spanning ~-1.0 to ~0.0 (narrow range, consistent with stability).
6. **Cyan (Light Blue) Line**
- **Trend**: Remains nearly flat at ~-1.5, with minimal fluctuations across all episodes.
- **Shaded Region**: Light cyan, spanning ~-1.5 to ~-1.5 (very narrow, indicating low variability).
7. **Dark Teal (Dark Blue-Green) Line**
- **Trend**: Fluctuates between ~-1.5 and -0.5, with peaks (e.g., ~-0.5 at episode 500) and troughs (e.g., ~-1.5 at episode 1000).
- **Shaded Region**: Light teal, spanning ~-1.5 to ~-0.5 (matches the lineās fluctuations).
### Key Observations
- **High-Performing Series (Red, Green, Orange)**: These lines show a strong upward trend, reaching high reward values (ā„1.0) by later episodes, indicating effective learning/performance improvement.
- **Stable/Low-Performing Series (Cyan, Magenta, Yellow, Dark Teal)**: These lines have lower reward values (ā¤0.0) and less upward trend. Cyan is the most stable (flat) at a low reward.
- **Variability (Shaded Regions)**: High-performing series (red, green, orange) have wider shaded regions initially, narrowing as episodes increase (suggesting reduced variability with more training). Low-performing series have narrower shaded regions (consistent but low performance).
- **Critical Episode Range (250ā500)**: A phase where red, green, and orange lines rise sharply, indicating rapid learning/performance gain.
### Interpretation
The chart likely compares the performance of different agents/algorithms over training episodes.
- **High-Performing Methods (Red, Green, Orange)**: Demonstrate effective learning, with reward increasing over time and variability decreasing. This suggests these methods adapt well to the task, improving with more episodes.
- **Low-Performing Methods (Cyan, Magenta, Yellow, Dark Teal)**: Show limited improvement, with cyan being the least effective (flat reward). These methods may struggle to learn the task or have inherent limitations.
- **Shaded Regions (Min/Max)**: Highlight performance variability. High performers have more variability initially but converge to stable high rewards, while low performers have consistent (but low) performance.
In summary, the chart reveals that some methods (red, green, orange) are far more effective at learning the task, while others (cyan, etc.) struggle to improve. The episode range (0ā2000) shows the progression of learning, with key improvements in the early-to-mid episodes (250ā1000) for top performers.
</details>
(l) Visual Sudoku 2 $\times$ 2
<details>
<summary>SV33.png Details</summary>

### Visual Description
## Reward vs Steps (Mean Min/Max): Line Chart with Confidence Intervals
### Overview
This is a line chart tracking **Evaluate Reward** across training **Episodes** (steps), displaying mean reward values for 7 distinct series, each paired with a shaded region representing the minimum/maximum reward range (confidence interval) for that series. The chart spans 0 to 2000 episodes on the horizontal axis, and reward values from -3 to +1.5 on the vertical axis.
### Components/Axes
- **Title**: "Reward vs Steps (Mean Min/Max)" (top-center of the chart)
- **X-axis**: Labeled *Episode*, linear scale with major ticks at 0, 250, 500, 750, 1000, 1250, 1500, 1750, 2000.
- **Y-axis**: Labeled *Evaluate Reward*, linear scale with major ticks at -3, -2, -1, 0, 1.
- **Series (Colored Lines + Shaded Bands)**:
1. Red line (topmost, light red shaded band)
2. Magenta (pink) line (second-highest, light pink shaded band)
3. Dark green line (middle, light green shaded band)
4. Orange line (middle, light orange shaded band)
5. Yellow line (middle, light yellow shaded band)
6. Dark teal (blue-green) line (lower-middle, light teal shaded band)
7. Cyan (bright blue) line (bottommost, light cyan shaded band)
### Detailed Analysis
1. **Red Line (Top Series)**
- Trend: Starts at ~-2.1 (episode 0), rises steeply until ~episode 750, then plateaus with minor fluctuations, ending at ~1.4 (episode 2000).
- Key points: Crosses 0 reward at ~episode 350, reaches 1 reward at ~episode 600, peaks at ~1.5 (episode 2000). Shaded band is narrow, indicating low performance variance.
2. **Magenta Line**
- Trend: Starts at ~-2.2 (episode 0), rises to ~-0.5 by episode 500, then fluctuates between -0.5 and 0 for the remainder, ending at ~-0.3 (episode 2000).
- Key points: Shaded band is very wide, indicating high variance (ranges from ~-1.5 to +1.5 at later episodes).
3. **Dark Green Line**
- Trend: Starts at ~-2.0 (episode 0), rises to ~-1.0 by episode 500, then fluctuates between -1.0 and -0.5, ending at ~-0.6 (episode 2000).
- Key points: Shaded band is moderate, variance ranges ~-1.8 to 0 at later episodes.
4. **Orange Line**
- Trend: Starts at ~-2.0 (episode 0), rises to ~-1.0 by episode 500, then fluctuates between -1.0 and -0.7, ending at ~-0.7 (episode 2000).
- Key points: Shaded band is moderate, variance ranges ~-1.7 to 0 at later episodes.
5. **Yellow Line**
- Trend: Starts at ~-1.0 (episode 0), dips to ~-1.5 at episode 250, then fluctuates between -1.5 and -0.8, ending at ~-0.9 (episode 2000).
- Key points: Shaded band is moderate, variance ranges ~-2.0 to 0 at later episodes.
6. **Dark Teal Line**
- Trend: Starts at ~-2.3 (episode 0), rises to ~-1.2 by episode 500, then fluctuates between -1.5 and -0.8, ending at ~-1.3 (episode 2000).
- Key points: Shaded band is moderate, variance ranges ~-2.2 to 0 at later episodes.
7. **Cyan Line (Bottom Series)**
- Trend: Starts at ~-2.7 (episode 0), rises slightly to ~-2.5 by episode 250, then plateaus with minor fluctuations, ending at ~-2.3 (episode 2000).
- Key points: Shaded band is narrow, indicating low variance; remains the lowest-performing series throughout.
### Key Observations
- The red line is the only series to achieve positive reward, with the steepest initial improvement.
- The cyan line shows the least improvement, staying near the bottom of the reward scale.
- The magenta line has the widest variance (shaded band), indicating the most inconsistent performance.
- All series show an initial upward trend (improvement) in the first 500 episodes, then most plateau or fluctuate.
- The red line's narrow shaded band means its performance is the most consistent once it plateaus.
### Interpretation
This chart likely compares the performance of different reinforcement learning agents (or agent configurations) during training, where *Evaluate Reward* measures task success.
- The red agent is the most successful: it learns rapidly, achieves positive reward, and maintains consistent high performance.
- The cyan agent performs the worst, with minimal learning progress.
- The magenta agent has high potential (reaches near 0 reward) but is highly unreliable, with extreme performance swings.
- The middle-tier agents (green, orange, yellow, teal) show moderate learning and consistent, mid-range performance.
The initial steep rise for most agents indicates rapid learning in the early training phase, followed by convergence to a stable policy (plateau). Variance (shaded bands) reflects performance reliability: lower variance means more predictable results.
</details>
(m) Visual Sudoku 3 $\times$ 3
<details>
<summary>SV44.png Details</summary>

### Visual Description
## Line Chart: Reward vs Steps (Mean Min/Max)
### Overview
This is a line chart titled "Reward vs Steps (Mean Min/Max)". It plots the "Evaluate Reward" on the y-axis against the number of "Episode" on the x-axis for six distinct data series. Each series is represented by a solid colored line (the mean) surrounded by a semi-transparent shaded area of the same color, indicating the minimum and maximum reward range for that episode. The chart includes a background grid.
### Components/Axes
* **Chart Title:** "Reward vs Steps (Mean Min/Max)" (Top center).
* **X-Axis:**
* **Label:** "Episode" (Bottom center).
* **Scale:** Linear, ranging from 0 to 2000.
* **Major Tick Marks:** 0, 250, 500, 750, 1000, 1250, 1500, 1750, 2000.
* **Y-Axis:**
* **Label:** "Evaluate Reward" (Left center, rotated vertically).
* **Scale:** Linear, ranging from -4 to 2.
* **Major Tick Marks:** -4, -3, -2, -1, 0, 1, 2.
* **Data Series (Identified by line color):**
1. **Red Line**
2. **Magenta (Pink) Line**
3. **Green Line**
4. **Yellow Line**
5. **Dark Teal Line**
6. **Cyan (Light Blue) Line**
* **Legend:** No explicit legend is present within the chart area. Series are distinguished solely by color.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **Red Line:**
* **Trend:** Shows a strong, consistent upward trend from start to finish.
* **Key Points:** Starts at ~ -2.9 (Episode 0). Crosses 0 reward around Episode 900. Ends at its peak of ~ 1.9 (Episode 2000). The shaded min/max range is very wide, especially between Episodes 750-1500, spanning nearly 3 reward units at its widest.
2. **Magenta Line:**
* **Trend:** Shows a steady, moderate upward trend.
* **Key Points:** Starts at ~ -3.0 (Episode 0). Crosses -1 reward around Episode 1100. Ends at ~ -0.5 (Episode 2000). Its shaded range is also wide, often overlapping with the red series' range.
3. **Green Line:**
* **Trend:** Shows a gradual, slight upward trend with minor fluctuations.
* **Key Points:** Starts at ~ -3.0 (Episode 0). Hovers between -2 and -1.5 for most of the chart. Ends at ~ -1.2 (Episode 2000). The shaded range is moderate.
4. **Yellow Line:**
* **Trend:** Relatively flat with minor fluctuations, showing no strong upward or downward trend.
* **Key Points:** Starts at ~ -2.0 (Episode 0). Fluctuates primarily between -2.2 and -1.8. Ends at ~ -1.8 (Episode 2000). The shaded range is relatively narrow.
5. **Dark Teal Line:**
* **Trend:** Shows a slight initial increase, followed by a very gradual downward trend in the latter half.
* **Key Points:** Starts at ~ -3.0 (Episode 0). Peaks around -1.8 near Episode 500. Declines slowly to end at ~ -2.5 (Episode 2000). The shaded range is moderate.
6. **Cyan Line:**
* **Trend:** Distinct two-phase trend: perfectly flat, then a step increase followed by noisy fluctuation.
* **Key Points:** Starts at -4.0 and remains flat until ~ Episode 750. Jumps to ~ -3.5 and then fluctuates between -3.7 and -3.3 for the remainder. Ends at ~ -3.7 (Episode 2000). The shaded range becomes very wide after the step increase.
**Spatial Grounding:** All data series originate from the left side (Episode 0) and progress to the right (Episode 2000). The red and magenta lines occupy the upper portion of the chart by the end, while the cyan line remains at the bottom. The shaded areas create significant overlap in the middle reward range (-3 to -1).
### Key Observations
* **Performance Hierarchy:** A clear performance hierarchy is established by the end of the episodes: Red > Magenta > Green > Yellow > Dark Teal > Cyan.
* **Variability:** The top-performing series (Red, Magenta) exhibit the highest variability (widest shaded areas), suggesting their mean performance comes with less consistency. The lowest-performing series (Cyan) also shows high variability after its phase change.
* **Anomaly:** The Cyan series is an outlier in behavior, showing a perfect flatline at the minimum reward (-4) for the first ~750 episodes before any learning or change occurs.
* **Convergence:** The Green, Yellow, and Dark Teal series converge into a similar performance band (between -2.5 and -1) from Episode 500 onward, making them difficult to distinguish without color.
### Interpretation
This chart likely visualizes the training performance of six different reinforcement learning agents or algorithmic variants over 2000 episodes. The "Evaluate Reward" is the performance metric.
* **What the data suggests:** The Red agent is the most successful, achieving the highest final reward and showing consistent improvement. The Magenta agent is the second-best learner. The Green, Yellow, and Dark Teal agents show modest, stable learning but plateau at a sub-optimal reward level. The Cyan agent fails to learn initially and, after a delayed start, only achieves a poor, unstable reward.
* **Relationship between elements:** The upward trends indicate learning. The width of the shaded min/max regions reflects the stability or volatility of each agent's policy during evaluation. The overlapping ranges, especially in the middle, indicate that on any given episode, the performance of different agents could be similar despite different mean trends.
* **Notable implications:** The high variability in top performers might be a concern for reliability. The delayed start of the Cyan agent points to a potential issue in its initialization or early training dynamics. The chart effectively compares not just final performance, but the learning trajectory and stability of each method.
</details>
(n) Visual Sudoku 4 $\times$ 4
<details>
<summary>SV55.png Details</summary>

### Visual Description
## [Line Chart]: Reward vs Steps (Mean Min/Max)
### Overview
This is a line graph titled *āReward vs Steps (Mean Min/Max)ā* that plots **Evaluate Reward** (y-axis) against **Episode** (x-axis, representing steps in a learning process). Multiple colored lines (with shaded regions for min/max values) represent different experimental conditions or algorithms, showing how their average reward evolves over episodes.
### Components/Axes
- **X-axis**: Labeled *āEpisodeā*, with ticks at 0, 200, 400, 600, 800, 1000, 1200, 1400, 1600 (range: 0ā1600 episodes).
- **Y-axis**: Labeled *āEvaluate Rewardā*, with ticks at -6, -5, -4, -3, -2, -1, 0, 1, 2 (range: -6 to 2).
- **Lines & Shaded Regions**: Multiple colored lines (red, pink, yellow, green, cyan, etc.) with semi-transparent shaded areas (min/max) around each line. The legend is not visible, but lines are distinguished by color.
### Detailed Analysis (Line-by-Line Trends)
We analyze each line (color) by its trend and key points:
1. **Red Line** (steepest upward trend):
- Starts at ~-5 (episode 0), dips to ~-5.5 (episode 100), then rises steadily.
- By episode 1600, reaches ~1 (highest reward).
- Shaded region (min/max) is wide (especially later), indicating high variance in performance.
2. **Pink Line** (moderate upward trend):
- Starts at ~-4.5 (episode 0), rises gradually.
- By episode 1600, reaches ~-1.5.
- Shaded region is moderate (consistent variance).
3. **Yellow Line** (flat trend):
- Starts at ~-3 (episode 0), remains relatively stable (slight increase).
- By episode 1600, stays ~-2.5.
- Shaded region is narrow (low variance, consistent performance).
4. **Green Line** (moderate upward trend):
- Starts at ~-4 (episode 0), rises to ~-3 (episode 200), then fluctuates around -3 to -2.5.
- Shaded region is moderate (consistent variance).
5. **Cyan Line** (lowest reward, slight upward trend):
- Starts at ~-6 (episode 0), rises to ~-5 (episode 200), then fluctuates around -5 to -4.5.
- Shaded region is wide (especially early), indicating high variance.
### Key Observations
- **Performance Hierarchy**: Red > Pink > Green > Yellow > Cyan (in terms of final reward).
- **Variance**: Red and Cyan have the widest shaded regions (highest variance), while Yellow has the narrowest (lowest variance).
- **Trends**: Red shows the most dramatic improvement; Cyan improves slightly but remains the lowest; Yellow is stable but low.
### Interpretation
This chart compares the learning performance of different agents/algorithms over episodes. The **red agent** learns most effectively (highest reward, steep upward trend) but with high variance (possibly due to exploration). The **cyan agent** struggles (lowest reward) but shows minor improvement. The **yellow agent** is consistent but underperforms. The shaded regions (min/max) reveal how reliable each agentās performance is: redās high variance suggests unstable but improving behavior, while yellowās low variance indicates consistency (even if low reward).
This data likely informs decisions about which algorithm/agent to prioritize for further development (e.g., red for high reward, yellow for stability). The wide variance in red and cyan may indicate a need for tuning to reduce uncertainty.
*(Note: The legend is not visible, so line labels are inferred by color. All values are approximate, based on visual estimation.)*
</details>
(o) Visual Sudoku 5 $\times$ 5
<details>
<summary>legends.png Details</summary>

### Visual Description
## Legend: Algorithm Comparison
### Overview
The image is a horizontal legend from a chart or graph, displaying seven distinct colored lines, each paired with a text label. This legend is used to identify different algorithms or methods, likely in a comparative performance study within fields such as reinforcement learning or optimization.
### Components/Axes
The legend consists of seven colored line segments, each followed by a text label. The elements are arranged horizontally from left to right. There are no axes, scales, or numerical data present in this isolated image.
### Detailed Analysis
The legend contains the following color-label pairs, listed in order from left to right:
1. **Color:** Teal (dark cyan) line. **Label:** `Rainbow`
2. **Color:** Green line. **Label:** `PPO`
3. **Color:** Orange line. **Label:** `PPO-Lagrangian`
4. **Color:** Yellow line. **Label:** `KCAC`
5. **Color:** Cyan (light blue) line. **Label:** `RC-PPO`
6. **Color:** Magenta (pink-purple) line. **Label:** `PLPG`
7. **Color:** Red line. **Label:** `NSAM(ours)`
All text is in English. The label `NSAM(ours)` explicitly denotes the method proposed by the authors of the document from which this legend is taken.
### Key Observations
* The legend uses a distinct, high-contrast color palette to ensure clear differentiation between the seven data series.
* The inclusion of `(ours)` on the final label is a common academic convention to highlight the authors' contribution in a comparative figure.
* The labels suggest a focus on reinforcement learning algorithms (e.g., PPO, Rainbow) and constrained optimization methods (e.g., PPO-Lagrangian, KCAC).
### Interpretation
This legend is a critical component for interpreting a larger, unseen chart. It establishes the mapping between visual cues (color) and conceptual entities (algorithms). The presence of multiple PPO variants (`PPO`, `PPO-Lagrangian`, `RC-PPO`) indicates a study likely investigating improvements or modifications to the Proximal Policy Optimization algorithm, possibly in the context of safe or constrained reinforcement learning (suggested by `Lagrangian` and `KCAC`). The proposed method, `NSAM`, is positioned as the final entry for direct comparison against these established baselines. To fully understand the data, trends, and conclusions, this legend must be cross-referenced with the main chart it accompanies.
</details>
Figure 6. Comparsion of learning curves on 4 domains. As the size and queen number increase in Sudoku and N-Queens, the learning task becomes more challenging. Graphs 1 $\sim$ 4 denote four graph coloring tasks with different topologies.
Sudoku. Sudoku is a decision-making domain with logical constraints sudokuRL. In this domain, the agent fills one cell with a number at each step until the board is complete. Action preconditions naturally arise: filling a number to a cell requires that the same number does not already exist in this row and column. Prior work sudokuRL; sudokuMLP shows that existing DRL algorithms struggle to solve Sudoku without predefined symbolic grounding functions.
| | Rainbow | PPO | PPO-lagrangian. | KCAC | RC-PPO | PLPG | NSAM(ours) | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Rew. | Viol. | Rew. | Viol. | Rew. | Viol. | Rew. | Viol. | Rew. | Viol. | Rew. | Viol. | Rew. | Viol. | | |
| Sudoku | 2Ć2 | 1.3 | 36.2% | 1.3 | 14.9% | 1.3 | 0.7% | 1.3 | 9.2% | 0.3 | 11.3% | -0.8 | 5.9% | 1.3 | 0.1% |
| 3Ć3 | 0.8 | 88.2% | -0.4 | 99.6% | -0.4 | 99.9% | 1.5 | 57.1% | -2.3 | 99.0% | -0.9 | 8.9% | 1.6 | 0.3% | |
| 4Ć4 | -2.6 | 100% | -2.6 | 100% | -3.4 | 100% | 1.0 | 91.2% | -3.8 | 99.9% | -2.2 | 15.7% | 2.1 | 0.6% | |
| 5Ć5 | -4.5 | 100% | -5.2 | 100% | -5.3 | 100% | -0.5 | 94.9% | -5.8 | 100% | -3.3 | 18.3% | 2.7 | 4.3% | |
| N-Queens | N=4 | -0.3 | 97.8% | 0.0 | 99.8% | 0.0 | 100% | 0.7 | 78.7% | -0.3 | 93.6% | 0.1 | 10.4% | 1.0 | 2.3% |
| N=6 | 0.0 | 100% | 0.0 | 100% | -0.1 | 100% | -0.1 | 100% | -0.8 | 100% | 1.0 | 12.1% | 1.0 | 1.3% | |
| N=8 | -0.4 | 100% | -0.1 | 100% | -0.3 | 100% | -0.2 | 100% | -1.3 | 100% | 1.0 | 41.6% | 1.0 | 1.1% | |
| N=10 | -1.2 | 100% | -0.6 | 100% | -1.0 | 100% | -0.8 | 100% | -1.7 | 100% | -1.6 | 98.2% | 1.0 | 1.5% | |
| Graph Coloring | G1 | -0.2 | 88.7% | 0.0 | 98.6% | -1.0 | 100% | 0.8 | 43.9% | -1.4 | 99.1% | 0.2 | 5.9% | 1.0 | 0.7% |
| G2 | -2.8 | 100% | -2.8 | 100% | -2.8 | 100% | -1.1 | 98.7% | -3.1 | 98.7% | -2.8 | 18.7% | 1.0 | 0.7% | |
| G3 | -2.7 | 100% | -2.6 | 100% | -2.5 | 100% | -0.3 | 90.0% | -3.1 | 98.9% | 0.1 | 7.5% | 1.0 | 0.4% | |
| G4 | -2.2 | 100% | -2.3 | 100% | -2.1 | 100% | 0.3 | 85.4% | -2.8 | 75.8% | -2.1 | 11.2% | 1.0 | 0.2% | |
| Visual Sudoku | 2Ć2 | -1.1 | 61.8% | 1.2 | 60.0% | 1.1 | 25.0% | -0.4 | 32.2% | -1.4 | 68.3% | -0.5 | 16.1% | 1.2 | 6.1% |
| 3Ć3 | -1.4 | 96.0% | -0.6 | 99.6% | -0.8 | 100% | -0.9 | 40.3% | -2.3 | 95.4% | -0.4 | 34.6% | 1.5 | 1.0% | |
| 4Ć4 | -2.5 | 100% | -1.3 | 100% | -1.4 | 100% | -1.7 | 39.4% | -3.7 | 96.3% | -0.5 | 38.2% | 1.9 | 0.6% | |
| 5Ć5 | -3.0 | 100% | -2.2 | 100% | -2.5 | 100% | -2.9 | 88.5% | -5.1 | 99.7% | -1.4 | 53.7% | 0.8 | 2.5% | |
Table 1. Comaprison of final reward (Rew.) and violation (Viol.) rate during training.
N-Queens. The N-Queens problem requires placing $N$ queens on a chessboard so that no two attack each other, making it a classic domain with logical constraints nqueens. The agent places one queen per step on the chessboard until all $N$ queens are placed safely. This task fits naturally within our extended MDP framework, where action preconditions arise: a queen can be placed at a position if and only if no already-placed queen can attack it.
Graph Coloring. Graph coloring is a NP-hard problem and serves as a reinforcement learning domain with logical constraints Graphc. The agent sequentially colors the nodes of an undirected graph with a limited set of colors. Action preconditions arise naturally: a node may be colored with a given color if and only if none of its neighbors have already been assigned that color.
Visual Sudoku. Visual Sudoku follows the same rules as standard Sudoku but uses image-based digit representations instead of vector inputs. Following prior visual Sudoku benchmark SudokuV_SATNET, we generate boards by randomly sampling digits from the MNIST dataset mnist. Visual Sudoku 5Ć5 is with a 140Ć140-dimensional state space and a 125-dimensional action space. In addition, the corresponding PSDD contains 125 atomic propositions and 782 clauses as constrains. This environment poses an additional challenge, as the digits are high-dimensional and uncertain representations, which increases the difficulty of symbolic grounding.
### 7.2. Hyperparameters
We run NSAM on a 2.60 GHz AMD Rome 7H12 CPU and an NVIDIA GeForce RTX 3070 GPU. For the policy and value functions, NSAM uses three-layer fully connected networks with 64 neurons per layer, while the gating function (Figure 3) uses a three-layer fully connected network with 128 neurons per layer. In the Visual Sudoku environment with image-based inputs, the policy and value functions are equipped with a convolutional encoder consisting of two convolutional layers (kernel sizes of 5 and 3, stride 2), followed by ReLU activations. The encoder output is then connected to a two-layer fully connected network with 256 neurons per layer. Similarly, the gating function in Visual Sudoku incorporates the same convolutional encoder, whose output is connected to a three-layer fully connected network with 128 neurons per layer. The learning rates for the actor, critic, and gating networks are set to $3\times 10^{-4}$ , $3\times 10^{-4}$ , and $2\times 10^{-4}$ , respectively. The gating function is trained using the Adam optimizer with a batch size of 128, updated every 1,000 time steps with 1,000 gradient updates per interval. Other hyperparameters follow the standard PPO settings The code of NSAM is open-sourced at: https://github.com/shan0126/NSRL.
### 7.3. Baselines
We compare ASG with representative baselines from four categories. (1) Classical RL: Rainbow Rainbow and PPO schulman2017ppo, two standard DRL methods, where Rainbow integrates several DQN enhancements and PPO is known for robustness and sample efficiency. (2) Neuro-symbolic RL: KCAC KCAC, which incorporates domain constraint knowledge to reduce invalid actions and represents the state-of-the-art in action-constrained DRL. (3) Action masking: PLPG PLPG, a state-of-the-art method that learns soft action masks for automatic action filtering. (4) Cost-based safe DRL: PPO-Lagrangian PPOlarg and RC-PPO RCPPO, which jointly optimize rewards and costs under the CMDP framework. For these methods, we construct the cost function using $(s,a,s^{\prime},y)\in\Gamma_{\phi}$ , where $y=0$ increases the cost by 1 wachi2023safe.
### 7.4. Learning efficiency and final performance
To answer Q1, we compare our method with all baselines on 16 tasks across four domains. The learning curves are shown in Figure 6. The error bounds (i.e., shadow shapes) indicate the upper and lower bounds of the performance with 5 runs using different random seeds. During the learning process of NSAM, we apply the normalization process defined in Eq. (5), which plays a critical role in excluding unsafe or unexplorable actions from the RL policy. On the Sudoku tasks, as the size increases and the action-state space grows, NSAM exhibits a slight decrease in sample efficiency but consistently outperforms all baselines. A similar trend is observed in the N-Queens domain as the number of queens increases. In the Graph Coloring tasks, NSAM is able to converge to the optimal policy regardless of the graph topology among Graph 1 $\sim$ 4. For the Visual Sudoku tasks, NSAM shows small fluctuations in performance after convergence due to the occurrence of previously unseen images. These fluctuations remain minor, indicating that NSAMās symbolic grounding and learned policy generalize well to unseen digit images. The final converged reward values are reported in Table 1. Overall, NSAM achieves stable and competitive performance, consistently matching or outperforming the baselines.
### 7.5. Less violation
To answer Q2, we record constraint violations during training for each task. An episode is considered to violate constraints if any transition $(s,a,s^{\prime},y)$ is labeled with $y=0$ , i.e., the action $a$ is not explorable in that state. The violation rate is computed as the ratio of the episodes violating constraints to the total number of episodes. The final violation rates are summarized in Table 1. MSAM achieves significantly lower violation rates than all other methods.
<details>
<summary>viol1.png Details</summary>

### Visual Description
## Line Chart with Confidence Intervals: Violation rate (Mean Min/Max)
### Overview
The image is a line chart titled "Violation rate (Mean Min/Max)" that compares the performance of three different algorithms or methods over the course of training steps. The chart plots the "Violation rate" on the y-axis against the number of "Step"s on the x-axis. Each method is represented by a solid line (the mean) and a shaded region around it (representing the min/max range or confidence interval).
### Components/Axes
* **Chart Title:** "Violation rate (Mean Min/Max)" (Top center).
* **Y-Axis:**
* **Label:** "Violation rate" (Vertical, left side).
* **Scale:** Linear scale from 0.0 to 0.8, with major tick marks at 0.0, 0.2, 0.4, 0.6, and 0.8.
* **X-Axis:**
* **Label:** "Step" (Horizontal, bottom).
* **Scale:** Linear scale from 0 to 40,000, with major tick marks at 0, 5000, 10000, 15000, 20000, 25000, 30000, 35000, and 40000.
* **Legend:** Located in the top-right corner of the plot area. It contains three entries:
1. **PPO:** Represented by a red line.
2. **PPO-Lagrangian:** Represented by a teal/blue-green line.
3. **Ours:** Represented by a green line.
* **Data Series:** Three distinct lines with associated shaded confidence bands.
* **PPO (Red Line):** Starts high, peaks early, then shows a steady, gradual decline.
* **PPO-Lagrangian (Teal Line):** Starts lower than PPO, peaks slightly later, then declines more steeply than PPO before flattening out.
* **Ours (Green Line):** Starts very low, has a minor early peak, and remains near zero for the vast majority of steps.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **PPO (Red Line):**
* **Trend:** Sharp initial increase to a peak, followed by a consistent, monotonic decrease over the remaining steps. The slope of decline is moderate and relatively constant after step ~5,000.
* **Key Points:**
* Step 0: ~0.4
* Peak (~Step 1,000): Mean ā 0.75, with a max (top of red shading) near 0.85.
* Step 10,000: Mean ā 0.6
* Step 20,000: Mean ā 0.35
* Step 30,000: Mean ā 0.2
* Step 40,000: Mean ā 0.15
2. **PPO-Lagrangian (Teal Line):**
* **Trend:** Sharp initial increase to a peak (lower and slightly later than PPO), followed by a steeper decline than PPO until approximately step 20,000, after which the rate of decline slows significantly.
* **Key Points:**
* Step 0: ~0.4
* Peak (~Step 2,000): Mean ā 0.7, with a max (top of teal shading) near 0.8.
* Step 10,000: Mean ā 0.4
* Step 20,000: Mean ā 0.15
* Step 30,000: Mean ā 0.1
* Step 40,000: Mean ā 0.08
3. **Ours (Green Line):**
* **Trend:** Very low initial value, a small, brief increase to a low peak, followed by a rapid decline to near-zero levels by step ~5,000. It maintains a near-zero violation rate for the remainder of the training steps, showing minimal variance (very narrow green shading).
* **Key Points:**
* Step 0: ~0.02
* Peak (~Step 1,000): Mean ā 0.05, with a max (top of green shading) near 0.3.
* Step 5,000: Mean ā 0.01
* Step 10,000 to 40,000: Mean remains consistently near 0.0 (approximately 0.0 to 0.02).
### Key Observations
1. **Performance Hierarchy:** The method labeled "Ours" demonstrates a significantly lower violation rate throughout nearly the entire training process compared to both PPO and PPO-Lagrangian.
2. **Convergence Speed:** "Ours" converges to a near-zero violation rate extremely quickly (within ~5,000 steps). PPO-Lagrangian converges to a low rate (~0.1) by step 30,000, while PPO still has a notably higher rate (~0.2) at that point.
3. **Variance/Uncertainty:** The shaded regions indicate the min/max range. PPO and PPO-Lagrangian show substantial variance, especially during the initial peak phase (steps 0-5,000). The variance for "Ours" is also notable during its early peak but becomes negligible once the rate drops near zero.
4. **Initial Conditions:** All three methods start at a similar violation rate (~0.4) at step 0, but their trajectories diverge immediately.
### Interpretation
This chart likely comes from a reinforcement learning or optimization study where the goal is to minimize a constraint violation (the "Violation rate"). The data suggests the following:
* **Superiority of the Proposed Method:** The method "Ours" is presented as a novel approach that achieves the primary objective (minimizing violations) far more effectively and efficiently than the baseline methods (PPO and PPO-Lagrangian). It not only reaches a lower final violation rate but does so orders of magnitude faster.
* **Trade-off Between Methods:** PPO-Lagrangian, which likely incorporates a Lagrangian multiplier to handle constraints, shows a clear improvement over standard PPO in both final performance and convergence speed. However, it is still substantially outperformed by "Ours."
* **Stability:** The very narrow confidence band for "Ours" after step 5,000 indicates high stability and reliability in maintaining near-zero violations once converged, unlike the other methods which continue to exhibit higher mean rates and variance.
* **Underlying Mechanism:** The dramatic difference in performance implies that the core algorithmic innovation in "Ours" addresses the constraint satisfaction problem in a fundamentally more effective way, possibly through a different formulation of the objective, a more efficient policy update rule, or a better mechanism for balancing reward maximization with constraint adherence. The chart serves as strong empirical evidence for the claimed advantage of the new method.
</details>
(a) Sudoku 2 $\times$ 2
<details>
<summary>viol2.png Details</summary>

### Visual Description
## Line Chart with Shaded Confidence Intervals: Violation rate (Mean Min/Max)
### Overview
The image is a line chart comparing the performance of three different algorithms or methods over the course of training steps. The performance metric is the "Violation rate," which is plotted against the number of "Step"s. Each method is represented by a solid line (the mean) and a shaded region around it (likely representing the min/max range or confidence interval). The chart demonstrates how each method's violation rate evolves during training.
### Components/Axes
* **Chart Title:** "Violation rate (Mean Min/Max)"
* **Y-Axis:**
* **Label:** "Violation rate"
* **Scale:** Linear scale from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:**
* **Label:** "Step"
* **Scale:** Linear scale from 0 to approximately 900,000, with major tick marks at 0, 200000, 400000, 600000, and 800000.
* **Legend:**
* **Position:** Top-right corner of the chart area.
* **Entries:**
1. **PPO** - Represented by a solid red line and a light red shaded area.
2. **PPO-Lagrangian** - Represented by a solid teal/dark cyan line and a light teal shaded area.
3. **Ours** - Represented by a solid dark green line and a very narrow, almost invisible, light green shaded area.
### Detailed Analysis
**1. PPO (Red Line & Shading):**
* **Trend:** The line starts at a violation rate of approximately 1.0 at step 0. It then follows a steep, smooth, downward curve, decaying rapidly before gradually flattening out.
* **Data Points (Approximate Mean Values):**
* Step 0: ~1.0
* Step 100,000: ~0.6
* Step 200,000: ~0.3
* Step 400,000: ~0.1
* Step 800,000: ~0.05
* **Variability (Shaded Region):** The red shaded area is widest in the early-to-mid training phase (approx. steps 50,000 to 300,000), indicating higher variance in performance during that period. It narrows significantly as training progresses.
**2. PPO-Lagrangian (Teal Line & Shading):**
* **Trend:** The line also starts near 1.0 at step 0. It remains high and relatively flat for the first ~100,000 steps before beginning a noisier, more gradual decline. The line exhibits significant high-frequency fluctuations (jaggedness) throughout its descent.
* **Data Points (Approximate Mean Values):**
* Step 0: ~1.0
* Step 200,000: ~0.9
* Step 400,000: ~0.65
* Step 600,000: ~0.55
* Step 800,000: ~0.45
* **Variability (Shaded Region):** The teal shaded area is very broad, especially from step 200,000 onward, indicating extremely high variance or a wide min/max range in the violation rate for this method. The upper bound of the shading remains near 1.0 for a large portion of the chart.
**3. Ours (Green Line & Shading):**
* **Trend:** The line starts at a very low violation rate (approximately 0.05) at step 0. It drops almost vertically to near 0.0 within the first few thousand steps and remains essentially flat at that level for the remainder of the chart (up to ~400,000 steps, where the line ends).
* **Data Points (Approximate Mean Values):**
* Step 0: ~0.05
* Step 10,000: ~0.0
* Step 100,000: ~0.0
* Step 400,000: ~0.0
* **Variability (Shaded Region):** The green shaded area is extremely narrow, appearing almost as a thick line. This indicates very low variance and highly consistent performance near zero violation.
### Key Observations
1. **Performance Hierarchy:** The method labeled "Ours" demonstrates vastly superior performance, achieving and maintaining a near-zero violation rate almost immediately. PPO shows good, smooth convergence to a low rate, while PPO-Lagrangian converges more slowly and with much higher noise and variance.
2. **Convergence Speed:** "Ours" converges in <10,000 steps. PPO shows significant reduction by 200,000 steps. PPO-Lagrangian is still descending at 800,000 steps.
3. **Stability/Variance:** "Ours" is extremely stable (low variance). PPO has moderate variance during learning. PPO-Lagrangian exhibits very high variance and instability throughout training.
4. **Initial Conditions:** "Ours" starts at a much lower violation rate (~0.05) compared to the other two methods, which both start at the maximum rate of 1.0.
### Interpretation
This chart likely comes from a research paper in reinforcement learning (RL), specifically dealing with constrained optimization or safe RL, where the goal is to maximize reward while keeping constraint violations below a threshold.
* **What the data suggests:** The proposed method ("Ours") is highly effective at satisfying constraints (minimizing violation rate) from the very start of training and does so with high reliability. This suggests it may incorporate a more effective mechanism for constraint handling or initialization compared to the baselines.
* **Relationship between elements:** The comparison highlights a trade-off. The standard PPO algorithm learns to reduce violations smoothly but starts from a point of complete violation. PPO-Lagrangian, a common method for constrained RL, struggles with stability and slow convergence in this scenario, as evidenced by its noisy line and wide confidence bands. The "Ours" method appears to break this trade-off, offering both immediate constraint satisfaction and stability.
* **Notable Anomalies:** The extremely wide shaded region for PPO-Lagrangian is a critical finding. It indicates that while its *average* performance improves, individual training runs or timesteps can still experience very high violation rates (near 1.0), which could be unacceptable in safety-critical applications. The fact that "Ours" starts at a non-zero but very low violation rate might indicate a specific design choice in its problem formulation or initialization.
</details>
(b) Sudoku 3 $\times$ 3
Figure 7. Violation rate during training
<details>
<summary>ab3.png Details</summary>

### Visual Description
## Line Chart: Reward vs Episode (Mean Min/Max)
### Overview
This is a line chart plotting **Evaluate Reward** (y-axis) against **Episode** (x-axis), showing two distinct data series with their respective minimum/maximum ranges (shaded regions). The chart tracks reward performance over 2000 episodes, with one series achieving a high, stable positive reward and the other remaining in negative reward territory.
### Components/Axes
- **X-axis**: Labeled "Episode", linear scale from 0 to 2000, with major ticks at 0, 250, 500, 750, 1000, 1250, 1500, 1750, 2000.
- **Y-axis**: Labeled "Evaluate Reward", linear scale from -3 to 1, with major ticks at -3, -2, -1, 0, 1.
- **Data Series**:
- **Red Line (with light red shaded range)**: Represents a high-reward series.
- **Teal (dark cyan) Line (with light teal shaded range)**: Represents a low-reward series.
- **Legend**: Implicit (colors correspond to the two series; no explicit legend box, but shaded regions match line colors).
### Detailed Analysis
#### 1. Red Series (High Reward)
- **Trend**:
- Starts at ~-2.8 (Episode 0).
- Rises sharply to ~0.2 (Episode 200).
- Jumps to ~1.5 (Episode 250) and stabilizes at ~1.5 from Episode 250 to 2000.
- **Shaded Range (Min/Max)**:
- Narrow range after Episode 250 (ā1 to 2), indicating low variance in rewards.
- Wider range early (Episode 0ā200: ā-3 to -0.5), showing initial variability.
#### 2. Teal Series (Low Reward)
- **Trend**:
- Starts at ~-2.8 (Episode 0).
- Rises to ~-2.2 (Episode 200).
- Fluctuates between ~-2.5 and -1.5 from Episode 250 to 2000, with a slight upward trend (ending at ~-1.8 at Episode 2000).
- **Shaded Range (Min/Max)**:
- Wider range (ā-3 to -1) after Episode 500, indicating high variance in rewards.
- Narrower range early (Episode 0ā200: ā-3 to -2), showing initial consistency.
### Key Observations
- The red series achieves a **high, stable reward (ā1.5)** after Episode 250, with minimal variance (narrow shaded area).
- The teal series remains in **negative reward** (never above 0), with higher variance (wider shaded area) and a slow upward trend.
- The red series has a **dramatic improvement** around Episode 200ā250, while the teal series shows gradual improvement but never reaches positive reward.
- The red seriesā shaded region is much narrower after stabilization, indicating **consistent performance**; the teal seriesā wider range suggests **unpredictable reward outcomes**.
### Interpretation
- The red series likely represents a **successful learning agent** (e.g., in reinforcement learning) that quickly converges to a high-reward policy, with consistent performance (low variance).
- The teal series represents a **less successful agent**, possibly with a suboptimal policy, showing gradual improvement but remaining in negative reward (indicating poor performance or a different task/objective).
- The sharp rise in the red series around Episode 200ā250 suggests a **critical learning phase** where the agent discovers a high-reward strategy.
- The wider variance in the teal series implies its reward outcomes are more unpredictable, possibly due to exploration-exploitation tradeoffs or a more complex task.
This chart effectively contrasts two learning trajectories: one that rapidly achieves high, stable rewards and another that struggles to improve beyond negative values, highlighting the impact of policy quality or task complexity on performance.
</details>
(a) Sudoku 3 $\times$ 3
<details>
<summary>ab4.png Details</summary>

### Visual Description
## Line Chart with Confidence Bands: Reward vs Episode (Mean Min/Max)
### Overview
This is a line chart with shaded confidence bands (representing min/max reward ranges) that compares the learning performance of two models, **NSAM** (red) and **NSAM-PSDD** (teal), over 2000 training episodes. The chart plots *Evaluate Reward* (y-axis) against *Episode* (x-axis), with the title "Reward vs Steps (Mean Min/Max)" (note: x-axis is labeled "Episode", not "Steps").
### Components/Axes
1. **Title**: "Reward vs Steps (Mean Min/Max)" (top-center of the chart area)
2. **X-axis**: Labeled *Episode*, linear scale from 0 to 2000, with major ticks at 0, 250, 500, 750, 1000, 1250, 1500, 1750, 2000.
3. **Y-axis**: Labeled *Evaluate Reward*, linear scale from -4 to 2, with major ticks at -4, -3, -2, -1, 0, 1, 2.
4. **Legend**: Positioned top-left inside the chart area, with two entries:
- Teal line + shaded region: `NSAM-PSDD`
- Red line + shaded region: `NSAM`
5. **Confidence Bands**: Shaded regions correspond to the min/max reward range for each model, matching the line color of the model.
---
### Detailed Analysis
#### 1. NSAM (Red Line + Red Shaded Region)
- **Trend**: Starts at a low reward, shows a steady upward trend until Episode 750, then sharply jumps to the maximum reward value of 2, and remains flat at 2 for all subsequent episodes (750 to 2000).
- **Approximate Data Points**:
- Episode 0: ~-4.2
- Episode 250: ~-3.5
- Episode 500: ~-2.5
- Episode 750: 2
- Episodes 750-2000: 2 (constant)
- **Confidence Band (Min/Max)**: Wide variance from Episode 0 to ~750, with maximum values reaching ~2 (around Episode 500) and minimum values as low as ~-4. After Episode 750, the band disappears (the line is flat, so min/max equals the mean value of 2).
#### 2. NSAM-PSDD (Teal Line + Teal Shaded Region)
- **Trend**: Starts at ~-4.2, remains relatively flat (around -4 to -3.8) until Episode 1000, then shows a volatile upward trend with fluctuations, ending at ~-0.3 at Episode 2000.
- **Approximate Data Points**:
- Episode 0: ~-4.2
- Episode 250: ~-3.8
- Episode 500: ~-4.0
- Episode 750: ~-3.8
- Episode 1000: ~-2.5
- Episode 1250: ~-2.2
- Episode 1500: ~-1.2
- Episode 1750: ~-1.2
- Episode 2000: ~-0.3
- **Confidence Band (Min/Max)**: Wide variance from Episode 0 to ~1000, with minimum values as low as ~-4.5 and maximum values up to ~0 (around Episode 1000). After Episode 1000, the band narrows, with the line fluctuating between ~-2.5 and ~-0.3, and the min/max range remains persistent (e.g., at Episode 1250, min ~-4.2, max ~-0.2; at Episode 2000, min ~-4.2, max ~0).
---
### Key Observations
1. NSAM converges to the maximum possible reward (2) by Episode 750, with no further variance or improvement after this point.
2. NSAM-PSDD shows slower, unstable learning, never reaching the maximum reward, and maintains persistent variance in performance across all episodes.
3. The red confidence band (NSAM) is most prominent during the initial learning phase (0-750), indicating high variance before the model stabilizes.
4. The teal confidence band (NSAM-PSDD) remains wide even after Episode 1000, indicating the model never fully stabilizes its performance.
---
### Interpretation
This chart compares the learning efficiency and stability of two reinforcement learning models for a specific task:
- **NSAM** is highly effective for this task: it rapidly learns to achieve the maximum possible reward, and once converged, maintains perfect, consistent performance with no variance. This suggests NSAM quickly identifies and locks into the optimal policy for the task.
- **NSAM-PSDD** is less effective: it shows slow, unstable learning, never reaches the optimal reward, and retains high variance in performance. This indicates NSAM-PSDD may be exploring suboptimal policy spaces, or has a less stable learning mechanism that prevents it from converging to the optimal solution.
- The stark difference in performance suggests NSAM is far better suited for this specific task, as it achieves and maintains the highest possible reward with no further exploration or variance after convergence.
</details>
(b) Sudoku 4 $\times$ 4
Figure 8. Ablation study on PSDD in NSAM
We further compare the change of violation rates during training for NSAM, PPO, and PPO-Lagrangian, as shown in Figure 7. In the early stages of training, NSAM exhibits a slightly higher violation rate because the PSDD parameters is not trained, which may cause inaccurate evaluation of action preconditions $\varphi$ . However, as training progresses, the violation rate rapidly decreases to near zero. During the whole training process, the violation rate of NSAM is consistently lower than that of PPO and PPO-Lagrangian.
### 7.6. Ablation study
To answer Q3, we conduct an ablation study on the symbolic grounding module by replacing the PSDD with a standard three-layer fully connected neural network (128 neurons per layer). The experiment results are shown in Figure 8. Unlike PSDDs, neural networks struggle to efficiently exploit symbolic knowledge and cannot guarantee logical consistency in their predictions. As a result, the policy trained with this ablated grounding module exhibits highly unstable performance, confirming the importance of PSDD for reliable symbolic grounding in NSAM.
### 7.7. Exploiting knowledge structure
To answer Q4, we design a special experiment where NSAM and PPO-Lagrangian are trained using only a single transition, as illustrated on the left side of Fig. 9. The right side of the figure shows the heatmaps of action probabilities after training. With the cost function defined via $\Gamma_{\phi}$ , PPO-Lagrangian can only leverage this single negative transition to reduce the probability of the specific action in this transition. As a result, in the heatmap of action probabilities after training, only the probability of this specific action decreases. In contrast, our method can exploit the structural knowledge of action preconditions to infer from the explorability of one action that a set of related actions are also infeasible. Consequently, in the heatmap, our method simultaneously reduces the probabilities of four actions. This demonstrates that our approach can utilize knowledge to generalize policy from very limited experience, which can significantly improve sample efficiency of DRL.
<details>
<summary>1data.png Details</summary>

### Visual Description
## [Diagram/Heatmap Comparison]: Action Probability in Policy for Grid-Filling Task
### Overview
The image is a technical diagram illustrating a reinforcement learning training process and comparing action probability distributions for two methods ("Ours" and "PPO-Lagrangian") in a 2x2 grid-filling task. The left section shows a grid transformation during training, while the right section contains heatmaps (action probability matrices) for two actions: *"Fill 1 at 4 positions"* and *"Fill 2 at 4 positions"*.
### Components/Axes
#### Left Section (Training Process)
- **Top Grid**: 2x2 grid with a black mark (filled cell) in **row 1, column 1** (rows: topābottom; columns: leftāright).
- **Arrow**: Labeled *"Train"* (pointing right, indicating the training step).
- **Bottom Grid**: 2x2 grid with two black marks (filled cells) in **row 1, column 1** and **row 2, column 1**.
- **Text**: *"Fill 1 at row 2 col 1"* (describes the action taken to transform the grid).
#### Right Section (Action Probability Heatmaps)
- **Title**: *"Action probability in the policy"*
- **Columns**: Two actions: *"Fill 1 at 4 positions"* (left) and *"Fill 2 at 4 positions"* (right).
- **Rows**: Two methods: *"Ours"* (top) and *"PPO-Lagrangian"* (bottom).
- **Color Bar**: Vertical bar (right) with a blue scale:
- White = 0.00 (lowest probability)
- Dark blue = 0.40 (highest probability)
- Intermediate values: 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35.
### Detailed Analysis (Numerical Values & Trends)
Each heatmap is a 2x2 grid (rows: topābottom; columns: leftāright) with probability values (color-coded by the bar).
#### 1. "Ours" Method (Top Row)
- **Fill 1 at 4 positions** (left heatmap):
- Row 1, Col 1: `0.2503` (dark blue, ~0.25)
- Row 1, Col 2: `0.0000` (white)
- Row 2, Col 1: `0.0000` (white)
- Row 2, Col 2: `0.2496` (dark blue, ~0.25)
*Trend*: Probabilities concentrate on the **diagonal** (1,1) and (2,2), with 0 elsewhere.
- **Fill 2 at 4 positions** (right heatmap):
- Row 1, Col 1: `0.0000` (white)
- Row 1, Col 2: `0.2502` (dark blue, ~0.25)
- Row 2, Col 1: `0.2499` (dark blue, ~0.25)
- Row 2, Col 2: `0.0000` (white)
*Trend*: Probabilities concentrate on the **anti-diagonal** (1,2) and (2,1), with 0 elsewhere.
#### 2. "PPO-Lagrangian" Method (Bottom Row)
- **Fill 1 at 4 positions** (left heatmap):
- Row 1, Col 1: `0.1385` (light blue, ~0.14)
- Row 1, Col 2: `0.1377` (light blue, ~0.14)
- Row 2, Col 1: `0.0323` (very light blue, ~0.03)
- Row 2, Col 2: `0.1384` (light blue, ~0.14)
*Trend*: Most cells have ~0.13ā0.14, except (2,1) (outlier: ~0.03).
- **Fill 2 at 4 positions** (right heatmap):
- Row 1, Col 1: `0.1377` (light blue, ~0.14)
- Row 1, Col 2: `0.1383` (light blue, ~0.14)
- Row 2, Col 1: `0.1381` (light blue, ~0.14)
- Row 2, Col 2: `0.1381` (light blue, ~0.14)
*Trend*: Uniformly spread (~0.13ā0.14) across all cells.
### Key Observations
- **"Ours" Method**:
- Highly structured policy: Probabilities concentrate on specific grid positions (diagonal/anti-diagonal) for each action, with 0 elsewhere.
- Higher probabilities (ā0.25) in targeted cells, indicating a deterministic or focused policy.
- **"PPO-Lagrangian" Method**:
- Stochastic, spread-out policy: Probabilities are more uniform (ā0.13ā0.14) across most cells, with one outlier (0.0323 in (2,1) for "Fill 1").
- Lower overall probabilities, suggesting a more exploratory or less structured policy.
- **Color Coding**: The blue scale confirms "Ours" has higher probabilities in specific cells (darker blue), while PPO-Lagrangian has lower, more dispersed probabilities (lighter blue).
### Interpretation
The diagram compares two reinforcement learning policies for a grid-filling task:
- **"Ours"** demonstrates a *targeted, deterministic policy*: It focuses on specific grid positions (diagonal/anti-diagonal) for each action, likely optimizing for efficiency or structure.
- **"PPO-Lagrangian"** shows a *stochastic, exploratory policy*: Probabilities are spread out, with less focus on specific positions (except the outlier in (2,1) for "Fill 1").
The left-side training process (grid transformation) illustrates how the policy evolves: starting with a grid, taking the action *"Fill 1 at row 2 col 1"*, and training to produce the policy. The heatmaps reveal that "Ours" is more efficient in allocating action probabilities, while PPO-Lagrangian is more exploratory (or less optimized) in this task. The outlier in PPO-Lagrangianās "Fill 1" (row 2, col 1: 0.0323) may indicate a less preferred position for that action.
This analysis allows reconstructing the imageās content: the training process, policy comparisons, and numerical/visual trends in action probabilities.
</details>
Figure 9. Policy training result from a single transition.
## 8. Conclusions and future work
In this paper, we proposed NSAM, a novel framework that integrates symbolic reasoning with deep reinforcement learning through PSDDs. NSAM addresses the key challenges of learning symbolic grounding from high-dimensional states with minimal supervision, ensuring logical consistency, and enabling end-to-end differentiable training. By leveraging action precondition knowledge, NSAM learns effective action masks that substantially reduce constraint violations while improving the sample efficiency of policy optimization. Our empirical evaluation across four domains demonstrates that NSAM consistently outperforms baselines in terms of sample efficiency and violation rate. In this work, the symbolic knowledge in propositional form. A promising research direction for future work is to investigate richer forms of symbolic knowledge as action preconditions, such as temporal logics or to design learning framework when constraints are unknown or incorrect. Extending NSAM to broader real-world domains is also an important future direction.
We sincerely thank the anonymous reviewers. This work is partly funded by the China Scholarship Council (CSC).
## References