2504.02654v1

Model: healer-alpha-free

# SymDQN: Symbolic Knowledge and Reasoning in Neural Network-based Reinforcement Learning **Authors**: - Ivo Amador - Nina Gierasimczuk (Technical University of Denmark) - ivo.amador@gmail.com, nigi@dtu.dk ## Abstract We propose a learning architecture that allows symbolic control and guidance in reinforcement learning with deep neural networks. We introduce SymDQN, a novel modular approach that augments the existing Dueling Deep Q-Networks (DuelDQN) architecture with modules based on the neuro-symbolic framework of Logic Tensor Networks (LTNs). The modules guide action policy learning and allow reinforcement learning agents to display behavior consistent with reasoning about the environment. Our experiment is an ablation study performed on the modules. It is conducted in a reinforcement learning environment of a 5x5 grid navigated by an agent that encounters various shapes, each associated with a given reward. The underlying DuelDQN attempts to learn the optimal behavior of the agent in this environment, while the modules facilitate shape recognition and reward prediction. We show that our architecture significantly improves learning, both in terms of performance and the precision of the agent. The modularity of SymDQN allows reflecting on the intricacies and complexities of combining neural and symbolic approaches in reinforcement learning. ## 1 Introduction Despite its rapidly growing impact on society, Artificial Intelligence technologies are tormented by reliability issues, such as lack of interpretability, propagation of biases, difficulty in generalizing across domains, and susceptibility to adversarial attacks. A possible way towards more interpretable, controlled and guided algorithms leads through the field of neuro-symbolic AI, which explores new ways of integrating symbolic, logic-based knowledge in neural networks (NNs). In particular, the framework of Logic Tensor Networks (LTNs, for short) Serafini and d’Avila Garcez (2016); Badreddine et al. (2022) enhances learning by interpreting first-order logic formulas concretely on data used by NNs algorithms. Such formulas express properties of data and, given a fuzzy semantics, can be integrated into the loss function, thus guiding the learning process. In this paper, we apply LTNs to a reinforcement learning problem. By integrating LTNs in the training process, our learning agent uses logic to learn the structure of the environment, to predict how different objects in the environment interact with each other, and to guide its actions by performing elementary reasoning about rewards. We investigate how such integration affects learning performance of a robust, established and well-studied framework of Dueling Deep Q-Network (DuelDQN, for short) Wang et al. (2016). The structure of the paper is as follows. In Section 2 we briefly recall Logic Tensor Networks and elements of the underlying Real Logic. In Section 3 we introduce our methodology: the SymDQN architecture and its training process. We follow up with the presentation of the experiment in Section 4. We discuss of the results in Section 5. Section 6 concludes and outlines directions for future work. #### Related Work Since its conception, the framework of LTNs has been applied in various domains. In computer vision, LTNs were used to inject prior knowledge about object relationships and properties, improving interpretability and accuracy in object detection Donadello et al. (2017). Their addition to convolutional neural networks improves the robustness on noisy data Manigrasso et al. (2021). They enhance the accuracy in reasoning tasks in open-world and closed-world scenarios Wagner and d’Avila Garcez (2022). In Bianchi and Hitzler (2019), LTNs are leveraged for deductive reasoning tasks. Finally, in learning by reinforcement LTNs were used to integrate prior knowledge into reinforcement learning agents improving both the learning rate and robustness to environmental changes Badreddine and Spranger (2019). The latter work is similar to ours in the choice of tools, but it differs in its methodology. In Badreddine and Spranger (2019), LTN is a separate pre-training module which interacts with DuelDQN only by creating inputs. In contrast, our SymDQN integrates LTN in the training process (making it learn alongside DuelDQN). Our work uses logic to adjust elements of a reinforcement learning framework. In that, it is related to reward shaping approaches, where the learner is given external symbolic information about the environment, e.g., in the form of linear time logic formulas (also known as restraining bolts) in Giacomo et al. (2019) or of an induced automaton in Furelos-Blanco et al. (2021). In a way, the LTN approach is similar: logical formulas adjust the reinforcement learning process. However, our technique is a more distributed form of reward shaping. First, the formulas of Real Logic are used as guides to obtain a symbolic representation of the environment, then to predict immediate rewards from encountering the objects of the environment. Finally, a logical formula is used to help the learner align the $q$ -values (the agent’s long term policy) with the predicted immediate rewards of symbolically represented objects. In other words, we restrain the reinforcement learner by expecting it to reason about its behavior as it learns, and we investigate the impact of this restriction on learning precision and performance. We will come back to this issue in Section 5.1, after we have detailed all the components. ## 2 Real Logic Real Logic is the basis of the functioning of LTNs. In this section we provide a rudimentary introduction (for a full exposition consult Badreddine et al. (2022)). Let us start with a simple example. **Example 1** *Consider two datasets: a data set of humans (with two features: age and gender), and a dataset of pets (with three features: height, weight and color). Assume that Alice appears in the data set of humans (for instance as a five year old female), and Max and Mittens are listed in a dataset of pets. To be able to talk about Alice, Max and Mittens, we need a logical language that includes constants referring to objects (particular rows of the datasets). Note that such constants can be of different types —in our example humans consists of two, while pets are composed of three features.* The signature of the language of Real Logic $\mathcal{L}$ contains a set $\mathcal{C}$ of constant symbols, a set $\mathcal{F}$ of function symbols, a set $\mathcal{P}$ of predicate symbols, and a set $\mathcal{X}$ of variable symbols. Let $\mathcal{D}$ be a non-empty set of domain symbols (that represent types). Domain symbols are used by functions $\mathbf{D}$ , $\mathbf{D_{in}}$ , and $\mathbf{D_{out}}$ which for a given element of the signature output its type, in the following way. $\mathbf{D}:\mathcal{X}\cup\mathcal{C}\rightarrow\mathcal{D}$ specifies the types for variables and constants; $\mathbf{D_{in}}:\mathcal{F}\cup\mathcal{P}\rightarrow\mathcal{D}^{\ast}$ specifies the types of the sequence of arguments allowed by function and predicate symbols ( $\mathcal{D}^{\ast}$ stands for the set of all finite sequences of elements from $\mathcal{D}$ ); $\mathbf{D_{out}}:\mathcal{F}\rightarrow\mathcal{D}$ specifies the type of the range of a function symbol. **Example 2** *Continuing Example 1, let the language of pet-ownership $\mathcal{L}_{pets}$ have the signature consisting of the set of constants $\mathcal{C}=\{\textsc{Alice},\textsc{Max},\textsc{Mit}\}$ , a set of function symbols $\mathcal{F}=\{\textsc{owner}\}$ , a set of predicate symbols $\mathcal{P}=\{\textsc{isOwner}\}$ , and two variable symbols $\mathcal{X}=\{\textsc{pet},\textsc{person}\}$ . Further, we have two domain symbols, one for the domain of humans and one for pets, $\mathcal{D}=\{H,P\}$ . Then, our domain functions can be defined in the following way. $\mathbf{D}(\textsc{Alice})=H$ (Alice is a constant of type $H$ ), $\mathbf{D}(\textsc{Max})=\mathbf{D}(\textsc{Mit})=P$ (Max and Mittens are of type $P$ ). Further, each dataset will have its own variable: $\mathbf{D}(\textsc{pet})=P$ , $\mathbf{D}(\textsc{person})=H$ . We also need to specify inputs for predicates: $\mathbf{D_{in}}(\textsc{isOwner})=HP$ (isOwner is a predicate taking two arguments, a human and a pet). Finally, for functions, we need both the input and the output types: $\mathbf{D_{in}}(\textsc{owner})=P$ , and $\mathbf{D_{out}}(\textsc{owner})=H$ (owner takes as input a pet and outputs the human who owns it).* The language of Real Logic corresponds to first-order logic, and so it allows for more complex expressions. The set of terms consists of constants, variables, and function symbols and is defined in the following way: each $t\in X\cup C$ is a term of domain $\mathbf{D}(t)$ ; if $t_{1},\ldots,t_{n}$ are terms, then $t_{1}\ldots t_{n}$ is a term of the domain $\mathbf{D}(t_{1})...\mathbf{D}(t_{n})$ ; if $t$ is a term of the domain $\mathbf{D_{in}}(f)$ , then $f(t)$ is a term of the domain $\mathbf{D_{out}}(f)$ . Finally, the formulae of Real Logic are as follows: $t_{1}=t_{2}$ is an atomic formula for any terms $t_{1}$ and $t_{2}$ with $\mathbf{D}(t_{1})=\mathbf{D}(t_{2})$ ; $P(t)$ is an atomic formula if $\mathbf{D}(t)=\mathbf{D_{in}}(P)$ ; if $\varphi$ and $\psi$ are formulae and $x_{1},\ldots,x_{n}$ are variable symbols, then $\neg\varphi$ , $\varphi\wedge\psi$ , $\varphi\vee\psi$ , $\varphi\rightarrow\psi$ , $\varphi\leftrightarrow\psi$ , $\forall x_{1}\ldots x_{n}\varphi$ and $\exists x_{1}\ldots x_{n}\varphi$ are formulae. Let us turn to the semantics of Real Logic. Domain symbols allow grounding the logic in numerical, data-driven representations—to be precise, Real Logic is grounded in tensors in the field of real numbers. Tensor grounding is the key concept that allows the interplay of Real Logic with Neural Networks. It refers to the process of mapping high-level symbols to tensor representations, allowing integration of reasoning and differentiable functions. A grounding $\mathcal{G}$ assigns to each domain symbol $D\in\mathcal{D}$ a set of tensors $\mathcal{G}(D)\subseteq\bigcup_{n_{1}\ldots n_{d}\in\mathbb{N}^{\ast}}\mathbb{ R}^{n_{1}\times\ldots\times n_{d}}$ . For every $D_{1}\ldots D_{n}\in\mathcal{D}^{\ast}$ , $\mathcal{G}(D_{1}\ldots D_{n})=\mathcal{G}(D_{1})\times\ldots\times\mathcal{G} (D_{n})$ . Given a language $\mathcal{L}$ , a grounding $\mathcal{G}$ of $\mathcal{L}$ assigns to each constant symbol $c$ , a tensor $\mathcal{G}(c)$ in the domain $\mathcal{G}(\mathbf{D}(c))$ ; to a variable $x$ it assigns a finite sequence of tensors $d_{1}\ldots d_{k}$ , each in $\mathcal{G}(\mathbf{D}(x))$ , representing the instances of $x$ ; to a function symbol $f$ it assigns a function taking tensors from $\mathcal{G}(\mathbf{D_{in}}(f))$ as input, and producing a tensor in $\mathcal{G}(\mathbf{D_{out}}(f))$ as output; to a predicate symbol $P$ it assigns a function taking tensors from $\mathcal{G}(\mathbf{D_{in}}(P))$ as input, and producing a truth-degree in the interval $[0,1]$ as output. In other words, $\mathcal{G}$ assigns to a variable a concatenation of instances in the domain of the variable. The treatment of free variables in Real Logic is analogous, departing from the usual interpretation of free variables in FOL. Thus, the application of functions and predicates to terms with free variables results in point-wise application of the function or predicate to the string representing all instances of the variable (see p. 5 of Badreddine et al. (2022) for examples). Semantically, logical connectives are fuzzy operators applied recursively to the suitable subformulas: conjunction is a t-norm, disjunction is a t-conorm, and for implication and negation—its fuzzy correspondents. The semantics for formulae with quantifiers ( $\forall$ and $\exists$ ) is given by symmetric and continuous aggregation operators $Agg:\bigcup_{n\in N}[0,1]^{n}\rightarrow[0,1]$ . Intuitively, quantifiers reduce the dimensions associated with the quantified variables. **Example 3** *Continuing our running example, we could enrich our signature with predicates $Dog$ and $Cat$ . Then, $Dog(Max)$ might return $0.8$ , while $Dog(Mit)$ might return $0.3$ , indicating that Max is likely a dog, while Mittens is not. In practice, the truth degrees for these atomic formulas could be obtained for example by a Multi-layer Perceptron (MLP), followed by a sigmoid function, taking the object’s features as input and returning a value in $[0,1]$ . For a new function symbol $age$ , $age(Max)$ could be an MLP, taking Max’s features, and outputting a scalar representing Max’s age. An example of a formula could be $Dog(Max)\vee Cat(Max)$ , which could return $0.95$ indicating that Max is almost certainly either a dog or a cat. A formula with a universal quantifier could be used to express that Alice owns all of the cats $\forall pet(Cat(pet)\rightarrow isOwner(Alice,pet))$ .* Real Logic allows some flexibility in the choice of appropriate fuzzy operators for the semantics of connectives and quantifiers. However, note that not all fuzzy operators are suitable for differentiable learning van Krieken et al. (2022). In Appendix B of Badreddine et al. (2022), the authors discuss some particularly suitable fuzzy operators. In this work, we follow their recommendation and adopt the Product Real Logic semantics (product t-norm for conjunction, standard negation, the Reichenbach implication, p-mean for the existential, and p-mean error for the universal quantifier). LTNs make use of Real Logic—they learns parameters that maximize the aggregated satisfiability of the formulas in the so-called knowledge base containing formulas of Real Logic. The framework is the basis of the PyTorch implementation of the LTN framework, known as LTNtorch library Carraro (2022). In our experiments we make substantial use of that tool. ## 3 Methodology ### 3.1 The game The environment used for the experiments was a custom Gymnasium Towers et al. (2024) environment ShapesGridEnv designed for the experiments in Badreddine and Spranger (2019), see Fig. 1. The game is played on an image showing a 5x5 grid with cells occupied by one agent, represented by the symbol ‘+’, and a number of other objects: circles, squares, and crosses. The agent moves across the board (up, right, down, left) and when it enters a space occupied by an object, it ‘consumes’ that object. Each object shape is associated with a reward. The agent’s goal is to maximize its cumulative reward throughout an episode. An episode terminates when either all shapes with positive reward have been consumed, or when a predefined number of steps has been reached. <details> <summary>extracted/6333762/fig-env.png Details</summary> ![a5e71db8](/v1/image/a5e71db8c7bf7beb9dcc6708574f59e7f99b2acce566b16740ed4c52e5fc6873) ### Visual Description ## Diagram: Abstract Geometric Pattern ### Overview The image is a black-and-white, low-resolution digital diagram or icon set. It displays a collection of geometric shapes arranged in a seemingly non-random pattern within a square frame. There is no textual information, labels, axes, legends, or data tables present. The composition appears to be a symbolic or abstract representation, possibly a pattern, a code, or a set of icons. ### Components/Axes * **Frame:** A thin black border defines the square boundary of the diagram. * **Shapes:** The diagram consists of four distinct types of geometric shapes, all rendered in black with white interiors or negative space. 1. **Square:** A simple, outlined square. 2. **Circle:** A simple, outlined circle. 3. **Plus/Cross (+):** A thick, bold plus sign or cross. 4. **Saltire (X):** A thick, bold "X" shape, also known as a St. Andrew's Cross. * **Layout:** The shapes are distributed across the canvas in a specific arrangement. There are no axes, scales, or legends to interpret. ### Detailed Analysis The shapes are positioned as follows (using a grid-like reference for spatial grounding): * **Top Region:** * **Top-Center:** A single **Plus (+)** shape. * **Top-Right Corner:** A vertical column of two **Saltire (X)** shapes, stacked one above the other. * **Middle Region:** * **Middle-Left:** A single **Square**. * **Center:** Two **Circle** shapes, stacked vertically. The top circle is directly to the right of the middle-left square. * **Middle-Right:** A single **Saltire (X)** shape, positioned below the top-right column. * **Bottom Region:** * **Bottom-Right Corner:** Two **Square** shapes, placed side-by-side horizontally. **Visual Trend/Pattern Verification:** The arrangement does not depict a numerical trend. It suggests a deliberate spatial composition. Shapes are clustered in the right half and bottom-right corner, with more isolated elements on the left and top-center. The use of negative space is significant, particularly in the large empty area in the bottom-left quadrant. ### Key Observations 1. **No Textual Content:** The image contains zero alphanumeric characters, labels, or annotations. 2. **Limited Palette:** Strictly monochromatic (black shapes on a white background). 3. **Shape Repetition:** Each shape type appears multiple times: 3 Squares, 2 Circles, 1 Plus, 3 Saltires. 4. **Asymmetrical Balance:** The composition is weighted towards the right and bottom, creating an asymmetrical but balanced visual field. 5. **Potential Symbolism:** The shapes are fundamental geometric primitives. Their specific arrangement could represent a simple pattern, a visual puzzle, a set of UI icons, or an abstract symbolic language. ### Interpretation This image does not present factual data or measurable information. Therefore, a data-driven interpretation is not possible. Instead, the interpretation must focus on its visual and potential symbolic properties. * **What it Demonstrates:** The image demonstrates a basic compositional exercise using fundamental shapes. It explores spatial relationships, balance, and the use of negative space within a constrained square format. * **Relationship Between Elements:** The elements relate purely through their visual placement. Proximity (e.g., the two circles stacked, the two squares side-by-side) creates implied groups or units. The contrast between the dense clustering on the right and the sparse left side creates visual tension. * **Notable Anomalies:** The most notable feature is the complete absence of any explanatory text or quantitative data, which is atypical for a "technical document." This suggests the image's purpose is either purely aesthetic, symbolic, or serves as a component within a larger context not provided here (e.g., a step in a visual pattern recognition test, an icon set for a software interface, or a piece of abstract digital art). * **Peircean Investigative Reading:** From a semiotic perspective, the shapes function as **icons** (they resemble what they represent—simple geometry) and potentially as **symbols** (if their arrangement carries a conventional meaning, like a logo or a sign). The lack of an explicit **index** (a direct causal link to data) or clear **interpretant** (a defined meaning for the viewer) leaves the diagram open to multiple interpretations, highlighting its abstract nature. </details> Figure 1: ShapesGridEnv environment We chose this environment because of its simplicity, and because it allows comparing our setting with that of Badreddine and Spranger (2019). The environment is very flexible in its parameters: density (the minimum and maximum amount of shapes initiated, in our case max is $18$ ), rewards (in our case the reward for a cross is +1, for a circle is -1 and for a square is 0), colors (in our case the background is white and objects are black), episode maximum length (for us it is 50). Altering the environment configurations allows investigating the adaptability of the learner in Badreddine and Spranger (2019). ### 3.2 The Model A suitable approach to learning to play such a game could be the existing Dueling Deep Q-Network (DuelDQN) Wang et al. (2015). The architecture is composed of several convolutional layers, which extract relevant features from the raw image input, and then pass them to the two main components, a Value Stream and an Advantage Stream (see Fig. 2). The Value Stream estimates how good it is for the agent to be in the given state, while the Advantage Stream estimates how good it is to perform each action in that given state. The two streams are then combined to calculate the final output. <details> <summary>extracted/6333762/dueldqn_parameters.png Details</summary> ![e7c8ae40](/v1/image/e7c8ae40363a96c847a5cd1b55c670560d7b862ce1a4fdc632f6fa7428ef8ecb) ### Visual Description ## Neural Network Architecture Diagram: Dueling Deep Q-Network (DQN) ### Overview The image displays a detailed architectural diagram of a convolutional neural network designed for reinforcement learning, specifically a Dueling Deep Q-Network (DQN). The network processes an input (likely an image state) through a series of convolutional layers, then splits into two parallel streams—a Value Stream and an Advantage Stream—before combining their outputs to produce the final Q-value estimate, Q(s,a). The flow is from left to right. ### Components/Axes The diagram is composed of interconnected blocks representing layers and operations. All text labels are in English. **Input Layer (Leftmost):** * A rectangular prism representing the input tensor. * Label below: `Conv2d 3→32`. This indicates a 2D convolutional layer taking 3 input channels (e.g., RGB image) and outputting 32 feature maps. **First Processing Block:** * An arrow labeled `ELU` (Exponential Linear Unit activation function) points from the input to the next block. * A 3D rectangular block representing the output of the first convolution. * Label below: `Conv2d 32→64`. This is a second convolutional layer taking 32 input channels and outputting 64. **Second Processing Block:** * An arrow labeled `ELU` points from the previous block to the next. * A larger 3D rectangular block. * Label below: `Conv2d 64→128`. This is a third convolutional layer taking 64 input channels and outputting 128. **Flattening & Initial Linear Layer:** * An arrow labeled `ELU` points from the last convolutional block to a vertical red bar. * The red bar represents a fully connected (Linear) layer. * Label below: `Linear 128x50x50→256`. This layer flattens the input (presumably 128 channels of 50x50 spatial dimensions) and projects it to a 256-dimensional vector. **Stream Split:** * The output of the `Linear 128x50x50→256` layer splits into two parallel paths. * **Upper Path Label:** `Value Stream` * **Lower Path Label:** `Advantage Stream` **Value Stream (Upper Path):** 1. A vertical red bar labeled above: `Linear 256→128`. 2. An arrow labeled `ELU` points to a smaller red bar. 3. The smaller red bar is labeled above: `Linear 128→1`. 4. The output of this final layer is labeled `V(s)`, representing the state-value function. **Advantage Stream (Lower Path):** 1. A vertical red bar labeled below: `Linear 256→128`. 2. An arrow labeled `ELU` points to a smaller red bar. 3. The smaller red bar is labeled below: `Linear 128→4`. 4. The output of this final layer is labeled `A(s,a)`, representing the advantage function for each action. **Output Combination:** * Arrows from both `V(s)` and `A(s,a)` converge. * They point to a final vertical red bar on the far right. * The output of this final combination is labeled `Q(s,a)`, representing the estimated Q-value for the given state and action. ### Detailed Analysis **Layer-by-Layer Data Flow:** 1. **Input:** 3-channel image data. 2. **Conv2d (3→32):** Produces 32 feature maps. Activated by ELU. 3. **Conv2d (32→64):** Produces 64 feature maps. Activated by ELU. 4. **Conv2d (64→128):** Produces 128 feature maps of spatial size 50x50 (inferred from the subsequent Linear layer label). Activated by ELU. 5. **Linear (Flatten & Project):** The 128*50*50 = 320,000-dimensional flattened vector is projected to a 256-dimensional hidden representation. 6. **Dueling Split:** The 256-dim vector is fed into two separate streams. * **Value Stream:** 256 → 128 (ELU) → 1. Outputs a single scalar V(s). * **Advantage Stream:** 256 → 128 (ELU) → 4. Outputs a 4-dimensional vector A(s,a), implying the action space has 4 discrete actions. 7. **Q-Value Calculation:** The final Q(s,a) is computed by combining V(s) and A(s,a). The standard dueling architecture formula is: Q(s,a) = V(s) + (A(s,a) - mean(A(s,a'))). The diagram shows the combination step but does not specify the exact arithmetic. **Spatial Grounding:** * The legend/labels are placed directly above or below their corresponding components. * The `Value Stream` label is positioned above the split point, aligned with the upper path. * The `Advantage Stream` label is positioned below the split point, aligned with the lower path. * The final output `Q(s,a)` is positioned to the right of the combining layer, at the far right of the diagram. ### Key Observations 1. **Dueling Architecture:** The defining feature is the split into Value and Advantage streams after the convolutional feature extractor. This is a hallmark of the Dueling DQN architecture, which separates the estimation of state value from the relative advantage of each action. 2. **Activation Function:** The network consistently uses the ELU (Exponential Linear Unit) activation function after every convolutional and linear layer (except the final output layers of each stream). 3. **Dimensionality Reduction:** There is a significant reduction in dimensionality from the convolutional output (128x50x50) to the first linear layer (256), indicating aggressive feature compression. 4. **Action Space:** The advantage stream outputs 4 values (`Linear 128→4`), specifying that the agent is designed for an environment with exactly four possible discrete actions. 5. **Visual Representation:** Convolutional layers are shown as 3D blocks, while linear layers are shown as vertical red bars. Arrows indicate the direction of data flow. ### Interpretation This diagram illustrates a sophisticated deep reinforcement learning model. The convolutional front-end is designed to process visual input (e.g., from a video game or camera), extracting hierarchical features through three layers of increasing depth (32, 64, 128 channels). The core innovation is the dueling structure. By learning V(s) (how good is this state generally?) separately from A(s,a) (how much better is this action compared to others in this state?), the network can learn which states are valuable without having to learn the effect of each action in every single state. This leads to more stable and efficient learning, especially in environments where the value of a state is often independent of the action taken. The output `Q(s,a)` is the final Q-value used for action selection (e.g., via an epsilon-greedy policy). The network's design suggests it is tailored for a specific task with a small, discrete action space (4 actions) and visual state observations of size 50x50 pixels with 3 color channels. The consistent use of ELU activations may help mitigate the vanishing gradient problem and allow for faster convergence during training. </details> Figure 2: Network Architecture of DuelDQN, with the convolutional layers in white, and the dense layers in red The DuelDQN architecture will be our starting point. We will extend it with new symbolic, cognitively-motivated components: - shape recognition (ShapeRecognizer), - reward prediction (RewardPredictor), - action reasoning (ActionReasoner), and - action filtering (ActionFilter). In the following, we will discuss each component in more detail. #### ShapeRecognizer The function of ShapeRecognizer is to estimate the likelihood of a certain observation to be of a given unique kind. ShapeRecognizer is comprised of one pre-processing function, which divides the initial raw image into 25 patches. Each patch is then processed by a Convolutional Neural Network (CNN), which then outputs a 5-dimensional tensor. The numbers chosen for the number of patches and the output dimension are an instance of soft knowledge injection, as the environment represents a 5x5 grid, and dividing it into 25 patches immediately separates the content of each cell in the grid. As for the output size, 5 is the number of different objects that each patch might contain: empty, agent, circle, cross, square. This allows the agent to perform a multi-label classification on each object type. The theoretical intention of this module is to give the agent a possibility of symbolic understanding of the different entities in the environment, by labeling their types. Given the simple nature of the ShapesGridEnv, representing the environment is very easy. The state is composed of $25$ positions, with each position being occupied by one of five shapes (empty, agent, circle, square, cross), which results in the state space of size $25^{5}$ . To generate this representation, we start by instantiating five one-hot representations of the classes, which are stored in the LTN Variable $shape\_types$ . Then, for each state that the agent is in, it keeps in memory all the different patches that it has seen and a list of all the patches that are present in the current state. <details> <summary>extracted/6333762/unique_patch.png Details</summary> ![0ce3f026](/v1/image/0ce3f026552db6010789ab2207f1a4e2d37d12684611a0ebff35a880c4875ea9) ### Visual Description ## Diagram: Symbol Transformation Process ### Overview The image illustrates a three-stage transformation process, moving from a sparse 3x3 grid of symbols, through an intermediate 5x5 grid with empty cells, to a final linear sequence of five symbols. The process is depicted left-to-right, connected by block arrows indicating the direction of transformation. The diagram contains no textual labels, axis titles, or numerical data. It is a purely visual, symbolic representation of a mapping or extraction operation. ### Components/Axes The diagram consists of three distinct visual components arranged horizontally: 1. **Left Component (Input Grid):** A 3x3 grid containing various pixelated symbols. 2. **Middle Component (Intermediate Grid):** A 5x5 grid containing a subset of the symbols from the left grid, placed in specific cells, with many cells left empty (white squares with black borders). 3. **Right Component (Output Sequence):** A horizontal row of five symbols, representing the final output. 4. **Connectors:** Two right-pointing block arrows (`→`) connect the left grid to the middle grid, and the middle grid to the right sequence. ### Detailed Analysis **1. Left Input Grid (3x3):** * **Top Row:** A plus sign (`+`) in the center cell. An 'X' shape in the right cell. * **Middle Row:** A square with a thick black border in the left cell. A circle in the center cell. The right cell is empty (white). * **Bottom Row:** A circle in the left cell. An 'X' shape in the center cell. A square with a thick black border in the right cell. **2. Middle Intermediate Grid (5x5):** This grid appears to be a spatial expansion or re-sampling of the input. Symbols are placed in specific coordinates, with empty cells filling the rest. * **Row 1 (Top):** Empty, Empty, Plus sign (`+`), Empty, 'X' shape. * **Row 2:** Empty, Empty, Empty, Empty, Circle. * **Row 3:** Empty, Square (thick border), Circle, Empty, Empty. * **Row 4:** Empty, Empty, Circle, Empty, 'X' shape. * **Row 5 (Bottom):** Empty, Empty, Empty, Square (thick border), Square (thick border). **3. Right Output Sequence:** A linear, ordered list of five symbols. From left to right: 1. Empty square (white with black border). 2. Plus sign (`+`). 3. 'X' shape. 4. Circle. 5. Square with a thick black border. ### Key Observations * **Symbol Set:** The process uses four distinct symbol types: Plus (`+`), 'X' shape, Circle, and Square (thick border). An "empty" state is also a key element. * **Spatial Transformation:** The transformation is not a simple one-to-one mapping. The 3x3 input is mapped to a 5x5 intermediate grid, suggesting a process like convolution, padding, or feature map expansion common in image processing or neural networks. * **Output Flattening:** The final step collapses the 2D intermediate grid into a 1D sequence. The order in the output sequence (Empty, +, X, Circle, Square) does not directly correspond to a simple row-major or column-major reading of the middle grid. * **Symbol Conservation:** All four symbol types from the input grid appear in the final output sequence. The input contains two squares, two circles, one plus, and two X's. The output contains one of each symbol type plus an empty square. ### Interpretation This diagram visually conceptualizes a data transformation pipeline, likely inspired by operations in computer vision or symbolic computation. * **What it Suggests:** The process demonstrates how spatially arranged data (the 3x3 grid) is first expanded or processed into a larger, sparser representation (the 5x5 grid), and then distilled into a fixed-length, ordered feature vector (the 5-element sequence). This is a fundamental pattern in machine learning, where raw input is transformed into a latent representation. * **Relationships:** The arrows define a clear, irreversible flow of information. The middle grid acts as an intermediate representation or "feature map." The final sequence could represent an encoded summary, a set of detected features, or a flattened tensor ready for further processing. * **Notable Patterns/Anomalies:** * The placement in the 5x5 grid seems deliberate but its logic isn't immediately obvious from the symbols alone. It may represent the application of a specific kernel or filter. * The final output sequence is not a direct serialization of the middle grid. The empty square appears first, which might represent a bias term, a padding token, or the result of a specific aggregation function (e.g., taking the most frequent "background" state). * The diagram abstracts away the *mechanism* of transformation, focusing instead on the change in data structure and representation. It serves as a high-level schematic for a technical process rather than a detailed algorithm specification. </details> Figure 3: The process of obtaining unique labels Once the variables have been set up, the ShapeRecognizer module can be used to estimate the likelihood of a grid cell containing a given unique shape. To guide its learning, the aggregated satisfiability of three axioms is maximized. The axioms represent the first instance of actual knowledge injection in the system: $$ \forall s\ \exists l\ IS(s,l) $$ $$ \neg\exists s\ l_{1}\ l_{2}\ (IS(s,l_{1})\wedge IS(s,l_{2})\wedge(l_{1}\neq l_ {2})) $$ $$ \forall s_{1}\ s_{2}\ l\ ((IS(s_{1},l)\wedge(s_{1}\neq s_{2}))\rightarrow\neg IS (s_{2},l) $$ In the above formulas, $s$ stands for a shape, $l$ stands for a label and $IS(x,y)$ stands for $x$ has label $y$ . A1 says that every shape has a label; A2 says that no shape has two different labels; A3 says that different shapes cannot have the same label. At each step of every episode, the aggregated satisfaction (truth) degree of these axioms is calculated, and its inverse, $1-AggSat(A1,A2,A3)$ , is used as a loss to train the agent. Intuitively, ShapeRecognizer gives the learner a way to distinguish between different shapes. In that, our approach is somewhat similar to the framework of semi-supervised symbol grounding Umili et al. (2023). #### RewardPredictor Once the environment is symbolically represented, we will make the agent understand the properties of different objects and their dynamics. The only truly dynamic element in the environment is the agent itself—nothing else moves. The agent can move to a cell that was previously occupied by a different shape, which results in the shape being consumed, and the agent being rewarded with the value of the respective shape. Hence, there are three key pieces of knowledge that the learner must acquire to successfully navigate the environment. It must identify which shape represents the agent, it must understand how each action influences the future position of the agent, and it must associate each shape with its respective reward. The task of self-recognition can be approached in numerous ways, depending on the information we have about the environment, and on our understanding of its dynamics. In our approach, leveraging the ShapeRecognizer, in each episode we count the occurrences of each shape in the environment and add it to a variable that keeps track of this shape’s count over time. The agent is the only shape that has a constant, and equal to one, number of appearances in the environment. This approach is enough to quickly determine with confidence which shape represents the agent. This step demonstrates a specific advantage of using the neuro-symbolic framework. Our reinforcement learning agent is now equipped with memory of the previous states of the environment (i.e., the count of shapes) which can then be used to make decisions or to further process symbolic knowledge. Understanding the impact of different actions is crucial for the agent to make informed decisions in the environment. Each action represents a direction (up, right, down, left) and taking the action will lead to one of two outcomes. If the agent is at the edge of the environment and attempts to move against the edge, it will remain in the same cell, otherwise, the agent will move one cell in the given direction. Given the simplicity of this dynamics, a function has been defined that takes as input a position and an action, outputting the resulting position. Our RewardPredictor is a Multi-layer Perceptron (MLP), which takes as input a ShapeRecognizer ’s prediction and outputs one scalar. The main intention of this module is to train to predict the reward associated with each object shape, using the symbolic representations generated by the ShapeRecognizer paired with high-level reasoning on the training procedure. This module intends to give the agent a way of knowing on a high level the reward associated with any shape, and consequently with any action. In reinforcement learning environments, agents learn action policies by maximizing their expected rewards. When building an agent that symbolically represents and reasons about the environment, one of the key elements is the agent’s ability to understand how to obtain rewards. Given that the agent has the capability of identifying the shapes in the grid, recognizing its own shape, and calculating the position it will take given an action, it can now determine the shape that will be consumed by that action. By using the RewardPredictor module and passing it this shape, the agent obtains a prediction of the reward associated with that shape. Over time, by calculating the loss between that prediction and the actual reward obtained after taking an action, the module learns to accurately predict the reward associated with every shape. #### ActionReasoner Once the agent can predict the expected reward of its own actions, we can then guide its policy learning so that it acts in the way (it expects) will give the highest immediate reward. To achieve this, we specify an axiom to ensure that the $q-$ value outputs of the Q-Network are in alignment with the predicted rewards. To achieve this, the expected reward of all the possible actions is calculated by using the RewardPredictor and the $q-$ values are calculated by calling the SymDQN. Our axiom expresses the following condition: if the reward prediction of action $a_{1}$ is higher than the reward prediction of action $a_{2}$ , then the $q-$ value of $a_{1}$ must also be higher than the $q-$ value of $a_{2}$ . The learning is then guided by the LTN framework with the following formula of Real Logic used in the loss function. $$ \forall\ \textup{Diag}((r_{1},q_{1}),(r_{2},q_{2}))(r_{1}>r_{2}):(q_{1}>q_{2}) $$ Two standard operators of Real Logic Badreddine et al. (2022) are applied in this axiom: $Diag$ and guarded quantification with the condition $(r_{1}>r_{2})$ . Firstly, the $Diag$ operator restricts the range of the quantifier, which will then not run through all the pairs from $\{r_{1},r_{2},q_{1},q_{2}\}$ , but only the pairs of (reward, $q$ -value) that correspond to the same actions. Specifically, when $r_{1}$ corresponds to the predicted reward of action ‘up’, then $q_{1}$ corresponds to $q$ -value of action ‘up’. Secondly, we use guarded quantification, restricting the range of the quantifier to only those cases in which $(r_{1}>r_{2})$ . If we had used implication, with the antecedent $(r_{1}>r_{2})$ false, the whole condition would evaluate to true. This is problematic when the majority of pairs do not fulfill the antecedent. In such a case the universal quantifier evaluates to true for most of the instances, even if the important ones, with antecedent true, are false. Guarded quantification gives a satisfaction degree that is much closer to the value we are interested in. #### ActionFilter Our learner can now predict the reward for each shape. For each action in a given state, it then knows what shapes could be consumed and what is their corresponding immediate reward. ActionFilter eliminates the actions for which the difference between their reward and the maximum immediately obtainable reward in that state is under a predefined threshold (we set it at 0.5). This allows a balance between the strictness of symbolic selection of immediately best actions and the information about rewards available in the network as a whole. This is represented in Fig. 4. <details> <summary>extracted/6333762/fig-action-filter.png Details</summary> ![7dccb919](/v1/image/7dccb9193b85dae517d58726f2711cdd8b3c6f5557204c6cb8602f8d4d5c34a2) ### Visual Description ## Diagram: Tic-Tac-Toe Game State Processing via SymDQN ### Overview The image is a technical diagram illustrating a decision-making process for an AI agent (SymDQN) in a Tic-Tac-Toe game. It shows a game board state, a filtered or processed view of that state, and the subsequent numerical evaluation and action selection performed by the agent. The flow moves from left to right. ### Components/Axes The diagram is composed of three primary regions: 1. **Left Region (Game State):** A 5x5 grid representing a Tic-Tac-Toe board. The grid contains empty squares, squares marked with 'X', squares marked with 'O', and one square with a '+' symbol. 2. **Middle Region (Processed State):** A smaller 3x3 grid, which appears to be a cropped or focused view derived from the left grid. Arrows point from the left grid to this middle grid and to a green box labeled "SymDQN". 3. **Right Region (Evaluation & Output):** Two rectangular text boxes with dark blue borders, connected by a downward arrow. * **Top Box:** Contains the text `r = [0, 1, 1, -1]` and `best actions = [1, 2]`. * **Bottom Box:** Contains the text `q = [0.34,-0.2,-0.13,0.3]`, `filter q =[-∞,-0.2, -0.13, -∞]`, and `action = 2`. 4. **Agent Component:** A green-bordered box labeled "SymDQN" positioned below the middle grid. An arrow points from this box to the bottom evaluation box. ### Detailed Analysis **1. Left Grid (5x5 Game State):** * **Structure:** 5 rows by 5 columns. * **Content (Row-wise from top):** * Row 1: Empty, Empty, Empty, Empty, **X** * Row 2: Empty, Empty, Empty, Empty, **O** * Row 3: Empty, **O** (with a thick border), **O**, Empty, Empty * Row 4: Empty, Empty, **O**, **+**, **X** * Row 5: Empty, Empty, Empty, **X**, **O** (with a thick border) * **Symbols:** 'X' and 'O' are standard player marks. The '+' symbol in Row 4, Column 4 is unique and may indicate a special cell, a cursor, or a proposed move. Two 'O' symbols (R3C2, R5C5) have noticeably thicker borders. **2. Middle Grid (3x3 Processed State):** * **Structure:** 3 rows by 3 columns. * **Content (Row-wise from top):** * Row 1: Empty, Empty, Empty * Row 2: **O**, Empty, **X** * Row 3: Empty, **X**, Empty * **Relationship to Left Grid:** This grid does not directly correspond to a contiguous 3x3 section of the left grid. It likely represents a processed, symmetrical, or feature-extracted view of the game state relevant to the SymDQN's decision-making. **3. Right Text Boxes (Evaluation Data):** * **Top Box (Reward/Best Actions):** * `r = [0, 1, 1, -1]`: A list of four numerical values, likely representing rewards or immediate outcomes for four possible actions. * `best actions = [1, 2]`: Identifies actions indexed 1 and 2 (using 0-based indexing) as the best based on the `r` values. This correlates with the two `1`s in the `r` list. * **Bottom Box (Q-Values & Final Action):** * `q = [0.34,-0.2,-0.13,0.3]`: A list of four Q-values, representing the estimated long-term utility for each of the four possible actions. * `filter q =[-∞,-0.2, -0.13, -∞]`: A filtered version of the Q-value list. Actions 0 and 3 are assigned `-∞` (negative infinity), effectively removing them from consideration. This filter likely applies constraints (e.g., only legal moves are considered). * `action = 2`: The final selected action. This corresponds to the action with the highest value in the `filter q` list (`-0.13` is greater than `-0.2` and `-∞`). **4. Flow and Logic:** The diagram depicts a sequential process: 1. The agent observes the game state (Left Grid). 2. This state is processed into a relevant feature representation (Middle Grid). 3. The SymDQN agent evaluates possible actions, generating initial rewards (`r`) and identifying top candidates (`best actions`). 4. The agent also computes Q-values (`q`) for each action. 5. A filter is applied to the Q-values (`filter q`), likely masking illegal or suboptimal moves. 6. The action with the highest filtered Q-value (`action = 2`) is selected as the final decision. ### Key Observations * The `best actions` from the reward list (`[1, 2]`) do not directly determine the final `action`. The final choice (`action = 2`) is based on the filtered Q-values, where action 2 has a value of `-0.13`, which is higher than action 1's `-0.2`. * The filtering step is critical, as it nullifies the Q-values for actions 0 and 3, which had the highest (`0.34`) and second-highest (`0.3`) raw Q-values, respectively. This suggests these actions were deemed illegal or invalid by the game rules. * The diagram visually separates the immediate reward signal (`r`) from the learned value function (`q`), showing how both inform the final decision. ### Interpretation This diagram illustrates the internal decision-making pipeline of a SymDQN (Symmetrical Deep Q-Network) agent playing Tic-Tac-Toe. It demonstrates how the agent: 1. **Perceives** a complex board state. 2. **Processes** it into a symmetrical or canonical form (the 3x3 grid) to leverage the game's inherent symmetries, improving learning efficiency. 3. **Evaluates** moves using two parallel signals: immediate rewards (`r`) and long-term value estimates (`q`). 4. **Applies Constraints** via the `filter q` step, which is essential for enforcing game rules (e.g., you cannot place a mark on an occupied cell). 5. **Selects** the optimal *legal* action based on the highest filtered Q-value. The key takeaway is the separation between raw evaluation and constrained decision-making. The agent's "best" move according to raw value (action 0 with Q=0.34) is overruled by the game's rules, leading it to select the next-best legal move (action 2). This highlights the importance of integrating environment-specific rules into the reinforcement learning process. The use of a processed state (middle grid) suggests the SymDQN architecture is designed to handle spatial symmetries, a common and powerful technique in game-playing AI. </details> Figure 4: The process of action filtering ActionFilter severely restricts action choice. We prevented it from forcing the outcomes by switching it off during the training period. When the agent is actually running in the environment, the ActionFilter is used to optimize decision making. Further strategies on how this dynamic might be implemented in training must be studied, as we want to maintain the asymptotic optimality of DuelDQNs, while enhancing them with reasoning, when relevant. With ShapeRecognizer, RewardPredictor, ActionReasoner and ActionFilter integrated with the original DuelDQN, the complete architecture of SymDQN can be seen in Fig. 5. <details> <summary>extracted/6333762/SymDQN_arch.png Details</summary> ![a599e002](/v1/image/a599e0025f57b73e85e32242a039ab157795b8dca592562e310b17b8c2605506) ### Visual Description ## Neural Network Architecture Diagram: Dueling DQN with Auxiliary Tasks ### Overview The image displays a detailed architectural diagram of a deep neural network designed for reinforcement learning. It combines a convolutional neural network (CNN) feature extractor with auxiliary modules for shape recognition and reward prediction, culminating in a Dueling Deep Q-Network (DQN) structure that outputs action-values Q(s, aᵢ). The flow proceeds from left to right, with visual input processed by the CNN and auxiliary tasks processed in parallel before their features are integrated. ### Components/Axes The diagram is composed of interconnected blocks representing layers and modules, with arrows indicating data flow. Key components are color-coded: * **Gray 3D Blocks**: Convolutional layers (Conv2d). * **Green 3D Blocks**: Auxiliary task modules (ShapeRecognizer, RewardPredictor). * **Red Vertical Bars**: Fully connected (Linear) layers. * **Text Labels**: Layer names, input/output dimensions, and activation functions (ELU) are placed adjacent to their respective components. **Spatial Layout:** * **Top-Left to Top-Center**: The main CNN feature extraction branch. * **Bottom-Left**: The auxiliary task branch (ShapeRecognizer and RewardPredictor). * **Center-Right**: The integration point and subsequent Dueling network architecture (Value and Advantage streams). * **Far-Right**: The final output node. ### Detailed Analysis **1. Main CNN Branch (Top, Gray Blocks):** * **Input**: Implicitly an image (3 channels, e.g., RGB). * **Layer 1**: `Conv2d 3→32` followed by `ELU` activation. Output is a 32-channel feature map. * **Layer 2**: `Conv2d 32→64` followed by `ELU` activation. Output is a 64-channel feature map. * **Layer 3**: `Conv2d 64→128` followed by `ELU` activation. Output is a 128-channel feature map. * **Flatten & Linear**: The output is flattened and passed to a `Linear (128*50*50)→256` layer. This suggests the spatial dimensions of the final convolutional feature map are 50x50. The output is a 256-dimensional vector. **2. Auxiliary Task Branch (Bottom, Green Blocks):** * **ShapeRecognizer**: `3→5`. This module takes a 3-dimensional input (possibly shape descriptors or a small feature vector) and outputs a 5-dimensional representation. * **RewardPredictor**: `5→1`. This module takes the 5-dimensional output from the ShapeRecognizer and predicts a scalar reward (1-dimensional output). * **Feature Integration**: The outputs from the auxiliary branch are not used directly as final predictions in this diagram. Instead, a feature vector (implied to be derived from these modules) is passed to a `Linear (128*5*5)→256` layer. This suggests the auxiliary modules process a 5x5 spatial input with 128 channels, flattening to 128*5*5=3200 before projection to 256 dimensions. **3. Feature Fusion & Dueling Architecture (Right, Red Bars):** * The 256-dimensional outputs from the **main CNN branch** and the **auxiliary branch** are concatenated or summed (the diagram shows them merging at a single point) to form a combined feature vector. * This combined vector is passed through a shared `Linear 256→128` layer. * The output then splits into two streams: * **Value Stream V(s)**: `Linear 128→1` followed by `ELU`. Outputs a scalar state-value V(s). * **Advantage Stream A(s, aᵢ)**: `Linear 128→4` followed by `ELU`. Outputs a 4-dimensional advantage vector, suggesting there are 4 possible actions (aᵢ). * **Final Output Q(s, aᵢ)**: The value and advantage streams are combined (typically as Q(s,a) = V(s) + A(s,a) - mean(A(s,a'))) to produce the final action-value output `Q(s, aᵢ)`, represented by a single red bar. ### Key Observations 1. **Hybrid Architecture**: The model integrates pure visual feature learning (CNN) with explicit auxiliary tasks (shape recognition, reward prediction). This is a form of multi-task or auxiliary-task learning, often used to improve representation learning and sample efficiency in reinforcement learning. 2. **Dueling DQN Structure**: The clear separation into Value V(s) and Advantage A(s, aᵢ) streams is the hallmark of a Dueling DQN architecture, which can lead to more stable learning by separately estimating state values and action advantages. 3. **Dimensionality Flow**: The diagram meticulously notes the changing dimensionality at each step (e.g., `3→32`, `128*50*50→256`), providing a clear blueprint for implementation. 4. **Activation Function**: The Exponential Linear Unit (`ELU`) is used consistently after convolutional and linear layers, except before the final output nodes of the Value and Advantage streams. 5. **Action Space**: The advantage stream outputs 4 values (`Linear 128→4`), indicating the environment has a discrete action space of size 4. ### Interpretation This diagram represents a sophisticated reinforcement learning agent's "brain." The architecture suggests the agent is designed for an environment where visual perception is crucial (hence the deep CNN). The inclusion of **ShapeRecognizer** and **RewardPredictor** as auxiliary tasks is a strategic design choice. By forcing the network to simultaneously learn to recognize shapes and predict rewards, it likely develops more robust and generalizable internal representations of the environment. This can lead to faster learning and better performance, especially in environments with sparse rewards or visual complexity. The **Dueling DQN** head is a proven technique for improving value estimation. By learning which states are valuable regardless of the action (V(s)) and which actions offer the most advantage in a given state (A(s, aᵢ)), the agent can make more nuanced decisions. The final output `Q(s, aᵢ)` provides the estimated long-term return for taking each of the 4 possible actions in a given state, which the agent would use to select the best action. In summary, this is not a generic network but a purpose-built architecture for a visual reinforcement learning task, incorporating modern techniques (auxiliary tasks, dueling streams) to enhance learning efficiency and decision quality. The explicit dimensionality notes make it a technical blueprint ready for implementation. </details> Figure 5: SymDQN network architecture integrating DuelDQN with ShapeRecognizer and RewardPredictor modules ## 4 The experiment By comparing the baseline DuelDQN model with our SymDQN model, this study attempts to answer the following questions: 1. Does the SymDQN converge to a stable action policy faster than the baseline DuelDQN? 1. Does the SymDQN outperform the baseline DuelDQN in average reward accumulation? 1. Is SymDQN more precise in its performance than DuelDQN, i.e., is it better at avoiding shapes with negative rewards? In the experiment, we analyze the impact that each individual modification has on the performance of the SymDQN, comparing them between each other and the baseline DuelDQN. We hence consider five experimental conditions: DuelDQN: the baseline model-no symbolic components; SymDQN: DuelDQN enriched with ShapeRecognizer (that uses A1-A3) and with RewardPredictor; SymDQN(AR): SymDQN with ActionReasoner (that uses the axiom A4); SymDQN(AF): SymDQN with ActionFilter; SymDQN(AR,AF): SymDQN enriched with both ActionReasoner and ActionFilter. Our experiment runs through $250$ epochs, after which the empirically observed rate of learning of all the variations is no longer significant. Each epoch contains $50$ episodes of training, and then the agent’s performance is evaluated as the average score of $50$ new episodes. The score is defined as the ratio of the actual score and the maximum score obtainable in a given episode. The other performance measure we look at is the percentage of the negative-reward objects consumed. The hardware and software specification and the hyperparameters of the experiment can be found in the Appendix A. ## 5 Results In this section, we will report on the results of our ablation study, which isolates the impact of each component on the agent’s performance. We will first focus on the obtained score comparison among different experimental conditions, and later we will report on the precision of the agent. We first compare our best-performing condition, SymDQN(AF) with the baseline learning of the pure non-symbolic DuelDQN, see Fig. 6. <details> <summary>extracted/6333762/SymDQNAF-DuelingDQN.png Details</summary> ![e9ec9403](/v1/image/e9ec9403525621cc5014071ddd93365e95c1c136c52f2be07e96547bc3643e4e) ### Visual Description ## Line Chart: Performance Comparison of SymDQN(AF) vs. Baseline Over Training Epochs ### Overview The image is a line chart comparing the performance of two models, "SymDQN(AF)" and "Baseline," over the course of 250 training epochs. The performance metric is the percentage of negative objects, where a higher value appears to indicate better performance. The chart demonstrates a clear and consistent performance gap between the two models throughout the training process. ### Components/Axes * **X-Axis (Horizontal):** * **Label:** "Epochs" * **Scale:** Linear scale from 0 to 250, with major tick marks at intervals of 50 (0, 50, 100, 150, 200, 250). * **Y-Axis (Vertical):** * **Label:** "% Negative Objects" * **Scale:** Linear scale from 0.1 to 0.8, with major tick marks at intervals of 0.1 (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8). * **Legend:** * **Position:** Bottom-right corner of the chart area. * **Entry 1:** "SymDQN(AF)" - Represented by a solid green line with circular markers (●). * **Entry 2:** "Baseline" - Represented by a solid black line with triangular markers (▲). * **Grid:** A light gray, dashed grid is present in the background, aligned with the major tick marks on both axes. ### Detailed Analysis **1. SymDQN(AF) (Green Line with Circles):** * **Trend:** The line shows a very rapid initial increase, followed by a sustained high level with moderate fluctuations. * **Data Points (Approximate):** * Epoch 0: ~0.14 * Epoch 50: ~0.65 * Epoch 100: ~0.68 * Epoch 150: ~0.72 * Epoch 200: ~0.71 * Epoch 250: ~0.70 * **Visual Trend:** The line ascends steeply from epoch 0 to approximately epoch 25, reaching ~0.5. It continues a strong upward trend until around epoch 50. From epoch 50 to 250, the line plateaus in the 0.65-0.75 range, exhibiting noticeable volatility but no significant long-term upward or downward drift. **2. Baseline (Black Line with Triangles):** * **Trend:** The line shows a more gradual, steady increase over the entire training period, with less initial volatility than SymDQN(AF). * **Data Points (Approximate):** * Epoch 0: ~0.05 * Epoch 50: ~0.20 * Epoch 100: ~0.50 * Epoch 150: ~0.60 * Epoch 200: ~0.65 * Epoch 250: ~0.60 * **Visual Trend:** The line starts very low and climbs slowly until around epoch 50. The rate of increase accelerates between epochs 50 and 125. From epoch 125 onward, the line continues to trend upward but with significant oscillations, peaking near 0.68 around epoch 190 before dipping again. ### Key Observations 1. **Performance Gap:** The SymDQN(AF) model consistently outperforms the Baseline model at every measured epoch. The gap is largest in the early training phase (epochs 0-75) and narrows somewhat in the later stages but never closes. 2. **Learning Speed:** SymDQN(AF) demonstrates significantly faster initial learning, achieving over 50% negative objects by epoch ~20, a milestone the Baseline model does not reach until approximately epoch 95. 3. **Volatility:** Both models show performance fluctuations (noise) after the initial learning phase, which is typical in reinforcement learning training curves. The fluctuations for SymDQN(AF) appear slightly more pronounced in absolute terms during its plateau phase. 4. **Convergence:** Neither model shows a clear sign of having fully converged by epoch 250, as both lines are still exhibiting oscillations. However, SymDQN(AF) appears to have reached a performance plateau around the 0.7 mark. ### Interpretation The data strongly suggests that the SymDQN(AF) algorithm is more sample-efficient and achieves a higher final performance level than the Baseline algorithm for this specific task of maximizing "% Negative Objects." * **Faster Policy Improvement:** The steep initial slope of the SymDQN(AF) curve indicates it extracts more useful learning signal from early experiences, leading to a rapid improvement in its policy. * **Higher Asymptotic Performance:** The sustained plateau at a higher value (~0.7 vs. ~0.6-0.65) suggests SymDQN(AF) is capable of learning a more optimal or effective policy in the long run. * **Stability Trade-off:** While SymDQN(AF) learns faster and better, the persistent fluctuations in both curves indicate that the learning process for both models remains somewhat unstable even after hundreds of epochs. This could be due to the inherent stochasticity of the environment, the exploration strategy, or the optimization dynamics. * **Practical Implication:** If training time or computational resources are limited, SymDQN(AF) is the clearly superior choice, as it reaches a usable performance level much sooner. Even with unlimited training time, it appears to settle at a better final performance. </details> Figure 6: SymDQN(AF) (in green) vs. DuelDQN (in black): $x$ -axis represents epochs and $y$ -axis represents the ratio of obtained score in the episode and the maximum obtainable score in that episode. Clearly, the performance of the SymDQN agent equipped with ActionFilter is superior to DuelDQN both in terms of quicker convergence (high initial learning rate) and overall end performance. Let us now look at different versions of our SymDQN to better understand what contributes to its performance. In Fig. 7 we show the performance of all four versions of SymDQN. <details> <summary>extracted/6333762/SymDQN_performance_new.png Details</summary> ![770cfa13](/v1/image/770cfa139beabce7aa87c5b743c61a606e010f95de03f44cd956e8cfaee23086) ### Visual Description ## Line Chart: Performance of SymDQN Variants vs. Baseline Over Training Epochs ### Overview The image is a line chart comparing the performance of five different reinforcement learning algorithms over the course of 250 training epochs. The performance metric is the percentage of negative objects, where a higher percentage indicates better performance. The chart shows that all algorithms improve over time, but with distinct learning curves and final performance levels. ### Components/Axes * **X-Axis (Horizontal):** Labeled "Epochs". It represents training time, ranging from 0 to 250 with major tick marks every 50 epochs (0, 50, 100, 150, 200, 250). * **Y-Axis (Vertical):** Labeled "% Negative Objects". It represents the performance metric, ranging from 0.0 to 0.8 with major tick marks every 0.1 units (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8). * **Legend:** Located in the bottom-right corner of the chart area. It contains five entries, each associating a colored line with a specific algorithm name. * Green line with circle markers: `SymDQN(AF)` * Purple line with triangle-up markers: `SymDQN(AR)` * Blue line with square markers: `SymDQN()` * Orange line with diamond markers: `SymDQN(AR, AF)` * Black line with triangle-down markers: `Baseline` ### Detailed Analysis **Trend Verification & Data Points (Approximate):** The following describes the visual trend for each series before listing approximate values. 1. **SymDQN(AF) (Green, circles):** * **Trend:** Shows the fastest initial learning and maintains the highest performance throughout. It rises steeply from near 0, surpassing 0.5 by epoch ~25, and continues a generally upward, albeit noisy, trajectory. * **Key Points:** Starts ~0.02. Reaches ~0.5 at epoch ~25. Peaks near ~0.76 around epoch 180. Ends at approximately 0.70 at epoch 250. 2. **SymDQN() (Blue, squares):** * **Trend:** Demonstrates strong, steady growth. It starts low but climbs consistently, eventually becoming the second-best performer and closely approaching the top performer in the later stages. * **Key Points:** Starts ~0.02. Crosses 0.4 around epoch 60. Reaches ~0.6 around epoch 120. Ends at approximately 0.68 at epoch 250. 3. **Baseline (Black, triangles down):** * **Trend:** Shows a steady, monotonic increase with less volatility than the SymDQN variants. It starts very low but maintains a consistent upward slope, ending as the third-best performer. * **Key Points:** Starts ~0.05. Reaches 0.3 around epoch 75. Hits 0.5 around epoch 130. Ends at approximately 0.65 at epoch 250. 4. **SymDQN(AR, AF) (Orange, diamonds):** * **Trend:** Has a very rapid initial jump in the first ~10 epochs, then plateaus and grows more slowly with significant fluctuations. It generally performs below the Baseline and SymDQN() after the initial phase. * **Key Points:** Starts ~0.14. Jumps to ~0.4 by epoch 10. Fluctuates between 0.4 and 0.5 for most of the training. Ends at approximately 0.60 at epoch 250. 5. **SymDQN(AR) (Purple, triangles up):** * **Trend:** Exhibits the slowest overall growth and the lowest final performance. It has a gradual, noisy ascent and remains below all other methods for the majority of the training period. * **Key Points:** Starts ~0.02. Reaches 0.2 around epoch 40. Hovers around 0.4-0.5 from epoch 100 onward. Ends at approximately 0.58 at epoch 250. ### Key Observations * **Performance Hierarchy:** A clear and consistent ranking is established after approximately epoch 50 and holds until the end: SymDQN(AF) > SymDQN() > Baseline > SymDQN(AR, AF) > SymDQN(AR). * **Convergence:** All lines show signs of plateauing or converging by epoch 250, though with continued noise. The gap between the top performer (SymDQN(AF)) and the others remains significant. * **Initial Conditions:** SymDQN(AR, AF) starts at a notably higher value (~0.14) than the others (all near 0.02-0.05), suggesting a different initialization or an early advantage that is not sustained. * **Volatility:** The SymDQN variants (green, blue, purple, orange) exhibit more volatility (higher frequency, larger amplitude fluctuations) in their learning curves compared to the smoother, more monotonic ascent of the Baseline (black). ### Interpretation The data suggests that the `SymDQN(AF)` algorithm is the most effective method for maximizing the "% Negative Objects" metric in this specific task, demonstrating both fast learning and high final performance. The `SymDQN()` variant (without specified AR/AF components) is a strong second, eventually outperforming the `Baseline`. The components `AR` and `AF` appear to have contrasting effects when added to the core SymDQN architecture: * Adding `AF` alone (green line) yields the best results. * Adding both `AR` and `AF` (orange line) provides a strong initial boost but leads to lower long-term performance and higher instability compared to the Baseline. * Adding `AR` alone (purple line) results in the poorest performance, suggesting it may hinder learning in this context. The `Baseline` method, while not the top performer, shows reliable and stable improvement, indicating it is a robust but potentially sub-optimal approach. The higher volatility in the SymDQN methods might reflect more aggressive exploration or sensitivity to batch data during training. The chart effectively demonstrates that architectural choices (the inclusion of AF/AR modules) significantly impact both the learning dynamics and the final outcome in this reinforcement learning scenario. </details> Figure 7: All versions of SymDQN: SymDQN(AF) (in green), SymDQN(AR,AF) (in red), SymDQN (in blue) and SymDQN(AR) (in purple); the $x$ -axis represents epochs and the $y$ -axis represents the ratio of obtained score in the episode and the maximum obtainable score in that episode. We report in standard deviations in the Appendix. From this graph, we can conclude that the presence of ActionReasoner, despite giving a slight boost in the initial learning rate, hampers the overall performance of the agent (red and purple graphs). On the other hand, the presence of ActionFilter improves the initial performance (green and red). Let us now move to another performance measure: the precision of the agent in avoiding objects of the shape associated with negative rewards. We now compare all five experimental conditions, see Fig. 8. <details> <summary>extracted/6333762/precision_new.png Details</summary> ![6b259a69](/v1/image/6b259a69a7e8fb535a2647b71152274568a58eec9c073fd056651451ee6123d5) ### Visual Description ## Line Chart: Percentage of Negative Objects Over Training Epochs ### Overview The image displays a line chart comparing the performance of five different reinforcement learning algorithms or variants over the course of 250 training epochs. The performance metric is the percentage of negative objects, where a lower value indicates better performance. The chart shows distinct learning curves and final performance levels for each method. ### Components/Axes * **X-Axis:** Labeled "Epochs". It represents training time, with major tick marks at 0, 50, 100, 150, 200, and 250. * **Y-Axis:** Labeled "% Negative Objects". It represents the performance metric, with major tick marks at 0.00, 0.05, 0.10, 0.15, and 0.20. * **Legend:** Located in the top-right corner of the chart area. It contains five entries, each with a distinct color, line style, and marker: 1. **SymDQN(AF):** Green dashed line with circle markers. 2. **SymDQN(AR):** Purple solid line with triangle-up markers. 3. **SymDQN():** Blue solid line with square markers. 4. **SymDQN(AR, AF):** Orange solid line with diamond markers. 5. **Baseline:** Black solid line with triangle-down markers. ### Detailed Analysis **Trend Verification & Data Series Analysis:** 1. **SymDQN(AF) (Green, dashed, circles):** * **Trend:** Starts at approximately 0.06 at epoch 0, then drops precipitously to near 0.00 within the first 10 epochs. It remains flat at or very near 0.00 for the entire remainder of the training (epochs 10-250). * **Key Points:** Initial value ~0.06 (Epoch 0). Reaches ~0.00 by Epoch ~10. Maintains ~0.00 until Epoch 250. 2. **SymDQN(AR) (Purple, solid, triangles-up):** * **Trend:** Starts at approximately 0.06 at epoch 0. Shows a rapid initial decrease, followed by a slower, noisy decline. It exhibits significant volatility in the early epochs (0-50) before settling into a gradual downward trend with minor fluctuations. * **Key Points:** Initial value ~0.06 (Epoch 0). Peaks locally around ~0.055 at Epoch ~15. Declines to ~0.02 by Epoch 50, ~0.01 by Epoch 100, and approaches ~0.00 by Epoch 250, though with slight noise. 3. **SymDQN() (Blue, solid, squares):** * **Trend:** Starts at approximately 0.095 at epoch 0. Increases sharply to a peak, then enters a long, volatile decline. It shows the second-highest values overall after the Baseline. * **Key Points:** Initial value ~0.095 (Epoch 0). Rises to a peak of ~0.16 around Epoch 40. Fluctuates between ~0.12 and ~0.15 until Epoch 100. Gradually declines with high volatility, ending at approximately 0.08 at Epoch 250. 4. **SymDQN(AR, AF) (Orange, solid, diamonds):** * **Trend:** This line is not visible for the first half of the training. It appears at approximately Epoch 125, starting at a value near 0.00. It remains flat at or extremely close to 0.00 from its first appearance until Epoch 250. * **Key Points:** First visible point at ~Epoch 125, value ~0.00. Maintains ~0.00 until Epoch 250. 5. **Baseline (Black, solid, triangles-down):** * **Trend:** Starts at approximately 0.06 at epoch 0. Shows the most dramatic increase, reaching the highest peak of all series. After the peak, it enters a volatile, gradual decline but remains the highest-valued series for the entire duration. * **Key Points:** Initial value ~0.06 (Epoch 0). Rises steeply to a global maximum of ~0.23 around Epoch 60. Fluctuates with a general downward trend, passing through ~0.15 at Epoch 100, ~0.13 at Epoch 150, and ending at approximately 0.10 at Epoch 250. ### Key Observations * **Performance Hierarchy:** There is a clear and consistent separation in performance. From best (lowest % negative objects) to worst: SymDQN(AF) ≈ SymDQN(AR, AF) > SymDQN(AR) > SymDQN() > Baseline. * **Convergence:** SymDQN(AF) and SymDQN(AR, AF) converge to near-perfect performance (~0% negative objects) and maintain it. SymDQN(AR) also converges towards zero but with more noise. SymDQN() and the Baseline do not converge to zero within 250 epochs. * **Volatility:** The Baseline and SymDQN() series exhibit high volatility (large, frequent fluctuations) throughout training. SymDQN(AR) shows moderate volatility early on that decreases over time. SymDQN(AF) and SymDQN(AR, AF) show negligible volatility after convergence. * **Anomaly:** The SymDQN(AR, AF) series is absent from the plot until approximately epoch 125. This could indicate a delayed start to that specific experiment or a plotting artifact where values were exactly zero and thus not rendered until a change occurred. ### Interpretation This chart demonstrates the effectiveness of different algorithmic modifications (denoted by AF and AR) to a base SymDQN algorithm in minimizing the production of "negative objects" during training. * **AF (likely "Action Filtering" or similar) is highly effective:** The SymDQN(AF) variant shows the fastest and most stable convergence to an optimal policy, suggesting this modification is crucial for eliminating negative outcomes. * **AR (likely "Action Regularization" or similar) is beneficial but slower:** SymDQN(AR) improves significantly over the baseline and base SymDQN(), but its learning is noisier and slower than the AF variant. * **Combining AR and AF yields perfect final performance:** SymDQN(AR, AF) achieves the same near-zero final performance as SymDQN(AF), though its delayed appearance on the chart is unexplained. It may represent a method that activates or is measured only after a certain training stage. * **The Baseline algorithm performs poorly:** It not only learns slowly but also actively increases the percentage of negative objects to a high peak before beginning a slow, unstable improvement, highlighting the need for the proposed modifications. * **Underlying Message:** The data strongly suggests that the proposed SymDQN framework, especially when augmented with the AF mechanism, provides a more stable, efficient, and effective learning process for this task compared to the baseline approach. The clear stratification of the lines provides compelling visual evidence for the superiority of the modified methods. </details> Figure 8: Agent’s precision in all conditions: SymDQN(AF) (in green), SymDQN(AR,AF) (in red), SymDQN(AR) (in purple); SymDQN (in blue); DuelDQN (in black): the $x$ -axis represents epochs and the $y$ -axis represents the percentage of negative-reward objects consumed by the agent. Note that the green and red lines overlap. While the presence of ActionReasoner (in purple) allows a significant improvement of precision, it’s the ActionFilter that eradicates negative rewards completely (red and green graphs). The baseline DuelDQN and the pure SymDQN perform similarly, not being able to learn to avoid negative rewards completely. ### 5.1 Interpretation and Discussion The integration of symbolic knowledge into reinforcement learning, as demonstrated by SymDQN, provides several insights into the potential of neuro-symbolic approaches in AI. The ability of SymDQN to extract and utilize key environmental features drives a significant boost in initial learning rate and overall performance, suggesting that symbolic representations can provide a valuable advantage to neural networks, enabling them to rapidly leverage the features for better decision-making. The ActionFilter provides a dramatic enhancement in early-stage performance, allowing the model to make good decisions as soon as the symbolic representation is available. By leveraging the symbolic representation and understanding of the environment, ActionFilter prunes sub-optimal actions, aligning the agent’s behavior with a symbolic understanding of the environment. The role of ActionReasoner is less clear: while providing a slight boost in initial performance, it hampers the overall learning rate. It seems that by forcing the model output to comply with the logical axiom, it diminishes its ability to capture information that is not described by the logical formulas. The two components, ActionReasoner and ActionFilter use symbolic information to adjust (the impact of) $q$ -values and could be seen as a form of reward shaping. ActionReasoner uses the axiom (A4) to align $q$ -values with predicted immediate rewards. As we can see in Fig. 7, this ‘reward shaping’ process is detrimental to the overall performance. A possible reason for that can be illustrated in the following example. Let’s assume the agent is separated from a multitude of positive shapes by a thin wall of negative shapes. A long-term perspective of sacrificing some reward by crossing over the wall can be blocked by attaching too much value to the immediate punishment. Note, however, that although ineffective, this ‘reward shaping’ makes the agent more cautious/precise (see Fig. 7). While ActionFilter does not shape the reward function directly (as it is turned off in the training phase), it performs reasoning based on rewards (Fig. 4). In a given state it eliminates the possibility of executing actions for which $q$ -values and immediate rewards differ too drastically. The advantages of SymDQN come with trade-offs. The computational cost introduced by the additional components is non-trivial, and the logical constraints imposed on the learning might hamper performance in more complex environments. In that, the use of LTNs in reinforcement learning sheds light on the ‘thinking fast and slow’ effects in learning. Firstly, the use of axioms (A1)-(A3) in ShapeRecognizer gives a sharp increase in the initial performance due to the understanding of the environment structure (Fig. 8). Apart from that, adjusting the reward function with ActionReasoner and ActionFilter will increase precision (as normally assumed about the System 2 type of behavior), but it can also hamper the overall performance, like it does in the case of ActionReasoner (see Fig. 7). ## 6 Conclusions and Future Work This research introduces a novel modular approach to integrating symbolic knowledge with deep reinforcement learning through the implementation of SymDQN. We contribute to the field of Neuro-Symbolic AI in several ways. Firstly, we demonstrate a successful integration of LTNs into reinforcement learning, a promising and under-explored research direction. This integration touches on key challenges: interpretability, alignment, and knowledge injection. Secondly, SymDQN augments reinforcement learning through symbolic environment representation, modeling of object relationships and dynamics, and guiding action based on symbolic knowledge, effectively improving both initial learning rates and the end performance of the reinforcement learning agents. These contributions advance the field of neuro-symbolic AI, bridging the gap between symbolic reasoning and deep learning systems. Our findings demonstrate the potential of integrative approaches in creating more aligned, controllable and interpretable models for safe and reliable AI systems. ### 6.1 Future Work We see several potential avenues for future research. Enhancing Environment Representation It is easy to see that the ShapeRecognizer can be adapted to any grid-like environment. It could also be further developed to represent more complex environments symbolically, e.g., by integrating a more advanced object-detection component (e.g., Faster-LTN Manigrasso et al. (2021)). With the addition of precise bounding box detection and multi-class labeling, the component could be extended to also perform hierarchical representations, e.g., recognizing independent objects and their parts or constructing abstract taxonomies. Automatic Axiom Discovery The investigation of automatic axiom discovery through iterative learning, or meta-learning, is an interesting direction that opens the doors to knowledge extraction from a model (see, e.g., Hasanbeig et al. (2021); Umili et al. (2021); Meli et al. (2024)). Theoretically, given enough time and randomization, Q-learning converges to optimal decision policy in any environment, and so the iterative development and assessment of axioms might allow us to extract knowledge from deep learning systems that outperform human experts. Broader Evaluation While a version of SymDQN was shown to be advantageous, it was only tested in a single, simple environment. A broader suite of empirical experiments in more complex environments, such as Atari games or Pac-Man, is necessary to understand the generalization capabilities of the findings. These environments provide more complex and diverse challenges, potentially offering deeper insights into the advantages of SymDQN. ## Appendix A Appendix #### Hardware and software specifications The experiments, hyper-parameter tuning, and model variation comparisons were performed on a computing center machine (Tesla V100 with either 16 or 32GB). The coding environment used was Python 3.9.5 and Pytorch 2.3.1 with Cuda Toolkit 12.1. To integrate LTNs, the LTNtorch Carraro (2022) library was used, a PyTorch implementation by Tommaso Carraro. For experiment tracking, the ClearMl Platform cle (2023) was used. The code is provided as a separate .zip file. #### Hyperparameters The hyper-parameters used for training were: explore steps: 25000, update steps: 1000, initial epsilon: 0.95, end epsilon: 0.05, memory capacity: 1000, batch size: 16, gamma: 0.99, learning rate: 0.0001, maximum gradient norm: 1. Semantics of quantifiers in LTN used $p=8$ . #### Standard deviations Performance was assessed through 5 independent experimental runs per configuration. Each run evaluated 50 distinct environments to calculate average scores per epoch; ( $\ast$ ) stands for SymDQN( $\ast$ ), values for epochs 50, 150, 250 (smoothed, rolling window 5). First table: overall performance (rewards, r) and standard deviations (sd). Second table: ratio of negative shapes collected (inverse of precision, p) and sd. | (AF) (AR) () | 0.65 0.26 0.43 | 0.03 0.03 0.05 | 0.70 0.47 0.65 | 0.01 0.01 0.02 | 0.71 0.53 0.66 | 0.01 0.04 0.02 | | --- | --- | --- | --- | --- | --- | --- | | (AR,AF) | 0.40 | 0.02 | 0.53 | 0.02 | 0.59 | 0.02 | | DuelDQN | 0.24 | 0.06 | 0.57 | 0.02 | 0.64 | 0.01 | | (AF) (AR) () | 0.00 0.02 0.17 | 0.00 0.00 0.02 | 0.00 0.01 0.12 | 0.00 0.00 0.02 | 0.00 0.00 0.08 | 0.00 0.00 0.00 | | --- | --- | --- | --- | --- | --- | --- | | (AR,AF) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | DDQN | 0.19 | 0.03 | 0.14 | 0.01 | 0.10 | 0.00 | ## References - Badreddine and Spranger (2019) Samy Badreddine and Michael Spranger. Injecting prior knowledge for transfer learning into reinforcement learning algorithms using logic tensor networks. CoRR, abs/1906.06576, 2019. - Badreddine et al. (2022) Samy Badreddine, Artur d’Avila Garcez, Luciano Serafini, and Michael Spranger. Logic tensor networks. Artificial Intelligence, 303:103649, 2022. - Bianchi and Hitzler (2019) Federico Bianchi and Pascal Hitzler. On the capabilities of logic tensor networks for deductive reasoning. In AAAI Spring Symposium Combining Machine Learning with Knowledge Engineering, 2019. - Carraro (2022) Tommaso Carraro. LTNtorch: PyTorch implementation of Logic Tensor Networks, mar 2022. - cle (2023) ClearML - your entire mlops stack in one open-source tool, 2023. Software available from http://github.com/allegroai/clearml. - Donadello et al. (2017) Ivan Donadello, Luciano Serafini, and Artur d’Avila Garcez. Logic tensor networks for semantic image interpretation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 1596–1602, 2017. - Furelos-Blanco et al. (2021) Daniel Furelos-Blanco, Mark Law, Anders Jonsson, Krysia Broda, and Alessandra Russo. Induction and exploitation of subgoal automata for reinforcement learning. J. Artif. Intell. Res., 70:1031–1116, 2021. - Giacomo et al. (2019) Giuseppe De Giacomo, Luca Iocchi, Marco Favorito, and Fabio Patrizi. Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications. In J. Benton, Nir Lipovetzky, Eva Onaindia, David E. Smith, and Siddharth Srivastava, editors, Proceedings of the Twenty-Ninth International Conference on Automated Planning and Scheduling, ICAPS 2019, Berkeley, CA, USA, July 11-15, 2019, pages 128–136. AAAI Press, 2019. - Hasanbeig et al. (2021) Mohammadhosein Hasanbeig, Natasha Yogananda Jeppu, Alessandro Abate, Tom Melham, and Daniel Kroening. Deepsynth: Automata synthesis for automatic task segmentation in deep reinforcement learning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 7647–7656. AAAI Press, 2021. - Manigrasso et al. (2021) Francesco Manigrasso, Filomeno Davide Miro, Lia Morra, and Fabrizio Lamberti. Faster-LTN: A neuro-symbolic, end-to-end object detection architecture. In Igor Farkaš, Paolo Masulli, Sebastian Otte, and Stefan Wermter, editors, Artificial Neural Networks and Machine Learning – ICANN 2021, pages 40–52, Cham, 2021. Springer International Publishing. - Meli et al. (2024) Daniele Meli, Alberto Castellini, and Alessandro Farinelli. Learning logic specifications for policy guidance in pomdps: an inductive logic programming approach. Journal of Artificial Intelligence Research, 79:725–776, February 2024. - Serafini and d’Avila Garcez (2016) Luciano Serafini and Artur S. d’Avila Garcez. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. CoRR, abs/1606.04422, 2016. - Towers et al. (2024) Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu, Manuel GoulÃ£o, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea PierrÃ©, Sander Schulhoff, Jun Jet Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium, June 2024. - Umili et al. (2021) Elena Umili, Emanuele Antonioni, Francesco Riccio, Roberto Capobianco, Daniele Nardi, Giuseppe De Giacomo, et al. Learning a symbolic planning domain through the interaction with continuous environments. In Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL), workshop at ICAPS 2021, 2021. - Umili et al. (2023) Elena Umili, Roberto Capobianco, and Giuseppe De Giacomo. Grounding LTLf Specifications in Image Sequences. In Proceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning, pages 668–678, 8 2023. - van Krieken et al. (2022) Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing differentiable fuzzy logic operators. Artificial Intelligence, 302:103602, 2022. - Wagner and d’Avila Garcez (2022) Benedikt Wagner and Artur S. d’Avila Garcez. Neural-symbolic reasoning under open-world and closed-world assumptions. In Proceedings of the AAAI 2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE 2022), 2022. - Wang et al. (2015) Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architectures for deep reinforcement learning. CoRR, abs/1511.06581, 2015. - Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1995–2003. JMLR.org, 2016.

Rendering Paper...