# Reasoning in Neurosymbolic AI
**Authors**: Son Tran, Edjard Mota, Artur d’Avila Garcez
> School of Information Technology, Deakin University, Victoria, 3125, Melbourne, Australia
> Instituto de Computação, Universidade Federal do Amazonas, 69067-005, Manaus, Brazil
Abstract
Knowledge representation and reasoning in neural networks have been a long-standing endeavor which has attracted much attention recently. The principled integration of reasoning and learning in neural networks is a main objective of the area of neurosymbolic Artificial Intelligence (AI). In this chapter, a simple energy-based neurosymbolic AI system is described that can represent and reason formally about any propositional logic formula. This creates a powerful combination of learning from data and knowledge and logical reasoning. We start by positioning neurosymbolic AI in the context of the current AI landscape that is unsurprisingly dominated by Large Language Models (LLMs). We identify important challenges of data efficiency, fairness and safety of LLMs that might be addressed by neurosymbolic reasoning systems with formal reasoning capabilities. We then discuss the representation of logic by the specific energy-based system, including illustrative examples and empirical evaluation of the correspondence between logical reasoning and energy minimization using Restricted Boltzmann Machines (RBM). The system, called Logical Boltzmann Machine (LBM), can find all satisfying assignments of a class of logical formulae by searching through a very small percentage of the possible truth-value assignments. Learning from data and knowledge in LBM is also evaluated empirically and compared with a purely-symbolic, a purely-neural and a state-of-the-art neurosymbolic system, achieving better learning performance in five out of seven data sets. Results reported in this chapter in an accessible way are expected to reignite the research on the use of neural networks as massively-parallel models for logical reasoning and to promote the principled integration of reasoning and learning in deep networks. LBM is also evaluated in the role of an interpretable neural module that can be added on top of complex neural networks such as convolutional networks and encoder-decoder networks to implement any given set of logical constraints e.g. fairness or safety requirements. LBM is further evaluated when deployed in the solution of the connectionist Boolean satisfiability (SAT) problem, maximum satisfiability (MaxSAT) and approximate optimization problems when certain logical rules may be given a higher priority or a penalty according to a confidence value. We conclude the chapter with a discussion of the importance of positioning neurosymbolic AI within a broader framework of formal reasoning and accountability in AI, discussing the challenges for neurosynbolic AI to tackle the various known problems of reliability of deep learning. We close with an opinion on the risks of AI and future opportunities for neurosymbolic AI. Keywords: Neurosymbolic AI, Restricted Boltzmann Machines, Logical Reasoning, SAT solving, MaxSAT, Energy-based Learning, Constrained Optimization, Modular Deep Learning.
1 What is Reasoning in Neural Networks?
Increasing attention has been devoted in recent years to knowledge representation and reasoning in neural networks. The principled integration of reasoning and learning in neural networks is a main objective of the field of neurosymbolic Artificial Intelligence (AI) [9, 34]. In neurosymbolic AI, typically, an algorithm is provided that translates some form of symbolic knowledge representation into the architecture and initial set of parameters of a neural network. Ideally, a theorem then shows that the neural network can be used as a massively-parallel model of computation capable of reasoning about such knowledge. Finally, when trained with data and knowledge, the network is expected to produce better performance, either a higher accuracy or faster learning than when trained from data alone. Symbolic knowledge may be provided to a neural network in the form of general rules which are known to be true in a given domain, or rules which are expected to be true across domains when performing transfer and continual learning. When rules are not available to start with, they can be extracted from a trained network. When rules are contradicted by data, they can be revised as part of the learning process. This has been shown to offer a flexible framework whereby knowledge and data, neural networks and symbolic descriptions are combined, leading to a better understanding of complex network models with the interplay between learning and reasoning.
This chapter includes a general discussion of how neurosymbolic AI can contribute to the goals of reasoning in neural networks and a specific illustration of a neurosymbolic system for reasoning in propositional logic with restricted Boltzmann machines (RBMs) [44]. We will describe a neurosymbolic system, called Logical Boltzmann Machines (LBM), capable of (i) representing any propositional logic formula into a restricted Boltzmann machine, (ii) reasoning efficiently from such formula, and (iii) learning from such knowledge representation and data. LBM comes with an algorithm to translate any set of propositional logical formulae into a Boltzmann machine and a proof of equivalence between the logical formulae and the energy-based connectionist model; in other words, a proof of soundness of the translation algorithm from logical formulae to neural networks. Specifically, the network is shown to assign minimum energy to the assignments of truth-values that satisfy the formulae. This provides a new way of performing reasoning in symmetrical neural networks by employing the network to search for the models of a logical theory, that is, to search for the assignments of truth-values that map the logical formulae to $true$ . We use the term model to refer to logical models and to neural network models. When the intended meaning is not clear from the context, we shall use the term logical model. If the number of variable is small, inference can be carried out analytically by sorting the free-energy of all possible truth-value assignments. Otherwise, Gibbs sampling is applied in the search for logical models. We start, however, with a general discussion of reasoning in current AI including large language models.
1.1 Reasoning in Large Language Models
Since the release of GPT4 by OpenAI in March 2023, a fierce debate developed around the risks of AI, Big Tech companies released various proprietary and open-source competitors to ChatGPT, and the European Union passed the regulatory AI Act in record time. Leading figures disagreed on what should be done about the risks of AI. Some claimed that Big Tech is best placed to take care of safety, others argued in favor of open source, and others still argued for regulation of AI and social media. As society contemplates the impact of AI on everyday life, the secrecy surrounding AI technology fueled fears of existential risk and even claims of an upcoming AI bubble burst. Large Language Models (LLMs) such as ChatGPT, Gemini, Claude, Mistral and DeepSeek are a great engineering achievement, are impressive at text summarization and language translation, may improve productivity of those who are knowledgeable enough to spot the LLM’s mistakes, but have great potential to deceive those who aren’t.
There are various technical and non-technical reasons why LLMs and current AI may not be deployed in practice: lack of trust or fairness, reliability issues and public safety as in the case of self-driving cars that use the same technology as LLMs. Fixing reliability issues case-by-case with Reinforcement Learning has proved to be too costly. A common risk mitigation strategy has been to adopt a human-in-the-loop approach: making sure that a human is ultimately responsible for decision making. However, in the age of Agentic AI, where at least some decisions are made by the machine, simply apportioning blame or liability to a human does not address the problem. It is necessary to empower the user of AI, the data scientist and the domain expert to be able to interpret, question and if necessary intervene in the AI system. Neural networks that are accompanied by symbolic descriptions and sound reasoning capabilities will be an important tool in this process of empowering users of AI.
Consider LLMs’ ability to produce code. If GPT4 was allowed to work, not as a stand-alone computer program, but in a loop whereby the code can be executed and data collected from execution to improve the code automatically, one can see how such self-improving LLM with autonomy may pose a serious risk to current computer systems. Recent experiments, however, indicated that the opposite, self-impairing, may also happen in practice, producing a degradation in performance. We will argue that the emerging field of neurosymbolic AI can address such failures and that there must be a better way, other than very costly post-hoc model alignment, of achieving AI that can offer certain logical guarantees to network training.
LLMs have been considered to be general purpose because they will provide an answer to any question. They do that by doing only one thing: predicting the probability of the next word (token) in a sentence. Having made a choice of the next word, LLMs will apply the same calculations recursively to build larger sentences. They are called auto-regressive machine learning models because they perform regression on the discrete tokens to learn such probabilities, and apply recursively the learned function $f$ to choose the word that comes at time t+1 given the words that are available at time t, that is, $x_{t+1}=f(x_{t})$ . Artificial General Intelligence (AGI), however, is best measured by the ability to adapt to novelty. It will require effective learning from fewer data, the ability to reason reliably about the knowledge that has been learned, the extraction of compact descriptions from trained networks and the consolidation of knowledge learned from multiple tasks, using analogy to enable extrapolation to new situations at an adequate level of abstraction. It has been almost two years since GPT4 was released. The competition has caught up. Reliable data seem to have been exhausted. Performance increments obtained with increase in scale have not produced AGI. It is fair to say that the “scale is all you need” claim has not been confirmed. Notwithstanding, domain-specific AI systems that can exhibit intelligence at the level of humans or higher already exist. These systems exhibit intelligence in specialized tasks: targeted medical diagnoses, protein folding, various closed-world two-player strategy games.
When LLMs make stuff up such as non-existing citations, they are said to hallucinate. AGI will require systems that never hallucinate (that is, reason reliably), that can form long term plans and act on those plans to achieve a goal, and that can handle exceptions as they materialize, addressing shifts in data distribution not case-by-case, but requiring far less data labeling. This is very different from current LLMs that seem to have difficulty handling exceptions. For this reason, hallucinations are not going away and the cost of post-hoc model alignment has spiraled in the last two years.
As a case in point, take the o1 LLM system released by OpenAI in September 2024; o1 was claimed to “think before it answers” and to be capable of “truly general reasoning”. Widely seen as a re-branding of the much anticipated GPT5, which was promised to be at AGI level, the little that we know about o1 is that it improved on reasoning and code generation benchmarks, and yet it can be stubbornly poor at simple tasks such as multiplication, formal reasoning, planning or the formidable ARC AGI challenge (see https://arcprize.org/). Let’s assume that OpenAI’s o1 system is best described as “GPT-Go”, a pre-trained transformer to which a tree search is incorporated in the style of Google DeepMind’s earlier Alpha-Go system. The tree search uses “Chain of Thought” (CoT) prompting: generation of synthetic data using the transformer neural network itself in a chain that breaks down a prompt into sub-prompts (sub-problems to be solved in stages). o1’s “thinking” time is presumably needed to build the tree for the CoT. And it’s this breaking of the problem into sub-problems that is expected to improve performance on reasoning tasks since this is how reasoning tasks are solved.
Leaving aside the practical question of how long users will be happy to wait for an answer, the main issue with o1 and successors is a lack of reliability of the synthetic data generation and combinatorial nature of CoT: CoT may solve one reasoning task well today only to fail at an analogous reasoning task tomorrow due to simple naming variations [31]. With synthetic data generation from GPT-like auto-regressive models having been shown to impair model performance, the quality of the data decreases and the model continues to hallucinate [42].
What we are seeing in practice is that eliminating hallucinations is very difficult. And there is another concern: regurgitation. The New York Times (NYT) lawsuit against OpenAI argues that ChatGPT can basically reproduce (regurgitate) copyrighted NYT texts with minimal prompting. Whether regurgitation can be fixed remains to be seen. Efforts in this direction have been focused on a simple technique called RAG (Retrieval Augmented Generation) that fetches facts from external sources. What is clear is that further research is needed to make sense of how LLMs generalize to new situations, to find out whether performance depends on task familiarity or true generalization. In the meantime, there will be many relevant but domain-specific applications of LLMs in areas where the system has been deemed to have been controlled reasonably well or where controlling it isn’t crucial.
In neurosymbolic AI, instead of adjusting the input to fix a misbehaving LLM as done with CoT, the idea is to control the architecture or the loss function of the system. Neurosymbolic AI integrates learning and reasoning to make model development parsimonious by following this recipe: (1) extract symbolic descriptions as learning progresses, (2) reason formally about what has been learned, (3) compress the neural network as knowledge is instilled back into the network. Reasoning in neurosymbolic AI follows the tradition of knowledge representation in AI. It requires the definition of a semantics for deep learning and it measures the capabilities of neural networks w.r.t. formally-defined, sound and approximate reasoning, providing a much needed measure of the accumulation of errors in the AI system.
1.2 AI from a Neurosymbolic Perspective
It is paradoxical that computers have been invented to provide fast calculations and sound reasoning, and yet the latest AI may fail at calculations as simple as multiplication (even though a typical artificial neural network will rely on millions of correct multiplications as part of its internal computations). The first wave of AI in the 1980s was knowledge-based, well-founded and inefficient if compared with deep learning. The second wave from the 2010s was data-driven, distributed and efficient but unsound if compared with knowledge-bases. It is clear that neural networks are here to stay, but the problems with deep learning have been stubbornly difficult to fix using neural networks alone. Next, we discuss how solving these problems will require the use of symbolic AI alongside neural networks. The third wave of AI, we argue, will be neurosymbolic [15].
In order to understand the achievements and limitations of AI, it is helpful to consider the AGI debate https://www.youtube.com/watch?v=JGiLz_Jx9uI. with its focus on what is missing from current AI systems, i.e. the technological innovation that may bring about better AI or AGI. Simply put, such innovation may be described as the ability to apply knowledge learned from a task by a neural network to a novel task without requiring too much data.
As AI experts John Hopfield and Geoff Hinton are awarded the 2024 Nobel Prize for Physics, and AI expert Demis Hassabis is awarded the 2024 Nobel Prize for Chemistry (with David Baker and John Jumper), one can say that the era of computation as the language of science has began. Hassabis led the team at Google DeepMind that created AlphaFold, an AI model capable of predicting with high accuracy the 3D structure of proteins given their amino acid sequence. AlphaFold is arguably the greatest achievement of AI to date, even though it is squarely an application specific (or narrow) AI by comparison with LLMs. From particle physics to drug discovery, energy efficiency and novel materials, AI is being adopted as the process by which scientific research is carried out. However, as noted above, the lack of a description or explanation capable of conveying a deeper sense of understanding of the solution being offered by AI is something that is very unsatisfactory. Computer scientists in a great feat of engineering will solve to a high degree of accuracy very challenging problems in science without necessarily improving their own understanding of the solutions provided by very large neural networks trained on vast amounts of data that are not humanly possible to inspect.
The risks of current AI together with this unsatisfactory lack of explainability confirm the need for neurosymbolic AI as an alternative approach. As mentioned, neurosymbolic AI uses the technology of knowledge extraction to interpret, ask what-if questions and if necessary intervene in the AI system, controlling learning in ways that can offer correctness or fairness guarantees and, with this process, producing a more compact, data efficient system. We start to see a shift towards such explainable neurosymbolic AI systems being deployed as part of a risk-based approach. As argued in [36], effective regulation goes hand in hand with accountability in AI, the definition of a risk mitigation strategy and the use of technology itself such as explainable AI technology [33] to mitigate risks. We shall return to this discussion at the end of the paper.
For more than 20 years, a small group of researchers have been advocating for neurosymbolic AI. Already around the turn of the 21st century, the importance of artificial neural networks as an efficient computational model for learning was clear to that group. But the value of symbol manipulation and abstract reasoning offered by symbolic logic was also obvious to them. Many before them have contributed to neurosymbolic AI. In fact, it could be argued that neurosymbolic AI starts together with connectionism itself, with the aptly titled 1943 paper by McCulloch and Pitts, A Logical Calculus of the Ideas Immanent in Nervous Activity, and with John Von Neumann’s 1952 Lectures on Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components, indicating that the gap between distributed vector representations (embeddings) and localist symbolic representations in logic was not as big as some might imagine. Even Alan Turing’s 1948 Intelligent Machinery introduced a type of neural network called a B-type machine. All of this, of course, before the term Artificial Intelligence was coined ahead of the now famous Dartmouth Workshop in 1956. Since then the field has separated into two: symbolic AI and connectionist AI (or neural networks). This has slowed progress as the two research communities went their separate ways with different conferences, journals and associations. Following the temporary success of symbolic AI in the 1980’s and the success of deep learning since 2015 with its now obvious limitations, the time is right for revisiting the approaches of the founding fathers of computer science and developing neurosymbolic AI that is fit for the 21st century. As a step in this direction, in what follows, we illustrate how a single bi-directional network layer in the form of a restricted Boltzmann machine can implement the full semantics of propositional logic, formally defined.
2 Background: Logic and Restricted Boltzmann Machines
Differently from general-purpose Large Language Models, domain specific Artificial Intelligence, such as the protein folding AlphaFold system, aims to develop systems for specific purposes, enabling human abilities to handle tasks that might otherwise take many years to solve. This goal of domain specific AI is analogous to the invention of the Archimedean lever, which enhanced physical strength capabilities and has enabled humanity to make leaps in construction, mobility and physical labor. AI can be a mental lever that enhances our ability to deal with problems requiring mental activity in volume or intensity that is difficult to accomplish in feasible time or with precision. Modeling such abstract human mental activity is a highly complex task and we shall focus on representing two well-studied aspects: learning and reasoning.
A key step in this endeavor is to choose an appropriate language to represent the problem at hand. In the context of this paper, such a choice will be deemed to be suitable if it allows the development of efficient algorithms to perform learning from data and reasoning about what was learned or if it allows one to identify patterns of solutions that will lead to adequate decisions. Traditional AI has separated the study of reasoning and learning with a focus on either knowledge elicitation by hand for the purpose of sound reasoning or statistical learning from large amounts of data. In neurosymbolic AI this artificial separation is removed. The neurosymbolic cycle seeks to enable AI systems to learn a little and reason a little in integrated fashion. Learning takes place in the usual way within a neural network but reasoning has to be formalized, whether taking place inside or outside the network. Instead of simply measuring reasoning capabilities of the networks using benchmarks, neurosymbolic AI networks seek to offer reasoning guarantees of correctness. It is crucial to pay attention to the many years of research in knowledge representation and reasoning within Computer Science logic. While learning may benefit from the use of natural language and other available multimodal data, sound reasoning requires a formal language. A choice of language adequate to the problem influences the system’s ability to find a solution.
Formal logic, particularly Propositional Logic, is the most straightforward language for representing propositions about the problem domain. Propositional logic is the simplest formal language for representation, a branch of mathematics and logic that deals with simple declarative statements, called propositions, which can be true or false. As we shall see, in the context of neurosymbolic systems, statements are not purely true or false, but are associated with confidence values, probability intervals or degrees of truth denoting the intrinsic uncertainly of AI problems. It is therefore incorrect to assume that the use of logic is incompatible with uncertainty reasoning or limited to crisp, true or false statements. In its most general form, logic includes fuzzy and many-valued logics and various other forms of non-classical reasoning. We start however with propositional logic.
Think of propositions as the fundamental building blocks for reasoning. For instance, “it is raining” is a proposition because its truth can be determined by examining the current weather conditions. We typically use symbols such as $P$ , $Q$ , or $R$ to represent these propositions. Any symbol, including indices, can be used as long as it is clear that they represent a specific proposition. To combine or modify these propositions, we use logical connectives or operators: AND ( $\land$ ), OR ( $\lor$ ), NOT ( $\lnot$ ), IMPLICATION ( $→$ ), and BI-CONDITIONAL ( $\leftrightarrow$ ). For example, if $P$ represents “it is raining” and $Q$ represents “I have an umbrella,” then $P\land Q$ means “it is raining AND I have an umbrella”. The operators allow us to compose complex relationships among ideas in a precise way.
A syntactically correct expression in logic is said to be a Well-Formed Formula (WFF). A WFF in propositional logic is constructed according to the following rules:
1. Any atomic proposition (e.g, $P$ , $Q$ , $R$ ) is a WFF.
1. If $A$ is a WFF then $\lnot A$ (the negation of $A$ ) is also a WFF.
1. If $A$ and $B$ are WFFs then $(A\land B)$ , $(A\lor B)$ , $(A→ B)$ , and $(A\leftrightarrow B)$ are also WFFs.
1. Nothing else is a WFF.
For example, the expression $(P\land Q)→ R$ is a WFF because it follows these rules: $P$ , $Q$ , and $R$ are atomic propositions, $(P\land Q)$ is a valid combination using the AND operator, and the entire expression forms a valid implication. On the other hand, expressions like $P\land\lor Q$ are not WFFs because they violate the rules.
Propositional logic is also known as Boolean Logic, named after George Boole, a pioneer in the formalization of logical reasoning. Interestingly, George Boole is the great-great-grandfather of Geoffrey Hinton, a leading figure in the field of neural networks. Boole proposed his Laws of Thought using a simplified notation where $1$ and $0 0$ denote true and false, respectively. This binary representation aligns naturally with the semantic interpretation of neural networks and fits seamlessly into the reasoning method to be presented in this paper.
By adhering to the rules of WFFs, we ensure that our logical expressions are unambiguous and well-structured (compositional), providing a solid foundation for further exploration of propositional logic and its applications. In the remainder of this paper, unless otherwise specified, we shall use WFF to refer specifically to a subset of WFFs consisting only of formulas constructed using combinations of negation ( $\lnot$ ), conjunction ( $\land$ ), and disjunction ( $\lor$ ). If other logical connectives, such as implication ( $→$ ) or bi-conditional ( $\leftrightarrow$ ), are included, we will explicitly clarify this deviation from the specific subset, noting that in Classical Logic $A\leftrightarrow B$ is equavelent to $(A→ B)\land(B→ A)$ and that $A→ B$ is equivalent to $\neg A\lor B$ .
2.1 Illustrating Logical Reasoning with the Sudoku Puzzle
Sudoku is more than just a number puzzle (see Figure 1); it is a gateway to understanding the power of logical thinking. This globally beloved puzzle challenges us to impose order on apparent chaos, using nothing but numbers and logic. At its core, Sudoku is about solving constraints, ensuring that every row, column, and sub-grid (or block) adheres to a simple strict rule (containing one and only one of the elements of a given set). The same principle of constraint satisfaction is a cornerstone of Artificial Intelligence and computational problem-solving. By learning how to express Sudoku’s rules logically, we unlock the secrets of this captivating game and the tools to tackle more complex problem solving. Let’s explore how propositional logic can elegantly capture the rules of Sudoku as a way to illustrate structured reasoning.
For simplicity, we consider a smaller version of Sudoku, using a $4× 4$ grid instead of the standard $9× 9$ . This simplified puzzle divides the board into four $2× 2$ blocks or sub-grids, each containing four positions (or cells). Blocks are counted from left to right and top to bottom: block 1 is on top of block 3, and block 2 is on top of block 4. Positions within each block are also counted from left to right and top to bottom. Each cell in the grid must contain a number from 1 to 4, with no repetition allowed in any row, column, or $2× 2$ block. In the real Sudoku puzzle, each block is $3× 3$ and the set of possible elements is {1,2,…,9} with the board having 9 blocks in total. Figure 1 depicts an example of an initial setting for a Sudoku $4× 4$ board, followed by two possible transitions placing number 3 in two possible cells satisfying the constraints. Two possible final states are also shown, each derived from the above two states if every movement satisfies the constraints of the puzzle.
<details>
<summary>extracted/6466920/figs/Sudoku-4x4-sol03.png Details</summary>

### Visual Description
\n
## Diagram: State Transition of a Block Arrangement
### Overview
This diagram illustrates the possible state transitions of a 2x2 block arrangement, starting from an initial state and moving towards possible final states by manipulating the position of the number '3'. The diagram shows two possible paths for moving the '3' and the resulting arrangements.
### Components/Axes
The diagram consists of:
* **Initial State:** A 2x2 grid with the numbers 1, 0, 2, 4, 1, and 2 arranged within it. Note that '0' appears in the initial state, which is unusual for a block arrangement typically containing numbers 1-4.
* **Intermediate States:** Two 2x2 grids showing the arrangement after moving the '3' to position 3 of Block 1.
* **Final States:** Two 2x2 grids representing possible final arrangements after the manipulation.
* **Arrows:** Curved arrows indicating the direction of the state transition.
* **Text Labels:** "initial state", "Place 3 at position 3 of Block 1", "Possible final states".
### Detailed Analysis / Content Details
**Initial State:**
The initial state grid contains the following values:
* Row 1: 1, 0, 2
* Row 2: 4, 2, 1
**First Transition:**
The first transition involves placing '3' at position 3 of Block 1. The intermediate grid shows:
* Row 1: 1, 2, 4
* Row 2: 3, 4, 2
* Row 3: 1
This is followed by an arrow leading to a possible final state:
* Row 1: 1, 4, 3, 2
* Row 2: 3, 2, 1, 4
* Row 3: 4, 1, 2, 3
* Row 4: 2, 3, 4, 1
**Second Transition:**
The second transition also involves placing '3' at position 3 of Block 1, but from a different intermediate arrangement. The intermediate grid shows:
* Row 1: 1, 2, 4
* Row 2: 4, 2, 1
* Row 3: 3
This is followed by an arrow leading to another possible final state:
* Row 1: 1, 4, 3, 2
* Row 2: 2, 3, 1, 4
* Row 3: 4, 1, 2, 3
* Row 4: 3, 2, 4, 1
### Key Observations
* The diagram focuses on the manipulation of the number '3' within the block arrangement.
* The initial state contains a '0', which is not part of the typical 1-4 range.
* The final states are 4x4 grids, while the initial and intermediate states are 2x2. This suggests an expansion or rearrangement process.
* There are multiple possible final states depending on the path taken.
### Interpretation
The diagram demonstrates a state transition process where a 2x2 block arrangement is transformed into a 4x4 arrangement by strategically placing the number '3'. The presence of multiple final states indicates that the process is not deterministic and can lead to different outcomes. The initial state containing '0' is an anomaly and might represent a starting condition or a placeholder value. The diagram could be illustrating a puzzle, a game mechanic, or a simplified model of a more complex system involving rearrangement and state changes. The arrows suggest a sequential process, and the diagram highlights the importance of the initial conditions and the specific manipulation step (placing '3') in determining the final outcome. The expansion from a 2x2 to a 4x4 grid suggests a potential doubling or expansion operation.
</details>
Figure 1: An initial Sudoku board and two branches generated by placing a 3 at position 3 of blocks 1 and 3, respectively, and corresponding final states satisfying the constraints of the game.
Solving Sudoku involves reasoning about these constraints, making it a good example for introducing logical notation. To model the problem using propositional logic, one can systematically represent the constraints in terms of propositional variables encoding the relationships between numbers, positions, rows, columns and blocks. The rules dictate that every row, column and block must include the numbers 1 to 4 exactly once. By encoding the problem in this way, one can use symbolic logical reasoning to systematically explore possible solutions while respecting all constraints. The rules are encoded as follows:
Logical Variables:
Let the proposition $B_{i,j,k}$ denote that the block $i$ at position $j$ (that is, the cell $(i,j)$ ) contains the number $k$ . Formally, $B_{i,j,k}$ is true if and only if $k∈\{1,2,3,4\}$ is in position $j$ of block $i$ , $1≤ i≤ 4$ , $1≤ j≤ 4$ . Logical Constraints:
The constraints ensure that the numbers are placed correctly according to the rules of Sudoku. These constraints can be grouped into four categories:
1. Each cell must contain a number (cell $(i,j)$ contains a 1 or a 2 or a 3 or a 4): $B_{i,j,1}\lor B_{i,j,2}\lor B_{i,j,3}\lor B_{i,j,4}$ . When needed, we shall write:
$$
\bigvee_{k=1}^{4}B_{i,j,k}\quad\text{as shorthand notation for }B_{i,j,1}\lor B%
_{i,j,2}\lor B_{i,j,3}\lor B_{i,j,4}.
$$
There cannot be two or more numbers on the same cell Notice that $\neg(A\wedge B)$ implies $\neg(A\wedge B\wedge C)$ .:
$$
\neg(B_{i,j,k_{1}}\wedge B_{i,j,k_{2}}),\quad\text{for all }k_{1}\neq k_{2}.
$$
The above two rules can be written compactly as:
$$
\left(\bigvee_{k=1}^{4}B_{i,j,k}\right)\wedge\left(\bigwedge_{k_{1}<k_{2}}\neg%
(B_{i,j,k_{1}}\wedge B_{i,j,k_{2}})\right),
$$
where $\bigwedge_{i}x_{i}$ is shorthand for $x_{1}\wedge x_{2}\wedge...$ and $k_{1}<k_{2}$ is used to avoid repetition. Notice that $A\wedge B$ is logically equivalent to $B\wedge A$ .
1. Each number appears exactly once per row. For each row across the entire board and each number $k$ , exactly one position in that row must contain $k$ . This is expressed as:
$$
\bigvee_{j=1}^{4}B_{i,j,k}\quad\text{}
$$
along with the constraint that there cannot be two or more occurrences of the same number on the same row:
$$
\neg(B_{i,j_{1},k}\wedge B_{i,j_{2},k}),\quad\text{for all }j_{1}\neq j_{2}.
$$
In compact form:
$$
\left(\bigvee_{j=1}^{4}B_{i,j,k}\right)\wedge\left(\bigwedge_{j_{1}<j_{2}}\neg%
(B_{i,j_{1},k}\wedge B_{i,j_{2},k})\right).
$$
1. Each number appears exactly once per column. In compact form (as above):
$$
\left(\bigvee_{i=1}^{4}B_{i,j,k}\right)\wedge\left(\bigwedge_{i_{1}<i_{2}}\neg%
(B_{i_{1},j,k}\wedge B_{i_{2},j,k})\right).
$$
1. Each number appears exactly once per block. For each $2× 2$ block and each number $k$ , exactly one position within the block must contain $k$ . For example, for the top-left block:
$$
\bigvee_{(i,j)\in\{(1,1),(1,2),(2,1),(2,2)\}}B_{i,j,k},
$$
along with the constraint:
$$
\neg(B_{i_{1},j_{1},k}\wedge B_{i_{2},j_{2},k}),\text{for all distinct pairs }%
(i_{1},j_{1})\neq(i_{2},j_{2}).
$$
In compact form:
$$
\left(\bigvee_{(i,j)\in\text{block}}B_{i,j,k}\right)\wedge\left(\bigwedge_{(i_%
{1},j_{1})<(i_{2},j_{2})}\neg(B_{i_{1},j_{1},k}\wedge B_{i_{2},j_{2},k})\right).
$$
The complete set of constraints for the $4× 4$ Sudoku puzzle is the conjunction of all the above conditions over all cells, rows, columns and blocks. This logical formula guarantees that every number appears exactly once in each row, column, and block, satisfying the rules of Sudoku. It also provides a systematic framework for reasoning about the puzzle.
**Example 1**
*For block 1, position 1, we have: - $B_{1,1,1}\lor B_{1,1,2}\lor B_{1,1,3}\lor B_{1,1,4}$
- $\neg B_{1,1,1}\lor\neg B_{1,1,2}$
- $\neg B_{1,1,1}\lor\neg B_{1,1,3}$
- $\neg B_{1,1,1}\lor\neg B_{1,1,4}$
- $\neg B_{1,1,2}\lor\neg B_{1,1,3}$
- $\neg B_{1,1,2}\lor\neg B_{1,1,4}$
- $\neg B_{1,1,3}\lor\neg B_{1,1,4}$*
Some observations about this representation:
- This notation provides a framework whereby each possible combination of $B$ with indices is assigned to True or False.
- Each rule above is called a clause (a disjunction of logic literals) and the complete set of clauses would be significantly larger to cover all rows, columns and blocks.
- This representation can be used as input to a satisfiability (SAT) solver to find solutions to the Sudoku puzzle, that is, assignments of truth-values True or False to each literal that will provably satisfy the puzzle’s constraints.
This Boolean logic representation allows us to express the Sudoku problem as a set of constraints that must be satisfied simultaneously. By finding a truth assignment to the variables that satisfy all the clauses, we determine a valid solution to the Sudoku puzzle.
2.2 Sudoku with Strategies of Sampling
1. Reasoning Strategy based on Unused Numbers:
To control which number to pick based on the bank of numbers not yet placed on the board, let us illustrate how additional constraints may be introduced that ensure unused numbers are considered first. A strategy such as this could be learned from observation of game plays as well as specified by hand.
For each empty cell $(i,j)$ , define $U(i,j)$ as the set of numbers $k$ such that $k$ is not already used in the corresponding row, column or block of cell $(i,j)$ .
The constraint ensuring the selection of an unused number $k$ can be expressed as:
$$
\bigvee_{k\in U(i,j)}B_{i,j,k}
$$
where $U(i,j)$ is defined as:
$$
U(i,j)=\{k\mid k\notin\{B_{i,j^{\prime},k^{\prime}}\mid j^{\prime}\neq j\}%
\land k\notin\{B_{i^{\prime},j,k^{\prime}}\mid i^{\prime}\neq i\}
$$
$$
\land k\notin\{B_{i^{\prime},j^{\prime},k^{\prime}}\mid(i^{\prime},j^{\prime})%
\in\text{block}(i,j)\}\}.
$$
Here, $\text{block}(i,j)$ denotes the set of positions in the same block as $(i,j)$ .
1. Priority Constraint for Unused Numbers:
To prioritize the use of unused numbers, we can add a preference rule that assigns higher priority to considering numbers from $U(i,j)$ ahead of other possibilities.
Formally, let $P(i,j,k)$ represent the priority of placing number $k$ in cell $(i,j)$ . The priority can be defined as:
$$
P(i,j,k)=\begin{cases}1&\text{if }k\in U(i,j)\\
0&\text{otherwise}\end{cases}
$$
The constraint ensuring that the highest priority is given to unused numbers can be expressed as:
$$
\bigvee_{k\in U(i,j)}(P(i,j,k)\wedge B_{i,j,k})
$$
The complete set of logical constraints for the 4x4 Sudoku puzzle now includes the original Sudoku constraints along with additional reasoning strategies that prioritize the use of unused numbers. These constraints ensure that every number appears exactly once in each row, column, and block while also guiding the generation of solutions (that is, the assignment of truth-values to the literals) by leveraging the bank of unused numbers. By incorporating these, the Sudoku solving process becomes systematic and more efficient as it should reduce the likelihood of the process getting stuck and having to backtrack when searching for a solution, or analogously in the case of a neural network getting stuck in local minima.
2.3 Restricted Boltzmann Machines
An RBM [44] is a two-layer neural network with bidirectional (symmetric) connections, which is characterised by a function called the energy of the RBM:
$$
{\it E}(\mathbf{x},\mathbf{h})=-\sum_{i,j}w_{ij}x_{i}h_{j}-\sum_{i}a_{i}x_{i}-%
\sum_{j}b_{j}h_{j} \tag{1}
$$
where $a_{i}$ and $b_{j}$ are the biases of input unit $x_{i}$ and hidden unit $h_{j}$ , respectively, and $w_{ij}$ is the connection weight between $x_{i}$ and $h_{j}$ . This RBM represents a joint probability distribution $p(\mathbf{x},\mathbf{h})=\frac{1}{Z}e^{-\frac{1}{\tau}{\it E}(\mathbf{x},%
\mathbf{h})}$ where $Z=\sum_{\mathbf{x}\mathbf{h}}e^{-\frac{1}{\tau}{\it E}(\mathbf{x},\mathbf{h})}$ is the partition function and parameter $\tau$ is called the temperature of the RBM, $\mathbf{x}=\{x_{i}\}$ is the set of visible units and $\mathbf{h}=\{h_{j}\}$ is the set of hidden units of the RBM.
Training RBMs normally makes use of the Contrastive Divergence learning algorithm [19], whereby each input vector from the training set is propagated to the hidden layer of the network and back to the input a number of times ( $n$ ) using a probabilistic selection rule to decide at each time whether or not a neuron should be activated (with activation value in $\{0,1\}$ ). The weight assigned to the connection between input neuron $x_{i}$ and hidden neuron $h_{j}$ is adjusted according to a simple update rule based on the difference between the value of $x_{i}h_{j}$ at time $1$ and time $n$ . More precisely, $\Delta W_{ij}=\eta((x_{i}h_{j})_{1}-(x_{i}h_{j})_{n})$ , where $\eta$ is a learning rate (a small positive real number).
3 Symbolic Reasoning with Energy-based Neural Networks
The content of this section is based on [52].
Over the years, many neurosymbolic approaches have used a form of knowledge representation based on if-then rules [49, 13, 50, 12, 56, 29, 51], written $B← A$ (make $B$ $True$ if $A$ is $True$ ) to distinguish from classical implication ( $A→ B$ ). Under the convention that $1$ represents $True$ and $0 0$ represents $False$ , given $B← A$ and input $1$ to neuron $A$ , a neurosymbolic system would infer that neuron $B$ should have activation value approximately $1$ . Given input $0 0$ to neuron $A$ , it would infer that $B$ should have activation approximately $0 0$ .
Logical Boltzmann Machines (LBM) allow for a richer representation than if-then rules by using full propositional logic. Next, we review LBM’s immediate related work, define a mapping from any logical formulae to LBMs, and describe how reasoning takes place by sampling and energy minimization. We also evaluate scalability of reasoning in LBM and learning by combining knowledge and data, evaluating results on benchmarks in comparison with a symbolic, another neurosymbolic and a neural network-based approach.
3.1 Related Work
One of the earliest work on the integration of neural networks and symbolic knowledge is known as KBANN (Knowledge-based Artificial Neural Network [49]), which encodes if-then rules into a hierarchical multilayer perceptron. In another early approach [8], a single-hidden layer recurrent neural network is proposed to support logic programming rules. An extension of that approach to work with first-order logic programs, called Connectionist Inductive Logic Programming (CILP++) [13], uses the concept of propositionalisation from Inductive Logic Programming (ILP), whereby first-order variables can be treated as propositional atoms in the neural network. Also based on first-order logic programs, [12] propose a differentiable ILP approach that can be implemented by neural networks, and [6] maps stochastic logic programs into a differentiable function also trainable by neural networks. These are all supervised learning approaches.
Early work in neurosymbolic AI has also shown a correspondence between propositional logic and symmetrical neural networks [38], in particular Hopfield networks, which nevertheless did not scale well with the number of variables. Among unsupervised learning approaches, Penalty Logic [37] was the first work to integrate nonmonotonic logic in the form of weighted if-then rules into symmetrical neural networks. However, Penalty Logic required the use of higher-order Hopfield networks, which can be difficult to construct Building such higher-order networks requires transforming the energy function into quadratic form by adding hidden variables not present in the original logic formulae. and inefficient to train with the learning algorithm for Boltzmann machines. More recently, several attempts have been made to extract and encode symbolic knowledge into RBMs trained with the more efficient Contrastive Divergence learning algorithm [35, 50]. Such approaches explored the structural similarity between symmetric networks and logical rules with bi-conditional implication but do not have a proof of soundness. By contrast, and similarly to Penalty Logic, LBM is provably equivalent to the logic formulae encoded in the RBM. Differently from Penalty Logic, LBM does not require the use of higher-order networks.
Alongside the above approaches, which translate symbolic representations into neural networks (normally if-then rules translated into a feedforward or recurrent network), there are hybrid approaches that combine neural networks and symbolic AI systems as communicating modules of a neurosymbolic system. These include DeepProbLog [29] and Logic Tensor Networks (LTN) [41]. DeepProbLog adds a neural network module to probabilistic logic programming such that an atom of the logic program can be represented by a network module. LTN and various approaches derived from it use real-valued logic to constrain the loss function of the neural network given statements in firt-order logic. Both DeepProbLog and LTNs use backpropagation, differently from the approach adopted here which uses Contrastive Divergence.
Finally, approaches focused on reasoning include SAT solving using neural networks. In [17, 7], the maximum satisfiability problem is mapped onto Boltzmann machines and higher-order Boltzmann machines, which are used to solve the combinatorial optimization task in parallel, similarly to [38]. In [53], the SAT problem is redefined as a soft (differentiable) task and solved approximately by deep networks with the objective of integrating logical reasoning and learning, as in the case of the approaches discussed earlier. This soft version of the SAT problem is therefore different from the satisfiability problem. A preliminary evaluation of our approach in comparison with symbolic SAT solvers shows that our approach allows the use of up to approximately 100 variables. This is well below the capability of symbolic SAT solvers. A way of improving the performance of neural SAT solvers may well be to consider approximate solutions as done by soft SAT solvers, including neuroSAT [40]. Although still not beating SAT solvers, neuroSAT showed promise at addressing out-of-distribution learning after training on random SAT problems.
In our experiments on learning, the focus is on benchmark neurosymbolic AI tasks with available data and knowledge, obtained from [13]. We therefore compare LBM with a state-of-the-art ILP symbolic system ALEPH [46], standard RBMs as a purely-neural approach closest to LBM, and with CILP++ as a neurosymbolic system. It is worth noting, however, that CILP++ is a neurosymbolic system for supervised learning while LBMs use unsupervised learning, and it is worth investigating approaches for semi-supervised learning and other combinations of such systems. Further comparisons and evaluations on both reasoning and learning are underway.
3.2 Knowledge Representation in RBMs
Before we present LBM, let’s contrast the simple $B← A$ example used earlier with classical logic. Given $A→ B$ as knowledge In classical logic, $A→ B$ is equivalent to $\neg A\vee B$ , i.e. True if $A$ is False regardless of the truth-value of $B$ ., if neuron $A$ is assigned input value $1$ in the corresponding neurosymbolic network, we expect the network to converge to a stable state where neuron $B$ has value approximately $1$ , similarly to the example seen earlier. This is because the truth-value of WFF $A→ B$ is True given an assignment of truth-values True to its constituent literals $A$ and $B$ . Now, $A→ B$ is False when $A$ is True and $B$ is False. If neuron $B$ is assigned input $0 0$ , we expect the network to converge to a stable state where $A$ is approximately $0 0$ ( $A→ B$ is True when $A$ is False and $B$ is False). What if $A$ is assigned input $0 0$ (or $B$ is assigned input $1$ )? In these cases, $A→ B$ is satisfied if $B$ is either $1$ or $0 0$ (or if $A$ is either $1$ or $0 0$ ). Differently from $B← A$ , the network will converge to one of the two options that satisfy the formulae.
From this point forward, unless stated otherwise, we will treat assignments of truth-values to logical literals and binary input vectors denoting the activation states of neurons indistinguishably.
**Definition 1**
*Let $s_{\varphi}(\mathbf{x})∈\{0,1\}$ denote the truth-value of a WFF $\varphi$ given an assignment of truth-values $\mathbf{x}$ to the literals of $\varphi$ , where truth-value $True$ is mapped to 1 and truth-value $False$ is mapped to 0. Let ${\it E}(\mathbf{x},\mathbf{h})$ denote the energy function of an energy-based neural network $\mathcal{N}$ with visible units $\mathbf{x}$ and hidden units $\mathbf{h}$ . $\varphi$ is said to be equivalent to $\mathcal{N}$ if and only if for any assignment of values to $\mathbf{x}$ there exists a function $\psi$ such that $s_{\varphi}(\mathbf{x})=\psi({\it E}(\mathbf{x},\mathbf{h}))$ .*
Definition 1 is similar to that of Penalty Logic [37], where all assignments of truth-values satisfying a WFF $\varphi$ are mapped to global minima of the energy function of network $\mathcal{N}$ . In our case, by construction, assignments that do not satisfy the WFF will, in addition, be mapped to maxima of the energy function. To see how this is the case, it will be useful to define strict and full DNFs, as follows.
**Definition 2**
*A strict DNF (SDNF) is a DNF with at most one conjunctive clause (a conjunction of literals) that maps to $True$ for any choice of assignment of truth-values $\mathbf{x}$ . A full DNF is a DNF where each propositional variable (a positive or negative literal) must appear at least once in every conjunctive clause (sometimes called a canonical DNF).*
For example, to turn DNF $A\vee B$ into an equivalent full DNF, one needs to map it to $(A\wedge\neg B)\vee(\neg A\wedge B)\vee(A\wedge B)$ , according to the truth-table for $A\vee B$ . For any given assignment of truth-values to $A$ and $B$ , at most one of the above three conjunctive clauses will be $True$ , by definition of the truth-table. Not every SDNF is also a full DNF though, e.g. $(a\wedge b)\vee\neg b$ is a SDNF that is not a full DNF.
**Lemma 1**
*Let $\mathcal{S}_{T_{j}}$ denote the set of indices of the positive literals $\mathrm{x}_{t}$ in a conjunctive clause $j$ . Let $\mathcal{S}_{K_{j}}$ denote the set of indices of the negative literals $\mathrm{x}_{k}$ in $j$ . Any SDNF $\varphi\equiv\bigvee_{j}(\bigwedge_{t}\mathrm{x}_{t}\wedge\bigwedge_{k}\neg%
\mathrm{x}_{k})$ can be mapped onto an energy function: $$
{\it E}(\mathbf{x})=-\sum_{j}(\prod_{t\in\mathcal{S}_{T_{j}}}x_{t}\prod_{k\in%
\mathcal{S}_{K_{j}}}(1-x_{k})).
$$*
Proof: Each conjunctive clause $\bigwedge_{t}\mathrm{x}_{t}\wedge\bigwedge_{k}\neg\mathrm{x}_{k}$ in $\varphi$ corresponds to the product $\prod_{t}x_{t}\prod_{k}(1-x_{k})$ which maps to $1$ if and only if $x_{t}$ is $True$ ( $x_{t}=1$ ) and $x_{k}$ is $False$ ( $x_{k}=0$ ) for all $t∈\mathcal{S}_{T_{j}}$ and $k∈\mathcal{S}_{K_{j}}$ . Since $\varphi$ is SDNF, $\varphi$ is $True$ if and only if one conjunctive clause is $True$ and $\sum_{j}(\prod_{t∈\mathcal{S}_{T_{j}}}x_{t}\prod_{k∈\mathcal{S}_{K_{j}}}(1%
-x_{k}))=1$ . Hence, the neural network with energy function ${\it E}$ is such that $s_{\varphi}(\mathbf{x})=-{\it E}(\mathbf{x})$ . ∎
**Theorem 1**
*Any SDNF $\varphi\equiv\bigvee_{j}(\bigwedge_{t}\mathrm{x}_{t}\wedge\bigwedge_{k}\neg%
\mathrm{x}_{k})$ can be mapped onto an RBM with energy function:
$$
{\it E}(\mathbf{x},\mathbf{h})=-\sum_{j}h_{j}(\sum_{t\in\mathcal{S}_{T_{j}}}x_%
{t}-\sum_{k\in\mathcal{S}_{K_{j}}}x_{k}-|\mathcal{S}_{T_{j}}|+\epsilon), \tag{2}
$$
such that $s_{\varphi}(\mathbf{x})=-{\it E}(\mathbf{x})$ , where $0<\epsilon<1$ and $|\mathcal{S}_{T_{j}}|$ is the number of positive literals in conjunctive clause $j$ of $\varphi$ .*
Proof: Lemma 1 states that any SDNF $\varphi$ can be mapped onto energy function ${\it E}=-\sum_{j}(\prod_{t∈\mathcal{S}_{T_{j}}}x_{t}\prod_{k∈\mathcal{S}_{%
K_{j}}}(1-x_{k}))$ . For each expression $\tilde{e}_{j}(\mathbf{x})=-\prod_{t∈\mathcal{S}_{T_{j}}}x_{t}\prod_{k∈%
\mathcal{S}_{K_{j}}}(1-x_{k})$ , we define an energy expression associated with hidden unit $h_{j}$ as $e_{j}(\mathbf{x},h_{j})=-h_{j}(\sum_{t∈\mathcal{S}_{T_{j}}}x_{t}-\sum_{k∈%
\mathcal{S}_{K_{j}}}x_{k}-|\mathcal{S}_{T_{j}}|+\epsilon)$ . The term $e_{j}(\mathbf{x},h_{j})$ is minimized with value $-\epsilon$ when $h_{j}=1$ , written $min_{h_{j}}(e_{j}(\mathbf{x},h_{j}))=-\epsilon$ . This is because $-(\sum_{t∈\mathcal{S}_{T_{j}}}x_{t}-\sum_{k∈\mathcal{S}_{K_{j}}}x_{k}-|%
\mathcal{S}_{T_{j}}|+\epsilon)=-\epsilon$ if and only if $x_{t}=1$ and $x_{k}=0$ for all $t∈\mathcal{S}_{T_{j}}$ and $k∈\mathcal{S}_{K_{j}}$ . Otherwise, $-(\sum_{t}x_{t∈\mathcal{S}_{T_{j}}}-\sum_{k∈\mathcal{S}_{K_{j}}}x_{k}-|%
\mathcal{S}_{T_{j}}|+\epsilon)>0$ and $min_{h_{j}}(e_{j}(\mathbf{x},h_{j}))=0$ with $h_{j}=0$ . By repeating this process for each $\tilde{e}_{j}(\mathbf{x})$ we obtain that the energy function ${\it E}(\mathbf{x},\mathbf{h})=-\sum_{j}h_{j}(\sum_{t∈\mathcal{S}_{T_{j}}}x_%
{t}-\sum_{k∈\mathcal{S}_{K_{j}}}x_{k}-|\mathcal{S}_{T_{j}}|+\epsilon)$ is such that $s_{\varphi}(\mathbf{x})=-\frac{1}{\epsilon}min_{\mathbf{h}}{\it E}(\mathbf{x},%
\mathbf{h})$ . ∎
It is well-known that any WFF $\varphi$ can be converted into DNF. Then, if $\varphi$ is not SDNF, by definition there is more than one conjunctive clause in $\varphi$ that map to $True$ when $\varphi$ is satisfied. This group of conjunctive clauses can always be converted into a full DNF according to its truth-table. By definition, any such full DNF is also a SDNF. Therefore, any WFF can be converted into SDNF. From Theorem 1, it follows that any WFF can be represented by the energy function of an RBM. The conversion of WFFs into full DNF can be computationally expensive. Sometimes, the logic is provided already in canonical DNF form or in Conjunctive Normal Form (CNF), i.e. conjunctions of disjunctions. We will see later that any WFF expressed in CNF can be converted into an RBM’s energy function efficiently without the need to convert into SDNF first. This covers the most common forms of propositional knowledge representation. Next, we describe a method for converting logical formulae into SDNF, which we use in the empirical evaluations that will follow. Consider a clause $\gamma$ such that:
$$
\gamma\equiv\bigvee_{t\in\mathcal{S}_{T}}\neg\mathrm{x}_{t}\vee\bigvee_{k\in%
\mathcal{S}_{K}}\mathrm{x}_{k} \tag{3}
$$
where $\mathcal{S}_{T}$ now denotes the set of indices of the negative literals, and $\mathcal{S}_{K}$ denotes the set of indices of the positive literals in the clause (dually to the conjunctive clause case). Clause $\gamma$ can be rearranged into $\gamma\equiv\gamma^{\prime}\vee\mathrm{x}^{\prime}$ , where $\gamma^{\prime}$ is obtained by removing $\mathrm{x}^{\prime}$ from $\gamma$ ( $\mathrm{x}^{\prime}$ can be either $\neg\mathrm{x}_{t}$ or $\mathrm{x}_{k}$ for any $t∈\mathcal{S}_{T}$ and $k∈\mathcal{S}_{K}$ ). We have:
$$
\gamma\equiv(\neg\gamma^{\prime}\wedge\mathrm{x}^{\prime})\vee\gamma^{\prime} \tag{4}
$$
because $(\neg\gamma^{\prime}\wedge\mathrm{x}^{\prime})\vee\gamma^{\prime}\equiv(\gamma%
^{\prime}\vee\neg\gamma^{\prime})\wedge(\gamma^{\prime}\vee\mathrm{x}^{\prime}%
)\equiv True\wedge(\gamma^{\prime}\vee\mathrm{x}^{\prime})$ . By De Morgan’s law ( $\neg(\mathrm{a}\vee\mathrm{b})\equiv\neg\mathrm{a}\wedge\neg\mathrm{b}$ ), we can always convert $\neg\gamma^{\prime}$ (and therefore $\neg\gamma^{\prime}\wedge\mathrm{x}^{\prime}$ ) into a conjunctive clause.
By applying (4) repeatedly, each time we eliminate a variable out of the clause by moving it into a new conjunctive clause. Given an assignment of truth-values, either the clause $\gamma^{\prime}$ will be True or the conjunctive clause ( $\neg\gamma^{\prime}\wedge\mathrm{x}^{\prime}$ ) will be True, e.g. $a\vee b\equiv a\vee(\neg a\wedge b)$ . Therefore, the SDNF for clause $\gamma$ in Eq. (3) is:
$$
\bigvee_{p\in\mathcal{S}_{T}\cup\mathcal{S}_{K}}(\bigwedge_{t\in\mathcal{S}_{T%
}\backslash p}\mathrm{x}_{t}\wedge\bigwedge_{k\in\mathcal{S}_{K}\backslash p}%
\neg\mathrm{x}_{k}\wedge\mathrm{x}^{\prime}_{p}) \tag{5}
$$
where $\mathcal{S}\backslash p$ denotes a set $\mathcal{S}$ from which element $p$ has been removed. If $p∈\mathcal{S}_{T}$ then $\mathrm{x}^{\prime}_{p}\equiv\neg\mathrm{x}_{p}$ . Otherwise, $\mathrm{x}^{\prime}_{p}\equiv\mathrm{x}_{p}$ . As an example of the translation into SDNF, consider the translation of an if-then statement (logical implication) below.
**Example 2**
*Translation of if-then rules into SDNF. Consider the formula $\gamma\equiv(x_{1}\wedge x_{2}\wedge\neg x_{3})→ y$ . Using our notation: $$
\gamma\equiv(\bigwedge_{t\in\{1,2\}}\mathrm{x}_{t}\wedge\bigwedge_{k\in\{3\}}%
\neg\mathrm{x}_{k})\rightarrow\mathrm{y} \tag{6}
$$ Converting to DNF: $$
(\mathrm{y}\wedge\bigwedge_{t\in\{1,2\}}\mathrm{x}_{t}\wedge\bigwedge_{k\in\{3%
\}}\neg\mathrm{x}_{k})\vee\bigvee_{t\in\{1,2\}}\neg\mathrm{x}_{t}\vee\bigvee_{%
k\in\{3\}}\mathrm{x}_{k} \tag{7}
$$ Applying the variable elimination method to the clause $\neg\mathrm{x}_{1}\vee\neg\mathrm{x}_{2}\vee\mathrm{x}_{3}$ , we obtain the SDNF for $\gamma$ : $$
\displaystyle(\mathrm{y}\wedge\bigwedge_{t\in\mathcal{S}_{T}}\mathrm{x}_{t}%
\bigwedge_{k\in\mathcal{S}_{K}}\neg\mathrm{x}_{k})\vee(\neg\mathrm{x}_{1})\vee%
(\mathrm{x}_{1}\wedge\neg\mathrm{x}_{2})\vee(\mathrm{x}_{1}\wedge\mathrm{x}_{2%
}\wedge\mathrm{x}_{3}) \tag{8}
$$*
3.3 Reasoning in RBMs
We have seen how propositional logic formula can be mapped onto the energy functions of RBMs. In this section, we discuss the deployment of such RBMs for logical reasoning.
3.3.1 Reasoning as Sampling
There is a direct relationship between inference in RBMs and logical satisfiability, as follows.
**Lemma 2**
*Let $\mathcal{N}$ be an RBM with energy function $E$ . Let $\varphi$ be a WFF such that $s_{\varphi}(\mathbf{x})=-{\it E}(\mathbf{x})$ . Let $\mathcal{A}$ be a set of indices of variables in $\varphi$ that have been assigned to either True or False. We use $\mathbf{x}_{\mathcal{A}}$ to denote the set $\{x_{\alpha}|\alpha∈\mathcal{A}\}$ ). Let $\mathcal{B}$ be a set of indices of variables that have not been assigned a truth-value in $\varphi$ . We use $\mathbf{x}_{\mathcal{B}}$ to denote $\{x_{\beta}|\beta∈\mathcal{B}\}$ ). Performing Gibbs sampling on $\mathcal{N}$ given $\mathbf{x}_{\mathcal{A}}$ is equivalent to searching for an assignment of truth-values for $\mathbf{x}_{\mathcal{B}}$ that satisfies $\varphi$ .*
Proof: Theorem 1 has shown that the assignments of truth-values to $\varphi$ are partially ordered according to the RBM’s energy function such that the models of $\varphi$ (mapping $\varphi$ to 1) correspond to minima of the energy function. We say that the satisfiability of $\varphi$ is inversely proportional to the RBM’s rank function. When the satisfiability of $\varphi$ is maximum ( $s_{\varphi}(\mathbf{x})=1$ ) ranking the output of $-{\it E}(\mathbf{x})$ produces the highest rank. A value of $\mathbf{x}_{\mathcal{B}}$ that minimises the energy function also maximises satisfiability: $s_{\varphi}(\mathbf{x}_{\mathcal{B}},\mathbf{x}_{\mathcal{A}})\propto-min_{%
\mathbf{h}}{\it E}(\mathbf{x}_{\mathcal{B}},\mathbf{x}_{\mathcal{A}},\mathbf{h})$ because:
$$
\displaystyle\mathbf{x}_{\mathcal{B}}^{*}=\operatorname*{arg\,min}_{\mathbf{x}%
_{\mathcal{B}_{\mathbf{h}}}}{\it E}(\mathbf{x}_{\mathcal{B}},\mathbf{x}_{%
\mathcal{A}},\mathbf{h})=\operatorname*{arg\,max}_{\mathbf{x}_{\mathcal{B}}}(s%
_{\varphi}(\mathbf{x}_{\mathcal{B}},\mathbf{x}_{\mathcal{A}})) \tag{9}
$$
We can consider an iterative process to search for truth-values $\mathbf{x}_{\mathcal{B}}^{*}$ by minimising an RBM’s energy function. This can be done using gradient descent or contrastive divergence with Gibbs sampling. The goal is to update the values of $\mathbf{h}$ and then $\mathbf{x}_{\mathcal{B}}$ in parallel until convergence to minimise ${\it E}(\mathbf{x}_{\mathcal{B}},\mathbf{x}_{\mathcal{A}},\mathbf{h})$ while keeping the other variables ( $\mathbf{x}_{\mathcal{A}}$ ) fixed. The gradients amount to:
$$
\displaystyle\frac{\partial-{\it E}(\mathbf{x}_{\mathcal{B}},\mathbf{x}_{%
\mathcal{A}},\mathbf{h})}{\partial h_{j}} \displaystyle=\sum_{i\in\mathcal{A}\cup\mathcal{B}}x_{i}w_{ij}+\theta_{j} \displaystyle\frac{\partial-{\it E}(\mathbf{x}_{\mathcal{B}},\mathbf{x}_{%
\mathcal{A}},\mathbf{h})}{\partial x_{\beta}} \displaystyle=\sum_{j}h_{j}w_{\beta j}+theta_{\beta} \tag{10}
$$
In the case of Gibbs sampling, given the assigned variables $\mathbf{x}_{\mathcal{A}}$ , the process starts with a random initialization of $\mathbf{x}_{\mathcal{B}}$ and proceeds to infer values for the hidden units $h_{j}$ and then the unassigned variables $x_{\beta}$ in the visible layer of the RBM, using the conditional distributions $h_{j}\sim p(h_{j}|\mathbf{x})$ and $x_{\beta}\sim p(x_{\beta}|\mathbf{h})$ , respectively, where $\mathbf{x}=\{\mathbf{x}_{\mathcal{A}},\mathbf{x}_{\mathcal{B}}\}$ and:
$$
\displaystyle p(h_{j}|\mathbf{x}) \displaystyle=\frac{1}{1+e^{-\frac{1}{\tau}\sum_{i}x_{i}w_{ij}+\theta_{j}}}=%
\frac{1}{1+e^{-\frac{1}{\tau}\frac{\partial-{\it E}(\mathbf{x}_{\mathcal{B}},%
\mathbf{x}_{\mathcal{A}},\mathbf{h})}{\partial h_{j}}}} \displaystyle p(x_{\beta}|\mathbf{h}) \displaystyle=\frac{1}{1+e^{-\frac{1}{\tau}\sum_{j}h_{j}w_{\beta j}+\theta_{%
\beta}}}=\frac{1}{1+e^{-\frac{1}{\tau}\frac{\partial-{\it E}(\mathbf{x}_{%
\mathcal{B}},\mathbf{x}_{\mathcal{A}},\mathbf{h})}{\partial x_{\beta}}}} \tag{11}
$$
It can be seen from Eq.(11) that the distributions are monotonic functions of the negative energy’s gradient over $\mathbf{h}$ and $\mathbf{x}_{\mathcal{B}}$ . Therefore, performing Gibbs sampling on them can be seen as moving towards a local minimum that is equivalent to an assignment of truth-values that satisfies $\varphi$ . Each step of Gibbs sampling, calculating $\mathbf{h}$ and then $\mathbf{x}$ to reduce the energy, should intuitively generate an assignment of truth-values that gets closer to satisfying the formula $\varphi$ . ∎
3.3.2 Reasoning as Lowering Free Energy
When the number of unassigned variables is not large, it should be possible to calculate the above probabilities directly. In this case, one can infer the assignments of $\mathbf{x}_{\mathcal{B}}$ using the conditional distribution:
$$
P(\mathbf{x}_{\mathcal{B}}|\mathbf{x}_{\mathcal{A}})=\frac{e^{-\mathcal{F}_{%
\mathcal{B}}(\mathbf{x}_{\mathcal{A}},\mathbf{x}_{\mathcal{B}})}}{\sum_{%
\mathbf{x}^{\prime}_{\mathcal{B}}}e^{\mathcal{F}_{\mathcal{B}}}(\mathbf{x}_{%
\mathcal{A}},\mathbf{x}^{\prime}_{\mathcal{B}})} \tag{12}
$$
where $\mathcal{F}_{\mathcal{B}}=-\sum_{j}(-\log(1+e^{(c\sum_{i∈\mathcal{A}\cup%
\mathcal{B}}w_{ij}x_{i}+\theta_{j})}))$ is known as the free energy; $\mathbf{x}^{\prime}_{\mathcal{B}}$ denotes all the combinations of truth-value assignments to the literals in $\mathbf{x}_{\mathcal{B}}$ , and $c$ is a non-negative real number that we call a confidence value. The free energy term $-\log(1+e^{(c\sum_{i∈\mathcal{A}\cup\mathcal{B}}w_{ij}x_{i}+\theta_{j})})$ is a negative softplus function scaled by $c$ as shown in Figure 2. It returns a negative output for a positive input and a close-to-zero output for a negative input.
<details>
<summary>extracted/6466920/figs/confidence_smoothing.png Details</summary>

### Visual Description
\n
## Chart: Logarithmic Function Plots
### Overview
The image displays a line chart illustrating the behavior of the function `-log(1 + exp(c*x))` for different values of the parameter 'c'. The chart visualizes how the function's curve changes as 'c' increases, demonstrating a shift in the steepness and position of the curve.
### Components/Axes
* **X-axis:** Labeled "x", ranging from approximately -1.0 to 1.0 with increments of 0.25.
* **Y-axis:** Labeled "-log(1+exp(c*x))", ranging from approximately -10.0 to 0.0 with increments of 2.0.
* **Legend:** Located in the top-left corner, listing the following lines and their corresponding 'c' values:
* c=0.1 (Blue)
* c=0.5 (Orange)
* c=1 (Green)
* c=5 (Red)
* c=10 (Purple)
### Detailed Analysis
The chart contains five distinct lines, each representing a different value of 'c'.
* **c=0.1 (Blue):** The line is relatively flat, remaining close to 0 across the entire x-axis range. It starts at approximately 0.0 at x=-1.0 and ends at approximately 0.1 at x=1.0.
* **c=0.5 (Orange):** This line is also relatively flat, but shows a slight downward trend as x increases. It starts at approximately 0.0 at x=-1.0 and ends at approximately -0.3 at x=1.0.
* **c=1 (Green):** This line exhibits a more noticeable downward trend than the previous two. It starts at approximately 0.0 at x=-1.0 and ends at approximately -1.0 at x=1.0.
* **c=5 (Red):** This line shows a significant downward trend, becoming steeper as x increases. It starts at approximately 0.0 at x=-1.0 and ends at approximately -5.0 at x=1.0.
* **c=10 (Purple):** This line is the steepest, exhibiting a very rapid downward trend. It starts at approximately 0.0 at x=-1.0 and ends at approximately -10.0 at x=1.0. The curve transitions sharply around x=0.
### Key Observations
* As the value of 'c' increases, the slope of the line becomes steeper.
* All lines start at approximately y=0 when x=-1.
* The lines converge towards y=0 as x approaches -1.
* The lines diverge as x increases, with larger 'c' values resulting in more negative y-values.
* The transition from a relatively flat line to a steep slope occurs around x=0, and the transition point shifts slightly to the left as 'c' increases.
### Interpretation
The chart demonstrates the effect of the parameter 'c' on the function `-log(1 + exp(c*x))`. The function represents a sigmoid-like curve, and 'c' controls the steepness of the transition between the two asymptotic regions. A larger 'c' value results in a sharper transition, effectively making the function more sensitive to changes in 'x'. This type of function is commonly used in machine learning, particularly in logistic regression and neural networks, to model probabilities or activation functions. The steepness controlled by 'c' determines how quickly the function transitions between 0 and 1 (or in this case, 0 and negative infinity). The chart visually confirms that increasing 'c' leads to a more abrupt change in the function's output. The function is essentially a scaled and shifted logistic function.
</details>
Figure 2: Free energy term $-\log(1+e^{cx})$ for different confidence values $c$ .
Each free energy term is associated with a conjunctive clause in the SDNF through the weighted sum $\sum_{i∈\mathcal{A}\cup\mathcal{B}}w_{ij}x_{i}+\theta_{j}$ . Therefore, if a truth-value assignment of $\mathbf{x}_{\mathcal{B}}$ does not satisfy the formula $\varphi$ , all energy terms will be close to zero. When $\varphi$ is satisfied, one free energy term will be $-\log(1+e^{c\epsilon})$ , for a choice of $0<\epsilon<1$ from Theorem 1. Thus, the more likely that a truth assignment is to satisfying the formula, the lower the free energy. Formally:
$$
s_{\varphi}(\mathbf{x})=-\frac{1}{c\epsilon}\text{min}_{\mathbf{h}}E(\mathbf{x%
},\mathbf{h})=\lim_{c\rightarrow\infty}-\frac{1}{c\epsilon}\mathcal{F}(\mathbf%
{x}) \tag{13}
$$
Figure 3 shows the average values of the energy function and free energy for CNFs with 55 clauses as the number of satisfied clauses increases. The CNF is satisfied if and only if all 55 clauses are satisfied. As can be seen, the relationships are linear. Minimum energy and free energy values converge with an increasing value of $c$ .
<details>
<summary>extracted/6466920/figs/energies_versus_sat_clauses_c1.png Details</summary>

### Visual Description
\n
## Scatter Plot: Energy vs. Satisfied Clauses
### Overview
The image presents a scatter plot comparing "Min Energy" and "Free Energy" against the "Number of satisfied clauses". The plot displays two distinct data series as points, showing their relationship with the number of satisfied clauses.
### Components/Axes
* **X-axis:** "Number of satisfied clauses", ranging from approximately 44 to 54, with markers at integer values.
* **Y-axis:** "Energy", ranging from approximately -100 to -20, with markers at integer values.
* **Legend:** Located in the top-right corner.
* Blue circles: "Min Energy"
* Orange circles: "Free Energy"
### Detailed Analysis
**Min Energy (Blue Series):**
The blue data series shows a relatively flat trend. The points are clustered horizontally around the energy level of approximately -18.
* At 44 satisfied clauses: Energy ≈ -18.5
* At 45 satisfied clauses: Energy ≈ -18.5
* At 46 satisfied clauses: Energy ≈ -18.5
* At 47 satisfied clauses: Energy ≈ -18.5
* At 48 satisfied clauses: Energy ≈ -18.5
* At 49 satisfied clauses: Energy ≈ -18.5
* At 50 satisfied clauses: Energy ≈ -18.5
* At 51 satisfied clauses: Energy ≈ -18.5
* At 52 satisfied clauses: Energy ≈ -18.5
* At 53 satisfied clauses: Energy ≈ -18.5
* At 54 satisfied clauses: Energy ≈ -18.5
**Free Energy (Orange Series):**
The orange data series also exhibits a relatively flat trend, but at a much lower energy level. The points are clustered horizontally around the energy level of approximately -98.
* At 44 satisfied clauses: Energy ≈ -98.5
* At 45 satisfied clauses: Energy ≈ -98.5
* At 46 satisfied clauses: Energy ≈ -98.5
* At 47 satisfied clauses: Energy ≈ -98.5
* At 48 satisfied clauses: Energy ≈ -98.5
* At 49 satisfied clauses: Energy ≈ -98.5
* At 50 satisfied clauses: Energy ≈ -98.5
* At 51 satisfied clauses: Energy ≈ -98.5
* At 52 satisfied clauses: Energy ≈ -98.5
* At 53 satisfied clauses: Energy ≈ -98.5
* At 54 satisfied clauses: Energy ≈ -98.5
### Key Observations
* Both data series show minimal variation in energy levels as the number of satisfied clauses increases.
* "Min Energy" consistently has a significantly higher (less negative) energy value than "Free Energy".
* There is no apparent correlation between the number of satisfied clauses and the energy levels for either series.
### Interpretation
The plot suggests that, within the observed range of satisfied clauses (44-54), neither "Min Energy" nor "Free Energy" is significantly affected by the number of clauses satisfied. The large difference in energy levels between the two series indicates that "Free Energy" is a much more sensitive metric, or represents a different aspect of the system being modeled. The lack of correlation could indicate that the system has reached a plateau in terms of energy reduction with increasing clause satisfaction, or that other factors are influencing the energy levels. The consistent flat lines suggest a stable state or a limited range of exploration within the parameters of the simulation or experiment.
</details>
(a) (c=1)
<details>
<summary>extracted/6466920/figs/energies_versus_sat_clauses_c5.png Details</summary>

### Visual Description
\n
## Scatter Plot: Energy vs. Satisfied Clauses
### Overview
This image presents a scatter plot illustrating the relationship between the number of satisfied clauses and energy levels, represented by two distinct data series: "Min Energy" and "Free Energy". The plot appears to visualize the optimization process of a system, potentially related to constraint satisfaction or energy minimization.
### Components/Axes
* **X-axis:** "Number of satisfied clauses" ranging from approximately 44 to 54. The axis has tick marks at integer values.
* **Y-axis:** "Energy" ranging from approximately -110 to -145. The axis has tick marks at integer values.
* **Legend:** Located in the top-right corner.
* "Min Energy" – Represented by blue circles.
* "Free Energy" – Represented by orange circles.
### Detailed Analysis
**Min Energy (Blue Circles):**
The "Min Energy" data series exhibits a generally downward trend.
* At approximately 44 satisfied clauses, the energy is around -110.
* At approximately 46 satisfied clauses, the energy is around -108.
* At approximately 48 satisfied clauses, the energy is around -105.
* At approximately 50 satisfied clauses, the energy is around -103.
* At approximately 52 satisfied clauses, the energy is around -100.
* At approximately 54 satisfied clauses, the energy is around -97.
**Free Energy (Orange Circles):**
The "Free Energy" data series also shows a downward trend, but it is more pronounced and less linear than the "Min Energy" series.
* At approximately 44 satisfied clauses, the energy is around -125.
* At approximately 46 satisfied clauses, the energy is around -130.
* At approximately 48 satisfied clauses, the energy is around -135.
* At approximately 50 satisfied clauses, the energy is around -138.
* At approximately 52 satisfied clauses, the energy is around -140.
* At approximately 54 satisfied clauses, the energy is around -142.
### Key Observations
* Both energy levels decrease as the number of satisfied clauses increases, indicating a positive correlation.
* The "Free Energy" consistently has lower (more negative) energy values than the "Min Energy" across the observed range of satisfied clauses.
* The rate of energy decrease appears to be more rapid for "Free Energy" than for "Min Energy".
* The data points are not perfectly aligned on a straight line, suggesting some variability or noise in the data.
### Interpretation
The plot likely represents the progress of an optimization algorithm attempting to satisfy constraints (represented by clauses) while minimizing energy. The "Min Energy" could represent the lowest energy state found so far, while "Free Energy" might represent a thermodynamic potential or a measure of the system's stability. The downward trends indicate that the algorithm is successfully finding solutions with more satisfied clauses and lower energy. The difference between the two energy levels suggests that the system is not yet in its global minimum energy state, and there is still potential for improvement. The variability in the data points could be due to the stochastic nature of the optimization algorithm or inherent noise in the problem being solved. The plot demonstrates a clear trade-off between the number of satisfied constraints and the overall energy of the system.
</details>
(b) (c=5)
<details>
<summary>extracted/6466920/figs/energies_versus_sat_clauses_c10.png Details</summary>

### Visual Description
\n
## Scatter Plot: Energy vs. Satisfied Clauses
### Overview
This image presents a scatter plot illustrating the relationship between the number of satisfied clauses and energy levels (both minimum energy and free energy). The plot displays data points for two different energy metrics as the number of satisfied clauses increases.
### Components/Axes
* **X-axis:** Number of satisfied clauses, ranging from approximately 44 to 54.
* **Y-axis:** Energy, ranging from approximately -220 to -270.
* **Legend:** Located in the top-right corner.
* Blue circles: Represent "Min Energy".
* Orange circles: Represent "Free Energy".
### Detailed Analysis
The plot shows two distinct data series.
**Min Energy (Blue):**
The blue data series exhibits a clear downward trend. As the number of satisfied clauses increases, the minimum energy decreases.
* At approximately 44 satisfied clauses, the Min Energy is around -220.
* At approximately 46 satisfied clauses, the Min Energy is around -230.
* At approximately 48 satisfied clauses, the Min Energy is around -240.
* At approximately 50 satisfied clauses, the Min Energy is around -250.
* At approximately 52 satisfied clauses, the Min Energy is around -260.
* At approximately 54 satisfied clauses, the Min Energy is around -270.
**Free Energy (Orange):**
The orange data series also demonstrates a downward trend, though it appears slightly more scattered than the Min Energy series. As the number of satisfied clauses increases, the free energy decreases.
* At approximately 44 satisfied clauses, the Free Energy is around -220.
* At approximately 46 satisfied clauses, the Free Energy is around -240.
* At approximately 48 satisfied clauses, the Free Energy is around -250.
* At approximately 50 satisfied clauses, the Free Energy is around -255.
* At approximately 52 satisfied clauses, the Free Energy is around -260.
* At approximately 54 satisfied clauses, the Free Energy is around -270.
### Key Observations
* Both energy metrics (Min Energy and Free Energy) decrease as the number of satisfied clauses increases.
* The Min Energy series appears to have a more consistent downward trend than the Free Energy series.
* The Free Energy values are generally slightly higher (less negative) than the Min Energy values for a given number of satisfied clauses.
* The data points are relatively sparse, with only a few data points for each value of satisfied clauses.
### Interpretation
The plot suggests a correlation between the number of satisfied clauses and the energy levels of a system. As the system satisfies more clauses, its energy decreases, indicating a more stable or optimal state. The difference between Min Energy and Free Energy could represent the energy cost associated with exploring different configurations or the entropy of the system. The downward trends in both series suggest that the system is moving towards a lower energy state as more clauses are satisfied. The scatter in the Free Energy data might indicate the presence of local minima or fluctuations in the system's energy landscape. The data suggests that the system is approaching a stable state as the number of satisfied clauses increases, but further investigation would be needed to understand the underlying mechanisms and the significance of the difference between Min Energy and Free Energy.
</details>
(c) (c=10)
Figure 3: Linear correlation between satisfiability of a CNF and minimization of the free energy function for various confidence values $c$ . Source: [52].
3.4 Logical Boltzmann Machines
We are now in position to present a translation algorithm to build an RBM from logical formulae. The energy function of the RBM will be derived based on Theorem 1 given a formula in SDNF. The weights and biases of the RBM will be obtained from the energy function $E(\mathrm{x},\mathrm{h})=-(\sum_{i}\theta_{i}\mathrm{x}_{i}+\sum_{j}\theta_{j}%
\mathrm{h}_{j}+\sum_{ij}\mathrm{x}_{i}W_{ij}\mathrm{h}_{j})$ , where $\theta_{i}$ are the biases of the visible units, $\theta_{j}$ are the biases of the hidden units, and $W_{ij}$ is the symmetric weight between a visible and a hidden unit. For each conjunctive clause in the formula of the form $\bigwedge_{t∈\mathcal{S}_{T}}\mathrm{x}_{t}\wedge\bigwedge_{k∈\mathcal{S}_%
{K}}\ \neg\mathrm{x}_{k}$ , we create an energy term $-h_{j}(\sum_{t∈\mathcal{S}_{T}}x_{t}-\sum_{k∈\mathcal{\ S}_{K}}x_{k}-|%
\mathcal{S}_{T}|+\epsilon)$ . The disjunctions in the SDNF are implemented in the RBM simply by creating a hidden neuron $h_{j}$ for each disjunct in the SDNF.
Learning in LBM uses learning from data $\mathcal{D}$ combined with knowledge provided by the logical formulae. Learning with data and knowledge is expected to improve accuracy or training time. If the logical formula is empty, the weights and biases are initialized randomly and one has a standard RBM. Learning in this case is an approximation of parameters $\Theta$ over a set of preferred models $\mathcal{D}=\{\mathbf{x}^{(n)}|n=1,..,N\}$ of an unknown formula $\varphi^{*}$ . Consider the case where the data set $\mathcal{D}$ is complete, i.e. it contains all preferred models of an unknown $\varphi^{*}$ . We will show that learning an RBM to represent the SDNF of $\varphi^{*}$ is possible. Consider the gradient of the negative log-likelihood ( $-\ell$ ) of an RBM:
$$
\frac{\partial{-}\ell}{\partial\Theta}=\mathbf{E}[\frac{\partial{\it E}(%
\mathbf{x},\mathbf{h})}{\partial\Theta}]_{\mathbf{h}|\mathbf{x}\in\mathcal{D}}%
-\mathbf{E}[\frac{\partial{\it E}(\mathbf{x},\mathbf{h})}{\partial\Theta}]_{%
\mathbf{h},\mathbf{x}} \tag{14}
$$
where $\mathbf{E}$ denotes the expected value. This function is not convex. Therefore, the RBM may not always converge to $\varphi^{*}$ . Consider now the case where $\mathcal{D}$ is incomplete. At a local minimum, we have that $\frac{∂\text{-}\ell}{∂ w_{ij}}=-\frac{1}{N}\sum_{\mathbf{x}∈%
\mathcal{D}}x_{i}p(h_{j}|\mathbf{x})+\sum_{\mathbf{x}}x_{i}p(h_{j}|\mathbf{x})%
p(\mathbf{x})≈ 0$ . A solution to this is $p(h_{j}|\mathbf{x})p(\mathbf{x})≈\frac{p(h_{j}|\mathbf{x})}{N}\text{ if %
}\mathbf{x}∈\mathcal{D},\text{ and 0 otherwise.}$ This can be achieved by either having $p(h_{j}|\mathbf{x})≈ 0$ or $p(\mathbf{x})≈ 0$ for all $\mathbf{x}∉\mathcal{D}$ and $p(\mathbf{x})≈\frac{1}{N}$ for $\mathbf{x}∈\mathcal{D}$ . Since $p(\mathbf{x})=\frac{1}{Z}\sum_{\mathbf{h}}\exp(-{\it E}(\mathbf{x},\mathbf{h}))$ then for a training example (preferred model) $\mathbf{x}$ we have $\sum_{\mathbf{x}}\sum_{\mathbf{h}}\exp(-{\it E}(\mathbf{x},\mathbf{h}))≈
N%
\sum_{\mathbf{h}}\exp(-{\it E}(\mathbf{x},\mathbf{h}))$ . Hence, a solution is obtained if $\sum_{\mathbf{h}}\exp(-{\it E}(\mathbf{x},\mathbf{h}))$ is equally large for all $x∈\mathcal{D}$ , and much smaller otherwise. We can further factorize this sum to get $\sum_{\mathbf{h}}\exp(-{\it E}(\mathbf{x},\mathbf{h}))\propto\prod_{j}(1+\exp(%
\sum_{i}w_{ij}x_{i}+\theta_{j}))$ . Now, suppose that an LBM with parameters $\Theta=(W^{*},\theta^{*})$ represents an unknown formula $\varphi^{*}$ . Assuming that the LBM has large and equal confidence values $c_{∞}$ for its free-energy function (as discussed in Section 3.3.2), this LBM would allow only one hidden unit to be activated for a satisfying assignment $\mathbf{x}$ . In the case of an unsatisfiable assignment, all hidden units would be deactivated. Therefore, one can choose $c_{∞}$ large enough to guarantee that a solution is found because $\prod_{j}(1+\exp(\sum_{i}w^{*}_{ij}x_{i}+b^{*}_{j}))≈\exp(c_{∞}%
\epsilon)\text{ if }\mathbf{x}∈\mathcal{D}.$
**Example 3**
*We use the symbol $\oplus$ to denote exclusive-or, that is $\mathrm{x}\oplus\mathrm{y}\equiv((\mathrm{x}\wedge\neg\mathrm{y})\vee(\neg%
\mathrm{x}\wedge\mathrm{y}))$ . The formula $\varphi\equiv(\mathrm{x}\oplus\mathrm{y})\leftrightarrow\mathrm{z}$ can be converted into the SDNF: $$
\varphi\equiv(\neg\mathrm{x}\wedge\neg\mathrm{y}\wedge\neg\mathrm{z})\vee(\neg%
\mathrm{x}\wedge\mathrm{y}\wedge\mathrm{z})\vee(\mathrm{x}\wedge\neg\mathrm{y}%
\wedge\mathrm{z})\vee(\mathrm{x}\wedge\mathrm{y}\wedge\neg\mathrm{z})
$$ For each conjunctive clause in $\varphi$ , a corresponding term is added to the energy function. An RBM for the XOR formula $\varphi$ can be built as shown in Figure 4 for a choice of $\epsilon=0.5$ and zero bias for the visible units ( $\theta_{i}=0$ ). The energy function of this RBM is: | | $\displaystyle{\it E}$ | $\displaystyle=-h_{1}(-x-y-z+0.5)-h_{2}(x+y-z-1.5)-$ | |
| --- | --- | --- | --- |
<details>
<summary>extracted/6466920/figs/xor_rbm.png Details</summary>

### Visual Description
\n
## Diagram: Neural Network Representation
### Overview
The image depicts a diagram of a neural network with a single layer of hidden nodes (h1, h2, h3, h4) connected to three output nodes (x, y, z). The connections between nodes are labeled with numerical weights, indicating the strength and direction of the connection.
### Components/Axes
The diagram consists of the following components:
* **Hidden Nodes:** h1, h2, h3, h4. Each node has a value associated with it: 0.5, -1.5, -1.5, -1.5 respectively.
* **Output Nodes:** x, y, z.
* **Connections:** Lines connecting hidden nodes to output nodes, each labeled with a weight. The weights are: -1, -1, -1, 1, 1, -1, 1, -1, 1, -1, 1, 1.
### Detailed Analysis or Content Details
The connections and their weights are as follows:
* **h1 to x:** -1
* **h1 to y:** -1
* **h1 to z:** -1
* **h2 to x:** 1
* **h2 to y:** 1
* **h2 to z:** -1
* **h3 to x:** 1
* **h3 to y:** -1
* **h3 to z:** 1
* **h4 to x:** -1
* **h4 to y:** 1
* **h4 to z:** 1
The values associated with the hidden nodes are:
* h1 = 0.5
* h2 = -1.5
* h3 = -1.5
* h4 = -1.5
### Key Observations
The weights are a mix of positive and negative values, suggesting both excitatory and inhibitory connections between the hidden and output nodes. The hidden nodes have different values, which will influence the activation of the output nodes.
### Interpretation
This diagram represents a simplified neural network structure. The values associated with the hidden nodes can be interpreted as the activation levels of those nodes. The weights determine how much each hidden node contributes to the activation of each output node. The output nodes (x, y, z) will have activation levels determined by the weighted sum of the inputs from the hidden nodes. This network could be part of a larger system for pattern recognition, classification, or regression. The specific arrangement of weights and node values suggests a particular function or mapping that the network is designed to perform. Without further context, it's difficult to determine the exact purpose of this network, but the diagram provides a clear representation of its structure and connectivity.
</details>
Figure 4: An RBM equivalent to the XOR formula $(\mathrm{x}\oplus\mathrm{y})\leftrightarrow\mathrm{z}$ . Table 1 shows the equivalence between $min_{\mathbf{h}}{\it E}(\mathbf{x},\mathbf{h})$ and the truth-table for the XOR formula. The above example illustrates in a simple case the value of using SDNF, in that it produces a direct translation into an RBM, as also illustrated below.
Table 1: Energy function and truth-table for the formula $((\mathrm{x}\wedge\neg\mathrm{y})\vee(\neg\mathrm{x}\wedge\mathrm{y}))%
\leftrightarrow\mathrm{z}$ .
| $0 0$ $0 0$ $0 0$ | $0 0$ $0 0$ $1$ | $0 0$ $1$ $0 0$ | $True$ $False$ $False$ | $-0.5$ $0 0$ $0 0$ |
| --- | --- | --- | --- | --- |
| $0 0$ | $1$ | $1$ | $True$ | $-0.5$ |
| $1$ | $0 0$ | $0 0$ | $False$ | $0 0$ |
| $1$ | $0 0$ | $1$ | $True$ | $-0.5$ |
| $1$ | $1$ | $0 0$ | $True$ | $-0.5$ |
| $1$ | $1$ | $1$ | $False$ | $0 0$ |*
**Example 4**
*We have seen that the SDNF of $(\mathrm{x}_{1}\wedge\mathrm{x}_{2}\wedge\neg\mathrm{x}_{3})→\mathrm%
{y}$ is $\text{ }(\mathrm{y}\wedge\mathrm{x}_{1}\wedge\mathrm{x}_{2}\wedge\neg\mathrm{x%
}_{3})\vee(\mathrm{x}_{1}\wedge\mathrm{x}_{2}\wedge\mathrm{x}_{3})\vee(\mathrm%
{x}_{1}\wedge\neg\mathrm{x}_{2})\vee\neg\mathrm{x}_{1}$ . We need an RBM with only 3 hidden units In the case of $\neg x_{1}$ , or any term of the energy function with a single variable, the term is implemented in the RBM via the bias of $x_{1}$ . For a positive literal $x$ , the energy term $-h(x-1+\epsilon)$ can be replaced by $-x\epsilon$ . For a negative literal $\neg x$ , the energy term $-h(-x+\epsilon)$ can be replaced by $-(1-x)\epsilon$ . This is possible because in order to minimize the energy, $h=1$ if and only if $x=0$ (in the case of $\neg x$ ), thus $h=1-x$ . Therefore, $-h(-x+\epsilon)=-(1-x)(-x+\epsilon)=-(-x+\epsilon+x^{2}-x\epsilon)=-(1-x)\epsilon$ , because $x=x^{2}$ . to represent this SDNF. The energy function with $\epsilon=0.5$ is: | | $\displaystyle E=e_{y}+\sum_{i=1}^{3}e_{i},\text{where}$ | |
| --- | --- | --- | The number of hidden units grows linearly with the number of disjuncts in the formula. The computationally expensive part is the translation from WFF to SDNF in case it is needed.*
3.5 Experimental Results
3.5.1 Reasoning
We deployed LBM to search for satisfying truth assignments of variables in large formulae. Let us define a class of formulae as:
$$
\varphi\equiv\bigwedge_{i=1}^{M}\mathrm{x}_{i}\wedge(\bigvee_{j=M+1}^{M+N}%
\mathrm{x}_{j}) \tag{15}
$$
A formula in this class consists of $2^{M+N}$ possible truth assignments of the variables, with $2^{N}-1$ of them mapping the formula to $true$ (call this the satisfying set). Converting to SDNF as done before but now for the class of formulae, we obtain:
$$
\varphi\equiv\bigvee_{j=M+1}^{M+N}(\bigwedge_{i=1}^{M}\mathrm{x}_{i}\wedge%
\bigwedge_{j^{\prime}=j+1}^{M+N}\neg\mathrm{x}_{j^{\prime}}\wedge\mathrm{x}_{j}) \tag{16}
$$
<details>
<summary>extracted/6466920/figs/completeness.png Details</summary>

### Visual Description
\n
## Line Chart: Completeness vs. Samples
### Overview
The image presents a line chart illustrating the relationship between the number of samples and the completeness of a process or dataset. Two lines are plotted, each representing a different set of parameters (M=20, N=5 and M=20, N=10). Shaded regions around each line indicate the variability or confidence interval.
### Components/Axes
* **X-axis:** Labeled "Samples", ranging from 0 to approximately 7,000,000 (7e6).
* **Y-axis:** Labeled "Completeness", ranging from approximately -0.2 to 1.2.
* **Legend:** Located in the bottom-right corner.
* Blue Line: "M=20, N=5"
* Red Line: "M=20, N=10"
* **Shaded Regions:** Represent the variability around each line. The blue shaded region corresponds to M=20, N=5, and the red shaded region corresponds to M=20, N=10.
### Detailed Analysis
**Line 1: M=20, N=5 (Blue)**
The blue line starts at approximately -0.15 at 0 samples. It exhibits a steep upward slope initially, reaching a completeness of approximately 0.4 at 500,000 samples. The slope gradually decreases, and the line plateaus around a completeness of 0.95 between 4,000,000 and 7,000,000 samples. The shaded region around the blue line is wider at lower sample counts, indicating greater variability, and narrows as the sample count increases.
* 0 Samples: Completeness ≈ -0.15
* 500,000 Samples: Completeness ≈ 0.4
* 1,000,000 Samples: Completeness ≈ 0.65
* 2,000,000 Samples: Completeness ≈ 0.8
* 4,000,000 Samples: Completeness ≈ 0.93
* 7,000,000 Samples: Completeness ≈ 0.95
**Line 2: M=20, N=10 (Red)**
The red line also starts at approximately -0.15 at 0 samples. It initially rises more slowly than the blue line, reaching a completeness of approximately 0.3 at 500,000 samples. The slope then increases, becoming steeper than the blue line between 1,000,000 and 3,000,000 samples. The red line plateaus around a completeness of 1.0 between 3,000,000 and 7,000,000 samples. The shaded region around the red line is also wider at lower sample counts and narrows as the sample count increases.
* 0 Samples: Completeness ≈ -0.15
* 500,000 Samples: Completeness ≈ 0.3
* 1,000,000 Samples: Completeness ≈ 0.5
* 2,000,000 Samples: Completeness ≈ 0.75
* 3,000,000 Samples: Completeness ≈ 0.9
* 4,000,000 Samples: Completeness ≈ 0.98
* 7,000,000 Samples: Completeness ≈ 1.0
### Key Observations
* Both lines exhibit a similar initial behavior, starting with negative completeness values.
* The red line (M=20, N=10) generally achieves higher completeness values than the blue line (M=20, N=5) for a given number of samples.
* The variability (as indicated by the shaded regions) is higher at lower sample counts for both lines.
* Both lines appear to converge towards a completeness of approximately 1.0 as the number of samples increases.
### Interpretation
The chart demonstrates how the completeness of a process or dataset increases with the number of samples, under different parameter settings (M and N). The parameter N appears to have a significant impact on the rate of completeness. A higher value of N (N=10) leads to faster and more complete convergence. The negative completeness values at low sample counts suggest that the initial stages of the process may introduce some degree of incompleteness or error. The convergence towards 1.0 indicates that, given enough samples, the process can achieve a high level of completeness. The shaded regions represent the uncertainty or variance in the completeness, which decreases as the sample size increases, suggesting that the process becomes more stable and predictable with more data. This could represent a learning curve or the convergence of an algorithm. The parameters M and N likely control aspects of the process, with N being the more influential factor in achieving completeness.
</details>
<details>
<summary>extracted/6466920/figs/completeness_2.png Details</summary>

### Visual Description
\n
## Chart: Completeness vs. Samples
### Overview
The image presents a chart illustrating the relationship between the number of samples and the completeness of a process, likely a data collection or model training process. Two curves are plotted, representing different parameter settings (M and N). The chart uses a shaded area to represent the variance or confidence interval around each curve.
### Components/Axes
* **X-axis:** Labeled "Samples", with a scale ranging from approximately 0 to 4.0 x 10<sup>9</sup> (4 billion).
* **Y-axis:** Labeled "Completeness", with a scale ranging from approximately -0.2 to 1.2.
* **Legend:** Located in the bottom-right corner.
* Blue Line: "M=25, N=10"
* Red Line: "M=30, N=10"
* **Shaded Areas:** Light red and light blue areas surrounding the respective lines, representing a confidence interval or standard deviation.
### Detailed Analysis
**Red Line (M=30, N=10):**
The red line starts at approximately (0, -0.15) and initially slopes upward. Around 0.5 x 10<sup>9</sup> samples, the curve flattens, exhibiting a plateau between approximately 0.6 and 0.8 completeness. From approximately 2.0 x 10<sup>9</sup> samples, the line increases more steadily, approaching a completeness of approximately 1.0 at 4.0 x 10<sup>9</sup> samples. The shaded area around the red line is wider at the beginning, indicating greater variance, and narrows as the number of samples increases.
**Blue Line (M=25, N=10):**
The blue line starts at approximately (0, 0.0) and rises sharply, reaching a completeness of approximately 0.8 at around 0.3 x 10<sup>9</sup> samples. After this point, the curve plateaus, remaining relatively stable around a completeness of 0.8 to 0.9 until approximately 3.5 x 10<sup>9</sup> samples, where it begins to slightly increase towards 1.0. The shaded area around the blue line is relatively narrow throughout, suggesting less variance compared to the red line.
### Key Observations
* The red line (M=30, N=10) generally exhibits a slower initial increase in completeness compared to the blue line (M=25, N=10).
* The red line reaches a higher maximum completeness (approximately 1.0) than the blue line (approaching 1.0).
* The variance (as indicated by the shaded areas) is higher for the red line, especially at lower sample counts.
* Both curves demonstrate diminishing returns in completeness as the number of samples increases beyond a certain point.
### Interpretation
The chart likely represents the convergence of a process as more data (samples) are processed. The parameters M and N likely control some aspect of the process, such as the model complexity or the number of iterations.
The fact that M=30, N=10 achieves a higher completeness suggests that increasing M (while keeping N constant) improves the process's ability to reach a complete state. However, this comes at the cost of increased initial variance. The diminishing returns observed in both curves indicate that there is a point beyond which adding more samples provides only marginal improvements in completeness.
The negative completeness values at the beginning of the red line are unusual and could indicate an initial period of instability or negative impact before the process begins to converge. This could be due to the model needing to "unlearn" initial biases or incorrect assumptions.
The chart suggests a trade-off between speed of convergence (blue line) and ultimate completeness (red line). Choosing the appropriate parameter settings (M and N) depends on the specific requirements of the application – whether rapid initial progress or maximum completeness is more important.
</details>
Figure 5: Percentage coverage as a measure of completeness as sampling progresses in the RBM. 100% coverage is achieved for the class of formulae with different values for M and N averaged over 100 runs. The number of samples needed to achieve $100\%$ coverage is much lower than the number of possible assignments ( $2^{M+N}$ ). For example, when M=20, N=10, all satisfying assignments are found after approximately $7.5× 10^{6}$ samples are provided as input to the RBM, whereas the number of possible assignments is approximately 1 billion, a ratio of sample size to the search space of $0.75\%$ . The ratio for M=30, N=10 is even lower at $0.37\%$ . Source: [52].
Applying Theorem 1 to construct an RBM from $\varphi$ , we use Gibbs sampling to find the models of a formula given random initial truth assignments to all the variables. A sample is accepted as a satisfying assignment (a model) if its free energy is lower than or equal to $-\log(1+\exp(c\epsilon)$ with $c=5,\epsilon=0.5$ . We evaluate the coverage and accuracy of accepted samples. Coverage is measured as the proportion of the satisfying set that is accepted over time. In this experiment, this is the number of satisfying assignments in the set of accepted samples divided by $2^{N}-1$ . It can be seen as a measure of completeness. Accuracy is measured as the percentage of samples accepted by the RBM that do satisfy the logical formula.
We test different values of $M∈\{20,25,30\}$ and $N∈\{3,4,5,6,7,8,9,10\}$ . LBM achieves $100\%$ accuracy in all cases, meaning that all accepted samples do satisfy the formula, as expected (given Theorem 1). Figure 5 shows the coverage as Gibbs sampling progresses (after each time that a number of random samples is collected). Four cases are considered: M=20 and N=5, M=20 and N=10, M=25 and N=10, M=30 and N=10.
In each case, we run the sampling process 100 times and report the average results with standard deviations. The satisfying set and therefore the number of samples needed to achieve $100\%$ coverage is much lower than the number of possible assignments ( $2^{M+N}$ ). For example, when M=20, N=10, all satisfying assignments are found after 7.5 million samples are collected, whereas the number of possible assignments is approximately 1 billion, producing a ratio of sample size to the search space size of just $0.75\%$ . The ratio for M=30, N=10 is even lower at $0.37\%$ w.r.t. $10^{12}$ possible assignments.
<details>
<summary>extracted/6466920/figs/time.png Details</summary>

### Visual Description
## Line Chart: Time vs. Search Space Size with N
### Overview
This image presents a line chart illustrating the relationship between 'N' (with a fixed M=20) and two dependent variables: 'Time (sec)' and 'Search space size'. The chart displays the trends of these variables as 'N' increases from 3 to 10. Both variables are plotted with confidence intervals represented by shaded areas.
### Components/Axes
* **X-axis:** Labeled "N (M=20)", ranging from 3 to 10.
* **Left Y-axis:** Labeled "Time (sec)", ranging from 0 to 80.
* **Right Y-axis:** Labeled "Search space size", ranging from 0 to 1e9 (1 billion).
* **Blue Line:** Represents 'Time (sec)' with a shaded confidence interval.
* **Red Line:** Represents 'Search space size' with a shaded confidence interval.
* **Legend:** No explicit legend is present, but the colors are consistently used throughout the chart.
### Detailed Analysis
**Time (sec) - Blue Line:**
The blue line representing 'Time (sec)' shows an increasing trend as 'N' increases. Initially, the time remains relatively constant between N=3 and N=7, fluctuating around approximately 8 seconds. From N=7 onwards, the time increases more rapidly, reaching approximately 75 seconds at N=10. The shaded area around the line indicates a confidence interval, which widens as 'N' increases, suggesting greater uncertainty in the time measurements at higher 'N' values.
* N=3: Time ≈ 8 sec
* N=4: Time ≈ 8 sec
* N=5: Time ≈ 7 sec
* N=6: Time ≈ 9 sec
* N=7: Time ≈ 11 sec
* N=8: Time ≈ 20 sec
* N=9: Time ≈ 45 sec
* N=10: Time ≈ 75 sec
**Search space size - Red Line:**
The red line representing 'Search space size' also exhibits an increasing trend with 'N'. The increase is more pronounced than that of 'Time (sec)'. The search space size starts at approximately 0.1e9 at N=3 and increases to approximately 0.9e9 at N=10. The confidence interval around the red line also widens with increasing 'N', indicating greater uncertainty in the search space size estimation at higher 'N' values.
* N=3: Search space size ≈ 0.1e9
* N=4: Search space size ≈ 0.1e9
* N=5: Search space size ≈ 0.1e9
* N=6: Search space size ≈ 0.2e9
* N=7: Search space size ≈ 0.3e9
* N=8: Search space size ≈ 0.5e9
* N=9: Search space size ≈ 0.7e9
* N=10: Search space size ≈ 0.9e9
### Key Observations
* Both 'Time (sec)' and 'Search space size' increase with 'N'.
* The 'Search space size' increases at a faster rate than 'Time (sec)'.
* The confidence intervals for both variables widen as 'N' increases, indicating greater variability or uncertainty in the measurements at higher 'N' values.
* The initial plateau in 'Time (sec)' between N=3 and N=7 suggests a phase where increasing 'N' does not significantly impact the computation time.
### Interpretation
The chart demonstrates a clear relationship between the parameter 'N' (with a fixed 'M=20') and the computational cost, as measured by 'Time (sec)' and the complexity of the problem, as represented by 'Search space size'. The increasing trends suggest that as 'N' grows, the algorithm requires more time and explores a larger search space. The faster growth of 'Search space size' compared to 'Time (sec)' indicates that the computational cost is not solely determined by the size of the search space; other factors, such as the algorithm's efficiency, also play a role. The widening confidence intervals at higher 'N' values could be due to increased computational complexity or limitations in the measurement process. The initial plateau in 'Time (sec)' might represent a regime where the algorithm can efficiently handle increasing 'N' values without a significant performance penalty, but beyond a certain point (around N=7), the computational cost begins to escalate rapidly. This suggests a potential trade-off between increasing 'N' and maintaining acceptable performance.
</details>
Figure 6: Time taken by LBM to collect all satisfying assignments compared with the size of the search space (i.e. the number of possible assignments up to 1 billion (1e9)) as N increases from 3 to 10 with fixed M=20. LBM only needs around 10 seconds for $N<=8$ , $\sim 25$ seconds for $N=9$ , and $\sim 68$ seconds for $N=10$ . The curve grows exponentially, similarly to the search space size, but at a much lower scale. Source: [52].
Figure 6 shows the time taken to collect all satisfying assignments for different N in $\{3,4,5,6,7,8,9,10\}$ with $M=20$ . LBM needed around 10 seconds for $N<=8$ , $25$ seconds for $N=9$ , and $68$ seconds for $N=10$ . As expected, the curve grows exponentially similarly to the search space curve, but at a much smaller scale.
3.5.2 Learning from Data and Knowledge
We now evaluate LBM at learning the same Inductive Logic Programming (ILP) benchmark tasks used by neurosymbolic system CILP++ [13] in comparison with ILP state-of-the-art system Aleph [46]. As mentioned earlier, the systems Aleph, CILP++ and a fully-connected standard RBM were chosen as the natural symbolic, neurosymbolic and neural system, respectively, for comparison. An initial LBM is constructed from the clauses provided as background knowledge. This process creates one hidden neuron per clause. Further hidden neurons are added using random weights for training and validation from data. Satisfying assignments can be selected from each clause as a training or validation example, for instance given clause $\mathrm{x}_{1}\wedge\neg\mathrm{x}_{2}→ y$ , assignment $x_{1}=True,x_{2}=False,y=True$ is converted into vector $[x_{1},x_{2},y]=(1,0,1)$ for training or validation. Both the LBM and the standard RBM are trained discriminatively using the conditional distribution $p(y|\mathbf{x})$ for inference as in [26]. In both cases, all network weights are free parameters for learning, with some weights having been initialized by the background knowledge in the case of the LBM, that is, the background knowledge can be revised during learning from data.
Seven data sets with available data and background knowledge (BK) are used: Mutagenesis (examples of molecules tested for mutagenicity and BK provided in the form of rules describing relationships between atom bonds) [47], KRK (King-Rook versus King chess endgame with examples provided by the coordinates of the pieces on the board and BK in the form of row and column differences) [3], UW-CSE (Entity-Relationship diagram with data about students, courses taken, professors, etc. and BK describing the relational structure) [39], and the Alzheimer’s benchmark: Amine, Acetyl, Memory and Toxic (a set of examples for each of four properties of a drug design for Alzheimer’s disease with BK describing bonds between the chemical structures) [23]. With the clauses converted into their equivalent set of preferred models in the form of vectors such as $[x_{1},x_{2},y]$ above, and combined with the available data, for the Mutagenesis and KRK tasks, $2.5\%$ of the data is used to build the initial LBM. For the larger data sets UW-CSE and Alzheimer’s, $10\%$ of the data is used as BK. The remaining data are used for training and validation based on 10-fold cross validation for each data set, except for UW-CSE that uses 5 folds for the sake of comparison. The number of hidden units added to the LBM is chosen arbitrarily at $50$ . The standard RBM without BK is given a higher degree of freedom with $100$ hidden units. Results are shown in Table 2. The results for Aleph and CILP++ are obtained from [13]. It can be seen that LBM has the best performance in 5 out of 7 data sets. Some of the results of the LBM and RBM are comparable when the BK can be learned from the examples, as in the case of the Alzheimer’s amine data set. In these cases, training the LBM is faster than the RBM. Aleph is better than all other models in the alz-acetyl data set. This task must rely more heavily on the correctness of the BK than the data. CILP++ however is considerably faster than Aleph and it can achieve comparable results. Although direct comparisons of running times are not possible to make between CILP++ and LBM, LBM’s running times look promising.
Table 2: Cross-validation performance of LBM against purely-symbolic system Aleph, neurosymbolic system CILP++ and a standard RBM on 7 benchmark data sets for neurosymbolic AI. We run cross-validation on RBM and LBM 100 times and report the average results with $95\%$ confidence interval. Source: [52].
| Mutagenesis KRK UW-CSE | ${80.85}$ ( $± 10.5$ ) ${99.60}(± 0.51)$ ${84.91}(± 7.32)$ | ${91.70}(± 5.84)$ ${98.42}(± 1.26)$ ${70.01}(± 2.2)$ | ${95.55}(± 1.36)$ ${99.70}(± 0.11)$ ${89.14}(± 0.46)$ | ${\mathbf{96.28}}(± 1.21)$ ${\mathbf{99.80}}(± 0.09)$ ${\textbf{89.43}}(± 0.42)$ |
| --- | --- | --- | --- | --- |
| alz-amine | ${78.71}(± 5.25)$ | ${78.99}(± 4.46)$ | ${\textbf{79.13}}(± 1.14)$ | ${78.25}(± 1.07)$ |
| alz-acetyl | ${\textbf{69.46}}(± 3.6)$ | ${65.47}(± 2.43)$ | ${62.93}(± 0.31)$ | ${66.82}(± 0.28)$ |
| alz-memory | ${68.57}(± 5.7)$ | ${60.44}(± 4.11)$ | ${68.54}(± 0.97)$ | ${\mathbf{71.84}}(± 0.88)$ |
| alz-toxic | ${80.50}(± 3.98)$ | ${81.73}(± 4.68)$ | $82.71(± 1.18)$ | $\mathbf{84.95}(± 1.04)$ |
3.6 Extensions of Logical Boltzmann Machines
3.6.1 Translating CNF into RBMs
In the general case, translation to SDNF can be costly. When knowledge is provided in CNF form, it is useful to be able to translate the CNF directly into the RBM without the need for an intermediate step.
Every WFF can be converted into CNF. A CNF is a conjunction of clauses. Formally:
$$
\varphi_{\text{CNF}}\equiv\bigwedge_{m=1}^{M}(\bigvee_{t\in\mathcal{S}^{m}_{T}%
}x_{t}\vee\bigvee_{k\in\mathcal{S}^{m}_{K}}\neg x_{k}) \tag{17}
$$
We will apply the same transformation process into SDNF to each conjunctive clause in the CNF. The result will be a conjunction of $M$ SDNFs (itself not an SDNF), as follows:
$$
\displaystyle\varphi_{\text{CNF}}\equiv\bigwedge_{m=1}^{M}(\bigvee_{t\in%
\mathcal{S}^{m}_{T}}\neg\mathrm{x}_{t}\vee\bigvee_{k\in\mathcal{S}^{m}_{K}}%
\mathrm{x}_{k})\equiv\bigwedge_{m=1}^{M}(\bigvee_{p\in\mathcal{S}^{m}_{T}\cup%
\mathcal{S}^{m}_{K}}(\bigwedge_{t\in\mathcal{S}^{m}_{T}\backslash p}\mathrm{x}%
_{t}\wedge\bigwedge_{k\in\mathcal{S}^{m}_{K}\backslash p}\neg\mathrm{x}_{k}%
\wedge\mathrm{x}^{\prime}_{p})) \tag{18}
$$
where $\mathrm{x}^{\prime}_{p}\equiv\neg\mathrm{x}_{p}$ if $p∈\mathcal{S}^{m}_{T}$ ; otherwise $\mathrm{x}^{\prime}_{p}\equiv\mathrm{x}_{p}$ .
This transformation would increase the space complexity from $\mathcal{O}(M× N)$ to $\mathcal{O}(M× N^{2})$ , where $M$ is the number of clauses and $N$ is the number of variables. This should not be a problem for current computing systems, especially when inference with RBMs can be highly parallelized.
Although the formula in Eq.(18) is not a SDNF, the equivalence between the CNF and the LBM still holds:
$$
s_{\varphi}=\begin{dcases*}1&when $-\frac{1}{\epsilon}min_{\mathbf{h}}E(%
\mathbf{x},\mathbf{h})=M$\\
0&otherwise\end{dcases*} \tag{19}
$$
Eq.(19) holds because the CNF is satisfied if and only if all $M$ SDNFs are satisfied. Under such circumstances, $min_{\mathbf{h}}E(\mathbf{x},\mathbf{h})=-M\epsilon$ . Otherwise, $min_{\mathbf{h}}E(\mathbf{x},\mathbf{h})=-M^{\prime}\epsilon$ , where $M^{\prime}<M$ .
When a confidence value c is used, the number of satisfied clauses in CNF will be proportional to the minimized energy function, and to the free-energy function when c increases.
3.6.2 Towards using LBM as a SAT Solver
The Boolean satisfiability (SAT) problem is a fundamental problem in Computer Science. It was the first problem that was proven to be NP complete. A formula is satisfiable if and only if there exists an assignment of truth-values mapping the formula to True. In practice, formulae in SAT problems are represented as Conjunctive Normal Forms (CNFs).
As discussed in Section 3.6.1, a formula in Conjunctive Normal Form (CNF) can be converted into a Logical Boltzmann Machine (LBM). The number of satisfied clauses in the CNF formula is proportional to the minimized energy function and the free-energy function of the LBM. This relationship allows us to solve SAT problems by transforming them into an optimization task: finding the minimum of the energy or free-energy function.
To make this approach computationally feasible, we focus on minimizing the free-energy function, as it is both easier to compute and differentiable. This transformation converts the discrete SAT problem into a continuous optimization problem. Instead of searching for solutions in a Boolean space (where variables $x$ are either 0 or 1), we search in a continuous space for parameters $\theta$ , where each Boolean variable $x$ is represented as a sigmoid function:
$$
x=\sigma(\theta)=\frac{1}{1+\exp(-\theta)}.
$$
This mapping ensures that $x$ smoothly transitions between 0 and 1 as $\theta$ changes, enabling gradient-based optimization techniques to be applied. To illustrate this process, consider a simple SAT problem with two variables:
$$
(\neg x_{1}\lor\neg x_{2})\land(x_{1}\lor\neg x_{2})\land(\neg x_{1}\lor x_{2}).
$$
Figure 7 visualizes the landscape of the LBM’s energy and free-energy functions for different values of $\theta_{1}$ and $\theta_{2}$ , where $x_{1}=\sigma(\theta_{1})$ and $x_{2}=\sigma(\theta_{2})$ . The plots reveal that when both $\theta_{1}$ and $\theta_{2}$ are more negative (corresponding to $x_{1},x_{2}≈ 0$ ), the functions approach their minima. This corresponds to a satisfying assignment of the CNF formula, illustrating how the optimization process identifies valid solutions.
We also analyze the impact of confidence values $c$ on the landscapes of the energy and free-energy functions. Figures 7(a), 7(b), 7(c), and 7(d) show that confidence values do not significantly alter the landscape of the energy function. However, for the free-energy function (Figures 7(e), 7(f), 7(g), 7(h)), smaller values of $c$ result in smoother landscapes. While this smoothing effect can facilitate optimization by reducing sharp transitions, it also narrows the gap between local minima and the global minimum. Conversely, higher values of $c$ increase the boundaries between optimal regions, making it more challenging to locate the global optimum. This trade-off highlights the importance of carefully selecting $c$ based on the specific characteristics of the SAT problem being solved.
In summary, the LBM framework provides an approach to solving SAT problems by converting them into continuous optimization tasks. By leveraging the differentiability of the free-energy function and the flexibility of sigmoid mappings, this approach bridges logical reasoning and numerical optimization. Future work should explore adaptive strategies for adjusting confidence values to balance smoothness and optimality and consider ways to enhance performance in the case of specific classes of SAT problems.
<details>
<summary>extracted/6466920/figs/emin_2var_c0.1.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: Minimized Energy Landscape
### Overview
The image presents a 3D surface plot visualizing a minimized energy landscape. The plot depicts the relationship between two parameters, θ₁ and θ₂, and their corresponding minimized energy value. The surface is colored to represent varying energy levels, with cooler colors (greens and blues) indicating lower energy and warmer colors (yellows and reds) indicating higher energy.
### Components/Axes
* **X-axis:** θ₁ ranging from approximately -4 to 4.
* **Y-axis:** θ₂ ranging from approximately -4 to 4.
* **Z-axis:** Minimized Energy ranging from approximately -0.14 to -0.02.
* **Surface:** Represents the minimized energy as a function of θ₁ and θ₂.
* **Color Gradient:** A continuous color gradient is used to represent the energy levels. The gradient appears to transition from purple (lowest energy) to green, yellow, and finally to a lighter shade of green/yellow (highest energy).
### Detailed Analysis
The surface exhibits two distinct minima (valleys).
* **Left Minimum:** Located around θ₁ ≈ -4 and θ₂ ≈ 2, with a minimized energy of approximately -0.13 to -0.14. The surface rises relatively steeply on either side of this minimum.
* **Right Minimum:** Located around θ₁ ≈ 4 and θ₂ ≈ -2, with a minimized energy of approximately -0.13 to -0.14. Similar to the left minimum, the surface rises steeply around this point.
* **Central Ridge:** A ridge of higher energy (yellow/green) runs between the two minima, peaking around θ₁ ≈ 0 and θ₂ ≈ 0, with a minimized energy of approximately -0.02 to -0.04.
* **Symmetry:** The plot appears roughly symmetric about the line θ₁ = θ₂ = 0, although there are subtle differences in the shape of the surface around each minimum.
### Key Observations
* The energy landscape has multiple local minima, suggesting the possibility of multiple stable states.
* The steepness of the surface around the minima indicates that small changes in θ₁ or θ₂ can lead to significant changes in energy.
* The central ridge represents an unstable state, as any small perturbation would likely cause the system to move towards one of the minima.
### Interpretation
This plot likely represents the result of an optimization process or a simulation of a physical system. The two minima suggest two possible stable configurations or states of the system. The energy values at these minima indicate the relative stability of each state. The shape of the surface and the presence of the central ridge provide information about the energy barriers between the different states.
The plot demonstrates a complex energy landscape with multiple potential energy wells. This type of landscape is common in many physical and chemical systems, and understanding its features is crucial for predicting the behavior of the system. The visualization allows for a quick assessment of the system's stability and the ease with which it can transition between different states. The data suggests that the system will tend to settle into one of the two minima, but the specific state it reaches may depend on initial conditions and external factors.
</details>
(a) c=0.1
<details>
<summary>extracted/6466920/figs/emin_2var_c0.5.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: Minimized Energy Landscape
### Overview
The image depicts a 3D surface plot representing a minimized energy landscape. The plot visualizes the relationship between two angular parameters, θ₁ and θ₂, and the resulting minimized energy. The surface exhibits a complex shape with multiple local minima and a prominent central peak.
### Components/Axes
* **X-axis:** θ₂ (Theta 2), ranging approximately from -4 to 4.
* **Y-axis:** θ₁ (Theta 1), ranging approximately from -4 to 4.
* **Z-axis:** Minimized Energy, ranging approximately from -0.7 to -0.1.
* **Surface:** The colored surface represents the minimized energy value for different combinations of θ₁ and θ₂. The color gradient indicates energy levels, with cooler colors (purple/blue) representing lower energy and warmer colors (yellow/green) representing higher energy.
* **Grid:** A grid is visible on the plot, providing a visual reference for the coordinate values.
### Detailed Analysis
The surface shows a complex energy landscape.
* **Central Peak:** A prominent peak is located near θ₁ = 0 and θ₂ = 0, with energy values around -0.1. This suggests a local maximum in energy.
* **Left Minimum:** A broad minimum is present on the left side of the plot, approximately at θ₁ = -4 and θ₂ = 4, with energy values around -0.7.
* **Right Minimum:** A similar broad minimum is present on the right side of the plot, approximately at θ₁ = 4 and θ₂ = -4, with energy values around -0.7.
* **Saddle Points:** There appear to be saddle points or regions of rapid energy change between the minima and the central peak.
* **Color Gradient:** The color gradient transitions smoothly from purple (low energy) to yellow (high energy), indicating a continuous energy landscape.
### Key Observations
* The energy landscape is not symmetrical. The minima are not perfectly mirrored across the origin.
* The presence of multiple minima suggests the possibility of multiple stable states or configurations.
* The central peak indicates an unstable state, where small perturbations could lead to a transition to one of the minima.
* The plot does not provide specific numerical values for the energy at each point, only a visual representation of the energy landscape.
### Interpretation
This plot likely represents the energy landscape of a physical system with two degrees of freedom, represented by the angles θ₁ and θ₂. The minima in the energy landscape correspond to stable states of the system, while the peak represents an unstable state. The shape of the landscape dictates the system's behavior and the likelihood of transitions between different states.
The asymmetry of the landscape suggests that the system is not invariant under certain transformations. The multiple minima indicate that the system can exist in different stable configurations, and the energy barrier between these configurations determines the rate of transitions between them.
This type of plot is commonly used in fields such as molecular dynamics, statistical mechanics, and machine learning to visualize and analyze complex energy landscapes. It can provide insights into the stability, dynamics, and behavior of the system under investigation.
</details>
(b) c=0.5
<details>
<summary>extracted/6466920/figs/emin_2var_c1.0.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: Minimised Energy Landscape
### Overview
The image depicts a 3D surface plot representing a minimised energy landscape. The plot visualizes the relationship between two angular parameters, θ₁ and θ₂, and the resulting minimised energy. The surface is colored to indicate energy levels, with warmer colors (yellow/green) representing higher energy and cooler colors (blue/purple) representing lower energy.
### Components/Axes
* **X-axis:** θ₂ (Theta 2), ranging approximately from -4 to 4.
* **Y-axis:** θ₁ (Theta 1), ranging approximately from -4 to 4.
* **Z-axis:** Minimised Energy, ranging approximately from -0.2 to -1.4.
* **Surface:** Represents the minimised energy as a function of θ₁ and θ₂.
* **Color Gradient:** A continuous color gradient is used to represent energy levels. The gradient appears to transition from purple (lowest energy) to blue, green, and finally yellow (highest energy).
### Detailed Analysis
The surface exhibits a complex shape with two distinct minima.
* **Minimum 1:** Located approximately at θ₁ = 2, θ₂ = 0. The energy at this minimum is around -0.2.
* **Minimum 2:** Located approximately at θ₁ = -2, θ₂ = 0. The energy at this minimum is also around -0.2.
* **Saddle Point:** A saddle point is visible between the two minima, around θ₁ = 0, θ₂ = 0. The energy at this point is approximately -0.6.
* **Energy Increase:** The energy increases as θ₁ and θ₂ move away from the minima, forming two "valleys" separated by a "ridge".
* **Surface Slope:** The surface slopes steeply downwards towards the minima from all directions.
* **Maximum Energy:** The maximum energy appears to be around -0.2, located at the peaks of the surface.
* **Minimum Energy:** The minimum energy appears to be around -1.4, located at the bottom of the valleys.
### Key Observations
* The energy landscape has two equivalent minima, suggesting two stable states or configurations.
* The symmetry of the landscape around the θ₂ = 0 axis indicates that the system is invariant under reflection across this axis.
* The saddle point represents an unstable equilibrium.
* The energy landscape is relatively smooth, with no sharp discontinuities.
### Interpretation
This plot likely represents the potential energy surface of a physical system with two degrees of freedom (θ₁ and θ₂). The minima correspond to stable states of the system, where the energy is minimized. The saddle point represents an unstable state, where the system can transition between the two minima. The shape of the energy landscape dictates the dynamics of the system, influencing how it evolves over time. The two minima suggest that the system can exist in two distinct configurations with equal energy. This could represent a system with a symmetry, or a system that can switch between two states. The plot suggests that the system will tend to settle into one of the two minima, but it could also overcome the energy barrier at the saddle point and transition between the two states. The exact nature of the system and the meaning of θ₁ and θ₂ would depend on the specific context of the problem.
</details>
(c) c=1
<details>
<summary>extracted/6466920/figs/emin_2var_c5.0.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: Minimized Energy Landscape
### Overview
The image depicts a 3D surface plot representing a minimized energy landscape. The plot visualizes the relationship between two parameters, θ₁ and θ₂, and their corresponding minimized energy value. The surface is colored to indicate energy levels, with warmer colors (yellow/orange) representing lower energy and cooler colors (green/purple) representing higher energy.
### Components/Axes
* **X-axis:** θ₂ - ranges approximately from -4 to 4.
* **Y-axis:** θ₁ - ranges approximately from -4 to 4.
* **Z-axis:** Minimized Energy - ranges approximately from -7 to 1.
* **Surface:** Represents the minimized energy value for different combinations of θ₁ and θ₂.
* **Color Gradient:** Indicates energy levels. Yellow/orange represents lower energy, transitioning through green to purple for higher energy.
### Detailed Analysis
The surface exhibits a complex shape with multiple local minima and a prominent central peak.
* **Central Peak:** Located around θ₁ = 0 and θ₂ = 0, the surface rises to a local maximum with an energy value of approximately 1.
* **Left Minimum:** A broad minimum extends from approximately θ₁ = -4 to θ₁ = 0 and θ₂ = -4, with energy values ranging from approximately -3 to -7. The minimum appears to be shallowest at θ₂ = -4.
* **Right Minimum:** A similar broad minimum extends from approximately θ₁ = 0 to θ₁ = 4 and θ₂ = 4, with energy values ranging from approximately -3 to -7. The minimum appears to be shallowest at θ₂ = 4.
* **Saddle Points:** There are saddle points visible where the surface changes curvature, indicating transitions between different energy minima. These are located around θ₁ = -2, θ₂ = 2 and θ₁ = 2, θ₂ = -2.
* **Trend:** The surface generally decreases in energy as θ₁ and θ₂ move away from the central peak towards the minima on the left and right.
### Key Observations
* The energy landscape is not symmetrical. The minima on the left and right sides of the plot are not identical in shape or depth.
* The presence of multiple local minima suggests the possibility of multiple stable states or configurations.
* The central peak represents an unstable state, as any small perturbation would likely cause the system to move towards one of the minima.
### Interpretation
This plot likely represents the energy landscape of a system with two degrees of freedom, parameterized by θ₁ and θ₂. The minimized energy indicates the stability of different configurations of the system. The multiple minima suggest that the system can exist in several stable states, and the energy barrier between these states (represented by the height of the saddle points) determines the ease with which the system can transition between them. The shape of the landscape suggests a complex interplay between the two parameters, and the asymmetry indicates that the system is not invariant under certain transformations. This type of plot is common in fields like physics, chemistry, and machine learning, where it is used to visualize the energy landscape of a system and understand its behavior. The data suggests that the system will tend to settle into one of the two minima, and the specific minimum it settles into will depend on its initial conditions.
</details>
(d) c=5
<details>
<summary>extracted/6466920/figs/fe_2var_c0.1.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: Free Energy Landscape
### Overview
The image presents a 3D surface plot visualizing a free energy landscape as a function of two angular parameters, θ₁ and θ₂. The surface is colored to represent the free energy values, ranging from approximately -4.07 to -4.11. The plot appears to show a relatively smooth, undulating surface.
### Components/Axes
* **X-axis:** θ₂ (Theta 2), ranging from approximately -4 to 4.
* **Y-axis:** θ₁ (Theta 1), ranging from approximately -4 to 4.
* **Z-axis:** Free Energy, ranging from approximately -4.11 to -4.07.
* **Color Scale:** Represents Free Energy values. Purple/Dark Blue indicates lower free energy (approximately -4.11), transitioning through green and turquoise to yellow indicating higher free energy (approximately -4.07).
### Detailed Analysis
The surface exhibits a general trend of decreasing free energy as θ₂ decreases from 4 to -4, and as θ₁ decreases from 4 to -4. The lowest free energy values are concentrated in the bottom-left corner of the plot (θ₁ ≈ -4, θ₂ ≈ -4). The highest free energy values are located in the top-right corner (θ₁ ≈ 4, θ₂ ≈ 4).
The surface is not perfectly flat; there are subtle undulations and variations in free energy across the landscape. It's difficult to extract precise numerical values from the plot without the underlying data, but we can observe the following approximate characteristics:
* **Minimum Free Energy:** Approximately -4.11, located near θ₁ = -4 and θ₂ = -4.
* **Maximum Free Energy:** Approximately -4.07, located near θ₁ = 4 and θ₂ = 4.
* **Saddle Points/Local Minima/Maxima:** There appear to be some subtle saddle points or local minima/maxima, but they are not clearly defined without more precise data.
### Key Observations
* The free energy landscape is relatively smooth, suggesting a gradual change in energy as the angular parameters vary.
* The minimum free energy is located in the region where both θ₁ and θ₂ are negative.
* The maximum free energy is located in the region where both θ₁ and θ₂ are positive.
* The plot does not show any sharp discontinuities or abrupt changes in free energy.
### Interpretation
This plot likely represents the free energy landscape of a system with two rotational degrees of freedom, described by the angles θ₁ and θ₂. The shape of the landscape indicates the stability of different configurations of the system. The minimum free energy corresponds to the most stable configuration, while higher free energy values correspond to less stable configurations.
The fact that the minimum free energy is located at negative values of both θ₁ and θ₂ suggests that the system prefers configurations where both angles are negative. The smooth nature of the landscape suggests that the system can easily transition between different configurations without encountering significant energy barriers.
The plot could be used to study the dynamics of the system, such as the rate of transitions between different configurations or the effect of external forces on the system's behavior. The absence of sharp features suggests that the system is not prone to sudden jumps or instabilities. This could be a representation of a potential energy surface for a molecule or a simplified model of a more complex system.
</details>
(e) c=0.1
<details>
<summary>extracted/6466920/figs/fe_2var_c0.5.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: Free Energy Landscape
### Overview
The image presents a 3D surface plot visualizing a free energy landscape as a function of two angular parameters, θ₁ and θ₂. The surface is color-coded to represent the free energy values, ranging from approximately -3.75 to -4.00. The plot suggests a complex energy landscape with multiple local minima and a broad maximum.
### Components/Axes
* **X-axis:** θ₁ (Theta 1), ranging from approximately -4 to 4.
* **Y-axis:** θ₂ (Theta 2), ranging from approximately -4 to 4.
* **Z-axis:** Free Energy, ranging from approximately -4.00 to -3.75.
* **Color Scale:** Represents Free Energy values. Purple/Dark Blue indicates lower free energy (approximately -4.00), transitioning through Green and Yellow to indicate higher free energy (approximately -3.75).
### Detailed Analysis
The surface exhibits a complex shape.
* **Minimum 1:** A prominent minimum is located around θ₁ ≈ -3, θ₂ ≈ -3, with a free energy value of approximately -4.00.
* **Minimum 2:** Another minimum is observed around θ₁ ≈ 3, θ₂ ≈ -3, with a free energy value of approximately -4.00.
* **Maximum:** A broad maximum is located around θ₁ ≈ 0, θ₂ ≈ 2, with a free energy value of approximately -3.75.
* **Trend along θ₁:** For a fixed θ₂, the free energy generally increases as θ₁ moves from negative to positive values, then decreases again.
* **Trend along θ₂:** For a fixed θ₁, the free energy exhibits a more complex behavior, with local minima and maxima.
* **Saddle Point:** A saddle point appears around θ₁ ≈ -2, θ₂ ≈ 2, where the surface curves up in one direction and down in another.
### Key Observations
* The energy landscape is not symmetric with respect to either θ₁ or θ₂.
* The presence of multiple minima suggests the possibility of multiple stable states or conformations.
* The broad maximum indicates a region of high energy, potentially representing a transition state between different minima.
* The color gradient is smooth, indicating a continuous energy landscape.
### Interpretation
This plot likely represents the free energy surface of a system with two rotational degrees of freedom (θ₁ and θ₂). The minima on the surface correspond to stable configurations of the system, while the maxima represent unstable configurations. The shape of the surface dictates the dynamics of the system, influencing the rates of transitions between different states. The asymmetry of the landscape suggests that the system is not isotropic. The presence of multiple minima indicates that the system can exist in multiple stable states, and the relative populations of these states will depend on the temperature and other external factors. The plot could be used to understand the conformational landscape of a molecule, the potential energy surface of a chemical reaction, or the state space of a physical system. The data suggests a system that is not easily perturbed from its stable states, but can transition between them via the higher energy saddle point.
</details>
(f) c=0.5
<details>
<summary>extracted/6466920/figs/fe_2var_c1.0.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: Free Energy Landscape
### Overview
The image depicts a 3D surface plot representing a free energy landscape. The plot visualizes the relationship between two angular parameters, θ₁ and θ₂, and the corresponding free energy. The surface exhibits a complex shape with a prominent minimum and a broad maximum.
### Components/Axes
* **X-axis:** θ₂ (Theta 2), ranging approximately from -4 to 4.
* **Y-axis:** θ₁ (Theta 1), ranging approximately from -4 to 4.
* **Z-axis:** Free Energy, ranging approximately from -3.4 to -4.0.
* **Surface:** The colored surface represents the free energy value for each combination of θ₁ and θ₂. The color gradient indicates the free energy level, with yellow representing higher free energy and purple representing lower free energy.
### Detailed Analysis
The surface shows a clear minimum around θ₁ ≈ 0 and θ₂ ≈ 0. The free energy at this minimum is approximately -3.9.
The surface rises to a broad maximum along the θ₂ axis, peaking around θ₂ ≈ 4 and θ₁ ≈ 0. The free energy at this maximum is approximately -3.4.
The surface slopes downwards as θ₁ moves away from 0 in both positive and negative directions.
The surface is relatively smooth, with no sharp discontinuities or abrupt changes.
### Key Observations
* **Minimum:** A distinct minimum in free energy exists near θ₁ = 0 and θ₂ = 0, suggesting a stable state or configuration.
* **Maximum:** A broad maximum exists along the θ₂ axis, indicating a less stable region.
* **Gradient:** The color gradient clearly shows the variation in free energy across the landscape.
* **Symmetry:** The plot appears roughly symmetric about the θ₁ = 0 axis.
### Interpretation
This plot likely represents the free energy landscape of a system with two rotational degrees of freedom, θ₁ and θ₂. The minimum in free energy corresponds to the most stable configuration of the system, while the maximum represents a less stable configuration. The shape of the landscape dictates the system's dynamics; the system will tend to move towards the minimum, but may encounter barriers along the way. The symmetry suggests that the system's behavior is not significantly affected by the sign of θ₁. This type of plot is commonly used in fields like molecular dynamics, statistical mechanics, and machine learning to visualize and understand the energy landscape of complex systems. The plot suggests that the system is most stable when both angles are close to zero, and that deviations from this configuration increase the free energy. The broad maximum indicates a region where the system is less likely to reside, but can still access.
</details>
(g) c=1
<details>
<summary>extracted/6466920/figs/fe_2var_c5.0.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: Free Energy Landscape
### Overview
The image depicts a 3D surface plot representing a free energy landscape. The plot visualizes the relationship between two angular parameters, θ₁ and θ₂, and the corresponding free energy. The surface is colored to indicate the free energy value, with warmer colors (yellow/green) representing lower free energy and cooler colors (blue/purple) representing higher free energy.
### Components/Axes
* **X-axis:** θ₁ (Theta 1), ranging approximately from -4 to 4.
* **Y-axis:** θ₂ (Theta 2), ranging approximately from -4 to 4.
* **Z-axis:** Free Energy, ranging approximately from -7 to -3.
* **Color Scale:** Represents Free Energy, with a gradient from purple (high energy) to yellow/green (low energy).
### Detailed Analysis
The surface exhibits a complex topography with multiple local minima and maxima.
* **Main Minimum:** A prominent minimum is located near θ₁ ≈ 0 and θ₂ ≈ 0, with a free energy value of approximately -3. This is indicated by the yellow/green color in that region.
* **Secondary Minima:** There are two secondary minima located approximately at:
* θ₁ ≈ 3, θ₂ ≈ 2, with a free energy of approximately -4.
* θ₁ ≈ -3, θ₂ ≈ -2, with a free energy of approximately -4.
* **Saddle Points:** Several saddle points are visible, connecting the minima. These are indicated by regions where the color changes rapidly.
* **Maximum:** A maximum is located approximately at θ₁ ≈ -4 and θ₂ ≈ 4, with a free energy of approximately -7.
* **Trend:** The surface generally slopes downward from the upper-right to the lower-left, indicating a decreasing free energy in that direction.
### Key Observations
* The landscape is not smooth; it has several distinct features.
* The primary minimum suggests a stable state at θ₁ ≈ 0 and θ₂ ≈ 0.
* The secondary minima suggest metastable states.
* The presence of saddle points indicates pathways between different states.
### Interpretation
This plot likely represents the free energy landscape of a system with two degrees of freedom, represented by the angles θ₁ and θ₂. The minima on the landscape correspond to stable or metastable states of the system. The height of the energy barriers (represented by the saddle points) determines the rate of transitions between these states. The system will tend to reside in the minimum energy state, but thermal fluctuations can allow it to overcome the energy barriers and explore other states. The shape of the landscape provides insights into the dynamics and stability of the system. The plot suggests that the system has a preferred configuration around θ₁ ≈ 0 and θ₂ ≈ 0, but can also exist in other configurations with slightly higher energy. The complexity of the landscape indicates that the system's behavior may be non-trivial.
</details>
(h) c=5
Figure 7: Energy function and free-energy function with different confidence values.
3.6.3 Comparison to Other Approaches
Unlike recent neural network-based approaches for SAT solving [40, 53], which rely on large datasets generated by traditional SAT solvers for training, our method eliminates this dependency. Instead, we directly convert SAT problems (typically expressed in CNF) into LBM representations. This data-independent transformation should simplify the workflow by reducing the need for extensive pre-processing or model training. When compared with other SAT-solving methods that utilize Boltzmann Machines (BM), such as those in [17, 7], our approach leverages the much simpler structure of Restricted Boltzmann Machines (RBMs). These earlier methods often employ dense or higher-order structures, which are computationally complex and challenging to implement. By contrast, our use of RBMs maintains a streamlined architecture without requiring modifications to the network (e.g. adding configurations) to map SAT problems to BMs. Finally, while LBM is not yet competitive with state-of-the-art SAT solvers in terms of raw performance, it seems to open up a promising direction for further research. Unlike traditional SAT solvers, LBM is in essence a neural network. This should enable a seamless integration of reasoning and learning. Furthermore, LBM does not require prior knowledge of SAT problem structures or specific solving tactics such as backtracking or unit propagation. It is important to notice that our current implementation relies solely on off-the-shelf optimization methods, with room for significant improvements through parallelization optimization and the incorporation of advanced SAT-solving strategies.
3.6.4 SAT Solving Methodology and Initial Experimental Results
We used random SAT problems [1] as a case study. To solve SAT problems using LBM, we follow these steps:
- Convert a CNF into an RBM using the transformation described earlier.
- Apply various inference and optimization techniques to find satisfying assignments or determine unsatisfiability. Specifically:
- Use Gibbs sampling to minimize the energy function and search for satisfying assignments.
- Employ gradient-based methods from TensorFlow to optimize the differentiable free energy function.
- Utilize stochastic optimization methods from the Scipy library, such as dual_annealing and differential_evolution.
Our experiments produced the following results:
- Gibbs Sampling:
- Gibbs sampling can find satisfying assignments for CNFs with fewer than 40 variables. For larger formulas, Gibbs sampling often gets stuck in local minima, making it challenging to determine satisfiability. In such cases, if the free energy function does not decrease after 1000 steps of Gibbs sampling, we conclude that the formula is likely unsatisfiable.
- Gradient-Based Optimization (TensorFlow):
- Gradient-based methods are applied to minimize the differentiable free energy function. However, these methods are prone to getting trapped in local minima, especially for SAT problems with more than 20 variables.
- Attempts to smoothing the energy landscape by adjusting confidence values did not significantly improve performance, highlighting a difficulty in solving larger SAT instances with this approach.
- Stochastic Optimization (Scipy):
- Among the stochastic optimization methods available in Scipy, dual_annealing and differential_evolution showed better scalability. These methods successfully solved SAT problems with up to 100 variables.
- This suggests that stochastic global optimization techniques may offer a viable alternative for solving larger SAT instances with RBMs.
3.6.5 Implementing Penalty Logic in LBM
The closest work to LBM is Penalty Logic [37], which represents propositional formulae in Hopfield networks and Boltzmann machines. In its first step, Penalty Logic creates hidden variables to reduce a formula $\varphi$ to a conjunction of sub-formulas $\bigwedge_{i}\varphi_{i}$ , each with at most three variables. This naming step makes conversion into an energy function easier, but some of the terms in the energy function may consist of hidden variables and therefore cannot be converted into an RBM. For example, a negative term $-h_{1}xy$ of a higher-order Boltzmann Machine would be transformed into the quadratic term $-h_{2}h_{1}-h_{2}x-h_{2}y+5h_{2}$ with $-h_{2}h_{1}$ forming a connection between two hidden units, which is not allowed in RBMs. The three-variable term $-h_{1}xy$ is implemented in the higher-order network as a hypergraph. The variable $h_{2}$ is introduced to turn the hyper-edge into normal edges between each of the three variables and $h_{2}$ with an appropriate bias value for the new node $h_{2}$ , in this example a value of 5. Contrast the LBM for XOR in Figure 4 with the RBM for XOR built using Penalty Logic: first, one computes the higher-order energy function: ${\it E}^{p}=4xyz-2xy-2xz-2yz+x+y+z,$ then transforms it to quadratic form by adding a hidden variable $h_{1}$ to obtain: ${\it E}^{p}=2xy-2xz-2yz-8xh_{1}-8yh_{1}+8zh_{1}+x+y+z+12h_{1},$ which is not an energy function of an RBM, so one keeps adding hidden variables until the energy function of an RBM might be obtained, in this case: ${\it E}^{p}=-8xh_{1}-8yh_{1}+8zh_{1}+12h_{1}-4xh_{2}+4yh_{2}+2h_{2}-4yh_{3}-4%
zh_{3}+6h_{3}-4xh_{4}-4zh_{4}+6h_{4}+3x+y+z.$
The LBM system converts any set of formulae $\Phi=\{\varphi_{1},...,\varphi_{n}\}$ into an RBM by applying Theorem 1 to each formula $\varphi_{i}∈\Phi$ . In the case of Penalty Logic, formulae are weighted. Given a set of weighted formulae $\Phi=\{w_{1}:\varphi_{1},...,w_{n}:\varphi_{n}\}$ , one can also construct an equivalent RBM where each energy term generated from formula $\varphi_{i}$ is multiplied by $w_{i}$ . In both cases, the assignments that minimise the energy of the RBM are the assignments that maximise the satifiability of $\Phi$ , i.e. the (weighted) sum of the truth-values of the formula.
**Lemma 3**
*Given a weighted knowledge-base $\Phi=\{w_{1}:\varphi_{1},...,w_{n}:\varphi_{n}\}$ , there exists an equivalent RBM $\mathcal{N}$ such that $s_{\Phi}(\mathbf{x})=-\frac{1}{\epsilon}min_{\mathbf{h}}{\it E}(\mathbf{x},%
\mathbf{h})$ , where $s_{\Phi}(\mathbf{x})$ is the sum of the weights of the formulae in $\Phi$ that are satisfied by assignment $\mathbf{x}$ .*
A formula $\varphi_{i}$ can be decomposed into a set of (weighted) conjunctive clauses from its SDNF. If there exist two conjunctive clauses such that one is subsumed by the other then the subsumed clause is removed and the weight of the remaining clause is replaced by the sum of their weights. Identical conjunctive clauses are treated in the same way: one of them is removed and the weights are added. From Theorem 1, we know that a conjunctive clause $\bigwedge_{t∈\mathcal{S}_{T_{j}}}\mathrm{x}_{t}\wedge\bigwedge_{k∈\mathcal%
{S}_{K_{j}}}\neg\mathrm{x}_{k}$ is equivalent to an energy term $e_{j}(\mathbf{x},h_{j})=-h_{j}(\sum_{t∈\mathcal{S}_{T_{j}}}x_{t}-\sum_{k∈%
\mathcal{S}_{K_{j}}}x_{k}-|\mathcal{S}_{T_{j}}|+\epsilon)$ where $0<\epsilon<1$ . A weighted conjunctive clause $w^{\prime}:\bigwedge_{t∈\mathcal{S}_{T_{j}}}\mathrm{x}_{t}\wedge\bigwedge_{k%
∈\mathcal{S}_{K_{j}}}\neg\mathrm{x}_{k}$ , therefore, is equivalent to an energy term $w^{\prime}e_{j}(\mathbf{x},h_{j})$ . For each weighted conjunctive clause, we can add a hidden unit $j$ to an RBM with connection weights $w_{tj}=w^{\prime}$ for all $t∈\mathcal{S}_{T_{j}}$ , and $w_{kj}=-w^{\prime}$ for all $k∈\mathcal{S}_{K_{j}}$ . The bias for this hidden unit will be $w^{\prime}(-|\mathcal{S}_{T_{j}}|+\epsilon)$ . The weighted knowledge-base and the RBM are equivalent because $s_{\Phi}(\mathbf{x})\propto-\frac{1}{\epsilon}min_{\mathbf{h}}{\it E}(\mathbf{%
x},\mathbf{h})$ , where $s_{\Phi}(\mathbf{x})$ is the sum of the weights of the clauses that are satisfied by $\mathbf{x}$ .
**Example 5**
*(Nixon diamond problem) Consider the following weighted knowledge-base from the original Penalty Logic paper [37] (the weights of 1000 and 10 are given and have been taken from the original paper):
| | $\displaystyle 1000:\mathrm{n}→\mathrm{r}\quad\text{ Nixon is a %
Republican.}$ | |
| --- | --- | --- |
<details>
<summary>extracted/6466920/figs/diamond.png Details</summary>

### Visual Description
\n
## Diagram: Network Flow with Weighted Edges
### Overview
The image depicts a directed graph representing a network flow. The network consists of five nodes (n, q, r, p, and intermediate nodes) connected by directed edges with associated weights. The weights likely represent capacities or costs associated with the flow along each edge. The diagram is presented on a light gray background.
### Components/Axes
The diagram consists of:
* **Nodes:** n, q, r, p, and four unnamed intermediate nodes forming a diamond shape.
* **Edges:** Directed arrows connecting the nodes, each labeled with a numerical weight.
* **Edge Labels:** h1, h2, h3, h4, h5, h6, h7 are used to identify the edges.
### Detailed Analysis or Content Details
The network flow can be described as follows:
* **Node n (Top):** Connected to three intermediate nodes with edges labeled h1, h2, and h3.
* h1: n -> intermediate node (right) with weight -1500.
* h2: n -> intermediate node (center) with weight 2000.
* h3: n -> intermediate node (left) with weight -1500.
* **Node q (Bottom-Left):** Connected to two intermediate nodes with edges labeled h6 and h7.
* h6: q -> intermediate node (bottom) with weight 10.
* h7: q -> intermediate node (left) with weight 5.
* **Node r (Bottom-Right):** Connected to two intermediate nodes with edges labeled h4 and h5.
* h4: r -> intermediate node (bottom) with weight -5.
* h5: r -> intermediate node (right) with weight 5.
* **Node p (Bottom):** Connected to two intermediate nodes with edges labeled h4 and h6.
* h4: p -> intermediate node (right) with weight -10.
* h6: p -> intermediate node (left) with weight 10.
* **Intermediate Nodes:**
* Left Intermediate Node: Connected to nodes n (h3), q (h7), and p (h6). Weights are -1500, 5, and 10 respectively.
* Center Intermediate Node: Connected to node n (h2). Weight is 2000.
* Right Intermediate Node: Connected to nodes n (h1), r (h5), and p (h4). Weights are -1500, 5, and -10 respectively.
* Bottom Intermediate Node: Connected to nodes q (h6) and r (h4). Weights are 10 and -5 respectively.
The weights are a mix of positive and negative values.
### Key Observations
* The network appears to be symmetrical around the vertical axis.
* The weights associated with edges connected to node 'n' are significantly larger in magnitude than those connected to nodes 'q', 'r', and 'p'.
* The presence of negative weights suggests the possibility of cost or reverse flow.
* The network forms a diamond shape, with 'n' at the top and 'p' at the bottom.
### Interpretation
This diagram likely represents a network flow problem, potentially related to optimization or resource allocation. The negative weights could indicate costs associated with flow, or the possibility of flow in the reverse direction. The large weights associated with node 'n' suggest it might be a source or sink with a high capacity. The diamond shape could represent a specific network topology designed for a particular purpose. The edge labels (h1-h7) are likely used for indexing or referencing specific paths within the network.
Without further context, it's difficult to determine the exact meaning of the network. However, the diagram provides a clear visual representation of the network's structure and the associated flow characteristics. The negative weights are a key feature that warrants further investigation.
</details>
Figure 8: The RBM for the Nixon diamond problem has 4 input neurons $\{n,q,r,p\}$ and 7 hidden neurons (shown in grey) as a result of the conversion into SDNF of the 4 weighted clauses shown in Example 5. Converting all four weighted clauses above into SDNF produces eight conjunctive clauses. For example, weighted clause $1000:\mathrm{n}→\mathrm{r}\equiv 1000:(\mathrm{n}\wedge\mathrm{r})%
\vee(\neg\mathrm{n})$ . After adding the weights of clause ( $\neg\mathrm{n}$ ) which appears twice, an RBM is created (Figure 8) representing the following unique conjunctive clauses with their corresponding confidence values: $1000:\mathrm{n}\wedge\mathrm{r},\quad 2000:\neg\mathrm{n},\quad 1000:\mathrm{n%
}\wedge\mathrm{q},\quad 10:\mathrm{r}\wedge\neg\mathrm{p},\quad 10:\neg\mathrm%
{r},\quad 10:\mathrm{q}\wedge\mathrm{p},\quad 10:\neg\mathrm{q}.$ With $\epsilon=0.5$ , this RBM has energy function: ${\it E}=-h_{1}(1000n+1000r-1500)-h_{2}(-2000n+1000)-h_{3}(1000n+1000q-1500)-h_%
{4}(10r-10p-5)-h_{5}(-10r+5)-h_{6}(10q+10p-15)-h_{7}(-10q+5).$*
4 Logical Boltzmann Machines for MaxSAT
MaxSAT - shorthand for Maximum Satisfiability - is a computational problem that extends the classical SAT (Boolean satisfiability) problem. In MaxSAT, the goal is to find an assignment of truth values to the variables of a Boolean formula that maximizes the number of satisfied clauses. The formula is typically represented in CNF. We denote the number of satisfied clauses given a assignment $\mathbf{x}$ as $\sum_{m}\mathbbm{1}(\mathbf{x}\models\varphi_{m})$ . Here, $\mathbf{x}\models\varphi_{m}$ denotes that an assignment $\mathbf{x}$ satisfies the clause $\varphi_{m}$ of the CNF and $\mathbbm{1}$ is an indicator function mapping a clause-satisfying assignment to 1, and everything else to 0. Differently from SAT, where the goal is to find any satisfying assignment, MaxSAT seeks the assignment that satisfies the maximum number of clauses, making it a combinatorial optimization problem:
$$
\mathbf{x}^{*}=\operatorname*{arg\,max}_{\mathbf{x}}\sum_{m}\mathbbm{1}(%
\mathbf{x}\models\varphi_{m}) \tag{20}
$$
For example, consider the CNF formula:
$$
\phi=(\neg x_{1}\lor\neg x_{2})\land(\neg x_{1}\lor x_{2})\land(x_{1}\lor\neg x%
_{2})\land(x_{1}\lor x_{2}).
$$
The goal is to find an assignment $x∈\{0,1\}^{n}$ (where $n$ is the number of variables, in this example $n=2$ ) that maximizes the number of satisfied clauses. We know that $\phi$ is unsatisfiable since each clause corresponds to exactly one assignment. An optimal solution will satisfy three of the four clauses. This relaxation of the SAT task makes MaxSAT particularly suited for real-world optimization problems where constraints may need to be prioritized or relaxed to achieve the best overall solution. However, this flexibility comes at a cost: NP-Hardness, meaning that finding exact solutions becomes computationally infeasible as the number of variables grows. Many real-world applications, however, require approximate solutions to the MaxSAT problem, with the main challenge being how to balance accuracy and computation time.
Symbolic MaxSAT solvers have been the focus of intensive research for many years, gaining popularity among researchers and finding application in various domains from AI and computer-aided design to automated reasoning. Recent advancements in MaxSAT solvers have demonstrated significant improvements, with state-of-the-art solvers capable of scaling up to handling millions of variables and clauses.
MaxSAT has proven to be valuable in software analysis [43], hardware verification [32, 22], combinatorial optimization [25], bioinformatics [45], and data analysis [4]. Despite these achievements, the growing complexity of real-world problems has spurred interest in alternative approaches, such as leveraging the differentiable capabilities of Machine Learning to address MaxSAT by transforming the discrete problem into a continuous optimization task.
In recent years, there has been a growing interest in connectionist solvers. These solvers aim to represent MaxSAT problems using neural networks to benefit from the parallel computation and learning capabilities of such ML systems and from specialized neural network accelerators, such as GPUs and TPUs, to further enhance performance. Beyond providing an alternative approach to solving complex reasoning problems, connectionist MaxSAT solvers may promote the development of interpretable classification models [28], facilitating novel neurosymbolic learning and reasoning [53, 52] with the promise of offering more transparent decision making in AI.
Deep learning-inspired MaxSAT solvers rely on real-valued weights often learned from example solutions [53]. In a related attempt, [27] seeks to train a Graph Neural Network for MaxSAT problem-solving. Unlike symbolic approaches, ML-based methods typically require a degree of supervision and may be criticized for their opacity. We take a different approach and use LBM along with a global optimization method called dual annealing, a modified version of a meta-heuristic method known as simulated annealing, to search for MaxSAT solutions. Using dual annealing, we will search the energy landscape of the RBM for a global minimum corresponding to a MaxSAT solution.
Research that has also focused on representing MaxSAT problems in connectionist networks without relying on explicit learning from examples include [18], where the MaxSAT problem is mapped to a combinatorial optimization framework based on a high-order Boltzmann Machine used to search for an approximate solution to SAT. However, the problems with efficiency of Boltzmann Machines are well-known. They have motivated the use of RBMs, first in [52] and leading up to this work. In a more recent development also using RBMs, RBMSAT was proposed to construct an RBM that represents the probability of an assignment w.r.t. the number of clauses that it satisfies [54]. The goal there is to conduct a heuristic search for solutions using block Gibbs samplings on neural network accelerators. Our goal, instead, is to make use of an interpretable RBM layer, as shown e.g. in Figure 13 where each clause can be read off the LBM with binary weights.
**Example 6**
*An AND-gate $\mathrm{x}_{1}\wedge\neg\mathrm{x}_{2}$ is represented by a free energy $FE=-\log(1+\exp(c×(x_{1}-x_{2}-0.5)))$ . Figure 9 illustrates the correspondence between the free energy and the truth-values for different values of $c$ . Similarly, Figure 10 shows the free energy of an OR-gate (that is, a clause) $\mathrm{x}_{1}\vee\mathrm{x}_{2}$ . This clause is transformed into SDNF $(\mathrm{x}_{1}\wedge\neg\mathrm{x}_{2})\vee\mathrm{x}_{2}$ and the corresponding free energy is $FE=-\log(1+\exp(c×(x_{1}-x_{2}-0.5)))-\log(1+\exp(c×(x_{2}-0.5))$ . As expected, the satisfying assignments are those that maximize the negative free energy.*
As we have seen already when using LBM as a SAT solver, a conjunctive clause $\varphi_{m}$ can be represented in an RBM with the energy function $E_{m}=\sum_{j}e_{j}$ and, therefore, the energy function of a CNF will be:
$$
{\it E}(\mathbf{x})=\sum_{m}E_{m} \tag{21}
$$
The free energy of each clause corresponds to the truth values of the clause, i.e. $\mathbbm{1}(\mathbf{x}\models\varphi_{m})\propto FE_{m}(\mathbf{x})$ . The free energy of the entire CNF $FE(\mathbf{x})=\sum_{m}FE_{m}(\mathbf{x})$ , therefore, corresponds to the number of satisfied conjunctive clauses, that is:
$$
\displaystyle\sum_{m}\mathbbm{1}(\mathbf{x}\models\varphi_{m})\propto FE(%
\mathbf{x}) \tag{22}
$$
An assignment that maximizes the number of satisfying clauses in a MaxSAT problem also minimizes the free energy of the LBM. Consequently, solving MaxSAT problems is equivalent to searching for a state of minimum free energy in the RBM.
<details>
<summary>extracted/6466920/figs/andc1_.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: True xFF vs. x1 and x2
### Overview
The image displays a 3D surface plot representing the relationship between "True xFF" and two variables, "x1" and "x2". The surface is colored, indicating varying values of "True xFF". The plot is rendered with a grid background for visual clarity.
### Components/Axes
* **X-axis:** Labeled "x1", ranging from approximately 0.0 to 1.0.
* **Y-axis:** Labeled "x2", ranging from approximately 0.0 to 1.0.
* **Z-axis:** Labeled "True xFF", ranging from approximately 0.3 to 0.9.
* **Surface:** The colored surface represents the values of "True xFF" for different combinations of "x1" and "x2". The color gradient transitions from purple/blue at lower values to green at higher values.
* **Grid:** A 3D grid is present in the background, providing a reference for the surface's position and shape.
### Detailed Analysis
The surface exhibits a clear upward trend as both "x1" and "x2" increase.
* **x1 = 0.0:** When x1 is 0.0, "True xFF" starts at approximately 0.3 and increases linearly to approximately 0.9 as x2 increases from 0.0 to 1.0.
* **x2 = 0.0:** When x2 is 0.0, "True xFF" starts at approximately 0.3 and increases linearly to approximately 0.9 as x1 increases from 0.0 to 1.0.
* **x1 = 1.0, x2 = 1.0:** At the point where x1 and x2 both equal 1.0, "True xFF" reaches its maximum value of approximately 0.9.
* **Shape:** The surface is approximately planar, suggesting a linear relationship between "True xFF" and the combined effect of "x1" and "x2".
* **Color Gradient:** The color gradient indicates that lower values of "True xFF" (around 0.3-0.5) are represented by shades of purple and blue, while higher values (around 0.7-0.9) are represented by shades of green.
### Key Observations
* The surface is consistently increasing, indicating a positive correlation between "x1", "x2", and "True xFF".
* The linear nature of the surface suggests that the effect of "x1" and "x2" on "True xFF" is additive.
* There are no apparent outliers or anomalies in the data.
### Interpretation
The plot demonstrates a positive, approximately linear relationship between the variables "x1" and "x2" and the value of "True xFF". This suggests that increasing either "x1" or "x2" (or both) will result in a higher "True xFF" value. The consistent upward trend and lack of anomalies indicate a stable and predictable relationship. Without further context, it's difficult to determine the specific meaning of "True xFF" or the nature of "x1" and "x2", but the plot clearly illustrates their combined influence on the dependent variable. The plot could represent a model or simulation where "True xFF" is a performance metric, and "x1" and "x2" are input parameters. The linear relationship suggests a simple model where the effect of each parameter is directly proportional to its value.
</details>
(a) c=1
<details>
<summary>extracted/6466920/figs/andc5_.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: True ∝ FE
### Overview
The image depicts a 3D surface plot representing a relationship between three variables: `x₁`, `x₂`, and `True ∝ FE`. The surface is colored with a gradient, transitioning from dark purple to teal to yellow, indicating varying values of `True ∝ FE`. The plot appears to visualize a function where the value of `True ∝ FE` increases as both `x₁` and `x₂` increase, but the relationship is not strictly linear.
### Components/Axes
* **X-axis:** Labeled `x₁`, ranging from approximately 0.0 to 1.0.
* **Y-axis:** Labeled `x₂`, ranging from approximately 0.0 to 1.0.
* **Z-axis:** Labeled `True ∝ FE`, ranging from approximately 0.0 to 2.5.
* **Surface:** The colored surface represents the function being plotted. The color gradient indicates the value of `True ∝ FE`.
* **Grid:** A 3D grid is present, providing a visual reference for the axes and surface.
### Detailed Analysis
The surface starts at a low value of `True ∝ FE` (approximately 0.0) when either `x₁` or `x₂` is close to 0.0. As both `x₁` and `x₂` increase, the value of `True ∝ FE` rises. The maximum value of `True ∝ FE` (approximately 2.5) is reached when both `x₁` and `x₂` are close to 1.0.
The surface is relatively flat near the axes (where either `x₁` or `x₂` is close to 0.0) and becomes steeper as both variables approach 1.0. The color gradient shows:
* Dark purple: `True ∝ FE` values around 0.0 - 0.5
* Blue: `True ∝ FE` values around 0.5 - 1.5
* Teal: `True ∝ FE` values around 1.5 - 2.0
* Yellow: `True ∝ FE` values around 2.0 - 2.5
The surface appears to be smooth and continuous, suggesting that the function is well-defined within the plotted range. There are no visible discontinuities or sharp edges.
### Key Observations
* The value of `True ∝ FE` is positively correlated with both `x₁` and `x₂`.
* The rate of increase in `True ∝ FE` is higher when both `x₁` and `x₂` are closer to 1.0.
* The surface is symmetrical with respect to the axes, suggesting that the function is likely symmetric in `x₁` and `x₂`.
### Interpretation
The plot likely represents a mathematical function where `True ∝ FE` is proportional to the product of `x₁` and `x₂`, or some other function that exhibits similar behavior. The proportionality symbol (∝) suggests that `True ∝ FE` is not necessarily equal to a specific function of `x₁` and `x₂`, but rather varies in a similar manner.
The plot could be used to visualize the relationship between two input variables (`x₁` and `x₂`) and an output variable (`True ∝ FE`). The shape of the surface provides insights into how changes in `x₁` and `x₂` affect the value of `True ∝ FE`. The smooth, continuous nature of the surface suggests a stable and predictable relationship.
The use of a color gradient enhances the visualization by providing an additional dimension of information. The gradient allows for a quick and intuitive understanding of the relative values of `True ∝ FE` across the surface.
</details>
(b) c=5
<details>
<summary>extracted/6466920/figs/andc10_.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: True α - FE
### Overview
The image depicts a 3D surface plot, representing a function of two variables, x₁ and x₂. The z-axis represents the value of "True α - FE". The surface exhibits a curved shape, transitioning from darker purple/blue tones at lower values to lighter green/yellow tones at higher values. There are vertical lines visible within the plot, appearing as dark streaks.
### Components/Axes
* **X-axis:** Labeled "x₁", ranging from approximately 0.0 to 1.0, with markers at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-axis:** Labeled "x₂", ranging from approximately 0.0 to 1.0, with markers at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Z-axis:** Labeled "True α - FE", ranging from approximately 0.0 to 5.0, with markers at 1.0, 2.0, 3.0, 4.0, and 5.0.
* **Surface:** The main visual element, displaying the function's output as a colored surface. The color gradient transitions from dark purple (low values) to light green/yellow (high values).
* **Vertical Lines:** Dark vertical lines are present throughout the plot, seemingly representing specific values or slices of the function.
### Detailed Analysis
The surface generally slopes upwards as both x₁ and x₂ increase. The lowest values of "True α - FE" (around 0-1) are concentrated near the origin (x₁ ≈ 0, x₂ ≈ 0). As x₁ and x₂ approach 1.0, the value of "True α - FE" increases significantly, reaching values up to approximately 5.0.
The vertical lines appear to be concentrated in the region where x₁ is relatively small (approximately 0.0 to 0.4) and x₂ varies between 0.0 and 1.0. These lines suggest a potential discontinuity or specific feature of the function within that region.
It is difficult to extract precise numerical values from the plot without knowing the exact function being represented. However, we can approximate some points:
* At x₁ = 0.0, x₂ = 0.0, "True α - FE" ≈ 0.5
* At x₁ = 1.0, x₂ = 0.0, "True α - FE" ≈ 3.5
* At x₁ = 0.0, x₂ = 1.0, "True α - FE" ≈ 3.0
* At x₁ = 1.0, x₂ = 1.0, "True α - FE" ≈ 5.0
### Key Observations
* The function is non-linear, exhibiting a curved surface.
* The value of "True α - FE" increases with increasing values of x₁ and x₂.
* The vertical lines suggest a potential feature or discontinuity in the function, particularly when x₁ is small.
* The color gradient provides a visual representation of the function's output, with darker colors indicating lower values and lighter colors indicating higher values.
### Interpretation
This plot likely represents the relationship between two input variables (x₁ and x₂) and a calculated value ("True α - FE"). The vertical lines could represent a set of constant values for x₁, showing how "True α - FE" changes with x₂ for those specific x₁ values. The increasing trend suggests a positive correlation between the input variables and the output value.
The "True α - FE" label suggests this might be related to a statistical model or a correction factor. "FE" could stand for "Fixed Effects" in a statistical context, and "True α" might represent the true parameter value being estimated. The plot could be visualizing the bias or error introduced by a fixed effects model, showing how it deviates from the true parameter value as a function of the input variables.
Without further context, it's difficult to determine the exact meaning of the plot. However, the visualization suggests a complex relationship between the variables, with potential non-linearities and specific features that warrant further investigation.
</details>
(c) c=10
Figure 9: Free energy corresponding to an AND gate with different confidence values (Example 6).
<details>
<summary>extracted/6466920/figs/c1.png Details</summary>

### Visual Description
## 3D Surface Plot: True ∞ - FE vs. x₁ and x₂
### Overview
The image depicts a 3D surface plot representing the relationship between "True ∞ - FE" and two variables, x₁ and x₂. The surface is colored, indicating varying values of "True ∞ - FE". The plot is rendered in a perspective view, showing the surface extending from near zero to approximately 1.4 on the z-axis.
### Components/Axes
* **X-axis:** Labeled "x₁", ranging from 0.0 to 1.0, with markers at 0.2, 0.4, 0.6, 0.8.
* **Y-axis:** Labeled "x₂", ranging from 0.0 to 1.0, with markers at 0.2, 0.4, 0.6, 0.8.
* **Z-axis:** Labeled "True ∞ - FE", ranging from approximately 1.0 to 1.4, with markers at 1.0, 1.1, 1.2, 1.3, 1.4.
* **Surface:** The surface itself represents the function being plotted. The color gradient ranges from dark purple (low values of "True ∞ - FE") to light green/yellow (high values of "True ∞ - FE").
* **Grid:** A grid is visible on the x-y plane, providing a visual reference for the values of x₁ and x₂.
### Detailed Analysis
The surface exhibits a clear upward trend as both x₁ and x₂ increase.
* **Trend along x₁:** For a constant value of x₂, the surface slopes upward as x₁ increases.
* **Trend along x₂:** For a constant value of x₁, the surface slopes upward as x₂ increases.
* **Combined Trend:** The steepest increase in "True ∞ - FE" occurs when both x₁ and x₂ are increasing simultaneously.
Approximate data points (estimated from visual inspection):
* When x₁ = 0.0 and x₂ = 0.0, "True ∞ - FE" ≈ 1.0.
* When x₁ = 1.0 and x₂ = 0.0, "True ∞ - FE" ≈ 1.2.
* When x₁ = 0.0 and x₂ = 1.0, "True ∞ - FE" ≈ 1.2.
* When x₁ = 1.0 and x₂ = 1.0, "True ∞ - FE" ≈ 1.4.
* When x₁ = 0.5 and x₂ = 0.5, "True ∞ - FE" ≈ 1.15.
The surface appears to be relatively smooth, with no sharp discontinuities or sudden changes in slope. The color gradient is consistent, with darker shades representing lower values and lighter shades representing higher values.
### Key Observations
* The surface is convex, indicating a positive correlation between x₁ and x₂, and "True ∞ - FE".
* The rate of increase in "True ∞ - FE" is not constant; it appears to accelerate as x₁ and x₂ approach 1.0.
* The minimum value of "True ∞ - FE" is approximately 1.0, occurring at x₁ = 0.0 and x₂ = 0.0.
* The maximum value of "True ∞ - FE" is approximately 1.4, occurring at x₁ = 1.0 and x₂ = 1.0.
### Interpretation
The plot suggests a functional relationship where "True ∞ - FE" increases with increasing values of both x₁ and x₂. This could represent a model where "True ∞ - FE" is a combined function of x₁ and x₂, potentially representing some form of efficiency or performance metric. The convex shape of the surface indicates diminishing returns; while increasing both x₁ and x₂ always increases "True ∞ - FE", the rate of increase slows down as the values approach 1.0.
The label "True ∞ - FE" is unusual. The "∞" symbol suggests a limit or asymptotic behavior, but the plot shows finite values. "FE" could stand for "Finite Element" or "Front End", but without further context, the meaning remains ambiguous. The plot likely represents a simplified model or a specific region of a more complex function, where the values of x₁ and x₂ are constrained between 0.0 and 1.0. The plot could be used to optimize the values of x₁ and x₂ to maximize "True ∞ - FE", subject to any constraints on the variables.
</details>
(a) c=1
<details>
<summary>extracted/6466920/figs/c5.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: True α – FE
### Overview
The image depicts a 3D surface plot representing the relationship between two input variables, x₁ and x₂, and the output variable "True α – FE". The surface is colored, indicating varying values of "True α – FE". The plot appears to show a complex, non-linear relationship between the input variables and the output.
### Components/Axes
* **X-axis:** Labeled "x₁", ranging from approximately 0.0 to 1.0, with markers at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-axis:** Labeled "x₂", ranging from approximately 0.0 to 1.0, with markers at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Z-axis:** Labeled "True α – FE", ranging from approximately 0.0 to 2.5, with markers at 0.5, 1.0, 1.5, 2.0, and 2.5.
* **Surface:** The surface itself represents the values of "True α – FE" for different combinations of x₁ and x₂. The surface is colored with a gradient, transitioning from darker blues at lower values to lighter greens and yellows at higher values.
### Detailed Analysis
The surface exhibits a complex shape. It generally increases in value as both x₁ and x₂ increase, but the rate of increase is not constant. There appears to be a region of rapid increase along the diagonal where x₁ ≈ x₂.
* **Minimum Value:** The minimum value of "True α – FE" appears to be close to 0.0, located near x₁ = 0.0 and x₂ = 0.0.
* **Maximum Value:** The maximum value of "True α – FE" appears to be around 2.5, located near x₁ = 1.0 and x₂ = 1.0.
* **Trend along x₁:** For a fixed value of x₂, the surface generally slopes upward as x₁ increases.
* **Trend along x₂:** For a fixed value of x₁, the surface generally slopes upward as x₂ increases.
* **Curvature:** The surface exhibits significant curvature, indicating a non-linear relationship. The curvature is more pronounced near the center of the plot.
* **Wavy Pattern:** There is a noticeable wavy pattern visible in the lower-left corner of the plot, where x₁ and x₂ are both small. This suggests oscillations or interference in the relationship.
### Key Observations
* The surface is not symmetrical with respect to either x₁ or x₂.
* The rate of change of "True α – FE" is not uniform across the entire range of x₁ and x₂.
* The wavy pattern in the lower-left corner is a notable feature that may indicate a specific phenomenon or limitation of the model.
### Interpretation
The plot suggests a complex, non-linear relationship between the input variables x₁ and x₂ and the output variable "True α – FE". The increasing trend indicates that higher values of x₁ and x₂ generally lead to higher values of "True α – FE". However, the curvature and asymmetry suggest that the relationship is not simple and may be sensitive to the specific values of x₁ and x₂. The wavy pattern in the lower-left corner could be due to numerical instability, interference effects, or a fundamental property of the underlying system.
Without further context, it is difficult to determine the specific meaning of "True α – FE" or the physical significance of x₁ and x₂. However, the plot provides valuable insights into the behavior of the system and can be used to guide further analysis or modeling. The plot could represent a response surface in an optimization problem, a physical phenomenon modeled by a complex equation, or a simulation result.
</details>
(b) c=5
<details>
<summary>extracted/6466920/figs/c10.png Details</summary>

### Visual Description
\n
## 3D Surface Plot: True α - FE
### Overview
The image presents a 3D surface plot visualizing the relationship between three variables: `x₁`, `x₂`, and `True α - FE`. The surface is colored, indicating varying values of `True α - FE`. The plot appears to represent a function of two input variables (`x₁` and `x₂`) resulting in a single output variable (`True α - FE`).
### Components/Axes
* **X-axis:** `x₁`, ranging from approximately 0.0 to 1.0.
* **Y-axis:** `x₂`, ranging from approximately 0.0 to 1.0.
* **Z-axis:** `True α - FE`, ranging from approximately 0.0 to 5.0.
* **Surface:** The colored surface represents the values of `True α - FE` for different combinations of `x₁` and `x₂`.
* **Color Gradient:** The color of the surface varies from purple (low values of `True α - FE`) to green/yellow (high values of `True α - FE`).
### Detailed Analysis
The surface exhibits a distinct "valley" shape.
* **Trend:** The surface slopes downward from both the `x₁` = 0.0 and `x₁` = 1.0 edges towards a minimum value around `x₁` = 0.5. Similarly, it slopes downward from both the `x₂` = 0.0 and `x₂` = 1.0 edges towards a minimum value around `x₂` = 0.5.
* **Minimum:** The minimum value of `True α - FE` appears to be approximately 0.8, located near the center of the plot (x₁ ≈ 0.5, x₂ ≈ 0.5).
* **Maximum:** The maximum values of `True α - FE` are located along the edges of the plot, particularly at the corners. The maximum value appears to be around 5.0, occurring at (x₁ ≈ 0.0, x₂ ≈ 0.0), (x₁ ≈ 1.0, x₂ ≈ 0.0), (x₁ ≈ 0.0, x₂ ≈ 1.0), and (x₁ ≈ 1.0, x₂ ≈ 1.0).
* **Contour Lines:** There are numerous contour lines visible on the surface, particularly in the valley region. These lines indicate constant values of `True α - FE`. The lines are densely packed near the minimum, indicating a steep gradient, and more spaced out towards the edges, indicating a shallower gradient.
### Key Observations
* The surface is symmetrical with respect to both `x₁` and `x₂`.
* The function `True α - FE` is minimized when both `x₁` and `x₂` are approximately 0.5.
* The function `True α - FE` increases rapidly as `x₁` and `x₂` move away from 0.5.
### Interpretation
The plot suggests a relationship where `True α - FE` is inversely proportional to the distance from the point (0.5, 0.5) in the `x₁`-`x₂` plane. This could represent a scenario where an optimal value for `True α - FE` is achieved when `x₁` and `x₂` are balanced around a central point. The shape of the surface indicates a strong sensitivity of `True α - FE` to changes in `x₁` and `x₂` when they are close to the optimal values. The function likely represents some form of error or difference, where the minimum value indicates the best fit or lowest error. Without further context, it's difficult to determine the specific meaning of the variables, but the plot clearly demonstrates a well-defined minimum and a symmetrical relationship between the input and output variables.
</details>
(c) c=10
Figure 10: Free energy corresponding to an OR gate with different confidence values (Example 6).
4.1 LBM with Dual Annealing
When representing CNF in a RBM, one option for solving the MaxSAT problem is to utilize stochastic search methods like Gibbs sampling. However, in our scenario, Gibbs sampling exhibits slow convergence, necessitating potentially extensive computational resources to reach equilibrium. To mitigate this challenge, we have adopted simulated annealing, a meta-heuristic technique renowned for addressing global optimization problems [24]. Specifically, we employ dual annealing, which seamlessly integrates global and local search strategies to enhance efficiency [55].
In this approach, the search algorithm initially employs simulated annealing to identify a candidate region within the search space where global optima are likely to be situated. Subsequently, a local search is conducted to precisely pinpoint the location of the global optima. This hybrid strategy combines the exploration capabilities of simulated annealing with the exploitation capabilities of local search, offering a robust and efficient approach to solving MaxSAT problems.
4.1.1 Experimental Results of LBM for MaxSAT
As a preliminary evaluation, we compare the performance of LBM at solving MaxSAT problems with Loandra, a state-of-the-art MaxSAT solver. Loandra was chosen as benchmark solver due to its performance in the MaxSAT Evaluations 2023. Experiments were carried out on a set of challenging MaxSAT instances known as MaxCut, each containing 1,260 clauses. Six distinct MaxCut problem instances were chosen from the MaxSAT 2016 benchmark. Both the LBM and Loandra solvers were run on each of the six instances with a timeout limit of 300 seconds for each run. The instances were chosen to represent diverse problem structures so as to evaluate the capabilities of the solvers without making assumptions about the CNF structure. All experiments were conducted on a standard desktop computer with a AMD ryzen7 5800X 8-core processor and 32 GB DDR4 RAM. The main evaluation metric was the number of satisfied clauses, indicating the solver’s ability to maximize clause satisfaction within the given time constraints.
<details>
<summary>extracted/6466920/figs/ClauseStats4LoandraLBM.png Details</summary>

### Visual Description
\n
## Statistical Data: Satisfied Clauses Statistics
### Overview
The image presents two sets of statistical data regarding "Satisfied Clauses" for two different methods: "Loandra" and "LBM". Each set includes several descriptive statistics: count, mean, standard deviation (std), minimum (min), 25th percentile, 50th percentile (median), 75th percentile, and maximum (max). The data is presented in a tabular format.
### Components/Axes
The image consists of two independent blocks of text, labeled (a) and (b) at the bottom. Each block represents the statistics for one method. The statistics are listed vertically with labels on the left and corresponding values on the right.
### Detailed Analysis or Content Details
**Loandra Satisfied Clauses Stats (a):**
* **count:** 6.000000
* **mean:** 1089.000000
* **std:** 8.508819
* **min:** 1078.000000
* **25%:** 1083.500000
* **50%:** 1088.000000
* **75%:** 1095.500000
* **max:** 1100.000000
**LBM Satisfied Clauses Stats (b):**
* **count:** 6.000000
* **mean:** 1078.666667
* **std:** 3.502380
* **min:** 1075.000000
* **25%:** 1075.750000
* **50%:** 1078.500000
* **75%:** 1080.500000
* **max:** 1084.000000
### Key Observations
* Both methods have the same count of 6.
* Loandra has a higher mean (1089.00) compared to LBM (1078.67).
* Loandra exhibits a larger standard deviation (8.51) than LBM (3.50), indicating greater variability in the number of satisfied clauses.
* The range (max - min) is wider for Loandra (1100 - 1078 = 22) than for LBM (1084 - 1075 = 9).
### Interpretation
The data suggests that, on average, Loandra satisfies more clauses than LBM. However, the satisfaction of clauses with Loandra is also more variable, as evidenced by the larger standard deviation. This could indicate that Loandra is more sensitive to the specific input or configuration, leading to a wider range of outcomes. LBM, on the other hand, appears to be more consistent in its performance, with a tighter distribution of satisfied clauses around its mean. The difference in the range of satisfied clauses further supports this observation. The small sample size (count = 6) should be considered when interpreting these results; a larger sample size would provide more robust statistical conclusions.
</details>
Figure 11: Statistics for clause satisfaction for (a) Loandra (1089 clauses satisfied on average) and (b) LBM for MaxSAT (1078 clauses satisfied on average).
<details>
<summary>extracted/6466920/figs/SatClausesComparisons.png Details</summary>

### Visual Description
\n
## Line Chart: Satisfied Clauses Comparison
### Overview
This line chart compares the number of satisfied clauses for two algorithms, Loandra and LBM, across six experiments. The x-axis represents the experiment index, ranging from 0 to 5, and the y-axis represents the number of satisfied clauses.
### Components/Axes
* **Title:** Satisfied Clauses Comparison
* **X-axis Label:** Experiment Index
* **Y-axis Label:** Satisfied Clauses
* **Legend:**
* Loandra (Blue Line with Circle Markers)
* LBM (Orange Line with Triangle Markers)
* **X-axis Markers:** 0, 1, 2, 3, 4, 5
* **Y-axis Scale:** Approximately 1075 to 1100, with gridlines at intervals of 5.
### Detailed Analysis
**Loandra (Blue Line):**
The Loandra line starts at approximately 1100 satisfied clauses at Experiment Index 0. It then exhibits a steep downward trend, decreasing to approximately 1096 at Index 1, then to approximately 1087 at Index 2. It continues to decrease to approximately 1083 at Index 3, then slightly increases to approximately 1085 at Index 4, and finally increases to approximately 1091 at Index 5.
* Experiment 0: ~1100 satisfied clauses
* Experiment 1: ~1096 satisfied clauses
* Experiment 2: ~1087 satisfied clauses
* Experiment 3: ~1083 satisfied clauses
* Experiment 4: ~1085 satisfied clauses
* Experiment 5: ~1091 satisfied clauses
**LBM (Orange Line):**
The LBM line begins at approximately 1085 satisfied clauses at Experiment Index 0. It decreases to approximately 1080 at Index 1, then to approximately 1076 at Index 2. It then decreases to approximately 1075 at Index 3, increases to approximately 1078 at Index 4, and finally increases to approximately 1081 at Index 5.
* Experiment 0: ~1085 satisfied clauses
* Experiment 1: ~1080 satisfied clauses
* Experiment 2: ~1076 satisfied clauses
* Experiment 3: ~1075 satisfied clauses
* Experiment 4: ~1078 satisfied clauses
* Experiment 5: ~1081 satisfied clauses
### Key Observations
* Loandra consistently outperforms LBM in terms of satisfied clauses across all experiment indices.
* Loandra exhibits a more significant fluctuation in the number of satisfied clauses compared to LBM.
* Both algorithms show a general decreasing trend in satisfied clauses from Experiment Index 0 to Experiment Index 3, followed by a slight increase.
* The largest drop in satisfied clauses for Loandra occurs between Experiment Index 0 and Experiment Index 1.
* The lowest point for LBM is at Experiment Index 3.
### Interpretation
The data suggests that the Loandra algorithm generally achieves a higher number of satisfied clauses than the LBM algorithm across the tested experiments. The initial drop in performance for both algorithms could be due to the algorithm converging to a local optimum or encountering more challenging problem instances as the experiment index increases. The slight recovery in performance towards the end of the experiments might indicate that the algorithms are adapting or finding better solutions with further iterations. The larger fluctuations observed in Loandra's performance could suggest that it is more sensitive to the specific characteristics of each experiment or that it has a higher variance in its search process. Further investigation would be needed to understand the underlying reasons for these differences and to determine which algorithm is more robust and efficient in different scenarios.
</details>
Figure 12: Performance of LBM and Loandra on 6 MaxCut experiments (0 to 5).
Figure 12 shows the number of clauses satisfied by the symbolic SAT solver Loandra and the LBM for MaxSAT approach in each of the 6 experiments. As a state-of-the-art solver, Loandra performed better than LBM in 5 out of 6 cases. However, the difference is relatively small and further investigation is warranted, with numerous directions for future exploration. One possible direction involves investigating alternative global optimization methods such as evolutionary strategies. Another obvious direction is the implementation of parallel computation to improve scalability of LBM search, similar to the approach employed in RBMSAT. As future work, the task of Image Sudoku introduced in [48] is an interesting extension to consider because of the interplay between object recognition and reasoning that is intrinsic to that task and aligned well with LBM’s approach integrating learning and reasoning, which we describe next.
5 Integrating Learning and Reasoning in Logical Boltzmann Machines
A key development in neurosymbolic AI will be the ability to add verified modules to existing networks. An RBM that can be shown to be provably equivalent to a given logical formula could act as one such module. A neurosymbolic module placed on top (at the output) of a larger network may serve to constrain that network’s output to satisfy certain properties. In this process, the specification of a neurosymbolic module may benefit from the use of a logical language that is richer (more expressive) than propositional logic. In particular, the use of first-order logic may be required.
In first-order logic, instead of using propositions and connectives alone, logical predicates are used to represent relations among quantified variables. Statements such as $∀ X∃ Y.R(X,Y)$ are used to denote compactly a relation $R$ between variables $X$ and $Y$ in a domain such that for all the values that variable $X$ may admit, there is a value (an instance) of variable $Y$ for which the relation $R$ holds True.
As an example, if we were to rewrite into first-order logic the Boolean logical notation provided earlier for the Sudoku puzzle, we could use a ternary predicate $board(B,P,V)$ , in which:
- $B$ represents the block index (1 to 4),
- $P$ represents the position index within the block (1 to 4), and
- $V$ represents the value (1 to 4),
to make the problem description a lot more compact.
A first-order representation of the constraints of the Sudoku puzzle would include:
1. Existence: $∀ B,P,∃ V.board(B,P,V)$
This ensures that every position in every block is filled with at least one value.
1. Uniqueness: $∀ B,P,V_{1},V_{2}.(board(B,P,V_{1})\land board(B,P,V_{2}))→ V_%
{1}=V_{2}$
This ensures that no position in the board can hold multiple values.
We will illustrate one way of integrating LBM as a logical layer on top of deep networks applied to a semantic image interpretation task. The task is to predict the relations between objects and their parts in an image. It requires the use of first-order logic. The knowledge base consists of symbolic facts expressing when an object type is normally part of another object type, e.g. ${part}(\text{Screen},\text{TV})$ , where Screen and TV are variables, denoting that TVs have screens. The knowledge base also includes a first-order rule connecting any two visual scenes ( $X_{1}$ and $X_{2}$ ) with the symbols of the logic, as follows:
$$
\displaystyle\forall X_{1},X_{2},\exists T_{1},T_{2}.(({type}(X_{1},T_{1})%
\wedge{type}(X_{2},T_{2}))\rightarrow \displaystyle({partOf}(X_{1},X_{2})\leftrightarrow{part}(T_{1},T_{2}))) \tag{23}
$$
where $X_{1}$ , $X_{2}$ are real-valued variables representing visual features of objects (an embedding, pixel values, etc.), as done in [41], and $T_{1}$ , $T_{2}$ are symbolic variables representing object types. Predicate ${type}$ is $True$ when an object, as defined by its visual features, is deemed to be of a given type (e.g. an object class). Given two visual scenes with their corresponding type classifications, ${type}(X_{1},T_{1})$ and ${type}(X_{2},T_{2})$ , one visual scene will be part of the other, $partOf(X_{1},X_{2})$ , if and only if the object type of the former is deemed to be part of the object type in the latter, ${part}(T_{1},T_{2})$ .
In order to implement the above rule, we use Faster-RCNN to extract features from object images, from which we build two Neural Network Regressors (NNR) To represent first-order logic in LBM, we combine LBM with the Neural Network Regressors. Each NNR represents a predicate in the formulae and outputs a truth-value for that predicate. LBM takes as input the truth-values of the predicates coming from the outputs of the NNRs. In essence, LBM sits on top of the NNRs connecting the predicates according to the connectives of the corresponding logic formulae.: $\mathcal{N}^{type}$ and $\mathcal{N}^{po}$ , as learned functions for ${type}$ and ${partOf}$ , respectively, as done in [11]. Finally, we use an autoencoder $\mathcal{N}^{pt}$ to implement the relation ${part}(T_{1},T_{2})$ between the symbolic variables $T_{1}$ and $T_{2}$ , following [51]. Let ${p}^{po}=\mathcal{N}^{po}(X_{1},X_{2})$ , ${p}^{pt}=\mathcal{N}^{pt}(T_{1},T_{2})$ , ${p}^{t_{1}}=\mathcal{N}^{type}(X_{1},T_{1})$ , ${p}^{t_{2}}=\mathcal{N}^{type}(X_{2},T_{2})$ , ${p}^{po},{p}^{pt},{p}^{t_{1}},{p}^{t_{2}}∈\{0,1\}$ , according to some choice of threshold. The first-order rule (23) can be converted to SDNF, as follows:
| | $\displaystyle({p}^{t_{1}}\wedge{p}^{t_{2}})→({p}^{po}\leftrightarrow%
{p}^{pt})\equiv$ | |
| --- | --- | --- |
From this SDNF, we build a LBM as the logical layer on top of the neural networks $\mathcal{N}^{type}$ , $\mathcal{N}^{po}$ and $\mathcal{N}^{pt}$ . Figure 13 shows the overall network architecture.
<details>
<summary>extracted/6466920/figs/lbm_sii.png Details</summary>

### Visual Description
## Diagram: LBM Network Architecture
### Overview
This diagram illustrates the architecture of a network involving a Latent Box Model (LBM) and Faster R-CNN (FRCNN) components. The diagram depicts the flow of information between these components, with connections weighted by numerical values. The LBM appears to process information from multiple FRCNN outputs across different time steps.
### Components/Axes
* **LBM:** A rectangular box labeled "LBM" in the top-center of the diagram.
* **FRCNN:** Represented by trapezoidal shapes labeled X₁, T₁, X₂, and T₂ at the bottom of the diagram.
* **Nodes:** Circular nodes labeled p<sup>t1</sup>, p<sup>pt</sup>, p<sup>po</sup>, and p<sup>t2</sup> within the LBM.
* **Connections:** Lines connecting FRCNN outputs to LBM nodes, with varying colors (blue and red) and weights.
* **Labels:** N<sup>type</sup>, N<sup>pt</sup>, N<sup>po</sup> labels indicating connections between FRCNN outputs and LBM nodes.
* **Weights:** Numerical values above the LBM nodes: -3.5, -1.5, -0.5, 0.5.
* **Legend:** Located in the top-right corner, defining line colors as representing weights of -1 (blue) and 1 (red). Also defines the shapes for Feed-forward NN and FRCNN.
### Detailed Analysis or Content Details
The diagram shows the following connections and weights:
* **X₁ to p<sup>t1</sup>:** Connection with an unspecified weight (implied to be positive, as it's a red line).
* **T₁ to p<sup>pt</sup>:** Connection with an unspecified weight (implied to be positive, as it's a red line).
* **X₂ to p<sup>po</sup>:** Connection with an unspecified weight (implied to be positive, as it's a red line).
* **T₂ to p<sup>t2</sup>:** Connection with an unspecified weight (implied to be positive, as it's a red line).
* **p<sup>t1</sup> to -3.5:** A red line connecting p<sup>t1</sup> to the weight -3.5.
* **p<sup>pt</sup> to -1.5:** A red line connecting p<sup>pt</sup> to the weight -1.5.
* **p<sup>po</sup> to -0.5:** A red line connecting p<sup>po</sup> to the weight -0.5.
* **p<sup>t2</sup> to 0.5:** A red line connecting p<sup>t2</sup> to the weight 0.5.
* **p<sup>t1</sup> to p<sup>pt</sup>:** A blue line connecting p<sup>t1</sup> to p<sup>pt</sup>.
* **p<sup>pt</sup> to p<sup>po</sup>:** A blue line connecting p<sup>pt</sup> to p<sup>po</sup>.
* **p<sup>po</sup> to p<sup>t2</sup>:** A blue line connecting p<sup>po</sup> to p<sup>t2</sup>.
* **p<sup>t1</sup> to p<sup>po</sup>:** A blue line connecting p<sup>t1</sup> to p<sup>po</sup>.
* **p<sup>pt</sup> to p<sup>t2</sup>:** A blue line connecting p<sup>pt</sup> to p<sup>t2</sup>.
* **X₁ to p<sup>pt</sup>:** A blue line connecting X₁ to p<sup>pt</sup>.
* **X₁ to p<sup>po</sup>:** A blue line connecting X₁ to p<sup>po</sup>.
* **T₁ to p<sup>po</sup>:** A blue line connecting T₁ to p<sup>po</sup>.
* **X₂ to p<sup>pt</sup>:** A blue line connecting X₂ to p<sup>pt</sup>.
* **X₂ to p<sup>t2</sup>:** A blue line connecting X₂ to p<sup>t2</sup>.
* **T₂ to p<sup>t1</sup>:** A blue line connecting T₂ to p<sup>t1</sup>.
* **T₂ to p<sup>t2</sup>:** A blue line connecting T₂ to p<sup>t2</sup>.
### Key Observations
* The LBM nodes (p<sup>t1</sup>, p<sup>pt</sup>, p<sup>po</sup>, p<sup>t2</sup>) appear to integrate information from multiple FRCNN outputs (X₁, T₁, X₂, T₂).
* The connections between LBM nodes are predominantly negative (blue lines), suggesting inhibitory relationships.
* The connections from FRCNN outputs to LBM nodes are predominantly positive (red lines), suggesting excitatory relationships.
* The weights associated with the LBM nodes are negative, indicating a potential suppression or reduction of activity.
* The diagram suggests a temporal aspect, with 't1' and 't2' potentially representing different time steps.
### Interpretation
This diagram likely represents a model for object tracking or scene understanding. The FRCNN components (X₁, T₁, X₂, T₂) likely represent object detections and their associated features at different time steps. The LBM acts as a latent variable model, integrating these detections to infer the underlying state of the scene. The negative weights on the LBM nodes suggest that the model is designed to filter out noise or inconsistencies in the detections. The connections between the LBM nodes represent dependencies between the latent variables. The overall architecture suggests a system that can maintain a consistent representation of the scene over time, even in the presence of noisy or incomplete observations. The use of both positive and negative connections indicates a complex interplay between different features and latent variables. The weights -3.5, -1.5, -0.5, and 0.5 suggest a graded influence of the LBM nodes, with -3.5 having the strongest suppressive effect and 0.5 having the weakest. The diagram does not provide any information about the specific functions of the FRCNN components or the LBM nodes, but it does provide a clear overview of the overall architecture and the flow of information.
</details>
Figure 13: Learning and reasoning about the PartOf relation in object images by grounding symbolic concepts into Convolutional Neural Networks and adding a logical layer in the form of a LBM module implementing the rule $({p}^{t_{1}}\wedge{p}^{t_{2}})→({p}^{po}\leftrightarrow{p}^{pt})$ .
Reasoning in the LBM can inform learning in $\mathcal{N}$ by backpropagating inferred knowledge to update the weights of the CNN, regressor or autoencoder. In particular, we train the entire system by minimizing the following loss function $[\mathcal{N}^{type}(x_{1},t_{1}),\mathcal{N}^{type}(x_{2},t_{2})]$ denotes the concatenation of the outputs from the $\mathcal{N}^{type}$ networks; $||x||^{2}_{2}$ is the squared Euclidean norm.:
| | $\displaystyle||\mathcal{N}^{po}(x_{1},x_{2})-\text{LBM}({p}^{po}|\mathcal{K}(x%
_{1},x_{2})||^{2}_{2}+$ | |
| --- | --- | --- |
where $x_{1},x_{2}$ and $\mathcal{K}(x_{1},x_{2})$ are obtained from the training data; $\mathcal{K}$ denotes the knowledge pertaining to $x_{1},x_{2}$ , i.e. the type of $x_{1}$ , type of $x_{2}$ , and whether $x_{1}$ is part of $x_{2}$ . We use $\text{LBM}({p}^{po}|\mathcal{K}(x_{1},x_{2}))$ and $\text{LBM}({p}^{t_{1}},{p}^{t_{2}}|\mathcal{K}(x_{1},x_{2}))$ to denote the application of LBM to infer the value of ${p}^{po}$ and of the pair $[{p}^{t_{1}},{p}^{t_{2}}]$ , respectively. For example, the LBM is used to infer ${p}^{po}$ , which is used in turn to update $\mathcal{N}^{po}$ .
Given, for instance, $x_{1}=∈cludegraphics[width=9.95863pt]{figs/screen.png}$ and $x_{2}=∈cludegraphics[width=9.95863pt]{figs/tvmonitor.png}$ , let’s assume that we do not know whether $x_{1}$ is part of $x_{2}$ . But, if the $\mathcal{N}^{type}$ networks tell us that ${type}(x_{1},\text{Screen})\equiv{True}$ and ${type}(x_{2},\text{TV})\equiv{True}$ and $\mathcal{K}(x_{1},x_{2})$ also includes ${part}(\text{Screen},\text{TV})$ , the LBM can infer that ${p}^{po}$ should be $True$ . Finally, this signal from the LBM’s reasoning, obtained from the first term of the loss function, is used to update during learning the parameters of $\mathcal{N}^{po}$ . Similarly, the second term of the loss function leverages information about the type of objects to update $\mathcal{N}^{type}$ .
We compared the LBM-enhanced model with three other neurosymbolic systems: Deep Logic Networks (DLN) [50], Logic Tensor Networks (LTN) [11, 2], and Compositional Neural Logic Programming (CNLP) [51]. The task and data set used were the same as in [11], with the exception of the rule $({p}^{t_{1}}\wedge{p}^{t_{2}})→({p}^{po}\leftrightarrow{p}^{pt})$ only used by LBM. The area under the curve (AUC) results shown in Table 3 indicate that the use of this single first-order rule for reasoning with the LBM and training of the Faster-RCNN model, also used in [11], produces a higher performance than LTN in the prediction of the part-of relation in images. LBM’s performance is comparable to that of CNLP. For the object type prediction, the LBM model performs better than CNLP, DLN and LTN.
Table 3: Comparison of neurosymbolic approaches; area under the curve (AUC) in the semantic image interpretation task. Source: [52].
| DLN CNLP LTN | $0.791± 0.032$ $0.816± 0.004$ $0.800$ | $0.605± 0.024$ $\mathbf{0.644± 0.015}$ $0.598$ |
| --- | --- | --- |
| LBM | $\mathbf{0.828± 0.002}$ | $\mathbf{0.645± 0.027}$ |
6 Challenges for Neurosymbolic AI
We introduced an approach and neurosymbolic system to reason about symbolic knowledge in an energy-based neural network. Differently from most LLM approaches and CoT, where reasoning capabilities are expected to emerge and are measured post-hoc using benchmarks, we use logic to provide a formal definition of reasoning. Only once a well-defined semantics is provided, one can show correspondence between networks and various forms of reasoning. We showed equivalence between propositional logic and RBMs. The finding led to a novel system, named Logical Boltzmann Machines integrating learning and reasoning in neural networks. Future work will focus on scaling up the application to SAT and learning from data and knowledge. Extensions include applications of weighted clauses, weighted SAT with parallel implementation as well as evaluations on relational learning tasks.
Equipped with a proof and algorithm showing how RBMs can implement a category of required constraints, it is possible to imagine how an RBM may be added as a module to an existing network imposing such constraints on the network. This RBM module becomes a verifiable component of the system, implementing for example a fairness or safety requirement as argued for in [16]. Next, we discuss how this simple idea may open up directions for research addressing some of the biggest challenges for current AI: data efficiency, fairness, safety and ultimately trust.
6.1 Nonmonotonic Logic
Consider the task of commonsense reasoning, the human-like ability to make sense of ordinary situations, such as making judgments about the nature of objects. It encompasses intuitive psychology (e.g. reasoning about intentions) and naive physics (a natural understanding of the physical world, including spatial and temporal reasoning). Commonsense reasoning requires an ability to jump to conclusions based on incomplete information, and to retract or revise such conclusions when more information become available. There have been many attempts to formalize commonsense, not least the work of John McCarthy who was responsible for coining the term Artificial Intelligence. Because it requires jumping to conclusions and therefore handling logical inconsistencies in a non-classical way, commonsense reasoning is modeled by non-monotonic logics, some of which are undecidable. LLMs have achieved much better results than logical formalizations of commonsense on reasoning benchmarks. It turns out that it is easier to learn commonsense from data than to formalize it logically. Yet, making sense of what has been learned by LLMs has been proven to be a daunting task. Making sense of what has been learned might help tremendously with the efforts to understand the limits of the formalization of commonsense in logic. As neural networks start to be deployed successfully in various fields of scientific discovery, achieving a true understanding of the processes at play will demand such an ability to explain the network’s reasoning.
6.2 Planning
AGI will require the ability to plan towards a goal and the ability to ask questions in order to achieve that goal. Planning requires the ability to break-down goals into sub-goals while reasoning with partial information over time. Having a goal, in turn, requires the provision of a description of the current state and the goal state, and a mechanism that reduces the difference (e.g. some distance function) between the current state and the goal state by changing the current state in a reasonable way. If the mechanism isn’t working, one may decide to change the goal or even change the mechanism itself.
Whether in Google’s AlphaGo or Sudoku, the goal state is to win the game, and because it is a closed environment, simulation can be used to learn to minimize the distance function without the need for an explicit description. In open-ended situations, the problem becomes much harder. An explicit description is one that can be manipulated by asking questions: “what might happen if I were to make this or that change?” without making the change. An explicit description needs, therefore, to be amenable to symbolic manipulation. We argue that in open-ended scenarios, an explicit description needs furthermore to be abstracted from the situation given only a few examples. Reasoning in the form of symbol manipulation on that abstract description can then take place that will be of a different nature from reasoning carried out in terms of pattern matching with similarity and distance functions.
6.3 Learning from its Mistakes
An AGI system should be able to learn from its mistakes, interact with users by asking questions, describing its understanding and improving its performance in a controlled way towards a goal, even if changing the goal and the mechanism for performance improvement. Controlled at the level of its symbolic description, the system can be made safe. With reasoning taking place at both the pattern matching and the more abstract levels, as advocated by D. Kahneman’s Thinking: Fast and Slow [21], the system will be able to adapt to novelty from only a few examples, check its understanding, multi-task and reuse knowledge from one task to another thus improving data and energy efficiency in comparison with the demands of current AI systems.
Adapting to novelty (therefore solving the long-tail distribution problem and out-of-distribution problem in Machine Learning) requires creating compact representations (in the brain or the mind) but also being able to change that representation from time to time in order to obtain new insight. It is the change of representation that allows one to look at a problem from a new angle to obtain new insight. Think of the difference between two computer programs, both correct for their purpose, one so-called spaghetti code and the other an example of what programmers like to refer to as beautiful code. The former may be faster to run, but the latter needs to be neat, easy to understand and useful to update and reuse. This is the core challenge of the latest research in neurosymbolic AI: extraction of relevant descriptions at the right level of abstraction from complex neural networks, sound application of reasoning and learning with various forms of representation - spatial, temporal, epistemic, normative, multimodal, nonmonotonic - and efficient knowledge and data reuse and extrapolation to multiple tasks in different application domains.
Consider the kind of program learned by the latest GPT-based chatbots. Transformer neural networks work, in essence, by mapping sets to sets and not sequences to sequences (GPT-based chatbots require positional encoding to handle sequences). Interestingly, in neurosymbolic AI, the computation of the semantics of various logics by neural networks is also done by mapping sets to sets recursively [34]. A great innovation of deep learning was multi-headed attention which is similar to representing binary relations in neurosymbolic networks. In certain domains of application, though, such as protein interaction, having the ability to represent not just binary but n-ary relations such as e.g. bond(Compound_Name, Bond_Type, Atom_1, Atom_2) may be very useful. This is the case in the Mutagenesis task and data set, where a chemical compound may have a bond of a certain type between any two atoms and the goal is to identify mutagenic compounds given the atom bonds. Hypergraph neural networks have been used recently to represent and learn n-ary relations [10]. Finally, the neurosymbolic (NeSy) framework of fibring neural networks [14] has been shown recently to offer a common representational foundation for both graph neural networks and transformers via a proof of correspondence with propositional modal logic with self-fibring. These are some of the exciting recent developments in neurosymbolic AI (see the NeSy conference series for more [5]) contributing to both a better understanding of deep learning and the development of new formalisms for learning and reasoning.
7 Conclusion
Many influential leaders have been pointing out the risks of current AI and arguing for the adoption of regulation. While it is clear that worldwide regulation is not achievable in the current geopolitical climate (see [30]), an alternative argument is that digital technology itself can offer, as part of an adequate accountability ecosystem, a new path to safer AI. In this new path, neural models can be validated symbolically by adopting the neurosymbolic cycle: train a little, reason a little, repeat. This is quite different from the current scale-is-all-you-need approach or what the EU AI Act has achieved. Regulation without accountability tends to increase weak competitiveness and may not decrease risks.
At first impression, the need for accountability in AI and the risks of current AI may seem to be quite disconnected from the technical contributions of this paper. However, we argue that the kind of formalization offered here is key to accountability, fairness and ultimately a safer AI. If neurosymbolic AI can show that compact network modules behave according to a given formal semantics then these modules can be composed in ways that will offer guarantees to the overall system. Of course, this continues to be an important research challenge, but results such as the ones reported in this paper point to an alternative to the current approach to AI, best illustrated by Figure 13, where a requirement (or a guardrail) can be implemented as a neurosymbolic network module on top of an existing complex network. For this reason, we conclude with a short summary and opinion on the need for accountability in AI.
The need for accountability in AI is now center stage, as indicated by the following quote from [20]: “A long-standing concern among analysts of AI development is the possibility of a race to the bottom in which multiple players feel pressure to neglect safety and security challenges in order to remain competitive. Perceptions - and therefore signals - are key variables in this scenario. Most actors would presumably prefer to have time to ensure their AI systems are reliable, but the desire to be first, the pressure to go to market, and the idea that competitors might be cutting corners can all push developers to be less cautious. Accordingly, signaling has an important role to play in mitigating race-to-the-bottom dynamics. Parties developing AI systems could emphasize their commitment to restraint, their focus on developing safe and trustworthy systems, or both. Ideally, credible signals on these points can reassure other parties that all sides are taking due care, mitigating pressure to race to the bottom”. In [36], the authors go further, arguing for an accountability in AI ecosystem. They propose to map out the general principles of AI into industry-specific mechanisms, having stated as early as 2021: “at present the ecosystem is unbalanced, which can be seen in the failures of certain mechanisms that have been attempted by leading technology companies. By taking an ecosystem perspective, we can identify certain elements that need developing and bolstering in order for the system as a whole to function effectively. Corporate governance mechanisms such as standardized processes and internal audit frameworks, leading up to potential external accreditation, need to be made to work together in ways that go beyond regulatory requirements, especially in technologies’ early period of evolution and deployment when regulation lags practice.”
As part of a case study carried out with a global software provider operating primarily in the gambling sector with a focus on online gambling, [36] reports on the use of AI to help reduce harm from gambling. The application of AI in responsible gambling has been a relevant use case because of the high regulatory focus, divergent regulatory perspectives worldwide, and a longstanding debate over ethical dilemmas relating to an increase in gambling addiction. Results are drawn from the risk profiling of gambling behavior using neural networks and explainability. The neural network performs classification of problem gambling. Explainable AI evaluates indirect gender bias and the need for algorithmic fairness. Results are analyzed in connection with the proposed accountability ecosystem and its operationalization. The AI accountability ecosystem has as stakeholders: corporate actors, market counterparts, civil society and government, alongside mechanisms such as internal auditing, external accreditation, investigative journalism, risk-based regulation and market shaping. Two key elements of the accountability ecosystem are discussed in detail: (i) interventions to reduce bias and (ii) increased transparency via model explainability. The benefits of having an industry-specific accountability process are illustrated in that it can be documented, reviewed, benchmarked, challenged and improved upon, “both to build trust that the underlying ethical principle is being taken seriously and to identify specific areas to do more.” [36]. The paper’s conclusions support the importance of industry-specific approaches in the operationalization of accountability principles in AI, noting how different metrics, priorities and accountability processes arise in online gambling compared to what might arise in other industries. Taken alongside relevant regulatory efforts on information security and privacy, accountability in AI is expected to reduce the risks of imbalances in regulation.
Widespread use of GPT-style chatbots is expected to increase productivity but also magnify errors, as humans become complacent in the use of the technology. When trying to distinguish genuine from malicious websites, people have learned over the years to look for grammatical errors, the quality of images and other cues. Learning whether or not to trust the output of LLMs is much harder. At this unique juncture when AI leaves the research laboratory and enters everyday life, new ways of doing the things that we are used to and take for granted will need to be decided upon and implemented quickly, until a better way of doing AI comes that will offer safety guarantees to AI systems.
References
- [1] Saeed Amizadeh, Sergiy Matusevych, and Markus Weimer. Learning to solve circuit-sat: An unsupervised differentiable approach. In ICLR, 2019.
- [2] Samy Badreddine, Artur d’Avila Garcez, Luciano Serafini, and Michael Spranger. Logic tensor networks. Artificial Intelligence, 303:103649, 2022.
- [3] Michael Bain and Stephen Muggleton. Learning optimal chess strategies. In K. Furukawa, D. Michie, and S. Muggleton, editors, Machine intelligence 13: machine intelligence and inductive learning, pages 291–309. Oxford University Press, Inc., New York, NY, USA, 1994.
- [4] Jeremias Berg, Antti Hyttinen, and Matti Jarvisalo. Applications of maxsat in data analysis. In Daniel Le Berre and Matti Jarvisalo, editors, Proceedings of Pragmatics of SAT 2015 and 2018, volume 59 of EPiC Series in Computing, pages 50–64. EasyChair, 2019.
- [5] Tarek R. Besold, Artur d’Avila Garcez, Ernesto Jiménez-Ruiz, Roberto Confalonieri, Pranava Madhyastha, and Benedikt Wagner, editors. Neural-Symbolic Learning and Reasoning - 18th International Conference, NeSy 2024, Barcelona, Spain, September 9-12, 2024, Proceedings, Part I, volume 14979 of Lecture Notes in Computer Science. Springer, 2024.
- [6] William W. Cohen, Fan Yang, and Kathryn Mazaitis. Tensorlog: Deep learning meets probabilistic dbs. CoRR, abs/1707.05390, 2017.
- [7] A. d’Anjou, M. Graña, F. J. Torrealdea, and M. C. Hernandez. Solving satisfiability via Boltzmann machines. IEEE Trans. Pattern Anal. Mach. Intell., 15(5):514–521, may 1993.
- [8] A. d’Avila Garcez, K. Broda, and D. Gabbay. Symbolic knowledge extraction from trained neural networks: A sound approach. Artif. Intel., 125(1–2):155–207, 2001.
- [9] A. d’Avila Garcez, L. C. Lamb, and D. M. Gabbay. Neural-Symbolic Cognitive Reasoning. Springer, 2009.
- [10] João Pedro Gandarela de Souza, Gerson Zaverucha, and Artur d’Avila Garcez. Hypergraph neural networks with logic clauses. In International Joint Conference on Neural Networks, IJCNN 2024, Yokohama, Japan, June 30 - July 5, 2024, pages 1–8. IEEE, 2024.
- [11] I. Donadello, L. Serafini, and A. S. d’Avila Garcez. Logic tensor networks for semantic image interpretation. In IJCAI-17, pages 1596–1602, 2017.
- [12] R. Evans and E. Grefenstette. Learning explanatory rules from noisy data. JAIR, 61:1–64, 2018.
- [13] M. França, G. Zaverucha, and A. d’Avila Garcez. Fast relational learning using bottom clause propositionalization with artificial neural networks. Mach. Learning, 94(1):81–104, 2014.
- [14] Artur d’Avila Garcez and Dov M. Gabbay. Fibring neural networks. In Proceedings of the 19th National Conference on Artifical Intelligence, AAAI’04, page 342–347. AAAI Press, 2004.
- [15] Artur d’Avila Garcez and Luís C. Lamb. Neurosymbolic AI: the 3 rd wave. Artif. Intell. Rev., 56(11):12387–12406, March 2023.
- [16] Eleonora Giunchiglia, Alex Tatomir, Mihaela Cătălina Stoian, and Thomas Lukasiewicz. Ccn+: A neuro-symbolic framework for deep learning with requirements. International Journal of Approximate Reasoning, 171:109124, 2024. Synergies between Machine Learning and Reasoning.
- [17] C. Hernandez, F. Albizuri, A. DAnjou, M. Graña, and Francisco Torrealdea. Efficient solution of max-sat and sat via higher order boltzmann. Revista Investigación Operacional, 22, 01 2001.
- [18] M. C. Hernandez, F. X. Albizuri, A. d’Anjou, M. Graña, and F. J. Torrealdea. High-order Boltzmann machines for MAX-SAT and SAT. In Stan Gielen and Bert Kappen, editors, ICANN ’93, pages 479–479, London, 1993. Springer London.
- [19] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.
- [20] Andrew Imbrie, Owen Daniels, and Helen Toner. Decoding intentions. https://cset.georgetown.edu/publication/decoding-intentions/, October 2023. Center for Security and Emerging Technology [Online; accessed 20-Jan-2025].
- [21] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, New York, 2011.
- [22] Tiepelt Marcel Kevin and Singh Tilak Raj. Finding pre-production vehicle configurations using a maxsat framework. In 18th International Configuration Workshop, pages 117––122. École des Mines d’Albi-Carmaux, 2016.
- [23] Ross D. King, Michael J. E. Sternberg, and Ashwin Srinivasan. Relating chemical activity to structure: An examination of ilp successes. New Generation Computing, 13(3), Dec 1995.
- [24] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671–680, 1983.
- [25] Mohit Kumar, Samuel Kolb, Stefano Teso, and Luc De Raedt. Learning max-sat from contextual examples for combinatorial optimisation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):4493–4500, Apr. 2020.
- [26] Hugo Larochelle, Michael Mandel, Razvan Pascanu, and Yoshua Bengio. Learning algorithms for the classification restricted boltzmann machine. J. Mach. Learn. Res., 13(1):643–669, March 2012.
- [27] Minghao Liu, Pei Huang, Fuqi Jia, Fan Zhang, Yuchen Sun, Shaowei Cai, Feifei Ma, and Jian Zhang. Can graph neural networks learn to solve the maxsat problem? (student abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 37(13):16264–16265, Sep. 2023.
- [28] Dmitry Malioutov and Kuldeep S. Meel. Mlic: A maxsat-based framework for learning interpretable classification rules. August 2018.
- [29] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. DeepProbLog: Neural probabilistic logic programming. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3749–3759. Curran Associates, Inc., 2018.
- [30] Chris Miller. Chip War: The Fight for the World’s Most Critical Technology. Scribner, New York, 2022.
- [31] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2024.
- [32] Antonio Morgado, Mark Liffiton, and Joao Marques-Silva. Maxsat-based mcs enumeration. In Armin Biere, Amir Nahir, and Tanja Vos, editors, Hardware and Software: Verification and Testing, pages 86–101, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
- [33] Kwun Ho Ngan, James Phelan, Esma Mansouri-Benssassi, Joe Townsend, and Artur d’Avila Garcez. Closing the neural-symbolic cycle: Knowledge extraction, user intervention and distillation from convolutional neural networks. In Artur d’Avila Garcez, Tarek R. Besold, Marco Gori, and Ernesto Jiménez-Ruiz, editors, Proceedings of the 17th International Workshop on Neural-Symbolic Learning and Reasoning, La Certosa di Pontignano, Siena, Italy, July 3-5, 2023, volume 3432 of CEUR Workshop Proceedings, pages 19–43. CEUR-WS.org, 2023.
- [34] Simon Odense and Artur d’Avila Garcez. A semantic framework for neurosymbolic computation. Artif. Intell., 340:104273, 2025.
- [35] L. de Penning, A. d’Avila Garcez, L.C. Lamb, and J-J. Meyer. A neural-symbolic cognitive agent for online learning and reasoning. In IJCAI, pages 1653–1658, 2011.
- [36] Chris Percy, Simo Dragicevic, Sanjoy Sarkar, and Artur d’Avila Garcez. Accountability in AI: from principles to industry-specific accreditation. CoRR, abs/2110.09232, 2021.
- [37] G. Pinkas. Reasoning, nonmonotonicity and learning in connectionist networks that capture propositional knowledge. Artif. Intell., 77(2):203–247, 1995.
- [38] Gadi Pinkas. Symmetric neural networks and propositional logic satisfiability. Neural Comput., 3(2):282–291, June 1991.
- [39] Matthew Richardson and Pedro Domingos. Markov logic networks. Mach. Learn., 62(1-2):107–136, February 2006.
- [40] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a SAT solver from single-bit supervision. In International Conference on Learning Representations, 2019.
- [41] Luciano Serafini and Artur d’Avila Garcez. Learning and reasoning with logic tensor networks. In AI*IA, pages 334–348, 2016.
- [42] Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. ArXiv, abs/2305.17493, 2023.
- [43] Xujie Si, Xin Zhang, Radu Grigore, and Mayur Naik. Maximum satisfiability in software analysis: Applications and techniques. In Rupak Majumdar and Viktor Kuncak, editors, Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I, volume 10426 of Lecture Notes in Computer Science, pages 68–94. Springer, 2017.
- [44] P. Smolensky. Constituent structure and explanation in an integrated connectionist/symbolic cognitive architecture. In Connectionism: Debates on Psychological Explanation. 1995.
- [45] Volker Sperschneider. Bioinformatics: Problem Solving Paradigms. Springer Publishing Company, Incorporated, 2008.
- [46] A. Srinivasan. The Aleph manual. http://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html, 2007. Accessed: 2021-01-23.
- [47] A. Srinivasan, S. H. Muggleton, R.D. King, and M.J.E. Sternberg. Mutagenesis: Ilp experiments in a non-determinate biological domain. In Proceedings of the 4th International Workshop on Inductive Logic Programming, volume 237 of GMD-Studien, pages 217–232, 1994.
- [48] Wolfgang Stammer, Antonia Wüst, David Steinmann, and Kristian Kersting. Neural concept binder. CoRR, abs/2406.09949, 2024.
- [49] G. Towell and J. Shavlik. Knowledge-based artificial neural networks. Artif. Intel., 70:119–165, 1994.
- [50] S. Tran and A. Garcez. Deep logic networks: Inserting and extracting knowledge from deep belief networks. IEEE T. Neur. Net. Learning Syst., (29):246–258, 2018.
- [51] Son N. Tran. Compositional neural logic programming. In Zhi-Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 3059–3066. International Joint Conferences on Artificial Intelligence Organization, 8 2021. Main Track.
- [52] Son N. Tran and Artur d’Avila Garcez. Neurosymbolic reasoning and learning with restricted boltzmann machines. Proceedings of the AAAI Conference on Artificial Intelligence, 37(5):6558–6565, Jun. 2023.
- [53] Po-Wei Wang, Priya L. Donti, Bryan Wilder, and J. Zico Kolter. SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6545–6554. PMLR, 2019.
- [54] David Warde-Farley, Vinod Nair, Yujia Li, Ivan Lobov, Felix Gimeno, and Simon Osindero. Solving maxsat with matrix multiplication, 2023.
- [55] Y Xiang, D.Y Sun, W Fan, and X.G Gong. Generalized simulated annealing algorithm and its application to the thomson model. Physics Letters A, 233(3):216–220, 1997.
- [56] Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base reasoning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2319–2328. Curran Associates, Inc., 2017.