## A NOVEL MULTI-STEP FINITE-STATE AUTOMATON FOR ARBITRARILY DETERMINISTIC TSETLIN MACHINE LEARNING
## A PREPRINT
## K. Darshana Abeyrathna
Centre for Artificial Intelligence Research University of Agder Grimstad, Norway darshana.abeyrathna@uia.no
## Rishad Shafik
Microsystems Research Group School of Engineering Newcastle University, UK rishad.shafik@newcastle.ac.uk
## Adrian Wheeldon
## Ole-Christoffer Granmo
Centre for Artificial Intelligence Research University of Agder Grimstad, Norway ole.granmo@uia.no
## Alex Yakovlev
Microsystems Research Group School of Engineering Newcastle University, UK alex.yakovlev@newcastle.ac.uk
Jie Lei
Microsystems Research Group School of Engineering Newcastle University, UK adrian.wheeldon@newcastle.ac.uk
## Morten Goodwin
Centre for Artificial Intelligence Research University of Agder Grimstad, Norway morten.goodwin@uia.no
July 7, 2020
## ABSTRACT
Due to the high energy consumption and scalability challenges of deep learning, there is a critical need to shift research focus towards dealing with energy consumption constraints. Tsetlin Machines (TMs) are a recent approach to machine learning that has demonstrated significantly reduced energy usage compared to neural networks alike, while performing competitively accuracy-wise on several benchmarks. However, TMs rely heavily on energy-costly random number generation to stochastically guide a team of Tsetlin Automata to a Nash Equilibrium of the TM game. In this paper, we propose a novel finite-state learning automaton that can replace the Tsetlin Automata in TM learning, for increased determinism. The new automaton uses multi-step deterministic state jumps to reinforce sub-patterns. Simultaneously, flipping a coin to skip every d 'th state update ensures diversification by randomization. The d -parameter thus allows the degree of randomization to be finely controlled. E.g., d = 1 makes every update random and d = ∞ makes the automaton completely deterministic. Our empirical results show that, overall, only substantial degrees of determinism reduces accuracy. Energy-wise, random number generation constitutes switching energy consumption of the TM, saving up to 11 mW power for larger datasets with high d values. We can thus use the new d -parameter to trade off accuracy against energy consumption, to facilitate low-energy machine learning.
Microsystems Research Group School of Engineering Newcastle University, UK jie.lei@newcastle.ac.uk
## 1 Introduction
State-of-the-art deep learning (DL) requires massive computational resources, resulting in high energy consumption [1] and scalability challenges [2]. There is thus a critical need to shift research focus towards dealing with energy consumption constraints [3]. Tsetlin Machines [4] (TMs) are a recent approach to machine learning (ML) that has demonstrated significantly reduced energy usage compared to neural networks alike [5]. Using a linear combination of conjunctive clauses in propositional logic, the TM has obtained competitive performance in terms of accuracy [6, 7, 8], memory footprint [5, 8], energy [5], and learning speed [5, 8] on diverse benchmarks (image classification, regression and natural language understanding). Furthermore, the rules that TMs build seem to be interpretable, similar to the branches in a decision tree (e.g., in the form if X satisfies condition A and not condition B then Y = 1) [6]. The reported small memory footprint and low energy consumption make the TM particularly attractive for addressing the scalability and energy challenge in ML.
Recent progress on TMs. Recent research reports several distinct TM properties. The TM can be used in convolution, providing competitive performance on MNIST, Fashion-MNIST, and Kuzushiji-MNIST, in comparison with CNNs, K-Nearest Neighbor, Support Vector Machines, Random Forests, Gradient Boosting, BinaryConnect, Logistic Circuits and ResNet [8]. The TM has also achieved promising results in text classification by using the conjunctive clauses to capture textual patterns [6]. By introducing clause weights, it has been demonstrated that the number of clauses can be reduced by up to 50 × , without loss of accuracy [9]. Further, hyper-parameter search can be simplified with multi-granular clauses, eliminating the pattern specificity parameter [10]. By indexing the clauses on the features that falsify them, up to an order of magnitude faster inference and learning has been reported [11]. Additionally, regression TMs compare favorably with Regression Trees, Random Forest Regression, and Support Vector Regression [7]. In [12], the TM is equipped with integer weighted clauses, learnt by a stochastic searching on the line (SSL) automaton. The integer weighted TM outperforms simple Multi-Layered Artificial Neural Networks, Decision Trees, Support Vector Machines, K-Nearest Neighbor, Random Forest, Gradient Boosted Trees (XGBoost), Explainable Boosting Machines (EBMs), as well as the standard TM. For continuous features, a scheme for adaptive threshold-based binarization using SSLs was proposed in [13]. Instead of using TAs to represent all unique thresholds, as in [14], merely two SSLs per feature adaptively adjust the thresholds. Finally, TMs have recently been shown to be fault-tolerant, completely masking stuck-at faults [15].
Paper Contributions. TMs rely heavily on energy-costly random number generation to stochastically guide a team of TAs to a Nash Equilibrium of the TM game. In this paper, we propose a novel finite state learning automaton that can replace the TAs of the TM, for increased determinism. The new automaton uses multi-step deterministic state jumps to reinforce sub-patterns. Simultaneously, flipping a coin to skip every d 'th state update ensures diversification by randomization. The d -parameter thus allows the degree of randomization to be finely controlled. We further evaluate the scheme empirically on five datasets, demonstrating that the new d -parameter can be used to trade off accuracy against energy consumption.
## 2 A Multi-Step Finite-State Learning Automaton
The origins of Learning Automata (LA) [16] can be traced back to the work of M. L. Tsetlin in the early 1960s [17]. The objective of an LA is to learn the optimal action through trial and error in a stochastic environment. Various types of LAs are available depending on the nature of the application [18]. Due to their computational simplicity, we here focus on two-action finite-state LA, which we extend by introducing a novel periodically changing structure (variable structure).
In general, the action that an LA performs next is decided by the present state of the LA (the memory). An LA interacts with its environment iteratively. In each iteration, the environment randomly produces a reward or a penalty, responding to the action selected by the LA according to an unknown probability distribution. If the LA receives a reward, it reinforces the action performed by moving to a 'deeper' state. If the action results in a penalty, the LA state moves towards the middle state, to weaken the performed action, ultimately jumping to the other action. In this manner, with a sufficient number of states, a LA converges to the action with the highest probability of producing rewards - the optimal action - with probability arbitrarily close to 1 . 0 [16].
The transitions between states can be be deterministic or stochastic. Deterministic transitions occur with probability 1 . 0 , while stochastic transitions are randomly performed based on a preset probability. If the transition probabilities are changing, we have a variable structure automaton, otherwise, we have one with fixed structure. The pioneering Tsetlin Automaton (TA), depicted in Figure 1, is a deterministic fixed-structure finite-state automaton [17]. The state transition graph in the figure depicts a TA with 2 N memory states. States 1 to N maps to Action 1 and states N +1 to 2 N maps to Action 2.
Action 1 Action 2
1 2 3 ....… N-3 N-2 N-1 N N+1 N+2 N+3 N+4 ……. 2N-2 2N-1 2N
Figure 1: Transition graph of a two-action Tsetlin Automaton with 2N memory states.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: State Transition Diagram with Rewards and Penalties
### Overview
The image is a state transition diagram illustrating a system with 2N states, divided into two actions. Transitions between states are associated with either a reward (red arrow) or a penalty (blue arrow). The diagram shows the states, the possible transitions between them, and the associated rewards or penalties.
### Components/Axes
* **States:** Represented by circles labeled 1, 2, ..., N-1, N, N+1, N+2, ..., 2N-1, 2N.
* **Actions:** The states are divided into two groups, labeled "Action 1" and "Action 2" at the top of the diagram. Action 1 includes states 1 through N, and Action 2 includes states N+1 through 2N.
* **Transitions:** Represented by curved arrows between states.
* **Rewards:** Represented by red curved arrows.
* **Penalties:** Represented by blue curved arrows.
* **Legend:** Located at the bottom-left of the diagram.
* Red arrow: "Reward"
* Blue arrow: "Penalty"
* **Separator:** A vertical dashed line separates Action 1 and Action 2.
### Detailed Analysis
* **States 1 to N (Action 1):**
* Each state *i* (where *i* ranges from 1 to N) has a red arrow looping back to itself, indicating a reward for staying in the same state.
* Each state *i* has a red arrow going to the next state *i+1*.
* Each state *i+1* has a blue arrow going back to state *i*, indicating a penalty for transitioning back.
* **States N+1 to 2N (Action 2):**
* Each state *i* (where *i* ranges from N+1 to 2N) has a red arrow looping back to itself, indicating a reward for staying in the same state.
* Each state *i* has a red arrow going to the next state *i+1*.
* Each state *i+1* has a blue arrow going back to state *i*, indicating a penalty for transitioning back.
* **Transition between Action 1 and Action 2:**
* State N has a blue arrow going to state N+1, indicating a penalty for transitioning from Action 1 to Action 2.
* State N+1 has a blue arrow going to state N, indicating a penalty for transitioning from Action 2 to Action 1.
### Key Observations
* The diagram illustrates a system where staying in the same state yields a reward.
* Transitions to the next state within the same action yield a reward, while transitioning back incurs a penalty.
* Transitions between Action 1 and Action 2 incur a penalty.
* The diagram suggests a cyclical or sequential nature to the state transitions within each action.
### Interpretation
The diagram represents a system where an agent can choose between two actions, each consisting of a sequence of states. The agent is rewarded for staying in the same state or progressing to the next state within the same action. However, the agent is penalized for moving backward within an action or switching between actions. This could represent a scenario where maintaining a consistent course of action is beneficial, while frequent changes or reversals are detrimental. The penalties for switching actions suggest that there may be a cost associated with changing strategies or contexts. The diagram could be used to model various systems, such as decision-making processes, resource allocation, or even physical systems with constraints on movement.
</details>
Figure 2: Transition graph of the Multi-Step Variable Structure Finite-State Learning Automaton.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: Reward and Penalty System
### Overview
The image is a diagram illustrating a reward and penalty system across a series of states, labeled numerically from 1 to 2N. The system is divided into two actions, Action 1 and Action 2, separated by a vertical dashed line. The diagram shows transitions between states, with different types of arrows indicating weak and strong penalties/rewards. The probability of these events occurring is either 1 or 0.5.
### Components/Axes
* **States:** Represented by rectangular boxes labeled with numbers from 1 to 2N.
* **Actions:** The system is divided into two actions: Action 1 (states 1 to N) and Action 2 (states N+1 to 2N).
* **Arrows:** Represent transitions between states, with different styles indicating the type of reward or penalty.
* **Weak penalty:** Dashed blue arrow. Takes place with probability 1 or 0.5.
* **Weak reward:** Solid blue arrow. Takes place with probability 1 or 0.5.
* **Strong penalty:** Striped blue arrow. s=3 and takes place with probability 1 or 0.5.
* **Strong reward:** Solid red arrow. s=3 and takes place with probability 1 or 0.5.
### Detailed Analysis
**Action 1 (States 1 to N):**
* **States 1, 2, and 3:**
* Strong red arrows (strong reward) originate from states 1, 2, and 3, looping back to previous states. For example, from state 2, a strong reward arrow goes to state 1.
* Dashed blue arrows (weak penalty) originate from states 1, 2, and 3, looping forward to the next state. For example, from state 1, a weak penalty arrow goes to state 2.
* **States N-1 and N:**
* Solid blue arrows (weak reward) connect state N-1 to state N, and state N to state N-1, forming a loop.
* Dashed blue arrows (weak penalty) originate from states N-1 and N, looping forward to the next state (N and N+1 respectively).
* A strong red arrow (strong reward) originates from state N-1, looping back to state N-2 (not explicitly labeled, but implied).
**Action 2 (States N+1 to 2N):**
* **States N+1 and N+2:**
* Solid blue arrows (weak reward) connect state N+1 to state N+2, and state N+2 to state N+1, forming a loop.
* Dashed blue arrows (weak penalty) originate from states N+1 and N+2, looping forward to the next state (N+2 and N+3 respectively).
* A strong red arrow (strong reward) originates from state N+2, looping back to state N+1.
* **States 2N-2, 2N-1, and 2N:**
* Strong red arrows (strong reward) originate from states 2N-2, 2N-1, and 2N, looping back to previous states. For example, from state 2N, a strong reward arrow goes to state 2N-1.
* Dashed blue arrows (weak penalty) originate from states 2N-2, 2N-1, and 2N, looping forward to the next state. For example, from state 2N-2, a weak penalty arrow goes to state 2N-1.
### Key Observations
* The diagram illustrates a state-based system with transitions influenced by rewards and penalties.
* The system is divided into two distinct actions, with similar patterns of rewards and penalties within each action.
* Strong rewards tend to pull the system back to earlier states, while weak penalties tend to push the system forward.
* The probability of each event (reward or penalty) is either 1 or 0.5.
* The "strength" of the strong rewards and penalties is indicated by 's=3'.
### Interpretation
The diagram represents a Markov Decision Process (MDP) or a similar reinforcement learning environment. The states represent different conditions or configurations, and the actions (Action 1 and Action 2) represent different sets of possible choices. The rewards and penalties represent the consequences of transitioning between states. The diagram suggests a system where there is a balance between exploration (moving forward due to penalties) and exploitation (returning to previous states due to rewards). The probability values (1 or 0.5) indicate the certainty or uncertainty associated with each transition. The 's=3' likely indicates the magnitude of the reward or penalty when a strong event occurs. The system could be used to model various scenarios, such as resource allocation, game playing, or robot navigation, where agents need to learn optimal strategies to maximize rewards and minimize penalties.
</details>
While the TA changes state in single steps, the deterministic Krinsky Automaton introduces multi-step state transitions [16]. The purpose is to reinforce an action more strongly when it is rewarded, and more weakly when penalized. The Krinsky Automaton behaves as a TA when the response from the environment is a penalty. However, when it is a reward, any state from 2 to N transitions to state 1 , and any state from N +1 to 2 N -1 transitions to state 2 N . In effect, N consecutive penalties are needed to offset a single reward.
Another variant of LA is the Krylov Automaton. A Krylov Automaton makes both deterministic and stochastic single-step transitions [16]. The state transitions of the Krylov Automaton is identical to those of a TA for rewards. However, when it receives a penalty, it performs the corresponding TA state change randomly, with probability 0 . 5 .
We now introduce our new type of LA, the multi-step variable-structure finite-state LA (MVF-LA), shown in Figure 2. The MVF-LA has two kinds of feedback, strong and weak. As covered in the next section, strong feedback is required by the TM to strongly reinforce frequent sub-patterns, while weak feedback is required to make the TM forget infrequent ones. To achieve this, weak feedback only triggers one-step transitions. Strong feedback, on the other hand, triggers s -step transitions. Thus, a single strong feedback is offset by s instances of weak feedback. Further, MVF-LA has a variable structure that changes periodically . That is, the MVF-LA switches between two different transition graph structures, one deterministic and one stochastic. The deterministic structure is as shown in the figure, while the stochastic structure introduces a transition probability 0 . 5 , for every transition. The switch between structure is performed so that every d 'th transition is stochastic, while the remaining transitions are deterministic.
## 3 The Arbitrarily Deterministic TM (ADTM)
In this section, we introduce the details of the ADTM, shown in Figure 3, where the TA is replaced with the new MVF-LA. The purpose of the ADTM is to control stochasticity, thus allowing management of energy consumption.
## 3.1 ADTMInference
Input Features. Like the TM, an ADTM takes a feature vector of o propositional variables as input, X = [ x 1 , x 2 , x 3 , . . . , x o ] , to be classified into one of two classes, y = 0 or y = 1 . These features are extended with their negation, to produce a set of literals: L = [ x 1 , x 2 , . . . , x o , ¬ x 1 , ¬ x 2 , . . . , ¬ x o ] =[ l 1 , l 2 , . . . , l 2 o ] .
Figure 3: The ADTM structure.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Diagram: Logic Tree Structure
### Overview
The image depicts a tree diagram representing a logical structure. It shows a root node 'v' branching out to several sub-nodes labeled '1/0'. These sub-nodes are connected to further logical components such as clauses and logical AND operations. The diagram illustrates a flow of logic from the bottom up, culminating in the root node 'v'.
### Components/Axes
* **Root Node:** A rectangle labeled 'v' at the top of the diagram.
* **Intermediate Nodes:** Rectangles labeled '1/0' connected to the root node.
* **Branches:** Lines connecting the root node to the intermediate nodes, labeled with '+' and '-'.
* **Clauses:** Dashed rectangles labeled 'Clause-1', 'Clause-2', ..., 'Clause-m'.
* **Logical AND:** Lambda symbol (Λ) within dashed rectangles above the clauses (except Clause-1).
* **Input Literals:** 'L = [l1, l2, l3, l4 ... l2o]' at the bottom, with arrows pointing upwards towards the clauses.
* **Truth Assignment (TA):** Table within the Clause-1 box showing 'TA1', 'TA0+1', 'TA0', 'TA2o' and their corresponding 'In' and 'ex' values.
* **Literals:** x1, ¬x1, x0, ¬x0 above the Truth Assignment table.
### Detailed Analysis
* **Root Node:** The root node 'v' represents the final logical outcome.
* **Intermediate Nodes:** The '1/0' nodes likely represent binary outcomes (True/False).
* **Branches:** The branches connecting 'v' to '1/0' are labeled with '+' and '-', possibly indicating positive and negative logical connections.
* **Clause-1:** Contains a table with literals x1, ¬x1, x0, ¬x0. Below these literals are truth assignments TA1, TA0+1, TA0, TA2o. Below the truth assignments are the values 'In' and 'ex'. The clause also contains the expression 'x1 Λ ... Λ ¬x0'.
* **Clause-2 to Clause-m:** Each of these clauses is associated with a logical AND operation (Λ).
* **Input Literals:** The input literals 'L = [l1, l2, l3, l4 ... l2o]' feed into the clauses.
* **Arrows:** Arrows indicate the direction of logical flow, generally from bottom to top.
### Key Observations
* The diagram represents a logical decision tree or circuit.
* Clause-1 has a more detailed structure compared to the other clauses.
* The input literals 'L' are the starting point of the logical evaluation.
* The truth assignments 'In' and 'ex' likely represent inclusion and exclusion of literals in the evaluation.
### Interpretation
The diagram illustrates a logical process where input literals 'L' are evaluated through a series of clauses. Clause-1 appears to be a special case with explicit truth assignments. The logical AND operations combine the results of the clauses, and the intermediate '1/0' nodes represent binary outcomes. The final result is represented by the root node 'v'. The '+' and '-' branches suggest that the outcome of each '1/0' node can either positively or negatively influence the final result 'v'. The entire structure likely represents a complex logical expression or a decision-making process.
</details>
Clauses. Patterns are represented by m conjunctive clauses. As shown for Clause-1 in the figure, a clause in the TM comprises of 2 o TAs, each controlling the inclusion of a specific literal. Let the set I j , I j ⊆ { 1 , . . . , 2 o } denote the indexes of the literals that are included in clause j . When evaluating clause j on input literals L , the literals included in the clause are ANDed: c j = ∧ k ∈ I j l k , j = 1 , . . . , m . Note that the output of an empty clause, I j = ∅ , is 1 during learning and 0 during inference.
Classification. In order to identify the sub-patterns associated with both of the classes of a two-class ADTM, the clauses are grouped in two. The number of clauses employed is a user set parameter m . Half of the clauses are assigned positive polarity ( c + j ). The other half is assigned negative polarity ( c -j ). The clause outputs, in turn, are combined into a classification decision through summation and thresholding using the unit step function u ( v ) = 1 if v ≥ 0 else 0 :
<!-- formula-not-decoded -->
That is, classification is based on a majority vote, with the positive clauses voting for y = 0 and the negative for y = 1 .
## 3.2 The MVF-LA Game and Orchestration Scheme
The MVF-LAs in ADTM are updated by so-called Type I and Type II feedback. Depending on the class of the current training sample ( X,y ) and the polarity of the clause (positive or negative), the type of feedback is decided. Clauses with positive polarity receive Type I feedback when the target output is y = 1 , and Type II feedback when the target output is y = 0 . For clauses with negative polarity, Type I feedback replaces Type II, and vice versa. In the following, we focus only on clauses with positive polarity.
Type I feedback: The number of clauses which receive Type I feedback is controlled by selecting them stochastically according to Eqn. 2:
<!-- formula-not-decoded -->
Above, v = ∑ m/ 2 j =1 c + j ( X ) -∑ m/ 2 j =1 c -j ( X ) is the aggregated clause output and T is a user set parameter that decides how many clauses should be involved in learning a particular sub-pattern. Further, Type I feedback consists of two kinds of sub-feedback: Type Ia and Type Ib. Type Ia feedback stimulates recognition of patterns by reinforcing the include action of MVF-LAs whose corresponding literal value is 1 , however, only when the clause output also is 1 . Note that an action is reinforced either by rewarding the action itself, or by penalizing the other action. Type Ia feedback is strong , with step size s (cf. Figure 2). Type Ib feedback combats over-fitting by reinforcing the exclude actions of MVF-LAs when the corresponding literal is 0 or when the clause output is 0 . Type Ib feedback is weak to facilitate learning of frequent patterns (cf. Figure 2).
Table 1: Performance of TM and ADTM with different d on Bankruptcy dataset.
| | TM | ADTM | ADTM | ADTM | ADTM | ADTM | ADTM |
|------|-------|--------|--------|--------|--------|--------|--------|
| | TM | d=1 | d=10 | d=100 | d=500 | d=1000 | d=5000 |
| F1 | 0.998 | 1.000 | 1.000 | 1.000 | 0.999 | 0.999 | 0.988 |
| Acc. | 0.998 | 1.000 | 1.000 | 1.000 | 0.999 | 0.999 | 0.987 |
Figure 4: Training and testing accuracy per epoch on the Bankruptcy dataset.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Chart Type: Line Graphs Comparing Training and Testing Accuracy
### Overview
The image contains two line graphs side-by-side. The left graph displays "Training Accuracy" versus "Epoch," while the right graph displays "Testing Accuracy" versus "Epoch." Both graphs show the performance of a model with varying values of 'd' (likely a hyperparameter). The x-axis (Epoch) ranges from 0 to 200 in both graphs. The y-axis (Accuracy) ranges from 0.6 to 1.0 in both graphs. The legend, located at the top, indicates the line colors corresponding to different 'd' values: d=1 (red), d=10 (blue, dashed), d=100 (green, dash-dot), d=500 (black, dotted), d=1000 (blue, dash-dot-dot), and d=5000 (orange, dash-dot).
### Components/Axes
* **X-axis (both graphs):** Epoch, ranging from 0 to 200. Major ticks at 0, 50, 100, 150, and 200.
* **Y-axis (both graphs):** Accuracy, ranging from 0.6 to 1.0. Major ticks at 0.6, 0.7, 0.8, 0.9, and 1.0.
* **Left Graph Title:** Training Accuracy
* **Right Graph Title:** Testing Accuracy
* **Legend (Top):**
* d = 1 (red, solid line)
* d = 10 (blue, dashed line)
* d = 100 (green, dash-dot line)
* d = 500 (black, dotted line)
* d = 1000 (blue, dash-dot-dot line)
* d = 5000 (orange, dash-dot line)
### Detailed Analysis
**Left Graph: Training Accuracy**
* **d = 1 (red, solid line):** Starts at approximately 0.6 and rapidly increases to approximately 0.95 by epoch 20, then slowly increases to 1.0.
* **d = 10 (blue, dashed line):** Starts at approximately 0.6 and rapidly increases to approximately 0.98 by epoch 20, then slowly increases to 1.0.
* **d = 100 (green, dash-dot line):** Starts at approximately 0.75 and rapidly increases to approximately 0.99 by epoch 10, then slowly increases to 1.0.
* **d = 500 (black, dotted line):** Starts at approximately 0.7 and rapidly increases to approximately 0.99 by epoch 10, then slowly increases to 1.0.
* **d = 1000 (blue, dash-dot-dot line):** Starts at approximately 0.7 and rapidly increases to approximately 0.99 by epoch 10, then slowly increases to 1.0.
* **d = 5000 (orange, dash-dot line):** Starts at approximately 0.6 and gradually increases to approximately 0.99 by epoch 100, then slowly increases to 1.0.
**Right Graph: Testing Accuracy**
* **d = 1 (red, solid line):** Starts at approximately 0.6 and rapidly increases to approximately 0.95 by epoch 20, then slowly increases to 1.0.
* **d = 10 (blue, dashed line):** Starts at approximately 0.6 and rapidly increases to approximately 0.98 by epoch 20, then slowly increases to 1.0.
* **d = 100 (green, dash-dot line):** Starts at approximately 0.75 and rapidly increases to approximately 0.99 by epoch 10, then slowly increases to 1.0.
* **d = 500 (black, dotted line):** Starts at approximately 0.7 and rapidly increases to approximately 0.99 by epoch 10, then slowly increases to 1.0.
* **d = 1000 (blue, dash-dot-dot line):** Starts at approximately 0.7 and rapidly increases to approximately 0.99 by epoch 10, then slowly increases to 1.0.
* **d = 5000 (orange, dash-dot line):** Starts at approximately 0.6 and gradually increases to approximately 0.99 by epoch 100, then slowly increases to 1.0.
### Key Observations
* For both training and testing accuracy, higher values of 'd' (100, 500, 1000) generally lead to faster initial increases in accuracy.
* The 'd = 5000' line (orange) shows a slower increase in accuracy compared to other 'd' values.
* All lines converge to approximately 1.0 accuracy as the number of epochs increases.
* The training and testing accuracy graphs are very similar, suggesting the model is not overfitting significantly.
### Interpretation
The graphs illustrate the impact of the hyperparameter 'd' on the training and testing accuracy of a model. The data suggests that increasing 'd' up to a certain point (around 100-1000) can improve the initial learning rate, leading to faster convergence. However, excessively large values of 'd' (e.g., 5000) may slow down the learning process. The similarity between training and testing accuracy indicates that the model generalizes well to unseen data and is not memorizing the training set. The optimal value of 'd' would likely be in the range of 100-1000, balancing fast convergence with good generalization.
</details>
Type II feedback: Clauses are also selected stochastically for receiving Type II feedback:
<!-- formula-not-decoded -->
Type II feedback combats false positive clause output by seeking to alter clauses that output 1 so that they instead output 0 . This is achieved simply by reinforcing inclusion of literals of value 0 . Thus, when the clause output is 1 and the corresponding literal value of an MVF-LA is 0 , the include action of the MVF-LA is reinforced. Type II feedback is strong , with step size s . Recall that in all of the above MVF-LA update steps, the parameter d decides the determinism of the updates.
## 4 Empirical Evaluation
We now study the performance of ADTM empirically using five real-world datasets. As baselines, ADTM is compared against regular TMs and seven other state-of-the-are machine learning approaches: Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Decision Trees (DTs), K-Nearest Neighbor (KNN), Random Forest (RF), Gradient Boosted Trees (XGBoost) [19], and Explainable Boosting Machines (EBMs) [20]. For comprehensiveness, three ANN architectures are used: ANN-1 - with one hidden layer of 5 neurons; ANN-2 - with two hidden layers of 20 and 50 neurons each, and ANN-3 - with three hidden layers and 20, 150, and 100 neurons. Performance of these predictive models are summarized in Table 6. We compute both F1-score (F1) and accuracy (Acc.) as performance measures. However, due to the class imbalance, we emphasize F1-score when comparing the performance of the different predictive models.
## 4.1 Bankruptcy
The Bankruptcy dataset contains historical records of 250 companies 1 . The outcome, Bankruptcy or Non-bankruptcy, is characterized by six categorical features. We thus binarize the features using thresholding [14] before we feed them into the ADTM. We first tune the hyper-parameters of the TM and the best performance is reported in Table 1, for m = 100 (number of clauses), s = 3 (step size for MVF-LA), and T = 10 (summation target). Each MVF-LA contains 100 states per action. The impact of determinism is reported in Table 1, for varying levels of determinism. As seen, performance is indistinguishable for d -values 1 , 10 , and 100 , and the ADTM achieves its highest classification accuracy. However, notice the slight decrease of F1-score and accuracy when determinism is further increased to 500 , 1000 , and 5000 .
Figure 4 shows how training and testing accuracy evolve over the training epochs. Only high determinism seems to influence learning speed and accuracy significantly. The performance of the other considered machine learning models
1 Available from https://archive.ics.uci.edu/ml/datasets/qualitative\_bankruptcy.
Table 2: Performance of TM and ADTM with different d on Balance Scale dataset.
| | TM | ADTM | ADTM | ADTM | ADTM | ADTM | ADTM |
|------|-------|--------|--------|--------|--------|--------|--------|
| | TM | d=1 | d=10 | d=100 | d=500 | d=1000 | d=5000 |
| F1 | 0.945 | 0.982 | 0.983 | 0.982 | 0.968 | 0.951 | 0.911 |
| Acc. | 0.948 | 0.980 | 0.981 | 0.980 | 0.935 | 0.894 | 0.793 |
Figure 5: Training and testing accuracy per epoch on the Balance Scale dataset.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Chart Type: Line Graphs of Training and Testing Accuracy vs. Epoch
### Overview
The image contains two line graphs side-by-side. The left graph displays "Training Accuracy" versus "Epoch," while the right graph displays "Testing Accuracy" versus "Epoch." Both graphs show the performance of a model with varying values of 'd' (likely a hyperparameter). The x-axis (Epoch) ranges from 0 to 200 in both graphs. The y-axis (Accuracy) ranges from 0.7 to 1.0 in both graphs. The legend, located at the top of the image, identifies the line styles and colors corresponding to different 'd' values: d=1 (red, solid), d=10 (blue, dash-dot), d=100 (green, dashed), d=500 (black, dotted), d=1000 (blue, dashed), and d=5000 (orange, dash-dot).
### Components/Axes
* **X-axis (both graphs):** Epoch, ranging from 0 to 200, with tick marks at approximately 0, 50, 100, 150, and 200.
* **Y-axis (left graph):** Training Accuracy, ranging from 0.7 to 1.0, with tick marks at 0.7, 0.8, 0.9, and 1.0.
* **Y-axis (right graph):** Testing Accuracy, ranging from 0.7 to 1.0, with tick marks at 0.7, 0.8, 0.9, and 1.0.
* **Legend (top):**
* d = 1 (red, solid line)
* d = 10 (blue, dash-dot line)
* d = 100 (green, dashed line)
* d = 500 (black, dotted line)
* d = 1000 (blue, dashed line)
* d = 5000 (orange, dash-dot line)
### Detailed Analysis
**Left Graph: Training Accuracy**
* **d = 1 (red, solid):** The line starts at approximately 0.7 and quickly rises to nearly 1.0, then plateaus.
* **d = 10 (blue, dash-dot):** The line starts at approximately 0.7 and rises to approximately 0.9, then plateaus.
* **d = 100 (green, dashed):** The line starts at approximately 0.9 and quickly rises to nearly 1.0, then plateaus.
* **d = 500 (black, dotted):** The line starts at approximately 0.9 and quickly rises to nearly 1.0, then plateaus.
* **d = 1000 (blue, dashed):** The line starts at approximately 0.9 and quickly rises to nearly 1.0, then plateaus.
* **d = 5000 (orange, dash-dot):** The line starts at approximately 0.7 and rises to approximately 0.8, then plateaus.
**Right Graph: Testing Accuracy**
* **d = 1 (red, solid):** The line starts at approximately 0.7 and quickly rises to nearly 1.0, then plateaus.
* **d = 10 (blue, dash-dot):** The line starts at approximately 0.7 and rises to approximately 0.9, then plateaus.
* **d = 100 (green, dashed):** The line starts at approximately 0.9 and quickly rises to nearly 1.0, then plateaus.
* **d = 500 (black, dotted):** The line starts at approximately 0.9 and quickly rises to nearly 1.0, then plateaus.
* **d = 1000 (blue, dashed):** The line starts at approximately 0.9 and quickly rises to nearly 1.0, then plateaus.
* **d = 5000 (orange, dash-dot):** The line starts at approximately 0.7 and rises to approximately 0.8, then plateaus.
### Key Observations
* For both training and testing accuracy, d = 1, d = 100, d = 500, and d = 1000 achieve high accuracy (close to 1.0) relatively quickly.
* d = 5000 consistently shows the lowest accuracy among the tested values.
* d = 10 shows a lower accuracy than d = 1, d = 100, d = 500, and d = 1000, but higher than d = 5000.
* The training and testing accuracy graphs exhibit similar trends for each 'd' value.
### Interpretation
The graphs suggest that the model's performance is highly dependent on the 'd' hyperparameter. Values of 'd' such as 1, 100, 500, and 1000 appear to be optimal, leading to high training and testing accuracy. A value of 'd' equal to 5000 results in significantly lower accuracy, indicating potential underfitting or a mismatch between the model complexity and the data. The similarity between the training and testing accuracy curves suggests that the model is generalizing well for the optimal 'd' values, with no significant overfitting. The value of d=10 shows a lower accuracy than d=1, d=100, d=500, and d=1000, but higher than d=5000, suggesting that there is a range of values for d that are optimal.
</details>
is compiled in Table 6. The best performance in terms of F1-score for the other models is obtained by ANN-3. However, ANN-3 is outperformed by the ADTM for all d -values except when d = 5000 .
## 4.2 Balance Scale
The Balance Scale dataset 2 contains three classes: balance scale tip to the right, tip to the left, or in balance. The class is decided by the size of the weight on both sides of the scale and the distance to each weight from the center. Hence the classes are characterized by four features. However, to make the output binary, we remove the 'balanced" class ending up with 576 data samples. The ADTM is equipped with 100 clauses. Each MVF-LA is given 100 states per action. The remaining two parameters, i.e., s value and T are fixed at 3 and 10 , respectively. Table 2 contains the results of TM and ADTM obtained on the Balance Scale dataset. Even though ADTM uses the same number of clauses as the TM, the performance with regards to F1-score and accuracy is better with ADTM when all the MVF-LAs updates are stochastic ( d = 1 ). The performance of the ADTM remains the same until the determinism-parameter surpasses 100 . After that, performance degrades gradually.
Progress of training and testing accuracy per epoch can be found in Figure 5. Each ADTM setup reaches its peak training and testing accuracy and becomes stable within a fewer number of training epochs. As can be seen, accuracy is maintained up to d = 100 , thus reducing random number generation to 1% without accuracy loss. From the results listed in Table 6 for the other machine learning approaches, EBM achieves the highest F1-score and accuracy.
## 4.3 Breast Cancer
The Breast Cancer dataset 3 contains 286 patients records related to the recurrence of breast cancer (201 with nonrecurrence and 85 with recurrence). The recurrence of breast cancer has to be estimated using nine features: Age, Menopause, Tumor Size, Inv Nodes, Node Caps, Deg Malig, Side (left or right), the Position of the Breast, and whether Irradiated or not. However, some of the patient samples miss some of the feature values. These samples are removed from the dataset in the present experiment.
The ADTM is arranged with the following parameter setup: m = 100 , s = 5 , T = 10 , and the number of states in MVF-LA per action is 100 . The classification accuracy of the TM and ADTM are summarized in Table 3. In contrast to the previous two datasets, the performance of both TM and ADTM is considerably lower, and further decreases with increasing determinism. However, the F1 measures obtained by all the other considered machine learning models are also low, i.e., less than 0 . 500 . The highest F1-score is obtained by ANN-1 and KNN, which both are lower than what is achieved with ADTM for d -values up to 100 .
2 Available from http://archive.ics.uci.edu/ml/datasets/balance+scale.
3 Available from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
Table 3: Performance of TM and ADTM with different d on Breast Cancer dataset.
| | TM | ADTM | ADTM | ADTM | ADTM | ADTM | ADTM |
|------|-------|--------|--------|--------|--------|--------|--------|
| | TM | d=1 | d=10 | d=100 | d=500 | d=1000 | d=5000 |
| F1 | 0.531 | 0.568 | 0.531 | 0.501 | 0.490 | 0.501 | 0.488 |
| Acc. | 0.703 | 0.702 | 0.698 | 0.691 | 0.690 | 0.690 | 0.693 |
Figure 6: Training and testing accuracy per epoch on the Breast Cancer dataset.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Chart Type: Training and Testing Accuracy vs. Epoch
### Overview
The image presents two line charts comparing the training and testing accuracy of a model across 200 epochs for different values of 'd' (likely representing dimensionality or a similar parameter). The left chart displays training accuracy, while the right chart shows testing accuracy. Each chart plots accuracy against the number of epochs, with different lines representing different values of 'd'.
### Components/Axes
* **X-axis (both charts):** Epoch, ranging from 0 to 200.
* **Y-axis (left chart):** Training Accuracy, ranging from 0.65 to 0.85.
* **Y-axis (right chart):** Testing Accuracy, ranging from 0.62 to 0.72.
* **Legend (top):** Located above both charts, it identifies the lines by color and 'd' value:
* Red: d = 1
* Blue (dashed): d = 10
* Green (dash-dotted): d = 100
* Black (dotted): d = 500
* Dark Blue (dash-dot-dotted): d = 1000
* Orange (dash-dot-dash): d = 5000
### Detailed Analysis
**Left Chart: Training Accuracy**
* **d = 1 (Red):** Starts at approximately 0.70 accuracy, rapidly increases to around 0.82 by epoch 25, and then gradually increases to approximately 0.84-0.85, with some fluctuations.
* **d = 10 (Blue, dashed):** Starts at approximately 0.65 accuracy, increases rapidly to around 0.78 by epoch 25, and then gradually increases to approximately 0.81-0.82.
* **d = 100 (Green, dash-dotted):** Starts at approximately 0.65 accuracy, increases rapidly to around 0.77 by epoch 25, and then gradually increases to approximately 0.80-0.81.
* **d = 500 (Black, dotted):** Starts at approximately 0.65 accuracy, increases rapidly to around 0.76 by epoch 25, and then gradually increases to approximately 0.79-0.80.
* **d = 1000 (Dark Blue, dash-dot-dotted):** Starts at approximately 0.65 accuracy, increases rapidly to around 0.76 by epoch 25, and then gradually increases to approximately 0.79-0.80.
* **d = 5000 (Orange, dash-dot-dash):** Starts at approximately 0.65 accuracy, increases rapidly to around 0.75 by epoch 25, and then gradually increases to approximately 0.79-0.80.
**Right Chart: Testing Accuracy**
* **d = 1 (Red):** Starts at approximately 0.66 accuracy, rapidly increases to around 0.71 by epoch 25, and then fluctuates around 0.71.
* **d = 10 (Blue, dashed):** Starts at approximately 0.62 accuracy, increases rapidly to around 0.67 by epoch 25, and then gradually increases to approximately 0.69.
* **d = 100 (Green, dash-dotted):** Starts at approximately 0.62 accuracy, increases rapidly to around 0.67 by epoch 25, and then gradually increases to approximately 0.68-0.69.
* **d = 500 (Black, dotted):** Starts at approximately 0.62 accuracy, increases rapidly to around 0.66 by epoch 25, and then gradually increases to approximately 0.68.
* **d = 1000 (Dark Blue, dash-dot-dotted):** Starts at approximately 0.62 accuracy, increases rapidly to around 0.66 by epoch 25, and then gradually increases to approximately 0.68.
* **d = 5000 (Orange, dash-dot-dash):** Starts at approximately 0.62 accuracy, increases rapidly to around 0.66 by epoch 25, and then gradually increases to approximately 0.68.
### Key Observations
* Training accuracy is consistently higher than testing accuracy for all values of 'd'.
* The model with d = 1 (red line) achieves the highest training and testing accuracy.
* The training accuracy for d = 1 shows more fluctuation than the other values.
* For both training and testing accuracy, the performance of d = 100, d = 500, d = 1000, and d = 5000 are very similar.
* All models show a rapid increase in accuracy in the initial epochs, followed by a slower increase or plateau.
### Interpretation
The charts suggest that a lower value of 'd' (specifically d = 1) leads to better performance in terms of both training and testing accuracy. The higher training accuracy compared to testing accuracy indicates that the models are overfitting to the training data, especially for d = 1. The similarity in performance for d = 100, d = 500, d = 1000, and d = 5000 suggests that increasing 'd' beyond a certain point does not significantly improve performance and may even hinder it due to overfitting or increased complexity. The rapid initial increase in accuracy indicates that the models learn quickly in the early stages of training, while the subsequent plateau suggests diminishing returns as training progresses.
</details>
The training and testing accuracy progress per epoch is reported in Figure 6, showing a clear degradation of performance with increasing determinism.
Table 4: Performance of TM and ADTM with different d on Liver Disorders dataset.
| | TM | ADTM | ADTM | ADTM | ADTM | ADTM | ADTM |
|------|-------|--------|--------|--------|--------|--------|--------|
| | TM | d=1 | d=10 | d=100 | d=500 | d=1000 | d=5000 |
| F1 | 0.648 | 0.705 | 0.694 | 0.692 | 0.692 | 0.689 | 0.692 |
| Acc. | 0.533 | 0.610 | 0.610 | 0.612 | 0.612 | 0.610 | 0.611 |
Figure 7: Training and testing accuracy per epoch on the Liver Disorders dataset.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Line Charts: Training and Testing Accuracy vs. Epoch
### Overview
The image presents two line charts side-by-side, illustrating the training accuracy (left) and testing accuracy (right) of a model over 200 epochs. Each chart displays multiple lines, each representing a different value of 'd' (likely a hyperparameter). The charts aim to show how the training and testing accuracy change with the number of epochs for different values of 'd'.
### Components/Axes
* **X-axis (both charts):** Epoch, ranging from 0 to 200. Increments are marked at 0, 50, 100, 150, and 200.
* **Y-axis (left chart):** Training Accuracy, ranging from 0.60 to 0.80. Increments are not explicitly marked, but the scale is linear.
* **Y-axis (right chart):** Testing Accuracy, ranging from 0.56 to 0.62. Increments are not explicitly marked, but the scale is linear.
* **Legend (top):** Located above the charts, the legend identifies each line by its 'd' value:
* Red: d = 1
* Teal dashed: d = 10
* Green dash-dot: d = 100
* Black dotted: d = 500
* Blue dashed: d = 1000
* Orange: d = 5000
### Detailed Analysis
**Left Chart: Training Accuracy**
* **d = 1 (Red):** Starts at approximately 0.59 and increases rapidly until around epoch 50, reaching approximately 0.74. It then continues to increase at a slower rate, reaching approximately 0.76 by epoch 200.
* **d = 10 (Teal dashed):** Starts at approximately 0.61 and increases rapidly until around epoch 25, reaching approximately 0.75. It then plateaus and slowly increases to approximately 0.79 by epoch 200.
* **d = 100 (Green dash-dot):** Starts at approximately 0.61 and increases rapidly until around epoch 25, reaching approximately 0.75. It then plateaus and slowly increases to approximately 0.79 by epoch 200.
* **d = 500 (Black dotted):** Starts at approximately 0.61 and increases rapidly until around epoch 25, reaching approximately 0.75. It then plateaus and slowly increases to approximately 0.79 by epoch 200.
* **d = 1000 (Blue dashed):** Starts at approximately 0.61 and increases rapidly until around epoch 25, reaching approximately 0.75. It then plateaus and slowly increases to approximately 0.79 by epoch 200.
* **d = 5000 (Orange):** Starts at approximately 0.61 and increases rapidly until around epoch 25, reaching approximately 0.75. It then plateaus and slowly increases to approximately 0.79 by epoch 200.
**Right Chart: Testing Accuracy**
* **d = 1 (Red):** Starts at approximately 0.56 and increases rapidly until around epoch 20, reaching approximately 0.61. It then plateaus and fluctuates around 0.61 for the remaining epochs.
* **d = 10 (Teal dashed):** Starts at approximately 0.59 and increases rapidly until around epoch 20, reaching approximately 0.61. It then plateaus and fluctuates around 0.61 for the remaining epochs.
* **d = 100 (Green dash-dot):** Starts at approximately 0.60 and increases rapidly until around epoch 20, reaching approximately 0.61. It then plateaus and fluctuates around 0.61 for the remaining epochs.
* **d = 500 (Black dotted):** Starts at approximately 0.60 and increases rapidly until around epoch 20, reaching approximately 0.61. It then plateaus and fluctuates around 0.61 for the remaining epochs.
* **d = 1000 (Blue dashed):** Starts at approximately 0.60 and increases rapidly until around epoch 20, reaching approximately 0.61. It then plateaus and fluctuates around 0.61 for the remaining epochs.
* **d = 5000 (Orange):** Starts at approximately 0.60 and increases rapidly until around epoch 20, reaching approximately 0.61. It then plateaus and fluctuates around 0.61 for the remaining epochs.
### Key Observations
* The training accuracy consistently increases with the number of epochs for all values of 'd', but the rate of increase slows down after approximately 50 epochs.
* The testing accuracy increases rapidly in the initial epochs and then plateaus, showing little improvement after approximately 20 epochs.
* For training accuracy, d=1 performs significantly worse than the other values of d.
* For testing accuracy, the different values of 'd' do not seem to have a significant impact on the final accuracy.
### Interpretation
The charts suggest that increasing the value of 'd' beyond 1 improves the training accuracy, but it does not significantly affect the testing accuracy. This could indicate that higher values of 'd' lead to overfitting, where the model performs well on the training data but does not generalize well to unseen data. The plateau in testing accuracy suggests that the model has reached its maximum performance on the testing dataset, and further training may not lead to significant improvements. The model with d=1 is likely underfitting the data.
</details>
## 4.4 Liver Disorders
The Liver Disorders dataset 4 was created by BUPA Medical Research and Development Ltd. (hereafter 'BMRDL') during the 1980s as part of a larger health-screening database. The dataset consists of 7 attributes. However, McDermott and Forsyth [21] claim that many researchers have used the dataset incorrectly, considering the Selector attribute as the class label. Based on the recommendation of McDermott and Forsythof, we here instead use the Number of Half-Pint Equivalents of Alcoholic Beverages as the dependent variable, binarized using the threshold ≥ 3 . The Selector attribute is discarded. The remaining attributes represent the results of various blood tests, and we use them as features.
Here, ADTM is given 10 clauses per class, with s = 3 and T = 10 . Each MVF-LA action possesses 100 states. The performance of ADTM for different levels of determinism is summarized in Table 4. For d = 1 , the F1-score of
4 Available from https://archive.ics.uci.edu/ml/datasets/Liver+Disorders.
Table 5: Performance of TM and ADTM with different d on Heart Disease dataset.
| | TM | ADTM | ADTM | ADTM | ADTM | ADTM | ADTM |
|------|-------|--------|--------|--------|--------|--------|--------|
| | TM | d=1 | d=10 | d=100 | d=500 | d=1000 | d=5000 |
| F1 | 0.687 | 0.759 | 0.766 | 0.767 | 0.760 | 0.762 | 0.605 |
| Acc. | 0.672 | 0.778 | 0.780 | 0.783 | 0.773 | 0.781 | 0.633 |
Figure 8: Training and testing accuracy per epoch on the Heart Disease dataset.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Chart: Training and Testing Accuracy vs. Epoch
### Overview
The image presents two line charts side-by-side, comparing the training accuracy (left) and testing accuracy (right) of a model across 200 epochs. Different lines represent different values of 'd' (likely a hyperparameter), ranging from 1 to 5000. The charts illustrate how the accuracy changes with the number of epochs for each 'd' value.
### Components/Axes
**Left Chart (Training Accuracy):**
* **Y-axis:** "Training Accuracy", ranging from 0.5 to 1.0 in increments of 0.1.
* **X-axis:** "Epoch", ranging from 0 to 200 in increments of 50.
* **Legend:** Located at the top of the image.
* Red line: d = 1
* Blue dashed line: d = 10
* Green dash-dot line: d = 100
* Black dotted line: d = 500
* Dark Blue dash-dot-dot line: d = 1000
* Orange dash-dot line: d = 5000
**Right Chart (Testing Accuracy):**
* **Y-axis:** "Testing Accuracy", ranging from 0.5 to 0.8 in increments of 0.1.
* **X-axis:** "Epoch", ranging from 0 to 200 in increments of 50.
* **Legend:** (Same as left chart) Located at the top of the image.
* Red line: d = 1
* Blue dashed line: d = 10
* Green dash-dot line: d = 100
* Black dotted line: d = 500
* Dark Blue dash-dot-dot line: d = 1000
* Orange dash-dot line: d = 5000
### Detailed Analysis
**Left Chart (Training Accuracy):**
* **d = 1 (Red):** Starts at approximately 0.75 accuracy and rapidly increases, plateauing near 1.0 after about 50 epochs.
* **d = 10 (Blue Dashed):** Starts at approximately 0.55 accuracy and increases, plateauing around 0.95 after about 75 epochs.
* **d = 100 (Green Dash-Dot):** Starts at approximately 0.65 accuracy and increases, plateauing around 0.97 after about 75 epochs.
* **d = 500 (Black Dotted):** Starts at approximately 0.7 accuracy and increases, plateauing around 0.97 after about 75 epochs.
* **d = 1000 (Dark Blue Dash-Dot-Dot):** Starts at approximately 0.7 accuracy and increases, plateauing around 0.95 after about 75 epochs.
* **d = 5000 (Orange Dash-Dot):** Starts at approximately 0.5 accuracy and increases slowly, reaching approximately 0.72 accuracy after 200 epochs.
**Right Chart (Testing Accuracy):**
* **d = 1 (Red):** Starts at approximately 0.7 accuracy and increases, plateauing around 0.78 after about 50 epochs.
* **d = 10 (Blue Dashed):** Starts at approximately 0.5 accuracy and increases, plateauing around 0.75 after about 75 epochs.
* **d = 100 (Green Dash-Dot):** Starts at approximately 0.6 accuracy and increases, plateauing around 0.78 after about 75 epochs.
* **d = 500 (Black Dotted):** Starts at approximately 0.65 accuracy and increases, plateauing around 0.78 after about 75 epochs.
* **d = 1000 (Dark Blue Dash-Dot-Dot):** Starts at approximately 0.65 accuracy and increases, plateauing around 0.75 after about 75 epochs.
* **d = 5000 (Orange Dash-Dot):** Starts at approximately 0.5 accuracy and increases slowly, reaching approximately 0.62 accuracy after 200 epochs.
### Key Observations
* For training accuracy, d = 1 achieves the highest accuracy, closely followed by d = 100 and d = 500. d = 5000 performs the worst.
* For testing accuracy, d = 1, d = 100, and d = 500 perform similarly, achieving the highest accuracy. d = 5000 performs the worst.
* The training accuracy generally plateaus after approximately 50-75 epochs for d values 1, 10, 100, 500, and 1000.
* The testing accuracy also plateaus after approximately 50-75 epochs for d values 1, 10, 100, 500, and 1000.
* The model with d = 5000 shows significantly lower accuracy in both training and testing sets compared to other d values.
### Interpretation
The charts suggest that the hyperparameter 'd' significantly impacts model performance. Lower values of 'd' (1, 100, 500) lead to higher training and testing accuracy, indicating a better fit for the data. The value d = 5000 results in underfitting, as the model fails to achieve high accuracy on either the training or testing sets. The similar performance of d = 1, d = 100, and d = 500 on the testing set suggests that increasing 'd' beyond a certain point does not improve generalization and may even hinder it. The plateauing of accuracy after a certain number of epochs indicates that further training does not significantly improve the model's performance, suggesting an optimal point for stopping the training process.
</details>
ADTM is better than what is achieved with the standard TM. In contrast to the performance on previous datasets, the performance of ADTM on Liver Disorders dataset with respect to F1-score does not decrease significantly with d . Instead, it fluctuates around 0 . 690 . Unlike the other datasets, the ADTM with d = 1 actually requires more training rounds than for larger d -values, before it learns the final MVF-LA actions. The ADTM with d = 1 is also unable to reach a similar level of training accuracy, compared to higher d -values. Despite the diverse learning speed, testing accuracy becomes similar after roughly 50 training rounds. The other considered machine learning models obtain somewhat similar F1-scores on the same dataset, with DT, RF, and EBM peaking with scores higher than 0 . 700 .
## 4.5 Heart Disease
The Heart Disease dataset 5 concerns prediction of heart disease. To this end, 13 features are available, selected among 75. Out of the 13 features, 6 are real-valued, 3 are binary, 3 are nominal, and one is ordered. In this case, the ADTM is built on 100 clauses. The number of state transitions when the feedback is strong, s , is equal to 3 while the target, T , is equal to 10 . The number of states per MVF-LA action in the ADTM is 100 . As one can see in Table 5, the ADTM provides better performance than TM in terms of F1-score and accuracy when d = 1 . The F1-measure increases with the d -value and peaks at d = 100 . Then it fluctuates and reaches its lowest value of 0 . 605 when d = 5000 .
Figure 8 shows that the training and testing accuracy for d = 5000 is consistently low over the training epochs, compared to the training and testing accuracy of other d -values. The other d -values, after initial variations, perform similarly roughly from the 75 th epoch and onward. Out of other machine learning algorithms, EBM provides the best F1-score. Even though ANN-1, ANN-2, DT, RF, and XGBoost obtain better F1-scores than TM, the F1 scores of ADTM when d equals to 1 , 10 , 100 , 500 , and 1000 are the highest ones.
## 5 Effects of Determinism on Energy Consumption
In the hardware implementation of TM, power is consumed by the pseudorandom number generators (PRNGs) when generating a new random number [5]. This is referred to as switching power . In the TM, every TA update is randomized, and switching power is consumed by the PRNGs on every cycle. Additionally, power is also consumed by the PRNGs whilst idle. We term this leakage power . Leakage power is always consumed by the PRNGs whilst they are powered up, even when not generating new numbers.
In the ADTM with hybrid TA where the determinism parameter d is introduced, d = 1 would be equivalent to a TM where every TA update is randomized. d = ∞ means the ADTM is fully deterministic, and no random numbers are required from the PRNG. If a TA update is randomized only on the d th cycle, the PRNGs need only be actively switched (and therefore consume switching power ) for 1 d portion of the entire training procedure. The switching power consumed by the PRNGs accounts for 7% of the total system power when using a traditional TA (equivalent to d = 1 ). With
5 Available from https://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29.
Table 6: Classification accuracy of state-of-the-art machine learning models on all the datasets.
| | Bankruptcy | Bankruptcy | Balance Scale | Balance Scale | Breast Cancer | Breast Cancer | Liver Disorder | Liver Disorder | Heart Disease | Heart Disease |
|---------|--------------|--------------|-----------------|-----------------|-----------------|-----------------|------------------|------------------|-----------------|-----------------|
| | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. |
| ANN-1 | 0.995 | 0.994 | 0.990 | 0.990 | 0.458 | 0.719 | 0.671 | 0.612 | 0.738 | 0.772 |
| ANN-2 | 0.996 | 0.995 | 0.995 | 0.995 | 0.403 | 0.683 | 0.652 | 0.594 | 0.742 | 0.769 |
| ANN-3 | 0.997 | 0.997 | 0.995 | 0.995 | 0.422 | 0.685 | 0.656 | 0.602 | 0.650 | 0.734 |
| DT | 0.993 | 0.993 | 0.986 | 0.986 | 0.276 | 0.706 | 0.728 | 0.596 | 0.729 | 0.781 |
| SVM | 0.994 | 0.994 | 0.887 | 0.887 | 0.384 | 0.678 | 0.622 | 0.571 | 0.679 | 0.710 |
| KNN | 0.995 | 0.994 | 0.953 | 0.953 | 0.458 | 0.755 | 0.638 | 0.566 | 0.641 | 0.714 |
| RF | 0.949 | 0.942 | 0.859 | 0.860 | 0.370 | 0.747 | 0.729 | 0.607 | 0.713 | 0.774 |
| XGBoost | 0.983 | 0.983 | 0.931 | 0.931 | 0.367 | 0.719 | 0.656 | 0.635 | 0.701 | 0.788 |
| EBM | 0.993 | 0.992 | 1.000 | 1.000 | 0.389 | 0.745 | 0.710 | 0.629 | 0.783 | 0.824 |
Table 7: Comparative power per datapoint with two different d values.
| | Bankruptcy | Breast Cancer | Balance Scale | Liver Disorder | Heart Disease |
|----------------|--------------|-----------------|-----------------|------------------|-----------------|
| Power (d=1) | 6.94 mW | 15.8 mW | 7.7 mW | 12.6 mW | 148 mW |
| Power (d=5000) | 6.45 mW | 14.7 mW | 7.2 mW | 11.8 mW | 137.6 mW |
d = 100 this is reduced to 0.07% of the system power, and with d = 5000 this is reduced further to 0.001% of the same. It can be seen that as d increases in the ADTM, the switching power consumed by the PRNGs tends to zero.
In the special case of d = ∞ the PRNGs are no longer required for TA updates since the TAs are fully deterministic we can omit these PRNGs from the design and prevent their leakage power from being consumed. The leakage power of the PRNGs accounts for 32% of the total system power. On top of the switching power savings this equates to 39% of system power, meaning large power and therefore energy savings can be made in the ADTM.
Table 7 shows comparative training power consumption per datapoint (i.e. all TAs being updated concurrently) for two different d values: d =1 and d =5000. Typically, the overall power is higher for bigger datasets as they require increased number of concurrent TAs as well as PRNGs. As can be seen, the increase in d value reduces the power consumption by 11 mW in the case of Heart Disease dataset. This saving is made by reducing the switching activity in the PRNGs as explained above. More savings are made by larger d values as the PRNG concurrent switching activities are reduced.
## 6 Conclusion
In this paper, we proposed a novel finite-state learning automaton (MFV-LA) that can replace the Tsetlin Automaton in TM learning, for increased determinism, and thus reduced energy usage. The new automaton uses multi-step deterministic state jumps to reinforce sub-patterns. Simultaneously, flipping a coin to skip every d 'th state update ensures diversification by randomization. The new d -parameter thus allows the degree of randomization to be finely controlled. E.g., d = 1 makes every update random and d = ∞ makes the automaton fully deterministic. Our empirical results show that, overall, only substantial degrees of determinism reduces accuracy. Energy-wise, the pseudorandom number generator contributes to switching energy consumption within the TM, which can be completely eliminated with d = ∞ . We can thus use the new d -parameter to trade off accuracy against energy consumption, to facilitate low-energy machine learning.
## Acknowledgement
The authors gratefully acknowledge the contributions from Jonathan Edwards at Temporal Computing on strategies for deterministic Tsetlin Machine learning.
## References
- [1] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and Policy Considerations for Deep Learning in NLP. In ACL , 2019.
- [2] J. Chen and X. Ran. Deep Learning With Edge Computing: A Review. Proc. of the IEEE , 107(8):1655-1674, 2019.
- [3] Eva GarcÃa-MartÃn, Crefeda Faviola Rodrigues, Graham Riley, and HÃ¥kan Grahn. Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing , 134:75 - 88, 2019.
- [4] Ole-Christoffer Granmo and B John Oommen. Solving Stochastic Nonlinear Resource Allocation Problems Using a Hierarchy of Twofold Resource Allocation Automata. IEEE Transaction on Computers , 2010.
- [5] Adrian Wheeldon, Rishad Shafik, Tousif Rahman, Jie Lei, Alex Yakovlev, and Ole-Christoffer Granmo. Learning Automata based Energy-efficient AI Hardware Design for IoT. Philosophical Transactions of the Royal Society A , 2020.
- [6] Geir Thore Berge, Ole-Christoffer Granmo, Tor Oddbjørn Tveit, Morten Goodwin, Lei Jiao, and Bernt Viggo Matheussen. Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications. IEEE Access , 7:115134-115146, 2019.
- [7] K. Darshana Abeyrathna, Ole-Christoffer Granmo, Xuan Zhang, Lei Jiao, and Morten Goodwin. The Regression Tsetlin Machine - A Novel Approach to Interpretable Non-Linear Regression. Philosophical Transactions of the Royal Society A , 378, 2019.
- [8] Ole-Christoffer Granmo, Sondre Glimsdal, Lei Jiao, Morten Goodwin, Christian W Omlin, and Geir Thore Berge. The Convolutional Tsetlin Machine. arXiv preprint arXiv:1905.09688 , 2019.
- [9] Adrian Phoulady, Ole-Christoffer Granmo, Saeed Rahimi Gorji, and Hady Ahmady Phoulady. The Weighted Tsetlin Machine: Compressed Representations with Clause Weighting. In Ninth International Workshop on Statistical Relational AI (StarAI 2020) , 2020.
- [10] Saeed Rahimi Gorji, Ole-Christoffer Granmo, Adrian Phoulady, and Morten Goodwin. A Tsetlin Machine with Multigranular Clauses. In Lecture Notes in Computer Science: Proceedings of the Thirty-ninth International Conference on Innovative Techniques and Applications of Artificial Intelligence (SGAI-2019) , volume 11927. Springer, 2019.
- [11] Saeed Gorji et al. Increasing the Inference and Learning Speed of Tsetlin Machines with Clause Indexing. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems . Springer, 2020.
- [12] K Darshana Abeyrathna, Ole-Christoffer Granmo, and Morten Goodwin. Extending the tsetlin machine with integer-weighted clauses for increased interpretability. arXiv preprint arXiv:2005.05131 , 2020.
- [13] K Darshana Abeyrathna, Ole-Christoffer Granmo, and Morten Goodwin. Adaptive Sparse Representation of Continuous Input for Tsetlin Machines Based on Stochastic Searching on the Line. In Preparation , 2020.
- [14] Kuruge Darshana Abeyrathna, Ole-Christoffer Granmo, Xuan Zhang, and Morten Goodwin. A scheme for continuous input to the Tsetlin Machine with applications to forecasting disease outbreaks. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems , pages 564-578. Springer, 2019.
- [15] Rishad Shafik, Adrian Wheeldon, and Alex Yakovlev. Explainability and Dependability Analysis of Learning Automata based AI Hardware. In IEEE 26th International Symposium on On-Line Testing and Robust System Design (IOLTS) . IEEE, 2020.
- [16] Kumpati S Narendra and Mandayam AL Thathachar. Learning Automata: An Introduction . Courier Corporation, 2012.
- [17] Michael Lvovitch Tsetlin. On behaviour of finite automata in random medium. Avtomat. i Telemekh , 22(10):13451354, 1961.
- [18] MALThathachar and P S Sastry. Networks of Learning Automata: Techniques for Online Stochastic Optimization . Kluwer Academic Publishers, 2004.
- [19] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining , pages 785-794, 2016.
- [20] Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. Interpretml: A unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 , 2019.
- [21] James McDermott and Richard S Forsyth. Diagnosing a disorder in a classification benchmark. Pattern Recognition Letters , 73:41-43, 2016.