## Benchmarking Defeasible Reasoning with Large Language Models - Initial Experiments and Future Directions
Ilias Tachmazidis 1 , Sotiris Batsakis 2 , 1 , Grigoris Antoniou 3
1 School of Computing and Engineering, University of Huddersfield, UK
2 Hellenic Mediterranean University, Greece 3 Leeds Beckett University, UK
i.tachmazidis@hud.ac.uk, sbatsakis@hmu.gr, g.antoniou@leedsbeckett.ac.uk
## Abstract
Large Language Models (LLMs) have gained prominence in the AI landscape due to their exceptional performance. Thus, it is essential to gain a better understanding of their capabilities and limitations, among others in terms of nonmonotonic reasoning. This paper proposes a benchmark that corresponds to various defeasible rule-based reasoning patterns. We modified an existing benchmark for defeasible logic reasoners by translating defeasible rules into text suitable for LLMs. We conducted preliminary experiments on nonmonotonic rulebased reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.
## 1 Introduction
Large Language Models (LLMs) have caught people's attention recently due to the exceptional performance these systems achieved in various language related tasks since they are the underlying technology behind chat bots such as ChatGPT 1 . Large Language models such as LaMDA (Thoppilan et al. 2022) and GPT (OpenAI 2023) are based on training deep neural networks with billions of parameters using huge lexical datasets and often employing human judgment in a semi-supervised (e.g., reinforcement learning) training setting (Lambert et al. 2022; Ouyang et al. 2022). The exceptional - often human level- performance of LLMs in various tasks has led to a widespread discussion about the potential benefits and dangers of such technologies in various areas and human society in general including petitions to pause research on more capable LLMs (Letters 2023). For example GPT-4 achieved human lever performance in various academic and professional exams including a score in the top 10% of test takers in the Uniform Bar Examination, this performance is attributed to a large degree to scaling LLMs to larger training datasets and more complex models with larger number of parameters (OpenAI 2023).
Despite the impressive performance of Large Language Models, including their ability to demonstrate an emerging intelligent behaviour and reasoning capabilities, leading to the point of considering them forerunners of Artificial General Intelligence (Bubeck et al. 2023) several issues related to LLMs have been identified, such as the energy cost of training LLMs (Luccioni, Viguier, and Ligozat 2022;
1 Available at: https://chat.openai.com/
Strubell, Ganesh, and McCallum 2019), difficulty to control their behaviour (Luccioni and Viviano 2021), ensuring conformity with stakeholders requirements and norms, as well as interpreting their functionality (Bowman 2023). The interpetability of LLMs is a crucial issue since neural network based LLMs appear to be 'black boxes', in contrast to logic based systems, and although various attempts exist to deal with this problem, including the use of LLMs to interpret LLMs (Bills et al. 2023), this is still an unresolved issue. In addition, since LLMs are trained on vast amounts of raw text they tend to replicate their input rather that apply robust reasoning (Bender et al. 2021). LLMs trained on raw text instead of structured knowledge bases integrating machine readable semantics contribute to the difficulty of achieving efficient reasoning and this is an issue examined in various works such as (Zhang et al. 2022) and surveyed in (Huang and Chang 2022). Various attempts to integrate Knowledge Graphs (KGs) to LLMs have been proposed (Zhen et al. 2022; Yin et al. 2022) as a solution to the last issue, but recent advances in LLMs capabilities, including high performance on academic and professional exams (OpenAI 2023), illustrated the need for an updated evaluation of the reasoning capabilities of LLMs. This updated evaluation should take into account the recent developments in the field, including the deployment of systems such as ChatGPT employing the benefits of scalability (Kaplan et al. 2020) and the LLMs demonstrated ability to adjust to new tasks given just a small number of examples (Brown et al. 2020). Furthermore LLMs capabilities with respect to important formalisms such as defeasible reasoning have not been examined in detail yet. This kind of reasoning is important for cases where knowledge is incomplete and conflicting, which is the case in many application areas including law and healthcare. In previous work (Antoniou and Batsakis 2023) preliminary experiments on LLM defeasible reasoning have been performed, but a systematic analysis involving benchmark construction containing several examples of different reasoning patters is missing.
This work is an initial step towards developing a deep understanding of reasoning capabilities of LLMs with emphasis of nonmonotonic reasoning. In order to achieve this we propose a benchmark for LLMs by modifying an existing benchmark for defeasible logic reasoners. The pro- posed benchmark corresponds to various reasoning patterns that will be described in the following. Furthermore we conducted preliminary experiments on nonmonotonic rulebased reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.
## 2 Background
A defeasible theory D is a triple (F,R, > ) where F is a finite set of facts (literals), R a finite set of rules, and > a superiority relation (acyclic relation upon R).
Arule r consists (a) of its antecedent (or body) A(r) which is a finite set of literals, (b) an arrow, and, (c) its consequent (or head) C(r) which is a literal. There are three types of rules: strict rules, defeasible rules and defeaters represented by a respective arrow → , ⇒ and /squiggleright . Strict rules are rules in the classical sense: whenever the premises are indisputable (e.g., facts) then so is the conclusion. Defeasible rules are rules that can be defeated by contrary evidence. Defeaters are rules that cannot be used to draw any conclusions; their only use is to prevent some conclusions.
Given a set R of rules, we denote the set of all strict rules in R by R s , and the set of strict and defeasible rules in R by R sd . R[q] denotes the set of rules in R with consequent q. If q is a literal, ∼ q denotes the complementary literal (if q is a positive literal p then ∼ q is ¬ p; and if q is ¬ p, then ∼ q is p).
A conclusion of D is a tagged literal and can have one of the following four forms:
- +∆ q, meaning that q is definitely provable in D.
- -∆ q, meaning that we have proved that q is not definitely provable in D.
- + ∂ q, meaning that q is defeasibly provable in D.
- -∂ q, meaning that we have proved that q is not defeasibly provable in D.
Provability is defined below. It is based on the concept of a derivation (or proof) in D = (F, R, > ). A derivation is a finite sequence P = P(1), ..., P(n) of tagged literals satisfying the conditions shown below. The conditions are essentially inference rules phrased as conditions on proofs. P(1.. ı ) denotes the initial part of the sequence P of length i. For more details on provability and an explanation of the intuition behind the conditions below, see (Maher 2004).
```
```
- -∂ : We may append P( ı + 1) = -∂ q if
```
```
## 3 Dataset
We propose a dataset of scalable test theories which is inspired by (Maher et al. 2001). In (Maher et al. 2001) authors focused on evaluating the efficiency of existing defeasible reasoning systems. Here, we focus on a translation of rules into text suitable for LLMs. The proposed dataset is focused on typical defeasible inference patterns, allowing a comparison between inputs for reasoning systems and LLMs.
Empty. First, we skip the empty() theory as it contains no facts, rules or priorities. The empty() theory serves as a baseline for reasoning systems. However, there is no meaningful evaluation of a LLM in the absence of text.
Chain. Our first theory is chain(n) , where a 0 is at the end of a chain of n rules a i +1 ⇒ a i , with a single fact a n initiating the chain of inference (no priorities defined). For chain(2) , the defeasible rules are as follows:
```
```
Note that ' >> A0000002' denotes a fact following the syntax of SPINdle (Rohaninezhad, Arif, and Noah 2015). Based on the fact A0000002 rule r1 infers that A0000001 is deafeasibly provable, while rule r2 infers that A0000000 is deafeasibly provable as well. In this work, the structure of the theories is aimed at determining through logical inference whether A0000000 is provable or not. Subsequently, the translation of chain(2) into plain text is as follows:
```
```
- A0000000 i s an Arkon .
Notice the pattern, facts are expressed as statements, while rules are expressed as if-then statements with the keyword 'typically' denoting the defeasible nature of the rule. Given the already identified affinity of ChatGPT to use other background knowledge when predicates and atoms are real-world entities, we use imaginary names of species on an imaginary planet, following (Ford and Billington 2000). Here, we use 'Arkon' in order to ask ChatGPT:
I s A0000000 an Arkon?
The theory chains(n) , is a version of chain(n) with strict rules. For chains(2) , the defeasible rules are as follows:
```
```
- r 2 : A0000001 -> A0000000
The translation of chains(2) into plain text is as follows:
- A0000002 i s an Arkon . I f A0000002 i s an Arkon , t h e n A0000001 i s an Arkon . I f A0000001 i s an Arkon , t h e n A0000000 i s an Arkon .
Notice the absence of keyword 'typically' in the if-then statements.
Circle. In defeasible logic, cyclical chains of reasoning do not lead to inferences. More specifically, in the theory circle(n) , a 0 is part of a circle of n rules a i +1 mod n ⇒ a i (no facts or priorities defined). For circle(2) , the defeasible rules are as follows:
```
```
Due to the cyclical nature of the rules, no defeasible conclusion is infered for either A0000000 or A0000001 . The translation of circle(2) into plain text is as follows:
- I f A0000000 i s an Arkon , t h e n t y p i c a l l y A0000001 i s an Arkon . I f A0000001 i s an Arkon , t h e n t y p i c a l l y A0000000 i s an Arkon .
The theory circles(n) , is a version of circle(n) with strict rules. For circles(2) , the defeasible rules are as follows:
```
```
- r 2 : A0000001 -> A0000000
The translation of circles(2) into plain text is as follows:
- I f A0000000 i s an Arkon , t h e n A0000001 i s an Arkon . I f A0000001 i s an Arkon , t h e n A0000000 i s an Arkon .
Notice again the absence of keyword 'typically' in the ifthen statements.
Directed Acyclic Graph (DAG). In order to consider more complex inference structures, we define theory dag(n,k) , where a 0 is the root of a k-branching tree of depth nk in which every literal occurs n times. The inference process is initiated by k facts, namely a nk +1 , ..., a nk + k (no priorities defined). For dag(2,2) , the defeasible rules are as follows:
```
```
- r 5 : A0000002 , A0000001 = > A0000000
Notice that nk + 1 (here 5) rules are generated, with k (here 2) facts, namely A0000006 and A0000005 making rule r1 applicable, inferring A0000004 . By applying rules r1 , r2 , r3 , r4 and r5 we can eventually infer A0000000 . The translation of dag(2,2) into plain text is as follows:
```
```
- I f A0000006 i s an Arkon and A0000005 i s an Arkon , t h e n t y p i c a l l y A0000004 i s an Arkon .
- I f A0000005 i s an Arkon and A0000004 i s an Arkon , t h e n t y p i c a l l y A0000003 i s an Arkon .
- I f A0000004 i s an Arkon and A0000003 i s an Arkon , t h e n t y p i c a l l y A0000002 i s an Arkon .
- I f A0000003 i s an Arkon and A0000002 i s an Arkon , t h e n t y p i c a l l y A0000001 i s an Arkon .
- I f A0000002 i s an Arkon and A0000001 i s an Arkon , t h e n t y p i c a l l y A0000000 i s an Arkon .
Notice that multiple predicates in the body of a rule are connected through the keyword 'and' in the if part of the if-then statement.
Levels. All mentioned theories above contained no conflicts. However, conflict resolution is an integral part of defeasible reasoning. Thus, theory levels-(n) defines a cascade of n conclusions, namely there are rules true ⇒ a i and a i +1 ⇒¬ a i , for 0 ≤ i < n (no facts or priorities defined). For levels-(2) , the defeasible rules are as follows:
```
```
- r 4 : A0000001 = > -A0000000
Notice that when the body of a rule (here r1 and r3 ) is empty then syntactically all preconditions are considered to be met. Negative conclusions such as ¬ A0000001 are prefixed with the minus sign, namely -A0000001 . Since there is no fact supporting A0000002 , rule r2 does not apply, thus we conclude A0000001 based on rule r1 . Subsequently, since both rules r3 and r4 apply, we cannot conclude A0000000 . Notice an emerging pattern where A0000000 cannot be proved for even n , while A0000000 can be proved for odd n (due to alternating conflicts on subsequent levels). The translation of levels-(2) into plain text is as follows:
```
```
- A0000000 i s not an Arkon .
The theory levels(n) , is a version of levels-(n) , where in addition there are superiority statements stating that, for odd i , rule a i +1 ⇒¬ a i is superior to true ⇒ a i (introducing n/ 2 priorities). For levels(2) , the defeasible rules are as follows:
```
```
Notice that due to the priority rule, if rule r2 was applicable (e.g. with A0000002 given as fact) then ¬ A0000001 would have been inferred (instead of A0000001 inferred here). The translation of levels(2) into plain text is as follows:
```
```
A similar inference pattern emerges where A0000000 cannot be proved for even n , while A0000000 can be proved for odd n (even though the process of conflict resolution on subsequent levels is different for levels(n) compared to levels(n) ).
Hierarchies. The authors of (Maher et al. 2001) defined theories: (i) tree(n,k) , where a 0 is the root of a k-branching tree of depth n in which every literal occurs once, and (ii) teams(n) , where every literal is disputed, with two rules for a i and two rules for ¬ a i , and the rules for a i are superior to the rules for ¬ a i (this situation is repeated recursively to a depth n ). In this work, we define hierarchies(n,k) , where a 0 is the root of a k-branching tree of depth n in which every literal occurs once. In addition, every literal (internal node of the tree) is disputed, with k/ 2 rules for a i and k/ 2 rules for ¬ a i , where k is even, and the rules for a i are superior to the rules for ¬ a i . Each external node of the tree is a fact, namely there are k n facts. For hierarchies(2,2) , the defeasible rules are as follows:
| >> | A0000006 |
|----------|------------------------------------------------------|
| >> >> >> | A0000005 A0000004 A0000003 A0000006 = > A0000005 = > |
| r1 r2 r1 | A0000002 -A0000002 |
| r3 | > r2 A0000004 = > A0000001 |
| r4 r3 | A0000003 = > -A0000001 > r4 |
| r5 | A0000002 = > A0000000 |
| | A0000001 = > > |
| r6 r5 | -A0000000 r6 |
The translation of hierarchies(2,2) into plain text is as follows:
```
```
- I f A0000003 i s an Arkon , t h e n t y p i c a l l y A0000001 i s not an Arkon , unless A0000004 i s a l s o an Arkon ( namely t h e n A0000001 i s an Arkon ) . I f A0000002 i s an Arkon , t h e n t y p i c a l l y A0000000 i s an Arkon .
- I f A0000001 i s an Arkon , t h e n t y p i c a l l y A0000000 i s not an Arkon , unless A0000002 i s a l s o an Arkon ( namely t h e n A0000000 i s an Arkon ) .
Here, A0000000 can be proved for any given parameter n and k since conflicts are always resolved in favour of a i .
We consider out of scope of this work theory mix(m,n,k) from (Maher et al. 2001), where there are m defeasible rules for a 0 and m defeaters against a 0 , where each rule has n atoms in its body (each atom can be established by a chain of strict rules of length k ).
Table 1 summarises the number of facts, rules and priorities generated for each theory as a function of given parameters. Notice that the numbers in Table 1 do not necessarily match the numbers in (Maher et al. 2001) as we have modified theory definitions.
Table 1: Sizes of scalable theories.
| Theory | Facts | Rules | Priorities |
|------------------|---------|--------------------|----------------------|
| chain(n) | 1 | n | 0 |
| chains(n) | 1 | n | 0 |
| circle(n) | 0 | n | 0 |
| circles(n) | 0 | n | 0 |
| dag(n,k) | k | nk +1 | 0 |
| levels-(n) | 0 | 2 n | 0 |
| levels(n) | 0 | 2 n | n/ 2 |
| hierarchies(n,k) | k n | k ∑ n - 1 i =0 k i | k 2 ∑ n - 1 i =0 k i |
## 4 Experimental Results
The proposed dataset is scalable, namely increasingly larger theories can be generated for increasing values of parameters n and k . However, as a first step in this work, we focus on relatively small and readable theories in order to assess empirically the inference patterns of ChatGPT. We used GPT-4o in order to assess theories: chain(8) 2 , chains(8) 3 ,
2 https://chatgpt.com/share/d8819744-9d90-42cb-aeba95a28769f08e
3 https://chatgpt.com/share/15f3f0a5-e8c9-4156-90507a28b62cc189
circle(8) 4 , circles(8) 5 , dag(3,2) 6 , levels-(5) 7 , levels(5) 8 , hierarchies(2,4) 9 . ChatGPT was given the following instructions for each theory:
<!-- formula-not-decoded -->
In addition, each prompt was based on the following template (namely, ' { theory } ' was substituted with each evaluated theory):
Based on t h e f o l l o w i n g knowledge al o n e :
```
```
I s A0000000 an Arkon?
<!-- formula-not-decoded -->
Each generated theory, such as chain(8) , was evaluated over four settings:
- A0000000 is not an Arkon with statements provided in random order ( -∂ -rand ),
- A0000000 is an Arkon with statements provided in random order ( + ∂ -rand ),
- A0000000 is not an Arkon with statements provided in sequential order ( -∂ -seq ),
- A0000000 is an Arkon with statements provided in sequential order ( + ∂ -seq ).
We evaluated statements provided in random order first in order to observe any differences in generated responces when the same theory is provided in sequential order. Notice that ChatGPT conversations provided as links in footnotes contain the four settings in the following order: -∂ -rand , + ∂ -rand , -∂ -seq , + ∂ -seq . Results are summarised in Table 2.
For theory chain(8) we notice that + ∂ -rand and + ∂ -seq have similar inference patterns, namely starting from provided facts, each rule is applied until a final conclusion is reached (interestingly, statements provided in random order do not change the inference sequence). We also notice that while -∂ -seq evaluates all rules sequentially (even
4 https://chatgpt.com/share/a73155a9-9c3d-498d-8b2d9532a9cd6d54
5 https://chatgpt.com/share/3b381c57-f881-4631-92e03974c446a5df
6 https://chatgpt.com/share/db79edc0-1f1e-4fec-bba1b16fb6bd15a8
7 https://chatgpt.com/share/f332e764-91df-4672-b6d01fa95196c96e
8 https://chatgpt.com/share/a1b2ae89-8f49-4f25-a7eafc94ac3e942d
9 https://chatgpt.com/share/bc1181d4-1f94-4a2f-8c9bcf11fb28c35d
Table 2: ChatGPT inference results (for results annotated with †readers are referred to comments included in the main text).
| Theory | - ∂ -rand | + ∂ -rand | - ∂ -seq | + ∂ -seq |
|----------------------------------------------------------------------------------------|-------------------------------------------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------|-------------------------------------------------------------------|
| chain(8) chains(8) circle(8) circles(8) dag(3,2) levels-(5) levels(5) hierarchies(2,4) | Correct Correct Correct Correct †Error †Error †Error †Correct | Correct Correct Correct Correct Correct †Error †Correct †Correct | Correct Correct Correct Correct Correct Error †Error †Correct | Correct Correct Correct Correct Correct Error †Correct †Correct |
after encountering A1111113 ), the inference of -∂ -rand moves backwards once A1111113 is encountered. Theory chains(8) follows similar inference patterns based on if-then statements.
For theory circle(8) we notice that for both -∂ -rand and -∂ -seq the circle is identified and no conclusion can be drawn. For + ∂ -rand and + ∂ -seq we introduced a fact that proves the circle, inference started from the given fact leading to the inference of A0000000 . Theory circles(8) follows similar inference patterns based on if-then statements. However, for -∂ -rand the circle was not explicitly identified as justification for inference.
For theory dag(3,2) , -∂ -rand exhibited unusual patterns (i.e. a potential hallucination), namely A1111114 was replaced with A0000004 (incorrectly), leading to the inference of A0000000 (while a rule based on A0000005 and A0000003 leading to A0000002 was not given as input). Interestingly, for -∂ -seq the inference pattern was correct with A1111114 breaking the chain of inference (this indicates that the sequence of statements can have an effect on the inference process, which is not the case for standard reasoners). The inference pattern for + ∂ -rand was correct, even though it is unclear why the 'Chain of Reasoning' did not include the conjunction of two premises. The inference pattern for + ∂ -seq was correct, with a well formed 'Chain of Reasoning'.
Theory levels-(5) introduces conflicting rules, with explanations provided not matching expected inference for defeasible reasoning. This might be attributed to the fact that ChatGPT starts from A0000000 working backwards, while the lack of priorities over conflicting rules introduces confusion. It is worth pointing out that -∂ -seq introduced the statement 'A0000002 is typically an Arkon.' (a fact not given as input), i.e. a potential hallucination.
Theory levels(5) contains priority rules, which provide some clarity. However, for -∂ -rand , since there are no priority rules for A0000002 the inference does not match deafeasible reasoning (it seems that ChatGPT does not follow the notion of defeasible reasoning where both A0000002 and ¬ A0000002 might not be provable). Conversely, for + ∂ -rand the combination of priorities and rules with failing premises (namely, the resolution of A0000002 knowing that A0000003 is not an Arkon) leads to conclusions that follow defeasible reasoning. Interestingly, the inference steps of -∂ -seq are well structured, closely resembling defeasible reasoning (with the exception of resolving A0000002 without a clear priority, which leads to an error). The inference of + ∂ -seq showed that the sequence of statements (where well structured sequences lead to more intuitive inference steps) as well as the presence of priorities and rules with failing premises can affect the inference process.
For theory hierarchies(2,4) , -∂ -rand and + ∂ -rand lead to correct inferences due to the presence of priorities and rules with failing premises. However, due to the random order of statements, the explanation of inferences can be challenging to follow. This is not the case for -∂ -seq and + ∂ -seq , which exhibited correct and well structured inference steps.
Overall, the following observations can be made:
- ChatGPT seems to adopt the closed-world assumption, where facts are considered as true and missing information as false,
- Monotonic rules are applied leading to new conclusions,
- Conflicting rules are resolved in the presence of priorities and rules with failing premises,
- The presence of conflicting rules supporting both p and ¬ p , with no clear priority, does not lead to a conclusion that neither p nor ¬ p can be inferred,
- Additional facts or rules (not provided as input) could be automatically introduced (i.e. a potential hallucination),
- A well structured sequence of statements (given as input) increases the readability of the inference process (compared to equivalent theories structured as statements in random order).
## 5 Conclusion
This work is a first step towards gaining a better understanding of reasoning capabilities of LLMs with respect to nonmonotonic reasoning. We proposed a benchmark tailored to LLMs through the modification of an existing benchmark for defeasible logic reasoners. A range of reasoning patterns was covered by the proposed benchmark. Preliminary experiments indicated encouraging results for monotonic reasoning as well as certain challenges in the context of nonmonotonic rule-based reasoning. Future work will focus on expanding our exploration of reasoning patterns that might pose a challenge to LLMs. Furthermore, while this work was focused on small and readable theories, future efforts will examine the effect of increasingly larger theories on the reasoning process of LLMs.
## References
Antoniou, G., and Batsakis, S. 2023. Defeasible reasoning with large language models-initial experiments and future directions. In CEUR Workshop Proceedings , volume 3485, 7687. CEUR Workshop Proceedings.
Bender, E. M.; Gebru, T.; McMillan-Major, A.; and Shmitchell, S. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , 610-623.
Bills, S.; Cammarata, N.; Mossing, D.; Tillman, H.; Gao, L.; Goh, G.; Sutskever, I.; Leike, J.; Wu, J.; and Saunders, W. 2023. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/inde
Bowman, S. R. 2023. Eight things to know about large language models. arXiv preprint arXiv:2304.00612 .
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language models are few-shot learners.
Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y . T.; Li, Y .; Lundberg, S.; et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 .
Ford, M., and Billington, D. 2000. Strategies in human nonmonotonic reasoning. Computational Intelligence 16(3):446-468.
Huang, J., and Chang, K. C.-C. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403 .
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling laws for neural language models.
Lambert, N.; Castricato, L.; von Werra, L.; and Havrilla, A. 2022. Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog . https://huggingface.co/blog/rlhf.
Letters, F. O. 2023. Pause giant ai experiments: An open letter. Future of Life Institution. https://futureoflife. org/openletter/pause-giant-ai-experiments .
Luccioni, A., and Viviano, J. 2021. What's in the box? an analysis of undesirable content in the common crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , 182-189.
Luccioni, A. S.; Viguier, S.; and Ligozat, A.-L. 2022. Estimating the carbon footprint of bloom, a 176b parameter language model. arXiv preprint arXiv:2211.02001 .
Maher, M. J.; Rock, A.; Antoniou, G.; Billington, D.; and Miller, T. 2001. Efficient defeasible reasoning systems. Int. J. Artif. Intell. Tools 10(4):483-501.
Maher, M. J. 2004. Propositional Defeasible Logic has Linear Complexity. CoRR cs.AI/0405090.
OpenAI. 2023. Gpt-4 technical report.
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; Schulman, J.; Hilton, J.; Kelton, F.; Miller, L.; Simens, M.; Askell, A.; Welinder, P.; Christiano, P.; Leike, J.; and Lowe, R. 2022. Training language models to follow instructions with human feedback.
Rohaninezhad, M.; Arif, S. M.; and Noah, S. A. M. 2015. A grounder for spindle defeasible logic reasoner. Expert Systems with Applications 42(20):7098-7109.
Strubell, E.; Ganesh, A.; and McCallum, A. 2019. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 3645-3650.
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.-T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 .
Yin, D.; Dong, L.; Cheng, H.; Liu, X.; Chang, K.-W.; Wei, F.; and Gao, J. 2022. A survey of knowledgeintensive nlp with pre-trained language models. arXiv preprint arXiv:2202.08772 .
Zhang, H.; Li, L. H.; Meng, T.; Chang, K.-W.; and Broeck, G. V. d. 2022. On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502 .
Zhen, C.; Shang, Y.; Liu, X.; Li, Y.; Chen, Y.; and Zhang, D. 2022. A survey on knowledge-enhanced pre-trained language models. arXiv preprint arXiv:2212.13428 .