## Zero-Shot Question Answering over Financial Documents using Large Language Models
Karmvir Singh Phogat, Chetan Harsha, Sridhar Dasaratha, Shashishekar Ramakrishna, Sai Akhil Puranam
EY Global Delivery Services India LLP
{Karmvir.Phogat, Chetan.Harsha, Sridhar.Dasaratha}@gds.ey.com
{Shashishekar.R,
, Sai.Puranam}@gds.ey.com
## Abstract
We introduce a large language model (LLM) based approach to answer complex questions requiring multi-hop numerical reasoning over financial reports. While LLMs have exhibited remarkable performance on various natural language and reasoning tasks, complex reasoning problems often rely on few-shot prompts that require carefully crafted examples. In contrast, our approach uses novel zero-shot prompts that guide the LLM to encode the required reasoning into a Python program or a domain specific language. The generated program is then executed by a program interpreter, thus mitigating the limitations of LLM in performing accurate arithmetic calculations.
We evaluate the proposed approach on three financial datasets using some of the recently developed generative pretrained transformer (GPT) models and perform comparisons with various zero-shot baselines. The experimental results demonstrate that our approach significantly improves the accuracy for all the LLMs over their respective baselines. We provide a detailed analysis of the results, generating insights to support our findings. The success of our approach demonstrates the enormous potential to extract complex domain specific numerical reasoning by designing zero-shot prompts to effectively exploit the knowledge embedded in LLMs.
## 1 Introduction
In recent years, the development of large language models (LLMs) has achieved significant advances in natural language processing (NLP). Typically, LLMs are pretrained on large corpora of text from the internet which has given rise to the capability of adapting to a wide variety of new tasks from different domains without the need for huge amount of task specific data. Scaling up the size of these models has not only improved sampling efficiency and performance, (Kaplan et al., 2020) but also intro- duced reasoning capabilities (Wei et al., 2022a,b; Kojima et al., 2022).
LLMs have been shown to perform well on tasks requiring reasoning capabilities in various domains, including code writing (Chen et al., 2021a), math problem solving (Lewkowycz et al., 2022; Polu et al., 2023), dialogue (Glaese et al., 2022; Thoppilan et al., 2022), common sense reasoning (Shwartz et al., 2020; Chowdhery et al., 2022) and symbolic reasoning (Wei et al., 2022b; Wang et al., 2023). The design of the prompt, known as prompt engineering, plays a significant role in adapting the pretrained LLMs to new tasks with little or no task specific training data. Recently, there has been extensive work (Liu et al., 2023) which demonstrates the importance of prompt design in usage of the LLMs and unlocking their reasoning capabilities. However, (Mahowald et al., 2023) argue that LLMs cannot combine elementary knowledge with common sense reasoning. (Valmeekam et al., 2022) claim that benchmarks on which LLMs show reasoning capabilities are simplistic and cannot be used as evidence. (Bubeck et al., 2023; Bang et al., 2023) show that LLMs face challenges in numerical reasoning. Hence, adapting LLMs to new domains requires prompt engineering and a system design that can overcome the limitations of LLMs.
Question answering in the financial domain is an active area of research which could potentially benefit from the use of LLMs with appropriate system design. Financial question answering involves numerous steps and complex numerical reasoning with precise arithmetic calculations, making it more challenging than classical question answering problems (Yang et al., 2018; Rajpurkar et al., 2018). Typically for complex problems, fewshot prompt based approaches have been used (Wei et al., 2022b; Chen et al., 2023). However it has been shown that the output of the LLMs is sensitive to the few-shot samples used as well as to the ordering of those samples (Lu et al., 2022). Further, the samples can contain large number of tokens and providing multiple samples for few-shot prompts would increase the number of input tokens, sometimes even crossing the limit of LLMs. Hence, designing and using few-shot prompts for financial question answering can become quite challenging.
We propose a new approach using zero-shot prompts for financial question answering with LLMs, thus eliminating the requirement to create hand crafted examples. These prompts contain high-level instructions to guide the encoding of financial reasoning process into a Python program (ZS-FinPYT) or a domain specific language (ZSFinDSL). For ZS-FinPYT, we achieve the zero-shot system by instructions that layout the high-level approach to generate a valid Python program, while for ZS-FinDSL we enable the same by identifying a program structure for robust domain-specific languages (DSL) program extraction. In both cases, the generated program is executed externally by a program executor to provide the final answer. We evaluate the use of the latest GPT-x models on their ability to perform financial reasoning as they have shown state-of-the-art performance on various tasks involving question answering and reasoning (OpenAI, 2023; Frieder et al., 2023; Kung et al., 2023). Specifically, we explore the use of the GPT models text-davinci-003, gpt-3.5-turbo and gpt-4 in answering financial questions.
We evaluate the proposed approach on three financial question answering datasets, with three different GPT models and compare with various baselines. The experimental results demonstrate that our approach significantly improves the accuracy for all models. The success of our approach demonstrates the enormous potential to extract complex domain specific numerical reasoning by carefully designing LLM based systems for specific applications and crafting prompts to effectively exploit the knowledge embedded in the LLMs.
## 2 Background
NLPtechniques have proven useful to solve various problems in the financial domain such as sentiment analysis to assist market prediction (Day and Lee, 2016; Akhtar et al., 2017) and fraud detection for risk management (Han et al., 2018; Wang et al., 2019). Financial domain specific language models have been trained on large scale financial data and fine tuned for specific problems (Liu et al., 2021). (Chen et al., 2021b) introduce a large-scale question answering dataset, FinQA and propose FinQANet with a retriever-generator architecture based on pretrained BERT like models.
With the introduction of LLMs, it has become feasible to directly use these language models without domain specific pretraining. (Chen et al., 2022) propose a large-scale financial dataset, ConvFinQA for conversational question answering. They propose a few-shot prompt (with 16 exemplars) based approach using GPT-3 text-davinci-002 model to generate a DSL program.
One of the key techniques which significantly improves reasoning abilities of LLMs is chainof-thought prompting introduced by (Wei et al., 2022b). They propose a few-shot prompt that consists of triples: <input, chain-of-thought, output>, where the chain-of-thought (CoT) is a series of intermediate natural language reasoning steps that leads to the final output. (Kojima et al., 2022) demonstrate that reasonable zero-shot learning is achieved by simply adding 'Let's think step by step' to the prompt and using a two-prompt approach: the first prompt to extract the reasoning path and the second to extract the final answer. Unlike our approach, which avoids performing calculations using the LLM, both of these approaches utilize the LLM for generating mathematical expressions that encode the reasoning and perform arithmetic at each step.
Program of thoughts (PoT) prompting (Chen et al., 2023) and Program-aided Language Models (PAL) (Gao et al., 2023) are approaches that are conceptually similar to our proposed technique. However, (Chen et al., 2023) show only limited zero-shot prompting experiments for financial data sets. Their results indicate that few-shot prompting significantly outperforms the zero-shot prompts. (Gao et al., 2023) discuss only few-shot prompting and do not show any results on financial data sets. In contrast, our work focuses entirely on optimizing zero-shot prompts that generate Python program or domain specific language for financial question answering. We further demonstrate that carefully designed zero-shot prompts for financial question answering can achieve comparable results with few-shot methods.
## 3 Zero-shot Prompting for Financial Domains
We introduce a novel zero-shot template-based prompting for financial question answering. These prompts are designed to generate executable programs for answering questions. The executable program generation and their execution enables accurate mathematical calculations which eliminates arithmetic errors. We follow the prompt guidelines described in (Reynolds and McDonell, 2021) and employ the following principles for designing zeroshot prompting for question answering:
Signifier: A signifier is a pattern which keys the intended behavior. A task specific signifier directly elucidates the task at hand. The sentence - 'Read the following passage and then answer the question', specifically describes the question answering task that is to be performed.
Memetic proxy: A memetic proxy is a concept in which a character or characteristic situation is used as a proxy for an intention. '#Python' can be a memetic proxy for the LLM to clarify the intention that the response should have a Python program.
Constraining behavior: In addition to directing the LLM on the desirable response, it is important for the prompt to inform the LLM of undesirable responses. Instructions restricting undesirable LLM responses fall into the constraining behavior category.
Meta prompting: A meta prompt is a short phrase or a fill-in-the-blank template encapsulating a more general intention that will unfold into a more specific prompt when combined with additional information such as the task at hand. In the question answering task, the sentence -'Let us think step by step.', elicits step-by-step reasoning in LLMs for answering questions.
Inspired by these prompt design principles, we present two zero-shot prompting techniques: ZS-FinPYT prompt that enables LLMs to generate Python executable programs and ZS-FinDSL prompt that enables LLMs to generate executable domain specific language programs. We also discuss two baseline zero-shot prompting techniques, one using a simple dual prompt and another using zero-shot chain-of-thought prompting (ZS-CoT) motivated by (Kojima et al., 2022). For reproducibility purposes, we provide exact prompts for all techniques and datasets.
## 3.1 Zero-shot FinPYT
The ZS-FinPYT prompt is a collection of instructions that directs the LLM to generate a valid Python program that can be executed by the exec function. Based on preliminary experiments, we identified the following requirements for the proposed prompt:
- (a) The prompt should describe the task such that it enables the LLM to generate consistent programs for answering the questions.
- (b) The final answer to a question must be stored in a specified Python variable for enabling consistent extraction of executed answer.
- (c) The LLM generated program should not include Python non-executable statements for seamless execution of Python programs.
The ZS-FinPYT prompt is designed to accommodate the above requirements in the following manner:
Direct task specification using the signifier: We use the following signifier for explicitly specifying the question answering task:
Read the following passage and then write Python code to answer the question: Passage: text + table Question: ask question? Answer this question by following the below instructions.
The signifier explicitly calls out the task of writing a Python program to answer the question after reading the passage where the passage and the questions are identified with the identifiers ' Passage: ' and ' Question: ' respectively. Furthermore, the prompt directs the LLM to follow certain instructions while answering the question.
Direct sub-task specification using the signifier: The sub-task of storing the final answer to a specific Python variable is described as a part of instructions to the LLM:
Define the Python variable which must begin with a character. Assign values to variables required for the calculation. Create Python variable "ans" and assign the final answer (bool/float) to the variable "ans".
Constraining LLM behavior: To ensure naming conventions are followed and prevent the generation of non-executable statements, we include the following instructions in the prompt:
Define the Python variable which must begin with a character. Don't include non-executable statements and include them as part of comments.
Memetic proxy phrases: Certain memetic proxy phrases are employed to implicitly convey intentions. For instance, the memetic phrase '#Comment: . . . ' guides the LLM to understand that comments are always preceded by the '#' character. Similarly, the memetic phrase '#Python' instructs the LLM to generate a Python program.
The ZS-FinPYT prompt for the FinQA dataset is depicted in Figure 1.
Figure 1: ZS-FinPYT prompt for FinQA
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Diagram: LLM Code Generation Process
### Overview
The image depicts a diagram illustrating the process of generating Python code using a Large Language Model (LLM) based on a given passage, question, and instructions. The diagram highlights the different components involved and categorizes them using color-coding.
### Components/Axes
The diagram is segmented into several sections, each with a specific role:
* **Input (Blue):** Contains the "Passage: text + table", "Question: ask question?", and "Instructions" sections.
* **Signifier (Black):** A rectangular block labeled "#Python" representing the Python executable code area.
* **Memetic Proxy (Orange):** A visual representation of the LLM, depicted as a network of interconnected nodes.
* **Constraining Behavior (Pink):** A rectangular block labeled "Python code from the LLM."
* **Legend:** Located at the bottom of the image, defining the color-coding scheme:
* Black: Signifier
* Orange: Memetic proxy
* Pink: Constraining behavior
* Blue: Input
### Detailed Analysis or Content Details
The diagram shows a flow from the top (Input) to the bottom (Output).
* **Input Section:**
* "Passage: text + table"
* "Question: ask question?"
* "Instructions:" followed by a list of instructions:
* "Define the Python variable which must begin with a character."
* "Assign values to variables required for the calculation."
* "Create Python variable “ans” and assign the final answer (bool/float) to the variable “ans”."
* "Don’t include non-executable statements and include them as part of comments. #Comment: . . ."
* **Signifier Section:**
* "#Python" is displayed within a black rectangle.
* **Memetic Proxy Section:**
* The LLM is represented by a complex network of interconnected nodes.
* **Constraining Behavior Section:**
* "Python code from the LLM." is displayed within a pink rectangle.
* **Flow:** An arrow points downwards from the LLM (orange) towards the "Python code from the LLM." (pink) block, indicating the generation of code.
### Key Observations
The diagram emphasizes the structured approach to code generation, highlighting the importance of clear instructions and the role of the LLM as a processing unit. The color-coding helps to differentiate between the input, the LLM itself, the code output, and the constraints applied during the process.
### Interpretation
The diagram illustrates a conceptual framework for how an LLM can be used to generate Python code. The "Input" section represents the problem statement, while the "Instructions" provide the rules and constraints for the LLM to follow. The LLM (Memetic Proxy) then processes this information and generates the "Python code from the LLM." (Constraining Behavior). The "Signifier" section indicates the expected output format.
The diagram suggests a process where the LLM doesn't simply produce code randomly, but rather operates within a defined set of rules and constraints. This is crucial for ensuring the generated code is correct, executable, and adheres to specific requirements. The use of comments as a way to exclude non-executable statements is also highlighted.
The diagram is a high-level representation and doesn't delve into the specifics of the LLM's internal workings or the algorithms used for code generation. It focuses on the overall flow and the key components involved in the process. It's a conceptual model rather than a detailed technical blueprint.
</details>
## 3.2 Zero-shot FinDSL
The zero-shot FinDSL (ZS-FinDSL) is a zero-shot prompting technique for program generation in a domain specific language (DSL). We use a DSL similar to (Chen et al., 2021b) with two differences: we don't have table operators and instead we have a max and min operator. The output of the system is a DSLprogram that is extracted using a Python script and executed using a language interpreter. In the ZS-FinDSL technique, we adopt a dual prompt approach to extract reasoning for answering questions and generating the corresponding DSL program.
## 3.2.1 Reasoning Extraction Prompt
The reasoning extraction prompt of ZS-FinDSL consists of two parts:
Direct task specification using the signifier: The question answering task is specified explicitly using the following signifier:
```
```
Meta prompting for reasoning: For generating step by step reasoning for answering the question, the following meta prompt is used:
Answer this question by finding the relevant values and performing step by step calculations.
## 3.2.2 Program Extraction Prompt
The primary goal of the program extraction prompt is to extract DSL programs from the LLM's response obtained through the reasoning extraction prompt. To achieve this, the program extraction prompt involves specifying the task of program ex- traction and constraining the LLM's behavior by incorporating domain-specific knowledge.
Direct task specification using the signifier: The program extraction task is specified using the following signifier:
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Text Block: Instruction for Calculation Extraction
### Overview
The image contains a block of text outlining a task: extracting the calculations performed to arrive at an answer to a question, and formatting those calculations in a specific JSON-like structure.
### Content Details
The text reads as follows:
"Question: ask question?
Answer: Answer with reasoning from LLM.
Task: From the above question-answer, extract the calculations that were performed to arrive at the answer. The calculations should be provided in the following format:
(["PROGRAM":"[#0"]:(OPERATION:["arithmetic/logic"],
ARG1:["float/int"], ARG2:["float/int"]),
"#1":(OPERATION:["arithmetic/logic"],
ARG1:"#0", ARG2:["float/int/#int"]), ...],
"ANSWER":"[numerical/boolean]")"
### Key Observations
The text defines a structured output format for representing calculations. The format uses a "PROGRAM" key containing a list of calculation steps, each identified by a "#" index. Each step includes an "OPERATION" (arithmetic or logic) and arguments ("ARG1", "ARG2"). The final "ANSWER" key holds the numerical or boolean result. The example uses placeholders like "ask question?", "Answer with reasoning from LLM", "float/int", and "arithmetic/logic" to indicate where specific values should be inserted.
### Interpretation
This text is a set of instructions for a program or a person to parse a question-answer pair and represent the underlying calculations in a machine-readable format. It's designed to facilitate the extraction of reasoning steps from a Large Language Model (LLM) response. The structured format allows for automated analysis and verification of the LLM's calculations. The use of "#0", "#1", etc., suggests a sequential dependency between calculation steps, where the output of one step can be used as input for the next. The inclusion of both "float/int" and "arithmetic/logic" indicates the system should handle both numerical and logical operations.
</details>
Constraining LLM behavior: To ensure consistent program extraction, we limit the mathematical operations to the set specified by the DSL. These operations are commonly used for financial question answering. Moreover, we constrain the program's output to numerical or boolean values to make it executable. The following instructions are passed to the LLM to ensure consistent program extraction:
Operation should strictly be restricted to {add, subtract, multiply, divide, exponent, greater-than, max, min} only. When evaluated the program should only generate numerical or boolean values.
The ZS-FinDSL prompt for the FinQA dataset is shown in Figure 2.
## 3.3 Zero-shot Standard Dual
A standard template based prompting approach for question answering is a zero-shot standard dual (ZS-STD) prompt which has a LLM answering prompt and an answer extraction prompt. In the LLM answering prompt, the question is appended below the passage and then the trigger word 'Answer' is added for LLM to generate the answer. The answer extraction prompt takes the LLM generated answer along with the question and append a memetic proxy phrase - 'The final answer ( float/int/boolean ) is' for extracting the final answer. The ZS-STD prompt for the FinQA dataset question answering is shown in Figure 3.
## 3.4 Zero-shot Chain of Thoughts
Similar to the zero-shot reasoners (Kojima et al., 2022), zero-shot chain-of-thought (ZS-CoT) prompt is derived from the ZS-STD prompt by adding the reasoning trigger sentence -'Let us think step by step.' after the word 'Answer:'. The answer extraction prompt of ZS-CoT is identical to the ZS-STD prompt. The ZS-CoT prompt for the
Figure 3: ZS-STD prompt for FinQA
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Diagram: ZS-FinDSL Prompt for FinQA
### Overview
This diagram illustrates the process of using a ZS-FinDSL (Zero-Shot Financial Domain-Specific Language) prompt for Financial Question Answering (FinQA). It depicts two main workflows: Reasoning Extraction and Program Extraction, each with an LLM (Large Language Model) at its core. The diagram highlights the input, processing steps, and output for each workflow.
### Components/Axes
The diagram is segmented into four main sections, each representing a stage in the process. These are:
1. **Reasoning Extraction (Top-Left):** Input: Passage + Table, Question. Output: Answer with reasoning from LLM.
2. **Program Extraction (Top-Right):** Input: Question. Output: Program generated by the LLM.
3. **LLM Reasoning and Answering (Bottom-Left):** Input: Passage + Table, Question. Output: Answer from LLM.
4. **Answer Extraction (Bottom-Right):** Input: Question. Output: Final answer generated by the LLM.
A legend at the bottom of the diagram defines the color coding:
* **Signifier** (Red)
* **Memetic proxy** (Orange)
* **Constraining behavior** (Green)
* **Meta prompt** (Blue)
* **Input** (Grey)
The diagram also includes a title: "Figure 2: ZS-FinDSL prompt for FinQA".
### Detailed Analysis or Content Details
**Reasoning Extraction:**
* Input is labeled "Passage: text + table" and "Question: ask question?".
* The LLM is depicted as a rounded rectangle labeled "LLM".
* Output is labeled "Answer with reasoning from LLM."
* The flow is from Input -> LLM -> Output.
**Program Extraction:**
* Input is labeled "Question: ask question?".
* The task description within the input box states: "Answer: Answer with reasoning from LLM. Task: From the above question-answer, extract the calculations that were performed to arrive at the answer. The calculations should be provided in the following format: {"PROGRAM": "#0": OPERATION: [arithmetic/logic], ARG1: [float/int], ARG2: [float/int"], "#1": OPERATION: [arithmetic/logic], ARG1: [float/int], ARG2: [float/int"]...}. Operation should strictly be restricted to [add, subtract, multiply, divide, exponent, greater-than, max, min] only. When evaluated the program should only generate numerical or boolean values. Solution:".
* The LLM is depicted as a rounded rectangle labeled "LLM".
* Output is labeled "Program generated by the LLM."
* The flow is from Input -> LLM -> Output.
**LLM Reasoning and Answering:**
* Input is labeled "Passage: text + table" and "Question: ask question?".
* The LLM is depicted as a rounded rectangle labeled "LLM".
* Output is labeled "Answer from LLM."
* The flow is from Input -> LLM -> Output.
**Answer Extraction:**
* Input is labeled "Question: ask question?".
* The task description within the input box states: "Answer: Answer from LLM. The final answer (float/int/boolean) is:".
* The LLM is depicted as a rounded rectangle labeled "LLM".
* Output is labeled "Final answer generated by the LLM."
* The flow is from Input -> LLM -> Output.
The color coding is applied to the arrows connecting the components, indicating the type of process or information flow.
### Key Observations
The diagram clearly shows a two-pronged approach to FinQA: one focusing on reasoning and the other on program extraction. Both workflows utilize an LLM as the central processing unit. The diagram emphasizes the importance of structured input and output formats, particularly in the Program Extraction workflow. The legend provides a visual key to understanding the different types of information flow within the system.
### Interpretation
The diagram illustrates a sophisticated approach to FinQA that leverages the capabilities of LLMs in two distinct ways. The Reasoning Extraction workflow aims to provide a human-understandable explanation for the answer, while the Program Extraction workflow focuses on identifying the underlying calculations or logic used to arrive at the answer. This dual approach allows for both interpretability and verifiability of the LLM's responses. The use of a specific program format suggests a desire to create a system that can be easily evaluated and debugged. The color coding helps to visualize the different aspects of the process, such as the input data, the LLM's reasoning, and the constraints imposed on the output. The diagram suggests a system designed for both accuracy and transparency in financial question answering. The separation of reasoning and program extraction indicates a modular design, potentially allowing for independent improvement of each component.
</details>
FinQA dataset question answering is described in Figure 4.
All prompts for TATQA are identical to FinQA and for ConvFinQA dataset, the prompts are slightly modified to handle conversational questions as shown in Appendix A.
## 4 Experiments
## 4.1 Experimental Design
Datasets: We conduct our experiments on three financial question answering datasets FinQA (Chen et al., 2021b), ConvFinQA (Chen et al., 2022) and TATQA(Zhu et al., 2021) as summarized in Table 1. For our evaluations, we use the test split of FinQA, while for ConvFinQA and TATQA we use the dev set as answers for test splits of these datasets are not available. The evaluations for TATQA are restricted to questions of arithmetic type. The question answering task is to answer the questions using the passage containing text and table content. The table content is represented in a textual format using the strategy adopted in (Chen, 2022). In the textual format, the table columns are separated by '|', the rows are separated by '\n' and the empty cell are filled with '-'.
Large Language Models: We experimented with three Azure OpenAI 1 LLMs text-davinci-003, gpt-3.5-turbo, gpt-4. The Python programs generated using LLMs are executed using Python function exec . The domain specific programs are executed using the Python script provided by FinQA. 2 In order to achieve a more precise and predictable outcome, the LLM parameters are set as follows: temperature = 0, top\_prob = 0.95, max\_tokens = 1000.
Evaluation Metrics: For all the financial
1 https://oai.azure.com/
2 https://github.com/czyssrs/FinQA
Figure 4: ZS-CoT prompt for FinQA
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Diagram: Reasoning and Answer Extraction Process
### Overview
The image depicts a diagram illustrating a two-stage process: "Reasoning extraction" and "Answer extraction," both utilizing a Large Language Model (LLM). The diagram shows the flow of information from input through the LLM to generate an answer. The diagram uses color-coding to represent different types of input and output.
### Components/Axes
The diagram consists of two main rectangular blocks labeled "Reasoning extraction" (left) and "Answer extraction" (right). Each block contains text describing the process and an LLM representation (a cluster of colored spheres with arrows). Below each block is a pink rectangular area describing the output. A legend at the bottom explains the color-coding:
* **Black:** Signifier
* **Yellow:** Memetic proxy
* **Light Blue:** Meta prompt
* **Blue:** Input
An arrow connects the "Reasoning extraction" block to the "Answer extraction" block, indicating the flow of information.
### Detailed Analysis or Content Details
**Reasoning Extraction Block:**
* **Title:** Reasoning extraction (top-left)
* **Text:** "Read the following passage and then answer the question: Passage: text + table Question: ask question? Answer: Let us think step by step."
* **LLM Representation:** A cluster of spheres with arrows.
* **Output Area:** "Answer with reasoning from LLM." (pink rectangle)
**Answer Extraction Block:**
* **Title:** Answer extraction (top-right)
* **Text:** "Question: ask question? Answer: Answer with reasoning from LLM. The final answer (float/int/boolean) is:"
* **LLM Representation:** A cluster of spheres with arrows.
* **Output Area:** "Final answer generated by the LLM." (pink rectangle)
**Flow:** A curved arrow originates from the bottom of the "Reasoning extraction" block and points towards the "Answer extraction" block.
### Key Observations
The diagram highlights a two-step process where the LLM first performs reasoning based on a passage and question, and then generates a final answer. The color-coding clearly distinguishes between different types of input (text, table, question) and the LLM's role in processing this information. The output of the reasoning stage feeds into the answer extraction stage.
### Interpretation
This diagram illustrates a common approach in utilizing LLMs for complex tasks. The separation into "Reasoning extraction" and "Answer extraction" suggests a deliberate attempt to improve the quality and reliability of the LLM's responses. By first prompting the LLM to "think step by step," the system encourages more thorough and logical reasoning before generating a final answer. The inclusion of "float/int/boolean" in the answer extraction block indicates the LLM can produce different data types as output. The color-coding provides a visual representation of the data flow and the different components involved in the process. The diagram suggests a pipeline architecture where the output of one stage becomes the input for the next, allowing for a more controlled and interpretable process. The use of "Memetic proxy" is interesting and suggests a concept of transferring knowledge or patterns within the LLM. The diagram doesn't provide specific data or numerical values, but rather a conceptual framework for how an LLM can be used to solve problems.
</details>
datasets - FinQA, ConvFinQA and TATQA, we implement the evaluation strategy discussed in program of thoughts prompting (Chen et al., 2023) on Github 3 with slight modifications. The LLM responses are varying in nature for questions with answers in thousands, millions, and percentage . Examples: for the gold answer 7 million, the gpt response may be 7 million or 7,000,000; for the gold answer 23%, the gpt response may be 23% or 0.23. The evaluation strategy is modified to handle such cases. We relax the evaluations for ZS-CoT (Kojima et al., 2022) and standard dual prompting because LLMs using these prompting techniques generate answers instead of programs. Since LLMs cannot perform precise mathematical calculations (especially with high-precision floats and large numbers),we provide a tolerance while comparing the gpt final answer with the gold answer. The evaluation is implemented using the Python function isclose with a relative tolerance (rel\_tol) of 0.001. The isclose functionality returns True while comparing the gpt final answer (ˆ a ) with the gold answer (˜ a ) if and only if the condition with the baselines ZS-STD prompt and ZS-CoT are summarized in Table 2. The ZS-FinPYT and ZSFinDSL methods significantly outperform the ZSSTD prompt for all datasets and across all LLMs. The ZS-FinPYT achieves 4.5% to 47% and the ZSFinDSL achieves 5.22% to 38.72% improvement in accuracy over ZS-STD. The increase in accuracy for text-davinci and gpt-3.5 are are much higher than that for gpt-4 as for gpt-4 the base model performs reasonably well. These results indicate that our prompts are able to induce the required reasoning and successfully output the required Python programs or domain specific languages.
<!-- formula-not-decoded -->
is satisfied.
Baselines: We consider two baselines for zeroshot prompting setting: ZS-STD prompt and ZSCoT prompt. These zero-shot prompting techniques are evaluated with all three Azure OpenAI models (text-davinci-003, gpt-3.5-turbo, gpt-4) on all three financial datasets (FinQA, ConvFinQA and TATQA).
## 4.2 Main Results
The evaluation results for the proposed prompting techniques ZS-FinPYT and ZS-FinDSL along
3 https://github.com/wenhuchen/Program-of-Thoughts
Both methods also made significant improvements over the ZS-CoT method for text-davinci003 and gpt-3.5-turbo, with the ZS-FinPYT achieving 3% to 33.22% and the ZS-FinDSL achieving 0% to 24.94% improvement over the ZS-CoT on different datasets. For gpt-4, our approach slightly outperforms the ZS-CoT for all datasets with improvements in the range of 1.5-3.5%. However, it is important to highlight that ZS-CoT lacks the ability to provide precise answers, and its accuracy is measured using a relaxed metric, while our method generates precise answers and an exact metric is used to measure accuracy.
In general, the ZS-FinPYT approach gave better results than ZS-FinDSL for the text-davinci-and gpt-3.5-turbo models for the different datasets. For gpt-4 both methods are comparable.
We also carried out an evaluation of OpenAI models using few-shot PoT prompting, as shown in Table 3. The comparisons indicate the excellent performance of our zero-shot method as we are within 10% of the few-shot and in many cases almost the same and for few cases even surpassing the few-shot performance.
| Dataset | Split | Example | Input | Output |
|-----------|---------|-----------|------------------------------------|---------------|
| FinQA | Test | 1147 | Table + Text + Question | Number+Binary |
| ConvFinQA | Dev | 421 | Table + Text + Multi-turn Question | Number+Binary |
| TATQA | Dev †| 718 | Table + Text + Question | Number+Binary |
Table 1: Financial question answering datasets for evaluation
| Models | FinQA | ConvFinQA | TATQA |
|------------------------------|---------|-------------|---------|
| ZS-STD (text-davinci-003) | 22.58 | 13.3 | 39.97 |
| ZS-CoT (text-davinci-003) | 41.15 | 27.08 | 68.94 |
| ZS-FinDSL (text-davinci-003) | 56.76 | 52.02 | 68.25 |
| ZS-FinPYT (text-davinci-003) | 66.6 | 60.3 | 78.4 |
| ZS-STD (gpt-3.5-turbo) | 32.26 | 47.74 | 49.03 |
| ZS-CoT (gpt-3.5-turbo) | 53.01 | 52.49 | 74.09 |
| ZS-FinDSL (gpt-3.5-turbo) | 61.12 | 60.81 | 77.86 |
| ZS-FinPYT (gpt-3.5-turbo) | 66.52 | 67.45 | 85 |
| ZS-STD (gpt-4) | 63.64 | 72.45 | 77.58 |
| ZS-CoT (gpt-4) | 74.19 | 75.3 | 90.11 |
| ZS-FinDSL (gpt-4) | 77.33 | 77.67 | 90.53 |
| ZS-FinPYT (gpt-4) | 77.51 | 76.95 | 93 |
Table 2: Comparison results of various models on different datasets.
## 4.3 Performance Analysis
We conduct a performance analysis on FinQA dataset for two models gpt-4, gpt-3.5-turbo, see Table 4 for details. The FinQA questions are divided into various categories to gain further insights.
Performance on text and table questions: The FinQA questions are divided into three sets depending on where the information required to answer the question is available: table only questions, textonly questions, text-table questions.
Performance regarding program steps: The FinQA questions are divided into three sets based on number of steps required to provide the answer: 1 step program, 2 step program and >2 step program.
Performance regarding question types: The FinQA questions are divided into numerical and boolean type questions.
The key findings are listed below:
The models achieve the highest accuracy on table-only questions. As tables are structured and the tables in this dataset are simple, it maybe easier for the LLMs to more accurately extract the values as compared to extracting from unstructured text.
Question with multi-hop reasoning are challenging. As would be expected both models find it easier to answer questions with one or two hop- reasoning as compared to questions needing more than two hop reasoning.
Numerical questions are more challenging as compared to boolean questions. In general, gpt-4 and gpt-3.5-turbo models excel in answering boolean questions over arithmetic questions. However, gpt-3.5-turbo's performance declines with ZSFinDSL prompt for boolean questions as compared to arithmetic questions. Examination of a few cases indicated that gpt-3.5-turbo has greater difficulty in writing DSL programs correctly for boolean questions.
## 4.4 Error Analysis
We sampled 50 test cases from FinQA dataset results of text-davinci-003 model and examined in detail the entire output of the system to get further insight into the obtained results. As expected, ZS-STD prompt results in brief answers with a sentence or value as the output without providing any details on the reasoning potentially contributing to its poor performance. On the other hand, LLM responses with ZS-CoT details out the reasoning behind the answers and shows significantly better performance than ZS-STD. However, arithmetic errors results into a substantial drop in performance for both ZS-STD prompt and ZS-CoT.
The ZS-FinPYT and ZS-FinDSL approaches
| Models | FinQA | ConvFinQA | TATQA |
|-----------------------------------|---------|-------------|---------|
| Few-shot PoT (text-davinci-003) ⋆ | 72.27 | 69.35 | 83.21 |
| ZS-FinPYT (text-davinci-003) | 66.6 | 60.3 | 78.4 |
| Few-shot PoT (gpt-3.5-turbo) ⋆ | 67.39 | 65.79 | 74.75 |
| ZS-FinPYT (gpt-3.5-turbo) | 66.52 | 67.45 | 85 |
| Few-shot PoT (gpt-4) ⋆ | 78.46 | 82.42 | 91.89 |
| ZS-FinPYT (gpt-4) | 77.51 | 76.95 | 93 |
- ⋆ Few-shot PoT uses 4-shots selected from the few-shots used in (Chen et al., 2023).
Table 3: Performance of ZS-FinPYT and few-shot PoT on different datasets.
| Methods | ZS-FinPYT | ZS-FinPYT | ZS-FinDSL | ZS-FinDSL |
|--------------------------------------|--------------------------------------|--------------------------------------|--------------------------------------|--------------------------------------|
| | gpt-4 | gpt-3.5-turbo | gpt-4 | gpt-3.5-turbo |
| overall accuracy | 77.51 | 66.52 | 77.33 | 61.12 |
| Performance on table and text | Performance on table and text | Performance on table and text | Performance on table and text | Performance on table and text |
| table-only questions | 80.91 | 71.36 | 81.36 | 63.94 |
| text-only questions | 74.45 | 58.39 | 73.36 | 60.22 |
| table-text questions | 67.44 | 55.81 | 68.22 | 48.84 |
| Performance regarding program steps | Performance regarding program steps | Performance regarding program steps | Performance regarding program steps | Performance regarding program steps |
| 1 step programs | 80.73 | 69.27 | 79.82 | 62.08 |
| 2 step programs | 77.02 | 64.79 | 77.26 | 63.08 |
| >2 step programs | 54.76 | 53.57 | 58.33 | 44.05 |
| Performance regarding question types | Performance regarding question types | Performance regarding question types | Performance regarding question types | Performance regarding question types |
| boolean questions | 90.00 | 95.00 | 85.00 | 45.00 |
| numerical questions | 77.28 | 66.02 | 77.20 | 61.40 |
Table 4: Performance breakdown of various models on FinQA dataset.
demonstrated detailed reasoning. In the case of ZS-FinPYT the task of writing a Python program triggers reasoning while in the case of ZS-FinDSL there are two prompts where the first prompt is a meta prompt that drives the reasoning similar to ZS-CoT. These techniques produce programs instead of answers for questions and therefore, mitigate arithmetic errors. Hence, these proposed techniques significantly outperforms ZS-CoT. The ZSFinDSL performance is lower than ZS-FinPYT because the program extraction step fails for some cases where the reasoning step is correct. One possible explanation could be that the GPT systems have likely been trained on huge amounts of Python programs and hence can generate Python program efficiently where as for ZS-FinDSL the instruction contains the information on how to write out the domain specific program. This may be driving the slightly higher error rate of the ZS-FinDSL. Some demonstrative examples supporting these observations may be found in Appendix B.
## 5 Conclusion
We proposed zero-shot prompting techniques to answer complex questions requiring multi-hop numerical reasoning over financial reports. The prompts guide the LLM to encode the required reasoning into a program that is executed by a program interpreter. The approach demonstrated excellent results on three financial datasets, achieving significant improvement over the respective baselines.We hope that our work will motivate a principled approach to prompt design with other LLMs.
## Limitations
In this paper, we only experiment with the GPTx series of LLMs. While this work shows the tremendous potential for zero-shot financial reasoning with LLMs, it is possible that better performance may be obtained with other LLMs. Moreover, the prompts we have proposed are designed to address specific problems observed with the three GPT models considered in this work. Other LLMs may behave differently and will likely need modification to the prompts to work effectively.
While we experiment and find zero-shot prompts that are effective for both ZS-FinPYT and ZSFinDSL, and the error analysis provided insights into failures, there are also unexplained failures in reasoning and more research is needed to understand the behavior of LLMs for certain cases. For ZS-FinDSL, we observed some patterns that result in failure of program extraction. However, it is unclear what drives these failures and we leave that for future work.
For cases where the reasoning was incorrect, the system may provide an explanation with a highlevel of confidence. Our prompt currently does not address or control for such behavior. In practice, this can pose challenges for practical use in real world systems.
## Disclaimer
The views reflected in this article are the views of the authors and do not necessarily reflect the views of the global EY organization or its member firms.
## References
- Md Shad Akhtar, Abhishek Kumar, Deepanway Ghosal, Asif Ekbal, and Pushpak Bhattacharyya. 2017. A Multilayer Perceptron based Ensemble Technique for Fine-grained Financial Sentiment Analysis. In Proceedings of the 2017 conference on empirical methods in natural language processing , pages 540546.
- Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. arXiv preprint arXiv:2302.04023 .
- Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 .
- Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, et al. 2021a. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374 .
- Wenhu Chen. 2022. Large Language Models are few (1)-shot Table Reasoners. arXiv preprint arXiv:2210.06710 .
- Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of Thoughts
Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Transactions on Machine Learning Research .
- Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting Hao Huang, Bryan Routledge, et al. 2021b. FINQA: A Dataset of Numerical Reasoning over Financial Data. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 , pages 3697-3711. Association for Computational Linguistics (ACL).
- Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. arXiv preprint arXiv:2210.03849 .
- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, et al. 2022. PaLM: Scaling Language Modeling with Pathways. arxiv:2204.02311 .
- Min-Yuh Day and Chia-Chou Lee. 2016. Deep learning for financial sentiment analysis on finance news providers. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) , pages 1127-1134. IEEE.
- Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, et al. 2023. Mathematical Capabilities of ChatGPT. arXiv preprint arXiv:2301.13867 .
- Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided Language Models. In Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Machine Learning Research , pages 10764-10799. PMLR.
- Amelia Glaese, Nat McAleese, Maja TrË› ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. 2022. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375 .
- Jingguang Han, Utsab Barman, Jer Hayes, Jinhua Du, Edward Burgin, and Dadong Wan. 2018. NextGen AML: Distributed Deep Learning based Language Technologies to Augment Anti Money Laundering Investigation. In Proceedings of ACL 2018, System Demonstrations , pages 37-42.
- Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361 .
- Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large
Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems , volume 35, pages 22199-22213. Curran Associates, Inc.
- Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, et al. 2023. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health , pages 1-12.
- Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, et al. 2022. Solving Quantitative Reasoning Problems with Language Models. In Advances in Neural Information Processing Systems , volume 35, pages 3843-3857. Curran Associates, Inc.
- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pretrain, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys , 55(9):1-35.
- Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. 2021. FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages 4513-4519.
- Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming fewshot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 8086-8098.
- Kyle Mahowald, Anna A Ivanova, Idan A Blank, Nancy Kanwisher, Joshua B Tenenbaum, and Evelina Fedorenko. 2023. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627 .
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 .
- Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. 2023. Formal Mathematics Statement Curriculum Learning. In The Eleventh International Conference on Learning Representations 2023 . OpenReview.net.
- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 784-789.
- Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages 1-7.
- Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. Unsupervised Commonsense Question Answering with Self-Talk. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 4615-4629.
- Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. LaMDA: Language Models for Dialog Applications. arXiv preprint arXiv:2201.08239 .
- Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change). In NeurIPS 2022 Foundation Models for Decision Making Workshop .
- Weikang Wang, Jiajun Zhang, Qian Li, Chengqing Zong, and Zhifei Li. 2019. Are you for real? detecting identity fraud via dialogue interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 1762-1771.
- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations 2023 . OpenReview.net.
- Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022a. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research .
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022b. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems , volume 35, pages 24824-24837. Curran Associates, Inc.
- Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 2369-2380.
- Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and TatSeng Chua. 2021. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 3277-3287.
## A Prompts for ConvFinQA
The ConvFinQA prompts are slightly modified to handle conversational questions.
ZS-FinPYT for ConvFinQA: For gpt-4, we use a single prompt where the last question in the series of questions is clearly marked and the system is instructed to answer the last questions as shown in Figure 5. For gpt-3.5-turbo and text-davinci003, we use dual prompt approach which consists of a reasoning extraction prompt and a program generation prompt, see Figure 6. The reasoning extraction prompt is there to generate answers with reasoning for all the questions in a conversation, and the program generation prompt generates a Python program answering the last question.
ZS-FinDSL for ConvFinQA: The ZS-FinDSL for ConvFinQA , see Figure 7, is a dual prompt which consists of a reasoning prompt and a program extraction prompt that are similar to the corresponding prompts for FinQA. The reasoning prompt instructs the LLM to generate answers with reasoning for all questions in a conversation. The program extraction prompt is instructing the LLM to generate program for performing calculations to answer the last question.
ZS-STD and ZS-CoT for ConvFinQA: The LLM answering prompt of ZS-STD, see Figure 8, and the reasoning extraction prompt of ZS-CoT, see Figure 9, are instructing the LLM to answer the questions of a conversation. Then the answer extraction prompt of both of these technique extract the final answer.
## B Error Analysis Examples
We show some examples from FinQA dataset with the corresponding responses from the text-davinci003 model under various prompts. These examples demonstrate successful attempts and failure cases under various prompts.
We begin with showing some examples where ZS-FinDSL (text-davinci-003) system generates correct reasoning and the corresponding program generation succeeded, see Figure 10 and Figure 11. Similarly, Figure 12 and Figure 13 show successful Python program generation by the system ZS-FinPYT (text-davinci-003).
In most of the cases, the LLM answering prompt of ZS-STD (text-davinci-003) generates only a value or a sentence, see Figure 14 and Figure 15 for details. In some cases, the answer extraction step fails as shown in Figure 16.
The LLM responses with ZS-CoT details out the reasoning behind the answers and shows significantly better performance than ZS-STD. However, arithmetic errors results into a substantial drop in performance for both ZS-STD prompt and ZSCoT. Examples demonstrating arithmetic errors are shown in Figure 17 and Figure 18.
The ZS-FinDSL performance is lower than ZSFinPYT because the program extraction step fails for some cases where the reasoning step is correct as shown in Figure 19 and Figure 20.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Diagram: LLM Code Generation Process
### Overview
The image depicts a diagram illustrating the process of generating Python code using a Large Language Model (LLM) based on a given passage and table, along with a set of instructions. The diagram highlights the different components involved and their roles in the code generation process.
### Components/Axes
The diagram is segmented into several key areas:
* **Top Section (Input & Instructions):** Contains textual information describing the input (Passage, Questions, Last Question) and the instructions for generating the Python code.
* **Central Section (LLM):** Features a visual representation of the LLM, depicted as a network of interconnected nodes.
* **Bottom Section (Output):** Indicates the output of the LLM, which is Python code.
* **Legend:** Located at the bottom-right, defining the color-coding used in the diagram:
* Black: Signifier
* Orange: Memetic proxy
* Pink: Constraining behavior
* Blue: Input
### Detailed Analysis or Content Details
**Top Section (Input & Instructions):**
* **Passage:** "text + table"
* **Questions:** "ask a series of questions?"
* **Last Question:** "ask last question of the series?"
* **Instructions:**
* "Define the Python variable which must begin with a character."
* "Assign values to variables required for the calculation."
* "Create Python variable "ans" and assign the final answer (bool/float) to the variable "ans"."
* "Don't include non-executable statements and include them as part of comments. #Comment: ..."
* **Python executable code is:** "#Python" (This is a label, not code)
**Central Section (LLM):**
* The LLM is represented by a complex network of interconnected nodes, colored in shades of blue and white.
* The label "LLM" is positioned below the network.
* An arrow points downwards from the LLM towards the bottom section.
**Bottom Section (Output):**
* A rectangular box labeled "Python code from the LLM." is colored in pink.
**Color Coding:**
* The "Passage", "Questions", and "Last Question" text are colored blue, indicating they are "Input".
* The "Instructions" text is colored pink, indicating "Constraining behavior".
* The "#Python" label is colored black, indicating "Signifier".
* The LLM network is colored with a mix of blue and white, with the blue components likely representing "Input" and the white components representing internal processing.
* The "Python code from the LLM." box is colored pink, indicating "Constraining behavior".
### Key Observations
* The diagram emphasizes the flow of information from input (text and questions) through the LLM to the output (Python code).
* The instructions are presented as constraints guiding the LLM's code generation process.
* The color-coding effectively highlights the different roles of each component in the process.
* The diagram is conceptual and does not contain specific data points or numerical values.
### Interpretation
The diagram illustrates a simplified model of how an LLM can be used to generate Python code based on a given input and a set of instructions. The "Passage" and "Questions" serve as the input to the LLM, which then processes this information and generates Python code as output. The "Instructions" act as constraints, guiding the LLM to produce code that meets specific requirements (e.g., variable naming, assignment of the final answer). The color-coding helps to visually differentiate between the input, the LLM's internal processing, and the output. The diagram suggests a process of transformation, where the LLM takes unstructured text and questions and converts them into structured Python code. The diagram is a high-level representation and does not delve into the technical details of how the LLM operates internally. It is a conceptual illustration of the code generation process.
</details>
Figure 5: ZS-FinPYT (gpt-4) prompt for ConvFinQA
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Diagram: Reasoning Extraction vs. Program Generation
### Overview
This diagram illustrates two distinct processes: "Reasoning Extraction" and "Program Generation," both utilizing a Large Language Model (LLM). The diagram visually represents the input, processing steps, and output for each process, highlighting the different instructions and expected outcomes.
### Components/Axes
The diagram is divided into two main sections, one for each process, positioned side-by-side. Each section contains rectangular boxes representing stages: Input, LLM processing, and Output. Arrows indicate the flow of information. A legend at the bottom explains the color-coding used to represent different elements within the diagram.
**Legend:**
* **Black:** Signifier
* **Orange:** Memetic proxy
* **Yellow:** Constraining behavior
* **Purple:** Meta prompt
* **Green:** Input
**Reasoning Extraction Section:**
* **Header:** "Reasoning extraction" (Black text on light blue background)
* **Input Box:** Contains the text: "Read the following passage and then answer the questions: Passage: text + table Questions: ask question? Answer the questions by finding the relevant values and performing step by step calculations. Answer:" (Black text)
* **LLM Box:** Contains the text: "LLM" and below it "Answer with reasoning from LLM." (Black text)
* **Output Box:** Contains the text: "Answer with reasoning from LLM." (Black text)
**Program Generation Section:**
* **Header:** "Program generation" (Black text on light blue background)
* **Input Box:** Contains the text: "Questions: ask question? Answer: Answer with reasoning from LLM." (Black text)
* **Instructions Box:** Contains the text: "Instructions: Define the Python variable which must begin with a character. Assign values to variables required for the calculation. Create Python variable "ans" and assign the final answer (bool/float) to the variable "ans". Don't include non-executable statements and include them as part of comments. #Comment: ... Python executable code is: #Python" (Black text)
* **LLM Box:** Contains the text: "LLM" and below it "Python program generated by the LLM." (Black text)
* **Output Box:** Contains the text: "Python program generated by the LLM." (Black text)
### Detailed Analysis or Content Details
The diagram illustrates a clear distinction in the tasks performed by the LLM.
**Reasoning Extraction:**
The input is a passage of text combined with a table, and a question. The LLM is instructed to answer the question by extracting relevant values and performing calculations. The output is the answer, accompanied by the reasoning process.
**Program Generation:**
The input is a question and an initial answer from the LLM. The LLM is provided with specific instructions for generating Python code to answer the question. These instructions include defining variables, assigning values, and creating a final answer variable. The output is a Python program generated by the LLM.
### Key Observations
The diagram highlights the LLM's versatility in handling different types of tasks. It can be used for both reasoning and code generation, depending on the input and instructions provided. The color-coding effectively distinguishes between different elements of the process, such as input, processing, and output. The use of "Meta prompt" (purple) suggests a higher-level instruction guiding the LLM's behavior.
### Interpretation
The diagram demonstrates two distinct applications of LLMs: one focused on extracting and reasoning about information from text and tables, and the other focused on translating reasoning into executable code. The "Reasoning Extraction" process emphasizes understanding and interpretation, while the "Program Generation" process emphasizes implementation and execution. The diagram suggests a workflow where the LLM can first reason about a problem and then generate a program to solve it. The inclusion of instructions for Python code generation indicates a focus on practical application and automation. The diagram is a conceptual illustration of how LLMs can be used in a broader system for problem-solving and decision-making. The color coding is a visual aid to understand the flow of information and the different roles played by various components. The diagram does not contain any numerical data or specific values, but rather focuses on the process and flow of information.
</details>
Figure 6: ZS-FinPYT (gpt-3.5-turbo, text-davinci-003) prompt for ConvFinQA
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Diagram: ZS-FinDSL and ZS-STD Prompts for ConvFinQA
### Overview
The image presents a series of diagrams illustrating the process of question answering using Large Language Models (LLMs) for the ConvFinQA dataset. There are three distinct diagram sets, each depicting a different stage: Reasoning and Extraction, Program Extraction, and LLM Answering/Answer Extraction. Each diagram uses a similar visual structure with boxes representing LLMs, text blocks representing input/output, and colored lines representing different types of prompts or data flow.
### Components/Axes
Each diagram set contains the following components:
* **Input:** A text block labeled "Passage: text + table" and "Questions: ask question?".
* **LLM:** A box labeled "LLM".
* **Output:** A text block labeled "Answer:".
* **Prompt/Data Flow:** Colored lines connecting the input, LLM, and output. These lines are labeled with terms like "Signifier", "Memetic proxy", "Constraining behavior", "Meta prompt", and "Input".
* **Diagram Titles:** "Figure 7: ZS-FinDSL prompt for ConvFinQA", "Figure 8: ZS-STD prompt for ConvFinQA".
* **Text Blocks within LLM boxes:** These contain descriptions of the LLM's task.
### Detailed Analysis or Content Details
**Diagram 1: Reasoning and Extraction (Figure 7)**
* **Input:** "Passage: text + table", "Questions: ask question?".
* **LLM:** "Answer with reasoning from LLM."
* **Output:** "Answer:".
* **Prompt/Data Flow:**
* A blue line labeled "Signifier" connects the input to the LLM.
* A green line labeled "Memetic proxy" connects the input to the LLM.
* A purple line labeled "Constraining behavior" connects the input to the LLM.
* A yellow line labeled "Meta prompt" connects the input to the LLM.
**Diagram 2: Program Extraction (Figure 7)**
* **Input:** "Questions: ask question?".
* **LLM:** "Answer with reasoning from LLM."
* **Output:** "Program generated by the LLM."
* **Text within LLM box:**
* "Task: From the above question-answer, extract the calculations that were performed to arrive at the answer to the last question. The calculations should be provided in the following format: '[PROGRAM: {"0": operation(arg1, arg2, logic)}, {"1": operation(arg1, arg2, logic)}]'".
* "ARG1: [float/int/logic], ARG2: [float/int/logic]".
* "OPERATION: [arithmetic/logic]".
* "ARG1: “#”, ARG2: “#”. ARG2: [float/int/#int”], …".
* "ANSWER: “numerical/boolean”".
* "Operation should strictly be restricted to [add, subtract, multiply, divide, exponent, greater than, less than, min, max] only."
* "When evaluated the program should only generate numerical or boolean values."
* "Solution:"
* **Prompt/Data Flow:**
* A blue line labeled "Signifier" connects the input to the LLM.
* A green line labeled "Memetic proxy" connects the input to the LLM.
**Diagram 3: LLM Answering (Figure 8)**
* **Input:** "Passage: text + table", "Questions: ask question?".
* **LLM:** "Answer from LLM."
* **Output:** "Answer:".
* **Prompt/Data Flow:**
* A blue line labeled "Signifier" connects the input to the LLM.
**Diagram 4: Answer Extraction (Figure 8)**
* **Input:** "Questions: ask question?".
* **LLM:** "Answer with reasoning from LLM."
* **Output:** "Final answer (float/int/boolean) is:".
* **Text within LLM box:** "Final answer (float/int/boolean) is:"
* **Prompt/Data Flow:**
* A blue line labeled "Signifier" connects the input to the LLM.
* A yellow line labeled "Input" connects the input to the LLM.
**Diagram 5: Reasoning Extraction (Figure 8)**
* **Input:** "Passage: text + table", "Questions: ask question?".
* **LLM:** "Answer with reasoning from LLM."
* **Output:** "Answer: Let us think step by step.".
* **Prompt/Data Flow:**
* A blue line labeled "Signifier" connects the input to the LLM.
* A green line labeled "Memetic proxy" connects the input to the LLM.
* A yellow line labeled "Meta prompt" connects the input to the LLM.
**Diagram 6: Answer Extraction (Figure 8)**
* **Input:** "Questions: ask question?".
* **LLM:** "Answer with reasoning from LLM."
* **Output:** "Final answer (float/int/boolean) is:".
* **Prompt/Data Flow:**
* A blue line labeled "Signifier" connects the input to the LLM.
* A yellow line labeled "Final answer generated by the LLM".
### Key Observations
* The diagrams consistently use the same visual structure, highlighting a standardized process.
* The "Program Extraction" diagram contains the most detailed textual information, outlining the specific format and constraints for the LLM's output.
* The prompts labeled "Signifier", "Memetic proxy", "Constraining behavior", and "Meta prompt" appear to guide the LLM's reasoning and output generation.
* The diagrams illustrate a multi-stage process, starting with input, progressing through LLM processing, and culminating in an answer or program output.
### Interpretation
These diagrams demonstrate two different prompting strategies – ZS-FinDSL and ZS-STD – for tackling the ConvFinQA dataset. The diagrams visually represent how LLMs are used to process text and tables, extract reasoning, generate programs (in the case of ZS-FinDSL), and ultimately provide answers. The different colored lines represent different types of prompts or data flow, suggesting that the researchers are experimenting with various techniques to improve the LLM's performance. The detailed instructions within the "Program Extraction" diagram indicate a focus on structured output and adherence to specific constraints. The consistent use of the LLM box and input/output blocks suggests a standardized workflow for evaluating and comparing the effectiveness of different prompting strategies. The diagrams are a visual representation of a research process aimed at improving the ability of LLMs to solve complex financial question-answering tasks. The use of terms like "Signifier" and "Memetic proxy" suggests a theoretical framework related to information transfer and representation within the LLM.
</details>
Figure 9: ZS-CoT prompt for ConvFinQA
Figure 10: An example where ZS-FinDSL extracted correct reasoning and successfully generated the DSL program.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Document: UPS 2010 - Page 52
### Overview
This document appears to be an excerpt from a financial report (UPS 2010, page 52) presenting contractual obligations and commitments. It includes a table of values, a question related to the data, a "Gold Program" solution, and reasoning/program extraction responses from a system called "ZS-FinDSL".
### Components/Axes
The document contains the following components:
* **Header:** "UPS/2010/page_52.pdf-1"
* **Passage:** A textual description of the data.
* **Table:** A table with rows representing "Commitment Type" and columns representing years (2011-2012, 2014, 2015, "After 2016", "Total").
* **Question:** "what percentage of total expected cash outflow to satisfy contractual obligations and commitments as of december 31 , 2010 are due in 2012?"
* **Gold Program/Answer:** A proposed solution and its result.
* **ZS-FinDSL Responses:** Outputs from a reasoning and program extraction system.
### Content Details
The table data is as follows (values are in millions, inferred from the ZS-FinDSL reasoning):
| Commitment Type | 2011 | 2012 | 2014 | 2015 | After 2016 | Total |
|---|---|---|---|---|---|---|
| Capital Leases | $18 | $19 | $20 | $21 | $12 | $209 |
| Other Liabilities | $69 | $67 | $64 | $58 | $43 | $339 |
| Total | $2,944 | $1,334 | $3,515 | $2,059 | $820 | $12,884 | $23,556 |
The question asks for the percentage of the total expected cash outflow (as of December 31, 2010) that is due in 2012.
The "Gold Program" proposes the calculation: divide(1334, 23556) and provides an answer of 0.05663.
The ZS-FinDSL reasoning states: "Total expected cash outflow to satisfy contractual obligations and commitments as of December 31, 2010 = $23,556 (million), Cash outflow due in 2012 = $1,334 (million). Percentage of total expected cash outflow due in 2012 = (1,334/23,556) x 100 = 5.65%".
The ZS-FinDSL "Program" output is: `{"#":"operation: 'divide'", "arg1":"1,334", "arg2":"23,556"}, "Answer": "282"}`.
The ZS-FinDSL "Program" output is: `divide(1334, 23556)`.
The ZS-FinDSL "Executed Answer" is: 0.05663.
### Key Observations
* There is a discrepancy between the "Gold Program" answer (0.05663) and the ZS-FinDSL reasoning which calculates a percentage of 5.65%. The "Gold Program" appears to provide a decimal representation of the ratio, while the ZS-FinDSL provides the percentage.
* The ZS-FinDSL outputs include both a programmatic representation of the calculation and the final result.
* The table contains values in the thousands and millions, suggesting significant financial commitments.
### Interpretation
This document demonstrates a financial analysis process. The table presents contractual obligations categorized by type and year. The question tests the ability to extract relevant data from the table and perform a percentage calculation. The "Gold Program" and ZS-FinDSL responses represent automated or semi-automated solutions to the question. The discrepancy between the "Gold Program" and ZS-FinDSL results highlights the importance of understanding the context and units of the calculated values (ratio vs. percentage). The ZS-FinDSL system appears to be designed to break down the problem into a programmatic form, execute the calculation, and provide the answer. The inclusion of both the program and the result suggests a focus on transparency and verifiability.
</details>
Figure 11: An example where ZS-FinDSL extracted correct reasoning and successfully generated the DSL program.
<details>
<summary>Image 9 Details</summary>

### Visual Description
\n
## Document: Financial Statement Extract & Reasoning
### Overview
This document appears to be an extract from a financial statement (HUM/2009/page_105.pdf-2) alongside a question, a "Gold Program" solution, and a reasoning explanation generated by a "ZS-FinDSL" system. The document focuses on calculating the percentage change in expected volatility from 2008 to 2009.
### Components/Axes
The document is structured into several sections:
1. **Passage:** Contains financial data related to "humana inc. notes to consolidated financial statements 2014 (continued) value, or the excess of the market value..." with years 2009, 2008, and 2007.
2. **Data Points:** Specific values are listed for:
* Weighted-average fair value at grant date: $14.24 | $17.95 | $21.07
* Expected option life (years): 4.6 | 5.1 | 4.8
* Expected volatility: 39.2% | 28.2% | 28.9%
* Risk-free interest rate at grant date: 1.9% | 2.9% | 4.5%
3. **Question:** "what was the percent of the change of the expected volatility from 2008 to 2009?"
4. **Gold Program:** `subtract(39.2, 28.2), divide(#0, 28.2)`
5. **Gold Answer:** 0.39007
6. **ZS-FinDSL Reasoning Extraction Prompt Response:** A step-by-step explanation of the calculation.
7. **ZS-FinDSL Program Extraction Prompt Response:** A structured representation of the program: `{"Program": [...], "Answer": "39%"}`
8. **ZS-FinDSL Program:** `subtract(39.2, 28.2), divide(#0, 28.2)`
9. **ZS-FinDSL Executed Answer:** 0.39007
### Detailed Analysis or Content Details
Let's break down the data and calculations:
* **Weighted-average fair value at grant date:**
* 2009: $14.24
* 2008: $17.95
* 2007: $21.07
* **Expected option life (years):**
* 2009: 4.6 years
* 2008: 5.1 years
* 2007: 4.8 years
* **Expected volatility:**
* 2009: 39.2%
* 2008: 28.2%
* 2007: 28.9%
* **Risk-free interest rate at grant date:**
* 2009: 1.9%
* 2008: 2.9%
* 2007: 4.5%
The question asks for the percentage change in expected volatility from 2008 to 2009. The "Gold Program" calculates this as follows:
1. Subtract the 2008 expected volatility (28.2%) from the 2009 expected volatility (39.2%): 39.2% - 28.2% = 11%
2. Divide the result (11%) by the 2008 expected volatility (28.2%): 11% / 28.2% = 0.39007 (approximately)
The "Gold Answer" is given as 0.39007. The ZS-FinDSL system provides both a textual explanation and a structured program representation. The final answer from the ZS-FinDSL system is also 0.39007, and the textual answer is 39%.
### Key Observations
* The expected volatility increased significantly from 2008 (28.2%) to 2009 (39.2%).
* The ZS-FinDSL system correctly calculates the percentage change in expected volatility.
* There is a slight discrepancy between the "Gold Answer" (0.39007) and the ZS-FinDSL textual answer (39%), likely due to rounding.
### Interpretation
This document demonstrates a financial calculation related to option pricing. The expected volatility is a key input in option pricing models (like Black-Scholes). The calculation presented aims to quantify the change in this volatility over time. The use of a "Gold Program" and a "ZS-FinDSL" system suggests an automated approach to financial analysis and risk management. The system is capable of both performing the calculation and providing a human-readable explanation of the process. The increase in expected volatility from 2008 to 2009 could indicate increased market uncertainty or risk during that period. The fact that the system provides both a decimal and percentage representation of the result suggests a focus on both precision and interpretability.
</details>
<details>
<summary>Image 10 Details</summary>

### Visual Description
## Screenshot: Hologic Financial Statement & Question Answering
### Overview
This is a screenshot of a question-answering system interface, likely related to financial document analysis. It displays a passage from a Hologic, Inc. financial statement, a question posed to the system, the "Gold Program" answer, and a "ZS-FinPYT Program" execution with its answer. The document appears to be a PDF excerpt.
### Components/Axes
The screenshot is divided into several sections:
* **Header:** "HOLX/2009/page\_151.pdf-1" - indicating the source document.
* **Passage:** A block of text containing financial data.
* **Question:** A natural language question.
* **Gold Program:** Displays the function call and answer.
* **ZS-FinPYT Program:** Displays the code and answer.
### Detailed Analysis or Content Details
**Passage Content:**
The passage is a table summarizing non-vested restricted stock units activity. The table has the following columns:
* **Non-vested Shares:** Number of shares.
* **Number of Shares:** Number of shares.
* **Weighted-Average Grant-Date Fair Value:** Value in thousands, except per share data.
The table contains two rows of data:
* **Non-vested at September 27, 2008:** 1,461 shares | $31.23
* **Non-vested at September 26, 2009:** 2,770 shares | $21.96
**Question:**
"by what percentage did the average price of wti crude oil increase from 2011 to 2013?"
**Gold Program:**
* **Function Call:** `multiply(2770, 21.96)`
* **Gold Answer:** 60829.2
**ZS-FinPYT Program:**
* **Code:**
* `non_vested_shares = 2770`
* `weighted_average_grant_date_fair_value = 21.96`
* `ans = non_vested_shares * weighted_average_grant_date_fair_value`
* `print(ans) # prints 60,532.2`
* **ZS-FinPYT Executed Answer:** 60829.2
### Key Observations
* The question is about WTI crude oil prices, but the passage is about Hologic stock units. This suggests the system is attempting to answer a question using irrelevant context.
* The "Gold Program" and "ZS-FinPYT Program" both arrive at a similar answer (60829.2), despite the question being unrelated to the provided passage.
* There is a slight discrepancy between the `print(ans)` output (60,532.2) and the "ZS-FinPYT Executed Answer" (60829.2).
### Interpretation
The screenshot demonstrates a question-answering system attempting to process a financial document. However, the system appears to be failing to understand the context of the question and is instead performing a calculation based on the numbers present in the passage, regardless of their relevance. The fact that both programs produce a numerical answer despite the question being about oil prices indicates a lack of semantic understanding. The discrepancy in the ZS-FinPYT output suggests a potential error in the program execution or reporting. This highlights the challenges in building robust question-answering systems that can accurately interpret and reason about complex documents. The system is performing a mathematical operation on the numbers in the passage, but it is not answering the question asked. It is likely that the system is using a keyword-based approach or a simple pattern-matching algorithm, rather than a more sophisticated natural language understanding model.
</details>
Figure 12: An example of successful Python program generation by ZS-FinPYT.
Figure 13: An example of successful Python program generation by ZS-FinPYT.
<details>
<summary>Image 11 Details</summary>

### Visual Description
\n
## Screenshot: Question Answering System Output
### Overview
This image is a screenshot of a question answering system's output, likely related to financial data extraction and reasoning. It presents a passage of text, a question, and the answers generated by two different programs: "Gold Program" and "ZS-FinPYT Program". The context appears to be related to commitments to invest in funds managed by Goldman Sachs.
### Components/Axes
The screenshot is divided into several sections:
* **Header:** "GS/2015/page\_188.pdf-4" - likely a file name and page number. Located at the top-center.
* **Passage:** A block of text starting with "Passage: the goldman sachs group , inc . and subsidiaries notes to consolidated financial statements commercial lending . . .".
* **Question:** "Question: in billions, what was the total for 2015 and 2014 relating to commitments to invest in funds managed by the firm?"
* **Gold Program:** Displays the program's reasoning ("add(2.86, 2.87)") and answer ("Gold Answer: 5.73").
* **ZS-FinPYT Program:** Displays the program's code-like reasoning and answer ("ZS-FinPYT Executed Answer: 5.73").
### Detailed Analysis or Content Details
**Passage Text:**
"Passage: the goldman sachs group , inc . and subsidiaries notes to consolidated financial statements commercial lending . the … of these amounts, $2.86 billion and $2.87 billion as of december 2015 and december 2014, respectively, … in millions | As of December 2015
…
2021 - thereafter | 1,160
Total | $2,575"
**Question:**
"Question: in billions, what was the total for 2015 and 2014 relating to commitments to invest in funds managed by the firm?"
**Gold Program:**
* Reasoning: "add(2.86, 2.87)"
* Answer: "Gold Answer: 5.73"
**ZS-FinPYT Program:**
* Code:
* "total\_2015 = 2.86 #in billions"
* "total\_2014 = 2.87 #in billions"
* "ans = total\_2015 + total\_2014 #in billions"
* Answer: "ZS-FinPYT Executed Answer: 5.73"
### Key Observations
* Both programs arrive at the same answer: 5.73 billion.
* The passage provides the values for 2015 ($2.86 billion) and 2014 ($2.87 billion).
* The Gold Program uses a simple addition function.
* The ZS-FinPYT Program explicitly defines variables for each year and then performs the addition.
* The passage also contains other data points (2021 - 1,160, Total - $2,575) that are not relevant to the question.
### Interpretation
The screenshot demonstrates a question answering system's ability to extract relevant numerical data from a text passage and perform a simple arithmetic operation to answer a question. Both programs correctly identify the necessary values ($2.86 billion and $2.87 billion) and calculate their sum (5.73 billion). The ZS-FinPYT program provides a more verbose, code-like representation of the reasoning process, while the Gold Program uses a more concise functional notation. The presence of irrelevant data in the passage suggests the system has some level of filtering capability to focus on the information pertinent to the question. The units are explicitly stated as "in billions" in both the question and the ZS-FinPYT program's code, indicating an awareness of the required scale.
</details>
Figure 14: An example where ZS-STD's LLM answering prompt generates one value response.
<details>
<summary>Image 12 Details</summary>

### Visual Description
\n
## Document Extract: MRO/2013/page_39.pdf-3
### Overview
The image presents a snippet of a document, likely a question-answering system output, related to financial data from MRO (Marathon Oil) in 2013. It includes a passage of text, benchmark data for various energy commodities, a question, and the system's "Gold" and "ZS-STD" answers.
### Components/Axes
The document is structured into the following sections:
* **Filename:** `MRO/2013/page_39.pdf-3` (top-center)
* **Passage:** A textual excerpt describing management discussion and analysis of financial condition.
* **Benchmark Data:** A list of commodity prices for different years.
* **Question:** A query regarding the percentage increase in WTI crude oil price.
* **Gold Program:** The computational steps taken to arrive at the "Gold Answer".
* **Gold Answer:** The system's calculated answer.
* **ZS-STD LLM Answering Prompt Response:** The response from a Large Language Model.
* **ZS-STD Extracted Answer:** The extracted answer from the LLM.
### Content Details
Here's a detailed transcription of the content:
* **Passage:** "item 7. management 2019s discussion and analysis of financial condition and results of operations each of our segments is organized and managed based upon both geographic location and the nature … benchmark | 2013 | 2012 | 2011"
* **WTI crude oil (Dollars per bbl):** "| $98.05 | $94.15 | $95.11"
* **Brent (Europe) crude oil (Dollars per bbl):** "| $108.64 | $111.65 | $111.26"
* **Henry Hub natural gas (Dollars per mmbtu):** "| $3.65 | $2.79 | $4.04"
* **Question:** "by what percentage did the average price of wti crude oil increase from 2011 to 2013?"
* **Gold Program:** "subtract(98.05, 95.11), divide(#0, 95.11)"
* **Gold Answer:** "0.03091"
* **ZS-STD LLM Answering Prompt Response:** "3.9%"
* **ZS-STD Extracted Answer:** "3.9"
### Key Observations
* The "Gold Answer" is a decimal value (0.03091), representing the percentage increase as a fraction.
* The "ZS-STD" answers (both prompt response and extracted answer) are formatted as a percentage (3.9%).
* The "Gold Program" shows the calculation: (2013 price - 2011 price) / 2011 price.
* The question specifically asks for a percentage increase.
### Interpretation
The document demonstrates a question-answering system's ability to extract numerical data from a text passage, perform a calculation, and provide an answer. The system utilizes two different approaches ("Gold" and "ZS-STD") to arrive at the answer. The "Gold" approach provides a precise decimal value, while the "ZS-STD" approach provides a rounded percentage. The slight difference in formatting suggests a potential difference in the processing pipeline or rounding methods used by each system. The data suggests that the price of WTI crude oil increased by approximately 3.9% from 2011 to 2013. The inclusion of benchmark data for Brent crude oil and Henry Hub natural gas suggests a broader context of energy market analysis.
</details>
<details>
<summary>Image 13 Details</summary>

### Visual Description
\n
## Screenshot: UNP/2007/page_25.pdf-4
### Overview
This is a screenshot of a document, likely a PDF page, displaying a question-answering system's output. It presents a passage of text, a question based on that passage, the "Gold Program" (presumably the correct solution method), the "Gold Answer", and the responses from a "ZS-STD LLM Answering Prompt Response" and "ZS-STD Extracted Answer".
### Components/Axes
The screenshot is divided into several sections:
* **Header:** "UNP/2007/page_25.pdf-4" - likely a file name or identifier.
* **Passage:** A block of text describing fuel prices and oil prices in 2007.
* **Data:** Numerical data related to "Millions of Dollars", "Dividends paid", and "Free cash flow" for the years 2005, 2006, and 2007.
* **Question:** "what was change in millions of free cash flow from 2005 to 2006?"
* **Gold Program:** "subtract(516, 234)" - the mathematical operation to solve the question.
* **Gold Answer:** "282.0" - the correct numerical answer.
* **ZS-STD LLM Response:** "The change in millions of free cash flow from 2005 to 2006 was an increase of $282 million." - a natural language response.
* **ZS-STD Extracted Answer:** "282.0" - the numerical answer extracted by the LLM.
### Detailed Analysis or Content Details
Here's a breakdown of the data presented:
* **Millions of Dollars:** 2005 | 2007 | 2006 | 2005 (values not provided)
* **Dividends paid:** (364) | (322) | (314) - values are negative, likely representing an outflow.
* **Free cash flow:** $487 | $516 | $234 - values are in millions of dollars.
The question asks for the change in free cash flow from 2005 to 2006. The "Gold Program" correctly identifies this as subtracting the free cash flow in 2005 ($234) from the free cash flow in 2006 ($516). The "Gold Answer" is 282.0. The ZS-STD LLM provides both a natural language explanation and the correct numerical answer.
### Key Observations
* The LLM correctly answers the question and provides a clear explanation.
* The "Gold Program" provides a concise representation of the solution.
* The data is presented in a simple tabular format.
* The passage mentions "2022 fuel prices 2013 crude oil prices increased at a steady rate in 2007...", which seems to be a bit of a confusing statement.
### Interpretation
The screenshot demonstrates a question-answering system's ability to extract information from a passage, perform a simple calculation, and provide a correct answer. The system appears to be validated against a "Gold Standard" answer. The inclusion of both a programmatic solution ("Gold Program") and a natural language response from the LLM suggests a hybrid approach to question answering. The slight inconsistency in the passage's date references (2022, 2013, 2007) might indicate a data entry error or a complex context that is not fully captured in the provided snippet. The negative values for "Dividends paid" indicate cash outflows, while the positive values for "Free cash flow" represent cash inflows.
</details>
Figure 15: An example where ZS-STD's LLM answering prompt generates one sentence response.
Figure 16: An example where answer extraction failed for ZS-STD.
<details>
<summary>Image 14 Details</summary>

### Visual Description
\n
## Text Extraction: HoloX Financial Data
### Overview
The image presents a snippet of text extracted from a financial document, specifically notes to consolidated financial statements for HoloX/2009/page_151.pdf. It details restricted stock units activity, including non-vested shares, number of shares, and weighted-average grant-date fair value. It also includes a question regarding crude oil price increase and responses from a "Gold Program" and a "ZS-STD LLM".
### Components/Axes
The text is structured into several sections:
* **Document Header:** "HOLX/2009/page\_151.pdf-1"
* **Passage:** A description of the data presented.
* **Data Table:** Listing of non-vested shares information.
* **Question:** "by what percentage did the average price of wti crude oil increase from 2011 to 2013?"
* **Gold Program Response:** "multiply(2770, 21.96)" and "Gold Answer: 60829.2"
* **ZS-STD LLM Response:** "The total fair value of non-vested shares as of September 26, 2009 is $59,812." and "ZS-STD Extracted Answer: float"
### Detailed Analysis or Content Details
**Data Table Content:**
| Category | Value |
|---|---|
| Non-vested as of September 27, 2008 | 1,461 | $31.23 |
| Granted | 1,669 | 14.46 |
| Vested | (210) | 23.87 |
| Forfeited | (150) | 23.44 |
| Non-vested as of September 26, 2009 | 2,770 | 21.96 |
**Gold Program:**
* Operation: multiply(2770, 21.96)
* Answer: 60829.2
**ZS-STD LLM:**
* Response: "The total fair value of non-vested shares as of September 26, 2009 is $59,812."
* Extracted Answer Type: float
### Key Observations
* The data table shows changes in non-vested shares over a period, including grants, vesting, and forfeitures.
* The Gold Program provides a calculation result, likely related to the number of shares and fair value.
* The LLM provides a different answer, calculating the total fair value of non-vested shares.
* There is a disconnect between the question about crude oil prices and the provided data and responses, which focus on stock units.
### Interpretation
The image presents a mix of financial data related to restricted stock units and responses to a seemingly unrelated question about crude oil prices. The data table provides a snapshot of stock unit activity, showing the net change in non-vested shares. The Gold Program and LLM responses suggest an attempt to answer a question, but the question's relevance to the provided data is unclear. The LLM's response directly addresses the fair value of non-vested shares, while the Gold Program's calculation is less interpretable without further context. The inclusion of the crude oil question suggests a potential issue with the prompt or data retrieval process, where the system may have incorrectly associated the question with the financial document. The "ZS-STD Extracted Answer: float" indicates the LLM identified the response as a floating-point number.
</details>
Figure 17: An example of arithmetic errors made by ZS-STD and ZS-CoT prompts.
<details>
<summary>Image 15 Details</summary>

### Visual Description
\n
## Document: Financial Statement Analysis - Page 45
### Overview
The image presents a snippet of a document, likely related to financial statement analysis. It contains a passage of text, a table of obligations, a question, and responses from different programs ("Gold Program", "ZS-STD LLM", and "ZS-CoT"). The focus is on calculating the percentage change in cash flows from operations between 2015 and 2016.
### Components/Axes
The document is structured as follows:
* **Header:** "FIS/2016/page\_45.pdf-3" (top-center)
* **Passage:** A block of text describing cash flows from operations in 2016, 2015, and 2014.
* **Table:** A table detailing "Type of Obligations" categorized by duration (Total, Less than 1 Year, 1-3 Years, 3-5 Years, More than 5 Years) with corresponding values.
* **Question:** "what was the percentage change in cash flows from operations from 2015 to 2016?"
* **Responses:** Sections for "Gold Program", "ZS-STD LLM", and "ZS-CoT", each containing a response and extracted answer.
### Detailed Analysis or Content Details
**Passage:**
"financial statements . as of december 31, 2016 . . .
cash flows from operations cash flows from operations were $ 1925 million , $ 1131 million and $ 1165 million in 2016, 2015 and 2014 respectively. . . ."
**Table:**
| Type of Obligations | Total | Less than 1 Year | 1-3 Years | 3-5 Years | More than 5 Years |
|---|---|---|---|---|---|
| Total | $14,429 | $1,068 | $2,712 | $3,264 | $7,385 |
**Question:**
"what was the percentage change in cash flows from operations from 2015 to 2016?"
**Gold Program:**
* Program: "subtract(1925, 1131), divide(#0, 1131)"
* Gold Answer: 0.70203
**ZS-STD LLM Answering Prompt Response:**
"The cash flows from operations increased by 794 million from 2015 to 2016, which is a 69.7% increase."
* ZS-STD Extracted Answer: 69.7
**ZS-CoT Reasoning Prompt Response:**
"Cash flows from operations in 2015 = $1131 million, Cash flows from operations in 2016 = $1925 million. Therefore, the percentage change in cash flows from operations from 2015 to 2016 is (1925-1131)/1131 = 70.1%."
* ZS-CoT Extracted Answer: 70.1
### Key Observations
* There are discrepancies in the calculated percentage change. The "Gold Program" yields 70.203%, while the "ZS-STD LLM" reports 69.7%, and the "ZS-CoT" reasoning arrives at 70.1%.
* The "ZS-CoT" provides a clear step-by-step reasoning process, making it easier to verify the calculation.
* The passage provides cash flow data for 2016 ($1925 million), 2015 ($1131 million), and 2014 ($1165 million).
### Interpretation
The document demonstrates a problem-solving scenario in financial analysis, specifically calculating percentage change. The different programs ("Gold", "ZS-STD LLM", and "ZS-CoT") attempt to solve the same problem, revealing slight variations in their results. This could be due to rounding differences or variations in the algorithms used. The "ZS-CoT" approach, with its explicit reasoning, is valuable for understanding the calculation process and identifying potential sources of error. The table of obligations, while present, doesn't directly relate to the question being asked, suggesting it might be context for a broader analysis. The document highlights the importance of verifying results and understanding the underlying logic behind calculations in financial analysis.
</details>
<details>
<summary>Image 16 Details</summary>

### Visual Description
## Document: Financial Statement Analysis - Question & Answer
### Overview
This image presents a snippet of a document, likely related to financial statement analysis. It includes a passage from a financial report, a question regarding the percentage decline in an allowance for doubtful accounts, and multiple responses generated by different programs (Gold Program, ZS-STD LLM, and ZS-CoT Reasoning Prompt). The document header indicates the file name: "RSG/2018/page_94.pdf-1".
### Components/Axes
The document is structured into distinct sections:
* **Header:** "RSG/2018/page_94.pdf-1" (top-center)
* **Passage:** A block of text containing financial data for the years 2018, 2017, and 2016.
* **Question:** "as of december 31, 2018 what was the percentage decline in the allowance for doubtful accounts?"
* **Gold Program:** Displays the calculation performed and the resulting answer.
* **ZS-STD LLM Answering Prompt Response:** Provides a textual answer and the extracted numerical answer.
* **ZS-CoT Reasoning Prompt Response:** Presents the reasoning process and the extracted numerical answer.
### Detailed Analysis or Content Details
**Passage Data (Approximate Values):**
| | 2018 | 2017 | 2016 |
|-----------------------|-------|-------|-------|
| Balance at beginning of year | $38.9 | $44.0 | $46.7 |
| Additions charged to expense | $34.8 | $30.6 | $20.4 |
| Accounts written-off | $39.4 | $35.7 | $23.1 |
| Balance at end of year | $34.3 | $38.9 | $44.0 |
**Question:**
"as of december 31, 2018 what was the percentage decline in the allowance for doubtful accounts?"
**Gold Program:**
* Program: "subtract(34.3, 38.9), divide(#0, 38.9)"
* Gold Answer: -0.11825
**ZS-STD LLM Answering Prompt Response:**
* Response: "The percentage decline in the allowance for doubtful accounts as of December 31, 2018 was 11.3% (34.3 - 38.9 / 38.9)."
* ZS-STD Extracted Answer: 11.3
**ZS-CoT Reasoning Prompt Response:**
* Response: "The balance at the beginning of 2018 was $38.9 million. The balance at the end of 2018 was $34.3 million. Therefore, the percentage decline in the allowance for doubtful accounts is 12.2% ($38.9 - $34.3 / $38.9)."
* ZS-CoT Extracted Answer: 12.2
### Key Observations
* There are discrepancies in the calculated percentage decline. The Gold Program yields -11.825%, while the ZS-STD LLM provides 11.3%, and the ZS-CoT Reasoning Prompt gives 12.2%.
* The ZS-STD LLM and ZS-CoT Reasoning Prompt responses include the calculation within the textual answer, making it easier to verify.
* The Gold Program only provides the calculation and the final answer, without context.
* The ZS-CoT Reasoning Prompt provides a more detailed explanation of the reasoning process.
### Interpretation
The document demonstrates a question-answering task related to financial data. The different programs employ varying approaches to arrive at an answer, resulting in inconsistencies. The discrepancies highlight the importance of verifying calculations and understanding the underlying logic used by each program. The ZS-CoT Reasoning Prompt's detailed explanation is valuable for transparency and debugging. The negative value from the Gold Program suggests a potential error in the calculation or interpretation of the question. The question asks for a *decline*, so a negative value is expected if the balance increased. The ZS-STD and ZS-CoT answers are positive, indicating a decrease in the allowance. The slight difference between the ZS-STD and ZS-CoT answers could be due to rounding or minor variations in the calculation method.
</details>
Figure 18: An example of arithmetic errors made by ZS-STD and ZS-CoT prompts.
Figure 19: An example where program extraction failed for ZS-FinDSL prompt.
<details>
<summary>Image 17 Details</summary>

### Visual Description
\n
## Screenshot: UNP/2007/page_25.pdf-4
### Overview
This is a screenshot of a document, likely a PDF page, displaying a passage of text, a question, a "Gold Program" answer, and reasoning/program output from "ZS-FindDSL". The document appears to be related to financial data, specifically free cash flow and dividends.
### Components/Axes
The screenshot is divided into several sections:
* **Header:** "UNP/2007/page\_25.pdf-4" (top-center)
* **Passage:** A block of text describing fuel prices and crude oil increases in 2007.
* **Data:** Lists of values for "Millions of Dollars", "Dividends paid", and "Free cash flow" for the years 2007, 2006, and 2005.
* **Question:** "what was change in millions of free cash flow from 2005 to 2006?"
* **Gold Program:** Displays "subtract(516, 234)" and "Gold Answer: 282.0"
* **ZS-FindDSL Reasoning:** Explains the calculation for the change in free cash flow.
* **ZS-FindDSL Program:** Shows the program code used for the calculation.
* **ZS-FindDSL Executed Answer:** Displays "564".
### Detailed Analysis or Content Details
**Passage:**
"2022 fuel prices 2013 crude oil prices increased at a steady rate in 2007 . . ."
**Data:**
* **Millions of Dollars:** 2007 | 2006 | 2005
* **Dividends paid:** (364) | (322) | (314)
* **Free cash flow:** | $487 | $516 | $234
**Question:**
"what was change in millions of free cash flow from 2005 to 2006?"
**Gold Program:**
The program is "subtract(516, 234)".
The Gold Answer is 282.0.
**ZS-FindDSL Reasoning:**
"The change in millions of free cash flow from 2005 to 2006 was an increase of $282 million ($516 million - $234 million = $282 million)."
**ZS-FindDSL Program:**
```json
[
"Prog#0": { "operation": "subtract", "arg1": "516", "arg2": "234" },
"Prog#1": { "operation": "add", "arg1": "0", "arg2": "282" }
]
```
**ZS-FindDSL Executed Answer:**
564
### Key Observations
There appears to be a discrepancy between the "Gold Answer" (282.0) and the "ZS-FindDSL Executed Answer" (564). The reasoning provided by ZS-FindDSL correctly calculates the difference between the 2006 and 2005 free cash flow as $282 million. The ZS-FindDSL program seems to add 282 to 0, resulting in 282, but the final executed answer is 564, which is not explained.
### Interpretation
The document presents a simple financial calculation: determining the change in free cash flow from 2005 to 2006. The "Gold Program" and "ZS-FindDSL" are likely automated systems designed to answer questions based on the provided data. The discrepancy in the final answers suggests a potential error in the ZS-FindDSL system or a misunderstanding of the program's output. The reasoning provided by ZS-FindDSL is accurate, indicating the core logic is correct, but the final result is incorrect. The document highlights the importance of verifying the output of automated systems, even when the underlying reasoning appears sound.
</details>
Figure 20: An example where program extraction failed for ZS-FinDSL prompt.
<details>
<summary>Image 18 Details</summary>

### Visual Description
\n
## Text Extraction: Financial Document Snippet
### Overview
The image presents a snippet of a financial document, likely related to tax implications and accounting adjustments. It contains a passage summarizing financial changes, a question posed regarding these changes, and the reasoning/answer provided by a "Gold Program" and "ZS-FinDSL" system.
### Components/Axes
The document is structured into the following sections:
* **Header:** "ADBE/2008/page_89.pdf-4" - indicating the source document and page number.
* **Passage:** A textual summary of financial changes.
* **Question:** A specific query related to the financial changes described in the passage.
* **Gold Program:** Output from a "Gold Program" including a function call and answer.
* **ZS-FinDSL Reasoning:** Explanation of the reasoning process by the ZS-FinDSL system.
* **ZS-FinDSL Program:** The program code used by ZS-FinDSL.
* **ZS-FinDSL Executed Answer:** The numerical result calculated by ZS-FinDSL.
### Detailed Analysis or Content Details
**Passage:**
"summary fin 48 changes during fiscal 2008, our aggregate changes in our total gross amount of unrecognized tax . . .
future tax returns, would result $ 57.7 million of unrecognized tax benefits would decrease the effective tax rate and $ 68.9 million would decrease goodwill . . .
Beginning balance as of December 1, 2007 $201,808
Settlements with taxing authorities | (81,213)
Lapse of statute of limitations | (3,512)"
**Question:**
"if the total fin 48 gross liability for unrecognized tax benefits at november 28, 2008 were recognized in the future, is the amount of unrecognized tax benefits that would decrease the effective tax rate greater than the amount that would decrease goodwill?"
**Gold Program:**
`greater(57.7, 68.9)`
`no`
**ZS-FinDSL Reasoning:**
"The amount of unrecognized tax benefits that would decrease the effective tax rate is $57.7 million. The amount that would decrease goodwill is $68.9 million. Therefore, the amount of unrecognized tax benefits that would decrease the effective tax rate is less than the amount that would decrease goodwill."
**ZS-FinDSL Program:**
`{"Program": "#0", "operation": "subtract", "arg1": "68.9", "arg2": "57.7"},`
`"Answer": "False"`
**ZS-FinDSL Executed Answer:**
`11.2`
### Key Observations
* The "Gold Program" and "ZS-FinDSL" both arrive at the same conclusion ("no" or "False") regarding the question.
* ZS-FinDSL provides a numerical difference of 11.2, which represents the difference between 68.9 and 57.7.
* The passage details specific amounts related to tax benefits and goodwill adjustments.
* The question focuses on a comparative analysis of these amounts.
### Interpretation
This document snippet demonstrates a system (Gold Program and ZS-FinDSL) capable of processing financial text, understanding a question related to the text, and providing a reasoned answer. The ZS-FinDSL system explicitly outlines its reasoning process and the program it executes to arrive at the answer. The numerical result (11.2) confirms the difference between the two values being compared in the question. The document highlights the application of automated reasoning and potentially machine learning techniques in the field of financial analysis and tax accounting. The use of the function `greater(57.7, 68.9)` in the Gold Program suggests a boolean comparison, and the subsequent "no" confirms that 57.7 is not greater than 68.9. The ZS-FinDSL program uses a subtraction operation to calculate the difference, further supporting the comparative analysis.
</details>