2311.14722
Model: gemini-2.0-flash
## Zero-Shot Question Answering over Financial Documents using Large Language Models
Karmvir Singh Phogat, Chetan Harsha, Sridhar Dasaratha, Shashishekar Ramakrishna, Sai Akhil Puranam
EY Global Delivery Services India LLP
{Karmvir.Phogat, Chetan.Harsha, Sridhar.Dasaratha}@gds.ey.com
{Shashishekar.R,
, Sai.Puranam}@gds.ey.com
## Abstract
We introduce a large language model (LLM) based approach to answer complex questions requiring multi-hop numerical reasoning over financial reports. While LLMs have exhibited remarkable performance on various natural language and reasoning tasks, complex reasoning problems often rely on few-shot prompts that require carefully crafted examples. In contrast, our approach uses novel zero-shot prompts that guide the LLM to encode the required reasoning into a Python program or a domain specific language. The generated program is then executed by a program interpreter, thus mitigating the limitations of LLM in performing accurate arithmetic calculations.
We evaluate the proposed approach on three financial datasets using some of the recently developed generative pretrained transformer (GPT) models and perform comparisons with various zero-shot baselines. The experimental results demonstrate that our approach significantly improves the accuracy for all the LLMs over their respective baselines. We provide a detailed analysis of the results, generating insights to support our findings. The success of our approach demonstrates the enormous potential to extract complex domain specific numerical reasoning by designing zero-shot prompts to effectively exploit the knowledge embedded in LLMs.
## 1 Introduction
In recent years, the development of large language models (LLMs) has achieved significant advances in natural language processing (NLP). Typically, LLMs are pretrained on large corpora of text from the internet which has given rise to the capability of adapting to a wide variety of new tasks from different domains without the need for huge amount of task specific data. Scaling up the size of these models has not only improved sampling efficiency and performance, (Kaplan et al., 2020) but also intro- duced reasoning capabilities (Wei et al., 2022a,b; Kojima et al., 2022).
LLMs have been shown to perform well on tasks requiring reasoning capabilities in various domains, including code writing (Chen et al., 2021a), math problem solving (Lewkowycz et al., 2022; Polu et al., 2023), dialogue (Glaese et al., 2022; Thoppilan et al., 2022), common sense reasoning (Shwartz et al., 2020; Chowdhery et al., 2022) and symbolic reasoning (Wei et al., 2022b; Wang et al., 2023). The design of the prompt, known as prompt engineering, plays a significant role in adapting the pretrained LLMs to new tasks with little or no task specific training data. Recently, there has been extensive work (Liu et al., 2023) which demonstrates the importance of prompt design in usage of the LLMs and unlocking their reasoning capabilities. However, (Mahowald et al., 2023) argue that LLMs cannot combine elementary knowledge with common sense reasoning. (Valmeekam et al., 2022) claim that benchmarks on which LLMs show reasoning capabilities are simplistic and cannot be used as evidence. (Bubeck et al., 2023; Bang et al., 2023) show that LLMs face challenges in numerical reasoning. Hence, adapting LLMs to new domains requires prompt engineering and a system design that can overcome the limitations of LLMs.
Question answering in the financial domain is an active area of research which could potentially benefit from the use of LLMs with appropriate system design. Financial question answering involves numerous steps and complex numerical reasoning with precise arithmetic calculations, making it more challenging than classical question answering problems (Yang et al., 2018; Rajpurkar et al., 2018). Typically for complex problems, fewshot prompt based approaches have been used (Wei et al., 2022b; Chen et al., 2023). However it has been shown that the output of the LLMs is sensitive to the few-shot samples used as well as to the ordering of those samples (Lu et al., 2022). Further, the samples can contain large number of tokens and providing multiple samples for few-shot prompts would increase the number of input tokens, sometimes even crossing the limit of LLMs. Hence, designing and using few-shot prompts for financial question answering can become quite challenging.
We propose a new approach using zero-shot prompts for financial question answering with LLMs, thus eliminating the requirement to create hand crafted examples. These prompts contain high-level instructions to guide the encoding of financial reasoning process into a Python program (ZS-FinPYT) or a domain specific language (ZSFinDSL). For ZS-FinPYT, we achieve the zero-shot system by instructions that layout the high-level approach to generate a valid Python program, while for ZS-FinDSL we enable the same by identifying a program structure for robust domain-specific languages (DSL) program extraction. In both cases, the generated program is executed externally by a program executor to provide the final answer. We evaluate the use of the latest GPT-x models on their ability to perform financial reasoning as they have shown state-of-the-art performance on various tasks involving question answering and reasoning (OpenAI, 2023; Frieder et al., 2023; Kung et al., 2023). Specifically, we explore the use of the GPT models text-davinci-003, gpt-3.5-turbo and gpt-4 in answering financial questions.
We evaluate the proposed approach on three financial question answering datasets, with three different GPT models and compare with various baselines. The experimental results demonstrate that our approach significantly improves the accuracy for all models. The success of our approach demonstrates the enormous potential to extract complex domain specific numerical reasoning by carefully designing LLM based systems for specific applications and crafting prompts to effectively exploit the knowledge embedded in the LLMs.
## 2 Background
NLPtechniques have proven useful to solve various problems in the financial domain such as sentiment analysis to assist market prediction (Day and Lee, 2016; Akhtar et al., 2017) and fraud detection for risk management (Han et al., 2018; Wang et al., 2019). Financial domain specific language models have been trained on large scale financial data and fine tuned for specific problems (Liu et al., 2021). (Chen et al., 2021b) introduce a large-scale question answering dataset, FinQA and propose FinQANet with a retriever-generator architecture based on pretrained BERT like models.
With the introduction of LLMs, it has become feasible to directly use these language models without domain specific pretraining. (Chen et al., 2022) propose a large-scale financial dataset, ConvFinQA for conversational question answering. They propose a few-shot prompt (with 16 exemplars) based approach using GPT-3 text-davinci-002 model to generate a DSL program.
One of the key techniques which significantly improves reasoning abilities of LLMs is chainof-thought prompting introduced by (Wei et al., 2022b). They propose a few-shot prompt that consists of triples: <input, chain-of-thought, output>, where the chain-of-thought (CoT) is a series of intermediate natural language reasoning steps that leads to the final output. (Kojima et al., 2022) demonstrate that reasonable zero-shot learning is achieved by simply adding 'Let's think step by step' to the prompt and using a two-prompt approach: the first prompt to extract the reasoning path and the second to extract the final answer. Unlike our approach, which avoids performing calculations using the LLM, both of these approaches utilize the LLM for generating mathematical expressions that encode the reasoning and perform arithmetic at each step.
Program of thoughts (PoT) prompting (Chen et al., 2023) and Program-aided Language Models (PAL) (Gao et al., 2023) are approaches that are conceptually similar to our proposed technique. However, (Chen et al., 2023) show only limited zero-shot prompting experiments for financial data sets. Their results indicate that few-shot prompting significantly outperforms the zero-shot prompts. (Gao et al., 2023) discuss only few-shot prompting and do not show any results on financial data sets. In contrast, our work focuses entirely on optimizing zero-shot prompts that generate Python program or domain specific language for financial question answering. We further demonstrate that carefully designed zero-shot prompts for financial question answering can achieve comparable results with few-shot methods.
## 3 Zero-shot Prompting for Financial Domains
We introduce a novel zero-shot template-based prompting for financial question answering. These prompts are designed to generate executable programs for answering questions. The executable program generation and their execution enables accurate mathematical calculations which eliminates arithmetic errors. We follow the prompt guidelines described in (Reynolds and McDonell, 2021) and employ the following principles for designing zeroshot prompting for question answering:
Signifier: A signifier is a pattern which keys the intended behavior. A task specific signifier directly elucidates the task at hand. The sentence - 'Read the following passage and then answer the question', specifically describes the question answering task that is to be performed.
Memetic proxy: A memetic proxy is a concept in which a character or characteristic situation is used as a proxy for an intention. '#Python' can be a memetic proxy for the LLM to clarify the intention that the response should have a Python program.
Constraining behavior: In addition to directing the LLM on the desirable response, it is important for the prompt to inform the LLM of undesirable responses. Instructions restricting undesirable LLM responses fall into the constraining behavior category.
Meta prompting: A meta prompt is a short phrase or a fill-in-the-blank template encapsulating a more general intention that will unfold into a more specific prompt when combined with additional information such as the task at hand. In the question answering task, the sentence -'Let us think step by step.', elicits step-by-step reasoning in LLMs for answering questions.
Inspired by these prompt design principles, we present two zero-shot prompting techniques: ZS-FinPYT prompt that enables LLMs to generate Python executable programs and ZS-FinDSL prompt that enables LLMs to generate executable domain specific language programs. We also discuss two baseline zero-shot prompting techniques, one using a simple dual prompt and another using zero-shot chain-of-thought prompting (ZS-CoT) motivated by (Kojima et al., 2022). For reproducibility purposes, we provide exact prompts for all techniques and datasets.
## 3.1 Zero-shot FinPYT
The ZS-FinPYT prompt is a collection of instructions that directs the LLM to generate a valid Python program that can be executed by the exec function. Based on preliminary experiments, we identified the following requirements for the proposed prompt:
- (a) The prompt should describe the task such that it enables the LLM to generate consistent programs for answering the questions.
- (b) The final answer to a question must be stored in a specified Python variable for enabling consistent extraction of executed answer.
- (c) The LLM generated program should not include Python non-executable statements for seamless execution of Python programs.
The ZS-FinPYT prompt is designed to accommodate the above requirements in the following manner:
Direct task specification using the signifier: We use the following signifier for explicitly specifying the question answering task:
Read the following passage and then write Python code to answer the question: Passage: text + table Question: ask question? Answer this question by following the below instructions.
The signifier explicitly calls out the task of writing a Python program to answer the question after reading the passage where the passage and the questions are identified with the identifiers ' Passage: ' and ' Question: ' respectively. Furthermore, the prompt directs the LLM to follow certain instructions while answering the question.
Direct sub-task specification using the signifier: The sub-task of storing the final answer to a specific Python variable is described as a part of instructions to the LLM:
Define the Python variable which must begin with a character. Assign values to variables required for the calculation. Create Python variable "ans" and assign the final answer (bool/float) to the variable "ans".
Constraining LLM behavior: To ensure naming conventions are followed and prevent the generation of non-executable statements, we include the following instructions in the prompt:
Define the Python variable which must begin with a character. Don't include non-executable statements and include them as part of comments.
Memetic proxy phrases: Certain memetic proxy phrases are employed to implicitly convey intentions. For instance, the memetic phrase '#Comment: . . . ' guides the LLM to understand that comments are always preceded by the '#' character. Similarly, the memetic phrase '#Python' instructs the LLM to generate a Python program.
The ZS-FinPYT prompt for the FinQA dataset is depicted in Figure 1.
Figure 1: ZS-FinPYT prompt for FinQA
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: LLM Python Code Generation
### Overview
The image illustrates a process where a Large Language Model (LLM) generates Python code based on a given passage and question. The diagram outlines the input, instructions, and output of this process, along with a legend explaining the color-coded elements.
### Components/Axes
* **Top Section (Input & Instructions):**
* Text: "Read the following passage and then write Python code to answer the question:"
* "Passage: text + table"
* "Question: ask question?"
* "Answer this question by following the below instructions."
* Instructions:
* "Define the Python variable which must begin with a character."
* "Assign values to variables required for the calculation."
* "Create Python variable "ans" and assign the final answer (bool/float) to the variable "ans"."
* "Don't include non-executable statements and include them as part of comments. #Comment: ..."
* "Python executable code is:"
* "#Python"
* **Middle Section (LLM):**
* A network diagram representing the LLM. The nodes are colored blue, yellow, green, and purple.
* Text: "LLM"
* A downward-pointing gray arrow indicating the flow of information.
* **Bottom Section (Output):**
* Text: "Python code from the LLM."
* **Legend (Bottom):**
* Black square: "Signifier"
* Orange square: "Memetic proxy"
* Pink square: "Constraining behavior"
* Blue square: "Input"
### Detailed Analysis or ### Content Details
The diagram shows the flow of information from a given passage and question to the LLM, which then generates Python code. The instructions specify how the Python code should be structured, including variable naming conventions and the handling of comments. The legend provides a color-coded key to understanding the different elements involved in the process.
### Key Observations
* The input consists of a passage (text + table) and a question.
* The LLM processes this input and generates Python code.
* Specific instructions are provided to guide the code generation process.
* The legend categorizes the elements involved in the process using color-coding.
### Interpretation
The diagram illustrates a typical use case of LLMs in generating code from natural language instructions. The instructions provided to the LLM are crucial in ensuring that the generated code meets specific requirements. The color-coded legend helps to understand the different aspects of the process, such as the input, the constraints, and the resulting code. The diagram highlights the role of LLMs in automating code generation and the importance of providing clear and specific instructions to achieve the desired outcome.
</details>
## 3.2 Zero-shot FinDSL
The zero-shot FinDSL (ZS-FinDSL) is a zero-shot prompting technique for program generation in a domain specific language (DSL). We use a DSL similar to (Chen et al., 2021b) with two differences: we don't have table operators and instead we have a max and min operator. The output of the system is a DSLprogram that is extracted using a Python script and executed using a language interpreter. In the ZS-FinDSL technique, we adopt a dual prompt approach to extract reasoning for answering questions and generating the corresponding DSL program.
## 3.2.1 Reasoning Extraction Prompt
The reasoning extraction prompt of ZS-FinDSL consists of two parts:
Direct task specification using the signifier: The question answering task is specified explicitly using the following signifier:
```
```
Meta prompting for reasoning: For generating step by step reasoning for answering the question, the following meta prompt is used:
Answer this question by finding the relevant values and performing step by step calculations.
## 3.2.2 Program Extraction Prompt
The primary goal of the program extraction prompt is to extract DSL programs from the LLM's response obtained through the reasoning extraction prompt. To achieve this, the program extraction prompt involves specifying the task of program ex- traction and constraining the LLM's behavior by incorporating domain-specific knowledge.
Direct task specification using the signifier: The program extraction task is specified using the following signifier:
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Task Description: Calculation Extraction from Question-Answer Pair
### Overview
The image presents a task description for extracting calculations from a question-answer pair generated by a Large Language Model (LLM). The task requires identifying the calculations performed to arrive at the answer and providing them in a specific JSON format.
### Components/Axes
The image contains the following components:
* **Question:** A placeholder for a question ("ask question?").
* **Answer:** A placeholder indicating the answer is generated with reasoning from an LLM ("Answer with reasoning from LLM").
* **Task:** A description of the task to extract calculations.
* **Format:** A JSON format specification for representing the extracted calculations.
### Detailed Analysis or ### Content Details
The task description specifies the following JSON format:
```json
{
"PROGRAM": {
"#0": {
"OPERATION": "[arithmetic/logic]",
"ARG1": "[float/int]",
"ARG2": "[float/int]"
},
"#1": {
"OPERATION": "[arithmetic/logic]",
"ARG1": "#0",
"ARG2": "[float/int/#int]"
},
...
},
"ANSWER": "[numerical/boolean]"
}
```
* **"PROGRAM"**: Contains a dictionary of calculation steps.
* Each step is identified by a key like "#0", "#1", etc.
* Each step contains:
* **"OPERATION"**: The type of operation performed (either "arithmetic" or "logic").
* **"ARG1"**: The first argument of the operation (either a "float" or an "int").
* **"ARG2"**: The second argument of the operation (either a "float" or an "int").
* In subsequent steps (e.g., "#1"), "ARG1" can refer to the result of a previous step (e.g., "#0"). "ARG2" can also be a float or int.
* **"ANSWER"**: The final answer, which can be either a "numerical" value or a "boolean" value.
### Key Observations
* The format is designed to represent a sequence of calculations.
* Each calculation step involves an operation and two arguments.
* Arguments can be either numerical values or references to the results of previous calculations.
* The final answer is either numerical or boolean.
### Interpretation
The task aims to extract the computational steps performed by an LLM to arrive at an answer. By representing these steps in a structured JSON format, it becomes possible to analyze the reasoning process of the LLM and potentially debug or improve its performance. The format allows for representing both arithmetic and logical operations, as well as numerical and boolean results. The use of references to previous steps enables the representation of complex, multi-step calculations.
</details>
Constraining LLM behavior: To ensure consistent program extraction, we limit the mathematical operations to the set specified by the DSL. These operations are commonly used for financial question answering. Moreover, we constrain the program's output to numerical or boolean values to make it executable. The following instructions are passed to the LLM to ensure consistent program extraction:
Operation should strictly be restricted to {add, subtract, multiply, divide, exponent, greater-than, max, min} only. When evaluated the program should only generate numerical or boolean values.
The ZS-FinDSL prompt for the FinQA dataset is shown in Figure 2.
## 3.3 Zero-shot Standard Dual
A standard template based prompting approach for question answering is a zero-shot standard dual (ZS-STD) prompt which has a LLM answering prompt and an answer extraction prompt. In the LLM answering prompt, the question is appended below the passage and then the trigger word 'Answer' is added for LLM to generate the answer. The answer extraction prompt takes the LLM generated answer along with the question and append a memetic proxy phrase - 'The final answer ( float/int/boolean ) is' for extracting the final answer. The ZS-STD prompt for the FinQA dataset question answering is shown in Figure 3.
## 3.4 Zero-shot Chain of Thoughts
Similar to the zero-shot reasoners (Kojima et al., 2022), zero-shot chain-of-thought (ZS-CoT) prompt is derived from the ZS-STD prompt by adding the reasoning trigger sentence -'Let us think step by step.' after the word 'Answer:'. The answer extraction prompt of ZS-CoT is identical to the ZS-STD prompt. The ZS-CoT prompt for the
Figure 3: ZS-STD prompt for FinQA
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Diagram: ZS-FinDSL prompt for FinQA
### Overview
The image presents a diagram illustrating the ZS-FinDSL prompt for FinQA, showcasing four distinct processes: Reasoning extraction, Program extraction, LLM answering, and Answer extraction. Each process involves a question-answering sequence, utilizing a Large Language Model (LLM) at different stages. The diagram uses color-coded elements to represent different aspects of the process, such as signifiers, memetic proxies, constraining behavior, meta prompts, and inputs.
### Components/Axes
* **Titles:**
* Reasoning extraction (top-left)
* Program extraction (top-right)
* LLM answering (bottom-left)
* Answer extraction (bottom-right)
* **Elements:**
* **Input Box (Light Blue):** Contains the question-answering prompt.
* Passage: text + table
* Question: ask question?
* Answer:
* **LLM Icon:** Represents the Large Language Model.
* **Output Box (Pink):** Contains the LLM's response.
* **Arrows:** Indicate the flow of information.
* **Legend (Bottom Center):**
* Black square: Signifier
* Orange square: Memetic proxy
* Pink square: Constraining behavior
* Green square: Meta prompt
* Blue square: Input
* **Figure Caption:** Figure 2: ZS-FinDSL prompt for FinQA
### Detailed Analysis
**1. Reasoning Extraction (Top-Left)**
* **Input:**
* "Read the following passage and then answer the question:"
* "Passage: text + table"
* "Question: ask question?"
* "Answer this question by finding the relevant values and performing step by step calculations."
* "Answer:"
* **Process:** The input is fed into an LLM.
* **Output:** "Answer with reasoning from LLM."
**2. Program Extraction (Top-Right)**
* **Input:**
* "Question: ask question?"
* "Answer: Answer with reasoning from LLM."
* "Task: From the above question-answer, extract the calculations that were performed to arrive at the answer. The calculations should be provided in the following format:"
* `{"PROGRAM": {"#0":{OPERATION:"[arithmetic/logic]", ARG1:"[float/int]", ARG2:"[float/int]"}, "#1":{OPERATION: [arithmetic/logic], ARG1:"#0", ARG2:"[float/int/#int]"}, ...}, "ANSWER": "[numerical/boolean]"}`
* "Operation should strictly be restricted to {add, subtract, multiply, divide, exponent, greater-than, max, min} only."
* "When evaluated the program should only generate numerical or boolean values."
* "Solution:"
* **Process:** The input is fed into an LLM.
* **Output:** "Program generated by the LLM."
**3. LLM Answering (Bottom-Left)**
* **Input:**
* "Read the following passage and then answer the question:"
* "Passage: text + table"
* "Question: ask question?"
* "Answer:"
* **Process:** The input is fed into an LLM.
* **Output:** "Answer from LLM."
**4. Answer Extraction (Bottom-Right)**
* **Input:**
* "Question: ask question?"
* "Answer: Answer from LLM."
* "The final answer (float/int/boolean) is:"
* **Process:** The input is fed into an LLM.
* **Output:** "Final answer generated by the LLM."
### Key Observations
* Each process starts with a question-answering prompt.
* The "Program extraction" process has a specific task to extract calculations and provide them in a structured format.
* The LLM is used in all four processes.
* The diagram uses color-coding to differentiate between different elements.
### Interpretation
The diagram illustrates the ZS-FinDSL prompt for FinQA, which involves a multi-stage process of reasoning, program extraction, LLM answering, and answer extraction. The "Program extraction" stage is particularly important as it aims to extract the calculations performed by the LLM to arrive at the answer. This allows for a more transparent and explainable AI system. The diagram highlights the flow of information between the different stages and the role of the LLM in each stage. The color-coding helps to differentiate between the different elements and provides a visual representation of the process.
</details>
FinQA dataset question answering is described in Figure 4.
All prompts for TATQA are identical to FinQA and for ConvFinQA dataset, the prompts are slightly modified to handle conversational questions as shown in Appendix A.
## 4 Experiments
## 4.1 Experimental Design
Datasets: We conduct our experiments on three financial question answering datasets FinQA (Chen et al., 2021b), ConvFinQA (Chen et al., 2022) and TATQA(Zhu et al., 2021) as summarized in Table 1. For our evaluations, we use the test split of FinQA, while for ConvFinQA and TATQA we use the dev set as answers for test splits of these datasets are not available. The evaluations for TATQA are restricted to questions of arithmetic type. The question answering task is to answer the questions using the passage containing text and table content. The table content is represented in a textual format using the strategy adopted in (Chen, 2022). In the textual format, the table columns are separated by '|', the rows are separated by '\n' and the empty cell are filled with '-'.
Large Language Models: We experimented with three Azure OpenAI 1 LLMs text-davinci-003, gpt-3.5-turbo, gpt-4. The Python programs generated using LLMs are executed using Python function exec . The domain specific programs are executed using the Python script provided by FinQA. 2 In order to achieve a more precise and predictable outcome, the LLM parameters are set as follows: temperature = 0, top\_prob = 0.95, max\_tokens = 1000.
Evaluation Metrics: For all the financial
1 https://oai.azure.com/
2 https://github.com/czyssrs/FinQA
Figure 4: ZS-CoT prompt for FinQA
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Diagram: Reasoning and Answer Extraction using LLM
### Overview
The image is a diagram illustrating the process of reasoning and answer extraction using a Large Language Model (LLM). It shows a two-stage process: Reasoning extraction and Answer extraction. The diagram uses color-coded boxes and arrows to represent the flow of information and the involvement of the LLM. A legend at the bottom explains the color codes.
### Components/Axes
* **Titles:** "Reasoning extraction" (top-left), "Answer extraction" (top-right)
* **Input Boxes (Blue):**
* Left: "Read the following passage and then answer the question:\nPassage: text + table\nQuestion: ask question?\nAnswer: Let us think step by step."
* Right: "Question: ask question?\nAnswer: Answer with reasoning from LLM.\nThe final answer (float/int/boolean) is:"
* **LLM Icons:** Represented by a multi-colored node diagram, positioned below each input box.
* **Output Boxes (Pink):**
* Left: "Answer with reasoning from LLM."
* Right: "Final answer generated by the LLM."
* **Arrows:** Indicate the flow of information. A gray arrow connects the output of the "Reasoning extraction" stage to the input of the "Answer extraction" stage.
* **Legend (Bottom):**
* Black square: "Signifier"
* Orange square: "Memetic proxy"
* Green square: "Meta prompt"
* Blue square: "Input"
### Detailed Analysis or ### Content Details
1. **Reasoning Extraction:**
* The process begins with an input box containing a passage (text + table), a question ("ask question?"), and an initial answer ("Let us think step by step.").
* This input is fed into an LLM.
* The output is "Answer with reasoning from LLM."
2. **Answer Extraction:**
* The output from the reasoning extraction stage, along with a new question ("ask question?") and a prompt for the final answer type ("The final answer (float/int/boolean) is:"), serves as input to the second stage.
* This input is processed by another LLM.
* The final output is "Final answer generated by the LLM."
3. **Color Coding:**
* The input boxes are blue.
* The output boxes are pink.
* The titles "Reasoning extraction" and "Answer extraction" are in boxes with an orange background.
* The LLM is represented by a multi-colored node diagram.
### Key Observations
* The diagram illustrates a two-step process for answering questions using an LLM.
* The first step focuses on reasoning, while the second step focuses on generating the final answer.
* The diagram highlights the importance of providing the LLM with both the question and relevant context (passage, initial answer).
### Interpretation
The diagram demonstrates a structured approach to leveraging LLMs for question answering. By separating the reasoning and answer generation stages, the process aims to improve the accuracy and reliability of the final answer. The use of specific prompts, such as specifying the expected data type of the answer (float/int/boolean), further refines the LLM's output. The diagram suggests that a well-designed input, combined with a multi-stage processing approach, can enhance the performance of LLMs in complex tasks.
</details>
datasets - FinQA, ConvFinQA and TATQA, we implement the evaluation strategy discussed in program of thoughts prompting (Chen et al., 2023) on Github 3 with slight modifications. The LLM responses are varying in nature for questions with answers in thousands, millions, and percentage . Examples: for the gold answer 7 million, the gpt response may be 7 million or 7,000,000; for the gold answer 23%, the gpt response may be 23% or 0.23. The evaluation strategy is modified to handle such cases. We relax the evaluations for ZS-CoT (Kojima et al., 2022) and standard dual prompting because LLMs using these prompting techniques generate answers instead of programs. Since LLMs cannot perform precise mathematical calculations (especially with high-precision floats and large numbers),we provide a tolerance while comparing the gpt final answer with the gold answer. The evaluation is implemented using the Python function isclose with a relative tolerance (rel\_tol) of 0.001. The isclose functionality returns True while comparing the gpt final answer (ˆ a ) with the gold answer (˜ a ) if and only if the condition with the baselines ZS-STD prompt and ZS-CoT are summarized in Table 2. The ZS-FinPYT and ZSFinDSL methods significantly outperform the ZSSTD prompt for all datasets and across all LLMs. The ZS-FinPYT achieves 4.5% to 47% and the ZSFinDSL achieves 5.22% to 38.72% improvement in accuracy over ZS-STD. The increase in accuracy for text-davinci and gpt-3.5 are are much higher than that for gpt-4 as for gpt-4 the base model performs reasonably well. These results indicate that our prompts are able to induce the required reasoning and successfully output the required Python programs or domain specific languages.
<!-- formula-not-decoded -->
is satisfied.
Baselines: We consider two baselines for zeroshot prompting setting: ZS-STD prompt and ZSCoT prompt. These zero-shot prompting techniques are evaluated with all three Azure OpenAI models (text-davinci-003, gpt-3.5-turbo, gpt-4) on all three financial datasets (FinQA, ConvFinQA and TATQA).
## 4.2 Main Results
The evaluation results for the proposed prompting techniques ZS-FinPYT and ZS-FinDSL along
3 https://github.com/wenhuchen/Program-of-Thoughts
Both methods also made significant improvements over the ZS-CoT method for text-davinci003 and gpt-3.5-turbo, with the ZS-FinPYT achieving 3% to 33.22% and the ZS-FinDSL achieving 0% to 24.94% improvement over the ZS-CoT on different datasets. For gpt-4, our approach slightly outperforms the ZS-CoT for all datasets with improvements in the range of 1.5-3.5%. However, it is important to highlight that ZS-CoT lacks the ability to provide precise answers, and its accuracy is measured using a relaxed metric, while our method generates precise answers and an exact metric is used to measure accuracy.
In general, the ZS-FinPYT approach gave better results than ZS-FinDSL for the text-davinci-and gpt-3.5-turbo models for the different datasets. For gpt-4 both methods are comparable.
We also carried out an evaluation of OpenAI models using few-shot PoT prompting, as shown in Table 3. The comparisons indicate the excellent performance of our zero-shot method as we are within 10% of the few-shot and in many cases almost the same and for few cases even surpassing the few-shot performance.
| Dataset | Split | Example | Input | Output |
|-----------|---------|-----------|------------------------------------|---------------|
| FinQA | Test | 1147 | Table + Text + Question | Number+Binary |
| ConvFinQA | Dev | 421 | Table + Text + Multi-turn Question | Number+Binary |
| TATQA | Dev †| 718 | Table + Text + Question | Number+Binary |
Table 1: Financial question answering datasets for evaluation
| Models | FinQA | ConvFinQA | TATQA |
|------------------------------|---------|-------------|---------|
| ZS-STD (text-davinci-003) | 22.58 | 13.3 | 39.97 |
| ZS-CoT (text-davinci-003) | 41.15 | 27.08 | 68.94 |
| ZS-FinDSL (text-davinci-003) | 56.76 | 52.02 | 68.25 |
| ZS-FinPYT (text-davinci-003) | 66.6 | 60.3 | 78.4 |
| ZS-STD (gpt-3.5-turbo) | 32.26 | 47.74 | 49.03 |
| ZS-CoT (gpt-3.5-turbo) | 53.01 | 52.49 | 74.09 |
| ZS-FinDSL (gpt-3.5-turbo) | 61.12 | 60.81 | 77.86 |
| ZS-FinPYT (gpt-3.5-turbo) | 66.52 | 67.45 | 85 |
| ZS-STD (gpt-4) | 63.64 | 72.45 | 77.58 |
| ZS-CoT (gpt-4) | 74.19 | 75.3 | 90.11 |
| ZS-FinDSL (gpt-4) | 77.33 | 77.67 | 90.53 |
| ZS-FinPYT (gpt-4) | 77.51 | 76.95 | 93 |
Table 2: Comparison results of various models on different datasets.
## 4.3 Performance Analysis
We conduct a performance analysis on FinQA dataset for two models gpt-4, gpt-3.5-turbo, see Table 4 for details. The FinQA questions are divided into various categories to gain further insights.
Performance on text and table questions: The FinQA questions are divided into three sets depending on where the information required to answer the question is available: table only questions, textonly questions, text-table questions.
Performance regarding program steps: The FinQA questions are divided into three sets based on number of steps required to provide the answer: 1 step program, 2 step program and >2 step program.
Performance regarding question types: The FinQA questions are divided into numerical and boolean type questions.
The key findings are listed below:
The models achieve the highest accuracy on table-only questions. As tables are structured and the tables in this dataset are simple, it maybe easier for the LLMs to more accurately extract the values as compared to extracting from unstructured text.
Question with multi-hop reasoning are challenging. As would be expected both models find it easier to answer questions with one or two hop- reasoning as compared to questions needing more than two hop reasoning.
Numerical questions are more challenging as compared to boolean questions. In general, gpt-4 and gpt-3.5-turbo models excel in answering boolean questions over arithmetic questions. However, gpt-3.5-turbo's performance declines with ZSFinDSL prompt for boolean questions as compared to arithmetic questions. Examination of a few cases indicated that gpt-3.5-turbo has greater difficulty in writing DSL programs correctly for boolean questions.
## 4.4 Error Analysis
We sampled 50 test cases from FinQA dataset results of text-davinci-003 model and examined in detail the entire output of the system to get further insight into the obtained results. As expected, ZS-STD prompt results in brief answers with a sentence or value as the output without providing any details on the reasoning potentially contributing to its poor performance. On the other hand, LLM responses with ZS-CoT details out the reasoning behind the answers and shows significantly better performance than ZS-STD. However, arithmetic errors results into a substantial drop in performance for both ZS-STD prompt and ZS-CoT.
The ZS-FinPYT and ZS-FinDSL approaches
| Models | FinQA | ConvFinQA | TATQA |
|-----------------------------------|---------|-------------|---------|
| Few-shot PoT (text-davinci-003) ⋆ | 72.27 | 69.35 | 83.21 |
| ZS-FinPYT (text-davinci-003) | 66.6 | 60.3 | 78.4 |
| Few-shot PoT (gpt-3.5-turbo) ⋆ | 67.39 | 65.79 | 74.75 |
| ZS-FinPYT (gpt-3.5-turbo) | 66.52 | 67.45 | 85 |
| Few-shot PoT (gpt-4) ⋆ | 78.46 | 82.42 | 91.89 |
| ZS-FinPYT (gpt-4) | 77.51 | 76.95 | 93 |
- ⋆ Few-shot PoT uses 4-shots selected from the few-shots used in (Chen et al., 2023).
Table 3: Performance of ZS-FinPYT and few-shot PoT on different datasets.
| Methods | ZS-FinPYT | ZS-FinPYT | ZS-FinDSL | ZS-FinDSL |
|--------------------------------------|--------------------------------------|--------------------------------------|--------------------------------------|--------------------------------------|
| | gpt-4 | gpt-3.5-turbo | gpt-4 | gpt-3.5-turbo |
| overall accuracy | 77.51 | 66.52 | 77.33 | 61.12 |
| Performance on table and text | Performance on table and text | Performance on table and text | Performance on table and text | Performance on table and text |
| table-only questions | 80.91 | 71.36 | 81.36 | 63.94 |
| text-only questions | 74.45 | 58.39 | 73.36 | 60.22 |
| table-text questions | 67.44 | 55.81 | 68.22 | 48.84 |
| Performance regarding program steps | Performance regarding program steps | Performance regarding program steps | Performance regarding program steps | Performance regarding program steps |
| 1 step programs | 80.73 | 69.27 | 79.82 | 62.08 |
| 2 step programs | 77.02 | 64.79 | 77.26 | 63.08 |
| >2 step programs | 54.76 | 53.57 | 58.33 | 44.05 |
| Performance regarding question types | Performance regarding question types | Performance regarding question types | Performance regarding question types | Performance regarding question types |
| boolean questions | 90.00 | 95.00 | 85.00 | 45.00 |
| numerical questions | 77.28 | 66.02 | 77.20 | 61.40 |
Table 4: Performance breakdown of various models on FinQA dataset.
demonstrated detailed reasoning. In the case of ZS-FinPYT the task of writing a Python program triggers reasoning while in the case of ZS-FinDSL there are two prompts where the first prompt is a meta prompt that drives the reasoning similar to ZS-CoT. These techniques produce programs instead of answers for questions and therefore, mitigate arithmetic errors. Hence, these proposed techniques significantly outperforms ZS-CoT. The ZSFinDSL performance is lower than ZS-FinPYT because the program extraction step fails for some cases where the reasoning step is correct. One possible explanation could be that the GPT systems have likely been trained on huge amounts of Python programs and hence can generate Python program efficiently where as for ZS-FinDSL the instruction contains the information on how to write out the domain specific program. This may be driving the slightly higher error rate of the ZS-FinDSL. Some demonstrative examples supporting these observations may be found in Appendix B.
## 5 Conclusion
We proposed zero-shot prompting techniques to answer complex questions requiring multi-hop numerical reasoning over financial reports. The prompts guide the LLM to encode the required reasoning into a program that is executed by a program interpreter. The approach demonstrated excellent results on three financial datasets, achieving significant improvement over the respective baselines.We hope that our work will motivate a principled approach to prompt design with other LLMs.
## Limitations
In this paper, we only experiment with the GPTx series of LLMs. While this work shows the tremendous potential for zero-shot financial reasoning with LLMs, it is possible that better performance may be obtained with other LLMs. Moreover, the prompts we have proposed are designed to address specific problems observed with the three GPT models considered in this work. Other LLMs may behave differently and will likely need modification to the prompts to work effectively.
While we experiment and find zero-shot prompts that are effective for both ZS-FinPYT and ZSFinDSL, and the error analysis provided insights into failures, there are also unexplained failures in reasoning and more research is needed to understand the behavior of LLMs for certain cases. For ZS-FinDSL, we observed some patterns that result in failure of program extraction. However, it is unclear what drives these failures and we leave that for future work.
For cases where the reasoning was incorrect, the system may provide an explanation with a highlevel of confidence. Our prompt currently does not address or control for such behavior. In practice, this can pose challenges for practical use in real world systems.
## Disclaimer
The views reflected in this article are the views of the authors and do not necessarily reflect the views of the global EY organization or its member firms.
## References
- Md Shad Akhtar, Abhishek Kumar, Deepanway Ghosal, Asif Ekbal, and Pushpak Bhattacharyya. 2017. A Multilayer Perceptron based Ensemble Technique for Fine-grained Financial Sentiment Analysis. In Proceedings of the 2017 conference on empirical methods in natural language processing , pages 540546.
- Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. arXiv preprint arXiv:2302.04023 .
- Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 .
- Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, et al. 2021a. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374 .
- Wenhu Chen. 2022. Large Language Models are few (1)-shot Table Reasoners. arXiv preprint arXiv:2210.06710 .
- Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of Thoughts
Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Transactions on Machine Learning Research .
- Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting Hao Huang, Bryan Routledge, et al. 2021b. FINQA: A Dataset of Numerical Reasoning over Financial Data. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 , pages 3697-3711. Association for Computational Linguistics (ACL).
- Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. arXiv preprint arXiv:2210.03849 .
- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, et al. 2022. PaLM: Scaling Language Modeling with Pathways. arxiv:2204.02311 .
- Min-Yuh Day and Chia-Chou Lee. 2016. Deep learning for financial sentiment analysis on finance news providers. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) , pages 1127-1134. IEEE.
- Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, et al. 2023. Mathematical Capabilities of ChatGPT. arXiv preprint arXiv:2301.13867 .
- Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided Language Models. In Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Machine Learning Research , pages 10764-10799. PMLR.
- Amelia Glaese, Nat McAleese, Maja TrË› ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. 2022. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375 .
- Jingguang Han, Utsab Barman, Jer Hayes, Jinhua Du, Edward Burgin, and Dadong Wan. 2018. NextGen AML: Distributed Deep Learning based Language Technologies to Augment Anti Money Laundering Investigation. In Proceedings of ACL 2018, System Demonstrations , pages 37-42.
- Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361 .
- Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large
Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems , volume 35, pages 22199-22213. Curran Associates, Inc.
- Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, et al. 2023. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health , pages 1-12.
- Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, et al. 2022. Solving Quantitative Reasoning Problems with Language Models. In Advances in Neural Information Processing Systems , volume 35, pages 3843-3857. Curran Associates, Inc.
- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pretrain, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys , 55(9):1-35.
- Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. 2021. FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages 4513-4519.
- Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming fewshot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 8086-8098.
- Kyle Mahowald, Anna A Ivanova, Idan A Blank, Nancy Kanwisher, Joshua B Tenenbaum, and Evelina Fedorenko. 2023. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627 .
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 .
- Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. 2023. Formal Mathematics Statement Curriculum Learning. In The Eleventh International Conference on Learning Representations 2023 . OpenReview.net.
- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 784-789.
- Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages 1-7.
- Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. Unsupervised Commonsense Question Answering with Self-Talk. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 4615-4629.
- Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. LaMDA: Language Models for Dialog Applications. arXiv preprint arXiv:2201.08239 .
- Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change). In NeurIPS 2022 Foundation Models for Decision Making Workshop .
- Weikang Wang, Jiajun Zhang, Qian Li, Chengqing Zong, and Zhifei Li. 2019. Are you for real? detecting identity fraud via dialogue interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 1762-1771.
- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations 2023 . OpenReview.net.
- Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022a. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research .
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022b. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems , volume 35, pages 24824-24837. Curran Associates, Inc.
- Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 2369-2380.
- Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and TatSeng Chua. 2021. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 3277-3287.
## A Prompts for ConvFinQA
The ConvFinQA prompts are slightly modified to handle conversational questions.
ZS-FinPYT for ConvFinQA: For gpt-4, we use a single prompt where the last question in the series of questions is clearly marked and the system is instructed to answer the last questions as shown in Figure 5. For gpt-3.5-turbo and text-davinci003, we use dual prompt approach which consists of a reasoning extraction prompt and a program generation prompt, see Figure 6. The reasoning extraction prompt is there to generate answers with reasoning for all the questions in a conversation, and the program generation prompt generates a Python program answering the last question.
ZS-FinDSL for ConvFinQA: The ZS-FinDSL for ConvFinQA , see Figure 7, is a dual prompt which consists of a reasoning prompt and a program extraction prompt that are similar to the corresponding prompts for FinQA. The reasoning prompt instructs the LLM to generate answers with reasoning for all questions in a conversation. The program extraction prompt is instructing the LLM to generate program for performing calculations to answer the last question.
ZS-STD and ZS-CoT for ConvFinQA: The LLM answering prompt of ZS-STD, see Figure 8, and the reasoning extraction prompt of ZS-CoT, see Figure 9, are instructing the LLM to answer the questions of a conversation. Then the answer extraction prompt of both of these technique extract the final answer.
## B Error Analysis Examples
We show some examples from FinQA dataset with the corresponding responses from the text-davinci003 model under various prompts. These examples demonstrate successful attempts and failure cases under various prompts.
We begin with showing some examples where ZS-FinDSL (text-davinci-003) system generates correct reasoning and the corresponding program generation succeeded, see Figure 10 and Figure 11. Similarly, Figure 12 and Figure 13 show successful Python program generation by the system ZS-FinPYT (text-davinci-003).
In most of the cases, the LLM answering prompt of ZS-STD (text-davinci-003) generates only a value or a sentence, see Figure 14 and Figure 15 for details. In some cases, the answer extraction step fails as shown in Figure 16.
The LLM responses with ZS-CoT details out the reasoning behind the answers and shows significantly better performance than ZS-STD. However, arithmetic errors results into a substantial drop in performance for both ZS-STD prompt and ZSCoT. Examples demonstrating arithmetic errors are shown in Figure 17 and Figure 18.
The ZS-FinDSL performance is lower than ZSFinPYT because the program extraction step fails for some cases where the reasoning step is correct as shown in Figure 19 and Figure 20.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Diagram: LLM Python Code Generation
### Overview
The image illustrates a process where a user reads text and a table, answers a question by writing Python code, and provides instructions for the code. This input is then processed by an LLM (Large Language Model) to generate Python code. The diagram includes a legend explaining the color-coding of different elements.
### Components/Axes
* **Top Section:** A light blue rounded rectangle containing instructions and context for the user.
* Text: "Read the following text and table, and then answer the last question by writing a Python code:"
* "Passage: text + table"
* "Questions: ask a series of questions?"
* "Last Question: ask last question of the series?"
* "Answer the last question by following the below instructions."
* "Instructions:"
* "Define the Python variable which must begin with a character."
* "Assign values to variables required for the calculation."
* "Create Python variable "ans" and assign the final answer (bool/float) to the variable "ans"."
* "Don't include non-executable statements and include them as part of comments. #Comment: ..."
* "Python executable code is:"
* "#Python" (orange text)
* **Middle Section:**
* An arrow pointing downwards from the bottom of the top section to the top of the bottom section.
* An icon representing an LLM (Large Language Model) with interconnected nodes.
* Text: "LLM" below the icon.
* **Bottom Section:** A pink rounded rectangle representing the output of the LLM.
* Text: "Python code from the LLM."
* **Legend (Bottom):**
* Black square: "Signifier"
* Orange square: "Memetic proxy"
* Pink square: "Constraining behavior"
* Blue square: "Input"
### Detailed Analysis or ### Content Details
* **User Input:** The user is instructed to read text and a table, answer a question using Python code, and follow specific instructions for writing the code.
* **LLM Processing:** The LLM takes the user's input and generates Python code.
* **Output:** The output is Python code generated by the LLM.
* **Color Coding:** The legend indicates that different elements in the process are color-coded to represent their function:
* Signifier (Black): Not explicitly shown in the diagram, but likely refers to elements that signify meaning or intent.
* Memetic Proxy (Orange): The "#Python" tag is colored orange, indicating it acts as a memetic proxy.
* Constraining Behavior (Pink): The bottom section "Python code from the LLM" is pink, indicating it represents a constraining behavior.
* Input (Blue): The top section containing the instructions is light blue, indicating it represents the input.
### Key Observations
* The diagram illustrates a workflow where user input is processed by an LLM to generate Python code.
* The color-coding provides additional information about the function of different elements in the process.
* The instructions emphasize the importance of defining variables correctly and including comments in the code.
### Interpretation
The diagram demonstrates the use of an LLM to generate Python code based on user instructions and input. The color-coding helps to clarify the roles of different elements in the process, such as the input, the LLM's processing, and the resulting code. The instructions provided to the user highlight the importance of clear and well-documented code, which is essential for effective collaboration and maintainability. The diagram suggests that the LLM is intended to assist users in writing Python code by automating the generation process based on specific instructions and input data.
</details>
Figure 5: ZS-FinPYT (gpt-4) prompt for ConvFinQA
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Diagram: Reasoning Extraction and Program Generation
### Overview
The image presents a diagram illustrating the process of reasoning extraction and program generation using a Large Language Model (LLM). It outlines the steps involved in taking a passage and questions, extracting relevant information, and generating a Python program to answer the questions.
### Components/Axes
The diagram is divided into two main sections: "Reasoning extraction" on the left and "Program generation" on the right.
**Reasoning Extraction (Left Side):**
* **Header:** "Reasoning extraction" (orange background)
* **Input Box (Top):** Light blue box containing the following text:
* "Read the following passage and then answer the questions:"
* "Passage: text + table"
* "Questions: ask question?"
* "Answer the questions by finding the relevant values and performing step by step calculations."
* "Answer:"
* **LLM Icon:** A multi-colored node diagram labeled "LLM"
* **Output Box (Bottom):** Pink box containing the text: "Answer with reasoning from LLM."
* **Flow:** An arrow flows from the Input Box to the LLM icon, and another arrow flows from the LLM icon to the Output Box.
**Program Generation (Right Side):**
* **Header:** "Program generation" (orange background)
* **Input Box (Top):** Light blue box containing the following text:
* "Questions: ask question?"
* "Answer: Answer with reasoning from LLM."
* "Task: Write a Python code to answer the last question by following the below instructions."
* "Instructions:"
* "Define the Python variable which must begin with a character."
* "Assign values to variables required for the calculation."
* "Create Python variable "ans" and assign the final answer (bool/float) to the variable "ans"."
* "Don't include non-executable statements and include them as part of comments. #Comment: ..."
* "Python executable code is:"
* "#Python"
* **LLM Icon:** A multi-colored node diagram labeled "LLM"
* **Output Box (Bottom):** Pink box containing the text: "Python program generated by the LLM."
* **Flow:** An arrow flows from the Input Box to the LLM icon, and another arrow flows from the LLM icon to the Output Box.
**Legend (Bottom):**
* Black square: "Signifier"
* Orange square: "Memetic proxy"
* Pink square: "Constraining behavior"
* Green square: "Meta prompt"
* Blue square: "Input"
### Detailed Analysis or ### Content Details
**Reasoning Extraction:**
1. The process begins with an input passage containing text and a table, along with a question.
2. The LLM processes this information to find relevant values and perform calculations.
3. The LLM outputs an answer with reasoning.
**Program Generation:**
1. The process starts with a question and the answer with reasoning from the LLM.
2. The task is to write a Python code to answer the last question, following specific instructions.
3. The instructions include defining Python variables, assigning values, creating an "ans" variable, and including comments.
4. The LLM generates a Python program.
### Key Observations
* The diagram highlights the use of an LLM in both reasoning extraction and program generation.
* The input and output boxes provide context for the LLM's role in each process.
* The instructions for program generation emphasize the importance of defining variables, assigning values, and including comments in the Python code.
### Interpretation
The diagram illustrates a system where an LLM is used to first extract reasoning from a given passage and questions, and then to generate a Python program based on that reasoning. This suggests a process of automated problem-solving where the LLM not only provides an answer but also generates the code necessary to arrive at that answer. The separation into "Reasoning extraction" and "Program generation" suggests a modular approach, where the LLM's reasoning capabilities are leveraged to create executable code. The instructions provided to the LLM for program generation are crucial for ensuring the code is well-structured and includes necessary elements like variable definitions and comments.
</details>
Figure 6: ZS-FinPYT (gpt-3.5-turbo, text-davinci-003) prompt for ConvFinQA
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Diagram: Prompt Engineering Strategies for ConvFinQA
### Overview
The image presents three diagrams illustrating different prompt engineering strategies (ZS-FinDSL, ZS-STD, and a third unnamed strategy) used with a Large Language Model (LLM) for the ConvFinQA task. Each diagram outlines the flow of information and processing steps involved in answering questions based on a given passage and table.
### Components/Axes
**General Components (Present in all diagrams):**
* **Input Text Box (Light Blue):** Contains the initial prompt instructions and data.
* **LLM Icon:** Represents the Large Language Model processing the input.
* **Output Text Box (Light Pink):** Displays the LLM's response or generated output.
* **Arrows:** Indicate the flow of information between components.
* **Legend (Bottom):** Defines the color-coded elements:
* Black: Signifier
* Orange: Memetic proxy
* Green: Constraining behavior (Only in Figure 7)
* Pink: Meta prompt
* Blue: Input
**Figure 7: ZS-FinDSL prompt for ConvFinQA**
* **Top-Left Box (Reasoning extraction):**
* Title: Reasoning extraction
* Text:
* Read the following passage and then answer the questions:
* Passage: text + table
* Questions: ask question?
* Answer the questions by finding the relevant values and performing step by step calculations.
* Answer:
* **Top-Right Box (Program extraction):**
* Title: Program extraction
* Text:
* Questions: ask question?
* Answer: Answer with reasoning from LLM.
* Task: From the above question-answer, extract the calculations that were performed to arrive at the answer to the last question. The calculations should be provided in the following format:
* {"PROGRAM": {"#0": {OPERATION:"[arithmetic/logic]", ARG1:"[float/int]", ARG2:"[float/int]"}, "#1":{OPERATION: [arithmetic/logic], ARG1:"#0", ARG2:"[float/int/#int]"}, ...}, "ANSWER": "[numerical/boolean]"}
* Operation should strictly be restricted to {add, subtract, multiply, divide, exponent, greater-than, max, min} only.
* When evaluated the program should only generate numerical or boolean values.
* Solution:
* **Bottom-Left Box:** Answer with reasoning from LLM.
* **Bottom-Right Box:** Program generated by the LLM.
**Figure 8: ZS-STD prompt for ConvFinQA**
* **Top-Left Box (LLM answering):**
* Title: LLM answering
* Text:
* Read the following passage and then answer the questions:
* Passage: text + table
* Questions: ask question?
* Answer:
* **Top-Right Box (Answer extraction):**
* Title: Answer extraction
* Text:
* Questions: ask question?
* Answer: Answer from LLM.
* The final answer (float/int/boolean) is:
* **Bottom-Left Box:** Answer from LLM.
* **Bottom-Right Box:** Final answer generated by the LLM.
**Third Diagram (Unnamed)**
* **Top-Left Box (Reasoning extraction):**
* Title: Reasoning extraction
* Text:
* Read the following passage and then answer the questions:
* Passage: text + table
* Questions: ask question?
* Answer: Let us think step by step.
* **Top-Right Box (Answer extraction):**
* Title: Answer extraction
* Text:
* Questions: ask question?
* Answer: Answer with reasoning from LLM.
* The final answer (float/int/boolean) is:
* **Bottom-Left Box:** Answer with reasoning from LLM.
* **Bottom-Right Box:** Final answer generated by the LLM.
### Detailed Analysis or ### Content Details
**Figure 7: ZS-FinDSL prompt for ConvFinQA**
1. **Reasoning extraction:** The LLM receives a passage and table, and is prompted to answer questions by finding relevant values and performing step-by-step calculations.
2. **Program extraction:** The LLM extracts the calculations performed to arrive at the answer. The calculations are formatted as a program with arithmetic/logic operations on float/int values. The output is a numerical or boolean answer.
3. The "Reasoning extraction" box connects to the "Program extraction" box with a gray arrow.
4. The LLM icon below the "Reasoning extraction" box connects to the "Answer with reasoning from LLM" box with a downward arrow.
5. The LLM icon below the "Program extraction" box connects to the "Program generated by the LLM" box with a downward arrow.
**Figure 8: ZS-STD prompt for ConvFinQA**
1. **LLM answering:** The LLM receives a passage and table, and is prompted to answer questions.
2. **Answer extraction:** The LLM extracts the final answer, which is a float, integer, or boolean value.
3. The "LLM answering" box connects to the "Answer extraction" box with a gray arrow.
4. The LLM icon below the "LLM answering" box connects to the "Answer from LLM" box with a downward arrow.
5. The LLM icon below the "Answer extraction" box connects to the "Final answer generated by the LLM" box with a downward arrow.
**Third Diagram (Unnamed)**
1. **Reasoning extraction:** The LLM receives a passage and table, and is prompted to answer questions by thinking step by step.
2. **Answer extraction:** The LLM extracts the final answer, which is a float, integer, or boolean value.
3. The "Reasoning extraction" box connects to the "Answer extraction" box with a gray arrow.
4. The LLM icon below the "Reasoning extraction" box connects to the "Answer with reasoning from LLM" box with a downward arrow.
5. The LLM icon below the "Answer extraction" box connects to the "Final answer generated by the LLM" box with a downward arrow.
### Key Observations
* Figure 7 (ZS-FinDSL) involves a two-step process: reasoning extraction followed by program extraction.
* Figure 8 (ZS-STD) is a simpler, direct answer extraction process.
* The third diagram is similar to ZS-STD but includes the prompt "Let us think step by step" in the reasoning extraction phase.
* The legend provides a color-coding scheme for different elements in the diagrams.
### Interpretation
The diagrams illustrate different approaches to prompt engineering for question answering tasks using LLMs. ZS-FinDSL aims to extract the reasoning process as a program, while ZS-STD directly extracts the answer. The third diagram combines elements of both, prompting the LLM to think step by step before extracting the answer. The choice of prompt engineering strategy can significantly impact the LLM's performance and the interpretability of its reasoning process. The diagrams highlight the importance of carefully designing prompts to elicit the desired behavior from the LLM.
</details>
Figure 9: ZS-CoT prompt for ConvFinQA
Figure 10: An example where ZS-FinDSL extracted correct reasoning and successfully generated the DSL program.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Table and Text Analysis: Contractual Obligations and Cash Outflow
### Overview
The image presents a table outlining contractual commitments over several years, followed by a question regarding the percentage of total expected cash outflow due in 2012. It also includes "Gold" and "ZS-FinDSL" program outputs related to answering the question.
### Components/Axes
* **Title:** UPS/2010/page\_52.pdf-1
* **Table Headers:**
* Commitment Type
* 2011
* 2012
* 2013
* 2014
* 2015
* After 2016
* Total
* **Commitment Types:**
* Capital Leases
* Other Liabilities
* Total
* **Question:** What percentage of total expected cash outflow to satisfy contractual obligations and commitments as of December 31, 2010 are due in 2012?
* **Gold Program:** divide(1334, 23556)
* **Gold Answer:** 0.05663
* **ZS-FinDSL Reasoning Extraction Prompt Response:** Text explaining the calculation.
* **ZS-FinDSL Program Extraction Prompt Response:** JSON format of the program.
* **ZS-FinDSL Program:** divide(1334, 23556)
* **ZS-FinDSL Executed Answer:** 0.05663
### Detailed Analysis or ### Content Details
**Table Data:**
| Commitment Type | 2011 | 2012 | 2013 | 2014 | 2015 | After 2016 | Total |
| :---------------- | :------ | :------ | :------ | :------ | :------ | :--------- | :-------- |
| Capital Leases | \$18 | \$19 | \$19 | \$20 | \$21 | \$112 | \$209 |
| Other Liabilities | 69 | 67 | 64 | 58 | 43 | 38 | 339 |
| Total | \$2,944 | \$1,334 | \$3,515 | \$2,059 | \$820 | \$12,884 | \$23,556 |
**Textual Information:**
* **Passage:** contractual commitments we have contractual obligations and commitments in the form of capital leases, operating leases ...
* **Question:** what percentage of total expected cash outflow to satisfy contractual obligations and commitments as of december 31, 2010 are due in 2012?
* **Gold Program:** divide(1334, 23556)
* **Gold Answer:** 0.05663
* **ZS-FinDSL Reasoning Extraction Prompt Response:** Total expected cash outflow to satisfy contractual obligations and commitments as of December 31, 2010 = $23,556 (million), Cash outflow due in 2012 = $1,334 (million), Percentage of total expected cash outflow due in 2012 = (1,334/23,556) x 100 = 5.65%
* **ZS-FinDSL Program Extraction Prompt Response:** {"Program": {"#0": {operation:"divide", arg1:"1,334", arg2:"23,556"}, "Answer": "282"}
* **ZS-FinDSL Program:** divide(1334, 23556)
* **ZS-FinDSL Executed Answer:** 0.05663
### Key Observations
* The table provides a breakdown of financial commitments over several years, categorized by "Capital Leases" and "Other Liabilities."
* The "Total" row represents the sum of all commitments for each year.
* The question asks for the percentage of total expected cash outflow due in 2012 relative to the total expected cash outflow as of December 31, 2010.
* Both the "Gold Program" and "ZS-FinDSL Program" calculate the answer using the same division: 1334 / 23556.
* The ZS-FinDSL Reasoning Extraction Prompt Response provides a more detailed explanation and arrives at a percentage of 5.65%.
### Interpretation
The image presents a financial analysis scenario where the goal is to determine the proportion of cash outflow due in 2012 compared to the total expected cash outflow as of the end of 2010. The table provides the necessary data, and the subsequent calculations (both "Gold" and "ZS-FinDSL") aim to answer the posed question. The ZS-FinDSL response provides a more complete answer by converting the result of the division into a percentage. The "Program Extraction Prompt Response" shows the structured data format used by the ZS-FinDSL system. The difference between the "Gold Answer" (0.05663) and the ZS-FinDSL percentage (5.65%) is simply a matter of multiplying by 100 to express the result as a percentage.
</details>
Figure 11: An example where ZS-FinDSL extracted correct reasoning and successfully generated the DSL program.
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Document Analysis: Financial Statement Data and Reasoning
### Overview
The image presents a snippet from a technical document, specifically a page (page_105.pdf-2) related to Humana Inc.'s consolidated financial statements for 2014. It includes financial data for the years 2009, 2008, and 2007, focusing on weighted-average fair value, expected option life, expected volatility, and risk-free interest rate. The document also features a question about the percentage change in expected volatility from 2008 to 2009, along with a "Gold Program" and a "ZS-FinDSL Reasoning Extraction Prompt Response" that provide solutions and reasoning.
### Components/Axes
* **Header:** "HUM/2009/page\_105.pdf-2"
* **Passage:** Contains introductory text and financial data.
* Years: 2009, 2008, 2007
* Weighted-average fair value at grant date: $14.24 (2009), $17.95 (2008), $21.07 (2007)
* Expected option life (years): 4.6 (2009), 5.1 (2008), 4.8 (2007)
* Expected volatility: 39.2% (2009), 28.2% (2008), 28.9% (2007)
* Risk-free interest rate at grant date: 1.9% (2009), 2.9% (2008), 4.5% (2007)
* **Question:** "what was the percent of the change of the expected volatility from 2008 to 2009?"
* **Gold Program:** Presents a calculation to answer the question.
* Program: subtract(39.2, 28.2), divide(#0, 28.2)
* Answer: 0.39007
* **ZS-FinDSL Reasoning Extraction Prompt Response:** Provides a step-by-step explanation of the calculation.
* Step 1: 39.2% - 28.2% = 11%
* Step 2: 11% / 28.2% = 0.39 or 39%
* **ZS-FinDSL Program Extraction Prompt Response:** Presents the calculation in a JSON-like format.
* Program: subtract(39.2, 28.2), divide(#0, 28.2)
* Answer: 39%
* **ZS-FinDSL Executed Answer:** 0.39007
### Detailed Analysis or Content Details
The document provides financial data for Humana Inc. over three years (2009, 2008, 2007). The weighted-average fair value at grant date decreased from $21.07 in 2007 to $14.24 in 2009. The expected option life fluctuated, reaching 5.1 years in 2008. The expected volatility was highest in 2009 at 39.2%. The risk-free interest rate decreased from 4.5% in 2007 to 1.9% in 2009.
The question focuses on calculating the percentage change in expected volatility from 2008 to 2009. The "Gold Program" and "ZS-FinDSL Reasoning Extraction Prompt Response" both provide methods to calculate this change. The "Gold Program" uses the formula: subtract(39.2, 28.2), divide(#0, 28.2), which results in an answer of 0.39007. The "ZS-FinDSL Reasoning Extraction Prompt Response" breaks down the calculation into two steps: first, subtracting the 2008 volatility from the 2009 volatility (39.2% - 28.2% = 11%), and then dividing the result by the 2008 volatility (11% / 28.2% = 0.39 or 39%).
The "ZS-FinDSL Program Extraction Prompt Response" presents the same calculation in a JSON-like format, which is likely used for automated processing. The "ZS-FinDSL Executed Answer" confirms the result of the calculation as 0.39007.
### Key Observations
* The weighted-average fair value at grant date decreased over the three years.
* The expected volatility increased significantly from 2008 to 2009.
* The risk-free interest rate decreased over the three years.
* Both the "Gold Program" and "ZS-FinDSL" methods arrive at the same answer for the percentage change in expected volatility.
### Interpretation
The document snippet appears to be part of a financial analysis or report. The data suggests a changing financial landscape for Humana Inc. between 2007 and 2009. The decrease in weighted-average fair value and risk-free interest rate, coupled with the increase in expected volatility, could indicate increased market uncertainty or specific challenges faced by the company during that period.
The inclusion of the question and the detailed calculation methods ("Gold Program" and "ZS-FinDSL") suggests an effort to explain or verify the percentage change in expected volatility. The "ZS-FinDSL" responses, particularly the JSON-like format, indicate the use of automated tools for financial analysis and reasoning. The consistency between the different calculation methods reinforces the accuracy of the result.
</details>
<details>
<summary>Image 10 Details</summary>

### Visual Description
## Text Extraction and Analysis of Financial Data
### Overview
The image presents financial data related to non-vested shares of a company, likely Hologic, Inc., along with a question about the percentage increase in the average price of WTI crude oil between 2011 and 2013. It also includes code snippets and their corresponding answers from two different programs ("Gold Program" and "ZS-FinPYT Program") that calculate the value of non-vested shares.
### Components/Axes
The image is divided into several sections:
1. **Header:** "HOLX/2009/page\_151.pdf-1"
2. **Passage:** A description of the data's context, indicating it's from the table of contents of Hologic, Inc.'s consolidated financial statements, specifically related to restricted stock units activity.
3. **Data Table:** Contains information about non-vested shares, including the date, number of shares, and weighted-average grant-date fair value.
4. **Question:** Poses a question about the percentage increase in the average price of WTI crude oil from 2011 to 2013.
5. **Gold Program:** Shows a simple multiplication calculation and its answer.
6. **ZS-FinPYT Program:** Presents a Python-like code snippet and its executed answer.
### Detailed Analysis or Content Details
**1. Header:**
* Text: HOLX/2009/page\_151.pdf-1
* Position: Top center
**2. Passage:**
* Text: "Passage: table of contents hologic, inc . notes to consolidated financial statements ( continued ) ( in thousands , except per share data ) a summary of the company 2019s restricted stock units activity during ..."
* Position: Below the header
**3. Data Table:**
* Columns: Non-vested Shares, Number of Shares, Weighted-Average Grant-Date Fair Value
* Rows:
* Non-vested at September 27, 2008: 1,461 shares, $31.23 fair value
* Non-vested at September 26, 2009: 2,770 shares, $21.96 fair value
**4. Question:**
* Text: "Question: by what percentage did the average price of wti crude oil increase from 2011 to 2013?"
* Position: Below the data table
**5. Gold Program:**
* Program: multiply(2770, 21.96)
* Answer: 60829.2
**6. ZS-FinPYT Program:**
* Code:
* non\_vested\_shares = 2770
* weighted\_average\_grant\_date\_fair\_value = 21.96
* ans = non\_vested\_shares \* weighted\_average\_grant\_date\_fair\_value
* print(ans) # prints 60,532.2
* Executed Answer: 60829.2
### Key Observations
* The "Gold Program" and "ZS-FinPYT Program" both calculate the same value (60829.2), which is the product of 2770 and 21.96.
* The ZS-FinPYT program's `print(ans)` statement outputs 60,532.2, which is different from the final executed answer of 60829.2. This suggests a potential discrepancy or rounding issue in the intermediate calculation.
* The question about WTI crude oil price increase is unrelated to the non-vested shares data.
### Interpretation
The image presents a mix of financial data and a seemingly unrelated question. The primary focus is on the calculation of the value of non-vested shares, which is performed using two different methods. The "Gold Program" provides a direct calculation, while the "ZS-FinPYT Program" demonstrates a more programmatic approach.
The discrepancy in the ZS-FinPYT program's output (60,532.2 vs. 60829.2) could be due to rounding errors or a slight variation in the calculation within the program.
The question about WTI crude oil price increase appears to be a separate problem, possibly included as a test or exercise within the same document. It is not directly related to the non-vested shares data presented.
</details>
Figure 12: An example of successful Python program generation by ZS-FinPYT.
Figure 13: An example of successful Python program generation by ZS-FinPYT.
<details>
<summary>Image 11 Details</summary>

### Visual Description
## Text Extraction and Analysis
### Overview
The image presents a text-based document, likely a page from a financial report or analysis. It includes a passage of text, a question, and two different program outputs (Gold Program and ZS-FinPYT Program) that attempt to answer the question. The document appears to be related to Goldman Sachs' financial statements.
### Components/Axes
* **Header:** "GS/2015/page\_188.pdf-4" (located at the top-center)
* **Passage:** A block of text containing information about Goldman Sachs' financial statements, specifically mentioning amounts of $2.86 billion and $2.87 billion as of December 2015 and December 2014, respectively. It also mentions "2021 - thereafter | 1,160" and "Total | $2,575".
* **Question:** "in billions, what was the total for 2015 and 2014 relating to commitments to invest in funds managed by the firm?"
* **Gold Program:**
* Program: "add(2.86, 2.87)"
* Answer: "5.73"
* **ZS-FinPYT Program:**
* Program:
* "total\_2015 = 2.86 #in billions"
* "total\_2014 = 2.87 #in billions"
* "ans = total\_2015 + total\_2014 #in billions"
* Executed Answer: "5.73"
### Detailed Analysis or Content Details
* **Passage Details:**
* Mentions Goldman Sachs Group, Inc. and subsidiaries.
* Refers to consolidated financial statements and commercial lending.
* States $2.86 billion as of December 2015.
* States $2.87 billion as of December 2014.
* Mentions "2021 - thereafter | 1,160" and "Total | $2,575".
* **Question Details:**
* Asks for the total commitments to invest in funds managed by the firm for 2015 and 2014, in billions.
* **Gold Program Details:**
* The program uses a simple addition function "add(2.86, 2.87)".
* The calculated answer is 5.73.
* **ZS-FinPYT Program Details:**
* The program assigns 2.86 to "total\_2015" and 2.87 to "total\_2014".
* It then adds these two variables to calculate "ans".
* The executed answer is 5.73.
### Key Observations
* Both the Gold Program and the ZS-FinPYT Program arrive at the same answer: 5.73.
* The values used in both programs (2.86 and 2.87) are directly extracted from the "Passage" section, corresponding to the amounts for December 2015 and December 2014, respectively.
* The question asks for the total of these two amounts.
### Interpretation
The image presents a problem-solving scenario where the task is to calculate the total commitments to invest in funds managed by Goldman Sachs for 2015 and 2014. The passage provides the necessary data, and two different programs (Gold Program and ZS-FinPYT Program) independently calculate the correct answer, which is 5.73 billion dollars. The ZS-FinPYT program provides a more detailed breakdown of the calculation steps, while the Gold Program uses a more concise function. The consistency in the answers from both programs reinforces the accuracy of the calculation.
</details>
Figure 14: An example where ZS-STD's LLM answering prompt generates one value response.
<details>
<summary>Image 12 Details</summary>

### Visual Description
## Text Analysis: Financial Data and Question Answering
### Overview
The image presents financial data related to crude oil and natural gas prices for the years 2011, 2012, and 2013. It also includes a question about the percentage increase in the average price of WTI crude oil from 2011 to 2013, along with a "Gold Program" solution and the response from a "ZS-STD LLM."
### Components/Axes
* **Title:** MRO/2013/page\_39.pdf-3
* **Passage:** A snippet from a management discussion and analysis of financial condition and results of operations.
* **Benchmark Table:**
* Columns: 2013, 2012, 2011
* Rows:
* WTI crude oil (Dollars per bbl)
* Brent (Europe) crude oil (Dollars per bbl)
* Henry Hub natural gas (Dollars per mmbtu)
* **Question:** "by what percentage did the average price of wti crude oil increase from 2011 to 2013?"
* **Gold Program:** The calculation used to determine the correct answer.
* **Gold Answer:** The correct answer to the question.
* **ZS-STD LLM Answering Prompt Response:** The answer provided by the LLM.
* **ZS-STD Extracted Answer:** The answer extracted by the LLM.
### Detailed Analysis or ### Content Details
**Benchmark Table Data:**
| Benchmark | 2013 | 2012 | 2011 |
| ------------------------------ | ------- | ------- | ------- |
| WTI crude oil (Dollars per bbl) | $98.05 | $94.15 | $95.11 |
| Brent (Europe) crude oil (Dollars per bbl) | $108.64 | $111.65 | $111.26 |
| Henry Hub natural gas (Dollars per mmbtu) | $3.65 | $2.79 | $4.04 |
**Question and Answers:**
* **Question:** by what percentage did the average price of wti crude oil increase from 2011 to 2013?
* **Gold Program:** subtract(98.05, 95.11), divide(#0, 95.11)
* **Gold Answer:** 0.03091 (or 3.091%)
* **ZS-STD LLM Answering Prompt Response:** 3.9%
* **ZS-STD Extracted Answer:** 3.9
### Key Observations
* The price of WTI crude oil increased from 2012 to 2013, but decreased from 2011 to 2012.
* The price of Brent crude oil decreased from 2011 to 2013.
* The price of Henry Hub natural gas decreased from 2011 to 2012, and increased from 2012 to 2013.
* The LLM's answer (3.9%) is close to, but not exactly the same as, the "Gold Answer" (3.091%).
### Interpretation
The data provides a snapshot of crude oil and natural gas prices over three years. The question focuses on the percentage increase of WTI crude oil between 2011 and 2013. The "Gold Program" and "Gold Answer" represent the correct calculation and result, while the "ZS-STD LLM" responses indicate the performance of a language model in answering the question. The discrepancy between the "Gold Answer" and the LLM's answer suggests a potential area for improvement in the LLM's accuracy. The passage context suggests this data is part of a larger financial analysis.
</details>
<details>
<summary>Image 13 Details</summary>

### Visual Description
## Text Extraction: Financial Data and Question Answering
### Overview
The image presents financial data for the years 2007, 2006, and 2005, including dividends paid and free cash flow in millions of dollars. It also includes a question about the change in free cash flow from 2005 to 2006, along with a "Gold Program" calculation and answer, and a "ZS-STD LLM Answering Prompt Response" and extracted answer.
### Components/Axes
* **Title:** UNP/2007/page\_25.pdf-4
* **Passage:** 2022 fuel prices 2013 crude oil prices increased at a steady rate in 2007 ...
* **Data Table:**
* Row 1: Millions of Dollars | 2007 | 2006 | 2005
* Row 2: Dividends paid | (364) | (322) | (314)
* Row 3: Free cash flow | $487 | $516 | $234
* **Question:** what was change in millions of free cash flow from 2005 to 2006?
* **Gold Program:** subtract(516, 234)
* **Gold Answer:** 282.0
* **ZS-STD LLM Answering Prompt Response:** The change in millions of free cash flow from 2005 to 2006 was an increase of $282 million.
* **ZS-STD Extracted Answer:** 282.0
### Detailed Analysis or ### Content Details
* **Dividends Paid (Millions of Dollars):**
* 2007: (364)
* 2006: (322)
* 2005: (314)
* **Free Cash Flow (Millions of Dollars):**
* 2007: $487
* 2006: $516
* 2005: $234
* **Question and Answer:** The question asks for the change in free cash flow from 2005 to 2006. The "Gold Program" calculates this as 516 - 234 = 282.0. The LLM response confirms this, stating an increase of $282 million.
### Key Observations
* Dividends paid are negative values, likely representing cash outflows.
* Free cash flow increased from 2005 to 2006 and decreased from 2006 to 2007.
* The LLM's answer matches the "Gold Answer," indicating a correct response.
### Interpretation
The data provides a snapshot of the company's financial performance over three years. The question focuses on the change in free cash flow, a key indicator of financial health. The consistency between the "Gold Program" calculation and the LLM's response suggests that the LLM correctly interpreted and answered the question based on the provided data. The passage at the beginning seems to be unrelated to the rest of the content.
</details>
Figure 15: An example where ZS-STD's LLM answering prompt generates one sentence response.
Figure 16: An example where answer extraction failed for ZS-STD.
<details>
<summary>Image 14 Details</summary>

### Visual Description
## Table: Holologic Stock Units Activity
### Overview
The image presents a table summarizing the activity of Holologic's restricted stock units. It includes data on non-vested shares, number of shares, and weighted-average grant-date fair value for different periods. Additionally, it contains a question related to the percentage increase in the average price of WTI crude oil from 2011 to 2013, along with a "Gold Program" calculation and a "ZS-STD LLM Answering Prompt Response".
### Components/Axes
The table has the following columns:
* **Non-vested Shares**: Describes the status of the shares (e.g., Non-vested, Granted, Vested, Forfeited).
* **Number of Shares**: The number of shares associated with each status.
* **Weighted-Average Grant-Date Fair Value**: The fair value of the shares at the grant date.
### Detailed Analysis or ### Content Details
Here's a breakdown of the table data:
* **Non-vested at September 27, 2008**: 1,461 shares, \$31.23 fair value.
* **Granted**: 1,669 shares, \$14.46 fair value.
* **Vested**: (210) shares, \$23.87 fair value.
* **Forfeited**: (150) shares, \$23.44 fair value.
* **Non-vested at September 26, 2009**: 2,770 shares, \$21.96 fair value.
Additional information:
* **Question**: "by what percentage did the average price of wti crude oil increase from 2011 to 2013?"
* **Gold Program**: `multiply(2770, 21.96)`
* **Gold Answer**: 60829.2
* **ZS-STD LLM Answering Prompt Response**: "The total fair value of non-vested shares as of September 26, 2009 is \$59,812."
* **ZS-STD Extracted Answer**: float
### Key Observations
* The table tracks changes in non-vested shares over time, including grants, vesting, and forfeitures.
* The "Gold Program" calculation seems to be related to the number of non-vested shares and their fair value as of September 26, 2009.
* The "ZS-STD LLM Answering Prompt Response" provides a slightly different value for the total fair value of non-vested shares as of September 26, 2009, compared to the "Gold Answer".
### Interpretation
The table provides a snapshot of the company's stock-based compensation program. It shows how the number of non-vested shares changes due to grants, vesting, and forfeitures. The question about WTI crude oil price increase seems unrelated to the stock unit data and may be part of a broader assessment. The "Gold Program" and "ZS-STD LLM Answering Prompt Response" suggest an attempt to calculate or verify the total value of non-vested shares, possibly for financial reporting or analysis purposes. The slight discrepancy between the "Gold Answer" and the "ZS-STD LLM Answering Prompt Response" could be due to rounding errors or different calculation methods.
</details>
Figure 17: An example of arithmetic errors made by ZS-STD and ZS-CoT prompts.
<details>
<summary>Image 15 Details</summary>

### Visual Description
## Document Analysis: Financial Statement and Question Answering
### Overview
The image presents a financial statement excerpt, a question related to the percentage change in cash flows, and several attempts to answer the question using different methods (Gold Program, ZS-STD LLM, and ZS-CoT Reasoning). It compares the answers derived from these methods.
### Components/Axes
* **Header:** "FIS/2016/page\_45.pdf-3"
* **Passage:** A snippet from financial statements as of December 31, 2016, mentioning cash flows from operations for 2016, 2015, and 2014.
* **Table:**
* Columns: Type of Obligations, Total, Less than 1 Year, 1-3 Years, 3-5 Years, More than 5 Years
* Row: Total, with corresponding dollar values for each time period.
* **Question:** "What was the percentage change in cash flows from operations from 2015 to 2016?"
* **Gold Program:** Presents the formula used to calculate the answer.
* **Gold Answer:** The numerical answer derived from the Gold Program.
* **ZS-STD LLM Answering Prompt Response:** The response generated by the ZS-STD LLM, including a textual explanation and the extracted answer.
* **ZS-STD Extracted Answer:** The numerical answer extracted by the ZS-STD LLM.
* **ZS-CoT Reasoning Prompt Response:** The response generated by the ZS-CoT reasoning, including the reasoning steps and the final answer.
* **ZS-CoT Extracted Answer:** The numerical answer extracted by the ZS-CoT reasoning.
### Detailed Analysis or Content Details
* **Passage:**
* Cash flows from operations: $1925 million (2016), $1131 million (2015), $1165 million (2014).
* **Table:**
* Total Obligations:
* Total: $14,429
* Less than 1 Year: $1,068
* 1-3 Years: $2,712
* 3-5 Years: $3,264
* More than 5 Years: $7,385
* **Question:** As stated above.
* **Gold Program:**
* Formula: subtract(1925, 1131), divide(#0, 1131)
* **Gold Answer:** 0.70203
* **ZS-STD LLM Answering Prompt Response:**
* Response: "The cash flows from operations increased by 794 million from 2015 to 2016, which is a 69.7% increase."
* Extracted Answer: 69.7
* **ZS-CoT Reasoning Prompt Response:**
* Response: "Cash flows from operations in 2015 = $1131 million, Cash flows from operations in 2016 = $1925 million. Therefore, the percentage change in cash flows from operations from 2015 to 2016 is (1925-1131)/1131 = 70.1%."
* Extracted Answer: 70.1
### Key Observations
* The Gold Program answer (0.70203) is a decimal, while the other answers are percentages. To compare, the Gold Program answer should be multiplied by 100, resulting in 70.203%.
* The ZS-STD LLM provides an increase of 69.7%, while the ZS-CoT reasoning provides an increase of 70.1%.
* The ZS-CoT reasoning provides the correct formula and calculation steps.
* The difference between the ZS-STD LLM and ZS-CoT answers is 0.4 percentage points.
* The Gold Program answer (70.203%) is very close to the ZS-CoT answer (70.1%).
### Interpretation
The image demonstrates a comparison of different methods for answering a financial question. The ZS-CoT reasoning provides the most accurate and transparent answer, as it includes the calculation steps. The Gold Program also provides a very accurate answer, but the format is different. The ZS-STD LLM provides a slightly less accurate answer. The exercise highlights the importance of reasoning and calculation transparency in question answering, especially in financial contexts. The slight variations in the answers suggest potential differences in the underlying algorithms or rounding methods used by each approach.
</details>
<details>
<summary>Image 16 Details</summary>

### Visual Description
## Document Analysis: Financial Statement Analysis
### Overview
The image presents a financial statement analysis, specifically focusing on the allowance for doubtful accounts. It includes a passage from a financial statement, a question regarding the percentage decline in the allowance, and several attempted solutions using different methods (Gold Program, ZS-STD LLM, and ZS-CoT Reasoning).
### Components/Axes
* **Header:** "RSG/2018/page\_94.pdf-1"
* **Passage:** Excerpt from "republic services, inc. notes to consolidated financial statements 2014 (continued) high quality financial..."
* Data table with years 2018, 2017, and 2016.
* Balance at beginning of year: $38.9 (2018), $44.0 (2017), $46.7 (2016)
* Additions charged to expense: 34.8 (2018), 30.6 (2017), 20.4 (2016)
* Accounts written-off: (39.4) (2018), (35.7) (2017), (23.1) (2016)
* Balance at end of year: $34.3 (2018), $38.9 (2017), $44.0 (2016)
* **Question:** "as of december 31, 2018 what was the percentage decline in the allowance for doubtful accounts?"
* **Gold Program:**
* Program: subtract(34.3, 38.9), divide(#0, 38.9)
* Answer: -0.11825
* **ZS-STD LLM Answering Prompt Response:**
* Response: "The percentage decline in the allowance for doubtful accounts as of December 31, 2018 was 11.3% (34.3 - 38.9 / 38.9)."
* Extracted Answer: 11.3
* **ZS-CoT Reasoning Prompt Response:**
* Response: "The balance at the beginning of 2018 was $38.9 million. The balance at the end of 2018 was $34.3 million. Therefore, the percentage decline in the allowance for doubtful accounts is 12.2% ($38.9 - $34.3 / $38.9)."
* Extracted Answer: 12.2
### Detailed Analysis or ### Content Details
The image presents a problem related to calculating the percentage decline in the allowance for doubtful accounts. It provides the relevant financial data for the years 2016, 2017, and 2018. The question specifically asks for the percentage decline as of December 31, 2018.
The image then shows three different approaches to solving the problem:
1. **Gold Program:** This approach uses a specific formula "subtract(34.3, 38.9), divide(#0, 38.9)" and arrives at an answer of -0.11825.
2. **ZS-STD LLM:** This approach provides a response stating the percentage decline as 11.3% and shows the calculation (34.3 - 38.9 / 38.9). The extracted answer is 11.3.
3. **ZS-CoT Reasoning:** This approach provides a reasoning prompt, stating the beginning and ending balances for 2018 and then calculates the percentage decline as 12.2% ($38.9 - $34.3 / $38.9). The extracted answer is 12.2.
### Key Observations
* The "Gold Program" result is a decimal (-0.11825), while the other two are percentages (11.3% and 12.2%).
* The ZS-STD LLM and ZS-CoT Reasoning approaches provide different answers (11.3% vs. 12.2%) despite using similar data.
* The provided calculation in the ZS-STD LLM response (34.3 - 38.9 / 38.9) does not directly translate to 11.3%. The calculation should be (38.9 - 34.3) / 38.9 = 0.11825, which is 11.825%.
* The provided calculation in the ZS-CoT Reasoning response ($38.9 - $34.3 / $38.9) also does not directly translate to 12.2%. The calculation should be (38.9 - 34.3) / 38.9 = 0.11825, which is 11.825%.
### Interpretation
The image highlights the challenges in accurately calculating financial metrics, even with seemingly straightforward data. The different approaches yield different results, indicating potential errors in the formulas or reasoning used. The "Gold Program" result is a decimal, which needs to be converted to a percentage to be comparable with the other results. The ZS-STD LLM and ZS-CoT Reasoning approaches provide different answers, suggesting inconsistencies in their calculations or interpretations of the data. The calculations provided in the ZS-STD LLM and ZS-CoT Reasoning responses do not match the extracted answers, indicating potential errors in the responses. The correct percentage decline, based on the provided data, is approximately 11.825%.
</details>
Figure 18: An example of arithmetic errors made by ZS-STD and ZS-CoT prompts.
Figure 19: An example where program extraction failed for ZS-FinDSL prompt.
<details>
<summary>Image 17 Details</summary>

### Visual Description
## Document Analysis: Financial Data and Reasoning
### Overview
The image presents a financial passage, a question related to the passage, and solutions generated by a "Gold Program" and a "ZS-FinDSL" program. It involves calculating the change in free cash flow between 2005 and 2006. The image also includes the reasoning and the program structure used by the ZS-FinDSL.
### Components/Axes
* **Header:** "UNP/2007/page\_25.pdf-4"
* **Passage:** Contains financial data for the years 2007, 2006, and 2005, including dividends paid and free cash flow in millions of dollars.
* **Question:** Asks for the change in millions of free cash flow from 2005 to 2006.
* **Gold Program:** Presents a simple subtraction operation and the correct answer.
* **ZS-FinDSL Reasoning:** Explains the calculation of the change in free cash flow.
* **ZS-FinDSL Program:** Shows the program structure used to arrive at the answer.
* **ZS-FinDSL Executed Answer:** Presents the final answer calculated by the ZS-FinDSL program.
### Detailed Analysis or ### Content Details
**Passage Data:**
| Category | 2007 | 2006 | 2005 |
| ----------------- | ----- | ----- | ----- |
| Dividends paid | (364) | (322) | (314) |
| Free cash flow | $487 | $516 | $234 |
**Question:**
* What was the change in millions of free cash flow from 2005 to 2006?
**Gold Program:**
* Operation: subtract(516, 234)
* Answer: 282.0
**ZS-FinDSL Reasoning:**
* The change in millions of free cash flow from 2005 to 2006 was an increase of $282 million ($516 million - $234 million = $282 million).
**ZS-FinDSL Program:**
```json
{
"Program": {
"#0": {
"operation": "subtract",
"arg1": "516",
"arg2": "234"
},
"#1": {
"operation": "add",
"arg1": "#0",
"arg2": "282"
},
"Answer": "282"
}
}
```
**ZS-FinDSL Executed Answer:**
* 564
### Key Observations
* The "Gold Program" provides the correct answer (282.0) using a simple subtraction.
* The "ZS-FinDSL Reasoning" correctly explains the calculation and arrives at the same correct answer ($282 million).
* The "ZS-FinDSL Program" outlines the steps taken by the ZS-FinDSL to arrive at the answer.
* The "ZS-FinDSL Executed Answer" is incorrect (564), indicating a potential error in the program's execution or logic.
### Interpretation
The image demonstrates a scenario where a financial question is posed, and two different methods are used to solve it. The "Gold Program" represents a straightforward, correct solution. The "ZS-FinDSL" program, while correctly reasoning about the problem, produces an incorrect final answer. This suggests a possible flaw in the ZS-FinDSL's execution or program structure, despite the correct reasoning. The discrepancy between the correct reasoning and the incorrect executed answer highlights the importance of verifying the output of automated systems, even when the underlying logic appears sound. The ZS-FinDSL program attempts to add the result of the subtraction to 282, which is not necessary to answer the question. This unnecessary operation leads to the incorrect final answer.
</details>
Figure 20: An example where program extraction failed for ZS-FinDSL prompt.
<details>
<summary>Image 18 Details</summary>

### Visual Description
## Document Analysis: Tax Liability and Unrecognized Tax Benefits
### Overview
The image presents a snippet from a document, likely related to financial reporting and tax liability analysis. It includes a passage summarizing changes in unrecognized tax benefits, a question regarding the relationship between amounts decreasing the effective tax rate and goodwill, and a comparison of a "Gold Program" answer with a "ZS-FinDSL" reasoning and program execution.
### Components/Axes
* **Header:** "ADBE/2008/page\_89.pdf-4"
* **Passage:** A summary of financial changes during fiscal year 2008 related to unrecognized tax benefits.
* **Question:** A query about the relative amounts of unrecognized tax benefits affecting the effective tax rate and goodwill.
* **Gold Program:** A program that compares 57.7 and 68.9.
* **Gold Answer:** The answer provided by the "Gold Program," which is "no."
* **ZS-FinDSL Reasoning:** The reasoning provided by the ZS-FinDSL system.
* **ZS-FinDSL Program:** The program code used by the ZS-FinDSL system.
* **ZS-FinDSL Executed Answer:** The numerical result of the ZS-FinDSL program execution, which is 11.2.
### Detailed Analysis or Content Details
* **Passage:**
* Summary of fin 48 changes during fiscal 2008.
* Aggregate changes in the total gross amount of unrecognized tax.
* $57.7 million of unrecognized tax benefits would decrease the effective tax rate.
* $68.9 million would decrease goodwill.
* Beginning balance as of December 1, 2007: $201,808
* Settlements with taxing authorities: (81,213)
* Lapse of statute of limitations: (3,512)
* **Question:**
* Asks if the amount of unrecognized tax benefits that would decrease the effective tax rate is greater than the amount that would decrease goodwill.
* **Gold Program:**
* `greater(57.7, 68.9)`
* **Gold Answer:**
* `no`
* **ZS-FinDSL Reasoning:**
* States that $57.7 million would decrease the effective tax rate.
* States that $68.9 million would decrease goodwill.
* Concludes that the amount decreasing the effective tax rate is less than the amount decreasing goodwill.
* **ZS-FinDSL Program:**
* `{"Program": ["#0": {operation:"subtract", arg1:"68.9", arg2:"57.7"}, "Answer": "False"}`
* **ZS-FinDSL Executed Answer:**
* `11.2`
### Key Observations
* The passage provides financial data related to unrecognized tax benefits and their impact on the effective tax rate and goodwill.
* The question seeks to compare the magnitudes of these impacts.
* The "Gold Program" directly compares the two amounts (57.7 and 68.9) and returns "no," indicating that 57.7 is not greater than 68.9.
* The "ZS-FinDSL" system provides reasoning consistent with the "Gold Program" and executes a subtraction operation (68.9 - 57.7), resulting in 11.2.
* The ZS-FinDSL program also returns "False" as an answer, which is not consistent with the numerical result of 11.2.
### Interpretation
The document snippet presents a scenario involving the analysis of unrecognized tax benefits and their impact on financial statements. The "Gold Program" and "ZS-FinDSL" system are used to evaluate a specific question about the relative magnitudes of the impact on the effective tax rate and goodwill.
The "Gold Program" provides a direct comparison, while the "ZS-FinDSL" system offers a more detailed reasoning and program execution. The ZS-FinDSL program calculates the difference between the two amounts, resulting in 11.2. However, the ZS-FinDSL program also returns "False" as an answer, which is inconsistent with the numerical result of 11.2. This suggests a potential error or misinterpretation in the ZS-FinDSL program's logic.
The data suggests that the amount decreasing goodwill ($68.9 million) is greater than the amount decreasing the effective tax rate ($57.7 million). The ZS-FinDSL's subtraction confirms this difference numerically.
</details>