2306.17462

Model: gemma-3-27b-it-free

# CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning **Authors**: - Yang Liu (Sun Yat-sen University) - Weixing Chen (Sun Yat-sen University) - Guanbin Li (Sun Yat-sen University) - Liang Lin (Sun Yat-sen University) ## Abstract We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc. These methods have been included in the toolbox with PyTorch implementations under NVIDIA computing system. It not only includes training and inference codes, but also provides model weights. We believe this toolbox is by far the most complete visual-linguitic causal reasoning toolbox. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to re-implement existing methods and develop their own new causal reasoning methods. Code and models are available at https://github.com/HCPLab-SYSU/CausalVLR. The project is under active development by HCP-Lab https://www.sysu-hcp.net ’s contributors and we will keep this document updated. ## 1 Introduction The emergence of vast amounts of heterogeneous multi-modal data, including images [2; 9], videos [13; 11; 10; 15; 28], languages [8; 7; 25], audios [3], and multi-sensor [12; 14; 30; 18; 23; 5] data, has led to the application of large language models (LLMs) such as ChatGPT [22] and ChatGLM [29] in various vision and language tasks, showing promising performance. However, current LLMs heavily rely on fitting extensive knowledge distributions, often capturing spurious correlations across different modalities. Consequently, they struggle to learn reliable chain-of-thought (COT) [24] that reflects essential causal relations within multi-modal knowledge, limiting their generalization and cognitive abilities. Fortunately, causal inference [19; 20; 16; 17] ] provides a promising alternative for learning robust and reliable cross-modal models due to its promising ability to achieve robust representation and model learning with good cognitive ability. For a detailed review of causal inference and visual representation learning, please refer to our review paper [16]. Visual-linguistic reasoning endeavors to comprehend both visual and linguistic content while performing various reasoning tasks, such as visual question answering (VQA), visual dialog, image/video captioning, and medical report generation. However, to date, there has been no comprehensive open-source framework available for causality-aware visual-linguistic reasoning. With the aim of offering a high-quality toolbox and a unified benchmark, we have developed CausalVLR: a pytorch-based open-source toolbox and benchmark designed specifically for visual-linguistic causal reasoning. Figure 1 provides an overview of CausalVLR. CausalVLR offers several key features: (1) Modular design: We decompose the visual-linguistic reasoning framework into different components, allowing for the easy construction of customized visual-linguistic reasoning frameworks by combining various modules. (2) Support for multiple frameworks out of the box: The toolbox provides support for popular and contemporary visual-linguistic reasoning frameworks. (3) High efficiency: All basic modules and operations are executed on GPUs to ensure optimal performance. (4) State of the art: The toolbox is derived from the codebase developed by the experienced HCP-Lab team, specializing in causal inference and visual-linguistic reasoning, and continuous improvements are made. In addition to introducing the codebase and benchmark results, we also share our experiences and best practices for visual-linguistic causal reasoning. Ablation experiments involving hyperparameters, architectures, and training strategies are conducted and discussed. Our aim is to contribute to future research and facilitate comparisons between different methods. The remaining sections are organized as follows: First, we present the various supported methods and highlight important features of CausalVLR. Next, we present the benchmark results. Finally, we showcase representative studies on selected baselines. Table 1 provides a summary of representative algorithms from the HCP-Lab team. <details> <summary>extracted/5291756/Fig1.png Details</summary> ![8279f911](/v1/image/8279f91105148bd49f4aabe1d9c3f2f2106bcf843bb664fb6b2f7da3c63b1880) ### Visual Description \n ## Diagram: Visual-Linguistic Reasoning Pipeline ### Overview This diagram illustrates a pipeline for visual-linguistic reasoning, outlining the key stages from feature extraction to benchmark evaluation. The pipeline is structured around a toolbox that feeds into various downstream tasks. The diagram uses boxes and arrows to represent processes and data flow. ### Components/Axes The diagram is divided into several main sections: Feature Extraction, Causal Reasoning, Relation Modeling, Cross-modal Feature Fusion, Inference, and a "VQA" (Visual Question Answering) section containing downstream tasks and a benchmark. * **Feature Extraction:** Inputs are "Images/Videos" and "Text/Language". * **Causal Reasoning:** Contains "Causal Discovery", "Causal Intervention", and "Counterfactual Reasoning". * **Relation Modeling:** Contains "Spatial-temporal Modeling" and "Linguistic Semantics Discovery". * **Cross-modal Feature Fusion:** Contains "Visual-Linguistic Feature Fusion". * **Inference:** Contains "Visual-Linguistic Reasoning". * **VQA Section:** Contains "Image/Video Captioning", "Medical Report Generation", "Medical Diagnosis", "Model Generalization and Robustness", and "Embodied AI Tasks". * **Benchmark:** "Visual-Linguistic Reasoning Datasets". * **Toolbox:** A large, central box labeled "Toolbox". * Arrows indicate the flow of information between components. ### Detailed Analysis or Content Details The diagram shows a hierarchical flow of information. 1. **Feature Extraction:** Images/Videos and Text/Language are inputs to the system. 2. **Causal Reasoning:** The outputs of Feature Extraction feed into Causal Reasoning, which branches into Causal Discovery, Causal Intervention, and Counterfactual Reasoning. 3. **Relation Modeling:** Feature Extraction also feeds into Relation Modeling, which branches into Spatial-temporal Modeling and Linguistic Semantics Discovery. 4. **Cross-modal Feature Fusion:** The outputs of Causal Reasoning and Relation Modeling converge into Cross-modal Feature Fusion, specifically "Visual-Linguistic Feature Fusion". 5. **Inference:** The output of Cross-modal Feature Fusion feeds into Inference, resulting in "Visual-Linguistic Reasoning". 6. **VQA Section:** The output of the Toolbox (which encompasses the previous stages) feeds into a variety of downstream tasks within the VQA section: * Image/Video Captioning * Medical Report Generation * Medical Diagnosis * Model Generalization and Robustness * Embodied AI Tasks 7. **Benchmark:** The VQA section is evaluated using "Visual-Linguistic Reasoning Datasets" as a benchmark. The diagram uses dashed arrows to indicate the flow from the Toolbox to the downstream tasks and from the downstream tasks to the Benchmark. There is a circular arrow between the VQA section and the Benchmark, suggesting an iterative process of evaluation and improvement. ### Key Observations The diagram emphasizes the importance of integrating causal reasoning and relation modeling for effective visual-linguistic understanding. The Toolbox acts as a central hub for processing and integrating information from different modalities. The VQA section highlights the diverse applications of this technology, including medical diagnosis and embodied AI. The iterative loop between the VQA section and the benchmark suggests a continuous cycle of development and evaluation. ### Interpretation This diagram represents a conceptual framework for building intelligent systems that can reason about visual and linguistic information. The pipeline structure suggests a modular approach, where each stage performs a specific function. The inclusion of causal reasoning and relation modeling indicates a focus on developing systems that can go beyond simple pattern recognition and understand the underlying relationships between objects and events. The diverse range of downstream tasks in the VQA section demonstrates the potential of this technology to address a wide range of real-world problems. The diagram suggests that the quality of the "Toolbox" is critical to the performance of the entire system. The benchmark serves as a crucial component for evaluating and improving the system's capabilities. The diagram does not provide specific data or numerical values, but rather a high-level overview of the system architecture and data flow. It is a conceptual illustration of a research direction rather than a report of specific results. </details> Figure 1: Overview of CausalVLR. CausalVLR is a python open-source framework for causal relation discovery, causal inference that implements state-of-the-art causal learning algorithms for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, medical diagnosis, model generalization and robustness, etc. ## 2 Algorithms This section provides a summary of three representative state-of-the-art (SOTA) algorithms for visual question answering (VQA) and medical report generation tasks. All algorithms have been implemented using PyTorch. The CausalVLR library will be continuously updated in the coming years. In this section, we will provide a concise introduction to the selected algorithms. Table 1: Representative visual-linguistic causal reasoning algorithms in CausalVLR. | Task | Algorithm | Highlight | Pytorch | | --- | --- | --- | --- | | Causal CoT | CausalGPT [21] | Causality-aware CoT | Yes | | VQA | CMCIR [8; 6] | Causal front-door and back-door interventions | Yes | | VQA | VCSR [25] | Visual causal scene discovery | Yes | | Image Captioning | AAAI2023 [26] | Knowledge consensus | Yes | | Medical Report Generation | VLCI [1] | Implicit visual causal intervention | Yes | | Medical Diagnosis | TPAMI2023 [4] | Causality-aware medical diagnosis | Yes | | Model Generalization and Robustness | CVPR2023 [27] | Counterfactual Fine-tuning | Yes | ### 2.1 CausalGPT Caco-CoT [21]. The framework of causal-consistency CoT (CaCo-CoT is shown in Figure 2.The contribution of Caco-CoT is listed as follows: <details> <summary>x1.png Details</summary> ![ee4948f4](/v1/image/ee4948f435c89e9f6477c167de6e2094e8596ab16c380c2fc68f24df6be53fc4) ### Visual Description \n ## Diagram: Chain-of-Thought (CoT) Reasoning Frameworks ### Overview The image presents a comparative diagram illustrating three different Chain-of-Thought (CoT) reasoning frameworks: CoT, Self-Consistent CoT, and Causal-Consistent CoT. The diagram visually outlines the stages and processes involved in each framework, from initial reasoning steps to final answer selection. It highlights the differences in how each approach handles reasoning consistency and evaluation. ### Components/Axes The diagram is divided into three main sections, labeled (a), (b), and (c), representing the three CoT frameworks. Each section is further divided into "Reasoning Stage" and "Evaluation Stage". Key elements include: * **Robot Icons:** Represent reasoning agents. * **t1...tn:** Represent steps in the CoT process. * **A1...An, A1j...Akj:** Represent answers from reasoning agents. * **S1...Sn, Sj, Skj:** Represent reasoning statements. * **Answer Pool A:** A collection of potential answers. * **Checkmarks/X Marks:** Indicate evaluation outcomes. * **Legend (Top-Right):** Defines the symbols used in the diagram: * `t`: A step from the CoT * `Sj`: The reasoning statement from agent j * `Aij`: The answer from reasoner agent i * `Akj`: Counterfactual answer for agent j, where k ≠ j * `Afj`: Re-reasoning answer for agent j ### Detailed Analysis or Content Details **(a) CoT:** * The process begins with a series of reasoning steps (t1...tn) performed by a single robot icon. * These steps generate an answer (Answer: A) directly. * The diagram shows a single path from reasoning steps to the final answer. **(b) Self-Consistent CoT:** * Multiple reasoning steps (t1...tn) are performed by multiple robot icons. * Each set of steps generates an answer (A1...An). * These answers are collected in an "Answer Pool A". * A "Top-1 Selection" process chooses the final answer (Answer: A) from the pool. **(c) Causal-Consistent CoT:** * This framework is divided into "Reasoning Stage" and "Evaluation Stage". * **Reasoning Stage:** Similar to (b), multiple reasoning steps (t1...tn) are performed by multiple robot icons, generating reasoning statements (S1...Sn) and answers (A1j...Akj) which are collected in an "Answer Pool A". * **Evaluation Stage:** * "New Round Reasoning" is performed by a robot icon. * "Statement Evaluation" assesses the reasoning statements (Sj). * "Counterfactual Evaluation" evaluates answers based on alternative scenarios (Akj). * "Rationale Reconsidering" refines the reasoning (Sjj). * The final answer (Answer: A) is selected based on the evaluation results, indicated by checkmarks (positive evaluation) and X marks (negative evaluation). ### Key Observations * The complexity increases from (a) to (c). CoT is the simplest, while Causal-Consistent CoT is the most elaborate. * Self-Consistent CoT introduces the concept of an "Answer Pool" and selection, improving robustness. * Causal-Consistent CoT adds a dedicated "Evaluation Stage" with multiple evaluation steps, aiming for more reliable and consistent reasoning. * The use of checkmarks and X marks in (c) visually represents the evaluation process and the selection of the final answer. ### Interpretation The diagram illustrates a progression in CoT reasoning frameworks, each building upon the previous one to address limitations and improve performance. The core idea is to move beyond single-path reasoning (CoT) to leverage multiple reasoning paths (Self-Consistent CoT) and then rigorously evaluate those paths to ensure consistency and reliability (Causal-Consistent CoT). The inclusion of counterfactual evaluation in Causal-Consistent CoT suggests an attempt to identify and mitigate potential biases or errors in the reasoning process. The diagram highlights the increasing sophistication of techniques aimed at making large language models more trustworthy and accurate in their reasoning abilities. The diagram does not provide any quantitative data, but rather a qualitative comparison of the frameworks. It is a conceptual illustration of the different approaches. </details> Figure 2: Comparison between existing approaches and causal-consistency CoT (CaCo-CoT). In CaCo-CoT, reasoner agents and an evaluator agent cooperate to facilitate a causally consistent reasoning process, thereby minimizing factual and inferential errors. To ensure reasoning faithfulness of the solutions by inspecting causal consistency, an evaluator is introduced. Initially, the evaluator is prompted to examine the statements in the solution step by step while considering the whole reasoning structure, which is in a non-causal direction. For further causal consistency assessment, the evaluator then moves on to apply a counterfactual answer to the original question and look for any contradiction therein. Finally, it turns out a modified answer after its reconsideration. The group of agents cooperates for a consensual answer if the most frequent solution is accepted by the evaluator. In the tasks of science question answering and commonsense reasoning, we demonstrate that our causal-consistency framework outperforms the existing approaches through a series of experiments and comparisons. The official codes and data are available at https://github.com/HCPLab-SYSU/CausalVLR. ### 2.2 VQA CMCIR [8]. The contribution of CMCIR is listed as follows: - We propose a causality-aware event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), to discover true causal structures via causal intervention on the integration of visual and linguistic modalities and achieve robust event-level visual question answering performance. To the best of our knowledge, we are the first to discover cross-modal causal structures for the event-level visual question answering task. - We introduce a linguistic back-door causal intervention module guided by linguistic semantic relations to mitigate the spurious biases and uncover the causal dependencies within the linguistic modality. To disentangle the visual spurious correlations, we propose a Local-Global Causal Attention Module (LGCAM) that aggregates the local and global visual representations by front-door causal intervention. - We construct a Spatial-Temporal Transformer (STT) that models the multi-modal co-occurrence interactions between the visual and linguistic knowledge, to discover the fine-grained interactions among linguistic semantics, spatial, and temporal representations. - To adaptively fuse the causality-aware visual and linguistic features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that leverages the hierarchical linguistic semantic relations to learn the global semantic-aware visual-linguistic features. - Extensive experiments on SUTD-TrafficQA, TGIF-QA, MSVD-QA, and MSRVTT-QA datasets show the effectiveness of our CMCIR for discovering visuallinguistic causal structures and achieving promising event-level visual question answering performance. The official codes and data are available at https://github.com/HCPLab-SYSU/CMCIR. VCSR [25]. The contribution of CMCIR is listed as follows: - We propose the Visual Causal Scene Refinement (VCSR) framework, to explicitly discover true causal visual scenes from the perspective of causal front-door intervention. To the best of our knowledge, we are the first to discover visual causal scenes for video question answering. - We build the Causal Scene Separator (CSS) module that learns to discover a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. - We introduce the Question-Guided Refiner (QGR) module that refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention. The official codes and data are available at https://github.com/HCPLab-SYSU/CausalVLR. ### 2.3 Medical Report Generation VLCI [1]. The contribution of CMCIR is listed as follows: - To implicitly mitigate cross-modal confounders and discover the true cross-modal causality, we propose visual-linguistic causal front-door intervention modules VDM and LDM. The VDM aims to disentangle the region-based features from images in the encoder, and the LDM aims to eliminate the spurious correlations caused by the visual-linguistic embedding. - To alleviate the problem of unpaired data when pretraining visual-linguistic RRG data, we combine the PLM and MIM for cross-modal pre-training in various data situations (e.g., unpaired, single modality), which is efficient and easy to implement. - We propose a lightweight Visual-Linguistic Causal Intervention (VLCI) framework for RRG, which introduces mediators without additional knowledge, to implicitly deconfound the visual-linguistic confounder by causal front-door intervention. Experimental results show that VLCI achieves state-of-the-art performance on two datasets IU-Xray and MIMIC-CXR. The official codes and data are available at https://github.com/WissingChen/VLCI. ## 3 Conclusions and future works This paper presents CausalVLR, an open-source toolbox that offers a comprehensive collection of state-of-the-art methods for causal relation discovery and causal inference in visual-linguistic reasoning tasks. We highlight the contributions of representative state-of-the-art visual-linguistic causal reasoning methods. CausalVLR aims to enhance the development of causal inference in the visual-linguistic reasoning domain by providing easily accessible open-source implementations, benchmarking models, and computational resources. We plan to incorporate additional state-of-the-art algorithms and benchmarks in future updates. ## References - [1] Weixing Chen, Yang Liu, Ce Wang, Guanbin Li, Jiarui Zhu, and Liang Lin. Visual-linguistic causal intervention for radiology report generation. arXiv preprint arXiv:2303.09117, 2023. - [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. - [3] Haoyuan Lan, Yang Liu, and Liang Lin. Audio-visual contrastive learning for self-supervised action recognition. arXiv preprint arXiv:2204.13386, 2022. - [4] Junfan Lin, Keze Wang, Ziliang Chen, Xiaodan Liang, and Liang Lin. Towards causality-aware inferring: A sequential discriminative approach for medical diagnosis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. - [5] Junfan Lin, Yuying Zhu, Lingbo Liu, Yang Liu, Guanbin Li, and Liang Lin. Denselight: Efficient control for large-scale traffic signals with dense feedback. arXiv preprint arXiv:2306.07553, 2023. - [6] Yang Liu, Guanbin Li, and Liang Lin. Cross-modal causal relational reasoning for event-level visual question answering. arXiv preprint arXiv:2207.12647, 2022. - [7] Yang Liu, Guanbin Li, and Liang Lin. Causality-aware visual scene discovery for cross-modal question reasoning. arXiv preprint arXiv:2304.08083, 2023. - [8] Yang Liu, Guanbin Li, and Liang Lin. Cross-modal causal relational reasoning for event-level visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. - [9] Yang Liu, Jing Li, ZhaoYang Lu, Tao Yang, and ZiJian Liu. Combining multiple features for cross-domain face sketch recognition. In Biometric Recognition: 11th Chinese Conference, CCBR 2016, Chengdu, China, October 14-16, 2016, Proceedings 11, pages 139–146. Springer, 2016. - [10] Yang Liu, Zhaoyang Lu, Jing Li, and Tao Yang. Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(8):2416–2430, 2018. - [11] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao. Global temporal representation based cnns for infrared action recognition. IEEE Signal Processing Letters, 25(6):848–852, 2018. - [12] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao. Deep image-to-video adaptation and fusion networks for action recognition. IEEE Transactions on Image Processing, 29:3168–3182, 2019. - [13] Yang Liu, Zhaoyang Lu, Jing Li, Chao Yao, and Yanzi Deng. Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity, 2018:1–20, 2018. - [14] Yang Liu, Keze Wang, Guanbin Li, and Liang Lin. Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing, 30:5573–5588, 2021. - [15] Yang Liu, Keze Wang, Lingbo Liu, Haoyuan Lan, and Liang Lin. Tcgl: Temporal contrastive graph for self-supervised video representation learning. IEEE Transactions on Image Processing, 31:1978–1993, 2022. - [16] Yang Liu, Yu-Shen Wei, Hong Yan, Guan-Bin Li, and Liang Lin. Causal reasoning meets visual representation learning: A prospective study. Machine Intelligence Research, pages 1–27, 2022. - [17] Yang Liu, Yushen Wei, Hong Yan, Guanbin Li, and Liang Lin. Causal reasoning with spatial-temporal representation learning: A prospective study. arXiv preprint arXiv:2204.12037, 2022. - [18] Jianyuan Ni, Raunak Sarbajna, Yang Liu, Anne HH Ngu, and Yan Yan. Cross-modal knowledge distillation for vision-to-sensor action recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4448–4452. IEEE, 2022. - [19] Judea Pearl. Causality. Cambridge university press, 2009. - [20] Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021. - [21] Ziyi Tang, Ruilin Wang, Weixing Chen, Keze Wang, Yang Liu, Tianshui Chen, and Liang Lin. Towards causalgpt: A multi-agent approach for faithful knowledge reasoning via promoting causal consistency in llms. arXiv preprint arXiv:2308.11914, 2023. - [22] Eva AM Van Dis, Johan Bollen, Willem Zuidema, Robert van Rooij, and Claudi L Bockting. Chatgpt: five priorities for research. Nature, 614(7947):224–226, 2023. - [23] Kuo Wang, LingBo Liu, Yang Liu, GuanBin Li, Fan Zhou, and Liang Lin. Urban regional function guided traffic flow prediction. Information Sciences, 634:308–320, 2023. - [24] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS Systems, 2022. - [25] Yushen Wei, Yang Liu, Hong Yan, Guanbin Li, and Liang Lin. Visual causal scene refinement for video question answering. MM ’23, page 377–386, New York, NY, USA, 2023. Association for Computing Machinery. - [26] Yang Wu, Pengxu Wei, and Liang Lin. Scene graph to image synthesis via knowledge consensus. In AAAI, 2023. - [27] Yao Xiao, Ziyi Tang, Pengxu Wei, Cong Liu, and Liang Lin. Masked images are counterfactual samples for robust fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20301–20310, 2023. - [28] Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, and Liang Lin. Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5606–5618, 2023. - [29] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. ICLR, 2023. - [30] Yuying Zhu, Yang Zhang, Lingbo Liu, Yang Liu, Guanbin Li, Mingzhi Mao, and Liang Lin. Hybrid-order representation learning for electricity theft detection. IEEE Transactions on Industrial Informatics, 19(2):1248–1259, 2022.

Rendering Paper...