2202.04800v2

Model: nemotron-free

## The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning Jack Hessel* 1 , Jena D. Hwang* 1 , Jae Sung Park 2 , Rowan Zellers 2 , Chandra Bhagavatula 1 , Anna Rohrbach 3 , Kate Saenko 4 , and Yejin Choi 1 , 2 1 Allen Institute for AI { jackh,jenah,chandrab } @allenai.org 2 Paul G. Allen School of Computer Science & Engineering, University of Washington { jspark96,rowanz,yejin } @cs.washington.edu 3 University of California, Berkeley anna.rohrbach@berkeley.edu 4 Boston University and MIT-IBM Watson AI saenko@bu.edu Abstract. Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can't help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a '20 mph' sign alongside a road, we might assume the street sits in a residential area (rather than on a highway), even if no houses are pictured. Can machines perform similar visual reasoning? We present Sherlock , an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents. We adopt a free-viewing paradigm: participants first observe and identify salient clues within images (e.g., objects, actions) and then provide a plausible inference about the scene, given the clue. In total, we collect 363K (clue, inference) pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using our corpus, we test three complementary axes of abductive reasoning. We evaluate the capacity of models to: i) retrieve relevant inferences from a large candidate corpus; ii) localize evidence for inferences via bounding boxes, and iii) compare plausible inferences to match human judgments on a newlycollected diagnostic corpus of 19K Likert-scale judgments. While we find that fine-tuning CLIP-RN50x64 with a multitask objective outperforms strong baselines, significant headroom exists between model performance and human agreement. Data, models, and leaderboard available at http://visualabduction.com/ . You know my method. It is founded upon the observation of trifles. ## 2 J. Hessel et al. Fig. 1: We introduce Sherlock : a corpus of 363K commonsense inferences grounded in 103K images. Annotators highlight localized clues (color bubbles) and draw plausible abductive inferences about them (speech bubbles). Our models are able to predict localized inferences (top predictions are shown), but we quantify a large gap between machine performance and human agreement. <details> <summary>Image 1 Details</summary> ![4de6c40e](/v1/image/4de6c40eba2d494a13489466d3fb6c2ccb2f1f0665341bd4b725d6eff7b9765d) ### Visual Description ## Screenshot: Traffic Accident Analysis Interface ### Overview The image depicts a web-based interface analyzing a traffic accident scene. A photograph of a highway accident is annotated with color-coded text boxes, each highlighting specific visual clues and inferred conclusions. The interface appears to be part of an AI-driven system that extracts contextual information from visual data. ### Components/Axes - **Image Elements**: - A highway scene with a flipped semi-truck/trailer, emergency responders, and a police car. - Annotations: - **Yellow Box**: Highlights the overturned semi-truck and trailer. - **Blue Box**: Marks patches of snow on the grassy roadside. - **Green Box**: Focuses on a white license plate with red English-style numbers. - **Text Boxes**: - Positioned to the right of the image, each linked to a specific visual clue via arrows. - Colors correspond to the annotated regions (yellow, blue, green). ### Detailed Analysis 1. **Yellow Text Box (Semi-Truck/Trailer)**: - Label: "large semi truck and trailer on its side laying on a freeway" - Inference: "There was a major accident that occurred minutes ago. The people are inspecting damage to the vehicles in the accident." 2. **Blue Text Box (Snow Patches)**: - Label: "patches of snow spread throughout grass on the side of freeway" - Inference: "Cold weather is causing hazardous conditions at this location. The roads are very icy." 3. **Green Text Box (License Plate)**: - Label: "a white license plate with five red English style numbers displayed" - Inference: "This accident happened in an English-speaking country. This is Ohio." ### Key Observations - The interface uses color-coded annotations to link visual elements (vehicles, weather, license plates) to contextual inferences. - The system identifies environmental factors (icy roads due to snow) and geographical context (Ohio, English-speaking country). - No numerical data or charts are present; analysis relies on textual descriptions and spatial grounding. ### Interpretation This interface demonstrates a system designed to infer critical details from accident scenes by analyzing visual clues. The yellow box emphasizes vehicle damage as the primary event, while the blue and green boxes contextualize the accident’s causes (weather) and location (Ohio). The absence of numerical data suggests the tool prioritizes qualitative analysis over quantitative metrics. The use of color-coding ensures clarity in associating observations with conclusions, though the system’s reliance on textual inference may lack the precision of sensor-based data. </details> ## 1 Introduction The process of making the most plausible inference in the face of incomplete information is called abductive reasoning, [47] personified by the iconic visual inferences of the fictional detective Sherlock Holmes. 5 Upon viewing a scene, humans can quickly synthesize cues to arrive at abductive hypotheses that go beyond the what's captured in the frame. Concrete cues are diverse: people take into account the emotion and mood of the agents, speculate about the rationale for the presence/absence of objects, and zero-in on small, contextual details; all the while accounting for prior experiences and (potential mis)conceptions. 6 Fig. 1 illustrates: snow may imply dangerous road conditions, an Ohio licence plate may suggest the location of the accident, and a blue sign may indicate this road is an interstate. Though not all details are equally important, certain salient details shape our abductive inferences about the scene as a whole [56]. This type of visual information is often left unstated. We introduce Sherlock , a new dataset of 363K commonsense inferences grounded in 103K images. Sherlock makes explicit typically-unstated cognitive processes: each image is annotated with at least 3 inferences which pair depicted details (called clues) with commonsense conclusions that aim to go beyond what is literally pictured (called inferences). Sherlock is more diverse than many existing visual commonsense corpora like Visual Commonsense Reasoning [75] 5 While Holmes rarely makes mistakes, he frequently misidentifies his mostly abductive process of reasoning as 'deductive.' [39,8] 6 The correctness of abductive reasoning is certainly not guaranteed. Our goal is to study perception and reasoning without endorsing specific inferences (see § 3.1). Table 1: Comparison between Sherlock and prior annotated corpora addressing visual abductive reasoning from static images. Sherlock showcases a unique data collection paradigm, leading to a rich variety of non-human centric (i.e., not solely grounded in human references) visual abductive inferences. | Dataset | # Images | Format | bboxes? | free- viewing? | human- centric? | |----------------------|------------|----------------|-----------|------------------|-------------------| | VCR [75] | 110K | QA | ✓ | | ✓ | | VisualCOMET [44] | 59K | If/Then KB | ✓ | | ✓ | | Visual7W [79] | 47K | QA | ✓ | partial | | | Visual Madlibs [72] | 11K | FiTB | ✓ | partial | ✓ | | Abstract Scenes [65] | 4.3K | KB | | | | | Why In Images [49] | 792 | KB | | | ✓ | | BD2BB [48] | 3.2K | If/Then | | ✓ | ✓ | | FVQA [66] | 2.2K | QA+KB | | | | | OK-VQA [36] | 14K | QA | | ✓ | | | KB-VQA [67] | 700 | QA | ✓ | | | | Sherlock | 103K | clue/inference | ✓ | ✓ | | and VisualCOMET [44], 7 due to its free-viewing data collection paradigm: we purposefully do not pre-specify the types of clues/inferences allowed, leaving it to humans to identify the most salient and informative elements and their implications. Other forms of free-viewing like image captions may not be enough: a typical caption for Fig. 1 may mention the accident and perhaps the snow, but smaller yet important details needed to comprehend the larger scene (like the blue freeway sign or the Ohio plates) may not be mentioned explicitly [5]. Dense captioning corpora [22] attempts to overcome this problem by highlighting all details, but it does so without accounting for which details are salient (and why). Using our corpus, we propose three complementary tasks that evaluate different aspects of machine capacity for visual abductive reasoning: 1. Retrieval of Abductive Inferences: given an image+region, the algorithm scores a large set of candidate inferences and is rewarded for assigning a high score to the gold annotation. 2. Localization of Evidence: the algorithm selects a bounding box within the image that provides the best evidence for a given inference. 3. Comparison of Plausibility: the algorithm scores a small set of plausible inferences for a given image+region, and is rewarded for aligning its scores with human judgments over those sets. In our setup, a single model undertakes all of these tasks: we ask algorithms to score the plausibility of an inference given an image and a bounding box contained within it. 8 We can directly compare models in their capacity to perform abductive reasoning, without relying on indirect generation evaluation metrics. Model predicted inferences are given in Fig. 1. The model is a fine-tuned CLIP [51] augmented to allow bounding boxes as input, enabling users to specify particular regions for the model to make abductive inferences about. Our best model, a multitask version of CLIP RN50x64 , outperforms strong baselines like UNITER [9] and LXMERT [61] primarily because it pays specific attention to the 7 For instance, 94% of visual references in [75] are about depicted actors, and [44] even requires KB entries to explicitly regard people; see Fig. 2. 8 We reserve generative evaluations (e.g., BLEU/CIDEr) for future work: shortcuts (e.g., outputting the technically correct 'this is a photo' for all inputs) make generation evaluation difficult in the abductive setting (see § 6). Nonetheless, generative models can be evaluated in our setup; we experiment with one in § 5.1. ## 4 J. Hessel et al. <details> <summary>Image 2 Details</summary> ![d2142fe1](/v1/image/d2142fe1e41bffed2a0b73e145b9a91ffa633d75a650c0cd28d8c139defc226c) ### Visual Description ## Screenshot: Visual Reasoning Task Interface ### Overview The image depicts a visual reasoning task interface with a photograph on the left and structured text on the right. The photograph shows a bar scene with labeled visual elements ("Clue A" and "Clue B") and a question about a person's action. The right side contains a multiple-choice question, answer options, and an event description. ### Components/Axes - **Left Panel (Photograph)**: - **Scene**: A bar with patrons, a counter, and a cash register. - **Labels**: - **Clue A**: Green box highlighting a beer sign on the wall (text: "LITE"). - **Clue B**: Orange box highlighting USD currency on a pitcher. - **Annotations**: - "Person1" (pink box) and "Person5" (pink box) identify individuals. - Textual hints: - "CLUE A: a beer sign on the wall → this is the USA" - "CLUE B: USD hanging on a pitcher → alcohol is served here" - **Right Panel (Textual Reasoning)**: - **Question**: "What is Person1 doing?" - **Answer Options**: 1. He is dancing. 2. He is giving a speech. 3. Person1 is getting his medicine. 4. He is ordering a drink from Person5. - **Event Description**: - "Event: Person5 mans the register and takes order" - "Before Person5 needed to... write down orders" - "Because Person5 wanted to... have everyone pay for their orders" ### Detailed Analysis - **Photograph Elements**: - **Clue A** (green box): Positioned on the wall, labeled "LITE" (likely a beer brand). - **Clue B** (orange box): Located on a pitcher, labeled "USD" (U.S. Dollar). - **Person1** (pink box): Standing with arms crossed, facing the counter. - **Person5** (pink box): Behind the counter, near the cash register. - **Textual Content**: - **Question**: Directly asks about Person1's action. - **Options**: Four plausible actions, with Option 4 being the correct answer (highlighted in pink). - **Event Context**: Explains Person5's role in taking orders and writing them down to ensure payment. ### Key Observations 1. **Correct Answer**: Option 4 ("ordering a drink from Person5") aligns with the event description. 2. **Clue Integration**: - Clue A (USA beer sign) and Clue B (USD) contextualize the setting as a U.S. bar where alcohol is served. - Person5's role as a cashier/order taker supports the conclusion that Person1 is ordering a drink. 3. **Visual-Textual Link**: The pink boxes (Person1/Person5) and colored clue boxes guide the reasoning process. ### Interpretation This task tests the ability to integrate visual and textual information to infer actions in a scene. The clues (beer sign, USD) establish the environment, while the event description provides explicit context for Person5's role. The correct answer (Option 4) relies on connecting Person1's position (at the counter) with Person5's role (order taker). The interface design uses color-coded boxes to emphasize key elements, aiding in spatial grounding and logical deduction. **Note**: No numerical data or charts are present; the task focuses on qualitative reasoning. </details> Fig. 2: Side-by-side comparison of VCR [75], VisualCOMET [44], and Sherlock on a representative instance. Sherlock showcases a wider range of (non-human centric) situational contexts. correct input bounding box. We additionally show that 1) for all tasks, reasoning about the full context of the image (rather than just the region corresponding to the clue) results in the best performance; 2) a text-only model cannot solve the comparison task even when given oracle region descriptions; and 3) a multi-task model fit on both clues/inferences at training time performs best even when only inferences are available at test time. We foresee Sherlock as a difficult diagnostic benchmark for vision-andlanguage models. On our comparison task, in terms of pairwise accuracy, our best model lags significantly below human agreement (headroom also exists for retrieval and localization). We release code, data, and models at http: //visualabduction.com/ . ## 2 Related Work Abductive reasoning. Abduction, a form of everyday reasoning first framed byPeirce, [46,47]; involves the creating of explanatory hypothesesbased on limited evidence. Humans use abduction to reconcile seemingly disconnected observations to arrive at meaningful conclusions [56] but readily retract in presence of new evidence [1]. In linguistics, abduction for communicated meaning (in an impoverished conversational context) is systematized through conversational maxims [15]. In images, [5] show that different object types have different likelihoods of being mentioned in image captions (e.g., 'fireworks' is always mentioned if depicted, but 'fabric' is not), but that object type alone does not dictate salience for abductive inferences, e.g., a TV in a living room may not be as conceptually salient as a TV in a bar, which may signal a particular type of bar. Abductive reasoning has recently received attention in language processing tasks [6,50,11,45], proof writing [60], and discourse processing [17,42], etc. Beyond visual recognition. Several tasks that go beyond image description/recognition have been proposed, including visual and analogical reasoning [43,77,21,3], scene semantics [23], commonsense interactions [65,49], temporal/causal reasoning [26,71], and perceived importance [5]. Others have explored commonsense reasoning tasks posed over videos, which usually have more input available than a single frame [63,20,31,74,13,32,78,12,34,19] (inter alia). Visual abductive reasoning. Sherlock builds upon prior grounded visual abductive reasoning efforts (Table 1). Corpora like Visual Commonsense Reasoning (VCR) [75], VisualCOMET [44], and Visual7W [79] are most similar to Sherlock in providing benchmarks for rationale-based inferences (i.e., the why and how). But, Sherlock differs in format and content (Fig. 2). Instead of annotated QA pairs like in [79,75] where one option is definitively correct, free-text clue/inference pairs allow for broader types of image descriptions, lending itself to softer and richer notions of reasoning (see § 4)-inferences are not definitively correct vs. incorrect, rather, they span a range of plausibility. Deviating from the constrained, human-centric annotation of [44], Sherlock clue/inference pairs support a broader range of topics via our open-ended annotation paradigm (see § 3). Sherlock 's inferences can be grounded on any number of visual objects in an image, from figures central to the image (e.g., persons, animals, objects) to background cues (e.g., time, location, circumstances). ## 3 Sherlock Corpus The Sherlock corpus contains a total of 363K abductive commonsense inferences grounded in 81K Visual Genome [29] images (photographs from Flickr) and 22K Visual Commonsense Reasoning (VCR) [75] images (still-frames from movies). Images have an average of 3.5 observation pairs , each consisting of: : an observable entity or object in the image, along with bounding box(es) - clue specifying it (e.g., 'people wearing nametags'). - inference : an abductive inference associated with the clue; not immediately obvious from the image content (e.g., 'the people don't know each other'). Both clues and inferences are represented via free text in English; both have an average length of seven tokens; per clue, there are a mean/median of 1.17/1.0 bounding boxes per clue. We divide the 103K annotated images into a training/validation/test set of 90K/6.6K/6.6K. Further details are available in § A. Annotation process. We crowdsource our dataset via Amazon Mechanical Turk (MTurk). For each data collection HIT, a manually qualified worker is given an image and prompted for 3 to 5 observation pairs . For each observation pair , the worker is asked to write a clue, highlight the regions in the image corresponding to the clue, and write an inference triggered by the clue. To discourage purely deductive reasoning, the workers are actively encouraged to think beyond the literally depicted scene, while working within real-world expectations. Crowdworkers also self-report Likert ratings of confidence in the correctness of their abductive inferences along a scale of 'definitely' = 3/3, 'likely' = 2/3, and 'possibly' = 1/3. The resulting inferences span this range (31%, 51%, 18%, respectively). To validate corpus quality, we run a validation round for 17K observation pairs in which crowdworkers provide ratings for acceptability (is the annotation reasonable?), bboxes (are the boxes reasonably placed for the clue?), and interestingness (how interesting is the annotation?). We find that 97.5% of the observation pairs are acceptable with 98.3% accurate box placement; and 71.9% of inferences are found to be interesting. <details> <summary>Image 3 Details</summary> ![2806f956](/v1/image/2806f956f29677e62a3bd0988ddee70839e37075ff1e1bc58824be80b2dbd0f9) ### Visual Description ## Sankey Diagram: Topic Interconnections in Visual Media ### Overview The image is a Sankey diagram illustrating the relationships between "Clue Topics" (left) and "Inference Topics" (right), with colored arrows representing the strength of connections. Percentages indicate the proportion of connections between topics. The diagram emphasizes how topics overlap or influence one another, with thicker arrows denoting stronger associations. ### Components/Axes - **Left Axis (Clue Topics)**: - Labels: "eating & dining" (11%), "nature scenes" (7%), "everyday outdoor scenes" (10%), "environment & landscape" (6%), "gatherings" (8%), "signs & writings" (7%), "everyday objects" (16%), "attire" (11%), "actions & activities" (15%), "vehicles & traffic" (9%). - Percentages are approximate and represent the relative frequency of each clue topic. - **Right Axis (Inference Topics)**: - Labels: "eating & dining" (11%), "time and weather" (12%), "nature & animals" (8%), "everyday scenes" (15%), "object & categorization" (17%), "occasions & events" (11%), "persons & characterization" (19%), "vehicles & travel" (6%). - **Arrows**: - Colored lines connect clue topics to inference topics. - Thickness of arrows correlates with the strength of the connection (e.g., thicker arrows = higher percentage). - **Legend**: - Located on the right side, matching colors to inference topics (e.g., orange for "eating & dining," green for "nature & animals"). ### Detailed Analysis - **Clue Topics**: - "everyday objects" (16%) has the highest frequency, with arrows connecting to "object & categorization" (17%) and "everyday scenes" (15%). - "actions & activities" (15%) links to "persons & characterization" (19%) and "occasions & events" (11%). - "eating & dining" (11%) connects to "eating & dining" (11%) and "time and weather" (12%). - **Inference Topics**: - "object & categorization" (17%) and "persons & characterization" (19%) are the most frequently inferred topics. - "time and weather" (12%) and "everyday scenes" (15%) show moderate connections. - **Flow Patterns**: - Arrows from "everyday objects" to "object & categorization" are the thickest, indicating a strong association. - "actions & activities" (15%) has a significant flow to "persons & characterization" (19%), suggesting a thematic link between activities and character analysis. - "vehicles & traffic" (9%) has minimal connections, with only a small arrow to "vehicles & travel" (6%). ### Key Observations 1. **Dominant Connections**: - "everyday objects" and "actions & activities" are central hubs, with strong ties to inference topics. - "persons & characterization" (19%) is the most frequently inferred topic, likely due to its broad applicability. 2. **Weak Connections**: - "vehicles & traffic" (9%) and "environment & landscape" (6%) have sparse connections, suggesting they are less central to the topic network. 3. **Overlap**: - Some topics (e.g., "eating & dining") appear in both clue and inference categories, indicating self-referential relationships. ### Interpretation The diagram highlights how visual media topics are interconnected, with "everyday objects" and "actions & activities" serving as key nodes. The strong link between "actions & activities" and "persons & characterization" suggests that human behavior and traits are frequently inferred from visual contexts. The sparse connections for "vehicles & traffic" and "environment & landscape" may indicate these topics are less commonly analyzed in isolation. The diagram underscores the importance of contextual relationships in visual media analysis, where topics often overlap and influence one another. **Note**: Percentages are approximate, and the diagram’s color coding (e.g., orange for "eating & dining") was cross-verified with the legend to ensure accuracy. The spatial layout places clue topics on the left and inference topics on the right, with arrows flowing between them. </details> ## 3.1 Dataset Exploration Sherlock 's abductive inferences cover a wide variety of real world experiences from observations about unseen yet probable details of the image (e.g., 'smoke at an outdoor gathering' → 'something is being grilled') to elaborations on the expected social context (e.g., 'people wearing nametags' → '[they] don't know each other'). Some inferences are highly likely to be true (e.g., 'wet pavement' → 'it has rained recently'); others are less definitively verifiable, but nonetheless plausible (e.g., 'large trash containers' → 'there is a business nearby'). Even the inferences crowdworkers specify as 3/3 confident are almost always abductive, e.g., wet pavement strongly but not always indicate rain. Through a rich array of natural observations, Sherlock provides a tangible view into the abductive inferences people use on an everyday basis (more examples in Fig. 14). Assessing topic diversity. To gauge the diversity of objects and situations represented in Sherlock , we run an LDA topic model [7] over the observation pairs . The topics span a range of common everyday objects, entities, and situations (Fig. 3). Inference topics associated with the clues include withincategory associations (e.g., 'baked potatoes on a ceramic plate' → 'this [is] a side dish') and cross-category associations (e.g., 'a nametag' (attire) → 'she works here' (characterization)). Many topics are not human centric; compared to VCR/VisualCOMET in which 94%/100% of grounded references are to people. A manual analysis of 150 clues reveals that only 36% of Sherlock observation pairs are grounded on people. Intended use cases. We manually examine of 250 randomly sampled observation pairs to better understand how annotators referenced protected characteristics (e.g., gender, color, nationality). A majority of inferences (243/250) are not directly about protected characteristics, though, a perceived gender is often made explicit via pronoun usage, e.g., 'she is running.' As an additional check, we pass 30K samples of our corpus through the Perspective API. 9 A manual examination of 150 cases marked as 'most toxic' reveals mostly false positives (89%), though 11% of this sample do contain lewd content (mostly prompted by 9 https://www.perspectiveapi.com/ ; November 2021 version. The API (which itself is imperfect and has biases [18,38,55]) assigns toxicity value 0-1 for a given input text. Toxicity is defined as 'a rude, disrespectful, or unreasonable comment that is likely to make one leave a discussion.' Fig. 3: Overview of the topics represented in the clues and inferences in Sherlock . This analysis shows that Sherlock covers a variety of topics commonly accessible in the natural world. Color of the connections reflect the clue topic. <details> <summary>Image 4 Details</summary> ![dcce4859](/v1/image/dcce48596543b80c10e076d196561f1540ee85fd5ddf55d052a2d02c246d1c18) ### Visual Description Icon/Small Image (24x26) </details> <details> <summary>Image 5 Details</summary> ![a72ee854](/v1/image/a72ee8543cb860ba5256e4419d96b14c42a8fdb903bd8cf1a59bd4c93593effc) ### Visual Description ## Screenshot: Textual Analysis with Visual Context ### Overview The image is a screenshot featuring two primary components: 1. A **photograph** of an outdoor scene with a **green bounding box** highlighting a person. 2. A **text box** containing nine statements, some crossed out, with one sentence highlighted and underlined. A **robot icon** with a question mark is positioned at the bottom left, connected by an arrow to the highlighted text. --- ### Components/Axes #### Photograph - **Subject**: Outdoor environment with trees, a road, and a person (highlighted in green). - **Details**: No explicit labels or axis markers. #### Text Box - **Content**: Nine statements (see below). - **Formatting**: - First eight statements are **crossed out**. - Ninth statement ("It is not during rush hour") is **highlighted and underlined**. #### Robot Icon - **Design**: Gray robot with red accents, question mark symbol. - **Position**: Bottom-left corner, connected via arrow to the highlighted text. --- ### Detailed Analysis #### Text Box Content 1. **Crossed-out statements**: - "The traffic is bad in this area" - "this man needs glasses to see" - "Pots, pans, and food are stored here" - "it has many items the person likes to eat" - "the person is on the go" - "he is baking cookies for a party he is attending tomorrow" - "this is the person drinking the tea" - "there's no one inside the building" 2. **Highlighted statement**: - "It is not during rush hour" (underlined, connected to robot icon). #### Spatial Relationships - The **robot icon** (bottom-left) points to the highlighted text via an arrow. - The **green bounding box** in the photograph isolates a person, suggesting relevance to the text. --- ### Key Observations 1. **Contradictory Statements**: - The first eight statements describe scenarios (e.g., traffic, cooking, tea-drinking) that are **crossed out**, implying they are incorrect or irrelevant. - The ninth statement ("It is not during rush hour") is emphasized, suggesting it is the correct or critical conclusion. 2. **Robot Icon**: - The question mark implies uncertainty or a query, directing attention to the highlighted text as the answer. 3. **Photograph Context**: - The green box around the person may indicate a focus on their activity, but no explicit link to the text is provided. --- ### Interpretation - The **highlighted statement** ("It is not during rush hour") likely resolves a contradiction among the crossed-out scenarios. For example: - If it were rush hour, traffic would be bad (contradicting the first statement). - The person’s activity (e.g., baking cookies, drinking tea) might align with non-rush-hour behavior. - The **robot icon** acts as a "verifier," signaling that the highlighted text is the correct inference. - The **photograph** provides visual context but lacks direct textual correlation, leaving the relationship between the image and text ambiguous. --- ### Conclusion The image appears to be part of a **problem-solving or quiz interface**, where the robot icon poses a question, and the text box presents hypotheses. The highlighted statement ("It is not during rush hour") is the validated answer, supported by the elimination of other scenarios. The photograph’s role remains unclear without additional context. </details> - (a) Retrieval of abductive inferences <details> <summary>Image 6 Details</summary> ![8bb6d2a8](/v1/image/8bb6d2a8297aafa89f8f46ea9e7ba789f5853779844e6763a96189b09a8aedc1) ### Visual Description ## Diagram: Shopping Context Analysis System ### Overview The diagram illustrates a conceptual system where a robotic entity processes shopping-related contextual information. It combines textual statements, visual imagery, and a central query mechanism represented by a robot with a question mark. The system appears to analyze relationships between consumer behavior, product availability, and pricing information. ### Components/Axes 1. **Top Section (Textual Context)** - Three labeled boxes with statements: - "People can purchase them" (left) - "She is there for shopping" (center) - "The price for the towels" (right) - Positioning: Horizontal alignment above the central robot component 2. **Central Component (Query Mechanism)** - Robot icon with question mark (?) in red - Positioning: Center of the diagram, acting as processing node 3. **Bottom Section (Visual Context)** - Collage of four images showing: - Retail environment (leftmost) - Shoppers interacting with products (second from left) - Product display with pricing (third from left) - Close-up of shoppers examining items (rightmost) - Positioning: Below the central robot component, connected via dotted lines 4. **Connectivity** - Dotted lines connecting the robot to all textual boxes and image elements - Positioning: Creates network of relationships between components ### Detailed Analysis - **Textual Elements**: - All statements use English language - Focus on consumer actions ("purchase"), shopper presence ("shopping"), and product pricing - No numerical values or specific quantities provided - **Visual Elements**: - Images depict sequential shopping experience: 1. Store environment 2. Product interaction 3. Pricing information 4. Consumer decision-making - Color coding: - Pink elements in images may indicate price tags or promotional materials - No explicit legend for color coding - **Connectivity Pattern**: - Robot's question mark serves as central node - All textual and visual elements connect to the query mechanism - No explicit flow direction indicated (bidirectional processing implied) ### Key Observations 1. The system emphasizes contextual analysis rather than quantitative data 2. The question mark suggests uncertainty or need for clarification in shopping scenarios 3. Visual elements provide concrete examples of abstract textual concepts 4. No explicit temporal or spatial relationships shown between components 5. Missing explicit labels for image content despite clear visual context ### Interpretation This diagram represents a cognitive architecture for shopping scenario analysis. The robot's question mark indicates an AI system processing multi-modal data (text and images) to resolve shopping-related queries. The lack of explicit flow suggests the system simultaneously evaluates: - Consumer purchasing capabilities - Shopper presence and behavior - Product pricing information - Visual context of shopping environments The absence of numerical data implies the system focuses on qualitative relationships rather than quantitative analysis. The pink elements in images likely represent price points or promotional materials, though this requires confirmation from additional documentation. The bidirectional connections suggest the system can both analyze existing shopping scenarios and predict potential outcomes based on contextual factors. </details> (b) Localization of evidence <details> <summary>Image 7 Details</summary> ![c0d8e550](/v1/image/c0d8e550e43bb06dfa27c33e18dbc228b089d7138923c47a2acbf069ea62339a) ### Visual Description ## Photograph: Historical Group Portrait with Textual Annotations ### Overview The image depicts a black-and-white photograph of a group of individuals in uniform, likely from a historical context (possibly WWII, as indicated by textual annotations). A purple overlay partially obscures the central portion of the image. Below the photograph, there are two text blocks with statements and a thumbs-up emoji, alongside a robot icon with a question mark. ### Components/Axes - **Image Elements**: - Group of individuals in uniform (left to right, centered). - Purple overlay (top-left to bottom-right, covering ~30% of the image). - Background: Buildings and open space (monochrome). - **Text Blocks**: - **Left Block**: - Thumbs-up emoji (👍) followed by four statements: 1. "they are part of an organization" 2. "they are porters" 3. "this is during WWII" 4. "they ... are saying goodbye" - **Right Block**: - Robot icon (gray with red accents, antenna, and question mark). ### Detailed Analysis - **Textual Content**: - The left text block contains declarative statements, with the thumbs-up emoji suggesting endorsement or agreement. - The robot icon with a question mark implies uncertainty or a query related to the statements. - **Image Composition**: - The purple overlay may symbolize censorship, redaction, or a focus on specific individuals. - The group’s uniform attire and posture suggest a formal or institutional setting. ### Key Observations - The statements in the left text block are incomplete or fragmented (e.g., "they ... are saying goodbye"). - The robot icon’s placement (bottom-right) contrasts with the thumbs-up emoji (bottom-left), creating a visual balance. - The historical context (WWII) is explicitly mentioned but lacks corroborating visual evidence in the photograph. ### Interpretation - The image appears to juxtapose historical imagery with modern analytical elements (text blocks, emoji, robot icon), possibly for a study on historical narratives, propaganda, or public perception. - The thumbs-up emoji and robot question mark may represent conflicting perspectives: endorsement of the statements versus skepticism about their validity. - The purple overlay could indicate sensitive or omitted information, raising questions about historical revisionism or data integrity. - The fragmented statement ("they ... are saying goodbye") suggests missing context, requiring further investigation into the group’s purpose or actions. ## Notes - No numerical data, charts, or diagrams are present. - All text is in English; no additional languages detected. - The robot icon and emoji are symbolic, not literal data points. </details> - (c) Comparison of plausibility Fig. 4: We pose three tasks over Sherlock : In retrieval , models are tasked with finding the ground-truth inference across a wide range of inferences, some much more plausible/relevant than others. In localization , models must align regions within the same image to several inferences written about that image. For comparison , we collect 19K Likert ratings from human raters across plausible candidates, and models are evaluated in their capacity to reconstruct human judgments across the candidates. Despite intrinsic subjectivity, headroom exists between human agreement and model performance, e.g., on the comparison task. visual content in the R-rated VCR movies) or stigmas related to, e.g., gender and weight. See § A.4 for a more complete discussion. While our analysis suggests that the relative magnitude of potentially offensive content is low in Sherlock , we still advocate against deployed use-cases that run the risk of perpetuating potential biases: our aim is to study abductive reasoning without endorsing the correctness or appropriateness of particular inferences. We foresee Sherlock as 1) a diagnostic corpus for measuring machine capacity for visual abductive reasoning; 2) a large-scale resource to study the types of inferences people may make about images; and 3) a potentially helpful resource for building tools that require understanding abductions specifically, e.g., for detecting purposefully manipulative content posted online, it could be useful to specifically study what people might assume about an image (rather than what is objectively correct; more details in Datasheet ( § F) [14]). ## 4 From Images to Abductive Inferences We operationalize our corpus with three tasks, which we call retrieval, localization, and comparison. Notationally, we say that an instance within the Sherlock corpus consists of an image i , a region specified by N bounding boxes r = {⟨ x 1 i , x 2 i , y 1 i , y 2 i ⟩} N i =1 , 10 a clue c corresponding to a literal description of r 's contents, and an in F erence f that an annotator associated with i , r , and c . We consider: 10 As discussed in § 3, N has a mean/median of 1.17/1.0 across the corpus. 1. Retrieval of Abductive Inferences: For a given image/region pair ( i , r ), how well can models select the ground-truth inference f from a large set of candidates ( ∼ 1K) covering a broad swath of the corpus? 2. Localization of Evidence: Given an image i and an inference f written about an (unknown) region within the image, how well can models locate the proper region? 3. Comparison of Plausibility: Given an image/region pair ( i , r ) and a small set ( ∼ 10) of relevant inferences, can models predict how humans will rank their plausibility? Each task tests a complementary aspect of visual abductive reasoning (Fig. 4): retrieval tests across a broad range of inferences, localization tests within-images, and comparison tests for correlation with human judgement. Nonetheless, the same model can undertake all three tasks if it implements the following interface: ## Sherlock Abductive Visual Reasoning Interface - Input: An image i , a region r within i , and a candidate inference f . - Target: A score s , where s is proportional to the plausibility that f could be inferred from ( i , r ). That is, we assume a model m : ( i , r , f ) → R that scores inference f 's plausibility for ( i , r ). Notably, the interface takes as input inferences, but not clues: our intent is to focus evaluation on abductive reasoning, rather than the distinct setting of literal referring expressions. 11 Clues can be used for training m ; as we will see in § 5 our best performing model, in fact, does use clues at training time. ## 4.1 Retrieval of Abductive Inferences For retrieval evaluation, at test time, we are given an ( i , r ) pair, and a large ( ∼ 1K) 12 set of candidate inferences f ∈ F , only one of which was written by an annotator for ( i , r ); the others are randomly sampled from the corpus. In the im → txt direction, we compute the mean rank of the true item (lower=better) and P @1 (higher=better); in the txt → im direction, we report mean rank (lower=better). ## 4.2 Localization of Evidence Localization assesses a model's capacity select a regions within an image that most directly supports a given inference. Following prior work on literal referring expression localization [28,25,73] (inter alia), we experiment in two settings: 1) we are given all the ground-truth bounding boxes for an image, and 2) we are given only automatic bounding box proposals from an object detection model. 11 In § B.1, for completeness, we give results on the retrieval and localization setups, but testing on clues instead. 12 Our validation/test sets contain about 23K inferences. For efficiency we randomly split into 23 equal sized chunks of about 1K inferences, and report retrieval averaged over the resulting splits. Table 2: Test results for all models across all three tasks. CLIP RN50x64 outperforms all models in all setups, but significant headroom exists, e.g., on Comparison between the model and human agreement. | | Retrieval | Retrieval | Retrieval | Localization | Comparison | |-----------------------------|--------------------|-------------|---------------|-----------------------|--------------------------| | | im → txt ( ↓ ) txt | → im ( ↓ | @1 im → txt ( | GT-Box/Auto-Box ( ↑ ) | Val/Test Human Acc ( ↑ ) | | Random | 495.4 | 495.4 | 0.1 | 30.0/7.9 | 1.1/-0.6 | | Bbox Position/Size | 257.5 | 262.7 | 1.3 | 57.3/18.8 | 5.5/1.4 | | LXMERT | 51.1 | 48.8 | 14.9 | 69.5/30.3 | 18.6/21.1 | | UNITER Base | 40.4 | 40.0 | 19.8 | 73.0/33.3 | 20.0/22.9 | | CLIP ViT-B/16 | 19.9 | 21.6 | 30.6 | 85.3/38.6 | 20.1/21.3 | | CLIP RN50x16 | 19.3 | 20.8 | 31.0 | 85.7/38.7 | 21.6/23.7 | | CLIP RN50x64 | 19.3 | 19.7 | 31.8 | 86.6/39.5 | 25.1/26.0 | | ↰ + multitask clue learning | 16.4 | 17.7 | 33.4 | 87.2 / 40.6 | 26.6 / 27.1 | | Human + (Upper Bound) | - | - | - | 92.3/(96.2) | 42.3/42.3 | GTbounding boxes. We assume an image i , the set of 3+ inferences F written for that image, and the (unaligned) set of regions R corresponding to F . The model must produce a one-to-one assignment of F to R in the context of i . In practice, we score all possible F × R pairs via the abductive visual reasoning interface, and then compute the maximum linear assignment [30] using lapjv's implementation of [24]. The evaluation metric is the accuracy of this assignment, averaged over all images. To quantify an upper bound, a human rater performed the assignment for 101 images, achieving an average accuracy of 92.3%. Auto bounding boxes. We compute 100 bounding box proposals per image by applying Faster-RCNN [54] with a ResNeXt101 [69] backbone trained on Visual Genome to all the images in our corpus. Given an image i and an inference f that was written about the image, we score all 100 bounding box proposals independently and take the highest scoring one as the prediction. We count a prediction as correct if it has IoU > 0 . 5 with a true bounding box that corresponds to that inference, 13 and incorrect otherwise. 14 ## 4.3 Comparison of Plausibility We assess model capacity to make fine-grained assessments given a set of plausible inferences. For example, in Fig. 4c (depicting a group of men marching and carrying bags), human raters are likely to say that they are military men and that the photo was taken during WWII, and unlikely to see them as porters despite them carrying bags. Our evaluation assumes that a performant model's predictions should correlate with the (average) relative judgments made by humans, and we seek to construct a corpus that supports evaluation of such reasoning. 13 Since the annotators were able to specify multiple bounding boxes per observation pair , we count a match to any of the labeled bounding boxes. 14 A small number of images do not have a ResNeXt bounding box with IoU > 0 . 5 with any ground truth bounding box: in § 5.1, we show that most instances (96.2%) are solvable with this setup. Constructing sets of plausible inferences. We use a performant model checkpoint fine-tuned for the Sherlock tasks 15 to compute the similarity score between all ( i , r , f ) triples in the validation/test sets. Next, we perform several filtering steps: 1) we only consider pairs where the negative inference received a higher score than the ground-truth according to the model; 2) we perform soft text deduplication to downsample inferences that are semantically similar; and 3) we perform hard text deduplication, only allowing inferences to appear verbatim 3x times. Then, through an iterative process, we uniquely sample a diverse set of 10 inferences per ( i , r ) that meet these filtering criteria. This results in a set of 10 plausible inference candidates for each of 485/472 validation/test images. More details are in § E. In a retrieval sense, these plausible inferences can be viewed as 'hard negatives:' i.e., none are the gold annotated inference, but a strong model nonetheless rates them as plausible. Human rating of plausible inferences. Using MTurk, we collect two annotations of each candidate inference on a three-point Likert scale ranging from 1 (bad: 'irrelevant'/'verifiably incorrect') to 3 (good: 'statement is probably true; the highlighted region supports it.'). We collect 19K annotations in total (see § E for full details). Because abductive reasoning involves subjectivity and uncertainty, we expect some amount of intrinsic disagreement between raters. 16 We measure model correlation with human judgments on this set via pairwise accuracy. For each image, for all pairs of candidates that are rated differently on the Likert scale, the model gets an accuracy point if it orders them consistently with the human rater's ordering. Ties are broken randomly but consistently across all models. For readability, we subtract the accuracy of a random model (50%) and multiply by two to form the final accuracy metric. ## 5 Methods and Experiments Training objective. To support the interface described in § 4, we train models m : ( i , r , f ) → R that score inference f 's plausibility for ( i , r ). We experiment with several different V+L backbones as detailed below; for each, we train by optimizing model parameters to score truly corresponding ( i , r , f ) triples more highly than negatively sampled ( i , r , f fake ) triples. LXMERT [61] is a vision+language transformer [64] model pre-trained on Visual Genome [29] and MSCOCO [33]. The model is composed of three transformer encoders [64]: an object-relationship encoder (which takes in ROI features+locations with a max of 36, following [2]), a language encoder that processes word tokens, and a cross modality encoder. To provide region information r , we calculate the ROI feature of r and always place it in the first object token to the visual encoder (this is a common practice for, e.g., the VCR dataset [75]). 15 Specifically, a CLIP RN50x16 checkpoint that achieves strong validation retrieval performance (comparable to the checkpoint of the reported test results in § 5.1); model details in § 5. 16 In § 5.1, we show that models achieve significantly less correlation compared to human agreement. We follow [9] to train the model in 'image-text retrieval' mode by maximizing the margin m = . 2 between the cosine similarity scores of positive triple ( i , r , f ) and two negative triples ( i , r , f fake ) and ( i fake , r fake , f ) through triplet loss. UNITER [9] consists of a single, unified transformer that takes in image and text embeddings. We experiment with the Base version pre-trained on MSCOCO [33], Visual Genome [29], Conceptual Captions [57], and SBU Captions [41]. We apply the same strategy of region-of-reference-first passing and train with the same triplet loss following [9]. CLIP. We finetune the ViT-B/16 , RN50x16 , and RN50x50 versions of CLIP [51]. Text is represented via a 12-layer text transformer. For ViT-B/16 , images are represented by a 12-layer vision transformer [10], whereas for RN50x16 / RN50x64 , images are represented by EfficientNet-scaled ResNet50 [16,62]. We modify CLIP to incorporate the bounding box as input. Inspired by a similar process from [76,70], to pass a region to CLIP, we simply draw a bounding box on an image in pixel space-we use a green-bordered / opaque purple box as depicted in Fig. 5b (early experiments proved this more effective than modifying CLIP's architecture). To enable CLIP to process the widescreen images of VCR, we apply it twice to the input using overlapping square regions, i.e., graphically, like this: [ 1 [ 2 ] 1 ] 2 , and average the resulting embeddings. We finetune using InfoNCE [59,40]. We sample a batch of truly corresponding ( i , r , f ) triples, render the regions r in their corresponding images, and then construct all possible negative ( i , r , f fake ) triples in the batch by aligning each inference to each ( i , r ). We use the biggest minibatch size possible using 8 GPUs with 48GB of memory each: 64, 200, and 512 for RN50x64 , RN50x16 , and ViT-B/16 , respectively. Multitask learning. All models thus far only utilize inferences at training time. We experiment with a multitask learning setup using CLIP that additionally trains with clues. In addition to training using our abductive reasoning objective, i.e., InfoNCE on inferences, we mix in an additional referring expression objective, i.e., InfoNCE on clues. Evaluation remains the same: at test time, we do not assume access to clues. At training time, for each observation, half the time we sample an inference (to form ( i , r , f ), and half the time we sample a clue (to form ( i , r , c )). The clue/inference mixed batch of examples is then handed to CLIP, and a gradient update is made with InfoNCE as usual. To enable to model to differentiate between clues/inferences, we prefix the texts with clue: / inference: , respectively. Baselines. In addition to a random baseline, we consider a content-free version of our CLIP ViT-B/16 model that is given only the position/size of each bounding box. In place of the image, we pass a mean pixel value across the entire image and draw the bounding box on the image using an opaque pink box (see § 5.2). ## 5.1 Results Table 2 contains results for all the tasks: In all cases, our CLIP-based models perform best, with RN50x64 outperforming its smaller counterparts. Incorporating the multitask objective pushes performance further. While CLIP performs the | | P @1 ( ↑ ) | Val/Test Human ( ↑ ) | |------------------------------|--------------|------------------------| | CLIP ViT-B/16 | 30.5 | 20.1/21.2 | | ↰ Position only | 1.3 | 5.5/1.4 | | ↰ No Region | 18.1 | 16.8/19.0 | | input ↰ No Context | 24.8 | 18.1/17.8 | | ↰ Only context | 18.9 | 17.4/16.3 | | ↰ Trained w/ only Clues | 23 | 16.2/19.7 | | ↰ Crop no Widescreen | 27.8 | 23.1/21.8 | | model ↰ Resize no Widescreen | 27.7 | 19.4/20.6 | | ↰ Zero shot w/ prompt | 12 | 10.0/9.5 | (a) Fig. 5: We perform ablations by varying the input data, top (a), and the modeling components, bottom (a). Figure (b) depicts our image input ablations, which are conducted by drawing in pixel-space directly, following [76]. Having no context may make it difficult to situate the scene more broadly; here: neatly stacked cups could be in a bar, a hotel, a store, etc. Access only the context of the dining room is also insufficient. For modeling, bottom (a), cropping/resizing decreases performance on retrieval ( P @1), but not comparison (Val/Test Human). <details> <summary>Image 8 Details</summary> ![ce381c75](/v1/image/ce381c7589672e91d69679df761bda38f0af8125f2ae69c788f4f75f9955b2d2) ### Visual Description ## Diagram: Contextual Processing Regions for Textual Prompt ### Overview The diagram illustrates a system processing a textual prompt ("the kitchen is part of a restaurant") through four distinct regions: Position Only, No Context, Only Context, and No Region. Each region is visually differentiated by color (pink for Position Only, gray for others) and connected via arrows from a central image. ### Components/Axes - **Text Prompt**: "the kitchen is part of a restaurant" (top of the diagram). - **Central Image**: A scene showing a person in a kitchen environment (left side of the diagram). - **Regions**: - **Position Only**: Pink-shaded area (leftmost region). - **No Context**: Gray-shaded area (second region). - **Only Context**: Gray-shaded area (third region). - **No Region**: Gray-shaded area (rightmost region). - **Arrows**: Connect the central image to each region, indicating directional flow. ### Detailed Analysis - **Text Prompt**: Explicitly states the input text for processing. - **Central Image**: Visual representation of the scene described in the prompt. - **Regions**: - **Position Only**: Highlighted in pink, suggesting prioritization or unique processing logic. - **No Context/Only Context/No Region**: Uniform gray shading implies shared processing characteristics or secondary focus. ### Key Observations - The **Position Only** region is visually distinct (pink), while the remaining regions share identical gray shading. - Arrows originate from the central image, indicating all regions derive input from the same source. - No numerical data or quantitative metrics are present; the diagram focuses on categorical distinctions. ### Interpretation The diagram likely represents a workflow for analyzing or generating outputs based on contextual and positional cues. The **Position Only** region’s unique coloration suggests it handles positional data independently, while the gray regions may process contextual or combined inputs. The absence of numerical values implies this is a conceptual or architectural diagram rather than a data-driven chart. The system appears to decompose the input prompt into distinct processing pathways, with Position Only as a critical component. </details> best, UNITER is more competitive on comparison and less competitive on retrieval and localization. We speculate this has to do with the nature of each task: retrieval requires models to reason about many incorrect examples, whereas, the inferences in the comparison task are usually relevant to the objects in the scene. In § C, we provide ablations that demonstrate CLIP models outperform UNITER even when trained with a smaller batch size. Compared to human agreement on comparison, our best model only gets 65% of the way there (27% vs. 42 %). ## 5.2 Ablations We perform data and model ablations on CLIP ViT-B/16 . Results are in Fig. 5. Input ablations. Each part of our visual input is important. Aside from the position only model, the biggest drop-off in performance results from not passing the region as input to CLIP, e.g., P @1 for im → txt retrieval nearly halves, dropping from 31 to 18, suggesting that CLIP relies on the local region information to reason about the image. Removing the region's content ('Only Context') unsurprisingly hurts performance, but so does removing the surrounding context ('No Context'). That is, the model performs the best when it can reason about the clue and its full visual context jointly. On the text side, we trained a model with only clues; retrieval and comparison performance both drop, which suggests that clues and inferences carry different information (additional results in § B.1). Model ablations. Weconsidered two alternate image processing configurations. Instead of doing two CLIP passes per image to facilitate widescreen processing ( § 5), we consider (i) center cropping and (ii) pad-and-resizing. Both take less computation, but provide less information to the model. Cropping removes the <details> <summary>Image 9 Details</summary> ![db56e5aa](/v1/image/db56e5aae58766dc0a53a831f74c6eede62e149f28b4689d8622f8a9430efbde) ### Visual Description Icon/Small Image (23x26) </details> Fig. 6: Validation retrieval perf. ( P @1) vs. comparison acc. for CLIP checkpoints. <details> <summary>Image 10 Details</summary> ![c0c3b33c](/v1/image/c0c3b33cb78c242796a5cf88b80c00a3999eb96d49bd9e6b748a69c745d55349) ### Visual Description ## Scatter Plot: Pairwise Human Accuracy vs P@1 Retrieval Performance ### Overview The image is a scatter plot comparing three model configurations (ViT/B-16, RN50x16, RN50x64) across two metrics: Pairwise Human Accuracy (y-axis) and P@1 Retrieval Performance (x-axis). Data points are color-coded and marked with distinct symbols, with a legend in the top-left corner. ### Components/Axes - **X-axis (P@1 Retrieval Performance)**: Ranges from 24 to 32, with grid lines at integer intervals. - **Y-axis (Pairwise Human Accuracy)**: Ranges from 16 to 26, with grid lines at integer intervals. - **Legend**: Located in the top-left corner, mapping: - Blue circles: ViT/B-16 (ρ=81) - Orange crosses: RN50x16 (ρ=91) - Green triangles: RN50x64 (ρ=66) ### Detailed Analysis 1. **ViT/B-16 (Blue Circles)**: - Data points cluster between x=26–28 and y=18–22. - Slight upward trend (ρ=81, indicating moderate correlation). - Example approximate values: (26, 19), (27, 20), (28, 21). 2. **RN50x16 (Orange Crosses)**: - Data points span x=24–32 and y=16–24. - Strong upward trend (ρ=91, highest correlation). - Notable points: (24, 16), (28, 20), (32, 24). 3. **RN50x64 (Green Triangles)**: - Data points cluster between x=26–30 and y=20–24. - Downward trend (ρ=66, weakest correlation). - Example approximate values: (26, 22), (28, 21), (30, 23). ### Key Observations - **Highest Accuracy**: RN50x16 achieves the highest Pairwise Human Accuracy (up to ~24) at x=32. - **Lowest Accuracy**: RN50x64 has the lowest accuracy (~16) at x=24. - **Trade-off**: RN50x64 shows higher P@1 Retrieval Performance (x=30) but lower accuracy compared to RN50x16 at similar x-values. - **ViT/B-16**: Balanced performance but lags behind RN50x16 in both metrics. ### Interpretation The data suggests that **RN50x16** optimally balances P@1 Retrieval Performance and Pairwise Human Accuracy, outperforming both ViT/B-16 and RN50x64. The strong positive correlation (ρ=91) for RN50x16 indicates that improvements in retrieval performance directly translate to higher human accuracy. Conversely, RN50x64’s weaker correlation (ρ=66) implies diminishing returns in accuracy despite better retrieval. ViT/B-16’s moderate performance highlights its limitations in scaling. These trends underscore the importance of architectural choices (e.g., model size) in vision-language tasks. </details> Fig. 7: Error analysis: examples of false positives and false negatives predicted by our model on the comparison task's validation set. <details> <summary>Image 11 Details</summary> ![f5da0428](/v1/image/f5da04282d2e4c69320f939c66d50010f686abec0fa5515c92be108a5e1be128) ### Visual Description ## Screenshot: Textual Statements and Responses in Image Panels ### Overview The image contains six panels, each featuring a textual statement and two response options (robot and person) with thumbs-up or thumbs-down icons. The statements describe scenarios, and the responses indicate agreement or disagreement. ### Components/Axes - **Panels**: Six distinct sections, each with: - A visual image (e.g., street sign, florist shop, room, etc.). - A textual statement. - Two response options: - **Robot**: Labeled "says" with a thumbs-up or thumbs-down icon. - **Person**: Labeled "says" with a thumbs-up or thumbs-down icon. ### Detailed Analysis #### Panel 1 (Top-Left) - **Image**: Street sign reading "FILBERT LANE" and "FILBERT" on a pole with a traffic light. - **Statement**: "People can park their cars on Filbert street for as long as they want." - **Responses**: - Robot: "says" with a thumbs-up icon. - Person: "says" with a thumbs-down icon. #### Panel 2 (Top-Right) - **Image**: Florist shop with a pink vehicle and floral displays. - **Statement**: "This is a florist shop." - **Responses**: - Robot: "says" with a thumbs-down icon. - Person: "says" with a thumbs-up icon. #### Panel 3 (Bottom-Left) - **Image**: Room with old metal frame windows and a person interacting with an object. - **Statement**: "This is a room in high rise apartment building with old metal frame windows." - **Responses**: - Robot: "says" with a thumbs-up icon. - Person: "says" with a thumbs-down icon. #### Panel 4 (Bottom-Middle) - **Image**: Wooden surface with a white object (possibly a person or item). - **Statement**: "They are hiding from someone." - **Responses**: - Robot: "says" with a thumbs-down icon. - Person: "says" with a thumbs-up icon. #### Panel 5 (Bottom-Right) - **Image**: Person in a pink room, partially obscured. - **Statement**: "They are hiding from someone." - **Responses**: - Robot: "says" with a thumbs-up icon. - Person: "says" with a thumbs-down icon. ### Key Observations 1. **Repetition of Statements**: The statement "They are hiding from someone" appears in both Panel 4 and Panel 5, but the responses differ. 2. **Contradictory Responses**: For the same statement, the robot and person often disagree (e.g., Panel 1, Panel 2, Panel 4, Panel 5). 3. **Visual Context**: The images provide contextual clues (e.g., florist shop, street sign, room) that may influence the validity of the statements. ### Interpretation - The panels appear to test the accuracy of textual statements against visual context. For example: - Panel 2’s statement "This is a florist shop" is visually confirmed, yet the robot disagrees (thumbs-down), while the person agrees (thumbs-up). This suggests a potential inconsistency in the response logic. - Panel 1’s statement about parking on Filbert Street is ambiguous without additional context (e.g., local regulations), leading to conflicting responses. - The use of "says" for both robot and person responses is unclear. It may indicate a placeholder for a more detailed response (e.g., "says yes" or "says no"). - The thumbs-up/down icons likely represent agreement/disagreement, but the lack of explicit labels (e.g., "agrees" or "disagrees") introduces ambiguity. ## Notes - **Language**: All text is in English. - **Uncertainty**: The exact meaning of "says" in the responses is unclear. It may be a formatting error or a placeholder for a more specific response. - **Missing Data**: No numerical values, charts, or diagrams are present. The focus is on textual statements and responses. </details> sides of images, whereas pad-and-resize lowers the resolution significantly. The bottom half of the table in Fig. 5a reports the results: both configurations lower performance on retrieval tasks, but there's less impact for comparison. Better retrieval → better comparison. In Fig. 6, we observe a high correlation between the retrieval performance of our (single-task) CLIP model checkpoints ( P @1) and the comparison human accuracy for the comparison task. For the smaller RN50x16 and ViT-B/16 models, this effect cannot simply be explained by training time; for RN50x16 , pearson corr. between training steps and comparison performance is 81, whereas, the correlation between P @1 and comparison performance is 91. Overall, it's plausible that a model with higher precision at retrieval could help further bridge the gap on the comparison task. Oracle text-only models are insufficient. One potential concern with our setup is that clues may map one-to-one onto inferences, e.g., if all soccer balls in our corpus were mapped onto 'the owner plays soccer' (and vice versa). We compare to an oracle baseline that makes this pessimistic assumption (complementing our 'No Context' ablation, which provides a comparable context-free visual reference to the clue). We give the model oracle access to the ground-truth clues. Following [6], we use T5-Large v1.1 [52] to map clues to inferences with no access to the image by fitting P (inference | clue) in a sequence-to-sequence fashion; training details are in § B. The resulting text-only clue → inference model, when given the clue 'chipped paint and rusted umbrella poles' , estimates likely inferences, for example: 'the area is in a disrepair' , 'the city does not care about its infrastructure.' , etc. The text-only oracle under-performs vs. CLIP despite the fact that, unlike CLIP, it's given the ground-truth clue : on comparison, it achieves 22.8/19.3 val/test accuracy; significantly lower than 26.6/27.1 that our best vision+language model achieves. This is probably because global scene context cannot be fully summarized via a local referring expression. In the prior 'chipped paint and rusted umbrella poles' example, the true inference, 'this beach furniture does not get put inside at night' , requires additional visual context beyond the clue-chipped paint and a rusty umbrella alone may not provide enough context to infer that this furniture is beach furniture. ## 5.3 Error Analysis We conduct a quantitative error analysis of multitask CLIP RN50x64 for the comparison task. We select 340 validation images with highest human agreement, and split images into two groups: one where the model performed above average, and one where the model performed below average. We attempt to predict into which group an image will fall using logistic regression in 5-fold cross-validation. Overall, errors are difficult to predict. Surface level image/text features of the images/inferences are not very predictive of errors: relative to a 50% ROC AUC baseline, CLIP ViT-B/16 image features achieve 55%, whereas the mean SentenceBERT [53] embedding of the inference achieves 54%. While not available a priori , more predictive than content features of model errors are human Likert ratings: a single-feature mean human agreement model achieves 57% AUC, (more human agreement = better model performance). Fig. 7 gives qualitative examples of false positives/negatives. The types of abductive reasoning the model falls short on are diverse. In the boat example, the model fails to notice that a florist has set up shop on a ship deck; in the window example, the model misinterprets the bars over the windows as being outside the building versus inside and attached to a bed-frame. The model is capable of reading some simple signs, but, as highlighted by [37], reasoning about the semantics of written text placed in images remains a challenge, e.g., a 'no parking' sign is misidentified as an 'okay to park' sign. Overall: the difficult-tocategorize nature of these examples suggests that the Sherlock corpus makes for difficult benchmark for visual abductive reasoning. ## 6 Conclusion We introduce Sherlock , a corpus of visual abductive reasoning containing 363K clue/inference observation pairs across 103K images. Our work complements existing abductive reasoning corpora, both in format (free-viewing, free-text) and in diversity (not human-centric). Our work not only provides a challenging vision+language benchmark, but also, we hope it can serve as a resource for studying visual abductive reasoning more broadly. Future work includes: 1. Salience: in Sherlock , annotators specify salient clues; how/why does salience differ from other free-viewing setups, like image captioning? 2. Ambiguity: when/why do people (justifiably) come to different conclusions? 3. Generative evaluation metrics: generation evaluation in abductive setting, i.e., without definitive notions of correctness, remains a challenge. Acknowledgments. This work was funded by DARPA MCS program through NIWC Pacific (N66001-19-2-4031), the DARPA SemaFor program, and the Allen Institute for AI. AR was additionally in part supported by the DARPA PTG program, as well as BAIR's industrial alliance program. We additionally thank the UC Berkeley Semafor group for the helpful discussions and feedback. ## References 1. Aliseda, A.: The logic of abduction: an introduction. In: Springer Handbook of Model-Based Science, pp. 219-230 (2017) 2. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018) 3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual Question Answering. In: ICCV (2015) 4. Bender, E.M., Friedman, B.: Data statements for natural language processing: Toward mitigating system bias and enabling better science. TACL 6 , 587-604 (2018) 5. Berg, A.C., Berg, T.L., Daume, H., Dodge, J., Goyal, A., Han, X., Mensch, A., Mitchell, M., Sood, A., Stratos, K., et al.: Understanding and predicting importance in images. In: CVPR (2012) 6. Bhagavatula, C., Bras, R.L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., tau Yih, W., Choi, Y.: Abductive commonsense reasoning. In: ICLR (2020) 7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3 , 993-1022 (2003) 8. Carson, D.: The abduction of sherlock holmes. International Journal of Police Science & Management 11 (2), 193-202 (2009) 9. Chen, Y.C., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: Universal image-text representation learning. In: ECCV (2020) 10. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 11. Du, L., Ding, X., Liu, T., Qin, B.: Learning event graph knowledge for abductive reasoning. In: ACL (2021) 12. Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2Commonsense: Generating commonsense descriptions to enrich video captioning. In: EMNLP (2020) 13. Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT vqa: Answering knowledge-based questions about videos. In: AAAI (2020) 14. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Iii, H.D., Crawford, K.: Datasheets for datasets. Communications of the ACM (2021) 15. Grice, H.P.: Logic and conversation. In: Speech acts, pp. 41-58. Brill (1975) 16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 17. Hobbs, J.R., Stickel, M.E., Appelt, D.E., Martin, P.: Interpretation as abduction. Artificial intelligence 63 (1-2), 69-142 (1993) 18. Hosseini, H., Kannan, S., Zhang, B., Poovendran, R.: Deceiving google's perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 (2017) 19. Ignat, O., Castro, S., Miao, H., Li, W., Mihalcea, R.: WhyAct: Identifying action reasons in lifestyle vlogs. In: EMNLP (2021) 20. Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: Tgif-QA: Toward spatio-temporal reasoning in visual question answering. In: CVPR (2017) 21. Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017) 22. Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: CVPR (2016) 23. Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: CVPR (2015) 24. Jonker, R., Volgenant, A.: A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38 (4), 325-340 (1987) 25. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: Referring to objects in photographs of natural scenes. In: EMNLP (2014) 26. Kim, H., Zala, A., Bansal, M.: CoSIm: Commonsense reasoning for counterfactual scene imagination. In: NAACL (2022) 27. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 28. Krahmer, E., Van Deemter, K.: Computational generation of referring expressions: A survey. Computational Linguistics 38 (1), 173-218 (2012) 29. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2016) 30. Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), 83-97 (1955) 31. Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: Spatio-temporal grounding for video question answering. In: ACL (2020) 32. Lei, J., Yu, L., Berg, T.L., Bansal, M.: What is more likely to happen next? videoand-language future event prediction. In: EMNLP (2020) 33. Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV (2014) 34. Liu, J., Chen, W., Cheng, Y., Gan, Z., Yu, L., Yang, Y., Liu, J.: Violin: A largescale dataset for video-and-language inference. In: CVPR (2020) 35. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019) 36. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: A visual question answering benchmark requiring external knowledge. In: CVPR (2019) 37. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: Visual question answering by reading text in images. In: ICDAR (2019) 38. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T.: Model cards for model reporting. In: FAccT (2019) 39. Niiniluoto, I.: Defending abduction. Philosophy of science 66 , S436-S451 (1999) 40. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018) 41. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: Describing images using 1 million captioned photographs. In: NeurIPS (2011) 42. Ovchinnikova, E., Montazeri, N., Alexandrov, T., Hobbs, J.R., McCord, M.C., Mulkar-Mehta, R.: Abductive reasoning with a large knowledge base for discourse processing. In: IWCS (2011) 43. Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: ICCV (2019) 44. Park, J.S., Bhagavatula, C., Mottaghi, R., Farhadi, A., Choi, Y.: VisualCOMET: Reasoning about the dynamic context of a still image. In: ECCV (2020) 45. Paul, D., Frank, A.: Generating hypothetical events for abductive inference. In: *SEM (2021) 46. Peirce, C.S.: Philosophical writings of Peirce, vol. 217. Courier Corporation (1955) 47. Peirce, C.S.: Pragmatism and pragmaticism, vol. 5. Belknap Press of Harvard University Press (1965) 48. Pezzelle, S., Greco, C., Gandolfi, G., Gualdoni, E., Bernardi, R.: Be different to be better! a benchmark to leverage the complementarity of language and vision. In: Findings of EMNLP (2020) 49. Pirsiavash, H., Vondrick, C., Torralba, A.: Inferring the why in images. Tech. rep. (2014) 50. Qin, L., Shwartz, V., West, P., Bhagavatula, C., Hwang, J., Bras, R.L., Bosselut, A., Choi, Y.: Back to the future: Unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning. In: EMNLP (2020) 51. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021) 52. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (2020) 53. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bertnetworks. In: EMNLP (2019) 54. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. NeurIPS (2015) 55. Sap, M., Card, D., Gabriel, S., Choi, Y., Smith, N.A.: The risk of racial bias in hate speech detection. In: ACL (2019) 56. Shank, G.: The extraordinary ordinary powers of abductive reasoning. Theory & Psychology 8 (6), 841-860 (1998) 57. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018) 58. Shazeer, N., Stern, M.: Adafactor: Adaptive learning rates with sublinear memory cost. In: ICML (2018) 59. Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: NeurIPS (2016) 60. Tafjord, O., Mishra, B.D., Clark, P.: ProofWriter: Generating implications, proofs, and abductive statements over natural language. In: Findings of ACL (2021) 61. Tan, H., Bansal, M.: LXMERT: Learning cross-modality encoder representations from transformers. In: EMNLP (2019) 62. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML (2019) 63. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: Understanding stories in movies through question-answering. In: CVPR (2016) 64. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017) 65. Vedantam, R., Lin, X., Batra, T., Zitnick, C.L., Parikh, D.: Learning common sense through visual abstraction. In: ICCV (2015) 66. Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: Fact-based visual question answering. TPAMI 40 (10), 2413-2427 (2017) 67. Wang, P., Wu, Q., Shen, C., Hengel, A.v.d., Dick, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017) 68. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: Transformers: State-of-the-art natural language processing. In: EMNLP: System Demonstrations (2020) 69. Xie, S., Girshick, R., Doll´ ar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017) 70. Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T.S., Sun, M.: CPT: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797 (2021) 71. Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: CLEVRER: Collision events for video representation and reasoning. In: ICLR (2020) 72. Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual Madlibs: Fill in the blank image generation and question answering. In: ICCV (2015) 73. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV (2016) 74. Zadeh, A., Chan, M., Liang, P.P., Tong, E., Morency, L.P.: Social-iq: A question answering benchmark for artificial social intelligence. In: CVPR (2019) 75. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: CVPR (2019) 76. Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J.S., Cao, J., Farhadi, A., Choi, Y.: MERLOT: multimodal neural script knowledge models. In: NeurIPS (2021) 77. Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: A dataset for relational and analogical visual reasoning. In: CVPR (2019) 78. Zhang, H., Huo, Y., Zhao, X., Song, Y., Roth, D.: Learning contextual causality from time-consecutive images. In: CVPR Workshops (2021) 79. Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: Grounded question answering in images. In: CVPR (2016) <details> <summary>Image 12 Details</summary> ![1eb1de17](/v1/image/1eb1de17e246b84d53622b7e7ac874656e8abd19ce60c1b3f18f54dfd6f571b3) ### Visual Description Icon/Small Image (24x24) </details> ## Supplementary Material ## A Sherlock Data Collection and Evaluation The dataset was collected during the month of February of 2021. The data collected is in English and HITs were open to workers originating from US, Canada, Great Britain and Australia. We target for a worker payment rate of $15/hour for all our HITs. For data collection and qualifications, average pay for the workers came to $16-$20 with median workers being compensated $12/hour. We hash Worker IDs to preserve anonymity. A sample of data collection HIT is shown in Fig. 11 (with instructions shown in Fig. 10). ## A.1 Qualification of Workers As a means for ensuring high quality annotations, 266 workers were manually selected through a qualification and training rounds. The workers were presented with three images and asked to provide three observation pairs per image. Each of the worker responses were manually evaluated. A total 297 workers submitting 8 reasonable observation pairs out of of 9 were qualified for training. The process of creating bounding boxes and linking these boxes to the observation pairs was complex enough to necessitate a training stage. For the training round, qualified workers were given a standard data collection hit (Fig. 11) at a higher pay to account for the time expected for them to learn the process. An additional training round was encouraged for a small pool of workers to ensure all workers were on the page with regards to the instructions and the mechanism of the hit. 266 workers worked on and completed the training (remaining 31 did not return for the training round). In this paper, we use the term qualified workers to refer to the workers who have completed both the qualification and training round. ## A.2 Data Collection As described in § 3, we collected a total of 363K observation pairs which consist of a clue and inference. Further examples of annotations are shown in Fig. 14. Image sourcing. For VCR images, we use the subset also annotated by VisualCOMET [44]; we limit our selection to images that contain at least 3 unique entities (persons or objects). For Visual Genome, during early annotation rounds, crowdworkers shared that particular classes of images were common and less interesting (e.g., grazing zebras, sheep in pastures). In response, we performed a semantic de-duplication step by hierarchical clustering into 80K clusters of extracted CLIP ViT-B/32 features [51] and sample a single image from each resulting cluster. We annotate 103K images in total, and divide them into a training/validation/test set of 90K/6.6K/6.6K, aligned with the community standard splits for these corpora. Bounding boxes. For each clue in an observation pair , the workers were asked to draw one or more bounding boxes around image regions relevant to the clue. For example, for the clue 'a lot of architectural decorations' given for the lower right image in Fig. 14, the worker chose box each of the architectural features separately in their own bounding box. While it was not strictly enforced, we encouraged the workers to keep to a maximum of 3 bounding boxes per clue, with allowance for more if necessitated by the image and the observation pair , based on worker's individual discretion. ## A.3 Corpus Validation To verify the quality of annotation, we run a validation over 17K observation pairs . For each observation pair , we present three independent crowdworkers with its associated image and its annotation: the clue with its corresponding region bound-boxed in the image and the inference along with its confidence rating. The workers are then asked rate the observation pairs along three dimensions: (1) acceptability of the observation pair (is the observation pair reasonable given the image?), (2) appropriateness of bounding boxes (do the bounding boxes appropriately represent the clue?), and (3) interestingness of the observation pair (how interesting is the observation pair ?). The annotation template of the HIT is shown in Fig. 12. ## A.4 Details on exploration of social biases The clues and inferences we collect from crowdsource workers are abductive, and thus are uncertain. Despite this type of reasoning being an important aspect of human cognition, heuristics and assumptions may reflect false and harmful social biases. As a concrete example: early on in our collection process during a qualifying round, we asked 70 workers to annotate an image of a bedroom, where action figures were placed on the bed. Many said that the bedroom was likely to belong to a male child, citing the action figures as evidence. We again emphasize that our goal is to study heuristic reasoning, without endorsing the particular inferences themselves. Sample analysis. While curating the corpus, we (the authors) have examined several thousand annotations. To supplement our qualitative experience, in addition, we conducted a close reading of a random sample of 250 inferences. This close reading was focused on references to protected characteristics of people and potentially offensive/NSFW cases. During both our informal inspection and close reading, we observed similar patterns. Like in other vision and language corpora depicting humans, the most common reference to a protected characteristic was perceived gender, e.g., annotators often assumed depicted people were 'a man' or 'a woman' (and sometimes, age is also assumed, e.g., 'an old man'). Aside from perception standing-in for identity, a majority of inferences are not specifically/directly about protected characteristics and are SFW (243/250 in our sample). The small number of exceptions included: assumptions about the gender of owners of items similar to the action figure example above (1/250 cases); speculation about the race of an individual based on a sweater logo (1/250); and commenting on bathing suits with respect to gender (1/250). Since still frames in VCR are taken from movies, some depict potentially offensive imagery, e.g., movie gore, dated tropes, etc. The images in VCR come with the following disclaimer, which we also endorse (via visualcommonsense.com): 'many of the images depict nudity, violence, or miscellaneous problematic things (such as Nazis, because in many movies Nazis are the villains). We left these in though, partially for the purpose of learning (probably negative but still important) commonsense implications about the scenes. Even then, the content covered by movies is still pretty biased and problematic, which definitely manifests in our data (men are more common than women, etc.).' Statistical analysis. While the random sample analysis suggests that a vast majority of annotations in our corpus do not reference protected characteristics and are SFW, for an additional check, we passed a random set of 30K samples (10K each from training/val/test) clues/inferences through the Perspective API. 17 While the API itself is imperfect and itself has biases [18,38,55], it nonetheless can provide some additional information on potentially harmful content in our corpus. We examined the top 50 clue/inference pairs across each split marked as most likely to be toxic. Most of these annotations were false positives, e.g., 'a dirty spoon' was marked as potentially toxic likely because of the word 'dirty.' But, this analysis did highlight a very small amount of lewd/NSFW/offensive content. Out of the 30K cases filtered through the perspective API, we discovered 6 cases of weight stigmatization, 2 (arguably) lewd observation, 1 dark comment about a cigarette leading to an early death for a person, 1 (arguable) case of insensitivity to mental illness, 6 cases of sexualized content, and 1 (arguable) case where someone was highlighted for wearing non-traditionally-gendered clothing. ## B Additional Modeling Details After some light hyperparameter tuning on the validation set, the best learning rate for fine-tuning our CLIP models was found to be .00001 with AdamW [35,27]. We use a linear learning rate warmup over 500 steps for RN50x16 and ViT-B/16 , and 1000 for RN50x64 . Our biggest model, RN50x64 , takes about 24 hours to converge when trained on 8 Nvidia RTX6000 cards. For data augmentation during training, we use pytorch 's RandomCrop , RandomHorizontalFlip , RandomGrayscale , and ColorJitter . For our widescreen CLIP variants, data augmentations are executed on each half of the image independently. We compute visual/textual embeddings via a forward pass of the respective branches of CLIP - for our widescreen model, we simply average the resultant embeddings for each side of the image. To compute similarity score, we use cosine similarity, 17 https://www.perspectiveapi.com/ ; November 2021 version. | | Retrieval | Retrieval | Localization GT-Box/Auto-Box ( ↑ | |--------------------|-------------|-------------------|------------------------------------| | | im → txt ( | ↓ ) P @1 im → txt | ( ↑ ) ) | | RN50x64 -inference | 12.8 | 43.4 | 92.5/41.4 | | RN50x64 -clue | 6.2 | 54.3 | 94.7/53.3 | | RN50x64 -multitask | 5.4 | 57.5 | 95.3 / 54.3 | Table 3: Retrieval and localization results when clues are used at evaluation time instead of inferences. This task is more akin to referring expression retrieval/localization rather than abductive commonsense reasoning. While clue retrieval/localization setups are easier overall (i.e., referring expressions are easier both models to reason about) the model trained for abductive reasoning, RN50x64 -inference, performs worse than the model trained on referring expressions RN50x64 -clue. and then scale the resulting similarities using a logit scaling factor, following [51]. Training is checkpointed every 300 gradient steps, and the checkpoint with best validation P @1 retrieval performance is selected. Ablation details. For all ablations, we use the ViT-B/16 version of CLIP for training speed: this version is more than twice as fast as our smallest ResNet, and enabled us to try more ablation configurations. A cleaner training corpus. Evaluations are reported over version 1.1 of the Sherlock validation/test sets. However, our models are trained on version 1.0, which contains 3% more data; early experiments indicate that the removed data doesn't significantly impact model performance. This data was removed because we discovered a small number of annotators were misusing the original collection interface, and thus, we removed their annotations. We encourage follow-up work to use version 1.1, but include version 1.0 for the sake of replicability. T5 model details. We train T5-Large to map from clues to inferences using the Huggingface transformers library [68]; we parallelize using the Huggingface accelerate package. We use Adafactor [58] with learning rate .001 and batch size 32, train for 5 epochs, and select the checkpoint with the best validation loss. ## B.1 Results on Clues instead of Inferences Whereas inferences capture abductive inferences, clues are more akin to referring expressions. While inferences are our main focus at evaluation time, Sherlock also contains an equal number of clues, which act as literal descriptions of image regions: Sherlock thus provides a new dataset of 363K localized referring expressions grounded in the image regions of VisualGenome and VCR. As a pointer towards future work, we additionally report results for the retrieval and <details> <summary>Image 13 Details</summary> ![d5080104](/v1/image/d50801042a94493efe658a371fee91aea37c2f3cfe86e505252d9d7723aa4682) ### Visual Description Icon/Small Image (24x25) </details> localization setups, but instead of using a version testing on inference texts, we test on clues. We do not report over our human-judged comparison sets, because or raters only observed inferences in that case. Table 3 includes prediction results of two models in this setting: both are RN50x64 models trained with widescreen processing and with clues highlighted in pixel space, but one is trained on inferences, and one is trained on clues. ## C Batch Size Ablation We hypothesize the nature of the hard negatives the models encounter during training is related to their performance. Because UNITER and LXMERT are bidirectional, they are quadratically more memory intensive vs. CLIP: as a result, for those models, we were only able to train with 18 negative examples per positive (c.f. CLIP ViT-B/16 , which uses 511 negatives). To check that batch size/number of negatives wasn't the only reason CLIP outperformed UNITER, we conducted an experiment varying ViT-B/16 's batch size from 4 to 512; the results are given in Fig. 8. Batch size doesn't explain all performance differences: with a batch size of only 4, our weakest CLIP-based model still localizes better than UNITER, and, at batch size 8, it surpasses UNITER's retrieval performance. ## D Clues and inferences vs. literal captions Fig. 8: The effect of batch size on performance of ViT/B-16 . UNITER batch size is 256. Performance on all tasks increases with increasing batch size, but appears to saturate, particularly for comparison. <details> <summary>Image 14 Details</summary> ![072490d3](/v1/image/072490d3f044fb247ccf76bf1661b864f6a7cc1a42ef63ed613da0d1266eea46) ### Visual Description ## Line Chart: UNITER Comparison (acc)=20.0 ### Overview The chart compares three performance metrics (Comparison accuracy, Localization GT, and Retrieval p@1) across varying CLIP batch sizes (4 to 512). All metrics show upward trends, with Localization GT consistently outperforming the others. ### Components/Axes - **X-axis**: CLIP Batch Size (4, 8, 16, 32, 64, 128, 256, 512) - **Y-axis**: Performance (10.7 to 85.0) - **Legend**: - Green: Comparison (acc) - Orange: Localization (GT) - Purple: Retrieval (p@1) - **Horizontal Reference Lines**: - Green dashed line at 20.0 (acc) - Purple dashed line at 19.8 (p@1) ### Detailed Analysis 1. **Comparison (acc)**: - Starts at 10.7 (batch size 4), rises steadily to 30.5 (batch size 512). - Values: 10.7, 19.8, 23.6, 26.3, 28.2, 30.5. - Slope: Gradual, linear increase. 2. **Localization (GT)**: - Starts at 74.5 (batch size 4), peaks at 85.0 (batch size 512). - Values: 74.5, 81.5, 84.4, 84.0, 84.9, 85.2, 85.0. - Slope: Steeper initial rise, plateaus after batch size 32. 3. **Retrieval (p@1)**: - Mirrors Comparison (acc) closely, starting at 10.7 (batch size 4) and reaching 30.5 (batch size 512). - Values: 10.7, 19.8, 23.6, 26.3, 28.2, 30.5. - Slope: Identical to Comparison (acc). ### Key Observations - **Localization (GT)** dominates all batch sizes, with a 10.2-point gap over Comparison (acc) at batch size 4 and a 54.5-point gap at 512. - **Comparison (acc)** and **Retrieval (p@1)** exhibit near-identical performance, suggesting overlapping methodologies or dependencies. - All metrics plateau after batch size 32, indicating diminishing returns beyond this point. ### Interpretation The data demonstrates that: 1. **Localization (GT)** is the most effective metric, likely due to ground-truth alignment or specialized training. 2. **Comparison (acc)** and **Retrieval (p@1)** perform similarly, possibly indicating shared architectural or training constraints. 3. **Batch size scaling** improves performance up to 32, after which gains diminish, suggesting optimal resource allocation at this threshold. 4. The 20.0 and 19.8 reference lines may represent baseline targets for comparison and retrieval tasks, respectively. The chart underscores the importance of Localization (GT) in UNITER's performance and highlights potential inefficiencies in scaling beyond batch size 32. </details> Fig. 9: The SentenceBERT [53] cosine similarity between clues/inferences and MSCOCO captions; MSCOCO caption self-similarity included for reference. On average, clues are closer to MSCOCO captions than inferences. <details> <summary>Image 15 Details</summary> ![ca8be0d0](/v1/image/ca8be0d02f7884642f69be7adb1220007b4827cdc5686ebc1ce0c08da63e523d) ### Visual Description ## Line Chart: Similarity Distributions to MSCOCO ### Overview The chart compares the similarity distributions of three methods ("Inferences," "Clues," and "COCO-self") to the MSCOCO dataset. The x-axis represents similarity scores (ranging from -0.2 to 1.0), and the y-axis represents density. Three density curves are plotted, with vertical dashed lines marking key similarity thresholds at 0.4, 0.5, and 0.8. ### Components/Axes - **X-axis**: "Similarity to MSCOCO" (scale: -0.2 to 1.0, markers at -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0). - **Y-axis**: "Density" (no explicit scale, but curves suggest normalized distributions). - **Legend**: Located in the top-left corner, with colors: - Green: Inferences - Orange: Clues - Purple: COCO-self - **Dashed Lines**: Vertical lines at x = 0.4 (green), 0.5 (orange), and 0.8 (purple). ### Detailed Analysis 1. **Inferences (Green)**: - Starts near 0 at x = -0.2, rises to a peak at ~0.4, then declines. - Density drops sharply after 0.4, approaching zero by x = 0.8. - Overlaps with "Clues" between x = 0.2 and 0.6. 2. **Clues (Orange)**: - Begins at x = 0, peaks at ~0.5, then declines. - Density remains non-zero up to x = 0.8, with a gradual slope. - Overlaps with "Inferences" in the 0.2–0.6 range. 3. **COCO-self (Purple)**: - Starts at x = 0.4, rises sharply to a peak at ~0.8, then declines. - Density is zero for x < 0.4 and x > 0.9. - Highest peak (taller than other curves) at x = 0.8. ### Key Observations - **Peak Differences**: - "COCO-self" peaks at 0.8, significantly higher than "Inferences" (0.4) and "Clues" (0.5). - "Inferences" and "Clues" share overlapping density regions between 0.2 and 0.6. - **Thresholds**: - Dashed lines at 0.4, 0.5, and 0.8 align with the peaks of each method, suggesting these are critical similarity benchmarks. - **Data Gaps**: - No data below x = -0.2 for any method. - "COCO-self" has no density below 0.4 or above 0.9. ### Interpretation The chart demonstrates that "COCO-self" achieves the highest similarity to MSCOCO, with a sharp peak at 0.8, indicating superior alignment or performance compared to "Inferences" and "Clues." The vertical dashed lines likely represent predefined thresholds for evaluating similarity, with "COCO-self" exceeding all others. The overlap between "Inferences" and "Clues" suggests these methods perform similarly in the mid-range similarity domain (0.2–0.6). The absence of data below -0.2 implies these values are either irrelevant or not measured. The sharp decline of "Inferences" after 0.4 highlights a potential limitation in its ability to generalize beyond moderate similarity scores. </details> We ran additional analyses to explore the textual similarity between Sherlock 's clues and inferences vs. literal image descriptions. For 2K images, we computed text overlap using S-BERT cosine similarity [53] between MS COCO captions and Sherlock clues/inferences. The result is in Fig. 9. As a baseline we include COCO self-similarity with held-out captions. Clues are more similar to COCO captions than inferences, presumably because they make reference to the same types of literal objects/actions that are described in literal captions. ## E Comparison Human Evaluation Set Details We aim to sample a diverse and plausible set of candidate inferences for images to form our comparison set. Our process is a heuristic effort designed to elicit 'interesting' annotations from human raters. Even if the process isn't perfect for generating interesting candidates, because we solicit human ratings we show inferences to annotators and ask them to rate their plausibility, the resulting set will still be a valid representation of human judgment. We start by assuming all inferences could be sampled for a given image+region, and proceed to filter according to several heuristics. First, we use a performant RN50x16 checkpoint as a means of judging plausibility of inferences. This checkpoint achieves 18.5/20.6/31.5 im2txt/txt2im/P@1 respectively on retrieval on v1.0 of the Sherlock corpus; this is comparable to the RN50x16 checkpoint we report performance on in our main results section. We use this checkpoint to score all validation/test (image+region, inference) possibilities. Global filters. We assume that if the model is already retrieving its ground truth inference which high accuracy, the instance is probably not as interesting: for each image, we disqualify all inferences that receive a lower plausibility estimate from our RN50x16 checkpoint vs. the ground truth inference (this also discards the ground-truth inference). This step ensures that the negative inferences we sample are more plausible than the ground truth inference according to the model. Next, we reduce repetitiveness of our inference texts using two methods. First, we perform the same semantic de-duplication via hierarchical clustering as described in § 3: clustering is computed on SentenceBERT [53] representations of inferences ( all-MiniLM-L6-v2 ). We compute roughly 18K clusters (corresponding to 80% of the dataset size) and sample a single inference from each cluster: this results in 20% of the corpus being removed from consideration, but maintains diversity, because each of the 18K clusters is represented. Second, we perform a hard-deduplication by only allowing three verbatim copies of each inference to be sampled. Local filters. After these global filters, we begin the iterative sampling process for each image+region. If, after all filtering, a given image+region has fewer than 20 candidates to select from, we do not consider it further. Then, in a greedy fashion, we build-up the candidate set by selecting the remaining inference with i) the highest model plausibility ii) that is maximally dissimilar to the already sampled inferences for this image according to the SentenceBERT representations. Both of these objectives are cosine similarities in vector spaces (one between image and text, and one between text and text). We assign weights so that the image-text similarity (corresponding to RN50x16 plausibility) is 5x more important than the text-text dissimilarity (corresponding to SentenceBERT diversity). After iteratively constructing a diverse and plausible set of 10 inferences for a given image under this process, we globally disqualify the sampled inferences such that no inference is sampled more than once for each image (unless it is a verbatim duplicate, in which case, it may be sampled up to 3 times). Finally, for all of the images we are able to sample a set of 10 inferences for, we sort by how promising they are collectively according to a weighted sum of: the (globally ranked) average length of the sampled inferences, the (globally ranked) diversity of the set of 10 (measured by mean all-pairs SentenceBERT cosine sim: lower=more diverse), and 5x the (globally ranked) average plausibility according to RN50x16 . We collect 2 human judgments for each of the 10 inferences for the top 500 images from the val/test sets (1K total) according to this heuristic ranking. The total is 20K human judgments, which formed v1 of the Sherlock comparison corpus. v1.1 has 19K judgments. Crowdowrking details. For the comparison task, we designed an additional HIT to collect human feedback on the retrieved inferences. In the HIT, workers were presented with the images with the appropriate clue region highlighted. Then they were provided with the inferences and were asked to rate them on a likert scale of 1-3, with 1 as 'irrelevant' or 'verifiably incorrect', 2 as 'statement is probably true but there is a better highlighted region to support it', and 3 as 'statement is probably true and the highlighted region supports it'. A sample of evaluation HIT is shown in Fig. 13. Human agreement on this setup is reported as accuracy § 5.1. ## F Datasheet for Sherlock In this section, we present a Datasheet [14,4] for Sherlock . 1. Motivation For Datasheet Creation - Why was the dataset created? Sherlock was created to support the study of visual abductive reasoning. Broadly speaking, in comparison to corpora which focus on concrete, objective facets depicted within visual scenes (e.g., the presence/absence of objects), we collected Sherlock with the goal of better understanding the types of abductive inferences that people make about images. All abductive inferences carry uncertainty. We aim to study the inferences we collect, but do not endorse their objectivity, and do not advocate for use cases that risk perpetuating them. - Has the dataset been used already? The annotations we collect are novel, but the images are sourced from two widely-used, existing datasets: Visual Genome [29] and VCR [75]. - What (other) tasks could the dataset be used for? Aside from our retrieval/localization setups, Sherlock could be useful as a pretraining corpus for models that aim to capture information about what people might assume about an image, rather than what is literally depicted in that image. One potentially promising case: if a malicious actor were posting emotionally manipulative content online, it might be helpful to study the types of assumptions people might make about their posts, rather than the literal contents of the post itself. - Who funded dataset creation? This work was funded by DARPA MCS program through NIWC Pacific (N66001-19-2-4031), the DARPA SemaFor program, and the Allen Institute for AI. ## 2. Data composition - What are the instances? We refer to the instances as clues/inferences, which are authored by crowdworkers. As detailed in the main text of the paper, a clue is a bounding box coupled with a free-text description of the literal contents of that bounding box. An inference is an abductive conclusion that the crowdworker thinks could be true about the clue. - How many instances are there? There are 363K commonsense inferences grounded in 81K Visual Genome images and 22K VCR images. - What data does each instance consist of? Each instance contains 3 things: a clue, a short English literal description of a portion of the image, an inference, a short English description of an inference associated with the clue that aims to be not immediately obvious from the image content, and a bounding box specified with the region of interest. - Is there a label or target associated with each instance? We discuss in the paper several tasks, which involve predicting inferences, bounding boxes, etc. - Is any information missing from individual instances? Not systematically - in rare circumstances, we had to discard some instances because of malformed crowdworking inputs. - Are relationships between individual instances made explicit? Yes - the annotations for a given image are all made by the same annotator and are aggregated based on that. - Does the dataset contain all possible instances or is it a sample? This is a natural language sample of abductive inferences; it would probably be impossible to enumerate all of them. - Are there recommended data splits? Yes, they are provided. - Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. Yes: some annotations are repeated by crowdworkers. When we collected the corpus of Likert judgments for evaluation, we performed both soft and hard deduplication steps, ensuring that the text people were evaluating wasn't overly repetitive. - Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? It - links to the images provided by Visual Genome and VCR. If images were removed from those corpora, our annotations wouldn't be grounded. ## 3. Collection Process - What mechanisms or procedures were used to collect the data? We collected data using Amazon Mechanical Turk. - How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred or derived from other data? Paid crowdworkers provided the annotations. - If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? We downsample common image types via a semantic deduplication step. Specifically, some of our crowdworkers were rightfully pointing out that it's difficult to say interesting things about endless pictures of zebra; these types of images are common in visual genome. So, we performed hierarchical clustering on the images from that corpus, and then sampled 1 image from each of 80K clusters. The result is a downsampling of images with similar feature representations. We stopped receiving comments about zebras after this deduplication step. - Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? Crowdworkers constructed the corpus via a mechanical turk HIT we designed. We our target was to pay $ 15/hour. A post-hoc analysis revealed that crowdworkers were paid a median $ 12/hr and a mean of $ 16-20/hour, depending on the round. - Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The main data was collected in February 2021. ## 4. Data Preprocessing - Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? Yes, significant preprocessing was conducted. The details are in - Was the 'raw' data saved in addition to the preprocessed, cleaned, labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the 'raw' data. The concept of 'raw' data is difficult to specify in our case. We detail the data we release in the main body of the paper. - Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point. We plan to release some software related to modeling, and also have provided some appendices that detail the crowdworking labelling efforts. - Does this dataset collection/processing procedure achieve the motivation for creating the dataset stated in the first section of this datasheet? If not, what are the limitations? We think so. It's difficult to fully specify the abductive reasoning process of humans. But we think our work goes a step beyond existing corpora. 5. Dataset Distribution - How will the dataset be distributed? The dataset is available at http://visualabduction.com/ . - When will the dataset be released/first distributed? What license (if any) is it distributed under? The dataset is released under CC-BY 4.0 and the code is released under Apache 2.0. - Are there any copyrights on the data? The copyright for the new annotations is held by AI2 with all rights reserved. - Are there any fees or access restrictions? No - our annotations are freely available. 6. Dataset Maintenance - Who is supporting/hosting/maintaining the dataset? The dataset is hosted and maintained by AI2. - Will the dataset be updated? If so, how often and by whom? We do not currently have plans to update the dataset regularly. - Is there a repository to link to any/all papers/systems that use this dataset? No, but if future work finds this work helpful, we hope they will consider citing this work. - If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? People are free to remix, use, extend, build, critique, and filter the corpus: we would be excited to hear more about use cases either via our github repo, or via personal correspondence. 7. Legal and Ethical Considerations - Were any ethical review processes conducted (e.g., by an institutional review board)? Crowdworking studies involving no personal disclosures of standard computer vision corpora are not required by our IRB to be reviewed by them. While we are not lawyers, the opinion is based on United States federal regulation 45 CFR 46, under which this study qualifies and as exempt and does not require IRB review. <details> <summary>Image 16 Details</summary> ![7ed74aaf](/v1/image/7ed74aaf09779b8463e79042175eb395c73e601a668b8a7ba77c272e6a3e59fa) ### Visual Description Icon/Small Image (24x25) </details> - (a) Wedo not collect personal information. Information gathered is strictly limited to general surveys probing at general world knowledge. - (b) We take precaution to anonymize Mechanical WorkerIDs in a manner that the identity of the human subjects cannot be readily ascertained (directly or indirectly). - (c) We do not record or include any interpersonal communication or contact between investigation and subject. ## Specifically: - We do not have access to the underlying personal records and will record information in such a manner that the identity of the human subject cannot readily be ascertained. - Information generated by participants is non-identifying without turning over the personal records attached to these worker IDs. - We do not record or include any interpersonal communication or contact between investigation and subject. ## - Does the dataset contain data that might be considered confidential? Potentially, yes. Most of the content in the corpus that would be considered potentially private/confidential would likely be depicted in the images of Visual Genome (VCR are stills from movies where actors onscreen are presumably aware of their public actions). While we distribute no new images, if an image is removed from Visual Genome (or VCR), it will be removed from our corpus as well. - Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why As detailed in the main body of the paper, we have searched for toxic content using a mix of close reading of instances and the Perspective API from Google. In doing this, we have identified a small fraction of instances that could be construed as offensive. For example, in a sample of 30K instances, we discovered 6 cases that arguably offensive (stigmatizes depicted people's weight based on visual cues). Additionally, some of the images from VCR, gathered from popular movies, can depict potentially offensive/disturbing content. The screens can be 'R Rated,' e.g., some images depict movie violence with zombies, some of the movies have Nazis as villains, and thus, some of the screenshots depict Nazi symbols. We reproduce VCR's content warning about such imagery in § A.2. ## - Does the dataset relate to people? Yes: the corpus depicts people, and the annotations are frequently abductive inferences that relate to people. As detailed in the main body of the paper, 36% of inferences (or more) are grounded on people; and, many inferences that are not directly grounded on people may relate to them. Moreover, given that we aim to study abduction, which is an intrinsi- cally subjective process, the annotations themselves are, at least in part, reflections of the annotators themselves. - Does the dataset identify any subpopulations (e.g., by age, gender)? We don't explicitly disallow identification by gender or age, e.g., in the clues/inferences, people often will use gendered pronouns or aged language in reference to people who are depicted (e.g., 'the old man'). Furthermore, while we undertook the sample/statistical toxicity analysis detailed in the main body of the paper, we have not manually verified that all 363K clue/inference pairings are free of any reference to a subpopulation. For example, we observed one case wherein an author speculated about the country-of-origin of an individual being Morroco, clued by the observation that they were wearing a fez. Like the other observations in our corpus, it's not necessarily the case that this is an objectively true inference, even if the fez is a hat that is worn in Morroco. - Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? The data collection process specifically instructs workers to avoid identifying any individual in particular (e.g., actors in movie scenes). Instead, they are specifically instructed to use general identifiers to describe people (e.g. 'student', 'old man', 'engineer'). In our experience with working with the corpus, we haven't encountered any instances where our annotators specifically identified anyone, e.g., by name. The images contained in VCR and Visual Genome that we source from do contain uncensored images of faces. But, if images are removed from those corpora, they will be removed from Sherlock as well, as we do not plan to re-host the images ourselves. <details> <summary>Image 17 Details</summary> ![38b2b38d](/v1/image/38b2b38dd4cfb71e1646048e3cb1b9864b3ce9a8169f205f65eb69ab6dcfc3e0) ### Visual Description ## Screenshot: Task Instructions for Image Analysis ### Overview The image displays a structured task instruction page for a user study or data annotation task. The content is organized into sections with clear headings, bullet points, and formatting (bold, colored text) to guide participants through a two-part process involving image analysis and clue/indication extraction. --- ### Components/Axes - **Header**: Blue banner with text "Instructions (click to expand/collapse)" and a thank-you message for participating in a "HIT" (likely Amazon Mechanical Turk task). - **Main Content**: - **Section 1**: "Your task" with a directive to analyze an image for observable clues and indications. - **Section 2**: "PART 1" with three steps: 1. Identify 3 observable clues (e.g., "an open algebra math workbook"). 2. Draw bounding boxes around clues. 3. Repeat steps 1–2 for all observations. - **Section 3**: "PART 2" with instructions to provide indications (interpretations of clues) and rate their likelihood (certain, likely, possible). - **Bonus Opportunity**: Up to 2 additional clue/indication sets for bonus pay. - **Rules**: Six numbered guidelines for clue/indication formatting, including: - Use noun phrases for clues (e.g., "the book under the table"). - Avoid contradictions in indications. - Exclude plain descriptions of actions or thoughts. - Use weather observations if salient. - Avoid gendered pronouns. - Review examples and "How to Pick Good Clues/Indications" section. --- ### Detailed Analysis #### Textual Content - **Header**: - "Instructions (click to expand/collapse)" - "Thanks for participating in this HIT!" - **Your Task**: - "In this task, we are asking you to put on your detective thinking cap. Given an image, find observable clues that might indicate information about a person, situation, or setting that may not be necessarily obvious in the image (we will call this indication)." - **PART 1**: - "Examine the image and find 3 observable clues." - "An observable clue MUST be something in the picture (e.g., an open algebra math workbook)." - Steps 1–3 emphasize iterative analysis and bounding box annotation. - **PART 2**: - "For each observable clue, provide an indication." - Indications are non-obvious interpretations (e.g., "an open algebra math workbook might indicate a high school student studying"). - Likelihood ratings: certain, likely, possible. - **Bonus Opportunity**: Up to 2 additional clue/indication sets for bonus pay. - **Rules**: 1. **Observable Clues**: Noun phrases with spatial details (e.g., "the book under the table"). 2. **Indications**: Complete sentences, realistic, non-contradictory. 3. Exclude plain descriptions of actions/thoughts. 4. Use weather observations if relevant. 5. Avoid gendered pronouns (use "they" if needed). 6. Review examples and "How to Pick Good Clues/Indications" section. --- ### Key Observations - **Formatting**: Critical terms like "observable clues" (blue) and "indication" (orange) are color-coded for emphasis. - **Iterative Process**: Participants must analyze images in two phases: clue identification (Part 1) and interpretation (Part 2). - **Quality Control**: Rules enforce specificity (e.g., spatial details for clues) and consistency (e.g., avoiding contradictions). - **Incentive Structure**: Bonus pay for additional clue/indication sets encourages thoroughness. --- ### Interpretation This task appears to be part of a crowdsourced data collection effort, likely for training machine learning models or validating human reasoning. Participants are asked to: 1. **Identify Clues**: Extract explicit, observable details from images (e.g., objects, settings). 2. **Generate Indications**: Infer implicit meanings or contexts (e.g., linking a math workbook to a student). 3. **Rate Certainty**: Assign likelihood scores to indications to gauge confidence. The structured rules ensure data quality by standardizing clue descriptions (noun phrases with spatial context) and filtering out irrelevant or contradictory interpretations. The bonus opportunity incentivizes participants to provide deeper analysis, potentially enriching the dataset. The task’s focus on "non-obvious" clues suggests an emphasis on uncovering latent patterns or contextual inferences, which could be critical for applications like scene understanding or behavioral prediction. </details> Fig. 10: Instructions for Sherlock data collection HIT. <details> <summary>Image 18 Details</summary> ![945e3444](/v1/image/945e3444db949dc48f1bb95772aa2356c50edaeeafbcd6eb8eaf3c6da91dcc4f) ### Visual Description ## Screenshot: Web-Based Annotation Interface for Object Detection ### Overview The image depicts a web interface for a multi-step annotation task involving object detection and contextual analysis. The interface is divided into two primary sections: **Part 1** (observation and bounding box creation) and **Part 2** (contextual indication filling). The example image shows a person flying a kite in a park with the Lincoln Memorial in the background. --- ### Components/Axes #### Part 1: Make Your Observations and Bound Them in Boxes - **Instructional Text**: - "Observe image below, then:" - Step 1: Choose observation number from dropdown (default: 1) and write observed clues in text field. - Step 2: Draw bounding boxes by clicking/dragging; 1-3 boxes allowed. Remove boxes via "x" in corner. - Step 3: Repeat steps 1-2 for additional observations. - **UI Elements**: - Dropdown labeled "Observation #" (options: 1-5, with 1 pre-selected). - Text field labeled "I spy..." for observed clues. - Example image thumbnail (kite-flying scene) with zoom selection tool. - "Reload" button for image refresh. #### Part 2: Fill in the Indications - **Observation 1 (Required)**: - Header: Pink background with "Observation 1 (required)". - Fields: - "I spy..." (text input). - "It might indicate that..." (text input). - Certainty radio buttons: "possible," "likely," "certain" (with "possible" pre-selected). - **Observation 2 (Required)**: - Header: Teal background with "Observation 2 (required)". - Identical structure to Observation 1. - **Observation 3 (Required)**: - Header: Brown background with "Observation 3 (required)". - Identical structure to Observation 1. - **Footer Note**: "Observations 1-3 are required; 4 & 5 are bonus/optional." --- ### Detailed Analysis #### Part 1 - **Image Example**: - Scene: Outdoor park with grass, trees, and the Lincoln Memorial (white neoclassical building with columns) in the background. - Foreground: Person (back to camera) wearing dark jacket and jeans, holding a kite string. Two kites visible in the sky. - Additional elements: American flag on a pole, distant pedestrians, and a clear blue sky. - **Bounding Box Instructions**: Users must manually draw boxes around key objects (e.g., person, kite, monument) using click-and-drag. Boxes are not required to be perfect. #### Part 2 - **Contextual Analysis Fields**: - Each observation requires a textual description of the observed object ("I spy...") and a hypothesis about its contextual significance ("It might indicate that..."). - Certainty levels ("possible," "likely," "certain") suggest a probabilistic framework for annotations, likely for training machine learning models. --- ### Key Observations 1. **Structured Annotation Workflow**: The task enforces a strict sequence: observe → box → contextualize. 2. **Certainty Calibration**: The inclusion of "possible/likely/certain" options implies a need for confidence scoring in annotations. 3. **Bonus Observations**: Optional fields (4 & 5) suggest flexibility for advanced users or additional data collection. 4. **Example Image Complexity**: The kite-flying scene includes multiple overlapping elements (person, kite, monument, flag), testing the annotator’s ability to isolate key objects. --- ### Interpretation This interface is designed for **computer vision training**, where annotators label objects (via bounding boxes) and infer contextual relationships (via text). The certainty levels ("possible/likely/certain") may map to confidence scores in a machine learning pipeline. The example image’s complexity (multiple objects, background elements) highlights the challenge of distinguishing foreground vs. background in real-world scenarios. The structured workflow ensures consistency in data collection, critical for model training. The "I spy..." and "It might indicate that..." fields bridge visual and semantic understanding, enabling models to learn both object recognition and contextual reasoning. </details> Fig. 11: Template setup for Sherlock data collection HIT. Instructions are shown in Figure 10 <details> <summary>Image 19 Details</summary> ![710aee72](/v1/image/710aee72c08f92d844096f533bcc42d550f92f78f2f7df0305c58ca1f5675e7d) ### Visual Description ## Screenshot: HIT Task Instructions for Image-Observation Evaluation ### Overview This image depicts a task instruction page for a Human Intelligence Task (HIT) on Amazon Mechanical Turk. The interface guides workers to evaluate pairs of images and observations by assessing the appropriateness of bounding boxes, the reasonableness of the observation, and the interest level of the observation. The layout includes a collapsible header, structured task steps, an example, and a concluding note. ### Components/Axes - **Header**: Blue bar with collapsible "Instructions" text. - **Main Content**: - **Task Description**: Text outlining the worker's responsibilities. - **Task Steps**: 1. **Bounding Box Appropriateness**: - Options: *Appropriate*, *Mostly Appropriate*, *Entirely Off*. - Criteria: Coverage of key elements (e.g., "flowers" with 1-3 boxes). 2. **Observation Reasonableness**: - Options: *Highly Reasonable*, *Relatively Reasonable*, *Unreasonable*. - Criteria: Logical connection between image and observation. 3. **Observation Interest**: - Options: *Very Interesting*, *Interesting*, *Caption-like*, *Not At All Interesting*. - **Example**: Textual scenario involving *Harry Potter* and *Dumbledore*. - **Note**: Yellow-highlighted advisory to avoid overthinking answers. ### Detailed Analysis - **Bounding Box Appropriateness**: - *Appropriate*: All key elements are boxed (e.g., "flowers" with 1-3 boxes). - *Mostly Appropriate*: Most elements boxed but missing some key elements. - *Entirely Off*: Boxes irrelevant or missing entirely. - **Observation Reasonableness**: - *Highly Reasonable*: Observation fully aligns with the image. - *Relatively Reasonable*: Observation makes partial sense but lacks full agreement on details. - *Unreasonable*: Observation is nonsensical for the image. - **Observation Interest**: - *Very Interesting*: Clever or astute observation. - *Interesting*: Subjectively engaging observation. - *Caption-like*: Descriptive but obvious (e.g., "states what’s happening"). - *Not At All Interesting*: Lacks engagement. ### Key Observations - The task emphasizes **contextual reasoning** (e.g., accepting "mostly appropriate" bounding boxes if key elements are covered). - The example illustrates **reasonableness evaluation** (e.g., a false observation in a movie context is still valid for uninformed viewers). - The note discourages overanalysis, prioritizing **intuitive judgments**. ### Interpretation This task design reflects a crowdsourcing workflow for training or validating computer vision models. Workers are asked to: 1. **Annotate images** by identifying key elements via bounding boxes. 2. **Validate observations** for logical consistency with the image. 3. **Assess engagement** to ensure observations are meaningful or novel. The example highlights the importance of **contextual awareness** (e.g., distinguishing between factual accuracy and subjective reasonableness). The note suggests the task values **efficiency over perfection**, aligning with real-world scenarios where rapid, intuitive judgments are critical. The structured options reduce ambiguity, ensuring consistent data collection for downstream analysis (e.g., model training or quality control). </details> <details> <summary>Image 20 Details</summary> ![eab8e7c6](/v1/image/eab8e7c674ddac83089e9aa55fd5db179e0470380a8ca567a8c5b9fceaa1b691) ### Visual Description ## Photograph: Motocross Race Action Shot ### Overview The image captures a dynamic motocross race scene with two riders mid-turn on a muddy track. A green horizontal line overlays the image, likely for technical analysis (e.g., trajectory or speed tracking). Spectators are visible in the background, and a partially obscured banner with text ("motmeis" and "mot") is present. ### Components/Axes - **Foreground**: Two motocross riders on dirt bikes, mid-action. - **Background**: Spectators, a yellow-and-red banner with text ("motmeis" in red, "mot" in yellow). - **Overlay**: A bright green horizontal line spanning the width of the image. - **Track**: Muddy terrain with visible tire tracks and airborne dirt particles. ### Detailed Analysis - **Rider 1 (Left)**: - Bike number: **905** (white on orange bike). - Gear: White helmet, black-and-white motocross suit. - Position: Slightly behind Rider 2, leaning into the turn. - **Rider 2 (Right)**: - Bike number: **69** (black on green bike). - Gear: Green-and-black motocross suit, white helmet with red accents. - Position: Leading, with a more aggressive lean into the turn. - **Banner Text**: - "motmeis" (red, partially obscured by Rider 1’s bike). - "mot" (yellow, partially obscured by Rider 2’s bike). - **Green Line**: - Horizontal, spans the entire width of the image. - Positioned ~40% from the top, cutting through both riders’ midsections. ### Key Observations - The green line’s placement suggests it may be used for motion analysis (e.g., tracking rider alignment or speed). - Rider 2’s bike (69) is visibly ahead, indicating a competitive lead. - The banner text ("motmeis" and "mot") likely references sponsors or event branding, though the full text is obscured. - Mud and dirt particles in the air emphasize high-speed action and track conditions. ### Interpretation The image highlights the intensity of motocross racing, with riders navigating a challenging turn on a muddy track. The green overlay line implies technical analysis, possibly for post-race performance evaluation. The banner text ("motmeis" and "mot") suggests sponsorship ties to motocross or motor sports, though the incomplete text limits definitive identification. Rider 2’s lead and aggressive posture contrast with Rider 1’s slightly trailing position, underscoring the race’s competitive nature. The airborne dirt and spectators in the background further contextualize the event as a high-energy, spectator-driven sport. </details> Fig. 12: Instructions and template setup for Sherlock data validation HIT. <details> <summary>Image 21 Details</summary> ![386a2aed](/v1/image/386a2aedecff65adea9d2b36c4f8733b18007ecf14e57fa4c8e6097be68b0f4b) ### Visual Description ## Screenshot: Survey Interface for Observation Evaluation ### Overview The image depicts a structured survey interface designed to evaluate observation pairs. It includes three sequential evaluation questions with multiple-choice options, each accompanied by highlighted keywords. The interface uses color-coded text to emphasize specific terms and phrases. ### Components/Axes 1. **Header Section**: - **Label**: "Observation Pair" (bold, black text on dark gray background). - **Content**: - "I spy: a crowd watching the motorcyclists" (blue text). - "It indicates that (likely) this is an event featuring professional and skilled riders" (orange text for "It indicates that," gray text for the rest). 2. **Bounding Box Appropriateness Question**: - **Label**: "Are the the bounding boxes appropriate for the observation pair?" (bold, white text on dark gray background). - **Options**: - "Appropriate" (bold, black text). - "Mostly Appropriate (with some wrong or key missing elements)" (bold, black text). - "Entirely Off (or missing)" (bold, black text). - **Highlight**: "bounding boxes" (green text). 3. **Reasonableness Question**: - **Label**: "Is the observation pair reasonable?" (bold, white text on dark gray background). - **Options**: - "Highly Reasonable (reasonable & I agree)" (bold, black text). - "Relatively Reasonable (reasonable though I don't fully agree on details)" (bold, black text). - "Unreasonable (makes little to no sense)" (bold, black text). - **Highlight**: "observation pair" (yellow text). 4. **Interest Question**: - **Label**: "How interesting is the observation?" (bold, white text on dark gray background). - **Options**: - "Very Interesting (clever, astute)" (bold, black text). - "Interesting" (bold, black text). - "Caption-like (just states what's obviously happening in the image)" (bold, black text). - "Not At All Interesting" (bold, black text). - **Highlight**: "observation" (yellow text). ### Detailed Analysis - **Textual Structure**: - The interface follows a top-down flow, with each section separated by horizontal dark gray bars. - Key terms (e.g., "bounding boxes," "observation pair," "observation") are highlighted in green and yellow to draw attention. - Parenthetical explanations clarify the intent of each option (e.g., "(reasonable & I agree)" for "Highly Reasonable"). - **Color Coding**: - Blue text highlights the observation description ("I spy..."). - Orange text emphasizes the inference ("It indicates that..."). - Green and yellow highlights denote technical terms ("bounding boxes," "observation pair," "observation"). ### Key Observations 1. The survey evaluates three dimensions: - **Bounding box accuracy** (spatial alignment). - **Reasonableness** (logical consistency). - **Interest level** (engagement potential). 2. Parenthetical explanations provide context for each option, ensuring clarity. 3. Color highlights guide the user’s focus to critical terms. ### Interpretation This interface is likely part of a user study or data annotation task, where participants assess the quality of generated observations (e.g., for computer vision or NLP systems). The structured questions ensure standardized evaluations, while color highlights and parenthetical notes reduce ambiguity. The progression from observation description to evaluation suggests a workflow for validating automated systems’ outputs. **Note**: No numerical data or visual trends are present, as this is a textual survey interface. </details> <details> <summary>Image 22 Details</summary> ![80aec668](/v1/image/80aec668a3d84b91ea2ce047b1410dfb196b4007546cd479ea63b504f8e88f55) ### Visual Description ## Screenshot: Amazon Mechanical Turk HIT Task Interface ### Overview This image depicts a user interface for a Human Intelligence Task (HIT) on Amazon Mechanical Turk. The task involves evaluating machine-generated statements about an image containing a highlighted region. Users must rate each statement as "Good," "Okay," or "Bad" based on its alignment with the image content, particularly the highlighted area. ### Components/Axes - **Header Section**: - Title: "Instructions (click to expand/collapse)" - Text: "Thanks for participating in this HIT!" - **Task Description**: - Instructions for rating 10 machine-generated statements about an image with a highlighted region. - Rating criteria: - **Good**: Statement is true for the image, and the highlighted region is the best part supporting the conclusion. - **Okay**: Statement could be true, but a different region would be better, or uncertainty exists. - **Bad**: Statement is verifiably incorrect, irrelevant, or not justified by the image/region. - **Important Note**: Users MUST base ratings on the highlighted region. - **Notes**: - Assess statements individually. - Forgive minor spelling/grammar errors. - **Image Section**: - A photograph of a social gathering (e.g., a bar) with a highlighted Lite beer logo. - Text overlay: "(Click on the image to view the original.)" - **Rating Interface**: - Two example machine statements labeled `Machine statement 1` and `Machine statement 2`. - For each statement, three radio buttons for "Good," "Okay," and "Bad" with descriptive criteria. ### Detailed Analysis - **Image Content**: - A group of people in a bar setting. - Highlighted region: A Lite beer logo (pink background with white text). - Visible objects: Drinks, a cash register, and a menu. - **Textual Content**: - Example statements are placeholders (`${machine_statement_1}`, `${machine_statement_2}`). - Rating criteria emphasize the highlighted region's relevance to the statement's validity. - **UI Elements**: - Expand/collapse buttons for instructions and examples. - Radio buttons for rating options. ### Key Observations 1. The task prioritizes the highlighted region as the basis for evaluation. 2. Statements are assessed individually, even if they conflict with prior conclusions. 3. The interface allows for flexibility in rating (e.g., accepting both "The person’s a high school teacher" and "The person’s a professor" as "Good" or "Okay" if the image supports both interpretations). 4. Minor errors in statements (e.g., "man" vs. "men") are to be overlooked. ### Interpretation This HIT task is designed to improve machine learning models by crowdsourcing human judgment on the relevance of generated statements to specific image regions. The emphasis on the highlighted region suggests the machine’s focus on key visual elements, and raters must determine whether the statements align with those elements. The example statements illustrate scenarios where contextual ambiguity (e.g., professions) requires raters to weigh visual evidence against textual claims. The task underscores the importance of spatial grounding in image-to-text alignment, as raters must reconcile abstract statements with concrete visual data. </details> <details> <summary>Image 23 Details</summary> ![f4b95a78](/v1/image/f4b95a78fc244fb926ef328b4d7f1b23ce432a47ac83b4eae14075547511fda1) ### Visual Description ## Annotated Photographs: Contextual Observations and Inferences ### Overview The image comprises four annotated photographs, each depicting distinct scenes with overlaid text boxes, magnifying glass icons, and labels. These annotations provide contextual observations, inferences, and speculative conclusions about the subjects, environments, and activities within the images. The annotations use color-coded text boxes (pink, blue, green, yellow, orange) to categorize observations and assign likelihoods (e.g., "Likely," "Possibly"). --- ### Components/Axes 1. **Main Labels**: - Top-left: "Concerned look on face" - Top-right: "Wall of drinks in the back" - Middle: "Smoke, an outdoor gathering with food" - Bottom-left: "A single family home across the street" - Bottom-right: "A lot of architectural decoration and a grand entrance on a beautiful brick building" 2. **Magnifying Glass Captions**: - Top-left: "Likely something is happening in the store" - Top-right: "Possibly there is an airplane hangar beyond this station" - Middle: "Possibly something is being grilled to eat at the party" - Bottom-left: "Likely this is a residential neighborhood" - Bottom-right: "Possibly this is a museum" 3. **Text Box Labels and Content**: - **Pink Boxes** (e.g., "Business suit and coat worn on person"): - "Likely this person just left work" - "Likely her skin is sensitive" - "A woman wearing a wide brim hat" - **Blue Boxes** (e.g., "Covered wrapped in arms"): - "Likely there's a baby in the cover" - "Likely he needs to relax" - "Smooth asphalt in the driveway" - **Green Boxes** (e.g., "Smoke, an outdoor gathering with food"): - "Possibly something is being grilled to eat at the party" - "A lot of people gathered, tables with food, a colorful sign" - "A woman is holding hand with a man walking down the pavement" - **Yellow Boxes** (e.g., "Wall of drinks in the back"): - "Likely this is a store" - "Likely this is a lunch party" - **Orange Boxes** (e.g., "Wet pavement"): - "Definitely it is raining" - "Possibly this is a museum" --- ### Detailed Analysis 1. **Top-left Image**: - A woman in a business suit holds a baby. - **Observations**: - Pink box: "Business suit and coat worn on person" → "Likely this person just left work." - Blue box: "Covered wrapped in arms" → "Likely there's a baby in the cover." 2. **Top-right Image**: - A train station with a green train and distant wing of an airplane. - **Observations**: - Orange box: "Wall of drinks in the back" → "Likely this is a store." - Green box: "Wing of airplane in distance" → "Possibly there is an airplane hangar beyond this station." 3. **Middle Image**: - Outdoor gathering with people, tables, and smoke. - **Observations**: - Green box: "Smoke, an outdoor gathering with food" → "Possibly something is being grilled to eat at the party." - Pink box: "A woman wearing a wide brim hat" → "Likely her skin is sensitive." - Blue box: "A man smoking a cigarette" → "Likely he needs to relax." 4. **Bottom-left Image**: - A child near a hedge and asphalt driveway. - **Observations**: - Pink box: "A big hedgerow next to asphalt" → No explicit inference. - Green box: "A single family home across the street" → "Likely this is a residential neighborhood." - Orange box: "Wet pavement" → "Definitely it is raining." 5. **Bottom-right Image**: - A grand brick building with parked cars and pedestrians. - **Observations**: - Orange box: "A lot of architectural decoration and a grand entrance on a beautiful brick building" → "Possibly this is a museum." - Green box: "A woman is holding hand with a man walking down the pavement" → "Likely they are husband and wife." - Blue box: "Some cars parked on the side of the street with tall buildings around it" → No explicit inference. --- ### Key Observations - **Likelihood Indicators**: - "Likely" (pink/blue boxes) suggests high confidence in observations (e.g., "Likely this person just left work"). - "Possibly" (green/orange boxes) indicates speculative inferences (e.g., "Possibly this is a museum"). - **Recurring Themes**: - Human activity (e.g., "Likely he needs to relax," "Likely they are husband and wife"). - Environmental context (e.g., "Wet pavement," "Smooth asphalt"). - **Color Coding**: - Pink/blue boxes focus on human subjects and immediate actions. - Green/yellow/orange boxes emphasize environmental or contextual details. --- ### Interpretation The annotations function as a hybrid of observational notes and speculative reasoning, using color and likelihood labels to structure interpretations. For example: - The use of "Likely" in pink/blue boxes ties human behavior (e.g., wearing a business suit, holding a baby) to plausible conclusions (e.g., "just left work," "there's a baby in the cover"). - "Possibly" in green/orange boxes highlights uncertain but contextually plausible inferences (e.g., "airplane hangar," "museum"). - Environmental cues (e.g., "wet pavement," "smooth asphalt") ground observations in physical reality, while architectural details (e.g., "grand entrance") suggest institutional or public spaces. The annotations collectively demonstrate how visual cues (e.g., clothing, weather, architecture) are used to infer narratives about people, places, and activities. The absence of explicit legends implies a standardized color-coding system for categorizing observations (e.g., pink for human subjects, green for environmental context). </details> Fig. 13: Instructions and template setup for Sherlock model evaluation HIT.   Fig. 14: Examples of clues and inference pair annotations in Sherlock over images from Visual Genome and VCR. For each observation pair , an inference (speech bubble) is grounded in a concrete clue (color bubble) present in an image. confidence score (in the order of decreasing confidence: 'Definitely' > 'Likely' > 'Possibly') for each inference is shown in yellow.

Rendering Paper...