2601.06223

Model: gemini-2.5-flash-free

Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness Edward C. Cheng echeng04@stanford.edu Jeshua Cheng jeshua.cheng@inquiryon.com Alice Siu asiu@stanford.edu [Warning: Draw object ignored] Abstract – This paper presents a conceptual and operational framework for developing and operating safe and trustworthy AI agents based on a Three-Pillar Model grounded in transparency, accountability, and trustworthiness. Building on prior work in Human-in-the-Loop systems, reinforcement learning, and collaborative AI, the framework defines an evolutionary path toward autonomous agents that balances increasing automation with appropriate human oversight. The paper argues that safe agent autonomy must be achieved through progressive validation, analogous to the staged development of autonomous driving, rather than through immediate full automation. Transparency and accountability are identified as foundational requirements for establishing user trust and for mitigating known risks in generative AI systems, including hallucinations, data bias, and goal misalignment, such as the inversion problem. The paper further describes three ongoing work streams supporting this framework: public deliberation on AI agents conducted by the Stanford Deliberative Democracy Lab, cross-industry collaboration through the Safe AI Agent Consortium, and the development of open tooling for an agent operating environment aligned with the Three-Pillar Model. Together, these contributions provide both conceptual clarity and practical guidance for enabling the responsible evolution of AI agents that operate transparently, remain aligned with human values, and sustain societal trust. Keywords— Generative AI, AI Agent, Human-in-the-Loop, HITL, RLHF, Responsible AI, Trustworthy AI [Warning: Draw object ignored] 1. Introduction The emergence of AI agents marks a new phase in the evolution of generative AI. While traditional chatbots focus on generating text-based responses, AI agents extend this capability into real-world action. These systems can execute tasks, reason over goals, and make decisions on behalf of humans. This shift from text generation to autonomous task execution holds the key to unlocking the economic and practical value of generative AI. Yet, as these systems gain autonomy and agency, the risks of error, bias, and misalignment also multiply. When AI agents make consequential real-life decisions, such as transferring funds, filing drug prescriptions, drafting contracts, or guiding robotic actions, their mistakes may lead to financial losses, privacy breaches, or even physical harms. These errors may arise from training biases, lack of situational context, hallucinated reasoning, or misalignment between user intent and model objectives. Consequently, the field is confronted with an urgent challenge: how to ensure safe, transparent, and accountable AI agents that enhance productivity without compromising accuracy, trust or human values. A growing body of recent research has emerged to address this challenge, focusing on the Human-in-the-Loop (HITL) paradigm and its extensions as means to govern, calibrate, and align AI agent behavior. These works explore how human expertise, oversight, and ethical grounding can be woven into the AI learning and action loop to produce systems that are both impactful and controllable. Collectively, they represent a growing consensus that human-AI collaboration, rather than full automation, is the most promising pathway toward efficient, effective, and safe AI agents that will result in higher productivity gain [1]. To organize this literature, we can group the representative surveys into three major thematic clusters that trace the conceptual evolution of safe AI agent design: 1. Foundational theories of human-in-the-loop AI and machine learning. 1. Operational frameworks and platforms for human-AI collaboration. 1. Emerging approaches for uncertainty alignment and human-governed AI agents. 1. Foundational Theories of Human-in-the-Loop AI Early research established the theoretical and ethical foundations for integrating humans into the AI lifecycle. Zanzotto (2019) proposed Human-in-the-loop Artificial Intelligence (HitAI) as both a moral and structural correction to the unregulated growth of autonomous AI [2]. He argued that humans are not mere annotators but the original “knowledge producers” whose insights underpin AI performance and thus must remain central to both credit and control. Wu et al. (2022) expanded this notion through a systematic survey of HITL for machine learning, framing it as a data-centric methodology that unites human cognition with computational scalability. They demonstrated that effective human involvement improves labeling efficiency, interpretability, and robustness, forming the foundation for iterative feedback loops in model development [3]. Building on these theoretical bases, Mosqueira-Rey et al. (2023) presented a unifying taxonomy of Human-in-the-Loop Machine Learning (HITL-ML) paradigms [4]. They identified key interaction modes, which include Active Learning, Interactive ML, Machine Teaching, Curriculum Learning, and Explainable AI. They revealed that human-AI relationships exist along a continuum of control: from machine-driven query optimization to human-driven knowledge transfer and interpretation. These early frameworks collectively redefined HITL as not simply supervision, but shared agency between human reasoning and machine inference, setting the epistemic groundwork for subsequent advances in safety and transparency. Extending the HITL perspective beyond technical design, recent studies from MIT Sloan introduced a management-oriented framework known as AI Alignment. This paradigm emphasizes that model accuracy, reliability in real-world contexts, and stakeholder relevance must be achieved through continuous human engagement. It reframes human involvement not only as a safeguard but also as a means for organizations to learn and adapt as they deploy AI. Grounded in empirical case studies, this framework shows that practices such as expert feedback and stakeholder participation are essential for building safe, context-aware AI systems [5]. A complementary MIT Sloan study found that asking critical safety questions early in the AI development process helps prevent systemic errors and security vulnerabilities, further reinforcing the importance of proactive human oversight [6]. 1. Operational Frameworks for Safe and Collaborative AI Agents As Human-in-the-Loop principles matured, a second wave of research shifted toward practical frameworks and system architectures that enable effective human-AI collaboration in real-world, embodied environments. Bellos and Siskind (2025) exemplify this transition by introducing a structured evaluation framework, a multimodal dataset, and an augmented-reality (AR) AI agent designed to guide humans through complex physical tasks such as culinary cooking and battlefield medicine. Their empirical studies demonstrate that interactive, context-aware guidance significantly improves task success rates, reduces procedural errors, and enhances user experience. Importantly, their results also show that exposure to AI-assisted guidance leads to measurable improvements in subsequent unassisted task performance, indicating that AI agents can support not only immediate task completion but also longer-term human skill acquisition. These findings position AI agents as collaborative partners that augment human capability rather than as purely automated systems [7]. In parallel, Mozannar et al. (2025) introduced Magentic-UI, an open-source user-interface platform for human-in-the-loop agentic systems. Built on Microsoft’s Magentic-One framework, it enables users to co-plan, co-execute, approve, and verify AI actions in complex digital tasks such as coding and document handling [8]. The platform embeds human oversight through structured, repeatable mechanisms. It supports co-planning, co-tasking, action approval, and answer verification, establishing a controlled environment for studying trust calibration, safety, and usability in AI agents. Together, these efforts move the field from abstract advocacy to practical system engineering, demonstrating that safety and transparency can be designed into agent interfaces, workflows, and orchestration protocols. 1. Emerging Approaches for Uncertainty-Aware and Human-Governed AI Agents Recent work has deepened the mathematical and procedural foundations of safety and alignment. Retzlaff et al. (2024) surveyed the domain of Human-in-the-Loop Reinforcement Learning (HITL-RL), arguing that reinforcement learning (RL) inherently depends on human feedback and should be understood as a HITL paradigm. Their work outlined design requirements such as feedback quality, trust calibration, and explainability for moving from human-guided to human-governed learning [9]. Complementing this, Ren et al. (2023) proposed the KNOWNO (“Know When You Don’t Know”) framework for LLM-driven robotic planners to identify critical moments that require human involvement. By employing conformal prediction to quantify uncertainty, KNOWNO enables robots to detect when their confidence falls below a safety threshold and proactively request human input to ensure safe and reliable task execution [10]. This model of uncertainty alignment provides formal statistical guarantees on task success while minimizing unnecessary human intervention. This work represents a crucial step toward self-aware, help-seeking agents. At a broader institutional level, research from Harvard University has expanded the discussion of AI safety to include ethics, governance, and societal accountability. Allen et al. (2024) proposed a democratic model of power-sharing liberalism, emphasizing human flourishing, shared authority, and institutional accountability. They argued that AI governance must move beyond risk management to actively promote public goods, equality, and autonomy through inclusive participation and transparent oversight. Their framework identifies six core governance tasks: mitigating harm, managing emergent capabilities, preventing misuse, advancing public benefit, building human capital, and strengthening democratic capacity [11]. Complementing this perspective, Barroso and Mello (2024) examined AI as both a revolutionary and perilous force shaping humanity’s future, calling for a global governance framework grounded in human dignity, transparency, accountability, and democratic oversight [12]. Together, these contributions frame AI not as a force to restrain but as a catalyst for renewing democracy and reinforcing collective well-being. Finally, Natarajan et al. (2025) reframed the entire discussion through the concept of AI-in-the-Loop (AI2L). Their analysis reveals that many systems labeled as HITL should be considered as AI2L, where humans, not AI, remain the decision-makers. They argue that this distinction is critical for designing systems that emphasize collaboration over automation, human impact over algorithmic efficiency, and co-adaptive intelligence over substitution [13]. This reorientation marks a philosophical inflection point: moving from human-assisted AI to AI-assisted humanity. 1. Toward a Framework for Safe, Transparent AI Agents Across these studies, a clear trajectory emerges. The field has progressed from recognizing the ethical necessity of human oversight, to engineering collaborative systems, and to developing experimentally grounded mechanisms for uncertainty and governance. Collectively, these efforts affirm that the challenge of AI agent safety, transparency, and alignment is both urgent and tractable. Embedding humans as teachers, collaborators, and governors within the AI lifecycle consistently improves reliability and trustworthiness, yet fragmentation persists across methodologies and evaluation metrics. This paper advances the next step to synthesize these developments into a unified conceptual framework and a set of guiding principles that integrate HITL, AI2L, uncertainty alignment, and human-governed learning into a progressively improving autonomous environment. Together, these foundations define an operational setting for a new generation of AI agents that are transparent by design, collaborative by nature, and accountable in operation, with the explicit goal of enabling increasing level of autonomy in a safe, controlled, and trustworthy manner. 1. The Evolution Path Towards Autonomous Agents The vision of achieving fully autonomous AI agents represents one of the most ambitious goals in artificial intelligence. However, this vision cannot be realized in a single leap. It must evolve through progressive stages of validation and oversight, where human involvement is reduced only as confidence in the system’s performance and alignment grows through proven safety, reliability, and accountability. This evolutionary approach has clear precedents in other industries, particularly in the development of autonomous driving. 1. Lessons from Autonomous Driving The field of autonomous driving provides an instructive example of how automation can evolve responsibly. Early driver-assist systems such as adaptive cruise control and lane-keeping support were designed to assist rather than replace human judgment. These systems required the driver to maintain foot on the pedal, hands on the wheel, and eyes on the road at all times. As perception models, control algorithms, and sensor fusion technologies advanced, vehicles began to handle more complex scenarios independently, such as automatic parking and highway lane changes. At this stage, the human driver could briefly disengage from active control but still had to monitor the road and be prepared to intervene if necessary. <details> <summary>2601.06223v1/SafeAIAgent-img001.png Details</summary> ![c8acdbd908c760caaf46073fa5e4ac5717a95ee590be4de4c08be6f67fef0c72](http://localhost:8000/v1/image/c8acdbd908c760caaf46073fa5e4ac5717a95ee590be4de4c08be6f67fef0c72) ### Visual Description ## Diagram: Progression of Driver Engagement in a Vehicle ### Overview This image is a sequential diagram illustrating the transition of a human driver's engagement and presence within a vehicle, likely depicting stages related to autonomous driving. It shows four distinct states, connected by directional arrows, moving from an actively engaged driver to the complete absence of a driver. Each state features a blue stick-figure person (or absence thereof), a blue car seat, and a blue steering wheel, all against a white background. ### Components/Axes The diagram does not feature traditional axes, labels, or a legend in the conventional sense. However, it contains the following recurring components and annotations: * **Driver Figure:** A blue stick-figure representation of a human driver. * **Car Seat:** A blue icon representing a vehicle seat. * **Steering Wheel:** A blue icon representing a steering wheel. * **Directional Arrows (Light Blue):** Three light blue, right-pointing arrows indicate the progression from one state to the next. * **Red Circles (Annotations):** Present only in the first image, highlighting specific areas of driver interaction: the head/eyes, hands on the steering wheel, and foot on a pedal. * **Dashed Blue Arrows (Annotations):** Present in the first two images, indicating the direction of the driver's gaze or attention. ### Detailed Analysis The diagram presents a left-to-right sequence of four distinct scenarios: 1. **Far Left Image (Active Driving):** * A blue stick-figure person is seated upright in a blue car seat. * The person's hands are actively gripping the blue steering wheel, which is positioned directly in front of them. * Their legs are bent, and a foot is visible, suggesting engagement with pedals. * **Annotations:** * A red circle highlights the person's head/eyes area. * A red circle highlights the person's hands on the steering wheel. * A red circle highlights the person's right foot. * Two dashed blue arrows originate from the head/eyes area, pointing forward and slightly downward, indicating focused visual attention. 2. **Second Image (Disengaged Driver Present):** * A light blue right-pointing arrow connects the first image to this one. * A blue stick-figure person is seated in the blue car seat. * The person's posture is slightly more relaxed compared to the first image; their head is tilted back slightly. * Their hands are no longer on the steering wheel, resting instead. The steering wheel is still present but appears slightly further away or lower relative to the person's relaxed posture. * **Annotations:** * One dashed blue arrow originates from the head/eyes area, pointing straight forward, indicating a general forward gaze but without the downward focus seen previously. The red circles are absent. 3. **Third Image (Reclined Driver Present):** * A light blue right-pointing arrow connects the second image to this one. * A blue stick-figure person is seated in the blue car seat. * The person's posture is significantly reclined, with their back and head resting against the reclined seatback. * Their hands are completely off the steering wheel, and their arms appear relaxed. * The steering wheel is still present, positioned further away and possibly slightly higher relative to the reclined person. The dashed blue arrows are absent. 4. **Far Right Image (Driver Absent):** * A light blue right-pointing arrow connects the third image to this one. * Only the blue car seat and the blue steering wheel are depicted. * The car seat is in a reclined position, similar to or potentially more reclined than in the third image. * The steering wheel is present, positioned similarly to the previous image, but there is no human figure in the seat. ### Key Observations * **Progression of Engagement:** The sequence clearly shows a decrease in driver engagement, from active control to complete absence. * **Posture Change:** The driver's posture evolves from upright and attentive to relaxed, then significantly reclined, reflecting decreasing need for active control. * **Steering Wheel Interaction:** The steering wheel remains a constant element but transitions from being actively gripped to being ignored, and finally, to being present without a driver. * **Annotation Significance:** The red circles and dashed arrows in the first image highlight critical points of human-vehicle interaction (eyes, hands, feet) and attention, which are removed as the driver's role diminishes. ### Interpretation This diagram effectively illustrates the conceptual stages of transitioning from manual driving to higher levels of vehicle autonomy. * **Image 1** represents **Level 0/1 (No Automation/Driver Assistance)**, where the human driver is fully responsible for all aspects of driving, with hands on the wheel, feet on pedals, and full visual attention on the road. The red circles emphasize these critical points of human control. * **Image 2** suggests **Level 2 (Partial Automation)**, where the vehicle can perform some driving tasks (e.g., adaptive cruise control, lane keeping), allowing the driver to take their hands off the wheel and relax slightly, but still requiring their presence and readiness to intervene. The forward-pointing dashed arrow indicates the driver is still expected to monitor the road. * **Image 3** likely depicts **Level 3 (Conditional Automation)**, where the vehicle handles all driving tasks under specific conditions, allowing the driver to be significantly disengaged (e.g., watching a movie, resting) and the seat to recline. However, the driver must still be available to take over if prompted. * **Image 4** represents **Level 4/5 (High/Full Automation)**, where the vehicle is capable of fully autonomous operation without any human intervention or even presence required. The absence of the driver figure, while the seat and steering wheel remain, implies that these components are either vestigial or can be stowed/reconfigured for passenger comfort in a fully self-driving vehicle. The overall message is a clear visual narrative of how autonomous driving technology progressively reduces the need for human involvement, ultimately leading to a future where the driver may become a passenger or even be entirely absent from the vehicle's operation. </details> Figure 1: Full Autonomous Driving was a Gradual Evolving Process This gradual and transparent evolution allowed engineers to identify edge cases, improve algorithms, and refine user interfaces based on real-world feedback. Most importantly, it allowed trust of the society to grow incrementally. Each technological improvement was accompanied by clearer communication about the system’s limitations and capabilities. Drivers learned when to rely on the system and when to take over. Through testing, validation, and iterative learning, both the technology and its human users matured on autonomous driving together. Only through this patient process did the industry approach Level 4 and Level 5 autonomy, where vehicles can operate without human intervention in most or all conditions [12, 13]. The success of this journey lies not only in technological innovation but also in earning human trust through transparency, communication of system limits, and clear accountability. 1. Parallels in AI Agents Development A similar path must be followed in the evolution of autonomous AI agents. These systems act on behalf of humans in both digital and physical environments, making decisions that can have significant consequences. Like early autonomous vehicles required drivers to remain attentive, current AI agents still depend on Human-in-the-Loop (HITL) oversight to ensure that their actions align with human intent. Human involvement serves as both a safeguard and a source of learning, helping the system adapt responsibly. As discussed in the earlier introduction, research by Wu (2022), Mosqueira-Rey (2023), and Retzlaff (2024) consistently shows that HITL systems improve interpretability, accountability, and model reliability [3, 4, 9]. Rather than viewing human oversight as an administrative overhead, it should be recognized as a critical step in the learning and governance process that helps agents mature progressively. 1. HITL as a Mechanism for Trust and Safety Human oversight is particularly essential during the intermediate stages of agent development and deployment. At this point, agents are capable of complex reasoning but still lack the contextual, ethical, and situational awareness required for independent operation [16]. Well-designed HITL mechanisms allow humans to validate outputs, correct errors, and prevent harm caused by hallucinations, data biases, or incorrect assumptions. This feedback loop not only safeguards users but also enables the system to learn and improve over time. As the system demonstrates consistent accuracy and reliability, the level of human intervention can be reduced. However, this reduction must be based on measurable improvements, not assumption. The importance of this gradual approach becomes even more evident in trust-sensitive domains such as finance, human resources, healthcare, legal, and areas that require regulatory compliance. Human oversight ensures share responsibility between humans and AI, maintaining compliance with both legal standard and societal expectations. Just as self-driving systems underwent years of supervised testing before being trusted on public roads, autonomous AI agents must demonstrate reliability before operating independently in high-stakes environments. Yet, their journey to full automation will likely unfold more rapidly, driven by the accelerating pace of AI research and development. 1. A Collaborative Path Toward Full Autonomy The journey toward fully autonomous agents is both a technological and social process. Technological progress enables higher levels of independence, while social acceptance depends on observable safety and accountability. Research from Bellos (2025) and Mozannar (2025) has shown that when humans and AI collaborate effectively, the result is higher task success rates, improved trust, and greater user confidence [7, 8]. Collaboration thus provides a bridge between current assisted systems and the future of full autonomy. This process can be viewed as four evolutionary stages of AI agency: 1. Assisted Agents: Humans make decisions while AI supports them through recommendations and reasoning. 1. Collaborative Agents: Humans and AI share responsibility in decision-making and task execution, combining human contextual understanding with AI computational precision and scalability. Human participation remains essential within the agentic workflow, as it enriches situational and semantic context, ensuring that AI agents produce responses and actions that are relevant, accurate, and aligned with user intent and real-world constraints [16]. 1. Supervised Autonomy: AI operates independently in constrained environments while remaining accountable through human review. 1. Full Autonomy with Human Governance: AI functions independently within transparent, auditable frameworks that preserve human oversight at the policy level. Advancement through these stages must be validated by evidence of safety, predictability, and alignment with human intent. This progressive process reflects the same progression that made autonomous driving successful. Skipping these steps would risk premature deployment and loss of confidence, which could set back both innovation and adoption. 1. Toward Trustworthy Autonomy True autonomy cannot be declared by design; it must be demonstrated through experience and data. Each stage of progress should confirm that the agent can act responsibly and transparently within defined boundaries. By embedding Human-in-the-Loop principles throughout the development process ensures that autonomy and trust grow in tandem. As seen in autonomous driving, confidence arises from steady progress and accountable design. While AI agents may reach maturity more quickly due to faster digital feedback loops and lower physical risks, their path to autonomy must still be guided by the same principles of transparency, validation, and ethical oversight. 1. A Three-Pillar Model for a Safe AI-Agent Operating Environment In the previous sections, we demonstrated that as AI systems evolve from passive chatbots to fully autonomous agents capable of acting on behalf of humans, the potential for both benefit and harm expands dramatically. In addition, as AI agents evolve to become increasingly independent of humans, their autonomy must emerge through a gradual, trust-building process in which human oversight and collaboration remain essential until AI systems demonstrate consistent reliability and alignment. Building on these foundations, this section proposes that to enable this evolutionary process to unfold safely and productively, AI agents must operate within a structured environment designed to support growth, supervision, and accountability. Without such an environment, autonomous evolution would occur in an uncontrolled manner, exposing organizations and individuals to unacceptable risks. To address this need, we propose a Three-Pillar Model (3PM) to support a safe AI-agent operating environment. This model defines the fundamental principles and environmental conditions required to develop, deploy, and operate safe autonomous agents while maintaining a balance between automation and human collaboration. The three pillars are: 1. Transparency of AI Agents ensures visibility into how agents operate across their life cycles. 1. Accountability in Decision-Making provides mechanisms to attribute and explain decisions made by both humans and AI. 1. Trustworthiness through Human-AI Collaboration establishes confidence in agentic systems through well-timed human oversight and fallback safeguards. Together, these pillars create the foundation for a safe and productive ecosystem where AI agents and humans can share responsibilities and co-evolve toward higher levels of autonomy. They support the long-term goal of achieving responsible, human-aligned AI while ensuring that enterprises can realize measurable return on investment through efficient, reliable, and trustworthy automation. 1. Pillar One: Transparency and Building Trust with AI Transparency provides the visibility necessary for humans to understand, monitor, guide, and audit agent behavior. It allows operators to know how the agent works, what it is doing, and why it acts in a particular way. This visibility is critical during the evolutionary path described earlier, because it enables humans to supervise and calibrate the agent’s performance as autonomy increases. Every agent instance passes through a lifecycle consisting of three stages: initiation, active operation, and completion or termination. Transparency must exist throughout each stage to make the process comprehensible and auditable. <details> <summary>2601.06223v1/SafeAIAgent-img002.png Details</summary> ![c3f1bc3da5593a480fba9b7b2e37fc93d7745b735da33330e8c2e6a1f383dd19](http://localhost:8000/v1/image/c3f1bc3da5593a480fba9b7b2e37fc93d7745b735da33330e8c2e6a1f383dd19) ### Visual Description ## Diagram: Process State Diagram ### Overview This image displays a state machine diagram illustrating a process flow from a "Begin" state to an "End" state, passing through several intermediate states. The diagram shows both a primary successful path, a waiting loop, and an alternative termination path. States are represented by circles or rounded squares, and transitions between states are indicated by arrows with descriptive labels. ### Components/Axes The diagram consists of eight distinct states and ten labeled transitions. There are no explicit axes or legends, but the states are visually differentiated by color and shape. **States (Nodes):** 1. **Begin**: A grey square with rounded corners, positioned at the far left of the diagram. This represents the starting point of the process. 2. **Initiating**: A light grey circular node, positioned to the right of "Begin". 3. **Initiated**: A green circular node, positioned to the right of "Initiating". 4. **Active**: A bright blue circular node, centrally located in the main flow, to the right of "Initiated". 5. **Wait (for HITL)**: A light blue circular node, positioned above the "Active" state. The parenthetical text "(for HITL)" indicates a specific condition for this wait state. 6. **Finish**: A dark blue circular node, positioned to the right of "Active". 7. **Abort**: A red circular node, positioned below the "Active" state. This represents an alternative, unsuccessful termination path. 8. **End**: A grey square with rounded corners, positioned at the far right of the diagram. This represents the final termination point of the process. **Transitions (Arrows with Labels):** * **Create**: An arrow pointing from "Begin" to "Initiating". * **Config**: An arrow pointing from "Initiating" to "Initiated", labeled above the arrow. * **Prompt**: An arrow pointing from "Initiating" to "Initiated", labeled below the arrow. * **Launch**: An arrow pointing from "Initiated" to "Active". * **Notify**: An arrow pointing from "Active" upwards to "Wait (for HITL)". * **Callback**: An arrow pointing from "Wait (for HITL)" downwards back to "Active". * **Complete**: An arrow pointing from "Active" to "Finish". * **Cancel**: An arrow pointing diagonally downwards from "Active" to "Abort". * **Close**: An arrow pointing from "Finish" to "End". * **Close**: An arrow pointing diagonally upwards from "Abort" to "End". ### Detailed Analysis The process begins at the "Begin" state. 1. From **Begin**, a "Create" action transitions the process to the **Initiating** state. 2. From **Initiating**, the process can transition to **Initiated** via two possible actions: "Config" or "Prompt". 3. Once **Initiated**, a "Launch" action moves the process to the **Active** state. 4. From the **Active** state, there are multiple possible paths: * **Main Path**: A "Complete" action transitions the process to the **Finish** state. * **Waiting Loop**: A "Notify" action can move the process to the **Wait (for HITL)** state. From the "Wait (for HITL)" state, a "Callback" action returns the process to the **Active** state, forming a self-loop. The "(for HITL)" annotation suggests a "Human-In-The-Loop" interaction is involved in this waiting period. * **Abortion Path**: A "Cancel" action can transition the process to the **Abort** state. 5. From the **Finish** state, a "Close" action transitions the process to the **End** state. 6. From the **Abort** state, a "Close" action also transitions the process to the **End** state. ### Key Observations * The diagram illustrates a sequential process with clear start and end points. * The main flow is linear: Begin -> Initiating -> Initiated -> Active -> Finish -> End. * The "Active" state is a critical hub, allowing for progression, a waiting period, or early termination. * There is a specific loop for a "Wait (for HITL)" state, indicating a potential pause for external input or human intervention before returning to "Active". * The "Abort" state represents an alternative, likely unsuccessful, termination path, distinct from the "Finish" state. Both "Finish" and "Abort" ultimately lead to the "End" state. * The use of different colors for circular states (light grey, green, bright blue, light blue, dark blue, red) visually distinguishes different phases or types of states. The start and end states are consistently represented by grey rounded squares. ### Interpretation This diagram represents a typical lifecycle or workflow for a task or system operation. The initial states ("Begin", "Initiating", "Initiated") describe the setup and preparation phases. "Create" likely refers to the initial creation of the task, while "Config" and "Prompt" suggest configuration steps or user input required to move to an "Initiated" state. The "Active" state signifies the core execution phase of the process. The presence of a "Wait (for HITL)" state indicates that the process might involve steps requiring human review, decision-making, or external input. The "Notify" and "Callback" transitions suggest an asynchronous interaction where the system notifies a human, waits, and then receives a callback to resume activity. This is crucial for processes that cannot be fully automated or require human oversight. The process can conclude successfully via the "Complete" transition to "Finish", followed by "Close" to "End". Alternatively, if issues arise or the process is no longer needed, it can be "Cancel"ed from the "Active" state, leading to an "Abort" state, which also eventually "Close"s to "End". The "Abort" state, being red, visually signals a negative or exceptional termination. Overall, the diagram effectively communicates the various stages, decision points, and potential outcomes of a process, highlighting both normal progression and exceptional handling, particularly emphasizing a human-in-the-loop interaction during the active phase. </details> Figure 2 Agent State Transition Diagram Within the 3-Pillar Model Initiation State. During initiation, a human defines the scope, context, and objectives of the agent’s work. This stage establishes the foundation for safe collaboration. For example, a Research Agent tasked with supporting a product’s go-to-market strategy must receive a clearly defined configuration that includes market segments, data sources, and success criteria. By setting these parameters, the human ensures that the agent’s goals are properly aligned with organizational objectives and ethical standards. This stage also serves as a point of human control, where configurations, role definitions, and constraints can be verified before the agent begins operation. Active State. Once launched, the agent enters its active state, where it performs the actions for which it was designed. For instance, a Research Agent may conduct web searches and synthesize findings. Likewise, a Payment Agent may initiate payment transactions. A Collection Letter Agent may draft personalized communications based on debtor information and credit conditions. During this phase, activity recording and observability become essential. The environment must automatically generate activity journals that record the agent’s decisions, interactions, and results. These logs enable oversight and provide a transparent record for post-task evaluation. Moreover, during this phase, the Human-in-the-Loop (HITL) mechanism plays an important role. When the agent encounters uncertainty or ambiguity, it may consult a human collaborator for guidance. Depending on task complexity and risk level, human involvement can vary from direct supervision to collaborative decision-making to minimal observation. Transparency allows both sides to know when and why such handoffs occur. Abort State. Both human operators and authorized AI subsystems should have the ability to abort or suspend an active agent when necessary. Abort events may occur if the agent cannot fulfill its mission due to missing resources, time constraints, or safety violations. The authority to abort should follow clearly defined governance rules, reflecting the contractual and regulatory conditions under which the agent operates. Finish State. When an agent finishes or terminates its task, it should produce a clear output along with a record of its entire operation. Transparency requires three complementary forms of documentation: 1. State transition records: Marking changes from initiation to finish. 1. Work progress records: Showing the detailed actions taken by the agent. 1. HITL records: capturing every human–AI interaction and decision. These records serve as the backbone of transparency within the agent operating environment. They allow developers, regulators, and users to reconstruct events, assess system performance, and identify opportunities for improvement. Without sufficient transparency, human collaborators cannot effectively supervise agent behavior, learn from outcomes, or develop trust in autonomous agent systems. While these three record types are not exhaustive, they represent the minimum information required to achieve acceptable transparency. In practice, the agent system may also maintain additional journals, such as system logs, user feedback logs, performance metrics, and other operational traces, to further support monitoring, analysis, and continuous improvement. 1. Pillar Two: Accountability and Responsibility While transparency answers what happened, accountability answers why it happened and who is responsible. In the previous section on the evolutionary path, we emphasized that autonomy must be earned gradually. Accountability provides the ethical and operational framework that makes this process safe. As AI agents gain more independence, the environment must ensure that each decision, whether made by a human or AI, is traceable to its source and understandable and explainable in context. Achieving accountability requires comprehensive decision journaling that records not only the outcomes but also the reasoning and contextual factors behind each choice. This is closely related to the principle of explainability in AI. Agents must be able to provide, upon request, the rationale for their decisions, including the data sources consulted, the constraints considered, and the degree of confidence associated with their outputs. A practical example illustrates this need. Suppose an automated food-ordering agent failed to account for a customer’s allergy to wheat or soy, resulting in a serious medical incident. In such a case, assigning responsibility requires a clear understanding of each participant’s role in the agentic workflow. Was the customer’s input ambiguous? Did a human worker at the restaurant fail to verify the order details during preparation? Did the AI agent miscommunicate the constraints? Or did the underlying language model generate an inaccurate summary of the order that omitted critical information? Without explicit records of each decision and the reasoning behind it, no clear accountability can be established or assigned. Accountability serves both corrective and developmental purposes. From a legal or regulatory perspective, it ensures that organizations can assign responsibility when things go wrong. From a technical perspective, it enables learning and continuous improvement. By identifying which part of the agentic workflow led to an undesirable outcome, AI and engineers can make targeted improvements to prevent recurrence. Accountability thus becomes the engine of continuous improvement within the agent ecosystem, reinforcing the learning loop necessary for safe autonomy and growing trust. 1. Pillar Three: Trustworthiness & Human-in-the-Loop The third pillar, trustworthiness, unites and build on top of the previous two. Transparency makes operations visible, accountability clarifies responsibility, and trustworthiness converts these attributes into confidence and willingness to rely on autonomous systems. As discussed in the evolutionary path section, human trust is not granted by design but earned through consistent, observable, and reliable performance. During the early phases of adoption, enterprises and end users will trust AI agents only if they can see clear boundaries of control and know that humans can intervene when necessary. Therefore, the operating environment must include mechanisms to specify risk thresholds and escalation rules that determine when human oversight is required. For example, in domains such as finance or healthcare, high-risk actions such as large transactions or clinical recommendations should automatically trigger human review. These checkpoints form structured Human-in-the-Loop interventions that ensure oversight at critical moments. Conversely, in high-volume, low-risk tasks, AI may operate independently for greater efficiency. Over time, as the system demonstrates reliability, the frequency of human interventions can be gradually reduced, following the same incremental trust-building logic that was illustrated in the autonomous driving analogy. However, any decision to increase the level of autonomy must be explicitly approved by a human authority and clearly documented. In addition, periodic spot checks should be conducted to verify safety and correctness, even after incremental advances in autonomous decision-making have been introduced. Trustworthiness also recognizes that in some contexts, AI can be more dependable than humans. Machines do not suffer from fatigue, emotional fluctuation, or inconsistency, and in repetitive or data-intensive tasks, AI may exhibit higher reliability than human operators. Accordingly, a trustworthy operating environment must support mutual confidence. Humans must trust AI agents to function within clearly defined safety boundaries, while AI systems must be designed to rely on validated human inputs and to defer judgment appropriately when required. The objective is not blind reliance but calibrated trust, grounded in empirical performance evidence and shared accountability. To support this calibration, every decision and every change must be properly recorded and remain auditable. Finally, trustworthiness ensures that when failures occur, they do not propagate unchecked. The environment must include robust fallback and recovery mechanisms that detect anomalies based on historical patterns, suspend automated actions, and transfer control to human operators before harm occurs. These safety measures ensure that risk remains manageable in very large-scale deployments with thousands of concurrently operating agents, even as autonomy levels continue to increase. 1. Integrating the Three Pillars in the Evolutionary Process The Three-Pillar Model is not a theoretical abstraction but a practical extension of the evolutionary approach described earlier. As agents progress from Assisted to Collaborative, to Supervised Autonomy, and ultimately to Full Autonomy under Human Governance, the balance among the three pillars must evolve in parallel with each successive stage of autonomy. In early stages, transparency plays the dominant role, ensuring that every action is observable, explainable, and auditable. As systems progress into collaborative stages, accountability becomes increasingly important because humans and AI share responsibility for decisions and outcomes. In the later stages, once agents have demonstrated consistent reliability and alignment, trustworthiness becomes the decisive factor that enables increasing levels of autonomy. Importantly, companies and users will always retain the ability to determine the degree of autonomy they are comfortable and willing to grant to different agents operating in their environments. This flexibility allows organizations to balance efficiency with risk tolerance, enabling a gradual and confident transition toward greater autonomy while maintaining control and trust throughout the process. These pillars together form a feedback ecosystem in which humans and AI learn from each other. Transparency provides data for accountability. Accountability identifies what needs improvement. Trustworthiness motivates greater delegation of control. Through this cycle, autonomy grows safely and progressively. In conclusion, the 3PM for agent creation, deployment, and operation establishes the essential conditions for safe evolution toward autonomous agents. It ensures that the journey from collaboration to independence occurs within a structure that is observable, responsible, and trustworthy. Only through such an environment can enterprises accelerate adoption, build user confidence, and achieve the full potential of AI agents while preserving human values and safety. 1. A Sample Use Case: Group Email Agent To illustrate the application of the Three-Pillar Model within a practical context, we consider a Group Email Agent operating in an enterprise-grade agentic environment. This use case demonstrates how transparency, accountability, and trustworthiness jointly ensure safe and effective collaboration between humans and AI. A Group Email Agent is a common and valuable application for enterprises that need to compose, review, and distribute communications to internal employees, customers, or business partners. Such messages can include policy updates, marketing announcements, product release communications, event invitations, or crisis management notifications. Because of their wide impact, group emails typically require coordination among multiple stakeholders, including representatives from the business unit, marketing and communications teams, legal and compliance departments, and senior management. These participants contribute to drafting, editing, verifying, and approving both the message content and the list of recipients. The Group Email Agent acts as an author, a coordinator, and executor, automating repetitive tasks while preserving human oversight where contextual understanding and judgment are critical. Figure 3 displays the agent activity records captured by the system throughout the lifecycle of a Group Email Agent instance. These records include state transitions, detailed task progress, and Human-in-the-Loop interactions, illustrating how the operating environment maintains continuous transparency and traceability from initiation to completion. <details> <summary>2601.06223v1/SafeAIAgent-img003.png Details</summary> ![7aeb7e08b8a3e7b37c34d8eb4fa25d8f1b7414f7c73d15be44e639fb109a00dd](http://localhost:8000/v1/image/7aeb7e08b8a3e7b37c34d8eb4fa25d8f1b7414f7c73d15be44e639fb109a00dd) ### Visual Description ## Screenshot: System Status Log ### Overview The image displays a system status log, presented as a bulleted list of events and actions, indicating the progress and state of an automated process involving email generation and human-in-the-loop (HITL) intervention. The overall status of the system is indicated as "Active". ### Components/Axes The image is structured into three main regions: 1. **Header (Top-center):** A blue bar with white text displaying "Status". 2. **Main Content (Center):** A vertically stacked, bulleted list of chronological events or log entries. Each entry is a plain text string. 3. **Status Indicator (Bottom-left):** A green circular icon followed by the text "Active". This is followed by one more bulleted log entry. ### Detailed Analysis The main content consists of 14 distinct log entries, presented as a bulleted list: * **Calling LLM to generate email** * *Description:* Initiates the process of using a Large Language Model (LLM) for email creation. * **Receive call from agent work item processing, response=modify, info=Change the event date to October 31, include ...** * *Description:* An agent work item processing system has requested a modification, specifically to change an event date to October 31. The "..." suggests further details were omitted or truncated. * **[HITL] hitl flow finished; waiting for human response** * *Description:* Indicates that the Human-in-the-Loop (HITL) workflow has completed its automated steps and is now paused, awaiting input or decision from a human. * **[HITL] AMP workitem inserted: email-20251020082318-20251020082444, action=Approval** * *Description:* An AMP (likely an internal system or channel) work item has been created within the HITL process. The work item is identified by "email-20251020082318-20251020082444" and requires an "Approval" action. The date embedded in the ID is 2025-10-20. * **[HITL] process_how resolved channel: amp** * *Description:* Within the HITL process, the method or channel used for resolution was "amp". * **[HITL] process_what resolved action: approval** * *Description:* Within the HITL process, the specific action that resolved the item was "approval". * **[HITL] process_who resolved initiator email: edwardgem@gmail.com** * *Description:* The initiator of the resolution within the HITL process is identified by the email address "edwardgem@gmail.com". * **[HITL] process_when LLM evaluation result: met=true; reason=The progress records include a "generated html email" event at 2025-10-20 08:24:37, indicating that the email has been generated. Therefore the condition "after email is generated" is satisfied.** * *Description:* This entry details the outcome of an LLM evaluation within the HITL process. The condition "after email is generated" was met because a "generated html email" event was recorded at 2025-10-20 08:24:37. * **[HITL] HITL config: {"enable":"true","how":"amp","what":"approval","when":"after email is generated","who":"initiator"}** * *Description:* This provides the configuration parameters for the HITL process in a JSON-like format: * `"enable"`: `"true"` (HITL is enabled) * `"how"`: `"amp"` (The channel for HITL) * `"what"`: `"approval"` (The action required for HITL) * `"when"`: `"after email is generated"` (The trigger condition for HITL) * `"who"`: `"initiator"` (The role responsible for HITL action) * **[HITL] received hitl call from instance - 1 time** * *Description:* The system received a call related to the HITL process from a specific instance, occurring "1 time". * **Completed generating email using LLM (model: gpt-oss:20b)** * *Description:* The email generation task using an LLM has been successfully completed. The specific LLM model used is identified as "gpt-oss:20b". * **Calling LLM to generate email** * *Description:* This is a repeat of the first entry, indicating a new or subsequent request to the LLM for email generation. * **Active** (preceded by a green circle icon) * *Description:* This indicates the current operational status of the system or process. * **Receive API call: generate-send** * *Description:* The system has received an API call with the instruction to "generate-send", likely related to the email generation and dispatch. ### Key Observations * The process involves an iterative or multi-step workflow. * Large Language Models (LLMs) are central to email generation. * Human-in-the-Loop (HITL) is an integral part of the workflow, specifically for "approval" actions, triggered "after email is generated". * The "AMP" channel is consistently used for HITL interactions. * Specific timestamps (2025-10-20 08:23:18, 2025-10-20 08:24:44, 2025-10-20 08:24:37) and an email address (edwardgem@gmail.com) provide concrete data points. * The LLM model `gpt-oss:20b` is explicitly mentioned. * The repetition of "Calling LLM to generate email" suggests either a new task initiated immediately after completion of the previous one, or a retry mechanism, or a continuous process. * The system is currently "Active". ### Interpretation This status log provides a granular view into an automated system designed for email generation, likely in a business or customer service context, given the mention of "agent work item processing" and "event date modification". The core functionality relies on an LLM to draft emails. The presence of a "Human-in-the-Loop" (HITL) component is critical. It indicates that while the LLM automates the initial email generation, human oversight and approval are required before the email is finalized or sent. This suggests a need for quality control, compliance, or complex decision-making that current LLMs might not fully handle autonomously. The HITL configuration clearly defines *when* (after email generation), *how* (via AMP channel), *what* (approval), and *who* (initiator) is involved in this human intervention. The sequence of events suggests: 1. An LLM is tasked with generating an email. 2. An external system (agent work item processing) might request modifications (e.g., changing an event date). 3. The HITL process is engaged, waiting for human input. 4. An AMP work item is created for approval. 5. A human (edwardgem@gmail.com) provides the "approval" via the "amp" channel. 6. The LLM's output is evaluated, and the condition "after email is generated" is met, confirming the email's creation. 7. The email generation by the LLM is completed. 8. A new cycle or a subsequent email generation task is initiated ("Calling LLM to generate email" again). 9. Finally, an API call to "generate-send" indicates the ultimate goal of the process: to generate and dispatch the email. The system appears to be robust, incorporating both AI automation and human validation, which is a common pattern in critical applications leveraging generative AI. The "Active" status confirms the system is operational and processing tasks. The specific LLM model `gpt-oss:20b` points to a particular implementation or version of the language model being used. The detailed logging, including timestamps and specific actions, is crucial for debugging, auditing, and understanding the flow of operations within this hybrid AI-human system. </details> Figure 3: Agent Activity Records Captured While Running a Group Email Agent Figure 4 demonstrates a user interface (UI) portal that enables an authorized human participant to provide contextual inputs, review agent activities, and intervene when necessary. This interface supports Human-in-the-Loop collaboration by allowing users to configure, guide, or correct the agent’s actions in real time, ensuring that human oversight remains an integral part of the agent’s operational workflow. <details> <summary>2601.06223v1/SafeAIAgent-img004.png Details</summary> ![43c59f34de10074de24fde0b66cda9814a950850d7d9611e39f9c454ebbdad85](http://localhost:8000/v1/image/43c59f34de10074de24fde0b66cda9814a950850d7d9611e39f9c454ebbdad85) ### Visual Description ## Screenshot: Group Email Agent Instance Review ### Overview This image is a screenshot of a web application interface displaying the details of a "Group Email Agent" instance. The interface provides information about the agent, its current status, and presents a user with an action to review and respond to a proposed change. It features a header with navigation and agent status, collapsible sections for "Summary" and "Action", and a set of action buttons at the bottom. ### Components/Axes The interface is structured into distinct regions: * **Header (Top Bar):** * **Navigation Link (Top-left):** A blue text link "< Back". * **Page Title (Top-center):** "Group Email Agent". * **Instance Identifier (Top-center, right of title):** "Instance ID: email-20251020082318". * **Status Badge (Top-right):** A light green rectangular badge with rounded corners, labeled "active". * **Action Button (Top-right):** A grey rectangular button with rounded corners, labeled "Abort Agent". * **Main Content Area:** * **Summary Section Header (Left, below header):** A right-pointing triangle icon (indicating a collapsed section) followed by the text "Summary". To the right of "Summary", there is an icon depicting "AI" with sparkles. * **Refresh Icon (Far right, aligned with Summary):** A circular arrow icon, typically used for refreshing content. * **Action Section Header (Left, below Summary):** A down-pointing triangle icon (indicating an expanded section) followed by the text "Action". * **Action Card (Central, within the expanded Action section):** * **Instructional Text:** "Please review this agent instance and choose an action:". * **Text Input Field (Textarea):** A large, multi-line text box with a light grey border. It contains the following text: "Change the event date to October 31, include the day of the week. Also change the time to 1-2 pm." A small diagonal gripper in the bottom-right corner indicates it is a resizable text area. * **Action Buttons (Bottom-right of the Action Card):** * **Approve Button:** A green rectangular button with rounded corners, labeled "Approve". * **Reject Button:** A red rectangular button with rounded corners, labeled "Reject". * **Modify Button:** A blue rectangular button with rounded corners, labeled "Modify". ### Content Details The primary content is a request for user action regarding an email agent. The agent's instance ID is `email-20251020082318` and its current status is `active`. The "Summary" section is collapsed, suggesting that detailed information about the agent's overall status or configuration is hidden but accessible. The presence of an "AI" icon next to "Summary" suggests that the agent might be AI-powered or that the summary itself is AI-generated. The "Action" section is expanded, presenting a specific task to the user. The task involves reviewing a proposed change to the agent's schedule or event details. The proposed change is: "Change the event date to October 31, include the day of the week. Also change the time to 1-2 pm." The user is given three explicit choices to respond to this proposed change: "Approve", "Reject", or "Modify". ### Key Observations * The UI is clean and uses standard web application design patterns (e.g., breadcrumbs/back links, status badges, action buttons, collapsible sections). * The "active" status and "Abort Agent" button suggest real-time operation and control over the agent. * The "Summary" section being collapsed by default, while "Action" is expanded, indicates that the immediate user focus is on the pending action rather than a general overview. * The "AI" icon next to "Summary" is a notable visual cue, hinting at advanced capabilities of the "Group Email Agent". * The text in the action field is a natural language instruction, implying that the agent or system can interpret and act upon such instructions. ### Interpretation This interface appears to be part of a system designed for managing automated "Group Email Agents," possibly for scheduling, sending, or processing emails. The "Instance ID" suggests that multiple such agents or instances can exist. The "active" status indicates that this particular agent is currently running. The core purpose of this specific screen is to facilitate human oversight and decision-making for an agent's proposed actions or changes. The agent has generated a request or identified a need to "Change the event date to October 31, include the day of the week. Also change the time to 1-2 pm." This implies the agent has some level of autonomy or intelligence (reinforced by the "AI" icon) to propose modifications to its own operational parameters, likely based on new information or a predefined goal. The user's role is to act as a gatekeeper or supervisor, reviewing the agent's proposal. The "Approve," "Reject," and "Modify" options provide a clear workflow: * **Approve:** The user agrees with the proposed change, and the agent's parameters would likely be updated accordingly. * **Reject:** The user disagrees, and the proposed change would be discarded. * **Modify:** The user wants to adjust the proposed change, suggesting that the text field might become editable or lead to another interface for making alterations before final approval. The "Abort Agent" button in the header serves as an emergency stop or termination option, indicating that the agent's operations can be halted at any time. The refresh icon suggests that the agent's status or pending actions might update dynamically. Overall, the system seems to be designed for intelligent automation with a human-in-the-loop for critical decisions. </details> Figure 4: UI for Obtaining Human-in-the-Loop Inputs During an Agentic Workflow Figure 5 illustrates how users can interact with the large language model (LLM) to discover, query, and inspect the progress of agents operating within the environment. Through conversational interfaces, users can retrieve explanations, review activity logs, and monitor task completion status. This interactive transparency fosters mutual understanding and trust between humans and AI agents, allowing confidence to grow naturally as agents demonstrate reliability and accountability over time. <details> <summary>2601.06223v1/SafeAIAgent-img005.png Details</summary> ![2494c653448c80d371098679b7f8d0662f5c1c747096b9490ee0b994a8db89be](http://localhost:8000/v1/image/2494c653448c80d371098679b7f8d0662f5c1c747096b9490ee0b994a8db89be) ### Visual Description ## Screenshot: AI Agent Chat Interface with Workflow Summary ### Overview This image is a screenshot of a chat interface, likely for an AI agent named "AMP" (or a system utilizing AMP), displaying a conversation where a user queries the agent about the duration of a task. The agent provides a summary of the task's key steps, the total time taken, and a detailed breakdown of the workflow's start and end points. The conversation focuses on an email generation and sending task, highlighting the involvement of a human-in-the-loop (HITL) process. ### Components/Axes The interface is structured like a typical chat application, divided into a header, a main conversation area, and a footer with an input field. **Header (Top Bar):** * **Title:** "Chat (with gpt-oss:20b)" - Indicates the chat context and the underlying model. * **Icons:** A chat bubble icon (left of the title) and a trash can icon (top-right, for deleting the chat). **Main Conversation Area:** * **"Switch to Summary" Button:** Located at the top-left of the conversation area, suggesting an alternative view or summary mode. * **Agent's Initial Response (Top):** * **Section Title:** "**Key take-aways**" * **Bullet Points:** * "The agent tried three times to generate the email, each time saving the HTML and waiting for a human-in-the-loop (HITL) approval." * "Three HITL work-items were created, all for "approval" via the AMP channel, and all were pending until the agent decided to skip the HITL step." * "After the HITL step was skipped, the agent sent the email and logged the "sent email" event." * **Follow-up Text:** "If you need more detail on any specific step (e.g., the content of the generated emails, the exact approval decisions, or the recipients list), let me know!" * **User's Query:** * **Label:** "You:" * **Message:** "how much time did it take to complete this agent work?" * **Timestamp:** "6:16 am" (positioned to the right of the user's message bubble). * **Agent's Response (AMP):** * **Label:** "AMP:" * **Summary Statement:** "The entire workflow—from the first email-generation step to the final "sent email" event—spanned **8 minutes 52 seconds**." * **Section Title:** "**Breakdown**" * **Data Table:** | Start | End | Duration | | :---------------------------------- | :-------------------------------- | :--------- | | 15:05:35 – "llm generating email" | 15:14:27 – "sent email" | 8 min 52s | * **Concluding Statement:** "So the Group Email Agent took just under nine minutes to complete the task." **Footer (Bottom Bar):** * **Input Field:** Placeholder text "Ask me anything about this agent instance" * **"Send" Button:** Located at the bottom-right, for submitting user input. ### Detailed Analysis The conversation details a specific task performed by an AI agent, likely a "Group Email Agent," involving email generation and sending. **Key Take-aways from the Agent:** 1. **Email Generation Attempts:** The agent made three attempts to generate the email. Each attempt involved saving the HTML and initiating a Human-In-The-Loop (HITL) approval process. 2. **HITL Work-Items:** Three HITL work-items were created for approval via the "AMP channel." These remained pending. 3. **HITL Skip and Email Send:** The agent ultimately decided to skip the HITL step, proceeded to send the email, and logged a "sent email" event. **Workflow Duration Query and Response:** * The user explicitly asked for the total time taken for "this agent work." * The agent responded that the entire workflow, from the first email generation to the final email send, took **8 minutes and 52 seconds**. **Breakdown of the Workflow:** The provided table clarifies the specific start and end points that define the 8 minutes 52 seconds duration: * **Start Event:** "llm generating email" at 15:05:35. * **End Event:** "sent email" at 15:14:27. * **Calculated Duration:** 8 minutes 52 seconds. The agent's concluding statement, "So the Group Email Agent took just under nine minutes to complete the task," summarizes the duration in a more colloquial manner. ### Key Observations * The agent's workflow involved multiple attempts and a human-in-the-loop (HITL) approval mechanism. * The HITL step was ultimately bypassed or skipped by the agent, indicating a potential override or a decision-making capability within the agent's logic. * The total time for the task, from initial generation attempt to final send, was precisely 8 minutes and 52 seconds. * The breakdown clearly defines the scope of the measured duration, from the start of LLM email generation to the "sent email" event. * The interface provides a clear, conversational way to interact with and get detailed information about agent performance. ### Interpretation This interaction demonstrates an AI agent's ability to not only perform complex tasks (like email generation and sending with HITL integration) but also to self-report on its process and performance. The "Key take-aways" suggest a scenario where the agent encountered a bottleneck or a design choice related to human oversight. The fact that the agent "tried three times" and "three HITL work-items were created" implies either iterative refinement of the email content or repeated attempts to get approval. The crucial detail is that the agent "decided to skip the HITL step." This could indicate: 1. **Time-out/Escalation Logic:** The HITL approval might have timed out, and the agent was programmed to proceed without it after a certain period or number of attempts. 2. **Agent Autonomy:** The agent might have determined that the generated email was sufficiently good or that the task's urgency outweighed the need for human approval in this specific instance. 3. **User Instruction:** Although not explicitly stated in the visible chat, a prior user instruction might have given the agent permission to bypass HITL under certain conditions. The precise timing breakdown (8 minutes 52 seconds) is valuable for performance analysis. It shows that even with multiple generation attempts and pending HITL approvals, the overall task completion time was under nine minutes. This could be considered efficient, especially if the email generation itself is complex. The "llm generating email" as the start point suggests that the clock began when the large language model started its creative process, and ended when the email was successfully dispatched. The "Switch to Summary" button and the detailed breakdown indicate a system designed for transparency and debugging of AI agent workflows, allowing users to understand not just *what* happened, but *how* and *for how long*. This is critical for trust, accountability, and optimization in AI-driven processes. The "Group Email Agent" likely refers to a specialized agent designed for managing email communications, potentially for a group or team. </details> Figure 5: Transparency Enables Human-AI Collaboration with Trustworthiness in Agent Operation By systematically capturing and recording agent activities, the operating environment enables a high degree of transparency that supports comprehensive analytics on both agent behavior and Human-in-the-Loop interactions. This transparency makes it possible to surface aggregated insights through a dashboard component, which serves as a central interface for monitoring, managing, and improving a large-scale agent operating environment. The dashboard plays a critical role in supporting operational oversight, performance evaluation, and continuous improvement, while also informing decisions about when and how to safely increase the level of autonomy within agentic workflows. Figure 6 illustrates the dashboard view, which presents a collection of analytic charts summarizing agent execution patterns, lifecycle states, intervention frequencies, and HITL engagement metrics. These visualizations allow users to quickly assess system health, identify bottlenecks, detect anomalous behavior, and understand where human involvement is most frequently required. By consolidating this information at scale, the dashboard enables organizations to manage thousands of concurrently operating agents in a controlled and informed manner. In addition to static visualization, the dashboard integrates interactive analysis through a natural language interface powered by a large language model. As shown in Figure 7, selecting a chart allows users to open an LLM-driven chat window that generates a contextual analysis report explaining observed trends and patterns. Users can further engage in dialogue with the LLM to ask follow-up questions, explore root causes, and derive business insights related to efficiency, risk, and workflow optimization. This combination of visual analytics and conversational analysis supports deeper understanding of agentic behavior and helps users identify targeted opportunities to refine processes, improve safety, and incrementally advance the autonomy of the overall agentic workflow system. <details> <summary>2601.06223v1/SafeAIAgent-img006.png Details</summary> ![180c4d92bcb658bd5caafac841614f9af0ba8402ddd95a6e7009a66ddb4ed4cb](http://localhost:8000/v1/image/180c4d92bcb658bd5caafac841614f9af0ba8402ddd95a6e7009a66ddb4ed4cb) ### Visual Description ## Dashboard: Realtime Adoption & Quality Trends ### Overview This image displays a dashboard titled "Realtime adoption & quality trends," presenting a comprehensive overview of agent performance, volume, quality, and operational metrics. The dashboard is organized into three main sections: "Volume & Adoption," "Quality & HITL," and "Performance & Operations," each containing multiple charts. A header section provides general information and filter controls for the data displayed. ### Components/Axes **Header Section (Top-Left to Top-Right):** * **Title:** "Realtime adoption & quality trends" * **Refresh Icon:** A circular arrow icon is present next to the title. * **Notification:** "Edwardgem, you have **3** work items awaiting your attention" with a right-pointing arrow. * **Filter Controls (Left to Right):** * **Start date:** Input field with value "08/01/2025" and a calendar icon. * **End date:** Input field with value "12/03/2025" and a calendar icon. * **Agent types:** Dropdown with value "5 selected". * **User:** Dropdown with value "All users". * **Granularity:** Dropdown with value "Weekly". **Section: Volume & Adoption (Top-Left Panel Group)** 1. **Chart Title:** "Total agents over time" (Line Chart) * **Y-axis Label:** Implicitly "Total agents" (based on title). * **Y-axis Scale:** 0 to 8000, with major grid lines and markers at 2000, 4000, 6000, 8000. * **X-axis Label:** Implicitly "Time". * **X-axis Markers:** 2025-08-11, 2025-09-15, 2025-10-20, 2025-12-01. * **Legend:** None (single data series). 2. **Chart Title:** "Volume by agent type" (Stacked Bar Chart) * **Y-axis Label:** Implicitly "Percentage". * **Y-axis Scale:** 0% to 100%, with major grid lines and markers at 25%, 50%, 75%, 100%. * **X-axis Label:** Implicitly "Time". * **X-axis Markers:** 2025-08-11, 2025-09-15, 2025-10-20, 2025-12-01. * **Legend (Bottom-Left):** * Light Blue: "Customer-Support" * Dark Blue: "Group-Email" * Yellow: "Invoice-Payment" * Light Grey: "Newsletter" * Pink: "Research" 3. **Chart Title:** "Top agent types" (Horizontal Bar Chart) * **Y-axis Label:** Implicitly "Agent types". * **Y-axis Categories:** Invoice-Payment, Customer-Support, Group-Email, Research, Newsletter (from top to bottom). * **X-axis Label:** Implicitly "Count". * **X-axis Scale:** 0 to 40000, with major grid lines and markers at 10000, 20000, 30000, 40000. * **Legend:** None (single color bars). **Section: Quality & HITL (Middle-Left Panel Group)** 1. **Chart Title:** "Finished vs aborted" (Area Chart) * **Y-axis Label:** Implicitly "Count". * **Y-axis Scale:** 0 to 8000, with major grid lines and markers at 2000, 4000, 6000, 8000. * **X-axis Label:** Implicitly "Time". * **X-axis Markers:** 2025-08-11, 2025-09-15, 2025-10-20, 2025-12-01. * **Legend (Bottom-Left):** * Green: "finished" * Pink: "aborted" 2. **Chart Title:** "Error distribution" (Donut Chart) * **Axes:** Not applicable. * **Legend (Bottom-Center):** * Light Blue: "none" * Dark Blue: "validation_error" * Yellow: "timeout" * Light Grey: "user_cancelled" * Pink: "system_error" 3. **Chart Title:** "HITL rate (%)" (Line Chart) * **Y-axis Label:** Implicitly "Percentage". * **Y-axis Scale:** 0% to 80%, with major grid lines and markers at 20%, 40%, 60%, 80%. * **X-axis Label:** Implicitly "Time". * **X-axis Markers:** 2025-08-11, 2025-09-15, 2025-10-20, 2025-12-01. * **Legend:** None (single data series). **Section: Performance & Operations (Bottom-Left Panel Group)** 1. **Chart Title:** "Average duration by agent type" (Vertical Bar Chart) * **Y-axis Label:** Implicitly "Duration". * **Y-axis Scale:** 0m to 333m, with major grid lines and markers at 0m, 83m, 167m, 250m, 333m. * **X-axis Label:** Implicitly "Agent type". * **X-axis Categories:** Customer-Support, Invoice-Payment, Research (from left to right). * **Legend:** None (single color bars). 2. **Chart Title:** "Queue wait trend" (Line Chart) * **Y-axis Label:** Implicitly "Time". * **Y-axis Scale:** 0s to 60s, with major grid lines and markers at 0s, 15s, 30s, 45s, 60s. * **X-axis Label:** Implicitly "Time". * **X-axis Markers:** 2025-08-11, 2025-09-15, 2025-10-20, 2025-12-01. * **Legend:** None (single data series). 3. **Chart Title:** "Concurrency heatmap" (Scatter Plot / Density Plot) * **Y-axis Label:** Implicitly "Concurrency level" or "Count". * **Y-axis Scale:** 0 to 23, with major grid lines and markers at 6, 12, 18 (implied), 23. * **X-axis Label:** Implicitly "Time". * **X-axis Markers:** 2025-10-05, 2025-10-20, 2025-11-03, 2025-11-17, 2025-12-04. * **Legend:** None (single color points). ### Detailed Analysis **Header Filters:** * Data is filtered for a period from August 1, 2025, to December 3, 2025. * Data pertains to "5 selected" agent types and "All users". * The granularity of the data is "Weekly". **Volume & Adoption Section:** 1. **Total agents over time:** * **Trend:** The number of total agents starts low, around 500-1000 in early August 2025. It then rapidly increases, peaking at approximately 6500 agents around mid-September 2025. Following this, there's a slight dip to about 5500 agents, then another rise to around 6000 agents by late September/early October. From mid-October, the number of agents gradually declines, dropping sharply from around 5500 in mid-November to approximately 1500-2000 by early December 2025. * **Key Data Points (Approximate):** * 2025-08-11: ~700 agents * 2025-09-15: ~6500 agents (peak 1) * 2025-09-25: ~5500 agents (trough) * 2025-10-05: ~6000 agents (peak 2) * 2025-12-01: ~1800 agents 2. **Volume by agent type:** * **Trend:** This stacked bar chart shows the weekly percentage distribution of work volume across five agent types. * **Customer-Support (Light Blue):** Consistently represents a significant portion, starting around 45-50% in August, decreasing slightly to 40% by mid-September, then increasing to about 50-55% by late October, and remaining around 50% through December. * **Group-Email (Dark Blue):** Starts around 10-15% in August, decreases to about 5% by mid-September, then fluctuates between 5-10% for the rest of the period. * **Invoice-Payment (Yellow):** Starts around 20-25% in August, increases to about 30-35% by mid-September, then gradually decreases to around 15-20% by December. * **Newsletter (Light Grey):** Remains a small, relatively stable proportion, generally below 5%, throughout the period. * **Research (Pink):** Starts very small, less than 5%, then increases to about 10-15% by mid-September, and remains around 10-15% for the rest of the period, showing a slight increase towards December. * **Overall:** The relative contribution of Invoice-Payment decreases over time, while Customer-Support and Research generally maintain or slightly increase their proportions. 3. **Top agent types:** * **Trend:** This horizontal bar chart ranks agent types by their total volume. * **Key Data Points (Approximate):** * **Invoice-Payment:** ~38,000 units * **Customer-Support:** ~18,000 units * **Group-Email:** ~8,000 units * **Research:** ~7,000 units * **Newsletter:** ~2,000 units * **Observation:** Invoice-Payment has by far the highest volume, more than double that of Customer-Support, which is the second highest. Newsletter has the lowest volume. **Quality & HITL Section:** 1. **Finished vs aborted:** * **Trend:** Both "finished" (green area) and "aborted" (pink area) tasks follow a similar trend to "Total agents over time." They start low, rise significantly, peak, and then decline. The "finished" tasks consistently outnumber "aborted" tasks. * **Key Data Points (Approximate):** * **Early August:** Both finished and aborted are low, finished ~500, aborted ~100. * **Mid-September (Peak):** Finished peaks around 6000-6200, aborted peaks around 1000-1200. * **Late October:** Finished around 5000-5500, aborted around 800-1000. * **Early December:** Finished drops to ~1500, aborted drops to ~200-300. * **Observation:** The ratio of finished to aborted tasks appears relatively stable, with finished tasks being roughly 5-6 times more numerous than aborted tasks throughout the period. 2. **Error distribution:** * **Trend:** This donut chart shows the distribution of different error types. * **Key Data Points (Approximate):** * **none (Light Blue):** Dominant, approximately 95-97% of all cases. * **validation_error (Dark Blue):** Very small slice, perhaps 1-2%. * **timeout (Yellow):** Very small slice, perhaps 0.5-1%. * **user_cancelled (Light Grey):** Very small slice, perhaps 0.5-1%. * **system_error (Pink):** Very small slice, perhaps 0.5-1%. * **Observation:** The vast majority of operations complete without any recorded error, indicating high overall process stability. 3. **HITL rate (%):** * **Trend:** The Human-In-The-Loop (HITL) rate starts high, around 78-80% in early August 2025. It remains stable until mid-September, then drops sharply to about 60% by late September. Another significant drop occurs in early October, bringing the rate down to approximately 40%. From mid-October onwards, the HITL rate stabilizes, fluctuating slightly between 40% and 45% until early December. * **Key Data Points (Approximate):** * 2025-08-11 to 2025-09-15: ~78-80% * 2025-09-20: ~60% * 2025-10-05: ~40% * 2025-10-05 to 2025-12-01: ~40-45% **Performance & Operations Section:** 1. **Average duration by agent type:** * **Trend:** This vertical bar chart compares the average duration for three specific agent types. * **Key Data Points (Approximate):** * **Customer-Support:** ~170m (minutes) * **Invoice-Payment:** ~300m (minutes) * **Research:** ~190m (minutes) * **Observation:** Invoice-Payment tasks have the longest average duration, significantly higher than Customer-Support and Research tasks. 2. **Queue wait trend:** * **Trend:** The average queue wait time fluctuates but remains relatively stable over the period. It starts around 50s in early August, dips slightly to 45s, then rises to around 55s in mid-September. It then fluctuates between 48s and 55s, ending around 50s in early December. * **Key Data Points (Approximate):** * 2025-08-11: ~50s * 2025-09-15: ~55s * 2025-10-20: ~50s * 2025-12-01: ~50s 3. **Concurrency heatmap:** * **Trend:** This chart displays the distribution of concurrent items over time. The X-axis spans from early October to early December 2025, and the Y-axis represents concurrency levels from 0 to 23. The chart is densely populated with light purple dots, indicating frequent occurrences of concurrency. * **Observation:** There is a consistent presence of concurrent items across the entire observed time frame. The density of points appears to be highest in the range of approximately 6 to 18 concurrent items, suggesting that these are the most common concurrency levels. There are fewer instances at very low (0-5) or very high (19-23) concurrency levels. The pattern appears somewhat uniform across the weeks shown, without clear peaks or troughs in overall concurrency density. ### Key Observations * **Agent Volume Volatility:** The "Total agents over time" chart shows significant fluctuations, with a sharp increase and decrease, suggesting project phases or seasonal changes in agent deployment or activity. * **Dominant Agent Type:** "Invoice-Payment" consistently represents the highest volume of work and also has the longest average duration per task, indicating it's a critical and time-consuming agent type. * **High Quality, Declining HITL:** The "Error distribution" shows very few errors, indicating high quality. However, the "HITL rate (%)" shows a significant drop from ~80% to ~40% over the period, suggesting a shift towards more automation or less human intervention, which could be a positive efficiency gain if quality is maintained. * **Stable Queue Times:** Despite fluctuations in agent volume, the "Queue wait trend" remains relatively stable, suggesting effective queue management or sufficient agent capacity for the observed wait times. * **Consistent Concurrency:** The "Concurrency heatmap" indicates a steady level of parallel processing, primarily within a mid-range, throughout the latter part of the period. ### Interpretation The dashboard provides a snapshot of an operational system, likely involving automated agents or processes, over a four-month period in late 2025. The "Volume & Adoption" section suggests a project or initiative that scaled up rapidly in August-September, reaching peak agent deployment, and then gradually scaled down towards December. The "Finished vs aborted" chart mirrors this trend, indicating that the overall workload (and completion rate) followed the agent deployment. The consistent ratio of finished to aborted tasks suggests that the underlying process efficiency or success rate remained stable even with fluctuating volume. The "Volume by agent type" and "Top agent types" charts highlight "Invoice-Payment" as the most significant workload contributor, both in terms of total volume and average duration. This suggests that optimizing "Invoice-Payment" processes could yield the most substantial improvements in overall system performance. "Customer-Support" and "Research" are also notable contributors. The "Quality & HITL" section presents an interesting dynamic. The "Error distribution" shows an overwhelmingly low error rate, implying robust and reliable processes. However, the "HITL rate (%)" experienced a sharp decline from nearly 80% to around 40%. This could be interpreted as a successful automation effort, where tasks previously requiring human intervention are now handled autonomously. If the quality (low error rate) is maintained despite reduced human involvement, this represents a significant efficiency gain. Conversely, if the HITL rate drop was unintended, it might indicate a problem with human agents not engaging when needed. Given the low error rate, the former interpretation (successful automation) is more likely. The "Performance & Operations" section provides insights into operational efficiency. The "Average duration by agent type" reinforces the importance of "Invoice-Payment" tasks, as they take the longest on average. This further supports focusing optimization efforts on this agent type. The "Queue wait trend" shows a relatively stable average wait time, suggesting that the system's capacity to handle incoming work is generally consistent, even with the observed fluctuations in total agent activity. The "Concurrency heatmap" indicates a sustained level of parallel processing, primarily in the mid-range of concurrency, which is a healthy sign of continuous operation rather than sporadic bursts. In summary, the data suggests a system that experienced a significant ramp-up and ramp-down in agent activity, with "Invoice-Payment" being a key workload driver. The system demonstrates high quality with very few errors, and a notable shift towards reduced human intervention (lower HITL rate), likely due to increased automation. Operational metrics like queue wait times and concurrency appear stable, indicating a well-managed system despite changes in overall volume. Further investigation might focus on the reasons behind the agent volume fluctuations and the specific drivers for the HITL rate reduction. </details> Figure 6 Analytic Charts Illustrating Realtime Adoption and Quality Trends of the Agentic System <details> <summary>2601.06223v1/SafeAIAgent-img007.png Details</summary> ![d05379dd6b5c4ef4677f3a04c4dd3960171aba4f8cf1414f2e3c87aaed02d0d0](http://localhost:8000/v1/image/d05379dd6b5c4ef4677f3a04c4dd3960171aba4f8cf1414f2e3c87aaed02d0d0) ### Visual Description ## Screenshot: Chat Interface with Chart Analysis ### Overview This image is a screenshot of a chat interface, displaying a conversation between a user ("You") and an AI assistant ("AMP", likely "llama3.1"). The user has requested an analysis of a chart titled "Total agents over time". The AMP response provides a textual analysis comparing the "Total agents over time" chart with "HITL (High-Volume Initiator/Reviewer Tracking) rate charts", identifying trends and potential correlations, even though the charts themselves are not visible in the screenshot. ### Components/Axes The image primarily consists of UI elements and text content within a chat window. **Top Header (Chat Window):** * **Top-left:** "Chat (llama3.1)" - Indicates the chat application and the AI model being used. * **Top-right:** An "X" icon, likely for closing the chat window. * **Right, below "X":** "8:16 pm" - The timestamp of the last message or the current time. **User Message:** * **Label:** "You:" (positioned on the left, below the header). * **Content:** A light grey speech bubble, slightly indented, containing the text: "Analyze the chart "Total agents over time". Highlight trends, anomalies, and key takeaways." **AMP Response:** * **Label:** "AMP:" (positioned on the left, below the user's message). * **Main Title (bold, left-aligned):** **Comparison of Total Agent Over Time and HITL Rate Charts** --- * **Introductory Paragraph:** "We will compare the "Total agents over time" chart with the HITL (High-Volume Initiator/Reviewer Tracking) rate charts to identify potential correlations and insights." * **Section Title (bold, left-aligned):** **Chart Analysis** --- * **Sub-section Title (bold, left-aligned):** **HITL Rate Charts** * **Description Paragraph:** "The HITL rate charts provide information on the top initiators and reviewers of agent runs. We have two relevant charts: "Top initiators & reviewers" and "HITL rates by initiator/reviewer"." * **Sub-section Title (bold, left-aligned):** **Comparison with Total Agent Over Time Chart** * **Bullet Point 1:** * The significant increase in total agent runs around 2025-09-01 is also reflected in the HITL rate charts, indicating that many of these new agents are initiated or reviewed by top users. * **Bullet Point 2:** * The decline from 2025-11-03 to 2025-11-10 observed in the "Total agents over time" chart is not as pronounced in the HITL rate charts, suggesting that this decrease might be related to a specific user behavior or technical issue rather than a broader trend. ### Detailed Analysis The image does not contain any charts or diagrams, but rather a textual analysis of them. The analysis focuses on two types of charts: 1. "Total agents over time" 2. "HITL (High-Volume Initiator/Reviewer Tracking) rate charts" The AMP response details the purpose of the HITL rate charts, stating they provide information on "top initiators and reviewers of agent runs" and specifically mentions two related charts: "Top initiators & reviewers" and "HITL rates by initiator/reviewer". The core of the analysis is a comparison between the "Total agents over time" chart and the HITL rate charts, highlighting two specific time periods: * **Trend 1: Increase around 2025-09-01** * **"Total agents over time" chart:** Shows a "significant increase" in total agent runs. * **HITL rate charts:** This increase is "also reflected" in these charts. * **Inferred Correlation:** The text suggests that "many of these new agents are initiated or reviewed by top users." * **Trend 2: Decline from 2025-11-03 to 2025-11-10** * **"Total agents over time" chart:** Shows a "decline". * **HITL rate charts:** This decline is "not as pronounced" in these charts. * **Inferred Cause:** The text suggests this less pronounced decline in HITL rates implies the decrease in total agents might be due to "a specific user behavior or technical issue rather than a broader trend." ### Key Observations * The image is a chat log, not a direct visualization of data. * The AI (llama3.1) is performing a comparative analysis of two unseen chart types. * Specific dates (2025-09-01, 2025-11-03, 2025-11-10) are used to pinpoint observed trends. * The analysis attempts to infer underlying causes or correlations between the "Total agents over time" and "HITL rate" metrics. * The "HITL" acronym is explicitly defined as "High-Volume Initiator/Reviewer Tracking". ### Interpretation This chat exchange demonstrates an AI's capability to analyze and interpret data trends from multiple sources (charts, in this case) and provide insights, even without visually presenting the charts themselves. The user's prompt is a direct request for such an analysis, and the AMP's response fulfills it by: 1. **Contextualizing the data:** Explaining what the HITL charts represent (top initiators and reviewers). 2. **Identifying correlations:** The significant increase in total agents around 2025-09-01 is directly linked to activity by top users, suggesting that top users are driving agent growth. This implies a positive correlation between overall agent activity and high-volume user engagement. 3. **Distinguishing between specific and general trends:** The less pronounced decline in HITL rates during the 2025-11-03 to 2025-11-10 period, despite a decline in total agents, is a crucial insight. It suggests that the overall dip in agent activity might not be due to a systemic issue affecting all users or a general slowdown, but rather a localized problem (specific user behavior or a technical glitch). This distinction is vital for troubleshooting and strategic decision-making, as it narrows down the potential causes of the decline. The AI's response provides actionable intelligence by not just stating trends but also offering potential explanations and implications, which is a key aspect of Peircean investigative reasoning. It moves beyond mere data extraction to a higher level of understanding and inference. </details> <details> <summary>2601.06223v1/SafeAIAgent-img008.png Details</summary> ![c0c6f7c1ca95e2b116a39e1cd6956989330098aacf204f790be241c4cc3d303a](http://localhost:8000/v1/image/c0c6f7c1ca95e2b116a39e1cd6956989330098aacf204f790be241c4cc3d303a) ### Visual Description ## Text Report: Agent Activity Analysis Summary ### Overview This image displays a text-based report providing an analysis of agent activity, including key takeaways, possible explanations for observed trends, and recommendations for further investigation. It is presented within a user interface, featuring a report regeneration link and a command input field at the bottom. ### Structure and Components The report is structured into three main sections: "Key Takeaways", "Possible Explanations", and "Recommendations". Each section title is bolded and followed by a dashed horizontal line acting as a visual separator. The content within each section is presented as bullet points or a numbered list. At the bottom of the report, below the "Recommendations" section, there is a clickable link labeled "Regenerate report". Below this link, there is a user interface element consisting of a text input field with the placeholder text "Type a command..." and a blue button labeled "Send" to its right. ### Content Details **Key Takeaways** -------------------- * The growth in total agent runs around 2025-09-01 appears to be driven by top users who are initiating and reviewing these agents. * The relatively stable HITL rates during the decline from 2025-11-03 to 2025-11-10 suggest that this issue might be specific to user behavior or a technical problem affecting some users. **Possible Explanations** -------------------- * **Changes in user behavior**: Top users who are initiating and reviewing agent runs may have adjusted their behavior, contributing to the growth around 2025-09-01 and the decline from 2025-11-03 to 2025-11-10. * **Technical issues**: Problems with the system's infrastructure or configuration might be affecting specific user groups or processes, leading to variations in agent activity. **Recommendations** -------------------- 1. Investigate user behavior changes around 2025-09-01 and 2025-11-03 to understand their impact on agent activity. 2. Analyze technical logs to identify potential issues with the system's infrastructure or configuration that might be contributing to the variations in agent runs. **Interactive Elements (Bottom of the page)** * Link: Regenerate report * Text Input Field (left of "Send" button): Type a command... * Button (right of text input field): Send ### Key Observations The report highlights two distinct periods of interest: a growth period around 2025-09-01 and a decline period from 2025-11-03 to 2025-11-10. The growth is attributed to "top users," while the decline is noted to have "relatively stable HITL rates," suggesting the issue is not a general system failure but potentially localized. Two primary causes are proposed for the variations in agent activity: changes in user behavior and technical issues. The recommendations directly address these proposed causes, suggesting specific investigative actions. ### Interpretation This document appears to be an automated or semi-automated summary generated from data analysis, likely related to a software system or platform involving "agents" and "users." The report identifies a problem (variations in agent runs) and offers initial hypotheses and actionable steps. The mention of "HITL rates" (Human-In-The-Loop) suggests that these agents might involve human oversight or intervention, and the stability of these rates during a decline in agent runs implies that the *quality* or *type* of human interaction with the agents hasn't drastically changed, even if the *volume* of agent runs has. This points away from a widespread user training or interface problem and more towards specific user groups or underlying technical issues. The report's structure (Takeaways -> Explanations -> Recommendations) is logical for problem-solving. It first summarizes the findings, then delves into potential root causes, and finally provides concrete steps for further investigation. The presence of a "Regenerate report" link and a command input field suggests an interactive environment where users can request updated reports or issue commands related to the system being analyzed. This implies a dynamic reporting system, possibly integrated into a larger operational dashboard or management tool. The dates (2025-09-01, 2025-11-03, 2025-11-10) indicate a forward-looking or recent analysis, given the current year is not 2025. </details> Figure 7 LLM-Integrated Chat Interface Enabling Analytic Insights From Agent Dashboard Charts 1. Next Steps and Future Work The work presented in this paper establishes both a conceptual framework and an operational foundation for developing safe, transparent, and trustworthy AI agents through the 3PM. Building on insights from prior research in Human-in-the-Loop systems and safe AI, the model is extended into a comprehensive operational approach supported by practical principles and implementation guidelines. As a practical and applicable framework, the 3PM is lightweight, easy to understand, and straightforward to apply in the development, deployment, and operation of large-scale, enterprise-grade agentic systems. At the same time, it is grounded in a complete and coherent theoretical foundation and is designed to evolve as the scope, scale, and complexity of the agentic environment expand. The next phase of this initiative focuses on translating these principles into real-world practice, ensuring that both industry and society can fully benefit from the responsible adoption of autonomous agents. To achieve this, three primary work streams have been initiated to extend, validate, and operationalize the ideas introduced in this study. 1. Public Deliberation through the Stanford Deliberative Democracy Lab The first work stream involves a collaboration with the Deliberative Democracy Lab (DDL) at Stanford University, which is conducting a series of public deliberative forums focused on the social and ethical dimensions of AI agents [17, 18]. These forums bring together a diverse range of stakeholders, including AI industry leaders, policymakers, researchers, and members of the public, to engage in structured, open discussions about the Three Pillars of transparency, accountability, and trustworthiness. In the initial phase, the DDL will conduct forums in North America and in India. The goal of this initiative is to bridge the gap between technological innovation and societal readiness. By involving the public in open, informed conversations, this work stream seeks to better understand how people perceive the risks and benefits of autonomous agents, what level of transparency they expect, and what safeguards they require to build trust. The insights from these dialogues will guide both technical and policy frameworks, ensuring that the development of AI agents aligns with public values and expectations across both business and consumer contexts. Through these deliberative processes, the AI community can establish mutual understanding and legitimacy around agent governance, helping society evolve toward an era of AI-enabled collaboration rather than resistance or fear. 1. Industry Collaboration through the Safe AI Agent Consortium The second work stream focuses on industry collaboration through the Safe AI Agent Consortium, an emerging alliance of leading organizations that share a commitment to advancing the responsible use of autonomous agents [19]. The consortium’s core members include Anthropic, Cohere, DoorDash, Meta, Microsoft, Oracle, PayPal, Stanford, and other key players across academia, technology and enterprise sectors. This group is jointly developing a set of industry guidelines and best practices grounded in the 3PM. These guidelines aim to operationalize the concepts of transparency, accountability, and trustworthiness in a way that developers, implementers, and users can readily apply to real-world AI systems. By creating common standards for agent design, documentation, observability, and governance, the consortium seeks to promote safe adoption of AI agents at scale. This initiative enables enterprises to capture productivity gains without compromising human oversight or public trust. The consortium’s open work may also expand to developing shared benchmarks, safety testing protocols, and interoperability frameworks for agent operating environments. These outcomes will serve as practical tools for both startups and large organizations to evaluate the maturity, safety, and reliability of their agentic systems. Through collective action and transparency among participants, this initiative aspires to make safety and responsibility a competitive advantage in the growing agent economy. 1. Open Tools and the Three-Pillar Agent Operating Environment The third work stream extends this research into applied development and community tooling. The objective is for industry leaders and startups to design and release a set of open-source tools and frameworks that embody the 3PM and accelerate the adoption of safe agentic systems. This includes the creation of an agent operating environment, as illustrated in this paper, that integrates transparency, accountability, and trustworthiness by design across the full agent lifecycle. This environment will provide a standardized foundation for safe and effective agent operations, offering key capabilities such as: - Agent activity logging and lifecycle tracking to ensure full transparency and traceability across initiation, execution and completion stages. - Decision journaling and explainability modules to support accountability by recording the reasoning, context, and outcomes behind each agent decision. - Configurable human oversight controls and fallback mechanisms to maintain trustworthiness and provide dynamic risk management through defined intervention thresholds. - AI generated analytics derived from agent activity logging and decision journals, with LLM deployed throughout the 3PM operating environment to enable interactive monitoring, health assessment, and insight generation. This capability allows users to better understand system behavior, identify improvement opportunities, and make informed decisions about progressively increasing levels of agent autonomy. - AI-assisted 24x7 monitoring of agentic workflows to continuously learn behavioral patterns, detect anomalies, and trigger timely human involvement when necessary to preserve system safety and security. By providing a shared technical foundation, this work stream aims to lower the entry barrier for organizations to adopt AI agents. It allows developers to embed safety and governance principles from the outset, rather than retrofitting compliance and oversight after deployment. The tools will be open for collaboration and extension by the research and developer communities, designed to integrate with existing agentic interoperability standards such as the Model Context Protocol (MCP) and the Agent-to-Agent (A2A) communication protocol. This openness will encourage cross-industry experimentation, validation, and interoperability, fostering a unified ecosystem where safe, transparent, and accountable AI agents can evolve and operate seamlessly across different environments. Through continuous contribution from the developer community and iterative improvement, the resulting ecosystem will foster a trusted agent economy in which innovation can advance both responsibly and efficiently. Over time, this environment may serve as a reference implementation for regulators, researchers, and practitioners seeking to harmonize safety and governance standards across industries and geographic regions, thereby accelerating the safe and scalable adoption of autonomous agents worldwide. 1. Conclusion This paper has presented a conceptual and operational framework for developing safe, transparent, and trustworthy AI agents through the 3-Pillar Model (3PM), consisting of Transparency, Accountability, and Trustworthiness. Building upon prior research in Human-in-the-Loop (HITL) systems, reinforcement learning with human feedback, and collaborative AI, this model provides a practical foundation for guiding the evolution of AI agents from assisted to fully autonomous operation. The framework emphasizes that autonomy must be achieved through a gradual, verifiable process in which trust is earned over time, rather than assumed by design. We have argued that the development of autonomous agents parallels the evolutionary path of autonomous driving, where safety, reliability, and human confidence were cultivated through progressive stages of shared control. Similarly, the journey toward trustworthy AI autonomy requires environments that support visibility, ethical reasoning, and human collaboration. The proposed Three-Pillar Model ensures that every stage of agent development and deployment remains transparent, accountable, and grounded in timely and appropriate human oversight. Transparency provides observability into agent behavior and decision-making processes; accountability ensures that both actions and decisions are traceable, explainable, and correctable; and trustworthiness transforms these safeguards into lasting confidence among users, organizations, and the broader public. To move from concept to practice, this research has initiated three complementary work streams. The first engages the public through the Deliberative Democracy Lab at Stanford University, facilitating informed dialogue between citizens and AI industry leaders about the social implications of agent transparency, accountability, and trust. The second advances industry collaboration through the Safe AI Agent Consortium, uniting leading technology organizations to establish shared best practices, evaluation benchmarks, and governance standards for safe agentic systems. The third work stream focuses on open tooling, with the goal of developing an open agent operating environment that embodies the Three-Pillar principles and supports interoperability among both native and external agents through protocols, including the Model Context Protocol (MCP), Agent-to-Agent (A2A) communication, Agent Communication Protocol (ACP), and Agent Network Protocol (ANP). Through these efforts, the 3PM progresses from theoretical construct to actionable framework, enabling the responsible evolution of autonomous agents. Through sustained collaboration across academia, industry, and society, we can shape a future in which AI agents operate in alignment with human values, advancing innovation while upholding safety, transparency, and ethical integrity. References 1. Sanders, T. How AI Agents Are Overcoming Market Hype to Deliver Real Business Impact. 2025 AI Agents G2 Insight Report, October 2025. https://company.g2.com/news/2025-ai-agent-report 1. Zanzotto, F.M. Viewpoint: Human-in-the-loop Artificial Intelligence. Journal of Artificial Intelligence Research 64 (2019) 243-252. February 2029. 1. Wu, X. et al. A Survey of Human-in-the-Loop for Machine Learning. arXiv:2108.00941 (v3). April 2022. https://arxiv.org/abs/2108.00941 1. Mosqueira-Rey, E. et al. Human-in-the-Loop Machine Learning: A State of the Art. Artificial Intelligence Review (2023) 56:3005–3054. August 2022. https://link.springer.com/article/10.1007/s10462-022-10246-w 1. Wixom, B., Someh, I., and Gregory, R. AI Alignment: A New Management Paradigm. MIT Center for Information Systems Research (MIT CISR). No. XX-11. November 2020. https://cisr.mit.edu/publication/2020_1101_AI-Alignment_WixomSomehGregory 1. Burnham, K. New framework helps companies build secure AI systems. MIT Management Sloan School. July 2025. https://mitsloan.mit.edu/ideas-made-to-matter/new-framework-helps-companies-build-secure-ai-systems 1. Bellos, F. et al. Towards Effective Human-in-the-Loop Assistive AI Agents. arXiv:2507.18374 (v1). July 2025. https://arxiv.org/abs/2507.18374 1. Mozannar, H. et al. Magentic-UI: Towards Human-in-the-loop Agentic Systems. Microsoft Research AI Frontiers. arXiv:2507.22358 (v1). July 2025. https://arxiv.org/abs/2507.22358 1. Retzlaff, C. O. et al. Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities. Journal of Artificial Intelligence Research 79 (2024) 359-415. January 2024. 1. Ren, A. Z. et al. Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners. Google DeepMind. arXiv:2307.01928 (v2). September 2023. https://arxiv.org/abs/2307.01928 1. Allen, D. et al. A Roadmap for Governing AI: Technology Governance and Power Sharing Liberalism. ASH Center for Democratic Governance and Innovation. Harvard Kennedy School. January 2024. https://ash.harvard.edu/wp-content/uploads/2024/01/340040_hks_ashgovroadmap_v2.pdf 1. Barroso, L. R. and Mello, P. P. C. Artificial Intelligence: Promises, Risks, and Regulation: Something New Under the Sun. CARR Center for Human Rights Policy. Harvard Kennedy School. December 2024. https://www.hks.harvard.edu/sites/default/files/2024-12/24_Barroso_Digital_v3.pdf 1. Natarajan, S. et al. Human-in-the-loop or AI-in-the-loop? Automate or Collaborate? The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25). March 2025. 1. Wang, J., Zhang, L., Huang, Y., and Zhao, J. Safety of Autonomous Vehicles. Journal of Advanced Transportation. October 2020. https://doi.org/10.1155/2020/8867757 1. Khan, M. A. et al. Level-5 Autonomous Driving—Are We There Yet? A Review of Research Literature. ACM Journals, ACM Computing Surveys (CSUR), Vol. 55, Issue 2, Article No. 27. January 2022. https://doi.org/10.1145/3485767 1. Cheng, J. Context-Aware Prompt Enhancement (CAPE) Framework for a Multi-Agent Application System. Inquiryon, Inc. July 2025. 1. Siu, A. Industry-Wide Deliberative Forum Invites Public to Weigh In on the Future of AI Agents. First public announcement. June 2025. https://deliberation.stanford.edu/industry-wide-deliberative-forum-invites-public-weigh-future-ai-agents 1. Siu, A. DoorDash and Microsoft join Industry-Wide Deliberative Forum on Future of AI Agents. Second public announcement. August 2025. https://deliberation.stanford.edu/doordash-and-microsoft-join-industry-wide-deliberative-forum-future-ai-agents 1. Katsanevas, A. et al. AI Agent for Good: Alignment, Safety, & Impact. 2025 Summer Symposium Hosted by Stanford Deliberative Democracy Lab. July 2025. https://deliberation.stanford.edu/ai-agent-good-alignment-safety-impact

Rendering Paper...