1706.03661v2
Model: healer-alpha-free
# DAC-h3: A Proactive Robot Cognitive Architecture to Acquire and Express Knowledge About the World and the Self
## DAC-h3: A Proactive Robot Cognitive Architecture to Acquire and Express Knowledge About the World and the Self
Clément Moulin-Frier*, Tobias Fischer*, Maxime Petit, Grégoire Pointeau, Jordi-Ysard Puigbo, Ugo Pattacini, Sock Ching Low, Daniel Camilleri, Phuong Nguyen, Matej Hoffmann, Hyung Jin Chang, Martina Zambelli, Anne-Laure Mealier, Andreas Damianou, Giorgio Metta, Tony J. Prescott, Yiannis Demiris, Peter Ford Dominey, and Paul F. M. J. Verschure
Abstract -This paper introduces a cognitive architecture for a humanoid robot to engage in a proactive, mixed-initiative exploration and manipulation of its environment, where the initiative can originate from both the human and the robot. The framework, based on a biologically-grounded theory of the brain and mind, integrates a reactive interaction engine, a number of state-of-the art perceptual and motor learning algorithms, as well as planning abilities and an autobiographical memory. The architecture as a whole drives the robot behavior to solve the symbol grounding problem, acquire language capabilities, execute goal-oriented behavior, and express a verbal narrative of its own experience in the world. We validate our approach in humanrobot interaction experiments with the iCub humanoid robot, showing that the proposed cognitive architecture can be applied in real time within a realistic scenario and that it can be used with naive users.
Index Terms -Cognitive Robotics, Distributed Adaptive Control, Human-Robot Interaction, Symbol Grounding, Autobiographical Memory
Manuscript received December 31, 2016; revised July 25, 2017; accepted August 09, 2017. The research leading to these results has received funding from the European Research Council under the European Union's Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. FP7-ICT612139 (What You Say Is What You Did project), as well as the ERC's CDAC project: Role of Consciousness in Adaptive Behavior (ERC-2013-ADG 341196). M. Hoffmann was supported by the Czech Science Foundation under Project GA17-15697Y. P. Nguyen received funding from ERC's H2020 grant agreement No. 642667 (SECURE). T. Prescott and D. Camilleri received support from the EU Seventh Framework Programme as part of the Human Brain (HBP-SGA1, 720270) project.
C. Moulin-Frier, J.-Y. Puigbo, S. C. Low, and P. F. M. J. Verschure are with the Laboratory for Synthetic, Perceptive, Emotive and Cognitive Systems, Universitat Pompeu Fabra, 08002 Barcelona, Spain. P. F. M. J. Verschure is also with Institute for Bioengineering of Catalonia (IBEC), The Barcelona Institute of Science and Technology (BIST) and Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
- T. Fischer, M. Petit, H. J. Chang, M. Zambelli, and Y. Demiris are with the Personal Robotics Laboratory, Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ, U.K.
- G. Pointeau, A.-L. Mealier, and P. F. Dominey are with the Robot Cognition Laboratory of the INSERM U846 Stem Cell and Brain Research Institute, Bron 69675, France.
- U. Pattacini, P. Nguyen, M. Hoffmann, and G. Metta are with the Italian Institute of Technology, iCub Facility, Via Morego 30, Genova, Italy. M. Hoffmann is also with the Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague.
- D. Camilleri, A. Damianou, and T. J. Prescott are with the Department of Computer Science, University of Sheffield, U.K. A. Damianou is now at Amazon.com.
Digital Object Identifier 10.1109/TCDS.2017.2754143
*C. Moulin-Frier and T. Fischer contributed equally to this work.
## I. INTRODUCTION
T HE so-called Symbol Grounding Problem (SGP, [1], [2], [3]) refers to the way in which a cognitive agent forms an internal and unified representation of an external word referent from the continuous flow of low-level sensorimotor data generated by its interaction with the environment. In this paper, we focus on solving the SGP in the context of human-robot interaction (HRI), where a humanoid iCub robot [4] acquires and expresses knowledge about the world by interacting with a human partner. Solving the SGP is of particular relevance in HRI, where a repertoire of shared symbolic units forms the basis of an efficient linguistic communication channel between the robot and the human.
To solve the SGP, several questions should be addressed:
- How are unified symbolic representations of external referents acquired from the multimodal information collected by the agent (e.g., visual, tactile, motor)? This is referred to as the Physical SGP [5], [6].
- How to acquire a shared lexicon grounded in the sensorimotor interactions between two (or more) agents? This is referred to as the Social SGP [6], [7].
- How is this lexicon then used for communication and collective goal-oriented behavior? This refers to the functional role of physical and social symbol grounding.
This paper addresses these questions by proposing a complete cognitive architecture for HRI and demonstrating its abilities on an iCub robot. Our architecture, called DAC-h3 , builds upon our previous research projects in conceiving biologically grounded cognitive architectures for humanoid robots based on the Distributed Adaptive Control theory of mind and brain (DAC, presented in the next section). In [8] we proposed an integrated architecture for generating a socially competent humanoid robot, demonstrating that gaze, eye contact and utilitarian emotions play an essential role in the psychological validity or social salience of HRI (DAC-h1). In [9], we introduced a unified robot architecture, an innovative Synthetic Tutor Assistant (STA) embodied in a humanoid robot whose goal is to interactively guide learners in a science-based learning paradigm through rich multimodal interactions (DAC-h2).
DAC-h3 is based on a developmental bootstrapping process where the robot is endowed with an intrinsic motivation to act and relate to the world in interaction with social peers.
Levinson [10] refers to this process as the human interaction engine : a set of capabilities including looking at objects of interest and interaction partners, pointing to these entities [11], demonstrating curiosity as a desire to acquire knowledge [12] and showing, telling and sharing this knowledge with others [11], [13]. These are also coherent with the desiderata for developmental cognitive architectures proposed in [14] stating that a cognitive architecture's value system should manifest both exploratory and social motives, reflecting the psychology of development defended by Piaget [15] and Vygotsky [16].
This interaction engine drives the robot to proactively control its own acquisition and expression of knowledge, favoring the grounding of acquired symbols by learning multimodal representations of entities through interaction with a human partner. In DAC-h3 , an entity refers to an internal or external referent: it can be either an object, an agent, an action, or a body part. In turn, the acquired multimodal and linguistic representations of entities are recruited in goal-oriented behavior and form the basis of a persistent concept of self through the development of an autobiographical memory and the expression of a verbal narrative.
We validate the proposed architecture following a humanrobot interaction scenario where the robot has to learn concepts related to its own body and its vicinity in a proactive manner and express those concepts in goal-oriented behavior. We show a complete implementation running in real-time on the iCub humanoid robot. The interaction depends on the internal dynamics of the architecture, the properties of the environment and the behavior of the human. We analyze a typical interaction in detail and provide videos showing the robustness of our system in various environments (https://github.com/robotology/wysiwyd). Our results show that the architecture autonomously drives the iCub to acquire a number of concepts about the present entities (objects, humans, and body parts), whilst proactively maintaining the interaction with a human and recruiting those concepts to express more complex goal-oriented behavior. We also run experiments with naive subjects in order to test the effect of the robot's proactivity level on the interaction.
In Section II we position the current contribution with respect to related works in the field and rely on this analysis to emphasize the specificity of our approach. Our main contribution is described in Section III and consists in the proposal and implementation of an embodied and integrated cognitive architecture for the acquisition of multimodal information about external word referents, as well as a context-dependent lexicon shared with a human partner and used in goal-directed behavior and verbal narrative generation. The experimental validation of our approach on an iCub robot is provided in Section IV, followed by a discussion in Section V.
## II. RELATED WORKS AND PRINCIPLES OF THE PROPOSED ARCHITECTURE
Designing a cognitive robot that is able to solve the SGP requires a set of heterogeneous challenges to be addressed. First, the robot has to be driven by a cognitive architecture bridging the gap between low-level reactive control and symbolic knowledge processing. Second, it needs to interact with its environment, including social partners, in a way that facilitates the acquisition of symbolic knowledge. Third, it needs to actively maintain engagement with the social partners for a fluent interaction. Finally, the acquired symbols need to be used in high-level cognitive functions dealing with linguistic communication and autobiographical memory.
In this section, we review related works on each of these topics along with a brief overview of the solution adopted by the DAC-h3 architecture.
## A. Functionally-driven vs. biologically-inspired approaches in social robotics
The methods used to conceive socially interactive robots derive predominantly from two approaches [17]. Functionallydesigned approaches are based on reverse engineering methods, assuming that a deep understanding of how the mind operates is not a requirement for conceiving socially competent robots (e.g. [18], [19], [20]), whilst biologically-inspired robots are based on theories of natural and social sciences and expect two main advantages of constraining cognitive models by biological knowledge: to conceive robots that are more understandable to humans, as they reason using similar principles, and to provide an efficient experimental benchmark from which the underlying theories of learning can be confronted, tested and refined (e.g. [21], [22], [23]). One specific approach used by Demiris and colleagues for the mirror neuron system is decomposing computational models implemented on robots into brain operating principles which can then be linked and compared to neuroimaging and neurophysiological data [23].
The proposed DAC-h3 cognitive architecture takes advantage of both methods. It is based on an established biologicallygrounded cognitive architecture of the brain and the mind (the DAC theory, presented below) that is adapted for the HRI domain. However, while the global structure of the architecture is constrained by biology, the implementation of specific modules can be driven by their functionality, i.e. using stateof-the-art methods from machine learning that are powerful at implementing particular functions without being directly constrained by biological knowledge.
## B. Cognitive architectures and the SGP
Another distinction in approaches for conceiving social robots, which is of particular relevance for addressing the SGP, reflects a divergence from the more general field of cognitive architectures (or unified theories of cognition [24]). Historically, two opposing approaches have been proposed to formalize how cognitive functions arise in an individual agent from the interaction of interconnected information processing modules in a cognitive architecture. Top-down approaches rely on a symbolic representation of a task, which has to be decomposed recursively into simpler ones to be executed by the agent. These rely principally on methods from symbolic artificial intelligence (from the General Problem Solver [25] to Soar [26] or ACT-R [27]). Although relatively powerful at solving abstract symbolic problems, top-down architectures are not able to solve the SGP per se because they presuppose the existence of symbols. Thus they are not suitable for
addressing the problem of how these symbols can acquired from low-level sensorimotor signals. The alternative, bottomup approaches instead implement behavior without relying on complex knowledge representation and reasoning. This is typically the case in behavior-based robotics [28], emphasizing lower-level sensory-motor control loops as a starting point of behavioral complexity as in the Subsumption architecture [29]. These approaches are not suitable to solve the SGP either because they do not consider symbolic representation as a necessary component of cognition (referred as intelligence without representation in [28]).
Interestingly, this historical distinction between top-down representation based and bottom-up behavior based approaches still holds in the domain of social robotics [30], [31]. Representation based approaches rely on the modeling of psychological aspects of social cognition (e.g. [32]), whereas behavior based approaches emphasize the role of embodiment and reactive control to enable the dynamic coupling of agents [33]. Solving the SGP, both in its physical and social aspects, therefore requires the integration of bottom-up processes for acquiring and grounding symbols in the physical interaction with the (social) environment and top-down processes for taking advantage of the abstraction, reasoning and communication abilities provided by the acquired symbol system. This has been referred to as the micro-macro loop , i.e. a bilateral relationship between an emerged symbol system at the macro level and a physical system consisting of communicating and collaborating agents at the micro level [34].
Several contributions in social robotics rely on such hybrid architectures integrating bottom-up and top-down processes (e.g. [35], [36], [37], [38]). In [35], an architecture called embodied theory of mind was developed to link high-level cognitive skills to the low-level perceptual abilities of a humanoid and implementing joint attention and intentional state understanding. In [36], or [37], the architecture combines deliberative planning, reactive control, and motivational drives for controlling robots in interaction with humans.
In this paper, we adopt the principles of the Distributed Adaptive Control theory of the mind and the brain (DAC, [39], [40]). DAC is a hybrid architecture which posits that cognition is based on the interaction of four interconnected control loops operating at different levels of abstraction (see Figure 1). The first level is called the somatic layer and corresponds to the embodiment of the agent within its environment, with its sensors and actuators as well as the physiological needs (e.g. for exploration or safety). Extending bottom-up approaches with drive reduction mechanisms, complex behavior is bootstrapped in DAC from the self-regulation of an agent's physiological needs when combined with reactive behaviors (the reactive layer ). This reactive interaction with the environment drives the dynamics of the whole architecture [41], bootstrapping learning processes for solving the physical SGP (the adaptive layer ) and the acquisition of higher-level cognitive representations such as abstract goal selection, memory and planning (the contextual layer ). These high-level representations in turn modulate the activity at the lower levels via top-down pathways shaped by behavioral feedback. The control flow in DAC is therefore distributed, both from bottom-up and top-down interactions between layers, as well as from lateral information processing within each layer.
## C. Representation learning for solving the SGP
As we have seen, a cognitive architecture solving the SGP needs to bridge the gap between low-level sensorimotor data and symbolic knowledge. Several methods have been proposed for compressing multimodal signals into symbols. A solution based on geometrical structures was offered by GĂ€rdenfors with the notion of conceptual spaces (e.g., [42]), whereby similarity between concepts is derived from distances in this space. Lieto et al. [43] advocate the use of the conceptual spaces as the lingua franca for different levels of representation.
Another approach has been proposed in [44], [45], which involves a single class of mental representations called 'Semantic Pointers'. These representations are particularly suited in solving the SGP as they support binding operations of various modalities, which in turn result in a single representation. This representation (which might have been initially formed by an input of a single modality) can then trigger a corresponding concept, whose occurrence leads to simulated stimuli in the other modalities. Furthermore, while semantic pointers can be represented as vectors, the vector representation can be transformed in neural activity which makes the implementation biologically plausible and allows mapping to different brain areas.
Other approaches consider symbols as fundamentally sensorimotor units. For example, Object-Action Complexes (OACs) build symbolic representations of sensorimotor experience and behaviors through the learning of object affordances [46] (for a review of affordance-based approaches, see [47]). In [48], a framework founded on joint perceptuo-motor representations is proposed, integrating declarative episodic and procedural memory systems for combining experiential knowledge with skillful know-how.
In DAC-h3, visual, tactile, motor and linguistic information about the present entities is collected proactively through reactive control loops triggering sensorimotor exploration in interaction with a human partner. Abstract representations are learned on-line using state-of-the-art machine learning methods in each modality (see Section III). An entity is therefore represented internally in the robot's memory as the association between abstracted multimodal representations and linguistic labels.
## D. Interaction paradigms and autonomous exploration
Learning symbolic representations from sensorimotor signals requires an autonomous interaction of a robot with the physical and social world. Several interaction paradigms have been proposed for grounding a lexicon in the physical interaction of a robot with its environment. Since the pioneering paradigm of language games proposed in [49], a number of multiagent models have been proposed showing how particular properties of language can self-organize out of repeated dyadic interactions between agents of a population (e.g. [50], [51]).
In the domain of HRI, significant progress has been made in allowing robots to interact with humans, for example in
learning shared plans [52], [53], [54], learning to imitate actions [55], [56], [57], [58], and learning motor skills [59] that can be used for engaging in joint activities. Other contributions have focused on lexicon acquisition through the transfer of sensorimotor and linguistic information from the interaction between a teacher and a learner through imitation [60] or action [61], [62]. However, in most of these interactions, the human is in charge and the robot is following the human's lead: the choice of which concept to learn is left to the human and the robot must identify it. In this case, the robot must solve the referential indeterminacy problem described by Quine [63], where the robot language learner has to extract the external concept that was referred to by the human speaker. However, acquiring symbols by interacting with other agents is not only a unidirectional process of information transfer between a teacher and learner [10].
Autonomous exploration and proactive behavior solves this problem by allowing robots to take the initiative in exploring their environment [64] and interacting with people [65]. The benefit of these abilities for knowledge acquisition has been demonstrated in several developmental robotics experiments. In [66], it is shown how a combination of social guidance and intrinsic motivation improve the learning of object visual categories in HRI. A similar mechanism is adopted in [67] for learning complex sensorimotor mappings in the context of vocal development. In [68], planning conflicts due to the uncertainty of the detected human's intention are resolved by proactive execution of the corresponding task that optimally reduces the system's uncertainty. In [69], the task is to acquire human-understandable labels for novel objects and learning how to manipulate them. This is realized through a mixed-initiative interaction scenario and it is shown that proactivity improves the predictability and success of human-robot interaction.
A central aspect of the DAC-h3 architecture is the robot's ability to act proactively in a mixed-initiative scenario. This allows self-monitoring of the robot's own knowledge acquisition process, removing dependence on the human's initiative. Interestingly, proactivity in a sense reverses the referential indeterminacy problem mentioned above by shifting the responsibility of solving ambiguities to the agent who is endowed with the adequate prior knowledge to solve it, i.e., the human, in a HRI context. The robot is now in charge of the concepts it wants to learn, and can use joint attention behaviors to guide the human toward the knowledge it wants to acquire. In the proposed system, this is realized through a set of behavioral control loops, by self-regulating knowledge acquisition, and by proactively requesting missing information about entities from the human partner.
## E. Language learning, autobiographical memory and narrative expression
We have just described the main components that a robot requires to solve the SGP: a cognitive architecture able to process both low-level sensorimotor data and high-level symbolic representation, mechanisms for linking these two levels in abstract multimodal representations, as well autonomous behaviors for proactively interacting with the environment. The final challenge to address concerns the use of the acquired symbols for higher-level cognition, including language learning, autobiographical memory and narrative expression.
Several works address the ability of language learning in robotics. The cognitive architecture of iTalk [70] focuses on modeling the emergence of language by learning about the robot's embodiment, learning from others, as well as learning linguistic capability. Cangelosi et al. [71] propose that action, interaction and language should be considered together as they develop in parallel, and one influences the others. Antunes et al. [72] assume that language is already learned, and address the issue that linguistic input typically does not have a one-to-one mapping to actions. They propose to perform reasoning and planning on three different layers (low-level robot perception and action execution, mid-level goal formulation and plan execution, and high-level semantic memory) to interpret the human instructions. Similarly, [73] proposes a system to recognize novel objects using language capabilities in one shot. In these works, language is typically used to understand the human and perform actions, but not necessarily to talk about past events that the robot has experienced.
A number of works investigate the expression of past events by developing narratives based on acquired autobiographical memories [74], [75], [76]. In [75], a user study is presented which suggests that a robot's narrative allows humans to get an insight into long term human-robot interaction from the robot's perspective. The method in [76] takes user preferences into account when referring to past interactions. Similarly to our framework, it is based on the implementation and cooperation between both episodic and semantic memories with a dialog system. However, no learning capabilities (neither language nor knowledge) are introduced by the authors.
In the proposed DAC-h3 architecture, the acquired lexicon allows the robot to execute action plans for achieving goaloriented behavior from human speech requests. Relevant information throughout the interaction of the robot with humans is continuously stored in an autobiographical memory used for the generation of a narrative self, i.e., a verbal description of the own robot's history over the long term (able to store and verbally describe interactions from a long time ago, e.g. a period of several months).
In the next section, we describe how the above features are implemented in a coherent cognitive architecture made up of functional YARP [77] modules running in real-time on the iCub robot.
## III. THE DAC-H3 COGNITIVE ARCHITECTURE
This section presents the DAC-h3 architecture in detail, which is an instantiation of the DAC architecture for human-robot interaction. The proposed architecture provides a general framework for designing autonomous robots which act proactively for 1) maintaining social interaction with humans, 2) bootstrapping the association of multimodal knowledge with its environment that further enrich the interaction through goal-oriented action plans, and 3) express a verbal narrative. It allows a principled organization of various functional modules into a biologically grounded cognitive architecture.
## A. Layer and module overview
In DAC-h3 , the somatic layer consists of an iCub humanoid robot equipped with advanced motor and sensory abilities for interacting with humans and objects. The reactive layer ensures the autonomy of the robot through drive reduction mechanisms implementing proactive behaviors for acquiring and expressing knowledge about the current scene. This allows the bootstrapping of adaptive learning of multimodal representations about entities in the adaptive layer . More specifically, the adaptive layer learns high-level multimodal representations (visual, tactile, motor and linguistic) for the categorization of entities (objects, agents, actions and body parts) and associates them in unified representations. Those representations form the basis of an episodic memory for goaloriented behavior through planning in the contextual layer , which deals with goal representation and action planning. Within the contextual layer, an autobiographical memory of the robot is formed that can be expressed in the form of a verbal narrative.
The complete DAC-h3 architecture is shown in Figure 1. It is composed of structural modules reflecting the cognitive modules proposed by the DAC theory. Each structural module might rely on one or more functional modules implementing more specific functionalities (e.g. dealing with motor control, object perception, and scene representation). The complete system described in this section, therefore, integrates several state-of-the-art algorithms for cognitive robotics and integrates them into a structured cognitive architecture grounded in the principles of the DAC theory. In the remainder of this section, we describe each structural module layer by layer, as well as their interaction with the functional modules , and provide references which provide more detail for the individual modules.
## B. Somatic layer
The somatic layer corresponds to the physical embodiment of the system. We use the iCub robot, an open source humanoid platform developed for research in cognitive robotics [4]. The iCub is a 104 cm tall humanoid robot with 53 degrees of freedom (DOF). The robot is equipped with cameras in its articulated eyes allowing stereo vision, and tactile sensors in the fingertips, palms of the hand, arms and torso. The iCub is augmented with an external RGB-D camera above the robot head for agent detection and skeleton tracking. Finally, an external microphone and speakers are used for speech recognition and synthesis, respectively.
The somatic layer also contains the physiological needs of the robot that will drive its reactive behaviors, as described in the following section on the reactive layer .
## C. Reactive layer
Following DAC principles, the reactive layer oversees the self-regulation of the internal drives of a cognitive agent from the interaction of sensorimotor control loops. The drives aim at self-regulating internal state variables (the needs of the somatic layer ) within their respective homeostatic ranges. In biological
Figure 1. The DAC-h3 cognitive architecture (see Section III) is an implementation of the DAC theory of the brain and mind (see Section II-B) adapted for HRI applications. The architecture is organized as a layered control structure with tight coupling within and between layers: the somatic, reactive, adaptive and contextual layers. Across these layers, a columnar organization exists that deals with the processing of states of the world or exteroception (left, red), the self or interoception (middle, blue) and action (right, green). The role of each layer and their interaction is described in Section III. White boxes connected with arrows correspond to structural modules implementing the cognitive modules proposed in the DAC theory. Some of these structural modules rely on functional modules , indicated by acronyms in the boxes next to the structural modules. Acronyms refer to the following functional modules. SR: Speech Recognizer; PASAR: Prediction, Anticipation, Sensation, Attention and Response; AD: Agent Detector; ARE: Action Rendering Engine; OR: Object Recognition; LRH: Language Reservoir Handler; SSM: Synthetic Sensory Memory; PT: Perspective Taking; SRL: Sensorimotor Representation Learning; KSL: Kinematic Structure Learning; OPC: Object Property Collector; ABM: Autobiographical Memory; NSL: Narrative Structure Learning.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: Cognitive Architecture Model
### Overview
The image displays a complex, multi-layered cognitive architecture diagram. It is structured as a matrix with four horizontal processing layers and three vertical functional columns, illustrating the flow of information and control from sensory input to high-level planning and action execution. The entire system interacts with an external "WORLD" represented at the base.
### Components/Axes
**Vertical Columns (Functional Domains):**
1. **WORLD** (Left column, pinkish background): Represents the interface with the external environment.
2. **SELF** (Center column, blue background): Represents the internal state and models of the agent itself.
3. **ACTION** (Right column, green background): Represents the output and behavioral systems.
**Horizontal Layers (Processing Levels):**
1. **CONTEXTUAL** (Top layer, dark grey background): High-level, memory-guided planning.
2. **ADAPTIVE** (Second layer, medium grey background): Learning and associative processing.
3. **REACTIVE** (Third layer, light grey background): Fast, homeostatic control loops.
4. **SOMATIC** (Bottom layer, lightest grey background): Basic bodily functions and interfaces.
**Detailed Component List (by Layer and Column):**
* **CONTEXTUAL Layer:**
* **Auto Biographical Memory** (Top-center, SELF column). Connected to smaller labels: **ABM** (left) and **NSL** (right).
* **Episodic Memory** (Left, WORLD column).
* **Goals** (Center, SELF column).
* **Action Plans** (Right, ACTION column).
* *Flow:* Arrows connect Auto Biographical Memory to Episodic Memory, Goals, and Action Plans. Episodic Memory feeds into Goals, which in turn feeds into Action Plans.
* **ADAPTIVE Layer:**
* **Perception** (Left, WORLD column). Associated labels: **SSM** (above), **OPC** and **LRH** (below).
* **Associations** (Center, SELF column). Associated labels: **OPC** and **PT** (above), **SRL** and **KSL** (below).
* **Action Selection** (Right, ACTION column).
* *Flow:* Perception feeds into Associations, which feeds into Action Selection. A vertical arrow connects Episodic Memory (above) to Perception.
* **REACTIVE Layer:**
* **Sensations** (Left, WORLD column). Associated labels: **SR** and **PASAR** (below).
* **Allostatic Controller** (Center, SELF column). Associated label: **AD** (above).
* **Behaviors** (Right, ACTION column). Associated label: **ARE** (below).
* *Flow:* Sensations feed into the Allostatic Controller, which feeds into Behaviors. A vertical arrow connects Perception (above) to Sensations.
* **SOMATIC Layer:**
* **Sensors** (Left, WORLD column).
* **Needs** (Center, SELF column).
* **Effectors** (Right, ACTION column).
* *Flow:* Sensors feed upward to Sensations. Needs feed upward to the Allostatic Controller. Effectors receive input from Behaviors.
**Base Element:**
* A large, gradient-filled (green to blue) semi-circle at the bottom labeled **WORLD** in bold, white text. This represents the external environment that the entire cognitive system interacts with.
### Detailed Analysis
The diagram is a directed graph showing hierarchical and lateral information flow.
* **Bottom-Up Flow (Perception):** Information originates in the **WORLD** via **Sensors**, becomes **Sensations**, is processed into **Perception**, and is integrated into **Episodic Memory** and the **Auto Biographical Memory**.
* **Top-Down Flow (Control):** High-level **Goals** and **Action Plans** (informed by memory) guide **Action Selection**, which influences **Behaviors** and ultimately drives **Effectors** acting upon the **WORLD**.
* **Internal Regulation:** The **SELF** column is central. The **Allostatic Controller** (reactive) and **Associations** (adaptive) mediate between perception and action, likely maintaining internal stability (homeostasis) and learning. **Needs** provide basic drives to the controller.
* **Acronyms:** Numerous two or three-letter acronyms (ABM, NSL, OPC, SSM, PT, LRH, SRL, KSL, SR, PASAR, AD, ARE) are placed adjacent to main components, likely denoting specific sub-modules, theories, or processes within the architecture. Their exact meanings are not defined in the image.
### Key Observations
1. **Symmetrical Structure:** The three-column (WORLD/SELF/ACTION) and four-layer layout creates a clear, symmetrical matrix for organizing cognitive functions.
2. **Central Role of Memory and Goals:** The **Auto Biographical Memory** and **Goals** are positioned at the top-center, indicating they are the pinnacle of the hierarchy, integrating past experience to direct future action.
3. **Allostatic Controller as a Hub:** In the REACTIVE layer, the **Allostatic Controller** is a central hub receiving input from Sensations (WORLD) and Needs (SELF) to generate Behaviors (ACTION), emphasizing real-time regulation.
4. **Color-Coded Columns:** The consistent pink (WORLD), blue (SELF), and green (ACTION) background shading for the columns provides immediate visual grouping of related functions.
5. **Directional Arrows:** All connections are one-way arrows, defining a strict causal or informational flow without feedback loops explicitly drawn (though feedback is implied by the system's purpose).
### Interpretation
This diagram represents a comprehensive, hybrid cognitive architecture designed for an autonomous agent (e.g., a robot or advanced AI). It integrates multiple paradigms:
* **Reactive Layer:** Embodies subsumption or behavior-based architecture, where **Sensations** and **Needs** trigger fast **Behaviors** via the **Allostatic Controller** for immediate survival and stability.
* **Adaptive Layer:** Incorporates learning and symbolic processing, where **Perception** is refined and linked via **Associations** to inform more flexible **Action Selection**.
* **Contextual Layer:** Adds a deliberative, model-based layer where **Episodic Memory** and an **Auto Biographical Memory** (a model of the self's history) enable long-term planning, goal management, and the generation of complex **Action Plans**.
The architecture suggests a system that balances reflexive reaction with learned adaptation and conscious-like planning. The "SELF" column is crucial, indicating the agent maintains models of its own state, history, and needs, which are central to its decision-making. The explicit connection from the high-level "Goals" down to "Action Selection" shows how abstract intentions are ultimately translated into concrete behaviors that affect the world. The numerous acronyms hint that this is a synthesis of specific, pre-existing computational theories or modules into a unified framework.
</details>
terms, such an internal state variable could, for example, reflect the current glucose level in an organism, with the associated homeostatic range defining the minimum and maximum values of that level. A drive for eating would then correspond to a self-regulation mechanism where the agent actively searches for food whenever its glucose level is below the homeostatic minimum and stops eating even if food is present whenever it is above the homeostatic maximum. A drive is therefore defined as the real-time control loop triggering appropriate behaviors whenever the associated internal state variable goes out of its homeostatic range, as a way to self-regulate its value in a dynamic and autonomous way.
In the social robotics context that is considered in this paper, the drives of the robot do not reflect biological needs as above but are rather related to knowledge acquisition and expression in social interaction. At the foundation of this developmental
bootstrapping process is the intrinsic motivation to interact and communicate. As described by Levinson [10] (see Introduction), a part of the human interaction engine is a set of capabilities that include the motivation to interact and communicate through universal (language independent) manners; including looking at objects of interest and at the interaction partner, as well as pointing to these objects. These reactive capabilities are built into the reactive layer of the architecture forming the core of the DAC-h3 interaction engine. These interaction primitives allow the DAC-h3 system and the human to share attention around specific entities (body parts, objects, or agents), and to bootstrap learning mechanisms in the adaptive layer that associate visual, tactile, motor and linguistic representations of entities as described in the next section.
Currently, the architecture implements the following two drives: one for knowledge acquisition and one for knowledge expression. However, DAC-h3 is designed in a way that facilitates the addition of new drives for further advancements (see Section V). First, a drive for knowledge acquisition provides the iCub with an intrinsic motivation to acquire new knowledge about the current scene. The internal variable associated with this drive is modulated by the number of entities in the current scene with missing information (e.g. unknown name, or missing property). The self-regulation of this drive is realized by proactively asking the human to provide missing information about entities, for instance, their name via speech, synchronized with gaze and pointing; or asking the human to touch the robot's skin associated with a specific body part.
Second, a drive for knowledge expression allows the iCub to proactively express its acquired knowledge by interacting with the human and objects. The internal variable associated with this drive is modulated by the number of entities in the current scene without missing information. The self-regulation is then realized by triggering actions toward the known entities, synchronized with verbal descriptions of those actions (e.g. pointing towards an object or moving a specific body part, while verbally referring to the considered entity).
The implementation of these drives is realized through the three structural modules described below, interacting with each other as well as with the surrounding layers: 1) sensations , 2) allostatic controller , and 3) behaviors (see Figure 1).
1) Sensations: The sensations module deals with low-level sensing to provide relevant information for meaning extraction in the adaptive layer . Specifically, the module detects presence and position of other agents and their body parts ( agent detector functional module; of interest are the head location for gazing at the partner and the location of the hands to detect pointing actions) based on the input of the RGB-D camera. Similarly, it detects objects based on a texture analysis and extracts their location using the stereo vision capabilities of the iCub [78]. The prediction, anticipation, sensation, attention and response functional module (PASAR; [79]) calculates the saliency of agents based on their motion (increased velocity leads to increased saliency), and similarly the saliency for objects is increased if they move or the partner points at them. Finally, the speech recognition functional module extracts text from human speech sensed by a microphone using the Microsoft TM Speech API. The functionalities of the sensations module can, therefore, be summarized as dimensionality reduction and saliency computation, and the resulting data are used for bootstrapping knowledge in higher layers of the architecture.
2) Allostatic Controller: In many situations, several drives which may conflict with each other can be activated at the same time (in the case of this paper, the drive for knowledge acquisition and the drive for knowledge exploration). Such possible conflicts can be solved through the concept of an allostatic controller [80], [81], defined as a set of simple homeostatic control loops and dealing with their scheduling to ensure an efficient global regulation of the internal state variables. The scheduling is decided according to the internal state of the robot and the output of the sensations module. The decision of which drive to follow depends on several factors: the distance of each drive level to their homeostatic boundaries, as well as predefined drive priorities (in DAC-h3 , knowledge acquisition has priority over knowledge expression, which results in a curious personality).
3) Behaviors: To regulate the aforementioned drives, the allostatic controller is connected to the behaviors module, and each drive is linked to corresponding behaviors which are supposed to bring it back into its homeostatic range whenever needed. The positive influence of such a drive regulation mechanism on the acceptance of the HRI by naive users has been demonstrated in previous papers [82], [83].
The drive for knowledge acquisition is regulated by requiring information about entities through coordinated behaviors. Those behaviors depend on the type of the considered entity:
- In the case of an object, the robot produces speech (e.g. 'What is this object?') while pointing and gazing at the unknown object.
- In the case of an agent, the robot produces speech (e.g. 'Who are you?') while looking at the unknown human.
- In the case of a body part, the robot either asks for the name (e.g. 'How do you call this part of my body?') while moving it or, if the name is already known from a previous interaction, asks the human to touch the body part while moving it (e.g., 'Can you touch my index while I move it, please?').
The multimodal information collected through these behaviors will be used to form unified representations of entities in the adaptive layer (see next section).
The drive for knowledge expression is regulated by executing actions towards known entities, synchronized with speech sentences parameterized by the entities' linguistic labels acquired in the adaptive layer (see next section). Motor actions are realized through the action rendering engine (ARE [84]) functional module which allows executing complex actions such as push, reach, take, look in a coordinated human-like fashion. Language production abilities are implemented in the form of predefined grammars (for example asking for the name of an object). Semantic words associated to entities are not present at the reactive level, but are provided from the learned association operating in the adaptive layer . The iSpeak module implements a bridge between the iCub and a voice synthesizer by synchronizing the produced utterance with the lip movements of the iCub to realize a more vivid interaction [85].
## D. Adaptive layer
The adaptive layer oversees the acquisition of a state space of the agent-environment interaction by binding visual, tactile, motor and linguistic representations of entities. It integrates functional modules for maintaining an internal representation of the current scene, visually categorizing entities, recognizing and sensing body parts, extracting linguistic labels from human speech, and learning associations between multimodal representations. They are grouped in three structural modules described below: perceptions , associations and action selection (see Figure 1).
1) Perceptions: The object recognition functional module [86] is used to learn the categorization of objects directly from the visual information given by the iCub eyes with resort to the most recent deep convolutional networks. The bounding boxes of the objects found in the Sensations module are fed to the learning module for the recognition stage. The output of the system consists of the 2D (in the image plane) and 3D (in the world frame) positions of the identified objects along with the corresponding classification scores as stored in the objects properties collector memory (explained below).
There are two functional modules related to language understanding and language production, both integrated within the language reservoir handler (LRH). The comprehension of narrative discourse module receives a sentence and produces the representation of the corresponding meaning, and can thus transform human speech into meaning. The module for narrative discourse production receives a representation of meaning and generates the corresponding sentence (meaning to speech). The meaning is represented in terms of PAOR: predicate(arguments) , where arguments correspond to thematic roles (agent,object,recipient) . Both models are implemented as recurrent neuronal networks based on reservoir computing [87], [88], [89].
The synthetic sensory memory (SSM) module is currently employed for face recognition and action recognition using a fusion of RGB-D data and object location data as provided from the sensations module. In terms of action recognition, SSM has been trained to automatically segment and recognize the following actions: push, pull, lift, drop, wave, and point, while also actively recognizing if the current action is known or unknown. More generally, it has been shown that SSM provides abilities for pattern learning, recall, pattern completion, imagination and association [90]. The SSM module is inspired by the role of the hippocampus by fusing multiple sensory input streams and representing them in a latent feature space [91]. During recall, SSM performs classification of incoming sensory data and returns a label along with an uncertainty measure corresponding to the returned label, which is a use case of the action and face recognition tasks within DAC-h3 . SSM is also capable of imagining novel inputs or reconstructing previously encountered inputs and sending the corresponding generated sensory data. This allows for the replay of memories as detailed in [92].
2) Associations: The associations structural module produces unified representations of entities by associating the multimodal categories formed in the perception module. Those unified representations are formed in the objects properties collector (OPC), a functional module storing all information associated with a particular entity at the present moment in a proto-language format as detailed in [82]. An entity is defined as a concept which can be manipulated and is thus the basis for emerging knowledge. In DAC-h3 , each entity has a name associated, which might be unknown if the entity has been discovered but not yet explored. More specifically, higher level entities such as objects, body parts and agents have additional intrinsic properties. For example, an object also has a location and dimensions associated with it. Furthermore, whether the object is currently present is encoded as well, and if so, its saliency value (as computed by the PASAR module described in Section III-C). On the other hand, a body part is an entity which contains a proprioceptive property (i.e. a specific joint), and a tactile information property (i.e. association with tactile sensors). Thus, the OPC allows integrating multiple modalities of one and the same entity to ground the knowledge about the self, other agents, and objects, as well as their relations. Relations can be used to link several instances in an ontological model (see Section III-E1: Episodic Memory ).
Learning the multimodal associations that form the internal representations of entities relies on the behavior generated by the knowledge acquisition drive operating at the reactive level (see previous section). Multimodal information about entities generated by those behaviors is bound together by registering the acquired information in the specific data format used by the OPC. For instance, the language reservoir handler module described above deals with speech analysis to extract entity labels from human replies (e.g. 'this is a cube'; {P:is, A:this, O:cube, R: â
}). The extracted labels are associated with the acquired multimodal information which depends on the entity type: visual representations generated by the object recognition module in case of an object or agent detector in case of an agent, as well as motor and touch information in case of a body part.
The associations of representations can also be applied to the developmental robot itself (instead of external entities as above), to acquire motor capabilities or to learn the links between motor joints and skin sensors of its body [93]. Learning self-related representations of the robot's own body schema is realized by the sensorimotor representation learning functional module dedicated to forward model learning [94]. The module receives sensory data collected from the robot's sensors (e.g. cameras, skin, joint encoders) and allows accurate prediction of the next state given the current state and an action. Importantly, the forward model is learned based on sensory experiences rather than based on known mechanical properties of the robot's body.
The kinematic structure learning functional module [95], [96] estimates an articulated kinematic structure of arbitrary objects (including the robot's body parts and humans) using visual input videos of the iCub eye cameras. This again is based on sensory experiences rather than known properties of the agents, which is important to autonomously identify the abilities of other agents. Based on the estimated articulated kinematic structures [95], we also allow the iCub to anchor two objects' kinematic structure joints by observing their movements [96] and formulating the problem of finding corresponding kinematic joint matches between two articulated kinematic structures. This
Figure 2. Examples of the kinematic structure correspondences which the iCub has found. The top figure shows the correspondences between the left and right arm of the iCub, which can be used to infer the body part names of one arm if the corresponding names of the other arm are known. Similarly, the bottom figure shows correspondences between the robot's body and the human's body.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: Multi-View Human Pose Estimation and Keypoint Correspondence
### Overview
The image is a composite of four panels demonstrating a computer vision process for human pose estimation across multiple camera views. It visualizes the detection of body keypoints (joints) in 2D images and their correspondence to a unified 3D skeletal model. The primary visual elements are human figures overlaid with colored dots (keypoints) and yellow lines connecting specific points across different panels to illustrate matching.
### Components/Axes
The image contains no traditional chart axes, legends, or text labels. The informational components are entirely visual:
* **Panels:** Four distinct image panels arranged in a 2x2 grid.
* **Keypoints:** Small colored dots placed on anatomical locations of a human figure (e.g., shoulders, elbows, wrists, hips, knees, ankles).
* **Color Coding:** Keypoints are colored, likely representing different body parts or confidence levels. The dominant colors are blue, green, cyan, and red/yellow.
* **Correspondence Lines:** Solid yellow lines connect specific keypoints between panels, indicating that the system has identified them as the same physical point viewed from different angles.
* **Skeletal Model:** The bottom-right panel displays a simplified, abstract stick-figure representation of the human skeleton, derived from the keypoints.
### Detailed Analysis
**Panel-by-Panel Breakdown:**
1. **Top-Left Panel:**
* **Content:** A grayscale image of a person standing, viewed from a side angle. The person's right arm is extended forward.
* **Keypoints:** A dense cluster of blue dots covers the head and upper torso. Green dots are distributed along the spine, arms, and legs. Cyan dots appear on the lower legs and feet.
* **Connections:** Three yellow lines originate from keypoints in this panel. They connect to:
* A point on the head/neck area (blue cluster) in the Top-Right panel.
* A point on the right hip (green) in the Top-Right panel.
* A point on the right ankle (cyan) in the Top-Right panel.
2. **Top-Right Panel:**
* **Content:** A grayscale image of the same person from a different, more frontal camera angle.
* **Keypoints:** Similar distribution of blue (head/shoulders), green (torso/limbs), and cyan (lower legs/feet) dots.
* **Connections:** Receives the three yellow lines from the Top-Left panel. It also has three yellow lines originating from it, connecting to the Bottom-Left panel.
3. **Bottom-Left Panel:**
* **Content:** Another grayscale view of the person, similar to the top-left but possibly from a slightly different time or camera.
* **Keypoints:** Same color scheme (blue, green, cyan).
* **Connections:** Receives three yellow lines from the Top-Right panel. It also has three yellow lines originating from it, connecting to the skeletal model in the Bottom-Right panel.
4. **Bottom-Right Panel:**
* **Content:** A black background with a stylized, abstract human skeleton model.
* **Keypoints:** The joints are represented by larger red squares with yellow centers. The bones are thick yellow lines.
* **Connections:** Receives three yellow lines from the Bottom-Left panel, linking specific 2D image keypoints to their corresponding joints on the 3D model. The connected joints appear to be the right shoulder, right hip, and right ankle.
**Correspondence Flow:**
The yellow lines create a clear visual chain: **Top-Left Image â Top-Right Image â Bottom-Left Image â 3D Skeleton Model**. This demonstrates the process of matching the same anatomical point across multiple 2D views and finally mapping it to a canonical 3D pose representation.
### Key Observations
* **Multi-View Consistency:** The system successfully identifies and links the same body parts (head, hip, ankle) across three significantly different camera perspectives.
* **Keypoint Density:** The raw detection (blue/green/cyan dots) is very dense and noisy, covering large areas of the body. This suggests an initial, high-recall detection phase.
* **Model Abstraction:** The final skeletal model (bottom-right) is a clean, low-noise abstraction, indicating a processing step that filters and consolidates the noisy 2D detections into a coherent 3D structure.
* **Occlusion Handling:** The person's right arm is extended in the top views, which could cause self-occlusion. The consistent matching of the hip and ankle points suggests the system is robust to such challenges.
### Interpretation
This diagram illustrates a core pipeline in **3D human pose estimation from multi-view cameras**. The data suggests the following process:
1. **2D Keypoint Detection:** Each camera view independently detects a large set of candidate keypoints (the colored dots) on the human figure. The high density implies a model designed for high sensitivity, possibly at the cost of precision.
2. **Cross-View Correspondence:** The system then solves the correspondence problem, identifying which noisy 2D point in View A represents the same physical joint as a point in View B. The yellow lines are the visual proof of this matching.
3. **3D Lifting/Reconstruction:** Finally, the matched 2D points from all views are used to "lift" or reconstruct the pose into a unified 3D skeletal model. The clean skeleton in the bottom-right is the outputâa simplified, actionable representation of the person's pose in space.
The **notable anomaly** is the stark contrast between the noisy, dense 2D detections and the clean, sparse 3D model. This highlights the critical role of the multi-view matching and 3D reconstruction algorithms in filtering noise and resolving ambiguity that is inherent in single-view 2D pose estimation. The diagram effectively argues for the power of multi-view systems to achieve robust and accurate 3D understanding.
</details>
allows the iCub to infer correspondences between its own body parts (its left arm and its right arm), as well as between its own body and the body of the human as retrieved by the agent detector [93] (see Figure 2).
Finally, based on these correspondences, the perspective taking functional module [97] enables the robot to reason about the state of the world from the partner's perspective. This is important in situations where the views of the robot and the human diverge, for example, due to objects which are hidden to the human but visible to the robot. More importantly, perspective taking is thought to be an essential element for successful cooperation and to ease communication, for example by resolving ambiguities [98]. By mentally aligning the selfperspective with that of the human partner, this module allows algorithms (concerned with the visuospatial perception of the world) to reason as if the input was acquired from an egocentric perspective; which allows to use learning algorithms trained on egocentric data to reason on data acquired from the human's perspective without the need of adapting them.
3) Action Selection: The action selection module uses the information from associations to provide context to the behaviors module at the reactive level. This context corresponds to entity names which are provided as parameters to the behaviors module, for instance pointing at a specific object or using the object linguistic label in the parameterized grammars defined at the reactive level. This module also deals with the scheduling of action plans from the contextual layer according to the current state of the system as explained in the following.
## E. Contextual layer
The contextual layer deals with higher-level cognitive functions that extend the time horizon of the cognitive agent, such as an episodic memory, goal representation, planning and the formation of a persistent autobiographical memory of the robot interaction with the environment. These functions rely on the unified representations of entities acquired at the adaptive level . The contextual layer consists of three functional modules that are described below: 1) episodic memory , 2) goals and action plans , and 3) autobiographical memory used to generate a narrative structure.
1) Episodic Memory: The episodic memory relies on advanced functions of the object property collector (OPC) to store and associate information about entities in a uniform format based on the interrogative words 'who' (is acting), 'what' (they are doing), 'where' (it happens), 'when' (it happens), 'why' (it is happening) and 'how' called an H5W data structure [82]. It is used for goal representation and as elements of the autobiographical memory . H5W have been argued to be the main questions any conscious being must answer to survive in the world [99], [100].
The concept of relations is the core of the H5W framework. It links up to five concepts and assigns them with semantic roles to form a solution to the H5W problem. We define a relation as a set of five edges connecting those nodes in a directed and labeled manner. The labels of those edges are chosen so that the relation models a typical sentence from the English grammar of the form: Relation â Subject Verb [Object] [Place] [Time]. The brackets indicate that the components are optional; the minimal relation is therefore composed of two entities representing a subject and a verb.
2) Goals and action plans: Goals can be provided to the iCub from human speech, and a meaning is extracted by the language reservoir handler , forming the representation of a goal in the goals module. Each goal consequently refers to the appropriately predefined action plan, defined as a state transition graph with states represented by nodes and actions represented by edges of the graph. The action plans module extracts sequences of actions from this graph, with each action being associated with a pre- and a post-condition state. Goals and action plans can be parameterized by the name of a considered entity. For example, if the human asks the iCub to take the cube, this loads an action plan for the goal 'Take an object' which consists of two actions: 'Ask the human to bring the object closer' and 'Pull the object'. In this case, each action is associated with a pre- and post-condition state in the form of a region in the space where the object is located. In the action selection module of the adaptive layer, the plan is instantiated toward a specific object according to the knowledge retrieved from the associations module (e.g. allowing to retrieve the current position of the cube). The minimal sequence of actions achieving the goal is then executed according to the perceived current state updated in real-time, repeating each action until its post-condition is met (or cease making the effort after a predefined timeout).
Although quite rigid in its current implementation, in the sense that action plans are predefined instead of being learned from the interaction, this planning ability allows closing the loop of the whole architecture, where drive regulation mechanisms at the reactive layer can now be bypassed through contextual goal-oriented behavior. Limitations of this system are discussed in Section V: Conclusions.
3) Autobiographical Memory: The autobiographical memory (ABM [101], [102], [103]) collects long term information (days, months, years) about interactions motivated by the human declarative long term memory situated in the medial temporal lobe, and the distinction between facts and events
[101]. It stores data (e.g. objects locations, human presence) from the beginning to the end of an episode by taking snapshots of the environmental information from the episodic memory containing the pre-conditions and effects of episodes. This allows the generation of high-level concepts extracted by knowledge-based reasoning. In addition, the ABM captures continuous information during an episode (e.g. images from the camera, joints values) [102], which can be used by reasoning modules focusing on the action itself, leading to the production of a procedural memory (e.g. through learning from motor babbling or imitation) [103].
The narrative structure learning module builds on the language processing and ABM capabilities. Narrative structure learning occurs in three phases: 1) First the iCub acquires experience in a given scenario, which generates the meaning representation in the ABM. 2) The iCub then formats each story in term of initial states, goal states, actions and results (IGARF graph [89]). 3) The human then provides a narration (that is understood using the reservoir system explained in Section III-D) for the scenario.
By mapping the events of the narration to the event of the story, the robot can extract the meaning of different discourse functions words (such as 'because'). It can thus automatically generate the corresponding form-meaning mapping that defines the individual grammatical constructions and their sequencing that defines the narrative construction of a new narrative.
## F. Summary on the DAC-h3 architecture
The DAC-h3 architecture, therefore, integrates several stateof-the-art algorithms for cognitive robotics and integrates them into a structured cognitive architecture grounded in the principles of the DAC theory. The drive reduction mechanisms in the reactive layer allow a complex control of the iCub robot which proactively interacts with humans. In turn, this allows the bootstrapping of adaptive learning of multimodal representations about entities in the adaptive layer . Those representations form the basis of an episodic memory for goaloriented behavior through planning in the contextual layer . The life-long interaction of the robot with humans continuously feed an autobiographical memory able to retrieve past experience from request and to express it verbally in a narrative. Altogether, this allows the iCub to interact with humans in complex scenarios, as described in the next section.
## IV. EXPERIMENTAL RESULTS
This section validates the cognitive architecture described in the previous section on a real demonstration with an iCub humanoid robot interacting with objects and a human. We first describe the experimental setup, followed by the behaviors provided to the robot (self-generated and human-requested). Finally, we analyze the DAC-h3 system in two ways: a complete version reporting the full complexity of the system through multiple video demonstrations and the detailed analysis of a particular interaction, as well as a simplified version showing the effect of the robot's proactivity level on naive users.
The code for reproducing these experiments on any iCub robot is available open-source at https://github.com/robotology/
Figure 3. The setup consists of an iCub robot interacting with objects on the table and a human in front of it. The table is separated (indicated by horizontal lines) into three areas: I for the area only reachable by the iCub, S for the shared area, and H for the human-only area (compare with Section IV-A).
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Photograph: Humanoid Robot with Objects on Table
### Overview
The image is a photograph depicting a small, white humanoid robot standing behind a light blue table. On the table are three distinct objects. The scene appears to be set in a laboratory or workshop environment. A vertical text overlay is present on the left side of the image.
### Components & Spatial Layout
**1. Primary Subject (Robot):**
* **Position:** Centered in the upper half of the frame, behind the table.
* **Description:** A child-sized humanoid robot with a white plastic head and torso. The head has large, dark, circular eye sensors. The torso features visible mechanical joints, wiring, and some black markings or signatures on the chest plate. The arms are articulated with metallic, skeletal hands. The lower body consists of mechanical legs. The robot's posture is slightly hunched, with its head tilted downward, appearing to look at the objects on the table.
**2. Table and Objects:**
* **Table:** A flat, light blue surface occupying the lower half of the image. Two faint, horizontal dashed lines are visible across its width.
* **Object 1 (Top Left):** An irregularly shaped, orange object. Its form is organic and non-geometric, resembling a crumpled piece of material or a 3D-printed abstract shape.
* **Object 2 (Top Right):** A bright yellow cube. It appears to be a standard, solid block.
* **Object 3 (Bottom Center):** A bright blue cube, similar in size to the yellow one. It is positioned closer to the viewer, below the dashed lines.
**3. Text Overlay:**
* **Content:** The capital letters "I", "S", and "H" are stacked vertically.
* **Language:** English.
* **Position:** Located in the bottom-left quadrant of the image, superimposed over the blue table surface. The letters are white with a slight shadow or outline for contrast.
* **Transcription:** `I` (top), `S` (middle), `H` (bottom).
**4. Background Environment:**
* **Left Background:** A dark computer monitor (off or displaying black) on a desk. Various cables and a power strip are visible.
* **Right Background:** A dark cabinet or piece of equipment. On top of it sits a white box or container with some indistinct markings.
* **General Setting:** The background is slightly out of focus, suggesting a shallow depth of field focused on the robot and table. The lighting is artificial and functional.
### Detailed Analysis
* **Robot's Focus:** The robot's head orientation and eye sensors are directed downward toward the area of the table where the orange and yellow objects are placed.
* **Object Arrangement:** The objects are placed in a triangular formation. The orange and yellow objects are in the "workspace" area above the dashed lines, while the blue object is in a separate zone below the lines.
* **Color Palette:** The scene is dominated by the white of the robot, the light blue of the table, and the primary colors (orange, yellow, blue) of the objects, creating a high-contrast setup likely for visual recognition tasks.
### Key Observations
1. The robot is a physical, articulated machine, not a digital rendering.
2. The objects are simple, distinct in color and shape (one irregular, two cubic), which is typical for robotics manipulation or computer vision experiments.
3. The dashed lines on the table may demarcate specific zones for the experiment (e.g., a "target zone" and a "staging zone").
4. The vertical text "ISH" is an artificial overlay, not part of the physical scene. Its meaning is not explained within the image.
### Interpretation
This image captures a moment in a robotics research or development setting. The setup strongly suggests an experiment in **object recognition, manipulation, or task planning**. The robot's attentive posture implies it is either processing visual input from the scene or is in a paused state between actions.
The choice of objectsâan irregular shape versus perfect cubesâis significant. It tests the robot's ability to handle both standardized and unstructured items. The color coding (orange, yellow, blue) is likely for easy visual segmentation by the robot's cameras.
The "ISH" overlay is ambiguous. It could be:
* An acronym for the project, lab, or a specific test (e.g., "Interactive Scene Handling").
* A label or identifier added for documentation purposes.
* A watermark or signature from the source of the image.
The overall scene conveys a controlled, technical environment focused on advancing human-robot interaction or autonomous capabilities. The primary information is visual and contextual; the image documents a specific experimental configuration rather than presenting quantitative data.
</details>
wysiwyd. It consists of all modules described in the last section implemented in either C++ or Python, and relies on the YARP middleware [77] for defining their connections and ensuring their parallel execution in real-time.
## A. Experimental setup
We consider an HRI scenario where the iCub and a human face each other with a table in the middle and objects placed on it. The surface of the table is divided into three distinct areas, as shown in Figure 3:
- 1) an area which is only reachable by the iCub ( I ),
- 2) an area which is only reachable by the human ( H ), and
- 3) an area which is reachable by both agents ( S for Shared ).
The behaviors available to the iCub are the following:
- 'Acquire missing information about an entity', which is described in more detail in Section IV-B1.
- 'Express the acquired knowledge', which is described in more detail in Section IV-B2.
- 'Move an object on the table', either by pushing it from region I to S or pulling it from region S to I ,
- 'Ask the human to move an object', either by asking to push the object from region H to S or by asking to pull it from region S to H .
- 'Show learned representations on screen' while explaining what is being shown, e.g. displaying the robot kinematic structure learned from a previous arm babbling phase.
- 'Interact verbally with the human' while looking at her/him. This is used for replying to some human requests as described in Section IV-C.
These behaviors are implemented in the behaviors module and can be triggered from two distinct pathways as shown in Figure 1. The behaviors for acquiring and expressing knowledge are triggered through the drive reduction mechanisms implemented in the allostatic controller (Section III-C) and are self-generated by the robot. The remaining behaviors are triggered from the action selection module (Section III-D), scheduling action sequences from the goals and action plans modules (Section III-E). In the context of the experiments described in this section, these behaviors are requested by the human partner. We describe these two pathways in the two following subsections.
## B. Self-generated behavior
Two drives for knowledge acquisition and knowledge expression implement the interaction engine of the robot (see Section III-C). They regulate the knowledge acquisition process of the iCub and proactively maintain the interaction with the human. The generated sensorimotor data feeds the adaptive layer of the cognitive architecture to acquire multimodal information about the present entities (see Section III-D). In the current experiment, the entities are objects on the table, body parts (fingers of the iCub), human partners, and actions. The acquired multimodal information depends on the considered entity. Object representations are based on visual categorization and stereo-vision based 3D localization performed by the object recognition functional module. Body part representations associate motor and touch events. Agents and actions representations are learned from visual input in the synthetic sensory memory module presented in Section III-D1. Each entity is also associated with a linguistic label learned by self-regulating the two drives detailed below.
1) Drive to acquire knowledge: This drive maintains a curiosity-driven exploration of the environment by proactively requesting the human to provide information about the present entities, e.g. naming an object or touching a body part. The drive level decays proportionally to the amount of missing information about the present entities (e.g. the unknown name of an entity). When below a given threshold, it triggers a behavior following a generic pattern of interaction, instantiated according to the nature of the knowledge to be acquired. It begins with a behavior to obtain a joint attention between the human and the robot toward the entity that the robot wants to learn about. After the attention has been attracted toward the desired entity, the iCub asks for the missing information (e.g. the name of an object or of the human, or in the case of a body part the name and touch information) and the human replies accordingly. In a third step, this information is passed to the adaptive layer and the knowledge of the robot is updated in consequence.
Each time the drive level reaches the threshold, an entity is chosen in a pseudo-random way within the set of perceived entities with missing information, with a priority to request the name of a detected unknown human partner. Once a new agent enters the scene, the iCub asks for her/his name, which is stored alongside representations of its face in the synthetic sensory memory module. Similarly, the robot stores all objects it has previously encountered in its episodic memory implemented by the object property collector module. When the chosen entity is an object, the robot asks the human to provide the name of interest while pointing at it. Then, the visual representation of the object computed by the object recognition module is mapped to the name. When the chosen entity is a body part (left-hand fingers), the iCub first raises its hand and moves a random finger to attract the attention of the human. Then it asks for the name of that body part. This provides a mapping between the robot's joint identifier and the joint's name. This mapping can be extended to include tactile information by asking the human to touch the body part which is being moved by the robot.
Once a behavior has been triggered, the drive is reset to its default value and decays again as explained above (the amount of the decay being reduced according to what has been acquired).
2) Drive to express knowledge: This drive regulates how the iCub expresses the acquired knowledge through synchronized speech, pointing and gaze. It aims at maintaining the interaction with the human by proactively informing her/him about its current state of knowledge. The drive level decays proportionally to the amount of already acquired information about the present entities. When below a given threshold (meaning that a significant amount of information has been acquired), it triggers a behavior alternating gazing toward the human and a known entity, synchronized with speech expressing the knowledge verbally, e.g. 'This is the octopus', or 'I know you, you are Daniel'. Once such a behavior has been triggered, the drive is reset to its default value and decays again as explained above (the amount of the decay changing according to what is learned by satisfying the drive for knowledge acquisition).
These two drives allow the robot to balance knowledge acquisition and expression in an autonomous and dynamic way. At the beginning of the interaction, the robot has little knowledge about the current entities and therefore favors behaviors for knowledge acquisition. By acquiring more and more knowledge, it progressively switches to behaviors for knowledge expression. If new entities are introduced, e.g. a new object or another human, it will switch back to triggering more behaviors for knowledge acquisition and so on.
## C. Human-requested behavior
The representations which are acquired through satisfying the drives introduced above allow a more complex interaction through goal-oriented behavior managed by the contextual layer (see Figure 1 and Section III-E). Goals can be provided to the iCub from human speech and a meaning is extracted by the language reservoir handler , forming the representation of a goal in the goals module. Each goal is associated with an action plan on the form of a sequence of actions together with their pre- and post-conditions in the action plans module. The action selection module takes care of the execution of the plan according to the associations known to the robots, triggering
Figure 4. State transition graph used for generating the action plans of the goals 'Give' and 'Take'. Nodes correspond to the table regions indicated in Figure 3. Arrows correspond to the actions to be executed for realizing a transition. According to the observed current state of the object, the iCub will execute the action which brings it closer to the goal state. For example, if the goal is to take an object which is in the human area, the robot will first ask the human to push it, and subsequently pull it into its own area.
<details>
<summary>Image 4 Details</summary>

### Visual Description
Icon/Small Image (312x61)
</details>
the appropriate behaviors according to its current perception of the scene updated in real time. Goal achievement bypasses the reactive behavior described in the previous subsection by freezing all the drive levels during the execution of the plan. The available goals are described below.
- 1) Give or take an object: These goals are generated from a human verbal request, e.g. 'Give me the octopus' or 'Take the cube' . Here, the goal is represented as a region on the table, either the human area H (for the 'Give' goal) or the iCub area I (for the 'Take' goal), where the mentioned object should be placed. Action plans are generated from the state-transition graph shown in Figure 4. State perception is updated in realtime according to the perceived location of the object computed through stereo-vision in the object recognition module.
- 2) Point to an object: This goal is generated through a verbal request, e.g. 'Point to the octopus' . If the mentioned object is not known to the iCub, it will first ask the human to point to it to learn the new association between the name and the object's visual representation. Once the name is known, or if it was already known, the iCub will point to the object.
- 3) Say the name of a recognized action: This goal is generated through a verbal request, e.g. 'How do you call this action?' formulated just after the human has performed an action on an object. Six actions can be recognized by the synthetic sensory memory module: 'push', 'pull', 'lift', 'drop', 'wave', and 'point'. The reply from the iCub provides the name of the action and the object as well as the hand used by the human, e.g. 'You pushed the cube with your left hand' .
- 4) Say what happened during a past interaction with a human: This goal is generated through a verbal request, e.g. 'What have you done the other day?' . Based on its past interactions with the environment and with humans, the iCub has stored all the relevant information in its autobiographical memory (see Section III-E), including its own drives, motivations, and actions, as well of the actions of the human (both, spoken and physically enacted). The narrative handler functional module can generate a narrative discourse from the content of the autobiographical memory and to generate an action plan in the form of a sequence of sentences. The human can request more detailed information about an event using sentences like 'What happened next?' or 'Why did you do that?' , this later question being answered according to the stored drive levels and goal states of the considered events. The robot can also learn a set of questions that it can re-use in the context of another story. Figure 5 shows an example of a generated narrative.
- 5) Show the learned kinematic structure: As for the previous goals, this goal is generated through verbal requests. When asked 'What have you learned from your arm babbling?', the iCub refers the human to look at the screen where the kinematic structures of its arms are displayed. Also, lines which connect nodes of the kinematic structures indicate the correspondences which the iCub has found between its left and right arm. Similarly, the iCub displays the correspondences which it has found between one of its arms and the body of the human (see Figure 2). This knowledge is further employed to point to the human's arm, which is interesting as both the name as well as the kinematic location of the human's arm are inferred from self-learned representations and mapping these representations to the partner.
First I wanted to get the toy .
First you have the toy .
Then I fail to grasp it .
After I fail to grasp , I reasoned .
Because I reasoned , I ask for the toy to you Finally you gave me the toy
Now I have the toy
I have the toy because you gave it to me
You gave the toy to me because I wanted it
Figure 5. Example of a narrative generated by the robot. The language reservoir handler will decompose the words in the narrative discourse in 3 categories: the discourse function words (DFW) which direct the discourse from one sentence to the other, the open class words (OCW) which correspond to the meaningful words in terms of vocabulary of the sentence, and the closed class words (CCW) which have a grammatical function in the sentence (see [89]).
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Text Block: Sequential Narrative with Logical Connectors
### Overview
The image displays a block of English text consisting of eight sequential sentences. The text forms a short narrative or logical argument about obtaining a toy, using temporal and causal connectors (e.g., First, Then, Because) to structure the sequence of events and reasoning. Certain words within each sentence are formatted in *italics*, likely for emphasis or to highlight key actions and states.
### Components/Axes
* **Format:** A single column of left-aligned text.
* **Language:** English.
* **Textual Structure:** Each line is a complete sentence beginning with a connector word or phrase.
* **Emphasis:** Specific words within each sentence are italicized. The connectors themselves are also italicized.
### Detailed Analysis
**Line-by-Line Transcription with Formatting:**
1. **First** *I wanted to get the toy.*
* Connector: *First* (italicized)
* Emphasized word: *wanted* (italicized)
2. **First** *you have the toy.*
* Connector: *First* (italicized)
* Emphasized word: *have* (italicized)
3. **Then** *I fail to grasp it.*
* Connector: *Then* (italicized)
* Emphasized word: *fail* (italicized)
4. **After** *I fail to grasp, I reasoned.*
* Connector: *After* (italicized)
* Emphasized word: *reasoned* (italicized)
5. **Because** *I reasoned, I ask for the toy to you*
* Connector: *Because* (italicized)
* Emphasized word: *ask* (italicized)
6. **Finally** *you gave me the toy*
* Connector: *Finally* (italicized)
* Emphasized word: *gave* (italicized)
7. **Now** *I have the toy*
* Connector: *Now* (italicized)
* Emphasized word: *have* (italicized)
8. *I have the toy because you gave it to me*
* Connector/Emphasis: *because* (italicized)
* Emphasized word: *gave* (italicized)
9. *You gave the toy to me because I wanted it*
* Connector/Emphasis: *because* (italicized)
* Emphasized word: *wanted* (italicized)
### Key Observations
* **Connector Pattern:** The text uses a clear sequence of logical/temporal markers: `First -> Then -> After -> Because -> Finally -> Now -> because -> because`.
* **Narrative Arc:** The sequence describes a progression from initial desire (`wanted`), through failed action (`fail`), reasoning (`reasoned`), request (`ask`), to successful transfer (`gave`) and final possession (`have`).
* **Circular Logic:** The final two sentences create a causal loop, restating the outcome (`I have the toy`) and its cause (`you gave it`) from two perspectives, both rooted in the initial desire (`wanted`).
* **Emphasis Consistency:** The italicized words consistently highlight the core action or state in each clause (e.g., *wanted*, *fail*, *reasoned*, *gave*, *have*).
### Interpretation
This text block appears to be a pedagogical or analytical example, likely used in fields such as:
* **Linguistics or Logic:** To demonstrate how connectors structure discourse and build causal or temporal arguments.
* **Language Learning:** To illustrate sentence construction using sequence words and subordinate clauses.
* **Cognitive Science or Philosophy:** To model a simple chain of intention, action, failure, reasoning, and social transaction.
The narrative is not merely a story but a structured breakdown of a goal-oriented process. The repetition and rephrasing in the final lines emphasize the relationship between desire, action, and outcome. The explicit highlighting (italics) draws attention to the verbs that drive the narrative forward, making the underlying logic of the sequence more transparent for study. The text provides a clear, self-contained example of how language encodes a sequence of events and their justifications.
</details>
## D. Scenario Progression
We first show how the full DAC-h3 system we have just described is able to generate a complex human-robot interaction by providing videos of live interactions (see https://github.com/ robotology/wysiwyd) and a detailed description of a particular interaction. Then, in the next subsection, we will analyze more specifically the effect of the robot's proactivity level on naive users. In both cases, we consider a mixed-initiative scenario, where the iCub behaves autonomously as described in Section IV-B, and so does the human. The human can interrupt the robot behavior by formulating verbal requests as described in Section IV-C. The scenario can follow various paths according to the interaction between the iCub's internal drive dynamics, its perception of the environment, and the behavior of the human.
Here, we describe one particular instance of the scenario. Figure 6 shows the corresponding drive dynamics and humanrobot interactions, and Figure 1 shows the connections between the modules of the cognitive architecture. Each of the numbered items below refers to its corresponding number in Figure 6.
- 1) At the beginning of the interaction, the iCub has only limited knowledge about the current scene. In the sensations module, the agent detector detects the presence of a human and extracts its skeleton. The object recognition module performs blob detection for extracting objects on the table from the visual input of the eye cameras and
Figure 6. Drive level dynamics during a typical mixed-initiative scenario described in Section IV-D. Each drive starts at its default value and decays following the dynamics described in Section IV-B. When reaching a given threshold (dashed horizontal line) the associated behavior is triggered (green rectangles), the corresponding drive level is reset to its default value and both drive levels are frozen for the duration of the behavior. Human behavior is indicated by the red rectangles, being either a reply to a question asked by the iCub (small rectangles) or a request to the iCub triggering goal oriented behavior (here: 'Take the cube'). The numbers refer to the description of the scenario progression in the main text.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Line Chart: Drive Level for Knowledge Acquisition and Expression Over Time
### Overview
This image is a schematic line chart illustrating the fluctuating "Drive level" for two cognitive processesâKnowledge Acquisition and Knowledge Expressionâover a conceptual timeline. The chart uses annotated time points and action labels to correlate specific events with changes in drive levels. It appears to be a theoretical model or diagram from a cognitive science, psychology, or AI research context.
### Components/Axes
* **Y-Axis (Vertical):** Labeled **"Drive level"**. It has three marked levels indicated by circled numbers: **â ** (highest), **âĄ** (middle), and **âą** (lowest). A dashed grey line runs horizontally at level âą, serving as a baseline.
* **X-Axis (Horizontal):** Labeled **"Time"**. It is marked with eight circled numbers **â through â§**, representing sequential time points or events.
* **Legend:** Located in the top-right corner.
* **Orange Line:** Labeled **"Knowledge acquisition"**.
* **Blue Line:** Labeled **"Knowledge expression"**.
* **Action Labels:** Positioned below the X-axis, aligned with specific time points. They are color-coded:
* **Green Boxes:** Contain action verbs. From left to right: **"Acquire"** (under point âą), **"Acquire"** (under point â€), **"Express"** (under point â„), **"Ask push"**, **"Pull"**, and **"Aquire"** (note: likely a typo for "Acquire") (all under point â§).
* **Pink Boxes:** Contain a specific action. **"Take the cube"** is positioned under point âŠ.
* **Numbered Annotations:** Circled numbers **â through â§** are placed directly on the chart lines or near key inflection points, corresponding to the time points on the X-axis.
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
* **Knowledge Acquisition (Orange Line):**
* **Trend:** Starts at a high drive level (â ), experiences sharp drops and recoveries, and generally maintains a higher average drive than expression until the later stages.
* **Key Points:**
* At **Time â **: Drive is at its peak (level â ).
* **Time â -âą**: Sharp decline to the baseline (level âą), coinciding with the first **"Acquire"** action.
* **Time âą-â€**: Recovers to a high level, then dips again before the second **"Acquire"** action at Time â€.
* **Time â€-â„**: Drops to a mid-level (between ⥠and âą).
* **Time â„-â§**: Remains relatively stable at a mid-to-low level, with a final rise after the **"Aquire"** action at Time â§.
* **Knowledge Expression (Blue Line):**
* **Trend:** Starts at a moderate level, shows a significant, sustained drop in the middle, followed by a strong recovery and a final decline.
* **Key Points:**
* At **Time â **: Drive starts at a moderate level (around âĄ).
* **Time â -â„**: Experiences a gradual decline, followed by a very sharp drop to the baseline (level âą) at **Time â„**, which aligns with the **"Express"** action.
* **Time â„-âŠ**: Recovers sharply to a high level (near â ).
* **Time âŠ-â§**: Plateaus, then begins a final decline after the actions at Time â§.
**Spatial Grounding of Elements:**
* The **legend** is in the **top-right** quadrant.
* The **"Take the cube"** label (pink) is **centered below** the X-axis between points ⊠and â§.
* The **green action labels** are aligned **directly below** their corresponding time point markers on the X-axis.
### Key Observations
1. **Inverse Relationship at Critical Points:** At **Time â„**, the drive for Expression crashes to its minimum precisely as the "Express" action occurs, while Acquisition drive is in a stable, lower state. This suggests a resource trade-off or a sequential process where expression consumes the drive built by acquisition.
2. **Action-Driven Fluctuations:** The sharpest changes in drive levels (both drops and recoveries) are tightly correlated with the labeled actions ("Acquire," "Express"). This implies the model posits actions as direct triggers for drive state changes.
3. **Final Convergence:** By **Time â§**, both drives are at moderate-to-low levels, but the Acquisition drive shows an upward tick following the final "Aquire" action, while Expression drive is trending downward.
4. **Asymmetry:** The "Knowledge expression" line has a more dramatic single drop and recovery, while the "Knowledge acquisition" line shows more frequent, jagged fluctuations.
### Interpretation
This chart models a dynamic system where the motivational "drive" to acquire knowledge and to express it are separate but interacting variables. The data suggests several underlying principles:
* **Phased Process:** Knowledge processing may occur in phases. High acquisition drive (early phase) builds a reservoir, which is then drawn upon for expression (middle phase), leading to a depletion of expression drive. The system then resets or recharges.
* **Cost of Expression:** The dramatic drop in expression drive at the moment of expression (Time â„) indicates that the act of expressing knowledge is cognitively costly or depleting in this model.
* **Goal-Directed Behavior:** The numbered events and specific actions ("Take the cube") frame this as a goal-oriented sequence. The drive levels are not random but are responses to task demands within a structured activity.
* **Potential Application:** This could be a framework for understanding learning cycles, human-computer interaction in tutoring systems, or the internal state management of an AI agent designed to both learn and communicate. The "Take the cube" action might represent a physical or metaphorical object manipulation that bridges acquisition and expression phases.
**Note on Language & Precision:** All text in the image is in English. The chart is schematic; the Y-axis "Drive level" has no numerical scale, so values are described relative to the marked levels â , âĄ, and âą. The analysis is based on the visual relationships and trends presented.
</details>
trains a classifier to categorize them. The 3D locations of the present objects are also computed in object recognition through stereo vision. This forms a first incomplete representation of the scene in the episodic memory where the object property collector registers the location and type of each detected entity (here objects and an agent). It also contains slots for unknown body parts of the iCub, here the five fingers of its right hand.
- 2) The presence of a large amount of missing information (presence of unknown objects, human and body parts) in the sensations module makes the drive for knowledge acquisition to decay rapidly in the allostatic controller and the drive for knowledge expression is kept constant (since there is no knowledge to express yet).
- 3) When the knowledge acquisition drive level is below threshold, it triggers the associated behavior ( behaviors module) for acquiring information about an entity. The choice of the unknown entity is pseudo-random, with priority for requesting the name of an unknown human. This makes the robot look at the human. The visual input is passed to the perceptions module where the synthetic sensory memory segments the face from the background and attempts to recognize it from previously seen faces. If it does not recognize the face, the robot asks, 'I do not know you, who are you?'. The human can then reply, e.g., saying 'I am Daniel'. The level of the drive is reset to its default value and both drives are frozen during the behavior execution.
- 4) The perceived speech is analyzed by the language reservoir handler in perceptions to extract the name 'Daniel' and is associated with the face representation in the associations module. Thus, the next time the iCub will interact with this person, it will directly recognize him and not ask for his name.
- 5) Once this interaction is achieved, the drives continue to decay. Since the iCub has just acquired more information, the decay of the drive for knowledge acquisition is slower
and the one for knowledge expression is increased. Still, the drive for knowledge acquisition reaches the threshold first. The behavior for acquiring information is therefore triggered again. This time, the random choice of an unknown entity makes the robot point to an object and ask, 'What is this object?'. The human replies e.g. 'This is the cube'. The language reservoir handler extracts the name of the object from the reply and the associations module associates it with the visual representation of the pointed object from object recognition . Now the cube can later be referred by its name.
- 6) The drives continue to decay. This time, the drive for knowledge expression reaches the threshold first. This triggers the behavior for expressing the acquired knowledge. A known entity is chosen, in this example it is the cube, which the robot points at while saying 'This is a cube'.
- 7) The human asks 'Take the cube'. A meaning is extracted by the language reservoir handler in perceptions and forms the representation of a goal to achieve in the goal module (here the desired location of the object, i.e. the region of the iCub I for the goal 'take', see Figure 4). An action plan is built in action plans with the sequence of two actions 'Ask the human to push the object' then 'Pull the object', together with their pre- and post-conditions in term of states ( I , S or H ). The action selection module takes care of the realization of the plan. First, it instantiates the action plan toward the considered object, here the cube, through its connection with associations . Then, it executes each action until its associated post-condition is met (repeating it up to three times before giving up). Since the cube is in the human area H , the iCub first triggers the behavior for interacting verbally with the human, asking 'Can you please bring the cube closer to the shared area?'. The human pushes the cube to the shared area S and the state transition is noticed by the robot thanks to the real-time object localization performed in the object recognition module. Then the robot triggers a motor action to pull the cube. Once the goal is achieved (i.e. the cube is in I ), the drive levels which were frozen during this interaction continue to decay.
- 8) The drive for knowledge acquisition reaches the threshold first. The associated behavior now chooses to acquire the name of a body part. The robot triggers the behavior for raising its hand and moving a random unknown body part, here the middle finger. It looks at the human and asks 'How do you call this part of my body?'. The name of the body part is extracted from the human's reply and is associated with the joint that was moved in associations .
The interaction continues following the dynamics of the drives and interrupted by the requests from the human. Once all available information about the present entities is acquired, the drive for knowledge acquisition stops to decay. However, the robot still maintains the interaction through its drive for knowledge expression and the human can still formulate requests for goal-oriented behavior. When new entities are introduced, e.g. an unknown object or another human entering
the scene, the drive for knowledge acquisition decays again and the process continues.
## E. Effect of the robot's proactivity level on naive users
We now test the DAC-h3 architecture with naive users having to perform a collaborative task with the iCub robot. For this aim, we conduct experiments on six subjects and compare different configurations of the DAC-h3 drive system reflecting different levels of robot's proactivity.
To better control the experiment, we simplify the setup described above by limiting it to object tagging and manipulation. This means that in this study we do not use the functionalities for agent or body part tagging, action recognition, kinematic structure learning and narrative discourse. The iCub can only proactively tag or point at objects on the table, whereas the human can reply to object tagging requests and provide orders for taking or giving an object. These orders trigger action plans combining object manipulation by the iCub with requests to the human to do so, as described above. We note that, due to the distributed implementation of the DAC-h3 systems as interacting YARP modules operating in parallel, deactivating functionalities simply requires to not launch the corresponding modules and does not imply any modification in the code.
Three objects are placed on the table: an octopus , a blue cube and a duck . Initially, the names of the objects are unknown to the iCub and they are placed on the table as shown on Figure 7 (left). The task given to the subjects is to achieve a goal configuration of object positions (Figure 7, right). The experiment terminates when the task is achieved.
To do so, the subject is instructed to interact with the iCub in the following way. At any time of the interaction, s/he can provide speech orders on the form of the sentences 'Give me the <object name>' and 'Please take the <object name>' , where <object name> is the name of one of the three objects. These names are provided to the subject before the experiment starts to make the speech recognition more robust. Whenever the iCub asks the subject for the name of an object, s/he can reply with 'This is the <object name>' . Moreover, whenever the iCub asks to show a specific object, the subject can point to it using her/his right hand. This latter behavior is added to the state transition graph described in Figure 4 and executed when the subject requests to perform an action on an object which is unknown to the iCub. To augment the difficulty of the task, the subject is asked to not move objects on its own initiative, but only when the iCub asks to do so.
Achieving the goal configuration requires a complex interaction between the subject and iCub. For example, moving the octopus from the human region to the iCub region requires to first inform the iCub about which object the octopus is, and then asking the iCub to take that object. Informing the robot about the name of an object can occur either from the iCub's initiative through the knowledge acquisition drive, or from the human's initiative by requesting an action on this object (when the object is unknown to the iCub, it will first ask the human to point at it). Since the octopus is not within the reach of the iCub, the robot will first ask the human to move it to the shared region, before executing the motor action for pulling the object in the iCub region.
Figure 7. The experimental task. Each panel shows a top view of the table, where the three letters I , S and H indicates the iCub, Shared and Human areas as in Figure 3. Starting from the initial configuration of object positions on the left, the subject is asked to achieve the goal configuration on the right. This is done by interacting with the iCub following the instructions described above. The three objects are: an octopus (orange object on the left of the table), a blue cube (blue object in the middle) and a duck (yellow object on the right).
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Diagram: Object Configuration Transformation
### Overview
The image displays a two-panel diagram illustrating a transformation from an "Initial configuration" to a "Goal configuration." It depicts the rearrangement of three distinct objects across three labeled positions or lanes. The diagram is likely from a robotics, planning, or puzzle-solving context, demonstrating a target state for an object manipulation task.
### Components/Axes
* **Panels:** Two side-by-side rectangular panels.
* **Left Panel:** Labeled "Initial configuration" at the top.
* **Right Panel:** Labeled "Goal configuration" at the top.
* **Positions/Lanes:** Each panel contains three horizontal lanes, demarcated by dashed lines. They are labeled on the left side with single letters:
* **I** (Top lane)
* **S** (Middle lane)
* **H** (Bottom lane)
* **Objects:** Three distinct 3D-rendered objects are present in each panel.
* A **blue cube**.
* A **yellow cube**.
* An **orange sphere**.
* **Transformation Indicator:** A solid black arrow points from the center of the "Initial configuration" panel to the center of the "Goal configuration" panel, indicating the direction of change.
### Detailed Analysis
**Initial Configuration (Left Panel):**
* **Lane I (Top):** Contains the **blue cube**. Positioned roughly in the center of the lane.
* **Lane S (Middle):** Contains the **yellow cube**. Positioned roughly in the center of the lane.
* **Lane H (Bottom):** Contains the **orange sphere**. Positioned roughly in the center of the lane.
**Goal Configuration (Right Panel):**
* **Lane I (Top):** Contains the **orange sphere**. Positioned roughly in the center of the lane.
* **Lane S (Middle):** Contains the **yellow cube**. Positioned roughly in the center of the lane.
* **Lane H (Bottom):** Contains the **blue cube**. Positioned roughly in the center of the lane.
**Transformation Summary:**
The change involves a swap of the objects between lanes I and H. The object in lane S (the yellow cube) remains in its original position.
### Key Observations
1. **Object Permanence & Identity:** The same three objects (blue cube, yellow cube, orange sphere) are present in both configurations. Their identities (color and shape) are preserved.
2. **Fixed Point:** The yellow cube in the middle lane (S) is the only object that does not change position between the initial and goal states.
3. **Symmetric Swap:** The transformation is a clean, symmetric exchange: the object from the top lane moves to the bottom lane, and the object from the bottom lane moves to the top lane.
4. **Spatial Consistency:** Within each panel, objects are centered within their respective lanes. The lane labels (I, S, H) are consistently placed on the far left.
### Interpretation
This diagram defines a clear **rearrangement goal**. It provides the "before" and "after" states for a task, implying that an agent (e.g., a robot, a software planner) must execute actions to move the objects from the initial layout to the goal layout.
The data suggests a task with the following constraints or characteristics:
* **Selective Movement:** Only two of the three objects require relocation. The system must recognize that the yellow cube is already in its goal state and should not be moved.
* **Specific Target Mapping:** The goal is not just a general "sort by color/shape" but a precise mapping of object identity to lane identity (Blue Cube -> H, Orange Sphere -> I, Yellow Cube -> S).
* **Underlying Logic:** The labels "I", "S", "H" could be abbreviations for specific locations, states, or categories (e.g., Input, Storage, Home; or In, Safe, Hazard). The transformation might represent a logical operation like swapping input and home positions while keeping a storage item fixed. The absence of other data (like coordinates or action sequences) indicates this is a high-level specification, not a detailed motion plan.
</details>
We run the experiment with six naive subjects on three different conditions. The three conditions correspond to three different levels of the robot's proactivity, defined by setting different drive decays in the allostatic controller module. The two drives for knowledge acquisition and knowledge expression are initialized to a default value of 0.5. Then, they decay linearly at a rate of n obj â ÎŽ unit/s, where ÎŽ is a constant value defining the proactivity level. For the knowledge acquisition drive, n obj is the number of perceived objects which are unknown to the robot. For the knowledge expression drive, it is the number of known objects. Therefore, the drive for knowledge acquisition (resp. knowledge expression) decays proportionally to the number of unknown objects (resp. known objects). We define three conditions: medium proactivity ( ÎŽ for knowledge acquisition = 0.01, for knowledge expression = 0.004), slow proactivity ( ÎŽ for both drives are 2.5 times lower than in medium proactivity) and fast proactivity ( ÎŽ for both drives are 2.5 times higher than in medium proactivity). Corresponding behaviors are triggered when a drive value goes below 0.25. For example, for the knowledge acquisition drive in the medium-proactive condition ( ÎŽ = 0 . 01 ) at the beginning of the interaction when all objects are unknown ( n obj = 3 ), it takes approximately 8 seconds for the drive to decay completely from the default value of 0.5 to the threshold 0.25.
Figure 8 shows the interaction diagrams for the six subjects grouped by condition. In the slow-proactive condition, we observe that the task can be solved very rapidly, as seen by the subject S 1 . Here the iCub has initiated only the first action ( Tagging (knowledge acquisition) ). With drives decaying slower, the robot acts less often by its own initiative, which leads to the subject leading the interaction according to its own goal. Since the iCub is rarely acquiring information by itself, the subject has to request actions on objects which are not yet known to the robot, triggering tagging behaviors ( Tagging (human points) ) prior to asking or moving objects. However, solving the task can also be quite long in this condition as seen by subject S 2 . This can be due to several factors such as the robot's failure to perceive or manipulate objects, or the time required by the subject to fully apprehend the system. This can also be due to the personality or the mood of the subject who can her/himself be more or less proactive.
At the other extreme, in the fast proactive condition (bottom of the figure), we observe that the interaction is dominated
Figure 8. Interaction diagrams for the six subjects ( S 1 to S 6 , two per condition) interacting with the iCub to solve the task described in Figure 7. Top: slow proactivity condition. Middle: medium proactivity. Bottom: fast proactivity. In each condition, the first five rows shows the interaction of one subject (e.g. S 3 in the medium proactivity condition) and the five last rows the the interaction of another subject (e.g. S 4 in the medium proactivity condition). For each subject, the colored bars show the times (x-axis) when specific actions (y-axis) are executed by the iCub. The two first rows ( Tagging (knowledge acquisition) and Pointing (knowledge expression) ) are actions initiated by the iCub through its drive regulation system in the allostatic controller as described in section IV-B. The three last rows ( Tagging (human points) , Robot moves object and Ask human to move object ) are actions initiated by the human through a speech order and executed by the iCub through action plans in the contextual layer, as described in section IV-C. They are sequenced according to the current state of the world and the current knowledge of the robot. The vertical arrow in each subject's interaction plot shows the time at which all the three object names are known to the robot.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Timeline Diagram: Robot-Human Interaction Proactivity Levels
### Overview
The image displays a series of six timeline charts (labeled S1 through S6) organized into three main sections based on the robot's proactivity level: "Slow robot's proactivity," "Medium robot's proactivity," and "Fast robot's proactivity." Each chart visualizes the temporal occurrence of specific actions initiated by either a robot or a human over a 600-second period. The charts use color-coded horizontal bars to represent discrete action events.
### Components/Axes
* **Main Sections (Top to Bottom):**
* Slow robot's proactivity (Contains panels S1 and S2)
* Medium robot's proactivity (Contains panels S3 and S4)
* Fast robot's proactivity (Contains panels S5 and S6)
* **X-Axis (Common to all panels):** Labeled "Time in s" at the bottom of the entire figure. It spans from 0 to 600 seconds, with major tick marks and labels at 0, 100, 200, 300, 400, 500, and 600. A dotted vertical grid aligns with these ticks.
* **Panel Structure (S1-S6):** Each panel is a self-contained timeline. The left side of each panel contains a legend defining the action categories.
* **Legend (Embedded in each panel, left side):** Actions are grouped by initiator.
* **Robot initiated:**
* `Tagging (knowledge acquisition)` - Blue bar
* `Pointing (knowledge expression)` - Green bar
* **Human initiated:**
* `Tagging (human points)` - Pink bar
* `Robot moves object` - Orange bar
* `Ask human to move object` - Purple bar
* **Annotations:** Small black upward-pointing arrows (â) appear in panels S1, S2, S3, S4, and S5, indicating specific moments in time.
### Detailed Analysis
**Panel S1 (Slow Proactivity):**
* **Robot Actions:** A single, short blue `Tagging` event occurs near t=0s. No green `Pointing` events.
* **Human Actions:** A pink `Tagging` event occurs around t=120s. Multiple orange `Robot moves object` events are scattered between ~t=150s and t=350s. A single purple `Ask human to move object` event occurs near t=320s.
* **Trend:** Sparse robot activity. Human actions are intermittent, with object movement being the most frequent human-initiated action.
**Panel S2 (Slow Proactivity):**
* **Robot Actions:** A single blue `Tagging` event near t=0s. Two green `Pointing` events occur between t=400s and t=450s.
* **Human Actions:** A pink `Tagging` event near t=220s. Orange `Robot moves object` events occur in three clusters: ~t=100-150s, ~t=280-320s, and a single event near t=480s. Two purple `Ask human to move object` events occur near t=260s and t=500s.
* **Trend:** Similar to S1, with delayed and infrequent robot pointing. Human object movement requests are more clustered.
**Panel S3 (Medium Proactivity):**
* **Robot Actions:** A single blue `Tagging` event near t=0s. Two green `Pointing` events occur between t=200s and t=250s.
* **Human Actions:** A pink `Tagging` event near t=100s. Orange `Robot moves object` events occur near t=130s and t=320s. Three purple `Ask human to move object` events occur near t=80s, t=180s, and t=350s.
* **Trend:** Robot pointing occurs earlier than in the Slow condition. Human actions are more evenly distributed.
**Panel S4 (Medium Proactivity):**
* **Robot Actions:** A single blue `Tagging` event near t=0s. Two green `Pointing` events occur near t=250s and t=420s.
* **Human Actions:** A pink `Tagging` event near t=180s. A cluster of orange `Robot moves object` events occurs between t=80-120s, with another cluster between t=300-400s. Two purple `Ask human to move object` events occur near t=100s and t=280s.
* **Trend:** Robot pointing is delayed. Human object movement requests are highly clustered in two distinct periods.
**Panel S5 (Fast Proactivity):**
* **Robot Actions:** Three blue `Tagging` events occur in quick succession between t=0-100s. Three green `Pointing` events occur between t=100-250s, with another near t=380s.
* **Human Actions:** A single pink `Tagging` event near t=180s. Orange `Robot moves object` events occur near t=320s and t=420s. A single purple `Ask human to move object` event occurs near t=200s.
* **Trend:** Marked increase in robot-initiated actions (both tagging and pointing), especially early in the timeline. Human-initiated actions are less frequent.
**Panel S6 (Fast Proactivity):**
* **Robot Actions:** Three blue `Tagging` events occur between t=0-100s. A dense series of green `Pointing` events occurs throughout the timeline, with clusters near t=100-200s, t=300-400s, and a final event near t=580s.
* **Human Actions:** No pink `Tagging` events. Orange `Robot moves object` events occur in a long cluster from ~t=250s to t=600s. Two purple `Ask human to move object` events occur near t=260s and t=410s.
* **Trend:** Highest density of robot actions, with pointing being very frequent. Human activity is dominated by a sustained period of requesting object movement in the latter half of the session.
### Key Observations
1. **Proactivity Gradient:** There is a clear visual increase in the frequency and earlier onset of robot-initiated actions (blue and green bars) from the Slow to Fast sections.
2. **Action Sequencing:** In all panels, robot `Tagging (knowledge acquisition)` (blue) is the first action, occurring at or near t=0s.
3. **Human Response Pattern:** Human `Tagging (human points)` (pink) typically occurs after the initial robot tagging. The frequency of human `Ask human to move object` (purple) appears relatively stable across conditions, while `Robot moves object` (orange) shows more variability in clustering.
4. **Arrow Annotations:** The black upward arrows (â) consistently appear immediately following a human `Tagging (human points)` (pink) event in panels S1, S2, S3, and S4. In S5, the arrow follows a robot `Pointing` event. This suggests the arrow marks a significant event or trigger point in the interaction sequence.
5. **Temporal Clustering:** Human-initiated `Robot moves object` (orange) actions show strong temporal clustering, especially in S4 and S6, indicating periods of concentrated activity.
### Interpretation
This diagram illustrates the results of an experiment or observation studying how a robot's level of proactivity (Slow, Medium, Fast) influences the temporal dynamics of a collaborative task with a human. The data suggests that:
* **Increased Robot Proactivity Leads to Earlier and More Frequent Robot Initiatives:** The "Fast" condition is characterized by the robot performing knowledge acquisition (tagging) and expression (pointing) actions sooner and more often. This likely represents a robot that is more assertive in sharing information or guiding the task.
* **Human Behavior Adapts to Robot Pace:** In the Fast condition (S5, S6), human-initiated tagging is reduced or absent, and human requests for object movement become more sustained (S6). This could indicate the human is following the robot's lead more closely or that the robot's high proactivity reduces the need for the human to initiate certain actions.
* **The Interaction is Structured Around Key Events:** The consistent placement of the black arrow after a human tagging event (in most panels) implies this action is a critical juncture, possibly triggering a change in the robot's behavior or the task phase. The exception in S5 (arrow after robot pointing) may indicate a shift in dynamics under high robot proactivity.
* **Collaboration Rhythm:** The clustering of human "move object" requests suggests the task has phases of object manipulation. The robot's proactivity level appears to affect the timing and density of these phases, not just the robot's own actions.
In essence, the chart provides a visual proof that tuning a robot's proactivity parameter significantly alters the choreography of human-robot interaction, affecting when and how often each agent contributes to the shared task.
</details>
by the iCub's self-generated actions ( Tagging (knowledge acquisition) and Pointing (knowledge expression) ). There is little time for the subject to request orders through speech between two consecutive actions of the iCub and it requires some time to succeed doing it. In consequence, solving the task can require quite a long time as seen by the subject S 6 . However, the names of all objects tend to be acquired faster in this condition (vertical arrows). This is because the drive for knowledge acquisition decays rapidly, especially at the beginning of the interaction where all objects are unknown, pushing the iCub to ask the name of everything around. Indeed, the three first actions are Tagging (knowledge acquisition) for both S 5 and S 6 .
Finally, in the medium-proactive condition, we observe a mixed initiative between the iCub and the subject. A possible positive effect of such a moderate proactivity of the robot is to maintain the social interaction when the subject is less proactive by her/himself, preventing long silent gaps in the interaction but still largely allowing the human to take the initiative. We observe a better interplay between iCub- and human-initiated actions in this condition.
Subjects are also asked to fill pre- and post-questionnaires before and after the interaction. The pre-questionnaire attempts at characterizing the personality traits of the subject, while the post-questionnaire evaluates subjective feelings about the interaction. As the main focus of this paper is on presenting a coherent cognitive architecture rather than the evaluation of the emerging human-robot interaction, we do not analyze these
data in this paper. Providing a solid statistical analysis will require acquiring data of many more subjects which will be the aim of a follow-up paper. However, the presented human-robot interaction study shows that proactivity is an important factor in HRI and that it can be controlled by simple modifications of the drive dynamics. As observed in Figure 8, the level of robot's proactivity has a clear influence on the resulting interaction by modulating the interplay between the robot and human initiatives.
## V. CONCLUSION AND FUTURE WORKS
This paper has introduced DAC-h3 , a proactive robot cognitive architecture to acquire and express knowledge about the world and the self. The architecture is based on the Distributed Adaptive Control (DAC) theory of the brain and mind, which provides a biologically grounded framework for organizing various functional modules into a coherent cognitive architecture. Those modules implement state-of-the-art algorithms modeling various cognitive functions for autonomous self-regulation, whole-body motor control, multimodal perception, knowledge representation, natural language processing, and goal-oriented behavior. They are all implemented using the YARP robotic middleware on the iCub robot, ensuring their parallel execution in real time and providing synchronous and asynchronous communication protocols among modules.
The implementation of DAC-h3 is flexible so that existing modules can easily be replaced by more sophisticated ones in the future. Moreover, most modules can be configured according to the user's needs, for example by adding new drives or more complex grammars into the system. This makes DAC-h3 a general framework for designing autonomous robots, especially in HRI setups. The underlying open-source code contains some modules which are specific to the iCub robot (modules related to action execution; which work on both real and simulated iCub's), but all cognitive nodes (autobiographical memory, allostatic controllers, etc.) can be freely used with other robots (most easily with YARP-driven robots, but thanks to YARP-ROS intercommunication also with ROS driven robots).
The main contribution of this paper is not about the modeling of the specific functional modules, which already have been published (see Section III), but rather about the integration of a heterogeneous collection of modules into a coherent and operational cognitive architecture. For this aim, the DAC-h3 architecture is organized as a layered control structure with tight coupling within and between layers (Figure 1 and Section III): the somatic , reactive , adaptive , and contextual layers. Across these layers, a columnar organization exists that deals with the processing of states of the world or exteroception, the self or interoception, and action. Two main control loops generate the behavior of the robot. First, a reactive-adaptive control loop ensures autonomy and proactivity through the self-regulation of internal drives for knowledge acquisition and expression. It allows the robot to proactively manage its own knowledge acquisition process and to maintain the interaction with a human partner, while associating multimodal information about entities with their linguistic labels. Second, an adaptive-contextual control loop allows the robot to satisfy human requests, triggering goal-oriented behavior relying on the acquired knowledge. Those goal-oriented behaviors are related to action planning for object passing, pointing, action recognition, narrative expression and kinematic structure learning demonstration.
We have shown that these two control loops lead to a welldefined interplay between robot-initiated and human-initiated behaviors, which allows the robot to acquire multimodal representations of entities and link them with linguistic symbols, as well as to use the acquired knowledge for goal-oriented behavior. This allows the robot to learn reactively as well as proactively. Reactive learning occurs in situations where the robot requires obtaining new knowledge to execute a human order (e.g. grasping an object with an unknown label), and thus leads to an efficient interaction for acquiring the information before acting according to the human desire. At the same time, the robot can also learn proactively to optimize its cognitive development by triggering learning interactions itself, which allows the robot to learn without having to wait for the human to teach new concepts. Moreover, this is supposed to reduce the cognitive load of the human teacher, as the robot 1) chooses the target entity of the learning interaction, 2) engages the joint attention by making the human aware of the target entity, and 3) asks for the corresponding label. Therefore, the human only needs to provide the label, without having to be concerned about the prior knowledge of the robot.
We have implemented the entire DAC-h3 architecture and presented an HRI scenario where an iCub humanoid robot interacts with objects and a human to acquire information about the present objects and agents as well as its own body parts. We have analyzed a typical interaction in detail, showing how DAC-h3 is able to dynamically balance the knowledge acquisition and expression processes according to the properties of the environment, and to deal with a mixed initiative scenario where both the robot and the human are behaving autonomously. In a series of video recordings, we show the ability of DAC-h3 to adapt to different situations and environments. We have also conducted experiments with naive subjects on a simplified version of the scenario, showing how the robot's proactivity level influences the interaction.
Adapting the proactivity level also provides a step towards personalities of robots. The curiousness (i.e. favoring proactive learning) or talkativeness (i.e. communicating about its own knowledge) of the robot is determined by the decay rates of the corresponding drives. Thus the personality of the robot can be altered by a simple modification of the decay values, as done for skill refinement by Puigbo et al. [104].
The current work has the following limitations. First, some of the available abilities deserve to be better integrated into the HRI scenario. For example, this is the case for the kinematic structure learning process which is currently executed in a separated learning phase instead of being fully integrated within the interaction scenario. Similarly, the narrative can only be generated from specific chunks of the robot's history as recorded in the autobiographical memory. Second, in this paper, we do not provide a statistical analysis of the HRI experiments. The reason is that we focus on the description
of the entire architecture and on their theoretical principles. A thorough statistical analysis will require collecting much more data to fully demonstrate the utility of some of these principles, for example how proactivity helps to solve the referential indeterminacy problem, as well as the effect of the robot's autonomy on the acceptability of the system by naive users. Third, although DAC-h3 can solve parts of the Symbol Grounding Problem (SGP), it still presupposes a symbolic concept of an entity which is given a priori to the system. Therefore, our contribution is more about the ability to proactively acquire multimodal information about these entities and linking them to linguistic labels that can be reused to express complex goal-oriented behavior later on.
We are currently extending the proposed architecture in the following ways. First, we are better integrating some of the available abilities within the interaction scenario as mentioned above. This will allow starting the knowledge acquisition process from scratch in a fully autonomous way. Second, we are considering to use more biologically plausible and/or computationally scalable models for some of the existing modules, namely the action planning and action selection modules. These are currently algorithmic implementations using predefined action plans. We want to replace it with an existing model of rule learning grounded in the neurobiology of the prefrontal cortex which can learn optimal action policies from experience to maximize long-term reward [105]. An interesting feature of this model for solving the SGP is that it relies on neural memory-units encoding sensorimotor contingencies with causal relationships learned through adaptive connections between them. An alternative solution is to use state-of-the art AI planners such as the Planning Domain Definition Language (PDDL), where multi-agent planning extensions are particularly relevant in the context of social robotics [106]. Third, we are also integrating more low-level reactive control abilities through an acquired notion of a peri-personal space [107], where the robot will be able to optimize its own action primitives to maintain safety distances with aversive objects (e.g. a spider) in real time while executing reaching actions toward other objects. Finally, we are working on a self-exploration process to autonomously discover the area which is reachable by the robot, similarly to Jamone et al. [108], and subsequently employing this self-model and applying it to the human partner to estimate his/her reachability.
## REFERENCES
- [1] S. Harnad, 'The symbol grounding problem,' Physica D: Nonlinear Phenomena , vol. 42, no. 1, pp. 335 - 346, 1990.
- [2] S. Coradeschi and A. Saffiotti, 'An introduction to the anchoring problem,' Robotics and Autonomous Systems , vol. 43, no. 2-3, pp. 85 - 96, 2003.
- [3] T. Taniguchi, T. Nagai, T. Nakamura, N. Iwahashi, T. Ogata, and H. Asoh, 'Symbol emergence in robotics: a survey,' Advanced Robotics , vol. 30, no. 11-12, pp. 706-728, 2016.
- [4] G. Metta et al. , 'The iCub humanoid robot: An open-systems platform for research in cognitive development,' Neural Networks , vol. 23, no. 8, pp. 1125-1134, 2010.
- [5] P. Vogt, 'The physical symbol grounding problem,' Cognitive Systems Research , vol. 3, no. 3, pp. 429-457, 2002.
- [6] P. Vogt and F. Divina, 'Social symbol grounding and language evolution,' Interaction Studies , vol. 8, no. 1, pp. 31-52, 2007.
- [7] A. Cangelosi, 'The grounding and sharing of symbols,' Pragmatics & Cognition , vol. 14, no. 2, pp. 275-285, 2006.
- [8] S. Lallée et al. , 'Towards the synthetic self: making others perceive me as an other,' Paladyn Journal of Behavioral Robotics , vol. 6, no. 1, 2015.
- [9] V. Vouloutsi et al. , 'Towards a Synthetic Tutor Assistant: The EASEL Project and its Architecture,' in International Conference on Living Machines , 2016, pp. 353-364.
- [10] S. C. Levinson, 'On the human interaction engine,' in Wenner-Gren Foundation for Anthropological Research, Symposium 134 . Berg, 2006, pp. 39-69.
- [11] U. Liszkowski, M. Carpenter, A. Henning, T. Striano, and M. Tomasello, 'Twelve-month-olds point to share attention and interest,' Developmental science , vol. 7, no. 3, pp. 297-307, 2004.
- [12] D. E. Berlyne, 'A theory of human curiosity,' British Journal of Psychology. General Section , vol. 45, no. 3, pp. 180-191, 1954.
- [13] M. Tomasello, M. Carpenter, J. Call, T. Behne, and H. Moll, 'Understanding and sharing intentions: The origins of cultural cognition,' Behavioral and brain sciences , vol. 28, no. 5, pp. 675-691, 2005.
- [14] D. Vernon, C. von Hofsten, and L. Fadiga, 'Desiderata for developmental cognitive architectures,' Biologically Inspired Cognitive Architectures , vol. 18, pp. 116-127, 2016.
- [15] J. Piaget, The construction of reality in the child . New York, NY, US: Basic Books, 1954.
- [16] L. Vygotsky, Mind in Society: Development of Higher Psychological Processes . Cambridge: Harvard University Press, 1978.
- [17] T. Fong, I. Nourbakhsh, and K. Dautenhahn, 'A survey of socially interactive robots,' Robotics and Autonomous Systems , vol. 42, no. 3, pp. 143-166, 2003.
- [18] J. Pineau, M. Montemerlo, M. Pollack, N. Roy, and S. Thrun, 'Towards robotic assistants in nursing homes: Challenges and results,' Robotics and Autonomous Systems , vol. 42, no. 3, pp. 271-281, 2003.
- [19] I. R. Nourbakhsh, J. Bobenage, S. Grange, R. Lutz, R. Meyer, and A. Soto, 'An affective mobile robot educator with a full-time job,' Artificial Intelligence , vol. 114, no. 1, pp. 95-124, 1999.
- [20] A. G. Di Nuovo, D. Marocco, S. Di Nuovo, and A. Cangelosi, 'Autonomous learning in humanoid robotics through mental imagery,' Neural Networks , vol. 41, pp. 147-155, 2013.
- [21] B. Adams, C. Breazeal, R. A. Brooks, and B. Scassellati, 'Humanoid robots: a new kind of tool,' IEEE Intelligent Systems and their Applications , vol. 15, no. 4, pp. 25-31, 2000.
- [22] K. Dautenhahn and A. Billard, 'Bringing up robots or - the psychology of socially intelligent robots: From theory to implementation,' in Conference on Autonomous Agents , 1999, pp. 366-367.
- [23] Y. Demiris, L. Aziz-Zadeh, and J. Bonaiuto, 'Information Processing in the Mirror Neuron System in Primates and Machines,' Neuroinformatics , vol. 12, no. 1, pp. 63-91, 2014.
- [24] A. Newell, Unified theories of cognition . Harvard University Press, 1990.
- [25] A. Newell, J. C. Shaw, and H. A. Simon, 'Report on a general problemsolving program,' IFIP Congress , pp. 256-264, 1959.
- [26] J. E. Laird, A. Newell, and P. S. Rosenbloom, 'SOAR: An architecture for general intelligence,' Artificial Intelligence , vol. 33, no. 1, pp. 1-64, 1987.
- [27] J. R. Anderson, The Architecture of Cognition . Harvard University Press, 1983.
- [28] R. A. Brooks, 'Intelligence without representation,' Artificial Intelligence , vol. 47, no. 1-3, pp. 139-159, 1991.
- [29] R. Brooks, 'A robust layered control system for a mobile robot,' IEEE Journal on Robotics and Automation , vol. 2, no. 1, pp. 14-23, 1986.
- [30] Ă. MiklĂłsi and M. GĂĄcsi, 'On the utilization of social animals as a model for social robotics,' Frontiers in Psychology , vol. 3, no. 75, pp. 1-10, 2012.
- [31] K. Dautenhahn, 'Socially intelligent robots: dimensions of human-robot interaction.' Philosophical transactions of the Royal Society of London. Series B, Biological sciences , vol. 362, no. 1480, pp. 679-704, 2007.
- [32] P. Carruthers and P. Smith, Theories of Theories of Mind . Cambridge University Press, 1996.
- [33] E. Di Paolo and H. De Jaegher, 'The interactive brain hypothesis,' Frontiers in Human Neuroscience , vol. 6, no. 163, pp. 1-16, 2012.
- [34] T. Taniguchi, 'Symbol emergence in robotics for long-term human-robot collaboration,' in IFAC Symposium on Analysis, Design, and Evaluation of Human-Machine Systems , 2016, pp. 144 - 149.
- [35] B. Scassellati, 'Investigating models of social development using a humanoid robot,' in IEEE International Joint Conference on Neural Networks , 2003, pp. 2704-2709.
- [36] A. Stoytchev and R. Arkin, 'Combining deliberation, reactivity, and motivation in the context of a behavior-based robot architecture,'
- in IEEE International Symposium on Computational Intelligence in Robotics and Automation , 2001, pp. 290-295.
- [37] M. Malfaz, Ă. Castro-Gonzalez, R. Barber, and M. A. Salichs, 'A Biologically Inspired Architecture for an Autonomous and Social Robot,' IEEE Transactions on Autonomous Mental Development , vol. 3, no. 3, pp. 232-246, 2011.
- [38] M. Scheutz, G. Briggs, R. Cantrell, E. Krause, T. Williams, and R. Veale, 'Novel mechanisms for natural human-robot interactions in the diarc architecture,' in Proceedings of AAAI Workshop on Intelligent Robotic Systems , 2013.
- [39] P. F. M. J. Verschure, T. Voegtlin, and R. J. Douglas, 'Environmentally mediated synergy between perception and behaviour in mobile robots,' Nature , vol. 425, no. 6958, pp. 620-624, 2003.
- [40] P. F. M. J. Verschure, C. M. A. Pennartz, and G. Pezzulo, 'The why, what, where, when and how of goal-directed choice: neuronal and computational principles,' Philosophical Transactions of the Royal Society B: Biological Sciences , vol. 369, no. 1655, pp. 1-14, 2014.
- [41] J.-Y. PuigbĂČ, C. Moulin-Frier, and P. F. Verschure, 'Towards selfcontrolled robots through distributed adaptive control,' in Biomimetic and Biohybrid Systems , 2016, pp. 490-497.
- [42] P. GĂ€rdenfors, Conceptual Spaces. The Geometry of Thought . Cambridge, MIT Press, 2000.
- [43] A. Lieto, A. Chella, and M. Frixione, 'Conceptual spaces for cognitive architectures: A lingua franca for different levels of representation,' Biologically Inspired Cognitive Architectures , vol. 19, pp. 1-9, 2017.
- [44] C. Eliasmith, T. C. Stewart, X. Choo, T. Bekolay, T. DeWolf, Y. Tang, and D. Rasmussen, 'A large-scale model of the functioning brain,' Science , vol. 338, no. 6111, pp. 1202-1205, 2012.
- [45] P. Blouw, E. Solodkin, P. Thagard, and C. Eliasmith, 'Concepts as Semantic Pointers: A Framework and Computational Model,' Cognitive Science , vol. 40, no. 5, pp. 1128-1162, 2016.
- [46] N. KrĂŒger et al. , 'Object-Action Complexes: Grounded abstractions of sensory-motor processes,' Robotics and Autonomous Systems , vol. 59, no. 10, pp. 740-757, 2011.
- [47] L. Jamone, E. Ugur, A. Cangelosi, L. Fadiga, A. Bernardino, J. Piater, and J. Santos-Victor, 'Affordances in psychology, neuroscience and robotics: a survey,' IEEE Transactions on Cognitive and Developmental Systems , 2016.
- [48] D. Vernon, M. Beetz, and G. Sandini, 'Prospection in Cognition: The Case for Joint Episodic-Procedural Memory in Cognitive Robotics,' Frontiers in Robotics and AI , vol. 2, no. 19, pp. 1-14, 2015.
- [49] L. Steels, 'The synthetic modeling of language origins,' Evolution of Communication , vol. 1, no. 1, pp. 1-34, 1997.
- [50] F. Kaplan, 'Semiotic schemata: Selection units for linguistic cultural evolution,' in International Conference on Artificial Life , 2000, pp. 372-381.
- [51] C. Moulin-Frier, J. Diard, J.-L. Schwartz, and P. BessiĂšre, 'COSMO ('Communicating about Objects using Sensory-Motor Operations'): a Bayesian modeling framework for studying speech communication and the emergence of phonological systems,' Journal of Phonetics , vol. 53, pp. 5-41, 2015.
- [52] S. Lallée et al. , 'Cooperative human robot interaction systems: IV. Communication of shared plans with Naïve humans using gaze and speech,' in IEEE/RSJ International Conference on Intelligent Robots and Systems , 2013, pp. 129-136.
- [53] S. Lallée et al. , 'Towards a platform-independent cooperative human robot interaction system: III an architecture for learning and executing actions and shared plans,' IEEE Transactions on Autonomous Mental Development , vol. 4, no. 3, pp. 239-253, 2012.
- [54] M. Petit et al. , 'The coordinating role of language in real-time multimodal learning of cooperative tasks,' IEEE Transactions on Autonomous Mental Development , vol. 5, no. 1, pp. 3-17, 2013.
- [55] S. Calinon, F. D'halluin, E. L. Sauser, D. G. Caldwell, and A. G. Billard, 'Learning and reproduction of gestures by imitation,' IEEE Robotics & Automation Magazine , vol. 17, no. 2, pp. 44-54, 2010.
- [56] Y. Demiris and A. Meltzoff, 'The robot in the crib: A developmental analysis of imitation skills in infants and robots,' Infant and Child Development , vol. 17, no. 1, pp. 43-53, 2008.
- [57] S. Schaal, 'Is imitation learning the route to humanoid robots?' Trends in cognitive sciences , vol. 3, no. 6, pp. 233-242, 1999.
- [58] K. Lee, Y. Su, T.-K. Kim, and Y. Demiris, 'A syntactic approach to robot imitation learning using probabilistic activity grammars,' Robotics and Autonomous Systems , vol. 61, no. 12, pp. 1323-1334, 2013.
- [59] M. Ewerton, G. Maeda, J. Peters, and G. Neumann, 'Learning motor skills from partially observed movements executed at different speeds,' in IEEE/RSJ International Conference on Intelligent Robots and Systems , 2015, pp. 456-463.
- [60] A. Billard and K. Dautenhahn, 'Grounding communication in autonomous robots: an experimental study,' Robotics and Autonomous Systems , vol. 24, no. 1, pp. 71-79, 1998.
- [61] A. Cangelosi, E. Hourdakis, and V. Tikhanoff, 'Language acquisition and symbol grounding transfer with neural networks and cognitive robots,' in IEEE International Joint Conference on Neural Network Proceedings , 2006, pp. 1576-1582.
- [62] D. Marocco, A. Cangelosi, K. Fischer, and T. Belpaeme, 'Grounding action words in the sensorimotor interaction with the world: experiments with a simulated iCub humanoid robot,' Frontiers in Neurorobotics , vol. 4, no. 7, pp. 1-15, 2010.
- [63] M. Quine, Word and object . MIT Press, 1960.
- [64] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner, 'Intrinsic motivation systems for autonomous mental development,' IEEE Transactions on Evolutionary Computation , vol. 11, no. 2, pp. 265-286, 2007.
- [65] C. Breazeal and B. Scassellati, 'Infant-like social interactions between a robot and a human caregiver,' Adaptive Behavior , vol. 8, no. 1, pp. 49-74, 2000.
- [66] S. Ivaldi et al. , 'Object learning through active exploration,' IEEE Transactions on Autonomous Mental Development , vol. 6, no. 1, pp. 56-72, 2013.
- [67] C. Moulin-Frier, S. M. Nguyen, and P.-Y. Oudeyer, 'Self-Organization of Early Vocal Development in Infants and Machines: The Role of Intrinsic Motivation,' Frontiers in Psychology , vol. 4, no. 1006, 2014.
- [68] O. C. Schrempf, U. D. Hanebeck, A. J. Schmid, and H. Worn, 'A novel approach to proactive human-robot cooperation,' in IEEE International Workshop on Robot and Human Interactive Communication , 2005, pp. 555-560.
- [69] I. Lutkebohle et al. , 'The curious robot -structuring interactive robot learning,' in IEEE International Conference on Robotics and Automation , 2009, pp. 4156-4162.
- [70] F. Broz et al. , 'The ITALK Project: A Developmental Robotics Approach to the Study of Individual, Social, and Linguistic Learning,' Topics in Cognitive Science , vol. 6, no. 3, pp. 534-544, 2014.
- [71] A. Cangelosi et al. , 'Integration of action and language knowledge: A roadmap for developmental robotics,' IEEE Transactions on Autonomous Mental Development , vol. 2, no. 3, pp. 167-195, 2010.
- [72] A. Antunes, L. Jamone, G. Saponaro, A. Bernardino, and R. Ventura, 'From human instructions to robot actions: Formulation of goals, affordances and probabilistic planning,' in IEEE International Conference on Robotics and Automation , 2016, pp. 5449-5454.
- [73] E. A. Krause, M. Zillich, T. E. Williams, and M. Scheutz, 'Learning to Recognize Novel Objects in One Shot through Human-Robot Interactions in Natural Language Dialogues,' in AAAI Conference on Artificial Intelligence , 2014, pp. 2796-2802.
- [74] J. Dias, W. C. Ho, T. Vogt, N. Beeckman, A. Paiva, and E. André, 'I know what I did last summer: Autobiographic memory in synthetic characters,' in International Conference on Affective Computing and Intelligent Interaction , 2007, pp. 606-617.
- [75] D. S. Syrdal, K. Dautenhahn, K. L. Koay, and W. C. Ho, 'Views from within a narrative: Evaluating long-term human-robot interaction in a naturalistic environment using open-ended scenarios,' Cognitive Computation , vol. 6, no. 4, pp. 741-759, 2014.
- [76] G. Sieber and B. Krenn, 'Towards an episodic memory for companion dialogue,' in International Conference on Intelligent Virtual Agents , 2010, pp. 322-328.
- [77] P. Fitzpatrick, G. Metta, and L. Natale, 'Towards long-lived robot genes,' Robotics and Autonomous Systems , vol. 56, no. 1, pp. 29-45, 2008.
- [78] S. Fanello et al. , '3D Stereo Estimation and Fully Automated Learning of Eye-Hand Coordination in Humanoid Robots,' in IEEE-RAS International Conference on Humanoid Robots , 2014, pp. 1028-1035.
- [79] Z. Mathews, S. B. i Badia, and P. F. Verschure, 'PASAR: An integrated model of prediction, anticipation, sensation, attention and response for artificial sensorimotor systems,' Information Sciences , vol. 186, no. 1, pp. 1-19, 2012.
- [80] M. Sanchez-Fibla et al. , 'Allostatic control for robot behavior regulation: a comparative rodent-robot study,' Advances in Complex Systems , vol. 13, no. 3, pp. 377-403, 2010.
- [81] M. S. Fibla, U. Bernardet, and P. F. Verschure, 'Allostatic control for robot behaviour regulation: An extension to path planning,' in IEEE/RSJ International Conference on Intelligent Robots and Systems , 2010, pp. 1935-1942.
- [82] S. Lallee and P. F. Verschure, 'How? Why? What? Where? When? Who? Grounding Ontology in the Actions of a Situated Social Agent,' Robotics , vol. 4, no. 2, pp. 169-193, 2015.
- [83] V. Vouloutsi, K. Grechuta, S. Lallée, and P. F. Verschure, 'The influence of behavioral complexity on robot perception,' in Conference on Biomimetic and Biohybrid Systems , 2014, pp. 332-343.
- [84] U. Pattacini, F. Nori, L. Natale, and G. Metta, 'An experimental evaluation of a novel minimum-jerk cartesian controller for humanoid robots,' in IEEE/RSJ International Conference on Intelligent Robots and Systems , 2010, pp. 1668-1674.
- [85] A. Parmiggiani et al. , 'The Design of the iCub Humanoid Robot,' International Journal of Humanoid Robotics , vol. 9, no. 4, 2012.
- [86] G. Pasquale, C. Ciliberto, L. Rosasco, and L. Natale, 'Object identification from few examples by improving the invariance of a deep convolutional neural network,' in IEEE/RSJ International Conference on Intelligent Robots and Systems , 2016, pp. 4904-4911.
- [87] X. Hinaut and P. F. Dominey, 'Real-Time Parallel Processing of Grammatical Structure in the Fronto-Striatal System: A Recurrent Network Simulation Study Using Reservoir Computing,' PloS one , vol. 8, no. 2, pp. 1-18, 2013.
- [88] X. Hinaut, M. Petit, G. Pointeau, and P. F. Dominey, 'Exploring the acquisition and production of grammatical constructions through humanrobot interaction with echo state networks,' Frontiers in Neurorobotics , vol. 8, no. 16, pp. 1-17, 2015.
- [89] A.-L. Mealier, G. Pointeau, P. Gardenfors, and P.-F. Dominey, 'Construals of meaning: The role of attention in robotic language production,' Interaction Studies , vol. 17, no. 1, pp. 41-69, 2016.
- [90] A. Damianou, C. H. Ek, L. Boorman, N. D. Lawrence, and T. J. Prescott, 'A top-down approach for a synthetic autobiographical memory system,' in Biomimetic and Biohybrid Systems , 2015, pp. 280-292.
- [91] A. C. Damianou and N. D. Lawrence, 'Deep gaussian processes.' in Artificial Intelligence and Statistics Conference , 2013, pp. 207-215.
- [92] D. Camilleri, A. Damianou, H. Jackson, N. Lawrence, and T. Prescott, 'iCub Visual Memory Inspector: Visualising the iCub's Thoughts,' in Conference on Biomimetic and Biohybrid Systems , 2016, pp. 48-57.
- [93] M. Zambelli, T. Fischer, M. Petit, H. J. Chang, A. Cully, and Y. Demiris, 'Towards anchoring self-learned representations to those of other agents,' in Workshop on Bio-inspired Social robot Learning in Home Scenarios at IEEE/RSJ International Conference on Intelligent Robots and Systems , 2016.
- [94] M. Zambelli and Y. Demiris, 'Online Multimodal Ensemble Learning using Self-learnt Sensorimotor Representations,' in IEEE Transactions on Cognitive and Developmental Systems , vol. 9, no. 2, 2017, pp. 113-126.
- [95] H. J. Chang and Y. Demiris, 'Highly Articulated Kinematic Structure Estimation combining Motion and Skeleton Information,' IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017, to be published.
- [96] H. J. Chang, T. Fischer, M. Petit, M. Zambelli, and Y. Demiris, 'Kinematic structure correspondences via hypergraph matching,' in IEEE Conference on Computer Vision and Pattern Recognition , 2016, pp. 4216-4225.
- [97] T. Fischer and Y. Demiris, 'Markerless Perspective Taking for Humanoid Robots in Unconstrained Environments,' in IEEE International Conference on Robotics and Automation , 2016, pp. 3309-3316.
- [98] M. Johnson and Y. Demiris, 'Perceptual Perspective Taking and Action Recognition,' International Journal of Advanced Robotic Systems , vol. 2, no. 4, pp. 301-308, 2005.
- [99] T. J. Prescott, N. Lepora, and P. F. M. J. Vershure, 'A future of living machines? International trends and prospects in biomimetic and biohybrid systems,' in Conference on Bioinspiration, Biometrics and Bioreplication , 2014.
- [100] P. Verschure, 'Formal minds and biological brains II: From the mirage of intelligence to a science and engineering of consciousness,' in IEEE Intelligent Systems Trends and Controversies , 2013, pp. 33-36.
- [101] G. Pointeau, M. Petit, and P. F. Dominey, 'Successive Developmental Levels of Autobiographical Memory for Learning Through Social Interaction,' IEEE Transactions on Autonomous Mental Development , vol. 6, no. 3, pp. 200-212, Sep. 2014.
- [102] M. Petit, T. Fischer, and Y. Demiris, 'Lifelong Augmentation of MultiModal Streaming Autobiographical Memories,' IEEE Transactions on Cognitive and Developmental Systems , vol. 8, no. 3, pp. 201-213, 2016.
- [103] M. Petit, T. Fischer, and Y. Demiris, 'Towards the emergence of procedural memories from lifelong multi-modal streaming memories for cognitive robots,' in Workshop on Machine Learning Methods for HighLevel Cognitive Capabilities in Robotics at IEEE/RSJ International Conference on Intelligent Robots and Systems , 2016.
- [104] J.-Y. Puigbo, C. Moulin-Frier, V. Vouloutsi, M. Sanchez-Fibla, I. Herreros, and P. F. J. Verschure, 'Skill refinement through cerebellar learning and human haptic feedback: An iCub learning to paint experiment,' in IEEE-RAS International Conference on Humanoid Robots , 2015, pp. 447-452.
- [105] A. Duff, M. Sanchez Fibla, and P. F. M. J. Verschure, 'A biologically based model for the integration of sensory-motor contingencies in rules and plans: A prefrontal cortex based extension of the Distributed Adaptive Control architecture,' Brain Research Bulletin , vol. 85, no. 5, pp. 289-304, 2011.
- [106] D. L. Kovacs, 'A Multi-Agent Extension of PDDL3.1,' in Workshop on International Planning Competition at the International Conference on Automated Planning and Scheduling , 2012, pp. 19-27.
- [107] A. Roncone, M. Hoffmann, U. Pattacini, L. Fadiga, and G. Metta, 'Peripersonal space and margin of safety around the body: learning tactile-visual associations in a humanoid robot with artificial skin,' PLoS ONE , vol. 11, no. 10, pp. 1-32, 2016.
- [108] L. Jamone, M. Brandao, L. Natale, K. Hashimoto, G. Sandini, and A. Takanishi, 'Autonomous online generation of a motor representation of the workspace for intelligent whole-body reaching,' Robotics and Autonomous Systems , vol. 62, no. 4, pp. 556-567, 2014.