Social robots are gaining in performance. However, they are still a long way from mastering all aspects of human interaction. At the Gipsa-lab in Grenoble, researchers teach them how to show their intentions by demonstrating them how to adjust their behaviour (gestures, speech, gaze) to the context and the person they are interacting with.
Thanks to her articulated eyes, Nina, the robot of the Gipsa-lab, can better understand humans.
Photo © Cyril Frésillon/Gipsa Lab/CNRS Phototèque
In his interactive guide Mind Reading published in 2004, the British psychologist Simon Baron-Cohen distinguishes more than 400 emotions that can be conveyed through voice and facial expressions: irritation, disappointment, indignation, relief, enthusiasm, etc. While gestures and postures of the whole body participate in expressing our mental state, the face is particularly informative. The eyes and mouth are among the areas that are most “read” by our interlocutors. Studies have even shown that the morphology of the human eye has evolved to maximise the accuracy with which the direction of the gaze is analysed by others (see inset “The Social Eye”). This “cooperative” eye facilitates the development of a deep theory of mind (*) for the highly social beings we have become.
What about machines? Many humanoid robots have no visible face (Asimo, Jibo), others have one, but unanimated (Nao, Pepper) or replaced by a flat screen (Baxter). Some of them however, like Sophia, developed by Hanson Robotics, or the geminoids designed by the Japanese robot engineer Hiroshi Ishiguro, are built with considerable physical realism. Their face is made of synthetic skin and their eyes have lenses that imitate our own pupils.
Paradoxically, this resemblance may be a source of disappointment for the onlooker, who expects more human-like interaction the more realistic the robot. However, it is technically very difficult to reproduce on a machine the 38 muscles that make up the human face and their complex interweaving. Result: when the subcutaneous skeleton of humanoid robots like Sophia do not have enough degrees of freedom to reproduce human movements, in particular on the face, they make us feel creepy (see inset “The Uncanny Valley”). Moreover, as for other humanoid robots with a less realistic look, most social robots do not have a functional cooperative eye: their cameras are often fixed, set in the middle of the forehead or on the torso. Their designers equip them with 2D and 3D camera sensors, telemeters, radars and microphones, that measure and analyse our every act and gesture, but often forget that they, in turn, need to “show” the intentions of their creatures.
And that’s where the problem lies! Should we maintain the asymmetry between humans and robots by limiting or omitting their ability to express their intentions? The debate is on : some people defend humanity and the superior social intelligence reserved only for humans, other promote transhumanism unbridled by the superhuman performances of artificial intelligence in motor (locomotion, seizing of flying objects, etc.) and cognitive functions (calculation, memory, etc.). Our position, more nuanced, is to provide robots with the ability to fulfil functions identical to those of humans without necessarily making them exact copies of humans. These functions may concern locomotion and handling of objects or, in our case, verbal communication and social interaction. In our Gipsa-lab laboratory in Grenoble, we are defending the idea that, while robots are increasingly efficient at perceiving and understanding their environment, they should be given the ability to share their experience with us, without forcing us to relearn new codes of communication.
Research in this field is only in its infancy. Robots act on the environment around them, but few studies have attempted to verify that their actions (especially communicative) are correctly perceived and accepted by their human interlocutors. This is what we wanted to test on our humanoid robot, Nina, developed in partnership with the Italian Institute of Technology in Genoa. Nina is an augmented version of the robot iCub, designed in 2006 to study robot cognition. It is equipped with a new eyelid articulation mechanism, complex ear pavililons with integrated microphones and a mouth covered with an elastic fabric, articulated by five motors, behind which lies a loud-speaker to give her a voice. The mouth movements are controlled by an audiovisual speech synthesiser developed in our laboratory. The system is first trained using videos of human discussions and is able to calculate and execute realistic movements of the mouth and face while producing the corresponding synchronous speech. Using tests conducted in a noisy environment (the background noise of a cocktail party), we showed that the movements of the jaws and lips calculated by our system did actually improve the audio-visual perception of the human listeners.
We then worked on the robot’s eyes and eyelids, asking ourselves this question: which parameters provide better social interaction? By testing various plastic capsules around the cameras fitted into Nina’s articulated eyes, we showed that the direction of the gaze was more accurately estimated by her human interlocutors when the relative sizes of the white sclera (**) and the coloured irises were properly scaled and when the upper eyelids movements were coupled with the gaze direction. When Nina gaze at one point of interest, its interlocutors know it. This ability to implicitly provide its interest to others is one of the bases of joint attention and the construction of a theory of mind. Thanks to this sharing of implicit information, Nina’s interlocutors will be able to make elaborate, reliable hypotheses about its (re)actions, and the robot will be able to build coordinated and fluid interactive behaviour.
But how to teach a robot which behaviours are relevant and sensitive to the context? There are many programming techniques. One possibility is to provide the robot with a cognitive model it can use to reason about its environment (what do they want? who is close enough to seize an object? etc.) and react to orders, questions, affirmations, other people’s doubts, so as to be able to plan their own actions. This mentalist approach is gradually being completed or replaced by statistical learning techniques and artificial intelligence that directly map the signals perceived by the robot (speech, gestures, etc.) to actions to be executed (look over there, point to an object, say a word, etc.). These so-called “end-to-end” learning techniques are used to capture the regularities of interactive behaviours.
Among the various statistical learning techniques that can be applied, the first is known as “developmental” learning: the robot learns everything alone by trial and error, requiring a prohibitive number of trials. Also, the algorithm used requires a clear definition of what is a right (or wrong) interactive social behaviour (and a way to reward the robot). However, no such universal definition exists. Another option is learning by observation or imitation. For a robot, this consists in monitoring human tutors who perform the task before reproducing it. Problem: the robot is not as agile as a human – e.g. our face movements are far richer and more diverse than what any robot can produce. It therefore will have difficulty in transposing the human performance to its limited sensorimotor (and cognitive) abilities. Furthermore, human partners must be supposed to react in the same way to the robot as to the human tutors.
Nothing is less sure; although we cannot avoid projecting human cognitive faculties onto a device with initiatives, the robot remains a technological object. Our theory of mind learned to separate agents (who act on the world and have intentions) from objects (that are passively subject to the actions of agents and to the forces of nature); we attribute separate value systems to each of these categories, which in the case of robots, enter into conflict. It is therefore difficult to transpose behaviours observed between humans to those expected from a social robot. A alternative approach called “learning by demonstration” can be used to take into account the machine’s sensorimotor limitations by allowing the tutor to directly act on the robot’s actuators (the motors and articulations) in the same way as a puppeteer. To do so, the robot is set to a docile or “passive” mode, in which it follows the gestures of the human pilot and limits itself to compensating the impact of the weight of its own body on its movements.
We chose this option to teach Nina social-communicative behaviours. We connected the robot to an “immersive teleoperation” platform, where the demonstration is conducted from the “inside”. Concretely, using a virtual reality headset fitted with movement sensors, including a binocular eyetracker, the tutor becomes the “pilot”, acts and perceives the robot’s environment through the robot’s body, its actuators and sensors. Nina passively follows the behaviours of the pilot. It stores in its memory all the sensorimotor signals experienced during interactions between the pilot and other people in front of the robot. Once this behavioural memory contains enough examples of interactions, data analysis and statistical modelling are used to endow Nina with autonomous interactive behaviours.
As part of the Sombrero project, funded by the National Research Agency, our ambition is to teach Nina how to autonomously conduct neuropsychological interviews (see here and here). They aim at evaluating the episodic memory (***) of patients suspected of suffering from Alzheimer’s disease, or other types of dementia, by having them pass a standard naming test on sixteen items. These interviews are usually conducted by psychologists. This type of short dialogs (one interview takes about twenty minutes) is well suited to the type of task that can be entrusted to a social robot. It is effectively a repetitive task, with a standard protocol, leading to a completed interaction with well-determined roles. The goal is not to replace the practitioner to make a diagnosis but to filter out patients who successfully pass the robot test from those who need to further consult a specialist. Challenges of this screening task should however not be underestimated: adapting to thousands of candidates’ psychological profiles, paying constant attention, displaying a courteous goodwill and unfailing empathy are difficult issues. This laborious task can be highly useful when reliable and discreet: a consultation with a robot could be seen as less “involving” for the patient than with a human professional. Such early screening would benefit a growing number of patients, since half of them are not currently diagnosed.
Unlike companion robots with which the (ethical and technical) difficulty consists of building a long-term relationship with few humans, our social robot targets short interactions with multiple patients. There are plenty of short interaction scenarios: from the interviewer robot to the collaborative industrial robot, handling game plays or a shop reception. The challenge is here to ensure that the robot is able to rapidly configure pre-learned models to adapt them to the motor, perceptive and cognitive abilities of the human it is in contact with, and of which it knows nothing or almost nothing. At a time when rapid progress in artificial intelligence and robotics gives rise to enthusiasm, fear and fantasy, robot engineers can provide technological tools which add to the debate. In particular, they can teach the robot to know the limits of its own skills. Like Tay, the conversational agent released by Microsoft on Twitter in 2016, any interactive system is by definition influenced by what it processes, which can lead it to stray outside the scope of its pre-defined skills, if solicited by a malicious user or if it overestimates its ability to reframe the interaction. In 24 hours, Tay became racist and misogynist, forcing Microsoft to disconnect it.
It is therefore up to the robot to know if it is capable or not of processing a dialog entry and estimate whether it is still within its socio-communicative know-how. It is therefore essential to fit our social robot with a “red button” software that we, robot engineers and other social or cognitive interaction specialists are responsible for. It must enable the robot to determine automatically if the interaction model is able to manage an exchange with a human and answer “I don’t know” or return to a standby state when the model is too far removed from the task it was trained to perform or the social skills it masters.
(*) Theory of mind is the ability to estimate the mental states of others.
(**) The sclera is the white, opaque membrane forming the white of the eye. Depending on the species, it is more or less pigmented and more or less visible.
(***) Episodic memory concerns autobiographic memory.
A SOCIAL EYE
The hypothesis of the “cooperative eye” was proposed in 2001 by a Japanese team, and then taken up by anthropologists at the Max-Planck Institute in Germany (2). It posits that the morphology and appearance of the human eye (white sclera contrasting strongly with the iris and facial skin) evolved to facilitate reading of the gaze by others to promote cooperation. To show this, scientists compared the shape and appearance of the eye (the height/width ratio and the exposed surface of sclera) in 874 adult animals in 88 species, the colour contrast between the sclera, iris and skin of 92 species and the eye movements of 26 species. Result: the human eye has the largest exposed sclera and its horizontal dimension is the largest of all the primates. This morphology favours eye movements rather than head rotation: 61% of visual explorations in humans can be made just by moving the eyes, compared to 20% to 30% in chimpanzees. Furthermore, 85 of the 92 species observed have brown sclera, and only humans have a white one. Better still; we are the only ones to have a sclera that is lighter than the skin and the iris. Some authors believe eye pigmentation has a selective advantage: to camouflage the gaze, and hence the intentions. In humans, it is probable that this trait disappeared in favour of a gaze that facilitates communication of the intentions for social cooperation.
THE UNCANNY VALLEY
The Uncanny Valley is a scientific theory imagined in 1970 by the Japanese robot engineer Masahiro Mori, which states that the more an android robot looks like us, the more its imperfections seem monstrous. It is only beyond a certain degree of realism in imitation that humanoid robots would be better accepted. This valley refers to the Freudian concept of unheimlich, an event which takes place in a familiar situation, but which gives rise to anxiety or even terror. More recently, scientists at Osaka University in Japan proposed a more complex mapping, taking account of our expectations linked to the static appearance of the robot and our impressions resulting from its movements. The authors believe there is an optimal “hill” where the static appearance and the dynamic behaviour match and do not place the observer ill at ease.
Gérard Bailly and Frédéric Elisei
Gérard Bailly is CNRS research director at Gipsa-lab, where he leads the Cognitive Robotics, Interactive Systems and Speech Processing (Crissp) team. Frédéric Elisei is CNRS research engineer in the same team.