Humans interact with continuously flowing, diverse stimulation. Likewise, humanoids must have multi-modal perceptual systems that can seamlessly integrate sensors. One way to do this is to allow sensors to continually compete for dominance. At the ElectroTechnical Laboratory in Japan, G.Cheng and Y. Kuniyoshi have developed a humanoid with 24 degrees of freedom, joint receptors with encoders and temperature sensing. The humanoid uses 6 PCs for control of hearing, vision, motor output and integration. The robot itself is lightweight and flexible, allowing it to interact comfortably and safely with humans. Throughout a visual and auditory tracking task, the robot tracks a person by sight and/or sound while mimicking the upper body motion of a person. The focus of the work was in showing that the robot can track people using a multiple sensory approach that is not task-specific and does not need to switch between sensor modalities. The key is that perceptual subsystems necessary for mimicry, tracking, vision and auditory processing should not be thought of as separate tasks and pursued separately, but as essential capabilities that must together contribute to high-utility humanlike behavior.
This said, humanoid roboticists agree that vision is the most crucial sensing modality for enabling rich, humanlike interactions with the environment. Of course, computer vision has long been a hard problem and an essential study in and of itself. The first main problem is that many factors are confounded into image data in a many-to-one mapping. For instance, how can a humanoid infer 3-dimensional reality from a 2-dimensional image? Another problem is the amazing amount of data to be processed. For a long time, computer vision research assumed that the goal was to acquire as much data about the environment as possible. This approach proved computationally intractable. Rather than view perceptual systems as passive receptors that merely collect any and all data, we are beginning to create perceptual systems that can interact with humans and with the physical environment, actively creating a perception of reality rather than just passively perceiving it.
The emotion exhibiting head of Kismet. MIT Artificial Intelligence Lab
Luiz-Marcos Garcia, Antonio A.F. Oliveira, Roderic A. Grupen, David S. Wheeler and Andy H. Fagg use attentional mechanisms to focus a humanoid robot on visual areas of interest. 6 On top of this capability, the authors have implemented a learning system that allows the robot to autonomously recognize and categorize the environmental elements it extracts. Robots must be equipped to exploit perceptual clues such as sound, movement, color intensity or human body language (pointing, gazing and so on). For rich sensor modalities such as vision, perception is as much a process of excluding input as receiving it.
Humans naturally find certain perceptual features interesting. Features such as color, motion and face-like shapes are likely to attract our attention. 5 MIT has been working to create a variety of perceptual feature detectors that are particularly relevant to interacting with people and objects. These include low-level feature detectors attuned to quickly moving objects, highly saturated color, and colors representative of skin tones. The robot's attention is determined by a combination of low-level perceptual stimuli. The relative weightings of the stimuli are modulated by high-level behavior and motivational influences. 6 For a task involving human interaction, the perceptual category "face" may be given higher priority than for a surveillance task where the robot must attend most closely to motion and color.
Graphical representation of Kismet's attentional system. MIT Artificial Intelligence Laboratory.
A sufficiently salient stimulus in any modality can supercede the robot's attention, just as a human watching a film might respond to sudden motion in the adjacent seat. MIT has implemented a number of intuitive arbitration rules into the system such as the fact that, all else being equal, larger objects are considered more salient than smaller ones. The goal is for the robot to be responsive to unexpected events, but also able to filter out superfluous events. Otherwise, the robot would become a slave to every whim of its environment. MIT has found that its attention model enables people to intuitively provide the right cues to direct the robot's attention. Actions such as shaking an object, moving closer to the intended listener, hand waving and altering tone of voice all help the robot focus on appropriate aspects of its environment.