Event Abstract

Acoustic Packaging and the Learning of Words

  • 1 Bielefeld University, Applied Informatics Group, Faculty of Technology, Germany
  • 2 Bielefeld University, Research Institute for Cognition and Robotics, Germany
  • 3 Bielefeld University, Faculty of Linguistics and Literature, Germany
  • 4 Bielefeld University, Emergentist Semantics Group, CITEC, Germany

In tutoring scenarios in which a human tutor and an infant learner are involved, the infant needs to be able to segment the continuous stream of multimodal information that s/he perceives into meaningful units. When the tutor is demonstrating actions while at the same time commenting them, both speech and visual information play an important role in the segmentation process [1]. On utterance level, speech helps the learner attend to particular units of the action stream and connect them. Within an utterance, emphasis highlights certain aspects of the speech stream thus helping the learner identify relevant semantic information – for example the color of a described object, or the goal of an ongoing movement. Visual cues such as motion can help the learner temporally segment the visual stream into chunks. Furthermore motion in conjunction with visual saliency can help the learner to track important objects which are part of the ongoing action. By combining information which is grounded in both visual and auditory modalities the learner can perform first steps towards word learning for example by connecting a color term with color properties of the current object.
The idea that language helps infants to structure the action stream they perceive has been proposed and termed acoustic packaging in [1]. A computational model which is able to segment a continuous stream of speech and action demonstrations into acoustic packages has been proposed in [2]. Here, acoustic packages are designed as bottom-up units for further learning and feedback processes. If we build robotic systems able to learn actions, they could make use of acoustic packaging in order to segment actions into meaningful parts in a way similar to infants. To perform first steps towards word learning it is important to further refine the action segmentation to identify highlighted parts. Furthermore feedback is important to communicate to the tutor what the robot has understood from the tutor's action demonstrations [3]. That way, the robot can report to the tutor that it learned the correct words or expressions for the action or term the tutor has focused on. For example, if the tutor shows a cup and focuses on the cup's color in the tutoring situation, s/he will probably emphasize the color term. By repeating this (emphasized) color term, the robot shows its understanding of the dialogue's essence.
In this poster we will describe how we identify emphasized parts in the acoustic modality and object properties in the visual modality. This information is linked by our acoustic packaging system and used to provide feedback on the iCub robot by a feedback module which is subsequently described.

Acoustic Packaging
The main tasks of our acoustic packaging system [2] are to deliver bottom-up segmentation hypotheses about the action presented and to form early learning units. The temporal segmentation component provides acoustic packages by associating segments from the visual and the acoustic modality. The system considers temporal synchrony as an amodal cue, which provides information about what segments should be packaged. When overlapping speech and motion segments are detected acoustic packages are created. In our initial version of the system the acoustic modality is segmented into utterances and the visual modality is segmented into motionpeaks by detecting local minima in the amount of motion over time. The modules in our system (see Figure 1) exchange events using through a central memory the so called active memory [4]. In the following, we will give an overview on our implementations of acoustic prominence and color saliency based tracking and their integration into our architecture.

Prominence Detection
We understand perceptual prominence of linguistic units as the unit's degree of standing out of its environment [5]. This results in two main requirements for this module which automatically detects perceptual prominent units. First the speech stream has to be segmented into linguistic units, which in our case are syllables. In our implementation a modified version of the Mermelstein algorithm [6] is used to segment utterances into syllables. The basic idea of this algorithm is to identify significant minima in the signal's energy envelope as syllable boundaries. The second step is to rate the syllables according to the acoustic parameters which correlate to the perceived prominence. We implemented a simplified version of an algorithm described in [5] which focuses on using spectral emphasis to rate the syllable segments. The syllable segment with the highest spectral emphasis rating is considered the most prominent syllable in the utterance (see Figure 2). The spectral emphasis feature is calculated by bandpass filtering the signal in the band 500Hz to 4000Hz and calculating the RMS energy. Utterance hypotheses the system segments are extended with this information and made available to other modules using the active memory.

Color Saliency Based Tracking
The motion based action segmentation can provide a temporal segmentation of the video signal but cannot deliver detailed spatial information and local visual features about moving objects in the tutoring situations. Our approach is based on the assumption that during action demonstrations the objects are moved and have toy typical coloring. Thus, the visual signal is masked using a motion history image to focus on the changing parts in the visual signal. The pixels of the changing regions are clustered in the YUV color space using UV coordinates for the distance function. The clusters are ranked according to their distance to the center of mass of all clusters. The top ranked clusters are considered as salient. Several heuristics are applied to filter out e.g. background which is uncovered. The top ranked clusters are tracked over time based on spatial and color distance. The top ranked trajectory forms the motion hypothesis of the object presented by the tutor (see Figure 3).

Feedback Based on Prominent Syllables and Object Properties Linked by Acoustic Packages
In our scenario, a human tutor sits in front of a robot and demonstrates cup stacking to the system. A typical acoustic package contains an utterance hypothesis (including prominence ratings) and a trajectory hypothesis. The idea of our feedback module is to make use of the trajectory color information and the prominent syllables in the acoustic packages in order to associate semantically relevant syllables from speech with properties of the object presented. During action demonstrations with the tutor explaining his actions the acoustic packages are clustered using the color feature of the object trajectories. The tutor can evaluate what the system has learned by only showing the cups but not explaining his actions. Then, the feedback module complements the speech modality by replaying the most prominent syllable from a package with a similar trajectory color. A neighborhood of two syllables is included in replaying as a heuristic to ensure capturing a full word and to compensate for possible oversegmentation effects.

First tests on the iCub robot showed that the acoustic packaging system extended with the color based tracking module and the prominence detection module is able to link semantic relevant parts of the utterance with object properties. Our feedback module is able to provide feedback to the tutor based on acoustic packages which provide this link. However, further work is necessary to close the loop between tutor and robot by implementing strategies for handling corrections or other types of feedback regarding the quality and relevance of the acoustic packages.

Figure 1: System overview with highlighted layers and their relation to the acoustic packaging system.
Figure 2: Cue visualization tool showing the segmentation and association of speech, prominence, motionpeaks, and trajectories to acoustic packages.
Figure 3: Two examples of tracking results using the color saliency based tracking module. The images show a test subject demonstrating cup stacking to an infant. The color property of the trajectory is automatically determined from the salient regions tracked.

Figure 1
Figure 2
Figure 3


The authors gratefully acknowledge the financial support from the FP7 European Project ITALK (ICT-214668).


[1] K. Hirsh-Pasek and R. M. Golinkoff, The Origins of Grammar: Evidence from Early Language Comprehension. The MIT Press, 1996.
[2] L. Schillingmann, B. Wrede, and K. J. Rohlfing, “A Computational Model of Acoustic Packaging”, IEEE Transactions on Autonomous Mental Development, vol. 1, no. 4, pp. 226–237, Dec. 2009.
[3] A. L. Vollmer, K. S. Lohan, K. Fischer, Y. Nagai, K. Pitsch, J. Fritsch, K. J. Rohlfing, and B. Wrede, “People Modify Their Tutoring Behavior in Robot-Directed Interaction for Action Learning”, in International Conference on Development and Learning, Shanghai, China, 2009.
[4] J. Fritsch and S. Wrede, “An Integration Framework for Developing Interactive Robots”, in Software Engineering for Experimental Robotics, D. Brugali, Ed. Springer, 2007, pp. 291–305.
[5] F. Tamburini and P. Wagner, “On automatic prominence detection for German”, in Interspeech 2007, 2007, pp. 1809–1812.
[6] P. Mermelstein, “Automatic segmentation of speech into syllabic units”, Journal of the Acoustical Society of America, 1975.

Keywords: color saliency, Feedback, human robot interaction, multimodal action segmentation, prominence

Conference: IEEE ICDL-EPIROB 2011, Frankfurt, Germany, 24 Aug - 27 Aug, 2011.

Presentation Type: Poster Presentation

Topic: Grounding

Citation: Schillingmann L, Wagner P, Munier C, Wrede B and Rohlfing K (2011). Acoustic Packaging and the Learning of Words. Front. Comput. Neurosci. Conference Abstract: IEEE ICDL-EPIROB 2011. doi: 10.3389/conf.fncom.2011.52.00020

Copyright: The abstracts in this collection have not been subject to any Frontiers peer review or checks, and are not endorsed by Frontiers. They are made available through the Frontiers publishing platform as a service to conference organizers and presenters.

The copyright in the individual abstracts is owned by the author of each abstract or his/her employer unless otherwise stated.

Each abstract, as well as the collection of abstracts, are published under a Creative Commons CC-BY 4.0 (attribution) licence (https://creativecommons.org/licenses/by/4.0/) and may thus be reproduced, translated, adapted and be the subject of derivative works provided the authors and Frontiers are attributed.

For Frontiers’ terms and conditions please see https://www.frontiersin.org/legal/terms-and-conditions.

Received: 11 Apr 2011; Published Online: 12 Jul 2011.

* Correspondence: Mr. Lars Schillingmann, Bielefeld University, Applied Informatics Group, Faculty of Technology, Bielefeld, Germany, lschilli@techfak.uni-bielefeld.de