In this paper we describe the acoustic emotion recognition
system built at the Speech Technology Group of the Universidad
Politecnica de Madrid (Spain) to participate in the INTERSPEECH
2009 Emotion Challenge. Our proposal is based on
the use of a Dynamic Bayesian Network (DBN) to deal with
the temporal modelling of the emotional speech information.
The selected features (MFCC, F0, Energy and their variants) are
modelled as different streams, and the F0 related ones are integrated
under a Multi Space Distribution (MSD) framework, to
properly model its dual nature (voiced/unvoiced). Experimental
evaluation on the challenge test set, show a 67.06% and 38.24%
of unweighted recall for the 2 and 5-classes tasks respectively.
In the 2-class case, we achieve similar results compared with
the baseline, with 8.5 times less features. In the 5-class case, we
achieve a statistically significant 6.5% relative improvement.
|