Audio processing in a multimodal framework