The Automated Audio Captioning task centers around generating natural language descriptions from audio inputs. Given the distinct modalities between the input (audio) and the output (text), AAC systems typically rely on an audio encoder to extract relevant information from the sound, represented as feature vectors, which a decoder then uses to generate text descriptions.
This area of research is critical for developing systems that enable machines to better interpret and interact with the surrounding acoustic environment. Recognizing its importance, the Detection and Classification of Acoustic Scenes and Events (DCASE) community has hosted annual AAC competitions since 2020, attracting over 26 teams globally from both academia and industry.

Listen to the result at Audio Example of a Recording Environment in a Forest.
In this post, we dive into the core innovations behind our winning submission to the DCASE 2024 AAC Challenge, to be hosted in Tokyo, Japan, October 23–25.
The CMU-NVIDIA solution report:
- Enhances the encoder-decoder architecture by employing multiple audio encoders.
- Uses an LM-based task-activating prompting for information post-editing enrichment.
This architecture improves the system’s ability to capture diverse audio features by using encoders with different granularities. The multi-encoder approach enables us to deliver richer, more complementary information to the decoder, significantly enhancing performance.
Professor Shinji Wanatabe from the Language Technologies Institute (LTI) at Carnegie Mellon University (CMU) said, “This is a cool way to showcase our team’s efforts, collaborating with open-source researchers, to contribute to advancements in the audio– and language-understanding communities.”
Multi-agent collaboration for enhanced performance
One of the most innovative aspects of our approach lies in the multi-agent collaboration between distinct encoder models, which proved to be a key factor in boosting performance. By integrating multiple encoders with varying granularities, such as BEATs and ConvNeXt, we achieved enhanced coverage of audio features.
This strategy of fusing encoders has similarities with recent breakthroughs in multimodal AI research, such as MERL and CMU’s 2023 solution, where the combination of distinct agents—each specializing in different aspects of a task—yields superior results.
In our system, we adopted an encoder fusion strategy similar to the concepts used in those papers, enabling us to use the strengths of each encoder. We further considered textual hypotheses-based enrichment with the recent GenTranslate in ACL 2024 and Generative Image Captioning (GIC) evaluation in EMNLP 2024 works from NVIDIA Research in Taiwan, which enable descriptive richness customization.
For example, GenTranslate and GIC both demonstrate how multiple language models can cooperate to improve speech translation accuracy across languages, while GenTranslate highlights the efficacy of multi-agent systems in generative speech translation tasks. 
Both examples underscore the value of integrating complementary models for complex tasks, reinforcing the potential of our approach to significantly elevate AAC performance. We introduce how the core techniques have been used on GPU-based pretraining pipelines and post-editing pipelines.
Advanced NVIDIA computer technology, such as Taipei-1 (the world’s 38th-ranked supercomputer cluster on the top 500 list), also played an important role in accelerating this state-of-the-art exploration and research development with the NVIDIA DGX and NVIDIA OVX platforms.

Figure 2 shows modeling based on encoder fusion, caption filtering, and generative summarization. The generative summarization part builds on NVIDIA Research’s previous work on GenTranslate.
Core acoustic modeling techniques behind the model
The architecture of our system is inspired by CMU and MERL’s last year’s winning open source model and introduces several improvements:
- Multi-encoder fusion: We employed two pretrained audio encoders (BEATs and ConvNeXt) to generate complementary audio representations. This fusion enables the decoder to attend to a wider pool of feature sets, leading to more accurate and detailed captions.
- Multi-layer aggregation: Different layers of the encoders capture varying aspects of the input audio, and by aggregating outputs across all layers, we further enriched the information fed into the decoder.
- Generative caption modeling: To optimize the generation of natural language descriptions, we applied a large language model (LLM)–based summarization process, similar to techniques used in RobustGER. This step consolidates multiple candidate captions into a single, fluent output, using LLMs to ensure both grammatical coherence and a human-like feel to the descriptions.
Multi-agent collaboration with audio-text-LLM Integration
Beyond the multi-encoder architecture, we also developed a new multi-agent collaboration inference pipeline. Inspired by recent research showing the benefits of nucleus sampling in AAC tasks, we improved upon traditional beam search methods.
Our inference process follows a three-stage pipeline:
- CLAP-based caption filtering: We generate multiple candidate captions and filter out less relevant ones using a Contrastive Language-Audio Pretraining (CLAP) model, reducing the number of candidates by half.
- Hybrid reranking: The remaining captions are then ranked using our hybrid reranking method to select the top k-best captions.
- LLM summarization: Finally, we use a task-activated (that is, [conditional prompt] do you know audio captioning?) LLM to summarize the k-best captions into a single, coherent caption, ensuring the final output captures all critical aspects of the audio.
This novel inference pipeline leverages the strengths of both audio processing and language modeling, significantly improving the model’s ability to contextually accurate captions as refined decoding text as a form of feature maps for the downstream text-agent
Impact and performance
Our multi-encoder system achieved a Fluency Enhanced Sentence-BERT Evaluation (FENSE) score of 0.5442, outperforming the baseline score of 0.5040. By incorporating multi-agent systems, we have opened new avenues for further improving AAC tasks.
Future work will explore integrating more advanced fusion techniques and examining how further collaboration between specialized agents can enhance both the granularity and quality of the generated captions.
We hope that our contributions inspire continued exploration in multi-agent AI systems and encourage other teams to adopt similar strategies for fusing diverse models to handle complex multimodal tasks like AAC.
In Figure 3, the higher score means more details and richer information captured from the audio context.

Using GPU technology for performance and scalability
Our solution outperformed other participants in the challenge by over (relative) 10% in semantic understanding scores, thanks to the synergy between multi-encoder fusion and LLM-driven summarization. This success underscores the potential of multi-agent, multi-modality systems in advancing general-purpose understanding.
The use of LLM-based many-to-one textual correction was a critical innovation in this process, enabling the model to better use the computational power of the text modeling agent. This method retrieves and refines hidden information embedded in the audio, improving the system’s overall performance.
This approach builds on NVIDIA’s state-of-the-art work in multimodal AI, such as the GenTranslate model, which excels in multilingual speech and text translation. Similarly, our recent Audio Flamingo project, Synthio project, and dataset from NVIDIA applied deep learning research (ADLR) also demonstrated the power of advanced pretraining techniques for audio encoders.
These systems, alongside our winning AAC solution, have all benefited from NVIDIA A100 and NVIDIA H100 GPUs, accelerating AI development and pushing the boundaries of what’s possible in multimodal learning. Huck Yang from NVIDIA Research has been invited to join the Technical Panel Discussion on audio-language technologies during the DCASE Workshop 2024 program.
 
         
           
           
     
     
     
     
     
     
     
     
    