
Custom Wake Word AI Avatar System for United Nation Conference
Key Results
The Client
The CTBTO (Comprehensive Nuclear-Test-Ban Treaty Organization) is a United Nations body based in Vienna, responsible for monitoring global compliance with the international nuclear test ban treaty. Their annual conference brings together delegates, diplomats, and scientists from member states across the world - representing dozens of languages and cultures.
For their Vienna 2025 conference, they wanted to push the boundaries of how AI could serve a high-profile international audience - starting with a smarter, hands-free way for delegates to interact with an AI avatar in a live conference environment.
The Problem
CTBTO Vienna needed a real-time AI avatar system for their 2025 conference that could be activated without touch interaction. The system needed to work in a conference environment with background noise, support multiple languages for international delegates, and replace button-based activation with natural wake word detection. They required a solution that could handle the complexity of a United Nations conference setting while providing seamless, hands-free interaction with the avatar.
The Solution
We built an integrated system combining proprietary wake word detection with AI avatar generation:
- AI avatar with real-time speech recognition and natural conversation capabilities
- Wake word activation supporting 6 languages (Chinese, Russian, Arabic, Spanish, French, English)
- Advanced audio processing for real-time voice interaction and response generation
- Personalized digital identity creation with customizable avatar appearance and behavior
- Background noise filtering for accurate detection in conference environments
- Hands-free activation replacing manual button interaction
The Result
The system achieved 95% wake word detection accuracy across all 6 languages - inside a live conference hall with ambient crowd noise, simultaneous translation audio, and overlapping conversations. That environment was the hardest possible test, and the system passed it.
Over 500 delegates engaged with the avatar during the conference. The experience was entirely hands-free: no buttons, no touchscreens, no trained interaction pattern required. Delegates in their native language could walk up and start talking.
The engineering challenge that shaped the whole build was specificity: general-purpose voice detection degrades fast in noisy multilingual environments. The solution required training wake word models per language independently and layering in ambient noise suppression before detection - not after. That architecture decision was what made 95% accuracy achievable in the field rather than just in testing.