Overview

Overview#

Orchestrator is a real-time intelligent conversation system for building personalized multimodal AI interaction workflows, including speech recognition (ASR), text conversation (LLM), text-to-speech (TTS), emotion analysis (Classification & Reaction), memory management (Memory), and 3D animation generation (Audio2Face & Speech2Motion). The system supports multiple AI service providers through modular design, providing streaming processing and complete conversation management capabilities.

Main application scenarios: personalized role-playing, customized virtual companions, education and training, intelligent customer service, office assistants, etc.

Core Features#

Technical Features#

Multimodal Interaction: Voice interaction, text conversation, 3D animation generation
Real-time Streaming Processing: Real-time data stream processing with low-latency response
Multi-AI Service Provider Support: Integration with mainstream AI services including SenseNova, OpenAI, Anthropic, Gemini, xAI, DeepSeek, MiniMax, ElevenLabs, Volcano Engine, etc.
Intelligent Memory Management: Multi-level conversation memory, relationship status, and emotional state management
Emotional Intelligence Analysis: Real-time analysis of character emotional changes, relationship changes, and triggered actions
Highly Scalable Architecture: Modular design, easy to add new AI services and custom features

Customization Capabilities#

Character Customization: Custom character personalities, voices, emotions, and actions
Interaction Customization: Flexible configuration of conversation modes, reaction mechanisms, and memory management
Service Combination: Support for combining multiple AI service providers, flexible selection based on scenario requirements

System Architecture#

Project Structure#

orchestrator/
├── proxy.py                   # Core orchestrator, manages DAG workflows
├── service/                   # Web service layer
│   ├── server.py              # FastAPI server, provides WebSocket interface
│   ├── requests.py            # Request data models
│   └── responses.py           # Response data models
├── conversation/              # Conversation management module
│   ├── conversation_adapter.py        # Text conversation adapter base class
│   ├── audio_conversation_adapter.py  # Audio conversation adapter base class
│   ├── openai_conversation_client.py  # OpenAI text conversation client
│   ├── openai_audio_client.py         # OpenAI audio conversation client
│   ├── anthropic_conversation_client.py # Anthropic conversation client
│   ├── gemini_conversation_client.py   # Gemini conversation client
│   ├── xai_conversation_client.py      # xAI conversation client
│   ├── deepseek_conversation_client.py # DeepSeek conversation client
│   ├── sensechat_conversation_client.py # SenseChat conversation client
│   ├── sensenova_conversation_client.py # SenseNova conversation client
│   └── minimax_conversation_client.py # MiniMax conversation client
├── generation/                # Generation management module
│   ├── speech_recognition/    # Speech Recognition (ASR)
│   │   ├── asr_adapter.py     # ASR adapter base class
│   │   ├── huoshan_asr_client.py # Volcano Engine ASR
│   │   ├── openai_realtime_asr_client.py # OpenAI real-time ASR
│   │   ├── sensetime_asr_client.py      # SenseTime ASR
│   │   └── softsugar_asr_client.py      # Softsugar ASR
│   ├── text2speech/          # Text-to-Speech (TTS)
│   │   ├── tts_adapter.py     # TTS adapter base class
│   │   ├── chatterbox_tts_client.py     # Chatterbox TTS
│   │   ├── elevenlabs_tts_client.py     # ElevenLabs TTS
│   │   ├── huoshan_tts_client.py        # Volcano Engine TTS
│   │   ├── sensenova_tts_client.py      # SenseNova TTS
│   │   ├── sensetime_tts_client.py      # SenseTime TTS
│   │   └── softsugar_tts_client.py      # Softsugar TTS
│   ├── speech2motion/        # Speech-to-Motion
│   │   ├── speech2motion_adapter.py     # S2M adapter base class
│   │   └── speech2motion_streaming_client.py # S2M streaming client
│   └── audio2face/           # Audio-to-Face
│       ├── audio2face_adapter.py        # A2F adapter base class
│       └── audio2face_streaming_client.py # A2F streaming client
├── memory/                   # Memory management module
│   ├── memory_adapter.py     # Memory adapter base class
│   ├── memory_manager.py     # Memory manager
│   ├── memory_processor.py   # Memory processor
│   ├── task_manager.py       # Task manager
|   ├── openai_memory_client.py  # OpenAI memory client
│   ├── xai_memory_client.py  # xAI memory client
│   ├── sensenova_memory_client.py # SenseNova memory client
│   └── minimax_memory_client.py # MiniMax memory client
├── classification/           # Classification module
│   ├── classification_adapter.py # Classification adapter base class
│   ├── openai_classification_client.py # OpenAI classification client
│   ├── gemini_classification_client.py # Gemini classification client
│   ├── sensenova_classification_client.py # SenseNova classification client
│   ├── minimax_classification_client.py # MiniMax classification client
│   └── xai_classification_client.py    # xAI classification client
├── reaction/                # Reaction module
│   ├── reaction_adapter.py   # Reaction adapter base class
│   ├── openai_reaction_client.py # OpenAI reaction client
│   ├── gemini_reaction_client.py # Gemini reaction client
│   ├── sensenova_reaction_client.py # SenseNova reaction client
│   ├── minimax_reaction_client.py # MiniMax reaction client
│   └── xai_reaction_client.py    # xAI reaction client
├── aggregator/              # Data aggregators
│   ├── conversation_aggregator.py # Conversation aggregator
│   ├── tts_reaction_aggregator.py # TTS reaction aggregator
│   ├── blendshapes_aggregator.py  # Facial expression aggregator
│   └── callback_aggregator.py     # Callback aggregator
├── io/                      # Data storage interfaces
│   ├── config/              # Configuration storage
│   │   ├── database_config_client.py # Database configuration client
│   │   ├── dynamodb_config_client.py # DynamoDB configuration client
│   │   └── mongodb_config_client.py  # MongoDB configuration client
│   └── memory/              # Memory storage
│       ├── database_memory_client.py # Database memory client
│       ├── dynamodb_memory_client.py # DynamoDB memory client
│       └── mongodb_memory_client.py  # MongoDB memory client
├── data_structures/         # Data structure definitions
└── utils/                   # Utility modules

Core Components#

1. Conversation Management Module (Conversation)#

Function: Handles text and audio conversations, supports multiple large language models
Core Components:
- ConversationAdapter: Text conversation adapter base class, handles streaming text conversations
- AudioConversationAdapter: Audio conversation adapter base class, handles real-time voice interactions
- Supported providers: SenseNova, OpenAI, Anthropic, Gemini, xAI, DeepSeek, MiniMax, etc.
Features: Streaming output support, long context, multimodal conversations

2. Text-to-Speech Module (TTS)#

Function: Converts text to natural speech, supports multiple voices and emotional expressions
Core Components:
- TextToSpeechAdapter: TTS adapter base class, handles streaming audio generation
- Supported providers: ElevenLabs, Volcano Engine, SenseTime, Softsugar, etc.
Features: Multiple voices, multiple emotions, multi-language support, real-time synthesis

3. Speech Recognition Module (ASR)#

Function: Real-time speech recognition, supports multiple languages and real-time processing
Core Components:
- ASRAdapter: ASR adapter base class, handles streaming speech recognition
- Supported providers: OpenAI, SenseTime, Softsugar, etc.
Features: Multi-language support, streaming recognition

4. Memory Management Module (Memory)#

Function: Multi-level conversation memory, emotional state, relationship state management
Core Components:
- MemoryAdapter: Memory adapter base class
- MemoryManager: Memory manager, handles conversation history and context
- MemoryProcessor: Memory processor, analyzes and manages memory data
Features: Multi-level memory storage, emotional state tracking, relationship state management

5. Emotion Analysis Module (Classification & Reaction)#

Function: Real-time emotion analysis, user intent classification, reaction generation
Core Components:
- ClassificationAdapter: Classification adapter, analyzes user intent
- ReactionAdapter: Reaction adapter, analyzes character emotional changes, relationship changes, and triggered actions
Features: Real-time emotion analysis, intent classification, personalized reaction generation

6. 3D Animation Generation Module#

Function: Speech-to-motion conversion, audio-to-facial expression conversion
Core Components:
- Speech2MotionAdapter: Speech-to-motion adapter
- Audio2FaceAdapter: Audio-to-facial expression adapter
Features: Real-time motion generation, facial expression synchronization, 3D animation output

7. Data Aggregators (Aggregator)#

Function: Coordinates data flow between multiple modules, ensures data synchronization
Core Components:
- ConversationAggregator: Conversation aggregator, coordinates conversation flow
- TTSReactionAggregator: TTS reaction aggregator, synchronizes voice and reactions
- BlendshapesAggregator: Facial expression aggregator
Features: Data flow coordination, real-time synchronization, error handling

8. Core Orchestrator (Proxy)#

Function: Manages DAG workflows, coordinates interactions between all modules
Core Components:
- Proxy: Main orchestrator, manages complex AI interaction workflows
- Supports multiple conversation modes: audio conversation, text conversation, mixed mode
Features: DAG workflow management, module coordination, process control

DAG Workflow Architecture#

The system uses a Directed Acyclic Graph (DAG) architecture to manage complex AI interaction workflows. Each conversation request creates a DAG instance containing multiple processing nodes and dependencies.

Diagram Legend:

Solid arrows (→): One-time complete data transmission between nodes in a single generation request
Dashed arrows (⇢): Streaming data transmission between nodes in a single generation request

Workflow Diagrams:

Complete Audio Conversation Flow (audio_chat_with_text_llm_v4)

Complete Audio Conversation Flow
Express Audio Conversation Flow (audio_chat_with_audio_llm_v4)

Express Audio Conversation Flow
Complete Text Conversation Flow (text_chat_with_text_llm_v4)

Complete Text Conversation Flow
Express Text Conversation Flow (text_chat_with_audio_llm_v4)

Express Text Conversation Flow
Direct Generation Flow (direct_generation_v4)

Direct Generation Flow

AI Services#

LLM#

Provider	Adapter Class	Default Model
OpenAI	`OpenAIConversationClient`	`gpt-4.1-2025-04-14`
Anthropic	`AnthropicConversationClient`	`claude-sonnet-4-5-20250929`
Google	`GeminiConversationClient`	`gemini-2.5-flash-lite`
DeepSeek	`DeepSeekConversationClient`	`deepseek-chat`
xAI	`XAIConversationClient`	`grok-3`
MiniMax	`MiniMaxConversationClient`	`MiniMax-M2.7`
SenseNova	`SenseChatConversationClient`	`SenseChat-5-1202` (Large Language Model)
SenseNova	`SenseNovaConversationClient`	`sensenova-6.7-flash-lite`
OpenAI	`OpenAIAudioClient`	`gpt-4o-mini-realtime-preview-2024-12-17`

OpenAI-compatible LLM credentials are read from user settings. MiniMax uses minimax_api_key and the default base URL https://api.minimaxi.com/v1; SenseNova uses sensenova_api_key and the default base URL https://token.sensenova.cn/v1.

ASR#

Provider	Adapter Class
OpenAI	`OpenAIRealtimeASRClient`
Volcano Engine	`HuoshanASRClient`
SenseTime	`SensetimeASRClient`
Softsugar	`SoftSugarASRClient`

TTS#

Provider	Adapter Class
Volcano Engine	`HuoshanTTSClient`
Softsugar	`SoftSugarTTSClient`
SenseNova	`SensenovaTTSClient`
ElevenLabs	`ElevenLabsTTSClient`
SenseTime	`SensetimeTTSClient`

Memory#

Provider	Adapter Class	Default Model
OpenAI	`OpenAIMemoryClient`	`gpt-4.1-mini-2025-04-14`
xAI	`XAIMemoryClient`	`Grok-3`
MiniMax	`MiniMaxMemoryClient`	`MiniMax-M2.7`
SenseNova	`SenseNovaMemoryClient`	`sensenova-6.7-flash-lite`

Classification#

Provider	Adapter Class	Default Model
OpenAI	`OpenAIClassificationClient`	`gpt-4.1-mini-2025-04-14`
xAI	`XAIClassificationClient`	`grok-3`
Gemini	`GeminiClassificationClient`	`gemini-2.5-flash-lite`
MiniMax	`MiniMaxClassificationClient`	`MiniMax-M2.7`
SenseNova	`SenseNovaClassificationClient`	`sensenova-6.7-flash-lite`

Reaction#

Provider	Adapter Class	Default Model
OpenAI	`OpenAIReactionClient`	`gpt-4.1-mini-2025-04-14`
xAI	`XAIReactionClient`	`grok-3`
Gemini	`GeminiReactionClient`	`gemini-2.5-flash-lite`
MiniMax	`MiniMaxReactionClient`	`MiniMax-M2.7`
SenseNova	`SenseNovaReactionClient`	`sensenova-6.7-flash-lite`