Overview#
Orchestrator is a real-time intelligent conversation system for building personalized multimodal AI interaction workflows, including speech recognition (ASR), text conversation (LLM), text-to-speech (TTS), emotion analysis (Classification & Reaction), memory management (Memory), and 3D animation generation (Audio2Face & Speech2Motion). The system supports multiple AI service providers through modular design, providing streaming processing and complete conversation management capabilities.
Main application scenarios: personalized role-playing, customized virtual companions, education and training, intelligent customer service, office assistants, etc.
Core Features#
Technical Features#
Multimodal Interaction: Voice interaction, text conversation, 3D animation generation
Real-time Streaming Processing: Real-time data stream processing with low-latency response
Multi-AI Service Provider Support: Integration with mainstream AI services including SenseNova, OpenAI, Anthropic, Gemini, xAI, DeepSeek, MiniMax, ElevenLabs, Volcano Engine, etc.
Intelligent Memory Management: Multi-level conversation memory, relationship status, and emotional state management
Emotional Intelligence Analysis: Real-time analysis of character emotional changes, relationship changes, and triggered actions
Highly Scalable Architecture: Modular design, easy to add new AI services and custom features
Customization Capabilities#
Character Customization: Custom character personalities, voices, emotions, and actions
Interaction Customization: Flexible configuration of conversation modes, reaction mechanisms, and memory management
Service Combination: Support for combining multiple AI service providers, flexible selection based on scenario requirements
System Architecture#
Project Structure#
orchestrator/
├── proxy.py # Core orchestrator, manages DAG workflows
├── service/ # Web service layer
│ ├── server.py # FastAPI server, provides WebSocket interface
│ ├── requests.py # Request data models
│ └── responses.py # Response data models
├── conversation/ # Conversation management module
│ ├── conversation_adapter.py # Text conversation adapter base class
│ ├── audio_conversation_adapter.py # Audio conversation adapter base class
│ ├── openai_conversation_client.py # OpenAI text conversation client
│ ├── openai_audio_client.py # OpenAI audio conversation client
│ ├── anthropic_conversation_client.py # Anthropic conversation client
│ ├── gemini_conversation_client.py # Gemini conversation client
│ ├── xai_conversation_client.py # xAI conversation client
│ ├── deepseek_conversation_client.py # DeepSeek conversation client
│ ├── sensechat_conversation_client.py # SenseChat conversation client
│ ├── sensenova_conversation_client.py # SenseNova conversation client
│ └── minimax_conversation_client.py # MiniMax conversation client
├── generation/ # Generation management module
│ ├── speech_recognition/ # Speech Recognition (ASR)
│ │ ├── asr_adapter.py # ASR adapter base class
│ │ ├── huoshan_asr_client.py # Volcano Engine ASR
│ │ ├── openai_realtime_asr_client.py # OpenAI real-time ASR
│ │ ├── sensetime_asr_client.py # SenseTime ASR
│ │ └── softsugar_asr_client.py # Softsugar ASR
│ ├── text2speech/ # Text-to-Speech (TTS)
│ │ ├── tts_adapter.py # TTS adapter base class
│ │ ├── chatterbox_tts_client.py # Chatterbox TTS
│ │ ├── elevenlabs_tts_client.py # ElevenLabs TTS
│ │ ├── huoshan_tts_client.py # Volcano Engine TTS
│ │ ├── sensenova_tts_client.py # SenseNova TTS
│ │ ├── sensetime_tts_client.py # SenseTime TTS
│ │ └── softsugar_tts_client.py # Softsugar TTS
│ ├── speech2motion/ # Speech-to-Motion
│ │ ├── speech2motion_adapter.py # S2M adapter base class
│ │ └── speech2motion_streaming_client.py # S2M streaming client
│ └── audio2face/ # Audio-to-Face
│ ├── audio2face_adapter.py # A2F adapter base class
│ └── audio2face_streaming_client.py # A2F streaming client
├── memory/ # Memory management module
│ ├── memory_adapter.py # Memory adapter base class
│ ├── memory_manager.py # Memory manager
│ ├── memory_processor.py # Memory processor
│ ├── task_manager.py # Task manager
| ├── openai_memory_client.py # OpenAI memory client
│ ├── xai_memory_client.py # xAI memory client
│ ├── sensenova_memory_client.py # SenseNova memory client
│ └── minimax_memory_client.py # MiniMax memory client
├── classification/ # Classification module
│ ├── classification_adapter.py # Classification adapter base class
│ ├── openai_classification_client.py # OpenAI classification client
│ ├── gemini_classification_client.py # Gemini classification client
│ ├── sensenova_classification_client.py # SenseNova classification client
│ ├── minimax_classification_client.py # MiniMax classification client
│ └── xai_classification_client.py # xAI classification client
├── reaction/ # Reaction module
│ ├── reaction_adapter.py # Reaction adapter base class
│ ├── openai_reaction_client.py # OpenAI reaction client
│ ├── gemini_reaction_client.py # Gemini reaction client
│ ├── sensenova_reaction_client.py # SenseNova reaction client
│ ├── minimax_reaction_client.py # MiniMax reaction client
│ └── xai_reaction_client.py # xAI reaction client
├── aggregator/ # Data aggregators
│ ├── conversation_aggregator.py # Conversation aggregator
│ ├── tts_reaction_aggregator.py # TTS reaction aggregator
│ ├── blendshapes_aggregator.py # Facial expression aggregator
│ └── callback_aggregator.py # Callback aggregator
├── io/ # Data storage interfaces
│ ├── config/ # Configuration storage
│ │ ├── database_config_client.py # Database configuration client
│ │ ├── dynamodb_config_client.py # DynamoDB configuration client
│ │ └── mongodb_config_client.py # MongoDB configuration client
│ └── memory/ # Memory storage
│ ├── database_memory_client.py # Database memory client
│ ├── dynamodb_memory_client.py # DynamoDB memory client
│ └── mongodb_memory_client.py # MongoDB memory client
├── data_structures/ # Data structure definitions
└── utils/ # Utility modules
Core Components#
1. Conversation Management Module (Conversation)#
Function: Handles text and audio conversations, supports multiple large language models
Core Components:
ConversationAdapter: Text conversation adapter base class, handles streaming text conversationsAudioConversationAdapter: Audio conversation adapter base class, handles real-time voice interactionsSupported providers: SenseNova, OpenAI, Anthropic, Gemini, xAI, DeepSeek, MiniMax, etc.
Features: Streaming output support, long context, multimodal conversations
2. Text-to-Speech Module (TTS)#
Function: Converts text to natural speech, supports multiple voices and emotional expressions
Core Components:
TextToSpeechAdapter: TTS adapter base class, handles streaming audio generationSupported providers: ElevenLabs, Volcano Engine, SenseTime, Softsugar, etc.
Features: Multiple voices, multiple emotions, multi-language support, real-time synthesis
3. Speech Recognition Module (ASR)#
Function: Real-time speech recognition, supports multiple languages and real-time processing
Core Components:
ASRAdapter: ASR adapter base class, handles streaming speech recognitionSupported providers: OpenAI, SenseTime, Softsugar, etc.
Features: Multi-language support, streaming recognition
4. Memory Management Module (Memory)#
Function: Multi-level conversation memory, emotional state, relationship state management
Core Components:
MemoryAdapter: Memory adapter base classMemoryManager: Memory manager, handles conversation history and contextMemoryProcessor: Memory processor, analyzes and manages memory data
Features: Multi-level memory storage, emotional state tracking, relationship state management
5. Emotion Analysis Module (Classification & Reaction)#
Function: Real-time emotion analysis, user intent classification, reaction generation
Core Components:
ClassificationAdapter: Classification adapter, analyzes user intentReactionAdapter: Reaction adapter, analyzes character emotional changes, relationship changes, and triggered actions
Features: Real-time emotion analysis, intent classification, personalized reaction generation
6. 3D Animation Generation Module#
Function: Speech-to-motion conversion, audio-to-facial expression conversion
Core Components:
Speech2MotionAdapter: Speech-to-motion adapterAudio2FaceAdapter: Audio-to-facial expression adapter
Features: Real-time motion generation, facial expression synchronization, 3D animation output
7. Data Aggregators (Aggregator)#
Function: Coordinates data flow between multiple modules, ensures data synchronization
Core Components:
ConversationAggregator: Conversation aggregator, coordinates conversation flowTTSReactionAggregator: TTS reaction aggregator, synchronizes voice and reactionsBlendshapesAggregator: Facial expression aggregator
Features: Data flow coordination, real-time synchronization, error handling
8. Core Orchestrator (Proxy)#
Function: Manages DAG workflows, coordinates interactions between all modules
Core Components:
Proxy: Main orchestrator, manages complex AI interaction workflowsSupports multiple conversation modes: audio conversation, text conversation, mixed mode
Features: DAG workflow management, module coordination, process control
DAG Workflow Architecture#
The system uses a Directed Acyclic Graph (DAG) architecture to manage complex AI interaction workflows. Each conversation request creates a DAG instance containing multiple processing nodes and dependencies.
Diagram Legend:
Solid arrows (→): One-time complete data transmission between nodes in a single generation request
Dashed arrows (⇢): Streaming data transmission between nodes in a single generation request
Workflow Diagrams:
Complete Audio Conversation Flow (
audio_chat_with_text_llm_v4)Complete Audio Conversation Flow
Express Audio Conversation Flow (
audio_chat_with_audio_llm_v4)Express Audio Conversation Flow
Complete Text Conversation Flow (
text_chat_with_text_llm_v4)Complete Text Conversation Flow
Express Text Conversation Flow (
text_chat_with_audio_llm_v4)Express Text Conversation Flow
Direct Generation Flow (
direct_generation_v4)Direct Generation Flow
AI Services#
LLM#
Provider |
Adapter Class |
Default Model |
|---|---|---|
OpenAI |
|
|
Anthropic |
|
|
|
|
|
DeepSeek |
|
|
xAI |
|
|
MiniMax |
|
|
SenseNova |
|
|
SenseNova |
|
|
OpenAI |
|
|
OpenAI-compatible LLM credentials are read from user settings. MiniMax uses minimax_api_key and the default base URL https://api.minimaxi.com/v1; SenseNova uses sensenova_api_key and the default base URL https://token.sensenova.cn/v1.
ASR#
Provider |
Adapter Class |
|---|---|
OpenAI |
|
Volcano Engine |
|
SenseTime |
|
Softsugar |
|
TTS#
Provider |
Adapter Class |
|---|---|
Volcano Engine |
|
Softsugar |
|
SenseNova |
|
ElevenLabs |
|
SenseTime |
|
Memory#
Provider |
Adapter Class |
Default Model |
|---|---|---|
OpenAI |
|
|
xAI |
|
|
MiniMax |
|
|
SenseNova |
|
|
Classification#
Provider |
Adapter Class |
Default Model |
|---|---|---|
OpenAI |
|
|
xAI |
|
|
Gemini |
|
|
MiniMax |
|
|
SenseNova |
|
|
Reaction#
Provider |
Adapter Class |
Default Model |
|---|---|---|
OpenAI |
|
|
xAI |
|
|
Gemini |
|
|
MiniMax |
|
|
SenseNova |
|
|