Overview#
Web App provides an intuitive graphical interface for customizing and interacting with 3D avatars. Depending whether the user has entered a chat session, the web app can be divided into two functional parts: the Web Fronted and Web Client. The web client is developed using Babylon.js and packaged into a Next.js project bootstrapped with create-next-app.
Core Features#
User Authentication: supports login state management. Authentication state is globally managed through Redux with local storage persistence, ensuring session continuity.
Avatar Customization: each avatar is highly configurable β including 3D models, LLMs, prompts, voices and background environments.
Real-time AI Chat: chatting with an embodied AI is as simple as holding the microphone button and speaking. The system integrated LLM, ASR, and TTS for natural voice/text conversations and supports multi-turn dialogue, session persistence, and real-time message streaming.
Chat State Management: a dedicated finite state machine (FSM) governs the entire chat lifecycle, ensuring consistent behavior and reliable state transitions.
Runtime Animation Pipeline: since the system streams character audio, facial expressions, and body motions in real time, a robust runtime animation pipeline is implemented to receive, organize, and replay streamed data, featuring mechanisms such as adaptive motion blending, connection-loss recovery, and network health estimation to ensure responsiveness and interactivity.
Efficient Resource Management: for reusable resources such as idle animations, an asset manager leverages the browserβs IndexedDB to efficiently cache data, avoiding redundant network requests and minimizing loading time.
Advanced 3D Features: supports eye focus and cloth simulation for enhanced realism. The eye focus mechanism constrains eye joint rotations so that the avatar naturally gazes toward the camera. The cloth simulation, powered by Havok Physics, delivers dynamic and physically plausible clothing deformations to garments and accessories during real-time interactions.
Project Structure#
dlp3d.ai
βββ app
β βββ api
β βββ babylon
β βββ components
β βββ constants
β βββ contexts
β βββ data_structures
β βββ features
β βββ hooks # Reusable state components
β βββ layouts
β βββ library/babylonjs # Web client core implementation
β β βββ config # Character and backend service configuration
β β βββ core # Basic type definitions, etc.
β β βββ runtime # Runtime modules
β β β βββ animation # Runtime animation storage data structures, character animation modules
β β β βββ asset # Asset management, motion asset download
β β β βββ audio # Streamed audio player
β β β βββ camera # Camera implementation
β β β βββ character # Character implementation, eye tracking mechanism, etc.
β β β βββ fsm # State definitions, state machine implementation, etc.
β β β βββ gui # GUI implementation
β β β βββ io # Motion file server, Protobuf definitions
β β β βββ light # Lighting settings
β β β βββ physics # Cloth simulation implementation
β β β βββ request # http request to orchestrator/backend API
β β β βββ stream # Streamed data management
β β β βββ runtime.ts # Motion, audio, and facial expression playback & synchronization
β β βββ globalState.ts # Globally stored variables
β β βββ onSceneReady.ts # Setup scene
β β βββ utils # Helper functions and tools, e.g., logger
β βββ request
β βββ store
β βββ styles
β βββ themes
β βββ types
β βββ utils
βββ configs
βββ dockerfiles
βββ docs
βββ public
Core Components#
Stream#
Animation data is uploaded and received in a streamed manner. We designed and implemented a series of data structures to efficiently manipulate the stream data:
orchestrator_v4_pb: Defines the Protobuf communication protocol
StreamChunk: Unpacks raw WebSocket packets with type identifier and indexing info
StreamBlock: Organizes animation data into time-bounded, frame-based blocks for efficient playback.
NetworkStream: Tracks network performance metrics for health analysis
We employ the following components to mange the stream connection:
Orchestrator Client. The web client does not directly communicate with LLM, TTS, or other AI services. Instead, it interacts with the orchestrator. The orchestrator client handles WebSocket communication with the orchestrator service, supporting audio, facial expression, and motion data streaming. It provides complete streaming media processing functionality, including data transition, multi-stream buffering, stream state management and error handling.
Adaptive Buffer Size Estimator. Due to varying network conditions between users and backend services, animation playback speed may exceed the stream data receiving rate. When the stream buffer is exhausted, we need to determine the optimal buffer size required for continued playback. The adaptive buffer size estimator dynamically adjusts buffer sizes based on network stream reception patterns to ensure smooth streaming playback.
Runtime Pipeline#
The Runtime Pipeline is the core system responsible for organizing, and replaying streamed character animations in real-time. The pipeline is composed of two components:
RuntimeAnimationGroup. Stores both skeletal and facial animations for characters in multiple animation buffers (long idle, local, streamed).
Runtime. The runtime updates characterβs audio, motion and face by advancing animation clocks. The built-in motion blending system ensures seamless switch between different animation buffers. Regarding the physics, the runtime synchronizes dynamic bones with skeletal animations.
FSM (Finite State Machine)#
The brain of the 3DAC system, governing the entire chat lifecycle and ensuring consistent behavior across all system states. It defines a series of states and manages state transitions throughout the system. The FSM uses a condition-based event queue to process external events such as user interactions, network events. Itβs core responsibilities include: character loading, asset download, stream lifecycle management, animation playback coordination with runtime pipeline, error handling and recovery. For more details, please refer to Finite State Machine
Asset Manager#
The asset manager is responsible for requesting assets from the backend services and converting them into runtime-ready data structures, which leverages the browserβs IndexedDB to efficiently cache reusable resources (such as idle animations, audio clips, and character models), avoiding redundant network requests and minimizing loading time. Static assets like motion files are downloaded via the motion file client. Assets that may varies with the user settings, e.g., audio is determined by the TTS choice, are generated using the orchestrator client.
Character#
The module for managing 3D character models, animations, and physics. Each character is converted from raw MMD to glTF format for web compatibility. A character has the following core components:
RuntimeAnimationGroup: Stores multiple animation buffers(long idle, local, streamed) that contains skeletal and facial animations
DynamicBoneSolver: Handles cloth and hair physics simulation
Meshes & Skeletons: mesh data and bone hierarchy
Relationship & Emotion: Tracks relationship stage with user and character emotional states
EyeTrackingController: Constrains eye joints to naturally gaze toward the user
GUI#
Manages GUI elements within the chat. The audio controls is used to determine whether the user is talking to the character. When the stream data is exhausted, the GUI system will give visual feedback indicating that the stream is in recovery status. The characterβs emotional change, as well as the relationship with the user, will also be displayed in an interactive manner.
Physics#
This delivers dynamic and physically plausible deformations to garments, hair, and accessories during real-time interactions. For dynamics, 6DOF constraints are added to the connected bones. For collisions, dynamic bones and rigid bodies are presented as SDF shapes(e.g., box, sphere, capsule). Hence, we can efficiently handle collisions without notable computational overhead. All constraints are stored as physics metadata and will be downloaded with the character model together.