1. The Necessity of Streaming

As discussed in our previous article, employing an "end-to-end streaming" approach for digital humans enables near real-time responses to user voice queries, significantly enhancing the user experience.

2. Challenges in Streaming Processing

An "end-to-end streaming" invocation method requires us to segment voice data as input, process it into corresponding video blocks, and then return these processed video blocks to the playback client for segmented playback. This achieves a near real-time user experience.

Crucially, these video blocks must be seamlessly concatenable, meaning the last frame of a preceding video segment should be identical to the first frame of the subsequent video segment.

Achieving this requirement by calling large video generation model APIs and specifying the first frame of the video is feasible. While the computational power of these large models ensures highly accurate lip-syncing for digital humans, their generation speed is entirely dependent on the inference capabilities of the large model. Consequently, the latency often exceeds acceptable thresholds for real-time user interaction, making it impractical for many applications.

3. Xbit Tech's Innovative Approach

Xbit Tech has developed an innovative method that can achieve "near real-time" responsiveness in video-to-speech synthesis scenarios where extremely high lip-sync precision is not a critical demand. The core principle lies in pre-generating a video of a digital human speaking. During each voice synthesis request, a segment of the pre-generated video, matching the duration of the synthesized speech, is sequentially extracted to form the resulting video.

Here's how it works:

Let t1 be the duration of the speech to be synthesized,

And t2 be the duration of the pre-generated video.

  • Case 1: t1 ≤ t2

    In this scenario, we simply extract a t1-duration segment sequentially from the "pre-generated video" and synthesize it with the speech to produce the result video.

  • Case 2: t1 > t2

    To accommodate longer speech durations, the "pre-generated video" is played in reverse and then seamlessly appended to the original "pre-generated video," forming a "stitched video."

At this point, the playback duration of the "stitched video" becomes 2 * t2.

  • - Subcase 1: t1 < 2 * t2
  • Only a t1-duration segment needs to be sequentially extracted from the "stitched video" and synthesized with the speech to produce the result video.

  • - * Subcase 2: t1 > 2 * t2
  • The "stitched video" can be duplicated and seamlessly appended to itself again. Following this principle, the "stitched video" can be extended indefinitely to synthesize "result videos" for any t1 duration of speech.

In practical applications, it is also necessary to utilize an in-memory database or a lightweight database like SQLite to store the extraction timestamp within the "pre-generated video" for each request ID. This ensures that the server returns correct results when handling concurrent requests.