A New System for Temporally Consistent Stable Diffusion Video Characters

September 26, 2024

0 6 minutes read

A new initiative from the Alibaba Group offers one of the best methods I’ve seen for generating full-body human avatars based on a stable diffusion-based base model.

Titled MIMO (MIMicken with Object Interactions), the system uses a range of popular technologies and modules, including CGI-based human models and AnimationDifferenceto enable temporarily consistent character replacement in videos – or alternatively to control a character with a user-defined skeletal pose.

Here we see characters interpolated from a single image source and driven by a predefined motion:

[Click video below to play]

From images from a single source, three different characters are controlled through a 3D pose sequence (far left) using the MIMO system. See the project website and associated YouTube video (embedded at the end of this article) for more examples and superior resolution. Source: https://menyifang.github.io/projects/MIMO/index.html

Generated characters, which can also come from frames in videos and in various other ways, can be integrated into real-world footage.

MIMO offers a new system that generates three separate encodings, each for character, scene and occlusion (i.e. matting, when an object or person passes in front of the depicted character). These encodings are integrated during the inference time.

[Click video below to play]

MIMO can replace original characters with photorealistic or stylized characters that follow the motion of the target video. See the project website and associated YouTube video (embedded at the end of this article) for more examples and superior resolution.

The system was trained via the Stable Diffusion V1.5 model, using a custom dataset curated by the researchers, and composed of both real and simulated videos.

The big problem with diffusion-based video is temporal stabilitywhere the content of the video flickers or ‘evolves’ in ways that are not desirable for consistent character representation.

MIMO instead effectively uses a single image as a map for consistent guidance, which can be orchestrated and limited by the interstitial SMPL CGI model.

Because the source reference is consistent and the base model on which the system is trained has been extended with adequately representative motion samples, the system’s capabilities for temporally consistent output are well above the common standard for diffusion-based avatars.

[Click video below to play]

Further examples of pose-driven MIMO characters. See the project website and associated YouTube video (embedded at the end of this article) for more examples and superior resolution.

It is becoming increasingly common for single images to be used as a source of effective neural representations, either on their own or in a multimodal manner, combined with text prompts. The popular ones, for example Live portrait A face transfer system can also generate very plausible deepfaked faces of images of one face.

The researchers believe that the principles used in the MIMO system can be extended to other and new types of generative systems and frameworks.

The new paper is titled MIMO: Steerable Character Video Synthesis with Spatial Decomposed Modelingand comes from four researchers at Alibaba Group’s Institute for Intelligent Computing. The work has a video loaded project page and an accompanying one YouTube videowhich is also embedded at the bottom of this article.

Method

MIMO realizes an automatic and unattended separation of the above three spatial components, in an end-to-end architecture (i.e. all sub-processes are integrated into the system and the user only needs to provide the input material).

The conceptual scheme for MIMO. Source: https://arxiv.org/pdf/2409.16160

Objects in source videos are translated from 2D to 3D, initially using the monocular depth estimator Depth Everything. The human element in each frame is extracted using methods adapted to the Tune-A-Video project.

This functions are then translated into video-based volumetric facets via Facebook Research Segment everything 2 architecture.

The scene layer itself is obtained by removing objects detected in the other two layers, essentially automatically creating a rotoscope-like mask.

A set is extracted for the motion latent codes for the human element are anchored in a standard human CGI-based SMPL model, whose movements provide the context for the human content displayed.

A 2D feature map for the human content is obtained by a differentiable rasterizer derived from a Initiative 2020 from NVIDIA. By combining the 3D data obtained from SMPL with the 2D data obtained by the NVIDIA method, the latent codes representing the ‘neural person’ have a solid match with their final context.

At this point it is necessary to establish a reference that is usually needed in architectures that use SMPL – a canonical attitude. This is broadly similar to Da Vinci’s ‘Vitruvian Man’in the sense that it represents a zero-pose template that can accept content and then be deformed, bringing with it the (effectively) mapped content.

These distortions, or ‘deviations from the norm’, represent human movement, while the SMPL model preserves the latent codes that form the human identity that has been extracted, and thus correctly represents the resulting avatar in terms of pose and texture.

An example of a canonical pose in an SMPL figure. Source: https://www.researchgate.net/figure/Layout-of-23-joints-in-the-SMPL-models_fig2_351179264

As for the issue of entanglement (the extent to which trained data can prove inflexible if you extend it beyond trained boundaries and associations), the authors* state:

‘To completely disentangle the appearance of posed video frames, an ideal solution is to learn the dynamic human representation from the monocular video and transform it from the posed space to the canonical space.

‘Considering its efficiency, we use a simplified method that directly transforms the posed human image to the canonical result in standard A-pose using a pre-trained human rest model. The synthesized canonical appearance image is fed to ID encoders to obtain the identity.

‘This simple design enables complete unbundling of identity and movement attributes. Next [Animate Anyone]the ID encoders include a CLAMP image encoder and a reference mesh architecture to embed for the global and local function, [respectively].’

For the scene and occlusion aspects, a shared and fixed Variational Autoencoder (VAE) – in this case derived from a 2013 publication) is used to embed the scene and occlusion elements into the latent space. Incongruities are handled by one paint in method from 2023 ProPainter project.

Once composed and retouched in this way, both the background and any closing objects in the video will provide a matte for the moving human avatar.

These parsed attributes are then input into a U-Net backbone based on the Stable Diffusion V1.5 architecture. The entire scene code is merged with the host system’s own latent noise. The human component is integrated via self-attention And cross-attention layers, respectively.

Then the noiseless result is output through the UAE decoder.

Data and testing

For training, the researchers created human video dataset titled HUD-7K, which consisted of 5,000 real character videos and 2,000 synthetic animations created by the And3D system. The real videos did not require annotation, due to the non-semantic nature of the figure extraction procedures in MIMO’s architecture. The synthetic data was fully annotated.

The model was trained on eight NVIDIA A100 GPUs (although the article does not specify whether these were the 40GB or 80GB VRAM models), for 50 iterations, using 24 video frames and a batch size from four, to convergence.

The motion module for the system is trained on the weights of AnimateDiff. During the training process, the weights of the UAE encoder/decoder and the CLIP image coder were determined frozen (as opposed to full refinement, which will have a much broader effect on a foundation model).

Although MIMO was not tried with analog systems, the researchers tested it on difficult, out-of-distribution motion sequences originating from COLLECT And Mixamo. These movements include climbing, playing and dancing.

They also tested the system on human videos in the wild. In both cases, the paper reports ‘high robustness’ for these invisible 3D movements, from different viewpoints.

Although the article provides several static image results that demonstrate the effectiveness of the system, MIMO’s actual performance can best be assessed with the extensive video results on the project page and in the YouTube video below (the videos of which are at the beginning of this article derived).

The authors conclude:

‘Experimental results [demonstrate] that our method allows not only flexible character, motion, and scene control, but also advanced scalability for arbitrary characters, generality for novel 3D motion, and applicability to interactive scenes.

‘We too [believe] that our solution, which takes into account its inherent 3D nature and automatically encodes the 2D video into hierarchical spatial components, could inspire future research in 3D-aware video synthesis.

“Moreover, our framework is not only well suited to generating character videos, but can also potentially be adapted to other controllable video synthesis tasks.”

Conclusion

It is refreshing to see that an avatar system based on stable diffusion appears to be capable of such temporal stability – not least because Gaussian avatars appear to be so too. acquiring the high ground in this specific research sector.

The stylized avatars shown in the results are effective, and while the level of photorealism that MIMO can produce is not currently equal to what Gaussian Splatting is capable of, the several benefits of creating temporally consistent people in a semantically based Latent Diffusion Network (LDM) ) are significant.

* My conversion of the authors’ inline quotes into hyperlinks and, where necessary, into external explanatory hyperlinks.

First published on Wednesday, September 25, 2024