Abstract
Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations — including pose skeletons to represent motion or masked background to represent environment — which inevitably leads to information loss. Skeleton maps suffer from inherent ambiguity under complex scenarios; character masks limit body-shape flexibility; and depth-ambiguous overlapping skeletons cause misinterpretation in multi-character interactions.
To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves end-to-end character animation. By directly concatenating driving videos latents to the sequence, the model obtains all required visual information from the input. To overcome the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and curate a pipeline to synthesize MotionPair-60K — a heterogeneous dataset of 60K motion pairs spanning animation, replacement, and multi-character tasks. We introduce in-context mask conditioning and mode-specific RoPE as unified soft guidance. To mitigate synthetic-data bias in detailed regions (e.g. fingers), we propose Bias-Aware DPO for post-training refinement. Extensive experiments demonstrate that SCAIL-2 substantially outperforms existing state-of-the-art approaches across all tasks, while unlocking emerging zero-shot capabilities such as animal-driven animation and mesh-based control.
Paper: (coming soon)
Code: https://github.com/zai-org/SCAIL-2
Model: https://huggingface.co/zai-org/SCAIL-2
Repackaged Models for ComfyUI: https://huggingface.co/Comfy-Org/SCAIL-2/tree/main/diffusion_models
Project Page: https://teal024.github.io/SCAIL-2/
