WAN 2.2 S2V: Fast Sound-to-Video AI with Workflow & One-Click Windows Installer
Transform any audio and image into cinematic talking videos within minutes—no deep tech skills required! WAN 2.2 S2V is a breakthrough AI video model that turns a portrait photo and an audio clip into natural, lip-synced video, ideal for creators, educators, and digital artists seeking fast, high-quality results on Windows PCs.
What is WAN 2.2 S2V?
This model takes a static image and matches it to the rhythm of your audio, driving realistic mouth movements, expressions, and gestures with strong synchronization and expressive animation. The installer comes pre-loaded with a streamlined workflow for ComfyUI, letting anyone easily generate up to minute-long videos: just upload an image, your voice or music, and a short text prompt to steer actions and scene details.
Preloaded Models within the Installer (Low VRAM)
umt5-xxl-encoder-Q5_K_M.gguf (ComfyUI\models\clip)
https://huggingface.co/city96/umt5-xxl-encoder-gguf/tree/main
wan_2.1_vae.safetensors (ComfyUI\models\vae)
https://huggingface.co/Kijai/WanVideo_comfy/tree/main
Wan2.2-S2V-14B-Q2_K.gguf (ComfyUI\models\unet)
https://huggingface.co/QuantStack/Wan2.2-S2V-14B-GGUF/tree/main
wav2vec2_large_english_fp16.safetensors (ComfyUI\models\audio_encoders)
https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/tree/main/split_files/audio_encoders
2xLexicaRRDBNet_Sharp.pth Upscale model (ComfyUI\models\upscale_models)
https://huggingface.co/Thelocallab/2xLexicaRRDBNet_Sharp/blob/main/2xLexicaRRDBNet_Sharp.pth
Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V1.1 (ComfyUI\models\loras)
https://huggingface.co/lightx2v/Wan2.2-Lightning/tree/main/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V1.1
The standard Wan2.2-S2V-14B diffusion models (FP16 and FP8) are not packaged with the installer. Download them from Comfy Org’s Hugging Face repository for best results!
Link - Comfy Org Wan2.2 S2V Diffusion Models
Speed
Generate crisp 480p talking portrait videos in about 10–15 minutes with an RTX 4050 (6GB VRAM). The workflow automatically adapts to your hardware; faster and higher-res results are possible on better GPUs.
System Requirements
Nvidia RTX 30XX, 40XX, or 50XX series GPU (FP16 support required; GTX 10XX/20XX not tested)
CUDA-compatible GPU with at least 8 GB VRAM
Windows OS
40 GB free storage recommended
What’s Included
Portable, pre-configured Windows installer for ComfyUI and Wan2.2-S2V
Simple, flat audio-to-video workflow—no manual setup needed
Automatic download for all required nodes and models
Flexible workflow: upload your audio, character portrait, and describe your scene in a prompt box
Usage Notes
Works with either GGUF or standard diffusion models.
Instantly toggle workflow sections with the Fast Groups Bypasser.
Portrait-style images (single character) recommended for best results; just type your video description and let the model go to work.
