top of page

WAN 2.2 S2V: Fast Sound-to-Video AI with Workflow & One-Click Windows Installer

Transform any audio and image into cinematic talking videos within minutes—no deep tech skills required! WAN 2.2 S2V is a breakthrough AI video model that turns a portrait photo and an audio clip into natural, lip-synced video, ideal for creators, educators, and digital artists seeking fast, high-quality results on Windows PCs.

 

What is WAN 2.2 S2V?

 

This model takes a static image and matches it to the rhythm of your audio, driving realistic mouth movements, expressions, and gestures with strong synchronization and expressive animation. The installer comes pre-loaded with a streamlined workflow for ComfyUI, letting anyone easily generate up to minute-long videos: just upload an image, your voice or music, and a short text prompt to steer actions and scene details.

 

Preloaded Models within the Installer (Low VRAM)

  • umt5-xxl-encoder-Q5_K_M.gguf (ComfyUI\models\clip)

    https://huggingface.co/city96/umt5-xxl-encoder-gguf/tree/main

  • wan_2.1_vae.safetensors (ComfyUI\models\vae)

    https://huggingface.co/Kijai/WanVideo_comfy/tree/main

  • Wan2.2-S2V-14B-Q2_K.gguf (ComfyUI\models\unet)

    https://huggingface.co/QuantStack/Wan2.2-S2V-14B-GGUF/tree/main

  • wav2vec2_large_english_fp16.safetensors (ComfyUI\models\audio_encoders)

    https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/tree/main/split_files/audio_encoders

  • 2xLexicaRRDBNet_Sharp.pth Upscale model (ComfyUI\models\upscale_models)

    https://huggingface.co/Thelocallab/2xLexicaRRDBNet_Sharp/blob/main/2xLexicaRRDBNet_Sharp.pth

  • Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V1.1 (ComfyUI\models\loras)

    https://huggingface.co/lightx2v/Wan2.2-Lightning/tree/main/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V1.1

 

The standard Wan2.2-S2V-14B diffusion models (FP16 and FP8) are not packaged with the installer. Download them from Comfy Org’s Hugging Face repository for best results!

Link - Comfy Org Wan2.2 S2V Diffusion Models

 

Speed

Generate crisp 480p talking portrait videos in about 10–15 minutes with an RTX 4050 (6GB VRAM). The workflow automatically adapts to your hardware; faster and higher-res results are possible on better GPUs.

 

System Requirements

  • Nvidia RTX 30XX, 40XX, or 50XX series GPU (FP16 support required; GTX 10XX/20XX not tested)

  • CUDA-compatible GPU with at least 8 GB VRAM

  • Windows OS

  • 40 GB free storage recommended

 

What’s Included

  • Portable, pre-configured Windows installer for ComfyUI and Wan2.2-S2V

  • Simple, flat audio-to-video workflow—no manual setup needed

  • Automatic download for all required nodes and models

  • Flexible workflow: upload your audio, character portrait, and describe your scene in a prompt box

 

Usage Notes

  • Works with either GGUF or standard diffusion models.

  • Instantly toggle workflow sections with the Fast Groups Bypasser.

  • Portrait-style images (single character) recommended for best results; just type your video description and let the model go to work.

    $4.00Price
    Quantity
      bottom of page