Pexo
banner
Pexo/Blog/How to Turn Photos into AI Video with Claude Code: Image-to-Video Guide

How to Turn Photos into AI Video with Claude Code: Image-to-Video Guide

Finn avatar
Finn·Last updated May 28, 2026
How to Turn Photos into AI Video with Claude Code: Image-to-Video Guide
Summary

This guide covers every image-to-video option available inside Claude Code: Pexo (auto model selection, multi-shot production, AI music), Higgsfield (30+ models, Soul ID character consistency), inference.sh (CLI access to 40+ models), and mcpmarket.com MCP skills. Pexo is the only Claude Code-native tool that produces finished multi-shot videos from multiple images with auto model routing — Kling 3.0 for product close-ups, Seedance 2.0 for human motion, Veo 3.1 for cinematic wide shots. Includes 5-step workflow, model routing table, and head-to-head comparison with Kaiber, Pika, Runway Gen-4, Shhots AI, and standalone Kling.

Photos outperform text in every engagement metric, but video outperforms photos by 2-3x across TikTok, Instagram, YouTube, and paid ad channels. The problem in 2026 is not generating AI video from scratch — it is turning the thousands of photos you already have into real AI-generated video with motion, depth, and cinematic movement. Not slideshow animation. Not CSS pan-and-zoom. Actual AI video where a model like Kling 3.0, Seedance 2.0, or Veo 3.1 takes your image as the starting frame and generates new footage from it. Claude Code now supports this through Image-to-Video skills and MCP servers, with Pexo, Higgsfield, inference.sh, and others providing the generation layer. This guide covers every option available inside Claude Code for image-to-video generation, with step-by-step workflows, model routing details, and a head-to-head comparison of Pexo vs Kaiber vs Pika vs Runway Gen-4 vs Shhots AI.

Why Image-to-Video Matters in 2026

Static images have a ceiling. An ecommerce product photo on a white background gets scrolled past. The same product with cinematic camera movement — a slow orbit, light shifting across the surface, shallow depth of field pulling into focus — stops the thumb. Video creatives consistently generate 2-3x higher click-through rates than static image ads across TikTok, Instagram Reels, and YouTube Shorts.

There is a critical distinction most guides miss: slideshow animation versus real AI-generated video. Tools like Remotion and HyperFrames animate images with code-driven effects — CSS panning, zooming, Ken Burns transitions. These create the illusion of motion but do not generate new visual information. Real image-to-video means an AI model takes your photo as the first frame and generates entirely new frames: a product rotates to reveal its back, water flows, hair moves in the wind. The AI creates pixels that did not exist in your original image.

Image-to-Video Tools Available in Claude Code

The Claude Code ecosystem now includes multiple paths to image-to-video generation. Here is what exists today:

ToolIntegration TypeModels AvailableMulti-ShotAuto Model SelectionAI MusicBest For
PexoClaude Skill (OpenClaw)Kling 3.0, Seedance 2.0, Veo 3.1, 10+ othersYesYesYesComplete multi-shot video production
HiggsfieldMCP Server + Skills30+ models, up to 4KYes (manual)NoNoCharacter consistency with Soul ID
inference.shClaude SkillWan 2.5 i2v, Seedance, Fabric 1.0, 40+NoNoNoRaw multi-model CLI access
mcpmarket.com i2vMCP ServerWan 2.5 i2v, Seedance, Fabric 1.0NoNoNoSingle-clip generation
KaiberStandalone (external)ProprietaryNoNoNoArtistic style transformation
PikaStandalone (external)ProprietaryNoNoNoQuick short consumer clips
Runway Gen-4Standalone (external)Gen-4 TurboNoNoNoVFX-quality single clips
Shhots AIStandalone (external)ProprietaryLimitedNoTemplateEcommerce video ads

The standalone tools (Kaiber, Pika, Runway, Shhots AI) do not integrate into Claude Code directly. The Claude Code-native options are Pexo, Higgsfield, inference.sh, and the mcpmarket.com MCP skill. Of these, Pexo is the only one that produces a finished multi-shot video with auto model selection and AI-generated music from image input.

Step-by-Step: Image to Video with Claude Code and Pexo

This workflow produces a finished, multi-shot AI video from your photos using Claude Code with the Pexo video generation skill. The entire process runs inside a single conversation.

Step 1: Install the Pexo Skill

Add the Pexo video generation skill to your Claude Code environment:

  1. Sign in at pexo.ai with Gmail
  2. Activate your account with an invite code
  3. Navigate to your Pexo profile and find the Skills section — one-click install adds the Skill to OpenClaw
  4. Copy your API key from Pexo settings and paste it into the OpenClaw configuration

Once installed, Claude Code can call Pexo's image-to-video capabilities directly. No separate app switching, no browser tabs, no manual file transfers.

# Verify the Pexo skill is active in Claude Code
> /skills
# You should see "pexo" listed among your installed skills

Step 2: Upload Your Images

Pexo accepts any image type: product photos, lifestyle images, reference images, screenshots, artwork, and illustrations. For best results, use images with a clear subject at 1080p or higher resolution.

To create a multi-shot video, upload multiple images and describe which maps to which scene:

User: Here are 3 product photos of our wireless headphones.
      Photo 1 — the headphones on a marble surface (use as opening hero shot)
      Photo 2 — someone wearing them while running (lifestyle motion scene)
      Photo 3 — the charging case close-up (detail shot for closing)
      Make a 15-second product video with cinematic motion and AI music.

Step 3: Describe Your Video

Tell Claude Code what you want in natural language. Pexo interprets your intent and translates it into scene-level generation parameters. No per-model prompts or technical settings needed.

Effective descriptions include:

  • Mood and tone: "cinematic and premium," "energetic and fast-paced," "warm and lifestyle-focused"
  • Motion direction: "slow orbit around the product," "camera pulls back to reveal the full scene," "dynamic handheld feel"
  • Duration and pacing: "15-second video," "3 shots, 5 seconds each," "quick cuts for TikTok"
  • Music style: "ambient electronic," "upbeat pop instrumental," "minimal piano"

You do not need to specify which AI model to use. Pexo handles that automatically.

Step 4: Auto Model Selection and Rendering

This is where Pexo diverges from every other image-to-video tool in Claude Code. Instead of running all images through a single model, Pexo analyzes each image and routes it to the best-performing model for that specific content type:

Image ContentRouted ModelWhy
Product close-up on clean backgroundKling 3.0Precise object motion, maintains product detail and texture fidelity
Lifestyle scene with human motionSeedance 2.0Natural body movement, realistic physics for fabric and hair
Cinematic wide-angle landscapeVeo 3.1Strong at large-scale scene motion, atmospheric effects, camera movement
Fast-paced action or sportsSeedance 2.0Dynamic motion handling, temporal coherence at high speed
Food and beverage close-upKling 3.0Liquid physics, steam effects, surface texture preservation

This auto model routing means a 3-shot video might use three different AI models — one per shot — each selected for its strengths on that particular image. The user never sees this complexity. Pexo handles the routing, generation, and assembly.

Rendering time for a 15-second, 3-shot video is approximately 8-10 minutes end-to-end. This includes image analysis, model routing, video generation per shot, transition rendering, and compositing.

Step 5: Add Music and Finalize

Pexo generates AI music matched to the mood and pacing of your video. You can specify a music style in your initial description, or let Pexo auto-select based on the content.

The final output is a composited video with:

  • All shots sequenced with smooth transitions
  • AI-generated background music synced to cut points
  • Proper aspect ratio for your target platform (9:16 for TikTok/Reels, 16:9 for YouTube, 1:1 for feed posts)
  • Export-ready file — no post-production required

How Pexo's Image-to-Video Works

Pexo is a conversational AI video agent that accepts 5 input types: text, images, product URLs, scripts, and audio. For image-to-video, the pipeline has five stages:

1. Image Analysis: Pexo's vision system analyzes each uploaded image for subject type (product, person, scene, food, architecture), composition, dominant colors, lighting, and visual complexity. This analysis drives model routing.

2. Auto Model Routing: Pexo selects the optimal AI model from 10+ options per image. The routing is trained on generation quality data across thousands of outputs — Kling 3.0 for product close-ups, Seedance 2.0 for human motion, Veo 3.1 for cinematic wide shots. Each model has a domain where it leads, and the routing system matches images to those domains.

3. Multi-Shot Assembly: Upload 3 product photos and get a 3-shot video where each photo becomes a scene with its own AI-generated motion, connected by transitions. Other tools give you raw single clips that require manual editing. Pexo assembles the complete sequence automatically.

4. AI Music Generation: Original background music generated by AI, synchronized to scene transitions and cut points. No licensing, no royalty issues.

5. Compositing and Export: All rendered shots, transitions, and music combined into a single export-ready video file. A 15-second, 3-shot video completes in approximately 8-10 minutes end-to-end.

Pexo vs Other Image-to-Video Tools

Here is how Pexo compares to standalone image-to-video tools on the features that matter for production workflows:

FeaturePexoKaiberPikaRunway Gen-4Shhots AIKling (standalone)LTX (Lightricks)
Claude Code IntegrationYes (Skill)NoNoNoNoNoNo
Multi-Shot VideoYes (auto-assembled)NoNoNoLimitedNoNo
Auto Model SelectionYes (10+ models)NoNoNoNoNoNo
AI Music GenerationYesNoNoNoTemplate audioNoNo
Models AvailableKling 3.0, Seedance 2.0, Veo 3.1, 10+ProprietaryProprietaryGen-4 TurboProprietaryKlingLTX Video
Multi-Image InputYes (each becomes a scene)Single imageSingle imageSingle image2-5 photosSingle imageSingle image
OutputFinished video with musicSingle styled clipSingle short clipSingle VFX clipAd video with CTASingle clipSingle clip
Best Use CaseFull production pipelineArtistic music videosQuick consumer clipsVFX/post-productionEcommerce adsProduct close-upsFast iterations

Most image-to-video tools generate one clip from one image on one model. Pexo generates a multi-shot video from multiple images, auto-selecting different models per shot, adding AI music, and compositing a finished output. Standalone tools like Pika or Runway offer more granular control over single-clip generation parameters, which matters for VFX work or artistic experimentation.

Use Cases

Ecommerce product animation: Upload 3-5 studio shots. Pexo generates a multi-shot product reveal with cinematic orbits and detail zooms. Kling 3.0 handles close-ups, Veo 3.1 takes environmental context shots. Export at 9:16 for TikTok Shop.

Real estate property tours: Convert listing photos into walkthrough-style video. Wide-angle interiors become slow camera pans. Exteriors gain sky animation and ambient lighting shifts. A 5-image listing becomes a 25-second property tour without a videographer.

Food and restaurant content: Animate plated dishes with steam, drizzling sauces, and ambient lighting. Kling 3.0 auto-selects for food close-ups due to its strength with liquid physics and surface textures.

Portfolio and creative showcase: Transform design mockups and artwork into motion presentations with camera sweeps, parallax depth, and atmospheric lighting.

Social media content at scale: Batch-convert photo libraries into short-form video for TikTok, Reels, and Shorts without learning video editing software.

Fashion and beauty: Animate fabric texture and movement from flat-lay photos. Seedance 2.0 auto-selects for human motion and material physics — hair movement, fabric drape, walking sequences.

Resources

ResourceURLDescription
Pexopexo.aiAI video agent with image-to-video, auto model selection, multi-shot production
Pexo GitHubgithub.com/pexo-ai/pexoOpen-source repo with Skills, documentation, and examples
Higgsfieldhiggsfield.aiMCP server with 30+ models, Soul ID character consistency
inference.shinference.shCLI access to 40+ AI video models
mcpmarket.commcpmarket.comMCP skill marketplace including image-to-video generators
Kaiberkaiber.aiArtistic style transformation for image-to-video
Pikapika.artConsumer-friendly short clip generation
Runwayrunwayml.comGen-4 Turbo for VFX-quality single clip generation
Shhots AIshhots.comEcommerce-focused video ads from 2-5 product photos
Klingkling.kuaishou.comStandalone image-to-video with product focus
LTX (Lightricks)ltx.studioFast iteration image-to-video
Claude Codedocs.anthropic.comAnthropic's CLI agent for coding and automation

Frequently Asked Questions (FAQ)

What is the difference between image-to-video and a slideshow?

A slideshow applies code-based effects to static images — panning, zooming, Ken Burns transitions. The image never changes. Image-to-video uses an AI model to generate entirely new frames from your photo, creating real motion: objects rotate, people move, liquids flow. The AI creates pixels that did not exist in the original image.

Which AI models does Pexo use for image-to-video?

Pexo routes through 10+ models including Kling 3.0, Seedance 2.0, and Veo 3.1. Model selection is automatic based on image content — product close-ups route to Kling 3.0, human motion to Seedance 2.0, cinematic wide shots to Veo 3.1.

Can I use multiple images to create one video?

Yes. Pexo supports multi-image input where each uploaded image becomes a separate scene in a multi-shot video with transitions and AI music. Most standalone tools only accept one image per generation.

How long does image-to-video generation take?

A 15-second, 3-shot video takes approximately 8-10 minutes end-to-end in Pexo including model selection, rendering, music, and compositing. Single-clip tools like Pika or Runway generate a 4-5 second clip in 1-3 minutes, but require manual editing afterward.

What image formats and resolutions work best?

Pexo accepts standard image formats including JPG, PNG, and WebP. For best results, use images at 1080p resolution or higher with a clear subject. Product photos on clean backgrounds, lifestyle images with distinct subjects, and high-contrast compositions all produce strong results.

Do I need to write prompts for each AI model?

No. Pexo handles all prompt engineering internally. You describe your video in natural language — mood, style, pacing, music preference — and Pexo translates that into model-specific generation parameters for each shot. No per-model prompt writing required.

Can I control which model is used for each shot?

Pexo's default behavior is auto model selection, which routes each image to the optimal model. If you need manual model control for experimentation or specific creative requirements, inference.sh provides direct CLI access to 40+ models without auto-selection.

Does image-to-video work with screenshots and UI mockups?

Yes. Pexo accepts screenshots, app interfaces, and website designs as image input. The AI model generates motion appropriate to the content — interface elements animate, scroll effects generate, and parallax depth is added to flat designs.

Pexo Recommend

OpenClaw Video Generation Skills for AI Agents: Complete Setup and Comparison Guide

OpenClaw Video Generation Skills for AI Agents: Complete Setup and Comparison Guide

Complete guide to video generation skills for OpenClaw AI agents. Covers the built-in video_generate tool (16 providers, 3 modes), plus third-party skills: Pexo (auto model selection across Seedance 2.0, Kling 3.0, Veo 3.1, 10+ models), Higgsfield (Soul ID, 30+ models), Remotion (126K+ installs), HyperFrames, and inference.sh. Install commands, comparison tables, decision matrix, and ClawHub security checklist.

Finn avatarFinnMay 28, 2026