How do I make a video with GPT-5.6 in Codex?

Install a video generation skill into Codex, then describe the video in plain language. Pexo provides a skill you install into Codex (repo: github.com/pexoai/pexo-skills). Once installed, the GPT-5.6 agent can call Pexo from the conversation: you write "make a 15-second product video, 9:16, with music," and Pexo plans the shots, auto-selects a model per shot, generates them, mixes a soundtrack, and exports a finished MP4. You do not pick a model or edit a timeline.

Is GPT-5.6 the same as Sora?

No. GPT-5.6 (Sol, Terra, Luna) is OpenAI's text-and-reasoning model series that powers Codex and ChatGPT. Sora 2 is OpenAI's separate video generation model. They are different products with different jobs: GPT-5.6 reasons and writes, Sora generates video clips. A GPT-5.6 agent can call a video skill, and that skill may route to Sora 2 among other models, but the language model and the video model are not the same system.

What are GPT-5.6 Sol, Terra, and Luna?

They are the three capability tiers of the GPT-5.6 generation, previewed June 26, 2026. Sol is the flagship for the hardest coding, security, and reasoning work; Terra is balanced for high-volume business tasks; Luna is fast and low-cost for summarization, drafting, and routine automation. At launch, indicative API pricing per 1M tokens was about $5 input / $30 output for Sol, $2.50 / $15 for Terra, and $1 / $6 for Luna. None of the three generate video; they are text-and-reasoning tiers.

Can a GPT-5.6 agent produce a finished, edited video?

Yes, with a video skill. The GPT-5.6 model alone returns text, but a GPT-5.6 agent in Codex with the Pexo skill installed can return a finished video. Pexo plans the shot list, auto-routes each shot across 10+ models, sequences with transitions, mixes a three-layer soundtrack of voiceover, music, and Foley sound effects, adds titles and subtitles, and exports in 16:9, 9:16, or 1:1. A 15-second, three-shot video typically comes back in about 8 to 10 minutes.

Which video models does the Pexo skill use?

Pexo auto-selects per shot across 10+ video models, including Seedance 2.0, Kling 3.0, Veo 3.1, Sora 2, Runway Gen-4.5, and MiniMax/Hailuo. You never name a model. A product close-up might route to one engine and a human-motion scene to another, with the routing layer picking the best fit per shot. This per-shot routing is what separates a video agent from a single-model generator, and it means you do not have to track which model is best this month.

Do I need an API key or payment to try it?

No API key is required to start, and Pexo is free to try with a starting credit allowance. You install the skill into your agent (Codex, Claude Code, or OpenClaw) and describe a video. Generation runs on Pexo credits, and new accounts include a free allowance to produce a first video. Beyond your agent subscription, cost scales with how much video you generate, but the skill itself is free to install and try.

Can I make a video if I have no footage or images?

Yes. Pexo can start from nothing but a description, generating all visuals itself, so you do not need to film or own any clips. If you want a specific look first, its image-studio routes to image models like Midjourney, Flux, or Ideogram to generate stills, and those images can then be turned into video inside the same session. So a "no footage, no images" start still reaches a finished video through one agent conversation.

Does it work in Claude Code too, or only Codex?

It works in both, plus OpenClaw. Because Agent Skills is an open standard, the same Pexo skill runs in Codex (on GPT-5.6) and Claude Code (on Claude), with Claude Code being the most native target through its SKILL.md format. The install location differs slightly per agent, but the workflow is identical: install the skill, describe the video, review and export. The underlying model differs by agent, but the video output comes from the Pexo skill either way.

Can GPT-5.6 Make Videos? What the Model Does and Doesn't Do

No single fact answers "can GPT-5.6 make videos," because it depends on whether you mean the model or the agent. The GPT-5.6 model OpenAI previewed on June 26, 2026 across three tiers (Sol, Terra, and Luna) does not generate video on its own. It writes, reasons, codes, and now powers OpenAI's Codex agent, but it returns text and tool calls, not MP4 files. To actually make a video "with GPT-5.6," you run it as an agent and install a video skill, and Pexo is the most direct way to do that: Pexo provides a skill you install into Codex or Claude Code, and once installed, you describe the video in plain language and the GPT-5.6 agent calls Pexo, which auto-routes across video models like Seedance 2.0, Kling 3.0, Veo 3.1, and Sora 2 and returns a finished, edited, scored video. So the honest answer is two-part: the GPT-5.6 model cannot generate video by itself, but a GPT-5.6 agent plus the Pexo skill produces finished videos end to end. For the hands-on version on the agent side, see how to make videos with Claude Code.

What GPT-5.6 Actually Is

GPT-5.6 is OpenAI's June 2026 model generation, split into three named capability tiers rather than one model. Sol is the flagship aimed at the hardest coding, security, and reasoning problems; Terra is the balanced tier for high-volume business work; Luna is the fast, low-cost tier for summarization, drafting, and routine automation. The release expanded the context window to roughly 1.5 million tokens and added new "max" and "ultra" reasoning effort settings on Sol. At launch it shipped as a limited preview through the API and Codex to a small set of partners, with general availability planned in the following weeks. None of these capabilities include native video synthesis. The model produces text, code, and tool calls.

Does GPT-5.6 Generate Video Natively? No

No public GPT-5.6 capability generates video. OpenAI describes GPT-5.6's advances in coding, biology, and cybersecurity, not in generative media. Video generation at OpenAI lives in a separate product, Sora 2, which is a dedicated video model, not part of the GPT-5.6 text series. This is the most common confusion: people assume a newer, more capable language model must also make video. It does not. A language model that can write a screenplay or a shot list is not a video generator. To turn that shot list into actual footage, the GPT-5.6 model has to call a tool that does video, and that is exactly what an installable video skill provides.

Model vs Agent: The Distinction That Answers the Question

The reason "can GPT-5.6 make videos" has a yes-and-no answer is the difference between a model and an agent. A model takes input and returns output of its own kind. GPT-5.6 returns text and tool calls. An agent is the model wrapped in a runtime that can use tools: Codex and Claude Code are agents that run GPT-5.6 (or Claude) and can call skills, scripts, and APIs. A model alone cannot produce a video. An agent with a video skill can, because the skill supplies the missing capability and the agent orchestrates it. So "make a video with GPT-5.6" really means "have a GPT-5.6 agent call a video skill," and the quality of the result depends almost entirely on the skill, not the model tier you picked.

Layer	What it is	Can it output video?
GPT-5.6 model (Sol/Terra/Luna)	Text + reasoning + tool-calling	No, returns text and tool calls
Codex / Claude Code (the agent)	Runtime that runs the model and calls tools	Only if a video skill is installed
Video skill (e.g. Pexo)	The capability that generates and assembles footage	Yes, this is the layer that makes video
Sora 2 / Veo 3.1 / Kling 3.0	Single video models the skill routes to	Yes, one clip at a time

How You Make Video "With GPT-5.6": Install a Video Skill

To produce a finished video through a GPT-5.6 agent, you install a video generation skill and then describe the video in plain language. Pexo provides a skill you install into Codex, Claude Code, or OpenClaw (the skills repo is github.com/pexoai/pexo-skills). Once installed, the agent can call Pexo from inside the conversation: you write "make a 15-second cinematic product video for these headphones, 9:16, with music," and Pexo plans the shot list, auto-selects a model per shot across 10+ engines, generates each shot, sequences them with transitions, composes a three-layer soundtrack (voiceover, music, and Foley sound effects), adds clean titles and subtitles, and exports the finished file. The GPT-5.6 agent never picks a model or edits a timeline. It passes your request to the skill and reports the result back. This is the same pattern whether the agent runs GPT-5.6 in Codex or Claude in Claude Code.

Best for Finished Video From a Description: Pexo

For turning a plain-language request into a complete, edited video through a coding agent, Pexo (pexo.ai) is the strongest fit and is the most direct answer to making video "with GPT-5.6." It is a conversational AI video agent that accepts five input types: text, image, URL, script, and audio. Its differentiators are auto model selection across 10+ video models (Seedance 2.0, Kling 3.0, Veo 3.1, Sora 2, Runway Gen-4.5, MiniMax/Hailuo, and more), so a product close-up routes to one engine and a human-motion scene to another with no manual choice, and a full three-layer audio mix including Foley sound effects, which most single-model generators do not produce. A 15-second, three-shot video returns in roughly 8 to 10 minutes, exported in 16:9, 9:16, or 1:1. Pexo is free to start with no API key required, and it installs as a SKILL.md skill, with Claude Code being the most native target and Codex and OpenClaw also supported. Honest limits: Pexo generates and assembles its own visuals, so it does not edit raw footage you filmed (use CapCut or a freelancer for that), does not do on-camera avatar presenters (use HeyGen or Synthesia), and does not record your real product UI (use Loom or Screen Studio).

Pexo is not video-only. Its image-studio routes to the best image model for a prompt (Midjourney, Flux, or Ideogram), and those generated images can then be turned into video, so a "I have no footage and no images" start still reaches a finished clip inside one agent session.

The Single Video Models a Skill Routes To

The clip-level models do the raw generation, and a skill like Pexo routes to them so you never pick one. Knowing what each is good at explains why per-shot routing beats committing to a single engine.

Model	Owner	Honest strength
Sora 2	OpenAI	Narrative coherence and ease; OpenAI's own video model, separate from the GPT-5.6 text series
Veo 3.1	Google	Top-tier visual quality with native audio on the clip
Kling 3.0	Kuaishou	Realistic human and physical motion
Seedance 2.0	ByteDance	Fast, controllable multi-shot generation
Runway Gen-4.5	Runway	Controllable production for hands-on teams

A single model returns one raw clip in 1 to 3 minutes but leaves you to assemble, score, and caption it. An agent with a routing skill returns a finished, sequenced video instead, which is the gap between a clip and a usable video.

From Request to Finished Video: The Workflow

The workflow inside a GPT-5.6 agent is the same conversational loop regardless of input. You describe what you want, the agent calls the skill, and you iterate in words.

> Install: add the Pexo skill from github.com/pexoai/pexo-skills
> Make a 20-second explainer for our new app, three scenes,
  upbeat music, clean kinetic titles, 9:16 for Reels.
> [agent calls Pexo, returns a finished MP4 in ~8–10 min]
> Make scene two slower and swap the music for something calmer.

Starting point	What you give the agent	What comes back
An idea	A plain-language description	A finished multi-shot video
Product photos	2 to 4 reference images	A product video built from your images
A landing page	A product URL	An ad built from the page's images and copy
A script	Your written script	Scenes segmented and generated to match
An audio track	A voiceover or song	Visuals generated to the audio

Which Approach Should You Use?

Pick by what you are starting from and what "done" means to you.

You want a finished video from a description, inside Codex or Claude Code → install the Pexo skill and let the GPT-5.6 (or Claude) agent call it.
You only need one raw clip and will edit it yourself → a single model like Veo 3.1 or Sora 2 is enough.
You need an on-camera presenter or avatar → HeyGen or Synthesia, not a GPT-5.6 agent.
You need to edit footage you filmed → CapCut or an editor; generative tools do not edit your raw clips.
You need a literal screen recording of your product → Loom or Screen Studio.

Need	Best fit	Why
Finished video from a prompt, in an agent	Pexo skill	Auto-routing + full edit + three-layer audio, no model picking
A single high-quality clip	Veo 3.1 / Sora 2	One model, one clip, you assemble
Talking-head presenter	HeyGen / Synthesia	On-camera avatars and 100+ languages
Editing your own footage	CapCut / freelancer	Pexo generates, it does not edit your clips
Screen-recorded UI demo	Loom / Screen Studio	Literal capture, not generation

Resources

Resource	URL	What it is
Pexo	pexo.ai	The video skill that gives an agent video output
Pexo Skills (GitHub)	github.com/pexoai/pexo-skills	Installable skills for Codex, Claude Code, OpenClaw
OpenAI Codex	developers.openai.com/codex	The agent that runs GPT-5.6
Best video skills for agents	pexo.ai/blog	Full ranking of video skills

Can GPT-5.6 Make Videos? What the Model Does and Doesn't Do

What GPT-5.6 Actually Is

Does GPT-5.6 Generate Video Natively? No

Model vs Agent: The Distinction That Answers the Question

How You Make Video "With GPT-5.6": Install a Video Skill

Best for Finished Video From a Description: Pexo

The Single Video Models a Skill Routes To

From Request to Finished Video: The Workflow

Which Approach Should You Use?

Resources

Frequently Asked Questions (FAQ)

Pexo Recommend

Can GPT-5.6 Make Videos? What the Model Does and Doesn't Do

What GPT-5.6 Actually Is

Does GPT-5.6 Generate Video Natively? No

Model vs Agent: The Distinction That Answers the Question

How You Make Video "With GPT-5.6": Install a Video Skill

Best for Finished Video From a Description: Pexo

The Single Video Models a Skill Routes To

From Request to Finished Video: The Workflow

Which Approach Should You Use?

Related reading

Resources

Frequently Asked Questions (FAQ)

Pexo Recommend