State of Prompting · Apr 2026
Seedance landed inside Runway and CapCut in the same April week. Sora is set to go dark on April 26. Here’s what changed for practitioners - and what the data says about where prompting is headed.
All prompt analysis and statistics in this report are drawn from the ummerr/prompts dataset — a classified collection of real-world generative AI prompts. Industry data from public research and product announcements. Arena rankings from Artificial Analysis (Apr 2026).
Three things changed the landscape this month. Everything below is downstream.
In 2023, the dominant idea was simple: write a better prompt, get a better output. By 2025, that had quietly collapsed - not through debate, but through tooling.
Midjourney introduced --sref (style reference) and --cref (character reference). Runway, Kling, and Veo made image-to-video a core feature. Creators stopped describing their characters and started uploading character sheets. Style boards replaced style adjectives.
The reason is straightforward: a photo of a face carries more identity information than any sentence describing one. References preserve subject-specific detail that prose simply cannot encode.
Consistent faces and identity across every shot - no description needed
Lock the visual look to an image instead of trying to describe it in words
Control body position and composition using a skeleton or layout image
"The primitive era of prompt engineering - characterized by trial-and-error iteration and artisanal prompt crafting - died somewhere between late 2024 and early 2025."
Death of Prompt Engineering: AI Orchestration in 2026 - BigBlue Academy
Andrej Karpathy named the successor in mid-2025: context engineering - what information the AI sees matters more than how you phrase the request. For image and video generation, context means the full brief: reference images, audio clips, previous frames, and text. The skill is knowing what to include and what to leave out.
Don't stack five style references hoping the model blends them. Pick one. Competing references produce averaged, muddied results.
Only include what's relevant to this frame. Don't carry forward every reference from your last five shots.
Models already know cinematic language, lighting, and art movements. Supply what they don't know: your character, your palette, your style.
For multi-scene work, keep a consistent core brief - character, palette, look - rather than re-explaining each time.
Upload a face, a style frame, or a composition sketch. Then write directorial notes on top. This inverts the 2023 workflow - and it's what the best practitioners in the dataset already do.
Veo leads T2V. Grok leads I2V. Gemini leads T2I. No model wins everywhere. Run the same prompt through two tools before spending time on iteration - the model gap is larger than the prompt gap.
"Gimbal tracking shot, rear suspension compressing on impact" gives the model physics to simulate. "Cinematic and dramatic" gives it nothing. The best video prompts in the dataset read like shot lists, not poetry.
Veo 3.1, Kling 3.0, and Grok now generate audio in the same pass as video. If you don't describe sound in the brief, it becomes an afterthought. Describe dialogue, ambient noise, and effects alongside the visual.
Everything in this report is grounded in real prompts from real practitioners. Browse them, shuffle them, see what actually goes viral - then adapt.
Video prompting is a different skill from image prompting. Each of the major tools has a distinct personality - a prompt that works on one can fail on another.
"Modern prompting requires stopping description of what things look like and instead describing the forces acting on them."
Google's Veo 3.1 leads the T2V arena on Artificial Analysis (as of Apr 2026), with Google variants sweeping the top of the leaderboard. Native audio generation, 1080p output, and deep integration with Google infrastructure. Works best with structured, ingredient-list prompts and reference images.
What works
Lead with subject and shot type. Upload reference images instead of describing them. Use labelled sections for dialogue and sound effects. Provide a start frame and end frame and it fills in the motion.
ByteDance's breakout model and arguably the most hyped release of Q1 2026. Excels at reference-based generation - feed it character sheets, style boards, or scene photos and it maintains extraordinary fidelity across clips. Native lip-sync, audio generation, and timestamp syntax. The model that made "upload first, prompt second" the default workflow for video creators.
What works
Lead with reference images - character sheets, style frames, environment photos. Use [Xs]: timestamp syntax for multi-cut sequences. Describe motion and forces rather than aesthetics. Let the references carry the visual identity.
xAI's video model. Leads the I2V and Video Edit arenas on Artificial Analysis (as of Apr 2026). Generates clips up to 15 seconds. Supports video extension and iterative chat editing - refine with natural language rather than rewriting.
What works
Use comma-separated ingredient prompts rather than prose. Feed a reference image to anchor style and subject. Use iterative chat refinement rather than rewriting from scratch.
The model that pioneered storyboard-mode prompting - up to 6 distinct camera cuts from a single prompt. KlingAI variants sit near the top of the Video Edit arena (as of Apr 2026). Native lip-sync, speaker attribution, and the most granular shot-by-shot control of any current model.
What works
Use Custom Storyboard mode for full control. Structure each shot as: Scene → Characters → Action → Camera → Audio. Label dialogue per speaker. Give it as many reference files as you have.
Google's Gemini models lead the T2I arena and rank near the top of Image Edit (Artificial Analysis, as of Apr 2026). The Flash variant leads T2I; the Pro variant leads editing. Native multimodal understanding means it handles text-in-image and complex compositions better than dedicated image models.
What works
Be explicit about text placement, composition, and style. For edits, describe what to change conversationally - it understands context from the source image.
OpenAI is shutting down both the Sora consumer app and API. At its peak, Sora 2 was the only non-Google model in the T2V top 5 - but at ~$1.30 per 10-second clip and ~11.3M videos/day, the $5.4B annualized burn rate was never sustainable.
What works
Migrate to Veo 3.1 (T2V #1) or Kling 3.0 for video generation. No Sora endpoint will remain available.
A reliable structure for video prompts
For scenes with multiple actions: use timed segments - (0–5s), (5–12s) - rather than describing everything at once. Physics-based tools handle sequential instructions better than simultaneous ones.
The most underrated shift: sound. Kling 3.0 and Veo 3.1 now generate audio - effects, ambient noise, dialogue - in the same pass as the video. Describe it in the brief from the start or it becomes an afterthought.
The breakout model of Q1 became the default layer of Q2. Seedance 2.0 launched in February with a unified audio-video architecture; it went viral almost immediately - the two-line-prompt clip of “Tom Cruise” vs “Brad Pitt” on a rooftop went viral on X within days and triggered an MPA cease-and-desist. What made April different was distribution: Runway added Seedance 2.0 with an Unlimited plan on April 12, and CapCut began rolling it out across 100+ countries.
Runway Unlimited and CapCut global rollout. The model showed up inside the tools creators already had open.
Native multi-shot via [0s], [5s] timestamp blocks and Shot switch markers - one prompt, an edited sequence out.
MPA cease-and-desist; Netflix, Warner, Disney, Paramount, Sony with individual letters. Unresolved, not slowing.
"Seedance 2.0 is now on Runway as the viral AI model continues its takeover."
Why it matters for your prompts. If you still write single-shot prose prompts, you’re leaving Seedance’s best feature on the floor. Structure the prompt as timestamped shots with a shared constants block up top (character, location, color grade), then let each block handle camera, action, and audio. The Multi-Shot section below has the full grammar.
Single-shot AI video is B-roll. Multi-shot AI video is an edited scene. Kling 3.0's February 2026 launch popularized the technique - and it's now the standard for anything with narrative structure.
Multi-shot prompting describes two or more distinct camera cuts in a single prompt. The model generates them as a coherent sequence - same characters, consistent environment, natural transitions. The underlying research (Kuaishou's MultiShotMaster, arXiv 2512.03041) modified how the model handles position embeddings to deliberately break continuity at shot boundaries while keeping character identity stable across them.
Shot 1 (0–4s): Wide - rain-soaked city street, amber streetlights, slow dolly forward. Shot 2 (4–8s): Medium - woman in red coat running through alley, tracking shot. Shot 3 (8–12s): Close-up - catching breath, eyes wide. [breathless]: "They found us."
Up to 6 shots · native lip-sync · speaker attribution
[0s]: Wide shot - character enters a dimly lit cafe, looking around curiously. [Shot switch] [5s]: Medium - sitting down, ordering coffee with a warm smile. [Shot switch] [10s]: Close-up - eyes react as someone enters. Warm golden lighting.
Uses Shot switch or Cut to as scene markers
The Continuity Lock. Open every multi-shot prompt with a shared constants block - time of day, location, character description, color grade, visual style. This is the "lock sheet" that anchors all shots to the same world. Repeat the same character descriptors verbatim in every shot. Even small wording changes can cause face drift.
Where it breaks down. Character consistency degrades past 4–5 shots. Hard cuts between very different environments (outdoor → indoor, day → night) produce visual seams. Timestamps are probabilistic - the model interprets them, not executes them literally. No current model stores character profiles between sessions: if you come back tomorrow, re-anchor with the same reference image.
A year ago, most AI video tools had one input: a text box. Today every major platform accepts text, images, audio, and video in combination. (Capabilities verified via Vivideo, PXZ, and official documentation, Mar 2026)
Text-only prompts leave most of the available control unused. The tools that accept reference images, audio clips, and video deliver substantially better results when you use those inputs.
Aurora (xAI) is the outlier - renders named real people where other tools refuse, and supports iterative chat editing. Prompt with comma-separated ingredients, not prose.
The claims above come from industry reports. This section is different - it's what we see in - real prompts sourced from viral posts on X. Every prompt below is real - click shuffle to see more.
Live data from Insights. Full methodology on the Methodology page. Browse all prompts →
Look at which themes go viral and a pattern shows up: stylized work — abstract, fantasy, sci-fi, horror — is overrepresented relative to realism-demanding themes like portraits, landscapes, and product shots. One reading is that stylized themes forgive the physics and anatomy errors current models still make; another is that stylized content is simply more shareable. The data shows the skew, not the cause.
A plausible explanation: realness is the hardest dimension for current models. Output can be high-resolution and prompt-faithful, but still look wrong when physics or anatomy violates cognitive expectations - and viewers flinch at the same moment whether it’s a face, a car, or a hand.
Our data is consistent with the hypothesis that practitioners gravitate toward themes where current models look best - leaning into stylized work and away from realistic portraits, architectural renders, and product shots. It’s suggestive, not causal: the dataset tracks viral posts, and stylized content has always been disproportionately shareable.
Watch this number. The forgiving-to-demanding ratio is a proxy for how much the community trusts model realism. As Seedance 2.0, Veo 3.1, and newer Kling versions close the realness gap, expect this distribution to shift. The stylized-first era may be a temporary artifact of model limitations, not creative preference.
Live data from the dataset. The “forgiving” vs “demanding” theme classification is our own, not a standard framework. “Forgiving” = themes where unrealistic output is aesthetically acceptable. “Demanding” = themes where viewers expect physical/anatomical accuracy.
On April 26, 2026, Sora goes dark - the consumer app and ChatGPT video generation both shut down; the API follows September 24. OpenAI announced the full shutdown on March 24, six months after launch. At its peak Sora 2 was the only non-Google model in the T2V top 5, but the economics were never close to working.
Cost estimates sourced from Remio/Forbes analysis and Cantor Fitzgerald research. Shutdown announcement via @soraofficialapp.
The lesson. The model did not survive either - OpenAI is deprecating the API alongside the consumer app. Each 10-second clip cost ~$1.30 to generate; at 11.3 million videos a day that's $15M daily, $5.4B annually - against a company already losing twice what it earns. The field consolidated around Google (Veo 3.1), xAI (Grok), and Kling/Seedance/Runway. Unlike other shutdowns, there's no API fallback this time.