A corpus of organic, in-the-wild generative AI prompts sourced from high-engagement posts on X/Twitter — covering image and video generation. Every entry reflects a real practitioner decision: what to generate, how to phrase it, and which model to use. High engagement acts as an organic quality signal — these prompts were judged worth sharing by thousands of practitioners. See methodology for classification details.
Established prompt datasets — DrawBench, PartiPrompts, T2I-CompBench — were designed for model evaluation: synthetic or crowdsourced prompts used to benchmark generation quality. They serve that purpose well. This dataset serves a different one.
This collection captures the organic prompt distribution from practitioners who actively use generative AI tools and share their results publicly. The selection mechanism — social engagement — is imperfect but meaningful: a prompt that accumulates high view and repost counts has passed a form of community judgment. The goal is to document practitioner behavior, not to compete with evaluation benchmarks.
| Dataset | Size | Source | Modality | Provenance | Engagement | Curated |
|---|---|---|---|---|---|---|
| DrawBench | 200 | Synthetic (LLM) | Image | None | None | 2022 |
| PartiPrompts | 1,632 | Crowdworkers (Google) | Image | None | None | 2022 |
| T2I-CompBench | 6,000 | Synthetic (GPT-4) | Image | None | None | 2023 |
| GenAI-Bench | 1,200 | LLM + human mix | Image + Video | None | None | 2024 |
| EvalCrafter | 700 | LLM + real users | Video | None | None | 2024 |
| VBench | 1,600 | Manual per dimension | Video | None | None | 2024 |
| T2VEval-Bench | 1,783 | LLM + manual | Video | None | Lab MOS | 2025 |
| ummerr/prompts(this) | - | Organic / in-the-wild | Image + Video | Full (URL + author) | Viral filter | Mar 2026 |
Study what the actual distribution of prompts looks like across modalities, models, and technique types - as opposed to the synthetic or curated distributions used in most benchmarks. Useful for calibrating evaluation sets to real practitioner behavior.
Each entry is sourced from high-engagement posts (high views, reposts, saves). This creates a weak but organic quality label: prompts that practitioners found compelling enough to share and reshare. Researchers can study whether engagement correlates with automated quality metrics.
The dataset covers image and video generation with structured technique labels. Most existing prompt datasets are image-only. This enables cross-modal comparison: how does a T2V prompt differ structurally from a T2I prompt for the same subject?
Each entry includes a detected model field. Researchers can study how prompt style, length, technique invocation, and reference usage vary across models - Midjourney vs. FLUX vs. Kling vs. Sora - and how practitioner prompting strategies adapt to model capabilities.
bookmarked_at and created_at timestamps enable temporal slicing. Study how the prompt distribution evolves as new models are released, how quickly practitioners adopt new techniques, and how model market share shifts over time in the practitioner community.
The requires_reference and reference_type fields capture which prompts require a reference image as input, and what kind (face, style, subject, pose, background). Useful for studying how practitioners use image conditioning vs. text-only prompting across different task types.
Four derived views for common research slices. Each is a filtered subset of the master dataset above — every row here also appears in the master download. Downloadable separately in JSONL, CSV, JSON, and a PII-stripped research variant.
Image prompts that generate from text alone — no reference image required.
View details →Image prompts that require a user-supplied reference image (face, style, subject, pose, or scene).
View details →Video prompts generated from text only — no reference or source image conditioning.
View details →Video prompts conditioned on an input image or video (image-to-video, reference-to-video, video-to-video).
View details →Prompts are drawn exclusively from posts that practitioners chose to share publicly. This systematically over-represents prompts that produced visually impressive or socially shareable results, and under-represents failed attempts, iterative drafts, and everyday utility prompts.
This bias is also the dataset's signal: understanding the distribution of prompts that practitioners consider share-worthy is itself a research question. Sourcing across both Twitter/X and Reddit partially offsets pure aesthetics bias - Reddit communities reward technical depth and reproducibility.
Source content is dominated by English-language posts from Twitter/X and a small set of Reddit communities. Non-English prompts, closed communities, Discord servers, and professional workflows are not represented. The dataset likely reflects the aesthetics and interests of a specific online subculture rather than the broader global practitioner population.
Ingestion spans multiple subreddits with different community cultures (r/midjourney, r/StableDiffusion, r/FluxAI, r/kling_ai), broadening the range of styles, use cases, and practitioner skill levels captured.
Category labels, theme tags, art style tags, model attribution, and extracted prompt text are assigned by Claude Sonnet 4.6 - not human annotators. Errors cluster around ambiguous multi-technique prompts, unfamiliar or emerging tools, non-English content, and prompts where the model name is absent from the post text.
The classifier uses structured tool-use output with strict enum validation, reducing free-form hallucination. Confidence scores are stored alongside labels. The raw source text is always preserved, so reclassification is non-destructive.
The dataset records what practitioners shared, not whether the prompt reliably produces good results across seeds, model versions, or hardware configurations. High engagement reflects community judgment, not empirical reproducibility.
Source post URLs and media_urls are retained for every entry, allowing researchers to inspect original post context, attached output media, and community reception. Engagement is best treated as a weak positive label, not a ground-truth quality rating.
Ingestion began in 2024 and runs on a rolling basis. Older models (pre-2023) are under-represented; newly released models may lag until ingestion catches up. Coverage of any given model reflects its social media footprint, not its market share or capability.
bookmarked_at and created_at timestamps are preserved for every entry, making temporal filtering straightforward. Model attribution is stored as free-text alongside a normalised canonical slug, so analyses can distinguish between model generations.
This dataset focuses on professional and creative prompt engineering. NSFW and sexualized content, which represents a significant share of public image generation activity, has been excluded from collection.
The exclusion is deliberate — the dataset targets craft-oriented prompting rather than exhaustive coverage of all generation use cases. Researchers studying the full distribution of generative AI usage should account for this gap.
Deduplication is exact-match only on the source post ID. Reposts, quote-tweets, and community re-shares of the same underlying prompt may appear as distinct entries. Downstream fine-tuning or similarity studies should apply semantic deduplication.
Each entry retains its author_handle and tweet_url, making provenance traceable. Semantic deduplication can be applied against the extracted_prompt field, which strips social framing and hashtags to surface the underlying prompt text.
Primary source. Bookmarked posts from practitioners sharing AI generation workflows. Filtered for high engagement - views, reposts, and saves. Includes media outputs, threads, and referenced works.
r/midjourney, r/StableDiffusion, r/FluxAI, r/kling_ai, r/PromptEngineering. Reddit sourcing broadens demographic coverage - communities reward technical reproducibility alongside visual impact.
Hand-entered prompts from tutorials, blog posts, or community shares not covered by automated ingestion.
| Field | Description |
|---|---|
| id | Primary key |
| tweet_id | Original post ID - enables deduplication and provenance tracing |
| tweet_text | Full text of the source post (unmodified) |
| author_handle | Platform username of the practitioner who shared the prompt |
| author_name | Display name |
| tweet_url | Canonical URL - links to output media and original engagement context |
| media_urls | Output image/video URLs attached to the post |
| source | Ingestion origin: twitter | reddit | manual |
| category | Top-level bucket: prompts | tech_ai_product | career_productivity | uncategorized |
| prompt_category | Modality + technique: image_t2i, video_t2v, video_i2v, audio, etc. |
| extracted_prompt | Clean prompt text extracted from post + comments - social framing stripped |
| detected_model | AI model mentioned (free-text canonical slug, e.g. "Midjourney v6.1") |
| prompt_themes | Visual themes: person, cinematic, landscape, scifi, fantasy, etc. |
| art_styles | Art styles: photorealistic, anime, oil_painting, pixel_art, etc. |
| requires_reference | True if prompt requires a reference image as input |
| reference_type | face_person | style_artwork | subject_object | pose_structure | scene_background |
| is_thread | True if post is a multi-tweet thread |
| thread_tweets | Array of {tweet_id, tweet_text} for threaded posts |
| confidence | Classifier confidence score (0–1) |
| rationale | LLM reasoning for the category assignment |
| user_notes | Human curator notes |
| bookmarked_at | When the post was originally bookmarked |
| created_at | Row insertion timestamp |
| updated_at | Last modification timestamp |
Free to use, share, and adapt for any purpose - including commercial - with appropriate credit. The original prompt texts remain the intellectual property of their authors; this dataset provides structured metadata for research purposes.
@dataset{ummerr_prompts_2025,
title = {ummerr/prompts: An In-the-Wild Generative AI Prompt Dataset},
author = {ummerr},
year = {2025},
url = {https://prompts.ummerr.com/dataset},
note = {Organic prompts sourced from high-engagement posts on X/Twitter.
Covers image and video generation with structured
metadata, model attribution, and technique labels.},
license = {CC BY 4.0}
}