Methodology
Most prompt datasets exist to evaluate models. This one exists to study practitioners. It captures what people actually type into image and video generators when they're trying to make something worth sharing — then classifies it so you can search, filter, and find patterns across a thousand-plus real-world prompts.
Primary source: high-engagement posts on X/Twitter — the platform where most AI generation work gets shared publicly. Secondary: Reddit communities (r/midjourney, r/StableDiffusion, r/FluxAI, r/kling_ai), which tend to reward technical depth and reproducibility over pure aesthetics. Supplementary: manual entries from tutorials and blog posts not covered by automated ingestion.
“High engagement” means substantial view counts, reposts, and saves. This is a convenience sample filtered by community reception, not a random sample. That's deliberate — the filtering is the signal. A prompt that thousands of practitioners chose to amplify has passed a form of peer review that no annotation rubric can replicate.
This dataset is not representative of all prompting behavior. It skews toward successful outputs shared for engagement — visually impressive, socially shareable results. Failed attempts, iterative drafts, and everyday utility prompts are systematically under-represented. Platform coverage is English-dominant, X/Twitter-heavy. Professional workflows, Discord communities, and non-English practitioners are largely absent.
That said: the bias is the dataset's signal. “What do practitioners consider worth sharing?” is itself a research question, and this dataset is purpose-built to answer it. If your analysis requires unbiased sampling, look elsewhere. If you want to study the social dynamics of prompt craft — what goes viral, what gets copied, what techniques spread — this is the dataset.
Each prompt is classified using a structured schema designed for this dataset. The categories and labels are our own — built to describe practitioner behavior (what technique was used, what model, what visual intent) rather than to evaluate output quality.
Each prompt is classified across five structured dimensions:
Modality + technique (e.g., video_t2v, image_i2i, image_character_ref)
Subject matter (person, cinematic, landscape, sci-fi, fantasy, etc.)
Aesthetic approach (photorealistic, anime, oil painting, pixel art, etc.)
Whether the prompt requires a reference image and what kind (face, style, subject, pose, scene)
Which AI model is mentioned or inferred (Midjourney, Kling, Flux, etc.)
Each prompt passes through a tiered pipeline of Claude models — smaller models handle routine classification, larger models handle ambiguity. Confidence scores are stored with every label. The original post text is always preserved, so reclassification is non-destructive.
Initial pass: is this a generation prompt? Filter out non-prompt posts, determine modality (image vs. video), and flag ambiguous cases for escalation.
Core classification: extract the clean prompt, detect model, assign category + technique, tag themes and art styles, identify reference requirements. Uses structured tool-use output with strict enum validation.
Complex or ambiguous prompts escalated from Sonnet: multi-technique workflows, unfamiliar tools, non-English content, and prompts requiring deeper contextual reasoning.
Included: Image generation (text-to-image, image-to-image, character references, reference-guided generation) and video generation (text-to-video, image-to-video, reference-to-video, video-to-video).
Excluded: LLM / text-generation prompts, audio generation, NSFW and sexualized content. This is a generative media dataset focused on craft-oriented prompting.
Ingestion is rolling — new prompts arrive as practitioners share them. Reddit communities get periodic sweeps. The dataset is a living collection, not a static release. Every entry carries timestamps (bookmarked_at, created_at) so you can slice by time period.
The short version: survivorship bias, platform skew, LLM classifier errors, and engagement-as-proxy-for-quality are all real concerns. We document each one honestly.
Full limitations with mitigations on the dataset page →@dataset{ummerr_prompts_2025,
title = {ummerr/prompts: An In-the-Wild Generative AI Prompt Dataset},
author = {ummerr},
year = {2025},
url = {https://prompts.ummerr.com/dataset},
note = {Organic prompts sourced from high-engagement posts on X/Twitter.
Covers image and video generation with structured
metadata, model attribution, and technique labels.},
license = {CC BY 4.0}
}