Methodology

How this dataset is built

Most prompt datasets exist to evaluate models. This one exists to study practitioners. It captures what people actually type into image and video generators when they're trying to make something worth sharing — then classifies it so you can search, filter, and find patterns across a thousand-plus real-world prompts.

Source

Primary source: high-engagement posts on X/Twitter — the platform where most AI generation work gets shared publicly. Secondary: Reddit communities (r/midjourney, r/StableDiffusion, r/FluxAI, r/kling_ai), which tend to reward technical depth and reproducibility over pure aesthetics. Supplementary: manual entries from tutorials and blog posts not covered by automated ingestion.

“High engagement” means substantial view counts, reposts, and saves. This is a convenience sample filtered by community reception, not a random sample. That's deliberate — the filtering is the signal. A prompt that thousands of practitioners chose to amplify has passed a form of peer review that no annotation rubric can replicate.

Selection bias

This dataset is not representative of all prompting behavior. It skews toward successful outputs shared for engagement — visually impressive, socially shareable results. Failed attempts, iterative drafts, and everyday utility prompts are systematically under-represented. Platform coverage is English-dominant, X/Twitter-heavy. Professional workflows, Discord communities, and non-English practitioners are largely absent.

That said: the bias is the dataset's signal. “What do practitioners consider worth sharing?” is itself a research question, and this dataset is purpose-built to answer it. If your analysis requires unbiased sampling, look elsewhere. If you want to study the social dynamics of prompt craft — what goes viral, what gets copied, what techniques spread — this is the dataset.

Classification schema

Each prompt is classified using a structured schema designed for this dataset. The categories and labels are our own — built to describe practitioner behavior (what technique was used, what model, what visual intent) rather than to evaluate output quality.

Each prompt is classified across five structured dimensions:

Category

Modality + technique (e.g., video_t2v, image_i2i, image_character_ref)

Visual theme

Subject matter (person, cinematic, landscape, sci-fi, fantasy, etc.)

Art style

Aesthetic approach (photorealistic, anime, oil painting, pixel art, etc.)

Reference type

Whether the prompt requires a reference image and what kind (face, style, subject, pose, scene)

Model family

Which AI model is mentioned or inferred (Midjourney, Kling, Flux, etc.)

Classification pipeline

Each prompt passes through a tiered pipeline of Claude models — smaller models handle routine classification, larger models handle ambiguity. Confidence scores are stored with every label. The original post text is always preserved, so reclassification is non-destructive.

01Haiku

Triage

Initial pass: is this a generation prompt? Filter out non-prompt posts, determine modality (image vs. video), and flag ambiguous cases for escalation.

02Sonnet

Structured labeling

Core classification: extract the clean prompt, detect model, assign category + technique, tag themes and art styles, identify reference requirements. Uses structured tool-use output with strict enum validation.

03Opus

Edge cases

Complex or ambiguous prompts escalated from Sonnet: multi-technique workflows, unfamiliar tools, non-English content, and prompts requiring deeper contextual reasoning.

Scope

Included: Image generation (text-to-image, image-to-image, character references, reference-guided generation) and video generation (text-to-video, image-to-video, reference-to-video, video-to-video).

Excluded: LLM / text-generation prompts, audio generation, NSFW and sexualized content. This is a generative media dataset focused on craft-oriented prompting.

Update cadence

Ingestion is rolling — new prompts arrive as practitioners share them. Reddit communities get periodic sweeps. The dataset is a living collection, not a static release. Every entry carries timestamps (bookmarked_at, created_at) so you can slice by time period.

Known limitations

The short version: survivorship bias, platform skew, LLM classifier errors, and engagement-as-proxy-for-quality are all real concerns. We document each one honestly.

Full limitations with mitigations on the dataset page →

Citation

BibTeX
@dataset{ummerr_prompts_2025,
  title        = {ummerr/prompts: An In-the-Wild Generative AI Prompt Dataset},
  author       = {ummerr},
  year         = {2025},
  url          = {https://prompts.ummerr.com/dataset},
  note         = {Organic prompts sourced from high-engagement posts on X/Twitter.
                  Covers image and video generation with structured
                  metadata, model attribution, and technique labels.},
  license      = {CC BY 4.0}
}