DatasetCC BY 4.0Multi-ModalIn-the-WildEngagement-Filtered

ummerr/prompts

A corpus of organic, in-the-wild generative AI prompts sourced from high-engagement posts on X/Twitter — covering image and video generation. Every entry reflects a real practitioner decision: what to generate, how to phrase it, and which model to use. High engagement acts as an organic quality signal — these prompts were judged worth sharing by thousands of practitioners. See methodology for classification details.

image-generationvideo-generationin-the-wild-promptspractitioner-behaviorengagement-filtered
Download JSONLDownload CSVDownload JSONCC BY 4.0 - cite as ummerr/prompts
Download Research Export (JSONL)Strips author PII and raw post text for X/Twitter ToS and GDPR compliance. Researchers can rehydrate original posts via tweet_id.

Why This Dataset Exists

Established prompt datasets — DrawBench, PartiPrompts, T2I-CompBench — were designed for model evaluation: synthetic or crowdsourced prompts used to benchmark generation quality. They serve that purpose well. This dataset serves a different one.

This collection captures the organic prompt distribution from practitioners who actively use generative AI tools and share their results publicly. The selection mechanism — social engagement — is imperfect but meaningful: a prompt that accumulates high view and repost counts has passed a form of community judgment. The goal is to document practitioner behavior, not to compete with evaluation benchmarks.

DatasetSizeSourceModalityProvenanceEngagementCurated
DrawBench200Synthetic (LLM)ImageNoneNone2022
PartiPrompts1,632Crowdworkers (Google)ImageNoneNone2022
T2I-CompBench6,000Synthetic (GPT-4)ImageNoneNone2023
GenAI-Bench1,200LLM + human mixImage + VideoNoneNone2024
EvalCrafter700LLM + real usersVideoNoneNone2024
VBench1,600Manual per dimensionVideoNoneNone2024
T2VEval-Bench1,783LLM + manualVideoNoneLab MOS2025
ummerr/prompts(this)-Organic / in-the-wildImage + VideoFull (URL + author)Viral filterMar 2026

Dataset Size

Research Applications

In-the-wild prompt distribution

Study what the actual distribution of prompts looks like across modalities, models, and technique types - as opposed to the synthetic or curated distributions used in most benchmarks. Useful for calibrating evaluation sets to real practitioner behavior.

Engagement as a quality signal

Each entry is sourced from high-engagement posts (high views, reposts, saves). This creates a weak but organic quality label: prompts that practitioners found compelling enough to share and reshare. Researchers can study whether engagement correlates with automated quality metrics.

Multi-modal prompt structure analysis

The dataset covers image and video generation with structured technique labels. Most existing prompt datasets are image-only. This enables cross-modal comparison: how does a T2V prompt differ structurally from a T2I prompt for the same subject?

Model-conditioned prompt analysis

Each entry includes a detected model field. Researchers can study how prompt style, length, technique invocation, and reference usage vary across models - Midjourney vs. FLUX vs. Kling vs. Sora - and how practitioner prompting strategies adapt to model capabilities.

Temporal adoption analysis

bookmarked_at and created_at timestamps enable temporal slicing. Study how the prompt distribution evolves as new models are released, how quickly practitioners adopt new techniques, and how model market share shifts over time in the practitioner community.

Reference image usage patterns

The requires_reference and reference_type fields capture which prompts require a reference image as input, and what kind (face, style, subject, pose, background). Useful for studying how practitioners use image conditioning vs. text-only prompting across different task types.

Task Categories

Image

Text → Image
Image → Image
Reference → Image
Character Ref

Video

Text → Video
Image → Video
Reference → Video
Video → Video

Curation & Collection

Selection mechanism. Posts are identified via X/Twitter search and bookmark capture from practitioner accounts actively sharing AI generation work. Selection is biased toward high-engagement content - posts with substantial view counts, reposts, and saves. This is not a random sample; it is a practitioner-judged quality filter. Prompts that circulated widely did so because other practitioners found them useful, reproducible, or instructive.
Extraction & labeling. Each entry is classified by Claude Sonnet 4.6 using structured tool-use output with strict enum validation. The classifier assigns modality + technique category, detects the target model from post text, extracts the clean prompt (stripping social framing and hashtags), and tags visual themes and art styles. Confidence scores are stored alongside all labels; the original post text is always preserved for reclassification.
Coverage. Image generation (text-to-image, image-to-image, character references, reference-guided generation) and video generation (text-to-video, image-to-video, reference-to-video, video-to-video). No LLM / text-generation prompts - this is a generative media dataset.

Limitations

LimitationSelection and survivorship bias

Prompts are drawn exclusively from posts that practitioners chose to share publicly. This systematically over-represents prompts that produced visually impressive or socially shareable results, and under-represents failed attempts, iterative drafts, and everyday utility prompts.

Mitigation

This bias is also the dataset's signal: understanding the distribution of prompts that practitioners consider share-worthy is itself a research question. Sourcing across both Twitter/X and Reddit partially offsets pure aesthetics bias - Reddit communities reward technical depth and reproducibility.

LimitationPlatform and demographic skew

Source content is dominated by English-language posts from Twitter/X and a small set of Reddit communities. Non-English prompts, closed communities, Discord servers, and professional workflows are not represented. The dataset likely reflects the aesthetics and interests of a specific online subculture rather than the broader global practitioner population.

Mitigation

Ingestion spans multiple subreddits with different community cultures (r/midjourney, r/StableDiffusion, r/FluxAI, r/kling_ai), broadening the range of styles, use cases, and practitioner skill levels captured.

LimitationLLM-assisted classification errors

Category labels, theme tags, art style tags, model attribution, and extracted prompt text are assigned by Claude Sonnet 4.6 - not human annotators. Errors cluster around ambiguous multi-technique prompts, unfamiliar or emerging tools, non-English content, and prompts where the model name is absent from the post text.

Mitigation

The classifier uses structured tool-use output with strict enum validation, reducing free-form hallucination. Confidence scores are stored alongside labels. The raw source text is always preserved, so reclassification is non-destructive.

LimitationEngagement ≠ controlled quality validation

The dataset records what practitioners shared, not whether the prompt reliably produces good results across seeds, model versions, or hardware configurations. High engagement reflects community judgment, not empirical reproducibility.

Mitigation

Source post URLs and media_urls are retained for every entry, allowing researchers to inspect original post context, attached output media, and community reception. Engagement is best treated as a weak positive label, not a ground-truth quality rating.

LimitationTemporal and model coverage skew

Ingestion began in 2024 and runs on a rolling basis. Older models (pre-2023) are under-represented; newly released models may lag until ingestion catches up. Coverage of any given model reflects its social media footprint, not its market share or capability.

Mitigation

bookmarked_at and created_at timestamps are preserved for every entry, making temporal filtering straightforward. Model attribution is stored as free-text alongside a normalised canonical slug, so analyses can distinguish between model generations.

LimitationContent scope exclusions

This dataset focuses on professional and creative prompt engineering. NSFW and sexualized content, which represents a significant share of public image generation activity, has been excluded from collection.

Mitigation

The exclusion is deliberate — the dataset targets craft-oriented prompting rather than exhaustive coverage of all generation use cases. Researchers studying the full distribution of generative AI usage should account for this gap.

LimitationNear-duplicate prompts

Deduplication is exact-match only on the source post ID. Reposts, quote-tweets, and community re-shares of the same underlying prompt may appear as distinct entries. Downstream fine-tuning or similarity studies should apply semantic deduplication.

Mitigation

Each entry retains its author_handle and tweet_url, making provenance traceable. Semantic deduplication can be applied against the extracted_prompt field, which strips social framing and hashtags to surface the underlying prompt text.

Data Sources

Twitter / X
Primary

Primary source. Bookmarked posts from practitioners sharing AI generation workflows. Filtered for high engagement - views, reposts, and saves. Includes media outputs, threads, and referenced works.

Reddit
Secondary

r/midjourney, r/StableDiffusion, r/FluxAI, r/kling_ai, r/PromptEngineering. Reddit sourcing broadens demographic coverage - communities reward technical reproducibility alongside visual impact.

Manual
Supplementary

Hand-entered prompts from tutorials, blog posts, or community shares not covered by automated ingestion.

Schema

FieldDescription
idPrimary key
tweet_idOriginal post ID - enables deduplication and provenance tracing
tweet_textFull text of the source post (unmodified)
author_handlePlatform username of the practitioner who shared the prompt
author_nameDisplay name
tweet_urlCanonical URL - links to output media and original engagement context
media_urlsOutput image/video URLs attached to the post
sourceIngestion origin: twitter | reddit | manual
categoryTop-level bucket: prompts | tech_ai_product | career_productivity | uncategorized
prompt_categoryModality + technique: image_t2i, video_t2v, video_i2v, audio, etc.
extracted_promptClean prompt text extracted from post + comments - social framing stripped
detected_modelAI model mentioned (free-text canonical slug, e.g. "Midjourney v6.1")
prompt_themesVisual themes: person, cinematic, landscape, scifi, fantasy, etc.
art_stylesArt styles: photorealistic, anime, oil_painting, pixel_art, etc.
requires_referenceTrue if prompt requires a reference image as input
reference_typeface_person | style_artwork | subject_object | pose_structure | scene_background
is_threadTrue if post is a multi-tweet thread
thread_tweetsArray of {tweet_id, tweet_text} for threaded posts
confidenceClassifier confidence score (0–1)
rationaleLLM reasoning for the category assignment
user_notesHuman curator notes
bookmarked_atWhen the post was originally bookmarked
created_atRow insertion timestamp
updated_atLast modification timestamp

License & Citation

License:Creative Commons Attribution 4.0 (CC BY 4.0)

Free to use, share, and adapt for any purpose - including commercial - with appropriate credit. The original prompt texts remain the intellectual property of their authors; this dataset provides structured metadata for research purposes.

BibTeX
@dataset{ummerr_prompts_2025,
  title        = {ummerr/prompts: An In-the-Wild Generative AI Prompt Dataset},
  author       = {ummerr},
  year         = {2025},
  url          = {https://prompts.ummerr.com/dataset},
  note         = {Organic prompts sourced from high-engagement posts on X/Twitter.
                  Covers image and video generation with structured
                  metadata, model attribution, and technique labels.},
  license      = {CC BY 4.0}
}