DatasetCC BY 4.0Multi-ModalIn-the-WildEngagement-Filtered

ummerr/prompts

A corpus of organic, in-the-wild generative AI prompts sourced from high-engagement posts on X/Twitter — covering image and video generation. Every entry reflects a real practitioner decision: what to generate, how to phrase it, and which model to use. High engagement acts as an organic quality signal — these prompts were judged worth sharing by thousands of practitioners. See methodology for classification details.

image-generationvideo-generationin-the-wild-promptspractitioner-behaviorengagement-filtered

Download JSONL Download CSV Download JSONCC BY 4.0 - cite as ummerr/prompts

Download Research Export (JSONL)Strips author PII and raw post text for X/Twitter ToS and GDPR compliance. Researchers can rehydrate original posts via tweet_id.

Why This Dataset Exists

Established prompt datasets — DrawBench, PartiPrompts, T2I-CompBench — were designed for model evaluation: synthetic or crowdsourced prompts used to benchmark generation quality. They serve that purpose well. This dataset serves a different one.

This collection captures the organic prompt distribution from practitioners who actively use generative AI tools and share their results publicly. The selection mechanism — social engagement — is imperfect but meaningful: a prompt that accumulates high view and repost counts has passed a form of community judgment. The goal is to document practitioner behavior, not to compete with evaluation benchmarks.

Dataset	Size	Source	Modality	Provenance	Engagement	Curated
DrawBench	200	Synthetic (LLM)	Image	None	None	2022
PartiPrompts	1,632	Crowdworkers (Google)	Image	None	None	2022
T2I-CompBench	6,000	Synthetic (GPT-4)	Image	None	None	2023
GenAI-Bench	1,200	LLM + human mix	Image + Video	None	None	2024
EvalCrafter	700	LLM + real users	Video	None	None	2024
VBench	1,600	Manual per dimension	Video	None	None	2024
T2VEval-Bench	1,783	LLM + manual	Video	None	Lab MOS	2025
ummerr/prompts(this)	-	Organic / in-the-wild	Image + Video	Full (URL + author)	Viral filter	Mar 2026

Dataset Size

Research Applications

In-the-wild prompt distribution

Study what the actual distribution of prompts looks like across modalities, models, and technique types - as opposed to the synthetic or curated distributions used in most benchmarks. Useful for calibrating evaluation sets to real practitioner behavior.

Engagement as a quality signal

Each entry is sourced from high-engagement posts (high views, reposts, saves). This creates a weak but organic quality label: prompts that practitioners found compelling enough to share and reshare. Researchers can study whether engagement correlates with automated quality metrics.

Multi-modal prompt structure analysis

The dataset covers image and video generation with structured technique labels. Most existing prompt datasets are image-only. This enables cross-modal comparison: how does a T2V prompt differ structurally from a T2I prompt for the same subject?

Model-conditioned prompt analysis

Each entry includes a detected model field. Researchers can study how prompt style, length, technique invocation, and reference usage vary across models - Midjourney vs. FLUX vs. Kling vs. Sora - and how practitioner prompting strategies adapt to model capabilities.

Temporal adoption analysis

bookmarked_at and created_at timestamps enable temporal slicing. Study how the prompt distribution evolves as new models are released, how quickly practitioners adopt new techniques, and how model market share shifts over time in the practitioner community.

Reference image usage patterns

The requires_reference and reference_type fields capture which prompts require a reference image as input, and what kind (face, style, subject, pose, background). Useful for studying how practitioners use image conditioning vs. text-only prompting across different task types.

Sub-datasets

Four derived views for common research slices. Each is a filtered subset of the master dataset above — every row here also appears in the master download. Downloadable separately in JSONL, CSV, JSON, and a PII-stripped research variant.

T2IText to Image

Image prompts that generate from text alone — no reference image required.

View details →

R2IReference to Image

Image prompts that require a user-supplied reference image (face, style, subject, pose, or scene).

View details →

T2VText to Video

Video prompts generated from text only — no reference or source image conditioning.

View details →

R2VReference to Video

Video prompts conditioned on an input image or video (image-to-video, reference-to-video, video-to-video).

View details →

Task Categories

Image

Text → Image

Image → Image

Reference → Image

Character Ref

Video

Text → Video

Image → Video

Reference → Video

Video → Video

Curation & Collection

Selection mechanism. Posts are identified via X/Twitter search and bookmark capture from practitioner accounts actively sharing AI generation work. Selection is biased toward high-engagement content - posts with substantial view counts, reposts, and saves. This is not a random sample; it is a practitioner-judged quality filter. Prompts that circulated widely did so because other practitioners found them useful, reproducible, or instructive.

Extraction & labeling. Each entry is classified by Claude Sonnet 4.6 using structured tool-use output with strict enum validation. The classifier assigns modality + technique category, detects the target model from post text, extracts the clean prompt (stripping social framing and hashtags), and tags visual themes and art styles. Confidence scores are stored alongside all labels; the original post text is always preserved for reclassification.

Coverage. Image generation (text-to-image, image-to-image, character references, reference-guided generation) and video generation (text-to-video, image-to-video, reference-to-video, video-to-video). No LLM / text-generation prompts - this is a generative media dataset.

Limitations

LimitationSelection and survivorship bias

Prompts are drawn exclusively from posts that practitioners chose to share publicly. This systematically over-represents prompts that produced visually impressive or socially shareable results, and under-represents failed attempts, iterative drafts, and everyday utility prompts.

Mitigation

This bias is also the dataset's signal: understanding the distribution of prompts that practitioners consider share-worthy is itself a research question. Sourcing across both Twitter/X and Reddit partially offsets pure aesthetics bias - Reddit communities reward technical depth and reproducibility.

LimitationPlatform and demographic skew

Source content is dominated by English-language posts from Twitter/X and a small set of Reddit communities. Non-English prompts, closed communities, Discord servers, and professional workflows are not represented. The dataset likely reflects the aesthetics and interests of a specific online subculture rather than the broader global practitioner population.

Mitigation

Ingestion spans multiple subreddits with different community cultures (r/midjourney, r/StableDiffusion, r/FluxAI, r/kling_ai), broadening the range of styles, use cases, and practitioner skill levels captured.

LimitationLLM-assisted classification errors

Category labels, theme tags, art style tags, model attribution, and extracted prompt text are assigned by Claude Sonnet 4.6 - not human annotators. Errors cluster around ambiguous multi-technique prompts, unfamiliar or emerging tools, non-English content, and prompts where the model name is absent from the post text.

Mitigation

The classifier uses structured tool-use output with strict enum validation, reducing free-form hallucination. Confidence scores are stored alongside labels. The raw source text is always preserved, so reclassification is non-destructive.

LimitationEngagement ≠ controlled quality validation

The dataset records what practitioners shared, not whether the prompt reliably produces good results across seeds, model versions, or hardware configurations. High engagement reflects community judgment, not empirical reproducibility.

Mitigation

Source post URLs and media_urls are retained for every entry, allowing researchers to inspect original post context, attached output media, and community reception. Engagement is best treated as a weak positive label, not a ground-truth quality rating.

LimitationTemporal and model coverage skew

Ingestion began in 2024 and runs on a rolling basis. Older models (pre-2023) are under-represented; newly released models may lag until ingestion catches up. Coverage of any given model reflects its social media footprint, not its market share or capability.

Mitigation

bookmarked_at and created_at timestamps are preserved for every entry, making temporal filtering straightforward. Model attribution is stored as free-text alongside a normalised canonical slug, so analyses can distinguish between model generations.

LimitationContent scope exclusions

This dataset focuses on professional and creative prompt engineering. NSFW and sexualized content, which represents a significant share of public image generation activity, has been excluded from collection.

Mitigation

The exclusion is deliberate — the dataset targets craft-oriented prompting rather than exhaustive coverage of all generation use cases. Researchers studying the full distribution of generative AI usage should account for this gap.

LimitationNear-duplicate prompts

Deduplication is exact-match only on the source post ID. Reposts, quote-tweets, and community re-shares of the same underlying prompt may appear as distinct entries. Downstream fine-tuning or similarity studies should apply semantic deduplication.

Mitigation

Each entry retains its author_handle and tweet_url, making provenance traceable. Semantic deduplication can be applied against the extracted_prompt field, which strips social framing and hashtags to surface the underlying prompt text.

Data Sources

✦Twitter / X

Primary

Primary source. Bookmarked posts from practitioners sharing AI generation workflows. Filtered for high engagement - views, reposts, and saves. Includes media outputs, threads, and referenced works.

◉Reddit

Secondary

r/midjourney, r/StableDiffusion, r/FluxAI, r/kling_ai, r/PromptEngineering. Reddit sourcing broadens demographic coverage - communities reward technical reproducibility alongside visual impact.

◈Manual

Supplementary

Hand-entered prompts from tutorials, blog posts, or community shares not covered by automated ingestion.

Schema

Field	Type	Nullable	Description
id	uuid	no	Primary key
tweet_id	text	no	Original post ID - enables deduplication and provenance tracing
tweet_text	text	no	Full text of the source post (unmodified)
author_handle	text	no	Platform username of the practitioner who shared the prompt
author_name	text	yes	Display name
tweet_url	text	no	Canonical URL - links to output media and original engagement context
media_urls	text[]	no	Output image/video URLs attached to the post
source	enum	no	Ingestion origin: twitter \| reddit \| manual
category	enum	no	Top-level bucket: prompts \| tech_ai_product \| career_productivity \| uncategorized
prompt_category	enum	yes	Modality + technique: image_t2i, video_t2v, video_i2v, audio, etc.
extracted_prompt	text	yes	Clean prompt text extracted from post + comments - social framing stripped
detected_model	text	yes	AI model mentioned (free-text canonical slug, e.g. "Midjourney v6.1")
prompt_themes	text[]	yes	Visual themes: person, cinematic, landscape, scifi, fantasy, etc.
art_styles	text[]	yes	Art styles: photorealistic, anime, oil_painting, pixel_art, etc.
requires_reference	boolean	yes	True if prompt requires a reference image as input
reference_type	enum	yes	face_person \| style_artwork \| subject_object \| pose_structure \| scene_background
is_thread	boolean	no	True if post is a multi-tweet thread
thread_tweets	jsonb	yes	Array of {tweet_id, tweet_text} for threaded posts
confidence	float	no	Classifier confidence score (0–1)
rationale	text	yes	LLM reasoning for the category assignment
user_notes	text	yes	Human curator notes
bookmarked_at	timestamptz	yes	When the post was originally bookmarked
created_at	timestamptz	no	Row insertion timestamp
updated_at	timestamptz	no	Last modification timestamp

License & Citation

License:Creative Commons Attribution 4.0 (CC BY 4.0)

Free to use, share, and adapt for any purpose - including commercial - with appropriate credit. The original prompt texts remain the intellectual property of their authors; this dataset provides structured metadata for research purposes.

BibTeX

@dataset{ummerr_prompts_2025,
  title        = {ummerr/prompts: An In-the-Wild Generative AI Prompt Dataset},
  author       = {ummerr},
  year         = {2025},
  url          = {https://prompts.ummerr.com/dataset},
  note         = {Organic prompts sourced from high-engagement posts on X/Twitter.
                  Covers image and video generation with structured
                  metadata, model attribution, and technique labels.},
  license      = {CC BY 4.0}
}