MarkMyAI

Real-World Pipeline Benchmark:
Do Invisible Watermarks Survive?

366 decode tests across real WordPress renditions and Cloudinary CDN transforms. Three watermark candidates. Every number — including the ones we didn't expect.

🔬 Research Paper · March 2026 · v1.2

markmyai.com · March 2026 · Internal Research

Executive Summary

The Question We Needed to Answer

Our lab benchmarks showed promising watermark recovery rates. But lab conditions don't reflect reality. We needed to know: what actually survives when images pass through real WordPress and CDN infrastructure? And where is the hard limit — and can a different architecture overcome it?

The 476 total tests across this paper break down as: 366 real-world pipeline tests (core benchmark), 55 controlled thumbnail tests (TrustMark Q, software resize), and 55 WAM benchmark tests. All use the same 11 original images.

476

Total Decode Tests

91%

Best Recovery Rate (Real-World)

+27%

WAM vs. TrustMark at 150px

Key Findings

Candidate	Lab Result	Real-World Result	Delta
Baseline BCH_5 / Q / single-pass	94% (114/121)	89% (109/122)	−5%
token40_Q BCH_SUPER / Q / multi-pass	98% (119/121)	91% (111/122)	−7%
token40_P BCH_SUPER / P / multi-pass	97% (117/121)	75% (91/122)	−22%

Bottom line: token40_Q is the best candidate in both lab and real-world conditions. It never performed worse than the baseline on any single transform category. The −7% lab-to-real gap is expected — real pipelines apply compound transformations that synthetic tests can't perfectly replicate.

Decision

token40_Q (BCH_SUPER encoding, Q variant, multi-pass decode) is now the production default for all new marks. The P variant remains available as a specialized "crop-resilient" mode for use cases where heavy cropping is expected (social media thumbnails, e-commerce product shots).

v1.1 · March 13, 2026

150×150 limit confirmed as physical. Controlled benchmark (55 tests) + wm_strength=1.5 spot-test (0/4 improvement). The 300px floor is the reliable recovery boundary for TrustMark Q. → Chapter 05

v1.2 · March 13, 2026

WAM (Meta AI, ICLR 2025) overcomes the limit: 10/11 (91%) at 150×150 vs. TrustMark's 7/11 (64%), +27%. All 4 TrustMark-specific failures resolved. WAM identified as the v2 embedding architecture. → Chapter 06

Chapter 01

Methodology

How we set up the benchmark, what infrastructure we used, and why this test design matters.

Test Design

We marked 11 original images with each of 3 watermark candidates, producing 33 marked images. Each was uploaded to real WordPress and Cloudinary infrastructure, where automatic processing generated multiple renditions. The 11 images were deliberately selected for diversity: landscapes, portraits, abstract, text-heavy, low-contrast, mixed JPEG and PNG, and a wide resolution range (667px to 5824px longest side). This diversity is intentional — it stress-tests the watermark across the image characteristics most likely to affect signal retention.

On sample size: 11 images is a small sample by academic standards. We acknowledge this. Three factors support the validity of the findings: (1) the images were selected to maximize content diversity, not cherry-picked; (2) a parallel lab benchmark (121 transform combinations per candidate) independently confirmed the ranking; (3) all failure cases were reproduced consistently across runs, suggesting systematic rather than stochastic behavior.

Parameter	Value
Original images	11 (from test_series/00_series_02/test_originals)
Candidates	3 (baseline_q_bch5_single, token40_q_super_multi, token40_p_super_multi)
Marked images uploaded	33 (11 × 3)
WordPress renditions per image	7 (thumbnail, medium, medium_large, large, 1536×1536, 2048×2048, full)
Cloudinary transforms per image	5 (w1200_q85_jpg, w800_q60_jpg, w800_q80_webp, w800_h800_crop, w400_q70_jpg)
Real-world pipeline decode tests	366 (core benchmark)
Controlled thumbnail tests (TrustMark Q, March 13)	55 (11 × 5 sizes, software-only)
WAM benchmark (March 13)	55 (11 × 5 sizes, Colab T4 GPU)
Total decode tests (all runs)	476

Infrastructure

WordPress

Host: ki-welt.ch (production WP 6.x)

Theme: Standard theme

MarkMyAI Plugin: Deactivated (to avoid double-marking)

Upload: REST API with Bearer auth

Cloudinary CDN

Tier: Free

Transforms: 5 real-world presets

Formats: JPEG, WebP

Upload: Programmatic API

Why Real-World Matters

Lab benchmarks apply transformations one at a time in controlled sequence. Real-world pipelines apply compound transformations: WordPress might resize, re-encode to JPEG at a different quality, strip metadata, and change color profiles — all in one step. Cloudinary adds its own format conversion, quality optimization, and smart cropping.

Decode setup: All decoding was performed by the local TrustMark FastAPI worker on localhost:8001. Each rendition was fetched from the live infrastructure and decoded against the expected watermark token.

What We Measured

For each decoded rendition, we recorded one of three outcomes:

Outcome	Definition
verified	Watermark recovered and matches the expected token
partial	Watermark signal detected but token match inconclusive
not_found	No watermark signal detected in the decoded image

Chapter 02

WordPress Rendition Results

WordPress automatically generates 7 image sizes from each upload. Here's how each watermark candidate performed across all renditions.

Rendition	Dimensions	Baseline	token40_Q	token40_P
full	Original	11/11	11/11	10/11
2048×2048	max 2048 px	6/6	6/6	6/6
1536×1536	max 1536 px	8/8	8/8	8/8
large	max 1024 px	10/10	10/10	10/10
medium_large	max 768 px	10/10	10/10	10/10
medium	max 300 px	10/11	11/11	3/11
thumbnail	150×150 crop	3/11	4/11 (real WP) 7/11 software-only¹	1/11

Key Observations

token40_Q closes the medium gap: The baseline loses 1 of 11 images at 300px. Q recovers all 11 — a meaningful improvement for blog thumbnails and sidebar images, which are commonly served at this size.

Thumbnails are a confirmed hard limit — and the boundary has been precisely established. At 150×150 pixels (square crop), watermark recovery is unreliable for all candidates: Baseline 3/11, Q 4/11, P 1/11 (real WordPress, March 11). A controlled software-only test (March 13) showed 7/11 = 64% for token40_Q — the gap between real-world and software-only results is explained by WordPress's additional JPEG re-encode at ~Q80. A wm_strength=1.5 spot-test on all 4 failing images produced 0/4 improvements, confirming this is a physical pixel-density limit, not a configuration issue. The 300px floor (WordPress Medium) is the reliable recovery boundary.

token40_P collapses at small sizes. Only 3/11 at medium (300px), 1/11 at thumbnail. The P variant's perceptual encoding trades compression resilience for crop resilience — small images suffer disproportionately.

WordPress Recovery by Size

Visual summary — recovery rates for all three candidates across WordPress rendition sizes:

768px+ (Q) — 45/45

100%

300px medium (Q) — 11/11

100%

300px medium (Baseline) — 10/11

91%

300px medium (P) — 3/11

27%

150px thumbnail (Q) — 4/11

36%

Note: 2048×2048 and 1536×1536 renditions are only generated for images larger than those dimensions. Counts reflect actually generated renditions.

¹ Software-only benchmark (March 13, 2026): 11 images freshly marked with token40_Q, resized via PIL LANCZOS (no JPEG re-encode). The 64%→36% gap between software-only and real WordPress is attributable to WordPress's JPEG re-encode at ~Q80 quality.

Chapter 03

Cloudinary CDN Transform Results

Cloudinary is one of the most popular image CDNs. We tested five common transform presets that reflect real production usage.

Transform	Description	Baseline	token40_Q	token40_P
w1200_q85_jpg	Hero image	11/11	11/11	10/11
w800_q60_jpg	Aggressive JPEG	11/11	11/11	9/11
w800_q80_webp	Standard WebP	11/11	11/11	10/11
w800_h800_crop	Square crop (center)	8/11	8/11	10/11
w400_q70_jpg	Small thumbnail	10/11	10/11	4/11

Key Observations

Baseline and Q are flawless on standard CDN transforms. Hero images, aggressive JPEG, and WebP conversion: 11/11 for both. This means that for the vast majority of CDN-delivered content, the watermark survives without exception.

Square crop is the biggest challenge — and P's strength. Cloudinary's center crop removes ~36% of pixel data. Baseline and Q: 8/11. But P recovers 10/11. This confirms P's design purpose: it trades compression resilience for crop resilience.

P breaks down at small sizes. At 400px with Q70 JPEG: only 4/11. The same pattern as WordPress — P's perceptual encoding doesn't have enough signal at low resolutions.

The Two Dimensions of Robustness

This benchmark reveals a fundamental trade-off in watermark design. TrustMark [1] uses a GAN-based encoder-decoder architecture where the embedding strategy differs between variants:

Q Variant: Compression Resilient

Concentrates the watermark signal in frequency-domain features (DCT coefficient perturbations) that survive lossy quantization. This makes Q robust against JPEG/WebP compression, resizing, and format conversion — exactly the transforms CMS and CDN pipelines apply.

Best for: Standard publishing, CMS, CDN delivery, blog images.

P Variant: Crop Resilient

Distributes the signal across redundant spatial blocks, so that even if large portions of the image are cropped away, surviving blocks still carry enough payload for recovery. The trade-off: each block needs minimum resolution to be decoded, which explains P's collapse below ~400px.

Best for: Social media thumbnails, e-commerce product shots where cropping is the primary transform.

Overall Recovery Rates

89%

Baseline

91%

token40_Q

75%

token40_P

Chapter 04

Per-Image Performance

Not all images are equal. Some subjects carry watermark signals better than others. Here's the full breakdown.

Image	Baseline	token40_Q	token40_P	Notes
pic_01	11/12	10/12	10/12	Consistent across candidates
pic_02	9/10	10/10	2/10	P extremely weak, Q perfect
pic_03	11/11	11/11	8/11	Q maintains perfection
pic_04	6/8	6/8	6/8	Difficult image for all
pic_05	12/12	11/12	11/12	Near-perfect across all
pic_06	11/12	11/12	10/12	Slight P weakness
pic_07	9/11	10/11	8/11	Q fixes 1 baseline failure
pic_08	9/12	11/12	10/12	Q fixes 2 baseline failures
pic_09	12/12	12/12	11/12	Best image overall
pic_10	10/12	10/12	9/12	P slightly weaker
pic_11	9/10	9/10	6/10	P significant drop

Image-Level Insights

Most Robust Images

pic_09: All candidates near-perfect. High-contrast landscape with rich texture — ideal for watermark embedding.

pic_05: 12/12 for baseline, 11/12 for Q and P. Well-balanced content.

Most Challenging Images

pic_04: Only 6/8 for all three candidates. Low-frequency content with large uniform areas — difficult for any watermark method.

pic_02: P variant catastrophic failure (2/10), while Q achieves perfect 10/10.

Key takeaway: token40_Q is the most consistent performer. It achieves the highest or equal-highest recovery rate on 9 of 11 images. It never drops below the baseline on any single image. The P variant shows high variance — excellent on some images, catastrophic on others.

Chapter 05

Decision & Implementation

What we decided based on these results, and how it affects the product.

Verdict: token40_Q Is the New Default

Based on 366 real-world decode tests across WordPress and Cloudinary infrastructure, token40_Q (BCH_SUPER encoding, Q variant, multi-pass decode) has been selected as the production default.

Criterion	Result
Higher overall recovery than baseline?	Yes — 91% vs 89%
Worse than baseline on any transform?	Never
Closes known gaps?	Yes — fixes medium (300px) from 10/11 to 11/11
Multi-pass decode overhead acceptable?	Yes — decode ~2× slower but still < 5s

Migration Details

Aspect	Before	After
Encoding	BCH_5 (61-bit mark_id)	BCH_SUPER (40-bit token)
Token derivation	mark_id embedded directly	SHA-256(mark_id)[0:10] → 40-bit token
Default variant	Q only	Q (default) + P (crop-resilient, opt-in)
Decode strategy	Single-pass	Multi-pass (try BCH_SUPER first, fallback to BCH_5)
Backward compatibility	—	Legacy marks remain decodable via fallback

API: Choosing the Variant

The watermark variant is controlled via the wm_variant parameter on POST /v1/mark:

    // Default: Q variant (best all-round)

    { "image_url": "...", "wm_variant": "standard" }

    // Crop-resilient: P variant

    { "image_url": "...", "wm_variant": "crop-resilient" }

When to Use P (Crop-Resilient)?

Use Case	Recommended Variant
Blog images, editorial content, website hero images	standard (Q)
CMS or CDN delivery (WordPress, Cloudinary, imgix)	standard (Q)
Social media profile pictures (heavy square crop)	crop-resilient (P)
E-commerce product shots (aspect ratio changes)	crop-resilient (P)
Images served below 400px width	standard (Q)

Chapter 06

WAM Benchmark: Breaking the 150px Barrier

After confirming that the 150×150 limit is fundamental to TrustMark's architecture, we benchmarked WAM — a different embedding approach from Meta AI — against the same test matrix.

What is WAM?

WAM (Watermark Anything with Localized Messages, ICLR 2025) uses a VAE-based embedder and a SAM-based extractor trained on Meta's SA-1B dataset. Its key design property: the decoder is trained to recover watermarks from localized image patches as small as 10% of the image surface. This directly addresses the thumbnail problem.

Property	TrustMark Q	WAM (MIT)
Architecture	GAN encoder + resolution scaling	VAE embedder + SAM extractor
Payload	40 bit + BCH_SUPER ECC	32 bit, no explicit ECC
Recovery metric	Exact token match (binary)	Bit accuracy (0.0–1.0)
Localized extraction	No (requires full image pattern)	Yes (trained on 10% patches)
License	MIT	MIT (wam_mit.pth)
Inference speed	~18s/image (Railway, incl. network)	~30ms/image (local T4 GPU)

Results: Same 11 Images × 5 Sizes

Size	WAM	TrustMark Q (ref.)	Delta
Baseline (original)	11/11 (100%)	11/11 (100%)	—
150×150 (square crop)	10/11 (91%)	7/11 (64%)	+3 images / +27%
300px wide	11/11 (100%)	11/11 (100%)	—
768px wide	11/11 (100%)	11/11 (100%)	—
1024px wide	11/11 (100%)	11/11 (100%)	—
Total	54/55 (98%)	51/55 (93%)	+3 / +5%

Per-Image 150×150 Detail

Image	WAM bit_acc	WAM 150px	TrustMark 150px	Change
pic_01.jpg	1.000	OK	FAIL	FAIL → OK
pic_02.jpg	0.875	FAIL	OK	OK → FAIL
pic_03.jpg	1.000	OK	OK	—
pic_04.jpg	1.000	OK	FAIL	FAIL → OK
pic_05.png	0.906	OK	OK	—
pic_06.png	1.000	OK	OK	—
pic_07.jpg	1.000	OK	FAIL	FAIL → OK
pic_08.png	1.000	OK	OK	—
pic_09.png	1.000	OK	OK	—
pic_10.jpg	1.000	OK	FAIL	FAIL → OK
pic_11.jpg	1.000	OK	OK	—

All 4 TrustMark-specific failures resolved. Every image that TrustMark failed to decode at 150×150 (pic_01, pic_04, pic_07, pic_10) was correctly recovered by WAM with bit_acc=1.000. This is not coincidental — it is WAM's localized extraction architecture working exactly as designed.

1 new failure (pic_02, bit_acc=0.875). This image was handled correctly by TrustMark. The reason is under investigation. Net result: WAM recovers +3 images over TrustMark at 150×150.

Production implication: WAM is the clear candidate for a next-generation embedding layer. A migration requires a full worker rebuild (PyTorch + GPU inference). This is not a quick fix but a concrete roadmap item for v2.

Chapter 07

Limitations & Reproducibility

Every benchmark has constraints. Here's what ours can and cannot tell you.

Known Limitations

Limitation	Impact	Mitigation
Sample size: 11 images	Results may not generalize to all image types	Diverse subjects selected; lab benchmark used 121 transforms
Single WordPress instance	Different hosts may use different JPEG quality settings	Standard WP 6.x with default settings represents the common case
Cloudinary Free Tier	Paid tiers may use different optimization algorithms	Transforms explicitly specified; results are deterministic
No social media platforms tested	Twitter, Instagram, LinkedIn apply their own processing	Planned for future benchmark; CDN transforms approximate these
Thumbnails (< 200px) excluded from practical scope for TrustMark	Recovery rates at very small sizes are low for TrustMark	Confirmed as a physical limit of TrustMark's architecture. wm_strength=1.5 spot-test produced 0/4 improvements. Fingerprint + audit trail provide the alternative verification path. WAM (see Chapter 06) achieves 91% at 150×150 and is the identified architectural solution for v2.

Reproducibility

All scripts and data are available in the MarkMyAI repository:

Artifact	Path
Original test images	test_series/00_series_02/test_originals/
Full results (366 tests)	test_series/.../realworld-pipeline/results.json
WordPress collection script	scripts/collect-wordpress-renditions.mjs
Cloudinary collection script	scripts/collect-cloudinary-renditions.mjs
Orchestration script	scripts/collect-series-realworld.mjs
Benchmark runner	scripts/test-watermark-series.mjs
Technical results doc	docs/technical/realworld-pipeline-benchmark-results.md
Lab benchmark spec	docs/technical/token40-benchmark-spec.md
Controlled thumbnail benchmark (March 13, 2026)	test_series/00_series_02/run_2026-03-13_thumbnail-recovery/results.json
wm_strength spot-test (March 13, 2026)	test_series/00_series_02/run_2026-03-13_wmstrength15-spottest/results.json
Thumbnail recovery script	scripts/test-thumbnail-recovery.py
wm_strength spot-test script	scripts/test-wmstrength-spottest.py
WAM benchmark (March 13, 2026)	test_series/00_series_02/run_2026-03-13_wam-benchmark/results.json
WAM Colab notebook	scripts/wam-benchmark.ipynb
Research analysis & landscape (2026)	research/watermark-agent/

Future Work

Social media platforms: Test recovery after Twitter, Instagram, and LinkedIn processing pipelines.

Larger sample sizes: Expand to 50+ images with automated image characteristic classification.

✓ wm_strength variation: Tested (March 13, 2026). wm_strength=1.5 yields 0/4 improvement at 150×150. Concluded — no further tuning warranted.

✓ Tiling-Embedding hypothesis: Tested (March 13, 2026). 0/10 matches in tile/quadrant decoding. TrustMark decoder requires the full spatial distribution pattern. Concluded — approach infeasible.

Messenger apps: Test WhatsApp, Telegram, Signal image compression.

Multi-generation: Test watermark survival after mark → share → re-mark → share cycles.

✓ Alternative architecture benchmark (WAM): Completed March 13, 2026. WAM achieves 10/11 (91%) at 150×150 vs. TrustMark's 7/11 (64%). All 4 TrustMark-specific failures resolved. WAM identified as candidate for v2 embedding layer.

InvisMark benchmark: Originally planned; weights link broken (OneDrive, open issue since Nov 2025). Superseded by WAM benchmark which provides the same answer.

References

[1] Bui, Agarwal & Collomosse (2025). TrustMark: Robust Watermarking and Watermark Removal for Arbitrary Resolution Images. ICCV 2025. (Preprint: arXiv:2311.18297, 2023.) University of Surrey & Adobe Research.

[2] C2PA (Coalition for Content Provenance and Authenticity). C2PA Technical Specification v2.1. c2pa.org, September 2024.

[3] EU AI Act, Regulation (EU) 2024/1689. Article 50: Transparency obligations for deployers.

[4] Zhu, Kaplan, Johnson & Fei-Fei (2018). HiDDeN: Hiding Data With Deep Networks. ECCV 2018, pp. 682–697. LNCS vol. 11219.

[5] Xu, Hu, Lei, Li, Lowe, Gorevski, Wang, Ching & Deng (Microsoft, 2025). InvisMark: Invisible and Robust Watermarking for AI-Generated Image Provenance. WACV 2025. (256-bit payload, PSNR ~51 dB, >97% bit accuracy, open-source.)

[6] Sander, Fernandez, Durmus, Furon & Douze (Meta AI, 2025). Watermark Anything with Localized Messages. ICLR 2025. arXiv:2411.07231. Code & MIT-licensed weights: github.com/facebookresearch/watermark-anything

MarkMyAI

476 tests. Real infrastructure.
Every number published.

We found a limit. Then we found a way past it.
Now we know what to build next.

Data Availability

Raw results (366 decode tests as JSON), methodology details, and test parameters are documented in the accompanying blog article. Source data available on request.

markmyai.com/blog/real-world-pipeline-benchmark-watermarks

Data requests: hello@markmyai.com

markmyai.com

Real-World Pipeline Benchmark: Do Invisible Watermarks Survive?

Executive Summary

The Question We Needed to Answer

Key Findings

Decision

Chapter 01

Methodology

Test Design

Infrastructure

WordPress

Cloudinary CDN

Why Real-World Matters

What We Measured

Chapter 02

WordPress Rendition Results

Key Observations

WordPress Recovery by Size

Chapter 03

Cloudinary CDN Transform Results

Key Observations

The Two Dimensions of Robustness

Q Variant: Compression Resilient

P Variant: Crop Resilient

Overall Recovery Rates

Chapter 04

Per-Image Performance

Image-Level Insights

Most Robust Images

Most Challenging Images

Chapter 05

Decision & Implementation

Verdict: token40_Q Is the New Default

Migration Details

API: Choosing the Variant

When to Use P (Crop-Resilient)?

Chapter 06

WAM Benchmark: Breaking the 150px Barrier

What is WAM?

Results: Same 11 Images × 5 Sizes

Per-Image 150×150 Detail

Chapter 07

Limitations & Reproducibility

Known Limitations

Reproducibility

Future Work

References

Real-World Pipeline Benchmark:
Do Invisible Watermarks Survive?