Real-World Pipeline Benchmark:
Do Invisible Watermarks Survive?

366 decode tests across real WordPress renditions and Cloudinary CDN transforms. Three watermark candidates. Every number — including the ones we didn't expect.

🔬 Research Paper · March 2026 · v1.2
markmyai.com · March 2026 · Internal Research

Executive Summary

The Question We Needed to Answer

Our lab benchmarks showed promising watermark recovery rates. But lab conditions don't reflect reality. We needed to know: what actually survives when images pass through real WordPress and CDN infrastructure? And where is the hard limit — and can a different architecture overcome it?

The 476 total tests across this paper break down as: 366 real-world pipeline tests (core benchmark), 55 controlled thumbnail tests (TrustMark Q, software resize), and 55 WAM benchmark tests. All use the same 11 original images.

476
Total Decode Tests
91%
Best Recovery Rate (Real-World)
+27%
WAM vs. TrustMark at 150px

Key Findings

Candidate Lab Result Real-World Result Delta
Baseline BCH_5 / Q / single-pass 94% (114/121) 89% (109/122) −5%
token40_Q BCH_SUPER / Q / multi-pass 98% (119/121) 91% (111/122) −7%
token40_P BCH_SUPER / P / multi-pass 97% (117/121) 75% (91/122) −22%
Bottom line: token40_Q is the best candidate in both lab and real-world conditions. It never performed worse than the baseline on any single transform category. The −7% lab-to-real gap is expected — real pipelines apply compound transformations that synthetic tests can't perfectly replicate.

Decision

token40_Q (BCH_SUPER encoding, Q variant, multi-pass decode) is now the production default for all new marks. The P variant remains available as a specialized "crop-resilient" mode for use cases where heavy cropping is expected (social media thumbnails, e-commerce product shots).

v1.1 · March 13, 2026

150×150 limit confirmed as physical. Controlled benchmark (55 tests) + wm_strength=1.5 spot-test (0/4 improvement). The 300px floor is the reliable recovery boundary for TrustMark Q. → Chapter 05

v1.2 · March 13, 2026

WAM (Meta AI, ICLR 2025) overcomes the limit: 10/11 (91%) at 150×150 vs. TrustMark's 7/11 (64%), +27%. All 4 TrustMark-specific failures resolved. WAM identified as the v2 embedding architecture. → Chapter 06

Chapter 01

Methodology

How we set up the benchmark, what infrastructure we used, and why this test design matters.

Test Design

We marked 11 original images with each of 3 watermark candidates, producing 33 marked images. Each was uploaded to real WordPress and Cloudinary infrastructure, where automatic processing generated multiple renditions. The 11 images were deliberately selected for diversity: landscapes, portraits, abstract, text-heavy, low-contrast, mixed JPEG and PNG, and a wide resolution range (667px to 5824px longest side). This diversity is intentional — it stress-tests the watermark across the image characteristics most likely to affect signal retention.

On sample size: 11 images is a small sample by academic standards. We acknowledge this. Three factors support the validity of the findings: (1) the images were selected to maximize content diversity, not cherry-picked; (2) a parallel lab benchmark (121 transform combinations per candidate) independently confirmed the ranking; (3) all failure cases were reproduced consistently across runs, suggesting systematic rather than stochastic behavior.
ParameterValue
Original images11 (from test_series/00_series_02/test_originals)
Candidates3 (baseline_q_bch5_single, token40_q_super_multi, token40_p_super_multi)
Marked images uploaded33 (11 × 3)
WordPress renditions per image7 (thumbnail, medium, medium_large, large, 1536×1536, 2048×2048, full)
Cloudinary transforms per image5 (w1200_q85_jpg, w800_q60_jpg, w800_q80_webp, w800_h800_crop, w400_q70_jpg)
Real-world pipeline decode tests366 (core benchmark)
Controlled thumbnail tests (TrustMark Q, March 13)55 (11 × 5 sizes, software-only)
WAM benchmark (March 13)55 (11 × 5 sizes, Colab T4 GPU)
Total decode tests (all runs)476

Infrastructure

WordPress

Host: ki-welt.ch (production WP 6.x)

Theme: Standard theme

MarkMyAI Plugin: Deactivated (to avoid double-marking)

Upload: REST API with Bearer auth

Cloudinary CDN

Tier: Free

Transforms: 5 real-world presets

Formats: JPEG, WebP

Upload: Programmatic API

Why Real-World Matters

Lab benchmarks apply transformations one at a time in controlled sequence. Real-world pipelines apply compound transformations: WordPress might resize, re-encode to JPEG at a different quality, strip metadata, and change color profiles — all in one step. Cloudinary adds its own format conversion, quality optimization, and smart cropping.

Decode setup: All decoding was performed by the local TrustMark FastAPI worker on localhost:8001. Each rendition was fetched from the live infrastructure and decoded against the expected watermark token.

What We Measured

For each decoded rendition, we recorded one of three outcomes:

OutcomeDefinition
verifiedWatermark recovered and matches the expected token
partialWatermark signal detected but token match inconclusive
not_foundNo watermark signal detected in the decoded image

Chapter 02

WordPress Rendition Results

WordPress automatically generates 7 image sizes from each upload. Here's how each watermark candidate performed across all renditions.

Rendition Dimensions Baseline token40_Q token40_P
full Original 11/11 11/11 10/11
2048×2048 max 2048 px 6/6 6/6 6/6
1536×1536 max 1536 px 8/8 8/8 8/8
large max 1024 px 10/10 10/10 10/10
medium_large max 768 px 10/10 10/10 10/10
medium max 300 px 10/11 11/11 3/11
thumbnail 150×150 crop 3/11 4/11 (real WP)
7/11 software-only¹
1/11

Key Observations

token40_Q closes the medium gap: The baseline loses 1 of 11 images at 300px. Q recovers all 11 — a meaningful improvement for blog thumbnails and sidebar images, which are commonly served at this size.
Thumbnails are a confirmed hard limit — and the boundary has been precisely established. At 150×150 pixels (square crop), watermark recovery is unreliable for all candidates: Baseline 3/11, Q 4/11, P 1/11 (real WordPress, March 11). A controlled software-only test (March 13) showed 7/11 = 64% for token40_Q — the gap between real-world and software-only results is explained by WordPress's additional JPEG re-encode at ~Q80. A wm_strength=1.5 spot-test on all 4 failing images produced 0/4 improvements, confirming this is a physical pixel-density limit, not a configuration issue. The 300px floor (WordPress Medium) is the reliable recovery boundary.
token40_P collapses at small sizes. Only 3/11 at medium (300px), 1/11 at thumbnail. The P variant's perceptual encoding trades compression resilience for crop resilience — small images suffer disproportionately.

WordPress Recovery by Size

Visual summary — recovery rates for all three candidates across WordPress rendition sizes:

768px+ (Q) — 45/45
100%
300px medium (Q) — 11/11
100%
300px medium (Baseline) — 10/11
91%
300px medium (P) — 3/11
27%
150px thumbnail (Q) — 4/11
36%

Note: 2048×2048 and 1536×1536 renditions are only generated for images larger than those dimensions. Counts reflect actually generated renditions.

¹ Software-only benchmark (March 13, 2026): 11 images freshly marked with token40_Q, resized via PIL LANCZOS (no JPEG re-encode). The 64%→36% gap between software-only and real WordPress is attributable to WordPress's JPEG re-encode at ~Q80 quality.

Chapter 03

Cloudinary CDN Transform Results

Cloudinary is one of the most popular image CDNs. We tested five common transform presets that reflect real production usage.

Transform Description Baseline token40_Q token40_P
w1200_q85_jpg Hero image 11/11 11/11 10/11
w800_q60_jpg Aggressive JPEG 11/11 11/11 9/11
w800_q80_webp Standard WebP 11/11 11/11 10/11
w800_h800_crop Square crop (center) 8/11 8/11 10/11
w400_q70_jpg Small thumbnail 10/11 10/11 4/11

Key Observations

Baseline and Q are flawless on standard CDN transforms. Hero images, aggressive JPEG, and WebP conversion: 11/11 for both. This means that for the vast majority of CDN-delivered content, the watermark survives without exception.
Square crop is the biggest challenge — and P's strength. Cloudinary's center crop removes ~36% of pixel data. Baseline and Q: 8/11. But P recovers 10/11. This confirms P's design purpose: it trades compression resilience for crop resilience.
P breaks down at small sizes. At 400px with Q70 JPEG: only 4/11. The same pattern as WordPress — P's perceptual encoding doesn't have enough signal at low resolutions.

The Two Dimensions of Robustness

This benchmark reveals a fundamental trade-off in watermark design. TrustMark [1] uses a GAN-based encoder-decoder architecture where the embedding strategy differs between variants:

Q Variant: Compression Resilient

Concentrates the watermark signal in frequency-domain features (DCT coefficient perturbations) that survive lossy quantization. This makes Q robust against JPEG/WebP compression, resizing, and format conversion — exactly the transforms CMS and CDN pipelines apply.

Best for: Standard publishing, CMS, CDN delivery, blog images.

P Variant: Crop Resilient

Distributes the signal across redundant spatial blocks, so that even if large portions of the image are cropped away, surviving blocks still carry enough payload for recovery. The trade-off: each block needs minimum resolution to be decoded, which explains P's collapse below ~400px.

Best for: Social media thumbnails, e-commerce product shots where cropping is the primary transform.


Overall Recovery Rates

89%
Baseline
91%
token40_Q
75%
token40_P

Chapter 04

Per-Image Performance

Not all images are equal. Some subjects carry watermark signals better than others. Here's the full breakdown.

Image Baseline token40_Q token40_P Notes
pic_01 11/12 10/12 10/12 Consistent across candidates
pic_02 9/10 10/10 2/10 P extremely weak, Q perfect
pic_03 11/11 11/11 8/11 Q maintains perfection
pic_04 6/8 6/8 6/8 Difficult image for all
pic_05 12/12 11/12 11/12 Near-perfect across all
pic_06 11/12 11/12 10/12 Slight P weakness
pic_07 9/11 10/11 8/11 Q fixes 1 baseline failure
pic_08 9/12 11/12 10/12 Q fixes 2 baseline failures
pic_09 12/12 12/12 11/12 Best image overall
pic_10 10/12 10/12 9/12 P slightly weaker
pic_11 9/10 9/10 6/10 P significant drop

Image-Level Insights

Most Robust Images

pic_09: All candidates near-perfect. High-contrast landscape with rich texture — ideal for watermark embedding.

pic_05: 12/12 for baseline, 11/12 for Q and P. Well-balanced content.

Most Challenging Images

pic_04: Only 6/8 for all three candidates. Low-frequency content with large uniform areas — difficult for any watermark method.

pic_02: P variant catastrophic failure (2/10), while Q achieves perfect 10/10.


Key takeaway: token40_Q is the most consistent performer. It achieves the highest or equal-highest recovery rate on 9 of 11 images. It never drops below the baseline on any single image. The P variant shows high variance — excellent on some images, catastrophic on others.

Chapter 05

Decision & Implementation

What we decided based on these results, and how it affects the product.

Verdict: token40_Q Is the New Default

Based on 366 real-world decode tests across WordPress and Cloudinary infrastructure, token40_Q (BCH_SUPER encoding, Q variant, multi-pass decode) has been selected as the production default.

CriterionResult
Higher overall recovery than baseline?Yes — 91% vs 89%
Worse than baseline on any transform?Never
Closes known gaps?Yes — fixes medium (300px) from 10/11 to 11/11
Multi-pass decode overhead acceptable?Yes — decode ~2× slower but still < 5s

Migration Details

AspectBeforeAfter
Encoding BCH_5 (61-bit mark_id) BCH_SUPER (40-bit token)
Token derivation mark_id embedded directly SHA-256(mark_id)[0:10] → 40-bit token
Default variant Q only Q (default) + P (crop-resilient, opt-in)
Decode strategy Single-pass Multi-pass (try BCH_SUPER first, fallback to BCH_5)
Backward compatibility Legacy marks remain decodable via fallback

API: Choosing the Variant

The watermark variant is controlled via the wm_variant parameter on POST /v1/mark:

// Default: Q variant (best all-round)
{ "image_url": "...", "wm_variant": "standard" }

// Crop-resilient: P variant
{ "image_url": "...", "wm_variant": "crop-resilient" }

When to Use P (Crop-Resilient)?

Use CaseRecommended Variant
Blog images, editorial content, website hero imagesstandard (Q)
CMS or CDN delivery (WordPress, Cloudinary, imgix)standard (Q)
Social media profile pictures (heavy square crop)crop-resilient (P)
E-commerce product shots (aspect ratio changes)crop-resilient (P)
Images served below 400px widthstandard (Q)

Chapter 06

WAM Benchmark: Breaking the 150px Barrier

After confirming that the 150×150 limit is fundamental to TrustMark's architecture, we benchmarked WAM — a different embedding approach from Meta AI — against the same test matrix.

What is WAM?

WAM (Watermark Anything with Localized Messages, ICLR 2025) uses a VAE-based embedder and a SAM-based extractor trained on Meta's SA-1B dataset. Its key design property: the decoder is trained to recover watermarks from localized image patches as small as 10% of the image surface. This directly addresses the thumbnail problem.

PropertyTrustMark QWAM (MIT)
ArchitectureGAN encoder + resolution scalingVAE embedder + SAM extractor
Payload40 bit + BCH_SUPER ECC32 bit, no explicit ECC
Recovery metricExact token match (binary)Bit accuracy (0.0–1.0)
Localized extractionNo (requires full image pattern)Yes (trained on 10% patches)
LicenseMITMIT (wam_mit.pth)
Inference speed~18s/image (Railway, incl. network)~30ms/image (local T4 GPU)

Results: Same 11 Images × 5 Sizes

Size WAM TrustMark Q (ref.) Delta
Baseline (original) 11/11 (100%) 11/11 (100%)
150×150 (square crop) 10/11 (91%) 7/11 (64%) +3 images / +27%
300px wide 11/11 (100%) 11/11 (100%)
768px wide 11/11 (100%) 11/11 (100%)
1024px wide 11/11 (100%) 11/11 (100%)
Total 54/55 (98%) 51/55 (93%) +3 / +5%

Per-Image 150×150 Detail

ImageWAM bit_accWAM 150pxTrustMark 150pxChange
pic_01.jpg1.000OKFAILFAIL → OK
pic_02.jpg0.875FAILOKOK → FAIL
pic_03.jpg1.000OKOK
pic_04.jpg1.000OKFAILFAIL → OK
pic_05.png0.906OKOK
pic_06.png1.000OKOK
pic_07.jpg1.000OKFAILFAIL → OK
pic_08.png1.000OKOK
pic_09.png1.000OKOK
pic_10.jpg1.000OKFAILFAIL → OK
pic_11.jpg1.000OKOK
All 4 TrustMark-specific failures resolved. Every image that TrustMark failed to decode at 150×150 (pic_01, pic_04, pic_07, pic_10) was correctly recovered by WAM with bit_acc=1.000. This is not coincidental — it is WAM's localized extraction architecture working exactly as designed.
1 new failure (pic_02, bit_acc=0.875). This image was handled correctly by TrustMark. The reason is under investigation. Net result: WAM recovers +3 images over TrustMark at 150×150.
Production implication: WAM is the clear candidate for a next-generation embedding layer. A migration requires a full worker rebuild (PyTorch + GPU inference). This is not a quick fix but a concrete roadmap item for v2.

Chapter 07

Limitations & Reproducibility

Every benchmark has constraints. Here's what ours can and cannot tell you.

Known Limitations

LimitationImpactMitigation
Sample size: 11 images Results may not generalize to all image types Diverse subjects selected; lab benchmark used 121 transforms
Single WordPress instance Different hosts may use different JPEG quality settings Standard WP 6.x with default settings represents the common case
Cloudinary Free Tier Paid tiers may use different optimization algorithms Transforms explicitly specified; results are deterministic
No social media platforms tested Twitter, Instagram, LinkedIn apply their own processing Planned for future benchmark; CDN transforms approximate these
Thumbnails (< 200px) excluded from practical scope for TrustMark Recovery rates at very small sizes are low for TrustMark Confirmed as a physical limit of TrustMark's architecture. wm_strength=1.5 spot-test produced 0/4 improvements. Fingerprint + audit trail provide the alternative verification path. WAM (see Chapter 06) achieves 91% at 150×150 and is the identified architectural solution for v2.

Reproducibility

All scripts and data are available in the MarkMyAI repository:

ArtifactPath
Original test imagestest_series/00_series_02/test_originals/
Full results (366 tests)test_series/.../realworld-pipeline/results.json
WordPress collection scriptscripts/collect-wordpress-renditions.mjs
Cloudinary collection scriptscripts/collect-cloudinary-renditions.mjs
Orchestration scriptscripts/collect-series-realworld.mjs
Benchmark runnerscripts/test-watermark-series.mjs
Technical results docdocs/technical/realworld-pipeline-benchmark-results.md
Lab benchmark specdocs/technical/token40-benchmark-spec.md
Controlled thumbnail benchmark (March 13, 2026)test_series/00_series_02/run_2026-03-13_thumbnail-recovery/results.json
wm_strength spot-test (March 13, 2026)test_series/00_series_02/run_2026-03-13_wmstrength15-spottest/results.json
Thumbnail recovery scriptscripts/test-thumbnail-recovery.py
wm_strength spot-test scriptscripts/test-wmstrength-spottest.py
WAM benchmark (March 13, 2026)test_series/00_series_02/run_2026-03-13_wam-benchmark/results.json
WAM Colab notebookscripts/wam-benchmark.ipynb
Research analysis & landscape (2026)research/watermark-agent/

Future Work

Social media platforms: Test recovery after Twitter, Instagram, and LinkedIn processing pipelines.

Larger sample sizes: Expand to 50+ images with automated image characteristic classification.

✓ wm_strength variation: Tested (March 13, 2026). wm_strength=1.5 yields 0/4 improvement at 150×150. Concluded — no further tuning warranted.

✓ Tiling-Embedding hypothesis: Tested (March 13, 2026). 0/10 matches in tile/quadrant decoding. TrustMark decoder requires the full spatial distribution pattern. Concluded — approach infeasible.

Messenger apps: Test WhatsApp, Telegram, Signal image compression.

Multi-generation: Test watermark survival after mark → share → re-mark → share cycles.

✓ Alternative architecture benchmark (WAM): Completed March 13, 2026. WAM achieves 10/11 (91%) at 150×150 vs. TrustMark's 7/11 (64%). All 4 TrustMark-specific failures resolved. WAM identified as candidate for v2 embedding layer.

InvisMark benchmark: Originally planned; weights link broken (OneDrive, open issue since Nov 2025). Superseded by WAM benchmark which provides the same answer.


References

[1] Bui, Agarwal & Collomosse (2025). TrustMark: Robust Watermarking and Watermark Removal for Arbitrary Resolution Images. ICCV 2025. (Preprint: arXiv:2311.18297, 2023.) University of Surrey & Adobe Research.

[2] C2PA (Coalition for Content Provenance and Authenticity). C2PA Technical Specification v2.1. c2pa.org, September 2024.

[3] EU AI Act, Regulation (EU) 2024/1689. Article 50: Transparency obligations for deployers.

[4] Zhu, Kaplan, Johnson & Fei-Fei (2018). HiDDeN: Hiding Data With Deep Networks. ECCV 2018, pp. 682–697. LNCS vol. 11219.

[5] Xu, Hu, Lei, Li, Lowe, Gorevski, Wang, Ching & Deng (Microsoft, 2025). InvisMark: Invisible and Robust Watermarking for AI-Generated Image Provenance. WACV 2025. (256-bit payload, PSNR ~51 dB, >97% bit accuracy, open-source.)

[6] Sander, Fernandez, Durmus, Furon & Douze (Meta AI, 2025). Watermark Anything with Localized Messages. ICLR 2025. arXiv:2411.07231. Code & MIT-licensed weights: github.com/facebookresearch/watermark-anything

476 tests. Real infrastructure.
Every number published.

We found a limit. Then we found a way past it.
Now we know what to build next.

Data Availability

Raw results (366 decode tests as JSON), methodology details, and test parameters are documented in the accompanying blog article. Source data available on request.

markmyai.com/blog/real-world-pipeline-benchmark-watermarks

Data requests: hello@markmyai.com

markmyai.com

© 2026 MarkMyAI · Dominic Tschan · Waltenschwil, Switzerland