366 decode tests across real WordPress renditions and Cloudinary CDN transforms. Three watermark candidates. Every number — including the ones we didn't expect.
Our lab benchmarks showed promising watermark recovery rates. But lab conditions don't reflect reality. We needed to know: what actually survives when images pass through real WordPress and CDN infrastructure? And where is the hard limit — and can a different architecture overcome it?
The 476 total tests across this paper break down as: 366 real-world pipeline tests (core benchmark), 55 controlled thumbnail tests (TrustMark Q, software resize), and 55 WAM benchmark tests. All use the same 11 original images.
| Candidate | Lab Result | Real-World Result | Delta |
|---|---|---|---|
| Baseline BCH_5 / Q / single-pass | 94% (114/121) | 89% (109/122) | −5% |
| token40_Q BCH_SUPER / Q / multi-pass | 98% (119/121) | 91% (111/122) | −7% |
| token40_P BCH_SUPER / P / multi-pass | 97% (117/121) | 75% (91/122) | −22% |
token40_Q (BCH_SUPER encoding, Q variant, multi-pass decode) is now the production default for all new marks. The P variant remains available as a specialized "crop-resilient" mode for use cases where heavy cropping is expected (social media thumbnails, e-commerce product shots).
v1.1 · March 13, 2026
150×150 limit confirmed as physical. Controlled benchmark (55 tests) + wm_strength=1.5 spot-test (0/4 improvement). The 300px floor is the reliable recovery boundary for TrustMark Q. → Chapter 05
v1.2 · March 13, 2026
WAM (Meta AI, ICLR 2025) overcomes the limit: 10/11 (91%) at 150×150 vs. TrustMark's 7/11 (64%), +27%. All 4 TrustMark-specific failures resolved. WAM identified as the v2 embedding architecture. → Chapter 06
How we set up the benchmark, what infrastructure we used, and why this test design matters.
We marked 11 original images with each of 3 watermark candidates, producing 33 marked images. Each was uploaded to real WordPress and Cloudinary infrastructure, where automatic processing generated multiple renditions. The 11 images were deliberately selected for diversity: landscapes, portraits, abstract, text-heavy, low-contrast, mixed JPEG and PNG, and a wide resolution range (667px to 5824px longest side). This diversity is intentional — it stress-tests the watermark across the image characteristics most likely to affect signal retention.
| Parameter | Value |
|---|---|
| Original images | 11 (from test_series/00_series_02/test_originals) |
| Candidates | 3 (baseline_q_bch5_single, token40_q_super_multi, token40_p_super_multi) |
| Marked images uploaded | 33 (11 × 3) |
| WordPress renditions per image | 7 (thumbnail, medium, medium_large, large, 1536×1536, 2048×2048, full) |
| Cloudinary transforms per image | 5 (w1200_q85_jpg, w800_q60_jpg, w800_q80_webp, w800_h800_crop, w400_q70_jpg) |
| Real-world pipeline decode tests | 366 (core benchmark) |
| Controlled thumbnail tests (TrustMark Q, March 13) | 55 (11 × 5 sizes, software-only) |
| WAM benchmark (March 13) | 55 (11 × 5 sizes, Colab T4 GPU) |
| Total decode tests (all runs) | 476 |
Host: ki-welt.ch (production WP 6.x)
Theme: Standard theme
MarkMyAI Plugin: Deactivated (to avoid double-marking)
Upload: REST API with Bearer auth
Tier: Free
Transforms: 5 real-world presets
Formats: JPEG, WebP
Upload: Programmatic API
Lab benchmarks apply transformations one at a time in controlled sequence. Real-world pipelines apply compound transformations: WordPress might resize, re-encode to JPEG at a different quality, strip metadata, and change color profiles — all in one step. Cloudinary adds its own format conversion, quality optimization, and smart cropping.
For each decoded rendition, we recorded one of three outcomes:
| Outcome | Definition |
|---|---|
| verified | Watermark recovered and matches the expected token |
| partial | Watermark signal detected but token match inconclusive |
| not_found | No watermark signal detected in the decoded image |
WordPress automatically generates 7 image sizes from each upload. Here's how each watermark candidate performed across all renditions.
| Rendition | Dimensions | Baseline | token40_Q | token40_P |
|---|---|---|---|---|
| full | Original | 11/11 | 11/11 | 10/11 |
| 2048×2048 | max 2048 px | 6/6 | 6/6 | 6/6 |
| 1536×1536 | max 1536 px | 8/8 | 8/8 | 8/8 |
| large | max 1024 px | 10/10 | 10/10 | 10/10 |
| medium_large | max 768 px | 10/10 | 10/10 | 10/10 |
| medium | max 300 px | 10/11 | 11/11 | 3/11 |
| thumbnail | 150×150 crop | 3/11 | 4/11 (real WP) 7/11 software-only¹ |
1/11 |
Visual summary — recovery rates for all three candidates across WordPress rendition sizes:
Note: 2048×2048 and 1536×1536 renditions are only generated for images larger than those dimensions. Counts reflect actually generated renditions.
¹ Software-only benchmark (March 13, 2026): 11 images freshly marked with token40_Q, resized via PIL LANCZOS (no JPEG re-encode). The 64%→36% gap between software-only and real WordPress is attributable to WordPress's JPEG re-encode at ~Q80 quality.
Cloudinary is one of the most popular image CDNs. We tested five common transform presets that reflect real production usage.
| Transform | Description | Baseline | token40_Q | token40_P |
|---|---|---|---|---|
| w1200_q85_jpg | Hero image | 11/11 | 11/11 | 10/11 |
| w800_q60_jpg | Aggressive JPEG | 11/11 | 11/11 | 9/11 |
| w800_q80_webp | Standard WebP | 11/11 | 11/11 | 10/11 |
| w800_h800_crop | Square crop (center) | 8/11 | 8/11 | 10/11 |
| w400_q70_jpg | Small thumbnail | 10/11 | 10/11 | 4/11 |
This benchmark reveals a fundamental trade-off in watermark design. TrustMark [1] uses a GAN-based encoder-decoder architecture where the embedding strategy differs between variants:
Concentrates the watermark signal in frequency-domain features (DCT coefficient perturbations) that survive lossy quantization. This makes Q robust against JPEG/WebP compression, resizing, and format conversion — exactly the transforms CMS and CDN pipelines apply.
Best for: Standard publishing, CMS, CDN delivery, blog images.
Distributes the signal across redundant spatial blocks, so that even if large portions of the image are cropped away, surviving blocks still carry enough payload for recovery. The trade-off: each block needs minimum resolution to be decoded, which explains P's collapse below ~400px.
Best for: Social media thumbnails, e-commerce product shots where cropping is the primary transform.
Not all images are equal. Some subjects carry watermark signals better than others. Here's the full breakdown.
| Image | Baseline | token40_Q | token40_P | Notes |
|---|---|---|---|---|
| pic_01 | 11/12 | 10/12 | 10/12 | Consistent across candidates |
| pic_02 | 9/10 | 10/10 | 2/10 | P extremely weak, Q perfect |
| pic_03 | 11/11 | 11/11 | 8/11 | Q maintains perfection |
| pic_04 | 6/8 | 6/8 | 6/8 | Difficult image for all |
| pic_05 | 12/12 | 11/12 | 11/12 | Near-perfect across all |
| pic_06 | 11/12 | 11/12 | 10/12 | Slight P weakness |
| pic_07 | 9/11 | 10/11 | 8/11 | Q fixes 1 baseline failure |
| pic_08 | 9/12 | 11/12 | 10/12 | Q fixes 2 baseline failures |
| pic_09 | 12/12 | 12/12 | 11/12 | Best image overall |
| pic_10 | 10/12 | 10/12 | 9/12 | P slightly weaker |
| pic_11 | 9/10 | 9/10 | 6/10 | P significant drop |
pic_09: All candidates near-perfect. High-contrast landscape with rich texture — ideal for watermark embedding.
pic_05: 12/12 for baseline, 11/12 for Q and P. Well-balanced content.
pic_04: Only 6/8 for all three candidates. Low-frequency content with large uniform areas — difficult for any watermark method.
pic_02: P variant catastrophic failure (2/10), while Q achieves perfect 10/10.
What we decided based on these results, and how it affects the product.
Based on 366 real-world decode tests across WordPress and Cloudinary infrastructure, token40_Q (BCH_SUPER encoding, Q variant, multi-pass decode) has been selected as the production default.
| Criterion | Result |
|---|---|
| Higher overall recovery than baseline? | Yes — 91% vs 89% |
| Worse than baseline on any transform? | Never |
| Closes known gaps? | Yes — fixes medium (300px) from 10/11 to 11/11 |
| Multi-pass decode overhead acceptable? | Yes — decode ~2× slower but still < 5s |
| Aspect | Before | After |
|---|---|---|
| Encoding | BCH_5 (61-bit mark_id) | BCH_SUPER (40-bit token) |
| Token derivation | mark_id embedded directly | SHA-256(mark_id)[0:10] → 40-bit token |
| Default variant | Q only | Q (default) + P (crop-resilient, opt-in) |
| Decode strategy | Single-pass | Multi-pass (try BCH_SUPER first, fallback to BCH_5) |
| Backward compatibility | — | Legacy marks remain decodable via fallback |
The watermark variant is controlled via the wm_variant parameter on POST /v1/mark:
| Use Case | Recommended Variant |
|---|---|
| Blog images, editorial content, website hero images | standard (Q) |
| CMS or CDN delivery (WordPress, Cloudinary, imgix) | standard (Q) |
| Social media profile pictures (heavy square crop) | crop-resilient (P) |
| E-commerce product shots (aspect ratio changes) | crop-resilient (P) |
| Images served below 400px width | standard (Q) |
After confirming that the 150×150 limit is fundamental to TrustMark's architecture, we benchmarked WAM — a different embedding approach from Meta AI — against the same test matrix.
WAM (Watermark Anything with Localized Messages, ICLR 2025) uses a VAE-based embedder and a SAM-based extractor trained on Meta's SA-1B dataset. Its key design property: the decoder is trained to recover watermarks from localized image patches as small as 10% of the image surface. This directly addresses the thumbnail problem.
| Property | TrustMark Q | WAM (MIT) |
|---|---|---|
| Architecture | GAN encoder + resolution scaling | VAE embedder + SAM extractor |
| Payload | 40 bit + BCH_SUPER ECC | 32 bit, no explicit ECC |
| Recovery metric | Exact token match (binary) | Bit accuracy (0.0–1.0) |
| Localized extraction | No (requires full image pattern) | Yes (trained on 10% patches) |
| License | MIT | MIT (wam_mit.pth) |
| Inference speed | ~18s/image (Railway, incl. network) | ~30ms/image (local T4 GPU) |
| Size | WAM | TrustMark Q (ref.) | Delta |
|---|---|---|---|
| Baseline (original) | 11/11 (100%) | 11/11 (100%) | — |
| 150×150 (square crop) | 10/11 (91%) | 7/11 (64%) | +3 images / +27% |
| 300px wide | 11/11 (100%) | 11/11 (100%) | — |
| 768px wide | 11/11 (100%) | 11/11 (100%) | — |
| 1024px wide | 11/11 (100%) | 11/11 (100%) | — |
| Total | 54/55 (98%) | 51/55 (93%) | +3 / +5% |
| Image | WAM bit_acc | WAM 150px | TrustMark 150px | Change |
|---|---|---|---|---|
| pic_01.jpg | 1.000 | OK | FAIL | FAIL → OK |
| pic_02.jpg | 0.875 | FAIL | OK | OK → FAIL |
| pic_03.jpg | 1.000 | OK | OK | — |
| pic_04.jpg | 1.000 | OK | FAIL | FAIL → OK |
| pic_05.png | 0.906 | OK | OK | — |
| pic_06.png | 1.000 | OK | OK | — |
| pic_07.jpg | 1.000 | OK | FAIL | FAIL → OK |
| pic_08.png | 1.000 | OK | OK | — |
| pic_09.png | 1.000 | OK | OK | — |
| pic_10.jpg | 1.000 | OK | FAIL | FAIL → OK |
| pic_11.jpg | 1.000 | OK | OK | — |
Every benchmark has constraints. Here's what ours can and cannot tell you.
| Limitation | Impact | Mitigation |
|---|---|---|
| Sample size: 11 images | Results may not generalize to all image types | Diverse subjects selected; lab benchmark used 121 transforms |
| Single WordPress instance | Different hosts may use different JPEG quality settings | Standard WP 6.x with default settings represents the common case |
| Cloudinary Free Tier | Paid tiers may use different optimization algorithms | Transforms explicitly specified; results are deterministic |
| No social media platforms tested | Twitter, Instagram, LinkedIn apply their own processing | Planned for future benchmark; CDN transforms approximate these |
| Thumbnails (< 200px) excluded from practical scope for TrustMark | Recovery rates at very small sizes are low for TrustMark | Confirmed as a physical limit of TrustMark's architecture. wm_strength=1.5 spot-test produced 0/4 improvements. Fingerprint + audit trail provide the alternative verification path. WAM (see Chapter 06) achieves 91% at 150×150 and is the identified architectural solution for v2. |
All scripts and data are available in the MarkMyAI repository:
| Artifact | Path |
|---|---|
| Original test images | test_series/00_series_02/test_originals/ |
| Full results (366 tests) | test_series/.../realworld-pipeline/results.json |
| WordPress collection script | scripts/collect-wordpress-renditions.mjs |
| Cloudinary collection script | scripts/collect-cloudinary-renditions.mjs |
| Orchestration script | scripts/collect-series-realworld.mjs |
| Benchmark runner | scripts/test-watermark-series.mjs |
| Technical results doc | docs/technical/realworld-pipeline-benchmark-results.md |
| Lab benchmark spec | docs/technical/token40-benchmark-spec.md |
| Controlled thumbnail benchmark (March 13, 2026) | test_series/00_series_02/run_2026-03-13_thumbnail-recovery/results.json |
| wm_strength spot-test (March 13, 2026) | test_series/00_series_02/run_2026-03-13_wmstrength15-spottest/results.json |
| Thumbnail recovery script | scripts/test-thumbnail-recovery.py |
| wm_strength spot-test script | scripts/test-wmstrength-spottest.py |
| WAM benchmark (March 13, 2026) | test_series/00_series_02/run_2026-03-13_wam-benchmark/results.json |
| WAM Colab notebook | scripts/wam-benchmark.ipynb |
| Research analysis & landscape (2026) | research/watermark-agent/ |
Social media platforms: Test recovery after Twitter, Instagram, and LinkedIn processing pipelines.
Larger sample sizes: Expand to 50+ images with automated image characteristic classification.
✓ wm_strength variation: Tested (March 13, 2026). wm_strength=1.5 yields 0/4 improvement at 150×150. Concluded — no further tuning warranted.
✓ Tiling-Embedding hypothesis: Tested (March 13, 2026). 0/10 matches in tile/quadrant decoding. TrustMark decoder requires the full spatial distribution pattern. Concluded — approach infeasible.
Messenger apps: Test WhatsApp, Telegram, Signal image compression.
Multi-generation: Test watermark survival after mark → share → re-mark → share cycles.
✓ Alternative architecture benchmark (WAM): Completed March 13, 2026. WAM achieves 10/11 (91%) at 150×150 vs. TrustMark's 7/11 (64%). All 4 TrustMark-specific failures resolved. WAM identified as candidate for v2 embedding layer.
InvisMark benchmark: Originally planned; weights link broken (OneDrive, open issue since Nov 2025). Superseded by WAM benchmark which provides the same answer.
[1] Bui, Agarwal & Collomosse (2025). TrustMark: Robust Watermarking and Watermark Removal for Arbitrary Resolution Images. ICCV 2025. (Preprint: arXiv:2311.18297, 2023.) University of Surrey & Adobe Research.
[2] C2PA (Coalition for Content Provenance and Authenticity). C2PA Technical Specification v2.1. c2pa.org, September 2024.
[3] EU AI Act, Regulation (EU) 2024/1689. Article 50: Transparency obligations for deployers.
[4] Zhu, Kaplan, Johnson & Fei-Fei (2018). HiDDeN: Hiding Data With Deep Networks. ECCV 2018, pp. 682–697. LNCS vol. 11219.
[5] Xu, Hu, Lei, Li, Lowe, Gorevski, Wang, Ching & Deng (Microsoft, 2025). InvisMark: Invisible and Robust Watermarking for AI-Generated Image Provenance. WACV 2025. (256-bit payload, PSNR ~51 dB, >97% bit accuracy, open-source.)
[6] Sander, Fernandez, Durmus, Furon & Douze (Meta AI, 2025). Watermark Anything with Localized Messages. ICLR 2025. arXiv:2411.07231. Code & MIT-licensed weights: github.com/facebookresearch/watermark-anything
476 tests. Real infrastructure.
Every number published.
We found a limit. Then we found a way past it.
Now we know what to build next.
Data Availability
Raw results (366 decode tests as JSON), methodology details, and test parameters are documented in the accompanying blog article. Source data available on request.
markmyai.com/blog/real-world-pipeline-benchmark-watermarks
Data requests: hello@markmyai.com
markmyai.com
© 2026 MarkMyAI · Dominic Tschan · Waltenschwil, Switzerland