ResearchMarch 11, 2026 · 7 min read

We Tested Whether Our Watermarks Survive the Real Web. Here's What We Found.

For months we've been saying that invisible watermarks can survive what C2PA can't — social sharing, CDN compression, WordPress renditions. That claim needs real numbers behind it. So we ran a proper benchmark. Here is what the data actually shows.

Update March 12, 2026: After the real-world WordPress + Cloudinary pipeline benchmark (see follow-up post), we promoted Candidate Q (token40 + BCH_SUPER) to production default. The "baseline remains what ships today" statement below reflects the state as of March 11.

The short version

In a controlled comparison across 11 images and 11 transformations each, our leading experimental watermark candidate recovered provenance in 119 out of 121 test cases — versus 114 out of 121 for our current default. Failures went from 7 down to 2. A second experimental variant was perfect on social-style pipelines but weaker under aggressive compression.

Why we are telling you this

We could have just shipped a better watermark and announced “improved robustness.” That's what most vendors do. We think that's the wrong approach.

If you are using MarkMyAI to prove AI image provenance under EU AI Act Article 50, you are trusting that our marking actually holds up when your images travel through the web. That trust should be grounded in real data, not marketing language.

So here is the honest version: what we tested, what improved, what still failed, and what we have not yet proven.

The problem we are trying to solve

When you mark an image with MarkMyAI, the invisible watermark encodes a unique identifier directly into the pixel data. That identifier links back to your proof record — even after C2PA metadata has been stripped.

The challenge is that every resize, format conversion, JPEG re-compression, or social media re-encode applies pixel-level changes to the image. The watermark has to survive those changes. If it does not, the recovery path fails.

We have been working on a new watermark configuration that trades some payload capacity for much stronger error correction. The hypothesis: a shorter identifier with better error tolerance should survive more aggressive transformations than our current approach. This benchmark is our first systematic test of that hypothesis.

How we tested it

We took 11 original images — a mix of photos, editorial images, and product shots — and marked each one with three different watermark configurations. Then we put each marked image through 11 transformations representing the kinds of processing that happens in real publishing workflows:

What we tested

11 original images
3 watermark candidates
11 transformations per image
363 decode checks total

Transformations included

JPEG at 4 quality levels
Resize and center crop
WebP conversion
PNG re-save
Social-media-style pipeline

The key discipline: all three configurations were tested on the exact same source images and the exact same transformations. No cherry-picking. The same image, marked three ways, run through the same gauntlet.

Current baseline

Our production watermark today. The comparison point everything else is measured against.

Experimental candidate Q → leading candidate

A new configuration with a shorter embedded identifier and significantly stronger error correction. Designed to be a better all-rounder.

Experimental candidate P

A more aggressive variant that was originally promising for very hard social-like cases. This test is the first time we pitted it directly against the baseline.

The results

Configuration	Verified recoveries	Note
Current baseline	114 / 121	Production default today
Experimental candidate Q	119 / 121	Best overall — 5 more recoveries than baseline
Experimental candidate P	117 / 121	Best on the hardest social-style pipeline

The headline number: the leading experimental candidate cut failures from 7 to 2. That is not a rounding error — it's a 71% reduction in failure cases on the same set of images.

Where the difference actually came from

Transformation	Baseline	Candidate Q	Candidate P
JPEG compression (high / medium / aggressive / severe)	✅ / ✅ / ✅ / ✅	✅ / ✅ / ✅ / ✅	✅ / ✅ / 10/11 / 8/11
Resize 75 % / 50 %	10/11 / ✅	✅ / ✅	✅ / ✅
Center crop 80 %	10/11	✅	✅
Convert to WebP	✅	✅	✅
Re-save as PNG	10/11	✅	✅
Social-media-style pipeline	7/11	9/11	✅

Two patterns stand out:

Candidate Q: the reliable all-rounder

It fixed every category where the baseline had gaps — resize, crop, re-saved PNG — without introducing any new failures. That kind of clean sweep across all transformation types is what makes it the better production candidate.

Candidate P: brilliant in one scenario, weaker in another

It was the only configuration to achieve a perfect score on the social-media-style pipeline. But it broke down on aggressive JPEG compression — something the baseline handles fine. A specialist, not a default.

The trade-off we had to make

Looking at the social-media pipeline alone, you might pick Candidate P. It hit 11 out of 11 there. That feels like the obvious winner.

But a production default has to hold up across the full range of what publishers actually encounter — not just the most photogenic test case. Across the complete benchmark, Candidate Q was stronger. It had fewer failures overall and no new regressions.

This is exactly the kind of decision that you only make well with controlled benchmarks. Intuition and one impressive result are not enough. The full picture is.

What this does not yet prove

These transformations were run in a controlled lab environment. We designed them to reflect real-world conditions — the JPEG settings, resize percentages, and social pipeline were chosen based on what platforms actually do — but we have not yet run the same marked images through a real WordPress installation, a live CDN, or actual social platform upload.

That is the next step. Before we change any production default, we need to see those numbers too.

We also want to be clear about scope: even a watermark that survives everything is not the whole answer. The invisible watermark is the recovery layer — it points back to your proof record when C2PA has already been stripped. C2PA, the audit trail, and the blockchain anchor each play different roles. Better watermark recovery makes the overall system stronger, but it does not make the other layers less important.

Where we go from here

The leading experimental candidate is now our front-runner for a future default. The current baseline remains what ships today. We are not switching production defaults based on one benchmark run.

Next: the same marked images through real WordPress rendition flows and CDN pipelines. If those results confirm the same direction, the case for a production switch becomes much stronger.

We will publish those results here too, with the same level of detail.

More context

If you want to understand why this problem exists in the first place — why WordPress and CDNs create a gap between the original you mark and the image your audience sees — start here.

Why CMS pipelines break provenance How invisible watermarks work