Google about the percentage that represents duplicate content


Google’s John Mueller recently answered a question about whether there is a percentage threshold for duplicate content that Google uses to identify and filter out duplicate content.

What percentage equals duplicate content?

The conversation actually started on Facebook when Duane Forrester (@DuaneForrester) asked if anyone knew if any search engine published a content overlap percentage at which content is considered duplicate.

Bill Hartzer (bhartzer) took to Twitter to ask John Mueller and received an almost immediate response.

Bill tweeted:

“Hey @johnmu, is there a percentage that represents duplicate content?

For example, should we try to ensure that pages are at least 72.6 percent unique from other pages on our site?

Does Google even measure it?”

Google’s John Mueller responded:

How does Google detect duplicate content?

Google’s methodology for detecting duplicate content has been remarkably similar for many years.

In 2013, Matt Cutts (@mattcutts), then a software engineer at Google posted an official Google video which describes how Google detects duplicate content.

He started the video by stating that a lot of internet content is duplicated and that this is normal.

“It’s important to realize that if you look at content online, about 25% or 30% of all online content is duplicate content.

… People will quote a paragraph of the blog and then link to the blog, things like that.”

He went on to say that Google will not penalize this content because so much of the duplicate content is innocent and without spamming intent.

According to him, penalizing websites for duplicate content would negatively affect the quality of search results.

What Google does when it finds duplicate content is:

“…try to group everything together and treat it as if it’s just one piece of content.”

Matt continued:

“It’s just treated as something that we have to put together properly. And we need to ensure that it is properly ranked.”

He explained that Google then chooses which page to display in the search results and that it filters out duplicate pages to improve the user experience.

How Google handles duplicate content – 2020 version

Fast forward to 2020 and Google released an episode of the Search Off the Record podcast where the same topic is described in remarkably similar language.

Here is relevant section of this podcast from 06:44 minutes into the episode:

“Gary Illyes: And now we’re done with the next step, which is actually canonicalization and fraud detection.

Martin Splitt: Isn’t that the same thing, fraud detection and canonicalization?

Gary Illyes: [00:06:56] Well, it’s not, is it? Because you have to detect scams first, basically lump them together and say all these sites are scams to each other,
and then you basically have to find a lead page for everything.

… And this is canonization.

So you have replication, which is the whole term, but within that you have cluster building, like deception cluster building, and canonicalization. “

Gary then explains in technical terms exactly how they do it. Basically, Google doesn’t exactly look at percentages, it compares checksums.

A checksum can be said to represent the content as a string of numbers or letters. If the content is duplicated, the checksum digit sequence will be similar.

Gary explained it this way:

“So what we do for fraud detection is try to detect fraud.

And the way we do it is maybe the way most people do it in other search engines, which is basically reduce the content to a hash or checksum and then compare the checksums.”

Gary said Google does it this way because it’s easier (and apparently accurate).

Google detects duplicate content with checksums

So when we talk about duplicate content, it’s probably not a percentage threshold where there’s a number at which the content is supposed to be duplicated.

Instead, duplicate content is detected by presenting the content in the form of a checksum, and then these checksums are compared.

An additional conclusion is that there seems to be a difference between when some content is duplicated and when all content is duplicated.

Featured Image Shutterstock/Ezume Images


Leave a Comment

error: Content is protected !!