There’s a lot of junk on the web. There is also a lot of good stuff on the web. And then there is the stuff that’s been lifted from the good and dropped amid the dross—the aggregation, the block-quotes, the straight-off copy-paste jobs.
The extent of that duplication now has a number: according to Matt Cutts, a long time Google search engineer who developed Google’s family-friendly “SafeSearch” filter and who now leads Google’s web spam team, “something like 25% or 30% of the web’s content is duplicate content.”
That’s not necessarily a bad thing. Not all of the duplication is plagiarized or hastily created traffic-seeking junk. Examples of inoffensive duplication include quotes from blogs that link back to the original blog, or the thousands of pages of technical manuals scattered across the web that are updated with small changes but remain largely the same.
Nonetheless, if search engines didn’t have a way to detect duplications, the internet would be almost unnavigable. Google’s approach, as you’ll almost certainly have noticed when you use it, is to omit pages that have very similar content, but to offer users the ability to see the similar results if they’re really interested. Things that are auto-created, however, like a blog that’s made up entirely of feeds from other blogs, might be treated as spam, Cutts says. That means most people will never encounter this large chunk of the internet. It also means there’s that much less you need to get through before you finish reading the entire internet this morning.
Note: This post is based on a YouTube video posted by Cutts, which means in a manner of speaking, this too is duplicate content.