Avoiding Duplicate Content for SEO

Duplicate content is the same or similar content that can be found on different pages on the web. Search engines penalize websites that host duplicate content, so you need to have a clear strategy to make sure your content does not get penalized.

For someone getting started with a content strategy, the first time duplicate content is likely to come up is if you want to syndicate your blog on another website. The best strategy is to agree on the canonical version of your content in advance. Decide which version should be shown by search engines and implement canonical tags so Google knows what you've decided. Read further to understand the issues.

What is duplicate content?

Search engines read individual pages on the web. Each page has a unique address or URL. An example of an individual URL is https://www.partnervine.com/blog/happy-customers. Search engines need to decide which page should be served up in a search, and they do not want to serve up the same page twice. Search engines penalize websites that they consider to be gaming the search engine rankings by posting the same content in multiple places.

Interestingly, unless you know where to look, search engines will not let you know that your content is considered duplicate, and the effects can be significant. For example, Google penalties range from drops in pages in search rankings to an entire removal of a website. One way to know if Google has penalized your website is through Google's Webmaster Tool (also known as Google Search Console). If your website has been penalized, the "Manual Actions" page shows errors. Technically, crawling is halted and the website either drops in rank or disappears completely.

The image is from the Manual Action sections of LegalForm's Google Search Console:

output-onlinejpgtools

Duplicate content on your own website

The first place to avoid duplicate content is on your own website. Duplicate content is often inadvertent, and here are some of the reasons you may be hosting duplicate content without even knowing it:

Session ID URLs, breadcrumb links, tracking URLs, permalinks in some content management systems that are generated automatically.
HTTP, HTTPS, and WWW versions of the same URL.
Mobile URLs such as m.domain.com.
Country URLs if you want search engines to provide separate results for different countries.
Multiple URLs created for filtering options like ratings, prices, or promotions.

This problem of duplicate content is aggravated by many content management systems (CMSs) and dynamic website builders. They automatically add tags to set multiple URLs for the same content.

The fastest way to identify duplicate content is to take a sentence from your content and search it on Google. Be sure to add quotation marks to have Google search for the phrase as it is, i.e. “place the sentence in quotation marks.” Google only uses the first 32 words in its searches, so searching for a lengthy sample of your content will not be more exact.

In addition to checking directly in Google, there are many web services online that crawl the web and identify duplicate content. Although they are not search engines, they will give you an idea of how you are doing. Some of the websites are:

Siteliner. Siteliner identifies duplicate content on a single domain name. There is a free version that is a good place to start for smaller websites.
Moz Pro. Moz Pro’s Site Crawl performs an online audit of your site that includes a long list of deliverables, including pages that show up as duplicate content.
Screaming Frog. Screaming Frog’s SEO Spider Tool runs an online audit and has a free version.

Duplicate content on third party websites

In addition to inadvertent duplicate content on your own website, your content may be duplicated on third party websites.

There are many tools to check if your content is being copied, including the following:

Copyscape. Copyscape is the version of Siteliner that crawls the web looking for duplicate versions of your content. It also has a free version and is a good place to start.
SEO Review Tools Duplicate Content Checker
Plagiarism Checker by Grammarly

Illegitimate duplication. The duplicate content may be illegitimate or inadvertent. If you are concerned about a website copying your content illegitimately, you can file a complaint with Google under the Digital Millenium Copyright Act (DMCA) in the US. The process is not entirely straightforward though and should only be used with the proper advice. Apart from a DMCA filing, you should include copyright notices and terms of use for your content. When you want to stop a party from using your content, you can contact them yourself, or hire a third party to do so.

Inadvertent duplication. If you are producing good content, a news website or other third party will want to republish your articles. If you let them republish your article without agreeing which version should appear in searches, you may be penalized for duplicate content. To avoid that situation, the version you want to disregard should be designated with a canonical tag, as described below. Doing so lets you syndicate your content, extend the reach of your digital marketing, and lower the chance that your website is penalized for duplicate content. It's a virtuous circle for both you and the news website.

Deciding which URL appears in searches is particularly important for the smaller partner, as the duplicate content will represent a larger proportion of its domain. A news website that posts a substantial number of articles will not be at greater risk of a penalty with one more article. That might not be the case for the website of a guest columnist or author that is allowing his or her content to be republished though.

Canonical tags

Canonical tags are a simple technical solution to avoid being penalized for duplicate content. Canonical tags tell search engines that a web page should be disregarded in search results and points the search engine to the ‘canonical’ version that should appear. Canonical tags have simple and consistent syntax and are placed at the head section of a web page. Here is how one looks like:

You can break them into two sections:

link rel=“canonical” this section contains the link to the master or the canonical version of the page
href=https://abc.com/sample-page/ shows where the original content can be found

When it comes to using canonical tags, ensure that the pages have the same or comparable content. This means that the images and text should be close to identical. If the pages only relate to each other but do not have the same content, the search engines may ignore the tags, and Google might even disregard tags on your website in the future.

In addition to a canonical tag, you can also use a noindex tag, which indicates to Google that a webpage should not show up in search results. Noindex tags are the simplest message for Google to read, so they are less likely to be ignored than a canonical tag. Canonical tags help build links to the authoritative version of your content though, so should be better for your content strategy's SEO.

A practical example

I run the legal tech company LegalForms, and am a PartnerVine Fellow. As part of my fellowship, I write articles with the PartnerVine team. Since both companies are in legal tech, I want to be sure that our content strategies are aligned. So far, the articles I have written only appear on PartnerVine. If we wanted to though, we could share content using canonical tags to extend the reach of our articles, create a better user experience for both websites, and avoid being penalized for duplicate content by search engines. It's a better way to collaborate for search engine optimization.