How to Make Sure a Website Never Apears Again

What is duplicate content?

Duplicate content is content that appears on the Internet in more than one place. That "1 place" is defined equally a location with a unique website address (URL) - so, if the same content appears at more than one web address, you've got duplicate content.

While not technically a penalisation, duplicate content can notwithstanding sometimes bear upon search engine rankings. When at that place are multiple pieces of, as Google calls information technology, "appreciably similar" content in more than one location on the Internet, it can exist hard for search engines to decide which version is more relevant to a given search query.

Why does duplicate content matter?

For search engines

Indistinguishable content can present three main issues for search engines:

They don't know which version(south) to include/exclude from their indices.
They don't know whether to direct the link metrics (trust, dominance, ballast text, link disinterestedness, etc.) to 1 page, or go along it separated between multiple versions.
They don't know which version(s) to rank for query results.

For site owners

When duplicate content is present, site owners can suffer rankings and traffic losses. These losses ofttimes stalk from two primary problems:

To provide the all-time search experience, search engines volition rarely show multiple versions of the same content, and thus are forced to cull which version is most probable to exist the best result. This dilutes the visibility of each of the duplicates.
Link equity tin can be further diluted because other sites take to cull betwixt the duplicates likewise. instead of all entering links pointing to i piece of content, they link to multiple pieces, spreading the link disinterestedness among the duplicates. Because entering links are a ranking factor, this can then touch the search visibility of a piece of content.

The net upshot? A piece of content doesn't achieve the search visibility it otherwise would.

Duplicate content issues for search engines

How do duplicate content issues happen?

In the vast majority of cases, website owners don't intentionally create duplicate content. Merely, that doesn't hateful information technology's not out there. In fact by some estimates, up to 29% of the spider web is really duplicate content!

Let's take a wait at some of the most common ways duplicate content is unintentionally created:

1. URL variations

URL parameters, such equally click tracking and some analytics code, tin crusade duplicate content issues. This can be a trouble caused not simply by the parameters themselves, just also the order in which those parameters appear in the URL itself.

For example:

world wide web.widgets.com/bluish-widgets?c... is a duplicate of world wide web.widgets.com/blue-widgets?c...&true cat=3" class="redactor-autoparser-object">www.widgets.com/blue-widgets is a duplicate of world wide web.widgets.com/blue-widgets?cat=3&color=blueish

Similarly, session IDs are a common duplicate content creator. This occurs when each user that visits a website is assigned a different session ID that is stored in the URL.

Session ids or parameters can create duplicate content

Printer-friendly versions of content tin also crusade indistinguishable content problems when multiple versions of the pages get indexed.

printer-friendly page versions can create duplicate content issues

1 lesson here is that when possible, information technology'south frequently benign to avoid adding URL parameters or alternate versions of URLs (the information those contain can usually be passed through scripts).

2. HTTP vs. HTTPS or WWW vs. non-Www pages

If your site has separate versions at "www.site.com" and "site.com" (with and without the "www" prefix), and the same content lives at both versions, yous've effectively created duplicates of each of those pages. The same applies to sites that maintain versions at both http:// and https://. If both versions of a page are alive and visible to search engines, you may run into a indistinguishable content event.

iii. Scraped or copied content

Content includes not only blog posts or editorial content, but likewise product data pages. Scrapers republishing your blog content on their ain sites may be a more than familiar source of indistinguishable content, but there'south a common trouble for e-commerce sites, as well: product information. If many unlike websites sell the same items, and they all utilize the manufacturer's descriptions of those items, identical content winds upward in multiple locations across the web.

How to prepare duplicate content problems

Fixing duplicate content issues all comes down to the same central thought: specifying which of the duplicates is the "correct" one.

Whenever content on a site can be institute at multiple URLs, it should exist canonicalized for search engines. Let's go over the 3 main means to do this: Using a 301 redirect to the correct URL, the rel=canonical attribute, or using the parameter handling tool in Google Search Console.

301 redirect

In many cases, the best way to gainsay duplicate content is to ready a 301 redirect from the "indistinguishable" folio to the original content page.

When multiple pages with the potential to rank well are combined into a unmarried page, they not simply stop competing with ane another; they also create a stronger relevancy and popularity signal overall. This will positively affect the "correct" folio'southward ability to rank well.

Fixing duplicate content issues with 301 redirects

Rel="canonical"

Another option for dealing with duplicate content is to use the rel=canonical aspect. This tells search engines that a given page should be treated as though information technology were a copy of a specified URL, and all of the links, content metrics, and "ranking power" that search engines utilise to this page should actually be credited to the specified URL.

The rel="canonical" aspect is part of the HTML head of a web page and looks like this:

General format:

<caput>...[other code that might be in your document'south HTML head]...<link href="URL OF ORIGINAL PAGE" rel="canonical" />...[other code that might be in your certificate'south HTML head]...</head>

The rel=approved attribute should exist added to the HTML caput of each duplicate version of a folio, with the "URL OF ORIGINAL PAGE" portion above replaced by a link to the original (canonical) page. (Brand sure you keep the quotation marks.) The attribute passes roughly the same amount of link equity (ranking power) as a 301 redirect, and, because information technology's implemented at the page (instead of server) level, oft takes less development time to implement.

Beneath is an example of what a canonical aspect looks similar in action:

duplicate-mozbar-screenshot_170315_161150.png?mtime=20170315161151#asset:4195:url

Using MozBar to identify canonical attributes.

Here, nosotros can see BuzzFeed is using the rel=canonical attributes to accommodate their use of URL parameters (in this case, click tracking). Although this folio is accessible by two URLs, the rel=canonical attribute ensures that all link equity and content metrics are awarded to the original page (/no-i-does-this-anymore).

Meta Robots Noindex

One meta tag that can be particularly useful in dealing with duplicate content is meta robots, when used with the values "noindex, follow." Commonly called Meta Noindex,Follow and technically known as content="noindex,follow" this meta robots tag tin exist added to the HTML head of each individual page that should be excluded from a search engine'due south index.

Full general format:

<caput>...[other code that might be in your document'southward HTML head]...<meta proper noun="robots" content="noindex,follow">...[other lawmaking that might be in your document's HTML head]...</caput>

The meta robots tag allows search engines to crawl the links on a page but keeps them from including those links in their indices. It's important that the duplicate page can yet be crawled, even though you're telling Google not to index it, because Google explicitly cautions confronting restricting crawl access to duplicate content on your website. (Search engines like to be able to run across everything in case you've made an error in your code. Information technology allows them to brand a [likely automated] "judgment call" in otherwise ambiguous situations.)

Using meta robots is a especially good solution for duplicate content problems related to pagination.

Preferred domain and parameter handling in Google Search Panel

Google Search Console allows y'all to set the preferred domain of your site (i.east. http://yoursite.com instead of http://world wide web.yoursite.com) and specify whether Googlebot should crawl diverse URL parameters differently (parameter treatment).

Duplicate-content-google-search-console-settings.png?mtime=20170315155632#asset:4191:url

Depending on your URL structure and the cause of your duplicate content bug, setting up either your preferred domain or parameter treatment (or both!) may provide a solution.

The main drawback to using parameter treatment as your primary method for dealing with duplicate content is that the changes you brand only work for Google. Whatever rules put in place using Google Search Panel volition not affect how Bing or whatsoever other search engine's crawlers interpret your site; y'all'll need to utilize the webmaster tools for other search engines in add-on to adjusting the settings in Search Console.

Additional methods for dealing with indistinguishable content

Maintain consistency when linking internally throughout a website. For case, if a webmaster determines that the canonical version of a domain is world wide web.example.com/, then all internal links should go to http://www.case.co... rather than http://example.com/pa... (notice the absence of www).
When syndicating content, brand sure the syndicating website adds a link dorsum to the original content and not a variation on the URL. (Check out our Whiteboard Friday episode on dealing with duplicate content for more information.)
To add an extra safeguard against content scrapers stealing SEO credit for your content, it's wise to add a cocky-referential rel=canonical link to your existing pages. This is a canonical attribute that points to the URL it's already on, the betoken being to thwart the efforts of some scrapers.

A self-referential rel=canonical link: The URL specified in the rel=canonical tag is the same equally the current folio URL.

While non all scrapers volition port over the total HTML code of their source material, some will. For those that practice, the self-referential rel=approved tag will ensure your site's version gets credit equally the "original" piece of content.

Keep learning

Duplicate Content in a Post-Panda Globe
Duplicate content - Google Technical Support
Handling User-Generated & Manufacturer-Required Duplicate Content Across Large Numbers of URLs
Aren't 301s, 302s, and canonicals all basically the same?

Put your skills to work

Moz Pro'due south site crawl can help identify duplicate content on a website. Endeavor information technology >>

jamespeare1953.blogspot.com

Source: https://moz.com/learn/seo/duplicate-content