Dec 23 / Kevin

Avoiding Duplicate Content

Over the years, there has been a lot of controversy and rumors about how search engines, in particular Google, penalize (or don’t) sites that contain duplicate content, either from other sties, or internally, through the use of different views – categories, tags, archives, etc. Whether or not they penalize you for having duplicate content, you should make efforts of removing it to ensure that your visitors from search engines see the page that you want them to – not an archive view or page that has become outdated.

Duplicate content will always exist, not matter what efforts are taken to eliminate it from the web. The whole point of the Internet was to create an inter-linking web of content, in which all knowledge is shared amongst sites, even through it shouldn’t necessarily be exactly the same. Unless you are trying to spam the search engines, through manipulation, there are really no reasons why you should be afraid of receiving a penalty for having duplicate content on your own site. Simply aim at keeping your content top-notch, and your pages will continue being served at the top of the page.

What is Duplicate Content, Specifically?

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Most of the time when we see this, it’s unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and — worse yet — linked) via multiple distinct URLs, and so on. In some cases, content is duplicated across domains in an attempt to manipulate search engine rankings or garner more traffic via popular or long-tail queries.

Source: Official Google Webmaster Central Blog: Deftly Dealing with Duplicate Content

How Search Engines Determine Duplicate Content

The primary focus of search engines is to deliver the most relevant search results for all searches. In order to prevent duplicate content, there aren’t really “duplicate content penalties,” but rather exclusion, in which the most “relevant” page will be indexed, while the other results will be excluded or moved down in the results – so, in a way, there is a penalty, but doesn’t affect your entire site unless your entire site is filled with content found from around the web.

How a Search Engine Determines Duplicate ContentDiagram Copyright Elliance, Inc.

How to Check for Duplicate Content

One of the leading tools to check for duplicate content and plagiarism is through Copyscape, which uses links and text found on those pages, although it is quite easy to search using the terms found in a piece of text, through the exact keyword match – using quotations around the text – within all webpages.

Best Practices

There are several key ways to ensure that duplicate content never makes its way intosearch engines in the first place – by preventing the content from being created in the first place.

Often, WordPress and other blog systems create category and tag pages that, unless you are using excerpts, will display all content. If you have the ability to turn this option off, it would be a good idea to do so, unless it is an important part of your website. Otherwise, you should exclude all category and tag pages from search engines by editing your robots.txt file, which can be created or edited through Google Webmaster Tools interface.

To start, you should exclude the following (although you can include other files, such as PNG, JPG, and other files, specifically. This means that no one will be able to access the files directly through search engines.

Information and guides on how to use the robots.txt file can be found here.

User-agent: *
Disallow: /comments/feed/
Disallow: /feed/
Disallow: /feed/atom/
Disallow: /feed/rss/
Disallow: /rss/
Disallow: /trackback/
Disallow: /wp/
Disallow: /*/comments/feed/$
Disallow: /*/feed/$
Disallow: /*/feed/atom/$
Disallow: /*/feed/rss/$
Disallow: /*/rss/$
Disallow: /*/trackback/$

Note: This is a general robots.txt file, which specifies which pages to exclude immediately from search results. You are suggested to research what works for other people, and what has been excluded on other sites. The file can be found at http://sitename.com/robots.txt.

A lot of search engine optimization through duplicate content and select inclusion is all about what you want the search engine robots to see and what you want your readers to discover. If you have comments displayed in results, visitors will find less value and will be less likely to click on the file, than if you had dominance in having the individual post in its place.

Non-www vs. www

To manually make all your pages redirect to one version, in my case, I have http://www.blogtipz.com redirect to http://blogtipz.com to ensure that there is no duplicate content generated from this inconsistency, you have to add this to your .htaccess file:

RewriteEngine On
 
RewriteCond %{HTTP_HOST} ^www.yourdomain\.com$ [NC] 
RewriteRule ^(.*)$ http://yourdomain.com/$1 [R=301,L]

The “Index” File Attribute

Often, additional text, in the format of /index, /index.php, etc. are displayed on select pages, particularly the main page of your blog. You want to ensure that all versions of the page are the same format, by using the following mod_rewrite rule.

RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_URI} index.html
RewriteRule .* http://www.yourdomain.com/? [R=301,L]

How Archives, Pages Look to Search Engines When Not Indexed Properly

Here’s an example demonstrating ow multiple versions of a single page could get indexed (you don’t want this to happen).

  1. http://example.com/post-1
  2. http://example.com/post-1/
  3. http://example.com/?s=post-1
  4. http://www.example.com/post-1
  5. http://www.example.com/post-1/
  6. http://www.example.com/category/category-name/post-1/
  7. http://example.com/2008/12/post-1/
  8. … and so on (all for one post)

Other Techniques

On pages that you do not want to have indexed, the simplest method would be adding the “noindex, nofollow” attribute, which tells the search engine to prevent that individual page from being indexed, which might be simpler than having to go through the removal process, illustrated below, which can be found within the Google Webmaster Tools area.

Google Webmaster Tools - Remove LinksClick Here for Larger View of Removal Tool

To Remove Content from Google:

Login to your Webmaster Tools area, click the site you want to edit/remove content from, then select Tools > Remove URLs to access the removal wizard. Next, click “New Removal Request” (a button). At this point, you are presented several options:

- Individual URLs: web pages, images, or other files.

- A directory and all subdirectories in your site.

- Your entire site.

- Cached copy of a Google search result.

Simply past the URls that you want to exclude on the following pages (upon selecting one of the options), then your request will either be approved or denied (as you have to meet the requirements) within 3-5 business days.

Another way to exclude content from search engines is through using Sitemaps, which indicates the pages that should be indexed and how often they should be updated. If you are using a developer-hosted blog, then a sitemap should be (under most circumstances) created for you, otherwise it is out of your control. 

If you are using WordPress.org (self-hosted), then you are able to edit the file and configure how it is presented to search engine robots. The main idea is to include as much content as possible, excluding tags, categories, archive pages, and comments. These provide very little value to readers/search engines, and shouldn’t be indexed. Several plugins to manually build sitemaps are available here.

General Ideas

  • Don’t Use “Dirty” Permalinks - When you use the default format of permalinks, you often have duplicate content, for the fact that categories, tags, search results, etc. may follow the same internal linking structure, using different formats across your site – not good for search engines.
  • Don’t “Spread” Content - Every time you syndicate content across the web, you are duplicating content from your blog. In some cases, popular content has been copied, word-for-word onto multiple sites, with the “spam” blogs trying to generate AdSense earnings are the winners, receiving a higher ranking than the site that created the content, simply because the spammer’s blog is newer and contains more keywords related to the search terms.
  • Create/Register Specific Domains - Internet searchers want to find information that is specific to a certain area. Therefore, they are more likely to use the country-specific domain, rather than a folder or sub-domain that you have created on a larger, international domain. It may be more expensive initially, but you will be able to produce more content that gets higher ranking once you employ this method on your blog.
  • Redirect Content that has Moved - Having a lot of 404 pages being generated suddenly in search engine results isn’t good – it means that you have a lot of content that isn’t being indexed. Use a plugin or specifically create a rule in your .htaccess file to redirect content that has moved.
  • Know Your Content - Be sure that you know how the content on your blog is generated, either through manual creation, or automatically, using archive pages. Don’t vary the format that you create post titles, or you may have content that is extremely similar. Another idea to keep in mind is the fact that non-www and www versions are different, so you want to make a decision early on to change this within Google Webmaster Tools and through your blog.
  • Printer-friendly Versions - Pages that contain printer-friendly versions should be excluded – it counts as duplicate content. Additionally, pages that reference different periods in time (i.e. archives pages) are also duplicate.
  • Make Use of Excerpts - A common rule for bloggers is to use excerpts, or brief tidbits of information displayed from each post, ensuring that no “full” content is duplicated on your blog. Don’t include too many posts on the main page.
  • Copyright - Place a terms/copyright policy indicating that you check for duplicate content throughout the web and that it is illegal to copy content from your site.

Leave a Comment

Professional WordPress Themes