Search Engine Facts
Search Engine Facts

Read our back issues

May 2017

December 2009

November 2009

October 2009

September 2009

August 2009

July 2009

June 2009

May 2009

April 2009

March 2009

February 2009

January 2009

December 2008

November 2008

October 2008

September 2008

August 2008

July 2008

June 2008

May 2008

April 2008

March 2008

February 2008

January 2008

December 2007

November 2007

October 2007

September 2007

August 2007

July 2007

June 2007

May 2007

April 2007

March 2007

February 2007

January 2007

December 2006

December 2006

November 2006

October 2006

September 2006

August 2006

July 2006

June 2006

May 2006

April 2006

March 2006

February 2006

Januray 2006

December 2005

November 2005

October 2005

September 2005

August 2005

July 2005

June 2005

May 2005

August 2005

March 2005

February 2005

January 2005

December 2004

November 2004

October 2004

September 2004

August 2004

July 2004


» Archive
All about software products and antivirus solutions.
Good deals and offers on computers & hardware.
AVG Antivirus offers top security solutions.

Home   Contact   Privacy policy    Partner sites

Google's duplicate content patent

This month, Google was granted a patent with the name Duplicate document detection in a web crawler system. The patent explains how a content filter from the search engine can work with a duplicate content server.

What is duplicate content?

The patent contains a definition of duplicate content:

"Duplicate documents are documents that have substantially identical content, and in some embodiments wholly identical content, but different document addresses."

The patent describes three scenarios in which duplicate documents are encountered by a web crawler:

Two pages, comprising any combination of regular web page(s) and temporary redirect page(s), are duplicate documents if they share the same page content, but have different URLs.

Two temporary redirect pages are duplicate documents if they share the same target URL, but have different source URLs.

A regular web page and a temporary redirect page are duplicate documents if the URL of the regular web page is the target URL of the temporary redirect page or the content of the regular web page is the same as that of the temporary redirect page.
A permanent redirect page is not directly involved in duplicate document detection because the crawlers are configured not to download the content of the redirecting page.

How does Google detect duplicate content?

According to the patent description, Google's web crawler consults the duplicate content server to check if a found page is a copy of another document. The algorithm then determines which version is the most important version.

Google can use different methods to detect duplicate content. For example, Google might take "content fingerprints" and compare them when a new web page is found.

Interestingly, it's not always the page with the highest PageRank that is chosen as the most important URL for the content:

"In some embodiments, a canonical page of an equivalence class is not necessarily the document that has the highest score (e.g., the highest page rank or other query-independent metric)."

How does this affect your website?

If you want to get high rankings, it is easier to do so with unique content. Try to use as much original content as possible on your web pages.

If your website must use the same content as another website, make sure that your website has better inbound links than the other websites that carry the same content. It's likely that your website will be chosen as the most important URL for the content then.

If your web site has unique content, you don't have to worry about potential duplicate content penalties. Optimize that content for search engines and make sure that your web site has good inbound links. It's hard to outrank a website with good optimized content and many good inbound links.

Copyright - Internet marketing and search engine ranking software

Home   Contact   Privacy policy    Partner sites
December 2009 search engine articles