URL canonicalization, duplicate-content problems in Google
URL canonicalization, duplicate-content problems in Google
Development

URL canonicalization, duplicate-content problems in Google

Theis article has been dedicated to URL canonicalization, mainly centered around avoiding duplicate-content problems in Google, and fixing type-in URLs, typos in inbound links, and badly-coded inbound links.

To be clear, a "canonical" domain is the single domain you want your site to be known by, and a canonical URL is the single URL you want your page to be known by.

Any others are non-canonical.

The word canonical is a religion-related term, and means "according to canon law, scripture or doctrine." But in general use, it just means "usual, standard, conventional, customary, or according to the rules." So as a Webmaster, you choose what single domain you want to use for your site, and what single URL should be used to request each of your pages.

This article also provides very good advice that it's best to avoid "stacked redirects" --multiple redirects invoked by a single client request-- while doing things like index page and domain canonicalization.

Here is domain/URL canonicalizaton and type-in fixup routine that would do the following:

  • Canonicalize the domain (e.g. redirect non-www and IP address to www)
  • Canonicalize my index pages (redirect "/index.html" to "/")
  • Remove multiple slashes in the URL
  • Remove spurious query strings (my sites' pages are mostly 'static' with a few exceptions)
  • Fix-up common typos in type-in URLs
  • Fix-up invalid inbound links caused by bad HTML mark-up
  • Fix-up URLs resulting from bad copy-and-pastes
  • Fix-up outdated or otherwise incorrect query strings
  • Suppress the fix-up redirect if the resulting URL does not resolve to an existing file
  • Suppress the fix-up if the link is on my own site (In this case, I want to see the 404 error)
  • Suppress the fix-up if the remote user is me or a site tester (Again, we want to see the 404 error)
  • Avoid recursion in mod_rewrite running in a per-directory .htaccess context
  • Avoid the nasty mod_rewrite bug in Apache 1.3.x
  • Do all of the above using a single 301-Moved Permanently redirect

    The result is a routine that can "correct" a request from a badly-coded link like:
    <a href="http://example.com/index,hmtl>for more info, click here</a>

    where the closing quote has been omitted on the link, "html" was mis-typed, and a comma was typed where the filetype-separator period should be.

    The result of a click on that link is a request for "http://example.com/index,hmtl%3Efor%20more%20info,%20click%20here%3C/a%3E

    The code will redirect that to the canonical domain and index page URL "www.example.com/" using a single redirect, correcting the comma and "hmtl", and stripping off the spurious path info along the way. Or it can fix up multiple slashes or periods, or remove trailing punctuation from links improperly embedded in text, or automatically-linked in forum posts, e.g. "For help with this code, see http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html."

    This is done with a 301-Moved Permanently redirect, so that search engines are notified not to list or use the incorrect URL, but to replace it with the corrected/canonicalized one.

    This code is intended for the most common Apache hosting set-ups: shared virtual hosting on Apache 1.3.x, with configuration options limited to .htaccess files only.

    Update: It was discovered that Apache 2.0.52 has the same bug as Apache 1.3.x. Although the original bug report was closed with a statement that this bug was fixed in Apache 2.0.30, it was apparently not fixed completely. Therefore, the solution presented here applies to Apache 2.0 as well.

    This routine is the right solution for my sites, which follow *my* strict URL conventions, but likely not for yours. Modification will almost certainly be required. Like almost all mod_rewrite code, this is not a simple cut-and-paste or find-and-replace proposition.

    It does fix-ups on only the most common URL errors I have seen in my logs, but of course, there are many others; The code is not meant to exhaustively cover all possible errors, just the most common ones on my sites.

    This code is to be examined and perhaps modified by Webmasters who are conversant and comfortable with mod_rewrite and regular expressions. Again, this code is not an entry-level exercise, and the most likely result of trying to modify it without thoroughly understanding and testing it is a disaster -- the best of which would be an immediate server crash, and the worst of which might be to thoroughly trash your search engine rankings.

    Code:
    # .htaccess
    #
    # Specify IP address(es) used by Webmaster, admins, & testers. These may access
    # the server by its unique IP address without being redirected to the domain.
    # Also, URLs are *not* corrected for access by this group, in order to prevent
    # this code from "hiding" problems during development.
    # Note that these addresses are those of your workstations, not your server.)
    SetEnvIf Remote_Addr ^192\.168\.1\. TestIP=true
    SetEnvif Remote_Addr ^10\.10\.45\.3$ TestIP=true
    SetEnvIf Remote_Addr ^127\.0\.0\.[1-7]$ TestIP=true
    #
    #
    # Setup: Enable mod_rewrite, disable MultiViews
    Options +FollowSymLinks -MultiViews
    RewriteEngine on
    #
    # Redirect non-problematic URLs
    # Note: The fix-up code below is complex, and is intended for use to fix only
    # generally-specified problematic URL requests. For administrative redirection
    # of specific non-problematic URLs, 'normal' redirects should be placed here.
    #
    RewriteRule ^old_page\.html$ http://www.example.com/new_page.html [R=301,L]
    RewriteRule ^old_page2\.htm$ http://www.example.com/new_page2.htm [R=301,L]
    #
    #
    # URL FIXUP REDIRECT ROUTINE
    #
    # This code corrects various problems with URLs, presumably due to typos in
    # links from other sites. It is complicated by measures taken to avoid a
    # mod_rewrite bug in Apache 1.3. ( See http://archive.apache.org/gnats/7879 )
    # This code uses a single external redirect to correct all detected problems.
    #
    # Skip next two rules if lowercasing in progress
    # (Remove this rule if case-conversion plug-in below is removed)
    RewriteCond %{ENV:qLow} ^yes$ [NC]
    RewriteRule . - [S=2]
    #
    # Prevent recursion and over-writing of myURI and myQS
    RewriteCond %{ENV:qRed} ^yes$ [NC]
    RewriteRule .? - [L]
    #
    # Get the client-requested full URI and full query string
    RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ (/[^?]*)(\?[^\ ]*)?\ HTTP/
    RewriteRule .? - [E=myURI:%1,E=myQS:%2]
    #
    #
    ###############################################
    # Uppercase to lowercase conversion plug-in
    # (This section, along with the first noted rule
    # above, may be removed if not needed or wanted)
    #
    # Skip next 28 rules if no uppercase letters in URL
    RewriteCond %{ENV:myURI} ![A-Z]
    RewriteRule .? - [S=28]
    #
    # Else swap them out, one at a time
    RewriteCond %{ENV:myURI} ^([^A]*)A(.*)$
    RewriteRule . - [E=myURI:%1a%2]
    RewriteCond %{ENV:myURI} ^([^B]*)B(.*)$
    RewriteRule . - [E=myURI:%1b%2]
    RewriteCond %{ENV:myURI} ^([^C]*)C(.*)$
    RewriteRule . - [E=myURI:%1c%2]
    RewriteCond %{ENV:myURI} ^([^D]*)D(.*)$
    RewriteRule . - [E=myURI:%1d%2]
    RewriteCond %{ENV:myURI} ^([^E]*)E(.*)$
    RewriteRule . - [E=myURI:%1e%2]
    RewriteCond %{ENV:myURI} ^([^F]*)F(.*)$
    RewriteRule . - [E=myURI:%1f%2]
    RewriteCond %{ENV:myURI} ^([^G]*)G(.*)$
    RewriteRule . - [E=myURI:%1g%2]
    RewriteCond %{ENV:myURI} ^([^H]*)H(.*)$
    RewriteRule . - [E=myURI:%1h%2]
    RewriteCond %{ENV:myURI} ^([^I]*)I(.*)$
    RewriteRule . - [E=myURI:%1i%2]
    RewriteCond %{ENV:myURI} ^([^J]*)J(.*)$
    RewriteRule . - [E=myURI:%1j%2]
    RewriteCond %{ENV:myURI} ^([^K]*)K(.*)$
    RewriteRule . - [E=myURI:%1k%2]
    RewriteCond %{ENV:myURI} ^([^L]*)L(.*)$
    RewriteRule . - [E=myURI:%1l%2]
    RewriteCond %{ENV:myURI} ^([^M]*)M(.*)$
    RewriteRule . - [E=myURI:%1m%2]
    RewriteCond %{ENV:myURI} ^([^N]*)N(.*)$
    RewriteRule . - [E=myURI:%1n%2]
    RewriteCond %{ENV:myURI} ^([^O]*)O(.*)$
    RewriteRule . - [E=myURI:%1o%2]
    RewriteCond %{ENV:myURI} ^([^P]*)P(.*)$
    RewriteRule . - [E=myURI:%1p%2]
    RewriteCond %{ENV:myURI} ^([^Q]*)Q(.*)$
    RewriteRule . - [E=myURI:%1q%2]
    RewriteCond %{ENV:myURI} ^([^R]*)R(.*)$
    RewriteRule . - [E=myURI:%1r%2]
    RewriteCond %{ENV:myURI} ^([^S]*)S(.*)$
    RewriteRule . - [E=myURI:%1s%2]
    RewriteCond %{ENV:myURI} ^([^T]*)T(.*)$
    RewriteRule . - [E=myURI:%1t%2]
    RewriteCond %{ENV:myURI} ^([^U]*)U(.*)$
    RewriteRule . - [E=myURI:%1u%2]
    RewriteCond %{ENV:myURI} ^([^V]*)V(.*)$
    RewriteRule . - [E=myURI:%1v%2]
    RewriteCond %{ENV:myURI} ^([^W]*)W(.*)$
    RewriteRule . - [E=myURI:%1w%2]
    RewriteCond %{ENV:myURI} ^([^X]*)X(.*)$
    RewriteRule . - [E=myURI:%1x%2]
    RewriteCond %{ENV:myURI} ^([^Y]*)Y(.*)$
    RewriteRule . - [E=myURI:%1y%2]
    RewriteCond %{ENV:myURI} ^([^Z]*)Z(.*)$
    RewriteRule . - [E=myURI:%1z%2]
    #
    # Set lowercasing-in-progress flag
    RewriteRule . - [E=qLow:yes]
    #
    # If any uppercase characters remain, re-start
    # mod_rewrite processing from the beginning
    RewriteCond %{ENV:myURI} [A-Z]
    RewriteRule . - [N]
    #
    # If any characters were lowercased, set redirect required
    # flag and reset lowercasing-in-progress flag
    # (S=28 from above lands here)
    RewriteCond %{ENV:qLow} ^yes$ [NC]
    RewriteRule . - [E=qRed:yes,E=qLow:done]
    #
    # End Uppercase to lowercase conversion plug-in
    ###############################################
    #
    # Fix non-canonical domain requests (except for valid
    # subdomains & stats accessed by unique server IP/address)
    RewriteCond %{HTTP_HOST} !^(www|dev|test)\.example\.com(:80)?$
    RewriteCond %{HTTP_HOST}<>%{ENV:TestIP} !^192\.168\.0\.101(:80)?<>true$ [NC]
    RewriteCond %{HTTP_HOST}<>%{REQUEST_URI} !^192\.168\.0\.101(:80)?<>/stats/
    RewriteRule .? - [E=qRed:yes]
    #
    # Replace "hmtl" with "html"
    RewriteCond %{ENV:myURI} ^([^.,]+)[.,]+hmtl [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.html]
    #
    # Replace comma(s) or multiple filetype delimiter periods in page filepaths
    # with a single period (e.g. "/page,html" or "/page..html")
    RewriteCond %{ENV:myURI} ^([^,.]+)([,.]{2,}|,)((s?html?|php[1-9]?|pdf|xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%3]
    #
    # Remove invalid trailing characters
    RewriteCond %{ENV:myURI} ^([/0-9a-z._\-]*)[^/0-9a-z._\-] [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Fix additional directory paths appended to filenames (/logo.jpg/<directory_path>)
    RewriteCond %{ENV:myURI} ^([^.]+\.[^/]+)/
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Remove trailing punctutation
    RewriteCond %{ENV:myURI} ^(.*)[._\-]+$
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Remove multiple contiguous slashes in URL (up to three instances)
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=qRed:yes,E=myURI:%1/%2,C]
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=myURI:%1/%2,C]
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=myURI:%1/%2]
    #
    # Redirect direct client requests for "<anything>/index.html" to "<anything>/"
    RewriteCond %{ENV:myURI} ^(/([^/]+/)*)index\.html [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Redirect specific replaced/relocated pages to specific new pages
    # (Note: This is 'doing it the hard way,' and only URLs that have
    # been requested with typos/type-ins or other problems should be
    # included here. A straight 301 redirect rule located above all of
    # the code shown here can be used to redirect non-problematic URLs)
    RewriteCond %{ENV:myURI}<>/locales.html ^/location\.html<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>/about/widgets-intl.html ^/about/local-widgets\.html<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>/selector/widget-selector.html ^/selector/widgets[^.]+\.xls<>(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Redirect all pages in old directories to same-named pages in new directories
    RewriteCond /new_dir1<>%{ENV:myURI} ^([^<]+)<>/old_dir1(.+)$ [NC,OR]
    RewriteCond /new_dir2<>%{ENV:myURI} ^([^<]+)<>/old_dir2(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1%2]
    #
    # Redirect old filetype to new filetype
    RewriteCond %{ENV:myURI}<>.jpg ^(/[^.]+)\.jpeg<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>.php5 ^(/[^.]+)\.php4<>(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1%2]
    #
    # Correct bad query string on products page link
    RewriteCond %{ENV:myURI} ^/products\.php$
    RewriteCond %{ENV:myQS} ^(([^&]+&)*)product=w1234(&.+)?$
    RewriteRule . - [E=qRed:yes,E=myQS:%1product=w01234%3,S=2]
    #
    # Remove blank query strings from all URLs
    RewriteCond %{ENV:myQS} ^\?$
    RewriteRule .? - [E=qRed:yes,S=1]
    #
    # Remove spurious query strings from non-dynamic pages
    RewriteCond %{ENV:myQS} ^\?
    RewriteCond %{ENV:myURI} !^/(locales|test)\.html$
    RewriteCond %{ENV:myURI} !^/(cats|products)\.php$
    RewriteCond %{ENV:myURI} !^/cgi-bin/
    RewriteRule .? - [E=qRed:yes,E=myQS:?]
    #
    # Do the external 301 redirect only if the referrer is
    # not our own site, the resource exists at the corrected URL,
    # and the requesting IP is not that of our site tester.
    # (Note: Some of these conditions have been commented-out for code testing.
    # Once the code has been tested thoroughly, be sure to un-comment these lines.)
    RewriteCond %{ENV:qRed} ^yes$ [NC]
    #RewriteCond %{ENV:TestIP} !^true$ [NC]
    RewriteCond %{HTTP_REFERER} !^http://((www|dev|test)\.)?example\.(org|com)
    RewriteCond %{HTTP_REFERER} !^http://192\.168\.0\.101(:80)?/?
    #RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI} -f [OR]
    #RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI} -d
    RewriteRule .? http://www.example.com%{ENV:myURI}%{ENV:myQS} [R=301,L]
    #
    # ##### End URL fixup redirect routine ###### .htaccess
    #
    # Specify IP address(es) used by Webmaster, admins, & testers. These may access
    # the server by its unique IP address without being redirected to the domain.
    # Also, URLs are *not* corrected for access by this group, in order to prevent
    # this code from "hiding" problems during development.
    # Note that these addresses are those of your workstations, not your server.)
    SetEnvIf Remote_Addr ^192\.168\.1\. TestIP=true
    SetEnvif Remote_Addr ^10\.10\.45\.3$ TestIP=true
    SetEnvIf Remote_Addr ^127\.0\.0\.[1-7]$ TestIP=true
    #
    #
    # Setup: Enable mod_rewrite, disable MultiViews
    Options +FollowSymLinks -MultiViews
    RewriteEngine on
    #
    # Redirect non-problematic URLs
    # Note: The fix-up code below is complex, and is intended for use to fix only
    # generally-specified problematic URL requests. For administrative redirection
    # of specific non-problematic URLs, 'normal' redirects should be placed here.
    #
    RewriteRule ^old_page\.html$ http://www.example.com/new_page.html [R=301,L]
    RewriteRule ^old_page2\.htm$ http://www.example.com/new_page2.htm [R=301,L]
    #
    #
    # URL FIXUP REDIRECT ROUTINE
    #
    # This code corrects various problems with URLs, presumably due to typos in
    # links from other sites. It is complicated by measures taken to avoid a
    # mod_rewrite bug in Apache 1.3. ( See http://archive.apache.org/gnats/7879 )
    # This code uses a single external redirect to correct all detected problems.
    #
    # Skip next two rules if lowercasing in progress
    # (Remove this rule if case-conversion plug-in below is removed)
    RewriteCond %{ENV:qLow} ^yes$ [NC]
    RewriteRule . - [S=2]
    #
    # Prevent recursion and over-writing of myURI and myQS
    RewriteCond %{ENV:qRed} ^yes$ [NC]
    RewriteRule .? - [L]
    #
    # Get the client-requested full URI and full query string
    RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ (/[^?]*)(\?[^\ ]*)?\ HTTP/
    RewriteRule .? - [E=myURI:%1,E=myQS:%2]
    #
    #
    ###############################################
    # Uppercase to lowercase conversion plug-in
    # (This section, along with the first noted rule
    # above, may be removed if not needed or wanted)
    #
    # Skip next 28 rules if no uppercase letters in URL
    RewriteCond %{ENV:myURI} ![A-Z]
    RewriteRule .? - [S=28]
    #
    # Else swap them out, one at a time
    RewriteCond %{ENV:myURI} ^([^A]*)A(.*)$
    RewriteRule . - [E=myURI:%1a%2]
    RewriteCond %{ENV:myURI} ^([^B]*)B(.*)$
    RewriteRule . - [E=myURI:%1b%2]
    RewriteCond %{ENV:myURI} ^([^C]*)C(.*)$
    RewriteRule . - [E=myURI:%1c%2]
    RewriteCond %{ENV:myURI} ^([^D]*)D(.*)$
    RewriteRule . - [E=myURI:%1d%2]
    RewriteCond %{ENV:myURI} ^([^E]*)E(.*)$
    RewriteRule . - [E=myURI:%1e%2]
    RewriteCond %{ENV:myURI} ^([^F]*)F(.*)$
    RewriteRule . - [E=myURI:%1f%2]
    RewriteCond %{ENV:myURI} ^([^G]*)G(.*)$
    RewriteRule . - [E=myURI:%1g%2]
    RewriteCond %{ENV:myURI} ^([^H]*)H(.*)$
    RewriteRule . - [E=myURI:%1h%2]
    RewriteCond %{ENV:myURI} ^([^I]*)I(.*)$
    RewriteRule . - [E=myURI:%1i%2]
    RewriteCond %{ENV:myURI} ^([^J]*)J(.*)$
    RewriteRule . - [E=myURI:%1j%2]
    RewriteCond %{ENV:myURI} ^([^K]*)K(.*)$
    RewriteRule . - [E=myURI:%1k%2]
    RewriteCond %{ENV:myURI} ^([^L]*)L(.*)$
    RewriteRule . - [E=myURI:%1l%2]
    RewriteCond %{ENV:myURI} ^([^M]*)M(.*)$
    RewriteRule . - [E=myURI:%1m%2]
    RewriteCond %{ENV:myURI} ^([^N]*)N(.*)$
    RewriteRule . - [E=myURI:%1n%2]
    RewriteCond %{ENV:myURI} ^([^O]*)O(.*)$
    RewriteRule . - [E=myURI:%1o%2]
    RewriteCond %{ENV:myURI} ^([^P]*)P(.*)$
    RewriteRule . - [E=myURI:%1p%2]
    RewriteCond %{ENV:myURI} ^([^Q]*)Q(.*)$
    RewriteRule . - [E=myURI:%1q%2]
    RewriteCond %{ENV:myURI} ^([^R]*)R(.*)$
    RewriteRule . - [E=myURI:%1r%2]
    RewriteCond %{ENV:myURI} ^([^S]*)S(.*)$
    RewriteRule . - [E=myURI:%1s%2]
    RewriteCond %{ENV:myURI} ^([^T]*)T(.*)$
    RewriteRule . - [E=myURI:%1t%2]
    RewriteCond %{ENV:myURI} ^([^U]*)U(.*)$
    RewriteRule . - [E=myURI:%1u%2]
    RewriteCond %{ENV:myURI} ^([^V]*)V(.*)$
    RewriteRule . - [E=myURI:%1v%2]
    RewriteCond %{ENV:myURI} ^([^W]*)W(.*)$
    RewriteRule . - [E=myURI:%1w%2]
    RewriteCond %{ENV:myURI} ^([^X]*)X(.*)$
    RewriteRule . - [E=myURI:%1x%2]
    RewriteCond %{ENV:myURI} ^([^Y]*)Y(.*)$
    RewriteRule . - [E=myURI:%1y%2]
    RewriteCond %{ENV:myURI} ^([^Z]*)Z(.*)$
    RewriteRule . - [E=myURI:%1z%2]
    #
    # Set lowercasing-in-progress flag
    RewriteRule . - [E=qLow:yes]
    #
    # If any uppercase characters remain, re-start
    # mod_rewrite processing from the beginning
    RewriteCond %{ENV:myURI} [A-Z]
    RewriteRule . - [N]
    #
    # If any characters were lowercased, set redirect required
    # flag and reset lowercasing-in-progress flag
    # (S=28 from above lands here)
    RewriteCond %{ENV:qLow} ^yes$ [NC]
    RewriteRule . - [E=qRed:yes,E=qLow:done]
    #
    # End Uppercase to lowercase conversion plug-in
    ###############################################
    #
    # Fix non-canonical domain requests (except for valid
    # subdomains & stats accessed by unique server IP/address)
    RewriteCond %{HTTP_HOST} !^(www|dev|test)\.example\.com(:80)?$
    RewriteCond %{HTTP_HOST}<>%{ENV:TestIP} !^192\.168\.0\.101(:80)?<>true$ [NC]
    RewriteCond %{HTTP_HOST}<>%{REQUEST_URI} !^192\.168\.0\.101(:80)?<>/stats/
    RewriteRule .? - [E=qRed:yes]
    #
    # Replace "hmtl" with "html"
    RewriteCond %{ENV:myURI} ^([^.,]+)[.,]+hmtl [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.html]
    #
    # Replace comma(s) or multiple filetype delimiter periods in page filepaths
    # with a single period (e.g. "/page,html" or "/page..html")
    RewriteCond %{ENV:myURI} ^([^,.]+)([,.]{2,}|,)((s?html?|php[1-9]?|pdf|xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%3]
    #
    # Remove invalid trailing characters
    RewriteCond %{ENV:myURI} ^([/0-9a-z._\-]*)[^/0-9a-z._\-] [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Fix additional directory paths appended to filenames (/logo.jpg/<directory_path>)
    RewriteCond %{ENV:myURI} ^([^.]+\.[^/]+)/
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Remove trailing punctutation
    RewriteCond %{ENV:myURI} ^(.*)[._\-]+$
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Remove multiple contiguous slashes in URL (up to three instances)
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=qRed:yes,E=myURI:%1/%2,C]
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=myURI:%1/%2,C]
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=myURI:%1/%2]
    #
    # Redirect direct client requests for "<anything>/index.html" to "<anything>/"
    RewriteCond %{ENV:myURI} ^(/([^/]+/)*)index\.html [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Redirect specific replaced/relocated pages to specific new pages
    # (Note: This is 'doing it the hard way,' and only URLs that have
    # been requested with typos/type-ins or other problems should be
    # included here. A straight 301 redirect rule located above all of
    # the code shown here can be used to redirect non-problematic URLs)
    RewriteCond %{ENV:myURI}<>/locales.html ^/location\.html<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>/about/widgets-intl.html ^/about/local-widgets\.html<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>/selector/widget-selector.html ^/selector/widgets[^.]+\.xls<>(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Redirect all pages in old directories to same-named pages in new directories
    RewriteCond /new_dir1<>%{ENV:myURI} ^([^<]+)<>/old_dir1(.+)$ [NC,OR]
    RewriteCond /new_dir2<>%{ENV:myURI} ^([^<]+)<>/old_dir2(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1%2]
    #
    # Redirect old filetype to new filetype
    RewriteCond %{ENV:myURI}<>.jpg ^(/[^.]+)\.jpeg<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>.php5 ^(/[^.]+)\.php4<>(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1%2]
    #
    # Correct bad query string on products page link
    RewriteCond %{ENV:myURI} ^/products\.php$
    RewriteCond %{ENV:myQS} ^(([^&]+&)*)product=w1234(&.+)?$
    RewriteRule . - [E=qRed:yes,E=myQS:%1product=w01234%3,S=2]
    #
    # Remove blank query strings from all URLs
    RewriteCond %{ENV:myQS} ^\?$
    RewriteRule .? - [E=qRed:yes,S=1]
    #
    # Remove spurious query strings from non-dynamic pages
    RewriteCond %{ENV:myQS} ^\?
    RewriteCond %{ENV:myURI} !^/(locales|test)\.html$
    RewriteCond %{ENV:myURI} !^/(cats|products)\.php$
    RewriteCond %{ENV:myURI} !^/cgi-bin/
    RewriteRule .? - [E=qRed:yes,E=myQS:?]
    #
    # Do the external 301 redirect only if the referrer is
    # not our own site, the resource exists at the corrected URL,
    # and the requesting IP is not that of our site tester.
    # (Note: Some of these conditions have been commented-out for code testing.
    # Once the code has been tested thoroughly, be sure to un-comment these lines.)
    RewriteCond %{ENV:qRed} ^yes$ [NC]
    #RewriteCond %{ENV:TestIP} !^true$ [NC]
    RewriteCond %{HTTP_REFERER} !^http://((www|dev|test)\.)?example\.(org|com)
    RewriteCond %{HTTP_REFERER} !^http://192\.168\.0\.101(:80)?/?
    #RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI} -f [OR]
    #RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI} -d
    RewriteRule .? http://www.example.com%{ENV:myURI}%{ENV:myQS} [R=301,L]
    #
    # ##### End URL fixup redirect routine #####
    Download the same code from here
    Notes:

    The "<>" characters used in several RewriteConds above have no special meaning to mod_rewrite and are not regular-expressions operators. They are merely a unique character string that I use to enable unambiguous matching of combined server variable values on a single line by clearly delineating one value from the other.

    The rules are in a specific order; Some of the later rules depend upon the actions of previous rules.

    Some of the rules have exclusions implemented using RewriteConds. You may not need them at all, or you will very likely need to modify them to suit your site.

    The order of the RewriteConds is intentional. In some cases, the given order is required so that back-references will function correctly, and in other cases they are ordered based on performance considerations. For example, it is good to avoid directory-exists and file-exists checks if possible, since they take a lot of time and CPU resources. So these are deferred until all other conditions are met.

    A very simple but effective way to test this code is to create a page of non-canonical, mis-typed, and malformed links to your site, and then click those links using a Mozilla or Firefox browser with the "Live HTTP Headers" extension enabled. The server response can then be examined in detail to be sure it's working as expected.

    Remember that the code was intentionally designed to *not* correct requests referred from your own site or to correct links when clicked-on by you or testers within your organization, as listed in the exclusion section at the top. It will make your life easier if you leave the RewriteConds in the 301-redirec rule commented-out until you have adapted this code to your site and have thoroughly tested it. Then un-comment those RewriteConds and re-test from a machine that is not part of your development and test network.

    A brief explanation of the techniques used here and the point of this exercise: Apache has a nasty mod_rewrite bug that prevents multiple internal rewrites from working properly, except for a few cases where subdirectories are not present in the URL-path. If an attempt is made to rewrite /dir/a.html to /dir/b.html, and then to rewrite /dir/b.html to /dir/c.html in a second RewriteRule, the resulting URL will be /dir/c.html/c.html. The more sequential rewrites are done, the more times the filepath will be added to the end of the URL. And of course, if you add yet another RewriteRule to try to remove it, you still end up with two repeats of the filepath!

    As was stated at the outset, it is best to avoid 'stacked' redirects, both to avoid confusing search engine robots, and to facilitate the efficient passing of PageRank/link-popularity through to the redirect target URL.

    Both of the above problems are addressed in this code through the use of environment variables: "qRed" to flag a queued external redirect, "myURI" to hold the URL as it is tested and modified, and "myQS" to hold the query string as it is tested and modified. By using these second two variables to completely by-pass Apache's normal URI-handling variables, the Apache mod_rewrite bug is avoided.

    Unfortunately, this makes the code at least twice as long as it would be without the bug, but I haven't found a better way to work around it.

    I have provided a "plug-in" for doing uppercase-to-lowercase conversion in .htaccess. I call it a 'plugin' because I structured it so that it can be easily added or removed with minimum impact on the other code. I *do not* suggest including or using this case-conversion code unless it is absolutely necessary to correct a pre-existing or emerging problem; It is potentially a very-slow, high-CPU-load routine because it will invoke a restart of all mod_rewrite processing if more than one instance of any given capital letter appears in the requested URL. As such, you should take all steps possible to avoid depending on it for any purpose other than to correct inbound links from other sites which are non-responsive to requests for link correction and are completely out of your control. It should certainly not be used to "allow" you to use mixed-case URLs on your own pages; The result is almost certain to be an overloaded server if your site is even moderately popular.

    I've tried to be specific in the description of the individual routines. Some of them may not be useable on your site. For example, the "Fix additional directory paths appended to filenames" routine cannot be used as-is on sites which have periods in directory paths. It would have to be re-coded or removed for use on such a site.

    This code came off a live server, and has therefore been fairly-thoroughly tested.

    If you use this code, JdMorgan will appreciate it if you'd attribute it to him. Source

  • "URL canonicalization, duplicate-content problems in Google" | Login/Create an Account | 2 comments
    Threshold
    The comments are owned by the poster. We aren't responsible for their content.

    Re: URL canonicalization, duplicate-content problems in Google (Score: 1 )
    by inkwind on Sunday, October 09, 2011 (01:56:23)
    (User Info | Send a Message)
    4Story Gold 4Story Gold
    Aion Kinah Aion Kinah
    Archeage Gold Archeage Gold
    Archlord Gold Archlord Gold
    Atlantica Gold Atlantica Gold
    Blade Soul Gold Blade Soul Gold
    Cabal Alz Cabal Alz
    DC Universe Cash DC Universe Cash
    DDO Platinum DDO Platinum
    Dekaron Dil Dekaron Dil
    Dofus Kamas Dofus Kamas
    Dragon Nest Gold Dragon Nest Gold
    Eden Eternal Gold Eden Eternal Gold
    Everquest 2 Platinum Everquest 2 Platinum
    Everquest Platinum Everquest Platinum
    FFxi Gil FFxi Gil
    FFxiv Gil FFxiv Gil
    Firefall Gold Firefall Gold
    Grand Fantasia Gold Grand Fantasia Gold
    Guild Wars 2 Gold Guild Wars 2 Gold
    Knight Online Gold Knight Online Gold
    Last Chaos Gold Last Chaos Gold

    | Parent

    Re: URL canonicalization, duplicate-content problems in Google (Score: 1 )
    by happyzhi on Friday, November 18, 2011 (00:53:38)
    (User Info | Send a Message)
    and raised a shield of body Rusty Hearts Gold already prepared Shaiya Gold,Xuan Feng said: Having said Silkroad Gold that.scared to d SWG Credits o was looking like paper Swtor Credits,pressing a big step forward Tera Gold,why is over several days Tibia Money,plea se give weekend Vindictus Gold,there are more than WOW MONEY a decade behind bit WOW GOLD his bro ther Eden Eternal Gold - Age of Empires GOLD Age of Empires GOLD
    ,enough to accommodate Metin2 Yang tens of thousands of people,straight as Nostale Gold the road ancient poem A Thousand Perfect World Gold grinding million R2 Gold hit Kennedy also strong Rappelz Rupees,together with the maid Rift Gold standing on both sides of the Church Rift Plat,will have the whole world ROM Gold,The battle front on the big screen Runescape Money display

    | Parent

    Products & Services [x]
    Customer Login [x]

    Welcome Anonymous
    Login
    Password
    (Register)
    Downloads [x] Domains [x]
    Hosting Deal ! [x]
    Space: 20GB Traffic: 1TB Unlimited domains, mySQL, SSH/Telnet, PHP4, 5; Ruby on Rails, full CGI
    Regular: $120/year. USE THIS COUPON CODE and take $30 off
    FRESH30OFF
    250 free cards [x]
    Get 250 Free Business Cards At VistaPrint.com!
    dot|Projects [x]
    Portfolio [x]
    PapperJam [x]
    PJN July Promo
    DNscoop data [x]

    © 2002-2008 | Studio DOTBEYOND.COM | San Francisco, CA |
    All Rights Reserved
    Legal Info | Free PPC coupons and codes | Brandjet.com | Engine released under GNU GPL, Credits, Privacy
    San Francisco Bay Area web designer news RSS SF Bay Downloads RSS
    RSS SF Web development, graphic design, SEO services php scripts and graphic design web links RSS