In which Function1.com plays with broken links and is redirected to its room
Let’s take a break for a moment from the world of portals, be it WebCenter or WebCenter Interaction, and take a look at the exciting life and times of our web site, Function1.com.
When we very recently refreshed our web site’s design, we also decided to change our blogging platform from MovableType to WordPress. As expected, as part of the migration we carefully migrated all of our blog content to our new site. However, we almost overlooked one detail that could potentially have both made our content inaccessible to some of our readers, as well as hampered our site’s search engine rankings.
Some background information: the URLs for the old blog posts were in the form http://www.function1.com/site/[year]/[month]/[post title].html. The URLs for the new blog posts are in the form http://www.function1.com/[year]/[month]/[post title]. For example, the old blog post http://www.function1.com/site/2007/12/check-search-server-status-the.html can now be found at http://www.function1.com/2007/12/check-search-server-status-the-hard-way/
So what’s the problem? Broken links, better known as 404 errors. Take a look at the following screenshot from Google Webmaster Tools:
Other sites are linking to our content, and specifically to our blog posts, and that is important to us. The table in the screenshot represents just a small subset of the links from other sites for which Google is now reporting crawl errors; this manifests into a lot of content that is sitting on our blog but is inaccessible because it’s being referenced by an invalidated URL. If people stumble across pages linking to our old blog posts, we of course want them to be able to get to our content. Now, do a Google search for “site:www.function1.com/site/”. As you can see, Google still holds a reference to many of our old URLs. Yes, Google is aware of the URLs of our new links (for example, through our sitemap), but we don’t want to lose the history we’ve established with Google, especially since Big G really does care about our old links going dead.
If all of the above didn’t make sense, here’s a made up example: what if Yahoo! decided to change its name to Oohay!, and in the process moved its domain from Yahoo.com to Oohay.com? The news stories, sports scores, images, movies, and all the other content would still exist, and we can be sure that Google would eventually discover all of the new URLs. But all of the old links to the content would effectively be severed, and as far as the search engines are concerned, it would be like one site completely shut down and a brand new site sprung up. The PageRank that the old URLs had earned would be blown away.
In comes our hero, the 301 redirect. The 301 HTTP status code tells the browser that a requested resource has permanently moved to a new location. In other words, we can configure our web server so that a request for http://www.function1.com/site/2007/12/check-search-server-status-the.html returns with a message to the browser that the resource has moved permanentely, and that the new location is http://www.function1.com/2007/12/check-search-server-status-the-hard-way/
To implement these redirects, I set aside some time setting up an .htaccess file that would hold the mappings for all of our old blog post URLs and some other old paths to the valid URLs. Again, the point is to make sure that our content is still accessible via the old URLs but is still ultimately located at the new URLs. Due to the number of blog posts that had to be redirected, I ended up writing a script to do a lot of this for me. If you are interested in the script, check it out at http://github.com/mshafrir/MovableType-to-WordPress-.htaccess-Generator/blob/master/htaccess.py. In any case, here’s a look at a slice of our current .htaccess file.
# Sample .htaccess file
#
# Explicit redirects of the October, 2009 blog posts
redirect 301 /site/2009/10/-google-web-toolkit-gwt.html http://www.function1.com/2009/12/integrating-a-google-web-toolkit-application-with-wci-and-the-imageserver/
redirect 301 /site/2009/10/oracle-open-world-and-the-nigh.html http://www.function1.com/2009/12/oracle-open-world-and-the-night-obama-stayed-at-my-hotel/
redirect 301 /site/2009/10/function1corp-attending-oow09.html http://www.function1.com/2009/10/function1corp-attending-oow09/
redirect 301 /site/2009/10/getting-the-band-back-together.html http://www.function1.com/2009/10/getting-the-band-back-together-function1-welcomes-casey-goodman-and-mike-shafrir/
#
# Redirects using regular expression matching
# Redirect http://www.function1.com/site/year/month/day/
# -> http://www.function1.com/year/month/day/
RedirectMatch 301 ^/site/([0-9][0-9][0-9][0-9])/([0-9][0-9])/([0-9][0-9])/$ http://www.function1.com/$1/$2/$3/
# Redirect http://www.function1.com/site/year/month/
# -> http://www.function1.com/year/month/
RedirectMatch 301 ^/site/([0-9][0-9][0-9][0-9])/([0-9][0-9])/$ http://www.function1.com/$1/$2/
# Redirect http://www.function1.com/site/year/
# -> http://www.function1.com/year/
RedirectMatch 301 ^/site/([0-9][0-9][0-9][0-9])/$ http://www.function1.com/$1/
For further reading, here’s an article on .htaccess files and 301 redirects.
And just to make sure I’m not making all this up, do a search for “site:www.function1.com/site/” on Google. Choose any link in the results, but before you click on it, note the URL in green. Finally, click on your chosen link, note that the content you expected came up, and then compare the URL in your location bar with the URL on the search results page. You’ve just observed 301 redirects in action.
Hopefully you’ve learned a little bit about the issues and risks that can arise from changing your site’s URL structure which may come about from migrating your site’s blogging platform or content management system (CMS). Please feel free to post your comments or questions below.
