Thursday, November 1, 2012

URL Rewriting


URL Rewriting is common practice these days and most hosted services provide it (wordpress, blogger, etc.)
With this practice, a URL like http://test.com/blog?year=2012&month=3&day=12 could become http://test.com/blog/2012/03/12.
For some people, it's only about user friendliness. For me and all the previous work we have done, it's all about search engine optimization (SEO), with the side effect of being a little bit more user friendly.


Terminology and definitions


We define raw form any non-rewritten URL, provided as is by the application.
Example:
  • www.tripfilms.com/watch.sdo
  • www.tripfilms.com/videos/123

We define pretty form, any rewritten URL, based on information from its raw form.
Example:
  • http://www.tripfilms.com/Travel_Video-v72264-Chongqing-Shibaozhai-Video.html
  • http://www.igougo.com/attractions-reviews-b7933-Las_Vegas-Bellagio_Fountains.html
  • http://www.virtualtourist.com/travel/Caribbean_and_Central_America/Puerto_Rico/San_Juan_Municipio/San_Juan-1645724/Things_To_Do-San_Juan-Old_San_Juan-BR-1.html

We define the process of injecting keywords into the URL as captioning. During this process, the rewriter is given a piece of information, retrieve keywords related to it and formats them in the URL, based on pre-defined rules.
Here's a quick overview of the process:

  1. Client requests www.tripfilms.com/videos/123
  2. System loads information for video #123
  3. System generates caption based on information. Here we'll use the title ("Shibaozhai") and its location ("Chongqing"), i.e. caption="Chongqing Shibaozhai"
  4. Create URL based on the rule: "Travel_Video-v$(id)-$(caption)-Video.html"
  5. System redirects (or rewrite, if it's a link in a page we're serving up) to "Travel_Video-v123-Chongqing-Shibaozhai-Video.html"

We define strong rewriting, any process where a variation in caption is detected and corrected automatically, regardless of the variation. This is the case for Tripfilms.com, IgoUg.com and TripAdvisor.com for example.

We define weak rewriting any process where a variation in caption may not be detected and corrected automatically. This is the case with most blogs based on Wordpress for example.

State of the art


There are different schools when it come to URL rewriting implementations.

With a unique identifier


Tripfilms.com, IgoUgo.com and TripAdvisor.com, among others, use that technique:
http://www.tripadvisor.com/Hotel_Review-g33364-d224979-Reviews-Antlers_Hilton_Colorado_Springs-Colorado_Springs_El_Paso_County_Colorado.html
where g33364 and d224979 helps the engine figure out how and what to rewrite exactly.
The end result is a perfectly reliable and worry-free rewriting, especially when it comes to user generated content. If the name of the hotel were to change, the engine would automatically redirect from the old form to the new one.
To test that's this rewriting is a strong one, imagine this hotel used to be named "Antlers Hilton2 Colorado Springs".
Say I blogged about it yesterday and added a link to that hotel page on TripAdvisor.com. Today the hotel changed his name and TripAdvisor updated his data with "Antlers Hilton Colorado Springs", the new name.
If I click on the link from my blog:
http://www.tripadvisor.com/Hotel_Review-g33364-d224979-Reviews-Antlers_Hilton2_Colorado_Springs-Colorado_Springs_El_Paso_County_Colorado.html
I get redirected to the new URL (with "Antlers Hilton Colorado Springs"):
http://www.tripadvisor.com/Hotel_Review-g33364-d224979-Reviews-Antlers_Hilton_Colorado_Springs-Colorado_Springs_El_Paso_County_Colorado.html


Without a unique identifier


Wordpress is the perfect example of an implementation of that approach:
http://sematblog.wordpress.com/2012/04/18/another-step-forward-for-semat/
It uses the date of the post in the URL, but there could be many posts from the same day, so the title of the post is used as well.
If that URL is used anywhere on the web (e.g. someone wants to blog about it and links to it), if the title of the post gets changed, wordpress needs a way to redirect the user from the old URL to the new one.
As is it's not possible, Wordpress (and others) use a history to provide a similar feature.


Political debate: unique identifier or not


Some say that adding the unique identifier in the URL may affect in a bad way the SEO value of the page. As of today June 2nd 2012, there hasn't been any formal proof (or reliable source) of such statement.


Implementation Guidelines


URL rewriting must be a reversible process, going from raw form to pretty form, and vice versa.
The best implementations should respect the following rules:
  1. Be 100% reliable: we need to go from raw to pretty, and from pretty to raw reliably and efficiently. Usually the id of the piece of information requested is used in the URL (see IgoUgo.com and Tripfilms.com). If not reliable, problem will ensue (e.g. some patterns for dynamic content might match pages for static content).
  2. Permanently redirect previous forms to the new one: this is purely for SEO purposes, because it tells search engine bots that the old form (whether raw or not) should be replaced by the new one, to avoid duplicate content (i.e. 2 different URLs pointing to the same content).
  3. Be separate from front-end development. The idea here is that the rewriting happens without developers (including designers) dealing with it. Rules of rewriting may change (SEO is a process, not an event), but the raw form is most likely to remain the same forever.
  4. Avoid hitting the database for every URL to be rewritten: there's a major hit in performance if that happens (especially if there are lots of content or links in any given page) . Scalability must be taken into consideration.


Tripfilms.com Implementation


We use a port of URLRewriter.NET called urlrewriter4j. It has extra features than its original, and can be used within Spring Framework.

Configuration

web.xml

<filter>
 <filter-name>UrlRewriterFilter</filter-name>
 <filter-class>
  org.springframework.web.filter.DelegatingFilterProxy
 </filter-class>
 <init-param>
  <param-name>targetBeanName</param-name>
  <param-value>
   net.sourceforge.urlrewriter4j.RewriteFilter
  </param-value>
 </init-param>
 <init-param>
  <param-name>targetFilterLifecycle</param-name>
  <param-value>true</param-value>
 </init-param>
</filter>

<filter-mapping>
 <filter-name>UrlRewriterFilter</filter-name>
 <url-pattern>/*</url-pattern>
</filter-mapping>

urlrewriter-config.xml

<beans ...>

 <bean id="net.sourceforge.urlrewriter4j.RewriteFilter" 
  class="net.sourceforge.urlrewriter4j.core.RewriteFilter">
   <property name="beanFactory">
    <bean class="net.sourceforge.urlrewriter4j.core.configuration.SpringBeanFactory"/>
   </property>
   <property name="configuration" 
    value="classpath:/com/tripfilms/urlrewriter/urlrewriter.xml"/>
 </bean>

 <import resource="classpath:/com/tripfilms/urlrewriter/spring.xml"/>
</beans>

The captioning action implementation depends on the type of content being requested. For example, when requesting a video, we'll load the video information and use it in the caption (like title of the video, location of the video, etc.).

The validate-captioning is important because it enforces point #2 discussed above in the implementation guidelines. Without it, we may end up with duplicate content. When the caption is not the one expected, the system will redirect the client to the proper URL.

We use Solr to get caption data. It may happen that the Solr index is a little out of date (something always go wrong), in which case we may not find the caption information needed to complete the rewriting process. In this situation we simply give up the rewriting process and keep the raw form. If it was coming from a pretty form, then a 404 is returned.


Advanced Implementations


The implementation of a URL Rewriter can seriously hurt page load-times.
Here's a little mind dump on the potential approaches when it comes to implementation.




No comments:

Post a Comment