PDA

View Full Version : Search Engine-Friendly URLs


theferf
10-18-2006, 10:05 PM
On today’s Internet, database driven or dynamic sites are very popular. Unfortunately the easiest way to pass information between your pages is with a query string. In case you don’t know what a query string is, it's a string of information tacked onto the end of a URL after a question mark.
http://www.sitepoint.com/images/books/phpmysql1/photo1.jpg
So, what’s the problem with that? Well, most search engines (with a few exceptions - namely Google) will not index any pages that have a question mark or other character (like an ampersand or equals sign) in the URL. So all of those popular dynamic sites out there aren’t being indexed - and what good is a site if no one can find it?

The solution? Search engine friendly URLs. There are a few popular ways to pass information to your pages without the use of a query string, so that search engines will still index those individual pages. I'll cover 3 of these techniques in this article. All 3 work in PHP with Apache on Linux (and while they may work in other scenarios, I cannot confirm that they do).

Method 1: PATH_INFO

Implementation:

If you look above this article on the address bar, you’ll see a URL like this: http://www.webmasterbase.com/article.php/999/12. SitePoint actually uses the PATH_INFO method to create their dynamic pages.

Apache has a "look back" feature that scans backwards down the URL if it doesn’t find what it's looking for. In this case there is no directory or file called "12", so it looks for "999". But it find that there's not a directory or file called "999" either, so Apache continues to look down the URL and sees "article.php". This file does exist, so Apache calls up that script. Apache also has a global variable called $PATH_INFO that is created on every HTTP request. What this variable contains is the script that's being called, and everything to the right of that information in the URL. So in the example we've been using, $PATH_INFO will contain article.php/999/12.

So, you wonder, how do I query my database using article.php/999/12? First you have to split this into variables you can use. And you can do that using PHP’s explode function:
$var_array = explode("/",$PATH_INFO);
Once you do that, you’ll have the following information:

$var_array[0] = "article.php"

$var_array[1] = 999

$var_array[2] = 12

So you can rename $var_array[1] as $article and $var_array[2] as $page_num and query your database.

Drawback:

There was previously one major drawback to this method. Google, and perhaps other search engines, would not index pages set up in this manner, as they interpreted the URL as being malformed. I contacted a Software Developer at Google and made them aware of the problem and I am happy to announce that it is now fixed.

There is the potential that other search engines may ignore pages set up in this manner. While I don't know of any, I can't be certain that none do. If you do decide to use this method, be sure to monitor your server logs for spiders to ensure that your site is being indexed as it should.
[BREAK=2]
Method 2: .htaccess Error Pages

Implementation:

The second method involves using the .htaccess file. If you're new to it, .htaccess is a file used to administer Apache access options for whichever directory you place it in. The server administrator has a better method of doing this using his or him configuration files, but since most of us don't own our own server, we don't have control over what the server administrator does. Now, the server admin can configure what users can do with their .htaccess file so this approach may not work on your particular server, however in most cases it will. If it doesn't, you should contact your server administrator.

This method takes advantage of .htaccess’ ability to do error handling. In the .htaccess file in whichever directory you wish to apply this method to, simply insert the following line:

ErrorDocument 404 /processor.php

Now make a script called processor.php and put it in that same directory, and you're done! Lets say you have the following URL: http://www.domain.com/directory/999/12/. And again in this example "999" and "12" do not exist, however, as you don't specify a script anywhere in the directory path, Apache will create a 404 error. Instead of sending a generic 404 header back to the browser, Apache sees the ErrorDocument command in the .htaccess file and calls up processor.php.

Now, in the first example we used the $PATH_INFO variable, but that won’t work this time. Instead we need to use the $REQUEST_URI variable, which contains everything in the URL after the domain. So in this case, it contains: /directory/999/12/.

The first thing you need to do in processor.php is send a new HTTP header. Remember, Apache thought this was a 404 error, so it wants to tell the browser that it couldn’t find a page.

So, put the following line in your processor.php:

header("HTTP/1.1 200 OK");

At this time I need to point out an important fact. In the first example you could specify what script processed your URL. In this example all URLs must be processed by the same script, processor.php, which makes things a little different. Instead of creating different URLs based on what you want to do, such as article.php/999/12 or printarticle.php/999/12 you only have 1 script that must do both.

So you must decide what to do based on the information processor.php receives - more specifically, by counting how many parameters are passed. For instance on my site, I use this method to generate my pages: I know that if there's just one parameter, such as http://www.online-literature.com/shakespeare/, that I need to load an author information page; if there are 2 parameters, such as http://www.online-literature.com/shakespeare/hamlet/, I know that I need to load a book information page; and finally if there are 3 parameters, such as http://www.online-literature.com/shakespeare/hamlet/3/, I know I need to load a chapter viewing page. Alternatively, you can simply use the first parameter to indicate the type of page to display, and then process the remaining parameters based on that.
New Revised 3rd Edition!
Build Your Own Database Driven Website Using PHP & MySQL
Build Your Own Database Driven Website Using PHP & MySQL

* Installation instructions for Windows, Linux and Mac OS X
* Instantly apply working code examples from the book to your Website
* Build a working Content Management System from scratch
* Master MySQL database administration
* Fully updated for PHP 5

* Download 4 Sample Chapters FREE

There are 2 ways you can accomplish this task of counting parameters. First you need to use PHP’s explode function to divide up the $REQUEST_URI variable. So if $REQUEST_URI = /shakespeare/hamlet/3/:

$var_array = explode("/",$REQUEST_URI);

Now note that, because of the positioning of the /’ there are actually 5 elements in this array. The first element, element 0, is blank, because it contains the information before the first /. The fifth element, element 4, is also blank, because it contains the information after the last /.

So now we need to count the elements in our $var_array. PHP has two functions that let us do this. We can use the sizeof() function as in this example:

$num = sizeof($var_array); // 5

You’ll notice that the sizeof() function counts every item in the array regardless of whether it's empty. The other function is count(), which is an alias for the sizeof() function.

Some search engines, like AOL, will automatically remove the trailing / from your URL, and this can cause problems if you’re using these functions to count your array. For instance http://www.online-literature.com/shakespeare/hamlet/ becomes http://www.online-literature.com/shakespeare/hamlet, and as there are 3 total elements in that array our processor.php would load an author page instead of a book page.

The solution is to create a function that will count only the elements in an array that actually hold data. This will allow you to leave off the ending / or allow any links from AOL'’s search engine to point to the proper place. An example of such a function is:

function count_all($arg)
{
// skip if argument is empty
if ($arg) {
// not an array, return 1 (base case)
if(!is_array($arg))
return 1;
// else call recursively for all elements $arg
foreach($arg as $key => $val)
$count += count_all($val);
return $count;
}
}

To get your count, access the function like this:

$num = count_all($url_array);

Once you know how many parameters you need, you can define them like this:

$author=$var_array[1];
$book=$var_array[2];
$chapter=$var_array[3];

Then you can use includes to call up the appropriate script, which will query your database and set up your page. Also if you get a result you’re not expecting, you can simply create your own error page for display to the browser.

Drawback:

The drawback of this method is that every page that's hit is seen by Apache as an error. Thus every hit creates another entry in your server error logs, which effectively destroys their usefulness. So if you use this method you sacrifice your error logs.