The ability to transfer information from one script to another is essential to modern dynamic web pages. Usually, the scripts will use the well-known GET method for this purpose. For example, to edit your personal information, PHP-Nuke calls the Your_Account module with an URL like:
http://www.yourdomain.com/modules.php?name=Your_Account&op=edituser |
The "name" and "op" are so-called URL parameters and are passed to the modules.php script through the GET method. This is what happens besides the scenes:
The modules.php file includes mainfile.php, as practically every piece of PHP-Nuke code does, directly or indirectly (see Section 20.2, for blocks and Chapter 21, for modules). In mainfile.php, one of the first things that is checked, is whether you have register_globals set to OFF in your php.ini:
if (!ini_get("register_globals")) { import_request_variables('GPC'); } |
If it is, the above code will call import_request_variables and import all GET variables (i.e. "name" and "op" in the example) in the $_GET array. Using the types parameter, you can specify which request variables to import with import_request_variables . You can use 'G', 'P' and 'C' characters respectively for GET, POST and Cookie, as in the example from mainfile.php above.
The code goes on to submit each variable in the $_GET array to a series of checks that should guard against any misuse of the parameters for cracking purposes (see Section 23.4.3), but this is not going to be pursued further here (see Section 23.1 for the security perspective on PHP-Nuke). We are rather going to concentrate on a different aspect of URL parameter passing: the GET method of transferring parameters between scripts makes your web pages unfriendly for search engines - up to the point that they may not be indexed at all!
When a search engine spider encounters an URL with many parameters while indexing your pages, it will ignore the URL and not index that particular page. Just how many parameters are too many for a search engine, is difficult to say. The search engines are deliberately vague on this point, just as they are on almost every other point regarding their algorithms. For example, Google states the following in its Guidelines for Webmasters:
Allow search bots to crawl your sites without session ID's or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.
This does not mean that Google will not spider dynamic pages at all, however (although this seems to have been the case in the past, though). Somewhere in the first months of 2003, Google's algoritms became intelligent enough to tackle the tricky problem of dynamic URLs - well, at least partly, see Google is now better at spidering dynamic sites. Observational evidence suggests that Google will now index a page whose URL contains no more than 2-3 parameters with short names (URLs with 2 parameters is the maximum for Google right now, said Marissa Mayer of Google in the Search Engines Strategies Chicago December 10th Day Two, Dec. 10th 2003 ).
Still, the problem remains for URLs with many parameters, as well as session IDs. A page that uses session IDs can generate an infinite amount of pages for a spider to visit. These types of pages are blocked from being indexed not only by Google, but from other search engines as well.
If a bot ignores a page due to a session ID or a large number of GET parameters on the URL, it will also ignore all pages referenced by that page (unless it finds its way to them through some other link that it can follow). Since every PHP-Nuke module is accessible through an URL of the form
http://www.yourdomain.com/modules.php?name=Your_Account&op=edituser |
and the PHP-Nuke forums with at least 4 URL parameters, one of which is a session ID,
http://www.yourdomain.com/modules.php?name=Forums&file=viewforum&f=1 &sid=9bd5f57e4615bbd6d9e2677ea7cbb781 |
you run the risk that the majority of your pages will be unknown to the search engines. As a rule of thumb we can say that, if this happens, it will cost you two thirds of your external referrals[1]. This can cost you your web existence and can mean the difference between success and failure for your website!
Why? Because search engines create multiple entry points into your website, a fact that many people fail to realize. Most people you know may be coming to your website through its main index.php page, mainly because it is easier to remember, or it's just the web address you printed on the business card you gave to them. But a well-indexed website will soon begin driving traffic to deeper located pages. The search engines have rendered elaborately crafted entry pages almost obsolete: today, every page of your website can be an entry page.
It is also overseen by the average webmaster that these interior pages often draw a different kind of users than the index page: users arriving there are much more qualified because they are looking for information specific to a certain topic. Because they are looking for very specific information, they are also more likely to convert on a sale or action that you have prepared for them.
If the search engines are not able to spider your dynamic content because of the GET parameters in the URL, you are losing - these more qualified visitors often won't find your website. Thus it is very important that you find a way to make as much of your website as possible visible to the search engines.
If you budget affords it, you can choose the lazy way: some serach engines, like Inktomi (FIXM: URL!), offer a Paid Inclusion Program. In a Paid Inclusion Program, it is you who submits a list of URLs to the search engine to crawl, not the search engine that finds them automatically. This way, the search engine can be sure that the list of URLs you submitted contains real content that is of importance to you and that none of the URLs contain duplicate content of one and the same page (something that can easy happen with session IDs and an automatic spider, for example).
On the plus side of a Paid Inclusion Program, you will get your pages indexed, the URLs will be the correct ones and the world will be able to search and find you. The downside is that you have to pay for each and every URL you want to have indexed. If this strains your budget, you will have to search for alternatives.
It turns out that such an alternative exists, thanks to the Swiss-Army-Knife of URL manipulation that is called mod_rewrite.
[1] | A typical website will get about two thirds of its external traffic from search engines and one third from sites that link directly to it. Of course, your mileage may vary. |