To create a search engine friendly PHP-Nuke site, you need a means of converting dynamic URLs to static ones. How can this be done? For example, you could convert
http://www.yourdomain.com/modules.php?name=Your_Account&op=userinfo&username=chris |
to
http://www.yourdomain.com/userinfo-chris.html |
Notice that the spider-friendly format contains no indication that there are any parameters being passed at all. Rather, it simply looks like we are trying to access the file userinfo-chris.html on www.yourdomain.com, and this will not present any problems to the search engine spiders. We have of course to track this case down and use such a “file” accordingly, i.e split it internally into its parts, determine the parameter-value pairs and the name of the script to execute and pass the parameters (name=Your_Account, op=edituser, username=chris) to that script (modules.php) for execution.
We will describe GoogleTap, an ingenious solution to this problem, that combines the components we saw in the previous sections:
mod_rewrite (Section 25.2),
regular expressions (Section 25.3),
.htacces file (Section 25.4).
Frontpage extensions will not work! | |
---|---|
This does not need to upset you, since PHP-Nuke does not use Frontpage extensions. But for the sake of completeness, be warned that if you use Frontpage extensions, mod_rewrite will not work! Of course, if your ISP has installed them, but you don't use them, you also don't need to worry. |
GoogleTap is a collection of files (header.php, footer.php, .htaccess and replacements for some modules and blocks) that combines the power of regular expressions (Section 25.3) and mod_rewrite (Section 25.2) into a PHP-Nuke setting, to implement a search engine friendly PHP-Nuke using URL manipulation. It grew out of attempts to make PHP-Nuke search engine friendly, that date back to the 5.x versions (see Mod_rewrite and Nuke, PHP Nuke & mod_rewrite and Google mod_rewrite fix), which in turn seem to be related to similar efforts in the PostNuke camp (see Search engine friendly URLs revisited and Hackin' the core - Mod_Rewrite).
Our treatment of GoogleTap is divided in three parts:
Requirements (Section 25.5.1.1),
Installation (Section 25.5.1.2),
How it works (Section 25.5.1.3).
To implement the solution of GoogleTap, your system needs to fullfil some software requirements:
mod_rewrite needs to be compiled and loaded into apache (see Section 25.2 on how you can check this).
allowoverrides need to be set to "all" for the directory location your site resides in. Check with your ISP (but see also Section 25.4 why this will make your web server much slower).
rewriteengine needs to be turned "on" in the .htaccess file (this is already done in the .htaccess file that comes with GoogleTap).
But it also needs to fullfil some hardware requirements too: mod_rewrite has to check a lot of regular expressions for each page request. Regular expressions are very flexible, but can be very time consuming too, depending on their complexity. The current implementation of GoogleTap seems to strain the CPU quite a bit. Depending on the hardware, your hosting scheme (“root” , or “virtual” server) and the number of HTTP requests you have to serve per second, this may make your pages prohibitively slow to load! Get the best hardware you can afford!
Monitor the load of your web server! | ||
---|---|---|
If you are on your own ("root") server, but your hardware is not up to the demands of mod_rewrite for computing power, you may find out that your web server load is approaching dwindling heights. There have been reports that a high server load affects the stability of Apache, or causes memory leaks. This may or may not be true for your version and configuration. But in case you are experiencing a high server load, the following script of Zhen-Xjell (the nukecops Webmaster) may turn out to be very helpful:
Save it as loadavg.pl, adapt it to your case (location of the perl executable and the Apache start script) and set up a cron job to call loadavg.pl at regular time intervalls. The script monitors the server load average and restarts the server process, if it becomes too high, say, higher than 5 - see GoogleTap, mod_rewrite and SID defined. |
Do not download the Google_Tap_Beta_0.6.5 package! | |
---|---|
Do not download the Google_Tap_Beta_0.6.5 package! At least the .htacess file contained there, is missing some crucial lines. Besides that, you will not find any clear installation instructions and the files are probably already too old by the time you read this. Use the GT_Distro_10-22-03 package, or a newer one, instead! Please understand that this is beta software and still under constant development. |
We will describe the manual installation of GoogleTap, as applies to the GT Distro 10-22-03 package (see Updated Google Tap Distribution 10-22-2003):
Open your header.php file and make the required changes as indicated in header.php_manualchanges.txt. Please use the included header.php as a reference. This means that you have to find the lines:
if (eregi("header.php",$_SERVER['PHP_SELF'])) { Header("Location: index.php"); die(); } |
in your existing header.php file and add
ob_start(); function replace_for_mod_rewrite(&$s) { $urlin = array( "'(?<!/)modules.php\?name=News&file=article& sid=([0-9]*)&mode=([a-z]*)&order=([0-9]*)&thold=([0-9]*) '", "'(?<!/)modules.php\?name=News&file=article&sid=([0-9]*)'", "'(?<!/)modules.php\?name=News&file=article&sid=([0-9]*)'", "'(?<!/)modules.php\?name=News&new_topic=([0-9]*)'", ... ); $urlout = array( "article-\\1-\\2-\\3-\\4.html", "article\\1.html", "article\\1.html", "article-topic-\\1.html", "archive-\\1-\\2-\\3.html", "archive.html", ... ); $s = preg_replace($urlin, $urlout, $s); return $s; } |
The code shown here has been abbreviated for clarity - please refer to the original instructions in header.php_manualchanges.txt! You should leave the code of your original header.php untouched, after the above lines. The final header.php would look as the header.php file that is included in the GT_Distro_10-22-03 package for your reference.
What you basically do with this in your header.php, is start output buffering and include a function (replace_for_mod_rewrite) with two arrays, urlin and urlout.
Open your footer.php file and make the required changes as indicated in footer.php_manualchanges.txt. Please use the included footer.php as a reference. This means that you have to add
//Google Tap Footer Entry $contents = ob_get_contents(); // store buffer in $contents ob_end_clean(); // delete output buffer and stop buffering echo replace_for_mod_rewrite($contents); //display modified buffer to screen //End of Google Tap Footer |
after the end of the foot function in your footer.php file. The final footer.php would look as the footer.php file that is included in the GT_Distro_10-22-03 package for your reference.
With this change, you just conclude the actions that you introduced in header.php: there, you started output buffering, here you have to store the buffer in a variable ($contents), before you stop buffering and call the new function you introduced in header.php, the replace_for_mod_rewrite() function, passing it the whole buffer content as argument.
Open your includes/sessions.php (Older than PHP-Nuke v.6.5, this may be modules/Forums/includes) file and make the required changes as indicated in includes/sessions.php_manualchanges.txt. Please use the included sessions.php as a reference. This means that you replace
if ( !empty($SID) && !eregi('sid=', $url) ) { $url .= ( ( strpos($url, '?') != false ) ? ( ( $non_html_amp) ? '&' : '&' ) : '?' ) . $SID; } return($url); |
with
if( !empty($SID) && !eregi('sid=', $url) && !areyouabot() ) { $url .= ( ( strpos($url, "?") != false ) ? ( ( $non_html_amp ) ? "&" : "&" ) : "?" ) . $SID; } return($url); |
This introduces an extra check with the areyouabot() function. The session ID ($SID) is appended to the URL ($url) only if the areyouabot() function returns false. Already from the name, we can deduce that areyouabot() checks if the visitor is a search engine spider bot. Thus, the session ID will not be output on the URL that search engines get to see when they visit the site.
Of course, we still have to insert the code for the areyouabot() function. We can do this after the append_sid function of sessions.php, by including the code as shown in includes/sessions.php_manualchanges.txt:
function areyouabot() { global $HTTP_SERVER_VARS; $RobotsList = array ( "antibot", "appie", "architext", "bjaaland", "digout4u", "echo", "fast-webcrawler", "ferret", ... ); $botID = strtolower($HTTP_SERVER_VARS['HTTP_USER_AGENT']); for ($i = 0; $i < count($RobotsList); $i++) { if ( strstr($botID, $RobotsList[$i]) ) { return TRUE; } } return FALSE; } |
Please note that the above code has been abbreviated for clarity. Refer to includes/sessions.php_manualchanges.txt for the full code of areyouabot(). The final sessions.php would look as the includes/sessions.php file that is included in the GT_Distro_10-22-03 package for your reference.
Upload the distribution .htaccess to your PHP-Nuke root directory. This is the same directory that the header.php and footer.php are located within.
Make a backup of your original .htaccess file! | |
---|---|
Your ISP may already have set up an .htaccess file for you. Even if you don't see one there, there might be one, as some systems will not show hidden files (such as .htaccess). Check very thoroughly that you don't overwrite an existing .htaccess file with important directives that are vital to your site. If you find one, make a backup copy of it and be sure that you have saved it somewhere in a safe and accessible place. If your site breaks and you can't access it with your browser anymore, you should still be able to upload the backup copy of your .htaccess and restore the previous settings. |
In order for GT to work correctly with all modules currently supported, you will need to make a few minor adjustments to the following modules:
Sections
Statistics
Web_Links
Top
Your_Account
You can either overwrite your existing file with the one included in the distribution or make the manual changes as indicated in the readme files contained within each module. (Make sure you backup your existing files first! Included files are based on the ones located at cvs.nukecops.com).
The changes correct some links in the above modules. If you check, you will see that the corrections are really minimal, but are nonetheless essential for a correct working of the GoogleTap: they replace the ampersand (&) on the links, with its equivalent “HTML entity” (&) (but not in the language files, where the changes work in the opposite direction, replacing & with &).
This is actually an ommission on the part of the modules, that GoogleTap is just trying to correct: to be HTML 4.x compliant, you can't put an ampersand (&) on the URL, you have to put its HTML entity (&) instead. This may not harm your links in most situations if you don't do it, but if you pass the URL through some regular expression, as GoogleTap does through mod_rewrite (see Section 25.5.1.3 for an explanation of how GoogleTap works)), you have to know if the ampersand is meant in a literal way, or if it is a metacharacter of some regular expression (a character that is not a “literal” , i.e. does not match (only) itself ).
You can now upload the GT (Google Tap) converted blocks - currently supported are:
Scrolling Forums Block
Top 10 Downloads Block
Top 10 Web_Links Block
Sections Articles Block
To install, simply overwrite your existing block. Make sure they are enabled in the blocks administration panel.
The GoogleTap works as follows:
The header.php file starts output buffering. This delays the output - all contents are temporarily sent to an internal buffer. It also defines a function, replace_for_mod_rewrite(). This function takes a sole argument (a variable) and replaces every occurence of a member of the urlin array with the respective member of the urlout array (we have seen the urlin and urlout arrays already in Section 25.5.1.2). You can view the urlin and urlout arrays as ordered collections of regular expressions (see Section 25.3), where the first member of urlin corresponds to the first member of urlout, the second member of urlin to the second member of urlout and so forth.
The replace_for_mod_rewrite() function examines the contents of the variable that was passed to it as argument (remember that in PHP a variable can hold the contents of a whole page - in fact, in our case, it does exactly that!) and replaces every occurence of the first regular expression in urlin, with the first element of urlout. It does so with all other regular expressions of urlin - if it finds a string that matches the nth regular expression, it replaces that string with the nth array member of urlout. The first five replacement pairs[1] are shown in Table 25-1.
Table 25-1. URL replacement with replace_for_mod_rewrite(): dynamic to static.
urlin: regular expression for dynamic URL | urlout: corresponding static URL |
'(?<!/)modules.php\?name=News &file=article &sid=([0-9]*) &mode=([a-z]*) &order=([0-9]*) &thold=([0-9]*)' | article-\\1-\\2-\\3-\\4.html |
'(?<!/)modules.php\?name=News &file=article &sid=([0-9]*)' | article\\1.html |
'(?<!/)modules.php\?name=News &file=article &sid=([0-9]*)' | article\\1.html |
'(?<!/)modules.php\?name=News &new_topic=([0-9]*)' | article-topic-\\1.html |
'(?<!/)modules.php\?name=Stories_Archive &sa=show_month &year=([0-9]*) &month=([0-9]*) &month_l=([a-zA-Z]*)' | archive-\\1-\\2-\\3.html |
From Table 25-1 we can already see that an URL of the form
(something)modules.php?name=News&file=article&sid=(some SID) &mode=(some mode)&order=(some order)&thold=(something) |
is replaced with
article-(some SID)-(some mode)-(some order)-(something.html |
(first replacement pair of the table), while an URL of the form
(something)modules.php?name=News&file=article&sid=(someSID) |
is replaced with
article(some SID).html |
(the escaped numbers in the second column of Table 25-1 represent "backreferences": they match the subexpressions inside the () of the expressions in the first column. \1 matches the first subexpression, \2 the second and so forth).
Some explanations on the regular expressions used in the urlin column of Table 25-1 (see also Section 25.3):
(?<!/) is a so-called assertion. It means that this pattern will only be valid, if it does not start with a preceding slash and helps us convert only links inside our PHP-Nuke site (so links to external sites will not be converted).
The question mark on the URL must be escaped, so we have to write \? to match it.
The ampersands (&) on the URL must be in the form of their HTML entities, “&” (see SGML entities). That's why all affected modules will have to be changed to echo URLs that contain & instead of & (this was done in Section 25.5.1.2).
sid=([0-9]*) indicates that you may have any digit from 0 to 9 after sid=. The * means that the occurrence may be 0 or more times.
mode=([a-zA-Z]*) indicates that you may have any alphabetical character after mode=. Again, the * means that the occurence may be 0 or more times.
The parenthesis () around a regular expression like [a-zA-Z]* or [0-9]* indicate that this is a subexpression that, if matched, will be stored in an internal numbered buffer that we cann access as \1, \2, \3 and so forth. Thus, if sid=([0-9]*) is the first such subexpression in an URL, \1 will internally store the sid value, whatever it is, as long as it is matched by [0-9]* , which matches arbitrarily long sequences of digits. If it is the second one, then the session ID will be stored in \2.
The matched subexpressions (\1, \2, \3 etc) are then used in the regular expressions of the urlout array to construct a simpler, static looking URL that nevertheless contains all necessary information to convert it back to the dynamic version (this backward conversion is carried up in the .htaccess file, see below).
The replace_for_mod_rewrite() function is not called in header.php though - it is only defined there. As its name says, header.php outputs the standard PHP-Nuke header (as opposed to the custom HTML header, which is contained in includes/my_heder.php, see Chapter 15). It would be too early to call replace_for_mod_rewrite() during header generation. Instead, everything is meticulously gathered in the “output buffer” of PHP (to learn more about output buffering in PHP, see Output Buffering With PHP and the PHP manual pages on output control functions).
As PHP-Nuke outputs the page contents (which continue to land in the output buffer), it eventually reaches the page footer. This is the job of footer.php. We already saw in Section 25.5.1.2 that it is the footer who calls replace_for_mod_rewrite(): it stops buffering, saves the buffer contents in a variable, cleans up the buffer and calls replace_for_mod_rewrite().
What are the contents of the output buffer at this stage? You guessed -it - the whole page, all the HTML that comprises the “page source” (the HTML code you see when you hit the “View source” menu entry of your browser. Without GoogleTap, this would be the HTML code that would be sent to your browser to render the page, with all its text and links and...
Huh! Did I say links? What if your page contains dynamic links? In fact, it is very unusual for a PHP-Nuke page to NOT contain any dynamic links. Dont't they have to be replaced with their static pendants too? Of course they do! Thus, it should be clear to you by now, that just by changing the URLs from dynamic to static, we are by far not done - we have to do this for every URL contained in the HTML of our page.
Well, that's exactly what that call of replace_for_mod_rewrite() in the footer does: it passes the whole page as argument to replace_for_mod_rewrite(), which then does an elaborate “search and replace” in the whole page, according to the urlin and urlout arrays, as indicated by Table 25-1.
This completes the translation from dynamic to static: the browser gets a page where every dynamic link of PHP-Nuke has been replaced by a static one. The user - and the search engine - sees only static links on the page to follow.
What's more, through the changes we did in sessions.php during installation (see Section 25.5.1.2), we have already taken care that if the visitor is a search engine bot, it will not be shown the session ID on the URL. Normal users will continue to see it, though. With this ingenious method, we get the best of two worlds: users can rely on the security and comfort that PHP session management has to offer - and search engines will not be chased away from our site through some huge URLs.
But suppose that the user (or better, the search engine, for which we are actually getting into all this trouble) now clicks on one of those static links on the page thet we previously served through our header-footer trick - what then? How are we going to tell PHP-Nuke which link was meant? Remember, PHP-Nuke sits on the server and understands only links like
modules.php?name=Your_Account&op=userinfo&username=chris |
Only so it is going to understand that you mean to see the User Info profile for chris (see Figure 18-7), in the Your_Account module. But the user will not click on that link, because all dynamic links were already transformed to static (by replace_for_mod_rewrite()), before the page was served by the web server to his browser. The user will click on a link that looks like
userinfo-chris.html |
This is because, according to the urlin and urlout arrays, which control dynamic-to-static translation of URLs in the replace_for_mod_rewrite() function of header.php, the regular expression
'(?<!/)modules.php\?name=Your_Account&op=userinfo&username=([a-zA-Z0-9_-]*)' |
is translated to
userinfo-\\1.html |
(the \1 in the second expression matches a subexpression, enclosed in parenthesis (), of the first regular expression, in this case whatever follows the "username=" string).
For the reverse translation, from static to dynamic, the .htaccess file comes into play. It tells mod_rewrite to rewrite every URL that looks like
userinfo-(some username).html |
to
modules.php?name=Your_Account&op=userinfo&username=(some username) |
Here is the relevant part of .htaccess that is responsible for this translation:
#Your Account RewriteRule ^userinfo-([a-zA-Z0-9_-]*).html modules.php?name=Your_Account&op=userinfo&username=$1 |
You will find rewrite rules like the above, for every regular expression pair that you encounter in the urlin and urlout arrays of the header.php file. Only the roles have been exchanged: the first expression is the static one, the second the dynamic.
This way, PHP-Nuke never sees the static URLs - when they arrive at the PHP interpreter, they have been already translated to the original, correct, dynamic URLs that PHP-Nuke knows how to treat.
While it is at it, by the way, .htaccess does some other neat things for us too. They don't have to do with the search-engine friendliness, but rather with the security of our site, and come as a nice by-product of the mod_rewrite technology:
It blocks access to almost everything, except .php files:
<FilesMatch "\.(inc|tpl|h|ihtml|sql|ini|conf|class|bin|spd|theme|module)$"> deny from all </FilesMatch> |
It redirects email spammers robots to a fake page (some emailsforyou.php)
RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR] RewriteCond %{HTTP_USER_AGENT} ^asterias [OR] RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR] RewriteCond %{HTTP_USER_AGENT} ^Black.Hole [OR] RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] ... RewriteRule ^.*$ emailsforyou.php [L] |
Of course, this is only going to work for email robots that are not clever enough to fake the HTTP_USER_AGENT string they present to the server.
[1] | For better readability, the urlin regular expression was broken across several lines in their table cell. The parts of the regular expression that match an URL parameter of the dynamic URL are on separate lines. But in reality, the regular expression is a one line long string. |