27.7. How to check the validity of thousands of web links

Figure 27-6. Administration panel: Web Links.

Administration panel: Web Links.

Problem: You need the check the validity of your PHP-Nuke weblinks. You have a lot of them and the built-in checker of PHP-Nuke just hangs , probably due to constraints on CPU time consumption (usually 30 sec.) . You would like to run a cron job to check for valid weblinks.

Solution: Download deadlinkcheck, let MySQL dump a list of your web links, and give it to deadlinkcheck to validate (see How to check the validity of thousands of weblinks, Weblinks Validation).

deadlinkcheck is a simple - you guessed right! - dead link checker. It only requires Perl 5.x to run, nothing else. It produces a nice HTML validation report that you can check every day, if you combine it with a scheduling facility like cron. To install it, follow these three simple steps:

  1. Extract the tar archive you downloaded:

    tar -xzvf dlc-0.4.0.tar.gz
  2. Change in the newly created directory dlc-0.4.0 and run the configure script:

    ./configure --prefix=/usr/local

    Substitute your own prefix. In the above example, deadlinkcheck will be installed in /usr/local/bin and the man page in /usr/local/man/man1 (unfortunately, the --mandir option to the configure script is not honoured).

  3. Run "make install". This completes installation. Do a "man deadlinkcheck" to get acquainted with the options.

The only thing you need now, is a file with all those 5000 HTTP links - or were they 50000? No matter how many links you have to check, this is easy done:

Create a MySQL batch script, say weblinks.sql, that contains:

# Put your database here!
use phpnuke;
#
# Output all entries of 'nuke_weblinks'
#
select url from nuke_links_links;

Run it from the MySQL prompt:

mysql < weblinks.sql

(You may need to add your user and password in the file, see running MySQL in Batch Mode). This should produce a list of all the URLs on the standard output. You only need to redirect it to a file:

mysql < weblinks.sql > urls_weblinks

Now all the above comes to its completion: write a cron job to execute a script that contains the following two commands:

mysql < weblinks.sql > urls_weblinks
deadlinkcheck -output deadlinks.html -HTMLoutput urls_weblinks

Next day, when the cron job is finished and you're enjoying the morning coffee, just open the file deadlinks.html with your browser and check the results.

TipHow to find out the category of the invalid links
 

To find and correct an invalid link in PHP-Nuke, it is nice to know the category it is in. The solution we will present is not perfect, but you could find an even better one, if you were willing to tweak the code of the deadlinkcheck script. Use the following command in weblinks.sql:

select url, cid, sid from nuke_links_links;

The rest remains the same. Now, at the end you get a file that contains the validated links, sorted according to HTTP return code (2xx, 3xx and 4xx) with the difference that now you see the categories in the anchor text! The links themselves show up in the status line of the browser when you pass the mouse pointer over the category id! This just makes use of a bug (in this case a feature ) of deadlinkcheck: if you have more information than just the link to check on a line, it checks the link and uses (part of) that extra information as anchor text.