18.13 Web Page Access Counting
18.13.1 Problem
You want to count the number of times a page has been accessed. This can be used to display a hit counter in the page. The same technique can be used to record other types of information as well, such as the number of times each of a set of banner ads has been served.
18.13.2 Solution
Implement a hit counter, keyed to the page you want to count.
18.13.3 Discussion
This section discusses access counting, using hit counters for the examples. Counters that display the number of times a web page has been accessed are not such a big thing as they used to be, presumably because page authors now realize that most visitors don't really care how popular a page is. Still, the general concept has application in several contexts. For example, if you're displaying banner ads in your pages (Recipe 17.8), you may be charging vendors by the number of times you serve their ads. To do so, you need to count the number of accesses for each one. You can adapt the technique shown in this section for purposes such as these.
There are several methods for writing a page that displays a count of the number of times it has been accessed. The most basic is to maintain the count in a file. When the page is requested, you open the file, read the count, increment it and write the new count back to the file and display it in the page. This has the advantage of being easy to implement and the disadvantage that it requires a counter file for each page that includes a hit count. It also doesn't work properly if two clients access the page at the same time, unless you implement some kind of locking protocol in the file access procedure. It's possible to reduce counter file litter by keeping multiple counts in a single file, but that makes it more difficult to access particular values within the file, and it doesn't solve the simultaneous-access problem. In fact, it makes it worse, because a multiple-counter file has a higher likelihood of being accessed by multiple clients simultaneously than does a single-counter file. So you end up implementing storage and retrieval methods for processing the file contents, and locking protocols to keep multiple processes from interfering with each other. Hmm . . . those sound suspiciously like the problems that MySQL already takes care of! Keeping the counts in the database centralizes them into a single table, SQL provides the storage and retrieval interface, and the locking problem goes away because MySQL serializes access to the table so that clients can't interfere with each other. Furthermore, depending on how you manage the counters, you may be able to update the counter and retrieve the new sequence value using a single query.
I'll assume that you want to log hits for more than one page. To do that, create a table that has one row for each page to be counted. This means it's necessary to have a unique identifier for each page, so that counters for different pages don't get mixed up. You could assign identifiers somehow, but it's easier just to use the page's path within your web tree. Web programming languages typically make this path easy to obtain; in fact, we've already discussed how to do so in Recipe 18.2. On that basis, you can create a hitcount table as follows:
CREATE TABLE hitcount ( path VARCHAR(255) BINARY NOT NULL, hits BIGINT UNSIGNED NOT NULL, PRIMARY KEY (path) );
This table definition involves some assumptions:
The BINARY keyword in the path column definition makes the column values case sensitive. That's appropriate for a web platform where pathnames are case sensitive, such as most versions of Unix. For Windows or for HFS+ filesystems under Mac OS X, filenames are not case sensitive, so you'd omit BINARY from the definition.
The path column has a maximum length of 255 characters, which limits you to page paths no longer than that. If you expect to require longer values, use a BLOB or TEXT type rather than VARCHAR. But in this case, you're still limited to indexing a maximum of the leftmost 255 characters of the column values, so you'd use a non-unique index rather than a PRIMARY KEY.
The mechanism works for a single document tree, such as when your web server is used to serve pages for a single domain. If you institute a hit count mechanism on a host that servers multiple virtual domains, you may want to add a column for the domain name. This value is available in the SERVER_NAME value that Apache puts into your script's environment. In this case, the hitcount table index would include both the hostname and the page path.
The general logic involved in hit counter maintenance is to increment the hits column of the record for a page, then retrieve the updated counter value. One way to do that is by using the following two queries:
UPDATE hitcount SET hits = hits + 1 WHERE path = 'page path'; SELECT hits FROM hitcount WHERE path = 'page path';
Unfortunately, if you use that approach, you may often not get the correct value. If several clients request the same page simultaneously, several UPDATE statements may be issued in close temporal proximity. The following SELECT statements then wouldn't necessarily get the corresponding hits value. This can be avoided by using a transaction or by locking the hitcount table, but that slows down hit counting. MySQL provides a solution that allows each client to retrieve its own count, no matter how many updates happen at the same time:
UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = 'page path'; SELECT LAST_INSERT_ID( );
The basis for updating the count here is LAST_INSERT_ID(expr), which was discussed in Recipe 11.17. The UPDATE statement finds the relevant record and increments its counter value. The use of LAST_INSERT_ID(hits+1) rather than just hits+1 tells MySQL to treat the value as though it were an AUTO_INCREMENT value. This allows it to be retrieved in the second query using LAST_INSERT_ID( ). The LAST_INSERT_ID( ) function returns a connection-specific value, so you always get back the value corresponding to the UPDATE issued on the same connection. In addition, the SELECT statement doesn't need to query a table, so it's very fast. A further efficiency may be gained by eliminating the SELECT query altogether, which is possible if your API provides a means for direct retrieval of the most recent sequence number. For example, in Perl, you can update the count and get the new value with a single query like this:
$dbh->do ( "UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = ?", undef, $page_path); $hits = $dbh->{mysql_insertid};
However, there's still a problem here. What if the page isn't listed in the hitcount table? In that case, the UPDATE statement finds no record to modify and you get a counter value of zero. You could deal with this problem by requiring that any page that includes a hit counter must be registered in the hitcount table before the page goes online. A friendlier alternate approach is to create a counter record automatically for any page that is found not to have one. That way, page designers can put counters in pages with no advance preparation. To make the counter mechanism easier to use, put the code in a utility function that takes a page path as its argument, handles the missing-record logic internally, and returns the count. Conceptually, the function acts like this:
update the counter if the update modifies a row retrieve the new counter value else insert a record for the page with the count set to 1
The first time you request a count for a page, the update modifies no rows because the page won't be listed in the table yet. The function creates a new counter and returns a value of one. For each request thereafter, the update modifies the existing record for the page and the function returns successive access counts.
In Perl, a hit-counting function might look like this, where the arguments are a database handle and the page path:
sub get_hit_count { my ($dbh, $page_path) = @_;
my $rows = $dbh->do ( "UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = ?", undef, $page_path); return ($dbh->{mysql_insertid}) if $rows > 0; # counter was incremented
# If the page path wasn't listed in the table, register it and # initialize the count to one. Use IGNORE in case another client # tries same thing at the same time.
$dbh->do ("INSERT IGNORE INTO hitcount (path,hits) VALUES(?,1)", undef, $page_path); return (1); }
The CGI.pm script_name( ) function returns the local part of the URL, so you use get_hit_count( ) like this:
my $hits = get_hit_count ($dbh, script_name ( )); print p ("This page has been accessed $hits times.");
The counting mechanism potentially involves multiple queries, and we haven't used a transactional approach, so the algorithm still has a race condition that can occur for the first access to a page. If multiple clients simultaneously request a page that is not yet listed in the hitcount table, each of them may issue the UPDATE query, find the page missing, and as a result issue the INSERT query to register the page and initialize the counter. The algorithm uses INSERT IGNORE to suppress errors if simultaneous invocations of the script attempt to initialize the counter for the same page, but the result is that they'll all get a count of one. Is it worth trying to fix this problem by using transactions or table locking? For hit counting, I'd say no. The slight loss of accuracy doesn't warrant the additional processing overhead. For a different application, the priority may be accuracy over efficiency, in which case you would opt for transactions to avoid losing a count.
A PHP version of the hit counter looks like this:
function get_hit_count ($conn_id, $page_path) { $query = sprintf ("UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = %s", sql_quote ($page_path)); if (mysql_query ($query, $conn_id) && mysql_affected_rows ($conn_id) > 0) return (mysql_insert_id ($conn_id));
# If the page path wasn't listed in the table, register it and # initialize the count to one. Use IGNORE in case another client # tries same thing at the same time.
$query = sprintf ("INSERT IGNORE INTO hitcount (path,hits) VALUES(%s,1)", sql_quote ($page_path)); mysql_query ($query, $conn_id); return (1); }
To use it, call the get_self_path( ) function that returns the script pathname (see Recipe 18.2):
$self_path = get_self_path ( ); $hits = get_hit_count ($conn_id, $self_path); print ("<p>This page has been accessed $hits times.</p>\n");
In Python, the function looks like this:
def get_hit_count (conn, page_path): cursor = conn.cursor ( ) cursor.execute (""" UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = %s """, (page_path,)) if cursor.rowcount > 0: # a counter was incremented count = cursor.insert_id ( ) cursor.close ( ) return (count)
# If the page path isn't listed in the table, register it and # initialize the count to one. Use IGNORE in case another client # tries same thing at the same time.
cursor.execute (""" INSERT IGNORE INTO hitcount (path,hits) VALUES(%s,1) """, (page_path,)) cursor.close ( ) return (1)
And is used as follows:
self_path = os.environ["SCRIPT_NAME"] count = get_hit_count (conn, self_path) print "<p>This page has been accessed %d times.</p>" % count
The recipes distribution includes demonstration scripts hit counter scripts for Perl, PHP, and Python under the apache directory. A JSP version is under the tomcat directory. Install any of these in your web tree, invoke it a few times, and watch the count increase. (First you'll need to create the hitcount table, as well as the hitlog table described in Recipe 18.14. Both tables can be created from the hits.sql script provided in the tables directory.)
|
No comments:
Post a Comment