Monday, November 2, 2009

18.13 Web Page Access Counting




I l@ve RuBoard










18.13 Web Page Access Counting




18.13.1 Problem



You want to count the number of times
a page has been accessed. This can be used to display a hit counter
in the page. The same technique can be used to record other types of
information as well, such as the number of times each of a set of
banner ads has been served.





18.13.2 Solution



Implement a hit counter, keyed to the page you want to count.





18.13.3 Discussion



This section discusses access counting, using hit counters for the
examples. Counters that display the number of times a web page has
been accessed are not such a big thing as they used to be, presumably
because page authors now realize that most visitors
don't really care how popular a page is. Still, the
general concept has application in several contexts. For example, if
you're displaying banner ads in your pages (Recipe 17.8), you may be charging vendors by the number of
times you serve their ads. To do so, you need to count the number of
accesses for each one. You can adapt the technique shown in this
section for purposes such as these.



There are several methods for writing
a page that displays a count of the number of times it has been
accessed. The most basic is to maintain the count in a file. When the
page is requested, you open the file, read the count, increment it
and write the new count back to the file and display it in the page.
This has the advantage of being easy to implement and the
disadvantage that it requires a counter file for each page that
includes a hit count. It also doesn't work properly
if two clients access the page at the same time, unless you implement
some kind of locking protocol in the file access procedure.
It's possible to reduce counter file litter by
keeping multiple counts in a single file, but that makes it more
difficult to access particular values within the file, and it
doesn't solve the simultaneous-access problem. In
fact, it makes it worse, because a multiple-counter file has a higher
likelihood of being accessed by multiple clients simultaneously than
does a single-counter file. So you end up implementing storage and
retrieval methods for processing the file contents, and locking
protocols to keep multiple processes from interfering with each
other. Hmm . . . those sound suspiciously like the problems that
MySQL already takes care of! Keeping the counts in the database
centralizes them into a single table, SQL provides the storage and
retrieval interface, and the locking problem goes away because MySQL
serializes access to the table so that clients can't
interfere with each other. Furthermore, depending on how you manage
the counters, you may be able to update the counter and retrieve the
new sequence value using a single query.



I'll assume
that you want to log hits for more than one page. To do that, create
a table that has one row for each page to be counted. This means
it's necessary to have a unique identifier for each
page, so that counters for different pages don't get
mixed up. You could assign identifiers somehow, but
it's easier just to use the page's
path within your web tree. Web programming languages typically make
this path easy to obtain; in fact, we've already
discussed how to do so in Recipe 18.2. On that
basis, you can create a hitcount table as follows:



CREATE TABLE hitcount
(
path VARCHAR(255) BINARY NOT NULL,
hits BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (path)
);


This table definition involves some assumptions:




  • The BINARY keyword in the path
    column definition makes the column values case sensitive.
    That's appropriate for a web platform where
    pathnames are case sensitive, such as most versions of Unix. For
    Windows or for HFS+ filesystems under Mac OS X, filenames are not
    case sensitive, so you'd omit
    BINARY from the definition.


  • The path column has a maximum length of 255
    characters, which limits you to page paths no longer than that. If
    you expect to require longer values, use a BLOB or
    TEXT type rather than VARCHAR.
    But in this case, you're still limited to indexing a
    maximum of the leftmost 255 characters of the column values, so
    you'd use a non-unique index rather than a
    PRIMARY KEY.


  • The mechanism works for a single document tree, such as when your web
    server is used to serve pages for a single domain. If you institute a
    hit count mechanism on a host that servers multiple virtual domains,
    you may want to add a column for the domain name. This value is
    available in the SERVER_NAME value that Apache
    puts into your script's environment. In this case,
    the hitcount table index would include both the
    hostname and the page path.



The general logic involved in hit counter maintenance is to increment
the hits column of the record for a page, then
retrieve the updated counter value. One way to do that is by using
the following two queries:



UPDATE hitcount SET hits = hits + 1 WHERE path = 'page path';
SELECT hits FROM hitcount WHERE path = 'page path';


Unfortunately, if you use that approach, you may often not get the
correct value. If several clients request the same page
simultaneously, several UPDATE statements may be
issued in close temporal proximity. The following
SELECT statements then wouldn't
necessarily get the corresponding hits value. This
can be avoided by using a transaction or by locking the
hitcount table, but that slows down hit counting.
MySQL provides a solution that allows each client to retrieve its own
count, no matter how many updates happen at the same time:



UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = 'page path';
SELECT LAST_INSERT_ID( );


The basis for updating the count here is
LAST_INSERT_ID(expr),
which was discussed in Recipe 11.17. The
UPDATE statement finds the relevant record and
increments its counter value. The use of
LAST_INSERT_ID(hits+1) rather than just
hits+1 tells MySQL to treat the value as though it
were an AUTO_INCREMENT value. This allows it to be
retrieved in the second query using LAST_INSERT_ID(
)
. The LAST_INSERT_ID( ) function
returns a connection-specific value, so you always get back the value
corresponding to the UPDATE issued on the same
connection. In addition, the SELECT statement
doesn't need to query a table, so
it's very fast. A further efficiency may be gained
by eliminating the SELECT query altogether, which
is possible if your API provides a means for direct retrieval of the
most recent sequence number. For example, in Perl, you can update the
count and get the new value with a single query like this:



$dbh->do (
"UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = ?",
undef, $page_path);
$hits = $dbh->{mysql_insertid};


However, there's still a problem here. What if the
page isn't listed in the hitcount
table? In that case, the UPDATE statement finds no
record to modify and you get a counter value of zero. You could deal
with this problem by requiring that any page that includes a hit
counter must be registered in the hitcount table
before the page goes online. A friendlier alternate approach is to
create a counter record automatically for any page that is found not
to have one. That way, page designers can put counters in pages with
no advance preparation. To make the counter mechanism easier to use,
put the code in a utility function that takes a page path as its
argument, handles the missing-record logic internally, and returns
the count. Conceptually, the function acts like this:



update the counter
if the update modifies a row
retrieve the new counter value
else
insert a record for the page with the count set to 1


The first time you request a count for a page, the update modifies no
rows because the page won't be listed in the table
yet. The function creates a new counter and returns a value of one.
For each request thereafter, the update modifies the existing record
for the page and the function returns successive access counts.



In Perl, a hit-counting function might
look like this, where the arguments are a database handle and the
page path:



sub get_hit_count
{
my ($dbh, $page_path) = @_;

my $rows = $dbh->do (
"UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = ?",
undef, $page_path);
return ($dbh->{mysql_insertid}) if $rows > 0; # counter was incremented

# If the page path wasn't listed in the table, register it and
# initialize the count to one. Use IGNORE in case another client
# tries same thing at the same time.

$dbh->do ("INSERT IGNORE INTO hitcount (path,hits) VALUES(?,1)",
undef, $page_path);
return (1);
}


The CGI.pm script_name( ) function returns the
local part of the URL, so you use get_hit_count( )
like this:



my $hits = get_hit_count ($dbh, script_name ( ));
print p ("This page has been accessed $hits times.");


The counting mechanism potentially involves multiple queries, and we
haven't used a transactional approach, so the
algorithm still has a race condition that can occur for the first
access to a page. If multiple clients simultaneously request a page
that is not yet listed in the hitcount table, each
of them may issue the UPDATE query, find the page
missing, and as a result issue the INSERT query to
register the page and initialize the counter. The algorithm uses
INSERT IGNORE to suppress
errors if simultaneous invocations of the script attempt to
initialize the counter for the same page, but the result is that
they'll all get a count of one. Is it worth trying
to fix this problem by using transactions or table locking? For hit
counting, I'd say no. The slight loss of accuracy
doesn't warrant the additional processing overhead.
For a different application, the priority may be accuracy over
efficiency, in which case you would opt for transactions to avoid
losing a count.



A PHP version of the hit counter looks
like this:



function get_hit_count ($conn_id, $page_path)
{
$query = sprintf ("UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1)
WHERE path = %s", sql_quote ($page_path));
if (mysql_query ($query, $conn_id) && mysql_affected_rows ($conn_id) > 0)
return (mysql_insert_id ($conn_id));

# If the page path wasn't listed in the table, register it and
# initialize the count to one. Use IGNORE in case another client
# tries same thing at the same time.

$query = sprintf ("INSERT IGNORE INTO hitcount (path,hits)
VALUES(%s,1)", sql_quote ($page_path));
mysql_query ($query, $conn_id);
return (1);
}


To use it, call the get_self_path( ) function that
returns the script pathname (see Recipe 18.2):



$self_path = get_self_path ( );
$hits = get_hit_count ($conn_id, $self_path);
print ("<p>This page has been accessed $hits times.</p>\n");


In Python, the function looks like this:



def get_hit_count (conn, page_path):
cursor = conn.cursor ( )
cursor.execute ("""
UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1)
WHERE path = %s
""", (page_path,))
if cursor.rowcount > 0: # a counter was incremented
count = cursor.insert_id ( )
cursor.close ( )
return (count)

# If the page path isn't listed in the table, register it and
# initialize the count to one. Use IGNORE in case another client
# tries same thing at the same time.

cursor.execute ("""
INSERT IGNORE INTO hitcount (path,hits) VALUES(%s,1)
""", (page_path,))
cursor.close ( )
return (1)


And is used as follows:



self_path = os.environ["SCRIPT_NAME"]
count = get_hit_count (conn, self_path)
print "<p>This page has been accessed %d times.</p>" % count


The recipes distribution includes demonstration
scripts hit counter scripts for Perl, PHP, and Python under the
apache directory. A JSP version is under the
tomcat directory. Install any of these in your
web tree, invoke it a few times, and watch the count increase. (First
you'll need to create the
hitcount table, as well as the
hitlog table described in Recipe 18.14. Both tables can be created from the
hits.sql script provided in the
tables directory.)










    I l@ve RuBoard



    No comments:

    Post a Comment