Step 1: Crawling the Web
Site
Crawling a Web site begins with the first page and involves
following every link found. For the mathematically inclined, class=docemphasis1>crawling a site is the same as performing a breadth first
search on a connected directional graph. A crawler
is a program that automates this process. Think of it as a browser that can
click on each link of the Web page by itself and traverse all the pages in the
Web site. The crawler sends an HTTP "GET" request to a page, parses
the HTML received, extracts all the hyperlinks from it, and recursively
performs the same action on each link.
Crawlers can be quite sophisticated. Instead of simply
following links, they can also mirror an entire Web site on the local hard
drive and extract other elements such as comments, client-side scripts, and
HTML comments. We discussed some of these techniques in style='color:#003399'>Chapter 7.
Crawling
a Site Manually
If a Web site doesn't contain many pages, you can follow the
hyperlinks by simply using a browser to making a list of them. This technique
is more accurate than using a crawler to gather links. One of the main
drawbacks of automated crawling is that crawlers can't interpret client-side
scripts, such as JavaScript, and the hyperlinks they contain.
A
Closer Look at the HTTP Response Header
Each HTTP response has two parts�namely, the HTTP response
header and data content. Usually, data content is presented in HTML, but it
also can be a byte block representing a GIF image or another object. Crawlers
rely heavily on HTTP response headers while crawling a site. Consider this HTTP
response header:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Sat, 16 Mar 2002 18:08:35 GMT
Connection: Keep-Alive
Content-Length: 496
Content-Type: text/html
Set-Cookie: ASPSESSIONIDQQGGGRHQ=DPHDNEMBEEHDNFMOPNPKIPHN; path=/
Cache-control: private
The first item to be inspected in the HTTP response header is the
HTTP response code, which appears in the first line of the HTTP response
header. In the preceding code snippet, the HTTP response code is
"200," which signifies that the HTTP request was processed properly
and that the appropriate response was generated. If the response code indicates
an error, the error occurred when requesting the resource. A "404"
response code indicates that the resource doesn't exist. A "403"
response code signifies that the resource is blocked from requests, but
nonetheless is present. Other HTTP response codes indicate that the resource
may have relocated or that some extra privileges are required to request that
resource. A crawler has to pay attention to these response codes and determine
whether to crawl farther.
The next bit of important information returned in the HTTP
response header, from a crawler's perspective, is the Content-Type field. It
indicates the type of a resource represented by the data in the HTTP content
that follows the HTTP response header. Again, the crawler has to pay attention
to the Content-Type. A crawler attempting to extract links from a GIF file
makes no sense, and crawlers usually pay attention only to
"text/html" content.
Some
Popular Tools for Site Linkage Analysis
Several commercial tools are available for use with crawling
Web applications. We describe a few of the tools and discuss some of their key
features in this section.
GNU
wget
GNU wget is a simple
command-line crawler and is available along with source code on http://www.wget.org/.
Although wget was primarily intended for Unix
platforms, a Windows binary is also available. Recall that we took a look at class=docemphasis1>wget in style='color:#003399'>Chapter 7, where we used it for mirroring a
Web site locally and searching for patterns within the retrieved HTML data for
source sifting. The advantages offered by wget
are that it is simple to use, a command-line tool, and available on both Unix
and Windows platforms. It also is very easy to use in shell scripts or batch
files to further automate linkage analysis tasks.
Because wget offers the
ability to mirror Web site content, we can run several commands or scripts on
the mirrored content for various types of analysis.
BlackWidow
from SoftByteLabs
SoftByteLabs' BlackWidow is a very fast Web site crawler for
the Windows platform. The crawling engine is multithreaded and retrieves Web pages
in parallel. BlackWidow also performs some basic source sifting techniques such
as those discussed in style='color:#003399'>Chapter 7. style='color:#003399'>Figure 8-2 shows BlackWidow crawling http://www.foundstone.com/.
On its tabs, you can view the progress of the crawling, thread by thread.
style='font-size:10.5pt;font-family:Arial'>Figure 8-2. Blackwidow crawling one
site with multiple threads
style='color:#003399'>Figure 8-3 shows the site structure in a
collapsible tree format. It helps us analyze how resources are grouped on the
Web site. The BlackWidow GUI has other tabs that show e-mail addresses that are
present on the pages, external links, and errors in retrieving links, if any.
As with GNU wget, BlackWidow also can be used
to mirror a Web site where URLs occurring within hyperlinks are rewritten for
accessibility from the local file system.
style='font-size:10.5pt;font-family:Arial'>Figure 8-3. Structure of
http://www.acme.com/
Funnel
Web Profiler from Quest Software
Funnel Web Profiler from Quest Software can perform an
exhaustive analysis of a Web site. Quest Software has a trial version of Funnel
Web Profiler available for download from http://www.quest.com. style='color:#003399'>Figure 8-4 shows a Funnel Web Profiler in
action running on style='color:#003399'>http://www.foundstone.com/. This tool has a
nice graphical user interface, which provides information such as content
grouping, a Web site map, cross-references, a crawled statistics list view, and
a tree view, among other things.
style='font-size:10.5pt;font-family:Arial'>Figure 8-4. Funnel Web Profiler,
showing scan statistics for style='color:#003399'>http://www.foundstone.com/
After the Web site scan is completed, Funnel Web Profiler
aggregates the information gathered and presents various representations and
statistics about the site information. For example, clicking on the Web Map tab
shows a graphical layout of the Web site and the pages in it. style='color:#003399'>Figure 8-5 shows the Web map of http://www.foundstone.com/.
Each Web resource is represented as a node, and the entire Web map shows how
each node is linked with other nodes. The Web map presents a visual
representation of the Web site and reveals the layout and linking of resources.
style='font-size:10.5pt;font-family:Arial'>Figure 8-5. Funnel Web Profiler's
Web map for style='color:#003399'>http://www.foundstone.com/
The Web map contains a cluster of linked nodes, with each
node's starting point identified. The top right corner gives a thumbnail
representation of the full Web map. It also allows the user to zoom in for a
more detailed view.
If we click on the List tab, we get a tabular list of all the
Web resources on style='color:#003399'>http://www.foundstone.com/, along with other
information such as the type of resource, its size in bytes, and when it was
modified. style='color:#003399'>Figure 8-6 displays the list view of http://www.foundstone.com/.
style='font-size:10.5pt;font-family:Arial'>Figure 8-6. List view of Web
resources on style='color:#003399'>http://www.foundstone.com/
Step-1
Wrap-Up
Some other tools�which we haven't covered in detail but are
worth mentioning�are Teleport Pro from Tennyson Maxwell (http://www.tenmax.com/)
and Sam Spade (style='color:#003399'>http://www.samspade.org/). Teleport Pro runs
on the Windows platform and is primarily used for mirroring Web sites. Teleport
Pro allows users to define individual projects for mirroring sites. Site
mirroring is quite fast with Teleport Pro's multithreaded mirroring engine. Sam
Spade is also a Windows-based tool that allows basic site crawling and
source-sifting. We now have quite a lot of information for performing thorough
analysis. Let's see what we can do with all this information.
style='width:90.0%'>
style='font-size:16.5pt;font-family:Arial'>Crawlers and RedirectionAutomated Web crawlers sometimes get thrown off track when The following JavaScript code snippet has a redirection <SCRIPT LANGUAGE="JavaScript"> location.replace("./index.php3"); </script> It instructs the browser to request index.php3. It will do However, if the redirection is performed by techniques such The following two examples illustrate redirection with the |
style='color:black;display:none'>
style='width:90.0%'>
Redirection by Content-LocationThe code snippet for this procedure is: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 lang=EN-GB>Date: Wed, 27 Mar 2002 08:13:01 GMT lang=EN-GB>Connection: Keep-Alive Content-Location: http://www.example.com/example/index.asp lang=EN-GB>Set-Cookie: ASPSESSIONIDQQGQGIWC=LNDJBOLAIFDAKJDBNDINOABF; path=/ lang=EN-GB>Cache-control: private Here we sent a GET request to a server, |
style='color:black;display:none'>
style='width:90.0%'>
Redirection by HTTP-EQUIVWe can insert <META> tags of several <META HTTP-EQUIV=Refresh CONTENT="2; url=http://www.yahoo.com/"> Smart crawlers implement methods to parse redirection |
No comments:
Post a Comment