Friday, November 13, 2009

19.6 Lockset Analysis













19.6 Lockset Analysis


Data races are hard to reveal with testing, due to nondeterministic interleaving of threads in a concurrent program. Statically exploring the execution space is computationally expensive, and suffers from the approximated model of computation, as discussed in Chapter 8. Dynamic analysis can greatly amplify the sensitivity of testing to detect potential data races, avoiding the pessimistic inaccuracy of finite state verification while reducing the optimistic inaccuracy of testing.


Data races are commonly prevented by imposing a locking discipline, such as the rule every variable shared between threads must be protected by a mutual exclusion lock. Dynamic lockset analysis reveals potential data races by detecting violation of the locking discipline.


Lockset analysis identifies the set of mutual exclusion locks held by threads when accessing each shared variable. Initially, each shared variable is associated with all available locks. When a thread accesses a shared variable v, lockset analysis intersects the current set of candidate locks for v with the locks held by that thread. The set of candidate locks that remains after executing a set of test cases is the set of locks that were always held by threads accessing that variable. An empty set of locks for a shared variable v indicates that no lock consistently protects v.



The analysis of the two threads in Figure 19.4 starts with two locks associated with variable x. When thread A locks lck1 to access x, the lockset of x is intersected with the locks hold by A. When thread B locks lck2 to access x, the intersection of the lockset of x with the current set of locks becomes empty, indicating that no locks consistently protect x.






Open table as spreadsheet














Thread



Program trace



locks held



lockset (x)



thread A



lock(lck1)
x=x+1;
unlock(lck1)



{ }
{lck1}



{lck1, lck2}
{lck1}



thread B



lock(lck2)
x=x+1;
unlock(lck2)



{ }
{lck2}
{ }



{ }






Figure 19.4: Threads accessing the same shared variable with different locks. (Adapted from Savage et al. [SBN+97])

This simple locking discipline is violated by some common programming practices: Shared variables are frequently initialized without holding a lock; shared variables written only during initialization can be safely accessed without locks; and multiple readers can be allowed in mutual exclusion with single writers. Lockset analysis can be extended to accommodate these idioms.


Initialization can be handled by delaying analysis till after initialization. There is no easy way of knowing when initialization is complete, but we can consider the initialization completed when the variable is accessed by a second thread.


Safe simultaneous reads of unprotected shared variables can also be handled very simply by enabling lockset violations only when the variable is written by more than one thread. Figure 19.5 shows the state transition diagram that enables lockset analysis and determines race reports. The initial virgin state indicates that the variable has not been referenced yet. The first access moves the variable to the exclusive state. Additional accesses by the same thread do not modify the variable state, since they are considered part of the initialization procedure. Accesses by other threads move to states shared and shared-modified that record the type of access. The variable lockset is updated in both shared and shared-modified states, but violations of the policy are reported only if they occur in state shared-modified. In this way, read-only concurrent accesses do not produce warnings.






Figure 19.5: The state transition diagram for lockset analysis with multiple read accesses.

To allow multiple readers to access a shared variable and still report writers' data races, we can simply distinguish between the set of locks held in all accesses from the set of locks held in write accesses.














Tests Passing; Solution Doesn't Work










Tests Passing; Solution Doesn't Work


One of the more frustrating project situations is to find that tests are reported as passing but the solution under the test still doesn't work for observers outside the test team. In these cases, you want to identify why the tests do not seem to find the same issues that other users do. Figures 9.159.18 are examples of this case.



High Bug Find Rate


Frequently you see a high test pass rate but still see a large incoming bug rate (or worse, customers or beta users are reporting lots of bugs that testing seems to be missing).


This can occur for several reasons:


  • The tests might be too gentle for this stage of the solution. In early iterations, gentle tests are good, but as the solution matures, tests should exercise broader scenarios and integrations. These tests might be missing.

  • Tests might be stale or be testing the wrong functionality.

  • It might be time to switch test techniques. (See Chapter 7.)


Consider Figures 9.15, 9.16, and 9.17.




Figure 9.15.

On the Quality Indicators chart, the test pass rate is high, but active bugs are also high.












Figure 9.16.

Correspondingly, on the Bug Rates chart, active bugs are high because find rate stays high.











Figure 9.17.

Tests aren't finding the bugs. On this report, many of the bugs found have no corresponding test. This might be a sign that testing is looking elsewhere. And, if you are expecting regression testing to prevent their undiscovered recurrence, you will need regression tests that you don't have yet.

[View full size image]







Tests Are Stale


Tests do not necessarily evolve at the same rate as the code under test. This risk is present especially when tests are heavily automated. In this situation, you see high test pass rates with ongoing code churn and diminishing code coverage (see Figure 9.18).




Figure 9.18.

This Quality Indicators chart shows a high rate of code churn and a low rate of code coverage from testing, yet test pass rates remain high. This suggests that the tests being run are not exercising the new code. Don't be lulled by the high test pass ratethese tests are clearly not testing all the new development work.

[View full size image]















Identifying Database Servers





Identifying
Database Servers



Identifying database servers is even trickier
than identifying front-end and internal application servers. Identifying
front-end and internal application servers is easier because both communicate
in HTTP. Their signatures work their way into various elements of HTTP, such as
the URL, HTTP header, and cookies.



In contrast, database servers communicate
with internal application servers in SQL. The only elements of a URL that get
passed to the database interface are the values being exchanged by means of
various input fields and URL parameters. Thus the only way to identify back-end
databases through URLs is to force them to generate errors that are reflected
by the application server and end up being sent back to the Web browser.



Let's consider two URLs:



style='font-size:10.0pt;font-family:Symbol'>�        
http://www.example.com/public/index.php?ID=27



style='font-size:10.0pt;font-family:Symbol'>�        
http://www.example.org/Profile.cfm?id=3&page=1



The first URL has a PHP script, class=docemphasis1>index.php, which seems to make use of a database as
suggested by the URL parameter "ID=27."
The second URL is a ColdFusion application, which again seems to perform
database queries based on the parameter id.



To force the database servers to return an
error involves tampering with the values passed to the parameters in both
cases. For the first URL, we substitute a nonnumeric ID
value for "27." For the second URL,
we prematurely truncate the query by replacing the value class=docemphasis1>3
with a single quotation mark. style='color:#003399'>Figures 6-11 and style='color:#003399'>6-12, respectively, show how the errors
appear.



style='font-size:10.5pt;font-family:Arial'>Figure 6-11. Forcing a database
error with PHP




style='font-size:10.5pt;font-family:Arial'>Figure 6-12. Forcing a database
error with ColdFusion




We leave it to you to figure out how much
damage is done by simply obtaining the types of information displayed in these
error messages! Hint: The ColdFusion SQL Server
error message contains enough information to launch a URL request that possibly
could cause remote command execution with Administrator privileges on the
database server of the Web application.



 





Chapter 8. Bottlenecks










Chapter 8. Bottlenecks


You've created a killer application. You store all your data as UTF-8, you receive and process email like it was candy, your data is well filtered, and you use more external services than you can count. It's going great, your users love you, and the venture capitalists are circling. And then your application. Grinds. To. A. Halt.


As applications grow, weak spots reveal themselves. Techniques that worked for 10 requests a second start to fail for 100 requests a second. Databases with 10,000 rows work fine but start to choke when they reach 100,000 or 1,000,000. In an ideal world, we would find and solve all of these problems before they happen in production, but there's always a chance we'll miss something.


In this chapter we'll look at techniques for identifying
and fixing bottlenecks
in our architecture, both before they happen and when they start to bog our systems down. We'll talk about ways to increase the performance we can get out of our existing hardware so we're making the most of what we have, before we move on to talking about scaling in the next chapter.












Have We Tested the Changes?










Have We Tested the Changes?


Throughout the lifecycle, the application changes. Regressions are bugs in the software under test that did not appear in previous versions. Regression testing is the term for testing a new version of the software to find those bugs. Almost all types of tests can be used as regression tests, but in keeping with the tenet of "Important problems fast," your regression testing strategy must be very efficient.


Ideally, you should test the most recent changes first. Not only does this mitigate the risk of unforeseen side effects of the changes, but also if you do find bugs and report them, the more recent changes are more current in everyone's memory.


One of the challenges in most test teams is identifying what exactly the changes are. Fortunately, the daily build report shows you exactly what changesets have made it into the build and what work items (scenarios, QoS, tasks, and bugs) have been resolved, thereby identifying the functionality that should be tested first (see Figure 7.13). Moreover, if you have reasonable build verification tests (BVTs), then you can check their results and code coverage.





Figure 7.13.

One of the many purposes of the daily build report is to focus testing activity on the newly built functionality. This list of work items resolved in the build is like an automatic release note, showing what new functionality needs to be tested.

[View full size image]














How Fool's Gold Pans Out



[ Team LiB ]





How Fool's Gold Pans Out


In conclusion, we hold the following software truths to be self-evident (or evident after careful examination, anyway):


  • The success of a software project depends on not writing source code too early in the project.

  • You can't trade defect count for cost or schedule unless you're working on life-critical systems. Focus on defect count; cost and schedule will follow.

  • Silver bullets are hazardous to a project's health, though software industry history suggests that vendors will continue to claim otherwise.

  • Half-hearted process improvement is an especially damaging kind of silver bullet because it undermines future improvement attempts.

  • Despite its name, software isn't soft, unless it's made that way in the first place, and making it soft is expensive.


The software world has had 50 years to learn these lessons. The most successful people and organizations have taken them to heart. Learning to resist software's fool's gold consistently is one of the first steps the software industry needs to take on the road to creating a true profession of software engineering.





    [ Team LiB ]



    Defining the Process Domain










     < Free Open Study > 





    Defining the Process Domain



    Domain processes are the interrelated activities that are specific to the functioning of the organization for which a software project is developed. The first step is to determine the domain in which the product will eventually be used. For all domain analysis, the critical point of view is that of the ultimate end-user. Valid analyses and evaluations of options must be done from this view. If there is no "real" customer available, then we must rely on a secondary "voice of the customer." This is usually someone from the marketing organization. Even when a matrix is viewed from the "developer's" perspective, the customer is ever present.



    The measure of quality for a software project is based on how well the software solves specific domain-related problems. For the software customer, the view is from his business domain, not that of a computer scientist or software engineer. For this reason, to deliver quality software, the project manager must understand the domain for which the software solves specific needs. For software product development, there are six classes of product domains:



    1. Consumer

    2. Business

    3. Industrial

    4. Real-time

    5. Really timely

    6. Scientific



    Individuals buy consumer products for personal use. This use could be at home, while traveling, or at work. The key here is that the consumer market is a mass market and is usually addressing a price-sensitive purchaser. Examples of consumer products utilizing software are cellular phones, automobiles, televisions, personal computers, and personal digital assistants.



    The majority of software products are targeted at the business product domain. Here the key is providing a cost-effective product to the business customer that will improve overall business profits. These products are usually expensive compared to consumer products and have maintenance, service, and installation services available as a necessary part of the product. Examples of these types of products are database tools such as Oracle™, enterprise resource planning products such as PeopleSoft™, development tool suites such as WebSphere™ and VisualAge™, and operating systems such as Solaris™.



    Industrial products are a specific subclass of business products. These are software tools that are purchased for the specific purposes of machine automation, factory automation and integration, and embedded control software. These are special-purpose and usually focused on a specific industry such as automotive, food processing, and semiconductor fabrication. This domain has the highest percentage of product customization and integration with legacy systems. Examples of these products are factory automation software from FactoryWorks™, embedded development systems from Wind River, and process modeling tools such as Hewlett-Packard's Vee™.



    Real-time products are used to control processes that have a defined and finite time budget. Real-time systems are used for data collection of events that last less than a microsecond. Real-time products control embedded medical devices such as pacemakers, where information must literally be processed between heartbeats. These products also work in the interface between the collection of analog data such as voice or music and its conversion to digital data that can be stored on a magnetic disk or CD-ROM. All real-time software is written specifically for the target hardware on which it executes.



    Really timely, as opposed to real-time, software products must execute within a time budget that does not irritate the end user. Examples of this are the software that runs ATM machines and does credit card verification while ordering over the Internet. Most really timely software products are a part of either business or industrial software products. They are broken out as a subclass because of the potential for causing customer irritability if they do not function effectively.



    Scientific products simulate real-world activities using mathematics. Real-world objects are turned into mathematical models. Executing formulas simulates the actions of the real-world objects. For example, some of an airplane's flight characteristics can be simulated in the computer. Rivers, lakes, and mountains can be simulated. Virtually any object with known characteristics can be modeled and simulated. Simulations use enormous calculations and often require supercomputer speed. As personal computers become more powerful, more laboratory experiments will be converted into computer models that can be interactively examined by students without the risk and cost of the actual experiments. Members of this product domain are Matlib™ for large mathematical formula development, Analytica™ for developing large-scale business models, and Expert Choice for developing large-scale decision support systems. Scientific software products are usually special-purpose tool kits for problem solving.



    The question now arises, "What about the government market?" For the six classes of software product domains as defined, all of them could be "government" customers. Where the separation of government from private customers comes into play is in the areas of business plans, legal concerns, contracting, and risk analysis.



    Four classes of product systems look at ways that the software product will be built and delivered from the developer's perspective. These four have different product development plans and life cycles. Although all product development is an iterative process, in the real business world there is usually an existing product portfolio. During the conceptual stage, the project manager will have worked on the product concept and selected a preliminary life cycle. That earlier work influences the selection of one or more of these product system classes:



    1. New software product

    2. Re-engineering of existing product

    3. Component integration

    4. Heroic maintenance



    A new software product starts with a set of requirements and moves through its development life cycle to delivery. It will use some sort of development tools and possibly object libraries, where appropriate. This is building a truly new software product for taking advantage of a new technology such as the Internet or using a new set of programming tools such as Java. It may also be a new market opportunity because of changes in government regulations such as telecommunications or banking deregulation.



    Re-engineering existing product is simply that. This product already exists in a form that may use outmoded software technology or be hosted on obsolete hardware. An example would be a DOS-based data collection system that would be re-engineered to run on Linux.



    Taking available commercial-off-the-shelf (COTS) products and integrating them into a product is component integration. An example of this type of product is taking an available embedded database tool along with a script-generation tool and a graphical user interface (GUI) generator to produce a new product that is used for integrating factory equipment into the overall manufacturing execution system.



    Heroic maintenance occurs when a company wants to wring the last bit of revenue out of an existing software product that has been passed over for re-engineering. Software product companies take great care in the management and timing of the releases of new capabilities within their product portfolios. When completely new products or massively re-engineered products are released, there is always a potential to cannibalize existing customer sales instead of having new customers buy the new product. Timing decisions may result in the delay of the newest product and the release of the old product with heroic work done to dress it up in new clothes. An example of this is once again our DOS system: Instead of re-engineering the entire system, the command-line interface was replaced with a pseudo-GUI. This is known in the software industry as "same dog, new collar!"



    The first matrix to be developed by the project manager involves identifying the product domain type, illustrated in Figure 5-3. This product domain type resides at the intersection of the six product domain classes and the product type classes. A software product can be defined to exist in multiple cells on this matrix. For example, suppose that there is a new, Web-based software product for registering personal DVD movies in a trading club in which points are earned to borrow (or a small fee is paid, if points are insufficient). This product would "live" in the consumer and really timely product domain classes as both a new software product and component integration. Although the concept of the product is new and new software would be developed, there are many libraries of components available for use. This example is represented in the matrix with an X in the relevant cell.



    Figure 5-3. Step 1�Identify the Product Domain Type



    Another example product is an enhancement to an existing factory-integration product to take information from past process steps and determine the optimum process for the product through the factory based on past production yield information and customer orders. We can tell immediately that this will be re-engineering an existing product, but some new software also will be developed. This product may touch four of the product domain classes: business, industrial, real-time, and scientific. Business could be touched because of accessing historic information on production yields. Industrial and real-time apply because it will be operating on factory automation equipment. The scientific piece comes in with the optimization algorithms necessary for determining the best individual product flow through the factory. This example is represented in the matrix with an O in the relevant cell.



    The third part of defining the process domain is the product component classes. This set of classes is also viewed from the perspective of the end-user. There are six members of the class, and the key question to ask is, "What does the customer expect to have delivered?" The software development project manager must discover whether the end-user has a product expectation. Six product component classes exist:



    1. Software

    2. Hardware

    3. People

    4. Database

    5. Documentation

    6. Procedures



    If a project is to develop a "pure" software product, the end-user has an expectation that he will get an installation set of media or an access key to a remote site to download the product. This is the way most consumer software is purchased�the only items received are the media or a digital file.



    Many products are turnkey: The developed software is hosted on hardware. Buying a cellular phone usually dictates the software running within the hardware. Although the software is a critical system component, the customer purchases the hardware.



    People are a critical part of many software products. Enterprise-wide software systems used for financial management, factory control, and product development may require consulting services to be "purchased" along with the software to aid in product installation, integration, and adoption into a specific environment.



    Database products, although most definitely software, are separated as a distinct class because of the expectations that accompany the purchase of this class of complex software. A database product is usually purchased as a separate, general-purpose tool kit to be used as an adjunct to all of a company's other information systems. More "software" products are delivered with an embedded database package within the product. It is important for the customer to realize that he is purchasing not only the "new" software, but also a database product.



    Documentation is almost always a part of the product. In some cases, it may be books and manuals purchased as a "shrink-wrapped," off-the-shelf software product. Many complex enterprise software products have third-party authors writing usage and tips books sold through commercial bookstores. If downloaded, the digital files may include a "readme" file and possibly complete soft copy documentation. Acquiring software from some sources such as SourceForge (www.SourceForge.com) may provide no documentation other that the software's source code.



    Procedures or business rules are a final component class. In situations in which the customer is buying systems and software used for decision support, equipment control, and component integration, it is important for the customer to understand the procedure deliverables. Usually the custom development of the business rules for an organization are done by either the organization itself or consultants hired from the software company. This can be a very gray area, and it is important that the project manager understand all of the project deliverables early in the development life cycle, especially those that can cause customer dissatisfaction and demonstrate a lack of quality.



    Now that the third set of domain classes has been defined, the project manager can fill out the last two matrices. The next one to complete is the identification of the critical product components, shown in Figure 5-4. This matrix is a table of the product component classes matched with the classes of product systems. This matrix provides us with the deliverables for the defined product based on whether it is new software, re-engineered software, a component integration product, or the heroic maintenance of a legacy product within the company's portfolio. Remember, the Web example is the X and the factory integration is the O.



    Figure 5-4. Step 2�Identify Critical Product Components



    For example, our Web-based software product for registering personal DVD movies was determined to be both a new software product and component integration. The critical product components for this product are software and documentation. It is Web-based and runs from a browser running on the customer's personal hardware. The customer will see no database, people, or procedures. The only documentation may be the instructions on the Web page itself.



    Our other example product, an enhancement to an existing factory integration product, involves re-engineering an existing product and some new software development. Based on how the product is to be marketed, the customer will see all the component classes except hardware. He will expect software to be delivered along with a field applications engineer to do the installation and acceptance testing within the customer's factory. The customer will also expect a database to keep the real-time product status and yield information along with the procedures for running the optimization algorithms. Documentation will be critical to both a company's engineers doing the installation and the customers after the product is accepted.



    The third matrix that the project manager produces to define the domain is to link the product domains to the delivered components. This matrix shown in Figure 5-5 is a table of the product component classes matched with the product domain classes. This matrix provides us with the deliverables for the defined product based on whether it is going to be installed into a consumer, business, industrial, real-time, really timely, or scientific domain.



    Figure 5-5. Step 3�Link Product Domains with Components



    Using our two examples, the Web-based software product for registering personal DVD movies in a trading club would "live" in the consumer and really timely product domain classes. The deliverables would be software and documentation. The second example, an enhancement to an existing factory integration product, touches four of the product domain classes: business, industrial, real-time, and scientific. The deliverable components are software, people, database, documentation, and procedures.



    Chapter 4, "Selecting Software Development Life Cycles," provided the descriptions of commonly used software development life cycles and the selection criteria for each. When compared to the overall company versus product life cycles, the software development life cycle is assumed within the acquisition phase of the product life cycle. Figure 5-1 shows this.



    A project manager must understand the relationship within his organization of software development within product life cycles. A typical product development life cycle begins with a development or acquisition phase during which the product is built or acquired. Figure 5-6 represents the product development phase. The project manager works hand in hand with the product manager to plan for the manufacturing of the product. This phase is the production ramp. Investment is made on the infrastructure for product manufacturing, and first products are built. After the production ramp, the software portion of the product is out of the hands of the software project manager, except for problem fixes.



    Figure 5-6. Software Development Life Cycle



    Figure 5-7 shows the entire product life cycle plotted in months versus thousands of dollars invested. The dollars of investment on the left side of the graph and below the zero line is the estimated investment in the product. The dollars above the zero line are the estimated revenue dollars that the product will earn. This type of information is usually developed by marketing and is a critical part of the return on investment that the product will make.



    Figure 5-7. Software Product Life Cycle



    Finally, the relationship between the development and product life cycles is graphically represented in Figure 5-8. This relationship is critical to keep in mind as the project manager works through the product development process. In the product world, the product life cycle drives decisions and investment. Only the investment part of the software development life cycle is important to product managers planning product portfolios.



    Figure 5-8. Software Product and Development Life Cycles












       < Free Open Study > 



      Section 8.6. The Mach VM User-Space Interface











      8.6. The Mach VM User-Space Interface


      Mach provides a powerful set of routines to user programs for manipulating task address spaces. Given the appropriate privileges, a task can perform operations on another task's address space identically to its own. All routines in the Mach VM user interface require the target task as an argument.[12] Therefore, the routines are uniform in how they are used, regardless of whether the target task is the caller's own task or another.

      [12] Specifically, the target task is a send right to the control port of the target task.


      Since user address spaces have a one-to-one mapping with user tasks, there are no explicit routines to create or destroy an address space. When the first task (the kernel task) is created, the map field of its task structure is set to refer to the kernel map (kernel_map), which is created by kmem_init() [osfmk/vm/vm_kern.c] during VM subsystem initialization. For subsequent tasks, a virtual address space is created with the task and destroyed along with the task. We saw in Chapter 6 that the task_create() call takes a prototype task and an address space inheritance indicator as arguments. The initial contents of a newly created task's address map are determined from these arguments. In particular, the inheritance properties of the prototype task's address map determine which portions, if any, are inherited by the child task.


      // osfmk/kern/task.c

      kern_return_t
      task_create_internal(task_t parent_task,
      boolean_t inherit_memory,
      task_t *child_task)
      {
      ...
      if (inherit_memory)
      new_task->map = vm_map_fork(parent_task->map);
      else
      new_task->map = vm_map_create(pmap_create(0),
      (vm_map_offset_t)(VM_MIN_ADDRESS),
      (vm_map_offset_t)(VM_MAX_ADDRESS), TRUE);
      ...
      }


      vm_map_fork() [osfmk/vm/vm_map.c] first calls pmap_create() [osfmk/ppc/pmap.c] to create a new physical map and calls vm_map_create() [osfmk/vm/vm_map.c] to create an empty VM map with the newly created physical map. The minimum and maximum offsets of the new VM map are taken from the parent's map. vm_map_fork() then iterates over the VM map entries of the parent's address map, examining the inheritance property of each. These properties determine whether the child inherits any memory ranges from the parent, and if so, how (fully shared or copied). Barring inherited memory ranges, a newly created address space is otherwise empty. Before the first thread executes in a task, the task's address space must be populated appropriately. In the case of a typical program, several partiessuch as the kernel, the system library, and the dynamic link editordetermine what to map into the task's address space.


      Let us now look at several Mach VM routines that are available to user programs. The following is a summary of the functionality provided by these routines:


      • Creating an arbitrary memory range in a task, including allocation of new memory

      • Destroying an arbitrary memory range, including one that is unallocated, in a task

      • Reading, writing, and copying a memory range

      • Sharing a memory range

      • Setting protection, inheritance, and other attributes of a memory range

      • Preventing the pages in a memory range from being evicted by wiring them in physical memory


      Note that in this section, we discuss the new Mach VM API that was introduced in Mac OS X 10.4. The new API is essentially the same as the old API from the programmer's standpoint, with the following key differences.


      • Routine names have the mach_ prefixfor example, vm_allocate() becomes mach_vm_allocate().

      • Data types used in routines have been updated to support both 64-bit and 32-bit tasks. Consequently, the new API can be used with any task.

      • The new and old APIs are exported by different MIG subsystems[13]: mach_vm and vm_map, respectively. The corresponding header files are <mach/mach_vm.h> and <mach/vm_map.h>, respectively.

        [13] We will look at MIG subsystems in Chapter 9.




      8.6.1. mach_vm_map()


      mach_vm_map() is the fundamental user-visible Mach routine for establishing a new range of virtual memory in a task. It allows fine-grained specification of the properties of the virtual memory being mapped, which accounts for its large number of parameters.


      kern_return_t
      mach_vm_map(vm_map_t target_task,
      mach_vm_address_t *address,
      mach_vm_size_t size,
      mach_vm_offset_t mask,
      int flags,
      mem_entry_name_port_t object,
      memory_object_offset_t offset,
      boolean_t copy,
      vm_prot_t cur_protection,
      vm_prot_t max_protection,
      vm_inherit_t inheritance);


      Given the importance of mach_vm_map(), we will discuss each of its parameters. We will not do so for all Mach VM routines.


      target_task specifies the task whose address space will be used for mapping. A user program specifies the control port of the target task as this argument, and indeed, the type vm_map_t is equivalent to mach_port_t in user space. Mach's IPC mechanism translates a vm_map_t to a pointer to the corresponding VM map structure in the kernel. We will discuss this translation in Section 9.6.2.


      When mach_vm_map() returns successfully, it populates the address pointer with the location of the newly mapped memory in the target task's virtual address space. This is when the VM_FLAGS_ANYWHERE bit is set in the flags argument. If this bit is not set, then address contains a caller-specified virtual address for mach_vm_map() to use. If the memory cannot be mapped at that address (typically because there is not enough free contiguous virtual memory beginning at that location), mach_vm_map() will fail. If the user-specified address is not page-aligned, the kernel will truncate it.


      size specifies the amount of memory to be mapped in bytes. It should be an integral number of pages; otherwise, the kernel will round it up appropriately.


      The mask argument of mach_vm_map() specifies an alignment restriction on the kernel-chosen starting address. A bit that is set in mask will not be set in the addressthat is, it will be masked out. For example, if mask is 0x00FF_FFFF, the kernel-chosen address will be aligned on a 16MB boundary (the lower 24 bits of the address will be zero). This feature of mach_vm_map() can be used to emulate a virtual page size that is larger than the physical page size.





      Caveat Regarding Offsets and Sizes


      As we noted in Section 8.4, Mach VM API routines operate on page-aligned addresses and memory sizes that are integral multiples of the page size. In general, if a user-specified address is not the beginning of a page, the kernel truncates itthat is, the actual address used will be the beginning of the page in which the original address resides. Similarly, if a size-specifying argument contains a byte count that is not an integral number of pages, the kernel rounds the size up appropriately. The following macros are used for truncating and rounding offsets and sizes (note that rounding 0xFFFF_FFFF pages yields the value 1):


      [View full width]
      // osfmk/mach/ppc/vm_param.h

      #define PPC_PGBYTES 4096
      #define PAGE_SIZE PPC_PGBYTES
      #define PAGE_MASK (PAGE_SIZE - 1)

      // osfmk/vm/vm_map.h

      #define vm_map_trunc_page(x) ((vm_map_offset_t)(x) & ~(
      (signed)PAGE_MASK))
      #define vm_map_round_page(x) (((vm_map_offset_t)(x) + PAGE_MASK) & \
      ~((signed)PAGE_MASK))




      The following are examples of individual flags (bits) that can be set in the flags argument.


      • VM_FLAGS_FIXED This is used to specify that the new VM region should be allocated at the caller-provided address, if possible. VM_FLAGS_FIXED is defined to be the value 0x0. Therefore, logically OR'ing this does not change the value of flags. It merely represents the absence of VM_FLAGS_ANYWHERE.

      • VM_FLAGS_ANYWHERE This is used to specify that the new VM region can be allocated anywhere in the address space.

      • VM_FLAGS_PURGABLE This is used to specify that a purgable VM object should be created for the new VM region. A purgable object has the special property that it can be put into a nonvolatile state in which its pages become eligible for reclamation without being paged out to the backing store.

      • VM_FLAGS_OVERWRITE This, when used along with VM_FLAGS_FIXED, is used to specify that the new VM region can replace existing VM regions if necessary.


      object is the critical argument of mach_vm_map(). It must be a Mach port naming a memory object, which will provide backing for the range being mapped. As we saw earlier, a memory object represents a range of pages whose properties are controlled by a single pager. The kernel uses the memory object port to communicate with the pager. When mach_vm_map() is used to map some portion of a task's address space with a memory object, the latter's pages are accessible by the task. Note that the virtual address at which such a page range appears in a given task is task-dependent. However, a page has a fixed offset within its memory objectthis offset is what a pager works with.


      The following are some examples of memory objects used with mach_vm_map().


      • When a Mach-O executable is loaded for execution by the execve() system call, the file is mapped into the address space of the target process through the vm_map() kernel function [osfmk/vm/vm_user.c], with the object argument referring to the vnode pager.

      • If the object argument is the null memory object (MEMORY_OBJECT_NULL), or equivalently, MACH_PORT_NULL, mach_vm_map() uses the default pager, which provides initially zero-filled memory backed by the system's swap space. In this case, mach_vm_map() is equivalent to mach_vm_allocate() (see Section 8.6.3), albeit with more options for configuring the memory's properties.

      • The object argument can be a named entry handle. A task creates a named entry from a given mapped portion of its address space by calling mach_make_memory_entry_64(), which returns a handle to the underlying VM object. The handle so obtained can be used as a shared memory object: The memory it represents can be mapped into another task's address space (or the same task's address space, for that matter). We will see an example of using mach_make_memory_entry_64() in Section 8.7.5.



      There is also mach_make_memory_entry(), which is a wrapper around mach_make_memory_entry_64(). The latter is not 64-bit-only, as its name suggests.




      The offset argument specifies the beginning of the memory in the memory object. Along with size, this argument specifies the range of the memory to be mapped in the target task.


      If copy is TRUE, the memory is copied (with copy-on-write optimization) from the memory object to the target task's virtual address space. This way, the target receives a private copy of the memory. Thereafter, any changes made by the task to that memory will not be sent to the pager. Conversely, the task will not see changes made by someone else. If copy is FALSE, the memory is directly mapped.


      cur_protection specifies the initial current protection for the memory. The following individual protection bits can be set in a Mach VM protection value: VM_PROT_READ, VM_PROT_WRITE, and VM_PROT_EXECUTE. The values VM_PROT_ALL and VM_PROT_NONE represent all bits set (maximum access) and no bits set (all access disallowed), respectively. max_protection specifies the maximum protection for the memory.


      Thus, each mapped region has a current protection and a maximum protection. Once the memory is mapped, the kernel will not allow the current to exceed the maximum. Both the current and maximum protection attributes can be subsequently changed using mach_vm_protect() (see Section 8.6.5), although note that the maximum protection can only be loweredthat is, made more restrictive.


      inheritance specifies the mapped memory's initial inheritance attribute, which determines how the memory is inherited by a child task during a fork() operation. It can take the following values.


      • VM_INHERIT_NONE The range is undefined ("empty") in the child task.

      • VM_INHERIT_SHARE The range is shared between the parent and the child, allowing each to freely read from and write to the memory.

      • VM_INHERIT_COPY The range is copied (with copy-on-write and other, if any, optimizations) from the parent into the child.


      The inheritance attribute can be later changed using mach_vm_inherit() (see Section 8.6.6).




      8.6.2. mach_vm_remap()


      mach_vm_remap() takes already mapped memory in a source task and maps it in the target task's address space, with allowance for specifying the new mapping's properties (as in the case of mach_vm_map()). You can achieve a similar effect by creating a named entry from a mapped range and then remapping it through mach_vm_map(). In that sense, mach_vm_remap() can be thought of as a "turnkey" routine for memory sharing. Note that the source and target tasks could be the same task.


      kern_return_t
      mach_vm_remap(vm_map_t target_task,
      mach_vm_address_t *target_address,
      mach_vm_size_t size,
      mach_vm_offset_t mask,
      boolean_t anywhere,
      vm_map_t src_task,
      mach_vm_address_t src_address,
      boolean_t copy,
      vm_prot_t *cur_protection,
      vm_prot_t *max_protection,
      vm_inherit_t inheritance);


      The cur_protection and max_protection arguments return the protection attributes for the mapped region. If one or more subranges have differing protection attributes, the returned attributes are those of the range with the most restrictive protection.




      8.6.3. mach_vm_allocate()


      mach_vm_allocate() allocates a region of virtual memory in the target task. As noted earlier, its effect is similar to calling mach_vm_map() with a null memory object. It returns initially zero-filled, page-aligned memory. Like mach_vm_map(), it allows the caller to provide a specific address at which to allocate.


      kern_return_t
      mach_vm_allocate(vm_map_t target_task,
      mach_vm_address_t address,
      mach_vm_size_t size,
      int flags);




      8.6.4. mach_vm_deallocate()


      mach_vm_deallocate() invalidates the given range of virtual memory in the given address space.


      kern_return_t
      mach_vm_deallocate(vm_map_t target_task,
      mach_vm_address_t *address,
      mach_vm_size_t size);


      It is important to realize that as used here, the terms allocate and deallocate subtly differ from how they are used in the context of a typical memory allocator (such as malloc(3)). A memory allocator usually tracks allocated memorywhen you free allocated memory, the allocator will check that you are not freeing memory you did not allocate, or that you are not double-freeing memory. In contrast, mach_vm_deallocate() simply removes the given rangewhether currently mapped or notfrom the given address space.


      When a task receives out-of-line memory in an IPC message, it should use mach_vm_deallocate() or vm_deallocate() to free that memory when it is not needed. Several Mach routines dynamicallyand implicitlyallocate memory in the address space of the caller. Typical examples of such routines are those that populate variable-length arrays, such as process_set_tasks() and task_threads().




      8.6.5. mach_vm_protect()


      mach_vm_protect() sets the protection attribute for the given memory range in the given address space. The possible protection values are the same as those we saw in Section 8.6.1. If the set_maximum Boolean argument is TRUE, new_protection specifies the maximum protection; otherwise, it specifies the current protection. If the new maximum protection is more restrictive than the current protection, the latter is lowered to match the new maximum.


      kern_return_t
      mach_vm_protect(vm_map_t target_task,
      mach_vm_address_t address,
      mach_vm_size_t size,
      boolean_t set_maximum,
      vm_prot_t new_protection);




      8.6.6. mach_vm_inherit()


      mach_vm_inherit() sets the inheritance attribute for the given memory range in the given address space. The possible inheritance values are the same as those we saw in Section 8.6.1.


      kern_return_t
      mach_vm_inherit(vm_map_t target_task,
      mach_vm_address_t address,
      mach_vm_size_t size,
      vm_inherit_t new_inheritance);





      8.6.7. mach_vm_read()


      mach_vm_read() TRansfers data from the given memory range in the given address space to dynamically allocated memory in the calling task. In other words, unlike most Mach VM API routines, mach_vm_read() implicitly uses the current address space as its destination. The source memory region must be mapped in the source address space. As with memory allocated dynamically in other contexts, it is the caller's responsibility to invalidate it when appropriate.


      kern_return_t
      mach_vm_read(vm_map_t target_task,
      mach_vm_address_t address,
      mach_vm_size_t size,
      vm_offset_t *data,
      mach_msg_type_number_t *data_count);


      The mach_vm_read_overwrite() variant reads into a caller-specified buffer. Yet another variantmach_vm_read_list()reads a list of memory ranges from the given map. The list of ranges is an array of mach_vm_read_entry structures [<mach/vm_region.h>]. The maximum size of this array is VM_MAP_ENTRY_MAX (256). Note that for each source address, memory is copied to the same address in the calling task.


      kern_return_t
      mach_vm_read_overwrite(vm_map_t target_task,
      mach_vm_address_t address,
      mach_vm_size_t size,
      mach_vm_address_t data,
      mach_vm_size_t *out_size);

      kern_return_t
      mach_vm_read_list(vm_map_t target_task,
      mach_vm_read_entry_t data_list,
      natural_t data_count);

      struct mach_vm_read_entry {
      mach_vm_address_t address;
      mach_vm_size_t size;
      };

      typedef struct mach_vm_read_entry mach_vm_read_entry_t[VM_MAP_ENTRY_MAX];





      8.6.8. mach_vm_write()


      mach_vm_write() copies data from a caller-specified buffer to the given memory region in the target address space. The destination memory range must already be allocated and writable from the caller's standpointin that sense, this is more precisely an overwrite call.


      kern_return_t
      mach_vm_write(vm_map_t target_task,
      mach_vm_address_t address,
      vm_offset_t data,
      mach_msg_type_number_t data_count);




      8.6.9. mach_vm_copy()


      mach_vm_copy() copies one memory region to another within the same task. The source and destination regions must both already be allocated. Their protection attributes must permit reading and writing, respectively. Moreover, the two regions can overlap. mach_vm_copy() has the same effect as a mach_vm_read() followed by a mach_vm_write().


      kern_return_t
      mach_vm_copy(vm_map_t target_task,
      mach_vm_address_t source_address,
      mach_vm_size_t count,
      mach_vm_address_t dest_address);




      Comparing Mach VM Routines with Mach IPC Routines for Memory Transfer


      Since large amounts of datatheoretically, entire address spacescan be transferred through Mach IPC, it is interesting to note the difference between Mach VM routines and Mach IPC messaging when sending data from one task to another. In the case of a Mach VM routine such as mach_vm_copy() or mach_vm_write(), the calling task must have send rights to the control port of the target task. However, the target task does not have to participate in the transferit can be passive. In fact, it could even be suspended. In the case of Mach IPC, the sender must have send rights to a port that the receiving task has receive rights to. Additionally, the receiving task must actively receive the message. Moreover, Mach VM routines allow memory to be copied at a specific destination address in the target address space.







      8.6.10. mach_vm_wire()


      mach_vm_wire() alters the given memory region's pageability: If the wired_access argument is one of VM_PROT_READ, VM_PROT_WRITE, VM_PROT_EXECUTE, or a combination thereof, the region's pages are protected accordingly and wired in physical memory. If wired_access is VM_PROT_NONE, the pages are unwired. Since wiring pages is a privileged operation, vm_wire() requires send rights to the host's control port. The host_get_host_priv_port() routine, which itself requires superuser privileges, can be used to acquire these rights.


      kern_return_t
      mach_vm_wire(host_priv_t host,
      vm_map_t target_task,
      mach_vm_address_t address,
      mach_vm_size_t size,
      vm_prot_t wired_access);



      Unlike other Mach VM routines discussed so far, mach_vm_wire() is exported by the host_priv MIG subsystem.






      8.6.11. mach_vm_behavior_set()


      mach_vm_behavior_set() specifies the expected page reference behaviorthe access patternfor the given memory region. This information is used during page-fault handling to determine which pages, if any, to deactivate based on the memory access pattern.


      kern_return_t
      mach_vm_behavior_set(vm_map_t target_task,
      mach_vm_address_t address,
      mach_vm_size_t size,
      vm_behavior_t behavior);


      The behavior argument can take the following values:


      • VM_BEHAVIOR_DEFAULT the default behavior for all nascent memory

      • VM_BEHAVIOR_RANDOM random access pattern

      • VM_BEHAVIOR_SEQUENTIAL sequential access (forward)

      • VM_BEHAVIOR_RSEQNTL sequential access (reverse)

      • VM_BEHAVIOR_WILLNEED will need these pages in the near future

      • VM_BEHAVIOR_DONTNEED will not need these pages in the near future


      The kernel maps the VM_BEHAVIOR_WILLNEED and VM_BEHAVIOR_DONTNEED reference behavior specifications to the default behavior, which assumes a strong locality of reference.



      mach_vm_behavior_set() is analogous to the madvise() system call. In fact, the Mac OS X madvise() implementation is a simple wrapper around the in-kernel equivalent of mach_vm_behavior_set().




      Since the expected reference behavior is applied to a memory range, the behavior setting is recorded as part of the VM map entry structure (struct vm_map_entry [osfmk/vm/vm_map.h]). Upon a page fault, the fault handler uses the behavior setting to determine which, if any, of the active pages are uninteresting enough to be deactivated. This mechanism also uses the sequential and last_alloc fields of the VM object structure (struct vm_object [osfmk/vm/vm_object.h]). The sequential field records the sequential access size, whereas last_alloc records the last allocation offset in that object.


      If the reference behavior is VM_BEHAVIOR_RANDOM, the sequential access size is always kept as the page size, and no page is deactivated.


      If the behavior is VM_BEHAVIOR_SEQUENTIAL, the page-fault handler examines the current and last allocation offsets to see if the access pattern is indeed sequential. If so, the sequential field is incremented by a page size, and the immediate last page is deactivated. If, however, the access is not sequential, the fault handler resets its recording by setting the sequential field to the page size. No page is deactivated in this case. The handling of VM_BEHAVIOR_RSEQNTL is similar, except the notion of sequential is reversed.


      In the case of VM_BEHAVIOR_DEFAULT, the handler attempts to establish an access pattern based on the current and last offsets. If they are not consecutive (in units of a page), the access is deemed random, and no page is deactivated. If they are consecutive, whether increasing or decreasing, the handler increments the sequential field by a page size. If the pattern continues and the recorded sequential access size exceeds MAX_UPL_TRANSFER (256) pages, the page that is MAX_UPL_TRANSFER pages away (behind or forward, depending on the direction) is deactivated. While the recorded sequential access size remains less than MAX_UPL_TRANSFER, no page is deactivated. If, however, the pattern is broken, the sequential access size is reset to the page size.



      Page deactivation involves calling vm_page_deactivate() [osfmk/vm/vm_resident.c], which returns the page to the inactive queue.







      8.6.12. mach_vm_msync()


      mach_vm_msync() synchronizes the given memory range with its pager.


      kern_return_t
      mach_vm_msync(vm_map_t target_task,
      mach_vm_address_t address,
      mach_vm_size_t size,
      vm_sync_t sync_flags);


      The sync_flags argument is the bitwise OR of synchronization bits defined in <mach/vm_sync.h>. The following are examples of valid combinations.


      • VM_SYNC_INVALIDATE flushes pages in the given memory range, returning only precious pages to the pager and discarding dirty pages.

      • If VM_SYNC_ASYNCHRONOUS is specified along with VM_SYNC_INVALIDATE, both dirty and precious pages are returned to the pager, but the call returns without waiting for the pages to reach the backing store.

      • VM_SYNC_SYNCHRONOUS is similar to VM_SYNC_ASYNCHRONOUS, but the call does not return until the pages reach the backing store.

      • When either VM_SYNC_ASYNCHRONOUS or VM_SYNC_SYNCHRONOUS is specified by itself, both dirty and precious pages are returned to the pager without flushing any pages.

      • If VM_SYNC_CONTIGUOUS is specified, the call returns KERN_INVALID_ADDRESS if the specified memory range is not mapped in its entiretythat is, if the range has a hole in it. Nevertheless, the call still completes its work as it would have if VM_SYNC_CONTIGUOUS were not specified.




      Precious Pages


      A precious page is used when only one copy of its data is desired. There may not be a copy of a precious page's data both in the backing store and in memory. When a pager provides a precious page to the kernel, it means that the pager has not necessarily retained its own copy. When the kernel must evict such pages, they must be returned to the pager, even if they had not been modified while resident.




      mach_vm_msync() is analogous to the msync() system call. In fact, the msync() implementation uses the in-kernel equivalent of mach_vm_sync(). POSIX.1 requires msync() to return an ENOMEM error if there are holes in the region being synchronized. Therefore, msync() always sets the VM_SYNC_CONTIGUOUS bit before calling the in-kernel version of mach_vm_msync(). If the latter returns KERN_INVALID_ADDRESS, msync() TRanslates the error to ENOMEM.




      8.6.13. Statistics


      System-wide VM statistics can be retrieved using the HOST_VM_INFO flavor of the host_statistics() Mach routine. The vm_stat command-line program also displays these statistics.


      $ vm_stat
      Mach Virtual Memory Statistics: (page size of 4096 bytes)
      Pages free: 144269.
      Pages active: 189526.
      Pages inactive: 392812.
      Pages wired down: 59825.
      "Translation faults": 54697978.
      Pages copy-on-write: 800440.
      Pages zero filled: 38386710.
      Pages reactivated: 160297.
      Pageins: 91327.
      Pageouts: 4335.
      Object cache: 205675 hits of 378912 lookups (54% hit rate)


      mach_vm_region() returns information about a memory region in the given address space. The address argument specifies the location at which mach_vm_region() starts to look for a valid region. The outbound values of address and size specify the range of the region actually found. The flavor argument specifies the type of information to retrieve, with info pointing to a structure appropriate for the flavor being requested. For example, the VM_REGION_BASIC_INFO flavor is used with a vm_region_basic_info structure. The count argument specifies the size of the input buffer in units of natural_t. For example, to retrieve information for the VM_REGION_BASIC_INFO flavor, the size of the input buffer must be at least VM_REGION_BASIC_INFO_COUNT. The outbound value of count specifies the size of the data filled in by the call.


      kern_return_t
      mach_vm_region(vm_map_t target_task,
      mach_vm_address_t *address,
      mach_vm_size_t *size,
      vm_region_flavor_t flavor,
      vm_region_info_t info,
      mach_msg_type_number_t *info_count,
      mach_port_t *object_name);


      Note that a task should be suspended before mach_vm_region() is called on it, otherwise the results obtained may not provide a true picture of the task's VM situation.


      The mach_vm_region_recurse() variant recurses into submap chains in the given task's address map. The vmmap command-line program uses both variants to retrieve information about the virtual memory regions allocated in the given process.













      Section 2: Word Families








       

       


















      Section 2: Word Families





      Section 2 of Vocabulary Basics groups words into families, words that are related because they express related ideas. While you continue to build familiarity with a set of words for future understanding and use, you will examine relationships among words, using techniques that can be applied to many concepts and word families as you continue to develop your verbal abilities.



      Three ways to think about words are covered in these lessons:





      • contrasts



      • categories



      • shades of meaning





      In Lesson 7, you will contrast words and groups of words about communicating. Words about sending and receiving, positive and negative communications, and asking and answering are among those covered. The importance of communications in today's society cannot be overemphasized. Technological developments have made the communication of messages faster and wider-reaching than ever before.



      In Lesson 8, you will examine relationships among words that describe size and amount, categorize them, and put them "in order." You will learn that some words meaning "big," "small," "many," or "few" are usually used in specific situations. We call a small sum of money paltry, but we wouldn't call a small animal or person or house paltry.



      In Lesson 9, you will compare words that describe importance, which are not easily put in order and are often used only in certain contexts. Importance is not a concrete entity that can be weighed and measured. Consequently, words related to this concept must express the "weight" of importance through differences in meaning that may be subtle or slight, but significant. You will examine words related to degrees of importance and understand differences in shades of meaning.





      WHAT DO YOU KNOW?



      TRUE OR FALSE?



      Circle T or F to indicate whether each of the following statements is true or false. On a separate sheet of paper, state why you chose that answer.



















      Q1:























      T





      F





      1. All words are neutral. They become positive or negative only by virtue of the way they are used.







      A1:

      F



      Q2:























      T





      F





      2. Imply refers to what a speaker does. Infer refers to what a listener does.







      A2:

      T



      Q3:























      T





      F





      3. Excessive, exorbitant, extravagant, and immoderate all mean "too many" or "too much."







      A3:

      T



      Q4:























      T





      F





      4. The words some, several, and numerous are in order of quantity, from fewest to most.







      A4:

      F







      MULTIPLE CHOICE



      Circle the letter next to the answer that best completes the sentence.



      1:

      Because the speaker ________________ from his topic, the audience had difficulty following his logic.





      (a) delineated



      (b) digressed



      (c) revealed



      (d) dictated







      A1:

      b



      2:

      The ________________ child was smaller than his playmates, but was healthy and happy.





      (a) miniature



      (b) heavy



      (c) diminutive



      (d) puny







      A2:

      c



      3:

      The ________________ feature of the product is its ease of use.





      (a) urgent



      (b) eminent



      (c) salient



      (d) imminent







      A3:

      c


























         

         


        Step 1: Crawling the Web Site





        Step 1: Crawling the Web
        Site



        Crawling a Web site begins with the first page and involves
        following every link found. For the mathematically inclined, class=docemphasis1>crawling a site is the same as performing a breadth first
        search on a connected directional graph. A crawler
        is a program that automates this process. Think of it as a browser that can
        click on each link of the Web page by itself and traverse all the pages in the
        Web site. The crawler sends an HTTP "GET" request to a page, parses
        the HTML received, extracts all the hyperlinks from it, and recursively
        performs the same action on each link.



        Crawlers can be quite sophisticated. Instead of simply
        following links, they can also mirror an entire Web site on the local hard
        drive and extract other elements such as comments, client-side scripts, and
        HTML comments. We discussed some of these techniques in style='color:#003399'>Chapter 7.



        Crawling
        a Site Manually



        If a Web site doesn't contain many pages, you can follow the
        hyperlinks by simply using a browser to making a list of them. This technique
        is more accurate than using a crawler to gather links. One of the main
        drawbacks of automated crawling is that crawlers can't interpret client-side
        scripts, such as JavaScript, and the hyperlinks they contain.



        A
        Closer Look at the HTTP Response Header



        Each HTTP response has two parts�namely, the HTTP response
        header and data content. Usually, data content is presented in HTML, but it
        also can be a byte block representing a GIF image or another object. Crawlers
        rely heavily on HTTP response headers while crawling a site. Consider this HTTP
        response header:



        HTTP/1.1 200 OK
        Server: Microsoft-IIS/5.0
        Date: Sat, 16 Mar 2002 18:08:35 GMT
        Connection: Keep-Alive
        Content-Length: 496
        Content-Type: text/html
        Set-Cookie: ASPSESSIONIDQQGGGRHQ=DPHDNEMBEEHDNFMOPNPKIPHN; path=/
        Cache-control: private


        The first item to be inspected in the HTTP response header is the
        HTTP response code, which appears in the first line of the HTTP response
        header. In the preceding code snippet, the HTTP response code is
        "200," which signifies that the HTTP request was processed properly
        and that the appropriate response was generated. If the response code indicates
        an error, the error occurred when requesting the resource. A "404"
        response code indicates that the resource doesn't exist. A "403"
        response code signifies that the resource is blocked from requests, but
        nonetheless is present. Other HTTP response codes indicate that the resource
        may have relocated or that some extra privileges are required to request that
        resource. A crawler has to pay attention to these response codes and determine
        whether to crawl farther.



        The next bit of important information returned in the HTTP
        response header, from a crawler's perspective, is the Content-Type field. It
        indicates the type of a resource represented by the data in the HTTP content
        that follows the HTTP response header. Again, the crawler has to pay attention
        to the Content-Type. A crawler attempting to extract links from a GIF file
        makes no sense, and crawlers usually pay attention only to
        "text/html" content.



        Some
        Popular Tools for Site Linkage Analysis



        Several commercial tools are available for use with crawling
        Web applications. We describe a few of the tools and discuss some of their key
        features in this section.



        GNU
        wget


        GNU wget is a simple
        command-line crawler and is available along with source code on http://www.wget.org/.
        Although wget was primarily intended for Unix
        platforms, a Windows binary is also available. Recall that we took a look at class=docemphasis1>wget in style='color:#003399'>Chapter 7, where we used it for mirroring a
        Web site locally and searching for patterns within the retrieved HTML data for
        source sifting. The advantages offered by wget
        are that it is simple to use, a command-line tool, and available on both Unix
        and Windows platforms. It also is very easy to use in shell scripts or batch
        files to further automate linkage analysis tasks.



        Because wget offers the
        ability to mirror Web site content, we can run several commands or scripts on
        the mirrored content for various types of analysis.



        BlackWidow
        from SoftByteLabs


        SoftByteLabs' BlackWidow is a very fast Web site crawler for
        the Windows platform. The crawling engine is multithreaded and retrieves Web pages
        in parallel. BlackWidow also performs some basic source sifting techniques such
        as those discussed in style='color:#003399'>Chapter 7. style='color:#003399'>Figure 8-2 shows BlackWidow crawling http://www.foundstone.com/.
        On its tabs, you can view the progress of the crawling, thread by thread.



        style='font-size:10.5pt;font-family:Arial'>Figure 8-2. Blackwidow crawling one
        site with multiple threads




        style='color:#003399'>Figure 8-3 shows the site structure in a
        collapsible tree format. It helps us analyze how resources are grouped on the
        Web site. The BlackWidow GUI has other tabs that show e-mail addresses that are
        present on the pages, external links, and errors in retrieving links, if any.
        As with GNU wget, BlackWidow also can be used
        to mirror a Web site where URLs occurring within hyperlinks are rewritten for
        accessibility from the local file system.



        style='font-size:10.5pt;font-family:Arial'>Figure 8-3. Structure of
        http://www.acme.com/




        Funnel
        Web Profiler from Quest Software


        Funnel Web Profiler from Quest Software can perform an
        exhaustive analysis of a Web site. Quest Software has a trial version of Funnel
        Web Profiler available for download from http://www.quest.com. style='color:#003399'>Figure 8-4 shows a Funnel Web Profiler in
        action running on style='color:#003399'>http://www.foundstone.com/. This tool has a
        nice graphical user interface, which provides information such as content
        grouping, a Web site map, cross-references, a crawled statistics list view, and
        a tree view, among other things.



        style='font-size:10.5pt;font-family:Arial'>Figure 8-4. Funnel Web Profiler,
        showing scan statistics for style='color:#003399'>http://www.foundstone.com/




        After the Web site scan is completed, Funnel Web Profiler
        aggregates the information gathered and presents various representations and
        statistics about the site information. For example, clicking on the Web Map tab
        shows a graphical layout of the Web site and the pages in it. style='color:#003399'>Figure 8-5 shows the Web map of http://www.foundstone.com/.
        Each Web resource is represented as a node, and the entire Web map shows how
        each node is linked with other nodes. The Web map presents a visual
        representation of the Web site and reveals the layout and linking of resources.



        style='font-size:10.5pt;font-family:Arial'>Figure 8-5. Funnel Web Profiler's
        Web map for style='color:#003399'>http://www.foundstone.com/




        The Web map contains a cluster of linked nodes, with each
        node's starting point identified. The top right corner gives a thumbnail
        representation of the full Web map. It also allows the user to zoom in for a
        more detailed view.



        If we click on the List tab, we get a tabular list of all the
        Web resources on style='color:#003399'>http://www.foundstone.com/, along with other
        information such as the type of resource, its size in bytes, and when it was
        modified. style='color:#003399'>Figure 8-6 displays the list view of http://www.foundstone.com/.



        style='font-size:10.5pt;font-family:Arial'>Figure 8-6. List view of Web
        resources on style='color:#003399'>http://www.foundstone.com/




        Step-1
        Wrap-Up



        Some other tools�which we haven't covered in detail but are
        worth mentioning�are Teleport Pro from Tennyson Maxwell (http://www.tenmax.com/)
        and Sam Spade (style='color:#003399'>http://www.samspade.org/). Teleport Pro runs
        on the Windows platform and is primarily used for mirroring Web sites. Teleport
        Pro allows users to define individual projects for mirroring sites. Site
        mirroring is quite fast with Teleport Pro's multithreaded mirroring engine. Sam
        Spade is also a Windows-based tool that allows basic site crawling and
        source-sifting. We now have quite a lot of information for performing thorough
        analysis. Let's see what we can do with all this information.



        style='width:90.0%'>




        style='font-size:16.5pt;font-family:Arial'>Crawlers and Redirection


        Automated Web crawlers sometimes get thrown off track when
        they encounter unusual linking techniques and page redirection. A few
        "smart" crawlers, however, can interpret these anomalies accurately
        and provide good crawled results. For example, a crawler may get confused
        when a redirection is encountered in a client-side script, because crawlers
        don't usually interpret client-side scripts such as JavaScript or VBScript.


        The following JavaScript code snippet has a redirection
        directive, which gets interpreted and executed on the browser:


        <SCRIPT LANGUAGE="JavaScript">
        location.replace("./index.php3");
        </script>

        It instructs the browser to request index.php3. It will do
        so only if the JavaScript execution is enabled within the browser. When a
        crawler encounters this instruction, it won't be able to interpret and
        execute the location.replace() statement and it will fail to crawl
        index.php3.


        However, if the redirection is performed by techniques such
        as a Content-Location header response or an HTML <META> tag, the
        crawler could look for them and crawl the pages accordingly.


        The following two examples illustrate redirection with the
        HTTP response header and the <META> tag, respectively.




        style='color:black;display:none'> 



        style='width:90.0%'>




        Redirection by Content-Location


        The code snippet for this procedure is:


        HTTP/1.1 200 OK
        Server: Microsoft-IIS/5.0
          lang=EN-GB>Date: Wed, 27 Mar 2002 08:13:01 GMT
          lang=EN-GB>Connection: Keep-Alive
        Content-Location: http://www.example.com/example/index.asp
          lang=EN-GB>Set-Cookie: ASPSESSIONIDQQGQGIWC=LNDJBOLAIFDAKJDBNDINOABF; path=/
          lang=EN-GB>Cache-control: private

        Here we sent a GET request to a server,
        www.example.com, and requested the default Web resource on its root directory.
        Examining the header of the HTTP response, we see that it has a special
        field, Content-Location. This particular field forces the browser to request
        the URL http://www.example.com/example/index.asp.




        style='color:black;display:none'> 



        style='width:90.0%'>




        Redirection by HTTP-EQUIV


        We can insert <META> tags of several
        types in the HTML header section The most common use of <META> tags is
        to list keywords associated with the HTML document. However, <META>
        tags can also be used for redirection. Using the HTTP-EQUIV clause within a
        <META> tag redirects the browser to a URL contained in it. The
        following <META> tag instructs the browser to refresh to
        http://www.yahoo.com/ after two seconds:


        <META HTTP-EQUIV=Refresh CONTENT="2; url=http://www.yahoo.com/">

        Smart crawlers implement methods to parse redirection
        responses such as those shown in the preceding examples. However, some
        crawlers such as GNU wget are unable to
        handle tags with HTTP redirection.