8.6. The Mach VM User-Space InterfaceMach provides a powerful set of routines to user programs for manipulating task address spaces. Given the appropriate privileges, a task can perform operations on another task's address space identically to its own. All routines in the Mach VM user interface require the target task as an argument.[12] Therefore, the routines are uniform in how they are used, regardless of whether the target task is the caller's own task or another.
Since user address spaces have a one-to-one mapping with user tasks, there are no explicit routines to create or destroy an address space. When the first task (the kernel task) is created, the map field of its task structure is set to refer to the kernel map (kernel_map), which is created by kmem_init() [osfmk/vm/vm_kern.c] during VM subsystem initialization. For subsequent tasks, a virtual address space is created with the task and destroyed along with the task. We saw in Chapter 6 that the task_create() call takes a prototype task and an address space inheritance indicator as arguments. The initial contents of a newly created task's address map are determined from these arguments. In particular, the inheritance properties of the prototype task's address map determine which portions, if any, are inherited by the child task. // osfmk/kern/task.c vm_map_fork() [osfmk/vm/vm_map.c] first calls pmap_create() [osfmk/ppc/pmap.c] to create a new physical map and calls vm_map_create() [osfmk/vm/vm_map.c] to create an empty VM map with the newly created physical map. The minimum and maximum offsets of the new VM map are taken from the parent's map. vm_map_fork() then iterates over the VM map entries of the parent's address map, examining the inheritance property of each. These properties determine whether the child inherits any memory ranges from the parent, and if so, how (fully shared or copied). Barring inherited memory ranges, a newly created address space is otherwise empty. Before the first thread executes in a task, the task's address space must be populated appropriately. In the case of a typical program, several partiessuch as the kernel, the system library, and the dynamic link editordetermine what to map into the task's address space. Let us now look at several Mach VM routines that are available to user programs. The following is a summary of the functionality provided by these routines:
Note that in this section, we discuss the new Mach VM API that was introduced in Mac OS X 10.4. The new API is essentially the same as the old API from the programmer's standpoint, with the following key differences.
8.6.1. mach_vm_map()mach_vm_map() is the fundamental user-visible Mach routine for establishing a new range of virtual memory in a task. It allows fine-grained specification of the properties of the virtual memory being mapped, which accounts for its large number of parameters. kern_return_t Given the importance of mach_vm_map(), we will discuss each of its parameters. We will not do so for all Mach VM routines. target_task specifies the task whose address space will be used for mapping. A user program specifies the control port of the target task as this argument, and indeed, the type vm_map_t is equivalent to mach_port_t in user space. Mach's IPC mechanism translates a vm_map_t to a pointer to the corresponding VM map structure in the kernel. We will discuss this translation in Section 9.6.2. When mach_vm_map() returns successfully, it populates the address pointer with the location of the newly mapped memory in the target task's virtual address space. This is when the VM_FLAGS_ANYWHERE bit is set in the flags argument. If this bit is not set, then address contains a caller-specified virtual address for mach_vm_map() to use. If the memory cannot be mapped at that address (typically because there is not enough free contiguous virtual memory beginning at that location), mach_vm_map() will fail. If the user-specified address is not page-aligned, the kernel will truncate it. size specifies the amount of memory to be mapped in bytes. It should be an integral number of pages; otherwise, the kernel will round it up appropriately. The mask argument of mach_vm_map() specifies an alignment restriction on the kernel-chosen starting address. A bit that is set in mask will not be set in the addressthat is, it will be masked out. For example, if mask is 0x00FF_FFFF, the kernel-chosen address will be aligned on a 16MB boundary (the lower 24 bits of the address will be zero). This feature of mach_vm_map() can be used to emulate a virtual page size that is larger than the physical page size.
The following are examples of individual flags (bits) that can be set in the flags argument.
object is the critical argument of mach_vm_map(). It must be a Mach port naming a memory object, which will provide backing for the range being mapped. As we saw earlier, a memory object represents a range of pages whose properties are controlled by a single pager. The kernel uses the memory object port to communicate with the pager. When mach_vm_map() is used to map some portion of a task's address space with a memory object, the latter's pages are accessible by the task. Note that the virtual address at which such a page range appears in a given task is task-dependent. However, a page has a fixed offset within its memory objectthis offset is what a pager works with. The following are some examples of memory objects used with mach_vm_map().
There is also mach_make_memory_entry(), which is a wrapper around mach_make_memory_entry_64(). The latter is not 64-bit-only, as its name suggests. The offset argument specifies the beginning of the memory in the memory object. Along with size, this argument specifies the range of the memory to be mapped in the target task. If copy is TRUE, the memory is copied (with copy-on-write optimization) from the memory object to the target task's virtual address space. This way, the target receives a private copy of the memory. Thereafter, any changes made by the task to that memory will not be sent to the pager. Conversely, the task will not see changes made by someone else. If copy is FALSE, the memory is directly mapped. cur_protection specifies the initial current protection for the memory. The following individual protection bits can be set in a Mach VM protection value: VM_PROT_READ, VM_PROT_WRITE, and VM_PROT_EXECUTE. The values VM_PROT_ALL and VM_PROT_NONE represent all bits set (maximum access) and no bits set (all access disallowed), respectively. max_protection specifies the maximum protection for the memory. Thus, each mapped region has a current protection and a maximum protection. Once the memory is mapped, the kernel will not allow the current to exceed the maximum. Both the current and maximum protection attributes can be subsequently changed using mach_vm_protect() (see Section 8.6.5), although note that the maximum protection can only be loweredthat is, made more restrictive. inheritance specifies the mapped memory's initial inheritance attribute, which determines how the memory is inherited by a child task during a fork() operation. It can take the following values.
The inheritance attribute can be later changed using mach_vm_inherit() (see Section 8.6.6). 8.6.2. mach_vm_remap()mach_vm_remap() takes already mapped memory in a source task and maps it in the target task's address space, with allowance for specifying the new mapping's properties (as in the case of mach_vm_map()). You can achieve a similar effect by creating a named entry from a mapped range and then remapping it through mach_vm_map(). In that sense, mach_vm_remap() can be thought of as a "turnkey" routine for memory sharing. Note that the source and target tasks could be the same task. kern_return_t The cur_protection and max_protection arguments return the protection attributes for the mapped region. If one or more subranges have differing protection attributes, the returned attributes are those of the range with the most restrictive protection. 8.6.3. mach_vm_allocate()mach_vm_allocate() allocates a region of virtual memory in the target task. As noted earlier, its effect is similar to calling mach_vm_map() with a null memory object. It returns initially zero-filled, page-aligned memory. Like mach_vm_map(), it allows the caller to provide a specific address at which to allocate. kern_return_t 8.6.4. mach_vm_deallocate()mach_vm_deallocate() invalidates the given range of virtual memory in the given address space. kern_return_t It is important to realize that as used here, the terms allocate and deallocate subtly differ from how they are used in the context of a typical memory allocator (such as malloc(3)). A memory allocator usually tracks allocated memorywhen you free allocated memory, the allocator will check that you are not freeing memory you did not allocate, or that you are not double-freeing memory. In contrast, mach_vm_deallocate() simply removes the given rangewhether currently mapped or notfrom the given address space. When a task receives out-of-line memory in an IPC message, it should use mach_vm_deallocate() or vm_deallocate() to free that memory when it is not needed. Several Mach routines dynamicallyand implicitlyallocate memory in the address space of the caller. Typical examples of such routines are those that populate variable-length arrays, such as process_set_tasks() and task_threads(). 8.6.5. mach_vm_protect()mach_vm_protect() sets the protection attribute for the given memory range in the given address space. The possible protection values are the same as those we saw in Section 8.6.1. If the set_maximum Boolean argument is TRUE, new_protection specifies the maximum protection; otherwise, it specifies the current protection. If the new maximum protection is more restrictive than the current protection, the latter is lowered to match the new maximum. kern_return_t 8.6.6. mach_vm_inherit()mach_vm_inherit() sets the inheritance attribute for the given memory range in the given address space. The possible inheritance values are the same as those we saw in Section 8.6.1. kern_return_t 8.6.7. mach_vm_read()mach_vm_read() TRansfers data from the given memory range in the given address space to dynamically allocated memory in the calling task. In other words, unlike most Mach VM API routines, mach_vm_read() implicitly uses the current address space as its destination. The source memory region must be mapped in the source address space. As with memory allocated dynamically in other contexts, it is the caller's responsibility to invalidate it when appropriate. kern_return_t The mach_vm_read_overwrite() variant reads into a caller-specified buffer. Yet another variantmach_vm_read_list()reads a list of memory ranges from the given map. The list of ranges is an array of mach_vm_read_entry structures [<mach/vm_region.h>]. The maximum size of this array is VM_MAP_ENTRY_MAX (256). Note that for each source address, memory is copied to the same address in the calling task. kern_return_t 8.6.8. mach_vm_write()mach_vm_write() copies data from a caller-specified buffer to the given memory region in the target address space. The destination memory range must already be allocated and writable from the caller's standpointin that sense, this is more precisely an overwrite call. kern_return_t 8.6.9. mach_vm_copy()mach_vm_copy() copies one memory region to another within the same task. The source and destination regions must both already be allocated. Their protection attributes must permit reading and writing, respectively. Moreover, the two regions can overlap. mach_vm_copy() has the same effect as a mach_vm_read() followed by a mach_vm_write(). kern_return_t
8.6.10. mach_vm_wire()mach_vm_wire() alters the given memory region's pageability: If the wired_access argument is one of VM_PROT_READ, VM_PROT_WRITE, VM_PROT_EXECUTE, or a combination thereof, the region's pages are protected accordingly and wired in physical memory. If wired_access is VM_PROT_NONE, the pages are unwired. Since wiring pages is a privileged operation, vm_wire() requires send rights to the host's control port. The host_get_host_priv_port() routine, which itself requires superuser privileges, can be used to acquire these rights. kern_return_t Unlike other Mach VM routines discussed so far, mach_vm_wire() is exported by the host_priv MIG subsystem. 8.6.11. mach_vm_behavior_set()mach_vm_behavior_set() specifies the expected page reference behaviorthe access patternfor the given memory region. This information is used during page-fault handling to determine which pages, if any, to deactivate based on the memory access pattern. kern_return_t The behavior argument can take the following values:
The kernel maps the VM_BEHAVIOR_WILLNEED and VM_BEHAVIOR_DONTNEED reference behavior specifications to the default behavior, which assumes a strong locality of reference. mach_vm_behavior_set() is analogous to the madvise() system call. In fact, the Mac OS X madvise() implementation is a simple wrapper around the in-kernel equivalent of mach_vm_behavior_set(). Since the expected reference behavior is applied to a memory range, the behavior setting is recorded as part of the VM map entry structure (struct vm_map_entry [osfmk/vm/vm_map.h]). Upon a page fault, the fault handler uses the behavior setting to determine which, if any, of the active pages are uninteresting enough to be deactivated. This mechanism also uses the sequential and last_alloc fields of the VM object structure (struct vm_object [osfmk/vm/vm_object.h]). The sequential field records the sequential access size, whereas last_alloc records the last allocation offset in that object. If the reference behavior is VM_BEHAVIOR_RANDOM, the sequential access size is always kept as the page size, and no page is deactivated. If the behavior is VM_BEHAVIOR_SEQUENTIAL, the page-fault handler examines the current and last allocation offsets to see if the access pattern is indeed sequential. If so, the sequential field is incremented by a page size, and the immediate last page is deactivated. If, however, the access is not sequential, the fault handler resets its recording by setting the sequential field to the page size. No page is deactivated in this case. The handling of VM_BEHAVIOR_RSEQNTL is similar, except the notion of sequential is reversed. In the case of VM_BEHAVIOR_DEFAULT, the handler attempts to establish an access pattern based on the current and last offsets. If they are not consecutive (in units of a page), the access is deemed random, and no page is deactivated. If they are consecutive, whether increasing or decreasing, the handler increments the sequential field by a page size. If the pattern continues and the recorded sequential access size exceeds MAX_UPL_TRANSFER (256) pages, the page that is MAX_UPL_TRANSFER pages away (behind or forward, depending on the direction) is deactivated. While the recorded sequential access size remains less than MAX_UPL_TRANSFER, no page is deactivated. If, however, the pattern is broken, the sequential access size is reset to the page size. Page deactivation involves calling vm_page_deactivate() [osfmk/vm/vm_resident.c], which returns the page to the inactive queue. 8.6.12. mach_vm_msync()mach_vm_msync() synchronizes the given memory range with its pager. kern_return_t The sync_flags argument is the bitwise OR of synchronization bits defined in <mach/vm_sync.h>. The following are examples of valid combinations.
mach_vm_msync() is analogous to the msync() system call. In fact, the msync() implementation uses the in-kernel equivalent of mach_vm_sync(). POSIX.1 requires msync() to return an ENOMEM error if there are holes in the region being synchronized. Therefore, msync() always sets the VM_SYNC_CONTIGUOUS bit before calling the in-kernel version of mach_vm_msync(). If the latter returns KERN_INVALID_ADDRESS, msync() TRanslates the error to ENOMEM. 8.6.13. StatisticsSystem-wide VM statistics can be retrieved using the HOST_VM_INFO flavor of the host_statistics() Mach routine. The vm_stat command-line program also displays these statistics. $ vm_stat mach_vm_region() returns information about a memory region in the given address space. The address argument specifies the location at which mach_vm_region() starts to look for a valid region. The outbound values of address and size specify the range of the region actually found. The flavor argument specifies the type of information to retrieve, with info pointing to a structure appropriate for the flavor being requested. For example, the VM_REGION_BASIC_INFO flavor is used with a vm_region_basic_info structure. The count argument specifies the size of the input buffer in units of natural_t. For example, to retrieve information for the VM_REGION_BASIC_INFO flavor, the size of the input buffer must be at least VM_REGION_BASIC_INFO_COUNT. The outbound value of count specifies the size of the data filled in by the call. kern_return_t Note that a task should be suspended before mach_vm_region() is called on it, otherwise the results obtained may not provide a true picture of the task's VM situation. The mach_vm_region_recurse() variant recurses into submap chains in the given task's address map. The vmmap command-line program uses both variants to retrieve information about the virtual memory regions allocated in the given process. |
Friday, November 13, 2009
Section 8.6. The Mach VM User-Space Interface
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment