Friday, November 13, 2009

Section 8.6. The Mach VM User-Space Interface











8.6. The Mach VM User-Space Interface


Mach provides a powerful set of routines to user programs for manipulating task address spaces. Given the appropriate privileges, a task can perform operations on another task's address space identically to its own. All routines in the Mach VM user interface require the target task as an argument.[12] Therefore, the routines are uniform in how they are used, regardless of whether the target task is the caller's own task or another.

[12] Specifically, the target task is a send right to the control port of the target task.


Since user address spaces have a one-to-one mapping with user tasks, there are no explicit routines to create or destroy an address space. When the first task (the kernel task) is created, the map field of its task structure is set to refer to the kernel map (kernel_map), which is created by kmem_init() [osfmk/vm/vm_kern.c] during VM subsystem initialization. For subsequent tasks, a virtual address space is created with the task and destroyed along with the task. We saw in Chapter 6 that the task_create() call takes a prototype task and an address space inheritance indicator as arguments. The initial contents of a newly created task's address map are determined from these arguments. In particular, the inheritance properties of the prototype task's address map determine which portions, if any, are inherited by the child task.


// osfmk/kern/task.c

kern_return_t
task_create_internal(task_t parent_task,
boolean_t inherit_memory,
task_t *child_task)
{
...
if (inherit_memory)
new_task->map = vm_map_fork(parent_task->map);
else
new_task->map = vm_map_create(pmap_create(0),
(vm_map_offset_t)(VM_MIN_ADDRESS),
(vm_map_offset_t)(VM_MAX_ADDRESS), TRUE);
...
}


vm_map_fork() [osfmk/vm/vm_map.c] first calls pmap_create() [osfmk/ppc/pmap.c] to create a new physical map and calls vm_map_create() [osfmk/vm/vm_map.c] to create an empty VM map with the newly created physical map. The minimum and maximum offsets of the new VM map are taken from the parent's map. vm_map_fork() then iterates over the VM map entries of the parent's address map, examining the inheritance property of each. These properties determine whether the child inherits any memory ranges from the parent, and if so, how (fully shared or copied). Barring inherited memory ranges, a newly created address space is otherwise empty. Before the first thread executes in a task, the task's address space must be populated appropriately. In the case of a typical program, several partiessuch as the kernel, the system library, and the dynamic link editordetermine what to map into the task's address space.


Let us now look at several Mach VM routines that are available to user programs. The following is a summary of the functionality provided by these routines:


  • Creating an arbitrary memory range in a task, including allocation of new memory

  • Destroying an arbitrary memory range, including one that is unallocated, in a task

  • Reading, writing, and copying a memory range

  • Sharing a memory range

  • Setting protection, inheritance, and other attributes of a memory range

  • Preventing the pages in a memory range from being evicted by wiring them in physical memory


Note that in this section, we discuss the new Mach VM API that was introduced in Mac OS X 10.4. The new API is essentially the same as the old API from the programmer's standpoint, with the following key differences.


  • Routine names have the mach_ prefixfor example, vm_allocate() becomes mach_vm_allocate().

  • Data types used in routines have been updated to support both 64-bit and 32-bit tasks. Consequently, the new API can be used with any task.

  • The new and old APIs are exported by different MIG subsystems[13]: mach_vm and vm_map, respectively. The corresponding header files are <mach/mach_vm.h> and <mach/vm_map.h>, respectively.

    [13] We will look at MIG subsystems in Chapter 9.




8.6.1. mach_vm_map()


mach_vm_map() is the fundamental user-visible Mach routine for establishing a new range of virtual memory in a task. It allows fine-grained specification of the properties of the virtual memory being mapped, which accounts for its large number of parameters.


kern_return_t
mach_vm_map(vm_map_t target_task,
mach_vm_address_t *address,
mach_vm_size_t size,
mach_vm_offset_t mask,
int flags,
mem_entry_name_port_t object,
memory_object_offset_t offset,
boolean_t copy,
vm_prot_t cur_protection,
vm_prot_t max_protection,
vm_inherit_t inheritance);


Given the importance of mach_vm_map(), we will discuss each of its parameters. We will not do so for all Mach VM routines.


target_task specifies the task whose address space will be used for mapping. A user program specifies the control port of the target task as this argument, and indeed, the type vm_map_t is equivalent to mach_port_t in user space. Mach's IPC mechanism translates a vm_map_t to a pointer to the corresponding VM map structure in the kernel. We will discuss this translation in Section 9.6.2.


When mach_vm_map() returns successfully, it populates the address pointer with the location of the newly mapped memory in the target task's virtual address space. This is when the VM_FLAGS_ANYWHERE bit is set in the flags argument. If this bit is not set, then address contains a caller-specified virtual address for mach_vm_map() to use. If the memory cannot be mapped at that address (typically because there is not enough free contiguous virtual memory beginning at that location), mach_vm_map() will fail. If the user-specified address is not page-aligned, the kernel will truncate it.


size specifies the amount of memory to be mapped in bytes. It should be an integral number of pages; otherwise, the kernel will round it up appropriately.


The mask argument of mach_vm_map() specifies an alignment restriction on the kernel-chosen starting address. A bit that is set in mask will not be set in the addressthat is, it will be masked out. For example, if mask is 0x00FF_FFFF, the kernel-chosen address will be aligned on a 16MB boundary (the lower 24 bits of the address will be zero). This feature of mach_vm_map() can be used to emulate a virtual page size that is larger than the physical page size.





Caveat Regarding Offsets and Sizes


As we noted in Section 8.4, Mach VM API routines operate on page-aligned addresses and memory sizes that are integral multiples of the page size. In general, if a user-specified address is not the beginning of a page, the kernel truncates itthat is, the actual address used will be the beginning of the page in which the original address resides. Similarly, if a size-specifying argument contains a byte count that is not an integral number of pages, the kernel rounds the size up appropriately. The following macros are used for truncating and rounding offsets and sizes (note that rounding 0xFFFF_FFFF pages yields the value 1):


[View full width]
// osfmk/mach/ppc/vm_param.h

#define PPC_PGBYTES 4096
#define PAGE_SIZE PPC_PGBYTES
#define PAGE_MASK (PAGE_SIZE - 1)

// osfmk/vm/vm_map.h

#define vm_map_trunc_page(x) ((vm_map_offset_t)(x) & ~(
(signed)PAGE_MASK))
#define vm_map_round_page(x) (((vm_map_offset_t)(x) + PAGE_MASK) & \
~((signed)PAGE_MASK))




The following are examples of individual flags (bits) that can be set in the flags argument.


  • VM_FLAGS_FIXED This is used to specify that the new VM region should be allocated at the caller-provided address, if possible. VM_FLAGS_FIXED is defined to be the value 0x0. Therefore, logically OR'ing this does not change the value of flags. It merely represents the absence of VM_FLAGS_ANYWHERE.

  • VM_FLAGS_ANYWHERE This is used to specify that the new VM region can be allocated anywhere in the address space.

  • VM_FLAGS_PURGABLE This is used to specify that a purgable VM object should be created for the new VM region. A purgable object has the special property that it can be put into a nonvolatile state in which its pages become eligible for reclamation without being paged out to the backing store.

  • VM_FLAGS_OVERWRITE This, when used along with VM_FLAGS_FIXED, is used to specify that the new VM region can replace existing VM regions if necessary.


object is the critical argument of mach_vm_map(). It must be a Mach port naming a memory object, which will provide backing for the range being mapped. As we saw earlier, a memory object represents a range of pages whose properties are controlled by a single pager. The kernel uses the memory object port to communicate with the pager. When mach_vm_map() is used to map some portion of a task's address space with a memory object, the latter's pages are accessible by the task. Note that the virtual address at which such a page range appears in a given task is task-dependent. However, a page has a fixed offset within its memory objectthis offset is what a pager works with.


The following are some examples of memory objects used with mach_vm_map().


  • When a Mach-O executable is loaded for execution by the execve() system call, the file is mapped into the address space of the target process through the vm_map() kernel function [osfmk/vm/vm_user.c], with the object argument referring to the vnode pager.

  • If the object argument is the null memory object (MEMORY_OBJECT_NULL), or equivalently, MACH_PORT_NULL, mach_vm_map() uses the default pager, which provides initially zero-filled memory backed by the system's swap space. In this case, mach_vm_map() is equivalent to mach_vm_allocate() (see Section 8.6.3), albeit with more options for configuring the memory's properties.

  • The object argument can be a named entry handle. A task creates a named entry from a given mapped portion of its address space by calling mach_make_memory_entry_64(), which returns a handle to the underlying VM object. The handle so obtained can be used as a shared memory object: The memory it represents can be mapped into another task's address space (or the same task's address space, for that matter). We will see an example of using mach_make_memory_entry_64() in Section 8.7.5.



There is also mach_make_memory_entry(), which is a wrapper around mach_make_memory_entry_64(). The latter is not 64-bit-only, as its name suggests.




The offset argument specifies the beginning of the memory in the memory object. Along with size, this argument specifies the range of the memory to be mapped in the target task.


If copy is TRUE, the memory is copied (with copy-on-write optimization) from the memory object to the target task's virtual address space. This way, the target receives a private copy of the memory. Thereafter, any changes made by the task to that memory will not be sent to the pager. Conversely, the task will not see changes made by someone else. If copy is FALSE, the memory is directly mapped.


cur_protection specifies the initial current protection for the memory. The following individual protection bits can be set in a Mach VM protection value: VM_PROT_READ, VM_PROT_WRITE, and VM_PROT_EXECUTE. The values VM_PROT_ALL and VM_PROT_NONE represent all bits set (maximum access) and no bits set (all access disallowed), respectively. max_protection specifies the maximum protection for the memory.


Thus, each mapped region has a current protection and a maximum protection. Once the memory is mapped, the kernel will not allow the current to exceed the maximum. Both the current and maximum protection attributes can be subsequently changed using mach_vm_protect() (see Section 8.6.5), although note that the maximum protection can only be loweredthat is, made more restrictive.


inheritance specifies the mapped memory's initial inheritance attribute, which determines how the memory is inherited by a child task during a fork() operation. It can take the following values.


  • VM_INHERIT_NONE The range is undefined ("empty") in the child task.

  • VM_INHERIT_SHARE The range is shared between the parent and the child, allowing each to freely read from and write to the memory.

  • VM_INHERIT_COPY The range is copied (with copy-on-write and other, if any, optimizations) from the parent into the child.


The inheritance attribute can be later changed using mach_vm_inherit() (see Section 8.6.6).




8.6.2. mach_vm_remap()


mach_vm_remap() takes already mapped memory in a source task and maps it in the target task's address space, with allowance for specifying the new mapping's properties (as in the case of mach_vm_map()). You can achieve a similar effect by creating a named entry from a mapped range and then remapping it through mach_vm_map(). In that sense, mach_vm_remap() can be thought of as a "turnkey" routine for memory sharing. Note that the source and target tasks could be the same task.


kern_return_t
mach_vm_remap(vm_map_t target_task,
mach_vm_address_t *target_address,
mach_vm_size_t size,
mach_vm_offset_t mask,
boolean_t anywhere,
vm_map_t src_task,
mach_vm_address_t src_address,
boolean_t copy,
vm_prot_t *cur_protection,
vm_prot_t *max_protection,
vm_inherit_t inheritance);


The cur_protection and max_protection arguments return the protection attributes for the mapped region. If one or more subranges have differing protection attributes, the returned attributes are those of the range with the most restrictive protection.




8.6.3. mach_vm_allocate()


mach_vm_allocate() allocates a region of virtual memory in the target task. As noted earlier, its effect is similar to calling mach_vm_map() with a null memory object. It returns initially zero-filled, page-aligned memory. Like mach_vm_map(), it allows the caller to provide a specific address at which to allocate.


kern_return_t
mach_vm_allocate(vm_map_t target_task,
mach_vm_address_t address,
mach_vm_size_t size,
int flags);




8.6.4. mach_vm_deallocate()


mach_vm_deallocate() invalidates the given range of virtual memory in the given address space.


kern_return_t
mach_vm_deallocate(vm_map_t target_task,
mach_vm_address_t *address,
mach_vm_size_t size);


It is important to realize that as used here, the terms allocate and deallocate subtly differ from how they are used in the context of a typical memory allocator (such as malloc(3)). A memory allocator usually tracks allocated memorywhen you free allocated memory, the allocator will check that you are not freeing memory you did not allocate, or that you are not double-freeing memory. In contrast, mach_vm_deallocate() simply removes the given rangewhether currently mapped or notfrom the given address space.


When a task receives out-of-line memory in an IPC message, it should use mach_vm_deallocate() or vm_deallocate() to free that memory when it is not needed. Several Mach routines dynamicallyand implicitlyallocate memory in the address space of the caller. Typical examples of such routines are those that populate variable-length arrays, such as process_set_tasks() and task_threads().




8.6.5. mach_vm_protect()


mach_vm_protect() sets the protection attribute for the given memory range in the given address space. The possible protection values are the same as those we saw in Section 8.6.1. If the set_maximum Boolean argument is TRUE, new_protection specifies the maximum protection; otherwise, it specifies the current protection. If the new maximum protection is more restrictive than the current protection, the latter is lowered to match the new maximum.


kern_return_t
mach_vm_protect(vm_map_t target_task,
mach_vm_address_t address,
mach_vm_size_t size,
boolean_t set_maximum,
vm_prot_t new_protection);




8.6.6. mach_vm_inherit()


mach_vm_inherit() sets the inheritance attribute for the given memory range in the given address space. The possible inheritance values are the same as those we saw in Section 8.6.1.


kern_return_t
mach_vm_inherit(vm_map_t target_task,
mach_vm_address_t address,
mach_vm_size_t size,
vm_inherit_t new_inheritance);





8.6.7. mach_vm_read()


mach_vm_read() TRansfers data from the given memory range in the given address space to dynamically allocated memory in the calling task. In other words, unlike most Mach VM API routines, mach_vm_read() implicitly uses the current address space as its destination. The source memory region must be mapped in the source address space. As with memory allocated dynamically in other contexts, it is the caller's responsibility to invalidate it when appropriate.


kern_return_t
mach_vm_read(vm_map_t target_task,
mach_vm_address_t address,
mach_vm_size_t size,
vm_offset_t *data,
mach_msg_type_number_t *data_count);


The mach_vm_read_overwrite() variant reads into a caller-specified buffer. Yet another variantmach_vm_read_list()reads a list of memory ranges from the given map. The list of ranges is an array of mach_vm_read_entry structures [<mach/vm_region.h>]. The maximum size of this array is VM_MAP_ENTRY_MAX (256). Note that for each source address, memory is copied to the same address in the calling task.


kern_return_t
mach_vm_read_overwrite(vm_map_t target_task,
mach_vm_address_t address,
mach_vm_size_t size,
mach_vm_address_t data,
mach_vm_size_t *out_size);

kern_return_t
mach_vm_read_list(vm_map_t target_task,
mach_vm_read_entry_t data_list,
natural_t data_count);

struct mach_vm_read_entry {
mach_vm_address_t address;
mach_vm_size_t size;
};

typedef struct mach_vm_read_entry mach_vm_read_entry_t[VM_MAP_ENTRY_MAX];





8.6.8. mach_vm_write()


mach_vm_write() copies data from a caller-specified buffer to the given memory region in the target address space. The destination memory range must already be allocated and writable from the caller's standpointin that sense, this is more precisely an overwrite call.


kern_return_t
mach_vm_write(vm_map_t target_task,
mach_vm_address_t address,
vm_offset_t data,
mach_msg_type_number_t data_count);




8.6.9. mach_vm_copy()


mach_vm_copy() copies one memory region to another within the same task. The source and destination regions must both already be allocated. Their protection attributes must permit reading and writing, respectively. Moreover, the two regions can overlap. mach_vm_copy() has the same effect as a mach_vm_read() followed by a mach_vm_write().


kern_return_t
mach_vm_copy(vm_map_t target_task,
mach_vm_address_t source_address,
mach_vm_size_t count,
mach_vm_address_t dest_address);




Comparing Mach VM Routines with Mach IPC Routines for Memory Transfer


Since large amounts of datatheoretically, entire address spacescan be transferred through Mach IPC, it is interesting to note the difference between Mach VM routines and Mach IPC messaging when sending data from one task to another. In the case of a Mach VM routine such as mach_vm_copy() or mach_vm_write(), the calling task must have send rights to the control port of the target task. However, the target task does not have to participate in the transferit can be passive. In fact, it could even be suspended. In the case of Mach IPC, the sender must have send rights to a port that the receiving task has receive rights to. Additionally, the receiving task must actively receive the message. Moreover, Mach VM routines allow memory to be copied at a specific destination address in the target address space.







8.6.10. mach_vm_wire()


mach_vm_wire() alters the given memory region's pageability: If the wired_access argument is one of VM_PROT_READ, VM_PROT_WRITE, VM_PROT_EXECUTE, or a combination thereof, the region's pages are protected accordingly and wired in physical memory. If wired_access is VM_PROT_NONE, the pages are unwired. Since wiring pages is a privileged operation, vm_wire() requires send rights to the host's control port. The host_get_host_priv_port() routine, which itself requires superuser privileges, can be used to acquire these rights.


kern_return_t
mach_vm_wire(host_priv_t host,
vm_map_t target_task,
mach_vm_address_t address,
mach_vm_size_t size,
vm_prot_t wired_access);



Unlike other Mach VM routines discussed so far, mach_vm_wire() is exported by the host_priv MIG subsystem.






8.6.11. mach_vm_behavior_set()


mach_vm_behavior_set() specifies the expected page reference behaviorthe access patternfor the given memory region. This information is used during page-fault handling to determine which pages, if any, to deactivate based on the memory access pattern.


kern_return_t
mach_vm_behavior_set(vm_map_t target_task,
mach_vm_address_t address,
mach_vm_size_t size,
vm_behavior_t behavior);


The behavior argument can take the following values:


  • VM_BEHAVIOR_DEFAULT the default behavior for all nascent memory

  • VM_BEHAVIOR_RANDOM random access pattern

  • VM_BEHAVIOR_SEQUENTIAL sequential access (forward)

  • VM_BEHAVIOR_RSEQNTL sequential access (reverse)

  • VM_BEHAVIOR_WILLNEED will need these pages in the near future

  • VM_BEHAVIOR_DONTNEED will not need these pages in the near future


The kernel maps the VM_BEHAVIOR_WILLNEED and VM_BEHAVIOR_DONTNEED reference behavior specifications to the default behavior, which assumes a strong locality of reference.



mach_vm_behavior_set() is analogous to the madvise() system call. In fact, the Mac OS X madvise() implementation is a simple wrapper around the in-kernel equivalent of mach_vm_behavior_set().




Since the expected reference behavior is applied to a memory range, the behavior setting is recorded as part of the VM map entry structure (struct vm_map_entry [osfmk/vm/vm_map.h]). Upon a page fault, the fault handler uses the behavior setting to determine which, if any, of the active pages are uninteresting enough to be deactivated. This mechanism also uses the sequential and last_alloc fields of the VM object structure (struct vm_object [osfmk/vm/vm_object.h]). The sequential field records the sequential access size, whereas last_alloc records the last allocation offset in that object.


If the reference behavior is VM_BEHAVIOR_RANDOM, the sequential access size is always kept as the page size, and no page is deactivated.


If the behavior is VM_BEHAVIOR_SEQUENTIAL, the page-fault handler examines the current and last allocation offsets to see if the access pattern is indeed sequential. If so, the sequential field is incremented by a page size, and the immediate last page is deactivated. If, however, the access is not sequential, the fault handler resets its recording by setting the sequential field to the page size. No page is deactivated in this case. The handling of VM_BEHAVIOR_RSEQNTL is similar, except the notion of sequential is reversed.


In the case of VM_BEHAVIOR_DEFAULT, the handler attempts to establish an access pattern based on the current and last offsets. If they are not consecutive (in units of a page), the access is deemed random, and no page is deactivated. If they are consecutive, whether increasing or decreasing, the handler increments the sequential field by a page size. If the pattern continues and the recorded sequential access size exceeds MAX_UPL_TRANSFER (256) pages, the page that is MAX_UPL_TRANSFER pages away (behind or forward, depending on the direction) is deactivated. While the recorded sequential access size remains less than MAX_UPL_TRANSFER, no page is deactivated. If, however, the pattern is broken, the sequential access size is reset to the page size.



Page deactivation involves calling vm_page_deactivate() [osfmk/vm/vm_resident.c], which returns the page to the inactive queue.







8.6.12. mach_vm_msync()


mach_vm_msync() synchronizes the given memory range with its pager.


kern_return_t
mach_vm_msync(vm_map_t target_task,
mach_vm_address_t address,
mach_vm_size_t size,
vm_sync_t sync_flags);


The sync_flags argument is the bitwise OR of synchronization bits defined in <mach/vm_sync.h>. The following are examples of valid combinations.


  • VM_SYNC_INVALIDATE flushes pages in the given memory range, returning only precious pages to the pager and discarding dirty pages.

  • If VM_SYNC_ASYNCHRONOUS is specified along with VM_SYNC_INVALIDATE, both dirty and precious pages are returned to the pager, but the call returns without waiting for the pages to reach the backing store.

  • VM_SYNC_SYNCHRONOUS is similar to VM_SYNC_ASYNCHRONOUS, but the call does not return until the pages reach the backing store.

  • When either VM_SYNC_ASYNCHRONOUS or VM_SYNC_SYNCHRONOUS is specified by itself, both dirty and precious pages are returned to the pager without flushing any pages.

  • If VM_SYNC_CONTIGUOUS is specified, the call returns KERN_INVALID_ADDRESS if the specified memory range is not mapped in its entiretythat is, if the range has a hole in it. Nevertheless, the call still completes its work as it would have if VM_SYNC_CONTIGUOUS were not specified.




Precious Pages


A precious page is used when only one copy of its data is desired. There may not be a copy of a precious page's data both in the backing store and in memory. When a pager provides a precious page to the kernel, it means that the pager has not necessarily retained its own copy. When the kernel must evict such pages, they must be returned to the pager, even if they had not been modified while resident.




mach_vm_msync() is analogous to the msync() system call. In fact, the msync() implementation uses the in-kernel equivalent of mach_vm_sync(). POSIX.1 requires msync() to return an ENOMEM error if there are holes in the region being synchronized. Therefore, msync() always sets the VM_SYNC_CONTIGUOUS bit before calling the in-kernel version of mach_vm_msync(). If the latter returns KERN_INVALID_ADDRESS, msync() TRanslates the error to ENOMEM.




8.6.13. Statistics


System-wide VM statistics can be retrieved using the HOST_VM_INFO flavor of the host_statistics() Mach routine. The vm_stat command-line program also displays these statistics.


$ vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free: 144269.
Pages active: 189526.
Pages inactive: 392812.
Pages wired down: 59825.
"Translation faults": 54697978.
Pages copy-on-write: 800440.
Pages zero filled: 38386710.
Pages reactivated: 160297.
Pageins: 91327.
Pageouts: 4335.
Object cache: 205675 hits of 378912 lookups (54% hit rate)


mach_vm_region() returns information about a memory region in the given address space. The address argument specifies the location at which mach_vm_region() starts to look for a valid region. The outbound values of address and size specify the range of the region actually found. The flavor argument specifies the type of information to retrieve, with info pointing to a structure appropriate for the flavor being requested. For example, the VM_REGION_BASIC_INFO flavor is used with a vm_region_basic_info structure. The count argument specifies the size of the input buffer in units of natural_t. For example, to retrieve information for the VM_REGION_BASIC_INFO flavor, the size of the input buffer must be at least VM_REGION_BASIC_INFO_COUNT. The outbound value of count specifies the size of the data filled in by the call.


kern_return_t
mach_vm_region(vm_map_t target_task,
mach_vm_address_t *address,
mach_vm_size_t *size,
vm_region_flavor_t flavor,
vm_region_info_t info,
mach_msg_type_number_t *info_count,
mach_port_t *object_name);


Note that a task should be suspended before mach_vm_region() is called on it, otherwise the results obtained may not provide a true picture of the task's VM situation.


The mach_vm_region_recurse() variant recurses into submap chains in the given task's address map. The vmmap command-line program uses both variants to retrieve information about the virtual memory regions allocated in the given process.













No comments:

Post a Comment