Wednesday, October 14, 2009

9.5. Four in Four: It Just Won't Go



9.5. Four in Four: It Just Won't Go


Many of
the problems that we have come across while developing and designing JPC have
been those associated with overhead. When trying to emulate a full computing
environment inside itself, the only things that prevent you from extracting 100%
of the native performance are the overheads of the emulation. Some of these
overheads are time related such as in the processor emulation, whereas others
are space related.


The most obvious place where
spatial overheads cause a problem is in
the address space: the 4 GB memory space (32-bit addresses) of a virtual
computer won't fit inside the 4 GB (or less) available in real (host) hardware.
Even with large amounts of host memory, we can't just declare byte[] memory
= new byte[4 * 1024 * 1024 * 1024];
. Somehow we
must shrink our emulated address space to fit inside a single process on the
host machine, and ideally with plenty of room to spare!


To save space, we first observe that
the 4 GB address space is invariably not full. The typical machine will not
exceed 2 GB of physical RAM, and we can get away with significantly less than
this in most circumstances. So we can crush our 4 GB down quite quickly by
observing that not all of it will be occupied by physical RAM.


The first step in designing our
emulated physical address space has its origin in a little peek at the future.
If we look up the road we will see that one of the features of the IA-32 memory
management unit will help guide our structure for the address space. In
protected mode, the memory management unit of the CPU carves the address space
into indivisible chunks that are 4 KB wide (known as pages). So the obvious
thing to do is to chunk our memory on the same scale.


Splitting our address space into 4 KB
chunks means our address space no longer stores the data directly. Instead, the
data is stored in atomic memory units, which are represented as various
subclasses of Memory. The address space then
holds references to these objects. The resultant structure and memory accesses
are shown in Figure 9-2.




Note: To optimize instanceof lookups, we
design the inheritance chain for Memory objects without using
interfaces.





Figure 9-2. Physical
address space block structure




This structure has a set of
220 blocks, and each block will require a
32-bit reference to hold it. If we hold these in an array (the most obvious
choice), we have a memory overhead of 4 MB, which is not significant for most
instances.






Tip #7: instanceof is faster
on classes


Performing
instanceof on a class is far quicker than
performing it on an interface. Java's single inheritance model means that on a
class, instanceof is simply one subtraction and
one array lookup; on an interface, it is an array
search.



Where this overhead is a problem, we can
make further optimizations. Observe that memory in the physical address space
falls into three distinct categories:





RAM



Physical RAM is mapped from the zero
address upward. It is frequently accessed and low latency.




ROM



ROM chips can exist at any address.
They are infrequently accessed and low latency.




I/O



Memory-mapped I/O can exist at any
address. It is fairly frequently accessed, but is generally higher latency than
RAM.


For addresses that fall within the
RAM of the real machine, we use a one-stage lookup. This ensures that accesses
to RAM are as low latency as possible. For accesses to other addresses, those
occupied by ROM chips and memory-mapped I/O, we use a two-stage lookup, as in Figure
9-3.



Figure 9-3. Physical
address space with a two-stage lookup




Now a memory "get" from RAM has three
stages:

return addressSpace.get(address);
return blocks[i >>> 12].get(address & 0xfff);
return memory[address];


And one from a higher address has
four:

return addressSpace.get(address);
return blocks[i >>> 22][(i >>> 12) & 0x3ff].get(address & 0xfff);
return memory[address];


This two-layer optimization has saved
us memory while avoiding the production of a bottleneck in every RAM memory
access. Each call and layer of indirection in a memory "get" performs a
function. This is indirection the way it should be used—not for the sake of
interfacing, but to achieve the finest balance of performance and footprint.


Lazy initialization is also used in JPC wherever there is chance of storage
never being used. Thus a new JPC instance has a physical address space with
mappings to Memory objects that occupy no space.
When a 4 KB section of RAM is read from or written to for the first time, the
object is fully initialized, as in Example 9-1.


Example 9-1. Lazy
initialization




public byte getByte(int offset)
{
try {
return buffer[offset];
} catch (NullPointerException e) {
buffer = new byte[size];
return buffer[offset];
}
}


 


No comments:

Post a Comment