Saturday, October 24, 2009

Recipe 4.11. Getting the N Smallest Items of an Array










Recipe 4.11. Getting the N Smallest Items of an Array





Problem


You want to find the
smallest few items in an array, or the
largest, or the most extreme according to some other measure.




Solution


If you only need to find the single smallest item according to some measure, use
Enumerable#min
. By default, it uses the <=> method to see whether one item is "smaller" than another, but you can override this by passing in a code block.



[3, 5, 11, 16].min
# => 3
["three", "five", "eleven", "sixteen"].min
# => "eleven"
["three", "five", "eleven", "sixteen"].min { |x,y| x.size <=> y.size }
# => "five"



Similarly, if you need to find the single
largest item, use
Enumerable#max
.



[3, 5, 11, 16].max
# => 16
["three", "five", "eleven", "sixteen"].max
# => "three"
["three", "five", "eleven", "sixteen"].max { |x,y| x.size <=> y.size }
# => "sixteen"



By default, arrays are sorted by their natural order: numbers are sorted by value, strings by their position in the ASCII collating sequence (basically alphabetical order, but all lowercase characters precede all uppercase characters). Hence, in the previous examples, "three" is the largest string, and "eleven" the smallest.


It gets more complicated when you need to get a number of the smallest or largest elements according to some measurement: say, the top 5 or the bottom 10. The simplest solution is to sort the list and skim the items you want off of the top or bottom.



l = [1, 60, 21, 100, -5, 20, 60, 22, 85, 91, 4, 66]
sorted = l.sort

#The top 5
sorted[-5…sorted.size]
# => [60, 66, 85, 91, 100]

#The bottom 5
sorted[0…5]
# => [-5, 1, 4, 20, 21]



Despite the simplicity of this technique, it's inefficient to sort the entire list unless the number of items you want to extract approaches the size of the list.




Discussion


The min and max methods work by picking the first element of the array as a "champion," then iterating over the rest of the list trying to find an element that can beat the current champion on the appropriate metric. When it finds one, that element becomes the new champion. An element that can beat the old champion can also beat any of the other contenders seen up to that point, so one run through the list suffices to find the maximum or minimum.


The naive solution to finding more than one
smallest item is to repeat this process multiple times. Iterate over the Array once to find the
smallest item, then iterate over it again to find the next-smallest item, and so on. This is naive for the same reason a bubble sort is naive: you're repeating many of your comparisons more times than necessary. Indeed, if you run this algorithm once for every item in the array (trying to find the n smallest items in an array of n items), you get a bubble sort.


Sorting the list beforehand is better when you need to find more than a small fraction of the items in the list, but it's possible to do better. After all, you don't really want to sort the whole list: you just want to sort the bottom of the list to find the smallest items. You don't care if the other elements are unsorted because you're not interested in those elements anyway.


To sort only the smallest elements, you can keep a sorted "stable" of champions, and kick the
largest champion out of the stable whenever you find an element that's smaller. If you encounter a number that's too large to enter the stable, you can ignore it from that point on. This process rapidly cuts down on the number of elements you must consider, making this approach faster than doing a sort.


The SortedList class from Recipe 4.7 is useful for this task. The min_n method below creates a SortedList "stable" that keeps its elements sorted based on the same block being used to find the minimum. It keeps the stable at a certain size by kicking out the
largest item in the stable whenever a smaller item is found. The max_n method works similarly, but the comparisons are reversed, and the smallest element in the stable is kicked out when a larger element is found.



module Enumerable
def min_n(n, &block)
block = Proc.new { |x,y| x <=> y } if block == nil
stable = SortedArray.new(&block)
each do |x|
stable << x if stable.size < n or block.call(x, stable[-1]) == -1
stable.pop until stable.size <= n
end
return stable
end

def max_n(n, &block)
block = Proc.new { |x,y| x <=> y } if block == nil
stable = SortedArray.new(&block)
each do |x|
stable << x if stable.size < n or block.call(x, stable[0]) == 1
stable.shift until stable.size <= n
end
return stable
end
end

l = [1, 60, 21, 100, -5, 20, 60, 22, 85, 91, 4, 66]
l.max_n(5)
# => [60, 66, 85, 91, 100]
l.min_n(5)
# => [-5, 1, 4, 20, 21]

l.min_n(5) { |x,y| x.abs <=> y.abs }
# => [1, 4, -5, 20, 21]





See Also


  • Recipe 4.7, "Making Sure a Sorted Array Stays Sorted"













No comments:

Post a Comment