Wednesday, October 28, 2009

12.14 Finding Cumulative Sums and Running Averages




I l@ve RuBoard










12.14 Finding Cumulative Sums and Running Averages




12.14.1 Problem



You have a set of observations measured over time and want to compute
the cumulative sum of the observations at each measurement point. Or
you want to compute a running average at each point.





12.14.2 Solution



Use a self-join to produce the sets of successive observations at
each measurement point, then apply aggregate functions to each set of
values to compute its sum or average.





12.14.3 Discussion



Recipe 12.13 illustrates how a self-join can produce
relative values from absolute values. A self-join can do the opposite as
well, producing cumulative values at each successive stage of a set
of observations. The following table shows a set of rainfall
measurements taken over a series of days. The values in each row show
the observation date and the amount of precipitation in inches:



mysql> SELECT date, precip FROM rainfall ORDER BY date;
+------------+--------+
| date | precip |
+------------+--------+
| 2002-06-01 | 1.50 |
| 2002-06-02 | 0.00 |
| 2002-06-03 | 0.50 |
| 2002-06-04 | 0.00 |
| 2002-06-05 | 1.00 |
+------------+--------+


To calculate cumulative rainfall for a given day, sum that
day's precipitation value with the values for all
the previous days. For example, the cumulative rainfall as of
2002-06-03 is determined like this:



mysql> SELECT SUM(precip) FROM rainfall WHERE date <= '2002-06-03';
+-------------+
| SUM(precip) |
+-------------+
| 2.00 |
+-------------+


If you want the cumulative figures for all days that are represented
in the table, it would be tedious to compute the value for each of
them separately. A self-join can do this for all days with a single
query. Use one instance of the rainfall table as a
reference, and determine for the date in each row the sum of the
precip values in all rows occurring up through
that date in another instance of the table. The following query shows
the daily and cumulative precipitation for each day:



mysql> SELECT t1.date, t1.precip AS 'daily precip',
-> SUM(t2.precip) AS 'cum. precip'
-> FROM rainfall AS t1, rainfall AS t2
-> WHERE t1.date >= t2.date
-> GROUP BY t1.date;
+------------+--------------+-------------+
| date | daily precip | cum. precip |
+------------+--------------+-------------+
| 2002-06-01 | 1.50 | 1.50 |
| 2002-06-02 | 0.00 | 1.50 |
| 2002-06-03 | 0.50 | 2.00 |
| 2002-06-04 | 0.00 | 2.00 |
| 2002-06-05 | 1.00 | 3.00 |
+------------+--------------+-------------+


The self-join can be extended to display the number
of days elapsed at each date, as well as the running averages for
amount of precipitation each day:



mysql> SELECT t1.date, t1.precip AS 'daily precip',
-> SUM(t2.precip) AS 'cum. precip',
-> COUNT(t2.precip) AS days,
-> AVG(t2.precip) AS 'avg. precip'
-> FROM rainfall AS t1, rainfall AS t2
-> WHERE t1.date >= t2.date
-> GROUP BY t1.date;
+------------+--------------+-------------+------+-------------+
| date | daily precip | cum. precip | days | avg. precip |
+------------+--------------+-------------+------+-------------+
| 2002-06-01 | 1.50 | 1.50 | 1 | 1.500000 |
| 2002-06-02 | 0.00 | 1.50 | 2 | 0.750000 |
| 2002-06-03 | 0.50 | 2.00 | 3 | 0.666667 |
| 2002-06-04 | 0.00 | 2.00 | 4 | 0.500000 |
| 2002-06-05 | 1.00 | 3.00 | 5 | 0.600000 |
+------------+--------------+-------------+------+-------------+


In the preceding query, the number of days elapsed and the
precipitation running averages can be computed easily using
COUNT( ) and AVG( ) because
there are no missing days in the table. If missing days are allowed,
the calculation becomes more complicated, because the number of days
elapsed for each calculation no longer will be the same as the number
of records. You can see this by deleting the records for the days
that had no precipitation to produce a couple of
"holes" in the table:



mysql> DELETE FROM rainfall WHERE precip = 0;
mysql> SELECT date, precip FROM rainfall ORDER BY date;
+------------+--------+
| date | precip |
+------------+--------+
| 2002-06-01 | 1.50 |
| 2002-06-03 | 0.50 |
| 2002-06-05 | 1.00 |
+------------+--------+


Deleting those records doesn't change the cumulative
sum or running average for the dates that remain, but does change how
they must be calculated. If you try the self-join again, it yields
incorrect results for the days-elapsed and average precipitation
columns:



mysql> SELECT t1.date, t1.precip AS 'daily precip',
-> SUM(t2.precip) AS 'cum. precip',
-> COUNT(t2.precip) AS days,
-> AVG(t2.precip) AS 'avg. precip'
-> FROM rainfall AS t1, rainfall AS t2
-> WHERE t1.date >= t2.date
-> GROUP BY t1.date;
+------------+--------------+-------------+------+-------------+
| date | daily precip | cum. precip | days | avg. precip |
+------------+--------------+-------------+------+-------------+
| 2002-06-01 | 1.50 | 1.50 | 1 | 1.500000 |
| 2002-06-03 | 0.50 | 2.00 | 2 | 1.000000 |
| 2002-06-05 | 1.00 | 3.00 | 3 | 1.000000 |
+------------+--------------+-------------+------+-------------+


To fix the problem, it's necessary to determine the
number of days elapsed a different way. Take the minimum and maximum
date involved in each sum and calculate a days-elapsed value from
them using the following expression:



TO_DAYS(MAX(t2.date)) - TO_DAYS(MIN(t2.date)) + 1


That value must be used for the days-elapsed column and for computing
the running averages. The resulting query is as follows:



mysql> SELECT t1.date, t1.precip AS 'daily precip',
-> SUM(t2.precip) AS 'cum. precip',
-> TO_DAYS(MAX(t2.date)) - TO_DAYS(MIN(t2.date)) + 1 AS days,
-> SUM(t2.precip) / (TO_DAYS(MAX(t2.date)) - TO_DAYS(MIN(t2.date)) + 1)
-> AS 'avg. precip'
-> FROM rainfall AS t1, rainfall AS t2
-> WHERE t1.date >= t2.date
-> GROUP BY t1.date;
+------------+--------------+-------------+------+-------------+
| date | daily precip | cum. precip | days | avg. precip |
+------------+--------------+-------------+------+-------------+
| 2002-06-01 | 1.50 | 1.50 | 1 | 1.5000 |
| 2002-06-03 | 0.50 | 2.00 | 3 | 0.6667 |
| 2002-06-05 | 1.00 | 3.00 | 5 | 0.6000 |
+------------+--------------+-------------+------+-------------+


As this example illustrates, calculation of cumulative values from
relative values requires only a column that allows rows to be placed
into the proper order. (For the rainfall table,
that's the date column.) Values
in the column need not be sequential, or even numeric. This differs
from calculations that produce difference values from cumulative
values (Recipe 12.13), which require that a table
have a column that contains an unbroken sequence.



The running averages in the rainfall examples are based on dividing
cumulative precipitation sums by number of days elapsed as of each
day. When the table has no gaps, the number of days is the same as
the number of values summed, making it easy to find successive
averages. When records are missing, the calculations become more
complex. What this demonstrates is that it's
necessary to consider the nature of your data and calculate averages
appropriately. The next example is conceptually similar to the
previous ones in that it calculates cumulative sums and running
averages, but it performs the computations yet another way.



The following table shows a marathon runner's
performance at each stage of a 26-kilometer run. The values in each
row show the length of each stage in kilometers and how long the
runner took to complete the stage. In other words, the values pertain
to intervals within the marathon and thus are relative to the whole:



mysql> SELECT stage, km, t FROM marathon ORDER BY stage;
+-------+----+----------+
| stage | km | t |
+-------+----+----------+
| 1 | 5 | 00:15:00 |
| 2 | 7 | 00:19:30 |
| 3 | 9 | 00:29:20 |
| 4 | 5 | 00:17:50 |
+-------+----+----------+


To calculate cumulative distance in kilometers at each stage, use a
self-join that looks like this:



mysql> SELECT t1.stage, t1.km, SUM(t2.km) AS 'cum. km'
-> FROM marathon AS t1, marathon AS t2
-> WHERE t1.stage >= t2.stage
-> GROUP BY t1.stage;
+-------+----+---------+
| stage | km | cum. km |
+-------+----+---------+
| 1 | 5 | 5 |
| 2 | 7 | 12 |
| 3 | 9 | 21 |
| 4 | 5 | 26 |
+-------+----+---------+


Cumulative distances are easy to compute because they can be summed
directly. The calculation for accumulating time values is a little
more involved. It's necessary to convert times to
seconds, sum the resulting values, and convert the sum back to a time
value. To compute the runner's average speed at the
end of each stage, take the ratio of cumulative distance over
cumulative time. Putting all this together yields the following
query:



mysql> SELECT t1.stage, t1.km, t1.t,
-> SUM(t2.km) AS 'cum. km',
-> SEC_TO_TIME(SUM(TIME_TO_SEC(t2.t))) AS 'cum. t',
-> SUM(t2.km)/(SUM(TIME_TO_SEC(t2.t))/(60*60)) AS 'avg. km/hour'
-> FROM marathon AS t1, marathon AS t2
-> WHERE t1.stage >= t2.stage
-> GROUP BY t1.stage;
+-------+----+----------+---------+----------+--------------+
| stage | km | t | cum. km | cum. t | avg. km/hour |
+-------+----+----------+---------+----------+--------------+
| 1 | 5 | 00:15:00 | 5 | 00:15:00 | 20.0000 |
| 2 | 7 | 00:19:30 | 12 | 00:34:30 | 20.8696 |
| 3 | 9 | 00:29:20 | 21 | 01:03:50 | 19.7389 |
| 4 | 5 | 00:17:50 | 26 | 01:21:40 | 19.1020 |
+-------+----+----------+---------+----------+--------------+


We can see from this that the runner's average pace
increased a little during the second stage of the race, but then
(presumably as a result of fatigue) decreased
thereafter.










    I l@ve RuBoard



    No comments:

    Post a Comment