Logarithmic Buckets

SQL Tips by Donald Burleson

A bird's-eye view onto big tables is essential for data analysis, with aggregation and grouping being the tools of the trade. Standard, equality-based grouping, however, fails to accommodate continuous distributions of data. Consider a TaxReturns table, for example. Would simple grouping by GrossIncome achieve anything? Undoubtedly it would show many individuals with identical incomes, but the big picture would escape us because of the sheer report size. Normally, we would like to know the distribution of incomes by ranges of values. Choosing the ranges carefully should produce a much more compact report.

There are two elementary methods for choosing ranges. With a linear scale the next range boundary point would be placed by a uniform offset. With a logarithmic scale the offset is increased exponentially. In the tax returns example a familiar linear scale of ranges is 0 - 10K, 10K - 20K, 20K - 30K, ?, while the logarithmic scale is ?, 1K ?10K, 10K - 100K, 100K - 1000K, -

Although a linear scale is conceptually simpler, a logarithmic scale is more natural. It appears that a linear scale has more precision where most tax returns are located, but this is solely by virtue of the magic number 10K that was chosen as an increment. Had a different figure been chosen, say 100K, then the precision would have been lost.

In a capitalist system income is not bounded. There are numerous individuals whose income goes far beyond the ranges where most people exist. As of 2005 there were 222 billionaires in the US. They all do not fit in the same tiny 10K bucket of incomes, of course. Most likely each would be positioned in its unique range, so 222 10K ranges is necessary in order to accommodate only them!

From logarithmic perspective, however, a $1B person is ?only? 5 ranges away from a $10K party. The logarithmic scale report is compact no matter how skewed the distribution is. Changing the factor of 10 into, say, 2 would increase the report size by a factor of log₂(10)≈3, which could still qualify as a compact report. In a sense, the choice of factor in a logarithmic scale is irrelevant.

Logarithmic Scale in Programming

I often hear a fellow programmer suggesting ?Let's allocate 2K of memory?. Why 2K? Well, this size sounds about right for this particular data structure. Fine, if this structure is static, but what about dynamic ones, like any collection type? Allocating 2K every time when we need to grow the structure may be fine by today's standard, but this magic number 2K would look ridiculous a decade later.

Logarithmic scale, however, provides an immortal solution. ?Last time I allocated X bytes of memory, let's allocate twice of that this time?. In short, there is no room for linear solutions in the world abiding by Moore's law.

Let's implement logarithmic scale grouping of incomes in SQL. All that is needed is power, log, and floor functions:

select power(10,floor(log(10,GrossIncome))), count(1)
from TaxReturns
group by floor(log(10,GrossIncome));

It is the floor(log(10,GrossIncome)) function that maps logarithmically spaced range boundaries into a distinct numbers. We group by this expression, although for the purpose of the readability of the result set, it is convenient to transform the column expression within the select list into conventional salary units by exponentiating it.

Skyline Query

Suppose you are shopping for a new car, and are specifically looking for a big car with decent gas mileage. Unfortunately, you are trying to satisfy the two conflicting goals. If querying the Cars relation in the database, then all models that are worse than others by both criteria can be ignored. The remaining set of cars is called the Skyline.

Figure 3.8 shows the Skyline of all cars in a sample set of three models. The vertical axis is seating capacity while the horizontal axis is the gas mileage. The vehicles are positioned in such a way that their locations match their respective profiles. The Hummer is taller than the Ferrari F1 racing car, which reflects the difference in their seating accommodations: 4 vs. 1. The Ferrari protrudes to the right of the Hummer, because it has superior gas mileage.

The Skyline is visualized as a contour of solid lines. The Ferrari outline is lost beneath the Roadster.

More formally the Skyline is defined as those points which are not dominated by any other point. A point dominates the other point if it is as good or better in all the dimensions. In the above example, the Roadster with mileage=20 and seating=2 dominates the Ferrari F1 with mileage=10 and seating=1. This condition can certainly be expressed in SQL. In this example, the Skyline query is:

select * from Cars c
where not exists (
   select * from Cars cc
   where cc.seats >= c.seats and cc.mileage > c.mileage
      or cc.seats > c.seats and cc.mileage >= c.mileage
);

Despite the apparent simplicity of this query, you would not be pleased with this query performance. Even if the seats and mileage columns were indexed, it would not help much as only half of the records on average meet each individual inequality condition. Certainly, there are not many records that satisfy the combined predicate, but leveraging an index to match it is not possible. Bitmapped indexes, which excel with Boolean expressions similar to is here, demand a constant on one side of the inequality predicate.

There is an efficient way to answer a Skyline query. Unfortunately, it is not aligned with the earlier Skyline SQL query attempt. Order all the data by either: seats, mileage, or mileage, seats. Here a composite index might be handy. Compare each record with its predecessor, and discard the one that is dominated by the other.

Let's assume that this example has been extended to 4 records:

Iterating record by record down, the Ferrari would be excluded first, because it is dominated by the neighbor BMW. Then, the BMW itself would be excluded, because it is dominated by the Roadster.

This is an excerpt from the new book SQL Design Patterns: The Expert Guide to SQL Programming by Vadim Tropashko

You can buy it direct from the publisher for 30%-off.

��