11.3. Aggregating by Range (Bands)
Some people have trouble writing SQL queries that return aggregates for bands. Such queries are actually quite easy to write using the case construct. By way of example, look at the problem of reporting on the distribution of tables by their total row counts. For instance, how many tables contain fewer than 100 rows, how many contain 100 to 10,000 rows, how many 10,000 to 1,000,000 rows, and how many tables store more than 1,000,000 rows?
Information about tables is usually accessible through data dictionary views: for instance, INFORMATION_SCHEMA.TABLES, pg_statistic, and pg_tables, dba_tables, syscat.tables, sysobjects and systabstats, and so on. In my explanation here, I'll assume the general case of a view named table_info, containing, among other things, the columns table_name and row_count. Using this table, a simple use of case and the suitable group by can give us the distribution by row_count that we are after:
select case
when row_count < 100
then 'Under 100 rows'
when row_count >= 100 and row_count < 10000
then '100 to 10000'
when row_count >= 10000 and row_count < 1000000
then '10000 to 1000000'
else
'Over 1000000 rows'
end as range,
count(*) as table_count
from table_info
where row_count is not null
group by case
when row_count < 100
then 'Under 100 rows'
when row_count >= 100 and row_count < 10000
then '100 to 10000'
when row_count >= 10000 and row_count < 1000000
then '10000 to 1000000'
else
'Over 1000000 rows'
end
There is only one snag here: group by performs a sort before aggregating data. Since we are associating a label with each of our aggregates, the result is, by default, alphabetically sorted on that label:
RANGE TABLE_COUNT
----------------- ------------
100 to 10000 18
10000 to 1000000 15
Over 1000000 rows 6
Under 100 rows 24
The ordering that would be logical to a human eye in such a case is to see Under 100 rows appear first, and then each band by increasing number of rows, with Over 1,000,000 rows coming last. Rather than trying to be creative with labels, the stratagem to solve this problem consists of two steps:
Performing the group by on two, instead of one, columns, associating with each label a dummy column, the only purpose of which is to serve as a sort key Wrapping up the query as a query within the from clause, so as to mask the sort key thus created and ensure that only the data of interest is returned
Here is the query that results from applying the preceding two steps:
select row_range, table_count
from ( -- Build a sort key to have bands suitably ordered
-- and hide it inside a subquery
select case
when row_count < 100
then 1
when row_count >= 100 and row_count < 10000
then 2
when row_count >= 10000 and row_count < 1000000
then 3
else
4
end as sortkey,
case
when row_count < 100
then 'Under 100 rows'
when row_count >= 100 and row_count < 10000
then '100 to 10000'
when row_count >= 10000 and row_count < 1000000
then '10000 to 1000000'
else
'Over 1000000 rows'
end as row_range,
count(*) as table_count
from table_info
where row_count is not null
group by case
when row_count < 100
then 'Under 100 rows'
when row_count >= 100 and row_count < 10000
then '100 to 10000'
when row_count >= 10000 and row_count < 1000000
then '10000 to 1000000'
else
'Over 1000000 rows'
end,
case
when row_count < 100
then 1
when row_count >= 100 and row_count < 10000
then 2
when row_count >= 10000 and row_count < 1000000
then 3
else
4
end) dummy
order by sortkey;
And following are the results from executing that query:
ROW_RANGE TABLE_COUNT
----------------- -----------
Under 100 rows 24
100 to 10000 18
10000 to 1000000 15
Over 1000000 rows 6
Aggregating by range (bands) requires building an artificial sort key to display results in desired order.
 |