Previous Page Next Page

11.3. Aggregating by Range (Bands)

Some people have trouble writing SQL queries that return aggregates for bands. Such queries are actually quite easy to write using the case construct. By way of example, look at the problem of reporting on the distribution of tables by their total row counts. For instance, how many tables contain fewer than 100 rows, how many contain 100 to 10,000 rows, how many 10,000 to 1,000,000 rows, and how many tables store more than 1,000,000 rows?

Information about tables is usually accessible through data dictionary views: for instance, INFORMATION_SCHEMA.TABLES, pg_statistic, and pg_tables, dba_tables, syscat.tables, sysobjects and systabstats, and so on. In my explanation here, I'll assume the general case of a view named table_info, containing, among other things, the columns table_name and row_count. Using this table, a simple use of case and the suitable group by can give us the distribution by row_count that we are after:

select case
         when row_count < 100
              then 'Under 100 rows'
          when row_count >= 100 and row_count < 10000
               then '100 to 10000'
          when row_count >= 10000 and row_count < 1000000
               then '10000 to 1000000'
          else
               'Over 1000000 rows'
        end as range,
        count(*) as table_count
from table_info
where row_count is not null
group by case
           when row_count < 100
                then 'Under 100 rows'
           when row_count >= 100 and row_count < 10000
                then '100 to 10000'
           when row_count >= 10000 and row_count < 1000000
                then '10000 to 1000000'
           else
                'Over 1000000 rows'
         end

There is only one snag here: group by performs a sort before aggregating data. Since we are associating a label with each of our aggregates, the result is, by default, alphabetically sorted on that label:

RANGE             TABLE_COUNT
----------------- ------------
100 to 10000                18
10000 to 1000000            15
Over 1000000 rows            6
Under 100 rows              24

The ordering that would be logical to a human eye in such a case is to see Under 100 rows appear first, and then each band by increasing number of rows, with Over 1,000,000 rows coming last. Rather than trying to be creative with labels, the stratagem to solve this problem consists of two steps:

  1. Performing the group by on two, instead of one, columns, associating with each label a dummy column, the only purpose of which is to serve as a sort key

  2. Wrapping up the query as a query within the from clause, so as to mask the sort key thus created and ensure that only the data of interest is returned

Here is the query that results from applying the preceding two steps:

select row_range, table_count
from ( -- Build a sort key to have bands suitably ordered
       -- and hide it inside a subquery
       select case
               when row_count < 100
                 then 1
               when row_count >= 100 and row_count < 10000
                 then 2
               when row_count >= 10000 and row_count < 1000000
                 then 3
               else
                 4
             end as sortkey,
             case
               when row_count < 100
                 then 'Under 100 rows'
               when row_count >= 100 and row_count < 10000
                 then '100 to 10000'
               when row_count >= 10000 and row_count < 1000000
                 then '10000 to 1000000'
               else
                 'Over 1000000 rows'
             end  as row_range,
             count(*) as table_count
      from table_info
      where row_count is not null
      group by case
                 when row_count < 100
                   then 'Under 100 rows'
                 when row_count >= 100 and row_count < 10000
                   then '100 to 10000'
                 when row_count >= 10000 and row_count < 1000000
                   then '10000 to 1000000'
                 else
                   'Over 1000000 rows'
               end,
               case
                 when row_count < 100
                   then 1
                 when row_count >= 100 and row_count < 10000
                   then 2
                 when row_count >= 10000 and row_count < 1000000
                   then 3
                 else
                   4
               end) dummy
order by sortkey;

And following are the results from executing that query:

ROW_RANGE         TABLE_COUNT
----------------- -----------
Under 100 rows             24
100 to 10000               18
10000 to 1000000           15
Over 1000000 rows           6

Aggregating by range (bands) requires building an artificial sort key to display results in desired order.


Previous Page Next Page