Grouping Data

So far, you've only applied your aggregate functions to either all the records in a table or a filtered set. This means that if you want to get, say, summed values for subsets of data in a database, it could mean several separate queries. For example:

SELECT SUM(UnitCost) FROM Products WHERE SupplierID = 1;
SELECT SUM(UnitCost) FROM Products WHERE SupplierID = 2;
SELECT SUM(UnitCost) FROM Products WHERE SupplierID = 3;

And so on. Obviously, this isn't ideal. As an alternative, SQL provides a way to divide tables into groups of rows and apply aggregate functions to groups rather than all the records returned by a query. This is when using aggregate functions becomes most powerful.

Filtering Group Data

To group data in SQL, you use the GROUP BY clause, with typical syntax as follows (typical because SELECT queries themselves are so variable):

SELECT ColumnA, AggFunc(AggFuncSpec) FROM Table
WHERE WhereSpec
GROUP BY ColumnA;

Here the values in ColumnA are enumerated, and groups are created for rows with matching values in ColumnA. For example, if there are 40 records with 15 different possible values in ColumnA, then 15 groups will be created. This means that 15 rows are returned, each consisting of an entry for the ColumnA values they share and the result of the aggregate function applied to the rows in a group.

Note that there's actually no need to return ColumnA in the query, but it tends to make sense so that you can identify the group. As with other queries involving aggregate functions, you can't include other columns in the data returned because this wouldn't make logical sense. For example:

SELECT ColumnA, ColumnB, AggFunc(AggFuncSpec) FROM Table WHERE WhereSpec
GROUP BY ColumnA;

Each group might have several rows with ColumnB values, and because the returned rows can only show one, this query is simply too vague to be interpreted, so an error will be raised.

The best way to look at this grouping behavior is to see an example.

GROUPING DATA

Enter and execute the following query:

SELECT StudentID, COUNT(*) AS HighPasses
FROM StudentExam
WHERE Mark > 70
GROUP BY StudentID;

The way in which a database will generally execute this query is to group together rows with the same value in the StudentID column and then execute the COUNT function to find out the number of exams that each of the students in your database has passed with a mark of greater than 70. The result looks as follows (only the first four rows are shown in the output):

StudentID   HighPasses

----------- -----------
1           1
3           2
4           1
6           1
...         ...

You'll notice that, in this example, if a student hasn't achieved any high passes, then he or she isn't listed at all.

Including all Rows in a Group (SQL Server)

In SQL Server, you can also add the ALL keyword to the GROUP BY clause:

SELECT ColumnA, AggFunc(AggFuncSpec) FROM Table
WHERE WhereSpec
GROUP BY ALL ColumnA;

If you do this, then groups will be selected from values in all the rows in the table, not just those filtered by the WHERE clause. However, the values calculated for these groups by aggregate functions do take the WHERE clause into account, meaning that some groups will be shown with misleading (or NULL) values calculated by these functions.

Compare the previous example with the following query:

SELECT StudentID, COUNT(*) AS HighPasses
FROM StudentExam
WHERE Mark > 70
GROUP BY ALL StudentID;

You now add the ALL keyword, meaning that all students will be considered, even if they have no exam marks this high. The result of this is that the COUNT function has no records to count for some students, hence the zero entries in the HighPasses column in the results and the increased number of results returned (again, only the first four rows are shown):

StudentID   HighPasses

----------- -----------
1           1
2           0
3           2
4           1
...         ...

Using the HAVING Clause

Suppose you want to filter your result set not by any actual value in the database but by the results of an aggregate function. For example, perhaps you want to find out which students did particularly well or badly in their exams in general. You don't want to filter based on any particular mark but on the average mark for all exams they took. You might think this would be as simple as adding the condition to the WHERE clause:

SELECT StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
WHERE AVG(Mark) < 50 OR AVG(Mark) > 70
GROUP BY StudentID;

Unfortunately, this won't work. The WHERE clause is used to filter data before the aggregate function is calculated, whereas you need to filter the data on the basis of the aggregate values. If you try to execute this query, you'll get an error.

The answer to the problem (as you've probably already guessed from the heading of this section) is the HAVING clause. This clause is placed after the GROUP BY clause and takes the following form:

HAVING FilterCondition

So, to find out which students scored on average more than 70 percent or less than 50 percent in their exams, you'd use this query:

SELECT StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
GROUP BY StudentID
HAVING AVG(Mark) < 50 OR AVG(Mark) > 70;

This returns the results:

StudentID    AverageMark

------------ ------------
2            48
3            80
7            71
10           73

You can also use HAVING in conjunction with WHERE if you want to filter the data both before and after the aggregate column is calculated. For example, let's suppose you want to restrict yourself to the results of the mathematics exams. There are three math exams, with ExamID values of 5, 8, and 11. So your new query will look like this:

SELECT StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
WHERE ExamID IN ( 5, 8, 11 )
GROUP BY StudentID
HAVING AVG(Mark) < 50 OR AVG(Mark) > 70;

The WHERE filter is applied before the aggregate values are calculated, so any rows in the StudentExam table where the ExamID field doesn't contain one of the values 5, 8, or 11 will be ignored completely. Once the aggregate values have been calculated, you use the HAVING clause to restrict the rows returned to those students whose average over these three exams is either particularly good or particularly bad. The result of this query is as follows:

StudentID    AverageMark

------------ ------------
2            39
10           71

Using Top-N Queries

Sometimes, for whatever reason, you won't want to return all the rows in a database that match the filter criteria. If you only want, say, the top 10 rows, then it makes little sense to get hundreds of rows from a database and only work with 10 of them because this means that more data is being exchanged between servers, which impairs performance. All of the database systems covered in this book support ways of doing this. Unfortunately, these are all different, so you'll look at each system in turn.

SQL Server and Access

SQL as implemented in Microsoft SQL Server and Access allows you to choose how many rows to return using the TOP keyword:

SELECT TOP RowCount ColumnA, ColumnB FROM Table;

The TOP RowCount section shown here can be added to any SELECT query. RowCount can be an absolute number, which will be how many rows to return, or it can be a number followed by the PERCENT keyword, meaning that a percentage of the total rows selected will be returned.

In its most basic usage, the TOP keyword simply returns the first matches found in a table:

SELECT TOP 5 CustomerID, CustomerName FROM Customers;

This query would return data from the first five rows in the Customers table.

However, one important point to note here is that the top rows are snipped off after any sorting is performed. This makes it possible to select, say, the rows with the highest or lowest values in a certain column:

SELECT TOP 10 PERCENT StudentID, Mark
FROM StudentExam
ORDER BY Mark;

The previous query returns the bottom 10 percent of marks in the StudentExam table—it's the bottom because ORDER BY Mark specifies an ascending order (the row with the lowest Mark value is the first one in the result set generated by the query).

RETRIEVING THE TOP FIVE STUDENTS (SQL SERVER/ACCESS)

Execute this query against the InstantUniversity database:

SELECT TOP 5 StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
GROUP BY StudentID
ORDER BY AVG(Mark) DESC;

Here you've retrieved the average mark scored by each student in his or her exams and ordered the student IDs according to this average, with the top average first in the result set. You use the TOP keyword to extract only the top five records, showing the five rows with the highest average mark:

StudentID    AverageMark

------------ ------------
3            80
10           73
7            71
6            68
9            68

Notice that you repeat AVG(Mark), rather than using the AverageMark alias, in the ORDER BY clause because Access doesn't allow computed aliases to be used in GROUP BY or ORDER BY clauses. This isn't a problem for SQL Server.

Also, notice that because SQL Server rounds the averages down to the nearest integer, the order isn't guaranteed to be accurate—in this case, the student with an ID of 9 (Andrew Forster, if you're asking) actually has a slightly higher average than Vic Andrews (with the ID of 6), but because they're both rounded down to 68, SQL Server treats them as equal.

This last point raises another issue—what if the cut-off point occurs in the middle of a sequence of several rows with the same value in the ORDER BY column? What if you asked for the top four values instead of the top five? SQL Server simply returns whatever rows it places at the top of the result set, so the record for Vic Andrews will be returned but not that for Andrew Foster.

To ensure that all rows with matching values are returned, you need to add WITH TIES to your query:

SELECT TOP 4 WITH TIES StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
GROUP BY StudentID
ORDER BY AverageMark DESC;

This returns the rows for both Vic Andrews and Andrew Foster, so five rows rather than four are returned. This is the default behavior for Access, so the WITH TIES keywords aren't supported.

Note

WITH TIES can only be used if the query has an ORDER BY clause.

MySQL

In MySQL, the LIMIT keyword is used in a similar way (although in a different place) as the TOP keyword:

SELECT ColumnA, ColumnB
FROM Table
LIMIT StartRecord, TotalRecords;

Here TotalRecords is the total number of rows to return, and StartRecord shows how many rows to omit. This is slightly more flexible than TOP (although no PERCENT equivalent exists) because you can get chunks of data at a time.

For example, you could use the following SQL query in MySQL:

SELECT CustomerID, CustomerName FROM Customers LIMIT 0, 5;

This would get data from the first five records in the Customers table. If only one number is supplied, this is taken as the number of rows to return, and a default of zero is assumed for the starting row. So, you could have written the previous query as this:

SELECT CustomerID, CustomerName FROM Customers LIMIT 5;

As with TOP, combining this keyword with ORDER BY can be a useful technique.

RETRIEVING THE TOP FIVE STUDENTS (MYSQL)

Execute this query against the InstantUniversity database:

SELECT StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
GROUP BY StudentID
ORDER BY AverageMark DESC
LIMIT 0, 5;

Again, you retrieve the average mark scored by each student in his or her exams and order the records accordingly. You add the LIMIT 0, 5 clause to specify that you want to retrieve only five records, starting with the first record (record number zero):

StudentID    AverageMark

------------ ------------
3            80
10           73.5
7            71.3333
9            68.6667
6            68.3333

DB2

DB2 uses a FETCH FIRST clause to limit the number of rows returned from a query. This has two basic forms; to retrieve just the top row, you can write the following:

SELECT CustomerID, CustomerName FROM Customers
FETCH FIRST ROW ONLY;

If you don't specify the number of rows to retrieve, only one row will be returned. To fetch more than one row, you must specify a positive integer value in the FETCH FIRST clause. For example, to fetch five rows, you can write this:

SELECT CustomerID, CustomerName FROM Customers
FETCH FIRST 5 ROWS ONLY;

Again, you can use this clause in conjunction with an ORDER BY clause to retrieve the highest or lowest values in the result set. The FETCH FIRST clause is placed at the end of the query, after the ORDER BY clause.

RETRIEVING THE TOP FIVE STUDENTS (DB2)

Enter and execute this query:

SELECT StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
GROUP BY StudentID
ORDER BY AverageMark DESC
FETCH FIRST 5 ROWS ONLY;

Here again you retrieve the five students with the highest average marks from the StudentExam table by averaging the Mark value for each separate StudentID using the AVG function. You limit the result set to five rows by appending the clause FETCH FIRST 5 ROWS ONLY to the query:

StudentID    AverageMark

------------ ------------
3            80
10           73
7            71
6            68
9            68

Note that, as with SQL Server, the order isn't guaranteed to be 100-percent accurate because the averages are rounded down to fit into the INT data type.

As with SQL Server, if the cut-off point comes in the middle of a group of records with the same value in the ORDER BY column, the values that are (arbitrarily) placed later won't be returned from the query.

To get around this, you need to use the RANK function, which assigns a rank to every row returned by a query. RANK has the following syntax:

RANK() OVER (ORDER BY ColumnName ASC | DESC)

You use RANK much like any other SQL function: to add a new column to the result set returned from the query. This column indicates what rank each row has when the results are ordered by the column named in the RANK() OVER clause. For example, to add a column indicating the rank for each of the students based on the average mark they scored in the exams they took, you'd use the following query:

SELECT RANK() OVER (ORDER BY AVG(Mark) DESC) AS Ranking,
        StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
GROUP BY StudentID;

So, to retrieve the top four students, taking any ties into account, you just need this query with a WHERE Ranking <= 4 clause, right? Nearly. Unfortunately, though, you can't access the computed Ranking column in the same query. Instead, you need to define this query as a subquery and then query that subquery, adding the WHERE clause to the outer query. The syntax for creating a subquery in DB2 is as follows:

WITH SubQueryName (SubQueryColumnList)
AS (SubQueryStatement)
... Outer query ...

So, when you put all that together for your top students query, you end up with this monster of a SQL statement:

WITH RankedMarks (Ranking, StudentID, AverageMark)
AS (
   SELECT RANK() OVER (ORDER BY AVG(Mark) DESC) AS Ranking,
           StudentID, AVG(Mark) AS AverageMark
   FROM StudentExam
   GROUP BY StudentID)
SELECT Ranking, StudentID, AverageMark
FROM RankedMarks
WHERE Ranking <= 4;

Oracle

Finally, Oracle allows a similar query to be performed using the ROWNUM keyword, where every record in a table is assigned a row number according to its position, separate from the ID of the row or any other data. This keyword, however, must be specified in a WHERE clause with a comparison operator, and so it isn't as flexible (and won't work with ORDER BY because reordered rows maintain their original row number).

SELECT CustomerID, CustomerName FROM Customers
WHERE ROWNUM <= 5;

To get around this, you can embed your SELECT statement as a subquery into an outer query, querying that subquery for the top n rows. For example, to retrieve the first five rows sorted alphabetically by CustomerName, use the following:

SELECT CustomerID, CustomerName FROM (
   SELECT CustomerID, CustomerName
   FROM Customers
   ORDER BY CustomerName ASC)
WHERE ROWNUM <= 5;

RETRIEVING THE TOP FIVE STUDENTS (ORACLE)

Enter and execute this query:

SELECT StudentID, AverageMark FROM (
   SELECT StudentID, AVG(Mark) AS AverageMark
   FROM StudentExam
   GROUP BY StudentID
   ORDER BY AverageMark DESC
)
WHERE ROWNUM <= 5;

Here you define a subquery that retrieves the average mark for each student over all exams taken, sorted in descending order (so the student with the highest average comes first):

SELECT StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
GROUP BY StudentID
ORDER BY AverageMark DESC

To retrieve the top five students, you embed this subquery into a SELECT statement and use the ROWNUM keyword to select just the top five rows:

StudentID    AverageMark

------------ ------------
3            80
10           73.5
7            71.3333333
9            68.6666667
6            68.3333333

Although rounding down to integers doesn't occur in this example, you still have the problem that some rows may be arbitrarily discarded if the cut-off point occurs in the middle of a group of rows with the same average mark. The way around this is similar to the DB2 approach—you use the RANK() function. This has the same syntax and is used in the same way as in DB2:

RANK() OVER (ORDER BY ColumnName ASC | DESC)

You use this to add a new column to your subquery, containing a value for each row that indicates the rank of that row in the result set. The subquery that uses this function has the same form as the DB2 subquery:

SELECT RANK() OVER (ORDER BY AVG(Mark) DESC) AS Ranking,
        StudentID, AVG(Mark) AS AverageMark
FROM StudentExam
GROUP BY StudentID

To retrieve the top four placed students (taking any ties into account), the outer query just needs to retrieve all the rows from this subquery that are ranked four or less:

SELECT Ranking, StudentID, AverageMark FROM (
   SELECT RANK() OVER (ORDER BY AVG(Mark) DESC) AS Ranking,
           StudentID, AVG(Mark) AS AverageMark
   FROM StudentExam
   GROUP BY StudentID)
WHERE Ranking <= 4;