Chapter 4: Summarizing and Grouping Data

The previous two chapters introduced the basics of SQL usage, and in fact this is all you'll need for some applications. However, there's a lot more that becomes possible when you delve into the wealth of additional facilities contained in SQL. In this chapter, you'll look at ways of summarizing and grouping data. This enables you to perform many versatile data manipulation techniques. For example, you can find out the maximum value or the average value in a column (say, the highest mark scored by a student in any exam). Or, perhaps more usefully, you can look at maximum or average values for columns based on groupings defined by other columns (so you could group the rows in a table according to the exam taken and then find out the minimum/maximum/average scores for each exam).

Of course, you could create much of this functionality using your own application code. After using basic SQL to extract data from a database, you can summarize and manipulate what you've extracted to your heart's content—but aggregate functions really are best done in SQL. After all, if you're going to examine hundreds of rows to calculate a single row summary answer, then it's much better to perform the aggregate calculation in the database and then send only the single row result over the network.

Summarizing Data

SQL includes several ways of looking at multiple rows simultaneously and extracting summary data. Most of this involves using aggregate functions. In the following sections, you'll look at how to use aggregate functions to achieve the following:

Count rows and columns
Obtain the sum of values from a single column
Calculate average values of columns
Get maximum and minimum values from columns
Limit the number of rows returned by a query and with which the previous calculations work

Performing Row and Column Counting

SQL includes a function called COUNT that you can use for counting the number of rows that meet a certain condition. You can use this function in several ways, but the basic syntax is the same in all cases. In general, you'll place this function inside a SELECT query as follows:

SELECT COUNT(CountSpecification) FROM Table;

As this is a SELECT statement, you can include a WHERE clause and anything else you might use in a SELECT statement. Without a filter, all rows will be processed by the COUNT function; otherwise, you'll only be applying it to the subset of the rows you've chosen.

CountSpecification can be one of the keywords described in Table 4-1.

Table 4-1: Count Specification Keywords
Count Specification	Meaning
*	Count all rows selected, including those with NULL values.
ALL Column	Count all rows with a non-NULL value for the specified column. This is the default operation if you simply specify Column without ALL or DISTINCT.
DISTINCT Column	Count all unique rows with a non-NULL value for the specified column.

Note

Note that the ALL and DISTINCT keywords can't be used with Microsoft Access databases in this context. Only * or simply Column will work.

One interesting feature is that you name the value returned in much the same way as you named calculated columns in Chapter 2, "Retrieving Data with SQL":

SELECT COUNT(CountSpecification) AS ColumnName FROM Table;

You can also use several COUNT functions at once, separated with commas just as if you were selecting several columns. However, you can't mix this function with column names as if it were a normal SELECT statement, at least not without considering groups, which you'll do later in this chapter. In other words, you can't do something like this:

SELECT COUNT(ReportsTo), FirstName, LastName
FROM Employees;

The previous variations are best learned with a few quick examples. The following is the simplest case:

SELECT COUNT(*) AS LegendCount FROM ArthurianLegends;

The previous expression returns the number of records in the table ArthurianLegends, returning it as a single entry in a column called LegendCount.

Here's another:

SELECT COUNT(ALL NameOfSwordInLegend) FROM ArthurianLegends;

This will return a single entry in an unnamed column, which indicates how many non-NULL values there are in the NameOfSwordInLegend column of the ArthurianLegends table.

Finally, the following:

SELECT COUNT(DISTINCT NameOfSwordInLegend)
FROM ArthurianLegends;

is practically the same as the previous, but this time you're only counting unique values. Should the text Excalibur appear several times in this column in different rows, it'll still only be counted once.

COUNTING ROWS

Execute the following query against the InstantUniversity database:

SELECT COUNT(*) AS NumberOfExams,
        COUNT(DISTINCT SustainedOn) AS UniqueDates,
        COUNT(Comments) AS ExamsWithComments
FROM Exam;

This query actually performs three separate counting calculations for all the records in the Exam table. Because you want to be able to tell the values apart, you name each calculation according to the calculation being performed.

The first calculation uses COUNT(*) to obtain the total number of rows in the table, storing it in a column called NumberOfExams.

The second calculation uses COUNT(DISTINCT SustainedOn) to determine the total number of unique entries in the SustainedOn column, storing the result in a column called UniqueDates.

The third calculation uses COUNT(Comments)—equivalent to COUNT(ALL Comments) because ALL is the default behavior—to count all the non-NULL entries in the Comments column, returning the result in a column called ExamsWithComments.

The output from this query is as follows:

NumberOfExams  UniqueDates  ExamsWithComments

-------------  -----------  -----------------
11             10           9

Even with this simple example, it should already be apparent that aggregate functions such as COUNT can be extremely handy.

Retrieving Column Totals

The next function you'll look at is SUM, used to calculate the total value of a column. Because this function performs a mathematical summing operation, it can only be used where the data type of the column you want to examine is appropriate. It works fine with numerical types, but it won't work at all with string values.

The syntax is much the same as with COUNT:

SELECT SUM(SumSpecification) FROM Table;

The only difference between the SUM function and the COUNT function is that you can't use a wildcard (*) in a SUM function. However, you specify ALL or DISTINCT in the same way (although not in Access) to select between all values or just unique ones. For example:

SELECT SUM(ALL Age) FROM Students;

This query will return a single entry in an unnamed row that's the sum of all entries in the Age column of a table called Students.

SUMMING COLUMNS

Enter and execute the following query:

SELECT SUM(Credits) AS TotalCredits FROM Course;

This simple example just adds up the entries in the Credits column of the Course table. It should give the following output:

TotalCredits

------------
55

Getting Column Averages

Column averages (arithmetic mean values) are easily calculated by dividing the sum of the column by the number of rows summed, but to make it even easier you can use the AVG function. This function works in the same way as those you've already examined:

SELECT AVG(AvgSpecification) FROM Table;

As with SUM, you can use ALL or DISTINCT in AvgSpecification (but, again, not in Microsoft Access). In both cases, NULL values are ignored.

Note

Note that only this function exists for average values—there are no equivalents for obtaining modal or median values. However, these are relatively easy to calculate with combinations of other techniques, so this isn't really a problem.

For example, the following query obtains the average age of the students in a Students database by getting the average value of data in the Age column:

SELECT AVG(ALL Age) FROM Students;

GETTING COLUMN AVERAGES

Execute this query against the InstantUniversity database:

SELECT AVG(Mark) AS AverageMark
FROM StudentExam
WHERE StudentID = 10;

This simple example uses the AVG function to calculate the average exam mark achieved by the student with an ID of 10, from data in the StudentExam table. Note that the answer you receive may depend on the RDBMS you're using. In some cases, the result will be converted into the same data type as the column. In SQL Server and DB2, for example, this column is of type INT, giving the answer shown shortly. In Oracle and Access, the column data type is NUMBER, which can hold floating-point values, so the answer will be 73.5. It's important to be aware of this to avoid getting false or inaccurate results.

Note
To get around this problem, you need to convert the column to a floating-point type before calculating the average. You'll look at data type conversion functions in Chapter 5, "Performing Calculations and Using Functions."

The results (with rounding) are as follows:

AverageMark

-----------
73

Retrieving Maximum and Minimum Values

The last two aggregate functions you'll look at here are MAX and MIN, which return maximum and minimum values of columns. These work in the same way as most of the other functions you've looked at in this chapter:

SELECT MAX(MaxSpecification) FROM Table;
SELECT MIN(MinSpecification) FROM Table;

As before, you can use ALL or DISTINCT in the specifications of these functions; although with these functions, this is for compatibility reasons only. Looking at either all values or only unique ones won't make a blind bit difference to the functionality here because how many times a value occurs is irrelevant when you're looking for extreme values.

MAXIMUM AND MINIMUM COLUMN VALUES

Enter and execute this query:

SELECT MAX(Mark) AS TopMark, MIN(Mark) AS BottomMark
FROM StudentExam
WHERE ExamID = 6;

In this, the last simple example of aggregate functions, you use the MAX and MIN functions to obtain the maximum and minimum values in the Mark column of the StudentExam table. This returns the top and bottom marks scored by any student in the electronics exam:

TopMark     BottomMark

----------- -----------
84          63