Combining Data From Queries

You may have noticed that, although you used data from various different tables in your subqueries, you only actually returned either a calculated column or column data from a single table. Sometimes it's useful to be able to retrieve data from multiple tables and to return all of that data, or a subset of that data, as a single set of rows.

SQL provides you with several set operators that allow you to combine SQL queries for this purpose—namely, UNION, UNION ALL, INTERSECT, and DIFFERENCE.

ANSI SQL refers to the DIFFERENCE operator, but it's referred to by other terms, depending on the database. Oracle calls this operator MINUS, and DB2 calls it EXCEPT.

The basic syntax for queries using these operators is as follows:

SELECT ColumnA, ColumnB FROM TableA
<Operator>
SELECT ColumnC, ColumnD FROM TableB;

The result will be two columns of data (generally called ColumnA and ColumnB) that contain data from all four columns. The actual set of data returned when you combine queries in this manner will vary depending on which operator you use.

The general rule for using these operators is that the data you extract from one table must have the same number of columns, and those columns must have the same data types, as the data extracted from other tables (or, at least, must be converted into the correct data types). So, as you can see, these operators were designed for use when you want to combine the contents of tables that have similar structure but different data. This means that in general, except of course where fortune (or indeed design) dictates, you have several tables with the same column data types.

The general technique for combining data from dissimilar tables is to use table joins, which you'll learn about in the next chapter.

A classic example of where you can use the UNION clause to great effect is when comparing an archived version of a table with the current version. Say, for example, that you archived the data in your InstantUniversity database on a yearly basis; you could then extract data from a table in the archived Class of 2002 database and compare it with data from the equivalent table in the current Class of 2003 database.

In general, however, you'll only extract a subset of the columns in each table. For example, if you were assembling a comparison between your products and those sold by a competitor, you might find that you could extract and combine the "similar" columns (product name, cost, and so on) from the tables carrying your own and your competitor's product information.

Let's start by examining the UNION operator.

Using the UNION Operator

The general syntax required to use UNION is simple:

SELECT ColumnA, ColumnB FROM TableA
UNION
SELECT ColumnC, ColumnD FROM TableB;

In general, whenever you use the UNION keyword, the column names are taken from the first SELECT query, but DB2 will assign arbitrary names if the column names aren't identical.

The UNION operator is supported in MySQL only from version 4.0 onward.

Suppose you have one table called Products with ProductID, ProductName, and ProductCost columns and one table called CompetitorProducts with ID, Name, and Cost columns, where the data types match. You could combine data from both these tables as follows:

SELECT ProductID, ProductName, ProductCost FROM Products
UNION
SELECT ID, Name, Cost FROM CompetitorProducts;

This would result in three columns of data—ProductID, ProductName, and ProductCost—containing data from both tables.

When you use UNION, you'll often need to include calculated columns or data type conversion to get column data to match. If, in the previous example, ProductID were a string value and ID were numeric, you would need to convert ID into a string. A handy way to do this would be to use the string conversion functions you looked at in the previous chapter. For example, you could use this on SQL Server:

SELECT ProductID, ProductName, ProductCost FROM Products
UNION
SELECT STR(ID), Name, Cost FROM CompetitorProducts;

Let's look at a working example.

COMBINING DATA WITH UNION

Enter and execute the following query:

SELECT Name, 'Professor' As Role FROM Professor
   WHERE ProfessorID = (
      SELECT ProfessorID FROM Class WHERE ClassID = 1)
UNION
SELECT Name, 'Student' FROM Student
   WHERE StudentID IN (
      SELECT StudentID FROM Enrollment WHERE ClassID = 1);

Note that this code won't work with MySQL 4.0 because it contains subqueries. A version of this query using table joins instead, which will work with MySQL, is available in the code download.

The query in this example extracts names from the Professor and Student tables that are involved in a specific class (subqueries are used here to match IDs with values in the Class and Enrollment tables, respectively). As well as extracting data, you provide a fixed value for a new calculated column, Role, which shows which table the name has come from:

   Name          Role

   ------------- ---------
   Anna Wolff    Student
   John Jones    Student
   Julia Picard  Student
   Prof. Dawson  Professor
   ...           ...

It's well worth noting here that the resultant data is sorted together rather than having two sorted lists on top of one another. You can use ORDER BY and filter on the result set just as you do with other data returned by queries.

Note
It's normal for names to be sorted by last name rather than by first name. You could do this using the FormatName() function created in Chapter 5, "Performing Calculations and Using Functions."

The previous example is pretty straightforward, but it doesn't quite fully illustrate how the UNION operator works. In set theoretical terms, a union of two sets of data will contain every member of each data set, but with each member only being counted once. So, the following:

{1, 2, 3, 4} UNION {3, 4, 5, 6}

produces this:

   {1, 2, 3, 4, 5, 6}

In order to demonstrate this, let's use another query, which is designed to extract the underperforming students:

SELECT StudentID
   FROM StudentExam
   WHERE Mark < 40
UNION
SELECT StudentID
   FROM Enrollment
   WHERE GRADE < 40
ORDER BY StudentID;

This query returns the StudentID for every student who received one or more exam mark under 40, and every student who received one or more class grade under 40. The results are as follows:

   StudentID

   -----------
   2
   5
   8

If you run each of the two queries involved in this UNION operation separately, you'll see that there's one student (with an ID of 5) who has received an exam mark under 10 and a class grade under 40. However, this student is only listed once in the result set. What in effect you have in this case is this:

{2, 5, 8} UNION {5},

which produces this:

   {2, 5, 8}

Keeping Duplicate Rows

SQL allows you to override the usual rules of set theory and return all members of each set of data regardless of duplicates. You do this by simply using the UNION ALL operator:

SELECT ColumnA, ColumnB FROM TableA
UNION ALL
SELECT ColumnC, ColumnD FROM TableB;

By simply adding the ALL keyword to your UNION operator, your result set will include duplicate data.

This is in contrast to other SELECT queries, which as you've seen have required you to use the DISTINCT keyword to remove duplicate data.

This means that the following:

{1, 2, 3, 4} UNION ALL {3, 4, 5, 6}

produces this:

   {1, 2, 3, 3, 4, 4, 5, 6}

This can be useful if you want to know the number of occurrences of a particular value in more than one table. Let's return to the underperforming students and perform the same query but using UNION ALL:

SELECT StudentID
   FROM StudentExam
   WHERE Mark < 40
UNION ALL
SELECT StudentID
   FROM Enrollment
   WHERE GRADE < 40
ORDER BY StudentID;

The results are as follows:

   StudentID

   -----------
   2
   5
   5
   8

Now, student 5, who was returned by both queries, is counted twice. This is quite useful because now you know that, although students 2 and 8 have suffered the odd bad exam grade, only student 5 is consistently underperforming in class.

Understanding Intersections and Differences

The last two operators are supported only by DB2 and Oracle, so we'll just present them briefly. The INTERSECT operator allows you to retrieve only the rows that occur in both queries. The following:

{1, 2, 3, 4} INTERSECT {3, 4, 5, 6}

produces this:

   {3, 4}

The EXCEPT (called MINUS in Oracle) operator allows you to retrieve the rows that occur in the first but not the second query. For example, the following:

{1, 2, 3, 4} EXCEPT (or MINUS) {3, 4, 5, 6}

produces this:

   {1, 2, 5, 6}

Let's demonstrate these by shaming the underachievers even more. The following query retrieves the IDs of all those students who scored less than 40 in at least one of their exams and were given at least one overall grade of less than 40:

SELECT StudentID
   FROM StudentExam
   WHERE Mark < 40
INTERSECT
SELECT StudentID
   FROM Enrollment
   WHERE GRADE < 40
ORDER BY StudentID;

The only row returned is, as you know, that for student 5:

   StudentID

   -----------
   5

Let's alter the query a bit to find only those students who have done very badly in one or more exam but who haven't also received at least one very good grade. You can do this using the EXCEPT operator (DB2):

-- DB2 only
SELECT StudentID
   FROM StudentExam
   WHERE Mark < 40
EXCEPT
SELECT StudentID
   FROM Enrollment
   WHERE GRADE > 69
ORDER BY StudentID;

In Oracle, you use the MINUS operator in the same way:

-- Oracle only
SELECT StudentID
   FROM StudentExam
   WHERE Mark < 40
MINUS
SELECT StudentID
   FROM Enrollment
   WHERE GRADE > 69
ORDER BY StudentID;

This query returns the IDs of all the students who scored less than 40 in an exam, minus the set of students who achieved a grade of 70 or more. Two of the usual suspects show up:

   StudentID

   -----------
   2
   5