Filtering Data

So far, all the queries you've seen have returned data from all the rows in a table. However, there are times when this isn't desirable. Often you'll want to extract only a subset of the rows in a table, or even a single row, based on specific criteria. SQL includes a whole host of tools to enable this kind of query, and in the following sections you'll examine them.

First, you'll look at how you can filter out any duplicate rows in your result set. Next, you'll look at how to extract certain rows based on simple conditions and how to combine these conditions for a powerful search facility. You'll move on to see how more advanced conditions and comparisons can give you even greater control over the rows returned, even in situations where you need to be vague in the conditions you set (that is, finding records based on small hints). Finally, you'll examine a topic that will come up time and time again: dealing with NULL data (where no value has been entered for that field).

Eliminating Duplicate Data using DISTINCT

You'll often encounter situations where the same data value occurs multiple times in the same column, whether by design or accident. Sometimes, you may want to ensure that you retrieve only unique rows of data from a query prior to processing. For example, from a Customers table during a purchase, you wouldn't want to charge a customer twice for a single order!

To filter out duplicate rows of data, you use the DISTINCT keyword. Let's start with the simplest case, where you're returning rows from a single column. If you only want to see unique row values for this column, the query would look as follows:

SELECT DISTINCT ColumnA FROM Table;

This can often be very useful. For example, single columns in a database may have a restricted set of values allowed. To find out all the different values in a column, you don't really want to have to read through every row in the table just to find the values that exist. To demonstrate this clearly, say that you wanted to query the Exam table in the InstantUniversity database to get a list of all the dates on which exams have been taken. You might try the following:

SELECT SustainedOn from Exam;

The result would look as follows:

SustainedOn

---------
3/12/2003
3/13/2003
3/11/2003
3/18/2003
3/19/2003
3/25/2003
3/26/2003
3/10/2003
3/21/2003
3/26/2003
3/24/2003

This is actually quite a simple case because there are only 11 rows and only one duplicate value, 3/26/2003, but in any event it's much more convenient to issue the following query:

SELECT DISTINCT SustainedOn FROM Exam;

The results are as follows:

SustainedOn

------------
3/10/2003
3/11/2003
3/12/2003
3/13/2003
3/18/2003
3/19/2003
3/21/2003
3/24/2003
3/25/2003
3/26/2003

This time the duplicate value has been removed and you have a simple list of unique exam dates—and as an added bonus you'll see that the dates have been automatically sorted into ascending order.

Let's move on to the case where you want to retrieve multiple columns. When you do this, you can guarantee that every row you obtain contains unique data for the specified set of columns with the following query:

SELECT DISTINCT ColumnA, ColumnB FROM Table;

For example, the following query will return only unique combinations of customer names and credit card details:

SELECT DISTINCT CustomerName, CreditCard FROM Customers;

Note that this doesn't mean you won't have duplicates in either column, only that each combination is unique. You could, for example, have several customers with the same name but with different credit card numbers. To illustrate this point, suppose you changed the query on InstantUniversity as follows:

SELECT DISTINCT ExamID, SustainedOn FROM Exam;

You would obtain the full 11 rows from the database. This is because there are 11 different combinations of these two fields, even though there's a duplicate in the SustainedOn column. Always remember that DISTINCT acts on the combination of columns you use in the query.

Using WHERE Conditions

The WHERE clause allows you to filter queries by demanding that the rows in your result set satisfy a particular condition:

SELECT ColumnA, ColumnB FROM Table
WHERE Condition;

Condition is very flexible, allowing you to test for equalities and inequalities in column data, ranges of values to look for, and much more. You achieve all this using a simple syntax that includes various operators and keywords that, when combined, allow you to search for pretty much anything.

Before you get to the more complex usages of this clause, though, you'll look at the simplest: equalities. These are particularly useful because they allow you to constrain the rows you obtain from queries in the tightest way. The syntax is as follows:

SELECT ColumnA, ColumnB FROM Table
WHERE ColumnC = Value;

The WHERE clause lets you apply a search condition to the rows of data gathered by your SELECT query. The search condition can take various forms, all of which can be evaluated to a Boolean value. In this case, you're only interested in rows for which the condition ColumnC = Value evaluates to TRUE; in other words, you're interested in those rows in which ColumnC has a value of Value. If this condition is TRUE, then the row is returned. If it evaluates to FALSE or UNKNOWN (because the presence of a NULL value), then it's ignored.

Note that there's no need for a WHERE clause of this type to use a column that's being returned by the query (here only ColumnA and ColumnB are returned for rows). However, in practice, because you know what's contained in ColumnC, there's no need to extract this data from the database.

For example, you might extract a single row from a Customers table based on the ID of the customer (assuming that customer IDs are unique):

SELECT FirstName, LastName, Address FROM Customers
WHERE CustomerID = 7;

Alternatively, you could extract all the customers with a certain last name:

SELECT FirstName, LastName, Address FROM Customers
WHERE LastName = 'Smith';

Of course, in many countries, searching for people called Smith in this way could land you with an awful lot of results!

You don't have to use literals in these comparisons. You could, for example, use equalities between columns:

SELECT FirstName, LastName, Address FROM Customers
WHERE LastName = FirstName;

Here all customers with identical first and last names will be found. For example, Major Major would be there.

Suppose you wanted to query the InstantUniversity database to find the name of the student whose ID is 3:

SELECT Name FROM Student
WHERE StudentID = 3;

This will yield you the following result:

   Name

   ----------------
   Emily Scarlett

This sort of query is extremely useful when you know the ID of a record and want to get other pertinent data about that record. Rather than extract a lot of potentially useless data in a single query, you can use this technique to home in on the data you actually want.

Using Other WHERE Clause Comparisons

As mentioned earlier, equalities are just one of the many things you can test for in a WHERE clause. You can use several other operators, which vary by RDBMS, as described in Table 2-1.

Table 2-1: Comparison Operators
Operator	RDBMSs	Meaning
=	All	Used to check if values are equal to one another
!=	All except Access	Used to check if values aren't equal to one another
^=	Oracle	Same meaning as !=
<>	All	Same meaning as !=
>	All	True if the first operand is greater than the second operand
>=	All	Used to check if the first operand is greater than or equal to the second
!<	SQL Server, DB2	Same as >= (Means "not less than," which is logically equivalent to "greater than or equal to")
<	All	True if the first operand is less than the second operand
<=	All	Used to check if the first operand is less than or equal to the second
!>	SQL Server, DB2	Same as <=

As you can see, there's some redundancy here, with the possibility of a choice of operator to use in some situations. The main reason for this is so you can use the operator that fits in best with your development experience—you're likely to have similar operators available in whatever programming language you're used to using.

WHERE CLAUSE COMPARISONS

An example SQL query that you could run on the InstantUniversity database is as follows:

SELECT ExamID, SustainedOn, Comments FROM Exam
WHERE SustainedOn > '2003-03-20';

In this example, you extract data from the Exam table, only getting data for exams that occur after March 20, 2003. The results from this query are as follows:

   ExamID SustainedOn Comments

   ------ ----------- -----------------------------
   6      3/25/2003
   7      3/26/2003   They'll enjoy this one
   9      3/21/2003   Two hours long. They might find it tricky
   10     3/26/2003   2hr test
   11     3/24/2003   Part two should be an hour and a half long

It's worth noting here that it's the RDBMS that's responsible for interpreting a date or time format string, and the specific string you need to use will depend on factors such as what the RDBMS can interpret and your operating system's locale settings. However, the format YYYY-MM-DD as used in this query works on all platforms. See Chapter 5, "Performing Calculations and Using Functions," for more details on working with date values.

Access expects dates to be enclosed, Visual Basic–style, within pound characters (#), so this query won't work with Access. Instead, you need to use the following:

SELECT ExamID, SustainedOn, Comments FROM Exam
WHERE SustainedOn > #2003-03-20#;

Combining WHERE Conditions

Often a single WHERE clause isn't enough to narrow down the data you require. You might want to get only those rows with specified data in one column and other specified data in another column. Alternatively, you might require rows that have certain data in one column or certain data in another column. Here you need to combine multiple WHERE clauses as follows:

SELECT ColumnA, ColumnB FROM Table
WHERE Condition LINK Condition;

Note that LINK in the previous query isn't a keyword. Instead, it refers to one of the keywords shown in Table 2-2.

Table 2-2: Basic Operators
Link Keyword	Meaning
AND	Both conditions must be met for the records found.
OR	Either condition must be met for the records found.

For example, suppose you wanted to specify particular conditions on the Occupation and FiscalStatus fields when querying the Customers table. You could ask that both of your conditions be met as follows:

SELECT CustomerID, CustomerName FROM Customers
WHERE Occupation = 'Carpenter' AND FiscalStatus = 'Loaded';

Alternatively, you could select rows where either one of the two conditions is met or both are met:

SELECT CustomerID, CustomerName FROM Customers
WHERE Occupation = 'Carpenter' OR FiscalStatus = 'Loaded';

Additionally, the NOT keyword can be applied to a condition, making the condition evaluate to true when it isn't met rather than when it is:

SELECT CustomerID, CustomerName FROM Customers
WHERE Occupation = 'Carpenter'
  AND NOT FiscalStatus = 'Bankrupt';

The following SQL statement queries the InstantUniversity database to select only those records from the StudentExam table where the ExamID field is equal to 1 and the IfPassed field is also 1:

SELECT StudentID, Mark, Comments FROM StudentExam
WHERE ExamID = 1 AND IfPassed = 1;

Here are the results:

   StudentID   Mark        Comments

   ----------- ----------- ------------------------
   1           55          Satisfactory
   8           71          Excellent, great work!

Performing Range Tests

It's possible to combine two comparison operators to select values that fall within a within a specified range. For example:

SELECT ExamID, SustainedOn, Comments FROM Exam
WHERE SustainedOn >= '2003-03-20'
AND SustainedOn <= '2003-03-24';

However, it's also possible to use another of the SQL keywords, BETWEEN, to make such range tests more readable:

SELECT ColumnA, ColumnB FROM Table
WHERE ColumnC BETWEEN LowerLimit AND UpperLimit;

Here the value of ColumnC is checked to see if it's greater than or equal to LowerLimit and less than or equal to UpperLimit. For example:

SELECT ExamID, SustainedOn, Comments FROM Exam
WHERE SustainedOn BETWEEN '2003-03-20' AND '2003-03-24';

You can also use NOT with BETWEEN to exclude a range:

SELECT ExamID, SustainedOn, Comments FROM Exam
WHERE SustainedOn NOT BETWEEN '2003-03-20' AND '2003-03-24';

Defining Set Membership

As well as checking to see if values fall inside (or outside) set ranges of values, it's also possible to check to see if values are one of a set number of possibilities.

Again, you can achieve this using operators and keywords that you've already met:

SELECT ColumnA, ColumnB FROM Table
WHERE ColumnC = Value1
OR ColumnC = Value2
OR ColumnC = Value3;

However, you can achieve this much more elegantly using the IN keyword in your WHERE clause, along with a set of possible values:

SELECT ColumnA, ColumnB FROM Table WHERE ColumnC IN (Set);

Here Set (enclosed in parentheses) is the set of values you're looking for in ColumnC. You can define sets in a number of ways. The simplest is probably to provide a set of literal values in the query:

SELECT ColumnA, ColumnB FROM Table
WHERE ColumnC IN ('Value1', 'Value2', 'Value3');

Alternatively, you can create your set by including by a completely separate SELECT query:

SELECT ColumnA, ColumnB FROM Table
WHERE ColumnC IN (SELECT ColumnD FROM Table2);

Note

This is an example of a subquery, a topic you'll look at in more detail in Chapter 6, "Combining SQL Queries."

As with the other keywords you've seen in this chapter, IN can be combined with NOT to search for values that don't occur in a set:

SELECT ColumnA, ColumnB FROM Table
WHERE ColumnC NOT IN ('Value1', 'Value2', 'Value3');

FINDING RECORDS IN A SET

As a more concrete example, you can try selecting information from the StudentExam table in the InstantUniversity database where the StudentID is one of three values:

SELECT StudentID, ExamID, Mark FROM StudentExam
WHERE StudentID IN (2, 7, 3);

This query produces the following results:

   StudentID   ExamID      Mark

   ----------- ----------- -----------
   2           3           44
   2           5           39
   2           6           63
   3           4           78
   3           7           82
   7           6           84
   7           8           62
   7           11          68

Using Pattern Matching

One of the most powerful ways of filtering rows is to search for text values in column data using patterns. You can, for example, look for rows with text columns that contain a certain word or for more complex patterns of characters. To do this, you use the LIKE keyword (you can also specify NOT LIKE) and supply a pattern to match:

SELECT ColumnA, ColumnB FROM Table
WHERE ColumnC LIKE Pattern;

Pattern is made up of literal characters plus certain wildcards that represent one or more characters. The wildcards shown in Table 2-3 are supported by all the database systems covered in this book.

Table 2-3: Wildcards
Wildcard	Meaning
%	Any string of characters, including strings of zero length
_	Any one character

Caution

Patterns are case sensitive in Oracle and DB2 but not in SQL Server, MySQL, or Access.

For example, to search for a string value that contains the text 'manager', you could use the following:

SELECT CustomerID, ContactName, ContactTitle FROM Customers
WHERE ContactTitle LIKE '%manager%';

You need to be aware of several issues when using patterns. Perhaps most important, you need to be very specific when you use them. Simply including one space too many could break the query, for example:

SELECT CustomerID, ContactName, ContactTitle FROM Customers
WHERE ContactTitle LIKE '%manager %';

This pattern searches for the string 'manager '—including a trailing space. If any columns contain the string 'manager' at the end of their value, there will be no trailing space, and the record won't be matched.

As another example, this query looks for four characters followed by a space at the start of the Name column in the Student table:

SELECT Name FROM Student WHERE Name LIKE '____ %';

This returns all the students whose first names are four characters long:

   Name

   -----------
   John Jones
   Gary Burton
   Anna Wolff

Another issue to be aware of is that wildcards are reserved characters and must be escaped if you want to include them in a pattern without their wildcard meaning. To do this, you need to include more information when these characters occur in the data you're searching. Let's say you want to look for percentages. The following search isn't good enough:

SELECT CustomerName, CustomerRating FROM Customers
WHERE CustomerRating LIKE '%50%%';

Here the % character is interpreted as a wildcard as usual, and the match will actually be for the string '50' rather than '50%'. To get round this, you can use an escape character to force the first % of the %% part of the pattern to be interpreted as the character % rather than a string:

SELECT CustomerName, CustomerRating FROM Customers
WHERE CustomerRating LIKE '%50<Escape Char>%%';

There's no escape character defined in SQL, so instead you need to specify your own using the ESCAPE keyword. You can choose any character you like for this, but you should try to choose one that doesn't occur elsewhere in the string, for example:

SELECT CustomerName, CustomerRating FROM Customers
WHERE CustomerRating LIKE '%50p%%' ESCAPE 'p';

Here p is used as an escape character and causes the first % character to be interpreted as a character, not as an escape code.

Access doesn't support this syntax, but you can escape the wildcard character by enclosing it in square brackets:

SELECT CustomerName, CustomerRating FROM Customers
WHERE CustomerRating LIKE '%50[%]%';

SQL Server also supports this syntax.

PATTERN MATCHING

Let's look at an example from the InstantUniversity database. Suppose you wanted to search for exam results in the StudentExam table that had been described as "great." You could try something like this:

SELECT StudentID, ExamID, Mark, Comments FROM StudentExam
WHERE Comments LIKE '%great%';

This works for SQL Server, MySQL, and Access because pattern matching is case insensitive in these systems. However, it will find 'great' but not 'Great' in DB2 and Oracle. To get around this, you need to convert the column to upper case before applying the match. You can do this in both DB2 and Oracle using the UPPER() function:

SELECT StudentID, ExamID, Mark, Comments FROM StudentExam
WHERE UPPER(Comments) LIKE '%GREAT%';

This query produces the following results:

   StudentID   ExamID      Mark        Comments

   ----------- ----------- ----------- ------------------------
   3           7           82          Great result!
   8           1           71          Excellent, great work!

However, this will also match records with a comment such as 'So far from great I'm expelling this student.' Be careful when using patterns because it's easy to forget things such as this, and sometimes the logical process you use to create a pattern will have unexpected results!

Performing Complex Pattern Matching

SQL Server, Access, and MySQL allow a more complex form of pattern matching based more or less closely on regular expressions. In the case of SQL Server and Access, this consists merely of extensions to the standard LIKE syntax, but MySQL uses a completely different keyword (REGEXP or RLIKE) for regular expression pattern matching.

SQL Server and Access

As well as the wildcards % and _, SQL Server and Access patterns can contain characters enclosed in square brackets. A match will be made if the entire pattern matches using any one of these characters. For example, '%[abcde]%' will match any expression that contains any of the letters A, B, C, D, or E anywhere (matches aren't case sensitive).

You can also use a hyphen to indicate a range of values, so you could also write the previous expression as '%[a-e%]'. You can therefore use [a-z] to represent any alphabetic character, [0-9] to match any numeric character, and [a-z0-9] to match any alphanumeric character.

In addition, SQL Server allows you to prefix the sign ^ to a character or range of characters in order to match only characters that aren't in that range; for example, [^a-z0-9] will match nonalphanumeric characters.

COMPLEX PATTERN MATCHING (ACCESS AND SQL SERVER)

Execute the following query against the InstantUniversity database:

SELECT Name FROM Student
WHERE Name LIKE '% [a-dw-z]%';

This query looks for all student names that contain a space, followed by a character in the range A–D or W–Z. The effect (because this sample data contains only students with two names) is to return a list of all the students whose second names start with one of the first four or last four letters of the alphabet:

   Name

   --------------
   Gary Burton
   Anna Wolff
   Vic Andrews
   Steve Alaska
   Mohammed Akbar

MySQL

The basic syntax for MySQL's regular expression matching is similar but provides more options. You can use square brackets to match one of a set of characters or a specific range of characters, as in SQL Server and Access, but there are some important differences. First, MySQL regular expression patterns are case sensitive, so to match all alphabetic characters, you need to use [a-zA-Z] instead of just [a-z]. Second, matches will be found anywhere in the string, regardless of whether anything comes before or after (unless you explicitly say otherwise), so you don't need (and in fact can't use) the % wildcard.

There are also some more character-matching patterns available, as shown in Table 2-4.

Table 2-4: Character-Matching Patterns
Character	Matches
^	The start of the string. For example, '^B' matches the first but not the second B in 'BOB'.
$	The end of the string. For example, 'B$' matches the second but not the first B in 'BOB'.
.	Any single character.
*	The previous character repeated zero or more times. For example, 'Ke*l' matches 'Kl', 'Kel', 'Keel', and so on.
{n}	The previous character repeated n times. For example, 'b{3}' matches 'bbb' but not 'bb'.

To use MySQL's regular expression matching, you use the REGEXP (or RLIKE) keyword instead of LIKE:

SELECT Name FROM Student WHERE Name REGEXP '^[Aa]';

This is equivalent to the following:

SELECT Name FROM Student WHERE Name LIKE 'A%';

REGULAR EXPRESSION MATCHING (MYSQL)

Execute the following query against the InstantUniversity database:

SELECT Name FROM Student
WHERE Name REGEXP '^[AM].*r$';

This query matches all student names that begin with either 'A' or 'M', followed by any number of characters and ending with the character 'r'. Two names in the sample data match these criteria:

   Name

   --------------
   Andrew Foster
   Mohammed Akbar