Recipe 2.3. Removing Specific Characters from a String

Table of Contents

Recipe 2.3. Removing Specific Characters from a String

Problem

You want to strip certain characters (e.g., whitespace) from a string.

Solution

XSLT 1.0

Use translate with an empty replace string. For example, the following code can strip whitespace from a string:

translate($input," &#x9;&#xa;&xd;", "")

XSLT 2.0

Using TRanslate( ) is still a good idea in XSLT 2.0 because it will usually perform best. However, some string removal tasks are much more naturally implemented using regular expressions and the new replace( ) function:

(: \s matches all whitespace characters :)
replace($input,"\s","")

Discussion

TRanslate( ) is a versatile string function that is often used to compensate for missing string-processing capabilities in XSLT 1.0. Here you use the fact that translate( ) will not copy characters in the input string that are in the from string but do not have a corresponding character in the to string.

You can also use TRanslate to remove all but a specific set of characters from a string. For example, the following code removes all non-numeric characters from a string:

translate($string, 
          translate($string,'0123456789',''),'')

The inner translate( ) removes all characters of interest (e.g., numbers) to obtain a from string for the outer translate( ), which removes these non-numeric characters from the original string.

Sometimes you do not want to remove all occurrences of whitespace, but instead want to remove leading, trailing, and redundant internal whitespace. XPath has a built-in function, normalize-space( ), which does just that. If you ever needed to normalize based on characters other than spaces, then you might use the following code (where C is the character you want to normalize):

translate(normalize-space(translate($input,"C "," C")),"C "," C")

However, this transformation won't work quite right if the input string contains whitespace characters other than spaces; i.e., tab (#x9), newline (#xA), and carriage return (#xD). The reason is that the code swaps space with the character to normalize, and then normalizes the resulting spaces and swaps back. If nonspace whitespace remains after the first transformation, it will also be normalized, which might not be what you want. Then again, the applications of non-whitespace normalizing are probably rare anyway. Here you use this technique to remove extra - characters:

<xsl:template match="/">
  <xsl:variable name="input" 
       select=" '---this --is-- the way we normalize non-whitespace---' "/>
 <xsl:value-of 
      select="translate(normalize-space(
                                 translate($input,'- ',' -')),'- ',' -')"/>
</xsl:template>

The result is:

this -is- the way we normalize non-whitespace

XSLT 2.0

Another more powerful way to remove undesired characters from a string is the use of the XSLT 2.0 replace() function, which harnesses the power of regular expressions. Here we use replace( ) to normalize non-whitespace without the caveats of our XSLT 1.0 solution:

<xsl:template match="/">
 <xsl:variable name="input" 
      select=" '---this --is-- the way we normalize non-whitespace---' "/>
<xsl:value-of select="replace(replace($input,'-+','-'),'^-|-$','')"/>
</xsl:template>

This code uses two calls to replace. The inner call replaces multiple occurrences of -with a single - and the outer call removes leading and trailing - characters.

Using Regular Expressions

This chapter introduces one of the veteran programmer's favorite tools for advanced string manipulation: regular expressions (affectionately known as regex). The addition of regex capabilities to XSLT was on the top 10 list of almost every XSLT developer I know. This sidebar is intended for those developers who have not had the pleasure of working with regular expressions or who are too intimidated by them. This is not an exhaustive reference, but it should get you going.

A regex is a string that encodes a pattern to match in another string. The simplest pattern is a literal string itself that is, the string "foo" can be used as a regular expression. It will match the string "foobar" starting at the first character. However, the real power of regular expressions is revealed only when you begin to wield the special meta-characters recognized by the language.

The most important meta-characters are those used to construct wildcards.

A period or dot (.) matches a single character.
A character class ([aeiou], [a-z], or [a-zA-Z]) matches a list, range, or combination of lists and ranges of characters.
Some character classes that are common are given special abbreviations. For example, \s is an abbreviation for whitespace characters including space, tab, carriage return, and new line, and \d is short for [0-9]. When there is a backslash abbreviation for a character class, it is often the case that the uppercase version inverts the match. So, for example, \S matches non-whitespace and \D matches a non-digit. This is not universally true. For example, \n matches a newline, but \N does not mean non-newline (this also goes for \t - tab and \r - carriage return).
One can negate a character class by beginning it with a ^. For example, [^aeiou] matches any character except these lowercase vowels. This also applies to ranges; [^0-9] is the same as \D.
Literals and wildcards are often mixed together. For example, d[aeiou]g matches "dag", "deg", "dig", "dog", and "dug", as well as any longer string that has these as substrings.
Equally important are the repetition metacharacters that allow preceding characters, wildcards, or combinations thereof to match repeatedly.
The * meta character means to match the previous expression 0 or more times. Hence, be* matches strings containing "b", "be", "bee", "beee", and so on. (10)* matches strings containing "10", "1010", "101010", and so on. Here the parenthesis acts as a grouping construct. If you remove the parenthesis, you get 10*, and the repetition applies only to the 0.
The + meta character means to match the previous expression one or more times. Hence, be+ matches strings containing "be", "bee", "beee", and so on, but not "b".
The ? metacharacter means match the previous expression zero or one time. Hence, be? matches strings containing "b" and "be".
Very often one needs to be specific with respect to where a regular expression matches. In particular, you will often only want to match a pattern at the start (^) or end ($) of a string, and sometimes you will want to match only if the pattern is anchored at both the start and the end. For example, "^be+" will match "bee keeper" but not "has been". The regex "be+$" will match "to be or not to be" but not "be he alive or be he dead". Further, "^be+$" will match "be" and "bee" but not "been" or "Abe".
The regex machinery presented thus far can handle most of the matching tasks you are likely to encounter. However, there are some so-called context-sensitive matches that cannot be handled by simple regex patterns. Consider wanting to match numbers that start and end with the same digit (11, 909, 3233, etc.). Pure regular expressions can't do this, but most regex engines, including the one specified for XPath 2.0, provide extensions to make this possible.
The facility requires two conventions. The first requires you to mark the portion of the pattern you wish to later reference with a captured group using parentheses, and the second requires you to reference the group by an index variable. For example, (\d)\d*\1 is a regex that matches any number that starts and ends in the same digit. The group is the first digit (\d) and the reference is \1, which means "whatever the first group matched." As you might guess, you can have multiple groups such as (\d)(\d)\1\2, which will match numbers like "1212" and "9999" but not "1213" or "1221". Back references like \1, \2, etc. are used with the XPath 2.0 matches( ) function. A similar notation using a $ instead of a \ is reserved for cases where the reference occurs outside of the regular expression itself. This occurs in the function replace( ) where you want to refer to groups in the matching regex from the replacement regex. For example, replace($someText, `(\d)\d*', `$1') will replace the first sequence of 1 or more digits in $someText with the first digit in that sequence. This facility is also available in the xsl:analyze-string instruction. We discuss these facilities in more detail in Recipes Recipe 2.6 and Recipe 2.10.

If you want to explore the world of regular expressions in more depth, you should check out Mastering Regular Expressions, Second Edition by Jeffery E. F. Friedl (O'Reilly, 1999). If you want more depth on XSLT 2.0's regex flavor, consider XPath 2.0 by Michael Kay (Wrox, 2004) or the W3C recommendation at http://www.w3.org/TR/xquery-operators#string.match and http://www.w3.org/TR/xmlschema-2/#regexs.

Table of Contents