Recipe 2.3. Removing Specific Characters from a String
Problem
You want to strip certain
characters (e.g., whitespace)
from a string.
Solution
XSLT 1.0
Use translate with an empty replace string. For
example, the following code can strip whitespace from a string:
translate($input," 	
&xd;", "")
XSLT 2.0
Using TRanslate( ) is still a good idea in XSLT
2.0 because it will usually perform best. However, some string
removal tasks are much more naturally implemented using regular
expressions and the new replace( ) function:
(: \s matches all whitespace characters :)
replace($input,"\s","")
Discussion
TRanslate( )
is a versatile string function that is often
used to compensate for missing string-processing capabilities in XSLT
1.0. Here you use the fact that translate( ) will
not copy characters in the input string that are in the
from string but do not have a corresponding
character in the to string.
You can also use TRanslate to remove all but a
specific set of characters from a string. For example, the following
code removes all non-numeric characters from a string:
translate($string,
translate($string,'0123456789',''),'')
The inner translate( ) removes all characters of
interest (e.g., numbers) to obtain a from string
for the outer translate( ), which removes these
non-numeric characters from the original string.
Sometimes you do not want to remove all occurrences of whitespace,
but instead want to remove leading, trailing, and redundant internal
whitespace. XPath has a built-in function, normalize-space(
),
which does just that. If you ever
needed to normalize based on characters other than spaces, then you
might use the following code (where C is the
character you want to normalize):
translate(normalize-space(translate($input,"C "," C")),"C "," C")
However, this transformation won't work quite right
if the input string contains whitespace characters other than spaces;
i.e., tab (#x9), newline (#xA), and carriage return (#xD). The reason
is that the code swaps space with the character to normalize, and
then normalizes the resulting spaces and swaps back. If nonspace
whitespace remains after the first transformation, it will also be
normalized, which might not be what you want. Then again, the
applications of non-whitespace normalizing are probably rare anyway.
Here you use this technique to remove extra -
characters:
<xsl:template match="/">
<xsl:variable name="input"
select=" '---this --is-- the way we normalize non-whitespace---' "/>
<xsl:value-of
select="translate(normalize-space(
translate($input,'- ',' -')),'- ',' -')"/>
</xsl:template>
The result is:
this -is- the way we normalize non-whitespace
XSLT 2.0
Another more powerful way to remove undesired characters from a
string is the use of the XSLT 2.0 replace()
function, which harnesses the power of
regular expressions. Here we use replace( ) to
normalize non-whitespace without the caveats of our XSLT 1.0
solution:
<xsl:template match="/">
<xsl:variable name="input"
select=" '---this --is-- the way we normalize non-whitespace---' "/>
<xsl:value-of select="replace(replace($input,'-+','-'),'^-|-$','')"/>
</xsl:template>
This code uses two calls to replace. The inner
call replaces multiple occurrences of -with a
single - and the outer call removes leading and
trailing
- characters.
This chapter introduces
one
of the veteran programmer's favorite tools for
advanced string manipulation: regular expressions (affectionately
known as regex). The addition of regex capabilities to XSLT was on
the top 10 list of almost every XSLT developer I know. This sidebar
is intended for those developers who have not had the pleasure of
working with regular expressions or who are too intimidated by them.
This is not an exhaustive reference, but it should get you going.
A
regex is a string that encodes a pattern to
match in another string. The simplest pattern is a literal string
itself that is, the string
"foo" can be used as a regular
expression. It will match the string
"foobar" starting at the first
character. However, the real power of regular expressions is revealed
only when you begin to wield the special meta-characters recognized
by the language.
The most important
meta-characters are those used to
construct wildcards.
A period or
dot (.) matches a single character. A character class ([aeiou],
[a-z], or [a-zA-Z]) matches a
list, range, or combination of lists and ranges of characters. Some character classes that are common are given special
abbreviations. For example, \s is an abbreviation
for whitespace characters including space, tab, carriage return, and
new line, and \d is short for
[0-9]. When there is a backslash abbreviation for
a character class, it is often the case that the uppercase version
inverts the match. So, for example, \S matches
non-whitespace and \D matches a non-digit. This is
not universally true. For example, \n matches a
newline, but \N does not mean non-newline (this
also goes for \t - tab and \r -
carriage return). One can negate a character class by beginning it with a
^. For example, [^aeiou]
matches any character except these lowercase vowels. This also
applies to ranges; [^0-9] is the same as
\D. Literals and
wildcards are often mixed together. For
example, d[aeiou]g matches
"dag", "deg",
"dig", "dog", and
"dug", as well as any longer string that has these
as substrings. Equally important are the repetition metacharacters that allow
preceding characters, wildcards, or combinations thereof to match
repeatedly. The * meta character means to match
the previous expression 0 or more times. Hence,
be* matches strings containing
"b", "be",
"bee", "beee", and so on.
(10)* matches strings containing
"10", "1010",
"101010", and so on. Here the parenthesis acts as
a grouping construct. If you remove the parenthesis, you get
10*, and the repetition applies only to the
0. The + meta character means to match
the previous expression one or more times. Hence,
be+ matches strings containing
"be", "bee",
"beee", and so on, but not "b". The ? metacharacter means match the
previous expression zero or one time. Hence, be?
matches strings containing "b" and
"be". Very often one needs to be specific with respect to where a regular
expression matches. In particular, you will often only want to match
a pattern at the start (^) or end
($) of a string, and sometimes you will want to
match only if the pattern is anchored at both the start and the end.
For example,
"^be+" will
match "bee keeper" but not
"has been". The regex "be+$"
will match "to be or not to be"
but not "be he alive or be he dead". Further,
"^be+$" will
match "be" and
"bee" but not "been" or
"Abe". The regex machinery
presented thus far can handle most of the matching tasks you are
likely to encounter. However, there are some so-called
context-sensitive
matches that cannot be handled by simple regex patterns. Consider
wanting to match numbers that start and end with the same digit (11,
909, 3233, etc.). Pure regular expressions can't do
this, but most regex engines, including the one specified for XPath
2.0, provide extensions to make this possible. The facility requires two conventions. The first requires you to mark
the portion of the pattern you wish to later reference with a
captured group
using parentheses, and the second requires you to reference the group
by an
index
variable. For example, (\d)\d*\1 is a regex that
matches any number that starts and ends in the same digit. The group
is the first digit (\d) and the reference is
\1, which means "whatever the
first group matched." As you might guess, you can
have multiple groups such as (\d)(\d)\1\2, which
will match numbers like "1212" and
"9999" but not "1213" or
"1221". Back references like \1, \2, etc. are used
with the XPath 2.0 matches( ) function. A similar
notation using a $ instead of a
\ is reserved for cases where the reference occurs
outside of the regular expression itself. This occurs in the
function
replace( ) where you
want to refer to groups in the matching regex from the replacement
regex. For example, replace($someText,
`(\d)\d*',
`$1') will
replace the first sequence of 1 or more digits in
$someText with the first digit in that sequence.
This facility is also available in the
xsl:analyze-string instruction. We discuss these
facilities in more detail in Recipes Recipe 2.6 and
Recipe 2.10.
If you want to explore the world of regular expressions in more
depth, you should check out Mastering Regular Expressions,
Second Edition by Jeffery E. F.
Friedl
(O'Reilly, 1999). If you want more depth on XSLT
2.0's regex flavor, consider XPath
2.0 by Michael Kay (Wrox, 2004) or the W3C
recommendation at http://www.w3.org/TR/xquery-operators#string.match
and http://www.w3.org/TR/xmlschema-2/#regexs.
|
|