Recipe 7.1. Dealing with Whitespace
Problem
You need to convert
XML
into formatted text, but whitespace issues are ruining the results.
Solution
Consider the following annotated XML sample. The symbols (newline),
(tab), and (space) mark whitespace-only text nodes that are often
overlooked but subject to being copied to the output:
Too much whitespace
Use
xsl:strip-space to get rid of whitespace-only
nodes. This top-level element with a single attribute,
elements, is assigned a whitespace-separated list
of element names that you want stripped of extra whitespace. Here,
extra whitespace means whitespace-only text nodes. This means, for
example, that the whitespace separating words in the previous
comment element are significant because they are
not whitespace only. On the other hand, the whitespace designated by
the special symbols are whitespace only. A common idiom uses <xsl:strip-space
elements="*"/> to strip whitespace by default and
xsl:preserve-space (see later) to override
specific elements. In XSLT 2.0, you are also allowed to have
elements="*:Name", which tells the processor to
strip whitespace in all elements with the given local name regardless
of namespace.
Use normalize-space to get rid of extra
whitespace. A common mistake is to assume that xsl:strip-space
takes care of "extra" whitespace
like that used to align text in the previous
comment element. This is not the case. The parser
always considers significant whitespace inside an
element's text that is mixed with nonwhitespace. To
remove this extra space, use normalize-space, as
in <xsl:value-of
select="normalize-space(comment)"/>.
Use translate to get rid of all whitespace. Another common mistake is to assume
normalize-space strips all whitespace. This is not
the case. Instead, it strips only leading and trailing whitespace and
converts multiple internal whitespace characters to single spaces. If
you need to strip all whitespace, use
TRanslate(something,' 

	', '').
Use an empty xsl:text element to prevent
terminating whitespace in the stylesheet from being considered
relevant. xsl:text is normally considered a way to preserve
whitespace. However, a strategically placed empty
xsl:text element can prevent trailing whitespace
in the stylesheet from being interpreted as significant.
Consider the results of the two modes in the following document and
stylesheet, shown in Example 7-1 to Example 7-3.
Example 7-1. Input
<numbers>
<number>10</number>
<number>3.5</number>
<number>4.44</number>
<number>77.7777</number>
</numbers>
Example 7-2. Processing numbers with and without an empty xsl:text element
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="numbers">
Without empty text element:
<xsl:apply-templates mode="without"/>
With empty text element:
<xsl:apply-templates mode="with"/>
</xsl:template>
<xsl:template match="number" mode="without">
<xsl:value-of select="."/>,
</xsl:template>
<xsl:template match="number" mode="with">
<xsl:value-of select="."/>,<xsl:text/>
</xsl:template>
</xsl:stylesheet>
Example 7-3. Output
Without empty text element:
10,
3.5,
4.44,
77.7777,
With empty text element:
10,3.5,4.44,77.7777,
Note that there is nothing magical about xsl:text
when it is used this way. It works just as well if you replace
<xsl:text/> with <xsl:if
test="0"/> (but don't do so unless you
enjoy confusing others). The effect is the placement of an element
node between the comma and the trailing newline, which creates a
whitespace-only node that will be ignored. Of course, some find this
confusing regardless, so you can also write
<xsl:text>,</xsl:text> if you prefer.
Too little whitespace
Use
xsl:preserve-space to override
xsl:strip-space for specific elements. There is not much point in using
xsl:preserve-space unless you also use
xsl:strip-space. This is because the default
behavior preserves space in the input document and documents loaded
with the document( ) function.  |
Remember, MSXML strips whitespace-only text nodes by default. In this
case, you can use xsl:preserve-space to counteract
this nonconformance.
|
|
Use xsl:text to precisely specify text-output
spacing. All whitespace inside an xsl:text element is
preserved. This preservation allows precise control over whitespace
placement. Sometimes you can use xsl:text to
simply introduce
line breaks: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="number">
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
However, the problem with
outputting newline characters directly is that some platforms (e.g.,
Microsoft's) expect a line break to be represented
as carriage- return plus newline. However, since XML parsers are
required to convert carriage-return plus newline into a single
newline, there is no way to create a platform-independent stylesheet.
Fortunately, most Windows-based editors and the Windows command
prompt handle single newlines correctly. The one exception is the
notepad editor that comes free with Windows.
Use nonbreaking space characters. XSLT does not treat the character #xA0; (nonbreaking space) as normal
whitespace. In particular, xsl:strip-space and
normalize-space( ) both ignore this character. If
you need to strip whitespace most of the time but have specific
instances when it should remain in place, you might try to use this
character in the XML input. Nonbreaking space is particularly useful
for HTML output, but may be of lesser value in other contexts
(depending on how the renderer handles it).
Discussion
The solution section lists techniques for managing whitespace.
However, knowing the XSLT rules that underlie the techniques is also
useful.
The most important rules to know apply to both
the stylesheet and input document(s):
A text node is never stripped unless it contains only whitespace
characters (#x20, #x9, #xD, or #xA). Although they are not all that common, you should also understand the
effect of xml:space attributes in both the
stylesheet and the input document(s). If a text-node's ancestor element has an
xml:space attribute with a value of
preserve, and no closer ancestor element has
xml:space with a value of
default, then whitespace-only text nodes are not
stripped. The chapter now looks at the rules for stylesheets and source
documents separately. For stylesheets, your options are simple.
The only stylesheet elements for
which whitespace-only nodes are preserved by default are
xsl:text. Here, "by
default" means unless otherwise specified using
xml:space="preserve" as stated
earlier in Step 2. See Example 7-4 and
Example 7-5.
Example 7-4. Stylesheet demonstrating the effect of xsl:text and xml:space=preserve
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="numbers">
Without xml:space="preserve":
<xsl:apply-templates mode="without-preserve"/>
With xml:space="preserve":
<xsl:apply-templates mode="with-preserve"/>
</xsl:template>
<xsl:template match="number" mode="without-preserve">
<xsl:value-of select="."/><xsl:text> </xsl:text>
</xsl:template>
<xsl:template match="number" mode="with-preserve" xml:space="preserve">
<xsl:value-of select="."/><xsl:text> </xsl:text>
</xsl:template>
</xsl:stylesheet>
Example 7-5. Output
Without xml:space="preserve":
10 3.5 4.44 77.7777
With xml:space="preserve":
10
3.5
4.44
77.7777
The only whitespace introduced by the first number match is the
single space contained in the xsl:text element.
However, when you use xml:space="preserve" in the
second number match template, you pick up all the whitespace
contained in the element including the two line breaks (the first is
after the <xsl:template ...> and the second
is after the </xsl:text>).
For source documents, the rules are as follows:
Initially, the list of elements in which whitespace is preserved
includes all elements in the document. If an element matches a NameTest in an
xsl:strip-space element, then it is removed from
the list of whitespace-preserving element names. If an element name matches a NameTest in an
xsl:preserve-space element, then it is added to
the list of whitespace-preserving element names.
A NameTest is either a simple name (e.g.,
doc) or a name with a namespace prefix (e.g.,
my:doc), wildcard (e.g., *), or
a wildcard with a namespace prefix (e.g., my:*).
In XSLT 2.0 *:doc are also allowed. The default
priority and import precedence rules apply when conflicts exist
between xml:strip-space and
xml:preserve-space:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:my="http://www.ora.com/XSLTCookbook/ns/my">
<!-- Strip whitespace in all elements -->
<xsl:strip-space="*"/>
<!-- except those in the "my" namespace -->
<xsl:preserve-space="my:*"/>
<!-- and those named foo -->
<xsl:preserve-space="foo"/>
See Also
One of the most complete discussions of whitespace handling is
in Michael Kay's XSLT
Programmer's Reference (Wrox, 2001). The
book also includes good coverage of the import precedence rules.
|