Previous Page Next Page

Recipe 2.7. Replacing Text

Problem

You want to replace all occurrences of a substring within a target string with another string.

Solution

XSLT 1.0

The following recursive template replaces all occurrences of a search string with a replacement string:

<xsl:template name="search-and-replace">
     <xsl:param name="input"/>
     <xsl:param name="search-string"/>
     <xsl:param name="replace-string"/>
     <xsl:choose>
          <!-- See if the input contains the search string -->
          <xsl:when test="$search-string and 
                           contains($input,$search-string)">
          <!-- If so, then concatenate the substring before the search
          string to the replacement string and to the result of
          recursively applying this template to the remaining substring.
          -->
               <xsl:value-of 
                    select="substring-before($input,$search-string)"/>
               <xsl:value-of select="$replace-string"/>
               <xsl:call-template name="search-and-replace">
                    <xsl:with-param name="input"
                    select="substring-after($input,$search-string)"/>
                    <xsl:with-param name="search-string" 
                    select="$search-string"/>
                    <xsl:with-param name="replace-string" 
                        select="$replace-string"/>
               </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
               <!-- There are no more occurrences of the search string so 
               just return the current input string -->
               <xsl:value-of select="$input"/>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

If you want to replace only whole words, then you must ensure that the characters immediately before and after the search string are in the class of characters considered word delimiters. We chose the characters in the variable $punc plus whitespace to be word delimiters:

<xsl:template name="search-and-replace-whole-words-only">
  <xsl:param name="input"/>
  <xsl:param name="search-string"/>
  <xsl:param name="replace-string"/>
  <xsl:variable name="punc" 
    select="concat('.,;:( )[  ]!?$@&amp;&quot;',&quot;&apos;&quot;)"/>
     <xsl:choose>
       <!-- See if the input contains the search string -->
       <xsl:when test="contains($input,$search-string)">
       <!-- If so, then test that the before and after characters are word 
       delimiters. -->
         <xsl:variable name="before" 
          select="substring-before($input,$search-string)"/>
         <xsl:variable name="before-char" 
          select="substring(concat(' ',$before),string-length($before) +1, 1)"/>
         <xsl:variable name="after" 
          select="substring-after($input,$search-string)"/>
         <xsl:variable name="after-char" 
          select="substring($after,1,1)"/>
         <xsl:value-of select="$before"/>
         <xsl:choose>
          <xsl:when test="(not(normalize-space($before-char)) or 
                    contains($punc,$before-char)) and 
               (not(normalize-space($after-char)) or 
                    contains($punc,$after-char))"> 
            <xsl:value-of select="$replace-string"/>
          </xsl:when>
          <xsl:otherwise>
            <xsl:value-of select="$search-string"/>
          </xsl:otherwise>
         </xsl:choose>
         <xsl:call-template name="search-and-replace-whole-words-only">
          <xsl:with-param name="input" select="$after"/>
          <xsl:with-param name="search-string" select="$search-string"/>
          <xsl:with-param name="replace-string" select="$replace-string"/>
         </xsl:call-template>
       </xsl:when>
    <xsl:otherwise>
       <!-- There are no more occurrences of the search string so 
          just return the current input string -->
       <xsl:value-of select="$input"/>
     </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Notice how we construct $punc using concat( ) so it contains both single and double quotes. It would be impossible to do this in any other way because XPath and XSLT, unlike C, do not allow special characters to be escaped with a backslash (\). XPath 2.0 allows the quotes to be escaped by doubling them up.


XSLT 2.0

The functionality of search-and-replace is built-in to the 2.0 function replace( ). The functionality of search-and-replace-whole-words-only can easily be emulated using a regex that matches words:

<xsl:function name="ckbk:search-and-replace-whole-words-only">
    <xsl:param name="input" as="xs:string"/>
    <xsl:param name="search-string" as="xs:string"/>
    <xsl:param name="replace-string" as="xs:string"/>
    <xsl:sequence select="replace($input, concat('(^|\W)',$search-string,'(\W|$)'), 
    concat('$1',$replace-string,'$2'))"/>
</xsl:function>

Many regex engines use \b to match word boundaries, but XPath 2.0 does not support this.


Here we build up a regex by surrounding $search-string with (^|\W) and (\W|$) where \W means "not \w" or "not a word character." The ^ and $ handle the case when the word appears at the beginning or end of the string. We also need to put the matched \W character back into the text using references to the captured groups $1 and $2.

The function replace( ) is more powerful than the preceding XSLT 1.0 solutions because it uses regular expressions and can remember parts of the match and use them in the replacement via the variables $1, $2, etc. We explore replace( ) further in Recipe 2.10.

Discussion

Searching and replacing is a common text-processing task. The solution shown here is the most straightforward implementation of search and replace written purely in terms of XSLT. When considering the performance of this solution, the reader might think it is inefficient. For each occurrence of the search string, the code will call contains( ), substring-before() , and substring-after() . Presumably, each function will rescan the input string for the search string. It seems like this approach will perform two more searches than necessary. After some thought, you might come up with one of the following, seemingly more efficient, solutions shown in Example 2-4 and Example 2-5.

Example 2-4. Using a temp string in a failed attempt to improve search and replace
<xsl:template name="search-and-replace">
     <xsl:param name="input"/>
     <xsl:param name="search-string"/>
     <xsl:param name="replace-string"/>
     <!-- Find the substring before the search string and store it in a 
     variable -->
     <xsl:variable name="temp" 
          select="substring-before($input,$search-string)"/>
     <xsl:choose>
          <!-- If $temp is not empty or the input starts with the search 
          string then we know we have to do a replace. This eliminates the 
          need to use contains( ). -->
          <xsl:when test="$temp or starts-with($input,$search-string)">
               <xsl:value-of select="concat($temp,$replace-string)"/>
               <xsl:call-template name="search-and-replace">
                    <!-- We eliminate the need to call substring-after
                    by using the length of temp and the search string 
                    to extract the remaining string in the recursive 
                    call. -->
                    <xsl:with-param name="input"
                    select="substring($input,string-length($temp)+
                         string-length($search-string)+1)"/>
                    <xsl:with-param name="search-string" 
                         select="$search-string"/>
                    <xsl:with-param name="replace-string" 
                         select="$replace-string"/>
               </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
               <xsl:value-of select="$input"/>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

Example 2-5. Using a temp integer in a failed attempt to improve search and replace
 <xsl:template name="search-and-replace">
     <xsl:param name="input"/>
     <xsl:param name="search-string"/>
     <xsl:param name="replace-string"/>
     <!-- Find the length of the sub-string before the search string and 
     store it in a variable -->
     <xsl:variable name="temp" 
     select="string-length(substring-before($input,$search-string))"/>
     <xsl:choose>
     <!-- If $temp is not 0 or the input starts with the search 
     string then we know we have to do a replace. This eliminates the 
     need to use contains( ). -->
          <xsl:when test="$temp or starts-with($input,$search-string)">
               <xsl:value-of select="substring($input,1,$temp)"/>
               <xsl:value-of select="$replace-string"/>
                    <!-- We eliminate the need to call substring-after
                    by using temp and the length of the search string 
                    to extract the remaining string in the recursive 
                    call. -->
               <xsl:call-template name="search-and-replace">
                    <xsl:with-param name="input"
                         select="substring($input,$temp + 
                              string-length($search-string)+1)"/>
                    <xsl:with-param name="search-string"
                         select="$search-string"/>
                    <xsl:with-param name="replace-string"
                         select="$replace-string"/>
               </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
               <xsl:value-of select="$input"/>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

The idea behind both attempts is that if you remember the spot where substring-before( ) finds a match, then you can use this information to eliminate the need to call contains( ) and substring-after( ). You are forced to introduce a call to starts-with( ) to disambiguate the case in which substring-before( ) returns the empty string; this can happen when the search string is absent or when the input string starts with the search string. However, starts-with( ) is presumably faster than contains( ) because it doesn't need to scan past the length of the search string. The idea that distinguishes the second attempt from the first is the thought that storing an integer offset might be more efficient than storing the entire substring.

Alas, these supposed optimizations fail to produce any improvement when using the Xalan XSLT implementation and actually produce timing results that are an order of magnitude slower on some inputs when using either Saxon or XT! My first hypothesis regarding this unintuitive result was that the use of the variable $temp in the recursive call interfered with Saxon's tail-recursion optimization (see Recipe 2.6). However, by experimenting with large inputs that have many matches, I failed to cause a stack overflow. My next suspicion was that for some reason, XSLT substring() is actually slower than the substring-before( ) and substring-after( ) calls. Michael Kay, the author of Saxon, indicated that Saxon's implementation of substring( ) was slow due to the complicated rules that XSLT substring must implement, including floating-point rounding of arguments, handling special cases where the start or end point are outside the bounds of the string, and issues involving Unicode surrogate pairs. In contrast, substring-before( ) and substring-after( ) translate more directly into Java.

The real lesson here is that optimization is tricky business, especially in XSLT where there can be a wide disparity between implementations and where new versions continually apply new optimizations. Unless you are prepared to profile frequently, it is best to stick with simple solutions. An added advantage of obvious solutions is that they are likely to behave consistently across different XSLT implementations.


Previous Page Next Page