Previous Page Next Page

Recipe 8.4. Merging Documents with Identical Schema

Problem

You have two or more identically structured documents and you would like to merge them into a single document.

Solution

If the content of the documents is distinct or you are not concerned about duplicates, then the solution is simple:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   
<xsl:output method="xml" indent="yes"/>
   
<xsl:param name="doc2"/> 
   
<xsl:template match="/*">
  <xsl:copy>
    <xsl:copy-of select="* | document($doc2)/*/*"/>
  </xsl:copy>
</xsl:template>
   
</xsl:stylesheet>

If duplicates exist among input documents but you want the output document to contain unique entries, you can use techniques discussed in Recipe 5.1 for removing duplicates. Consider the following two documents in Example 8-11 and Example 8-12.

Example 8-12. Document 1
<people which="MeAndMyFriends">
     <person firstname="Sal" lastname="Mangano" age="38" height="5.75"/>
     <person firstname="Mike" lastname="Palmieri" age="28" height="5.10"/>
     <person firstname="Vito" lastname="Palmieri" age="38" height="6.0"/>
     <person firstname="Vinny" lastname="Mari" age="37" height="5.8"/>
</people>

Example 8-13. Document 2
<people which="MeAndMyCoWorkers">
     <person firstname="Sal" lastname="Mangano" age="38" height="5.75"/>
     <person firstname="Al" lastname="Zehtooney" age="33" height="5.3"/>
     <person firstname="Brad" lastname="York" age="38" height="6.0"/>
     <person firstname="Charles" lastname="Xavier" age="32" height="5.8"/>
</people>

This stylesheet merges and removes the duplicate element using xsl:sort and the exsl:node-set extensions:

<xsl:stylesheet version="1.0"  
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:exsl="http://exslt.org/common">
    
    <xsl:import href="exsl.xsl" />

<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
   
<xsl:param name="doc2"/> 
<!-- Here we introduce a 'key' attribute to make removing duplicates -->
<!-- easier -->
<xsl:variable name="all">
  <xsl:for-each select="/*/person | document($doc2)/*/person">
    <xsl:sort select="concat(@lastname,@firstname)"/>
    <person key="{concat(@lastname, @firstname)}">
      <xsl:copy-of select="@* | node( )" />
    </person>  </xsl:for-each>
</xsl:variable>
   
<xsl:template match="/">
     
<people>
     <xsl:for-each 
         select="exsl:node-set($all)/person[not(@key = 
                          preceding-sibling::person[1]/@key)]">
          <xsl:copy-of select="."/>
     </xsl:for-each>
</people>
     
</xsl:template>

Removing duplicates this way has three drawbacks. First, it alters the order of the elements, which might be undesirable. Second, it requires the use of the node-set extension in XSLT 1.0. Third, it is not generic in the sense that you must rewrite the entire stylesheet for every situation when you want a non-duplicating merge.

One way to address these problems uses xsl:key:

<!-- Stylesheet: merge-simple-using-key.xslt -->
<!-- Import this stylesheet into another that defines the key -->
   
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:merge="http:www.ora.com/XSLTCookbook/mnamespaces/merge">
     
<xsl:param name="doc2"/> 
   
<xsl:template match="/*">
  <!--Copy the outermost element of the source document -->
  <xsl:copy>
    <!-- For each child in the source, determine if it should be 
    copied to the destination based on its existence in the other document.
    -->
    <xsl:for-each select="*">
    
      <!-- Call a template which determines a unique key value for this
           element. It must be defined in the including stylesheet. 
      -->  
      <xsl:variable name="key-value">
        <xsl:call-template name="merge:key-value"/>
      </xsl:variable>
      
      <xsl:variable name="element" select="."/>
      <!--This for-each is simply to change context 
          to the second document 
      -->
      <xsl:for-each select="document($doc2)/*">
        <!-- Use key as a mechanism for testing the presence 
             of the element in the second document. The 
             key should be defined by the including stylesheet
        -->
        <xsl:if test="not(key('merge:key', $key-value))">
          <xsl:copy-of select="$element"/>
        </xsl:if>
      </xsl:for-each>
      
    </xsl:for-each>
   
    <!--Copy all elements in the second document -->
    <xsl:copy-of select="document($doc2)/*/*"/>
    
  </xsl:copy>
</xsl:template>
   
</xsl:stylesheet>

The following stylesheet imports the previous one and defines the key and a template to retrieve the key's value:

<!-- This stylesheet defines uniqueness of elements in terms of a key. -->
<xsl:stylesheet version="1.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:merge="http:www.ora.com/XSLTCookbook/mnamespaces/merge">
   
<xsl:include href="merge-simple-using-key.xslt"/>
   
<!--A person is uniquely defined by the concatenation of 
    last and first names -->
<xsl:key name="merge:key" match="person" 
         use="concat(@lastname,@firstname)"/>
   
<xsl:output method="xml" indent="yes"/>
   
<!-- This template retrives the key value for an element -->
<xsl:template name="merge:key-value">
  <xsl:value-of select="concat(@lastname,@firstname)"/>
</xsl:template>
   
</xsl:stylesheet>

A second way to merge and remove duplicates uses value-based set operations that are discussed in Recipe 9.2. This book presents the solution, but refers the reader to that recipe for more information. Example 8-13 and Example 8-14 include more stylesheets.

Example 8-14. A reusable stylesheet that implements the merge in terms of a union
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:vset="http:/www.ora.com/XSLTCookbook/namespaces/vset">
   
<xsl:import href="../query/vset.ops.xslt"/>
   
<xsl:output method="xml" indent="yes"/>
   
<xsl:param name="doc2"/> 
   
<xsl:template match="/*">
  <xsl:copy>
    <xsl:call-template name="vset:union">
      <xsl:with-param name="nodes1" select="*"/>
      <xsl:with-param name="nodes2" select="document($doc2)/*/*"/>
    </xsl:call-template>
  </xsl:copy>
</xsl:template>
   
</xsl:stylesheet>

Example 8-15. A stylesheet defining what element equality means
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:vset="http:/www.ora.com/XSLTCookbook/namespaces/vset">
   
<xsl:import href="merge-using-vset-union.xslt"/>
   
<xsl:template match="person" mode="vset:element-equality">
  <xsl:param name="other"/>
  <xsl:if test="concat(@lastname,@firstname) = 
                concat($other/@lastname,$other/@firstname)">  
    <xsl:value-of select="true( )"/>
  </xsl:if>
</xsl:template>
   
</xsl:stylesheet>

The vset:union-based solution involves less new code than the key-based solution; however, for large documents, the xsl:key-based solution is likely to be faster.

Discussion

Merging documents is often necessary when separate individuals or processes produce parts of the document. Merging is also necessary when reconstituting a very large document that was split up to be processed in parallel or because it was too cumbersome to handle as a whole.

The examples in this section address the simple case when just two documents are merged. If an arbitrary number of documents are merged, a mechanism is required to pass a list of documents into the stylesheet. One technique uses a parameter containing all filenames separated by spaces and employs a simple tokenizer (Recipe 2.9) to extract the names. Another technique passes all the filenames in the source document, as shown in Example 8-15 and Example 8-16.

Example 8-16. XML-containing documents to be merged
<mergeDocs>
  <doc path="people1.xml"/>
  <doc path="people2.xml"/>
  <doc path="people3.xml"/>
  <doc path="people4.xml"/>
</mergeDocs>

Example 8-17. A stylesheet for merging the documents (assumes no duplicatesare in the content)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   
<xsl:output method="xml" indent="yes"/>
   
<xsl:variable name="docs" select="/*/doc"/>
   
<xsl:template match="mergeDocs">
     <xsl:apply-templates select="doc[1]"/>
</xsl:template>
   
<!--Match the first doc to create the topmost element -->
<xsl:template match="doc">
  <xsl:variable name="path" select="@path"/>
  <xsl:for-each select="document($path)/*">
    <xsl:copy>
       <!-- Merge children of doc 1 -->
      <xsl:copy-of select="@* | *"/>
       <!--Loop over remaining docs to merge their children -->
      <xsl:for-each select="$docs[position( ) > 1]">
          <xsl:copy-of select="document(@path)/*/*"/>
      </xsl:for-each>
    </xsl:copy>
  </xsl:for-each> 
</xsl:template>
   
</xsl:stylesheet>


Previous Page Next Page