Recipe 13.2. Working with Excel XML Spreadsheets
Problem
You want to export data from Excel to XML, but not in the
native format supported by Microsoft.
Solution
XSLT 1.0
If you have an Excel spreadsheet that looks like this:
Date
|
Price
|
Volume
|
---|
20010817
|
61.88
|
260163
|
20010820
|
62.7
|
241859
|
20010821
|
60.78
|
233989
|
20010822
|
60.66
|
387444
|
Then the Excel (XP or 2003) XML format looks like this:
<?xml version="1.0"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-
microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html=
"http://www.w3.org/TR/REC-html40">
<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
<Author>Salvatore R. Mangano</Author>
<LastAuthor>Salvatore R. Mangano</LastAuthor>
<Created>2002-08-18T00:43:49Z</Created>
<LastSaved>2002-08-18T02:19:21Z</LastSaved>
<Company>Descriptix</Company>
<Version>10.3501</Version>
</DocumentProperties>
<OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office">
<DownloadComponents/>
<LocationOfComponents HRef="/"/>
</OfficeDocumentSettings>
<ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">
<WindowHeight>9915</WindowHeight>
<WindowWidth>10140</WindowWidth>
<WindowTopX>240</WindowTopX>
<WindowTopY>255</WindowTopY>
<ProtectStructure>False</ProtectStructure>
<ProtectWindows>False</ProtectWindows>
</ExcelWorkbook>
<Styles>
<Style ss:ID="Default" ss:Name="Normal">
<Alignment ss:Vertical="Bottom"/>
<Borders/>
<Font/>
<Interior/>
<NumberFormat/>
<Protection/>
</Style>
</Styles>
<Worksheet ss:Name="msft">
<Table ss:ExpandedColumnCount="3" ss:ExpandedRowCount="5" x:FullColumns="1"
x:FullRows="1">
<Row>
<Cell>
<Data ss:Type="String">Date</Data>
</Cell>
<Cell>
<Data ss:Type="String">Price</Data>
</Cell>
<Cell>
<Data ss:Type="String">Volume</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="Number">20010817</Data>
</Cell>
<Cell>
<Data ss:Type="Number">61.88</Data>
</Cell>
<Cell>
<Data ss:Type="Number">260163</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="Number">20010820</Data>
</Cell>
<Cell>
<Data ss:Type="Number">62.7</Data>
</Cell>
<Cell>
<Data ss:Type="Number">241859</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="Number">20010821</Data>
</Cell>
<Cell>
<Data ss:Type="Number">60.78</Data>
</Cell>
<Cell>
<Data ss:Type="Number">233989</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="Number">20010822</Data>
</Cell>
<Cell>
<Data ss:Type="Number">60.66</Data>
</Cell>
<Cell>
<Data ss:Type="Number">387444</Data>
</Cell>
</Row>
</Table>
<WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
<Selected/>
<Panes>
<Pane>
<Number>3</Number>
<ActiveRow>11</ActiveRow>
<ActiveCol>5</ActiveCol>
</Pane>
</Panes>
<ProtectObjects>False</ProtectObjects>
<ProtectScenarios>False</ProtectScenarios>
</WorksheetOptions>
</Worksheet>
</Workbook>
which is probably not what you had in mind!
This example conveniently maps an
Excel XML file to a simpler XML file.
Many spreadsheets created in Excel have a structure in which the
first row contains column names and subsequent rows contain data for
those columns.
One obvious mapping would convert the column names into element names
and the remaining cells into element content. The only missing pieces
of information are the names of the top-level element and the element
containing each row.
This
stylesheet takes these names as parameters with some obvious
defaults. It converts some of the useful metadata into comments and
throws away the Excel-specific stuff. This section provides several
other parameters that increase the generality of the conversion, such
as which row contains the column names, where the data starts, and
what to do about empty cells:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<!-- The name of the top-level element -->
<xsl:param name="topLevelName" select=" 'Table' "/>
<!-- The name of each row -->
<xsl:param name="rowName" select=" 'Row' "/>
<!-- The namespace to use -->
<xsl:param name="namespace"/>
<!-- The namespace prefix to use -->
<xsl:param name="namespacePrefix"/>
<!-- The character to use if column names contain whitespace -->
<xsl:param name="wsSub" select="'_'"/>
<!--Determines which row contains the col names-->
<xsl:param name="colNamesRow" select="1"/>
<!--Determines which row the data begins -->
<xsl:param name="dataRowStart" select="2"/>
<!-- If false then cells with null or whitespace-only content -->
<!-- will be skipped -->
<xsl:param name="includeEmpty" select="true( )"/>
<!-- If false then author and creation metadata will not be put -->
<!-- into a comment-->
<xsl:param name="includeComment" select="true( )"/>
<!--Normalize the namespacePrefix -->
<xsl:variable name="nsp">
<xsl:if test="$namespace">
<!-- Only use prefix if namespace is specified -->
<xsl:choose>
<xsl:when test="contains($namespacePrefix,':')">
<xsl:value-of
select="concat(translate(substring-before(
$namespacePrefix,
':'),' ',''),':')"/>
</xsl:when>
<xsl:when test="translate($namespacePrefix,' ','')">
<xsl:value-of
select="concat(translate($namespacePrefix,' ',''),':')"/>
</xsl:when>
<xsl:otherwise/>
</xsl:choose>
</xsl:if>
</xsl:variable>
<!--Get the names of all the columns with whitespace replaced by -->
<xsl:variable name="COLS" select="/*/*/*/ss:Row[$colNamesRow]/ss:Cell"/>
<xsl:template match="o:DocumentProperties">
<xsl:if test="$includeComment">
<xsl:text>
</xsl:text>
<xsl:comment>
<xsl:text>
</xsl:text>
<xsl:if test="normalize-space(o:Company)">
<xsl:text>Company: </xsl:text>
<xsl:value-of select="o:Company"/>
<xsl:text>
</xsl:text>
</xsl:if>
<xsl:text>Author: </xsl:text>
<xsl:value-of select="o:Author"/>
<xsl:text>
</xsl:text>
<xsl:text>Created on: </xsl:text>
<xsl:value-of select="translate(o:Created,'TZ',' ')"/>
<xsl:text>
</xsl:text>
<xsl:text>Last Author: </xsl:text>
<xsl:value-of select="o:LastAuthor"/>
<xsl:text>
</xsl:text>
<xsl:text>Saved on:</xsl:text>
<xsl:value-of select="translate(o:LastSaved,'TZ',' ')"/>
<xsl:text>
</xsl:text>
</xsl:comment>
</xsl:if>
</xsl:template>
<xsl:template match="ss:Table">
<xsl:element
name="{concat($nsp,translate($topLevelName,
' 	
',$wsSub))}"
namespace="{$namespace}">
<xsl:apply-templates select="ss:Row[position( ) >= $dataRowStart]"/>
</xsl:element>
</xsl:template>
<xsl:template match="ss:Row">
<xsl:element
name="{concat($nsp,translate($rowName,
' 	
',$wsSub))}"
namespace="{$namespace}">
<xsl:for-each select="ss:Cell">
<xsl:variable name="pos" select="position( )"/>
<!-- Get the correct column name even if there were empty -->
<!-- cols in original spreadsheet -->
<xsl:variable name="colName">
<xsl:choose>
<xsl:when test="@ss:Index and
$COLS[@ss:Index = current( )/@ss:Index]">
<xsl:value-of
select="$COLS[@ss:Index = current( )/@ss:Index]/ss:Data"/>
</xsl:when>
<xsl:when test="@ss:Index">
<xsl:value-of
select="$COLS[number(current( )/@ss:Index)]/ss:Data"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$COLS[$pos]/ss:Data"/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:if test="$includeEmpty or
translate(ss:Data,' 	
','')">
<xsl:element
name="{concat($nsp,translate($colName,
' 	
',$wsSub))}"
namespace="{$namespace}">
<xsl:value-of select="ss:Data"/>
</xsl:element>
</xsl:if>
</xsl:for-each>
</xsl:element>
</xsl:template>
<xsl:template match="text( )"/>
</xsl:stylesheet>
The result of the transformation, with default parameter values, is
the much more direct XML representation that follows:
<Table>
<Row>
<Date>20010817</Date>
<Price>61.88</Price>
<Volume>260163</Volume>
</Row>
<Row>
<Date>20010820</Date>
<Price>62.7</Price>
<Volume>241859</Volume>
</Row>
<Row>
<Date>20010821</Date>
<Price>60.78</Price>
<Volume>233989</Volume>
</Row>
<Row>
<Date>20010822</Date>
<Price>60.66</Price>
<Volume>387444</Volume>
</Row>
</Table>
XSLT 2.0
The main improvements of using XSLT 2.0 is the ability to introduce
some helper functions to remove redundant code and the use of more
succinct XPath 2.0 syntax.
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:fn="http://www.w3.org/2005/02/xpath-functions"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:ckbk="http://www.oreilly.com/xsltckbk">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<!-- The name of the top-level element -->
<xsl:param name="topLevelName" select=" 'Table' " as="xs:string"/>
<!-- The name of each row -->
<xsl:param name="rowName" select=" 'Row' " as="xs:string"/>
<!-- The namespace to use -->
<xsl:param name="namespace" select=" '' " as="xs:string"/>
<!-- The namespace prefix to use -->
<xsl:param name="namespacePrefix" select=" '' " as="xs:string" />
<!-- The character to use if column names contain whitespace -->
<xsl:param name="wsSub" select="'_'" as="xs:string"/>
<!--Determines which row contains the col names-->
<xsl:param name="colNamesRow" select="1" as="xs:integer"/>
<!--Determines which row the data begins -->
<xsl:param name="dataRowStart" select="2" as="xs:integer"/>
<!-- If false then cells with null or whitespace-only content -->
<!-- will be skipped -->
<xsl:param name="includeEmpty" select="true( )" as="xs:boolean"/>
<!-- If false then author and creation metadata will not be put -->
<!-- into a comment-->
<xsl:param name="includeComment" select="true( )" as="xs:boolean"/>
<!--Normalize the namespacePrefix -->
<xsl:variable name="nsp" as="xs:string"
select="if (contains($namespacePrefix,':'))
then concat(translate(substring-before($namespacePrefix,':'),' ',''),':')
else
if (matches($namespacePrefix,'\W'))
then concat(translate($namespacePrefix,' ',''),':')
else '' "/>
<!--Get the names of all the columns-->
<xsl:variable name="COLS" select="/*/*/*/ss:Row[$colNamesRow]/ss:Cell"/>
<xsl:template match="o:DocumentProperties">
<xsl:if test="$includeComment">
<xsl:text>
</xsl:text>
<xsl:comment select="concat('
',
ckbk:comment(o:Company),
ckbk:comment(o:Author),
ckbk:comment(o:Created,'Created on'),
ckbk:comment(o:LastAuthor,'Last Author'),
ckbk:comment(o:LastSaved,'Saved on'))"/>
</xsl:if>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="ss:Table">
<xsl:element
name="{ckbk:makeName($nsp,$topLevelName,$wsSub)}"
namespace="{$namespace}">
<xsl:apply-templates select="ss:Row[position( ) ge $dataRowStart]"/>
</xsl:element>
</xsl:template>
<xsl:template match="ss:Row">
<xsl:element
name="{ckbk:makeName($nsp,$rowName,$wsSub)}"
namespace="{$namespace}">
<xsl:for-each select="ss:Cell">
<xsl:variable name="pos" select="position( )"/>
<!-- Get the correct column name even if there were empty -->
<!-- cols in original spreadsheet -->
<xsl:variable name="colName" as="xs:string"
select="if (@ss:Index and $COLS[@ss:Index = current( )/@ss:Index])
then $COLS[@ss:Index = current( )/@ss:Index]/ss:Datae
else
if (@ss:Index)
then $COLS[number(current( )/@ss:Index)]/ss:Data
else $COLS[$pos]/ss:Data"/>
<xsl:if test="$includeEmpty or
translate(ss:Data,' 	
','')">
<xsl:element
name="{ckbk:makeName($nsp,$colName,$wsSub)}"
namespace="{$namespace}">
<xsl:value-of select="ss:Data"/>
</xsl:element>
</xsl:if>
</xsl:for-each>
</xsl:element>
</xsl:template>
<xsl:template match="text( )"/>
<xsl:function name="ckbk:makeName" as="xs:string">
<xsl:param name="nsp" as="xs:string"/>
<xsl:param name="name" as="xs:string"/>
<xsl:param name="wsSub" as="xs:string"/>
<xsl:sequence select="concat($nsp,translate($name,
' 	
',$wsSub))"/>
</xsl:function>
<xsl:function name="ckbk:comment" as="xs:string">
<xsl:param name="elem"/>
<xsl:sequence select="ckbk:comment($elem, local-name($elem))"/>
</xsl:function>
<xsl:function name="ckbk:comment" as="xs:string">
<xsl:param name="elem"/>
<xsl:param name="label" as="xs:string"/>
<xsl:sequence select="if (normalize-space($elem))
then concat($label,': ',$elem,'
')
else '' "/>
</xsl:function>
</xsl:stylesheet>
Discussion
I almost did not include this recipe in the book because it initially
seemed trivial. However, I realized that a robust solution needs to
handle many special cases, and many implementations (including my
first) would miss them. For example, spreadsheets often contain empty
columns used
as
spacers. You
need to know how to handle them by looking for the
@ss:Index attribute. This book's
initial version also hardcoded many of the choices this version
exposes as parameters.
At least one obvious additional extension could be made to this
stylesheet: the handling of multiple
ss:Worksheet
elements. This handling could be done by specifying the worksheet
number as a parameter:
<xsl:param name="WSNum" select="1"/>
<xsl:variable name="COLS"
select="/*/ss:Worksheet[$WSNum]/*/ss:Row[$colNamesRow]/ss:Cell"/>
<xsl:template match="ss:Workbook">
<xsl:element name="{concat($nsp,translate($topLevelName,
' 	
',$wsSub))}"
namespace="{$namespace}">
<xsl:apply-templates select="ss:Worksheet[number($WSNum)]/ss:Table"/>
</xsl:element>
</xsl:template>
A more ambitious solution handles each
Worksheet in
a multiple
Worksheet document as a separate element in the
resulting document. This setup means that the column names can no
longer be handled as a global variable:
<xsl:template match="ss:Workbook">
<xsl:element name="{concat($nsp,translate($topLevelName,
' 	
',$wsSub))}"
namespace="{$namespace}">
<xsl:choose>
<xsl:when test="number($WSNum) > 0">
<xsl:apply-templates
select="ss:Worksheet[number($WSNum)]/ss:Table">
<xsl:with-param name="COLS"
select="ss:Worksheet[number($WSNum)]
/*/ss:Row[$colNamesRow]/ss:Cell"/>
</xsl:apply-templates>
</xsl:when>
<xsl:otherwise>
<xsl:for-each select="ss:Worksheet">
<xsl:element
name="{concat($nsp,translate(@ss:Name,
' 	
',$wsSub))}"
namespace="{$namespace}">
<xsl:apply-templates select="ss:Table">
<xsl:with-param name="COLS"
select="*/ss:Row[$colNamesRow]/ss:Cell"/>
</xsl:apply-templates>
</xsl:element>
</xsl:for-each>
</xsl:otherwise>
</xsl:choose>
</xsl:element>
</xsl:template>
<xsl:template match="ss:Table">
<xsl:param name="COLS"/>
<xsl:apply-templates select="ss:Row[position( ) >= $dataRowStart]">
<xsl:with-param name="COLS" select="$COLS"/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match="ss:Row">
<xsl:param name="COLS"/>
<!-- The rest is the same as original ... -->
</xsl:template>
The only trouble with this solution is that it assumes that the
column names have to be in the same row in each worksheet.
|