Microsoft Office files are actually zip files with a lot of XML and other stuff inside. It is remarkably difficult to get to the actual contents of them: What is in Excel cell A1B2 or what is written in this Word document. To help with this, the xtpxlib-xoffice component contains XProc (1.0 and 3.0) pipelines to extract contents from Excel (.xlsx) and Word (.docx) files.

The namespace prefix xtlxo: is bound to the namespace http://www.xtpxlib.nl/ns/xoffice (xmlns:xtlxo="http://www.xtpxlib.nl/ns/xoffice").

Especially the .docx (Word) conversions should be considered unfinished and experimental. Not everything is converted.

1.1 Converting from Excel (.xlsx)

The xtlxo:extract-xlsx pipeline takes an Excel .xlsx file and turns this into much more manageable XML. The schema for the resulting XML format is here.

Take for instance this simple Excel sheet:


Running this through the xtlxo:extract-xlsx pipeline returns something like this:

<?xml version="1.0" encoding="UTF-8"?>
<workbook xmlns="http://www.xtpxlib.nl/ns/xoffice"
          href="file:///path/to/excel.xlsx"
          timestamp="2019-12-11T12:50:20.252+01:00">
  
   <properties>
      … Sheet properties …
   </properties>
  
   <worksheet name="Sheet1">
      <row index="1">
         <cell index="1" ref="A1">
            <value>1</value>
         </cell>
         <cell index="2" ref="B1">
            <value>What's up?</value>
         </cell>
      </row>
      <row index="2">
         <cell index="1" ref="A2">
            <value>2</value>
            <formula>A1+1</formula>
         </cell>
         <cell index="2" ref="B2">
            <value>Cell with <span class="b">bold</span> in it</value>
         </cell>
      </row>
   </worksheet>
  
</workbook>

1.2 Converting to Excel (.xlsx)

The xtlxo:modify-xlsx pipeline takes a template Excel .xlsx file and changes this. The result will be written to a new Excel file.

It has the following features:

There are some things you need to take care of creating the template Excel file:

The XML for specifying the changes to the Excel file is quite simple. The schema can be found here. A simple example:

<xlsx-modifications xmlns="http://www.xtpxlib.nl/ns/xoffice">
  
  <worksheet name="TEST">
    
    <row name="NAMEDCELL" >
      <column name="NAMEDCELL" >
        <number>12345</number>
      </column>
      <column name="NAMEDCELL" offset="1">
        <string>One to the right</string>
      </column>
    </row>
    
    <row index="1">
      <column index="1">
        <string>Upper left-hand corner</string>
      </column>
      <column index="2">
        <number>6E3</number>
      </column>
    </row>
  </worksheet>
  
</xlsx-modifications>

1.3 Converting from Word (.docx)

The xtlxo:extract-docx pipeline takes a Word (.docx) file and turns this into an understandable XML format. This format is experimental, there is currently no schema for it.

As an example take this simple Word file:


Running this through the xtlxo:extract-docx pipeline returns something like:

<document xmlns="http://www.xtpxlib.nl/ns/xoffice"
          dref=""
          timestamp="2019-12-11T13:09:15.415+01:00">
   <properties>
      … document properties …      
   </properties>
  
   <p xml:space="preserve">Hello there!</p>
   <p xml:space="preserve">Something in <span class="b">Bold</span>!</p>
   <p class="ListBullet" xml:space="preserve">A list entry</p>
   <p class="ListBullet" xml:space="preserve">Another one</p>
   <p class="ListBullet" indent-left="360" indent-level="0" xml:space="preserve"/>
   <table>
      <tr>
         <td>
            <p class="ListBullet" indent-level="0" xml:space="preserve">Simple table header</p>
         </td>
         <td>
            <p class="ListBullet" indent-level="0" xml:space="preserve">More header</p>
         </td>
      </tr>
      <tr>
         <td>
            <p class="ListBullet" indent-level="0" xml:space="preserve">Column1, row 2</p>
         </td>
         <td>
            <p class="ListBullet" indent-level="0" xml:space="preserve">Column 2 row 2</p>
         </td>
      </tr>
   </table>
   <p class="ListBullet" indent-left="360" indent-level="0" xml:space="preserve"/>
  
</document>

 

There's an experimental pipeline xtlxo:create-docx to create Word documents (using a template Word document for things like styles, margins, etc.). If you feed this the same kind of XML you get from xtlxo:extract-docx, the result should be a valid, useable Word document with the new text in it. It's currently incomplete (it doesn't do tables for instance). Use at your own risk.