Archive for July, 2006

Open Document Tutorial part 2:

Tuesday, July 11th, 2006

This article is being archived here from its original publication in the 3Monkeyweb wiki.

When a document has been edited in Word and OpenOffice certain artifact are introduced. One of my current projects is to clean up a 800 plus page specification that has been translated between .doc and .odt many times. This has resulted in a bloated document, with several inconsistencies. In my last article, I showed some underlying differences in the two formats, both from direct inspection and inference. Now comes the task of tackling the .odt.

Getting started

I will be using oo_ms_oo.doc.odt, as a reference document for this tutorial. As shown previously, .odt’s are nothing more than .zip files. unzip the file in an empty directory as follows.

% unzip ../oo_ms_oo_doc.odt
Archive:  ../oo_ms_oo_doc.odt
extracting: mimetype
inflating: layout-cache
inflating: content.xml
inflating: styles.xml
extracting: meta.xml
inflating: Thumbnails/thumbnail.png
inflating: settings.xml
inflating: META-INF/manifest.xml

I am going to skip all of the preamble information regarding .odt ”packages” and jump right into the meat of the problem. The files we are most concerned with are ”content.xml” and ”styles.xml”. Open ”styles.xml” in your favorite XML editor (I prefer oXygen). Ignoring, the root element, ”<office:document-styles>”, we see the first major element of the document <office:font-face-decls>. This is the element we will attack first. I will use Relax-NG Schema notation for elements.

<office:font-face-decls>

<define name="office-font-face-decls">
  <optional>
    <element name="office:font-face-decls">
      <zeroOrMore>
        <ref name="style-font-face"/>
      </zeroOrMore>
    </element>
  </optional>
</define>

As you can see, it is pretty simple. I only contains style-font-face
refs. So let us take a look at that element.

<define name="style-font-face">
  <element name="style:font-face">
    <ref name="style-font-face-attlist"/>
    <optional>
      <ref name="svg-font-face-src"/>
    </optional>
    <optional>
      <ref name="svg-definition-src"/>
    </optional>
  </element>
</define>

For brevity (and simplicity), I’m going to choose to ignore the optional ”svg-font-face-src” and ”svg-definition-src” refs. As a side note, I have not encountered these in real world situations. We are left with an <office:font-face-decls> element that contains zero or more <style:font-face”> elements. We can infer that the ref ”style-font-face-attlist” is an attribute list and does not contain any elements. I have verified that that is indeed the case, but the complete definition is too lengthy to list here. Here is the complete schema.

Basic strategy

We will iterate through the fonts comparing certain attributes. When we find two fonts that are similar enough, we can replace one with the other and remove the duplicate. This will be accomplished in two steps.

  1. Identify potential substitutions
    Once all substitutions have been identified, a map file is written to disk. This file can then be edited to suit the particular interest of the user.
  2. Perform the substitutions
    Once the map file is ready the script is run a second time to make all of the replacements.

The code
I choose to program this in perl with the help of the package XML::Simple. It certainly could have been done with some XSL filters, but would have been much more complicated. The complete perl script font-face-decls.pl can be downloaded from the ODT Tools file repository. Remember this was not intended as a production script. Therefore, I did not worry a lot about bounds checking, errors, or plain just making it look pretty. If you would like to volunteer to help on this project and combine this and future tools in to a well rounded package please contact me.

First we need to load ”content.xml” and ”styles.xml” then extract the <office:document-styles> element. I simple read each file in as one big string by locally undef’ing $/, then use a regular expression to extract the <office:document-styles> element to a string. Finally, I use XMLin to convert the element to a perl data structure. I could have actually, extracted <office:font-face-decls>

I don’t want to work with two structures, so the first thing I do is combine the styles and content hashes into a single hash. We check to make sure any combined elements contain the same attributes, adding any extra attributes as well.

So what in the XML do we want to modify?
Let us compare <style:font-face> elements to determine where we might make some improvements

The XML (edited for brevity))

<style:font-face style:name="StarSymbol"
style:font-charset="x-symbol"/>

<style:font-face style:name="Wingdings"
style:font-pitch="variable"
style:font-charset="x-symbol"/>

<style:font-face style:name="Symbol"
style:font-family-generic="roman"
style:font-pitch="variable"
style:font-charset="x-symbol"/>

<style:font-face style:name="Albany AMT1"
style:font-pitch="variable"/>

<style:font-face style:name="Albany AMT"
style:font-pitch="variable"/>

<style:font-face style:name="Lucidasans"
style:font-pitch="variable"/>

<style:font-face style:name="Thorndale AMT"
style:font-family-generic="roman"
style:font-pitch="variable"/>

<style:font-face style:name="Thorndale AMT1"
style:font-family-generic="roman"
style:font-pitch="variable"/>

Notice that ”Albany AMY” and ”Thorndale AMT” appear to be duplicated. Our first rule will be to replace any fonts whose names only differ by an appended sequential number. Next, we see that there are three fonts with a ”x-symbol” font-charset. One symbol font is plenty, therefore we can replace all symbol fonts with a single symbol font. Finally, we notice that, neither ”Albany AMT” or ”Lucidasans” has a ”style:font-family-generic” attribute. These both happen to belong to the ”swiss” generic font family. Since we are attacking this in two steps, we will be able to modify the ”font-face-decl.map” file in order to substitute one of these for the other. But let us consider the case where these two style:font-faces where described as follows.

The Hypothetical XML

<style:font-face style:name="Albany AMT"
style:font-family-generic="swiss"
style:font-pitch="variable"/>

<style:font-face style:name="Lucidasans"
style:font-family-generic="swiss"
style:font-pitch="variable"/>

If this were the case, then we could add a third rule to replace members of the same style:font-family-generic with a single font. Perhaps, I will update the example data to show this operation, but for now just be aware that I have tested this rule on the 800 page gorilla, and it is included in the script.

Running the script

As I stated above, the script I wrote is not of my normal professional quality, so the run environment is pretty strict. Volunteers? You must run the script from the directory that contains your extracted .odt. To create the map file run the following command.

% font-face-decls.pl map

This will result in the output of the two files ”font-face-decls.rpt” and ”font-face-decls.map”, see them below.

font-face-decls.rpt

StarSymbol                             x-symbol
Wingdings                              x-symbol                 StarSymbol
Symbol                 roman           x-symbol                 StarSymbol
Albany AMT1                                                     Albany AMT
Albany AMT
Lucidasans
Thorndale AMT          roman
Thorndale AMT1         roman                                    Thorndale AMT

font-face-decls.map

{
  'StarSymbol' => '',
  'Wingdings' => 'StarSymbol',
  'Symbol' => 'StarSymbol',
  'Albany AMT1' => 'Albany AMT',
  'Albany AMT' => '',
  'Lucidasans' => '',
  'Thorndale AMT' => '',
  'Thorndale AMT1' => 'Thorndale AMT',
};

As suggested previously, we want to modify the map file in order to eliminate one of either, ”Albany AMT” or ”Lucidasans”. Since ”Albany AMT” is alreadybeing used as a replacement, we will replace ”Lucidasans” with it as well. Therefore edit the map file ”Lucidasans” line to read.’Lucidasans’ => ‘Albany AMT’,We are now ready to perform the substitutions in bulk. Run the script in ”replace” mode as follows.

% font-face-decls.pl replace

We end up with two files ”comment-new.xml” and ”styles-new.xml”. If we examine either of these files, we will find that the <office:font-face-decls> element is reduced to three <style:font-face> elements as seen here.

<office:font-face-decls> (Edited for brevity)

<style:font-face style:name="StarSymbol"
style:font-charset="x-symbol"/>

<style:font-face style:name="Albany AMT"
style:font-pitch="variable"/>

<style:font-face style:name="Thorndale AMT"
style:font-family-generic="roman"
style:font-pitch="variable"/>

Some interesting points

You may notice in the code that when replacing one font with another we do this by first removing the duplicate style, then we substitute the replacement font name globally for the duplicated font name. There is a potential bug in the global substitution. Suppose the actual content of the document contained the duplicate font name ”in quotes” such as ””Symbol””. This would be replace by ””StarSymbol””, which is an incorrect substitution. We should limit our substitution to the element <office:document-styles>. Volunteers?

Second, notice that we are replacing the duplicate font name in certain style. That means that one or more styles may now be duplicated. We will investigate removing duplicate styles in the next installment.

Another point of interest, is the fact that we wrap our string in the call to XMLin with a static string. This resolves namespace issues but could be implemented cleaner.

Until next time…

Differences in OpenOffice .odt vs Microsoft Word .doc

Thursday, July 6th, 2006

This article is being archived here from its original publication in the 3Monkeyweb wiki.

This is the first in a series of articles detailing my experiences with directly manipulating .odt. I currently have a project to clean up and unify a 800 plus page document that has been converted among several formats over the years. It is distributed in .doc, .odt, .sxw and .pdf forms.

What am I working with

I’m starting out by creating several files that I can inspect for differences. The files can be found hereexternal link. The files with their size and a brief description of each is below.

  • oo_doc.odt (19175) was created by copying and pasting a 750 word section of a reference article in to OpenOffice Writer. I then added a character style and a list style. I applied these two styles and a third default style to the document as well as applying the “Heading 1” style to all headers.
  • ms_doc.doc (33280) was created in the same manner as oo_doc.odt, using MS Word 2003 instead of OpenOffice Writer.
  • oo_doc.doc (21504), oo_doc.sxw (18787), oo_doc.rtf (19013), oo_doc_ms.xml (26148), and oo_doc_db.xml (5957) were all created by loading oo_doc.odt and saving it in the appropriate format. oo_doc_ms.xml is Microsoft Word 2003 XML and oo_doc_db.xm is DocBook? XML.
  • ms_oo_doc.doc (22528) and ms_oo_doc.odt (19911) were created by loading ms_doc.doc with OpenOffice Writer and saving in the appropriate format.
  • oo_ms_doc.doc (31744) was created by loading oo_doc.doc with Microsoft Word 2003 and simply re-saving.
  • oo_ms_oo_doc.doc (22528) and oo_ms_oo_doc.odt (20653) were created by loading oo_ms_doc.doc with OpenOffice Writer and saving to the appropriate format.
  • ms_doc.rtf (22940) and ms_doc_ms.xml (26011) were created by loading ms_doc.doc in Microsoft Word 2003 and saving to the appropriate format. ms_doc_ms.xml is Microsoft Word 2003 XML.
  • ms_doc.zip (5465), oo_doc.zip (4608), and oo_odt.zip (18195) where all compressed with zip, not gzip, as this is the compression engine OpenOffice uses, from ms.doc.doc, oo_doc.doc and oo_doc.odt respectively.

I will probably add more files to the repository as my investigation continues, but for now these will do.

A few observations

First you will notice that the .odt files are generally smaller than the .doc files created from the same source from within OpenOffice Writer, oo_doc .odt 19175 vs .doc 21504 and oo_ms_oo_doc .odt 20653 vs .doc 22528. Although a small sample size, the trend is clear the .odt is more compact than .doc. The second thing you might notice is ms_doc.zip is highly compressible (84%) while oo_odt.zip is not (6%). I didn’t actually expect the .odt to compress as it is already a compressed format, so that is a little strange, and what is the real size?. What is even more interesting is the compression on ms_doc.doc. Apparently the .doc format is highly compressible. So why doesn’t Microsoft compress the file? Imagine the bandwidth saving on all of those .doc email attachments. Finally, when an .odt is saved as .doc, re-saved in Word, then converted back to .odt, bloat is introduced.

The first observation can be attributed to differences in the file formats, so I’m not too interested in that. The third observation, I will be covering in my next post. So, let us investigate the second observation, specifically what is the real size of the odt. When I unzip the .odt this is what I get.

> ls -al *
-rw-r--r-- 1 user users 13437 2006-07-06 18:11 content.xml
-rw-r--r-- 1 user users 18 2006-07-06 18:11 layout-cache
-rw-r--r-- 1 user users 1055 2006-07-06 18:11 meta.xml
-rw-r--r-- 1 user users 39 2006-07-06 18:11 mimetype
-rw-r--r-- 1 user users 6608 2006-07-06 18:11 settings.xml
-rw-r--r-- 1 user users 13138 2006-07-06 18:11 styles.xml

Configurations2:
total 0
drwxr-xr-x 2 user users 48 2006-07-06 18:11 .
drwxr-xr-x 6 user users 336 2006-07-06 14:25 ..

META-INF:
total 4
drwxr-xr-x 2 user users 80 2006-07-06 14:25 .
drwxr-xr-x 6 user users 336 2006-07-06 14:25 ..
-rw-r--r-- 1 user users 1173 2006-07-06 18:11 manifest.xml

Pictures:
total 0
drwxr-xr-x 2 user users 48 2006-07-06 18:11 .
drwxr-xr-x 6 user users 336 2006-07-06 14:25 ..

Thumbnails:
total 12
drwxr-xr-x 2 user users 80 2006-07-06 14:25 .
drwxr-xr-x 6 user users 336 2006-07-06 14:25 ..
-rw-r--r-- 1 user users 10261 2006-07-06 18:11 thumbnail.png

Well a 10k .png will not help our cause. Configurations2, Pictures and Thumbnails are not required by the OpenDocument? specifications so let us remove them. We need to modify META-INF/manifest.xml as well by removing the referencing elements. After doing so and re-zipping, the resulting .odt is now 8232 bytes. Much closer to the compressed .doc. Realize that for significantly large documents, thumbnail.png becomes much less of a factor. Also realize that if you have embedded graphics or special configuration information, it might not be a good idea to remove those directories. If we open and re-save the document, all of what we removed will be replaced. But since this is an optimization effort, and all we really wanted to discover is how well .odt maps to .doc, and we are in the ballpark.

What have we learned

Natively, OpenOffice is providing smaller file sizes than Microsoft Word. In our test case this was on the order of a forty percent reduction. oo_doc.odt at 19175 bytes verses ms_doc.doc at 33280. A reduction of 14105 bytes or 42%. It was also noted that when we tried to compress ms_doc.doc, that we achieved a very substantial decrease in size. However, the practically of unzipping and re-zipping a .doc file each time we want to edit it was called into question. We were also able to determine that the majority of bloat in the .odt data was caused by the thumbnail feature of OpenOffice, and that for larger sized documents this should quickly become a non-factor.

Next time

Point three Saving in alternating sessions of OpenOffice and Word introduces bloat will be covered. We will take a look at the <office:font-face-decls> element and I will introduce a script that will maintain a clean set of font-face-decls.