Archive for the 'Open Source' Category

OpenOffice .odt Opened Up – Part 1: Overview

Friday, January 12th, 2007

Overview

In the first article in this series, OpenOffice ODF/.odt compared to Microsoft Word .doc, I compared various file types for size efficiency. Of particular interest was the fact that OpenOffice Write stores .odts in a zip format, an implementation of PKZip to be exact. With this knowledge and the Open Document Format standard, we can investigate how certain elements of a document effect its size and overall efficiency.

My test cases where produced with the following software:

  • SuSE Linux 10.1
  • OpenOffice 2.0.2.7.1
  • zip 2.31 (March 8th 2005)

Starting Out

As we previously observed, .odt documents are stored in ZIP format. It is possible to store the document as a single XML file that conforms to the OpenOffice.org document type definition (DTD). It is also possible to store the document as several subdocuments, each with a different document root that represents a particular aspect of the document, such as, content or style.
Quoting the Open Document Format for Office Applications (OpenDocument) v1.0 (Second Edition), (ODF Specification):

The OpenDocument format supports the following two ways of document representation:

  • As a single XML document.
  • As a collection of several subdocuments within a package (see section 17), each of which stores part of the complete document. Each subdocument has a different document root and stores a particular aspect of the XML document. For example, one subdocument contains the style information and another subdocument contains the content of the document. All types of documents, for example, text and spreadsheet documents, use the same document and subdocuments definitions.

There are four types of subdocuments, each with different root elements. Additionally, the single XML document has its own root element, for a total of five different supported root elements. The root elements are summarized in the following table:

Root Element Subdocument Content Subdoc. Name in Package
office:document Complete office document in a single XML document. n/a
office:document-content Document content and automatic styles used in the content. content.xml
office:document-styles Styles used in the document content and automatic styles used in the styles themselves. styles.xml
office:document-meta Document meta information, such as the author or the time of the last save action. meta.xml
office:document-settings Application-specific settings, such as the window size or printer information. settings.xml

So, what is in our reference .odt? We will use the Linux produced document from a prior article (oo_part1.odt) with XML compression disabled. We’ve done this so that the XML is more human readable. After we unzip the file using the Linux utility unzip, we have the raw files as shown below.

.odt unzipped directory tree

As you can see all four subdocuments as specified in the specification are present as well as several other files. In particular META-INF/manifest.xml list the contents of the package, including information such as full path and type.

The file Thumbnails/thumbnail.png although part of the package, is not part of the document. The thumbnail image should conform to the Thumbnail Managing Standard (TMS) at www.freedesktop.org, and therefore should be24bit, non-interlaced PNG image with full alpha transparency. The required size for the thumbnails is 128×128 pixel.

Here is the thumbnail from our reference document.

thumbnail.png

Having the thumbnail available in the package, allows other applications such as file managers to preview the document to the user. With a little creative programming, sites such as Google, Yahoo or Ask, could extract this thumbnail and preview the document for users, with little difficulty.

Document Elements

The office:document may contain any of the document elements listed below.

  • office:document-attrs
  • office:document-common-attrs
  • office:meta
  • office:settings
  • office:scripts
  • office:font-face-decls
  • office:styles
  • office:automatic-styles
  • office:master-styles
  • office:body

When the subdocument method is used however, elements are restricted to certain subdocuments.

Elements in content.xml

  • office:document-content (subdocument root)
  • office:document-common-attrs
  • office:scripts
  • office:font-face-decls
  • office:automatic-styles
  • office:body

Elements in styles.xml

  • office:document-styles (subdocument root)
  • office:document-attrs
  • office:document-common-attrs
  • office:font-face-decls
  • office:styles
  • office:automatic-styles
  • office:master-styles

Elements in meta.xml

  • office:document-meta (subdocument root)
  • office:document-common-attrs
  • office:meta

Elements in settings.xml

  • office:document-settings (subdocument root)
  • office:document-common-attrs
  • office:settings

What’s Up Next?

At this point we have a clear understanding of the subdocument method that OpenOffice applies to its ODF implementation, and we know what top level elements are handled by each subdocument.

In the next article, we will ease into the subdocument elements by exploring the office:document-meta and office:document-settings elements. These two elements are rather simple and will not require as much review compared to office:document-content or office:document-styles.
Until next time.

-3Monkeys

Popularity: 23% [?]

  • DZone
  • StumbleUpon
  • Technorati
  • del.icio.us
  • Slashdot
  • Digg
  • Reddit
  • NewsVine
  • SphereIt
  • e-mail
  • Facebook
  • Google Bookmarks
  • Live
  • Propeller

OpenOffice ODF/.odt compared to Microsoft Word .doc

Friday, December 29th, 2006

Overview

This is the first in a series of articles that will compare ODF and in particular the OpenOffice implementation and Microsoft Office and its various data formats with respect to various measures. This article will cover the efficiency of the .odt, .doc and .xml formats, with particular interest to native and compressible file sizes.

Methodology

My windows test cases were generated using the following software:

  • Microsoft Windows XP Professional 2002, SP2
  • Microsoft Word 2003 (11.6368.6368) SP2
  • OpenOffice 2.0.3
  • Adobe Acrobat Standard 7.0.8 5/16/2006.

My Linux test cases where produced with the following software:

  • SuSE Linux 10.1
  • OpenOffice 2.0.2.7.1
  • Adobe Reader 7.0.8 05/22/2006

I needed a fairly large chunk of text for my test, I decided on the November draft of the ISO/IEC C Standard, located at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1905.pdf (copy here). This is a significantly large document, and I decided only to use the first seven chapters for my test case. In order to produce the target documents, I selected the contents from the beginning of the document through chapter 7, and copied this to the clipboard. I then pasted the clipboard into native versions of Microsoft Word under Windows and OpenOffice Writer under both Windows and Linux. For Microsoft Word, I saved the document as a native .doc and .xml. For OpenOffice, I saved the document as native .odt and exported it as .doc. I also saved the content as .txt with Notepad under Windows as a reference point. For archival purposes, I have mirrored all documents referred to in this article on the 3monkey wiki download area.

Raw Results

File Size
Microsoft Office .doc 921,088
Microsoft Office .xml 6,475,669
OpenOffice (XP) .odt 154,892
OpenOffice (XP) .doc 1,335,296
OpenOffice (Linux) .odt 160,045
OpenOffice (Linux) .doc 1,338,368
Notepad 417,549

Observations

My first observation was the Linux OpenOffice implementation created slightly larger file sizes than the Windows implementation. This was probably due to the differing versions. I will revisit this in a later article if it is merited.

My next observation was that the OpenOffice .doc file was significantly larger than the Microsoft Word version. This is likely due to Microsoft’s access to the complete .doc specification, and thus a better understanding of how to optimize the file content and size. For grins, I loaded the OpenOffice .doc with Microsoft Word and saved it naively. I also loaded the Microsoft Word .doc with OpenOffice and saved it both as a .doc and .odt. The results of these test are below.

File Size
OO .doc loaded/saved in MS 808,960
MS .doc loaded/saved in OO 1,277,952
MS .doc loaded/saved as .odt in OO 155,113

This produced some interesting results. First, even though the original OpenOffice .doc file was originally larger than the native Microsoft Word version, when loaded and saved with Word, resulted in a file 12% smaller file than the original native Word .doc. This indicates that OpenOffice does not save all of the information regarding a document that Word does. This is further supported by the opposite transformation. When we load the Word document in OpenOffice and re-save as a .doc, we experience a file size reduction again. This reduction, although not as significant, clearly supports the fact that OpenOffice is not saving all the information in its .doc format as Word. By a cursory visual inspection all of the documents seem to be equivalent. Without access to the .doc file format specification it is difficult to infer whether or not the information loss is of consequence or not. In other words the file size difference may be due to bloat in the native Word format or due information loss by OpenOffice.Next most people will notice that not only is the .odt versions smaller than then .doc versions irregardless of which application produced them. Further more the .odt is almost one-third the size of the raw text from notepad. The reason the .odt is so significantly smaller is the the OpenOffice implementation applies compression on its output, and obviously decompresses it on the fly for input. This has both as advantages and disadvantages. The primary disadvantage is load and save times. Since the file must be either compressed or decompressed, this takes extra CPU cycles. However, with the speed and efficiency of today’s processors, this should be of little practical impact. The one obvious major advantage is file size. Not only does this save in raw disk storage, but also results in lower bandwidth for such mediums as email and downloads.I wondered what would the results be of compressing the .doc, .xml, .odt and .txt? I compressed all four formats using the Linux utility zip (as that is the underlining implementation for OpenOffice). The results (below) where fairly interesting and somewhat expected.

File Type Original Size Compressed Size
.doc 921,088 179,648
.xml 6,475,669 228,497
.odt 154,892 153,456
.txt 417,549 104,236

Notice that each format compress roughly to the same size. The .xml is larger due to both its original size thus the number of segments that needed to be compressed and additional data compared to the other formats. The .doc is roughly 15% larger than the .odt, which was only slightly compressed (perhaps to a slight algorithm change). The .txt compressed more than the others, this is due to the fact that it cares on formatting, style or meta information and is simply the raw text. Seeing the vastly decreased storage in respect to the .doc, I wonder why Microsoft does not inappropriate a compression strategy similar to OpenOffice.

Conclusion

From this limited data sample, I have to declare OpenOffice Writer the champion of round one. Perhaps if Microsoft Word employed a compressed output form the outcome may have been different. It is actually a little strange that OpenOffice which is based on a pure text format (XML) is compressed into a binary zip file and that Microsoft Word, which is a proprietary binary format is not.

What Is Up Next?

For the most part these test cases did not contain much formatting or style information, nor did it consider such elements as tables and graphs. I will investigate how these effect the efficiency in a latter article. But before I do that, I will need to expose more of how ODF works. Therefore, the next few articles in this series will be a primer for the ODF specification.

Until next time…
-3Monkeys

Popularity: 100% [?]

  • DZone
  • StumbleUpon
  • Technorati
  • del.icio.us
  • Slashdot
  • Digg
  • Reddit
  • NewsVine
  • SphereIt
  • e-mail
  • Facebook
  • Google Bookmarks
  • Live
  • Propeller

Open Document Tutorial part 2:

Tuesday, July 11th, 2006

This article is being archived here from its original publication in the 3Monkeyweb wiki.

When a document has been edited in Word and OpenOffice certain artifact are introduced. One of my current projects is to clean up a 800 plus page specification that has been translated between .doc and .odt many times. This has resulted in a bloated document, with several inconsistencies. In my last article, I showed some underlying differences in the two formats, both from direct inspection and inference. Now comes the task of tackling the .odt.

Getting started

I will be using oo_ms_oo.doc.odt, as a reference document for this tutorial. As shown previously, .odt’s are nothing more than .zip files. unzip the file in an empty directory as follows.

% unzip ../oo_ms_oo_doc.odt
Archive:  ../oo_ms_oo_doc.odt
extracting: mimetype
inflating: layout-cache
inflating: content.xml
inflating: styles.xml
extracting: meta.xml
inflating: Thumbnails/thumbnail.png
inflating: settings.xml
inflating: META-INF/manifest.xml

I am going to skip all of the preamble information regarding .odt ”packages” and jump right into the meat of the problem. The files we are most concerned with are ”content.xml” and ”styles.xml”. Open ”styles.xml” in your favorite XML editor (I prefer oXygen). Ignoring, the root element, ”<office:document-styles>”, we see the first major element of the document <office:font-face-decls>. This is the element we will attack first. I will use Relax-NG Schema notation for elements.

<office:font-face-decls>

<define name="office-font-face-decls">
  <optional>
    <element name="office:font-face-decls">
      <zeroOrMore>
        <ref name="style-font-face"/>
      </zeroOrMore>
    </element>
  </optional>
</define>

As you can see, it is pretty simple. I only contains style-font-face
refs. So let us take a look at that element.

<define name="style-font-face">
  <element name="style:font-face">
    <ref name="style-font-face-attlist"/>
    <optional>
      <ref name="svg-font-face-src"/>
    </optional>
    <optional>
      <ref name="svg-definition-src"/>
    </optional>
  </element>
</define>

For brevity (and simplicity), I’m going to choose to ignore the optional ”svg-font-face-src” and ”svg-definition-src” refs. As a side note, I have not encountered these in real world situations. We are left with an <office:font-face-decls> element that contains zero or more <style:font-face”> elements. We can infer that the ref ”style-font-face-attlist” is an attribute list and does not contain any elements. I have verified that that is indeed the case, but the complete definition is too lengthy to list here. Here is the complete schema.

Basic strategy

We will iterate through the fonts comparing certain attributes. When we find two fonts that are similar enough, we can replace one with the other and remove the duplicate. This will be accomplished in two steps.

  1. Identify potential substitutions
    Once all substitutions have been identified, a map file is written to disk. This file can then be edited to suit the particular interest of the user.
  2. Perform the substitutions
    Once the map file is ready the script is run a second time to make all of the replacements.

The code
I choose to program this in perl with the help of the package XML::Simple. It certainly could have been done with some XSL filters, but would have been much more complicated. The complete perl script font-face-decls.pl can be downloaded from the ODT Tools file repository. Remember this was not intended as a production script. Therefore, I did not worry a lot about bounds checking, errors, or plain just making it look pretty. If you would like to volunteer to help on this project and combine this and future tools in to a well rounded package please contact me.

First we need to load ”content.xml” and ”styles.xml” then extract the <office:document-styles> element. I simple read each file in as one big string by locally undef’ing $/, then use a regular expression to extract the <office:document-styles> element to a string. Finally, I use XMLin to convert the element to a perl data structure. I could have actually, extracted <office:font-face-decls>

I don’t want to work with two structures, so the first thing I do is combine the styles and content hashes into a single hash. We check to make sure any combined elements contain the same attributes, adding any extra attributes as well.

So what in the XML do we want to modify?
Let us compare <style:font-face> elements to determine where we might make some improvements

The XML (edited for brevity))

<style:font-face style:name="StarSymbol"
style:font-charset="x-symbol"/>

<style:font-face style:name="Wingdings"
style:font-pitch="variable"
style:font-charset="x-symbol"/>

<style:font-face style:name="Symbol"
style:font-family-generic="roman"
style:font-pitch="variable"
style:font-charset="x-symbol"/>

<style:font-face style:name="Albany AMT1"
style:font-pitch="variable"/>

<style:font-face style:name="Albany AMT"
style:font-pitch="variable"/>

<style:font-face style:name="Lucidasans"
style:font-pitch="variable"/>

<style:font-face style:name="Thorndale AMT"
style:font-family-generic="roman"
style:font-pitch="variable"/>

<style:font-face style:name="Thorndale AMT1"
style:font-family-generic="roman"
style:font-pitch="variable"/>

Notice that ”Albany AMY” and ”Thorndale AMT” appear to be duplicated. Our first rule will be to replace any fonts whose names only differ by an appended sequential number. Next, we see that there are three fonts with a ”x-symbol” font-charset. One symbol font is plenty, therefore we can replace all symbol fonts with a single symbol font. Finally, we notice that, neither ”Albany AMT” or ”Lucidasans” has a ”style:font-family-generic” attribute. These both happen to belong to the ”swiss” generic font family. Since we are attacking this in two steps, we will be able to modify the ”font-face-decl.map” file in order to substitute one of these for the other. But let us consider the case where these two style:font-faces where described as follows.

The Hypothetical XML

<style:font-face style:name="Albany AMT"
style:font-family-generic="swiss"
style:font-pitch="variable"/>

<style:font-face style:name="Lucidasans"
style:font-family-generic="swiss"
style:font-pitch="variable"/>

If this were the case, then we could add a third rule to replace members of the same style:font-family-generic with a single font. Perhaps, I will update the example data to show this operation, but for now just be aware that I have tested this rule on the 800 page gorilla, and it is included in the script.

Running the script

As I stated above, the script I wrote is not of my normal professional quality, so the run environment is pretty strict. Volunteers? You must run the script from the directory that contains your extracted .odt. To create the map file run the following command.

% font-face-decls.pl map

This will result in the output of the two files ”font-face-decls.rpt” and ”font-face-decls.map”, see them below.

font-face-decls.rpt

StarSymbol                             x-symbol
Wingdings                              x-symbol                 StarSymbol
Symbol                 roman           x-symbol                 StarSymbol
Albany AMT1                                                     Albany AMT
Albany AMT
Lucidasans
Thorndale AMT          roman
Thorndale AMT1         roman                                    Thorndale AMT

font-face-decls.map

{
  'StarSymbol' => '',
  'Wingdings' => 'StarSymbol',
  'Symbol' => 'StarSymbol',
  'Albany AMT1' => 'Albany AMT',
  'Albany AMT' => '',
  'Lucidasans' => '',
  'Thorndale AMT' => '',
  'Thorndale AMT1' => 'Thorndale AMT',
};

As suggested previously, we want to modify the map file in order to eliminate one of either, ”Albany AMT” or ”Lucidasans”. Since ”Albany AMT” is alreadybeing used as a replacement, we will replace ”Lucidasans” with it as well. Therefore edit the map file ”Lucidasans” line to read.’Lucidasans’ => ‘Albany AMT’,We are now ready to perform the substitutions in bulk. Run the script in ”replace” mode as follows.

% font-face-decls.pl replace

We end up with two files ”comment-new.xml” and ”styles-new.xml”. If we examine either of these files, we will find that the <office:font-face-decls> element is reduced to three <style:font-face> elements as seen here.

<office:font-face-decls> (Edited for brevity)

<style:font-face style:name="StarSymbol"
style:font-charset="x-symbol"/>

<style:font-face style:name="Albany AMT"
style:font-pitch="variable"/>

<style:font-face style:name="Thorndale AMT"
style:font-family-generic="roman"
style:font-pitch="variable"/>

Some interesting points

You may notice in the code that when replacing one font with another we do this by first removing the duplicate style, then we substitute the replacement font name globally for the duplicated font name. There is a potential bug in the global substitution. Suppose the actual content of the document contained the duplicate font name ”in quotes” such as ””Symbol””. This would be replace by ””StarSymbol””, which is an incorrect substitution. We should limit our substitution to the element <office:document-styles>. Volunteers?

Second, notice that we are replacing the duplicate font name in certain style. That means that one or more styles may now be duplicated. We will investigate removing duplicate styles in the next installment.

Another point of interest, is the fact that we wrap our string in the call to XMLin with a static string. This resolves namespace issues but could be implemented cleaner.

Until next time…

Popularity: 9% [?]

  • DZone
  • StumbleUpon
  • Technorati
  • del.icio.us
  • Slashdot
  • Digg
  • Reddit
  • NewsVine
  • SphereIt
  • e-mail
  • Facebook
  • Google Bookmarks
  • Live
  • Propeller