OpenOffice ODF/.odt compared to Microsoft Word .doc

Overview

This is the first in a series of articles that will compare ODF and in particular the OpenOffice implementation and Microsoft Office and its various data formats with respect to various measures. This article will cover the efficiency of the .odt, .doc and .xml formats, with particular interest to native and compressible file sizes.

Methodology

My windows test cases were generated using the following software:

  • Microsoft Windows XP Professional 2002, SP2
  • Microsoft Word 2003 (11.6368.6368) SP2
  • OpenOffice 2.0.3
  • Adobe Acrobat Standard 7.0.8 5/16/2006.

My Linux test cases where produced with the following software:

  • SuSE Linux 10.1
  • OpenOffice 2.0.2.7.1
  • Adobe Reader 7.0.8 05/22/2006

I needed a fairly large chunk of text for my test, I decided on the November draft of the ISO/IEC C Standard, located at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1905.pdf (copy here). This is a significantly large document, and I decided only to use the first seven chapters for my test case. In order to produce the target documents, I selected the contents from the beginning of the document through chapter 7, and copied this to the clipboard. I then pasted the clipboard into native versions of Microsoft Word under Windows and OpenOffice Writer under both Windows and Linux. For Microsoft Word, I saved the document as a native .doc and .xml. For OpenOffice, I saved the document as native .odt and exported it as .doc. I also saved the content as .txt with Notepad under Windows as a reference point. For archival purposes, I have mirrored all documents referred to in this article on the 3monkey wiki download area.

Raw Results

File Size
Microsoft Office .doc 921,088
Microsoft Office .xml 6,475,669
OpenOffice (XP) .odt 154,892
OpenOffice (XP) .doc 1,335,296
OpenOffice (Linux) .odt 160,045
OpenOffice (Linux) .doc 1,338,368
Notepad 417,549

Observations

My first observation was the Linux OpenOffice implementation created slightly larger file sizes than the Windows implementation. This was probably due to the differing versions. I will revisit this in a later article if it is merited.

My next observation was that the OpenOffice .doc file was significantly larger than the Microsoft Word version. This is likely due to Microsoft’s access to the complete .doc specification, and thus a better understanding of how to optimize the file content and size. For grins, I loaded the OpenOffice .doc with Microsoft Word and saved it naively. I also loaded the Microsoft Word .doc with OpenOffice and saved it both as a .doc and .odt. The results of these test are below.

File Size
OO .doc loaded/saved in MS 808,960
MS .doc loaded/saved in OO 1,277,952
MS .doc loaded/saved as .odt in OO 155,113

This produced some interesting results. First, even though the original OpenOffice .doc file was originally larger than the native Microsoft Word version, when loaded and saved with Word, resulted in a file 12% smaller file than the original native Word .doc. This indicates that OpenOffice does not save all of the information regarding a document that Word does. This is further supported by the opposite transformation. When we load the Word document in OpenOffice and re-save as a .doc, we experience a file size reduction again. This reduction, although not as significant, clearly supports the fact that OpenOffice is not saving all the information in its .doc format as Word. By a cursory visual inspection all of the documents seem to be equivalent. Without access to the .doc file format specification it is difficult to infer whether or not the information loss is of consequence or not. In other words the file size difference may be due to bloat in the native Word format or due information loss by OpenOffice.Next most people will notice that not only is the .odt versions smaller than then .doc versions irregardless of which application produced them. Further more the .odt is almost one-third the size of the raw text from notepad. The reason the .odt is so significantly smaller is the the OpenOffice implementation applies compression on its output, and obviously decompresses it on the fly for input. This has both as advantages and disadvantages. The primary disadvantage is load and save times. Since the file must be either compressed or decompressed, this takes extra CPU cycles. However, with the speed and efficiency of today’s processors, this should be of little practical impact. The one obvious major advantage is file size. Not only does this save in raw disk storage, but also results in lower bandwidth for such mediums as email and downloads.I wondered what would the results be of compressing the .doc, .xml, .odt and .txt? I compressed all four formats using the Linux utility zip (as that is the underlining implementation for OpenOffice). The results (below) where fairly interesting and somewhat expected.

File Type Original Size Compressed Size
.doc 921,088 179,648
.xml 6,475,669 228,497
.odt 154,892 153,456
.txt 417,549 104,236

Notice that each format compress roughly to the same size. The .xml is larger due to both its original size thus the number of segments that needed to be compressed and additional data compared to the other formats. The .doc is roughly 15% larger than the .odt, which was only slightly compressed (perhaps to a slight algorithm change). The .txt compressed more than the others, this is due to the fact that it cares on formatting, style or meta information and is simply the raw text. Seeing the vastly decreased storage in respect to the .doc, I wonder why Microsoft does not inappropriate a compression strategy similar to OpenOffice.

Conclusion

From this limited data sample, I have to declare OpenOffice Writer the champion of round one. Perhaps if Microsoft Word employed a compressed output form the outcome may have been different. It is actually a little strange that OpenOffice which is based on a pure text format (XML) is compressed into a binary zip file and that Microsoft Word, which is a proprietary binary format is not.

What Is Up Next?

For the most part these test cases did not contain much formatting or style information, nor did it consider such elements as tables and graphs. I will investigate how these effect the efficiency in a latter article. But before I do that, I will need to expose more of how ODF works. Therefore, the next few articles in this series will be a primer for the ODF specification.

Until next time…
-3Monkeys

Popularity: 100% [?]

  • description
  • StumbleUpon
  • Technorati
  • del.icio.us
  • Slashdot
  • Digg
  • Reddit
  • NewsVine
  • SphereIt
  • e-mail
  • Facebook
  • Google
  • Live
  • Propeller
1 Star2 Stars3 Stars4 Stars5 Stars6 Stars7 Stars8 Stars9 Stars10 Stars (32 votes, average: 6.31 out of 10)
Loading ... Loading ...

11 Responses to “OpenOffice ODF/.odt compared to Microsoft Word .doc”

  1. meneame.net Says:

    OpenOffice odt versus Microsoft doc…

    Comparativa en ingles entre los formatos ODT y DOC….

  2. 3monkeys » OpenOffice: .odt Opened Up Says:

    [...] In the first article in this series, OpenOffice ODF/.odt compared to Microsoft Word .doc, I compared various file types for size efficiency. Of particular interest was the fact that OpenOffice Write stores .odts in a zip format, an implementation of PKZip to be exact. With this knowledge and the Open Document Format standard, we can investigate how certain elements of a document effect its size and overall efficiency. [...]

  3. Heliologue Says:

    In all fairness, you should probably be comparing ODF against Microsoft’s new XML format, which is less apples-to-oranges.

  4. Woooops Says:

    Which is better shouldn’t be judged by size.

  5. Hildegard Jasper Says:

    Krall…

    Useful, thank you!…

  6. Kevin Says:

    I think you meant “regardless” in the third paragraph under “Observations.” Irregardless means “with regard.”

  7. OpenOffice ODT Microsoft DOC ile kar??la?t?r?lm?? | Etixet Says:

    [...] Payla?mam gerekti?ini dü?ünüyorum. Ama telif hakk? meselesi nedeniyle özür dileyerek ingilizcesini veriyorum. Tablolar? bile inceleyip anlaman?z yeterli (S?ras?yla Dosya Tipleri ve [...]

  8. Jake Says:

    Kevin: I cannot find a single dictionary that says irregardless means with regard. My print dictionary says see: Regardless. Meriam Webster online says “nonstandard : regardless”. All of these definitions: http://dictionary.reference.com/browse/irregardless say that is regardless.

    It is nonstandard prehaps, but it does oddly mean regardless nonetheless.

  9. Jenrose Says:

    I discovered this with a file that saved to 300k or thereabouts in ODT, and almost 900 in TXT.

    Mind boggling.

  10. JoeG Says:

    The origin of irregardless is not known for certain, but the consensus among references is that it is a blend of irrespective and regardless, both of which are commonly accepted standard English words. By blending these words, an illogical word is created. “Since the prefix ir- means ‘not’ (as it does with irrespective), and the suffix -less means ‘without,’ irregardless is a double negative.”[1]

    thats from wiki.
    Basically irregardless is a bastardization of the english language
    derived from a lazy crunch of two words when one isnt sure which to use.
    and according to my father lol, its like nails on a chalkboard.

  11. Sean Says:

    LOL! That comment about irregardless reminds me of when my wife used to say “could you refrain from not doing that?” So I comply. I keep right on doing it. It used to really get her mad just because I listened to her! She hasn’t done that in a long time so I guess she got the hint.

Leave a Reply