Archive for December, 2006

OpenOffice ODF/.odt compared to Microsoft Word .doc

Friday, December 29th, 2006

Overview

This is the first in a series of articles that will compare ODF and in particular the OpenOffice implementation and Microsoft Office and its various data formats with respect to various measures. This article will cover the efficiency of the .odt, .doc and .xml formats, with particular interest to native and compressible file sizes.

Methodology

My windows test cases were generated using the following software:

  • Microsoft Windows XP Professional 2002, SP2
  • Microsoft Word 2003 (11.6368.6368) SP2
  • OpenOffice 2.0.3
  • Adobe Acrobat Standard 7.0.8 5/16/2006.

My Linux test cases where produced with the following software:

  • SuSE Linux 10.1
  • OpenOffice 2.0.2.7.1
  • Adobe Reader 7.0.8 05/22/2006

I needed a fairly large chunk of text for my test, I decided on the November draft of the ISO/IEC C Standard, located at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1905.pdf (copy here). This is a significantly large document, and I decided only to use the first seven chapters for my test case. In order to produce the target documents, I selected the contents from the beginning of the document through chapter 7, and copied this to the clipboard. I then pasted the clipboard into native versions of Microsoft Word under Windows and OpenOffice Writer under both Windows and Linux. For Microsoft Word, I saved the document as a native .doc and .xml. For OpenOffice, I saved the document as native .odt and exported it as .doc. I also saved the content as .txt with Notepad under Windows as a reference point. For archival purposes, I have mirrored all documents referred to in this article on the 3monkey wiki download area.

Raw Results

File Size
Microsoft Office .doc 921,088
Microsoft Office .xml 6,475,669
OpenOffice (XP) .odt 154,892
OpenOffice (XP) .doc 1,335,296
OpenOffice (Linux) .odt 160,045
OpenOffice (Linux) .doc 1,338,368
Notepad 417,549

Observations

My first observation was the Linux OpenOffice implementation created slightly larger file sizes than the Windows implementation. This was probably due to the differing versions. I will revisit this in a later article if it is merited.

My next observation was that the OpenOffice .doc file was significantly larger than the Microsoft Word version. This is likely due to Microsoft’s access to the complete .doc specification, and thus a better understanding of how to optimize the file content and size. For grins, I loaded the OpenOffice .doc with Microsoft Word and saved it naively. I also loaded the Microsoft Word .doc with OpenOffice and saved it both as a .doc and .odt. The results of these test are below.

File Size
OO .doc loaded/saved in MS 808,960
MS .doc loaded/saved in OO 1,277,952
MS .doc loaded/saved as .odt in OO 155,113

This produced some interesting results. First, even though the original OpenOffice .doc file was originally larger than the native Microsoft Word version, when loaded and saved with Word, resulted in a file 12% smaller file than the original native Word .doc. This indicates that OpenOffice does not save all of the information regarding a document that Word does. This is further supported by the opposite transformation. When we load the Word document in OpenOffice and re-save as a .doc, we experience a file size reduction again. This reduction, although not as significant, clearly supports the fact that OpenOffice is not saving all the information in its .doc format as Word. By a cursory visual inspection all of the documents seem to be equivalent. Without access to the .doc file format specification it is difficult to infer whether or not the information loss is of consequence or not. In other words the file size difference may be due to bloat in the native Word format or due information loss by OpenOffice.Next most people will notice that not only is the .odt versions smaller than then .doc versions irregardless of which application produced them. Further more the .odt is almost one-third the size of the raw text from notepad. The reason the .odt is so significantly smaller is the the OpenOffice implementation applies compression on its output, and obviously decompresses it on the fly for input. This has both as advantages and disadvantages. The primary disadvantage is load and save times. Since the file must be either compressed or decompressed, this takes extra CPU cycles. However, with the speed and efficiency of today’s processors, this should be of little practical impact. The one obvious major advantage is file size. Not only does this save in raw disk storage, but also results in lower bandwidth for such mediums as email and downloads.I wondered what would the results be of compressing the .doc, .xml, .odt and .txt? I compressed all four formats using the Linux utility zip (as that is the underlining implementation for OpenOffice). The results (below) where fairly interesting and somewhat expected.

File Type Original Size Compressed Size
.doc 921,088 179,648
.xml 6,475,669 228,497
.odt 154,892 153,456
.txt 417,549 104,236

Notice that each format compress roughly to the same size. The .xml is larger due to both its original size thus the number of segments that needed to be compressed and additional data compared to the other formats. The .doc is roughly 15% larger than the .odt, which was only slightly compressed (perhaps to a slight algorithm change). The .txt compressed more than the others, this is due to the fact that it cares on formatting, style or meta information and is simply the raw text. Seeing the vastly decreased storage in respect to the .doc, I wonder why Microsoft does not inappropriate a compression strategy similar to OpenOffice.

Conclusion

From this limited data sample, I have to declare OpenOffice Writer the champion of round one. Perhaps if Microsoft Word employed a compressed output form the outcome may have been different. It is actually a little strange that OpenOffice which is based on a pure text format (XML) is compressed into a binary zip file and that Microsoft Word, which is a proprietary binary format is not.

What Is Up Next?

For the most part these test cases did not contain much formatting or style information, nor did it consider such elements as tables and graphs. I will investigate how these effect the efficiency in a latter article. But before I do that, I will need to expose more of how ODF works. Therefore, the next few articles in this series will be a primer for the ODF specification.

Until next time…
-3Monkeys

Happy Holidays

Sunday, December 24th, 2006

Happy Holidays all. This is a very busy time of year for me, as I suppose it is for a lot of you as well. I have not had the time to post any articles in the past few days, but I have started some research on a planned series of articles entitled “OpenOffice vs MicroSoft Office“, a series of article that will explore at first the technical aspects of the two, and may later look at usability, adaptation and long term prospects as well.

I look to post the first in this series on either Tuesday or Wednesday.

Again Happy Holidays.

3Monkeys

I Got Tagged … at a Bar

Wednesday, December 20th, 2006

So I walk into the local wings joint, order up a ten piece and fire up the laptop. What do I find? Derek has tagged me into the “List five things people don’t know about you and then get five more people to do the same” pyramid scheme meme. Five things?

  1. I’m an avid science-fiction/fantasy reader. At last count I had a library of over 3,000 books. What really sucks is moving them.
  2. My first programming job was at the ripe old age of 13. I was taking a computer camp at a local college and was paid by some of the undergrads to code their assignments for them
  3. Forget about Sue, Stacy *IS* a guy’s name damn it! (well some of you already knew that)
  4. I went to school with a girl named, appropriately enough, Stacy Doss. No we didn’t get married so she wouldn’t have to change her name, but I did manage to smuggle a drill team award out of the deal.
  5. I really have no common sense at all. In forth grade I broke my arm several weeks before the spring “Field Day”, an event where all the students participated in various events such as the egg toss, three-legged race, and of course the obstacle course. They tried to persuade me not to compete in the obstacle course but I was dead set on it. Good thing to, since I won by hooking my cast over the “wall”, thus slingshotting me over it and on to victory.

Now that, I think about it, I could have used “Takes laptop to bar to work”, if I hadn’t used it in the title — DOH!!
So now on to the tagging, Lowell, Steve, Rick, MG and why not? 3Monkeys tags Monkey Bite’s Michael Calore.

And to answer that question everyone was wondering, the wings were great.

How did this get to me? Here is the path Derek van Vliet » Chris Finke » C.K. Sample » Jason Calacanis » Amanda Congdon » Michael Ambs » Rick Rey » Steve Woolf » Steve Garfield » Jeff Pulver.