Archive for the 'archives' Category

RSS – Really Simple Syndication : What you need to know

Saturday, April 7th, 2007

ARCHIVE POST

What is RSS?

RSS or Really Simple Syndication is a method for, well, syndicating content. RSS files are no more than specially crafted XML (eXtensible Markup Language) files. They include items with usually a description and content and possibly other data such as author, date and a link to the actual content. Content is usually an article, forum post or blog entry, but can be simple page changes to more complex audio or video objects usually referred to as podcast. You should also know that RSS is some times referred to as XML, as that is the underlying technology or as RDF. Content may referred to as a feed, a stream or a channel. A few common icons to indicate RSS are rss, rss and rss, though others are in use as well.

How will this help me?

RSS allows everyone from a web publisher to a blogger to deliver content directly to the end user — you. This save you the time and effort of visiting sites, blogs, forums and other web resources to see if there are any updates. You can stay current on events, news, and other content you find interesting. Your a sports nut? Try the ESPN Headline rssexternal link Feed. Collect coffee cups? A ebay rssexternal link search feed might be what you want. Maybe you want to be informed of the latest Firefox plug-ins rssexternal link. Or you want to know when 3Monkeys rss makes a new post. All of these and many, many more are available through RSS feeds.

How do I use these feeds?

By themselves, RSS feeds are rather cryptic as viewed in most browsers, that is due to the fact that they are intended to be used by other software. This software comes in several flavors, most notably aggregators and tickers. Aggregators either are stand-alone programs or integrate into a browser, mail client or other application, often called an extension, plug-in or add-on. Tickers come in the same flavors. One of these options will allow you to make use of the various RSS resource available on the Internet.

What is an aggregator?

Aggregators collect RSS feeds and display content in an easily readable form. Firefox, allows for Live Bookmarksexternal link, which essentially makes a bookmark group that is constantly populated by an RSS feed. Viewing the RSS feed is as simple as navigating the bookmark group. New feeds can be added by clicking the a icon in the address bar for sites that syndicate a RSS feed. Another option is the Firefox InfoRSSexternal link extension, which uses the sidebar for feed listing. Many RSS Aggregators can be found for Linuxexternal link, Windowsexternal link, Macexternal link and in generalexternal link.

How about tickers?

I personally use a a ticker for the majority of my RSS content. Specifically, I use the RSS Tickerexternal link, extension for Firefox. This displays RSS feed item titles in a scrolling bar either along the bottom or below the address bar. Feeds updates are configurable to certain time intervals and the content can be opened in the current window or a new tab. I can open a single item, all items in a feed or simply all items. Having the ticker just below the status line is unobtrusive and allows me to glance down and scan between coding sessions, picking out those interesting articles. RSS Ticker uses Firefox’s Live Bookmarks as it source which means no extra work for me on adding feeds. Pretty cool.

Is this a fad?

Thousands upon thousands of sites use RSS today, with more starting each and everyday. More mainstream users are beginning to understand its usefulness every day and subscribing to various feeds of interest. With RSS, information on the Internet becomes easier to find, and web developers can spread their information easily. So No! RSS is not a fad, in fact it will be an essential part of the web for the immediate future, that is until the next break-through technology comes along.

Differences in OpenOffice .odt vs Microsoft Word .doc

Thursday, July 6th, 2006

This article is being archived here from its original publication in the 3Monkeyweb wiki.

This is the first in a series of articles detailing my experiences with directly manipulating .odt. I currently have a project to clean up and unify a 800 plus page document that has been converted among several formats over the years. It is distributed in .doc, .odt, .sxw and .pdf forms.

What am I working with

I’m starting out by creating several files that I can inspect for differences. The files can be found hereexternal link. The files with their size and a brief description of each is below.

  • oo_doc.odt (19175) was created by copying and pasting a 750 word section of a reference article in to OpenOffice Writer. I then added a character style and a list style. I applied these two styles and a third default style to the document as well as applying the “Heading 1″ style to all headers.
  • ms_doc.doc (33280) was created in the same manner as oo_doc.odt, using MS Word 2003 instead of OpenOffice Writer.
  • oo_doc.doc (21504), oo_doc.sxw (18787), oo_doc.rtf (19013), oo_doc_ms.xml (26148), and oo_doc_db.xml (5957) were all created by loading oo_doc.odt and saving it in the appropriate format. oo_doc_ms.xml is Microsoft Word 2003 XML and oo_doc_db.xm is DocBook? XML.
  • ms_oo_doc.doc (22528) and ms_oo_doc.odt (19911) were created by loading ms_doc.doc with OpenOffice Writer and saving in the appropriate format.
  • oo_ms_doc.doc (31744) was created by loading oo_doc.doc with Microsoft Word 2003 and simply re-saving.
  • oo_ms_oo_doc.doc (22528) and oo_ms_oo_doc.odt (20653) were created by loading oo_ms_doc.doc with OpenOffice Writer and saving to the appropriate format.
  • ms_doc.rtf (22940) and ms_doc_ms.xml (26011) were created by loading ms_doc.doc in Microsoft Word 2003 and saving to the appropriate format. ms_doc_ms.xml is Microsoft Word 2003 XML.
  • ms_doc.zip (5465), oo_doc.zip (4608), and oo_odt.zip (18195) where all compressed with zip, not gzip, as this is the compression engine OpenOffice uses, from ms.doc.doc, oo_doc.doc and oo_doc.odt respectively.

I will probably add more files to the repository as my investigation continues, but for now these will do.

A few observations

First you will notice that the .odt files are generally smaller than the .doc files created from the same source from within OpenOffice Writer, oo_doc .odt 19175 vs .doc 21504 and oo_ms_oo_doc .odt 20653 vs .doc 22528. Although a small sample size, the trend is clear the .odt is more compact than .doc. The second thing you might notice is ms_doc.zip is highly compressible (84%) while oo_odt.zip is not (6%). I didn’t actually expect the .odt to compress as it is already a compressed format, so that is a little strange, and what is the real size?. What is even more interesting is the compression on ms_doc.doc. Apparently the .doc format is highly compressible. So why doesn’t Microsoft compress the file? Imagine the bandwidth saving on all of those .doc email attachments. Finally, when an .odt is saved as .doc, re-saved in Word, then converted back to .odt, bloat is introduced.

The first observation can be attributed to differences in the file formats, so I’m not too interested in that. The third observation, I will be covering in my next post. So, let us investigate the second observation, specifically what is the real size of the odt. When I unzip the .odt this is what I get.

> ls -al *
-rw-r--r-- 1 user users 13437 2006-07-06 18:11 content.xml
-rw-r--r-- 1 user users 18 2006-07-06 18:11 layout-cache
-rw-r--r-- 1 user users 1055 2006-07-06 18:11 meta.xml
-rw-r--r-- 1 user users 39 2006-07-06 18:11 mimetype
-rw-r--r-- 1 user users 6608 2006-07-06 18:11 settings.xml
-rw-r--r-- 1 user users 13138 2006-07-06 18:11 styles.xml

Configurations2:
total 0
drwxr-xr-x 2 user users 48 2006-07-06 18:11 .
drwxr-xr-x 6 user users 336 2006-07-06 14:25 ..

META-INF:
total 4
drwxr-xr-x 2 user users 80 2006-07-06 14:25 .
drwxr-xr-x 6 user users 336 2006-07-06 14:25 ..
-rw-r--r-- 1 user users 1173 2006-07-06 18:11 manifest.xml

Pictures:
total 0
drwxr-xr-x 2 user users 48 2006-07-06 18:11 .
drwxr-xr-x 6 user users 336 2006-07-06 14:25 ..

Thumbnails:
total 12
drwxr-xr-x 2 user users 80 2006-07-06 14:25 .
drwxr-xr-x 6 user users 336 2006-07-06 14:25 ..
-rw-r--r-- 1 user users 10261 2006-07-06 18:11 thumbnail.png

Well a 10k .png will not help our cause. Configurations2, Pictures and Thumbnails are not required by the OpenDocument? specifications so let us remove them. We need to modify META-INF/manifest.xml as well by removing the referencing elements. After doing so and re-zipping, the resulting .odt is now 8232 bytes. Much closer to the compressed .doc. Realize that for significantly large documents, thumbnail.png becomes much less of a factor. Also realize that if you have embedded graphics or special configuration information, it might not be a good idea to remove those directories. If we open and re-save the document, all of what we removed will be replaced. But since this is an optimization effort, and all we really wanted to discover is how well .odt maps to .doc, and we are in the ballpark.

What have we learned

Natively, OpenOffice is providing smaller file sizes than Microsoft Word. In our test case this was on the order of a forty percent reduction. oo_doc.odt at 19175 bytes verses ms_doc.doc at 33280. A reduction of 14105 bytes or 42%. It was also noted that when we tried to compress ms_doc.doc, that we achieved a very substantial decrease in size. However, the practically of unzipping and re-zipping a .doc file each time we want to edit it was called into question. We were also able to determine that the majority of bloat in the .odt data was caused by the thumbnail feature of OpenOffice, and that for larger sized documents this should quickly become a non-factor.

Next time

Point three Saving in alternating sessions of OpenOffice and Word introduces bloat will be covered. We will take a look at the <office:font-face-decls> element and I will introduce a script that will maintain a clean set of font-face-decls.