Leedberg.com

The online home for Greg Leedberg, since 1995.

Monday, May 29, 2006

No More Word Files!!

No, I'm not declaring that the Microsoft Word file format is dead, or soon to be so. Rather, I'm calling for people to realize why it should be dead (or at least marginalized!).

First, let's soldify what this argument is about. Microsoft Word is a word processor, developed by Microsoft. Over the years, it has become the most-used word processor around the world. Interestingly, it's quite expensive -- Microsoft Office Standard (which includes Word plus a handful of other less-used applications) costs in the neighborhood of $130.

Many arguments have been made against Word on the basis that it is closed-source, while there are free, open source alternatives out there, such as OpenOffice, KOffice, and AbiWord. I'm not here to make this argument. I think that corporations have a right to make money off of their products if they want to (and that other people have the right to make their software available for free, if they want to). Likewise, if someone thinks a product is worth what it costs, they have every right to pay for it. And if having the source code isn't important to them, so be it.

My argument is against the file format Word uses. Whenever you type up a document in Word, you likely save it such that it has a ".doc" extension. This signifies that it is stored in the Word file format. The Word file format is what's considered a "closed " file format. It's binary, so a human wouldn't be able to look at the contents raw and understand them. Worse yet, only Microsoft fully understands the format, and they don't release the specifications of that format. So, only products made by Microsoft can (in theory) fully and reliably read and write Word files. Contrast this with an open file format. Generally, an open file format's raw contents are human-readable, so it's easy to figure out what's going on in the file. Most importantly, the specification for the format is documented and publicly available, so that anybody is free to make a program that can read and write the format.

Ignoring the specific case of Microsoft Word, there are lots of problems with closed (or "proprietary") file formats in general. Most obviously, they lock you into a particular vendor's products. This means if you use Microsoft Word to create a document, to ensure full compatiblity you will always need Microsoft Word in order to read that file. There are some free projects that have attempted to reverse-engineer the Word format, but none of them are 100% accurate. This hurts you in the present, since it means that any computers you own will have to have Microsoft products on them in order for you to carry your work between the computers. More frighteningly, this introduces lots of possible problems in the future. You will basically need a copy of Microsoft Word forever in order to continue to read your files. What if Microsoft goes out of business? Stops making Word? Stops making Word for the particular operating system you are currently using, forcing you to upgrade unwittingly. By creating this lock-in, the closed format decreases competition, as people are less likely to use a competing product if all of their existing files will be unreadable. This is true for any product that uses proprietary formats.

Also, using closed formats is a hinderance to open communication. If you want to type up a document in a closed format such as Word, and send it to someone, they have to have Word as well. This turns the closed file format almost into a "virus" of sorts -- it keeps spreading as people find a need to communicate with someone who already has it. If the person you want to send the file to doesn't have Word, you won't be able to share your information with them.

The above reasons are general arguments against all closed file formats, and they all apply to the Microsoft Word file format. But of course, the Word format has several of its own particular downsides. For one, if you are forced to upgrade to the newest version of Word in order to read your old files, you may very well find that the new version of Word can't actually read your old files. Even though Microsoft has the specifications to this closed format, it has a notorious reputation for somehow making it so that new versions of Word have problems reading certain older files. And of course since only Microsoft has the specification to this format, you're out of luck if you want to try and find some other program to use.

Also, it is a problem that Word is not cross-platform -- it is only available to people that run Microsoft Windows, and Apple's operating systems. So, if you want to send a file to someone using some other operating system, such as Linux or BSD, there is simply no way for them to acquire a copy of Microsoft Word, and you are completely blocked from communicating with them. On this same train of thought, you have to keep in mind that Microsoft Office is a very, very expensive program to purchase. As I said above, $130 just for the most basic functionality. It is entirely possible that this is more than some people can afford, or is more than some people think Office is worth. It's not at all clear to me why someone would assume that their peer has purchased a program that costs this much money. Sending someone Word files may be putting pressure on them to spend the money for Office -- money they may not be able to spare.

So what's the solution? Clearly, the point I'm getting to is that we should try to use open formats rather than closed formats. Currently, the best example of an open format for word processing is OpenDocument. OpenDocument is an open, XML-based, file specification that was developed by a committee of interested organizations. It incorporates the vast majority of word processing features that existing products such as Word offer. However, the specification is completely open, and anybody can produce a product that can read/write it. Several already do, most notably OpenOffice. It is expected that in the future there will be a plugin for Word that will allow it to use this format, and eventually it's likely that Word will even natively support it.

Even if you don't use OpenDocument, use something more open than Word's default format. such as RTF, PDF, or HTML. OpenDocument is probably the best open format for word processing currently, but even if you don't use OpenDocument right now, you should at least use something more open than Word (especially when you send a file to someone). Formats such as RTF, PDF, and HTML are relatively well-understood and/or open, and have both free and commercial readers and writers available for most operating systems. Coincidentally, both RTF and HTML are natively able to be read and written from within Word.

In conclusion, I think that the success of the Word file format is one of the worst things to happen to the computer industry -- ever. It's pretty bad for storing your own personal files, but it's especially bad for cases where you want to share your files with other people -- closed formats simply weren't designed for this. If you need to send a file to someone, please, please, don't send them a Word file. Convert it to something more open. And even if you store your everyday documents in Word format, consider saving your most important documents in a format that you know will still be accessibly 10 years from now.

Of course, ideally, you should just use OpenDocument for everything.

Labels: ,

Friday, May 12, 2006

Spam: How To Deal With It

As I've talked about before, I receive a large amount of spam. Not as much as I used to -- at its peak, I received about 200 spam messages a day -- but still too much. One would hope that after years of having to deal with spam, I would have some advice to pass to people, and I do.

The first, and most important, thing to do in order to deal with spam is to never, ever, put your email address on the web in plain text. Frequently people make websites and want to be able to receive feedback, so they give their email address. But, this is exactly how spammers make their email address lists. They have programs that crawl the web, just looking for email addresses to spam. Even putting it in a "mailto:" link makes it available to them. If you really want people to be able to send you mail from your site, find out if your hosting provider has some sort of formmail solution for you to use. For example, at Dreamhost, I can make a page that has a form that a user can fill out, the form gets submitted to a program, and the program is configured to send the message to me. My email address is not visible at all on the form page. Most hosting providers have something similar.

Crawling the web is one way that spammers get your address. Another way is when you give your email address out to companies online, who then go on to sell your address to spammers. To combat this, I use a free service called SpamGourmet. The way this works is that I set up an account with SpamGoumet. Let's say my account name is GregLeedberg (it isn't). You tell SpamGourmet what your real address is (so you do have to trust SpamGourmet itself). Then, whenever you need to give an email address to a company, you can, on the fly, make up a forwarding address through SpamGourmet that will forward through to your real address, but only will forward a certain number of emails. You don't even have to go to SpamGourmet to make the new address. All you have to is give the company an address in the form [some unique identifier].[maximum number of messages you want forwarded].[account name]@spamgourmet.org. So if EvilCorporation wanted my address, I could give them an address such as evilcorporation.20.gregleedberg@spamgourmet.org. The first 20 messages would be forwarded to me, in case they are legimitate, and after that the address is no longer valid. So if EvilCorporation sells that address and it gets spammed, it will quickly stop forwarding me the messages. For each company you need to give an address to, you can come up with a new unique identifier, which will have its own message counter. I use this whenever I need to give an email address to someone I don't automatically trust.

So, the above two methods try to reduce the amount of spammers that have my email address. But inevitably, you will end up on some spam lists. What to do then?

Well, my first line of defense if I get spam, is to use a free service called SpamCop. Usually, when you get a spam, the "from" line is obfuscated, as are any links within the email. You can send your spam to SpamCop, who has tools that are able to figure out where a spam is really coming from, and where the links are really going, and can then send complaints to the ISPs involved. And it seems that, on the whole, ISPs really do listen to SpamCop reports that they get. By using SpamCop, you can effectively shut down a spammer's current account. Of course, in time they will just open a new account somewhere else. But if we keep them moving and continually make it hard for them to send their spam, hopefully eventually they will give up, or less people will want to spam.

Lastly, even with all of these tactics, I still get spam. So as a last layer of protection, I just use Mozilla Thunderbird as my email client, which has excellent spam filtering algorithms which can learn to detect the spam that you get. This at least makes it so that I don't have to explicitly look at every single spam message I receive. Sure, some messages get past the Thunderbird filter, but it still significantly cuts down on what I have to see.

So that is how I have dealt with my huge spam problem over the past few years. It works okay, but I still wish that spammers would realize that what they do is really not an effective way of marketing. Annoying your potential customers doesn't win sales, it just causes backlash.

Labels: ,

Sunday, May 07, 2006

My Backup Strategy

My computer is the centerpoint of my life. Partly, that's because I'm a geek, and a software engineer. But, I would wager a bet that many people, including non-geeks, are in the same situation. These days, our computers are home to at least our digital photographs, emails, word processed documents, music, financial data, and lots more depending on what you do with your computer. That being the case, it's easy to see that if any one of us were to lose our computer one day, we'd also be losing a lot of important data and memories.

Which is why everyone should back up their stuff. And, everyone should have a sensible backup strategy that protects in both small-scale and large-scale data loss scenarios. So today, I'm going to describe to you my personal backup strategy, which has developed over the span of my computing life, and I now think is pretty good.

First, what do I back up? Some people back up every bit on their hard drive. I don't. What's important to me is exactly what I described above -- my pictures, documents, music, etc. Basically, my data. Not my program installations, registry settings, and so on. On my computer, I have a very strict separation of program data and user data. All of my personal data is kept on one partition (which just happens in this case to be an entire drive), and my "My Documents" directory is mapped to that partition. My Windows installation directory and all of my programs are installed to a separate partition / drive. I figure, if I lose my hard drive, I can always re-install my programs, so what's really important is to make sure I don't lose my data. This also significantly cuts down on the size of my backups.

Now that it's clear what I've backing up, it's important to think of just what sorts of scenarios we're trying to protect our data from. The most common data loss scenario is simply a hard drive dying. This happens quite frequently. Another data loss scenario is that something physically happens to the computer itself -- a power surge, you drop the entire computer, the power supply catches fire and the case fills up with smoke, etc. This happens less frequently, but in thie worst case this causes the entire computer to be unusable. Another data loss scenario is that something happens to the entire house/building where you store your computer -- fire, flooding, etc. Lastly, it's worth considering the scenario of a large-scale natural disaster, such as earthquake, hurricance, or even military attack. In those sorts of cases, your data probably won't be first thing on your mind, but months later you'd probably start to wish that you had your old digital pictures and documents.

How do we protect against the most common scenario, of a hard drive failing? For this, I have two hard drives. One has all of my programs and OS on it, and one has all of my data. However, on my program-only drive, I also have a large partition that serves as a backup for the data-only drive. Every single night, I have an rsync script that performs an incremental backup of my data onto the backup partition. Since it is incremental, I only am moving the data that has changed, not the entire 40GB of data. Obviously, this plan requires that my backup partition be the same size as my data drive, and so my program drive needs to be significantly bigger than the data drive (so it can store all of my programs, plus the data backup). You just have to take that into account when upgrading drives. This backup is performed every night, since this is the most common type of data loss. So, if my data drive failed tomorrow, I would have a backup of that drive that is no more than 24 hours old.

With isolated drive failure covered, what about if the entire computer was incapacitated? i.e., the power supply catches fire, smoke fills the case, and every component is killed (thus eliminating both my data drive and the backup of the data drive). To protect against this, once a month I backup my data drive to DVD-RW media. This way, the backup is stored outside of the case, but is still accessible. I only do this once a month because this can't be automated with a script, so it requires more effort and time on my part. It's worth noting that DVD-RW (and CD-RW) media can't neccessarily be trusted, so to make this backup more reliable, I actually use two different DVD-RW discs, which I alternate between each month. So I always have one disc which is no more than 1 month old, and one disc which is no more than 2 months old. This is better than using just one, only to find out when I need it that it's actually not been working for the past several months. It's also worth noting that since obivously a 4.7GB DVD is not enough to hold all of my data, this is actually just a selective backup. I leave out big media items, like ripped music files and home movies. In a crunch, I could do without those items.

So now we've covered every data loss scenario except loss of the entire building, and large-scale disaster. I use just one backup method to protect against these two scenarios. For this, once a year I backup all of my data (including media) to a set of several DVDs, and then store this DVD set as far away from my computer as I possibly can. When I was at college, for instance, I would store these discs at home. The idea is, in the more likely case (of these two scenarios) that the building is lost due to fire or flood, you've got a backup set stored somewhere outside of the building that you can fall back on. Even in the more extreme case that your entire region is affected, hopefully the set is far enough away that it is still safe. Due to the lower chance of these scenarios happening, this set is only created once a year. So, worst case, you revert back to your data as of no more than 1 year ago. That's still better than no data at all. I use DVD-Rs here because they tend to be more reliable long-term than the re-writable kind. And since you can't erase them, once they are no longer the "latest" annual backups, you can keep them with the computer just so you can have some extreme roll-back capability, possibly spanning several years of these backups.

So, that's my backup strategy. You'll notice I don't use any special backup programs. Just rsync (a free, open source file sync'ing tool that comes with Linux and Cygwin) and any DVD/CD burning application. I think, on the whole, that backup programs are a scam.

If you don't already have a backup strategy, I hope that this will prompt you to start doing some sort of backup. And if you already do backups, I hope that maybe I've explored some scenarios you hadn't thought or given you some new ideas.

Labels: ,

Monday, May 01, 2006

The Daily Show vs. The Colbert Report

One of my favorite television shows -- one of the few I've consistently watched religiously over the years -- is The Daily Show with Jon Stewart. I even saw Jon Stewart live once when he came to Cornell. The premise of the Daily Show is simple: it's a "fake" newscast. It doesn't aspire to be a full-fledged newscast like you would see on your local news channel at 5:00. Rather, it lampoons the day's current events, newsmaker and oddities. It's structured just like a newscast (complete with contributers, in-depth segments, and sound bytes), but it makes fun of what's going on in the news, rather than merely reporting it.

The humor is quite smart, too. For sure, not everyone in the world would get the humor. It's full of pop culture references, biting satire, and jokes that at times, surprisingly, actually require a pretty in-depth knowledge of what's going on in the world (a study once showed that Daily Show viewers were more knowledgable about the news and better educated than O'Reilly viewers). The commentary frequently goes beyond just fun and games, and ends up making very intelligent, non-mainstream statements on the state of the world. Everyone is fair game on the Daily Show -- politicians, common people, and even the media itself. The Daily Show has even won 7 Emmys and 2 Peabody awards.

One of the most-liked contributers on The Daily Show was Stephen Colbert, and in 2005 he was put at the helm of the first Daily Show spin-off: The Colbert Report. As one of the most popular contributers to the Daily Show, his new show attracted a massive audience almost by default. The shows are both based on the idea of being fake news shows, and I think many people felt that The Colbert Report would end up being effectively an extra half hour of The Daily Show.

But that ended up not being the case. Whereas the Daily Show is a fake newscast that makes fun of what's going on in the news, The Colbert Report is more accurately desribed as a parody of the news shows themselves (particularly those in the vein of the O'Reilly Factor). Stephen Colbert portrays a fictional character in the show, rather being himself. He's an arrogant, ultra-conservative, narrow-minded newscaster with his own opinions to push. The show is a satire of the conservative media, as well as conservative politicians. Some news of the day is covered, but there is more of an emphasis on segments and interviews.

Based on the first few episodes of The Colbert Report, I had some positive reactions. First, I was pleased that it wasn't just trying to be a rip-off of The Daily Show. The whole idea of it being a parody of news talk shows was rather original. And one segment in particular -- "The Word" -- was consistently hilarious.

But initially, my overall reaction was actually negative. Interviews -- which are the highlight of The Daily Show -- were horrible initially. In The Daily Show, Stewart frequently asks "hard-hitting" questions of his guests, in the midst of more light-hearted questions. Colbert seemed to completely stay away from any questions of substance, instead trying to keep on building up the fake character he was trying to develop. Indeed, Colbert pretty much stayed away from asking his guests any questions, leading to quite lifeless interviews. The other segments on the show were pretty boring, and in the beginning the whole "ultra-conservative parody" angle wasn't played up as much, there was more of a focus on "arrogant" -- which got annoying fast.

However, since first debuting, I would say that The Colbert Report has improved significantly. The interviews in particular are a lot smarter. Guests seem to be much better picked -- for instance, pairing up anti-Bush and ultra-liberal personalities against Colbert's fake character. In this setting, the contrast works amazingly well, and we even get some interesting conversations now and then. And overall, Colbert seems to be a lot more comfortable in his parody of O'Reilly-types now, focusing more on being narrow-minded and ultra-conservative. In this world where most of the media seems to fit that mold (but never cracks a laugh), it's very refreshing to see it to such a humorous extreme.

The Colbert Report has a lot of potential. Especially if it continues to focus more on being a parody of a news media that takes itself too seriously. I actually watch The Colbert Report by choice now. It's refreshingly different from The Daily Show, so that neither one will likely eclipse the other in quality.

In the end, if you're looking for some smart political humor, try to check them both out.

Labels: , ,