sorting text records on the first column only

It turns out that join is a strange beast. Let's say you have two files, which are textual records using TAB as separator, containing different columns beside the first one, which is the record key. Using join you can do cool stuff like join file1 file2 to obtain a join in the relations sense, by default on the first column.

The join manpage states that the two fields must be sorted on the join field. This is of course to avoid complexity explosion. The annoying thing is that if you forget to do that, you will miss tuples, without any warning, which is not good (and can actually be at least spotted avoiding complexity explosion ...). So yeah, if I'm here it's because after having been hit by this, I finally read the fine manpage :-)

But how to sort on the join field? Turns out that dear old sort -u file | sponge file is not enough. Why? Because by default it will use all keys to sort and then you will end up sorting also on data values; needless to say that such sorting methods is not stable among the two different files you want to sort in turn.

OK then, let's try with sort -u -k 1 file | sponge file. Not even, but close. ... because sort with this syntax will start sorting at key 1 (which is the default), but then continue sorting with the other keys, in the hope of doing you a favour.

The magic line is then sort -u -k 1,1 file | sponge file, which tells sort to (damn) sort using only the first (damned) key and stopping there. Thank you sort. Have a nice day.

I’m going to FOSDEM, the Free and Open Source Software Developers’ European Meeting

PS I've been sick this week, but I'm definitely getting better, so I can finally announce that I'm going to FOSDEM 2009, yay, see you there!