sorting text records on the first column only
It turns out that join
is a strange beast.
Let's say you have two files, which are textual records using TAB
as separator, containing different columns beside the first one,
which is the record key. Using join you can do cool stuff like
join file1 file2
to obtain a join in the relations
sense, by default on the first column.
The join manpage states that the two fields must be sorted on the join field. This is of course to avoid complexity explosion. The annoying thing is that if you forget to do that, you will miss tuples, without any warning, which is not good (and can actually be at least spotted avoiding complexity explosion ...). So yeah, if I'm here it's because after having been hit by this, I finally read the fine manpage
But how to sort on the join field? Turns out that dear old
sort -u file |
sponge
file
is not enough. Why? Because by default it will
use all keys to sort and then you will end up sorting also
on data values; needless to say that such sorting methods is not
stable among the two different files you want to sort in turn.
OK then, let's try with sort -u -k 1 file | sponge
file
. Not even, but close. ... because sort
with this syntax will
start sorting at key 1 (which is the default), but then
continue sorting with the other keys, in the hope of doing you a
favour.
The magic line is then sort -u -k 1,1 file | sponge
file
, which tells sort
to (damn) sort using
only the first (damned) key and stopping there. Thank you
sort
. Have a nice day.
PS I've been sick this week, but I'm definitely getting better, so I can finally announce that I'm going to FOSDEM 2009, yay, see you there!