From branderh Sat Oct 19 12:21:18 1996
Return-Path: <branderh>
Received: by debian.iaehv.nl
	id m0vEYWp-0002isC
	(Debian /\oo/\ Smail3.1.29.1 #29.37); Sat, 19 Oct 96 12:21 MET DST
Sender: branderh
Received: from gatekeep.ti.com (news.ti.com [192.94.94.33]) 
          by iaehv.IAEhv.nl (8.6.13/1.63) with ESMTP; pid 6684
          on Sat, 19 Oct 1996 05:26:57 +0200; id FAA06684
          efrom: meyering@asic.sc.ti.com; eto: <branderh@IAEhv.nl> 
Received: from asic.sc.ti.com ([156.117.183.136]) by gatekeep.ti.com (8.6.13) with ESMTP id WAA26804 for <branderh@IAEhv.nl>; Fri, 18 Oct 1996 22:26:25 -0500
Received: from appaloosa.asic.sc.ti.com (appaloosa [156.117.183.236]) by asic.sc.ti.com (8.7.6/8.7.3) with ESMTP id WAA27166 for <branderh@IAEhv.nl>; Fri, 18 Oct 1996 22:26:41 -0500 (CDT)
Received: (from meyering@localhost) by appaloosa.asic.sc.ti.com (8.7.6/8.7.3) id WAA16552; Fri, 18 Oct 1996 22:26:40 -0500 (CDT)
Sender: meyering@appaloosa.asic.sc.ti.com
To: branderh@IAEhv.nl
Subject: Re: sort ignore characters
References: <m0vDyhG-0002iIC@debian.iaehv.nl>
Reply-To: meyering@asic.sc.ti.com, meyering@na-net.ornl.gov
From: Jim Meyering <meyering@asic.sc.ti.com>
Date: 18 Oct 1996 22:26:37 -0500
In-Reply-To: branderh@IAEhv.nl's message of Thu, 17 Oct 1996 22:05:37 +0200 (MET DST)
Message-ID: <wpgbudzd47m.fsf@asic.sc.ti.com>
Lines: 54
X-Mailer: Red Gnus v0.52/Emacs 19.34
X-UIDL: 425f1f8a91d9e8415b13c677e503c39a
Status: RO

| I'm busy sorting wordlists containing 8-bit characters from the iso-8859-1
| set and we use some special character for marking the hyphenation points in
| each word.  A typical example looks like this:
| 
| (sorted with a self made implementation: wordsort.c)
| 
| ab7sur7dis7tisch
| ab7sur7dist
| ab7surd
| aba7cus
| abac7tis
| aban7don
| 
| As you can see absur... comes earlier than abac...
| 
| (the following is sorted by gnu sort textutils 1.19m)
| 
| abac7tis
| aban7don
| aba7cus
| ab7surd
| ab7sur7dist
| ab7sur7dis7tisch
| 
| One might say that gnu sort is right here, but it seems to ignore the
| extended characters (the non-ascii ones).  This is not right, it should
| notice IMHO.

It does compare them.  I think the problem is that your special
character compares `larger' than all others.

| When it notices I would like to suggest a new feature right away: ignore
| characters.  In order to provide sorting where certain characters aren't
| seen (like the behaviour of gnu sort right now, but than user definable).

Rather than trying to add a feature, let's try to do your sort
using POSIX.2 sort with some other tools:

Construct an aux file that looks like this:

  abactis abac7tis
  abandon aban7don
  abacus aba7cus
  absurd ab7surd
  absurdist ab7sur7dist
  absurdistisch ab7sur7dis7tisch

with the first field being the word with all special characters removed.
You could use this: (untested)

  cut -f 1 orig |tr -d 'SPECIAL-CHAR' | join - orig > aux

Then, sort that just-created aux file on the first field and
use cut to remove the first field.



Added by erick:
En de oplossing is:
cat orig \
|tr -d "\267=+#-" \
| join - orig \
| cut -d' ' -f2- \
|sort -f \
|cut -d' ' -f2- >results


