The two lists contained in this archive are the product of the
"n-dicts" project (n being a variable whose value is currently
12).  The purpose of this project is to create a list of words
which approximates the common core of the vocabulary of American
Engish.

The methodology of the project is to record and correlate the words
listed in a number of small dictionaries.  The number of dictionaries
so recorded is now 12, comprising 8 ESL (English as a Second Language)
dictionaries and 4 "desk dictionaries".  The dictionaries chosen
vary widely by publisher, by style, by completeness and by depth.
One of them is a British dictionary with an international bent; the
remainder are dictionaries of American English (three from British
publishers).  The smallest of them contains about 20,000 entries, and
the largest 44,000.  (All totalled, there are about 74,000 entries,
many of which appear in only a single dictionary.)  All but two of
them were published in the last five years.

I tried two different ways of winnowing this data to produce lists of
common words.  Both have produced interesting results, included
herein.  One list, the 6of12 list, contains all words and phrases
listed in 6 of the 12 dictionaries.  One way of describing this list
is that it contains those words and phrases which a (seeming) majority
of lexicographers believe are relevant to people learning English,
and/or to everyday usage.  This list contains about 32,000 words and
phrases. The other list, the 2of12 list, is more inclusive in that it
includes words listed in as few as two of the source dictionaries, but
less inclusive in that it excludes items of various sorts, including
multiword phrases, proper names and abbreviations.  This list contains
about 42,000 words.  It is perhaps more suitable for use in areas
like spell checking or word games than the 6of12 list.  (Honesty
compels me to admit that neither 12dicts list is, by itself, a good
choice for spell checking, due to the absence of inflections, proper
names, Roman numerals, etc.)

A more precise description of the criteria by which the lists were
composed is as follows:

1.  The 6of12 list contains all non-excluded words and phrases which
    appear in 6 or more of the source dictionaries.
2.  Prefixes and suffixes are excluded.  Abbreviations are included;
    however, if they are entirely lower-case and alphabetic, they are
    terminated with a colon (":") so they can be easily distinguished
    from regular words.
3.  Inflections of included words are not themselves included unless
    they are separately defined or irregular.
4.  It sometimes occurs that different spellings of the same word
    are listed in 6 or more dictionaries, even though no single form
    is so listed.  In this case, if one spelling is clearly more
    accepted, this spelling and this spelling only is listed.  If all
    spellings seem equally accepted, one spelling has been selected
    arbitrarily for inclusion.
5.  The 6of12 list contains a significant number of words which do not
    meet either crierion 1 or 4.  These words, sometimes called
    "signature words", are discussed below.  All of these words are
    listed in at least one of the source dictionaries.
6.  In addition to the ":" suffix discussed above, other special
    suffix characters are used to mark words with certain character-
    istics, as discussed below.

1.  The 2of12 list contains all non-excluded words which appear in at
    least 2 of the source dictionaries.
2.  This list excludes capitalized words, multiword phrases, and
    abbreviations, as well as prefixes and suffixes.  It does not
    exclude hyphenated words or contractions.  If a word occurs in
    both a hyphenated and an unhyphenated form, the unhyphenated
    form is listed, even if the hyphenated form is generally
    preferred.
3.  The list excludes spellings which are considered (by a majority
    of the dictionaries listing it) to be non-American usage.  It
    also excludes secondary spellings which are mentioned by fewer
    than four of the source dictionaries.
4.  Inflections of included words are not themselves included unless
    they are separately defined, or irregular.
5.  The list also includes a small number of signature words, as
    discussed below.

As indicated, both lists have been augmented with words (and, in the
case of the 6of12 list, phrases) which fail to meet the formal
requirements for inclusion.  In the case of the 6of12 list, 1024
words were added (about 3 % of the total).  These are all words which,
in the judgment of the compiler, are as familiar as many of the words
which met the criteria for inclusion.  Examples of some of the sorts
of words which were added are:

1.  Words of the same category as other included words.  An example is
    the astrological sign "Cancer", which alone of all the astro-
    logical signs fails to appear in 6 or more of the dictionaries.
    Similarly added were the omitted holidays "Thanksgiving" and
    "Valentine's Day".
2.  Vulgarities, sexual terms and insults.  Some such words were
    already included, but most of the source dictionaries were quite
    squeamish about them.  These words are very widely known indeed;
    I hold that any list of "common" words which does not include the
    infamous f-word is simply discredited thereby.  Some may feel that
    it would have been better to leave some or all of these terms
    unmentioned.  Nevertheless, the expression of blasphemy,
    unwarranted contempt, and perverse lust, whether in words or in
    deeds, is a very human trait.  Suppressing the evidence of these
    aspects of the human condition in our language makes no more sense
    than excluding "leprosy", "gangrene" and "dementia", no matter how
    unpleasant they may be to contemplate.
3.  Conventional conversational phrases so common as to be practically
    invisible to native speakers.  Examples are "thank you", "good
    night", "uh-huh", "of course" and "gesundheit".
4.  Sports terminology, especially for football and baseball.  (If I,
    who am practically sports-blind, noticed this deficiency, it must
    be of major proportions indeed.)

Note that the signature words in the 6of12 list can be identified via
the suffix character "+", and eliminated if desired.

A much smaller set of words (51) was added to the 2of12 list.  These
were of two sorts:

1.  Signature words from the 6of12 list which were not already present
    in the 2of12 list, and which are not excluded due to being
    abbreviations, phrases, etc.
2.  Inflections of irregular verbs not explicitly mentioned in 2
    source dictionaries, such as "outfought" and "reheard".

Some of the 6of12 list entries are annotated with a suffix character,
giving additional information about the associated word.  The
annotations can be easily removed with an editor or script if
they are unwanted.

These annotations are:

   : - The word is an othwerwise unmarked abbreviation.  This suffix
       may appear in combination with another suffix.
   & - The word is primarily a non-American usage.
   # - The word is generally held to be a variant or less preferred
       form of another word.
   < - This form of a word is held to be the primary form by fewer
       dictionaries than some other form of the word.
   ^ - This form of the word was selected arbitrarily from a set of
       variants, none of which was clearly preferred.
   = - Roughly, this indicates a "second class" word.  More precisely,
       the word falls into one of the following classes:
       a.  The word is an inflection which was defined in the same
           entry as the base word.
       b.  The word is a derived word (-ly, -ness or -er/or) which
           was not defined in a separate entry.
       c.  The word appeared in a list of undefined words with a
           common prefix, such as un- or re-.
   + - The word is a signature word.

The words in the 2of12 list are not annotated.

Some history

It may have occurred to some to wonder about how something like the
n-dicts project came to be (though I assume that anyone who bothers
to download this archive must already have some idea that such a
project could be of interest).

Some years ago, there was a post to the sci.crypt newsgroup, on the
subject of creating PGP passphrases using randomly selected entries
from a supplied list of very short words.  (If this sounds interesting,
see http://world.std.com/~reinhold/diceware.html for an expanded
version of the post.)  The word list, which was extracted from
/usr/dict/words on some UNIX system, seemed to me ill-suited to
its intended purpose.  It included arcane acronyms (bstj, ncr),
misspellings (diety) and words of amazing obscurity (bhoy, kombu).
I decided I could do better (and eventually did).

This caused me to start downloading English word lists, of which there
are many, from the Internet.  I was not impressed by the overall
quality of these lists, and the few which were high-quality were all-
inclusive, burying the everyday words under a mountain of archaisms
and esoterica.

The flaws of the vast majority of these lists are worth recounting:

1.  Failure to proofread.  Many of these lists are littered with
    misspellings and typos, sometimes approaching gibberish.  (I
    presume, for instance, that the bizarre string "nondploe",
    which was found in a purported Scrabble (r) word list, is a typo
    for something more or less legitimate, but I have no idea what.)
    Working on my own lists has helped me understand that 100 %
    accuracy is a very demanding goal, seldom actually achieved, but
    I still feel it reasonable to expect no more than 1 or 2 errors
    per 10,000 words.
2.  Acceptance of completely undocumented lazy spellings, such as
    "bullseye" and "courtmartial".
3.  Failure to respect capitalization.
4.  Failure to distinguish abbreviations from other entries.
5.  Treating esoteric computer jargon, and especially UNIX jargon,
    as everyday English.  (Beware any list which includes "emacs",
    "inode" and "lvalue".)
6.  Apparently random word selection.  The various /usr/dicts/words
    files are compendia of all the above sins.  Noteworthy is the
    inclusion of a large set of apparently randomly chosen personal
    names (uncapitalized, of course, and missing "wanda", "marge",
    "polly" and "sid").
7.  Inconsistent inflection.  Some lists include all inflections of
    their vocabulary, while others include only singulars and
    infinitives.  Either policy is fine, and has its advantages.  I
    am personally very annoyed when inflected forms appear at random.
    I find this generally happens when a compiler merges several lists
    with different characteristics, with no attempt to reconcile their
    divergent styles.
8.  Omission of everyday words.   I've seen a list that includes
    "bremsstrahlung", yet omits "log" and "beer".  Or that includes
    "saxophone" but not "sax", and "rhinoceros" but not "rhino".  Of
    course, due to my original purpose in seeking out common short
    words, I found this especially annoying.

One result of my frustration with this situation was my working with
Mendel Cooper on ENABLE (for further information, check out
http://personal.riverusers.com/~thegrendel/software.html), which was
close to unique in having an active caretaker, one clearly concerned
with quality, and in being oriented towards American rather than
British English.  (A high-quality list oriented towards British
rather than American English can be downloaded from the URL
ftp://www.simtel.net/pub/simtelnet/win3/homeent/ukacd16.zip.)  But
ENABLE is an all-encompassing list and, even if it had been complete
at the time I started my search for a list of common words, would not
have been what I wanted for that reason.

I finally decided that only starting from scratch with a systematic
approach was likely to get me what I was looking for, and that
dictionaries intended for non-native speakers of English were the
best possible source for words that are in some cases so familiar
that we never think of them.  This has led to the 12dicts lists,
which I hope have managed to avoid the flaws recited above.

(I should acknowledge one form of inconsistency exhibited by the
12dicts lists, which is that sometimes related words are spelled
inconsistently.  For instance, the 2of12 list contains both
"broadminded" and "broad-mindedness".  This generally occurs as a
result of the methodology used to build the lists.  In the case of
"broadminded", only one dictionary listed "broadmindedness", which was
therefore excluded.  I felt unequal to trying to correct these
inconsistencies, some of which are real and not mere artifacts of
12dicts, such as the contrast between "self-conscious" and
"unselfconscious".)

It is possible that in the future the "n" of n-dicts will increase
again, but, in fact, consideration of an additional dictionary now
seems to result in the discovery that its vocabulary matches 12dicts
pretty closely. At the very least, this phenomenon gives me hope that
the n-dicts lists have at last met their goal, and will now be useful,
or at least interesting, to others.

The 12dicts lists were compiled by Alan Beale.  I explicitly release
them to the public domain, but request acknowledgment of their use.
Feel free to send comments, suggestions, inquiries and/or large sums
of money to me at biljir@pobox.com.  If you find 12dicts useful, I'd
love to hear about it.
