Chapter 3. A tour of the language pack

Of course, until you actually add some real grammatical rules to the language pack input files, the Perl module will function as a simple spell checker only. In this chapter I'll describe the syntax of the input files and some tricks for building them quickly.

In case you're just curious about a single file (what it does or how to create it), here are brief descriptions of each of the files, with links to the more detailed descriptions later in this chapter.

3.1. The lexicon

If you'd like your grammar checker to have at least the functionality of a spell checker, you'll need to assemble a large word list (though it is worth mentioning that, for some languages, it is possible to implement a tool that performs interesting checks without necessarily recognizing each word, e.g. Igbo "vowel harmony" rules). Most languages will want a tagged list, with part-of-speech information associated to each word.

3.1.1. Parts of speech

Part-of-speech markup is added to input texts as XML tags; you'll need to choose these tags first. If you haven't provided me with a tagged word list (e.g. if you're just starting with a word list from a spell checker) the default language pack will simply tag all words with <U> ("unknown" part of speech). If you just want a fancy spell checker this is sufficient. Otherwise you can place your tags (e.g. <N>, <V>, <N plural="y">, etc.) in pos-xx.txt and assign a numerical code to each (used internally). There are a couple of mild restrictions:

  • The numerical codes must be integers between 1 and 65535, excluding 10 (used as a file delimiter). [1]

  • Code 127 has a special meaning across all languages: it is used to markup words which are correct but are very rare or might hide common misspellings. A good example in Irish is ata which is a past participle meaning "swollen", but does not appear in my corpus of over 20 million words except as a misspelling of atá (a form of the verb "to be"). Words like yor and cant are well-known examples in English.

  • The XML tags must be ASCII capital letters, excluding B, E, F, X, Y, and Z (which are all tags added to the XML stream by An Gramadóir while checking grammar; see the FAQ for explanations of these). This leaves 20 possible tags, which should be more than enough in light of the fact that you can refine the semantics of your tags by adding XML attributes where appropriate.

3.1.2. Main word list

The files lexicon-xx.bs and lexicon-xx.txt contain the main database of recognized words. The first of these is the compressed version that comes in the language pack tarball, the latter is the uncompressed version that you should use for editing, adding words, part-of-speech tags, etc. If you don't see lexicon-xx.txt you can recreate it using:

$ make lexicon-xx.txt
	  

Conversely, if you ever do a make dist, the compressed version will be updated correctly, taking into account any additions or changes made to lexicon-xx.txt. The file lexicon-xx.txt contains one word per line followed by whitespace and one of the numerical grammatical codes from pos-xx.txt; e.g.:

Example 3-1. An excerpt from a fictional lexicon-en.txt

dipper 31
dire 36
direct 33
direct 36
direct 37
directed 36
direction 31
directional 36
directions 32
	    

Note that ambiguous words should be listed multiple times, once for each possible part of speech (we are thinking in the example above of the word direct as either a verb, adjective, or adverb). The word list need not be alphabetized, but this is probably a good idea for maintenance purposes! The only requirement is that all of the codes for a single ambiguous word must appear contiguously.

As noted earlier, in the default language pack, all grammatical codes are initially set to "1" (<U>) as placeholders, until a proper tagged word list can be constructed.

3.1.3. Replacements

The file eile-xx.bs is a "replacement" file which contains on each line a non-standard or dialect spelling of a legitimate word followed by a suggested replacement. The file earraidi-xx.bs is similar, but should be used for true misspellings. The only difference in functionality between the two files is how the replacements are reported to the end-user. I built the file eile-en.bs in the English language pack by collating the specifically American and British word lists that are distributed with ispell. The Irish file eile-ga.bs is a by-product of my work on dialect support for Irish language spell checkers. The replacement "word" is allowed to contain spaces, e.g.

spellchecker spell checker
	  

3.1.4. Morphology

The file morph-xx.txt encodes morphological rules and other spelling changes for your language; it is structured as a sequence of substitutions, one per line, using Perl regular expression syntax, with fields separated by whitespace. When an unknown word is encountered, these replacements are applied recursively (depth first, to a maximum depth of 6) until a match is found.

So, for example, this file is where you can specify customized rules for decapitalization (the default language pack provides standard rules for this, while for Irish it is substantially more complicated). You can also use it to strip common prefixes and suffixes in much the same way as the "affix file" is used for ispell or for aspell (but, unlike those programs, allowing several levels of recursion). For Irish, morph-ga.txt is also used to encode many of the spelling reforms that were introduced as part of the "Official Standard" in the 1940's.

The syntax is simpler than it first appears. Each line represents a single rule, and contains four whitespace-separated fields. The first field contains the pattern to be replaced, the second field is the replacement (backreferences allowed, which moves us beyond the usual realm of finite state morphology), and the third field is a code indicating the "violence level" the change represents. Level -1 means that no message should be reported if the rule is applied and the modified word is found (as in the default rule which turn uppercase words into lowercase). Level 0 means that a message is given which just alerts the user that the surface form was not found in the database but that the modified version was. Level 1 indicates that the rule applies only to non-standard or variant forms and will be reported as such (e.g. for American English you could define a level 1 rule that changes ^anaesth to anesth, or globally changes centre to center, etc.) Level 2 indicates that the rule applies only when the surface form is truly incorrect in some way.

False positives can be avoided by placing words that are not morphologically productive in the file nocombo-xx.txt.

Notes

[1]

This is a white lie; the legal numerical codes are, in actuality, precisely those positive integers corresponding to Unicode code points. So this means there are more than a million possible codes (but it also means that you need to avoid the so-called surrogates, 55296 to 56320). Hopefully no one will ever need to know this.