3.2. Grammar checking

The grammar checker per se is generated from three input files that share the same basic syntax, to be described in the sections below. Complicated "meta" scripts convert these (more or less) human-readable files into the Perl scripts which actually find and mark up the grammatical errors.

3.2.1. Common structure of the *.in files

The structure of all three input files is essentially the same. I've included a flex/bison parser in the distribution that can be used for error-checking these files during development (see the poncin target in the Makefile). Also, those who might prefer a formal (BNF-like) grammar can look at the files ponc.in.l and ponc.in.y.

Lines beginning with a # or lines containing only whitespace are ignored. All other lines contain "rules", which are structured as follows:

phrase:action
  

A phrase is a simplified description of the regular expression you want to match in the marked up text stream. The phrase syntax is the same for all three files: one or more words, optionally surrounded by tags, separated by single spaces. A word can either be an explicit regular expression (e.g. [Aa]ch to match upper or lowercase ach) or one of a collection of macros defined in the file macra-xx.meta.pl (e.g. for Irish LENITEDDFST expands to the regular expression [DdFfSsTt]h[^<]*). Complicated regular expressions should be defined as macros in macra-xx.meta.pl; simple expressions such as optional substrings or alternation are fine. In such cases, you should avoid using "non-capturing parentheses" and use plain (capturing) parentheses; the conversion scripts will treat these correctly when generating the final Perl code.

The real power comes from being able to specify part of speech tags as regular expressions; these take one of the following forms:

3.2.2. Multi-word units

The file comhshuite-xx.in is the simplest of the three; each line contains a multiword "set phrase" in the phrase portion of the rule, followed by the part of speech tag that should be assigned to the given phrase as the action portion. For instance, the phrase le haghaidh appears in the Irish version, followed by the single (opening) part of speech tag <S>, indicating that it is to be treated as a preposition. Since this filter is applied before any disambiguation occurs, the phrase portion should consist of the words to be lumped together with no additional markup.

Dealing with idiomatic expressions in this way improves the performance of the part-of-speech tagger (in terms of both speed and accuracy). It also allows us to report an error when a word which is almost always used in a set phrase is mistakenly used in some other context.

3.2.3. Disambiguation

The file aonchiall-xx.in contains rules for disambiguating parts of speech; for instance, the word an in Irish can either be the definite article or an interrogative particle. You will find a sequence of rules in aonchiall-ga.in which indicate, for instance, that if an is followed by a verb, preposition, or pronoun, we expect it to be an interrogative (and in most other cases it is the article). This kind of disambiguation is obviously a necessary preliminary step before one can try to apply grammatical rules depending on part of speech.

3.2.3.1. aonchiall-xx.in syntax

More specifically, the phrase portion of a rule in aonchiall-xx.in is required to contain a single word marked up with <B></B>. Naturally, this is the word to disambiguate and the phrase is the context in which the disambiguation is to occur. The full syntax used by An Gramadóir for an ambiguous word looks something like this:

<B><Z><J/><R/><V/></Z>direct</B>
	    

with the list of all possible part of speech tags given within the <Z> markup (note that trailing slashes are required on these tags to ensure valid XML). If you don't care about matching the part of speech tags for an ambiguous word, it is acceptable to leave out the <Z> markup entirely:

<B>direct</B>
	    

will match any ambiguous instance of the word "direct". It is also common to define macros to match certain regular expressions in the part of speech tags; for example, one could define a macro ANYNOUN to match any sequence of tags containing <N[^>]*/>; then the following will match all ambiguous words that can possibly be resolved as nouns:

<B><Z>ANYNOUN</Z>ANYTHING</B>
	    

Like comhshuite-xx.in, the "action" portion consists of a single part of speech tag, representing the disambiguated part of speech when the given phrase is matched.

The rules specified in aonchiall-xx.in are applied (in the order they appear) two times. The second pass is quite useful for Irish, allowing rules to be applied in cases that the contextual parts of speech are disambiguated in the first pass.

The latest versions of the engine admit an extension that allows certain part-of-speech tags to be excluded in a given context. This is done by prefixing an exclamation point (!) to the action portion of the rule. So, for example, this rule (for Irish) indicates that eclipsed words should not be tagged as past tense verbs:

<B><Z>ANYPAST</Z>ECLIPSED</B>:!<V p="y" t="caite">
	    
Note also that the action portion can be given as a regular expression, and all matching tags will be eliminated from consideration:
[Dd]o <B><Z>ANYVERB</Z>ANYTHING</B>:!<V[^>]+>
	    

3.2.3.2. Unigram tagging

If an ambiguity is not resolved after two passes through aonchiall-xx.in, then the default behavior is to simply assign the candidate tag with the highest overall frequency in your language. The file unigram-xx.txt consists of a list of the legal part-of-speech tags for your language sorted in order of frequency highest to lowest. Sometimes it helps in disambiguation to "lump together" several tags (e.g. by stripping attributes that have no use in grammar checking). This can be achieved by placing appropriate substitutions in unigram-xx.pre. After you have a first version up and running, you can create or update unigram-xx.txt with this command:

% cat big.txt | gramdev-xx.pl --minic > unigram-xx.txt
	  

3.2.3.3. Unsupervised training algorithms

In fact, it is even possible for An Gramadóir to apply statistical methods to help find candidate rules for aonchiall-xx.in. I've implemented the algorithm from Eric Brill's paper Unsupervised learning of disambiguation rules for part of speech tagging so that the output is suitable for use in aonchiall-xx.in (and so that the highest-scoring rules come first). Run it as follows:

$ cat big.txt | gramdev-xx.pl --brill > rules.txt
	  

3.2.4. Rules and exceptions

The file rialacha-xx.in contains the grammatical rules proper, and lists any exceptions to these rules.

3.2.4.1. rialacha-xx.in syntax

The phrase portion of a rule in rialacha-xx.in is converted to a regular expression which matches a grammatical error. The action portion consists simply of a macro which expands to the error message you want to be displayed when the rule applies. These macros are defined in messages.txt. Perhaps the most common rule for Irish is SEIMHIU which expands to "Séimhiú ar iarraidh" ("Missing lenition"). Certain macros can also take an parameter inside curly braces: the action BACHOIR{ina} expands to "Ba chóir duit /ina/ a úsáid anseo" ("You ought to use /ina/ here") with the parameter inserted between the slashes.

Two very important rules are included in the default language pack:

<X>ANYTHING<X>:UNKNOWN
<F>ANYTHING<F>:UNCOMMON
	  

Words not found in the lexicon are marked up with the tag <X>, and so the first rule reports such words as "unknown". Words that are found in the lexicon, but appear there with part of speech code 127 (see above), are given the special tag <F> and so the second rule reports these as "uncommon".

3.2.4.2. Exceptions

In earlier versions, the exceptions were kept in a separate input file called eisceacht-xx.in. We now find it more convenient to store the exceptions together with the rules to which they apply in the file rialacha-xx.in. Following each rule, one has the option of including a block of patterns representing exceptions to the rule that are actually grammatical and should not be reported as errors. For instance, the word dhá ("two") causes lenition in general, but not, for instance, when preceded by the possessive adjective ár. To implement this exception, it is placed in rialacha-ga.in immediately following the general rule, and with the action portion of the rule set to OK:

<A>[Dd]há<A> UNLENITED:SEIMHIU
<D>[Áá]r<D> <E><A>[Dd]há<A> UNLENITED<E>:OK
	  

When the exception requires more context than the rule itself, as in this example, the words corresponding to the rule must be enclosed within <E> tags to avoid potential ambiguities. You can specify as many exceptions as you like to a single rule, but note that exceptions only apply to the rule that they follow.

3.2.4.3. Testing

It is a good idea to include one or more sample sentences for each rule in rialacha-xx.in. These are given on lines beginning with #., which for Irish I usually put on the line directly preceding the rule that they illustrate. When you build the Perl module, these sentences are extracted into the plain text file triail in the language pack directory, and are also used to generate a test script for the Perl module. The "expected output" when the grammar checker is applied to triail is stored in triail.xml. The command:

$ make test
            

will rebuild the Perl module and test scripts (if necessary), and then compare the results of checking triail with the contents of triail.xml, complaining with great bitterness when they differ. When new sample sentences are added, you'll need to update triail.xml; use

$ make triail.xml-update
            

to do this (but be sure before you update that you haven't accidentally broken any other rules).