Developing patterns for Irish

There is a brief description of the TeX hyphenation algorithm on the Wikipedia page for TeX, or for more details you can check out Frank Liang's PhD thesis where the algorithm was first described.

The Irish patterns consist of rules like the following:

al3i
a6ll
al2ann
geal5a
Roughly speaking, even numbers prevent a word from being broken at the given point, and odd numbers permit a break at the given point. Larger numbers carry stronger weight than lower numbers when two rules apply. The first rule al3i permits a hyphen after the “l” in words like béaliata or galinneall. The second rule strongly prevents hyphenation before the “ll” in words like timpeallacht or fealltóir. The third rule weakly prevents hyphenation after the “l” in words like bialann or dialann. Note that it also applies to the verb gealann, theoretically preventing a desirable hyphenation, but is overridden by the fourth rule which permits a hyphen at this same spot (since 5 is greater than 2).

One basic heuristic involves lenition; a word should never be broken between a lenitable consonant and the “h” indicating the lenition orthographically. Thus you will find patterns like c2h and d2h in the rule set. Conversely, if an “h” appears after a vowel or non-lenitable consonant, it is usually a good candidate for a hyphen point, as in Bói-héam-ach or Faran-haít. This results in patterns of the form i1h and n5h6a.

Another basic heuristic is, for syncopated words, to include a hyphen at the point of syncopation; e.g. ciog-al and ciog-lach.


Results

The resulting hyphenation patterns are very much morphological vs. phonological (my personal preference). As a consequence, they do not always agree with hyphenations I've found in actual printed texts. For instance:

These patterns Corpus
Ceilt-each Ceil-teach
siosc-adh sios-cadh
craic-eann crai-ceann
ceann-aithe cean-naithe
tuairt-eáil tuair-teáil
comh-alta com-halta

The last example is of course an abomination of the worst kind.


Known bugs or ambiguities

The word “record” is a well-known example of an English word where the proper hyphenation depends on context (verb re-cord vs. noun rec-ord). A strict adherence to morphological hyphenation in Irish leads to a number of amusing (and highly-improbable) ambiguities, many arising from the not-particularly-distinctive form of the imperfect autonomous:

In the rules file, these words are placed inside a \hyphenation{} statement so that TeX will not apply the usual rule set to them.

Note there is also a potential difficulty with words like bainte which can be viewed morphologically as bain+te (i.e. past participle) or as baint+e (genitive of a second declension noun). The same holds if the noun forms admits a plural, e.g. bhaint+í could also be the imperfect autonomous bhain+tí. The current set of patterns is designed to allow the past participle hyphenation in most cases. Here are the other words for which this is relevant: athoscailte, bainte, ceilte, cigilte, coigilte, cuimilte, deighilte, déroinnte, diomailte, dúbailte, easmailte, eitilte, fóinte, foroinnte, fuascailte, meilte, múscailte, oscailte, roinnte, satailte, streachailte, tochailte, tomhailte, tríroinnte, tuirlingte. Other past participles have the same ambiguity “accidentally”: ciste (cist=“a cyst”), coirte (coirt=“tree bark”), deilte (deilt=“delta”), feilte (feilt=“felt”) The noun cruachta is a third declension example.

Finally, there are some true bugs. They are extremely rare: as of version 1.0 the patterns do not produce any hyphen points which are not in the database and miss just 10 out of 314,639 hyphen points. This is not to say that you won't discover any bad hyphenations, but that they are the fault of the underlying database and not of the algorithms used to produce the patterns.