This is the old “About” page for the now-defunct aimsigh.com search engine that
I created in 2005. Even though hardly anyone used the site, it had some
nice features, and I'm posting this now since contains a pretty good summary
of the main issues arising in Irish language Information Retrieval.
...[this page] is directed primarily at my colleagues working on natural language processing and minority languages, particularly at those who might be interested in creating similar “linguistically sophisticated” search engines. Because of this, I won't assume any prior knowledge of Irish linguistics in what follows.
There were two primary motivations for creating this site; the first was that existing search engines like Google and Yahoo!, as powerful as they can be for English search, are unsuitable for Irish in various ways that will be discussed in detail below. The second was a desire on my part to create a single tool that harnesses most of my previous work on language technology for Irish. These earlier projects include the first Irish spellchecker (GaelSpell), part-of-speech tagger, morphological analyzer, grammar checker (An Gramadóir), electronic thesaurus, and web-crawled corpora (both monolingual and bilingual).
Below I've listed some of the features of aimsigh.com that are not available with standard general-purpose search engines (and because of the marginalized position of Irish are unlikely ever to be so — more on this point below).
1. Spelling standardization
Irish underwent a major spelling reform in the 1940's and 1950's, introducing the so-called Caighdeán Oifigiúil (Official Standard):
An t-árd-cheannas ar na Fórsaíbh Cosanta is le dligheadh a riaghlóchar an modh ar a n-oibreochar é.
An t-ardcheannas ar na Fórsaí Cosanta is le dlí a rialófar an modh ar a n-oibreofar é.
Most writing in Irish today conforms to the standard, though certainly not 100%: many writers still prefer to use spellings and grammatical constructs that reflect more accurately their own dialect of Irish. In addition there are quite a few historical and legal documents on the web that use pre-standard orthography, most of them produced and published by the Irish government (one major source is the site www.achtanna.ie, which contains the full text of all Acts enacted by the Irish Parliament since 1922). So independent of the side one might take in the debate over the merits of the Caighdeán, these pre-standard and dialect documents are “out there”, and we are faced with the inescapable engineering challenge of making them easily available through a search interface.
To achieve this, the aimsigh.com engine employs a sophisticated “Irish standardizer” which amounts to a finite state transducer encoding the morphological rules of non-standard Irish together with mappings to standardized forms. These rules are augmented with a large database of non-standard/standard word pairs that was extracted in part from a parallel corpus of English and Irish texts (read more about this here: Applications of parallel corpora to the development of monolingual language technologies). The end result is that if a user selects the box Litriú neamhchaighdeánach (Non-standard spelling) on the main aimsigh.com page, and enters in word like Gaeilge (Irish language), any documents containing either Gaeilge or one of the non-standard spellings Gaolainn, Gaedhilge, Gaedhilg, Gaedhealg, Gaedhealaing, Gaeilic, Gaoidhealg, Gaodhalainn, Gaelainn, Gaeluinn, ..., etc. will be retrieved...
A nice side-effect of this feature is that the standardization process also corrects common spelling errors, so if you can't remember how to spell ionannas and you search for ionnanas, you will still retrieve all documents containing the correct spelling. Conversely a search for Údarás (“Authority”, spelled correctly) will turn up documents containing misspellings Udarás or Údaras which is probably the desired behavior, since such misspellings are remarkably common, even in presumably-edited texts.
2. Initial mutations
The beginning of a word in Irish can be written in different ways depending on the grammatical context. For example, bean (woman) becomes an bhean after the definite article an and ár mbean after the possessive pronoun ár (our). Several other possibilities occur when a word begins with a vowel: athair (father) can become t-athair, n-athair, d'athair, etc. In most cases, the presence or absence of one of these mutations has no real effect on the semantics of the word in question, somewhat like the presence or absence of an initial capital on a (non-proper) noun in English. In other words, someone searching for information on lexicography (foclóireacht) would surely be just as happy to retrieve documents containing fhoclóireacht or bhfoclóireacht.
This behavior can be achieved by selecting the button Focail chlaochlaithe (Mutated words) on the main aimsigh.com page (I generally select this button for all of my own searches). For example, to find documents concerning the country of Sudan, one might search for the term Súdáin, but since this word generally follows the definite article in Irish, and is therefore prefixed with a “t”, it is much more effective to search with aimsigh.com...
3. Inflectional morphology
Irish morphology is much more complicated than English morphology, and because of this, it is desirable to perform “stemmed” searches in many instances. For example, if one is interested in Irish language schools it is convenient to be able to search for the single term gaelscoil and retrieve all documents containing gaelscoil, gaelscoile (genitive), or gaelscoileanna (plural) as well as the mutated forms of these words (ghaelscoil, ngaelscoil, etc., nine words in all). Verbal morphology is even more complicated, with a single root word typically producing more than 50 inflected/mutated forms.
Stemmed searches can be performed by selecting
the button Focail chlaochlaithe infhillte (Mutated and inflected words).
For example, if you are interested in monetary policy, you would
naturally try to search for airgeadaíocht; while Google returns
only a couple of results for this query, we get several hundred with the
aimsigh.com stemming feature activated...
Stemmed searching has a mixed reputation in information retrieval circles, though this might be due in part to the fact that most research has been done on English or other languages with similarly limited morphological complexity. The other issue is that a lot of online stemming is done with “resource-light” approaches like the Porter algorithm; aimsigh.com instead uses a full lexicon and morphological analysis to guarantee correct stemming. It is worth noting also that Irish morphology is not nearly as complicated as languages like Basque, Swahili, or Hiligaynon, where a certain amount of stemming would seem to be absolutely essential.
4. Only Irish language documents
Irish language documents on the web are drowned in a veritable sea of English, Spanish, German, etc., and simple searches with Google or Yahoo! are often fruitless because of this. One issue of course is that some Irish words accidentally coincide with English words: think of bean (woman), punt (pound), file (poet), or tine (fire). And English is not the only problem; if you search for a very Irish-looking word like luach (value) with a standard search engine, you'll turn up very few Irish documents because this word also means “calendar” in Hebrew. In addition, there are many lexical conflicts with Scottish Gaelic, which has a somewhat smaller but not inconsequential web presence; a search for a word like ceist will yield documents split roughly half-and-half between Scottish and Irish Gaelic. So perhaps, in frustration, you swear off ever using any words that happen to coincide with anything from any other written language, and decide to search for “Bunreacht na hÉireann” (Constitution of Ireland). As it turns out, the first seven hits on Google point to English documents! A related (but obviously less important) issue is the irritation of having to click through “Choose your language” splash pages on Irish governmental web sites (when indeed a choice is offered). You are taken to such a page for example when you search for Foras na Gaeilge with Google; the top hit for aimsigh.com is instead the Irish language home page....
This feature is, of course, not particularly remarkable in a technological sense; most search engines offer the ability to restrict results to particular languages. The problem is that they usually only offer a selection of the most prominent 30-35 languages on the web. Now without too much work (and granted a certain amount of volunteer help from native speakers) I was able to train statistical language recognizers for the “next” 150 or so languages and run web crawlers for each of them (see Corpus building for minority languages for more information). So I suspect that the restriction to 30+ languages on Google's site must be a user interface decision on their part, so that Swedish speakers won't have to scan through a list of 200 (or 500 or 1000) languages to find Swedish in a pull-down menu.
5. Non-standard representations of síntí fada
For the non-Irish-speaking readers, síntí fada are the acute accents that appear on many vowels in Irish. Back in the day before 8-bit email was widely available, messages sent to popular email discussion lists like GAELIC-L were written with the infamous slaiseanna to indicate the accents: u/rsce/alai/ = úrscéalaí (novelist) or te/acschomhad = téacschomhad (text file). So, as a consequence, the archives of such mailing lists (which form the single largest source of Irish language material on the web as of this writing), are essentially invisible to standard search engines (which would all index the above words as separate units u + rsce + alai or te + acschomhad). In contrast, the aimsigh.com engine automatically detects pages that use unusual conventions and converts them to a standard format for indexing....
6. Decapitalization according to Irish conventions
When certain initial mutations (“t” and “n” before vowels) are added to uppercase words, the mutating letter is written in lowercase and without a hyphen: Acht > tAcht (a legislative act), or Ocht > nOcht (eight). On the other hand, the lowercase versions of the same words are written with hyphens: t-acht, n-ocht. The really bad news is that in these two cases, naïve conversion to lowercase produces completely different Irish words (tacht is a verb meaning “choke” and nocht is either a verb or adjective meaning “bare”). So if (for whatever reason) you enter tacht into Google, the Irish language results that are retrieved are all incorrect, referring to tAcht...
7. Automatic translation and augmentation of document titles
One of my main interests is in machine translation, and I have a rudimentary system in place for translating English text to Irish. This is used when documents are harvested from the web to translate boilerplate English titles into Irish; for example “GAELIC-L Archives — June 2000” would appear in aimsigh.com search results as “Cartlann GAELIC-L — Meitheamh 2000”. Or bilingual titles like “TG4 — Irish language television channel — Teilifis Gaeilge” [sic] are shortened and corrected to “TG4 — Teilifís Ghaeilge”. This makes for more effective searching (since we assume primarily Irish language search terms will be used) and also a more pleasing Irish-only visual experience when scanning results.
In addition, we augment useless titles with additional information to help clarify the contents of the document. Here “useless” means, in the worst case, no title at all, as is the case for articles in the newspaper Lá, but also refers to situations in which the title, even if translated, fails to provide useful clues as to the contents of the document, or fails to distinguish the document from others on the site; good examples are pages from the Irish Times Teanga Bheo site, the majority of which are titled simply “An Teanga Bheo — The Irish Times weekly Irish language site”. For the Lá articles, we do our best to extract the title from the body of the HTML document and construct a title from that. For the Teanga Bheo articles, it is possible to extract the date of publication from the URL; this is translated to Irish and appended to the title. Similar tricks are used for a number of other sites...
8. Manually-curated document database
Above, we bemoaned the fact that Irish language documents form only the tiniest fraction of all those available on the web. On the other hand, once the Irish documents have been separated from the rest, the small size of the resulting database (hundreds of thousands of documents vs. billions) is a great advantage. On the one hand, there seem to be quite a few Irish documents (thousands) out there that are not indexed at all by Google (some examples linked below); I suspect that we had better luck in turning these up by focusing our crawling on a relatively small number of especially productive (mostly .ie) domains...
Conversely, “spam sites” are becoming more of a problem for the large search engines, and quite a few searches for Irish language documents will turn up useless sites that simply repeat verbatim excerpts from various other pages. For example, if you search for the title “Irish Times weekly Irish language” on Google, there are 44 unique results returned of which only 5 appear to be legitimate. Because of the smaller scale of aimsigh.com it is an easy matter to detect such sites semi-automatically and then remove them from our indices...
9. Irish-centric page ranking
Even when there are documents available that contain legitimate Irish text, it sometimes happens that the page ranking algorithms are skewed in favor of sites that are heavily linked from non-Irish web pages. A good example is the Open Directory Project dmoz.org which often gets high page rankings because of its wide popularity (especially among English and German speakers). Unfortunately almost no Irish language sites are contained in the directory and so results from dmoz.org or one of its many mirrors are of little use to an Irish speaker. This can be illustrated by searching for something like Eolaíocht with Google; about half of the returned results are from Open Directory mirror sites...
Another example is the “Foras na Gaeilge” query discussed and linked above; the default English page is listed as the second search result, while the default Irish page is nowhere to be found. When ranking pages, aimsigh.com only considers links emanating from other Irish language pages; while this system is still not perfect it seems to improve matters greatly.