An Gramadóir: History

Release History

The data on this page refer to the Irish version of An Gramadóir which was, through version 0.4, packaged together with the build scripts as gramadoir-0.x. Starting with 0.50, the Irish version is distributed independently as the Perl module Lingua::GA::Gramadoir.

Version	Release Date	Size of lexicon	Alt. forms	Idioms	Disambig rules	Gramm. rules	Excepts
0.1	18 Jul 2003	313,973	-	-	16	146	18
0.2	30 Jul 2003	314,027	22,292	-	22	173	18
0.3	21 Oct 2003	315,002	23,353	-	24	177	18
0.4	8 Jan 2004	315,041	24,107	214	331	361	77
0.50	28 Jul 2004	320,958	27,868	222	333	362	77
0.51	25 Aug 2004	321,088	31,077	222	333	362	77
0.60	3 Mar 2005	310,883	44,067	410	456	1573	492
0.70	10 Oct 2013	359,710	117,958	426	545	2821	871

Benchmarks

The benchmark corpus is comprised of approximately one megabyte of plain text from the online Irish monthly Beo!. There are 192406 words in the corpus forming 9292 sentences. Times below are given in seconds (real computation time on my dual Xeon box running Gentoo Linux). WPM = words per minute.

v	TOT	WPM	ab	cu	co	a1	a2	un	rl	ei	as
0.1	220.16	52436	5.20	1.06	-	3.03	-	-	206.52	0.46	3.89
0.2	220.73	52301	5.10	1.13	-	3.38	-	-	206.80	0.43	3.88
0.3	241.61	47781	5.04	1.11	-	3.41	-	-	227.64	0.44	3.97
0.4	208.39	55398	15.06	2.60	6.39	133.83	26.46	0.76	19.25	1.04	3.00
0.50	216.36	53357	6.97	6.05	9.75	144.02	29.08	1.41	14.72	1.83	2.60
0.51	204.19	56537	6.95	6.63	8.97	130.11	26.89	1.26	19.03	1.65	2.62
0.60	193.00	59815	24.49	7.40	23.39	78.73	25.07	1.34	29.91	-	2.42

ChangeLog Summary

Version 0.60->0.70

Massive expansion of lexicon, especially handling of non-standard spellings for Caighdeánaitheoir
Nearly double the number of grammatical rules
Small bug fixes

Version 0.51->0.60

Lexicon additions, improvements, and bug fixes (some unnecessary inflections removed)
Rule set more than tripled in size, now covering a wide range of Irish grammar, including nearly all rules concerning missing or unnecessary initial mutations
Simplification of the tagset, some regular expression optimizations, less dependence on part-of-speech macros, and the restructuring of a few slow rules lead to another 5% speed improvement despite the massively larger rule set
Added warnings for many “dangerous pairs” based on corpus analysis
Now correctly tokenizes numerals, including years, ordinals like 5ú, and plurals like 1950í.
Many morphological rules added for treating pre-standard orthography; together with work on the replacement file, these allow An Gramadóir to be used as a “normalizer” for indexing, information retrieval, etc.

Version 0.50->0.51

Lexicon additions and improvements
Improved error trapping
Improved Perl code generation (consistent use of non-capturing parentheses, etc.) giving 5% speed improvement.
Perl implementation of developer options --brill, --freq, --ambig distributed in the language pack gramadoir-ga-0.51.
Added --no-unigram option to gram-ga.pl
POD documentation for gram-ga.pl

Version 0.4->0.50

Complete rewrite of core engine entirely in Perl
Default output encoding is now utf8; added a --aschod option to change this
The default is now to report all spelling errors in a sentence (was at most two)
Added a complete morphological analyzer which greatly improves the error messages when words are not found in the lexicon.
The morphology engine also improves handling of late capitals so words like d'Fhoras and Sean-Nós are passed over silently as correct now. Also, since the lowered version of a word like hAire is not automatically searched (since the capitalized version is in the lexicon), this gets the correct, unambiguous masculine POS tag. Or in bPáirtí Glas, the first word is now recognized unambiguously as a noun which then has the added benefit of allowing Glas to be correctly recognized as an adjective.
Line numbers are now given where the error occurs (was the line number of the beginning of the sentence containing the error).
Non-standard words are now tagged so as to be reported as misspellings when the --litriu flag is given.
Doubled words only reported when there is no intervening punctuation; the two words together are now marked up as the erroneous text.
Bug from unescaped $ in bash version goes away with perl
Global highlighting bug fixed (e.g. re in gach re caused the re in toibreacha to be highlighted also).
No more line number attribute in intermediate XML
Use character entities ", etc. in --api, --html, and --xml output
Added --api command line option which generates XML output suitable for use as an interface to other programs
Added command line options --aschur, --dath, --comheadan
I cracked and added English versions of the long command-line options
Improvements (adding to .neamhshuim) and bug fixes to Vim interface
Afrikaans localization

Version 0.3->0.4

Rule set more than double the size (improved generation of Perl code means no loss in efficiency, in fact a 15% improvement)
“tag 2-gram” rules added that flag unlikely part-of-speech combinations
Complete Brill-like rule-based tagger added with 331 disambiguation rules followed by a default unigram tagger
Added developer options --brill, --ilchiall, --minic, and --no-unigram useful for developing the tagger
Module for chunking of set phrases added
Language-dependent modules added for recognizing abbreviations; improves sentence segmentation
Added --aspell option which makes suggestions for misspelled words
Modularized more language-specific material (tolower, macro files, etc.)
Flagging of repeated words
Flagging of extremely rare words (which sometimes disguise misspellings)
Dutch, French, Mongolian, Romanian, and Slovak localizations
Now builds and runs cleanly in a UTF-8 locale
Vim interface gramadoir.vim included
Minor bug fixes

Version 0.2->0.3

Optional native language support with GNU gettext (with translations to Irish and German included)
Added ability to specify text encoding via --ionchod command line option
Modularized language-specific material; added trivial English port and the --teanga command line option
Ported to build cleanly under Mac OS X
Added new (mostly grammar) rules
Minor bug fixes
New extras in tarball: emacs interface, complete CVS ChangeLog

Version 0.1->0.2

Added the replacement file containing non-standard forms
Added user ignore file and --iomlan command line option
Added new disambiguation and grammar rules
More robust handling of exceptional input text
Minor bug fixes

GaelSpell

Grammar Checker

Thesaurus

Standardizer

Scottish Gaelic

Manx Gaelic

Trends

DIL Bridge

Twitter

Archives

Gaelicize Me

Hyphenation

Ríomhacadamh

Chuala.me

Accentuate.us

Other Resources

Release History

Benchmarks

ChangeLog Summary