|It even has its own flag!|
I have had an off again, on again relationship with Esperanto. But, don't cry for me, Esperanto, the truth is I've never left you. Sometimes I have left it alone for years, and then come back to it to study it with abandon for a time. I've always found it to be fascinating. The grammar is clean, the pronunciation is regular and easy. I find it fairly easy to read, at least with some of the simpler sentence constructs, and I like the complete regularity of the language. Esperanto is a flexional language, and the declensions of the various word type are completely, strictly regular. There are no special cases. There are no “yes, but...” rules. There aren't any exceptions to the grammar rules – none. Which has always made me think that this would be an excellent language for use with computer programs. It should, in theory, be easy to write a program that would parse a sentence in Esperanto, and then return some sort of object describing the parsed sentence. Add a little vocabulary, and you could provide a language interface to your program. A little rudimentary language learning on your part, and you've got a great way to control a program, have a natural language interface to turn lights on and off in your house, control a robot, change TV channels, or whatever you like!
It seems like somebody should have already written one of these. It seems an obvious thing to do. I looked all over the web, but, I couldn't find a simple parser for Esperanto, that would give me an object describing a sentence. So I wrote one.
A word about Grammar
The thing is, people mean different things when they use the word “grammar”. Many people take a prescriptive, lawyerly approach, where they may judge some utterance to be either “good” or “bad” grammar. This type of grammar, where rules are prescribed and are not to be broken, is commonly referred to as a “classical grammar”, and that approach has no use here.
In linguistics, there are many ways to approach grammar, and many meanings to the word. The meaning I am using here is that of a descriptive grammar – that is, I'm not concerned with what is right or wrong, but rather, what are some rules that can describe the underlying construct of a particular language? For instance, how would I find a set of rules that, given a particular concept or thought, would tell me how to piece together an utterance in the target language, and have it adhere to what other speakers of the language expect? This is called a “generative grammar”, a concept championed by Noam Chomsky in several of his early books, notably Syntactic Structures. Dr. Chomsky has moved somewhat away from his early theories, but, they will work for our purposes. Another key concept in what I wrote is termed a “transformational grammar” (Also a concept from Chomsky), which allows a structure in some form to be transformed into another by some rule or set of rules. In practical terms, this allows us to do things such as deal with repeating conjunctions (“this and that and these and those”), deal with the question construct in Esperanto (“Ĉu” at the beginning of a sentence turns it into a question, and in our object model, and that should be a top level node containing “Ĉu”, with the rest of the sentence in a subnode), and so forth. Transformational rules allow us to do these transforms.
Some approaches to computational linguistics take the approach of attempting to discover these underlying grammatical constructs in a language. Generally, a very large corpus of sample material in the target language is fed into a computer program, and statistical and probabilistic methods are used to find relationships within the target material. This is certainly an interesting subject, as it leads directly into parallel processing and massively parallel systems (Hadoop and Google's MapReduce come to mind), but it is not the approach I used here. We already know the structure of Esperanto. It is well defined, and clearly laid out in many books, online courses, and so forth. What I intended to do was to take that written form, intended for humans, and turn that set of rules into a computer program that could recognize and digest utterances in our well-defined target language. Note also that this is one-way. The program does not take concepts and generate utterances in Esperanto, but only pulls apart existing utterances.
As the source of the rules and grammar, I used A Complete Grammar of Esperanto, by Ivy Kellerman Reed. This is one of the seminal works on Esperanto, and is available many places online for free, such as this one at Project Gutenberg. I started reading the book as a refresher in Esperanto, and, as I went, I wrote the computer program, using the examples and exercises in the book as test inputs for the program. I haven't yet completed the book, but, so far, so good! And so far, quite good enough for the intended purpose, which is to be able to control computer programs or devices (such as a robot) by typing Esperanto sentences.
Some Things This Program Does Not Do
It doesn't care about meaning, for one. It strictly adheres to syntax and syntactically decomposing utterances in Esperanto. Semantics and deeper meaning are left to the program consuming the resultant object from this program.
It also is strongly bound to the syntax of the target language, and does not learn. If it encounters an utterance that it cannot decipher, it will produce an object either partially or wholly in error – however, since the rules of the language are so firm, it's pretty easy to identify things that do not conform to those rules.
It's not a translator. While it is conceivable that the output from this program could be passed through another program with a different set of rules and transforms, and produce utterances in some other language, I have not written any program to do that. I mainly intended this as input to some other program or device, in order to direct its actions.
How it Works
The code does two passes through an utterance. The first is the Grouper, which does more or less low-level parsing, such as dividing sentences into subphrases on commas or other punctuation, determining word type via endings, or recognizing special marker words such as "kaj" ("and") or “Ĉu”(question marker). It also groups things into subphrases depending on the words' relationships to one another -- for instance, adjectives and nouns that are adjacent to each other and agree in case are grouped into a subphrase. At the end of this pass, which recursively looks at the phrases it parses, the phrases and subphrases are grouped into a tree structure, with the individual words as leaf nodes. The following picture illustrates:
In the above example, the sentence, "Fortaj ĉevaloj marŝas kaj kuras en la verdaj kampoj." ("[The,A] Strong horse walks and runs in the green fields.") is parsed. We can see that it has created a top-level group on punctuation (in this case, a period, "."), under which it has several subgroups corresponding to things like noun phrases and verb phrases. For more complicated sentences, we may also see prepositional phrases and so forth. This program keeps things simple as much as it can -- for instance, if a word like a pronoun can be parsed by regular syntactic rules, (such as, by its ending), it doesn't get special processing. Remember, the program parses syntactically only, and disregards meaning in its processing.
In this example, the noun phrase is expanded, and we can see that it has grouped "Fortaj ĉevaloj" together, since they agree in number and type. The words have further attributes assigned to them, such as Fortaj being recognized as a plural adjective relating to the sentence's agent ("subject" could be considered analogous, in other classification systems). These attributes are simply that -- text that is assigned to the word during processing, akin to a tag cloud or other classification system. This allows for other processing to be done past what this program does, opening up the door for things like statistical analysis of text, Kmeans classification, and so forth. Or, just simply acting on the input!
To illustrate, here is "Fortaj", in more detail:
We can see here that the program has recognized "Fortaj" as flexional, and that it has 100% confidence in its results. The program is set up to use probability to determine how to classify a word, but, as Esperanto is so regular, just simply recognizing the ending of the word has proved to be reliable so far, and probabilistic methods of identification have not yet been needed. The program has a lexicon, and further builds that lexicon during parsing, but, it does not have an entry for Fort, which means that its Translation property on the word object is null. Again, the program operates syntactically only, and assumes that any translation or action done on words it parses is up to the consumer of its results. It's the calling program's responsibility to know what to do with the words and phrases it finds.
After groupings have been done, the next processing pass does any transformations that need to be done. At this point, we have a tree structure, as described above, which should be completely parsed. The transfomation pass takes this parsed tree, and further applies rules in order to produce the final result. For instance, one of the rules we have been talking about is that “Ĉu” marks a question. The transformation for this is that “Ĉu” forms a top node, with the rest of the transformation underneath it. A somewhat similar rule may be to group subphrases related to conjunctions under that conjunction -- for example, the equivalent of
this (abstract noun-like object)
that (abstract noun-like object)
these (abstract noun-like object)
could be transformed to:
this (abstract noun-like object)
that (abstract noun-like object)
these (abstract noun-like object)
For the transform mentioned above, here are before and after examples. Note in the before example, the word "Ĉu" does not have anything special about it, although the grouper did recognize it as a word needing a transformation. In the second example, we can see that new subgroups have been created under the word (see near the highlighted line in the second picture, below). Similarly, the question word or any of its subwords and subphrases could be marked in any number of ways (for instance, attributes on the words) for further processing or as a marker to the consuming program to take some action.
The transforms are implemented using the Visitor pattern, where each type of transform is implemented as a small class which is handed a Group object. The object does any transform, and returns true if it found anything on which to operate. The main loop recurses over the tree until all transforms return false, meaning they found nothing on which to make any changes. Each Transformer class is a small, simple class that does one operation on its input group.
The next steps that this program needs to take are more transformations, specifically ones that explicitly express grammar rules of Esperanto, or ones that mark leaf nodes in specific ways -- for instance, marking all leaf nodes that are related to a pronoun in the sentence, and so forth.
The program also needs to further encapsulate the chapters of the source text (A Complete Grammar of Esperanto), and make sure that it covers any of the more esoteric rules of the language.
Other next steps could be things like loosening its cohesion to Esperanto specifically, and make it so that it can do data mining to find its own rules and transformations in sample text -- that is, make it learn on its own. This is, obviously, a large undertaking and I have purposely left it out of scope for this program.
Another possible expansion is to do the exact reverse of parsing -- take some programmatic object, and use the rules defined here to not parse text, but to generate correct text in Esperanto.
Where to Download the Code
The code for this project is available at the link below. Please let me know if you find it useful, and, if you expand on it, please let me know, so we can roll your changes in!
Linguistics is a fascinating subject, but it can be hard to figure out where to start. For a good overview, I recommend getting a copy of Teach Yourself Linguistics, by Jean Aitchison, who is the Emeritus Rupert Murdoch Professor of Language and Communication at Oxford. Despite her rather formidable title, Dr. Atchison has written a very straightforward and easy to read introduction to linguistics. For an interesting read on language formation and development, The Language Instinct by Steven Pinker, a well known professor in the Psych department at Harvard, is a classic. To go a bit deeper on a much more technical level, Syntactic Structures by Noam Chomsky dramatically changed the field of linguistics when it was published. It's worth following that link just to read the introduction and overview. Finally, for a good layman's introduction to everything related to cognition, neuroscience, and, of course, language and linguistics, I highly recommend the Brain Science Podcast, by Dr. Ginger Campbell. Dr. Campbell gives overviews of many fascinating books, far more than I for one could ever read, and has interesting interviews with leading researchers and authors working in fields such as cognitive science and neuroscience. I really can't recommend her podcast strongly enough. And she also goes to Dragon*Con, so how cool is that?!
And one more thought. I know, I know, as you've been reading this, you have been thinking, "Yes, yes, this is all fine and good, but, what about a free editor for Ancient Egyptian hieroglyphics?" So, without further ado or non sequiturs, here it is. JSesh, open source and available for most platforms. It's pretty awesome!