Thursday, April 2, 2015

You Can Do Magic

Well, Alchemi, anyway. 

Alchemi is an open source high performance computing framework, originally developed by the GRIDS Laboratory at the University of Melbourne, Australia, and other contributors to the project, which can be found here. The original Alchemi project is available under GPL v2, as are derivative works such as mine, which is a fork and update of the original Alchemi project. The project described here can be found on SourceForge, at https://sourceforge.net/projects/alchemi2/ .

Alchemi is a grid computing framework akin to more commonly used frameworks such as MPI, or the commercially available Digipede framework. Similar to Digipede, in Alchemi, you create a class in a .Net language, in which you do some unit of work, and that class is handed off to a worker node. The worker node executes the piece of work and then hands back the serialized class, with properties containing the results of computation.  

Alchemi is pretty neat, but, it was pretty old code, and needed some updating. For instance, internally, it used no generics, as the original code predates generics in the .Net framework. I also found the performance lacking, mainly due to the setup and teardown time needed for each instance of the GThread classes, which are what are handed to the worker node. There were also some much needed improvements. For instance, the idea of an application, which is a collection of GThreads, was good, as far as it went, but there also needed to be orchestrations of some sort, where one set of parallelized steps finishes, and the generated output is handed to another set of steps to continue the work. In the new paradigm, these are Batch objects, which are collections of Application objects with dependencies between the Applications, which describe an acyclic graph, or a dataflow diagram. Also lacking was the ability to do online processing – when a worker node finishes with its data, it generally quits and returns. While this can be great for static data, it doesn't allow for things such as a parallelized neural net, waiting for data and acting on it as it arrives. For this, I used MSMQ to allow piping of data between processes running in parallel, and mailboxes (implemented in SQL Server) to allow for control signals to be passed to worker nodes.

A completed batch, showing its steps. Steps are individual grid applications which are orchestrated as part of a batch.

I reworked a lot of the code to accomplish this, as well as doing some general cleanup and better adherence to naming standards (Gthreads are now renamed as GridThreads, there are no more non-generic collections, etc), better code organization, etc. I would estimate that about 70% of the code is rewritten or outright new code at this point, while retaining (for the most part) the underlying remoting mechanisms that launch and manage worker threads. 

The result is a much faster, much more fault tolerant and reliable cluster, capable of much higher throughput. For a six worker node cluster, I am, for some applications, seeing throughput of 200,000 records a second on my seven node, 13 core cluster (all Intel i5 machines, except for one ex-embedded machine, which has a Sempron). While not ready for the Top500 list, for what I need the cluster for (such as, crunching through robotic sensor information), it's not bad at all, and, in fact, quite fast.

The compute cluster I built consists of six HP DC7800 mini desktop computers, and one very modified HP T5730W thin client machine. All machines are running Windows 7 Pro, except node 6, which is Win8.1. The Drobo provides external storage, as well as being the NAS for our house in general.

Some Highlights

 In order to have better throughput, there are new types of GridThreads, such as the MessageQueueGridThread, which reads from one MSMQ queue, and outputs to another. It can operate in one of two modes – one where it processes a queue until no more records are available and then quits, and a second mode, where it runs continually until it receives an external stop signal, allowing for online, non-batch processing. Queues can be dynamically allocated from a pool of known queues running on several machines in the cluster, and allow for dynamic piping of data between parallel processes in an application.

namespace Robotics.Alchemi.TubeTest.Grid
    using global::Alchemi.Core.GridThreads;
    using global::Alchemi.Core.Helpers;
    using global::Alchemi.Core.Storage;
    using System;
    using System.Collections.Generic;
    using System.Messaging;

    public static class AddPickles
        public static string InMessageQueueAlias = "AddPicklesInqueue";

        public class AddPicklesStep : global::Alchemi.Core.BatchJobs.Step
            public override IEnumerable GetThreadsToExecute(IEnumerable batchState, bool isLocalExecution)

                yield return
                    new AddPicklesGridThread
                        StopSignalLabel = BatchInfo.StopSignalLabel,
                        StopType = MessageQueueGridThreadStopType.StopOnSignal,
                        ControlMessageQueueAlias = BatchInfo.ControlMessageQueueAlias,
                        InMessageQueueAlias = AddPickles.InMessageQueueAlias,
                        OutMessageQueueAlias = AddMustard.InMessageQueueAlias,

        public class AddPicklesGridThread : MessageQueueGridThread
            protected override object ProcessMessage(Message message, string messageAsString)
                return messageAsString + ", pickles";

An example of a message queue enabled grid thread, and its associated batch step setup class.

Queues can be monitored within the console application. Within running grid applications, queues, which can exist on any arbitrary machine within the cluster, are dynamically reserved for a limited time by the application. The application periodically re-reserves its queues. When a queue is idle past its reservation time, it can then be used by another application.

In-memory tracking of executor nodes and managers has been removed in favor of SQL Server only. This allows for better tracking and logging of node activity, applications, batches, and so on, and also allows access to data through other channels other than the manager, allowing the manager to focus on scheduling rather than servicing data requests better served through a dedicated, robust database system. This also allows for a more distributed system in the future, allowing Executors to be less dependent on a central control point.

 Executors and Managers are more fault tolerant. Executors sense when the manager is unavailable, and aggressively attempt to re-establish contact. While not perfect, this allows for a cluster that is tolerant of node failures and more readily routes around hardware failures.

Executors can be more directly controlled, and can be issued commands directly in order to stop/start an executor, shut down the executor's host machine, etc.

The Executors control screen, showing buttons that send commands to individual executors. As with most screens in the console application, this screen has an auto-refresh feature with an adjustable refresh frequency.

Executor (worker) nodes can be controlled directly by passing messages into a mailbox that they monitor. Mailbox functionality is also available programmatically to grid applications.

The front end application is greatly improved and extended. While a lot of the existing functionality remains – for instance, the performance graph – there is a lot of new functionality written using WPF, the entity framework, and the MVVM pattern, allowing for a responsive application with a more modern look and feel, functionality, and so forth. This bypasses the manager node where possible and instead queries SQL server directly, making it a normal data-centric application, independent of whether the cluster is up and running or not.

The main form of the console application. While not to regular UX standards, the [[HOME]] menu item brings the user back to this start screen, to allow quick switching between work areas. The legacy application is available through the 'Legacy Console' tile.

Most screens, such as the Applications screen shown here, have search and auto-refresh functionality. 

Although with some modifications, all of the legacy screens, such as the performance graph shown here, still exist and are readily accessible in the console application. 

Feed Shark

Saturday, April 26, 2014

Gort, Klaatu barada Esperanto

Esperanto. It's a made up language. No really! It's a full-blown, complete language, but it is entirely made up. It's what's sometimes referred to as a synthetic or constructed language – as opposed to a natural language, such as English, Chinese, or Salish. (Interesting note: it's a controversial issue, but, some people think that Salishan languages don't have nouns). Worldwide, it has several million speakers, and has clubs formed around it, blogs in it, newspapers, etc. You can take online or mail-in courses to learn it. There are many books written in it, or translated to it. Google Translate even translates to/from Esperanto. The language has a long history, being formed in the late 1800s by one Ludwik Lazarus Zamenhof (a.k.a. “Doktoro Esperanto”), and is the world's most successful constructed language. The language's name means “one who hopes”, and it was the hope of the eponymous Doktoro that it would be a neutral language, not tied to any place or ideology, which could be used to promote peace and understanding.

It even has its own flag!

I have had an off again, on again relationship with Esperanto. But, don't cry for me, Esperanto, the truth is I've never left you. Sometimes I have left it alone for years, and then come back to it to study it with abandon for a time. I've always found it to be fascinating. The grammar is clean, the pronunciation is regular and easy. I find it fairly easy to read, at least with some of the simpler sentence constructs, and I like the complete regularity of the language. Esperanto is a flexional language, and the declensions of the various word type are completely, strictly regular. There are no special cases. There are no “yes, but...” rules. There aren't any exceptions to the grammar rules – none. Which has always made me think that this would be an excellent language for use with computer programs. It should, in theory, be easy to write a program that would parse a sentence in Esperanto, and then return some sort of object describing the parsed sentence. Add a little vocabulary, and you could provide a language interface to your program. A little rudimentary language learning on your part, and you've got a great way to control a program, have a natural language interface to turn lights on and off in your house, control a robot, change TV channels, or whatever you like!

It seems like somebody should have already written one of these. It seems an obvious thing to do. I looked all over the web, but, I couldn't find a simple parser for Esperanto, that would give me an object describing a sentence. So I wrote one.

A word about Grammar

What, you're still here? But, I said “grammar”. Surely you're cringing.

The thing is, people mean different things when they use the word “grammar”. Many people take a prescriptive, lawyerly approach, where they may judge some utterance to be either “good” or “bad” grammar. This type of grammar, where rules are prescribed and are not to be broken, is commonly referred to as a “classical grammar”, and that approach has no use here.


In linguistics, there are many ways to approach grammar, and many meanings to the word. The meaning I am using here is that of a descriptive grammar – that is, I'm not concerned with what is right or wrong, but rather, what are some rules that can describe the underlying construct of a particular language? For instance, how would I find a set of rules that, given a particular concept or thought, would tell me how to piece together an utterance in the target language, and have it adhere to what other speakers of the language expect? This is called a “generative grammar”, a concept championed by Noam Chomsky in several of his early books, notably Syntactic Structures. Dr. Chomsky has moved somewhat away from his early theories, but, they will work for our purposes. Another key concept in what I wrote is termed a “transformational grammar” (Also a concept from Chomsky), which allows a structure in some form to be transformed into another by some rule or set of rules. In practical terms, this allows us to do things such as deal with repeating conjunctions (“this and that and these and those”), deal with the question construct in Esperanto (“Ĉu” at the beginning of a sentence turns it into a question, and in our object model, and that should be a top level node containing “Ĉu”, with the rest of the sentence in a subnode), and so forth. Transformational rules allow us to do these transforms.

Some approaches to computational linguistics take the approach of attempting to discover these underlying grammatical constructs in a language. Generally, a very large corpus of sample material in the target language is fed into a computer program, and statistical and probabilistic methods are used to find relationships within the target material. This is certainly an interesting subject, as it leads directly into parallel processing and massively parallel systems (Hadoop and Google's MapReduce come to mind), but it is not the approach I used here. We already know the structure of Esperanto. It is well defined, and clearly laid out in many books, online courses, and so forth. What I intended to do was to take that written form, intended for humans, and turn that set of rules into a computer program that could recognize and digest utterances in our well-defined target language. Note also that this is one-way. The program does not take concepts and generate utterances in Esperanto, but only pulls apart existing utterances.

As the source of the rules and grammar, I used A Complete Grammar of Esperanto, by Ivy Kellerman Reed. This is one of the seminal works on Esperanto, and is available many places online for free, such as this one at Project Gutenberg. I started reading the book as a refresher in Esperanto, and, as I went, I wrote the computer program, using the examples and exercises in the book as test inputs for the program. I haven't yet completed the book, but, so far, so good! And so far, quite good enough for the intended purpose, which is to be able to control computer programs or devices (such as a robot) by typing Esperanto sentences.

Some Things This Program Does Not Do

It doesn't care about meaning, for one. It strictly adheres to syntax and syntactically decomposing utterances in Esperanto. Semantics and deeper meaning are left to the program consuming the resultant object from this program.

It also is strongly bound to the syntax of the target language, and does not learn. If it encounters an utterance that it cannot decipher, it will produce an object either partially or wholly in error – however, since the rules of the language are so firm, it's pretty easy to identify things that do not conform to those rules.

It's not a translator. While it is conceivable that the output from this program could be passed through another program with a different set of rules and transforms, and produce utterances in some other language, I have not written any program to do that. I mainly intended this as input to some other program or device, in order to direct its actions.

How it Works


Subphrase Grouping

The code does two passes through an utterance. The first is the Grouper, which does more or less low-level parsing, such as dividing sentences into subphrases on commas or other punctuation, determining word type via endings, or recognizing special marker words such as "kaj" ("and") or “Ĉu”(question marker). It also groups things into subphrases depending on the words' relationships to one another -- for instance, adjectives and nouns that are adjacent to each other and agree in case are grouped into a subphrase. At the end of this pass, which recursively looks at the phrases it parses, the phrases and subphrases are grouped into a tree structure, with the individual words as leaf nodes. The following picture illustrates:

In the above example, the sentence, "Fortaj ĉevaloj marŝas kaj kuras en la verdaj kampoj." ("[The,A] Strong horse walks and runs in the green fields.") is parsed. We can see that it has created a top-level group on punctuation (in this case, a period, "."), under which it has several subgroups corresponding to things like noun phrases and verb phrases. For more complicated sentences, we may also see prepositional phrases and so forth. This program keeps things simple as much as it can -- for instance, if a word like a pronoun can be parsed by regular syntactic rules, (such as, by its ending), it doesn't get special processing. Remember, the program parses syntactically only, and disregards meaning in its processing.

In this example, the noun phrase is expanded, and we can see that it has grouped "Fortaj ĉevaloj" together, since they agree in number and type. The words have further attributes assigned to them, such as Fortaj being recognized as a plural adjective relating to the sentence's agent ("subject" could be considered analogous, in other classification systems). These attributes are simply that -- text that is assigned  to the word during processing, akin to a tag cloud or other classification system. This allows for other processing to be done past what this program does, opening up the door for things like statistical analysis of text, Kmeans classification, and so forth. Or, just simply acting on the input!

To illustrate, here is "Fortaj", in more detail:

We can see here that the program has recognized "Fortaj" as flexional, and that it has 100% confidence in its results. The program is set up to use probability to determine how to classify a word, but, as Esperanto is so regular, just simply recognizing the ending of the word has proved to be reliable so far, and probabilistic methods of identification have not yet been needed. The program has a lexicon, and further builds that lexicon during parsing, but, it does not have an entry for Fort[], which means that its Translation property on the word object is null. Again, the program operates syntactically only, and assumes that any translation or action done on words it parses is up to the consumer of its results. It's the calling program's responsibility to know what to do with the words and phrases it finds.



After groupings have been done, the next processing pass does any transformations that need to be done. At this point, we have a tree structure, as described above, which should be completely parsed. The transfomation pass takes this parsed tree, and further applies rules in order to produce the final result. For instance, one of the rules we have been talking about is that “Ĉu” marks a question. The transformation for this is that “Ĉu” forms a top node, with the rest of the transformation underneath it. A somewhat similar rule may be to group subphrases related to conjunctions under that conjunction -- for example, the equivalent of

this  (abstract noun-like object)
and   (conjunction)
that  (abstract noun-like object)
and   (conjunction)  
these (abstract noun-like object)

could be transformed to:

and (conjunction) 
     this  (abstract noun-like object)
     that  (abstract noun-like object)
     these (abstract noun-like object)

 For the transform mentioned above, here are before and after examples. Note in the before example, the word "Ĉu" does not have anything special about it, although the grouper did recognize it as a word needing a transformation. In the second example, we can see that new subgroups have been created under the word (see near the highlighted line in the second picture, below). Similarly, the question word or any of its subwords and subphrases could be marked in any number of ways (for instance, attributes on the words) for further processing or as a marker to the consuming program to take some action.

 After transform:

 The transforms are implemented using the Visitor pattern, where each type of transform is implemented as a small class which is handed a Group object. The object does any transform, and returns true if it found anything on which to operate. The main loop recurses over the tree until all transforms return false, meaning they found nothing on which to make any changes. Each Transformer class is a small, simple class that does one operation on its input group.

Next Steps

The next steps that this program needs to take are more transformations, specifically ones that explicitly express grammar rules of Esperanto, or ones that mark leaf nodes in specific ways -- for instance, marking all leaf nodes that are related to a pronoun in the sentence, and so forth.

The program also needs to further encapsulate the chapters of the source text (A Complete Grammar of Esperanto), and make sure that it covers any of the more esoteric rules of the language.

Other next steps could be things like loosening its cohesion to Esperanto specifically, and make it so that it can do data mining to find its own rules and transformations in sample text -- that is, make it learn on its own. This is, obviously, a large undertaking and I have purposely left it out of scope for this program.

Another possible expansion is to do the exact reverse of parsing -- take some programmatic object, and use the rules defined here to not parse text, but to generate correct text in Esperanto.

Where to Download the Code


The code for this project is available at the link below. Please let me know if you find it useful, and, if you expand on it, please let me know, so we can roll your changes in!




Linguistics is a fascinating subject, but it can be hard to figure out where to start. For a good overview, I recommend getting a copy of Teach Yourself Linguistics, by Jean Aitchison, who is the Emeritus Rupert Murdoch Professor of Language and Communication at Oxford. Despite her rather formidable title, Dr. Atchison has written a very straightforward and easy to read introduction to linguistics. For an interesting read on language formation and development, The Language Instinct by Steven Pinker, a well known professor in the Psych department at Harvard, is a classic. To go a bit deeper on a much more technical level, Syntactic Structures by Noam Chomsky dramatically changed the field of linguistics when it was published. It's worth following that link just to read the introduction and overview. Finally, for a good layman's introduction to everything related to cognition, neuroscience, and, of course, language and linguistics, I highly recommend the Brain Science Podcast, by Dr. Ginger Campbell. Dr. Campbell gives overviews of many fascinating books, far more than I for one could ever read, and has interesting interviews with leading researchers and authors working in fields such as cognitive science and neuroscience. I really can't recommend her podcast strongly enough. And she also goes to Dragon*Con, so how cool is that?!

And one more thought. I know, I know, as you've been reading this, you have been thinking, "Yes, yes, this is all fine and good, but, what about a free editor for Ancient Egyptian hieroglyphics?" So, without further ado or non sequiturs, here it is. JSesh, open source and available for most platforms. It's pretty awesome!