-->

Thursday, August 23, 2012

In which I bark orders at a robot, and it actually listens!


So, here's my latest adventure with HouseBot. Since I have the Kinect, and since the Kinect has directional microphones in it, I decided to do a little experimenting around with the Microsoft Speech SDK. In an earlier iteration of HouseBot (well, same platform, much less powerful computer, much more finicky drive wheels.. lots of improvements since then), I had also played around with this, but without quite as much success. The main reason for that was that I was using a microphone plugged into the computer directly, instead of using the Kinect microphones. With the other microphone, you basically had to be right on top of it (and sometimes shout) in order to get it to respond. With the Kinect, since it is designed for gaming (et al), you can be across the room and still have the mic respond.

So, I think I'll just jump right into the finished result, and then do a little of a dive into how it's done. Here's a film of HouseBot responding to voice commands.



As you can see, she's still not going to get me a beer. *sigh*... Science, such a harsh mistress you are..


So how does it work?

In order to do something like this, you need to do a couple of things:

1. Build a robot
2. Build a vocabulary of the expected commands
3. Stream audio from the Kinect to the voice recognition code
4. On recognizing a command, take some action

Underneath it all, it is using the Microsoft Speech SDK, the Kinect for Windows SDK, and the Kinect For Windows Developer Toolkit (more info here). The code shown here, at least the setup code in the recognizer object, is an adaptation of the C# speech recognition example in the toolkit. I highly recommend the toolkit! I wrote all this in C# using Visual Studio 2010 , but there's no reason why this couldn't be developed in either one of the express editions, or, in one of my favorite freeware products, SharpDevelop (which is also great for developing in IronPython!)

So keep in mind, there's a lot of support code for the robot that I'm not going to show here. It'll be fairly obvious where I'm calling into the robot, but some of the basic idea is that the robot has various tools attached to it (objects that implement ITool), such as the light, the turret, and the speech generator. The robot itself is a platform (implements IMobilityPlatform) and a sensor provider (implements ISensorProvider). So, if you see something that says "UseTool", that method call is asking the tool to do its core action or some alternate action, and  if you tell a platform to turn or move forward, that's the robot itself. Some tools, such as the turret or voice generation tool also have some specialized actions -- such as "Say()" on the voice tool.

What it looks like when you are consuming all this is that first, we'll do a little setup:


      ITool recognizer;
      ITool light;
      KinectViewAngle kinectAngle;
      ArduinoStepperMotor turret;
      Voice voice;
      StringSensor speech;
      ObjectSensor objectSensor;
      Robotics.Framework.Oscillators.Timeout timeout = new Robotics.Framework.Oscillators.Timeout(new TimeSpan(hours: 0, minutes: 0, seconds: 2));

      public override void Setup()
        {
            recognizer = (Platform as IToolProvider).Tools.Where(tool => tool.Name == "Speech Recognizer").First();
            turret = (ArduinoStepperMotor)((Platform as IToolProvider).Tools.Where(tool => tool.Name == "Head Position Motor").First());
            voice = (Voice)((Platform as IToolProvider).Tools.Where(tool => tool.Name == "Speech").First());
            speech = (StringSensor)((Platform as ISensorProvider).Sensors["Speech Recognition"]);
            light = ((Platform as IToolProvider).Tools.Where(tool => tool.Name == "Light").First());
            kinectAngle = (KinectViewAngle)((Platform as IToolProvider).Tools.Where(tool => tool.Name == "Kinect View Angle 1").First());
            objectSensor = (ObjectSensor)((Platform as ISensorProvider).Sensors.Values.Where(sensor => sensor is ObjectSensor).First());
           
            speech.SensorChanged += new SensorEvent(speech_SensorChanged);

            recognizer.UseTool();
            timeout.Reset();
            voice.Say("ready.");
        }

And then we'll just wait for voice commands. A note - the recognizer object is the actual object doing the speech recognition, but, to make things easier, it fills in a value on a StringSensor object, which is a kind of base sensor object type I use on the robotics platform to easily represent and sense things that are string values (RFID sensor values, sensed speech, values coming from an IR remote receiver, things like that). This is the code that interprets commands:

        bool isProcessing = false;
        void speech_SensorChanged(object sender, SensorEventArgs args)
        {
            if (isProcessing) return;

            isProcessing = true;

            if (!string.IsNullOrEmpty(speech.StringValue) && timeout.IsElapsed)
            {
                timeout.Reset();
                if (isWaitingForCommand)
                {
                    switch (speech.StringValue)
                    {
                        case "speech on":
                            isQuietMode = false;
                            Say("Speech is turned on now.");
                            break;

                        case "speech off":
                            Say("Speech is turned off.");
                            isQuietMode = true;
                            break;

                        case "center":
                            Say("Centering the turret.");
                            turret.Center();
                            break;

                        case "forward":
                            Say("Moving forward.");
                            Platform.Forward(16);
                            break;

                        case "backward":
                            Say("Moving back.");
                            Platform.Backup(10);
                            break;

                        case "use turret":
                            Say("Commands will move the turret.");
                            isUsingTurret = true;
                            break;

                        case "use motors":
                            Say("Commands will move the drive motors.");
                            isUsingTurret = false;
                            break;

                        case "left":
                            if (isUsingTurret)
                            {
                                Say("Turning turret left.");
                                turret.MoveDegrees(-45);
                            }
                            else
                            {
                                Say("Turning left.");
                                Platform.Turn(-45);
                            }
                            break;

                        case "right":
                            if (isUsingTurret)
                            {
                                Say("Turning turret right.");
                                turret.MoveDegrees(45);
                            }
                            else
                            {
                                Say("Turning right.");
                                Platform.Turn(45);
                            }
                            break;

                        case "turn around":
                            Say("Turning around.");
                            Platform.Turn(180);
                            break;

                        case "get beer":
                            Say("Ha. Go get your own beer! They are in the fridge.");
                            kinectAngle.PositionAt(0, 15, 0);
                            turret.MoveDegrees(45);
                            Thread.Sleep(1000);
                            kinectAngle.PositionAt(0, 0, 0);
                            turret.Center();
                            break;

                        case "stop":
                            Platform.Stop();
                            Say("Stopping. Say 'row bought' to continue.");
                            isWaitingForCommand = false;
                            break;

                        case "light on":
                            light.UseTool();
                            break;

                        case "light off":
                            light.Stop();
                            break;

                        case "look up":
                            kinectAngle.PositionAt(0, 15, 0);
                            break;

                        case "look down":
                            kinectAngle.PositionAt(0, -15, 0);
                            break;

                        case "look middle":
                            kinectAngle.PositionAt(0, 0, 0);
                            break;

                        case "good job":
                            Say("Thank you.");
                            break;

                        case "robot":
                            Say("I am awaiting commands.");
                            break;

                        case "status":
                            Say(objectSensor.GetStatus());
                            break;

                        case "yes":
                        case "no":
                            break;

                        default:
                            //Say(string.Format("The phrase, '{0}', does not map to a command.", speech.StringValue));
                            break;

                    }
                }
                else
                {
                    if (speech.StringValue == "robot")
                    {
                        isWaitingForCommand = true;
                        Say("I am listening.");
                    }

                    if (speech.StringValue == "good job")
                    {
                        Say("Thank you.");
                    }
                }
            }

            speech.StringValue = String.Empty;
            isProcessing = false;
        }

Underneath the scenes, there's a little setup going on. In the recognizer object, we're setting up a vocabulary, getting a reference to the Kinect audio stream, and setting up a reference to the voice recognition engine provided from the Kinect, like so:


namespace Robotics.Platform.HouseBot.Kinect
{
    using System;
    using System.Threading;
    using Microsoft.Kinect;
    using Microsoft.Speech.AudioFormat;
    using Microsoft.Speech.Recognition;
    using Robotics.Framework.Tools;

    public class KinectSpeechRecognizer : ITool
    {
        const double ConfidenceThreshold = 0.5;  // Speech utterance confidence below which we treat speech as if it hadn't been heard

        public delegate void SpeechRecognizedEventHandler(object sender, SpeechRecognizedEventArgs args);
        public event SpeechRecognizedEventHandler SpeechRecognized = delegate { };

        RecognizerInfo recognizer;
        private KinectSensor sensor;
        private SpeechRecognitionEngine speechEngine;

        public KinectSpeechRecognizer(KinectSensor newSensor)
        {
            sensor = newSensor;
            InUse = false;
        }

        private RecognizerInfo GetKinectRecognizer()
        {
            var recognizers = SpeechRecognitionEngine.InstalledRecognizers();
            foreach (RecognizerInfo recognizer in recognizers)
            {
                string value;
                recognizer.AdditionalInfo.TryGetValue("Kinect", out value);
                if (value == "True" && recognizer.Culture.Name == "en-US")
                {
                    return recognizer;
                }
            }

            return null;
        }

        private void InitializeSpeechRecognition()
        {
            int i = 0;

            while (recognizer == null && ++i < 10)
            {
                recognizer = GetKinectRecognizer();
                if (recognizer == null)
                    Thread.CurrentThread.Join(500);
            }

            if (recognizer == null)
                return;

            speechEngine = new SpeechRecognitionEngine(recognizer.Id);

            var phrases = new Choices();

            phrases.Add(new SemanticResultValue("go forward", "forward"));
            phrases.Add(new SemanticResultValue("back up", "backward"));
            phrases.Add(new SemanticResultValue("stop", "stop"));
            phrases.Add(new SemanticResultValue("turn left", "left"));
            phrases.Add(new SemanticResultValue("turn right", "right"));
            phrases.Add(new SemanticResultValue("turn the light on", "light on"));
            phrases.Add(new SemanticResultValue("turn the light off", "light off"));
            phrases.Add(new SemanticResultValue("use the turret", "use turret"));
            phrases.Add(new SemanticResultValue("use the drive motors", "use motors"));
            phrases.Add(new SemanticResultValue("center", "center"));
            phrases.Add(new SemanticResultValue("robot", "robot"));
            phrases.Add(new SemanticResultValue("look up", "look up"));
            phrases.Add(new SemanticResultValue("look down", "look down"));
            phrases.Add(new SemanticResultValue("look straight", "look middle"));
            phrases.Add(new SemanticResultValue("good job", "good job"));
            phrases.Add(new SemanticResultValue("get me a beer", "get beer"));

            var grammarBuilder = new GrammarBuilder { Culture = recognizer.Culture };
            grammarBuilder.Append(phrases);

            var grammar = new Grammar(grammarBuilder);
            speechEngine.LoadGrammar(grammar);

            speechEngine.SpeechRecognized += speechEngine_SpeechRecognized;

            var stream = sensor.AudioSource.Start();
            speechEngine.SetInputToAudioStream(stream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));

            speechEngine.RecognizeAsync(RecognizeMode.Multiple);
        }

        void speechEngine_SpeechDetected(object sender, SpeechDetectedEventArgs e)
        {
            //  throw new NotImplementedException();
        }

        void speechEngine_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
        {
            // throw new NotImplementedException();
        }

        void speechEngine_SpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e)
        {
            // throw new NotImplementedException();
        }

        Robotics.Framework.Oscillators.Timeout timeout = new Robotics.Framework.Oscillators.Timeout(new TimeSpan(days: 0, hours: 0, minutes: 0, seconds: 1, milliseconds: 500));
        void speechEngine_SpeechRecognized(object sender, Microsoft.Speech.Recognition.SpeechRecognizedEventArgs e)
        {
            if (!timeout.IsElapsed)
                return;
            if (e.Result.Confidence >= ConfidenceThreshold)
                SpeechRecognized(this, new SpeechRecognizedEventArgs { RecognizedSpeech = e.Result.Semantics.Value.ToString(), });

            timeout.Reset();
        }

        public bool InUse { get; set; }

        public string Name { get; set; }

        public bool UseTool()
        {
            if (speechEngine == null)
                InitializeSpeechRecognition();
            else
                speechEngine.RecognizeAsync(RecognizeMode.Multiple);

            return true;
        }

        public bool UseTool(int action)
        {
            throw new NotImplementedException();
        }

        public void Stop()
        {
            if (speechEngine != null)
                speechEngine.RecognizeAsyncStop();
        }

        public bool PositionAt(double degreesX, double degreesY, double radius)
        {
            throw new NotImplementedException();
        }
    }
}

So that's about it. One of the hurdles I had to overcome was to make sure I had references to the correct SDK objects. If I set the references to the Microsoft.Kinect assembly in the IDE, things didn't work correctly. I had to look at the csproj file in the developer toolkit example, and manually edit my csproj to match. Once that was figured out, it was smooth sailing. Play with the recognition threshold if you get spurious recognitions that you don't want. At one point I had it set to 0.9 -- less than 90% certainty, and it won't respond. This actually seemed a pretty good setting.

Enjoy!