Kinecting AR Drone Pt. I – Basic UI, Kinect camera & voice recognition

I’m back! Now what I forgot in my introduction post was – What is the point of this series? AR Drone is just a quadricopter, what is the link with Kinect?
The answer is simple – There is none.

So let’s create one! I had an idea why not control my drone with my Kinect and turn my drone into a voice-and-gesture-controlled drone and integrate the camera of the drone in an application and fool around with some tricks and its LEDs.
Because of the big amount of content this tutorial is spread into three parts.

As we’ve seen in my Kinect Television tutorial connecting to our Kinect and displaying the camera is easy!
I recommend reading this tutorial before starting since I won’t go into detail on this.

What you will learn

In part I we will learn the following things -

  • Integrate Kinect sensor in our application
  • Visualize the Kinect-state
  • Visualize the camera
  • Add speech recognition

Prerequisites

In order to follow this tutorial you’ll need a Kinect sensor (I’m using Kinect for Windows v1) and the following things –

  • Basic C# & WPF skills
  • Kinect for Windows SDK v1.8 (link)
  • Microsoft Speech Platform SDK v11 (link)

Chapter overview

  1. Exploring the template
  2. Connecting to our Kinect
  3. Visualizing the Kinect Camera
  4. Adding speech recognition

I. Exploring the template

The tutorial template is a basic WPF application that will visualize our Kinect (2) & drone (4) cameras as well as the state of our device. (1-3)
In sector 5 we will log some things like our recognized commands, etc.

Template Overview

You can download the template in the here.

II. Connecting to our Kinect

First things first, we need to reference the SDK first since the ‘Microsoft.Kinect’ isn’t a built-in namespace by right clicking on References > Add reference > Browse to C:/Program Files/Microsoft SDKs/Kinect/v1.8/Assemblies/Microsoft.Kinect.dll or type it in the search bar and include a ‘using’ in MainWindow.xaml.cs.

using Microsoft.Kinect;

We’re ready to rock ‘n roll.

Starting our sensor

Let’s start by creating a method called InitializeKinect that will be called in our constructor.

public MainWindow()
{
    InitializeComponent();

    // Kinect initialization
    InitializeKinect();
}

In this method we want to get the first connected KinectSensor-object that has a KinectStatus.Connected-status. This object is a represenation of a physical sensor.

Once we’ve acquired the first sensor we’ll pass it into a new method called StartSensor that will start the sensor and required streams.

private void InitializeKinect()
{
    // Get current running sensor
    KinectSensor sensor = KinectSensor.KinectSensors.FirstOrDefault(sens => sens.Status == KinectStatus.Connected);

    // Initialize sensor
    StartSensor(sensor);
}

Starting the sensor is very easily done by calling the Start-method after you started the stream(s) you want.
For now we’ll just start the sensor after we check that the FirstOrDefault-query in InitializeKinect found a sensor that was running, if not skip the method since we can’t start a sensor.
It’s also important to store the representation in a variable for later usage, f.e. _currentSensor.

/// <summary>
/// Representation of our Kinect-sensor
/// </summary>
private KinectSensor _currentSensor = null;

/// <summary>
/// Start a new sensor
/// </summary>
private void StartSensor(KinectSensor sensor)
{
    // Avoid crashes
    if (sensor == null)
	return;

    // Save instance
    _currentSensor = sensor;

    // Start sensor
    _currentSensor.Start();
}

The sensor will stay connected and running even when your application is shut down.
This means that we need to notify the sensor that our application is closing and we want to shut it down if it is running.

We can do this in the Closing-event of our MainWindow where we call the Stop-method of our KinectSensor.

private void OnClosing(object sender, CancelEventArgs e)
{
    if(_currentSensor != null && _currentSensor.Status == KinectStatus.Connected)
	_currentSensor.Stop();
}

Visualizing sensor state changes

Now that our sensor is running it might come in handy to see what the current status of our sensor is, f.e. it might occur that it loses power.
For that reason I created a user control called KinectInfoControl that we will add to our UI.

Kinect State Control

Now since this tutorial is focussing on Kinect I won’t explain the control but you can download the code here.
It basicly exposes a property called CurrentStatus that we will use to databind our KinectStatus to.

Based on the CurrentStatus-property two converters will convert that value into something nicelooking.

  • KinectStatusToBrushConverter – Converts the KinectStatus to a Brush and can be Green (Connected), Red (Disconnected), Orange (Others – Warning).
    (code)
  • EnumToStringConverter – Converts the enumeration value to a decent string representation
    (code)

Add the user control and converters to your project and make sure that the references match your project.

To close the gap between our MainWindow and the user control we will create a property that we will use to databind directly to the UC.

private KinectStatus _kinectStatus = KinectStatus.Error;
public KinectStatus KinectStatus
{
    get { return _kinectStatus; }
    set
    {
		if (_kinectStatus != value)
		{
			_kinectStatus = value;
			OnPropertyChanged("KinectStatus");
		}
    }
}

Now that we have our property set for databinding we can add our KinectInfoControl to our UI and bind the KinectStatus property.

 <Grid x:Name="Kinect";
	  ...
	  Background="White">
	<Grid.ColumnDefinitions>
		<ColumnDefinition />
		<ColumnDefinition />
	</Grid.ColumnDefinitions>
	<Grid.RowDefinitions>
		<RowDefinition />
		<RowDefinition Height="480px" />
	</Grid.RowDefinitions>

	<Label ... />
	<Image ... />
	
	<uc:KinectInfoControl CurrentStatus="{Binding KinectStatus, ElementName=Window}"
				Grid.Row="0"
				Grid.Column="1"
				VerticalAlignment="Center"
				HorizontalAlignment="Right"
				Margin="25,15" />
</Grid>

Now that everything is set we’re ready to visualize the status.
Every time the sensor state changes we’d like to save that state and update the UI.

In order to do that we’ll need to listen to the StatusChanged-event.

KinectSensor.KinectSensors.StatusChanged += OnKinectStatusChanged;

Every time the event is thrown we’ll check if the current sensor is different from Null and if the DeviceConnectionId is the same.
Once we’ve passed the test we copy the status to our KinectStatus-property.

/// <summary>
/// Process a Kinect status change
/// </summary>
private void OnKinectStatusChanged(object sender, StatusChangedEventArgs e)
{
    if (_currentSensor == null || _currentSensor.DeviceConnectionId != e.Sensor.DeviceConnectionId)
		return;

    // Save new status
    KinectStatus = e.Sensor.Status;
}

When you’ll test our application you will see that the control still has the default value after we’ve started our sensor.
This is because the state hasn’t changed since we’ve started.

You can copy the status from the sensor to our KinectStatus-property in the StartSensor-method to update the UI.

private void StartSensor(KinectSensor sensor)
{
    ...

    // Save sensor status
    KinectStatus = _currentSensor.Status;

    ...
}

III. Visualizing the Kinect Camera

Time for some data processing. Let’s start with visualizing the camera!

To do so we need to expand our StartSensor-method where we enable our ColorStream for a specific ColorImageFormat and add an eventhandler for the ColorFrameReady-event where we will process the camera information.

Add this code before we call the Start()-method of our sensor.

private void StartSensor(KinectSensor sensor)
{
    ...

    // Initialize color & skeletal tracking
    _currentSensor.ColorStream.Enable(ColorImageFormat.RgbResolution640x480Fps30);

    // Sub to events
    _currentSensor.ColorFrameReady += OnColorFrameReadyHandler;

    ...
}

As we learn in the Kinect Television tutuorial, there are a lot more ColorImageFormats but this one is most suitable for our application.
We could go for the 1280×960 but this will only have 12 frames per second which will result in less quality. (Kinect v2 camera will nail this!)

More info on the ColorImageFormat here.

Processing the color information

We need to create two global variable that we will use to process our color data.
The first variable is a WritableBitmap that will be assigned to our Image-control. The difference with a standard Bitmap is that it doesn’t inherit from Image but BitmapSource and it is designed for rendering images on a frame-basis like we are going to do. I recommend reading the documentation.

Second variable is a byte-array that will be a buffer between the ColorImageFrame and our WritableBitmap.

/// <summary>
/// WritebleBitmap that will draw the Kinect video output
/// </summary>
private WriteableBitmap _cameraVision = null;

/// <summary>
/// Buffer to copy the pixel data to
/// </summary>
private byte[] _pixelData = new byte[0];

Every time the ColorFrameReady-event is fired we’ll use the eventargs to retrieve the ColorImageFrame. This frame holds all the colorinformation we need to visualize it.
After opening the frame we want to check if it isn’t null since it is possible that a frame has been skipped.

Each frame we will check if our _pixelData array has a length of 0, that means that this will be the first time we have a captured frame and we’ll want to visualize it.
To do so we’ll need to initialize our variables and assign the WriteableBitmap to our Image-control.

After we’ve initialized our variables we will copy the data from the frame into our buffer.

Last step is to simply write the buffer to the WriteableBitmap so the UI gets updated.
To do so we call the WritePixels-method there we specify the dimension of the image, our buffer, the stride and what the starting index of our buffer is.

/// <summary>
/// Process color data
/// </summary>
private void OnColorFrameReadyHandler(object sender, ColorImageFrameReadyEventArgs e)
{
    using (ColorImageFrame colorFrame = e.OpenColorImageFrame())
    {
		if (colorFrame == null)
			return;

		// Initialize variables
		if (_pixelData.Length == 0)
		{
			// Create buffer
			_pixelData = new byte[colorFrame.PixelDataLength];

			// Create output rep
			_cameraVision = new WriteableBitmap(colorFrame.Width,
												colorFrame.Height,

												// DPI
												96, 96,

												// Current pixel format
												PixelFormats.Bgr32,

												// Bitmap palette
												null);

			// Hook image to Image-control
			KinectImage.Source = _cameraVision;
		}

		// Copy data from frame to buffer
		colorFrame.CopyPixelDataTo(_pixelData);

		// Update bitmap
		_cameraVision.WritePixels(	// Image size
									new Int32Rect(0, 0, colorFrame.Width, colorFrame.Height),

									// Buffer
									_pixelData,

									// Stride
									colorFrame.Width * colorFrame.BytesPerPixel,

									// Buffer offset
									0);
    }
}

IV. Adding speech recognition

Creating our VoiceCommands-enum

I created an enumeration that will contain all the commands that we will support by using speech commands.
These commands will be prefixed with “Drone” to prevent unwanted recognized commands.

public enum VoiceCommand
{
	[Description("Unknown")]
	Unknown = 0,
	[Description("Takeoff")]
	TakeOff = 1,
	[Description("Land")]
	Land = 2,
	[Description("Emergency Landing")]
	EmergencyLanding = 4
}

Retrieving the Kinect recognizer

To build our grammar we need some information about the Kinect and it’s speech recognizer represented as a RecognizerInfo-object. You computer has several RecognizerInfo-objects installed for each recording device, for example your built-in microphone if you are using a laptop.

If we want to get the recognizer for our kinect we need to loop that collection and get the first result where the additional info contains a Key/Value “Kinect” with a value “True”.
Next to that we want to specify our language pack ‘en-US’ for english commands. Let’s create a method that returns the RecognizerInfo for our kinect and call it GetKinectRecognizer.

private RecognizerInfo GetKinectRecognizer()
{
    /* Create a function that checks if the additioninfo contains a key called &quot;Kinect&quot; and if it's true.
     * Also check if the culture is en-US so that we're using the English pack
    */
    Func<RecognizerInfo, bool> matchingFunc = r =>
    {
		string value;
		r.AdditionalInfo.TryGetValue("Kinect", out value);

		return "True".Equals(value, StringComparison.InvariantCultureIgnoreCase) &&
							 "en-US".Equals(r.Culture.Name, StringComparison.InvariantCultureIgnoreCase);
    };

    return SpeechRecognitionEngine.InstalledRecognizers().FirstOrDefault(matchingFunc);
}

Setting up our grammar and link it to our Kinect

To set up our grammar we will use the following properties –

  • SpeechRecognitionEngine will be used to build our grammar and start recognizing speech commands and listen to the corresponding events
  • KinectAudioSource represents the audio from the microphone array
  • A dictionary with our speech commands & corresponding enumeration value
/// <summary>
/// The RecognitionEngine used to build our grammar and start recognizing
/// </summary>
private SpeechRecognitionEngine _recognizer;

/// <summary>
/// The KinectAudioSource that is used.
/// Basicly gets the Audio from the microphone array
/// </summary>
private KinectAudioSource _audioSource;
		
/// <summary>
/// All speech commands and actions
/// </summary>
private readonly Dictionary<string, object> _speechActions = new Dictionary<string, object>()
{
    { "Take off", VoiceCommand.TakeOff },
    { "Land", VoiceCommand.Land },
    { "May day", VoiceCommand.EmergencyLanding }
};

With our variables set and our new GetKinectRecognizer-method we are set to initialize our speech recognition in a new method called InitializeSpeech.

We’ll start with checking if a vocabulary is specified and if our sensor is still connected before we call our new method GetKinectRecognizer. Once we have a RecognizerInfo we’ll create a new SpeechRecognizerEngine based on the ID of our RecognizerInfo.

Up next is creating a Choices object that will contain all the commands (keys) from our dictionary that will represent command options.
We will append this object to our GrammarBuilder after we appended ‘Drone’, this will create the following command pattern ‘Drone _CMD_’ where _CMD_ is one of our Choices values.
Now we will pass our builder into a new Grammar object that we will load into our recognizer so he know what he should be listening to.

After we hooked into the recognized & rejected events we can get the audiostream from our KinectSensor-object and link it to the recognizer.
First we will tell the Kinect to focus on the strongest audio source by changing the BeamAngleMode to Adaptive since i’m using high volume loudspeakers.
Starting the audio stream returns a Stream-object that we will use to set the input for the recognizer and a set of basic SpeechAudioFormatInfo.

Last thing we need to do is tell the recognizer to start recognizing asynchronously and tell it to keep listening after a match by passing in RecognizeMode.Multiple

Call this method at the end of our StartKinect-method.

private void InitializeSpeech()
{
    // Check if vocabulary is specifief
    if (_speechActions == null || _speechActions.Count == 0)
		throw new ArgumentException("A vocabulary is required.");

    // Check sensor state
    if (_currentSensor.Status != KinectStatus.Connected)
		throw new Exception("Unable to initialize speech if sensor isn't connected.");

    // Get the RecognizerInfo of our Kinect sensor
    RecognizerInfo info = GetKinectRecognizer();

    // Let user know if there is none.
    if (info == null)
		throw new Exception("There was a problem initializing Speech Recognition.\nEnsure that you have the Microsoft Speech SDK installed.");

    // Create new speech-engine
    try
    {
		_recognizer = new SpeechRecognitionEngine(info.Id);

		if (_recognizer == null) throw new Exception();
    }
    catch (Exception ex)
    {
		throw new Exception("There was a problem initializing Speech Recognition.\nEnsure that you have the Microsoft Speech SDK installed.");
    }

    // Add our commands as "Choices"
    Choices cmds = new Choices();
    foreach (string key in _speechActions.Keys)
		cmds.Add(key);

    /*
     * The GrammarBuilder defines what the requisted &quot;flow&quot; is of the possible commands.
     * You can insert plain text, or a Choices object with all our values in it, in our case our commands
     * We also need to pass in our Culture so that it knows what language we're talking
     */
    GrammarBuilder cmdBuilder = new GrammarBuilder { Culture = info.Culture };
    cmdBuilder.Append("Drone");
    cmdBuilder.Append(cmds);

    // Create our speech grammar
    Grammar cmdGrammar = new Grammar(cmdBuilder);

    // Prevent crashes
    if (_currentSensor == null || _recognizer == null)
		return;

    // Load grammer into our recognizer
    _recognizer.LoadGrammar(cmdGrammar);

    // Hook into speech events
    _recognizer.SpeechRecognized += OnCommandRecognizedHandler;
    _recognizer.SpeechRecognitionRejected += OnCommandRejectedHandler;

    // Get the kinect audio stream
    _audioSource = _currentSensor.AudioSource;

    // Set the beamangle
    _audioSource.BeamAngleMode = BeamAngleMode.Adaptive;

    // Start the kinect audio
    Stream kinectStream = _audioSource.Start();

    // Assign the stream to the recognizer along with FormatInfo
    _recognizer.SetInputToAudioStream(kinectStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));

    // Start recognizingand make sure to tell that the RecognizeMode is Multiple or it will stop after the first recognition
    _recognizer.RecognizeAsync(RecognizeMode.Multiple);
}

Process the recognition results

Our very last task for Part I is to process the results of the recognizer when he recognized or rejected a command.

When he recognizes a command we’ll first check when the last command was recognized since it might occur that he recognizes some command multiple times or in a brief moment that will result into unwanted actions.
We will put a command “ignore” for 2 seconds after the previous one.

Each recognized command also has a Confidence-value that indicates how sure the recognizer is about it’s result, we will put our mark on 0.8 or 80%.

The result will be converted into a VoiceCommand value by using a helper method and we will log it to our output window by using an extension method that will get the value of the DescriptionAttribute in our enumeration. You can find it here.

/// <summary>
/// Timestamp of the last successfully recognized command
/// </summary>
private DateTime _lastCommand = DateTime.Now;

/// <summary>
/// A constant defining the delay between 2 successful voice recognitions.
/// It will be dropped if there already is one recognized in this interval
/// </summary>
private const float _delayInSeconds = 2;

/// <summary>
/// Command recognized
/// </summary>
private void OnCommandRecognizedHandler(object sender, SpeechRecognizedEventArgs e)
{
    TimeSpan interval = DateTime.Now.Subtract(_lastCommand);

    if (interval.TotalSeconds < _delayInSeconds)
		return;

    if (e.Result.Confidence < 0.80f)
		return;

    // Retrieve the DroneAction from the recognized result
    VoiceCommand invokedAction = GetDroneAction(e.Result);

    // Log action
    WriteLog("Command '" + invokedAction.GetDescription() + "' recognized.");

    /*
     * Coming in Part II
     */

}

/// <summary>
/// Convert recognized command to an enumeration representation
/// </summary>
private VoiceCommand GetDroneAction(RecognitionResult recogResult)
{
    if (recogResult == null || string.IsNullOrEmpty(recogResult.Text))
		return VoiceCommand.Unknown;

    // Seperate 'Drone' from command grammar
    Match m = Regex.Match(recogResult.Text, "^Drone (.*)$");

    // Check if it matches
    if (m.Success)
    {
		// Get command from object
		KeyValuePair<string, object> cmd = _speechActions.FirstOrDefault(action => action.Key.ToLower() == m.Groups[1].ToString().ToLower());

		if (cmd.Value != null)
		{
			return (VoiceCommand)cmd.Value;
		}
		return VoiceCommand.Unknown;
    }
    else return VoiceCommand.Unknown;
}

For debugging purposes we will also log the rejected commands.

private void OnCommandRejectedHandler(object sender, SpeechRecognitionRejectedEventArgs e)
{
    WriteLog("Unkown command");
}

Conclusion

In this detailed tutorial we’ve seen how we can get an instance of our Kinect sensor and how we should start it. Next to that we’ve enable the colorstream and visualized it along with the connection state of our sensor. We ended with enabling speech recognition based on a small set of commands and visualized if the recognizer has rejected or recognized the commands.

My posts on part II & III will be a more high level post that explain the my pitfalls & thought on several topics.

This is how your application should look like at the end of Part I, you can find my code here if you want to check my code or I left something out by mistake.

Pt. I - Result

This entry was posted in Kinecting AR Drone and tagged , . Bookmark the permalink.
  • skyrien

    It’s been a while… is this project still ongoing? I’m interested to see the results!

    • Tom_Kerkhove

      Yes but I have very little time and some of it goes to the developer kit.
      No ETA for next post available yet..

      • skyrien

        Thanks Tom! Also, it looks like I misread the date–(thought it was posted in January, instead of last week. Being in the US, where we do things a little backwards).

        I’m following this blog, and am looking forward to what’s coming next!

  • Pingback: Kinecting AR Drone – Part 1 | Geekness in Words

  • Muhamad Fahruroji

    Hi Tom, have you finished this project? I have kinect and Ar Drone 2.0, it’s seem interesting to control my quadrotor with my body gesture, I really wait for it!

    • Tom_Kerkhove

      It is not finished yet but step II is ready! Writing a blog post on it asap!