Freitag, 4. November 2011

Looking for a KDE related job? We are hiring!

We, the non profit research organization simon listens e.V. are looking for qualified C++ / Qt / KDE hackers to join our team!

Initially, we would be looking to fill part time positions but they can be extended to full time afterwards.

While our projects mostly focus on speech recognition using our own, KDE based solution called simon, you do not need to know anything about speech recognition to join!

Interested? Contact me for more information or send me your resume right away: grasch at simon-listens dot org

Freitag, 14. Oktober 2011

simon meets MeeGo

I'm happy to report that since August, I can now officially call myself a Qt Ambassador!

As an Ambassador, I had the opportunity to apply for a loaned Nokia N950 to develop / port applications to MeeGo/Harmattan. I took Nokia up on their offer and the result is simone - a trimmed down, mobile version of simon. In other words: "simon embedded" or "simone".

The client features push to talk or automatic voice activity detection (configurable) and because of simons client / server architecture uses little power on the device itself. Even with voice activity detection running you should get many hours of continuous speech recognition out of a single charge.

On the one hand, simone can be used to replace the headset of a "full" simon installation but also includes a couple of default actions on the device. For example, you can use a voice controlled quick dial feature or start / stop a turn-by-turn navigation.


For more information and a live demo, have a look at the youtube demonstration:

If you can't see the embedded video, try this direct link.

Dienstag, 6. September 2011

simon meets AT-SPI-2

Over the last couple of days I have again been working on what I started during this years Desktop Summit: simons AT-SPI 2 integration.
What started as a GSoC project idea back in April is now beginning to take shape.

The basic idea is still the same: First, integrate sequitur in simon to be able to transcribe arbitrary words automatically. To facilitate this, sequitur first needs to learn the transcription rules from a large dictionary. So I integrated a feature that let's user turn their shadow dictionary (which already supports many different formats) into a regular sequitur model.
After this sequitur model generation process, the system is used to transcribe words for the ATSPI plugin but also for adding new words manually.

Thanks to sequitur, simon can now transcribe words automatically that are definitely not in the shadow dictionary:
With this as the basic foundation and some help from Frederik and Joanie I created a plugin that would analyze the UI of currently active window, create vocabulary and grammar for it and associate commands with the user interface elements.

It's still in an early development stage (as is the support for ATSPI-2 of GTK and Qt) but the basic stuff already works. To check it out, either build and install the current development version of simon from Git (atspi branch) or have a look at the demonstration video below.

For RSS readers: ATSPI demonstration on Youtube

Dienstag, 9. August 2011

Desktop Summit 2011

I just arrived back home (I flew home after the talks) after this years Desktop Summit and it was awesome! In retrospect I kinda regret not staying the whole week... Next year... :)

Anyways, I met tons of interesting people and had a lot of productive meetings and discussions. It's amazing what can get done in just a few minutes if the right people are sitting together.

If we (the KDE accessibility team) can implement even half of what was discussed in the last couple of days, I'm sure we're looking at a big step towards a truly accessible free desktop.

Oh and Martin: I'm looking forward to all those KWin effects for simon :P

Benefit Project Completed

After more than a year of hard work we - the simon listens Team - are proud to announce that the Benefit project to use simon among other open source technologies (XBMC, Ubuntu,...) to create an affordable, self contained, voice controlled multimedia solution especially suited for elderly people has been completed.
The created solution - including the speech model and scenarios - will be released under a free license very soon.

But in the meantime, you can already have a look at a short demo video on youtube:

(Planet readers, click here)

Mittwoch, 8. Juni 2011

GSoC Guest Post: Context Detection

This year, we have been given the opportunity to work with two students as part of Googles annual Summer of Code. Adam is working on context dependent speech recognition (see below) and Alessandro is working on the Voxforge integration. Moreover, another student, Saurabh, is working on the Workspace integration as part of the Season of KDE.

So as kind of a start to hopefully a series of blog posts of our new contributers, I asked Adam to talk a bit about his progress and future plans about the context dependent speech recognition. This is what he wrote:

As part of the Google Summer of Code, I have been working to add
context-based activation and deactivation of scenarios in the KDE speech
recognition program simon. The simon program allows users to create or
download scenarios which, when activated, allow them to control other
programs such as web browsers, text editors, and games with speech

When the number of commands that must be considered for speech
recognition in simon becomes too large (for example, if the scenarios
that are active have a large number of possible commands), the speed and
accuracy of the speech recognition can suffer to the point of
unusability. Context-based activation and deactivation of scenarios will
allow scenarios to be deactivated when they are not needed (for example,
when the program that they control is not opened, or when the program is
not the active window) so that the number of commands being considered
by speech recognition will be kept low enough to ensure accuracy and

The context gathering system has been developed so that scenarios have a
"compound condition" which is a group of conditions under which the
scenario should activate. The compound condition becomes satisfied when
all of its conditions (which gather contexts) are satisfied. When the
compound condition becomes satisfied or unsatisfied, it communicates
this to its scenario, which then indicates to the scenario manager
whether or not it should be activated.

Compound conditions will be created with a user interface similar to
simon's command adding and editing interface. A scenario with no
conditions in its compound condition will always be active. This means
that any scenario made before this feature was added will maintain its
former functionality, but can be easily changed to (de)activate under
certain conditions.

The conditions of which the compound condition is composed are developed
as plugins (similarly to the command managers in simon), so it will be
easy to add new types of conditions. For example, one of the currently
developed plugins gathers information about running processes, so a
scenario can be activated under the condition that some process is
running or not running (for example a Rekonq scenario could have the
condition "'rekonq' is running"). The extensibility allowed by this
plugin system means that conditions such as "'Firefox' is the active
window" or "The user is connected to the internet" or "Fewer than 3
scenarios are currently active in simon" or any other type of condition
that could be determined by simon can be easily developed and used to
guide scenario activation and deactivation.

The next steps of my project include making the scenarios actually
activate and deactivate in response to conditions, making a parent/child
scenario relationship so that a single scenario can have child scenarios
with independent grammars and conditions (so that parts of the scenario
can be activated and deactivated independently), making more condition
plugins, and exploring the possibilities of what else simon would be
able to do with the contexts that it will be able to gather (for example
switching speech models based on the microphone that is being used).

Dienstag, 5. April 2011

GSoC idea: Ubiquitous Speech Recognition

The Google Summer of Code application period for students closes in a couple of days and I still have one last idea for simon for any student still looking for a project: Ubiquitous Speech Recognition.

Some of you might already know that simon already supports recording (and recognizing) from multiple microphones simultaneously. Sound cards and microphones are comparatively cheap and the server / client architecture of simon would even allow for input from mobile phones, other PCs, etc.

We also have gadgets and home appliances getting smarter and smarter every year. KNX is getting increasingly popular, is already included in many new electrical installations and allows home automation for a very fair price.

Voice control is an intuitive way to interact with all kinds of devices and - compared to alternatives like touch screens and the like - also quite cheap. simon already has more than enough interfaces to connect up your favorite home automation controllers / hardware interfaces. Something that people are already doing.

However, speech recognition has traditionally relied on controlled environments. False-positives are still a major issue and recognition accuracy depends on being optimized for a certain situation.

Still: Adapting the recognition to certain situations is already part of another GSoC idea (that fortunately already has a very promising student attached to it) so that leaves the voice activity detection part as the remaining hassle.

The voice activity detection (in short: VAD) tells the system when to listen to the user and tries to distinguish between background noise and user input. Normally this is just one comparatively minor part in a speech recognition system but when your whole apartment (or at least parts of it) are listening for voice input this becomes kind of important :).

The current system in simon just compares the current "loudness" to a configurable threshold. This is fine for headset users but almost useless in the above scenario.

And here is where it's your turn to get creative: Try to find a novel approach to separate voice commands from background noise.

For example: Use webcams and computer vision algorithms to determine if the user is even near a microphone at the time of the heard "command".

You could also define "eye contact" with a camera as the signal to activate the recognition.  Or maybe you could deactivate the system unless the user raises his hand before he speaks?

Another idea would be to let different microphones work together and subtract the similarities (to filter out global noise).

You can also use noise conditioning to remove the music playing over the PC speakers automatically from the input signal.

Or why not use the reception strength of the users bluetooth phone to determine in which room he currently is?

Bonus points for coming up with other ideas in the comment section!

Montag, 4. April 2011

GSoC idea: Voice Control for the Linux Desktop

As this has worked so perfectly the last time, I want to use this blog post to present another idea for the Google Summer of Code 2011 that has not yet found an interested student.

The simon system currently has plugins to trigger shortcuts, simulate clicks and interact directly with applications through IPC technology like DBus and JSON. This makes simon perfect for interacting with a vast variety of applications as long as it is configured for each application beforehand.

To counteract this, we have the scenario system that allows users to exchange such configurations online. This repository already covers many of the "standard" applications.
Still: The user has to actively pick which applications to control. If there is no scenario available for an application, things get a bit more complicated.

So how could we create dynamic scenarios that allow the user to control new applications without configuring anything?

Well let's look at what's needed to voice control an application.

First of all, we need to know what options are currently available.

Let's look at KWrite as an example application:

Just looking at the screenshot a human can quickly tell that there are at least the following commands: "New", "Open", "Save", "Save As", "File", "Edit", etc.

Well if screenreaders can read those options to the user, why shouldn't simon parse them automatically as well?

With the upcoming AT-SPI-2 and the Qt accessibility bridge, the user interface (including buttons, menu items, etc.) are all exported over DBus.

As elements can also be triggered (clicked / selected) over this interface, simon can easily "read" running applications and create appropriate commands.

Best of all: Because screenreaders are well established, many applications already make sure that this will work properly.

Vocabulary and Grammar
Now that we have our commands in place simon still needs to recognize all those words ("New", "Save", etc.) that are probably not in the users active vocabulary.

As speech recognition systems need a phonetic description of each word that is not trivial.

...if it weren't for Sequitur. Sequitur is a grapheme to phoneme converter that translates any given text to a phonetical description.

The system can be compared to a native speaker: Even if you have never heard a word spoken out loud you still have at least a rough idea about how to pronounce it. That's because there are certain rules in any language that you know even if you aren't aware of them.
Sequitur works in much the same way that it learns those rules by reading large dictionaries. With the generated model it can transcribe even words that were not in the input dictionary.

In our tests, sequitur prooved to be very reliable, accurate and quite fast.

simon already allows the user to specify a dictionary large enough to act as the information source for sequitur: The shadow dictionary. Because there are already import mechanisms for most major pronunciation dictionary formats, there is more than enough raw material to "feed" to sequitur already available.

Now that we have the vocabulary, setting up an appropriate grammar is very easy. Just make sure that all the sentences of the created commands are allowed.

For static models no training data is required so that's all that'd be needed.

With a combination of AT-SPI-2 and Sequitur one could quite easily extend the current simon version to automatically create working voice commands for all standard widgets of running applications.

This allows the user of a static model to comfortably use any application-specific configuration at all.

Because AT-SPI-2 is a standard, the resulting system would automatically work with both Qt and KDE applications as well as Gnome applications.

If you are interested in working on this idea, please send me an email.

Samstag, 2. April 2011

GSoC idea: Crowdsourcing Speech Model Training

There still is a week left for students to apply for Googles annual Summer of Code.

Following Lydias recommendation on the mailing list, I've decided to showcase some ideas for simon that are not yet taken by any student on this blog for the remainder of the application period: If you'd like to implement one of those ideas, please feel free to send me a mail at grasch ate simon-listens ° org.

The first idea that is still up for grabs is simons voxforge integration. Voxforge is an ambitious project to create free (GPL) speech models for everyone. With the current Voxforge models, simon can already be used without any training at all. Just download simon and the appropriate model from the Voxforge website for your language and start talking to your computer.

This works because the Voxforge models have been trained with lots and lots of voice recordings from people around the world. The resulting model is speaker-independent and works quite well for most people. If you need even more accuracy, just adapt the general model to your voice with a couple of training session and you are ready to go.

The current Voxforge model for English is quite good for command and control but nowhere near powerful enough for dictation. The models for other languages consist of even fewer samples. In the last five years, 624 identified users submitted voice recordings for the English model. Only 50 identified people submitted recordings for the German Voxforge model.

I think this is primarily because donating voice (through the Java applet on the Voxforge homepage) is only done by those who are actively searching for ways to improve open source speech recognition. There is also no immediate pay off for the donators.

simon on the other hand reaches a wide array of people interested in open source speech recognition: More than 24.000 in the past 12 months.

Many of those users train simon to get the most out of their system. But those trainings samples never get submitted to Voxforge to improve the general model because there is no easy way to do that.

I propose to implement an easy to use uploading system that allows the user to submit his training samples directly to the voxforge corpus with the press of a button.

Together with an automatic download of the voxforge model for a selected language when simon is launched for the first time this means that simon users can:
1. Get started with the general model even easier because they don't have to download it manually
2. If the recognition rate is too low, they can (and in our experience often will) train their model locally.
By submitting the recorded samples for the local training back to Voxforge, they not only submit valuable recordings - more often than not they would even submit exactly those recordings that train words that couldn't be recognized with the previous Voxforge model.

And because users can immediately see if their samples are helping or hurting (by checking if the recognition rate improves locally), the generated submissions should be fairly high quality. There is even an immediate advantage for the end-user (their recognition rate improves).

If you are interested on working on this proposal please contact me at grasch ate simon-listens ° org.

Samstag, 19. März 2011

simon at GSoC 2011

I've just been officially approved as a mentor for KDE. There are three ideas already for simon on the ideas page.

It's the first time I'll (hopefully) be participating in GSoC (as a mentor) but I am very much looking forward to it.

I hope to find a few students interested in simon so if you want to do anything that has to do with speech recognition at all (even if it's not mentioned in the ideas page), just contact me.

Want to voice-control your lawn mower? Talk to your Roomba? We can do that :)

Let your ideas run wild!

CeBIT 2011

I've been to quite a few conferences in the past year including cooperate stuff like the AAL Forum in Denmark but also this years Akademy, the OpenSUSE conference and the LinuxTage in Graz (where I'll be again this year, btw.). However, I've never been to anything like the CeBIT.

We arrived a day early to inspect our booth space and to set up our equipment. I then had a quick stroll through the rest of our hall. That "quick stroll" took about an hour. And that was only "our" hall (hall 2) which isn't even the biggest one of the 17 - again: seventeen - halls in use.

But you all probably knew that the CeBIT, the biggest IT fair in the world, was quite big :). Still, it's something entirely different to walk through the halls yourself. Not that we had a lot of time to explore the exhibition - no, we were plenty busy :)

The simon booth had an ideal position in one of the most frequented halls right in the middle of the open source area. That meant we got a lot of foot traffic and had lots of interesting conversations. We also had a lot of people coming up to us telling us that they already use simon. This included stuff like home automation - a first for simon AFAIK.

We also met quite a few people how work in nursing homes that were quite impressed with simon and exchanged contact information with us.  We got amazing feedback and of course a lot of feature requests so we won't be running out of ideas any time soon :)

As a little thank you to all the people behind simon we sent out invitations to simons many testers, contributers and translators including free tickets for the event. Many had to decline because they live too far away but some could make it and met us in Germany. Among them was John Ambeliotis, author of jaNET a personal assistant powered by simon.

John and me next to the simon listens booth

I think it's great to meet contributers up close instead of just communicating per e-Mail and want to thank everybody again who came to visit us and expressed interest in simon!

Samstag, 5. Februar 2011

Waking from Hibernation

After a very busy Jannuary I finally have some more time to work on simon again. Do expect more updates than usual from me this month :)

The first week(s) I will be working on getting the Akonadi integration in simon up and running. Using this new plugin you will be able to schedule simon commands on specific dates / times using your conventional groupware infrastructure.
For example, you can use simons dialog system to wake you up in the morning by scheduling an event in KOrganizer with a special (but configurable) prefix in the events summary: By default "[simon-command] Dialog/Welcome" would execute the welcome dialog at the start time of the event in your calendar.

The Akonadi plugin also provides the option to react on other, "normal" events by displaying a reminder about them - again by using the dialog system. This is meant as a replacement of the korganizer reminder system for voice controlled systems.

You can even use the Akonadi plugin to schedule command executions from within simon.

The dialog system has been spruced up and now sports support for user avatars and multiple texts per state (one will be randomly selected during execution to make the system feel a bit more natural).

Other than that I will also be working - again - on improving our speech model creation technique and have also been assigned to have a look at some problems in ssc / sscd again that seem to creep up in the windows version.

The other parts of our Benefit project are also progressing well and we will be starting the first prototyping tests within the next couple of weeks. I hope I can get some video demonstrations done before the end of the month.

Of course, next to all that serious business I couldn't resist to hack around a bit and developed a tiny proof of concept that demonstrates a novel approach on how to deal with false positives in a speech recognition system that will basically running 24 / 7.

The prototype uses opencv and the webcam of my notebook and runs a simple face detection on the current image. If no person is found sitting in front of the laptop and looking at its screen, simon is automatically disabled. As soon as the user looks at the computer again, the system is re-activated.

This is just a proof of concept at this stage but is already working quite well. I hope to extend this into a regular simon plugin that could use simons filter system to toggle recognition for simon 0.4. Yes, we are going multimodal :)