Hey Siri, can we talk … to our apps?

Post_20141218_Siri_Header

Last June, Apple surprised developers at WWDC with announcement after announcement of new tools, new APIs, and even a new language. However when all was said and done, the one last trick developers were waiting for remained in Apple’s hat–for now.

Where was the Siri API?

To be sure, Siri is now more capable than ever. Arriving with the iPhone 4S in 2011, it brought a novel new way to interact with our devices. It skyrocketed in mindshare, and even after the arrival of voice search assistants from rivals Google, Microsoft, and now Amazon, remains the first virtual assistant most people think of.

However, the more we used Siri, the more limited we realized it was. The intelligence we perceived was mostly an illusion, but Apple kept improving Siri and users can now interact with all kinds of system services using their voice. However, interaction with third party apps remains extremely elusive and limited only to Apple-approved avenues.

Apple is undoubtedly working on an API for developers who want Siri-style voice accessibility to their apps, but does that mean developers should wait until it’s ready? Heck no! There are already more solutions available for iOS, Android and even the web, than Siri has jokes.*

(Almost) Built-in

One solution is nearly built-in to iOS. Starting with iOS 5, a “veritable Swiss-Army knife of linguistic functionality” is built-in to every iPhone, iPad, and iPod–maybe even the Apple TV and upcoming Apple Watch itself. NSLinguisticTagger is a powerful system-supplied class that can break down a natural language sentence into nouns, verbs, numbers, places, parts of speech and more.

If you can provide a sentence to an NSLinguisticTagger, such as from Apple’s own Dictation feature or any number of third party libraries, chances are you can glean a lot of user intent just from separating the important bits from the an’s, uh’s and the’s.

For example, in a Swift playground:

import Cocoa

var sentence = "Hey Siri, what's on my calendar today?”

let options: NSLinguisticTaggerOptions =


(.OmitWhitespace | .OmitPunctuation | .JoinNames)


let taggerOptions = Int(options.rawValue)


let preferredLanguage = NSLocale.preferredLanguages()


[0] as String


let tagSchemes =


NSLinguisticTagger.availableTagSchemesForLanguage(pref
erredLanguage)


let tagger = NSLinguisticTagger(tagSchemes:


tagSchemes, options: taggerOptions)


let fullRange = NSMakeRange(0, sentence.utf16Count)

tagger.string = sentence


tagger.enumerateTagsInRange(fullRange,


   scheme: NSLinguisticTagSchemeLexicalClass,


   options: options) { (tag, tokenRange,


sentenceRange, _) -> Void in

   let token = (sentence as

NSString).substringWithRange(tokenRange)


   println("\(token): \(tag)")
}

produces,

Hey Siri: Noun


what: Pronoun


's: Verb


on: Preposition


my: Determiner


calendar: Noun


today: Noun

or,

var sentence = "Give me directions to my office.”

produces,

Give: Verb


me: Pronoun


directions: Noun


to: Preposition


my: Determiner


office: Noun

or the classic,


var sentence = "The quick, brown fox jumped over the
lazy dog.”

gives

The: Determiner


quick: Adjective


brown: Adjective


fox: Noun


jumped: Verb


over: Preposition


the: Determiner


lazy: Adjective


dog: Noun

Because NSLinguisticTagger can be initialized with various options, including a language, you can see how flexible and powerful a parser it can be. You can build everything from an old-school text adventure to an editor capable of highlighting parts of speech.

But what if you want to use voice to create the sentences?

Hmm, sounds familiar

A third-party solution from Nuance Communications called SpeechKit provides voice recognition and synthesis SDKs for iOS, Android, Windows Phone, and web. Using Nuance’s SpeechKit libraries, a developer can record a user’s voice, ship it off to Nuance for processing, receive a best guess at what the user actually said (along with other possibilities that the system is less confident about and suggestions), and finally speak an appropriate response in a pleasant-sounding voice.

If that voice sounds familiar, it should. Nuance provided Siri’s speech recognition engine and shares the same voice synthesis as the original Siri.

To use Nuance’s SpeechKit, you initialize an SKRecognizer object, with your authentication key, the language you want it to detect, and how you want SpeechKit to determine when to start processing speech (the user signifies ready, using a button tap for example, or if SpeechKit detects a long or short pause). SpeechKit’s cloud servers transcribe the voice, and return a simple SKRecognition object.

@interface SKRecognition : NSObject


{


   NSArray *results;


   NSArray *scores;


   NSString *suggestion;


   NSObject *data;


}

Results are the sentence(s) or fragment(s) SpeechKit believes the user said, with a score of the confidence in those results.

In the simplest case, you can send recognition.firstResult() to your NSLinguisticTagger parser, and start figuring out the user’s intent.

Want to tell your user something?

SKVocalizer.speakString(“This is what I found for you on the Internet.”)

Voila, your very own Siri!

Free and fast

Another popular–and free–solution is Open Ears by PolitePix. One huge advantage Opens Ears has with its implementation is the voice recognition is done completely locally on the device. There is no slight delay where recordings are streamed over the network and results returned after cloud processing. The popular traffic app Waze uses OE to allow drivers to report traffic conditions hands-free, after waving at their docked devices.

A big shortcoming of such a local system is its vocabularies are much smaller, compared to the seemingly unlimited dictionaries of cloud-based services. However, that smaller vocabulary is also the source of the speed Open Ears enjoys and doesn’t require a network connection.

Open Ears also provides a development platform, allowing the purchase of plugins that extend its capabilities. One such plugin, RapidEars promises to allow live realtime speech recognition with no perceivable delay.

Paved with good intentions

Two more similar competing solutions may be the closest things to how Siri works yet. Wit.ai and API.ai both try to translate user input, whatever they said, into user intent, with parameters. API.ai describes it as “natural language understanding.”

Both solutions listen to the user, break it down, and return a structure that describes what they believe was the intent of the user from her utterance. An intent could be “Start timer”, “Open Game”, or “Create Event”, even if what they said was “I need a timer for boiling these eggs”, “I want to play Angry Birds”, or “I need to make a meeting with Steve when I have time this afternoon”. Those intents could have arguments such as “5 minutes”, “Angry Birds”, and “Steve, 3:00 pm”.

Like Siri’s natural language processing, these two solutions give that illusion of intelligence by making the connection between what a person said, and what a person means.

The road to hell?

As exciting as such artificial intelligence can be, no less theorists and technologists than Stephen Hawking and Elon Musk are terrified at the potential impact on humanity. The recent feature film Her by Spike Jonze gives a glimpse of a near future where everybody talks to their personal devices, rather than the actual humans all around. And a trailer for the new Terminator movie has already hit the YouTubes.

But as dire as those warnings sound, they really are just fiction and fretting. Developers are not going to start booting up Skynet by adding voice recognition to their mobile apps. But what they will do is make their apps more accessible than ever before.


*This is not true. Siri has an unlimited number of silly jokes.

Stay-Up-To-Date

Keep in the loop with the latest in emerging technology and Mutual Mobile