Could Siri be coded in Python

Most of the magic behind Siri happens remotely.

I want to create my OWN version of Siri…. except I don’t care for having it on my phone. I want my entire house to be talking to me… more like Jarvis (from Ironman).

I believe I have access to all the right resources to create this AI.
It breaks down into three major parts:
1) convert speech to text
2) query database populated with q&a
3) convert text to speech

Speech to Text

Most speech to text engines suck. Siri’s works exceptionally well because the engine isn’t on your phone… it’s remote. I supposed we can hack Siri by running a MITM attack on an iphone and faking the SSL cert and intercepting the apple ID…. OR we can do something much simpler. Google’s Chrome 11 browser includes a voice input function (which isn’t yet part of the HTML5 standard) and can convert your speech into text. This guy discovered that it was happening remotely through an undocumented API call to google. All we have to do is access this same API and we got ourselves a free Speech-to-Text engine!

In case you don’t understand Perl, this is how you use the API:

POST to:

POST params: (which should include the contents of a .flac encoding of your voice recorded in mono 16000hz or 8000hz)
(which should read “audio/x-flac; rate=16000” or 8000 depending on your voice recording. This should also be mirrored in the Content-Type section of your header.)

Response: json text

I used ffmpeg to convert my audio into the desired format:

So I recorded my voice on my iphone 3gs asking “what day is it today?” and converted it to the appropriate .flac format and posted it to google’s API and this is what I got in response:


Database populated with Q&A

This is probably the most difficult part to obtain. To build it from scratch would require tons of data and advanced algorithms to interpret sentences constructed in various ways. I read somewhere that Siri was using Wolfram Alpha’s database….. so…. I checked out Wolfram Alpha and they have an engine that answers your questions. Not only that, they also offer an API service. (If you query less than 2000 times a month, it’s free!). So I signed up for the API service and tested it out. I asked it some simple questions like “What day is it today?” and “Who is the president of the United States?”. It returns all answers in a well-formed XML format.

Again…. sweet.

Text to Speech

This part is easy… and google makes it even easier with yet another undocumented API! It’s straight-forward. A simple GET request to:

Just replace the parameter with any sentence and you can hear google’s female robot voice say anything you want.

Voice Input

I can either make my program run over a web browser or as a stand-alone app. Running it over the web browser is cool because I would then be able to run it from just about any machine. Unfortunately, HTML 5 doesn’t have a means of recording voice. My options are a) only use google Chrome, b) make a flash app, c) make a Java applet.

Anywho… no big deal.

Putting It All Together

It responds with this answer. Good girl.
It’s still missing the voice input portion of the code. Currently, it just accepts a .flac file. I wrote 3 chunks of code that I put together as one pipeline of an AI process. The advantage of this over Siri is that I can intervene at anytime. I can have it listen for particular questions such as “who is your master?” and respond appropriately…. but more importantly, I can have it listen for “Turn on my lights” or “turn on the TV” or “open the garage door” or “turn to channel 618”. Certain questions will have my bot send a signal to the appropriate Arduino controlled light switch or garage switch or IR blaster and respond with a “yes, master”. I’ll post videos when it’s done.

Here is a video of the prototype in action.

Updated to give you a link to a working demo. This version requires you to use the Chrome browser (thanks to Shiv Kokroo for generously providing hosting / wolfram app ID):

Working Demo

Click on the little microphone and try asking her a question like “how many legs does a spider have?” or “what is 15 + 11?” or “turn off the lights”. 🙂

Update: There is a follow-up to this post here.

Source codes can be found on github.

Like this: