Consumers can now experience the vast improvements that DeepMind have achieved with their human speech AI WaveNet

Technological overlords DeepMind have made leaps and strides in “perfecting” WaveNet’s human speech AI, and consumers can now experience it on Google Assistant’s virtual helper. Just a year ago, DeepMind had cracked an alternative way of producing text-to-speech (TTS) in order to make it sound more realistic, compared to the previous methods.




Previously, other TTS systems have utilized either one of two available methods. Both had their respective drawbacks. Concatenative speech to text involved immense hours of human input. This technique requires a voice actor to record large amounts of text which is later broken down and re-pieced in small segments. These broken down segments are then taught to the AI for it to rearrange into various words and phrases. Audio libraries need to be updated ad infinitum.


Animated image


Other techniques have been time-consuming or too imperfect sounding

The second technique, called Parametric,  creates a purely computer-generated voice. Engineers achieve this by programming the AI with sets of parameters. Obviously, this method sounds extremely robotic. So if the voice samples WaveNet has produced are anything to go by, DeepMind has proven that it has effectively lifted the bar of text to speech within the industry.

The creators of DeepMind sought to replicate human neural networks when developing it’s AI. As a result, they built WaveNet using these same convolutional neural networks. Engineers create waveforms from human speech recordings. These human speech recordings synthesize the waveforms, as opposed to being the fabric of the text to speech (as with concatenative).


Perfect speech, exquisite music

In this way, WaveNet can apply finer details to its speech. Finer details such as lip-smacks or accents. In fact, WaveNet is so advanced it can create exquisite music from scratch. One can believe they can hear the soft thuds of the wood of the keys of the piano while they hear the sound they make when they hit the strings.



One of the drawbacks, however, that has existed up till now, is the extraordinary amount of time it takes to create the waveforms. When DeepMind first announced WaveNet last year, it still took 1 second to generate only 0.02 seconds of audio. This initially made the system unfeasible. However, this latest announcement reveals that DeepMind has rectified this. The improvement is quite incredible and is 1000 times faster than this time last year. WaveNet can create 1 second of a raw waveform in 50 milliseconds.


Startling advancements in AI technology

These startling advancements in AI technology have no doubt alarmed many people. Engineers such as Elon Musk and scientists like Stephen Hawking have been very public about their fear of AI getting out of control and potentially destroying humanity. Indeed, it only took a year for DeepMind to perfect WaveNet to this monumental level. What will they achieve this time next year? What will they have achieved ten years from now?

In a video made by ColdFusion in April this year, the narrator describes unbelievable things that AI has achieved already. He explained how AI can now do things like creating a video just by using a still image. AI also is surpassing human abilities to predict human behavior. In another example, developers StackGAN have developed an AI that can create an accurate picture from merely a description they give it, of only just a few words.

What do you think? Could AI take over the human race? Would this AI takeover affect us negatively or positively? Leave your comments below.


References: Futurism, ColdFusion, Fortune

Photo credit: Ars Electronica via Visualhunt / CC BY-NC-ND