Search

Talk, don’t type!

The potential for automatic speech recognition in financial markets

31 August 2016

By

Humans have been talking for 100,000 years. Now, with the latest developments in machine learning and computing power, machines are smart enough to listen. Not only can machines recognize speech, they can understand the meaning of it. We are on the cusp of a profound change in behavior. 

The use and accuracy of automatic speech recognition (ASR) technology has increased exponentially in the last two years. Two factors are driving this – factor one is the vast sum of money being spent on ASR by some of the biggest players in technology: Google, Apple, Amazon, Baidu, Microsoft and IBM. What drives this investment is factor two: user behavior. 

360 million voice searches per day

Two years ago, in 2014, Google voice searches were still close to zero. Now, of the 3 billion Google searches done each day, 12% are voice searches. That is 360 million voice searches per day. As a percentage on mobile devices, it is even higher - 20%. These numbers matter because the accuracy of ASR is based on volume of speech data sets which feed into machine learning. What ASR does, in effect, is predict what you are probably saying, based on what millions of users have actually said in the past. The more data sets there are, the more accurate ASR’s predictions become. 

This new level of sophistication means that ASR is no longer used just for making a hands-free phone call in the car or finding a nearby restaurant. There are many jobs where typing is difficult or impossible, or where the entire job is based on spoken conversations. In these cases, speech recognition technology can transform the way people work. The F-35 was the first U.S. fighter aircraft with an ASR system able to "hear" a pilot's spoken commands, leaving their hands free to control the aircraft. There are a growing number of companies that are now doing the same for financial markets. Banks are building ASR into their consumer banking solutions, stock brokers like E*Trade are building it into their mobile apps, and companies like GreenKey are bringing this technology to the capital markets, where voice is and always will be fundamental to the conduct of business. 

A short history of ASR

The first ASR systems, developed in the 1950s by Bell Laboratories and IBM, could only understand digits. In the 1970s, the U.S. Department of Defense started funding research projects looking for sound patterns and developed systems with the vocabulary of a three year old. Through the 1980s and 1990s, ASR turned to prediction using statistical methods. As computers with faster processors arrived toward the end of the millennium, a company called Dragon launched the first consumer product (Dragon Dictate). 

The last five years have seen a step-change driven by advances in deep learning and big data and the availability of faster processing power from cloud computing. We are now seeing an explosion of both voice search apps and personal assistants. Apple introduced its (initially not very) intelligent personal assistant Siri on the iPhone 4S in 2011. Google offered its “Voice Search” app that uses data from billions of search queries to better predict what you're probably saying. In 2014, Amazon launched Echo, a wireless speaker and voice command device that responds to the name Alexa and can be told to play music, make to-do lists and set alarms. Google has now announced Google Home, a Wi-Fi speaker with a built-in voice assistant to answer questions and control web-enabled devices in your home. 

According to a Northstar Research study, half of all adults and teenagers use voice search every day (Siri, Google Now or Cortana). Google Now has an error rate of 8% compared to about 25% a few years ago. Recently Google open sourced its TensorFlow machine learning system, which underpins its ASR. Microsoft followed suit by open-sourcing its Computational Networks Toolkit (CNTK) for the ASR behind its Cortana virtual assistant. This will spur the rapid development for an array of new ASR apps.

How does cutting-edge ASR work?

ASR systems are based on two key models:

  1. Acoustic model: to represent the relationship between an audio signal and the phonemes (sounds) or other linguistic units that make up speech. The acoustic model is built by taking audio recordings of speech (split into small consecutive slices or “frames” of milliseconds of audio to be analyzed for its frequency content), and their text transcriptions. Software then creates statistical representations of the sounds that make up each word. The result is a probability distribution over all the phonemes in the model. Leading ASR systems use deep neural networks as the core technology to model the sounds of a language to assess which sound a user is producing at every instant in time. DNNs are capable of being “trained.” This is a form of machine learning based on representations of the brain’s neural response to specific stimuli, in this case millions of examples of recorded speech. Critically, ASR systems are able to process information blazingly fast, and can work in noisy environments.
  2. Language model: a probability distribution over sequences of words that estimates the relative likelihood of different phrases. The language model provides context to distinguish between words and phrases that sound similar. For example, the phrase “It's easy to wreck a nice beach" is pronounced almost the same as "It's easy to recognize speech.” Given the context of the conversation, the language model enables this ambiguity from the acoustic model to be resolved. 

There are many challenges to applying this well in practice. To be able to do it accurately, at speed, the ASR system needs to be able to cope with messy data that includes different languages, strong accents, background noise, acoustical distortions and multiple people speaking. These can be overcome by harnessing the latest state-of-the-art technology, both cutting-edge software and the fastest GPU processors to run it on, and combining that with large amounts of data to teach the machine and sophisticated models. Good models account for accents and for the typical background noise of different environments. With the physical proximity of a microphone to a mouth, and by capturing high-fidelity audio, these challenges can be overcome. 

With a speaker-dependent approach, the machine learns an individual’s accent and intonation, and the models are tailored and specific to each individual user. This requires only a few seconds of speech initially and improvement is a continuous process as the machine learns and adapts whenever the speaker uses the system. Being speaker-dependent means that accuracy will continue to improve over time.

To create a highly accurate ASR system requires large amounts of data to train both the acoustic model and the language model. By using domain-specific models, trained with a large volume of training data, it is possible to achieve accuracy greater than 95%.

Voice in the financial markets

Nearly all the key interactions in the financial markets are done by voice: trades or inquiries related to transactions that are large or complicated or for an illiquid product; important advice and important client interactions. This is simply because voice is the predominant form of human communication, and it has qualities that are not matched by other mediums: immediacy, empathy, nuance, instant feedback. If you walk onto a trading floor of a large bank or broking house you will see hundreds of people on the telephone, talking, shouting and communicating. Despite the trend away from open outcry, the financial markets still very much run on voice. 

Leading ASR systems use deep neural networks as the core technology to model the sounds of a to assess which sound a user is producing at every instant in time.

The opportunity to apply ASR to the financial markets is immense and will be truly transformational. Traders will become more efficient, vast amounts of data will be harnessed, fewer mistakes will be made, and there will be vastly more transparency and auditability. Regulators will have the same level of transparency and audit trails with voice interactions as with electronic ones, and this will allow the market participant to choose the form of communication that best serves their needs. The markets will become safer, cheaper and more efficient.

While Google and Apple are trying to transition tasks that are currently done by typing to voice, in the financial markets it is the opposite: key tasks are already done by voice and the typing has, until now, been a subsequent mandatory duplication of the task.

Applying ASR to the financial markets

At GreenKey, our approach has been to develop our models based on the leading open source ASR frameworks, adapted using thousands of hours of “trader speak” to make them highly domain specific. We then train the models further to each specific user by having them read text into the system. The system analyzes the person's specific voice and uses it to fine-tune the acoustic model. All spoken communications on any voice device, including trading turrets, desk phones and mobiles, can then be converted into usable data that can be leveraged to drive workflows or analyzed for compliance, regulatory, sales and trading purposes. At GreenKey, we have built the system for English and are now expanding it to other languages. 

To increase the accuracy even further for trading applications, we use a custom grammar approach that works like the FIX protocol for electronic communications. In ASR, the grammar describes all the likely terms the engine should expect and controls when to switch on and off various speech functions. When well-tuned, the grammar can deliver very robust results by better defining the target space and whether to acquire, ignore or expand upon certain inputs. 

In markets where traders typically use a standard syntax and sequence of words for their trades, we have established that grammar within the ASR platform to make the engine more efficient and reliable. GreenKey is developing global industry standards for voice communications and voice trading. 

We use our GreenKey ASR platform for two distinct products, and we are only just scratching the surface of use cases for each of them: 

  1. General transcription: this means turning every spoken word into text (.wav files into.txt files) to create searchable, usable data. In order to achieve very high levels of accuracy, we do this with a small delay, which allows the language model to fully apply its probability determinations. Use cases include knowing which clients have inquired about what things then tracking those client inquiries, real-time surveillance and compliance alerts on all conversations, and applying to voice communications all the analytics that are already applied to electronic interactions.
  2. Voice commands (voice-driven workflows): this means extracting certain keywords from a conversation in real-time to drive workflows, with a user-defined mapping and framework approach. This can also include using a keyword in a live stream of data to commence and cease the ASR. Use cases include automatically populating trade tickets, capturing quotes for pre-trade transparency requirements under MiFID and satisfying best execution requirements.

Conclusion

When we talk about "Big Data," we often forget that the problem is as much related to data capture as it is to data analysis. Digitizing voice is about enabling intelligent voice data capture to help market participants to improve their analytics capabilities. ASR has come a very long way in the last five years and is ready to transform our lives. Having built a leading financial markets’ domain-specific engine at GreenKey, we are truly excited to be part of digitizing voice and changing the way the markets communicate.

  • MarketVoice
  • Technology