Amazon Polly : Delivering Health Care for People with Long-Term Conditions

Michael Wray
Feb 7, 2018
5 min read

With an aging population that continues to grow, healthcare is being changed forever. Are we ready for it? Which cost-effective technologies can we use to meet the ever-increasing demands on healthcare-related services?

With the right technology, many needs related to healthcare can be met remotely. This is already implemented by the National Health Service (NHS) in the UK. Although remote healthcare is far from being widespread, innovative organizations are realizing that by tapping into low-cost digital health solutions, some great efficiencies can be delivered at scale.

Despite being a dinosaur in the communications space, automated telephony can be a perfect communication channel to deploy services at scale because nearly everyone can use it, even if they do not have access to the internet or own a smartphone. And for many older people, the telephone is a piece of technology they are comfortable with and confident using.

In this post, we highlight how Inhealthcare has enabled NHS healthcare providers to leverage the capabilities of Amazon Polly in connection with remote communications. We show how Amazon Polly can be used at design time with our call script design tools to help design and simulate automated telephone calls. We illustrate how protocols can be built into automated telephone call scripts, how telephone calls are placed, and how synthesized speech is generated by Amazon Polly and streamed down the telephone line.

Inhealthcare provides a digital health platform that specializes in providing care in the UK, outside of hospital walls. The Inhealthcare platform connects to existing established healthcare software systems and enables clinical protocols and pathways to be modeled, created, tested, executed, and monitored. An important factor in delivering services remotely is to use an appropriate communication method. While apps, wearables, and web access are suitable for certain people, many individuals struggle with using these advanced technologies. Simpler alternatives like text messaging or automated telephony provide a better solution. As a platform provider, we support all of these communication channels, but in this post we focus on how we use Amazon Polly with automated telephony.

IVR

IVR (interactive voice response) has been around for ages, and it is for this reason that nearly everybody knows how to use it. Whether you experienced it as a reminder to set your watch with the help of the speaking clock, or as a nuisance call asking you about the recent injury you didn’t have, like most people, you have experienced IVR. This is important when delivering healthcare on a national basis: it must be simple and inclusive.

IVR enables two-way communication; the computer can communicate with the human using a synthesized voice, and the human can communicate with the computer by using dual tone multi frequency (DTMF) codes. These are the codes you hear when you press on the buttons of the keypad.

How it works

The Inhealthcare platform includes the digital pathway engine, which automatically manages and orchestrates remote communications. The integrated development environment (IDE) provides the tooling to design and build clinical pathways and protocols, which are published to the digital pathway engine. The call script designer, an element of the IDE, is used for constructing automated telephone calls.

At the appropriate time, and adhering to a clinical protocol that has been published to the digital pathway engine, a message is sent to the Voice Messaging System (VMS), a micro-service responsible for managing telephone calls. A phone call could last anywhere from a few seconds to several minutes, depending on the complexity of its call script. It is the responsibility of the VMS to interpret the call script, manage the state of a phone call, and report the state back to the digital pathway engine. In progressing through the call script, the VMS queues up commands for the Telephony Interface Manager (TIM) to execute. The first command is to place a call. This is done using Asterisk, an open source PBX system that is configured to connect to a remote SIP trunk provider. SIP (Session Initiation Protocol) is a protocol commonly used by telephony systems.

After the call is established, the VMS steps through the call script. Information is delivered as synthesized speech retrieved from Amazon Polly, and responses from the call recipient take the form of button presses on their telephone keypad (DTMF codes). To sound like a realistic conversation, it is vital that Amazon Polly responds quickly. Delays and dead air time cause frustration and increase the likelihood of hang up.

Before using Amazon Polly, we used a locally hosted text-to-speech (TTS) engine. Initially we had concerns that Amazon Polly might not respond quickly enough, but, instead, we have found it to have very low latency. A great advantage of Amazon Polly is its cost effectiveness: TTS can use significant CPU and RAM, but with Amazon Polly this is no longer something we need to worry about. It has a very simple pay-as-you-go pricing model that is based on usage. With a sensible caching strategy, costs can be reduced even further. Using a simple algorithm, we split the text into sentences, and if that exact sentence has already been synthesized, it is retrieved directly from a local cache. We currently see a cache hit more than 80% of the time.

Monitoring

Amazon Polly metrics are integrated with Amazon CloudWatch out of the box, so it is easy to configure monitors and alarms to keep track of Amazon Polly performance. However, this only tells part of the story. We also have our own monitoring, based on the useful Coda Hale metric library, so we can check things like full round-trip times and cache hits. These are currently reported up to New Relic, but they could just as easily be sent to Amazon CloudWatch. As the following graph shows, we find that Amazon Polly typically has about a 50 ms latency.

Speech Synthesis Markup Language (SSML)

The input to Amazon Polly can be raw text or SSML. We use SSML because it allows for greater control over how the speech is synthesized. Currently, we don’t use many of these control features, but in the future, we expect to use them more, and integrate them into the call script design tool. For example, we could use the control to slow down the speech rate. We do use ‘x-loud’ prosody, to ensure that the speech is easy to hear, because many of our listeners are elderly. We also use a sample rate of 8 kHz and a format of pulse-code modulation (PCM), which is what telephony typically expects. Brian is our favored voice; early feedback suggested that this was the preferred choice.

Conclusion

Despite the old-fashioned nature of the telephone, it offers an efficient, safe, and cost-effective way to enable home-care communication. Many individuals with long-term conditions have been using digital health-related services that are based on the telephone for over three years. They enjoy the freedom it gives them. We are committed to supporting remote communication for the UK’s aging population across all channels, and by leveraging Amazon Polly’s text-to-speech service we have found a low-latency and low-cost solution to scale automated telephony.

Source : https://aws.amazon.com/blogs/machine-learning/using-amazon-polly-to-deliver-health-care-for-people-with-long-term-conditions