jambonz news and blog posts

Text-to-speech latency: the jambonz leaderboard

Dave Horton — Mon, 15 Apr 2024 14:36:07 GMT

The emergence of AI and Large Language Models (LLMs) onto the tech landscape promises to reshape everything: how we work, how we play, and how we engage with others. Of course - let's be honest: not much of that has happened yet. Someday we'll surely experience the "sonic boom" moment when the actual rate of progress catches up to the hype, but sorry folks, we're not there yet.

Instead, the most notable impact to date has been the refocusing of huge amounts of private and public capital into any and all product categories thought to either benefit from or drive AI technologies. Those of us laboring to make our daily bread in the CX/AI space find ourselves the lucky beneficiaries of this effect. We get to play with new speech technologies developed by startup companies newly flush with VC cash and eager to brag about how many NVIDIA GPUs they bought over the weekend. For those of us working adjacent to them and out of the VC spotlight, it's like eating at the high school table with the rich kids who suddenly and inexplicably want to share their nicely packaged lunches with us.

I'll be honest, the sprouting-up of new text-to-speech (TTS) vendors that we've seen over the past year or so was not something I expected because, quite frankly, I thought your dad's Google TTS and Microsoft TTS were pretty damn fine, not to mention that the investment theme is lost on me when a market already has close to commodity-level pricing. Oh well, that just goes to show what I know.

In our upcoming jambonz 0.9.0 release we've added support for TTS services from a bunch of these sassy newcomers that want to challenge the giants, and we thought it would be a good time to put them to test. What age-old story are we going to see here: the new upstarts disrupting the failing dinosaurs? Or would it be the well-heeled Daddy Warbucks incumbents quashing the neophytes? Let's find out!

Introducing your contestants!

The jambonz open source voice gateway for CX/AI providers has been widely adopted by many CX/AI providers, including the leading vendors in the space. Our "bring your own everything" design enables customers to connect their preferred carriers and speech vendors and so we have always made it our mission to give our customers the broadest selection of speech vendors for both text-to-speech and speech-to-text.

As well, we strive to give customers detailed insights into the behavior of their service through an open telemetry observability framework that reports on data such as time-to-first-byte for TTS requests.

In upcoming release 0.9.0 we added several new vendors for text-to-speech, and we've also made an effort to support streaming APIs where available to reduce the latency experienced by users, so it seemed like a good time to do some benchmark testing and establish a leaderboard. In our testing we compared:

Deepgram
Elevenlabs
Google ^*
Microsoft
PlayHT
Rime Labs
Whisper

^*With all other vendors we measured time-to-first-byte; however with Google we were forced to measure time-to-last-byte as we have not implemented a proper streaming API integration for them (yet).

The testbed

We ran the tests using a jambonz server running in AWS us-east-1 region on a single EC2 t2-medium instance. We ran against the hosted SaaS service for each of the vendors. We tested two different scenarios, both common to conversational AI:

a very short user prompt (e.g., "Hello and thank you for calling. How can I assist you today?"),
and a slightly longer prompt that a caller might commonly encounter (e.g., "It seems like you're having trouble logging into your account. For security reasons, please provide the email address associated with your account. Once verified, I will send a password reset link directly to your email. Alternatively, say 'help' for more assistance.")

We tested 5 variations of short and long prompts on each TTS engine, using English language:

short prompts

Hello and thank you for calling. How can I assist you today?
Please hold while I transfer you to a customer service representative.
I'm sorry, I didn't catch that. Could you please repeat your request?
Your current balance is $347.92. Would you like to make a payment now?
Thank you for your patience. A representative will be with you shortly.'

longer prompts

Thank you for calling our customer support line. To better assist you, please state the reason for your call, such as 'billing', 'technical support', or 'account information'. You can also say 'more options' to hear additional services.
You have indicated that you are calling about a billing issue. If you would like to proceed with a payment, please say 'Make a payment'. If you need details about your last transaction or have a billing dispute, please say 'Billing details'
It seems like you're having trouble logging into your account. For security reasons, please provide the email address associated with your account. Once verified, I will send a password reset link directly to your email. Alternatively, say 'help' for more assistance.
Our records show that your warranty is due to expire in 30 days. To extend your warranty for another year, please say 'Extend warranty'. If you would like to know the benefits of extending your warranty, please say 'Explain benefits'.
If you are calling to update your personal information, such as address or phone number, please clearly state the new information after the beep. For any changes to sensitive data, such as your password or payment methods, please ensure you have your security pin ready.

In all cases (except Google, as described above) we measured the time from sending the request to the service to receiving the first byte of audio. We give more details on the configuration of each TTS service later in this blog post.

Results

Before we review the results, there is one additional subtlety to be aware of when measuring latency. Here we are measuring time to first byte, which is an important metric. However, all providers send a small amount of silence at the beginning of generated audio, and that amount we found to differ by provider. The experience of the user will be the time to first byte plus the duration of leading silence. In our experience, the vendors fell into two categories:

those providing very short duration of leading silence; this includes Deepgram (~150 ms), Elevenlabs (~100ms), Microsoft (~150ms), and RimeLabs (~200ms); and
those providing longer duration of leading silence: Google (~600ms), Whisper (~670ms), and PlayHT (~637ms).

Keep these in mind as we review the results.

Without further ado, here are the results from the tests using the short audio requests.

and here are the results from testing the longer audio segments.

And here are this detailed data from the tests.

short audio tests - time to first byte (ms)

Prompt	Deepgram	Elevenlabs	Google	Microsoft	PlayHT	RimeLabs	Whisper
1	324	481	183	127	121	326	405
2	348	451	173	345	61	248	649
3	355	613	185	293	75	250	601
4	324	645	274	427	59	201	472
5	355	471	192	318	50	187	470
avg.	341	532	201	302	73	242	519

long audio tests - time to first byte (ms)

Prompt	Deepgram	Elevenlabs	Google	Microsoft	PlayHT	RimeLabs	Whisper
1	460	771	450	293	67	642	600
2	363	739	420	345	50	320	772
3	357	833	465	356	168	306	581
4	435	1404	338	364	65	322	781
5	472	783	367	409	111	340	830
avg.	417	906	408	353	92	386	712

Our findings

Wow! We were not expecting this.

PlayHT (avg 73ms short audio/92ms long audio) was the winner by a mile, with blazingly fast results. Sub-100 ms times (what!!??) to first byte is quite astonishing, given that we are including network round-trip time into that measurement. We truly did not expect this, and count us impressed. PlayHT's voices are regarded as high-quality natural-sounding voices and they offer a voice cloning feature as well. However, we experienced some minor audio defects in our testing: PlayHT incorrectly pronounced "$347.92" as "three hundred forty-seven dollars and ninety-two two". However, all the longer prompts played perfectly. Now, again, this very short time-to-first-byte needs to be factored in with the fact that the audio itself contains a bit more leading silence than some of the other vendors.
Our next biggest surprise was Google (201ms/408ms). We were surprised on two fronts: first, Google had an extra burden in that we were measuring time to last byte instead of first, because we have not yet implemented streaming support for Google in jambonz; and secondly, we are historically used to seeing in the neighborhood of 800ms+ times for Google to synthesize audio. (Keep in mind that Google does return a fair amount of leading silence so head-to-head in overall user experience of latency they would probably fall slightly behind Microsoft, for instance.) Something must have changed recently over at Google to deliver these impressive numbers.
One of the new entrants, RimeLabs (242ms/386ms) turned in some very fast time as well. And if you calculate in the fact that they return much less leading silence than PlayHT, they provide arguably the fastest user experience. RimeLabs also has an optional feature to reduce latency even further by skipping the text normalization phase; when we enabled this feature the numbers got even better (217ms/274ms), edging out everyone except PlayHT. However, they recommend enabling this only on text "where there are no digits, abbreviations, or tricky punctuation" and we found this to be true: when we enabled it our text containing the account balance did not play correctly. Additionally, in our testing, we noted some slight but detectable pauses during longer sentences where they did not belong. And finally, it might be personal preference, but most of the voices seem to lack emotion, as if they are doing a task they are not interested in. If I were calling into a contact center I'd feel like I was talking to a bored gen-X'er who was counting the minutes until they could go off shift.
Reinforcing that the dinosaurs are not dead, Microsoft (302ms/353ms) came in with very fast times as well, competitive with the new entrants and raising the question: why change? I guess it turns out that having buckets of money to throw at GPUs is still an advantage. Both Google and Microsoft deliver those perfectly crafted AI voices that are so good that, counter-intuitively, you immediately know that its AI you're talking to and not a real person.
Deepgram's (341ms/417ms) new Aura offering was not left behind, turning in some very fast times of its own. There may still be a few issues to work out, as we experienced unnatural pauses once or twice during longer sentences where there was no comma, semi-colon or other indication there should be. Additionally, a spurious soft 'A' was inserted when synthesizing a phase enclosed with single quotes; e.g. "say 'billing' if you have a billing question" is heard as "say ah billing if you have..".
Whisper (519ms/712ms) from openAI impressed as well. The times are slightly longer than the rest of the field, but the quality was outstanding: the voices sounded great, the speech cadence was perfect and the pronunciations were spot on.
Elevenlabs (532ms/906ms) has become well-known for its natural sounding voices and is rapidly becoming the choice of many for that reason. Its times in our test were slightly slower than the rest of the field, but still quite fast overall. We experienced no defects in the generated audio.

Summary

Our main takeaway is how fast all of these vendors are. A year ago, we would have been happy with sub-second results - now we are hungering for, and in some cases getting, ttfb times of less than 100 millseconds. All of these vendors provide great products that are worth evaluating for those planning their CX/AI rollout. We're looking forward to the vendors polishing things like speech cadence and the minor imperfections that we encountered.

We should note that we are also happy to work directly with any of these vendors to collaborate on testing or on fine-tuning our integrations if necessary to improve performance and overall user experience. We will update our leaderboard from time to time, and we are always adding support for new vendors so reach out to us if you provide a TTS service and would like to be included in future reports.

Also, feel free to create a free account on the jambonz cloud to try out jambonz!

Appendix: Notes on our configuration

A few notes on how we configured each speech service.

vendor	model	voice
Deepgram	Aura	Asteria
Elevenlabs	turbo-v2	Serena
Google		Wavenet-C
Microsoft		Ava (multi-lingual)
PlayHT	PLayHT2.0-Turbo	Jennifer
RimeLabs	Mist	Abby
Whisper	tts-1	Alloy

Improving the voicebot experience

Dave Horton — Thu, 08 Feb 2024 17:29:15 GMT

Yes, we can talk to AI. Connecting a speech driven interface to AI is easy. But crafting a conversational experience that approaches the ease and pleasure of a conversation with another human is not.

One of the challenges is that speech recognition systems are not that good at detecting the turn of a conversation. As humans, we're super good at this - we continually process all sorts of cues during a conversation to determine when our partner has finished speaking and we can jump in. For instance:

s/he asked us a question so we know it's time for us to respond (and think about how easy it is for us to classify a question as rhetorical, in which case we understand it is not for us to respond to),
s/he starts by saying "it's a long story.." and we settle in to wait longer for our turn to speak, or
s/he pauses during her speech, but we know it is a sort of "thinking" or a "bridge" pause and we don't break in ("Yeeeah..(pause)...well, it's not that simple, my friend..")

However, in a conversation with AI, speech recognition services will return a transcript any time the speaker pauses, regardless of whether this is a complete response. Quite often, conversations get derailed because AI tried to process a partial statement from the speaker and returned a non-sensical response. If on the other hand we simply wait an extra long time to be sure the user has finished speaking, we get a stilted conversation with lots of uncomfortable silence that is even worse.

Using AI to help predict the turns of the conversation

What if we used AI to predict the type of response a caller may make to a given statement or question from a voicebot? And what if we then used that prediction to tune the speech recognizer specifically for this turn of the conversation?

This is essentially a text classification operation, which is something that LLMs are really good at. Let's build a simple example using OpenAI's gpt-3.5-turbo model to test with.

Streamlit app

Below is a streamlit app that we can use to test out our idea. In this app we are using a few-shot prompt to configure the model to assess a statement from a voicebot and predict the type of response a caller might make. You'll need an OpenAI key to test with, and you can actually modify the prompt and examples to see how it impacts your results.

We ask the model to predict and classify the next response from the caller as one of four types:

a single utterance
a single sentence
multiple sentences
identification data

The categories are mostly self-explanatory, but the category of identification data needs some explanation. The purpose of this category is to predict when a caller is going to need to do something like give a credit card or customer number, spell their name or email address, etc. In these cases we need to make sure to give some extra time since people will often speak slowly, or pause while they refer to notes they are reading from.

Read the instructions on the first page of the app below, review the prompts and examples and then click on the "Try it out" page and enter a statement or question and see what type of response the model predicts. You can even change the prompt or the few-shot examples to see if you can improve the results!

If you are unable to run the application yourself for some reason, here is a screenshot showing a sample query and response:

Implementing in jambonz

For our experiment we'll make a simple change to our jambonz app to consult OpenAI at each turn of the conversation to get a classification of the next expected response. We'll then turn that into a specific configuration command sent to the speech recognizer.

We're using Deepgram and specifically we'll be tuning their endpointing and utteranceEndMs properties as described here.

If the classification is single utterance, we'll leave the settings at their default, because the default settings have extremely low latency (i.e. transcripts returned very quickly after extremely short pauses)
If the classification is single sentence, we'll set the properties to {endpointing: 500} to set the endpointing to 500 ms,
If the classification is multiple sentences, we'll set the properties to {endpointing: 500, utteranceEndMs: 2000} which sets the endpointing as above and additionally requires 2 second of non-speech before returning a final transcript
If the classification is identification data, we'll also set the properties to the same values as above, since we also need to allow the caller plenty of time to finish entering their customer number, spell their name, or what have you.

Results

The resulting conversation after these changes is much more natural and, as a result, much more effective. Check out the video below to see the difference before and after the AI tuning approach is implemented

Conclusion

The advances in AI over the past year has been breathtaking. However, the quality of speech interactions between humans and AI still lags. Until problems like detecting turns of conversation are solved, the promise of AI in our everyday lives will be stunted. This is a hard problem that will take an array of solutions to solve comprehensively, and in this post we have only experimented with a relatively simple approach of using text-based classification to improve the prediction of dialog turns. We look forward to doing further work on this topic in the future.

Priority queues in jambonz

Dave Horton — Tue, 04 Jul 2023 20:58:41 GMT

Overview

Being able to prioritize incoming calls based on its importance is an essential feature in many business scenarios. For example, the most common scenario in the financial sector is to skip the whole queue in case a client is calling to block the credit card in case of loss.

There were no way to do it in the Jambonz platform until now. Starting from version 0.8.4 the ability to prioritize incoming calls is possible by adding the priority attribute to the queued call. The calls with the highest priority will be connected faster even though there might be other calls awaiting in the queue. Also, this release adds an option to get any queued calls from the queue! What you need is just to know the callSid of the queued call. Isnt this awesome?

Node-RED example

Lets cover a simple example using Node-RED. Node-RED nodes for the Jambonz platform are available under @jambonz/node-red-contrib-jambonz package. Just follow this guide to install them.

The basics of creating a jambonz application is described in this blog post. In this example we are focused on how to build the Node-RED flow.

In this flow we are greeting the customer on joining the queue and getting the priority from the external application based on the called number. If the number is allocated for a VIP person, the call is placed into the queue with the highest priority. In case the external server is not reachable the catch node continues the logic and places the call with the default priority.

During the querying we are announcing that the call is still in the queue and we are waiting for an available agent to handle the call. When the call waits for a free agent longer than allowed time, the call is disconnected with a goodbye announcement.

Conclusion

The ability to assign priorities to calls in a queue is a useful feature for many use cases. It is now supported in jambonz, but is completely optional. As before, calls can be queued with no priority if desired.

Thanks to @AVoylenko for driving this feature!

How to monitor jambonz on AWS using Voipmonitor and Traffic Mirroring

Dave Horton — Thu, 18 May 2023 14:38:43 GMT

jambonz comes with a great set of observability tools out of the box:

a grafana dashboard displaying key performance metrics,
opentelemetry application traces, which are visible in the jambonz portal, and
the ability to download sip traces in the form of pcap files from the portal for recent calls.

The gold standard in VoIP system monitoring, however, has always been Voipmonitor. Voipmonitor provides a wealth of charts and detailed analytic tools for SIP and RTP that are powerful and yet accessible to tier-one support engineers. If you want to level up your jambonz support organization, implementing Voipmonitor is a great choice.

In this article we will show you how to deploy Voipmonitor on AWS using the Traffic Mirroring feature to mirror traffic from your jambonz SIP and RTP servers to a voipmonitor server.

Note: This works both for EC2 as well as Kubernetes installs.

Also, as a bonus, we will show you how to write the mirrored traffic to pcap files and upload them to S3 storage, separately from Voipmonitor. Having rolling raw pcap files like this can be useful for troubleshooting low level transport connection issues such as TLS connectivity with carriers.

Let's get started!

What you need

Nitro-based EC2 instances

We'll be mirroring the traffic to the voipmonitor server using AWS Traffic Mirroring, and that feature is only available to nitro-based instances, so make sure that your jambonz servers are running on a nitro-based instance as well as the new server that you'll be spinning up to run voipmonitor.

Voipmonitor

And of course you'll need to install Voipmonitor as well (we'll show you how). Voipmonitor is a commercial product, but don't be dissuaded from trying it -- they provide a 30-day free trial and the pricing is very reasonable when you are ready for upgrade to a paid license.

Installing Voipmonitor

The first step is to spin up a new instance in the AWS VPC where your jambonz system is running (Traffic Mirroring works within a VPC so Voipmonitor must be in the same VPC, unless you choose to do more complicated VPC peering). Again, this must be a nitro-based instance; in my deployment I chose to use a t3.small. You'll want a lot of disk space, at least 50G (for my deployment I used a 100 G disk).

So go ahead and spin up a nitro-based instance on Debian 11, configured as above, in your jambonz VPC. Create a new security group for the instance that allows the following traffic in:

ssh (22/tcp) from anywhere,
http (80/tcp) and https (443/tcp) from anywhere, and
VXLAN (4789/udp) from the VPC

Once the instance is up and running, follow these instructions to install the voipmonitor GUI and sniffer. The sniffer will be reading the mirrored traffic from the network interface that will be coming in on 4789/udp. The default voipmonitor config (in /etc/voipmonitor.conf) listens for VXLAN traffic on this port so no changes are needed to the config file.

After installing the GUI, connect your browser to the public IP of the voipmonitor server using http (not https). You will be guided through the final stages of the install and redirected to the voipmonitor corporate site to generate a free license for the demo period of 30 days.

As a final step, If you want to enable HTTPS for the voipmonitor GUI follow these instructions. As an example, to use letsencrypt to generate your certificate you would first create a DNS A record for the server in your DNS provider, and then simply do this:

apt-get updateapt install snapdsnap install coresnap install --classic certbotln -s /snap/bin/certbot /usr/bin/certbotcertbot --apachesystemctl restart apache2

Once voipmonitor is up and running, we now need to mirror the traffic from the jambonz SIP and RTP servers to voipmonitor.

Configuring AWS Traffic Mirroring

There are three steps to configuring traffic mirroring:

Create a mirror target
Create two mirror filters: one for SIP and one for RTP
Create mirror sessions for each jambonz SIP and RTP server. A mirror session will direct traffic from one elastic network interface (ENI) to the mirror target, using the mirror filter to determine which traffic to mirror.

Create a mirror target

First, retrieve and copy the Interface ID for the network interface that is attached to voipmonitor instance that you created. You can find it in the network panel of the instance details as shown below:

Then go to Traffic mirroring / MIrror targets / Create mirror target. Leave Target type set to "Network Interface" and paste the Interface ID that you copied above. Add a Name tag and click Create. This configures the voipmonitor instance to receive the mirrored traffic.

Create mirror filters

Go to Traffic mirroring / Mirror filters / Create mirror filter. First, let's create a filter for sip traffic.

Create inbound rules that accept the following:

port 5060/udp from anywhere (this is for sip over udp),
5060/tcp, 5061/tcp, 8443/tcp from anywhere (sip over tcp, tls, and websockets), and
icmp (can be useful to troubleshoot destination unreachable issues).

Create outbound rules that accept the following:

5060/udp sent from the VPC to anywhere
5060/tcp, 5061/tcp, 8443/tcp from the VPC sent anywhere, and
icmp

Save the mirror filter. As an example, my sip inbound rules look like this:

and my sip outbound rules look like this:

Now create a second filter for rtp. Inbound rules this time will be:

dst port 40000-60000/udp
icmp

and outbound rules will be:

src port: 40000-60000/udp
icmp

Create mirror sessions

Finally, create a mirror session for every jambonz SIP server and RTP server. If using Kubernetes, you will create a mirror session for every node in the SIP and RTP node pools. The source node in each case will be the jambonz server and the destination will be the mirror target that you created earlier.

As a first step, for each source node gather the interface IDs for the ENI for each EC2 instance. Then create a mirror session for each node; for sip nodes use the sip mirror filter and for rtp nodes using the rtp mirror filter. In all cases connect to the single mirror target that you've created.

Once you've done this, mirrored traffic should be flowing to voipmonitor and you should see calls in the voipmonitor GUI.

Bonus section: upload pcap files to AWS S3

In addition to voipmonitor you may also want to upload pcaps of network traffic to AWS S3.

Note: this section does not require the voipmonitor install, though it does require the traffic mirroring to be set up as above.

Create an S3 bucket

Create a bucket in S3 to hold the pcap files.

Install the aws cli on the mirror target server

sudo apt-get updatesudo apt-get upgradesudo apt-get install unzipcurl "https://d1vvhvl2y92vvt.cloudfront.net/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"unzip awscliv2.zipsudo ./aws/install

Configure the aws cli

aws configure

Create a systemd file for tcpdump

We'll be using tcpump to continually write the incoming mirrored traffic to pcap files on the server. To do that, create a file named /etc/systemd/system/tcpdump.service:

[Unit]Description=Tcpdump[Service]ExecStart=/usr/bin/tcpdump -ni ens5 'port 4789' -W 5 -C 300 -w /tmp/rolling.pcapRestart=on-failureUser=rootLimitNOFILE=4096[Install]WantedBy=multi-user.target

This writes all the encapsulated traffic to 5 rotating pcap files, closing each file when it reaches 300 MB. Start the service:

sudo systemctl daemon-reloadsudo systemctl start tcpdump.service

Monitor pcap files and upload them to AWS S3

Next we need a job to monitor these files and upload them to your bucket. First, install support for inotify, which we will use to detect when a rolling pcap file has been closed.

sudo apt-get install inotify-tools

Create a file named /usr/local/bin/pcap-monitor.sh. Copy the contents below into the file:

#!/bin/bashfolder="/tmp/"bucket=""inotifywait -m $folder -e close_write |    while read path action file; do        if [[ "$file" =~ .*pcap[0-9]$ ]]; then            echo "The file '$file' appeared in directory '$path' via '$action'"            # Format the current date and time            current_date=$(date +%Y-%m-%d)            current_time=$(date +%H-%M-%S)            # Copy and compress the file            cp ${path}${file} ${path}${file}.tmp && gzip -q ${path}${file}.tmp            # S3 path            s3_path="s3://$bucket/${current_date}/${current_time}.pcap.gz"            echo "Uploading to $s3_path"            # Upload to S3 and remove the compressed file            aws s3 cp ${path}${file}.tmp.gz $s3_path --quiet && rm ${path}${file}.tmp.gz            echo "Upload to $s3_path completed"        fi    done

This script watches the /tmp directory and every time a new pcap file is closed it will zip it and then upload the zip file to your S3 bucket, deleting the zip file afterwards. Make the file executable:

sudo chmod a+x /usr/local/bin/pcap-monitor.sh

Next, we'll create a systemd daemon to run this.

Create a file named /etc/systemd/system/pcap-monitor.service and copy the contents below in:

[Unit]Description=Monitor ScriptAfter=network.target[Service]ExecStart=/bin/bash /usr/local/bin/pcap-monitor.shWorkingDirectory=/tmpStandardOutput=journalStandardError=journalSyslogIdentifier=pcap-monitor[Install]WantedBy=multi-user.target

Now start it:

sudo systemctl daemon-reloadsudo systemctl restart pcap-monitor

All set! At this point pcap files should be periodically uploaded to your bucket, into folders by date. You may want to configure the AWS bucket to automatically delete pcap files after a certain number of days. But now you will have access to all network traffic from your production SBC SIP and RTP servers in order to troubleshoot any tricky SIP interop issues in production.

Time for one more option?

If you like, with a few more steps you can strip the VXLAN headers from the pcaps, so that they will appear in wireshark exactly as they would as they arrived over the wire. There is no problem with leaving the pcaps encapsulated in the VXLAN headers, but it might be slightly less confusing when you analyze them in a tool like wireshark without them. If you want to do this, first you need to build a simple utility to strip the headers -- luckily, I have one for you on my github!

sudo apt-get install libpcap-dev build-essential gitgit clone https://github.com/drachtio/decap_vxlan.gitcd decap_vxlanmake && sudo make install

Now edit the /usr/local/bin/pcap-monitor.sh file to be like this

#!/bin/bashfolder="/tmp/"bucket=""inotifywait -m $folder -e close_write |    while read path action file; do        if [[ "$file" =~ .*pcap[0-9]$ ]]; then            echo "The file '$file' appeared in directory '$path' via '$action'"            # Format the current date and time            current_date=$(date +%Y-%m-%d)            current_time=$(date +%H-%M-%S)            # Copy and compress the file            cat ${path}${file} | decap_vxlan | gzip -q > ${path}${file}.tmp.gz            # S3 path            s3_path="s3://$bucket/${current_date}/${current_time}.pcap.gz"            echo "Uploading to $s3_path"            # Upload to S3 and remove the compressed file            aws s3 cp ${path}${file}.tmp.gz $s3_path --quiet && rm ${path}${file}.tmp.gz            echo "Upload to $s3_path completed"        fi    done

That's it! Now you've rolling pcaps of your recent sip and rtp traffic available in a secured S3 bucket for your jambonz deployment. Enjoy!

Speech companies are failing at conversational AI

Dave Horton — Sat, 06 May 2023 15:29:26 GMT

It might seem that we're in a golden age of deploying speech technology into contact centers. You'd be forgiven for thinking that, what with the large numbers of new companies in the space, most funded with planeloads of VC cash, exciting new applications of AI to speech recognition, and strategic imperative that enterprises see in finally making automated voice interactions...well, not suck.

The truth, however, is that speech technology providers are still failing at conversational AI for the simple reason behind most business failures: they aren't listening to and anticipating their customers' needs.

As the creator of jambonz, the open source voice gateway for conversational AI, I've spent the past three years working with most of the commercial speech vendors. Based on that experience, I'm suggesting three high-value (and blindingly obvious) features for conversational AI that speech vendors need to implement to improve the conversational AI experience.

But first, let's begin by enumerating the different requirements that conversational AI has from long-form speech-to-text transcription. There are some seemingly subtle yet very important distinctions:

Every piece of audio from a caller we transcribe in a conversational AI use case is in response to a question or a prompt. There is always something that we just asked or said to the user that he or she is responding to now. In other words, conversational AI is highly contextual on a short-term (query-response) basis.
Conversational AI is about....(wait for it)...conversations. It's a two-way discussion, even if we're only transcribing one side of it (the caller). The conversation proceeds turn by turn. The deeper into the conversation we get, the more accurate and faster we ought to become at accurately transcribing what is being said. Conversational AI, therefore, is also highly contextual on a medium-term (conversation length) basis (and that context includes both sides of the conversation).
During a conversation, there are times when we don't want or need to transcribe a caller's speech. We're in essence not listening to them for certain periods. For instance, we may want to make them listen to a lengthy prompt in full (perhaps for legal purposes). We need to be listening/transcribing most of the time, but not all of the time.

Those are simple properties of a conversational dialog that we can probably all agree on. So what? Well, from these properties we can draw the following critical features that we would require from the underlying speech recognition technology:

fine-tuned control of endpointing,
an API interface that includes relevant prompt, and
suitable billing models.

Here's what I mean:

Fine-tuned control of endpointing

"endpointing" is a feature wherein the speech provider uses speech energy detection to determine the end of an utterance and then returns a transcript for that utterance.

In conversational AI, we prompt the user, and then we want to gather their response. This process is somewhat more an art than a science as you can imagine. We want to get the user's full thought (i.e. not have the recognizer return only the first half of a sentence). Yet we also want to minimize the latency in the conversation (i.e., not have the recognizer take so long to determine the user has finished speaking that the conversation becomes laden with periods of unnatural silence).

All the speech providers that we support in the open source jambonz conversational AI voice gateway support endpointing. But only one of them -- Deepgram exposes via their API the ability to control the endpointing behavior.

Controlling endpointing behavior would be (and in the case of Deepgram, is) a highly useful feature. When I have a voicebot ask a user a question that implies a quick confirmation ("Would you like to speak to an agent?"), I'm expecting a yes/no answer. As a result, I want the endpointing to be very quick (maybe 500 milliseconds). If I'm asking the customer a broader question ("Please tell us how can we help you today?"), however, I want endpointing to wait for maybe 2 seconds of silence to make sure I get everything they want to say, which may, in this case, be more than a single sentence.

All speech providers should expose via their API the ability to control endpointing behavior.

API interface that includes relevant prompt

Today, most of us are enchanted by the power of ChatGPT, right? And like me, I bet you're really impressed with how generative language models can create such high quality responses in response to nothing but prompt text.

If so, you might also find it strange that while every time we connect to a speech recognizer during conversational AI we have just provided a prompt to the caller, the speech provider's API apparently has no interest in knowing what that prompt was. Wouldn't that prompt help shape answers? For that matter, wouldn't that prompt also help determine the most effective endpointing configuration to use for the current user response?

Today, speech providers allow for things like hints in their APIs -- an array of words or phrases that should be "boosted" in terms of making the recognizer more aware of them. That's great, and we should have hints, but even more valuable than hints is the question I just asked the user which she is now responding to. And guess what -- we have that exact question, in text form, because we probably just did text-to-speech to generate that question!

So please, speech vendors, augment your APIs to let me tell you the current prompt the user is responding to with their speech. Then, use that information to create more accurate responses.

Some examples may be helpful:

I just prompted the user, "Could you please spell your last name?", so the recognizer should now expect some spoken letters (i.e. don't transcribe "T" as "tee").
I just prompted the user, "Could you tell me what is wrong with your medical equipment?", so the recognizer should automatically boost medical equipment words or phrases.
I just prompted the user, "Is this the best number to call you back on?", so the recognizer should be prepared to return quickly after "yes", "no," or other confirmatory/negatory phrases.

Suitable billing model

The billing model that most (all?) speech providers use is per-second billing for the time we are connected to the recognizer, sometimes rounded up to threshold. We get charged this regardless of whether we actively want a transcript at any given point in the conversation. This result leads to an implementation model where the voice gateway connects to the speech recognizer for each turn of the conversation, creates a transcript, and then drops the connection. In the next turn of the conversation, we will prompt the user, connect again to the recognizer, and get a new transcript.

Dropping and re-establishing the connection like this is done to save cost, but it isn't ideal for several reasons. For one thing, there's a bit of overhead each time in establishing the connection, during which speech from the user might be lost (though in jambonz, we queue incoming voice frames during connection to avoid this).

More importantly, though, any chance for using longer term context to improve results is lost. Consider again the ChatGPT experience: the longer the conversation that you have with it becomes, the better results you receive. As the conversation proceeds, ChatGPT has more context to form its answers.

What speech providers should do is to provide an API that lets a voice gateway platform like jambonz connect once, at the start of the call. Then at any time during the connection, they should allow the voice gateway to call an API to "pause" recognition. During the paused interval, they shouldn't not charge me and should simply discard any voice packets that are sent over the connection. When I'm ready to gather a response from the user again, I should be able to call an API to "resume" recognition over this same connection. Billing again can start at this point.

Most importantly, speech providers should use the enhanced context gained from the ongoing conversation to give me more accurate and faster results the deeper into the conversation we go.

Conclusion

I hope I'll look back at this blog post in a year and laugh at how outdated it has become. I hope that I'll see an array of speech companies that really "get" conversational AI and have invested time and attention to properly support our use case. Unfortunately, that's not the situation today.

Conversational AI has some unique requirements for speech recognition, and today's speech providers are not meeting them. The result is that conversational AI experiences are by and large not matching the industry hype. Speech providers need to stop looking at transcription as a one-size-fits all solution and build the services that we in the conversational AI space need to create the experiences that will truly delight customers.

If you like what you've read, check out the jambonz blog or subscribe to our newsletter.

Deploying jambonz on Google Cloud

Dave Horton — Tue, 18 Apr 2023 21:38:21 GMT

Overview

This article describes how to build a jambonz image on GCP using packer and then deploy a VM using terraform. To follow along the example here, you will need the following:

a google cloud account
packer and terraform and git installed on your laptop

Let's get started!

Create a GCP project and service account

In order to run packer and terraform locally on your laptop and create resources on Google Cloud Platform we'll need to download some credentials to our laptop. So login into the GCP console, create a project (or select an existing one) and then from the main menu select IAM & Admin / Service Accounts. Click "Create Service Account", fill in the details and click "Create and Continue".

Add the following roles to the service account and then click "Continue":

Compute Admin
Compute Instance Admin (v1)
Compute Network Admin
Service Account Admin
Service Account User

Now find the service account you just created in the list, select it and then click Add Key / Create Key and select JSON. Download the json key file to your laptop and save it to folder.

Check out jambonz-infrastructure from github

In a terminal window on your laptop check out jambonz-infrastructure and navigate to the packer folder for building the jambonz-mini.

git clone https://github.com/jambonz/jambonz-infrastructure.gitcd jambonz-infrastructure/packer/jambonz-mini/

Next, set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the JSON service key file that you downloaded

export GOOGLE_APPLICATION_CREDENTIALS=~/Downloads/drachtio-cpaas-4712713e4b95.json

Building a image

Now you are ready to build a GCP image. First, review the settings in the file gcp-template.json and change:

project_id to your gcp project id,
image_zone to the zone that you want to build the image in.
You can also change the disk_size if you want to have a larger disk than the default 80G (it is recommended to have at least 80G to handle time series data like call detail records and logs).

Once you have made any changes necessary, start the packer build. This will take 30-45 minutes to complete as it builds all of the supporting software projects as well as jambonz.

 packer build -color=false gcp-template.json

Once it is completes it will generate an image:

    googlecompute: Processing triggers for man-db (2.9.4-2) ...==> googlecompute: Deleting instance...    googlecompute: Instance has been deleted!==> googlecompute: Creating image...==> googlecompute: Deleting disk...    googlecompute: Disk has been deleted!Build 'googlecompute' finished after 40 minutes 57 seconds.

You will be able to see the image in the GCP console:

Deploying a VM instance

Now change into the terraform folder

cd ../../terraform/gcp/jambonz-mini/

You can now prepare to run terraform by running the following commands:

terraform initterraform plan

This will prompt you to provide some variables, most importantly the image id that you just created. If you prefer, you can also provide these settings by creating a file named deployment.tfvars in the same folder, for example in my case:

image = "packer-1681065267"region = "us-central1"zone = "us-central1-a"project = "drachtio-cpaas"dns_name = "jambonz.me"instance_type = "e2-medium"

the dns_name should be a DNS name in a domain controlled by you. It will be the http URL at which you access the jambonz.portal. The instance_type is the GCP instance type that the jambonz VM instance will be running on. We recommend an instance type with a minimum of 2vCPUs and 8 GB Ram (more of each if this server is going to handle production traffic).

and then run the command as:

terraform plan -var-file="deployment.tfvars"

If all looks good, then run it:

terraform apply

or, if you've created a variables file:

terraform apply -var-file="deployment.tfvars"

This should run successfully with output like the following:

random_string.uuid: Creating...random_string.uuid: Creation complete after 0s [id=ntzsiu]google_compute_address.jambonz_static_ip: Creating...google_compute_firewall.jambonz_mini_firewall_rule: Creating...google_compute_address.jambonz_static_ip: Still creating... [10s elapsed]google_compute_firewall.jambonz_mini_firewall_rule: Still creating... [10s elapsed]google_compute_address.jambonz_static_ip: Creation complete after 11s [id=projects/drachtio-cpaas/regions/us-central1/addresses/jambonz-static-ip-ntzsiu]google_compute_instance.jambonz_mini: Creating...google_compute_firewall.jambonz_mini_firewall_rule: Creation complete after 11s [id=projects/drachtio-cpaas/global/firewalls/jambonz-firewall-rule-ntzsiu]google_compute_instance.jambonz_mini: Still creating... [10s elapsed]google_compute_instance.jambonz_mini: Still creating... [20s elapsed]google_compute_instance.jambonz_mini: Still creating... [30s elapsed]google_compute_instance.jambonz_mini: Still creating... [40s elapsed]google_compute_instance.jambonz_mini: Creation complete after 42s [id=projects/drachtio-cpaas/zones/us-central1-a/instances/jambonz-mini-ntzsiu]Apply complete! Resources: 4 added, 0 changed, 0 destroyed.

Now you should be able to log into the GCP console and see the running instance. Make note of the external (static) IP because in the final step you will add DNS A records pointing to this IP.

Create DNS records for the server

Now create DNS records in your DNS provider for the name that you provided when creating the instance. For example, in my case I chose to use the name jambonz.me to refer to the VM instance, and I will create the following two DNS A records, each pointing to the static IP of the server:

jambonz.me
grafana.jambonz.me
In my case, I am using [dnsmadeeasy] as my DNS provider so my DNS records look like this:
# Log into the portal and configure the system
At this point you can log into the jambonz portal at the DNS name that you chose with user / password "admin/admin". You will be forced to change the password on your first login.
You'll want to configure your Carriers (sip trunks) and speech credentials, but at this point you have a fully functional jambonz system running on GCP. Enjoy!

Tutorial: adding support for a custom speech provider

Dave Horton — Fri, 31 Mar 2023 01:03:57 GMT

jambonz supports many speech providers out of the box, but what if you want to use a speech provider for that is not currently supported? There is where the jambonz custom speech API comes in.

The custom speech api requires jambonz 0.8.2 or above

In this article, we will walk through an example of adding support for Vosk which is an open source speech recognition engine that you can run on your own infrastructure. Vosk is not natively supported by jambonz, but we shall see that we can easily add in support for it using the custom speech api.

As described in the api docs, to add support for a speech recognition provider you need to build a websocket server that provides the integration with the speech provider. Your websocket server receives JSON messages and an audio stream from jambonz and returns transcripts.

The example code we'll be using to integrate Vosk STT can be found on github in the custom-speech-example repo. To run this example you need the following prerequisites:

a jambonz system running 0.8.2 or above
a basic jambonz app that exercises speech recognition
a server on which you can run docker and the custom-speech-example Node.js app (these could also be two different servers).

Provisioning a custom speech provider

First, let's log into the portal and create a new speech service for Vosk.

To do so, select Speech and then click the + icon to add a new provider. Select "Custom" from the dropdown and give it a name -- we'll call it Vosk.

Check "Use for speech-to-text" and enter the ws(s) URL that is the endpoint of your websocket server. In my case as you can see, I'll be running that server on the same jambonz instance, and it will be listening on port 3088.
Also add an authentication token value that your websocket server can use to authenticate the connections from jambonz.

Now you can click on Applications, select the jambonz application you are going to use for testing and specify to use Vosk as the STT provider.

Hint: A good app to test with is simple echo voicebot that transcribes your voice and repeats it back to you using text-to-speech. You can quickly generate this app using the command "npx create-jambonz-ws-app -s echo my-echo-bot".

Running Vosk server

Now we need to run the Vosk server. We are going to be using the grpc api that Vosk supports to send audio and commands to the server. The simplest way to run it is in docker. In my case I have Vosk server running on a separate server, as it is fairly demanding of system resources.

docker run --network host -ti alphacep/kaldi-grpc-en:latest

The Vosk server listens for incoming grpc connections on port 5001 by default, so in our next step we will configure the custom-speech-example app accordingly.

Running the websocket server serving the speech api endpoint

As mentioned above, I have cloned the repo to the same server jambonz is running on, and I will configure it to listen on port 3088 for connections from jambonz.

It will also need to connect to Vosk on the remote server and port 5001, and it will authenticate connections from jambonz using the auth token I specified earlier when configuring the custom Vox speech provider in the jambonz portal.To handle all of this configuration I've decided to run the Node.js app using pm2 and included the following configuration file (ecosystem.config.js):

/* eslint-disable max-len */module.exports = {  apps : [    {      name: 'jambonz-custom-speech-vendors',      script: 'app.js',      instance_var: 'INSTANCE_ID',      exec_mode: 'fork',      instances: 1,      autorestart: true,      watch: false,      max_memory_restart: '1G',      env: {        LOGLEVEL: 'debug',        HTTP_PORT: 3088,        API_KEY: 'foobar',        VOSK_URL: '54.167.7.129:5001'      }    }  ]};

Test it out!

Now the fun part -- let's route a phone number to our jambonz app and test it out!

When I make the call, Vosk transcribes my speech and it is played back to me. Looking at the logs from custom-speech-example, I can see the transcripts being received from Vosk and sent back to jambonz:

$ pm2 log jambonz-custom-speech-vendors{"msg":"Example jambonz speech server listening at http://localhost:3088"} {"url":"/transcribe/vosk","headers":{"pragma":"no-cache","cache-control":"no-cache","host":"localhost","origin":"http://localhost","upgrade":"websocket","connection":"Upgrade","sec-websocket-key":"1CvOEywO14v0DoJGMUjo9Q==","sec-websocket-version":"13","authorization":"Bearer foobar"},"msg":"received upgrade request"}{"msg":"upgraded to websocket, url: /transcribe/vosk"} {"obj":{"type":"start","language":"en-US","format":"raw","encoding":"LINEAR16","interimResults":true,"sampleRateHz":8000,"options":{}},"msg":"received JSON message from jambonz"}{"data":{"chunksList":[{"alternativesList":[{"text":"this is a test using a custom speech provider","confidence":1,"wordsList":[]}],"pb_final":true,"endOfUtterance":false}]},"msg":"received data from vosk"}{"data":{"chunksList":[{"alternativesList":[{"text":"this is a test using a custom speech provider","confidence":1,"wordsList":[]}],"pb_final":true,"endOfUtterance":false}]},"msg":"sending transcription to jambonz"}{"obj":{"type":"stop"},"msg":"received JSON message from jambonz"}

The experience for the caller is no different than using one of the native jambonz speech providers.

Summary

While jambonz comes packed with native support for a large number of speech providers (Google, Microsoft, AWS, Nuance, Nvidia, IBM Watson, and Wellsaid at the time of writing) it's super easy to add in your own speech providers using the custom speech API. In this tutorial, we walked through adding support for the open source Vosk server. In the example project that we shared, you will find other examples as well, including adding support for AssemblyAI speech recognition as well as an example of how to implement support for custom text-to-speech as well as speech-to-text.

For more information about jambonz visit us at jambonz.org, join our community slack channel, or email us at support@jambonz.org.

Installing jambonz using AWS Marketplace

Dave Horton — Sun, 08 Jan 2023 21:26:51 GMT

This article will walk you through the process of deploying jambonz using the AWS Cloudformation offering. Additionally, it shows the post-install steps necessary to modify the jambonz portal to use https instead of plain http.

The AWS Marketplace offering for a jambonz server is a flat fee of $18/month. This includes the ability to deploy an unlimited number of instances. Cost of the AWS instances themselves and associated infrastructure is separate.

Installation

First steps

Before starting

Choose the AWS region you want to deploy into. If have not previously generated an AWS keypair in that region, go and do so now. You will need it in the steps below.

Main page

In your web browser, navigate to https://aws.amazon.com/marketplace/pp/prodview-55wp45fowbovo. This provides the basic details of the offering, including version numbers and pricing. Click on the button labeled "Continue to Subscribe".

Terms and Conditions page

Review the terms and conditions and click "Continue to Configuration".

Configure this software page

Leave the Fulfillment option and Software version to their default settings. Select your desired AWS region from the dropdown. Click "Continue to Launch".

Launch this software page

In the "Choose Action" dropdown select "Launch Cloudformation" and click "Launch

Create Stack

Now we've moved to the AWS Cloudformation stacks page, where we are creating a new stack using the template provided. On the next several pages we will configure the AWS environment that jambonz will be running in. All of the AWS infrastructure that will be created will be in a new VPC created by this stack, so it will not interfere with anything else you already have running in this region.

On this page, leave everything at the default selections, and simply click "Next".

Specify stack details

Enter a name for the stack and then fill out the Parameters section:

AllowedHttpCidr: This is a network mask that can restrict what networks can access the jambonz portal. Typically, you may want to leave this open to the internet if you don't know where your admin users will be logging in from; if so, enter "0.0.0.0/0".
AllowedRtpCidr: Ditto to the above, for the source of caller media streams (RTP). Again, in most cases you will want this to be "0.0.0.0/0".
AllowedSipCidr: Ditto for restricting the sources of SIP signaling. As above, for most deployments this will be "0.0.0.0/0" but if you wanted to restrict it for instance to one of your internal SBCs you could enter the network mask for that.
AllowedSshCidr: And finally, the same for where you want to allow ssh access into the server and, again, "0.0.0.0/0" means access from anywhere.
InstanceType: Here you choose the instance type you want to run on. For production VoIP systems, AWS recommends the c5n instance class and a good choice for jambonz would be the c5n.xlarge, if they have that instance type in your region. This would handle about 15-20 arriving calls per second and 300-400 concurrent calls in progress. If you need something smaller, e.g. for testing, feel free to deploy a t2 or t3.medium. If you need something for large production loads, speak to us about building a jambonz horizontally-scalable cluster -- we have a cloudformation script for that as well, but it is not currently available on the AWS marketplace so you need to contact us (support@jambonz.org) and we will set you up.
KeyName: Select one of the AWS keypairs that you have previously generated in this region.
URLPortal: If you intend for the jambonz portal to be accessible from a browser using a DNS name (versus its IP address), then enter that name here. In my example, I will be adding 'jambonz.net' as the DNS. (Note: you don't need to provision the DNS record for that name just yet, you will do so after the stack completes and you know the elastic IP of the EC2 instance created.)
VpcCidr: The stack is going to create an AWS VPC that will contain everything that gets created -- things like subnets, internet gateways, the EC2 instance itself, etc. Here put a CIDR for the VPC itself. As an example, I typically use "10.0.0.0/16".
Finally, click "Next".

Configure stack options

Under the "Tags" section, its a good idea to add a "Name" tag with a meaningful name to identify and group all of the stuff we are going to create. This is optional, however.

Leave the rest of the fields at their default and click "Next" at the bottom righthand side of the page.

Review stack

Leave everything as presented and click "Submit" on the bottom of the page.

Note: if prompted to confirm the creation of an IAM role, select the checkbox to ok that. As of release 0.7.8 and later, the stack automatically configures jambonz to send logs to Cloudwatch, which is very useful for troubleshooting. In order to do so, the stack must create an IAM role for this.

Done!

Now we wait a bit while AWS deploys our EC2 instance and the associated stuff in our new VPC. This should take roughly 2-3 minutes.

When complete, click on the Outputs panel. This will tell you the elastic IP of the server, along with the initial password to use to log into the portal.

DNS configuration

If you entered a DNS name in the stack parameters above, then now is the time to create DNS records for the jambonz portals. (Skip this step if you left that parameter blank).

Copy the ServerIP from the Outputs panel and create DNS A records for all of the following, each pointing to that IP:

api.
grafana.
homer.
jaeger.

For instance, I use dnsmadeeasy as my DNS provider, so I go into their portal to set up these DNS records.

Accessing the portal

Now you should be able to log into the portal. In your browser URL window you will either enter the DNS name that you specified (http://jambonz.net, in my case) or just the Server IP if you did not specify a DNS name.

Note: once the cloudformation stack is complete, the EC2 instance still requires a bit more time to initialize and come up the first time. If you try to log into the portal and get a 502 Bad Gateway just give it a minute or two and try again.

On the jambonz login screen use the username 'admin' and the first-time password that you see on the Output panel of the cloudformation window. You will be forced to set a new password. Once you set the new password you will be logged into the home page of the portal.

Additional portals

If you used a DNS name, the monitoring portal will be grafana. and you will log in with admin/admin. You will be forced to change the password on the initial login.

Navigate to Dashboards/Browse/jambonz/Jambonz Metrics to see the jambonz monitoring page. Right now, with no traffic there won't be much activity, but this will be a useful page to monitor your live system. It will tell you things like: how many calls are on the system, what is the latency / response time for things like text-to-speech generation and application webhook responses etc.

SIP traces are sent to homer, which can be accessed at homer.. The initial login is admin/sipcapture.

Application traces are sent to jaeger, which can be accessed at jaeger.. There is currently no authentication associated with this page. Application traces can be very useful to see exactly what happened on a specific call. In the trace below, for example, we can see the following:

retrieving account details from mysql took 15 milliseconds
the http webhook to retrieve the jambonz app took 308 milliseconds
synthesizing speech from text took 1.4 seconds (note: tts audio is cached by jambonz, so further requests for this text/voice/language/provider will be near-instantaneous)

Modifying portals to use HTTPS

We've been using plain HTTP to access the portals so far. That's perfectly fine, but we can also change to use HTTPS if we like. To do so, we will need to install a TLS certificate on the server, and modify the NGINX and jambonz configs slightly. Let's walk through that.

Check nginx config

In jambonz releases prior to 0.8, it may be necessary to fix something in the nginx config file before performing these steps. ssh into the server and as root user open /etc/nginx/sites-available/default in an editor. If the top lines look like this:

  server {      server_name _;      location /api/ {

change them to replace the underscore with your dns name (jambonz.net in my case)

server {     server_name jambonz.net;     location /api/ {

Installing a TLS certificate

You can use any certificate provider you want, but in this case I am going to use letsencrypt because they are free and easy to use.

Following the instructions on their page, I go here to get my instructions. Following these instructions is pretty straightforward:

sudo -u root -s apt-get update apt install snapd snap install core snap install --classic certbot ln -s /snap/bin/certbot /usr/bin/certbot certbot --nginx

This will install a certificate and rewrite your nginx configuration file (the same one we looked at above) to support https. For instance, mine now looks like this.

Modifying jambonz config

We're not done quite yet. Now, ssh into the jambonz server as the admin user and go to the ~/apps/jambonz-webapp folder.

Modify the .env file in this folder. If you are running a 0.7.x version change this:

REACT_APP_API_BASE_URL=http://jambonz.net/api/v1

to this

REACT_APP_API_BASE_URL=https://jambonz.net/api/v1

(Of course, your dns name will be different, but the point is simply to change this to an https url).

If you are running a 0.8 version or above change this:

VITE_API_BASE_URL=http://jambonz.net/api/v1

to this

VITE_API_BASE_URL=https://jambonz.net/api/v1

Then rebuild and restart:

npm run buildnpm restart jambonz-webapp

All set! If you log out of the portal, refresh your browser and log back in you should now see your connection is over a secure https connection.

Conclusion

I hope this article has been useful in showing you how to quickly install and configure a jambonz system on AWS using the Marketplace offering. Please feel free to contact us by email at support@jambonz.org or by joining our community slack channel by going to https://joinslack.jambonz.org.

Supporting webrtc clients with jambonz

Dave Horton — Thu, 05 Jan 2023 22:23:40 GMT

If you've done a standard install of jambonz using either the cloudformation template or the Kubernetes helm chart your jambonz system will be configured to receive SIP traffic from VoIP carriers and sip phones using UDP transport.

But what if you want to receive traffic from webrtc clients as well? No problem, this tutorial will show you the changes needed to make that happen.

Note: the example that follows shows how to configure this on a jambonz system running on VMs, such as AWS EC2.

Overview

Webrtc clients will be sending SIP over secure WebSockets (wss). Client-side libraries like jsSip (my favorite) and SIP.js are most often used to build the client webrtc application that runs in your browser. To support these types of clients, jambonz needs to support sip over wss for the SIP signaling, and SRTP for the encrypted media.

Note that insecure websockets (SIP over plain ws) is not allowed by the browser, so we are going to need to install a TLS certificate on the jambonz server. In the example below we'll configure our jambonz server to listen on port 8443/tcp for sip over wss traffic, and we'll create a TLS wildcard certificate for *.sip.jambonz.me since that is a domain that we own. (If you are working along, you should similarly choose a domain of your own that you control the DNS for).

Generating a TLS certificate for SIP traffic

We're going to use letsencrypt to generate our certificate because it's free (and easy!).

After installing the certbot program on the jambonz debian server by following their instructions, we run the following command:

certbot certonly --manual --preferred-challenges=dns --email daveh@drachtio.org --server https://acme-v02.api.letsencrypt.org/directory --agree-tos -d *.sip.jambonz.me -d sip.jambonz.me

This gets us a TLS cert with the CN of *.sip.jambonz.me and Subject Alternative Names of both sip.jambonz.me and *.sip.jambonz.me. I like this because it gives me the option of assigning different jambonz accounts different SIP realm values to register against (e.g. one jambonz user joe can register phones under the realm joe.sip.jambonz.me while jane with a different jambonz account registers her devices under jane.sip.jambonz.me).

I'm using the DNS challenge method to verify I control those domains, so letsencrypt will prompt me to add some TXT records in my DNS provider. Once I've done that the TLS cert is generated to the server:

Successfully received certificate.Certificate is saved at: /etc/letsencrypt/live/sip.jambonz.me/fullchain.pemKey is saved at:         /etc/letsencrypt/live/sip.jambonz.me/privkey.pem

Configuring drachtio

Now that we have our TLS certificate, we need to configure drachtio to use it. This is a simple matter of adding the tls info to the /etc/drachtio.conf.xml config file. When done, it will look like this:

<sip>  <contacts>contacts>  <tls>    <key-file>/etc/letsencrypt/live/sip.jambonz.me/privkey.pemkey-file>    <cert-file>/etc/letsencrypt/live/sip.jambonz.me/fullchain.pemcert-file>  tls>  <udp-mtu>4096udp-mtu>sip>

Now, we need to tell drachtio to listen on port 8443 for sip traffic over wss. To do that we edit /etc/systemd/system/drachtio.service to add this new sip contact. When finished, that section of the file looks like this:

ExecStart=/usr/local/bin/drachtio --daemon \--contact sip:${LOCAL_IP};transport=udp --external-ip ${PUBLIC_IP} \--contact sips:${LOCAL_IP}:8443;transport=wss --external-ip ${PUBLIC_IP} \--contact sip:${LOCAL_IP};transport=tcp \ --address 0.0.0.0 --port 9022 --homer 127.0.0.1:9060 --homer-id 10

We then restart drachtio..

systemctl daemon-reloadsystemctl restart drachtio

we can verify that drachtio is now listening on tcp port 8443 by looking at the /var/log/drachtio.log file after we restart it:

2023-01-05 20:53:30.794776 SipTransport::logTransports - there are : 3 transports2023-01-05 20:53:30.794794 SipTransport::logTransports - tcp/10.0.188.191:5060 (sip:10.0.188.191;transport=tcp, external-ip: , local-net: 10.0.0.0/8)2023-01-05 20:53:30.794802 SipTransport::logTransports - wss/10.0.188.191:8443 (sips:10.0.188.191:8443;transport=wss, external-ip: 35.176.86.236, local-net: )2023-01-05 20:53:30.794812 SipTransport::logTransports - udp/10.0.188.191:5060 (sip:10.0.188.191;transport=udp, external-ip: 35.176.86.236, local-net: ), mtu size: 4096

Of course, we also need to make sure network traffic is being allowed into 8443/tcp. On EC2, we do that by reviewing and if necessary editing the service group for the instance.

Configuring jambonz

We're not quite done yet. If we were to point a webrtc client at the server now and try to register, we would receive a 403 Forbidden back because we have not yet configured our authorization. We need to authenticate clients based on a sip username and password.

If you are not familiar with how jambonz handles SIP authentication, here is a video describing it.

https://youtu.be/m0EvTFqTZXU

Once you have created your webhook application for authentication, specify it in the account section in the jambonz portal.

You will also need to specify the sip realm that you want devices owned by that jambonz account to use. In my example, I've chosen a subdomain named daveh.sip.jambonz.me to be used.

Now you need to point your webrtc client at the jambonz server, and configure it with sip username, password, and realm that match the information your webhook is using. In my webrtc client that I've built using jssip, that client-side config looks like this:

let data = {    name: 'daveh',    sipUri: 'sip:daveh@sip.jambonz.me',    sipPassword: 'foobar',    wsUri: 'wss://sip.jambonz.me:8443',    pcConfig: {        iceServers: [            {                urls: [ 'stun:stun.l.google.com:19302' ]            }        ]    },    initiallyMinimized: false};

Bingo! My webrtc client now registers successfully.

One last thing - lets make a test call. I'm going to route my incoming calls from sip and webrtc devices to the good old "hello world" application.

I dial any number from my webrtc client and....it works!

So long for now..

Thanks again for trying out jambonz! Feel free to learn more at jambonz.org, or join our slack channel by going to joinslack.jambonz.org, or email me at daveh@jambonz.org.

Building your first jambonz app using Node.js

Dave Horton — Thu, 23 Sep 2021 13:59:57 GMT

jambonz is the open-source CPaaS for service providers and developers alike that is no more difficult to install than a webserver. It is a BYOE (bring your own everything) platform, which means that you bring your own carrier trunks and speech APIs (Google and AWS cloud speech both supported).

Haven't got a carrier, but want to try out jambonz? No problem, visit TelecomsXchange to create a free account and gain access to their worldwide network of hundreds of voice and SMS carriers!

Your options for deploying a jambonz service include:

downloading and installing for free on your own infrastructure (jambonz is published under the MIT open source license),
deploying in your AWS account using a cloudformation script, or
creating a free account on our hosted platform.

If you are just getting started and want to try out jambonz for the first time, we recommend creating an account on the hosted platform, since it is simple, free, and gets you up and running instantly.

Node.js is the preferred application environment for creating and running jambonz applications. And whipping up a jambonz app couldn't be easier using the npx and the create-jambonz-app utility. Let's run it with the -h option to see what it can do for us:

 $ npx create-jambonz-app -hUsage: create-jambonz-app [options] project-nameOptions:  -v, --version              display the current version  -s, --scenario   generates sample webhooks for specified scenarios, default is dial and tts (default: "tts, dial")  -h, --help                 display help for commandScenarios available:- tts: answer call and play greeting using tts,- dial: use the dial verb to outdial through your carrier,- record: record the audio stream generated by the listen verb,- auth: authenticate sip devices, or- all: generate all of the above scenariosExample:  $ npx create-jambonz-app my-app

You can see that it will scaffold out an express-based jambonz webhook application that implements one or more scenarios.

Let's dive right in and create a simple app that answers a call and plays a greeting using text-to-speech:

npx create-jambonz-app -s tts my-appCreating a new jambonz app in /Users/dhorton/tmp/my-appInstalling packages...

Done! Now let's see what we have:

$ cd my-app/$ ls -lrttotal 432-rw-r--r--    1 dhorton  staff     567 Sep 23 07:40 README.md-rw-r--r--    1 dhorton  staff    1616 Sep 23 07:40 app.jsdrwxr-xr-x    2 dhorton  staff      64 Sep 23 07:40 data-rw-r--r--    1 dhorton  staff     491 Sep 23 07:40 ecosystem.config.jsdrwxr-xr-x    3 dhorton  staff      96 Sep 23 07:40 libdrwxr-xr-x  239 dhorton  staff    7648 Sep 23 07:40 node_modules-rw-r--r--    1 dhorton  staff  203525 Sep 23 07:40 package-lock.json-rw-r--r--    1 dhorton  staff     489 Sep 23 07:40 package.json

A pm2 config file is generated to hold some environment variables that you must define -- things such as your jambonz account sid, the URL of your jambonz API server, etc:

$ cat ecosystem.config.jsmodule.exports = {  apps : [{    name: 'my-app',    script: 'app.js',    instance_var: 'INSTANCE_ID',    exec_mode: 'fork',    instances: 1,    autorestart: true,    watch: false,    max_memory_restart: '1G',    env: {      NODE_ENV: 'production',      LOGLEVEL: 'info',      HTTP_PORT: 3000,      JAMBONZ_ACCOUNT_SID: '',      JAMBONZ_API_KEY: '',      JAMBONZ_REST_API_BASE_URL: '',      WEBHOOK_SECRET: '',      HTTP_PASSWORD: '',      HTTP_USERNAME: '',    }  }]};

Note: use of pm2 is course optional, you can alternatively specify your environment variables in a .env file or on the command line if you prefer.

The final two environment variables above (HTTP_PASSWORD and HTTP_USERNAME) are optional - you only need to suply them if you are using http basic auth to secure your webhook endpoint. The others are required, and you can find your account_sid, api_key, and webhook secret in the jambonz portal. If you are using the jambonz.us hosted platform the JAMBONZ_REST_API_BASE_URL variable should be https://api.jambonz.us/v1, otherwise set to the appropriate URL for your own server.

Update the ecosystem.config.js file with these values where indicated.

Let's have a look at the code now. The app.js file is boilerplate stuff that you probably won't have to change, but let's have a look at it to understand what it does.

$ cat app.jsconst assert = require('assert');assert.ok(process.env.JAMBONZ_ACCOUNT_SID, 'You must define the JAMBONZ_ACCOUNT_SID env variable');assert.ok(process.env.JAMBONZ_API_KEY, 'You must define the JAMBONZ_API_KEY env variable');assert.ok(process.env.JAMBONZ_REST_API_BASE_URL, 'You must define the JAMBONZ_REST_API_BASE_URL env variable');const express = require('express');const app = express();const {WebhookResponse} = require('@jambonz/node-client');const basicAuth = require('express-basic-auth');const opts = Object.assign({  timestamp: () => `, "time": "${new Date().toISOString()}"`,  level: process.env.LOGLEVEL || 'info'});const logger = require('pino')(opts);const port = process.env.HTTP_PORT || 3000;const routes = require('./lib/routes');app.locals = {  ...app.locals,  logger,  client: require('@jambonz/node-client')(process.env.JAMBONZ_ACCOUNT_SID, process.env.JAMBONZ_API_KEY, {    baseUrl: process.env.JAMBONZ_REST_API_BASE_URL  })};if (process.env.HTTP_USERNAME && process.env.HTTP_PASSWORD) {  const users = {};  users[process.env.HTTP_USERNAME] = process.env.HTTP_PASSWORD;  app.use(basicAuth({users}));}app.use(express.urlencoded({ extended: true }));app.use(express.json());if (process.env.WEBHOOK_SECRET) {  app.use(WebhookResponse.verifyJambonzSignature(process.env.WEBHOOK_SECRET));}app.use('/', routes);app.use((err, req, res, next) => {  logger.error(err, 'burped error');  res.status(err.status || 500).json({msg: err.message});});const server = app.listen(port, () => {  logger.info(`Example jambonz app listening at http://localhost:${port}`);});

Pretty straightforward stuff, right? It's creating an express app, using middleware to validate the signature of incoming webhook requests to be sure they came from your account, enforcing http basic auth if you've configured it, and then invoking your the http endpoints you are exposing.

Oh, and its also including the @jambonz/node-client npm package. This little beauty will make it easy to respond to webhooks and to use the jambonz REST api.

Let's look at the code generated for the http endpoints next. In this case, we asked it to generate a simple app to use tts to play a greeting. We'll actually need two endpoints: one to respond to the webhook request for an application, and one to handle call status events. Both of those are generated under lib/routes/endpoints as you can see:

$ ls -lrt lib/routes/endpoints/total 24-rw-r--r--  1 dhorton  staff  218 Sep 23 07:40 call-status.js-rw-r--r--  1 dhorton  staff  183 Sep 23 07:40 index.js-rw-r--r--  1 dhorton  staff  803 Sep 23 07:40 tts-hello-world.jsMBP-daveh:my-app dhorton$ cat lib/routes/endpoints/index.jsconst router = require('express').Router();router.use('/call-status', require('./call-status'));router.use('/hello-world', require('./tts-hello-world'));module.exports = router;

Finally, let's have a look at the code generated to respond to the application webhook!

$ cat lib/routes/endpoints/tts-hello-world.jsconst router = require('express').Router();const WebhookResponse = require('@jambonz/node-client').WebhookResponse;const text = `Hi there, and welcome to jambones!jambones is the _CPaaS designed with the needsof communication service providers in mind.This is an example of simple text-to-speech, but there is so much more you can do.Try us out!`;router.post('/', (req, res) => {  const {logger} = req.app.locals;  logger.debug({payload: req.body}, 'POST /hello-world');  try {    const app = new WebhookResponse();    app      .pause({length: 1.5})      .say({text});    res.status(200).json(app);  } catch (err) {    logger.error({err}, 'Error');    res.sendStatus(503);  }});module.exports = router;

The key here is the WebhookResponse instance we create and then call chained methods on, each corresponding to a jambonz webhook verb:

const WebhookResponse = require('@jambonz/node-client').WebhookResponse;.. then ..const app = new WebhookResponse();app  .pause({length: 1.5})  .say({text});

When we've added all our verbs, we simply return it a json payload in our http response:

res.status(200).json(app);

Let's run this puppy!

When testing on my laptop, I like to use ngrok to provide an externally-reachable URL for the app running behind my firewall:

$ ngrok http -region=us -hostname=jambonz-apps.drachtio.org 3000

Note: I am using a custom domain here, which requires a paid plan on ngrok.

Next, I just start up my app:

$ pm2 start ecosystem.config.js[PM2][WARN] Applications my-app not running, starting...[PM2] App [my-app] launched (1 instances) id   name       namespace    version  mode     pid       uptime       status     cpu       mem       user      watching  0    my-app     default      0.0.1    fork     25162     0s      0     online     0%        9.7mb     dhorton   disabled

pm2 log shows me that the app has is listening on the configured port for webhooks

$ pm2 log0|my-app   | {"level":30, "time": "2021-09-23T12:53:11.336Z","pid":25162,"hostname":"MBP-daveh.local","msg":"Example jambonz app listening at http://localhost:3000"}

Now, bounce over to your jambonz portal and create a new application with this webhook URL and path..

And assign a phone number to route to this app..

Bam! Done!

Incoming calls on this phone number now route to this webhook, which returns a simple app that plays a text-to-speech greeting using my preferred speech vendor and voice.

This is has been a basic (and hopefully painless!) introduction to building apps using jambonz and the Node.js SDK. If you prefer to learn by watching, here is a video covering much of the same material but going into a bit more detail, including showing you how to use Twilio as your carrier.

To create your own account for testing, just head over to jambonz.us, or watch this video showing how to get started.

There's much more you can do -- as a next step, consider using the --scenario option to scaffold an app that does a dial or a record (using the listen verb) operation.

In this article, we explored creating webhook applications but did not use the REST api. We will cover that in a later article, but for a sneak peak at an example application that uses both webhooks and the REST api, have a look at the attended transfer application here (and video here), where we respond to dtmf events during a call to use the REST api to perform Live Call Control.

Any questions, feel free to email us at support@jambonz.org or join our slack channel to ping us in real-time!