Here's how simple it is to start your audio journey

Lately, there’s been some talk in my social media circles about the simplest or easiest approach to the audio and voice revolution for the uninitiated. Some publishers (and some advertisers too) are confused by the breadth of audio options and because of that, they’re not doing anything. 

To be fair, I can understand how someone might feel overwhelmed by the sheer number of possibilities. In my mind, I envision them thinking: 

  • Audiences really dig podcasts (audio’s blockbuster star) – we have to have one. 
  • Voice skills are popular too so that’s another thing to explore. 
  • But then – do we go with Alexa or Google Assistant, or both? 
  • What about a voice interface? Should we change our chatbot to a voicebot?
  • Spotify seems huge at the moment, how do we get a foothold there?

These are all perfectly reasonable questions and doubts. The reality is: audio is not that complicated. There are super simple solutions that the media world can adopt to kickstart their audio strategy, and everything begins with text-to-speech (TTS) technology

Why is text-to-speech tech the first audio step?

Hey, that heading rhymes! Turning my serious mode on, there are four primary reasons why TTS is:

  • Simplicity of integration
  • Low production cost
  • Supported by AI
  • Distribution to ALL other audio platforms

Notice the audio player at the beginning of this post, the one you may be using to listen to this article in case you’re all about convenience? That’s the end result of the straightforward and short integration process. A small piece of code is embedded in the respective website with the text content (like a news publication or this blog, for instance) and is added automatically to every new article. Within an hour, everything is set up. 

It only takes a few seconds to load the converted audio and with a few clicks more, it’s possible to fully customize the audio player to natively align with the website’s overall look and feel. The player’s loading time is optimized for both latency and resource consumption so that its footprint is minimal. As a bonus, the playback goes on in the background in case a visitor decides to open or switch to another tab.

Text-to-speech can significantly cut production costs compared to alternatives such as human narration as the content already exists and there is no need for post-production add-ons such as sound effects and music. It’s just a matter of transforming it to audio. The technology is also highly scalable so you can do more with less.

To make sure TTS tech is converting or audiofying only the text on the page that is relevant, AI solutions are in charge to make the correct match to the given website. This is usually a part of the short onboarding process, where an AI special sauce (as is the case with Trinity Audio) makes sure 100% of the relevant text is read and none of the fluff.

Using AI in this capacity is nothing new. For example, Apple has a built-in widget that reads text on mobile when the reader drags down the screen from top to bottom. There is also Google’s Read It feature via Google Assistant where the voice assistant reads articles and other text directly.

In both of these cases, the experience is smooth but fairly basic. Don’t get me wrong – it’s great they exist on websites that don’t provide a custom experience, especially for visually impaired and illiterate people. It’s just that they don’t provide a native listening experience that’s in line with the content, something custom TTS integration handles with ease.

That’s not all when it comes to AI. Regarding the listening itself, there are a variety of options to fine-tune the experience such as setting different voices for different sections, different reading speeds for different parts, multi-language conversion, and so on. When paired with a screen device, it’s possible to pair text and audio to see the text while listening and synchronize highlighting for people with disabilities. In short – a lot can be done to create the perfect experience.

Finally, TTS acts as a gateway to a broader distribution in the audio landscape, checking most of the dilemma boxes previously mentioned. Your freshly created audio content can be syndicated to every audio streaming platform. With a few clicks, you can basically have a podcast in itself distributed on Spotify, Apple Podcasts, Google Podcasts, iHeartMedia, and such. Here is an example of this blog’s podcast.

But what about the cold, synthetic speech?

Will it surprise you to learn it’s anything but cold and tangibly synthetic?

Thanks to the advances in processing power and compression, training voice models is easier and more accessible. Powered by neural technology, we now have the most natural and human-like text-to-speech voices ever developed.

The advantage of neural text-to-speech (NTTS) is learning from training data, which results in smoother speech with no audibly stringed units of sound, proper rhythm, and intonation of the voice depending on the intended use case (if the context is conversational or informational). This is a synthesized speech that has seamless transitions with, for instance, more natural pauses when switching between paragraphs or even going from one dialog to another between different characters.

Human ears have become not only tolerant of “mechanical” voices but comfortable with them. Our research has shown that 59% of people listened to the TTS-powered audio versions of news articles and blog posts from start to finish. This clearly suggests that the ability to consume content via audio is serving a market need and that voice technology has become very important in content consumption. Trust me – we are only scratching the surface.

Bottom line: Frankie says relax

With all I’ve written so far, I’ll repeat what I said in one of those social discussions: 

relax, take time to know the tech, and see if it makes sense for your audience.

Understand what you can do with it, how you can distribute, and how your audience is reacting. There is no need to start a podcast and pour your resources into something that’s likely not going to work. I say this as a podcast fan who welcomed the format with open arms as I love radio and this was the natural and logical evolution. But podcast saturation is real and it’s getting harder and harder to occupy and retain the attention, especially if that’s the way you burst into the audio scene. 

AI-powered text-to-speech is the easiest and simplest solution to understand IF your audiences like to engage with audio content. You know your audiences best – give it a go and see their reaction. Test it out and if the sentiment is a positive one, slowly build on that foundation by gradually increasing the pace and investment.

So what do I get with audio content?

The thing is, digital audio is now everywhere, starting with an embedded version of itself on numerous websites. Thanks to this omnipresence, audio is used today both as a primary channel for content consumption and as a complementary medium to written and/or visual communication. It doesn’t matter if people are seeking information and entertainment. They want both, and then some. 

Smokey Robinson James GIF

The revolution that started with the increasing adoption of smart speakers is now continuing its ascend thanks to better digital connectivity all around. A good example is cars and other types of vehicles that are slowly delving deeper into the concept of entertainment platforms with each new version. 

It has effectively transformed how people consume content due to its personal and convenient nature. So adding it to your repertoire first and foremost means meeting the needs of a growing audience of listeners.

I could talk your ear off about the various opportunities that having a listening experience offers, from better user experience and making your content portable to new distribution options and a new monetization stream. Suffice it to say that audio is where audiences are these days, particularly readers, and they’ll stay there for a long time.

What’s the next step?

Let’s say you already are offering a listening experience. 

A while back, I formalized my vision of a multi-step audio strategy, and I firmly hold to that structure. 

These five steps can also be grouped into four distinct phases, just like Marvel movies:

  1. Make available by giving your audience the ability to listen to your content;
  2. Enhance by recommending more audio content articles to listen to enhance the overall experience;
  3. Expand by providing options to consume your audio through additional channels such as mainstream audio platforms and smart speakers;
  4. Vocalize by letting your users discover, interact, and engage with your audio content using voice commands.

It’s a close-knit circle that maximizes the potential of your content in a world where people want to multitask and absorb on the go. The bar is set high, I’ll tell you that much.

Final thoughts

With more consumption of audio content than ever before, now is the time to focus and educate yourself. There are a bunch of good things that audio brings to every content strategy: portability, intimacy, immersion, and passive involvement on the listener’s part, to name a few. 

Do note that despite the technology doing most of the legwork, a good listening experience demands some work on your part. This mostly pertains to crafting audio-friendly content as the underlying technology is not perfect, which means some content works better than the other. This is a developing industry that is constantly looking for ways to make the most out of content while keeping it highly relevant and cost-friendly.

One thing is for certain: the audio-first and voice-first people have you covered.


Make sure you’re following me on Twitter for ongoing updates, tips, and industry takeaways!

Image credits:

https://giphy.com/gifs/king-ken-jeong-interested-eoN5fHRfV4sSI
https://giphy.com/gifs/reasons-good-directorial-2Pl8OTc2UydcQ