SDG 5: Gender Equality
2. Project Details
Company or Institution
"Sam", the Non-Binary Voice Assistant
General description of the AI solution
Sam is a non-binary text-to-speech (TTS) voice created by Accenture in collaboration with CereProc. It is the first comprehensive non-binary TTS voice solution.
Gender presentation in voice is a combination of pitch, intonation, and word choice. And non-binary speech is made up of a combination of male and female speech patterns, meaning there is no one non-binary voice. So, to create the “Sam” voice, they didn’t just alter the pitch of the voice to make it non-binary, they also influenced the speech patterns and intonation of the models by incorporating audio data from non-binary individuals.
Sam can be used as the voice of conversational assistants, for Augmentative and Alternative Communication (AAC) devices for speech impaired users, and for many other uses. But the aim of this project was not only to develop the voice itself, but also to develop and open source a replicable process so others could create and innovate on other non-binary TTS voices as well. Already, their open-sourced materials have been used by academic institutions for ongoing research and by other groups to create their own non-binary voice solutions.
A short demo of Sam can be heard at the end of this video: https://www.youtube.com/watch?v=mL1n5AEFLl4
Excellence and Scientific Quality: Please detail the improvements made by the nominee or the nominees’ team or yourself if your applying for the award, and why they have been a success.
The process of creating Sam involved a female identifying voice actor recording 6 hours of audio from an open-source script. This data was pre-processed to lower the pitch, bringing it to the range between the average male and female speaker. This data was then used to train the acoustic model for Sam.
A "non-binary" speaking style was then applied to the voice through a prosody model. This model was trained using publicly available transcription and audio data from a non-binary donor voice with permission. The pitch of their audio was brought to the same range, in this case by raising the pitch slightly. The prosody model was then used to "guide" the acoustic model at synthesis time.
The final models are "adapted" from average models incorporating data from dozens of male and female speakers, balanced so that the median of the pitch and vocal tract length distribution fall approximately to the average of male and female ranges. To make the average models more robust for these purposes, they had individuals rate male and female speakers on a spectrum of gender perception. They also sourced, transcribed, and added voice data from a number of non-binary, trans, and intersex individuals to the average models.
The use of SSML markup was also employed to make the voice sound more natural by shifting the emotion and intonation of the speech depending on context.
To date, Sam has been featured in an initial Accenture press release, an Accenture accessibility report, and several conferences. The creation process was documented and open-sourced under an Apache 2.0 license, along with all the components used. A high-quality version of Sam runs on the CereVoice SDK with a CereProc license. A Coqui model was trained using the same data and open sourced by the founders of Coqui (https://coqui.ai/).
Scaling of impact to SDGs: Please detail how many citizens/communities and/or researchers/businesses this has had or can have a positive impact on, including particular groups where applicable and to what extent.
This year, consumers interacted with 4.2 billion digital voice assistants around the world, and that number is expected to double by 2024 (Juniper Research). The millions of people who adopt voice assistants into their daily routines will consist of a diverse population including male, female, and non-binary individuals. But TTS voices available today do not reflect that diversity, and this is already proving problematic: as UNESCO pointed out in their 2019 report, I’d Blush if I Could, designing only female voice assistants encourages negative behavior, both with the assistants and with real people.
In the U.S., 12% of millennials identify as transgender or gender non-conforming and 56% of Gen Z’ers know someone who uses gender-neutral pronouns, such as they/them. Additionally, a general population study we conducted indicated that 21% of non-binary individuals would prefer a non-binary voice assistant, compared to only 1-2% of male and female identifying individuals. This underscores the need for technology that represents the non-binary population.
The team included the non-binary community in the design and development of Sam, conducting two surveys to get their feedback and ensure they felt comfortable with the voice being designed to represent them. This helped shape the sound of the voice significantly. They got amazing feedback from the community, with one non-binary and speech impaired individual even stating: “…this is my favorite I have ever heard and would buy it in an instant if it were available … I love this voice.”
Sam, and the AI process developed to make it, is one step towards a future where a diversity of voices are available to the public, and more people can see themselves and their peers reflected in technology.
Scaling of AI solution: Please detail what proof of concept or implementations can you show now in terms of its efficacy and how the solution can be scaled to provide a global impact ad how realistic that scaling is.
In addition to creating the first deployable version of a non-binary TTS voice, this project encourages users to create their own non-binary voices, either by leveraging our audio files with different TTS engines or by adopting the same process but recording new audio from different voice actors. They did so with the hope of reflecting the diversity of the community and encourage innovation in this space. Anyone can request access to the Github repository at this site: https://sam-accenture-non-binary-voice.github.io/request-sam/.
There are detailed instructions on the process and Idlak models and recipes using the audio files in the repository as well. The data was already used by Coqui to create their own open-source models using the same audio data: https://github.com/coqui-ai/tts
The Sam voice itself can also be deployed at scale. With more investment, they could collect the industry standard of 30 hours of voice actor recordings, instead of the 6 already recorded. This would make the voice more robust and would make it sound more natural. This could also involve the recording of industry or domain-specific terms to optimize Sam for use in specific environments (ex. recording of medical terms for use as a voice assistant in a health app). The non-binary aspect of the voice could also be taken a step further. Future works could aim to transfer part of the non-binary voice quality to the acoustic model or could use some voice over data in the prosody model training for an improved compatibility between both speech production models.
With Sam already created, and the process and audio data open-sourced, companies face far fewer barriers in terms of resources and cost to adopt Sam for their own purposes or create their own non-binary voice.
Ethical aspect: Please detail the way the solution addresses any of the main ethical aspects, including trustworthiness, bias, gender issues, etc.
Sam, a non-binary TTS voice, was made to address the lack of diversity in the 4.2 billion digital voice assistants around the world. To help address these issues in a way that would be beneficial and valued by the most marginalized gender communities, researchers at Accenture Labs worked closely with members of the non-binary community on the development of Sam’s voice. Accenture surveyed non-binary individuals and used their feedback and audio data with permission to influence not only pitch, but speech patterns and intonation. The result is a voice that combines aspects of male and female voices to better resonate with the community it was designed to represent.
Additionally, Accenture conducted a general population survey to better understand how voice assistant users across demographics perceive non-binary voices. The results of this study can help inform further iterations of Sam as well as other non-binary voices such that they will be more likely to be positively received and adopted by the public.