Using VOICEVOX CORE with Python (Text to Speech)

Info

This article is translated from Japanese to English.

https://404background.com/program/voicevox-core/

Introduction

In a previous article, I tried speech synthesis and speech recognition using Google Cloud Platform, but frequent use can become costly. Since I wanted to run everything for free, this time I tried using VOICEVOX.

It is described as “a free and moderately high-quality text-to-speech software,” but I personally think it's high-quality enough and an excellent piece of software. Implementing a speech synthesis engine from scratch would be a lot of work...

I ran into several errors while using it, so I’ve written down how I resolved them. Depending on future updates, the information may change, but I hope it will be useful.

▼VOICEVOX official page

https://voicevox.hiroshiba.jp/

▼VOICEVOX software terms of use

https://voicevox.hiroshiba.jp/term/

▼Previous related article

Text-to-Speech API / Speech-to-Text APIを使ってみる(Google Cloud Platform)

はじめに  今回はGCP(Google Cloud Platform)のText-to-Speech APIとSpeech-to-Text APIを使ってみました。以下にまとめるように、音声系のAPIですね。 Text-to-Speech A…

Overview of VOICEVOX

▼If you just want to try it out easily, download it from here. It comes as a desktop application. It seems to use Electron as well.
https://voicevox.hiroshiba.jp/

▼Here is the overall architecture of VOICEVOX. This time, we will use the core component.
https://github.com/VOICEVOX/voicevox/blob/main/docs/%E5%85%A8%E4%BD%93%E6%A7%8B%E6%88%90.md

▼GitHub repository for the core component
https://github.com/VOICEVOX/voicevox_core

Running with Python

Setting up the environment

▼Python environment setup instructions are written here.
https://github.com/VOICEVOX/voicevox_core/blob/main/example/python/README.md

This time, I am running it on Windows using VS Code, through a PowerShell terminal. My Python version is 3.10.11.

▼I am using version 0.14.5 of voicevox_core.
https://github.com/VOICEVOX/voicevox_core/releases/tag/0.14.5

First, download download.exe into the voicevox folder using the following commands, and run it. The required libraries will be installed.

Install the Python libraries using pip.

▼The folder structure looks like this.

Running the program

I tried running the sample program from the voicevox_core repository, but I encountered errors and was unable to run it.

▼Here is the sample program I tried.
https://github.com/VOICEVOX/voicevox_core/tree/main/example/python

First, I got the following error. I had installed the 64-bit versions of both Python and VOICEVOX.

This means that loading the DLL file failed. You can fix this by placing onnxruntime.dll, which is included in the voicevox_core folder you downloaded earlier, in the correct location.

I was able to avoid the error in two ways:

  • Run the program in the directory where onnxruntime.dll is located
  • Copy onnxruntime.dll into the voicevox_core folder installed via pip

Next, I also encountered an error saying that voicevox_core.blocking does not exist. This may be resolved in a future update.

About voicevox_core.blocking

The current Python binding of voicevox_core includes a blocking.py.

▼Here it is.
https://github.com/VOICEVOX/voicevox_core/tree/main/crates/voicevox_core_python_api

I was using version 0.14.5, which was the latest release at the time, but it looks like version 0.15.0 is likely to be released. When that happens, it may include...

In the following article, there was a program that did not use voicevox_core.blocking, and this version was able to run successfully.

▼Here is the article.

https://qiita.com/taka7n/items/1dc61e507274b93ee868

You can specify the speaker by ID, and the following program displays a list of available speakers.

from voicevox_core import METAS
from pprint import pprint
pprint(METAS)

Save the program in the previously created voicevox_core folder and run it.

from pathlib import Path
from voicevox_core import VoicevoxCore, METAS
import sys, os

core = VoicevoxCore(open_jtalk_dict_dir=Path("./open_jtalk_dic_utf_8-1.11"))
speaker_id = 2

text = sys.argv[1]
if not core.is_model_loaded(speaker_id):
    core.load_model(speaker_id)
wave_bytes = core.tts(text, speaker_id)
with open("./" + text + ".wav", "wb") as f:
    f.write(wave_bytes)

Since the ID is 2, the voice used is the normal voice of “Shikoku Metan.”

Since the program converts command-line arguments into speech, run it like this:

▼If you encounter the following error, check the path to the open_jtalk folder.

A file with the name text + .wav will be created.

▼Here is an example of the generated audio.

Conclusion

Setting up the environment is always the most challenging part, but now it is ready to run. Future updates will be important to keep track of.

Being able to run this in Python means that it can be integrated into applications, executed from Node-RED, and embedded in robots. This opens up many possibilities. The next thing I would like is voice recognition functionality.

Leave a Reply

Your email address will not be published. Required fields are marked *