File to Speech
Hello there! This post showcases a file-to-speech website I built
using my own custom engine, which uses WebGPU as its rendering
backend. One thing I miss when listening to books on Audible is seeing
the text itself, as it anchors me in the book and lets me focus more,
let my mind wander a bit, and imagine the scenes. My other half
(Heidy) and I heavily used the reader feature inside MS Edge for this,
but it breaks or stops working for no reason too often, so I decided
to make it my own.
If you are on Linux or the webpage shows nothing for you plese go
here
to check on the implementation status of your particular platform and
how to get around it.
For TTS I am using PiperTTS, and
the only available voices are the ones I could confidently say were
intended for the public.
Rust
WebGPU
JS
Piper TTS
Available Here!
The site only supports text, which means no images in the file will be displayed. The file types supported are:
although I would not recommend it, as PDF does not include any semantic information about the text blocks, so I am guessing where paragraphs start and end.
EPUB
this one works well unless the author tried some fancy HTML, in which case the `HTML to String` parser I am using will fail in interesting ways.
TXT
this one is the best, as you cannot really do much; the system
currently does not do any markdown parsing, so the user should
keep it in mind. To properly parse the author and chapters, the
user has to format it in the following way:
```
#DOC_TYPE TXT
Example title
Example author
#CHAPTER
Chapter name
Chapter Text Here
#CHAPTER
Chapter name
Chapter Text Here
```
The UI interaction system was guided by what is explained in the
following videos:
Unite 2013 - Wrangling OnGUI
Immediate-Mode Graphical User Interfaces (cmuratori)
The text rendering is done using SDFs generated at build time and
embeded into the application wasm module. The TTS is handled by the
ONNX runtime on the web through the library
vits-web
. The runtime is started in a web-worker to ensure if does not
affect the main thread, it also allowed me to easyly destroy the
enture web-worker when the ONNX runtime ran into an unrecoverable
issue while running. The module has all the released Piper TTS
voices, but most of them
do not have CC0/public domain license, so I removed them
. The only available voices are the ones I could confidently say
were intended for the public.