File to Speech

Hello there! This post showcases a file-to-speech website I built using my own custom engine, which uses WebGPU as its rendering backend. One thing I miss when listening to books on Audible is seeing the text itself, as it anchors me in the book and lets me focus more, let my mind wander a bit, and imagine the scenes. My other half (Heidy) and I heavily used the reader feature inside MS Edge for this, but it breaks or stops working for no reason too often, so I decided to make it my own.

If you are on Linux or the webpage shows nothing for you plese go here to check on the implementation status of your particular platform and how to get around it.

For TTS I am using PiperTTS, and the only available voices are the ones I could confidently say were intended for the public.

Tech Used

Rust

WebGPU

Piper TTS

Available Here!

Supported file types

The site only supports text, which means no images in the file will be displayed. The file types supported are:

PDF

although I would not recommend it, as PDF does not include any semantic information about the text blocks, so I am guessing where paragraphs start and end.

EPUB

this one works well unless the author tried some fancy HTML, in which case the `HTML to String` parser I am using will fail in interesting ways.

TXT

this one is the best, as you cannot really do much; the system currently does not do any markdown parsing, so the user should keep it in mind. To properly parse the author and chapters, the user has to format it in the following way:

```
#DOC_TYPE TXT
Example title
Example author

#CHAPTER Chapter name
Chapter Text Here

#CHAPTER Chapter name
Chapter Text Here
```

Misc

The UI interaction system was guided by what is explained in the following videos:
Unite 2013 - Wrangling OnGUI
Immediate-Mode Graphical User Interfaces (cmuratori)

The text rendering is done using SDFs generated at build time and embeded into the application wasm module. The TTS is handled by the ONNX runtime on the web through the library vits-web . The runtime is started in a web-worker to ensure if does not affect the main thread, it also allowed me to easyly destroy the enture web-worker when the ONNX runtime ran into an unrecoverable issue while running. The module has all the released Piper TTS voices, but most of them do not have CC0/public domain license, so I removed them . The only available voices are the ones I could confidently say were intended for the public.