I found a solution that should work for my use case. :)
My game is not reading out arbitrary sentences, but just a fixed list of words, so it made sense for me to use text-to-speech to pre-generate all the .ogg files I needed. It does add a bit to the size of the game, but saves me from having to link to external libraries or services, making it easier to port the game to any OS or device.
I wound up using Google cloud doing basically what they do here using Python, and simply looping over my word list. You can find a list of available languages and voices here. I did this on Linux, setting up a virtualenv and installing the google cloud Python API using pip (Google has docs for how to set this up as I don't remember all the steps I did).
I did get random StatusCode.UNAVAILABLE every now and then, so I had to resume the process manually from where it left of, just removing the words from my list it was done with, but it wasn't too bad as I only had 5000 words to go through (a try/except with a long pause and a retry, would probably work too).
NOTE! Don't use OGG_VORBIS format, as the OGG files generated by the text-to-speech engine don't work with Godot (see my bug report for details). You can always download them as WAV (LINEAR16), and then batch convert them to ogg afterwards, which works fine. I haven't tested with MP3, so it might or might not work.