My youngest daughter has an early school bus and my older daughter has a 7:38 bus. So every morning I get my youngest to her bus first and then I take my oldest to her bus stop (on the corner) and walk to work. So my schedule is quite regular. I can walk to work different ways, but I always start by walking two blocks down Atlantic Avenue. So that’s about 7:38 – 7:42 every day.
Pretty much every morning, three things happen in those two blocks.
After about a half a block, I hear little kid style running footsteps. A small girl passes me sprinting. I hear more footsteps and her slightly older brother passes me. She’s always first so I’m thinking he gives her a head start. The first time it happened I instinctively turned around to make sure a parent was watching, but now I know their mom is right behind. And they always wait at the corner before turning right.
Then a short boy walks past me in the opposite direction. I originally thought he was about 12, but he’s actually heading to my daughter’s bus stop and must be 15 because they are in the same grade. He’s always late. The bus is usually a little late too, but occasionally the bus has left. But since it comes in his direction he flags it down and the driver is nice and stops for him. He plays hockey and is occasionally carrying a bag of hockey equipment as big as he is.
In the second block a guy walks past me walking his short stocky dog. It’s a cute dog with a bandana around his neck. He must walk his dog at exactly the same time, we literally pass each around the same spot on the block every morning. It’s one of those things where we’ve started doing the silent head nod acknowledgment, even though we’ve never spoken and probably never will.
Having computers read things and sound like people has long been a challenge. But a Google team seems to have hit parity with human speech (paper):
The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms.
Sure, perfectly clear to me. But they follow with:
Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech
Comparable to professionally recorded speech, that I understand. Tacotron 2 is a single neural network trained from data alone. That seems to be the direction AI is going, neural networks that are easily trained.