Hey, got my thesis due next week and I'm just picking up some random people on the Internet to evaluate it. It's about transforming a speech signal with no emotion into something with emotion, most likely just anger in this case.
What I need is just some people to tell me what they think about the speech and how much anger you think it contains. I'll give like 8 different samples, all with different modifications, and you'd rate it from 1-5. Then I'd take the best overall result and do some modifications to it to find which parameters are the most important, and you rate that one from 1-5.
There's a bunch of last-minute stuff I'd like to do, but might not have time to do, like converting to sadness, surprise, etc, so there could be more. Also, this is probably like the first attempt in the world of its kind to do something like this and I cut a hell lot of corners, so speech quality is rather poor, IMO
Let me know if you're interested, and I'll drop you a PM to avoid any spoilers in the thread contaminating results from others.
Oh, and BTW, I kinda need this done ASAP, next week is a bit too late
Edited by Muz
Disclaimer: Any sarcasm in my posts will not be mentioned as that would ruin the purpose. It is assumed that the reader is intelligent enough to tell the difference between what is sarcasm and what is not.
Doesn't take much time at all, there's like 10 samples, each less than a second. Anyone who's not a troll should qualify and should be able to do it in less than 5 minutes
Technology's still a bit primitive, it's still a prototype, first attempt of this kind, so can't really do anything scary with it yet. But theoretically, it should work in real time with some optimizations.
Disclaimer: Any sarcasm in my posts will not be mentioned as that would ruin the purpose. It is assumed that the reader is intelligent enough to tell the difference between what is sarcasm and what is not.
All done! Thanks a lot to everyone who participated!
Idea:
To synthesize emotions into speech. Started with only anger here because I got it working literally half a week before submission and only had enough time to clone one emotion. Why? Because synthesized speech sucks. You've all have probably heard Stephen Hawking or one of those ones that come with Windows. The idea here is that putting some emotion would make it sound a lot better, more human and less robot.
Conclusion:
It worked. Sorta. Rated 2.5 out of 5 anger, 2.4 out of 5 quality. Whether it's bad or not depends on what you're doing with it. I'd compare it to well, nice pixel art. You know what the picture's supposed to be, but it's not exactly photorealistic. It's got a few artifacts in the speech, but that buzz actually sounds good when you're used to it.
If you're pulling a prank on someone, it'd work very well on someone who didn't expect it (kinda like Photoshop). For speech synthesizers, it works great at making it sound less boring, just make the pitch contour higher to give it a happier sound, lower to give it a sad one.
Also anger has these spikes in pitch and energy contours. That's much there is to it. It's difficult to simulate just because they change so much more than the transformer can handle. Almost any other emotion has more subtle differences, it should work much better for those.
It's also basically a functional pitch contour transformer. I.e. it can correct you if you're singing out of tune. It's sort of like Photoshop for voice in that sense. But can't really fix your voice if you suck at singing, and if you sing out of key by around 50 Hz, it'd have a techno-ish effect. 50 Hz is still a huge range to change.. you shouldn't be singing that badly
Compared to what other people have done, well.. it's the most successful emotional transformation so far Unless someone's put some top secret research into something better.
Implementation:
Anyway, while all these PhD students were taking huge piles of statistics, hidden markov models, and basically trying to inverse whatever knowledge they got from emotion detection, I took a more retarded game designer approach. I tried to simply quantify emotions as a bunch of numbers.
So, I split it down into three variables - energy contour, duration modification, and pitch contour. There's a bunch of theories I had around these. One was to imitate it exactly - which didn't work out so well, because it just doesn't go up higher than a certain pitch.
The others were kinda meh, proven wrong. One of them was proven kinda true. What was right is that people don't really notice a lot of the bad effects. I guess we're used to listening to horribly compressed music/videos/phone speech. It's fine to just mess it up.
Technical stuff:
Well, I'm not sure what to say about this. I'm not going to give 100% details until the thesis is officially published by the uni - the whole patent possibilities and all.
The stuff I could say is common knowledge. It uses a standard PSOLA (pitch modifying algorithm). It's just a basic pitch modifier in essence, with modifications to allow it to change time even though it was theoretically a stupid thing to do. I think everyone was skeptical about that, lol.
And uh.. yeah. I don't think any of you really play around with this stuff, so a detailed technical explanation doesn't help. But if you've got questions, ask
Why it shouldn't work:
I did take a hell lot of shortcuts. If a mechanical thing, it'd probably be duct taped all over the place. Surprisingly, it held together, and while I was asking my supervisor why it didn't work... it did. It worked so well that he asked me if the synthesized speech was the original. I'm still scratching my head about it working at all, but it does.
1. Never used any of the formulas or stuff suggested by technical papers. I looked at them for like 4 months and went all "screw this" and wrote some random code based on the pictures.
2. PSOLA doesn't use interpolation In English, it's got big goddamn chunks in the pitch contour and nobody noticed.
3. Pitch detector doesn't work reliably. It needs to know the pitch before deciding what to change it to. It's sort of like a plane autoflying and landing without vision but not sure how high it's flying.
4. Pitch correction method is stupid. If you had someone screaming from a range of 40 to 400 Hz, it would just assume an error and assumes that you're screaming at 90 Hz for all that range. The "angry" speech shouldn't work at all. That's the first speech file for you guys who heard it.
5. It mixes voiced, silent, and unvoiced speech, which is epically stupid. They're two very different things (in design, not theory). I think some of you heard a big 'pop' in the middle of the second speech file. That seems to be the only noticeable one. Theoretically, it'd be 'popping' all over the place.
6. There's like 20 pages written on how to do duration modification properly. My system uses a "choose it at random" approach. Both work almost equally well, but my algorithm messes up epically when it increases duration by over 1.5.
Anyway, it raises some big questions about why they worked at all, and accidentally unlocked another branch of research into this stuff.
Edited by Muz
Disclaimer: Any sarcasm in my posts will not be mentioned as that would ruin the purpose. It is assumed that the reader is intelligent enough to tell the difference between what is sarcasm and what is not.
Interesting that more work hasn't been put into good speech synthesizers with emotion. Or at least "publicly". Does this have commercial uses or do you have any plans for practical uses?
There's been like a bit of "work", but no actual work. As in a lot of people are summarizing what others did, but never took any steps in any direction in particular. Only significant previous one is a masters paper and it's all about synthesizers, not transformation. I know Microsoft puts a huge pile of money on speech technology, they're the top researchers in the world on it, but they keep all their stuff secret.
Commercial uses.. not really. Like a bunch of research stuff, it's about just trying to find another way to do something and the idea of any good research is to raise more questions. Emotional detection has a straight off commercial use, that is handling angrier calls on call centers. But this is like two steps away from that.
Heh, I'm a little surprised how many academics are thinking of using this to pull pranks on people
Disclaimer: Any sarcasm in my posts will not be mentioned as that would ruin the purpose. It is assumed that the reader is intelligent enough to tell the difference between what is sarcasm and what is not.
Eh, no worries. I got the data right where I wanted it... saying that it worked, lol. More people make it more accurate, but great that you offered to help anyway
Disclaimer: Any sarcasm in my posts will not be mentioned as that would ruin the purpose. It is assumed that the reader is intelligent enough to tell the difference between what is sarcasm and what is not.