I posted about this upthread, but there are two things at work here.
On one hand, you have suggestibility - so if someone tells you you'll hear "X", then your brain will gear up for you to hear what you expected.
Secondly, there are two voices on the recording. One saying one name, the other saying the other. The voices are at different audio frequencies, but are simultaneous.
Cut high frequency output and you hear one thing, cut low frequency output and you hear the other.