I imagine you're not old enough to remember PC Speaker sounds... intelligible speech and more was done with ONE bit per sample.
All you really need is some method in indicating WHERE the speaker cone should be at any given time. With 2 bits per sample, as you showed, you get 4 positions. As long as you can MOVE the speaker cone ANY distance, you can create sound. Of course, having more positions (more bits per sample) allows you finer control over the position, and thus a better sounding sound output.
Small speaker movements (say, in this example 1 -> 2), would create a small movement of the speaker cone, creating a smaller, quieter sound.
Big speaker movements (3 -> 0), would create a big motion, making a louder sound.
The speaker always can move from one end to the other of its physical travel, regardless of how many bits-per-sample you are using, the more bits though, the more discrete positions you can select, and the better the sound quality.
Lower bits-per-sample generally gives a 'buzzy' kind of square-wave output, which the analog speaker modulates into an altered 'squared-sine wave', a sine wave with kinda square rises and drops instead of the smooth rise and fall of a nice sine (sound) wave.