Episode 9: Disappointment

Episode 9: Disappointment

Let me go ahead and apologize for this episode.

You know how some days you code for 5 minutes and it makes a huge improvement in your app... and some days you debug for 8 hours and the only "progress" you can say you've made is that now you really understand how bad your app is? We're going to do a lot of work this episode only to discover just how inefficient our putpixel routine really is, and by the end you'll be convinced that we need a major shift in the way we handle graphics.

At the end of the last episode we saw that continually erasing and redrawing our player square is causing lots of flickering, and realized that we need to try to synchronize our drawing to the screen refresh rate. This is how a CRT monitor draws the screen:

The blue arrows are the periods of time during which the scan gun is moving to the start of the next horizontal line, and the red arrow is the time during which the scan gun is returning to the top of the screen.  The horizontal retrace interval is too short to deal with, so we will have to do all of our screen updating during the vertical retrace.

Our trusty IBM PCjr Technical Reference tell us on page 2-73 about a video "status register" that sets bit 3 to 1 whenever vertical retrace is active and 0 otherwise, and a few pages back on 2-63 we find that we can read this status register at port 0x3da.  So let's give it a try:

; Wait for port 0x3da bit 3 to go high, meaning that we are in
; the vertical retrace period and can safely update the framebuffer.
waitForRetrace:
  mov dx, 0x3da
.loop:         ; Wait for the vertical retrace bit to go high
  in al, dx
  and al, 0x8
  jz .loop
  ret

We can call this at the start of our game loop before we draw to the screen. It will cause us sit and wait until vertical retrace is active before continuing. But what if we happen to be in the middle of vertical retrace when the procedure is called? It'll return immediately, and we'll think we have plenty of time for drawing when in fact our time could be almost up. What we have to do instead is wait for the bit to be 0 (meaning we are drawing from the screen), then watch it and return immediately when it becomes 1:

; Wait for port 0x3da bit 3 to go high, meaning that we are in
; the vertical retrace period and can safely update the framebuffer.
waitForRetrace:
  mov dx, 0x3da
.loop:         ; Wait for the vertical retrace bit to go low
  in al, dx
  and al, 0x8
  jnz .loop
.loop2:
  in al, dx    ; Now wait for it to go high. This way
  and al, 0x8  ; we know we've caught vertical retrace
  jz .loop2    ; right at the beginning.
  ret

So let's see what we've got:

Whoops. Our rectangle is only being partially drawn. Even worse, as we move it around the screen we see that different parts of it are phasing in and out. Maybe our drawing is too slow and we're only completing part of it before the screen needs to redraw? Here's an easy way to test this theory: Before we start our drawing let's paint the upper-left pixel red, and after drawing is complete let's paint it yellow. If we see yellow there when we run the app we know we're completing our drawing in time, otherwise we're not. Put this just after waitForRetrace:

  mov ax, 0xb800
  mov es, ax
  mov al, 0x44
  xor di, di
  stosb

After drawing the player graphic, do the same thing again but with yellow (0xee).  Side note: This actually colors the first 2 pixels, but it's not a big deal. Now run it:

Yep, red only. The vertical retrace period is up before we finish our drawing. So how far are we off? One of the benefits of doing our development in DOSBox is that we can adjust the emulated CPU speed. While the app is running you can decrease cycles with CTRL+F11 and increase them with CTRL+F12, or you can set the cycles count directly dosbox.conf. Let's increase it until we see yellow.

Wow. On my machine, it takes a speed of about 1760 cycles to finish our drawing routine consistently on time. If 315 cycles is a good approximation of the speed of the PCjr, we're eating up more than 5x as much time as we should be. What a harsh reality check.

We know what we need to do: Optimize! Where should we spend the most time optimizing? Well, we draw two 8x8 rectangles each time through the loop, and we make a putpixel call for each pixel of those rectangles, so we're looking at 8*8*2 = 128 putpixel calls.

One easy optimization trick is to replace multiplication and division by bit shifts wherever possible. Multiplication and division by powers of 2 is identical to bit-shifting left or right by that power. Furthermore, even if the operation is not by a power of two, it can often be decomposed into one – for example:

320x = 256x + 64x = x*2^8 + x*2^6 = x << 8 + x << 6

Let's look at lines 10 - 13 of putpixel.  We're dividing AX (the Y coordinate) to by 4 to obtain the bank number (as remainder, 0-3) and the row number within the bank (as quotient).  Shifting AX to the right by 2 is a quicker way of dividing by 4, and masking off all but the last 2 bits of AX (via ANDing it by 00000011) will give us the remainder:

  mov di, ax      ; Faster alternative to dividing AX by 4: shift
  shr di, 1       ; right twice for quotient, mask with 0b11 for
  shr di, 1       ; remainder.
  and ax, 0b11    ; AX = bank number (0-3), DI = row within bank

Next notice lines 15-18, where we multiply the row number (now in AX) by the bank width to obtain the bank offset. 0x200 is 2^9, so let's just shift AX left by 9:

  mov cl, 9       ; Faster alternative to multiplying AX by the
  shl ax, cl      ; bank width (0x200): shift left by 9.

As a bonus, we no longer have to push and pop of DX, saving us a couple more cycles.
Now try your performance test again, adjusting the CPU speed to the lowest which still gives you solid yellow. I get 1650-1655. Let's keep going.

Jumps are expensive on the 8088. Currently our putpixel is guaranteed to jump once on every access: jc .setLow might jump, but even if it doesn't there will be a jmp .finish later. Instead of jumping to .finish, how much time do we save by duplicating the .finish code at the end of both .setLow and .setHigh? Another 20 cycles on my machine, bringing the cycle count down to 1630-1635.

And now for the disappointment...

This is about as good as we can do. Really.

I played around with lots more optimizations including:

  • Decomposing the multiply-by-320 into an 8-bit left shift added to a 6-bit left-shift
  • Using different registers to prevent having to push and pop DX
  • Using stosb to write the pixel byte back into memory (which requires using DI instead of SI)
  • Reworking things so that ES doesn't need to be set every time

Unfortunately this saves us only a few more cycles... nowhere near the efficiency which would make putpixel a viable option for updating the screen during vertical refresh.So, that's it. No heavy calculation during vertical retrace; our PCjr just doesn't have the speed for it. We'll have to do all our rendering to offscreen buffers, pre-rendering as much stuff as possible, and use the vertical retrace period only to update the screen by making a couple quick, efficient memory copies. We'll play with this in the next episode.

As always, the (disappointing) full source for this episode is on GitHub.