May 2021 C64 Optimization Challenge

I’m participating in another size & speed coding challenge for the month of May! Basically just draw three characters in the system font with one character per pixel on-screen. There is a launcher program available that can show how many cycles one iteration takes which can then be multiplied with the size of the program for a score in “cyclebytes”!

I’ve also joined two previous optimization challenges, the Spiral and the Kaleidoscope!

These challenges are organized by the Shallan’s den of 6510 discord, if you’re interested why not jump in?

The rules

MAY – BIG CHARS CHALLENGE

  1. The aim is to write an efficient function to correctly display 3 letters or numbers from the default c64 uppercase ROM font but magnified 8 times and with a shadow.
  2. Each character must be made up of inverse spaces(char 160) in white corresponding to the exact pixels in the ROM font, with a drop shadow (down and to the right) made of inverted spaces(char 160) in black
  3. The first character must be displayed starting at x=0, y=9 ($0568), the second at x=10, y=9 ($0572) and the third at x=20, y=9 ($057c). (NOTE: most of the ROM font has space down both the left and right of each char, therefore the first visible pixels will be at x+1)
  4. The main program will call the method at $c000 with the char values to be displayed found in zeropage at $02,$03 and $04 in that order
  5. No verification routine is available to submissions, entries must be self validated.
  6. On stream on the day following competition closure, each submission will be loaded into a verification version of the main program. If the output passes it will receive a score, calculated from the average cycle count of 3 consecutive runs multiplied by the number of bytes used between $c000 and $cfff. Lowest score wins.
  7. Submissions must be made in the form of a prg containing just the code and data for $c000-$cfff, the main program should NOT be included.
  8. Closing date for entries is 23:59pm GMT on the 28th May 2021.
  9. All code, vars (exception for zero page) and tables must reside between $c000-$cfff. The entry point will be $c000.
  10. Screen ram will remain at the default of $0400 and must not be changed.
  11. All zeropage is free for use as BASIC and Kernal are banked out.
  12. The main program disables the screen so that bad lines are not present for the most accurate cycle timing. During testing you can disable this if needed.
  13. The main program uses two CIA timers to accurately time the function, do NOT disable stop or otherwise change these timers or the submission will not count.
  14. The function must be able to be called multiple times so any self modified code must set itself up.
An example of the result of the challenge

Planned Algorithm

The initial strategy is to read out each character index, calculate the pointer to the ROM Font Data, read out each row of the character as a byte and draw screen code 160 (inverted space) on screen for each bit that is set in that byte, set the color to white and draw the drop shadow part at an offset of 41 bytes.

For each index (0-2):

  • Read screen code from address $02 + index
  • Multiply by 8 to get character graphics offset from ROM font ($d000)
  • Add $d000 to address and store in self modified code (lda $dxxx)
  • Calculate destination address: $0568 + character index * 10 and copy to Zero Page pointer
  • Calculate shadow destination address = destination address + 40 + 1 and copy to Zero Page pointer
  • Calculate destination color address = destination address + $d800 – $0400 and copy to Zero Page pointer

For each row in the character (0-7):

  • Set rom bank register ($01) to $33
  • Read one byte from ROM font and copy to Zero Page shift register
  • Restore rom bank register ($01) to $35
  • increment low byte of rom font address
  • set Y register to 7, looping for each bit in the byte:
    • Shift register one bit right
    • if carry clear, skip setting an inverted space
      • set A to 1
      • store A to (zero page color),y
      • set A to 160
      • store A to (zero page destination),y
      • store A to (zero page shadow destination),y
    • increment all three zero page pointers by 40
    • decrement Y and loop until N flag set

Before starting to code

Just looking at the algorithm there are some opportunities!

Instead of looping until Y is negative we can just early out if the shift register becomes zero because there will not be any more bits to draw. Y still need to be decremented but that can be done at the start of the loop.

{
	lsr
	{
		bcc %
		; draw inverted space
	}
	bne !
}

Multiplying by 8 is of course just shifting two bytes left 3 times. Multiplying by 10 is not that hard but would almost certainly take more memory than a lookup-table of three values. Since the maximum offset is 20 (2 * 10) and the low byte of the address won’t wrap we only need to update the lower byte of the destination addresses for each character.

It is also known that the bottom row of each character is empty so only the first 7 rows need to be drawn. Conveniently the 7th screen row is less than 256-8 (6*40 = 240) so instead of incrementing the zero page pointers it is possible to just increment the address offset each row!

When the 8th row is incremented the register will overflow so it is not necessary to use a counter to exit the character drawing when the character is complete.

First Attempt

org $c000

zpScrn = $f0
zpScrnShade = $f2
zpCol = $f4
zpShift = $f6

	lda #$05
	sta zpScrn+1
	sta zpScrnShade+1
	lda #$d9
	sta zpCol+1

	ldx #2
	{
		lda 2,x ; get next character
		asl ; multiply by 8
		sta charAddr ; store in instruction address
		lda #$d0>>3 ; high byte of address / 8
		rol
		asl charAddr
		rol
		asl charAddr
		rol
		sta charAddr+1	; leaves C clear

		lda scrnLo,x ; read low byte of screen destination
		sta zpScrn ; store in screen pointer
		sta zpCol ; same low byte used for color pointer
		adc #41 ; calculate shadow destination
		sta zpScrnShade

		lda #8 ; this is the offset to the right of the character
		{
			pha ; save the offset
			tay ; copy offset to y for indirect addressing
			lda #$33 ; make char rom visible
			sta 1
charAddr = *+1
			lda $d000
			sta zpShift 
			inc charAddr
			lda #$35 ; make vic registers visible
			sta 1
			{
				dey ; step 1 character to the left
				lsr zpShift ; shift out right bit
				{
					bcc % ; skip this is bit was clear
					lda #160 ; set inverted space character
					sta (zpScrn),y
					sta (zpScrnShade),y
					lda #1 ; set white color
					sta (zpCol),y
					lda zpShift ; get Z flag from shift reg
				}
				bne ! ; loop until shift register is empty
			}
			pla ; get the destination offset
			clc
			adc #40 ; step the offset to the next line
			bcc ! ; loop until offset wraps
		}
		dex ; step to the next character to the left
		bpl ! ; repeat until all characters drawn
	}
	rts ; and done!

scrnLo:
	dc.b $68, $68+10, $68+20
First pass code

So how does this add up?

This is 95 bytes, pretty good cycle count so far. My first thought at this point was that this is pretty much it. Sure there are some cycles here or there and maybe a byte here and there, this challenge is pretty trivial. This intuition was not correct!

Looking at the image above the score would be around 425000 cyclebytes, or 0.425 MCB as we’ve been referring to it!

Tinkering with the obvious targets

Using zero page pointers requires setting up both the high and low bytes so maybe instead just use self modified code?

The zp pointer setup code is 23 bytes in total, while there are 3 writes using the zp pointers so 6 bytes. Using SMC would look something like this:

lda scrnLo,x
sta scrnTrgLo
sta colTrgLo
adc #41
sta scrnShadeTrgLo

Which adds up to 14 bytes, but the 3 writes are now 3 bytes each instead of two bytes, so the total saving is just 6 bytes but it also saves a bunch of cycles!

13 cycles for the initial setup is saved, 3 cycles is lost writing to absolute addresses instead of zero page, and 6 cycles is saved for each drawn “pixel”!

To estimate the number of pixels I’m just going with an estimated average of 3.6 pixels per row, or 3 * 7 * 3.6 ~= 76 pixels in total.

Another zero page use is the shift register which is read out from the font data, which costs 5 cycles. If this can be done with the accumulator instead it only costs 2 cycles. Doing this requires some stack pushing and popping but it is doable.

Finally setting characters to white is just changing a 0 to a 1 which can be handled by an inc instead of lda #1 / sta saving two bytes and almost the same cycle cost (7 vs 6+, in measurements this is an average of 2-3 cycles total). Here is this rewrite:

org $c000
	ldy #2
	{
		lda 2,y
		asl
		sta charAddr
		lda #$d0>>3
		rol
		asl charAddr
		rol
		asl charAddr
		rol
		sta charAddr+1	; leaves C clear

		lda scrnLo,y
		sta scrnTrgLo
		sta colTrgLo
		adc #41
		sta scrnShadeTrgLo

		lda #8
		{
			pha
			tax
			lda #$33	; make char rom visible
			sta 1
charAddr = *+1
			lda $d000
			pha
			inc charAddr
			lda #$37	; make vic registers visible
			sta 1
			pla
			{
				dex
				lsr
				{
					bcc %
					pha
					lda #160
scrnTrgLo = *+1
					sta $568,x
scrnShadeTrgLo = *+1
					sta $591,x
colTrgLo = *+1
					inc $d968,x
					pla
				}
				bne !
			}
			pla
			clc
			adc #40
			bcc !
		}
		dey
		bpl !
	}
	rts
scrnLo:
	dc.b $68, $68+10, $68+20
Another iteration

The code size is 89 bytes and checking an average of a number of runs the result is around 0.375 MCB.

Another round

So removing all zero page use was great, let’s try to use a zero page pointer just for reading out of ROM Font Data instead!

While rewriting reading font ROM, one thing to look at the launcher code is the list of all the possible characters:

possibleChars:
		.text "abcdefghijklmnopqrstuvwxyz0123456789"

The character indices in this set range from A ($01) to 9 ($39) so the top two bits of the character index is not used! This means when multiplied by 8 only one bit is affected in the high byte of the address! Calculating the ROM Font Data address can be simplified.

Another interesting fact requires looking at the actual ROM Font Data for the possble characters:

The allowed characters

Apart from row 8 in all characters there are no empty rows in these characters.

The advantage of this is that the shift byte -> draw if bit was set -> repeat until byte is empty can be divided into shift byte until first set bit -> draw -> repeat until byte is empty, or in code:

{
	asl ; shift bits left, carry set to previous top bit
	inx ; step one screen character
	bcc ! ; if carry is clear get another bit, do not continue to draw
	pha ; save accumulator in case there are any more bits
	< draw inverted space >
	pla ; restore the accumulator and make sure Z flag is set if 0
	bne ! ; resume drawing more bits unless done
}

The pha/pla in the loop adds up to 7 cycles happens approximately 72 times so using a register would only cost 4 cycles. By storing the character index while drawing a character the y register can be freed up, at a cost of 6 cycles per character, or 18 CB plus 4 bytes. Let’s check out this iteration:

org $c000
	ldx #2
	{
		lda 2,x ; get character index
		asl ; multiply by 8 to get offset into ROM Font Data
		asl
		asl
		sta 7 ; store the low byte of the offset in ZP
		lda #$d0>>1 ; shift in the 9th bit into the high byte address
		rol	; leaves C clear
		sta 8

		lda scrnLo,x ; get the screen & color address low byte
		sta scrnTrgLo
		sta colTrgLo
		adc #41
		sta scrnShadeTrgLo

		stx 6

		lda #7 ; a = offset into screen drawing
		{
			pha
			tax ; use x for screen address offset
			ldy #1	; make char rom visible
			sty 1
			dey ; y used for indirect addressing, must be 0
			lda (7),y ; get the next byte of the char
			inc 7 ; step to the next Font ROM address
			stx 1 ; restore IO registers, x low 4 bits is 7
			{
				inx
				asl
				bcc ! ; iterate until first bit
				tay ; always draw this bit
				lda #160 ; inverted space
scrnTrgLo = *+1
				sta $568,x ; store screen
scrnShadeTrgLo = *+1
				sta $591,x ; store screen shadow
colTrgLo = *+1
				inc $d968,x ; set color
				tya ; restore a and set Z flag based on value
				bne ! ; keep drawing until a = 0
			}
			pla	; carry always set from loop above
			adc #40-1 ; skip to the next line, C adds 1
			bcc ! ; iterate until 7th line
		}
		ldx 6 ; restore character index
		dex ; go to the previous
		bpl ! ; until < 0
	}
scrnLo:
	rts ; sneak trick: rts opcode is $60, using as screen offset!
	dc.b $60+10, $60+20

What we have so far

Code is at 79 bytes, already below 80! and on average about 3672 cycles so a result around 0.29 MCB, already below 0.3 MCB!

Needs more less..

Time to go for something out of the normal. Maybe there is a better way to select the Font ROM than lda/sta into 1. When the code is started the value in 1 is $35 and any combination of %001 – %011 in the lower three bits will reveal Font ROM. $35 is %00110101 so shifting that right will be %00011010.

So the plan is to lsr 1 / read the byte from Font ROM / rol 1 to restore.

There is still a pha/pla that is just holding the previous start value in x, but it can just be calculated by clearing out the lower three bits that was incremented. But it would also need to add 40 to get to the next row which would technically require the accumulator.

Except it can be done in one go with the sax illegal instruction! This instruction ands a and x, then subtracts the parameter from a and transfers the value back to x.

Let’s try these things out..

cpu 6502ill
org $c000
	; y = $ff
	ldx #2 ; iterate over 3 chars
	iny ; set y to 0
	{
		stx 6 ; save x
		lda 2,x ; get next character
		asl ; multiply by 8 to get font data address
		asl
		asl
		sta 7
		lda #$d0>>1 ; handle high byte of font data address
		rol	; leaves C clear
		sta 8

		lda scrnLo,x ; set up the screen/color destination
		sta scrnTrgLo
		sta colTrgLo
		adc #41 ; offset to shadow destination
		sta scrnShadeTrgLo


		ldx #7
		{
			lsr 1 ; make Font ROM data visible
			lda (7),y
			inc 7
			rol 1 ; restore IO & RAM
			{
				inx ; increment dest offset
				asl ; shift one bit left of font row
				bcc ! ; repeat until non-zero
colTrgLo = *+1
				inc $d968,x ; set color to white
				tay
				lda #160
scrnTrgLo = *+1
				sta $568,x ; set inverted space
scrnShadeTrgLo = *+1
				sta $591,x ; also on shadow
				tya
				bne ! ; repeat until all bits in byte out
			} ; C = 0, Y = 0
			lda #$f8 ; and X with this before subtracting
			sax #~(40-1+7-8) ; dc.b $cb, ~(40-1+7-8)
			bcc !
		}
		ldx 6 ; restore character index
		dex ; next one
		bpl ! ; repeat until < 0
	}
scrnLo:
	rts ; first character screen offset
	dc.b $60+10, $60+20 ; 2nd and 3rd character screen offset
	

Let’s add it up.. again..

74 bytes looks pretty good! And running some tests shows an average of 3580 cycles so the result is 0.265 MCB! This is a pretty large change from the previous 0.29 MCB so the question is how low we can go.

Lets keep going smaller!

The big thing that stands out now is the inc 7 to get the next byte from the Font ROM data when we’re reading y indirect anyway. So we could loop 0-7 with y but we might as well just copy the low byte of the Font ROM address to y and leave it at 0.

And instead of using zero page address 7 and 8 for the pointer I’m using $10 and $11 because $10 is already zero so it doesn’t need to be initialized.

This also means going back to using the stack to save the remaining bits to draw in a row but saving some bytes is surely worth it? Let’s try it out!

71 byte version in action
cpu 6502ill
org $c000
	ldx #2 ; iterate over 3 chars
	{
		stx 6 ; save x
		lda 2,x ; get next character
		asl ; multiply by 8 to get font data address
		asl
		asl
		tay ; was sta 7
		lda #$d0>>1 ; handle high byte of font data address
		rol	; leaves C clear
		sta $11	; $10 in ZP contains 0 so use this address instead

		lda scrnLo,x ; set up the screen/color destination
		sta scrnTrgLo
		sta colTrgLo
		adc #41 ; offset to shadow destination
		sta scrnShadeTrgLo

		ldx #7
		{
			lsr 1 ; make Font ROM data visible
			lda ($10),y
			rol 1 ; restore IO & RAM
			iny ; increment offset into Font data
			{
				inx ; increment dest offset
				asl ; shift one bit left of font row
				bcc ! ; repeat until non-zero
colTrgLo = *+1
				inc $d968,x ; set color to white
				pha
				lda #160
scrnTrgLo = *+1
				sta $568,x ; set inverted space
scrnShadeTrgLo = *+1
				sta $591,x ; also on shadow
				pla
				bne ! ; repeat until all bits in byte out
			} ; C = 0, Y = 0
			lda #$f8 ; and X with this before subtracting
			sax #~(40-1+7-8) ; dc.b $cb, ~(40-1+7-8)
			bcc !
		}
		ldx 6 ; restore character index
		dex ; next one
		bpl ! ; repeat until < 0
	}
scrnLo:
	rts ; first character screen offset
	dc.b $60+10, $60+20 ; 2nd and 3rd character screen offset

71 bytes! This must be the best so far. The average cycles seem to end up at 3663.5 which is a bit higher thouh.. Well, 0.26 MCB is still a little bit lower so the size saving pays off even if the difference is very small.

Out of ideas for space, not cycles

So this is as far as I’ve got with making the code smaller and now it is time to reason about how to make the total score lower. I also don’t really see any way to make it faster without adding something.

One useful metric for reasoning is that if I add one byte I need to at least shave off (3663/71) or about 52 cycles.

A bit off the left side please

One fact that haven’t been considered yet is that the leftmost pixel is always 0 in the possible set of characters. By pre-shifting that we can skip one inx and one bcc at the cost of an extra asl.

So for 3 characters with 7 lines each there is a saving 3+2-2 cycles, or 63 cycles which is greater than 52! Let’s give it a go! Here is the code for the 72 byte version:

cpu 6502ill
org $c000
ldx #2 ; iterate over 3 chars
{
	stx 6 ; save x
	lda 2,x ; get next character, valid bits are %00xxxxxx
	asl ; multiply by 8 to get font data address
	asl
	asl
	tay ; Y is now the low byte of the Font ROM data address
	lda #(>$d000)>>1 ; handle high byte of font data address
	rol ; shift in lowest bit of Font ROM address, leaves C clear
	sta $11	; $10 in ZP contains 0, this is the high byte

	lda scrnLo,x ; set up the screen/color destination
	sta scrnTrgLo ; update low byte of screen target
	sta colTrgLo ; same low byte used for color target
	adc #41 ; offset to shadow destination
	sta scrnShadeTrgLo ; update low byte of screen shadow target

	ldx #8 ; start at offset 8 so I can use rts opcode for first byte
	{
		lsr 1 ; make Font ROM data visible
		lda ($10),y ; read byte from Font ROM data
		rol 1 ; restore IO & RAM memory
		iny ; increment offset into Font data
		asl ; leftmost bit is always empty so skip that
		{
			inx ; increment dest offset
			asl ; shift one bit left of font row
			bcc ! ; repeat until non-zero
colTrgLo = *+1
			inc $d968,x ; set color to white
			pha ; save remaining font graphics bits
			lda #160 ; reverse space screen code
scrnTrgLo = *+1
			sta $568,x ; set inverted space
scrnShadeTrgLo = *+1
			sta $591,x ; also on shadow
			pla ; restore remaining font graphics bits
			bne ! ; repeat until all bits in byte drawn on screen
		} ; C = 0, Y = 0, A = 0, X = screen offset
		lda #$f8 ; and X with this before subtracting
		sax #~(40-1) ; sax = txa / and #$f8 / sbc #~(40-1) == adc #40-1 / tax
		bcc ! ; on the 8th row the carry will overflow
	}
	ldx 6 ; restore character index
	dex ; next character
	bpl ! ; repeat until character index < 0
}
scrnLo:
rts ; first character screen offset
dc.b $60+10, $60+20 ; 2nd and 3rd character screen offset

Now at 72 bytes and… 3570 cycles on average. That is about 0.257 MCB which is yet another tiny saving! But not even 0.01 MCB from the 74 byte version…

So for the 74 byte version 1 extra byte would need to save just 3580/74 or about 48 cycles, so the potential for savings is even greater. Let’s check it out..

Better or worse?

Here is the code for the 75 byte version

cpu 6502ill
org $c000
; y = $ff from the calling code
ldx #2 ; iterate over 3 chars
iny ; set y to 0
{
	stx 6 ; save x in ZP for next index
	lda 2,x ; get next character
	asl ; multiply by 8 to get font data address
	asl
	asl
	sta 7 ; low byte address of Font ROM data for this character
	lda #$d0>>1 ; handle high byte of font data address
	rol	; shift in lowest bit of Font ROM address, leaves C clear
	sta 8 ; high byte address of Font ROM data for this character

	lda scrnLo,x ; set up the screen/color destination
	sta scrnTrgLo ; update low byte of screen target
	sta colTrgLo ; same low byte used for color target
	adc #41 ; offset to shadow destination
	sta scrnShadeTrgLo ; update low byte of screen shadow target

	ldx #8 ; start at offset 8 so I can use rts opcode for first byte
	{ ; this scope draws one 7x7 character
		lsr 1 ; make Font ROM data visible
		lda (7),y ; read byte from Font ROM data, Y is always 0
		inc 7 ; increment address of Font ROM data
		rol 1 ; restore IO & RAM memory
		asl ; leftmost bit is always empty so skip that
		{ ; this scope draws one 7 bit row of font graphics
			inx ; increment dest offset
			asl ; shift one bit left of font row
			bcc ! ; repeat until non-zero
colTrgLo = *+1
			inc $d968,x ; set color to white
			tay ; save remaining font graphics bits
			lda #160 ; reverse space screen code
scrnTrgLo = *+1
			sta $568,x ; set inverted space
scrnShadeTrgLo = *+1
			sta $591,x ; also on shadow
			tya ; restore remaining font graphics bits
			bne ! ; repeat until all bits in byte drawn on screen
		} ; C = 0, Y = 0, A = 0, X = screen offset
		lda #$f8 ; and X with this before subtracting
		sax #~(40-1) ; sax = txa / and #$f8 / sbc #~(40-1) == adc #40-1 / tax
		bcc ! ; on the 8th row the carry will overflow
	}
	ldx 6 ; restore character index
	dex ; next one
	bpl ! ; repeat until character index < 0
}
scrnLo:
rts ; first character screen offset (rts is $60)
dc.b $60+10, $60+20 ; 2nd and 3rd character screen offset

New size would be 75 bytes and average cycle time is 3377 so the result is 0.253 MCB!

So 193 cycle difference from the 72 byte version? maybe it is a bit large, let’s check.

Counting cycles, adding for the 72 byte version and subtracting for the 75 byte version:

iny: -2, sty 7: -3 * 3, tay: +2 * 3, inc 7: -5 * 7 * 3, pha/pla: +7 * 3.6 * 7 * 3, tay/tya: -4 * 3.6 * 7 * 3, iny: +2 * 7 * 3

Total: -2-3*3+2*3-5*7*3+7*3.6*7*3-4*3.6*7*3+2*7*3 = 158.8, so in that case the difference would be 72*3580 – 75*3412 = 1170 cyclebytes. Still a win but very small.

This is assuming an average of 3.6 pixels per row in each character, if there is just 2.5 which would be one of the smallest characters the difference would be 89.5 in which case the 72 byte version would be 4000 cyclebytes cheaper…

So at this point it is pretty much a gamble. The score comes down almost entirely to which characters are used for the final competition, but even a very small improvement could make a difference.

Lots of performance profiling needed!

Time to mess with the competition launcher to get easier profiling data. I’m adding a function to convert the ASCII score to a hex value and one to add it to a running total, then at the start of Verify: just call those functions. Then just disable the wait for space and let it run for 65536 iterations.

Here’s the code, in x65 syntax so if it is useful for any future cyclebytes compo just edit it for your assembler!

AddScore:
{
	clc
	ldx #0
	ldy #3
	{
		lda ScoreHex,x
		adc ScoreSum,x
		sta ScoreSum,x
		inx
		dey
		bpl !
	}
	{
		inc ScoreCount
		bne %
		inc ScoreCount+1
		beq *
	}
	rts
}

ScoreToHex:
{
	lda #0
	ldx #3
	{
		sta ScoreHex,x
		dex
		bpl !
	}
;TimerAsciiLen
	ldy #0
	{
		tya
		pha
		jsr MulScore10
		pla
		tay
		lda TimerASCII,y
		{
			and #$f
			adc ScoreHex
			sta ScoreHex
			bcc %
			inc ScoreHex+1
			bne %
			inc ScoreHex+2
			bne %
			inc ScoreHex+3
		}
		iny
		cpy #TimerAsciiLen
		bcc !
	}
	rts
}

MulScore10:
{
	ldx #3
	{
		lda ScoreHex,x
		sta ScoreHexTemp,x
		dex
		bpl !
	}
	jsr MulScore2
	jsr MulScore2
	clc
	ldx #0
	ldy #3
	{
		lda ScoreHex,x
		adc ScoreHexTemp,x
		sta ScoreHex,x
		inx
		dey
		bpl !
	}
} // fallthrough to MulScore2
MulScore2:
{
	asl ScoreHex
	rol ScoreHex+1
	rol ScoreHex+2
	rol ScoreHex+3
	rts
}

ScoreSum = $1000
ScoreCount = $1004

ScoreHex:
	ds 4

ScoreHexTemp:
	ds 4

Now let’s see if my cycle calculations are correct. I’m just letting this run and then using a calculator to figure out the average cycle count.

The 72 byte version has an average of 3559.32 cycles and the 75 byte version 3400.85. This means the 72 byte version is 255064 cyclebytes on average and the 75 byte version is 252675, a difference of 1845 CB.

This also shows that my cycle calculation is pretty close, in theory 158.8 and in practice around 158.47! The PK6 test shows a similar difference in favor of the 75 byte version.

Finishing with a PK6 score of 252675 cyclebytes!

The competition results!

Here is the final scoreboard, so turns out I won this month! But Zsezse managed to win the third letter combination at the same byte size so this was extremely close.

Congrats to everyone that participated!

Comparing with other solutions

Here is an implementation that combines ideas with the solution from zsezse, saving a byte to save kilocyclebytes!

So the specific changes here are (first pass):

  • Accumulator already contains the first character on entry (new idea)
  • Zero page address $74, $75 already contains some of the bits needed for the Font ROM data address (Zsezse)
  • Address 5 contains a negative byte, but $2-$4 will all be positive so a cheaper completion check at the end. (Zsezse)

result: 74 bytes, 1DP 3164, OHY 3577, ECB 3320 = 248171.333 CycleBytes
(in a 64k run, average cycles is 3406.16 => 252056 CycleBytes average, 619 better than my submission)

second pass:

  • The x register is always $11 when entering so instead of setting it just subtract $11 from the read addresses! x will be $11, $12, $13 and when it is $15 the byte read from (2-$11)+x will be negative! (new idea)

result: 72 bytes, 1DP 3189, OHY 3639, ECB 3389 => 245208 CycleBytes
Geir Straume double checked and my numbers above were wrong, should be:
result: 72 bytes, 1DP 3187, OHY 3573, ECB 3316 => 241824 CycleBytes
(in a 64k run, average cycles is 3405.05 => 245164 CycleBytes average)

third pass:

  • Geir Straume added another trick, which is to use $e8/$e9 for the Font ROM data address. Since inx is $e8 this can be used as an inc $e8 the first step of the BitLoop! (removing the trick to use $d0 already in address $75)

Code from Geir in Kick Assembler friendly format.
result: 72 bytes, 1DP 3136, OHY 3522, ECB 3265 => 238152 CycleBytes
(in a 64k run, average cycles is 3355.85 => 241621 CycleBytes average, 11054 better than my submission!)

	* = $c000
	// a = first char
	// y = $ff from the calling code
	// x = $11 from the calling code
	iny // set y to 0
CharacterLoop:
	stx 6 // save x in ZP for next index
	asl // multiply by 8 to get font data address
	asl
	asl
	sta $e8// low byte address of Font ROM data for this character
	lda #$d0>>1 // handle high byte of font data address
	rol	// shift in lowest bit of Font ROM address, leaves C clear
	sta $e9	// high byte address of Font ROM data for this character

	lda scrnLo-$11,x // set up the screen/color destination
	sta scrnTrgLo // update low byte of screen target
	sta colTrgLo // same low byte used for color target
	adc #41 // offset to shadow destination
	sta scrnShadeTrgLo // update low byte of screen shadow target

	ldx #9 // start at offset 9 so I can use rts opcode for first byte
RowLoop:
	lsr 1 // make Font ROM data visible
	lda ($e8),y // read byte from Font ROM data, Y is always 0
	rol 1 // restore IO & RAM memory
	asl // leftmost bit is always empty so skip that
	.byte $e6 // 'inc $e8': using opcode for 'inx' as the zp address
BitLoop:
	inx // increment dest offset (skipped for the first loop iteration, due to the 'inc' above)
	asl // shift one bit left of font row
	bcc BitLoop // repeat until non-zero
	tay // save remaining font graphics bits
	inc colTrgLo:$d968,x // set color to white
	lda #160 // reverse space screen code
	sta scrnTrgLo:$568,x // set inverted space
	sta scrnShadeTrgLo:$591,x // also on shadow
	tya // restore remaining font graphics bits
	bne BitLoop // repeat until all bits in byte drawn on screen
	// C = 0, Y = 0, A = 0, X = screen offset
	lda #$f8 // and X with this before subtracting
	axs #~40 // axs = txa / and #$f8 / sbc #~40 == adc #40 / tax
	bcc RowLoop // on the 8th row the carry will overflow
	ldx 6 // restore character index
	inx // step to the next character
	lda.z 2-$11,x // address $05 contains a negative value
	bpl CharacterLoop // loop while there are still valid characters
scrnLo:
	rts // first character screen offset
	.byte $60+10, $60+20 // 2nd and 3rd character screen offset

Published by Space Moguls

I make Commodore 64 games and sometimes demos.

2 thoughts on “May 2021 C64 Optimization Challenge

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: