Shallan’s April-22 Optimization Challenge

I wasn’t going to participate in this one. I need to work on my C64 game right now (Island Adventure) AND I’m at the beginning of a new (solo) project at work where I need to ship x projects in y months, where x is an order of magnitude greater than y..

Anyway, I got sick for a couple of days and on the 27th of April I got a bit bored. I don’t need to submit anything but seems like a fun distraction for the rest of the day. BIG mistake! Fortunately the challenge ended on the 29th so I was just hooked for a couple of days or so….

The challenge is to convert 960 bytes from standard base64 encoding to 720 bytes of C64 screen codes. The encoded data is in the same place as the destination so you need to overwrite your input buffer as you progress. Since the input is larger than the output there is also a block that needs to be cleared on screen (lines 18 through 24) adding a little extra.

Results

Before diving in to details, here are the all-important results. I’m pretty happy with a 6th place especially since I didn’t spend much time on this one. I’ll compare my result with some of the winners at the end!

Rules

APRIL 2022 – BASE64 DECODE CHALLENGE

The aim is to write an efficient function to decode base64 strings and display them on screen
You are given a screen of mixed case text, this is (starting at the top left) a single base64 encoded string (https://en.wikipedia.org/wiki/Base64)
You must convert this string to binary data and display each byte sequentially from the top left, resulting in a decoded screen output consisting of 18 lines. None of the original encoded data must be visible, replace it with spaces.
The main program will call the method at $c000 with the character set switched to allow upper and lower case and the screen ready populated with the encoded data
No verification routine is available to submissions, entries must be self validated.
Base64
- In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits.
  Common to all binary-to-text encoding schemes, Base64 is designed to carry data stored in binary formats across chann…
On stream on the day following competition closure, each submission will be loaded and run with 3 different pieces of encoded data. If the output passes it will receive a score, calculated from the average cyclebyte count of the 3 consecutive runs. (cyclebytes = Number of cycles * number of bytes in the program) Lowest score wins.
Submissions must be made in the form of a prg containing just the code and data for $c000-$cfff, the main program should NOT be included.
Closing date for entries is 23:59pm BST on the 29th April 2022.
All code, vars and tables must reside between $c000-$cfff, with the excpetion that vars can be stored in zeropage ($04-$ff) The entry point will be $c000.
Screen ram will remain at the default of $0400 and must not be changed.
All zeropage from $04-$ff is free for use (vars only , no code) as BASIC and Kernal are banked out.
The main program disables the screen so that bad lines are not present for the most accurate cycle timing. During testing you can disable this if needed.
The main program uses two CIA timers to accurately time the function, do NOT disable stop or otherwise change these timers or the submission will not count.
The function must be able to be called multiple times so any self modified code must set itself up.

Initial Conditions

With the testing code provided for the challenge you’re presented with this screen of data (this is uppercase, the result is displayed in lowercase but there isn’t much useful information on this screen in lowercase either)

The basic algorithm

Essentially if you start from the top left (address $400) fetch one byte at a time, convert the 8 bit character to a 6 bit value and write 6 bits at a time. And stop when you reach line 24! Then you have to clear the screen area from line 18 to line 24 so there isn’t any visible garbage. Seems easy enough.

Converting from 8 bits to 6 is a little awkward though, you have the following ranges to keep track of:

A-Z ($41-$5a) => 00 – 25
~~a-z ($61-$7a) => 26 – 51~~*
a-z ($01-$1a) => 26 – 51
0-9 ($30-$39) => 52 – 61
+ ($2b) => 62
/ ($2f) => 63

*) So unfortunately I managed to get a borked version of the test program which was assembled with lowercase as ASCII, which matched the official format, instead of C64 screen codes so this entire post up to the end (May 2nd) is written with this assumption.

So you could attempt converting it on the fly, code size and speed is not that bad if you optimize it a little as long as you don’t try to unroll a loop or something silly like that. Here’s an example of how that could look:

		clc
		lda #$100 - $41
		adc [base64 value]
		bcs .alpha ; 52 values here
.not_alpha	adc #$41-$30+52 ; 0-9, +, /
		cmp #52 ; >52 => found value
		bcs .done ; 10 values here

		cmp #52 + $2d-$30 ; $2d is inbetween $2b and $2f
		lda #$3e + 7
		bne .sub

.alpha		cmp #26
		bcc .done
.sub		sbc #$61-$41-25-1
.done

So to estimate the cycle time, A-Z would cost 15 cycles, a-z 16 cycles, 0-9 18 cycles, + and / 27 cycles and needing 25 bytes. My initial attempt benefited from this version over a table based lookup.

So for most of my experiments table lookups won out because the lookup itself is just 2-3 bytes and 4 cycles even if I generated the table.

I made one table generator and couldn’t find much to optimize in it, trying different branch conditions to save a byte here and there. This is what I made:

const First = $2b ; '+', don't need any values below this
const zpTab = $2b ; nice and round start address

	ldy #0
	ldx #63

Table:
		txa
		; note: $3f will be wrong offset on 2nd run but it is ok because it was
		; previously written at the correct location
StoreAt:
		sta.z $2f - First - 63 + zpTab,x
			cmp Base64_Range_Count,y
			bne Continue
			lda Base64_Range_Offs,y
			sta StoreAt+1
			iny
Continue:
		dex
		bpl Table

...


; when table value equals this, change target pointer for next section of table
Base64_Range_Count:
		dc.b 63, 62, 52, 26
; counts & start values
Base64_Range_Offs:
		//dc.b $2f - First - 63 + zpTab ; Z, z, 9, =, \
		dc.b $2b - First - 62 + zpTab
		dc.b $30 - First - 52 + zpTab
		dc.b $61 - First - 26 + zpTab
		dc.b $41 - First - 0 + zpTab

This adds up to 1191 cycles to generate the table and 30 bytes so it i not insignificant but a fairly small part of the total.

You’re allowed to generate the table once and test if it was generated for the 2nd and 3rd runs but I didn’t find a case where that save more cyclebytes than it cost.

After the loop finishes the accumulator contains 0 and the y register contains 4. Pretty handy for initializing the source and destination zero page vectors!

The main body

So the majority of cycles and bytes are spent on the actual reading and writing of decoded bytes. Writing 6 bits at a time is a little awkward but by grouping 4 bytes in the output is 6*8 = 24 bits or 3 bytes. That means the 960 input bytes becomes 240 batches of 4 bytes.

Intuitively I want to read the 4 source bytes in a loop, decode each and store to zero page and then combine the 3 bytes. I have a zero page counter for the 240 iterations resulting in something like this:

Loop:
	tya ; "free" storing of the destination offset
	ldy #3
	{
		sta zpData,y ; store destination index (y) on 1st iteration of this loop
		{
			clc
			lda #$100 - $41
			adc (zpSrc),y
			bcs .alpha ; 52 values here
.not_alpha	
			adc #$41-$30+52 ; 0-9, +, /
			cmp #52 ; >52 => found value
			bcs % ; 10 values here

			cmp #52 + $2d-$30 ; $2d is inbetween $2b and $2f
			lda #$3e + 7
			bne .sub

.alpha			cmp #26
			bcc %
.sub			sbc #$61-$41-25-1
		}
		dey
		bpl !
	}
	; at this point the accelerator contains the first byte to process!

	; restore the destination offset
	ldy zpData+3
	{
		asl zpData ; shift 2nd byte up twice
		asl zpData
		asl zpData ; shift two bits from 2nd byte into first byte
		rol
		asl zpData
		rol
	}
	sta (zpDst),y
 ; store first byte in destination
	iny ; increment destination offset
	{
		bne % ; if dest offs lo wraps, increment dest offs hi
		inc zpDst+1
	}
	{
		lda zpData+1 ; get 3rd byte
		lsr
 ; shift two bottom bits of 3rd byte into fourth byte
		ror zpData+2
		lsr
		ror zpData+2
		ora zpData ; merge 4 bits of 2nd with 4 bits of 3rd
	}
	sta (zpDst),y ; store combined 2nd and 3rd byte
	iny
	{
		bne %
		inc zpDst+1
	}
	{
		lda zpData+2 ; fourth byte is ready
	}
	sta (zpDst),y
	iny
 ; the 3rd byte will not wrap so no need to check
;	{
;		bne %
;		inc zpDst+1
;	}
	dec zpCnt ; decrement the chunk counter
	beq Finito ; 2 cycles for a branch not taken so move the exit to the end

;	clc ; carry will always be clear here so safe to ignore
	lda #4 ; increment the source counter
	adc zpSrc
	sta zpSrc
	bcc Loop
	inc zpSrc+1
	bcs Loop
Finito:

I don’t have the exact cyclebyte count for this version, but I think it was a bit above 6 Mcb which also includes setting up the zero page vectors and count and then clearing the bottom part of the screen.

Hunting for easy savings

Where to start? Maybe using self modifying code of an absolute address for the source address would alleviate switching the y register?

Loop:

	; read 4 bytes
	ldx #3
	{
		{
			clc
			lda #$100 - $41
const Source = *+1
			adc $400,x
			bcs .alpha ; 52 values here
.not_alpha	

			adc #$41-$30+52 ; 0-9, +, /
			cmp #52 ; >52 => found value
			bcs % ; 10 values here

			cmp #52 + $2d-$30 ; $2d is inbetween $2b and $2f
			lda #$3e + 7
			bne .sub

.alpha			cmp #26
			bcc %
.sub			sbc #$61-$41-25-1
		}
		sta zpData,x
		dex
		bpl !
	}

; ... inner part of the loop is the same except no need to restore y

	adc Source
	sta Source
	bcc LoopInner
	inc Source+1
	bcs LoopInner
Finito:

Unfortuately the absolute addressing instead of zero page and extra bytes needed didn’t result in an improvement. But maybe x could be used for the chunk count instead of a zero page counter instead?

	ldx #240
; ...
	dex
	beq Finito

Saving both bytes (3) and cycles (3 + 240*3) is a win!

Diving into the table solution again

While initially the in-place decoding was a slightly cheaper solution counting cyclebytes the table lookup is so tiny that it should yield better results, maybe even allowing unrolling the 4 byte source decode loop!

	ldx byte
	lda.z table,x

The decode itself is just 4 cycles, reading and decoding one byte could be as low as four bytes with the right addressing mode.

To solve looking up an indirect indexed byte into the x register without a tax I can use lax (undocumented opcode). To check if the chunks are complete check if the destination low byte is the low byte of the end address.

I can also try to store the decoded bytes in reverse order to see if I can save any operations.

or to put it together:

Loop:
	lax (zpSrc),y ; read first source byte
	lda.z zpTab - First,x ; decode
        sta zpData ; temp storage first byte
        iny

	lax (zpSrc),y
	lda.z zpTab - First,x
        asl ; shift up two empty bits of second byte
        asl
        asl ; shift up two bits from second byte into first byte
        rol zpData
        asl
        rol zpData
        pha ; saving a couple of bytes using pha/pla instead of zp
        iny

	lax (zpSrc),y
	lda.z zpTab - First,x
        sta zpData+2 ; temp store 3rd byte
        iny

	lax (zpSrc),y
	lda.z zpTab - First,x

        asl ; shift up two bits from fourth source byte
        asl
        lsr zpData+2 ; shift down 2 bits from 3rd to 4th byte
        ror
        lsr zpData+2
        ror

        iny ; only need to check source addr wrap each 4 bytes!
            bne SkipIncSrc
            inc zpSrc+1
SkipIncSrc:
        sty zpY ; save source offset

        ldy #2 ; store destination bytes in reverse
        sta (zpDst),y ; a already has 3rd dest byte
        dey
        pla ; combine 2nd and 3rd source byte into 2nd dest byte
        ora zpData+2
        sta (zpDst),y
        dey
        lda zpData ; first byte is ready to copy
        sta (zpDst),y

            lda zpDst ; increment dest address
            adc #3
            sta zpDst
            bcc SkipIncDest
            inc zpDst+1
SkipIncDest:

        ldy zpY ; restore source offset
        cmp #<40*18
        bne Loop-1
    }

Combined with savings from table setup and clearing the remaining space on screen this solution is just below 5 Mcb already!

Folding instructions

One popular approach in these challenges is to hide instructions in the arguments of opcodes, so for instance the lines sta $a8, tay becomes ldy $a4 if you skip the first byte.

Here is the zero page setup code right before the main loop:

	; x = $ff, a = 0, y = 4
	sty zpDst+1
	sty zpSrc+1
	sta zpDst
	sta zpSrc
        tay
Loop:
        ldy zpY

So to skip the ldy zpY before the end of the loop end I can assign zpSrc to $a8 and zpY to $a4 to save 2 bytes from the loop:

	; x = $ff, a = 0, y = 4
	sty zpDst+1
	sty zpSrc+1
	sta zpDst
	sta zpSrc ; hidden ldy zpY (a4 a8)
Loop:
        tay ; only runs first time through loop
; ...
        cmp #<40*18
        bne Loop-1 ; hidden instruction to execute before "Loop"

And there is also the opportunity to be clever with zpDst since the accumulator contains the low byte of the destination I can defer writing that back to the zero page to the start of the loop by assigning it address $85 (sta zero page)

	; x = $ff, a = 0, y = 4
	sty zpDst+1
	sty zpSrc+1
	sta zpDst ; hidden sta zpDst (85 85)
	sta zpSrc ; hidden ldy zpY (a4 a8)
Loop:
        tay ; only runs first time through loop
;...

            lda zpDst
            adc #3 ; skip sta zpDst because it is done through the branch back to Loop-3
            bcc SkipIncDest
            inc zpDst+1
SkipIncDest:

        cmp #<40*18
        bne Loop-3 ; two hidden instructions to execute before "Loop"

So that’s my submission for the main loop of the challenge! But there is just one bit left I haven’t covered.

Clearing the garbage

The last little bit of the challenge is to clear up the lines that have input data but isn’t overwritten by decoded data which is line 18 through 24. This is another 240 bytes to overwrite with $20. Unfortunately I can’t rely on that the last character will be a space so I have to initialize both x and a to have this done reliably:

Finito:
	ldx #$f0
	lda #$20
Clear:
		dex
		sta $400 + 18*40,x
		bne Clear
    rts

Nothing to change about the loop but I can rearrange this to save one byte. I know a place on zero page that contains $10 ($100 – $f0) and I can assign x and a to this in one instruction, then shift a left to get $20. I just need to run the loop incrementally instead of decrementally:

Finito:
	lax zpTab + $41 - First + $10 ; read out $10 into a and x
	asl ; $10<<1 = $20
Clear:
		sta $400 + 18*40 - $10,x
		inx
		bne Clear
    rts

Submitted result

Here’s what I came up with by around 11 pm CET on April 29th!

org $c000
cpu 6502ill

const First = $2b ; '+', don't need any values below this
const zpTab = $2b ; nice and round start address
const zpY = $a8 ; opcode = tay
const zpSrc = $a4 ; opcode = ldy zp
const zpDst = $85 ; opcode = sta zp
const zpShift1 = 9
const zpShift2 = 10

	ldy #0
	ldx #63

Table:
		txa
		; note: $3f will be wrong offset on 2nd run but it is ok because it was
		; previously written at the correct location
StoreAt:
		sta.z $2f - First - 63 + zpTab,x
			cmp Base64_Range_Count,y
			bne Continue
			lda Base64_Range_Offs,y
			sta StoreAt+1
			iny
Continue:
		dex
		bpl Table

	; x = $ff, a = 0, y = 4
	sty zpDst+1
	sty zpSrc+1
	sta zpDst ; hidden sta zpDst (85 85)
	sta zpSrc ; hidden ldy zpY (a4 a8)
Loop:
        tay ; only runs first time through loop

		lax (zpSrc),y
		lda.z zpTab - First,x
        sta zpShift1
        iny

		lax (zpSrc),y
		lda.z zpTab - First,x
        asl
        asl
        asl
        rol zpShift1
        asl
        rol zpShift1
        pha
        iny

		lax (zpSrc),y
		lda.z zpTab - First,x
        sta zpShift2
        iny

		lax (zpSrc),y
		lda.z zpTab - First,x

        asl
        asl
        lsr zpShift2
        ror
        lsr zpShift2
        ror

        iny
            bne SkipIncSrc
            inc zpSrc+1
SkipIncSrc:
        sty zpY

        ldy #2
        sta (zpDst),y
        dey
        pla
        ora zpShift2
        sta (zpDst),y
        dey
        lda zpShift1
        sta (zpDst),y

            lda zpDst
            adc #3
            bcc SkipIncDest
            inc zpDst+1
SkipIncDest:

        cmp #<40*18
        bne Loop-3 ; two hidden instructions to execute before "Loop"

Finito:
	lax zpTab + $41 - First + $10 ; read out $10 into a and x
	asl ; $10<<1 = $20
Clear:
		sta $400 + 18*40 - $10,x
		inx
		bne Clear
    rts

; when table value equals this, change target pointer for next section of table
Base64_Range_Count:
		dc.b 63, 62, 52, 26
; counts & start values
Base64_Range_Offs:
		//dc.b $2f - First - 63 + zpTab ; Z, z, 9, =, \
		dc.b $2b - First - 62 + zpTab
		dc.b $30 - First - 52 + zpTab
		dc.b $61 - First - 26 + zpTab
		dc.b $41 - First - 0 + zpTab

Summary: code size 123 bytes, measured cycles: 39148 => 4815204 cyclebytes.

Now we just wait and see what I’ve missed! Plenty of time to not obsess about potential improvements at all until May 3rd.

Waking up April 30

Two solutions woke me up too early on Saturday:

Maybe I can save enough cycles on the clear loop at the end by using another sta abs,x so I only need half the amount of inx/bne?
Maybe if I create two tables for decoding data, so I add one with the original but shifted left twice to skip some asl?

Solution 1 would require 4 more bytes (lax trick doesn’t work, so ldx/lda plus the extra sta abs,x) and would save (240/2) * 5 cycles = 600 cycles. 129 * 38548 = 4972692 so not a saving at all!

Solution 2 requires quite a few more bytes to generate the table so that didn’t pan out either, I think I ended up at 37715 cycles but I don’t have that version to check the file size.

Waking up May 1st

Maybe I just didn’t go far enough with solution 2 from the day before? What if I had 2, 3, or 5 extra tables? I can only fit 3 tables on zero page so let’s focus on that.

So the target right now is getting rid of shifting on zero page (rol zp) which is 5 cycles compared with just 2 cycles for rol accumulator. That means if I have shift left by 4 and shift right by 4 I can cheaply get to shift by two or shift by 6. (Un)fortunately this turns out to need a lot more bytes and even costing more cycles on the first iteration. Adding a skip to generate the table saves just a little bit on the 2nd and 3rd runs so I abandon the attempt without fixing all the bugs.

I’m still ahead on my original submission so no regrets with that so far!

Checking out the discord channel I notice some participants are getting a little antsy, waiting 80+ hours to figure out what everyone else did is making everyone a little jumpy! With the spiral challenge I think I found the last couple of bytes just a few hours before the stream so even that was a little less stressful!

Anyway, trying again to come up with a way to skip building the table on the 2nd and 3rd run. Luckily since a and y are initialized by rebuilding the table and just initializing the registers in addition to modifying a byte in the table loop leads to the total number of cyclebytes to be larger than the submitted version (7 extra bytes to save ~1000 cycles).

May 2nd

Kind of surprised I haven’t found any improvements so far, feeling confident there is a significantly better method that I just can’t figure out even though more time has passed since the deadline than I spent on this challenge.

Some minor panic on an external test resulting in a corrupt screen! So turns out since I don’t assemble with Kick Assembler I just downloaded the ACME version that had a pre-assembled prg file that assembled the data as ASCII and not screen codes like Kick Assembler would have done. My solution so far has relied on the base64 data to be in ASCII and not screen codes. Chatting with Shallan to figure it out…..

… So no code was changed, the table indices had to be shifted and it works as intended with screen codes instead.

org $c000
cpu 6502ill

const First = 1 ; '+', don't need any values below this
const zpTab = 11 ; nice and round start address
const zpY = $a8 ; opcode = tay
const zpSrc = $a4 ; opcode = ldy zp
const zpDst = $85 ; opcode = sta zp
const zpShift1 = 9
const zpShift2 = 10

	ldy #0
	ldx #63

Table:
		txa
		; note: $3f will be wrong offset on 2nd run but it is ok because it was
		; previously written at the correct location
StoreAt:
		sta.z $2f - First - 63 + zpTab,x
			cmp Base64_Range_Count,y
			bne Continue
			lda Base64_Range_Offs,y
			sta StoreAt+1
			iny
Continue:
		dex
		bpl Table

	; x = $ff, a = 0, y = 4
	sty zpDst+1
	sty zpSrc+1
	sta zpDst ; hidden sta zpDst (85 85)
	sta zpSrc ; hidden ldy zpY (a4 a8)
Loop:
        tay ; only runs first time through loop

		lax (zpSrc),y
		lda.z zpTab - First,x
        sta zpShift1
        iny

		lax (zpSrc),y
		lda.z zpTab - First,x
        asl
        asl
        asl
        rol zpShift1
        asl
        rol zpShift1
        pha
        iny

		lax (zpSrc),y
		lda.z zpTab - First,x
        sta zpShift2
        iny

		lax (zpSrc),y
		lda.z zpTab - First,x

        asl
        asl
        lsr zpShift2
        ror
        lsr zpShift2
        ror

        iny
            bne SkipIncSrc
            inc zpSrc+1
SkipIncSrc:
        sty zpY

        ldy #2
        sta (zpDst),y
        dey
        pla
        ora zpShift2
        sta (zpDst),y
        dey
        lda zpShift1
        sta (zpDst),y

            lda zpDst
            adc #3
            bcc SkipIncDest
            inc zpDst+1
SkipIncDest:

        cmp #<40*18
        bne Loop-3 ; two hidden instructions to execute before "Loop"

Finito:
	lax zpTab + $41 - First + $10 ; read out $10 into a and x
	asl ; $10<<1 = $20
Clear:
		sta $400 + 18*40 - $10,x
		inx
		bne Clear
    rts

; when table value equals this, change target pointer for next section of table
Base64_Range_Count:
		dc.b 63, 62, 52, 26
; counts & start values
Base64_Range_Offs:
		//dc.b $2f - First - 63 + zpTab ; Z, z, 9, =, \
		dc.b $2b - First - 62 + zpTab
		dc.b $30 - First - 52 + zpTab
		dc.b $01 - First - 26 + zpTab
		dc.b $41 - First - 0 + zpTab

May the 4th, the day after

Time to check out what other challengers did that I didn’t do! Geir Straume always has interesting solutions and won so let’s check it out!

The sources are fairly long but I don’t think I can discuss it without including the full source first so here it is (gently edited for 8 space tabs):

/*
	An entry for Shallan's "Base64 Decode Challenge" (April 2022)

	Code by Geir Straume (Discord: GeirS#7311, Web: https://gstraume.itch.io/)
*/

// Tables for decoding (within $c000-$cfff or $04-$ff)
.var Decode0A	= End+1
.var Decode0B	= $c180
.var Decode1A	= $c200
.var Decode1B	= $c280
.var Decode2A	= $89
.var Decode2B	= $21	// Table located at $20+1 so that a zp address in the code may also be used as data (a space character)

// Pointers in zero page
.var s0			= $ae	// Pointer to source address for encoded data ($ae = Decode2A+$25, see initialization below)
.var Dest		= $b2	// Pointer to destination address for decoded data ($b2 = Decode2A+$29, see initialization below)

* = $c000

	ldy YValue:#$0b		// Value is $0b on first run, 0 on subsequent runs
	beq Main			// Branch if this is not the first run

	// Generate the initial decode table
	ldx #$3f
	stx Decode2B+$2f	// Value for char '/'
!:	dex
	dey
	stx Decode2B+$21,y	// Value for char '+' at Decode2B+$2b, and two values at +$25 and +$29 which become zero at Decode2A+$25 and +$29
	stx Decode2B+$30,y	// Values for digits
	bne !-
	ldy #$1c
SetupLoop:
	dex
	dey
	stx Decode2B-1,y	// Values for lowercase letters, and value $18 at Decode2B-1 which becomes $60 (opcode for 'rts') at Decode0A-1
	sty Decode2B+$29,x	// Values for uppercase letters
	bne SetupLoop
	// x=$18, y=0
	sty YValue		// Set y value to zero for subsequent runs

	// Generate the other decode tables based on the initial one
!:	lda Decode2B-$19,x
	lsr
	lsr
	sta Decode1B-$19,x
	lsr
	lsr
	sta Decode0B-$19,x
	lda Decode2B-$19,x
	asl
	asl
	sta Decode0A-$19,x
	asl
	asl
	sta Decode1A-$19,x
	asl
	asl
	sta Decode2A-$19,x
	inx
	bpl !-

Main:
	ldx #3			// Initial hi byte of destination address

OuterLoop:
	stx Dest+1		// Set hi byte of destination address
	inx
	stx s0+1		// Set hi byte of source addresses
	stx s1+1
	stx s2+1
	stx s3+1

InnerLoop:
	dec Dest		// Adjust lo byte of destination address
	lax (s0),y		// Get source byte #0
	lda Decode0A,x		// Decode for initial destination byte #0
	iny
	ldx s1:$0400,y		// Get source byte #1
	ora Decode0B,x		// Decode for and update destination byte #0
	sta (Dest),y		// Put destination byte #0 on screen
	lda Decode1A,x		// Decode for initial destination byte #1
	iny
	ldx s2:$0400,y		// Get source byte #2
	ora Decode1B,x		// Decode for and update destination byte #1
	sta (Dest),y		// Put destination byte #1 on screen
	lda Decode2A,x		// Decode for initial destination byte #2
	iny
	ldx s3:$0400,y		// Get source byte #3
	ora Decode2B,x		// Decode for and update destination byte #2
	sta (Dest),y		// Put destination byte #2 on screen
	iny					// Adjust lo byte
	bne InnerLoop		// Branch if no change of hi byte

	ldx s0+1		// Get hi byte of source address
	lda SetupLoop-4,x	// Use code as data: Values read are $ca, $88, $96, $20
	bmi OuterLoop		// Branch if not done yet

	// a=$20, x=7
!:	sta $06d0-7,x		// Clear the required part of the screen
	inx
	bne !-
End:
	// An 'rts' instruction is stored here when table 'Decode0A' is generated on first run

/*
NOTES:

 1. The code initializes all tables and variables it uses. It does not depend on any known default values in memory, in compliance with Shallan's statement here: https://discord.com/channels/594149844742832138/800697778463440896/961234686531751936

 2. The program decodes 64 bytes more than actually required. I don't consider this to be in violation of any of the challenge rules or the intent of the memory related rules. (I believe the intent is to prevent participants from taking advantage of known default values in memory.)

 3. The program clears screen area $06d0-$07cf inclusive. This should be within the rules, according to the following statement by Shallan: https://discord.com/channels/594149844742832138/800697778463440896/966288078660583464
*/

The comments in the code explain it very well so I don’t have to repeat that and go straight to what differs from my attempts!

Generating the table takes 6525 cycles, that is pretty massive compared to around 1200 for my own so skipping generating this is definitely worth 5 bytes! (sty abs and the beq to skip it). It is also 60 bytes of code, leaving just 70 bytes for the main code!

The lesson here is to not worry about setup cost if it saves even more in the main processing!

Moving on to the decoding loop I notice that the same register is used to index both source and destination, which is a huge benefit and possible by decrementing the destination pointer.

I’m a bit worried about using absolute addressing and self mod since it takes quite a few bytes to handle the self mod, but it should not be ignored as it can save registers needing to be reloaded or even produce smaller overall code!

Saving bytes on initializing a and x for the clear loop at the end is impressive, at one point I had something similar but had to remove it when I did other optimizations.

I’ll try to look through Scaslin’s version too, it sounds similar to mine but much faster!

SCaslin’s solution

SCaslin’s approach is closer to mine, generate the table without skipping to save bytes then use a single table with shifts. It has been through more optimizations though!

Here’s the source

// Base64 Coding Challenge Entry ( April 2022 )
// SCaslin 25th April 2022
//
// ##### cycles 28773 #####
// #####   bytes 114  #####
// #####  mCB 3280122 #####
//
// base64
// byte 1	 2	  3	   4
// src	00111111 00222222 00333333 00444444
// dst	11111122 22223333 33444444
//
// Notes
// no clc...first interation just clears the carry...saves 1 byte but only adds 21 cycles....overall mCB improvement.
// decode table is generated in zp using sta zp,x.... +/ case is brute force. loop has to dec so not to overwrite A-Z with 0-9 sta.
// src adrs is selfmod so we can use ldx abs,y. bytes are shifted together, code is compact and performs ok. using tables would be faster but bulky. 
// MAGIC HERE...Y is used as index into source and destination! Y is stepped by 4 each iteration but destination ptr(dptr) is decremented by 1 to give a step of 3.
// low byte of dptr is also used as loop count so starts at 240. 240 * 3 = 720 output bytes. Y steps 4 so only needs to check overflow once per loop
// selfmod code is reset at begining, same code is also used for increment so little byte cost. dptr is at $E8 so we have duel use, sta $E8 & inx. $E8 is inx opcode.
// illegal op ARR is used....same as ror, and #$C0 - saves a byte
// finally small loop to clear remaining screen bytes.
// bne sm: seems like it could be optimised out some how. maybe later....
// 
.label dptr = $E8
.label tmp2 = $6
.label tmp3 = $7
.label decodetab = $10

* = $c000
start:	ldx #25+1		// create decode table in zp, 0-25(inc) + extra interation to clear carry
!:	txa
	sta decodetab+$41, x	// A-Z
	adc #26
	sta decodetab+1, x	// a-z
	adc #26
	sta decodetab+$30, x	// 0-9
	dex
	bpl !-
	ldy #62
	sty decodetab+$2B	// +
	iny
	sty decodetab+$2F	// /
				// A=52 X=255 Y=63
	ldx #3
	ldy #16			// Y steps by 4
	lda #(256-16)

	.byte $85		// sta.zp dptr ($E8)
sm:	inx			// ($E8)
	stx dptr+1		// hi dptr
	stx sadr1+1		// self mods	
	stx sadr2+1
	stx sadr3+1
	stx sadr4+1

loop: 	ldx sadr1:$0401-16,y	// byte 2
	lda decodetab,x		// 00222222
	asl
	asl			// a = 22222200
	asl
	sta tmp2		// tmp2 = 22222000
	ldx sadr2:$0400-16,y	// byte 1
	lda decodetab,x		// a = 00111111
	rol			// a = 01111112
	asl tmp2		// tmp2 = 22220000
	rol			// a = 11111122
	sta (dptr),y		// dst 11111122

	ldx sadr4:$0402-16,y	// byte 3
	lda decodetab,x		// a = 00333333
	lsr			// a = 00033333
	ror tmp3		// tmp3 = 30000000
	lsr 			// a = 00003333
	ora tmp2		// a = 22223333
	iny
	sta (dptr),y		// dst+1 22223333

	ldx sadr3:$0403-1-16,y	// byte 4
	lda tmp3		// same as ror, and #$C0
	arr #$80		// now a = 33000000
	ora decodetab,x		// a = 33444444
	iny
	sta (dptr),y		// dst+2 33444444

	dec dptr		// dec dptr/count, step by 3
	beq clr
	iny
	iny
	bne loop
	ldx dptr+1
	bne sm

clr:	lda #$20		// clear remaining chars from screen
	ldy #6*40
!:	dey
	sta $0400+(18*40),y
	bne !-
	rts			// done

Generating the table without a compressed table looks like a much better approach, no absolute addressing needed, a lot fewer iterations in the loop since each section of the table is written to for each loop branch!

So another option to spending a large amount of cycles in the table generation is that you can also choose to make it really fast! Two opposite ways to approach it that are both better than what I did 🙂

Generating the table is 23 bytes compared to mine that is 22 plus 8 bytes data so that’s 7 bytes shorter already. The generation is just 632 cycles too which is about half what I have! Focusing on doing things entirely in zero page and immediate mode instructions seems like a good strategy.

Using the same index register for source and destination even if the stride differs works well here like it did for Geir, decrementing the destination is a small cost compared to keeping separate indices and having to temporarily store the result.

Again, using absolute addressing is good and interesting that there is no zero page indirect which GeirS used for one of the four bytes.

The one win my code has over this is that I only use three bytes to set up registers for clearing the empty space and SCaslin uses four.

If I had spent more time I think I would have switched over to 6 tables rather than refining the one table approach like SCaslin did.

Shallan’s April-22 Optimization Challenge

Results

Rules

Initial Conditions

The basic algorithm

The main body

Hunting for easy savings

Diving into the table solution again

Folding instructions

Clearing the garbage

Submitted result

Waking up April 30

Waking up May 1st

May 2nd

May the 4th, the day after

SCaslin’s solution

Published by Space Moguls

Leave a comment Cancel reply

Results

Rules

Initial Conditions

The basic algorithm

The main body

Hunting for easy savings

Diving into the table solution again

Folding instructions

Clearing the garbage

Submitted result

Waking up April 30

Waking up May 1st

May 2nd

May the 4th, the day after

SCaslin’s solution

Share this:

Related

Published by Space Moguls

Leave a comment Cancel reply