I wasn’t going to participate in this one. I need to work on my C64 game right now (Island Adventure) AND I’m at the beginning of a new (solo) project at work where I need to ship x projects in y months, where x is an order of magnitude greater than y..
Anyway, I got sick for a couple of days and on the 27th of April I got a bit bored. I don’t need to submit anything but seems like a fun distraction for the rest of the day. BIG mistake! Fortunately the challenge ended on the 29th so I was just hooked for a couple of days or so….

The challenge is to convert 960 bytes from standard base64 encoding to 720 bytes of C64 screen codes. The encoded data is in the same place as the destination so you need to overwrite your input buffer as you progress. Since the input is larger than the output there is also a block that needs to be cleared on screen (lines 18 through 24) adding a little extra.
Results
Before diving in to details, here are the all-important results. I’m pretty happy with a 6th place especially since I didn’t spend much time on this one. I’ll compare my result with some of the winners at the end!

Rules
APRIL 2022 – BASE64 DECODE CHALLENGE
- The aim is to write an efficient function to decode base64 strings and display them on screen
- You are given a screen of mixed case text, this is (starting at the top left) a single base64 encoded string (https://en.wikipedia.org/wiki/Base64)
- You must convert this string to binary data and display each byte sequentially from the top left, resulting in a decoded screen output consisting of 18 lines. None of the original encoded data must be visible, replace it with spaces.
- The main program will call the method at $c000 with the character set switched to allow upper and lower case and the screen ready populated with the encoded data
- No verification routine is available to submissions, entries must be self validated.
Base64- In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits.
Common to all binary-to-text encoding schemes, Base64 is designed to carry data stored in binary formats across chann…
- In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits.
- On stream on the day following competition closure, each submission will be loaded and run with 3 different pieces of encoded data. If the output passes it will receive a score, calculated from the average cyclebyte count of the 3 consecutive runs. (cyclebytes = Number of cycles * number of bytes in the program) Lowest score wins.
- Submissions must be made in the form of a prg containing just the code and data for $c000-$cfff, the main program should NOT be included.
- Closing date for entries is 23:59pm BST on the 29th April 2022.
- All code, vars and tables must reside between $c000-$cfff, with the excpetion that vars can be stored in zeropage ($04-$ff) The entry point will be $c000.
- Screen ram will remain at the default of $0400 and must not be changed.
- All zeropage from $04-$ff is free for use (vars only , no code) as BASIC and Kernal are banked out.
- The main program disables the screen so that bad lines are not present for the most accurate cycle timing. During testing you can disable this if needed.
- The main program uses two CIA timers to accurately time the function, do NOT disable stop or otherwise change these timers or the submission will not count.
- The function must be able to be called multiple times so any self modified code must set itself up.
Initial Conditions
With the testing code provided for the challenge you’re presented with this screen of data (this is uppercase, the result is displayed in lowercase but there isn’t much useful information on this screen in lowercase either)

The basic algorithm
Essentially if you start from the top left (address $400) fetch one byte at a time, convert the 8 bit character to a 6 bit value and write 6 bits at a time. And stop when you reach line 24! Then you have to clear the screen area from line 18 to line 24 so there isn’t any visible garbage. Seems easy enough.
Converting from 8 bits to 6 is a little awkward though, you have the following ranges to keep track of:
- A-Z ($41-$5a) => 00 – 25
a-z ($61-$7a) => 26 – 51*- a-z ($01-$1a) => 26 – 51
- 0-9 ($30-$39) => 52 – 61
- + ($2b) => 62
- / ($2f) => 63
*) So unfortunately I managed to get a borked version of the test program which was assembled with lowercase as ASCII, which matched the official format, instead of C64 screen codes so this entire post up to the end (May 2nd) is written with this assumption.
So you could attempt converting it on the fly, code size and speed is not that bad if you optimize it a little as long as you don’t try to unroll a loop or something silly like that. Here’s an example of how that could look:
clc
lda #$100 - $41
adc [base64 value]
bcs .alpha ; 52 values here
.not_alpha adc #$41-$30+52 ; 0-9, +, /
cmp #52 ; >52 => found value
bcs .done ; 10 values here
cmp #52 + $2d-$30 ; $2d is inbetween $2b and $2f
lda #$3e + 7
bne .sub
.alpha cmp #26
bcc .done
.sub sbc #$61-$41-25-1
.done
So to estimate the cycle time, A-Z would cost 15 cycles, a-z 16 cycles, 0-9 18 cycles, + and / 27 cycles and needing 25 bytes. My initial attempt benefited from this version over a table based lookup.
So for most of my experiments table lookups won out because the lookup itself is just 2-3 bytes and 4 cycles even if I generated the table.
I made one table generator and couldn’t find much to optimize in it, trying different branch conditions to save a byte here and there. This is what I made:
const First = $2b ; '+', don't need any values below this
const zpTab = $2b ; nice and round start address
ldy #0
ldx #63
Table:
txa
; note: $3f will be wrong offset on 2nd run but it is ok because it was
; previously written at the correct location
StoreAt:
sta.z $2f - First - 63 + zpTab,x
cmp Base64_Range_Count,y
bne Continue
lda Base64_Range_Offs,y
sta StoreAt+1
iny
Continue:
dex
bpl Table
...
; when table value equals this, change target pointer for next section of table
Base64_Range_Count:
dc.b 63, 62, 52, 26
; counts & start values
Base64_Range_Offs:
//dc.b $2f - First - 63 + zpTab ; Z, z, 9, =, \
dc.b $2b - First - 62 + zpTab
dc.b $30 - First - 52 + zpTab
dc.b $61 - First - 26 + zpTab
dc.b $41 - First - 0 + zpTab
This adds up to 1191 cycles to generate the table and 30 bytes so it i not insignificant but a fairly small part of the total.
You’re allowed to generate the table once and test if it was generated for the 2nd and 3rd runs but I didn’t find a case where that save more cyclebytes than it cost.
After the loop finishes the accumulator contains 0 and the y register contains 4. Pretty handy for initializing the source and destination zero page vectors!
The main body
So the majority of cycles and bytes are spent on the actual reading and writing of decoded bytes. Writing 6 bits at a time is a little awkward but by grouping 4 bytes in the output is 6*8 = 24 bits or 3 bytes. That means the 960 input bytes becomes 240 batches of 4 bytes.
Intuitively I want to read the 4 source bytes in a loop, decode each and store to zero page and then combine the 3 bytes. I have a zero page counter for the 240 iterations resulting in something like this:
Loop:
tya ; "free" storing of the destination offset
ldy #3
{
sta zpData,y ; store destination index (y) on 1st iteration of this loop
{
clc
lda #$100 - $41
adc (zpSrc),y
bcs .alpha ; 52 values here
.not_alpha
adc #$41-$30+52 ; 0-9, +, /
cmp #52 ; >52 => found value
bcs % ; 10 values here
cmp #52 + $2d-$30 ; $2d is inbetween $2b and $2f
lda #$3e + 7
bne .sub
.alpha cmp #26
bcc %
.sub sbc #$61-$41-25-1
}
dey
bpl !
}
; at this point the accelerator contains the first byte to process!
; restore the destination offset
ldy zpData+3
{
asl zpData ; shift 2nd byte up twice
asl zpData
asl zpData ; shift two bits from 2nd byte into first byte
rol
asl zpData
rol
}
sta (zpDst),y
; store first byte in destination
iny ; increment destination offset
{
bne % ; if dest offs lo wraps, increment dest offs hi
inc zpDst+1
}
{
lda zpData+1 ; get 3rd byte
lsr
; shift two bottom bits of 3rd byte into fourth byte
ror zpData+2
lsr
ror zpData+2
ora zpData ; merge 4 bits of 2nd with 4 bits of 3rd
}
sta (zpDst),y ; store combined 2nd and 3rd byte
iny
{
bne %
inc zpDst+1
}
{
lda zpData+2 ; fourth byte is ready
}
sta (zpDst),y
iny
; the 3rd byte will not wrap so no need to check
; {
; bne %
; inc zpDst+1
; }
dec zpCnt ; decrement the chunk counter
beq Finito ; 2 cycles for a branch not taken so move the exit to the end
; clc ; carry will always be clear here so safe to ignore
lda #4 ; increment the source counter
adc zpSrc
sta zpSrc
bcc Loop
inc zpSrc+1
bcs Loop
Finito:
I don’t have the exact cyclebyte count for this version, but I think it was a bit above 6 Mcb which also includes setting up the zero page vectors and count and then clearing the bottom part of the screen.
Hunting for easy savings
Where to start? Maybe using self modifying code of an absolute address for the source address would alleviate switching the y register?
Loop:
; read 4 bytes
ldx #3
{
{
clc
lda #$100 - $41
const Source = *+1
adc $400,x
bcs .alpha ; 52 values here
.not_alpha
adc #$41-$30+52 ; 0-9, +, /
cmp #52 ; >52 => found value
bcs % ; 10 values here
cmp #52 + $2d-$30 ; $2d is inbetween $2b and $2f
lda #$3e + 7
bne .sub
.alpha cmp #26
bcc %
.sub sbc #$61-$41-25-1
}
sta zpData,x
dex
bpl !
}
; ... inner part of the loop is the same except no need to restore y
adc Source
sta Source
bcc LoopInner
inc Source+1
bcs LoopInner
Finito:
Unfortuately the absolute addressing instead of zero page and extra bytes needed didn’t result in an improvement. But maybe x could be used for the chunk count instead of a zero page counter instead?
ldx #240
; ...
dex
beq Finito
Saving both bytes (3) and cycles (3 + 240*3) is a win!

Diving into the table solution again
While initially the in-place decoding was a slightly cheaper solution counting cyclebytes the table lookup is so tiny that it should yield better results, maybe even allowing unrolling the 4 byte source decode loop!
ldx byte
lda.z table,x
The decode itself is just 4 cycles, reading and decoding one byte could be as low as four bytes with the right addressing mode.
To solve looking up an indirect indexed byte into the x register without a tax I can use lax (undocumented opcode). To check if the chunks are complete check if the destination low byte is the low byte of the end address.
I can also try to store the decoded bytes in reverse order to see if I can save any operations.
or to put it together:
Loop:
lax (zpSrc),y ; read first source byte
lda.z zpTab - First,x ; decode
sta zpData ; temp storage first byte
iny
lax (zpSrc),y
lda.z zpTab - First,x
asl ; shift up two empty bits of second byte
asl
asl ; shift up two bits from second byte into first byte
rol zpData
asl
rol zpData
pha ; saving a couple of bytes using pha/pla instead of zp
iny
lax (zpSrc),y
lda.z zpTab - First,x
sta zpData+2 ; temp store 3rd byte
iny
lax (zpSrc),y
lda.z zpTab - First,x
asl ; shift up two bits from fourth source byte
asl
lsr zpData+2 ; shift down 2 bits from 3rd to 4th byte
ror
lsr zpData+2
ror
iny ; only need to check source addr wrap each 4 bytes!
bne SkipIncSrc
inc zpSrc+1
SkipIncSrc:
sty zpY ; save source offset
ldy #2 ; store destination bytes in reverse
sta (zpDst),y ; a already has 3rd dest byte
dey
pla ; combine 2nd and 3rd source byte into 2nd dest byte
ora zpData+2
sta (zpDst),y
dey
lda zpData ; first byte is ready to copy
sta (zpDst),y
lda zpDst ; increment dest address
adc #3
sta zpDst
bcc SkipIncDest
inc zpDst+1
SkipIncDest:
ldy zpY ; restore source offset
cmp #<40*18
bne Loop-1
}
Combined with savings from table setup and clearing the remaining space on screen this solution is just below 5 Mcb already!

Folding instructions
One popular approach in these challenges is to hide instructions in the arguments of opcodes, so for instance the lines sta $a8, tay becomes ldy $a4 if you skip the first byte.
Here is the zero page setup code right before the main loop:
; x = $ff, a = 0, y = 4
sty zpDst+1
sty zpSrc+1
sta zpDst
sta zpSrc
tay
Loop:
ldy zpY
So to skip the ldy zpY before the end of the loop end I can assign zpSrc to $a8 and zpY to $a4 to save 2 bytes from the loop:
; x = $ff, a = 0, y = 4
sty zpDst+1
sty zpSrc+1
sta zpDst
sta zpSrc ; hidden ldy zpY (a4 a8)
Loop:
tay ; only runs first time through loop
; ...
cmp #<40*18
bne Loop-1 ; hidden instruction to execute before "Loop"
And there is also the opportunity to be clever with zpDst since the accumulator contains the low byte of the destination I can defer writing that back to the zero page to the start of the loop by assigning it address $85 (sta zero page)
; x = $ff, a = 0, y = 4
sty zpDst+1
sty zpSrc+1
sta zpDst ; hidden sta zpDst (85 85)
sta zpSrc ; hidden ldy zpY (a4 a8)
Loop:
tay ; only runs first time through loop
;...
lda zpDst
adc #3 ; skip sta zpDst because it is done through the branch back to Loop-3
bcc SkipIncDest
inc zpDst+1
SkipIncDest:
cmp #<40*18
bne Loop-3 ; two hidden instructions to execute before "Loop"
So that’s my submission for the main loop of the challenge! But there is just one bit left I haven’t covered.

Clearing the garbage
The last little bit of the challenge is to clear up the lines that have input data but isn’t overwritten by decoded data which is line 18 through 24. This is another 240 bytes to overwrite with $20. Unfortunately I can’t rely on that the last character will be a space so I have to initialize both x and a to have this done reliably:
Finito:
ldx #$f0
lda #$20
Clear:
dex
sta $400 + 18*40,x
bne Clear
rts
Nothing to change about the loop but I can rearrange this to save one byte. I know a place on zero page that contains $10 ($100 – $f0) and I can assign x and a to this in one instruction, then shift a left to get $20. I just need to run the loop incrementally instead of decrementally:
Finito:
lax zpTab + $41 - First + $10 ; read out $10 into a and x
asl ; $10<<1 = $20
Clear:
sta $400 + 18*40 - $10,x
inx
bne Clear
rts

Submitted result
Here’s what I came up with by around 11 pm CET on April 29th!
org $c000
cpu 6502ill
const First = $2b ; '+', don't need any values below this
const zpTab = $2b ; nice and round start address
const zpY = $a8 ; opcode = tay
const zpSrc = $a4 ; opcode = ldy zp
const zpDst = $85 ; opcode = sta zp
const zpShift1 = 9
const zpShift2 = 10
ldy #0
ldx #63
Table:
txa
; note: $3f will be wrong offset on 2nd run but it is ok because it was
; previously written at the correct location
StoreAt:
sta.z $2f - First - 63 + zpTab,x
cmp Base64_Range_Count,y
bne Continue
lda Base64_Range_Offs,y
sta StoreAt+1
iny
Continue:
dex
bpl Table
; x = $ff, a = 0, y = 4
sty zpDst+1
sty zpSrc+1
sta zpDst ; hidden sta zpDst (85 85)
sta zpSrc ; hidden ldy zpY (a4 a8)
Loop:
tay ; only runs first time through loop
lax (zpSrc),y
lda.z zpTab - First,x
sta zpShift1
iny
lax (zpSrc),y
lda.z zpTab - First,x
asl
asl
asl
rol zpShift1
asl
rol zpShift1
pha
iny
lax (zpSrc),y
lda.z zpTab - First,x
sta zpShift2
iny
lax (zpSrc),y
lda.z zpTab - First,x
asl
asl
lsr zpShift2
ror
lsr zpShift2
ror
iny
bne SkipIncSrc
inc zpSrc+1
SkipIncSrc:
sty zpY
ldy #2
sta (zpDst),y
dey
pla
ora zpShift2
sta (zpDst),y
dey
lda zpShift1
sta (zpDst),y
lda zpDst
adc #3
bcc SkipIncDest
inc zpDst+1
SkipIncDest:
cmp #<40*18
bne Loop-3 ; two hidden instructions to execute before "Loop"
Finito:
lax zpTab + $41 - First + $10 ; read out $10 into a and x
asl ; $10<<1 = $20
Clear:
sta $400 + 18*40 - $10,x
inx
bne Clear
rts
; when table value equals this, change target pointer for next section of table
Base64_Range_Count:
dc.b 63, 62, 52, 26
; counts & start values
Base64_Range_Offs:
//dc.b $2f - First - 63 + zpTab ; Z, z, 9, =, \
dc.b $2b - First - 62 + zpTab
dc.b $30 - First - 52 + zpTab
dc.b $61 - First - 26 + zpTab
dc.b $41 - First - 0 + zpTab
Summary: code size 123 bytes, measured cycles: 39148 => 4815204 cyclebytes.
Now we just wait and see what I’ve missed! Plenty of time to not obsess about potential improvements at all until May 3rd.

Waking up April 30
Two solutions woke me up too early on Saturday:
- Maybe I can save enough cycles on the clear loop at the end by using another sta abs,x so I only need half the amount of inx/bne?
- Maybe if I create two tables for decoding data, so I add one with the original but shifted left twice to skip some asl?
Solution 1 would require 4 more bytes (lax trick doesn’t work, so ldx/lda plus the extra sta abs,x) and would save (240/2) * 5 cycles = 600 cycles. 129 * 38548 = 4972692 so not a saving at all!
Solution 2 requires quite a few more bytes to generate the table so that didn’t pan out either, I think I ended up at 37715 cycles but I don’t have that version to check the file size.

Waking up May 1st
Maybe I just didn’t go far enough with solution 2 from the day before? What if I had 2, 3, or 5 extra tables? I can only fit 3 tables on zero page so let’s focus on that.
So the target right now is getting rid of shifting on zero page (rol zp) which is 5 cycles compared with just 2 cycles for rol accumulator. That means if I have shift left by 4 and shift right by 4 I can cheaply get to shift by two or shift by 6. (Un)fortunately this turns out to need a lot more bytes and even costing more cycles on the first iteration. Adding a skip to generate the table saves just a little bit on the 2nd and 3rd runs so I abandon the attempt without fixing all the bugs.
I’m still ahead on my original submission so no regrets with that so far!
Checking out the discord channel I notice some participants are getting a little antsy, waiting 80+ hours to figure out what everyone else did is making everyone a little jumpy! With the spiral challenge I think I found the last couple of bytes just a few hours before the stream so even that was a little less stressful!
Anyway, trying again to come up with a way to skip building the table on the 2nd and 3rd run. Luckily since a and y are initialized by rebuilding the table and just initializing the registers in addition to modifying a byte in the table loop leads to the total number of cyclebytes to be larger than the submitted version (7 extra bytes to save ~1000 cycles).

May 2nd
Kind of surprised I haven’t found any improvements so far, feeling confident there is a significantly better method that I just can’t figure out even though more time has passed since the deadline than I spent on this challenge.
Some minor panic on an external test resulting in a corrupt screen! So turns out since I don’t assemble with Kick Assembler I just downloaded the ACME version that had a pre-assembled prg file that assembled the data as ASCII and not screen codes like Kick Assembler would have done. My solution so far has relied on the base64 data to be in ASCII and not screen codes. Chatting with Shallan to figure it out…..

… So no code was changed, the table indices had to be shifted and it works as intended with screen codes instead.
org $c000
cpu 6502ill
const First = 1 ; '+', don't need any values below this
const zpTab = 11 ; nice and round start address
const zpY = $a8 ; opcode = tay
const zpSrc = $a4 ; opcode = ldy zp
const zpDst = $85 ; opcode = sta zp
const zpShift1 = 9
const zpShift2 = 10
ldy #0
ldx #63
Table:
txa
; note: $3f will be wrong offset on 2nd run but it is ok because it was
; previously written at the correct location
StoreAt:
sta.z $2f - First - 63 + zpTab,x
cmp Base64_Range_Count,y
bne Continue
lda Base64_Range_Offs,y
sta StoreAt+1
iny
Continue:
dex
bpl Table
; x = $ff, a = 0, y = 4
sty zpDst+1
sty zpSrc+1
sta zpDst ; hidden sta zpDst (85 85)
sta zpSrc ; hidden ldy zpY (a4 a8)
Loop:
tay ; only runs first time through loop
lax (zpSrc),y
lda.z zpTab - First,x
sta zpShift1
iny
lax (zpSrc),y
lda.z zpTab - First,x
asl
asl
asl
rol zpShift1
asl
rol zpShift1
pha
iny
lax (zpSrc),y
lda.z zpTab - First,x
sta zpShift2
iny
lax (zpSrc),y
lda.z zpTab - First,x
asl
asl
lsr zpShift2
ror
lsr zpShift2
ror
iny
bne SkipIncSrc
inc zpSrc+1
SkipIncSrc:
sty zpY
ldy #2
sta (zpDst),y
dey
pla
ora zpShift2
sta (zpDst),y
dey
lda zpShift1
sta (zpDst),y
lda zpDst
adc #3
bcc SkipIncDest
inc zpDst+1
SkipIncDest:
cmp #<40*18
bne Loop-3 ; two hidden instructions to execute before "Loop"
Finito:
lax zpTab + $41 - First + $10 ; read out $10 into a and x
asl ; $10<<1 = $20
Clear:
sta $400 + 18*40 - $10,x
inx
bne Clear
rts
; when table value equals this, change target pointer for next section of table
Base64_Range_Count:
dc.b 63, 62, 52, 26
; counts & start values
Base64_Range_Offs:
//dc.b $2f - First - 63 + zpTab ; Z, z, 9, =, \
dc.b $2b - First - 62 + zpTab
dc.b $30 - First - 52 + zpTab
dc.b $01 - First - 26 + zpTab
dc.b $41 - First - 0 + zpTab
May the 4th, the day after
Time to check out what other challengers did that I didn’t do! Geir Straume always has interesting solutions and won so let’s check it out!
The sources are fairly long but I don’t think I can discuss it without including the full source first so here it is (gently edited for 8 space tabs):
/*
An entry for Shallan's "Base64 Decode Challenge" (April 2022)
Code by Geir Straume (Discord: GeirS#7311, Web: https://gstraume.itch.io/)
*/
// Tables for decoding (within $c000-$cfff or $04-$ff)
.var Decode0A = End+1
.var Decode0B = $c180
.var Decode1A = $c200
.var Decode1B = $c280
.var Decode2A = $89
.var Decode2B = $21 // Table located at $20+1 so that a zp address in the code may also be used as data (a space character)
// Pointers in zero page
.var s0 = $ae // Pointer to source address for encoded data ($ae = Decode2A+$25, see initialization below)
.var Dest = $b2 // Pointer to destination address for decoded data ($b2 = Decode2A+$29, see initialization below)
* = $c000
ldy YValue:#$0b // Value is $0b on first run, 0 on subsequent runs
beq Main // Branch if this is not the first run
// Generate the initial decode table
ldx #$3f
stx Decode2B+$2f // Value for char '/'
!: dex
dey
stx Decode2B+$21,y // Value for char '+' at Decode2B+$2b, and two values at +$25 and +$29 which become zero at Decode2A+$25 and +$29
stx Decode2B+$30,y // Values for digits
bne !-
ldy #$1c
SetupLoop:
dex
dey
stx Decode2B-1,y // Values for lowercase letters, and value $18 at Decode2B-1 which becomes $60 (opcode for 'rts') at Decode0A-1
sty Decode2B+$29,x // Values for uppercase letters
bne SetupLoop
// x=$18, y=0
sty YValue // Set y value to zero for subsequent runs
// Generate the other decode tables based on the initial one
!: lda Decode2B-$19,x
lsr
lsr
sta Decode1B-$19,x
lsr
lsr
sta Decode0B-$19,x
lda Decode2B-$19,x
asl
asl
sta Decode0A-$19,x
asl
asl
sta Decode1A-$19,x
asl
asl
sta Decode2A-$19,x
inx
bpl !-
Main:
ldx #3 // Initial hi byte of destination address
OuterLoop:
stx Dest+1 // Set hi byte of destination address
inx
stx s0+1 // Set hi byte of source addresses
stx s1+1
stx s2+1
stx s3+1
InnerLoop:
dec Dest // Adjust lo byte of destination address
lax (s0),y // Get source byte #0
lda Decode0A,x // Decode for initial destination byte #0
iny
ldx s1:$0400,y // Get source byte #1
ora Decode0B,x // Decode for and update destination byte #0
sta (Dest),y // Put destination byte #0 on screen
lda Decode1A,x // Decode for initial destination byte #1
iny
ldx s2:$0400,y // Get source byte #2
ora Decode1B,x // Decode for and update destination byte #1
sta (Dest),y // Put destination byte #1 on screen
lda Decode2A,x // Decode for initial destination byte #2
iny
ldx s3:$0400,y // Get source byte #3
ora Decode2B,x // Decode for and update destination byte #2
sta (Dest),y // Put destination byte #2 on screen
iny // Adjust lo byte
bne InnerLoop // Branch if no change of hi byte
ldx s0+1 // Get hi byte of source address
lda SetupLoop-4,x // Use code as data: Values read are $ca, $88, $96, $20
bmi OuterLoop // Branch if not done yet
// a=$20, x=7
!: sta $06d0-7,x // Clear the required part of the screen
inx
bne !-
End:
// An 'rts' instruction is stored here when table 'Decode0A' is generated on first run
/*
NOTES:
1. The code initializes all tables and variables it uses. It does not depend on any known default values in memory, in compliance with Shallan's statement here: https://discord.com/channels/594149844742832138/800697778463440896/961234686531751936
2. The program decodes 64 bytes more than actually required. I don't consider this to be in violation of any of the challenge rules or the intent of the memory related rules. (I believe the intent is to prevent participants from taking advantage of known default values in memory.)
3. The program clears screen area $06d0-$07cf inclusive. This should be within the rules, according to the following statement by Shallan: https://discord.com/channels/594149844742832138/800697778463440896/966288078660583464
*/
The comments in the code explain it very well so I don’t have to repeat that and go straight to what differs from my attempts!
Generating the table takes 6525 cycles, that is pretty massive compared to around 1200 for my own so skipping generating this is definitely worth 5 bytes! (sty abs and the beq to skip it). It is also 60 bytes of code, leaving just 70 bytes for the main code!
The lesson here is to not worry about setup cost if it saves even more in the main processing!
Moving on to the decoding loop I notice that the same register is used to index both source and destination, which is a huge benefit and possible by decrementing the destination pointer.
I’m a bit worried about using absolute addressing and self mod since it takes quite a few bytes to handle the self mod, but it should not be ignored as it can save registers needing to be reloaded or even produce smaller overall code!
Saving bytes on initializing a and x for the clear loop at the end is impressive, at one point I had something similar but had to remove it when I did other optimizations.
I’ll try to look through Scaslin’s version too, it sounds similar to mine but much faster!
SCaslin’s solution
SCaslin’s approach is closer to mine, generate the table without skipping to save bytes then use a single table with shifts. It has been through more optimizations though!
Here’s the source
// Base64 Coding Challenge Entry ( April 2022 )
// SCaslin 25th April 2022
//
// ##### cycles 28773 #####
// ##### bytes 114 #####
// ##### mCB 3280122 #####
//
// base64
// byte 1 2 3 4
// src 00111111 00222222 00333333 00444444
// dst 11111122 22223333 33444444
//
// Notes
// no clc...first interation just clears the carry...saves 1 byte but only adds 21 cycles....overall mCB improvement.
// decode table is generated in zp using sta zp,x.... +/ case is brute force. loop has to dec so not to overwrite A-Z with 0-9 sta.
// src adrs is selfmod so we can use ldx abs,y. bytes are shifted together, code is compact and performs ok. using tables would be faster but bulky.
// MAGIC HERE...Y is used as index into source and destination! Y is stepped by 4 each iteration but destination ptr(dptr) is decremented by 1 to give a step of 3.
// low byte of dptr is also used as loop count so starts at 240. 240 * 3 = 720 output bytes. Y steps 4 so only needs to check overflow once per loop
// selfmod code is reset at begining, same code is also used for increment so little byte cost. dptr is at $E8 so we have duel use, sta $E8 & inx. $E8 is inx opcode.
// illegal op ARR is used....same as ror, and #$C0 - saves a byte
// finally small loop to clear remaining screen bytes.
// bne sm: seems like it could be optimised out some how. maybe later....
//
.label dptr = $E8
.label tmp2 = $6
.label tmp3 = $7
.label decodetab = $10
* = $c000
start: ldx #25+1 // create decode table in zp, 0-25(inc) + extra interation to clear carry
!: txa
sta decodetab+$41, x // A-Z
adc #26
sta decodetab+1, x // a-z
adc #26
sta decodetab+$30, x // 0-9
dex
bpl !-
ldy #62
sty decodetab+$2B // +
iny
sty decodetab+$2F // /
// A=52 X=255 Y=63
ldx #3
ldy #16 // Y steps by 4
lda #(256-16)
.byte $85 // sta.zp dptr ($E8)
sm: inx // ($E8)
stx dptr+1 // hi dptr
stx sadr1+1 // self mods
stx sadr2+1
stx sadr3+1
stx sadr4+1
loop: ldx sadr1:$0401-16,y // byte 2
lda decodetab,x // 00222222
asl
asl // a = 22222200
asl
sta tmp2 // tmp2 = 22222000
ldx sadr2:$0400-16,y // byte 1
lda decodetab,x // a = 00111111
rol // a = 01111112
asl tmp2 // tmp2 = 22220000
rol // a = 11111122
sta (dptr),y // dst 11111122
ldx sadr4:$0402-16,y // byte 3
lda decodetab,x // a = 00333333
lsr // a = 00033333
ror tmp3 // tmp3 = 30000000
lsr // a = 00003333
ora tmp2 // a = 22223333
iny
sta (dptr),y // dst+1 22223333
ldx sadr3:$0403-1-16,y // byte 4
lda tmp3 // same as ror, and #$C0
arr #$80 // now a = 33000000
ora decodetab,x // a = 33444444
iny
sta (dptr),y // dst+2 33444444
dec dptr // dec dptr/count, step by 3
beq clr
iny
iny
bne loop
ldx dptr+1
bne sm
clr: lda #$20 // clear remaining chars from screen
ldy #6*40
!: dey
sta $0400+(18*40),y
bne !-
rts // done
Generating the table without a compressed table looks like a much better approach, no absolute addressing needed, a lot fewer iterations in the loop since each section of the table is written to for each loop branch!
So another option to spending a large amount of cycles in the table generation is that you can also choose to make it really fast! Two opposite ways to approach it that are both better than what I did 🙂
Generating the table is 23 bytes compared to mine that is 22 plus 8 bytes data so that’s 7 bytes shorter already. The generation is just 632 cycles too which is about half what I have! Focusing on doing things entirely in zero page and immediate mode instructions seems like a good strategy.
Using the same index register for source and destination even if the stride differs works well here like it did for Geir, decrementing the destination is a small cost compared to keeping separate indices and having to temporarily store the result.
Again, using absolute addressing is good and interesting that there is no zero page indirect which GeirS used for one of the four bytes.
The one win my code has over this is that I only use three bytes to set up registers for clearing the empty space and SCaslin uses four.
If I had spent more time I think I would have switched over to 6 tables rather than refining the one table approach like SCaslin did.