I spent some time looking into your logic bug, but I couldn't figure out exactly how it worked, since I got confused by the way you were shuffling values between registers and memory. I wondered how the code would look if it worked in the other direction - keeping values in memory and only loading them into registers when necessary. This is what I came up with (excluding the tile map and tile data, which is the same as your code):

; Let $50+51 be our destination in VRAM

; since VRAM is outside the zero-page,

; we need two bytes

LDA #$00

STA $50

LDA #$02

STA $51

; Let $52 be our source in the tile-map

LDA #$20

STA $52

; The tile-map is always in the zero-page,

; but let's use a 16-bit address

; so we can use indirect addressing

LDA #$00

STA $53

draw_tile_loop:

; Each tile starts at the first pixel

LDY #$00

; Read the value pointed to by $0052

; into the accumulator.

; Without a "vanilla indirect" variant of LDA,

; we'll use "indirect, Y-indexed"

; since we just set Y to 0.

LDA ($52),Y

; If the tile-start is $FF...

CMP #$FF

; ...then we're at the end of the tile map

BEQ all_done

; Otherwise, let $54 be our source in the tile

STA $54

; Keep $55 as 00, so we can use indirect mode

LDA #$00

STA $55

draw_tile_data_loop:

; Copy the pixel from the tile data to VRAM

LDA ($54),Y

STA ($50),Y

; Prepare to copy the next pixel

INY

; Have we drawn all 8 pixels of this tile?

CPY #$08

; If not, go back and do the next one.

BNE draw_tile_data_loop

; Otherwise, we've done this tile,

; let's move to the next tile in the tile map

; We know the tile map is all in zero-page,

; so we can get away with INC.

INC $52

; Let's move to the next tile in VRAM too

; Since this is outside zero-page,

; we need to do a full 16-bit addition with carry

CLC

LDA $50

ADC #$08 ; 8 pixels in a tile!

STA $50

LDA $51

ADC #$00 ; propagate the carry

STA $51

JMP draw_tile_loop

all_done:

BRK

As a bonus, it demonstrates the "indirect Y-indexed" mode! This version might be considered a little more wasteful, since it spends bytes of the zero-page to store values that will always be 0, and because it uses "LDA ($52),Y" to emulate the missing "indirect zero-page" mode. It works, though, and the inner pixel-copying loop ("draw_tile_data_loop") is quite tidy, I think!

I spent some time looking into your logic bug, but I couldn't figure out exactly how it worked, since I got confused by the way you were shuffling values between registers and memory. I wondered how the code would look if it worked in the other direction - keeping values in memory and only loading them into registers when necessary. This is what I came up with (excluding the tile map and tile data, which is the same as your code):

; Let $50+51 be our destination in VRAM

; since VRAM is outside the zero-page,

; we need two bytes

LDA #$00

STA $50

LDA #$02

STA $51

; Let $52 be our source in the tile-map

LDA #$20

STA $52

; The tile-map is always in the zero-page,

; but let's use a 16-bit address

; so we can use indirect addressing

LDA #$00

STA $53

draw_tile_loop:

; Each tile starts at the first pixel

LDY #$00

; Read the value pointed to by $0052

; into the accumulator.

; Without a "vanilla indirect" variant of LDA,

; we'll use "indirect, Y-indexed"

; since we just set Y to 0.

LDA ($52),Y

; If the tile-start is $FF...

CMP #$FF

; ...then we're at the end of the tile map

BEQ all_done

; Otherwise, let $54 be our source in the tile

STA $54

; Keep $55 as 00, so we can use indirect mode

LDA #$00

STA $55

draw_tile_data_loop:

; Copy the pixel from the tile data to VRAM

LDA ($54),Y

STA ($50),Y

; Prepare to copy the next pixel

INY

; Have we drawn all 8 pixels of this tile?

CPY #$08

; If not, go back and do the next one.

BNE draw_tile_data_loop

; Otherwise, we've done this tile,

; let's move to the next tile in the tile map

; We know the tile map is all in zero-page,

; so we can get away with INC.

INC $52

; Let's move to the next tile in VRAM too

; Since this is outside zero-page,

; we need to do a full 16-bit addition with carry

CLC

LDA $50

ADC #$08 ; 8 pixels in a tile!

STA $50

LDA $51

ADC #$00 ; propagate the carry

STA $51

JMP draw_tile_loop

all_done:

BRK

As a bonus, it demonstrates the "indirect Y-indexed" mode! This version might be considered a little more wasteful, since it spends bytes of the zero-page to store values that will always be 0, and because it uses "LDA ($52),Y" to emulate the missing "indirect zero-page" mode. It works, though, and the inner pixel-copying loop ("draw_tile_data_loop") is quite tidy, I think!