[Project Log] Python on the 6502/C64, 8080, 6800, 6809 and AVR

Oh! That explains it!

I walked through Commons yesterday morning and said hello. You did not respond so I thought you were asleep.

Now I know you were in guru meditation contemplating Python on the 6502. I’m glad you were not breaking the no sleeping on site rule!

I have an Arduino-ish class on the calendar and two under review. You are always welcome to come assist/harass.

While the guru continues to meditate over Python memory management, I have been playing with code optimization some more as well as bringing my ancient tools for the 6809 up to snuff.

Pascal code to add two variables look like this on the 6502:

 			  00083	; 00022	    E := B + C;
 0291 18	      [2] 00084		clc
 0292 AD 0366	      [4] 00085		lda	B_
 0295 6D 0368	      [4] 00086		adc	C_
 0298 AA	      [2] 00087		tax
 0299 AD 0367	      [4] 00088		lda	B_+1
 029C 6D 0369	      [4] 00089		adc	C_+1
 029F 8E 036C	      [4] 00090		stx	E_
 02A2 8D 036D	      [4] 00091		sta	E_+1

and like this on the 6809:

 0000 FC 0100		      [6] 00001	         ldd    B_
 0003 F3 0102		      [7] 00002	         addd   C_
 0006 FD 0104		      [6] 00003	         std    E_

28 clock cycles versus 19 in half the memory space. Granted that modern 6502 family processors can run at 14 MHz while the best 6809 was 2 MHz.

3 Likes

Almost… the Hitachi 63C09 variant could clock up to 3.5MHz. It had a couple of new instructions on it too. It’s a popular upgrade for the TRS-80 CoCo crowd as it’s a pin-compatible upgrade.

2 Likes

A couple?

The 6309 is to the 6809 as the 65816 is to the 65C02. Many more registers and addressing modes and a supercharged “native” mode.

https://datassette.org/livros/tandy-trs-color/6309-book

1 Like

While working on the 6809 simulator, one of the test programs is a FORTH interpreter written for the 6800. The 6809 is not binary code compatible with the 6800 but 6800 assembly language source files can be assembled into 6809 binaries.

FORTH is based on manipulating 16-bit values on a stack. Since the 6809 was designed to efficiently handle such code, I chose to see how well it can do.

The FORTH word I chose at random to study is “equal.” It compares the top two values on the stack and replaces them with a one if equal and zero otherwise.

The original 6800 implementation is:

 085D 5F	      [2] 02159	         clrb             ; Presume not equal
 			  02160
 085E 32	      [4] 02161	         pula             ; Get a number
 085F 97 0A	      [4] 02162	         staa   Tmp
 0861 32	      [4] 02163	         pula
 0862 97 0B	      [4] 02164	         staa   Tmp+1
 			  02165
 0864 32	      [4] 02166	         pula             ; Get the other
 0865 91 0A	      [3] 02167	         cmpa   Tmp
 0867 32	      [4] 02168	         pula
 0868 26 05 (086F)    [4] 02169	         bne    ___eq_NotEqual
 			  02170
 086A 91 0B	      [3] 02171	         cmpa   Tmp+1     ; Upper bytes equal, compare lower
 086C 26 01 (086F)    [4] 02172	         bne    ___eq_NotEqual
 			  02173
 086E 5C	      [2] 02174	         incb
 			  02175
 086F			  02176	___eq_NotEqual:
 086F 4F	      [2] 02177	         clra             ; Clear upper byte
 			  02178
 0870 37	      [4] 02179	         pshb             ; Push result
 0871 36	      [4] 02180	         psha
 			  02181
 0872 7E 017A	      [3] 02182	         jmp    Next

55 machine cycles through the equal path.

A much better 6809 implementation is:

 0000 35 06		      [7] 00001	         puls   D         ; Get first number
 				  00002
 0002 10A3 E4		      [7] 00003	         cmpd   ,S        ; Compare with second number
 0005 26 04 (000B)	      [3] 00004	         bne    ___eq_NotEqual
 				  00005
 0007 C6 01		      [2] 00006	         ldab   #1        ; Indicate equal
 				  00007
 0009 20 01 (000C)	      [3] 00008	         bra    ___eq_Finish
 				  00009
 000B				  00010	___eq_NotEqual
 000B 5F		      [2] 00011	         clrb
 				  00012
 000C				  00013	___eq_Finish
 000C 4F		      [2] 00014	         clra
 000D A7 E4		      [4] 00015	         sta    ,S        ; Store result
 				  00016
 000F 7E 0012		      [4] 00017	         jmp    Next      ; Pass control to inner interpreter

32 machine cycles through the equal path.

My FORTH system has never been adapted to the 6502 so I will have to do some speculating as to how the equal word might be implemented there.

A simple paraphrase of the 6800 code to the 6502 would be:

 0006 68	      [4] 00006		pla			; Get low byte of first number
 0007 85 04	      [3] 00007		sta	Tmp
 			  00008
 0009 68	      [4] 00009		pla			; Get high byte of first number
 000A 85 05	      [3] 00010		sta	Tmp+1
 			  00011
 000C 68	      [4] 00012		pla			; Compare low bytes
 000D C5 04	      [3] 00013		cmp	Tmp
 000F D0 0A (001B)  [2/3] 00014		bne	___eq_NotEqual
 			  00015
 0011 68	      [4] 00016		pla			; Compare high bytes
 0012 C5 05	      [3] 00017		cmp	Tmp+1
 0014 D0 06 (001C)  [2/3] 00018		bne	___eq_NotEqual2
 			  00019
 0016 A2 01	      [2] 00020		ldx	#1
 			  00021
 0018 4C 001E	      [3] 00022		jmp	___eq_Finish
 			  00023
 001B			  00024	___eq_NotEqual
 001B 68	      [4] 00025		pla			; Discard high byte of second number
 			  00026
 001C			  00027	___eq_NotEqual2
 001C A2 00	      [2] 00028		ldx	#0
 			  00029
 001E			  00030	___eq_Finish
 001E A9 00	      [2] 00031		lda	#0		; Push result
 0020 48	      [3] 00032		pha
 0021 8A	      [2] 00033		txa
 0022 48	      [3] 00034		pha
 			  00035
 0023 4C 0026	      [3] 00036		jmp	Next

50 machine cycles through the equal path.

The stack of the 6502 is limited to 256 bytes.

FORTH is stack-oriented. 16-bit words are manipulated on the stack.

I do not have enough experience to know whether 128 words are enough, but it seems limiting, especially considering that the small 6502 stack is shared with the return addresses for subroutine calls and interrupts.

An easy way to double the amount of stack space is to allocate two 256 byte blocks of memory, one for the low bytes of words and one for the high, keep pointers to them in the zero page and use the Y register as the stack pointer. This stack is separate from the processor’s machine stack and is needed anyway since FORTH actually mandates two stacks, a data stack and a return stack.

An added benefit of this approach is that both bytes of the top word on the stack are easily accessible without having to mess with the stack pointer.

Code for the equal word on the 6502 might look something like this:

 0004 B1 00	    [5/6] 00018		lda	(StkLo),Y	; Get low byte of first number
 0006 AA	      [2] 00019		tax
 			  00020
 0007 B1 02	    [5/6] 00021		lda	(StkHi),Y	; Get high byte of the first number
 			  00022
 0009 C8	      [2] 00023		iny			; Jettison the first number
 			  00024
 000A D1 02	    [5/6] 00025		cmp	(StkHi),Y	; Compare high bytes
 000C D0 0A (0018)  [2/3] 00026		bne	___eq_NotEqual
 			  00027
 000E 8A	      [2] 00028		txa			; Compare low bytes
 000F D1 00	    [5/6] 00029		cmp	(StkLo),Y
 0011 D0 05 (0018)  [2/3] 00030		bne	___eq_NotEqual
 			  00031
 0013 A9 01	      [2] 00032		lda	#1		; Indicate equal
 			  00033
 0015 4C 001A	      [3] 00034		jmp	___eq_Finish
 			  00035
 0018			  00036	___eq_NotEqual
 0018 A9 00	      [2] 00037		lda	#0		; Indicate not equal
 			  00038
 001A			  00039	___eq_Finish
 001A 91 00	      [6] 00040		sta	(StkLo),Y	; Store result
 001C A9 00	      [2] 00041		lda	#0
 001E 91 02	      [6] 00042		sta	(StkHi),Y
 			  00043
 0020 4C 0023	      [3] 00044		jmp	Next		; Pass control to inner interpreter

52 machine cycles through the equal path. Slightly slower, but worth doing if the larger stack is needed.

That got me curious how much the original 6800 implementation can benefit from “indexing the stack:”

 0000 32	      [4] 00001	         pula             ; Get high byte of first number
 0001 33	      [4] 00002	         pulb             ; Get low byte of first number
 			  00003
 0002 30	      [4] 00004	         tsx              ; Prepare to index the stack
 			  00005
 0003 E1 01	      [5] 00006	         cmpb   1,X       ; Compare low bytes
 0005 26 08 (000F)    [4] 00007	         bne    ___eq_NotEqual
 			  00008
 0007 A1 00	      [5] 00009	         cmpa   ,X        ; Compare high bytes
 0009 26 04 (000F)    [4] 00010	         bne    ___eq_NotEqual
 			  00011
 000B 86 01	      [2] 00012	         ldaa   #1        ; Indicate equal
 			  00013
 000D 20 01 (0010)    [4] 00014	         bra    ___eq_Finish
 			  00015
 000F			  00016	___eq_NotEqual
 000F 4F	      [2] 00017	         clra             ; Indicate not equal
 			  00018
 0010			  00019	___eq_Finish
 0010 A7 01	      [6] 00020	         staa   1,X
 0012 6F 00	      [7] 00021	         clr    ,X
 			  00022
 0014 7E 0017	      [3] 00023	         jmp    Next      ; Pass control to inner interpreter

52 machine cycles through the equal path. Surprisingly identical to the 6502!

1 Like

I am a few days away from wrapping up work on my debugger for the 68000 microprocessor.

After that, I’ll put some polish on the AVR tools. AVR is the controller on the Arduino.

Then decide whether or not to make the MSP430 tools usable; they currently do not work.

And then, back to Python. While I was out collecting flu strains, I lacked the concentration to think through the memory management issues so I read about memory management techniques in compiler books. Stay with reference counting or switch to something else?

I demonstrated what I have at the last retrocomputing meeting and got some good feedback. The current plan is to have something you can play with by the next one (late July?)

2 Likes

For those few of you in the compiler prerelease testing program, you may want to try this program:

twos = 1
n = 0
while n <= 250:
    print(n, twos)
    twos = twos + twos
    n = n + 1

Like fibo, it demonstrates the function and usefulness of variable precision integers. Unlike fibo, it is a little more intuitive to see run.

If you are not in the testing program and want to be, PM me and I will add you to the list.

2 Likes

Due to the recent interest in the Python compiler, I have resumed work on it.

The 68000 toolset work is still incomplete: the single-line assembler in the debugger is mostly missing and none of the processor simulator is there. The AVR tools need to be refreshed and I have yet to decide whether the TI MSP430 stuff is worth the effort now. These processors are in play because they are potential Python targets.

Anyway, before I went off collecting flu strains, I was playing Little Dutch Boy plugging memory leaks in expression evaluation. These show up in many forms.

  • a + b + c
    First evaluate a + b, then evaluate (a + b) + c and finally garbage collect the storage for (a + b).

  • a * b + c * d
    Evaluate a * b, then evaluate c * d, and then (a * b) + (c * d) and finally garbage collect the storage for both (a * b) and (c * d).

  • if a + b < c:
    Evaluate a + b, then evaluate (a + b) < c and finally garbage collect the storage for (a + b).

  • if a + b:
    Evaluate a + b, then convert it to a bool and finally garbage collect the storage for (a + b).

Any others?

1 Like

Is this a compiler that runs on the 6502 and generates 6502?

It is a cross-compiler running on a Windows or DOS PC generating 6502 code.

1 Like

In that case, have you considered just converting pre-digested python code? (pyc files)

http://effbot.org/pyfaq/how-do-i-create-a-pyc-file.htm

No, I was not aware of the compiled bytecode file when I started the project.

But doing a little bit of reading about it now in https://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html, the bytecode file changes with the version of Python doing the compiling. That is a variable I do not need.

Unfortunately, python is plagued with this sort of thing. There are even two major incompatible branches. Python 2 and 3. Even the language is incompatible. You are going to have to figure out which one you want to support anyway.

That is why I decreed Python 3 only, did some preliminary lexer and parser work, then dove into implementing variable precision integers.

I believe the major change in pyc was between 2.5 and 2.6. I believe, but may be wrong, that python 3 all has the same pyc format.

I had thought that multiply and floordiv were not implemented. As I look at the code, they are in there. So now I think I am not remembering having finished testing them.

A quick test showed they were working:

 			  00098	; 00001	print(3*'a')
 020E A9 01	      [2] 00099		lda	#1
 0210 A2 00	      [2] 00100		ldx	#0
 0212 20 135A	      [6] 00101		jsr	AllocPargsAndKargs
 0215 A0 03	      [2] 00102		ldy	#3
 0217 A2 BD	      [2] 00103		ldx	#S_00000&$FF
 0219 A9 29	      [2] 00104		lda	#S_00000>>8
 021B 20 13C0	      [6] 00105		jsr	StoreParg
 021E A2 EE	      [2] 00106		ldx	#_print&$FF
 0220 A9 29	      [2] 00107		lda	#_print>>8
 0222 20 1342	      [6] 00108		jsr	GetObject
 0225 20 1405	      [6] 00109		jsr	object.__call__
 0228 20 13E2	      [6] 00110		jsr	FreePargsAndKargs
 			  00111	; 00002	print(6//2)
 022B A2 B5	      [2] 00112		ldx	#I_00002&$FF
 022D A9 29	      [2] 00113		lda	#I_00002>>8
 022F 86 1A	      [3] 00114		stx	Ptr1
 0231 85 1B	      [3] 00115		sta	Ptr1+1
 0233 A2 AD	      [2] 00116		ldx	#I_00006&$FF
 0235 A9 29	      [2] 00117		lda	#I_00006>>8
 0237 86 18	      [3] 00118		stx	Ptr0
 0239 85 19	      [3] 00119		sta	Ptr0+1
 023B 20 1D1B	      [6] 00120		jsr	object.__floordiv__
 023E 48	      [3] 00121		pha
 023F 8A	      [2] 00122		txa
 0240 48	      [3] 00123		pha
 0241 A9 01	      [2] 00124		lda	#1
 0243 A2 00	      [2] 00125		ldx	#0
 0245 20 135A	      [6] 00126		jsr	AllocPargsAndKargs
 0248 A0 03	      [2] 00127		ldy	#3
 024A 68	      [4] 00128		pla
 024B AA	      [2] 00129		tax
 024C 68	      [4] 00130		pla
 024D 20 13C0	      [6] 00131		jsr	StoreParg
 0250 A2 EE	      [2] 00132		ldx	#_print&$FF
 0252 A9 29	      [2] 00133		lda	#_print>>8
 0254 20 1342	      [6] 00134		jsr	GetObject
 0257 20 1405	      [6] 00135		jsr	object.__call__
 025A A0 02	      [2] 00136		ldy	#2
 025C 20 13D0	      [6] 00137		jsr	DerefParg
 025F 20 13E2	      [6] 00138		jsr	FreePargsAndKargs
 			  00139	; 00003	print(2*3)
 0262 A9 01	      [2] 00140		lda	#1
 0264 A2 00	      [2] 00141		ldx	#0
 0266 20 135A	      [6] 00142		jsr	AllocPargsAndKargs
 0269 A0 03	      [2] 00143		ldy	#3
 026B A2 AD	      [2] 00144		ldx	#I_00006&$FF
 026D A9 29	      [2] 00145		lda	#I_00006>>8
 026F 20 13C0	      [6] 00146		jsr	StoreParg
 0272 A2 EE	      [2] 00147		ldx	#_print&$FF
 0274 A9 29	      [2] 00148		lda	#_print>>8
 0276 20 1342	      [6] 00149		jsr	GetObject
 0279 20 1405	      [6] 00150		jsr	object.__call__
 027C 20 13E2	      [6] 00151		jsr	FreePargsAndKargs

Not so fast!

A look at the generated code revealed that some but not all were converted to constants at compile time.

A further test showed that integer multiply and divide had been implemented, but string multiply has not:

 			  00152	; 00004	a = 'a'
 			  00153
 027F AE 29D2	      [4] 00154		ldx	_a
 0282 AD 29D3	      [4] 00155		lda	_a+1
 0285 20 0CA9	      [6] 00156		jsr	DeRef
 			  00157
 0288 A2 B7	      [2] 00158		ldx	#S_00001&$FF
 028A A9 29	      [2] 00159		lda	#S_00001>>8
 028C 8E 29D2	      [4] 00160		stx	_a
 028F 8D 29D3	      [4] 00161		sta	_a+1
 			  00162	; 00005	b = 6
 			  00163
 0292 AE 29D7	      [4] 00164		ldx	_b
 0295 AD 29D8	      [4] 00165		lda	_b+1
 0298 20 0CA9	      [6] 00166		jsr	DeRef
 			  00167
 029B A2 AD	      [2] 00168		ldx	#I_00006&$FF
 029D A9 29	      [2] 00169		lda	#I_00006>>8
 029F 8E 29D7	      [4] 00170		stx	_b
 02A2 8D 29D8	      [4] 00171		sta	_b+1
 			  00172	; 00006	c = 2
 			  00173
 02A5 AE 29DC	      [4] 00174		ldx	_c
 02A8 AD 29DD	      [4] 00175		lda	_c+1
 02AB 20 0CA9	      [6] 00176		jsr	DeRef
 			  00177
 02AE A2 B5	      [2] 00178		ldx	#I_00002&$FF
 02B0 A9 29	      [2] 00179		lda	#I_00002>>8
 02B2 8E 29DC	      [4] 00180		stx	_c
 02B5 8D 29DD	      [4] 00181		sta	_c+1
 			  00182	; 00007	d = 3
 			  00183
 02B8 AE 29E1	      [4] 00184		ldx	_d
 02BB AD 29E2	      [4] 00185		lda	_d+1
 02BE 20 0CA9	      [6] 00186		jsr	DeRef
 			  00187
 02C1 A2 B3	      [2] 00188		ldx	#I_00003&$FF
 02C3 A9 29	      [2] 00189		lda	#I_00003>>8
 02C5 8E 29E1	      [4] 00190		stx	_d
 02C8 8D 29E2	      [4] 00191		sta	_d+1
 			  00192	; 00008	#print(3*a)
 			  00193	; 00009	print(b//c)
 02CB A2 DC	      [2] 00194		ldx	#_c&$FF
 02CD A9 29	      [2] 00195		lda	#_c>>8
 02CF 20 1342	      [6] 00196		jsr	GetObject
 02D2 86 1A	      [3] 00197		stx	Ptr1
 02D4 85 1B	      [3] 00198		sta	Ptr1+1
 02D6 A2 D7	      [2] 00199		ldx	#_b&$FF
 02D8 A9 29	      [2] 00200		lda	#_b>>8
 02DA 20 1342	      [6] 00201		jsr	GetObject
 02DD 86 18	      [3] 00202		stx	Ptr0
 02DF 85 19	      [3] 00203		sta	Ptr0+1
 02E1 20 1D1B	      [6] 00204		jsr	object.__floordiv__
 02E4 48	      [3] 00205		pha
 02E5 8A	      [2] 00206		txa
 02E6 48	      [3] 00207		pha
 02E7 A9 01	      [2] 00208		lda	#1
 02E9 A2 00	      [2] 00209		ldx	#0
 02EB 20 135A	      [6] 00210		jsr	AllocPargsAndKargs
 02EE A0 03	      [2] 00211		ldy	#3
 02F0 68	      [4] 00212		pla
 02F1 AA	      [2] 00213		tax
 02F2 68	      [4] 00214		pla
 02F3 20 13C0	      [6] 00215		jsr	StoreParg
 02F6 A2 EE	      [2] 00216		ldx	#_print&$FF
 02F8 A9 29	      [2] 00217		lda	#_print>>8
 02FA 20 1342	      [6] 00218		jsr	GetObject
 02FD 20 1405	      [6] 00219		jsr	object.__call__
 0300 A0 02	      [2] 00220		ldy	#2
 0302 20 13D0	      [6] 00221		jsr	DerefParg
 0305 20 13E2	      [6] 00222		jsr	FreePargsAndKargs
 			  00223	; 00010	print(c*d)
 0308 A2 E1	      [2] 00224		ldx	#_d&$FF
 030A A9 29	      [2] 00225		lda	#_d>>8
 030C 20 1342	      [6] 00226		jsr	GetObject
 030F 86 1A	      [3] 00227		stx	Ptr1
 0311 85 1B	      [3] 00228		sta	Ptr1+1
 0313 A2 DC	      [2] 00229		ldx	#_c&$FF
 0315 A9 29	      [2] 00230		lda	#_c>>8
 0317 20 1342	      [6] 00231		jsr	GetObject
 031A 86 18	      [3] 00232		stx	Ptr0
 031C 85 19	      [3] 00233		sta	Ptr0+1
 031E 20 1B24	      [6] 00234		jsr	object.__mul__
 0321 48	      [3] 00235		pha
 0322 8A	      [2] 00236		txa
 0323 48	      [3] 00237		pha
 0324 A9 01	      [2] 00238		lda	#1
 0326 A2 00	      [2] 00239		ldx	#0
 0328 20 135A	      [6] 00240		jsr	AllocPargsAndKargs
 032B A0 03	      [2] 00241		ldy	#3
 032D 68	      [4] 00242		pla
 032E AA	      [2] 00243		tax
 032F 68	      [4] 00244		pla
 0330 20 13C0	      [6] 00245		jsr	StoreParg
 0333 A2 EE	      [2] 00246		ldx	#_print&$FF
 0335 A9 29	      [2] 00247		lda	#_print>>8
 0337 20 1342	      [6] 00248		jsr	GetObject
 033A 20 1405	      [6] 00249		jsr	object.__call__
 033D A0 02	      [2] 00250		ldy	#2
 033F 20 13D0	      [6] 00251		jsr	DerefParg
 0342 20 13E2	      [6] 00252		jsr	FreePargsAndKargs

So the next step is not done, but it is much further along than I remembered. I think this was where I was when I discovered the memory leaks and took the detour to fix them.

2 Likes

Stan brought this to my attention. It is just too good not to share.

3 Likes

to copy a block on the 6809, you could use ldd ,U++ ? Or Stack blast it.

You make a good point. For any transfer larger than a single byte, moving two bytes at a time is quicker short of using DMA and not every system can do that.

The original code:

 0006 DE 00		      [5] 00005	Copy1    ldu    Src
 0008 109E 02		      [6] 00006	         ldy    Dest
 000B DC 04		      [5] 00007	         ldd    Count
 000D 27 08 (0017)	      [3] 00008	         beq    Done1
 				  00009
 000F A6 C0		      [6] 00010	Loop1    lda    ,U+
 0011 A7 A0		      [6] 00011	         sta    ,Y+
 0013 30 1F		      [5] 00012	         leax   -1,X
 0015 26 F8 (000F)	      [3] 00013	         bne    Loop1
 				  00014
 0017 39		      [5] 00015	Done1    rts

Time taken is 24 cycles plus 20 per byte.

A naive conversion to word transfer:

 0018 DE 00		      [5] 00017	Copy2    ldu    Src
 001A 109E 02		      [6] 00018	         ldy    Dest
 001D DC 04		      [5] 00019	         ldd    Count
 001F 27 12 (0033)	      [3] 00020	         beq    Done2
 				  00021
 0021 44		      [2] 00022	         lsra
 0022 56		      [2] 00023	         rorb
 0023 1F 01		      [6] 00024	         tfr    D,X
 0025 24 04 (002B)	      [3] 00025	         bcc    Loop2
 				  00026
 0027 A6 C0		      [6] 00027	         lda    ,U+
 0029 A7 A0		      [6] 00028	         sta    ,Y+
 				  00029
 002B EC C1		      [8] 00030	Loop2    ldd    ,U++
 002D ED A1		      [8] 00031	         std    ,Y++
 002F 30 1F		      [5] 00032	         leax   -1,X
 0031 26 F8 (002B)	      [3] 00033	         bne    Loop2
 				  00034
 0033 39		      [5] 00035	Done2    rts

Time taken is 37 cycles plus 12 if there is an odd byte plus 24 for each word.

Taking advantage of the fact that pulling a word from the U stack is a cycle faster:

 0034 DE 00		      [5] 00037	Copy3    ldu    Src
 0036 109E 02		      [6] 00038	         ldy    Dest
 0039 DC 04		      [5] 00039	         ldd    Count
 003B 27 12 (004F)	      [3] 00040	         beq    Done3
 				  00041
 003D 44		      [2] 00042	         lsra
 003E 56		      [2] 00043	         rorb
 003F 1F 01		      [6] 00044	         tfr    D,X
 0041 24 04 (0047)	      [3] 00045	         bcc    Loop3
 				  00046
 0043 A6 C0		      [6] 00047	         lda    ,U+
 0045 A7 A0		      [6] 00048	         sta    ,Y+
 				  00049
 0047 37 06		      [7] 00050	Loop3    pulu   D
 0049 ED A1		      [8] 00051	         std    ,Y++
 004B 30 1F		      [5] 00052	         leax   -1,X
 004D 26 F8 (0047)	      [3] 00053	         bne    Loop3
 				  00054
 004F 39		      [5] 00055	Done3    rts

Time taken is 37 cycles plus 12 if there is an odd byte plus 23 for each word.

Also, I hear self modifying code is common for the 6502 and computers with limited speed and space in general.
I’m interested in learning more about it and how to speed things up and save space using it.

http://www.lemon64.com/forum/viewtopic.php?t=47092&sid=a7e3278d5f633959dc4fb8051d8161e8