[Project Log] Python on the 6502/C64, 8080, 6800, 6809 and AVR

richmeyer · November 16, 2019, 1:06am

Overlapping General Purpose Registers in the 9900:

The Sixteen general purposes registers for the 9900 were in memory, not in the CPU. So the registers could be located anywhere in available memory. You could even overlap the register assignments. Such as; before a subroutine you were using registers R13, R14, and R15. During the start of the subroutine, you could define the new set of 16 general purpose registers with an offset of 8 addresses as an example. Then the old R13 register is now R6, old R14 is now R7, and old R15 is now R8. The subroutine now uses R6, R7, and R8 without having to move the contents of any registers. Then after the subroutine you revert back to R13, R14, and R15 with your results. Confusing, but if well documented with comments in your code it worked very well.

Back in the day when memory was expensive, these tricks were necessary due to memory cost. In Assembly language this trick saved a total of 6 Move instruction cycles and increased the speed of your program.

Bill · November 16, 2019, 1:13am

I have been reading about multiply and divide. Both use register pairs. If R15 is specified, you get a “free” seventeenth register in the word beyond the workspace…

Bill · November 17, 2019, 2:15am

OK, 9900 assembly language experts,

examining the Branch instruction more closely…

B @THERE

means the next instruction executed is at the location denoted by the label “THERE.”

And

B *R11

means the next instruction executed is at the location whose address is in the register R11.

So what does

B R11 ; without the '*'

do?

My initial thought was that was the same as the X (eXecute) instruction. But it is not.

The TI Editor/Assembler yields the following:

so the assembler does not think that B R11 is invalid.

urbite · November 17, 2019, 5:20pm

I believe it will branch to the memory location which is R11. In that case, the address of the instruction would be WP+22.

Bill · November 17, 2019, 5:39pm

Interesting…

I never saw that possibility.

So we can load up instructions into the workspace, then execute them while they are subject to change when registers are modified.

Bill · November 17, 2019, 6:40pm

The opening stanzas of Space Voyage 9900…

 Addr Code		  Line#	  Source Statement

 0000 02E0 014E		  00001	SPAVOY	lwpi	WRKSPC		; Load workspace
 0004 020F 014E		  00002			li		R15,STACK	; Load "stack pointer"
 0008 0201 022C		  00003			li		R1,TITLE
 000C 06A0 00A2		  00004			bl		@PSTRNG		; Output title
 0010 0201 0177		  00005			li		R1,STPSFL
 0014 04C0			  00006	SETUP	clr		R0			; Clear all temp storage
 0016 DC40			  00007			movb	R0,*R1+
 0018 0281 0218		  00008			ci		R1,MOVTBL
 001C 16FB (0014)	  00009			jne		SETUP
 001E 0201 023F		  00010			li		R1,SHTLNG	; Short or long version?
 0022 06A0 00A2		  00011			bl		@PSTRNG
 0026 05A0 016F		  00012	SEED0	inc		@RNDM		; Increment 0th byte of random number seed
 002A 06A0 00C4		  00013			bl		@STAT		;   until user types answer
 002E 13FB (0026)	  00014			jeq		SEED0
 0030 06A0 00EC		  00015			bl		@INCH
 0034 0280 5300		  00016			ci		R0,'S'*256
 0038 1305 (0044)	  00017			jeq		SHORT
 003A 0280 7300		  00018			ci		R0,'s'*256
 003E 1302 (0044)	  00019			jeq		SHORT
 0040 05A0 01AC		  00020			inc		@LENGTH		; If long set flag
 0044 B800 016F		  00021	SHORT	ab		R0,@RNDM	; Add to seed
 0048 0201 0426		  00022			li		R1,INTRO0	; Out password prompt
 004C 06A0 00A2		  00023			bl		@PSTRNG
 0050 0201 01C4		  00024			li		R1,PASWRD	; Get password
 0054 05A0 0171		  00025	SEED1	inc		@RNDM+2		; Increment 1st byte of random number seed
 0058 06A0 00C4		  00026			bl		@STAT		;   until user types answer
 005C 13FB (0054)	  00027			jeq		SEED1
 005E 06A0 00EC		  00028			bl		@INCH
 0062 DC40			  00029			movb	R0,*R1+		; Save it
 0064 B800 0171		  00030			ab		R0,@RNDM+2	; Add to seed
 0068 05A0 0173		  00031	SEED2	inc		@RNDM+4		; Increment 2nd byte of random number seed
 006C 06A0 00C4		  00032			bl		@STAT		;   until user types answer
 0070 13FB (0068)	  00033			jeq		SEED2
 0072 06A0 00EC		  00034			bl		@INCH
 0076 DC40			  00035			movb	R0,*R1+		; Save it
 0078 B800 0173		  00036			ab		R0,@RNDM+4	; Add to seed
 007C 05A0 0175		  00037	SEED3	inc		@RNDM+6		; Increment 3rd byte of random number seed
 0080 06A0 00C4		  00038			bl		@STAT		;   until user types answer
 0084 13FB (007C)	  00039			jeq		SEED3
 0086 06A0 00EC		  00040			bl		@INCH
 008A DC40			  00041			movb	R0,*R1+		; Save it
 008C B800 0175		  00042			ab		R0,@RNDM+6	; Add to seed
 0090 06A0 0102		  00043			bl		@RVERIFY	; Ensure random number seed is valid
 0094 0201 01C8		  00044			li		R1,QUDMAP	; Point to quadrant map
 0098 0203 0040		  00045			li		R3,64
 009C 06A0 00F0		  00046	SETUP0	bl		@RANDOM		; Setup number of Klingons

SV99

Notice the code peeking through the registers until the workspace pointer is loaded.

The last two instructions “push” the return address in R11 onto the “stack.”

A later

	mov	*R15+,R11

recovers the return address.

Bill · November 19, 2019, 5:28pm

Another question for 9900 assembly language experts…

The flags in the status register of the 9900 appear to be wildly different from other processors.

The L> flag supposedly indicates when an unsigned quantity is greater than zero.

The A> flag supposedly indicates when a signed quantity is greater than zero.

From page 3-29 of

http://www.bitsavers.org/pdf/ti/990/tx990/0943441-9701_990_Assembly_Language_Programming_Guide_Oct78.pdf

The JL (jump if lower) instruction branches when L> is clear and EQ is clear. That makes sense.

The JHE (jump if higher or equal) instruction branches when L> is set or EQ is set. That makes sense.

The JH (jump if higher) instruction branches when L> is set and EQ is clear. The JLE (jump if lower or equal) instruction branches when L> is clear or EQ is set. These two seem to imply that the L> flag is actually set for L>=

Anybody know about this one?

Bill · November 20, 2019, 12:06am

An interesting read…

Bill · November 24, 2019, 5:05pm

The pseudo-random number generator for the 9900 is finally working properly.

The algorithm is for 8-bit operations and implementation was tricky on the 9900 which does not provide a full set of 8-bit instructions. Stray bits kept sneaking in through the “unused” halves of 16-bit registers.

Bill · November 26, 2019, 2:06am

The documentation seems to imply that L> and EQ are mutually exclusive: that no value is both greater than 0 and equal to 0.

However, there are a couple of instructions, compare ones corresponding (COC) and compare zeros corresponding (CZC) which affect only the EQ flag. There may be a programming trick lurking in there somewhere…

Bill · November 27, 2019, 1:19am

Appendix A of this manual goes into excruciating detail what goes into setting the several conditional code flags.

http://bitsavers.informatik.uni-stuttgart.de/pdf/ti/990/945250-9701_990_Computer_Family_Systems_Handbook_3ed_May76.pdf

Appendix B does the same for calculating the cycle times for instructions. This is for the 990/4, a TI industrial microcomputer based on the 9900 CPU. It is not directly applicable to the 99/4A with its various cycle eaters.

So about how many instructions does a 99/4A execute in a second? Something on the order of 100,000 to 200,000 depending on the instructions?

Bill · November 28, 2019, 1:45am

I am somewhat surprised nobody commented about this.

The discussion has evolved to what is the quickest way to arrive at the answer. The current speed leader is the hybrid compiled/interpreted environment of FORTH.

The programming problem does several time-consuming things:

Multiply large numbers
Convert large numbers to an ASCII string
Find one string within another

These capabilities are all built into the Python language.

In the following solution, Python code serves a glue between various bits of code in the run-time library. The loop only executes 175 times.

Because needed features are built-in, it is fair game to write those in assembly language as part of the run-time library. Implementation in other languages requires some to much of it in compiled high-level code.

I believe that a Python compiler has a good chance to be the fastest implementation.

Kentamanos · November 28, 2019, 2:33am

I wonder if due to those 2 requirements if some sort of BCD library would work well here.

Back in my pre-teen BASIC days, I used to do a lot of “string math” stuff which is kinda similar. I used to run a program on my Osborne to calculate the digits of the square root of 3 by naively adding an increasing digit on the end until the first two digits were “03” (or higher early on), and then lowering that digit by one and going on to the next.

Bill · November 28, 2019, 6:06am

The problem is that a BCD library is not part of the core language definition, so I lose the blanket right to implement that in assembly language (a major advantage in a speed contest.) Python is a poor language to use for writing such a library as it will have to be very “low level.”

Even if it was, using BCD may do away with the expensive conversion to a decimal representation (a divide by 10 for each digit) but multiplying becomes much more difficult on a processor without a “decimal adjust” instruction.

Bill · November 29, 2019, 3:31am

The expense of converting a binary number is so bothersome that I did some more thinking and research.

An algorithm with a funny name, double dabble, shows promise to speed up the conversion…

I am also looking into an implementation of a BCD-based approach to the problem without using any multiplication or division, though I still want to see how well a straightforward implementation in Python compares to what the others have been doing.

Bill · November 29, 2019, 11:12pm

I got the code running. No division. Multiplication done by shifting and adding.

Number * 7 = Number * 1 + Number * 2 + Number * 4

It finishes almost instantaneously on my rather slow netbook. It was written in a simple manner to help get it working; there is opportunity to more than double its speed by combining loops.

Now to get one of my toy compilers to translate it into 9900 assembly language…

Bill · December 3, 2019, 4:11am

Three additions to the Python compiler are necessary to run the 7’s problem solution on a 6502:

Implement the syntax necessary to call a method for an object
Create a str function using code already in place for implementing print
Write the find method for strings

The first two are done and the third is halfway there.

I can already tell that converting an integer to an ASCII string representation is horribly slow; the time spent multiplying two numbers is very small in comparison. I will have to eventually replace the current repeated divide by ten approach to double dabble and see how much of an improvement that is. If it is not substantial, then obviously a BCD-based approach will be the fastest way to solve the 7’s problem.

Bill · December 4, 2019, 8:44pm

The 7’s problem now compiles to 6502 and ~~runs~~ crawls.

It consumes over 22 billion machine cycles, meaning a 1 MHz 6502 would grind on this for over six hours.

The only good news is that there is plenty of room for improvement from a double dabble implementation. Stay tuned…

Oh, and my emulator runs the program in just over 95 seconds; it cranks out a rather surprising 230 MHz speed for the virtual 6502. That is on a 1 GHz AMD C60 processor. Most of today’s machines should do much better.

Bill · December 7, 2019, 3:27am

A conversion for 16-bit integers has been implemented based on double dabble. Now to make another one for big numbers…

 05E2					  00966	WriteInteger
 05E2 86 49		      [3] 00967		stx	Dabble		; Put number in conversion buffer
 05E4 85 4A		      [3] 00968		sta	Dabble+1
 						  00969
 05E6 A9 00		      [2] 00970		lda	#0			; Zero the rest of the buffer
 05E8 85 4D		      [3] 00971		sta	Dabble+4
 05EA 85 4C		      [3] 00972		sta	Dabble+3
 05EC 85 4B		      [3] 00973		sta	Dabble+2
 						  00974
 05EE A9 10		      [2] 00975		lda	#16			; 16 bits in the number
 05F0 85 02		      [3] 00976		sta	Byt0
 						  00977
 05F2					  00978	WriteInteger1
 05F2 A0 03		      [2] 00979		ldy	#3			; Three bytes in converted BCD
 						  00980
 05F4					  00981	WriteInteger2
 05F4 B9 004A	    [4/5] 00982		lda	Dabble+1,Y	; Isolate lower nybble
 05F7 29 0F		      [2] 00983		and	#$F
 05F9 C9 05		      [2] 00984		cmp	#4+1		; If greater than 4, add 3
 05FB 90 0F (060C)  [2/4] 00985		bcc	WriteIntegerLoNotGT4
 						  00986
 05FD 18		      [2] 00987		clc
 05FE 69 03		      [2] 00988		adc	#3
 0600 85 03		      [3] 00989		sta	Byt1
 						  00990
 0602 B9 004A	    [4/5] 00991		lda	Dabble+1,Y	; Combine
 0605 29 F0		      [2] 00992		and	#$F0
 0607 05 03		      [3] 00993		ora	Byt1
 0609 99 004A	      [5] 00994		sta	Dabble+1,Y
 						  00995
 060C					  00996	WriteIntegerLoNotGT4
 060C B9 004A	    [4/5] 00997		lda	Dabble+1,Y	; Isolate upper nybble
 060F 29 F0		      [2] 00998		and	#$F0
 0611 C9 50		      [2] 00999		cmp	#$40+$10	; If greater than 4, add 3
 0613 90 0F (0624)  [2/3] 01000		bcc	WriteIntegerHiNotGT4
 						  01001
 0615 18		      [2] 01002		clc
 0616 69 30		      [2] 01003		adc	#$30
 0618 85 03		      [3] 01004		sta	Byt1
 						  01005
 061A B9 004A	    [4/5] 01006		lda	Dabble+1,Y	; Combine
 061D 29 0F		      [2] 01007		and	#$F
 061F 05 03		      [3] 01008		ora	Byt1
 0621 99 004A	      [5] 01009		sta	Dabble+1,Y
 						  01010
 0624					  01011	WriteIntegerHiNotGT4
 0624 88		      [2] 01012		dey				; More BCD digits to shift?
 0625 D0 CD (05F4)  [2/4] 01013		bne	WriteInteger2
 						  01014
 0627 06 49		      [5] 01015		asl	Dabble		; Shift the buffer left one bit
 0629 26 4A		      [5] 01016		rol	Dabble+1
 						  01017
 062B 26 4D		      [5] 01018		rol	Dabble+4
 062D 26 4C		      [5] 01019		rol	Dabble+3
 062F 26 4B		      [5] 01020		rol	Dabble+2
 						  01021
 0631 C6 02		      [5] 01022		dec	Byt0		; More bits to shift?
 0633 D0 BD (05F2)  [2/4] 01023		bne	WriteInteger1
 						  01024
 0635 A0 00		      [2] 01025		ldy	#0			; Index converted BCD
 0637 A2 00		      [2] 01026		ldx	#0			; And the output string
 						  01027
 0639					  01028	WriteInteger3
 0639 B9 004B	    [4/5] 01029		lda	Dabble+2,Y	; Isolate upper digit
 063C 29 F0		      [2] 01030		and	#$F0
 063E D0 04 (0644)  [2/3] 01031		bne	WriteIntegerEmitHi
 						  01032
 0640 E0 00		      [2] 01033		cpx	#0			; Leading zero?
 0642 F0 0A (064E)  [2/3] 01034		beq	WriteIntegerSkipHi
 						  01035
 0644					  01036	WriteIntegerEmitHi
 0644 4A		      [2] 01037		lsr	A			; Shift into lower nybble
 0645 4A		      [2] 01038		lsr	A
 0646 4A		      [2] 01039		lsr	A
 0647 4A		      [2] 01040		lsr	A
 						  01041
 0648 18		      [2] 01042		clc				; Convert to ASCII numeral
 0649 69 30		      [2] 01043		adc	#'0'
 064B E8		      [2] 01044		inx
 064C 95 51		      [4] 01045		sta	Out,X
 						  01046
 064E					  01047	WriteIntegerSkipHi
 064E B9 004B	    [4/5] 01048		lda	Dabble+2,Y	; Isolate lower digit
 0651 29 0F		      [2] 01049		and	#$F
 0653 D0 04 (0659)  [2/3] 01050		bne	WriteIntegerEmitLo
 						  01051
 0655 E0 00		      [2] 01052		cpx	#0			; Leading zero?
 0657 F0 06 (065F)  [2/3] 01053		beq	WriteIntegerSkipLo
 						  01054
 0659					  01055	WriteIntegerEmitLo
 0659 18		      [2] 01056		clc				; Convert to ASCII numeral
 065A 69 30		      [2] 01057		adc	#'0'
 065C E8		      [2] 01058		inx
 065D 95 51		      [4] 01059		sta	Out,X
 						  01060
 065F					  01061	WriteIntegerSkipLo
 065F C8		      [2] 01062		iny				; Address next pair of digits
 						  01063
 0660 C0 03		      [2] 01064		cpy	#3			; More digits?
 0662 90 D5 (0639)  [2/3] 01065		bcc	WriteInteger3
 						  01066
 0664 86 51		      [3] 01067		stx	Out			; Store length of result
 0666 8A		      [2] 01068		txa
 0667 D0 08 (0671)  [2/3] 01069		bne	WriteIntegerWrite	; Check for all 0's
 						  01070
 0669 A9 01		      [2] 01071		lda	#1			; Default to "0"
 066B 85 51		      [3] 01072		sta	Out
 066D A9 30		      [2] 01073		lda	#'0'
 066F 85 52		      [3] 01074		sta	Out+1
 						  01075
 0671					  01076	WriteIntegerWrite
 0671 A2 51		      [2] 01077		ldx	#Out&$FF	; Point to the result
 0673 A9 00		      [2] 01078		lda	#Out>>8
 						  01079
 						  01080	; Fall through to WriteString

Bill · December 8, 2019, 6:12pm

The above code uses separate buffers for the conversions to packed BCD, then to text.

When dealing with large numbers, it is beneficial to reduce memory usage wherever possible. By placing the packed BCD result from the double dabble into the end of the text result buffer, one buffer can be used for both…

Dabble2BCD0
Dabble2BCD1
Dabble2BCD2
Dabble2BCD3