If you love python, wait until you discover julia. It’s sort of like python with C-like speed where all of your data types are actually close to the metal rather than structs.
Thanks for the tip.
I have heard of it but have never looked into it.
My earlier thought was that type hinting may allow a compiler to generate better code for Python.
Have you made any progress on this?
I got as far as the label CONT2 when other things diverted my attention.
I have to admit that I got distracted on so many other things that this got lost in the shuffle. Will put it back on my near-term ‘pending’ pile
I recently resumed working on my “collection” of Space Voyage games.
I think I may finally have all five current versions (6800, 6502, 8080, AVR and 68K) behaving in the same way. The pseudo-random generator has been temporarily wired to start with the same seed so that they may be compared.
In addition to 9900 coding, I will need information on the best cross-development tools and emulators as well as three interface subroutines: write a character, read a character and check whether a character has been typed. I am currently using my own cross-assembler and emulator.
Following cross-dev tool links are from post 170 in this thread.
Excellent TI 99/4A technical info website by Dr. Thierry Nouspikel, who started his TI 99/4A work as a student. Very rich source of info.
http://www.unige.ch/medecine/nouspikel/ti99/titechpages.htm
Complete TMS9900/TI 99/4A cross development too suite (written in python)
https://endlos99.github.io/xdt99/
I would probably go with Classic99 for the emulator
http://www.harmlesslion.com/software/Classic99
Here’s a list of other emulators
http://99er.net/emul.shtml
Will have to get back with you on the functions to read/check for character. That’s the easy part if you use the existing console ROM API calls.
Writing a character will require some infrastructure, since the screen is tile-mapped. So scrolling the screen and other operations to make it look like a terminal are not natively part of the console API. There are frameworks out there to implement that, and may actually be available by GPL calls. But I never did much with GPL, so would have to do some research there.
A good starting reference on the available API calls is the TI Editor/Assembler manual. It also has detailed description of 9900 assembly language and a breakdown of each instruction.
http://99er.net/download2/index.php?act=view&id=88
I have been using this for instruction set reference instead of the Editor/Assembler manual:
The authoritative source is the TMS9900 Data Manual, which is attached. The instruction set description is much condensed over the 99/4A manuals.
The only thing that won’t be correct is the clock cycle count. 4 clock cycles need to be added to all memory accesses, except for accesses to the ROM (0x0000-0x1FFF) and the 256 byte scratchpad RAM (0x8300-0x83FF). This is due to wait states added during the conversion of a single 16-bit memory cycle to two 8-bit memory accesses. Why the conversion is done is part of the 99/4A history.
TMS9900_DataManual_ocr.pdf (1.2 MB)
I have been using Appendix A of this manual for instruction timing information:
Many moons ago, I bought all three flavors of Turbo Pascal version 3, CP/M-80, CP/M-86 and PC-DOS. Not only did the CP/M-80 version need a Z80 to run, but the programs it generated also required a Z80, leaving out the many systems based on the 8080 or 8085.
I took brief a look at trying to modify it to generate 8080 code, but that effort did not get very far as more of my time was being spent learning to program for the IBM PC.
Recent discussions about simulating an 8080 brought up the incompatibility between Turbo Pascal and the 8080.
I believe that modifying the compiler to directly generate 8080 code to be too difficult. Modifying it to run on an 8080 is out of the question. But much of the same result may be achieved by translating the compiled code to the 8080, then stitching it to a run-time library rewritten for the 8080.
I am happy to be able to report that the prototype for converting the Z80 code produced by the CP/M Turbo Pascal compiler to 8080 code is going well. So far, no Z80 instructions have been encountered. Unfortunately, the same cannot be said for the subroutine library; it makes extensive use of the index and alternate registers as well as instructions like subtract with carry of register pairs.
The original Pascal code:
procedure Greet;
begin
writeln('Hello world.')
end; { Greet }
begin
Greet
end.
The Z80 code generated by the compiler:
0100 C3E320 jp 20E3
Command? u20e3l23
20E3 310001 ld SP,0100
20E6 21C2EC ld HL,ECC2
20E9 0100FF ld BC,FF00
20EC CD6503 call 0365
20EF 212021 ld HL,2120
20F2 11D0EA ld DE,EAD0
20F5 0142ED ld BC,ED42
20F8 3E01 ld A,01
20FA CDD504 call 04D5
20FD C31A21 jp 211A
2100 CD9C14 call 149C
2103 CDBB17 call 17BB
Command? d2106ld
2100 0C 48-65 6C 6C 6F 20 77 6F 72 ' .Hello wor'
2110 6C 64 2E 'ld. '
Command? u2113
2113 CDCE17 call 17CE
2116 CD1C20 call 201C
2119 C9 ret
211A CD0021 call 2100
211D C3D520 jp 20D5
The 8080 code generated by the recompiler:
org 100h
jmp L_20E3
org 20E3h
L_20E3:
lxi SP,0100h
lxi H,0ECC2h
lxi B,0FF00h
call R_0365
lxi H,2120h
lxi D,0EAD0h
lxi B,0ED42h
mvi A,01h
call R_04D5
jmp L_211A
L_2100:
call R_149C
call R_17BB
db 12,'Hello world.'
call R_17CE
call R_201C
ret
L_211A:
call L_2100
jmp R_20D5
end 100h
It turns out that the line
20EF 211921 ld HL,2119
actually points HL just past the last instruction in the program, I am guessing to initialize the heap, so the generated code is now this:
org 100h
jmp L_20E3
org 20E3h
L_20E3:
lxi SP,0100h
lxi H,0ECC2h
lxi B,0FF00h
call R_0365
lxi H,L_End
lxi D,0EAD8h
lxi B,0ED42h
mvi A,01h
call R_04D5
call R_149C
call R_17BB
db 12,'Hello world.'
call R_17CE
call R_201C
jmp R_20D5
L_End:
end 100h
Last month, I revisited the 6800 version of the floating point code and rewrote the test driver to catch up with the AVR version.
The number display is much better plus the value is retained so that another operation may be performed upon it.
There is still some more work to be done. I still need to investigate implementing rounding, guard and sticky bits, then make a double precision version.
Can 8080 code be smaller and faster than Z80 code?
Much has been said about how the new instructions added to the Z80 are often too slow to be useful, that these are more commonly used to create compact code rather than faster execution.
I have a subroutine for which a version written for the 8080 is both smaller and quicker. It is used by CP/M Turbo Pascal programs to release allocated memory back to the heap. The 8080 version is not only faster; it is substantially faster.
I did not originally set out to write a faster implementation but had to reach repeatedly into my bag of tricks to avoid running out of registers while trying to do a literal translation. The result ended up surprisingly more efficient.
The original Z80 code:
07241 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
07242 ;
07243 ; Dispose
07244 ;
07245 ; Input:
07246 ; HL = size of block to free
07247 ; Top of stack = the address of variable containing block address
07248 ;
1D7B 07249 L_1D7B:
1D7B EB [4] 07250 ex DE,HL ; Move size to DE
07251
1D7C E1 [10] 07252 pop HL ; Get return address
1D7D E3 [19] 07253 ex (SP),HL ; Swap for address of variable
07254
1D7E 7E [7] 07255 ld A,(HL) ; Get address of the block to free
1D7F 23 [5] 07256 inc HL
1D80 66 [7] 07257 ld H,(HL)
1D81 6F [4] 07258 ld L,A
07259
1D82 13 [5] 07260 inc DE ; Round size up to next multiple of 4
1D83 13 [5] 07261 inc DE
1D84 13 [5] 07262 inc DE
1D85 7B [4] 07263 ld A,E
1D86 E6FC [7] 07264 and 0FCh
1D88 5F [4] 07265 ld E,A
07266
1D89 EB [4] 07267 ex DE,HL ; Save "quantized" size
1D8A 22 00F0 [16] 07268 ld (L_00F0),HL ; DE -> block to free
07269
1D8D 2A 00DE [16] 07270 ld HL,(L_00DE) ; Point to the first free block
1D90 E5 [11] 07271 push HL
1D91 DDE1 [14] 07272 pop IX
07273
1D93 B7 [4] 07274 or A ; Is block to free lower?
1D94 ED52 [15] 07275 sbc HL,DE
1D96 3052 (1DEA) [7/12] 07276 jr NC,L_1DEA ; Yes
07277
1D98 07278 L_1D98:
1D98 DD6E00 [19] 07279 ld L,(IX+0) ; Get next address
1D9B DD6601 [19] 07280 ld H,(IX+1)
1D9E E5 [11] 07281 push HL
07282
1D9F B7 [4] 07283 or A ; Is this block still lower?
1DA0 ED52 [15] 07284 sbc HL,DE
1DA2 3004 (1DA8) [7/12] 07285 jr NC,L_1DA8 ; No
07286
1DA4 DDE1 [14] 07287 pop IX ; Point to next free block
07288
1DA6 18F0 (1D98) [12] 07289 jr L_1D98 ; Keep looking
07290
1DA8 07291 L_1DA8:
1DA8 E1 [10] 07292 pop HL ; Get block just above block to free
07293
1DA9 D5 [11] 07294 push DE ; Point to block to be freed
1DAA FDE1 [14] 07295 pop IY
07296
1DAC ED4B 00F0 [20] 07297 ld BC,(L_00F0) ; Retrieve block size
07298
1DB0 FD7102 [19] 07299 ld (IY+2),C ; Store size
1DB3 FD7003 [19] 07300 ld (IY+3),B
1DB6 FD7500 [19] 07301 ld (IY+0),L ; Store next
1DB9 FD7401 [19] 07302 ld (IY+1),H
07303
1DBC DD7300 [19] 07304 ld (IX+0),E ; Point here from previous block
1DBF DD7201 [19] 07305 ld (IX+1),D
07306
1DC2 DDE5 [15] 07307 push IX ; Point to previous block
1DC4 E1 [10] 07308 pop HL
1DC5 DD4E02 [19] 07309 ld C,(IX+2) ; Get its size
1DC8 DD4603 [19] 07310 ld B,(IX+3)
1DCB CD 1E05 [17] 07311 call L_1E05 ; Try to coalesce
1DCE 2809 (1DD9) [7/12] 07312 jr Z,L_1DD9 ; Combined?
07313
1DD0 DD5E00 [19] 07314 ld E,(IX+0) ; Point to next block
1DD3 DD5601 [19] 07315 ld D,(IX+1)
1DD6 D5 [11] 07316 push DE
1DD7 DDE1 [14] 07317 pop IX
07318
1DD9 07319 L_1DD9:
1DD9 DDE5 [15] 07320 push IX ; Set up to try combining with next block
1DDB E1 [10] 07321 pop HL
1DDC DD4E02 [19] 07322 ld C,(IX+2) ; Load size
1DDF DD4603 [19] 07323 ld B,(IX+3)
1DE2 DD5E00 [19] 07324 ld E,(IX+0) ; Load next
1DE5 DD5601 [19] 07325 ld D,(IX+1)
1DE8 181B (1E05) [12] 07326 jr L_1E05
07327
1DEA 07328 L_1DEA:
1DEA 2A 00DE [16] 07329 ld HL,(L_00DE) ; Freed block is now first free block
1DED ED53 00DE [20] 07330 ld (L_00DE),DE
1DF1 D5 [11] 07331 push DE
1DF2 DDE1 [14] 07332 pop IX
1DF4 DD7500 [19] 07333 ld (IX+0),L ; Store next
1DF7 DD7401 [19] 07334 ld (IX+1),H
1DFA ED4B 00F0 [20] 07335 ld BC,(L_00F0) ; Store size
1DFE DD7102 [19] 07336 ld (IX+2),C
1E01 DD7003 [19] 07337 ld (IX+3),B
1E04 EB [4] 07338 ex DE,HL
07339
1E05 07340 L_1E05:
1E05 09 [11] 07341 add HL,BC ; Are the blocks adjacent?
1E06 B7 [4] 07342 or A
1E07 ED52 [15] 07343 sbc HL,DE
1E09 C0 [5/11] 07344 ret NZ ; No
07345
1E0A D5 [11] 07346 push DE
1E0B FDE1 [14] 07347 pop IY
07348
1E0D 2A 00C4 [16] 07349 ld HL,(L_00C4) ; Is upper block at the top of the heap?
1E10 B7 [4] 07350 or A
1E11 ED52 [15] 07351 sbc HL,DE
1E13 281B (1E30) [7/12] 07352 jr Z,L_1E30 ; Yes
07353
1E15 FD7E00 [19] 07354 ld A,(IY+0) ; Transfer next to lower block
1E18 DD7700 [19] 07355 ld (IX+0),A
1E1B FD7E01 [19] 07356 ld A,(IY+1)
1E1E DD7701 [19] 07357 ld (IX+1),A
1E21 FD6E02 [19] 07358 ld L,(IY+2) ; Add the sizes
1E24 FD6603 [19] 07359 ld H,(IY+3)
1E27 09 [11] 07360 add HL,BC
1E28 DD7502 [19] 07361 ld (IX+2),L
1E2B DD7403 [19] 07362 ld (IX+3),H
1E2E AF [4] 07363 xor A ; Clear Z flag
07364
1E2F C9 [10] 07365 ret
07366
1E30 07367 L_1E30:
1E30 DDE5 [15] 07368 push IX ; Set new top of heap
1E32 E1 [10] 07369 pop HL
1E33 22 00C4 [16] 07370 ld (L_00C4),HL
07371
1E36 0604 [7] 07372 ld B,4 ; Zero the header
07373
1E38 07374 L_1E38:
1E38 3600 [10] 07375 ld (HL),0
1E3A 23 [5] 07376 inc HL
1E3B 10FB (1E38) [8/13] 07377 djnz L_1E38
07378
1E3D C9 [10] 07379 ret
As rewritten for the 8080:
05810 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
05811 ;
05812 ; Dispose
05813 ;
05814 ; Input:
05815 ; HL = size of block to free
05816 ; Top of stack = the address of variable containing block address
05817 ;
14E9 05818 Rq1D7B:
14E9 EB [4] 05819 xchg ; Move size to DE
05820
14EA E1 [10] 05821 pop H ; Get return address
14EB E3 [18] 05822 xthl ; Swap for address of variable
05823
14EC 7E [7] 05824 mov A,M ; Get address of the block to free
14ED 23 [5] 05825 inx H
14EE 66 [7] 05826 mov H,M
14EF 6F [5] 05827 mov L,A
05828
14F0 13 [5] 05829 inx D ; Round size up to next multiple of 4
14F1 13 [5] 05830 inx D
14F2 13 [5] 05831 inx D
14F3 7B [5] 05832 mov A,E
14F4 E6 FC [7] 05833 ani 0FCh
14F6 5F [5] 05834 mov E,A
05835
14F7 EB [4] 05836 xchg ; Save "quantized" size
14F8 22 00F0 [16] 05837 shld Vq00F0 ; DE -> block to free
05838
14FB 2A 00DE [16] 05839 lhld Vq00DE ; Point to the first free block
05840
14FE 7D [5] 05841 mov A,L ; Is block to free lower?
14FF 93 [4] 05842 sub E
1500 7C [5] 05843 mov A,H
1501 9A [4] 05844 sbb D
1502 DA 151E [10] 05845 jc Lq1D98 ; No
05846
1505 2A 00F0 [16] 05847 lhld Vq00F0 ; Load size into BC
1508 44 [5] 05848 mov B,H
1509 4D [5] 05849 mov C,L
05850
150A 2A 00DE [16] 05851 lhld Vq00DE ; Freed block is now first free block
150D EB [4] 05852 xchg
150E 22 00DE [16] 05853 shld Vq00DE
05854
1511 73 [7] 05855 mov M,E ; Store next
1512 23 [5] 05856 inx H
1513 72 [7] 05857 mov M,D
05858
1514 23 [5] 05859 inx H ; Store size
1515 71 [7] 05860 mov M,C
1516 23 [5] 05861 inx H
1517 70 [7] 05862 mov M,B
1518 2B [5] 05863 dcx H
1519 2B [5] 05864 dcx H
151A 2B [5] 05865 dcx H
05866
151B C3 1554 [10] 05867 jmp Lq1E05 ; Try to coalesce with next block
05868
151E 05869 Lq1D98:
151E 44 [5] 05870 mov B,H ; Remember current block address
151F 4D [5] 05871 mov C,L
05872
1520 7E [7] 05873 mov A,M ; Get next address
1521 23 [5] 05874 inx H
1522 66 [7] 05875 mov H,M
1523 6F [5] 05876 mov L,A
05877
1524 93 [4] 05878 sub E ; Is this block still lower?
1525 7C [5] 05879 mov A,H
1526 9A [4] 05880 sbb D
1527 DA 151E [10] 05881 jc Lq1D98 ; Yes
05882
152A E5 [11] 05883 push H ; Save address of block above block to free
05884
152B 60 [5] 05885 mov H,B ; Point to block just below block to free
152C 69 [5] 05886 mov L,C
05887
152D 73 [7] 05888 mov M,E ; Point here from previous block
152E 23 [5] 05889 inx H
152F 72 [7] 05890 mov M,D
05891
1530 E1 [10] 05892 pop H ; Refresh address of block above block to free
05893
1531 EB [4] 05894 xchg ; Point to block to free
05895
1532 73 [7] 05896 mov M,E ; Store next
1533 23 [5] 05897 inx H
1534 72 [7] 05898 mov M,D
05899
1535 EB [4] 05900 xchg ; Retrieve block size
1536 2A 00F0 [16] 05901 lhld Vq00F0
1539 EB [4] 05902 xchg
05903
153A 23 [5] 05904 inx H ; Store size
153B 73 [7] 05905 mov M,E
153C 23 [5] 05906 inx H
153D 72 [7] 05907 mov M,D
05908
153E 60 [5] 05909 mov H,B ; Point to previous block
153F 69 [5] 05910 mov L,C
05911
1540 CD 154A [17] 05912 call Lq1DD9 ; Try to coalesce with block being freed
1543 CA 154A [10] 05913 jz Lq1DD9 ; Combined?
05914
1546 7E [7] 05915 mov A,M ; Point to next block
1547 23 [5] 05916 inx H
1548 66 [7] 05917 mov H,M
1549 6F [5] 05918 mov L,A
05919
154A 05920 Lq1DD9:
154A 23 [5] 05921 inx H ; Set up to try combining with the next block
154B 23 [5] 05922 inx H
154C 23 [5] 05923 inx H
154D 46 [7] 05924 mov B,M ; Load size
154E 2B [5] 05925 dcx H
154F 4E [7] 05926 mov C,M
1550 2B [5] 05927 dcx H ; Load next
1551 56 [7] 05928 mov D,M
1552 2B [5] 05929 dcx H
1553 5E [7] 05930 mov E,M
05931
1554 05932 Lq1E05:
1554 7D [5] 05933 mov A,L ; Are the blocks adjacent?
1555 81 [4] 05934 add C
1556 BB [4] 05935 cmp E
1557 C0 [5/11] 05936 rnz ; No - low bytes do not match
1558 7D [5] 05937 mov A,L
1559 81 [4] 05938 add C
155A 7C [5] 05939 mov A,H
155B 88 [4] 05940 adc B
155C BA [4] 05941 cmp D
155D C0 [5/11] 05942 rnz ; No - high bytes do not match
05943
155E E5 [11] 05944 push H ; Stash address of lower block
05945
155F 2A 00C4 [16] 05946 lhld Vq00C4 ; Is upper block at the top of the heap?
1562 7D [5] 05947 mov A,L
1563 BB [4] 05948 cmp E
1564 C2 156C [10] 05949 jnz Lq1E15 ; No
1567 7C [5] 05950 mov A,H
1568 BA [4] 05951 cmp D
1569 CA 1582 [10] 05952 jz Lq1E30 ; Yes
05953
156C 05954 Lq1E15:
156C E1 [10] 05955 pop H ; Recover address of lower block
05956
156D 1A [7] 05957 ldax D ; Transfer next to lower block
156E 77 [7] 05958 mov M,A
156F 23 [5] 05959 inx H
1570 13 [5] 05960 inx D
1571 1A [7] 05961 ldax D
1572 77 [7] 05962 mov M,A
1573 23 [5] 05963 inx H
1574 13 [5] 05964 inx D
05965
1575 1A [7] 05966 ldax D ; Add the sizes
1576 81 [4] 05967 add C
1577 77 [7] 05968 mov M,A
1578 13 [5] 05969 inx D
1579 23 [5] 05970 inx H
157A 1A [7] 05971 ldax D
157B 88 [4] 05972 adc B
157C 77 [7] 05973 mov M,A
05974
157D 2B [5] 05975 dcx H ; Point back to the base
157E 2B [5] 05976 dcx H
157F 2B [5] 05977 dcx H
05978
1580 AF [4] 05979 xra A ; Clear Z flag
05980
1581 C9 [10] 05981 ret
05982
1582 05983 Lq1E30:
1582 E1 [10] 05984 pop H ; Recover address of lower block
05985
1583 22 00C4 [16] 05986 shld Vq00C4 ; Set new top of heap
05987
1586 06 04 [7] 05988 mvi B,4 ; Zero the header
05989
1588 05990 Lq1E38:
1588 36 00 [10] 05991 mvi M,0
158A 23 [5] 05992 inx H
158B 05 [5] 05993 dcr B
158C C2 1588 [10] 05994 jnz Lq1E38
05995
158F C9 [10] 05996 ret
I have been putting some development time into FLEX again. The first priority is rearranging the several components of the system to arrive at a set of configurations which work on the largest number of dissimilar 6502 systems.
The second is changing the API slightly to try to reduce the reloading of registers. The initial prototype expected addresses to be passed in registers A:X because that was how I kept a sixteen-bit number while doing calculations. By changing the calling convention to Y:X, I am saving up to about half of the register reload instructions.
Now I have a couple of questions:
How useful is it to not corrupt memory used by a program if the operating system is rebooted? Some 680x programs provided a warm restart entry point at $103; this may be useful to try to save data after a crash before a total clean restart. Is it important at all to preserve the contents of the Utility Command Space under the same scenario?
I have always thought that the “current drive” concept in CP/M and DOS was a much more convenient way to specify the “working” drive than using the ASN (assign) utility in FLEX.
Now that I have a platform for experimentation, I have prototyped a current drive mode for FLEX. Instead of the usual “+++” command prompt, it uses “0++” when drive 0 is the current drive and operating in current drive mode. The “1.” command makes drive 1 the current drive.
The ASN utility will be modified to switch between the classic and current drive modes in addition to assigning drives in classic mode.
So is this a good idea or am I off in the weeds?
I just discovered a bug lurking in my processor simulators for many years.
The rotate and shift instructions of some processors like the 6800 and 6502 set the zero and sign flags based on the value of the result. My code for most of them had been propagating the values of the native x86 processor flags to those of the simulated CPU. The problem is that the shift and rotate instructions of the x86 does not affect the zero and sign flags - a small detail I had overlooked.
That this is not an often used feature is why it went this long.
Correction: The rotate instructions of the x86 do not affect the sign and zero flags. The shift instructions do. That explains how I managed to miss this.
I got big binary number to ASCII decimal conversion working with the Double Dabble algorithm.
Instead of over 22 billion machine cycles with the divide by ten for each digit method, the Sevens problem completes in a little over 103.2 million cycles. That is not unlike going from a bicycle to an airliner.
There is still some room for optimization. Namely the fact that numbers up to 100 bytes in size do not result in a string over 255 characters long, so calculations can be done using single-byte values. I’m thinking maybe up to a doubling of the speed.
number = 1
power=0
string=""
while string.find("777777") < 0:
number = number * 7
power = power + 1
string = str(number)
print('Seven to the power of', power, 'is', string)
The original divide by 10 for each digit method took over 22 billion cycles.
The first double-byte double dabble implementation took $626D28A 103207562 cycles.
The implementation of a single-byte double dabble mode for numbers 100 bytes or less took $59D709A 94204058 cycles. This was more than I was expecting.
After some more optimization, it dropped to $59D47D6 94193622 cycles.
What this is trying to say is that the decimal conversion is no longer a large majority of the time. To see how this broke down:
Omitting the string search took $5974C0C 93801484 cycles.
number = 1
power=0
string=""
while power < 176:
number = number * 7
power = power + 1
string = str(number)
Omitting the string search and the decimal conversion took $3F55E03 66412035 cycles.
number = 1
power=0
string=""
while power < 176:
number = number * 7
power = power + 1
What this means is that the multiplication code needs another look. I know it can be improved.
Some time ago, I posted implementations of binary to decimal conversion using the Double Dabble algorithm for several microprocessors:
Rudla posted code to convert a binary number to packed BCD:
http://forum.6502.org/viewtopic.php?f=2&t=6579
I recognized that his code bore a definite resemblance to Double Dabble, but it was clearly not the same. Studying his implementation allowed me to step back and see the forest from the trees of the details of the Double Dabble algorithm. Improvements from a straight interpretation of Double Dabble include:
- all of that testing of and adding to individual BCD digits is merely a way to perform a “decimal adjustment” when the hardware does not directly support it. Hardware supporting decimal adjustment can combine the shifting and fiddling by adding to itself (doubling) with the decimal arithmetic mode built into the 6502 or a DAA decimal adjust instruction following the addition on other processors.
- shifting only one byte of the number being converted at a time instead of shifting the entire number. The savings become much more significant for larger numbers.
- growing the result as digits are added instead of needlessly shifting an ocean of zeroes in a large pre-cleared buffer.
I improveed on Rudla’s implementation by starting with no bytes of result instead of one byte containing zero, saving up to seven shifts of the byte. I also did a minor tweak improving the innermost loop.
Using what I learned, I rewrote my conversion code from scratch, making it faster and simpler in the process.
My design is based on maximum reuse of memory and is intended to produce a printable character string instead of just a packed BCD number. Because the maximum number to be converted is eight bytes, the buffer is placed in the zero-page for maximum efficiency. The number to be converted is copied to the low end of the output buffer with the least-significant byte at the lowest address. The converted BCD digits grow downward from the end of the buffer and do not overlap the number until after its bytes are already consumed by the shifting. Because the largest number to be converted is 64 bits, it was reasonable to place the buffer in the zero page.
While I will not claim that these are the fastest routines to convert 32 and 64 bit integers to ASCII decimal strings, I am confident they are among the best.
Did you know that there is a 32000 character limit for a post? I know now…
The 6502 version:
. 00191 ; Input:
. 00192 ; The absolute value of the number to convert is stored
. 00193 ; starting at OutBuf+2 in little-endian form
. 00194 ; A = number of bytes in the number
. 00195 ; Dabble is the label of last byte of the buffer
. 00196 ; OutBuf+0 = 0 if number is negative, other if negative
. 00197 ;
. 00198 ; Output:
. 00199 ; A:X contains the address of the resultant string
. 00200 ;
. 00201 ; Uses:
. 00202 ; Byt0 = number of bytes to convert
. 00203 ; Byt1 = number of packed BCD bytes in result
. 00204 ; Byt2 = bits of number being shifted
.206E 00225 Format:
.206E 85 12 [3] 00226 sta Byt0 ; Remember number of bytes to convert
.2070 AA [2] 00227 tax ; Use it as index into the number
. 00228
.2071 A9 00 [2] 00229 lda #0 ; Start with no bytes of BCD digits
.2073 85 13 [3] 00230 sta Byt1
. 00231
.2075 F8 [2] 00232 sed ; Switch to decimal mode
. 00233
.2076 00234 SkipLeadingZero:
.2076 B5 19 [4] 00235 lda OutBuf+1,X ; Get next highest byte of the number
.2078 D0 07 (2081) [2/3] 00236 bne NotLeadingZero
. 00237
.207A CA [2] 00238 dex ; Skip a leading zero
. 00239
.207B D0 F9 (2076) [2/3] 00240 bne SkipLeadingZero ; Check the next one if more
. 00241
.207D F0 2F (20AE) [2/3] 00242 beq BCD2ASCII ; All bytes are zero
. 00243
.207F 00244 ByteLoop:
.207F B5 19 [4] 00245 lda OutBuf+1,X ; Get next highest byte of the number
. 00246
.2081 00247 NotLeadingZero:
.2081 85 14 [3] 00248 sta Byt2 ; Set it for shifting
. 00249
.2083 86 12 [3] 00250 stx Byt0 ; Remember the address of the last one shifted
. 00251
.2085 38 [2] 00252 sec ; Shift in extra bit for eight trips through loop
. 00253
.2086 B0 12 (209A) [2/3] 00254 bcs EnterBitLoop ; Always branches
. 00255
.2088 00256 BitLoop:
.2088 A2 2D [2] 00257 ldx #Dabble ; Point to the BCD digits
. 00258
.208A A4 13 [3] 00259 ldy Byt1 ; Load count of number of digit bytes
.208C F0 0A (2098) [2/3] 00260 beq SkipDigitLoop ; Skip loop if none
. 00261
.208E 00262 DigitLoop:
.208E B5 00 [4] 00263 lda 0,X ; Double pair of BCD digits with carry
.2090 75 00 [4] 00264 adc 0,X
.2092 95 00 [4] 00265 sta 0,X
. 00266
.2094 CA [2] 00267 dex ; Point to next pair of BCD digits
. 00268
.2095 00269 IntoDigitLoop:
.2095 88 [2] 00270 dey ; More BCD digits?
.2096 D0 F6 (208E) [2/3] 00271 bne DigitLoop ; Yes
. 00272
.2098 00273 SkipDigitLoop:
.2098 B0 0B (20A5) [2/3] 00274 bcs Extend ; Carry out of accumulated digits?
. 00275
.209A 00276 EnterBitLoop:
.209A 26 14 [5] 00277 rol Byt2 ; Transfer upper bit to carry flag
.209C D0 EA (2088) [2/3] 00278 bne BitLoop ; Repeat if more
. 00279
.209E A6 12 [3] 00280 ldx Byt0 ; Point back to the number to convert
. 00281
.20A0 CA [2] 00282 dex ; Any more bytes of the number to convert?
.20A1 D0 DC (207F) [2/3] 00283 bne ByteLoop ; Repeat if more
. 00284
.20A3 F0 09 (20AE) [2/3] 00285 beq BCD2ASCII ; Always branches to unpacking stage
. 00286
.20A5 00287 Extend:
.20A5 E6 13 [5] 00288 inc Byt1 ; Add a new byte to the result
.20A7 A9 01 [2] 00289 lda #1
.20A9 95 00 [4] 00290 sta 0,X
. 00291
.20AB 18 [2] 00292 clc
.20AC 90 EC (209A) [2/3] 00293 bcc EnterBitLoop ; Always branches
. 00294
.20AE 00295 BCD2ASCII:
.20AE D8 [2] 00296 cld ; Return to binary mode
. 00297
.20AF A5 13 [3] 00298 lda Byt1 ; Any converted digits?
.20B1 F0 42 (20F5) [2/3] 00299 beq NoDigits ; No
. 00300
.20B3 38 [2] 00301 sec
.20B4 A9 2E [2] 00302 lda #Dabble+1 ; Point to most significant BCD pair
.20B6 E5 13 [3] 00303 sbc Byt1
.20B8 A8 [2] 00304 tay
. 00305
.20B9 A2 1A [2] 00306 ldx #<OutBuf+2 ; Point to result buffer
. 00307
.20BB B9 0000 [4/5] 00308 lda 0,Y ; Get most significant BCD pair
. 00309
.20BE 29 F0 [2] 00310 and #$F0 ; Is the upper one zero?
.20C0 F0 0E (20D0) [2/3] 00311 beq UpperZero
.20C2 D0 03 (20C7) [2/3] 00312 bne ConvertBoth
. 00313
.20C4 00314 ASCIILoop:
.20C4 B9 0000 [4/5] 00315 lda 0,Y ; Get next most significant BCD pair
. 00316
.20C7 00317 ConvertBoth:
.20C7 4A [2] 00318 lsr A ; Convert upper one to ASCII
.20C8 4A [2] 00319 lsr A
.20C9 4A [2] 00320 lsr A
.20CA 4A [2] 00321 lsr A
.20CB 09 30 [2] 00322 ora #'0'
.20CD 95 00 [4] 00323 sta 0,X
.20CF E8 [2] 00324 inx
. 00325
.20D0 00326 UpperZero:
.20D0 B9 0000 [4/5] 00327 lda 0,Y
.20D3 C8 [2] 00328 iny
.20D4 29 0F [2] 00329 and #$F ; And then the lower one
.20D6 09 30 [2] 00330 ora #'0'
.20D8 95 00 [4] 00331 sta 0,X
.20DA E8 [2] 00332 inx
. 00333
.20DB C6 13 [5] 00334 dec Byt1 ; More digits?
.20DD D0 E5 (20C4) [2/3] 00335 bne ASCIILoop ; Yes
. 00336
.20DF 8A [2] 00337 txa ; Store string length
.20E0 38 [2] 00338 sec
.20E1 E9 1A [2] 00339 sbc #OutBuf+2
. 00340
.20E3 A6 18 [3] 00341 ldx OutBuf ; Is the number negative?
.20E5 F0 14 (20FB) [2/3] 00342 beq NumberNotNegative
. 00343
.20E7 18 [2] 00344 clc ; Increment and store length
.20E8 69 01 [2] 00345 adc #1
.20EA 85 18 [3] 00346 sta OutBuf
. 00347
.20EC A9 2D [2] 00348 lda #'-' ; Prepend minus sign
.20EE 85 19 [3] 00349 sta OutBuf+1
. 00350
.20F0 A9 00 [2] 00351 lda #>OutBuf
.20F2 A2 18 [2] 00352 ldx #<OutBuf
. 00353
.20F4 60 [6] 00354 rts
. 00355
.20F5 00356 NoDigits:
.20F5 A9 30 [2] 00357 lda #'0' ; Emit single '0'
.20F7 85 1A [3] 00358 sta OutBuf+2
. 00359
.20F9 A9 01 [2] 00360 lda #1 ; String length one
. 00361
.20FB 00362 NumberNotNegative:
.20FB 85 19 [3] 00363 sta OutBuf+1
. 00364
.20FD A9 00 [2] 00365 lda #>OutBuf+1
.20FF A2 19 [2] 00366 ldx #<OutBuf+1
. 00367
.2101 60 [6] 00368 rts
The scorecard:
00001 ; Converting $FFFFFFFF
00002 ; old $2BC9 11209 cycles for call FormatLongword
00003 ; new $0A72 2674 cycles
00004 ;
00005 ; Converting $00000001
00006 ; old $2A96 10902 cycles for call FormatLongword
00007 ; new $015B 347 cycles
00008 ;
00009 ; Converting $00000000
00010 ; old $2A98 10904 cycles for call FormatLongword
00011 ; new $0075 117 cycles
The addressing modes and the three byte-sized registers worked well for this implementation.
The 6800 version:
. 00486 * Input:
. 00487 * The absolute value of the number to convert is stored
. 00488 * starting at OutBuf+2 in little-endian form
. 00489 * A = number of bytes in the number
. 00490 * Dabble is the label of last byte of the buffer
. 00491 * OutBuf+0 = 0 if number is negative, other if negative
. 00492 *
. 00493 * Output:
. 00494 * X contains the address of the resultant string
. 00495 *
. 00496 * Uses:
. 00497 * Byt0 = number of bytes to convert
. 00498 * Byt1 = number of packed BCD bytes in result
. 00499 * Byt2 = bits of number being shifted
. 00500 * Ptr0
. 00501 * Ptr1
.01F5 00522 Format
.01F5 97 10 [4] 00523 staa Byt0 ; Remember number of bytes to convert
. 00524
.01F7 8B 17 [2] 00525 adda #<OutBuf+1 ; Point to most significant byte of number
.01F9 97 0B [4] 00526 staa Ptr0+1
.01FB 86 00 [2] 00527 ldaa #>OutBuf+1
.01FD 89 00 [2] 00528 adca #0
.01FF 97 0A [4] 00529 staa Ptr0
.0201 DE 0A [4] 00530 ldx Ptr0
. 00531
.0203 7F 0011 [6] 00532 clr Byt1 ; Start with no bytes of BCD digits
. 00533
.0206 00534 SkipLeadingZero
.0206 E6 00 [5] 00535 ldab ,X ; Get next highest byte of the number
.0208 26 0D (0217) [4] 00536 bne NotLeadingZero
. 00537
.020A 09 [4] 00538 dex ; Skip a leading zero
. 00539
.020B 7A 0010 [6] 00540 dec Byt0
.020E 26 F6 (0206) [4] 00541 bne SkipLeadingZero ; Check the next one if more
. 00542
.0210 20 32 (0244) [4] 00543 bra BCD2ASCII ; All bytes are zero
. 00544
.0212 00545 ByteLoop
.0212 DE 0A [4] 00546 ldx Ptr0 ; Point back to the number to convert
.0214 09 [4] 00547 dex ; Point to next highest byte
. 00548
.0215 E6 00 [5] 00549 ldab ,X ; Get next highest byte of the number
. 00550
.0217 00551 NotLeadingZero
.0217 D7 12 [4] 00552 stab Byt2 ; Set it for shifting
. 00553
.0219 DF 0A [5] 00554 stx Ptr0 ; Remember the address of the last one shifted
. 00555
.021B 0D [2] 00556 sec ; Shift in extra bit for eight trips through loop
. 00557
.021C 20 1C (023A) [4] 00558 bra EnterBitLoop
. 00559
.021E 00560 BitLoop
.021E CE 002B [3] 00561 ldx #Dabble ; Point to the BCD digits
. 00562
.0221 D6 11 [3] 00563 ldab Byt1 ; Load count of number of digit bytes
.0223 27 0B (0230) [4] 00564 beq SkipDigitLoop ; Skip loop if none
. 00565
.0225 00566 DigitLoop
.0225 A6 00 [5] 00567 ldaa ,X ; Double pair of BCD digits with carry
.0227 A9 00 [5] 00568 adca ,X
.0229 19 [2] 00569 daa
.022A A7 00 [6] 00570 staa ,X
. 00571
.022C 09 [4] 00572 dex ; Point to next pair of BCD digits
. 00573
.022D 00574 IntoDigitLoop
.022D 5A [2] 00575 decb ; More BCD digits?
.022E 26 F5 (0225) [4] 00576 bne DigitLoop ; Yes
. 00577
.0230 00578 SkipDigitLoop
.0230 24 08 (023A) [4] 00579 bcc EnterBitLoop ; Carry out of accumulated digits?
. 00580
.0232 7C 0011 [6] 00581 inc Byt1 ; Add a new byte to the result
.0235 86 01 [2] 00582 ldaa #1
.0237 A7 00 [6] 00583 staa ,X
. 00584
.0239 0C [2] 00585 clc
. 00586
.023A 00587 EnterBitLoop
.023A 79 0012 [6] 00588 rol Byt2 ; Transfer upper bit to carry flag
.023D 26 DF (021E) [4] 00589 bne BitLoop ; Repeat if more
. 00590
.023F 7A 0010 [6] 00591 dec Byt0 ; Any more bytes of the number to convert?
.0242 26 CE (0212) [4] 00592 bne ByteLoop ; Repeat if more
. 00593
.0244 00594 BCD2ASCII
.0244 96 11 [3] 00595 ldaa Byt1 ; Any converted digits?
.0246 27 57 (029F) [4] 00596 beq NoDigits ; No
. 00597
.0248 86 2C [2] 00598 ldaa #<Dabble+1 ; Point to most significant BCD pair
.024A 90 11 [3] 00599 suba Byt1
.024C 97 0B [4] 00600 staa Ptr0+1
.024E 86 00 [2] 00601 ldaa #>Dabble+1
.0250 82 00 [2] 00602 sbca #0
.0252 97 0A [4] 00603 staa Ptr0
. 00604
.0254 CE 0018 [3] 00605 ldx #OutBuf+2 ; Point to result buffer
.0257 DF 0C [5] 00606 stx Ptr1
. 00607
.0259 DE 0A [4] 00608 ldx Ptr0 ; Get most significant BCD pair
.025B A6 00 [5] 00609 ldaa ,X
.025D 08 [4] 00610 inx
.025E DF 0A [5] 00611 stx Ptr0
.0260 DE 0C [4] 00612 ldx Ptr1
. 00613
.0262 16 [2] 00614 tab ; Clone it
. 00615
.0263 85 F0 [2] 00616 bita #$F0 ; Is the upper one zero?
.0265 27 17 (027E) [4] 00617 beq ConvertLowerOnly
.0267 20 0C (0275) [4] 00618 bra ConvertBoth
. 00619
.0269 00620 ASCIILoop
.0269 DF 0C [5] 00621 stx Ptr1
.026B DE 0A [4] 00622 ldx Ptr0 ; Get next most significant BCD pair
.026D A6 00 [5] 00623 ldaa ,X
.026F 08 [4] 00624 inx
.0270 DF 0A [5] 00625 stx Ptr0
.0272 DE 0C [4] 00626 ldx Ptr1
. 00627
.0274 16 [2] 00628 tab ; Clone it
. 00629
.0275 00630 ConvertBoth
.0275 44 [2] 00631 lsra ; Convert upper one to ASCII
.0276 44 [2] 00632 lsra
.0277 44 [2] 00633 lsra
.0278 44 [2] 00634 lsra
.0279 8A 30 [2] 00635 oraa #'0'
.027B A7 00 [6] 00636 staa ,X
.027D 08 [4] 00637 inx
. 00638
.027E 00639 ConvertLowerOnly
.027E C4 0F [2] 00640 andb #$F ; And then the lower one
.0280 CA 30 [2] 00641 orab #'0'
.0282 E7 00 [6] 00642 stab ,X
.0284 08 [4] 00643 inx
. 00644
.0285 7A 0011 [6] 00645 dec Byt1 ; More BCD pairs?
.0288 26 DF (0269) [4] 00646 bne ASCIILoop
. 00647
.028A DF 0C [5] 00648 stx Ptr1 ; Determine string length
.028C 96 0D [3] 00649 ldaa Ptr1+1
.028E 80 18 [2] 00650 suba #<OutBuf+2
. 00651
.0290 D6 16 [3] 00652 ldab OutBuf ; Is the number negative?
.0292 27 11 (02A5) [4] 00653 beq NumberNotNegative
. 00654
.0294 4C [2] 00655 inca ; Increment and store length byte
.0295 97 16 [4] 00656 staa OutBuf
. 00657
.0297 C6 2D [2] 00658 ldab #'-' ; Prepend minus sign
.0299 D7 17 [4] 00659 stab OutBuf+1
. 00660
.029B CE 0016 [3] 00661 ldx #OutBuf ; Point to the result
. 00662
.029E 39 [5] 00663 rts
. 00664
.029F 00665 NoDigits
.029F 86 30 [2] 00666 ldaa #'0' ; Emit single '0'
.02A1 97 18 [4] 00667 staa OutBuf+2
. 00668
.02A3 86 01 [2] 00669 ldaa #1 ; String length one
. 00670
.02A5 00671 NumberNotNegative
.02A5 97 17 [4] 00672 staa OutBuf+1
. 00673
.02A7 CE 0017 [3] 00674 ldx #OutBuf+1
. 00675
.02AA 39 [5] 00676 rts
The scorecard:
00001 * Converting $FFFFFFFF
00002 * old $38F0 14576 cycles for call FormatLongword
00003 * new $0FED 4077 cycles
00004 *
00005 * Converting $00000001
00006 * old $37B3 14259 cycles for call FormatLongword
00007 * new $01F2 498 cycles
00008 *
00009 * Converting $00000000
00010 * old $37B0 14256 cycles for call FormatLongword
00011 * new $00B9 185 cycles
The 6800 did not fare so well with only one index register and relatively slow memory access instructions.
The added capabilities of the 6809 do not appear to be much better.
The 8080 version:
. 00127 ; Input:
. 00128 ; The absolute value of the number to convert is stored
. 00129 ; starting at OutBuf+2 in little-endian form
. 00130 ; A = number of bytes in the number
. 00131 ; B = 0 if number is positive, other if negative
. 00132 ; Dabble is the label of last byte of the buffer
. 00133 ;
. 00134 ; Output:
. 00135 ; HL contains the address of the resultant string
.0173 00156 Format:
.0173 4F [5] 00157 mov C,A ; Remember number of bytes to convert
.0174 5F [5] 00158 mov E,A
. 00159
.0175 78 [5] 00160 mov A,B ; Remember negativity
.0176 32 0333 [13] 00161 sta OutBuf
. 00162
.0179 AF [4] 00163 xra A ; Start with no bytes of BCD digits
.017A 47 [5] 00164 mov B,A
. 00165
.017B 21 0334 [10] 00166 lxi H,OutBuf+1 ; Point DE to most significant byte of number
.017E 57 [5] 00167 mov D,A
.017F 19 [10] 00168 dad D
.0180 EB [4] 00169 xchg
. 00170
.0181 00171 SkipLeadingZero:
.0181 1A [7] 00172 ldax D ; Get next highest byte of the number
.0182 B7 [4] 00173 ora A
.0183 C2 018F [10] 00174 jnz NotLeadingZero
. 00175
.0186 1B [5] 00176 dcx D ; Skip a leading zero
. 00177
.0187 0D [5] 00178 dcr C
.0188 C2 0181 [10] 00179 jnz SkipLeadingZero ; Check the next one if more
. 00180
.018B C3 01BC [10] 00181 jmp BIN2ASCII ; All bytes are zero
. 00182
.018E 00183 ByteLoop:
.018E 1A [7] 00184 ldax D ; Get next highest byte of the number
. 00185
.018F 00186 NotLeadingZero:
.018F 1B [5] 00187 dcx D ; Point to next lower byte
. 00188
.0190 D5 [11] 00189 push D ; Stash pointer to next byte
. 00190
.0191 5F [5] 00191 mov E,A ; Set it for shifting
. 00192
.0192 37 [4] 00193 stc ; Shift in extra bit for eight times thru loop
. 00194
.0193 C3 01AF [10] 00195 jmp EnterBitLoop
. 00196
.0196 00197 BitLoop:
.0196 21 0348 [10] 00198 lxi H,Dabble ; Point to the BCD digits
. 00199
.0199 04 [5] 00200 inr B ; Check count of number of digit bytes
.019A 05 [5] 00201 dcr B ; Test for zero but preserve carry flag
.019B CA 01A8 [10] 00202 jz SkipDigitLoop ; Skip loop if none
. 00203
.019E 50 [5] 00204 mov D,B
. 00205
.019F 00206 DigitLoop:
.019F 7E [7] 00207 mov A,M ; Double pair of BCD digits with carry
.01A0 8E [7] 00208 adc M
.01A1 27 [4] 00209 daa
.01A2 77 [7] 00210 mov M,A
. 00211
.01A3 2B [5] 00212 dcx H ; Point to next pair of BCD digits
. 00213
.01A4 00214 IntoDigitLoop:
.01A4 15 [5] 00215 dcr D ; More BCD digits?
.01A5 C2 019F [10] 00216 jnz DigitLoop ; Yes
. 00217
.01A8 00218 SkipDigitLoop:
.01A8 D2 01AF [10] 00219 jnc EnterBitLoop ; Carry out of accumulated digits?
. 00220
.01AB 04 [5] 00221 inr B ; Add new byte to the result
.01AC 36 01 [10] 00222 mvi M,1
. 00223
.01AE B7 [4] 00224 ora A ; Clear carry flag
. 00225
.01AF 00226 EnterBitLoop:
.01AF 7B [5] 00227 mov A,E
.01B0 17 [4] 00228 ral ; Transfer upper bit to carry flag
.01B1 5F [5] 00229 mov E,A
. 00230
.01B2 1C [5] 00231 inr E
.01B3 1D [5] 00232 dcr E
.01B4 C2 0196 [10] 00233 jnz BitLoop ; Repeat if more
. 00234
.01B7 D1 [10] 00235 pop D ; Recover pointer to next byte
. 00236
.01B8 0D [5] 00237 dcr C ; Any more bytes of the number to convert?
.01B9 C2 018E [10] 00238 jnz ByteLoop ; Repeat if more
. 00239
.01BC 00240 BIN2ASCII:
.01BC 21 0335 [10] 00241 lxi H,OutBuf+2 ; Point to result buffer
. 00242
.01BF 78 [5] 00243 mov A,B ; Any converted digits?
.01C0 B7 [4] 00244 ora A
.01C1 CA 01F4 [10] 00245 jz NoDigits ; No
. 00246
.01C4 11 0349 [10] 00247 lxi D,Dabble+1 ; Point to most significant BCD pair
.01C7 7B [5] 00248 mov A,E
.01C8 90 [4] 00249 sub B
.01C9 5F [5] 00250 mov E,A
.01CA 7A [5] 00251 mov A,D
.01CB DE 00 [7] 00252 sbi 0
.01CD 57 [5] 00253 mov D,A
. 00254
.01CE 1A [7] 00255 ldax D ; Get most significant BCD pair
. 00256
.01CF E6 F0 [7] 00257 ani 0F0h ; Is the upper one zero?
.01D1 CA 01DF [10] 00258 jz ConvertLowerOnly
. 00259
.01D4 00260 ASCIILoop:
.01D4 1A [7] 00261 ldax D ; Get next most significant BCD pair
.01D5 0F [4] 00262 rrc ; Convert upper one to ASCII
.01D6 0F [4] 00263 rrc
.01D7 0F [4] 00264 rrc
.01D8 0F [4] 00265 rrc
.01D9 E6 0F [7] 00266 ani 0Fh
.01DB F6 30 [7] 00267 ori '0'
.01DD 77 [7] 00268 mov M,A
.01DE 23 [5] 00269 inx H
. 00270
.01DF 00271 ConvertLowerOnly:
.01DF 1A [7] 00272 ldax D ; And then the lower one
.01E0 13 [5] 00273 inx D
.01E1 E6 0F [7] 00274 ani 0Fh
.01E3 F6 30 [7] 00275 ori '0'
.01E5 77 [7] 00276 mov M,A
.01E6 23 [5] 00277 inx H
. 00278
.01E7 05 [5] 00279 dcr B
.01E8 C2 01D4 [10] 00280 jnz ASCIILoop
. 00281
.01EB 7D [5] 00282 mov A,L ; Store string length
.01EC DE 35 [7] 00283 sbi OutBuf+2 and 0FFh
.01EE 32 0334 [13] 00284 sta OutBuf+1
. 00285
.01F1 C3 01F9 [10] 00286 jmp FinishFormat
. 00287
.01F4 00288 NoDigits:
.01F4 36 30 [10] 00289 mvi M,'0' ; Emit single '0'
. 00290
.01F6 2B [5] 00291 dcx H ; String length one
.01F7 36 01 [10] 00292 mvi M,1
. 00293
.01F9 00294 FinishFormat:
.01F9 21 0334 [10] 00295 lxi H,OutBuf+1 ; Point to the result
. 00296
.01FC 3A 0333 [13] 00297 lda OutBuf ; Was number negative?
.01FF B7 [4] 00298 ora A
.0200 CA 0209 [10] 00299 jz NumberNotNegative ; No
. 00300
.0203 7E [7] 00301 mov A,M ; Get and increment length byte
.0204 3C [5] 00302 inr A
. 00303
.0205 36 2D [10] 00304 mvi M,'-' ; Prepend minus sign
. 00305
.0207 2B [5] 00306 dcx H ; Point to the new result
. 00307
.0208 77 [7] 00308 mov M,A ; Store new length
. 00309
.0209 00310 NumberNotNegative:
.0209 C9 [10] 00311 ret
The scorecard:
00001 ; Converting $FFFFFFFF
00002 ; old $7F53 32595 cycles for call FormatLongword
00003 ; new $2359 9049 cycles
00004 ; new $1F18 7960 cycles
00005 ;
00006 ; Converting $00000001
00007 ; old $7CEE 31982 cycles for call FormatLongword
00008 ; new $055A 1370 cycles
00009 ; new $04C5 1221 cycles
00010 ;
00011 ; Converting $00000000
00012 ; old $7D05 32005 cycles for call FormatLongword
00013 ; new $01E0 480 cycles
00014 ; new $01A2 418 cycles
Two sets of new numbers are reported. One for the initial version and another for after some intense code golf managed to eliminate variables in memory.
I actually wrote the 8080 version first, then moved on to the 6800 and then the 6502. Only then did I invest the effort to optimize this one.
I came across some benchmarks to compare Millfork with other compilers:
The linked list one caught my attention. It creates a list of 3000 nodes then traverses it to add up the values.
My implementions are somewhat like the one provided in C except that I do not use his timing mechanism and the list creation loop counts down instead of up. My intent is not to compare with the compilers but between several different processors.
I am slowly relenting to the prospect that a loop ending with an unconditional branch back to the top often performs better by transforming it to enter in the middle. The result looks unstructured, but it is not very different.
The first one is for the 6800. I have found that it is easiest to adopt 6800 code to other processors instead of the other way around.
The dual accumulators and 16-bit index register with displacement is especially well suited for this challenge.
Though inlining the subroutines provided a substantial improvement, I am showing the code before that final transformation for clarity.
00001 * 378834 ($5C7D2) cycles
00002 * 300821 ($49715) cycles after inlining subroutines
00003
0000 005F 00004 Free fdb Heap
0002 0000 00005 Root fdb 0
0004 0000 00006 New fdb 0
0006 0000 0000 00007 Total fdb 0,0
000A 0BB7 00008 I fdb 2999
00009
00010 *
00011 * Allocate a node
00012 *
000C 00013 Alloc
000C D6 01 [3] 00014 ldab Free+1 ; Allocate a node
000E D7 05 [4] 00015 stab New+1 ; Return it in New
00016
0010 CB 04 [2] 00017 addb #4 ; Point to free memory
0012 D7 01 [4] 00018 stab Free+1
0014 D6 00 [3] 00019 ldab Free
0016 D7 04 [4] 00020 stab New
0018 C9 00 [2] 00021 adcb #0
001A D7 00 [4] 00022 stab Free
00023
001C 39 [5] 00024 rts
00025
00026 *
00027 * Prepend a node to the list
00028 *
001D 00029 Prepend
001D 8D ED (000C) [8] 00030 bsr Alloc ; Allocate a new node
00031
001F DE 04 [4] 00032 ldx New ; Point to the new node
00033
0021 D6 02 [3] 00034 ldab Root ; Point next of new node to Root
0023 E7 00 [6] 00035 stab ,X
0025 D6 03 [3] 00036 ldab Root+1
0027 E7 01 [6] 00037 stab 1,X
00038
0029 A7 03 [6] 00039 staa 3,X ; Store value in node
002B D6 0A [3] 00040 ldab I
002D E7 02 [6] 00041 stab 2,X
00042
002F DF 02 [5] 00043 stx Root ; This is now the first node
00044
0031 39 [5] 00045 rts
00046
00047 *
00048 * Traverse the list and sum the values
00049 *
0032 00050 Sum
0032 DE 02 [4] 00051 ldx Root
0034 4F [2] 00052 clra
0035 5F [2] 00053 clrb
00054
0036 00055 SumLoop
0036 EB 03 [5] 00056 addb 3,X ; Add value of node to sum
0038 A9 02 [5] 00057 adca 2,X
003A 24 08 (0044) [4] 00058 bcc NoCarry
00059
003C 7C 0007 [6] 00060 inc Total+1
00061
003F 26 03 (0044) [4] 00062 bne NoCarry
00063
0041 7C 0006 [6] 00064 inc Total
00065
0044 00066 NoCarry
0044 EE 00 [6] 00067 ldx ,X ; Point to the next node
0046 26 EE (0036) [4] 00068 bne SumLoop ; Until end of list
00069
0048 97 08 [4] 00070 staa Total+2
004A D7 09 [4] 00071 stab Total+3
00072
004C 39 [5] 00073 rts
00074
00075 *
00076 *
00077 *
004D 00078 Main
004D 96 0B [3] 00079 ldaa I+1 ; Get low byte of node value
00080
004F 20 01 (0052) [4] 00081 bra IntoLoop
00082
0051 00083 Loop
0051 4A [2] 00084 deca
00085
0052 00086 IntoLoop
0052 8D C9 (001D) [8] 00087 bsr Prepend
00088
0054 4D [2] 00089 tsta ; Low byte of count zero?
0055 26 FA (0051) [4] 00090 bne Loop ; No, no borrow
00091
0057 7A 000A [6] 00092 dec I ; Borrow from upper byte
005A 2A F5 (0051) [4] 00093 bpl Loop ; Done if it goes negative
00094
005C 8D D4 (0032) [8] 00095 bsr Sum ; Traverse the list and add them up
00096
005E 3F [12] 00097 swi
00098
005F 00099 Heap
00100
004D 00101 end Main
The 6502 had trouble keeping up with the others. Am I missing something?
00001 ; 519567 ($7ED8F) cycles
00002 ; 447558 ($6D446) after inlining subroutines
00003
0000 026F 00004 Free fdb Heap
0002 0000 00005 Root fdb 0
0004 0000 00006 New fdb 0
0006 0000 0000 00007 Total fdb 0,0
000A 0BB7 00008 I fdb 2999
00009
0200 00010 org $200
00011
00012 ;
00013 ; Allocate a node
00014 ;
0200 00015 Alloc:
0200 A5 00 [3] 00016 lda Free ; Allocate a node
0202 85 04 [3] 00017 sta New ; Return it in New
00018
0204 18 [2] 00019 clc ; Point to free memory
0205 69 04 [2] 00020 adc #4
0207 85 00 [3] 00021 sta Free
0209 A5 01 [3] 00022 lda Free+1
020B 85 05 [3] 00023 sta New+1
020D 69 00 [2] 00024 adc #0
020F 85 01 [3] 00025 sta Free+1
00026
0211 60 [6] 00027 rts
00028
00029 ;
00030 ; Prepend a node to the list
00031 ;
0212 00032 Prepend:
0212 20 0200 [6] 00033 jsr Alloc ; Allocate a new node
00034
0215 A0 00 [2] 00035 ldy #0 ; Point to the new node
00036
0217 A5 02 [3] 00037 lda Root ; Point next of new node to Root
0219 91 04 [6] 00038 sta (New),Y
021B C8 [2] 00039 iny
021C A5 03 [3] 00040 lda Root+1
021E 91 04 [6] 00041 sta (New),Y
0220 C8 [2] 00042 iny
00043
0221 A5 0A [3] 00044 lda I ; Store value in node
0223 91 04 [6] 00045 sta (New),Y
0225 C8 [2] 00046 iny
0226 A5 0B [3] 00047 lda I+1
0228 91 04 [6] 00048 sta (New),Y
00049
022A A5 04 [3] 00050 lda New ; This is now the first node
022C 85 02 [3] 00051 sta Root
022E A5 05 [3] 00052 lda New+1
0230 85 03 [3] 00053 sta Root+1
00054
0232 60 [6] 00055 rts
00056
00057 ;
00058 ; Traverse the list and sum the values
00059 ;
0233 00060 NoCarry:
0233 A0 00 [2] 00061 ldy #0 ; Point to the next node
0235 B1 02 [5/6] 00062 lda (Root),Y
0237 AA [2] 00063 tax
0238 C8 [2] 00064 iny
0239 B1 02 [5/6] 00065 lda (Root),Y
023B 85 03 [3] 00066 sta Root+1
023D 86 02 [3] 00067 stx Root
00068
023F 05 02 [3] 00069 ora Root
0241 F0 1A (025D) [2/3] 00070 beq Summed ; Until end of list
00071
0243 00072 Sum:
0243 18 [2] 00073 clc ; Add value of node to sum
0244 A0 02 [2] 00074 ldy #2
0246 A5 06 [3] 00075 lda Total
0248 71 02 [5/6] 00076 adc (Root),Y
024A 85 06 [3] 00077 sta Total
024C C8 [2] 00078 iny
024D A5 07 [3] 00079 lda Total+1
024F 71 02 [5/6] 00080 adc (Root),Y
0251 85 07 [3] 00081 sta Total+1
0253 90 DE (0233) [2/3] 00082 bcc NoCarry
00083
0255 E6 08 [5] 00084 inc Total+2
00085
0257 D0 DA (0233) [2/3] 00086 bne NoCarry
00087
0259 E6 09 [5] 00088 inc Total+3
00089
025B D0 D6 (0233) [2/3] 00090 bne NoCarry ; Always branches
00091
025D 00092 Summed:
025D 60 [6] 00093 rts
00094
00095 ;
00096 ;
00097 ;
025E 00098 NoBorrow:
025E C6 0A [5] 00099 dec I
00100
0260 00101 Main:
0260 20 0212 [6] 00102 jsr Prepend
00103
0263 A5 0A [3] 00104 lda I ; Low byte of count zero?
0265 D0 F7 (025E) [2/3] 00105 bne NoBorrow ; No, no borrow
00106
0267 C6 0B [5] 00107 dec I+1 ; Borrow from upper byte
0269 10 F3 (025E) [2/3] 00108 bpl NoBorrow ; Done if it goes negative
00109
026B 20 0243 [6] 00110 jsr Sum ; Traverse the list and add them up
00111
026E 00 [7] 00112 brk
00113
026F 00114 Heap:
00115
0260 00116 end Main
The 8080 was difficult but benefited from abundant registers. Remember that a memory cycle of the 8080 is three machine cycles; it is one on the others, so 8080 machines usually ran at a two to three times faster clock rate.
00001 ; 991228 (0F1FFCh) cycles
00002 ; 829201 (0CA711h) cycles after inlining subroutines
00003
0100 00004 org 100h
00005
00006 ;
00007 ; Allocate a node
00008 ;
0100 00009 Alloc:
0100 2A 0163 [16] 00010 lhld Free ; Allocate a node
00011
0103 11 0004 [10] 00012 lxi D,4 ; Point to free memory
0106 EB [4] 00013 xchg ; Keep New in DE
0107 19 [10] 00014 dad D
0108 22 0163 [16] 00015 shld Free
00016
010B EB [4] 00017 xchg ; Return New in HL
00018
010C C9 [10] 00019 ret
00020
00021 ;
00022 ; Prepend a node to the list
00023 ;
010D 00024 Prepend:
010D CD 0100 [17] 00025 call Alloc ; Allocate a new node
00026
0110 3A 0165 [13] 00027 lda Root ; Point next of new node to Root
0113 77 [7] 00028 mov M,A
0114 3A 0166 [13] 00029 lda Root+1
0117 22 0165 [16] 00030 shld Root ; This is the new first node
011A 23 [5] 00031 inx H
011B 77 [7] 00032 mov M,A ; Finish pointer to previous Root
011C 23 [5] 00033 inx H
00034
011D 71 [7] 00035 mov M,C ; Store value in node
011E 23 [5] 00036 inx H
011F 70 [7] 00037 mov M,B
00038
0120 C9 [10] 00039 ret
00040
00041 ;
00042 ; Traverse the list and sum the values
00043 ;
0121 00044 Sum:
0121 2A 0165 [16] 00045 lhld Root
0124 11 0000 [10] 00046 lxi D,0
00047
0127 00048 SumLoop:
0127 23 [5] 00049 inx H ; Add value of node to sum
0128 23 [5] 00050 inx H
0129 4E [7] 00051 mov C,M
012A 23 [5] 00052 inx H
012B 46 [7] 00053 mov B,M
012C EB [4] 00054 xchg
012D 09 [10] 00055 dad B
012E EB [4] 00056 xchg
012F D2 013B [10] 00057 jnc NoCarry
00058
0132 E5 [11] 00059 push H ; Carry to high word
0133 2A 016B [16] 00060 lhld Total+2
0136 23 [5] 00061 inx H
0137 22 016B [16] 00062 shld Total+2
013A E1 [10] 00063 pop H
00064
013B 00065 NoCarry:
013B 2B [5] 00066 dcx H ; Point to the next node
013C 2B [5] 00067 dcx H
013D 7E [7] 00068 mov A,M
013E 2B [5] 00069 dcx H
013F 6E [7] 00070 mov L,M
0140 67 [5] 00071 mov H,A
00072
0141 B5 [4] 00073 ora L ; Until end of list
0142 C2 0127 [10] 00074 jnz SumLoop
00075
0145 EB [4] 00076 xchg
0146 22 0169 [16] 00077 shld Total
00078
0149 C9 [10] 00079 ret
00080
00081 ;
00082 ;
00083 ;
014A 00084 Main:
014A 2A 016D [16] 00085 lhld I ; Get count
014D 4D [5] 00086 mov C,L
014E 44 [5] 00087 mov B,H
00088
014F C3 0153 [10] 00089 jmp IntoLoop
00090
0152 00091 NoBorrow:
0152 0D [5] 00092 dcr C
00093
0153 00094 IntoLoop:
0153 CD 010D [17] 00095 call Prepend
00096
0156 0D [5] 00097 dcr C ; Low byte of count zero?
0157 0C [5] 00098 inr C
0158 C2 0152 [10] 00099 jnz NoBorrow ; No, no borrow
00100
015B 05 [5] 00101 dcr B ; Borrow from upper byte
015C F2 0152 [10] 00102 jp NoBorrow ; Done if it goes negative
00103
015F CD 0121 [17] 00104 call Sum ; Traverse the list and add them up
00105
0162 76 [7] 00106 hlt
00107
0163 016F 00108 Free dw Heap
0165 0000 00109 Root dw 0
0167 0000 00110 New dw 0
0169 0000 0000 00111 Total dw 0,0
016D 0BB7 00112 I dw 2999
00113
016F 00114 Heap
00115
014A 00116 end Main
The 6809 ran away with this challenge. The additional registers and flexible addressing modes gave it a major advantage.
00001 * 363754 ($58CEA) cycles original 6800 source code
00002 * 243378 ($3B6B2) cycles
00003 * 171350 ($29D56) cycles after inlining subroutines
00004
0000 0044 00005 Free fdb Heap
0002 0000 00006 Root fdb 0
0004 0000 00007 New fdb 0
0006 0000 0000 00008 Total fdb 0,0
000A 0BB7 00009 I fdb 2999
00010
00011 *
00012 * Allocate a node
00013 *
000C 00014 Alloc
000C 30 C4 [4] 00015 leax ,U ; Allocate a node
00016
000E 33 44 [5] 00017 leau 4,U ; Point to free memory
00018
0010 39 [5] 00019 rts
00020
00021 *
00022 * Prepend a node to the list
00023 *
0011 00024 Prepend
0011 8D F9 (000C) [7] 00025 bsr Alloc ; Allocate a new node
00026
0013 DC 02 [5] 00027 ldd Root ; Point next of new node to Root
0015 ED 84 [5] 00028 std ,X
00029
0017 10AF 02 [7] 00030 sty 2,X ; Store value in node
00031
001A 9F 02 [5] 00032 stx Root ; This is now the first node
00033
001C 39 [5] 00034 rts
00035
00036 *
00037 * Traverse the list and sum the values
00038 *
001D 00039 Sum
001D 9E 02 [5] 00040 ldx Root
001F CC 0000 [3] 00041 ldd #0 ; Clear total
0022 1F 03 [6] 00042 tfr D,U
00043
0024 00044 SumLoop
0024 E3 02 [7] 00045 addd 2,X ; Add value of node to sum
0026 24 02 (002A) [3] 00046 bcc NoCarry
00047
0028 33 41 [5] 00048 leau 1,U
00049
002A 00050 NoCarry
002A AE 84 [5] 00051 ldx ,X ; Point to the next node
002C 26 F6 (0024) [3] 00052 bne SumLoop ; Until end of list
00053
002E DD 08 [5] 00054 std Total+2
0030 DF 06 [5] 00055 stu Total
00056
0032 39 [5] 00057 rts
00058
00059 *
00060 *
00061 *
0033 00062 Main
0033 109E 0A [6] 00063 ldy I ; Get node value
0036 CE 0044 [3] 00064 ldu #Heap ; Point to free memory
00065
0039 00066 Loop
0039 8D D6 (0011) [7] 00067 bsr Prepend
00068
003B 31 3F [5] 00069 leay -1,Y
00070
003D 26 FA (0039) [3] 00071 bne Loop
00072
003F 8D D0 (0011) [7] 00073 bsr Prepend ; And once for 0 just to make it fair
00074
0041 8D DA (001D) [7] 00075 bsr Sum ; Traverse the list and add them up
00076
0043 3F [19] 00077 swi
00078
0044 00079 Heap
00080
0033 00081 end Main