The code below takes 20 bytes. Yet, there’s a way to make it even smaller through interrupts. How?
A
MOV AH,9
MOV DX,108
INT 21
RET
DB 'HELLO WORLD$'
R CX
14
N MYHELLO.COM
W
CodePudding user response:
Print a shorter message, like db 'hi$' :P Or as Vitsoft suggests, take the string as an arg like the Unix echo command, so it doesn't take up space in your program.
Or depend on some values that some DOS versions leave in registers when your program starts, if you don't care about portability or only relying on documented guarantees.
Almost certainly int 21h / ah=9 is the most compact way to print multiple bytes of text. You need to get AH=9 and DX=pointer somehow. Without relying on existing bytes in convenient places in registers or memory that some DOS version might happen to leave lying around, that takes a 2-byte mov ah,9 and a 3-byte mov dx, imm16.
You can set DX=0 with xor dx,dx, but even the very start of your file is at offset 100h in a .com program. (And that would mean letting the ASCII text execute as machine code without a jmp over it!)
call label / db "text" / label: pop dx would be 4 bytes total to get the pointer into DX.
Use uninitialized register values left by some known DOS version.
http://www.fysnet.net/yourhelp.htm linked from Tips for golfing in x86/x64 machine code on codegolf.SE found the startup register values across an array of DOS versions. This is not standardized, AFAIK, so it's just a happens-to-work. Later versions of FreeDOS became more and more similar to MS-DOS, because presumably some existing software was written to rely on it, on purpose, by accident, or because some people didn't know that "works on my machine" isn't the same thing as "guaranteed future proof and portable", but various other DOS versions differ. This is not something you should rely on for production use, only silly computer tricks like code golf or "demo scene" programs.
Most DOS versions happen to leave SI=0100h at program startup. So if we can put our string there without messing up the machine (or SI), we can mov dx, si (2 byte) instead of mov dx, 108h or 107h (3 bytes). But lea dx, [si 8] is 3 bytes (opcode modrm disp8), so no saving unless we let the string execute.
Or even better, if there's something on the stack you could use for pop dx? Or popa, if you're extremely lucky also setting AX=09xx. But I don't know if any DOS versions happen to leave any known stuff on the stack, other than the "return" address which points at an int 20h instruction or something. Popping that would mean exiting manually with int 20h instead of ret, costing 1 more byte.
Actually, xchg ax, reg is only 1 byte, so if any register starts with 09xx, we can use that. MS-DOS 4.0 and later, FreeDOS 1.0, and IBM PC-DOS 4.0 and later, all start with BP=09xxh. So we can save a byte in AH init by using xchg ax, bp, separate from anything with DX. (Fun fact: This is where the 90h NOP encoding comes from: it's just a special case of xchg ax,ax, until x86-64 had to document it as an actual NOP because in 64-bit mode it doesn't zero-extend EAX into RAX like xchg eax,eax would).
Letting the text execute as code
To save any more bytes, the only hope I see for making this shorter is using text that happens to decode as instructions that let execution come out the other side, without messing up SI, so you can put it in the path of execution. But at best you're saving 1 byte, unless the text also contains instructions that do anything useful.
But your message don't work for that. I checked out how 3 capitalizations of it would disassemble, using nasm to make a flat binary and ndisasm -b16 to disassemble the result. (I used align 16 so I could find the boundary, and to give a sort of nop slide so if the last byte of the string was not the last byte of an instruction, it would consume some of that padding instead of changing decoding of the next string.) I don't have DOS or a debug.exe, so I'm using trailing-h syntax on hex numbers. In DOS Debug, all numbers are implicitly hex, that's why int 21 is the right number. I also haven't tested these, I'm not that interested in obsolete 16-bit stuff, but the x86 machine code shenanigans are fun. Although true code-golf challenge questions are off-topic on Stack Overflow, this kind of single-language optimization question is a better fit here than on https://codegolf.stackexchange.com/
; just to look at disassembly, to see if there's any hope of letting them execute
DB 'HELLO WORLD$'
align 16
db 'Hello World$'
align 16
db 'hello world$'
;; DB 'HELLO WORLD$'
00000000 48 dec ax ; early-alphabet upper-case
00000001 45 inc bp ; is all single-byte inc/dec
00000002 4C dec sp ; the same opcodes x86-64 repurposed as REX
00000003 4C dec sp ;; Modified SP breaks RET
00000004 4F dec di
00000005 20574F and [bx 0x4f],dl ;; step on part of the PSP
00000008 52 push dx ; 'R' also modifies SP
00000009 4C dec sp ; 'L'
0000000A 44 inc sp ; 'D' cancel each other's effect on SP
0000000B 2490 and al,0x90
0000000D 90 nop
0000000E 90 nop
0000000F 90 nop
Nifty. So it doesn't actually do anything fatal to the machine (depending on where BX is pointing). With BX=0 from the DOS versions that give the BP and SI values we want, that would mask away some bits in [ds: 4f], which is in the reserved part of the PSP (program segment prefix). This may be fine if nothing else ever looks there before we exit, or during the DOS exit call.
But note the and al, 0x90 ending: the string itself ended with 24h, aka '$', as the start of an instruction. That's the opcode for and al, imm8, so it consumes 1 byte of whatever's next as part of that instruction.
So you'd need a byte of padding after it before you could put the start of a useful instruction. That would kill the 1-byte size saving.
And it messes up SP, so we can't ret anymore. We'd need int 20h to exit, unless you can bail out with CC int3 or something. Not sure what DOS does on that exception.
;; 20 bytes, cancelling out saving from xchg ax,bp
DB 'HELLO WORLD$' ; executes as machine code without doing anything too bad
nop ; but this is needed. It's actually consumed as an immediate for 24h
mov dx, si
xchg ax, bp ; AH=09 on some DOS versions, in 1 byte instead of 2.
int 21
int 20 ; larger than ret, making this a net loss.
Other capitalizations are a problem, involving 'l' as 6C insb IO instructions (https://www.felixcloutier.com/x86/ins:insb:insw:insd), and similarly 'o' as 6F outsw.
;; db 'Hello World$'
00000010 48 dec ax
00000011 656C gs insb ; big problem, IO could crash the machine
00000013 6C insb ; using port=DX, data from [DS:SI]
00000014 6F outsw
00000015 20576F and [bx 0x6f],dl
00000018 726C jc 0x86 ; conditional branch, but AND always clears CF so this will be not-taken
0000001A 642490 fs and al,0x90 ; FS prefix was new with 386
0000001D 90 nop ; the align 16 padding, including previous immediate
0000001E 90 nop
0000001F 90 nop
;; db 'hello world$'
00000020 68656C push word 0x6c65
00000023 6C insb
00000024 6F outsw
00000025 20776F and [bx 0x6f],dh
00000028 726C jc 0x96
0000002A 64 fs
0000002B 24 db 0x24
Again the 24h byte is left dangling, as the start of an instruction. If there'd been a nop after it, ndisasm would have decoded it as fs and al, 0x90 like in the previous block.
Looks like ello is a problem, with IO instructions.
We need the 2nd-last byte of the string to be something else, like the start of a 2-byte instruction, ideally something like 3C ib cmp al, imm8. That's ASCII <.
And we need it not to mess up SP. If it decrements it, we need to increment or pop into dummy registers, so it's once again pointing at the return address.
18 byte version, printing a modified string of same length
;; 18 bytes
DB 'HELLO_WOLD<$' ; executes as machine code, returning SP to original position without overwriting return address
mov dx, si ; mov dx,0100h MS-DOS (all versions), FreeDOS 1.0, many other DOSes
xchg ax, bp ; mov ah,9 MS-DOS 4.0 and later, and FreeDOS 1.0
int 21h
ret
Disassembles as
00000000 48 dec ax
00000001 45 inc bp ; affects BP, which we want to use later
00000002 4C dec sp
00000003 4C dec sp ; SP offset by -2
00000004 4F dec di
00000005 5F pop di ; restore SP
00000006 57 push di ; SP offset by -2
00000007 4F dec di
00000008 49 dec cx
00000009 59 pop cx ; restore SP
0000000A 3C24 cmp al,0x24 ; '<' consumes the '$' as an imm8
0000000C 89F2 mov dx,si ; instructions from the source, as written.
0000000E 95 xchg ax,bp
0000000F CD21 int 0x21
00000011 C3 ret
This does inc bp, modifying one of the registers we're relying on for an initial value. But unless the low byte was FF to start with, it won't wrap and change the 09 in the high half. On FreeDOS 1.0 specifically, the initial BP value is 091Eh. On MS-DOS versions from Win9x, it's 0912h. On DOS from Win-NT derived versions, it's 09xxh, which doesn't rule out 09FFh
I had to mangle the string pretty seriously to balance the stack, with an even number of dec sp instructions and pops to balance that. The 1-byte 58 rw pop reg includes some of the late upper-case alphabet letters.
Also had to avoid add [si], sp or things like that, since the initial SI points at our string. (The initial BX typically doesn't.)
HELLO_WOLD< has one too many pushes, but the 'LD' part cancels out, dec sp / inc sp. In that order so it doesn't temporarily leave part of your return address below SP, where an interrupt or debugger could clobber it.
If you really wanted to get serious about coming up with a string that visually looked more like HELLO WORLD, you'd want to make a table of ASCII character and the corresponding instruction. Many upper-case ASCII characters are opcodes for single-byte instructions, either inc/dec or push/pop.
You could use a good assembler like NASM with a %rep / %assign i, i 1 / db i / %endrep block, and run it through a disassembler. Or write a program to output a binary file and disassemble that.
Or look at http://ref.x86asm.net/coder32.html and match it up with https://asciitable.com/
Can we use the text as machine code to at least exit the program? Unlikely; ret is c3, int 20h is CD 20, so neither of those opcodes will appear in ASCII text.
AFAIK, you can't tail-call a DOS routine to have it print and then exit without needing a ret or equivalent in your own code. Or if you can, it would be a 3-byte jmp rel16, or more likely a far jmp, which would take more bytes than 2 2 for mov ah, 9 / int 21h if we're talking about the jmp ptr16:16 form.
CodePudding user response:
You can shorten it to 8 bytes if you don't mind providing the text as its argument:
R:\>debug
-A
0DB2:0100 MOV AH,9
0DB2:0102 MOV DX,82
0DB2:0105 INT 21
0DB2:0107 RET
0DB2:0108
-R CX
CX 0000 :8
-N HELLO8.COM
-W
Writing 0008 bytes
-Q
R:\>HELLO8.COM HELLO WORLD$
HELLO WORLD
R:\>
