Programs Under the Hood...Hello, World

Posted by: dargueta in Untagged  on

(Part 4)

Hello, and welcome to the latest part of my series, Programs Under the Hood. Today we’re going to relearn a bit of what I poorly taught last time, get more familiar with debug.exe, and finally write our first program in assembly language.
 I know I threw a lot at you last time, so if you don’t mind, I’ll do a quick rehashing of the important points.

 

  1. The generic Intel CPU has several sets of registers, or super-fast little sections of memory that have names. You can think of them as variables. They can be split up into the  following groups:
    1. General Registers. These are the registers you can use for whatever you want. They each are 32 bits (one dword) in length, and are named EAX, EBX, ECX, and EDX. Each 32-bit register can be divided into two 16-bit registers, an upper half and a lower half. For historic reasons, only the lower half of each is directly accessible to the programmer. They are AX, BX, CX, and DX, respectively. In turn, each of these 16-bit registers can be divided into two (accessible) 8-bit registers: AH and AL, BH and BL, CH and CL, and DH and DL. As one may guess, registers ending in an H are the high byte, and those ending in an L are the low byte.
    2. Segment Registers. Intel processors view memory as segments, or chunks, of 64KB that are further divisible into 65536 bytes each. (That’s 2^16, in case you were wondering.) The job of the 16-bit segment registers is to point to a specific segment in memory that the program is using. The registers are:
      1. CS – Points to the segment containing the currently executing code
      2. SS – Points to the segment containing the program stack
      3. DS – Points to the current data segmen
      4. ES, FS, GS – Extra pointers to other segments the program might be using. These are typically data segments, but not necessarily.
    3. Pointer Registers. These 16-bit registers point to a specific location within a segment. They contain the offset of the desired address. They can be divided into two categories:
      1. Stack Pointers: SP points to the top of the program stack, and BP points to the bottom. Both of these registers point to a location within the segment specified  by SS
      2. Data Pointers: SI and DI (source index and destination index) are typically used when transferring data from one memory location to another, although they can be used for just about anything. They both point to a location within the segment specified by DS.
  2. Instructions make the processor do something. They may have anywhere from zero to two operands, although the number of operands is typically fixed for each instruction type. Examples include MOV (copies data), ADD (adds two numbers), and SUB (subtracts two numbers). The most recent Intel processor recognizes several hundred different instructions.
  3. Operands are the pieces of data that an instruction uses to do something. (You can think of them as arguments to a function.) There are three types of operands: registers, memory, and numeric constants. Some instructions are very picky about what kinds of operands they can work with; others can take just about anything. When writing an instruction out, remember that the format is instruction   destination, source.
Before we write our first assembly language program, I’ll introduce you to the three most basic instructions you can’t do anything without. Seriously. They’re that important, so pay attention.

BASIC INSTRUCTIONS YOU REALLY NEED TO KNOW
  • MOV      Stands for “move”. Equivalent to the = operator in higher-level languages. Always takes two operands. Example: MOV AH,42H is equivalent to AH = 0x42;
  • INT        Stands for “Call interrupt.” Always takes one operand, an 8-bit constant specifying which of the 256 interrupt vectors is to be used. An interrupt vector can be thought of as a library of functions; when calling an interrupt, a number identifying the desired function is passed in AH (sometimes in AX), and the arguments, if any, are passed in other registers. So if AH contains 9 and I call INT 21H, the BIOS (Basic I/O System) will call a function that’s basically a castrated version of printf.
  • JMP       Stands for “jump unconditionally”. Equivalent to the goto keyword in C/C++. Takes one operand, always either 1) a numeric address, 2) a register containing an address, 3) a memory location containing an address. Example: (1) JMP 0483H  (2) JMP [EAX]  (3) JMP NEAR [65F2H]. (Side note: the “NEAR” keyword is necessary to tell the compiler how long the jump is going to be, and therefore how many bytes the address at the specified memory location needs to be. This saves up to 6 bytes per jump instruction at compile time.)
With these three instructions, you can do a lot. Most of the work in a program is done by MOV and INT; JMP and its cousins, the conditional jumps, (more on that later) are used to provide if-then-else flow control.

TOOLS OF THE TRADE – GETTING COMFY WITH DEBUG.EXE
I must warn you of a few things:
  1. debug.exe is so old it only recognizes 8086 instructions. There are no 32-bit registers, no weird instructions like CPUID or CMPXCHG16B. There’s a way to get around this, but I’ll tell you later once you’re more proficient.
  2. Any number you type into debug.exe is assumed to be in hexadecimal. You can’t input decimal, octal, or binary numbers, so don’t try. Do not use special notation such as the 0x prefix or h suffix. Debug will choke and vomit an error message at you. All numbers must begin with a numeric digit, so if you want 0xE5 or E5H then you write it 0E5 (case doesn’t matter).
  3. debug.exe can only  compile COM programs. The only way to make an EXE with debug is to construct the headers and relocation tables by hand, and that is way too much work, even for me, the re-inventor of the wheel and the quesadilla.
  4. debug.exe doesn’t recognize labels, identifiers, and precompiler directives. There can be no blank lines in the middle of a program, since a blank line is used to signal the end of the code. Comments must go on their own line. My sample code later on will violate these rules, but I will show you how to make your programs debug.exe-compatible so you can follow along. (I may even write a utility to do this and give it to you.)
Now that I’ve warned you, let’s begin! 

WRITING OUR FIRST PROGRAM - HELLO, WORLD!
Here’s a little exercise for you. Start up debug.exe from the command prompt (Start > Run, type command, then type debug at the prompt), and type F  0000  FFFF   0 then press Enter. This just clears the memory (fill from address 0000 to FFFF with 0). Normally we don’t need to do this, but I want to show you something later, so for this time only we do.
Now type A  0100 and press Enter. This is how we tell debug.exe that we want to start writing an assembly language program. You should get something like this:

49C8:0100   _

Don’t worry if the number before the colon is different than mine. It doesn’t matter. Now that we’re in assembler mode, you can type in your program, one instruction to a line, and keep going until you’re done. The first number in hexadecimal is the segment, the second is the offset. Your segment is likely different from mine because the address you see for the memory is allocated by debug. Since this is a COM program, we only need to concern ourselves with the offset because everything is in one segment. Convenient, no?
What we’re going to do now is write our first assembly-language program, the obligatory Hello World crapola. Type in the following exactly as shown (or just copy and paste):

jmp   0112
db    “Hello, World!”,0d,0a,”$”
mov   ah,09
mov   dx,0102
int   21h
mov   ax,4c00
int   21h

Follow this by a blank line and press enter again. You should get the prompt again, a little white dash. Just for kicks, I want to show you what the assembled code looks like. Type D  0100 and press enter. You should get a dump of memory that looks somewhat like this:
DEBUG screenshot with memory dump and program

I colored the screenshot for you: red represents code, and blue represents data. (Ignore the little green circle for now.) You can see that our program is very small—only 30 bytes, including the data. The same program written in C comes out to a whopping 7168 bytes!
Let’s run our program and see if it works. Type G =100 at the prompt (for go to address 100H) and press enter. You should get “Hello, World!” followed by a blank line. (If you ran debug.exe straight from the Run menu, it’ll abruptly close. Don’t worry, that’s normal. Run it again from the command prompt instead.)
How does it work? Let’s take a line-by-line look.

jmp   0112
As I told you before, JMP makes the CPU jump to the specified location. Since we’re writing a COM program, all the code and data must be in the same segment. It’s far more convenient to know where your data is before you write the program, so most COM programs have the data at the beginning, with a single JMP statement to skip over the data. Looking at this statement, that means our code actually begins at address 0112H of the segment.

db    “Hello, World!”,0d,0a,”$”
This comprises the data portion of our program – a single string. If your wondering what the 0d,0a business is, it’s just the ASCII code for CR+LF, equivalent to n in C/C++ programs. 0d returns the cursor to the beginning of the line, and 0a bumps it down one line on the screen. The dollar sign signals the end of the string to the BIOS PrintString function, which comes next.

mov   ah,09
mov   dx,0102
int   21h
A lot of explaining is in order for these three lines. I’ll start by explaining a bit of the way BIOS interrupt calls work. You have 256 interrupt vectors (the maximum number you can differentiate with one-byte identifiers). Most of these interrupt vectors have subfunctions, each associated with an 8- or 16-bit number, depending on the interrupt. When calling an interrupt, the subfunction’s number is placed in AH or AX, depending on the size of the identifier. Each function does something different. For example, in interrupt 21h, subfunction 3ch creates a file, subfunction 48h allocates a block of memory, and so on. A lot of these subfunctions require arguments to be passed to them, just like functions in higher-level languages. Just like in other languages too, these subfunctions return data in the registers. They can, however, return multiple things at once, only restricted by the number of registers that they have available. Let’s take a look at the requirements for interrupt 21h, subfunction 09h:

INTERRUPT 21H SUBFUNCTION 09H
Description: Writes a string terminated by the dollar sign ($) to standard output, i.e. the console in most instances.
Arguments:
      AH = 09H (the function identifier)
      DS:DX points to the string terminated by ‘$’.
Returns: Nothing.

So…we put 09h in AH, as expected. Then we put 0102h in DX…where’d that come from? 0102h is the starting address of the string. If you look closely at the screenshot I provided, in the section where I’m entering the assembly language code, you’ll see the addresses on the left changing as I enter in instructions. Our string begins at offset 0102h, as I circled in green. You can also see in the next statement that our program begins at offset 0112h – that’s how I got the address for the JMP statement at the beginning of the program.
The next line of code calls the interrupt 21h. But wait! Don’t we need to set DS:DX to point to the string? Yep. That’s what we did. You load the segment into DS, and the offset into DX. Since our program is a COM program, we won’t need to worry about changing DS because everything is in the same segment. Larger programs don’t have that luxury. Let’s continue our analysis of the program:

mov   ax,4c00
int   21h
Oh no…another interrupt call. What does this one do? A quick look at an interrupt table tells us that interrupt 21h, subfunction 4ch exits a program with an 8-byte exit code. In short, this works exactly like the exit() function in C. Let’s look at the requirements:

INTERRUPT 21H SUBFUNCTION 4CH
Description: Terminates a process (program), releasing resources (i.e. files, memory, etc.) that the process claimed back to the system.
Arguments:
      AH = 4CH (the function identifier)
      AL = return code
Returns: Nothing. Technically, it never returns because it kills the process it was called from.

Wait a minute…I never set AH or AL. I just moved 4c00h into AX. Gotcha! AH and AL are the two halves of AX. Changing  AX changes both of them. If AH is the high byte of AX, and AL is the low byte, then after the MOV statement is executed AH should contain the top byte of 4c00h and AL should contain the low byte. That means that AH = 4ch and AL = 00h. I could have written the same thing as:

 
mov   ah,4c
mov   al,00
int   21h

but I didn’t for two reasons: 1) it takes up several more bytes of space because we’re executing two instructions in the place of one; 2) I wanted to show you how the whole AX/AH/AL thing works if you haven’t gotten it by now.

Well, that’s all for today. Next time we’ll explore some shortcomings of the DOS interrupts, learn about memory segmentation and addressing, and begin writing a few useful functions to be used in our disassembler. If you want to find out more about interrupts, just go to the previous links I gave you, or visit http://lrs.uni-passau.de/support/doc/interrupt-57/INT.HTM. (Warning: There are a lot of interrupts there specific to certain systems and extender programs that most don’t have. Only look at those whose description begins with DOS followed by a version number, e.g. DOS 1+. Everyone who has Windows or DOS has these.)
Trackback(0)
Comments (0)add comment

Write comment
quote
bold
italicize
underline
strike
url
image
quote
quote
smile
wink
laugh
grin
angry
sad
shocked
cool
tongue
kiss
cry
smaller | bigger

busy