| Programs Under the Hood...Part 8: Disassembling Stuff Posted by: dargueta in Untagged on Aug 13, 2008 |
Welcome back to Part 8 of Programs Under the Hood. Today we're going to disassemble a BIOS interrupt to get a real-world example of what programs are structured like, and we'll see if we can convert some of it to C/C++ code. (I apologize for the large line breaks. For some reason, they just appeared.)
POPPING THE HOOD-DISASSEMBLING A BIOS ROUTINE
A few issues back I mentioned that the BIOS provides a lot of basic functionality in assembly-language programs. It's sort of like a primitive library, albeit one that's slow as hell because it's designed to be compatible with everything. Nowadays performance programs bypass it altogether and execute functions in device drivers and system libraries. However, since the BIOS came first, it's interesting to see what it's built like, and how it works. We're going to disassemble part of interrupt 10H, which was responsible for graphics functions such as plotting pixels and changing video modes.
FINDING THE ENTRY POINT
Every interrupt has a starting address, like a function. I think the addresses are standard, but what if they aren't? How do you find out where INT 12H is on your computer? By using the BIOS, of course. By flipping through an interrupt table, I found that INT 21H, subfunction 35H gives the entry point, or starting address, of an interrupt:
INTERRUPT 21H SUBFUNCTION 35H
Description: Returns the entry point of the specified interrupt.
Arguments:
AH = 35H (the function identifier)
AL = interrupt number
Returns: Interrupt starting address is in ES:BX.
We need to call this interrupt and examine the output. But how? I really don't feel like writing an entire program, most of which will be spent converting a binary number into human-readable text. This is where debug comes in handy. You can write a program and run it in real time, stepping through each instruction and examining the contents of registers at each step, if you like. That's perfect! We only need two or three lines of code instead of who knows how many.
Remember that debug is a dumb command-line utility. We need to tell it we want to start assembling a program by typing A 100. This means "Assemble beginning at address 100H". COM programs, by the way, always begin at address 100H of their segment. The first 256 bytes contain the command line and some other extra information stuck in by the loader.
So now that we're ready to begin writing our program (more like a stub since it's not a complete one), we need to do three things: 1) set AH to the number of the function we want to execute, 35H; 2) set AL to the interrupt whose address we want, 10H; 3) call interrupt 21H. We can do this with two instructions, plus a breakpoint to prevent debug from executing anything we don't want it to:
mov ax,3510
int 21H
;hard-coded breakpoint
int 03H
Leave a blank line, then press enter again to let debug know you're done. It should return to the prompt, a single dash. Now we need to execute our little stub program. At the prompt, type G =100 and press enter. Here's what my debug spat out:
AX=3510 BX=08A9 CX=0003 DX=0000 SP=FFEC BP=0000 SI=0090 DI=0000
DS=0000 ES=0210 SS=13EC CS=13EC IP=0105 NV UP EI PL NZ NA PE NC
13EC:0105 CC INT 3
The first two lines are a printout of the state of the registers and the CPU flags. The third line shows the next instruction that would be executed. What do you know...it's our breakpoint. If we hadn't put that there, debug would've kept right on going into undefined memory, and who knows what it could be executing, which is kinda dangerous.
But I digress. If you remember the specification for INT 21H, subfunction 35H, the address we're interested in is in ES:BX. Well, what does the output say? ES=0210, BX=08A9. So the entry point of interrupt 10H is 0210:08A9, or at least on my computer. Let's go right now and see what's lurking there.
DISASSEMBLING WITH DEBUG
Now that we know where INT 10H starts, we can directly disassemble it. Type U 0210:08A9 (or whatever address your computer returned) and press enter. You should get the following (I slightly modified it to make it more easily readable):
0210:08A9 CMP BYTE PTR CS:[08A7],02
0210:08AF JNZ 08B6
0210:08B1 CALL 0806
0210:08B4 JB 0915
0210:08B6 CMP BYTE PTR CS:[08A7],01
0210:08BC JZ 091B
0210:08BE CMP AH,00
0210:08C1 JZ 08F3
0210:08C3 CMP AH,1C
0210:08C6 JA 08D3
0210:08C8 CMP AH,04
I know what you're saying: Whoop-dee-doo. I have no idea what this does. Neither do I, at first.
First of all, this is only the first 128 bytes of the interrupt or so, because that's all that debug.exe can handle at once. If you type U again and press enter, it will pick up where it left off and disassemble the next few instructions up to 128 bytes. Let's take an in-depth look at the code we have now:
0210:08A9 CMP BYTE PTR CS:[08A7],02
0210:08AF JNZ 08B6
The CMP instruction compares the byte at CS:08A7 with the value 2. JNZ is the same as JNE which means jump if not equal, so we can safely assume that the code from this point forward until offset 08B6 will not get executed if the byte at 0210:08A7 is equal to 2. We can easily rewrite this as an if statement in C/C++ code:
if(*((BYTE *)(0x00029A7)) != 2)
{
//execute statements beginning at 0210:08B6
}
//continue executing at 0210:08B1
Notice that we have to hard-code the address, convert it to a pointer, t hen dereference it to get what we want. Let's try and make this a little more readable:
const BYTE *pbValue = (BYTE *)0x00029A7;
if(*pbValue != 2)
{
//execute statements beginning at 0210:08B6
}
//continue executing at 0210:08B1
Isn't that easier to read? But wait...where'd the 0x00029A7 come from? Why didn't I just put 0210:08A7? Answer: You can't do that in C/C++. They use linear addresses, meaning that they completely ignore segments and just use 32-bit offsets. Why? Let me explain:
If you recall from a previous issue of Programs Under the Hood, I mentioned that with the way the 8086's segmentation works you could only use 1Mb of memory. This is still the case with Intel processors today, in order to maintain backwards compatibility with 16-bit applications. To get around this severe limitation, 32-bit programs nowadays set all segment registers to 0 and instead use 32-bit offsets, allowing up to 4Gb of memory to be used. (If you want to use more memory, you need a 64-bit processor, which allows theoretical limit of about 16.8 million terabytes!)
So how does one convert a 16-bit segment-offset address into a 32-bit linear address and vice-versa? Luckily, there's a simple formula for this:
LinAddr = segment*16 + offset
SegAddr = {offset = addr % FFFFH; segment = (addr - offset) >> 16};
Let's try converting the address we encountered in our program. Our segment is 0210H, our offset is 08A7H, so:
LinAddr = (segment)*16 + (offset)
LinAddr = (0210H)*10H + (08A7H)
LinAddr = 02100H + 08A7H
LinAddr = 00029A7H
We will be doing this with every address we encounter from now on. (I must warn you, though, that since most linear addresses have up to 4096 equivalent segment addresses, you probably won't get the same address out that you put in.)
Going back to the disassembly, we encounter the following instruction:
0210:08B1 CALL 0806
0210:08B4 JB 0915
Okay...we know what this does. It calls a function located at 0120:0806, right? So how do we figure out what that function does? It'd take too long to disassemble the whole thing here, so I'll just say that it copies a buffer in DS:SI into the screen text buffer at 0xB000:0000 if AX is a certain value. Now for the jump statement: jump if below to offset 0915. I honestly don't know enough about the function to tell you why it's there. As far as programming convention goes, it shouldn't.
0210:08B6 CMP BYTE PTR CS:[08A7],01
0210:08BC JZ 091B
Again, we reference the mysterious byte at 0210:08A7. This time we check to see if it's equal to 1. If it is, we jump to offset 091B.
0210:08BE CMP AH,00
0210:08C1 JZ 08F3
0210:08C3 CMP AH,1C
0210:08C6 JA 08D3
0210:08C8 CMP AH,04
0210:08CB JZ 08ED
More mundane comparisons, as you can see. If you haven't noticed, we're comparing AH to different values, then jumping to different locations. Is this 1) a switch block, or 2) a series of if-else statements? Look closely at the second comparison. What is the jump instruction? JA, or jump if above. Switch statements cannot contain relative comparisons like case a < 5 and so on. Only if statements can, so this is a series of if-else statements. (If it were a switch block, then all of the jump statements would be JZ, which tests equality.) Moving on:
0210:08CD C4C4 LES AX,SP
0210:08CF 42 INC DX
0210:08D0 EB43 JMP 0915
I'm sure you can figure out the INC and JMP, but what about this mysterious LES? What does it do? To be brief, it loads (in this case) ES:AX with the 4-byte pointer in memory pointed to by SS:SP. Examples:
;ES:DI loaded from DWORD at SS:[BP+04]
les di,[bp+04]
;DS:SI loaded from DWORD at DS:[DX]
lds si,dx
;FS:EAX loaded from 48-bit pointer at DS:[2*EBX+ECX+6].
;To use FS or GS, as well as this more complex memory
;addressing scheme, you need at least a 32-bit processor.
lfs eax,[2*ebx+ecx+6]
The rest of the function goes on for quite a while, so I think here's a good place to stop. Before I finish this, I'll leave you with some steps to disassembling programs by hand:
TIPS AND TRICKS FOR DISASSEMBLING STUFF
- Keep track of memory references. If a program reads from or writes to a memory address, write that down and see what's there. Sometimes you'll be surprised and find an actual string; most times it'll be some binary number.
- Write down all addresses of function calls as you come across them.
- Write down all addresses of jumps as you come across them, and make sure you note whether they are conditional or unconditional. Follow all unconditional forward jumps. Chances are, they jump over a data section, which will give you erroneous instructions if you try to disassemble it.
- Draw yourself a model of the stack and keep track of where everything is. Sometimes you can figure out the purpose of some of the variables just by seeing how they're used.
- Go back and disassemble all functions as if each of them were their own program (i.e. go to step 1 and repeat everything.) If you figure out what a function does, provide a name for it and write it down along with its address, arguments that may be passed to it, and return values.
- Go back and disassemble from the beginning of each address jumped to. For example, if you find an instruction that says JZ 0389, you would disassemble beginning at 0389.
- WATCH YOUR ADDRESSES. Being off by one byte changes everything. For example, observe the following:
Actual code:
jz near -0641H
and ax,es:[di]
Disassembled code if off by one byte:
test bh,bh
int 26H
and ax,ds:[di]
See the difference? With some other instructions, this could be really risky. You could overwrite your own variables, call interrupt functions you never meant to call, and basically put your computer at risk. Moral of the story: Watch your addresses.
That's all for now. Next time, I'll show you some functions we need to write to get this disassembler project off the ground. Oh, and by the way, because of time constraints, I'm writing it in C/C++.