Programs Under the Hood...Part 5
Posted by: dargueta in Untagged on Jul 05, 2008
Hello, and welcome back to Programs Under the Hood. Today we're going to start planning out our disassembler's memory, talk a bit about memory segmentation, and then start working on the actual program. For those of you who have no idea what I’m talking about, please see Programs Under the Hood…Introduction to start from the beginning.
All right... let’s get going on this disassembler project. I can only compile COM programs on my computer (don’t worry about why), so I’ll need to outline a few restrictions first. As I said earlier, COM programs can only be 64Kb (65536 bytes) in size including the stack, data, and code. This is a problem for such a big project, so we need to plan out carefully how much space we’re using for what.
Before I write any code, I want to enumerate the features I want in this program. That way, I know exactly what I need to code and how it needs to work.
MEMORY SEGMENTATION
One day, the folks at Intel in the 1970’s decided that they needed a simple way for the processor to address memory. One guy hit on the idea of chopping memory up into an array of regular-sized chunks, then using two pointers: one to point to the chunk, the other to point to a specific byte within that chunk. The idea of segment-offset pointers was born, and the principle has been followed down to the Pentium IV and beyond.
If you remember from the previous section, the generic Intel processor I described has four so-called pointer registers. Each of these points to a specific location within a segment. Let’s practice: What does 473D:9FE2 point to? Well, it points to byte 9FE2h within segment 473Dh. To convert that to a linear address, we multiply the segment by the number of bytes in a segment and add the offset to get the final linear address. So 473D:9FE2 points to the 1,195,220,962th byte in memory.* This 16-bit segment/offset system allows for a whopping 4,294,967,296 bytes of memory to be addressed – just over four gigabytes of memory. We’re reaching that limit pretty soon…which is why if you want more than 4 Gb of RAM, you need a 64-bit computer.
*Not true with the 8086. The Intel designers decided that 1 Mb of RAM was more than enough for anybody to have in their computer, so instead of shifting the segment address by the full 16 bits, the 8086 only shifted it by 4 bits before adding the offset. This resulted in RAM being segmented into 64Kb segments that overlapped every 16 bytes, so the same physical byte in memory could have up to 4096 different addresses. Luckily for us, this doesn’t happen with later processors.
Now that we know how memory segmentation works, let’s start working toward our first goal: printing things out to the screen. Well, we know that we’ll have to print out addresses, file paths, and variable messages. Sounds like a job for printf, don’t you think? Yeah…except there’s one problem. We don’t have a printf. The DOS function we worked with earlier only prints out a simple string, terminated by a dollar sign. We could put the string together by hand every time, and then call the DOS print-string function, but we won’t for two reasons: 1) the formatting will be nearly the same in most instances, so it’s better to have a single function do it; 2) if you’re doing more or less the same thing multiple times, it wastes valuable space. Why not just put it all in a function and then call it, reducing overhead dramatically? Great! Let’s do it. But how... Why not make our own version of printf? It doesn’t have to be as fancy, just enough to cover what we need. Besides, it’ll be a great exercise. So what do we need our version of printf to do?
Well, now that we got that figured out, how are we going to emulate it? Instead of making things confusing, let’s just basically copy printf and modify it a little bit. Taking a quick look at the specs for the printf function on Microsoft’s website (http://msdn.microsoft.com/en-us/library/hf4y5e3w(VS.80).aspx), we can see that we’re going to have to make a few changes to the format: Modifications to General PRINTF format:
L et’s get started. Obviously in order to implement printf we’re going to need a function that converts an integer into a string of any base. We’ll use two functions: UINTTOSTRING and SINTTOSTRING for signed and unsigned integers, respectively. Because signed numbers are stored in two’s-complement format, SINTOSTRING only needs to convert a signed number into an unsigned number, call UINTOSTRING, and then return. So UINTTOSTRING is going to do the bulk of the work. I’ve written the C/C++ code for UINTTOSTRING first, and then we’re going to translate it into assembly language code. I’m doing this for two reasons: 1) to help plan out what we’re going to do better; 2) to show you how higher-level code is resolved into machine language.
unsigned __int16 UINTTOSTRING(char *buffer, unsigned __int64 number, unsigned __int8 base)
{
const char szNUMBERBUFFER[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
unsigned __int64 remainder = 0;
unsigned __int16 stringIndex = 0;
unsigned __int16 powerIndex = 0;
/* What we need to do is continually divide the number by the base,
using the remainder as an index into szNUMBERBUFFER to find
the string representation of that one digit. Since continual
integer division will eventually result in number becoming 0,
this is the condition we use to stop.*/
//this is a fix – if the number is 0, nothing will be put in the buffer.
if(number == 0)
{
buffer[0] = ‘0’;
return 1;
}
//strip off leading zeroes by seeing how many digits are needed.
unsigned __int64 temp = number;
while(temp > 0)
{
temp /= (unsigned __int64)base;
++stringIndex;
}
//compensate for string indexing beginning at 0, not 1.
--stringIndex;
//do the conversion
do
{
//remainder is index into number string
remainder = number % (unsigned __int64)base;
//dump appropriate character representation of digit into buffer
buffer[stringIndex] = szNUMBERBUFFER[remainder];
//divide the number so we can get the next digit
number /= (unsigned __int64)base;
//point to next character in buffer. We’re going in reverse so that
//the number doesn’t end up backwards.
--stringIndex;
//increase power
++powerIndex;
//if the number is 0 we’re done. Exit the loop.
}while(number > 0);
//return the number of characters in the string
return powerIndex;
}
All right... let’s get going on this disassembler project. I can only compile COM programs on my computer (don’t worry about why), so I’ll need to outline a few restrictions first. As I said earlier, COM programs can only be 64Kb (65536 bytes) in size including the stack, data, and code. This is a problem for such a big project, so we need to plan out carefully how much space we’re using for what.
- At least 512 bytes are going to be needed to hold the file we’re disassembling, since EXE headers are at least that long. I’ll set aside 1024 bytes, since reading from RAM is much faster than requesting a file read.
- 128 bytes will be used as a string buffer for each of the input, output, and data dump file paths, since that’s the longest path DOS can safely handle. Accounting the hard-coded temporary file paths of the other resources we’ll be using, this comes out to a total of 1024 bytes for this.
- 2048 bytes will be used as a loading area for external functions (I’ll get to that in a second).
- About 8192 bytes for variables, i.e. file handles, buffers, pointers, etc. Most of this will be taken up by the instruction database. If it gets too large, we may have to move it into an external file that we’ll read from at runtime.
Before I write any code, I want to enumerate the features I want in this program. That way, I know exactly what I need to code and how it needs to work.
- Print error and status messages to the screen or a trace file.
- Disassemble both EXE and COM programs of all versions (including PE and NE programs but not DLLs), doing the following:
- Output disassembled code into a user-specified file
- Skip data sections and output them to a separate file, both in ASCII and hexadecimal form.
MEMORY SEGMENTATION
If you remember from the previous section, the generic Intel processor I described has four so-called pointer registers. Each of these points to a specific location within a segment. Let’s practice: What does 473D:9FE2 point to? Well, it points to byte 9FE2h within segment 473Dh. To convert that to a linear address, we multiply the segment by the number of bytes in a segment and add the offset to get the final linear address. So 473D:9FE2 points to the 1,195,220,962th byte in memory.* This 16-bit segment/offset system allows for a whopping 4,294,967,296 bytes of memory to be addressed – just over four gigabytes of memory. We’re reaching that limit pretty soon…which is why if you want more than 4 Gb of RAM, you need a 64-bit computer.
*Not true with the 8086. The Intel designers decided that 1 Mb of RAM was more than enough for anybody to have in their computer, so instead of shifting the segment address by the full 16 bits, the 8086 only shifted it by 4 bits before adding the offset. This resulted in RAM being segmented into 64Kb segments that overlapped every 16 bytes, so the same physical byte in memory could have up to 4096 different addresses. Luckily for us, this doesn’t happen with later processors.
Now that we know how memory segmentation works, let’s start working toward our first goal: printing things out to the screen. Well, we know that we’ll have to print out addresses, file paths, and variable messages. Sounds like a job for printf, don’t you think? Yeah…except there’s one problem. We don’t have a printf. The DOS function we worked with earlier only prints out a simple string, terminated by a dollar sign. We could put the string together by hand every time, and then call the DOS print-string function, but we won’t for two reasons: 1) the formatting will be nearly the same in most instances, so it’s better to have a single function do it; 2) if you’re doing more or less the same thing multiple times, it wastes valuable space. Why not just put it all in a function and then call it, reducing overhead dramatically? Great! Let’s do it. But how... Why not make our own version of printf? It doesn’t have to be as fancy, just enough to cover what we need. Besides, it’ll be a great exercise. So what do we need our version of printf to do?
- Convert binary integers into strings of any base, distinguishing between signed and unsigned integers
- Print strings inside other strings (e.g. in “the file %s could not be found” the %s would be replaced by a string passed in as an argument)
- Print characters inside strings (e.g. “blah %c blah…”)
- Convert pointers into strings (basically an extension of the int-to-string function)
- Convert Booleans into strings (0 is false, anything else is true)
- If necessary, it should deal with floating-point numbers too
- Prefixes:
- # Prints a number base indicator before or after numbers. B is appended to binary numbers, 0 is prepended to octal numbers, 0x is prepended to hexadecimal numbers, and nothing is added to decimal numbers.
- - Prints a negative sign in front of negative signed numbers, nothing in front of positive signed numbers. This is the default. Cannot be used with u prefix.
- + Prints a negative sign in front of negative signed numbers, and a positive sign in front of positive signed numbers. Cannot be used with u prefix.
- u Unsigned (used only with numbers, cannot be used with floating-point numbers)
- Size specifiers:
- b Byte
- w Word
- d Dword
- q Qword
- Base specifiers:
- b Binary
- o Octal
- d Decimal
- x Hexadecimal, lowercase
- X Hexadecimal, uppercase
- There will be no wide-character support, so C and S format specifiers won’t be used.
- We don’t need g, G, a, A, n, or i.
- Because we have different-sized numbers, numeric specifiers except the floating-point ones must be followed by a size specifier. For example, a signed decimal word would be written %dw.
- f now indicates a 32-bit floating point number, and F indicates a 64-bit floating-point number.
unsigned __int16 UINTTOSTRING(char *buffer, unsigned __int64 number, unsigned __int8 base)
}
This next portion necessitates some explaining that’ll make this blog way too long, so I’ll leave off here for now. Thanks for reading, and I’ll see you soon.
Set as favorite
Bookmark
Email This
Hits: 134
Trackback(0)
Comments (3)

John
said:
This is starting to get complicated. Nonetheless, it seems like fun! You have enticed me to pickup a book on assembly. |
|
|
report abuse
vote down
vote up
|
Write comment