Programs Under the Hood...Lock and Load Posted by: dargueta in Untagged  on
  (Part 2)


I said in my last blog that I'm going to assume you know assembly language in my further exploits. I changed my mind. I'll walk all of you through this in a way that'll be as painless as possible for the experienced and the n00bs. Before we begin writing this disassembler, I should outline a few things I plan on doing.

  1. I'm going to use debug.exe for nearly everything. It's very crappy, yes, but it's primitive enough so you can see how everything works at its most basic level.
  2. In order to not confuse anyone between programming standards such as NASM, MASM, and the like, I'll outline my own standard, very straightforward and simple.
    1. Instructions have the format    [label] mnemonic           dest[,source]
    2. All numbers in code sections are in hexadecimal only.
    3. All labels must begin with a dollar sign. No other identifier can contain the dollar sign.
    4. Comments are single-line and begin with a semicolon.
    5. Variables are declared using the syntax               var varname varsize varinitvalue. For example, to create an ASCIIZ string called szMyMsg, I'd write (on one line): var szMyMsg db "Hello World",0d,0a,"$"
    6. Functions:
  • i. Arguments are passed in C-style order on the stack (i.e. pushed in reverse order) or in registers, depending on the type. The caller is responsible for saving registers and cleaning up the stack.
  • ii. Return conventions: integers and near pointers in EAX, 64-bit integers in EDX:EAX, and far pointers in DX:EAX. Boolean return values are single bytes, 1=true and 0=false.
  1. I'll provide two code samples per example as necessary-one in C/C++ style pseudocode, the other in assembly language.

 

 Any questions? Great! Let's get going.

 

TYPES OF PROGRAMS

Before we try to disassemble any program, we need to know what types and formats there are. Be sure to distinguish the two: a format is a method by which a program is stored in a file. A type is a distinction of how a program is stored and executed in memory.

There are two types that we need to concern ourselves with: EXE and COM. (I would show diagrams, but in my infinite wisdom I managed to make them far too tiny to read and am currently working on remaking them.)

  • COM programs are by far the easiest to deal with. They were the first programs, just a file with bare code, no extra information or anything. These programs, because of their lack of extra information, are restricted in size to one segment of memory, or 64Kb. That means that all the code, data, and the program stack need to fit into that one segment. You can't make a COM program larger than 64Kb-your computer will stop when it reaches the end of the segment and sit there like a paralyzed dog. It doesn't matter if you have just one more byte of code left. It won't execute. It won't even get loaded.
  • EXE programs, on the other hand, are far more common nowadays. They can span multiple segments with more than one segment per segment type (code, data, stack). They contain a lot of extra information. In fact, every EXE program must have a standard EXE header that takes up 512 bytes, and most contain extra information used by the loader when a program is being loaded into memory. We won't worry about this extra information for a while, though. For now we'll just play with COM programs.

There are also several types of programs.

  • Regular programs are loaded into memory, run, and then overwritten with a waiting program when they finish executing. These constitute the majority of programs nowadays.
  • DLL - An acronym for Dynamically Linked Library, this is a collection of functions that any program can use. A DLL remains in memory as long as it's needed by a program, and is then removed. Some important examples include kernel32.dll, user32.dll, and mscorees.dll.
  • TSR-Terminate and Stay Resident. These programs are run, but instead of being unloaded when they're done, they sit in memory until the computer turns off. TSR programs typically wait for some event to happen, "wake up," do their thing, and go back to "sleep". Much like a cat, actually. These are usually used to provide additional keyboard functionality or for antivirus software. If, for example, I was tired of digging through the Programs menu to find Microsoft Word, I could write a TSR program that would run Word whenever I pressed CTRL+W at the desktop. They're fun but tricky to write; I'll show you how later, if I find an opportunity.

 

WHAT'S A PROCESS LIKE YOU DOING IN RAM LIKE THIS?

Here's a little exercise. Start up Microsoft Word (or any other program that takes forever to load and crashes often). It isn't ready right away, now does it? We say it's loading. Loading where? Into RAM, of course. Since reading from and writing to RAM is at least a few orders of magnitude faster than doing the same operation from the fastest hard drives, operating systems load a program into RAM before executing it. What if there isn't enough space? That's a memory management problem, which I'll discuss in detail if and when I show you how to create your own full-fledged operating system. That'll be much later.

For now, let's take a look at the loading process in more detail:

  1. User makes request to run program.
  2. Operating system finds the file containing the program and opens it.
  3. The operating system loads the program in different ways depending on the type:
    1. COM Programs: Copy file straight into RAM and execute.
    2. EXE Programs:
  • i. Process the header.
  • ii. Allocate memory as specified in the header.
  • iii. Load portions of the program file into different segments in memory as specified in the header.
  • iv. Load required libraries.
  • v. Find the entry point and begin execution.

 

CONSTANTS AND VARIABLES ARE THE SAME THING? WHAT?

You know all those variables in your last C program you wrote? Ever wonder where they're put? Most of them are actually in the program file. Open up any EXE program with a binary hex editor (in COM programs it's more tricky), and you'll see that there are entire sections of the file where it's all just null bytes. Those are your variable spaces. Since the file is copied directly into RAM before execution, those get copied too. Even your string constants are there. Write a simple Hello World program, compile it, and then open it in Notepad or a hex editor. I can guarantee you 100% that there will be a string somewhere in the file that says "Hello World." It's not a variable, but how else is your program going to print out a string if it doesn't have a string to print? Wait...if variables and constants are copied into the same memory...then can you modify constants? The answer is yes. It's bad practice, but you can do it. The only reason why you can't do it in higher-level languages is because the compiler will refuse to let you, not the system. Hell, you could even rewrite the code in your program right in memory. (These are called self-modifying programs and are very difficult to write.) The system can't tell the difference between a string of opcodes or a string of characters. You can theoretically execute data, but I don't recommend doing it because the result will be unpredictable. Windows has a feature called Data Execution Prevention for a reason.

 

Anyway, that's it for now. Next time we'll begin mapping out our program, and begin writing our first bit of code. See you soon!

By the way...why won't this let me put any punctuation in the title? It's supposed to say Programs Under the Hood - Lock and Load, but it keeps replacing the dash with a space. Oh well...


Trackback(0)
feed1 Comments
Jordan
June 25, 2008
Votes: +0

Excellent read dargueta, looking forward to your next blog!

report abuse
vote down
vote up

Write comment
 
 
quote
bold
italicize
underline
strike
url
image
quote
quote
smile
wink
laugh
grin
angry
sad
shocked
cool
tongue
kiss
cry
smaller | bigger
 

security image
Write the displayed characters


busy