[Exploit development] 4- Understanding Binary Files
Intro
Hello everyone, I hope you’re all well. In this article, we’re going to talk about binary files. We’ll look at how they are built, what they contain, their structure, and the information contained within each part. the importance of this information, the role of each piece of information, and how we can read and understand it using specialized tools. Understanding the architecture of binary files is vital for reverse engineering and the process of debugging and analyzing the software to determine what it does so you can break it. Also, this is very important for developing custom shellcodes as we’ll see in the upcoming articles.
Executable binary file
A binary file or portable executable file is a file whose content is in a binary format consisting of a series of sequential bytes. It contains not only the executable instructions that can be executed directly by the CPU, but also a lot of information that helps the OS to manage and load it correctly into the memory, to be executed, and more. Executable binary files include a wide range of file types, including executables ( .exe, .elf ), libraries ( .dll, .so ), control panel applications ( .cpl ), Kernel modules ( .srv, .ko ), and many others.
Building phases
Executable binary file goes through several phases to be built. so let’s write a simple program, to show you those phases and where our data goes in the final binary file.
#include <stdio.h>
#ifdef _WIN32
#define OS "Windows"
#elif __linux__
#define OS "Linux"
#elif __unix__
#define OS "Unix"
#elif __APPLE__
#define OS "Apple"
#else
#define OS "Unknown"
#endif
char g_cData[] = "This is a test program";
void main(void)
{
puts( "Hello World from " OS );
puts( g_cData );
}
Preprocessing phase
Source code Preprocessing is the first phase in the compilation process, the preprocessor is not a part of the compiler but is a separate step in the compilation process. The preprocessing is just a text substitution. All preprocessor commands begin with a hash symbol (#) such as #include
, #define
, and #ifdef
. so the above source code each directive
and macro
will be replaced with its actual content, after this process if compiled on Linux OS our code will look like the following.
/*
THE ACTUAL CONTENT OF stdio.h ( WITH ITS DEPENDENCIES )
*/
char g_cData[] = "This is a test program";
void main(void)
{
puts( "Hello World from " "Linux" );
puts( g_cData );
}
You can use gcc -E
to only run the preprocessor part.
Compiling phase
This phase is done via several steps aimed for generating assembly code for the targeted architecture. You can use gcc -S
to only run the compiler part.
Assembling phase
This phase aims to convert the output of the compiler into machine language. its output is an object file containing not only the executable section but also some information helpful for the linker which is the next phase. You can use gcc -c
to only run the assembler without getting into the linking phase.
Linking phase
Linking is a process of collecting and maintaining object modules and combining them into a single executable file that can be loaded by the OS loader and then executed. There are two types of linking, static linking which aims to embed the library within the final executable file, and dynamic linking which is performed during the run time.
Executable binary structure
The Executable binary structure is not much different between Windows and Linux. We will cover the main parts of the Executable binary files.
Executable binary Headers
The first part of the executable binary files consists of headers that contain data assisting the OS loader in loading the program into memory. There are data such as the Initial Stack Pointer value, Initial Instruction Pointer value, the type of machine (CPU Architecture), the number of the sections, the size of the stack/heap to reserve, some description about sections, and a lot more information.
You can read headers information using readelf on Linux as follows:
readelf --headers <pefile>
Or you can use the following command on Windows (inside the Visual Studio prompt) as follows:
dumpbin /headers <pefile>
Sections
Sections are the actual data containers of the executable file. There are many sections for different purposes, their number is specified in the headers, and their names may differ slightly between operating systems.
- .text|.code: Contains the executable code of the program.
- .data: Contains the initialized data.
- .rdata|.rodata: Contains the read-only data.
- .bss: Contains the uninitialized global data.
- .tls|(.tbss,.tdata): Similar to .bss section, but for Thread-Local data.
- .idata|(.dynstr&.dynsym): Contains the import tables.
- .edata|(.dynstr&.dynsym): Contains the export tables.
- .reloc|.rela: Contains the image relocation information.
- .debug|.debug_XXX : Contains the debugging information.
Those are the most important sections, we can parse them on Linux as follows:
readelf -t <pefile>
readelf -p <section-name|section-number> <pefile>
readelf -x <section-name|section-number> <pefile>
# dump section
objcopy -O binary --only-section=<section-name> <pefile> <output>
Or as following on Windows:
dumpbin /SECTION:<section-name> <pefile>
Conclusion
In this article, we looked at a basic overview of the executable binary files talked briefly about the process of building them, and looked at their architecture, and the information inside each part. In the following articles, we’ll talk about those parts in much more detail, and we’ll learn how we can deal with them programmatically.