0xNinjaCyclone Blog

Penetration tester and Red teamer


[Exploit development] 04- Understanding Binary File Formats and Internal Structures

Intro

Hello everyone, I hope you’re all well. In this article, we’re going to talk about binary files. We’ll look at how they are built, what they contain, their structure, and the information contained within each part. the importance of this information, the role of each piece of information, and how we can read and understand it using specialized tools. Understanding the architecture of binary files is vital for reverse engineering and the process of debugging and analyzing the software to determine what it does so you can break it. Also, this is very important for developing custom shellcodes as we’ll see in the upcoming articles.

Loader Perspective

When an executable is launched, the operating system does not execute the file directly from disk. Instead, the OS loader:

  1. Parses the executable headers.
  2. Validates format signatures (MZ/PE or ELF magic).
  3. Maps segments into virtual memory according to alignment rules.
  4. Applies relocations (if ASLR shifts the image base).
  5. Resolves imported symbols.
  6. Transfers execution to the defined entry point.

Understanding this process is essential because exploitation frequently targets:

  • Loader assumptions
  • Relocation mechanisms
  • Import resolution
  • Memory protection attributes
  • Entry point control flow

The binary format defines the contract between the executable and the operating system.

Executable binary file

A binary file or portable executable file is a file whose content is in a binary format consisting of a series of sequential bytes. It contains not only the executable instructions that can be executed directly by the CPU, but also a lot of information that helps the OS to manage and load it correctly into the memory, to be executed, and more. Executable binary files include a wide range of file types, including executables ( .exe, .elf ), libraries ( .dll, .so ), control panel applications ( .cpl ), Kernel modules ( .srv, .ko ), and many others.

Building phases

Executable binary file goes through several phases to be built. so let’s write a simple program, to show you those phases and where our data goes in the final binary file.

#include <stdio.h>

#ifdef _WIN32 
    #define OS "Windows"
#elif __linux__
    #define OS "Linux"
#elif __unix__ 
    #define OS "Unix"
#elif __APPLE__
    #define OS "Apple"
#else
    #define OS "Unknown"
#endif

char g_cData[] = "This is a test program";

void main(void)
{ 
    puts( "Hello World from " OS );
    puts( g_cData );
}

Preprocessing phase

Source code Preprocessing is the first phase in the compilation process, the preprocessor is not a part of the compiler but is a separate step in the compilation process. The preprocessing is just a text substitution. All preprocessor commands begin with a hash symbol (#) such as #include, #define, and #ifdef. so the above source code each directive and macro will be replaced with its actual content, after this process if compiled on Linux OS our code will look like the following.

/*
    THE ACTUAL CONTENT OF stdio.h ( WITH ITS DEPENDENCIES )
*/

char g_cData[] = "This is a test program";

void main(void)
{ 
    puts( "Hello World from " "Linux" );
    puts( g_cData );
}

You can use gcc -E to only run the preprocessor part.

Although preprocessing is textual, it can significantly influence the final binary:

  • Conditional compilation may introduce or remove entire code paths.
  • Macro expansion can inline logic in unexpected ways.
  • Platform-specific directives change ABI compatibility.
  • Security flags (e.g., _FORTIFY_SOURCE) alter function implementations.

In security research, reviewing preprocessed output (gcc -E) can reveal hidden macro logic that affects control flow or memory usage.

Compiling phase

This phase is done via several steps aimed for generating assembly code for the targeted architecture. You can use gcc -S to only run the compiler part. During compilation, the compiler performs:

  • Control Flow Graph (CFG) construction
  • Stack frame layout generation
  • Register allocation
  • Instruction scheduling
  • Optimization passes (dead code elimination, inlining, constant folding)

Compiler optimization levels (-O0, -O2, -O3) drastically change:

  • Stack variable ordering
  • Frame pointer usage
  • Gadget availability
  • Code alignment

For exploit developers, different optimization levels can change vulnerability behavior entirely.

Assembling phase

This phase aims to convert the output of the compiler into machine language. its output is an object file containing not only the executable section but also some information helpful for the linker which is the next phase. You can use gcc -c to only run the assembler without getting into the linking phase.

Object files contain relocation entries because symbol addresses are not yet final.

Relocation entries specify:

  • Location to patch
  • Type of relocation (absolute, relative, PC-relative)
  • Target symbol reference

The linker later resolves these relocations when combining object files. In dynamically linked binaries, additional relocations are processed at runtime by the dynamic loader.

Understanding relocation entries is crucial for:

  • GOT overwrite techniques
  • Position-independent shellcode
  • Kernel exploit reliability

Linking phase

Linking is a process of collecting and maintaining object modules and combining them into a single executable file that can be loaded by the OS loader and then executed. There are two types of linking, static linking which aims to embed the library within the final executable file, and dynamic linking which is performed during the run time. Linkers also creates various tables for the loaders, containing important data that helps them load the program into memory at runtime.

In ELF binaries:

  • GOT (Global Offset Table) stores resolved symbol addresses.
  • PLT (Procedure Linkage Table) handles lazy binding.
  • .rela.plt contains relocation entries for external functions.

In PE binaries:

  • IAT (Import Address Table) stores imported function pointers.
  • ILT (Import Lookup Table) defines external dependencies.
  • Relocation Table contains entries for all base relocations in the image.

These structures are prime exploitation targets because overwriting a function pointer inside them can redirect execution flow.

Executable binary structure

The Executable binary structure is not much different between Windows and Linux. We will cover the main parts of the Executable binary files.

Executable binary Headers

The first part of the executable binary files consists of headers that contain data assisting the OS loader in loading the program into memory. There are data such as the Initial Stack Pointer value, Initial Instruction Pointer value, the type of machine (CPU Architecture), the number of the sections, the size of the stack/heap to reserve, some description about sections, and a lot more information.

You can read headers information using readelf on Linux as follows:

readelf --headers <pefile>

Or you can use the following command on Windows (inside the Visual Studio prompt) as follows:

dumpbin /headers <pefile>

Certain header fields directly impact security:

  • Image Base (affects ASLR behavior)
  • Section Alignment (affects memory boundaries)
  • Entry Point (initial control transfer)
  • Data Directories (Import Table, Relocation Table, TLS callbacks)

Misconfigurations or manual manipulation of these fields can:

  • Bypass security mitigations
  • Create malformed binaries for evasion
  • Trick static analysis tools

Sections

Sections are the actual data containers of the executable file. There are many sections for different purposes, their number is specified in the headers, and their names may differ slightly between operating systems.

  • .text|.code: Contains the executable code of the program.
  • .data: Contains the initialized data.
  • .rdata|.rodata: Contains the read-only data.
  • .bss: Contains the uninitialized global data.
  • .tls|(.tbss,.tdata): Similar to .bss section, but for Thread-Local data.
  • .idata|(.dynstr&.dynsym): Contains the import tables.
  • .edata|(.dynstr&.dynsym): Contains the export tables.
  • .reloc|.rela: Contains the image relocation information.
  • .debug|.debug_XXX : Contains the debugging information.

Those are the most important sections, we can parse them on Linux as follows:

readelf -t <pefile>
readelf -p <section-name|section-number> <pefile>
readelf -x <section-name|section-number> <pefile>

# dump section
objcopy -O binary --only-section=<section-name> <pefile> <output>

Or as following on Windows:

dumpbin /SECTION:<section-name> <pefile>

Each section has associated memory permissions:

  • Read (R)
  • Write (W)
  • Execute (X)

Modern operating systems enforce W^X policy (Write XOR Execute).

Examples:

  • .text → RX
  • .data → RW
  • .rdata → R
  • .bss → RW (zero-initialized at runtime)

Security mechanisms such as:

  • DEP/NX
  • ASLR
  • RELRO (ELF)
  • Control Flow Guard (Windows)

rely heavily on section configuration.

Incorrect section permissions can introduce severe attack surface.

Conclusion

Understanding binary internals allows you to:

  • Predict memory layout deterministically.
  • Identify code reuse primitives.
  • Craft position-independent shellcode.
  • Manipulate import resolution.
  • Analyze loader behavior under ASLR.

Without executable format knowledge, exploit development becomes probabilistic. With it, exploitation becomes engineered.

In this article, we looked at a basic overview of the executable binary files talked briefly about the process of building them, and looked at their architecture, and the information inside each part. In the following articles, we’ll talk about those parts in much more detail, and we’ll learn how we can deal with them programmatically.