Toggle menu
Toggle personal menu
Not logged in
Your IP address will be publicly visible if you make any edits.

File formats reverse engineering: Difference between revisions

No edit summary
 
(14 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Reverse de formats =
= Reverse of File Formats =
Cette page est destinée aux recherches sur le reverse de fichiers et aux essais en cours.
This page is intended for research on reverse engineering file formats and ongoing experiments.
== Tailles de fichiers ==
== Archives / Compressed Files ==
=== PGCD et taille de fichier ===
The "[http://wiki.xentax.com/index.php/DGTEFF Definitive Guide To Exploring File Formats]" is a good starting point for understanding file organization. We can potentially create a list of possible fields that may appear in the header of the studied file to identify different header elements and their functions.
L'idée, c'est d'observer les tailles de fichiers pour retrouver des caractéristiques de certaines archives, telles que la taille du header, et l'éventuelle utilisations de "conteneurs"/chunks de taille fixe.


Pour cela, on prend l'ensemble des tailles en octets de chaque fichier du format étudié. On réalise le PGCD sur ces tailles et si le PGCD est conséquent, c'est qu'on a potentiellement un header et des chunks de la taille du PGCD.
== Variable Typing ==
[[File:Typage & compilation.png|800px|center|thumb|thumbnail]]
When studying files that structure data from a program, it's important to keep in mind that types are translated by the compiler for a specific architecture. As a result, the same value can be represented in binary in different ways.


=== PGCD et taille de header variable ===
== File Sizes ==
Dans le cas on on aurait un fichier avec une taille de header différente de la taille des chunks, on peut tester les PGCD en soustrayant la taille d'un header que l'on augmente.
=== GCD and File Size ===
The idea here is to observe file sizes to identify characteristics of certain archives, such as header size and the possible use of fixed-size "containers" or chunks.


La structure du fichier recherché serait la suivante :
To do this, we take all the file sizes in bytes of each file in the studied format. We calculate the greatest common divisor (GCD) of these sizes, and if the GCD is significant, it indicates a potential header and chunks of size equal to the GCD.
 
=== GCD and Variable Header Size ===
In cases where a file has a header size different from the chunk size, we can test GCDs by subtracting an increasing header size.
 
The structure of the searched file would be as follows:


|---- HEADER ----|------------ CHUNK 1 ------------|------------ CHUNK 2 ------------|------------ CHUNK N ------------|
|---- HEADER ----|------------ CHUNK 1 ------------|------------ CHUNK 2 ------------|------------ CHUNK N ------------|


On utilisera alors un seuil au dessus duquel nous afficherons la taille du header et la taille du PGCD.
We will use a threshold above which we will display the header size and GCD.
 
== Searching for Known Properties ==
=== Brute-Force of a Fixed Offset ===
In cases where we can gather a set of properties that can be clearly attributed to a group of files, it may be possible to brute-force to find the location in the file where these properties are stored.
 
The first step is to search for a set of data to gather them into a CSV file. These data should be associated with a file name. For example, we can scrape this data from fan websites.


== Recherche de propriétés et clair connu ==
The second step is to program the search for many coherent encodings for each property in the CSV. We will search for all these encodings in all files from position 0 to the end of the smallest file in the set. Then we will check if the property appears at the given position. If it is not present, we will move to the next position.
=== Bruteforce d'un offset fixe ===
Dans le cas où on peut rassembler un ensemble de caractéristiques que l'on peut clairement attribuer à un groupe de fichiers, il est alors envisageable de bruteforcer pour retrouver l'endroit dans le fichier où ces propriétés sont stockées.


La première étape est de rechercher un ensemble de données pour les réunir dans un csv. Ces données devront être associées à un nom de fichier. On pourra par exemple scraper ces données sur des sites de fans par exemple.
It's important to note that types are translated differently by compilers. We should test different representations (signed numbers in one's complement, two's complement, sign and absolute value, fixed-point, floating-point, with different sign/mantissa/exponent sizes, etc.).


La seconde étape, c'est de programmer la recherche de nombreux encodages cohérents pour chaque propriétés du csv recherchée. On recherchera tous ces encodages dans l'ensemble des fichier de la position 0 à la fin du plus petit des fichiers du lot. On vérifiera alors si la propriété apparait à la position donnée, et si elle n'est pas présente, on passera alors à la position suivante.
=== Sets and Masks ===
Here again, we have a dataset with properties related to the studied files. The goal is to group properties by values and study changes in associated files to identify the potential position of a property.


=== Ensembles et masque ===
For files:
Ici aussi, on dispose d'un dataset avec des propriétés en lien avec les fichiers étudiés. Il s'agit de regrouper les propriétés par valeurs et d'étudier les changements dans les fichiers associés pour identifier la position potentielle d'une propriété.


Pour les fichiers :
A = Set(files WITH value "a" for the property)
* A = Ensemble(fichiers AVEC valeur "a" de la propriété)
B = Set(files WITHOUT value "a" for the property)
* B = Ensemble(fichiers SANS valeur "a" de la propriété)
D = Mask file # of the same size as the file; we will put FF or bits set to 1 in places where the bit or byte may represent the property
* D = FichierMasque # de la taille du fichier ; on y mettra FF ou des bits à 1 si le bit ou l'octet peut représenter la propriété
For all A, if the byte at the studied offset is identical, we set the byte in D to FF; otherwise, we set it to 00.
For all B, if there exists a B that is identical to the byte in A, we set the byte in D to 0x00.


Pour tous les A si l'octet à l'offset étudié est identique, on met l'octet dans D à FF, 00 sinon
We repeat the operation for all values of the property. To add flexibility, we can introduce error tolerance for a dataset that may contain errors or the presence of specially encoded values.
Pour tous les B s'il existe un B identique à l'octet correspondant dans A alors on met l'octet dans D à 0x00


On répète l'opération pour toutes les valeurs de la propriété. Pour ajouter de la souplesse, on peut ajouter une tolérance aux erreurs pour un dataset comportant éventuellement des erreurs, ou la présence de valeurs spéciales encodées différemment.
The bit-wise approach poses an issue in comparing A and B. If the byte at position i is identical for A and B, then we remove the byte from the mask. But for bits, we need the exact size of the property's encoding because if the bit size is smaller, we may find values that are potentially identical for A and B, with the next or previous bit belonging to the property and differing. The same issue applies if the bit size is larger.


L'approche par bit pose un soucis dans la comparaison entre A et B. Pour rappel, si l'octet à la position i est identique pour A et B, alors on retire l'octet du masque. Mais pour le bit, il nous faut la taille exacte de l'encodage de la propriété car si la taille en bits est inférieure, on retrouve alors des valeurs possiblement identique pour A et B, avec le bit suivant ou précédent qui appartient à la propriété et qui lui, diffère. Idem si la taille en bits est supérieure.
This is why it's necessary to create multiple masks for each bit size. Randomly added similarities or differences with an incorrect bit size of the property's encoding nullify the mask, so we test by incrementing sizes and keep only the mask that produces a result.


C'est pourquoi il est nécessaire de faire plusieurs masques pour chaque taille en bits. Les similarités ou les différences ajoutées aléatoirement avec une mauvaise taille en bits de l'encodage de la propriété réduisent à néant le masque, ce qui permet de tester en incrémentant les tailles et de ne garder que le masque qui produit un résultat.
It's also important to note the overlap of the mask after a shift. For example, if we study bits from 0 to 4 and set the mask to 1111, we will subsequently study bits from 1 to 5 and potentially get 10000, which would erase the previous work. The idea would be to force bits to 1 and not reset them when they are set to 1 in the A and B comparison. We can potentially create two masks A <-> A and A <-> B and then use the logical AND.


Il est aussi à noter le recroisement du masque après un décalage. Si on étudie par exemple les bits de 0 à 4 et qu'on met le masque à 1111 on étudiera ensuite les bits de 1 à 5 et on aura potentiellement 10000 <- ce qui effacerait le travail précédent. L'idée serait de forcer les bits à 1 et de ne pas réinitialiser quand ils sont mis à 1 dans la comparaison de A et de B. On peut éventuellement créer deux masques A <-> A et A <-> B pour en faire ensuite le & logique.
== Tools to Test ==


== Outils à tester ==
https://hexinator.com/
https://hexinator.com/


tester l'entropie avec binwalk (1. dans doku)
https://www.sweetscape.com/010editor/
 
Test entropy with binwalk (1. in Doku).
 
== Doku ==
== Doku ==
* 1. https://archive.fosdem.org/2021/schedule/event/reverse_engineering/attachments/slides/4518/export/events/attachments/reverse_engineering/slides/4518/Reverse_Engineering_of_binary_File_Formats.pdf
 
* 2. [http://www.iwriteiam.nl/Ha_HTCABFF.html www.iwriteiam.nl/Ha_HTCABFF.html]
[https://archive.fosdem.org/2021/schedule/event/reverse_engineering/attachments/slides/4518/export/events/attachments/reverse_engineering/slides/4518/Reverse_Engineering_of_binary_File_Formats.pdf Reverse Engineering of binary File Formats]
 
[http://www.iwriteiam.nl/Ha_HTCABFF.html http://www.iwriteiam.nl/Ha_HTCABFF.html] (Tools extraction is pending from the links)
 
https://github.com/tylerha97/awesome-reversing (A comprehensive list to study, determine what relates to binary reverse engineering - see if there are other "awesome .." lists)
 
https://beginners.re/RE4B-FR.pdf (check the detailed table of contents and find sections related to binary reverse engineering)


https://en.wikibooks.org/wiki/Reverse_Engineering/File_Formats
https://en.wikibooks.org/wiki/Reverse_Engineering/File_Formats
[[Category:File format]]
[[Category:Gotcha Force]]

Latest revision as of 10:18, 20 September 2023

Reverse of File Formats

This page is intended for research on reverse engineering file formats and ongoing experiments.

Archives / Compressed Files

The "Definitive Guide To Exploring File Formats" is a good starting point for understanding file organization. We can potentially create a list of possible fields that may appear in the header of the studied file to identify different header elements and their functions.

Variable Typing

thumbnail

When studying files that structure data from a program, it's important to keep in mind that types are translated by the compiler for a specific architecture. As a result, the same value can be represented in binary in different ways.

File Sizes

GCD and File Size

The idea here is to observe file sizes to identify characteristics of certain archives, such as header size and the possible use of fixed-size "containers" or chunks.

To do this, we take all the file sizes in bytes of each file in the studied format. We calculate the greatest common divisor (GCD) of these sizes, and if the GCD is significant, it indicates a potential header and chunks of size equal to the GCD.

GCD and Variable Header Size

In cases where a file has a header size different from the chunk size, we can test GCDs by subtracting an increasing header size.

The structure of the searched file would be as follows:

|---- HEADER ----|------------ CHUNK 1 ------------|------------ CHUNK 2 ------------|------------ CHUNK N ------------|

We will use a threshold above which we will display the header size and GCD.

Searching for Known Properties

Brute-Force of a Fixed Offset

In cases where we can gather a set of properties that can be clearly attributed to a group of files, it may be possible to brute-force to find the location in the file where these properties are stored.

The first step is to search for a set of data to gather them into a CSV file. These data should be associated with a file name. For example, we can scrape this data from fan websites.

The second step is to program the search for many coherent encodings for each property in the CSV. We will search for all these encodings in all files from position 0 to the end of the smallest file in the set. Then we will check if the property appears at the given position. If it is not present, we will move to the next position.

It's important to note that types are translated differently by compilers. We should test different representations (signed numbers in one's complement, two's complement, sign and absolute value, fixed-point, floating-point, with different sign/mantissa/exponent sizes, etc.).

Sets and Masks

Here again, we have a dataset with properties related to the studied files. The goal is to group properties by values and study changes in associated files to identify the potential position of a property.

For files:

A = Set(files WITH value "a" for the property) B = Set(files WITHOUT value "a" for the property) D = Mask file # of the same size as the file; we will put FF or bits set to 1 in places where the bit or byte may represent the property For all A, if the byte at the studied offset is identical, we set the byte in D to FF; otherwise, we set it to 00. For all B, if there exists a B that is identical to the byte in A, we set the byte in D to 0x00.

We repeat the operation for all values of the property. To add flexibility, we can introduce error tolerance for a dataset that may contain errors or the presence of specially encoded values.

The bit-wise approach poses an issue in comparing A and B. If the byte at position i is identical for A and B, then we remove the byte from the mask. But for bits, we need the exact size of the property's encoding because if the bit size is smaller, we may find values that are potentially identical for A and B, with the next or previous bit belonging to the property and differing. The same issue applies if the bit size is larger.

This is why it's necessary to create multiple masks for each bit size. Randomly added similarities or differences with an incorrect bit size of the property's encoding nullify the mask, so we test by incrementing sizes and keep only the mask that produces a result.

It's also important to note the overlap of the mask after a shift. For example, if we study bits from 0 to 4 and set the mask to 1111, we will subsequently study bits from 1 to 5 and potentially get 10000, which would erase the previous work. The idea would be to force bits to 1 and not reset them when they are set to 1 in the A and B comparison. We can potentially create two masks A <-> A and A <-> B and then use the logical AND.

Tools to Test

https://hexinator.com/

https://www.sweetscape.com/010editor/

Test entropy with binwalk (1. in Doku).

Doku

Reverse Engineering of binary File Formats

http://www.iwriteiam.nl/Ha_HTCABFF.html (Tools extraction is pending from the links)

https://github.com/tylerha97/awesome-reversing (A comprehensive list to study, determine what relates to binary reverse engineering - see if there are other "awesome .." lists)

https://beginners.re/RE4B-FR.pdf (check the detailed table of contents and find sections related to binary reverse engineering)

https://en.wikibooks.org/wiki/Reverse_Engineering/File_Formats