Research

Reverse Engineering is Not Hard with LLM Powered Tools

With the advent of LLM-powered tools, the intricate task of reverse engineering compiled software is becoming more manageable, allowing newcomers and seasoned professionals to navigate the process more easily.

First Published 9th April 2024

Reverse Engineering Is Hard with LLM-Powered Tools

The reversing engineer.

4 min read  |  Reflare Research Team

Navigating the complexities of compiled software can often feel like a daunting task, especially for those new to the field. This process involves understanding software from its final form, which presents various challenges.

However, recent advancements in artificial intelligence, particularly Large Language Models (LLMs), are transforming this landscape. This article delves into these developments, discussing how LLMs simplify software reverse engineering and make it more accessible for beginners through the lens of three pioneering tools: Sidekick, ReverserAI, and LLM4Decompile.

Binary Ninja Sidekick

Binary Ninja, a popular alternative to IDA Pro,  recently launched a significant update, introducing a new plugin named Sidekick. This AI-powered extension is designed to enhance user experience by providing a suite of tools for analysing and understanding binary programs more effectively. The plugin offers both free and premium features, depending on user needs.

For users looking for no-cost solutions, Sidekick provides a solid foundation with features like quick search and navigation, user-defined indexes, and a code insight map. These tools allow users to navigate through binaries easily, identify points of interest, and understand the relationships between different parts of the code.

Additionally, the structure recovery feature enhances code clarity by allowing users to recover structure definitions manually. The interactive assistance and documentation view further enriches the analysis process by enabling users to keep a local log of notes and create reports associated with functions.

The premium side of Sidekick takes the functionality to the next level with advanced features that leverage multiple machine learning models. The paid service enhances quick search and navigation with natural language processing, allowing for the automatic creation of indexer scripts. Structure recovery becomes automated, minimising or removing the need to do it manually, and variable, function, and structure naming suggestions are provided to clarify their purposes.

If you want to learn more about Sidekick, a security researcher from Invoke RE recently conducted an automated malware analysis with LLM  livestream on Youtube and demonstrated some of its features against real-world malware. If you are only interested in the Sidekick part of the stream, watch it from the 44th minute onwards.

ReverserAI

ReverserAI, developed by Tim Blazytko, is the latest entry into the evolving space of tools that provide reverse engineering assistance with the aim of enhancing the reverse engineering process through the integration of locally-hosted large language models (LLMs). This model seeks to address some of the challenges faced by reverse engineers by providing automated assistance directly on consumer hardware, a feature that distinguishes it from its peers primarily reliant on cloud-based AI-based tools such as Sidekick.

The essence of ReverserAI lies in its commitment to enhancing user efficiency and data privacy. By operating offline, it ensures that sensitive data remains within the confines of the user's hardware, mitigating privacy and security concerns associated with cloud-based operations. This approach is particularly appealing to those working on confidential or sensitive projects, offering them the benefits of AI-assisted reverse engineering without the risks of data exposure.

One of the core features of ReverserAI which is also featured in Sidekick's premium service, is its ability to automatically suggest semantically meaningful function names based on the decompiler output. This feature is significant as it aids in the comprehension and documentation of reverse-engineered code, a task that can often be time-consuming and complex.

The tool is initially made available as a plugin for Binary Ninja, yet its design anticipates future extensions to other prominent reverse engineering platforms, such as IDA and Ghidra.

While the project is promising, the author acknowledges its limitations, chiefly its reliance on local LLMs means it does not yet match the performance and capabilities of their cloud-based counterparts. This limitation is partly due to the substantial computing resources required to run such models effectively on consumer-grade hardware.

However, the project is presented as a stepping stone toward exploring the feasibility and potential of local LLMs in reverse engineering tasks, opening avenues for future research and development.

LLM4Decompile

The LLM4Decompile project, developed by researchers from the Southern University of Science and Technology and The Hong Kong Polytechnic University, introduces the first open-source Large Language Model (LLM) specifically designed for decompilation tasks.

This effort addresses a gap in the reverse engineering field by leveraging the capabilities of large language models to convert compiled machine code back into a more understandable form of source code, a process essential for analysing software when its original source is unavailable.

Decompilation has long been challenged by the loss of information such as variable names and program structure during the compilation process. Traditional tools, while effective in certain scenarios, often struggle to produce code that balances readability with accuracy. The LLM4Decompile project aims to mitigate these issues by pre-training models ranging from 1B to 33B on 4 billion tokens of C source code and corresponding assembly code, thus providing a robust foundation for further development in decompilation technology.

A key contribution of this project is the introduction of the Decompile-Eval dataset, the first of its kind to evaluate decompilation based on re-compilability and re-executability. This approach focuses on whether the decompiled code can be successfully recompiled and if it functions as intended, highlighting the importance of understanding program semantics in the decompilation process.

The project's findings indicate that LLM4Decompile outperforms existing models like GPT-4 in decompiling assembly code, with up to 21% of the code being accurately decompiled, a significant improvement in terms of understanding code structure and semantics. Furthermore, the research identifies the limitations of traditional evaluation metrics such as BLEU and Edit Similarity when applied to programming languages, suggesting the need for more appropriate methods to assess decompilation outcomes.

Training methodologies adopted by LLM4Decompile, particularly the sequence-to-sequence (S2S) forecasting approach, distinguish it from other models. This method enhances the model's focus on generating accurate output source code without the complication of incorporating assembly code into the loss calculation, thereby improving the model's effectiveness in learning decompilation patterns.

Sidekick, ReverserAI, and LLM4Decompile represent significant advancements in the use of large language models for reverse engineering tasks. As these tools continue to evolve, they promise to make reverse engineering more accessible for all users. These tools not only promise to improve efficiency and understanding in binary analysis but also signify the next evolution of reverse engineering tools.

Subscribe by email