Research

Exploring Data Science and Machine Learning the Cybersecurity Way

Focusing on the dynamic interplay between artificial intelligence and cybersecurity, we cast an eye over some of the more valuable projects in the areas of malware detection and software vulnerability analysis.

First Published 12th December 2023

Exploring Data Science and Machine Learning the Cybersecurity Way

"If there was a problem, Yo, I'll solve it."

5 min read  |  Reflare Research Team

(Security) class is in session

Previously, we published a research piece about how advancements in AI technologies, such as Large Language Models (LLM), are forever changing the future of cybersecurity. In that article, we also shared some websites and books that could help you acquire the knowledge you need to venture into the world of data science, machine learning, and artificial intelligence.

However, once you have learned the basics, where do you go from there to practise your skills and reinforce your knowledge? Kaggle, the machine learning and data science community platform, has a lot of helpful information and interesting datasets, but most are not cybersecurity-related.

In this article, we suggest some valuable and practical project ideas about malware and software vulnerabilities. Also, we've included links to several relevant datasets that will help you bring these projects to life.

Problems looking for solutions

If you're a researcher or student in machine learning with an interest in cybersecurity, exploring malware datasets offers an array of fascinating project opportunities. These datasets (you will find links to some of them later in this article) are rich with information and can be used to create impactful and innovative projects. Here are some cool projects you can do with malware datasets:

Detecting Different Malware Families: This project could involve developing a classification system that identifies and categorises malware into specific families based on their characteristics and behaviours. Such a system would be invaluable in understanding the nature and lineage of malware threats.

Identifying Packers and File Formats: Delve into the intricate world of malware concealment by creating a tool that detects different packers and file formats used by malware. Malware authors often use packers to obfuscate their code, and being able to identify these can be crucial in analysing and mitigating malware threats.

Malware Visualisation: A project that focuses on visualising malware data can be both insightful and educational. This could involve creating graphical representations of malware code, execution patterns, or propagation methods. Visualisation aids in understanding complex malware data and can be a powerful tool for research and education.

Experiment with Different Detection Algorithms: Employ algorithms like Decision Trees, Support Vector Machines (SVM), or Neural Networks to classify and predict malware samples.

Unsupervised Learning Approaches: Explore clustering techniques like K-Means or Hierarchical Clustering to uncover hidden patterns and groupings in malware data without pre-labelled classes.

Reinforcement Learning: Experiment with reinforcement learning to develop models that adapt and improve their malware detection strategies based on interactive feedback.

Deep Learning: Utilise deep learning techniques, such as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), for more complex tasks like detecting sophisticated or previously unseen malware variants.

These projects will help you to enhance your understanding of both machine learning and malware detection engineering. And who knows - you might contribute significantly to the field by discovering and offering new insights and tools to combat malware threats. Nice for your CV, promotion, ego, etc.

However, suppose malware-related projects are not your cup of tea, and you're more inclined towards vulnerability research. In that case, the following are a couple of project ideas that might be of particular interest to you:

Automated Code Review with LLMs: Implement a system where GPT or similar LLMs assist in code reviews by flagging sections of code that may contain vulnerabilities. This system can provide explanations or suggestions for improvement, making it a valuable tool for developers.

Predicting Vulnerability Fixes Using GPT: Develop a project where GPT suggests potential fixes for identified vulnerabilities. By training the model on historical data of how vulnerabilities were addressed, it can learn to propose effective and secure solutions for new vulnerabilities.

Datasets to play with

Having cool research project ideas means nothing if you do not have the datasets to actually work on the projects. For that reason, the following are some of the datasets that we know are publicly available for anyone to download. Be mindful though - some of these datasets are terabytes in size and might take days or even weeks for you to download depending on your internet speed and bandwidth (alternatively, you can buy an external hard-disk from vx-underground containing all their malware samples to date).

SOREL Dataset: This dataset contains nearly 20 million files, including around 10 million "disarmed" malware samples and an extensive collection of pre-extracted features and metadata. The dataset is unique for its comprehensive coverage, including high-quality labels derived from multiple sources, information about vendor detections of the malware samples at the time of collection, and additional tags related to each malware sample. Link: https://github.com/sophos/SOREL-20M

VX-Underground Dataset: The vx-underground malware dataset is a vast and diverse collection of malicious software samples, encompassing a wide range of malware types including viruses, trojans, worms, and ransomware. The collection is continually updated to include the latest threats, ensuring its relevance for current research endeavours in the field of malware analysis and cybersecurity. The people behind VX-Underground also make those samples available in hard-disks that can be purchased on their website for those who are unable to download them directly. Link: https://vx-underground.org/

Windows Malware API Call Dataset: This dataset consists of a collection of API call sequences extracted from various malware executed in a Cuckoo Sandbox environment. This dataset includes a total of 7107 Windows PE malware samples, categorised into eight primary malware families: Trojans, Backdoors, Downloaders, Worms, Spyware, Adware, Droppers, and Viruses. Each malware sample in the dataset is represented by an ordered sequence of API calls made during its execution on a Windows operating system. The dataset also incorporates malware family classifications obtained from VirusTotal, using the hash values of each malware sample. This detailed categorisation and the dataset's focus on API call patterns make it a valuable resource for research in malware classification and cybersecurity. Link: https://github.com/ocatak/malware_api_class

Big-Vul Dataset: This dataset is a C/C++ code vulnerability dataset that was collected from open-source GitHub projects and the Common Vulnerabilities and Exposures (CVE) database. It includes descriptive information of vulnerabilities, such as CVE IDs, severity scores, and summaries, as well as vulnerability-related code changes extracted from the code repositories. The Big-Vul dataset contains a total of 3754 code vulnerabilities across 91 different vulnerability types, extracted from 348 GitHub projects. Link: https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset

SECBENCH Dataset: A comprehensive database of real security vulnerabilities, specifically designed to enhance software security testing research. Compiled from GitHub projects, it includes 682 vulnerabilities across 16 distinct patterns, such as cross-site scripting and SQL injection. The dataset, derived from 248 projects and nearly 2 million commits, is structured to include the non-vulnerable code (Vfix), the vulnerable code (Vvul), and the differences between them (Vdiff), highlighting the changes needed to fix the vulnerabilities. Focusing on popular programming languages like JavaScript, Java, Python, Ruby, and PHP. Link: https://github.com/TQRG/secbench

SAP JAVA Vulnerabilities Dataset: This dataset, created by SAP Security Research, maps 624 publicly disclosed vulnerabilities across 205 distinct Java projects to 1282 commits that address these vulnerabilities. Notably, it includes vulnerabilities not available in the National Vulnerability Database, obtained from project-specific advisories. The dataset is unique for its manual curation, ensuring high-quality data and covering vulnerabilities with practical industrial relevance. It also comes with supporting scripts for automatic data retrieval and augmentation, making it a valuable resource for researchers in software security and vulnerability management. The release of this dataset and its supporting tools aims to facilitate collaborative efforts among open-source communities, academia, and the industry in maintaining and extending this valuable resource. Link: https://github.com/SAP/project-kb/tree/master/MSR2019

DEVIGN Dataset: This is a meticulously curated collection of data for identifying vulnerabilities in software code using advanced machine learning techniques. It comprises manually labelled datasets constructed from four large, diverse open-source C language projects, ensuring a wide range of real-world code complexities and varieties. The dataset includes 48687 commits, focusing on security-related changes, and is divided into vulnerability-fixing commits (23355) and non-vulnerability-fixing commits (25332). The dataset's structure allows for detailed analysis of code vulnerabilities. Link: https://sites.google.com/view/devign

Stay up-to-date on the latest cybersecurity trends and analysis with your Reflare Research Newsletter subscription. You can also explore some of our related articles to learn more.

Subscribe by email