This WIPBPWFI (Work In Progress But Probably Won't Finish It) project is the implementation of a cool idea that I got during a weekend.
This work tries to classify PE malware sent as attachment in phishing email. Given that many phishy attachments are PE files with misleading icons, such as the MS Word document one, I compare the phishy icon with a ground truth set to infer if the icon correctly represents an executable or not.
Using the ResNet18 model, the tool constructs a vector representation of icons from legit PEs. Then, given a PE, the tool extracts the icon through the icon_extractor
and measures the cosine similarity between it and all the legitimate icons. The result is computed as the label of the most similar legitimate icon. The label represents wether the icon is typically used in PE files (value 1), 0 otherwise. Therefore, if a PE sample is classified with a 0, then it's likely to be a malware.
The ground truth set of icons has been extracted from this dataset and are provided in the icon_extractor/benign_icons.zip
archive.
The feature vector generator part in the Malware_Similar_Icons.ipynb
notebook is based on the work Recommending similar images using PyTorch.
How else can you spend your rainy weekend?
A dataset made of apps is quite useless due to the assumptions of this project.
- Integrate the
icon_extractor
into the main Python script - Classify the ground truth manually adding the label 1 if it represents a PE icon, 0 otherwise
- Adapt from Colab Notebook to Python script
- Enrich legit icon dataset with Word’s documents etc. icons
- From icon label, infer if binary really reflects the icon
- Try a lower input dim (e.g. 32x32) —> inputDim = (224,224)
- If input dim has been lowered, test if known similarity values have significantly changed (i.e. something is not working as expected)
- Support gpu
- Impement malware icon comparison with those in the legit dataset (CORE)
- Check if smaller icons still match with bigger ones (i.e. up/down-scaling works)