Mining Malware – Part II

wordcloud-enervista-stringsWell, I’ve spent about a week off and on working on this project, and have some limited analysis to report.  I’ve developed the python code that will run through all of the @VXShare zip files, and pull out the strings. Don’t laugh, the way I’m going about this is very extensible and preserves the files as ZIPs, so that I can eventually share them back to the community. Additionally, it makes it as simple as running a command to get new sets into the system, and reduces my storage requirements.

To test out the code, I pulled down @VXShare’s APT1 dataset for a trial run and slapped 220 files of 293 into a temporary ‘database’ (had an issue with the size, and have to recalibrate some stuff). I’m deliberately trying to look at the data without a lot of intelligence (a rather easy task, as my code analysis skills are dated). This means that I’m relating the data without any knowledge of the underlying operation, just looking for patterns and connections. This is the essence of data mining, but it’s probably going to cause lots of security pros to go ‘well duh, of COURSE it’s like that’.

So. Fair warning given.

I used a simple count, i.e. a histogram, of the strings found. Basically, I lumped the whole strings() dataset together, and counted wherever a string was repeated. This brought up a few things that were interesting. First, the APT files regularly interacted with ADVAPI.dll, KERNEL32.DLL, MSVCRT.DLL, WININET.dll, USER32.dll, and a few others, a count that went into the high 100s. Which is a ‘duh’, since these are Windows APIs. As a control, I ran the string program up against an electric power relay program I had from previous work. While the relay program had some of the DLLs referenced, it was missing others, including MSVCRT.dll and CRYPT32.dll. Don’t know if this means anything, other than the APT1 dataset had higher confidentiality requirements than the relay program.

Second, I noticed a lot of Open Source or readily available code. I started wondering about this, but after putting myself in an attacker mindset I came to a conclusion: Open Source and commonly available code is much less traceable than commercially available code. By this, I mean that it can be downloaded and inserted anonymously, with little track backs to the original developer. Otherwise, a good forensics person could potentially trace this back to a buyer. If this tendency for open code is present in attempts to hack automation sites, it would be a good idea to identify open source code for DNP3, modbus, and other protocol stacks on the Internet and include in the searches.

Third, I took a look at strings that correspond to an IP Address format. Most were external IP address, ones that exist on the internet. However, there was a single IP that was in reserved space, This is interesting, because most automation systems use the reserved IP range (or a public range assigned to their corporation that they replicate everywhere. You know who you are).  So, I will be adding my knowledge of internal IP addresses in automation, including multicast, that I’ve seen that are deployed to my searches as well.

All in all, nothing game changing in this post, as it’s basically static code analysis. But, I’m hoping that the bulk analysis of a large dataset in an automated manner will show patterns, and allow the quick and simple re-evaluation of malware in a repeatable manner.

Leave a Reply