Analysis of Android malware families using available source code.
android-malware-source-code-analysis
Since the emergence of malware in the 1970s, these malicious programs have steadily increased in number and sophistication. The increasing profits generated by the use of malware have led to a growing demand, turning malware into a commodity of the underground economy. In this thesis, we analyze the evolution of Android malware from 2012 to date from a software engineering perspective. We analyze the source code of 97 samples from 83 unique families and obtain measures of their size, code quality, and estimates of the development costs (effort, time, and number of people). Our results suggest a linear increment per year in aspects such as number of malware samples and size, as well as a rapid increase in development cost. In terms of complexity and maintainability, we observe a low score compared to malware on other operating systems. Overall, our results are not conclusive enough to support claims about the increasing complexity of Android malware and its production progressively becoming an industry. This could be due to the fact that the Android malware industry is still young, as the operating system itself was launched just over ten years ago, or that there has been little change in the computer industry since the release of Android compared to the progress made in previous decades.
Every day, the mobile landscape grows in size. Recently, Google announced that Android had surpassed three billion users, securing the largest mobile market share for another year. Every year, the number of mobile users increases, causing malware to follow that trend. But malware is not only growing in the mobile landscape. As AVTEST's Malware Statistics clearly show, 2021 saw an extreme increase in new malware discovered, exceeding 150,000,000 new samples that year. Android is not far behind, as 3,000,000 new samples were discovered for this operating system.
A 2021 Report by Malware Bytes confirmed that malware as a business is a growing trend, taking up more real estate in the cyber-threat landscape, making malware development more profitable. On Android, most malware developers fund their operations by generating ad revenue, while others deploy ransomware or large botnets for profit. It is also mentioned that stalkerware and spyware-type applications experienced a detection increase of 1,677% in 2021, which is consistent with the types of malware present in our dataset. Some malware developers even leveraged the current global situation to use the COVID-19 pandemic as a cover to deploy malware and infect unsuspected victims; two samples in our dataset are ransomware disguised behind this facade.
As the number and profitability of Android malware increases, so does the sophistication and impact of attacks. In this thesis, we present a study of the evolution of Android malware from a software engineering approach. Our analysis is based on a dataset collected by the authors over several months and composed of the source code of 97 Android malware samples ranging from 2012 to 2022. Our dataset includes, among others, RATs, Trojan-Bankers, Keyloggers, Ransomware, Spyware, and Lockers. This is the largest Android malware source code dataset presented in the literature. We perform several analyses on this dataset. First, we review the most prevalent malware types and the most common permissions and capabilities used by the collected samples, as well as their antivirus detection rate. We also measure the evolution of malware development as a function of size, cost and quality.
Size dimensions are measured with several metrics, mainly the number of source files, the number of source lines of code (SLOCs), the number of functions, and the number of different programming languages used. Development cost is calculated with three estimates: effort, development time, and team size. Finally, code quality measures are computed using the cyclomatic complexity, the maintainability index, and the density of comments present in the code.
We then compare the results obtained with other similar works (this and this) that performed the same measurements, but without specializing in a single platform, as we did with Android. This thesis is based on those works; we wanted to know if the results obtained were also applicable to Android malware.
To our knowledge, our work is the first to analyze the code evolution of Android malware from this perspective. We also believe that our dataset of Android malware source code is the largest analyzed in the literature.
The main findings of our work include:
- In the Android malware landscape there is a high tendency towards spying malware, such as Spyware, RATs, Trojan-Spy, Keyloggers, etc.
- The number of malware samples increases at a rapid rate every year.
- Antivirus detection rates are severely skewed towards Lockers, Trojan-Bankers, Ransomware, Rootkits, Keyloggers, and RATs, as the rest of the malware types hardly raised a single detection.
- There is a high annual increase in the number of source code files, SLOCs, functions and programming languages used.
- There is a big difference in the number of files, SLOCs, functions and programming languages between malware types, as Backdoors, Trojan-Bankers and RATs often outnumber the other types in some of these categories.
- Android malware samples have a high value of effort, development time, and team size, which is steadily increasing every year.
- Android malware samples have a low value of complexity, maintainability index, and comment ratio, which slowly decreases every year.
In this thesis, we have conducted a study on the evolution of Android malware source code over its entire lifetime, which for now spans a period of 10 years. We have collected and analyzed 97 samples, which is the largest dataset of Android malware source code to our knowledge. We have quantified the code size and estimated its cost and quality using well-known software metrics. The results extracted from this work indicate an increase in size and cost, but a small decrease in complexity and maintainability. Therefore, we conclude that the results are not conclusive enough to support the claim of a developing malware production industry.
The resulting research paper can be obtained here.
All malware samples used for this analysis are stored in this repo.
# Important project components
.
├── android-os-malware-samples.csv - # Contains all the info obtained from the samples
├── docs
│ ├── Android_Malware_Source_Code_Analysis.pdf - # Research Paper
│ └── Android_Malware_Source_Code_Analysis_slides.pdf - # Presentation slides
└── latex
├── Android_Malware_Source_Code_Analysis.tex - # LaTex doc used to write the paper
├── data - # Folder containing data used to generate the graphs
└── IEEEtran.cls - # IEEE template class document