Botnets detection at scale – Lesson learned from clustering billions of web attacks into botnets.


A common problem in the cybersecurity industry is how to detect and track botnets when there are billions of daily attacks. Botnets are internet connected devices that perform repetitive tasks, such as Distributed Denial of Service (DDoS). In many cases, these consumer devices are infected with malicious malware that is controlled by an external entity, often without the owner’s knowledge. Botnet detection allows enhancing website’s security and coming up with ways to mitigate their impact.

With billions of daily attacks and millions of daily attacking IPs, botnet detection is first a scale problem and therefore choosing the most relevant data is the most challenging task:
- What IPs to cluster? Find the most relevant IPs before clustering
- Which IP pairs to calculate distance for?
- Which features should be calculated and sent to the distance function?

To solve the scale problem we moved as much weight as possible to the query engine. We used a SQL based query engine to find the most relevant IPs and the most relevant IP pairs. By using a query engine we saved data movement, time and costs - the query engine returned only the most relevant data and made the entire process lighter compared to pure python code. It allowed us to run many experiments and get to better results fast. Filtering the most relevant IP pairs by the query engine results with a smaller distance matrix and reduced memory consumption of the clustering algorithm.

We will explain our flow of botnet detection from scratch. The flow includes data extraction, feature selection, clustering, validation and fine tuning. We will also explain our method for measuring results of unsupervised learning problems using a query engine:
- How we scored botnets and experiments, which allowed us to fine tune our algorithm.
- Visualizing botnets activity like attacks, attack targets and tools used over time. The visualization helped us learn about the botnet activity and our clustering quality.

The problem belongs to the cyber-security domain and is relevant to other domains as well.


Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure. Ori also has an AWS Data Analytics certification. In the Threat Research group, Ori is responsible for the data infrastructure and involved in analytics projects, machine learning, and innovation projects.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google