Using Machine Learning to Detect Malicious URLs

This is a write-up of an experiment employing a machine learning model to identify malicious URLs. The author provides a link to the code for you to try yourself.



By Faizan Ahmad, CEO Fsecurify.

Header

With the growth of Machine Learning in the past few years, many tasks are being done with the help of machine learning algorithms. Unfortunately or fortunately, there has been little work done on security with machine learning algorithms. So I thought of presenting some at Fsecurify.

A few days ago, I had this idea about what if we could detect a malicious URL from a non-malicious URL using some machine learning algorithm. There has been some research done on the topic so I thought that I should give it a go and implement something from scratch. So let's start.

Gathering Data

The first task was gathering data. I did some surfing and found some websites offering malicious links. I set up a little crawler and crawled a lot of malicious links from various websites. The next task was finding clear URLs. Fortunately, I did not have to crawl any. There was a data set available. Don’t worry if I am not mentioning the sources of the data. You’ll get the data at the end of this post.

So, I gathered around 400,000 URLs out of which around 80,000 were malicious and others were clean. There we have it, our data set. Let's move next.

Analysis

We’ll be using Logistic Regression since it is fast. The first part was tokenizing the URLs. I wrote my own tokenizer function for this since URLs are not like some other document text.


The next step is to load the data and store it into a list.


Now that we have the data in our list, we have to vectorize our URLs. I used tf-idf scores instead of using bag of words classification since there are words in urls that are more important than other words e.g ‘virus’, ‘.exe’ ,’.dat’ etc. Lets convert the URLs into a vector form.


We have the vectors. Let's now convert it into test and training data and go right about performing logistic regression on it.


That’s it. See, it's that simple yet so effective. We get an accuracy of 98%. That’s a very high value for a machine to be able to detect a malicious URL with. Want to test some links to see if the model gives good predictions? Sure. Let's do it.


The results come out to be amazing.

  • wikipedia.com (Good Url)
  • google.com/search=faizanahmad (Good Url)
  • pakistanifacebookforever.com/getpassword.php/ (Bad Url)
  • www.radsport-voggel.de/wp-admin/includes/log.exe (Bad Url)
  • ahrenhei.without-transfer.ru/nethost.exe (Bad Url)
  • www.itidea.it/centroesteticosothys/img/_notes/gum.exe (Bad Url)

This is what a human would have predicted. No?

The data and code is available at Github.

Bio: Faizan Ahmad is a Fulbright computer science undergrad and CEO of Fsecurify.

Related: