Quickest ways to remove spam from Google Analytics

Just like you, I absolutely hate spam showing up inside Google Analytics. It makes it harder to focus on the data that’s most important to you (and it’s just plain annoying).

Google Trends shows a big spike in people searching for ‘google analytics spam’ in December 2016 and June 2015. Here’s the overall trend:

Google Trends

What is referral spam?

Referral spam is fake traffic (created by bots and spiders) that shows up in your referrals reports. Sessions are seen coming from fake websites and these sessions can skew the data in your reports.

What’s the solution?

How can I prevent spam showing up inside Google Analytics?

I’m going to take you through my 15 minute guide to dealing with referral spam inside Google Analytics. So if you’re seeing something like this inside Google Analytics…

google-analytics-spam-001.gif

Then you’re in the right place!

And check out my tutorial on cleaning up your referral and data spam. It covers the steps we're about to walk through...

Now it's time to jump in...

15 Minute Guide to Google Analytics Spam

google-analytics-spam-guide.gif

Step 1: Only include traffic to your domains

Start by writing down all the places where you expect your Google Analytics tracking code to be found. This will of course include your website’s domain, but might include additional domains too.

For example, at Loves Data I’m currently using the following domains:

  • www.lovesdata.com – this is our main website
  • learn.lovesdata.com – this is where our online training platform lives
  • conference.lovesdata.com – this is a microsite for our annual summit

In a moment we’re going to configure a filter to clean up our spam data, so make a note of your top-level domain. In my example, I’ll just write down lovesdata.com as this will cover both of the sub-domains (learn.lovesdata.com and conference.lovesdata.com) and my main website (www.lovesdata.com).

I also have my Google Analytics tracking code on the following third party websites:

  • lovesdata.leadpages.net – we use Leadpages to create special landing pages for our advertising campaigns and my tracking code is installed on these pages, so I want to keep this data available within Google Analytics
  • youtube.com – my tracking code is installed on my YouTube Channel, so I’ll add this to my list
  • eventbrite.com – we use Eventbrite for selling event tickets and since I’ve installed the tracking code on my events I also want to list this domain
  • teachable.com – finally we use Teachable to deliver our online training, again I’ve installed Google Analytics so I want to list this domain too

Now you should have a list of all the places where you expect (and actually want) your Google Analytics tracking code to be installed and collecting data.

The next step is to turn this list into a Regular Expression. For my domains, this will be:

lovesdata\.com|leadpages\.net|youtube\.com|eventbrite\.com|teachable\.com

If you’re not familiar with Regular Expressions, that’s totally fine. All you need to do is type in each domain, then put a backslash (\) before any full stops and separate each domain with a pipe (|). When you’re done there shouldn’t be any spaces between (or in) the domains.

Now it’s time to make sure only traffic from these domains in your Google Analytics reports. To do this you’ll need to navigate to ‘Admin’, then ‘Filters’ within your reporting view. Then click ‘Add Filter’ and select ‘Custom’ as the filter type and select ‘Include’.

Here’s what the filter looks like for my domains:

Now click ‘Save’. This filter will help to ensure that data is only reported from the websites you’ve specified. It’s going to help remove the majority of the spam coming into Google Analytics.

(You also have the option of including traffic that views your website using Google Translate, Google Cache and Google Web Light by adding googleusercontent.com and googleweblight.com to your filter.)

Step 2: Review your inbound traffic sources

Some Google Analytics spam comes from crawlers that have identified your domain name, which means that the filter we created in step 1 won’t remove all of your fake traffic.

You can take different approaches to dealing with this type of spam.

The first option is to exclude all the currently known spam sources using custom filters. It’s going to take some time to do this, but I encourage you to do it. Here’s a handy list of spam domains that you can exclude from your reports. You’ll need to create multiple filters (which we’ll look at in a moment).

Open source list of spam domains

You also have the option of reviewing your Acquisition reports to proactively identify traffic sources that are spam. I like this approach because it puts you more in touch with your data and ensures it’s clean. The downside is that the spam will already be in your reports, so this is cleaning this up after the fact.

To do this, navigate into the ‘Source/Medium’ report under ‘Acquisition’ and look for unusual traffic sources. In some cases you might need to dig a little deeper to check that they are spam. Here’s some spam showing up in the report:

To remove this spam you’ll need to create custom filters to exclude these traffic sources. Here’s an example of the Regular Expression used in the filter:

1\-99seo\.com|3\-letter-domains\.net|9i543\.com

Here’s what the filter looks like inside Google Analytics:

Now if you’re going to be excluding more than a handful of spam sources you’ll need to configure multiple filters. This is because there is a character limit (maximum 255). So in this case you’ll need to create multiple filters.

Step 3: Remove fake language codes

Spam can also show up in the Language report (which is super annoying). Here’s an example:

google-analytics-spam-008.gif

You can configure an ‘exclude filter’ to remove these fake languages from showing up. To do this, you’ll need to create a custom filter that excludes languages that match the following Regular Expression (via Mike Sullivan):

.{13,}|\.

This excludes any language that contains 13 or more characters (most of the defaults contain around 5 characters, for example en-us). It also excludes any languages that contain a full stop (the defaults don’t contain these).

Here’s what your filter should look like:

Step 4: Enable bot filtering

The final step is to enable bot filtering for your reporting view. This configuration option means that Google Analytics won’t report any known bot or spider traffic into your reports. You can enable this by navigating to ‘Admin’, then selecting ‘View Settings’ and ensure that the ‘Bot Filtering’ option is checked. Here’s what you should see:

This setting certainly helps by removing some fake traffic, but you’ll still need to complete steps 1, 2 and 3 for the cleanest set of data possible.

Conclusion

Whether it’s fighting spam in analytics or email, we know Google is working on a more permanent solution. Until then, my 15 minute guide will help you collect cleaner data into your reports.

Here are some additional resources that might be of interest: