Using .htaccess to Prevent Web Scraping

Web scraping, known as content scraping, data scraping, web harvesting, or web data extraction, is a way of extracting data from websites, preferably using a program that sends a number of HTTP requests, emulating human behaviour, getting the responses and extracting the required data out of them. Modern GUI-based web scrapers like Kimono enable you to perform this task without any programming knowledge.

If you face the problem of others scraping content from one of your websites, there are many ways of detecting web scrapers — Google Webmaster Tools and Feedburner to name a few tools.

In this article, we will discuss a few ways to make the lives of these scrapers difficult, using .htaccess files in Apache.

An .htaccess (hypertext access) file is a plain text configuration file for web servers that overrides the global server settings for the directory where the file is placed. They can be innovatively used to prevent web scraping.

Before we discuss the specific methods, let me clear up one small fact: If something is publicly available, it can be scraped. The steps that we discuss here can only make things more difficult, not impossible. However, what would you do if someone is smart enough to bypass all your filters? We have a solution for that too.

Getting Started with .htaccess

Since the use of .htaccess files involves Apache checking and reading all .htaccess files on every request, it is generally turned off by default. There are different processes to enable it in Ubuntu, OS X and Windows. Your .htaccess files will be interpreted by Apache only after you enable them, or they will be simply ignored.

Next, in most of our use cases, we will be using the RewriteEngine of Apache, which is a part of the mod_rewrite module. If necessary, you could check out a detailed guide on how to set up mod_rewrite for Apache or a general guide on .htaccess.

Once you have completed these, you are ready to proceed with the solutions discussed here on dealing with content scrapers. If you haven’t completed either of these steps successfully, Apache will ignore your .htaccess files or raise an error when you restart it after making changes.

Prevent Hotlinking

If someone scrapes your content, all your inline HTML remains the same. This means that the links to the images that were part of your content (and most probably hosted on your domain) remain the same. If the scraper wishes to put the content on a different website, the image would still link back to the original source. This is called hotlinking. Hotlinking costs you bandwidth because every time someone opens the scraper’s site, your image is downloaded.

You can prevent hotlinking by adding the following lines to your .htaccess file.

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$

# domains that can link to your content (images here)
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?mysite.com [NC]

# show no image when hotlinked
RewriteRule \.(jpg|png|gif)$ – [NC,F,L]

# Or show an alternate image
# RewriteRule \.(jpg|png|gif)$ http://mysite.com/forbidden_image.jpg [NC,R,L]

Some notes about the code:

Switching on RewriteEngine gives us the ability to redirect the user’s request.
RewriteCond specifies which requests should be redirected. %{HTTP_REFERER} is the variable that contains the domain from which the request was made.
Then we match it with our own domain mysite.com. We add (www\.) to ensure requests from both mysite.com and www.mysite.com are allowed. Similarly, our code covers http and https.
Next, we check if a jpg, png, or gif file was requested, and either show an error or redirect the request to an alternate image.
NC ignores the case, F shows a 403 Forbidden error, R redirects the request, and L stops rewriting.
Note that you should apply only one of the rules above (either the 403 error or the alternate image). This is because as soon as L is encountered, Apache would not apply any other rules. In the code example above, the alternate image method is commented out.

How Can Web Scrapers Bypass This?

One way for a web scraper to bypass such a hurdle is to download images as it encounters them in the HTML code. In such a case, a regular expression check can be applied, the images downloaded, and the links of the images changed while storing the data in the system.

Allow or Block Requests From Specific IP Addresses

If you happen to determine the origin of the requests of the web scraper (usually, it’s an unnaturally high number of requests from the same IP address), you can block requests from that IP address.

Order Deny
Deny from xxx.xxx.xxx.xxx

In the code above (and in other examples in this article) you would replace xxx.xxx.xxx.xxx with the IP address you want to block. If you are really paranoid about security, you could deny requests from all IP addresses and selectively allow from a whitelist of IP addresses:

order deny,allow
Deny from all
# IP Address whitelist 
allow from xx.xxx.xx.xx
allow from xx.xxx.xx.xx

One use case for this technique (not related to web scraping) is blocking access to the WordPress’s wp-admin directory. In such a case, you would allow requests from only your IP address, eliminating the possibility of someone hacking your site via wp-admin.

How Can Web Scrapers Bypass This?

If a web scraper has access to proxies, it could distribute its requests over the list of IP addresses to avoid abnormal activity from one IP address.

To explain: Let’s say someone is scraping your site from IP address 1.1.1.1. So you block 1.1.1.1 using .htaccess. Now, if the scraper has access to a proxy server 2.2.2.2, it routes its request through 2.2.2.2, so it appears to your server that the request is coming from 2.2.2.2. So, in spite of blocking 1.1.1.1, the scraper is still able to access the resource.

Thus, if the scraper has access to thousands of these proxies, it can become undetectable if it sends requests in low numbers from each proxy.

Redirect Requests From an IP Address

You can not only block any IP address, you can redirect them to a different page too:

RewriteCond %{REMOTE_ADDR} xxx\.xxx\.xxx\.
RewriteRule .* http://mysite.com [R,L]

If you redirect them to a static site, chances are the scraper will figure this out. However, you can go one step further and do something a bit more innovative. For that, you need to understand how your content is scraped.

Web scraping is a systematic procedure. It involves studying URL patterns and sending requests to all possible pages on a website. If you are a WordPress user, for instance, the URL pattern is http://mysite.com/?p=[page_no], where you increment page_no from 1 to a large number.

What you could do is create a page especially for redirection that redirects the request to one out of a number of predefined pages:

RewriteCond %{REMOTE_ADDR} xxx\.xxx\.xxx\.
RewriteRule .* http://mysite.com/redirection_page [R,L]

In the above code, “redirection_page” would be the page used to do one of the subsequent predefined redirects. Therefore, when a web scraping program is running, it would be redirected to a number of pages and it would be difficult to detect that you have identified the scraper.

Alternately, “redirection_page” can redirect to a third page “redirection_page_1”, which would then redirect back to “redirection_page”. This would lead to a redirect loop, and a request would get bounced back between the two pages indefinitely.

How Can Web Scrapers Bypass This?

A web scraper could check for redirection of the request. If there is a redirect, it would get a 301 or 302 HTTP status code. If there was no redirection, it would get the normal 200 status code.

Matt Cutts to the Rescue

Matt Cutts is the head of the web spam team at Google. Part of his job is to be on constant lookout for scraping sites. If he doesn’t like your website, he can make it vanish from Google’s search results. The recent Panda and Penguin updates to Google’s search algorithm have affected a huge number of sites, including a number of scraper sites.

A webmaster can report scraper sites to Google using this form, providing the source of the content. If you produce original content, you would definitely be on the radar of web scrapers. Yet, if they re-publish your content, Google will make sure that they are omitted from its search results.

Frequently Asked Questions (FAQs) on Preventing Web Scraping

What is Web Scraping and Why Should I Be Concerned About It?

Web scraping is a method used to extract large amounts of data from websites. While it can be done manually, it is usually automated for efficiency. The data can be used for various purposes, such as data analysis, data mining, and machine learning. However, it can also be used maliciously to steal sensitive information, overload servers, or manipulate data. Therefore, it’s crucial to protect your website from potential web scraping.

How Can I Identify if My Website is Being Scraped?

There are several signs that your website may be being scraped. These include a sudden spike in traffic, an increase in failed login attempts, or a slowdown in website performance. You can also check your server logs for suspicious activity, such as repeated requests from the same IP address or unusual user-agent strings.

How Does .htaccess File Help in Preventing Web Scraping?

The .htaccess file is a configuration file used by Apache-based web servers. You can use it to restrict access to your website by IP address or user-agent string, block specific HTTP methods, or redirect traffic. This can help prevent web scrapers from accessing your site or limit their ability to extract data.

Can I Use CAPTCHA to Prevent Web Scraping?

Yes, CAPTCHA can be an effective tool in preventing automated web scraping. It requires users to complete a task that is easy for humans but difficult for bots, such as identifying objects in an image or solving a simple math problem. However, it can also be a barrier to legitimate users and should be used judiciously.

What is the Role of Robots.txt in Preventing Web Scraping?

The robots.txt file is a simple text file that webmasters use to instruct web robots (typically search engine robots) how to crawl pages on their website. While it can’t prevent determined scrapers, it can discourage well-behaved bots from accessing certain parts of your site.

How Can I Use JavaScript to Prevent Web Scraping?

JavaScript can be used to obfuscate your website’s HTML, making it harder for scrapers to extract data. You can also use it to dynamically load content, which can confuse scrapers that are not equipped to handle JavaScript.

Can I Use Rate Limiting to Prevent Web Scraping?

Yes, rate limiting is a technique that limits the number of requests a user can make to your website within a certain time period. This can help prevent scrapers from overloading your server and slow down their data extraction efforts.

What is the Impact of Web Scraping on SEO?

Web scraping can have a negative impact on SEO if your content is copied and published elsewhere, leading to duplicate content issues. It can also slow down your site, leading to a poor user experience and lower search engine rankings.

Can I Take Legal Action Against Web Scrapers?

Yes, in many jurisdictions, web scraping can be considered illegal, especially if it involves the extraction of copyrighted content or personal data. However, legal action should be considered a last resort, as it can be costly and time-consuming.

What are Some Best Practices for Preventing Web Scraping?

Some best practices include regularly monitoring your website for signs of scraping, using a combination of prevention techniques (such as .htaccess rules, CAPTCHA, and rate limiting), keeping your website’s software up-to-date, and educating yourself about the latest scraping techniques and countermeasures.