Crawlers Limiter Middleware
Status
In production.
Abstract
This little middleware has been written to prevent system overload caused by aggressive crawlers on huge django websites. In conjunction with memcached cache, this system is really fast and could be extended for other purposes. Regular hosts like bots are configured in a white-list.
Installation
- get PimenTech libcommonDjango :
svn checkout http://svn.pimentech.org/pimentech/libcommonDjango
- install it with "make install"
Activation and configuration
In your settings.py :
- Activate memcached cache backend. Remember that each django hit is logged and other cache systems would be too costly.
- ::
- CACHE_BACKEND = 'memcached://127.0.0.1:11211/'
- the best is to put the middleware in first position of MIDDLEWARE_CLASSES
MIDDLEWARE_CLASSES = (
'django_pimentech.middleware.crawler_limiter.CrawlerLimiterMiddleware',
'django.middleware.common.CommonMiddleware',
...other middlewares...
)
- configure the folowing parameters :
import re
CRAWL_WHITE_LIST = re.compile("127\.0\.0\.1|192\.168\.1\.*|66\.249\.65\.*|66\.249\.66\.*|74\.6\.8\.*")
CRAWL_CACHE_DURATION = 2
CRAWL_SITE_COUNT = 50
CRAWL_CACHE_BANNED_DURATION = 60 * 5
- CRAWL_WHITE_LIST : these ips are not concerned with the crawler limiter (66.249.65.*, 66.249.66.* : Google ; 74.6.8.* : Yahoo)
- CRAWL_CACHE_DURATION : duration in seconds of cache per ip. For each django connection, the client ip is stored. If the client hits the site before the expiration, the ipcounter is incremented, with a new expiration of CRAWL_CACHE_DURATION. Otherwise the ip key expires from the cache with its counter.
- CRAWL_SITE_COUNT : if the ip counter reach this value, a mail is sent to the site admins, a 403 forbidden http response is returned, and the ip counter is incremented with the CRAWL_CACHE_BANNED_DURATION.
Source
# -*- coding: utf-8 -*-
from django.http import HttpResponse
from django.conf import settings
from django.core.mail import mail_admins
from django.core.cache import cache
import socket
class HttpResponseBandwidthLimitExeeded(HttpResponse):
status_code = 509
class CrawlerLimiterMiddleware(object):
"""
Forbids access to violent crawlers
see http://garage.pimentech.net/libcommonDjango_django_pimentech_middleware_crawler_limiter/
for configuration instructions.
"""
def process_request(self, request):
ip = request.META['REMOTE_ADDR']
if settings.CRAWL_WHITE_LIST.match(ip):
return
try:
n = cache.get(ip)
except ValueError:
n = None
mail_admins("Memcached error for host %s" % ip)
if n is None:
cache.set(ip, 1, settings.CRAWL_CACHE_DURATION)
elif n >= settings.CRAWL_SITE_COUNT:
cache.set(ip, n+1, settings.CRAWL_CACHE_BANNED_DURATION)
if n == settings.CRAWL_SITE_COUNT:
try:
host = socket.gethostbyaddr(ip)
except:
host = 'Unknown'
else:
host = host[0]
mail_admins(' Bad crawler detected',
'%s hits limit reached for %s (host %s).\n'
'"Bandwidth limit exceeded" (509) response will be returned until it stops for %s seconds.' \
% (n, ip, host, settings.CRAWL_CACHE_BANNED_DURATION))
return HttpResponseBandwidthLimitExeeded('<h1>Make it slowlier please</h1>')
else:
cache.set(ip, n+1, settings.CRAWL_CACHE_DURATION + n/10)
Note
We would be please to receive your feedbacks, corrections and improvement suggestions. If you have Denial Of Service attacks problems, you could also try mod_evasive apache module.

PDF version