Crawlers Limiter Middleware

Status

In production.

Abstract

This little middleware has been written to prevent system overload caused by aggressive crawlers on huge django websites. In conjunction with memcached cache, this system is really fast and could be extended for other purposes. Regular hosts like bots are configured in a white-list.

Installation

  • get PimenTech libcommonDjango :
svn checkout http://svn.pimentech.org/pimentech/libcommonDjango
  • install it with "make install"

Activation and configuration

In your settings.py :

  • Activate memcached cache backend. Remember that each django hit is logged and other cache systems would be too costly.
::
CACHE_BACKEND = 'memcached://127.0.0.1:11211/'
  • the best is to put the middleware in first position of MIDDLEWARE_CLASSES
MIDDLEWARE_CLASSES = (
         'django_pimentech.middleware.crawler_limiter.CrawlerLimiterMiddleware',
         'django.middleware.common.CommonMiddleware',

          ...other middlewares...
          )
  • configure the folowing parameters :
import re

CRAWL_WHITE_LIST = re.compile("127\.0\.0\.1|192\.168\.1\.*|66\.249\.65\.*|66\.249\.66\.*|74\.6\.8\.*")
CRAWL_CACHE_DURATION = 2
CRAWL_SITE_COUNT = 50
CRAWL_CACHE_BANNED_DURATION = 60 * 5
  • CRAWL_WHITE_LIST : these ips are not concerned with the crawler limiter (66.249.65.*, 66.249.66.* : Google ; 74.6.8.* : Yahoo)
  • CRAWL_CACHE_DURATION : duration in seconds of cache per ip. For each django connection, the client ip is stored. If the client hits the site before the expiration, the ipcounter is incremented, with a new expiration of CRAWL_CACHE_DURATION. Otherwise the ip key expires from the cache with its counter.
  • CRAWL_SITE_COUNT : if the ip counter reach this value, a mail is sent to the site admins, a 403 forbidden http response is returned, and the ip counter is incremented with the CRAWL_CACHE_BANNED_DURATION.

Source

# -*- coding: utf-8 -*-
from django.http import HttpResponse
from django.conf import settings
from django.core.mail import mail_admins
from django.core.cache import cache
from django.core.urlresolvers import resolve
import re
import socket




class HttpResponseTooManyRequests(HttpResponse):
    status_code = 429

class HttpResponseForbidden(HttpResponse):
    status_code = 403


class CrawlerLimiterMiddleware(object):
    """
    Forbids access to violent crawlers
    see http://garage.pimentech.net/libcommonDjango_django_pimentech_middleware_crawler_limiter/
    for configuration instructions.
    """
    GOOD_IP = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')

    def process_request(self, request):
        ip = None
        HTTP_X_FORWARDED_FOR = request.META.get('HTTP_X_FORWARDED_FOR')
        REMOTE_ADDR = request.META.get('REMOTE_ADDR')
        PATH = request.path

        if HTTP_X_FORWARDED_FOR:
            for ip in HTTP_X_FORWARDED_FOR.split(', '):
                if ip:
                    break
        if not ip or not self.GOOD_IP.match(ip):
            ip = REMOTE_ADDR

        try:
            if settings.CRAWL_WHITE_LIST.match(ip):
                return

            if settings.CRAWL_WHITE_PATH_LIST.match(PATH):
                return
        except:
            pass

        if not request.META.get('HTTP_USER_AGENT'):
            return HttpResponseForbidden('<h1>Forbidden</h1>')

        try:
            resolve(request.path)
        except:
            # Bad URL (404), will be rejected later
            return

        try:
            n = cache.get(ip)
        except ValueError:
            n = None
            mail_admins("Memcached error for host %s" % ip)
        if n is None:
            cache.set(ip, 1, settings.CRAWL_CACHE_DURATION)
        elif n >= settings.CRAWL_SITE_COUNT:
            cache.set(ip, n+1, settings.CRAWL_CACHE_BANNED_DURATION)
            if n == settings.CRAWL_SITE_COUNT:
                try:
                    host = socket.gethostbyaddr(ip)
                except:
                    host = 'Unknown'
                else:
                    host = host[0]
                mail_admins(' Bad crawler detected',
                            '%s hits limit reached for %s (host %s).\n' \
                                '"Too Many Requests" (429) response will be returned until it stops for %s seconds.\n' \
                                'HTTP_X_FORWARDED_FOR : %s\n' \
                                'REMOTE_ADDR : %s\n' \
                                % (n, ip, host,
                                   settings.CRAWL_CACHE_BANNED_DURATION,
                                   HTTP_X_FORWARDED_FOR, REMOTE_ADDR
                                   ))
            return HttpResponseTooManyRequests('<h1>Do it slower please</h1>')
        else:
            cache.set(ip, n+1, settings.CRAWL_CACHE_DURATION + n/10)

Note

We would be please to receive your feedbacks, corrections and improvement suggestions. If you have Denial Of Service attacks problems, you could also try mod_evasive apache module.

Commentaires

Comments

blog comments powered by Disqus