Load Control in Connotea

Connotea has some internal load controls that are designed to stop it from being overburdened with simultaneous queries. This process is called throttling.

The intention is that user requests take priority, followed by Web API and other non-browser requests, followed by search engine crawlers. The basic rules that Connotea uses are as follows:

  • If the request comes from a user's browser, always try to serve it immediately, no matter what the load.
    • Unless this user has requested more than 10 pages in the previous 15 seconds.
  • If the request comes from a known web crawler (like Google or Yahoo!), only serve the request if
    • this crawler hasn't acccessed Connotea in the previous 30 seconds, and
    • no web crawler has accessed Connotea in the previous 2 seconds, and
    • Connotea is running fewer than 15 other queries at the same time.
  • If the request is from some other agent, like an RSS reader, or via the Web API, serve the request unless
    • Connotea is running 15 or more other queries at the same time, or
    • this user has issued more than 10 requests in the previous 15 seconds.

Technically, if Connotea determines that a request shouldn't be served for any reason, it refuses to run the query and returns a HTTP 503 reponse code. In a browser you would see a message telling you that Connotea is under a high load.

The details are slightly more complex than this, so if you need to know exactly what logic is followed, have a look at the pseudo-code below, which outlines the throttling process.

handle_request:
 return 503 if service_paused
 return 503 if bot_throttle
 return 503 if dynamic_throttle
 return really_handle_request

bot_throttle:
 if is_user_agent_a_known_crawler
  return 503 if rapid_fire_by_last_hit(30, ip+agent)
  return 503 if rapid_fire_by_last_hit(2, 'all')
  return 503 if load_defer
 else if is_user_agent_an_assumed_bot
  return 503 if load_defer
 return 0

dynamic_throttle:
 return 503 if rapid_fire_by_hit_stack(15, 10, ip+agent)
 return 0
 
rapid_fire_by_last_hit:
 check there have been no hits in given time interval from given list of agents
 
rapid_fire_by_hit_stack:
 check for given number of hits in given time interval for given user agent and IP address
 
load_defer:
 return 0 if load <= LOAD_MAX
 return 503 if sleeping_count > SLEEPING_MAX
 sleep(delay)
 return 0 if load <= LOAD_MAX
 return 503
 
is_user_agent_a_known_crawler:
 check current user agent against know list of web crawlers
 
is_user_agent_an_assumed_bot:
 check current user agent does not appear in know list of browsers
Version 1 (Current) | Last updated: Mon Aug 07 2006 12:35 UTC by User:ben (Cerated page and detailed the Connotea load control process)