Log in
Load Control in Connotea
Connotea has some internal load controls that are designed to stop it from being overburdened with simultaneous queries. This process is called throttling.
The intention is that user requests take priority, followed by Web API and other non-browser requests, followed by search engine crawlers. The basic rules that Connotea uses are as follows:
- If the request comes from a user's browser, always try to serve it immediately, no matter what the load.
- Unless this user has requested more than 10 pages in the previous 15 seconds.
- If the request comes from a known web crawler (like Google or Yahoo!), only serve the request if
- this crawler hasn't acccessed Connotea in the previous 30 seconds, and
- no web crawler has accessed Connotea in the previous 2 seconds, and
- Connotea is running fewer than 15 other queries at the same time.
- If the request is from some other agent, like an RSS reader, or via the Web API, serve the request unless
- Connotea is running 15 or more other queries at the same time, or
- this user has issued more than 10 requests in the previous 15 seconds.
Technically, if Connotea determines that a request shouldn't be served for any reason, it refuses to run the query and returns a HTTP 503 reponse code. In a browser you would see a message telling you that Connotea is under a high load.
The details are slightly more complex than this, so if you need to know exactly what logic is followed, have a look at the pseudo-code below, which outlines the throttling process.
handle_request:
return 503 if service_paused
return 503 if bot_throttle
return 503 if dynamic_throttle
return really_handle_request
bot_throttle:
if is_user_agent_a_known_crawler
return 503 if rapid_fire_by_last_hit(30, ip+agent)
return 503 if rapid_fire_by_last_hit(2, 'all')
return 503 if load_defer
else if is_user_agent_an_assumed_bot
return 503 if load_defer
return 0
dynamic_throttle:
return 503 if rapid_fire_by_hit_stack(15, 10, ip+agent)
return 0
rapid_fire_by_last_hit:
check there have been no hits in given time interval from given list of agents
rapid_fire_by_hit_stack:
check for given number of hits in given time interval for given user agent and IP address
load_defer:
return 0 if load <= LOAD_MAX
return 503 if sleeping_count > SLEEPING_MAX
sleep(delay)
return 0 if load <= LOAD_MAX
return 503
is_user_agent_a_known_crawler:
check current user agent against know list of web crawlers
is_user_agent_an_assumed_bot:
check current user agent does not appear in know list of browsers