code
The source code for Connotea is available for you to use and modify under the terms of the GNU GPL.
- Name
- Copyright and License
- Naming
- About the Code
- Download
- Upgrading
- New Features from 1.7.1 to 1.8
- New Features from 1.5.0 to 1.7.1
- Upgrading from 1.7.1 to 1.8
- Upgrading from 1.5.0 to 1.7.1
- Upgrading from Versions Prior to 1.5.0 to 1.7.1
- Acquiring Source for Specialized Programming
- Creating a Citation Module
- Creating an Import Module
- Creating a Proxy Module
- Adding a Static Web Page
- Adding a Dynamic Web Page
- Speaking to the Web Api from Your Application
- Minimum Requirements
- Setup
- Mysql
- Wiki::Toolkit
- Apache
- Memcached
- Configuration
- Customization
- Logrotate
- Support
- Webcite
- Acknowledgments
- More Information
Name
Connotea Code
Copyright and License
© Copyright 2005-2007 Nature Publishing Group.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
Some portions regarding RDF are originally from RDF::Core, derived from works Copyright © 2001 Ginger Alliance Ltd., and carry their own copyright and GPL notices.
Naming
You will the see the names Connotea, Bibliotech and Connotea Code used. To eliminate any confusion, we'll clarify the meaning of those names here.
Connotea is the name of the online reference management service created and run by Nature Publishing Group (NPG). Bibliotech was the initial project name used at NPG while the service was being developed, and hence this name is used for some class and variable names in the code. The release of the underlying technology for Connotea is known as Connotea Code. The purpose of this page and the SourceForge project is to make the code that runs this site publicly available for review and re-use.
Therefore, it makes sense to refer to Connotea the service, or to the Connotea Code. However, Connotea is a trademark of Nature Publishing Group, so if you use the code to create your own bookmarking service, we ask that you don't brand it as Connotea. We also ask that you include the following footer on your site:
This site is powered by <a href="http://sf.net/projects/connotea">Connotea Code</a>, the open source software behind <a href="http://www.connotea.org/">Connotea</a>.
The Connotea logo, the site guide and related documentation, other image files and stylesheets are copyrighted by NPG and are not released under the GPL.
About the Code
Connotea Code runs a social bookmarking web site for users to save and share links, which can have citation data automatically retrieved from authoritative sources.
Connotea Code is written in Perl, and uses MySQL as the data store. It runs as a mod_perl handler in Apache2, and uses templates for page presentation.
Download
Download the tarball from the connotea SourceForge project area at http://sf.net/projects/connotea.
The current stable release is version 1.8.
Upgrading
NEW FEATURES FROM 1.7.1 TO 1.8
- Web API in regular use.
- Template Toolkit based templates in regular use.
- More optimized SQL queries for common requests.
- Greater use of transactions in MySQL.
- Greater flexbility for citation source modules.
- New citation source modules, plus improvements to existing modules.
- Blog component to create news page from external blog.
- Wiki component to create custom wiki.
- Admin component with user search.
- Integration with Bibutils library for BibTeX and MODS output.
- Antispam system with captcha and quarantine responses.
- Click tracker for all posts.
- Alpha-version proxy module system to handle known proxied post URL's.
- Alpha-version stand-alone citation server capability.
- Additional tools such as command-line post by API, user recovery, and test suite launch.
- Automated deployment scripts, now supporting Darcs instead of CVS.
- Updated code to support newer versions of CPAN modules.
- More test suite scripts.
NEW FEATURES FROM 1.5.0 TO 1.7.1
- Many bugs fixed.
- Alpha-version Web API.
- Alpha-version Template Toolkit based output framework.
- Full text searching feature.
- Better cache control and throttling.
- Better bookmarklets.
- Better URI validation.
- Better XML encoding for fringe cases.
- Better character set decoding of downloaded documents for citations.
- Exception email notification.
- More support for two instances on same server.
- More support for split web/database servers for one instance.
- More comprehensive User Agent support for citation modules.
- Method to switch from one citation module to another.
- Optimized SQL for counting totals and some other operations.
- Added methods for profiling code and dumping SQL statements.
- Loosen some grammar restrictions, e.g. ok to name a tag ``tag''.
- Tighten some grammar restrictions, e.g. num & start must be numeric.
- Better RIS import based on real-world file examples.
- New citation modules:
- Several optional administrator utilities:
- retro: script to update citation data rectroactively.
- bibwatch: load monitoring utility.
- bibpreempt: preempting and testing utility.
- resendreg: utility to resend registration details.
- deluser: utility to delete users.
- memcache_wrapper: init.d script to keep memcached running.
- Several developer testing utilities:
UPGRADING FROM 1.7.1 TO 1.8
See sql/schema_alter.sql for commands to patch your database. Other elements of the upgrade should be optional; that is, you can turn them on later.
UPGRADING FROM 1.5.0 TO 1.7.1
The biggest difference between 1.5.0 and 1.7.1 is that 1.7.1 uses two databases at once.
In order to support fulltext matching, a new feature, we use a MyISAM database in MySQL with FULLTEXT keys (see http://dev.mysql.com/doc/refman/4.1/en/fulltext-search.html).
However, InnoDB is still faster for JOIN's and offers referential integrity, so as a compromise we run two databases and keep them synchronized with MySQL replication (see http://dev.mysql.com/doc/refman/4.1/en/replication.html).
If you are upgrading from Connotea Code 1.5.0, please see the section below on database setup for the secondary search database. To upgrade, you will need to:
mysqldump -c -t -u bibliotech -p bibliotech > /tmp/dump
echo 'source sql/wipe.sql' | mysql -u bibliotech -p bibliotech
echo 'source /tmp/dump' | mysql -u bibliotech -p bibliotech
Except for the addition of a MyISAM database, there are no intradatabase schema changes between 1.5.0 and 1.7.1.
UPGRADING FROM VERSIONS PRIOR TO 1.5.0 TO 1.7.1
To upgrade from versions prior to 1.5.0, please edit sql/schema_alter.sql to contain only the statements necessary to alter the database schema from your version to the current schema. There are no schema changes between 1.5.0 and 1.7.1.
schema_alter.sql echo 'source schema_alter.sql' | mysql -u root -p bibliotech
Then follow the directions above for upgrading from 1.5.0.
Acquiring Source for Specialized Programming
CREATING A CITATION MODULE
Connotea's ability to import bibliographic information from third-party websites is enabled by a series of plug-ins.
If you downloaded this source code with the intent of creating a citation module, see the comments and code in the file Bibliotech/CitationSource.pm which will explain the base class from which your citation source module should be derived.
In previous releases testing your citation module required a full instance of Connotea Code. In this release, a script named test_util/citation_source_test.pl provides a way to test your module's return values without an instance. Your module file should be placed in the Bibliotech/CitationSource directory to be recognized by this script.
You may also test by creating a fully installed instance, which gives the added benefit of letting you test via a web browser and ensure that citation data is saved properly in MySQL.
If you create a new citation plug-in, please consider releasing it back to the Connotea community.
CREATING AN IMPORT MODULE
Connotea's ability to import a batch of links or references depends on a series of plug-ins.
If you downloaded this source code with the intent of creating an import module, see the comments and code in the file Bibliotech/Import.pm which will explain the base class from which your import module should be derived.
In previous releases testing your citation module required a full instance of Connotea Code. In this release, a script named test_util/import provides a way to test your module's return values without an instance. Your module file should be placed in the Bibliotech/Import directory to be recognized by this script.
You may also test by creating a fully installed instance, which gives the added benefit of letting you test via a web browser and ensure that imported data is saved properly in MySQL.
If you create a new import plug-in, please consider releasing it back to the Connotea community.
CREATING A PROXY MODULE
Connotea's ability to provide proxy translation for specific types of URI's depends on a series of plug-ins.
If you downloaded this source code with the intent of creating a proxy module, see the comments and code in the file Bibliotech/Proxy.pm which will explain the base class from which your import module should be derived.
You may test by creating a fully installed instance.
If you create a new proxy plug-in, please consider releasing it back to the Connotea community.
ADDING A STATIC WEB PAGE
Any Connotea Code instance that contains the Inc component has the
ability to deliver static pages through the template system. A URL
path that is not recognized by Bibliotech::Parser will be tested as
a filename under the document root with an extension of .inc
appended. The contents of this file should be XHTML. If found, the
contents will be served within inc.tt or default.tt according to
the rules of the template system.
ADDING A DYNAMIC WEB PAGE
To create a new component for your Connotea Code instance that serves dynamic web content requires at least the following:
In Bibliotech/Parser.pm you must find the grammar definition and add a subrule to the page production which will designate the URL path that will activate your component. Keep in mind that a path name that is a shortened version of another path name will always eclipse the longer one if it appears first, so you should add it after (e.g. ``urilabel'' must come before ``uri'' or ``uri'' would always match for either).
In Bibliotech/Page/Standard.pm add a package based on
Bibliotech::Page like the others defined in that file. The name
should be Bibliotech::Page::x where x is your path name with a
single capital letter at the beginning even if it is more than one
word (e.g. Bibliotech::Page::Reportspam for a path of
/reportspam). Include a main_component() method that returns a
string of the last part of the class name of the main component, a
Bibliotech::Component-derived class (e.g. 'ReportSpam' for
Bibliotech::Component::ReportSpam).
In the Bibliotech/Component directory create a module based on
Bibliotech::Component. Use the others that appear in that diectory
as examples and refer directly to the source code in
Bibliotech/Component.pm, particularly the comments, for
descriptions of expected methods and their expected return values. For
an HTML compontent be sure to include last_updated_basis() and
html_content(). In particular, html_content() should return a
Bibliotech::Page::HTML_Content object; that class is defined in
Bibliotech/Page.pm.
SPEAKING TO THE WEB API FROM YOUR APPLICATION
The Connotea Web API allows communication with an instance, either the Connotea web site at http://www.connotea.org/ or your own private instance, using a predefined set of commands to access structured data and accomplish normal user actions in a programmatic manner.
Your software may be written in any language you choose - the basic requirements are the ability to create and parse XML and communicate using the HTTP protocol. The ability to interpret the XML as RDF and use object orientation to model the objects serialized as RDF may prove helpful. Libraries and sample code are available.
See http://www.connotea.org/wiki/WebAPI for Web API documentation.
Minimum Requirements
This code requires, or has been best tested on:
- Linux/UNIX operating system (tested on Red Hat Enterprise
Linux 4 - see http://www.redhat.com/)
- Perl 5.8.0 (see http://www.perl.org/)
- Perl CPAN modules as identified on the list below
(see http://cpan.perl.org/)
- Apache 2.0.40 (see http://httpd.apache.org/)
- MySQL 5.0.17 (see http://www.mysql.com/)
- Memcached 1.1.12 (see http://www.danga.com/memcached/)
CPAN
You will need to have the following modules installed from CPAN.
On all Perl systems you can type:
LANG=C cpan
...or...
LANG=C perl -MCPAN -e shell
...to get a CPAN shell prompt, and then type:
cpan> install XXX::YYY
..or...
cpan> force install XXX::YYY
...to install a module.
The LANG=C portion of the command line above is highly
recommended as many modern Linux distributions set your default
LANG to a locale-based setting and this often interferes with Perl
module compilations. When it does, the error messages will be very
misleading and never mention the LANG variable.
Before you embark on what will probably be a long install-fest, it is also recommended that you type:
cpan> install Bundle::CPAN
...inside the CPAN shell and then restart it. This will ensure that you are using the latest version of the CPAN code. Some things will go more smoothly.
When asked whether to follow dependencies, answer yes. When asked about optional utilities and scripts that can be installed to /bin or /usr/bin, answer however you like, as none are necessary for this code.
You do not necessarily need the latest version of every module, although in one or two cases you do. In general, if your Perl is at least 5.8.0, just install the version that a non-force install will give you at the CPAN prompt. If you are lower than 5.8.0, upgrade your base Perl installation first.
The list:
On Red Hat and some other distros, the following are provided in vendor packages, and you're better off using those.
...but install these from CPAN so you get new versions:
- IO::String (for Bio::Biblio::IO, better to preinstall)
- XML::Writer (for Bio::Biblio::IO, better to preinstall)
- XML::Twig (for Bio::Biblio::IO, better to preinstall)
- SOAP::Lite (for Bio::Biblio::IO, better to preinstall)
- Pod::Man (for DateTime, better to preinstall)
- Bio::Biblio::IO (usually has to be forced unfortunately, there are many tests and some fail)
- Cache::Memcached
- CGI
- Class::DBI
- Config::Scoped
- Data::Dumper (not just for debugging, actually used in production)
- Date::Parse
- DateTime (you may need to force installation of DateTime::Set if your timezone is not UTC)
- DateTime::Format::ISO8601
- DateTime::Format::MySQL
- DateTime::Incomplete
- Digest::MD5
- Encode (you may need to force installation of Encode if some non-English tests fail)
- Fcntl
- File::Temp
- File::Touch
- FindBin
- HTML::Entities
- HTML::Sanitizer (you may need to force installation of HTML::Sanitizer due to some year-old bugs already filed on CPAN)
- HTTP::OAI
- IO::File
- JSON
- LWP::UserAgent
- List::MoreUtils
- List::Util (you may need to force installation of List::Util unless you have a very new version of Perl)
- Net::Daemon::Log (you may need to force installation of Net::Daemon::Log for failing a fork test - not used by us)
- Netscape::Bookmarks
- Parse::RecDescent
- RDF::Core
- SQL::Abstract
- Set::Array
- Storable
- Template
- Test::Exception
- Time::HR
- URI
- URI::Escape
- URI::Heuristic
- URI::OpenURL
- URI::QueryParam
- Want
- Wiki::Toolkit
- Wiki::Toolkit::Plugin::Diff
- XML::Element
- XML::Feed
- XML::LibXML
- XML::RSS
- YAML (you may need to force installation of Test::Simple which is a dependency of YAML, for an unknown reason)
- Apache::Emulator (not required for core web service service)
- Text::BibTeX (not required for core web service service)
Setup
MYSQL
Two databases for user posts need to be created. See sql/schema.sql for the database schema which needs to be created in MySQL. The first database will be created using InnoDB tables to enforce foreign keys and constraints and for table joining speed. A second database then should be created with a _search suffix using MyISAM tables that have FULLTEXT indexes which are queried when searching for words. (FULLTEXT indexes are not available for InnoDB yet.)
The second schema is generated from the first by running:
cd sql perl mkschema_search < schema.sql > schema_search.sql
MySQL relication can be used to make the MyISAM database a slave of the InnoDB database, even on the same machine. This is a suggested configuration for /etc/my.cnf that will do just that:
[mysqld] # local replication of bibliotech to bibliotech_search: server-id=1 log-bin=mysql-bin binlog-do-db=bibliotech replicate-same-server-id=1 replicate-rewrite-db=bibliotech->bibliotech_search replicate-do-db=bibliotech_search master-host=localhost master-user=search_repl master-password=pass # change stopwords in support of bibliotech freematch feature: #ft_stopword_file=/etc/mysql_stopwords.txt ft_min_word_len=2 ft_max_word_len=255 # allow packing of queries group_concat_max_len=8192
Change the master-password line! Also change the database names if you
are not using bibliotech.
You will probably find the MySQL stopwords to be too restrictive in practice. The list can be viewed at http://dev.mysql.com/doc/mysql/en/fulltext-stopwords.html. We recommend that you pare down this list to a more suitable one, and use the ft_stopword_file keyword to tell MySQL to use your list instead.
In any case, if you want the search feature to behave predictably, you must specify an external text file stopword list to MySQL. The search handler will query MySQL to find out the stopword list file being used, and read it as well, so it can anticipate MySQL reporting no matches for words that otherwise should match.
You'll need to execute a grant statement similar to this one:
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO search_repl@'localhost.localdomain' IDENTIFIED BY 'pass';
Two notes on the replication grant statement:
- MySQL seems to consider
localhost.localdomaindifferent fromlocalhostand while the shorter version normally works, for replication it seems that the longer one is needed. If you have problems, try both.
- You must have the updated privilege table structure. If you
have had MySQL installed since the 3.x series, your mysql.user table
lacks the privilege fields mentioned above; check your docs about a
script called 'mysql_fix_privilege_tables'. On many systems this will
be a shell script in /usr/bin that you can run as root with a
--password=xxxparameter (to specify the MySQL root user password, not the Unix root user password).
The MySQL username used by the Perl handler must have access to both databases (username and password as in /etc/bibliotech.conf):
GRANT SELECT, INSERT, UPDATE, DELETE ON bibliotech.* TO user@localhost IDENTIFIED BY 'secret'; GRANT SELECT, INSERT, UPDATE, DELETE ON bibliotech_search.* TO user@localhost IDENTIFIED BY 'secret';
WIKI::TOOLKIT
You also need to setup Wiki::Toolkit so that a wiki is available. This is required. You should create a blank database, grant a user rights to it, and run the provided setup script.
CREATE DATABASE conwiki; GRANT ALL ON conwiki.* TO conwiki@localhost IDENTIFIED BY 'secret';
/usr/bin/wiki-toolkit-setupdb --type mysql \
--name conwiki \
--user conwiki \
--pass secret \
--host localhost
Remember to populate the COMPONENT WIKI block of your configuration
file with the wiki database details.
APACHE
Everything under the site/default subdirectory should be placed or linked into an Apache-accessible location, and a location handler should be added to httpd.conf (or elsewhere in the Apache configuration) such as the following one.
Update the values to match your IP, domain, and file paths:
<VirtualHost 1.2.3.4:80>
ServerName www.yourdomain.com
ServerAlias yourdomain.com
ServerAdmin you@yourdomain.com
DocumentRoot /var/www/perl/connotea_code/site/default
PerlOptions +Parent
PerlSwitches -I/var/www/perl/connotea_code
PerlModule Bibliotech::Apache
PerlModule Bibliotech::AuthCookie
<Location />
SetHandler perl-script
PerlHandler Bibliotech::Apache
PerlAuthenHandler Bibliotech::AuthCookie::authen_handler
AuthName Bibliotech
AuthType basic
require valid-user
#ErrorDocument 503 /paused.html
#ErrorDocument 503 /readonly.html
ErrorDocument 503 /unavailable.html
</Location>
</VirtualHost>
The 503 lines allow a custom page to be displayed when your site is under heavy load (unavailable.html) or when you deliberately pause service (paused.html) or make it read-only (readonly.html); you must edit your Apache configuration and switch which line is commented for the latter two modes.
MEMCACHED
Memcached is required, and the code is written to assume that a memcache is running. Database timestamps, cached HTML, and uploaded files are all stored temporarily in this cache.
CONFIGURATION
See config for a configuration that should be copied to
/etc/bibliotech.conf and edited to suit your needs. Particularly,
be sure to change *_SECRET and *_PASSWORD variables.
Default configuration:
# Example Configuration For A Connotea Code Instance
#
# Should be installed to /etc/bibliotech.conf
# Commented lines represent default values
GENERAL {
# Public service name and contact details
# - if your base URL will be www.yourdomain.com/somename, SITE_NAME
# should be 'somename' although the capitalization may be different
# as long as the directory containing the HTML files in lower case
# - if your base URL will be a plain domain name, you have more
# freedom to set SITE_NAME to whatever you like
SITE_NAME = 'My Bookmarking Service'
SITE_EMAIL = 'admin@example.org'
# optionally set document root and home page hyperlink, or they can
# be detected
#DOCROOT = '/var/www/perl/connotea_code/site/default'
#LOCATION = 'http://www.mydomain.com/'
# set prepath if location has a component after the domain,
# e.g. /bibliotech
#PREPATH = ''
# send emails to a system administrator when an unhandled Perl
# exception is thrown defaults to undefined which skips the sending
# of an email
#EXCEPTION_ERROR_REPORTS_TO = 'admin@example.org'
#EXCEPTION_ERROR_REPORTS_TO = [ 'admin1@example.org', 'admin2@example.org' ]
# database connection details
# connection string to the main InnoDB database
DBI_CONNECT = 'dbi:mysql:bibliotech'
DBI_USERNAME = 'user'
DBI_PASSWORD = 'secret'
# just the database name of the replicated MyISAM FULLTEXT-enabled database
DBI_SEARCH = 'bibliotech_search'
# memcached server address
#MEMCACHED_SERVERS = [ '127.0.0.1:11211' ]
# directory in which your templates will reside
TEMPLATE_ROOT = 'defaulttemplate'
# where does Apache create its pid file?
#PID_FILE = '/var/run/httpd.pid'
# which system binary should we use for mail (should be, or emulate,
# sendmail)
SENDMAIL = '/usr/lib/sendmail'
# Change these!
# *************
# These are secret strings used as part of the data when creating
# MD5 hashes so we can provide those hashes to public users and then
# verify them later
USER_COOKIE_SECRET = 'secretsecret'
USER_VERIFYCODE_SECRET = 'veryverysecret'
FORGOTTEN_PASSWORD_SECRET = 'datagone';
# set to true on RHEL 3 or when you get an error starting up about
# Apache::compat
#MOD_PERL_WORKAROUND = false
# set to true when debugging a problem with the site not coming up
# (causes many HTTP error code pages to be replaced with status 200
# text/plain with explanation)
#EXPLAIN_HTTP_CODES = false
# Send 304 codes when we can
#CLIENT_SIDE_HTTP_CACHE = true
# send a cache control header to tell all clients and intermediate
# caches not to hold this data (if you uncomment this setting you do
# not need the next one as it will have no effect)
#NO_CACHE_HEADER = false
# send a cache control header to set an expiration time, in seconds
# from now
#CACHE_AGE_HEADER = 3
# what is the time zone setting on MySQL? ('local' means local time
# on database host)
#TIME_ZONE_ON_DB_HOST = 'local'
# what time zone should the site display? (e.g. 'UTC',
# 'America/New_York', 'Europe/London', etc.)
#TIME_ZONE_PROVIDED = 'Europe/London'
# control how many bookmarks are considered for "linked" lists
#LINKED_RECENT_INTERVAL = '24 HOUR'
# set to true if you don't want to bother checking a new user's
# email address and instead just log them straight in
#SKIP_EMAIL_VERIFICATION = false
# which citation source modules are active
CITATION_MODULES = [ Self Pubmed NPG Hubmed Dlib Amazon Highwire
DOI PMC Blackwell Wiley ePrints ]
# which import modules are active
IMPORT_MODULES = [ FirefoxBookmarks RIS ]
# which proxy translation modules are active
PROXY_MODULES = [ Ads ]
# install bibutils and then provide the path to support MODS,
# BibTeX, etc.: http://www.scripps.edu/~cdputnam/software/bibutils/
#BIBUTILS_PATH = /usr/local/bin/bibutils
# disallow users or groups starting with these words
RESERVED_PREFIXES = [ 'connotea' 'bibliotech' ]
# define global CSS stylesheet(s)
#GLOBAL_CSS_FILE = 'global.css'
# also supports multiple...
#GLOBAL_CSS_FILE = [ 'global.css' 'global_dev.css' ]
# optionally define separate CSS filename for the home page, will
# replace GLOBAL_CSS_FILE option there
#HOME_CSS_FILE = 'home.css'
# limit uploaded RIS files to a certain number of entries
#IMPORT_MAX_COUNT = 1000
# pause web services
#SERVICE_PAUSED = true
# takes IP addresses - no wildcards or ranges allowed, must be
# explicit addresses
#SERVICE_NEVER_PAUSED_FOR = [ '192.168.1.10', '192.168.1.11' ]
# "early" means before last_updated is computed and HTTP HEAD and
# If-Modified-Since/304 transactions are handled Useful if you are
# pausing to fix a bug in these areas
#SERVICE_PAUSED_EARLY = false
# make web services read-only
#SERVICE_READ_ONLY = true
# takes IP addresses - no wildcards or ranges allowed, must be
# explicit addresses
#SERVICE_NEVER_READ_ONLY_FOR = [ '192.168.1.10', '192.168.1.11' ]
# let visitors with a foreign/blank Referer see slightly old data if
# we are not current in the cache
#FRESH_VISITOR_LAZY_UPDATE = 180
# override HTML <title> for certain pages
TITLE_OVERRIDE = { home = '\uConnotea - social bookmarking' }
# files to parse with template system
HANDLE_STATIC_FILES = [ 'remote.js' ]
# create a log file
#LOG_FILE = '/var/log/bibliotech.log'
# two major kinds of throttling
#BOT_THROTTLE = false
#DYNAMIC_THROTTLE = false
# indicate known bot User-Agent strings
#THROTTLE_FOR = [ ]
# avoid throttling likely human User-Agent strings
#ANTI_THROTTLE_FOR = ['^Mozilla/[\d\.]+ .*(Gecko|KHTML|MSIE)',
# '^Opera/[\d\.]+\b',
# '^amaya/[\d\.]+\b',
# '^Democracy/[\d\.]+\b',
# '^Dillo/[\d\.]+\b',
# '^iCab/[\d\.]+\b',
# '^IBrowse/[\d\.]+\b',
# '^ICE Browser/[\d\.]+\b',
# '^(Lynx/[\d\.]+|Links)\b',
# 'NetPositive',
# '^Emacs-',
# 'WWW::Connotea',
# ]
# defer requests when load is higher than this number
#LOAD_MAX = 25
# when load is high (>LOAD_MAX) we first "defer" a query, which
# means we sleep, and then we check the load again; if it's still
# too high, we send a 503 with a Retry-After which we call a "wait";
# each of the LOAD_DEFER_* and LOAD_WAIT_* sets of variables are
# four variables that decide the interval based on a formula that
# uses the current load number:
# interval = max(min(load*multiplier+adjustment,minimum),maximum)
#LOAD_DEFER_MUL = 1
#LOAD_DEFER_ADJ = 0
#LOAD_DEFER_MIN = 0
#LOAD_DEFER_MAX = 30
#LOAD_WAIT_MUL = 1
#LOAD_WAIT_ADJ = 0
#LOAD_WAIT_MIN = 0
#LOAD_WAIT_MAX = 30
# throttle rapid fire hosts
#DYNAMIC_THROTTLE_TIME = 15
#DYNAMIC_THROTTLE_HITS = 10
# protect Web API User-Agent string
#DYNAMIC_THROTTLE_NEVER_FOR = [ 'WWW::Connotea' ]
# time for a single host rapid fire
#BOT_LONE_THROTTLE_TIME = 30
# time for all hosts rapid fire
#BOT_ALL_THROTTLE_TIME = 2
# when a request is throttled it can be "slept" for some time,
# unless there are too many already sleeping
#SLEEPING_MAX = 10
# some components compute site-wide lists using "recent" data; these
# ask "how recent?" (specify in labeled units of HOUR or DAY)
#ACTIVE_TAGS_WINDOW = '30 DAY'
#ACTIVE_USERS_WINDOW = '30 DAY'
#TAG_CLOUD_WINDOW = '60 DAY'
#POPULAR_WINDOW = '60 DAY'
# details governing the definition of popular tags
#POPULAR_TAGS_WINDOW = '7 DAY'
#POPULAR_TAGS_LAG = '10 MINUTE'
#POPULAR_TAGS_IGNORE = [ uploaded ]
#POPULAR_TAGS_POST_MIN = 5
#POPULAR_TAGS_USER_MIN = 5
#POPULAR_TAGS_BOOKMARK_MIN = 5
# how many words can be entered in the freematch search box for a
# single query?
#MAX_FREEMATCH_TERMS = 12
# limit how many posts can be exported or imported at once to avoid
# lengthy calculations
#EXPORT_MAX_COUNT = 1000
#IMPORT_MAX_COUNT = 1000
}
# Some citation modules need credentials or other minor configuration:
CITATION AMAZON {
AWSID = ''
}
CITATION DOI {
CR_USER = ''
CR_PASSWORD = ''
}
CITATION HIGHWIRE {
SCI_USER = ''
SCI_PASSWORD = ''
}
CITATION {
#DNS_LISTS = [ sbl.spamhaus.org multi.surbl.org ]
#WHITE_LIST = []
#SCORE = 2
}
# Component modules:
COMPONENT BLOG {
#FEED_URL = 'file:///tmp/blog.xml'
}
COMPONENT WIKI {
#DBI_CONNECT = 'dbi:mysql:conwiki'
#DBI_USERNAME = 'conwiki'
#DBI_PASSWORD = 'secret'
#ADMIN_USERS = [ 'admin' ]
#LOCK_TIME = '10 MINUTE'
#HOME_NODE = 'System:Home'
}
COMPONENT KILLSPAMMER {
#ADMIN_USERS = []
}
COMPONENT ADMIN {
#ADMIN_USERS = []
}
COMPONENT ADMINSTATS {
#ADMIN_USERS = []
}
# Other:
# specify file paths for the captchas generated
CAPTCHA {
#DATA_FOLDER = '/tmp/captcha/data'
# next line actually defaults to DOCROOT + /captcha
#OUTPUT_FOLDER = '/var/www/perl/connotea_code/site/default/captcha'
#OUTPUT_LOCATION = '/captcha/'
}
ANTISPAM {
# filenames to comma-separated-values log files that can be generated
# make sure apache has write access
#SCORE_LOG = ''
#CAPTCHA_LOG = ''
# Generic "bad phrases", see below for where it's used
#BAD_PHRASE_LIST = []
# URI_BAD_PHRASE_LIST defaults to BAD_PHRASE_LIST if not specified
#URI_BAD_PHRASE_LIST = []
#URI_BAD_PHRASE_SCORE = 1
#USERNAME_ENDS_IN_DIGIT_SCORE = 1
#USERNAME_DIGIT_MIDDLE_SCORE = 1
# TAG_BAD_PHRASE_LIST defaults to BAD_PHRASE_LIST if not specified
#TAG_BAD_PHRASE_LIST = []
#TAG_BAD_PHRASE_SCORE = 1
#TAG_REALLY_BAD_PHRASE_LIST = []
#TAG_REALLY_BAD_PHRASE_SCORE = 3
#TAGS_TOO_MANY_MAX = 7
#TAGS_TOO_MANY_SCORE = 1
#EMAIL_GENERIC_SERVICE_LIST = []
#EMAIL_GENERIC_SERVICE_SCORE = 1
#TAGS_TWO_ALLITERATIVE_SCORE = 1
#LIBRARY_EMPTY_SCORE = 1
#LIBRARY_TAGS_TOO_MANY_MAX = 50
#LIBRARY_TAGS_TOO_MANY_SCORE = 1
#LIBRARY_RECENT_ACTIVE_MAX = 5
#LIBRARY_RECENT_ACTIVE_WINDOW = '24 HOUR'
#LIBRARY_RECENT_ACTIVE_SCORE = 1
#LIBRARY_HAS_HOST_MAX = 3
#LIBRARY_HAS_HOST_WHITE_LIST = []
#LIBRARY_HAS_HOST_SCORE = 1
# DESCRIPTION_BAD_PHRASE_LIST defaults to BAD_PHRASE_LIST if not specified
#DESCRIPTION_BAD_PHRASE_LIST = []
#DESCRIPTION_BAD_PHRASE_SCORE = 1
#COMMENT_TAGS_SCORE = 1
#URI_BAD_TLD_LIST = []
#URI_BAD_TLD_SCORE = 1
# TITLE_BAD_PHRASE_LIST defaults to BAD_PHRASE_LIST if not specified
#TITLE_BAD_PHRASE_LIST = []
#TITLE_BAD_PHRASE_SCORE = 1
#URI_BAD_HOST_LIST = []
#URI_BAD_HOST_SCORE = 1
#TAG_POPULAR_SCORE = 1
#STRANGE_TAG_COMBO_LIST = []
#STRANGE_TAG_COMBO_SCORE = 1
#DESCRIPTION_HAS_TITLE_SCORE = 1
#TOO_MANY_COMMAS_MAX = 3
#TOO_MANY_COMMAS_SCORE = 1
#REPEATED_WORDS_SCORE = 1
#REPEATED_WORDS_URI_BONUS = 1
#PREFILLED_ADD_FORM_SCORE = 1
#AUTHORITATIVE_CITATION_SCORE = -1
#USERNAME_CONSONANTS_SCORE = 1
#TITLE_SITEMAP_SCORE = 2
# >SCORE_MAX means spam
#SCORE_MAX = 4
# test tag is 'i_am_spam'
#I_AM_SPAM_SCORE = 10
# these users are not subjected to spam checks
#TRUSTED_USER_LIST = []
}
CUSTOMIZATION
The look and feel of your Connotea Code installation can be modified by creating a new stylesheet and new templates. The template system is Template Toolkit documented at the web site at http://www.template-toolkit.org/. We refer to this system as TT for short.
TEMPLATE LOCATION
Templates are located by default in site/default. This is controlled by options in the configuration. It is recommended that templates have a .tt extension.
TEMPLATE SELECTION
The template used to service a particular request is determined by the page requested and the available template filenames.
Individual templates can be defined for individual pages; for example, to override the template for the add form, create a template called add.tt.
For general bookmark listing queries (e.g. /tag/tagname), templates
beginning with recent can be used. recent.tt will be used for
queries with no user or tag parameters - recent_user.tt,
recent_tag.tt and recent_user_tag.tt can be created to specify
the behaviour is there is a user query, a tag query, or both
respectively.
Unless overridden by a specific template, default.tt is used.
TEMPLATE EXAMPLES
Templates should not contain the full HTML for the page you want to
construct, but only that which should appear between the <body>
and </body> tags.
This is an example default.tt:
[% prepare_component_begin() %]
[% prepare_component('main',undef,'main,verbose') %]
[% prepare_component_end() %]
<html>
<head>
<title>[% main_title %]</title>
[% rss_link %]
[% component_javascript_block_if_needed %]
</head>
<body[% component_javascript_onload_if_needed %]>
[% component_html('main',undef,'main,verbose') %]
</body>
</html>
The syntax is from Template Toolkit documented at the web site at http://www.template-toolkit.org/. We refer to TT for short.
A Connotea web page is a series of components that are combined
together, contributing HTML which can be organized in
separately-placed parts calculated at once, or as one block, and also
sometimes Javascript to be placed in a <script> block in the
<head> or in the <body> onload attribute. The
components are controlled by the template selected to represent the
HTTP query, and each component running can access the current command,
the posts that result from the SQL engine processing the query of the
command, and a variety of support services.
Several functions are provided to TT by the calling
instance. prepare_component_begin() is called before anything
happens, and this prepares some internal data structures; then
prepare_component() is called with the base name of the component
(e.g. Blah corresponds to Bibliotech::Component::Blah) or the
special word main which does a lookup for the main component of a
page described in Bibliotech/Parser.pm and
Bibliotech/Page/Standard.pm, which allows some templates to be
reused. The second argument is a comma-separated list of parts, which
is a mechanism used by some components such as ListOfTags, and the
third argument is a comma-separated list of options to the components,
of which the universal ones are main and verbose which should be
set to true for the main component (the format is key=value but
ommiting the value is the same as =1 which is true). Later in the
template, a call to component_html() with the same arguments
inserts the HTML (if you specified multiple part names in the prepare
call, you should have multiple calls to insert HTML each with one part
mentioned).
Individual templates can be defined for individual requests. For
example, to override the template for the add form, create a template
called add.tt. For general bookmark listing queries
(e.g. /tag/tagname), templates beginning with recent can be
used. recent.tt will be used for queries with no user or tag
parameters - recent_user.tt, recent_tag.tt and
recent_user_tag.tt can be created to specify the behaviour is there
is a user query, a tag query, or both respectively.
A more realistic example for default.tt would make calls to a normalprep.tt and a normal.tt wrapper:
[% prepare_component_begin() %]
[% INCLUDE normalprep.tt %]
[% prepare_component('main',undef,'main,verbose') %]
[% prepare_component_end() %]
[% WRAPPER normal.tt %]
[% component_html('main',undef,'main,verbose') %]
[% END %]
In this case, normal.tt would contain a basic look for the web site that can be used by many other templates.
A set of working templates is furnished with this distribution.
TEMPLATES AND COMPONENTS
A Connotea web page is a series of components that are combined
together, contributing HTML which can be organized in
separately-placed parts calculated at once, or as one block, and also
sometimes Javascript to be placed in a <script> block in the
<head> or in the <body> onload attribute. The
components are controlled by the template selected to represent the
HTTP query, and each component running can access the current command,
the posts that result from the SQL engine processing the query of the
command, and a variety of support services.
Several functions are provided to TT by the calling instance. Some functions are general, and are available in all templates, even small snippet templates used by components (conventionally named with a comp prefix). Some functions are page-level utilities that largely control component insertion.
TEMPLATE PAGE LEVEL FUNCTIONS
prepare_component_begin()
Declare the beginning of prepare_component(module, parts, options)
Prepare a component.
- module
The base name of the desired component (e.g. - parts
For most components, use - options
A comma-separated list of options to the component, of which the
universal ones are prepare_component_end()
Declare the end of component_html(module, part, options)
The arguments are the same as component_javascript_onload()
Insert the Javascript addressed at the component_javascript_onload_if_needed
Insert the Javascript addressed at the component_javascript_block()
Insert the Javascript addressed at the component_javascript_block_if_needed()
Insert the Javascript addressed at the main_title
The HTML document title recommended by the main component, or failing
that, a default constructed from the site name and page name.
main_heading
The HTML document heading (main_description
The description recommended by the main component; used in RSS, etc.
css_link
Insert a rss_link
Insert a
prepare_component() calls.
Blah corresponds to
Bibliotech::Component::Blah) or the special word main which does
a lookup for the main component of a page described in
Bibliotech/Parser.pm and Bibliotech/Page/Standard.pm, which
allows some templates to be reused.
undef for one block of HTML output. For
components that can return multiple parts of HTML, this option is a
comma-separated list of part names to prepare in one calculation for
efficiency. Components with parts: ListOfTags, ListOfUsers, and
ListOfGangs.
main and verbose which should be set to true
for the main component (the format is key=value but ommiting the value
is the same as =1 which is true).
prepare_component() calls.
prepare_component(), except that the
part should be at most one part, not more than one.
onload handler.
onload handler, but wrap it
a space followed by the actual attribute itself, as in,
onload="blah", or if there is no Javascript, insert nothing.
head of the HTML document.
head of the HTML document,
but wrap it in a <script> block, or if there is no Javascript,
insert nothing.
H1) recommended by the main component.
<link> representing the CSS files dictated by
HOME_CSS_FILE for the home page or GLOBAL_CSS_FILE otherwise
(configuration options).
<link> representing the RSS format output for the
currently viewed page.
TEMPLATE GENERAL FUNCTIONS
<a href="[% location %]news">
sitenamesiteemailuser[% IF user %][% user.username %][% ELSE %]Visitor[% END %]
is_browser_safari, is_browser_firefox,
is_browser_ie, is_browser_otherbrowser_redirect(url)location
is prepended.
is_virgincanonical_uricanonical_locationlocation.
object_locationlocation and setting format to HTML.
no_num(url)encode_xml_utf8(str)encode_xhtml_utf8(str)nowBibliotech::DateTime object, e.g.:
[% now.label %] [% now.ymd %] [% now.ymdhm %] [% now.iso8601 %] [% now.iso8601_utc %]
timejoin(joinstr, ...)speech_join(jointype, ...)jointype is and or
or. This function will combine with commas, spaces, and the
jointype operator (if at least three items), e.g.:
speech_join('and', 'bob') -> 'bob'
speech_join('and', 'bob', 'alice') -> 'bob and alice'
speech_join('and', 'bob', 'alice', 'tom') -> 'bob, alice, and tom'
plural(amount, singular, plural, no_space)plural(6, 'second', 'seconds') -> '6 seconds' plural(1, 'second', 'seconds') -> '1 second'
commas(num)commas(5000000) -> '5,000,000'
divide(a, b, places, multiplier)places (default 1 if omitted), and multiplied by multiplier
(default 1 if omitted), e.g.:
divide(10, 0) -> 0 divide(10, 2) -> 5 divide(10, 4, 2) -> 2.50 divide(10, 4, 2, 100) -> 250
percent(a, b, places)divide but multiplier is 100 and a percent sign is
appended.
percent(1, 2) -> 50.0% percent(4, 100) -> 4.0%
bookmarkletsbookmarklet(page, popup)page should be add,
addcomment, or comments. Argument popup should be direct
or popup.
bookmarklet_js(page, popup)bookmarklet but only insert the Javascript.
user_in_own_library/user with their username.
user_in_another_library/user with a username other than their own.
click_counter_onclick(url, new_window)new_window is true.
TEMPLATE COMPONENT SNIPPET FUNCTIONS
sticky(parameter)
Used primarily inside the has_validation_error
True if the form is redisplayed with an active error.
has_validation_error_for(field)
True if the form is redisplayed with an active error concerning the
field specified. Note that some errors are not tied to a field.
validation_error_field
If the form is redisplayed with an active error, the field name that
gave rise to the error. Note that some errors are not tied to a field.
validation_error
If the form is redisplayed with an active error, the error message.
value attributes of HTML ipnut tags,
this function allows the form to remember values between refreshes, so
if a message must be displayed to the user causing the form to be
redisplayed, the user's form responses are not lost.
LOGROTATE
A log file will be created at the place specified in LOG_FILE in
the configuration. This file can grow quickly so you may like to
configure logrotate to deal with it on a weekly basis. This is the
contents of a suggested /etc/logrotate.d/bibliotech file:
/var/log/bibliotech.log {
create 644 apache root
notifempty
weekly
rotate 5
compress
postrotate
/bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null` 2> /dev/null || true
endscript
}
SUPPORT
To subscribe to the Connotea Code development mailing list, go to https://lists.sourceforge.net/lists/listinfo/connotea-code-devel.
This list is for the discussion of the core code and citation and import plug-ins. It is intended for use by people who are installing their own instance of Connotea Code, or who are reviewing the code to see how Connotea handles their data, or who would like to help enable the importing of bibliographic information from more sources.
There is a separate list, connotea-discuss, for discussion of Connotea itself - i.e. for discussion of http://www.connotea.org/. That list is the appropriate place for questions about how to use the site, or requests or suggestions for new features.
WEBCITE
The Connotea Code has a module called WebCite which can be employed separately as a simple web service providing citation information using the Connotea Code citation modules.
A WebCite instance requires the full codebase to be present, as well as the CPAN module dependencies, and a compatible version of Apache, just as you might setup for Connotea Code proper. However, MySQL, Memcache, and Wiki::Toolkit are not required.
You should create a configuration file as directed for Connotea Code, but only the WEBCITE section and sections for your active citation modules are required.
WebCite provides caching not with memcached but via the filesystem, so the results survive Apache restarts.
WebCite can be turned on inside a normal Connotea Code deployment, but will perform its duties separately from the main codebase.
The main page for WebCite simply presents a form with two fields:
A submit button is provided for human users but the value is ignored.
The form can be called by programs by performing an HTTP POST to the
installed location with uri and fmt parameters.
Programs should evaluate the HTTP status code first.
In the case of a 404 code, a brief message such as No citation
may be displayed, but programs should not expect this value, and
should rely exclusively on the HTTP status code to determine this
condition.
In the case of a 200 code, the Content-Type header should be
appropriate for the format requested, and the data payload should
present the citation data in the format requested.
A 500 code will occur if an exception is thrown retrieving the
citation data, and the data payload will be plain text giving the
error message.
All transactions are separate. There is no concept of sessions.
There are no authentication checks in the WebCite code, although the system administrator is free to add restrictions in the Apache configuration.
Subsequent requests within 90 days for the same URL will return data cached from the original request. Data is cached in an internal structure form so the same cache entry can produce all output formats.
The Apache configuration block for WebCite is as follows:
PerlSwitches -I/var/www/perl/...
<Location /bibliotech>
SetHandler perl-script
PerlHandler Bibliotech::WebCite
</Location>
The PerlSwitches line should point to the directory that contains the Bibliotech directory.
This may or may not appear in the same Apache configuration as Connotea Code proper.
A suggested configuration for /etc/bibliotech.conf is as follows:
WEBCITE {
CACHE_ENABLED = true
CACHE_PATH = '/var/cache/webcite'
CACHE_TIMEOUT = 7776000
LOG_ENABLED = true
LOG_FILE = '/var/log/webcite.log'
}
Again this may be a file exclusively for WebCite or a file with intermixed configuration for Connotea Code proper.
You should create the cache directory and log file and give the Apache user write access before starting Apache.
Acknowledgments
The look, structure, documentation and source code of http://www.connotea.org/ are the collective work of Martin Flack, Ben Lund, Timo Hannay, Joanna Scott, Stefania Bojano, Grant Farrelly, Euan Adie, and Ian Mulvany. The vast majority of the programming was done by Martin Flack of NeoReality, Inc., http://www.neoreality.com/.
The materials available from http://sf.net/projects/connotea are released under the GNU General Public License; reuse of all other materials requires the express written permission of Nature Publishing Group.
More Information
Please visit this URL for more information: http://www.connotea.org/code
If you have questions, email us or try our mailing lists: http://www.connotea.org/contact