Ben Good pointed out that Conntoea’s import mechanisim for RIS files generated from Endnote was magnling tags, and especially tag items like MESH terms.
We have now fixed this, but the fix is somewhat non-trivial.
The main reason for the problem is that we were usin the same tag compreshension code that is used in the “Add To Connotea” pop-up. This assumes that “mutliple words in quotes” are one tag, and lots, of, tag, separated, words, are individual tags. In addition, since you access tags in conntoea through the url we have to throw away forward slashes since a tag with a forward slash in it is going to confuse the url resover in Connotea.
Now when it comes to pub med records it looks like all of these rules are specifically chosen to break the way Pubmed records describe tags :/
Ben describes the problem very well:
In Pubmed records, and in the Endnote records, /’s are used to separate descriptors such as “Transcription Factors” from qualifiers such as “antagonists & inhibitors” and “metabolism”. For example, you might see a keyword listed as “Transcription Factors/antagonists & inhibitors/metabolism”. When imported, Connotea strips the slashes from the tag and thus adds the tag “Transcription Factorsantagonists & inhibitors metabolism” to the post.
So now we deal correctly with these tags, yay!
MeSH terms sometimes contain commas like “Models, Genetic”. When imported, these compound terms get split into multiple separate tags (Models and Genetic).
That’s because our comma separation parsing used to take precendence over our parsing of collecvie terms, but we have fixed that now.
In addition, it appears that quite a few people have managed to import the “Research Support” aspect of Pubmed Records as well. This is why you see more than a thousand bookmarks with the rather misleading tag “Non-U.S. Gov’t”, often also tagged with the seemingly contradictor “U.S. Gov’t”. (This happens when the research in the paper had both U.S. and non-U.S. funding).
We decided to leave this alone, as solving this problem requires understanding what the tags mean, and the context in which they appear. OK, so you can’t win every time. I guess we are just going to have to wait for the semantic web!
p.s. You will often see a ‘star’ appended to the beginning of tags imported in this manner such as ”’star’Genes”. This indicates that the stared’d term is a major topic (as opposed to minor topic) in the manuscript according to MEDLINE indexing.
OK, so now we strip leading stars from tag names, so that ’’star’gene’ imported becomes the tag ‘gene’ and can connect to all of the other items that have been tagged with ‘gene’ by users.
In a way so far all of the above is pretty straigt forward, now things get a little itneresting,
Martin our developer points out the following behaviour:
“One of the annoying parts of import is that if the keywords are separated by newlines but a tag with commas was collapsed into two, it would likely merge with other tags on the first or second term and then be tedious for the user to pick out later no matter what the UI.
In the RIS importer I’ve added a heuristic test which allows splitting on commas, except where it sort of looks like newlines are being used to demarcate the tags.
Here are some examples:
(1)
KW - aaa, bbb, ccc, ddd, eee, fff, ggg, hhh,
iii, jjj, kkk, lll, mmm, nnn, ooo, ppp
(2)
KW - aaa
bbb
ccc
ddd
Transcription, Genetic
eee
fff
ggg
hhh
iii
jjj
kkk
lll
mmm
nnn
ooo
ppp
(3)
KW - aaa, bbb
ccc, ddd
eee, fff
(Describing this in this blog is a bit hard, you have to ignore the extra lines between text lines as the blog parser treats whitespace in this system in a funny way, but I hope you see what we mean)
So in (1), sixteen tags are evident. In (2), the same tags appear, and I’ve added a comma-containing tag to show how they appear in dang.txt from our friend Ben. Clearly in (2) the newline is supposed to be the separator, not the comma. However, if you eat commas as part of tag names then (1) will fail.
The heuristic I came up with is that if there are at least three lines and no line runs longer than 60 characters then it should be treated as newline-separated and include the commas in the tag names. Otherwise, separate on commas as well.
This makes (3) not do the right thing, so it’s up to you if you think this will help or hurt. (3) IMHO is not likely to be computer generated… a computer would either write one per KW line (avoiding all this), split on newlines, or fill up lines to ~80 chars split on commas and add in newlines to keep going. All of which work with my test.”
And so that’s how we have left it for now.
Connotea Blog