jump to navigation

An editable database tracking freely accessible mathematics literature. January 3, 2014

Posted by Scott Morrison in papers, publishing, Uncategorized, websites.
Tags:
trackback

(This post continues a discussion started by Tim Gowers on google+. [1] [2])

(For the impatient, go visit http://tqft.net/mlp, or for the really impatient http://tqft.net/mlp/wiki/Adv._Math./232_(2013).)

It would be nice to know how much of the mathematical literature is freely accessible. Here by ‘freely accessible’ I mean “there is a URL which, in any browser anywhere in the world, resolves to the contents of the article”. (And my intention throughout is that this article is legitimately hosted, either on the arxiv, on an institutional repository, or on an author’s webpage, but I don’t care how the article is actually licensed.) I think it’s going to be okay to not worry too much about discrepancies between the published version and a freely accessible version — we’re all grown ups and understand that these things happen. Perhaps a short comment field, containing for example “minor differences from the published version” could be provided when necessary.

This post outlines an idea to achieve this, via a human editable database containing the tables of contents of journals, and links, where available, to a freely accessible copy of the articles.

It’s important to realize that the goal is *not* to laboriously create a bad search engine. Google Scholar already does a very good job of identifying freely accessible copies of particular mathematics articles. The goal is to be able to definitively answer questions such as “which journals are primarily, or even entirely, freely accessible?”, to track progress towards making the mathematical literature more accessible, and finally to draw attention to, and focus enthusiasm for, such progress.

I think it’s essential, although this is not obvious, that at first the database is primarily created “by hand”. Certainly there is scope for computer programs to help a lot! (For example, by populating tables of contents, or querying google scholar or other sources to find freely accessible versions.) Nevertheless curation at the per-article level will certainly be necessary, and so whichever route one takes it must be possible for humans to edit the database. I think that starting off with the goal of primarily human contributions achieved two purposes: one, it provides an immediate means to recruit and organize interested participants, and two, hopefully it allows much more flexibility in the design and organization of the collected data — hopefully many eyes will reveal bad decisions early, while they’re easy to fix.

That said, we better remember that eventually computers may be very helpful, and avoid design decisions that make computer interaction with the database difficult.

What should this database look like? I’m imagining a website containing a list of journals (at first perhaps just one), and for each journal a list of issues, and for each issue a table of contents.

The table of contents might be very simple, having as few as four columns: the title, the authors, the link to the publishers webpage, and a freely accessible link, if known. All these lists and table of contents entries must be editable by a user — if, for example no freely accessible link is known, this fact should be displayed along with a prominent link or button which allows a reader to contribute one.

At this point I think it’s time to consider what software might drive this website. One option is to build something specifically tailored to the purpose. Another is to use an essentially off-the-shelf wiki, for example tiddlywiki as Tim Gowers used when analyzing an issue of Discrete Math.

Custom software is of course great, but it takes programming experience and resources. (That said, perhaps not much — I’m confident I could make something usable myself, and I know people who could do it in a more reasonable timespan!) I want to essentially ignore this possibility, and instead use mediawiki (the wiki software driving wikipedia) to build a very simple database that is readable and editable by both humans and computers. If you’re impatient, jump to http://tqft.net/mlp and start editing! I’ve previously used it to develop the Knot Atlas at http://katlas.org/ with Dror Bar-Natan (and subsequently many wiki editors). There we solved a very similar set of problems, achieving human readable and editable pages, with “under the hood” a very simple database maintained directly in the wiki.

About these ads

Comments

1. Charles Rezk - January 3, 2014

Neat.

So I’ve tried editing, at http://tqft.net/mlp/wiki/Algebr._Geom._Topol./13_%282013%29,_no._2 . Was successful, though the output doesn’t look right (it appears to be doubled).

2. Scott Morrison - January 3, 2014

Hi Charles, thanks for trying it out!

I’d been expecting that the FreeURL and PublishedURL textboxes would only ever receive a plain URL, but you’ve put in some extra information as well. I changed my templates so it displays correctly in any case — if you go back to http://tqft.net/mlp/wiki/Algebr._Geom._Topol./13_%282013%29,_no._2 you’ll see your full contributions show up properly.

This does raise an interesting problem. Having more information (e.g. your annotation “identical to the published version”) is really good. On the other hand, I have many applications in mind where it’s important that the database we build really is computer readable, and so there’s value to having uniform rules like “this field must contain a URL, and only a URL”.

3. porton - January 3, 2014

Converting from MediaWiki to a normalized SQL DB would be a headache.

It should from the very beginning be a normalized database. (This means that title, author, URL, etc. be separate field from the very beginning, not just a text in MediaWiki.) There should be (instead of MediaWiki) separate entry input controls for each field.

I am a PHP+MySQL programmer which has recently lost a job (because the project I worked on was closed). Hire me. I would quickly and easily do this project using Yii framework.

4. Scott Morrison - January 3, 2014

Hi Porton,

it’s not as bad as it might look. All the interesting data, e.g. user added URLs, are on the own pages, with no other content, e.g. http://tqft.net/mlp/wiki/Data:MR3031512/FreeURL. It’s possible for anyone to quickly grab the content, without any mediawiki markup, via a URL like http://tqft.net/mlp/index.php?title=Data:MR3031512/FreeURL&action=raw. These pages are then transcluded into the user viewable pages.

I’ll soon write a little script that grabs all these pages and produces an easy-to-parse (CSV? XML? JSON?) database of all the content.

I think there are really good arguments for using wiki software for this project. (As examples, I added OpenID login in <5 minutes, and we already have revision control baked in.) Further, there's no money available for programmers, anyway!

5. Kevin Walker - January 3, 2014

@Scott #2: Isn’t it very easy for a program to parse a blob of text and extract only the URLs? In other words

“http://arxiv.org/blah — differs from published version”

and

“yadda yadda http://asdf.edu/~whoever/article.pdf yadda yadda”

are not significantly harder to parse than

“http://just.a/plain/url” .

I suppose that if you also wanted to extract helpful remarks like “differs from published version”, then encouraging a uniform format might be helpful.

Or maybe provide some checkboxes on the page where one pastes in the URL, something like:

Check all that apply:
( ) Essentially the same as published version
( ) Differs from published version
( ) On author’s web page

6. Scott Morrison - January 3, 2014

@Kevin, you’re right, it’s easy enough, and probably a good trade-off. No rules for the humans, we get the detailed information which is useful, and the computer has to do slightly more work.

The checkboxes I don’t think I can do in the present (mediawiki-based) implementation. If this takes off, making a v2 with custom software is probably a good idea. For now I’m just going to see where we can go with this very simple implementation.

7. Yemon Choi - January 4, 2014

That deletion log of spambots looks somewhat tedious…

8. Noah Snyder - January 4, 2014

It’s really interesting to have some data on how frequent arxiv posting is. The rate of arxiving is actually significantly higher than I would have thought it was.

9. Scott Morrison - January 4, 2014

@Yemon, right after deleting, I installed some automatic anti-spam measures, which stopped the ongoing attack. Let’s not get depressed yet.

10. Yemon Choi - January 4, 2014

@Noah: in due course, I’d be interested to find out whether the arxiving rate is higher for GAFA than for JFA, and if so, why that might be

11. Dmitri Pavlov - January 5, 2014

I noticed that the wiki automatically picks up arXiv papers if they contain a DOI. Perhaps one can also set it up to automatically add arXiv papers if both titles and authors match exactly?

Scott Morrison - January 5, 2014

Hi Dmitri,

we’re working on this! Matching authors and titles turns out to be pretty hard; sometimes very minor titles differences don’t matter at all, and sometimes they matter a lot. This is definitely the next stage of automation, however.

12. Noah Snyder - January 5, 2014

Maybe there should be an extra entry for comments. Currently when people add comments it’s breaking automation (especially making things green when they should be red). Another option would be to make anything that doesn’t have a link red, instead of requiring the magic words “none available”.


Sorry comments are closed for this entry

Follow

Get every new post delivered to your Inbox.

Join 655 other followers

%d bloggers like this: