What is Bookworm arXiv?
Bookworm arXiv demonstrates a new way of interacting with over 700,000 scientific articles published on arXiv since 1992. The Harvard Cultural Observatory previously collaborated with Google Books on the Google ngrams viewer. Bookworm, by using openly circulated preprints, allows creation of interesting queries and comparisons across corpora, along with direct access to the original texts.
What texts does this use?
This site builds on the amazing work of arXiv.org. ArXiv provides open access to over 700,000 e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. When you build a corpus, you can see exactly how many texts you are searching in the construction box.
How does it work?
You can compare different words in the same collection of articles, the same
word across multiple collections, or a combination of the two. Enter your
search word in the text box and click on 'Edit' to change the article collection. Bookworm will color code them as you enter them. Hitting 'Return' will create a new graph using your options.
The graph displays results: click on a point to see the articles that best match your search terms for that series and month. You can read the article at arXiv by clicking 'Read'.
- Term(s) - Words to query. If you type more than one (separated by commas) the viewer will add the results for all your words together. If you want to compare two words, just add another field. We currently have data for one- and two-word phrases; we're thinking about ways to add more.
- Archive - These are the main categories of arXiv articles, as specified at arxiv.org
- Subject class - The arXiv subject classes are much more specific than the archive, but apply to fewer articles.
- Email domain - ArXiv metadata provides the email address of the submitting author. Here, you can also search by top level email domain, e.g. 'edu' or 'uk'.
- Author email institution - You can also search by the second level email domain, which usually gives the author's institution, e.g. 'harvard.edu' or (for academic institutions in some countries) u-tokyo.ac.jp.
Advanced Options
- Quantity - By default, we show the number of times your search terms appear per one million words published that month. Selecting 'Raw Counts' will give you the actual number of words published: this can be interesting to use with a word like 'of' if you just want to see how many words are in the full sample, or a particular subset.
- Capitalization - You can choose if you want to search case sensitively or not. (If not, the program will also return a few other very similar forms: the word 'the' will also include the French word 'thé', for example).
- Smoothing - To reduce noise in the series, smoothing enables you to take a moving-window average across several months of articles. The moving average is more heavily weighted to the center of the window.
- Submit - To see the results of setting the advanced options, click 'Search'
- Export - The 'Export' function allows you to access the data underlying your final chart in CSV format.
Disclaimers and Acknowledgements
- Beta - This is a proof-of-concept beta to show how new interfaces can unlock trends in scientific articles. There are still OCR misreadings, missing metadata fields, and all sorts of exciting problems. We've solved most of the really easy ones: think of all the rest of them as invitations to learn more about the condition of our digital resources.
- Thanks - Professor Paul Ginsparg provided the complete arXiv and metadata along with valuable feedback. Hosting and support has been provided by the Open Cloud Consortium's
Open Science Data Cloud (opencloudconsortium.org). Martin Camacho, Benjamin Schmidt, and Neva Cherniavsky designed and authored the front-end web site and the back-end database. The directors of the Harvard Cultural Observatory and drivers of the bookworm arXiv project are Jean-Baptiste Michel and Erez Lieberman Aiden.