Magazine: Hello world Programmatic access to Wikipedia
FREE CONTENT FEATURE
Programmatic access to Wikipedia
Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition
Wikipedia is often regarded as today's best-known source of collaborative intelligence. As such, it can also be an excellent subject for research which comes into the domain of "distributed cognition." In this tutorial we will learn how to programmatically access the data behind Wikipedia by using its Web API.
Web application programming interfaces [APIs] are the standard way of communication in the Web 2.0 environment. There may be several variants, but the most basic one is as follows:
- The client [for example, the web browser] requests data by sending an HTTP request to the server, optionally passing parameters in the query string.
- The server returns the result in a well-defined format, usually XML or JSON.
The description of the API methods should somehow be made available to the client.
MediaWiki, the wiki engine behind Wikipedia and many other collaborative projects, exposes a public Web API whose entry point varies but in general looks like this: http://SITE/.../api.php.
For English Wikipedia, it is http://en.wikipedia.org/w/api.php, while for the Polymath project it is http://michaelnielsen.org/polymath1/api.php. Pointing your browser to this address will give you a complete documentation of the API. To get a feeling of how it works, let's consider some examples.
- action=querythis is what you will use most of the time unless you want to edit data [when developing bots for example]
- list=randominstructs to choose a random page
- rnnamespace=0instructs to select a page in namespace with id=0. More on that later.
The return value from the above call is a data structure which looks like this:
The format of the data can be controlled by passing "format" parameter to each API call. Since in this example we didn't supply this parameter, the default value "xmlfm" was used, which means "XML pretty-printed in HTML." You would mostly use this for debugging. In real applications, you will probably specify "xml" or "json."
Each resource in MediaWiki belongs to a single namespace, for example:
- id=0: normal pages [i.e., content pages]
- id=1: talk pages
- id=14: category pages
To see a complete list of namespaces, issue the following call:
- Python 2.5+
- python-wikitools package, which is available at http://code.google.com/p/python-wikitools/.
MediaWiki API Main Documentation
Wikipedia Bots Development
MediaWiki Client tools
©2010 ACM 1528-4972/10/1200 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.
To comment you must create or log in with your ACM account.