An overview of R libraries to query Wikidata
R is a free programming language for statistical computing. This post is an overview of libraries that query Wikidata and allow you to fetch data from it.
Libraries
WikidataR 1.4.0
Note: WikidataR has been forked and is actively maintained, now available in version 2.1.3. This version is not studied here yet, though it is interesting to note that it embarks several packages, including the following WikidataQueryServiceR.
WikidataR is the only R library that targets the Wikidata part of the Mediawiki API. While it is basic (not all features from the API are covered), it works well and nicely, with a neat documentation. You can get a specific item or a specific property with get_item
and get_property
, get random items or properties with get_random_item
and get_random_property
(with an optional parameter to fetch several elements at once), and find items and properties by their labels and aliases using find_item
and find_property
. WikidataR fully supports Wikibase data model, but is out of date, as lexicographical data is not yet implemented.
WikidataQueryServiceR 1.0.0
WikidataQueryServiceR is a R library that targets Wikidata Query Service (WDQS), the official SPARQL endpoint of Wikidata. It provides a simple function query_wikidata
that returns the results of a query.
Since version 1.0.0, this library also provides a function named get_example
to get queries from the examples page. It allows to scrap examples and to then use them with the function query_wikidata
.
SPARQL package 1.16
SPARQL package is a generic R library that allows you to query any SPARQL endpoint. Its advantage over WikidataQueryServiceR is that you can use it to query several SPARQL endpoints and not only Wikidata’s one. Surprisingly, this library is much slower than WikidataQueryServiceR, even when your query returns only a few thousands of results.
Miscellaneous
This post only covers general-purpose libraries (I may have missed some!). More specific libraries that allow you to retrieve data from Wikidata exist, like wikitaxa for taxonomic data, or webchem for chemical data.
Examples
WikidataR 1.4.0
As the documentation of WikidataR is short and covers nearly everything, I’ll let you read it!
WikidataQueryServiceR 1.0.0
Install the library and load it:
> install.packages("WikidataQueryServiceR") > library(WikidataQueryServiceR)
Query all video games with a publication date, keeping only the earliest date by video game (a video game can have various publication dates, depending on the platform or the geographical area of publishing):
> r <- query_wikidata(' SELECT ?item ?itemLabel (MIN(?_date) AS ?date) (MIN(?_year) AS ?year) { ?item wdt:P31 wd:Q7889 ; wdt:P577 ?_date . BIND(YEAR(?_date) AS ?_year) . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . } } GROUP BY ?item ?itemLabel HAVING (?year > 1) ') 25817 rows were returned by WDQS
Display the first ones:
> head(r) item itemLabel date year 1 http://www.wikidata.org/entity/Q2374 Civilization III 2001-10-30T00:00:00Z 2001 2 http://www.wikidata.org/entity/Q2377 Civilization IV 2005-10-25T00:00:00Z 2005 3 http://www.wikidata.org/entity/Q2385 Civilization V 2010-09-21T00:00:00Z 2010 4 http://www.wikidata.org/entity/Q2387 Commandos 2: Men of Courage 2001-09-20T00:00:00Z 2001 5 http://www.wikidata.org/entity/Q2440 Freedom Force vs the 3rd Reich 2005-03-08T00:00:00Z 2005 6 http://www.wikidata.org/entity/Q2450 Heroes of Might and Magic V 2006-05-16T00:00:00Z 2006
Display the number of games published each year:
> barplot(table(r$year), col = "dodgerblue3", xlab = "year", ylab = "count")
SPARQL package 1.16
Install the library and load it:
install.packages("SPARQL") library(SPARQL)
We use the same query as in the previous example and, as this library can query any SPARQL endpoint, we have to give it the URL of the Wikidata endpoint:
r <- SPARQL('https://query.wikidata.org/sparql',' SELECT ?item ?itemLabel (MIN(?_date) AS ?date) (MIN(?_year) AS ?year) { ?item wdt:P31 wd:Q7889 ; wdt:P577 ?_date . BIND(YEAR(?_date) AS ?_year) . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . } } GROUP BY ?item ?itemLabel HAVING (?year > 1) ',curl_args=list(useragent='User Agent Example'))
Display the first results (note that they are in r$results
):
> head(r$results) item itemLabel date year 1 <http://www.wikidata.org/entity/Q2374> "Civilization III"@en 1004396400 2001 2 <http://www.wikidata.org/entity/Q2377> "Civilization IV"@en 1130191200 2005 3 <http://www.wikidata.org/entity/Q2385> "Civilization V"@en 1285020000 2010 4 <http://www.wikidata.org/entity/Q2387> "Commandos 2: Men of Courage"@en 1000936800 2001 5 <http://www.wikidata.org/entity/Q2440> "Freedom Force vs the 3rd Reich"@en 1110236400 2005 6 <http://www.wikidata.org/entity/Q2450> "Heroes of Might and Magic V"@en 1147730400 2006
You may notice that, by default:
- URLs are surrounded by brackets;
- labels contain language codes;
- dates are parsed as timestamps.
Still, you can display the same graph:
> barplot(table(r$results$year), col = "dodgerblue3", xlab = "year", ylab = "count")
Summary
WikidataR | WikidataQueryServiceR | SPARQL package | |
---|---|---|---|
CRAN | WikidataR | WikidataQueryServiceR | SPARQL |
Repository | github.com/Ironholds/WikidataR | github.com/bearloga/WikidataQueryServiceR | github.com/cran/SPARQL |
Version | 1.4.0 | 1.0.0 | 1.16 |
Release date | 2017-09-22 | 2020-06-17 | 2013-10-25 |
Target | Mediawiki API | Wikidata Query Service | any SPARQL endpoint |
Features |
|
|
|
Pros |
|
|
|
Cons |
|
|
(*) Both WikidataQueryServiceR and SPARQL package have options to change format behavior, not covered here.
So, what library should you use? It depends on your needs:
- if you only need to get a few specific items from Wikidata, use WikidataR;
- if you need to do more tedious work, like complex search to retrieve numerous results, use WikidataQueryServiceR;
- if you need to query several SPARQL endpoints, use SPARQL package.
Update February 2022: this post has been updated to reflect new releases of WikidataR and WikidataQueryServiceR.
R logo by The R Foundation, CC BY-SA 4.0.