Exploring Wikidata properties by the similarity of their use
A few weeks ago, I released Wikidata Related Properties, a tool to explore Wikidata properties and find the ones used together.
Features overview
The first idea when you want to find which properties are the most used with one specific property in Wikidata is to look at the cardinality of intersection, i.e. the number of items that use both properties. The issue with this method is that it will mainly returns general properties. For instance, when you look at the closest properties of archives at (P485) sorted by the cardinality of intersection, you have a bunch of general properties about humans (sex or gender, occupation, given name, …).
Another idea is to use the Jaccard index, which is the cardinality of intersection divided by the cardinality of union of two sets. It allows to find properties that are used mainly together and not on differing sets of items. With the same example of archives at (P485), we can see that the closest properties sorted by the Jaccard index are quite different, with mostly external IDs from authorities.
In a nutshell:
- the sort by cardinality of intersection allows you to find general properties;
- the sort by Jaccard index allows you to find more domain-specific properties.
The tool unveils closest properties by both methods. Each property is displayed with its English label and its P number, and is also linked to its page on Wikidata. Properties can be filtered by type, for example to gather statistics about external ids only. The data can be downloaded from the main page of the tool.
Limits
At the moment, statistics are limited to:
- properties used as main properties of statements (not as qualifiers or in references);
- main (Q) and property (P) namespaces, and don’t include lexicographical data, as lexemes are excluded from Wikidata JSON dumps for an unknown reason (T195419, T220883).
Other methods to detect similarity should be available. For instance, the fact that P4285 is (or should be) a subset of P269 is not clearly visible at the moment.
Note: the idea to use the Jaccard index comes from Goran S. Milovanović (T214897).
Technical overview
The tool relies on the weekly Wikidata JSON dump, which is read in a one-time pass with the Wikidata Toolkit, to compute the cardinality of each property and the cardinality of intersection of each pair of properties. The data is then imported into a MySQL database to compute the Jaccard index and to easily display the data with PHP.
Here is a description of the algorithm and its main variables used to generate the statistics:
p_s
is the list of all Wikidata properties; each element is a pair(p, c)
withp
the id of the property andc
the cardinality of the property (i.e. the number of distinct Wikidata items that use it).q_s
is the list of all pairs of Wikidata properties; each element is a 4-tuple(pa, pb, i, j)
withpa
andpb
the ids of the properties,i
the cardinality of intersection (i.e. the number of distinct Wikidata items that use both properties), andj
the Jaccard index (i.e. the number of distinct Wikidata items that use both properties divided by the number of distinct Wikidata items that use at least one of the properties).u_s
is the list of properties used in a Wikidata item.
set p_s to an empty set of pairs; set q_s to an empty set of 4-tuples; for each item in the Wikidata JSON dump: set u_s to an empty set of singletons; for each statement with normal or preferred rank in the item: set p the main property used in the statement; if p not in u_s: add p to u_s; for each property pa in u_s: if (pa, _) not in p_s: add (pa, 0) to p_s; set (pa, n) to (pa, n + 1) in p_s; for each property pb in u_s: if pa < pb: if (pa, pb, _, _) not in q_s: add (pa, pb, 0, _) to q_s; set (pa, pb, n, _) to (pa, pb, n + 1, _) in q_s; for each tuple (pa, pb, i, _) in q_s: get (p, c_a) from p_s where p = pa; get (p, c_b) from p_s where p = pb; set (pa, pb, i, _) to (pa, pb, i, i / (c_a + c_b - i)) in q_s;
Lines 1 to 17 of the pseudocode are implemented by the class PropertiesProcessor.
Lines 18 to 21 of the pseudocode are implemented by the SQL query at the line 31 of the import script.