Early Results on the Lifespan of Wikipedia Project
January, 0001
These are some quick notes about an ongoing project.
The goal of this project is to use survival analysis to study the lifespan of information on Wikipedia and Wikidata with the goal of understanding 1) how lasting contributions are made in collaborative projects, and 2) the 'half-life' of knowledge.
I first downloaded all revisions of every page on English Wikipedia, as well as all revisions of Wikidata. I then created a Rust program that steps through the data dumps, and for each page, finds all sentences which ever appeared and the first and last revisions on which they appeared. This has to be done in a streaming way because the total uncompressed size of the Wikipedia and Wikidata data is about 75TB. As each dump file is unzipped, the contained XML is parsed, the contained Wikitext is parsed, the plaintext is extracted, the plaintext is tokenized into sentences, and records of which sentences appeared on which revisions are put into an efficient data structure. An analogous procedure is used for Wikidata, using claims (essentially entity-property-value triples) instead of sentences.
This data can be used to create this layered plot which show the 'stratigraphy' of Wikipedia:

Each colored layer represents a year, and the thickness of each layer represents the number of sentences written in that year that are still on Wikipedia by some later date. The top profile of the plot shows the total number of sentences on Wikipedia over time, and vertical slices show the breakdown of years when Wikipedia sentences were written. This essentially answers the question, when was Wikipedia written?
Here is a similar plot for Wikidata:

This follows a notably different pattern, with claims lasting much longer than sentences do on Wikipedia, and with a major event occurring in 2017 that saw a large fraction of existing claims deleted and replaced, just as the total number of claims rose steeply
Survival Analysis
Using the Kaplan-Meier Estimator, these are the survival functions of sentences on Wikipedia and claims on Wikidata:
As was observed in the plots above, claims on Wikidata seem to survive much longer than sentences on Wikipedia.
This approach is especially powerful when looking at particular groups of pages or users. For example, this plot shows the survival function of sentences written by users with different access levels:
This can be used to assess which user groups made the most lasting contributions. In this case, it seems to be showing that bots have been some of the most effective editors, but that unconfirmed users were struggled to produce lasting contributions.
A similar approach can be used to study whether different kinds of pages lead to more or less successful collaborative environments. For example, perhaps a page on a topic in mathematics would see many long-lasting contributions, unlike a page about a contentious political topic. (There could be an interesting interaction with this paper.)
Hazard Functions and Different Kinds of Edits
Another interesting phenomenon can be observed by looking at the hazard functions of information on Wikipedia and Wikidata:
These have very different shapes, hinting that they could be produced by fundamentally different mechanisms. I think this makes sense because there are really two kinds of edits on Wikipedia / Wikidata: quality edits, and update edits.
Quality edits improve the quality of a page without necessarily changing its factual content. This includes fixing punctuation, improving phrasing, or elaborating on a sentence without removing its original meaning.
Update edits remove information that is now consider factually incorrect, and replace it with new information. Sometimes this is due to a change in the state of the world (for example, the population of France is not the same now as it was 10 years ago), or due to a change in what is considered correct by the Wikipedia / Wikidata community.
In my experiments sampling random revisions from Wikipedia and Wikidata, I have found that Wikipedia is dominated by quality revisions, while Wikidata is dominated by update revisions. In particular, Wikidata has largely abstracted away the features which could be improved with quality edits (like phrasing or formatting), making them rare. This is extremely useful, because it makes it possible to separate out these two effects and study them separately. In particular, the quality edits of Wikipedia make it possible to study online collaboration, while the update edits of Wikidata make it possible to study the half-life of knowledge.