News

New monthly dataset shows where people fall into Wikipedia rabbit holes

Photo by Taxiarchos228, Free Art License 1.3.

Have you ever looked up a Wikipedia article about your favorite TV show just to end up hours later reading on some obscure episode in medieval history? First, know that you’re not the only person who’s done this. Roughly one out of three Wikipedia readers look up a topic because of a mention in the media, and often get lost following whatever link their curiosity takes them to.

Aggregate data on how readers browse Wikipedia contents can provide priceless insights into the structure of free knowledge and how different topics relate to each other. It can help identify gaps in content coverage (do readers stop browsing when they can’t find what they are looking for?) and help determine if the link structure of the largest online encyclopedia is optimally designed to support a learner’s needs.

Perhaps the most obvious usage of this data is to find where Wikipedia gets its traffic from. Not only clickstream data can be used to confirm that most traffic to Wikipedia comes via search engines, it can also be analyzed to find out—at any given time—which topics were popular on social media that resulted in a large number of clicks to Wikipedia articles.

In 2015, we released a first snapshot of this data, aggregated from nearly 7 million page requests. A step-by-step introduction to this dataset, with several examples of analysis it can be used for, is in a blog post by Ellery Wulczyn, one of the authors of the original dataset.

Since this data was first made available, it has been reused in a growing body of scholarly research. Researchers have studied how Wikipedia content policies affect and bias reader navigation patterns (Lamprecht et al, 2015); how clickstream data can shed light on the topical distribution of a reading session (Rodi et al, 2017); how the links readers follow are shaped by article structure and link position (Dimitrov et al, 2016; Lamprecht et al, 2017); how to leverage this data to generate related article recommendations (Schwarzer et al, 2016), and how the overall link structure can be improved to better serve readers’ need (Paranjape et al, 2016😉

Due to growing interest in this data, the Wikimedia Analytics team has worked towards the release of a regular series of clickstream data dumps, produced at monthly intervals, for 5 of the largest Wikipedia language editions (English, Russian, German, Spanish, and Japanese). This data is available monthly, starting from November 2017.

A quick look into the November 2017 data for English Wikipedia tells us it contains nearly 26 million distinct links, between over 4.4 million nodes (articles), for a total of more than 6.7 billion clicks. The distribution of distinct links by type (see Ellery’s blog post for more details) is as follow:

    • 60% of links (15.6M) are internal and account for 1.2 billion clicks (18%).
    • 37% of links (9.6M) are from external entry-points (like a Google search results page) to an article and count for 5.5 billion clicks.
    • 3% of links (773k) have type “other”, meaning they reference internal articles but the link to the destination page was not present in the source article at the time of computation. They account for 46 million clicks.

If we build a graph where nodes are articles and edges are clicks between articles, it is interesting to observe that the global graph is strongly connected (157 nodes not connected to the main cluster). This means that between any two nodes on the graph (article or external entrypoint), a path exists between them. When looking at the subgraph of internal links, the number of disconnected components grows dramatically to almost 1.9 million forests, with a main cluster of 2.5M nodes. This difference is due to external links having very few source nodes connected to many article nodes. Removing external links allows us to focus on navigation within articles.

In this context, a large number of disconnected forests lends itself to many interpretations. If we assume that Wikipedia readers come to the site to read articles about just sports or politics but neither reader is interested in the other category we would expect two “forests”. There will be few edges over from the “politics” forest to the “sports” one. The existence of 1.9 million forests could shed light on related areas of interest among readers – as well as articles that have lower link density – and topics that have a relatively small volume of traffic, making them appear as isolated nodes.

Using the igraph library together with ggraph, we can obtain a list of articles linked from net neutrality, treat that neighborhood of articles as a network, and then visualize how those are connected by the number of clicks and neighbors. Data visualization by Mikhail Popov/Wikimedia Foundation, CC BY-SA 4.0.

If you’re interested in studying Wikipedia reader behavior and in using this dataset in your research, we encourage you to cite it via its DOI (doi.org/10.6084/m9.figshare.1305770) and to peruse its documentation. You may also be interested in additional datasets that Wikimedia Analytics publishes (such as article pageview data) or in navigation vectors learned from a corpus of Wikipedia readers’ browsing sessions.

Joseph Allemandou, Senior Software Engineer, Analytics
Mikhail Popov, Data Analyst, Reading Product
Dario Taraborelli, Director, Head of Research

Did you enjoy that net neutrality data visualization? Find out how you can do it.

Related — Related

Read further in the pursuit of knowledge

A group of men celebrate with the World Cup trophy amidst a shower of confetti

Wikipedia’s most-popular articles of 2018 show that pop culture rules over us all

Wikipedia Year in Review

People visited Wikipedia over 190 billion times in 2018 alone, many motivated by the encyclopedia’s wealth of in-depth articles about topics you didn’t know enough about. But in looking at the English Wikipedia’s most-popular articles of 2018, it’s clear that one motivation reigned supreme. People wanted to keep up with the popular culture moments happening….

Read more

Five ways academics can contribute to Wikipedia

Community Gender gap Wikipedia

In recent weeks, the world learned about Dr. Donna Strickland, only the third woman to be awarded the Nobel Prize in Physics. It also learned that Wikipedia lacked an article on Strickland amongst its over five million articles. Wikipedia subsequently received justifiable criticism for its low percentage of female editors, its editing culture, and its….

Read more
A lock and chain hold a gate shut.

How many Wikipedia references are available to read? We measured the proportion of open access sources across languages and topics.

Foundation Research Wikipedia

Let’s say you’re planning a trip to a subtropical region and you want to learn about available vaccines for yellow fever. You look up the English Wikipedia article. You’re lucky to find a well-sourced section, with a wealth of references, many of them pointing to information from public health agencies and reputable news articles. Great!….

Read more

Help us unlock the world’s knowledge.

As a nonprofit, Wikipedia and our related free knowledge projects are powered primarily through donations.

Donate now

Connect

Stay up-to-date on our work.

Get email updates

Subscribe to news about ongoing projects and initiatives.

Contact a human

Questions about the Wikimedia Foundation or our projects? Get in touch with our team.

Contact