he most important responsibility of the Wikimedia Foundation’s Technology department is to “keep the servers running”: to operate the computers that provide Wikipedia and other Wikimedia sites.

But running the servers is only a fraction of the work our eighty-person team does. The Technology department also provides a variety of other essential services and platforms to the rest of the Wikimedia Foundation and to the public. In this post, we’ll introduce you to all of the programs that make up the Technology department, and highlight some of our work from the past year.

The 18 members of the Technical Operations team maintain the servers and run the wikis.  Over the last year, the team delivered an uptime of 99.97% (according to independent monitoring) across all wikis. The team also practiced switching the wikis back and forth between data centers, so that the sites are resilient in case of site failure.  This year, they performed the second ever switchover from primary to secondary data center and back, and more than doubled the speed of the switchover (more information).

The ten-person Fundraising Technology team is responsible for the security, stability, and development of the Wikimedia Foundation’s online donation systems. Millions of relatively small donations (average of about $15 per transaction) make up the majority of the Wikimedia Foundation’s operating budget every year.  The team maintains integration with 6 major payment processors, and several smaller ones, enabling online fundraising campaigns in approximately 30 countries each year. The team also maintains donor databases and other tools supporting fundraising.

You may have noticed that saving edits on Wikimedia got faster last year.  For this, credit the Performance team.  Last year, they tackled technical debt, and focused on the most central piece of our code’s infrastructure, MediaWiki Core, looking for the highest value improvements to make biggest performance impact.The four-person team was responsible for 27% of all contributions to MediaWiki Core last year (source).  Their biggest success was reducing the time to save an edit by 15% for the median and by 25% for the 99th percentile (the 1% slowest edit saves). This is a performance improvement directly felt by all editors of our wikis.

The eight people on the Release Engineering team (RelEng) maintain the complicated clusters of code and servers needed to deploy new versions of Mediawiki and supporting services to the servers and to monitor the results.  Last year they consolidated to a single deployment tool, which we expect to permanently reduce the cost of Wikimedia website maintenance.  A creeping increase in maintenance costs is a major Achilles’ heel (“a weakness in spite of overall strength, which can lead to downfall”) of complex websites, so any improvement there is a major victory.

It’s hard to know if you are improving something if you can’t measure the improvement, and you can’t measure improvements to something you aren’t measuring in the first place.  For example, the English Wikipedia community is experimenting with a different model for creating articles (ACTRIAL), and will need reliable data to know what the result of the experiment actually is.  The seven-person Analytics Engineering team builds and supports measurement tools that support this and many other uses, while working within the Privacy Policy and the values of the movement that constrain what data can be collected. The team is working in new initiatives to process data real time that, for example, enable the fundraising team to get same-day turnaround on questions about fundraising effectiveness.  One of the main projects this year is Wikistats 2.  Wikistats has been the canonical source of statistics for the Wikimedia movement since its inception. Wikistats 2 has been redesigned for architectural simplicity, faster data processing, and a more dynamic and interactive user experience. The alpha for the UI and new APIs was launched on December 2019. Although the tool and APIs are geared to the community, anyone can use Wikistats UI and APIs to access information about Wikipedia.

To illustrate our recent post on Wikipedia clickstream ‘rabbit holes,’ the Wikimedia Foundation’s Mikhail Popov used R code to create a net neutrality clickstream. An overview of core graph theory terms is provided, along with brief introductions to R packages igraph, ggplot2, and ggraph. While some familiarity with R is necessary, he includes a list of free resources for learning it at the bottom.



Diagram by Mikhail Popov/Wikimedia Foundation, CC BY-SA 4.0.

To go together with the announcement post for the monthly Wikimedia Clickstream releases, we wanted to create a graphic that would showcase the network aspect of the dataset:

In this post, you’ll learn a little bit about graph theory, how to import the clickstream data into the programming language and statistical software “R”, how to convert it into a graph, and how to visualize it to create the figure above. Some familiarity with R is required to follow along, but a list of free resources is available at the bottom for those who are interested in learning it.

Setting up

First, you’ll need R installed on your system. It is available on Comprehensive R Archive Network (CRAN) for Linux, macOS, and Windows. Then you will need to install some additional packages which are available on CRAN:

install.packages(c("igraph", "ggraph"))

If trying to attach the ggraph package with library(ggraph) gives you an error, you need to install the development version of ggplot2:

# install.packages("devtools")
devtools ::install_github("tidyverse/ggplot2")

Quick vocabulary lesson

graph is a structure containing a set of objects where pairs of objects called vertices (also called nodes) are connected to each other via edges (also called links). Those edges represent relationships and they may be directed or undirected (mutual), and they may have weights. Two vertices that are endpoints of the same edge are adjacent to each other, and the set of vertices adjacent to the vertex V is called a neighborhood of V.

For example, if we have a group of people and we know how much each person trusts someone else, we can represent this network as a directed graph where each vertex is a person and each weighted, directed edge between vertices A and B is how much person A trusts person B (if at all!). Each person who A trusts or who trusts A collectively form a neighborhood of people adjacent to A.

There are additional terms to describe different properties of graphs (such as when a path of edges exists between any two vertices), but they are outside the scope of this post. If you are interested in learning more, please refer to the articles on graph theory and graph theory terms.

Reading data

First, we have to grab the dataset from Wikimedia Downloads and uncompress it before reading it into R. Once it’s ready, we can use the built-in utility for importing it. In this case we are working with the English Wikipedia clickstream from November 2017:

enwiki <- read.delim(
sep = "\t",
col.names = c("source", "target", "type", "occurrences")

Due to the size of the dataset, this will take a while. To speed up this step (especially if something goes wrong and you have to restart your session), we recommend using readrpackage. Unfortunately we were unable to read in the original file with readr so we had to use the built-in utility to read the data first, write it out using readr::write_tsv (which performed all the necessary character escaping), and then read it with readr in subsequent sessions. If you have gzip, you can compress the newly created version and read in the compressed version directly:

enwiki <- readr::read_tsv("clickstream-enwiki-2017-11-v2.tsv.gz")

Working with graphs

Now that we have the data inside R, we have to convert it to a special object that we can then do all sorts of cool, graph-y things with. We accomplish this with the igraph network analysis library (which has R, Python, and C/C++ interfaces).

Although the data includes counts of users coming in from search engines and other Wikimedia projects, we restrict the graph to just clicks between articles and specify that the connections described by our data have a direction:


g <- graph_from_data_frame(subset(enwiki, type == "link"), directed = TRUE)

We use the make_ego_graph function to construct a sub-graph that is a neighborhood of articles adjacent to our article of choice. In igraph, we can use the functions V and E to get/set attributes of vertices and edges, respectively, so we also create additional attributes for the vertices (such as the number of neighbors via degree)

sg <- make_ego_graph(g, order = 1, mode = "out", V(g)["Net_neutrality"])[[1]]

# Number of neighbors:
V(sg)$edges <- degree(sg)

# Labels where the spaces aren't underscores:
V(sg)$label <- gsub("_", " ", V(sg)$name, fixed = TRUE)

Because of the high number of articles in this neighborhood, to make a cleaner diagram we want to omit articles that have fewer than two neighbors — that is, a sub-sub-graph that only has vertices which have at least two edges:

ssg <- induced_subgraph(sg, V(sg)[edges > 1])


In the data visualization package ggplot2 (and the “grammar of graphics” framework), you make a composition from layers that have geoms (geometric objects) whose aesthethics are mapped to data, potentially via scales such as size, shape, and alpha levels (also known as opacity/transparency), and color. The extension ggraph works on graph objects made with igraph and allows us to use a familiar language and framework for working with network data.

In order to draw a graph, the vertices and edges have to be arranged into a layout via an algorithm. After trying various algorithms, we settled on the Davidson-Harel layout to visualize our sub-sub-graph:

set.seed(42) # for reproducibility
ggraph(ssg, layout = "dh") +
geom_edge_diagonal(aes(alpha = log10(occurrences))) +
scale_edge_alpha_continuous("Clicks", labels = function(x) {return(ceiling(10 ^ x)) }) +
geom_node_label(aes(label = label, size = edges)) +
scale_size_continuous(guide = FALSE) +

theme_graph() +
theme(legend.position = "bottom")

We map the opacity of the edge geoms to the count of clicks to show volume of traffic between the articles (using a log10 transformation to adjust for positive skew). We also map the size of the label geoms to the number of neighbors, so the names of articles with many adjacent articles show up bigger.

Note that due to the way many graph layout algorithms work — where a random initial layout is generated and then iterated on to optimize some function — if you want reproducible results you need to specify a seed for the (pseudo)random number generator.

If you would like to learn how to work with ggplot2‘s geomsaesthethic mappings, and scalesUCLA’s Institute for Digital Research and Education has a thorough introduction. I also recommend the data visualization chapter from R for Data Science.

Parting words

I hope this was helpful in getting you started with Wikimedia Clickstream data in R, and I look forward to seeing what the community creates with these monthly releases! Please let us know if you are interested in technical walkthrough posts like this.

Learning R

If you are interested in learning R, here are some free online resources:

Beginner’s guide to R by Sharon Machlis
DataCamp’s Introduction to R
Code School’s Try R
edX’s Introduction to R for Data Science
swirl: Learn R, in R
R for Data Science by Garrett Grolemund and Hadley Wickham
RStudio webinars.

The Wikimedia Foundation’s Analytics team is releasing a monthly clickstream dataset. The dataset represents—in aggregate—how readers reach a Wikipedia article and navigate to the next. Previously published as a static release, this dataset is now available as a series of monthly data dumps for English, Russian, German, Spanish, and Japanese Wikipedias.

Photo by Taxiarchos228, Free Art License 1.3.

Have you ever looked up a Wikipedia article about your favorite TV show just to end up hours later reading on some obscure episode in medieval history? First, know that you’re not the only person who’s done this. Roughly one out of three Wikipedia readers look up a topic because of a mention in the media, and often get lost following whatever link their curiosity takes them to.

Aggregate data on how readers browse Wikipedia contents can provide priceless insights into the structure of free knowledge and how different topics relate to each other. It can help identify gaps in content coverage (do readers stop browsing when they can’t find what they are looking for?) and help determine if the link structure of the largest online encyclopedia is optimally designed to support a learner’s needs.

Perhaps the most obvious usage of this data is to find where Wikipedia gets its traffic from. Not only clickstream data can be used to confirm that most traffic to Wikipedia comes via search engines, it can also be analyzed to find out—at any given time—which topics were popular on social media that resulted in a large number of clicks to Wikipedia articles.

In 2015, we released a first snapshot of this data, aggregated from nearly 7 million page requests. A step-by-step introduction to this dataset, with several examples of analysis it can be used for, is in a blog post by Ellery Wulczyn, one of the authors of the original dataset.

Since this data was first made available, it has been reused in a growing body of scholarly research. Researchers have studied how Wikipedia content policies affect and bias reader navigation patterns (Lamprecht et al, 2015); how clickstream data can shed light on the topical distribution of a reading session (Rodi et al, 2017); how the links readers follow are shaped by article structure and link position (Dimitrov et al, 2016Lamprecht et al, 2017); how to leverage this data to generate related article recommendations (Schwarzer et al, 2016), and how the overall link structure can be improved to better serve readers’ need (Paranjape et al, 2016😉

Due to growing interest in this data, the Wikimedia Analytics team has worked towards the release of a regular series of clickstream data dumps, produced at monthly intervals, for 5 of the largest Wikipedia language editions (English, Russian, German, Spanish, and Japanese). This data is available monthly, starting from November 2017.

A quick look into the November 2017 data for English Wikipedia tells us it contains nearly 26 million distinct links, between over 4.4 million nodes (articles), for a total of more than 6.7 billion clicks. The distribution of distinct links by type (see Ellery’s blog post for more details) is as follow:

    • 60% of links (15.6M) are internal and account for 1.2 billion clicks (18%).
    • 37% of links (9.6M) are from external entry-points (like a Google search results page) to an article and count for 5.5 billion clicks.
    • 3% of links (773k) have type “other”, meaning they reference internal articles but the link to the destination page was not present in the source article at the time of computation. They account for 46 million clicks.

If we build a graph where nodes are articles and edges are clicks between articles, it is interesting to observe that the global graph is strongly connected (157 nodes not connected to the main cluster). This means that between any two nodes on the graph (article or external entrypoint), a path exists between them. When looking at the subgraph of internal links, the number of disconnected components grows dramatically to almost 1.9 million forests, with a main cluster of 2.5M nodes. This difference is due to external links having very few source nodes connected to many article nodes. Removing external links allows us to focus on navigation within articles.

In this context, a large number of disconnected forests lends itself to many interpretations. If we assume that Wikipedia readers come to the site to read articles about just sports or politics but neither reader is interested in the other category we would expect two “forests”. There will be few edges over from the “politics” forest to the “sports” one. The existence of 1.9 million forests could shed light on related areas of interest among readers – as well as articles that have lower link density – and topics that have a relatively small volume of traffic, making them appear as .

When I was five or six years old, my mom’s boyfriend took us ice fishing. He drove his Jeep to the edge of one of Minnesota’s ten thousand lakes and then he kept going, down the boat ramp and out onto the glittering expanse of white. He stopped next to a small ice shanty, and then he built a fire. Right on the ice.

That’s what I remember, at least. But, now, thirty-four years later, these memories seem suspect. Did we really drive on the lake? Did he really build a fire on the ice?

I’ve been thinking about the past a lot lately. My therapist and I have begun the arduous process of unpacking my childhood to see if we can find the source of my adult neuroses. I worry this will be an impossible task. My memories of that time seem slippery. I can never quite get a solid enough grasp on them to wring out meaning. And even the ones that seem so vivid and real feel fake, like cheap knockoffs.  Continue reading 

To Drink Or Not To Drink?

This week, a headline literally* gave me whiplash. The loss of 1,600 points on the Dow? No, don’t be silly. Another government shutdown? No, not that one either. I mean the big news. Backpackers no longer have to filter their water. Because there’s nothing in the water that can hurt them!

Wow, right? Like many outdoor enthusiasts, I’ve always seen a water filter as a crucial part of my of my packing regimen. I’ve used ceramic filters, paper filters, those odd filters attached to the bottles, tablets, drops, UV light, and good old fashioned boiling.

I’ll never forget the feeling of staring at a snowmelt stream while coming off Clyde Minaret in the Sierras, long after dark, totally lost and dry as desert salt. I just sat there, staring, holding my broken Steripen and wondering if it was worth the risk.

In fact, if I had to decide between a filter and sleeping pad, I’m pretty sure I’d be waking up with a sore back in the morning.

Continue reading 

Amazon Alexa fanfic



Inspired by true events


Me: Alexa. Good morning.

Alexa: Good morning! On this day in 1961, NASA sent a chimpanzee named Ham into space, flying 155 miles up in the Mercury capsule.

But these scientists weren’t just aping around. This mission was designed to tell them about –

Me: Alexa stop. Alexa, did you just say “aping around”?

Alexa: Yes I did.

Me: Alexa. Do you mean “monkeying around”?

Alexa: No. I said “aping around.”

Me: Alexa. But the joke is “monkeying around.”

Alexa: “Aping around” is an acceptable alternative.

Me:   Alexa no it’s not! Literally no one uses the word “ape” in that context. They say “monkeying around”. Or maybe “horsing around”. I guess you could “ape” someone —

Alexa: From a legal perspective, “monkeying around” and “aping around” are identical.

Me: …

Me: Alexa did somebody sue amazon dot com?

Continue reading 

Shifting Baselines in the Outback

Scotia preserve. Emma Marris

Daniel Pauley, a fisheries scientist, coined the term “shifting baselines” in 1995 to describe how depleted fish populations came to be considered “normal” by generations that had never experienced the teeming abundance their grandparents had known.

The concept is now a fundamental one in conservation. As ecosystems change and as human memory dims, former states are forgotten and newer, altered states come to be considered the baseline against which change should be measured and to which restoration should aim. This can mean that, for example, one generation insists that a park “should be” a dense forest because that is how it appeared in their youth—thanks to the fact that elephants had been driven locally extinct. (Elephants browse so ferociously and even knock over full-grown trees, keeping landscapes in savannah-mode.)

Now a new paper looks at shifting baselines in the Australian Outback, where ants have long thought to be the primary way seeds move around the landscape. Turns out that the role of small, adorable mammals in seed moving may have been overlooked because these creatures have been hit so hard by introduced predators, including cats and foxes.

Continue reading 

Fukushima’s legacy in the Arctic Ocean

When the official photographer’s helicopter hovered above the Arctic Ocean for the bank note photo shoot, the Canadian Coast Guard ship Amundsen carried Jay Cullen’s oceanographic research equipment prominently on its deck. The icebreaker was to feature on the red Canadian fifty-dollar bill, and Cullen saw his chance at immortality. Unfortunately, when the mint released the note, the artist had airbrushed out Cullen’s crates like so much clutter.

Cullen has been going up to the Arctic for ten years now, taking measurements of chemical tracers in ocean waters to track the changes in currents. But over the last three years, Cullen’s techniques have been put to use for another task: tracking the radioactive material released by Fukushima.

Continue reading 

The Last Word

January 29 – February 2, 2018

Craig begins the week:  he believes tarot cards? well, he believes chaos theory, he believes systems can organize themselves when smaller parts interact, he’s risking serious woo here, but sure, why not?

Rose’s dog is well-behaved, trustworthy, doesn’t even bark.  Rose’s dog was not always this way.  Once Rose’s dog was a puppy who drove her to the point of sitting on the kitchen floor and crying.

Jennifer has chronic pain and got herself off opoids, using an iffy plant/chemical/drug called Kratom.  She likes it, finds lots of arguments against it, and that would be ok, if they’d just do rigorous tests of the stuff.

Sadly, Ursula LeGuin has died.  Michelle remembers her voice and her connection with another great lady of the Pacific Northwest, Mount St. Helens.  LeGuin could see it out her kitchen window.

Craig ends the week too, this time out in the desert near an old uranium mine, picking up a little olivella shell, whose travels he traces back through the Southwest to its birthplace in the Sea of Cortez.

Ed. note:  Is spring ever going to come? No?


Shell Walkers

I was snooping around an old uranium mill the other day in southern Utah, taking advantage of an unusually warm January day in the desert to explore washes, ridges, and places where I could hunt for artifacts. You’ll find here glass bottles, metal tags, and pieces of machinery. It was a field mill, looked like 1950s by the decay. No bigger than a one-bedroom house, it had been reduced to some crackled concrete walls and durable trash, glass, plastic, metal. Bolts, broken tea cups, bottle caps. It had been built near a steep gully above a dry wash, and its ruins were crumbling into sandy, ashen soil.

In this dark soil, instead of prospector artifacts, I began finding sherds of pre-Columbian pottery, some painted with lines of black paint on white clay. This was Pueblo ancestry, between 800 and 1,000 years ago, shattered pieces of jar necks and bowls from a cliff- and pithouse-dwelling people who still grow corn in the desert mesas and riversides of northern Arizona and northern New Mexico.

The mill had been built on a prehistoric kiln site, ground still discolored from the number of fires that happened here, ceramic pieces broken and left around as temper and trash. Sticking out of the soil near the base of a brushy sage was the end of a sea shell. I pulled from the ground the neat little capsule of an olivellashell. I hadn’t seen one of these in years. Last time I remember was a cave in southwest Arizona, a hundred miles from an ocean. These kinds of shells were transported across the Southwest, and went on to Texas and Oklahoma. They were moved by foot, carried in satchels, baskets, and woven cotton bags, some made on looms and given intricate colors and patterns.

The olivella in my hand was a type that would come from either southern California or high in the crotch of the Sea of Cortez between Baja and mainland Mexico. Continue reading 

Redux: The Lady and Le Guin


I’ve been thinking a lot about Ursula Le Guin since her death on January 22. Here in the Pacific Northwest, she was not only a beloved author but a beloved public figure, active in the Portland community until the very end of her long life. I’ll miss hearing her voice, and I’ll miss her sharp wisdom about worlds real and imagined. Here’s a post I wrote in the summer of 2015 about Le Guin’s history with another great woman of the Northwest—Mount St. Helens.

Late last month, I got to camp with a group of ecologists at the base of Mt. St. Helens, in southwestern Washington state. Some of the scientists had been studying the mountain since shortly after it erupted on May 18, 1980, and they were full of stories about the changes they’d seen over the past thirty-five years. They told me that someone else had been watching the mountain just as long as they had, and that she still watched it every morning. Her name was Ursula Le Guin.

Ursula Le Guin? I said. The Ursula Le Guin?




You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *