Research

The Trouble with Crowd Sourced Data

Written by Reflare Research Team | Aug 31, 2021 2:18:00 PM

A Mapbox GL JS v2.0.2 user maliciously renamed New York City. Within seconds of the attack, eBay, Snapchat, Foursquare, CitiBike and Zillow automatically fell in line, and NYC was gone.

First Published 31st August 2018

The wisdom (and stupidity) of crowds. Baaaahstards!

4 min read  |  Reflare Research Team

During the evening hours of the 30th of August 2018 U.S. time, several users of popular mobile apps like Snapchat, CitiBike and Zillow noticed that the city of New York had been renamed into “Jewtropolis” on the applications’ map screens.

The affected companies quickly pointed out that the maps themselves were provided by 3rd party mapping company Mapbox who in turn quickly identified a malicious edit as the source of the renaming. According to a statement released by Mapbox in the aftermath of the incident, the offending edit was undone within an hour of making it to the servers.

The statement outlines that a malicious individual had made several edits to data sources used by the company in an attempt to get them published. All of them were caught by automated AI review systems and submitted for human review. While the majority was correctly rejected, the edit made to New York City was - likely through human error - pushed into the live dataset.

Activists, Ambiguity, Trolls and Sheeple

There are four central challenges that crowdsourced data faces: Activists, ambiguity, trolls and sheeple. In this case, we are using “activists” to mean anyone following an agenda to create social or political change. Activists - be they individuals, grassroots, corporate or state-sponsored - therefore often publish information that is biased, abridged or outright untrue, which decreases the overall quality of the dataset.

Ambiguity refers to information that may be interpreted differently depending on the observer. While issues of ambiguity often overlap with activist's agendas (e.g. the submitted location of a disputed border or the labelling of a certain act as a crime), they can be as banal as the difference between the use of terms (e.g. “free” in terms of liberty, and “free” in terms of beer).

While activists usually try to create narratives, trolls focus on destructive outcomes. The anonymous nature of the internet combined with the human tendency to dehumanize those we don’t directly interact with have made trolling on the internet an issue nearly from the time of its inception. And then there's the sheeple - those who will view what is happening, and pile-on due to social proofing, "fun", or sheer interest to see what happens next. 

Even though the four challenges discussed above are commonly intertwined with one another, it is usually possible to establish the main motive for a low-quality edit to a crowdsourced dataset. In the case of this week’s malicious Mapbox edit, trolling seems to have been the main motivation since there is neither a debate over the name of New York City nor a reasonable expectation that the edited name would be believed by the general population.

Human Review

Oversight and pruning by human moderators have been a tool to curb the influence of activists, ambiguity and trolls since the times of Bulletin Boards and early online forums. In a way, any public forum is - when seen in its abstract entirety - a crowdsourced text dataset. The same tools are therefore usually employed by modern crowdsourcing efforts.

Unfortunately, humans are fallible. From implicit and explicit biases to misreading data to simply pressing the wrong button, content moderation by humans will always have a failure rate. It is also relatively expensive.

To counter these two issues, more and more companies are implementing AI systems to perform an initial review of data before it is sent to human moderators. Mapbox was using just such a system which flagged the edit and sent it to a human moderator for a final decision. Unfortunately, that human moderator in turn seems to have made an error that sent the edit into the dataset.

Since AIs are far from the stage at which they can reliably identify activism, ambiguity and trolling, human moderation will continue to be the tool of choice for the foreseeable future. And since humans are fallible, it is only a matter of time before a similar attack against a different target succeeds.