The Trouble with Crowd Sourced Data
During the evening hours of the 30st of August 2018 U.S. time, several users of popular mobile apps like Snapchat, CitiBike and Zillow noticed that the city of New York had been renamed into “Jewtropolis” on the applications’ map screens.
The affected companies quickly pointed out that the maps themselves were provided by 3rd party mapping company Mapbox who in turn quickly identified a malicious edit as the source of the renaming. According to a statement released by Mapbox in the aftermath of the incident, the offending edit was undone within an hour of making it to the servers.
The statement outlines that a malicious individual had made several edits to datasources used by the company in an attempt to get them published. All of them were caught by automated AI review systems and submitted for human review. While the majority was correctly rejected, the edit made to New York City was - likely through human error - pushed into the live dataset.
Activists, Ambiguity and Trolls
There are three central challenges that crowdsourced data faces: Activists, ambiguity, and trolls. In this case, we are using “activists” to mean anyone following an agenda to create social or political change. Activists - be they individuals, grassroots, corporate or state-sponsored - therefore often publish information that is biased, abridged or outright untrue, which decreases the overall quality of the dataset.
Ambiguity refers to information that may be interpreted differently depending on the observer. While issues of ambiguity often overlap with activists agendas (e.g. the submitted location of a disputed border or the labeling of a certain act as a crime), they can be as banal as the difference between the use of terms (e.g. “free” in terms of liberty, and “free” in terms of beer).
While activists usually try to create narratives, trolls focus on destructive outcomes. The anonymous nature of the internet combined with the human tendency to dehumanize those we don’t directly interact with have made trolling on the internet an issue nearly from the time of its inception.
Even though the three challenges discussed above are commonly intertwined with one another, it is usually possible to establish a main motive for a low-quality edit to a crowdsourced dataset. In the case of this week’s malicious Mapbox edit, trolling seems to have been the main motivation since there is neither a debate over the name of New York City nor a reasonable expectation that the edited name would be believed by the general population.
Oversight and pruning by human moderators has been a tool to curb the influence of activists, ambiguity and trolls since the times of Bulletin Boards and early online forums. In a way, any public forum is - when seen in its abstract entirety - a crowdsourced text dataset. The same tools are therefore usually employed by modern crowdsourcing efforts.
Unfortunately, humans are fallible. From implicit and explicit biases to misreading data to simply pressing the wrong button, content moderation by humans will always have a failure rate. It is also relatively expensive.
To counter these two issues, more and more companies are implementing AI systems to perform an initial review of data before it is sent to human moderators. Mapbox was using just such a system which flagged the edit and sent it to a human moderator for a final decision. Unfortunately that human moderator in turn seems to have made an error that sent the edit into the dataset.
Since AIs are far from the stage at which they can reliably identify activism, ambiguity and trolling, human moderation will continue to be the tool of choice for the foreseeable future. And since humans are fallible, it is only a matter of time before a similar attack against a different target succeeds.