Making Censorship Data More User Friendly
Published (updated: ) in Technology. Tags: censorship, measurements.
Internet censorship poses increasingly significant dangers to open access to the Internet, with governments, ISPs, and other actors monitoring and tampering with user traffic. As Internet censorship becomes more pervasive, there is a heightened need for high-quality and easy-to-interpret network measurement data that can help journalists, policymakers, researchers, and advocacy groups characterize censorship mechanisms and ensure accountability on the part of censors.
Over the past decade, the censorship measurement community has risen to this challenge to build longitudinal, global observatories of Internet censorship, such as OONI and Censored Planet, which produce high-quality measurement data with excellent coverage over time and space.
Problem solved? Not completely.
Collecting Measurement Data is Only Part of the Process of Characterizing Censorship
Analyzing large-scale measurement data presents numerous challenges in removing false positives, adding external information, and exploring aggregated data due to the inherently opaque, evasive, and diverse censorship ecosystem.
Ad-hoc analysis practices adopted so far do not scale to large amounts of measurement data and may lead to incorrect conclusions, which may have far-reaching implications in a politically sensitive area. Based on our experience running a large-scale censorship observatory at the University of Michigan, we identify key challenges that prevent researchers, including experts, from accurately characterizing censorship phenomena.
Accounting For Measurement Methodology Behavior and Limitations
It is important to consider the relationship between measurements on different Internet protocols and how they affect each other. For example, Figure 1 shows two OONI measurements conducted around the same time in Myanmar —one shows DNS tampering for www.facebook.com, and the other shows TCP/IP blocking. Analysis processes that only consider the outcome of measurements may conclude that the type of blocking changes between different measurements. However, further inspection of the data shows that the TCP/IP blocking measurement used a public DNS resolver (belonging to Google) and thus bypassed DNS tampering. Therefore, it is important to consider how measurements are conducted in the censorship characterization process.
Obtaining Accurate Metadata to Characterize Measurements
Most previous studies rely on country geolocation data to summarize censorship practices by country, but this can be erroneous for two reasons:
- Geolocation databases are known to have inaccuracies
- Censorship is frequently implemented at the ISP or organizational level, which requires additional metadata.
Unexpected Network Behavior That Could be Confused With Censorship
A major challenge is to account for CDN configurations that may cause network behavior and localization effects that are hard to quantify. For example, Cloudflare and Godaddy may block Internet measurements because of DDoS concerns or low IP reputation and inject an ‘Access Denied’’ page (Figure 2), which can be easily misconstrued as censorship.
Other sources of unexpected network behavior could arise due to events, such as geoblocking and Internet shutdowns, all of which could affect censorship observations.
Censorship Data Analysis Pipeline
Collaborating with Google Jigsaw, we built an open-source censorship data analysis pipeline tailored for Censored Planet that resolves many of the challenges we identify systematically. The pipeline parses measurement data and adds metadata from a variety of sources. Then it compares measurement responses to known fingerprints that act as censorship signals. Finally, errors during network measurements are mapped to a human-readable outcome to allow the data to be easily explored, which we make public via the Censored Planet Dashboard.
The design of the data analysis pipeline has three key features:
- It completely separates the measurement collection and analysis process, facilitating iterative improvements of the analysis process in the future.
- It is highly efficient in dealing with large-scale measurement data, processing Censored Planet’s 60 billion measurements in less than 24 hours.
- It is modular, allowing for partial additions of analysis features and processing of smaller datasets.
The processed data from the Censored Planet pipeline allows users to identify and explore censorship events and phenomena easily. For example, Figure 4 shows the Psiphon website being blocked in Belarus around the elections on 9 August 2020. As evident from the figure, the Psiphon website was not blocked before the elections but faced different types of blocking overtime after that period.
We hope our detailed breakdown of challenges motivates researchers to follow best practices and use our data analysis pipeline to provide a more accurate and impactful characterization of pervasive Internet censorship.
Learn more about our:
Contributors: Armin Huremagic (Censored Planet), Sarah Laplante, and Vinicius Fortuna(Jigsaw), Roya Ensafi (Censored Planet, University of Michigan).
Ram Sundara Raman is a Ph.D. Candidate at the University of Michigan whose research focuses on measuring large-scale network interference and censorship. The views expressed by the authors of this blog are their own and do not necessarily reflect the views of the Internet Society.