Before analyses can take place, structured, machine-readable data and unstructured, free-form information needs to be “modeled” and “enriched”.
Modeling is the process of structuring raw data and information to facilitate further processing. In the case of unstructured, free-form information, this also entails making the data machine-readable.
The next step is to enrich the structured data by combining it with information from other sources. Enrichment might involve the calibration of timestamps, translating network addresses to geographical coordinates, or even just homogenizing the already existing values and descriptors to a common standard, enabling meaningful comparisons different data sets and entities.
What to analyze ?
The type of analysis that can be performed is highly dependent on the nature and quantity of the available data. A large quantity of highly structured data with well-known characteristics, such as network performance metrics, can be analyzed using domain agnostic statistical modeling and numerical analysis. Other types of data will require a more elaborate domain specific framework within which automated analyses can be performed. Work on the development of these frameworks is an ongoing area of focus.
Regardless of what type data of data is under consideration, any type of analysis aims towards helping answer three basic questions: a) what has happened where ? b) what is happening there now ? c) can any statement be made as to how ‘a’ evolved into ‘b’ ?
How to analyze ?
In the context of the system we are building, “analysis” refers specifically to automated analysis, with the intent to facilitate socio-political analyses by third parties. Chokepoint aims to provide a tool to make sense of a large amount of disparate data and information, by collecting and contextualizing this information across different disciplines.
We recognize that automated analyses rarely allow for definitive statements about causality or intent, and we wish to avoid the suggestion of same in the presentation of our results. (This is especially important given that the visualizations we develop, e.g. the presentation of results on a map, while making it easier to navigate and understand the results, also introduces inherent bias and distortion)
Causality v. Correlation
While causality might be tricky or impossible to ascertain based on automated analyses, it does allow for the detection of trends and correlations within large data sets. These trends and correlations, especially when presented in (near) real-time, are extremely valuable leads for research into the underlying causal relationships (if any).
Historic analysis and “RQF” analysis
The integration of a large variety of different data sources poses two problems. First, different sources of data publish different amounts of data at different intervals: data sources that provide information about network performance are generally high-volume and nearly real-time, while data sources for legal and jurisdictional information move at a comparatively much slower rate. Second, there is the problem of information overload: to be practical, the system must be able to distinguish information based on urgency and allow for the rapid integration and contextualization of urgent information (we call this “Really Quite Fast” or RQF analysis).
To address these issues, the system is arranged in a layered fashion, where each data sets is first analyzed in isolation, then combined with other data sets for contextual analysis, and finally linked to historical trends and data sets for historical analysis. This results in a cumulative process that allows the system to scale over time in both breadth and depth.
Description continues here as “Report”.
An overview :