InfoTracker: Detecting Meaningful Text Overlaps

Customer Intelligence Advanced Research Projects Activity (IARPA)
Users Analysts, educators, and authors
Need The National Intelligence Community (NIC) faces huge information management challenges especially as relates to the often conflicting goals of international/interagency collaboration and classified information protection. Both the NIC and the Department of Defense require new tools that can improve their ability to safely share information across security and organizational boundaries.
Solution
  • InfoTracker provides two primary capabilities that can assist document sanitization. First, it maintains a “redaction” memory that captures editing decisions and exploits this knowledge to support future sanitization efforts. Second, it utilizes novel context understanding techniques to reduce the false alarm rates associated with typical ‘dirty word’ checks. Consider the scenario where InfoTracker is being applied to the problem of sanitizing documents related to the 2002 missile strike on a car carrying terrorists in Yemen. An analyst might be told to remove all specific mention of US capabilities and operations in Yemen, including any text that might be used to infer the nature of the attack or potential US-Yemen collaboration in the strike.
  • The user may begin with a simple ‘dirty words’ list that InfoTracker uses to identify portions of the document requiring attention. The analyst could then highlight regions to be redacted from within the document’s native application (Microsoft Word in this case). Once the redactions are committed, by the user or a quality control officer, this text is added to the system’s knowledge base and can be used to annotate future documents – providing awareness of previous redaction decisions. In most applications very effective guidance can be provided even after just a couple of documents are processed. As shown in the figure below, InfoTracker highlights meaningful text fragments that occurred in previous redactions (rather than highlighting all, potentially chance, occurrences of common terms and phrases). Similarly, InfoTracker can successfully exploit the redaction memory to understand the contexts in which ‘dirty words’ appear so as to filter out benign instances (such as the final occurrence of ‘CIA’ in this figure). Finally, InfoTracker can provide users with details regarding past redaction decisions through a simple point and click operation.
Status The InfoTracker prototype has been developed for Windows XP and heavily tested. InfoTracker’s built-in viewer includes document converters that cover PDF, HTML, and Microsoft Office. Users may also utilize InfoTracker through a Microsoft Word toolbar. We continue to extend the capabilities of InfoTracker and are actively seeking Beta users. Please contact TJ Goan for additional information.
Related Applications While originally developed to support more accurate and timely redaction of sensitive documents, the InfoTracker C++ API has been applied in many other applications including: plagiarism detection, difference detection, cross-boundary solutions, automated information retrieval, and multi-word feature selection for text mining. See the InfoTracker Product Page.