InfoTracker

 

InfoTracker

 

Overview

InfoTracker allows for the rapid comparison of a document against a potentially large collection of writings in order to identify non-chance overlaps in the text. Additionally, the data structure employed by InfoTracker provides a powerful and efficient means for identifying and ranking multi-word key terms in the text corpus.
The InfoTracker C++ API has been applied in many applications including: plagiarism detection, difference detection, classified document redaction support, automated information retrieval, and multi-word feature selection for text mining. If you are interested in evaluating the InfoTracker API or a specialized application, or in purchasing source code please contact us.
Plagiarism Detection
Competing plagiarism detection tools employ a variety of techniques in an effort to detect document content that may have been reproduced without consent. Some of these systems utilize poorly scaling pair-wise text comparison methods. Other approaches build indices of text samples (e.g., eight word sequences) in order to detect co-derived documents. While these techniques can perform well when content is reused in large chunks, they perform inadequately when plagiarism is more subtle.

Data structures such as suffix trees which allow documents to be comprehensively compared with a stored corpus are recognized as the gold standard with regards to accuracy and speed of comparison. These approaches, however, have largely been dismissed for use in very large scale applications due to the fact that they were previously limited to in-memory operation or involved unreasonable index update costs. InfoTracker breaks new ground in two ways:

  • It indexes the text corpus utilizing a data structure based on the String B-Tree which is an attractive cross between traditional external-memory indices (such as those used in search engines) and the main-memory string-matching data structures such as suffix trees and suffix arrays.
  • InfoTracker’s indexing scheme goes beyond previous full-text analysis approaches by allowing for the efficient contrasting or corpora. In particular, our approach utilizes a large collection of common word sequences to screen out false positives due to common phrases or corporate boilerplate. This new approach allows InfoTracker to accurately detect plagiarism based on shorter sequences of overlapping words than competing techniques, thus allowing detection even when an author employs synonym replacement and other simple rewriting techniques.
In addition there are distinct advantages to using a client based plagiarism detection tool over one of the many Web based services. The most critical of these is the protection of Intellectual Property. It is widely recognized (see www.EssayScam.org) that an increasing number of Web sites offering plagiarism detection services are simply fronts used to unscrupulously collect essays that are resold through paper mills. Other services, such as TurnItIn, have been criticized for archiving and profiting from submitted works without the knowledge or permission of the authors. A desktop client based solution such as that provided by InfoTracker avoids these concerns.
Difference Finding
Often there are identical passages in User Agreement, Terms of Use, and other legal documents. InfoTracker can be used to compare a new agreement a user is asked to accept against those they have previously consented to – highlighting all meaningful differences. For example, many Websites employ very similar Privacy Statements, so one might expect that if the user once agreed to a particular clause they will be willing to again. Given this, it is useful to be able to simply highlight the areas of difference or similarity so that users may make informed decisions more rapidly. In the below image, the black text represents differences between the current document and previously reviewed Privacy Statements.

Often there are identical passages in User Agreement, Terms of Use, and other legal documents. InfoTracker can be used to compare a new agreement a user is asked to accept against those they have previously consented to – highlighting all meaningful differences. For example, many Websites employ very similar Privacy Statements, so one might expect that if the user once agreed to a particular clause they will be willing to again. Given this, it is useful to be able to simply highlight the areas of difference or similarity so that users may make informed decisions more rapidly. In the below image, the black text represents differences between the current document and previously reviewed Privacy Statements.

Key Term Identification
It is often valuable to identify a set of terms that characterize or summarize a collection of documents. For instance one might wish to establish useful metadata tags or a topic hierarchy by analyzing a corpus of text. Similarly, key terms form the critical foundation of features used in a variety of text mining / machine learning tasks. InfoTracker presents a unique opportunity to identify and rank multi-word terms that are particularly important to a corpus of text. These multi-word key terms can hold substantially more discriminative power than individual words because they capture the context vital to their meaning. Further, unlike more naïve approaches, InfoTracker is not limited to identifying phrases of a fixed length (bi-grams or any specific n-gram) or based on a defined grammar. Rather, it will identify sequences of words based on their ability to characterize a corpus, regardless of length.
Consider the following list of the top terms extracted from a small collection of product and service reviews. The ranking of the terms is based on the relative frequency of occurrence of the term within and outside these reviews. This key term identification strategy can be used to summarize document collection or as a preprocessing step for text mining.
Document Redaction Support
InfoTracker provides two primary capabilities that can assist document sanitization. First, it maintains a “redaction” memory that captures editing decisions and exploits this knowledge to support future sanitization efforts. Second, it utilizes novel context understanding techniques to reduce the false alarm rates associated with typical ‘dirty word’ checks. Consider the scenario where InfoTracker is being applied to the problem of sanitizing documents related to the 2002 missile strike on a car carrying terrorists in Yemen. An analyst might be told to remove all specific mention of US capabilities and operations in Yemen, including any text that might be used to infer the nature of the attack or potential US-Yemen collaboration in the strike.
The user may begin with a simple ‘dirty words’ list that InfoTracker uses to identify portions of the document requiring attention. The analyst could then highlight regions to be redacted from within the document’s native application (Microsoft Word in this case). Once the redactions are committed, by the user or a quality control officer, this text is added to the system’s knowledge base and can be used to annotate future documents – providing awareness of previous redaction decisions. In most applications very effective guidance can be provided even after just a couple of documents are processed. As shown in the figure below, InfoTracker highlights meaningful text fragments that occurred in previous redactions (rather than highlighting all, potentially chance, occurrences of common terms and phrases). Similarly, InfoTracker can successfully exploit the redaction memory to understand the contexts in which ‘dirty words’ appear so as to filter out benign instances (such as the final occurrence of ‘CIA’ in this figure). Finally, InfoTracker can provide users with details regarding past redaction decisions through a simple point and click operation.

Summary

InfoTracker’s text indexing scheme allows it to efficiently compare all sequences of words (regardless of length) within a document with those stored in a potentially very large document index. By detecting the unique characteristics of shared content (e.g., unusual word sequences or misspellings) InfoTracker can reliably detect meaningful overlaps with the stored text. Some of InfoTracker’s more important attributes and capabilities include:

  • Utilizing a large body of background knowledge, InfoTracker is able to recognize when text overlaps are due to common language use or institutional boilerplate, thus reducing false alarms
  • InfoTracker can reliably conduct text matching in the face of distortions caused by Optical Character Recognition (OCR) and speech recognition software
  • The implementation of InfoTracker is highly scalable
    • The size of the constructed document index is roughly linear in the size of the raw text corpus
    • Text may be incrementally added to the index at a cost (measured in I/O) that is linear in the size of the text input
    • The algorithm can use as little or as much RAM as desired with the primary structure being disk resident
    • Text comparison processes can be done in parallel across multiple processors

Status

The InfoTracker prototype has been developed for Windows XP and heavily tested. InfoTracker’s built-in viewer includes document converters that cover PDF, HTML, and Microsoft Office. Users may also utilize InfoTracker through a Microsoft Word toolbar. We continue to extend the capabilities of InfoTracker and are actively seeking Beta users. Please contact TJ Goan at goan@stottlerhenke.com for additional information.
This work was supported by the ARDAAdvanced Information Assurance program under contract NBCHC030077.

Publications

Goan, T., Fujioka, E., Kaneshiro, R., and Gasch, L., 2006 “Identifying Information Provenance in Support of Intelligence Analysis, Sharing, and Protection,” IEEE International Conference on Intelligence and Security Informatics, ISI 2006, San Diego, CA, USA, May 23-24, 2006, Proceedings. Lecture Notes in Computer Science 3975 Springer 2006.
Goan, T and Broadhead, M, 2004 Detecting the Misappropriation of Sensitive Information through Bottleneck Monitoring, Workshop on Secure Knowledge Management (SKM 2004).

Related Work

Bernstein, Y. and Zobel, J., 2004 A Scalable System for Identifying Co-Derivative Documents, SPIRE 2004: The Eleventh Symposium on String Processing and Information Retrieval, Padova, Italy, pp. 55-67
Falkenberg, H.C. and Grimsmo, N., 2004 Introduction to string searching and comparison of suffix structures and inverted files. Technical report, Norwegian University of Science and Technology.
Ferragina, P. and Grossi, R., 1999 The String B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM 46 (1999) 236-280.
Gusfield, D., 1997 Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.
Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., and Zobel, J., 2005 “Similarity Measures for Tracking Information Flow,” ACM 14th Conference on Information and Knowledge Management