document corpora – vialab | Dr. Christopher Collins

Hierarchical Matrix for Visual Analysis of Cross-Linguistic Features

Contributors:

Mariana Shimabukuro, Jessica Zipf, Mennatallah El-Assady, and Christopher Collins

This paper presents a visualization technique for cross-linguistic error analysis in large learner corpora. H-Matrix combines a matrix, which is commonly used by linguists to investigate cross-linguistic patterns, with a tree diagram to aggregate and interactively re-weight the importance of matrix rows to create custom investigative views. Our technique can help experts to perform data operations, such as feature aggregation, filtering, ordering and language comparison interactively without having to reprocess the data. H-Matrix dynamically links the high-level multi-language overview to the extracted textual examples, and a reading view where linguists can see the detected features in context, confirm and generate hypotheses.

The source code for H-matrix can be found on our Github.

Publications

Acknowledgements

The authors wish to thank the reviewers, our colleagues, and domain experts. This work was supported in part by NSERC Canada Research Chairs and a grant from SFB-TRR 161. This research has also been made possible by the Ontario Research Fund, funding research excellence.

Guided Topic Model Refinement using Word-Embedding Projections

Contributors:

Mennatallah El-Assady, Rebecca Kehlbeck, Christopher Collins, Daniel Keim, and Oliver Deussen

We present a framework that allows users to incorporate the semantics of their domain knowledge for topic model refinement while remaining model-agnostic. Our approach enables users to (1) understand the semantic space of the model, (2) identify regions of potential conflicts and problems, and (3) readjust the semantic relation of concepts based on their understanding, directly influencing the topic modelling. These tasks are supported by an interactive visual analytics workspace that uses word-embedding projections to define concept regions which can then be refined. The user-refined concepts are independent of a particular document collection and can be transferred to related corpora. All user interactions within the concept space directly affect the semantic relations of the underlying vector space model, which, in turn, change the topic modelling. In addition to direct manipulation, our system guides the users’ decision-making process through recommended interactions that point out potential improvements. This targeted refinement aims at minimizing the feedback required for an efficient human-in-the-loop process. We confirm the improvements achieved through our approach in two user studies that show topic model quality improvements through our visual knowledge externalization and learning process.

Publications

Parallel Tag Clouds

Contributors:

Christopher Collins, Fernanda B. Viégas, and Martin Wattenberg

Do court cases differ from place to place? What kind of picture do we get by looking at a country’s collection of law cases? We introduce Parallel Tag Clouds: a new way to visualize differences amongst facets of very large metadata-rich text corpora. We have pointed Parallel Tag Clouds at a collection of over 600,000 US Circuit Court decisions spanning a period of 50 years and have discovered regional as well as linguistic differences between courts. The visualization technique combines graphical elements from parallel coordinates and traditional tag clouds to provide rich overviews of a document collection while acting as an entry point for the exploration of individual texts. We augment basic parallel tag clouds with a details-in-context display and an option to visualize changes over a second facet of the data, such as time. We also address text mining challenges such as selecting the best words to visualize, and how to do so in reasonable time periods to maintain interactivity.

This research was given the VAST Test of Time Award at the IEEE Conference in 2019.

[Download high-resolution mp4]

Slides from the presentation at IEEE VAST 2009

Publications

DocuBurst: Visualizing Document Content using Language Structure

Contributors:

Christopher Collins, Gerald Penn, Sheelagh Carpendale, Brittany Kondo, Bradley Chicoine

DocuBurst is the first visualization of document content that takes advantage of the human-created structure in lexical databases. We use an accepted design paradigm to generate visualizations that improve the usability and utility of WordNet as the backbone for document content visualization. A radial, space-filling layout of hyponymy (IS-A relation) is presented with interactive techniques of zoom, filter, and details-on-demand for the task of document visualization. The techniques can be generalized to multiple documents.

Check out the live demo here.

Media Coverage

DocuBurst featured in Marti Hearst’s wonderful book, Search User Interfaces
DocuBurst featured in the Toronto Star!
DocuBurst on ‘information aesthetics’ blog
Interview with Margaux Watt of CBC Radio One Manitoba’s “Up To Speed“, 21 Feb, 2008:
A feature story on DocuBurst aired on FairChild TV “Media Focus” (cable 36 in Toronto), Friday, March 14, 2008!

Resources

Software

The code for displaying and interacting with radial, space-filling trees in prefuse is open source and is available for download. The code is distributed as a zip file and can be imported into Eclipse. It is dependent on the prefuse information visualization toolkit and, unfortunately, is minimally documented at this time:

Radial Space Filling Trees in prefuse [.zip] (requires separate prefuse download) or
Mavenized code, including pom, courtesy Brian O’Neill or
Executable Jar with prefuse embedded [.jar]