text analysis – vialab | Dr. Christopher Collins

Academia is Tied in Knots

Contributors:

Tommaso Elli, Adam Bradley, Christopher Collins, Uta Hinrichs, Zachary Hills, and Karen Kelsky

As researchers and members of the academic community, we felt that the issue of sexual harassment goes too often under-reported and we decided to give visibility to it using data visualization as a communicative medium. We present a data visualization project aimed at giving visibility to the issue of sexual harassment in the academic community.

The data you are about to see comes from an anonymous online survey aimed at collecting personal experiences. The survey was issued in late 2017 and, through it, more than 2000 testimonies were collected. This data is highly personal and sensitive. We spent significant effort identifying suitable ways to handle and represent it, to show the large dataset, but also honour the individual experiences.

Explore the visualization at tiedinknots.io

Publications

Acknowledgements

This work was supported by NSERC Canada Research Chairs, the Canada Research Chairs, and DensityDesign.

Textension: Digitally Augmenting Document Spaces in Analog Texts

Contributors:

Adam James Bradley, Christopher Collins, Victor Sawal, and Sheelagh Carpendale

In this paper, we present a framework that allows people who work with analog texts to leverage the affordances of digital technology, such as data visualization, computational linguistics, and search, using any web-based mobile device with a camera. After taking a picture of a particular page or set of pages from a text or uploading an existing image, our prototype system builds an interactive digital object that automatically inserts visualizations and interactive elements into the document. Leveraging the findings of previous studies, our framework augments the reading of analog texts with digital tools, making it possible to work with texts in both a digital and analog environment.

Check out our online demo.

Publications

Acknowledgements

This work was supported by NSERC Canada Research Chairs, The Canada Foundation for Innovation – Cyberinfrastructure Fund, and the Province of Ontario – Ontario Research Fund.

Hierarchical Matrix for Visual Analysis of Cross-Linguistic Features

Contributors:

Mariana Shimabukuro, Jessica Zipf, Mennatallah El-Assady, and Christopher Collins

This paper presents a visualization technique for cross-linguistic error analysis in large learner corpora. H-Matrix combines a matrix, which is commonly used by linguists to investigate cross-linguistic patterns, with a tree diagram to aggregate and interactively re-weight the importance of matrix rows to create custom investigative views. Our technique can help experts to perform data operations, such as feature aggregation, filtering, ordering and language comparison interactively without having to reprocess the data. H-Matrix dynamically links the high-level multi-language overview to the extracted textual examples, and a reading view where linguists can see the detected features in context, confirm and generate hypotheses.

The source code for H-matrix can be found on our Github.

Publications

Acknowledgements

The authors wish to thank the reviewers, our colleagues, and domain experts. This work was supported in part by NSERC Canada Research Chairs and a grant from SFB-TRR 161. This research has also been made possible by the Ontario Research Fund, funding research excellence.

Metatation: Annotation for Interaction to Bridge Close and Distant Reading

Contributors:

Hrim Mehta, Adam Bradley, Mark Hancock, and Christopher Collins

In the domain of literary criticism, many critics practice close reading, annotating by hand while performing a detailed analysis of a single text. Often this process employs the use of external resources to aid analysis. In this article, we present a study and subsequent tool design focused on leveraging a critic’s annotations as implicit interactions for initiating context-specific computational support that automatically searches external resources. We observed 14 poetry critics performing a close reading, revealing a set of cognitive practices supported through free-form annotation that have not previously been discussed in this context. We used guidelines derived from our study to design a tool, Metatation, which uses a pen-and-paper system with a peripheral display to utilize reader annotations as underspecified interactions to augment close reading. By turning paper-based annotations into implicit queries, Metatation provides relevant supplemental information in a just-in-time manner and acts as a bridge between close and distant reading.

Publications

NEREx: Named-Entity Relationship Exploration in Conversations

Contributors:

Mennatallah El-Assady, Rita Sevastjanova, Bela Gipp, Daniel Keim, and Christopher Collins

We present NEREx, an interactive visual analytics approach for the exploratory analysis of verbatim conversational transcripts. By revealing different perspectives on multi-party conversations, NEREx gives an entry point for the analysis through high-level overviews and provides mechanisms to form and verify hypotheses through linked detail-views. Using a tailored named-entity extraction, we abstract important entities into ten categories and extract their relations with a distance-restricted entity-relationship model. This model complies with the often ungrammatical structure of verbatim transcripts, relating two entities if they are present in the same sentence within a small distance window. Our tool enables the exploratory analysis of multi-party conversations using several linked views that reveal thematic and temporal structures in the text. In addition to distant-reading, we integrated close-reading views for a text-level investigation process. Beyond the exploratory and temporal analysis of conversations, NEREx helps users generate and validate hypotheses and perform comparative analyses of multiple conversations. We demonstrate the applicability of our approach on real-world data from the 2016 U.S. Presidential Debates through a qualitative study with three domain experts from political science.

For a demo, please visit: http://visargue.inf.uni-konstanz.de/

Publications

ConToVi: Multi-Party Conversation Exploration using Topic-Space Views

Contributors:

Mennatallah El-Assady, Valentin Gold, Carmela Acevedo, Christopher Collins, and Daniel Keim

We introduce a novel visual analytics approach to analyze speaker behaviour patterns in multi-party conversations. We propose Topic-Space Views to track the movement of speakers across the thematic landscape of a conversation. Our tool is designed to assist political science scholars in exploring the dynamics of a conversation over time to generate and prove hypotheses about speaker interactions and behaviour patterns. Moreover, we introduce a glyph-based representation for each speaker turn based on linguistic and statistical cues to abstract relevant text features. We present animated views for exploring the general behaviour and interactions of speakers over time and interactive steady visualizations for the detailed analysis of a selection of speakers. Using a visual sedimentation metaphor we enable the analysts to track subtle changes in the flow of a conversation over time while keeping an overview of all past speaker turns. We evaluate our approach on real-world datasets and the results have been insightful to our domain experts.

For access to the tool, please take a look at the presentation slides or contact us via e-mail.

Presentation Slides (PDF)

Publications

Acknowledgements

Lexichrome: Lexical Discovery with Word-Color Associations

Contributors:

Chris K. Kim, Christopher Collins, Uta Hinrichs, Saif M. Mohammad

Based on word-colour associations from a comprehensive, crowdsourced lexicon, we present Lexichrome: a web application that explores the popular perception of relationships between English words and eleven basic colour terms using interactive visualization. Lexichrome provides three complementary visualizations: “Palette” presents the diversity of word-colour associations across the colour palette; “Words” reveals the colour associations of individual words using a dictionary-like interface; “Roget’s Thesaurus” uncovers colour association patterns in different semantic categories found in the thesaurus. Finally, our text editor allows users to compose their own texts and examine the resultant chromatic fingerprints throughout the process. We studied the utility of Lexichrome in a two-part qualitative user study with nine participants from various writing-intensive professions. We find that the presence of word-colour associations promotes awareness surrounding word choice, editorial decision, and audience reception, and introduces a variety of use cases, features, and future opportunities applicable to creative writing, corporate communication, and journalism.

Lexichrome is available for public access at http://lexichrome.com.

Publications

Acknowledgements

Thanks to Jason Boyd and Laurie Petrou. This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC).

SentimentState: Exploring Sentiment Analysis on Twitter

Contributors:

Taurean Scantlebury and Christopher Collins

Twitter feeds are a potential source of useful information regarding the state of mind of persons who are the subject of legal or medical assessment. These may include persons suspected of committing crimes or patients that arrive at a hospital for a mental health emergency, for example, attempted suicide. Messages called “tweets” can expose the state of mind of a Twitter user. Analysts are challenged with creating reports of the online presence of users quickly and efficiently. We present a web-based visualization tool called SentimentState that performs sentiment analysis on tweets from a user’s Twitter account.

SentimentState analyses tweets based on ten emotions (positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise and trust) and creates an interactive timeline graph of the emotional state of the user. It uses a collection of emotion 24,200 word-sense pairs collected from the National Research Council of Canada (NRC). We anticipate that this interactive visualization can have applications throughout, and even beyond, legal and medical assessments, and will provide analysts with timely and relevant information regarding the mood state of clients, patients and other persons under assessment.

Check out our Online Demo and our GitHub Repository for source code related to this project.

Publications

Acknowledgements

Thanks to Saif Mohammed for providing the NRC Emotion Lexicon for this project.

Investigating the Semantic Patterns of Passwords

Contributors:

Rafael Veras Guimaraes, Julie Thorpe, and Christopher Collins

Summary

What is the meaning within a password? And, how does the meaning in your password relate to security risks? In our research into the ‘secret language of passwords,’ we have investigated the numerical and textual patterns from a semantic (meaning) point of view. Where prior research investigated letter and number sequences to expose vulnerable passwords, such as “password123,” our research has delved into the composition of seemingly complex passwords such as “ilovedan1201” or “may101982” and revealed common patterns. In these cases, the patterns of <I><love><male-name><number> and <month in letters><day in numbers><year after 1980 in numbers> are common patterns which, once learned, can be used to generate password guesses, such as “IloveMike203” and “July022001”.

Using linguistic analysis and interactive visualization techniques, we have investigated the patterns of date-like numbers in passwords, and the meaning and relationships between types of words in passwords. The resulting analysis guided our creation of a password guessing system (not available to the public!) which on several measures is better than any prior published result. The exposed vulnerabilities are motivating our ongoing work into new ways to help people create semantically secure passwords. This research contributed to a major story in the New York Times Magazine on the Secret Life of Passwords.

Our research started with the many large password leaks that were made publicly available on the Internet. In particular, the 32 million passwords from the RockYou website, exposed in 2009.

Our published research was conducted in two phases:

Date and Numbers

We started exploring date patterns, as 24% of the RockYou passwords contain a numeric sequence of at least 4 digits. So we wondered whether or not these sequences are dates, and if so, are there any temporal patterns? Our analyses found that 6% of these passwords (almost 2 million accounts!) contain numbers that match a date. To facilitate exploration of the patterns in the choice of dates, we created an interface that allows one to find the frequency that each day, month, year or decade (back to the year 1900) is referred to, as well as the corresponding passwords. We did not count passwords with numbers that are more likely to be keyboard patterns than dates, such as “111111”. Exploring the data through this interface, we confirmed some predictable patterns, such as the preference for dates that have repeated days and months (e.g., 08/08/1989), but also uncovered hidden ones, such as a consistent preference for the first two days of months, holidays, and a few notorious dates (e.g., Titanic accident) . For a detailed report on this work please read our paper or try our exploratory interface.

Words and Building a Password Grammar

In the second part of this research, we turned our attention to semantic patterns in the choice of words. Employing natural language processing techniques, we broke each password into words and classified the words according to their syntactic (grammar) function and semantic (meaning) content. The result is a rich model representing the syntactic and semantic patterns of a collection of passwords. With this model, we can rank the semantic categories to find that “love” is the most prevalent verb in passwords, “honey” is the most used food-related word, and “monkey” is the most popular animal, for example. Contrary to reported psychology research, we found that many categories related to sexuality and profanity are among the top 100. Our work also brought insight into the relations between concepts; for example, our model shows that a male name is four times more likely to follow the string “ilove” than a female name. Our paper, published in the NDSS Symposium 2014, discusses the security implications of our work. In summary, we show that the security provided by passwords is overestimated by methods that do not account for semantic patterns.

Online Demos

Try the dates visualization yourself!

Try the words visualization yourself!

Software

Semantic-Guesser

Media Coverage

Our research has also been featured in additional media, including:

UOIT researchers crack down on password security in wake of Heartbleed (Julie Thorpe speaks to durhamregion.com)
Change your password: A lesson from Russian website hackers (Christopher Collins speaks to durhamregion.com)
Rogers TV Durham Now, September 2014 (all team members describe our work to Neil McArtney; includes the work of Julie Thorpe, Amirali Salehi-Abari (U Toronto), and Brent MacRae on GeoPass)
Follert, Jillian. “From the Enigma Machine to Online Passwords: UOIT Looks at Keeping Secrete Information Secret”, Metroland DurhamRegion.com, March 13, 2018
Walker, Anna-Kaiser. “Protect Yourself Against Identity Theft”, Reader’s Digest, March 1, 2018
Spencer, Susan. “A World Beyond Passwords”, CBS Sunday Morning, February 19, 2017 (International television)
Lynch, Laura. “Passwords” CBC Radio One: The Current, February 13, 2017 (National and online radio interview)
Urbina, Ian. “The Secret Life of Passwords” New York Times Magazine, November 19, 2014 (International magazine and web)

We have also been featured on UOIT Homepage, including an article entitled “Heartbleed update: UOIT researchers analyze why consumers use weak passwords“.

Publications

Acknowledgements

Thanks to undergraduate alumni Jeffrey Hickson and Swapan Lobana who worked as research assistants on this project, and to the funding agencies who supported this work.

DocuBurst: Visualizing Document Content using Language Structure

Contributors:

Christopher Collins, Gerald Penn, Sheelagh Carpendale, Brittany Kondo, Bradley Chicoine

DocuBurst is the first visualization of document content that takes advantage of the human-created structure in lexical databases. We use an accepted design paradigm to generate visualizations that improve the usability and utility of WordNet as the backbone for document content visualization. A radial, space-filling layout of hyponymy (IS-A relation) is presented with interactive techniques of zoom, filter, and details-on-demand for the task of document visualization. The techniques can be generalized to multiple documents.

Check out the live demo here.

Media Coverage

DocuBurst featured in Marti Hearst’s wonderful book, Search User Interfaces
DocuBurst featured in the Toronto Star!
DocuBurst on ‘information aesthetics’ blog
Interview with Margaux Watt of CBC Radio One Manitoba’s “Up To Speed“, 21 Feb, 2008:
A feature story on DocuBurst aired on FairChild TV “Media Focus” (cable 36 in Toronto), Friday, March 14, 2008!

Resources

Software

The code for displaying and interacting with radial, space-filling trees in prefuse is open source and is available for download. The code is distributed as a zip file and can be imported into Eclipse. It is dependent on the prefuse information visualization toolkit and, unfortunately, is minimally documented at this time:

Radial Space Filling Trees in prefuse [.zip] (requires separate prefuse download) or
Mavenized code, including pom, courtesy Brian O’Neill or
Executable Jar with prefuse embedded [.jar]