Abstract
To make decisions about the long-term preservation of and access to large digital collections, digital curators use information such as the collections’ digital object types, their contents and preservation risks, and how they are organized. To date, the process of analyzing a collection—from data gathering to exploratory analysis and final conclusions—has largely been conducted using linear review and pen and paper methods. To help curators analyze large-scale digital collections, we developed an interactive visual analytics application. We have put methods in place to summarize large and diverse information about the collection and to present it as integrated views. Multiple views can be linked or unlinked on demand to enable curators to identify trends and particularities at different levels of detail and to compare and contrast views. We describe two analysis workflows to illustrate how the application can be used to triage digital collections and facilitate collection management decision making and to provide access. After conducting a focus group study with domain specialists, we introduced features to address their concerns and needs.
Introduction
Digital curation involves the ongoing processes of managing, preserving, and making accessible varied kinds of digital collections. 1 Digital curators are librarians, archivists, researchers, and IT professionals who provide the aforementioned services to academic institutions, the government, industries, and the public at large. Digital collections can be born digital or be digitized; they are aggregations of digital objects used to record, study, observe, and measure cultural, social, physical, and biological phenomena. Web archives, digitized newspapers, literary texts and books, census data, architectural and engineering drawings, maps, and satellite image data as well as combinations of different types of digital objects are only some examples of digital collections.
Among other functions, digital curators evaluate collections to learn their technical characteristics, their contents, and how those are organized. Then, they analyze the information to identify the actions needed to preserve the collections and to make them accessible to the public. While the cognitive activities involved in the analysis are nonlinear, sense-making processes during which curators make different observations and pursue alternative thinking paths, influenced by a long tradition of physical documents, the evaluation of the collections is traditionally undertaken in a linear fashion and using pen and paper methods. 2 In the context of large-scale and heterogeneous collections, the different layers of information cannot be easily comprehended if presented linearly and sequentially, and there is a risk of getting buried in details or lost in generalities. Furthermore, during the decision-making process, curators need to integrate and weigh in different kinds of information to arrive at conclusions. 3,4 Overwhelmed by the number and diversity of objects to analyze, the community has repeatedly expressed the need to find new, efficient, and automated methods for digital collection analysis. 5 To address these concerns, and with guidance from digital curators, our team developed a visual analytics application to facilitate, organize, and enhance the cognitive processes involved in collection analysis.
The application uses metadata—automatically extracted from the collection—to represent a collection as a treemap 6 and provides analysis functionalities based on metadata aggregations, categorization, filtering, correlation, and data mining. Users may interact with collection views showing digital object types and sizes, tag clouds of directory labels and file names, and organizational patterns at different levels of abstraction. These views are useful to learn the collection’s scope and composition, to make inferences and to learn about its contents, and to make collection management decisions.
This article is an extended version of the work presented in Xu et al. 7 It reports the results of a focus group study with six curators to understand their attitudes toward using the visual analytics application and to obtain feedback about existing and new functionalities. It also includes new features: a data aggregator and a data selector (introduced in sections “Visualizing additional relationships” and “Data navigation”) and a network graph to track analysis workflows (section “Data navigation”).
The contributions of this work include the following:
the creation of a visual analytics tool to interact with large and heterogeneous digital collections for digital curation purposes;
the implementation of novel visualization techniques to render multiple collection characteristics;
the results of a focus group study conducted to gather feedback about the data curators’ attitudes in relation to the visual analytics application.
To the best of our knowledge, this is a first-of-its-kind visual analytics tool for digital collection analysis. Relevant collection characteristics are integrated visually to enable understanding of multiple information layers that curators have to consider about the composition, contents, and organization of large and heterogeneous digital collections. The application includes improvements to the treemap visualization through pixel-based rendering and glyph-based techniques, designed to show multiple collection attributes. The focus group results revealed the users’ concerns in relation to the collection’s representation. Users also expressed difficulties to transition from abstract to detailed representations of information. We used these comments as feedback to incorporate changes. We consider that they can be taken into consideration in the design of many kinds of visual applications addressing large data.
The rest of the article is organized as follows. Section “Design considerations” describes requirements and design considerations. In section “Related work,” we present related work. Section “Visualization implementation” details the techniques that we developed for visual analysis. Section “Application examples” presents two use cases showing analytical workflows using the application. The focus group session and corresponding data analysis with interpretation are presented in section “Focus group study.” We conclude in section “Conclusions and discussion” with a discussion of the challenges and limitations of the current implementation.
Design considerations
We designed a visual analytics application to conduct triage in a broad range of digital collections. During analysis, curators integrate discrete pieces of information about a collection into a comprehensive assessment. In this visualization, collections’ attributes such as structure, digital object types, contents, and organizational patterns are represented in an integrated fashion and can be analyzed interactively. The different interactive features focused on enabling curators to incorporate their experience in the analyses and to follow their stream of thought to facilitate inference making. Challenges included designing methods to summarize large amounts of information and in representing the different layers of information meaningfully to improve comprehension and to facilitate decision making. The tool does not address specific domain science curation functions such as performing statistical calculations in reference to particular phenomena recorded in the data. Researchers and the general public can also use the visualization for information access and discovery purposes. Two usage scenarios (see section “Application examples”) exemplify a typical collection analysis workflow and a case of a user searching for images to include in an article.
In this project, we followed an iterative feedback design methodology. As a first step, we discussed problems related to evaluating large collections with digital curators at the National Archives and gathered initial requirements. The digital curator on our team described typical workflows for triage and to access collections and the domain-specific concepts that curators consider when analyzing collections. She also provided guidance and feedback throughout the design and development processes. Throughout the development stages, we used different feedback mechanisms, and as a result, the visualization went through several iterations. After the initial designs to visualize structure and digital object classes were implemented, we invited 14 library and archives professionals to accomplish a set of simple tasks and to provide structured feedback. The details of this study, which involved usability testing and comparisons with collection views using Windows Explorer software provided by Microsoft Windows (http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/app_win_explorer.mspx), are available in a technical report. 8 To develop the preservation risk and the selector and aggregator features, we worked for several sessions with two experts in digital preservation. We also incorporated the feedback obtained from reviewers as a consequence of presenting articles in relation to the visualization. A focus group session conducted during the last stage of development provided insights and feedback that we took into consideration to make further improvements.
To structure the design, we formalized the requirements and the domain-specific concepts as information needs (see section “Requirements analysis and information needs”) and derived a four-component information model (see section “Visualization model for collection analysis”) to analyze digital collections using corresponding visual representations. Each of the sections that present the visualization features describes the criteria behind the design choice.
Requirements analysis and information needs
We identified the following information needs for digital collection analysis:
Provenance
This refers to the author, department, and/or function from which a collection originated as well as the methods and technologies used to transform raw data into processed collections.
Structure and organization
How digital objects are arranged in a directory hierarchy is understood as the collection’s structure. Structure is important to identify the original order in which digital objects are organized and to understand the relationships between objects. A collection may be formed by different groups of related objects, which may be organized thematically, by author, by date, sequentially, by file naming convention, or have no logical organization. Within one collection, multiple organizational criteria may coexist, and some software requires that the files be organized in specific configurations in order to render them correctly (e.g. geographic information system (GIS)). Additionally, many collections are organized to map the functions or departments of the organization that created them, which points to the collections’ provenance. Maintaining a collection’s structure and order helps preserve provenance and facilitates access.
Content
Descriptive information may be recorded in directory labels and file names. Labels may contain subject terms, proper names, time periods, and provenance information. Curators use this information to create catalogs and to make these collections available to the public.
Statistics
Information about the size and number of files of the different groups of records in a collection is used to make storage and long-term preservation decisions. For access purposes, these statistics indicate the scope of records that a user will have to search through in order to find specific information.
Technical characterization
Knowing the types of file formats in the collection and their location in the structure is needed for access purposes (e.g. to establish where the videos or images are in the collection), to identify simple and complex digital objects formed by one or more file formats, and to plan for the long-term preservation of the collection.
Context
This refers to the reasons why a collection is created and how it is used. Collections’ provenance, structure, contents, and technical characterization are considered by digital curators in relation to contextual information to provide users with accurate understanding of the collection’s functions. Contextual information can be obtained through external sources, or it may be inferred from the contents and characteristics of the collection.
Visualization model for collection analysis
For purposes of building a visualization to address the information needs laid out above, the attributes of a digital collection can be generalized as metadata about each file and about the relationships among them. We characterize digital collections as a four-component information model.
Hierarchical organization
Structural information is presented, so that users can distinguish the distribution of digital objects and their relationships. Although we expect most large collections to have a hierarchical organization, the visualization can also be used for nonhierarchical collections.
Numerical metadata
This refers to any numerical value that is associated with a collection component such as size and number of files forming digital objects and groups of digital objects. In our application, we present values as aggregations, distributions, data mining scores, preservation risk scores, and other statistical calculations.
Additional relationships among digital objects
These refer to relationships other than those presented by the collection’s hierarchical structure. These relationships may result from analysis methods such as clustering or classification.
Descriptive metadata
This refers to any type of text description of digital objects in the form of naming conventions, descriptive tags, or the labels of the directories that contain them.
Throughout the design, we made decisions about how to represent these components in relation to how the user would interact with them. As a first step, curators will use the visualization to explore collections’ patterns. Based on preliminary observations, he or she can pursue more targeted analysis workflows or focus on specific groups of digital objects. Flexibility to pursue different analysis paths is available by selecting different visualization interactions while organizing and keeping track of them.
Scalable architecture for data handling
All metadata and the analysis results are stored in a Relational Database Management System (RDBMS) through a one-time processing step and retrieved on the fly. For efficiency in rendering large data, we adopted the Model–View–Controller architecture (MVC) in which Model (data) is divided physically into two locations: the client machine and the database server. 9 During a visualization session, only the data needed for viewing are downloaded from the remote database to a local repository. The data are maintained locally and used on demand until the user closes the visualization session. In this project, View refers to different types of visualizations that are also rendered on demand. Each View has its own controller, and a top-level controller manages all the available and the rendered visualizations. This two-tiered control model integrates different visualization libraries (Prefuse, 10 OpenCloud, 11 and JFreeChart 12 ) into a unified framework and provides flexibility for future expansion.
Related work
Although there is abundant research on the use of information visualization for text analysis, information hierarchies, and multidimensional data, to the best of our knowledge, ours is the first work that integrates all these components into a framework for the purposes of managing and providing access to large digital collections. Below we review work in these areas in relation to our selection of techniques.
Hierarchical information visualization
Common techniques to visualize hierarchical information may be divided into two models: node-link-based visual representation and space-filling-oriented representation. The node-link-based visual representation is good for exploration purposes in which branches of the tree can be hidden. 13 However, for the purposes of representing large hierarchical collections, the rendering of links requires additional space. This model also presents challenges to determine connectivity and to compare between nodes.
The space-filling-oriented representation is more compact, as it does not render links between nodes. Instead, the hierarchical relationship is indicated by the arrangement of the nodes. There are two types of layout: tiled and nested. In the tiled layout, nodes are drawn next to each other without overlap. In Sunburst Tree, 14 the root node is placed in the center of the display, and the children are drawn as arc blocks surrounding the parent node. The tiled layout uses a lot of screen space for rendering internal nodes. A variation of this visualization is the icicle tree, 15 which places the root on one side and arranges internal nodes toward the opposite one. While this representation may be more intuitive in some applications, such as when representing disk space, 16 nodes at different levels of the hierarchy may be hard to compare.
Treemaps use nested layouts, which makes more efficient use of the screen space. 6 The root is represented by a rectangle, and all the children are placed inside of the root node, presenting the problem that the hierarchical structure may be hard to recognize. In this project, we use a squarified treemap algorithm 17 and multiple boundary lines with increased spacing in between to better illustrate the hierarchical structure.
Treemaps are used effectively in information visualization applications such as threaded discussion forums 18 and Google news stories. 19 Treemaps are also scalable for large sets of digital objects, for example, in the display of search results. 20 Specifically, related to the visualization developed in our project, a number of commercial applications use space-filling representations to display disk usage statistics. Along with the treemap showing all the information of the directory hierarchy, WinDirStat includes an explorer panel that presents aggregated information and a panel that shows file format composition. 21 The size of each square is determined by the size of the corresponding directory on disk, and the color of the square shows the type of file. To emphasize the directory structure, WinDirStat uses a cushion technique. 22 DaisyDisk, an application with a similar purpose available for MAC OS, uses a Sunburst layout to display a disk’s hierarchical structure. 16 Although our visualization has some overlap with the tools mentioned above, its focus is to support curatorial analysis, specifically of large-scale collections with multiple components and attributes.
Text visualization
Some text visualization applications focus on visualizing text mining results and showing relationships among terms, as well as text patterns in document collections. 23,24 Other techniques such as tag cloud 25 and WordTree 26 focus on providing visual summaries of the content and relations of text corpus. Fewer research tools focus on supporting visual investigative analysis of text corpora. Jigsaw is an application designed to help analysts discover potential terrorist threats. The application offers multiple data views such as scatterplots and network graphs. 27 Entities are extracted from texts and visualized in relation to other information dimensions such as time and social networks in multiple coordinated views. Lei et al. 28 integrate tag cloud with stacked area graphs to help users understand text corpora through facets. In their approach, a stacked area graph is used to show distributions of different categories of documents over time. The tag cloud is added directly into the area to give users a quick glance of the documents’ content.
Our use of text visualization also supports investigative analysis. At different levels of the collection hierarchy, digital records and groups of records are associated with their corresponding descriptive terms as tag clouds extracted from directory labels and file names. The text visualization feature provides users with a summary of the contents of the selected directory and enables them to make inferences about the contents of digital objects included. We also introduced a text visualization in which image tags are classified, and the numbers corresponding to each class are represented as a pixel-based rendering in relation to the collection’s structure and provenance information (see section “Usage scenario”). Different from image visualization projects, 29 here, we visualize content descriptions as classes of tags from an image collection. This form of text visualization allows users to identify classes of images within one directory and to compare classes of images across directories.
Visualization of multiple attributes
Glyph and pixel-oriented techniques are two well-known methods for visualizing multidimensional data. In the glyph-based visualization, each attribute of a digital object is mapped to a property such as size, color, length, and orientation of a graphical object. Glyph-based rendering enables an overall visual comparison between two multidimensional objects. This technique has been used to assist in the analysis of network traffic 30 and web search results 31 but is often considered nonscalable for large-volume collections. In turn, pixel-oriented visualization is effective to explore large sets of multidimensional data. The basic idea behind it is to map each attribute value of a digital object to a color pixel. Different attributes are then displayed in different subwindows with informative arrangements to help users identify patterns among attributes and digital objects. 32 In this project, to facilitate the comparison of numerical attribute values and their distribution as well as to show additional relationships among digital objects, we use both horizontal and nested spiral color pixel-filling techniques within the treemap.
Visualization implementation
Our application contains a variety of visual analytics tools within one framework. It allows flexibility to pursue analysis workflows that can be traced back and revisited. Throughout the analysis, relevant collection attributes, such as size, file types, organizational criteria, and hierarchical structure, are visually integrated to improve comprehension about a collection. The visualization provides decision-making support through the possibility to discover trends, identify patterns, and compare and contrast to prioritize a collection’s preservation and access activities.
Collection’s hierarchical structure visualization
Given the need to maintain the original order and structure of the collection, we use treemaps to represent collections. The collection is organized as nodes and edges in a hierarchical structure. The entire visualization space is assigned to the root node, which is the first-level directory. Child nodes (subdirectories) are rendered as nested rectangles within their parent rectangles. This allows one to observe the hierarchical dependencies between subdirectories and thus to understand the relationships between digital objects. To visually differentiate between subdirectories, we render colored borderlines for each of the levels and add a small amount of border spacing between them to increase the visualization’s readability. The space that each node occupies on the visualization is based on the number of files within. Figure 1 shows an example of how we render structure.

Structural representation of a collection as a treemap and border spacing on the same root node.
To navigate a collection’s structure, users can render one or more nested directories at a time by choosing in the control panel the depth/level of the tree that they want to observe. And, to better understand relationships between groups of digital objects, they can zoom in to appreciate the border spacing. For ease of orientation in the visualization space, we use a squarified treemap algorithm. 17 As the user is exploring the different directory levels, the treemap layout remains fixed, so that the position of high-level nodes remains stable.
Numerical metadata visualization
Numerical attributes such as the number, size, and types of files are considered by curators to plan a collection’s storage, preservation, and access. To summarize large amounts of file format information, we classify the existing file formats in a collection into classes, each representing a type of digital object. For example, jpg, png, and tiff files belong to the image class, while rtf, doc, and docx are considered part of the word processing class. The amount and diversity of file formats are discovered automatically for each collection, and their classification is based on the needs of data curators to understand the multiple types of data objects present in the collection. In our test bed collection that includes ~300 different file formats, we identified 20 file classes. To show file class distribution, we use pixel-based rendering. Colors were selected based on discussions between members of the application development team to achieve maximum contrast between them.
Representing such large number of classes with different colors is challenging. On the one hand, it helps users to note the diversity present in the collection. On the other hand, it can be confusing to identify individual classes. Our goal is that users first understand the overall diversity of the collection and later use that information to further explore the individual directories within which distributions of attributes may be less diverse.
During analysis, curators need to know how many file classes exist in a collection and which one is the most relevant. Figure 2 shows an example of multiple-attribute rendering in the treemap. In this example, we extract structural (file path) and technical metadata (file format identification) from the collection, and we store that information in the RDBMS. In response to user queries, statistics are aggregated at each level of the directory structure. To enable observation of all the file classes available in the nodes at the same time, we implemented horizontal pixel-based rendering within nodes. We show the fraction of the various classes in a node by appropriately coloring that fraction of the node’s pixels. We decided to show it this way, so that distributions of classes of files within a directory can be observed in relation to the size and organizational structure of the directory. To enable users to make inferences about the technical characterization of the collection, we render all the file classes available per directory at one time. In this way, a user may be able to infer the presence of complex digital objects (e.g. web archives containing both html files and images or web archives containing downloadable data files). In turn, the spacing between directory levels helps to easily distinguish neighbor folders showing similar patterns and thus to identify groups of similar digital objects (see Figure 2).

Color pixel rendering of file classes in a directory including various subdirectories.
To render a color for each class, we start from the top left corner of the node and fill in the exact proportion of pixels. The process continues for all the classes. To highlight issues of data quality, files with unidentified formats are rendered at the end of each node in black and files with more than one possible identified format are shown in gray. In addition, to allow curators to identify the preservation risk (preservation risk is an indication of the sustainability and quality of a file format for long-term preservation based on criteria devised at the Library of Congress. See Sustainability of Digital Formats at www.digitalpreservation.gov/formats/index.shtml), as an indication of obsolescence of digital objects, file formats are categorized according to five risk levels or unknown risk using the Stanford Digital Repository Format Scoring Matrix. 33 The scores are rendered in colors using the same pixel-based rendering approach described for the file classes.
Figure 2 shows a directory including various subdirectories of different sizes in which existing file classes are rendered. Blue represents images, green web, red pdf, and black groups all the unknown file formats that could not be classified. Across the directories, the presence of similar color patterns indicates that this is a web archive including a majority of images and in some cases pdf files. As the user mouses over the directories, a pie chart in the right panel shows the percentage of each file class in a given directory. The label of the high-level directory—as provenance information—is available in the middle of the view in contrasting white letters, and the individual tokens from directory names are shown in the tag cloud also in the panel to the right.
During our iterative design process, a curator observed that “the ability to assess varied characteristics and to compare selected attributes across a vast collection is a breakthrough.” 34 In addition, another specialist commented on how the visualization is effective to evaluate technical characterization information in large collections for which there is no prior description.
Data aggregator
The data aggregator constitutes an improvement over the first iteration of the visualization. A useful task in collection analysis is to observe how the value of a single attribute (e.g. size and number of files corresponding to a file class) varies across different directories. A common solution (which we adopted during an initial design stage) is to map the range of those values into a continuous color map. In our project, this approach rendered a narrow color spectrum because the ranges of numerical values of the different collection attributes, such as file classes, are very large and their distribution is uneven. In addition, the value distribution between attributes varies significantly. Therefore, users observing file class distributions concluded that there was little variation across the collection. To address this issue, we implemented a data aggregator that allows users to specify from 1 to 10 the number of color groups in which they would like to see a given attribute distribution. Once the number is chosen, the data are divided, so that each group contains similar number of items. Each group is then mapped to a color for rendering. In Figure 3, the number of pdf files for each directory is shown in two different data aggregation settings. In Figure 3(a), the user selected two groups, while in Figure 3(b), eight groups were used. Discretization by size ensures that the number of visual items for each color is similar to provide a better overall contrast. Users may specify the number of groups dynamically and based on their analysis needs.

Data aggregator to show file class distributions within directories.
Visualizing additional relationships
Identifying the way in which digital objects are organized is one of the activities that curators undertake for purposes of providing access to collections. Among other criteria, collections may be organized by subject, by type of digital object, or by geographical location, and their files may be ordered by a regular naming convention or sequentially by number. In turn, digital collections may have similar or more than one kind of organizational criteria at different levels of the hierarchical structure and across directories (e.g. by date and by geographical location). We implemented a data mining system that uses the terms in the directory labels and the names of the files as metadata to predict organizational patterns. To show the different organizational criteria in the treemap, we implemented a nested square rendering method. Figure 4 shows the basic architecture of our data mining system, which uses WordNet 35 to infer word senses, string alignment to find spatial, sequential, temporal, and naming organization of files, and clustering to form groups of files that have the same organizational criteria.

Architecture of the data mining system to predict the organization of collections.
We use Gotoh’s affine gap cost alignment algorithm 36 to find similarities between the words in the directory labels and the file names. We tokenize the file and directory names into separate words. Then, we search through the hypernym trees of these words using WordNet. A hypernym tree contains the hierarchical relations between words whose meaning include the meaning of the other. For example, we can use a hypernym tree to find whether a word’s meaning is spatial or temporal. Spatial words such as Texas have a “location” hypernym, and temporal words like March have “calendar” in their hypernym trees. All spatial and temporal words are replaced with a string of special characters to ensure their alignment when using a sequence alignment algorithm. Numbers and words with similar naming conventions are also aligned (see Figure 5). Using a sequence alignment algorithm, we find sequential, naming, spatial, and temporal patterns in the file names and calculate a similarity score for each of the patterns. We run the algorithm on the entire collection (>100 high-level directories). For every directory, we also derive a feature vector of the form <Sequential, Naming, Spatial, Temporal>. Then, we cluster all the directories into four different groups based on the similarity values of each dimension in the feature vector as “sequential,” “naming,” “spatial,” and “temporal.” Finally, these groups are visually represented.

(a) Sequential, (b) naming, (c) spatial, and (d) temporal pattern alignment examples.
In Figure 5(a)–(d), we see sequential, naming, spatial, and temporal patterns found by the sequence alignment algorithm, respectively. The spatial words are replaced by “$$$$$,” which allows matching between the spatial words “Alabama” and “Iowa,” and temporal words such as “April” and “February” are replaced by “#####,” so that they get aligned by the algorithm.
The clustering results are shown as a rectangular glyph, with the outermost rectangle representing the most shared pattern in a particular directory and the innermost rectangle representing the least shared pattern. The width of each rectangle in the glyph shows the similarity score (a higher score corresponds to a wider rectangle). Figure 6 shows an example of the visual representation of the data mining results of a collection of websites with ~38,000 files. We used the color green to denote naming patterns, orange for sequential patterns, and dark red for spatial patterns. In this particular example, a temporal pattern was not found. It can be observed that most directories are rendered with three nested squares in order of green, orange, and dark red, from outermost to innermost. Such a pattern (e.g. A in Figure 6) indicates that all three types of organizational criteria exist in that directory in order of relevance. Other directories show no sequential (B in Figure 6) or spatial arrangements (C in Figure 6). In contrast to the views representing only the collection’s structure (see Figure 1), there is no spacing between the nested squares, which renders the colors within the squares more distinguishable. When asked to give feedback about this representation, a curator explained that through the glyphs, it was possible to clearly identify the relevance of certain organizational criteria in one directory, as well as organizational patterns across many directories.

Data mining results shown as nested squares. Three exemplar patterns are labeled with A, B, and C. Green, orange, and dark red colors denote naming patterns, sequential patterns, and spatial patterns, respectively.
Textual context overview
To enable overview browsing of the terms in the labels of files and directories, we use OpenCloud, a tag cloud visualization library. 11 As a user navigates through a particular directory, the files and directory names are inserted into the cloud. The library extracts tags from the directory and file names and assigns weights to each tag based on its frequency of occurrence. Tag clouds are generated showing more frequent words in bigger font sizes. Tags corresponding to directory labels are shown in red, and those corresponding to file names are shown in blue. The color distinction helps differentiate descriptive terms located at the directory level from those at the file level. This information is useful to study the directory contents.
The tag cloud is also used to verify the data mining results. In Figure 7, it is possible to see the correspondence between the tag cloud and the data mining results for a directory. In this example, sequential (orange) and naming alignments (green) exist in the same proportion. Here, “canyonlands” and “canyon de chelley” are indicators for naming alignments; 2005, 2006, and 2007 are indicators for sequential alignments. Tags like “bigbend,” “bryce,” and “map” indicate possible spatial (dark red) alignments.

This figure shows the (b) tag cloud rendering for (a) the contents of a given directory.
Data navigation
The application’s interactivity is fundamental to provide a rich analysis experience. Beyond basic zoom and pan functions, the interactive features implemented are as follows:
Navigation with dynamic changes of root
The root node of a view can be changed by placing the node’s ID number in the textbox available in the menu bar (Figure 2) and by double clicking on any node on an existing treemap instance. This feature provides a convenient way to create multiple views for investigation and for comparing and contrasting those views. For example, users can double click on several directories (these become the root node) of a treemap instance or may navigate down a particular directory. To help users track and retrace their investigative workflows, we introduced a dynamic network graph. This feature is similar to the navigation view in the visualization framework proposed in the study by Shrinivasan and Van Wijk. 37 We added this feature as a response to curators’ feedback noting that they could easily get lost after opening more than three views (see section “Usage scenario”).
Figure 8 demonstrates the visualization of a user’s investigative flow with eight opened views. Each node represents a treemap visualization instance and is labeled with the view number and the root node ID. When a new view is created, a new node is automatically added as a child of the node from which the new view is generated. In Figure 8, the Treemap (TM)1:2 represents the first visualization instance whose root ID is 2. Three instances (TM2, TM3, and TM6) were generated from TM1. The number indicates the order in which each instance was generated. In this case, TM6 was created after TM4 and TM5 was created from TM3.

Example of a network visualization showing an investigation workflow.
Abstraction with dynamic changes of rendered levels
Users may select the number of collection directory levels to render at a given time. This is an important requirement to support information abstraction and syntheses. In our implementation, a high-level abstraction significantly reduces the amount of data to render, which increases performance for large-scale data. When users need to see distributions at deeper directory levels, our interface enables them to navigate data vertically (see examples in Figures 2 and 6).
Detail-on-demand
Hovering over a node with the mouse updates the corresponding pie chart and the tag cloud with the node’s specific information. For instance in Figure 2, the pie chart presents statistics corresponding to the selected node, and in Figure 7, the tag cloud shows the frequency of words in the directory corresponding to the selected node. To explore directories in detail, any node in a treemap can be double clicked and it expands to a new treemap with the selected node as the root. The view is presented as a tab that can be resized and moved around in the screen space. The latter enables users to organize views into different analysis workflows in their screen space (see Figure 11).
Data selection and filtering
Data selection with search box
We provide users with a search functionality to find terms of interest in the directory labels. When the term is found, the corresponding directory or directories are highlighted with a dark pink transparent color. This feature is useful to find digital objects organized under a given label/term within and across directories. Commenting about this feature, a curator explained that this feature is useful for investigators looking for cross-directory-related data.
Data selector
A common scenario during analysis is to identify directories that meet certain evaluation criteria specified by the curator. Therefore, there is a need to support queries-on-demand through the visualization. 38 The data selector feature is designed to meet this need by allowing users to add any number of selected collection attributes. Once an attribute is added, its minimum and maximum values appear in two textboxes. Users may edit the values to specify other minimum and/or maximum values. Only the directories containing values that satisfy the specified criteria will be highlighted in the view. An example of how this feature works is shown in Figure 9. The image in Figure 9(a) shows all the directories containing a minimum of one word processor file class highlighted in green. Next, we added the criteria “directories with at least one pdf file class.” Results from Figure 9(b) show that in comparison to the first image, there is only one directory that contains word processor files without any pdf files. After adding the third criteria, “directory with at least one spreadsheet file,” only six directories remain highlighted as shown in Figure 9(c).

Uses of the data selector feature.
The data selector interface translates user-specified criteria into a joint structured query language (SQL) query and queries the relational database dynamically to determine the directories to highlight. There is no limitation on the number of criteria that can be added, and previously added criteria can be removed. This feature was developed with the goal of providing tools to guide the exploration from general to more specific, to identify areas of interest, and to study the collection with more precision. As explained by the curator that suggested this feature, it filters out information, allowing comparing and contrasting for purposes of making decisions and establishing priorities.
Linking of multiple views
To allow flexibility in the analysis workflows during which users may open multiple treemap views, we devised two linking capabilities: duplication and selection. Users can choose to create a new visualization from a selected node or to generate a duplicate view from the existing visualization. Duplication helps when viewing different attributes (e.g. one view shows data mining results and the other file classes). In turn, selection is related to detail-on-demand, which allows users to choose a section of the visualization and to show its details in another view. Once a new visualization is created, the general visualization control panel is dynamically linked to the view currently selected, and the user can control each view individually. This feature allows users to quickly change information highlighted in two views of the same set of data individually for comparison.
Additionally, users can select data through the multiple view matrix panel that is automatically generated when there is more than one visualization instance opened (Figure 10). In this panel, the checkboxes between any two visualization instances are shown, and users can choose to link or unlink multiple visualization instances. Once the checkbox is marked, the data used for visualization on that row will be automatically applied to the visualization indicated on that column. To allow one visualization instance to control or be controlled from multiple other visualization instances, users can check any combination in the matrix. The user can also stop the linkage among views anytime through the interactive panel.

Matrix panel of all the visualization instances to coordinate multiple views.
The use and coordination of multiple views enhance support for multiuser collaborations. Figure 11 demonstrates a scenario in which two users interact with different visualization instances independently and collaboratively. At a certain point in the exploration, the user to the left updates a view instance from the user to the right through the interactive matrix view for purposes of discussion. To summarize the interactive visualization experience, during a feedback session a domain specialist contrasted its versatility to the fixed way in which information about a collection is traditionally presented to users.

Users working independently and collaboratively using the matrix panel.
Application examples
Test bed collection
In our project, we use a test bed collection developed by the National Archives and Records Administration (NARA, http://www.archives.org). It includes publicly available data provided by Federal Agencies or harvested from their websites. Each record group corresponds to all the records of a small Federal Agency or some of the records of a larger Federal Agency and is represented as a node that includes child nodes. In turn, each record group may have different types of digital objects bearing different arrangements and a variety of file formats. The sample from the test bed collection that we used contained 1,031,118 files in 200 different formats. The collection includes everything from financial documents and press releases to GIS data, computer-aided design (CAD) drawings, websites, and three-dimensional (3D) images among others. Some of the record groups have up to 12 levels of hierarchical nesting. In the scenarios, we use the terms records and record groups to illustrate collection analysis and information discovery workflows in the context of archives.
Usage scenario
Rose, an archivist working at the State Archives, has received a backlog collection with record groups from different government agencies. Her goal for the day is to use the visualization to conduct analysis and to make decisions about storage allocations and processing needs with the goal of making the collections promptly accessible to the public. She first opens a general view and immediately finds out which agencies sent more materials by looking at the size of the corresponding nodes. Mousing over each one, she obtains precise statistics for each agency in the form of pie charts, which she will report to the storage allocation team. Next, she focuses on understanding general patterns in the collection. For this, she selects the file class view (Figure 12) and, by looking at the color and size patterns across directories, she quickly identifies that web (green), pdf (red), and image (blue) classes are predominant. Seeing the same color combination in most directories, she infers that there is a high presence of web pages in the collection. With this information, she submits a ticket to the Advanced Interfaces engineers who will provide access to them through the archives portal. She also observes that some directories with a majority of pdf files also contain spreadsheets (dark green) and/or text files (yellow). Based on her experience, she concludes that those may be different versions of the same accounting document, which is often saved in both a comma delimited and a pdf format. She decides to use the text format for long-term preservation because of its interoperability. To her distress, she confirms that in almost all the directories, there are significant numbers of files without file format identification information. This means that the file format identification software in use needs to be updated. She also observes that four agencies sent a majority of compressed files, which will have to be unpacked for identification. She creates a request for the Data Analysis team to update the file format identification tool, to unpack and identify the compressed files, and to estimate a timeline to provide access to these record groups.

View showing file class distributions for all the record groups. A zoom out region shows the spacing between neighboring squares to help users distinguish individual groups of records.
After observing general patterns, Rose is interested in obtaining more precise information about the distribution of different file classes in the collection. She creates a new treemap view and uses the aggregator feature to examine the distribution of image, pdf, and web file classes. After conducting observations for each class (see pdf file class example in Figure 3), she learns their predominance across the collection.
During her general observation, she learned that web pages containing image and pdf classes are ubiquitous in the collection. Next, she wants to identify which record groups containing these classes are at higher preservation risk (see Figure 13). She creates a new treemap view, and using the selector feature, she combines the three classes of files with two other criteria: number of unknown file formats and risk level 2, which is a risk level associated with image file formats. The view shows in bright green the record groups that she can prioritize for preservation action or she can continue the evaluation by including other criteria such as risk associated with pdf files.

View showcasing results of the selection of five criteria of interest to the analyst.
She now decides to further examine the largest record groups to study what other factors may influence a preservation decision. By double clicking on the biggest directory (highlighted in Figure 13), a new treemap is generated from the last view. To examine this particular record group, she uses the preservation risk view (Figure 14), which shows the relationship between types of file formats and preservation risk. She notes that while the directory contains a big proportion of risk 0 and risk 1 files, there are many file formats for which there is no format or risk information (in black) as well as a significant number of compressed files that are flagged as risk 5. The detailed observation of a particular record group confirms that the general observations made about the entire collection were accurate and indicates that it would be wise to prioritize the largest record group for preservation actions.

(a) Preservation risk view and (b) file class view with detailed view of correspondence between lowest risk level and file class.
Next, she is curious to learn which file classes correspond to risk 0. She goes back to the first treemap view and zooms in to observe file classes in detail. Using as contextual reference the Digital Repository Format Scoring Matrix, she infers that risk 0 (light pink) corresponds to the text file class (yellow). 31
Rose then proceeds to study the collection’s organizational criteria. For this, she opens another view to render the data mining attributes. The first organizational view at the first directory level shows that the predominant organization corresponds to a combination of naming (light green), temporal (brown), and spatial (dark red) alignments (Figure 15). Knowing that most groups have more than one organizational layer, Rose creates duplicate visualizations and updates the directory level at each new instance to navigate the hierarchical structure. While she has many views opened, she can keep track of her analysis workflow using the network view. As she goes deeper in the directory hierarchy (Figure 16), she notices that temporal alignment starts fading out and that sequential emerges as smaller directories are rendered. She concludes that temporal and spatial terms are used at the higher directory levels to describe the records. In turn, sequential alignment becomes prominent in deeper directories containing files whose naming conventions are formed by subsequent numbering systems. After concluding the analysis, she reports that a majority of the collections have some degree of organization and that their themes can be inferred at the higher directory levels. This will guide the Cataloguing and Description team in identifying adequate access points for the collection using the tag cloud function. All along, Rose has traced the analysis workflow through the network visualization and has been able to incorporate her knowledge as well as contextual information in the analysis.

Visualization of inferred organizational criteria at the highest directory level.

General view of inferred organizational criteria at the nested directory levels.
Visual analysis of tags in image collections
Our system was adapted to provide users with a tool to observe the organizational hierarchy of image collections in association with their tag information. Represented as distributions in the treemap, tags are associated with the corresponding directory label containing provenance information. To visualize the distribution of classes of tags and to identify their relevance, we use pixel-based rendering techniques. In this case, we automatically associate images to tags found in the corresponding image descriptions in the html pages. First, we parse the html files and extract the image descriptions. To generate a set of initial tags, we first apply stop word removal and frequency count. High-frequency words are chosen to be the tags and are clustered into eight categories using human knowledge, so that, for example, forest and treemap to one class. We also include a class for images that do not correspond to any of the eight tag categories. We do a regular expression matching between the tag descriptions, the html content associated with every image, and the image id/filename in order to obtain a set of tags which correspond to that image. Now, every image has a set of tags, and every directory has a set of images. To do the pixel-based rendering, we find the particular tags present in a directory and render them with different colors. Given that one image may have more than one tag (e.g. images with rivers and valleys), the colors in the representation reflect the diversity of the image collection. The application allows a richer understanding of the large collection’s content and to compare content across record groups. The method can be used to associate descriptive tags with any kind of data.
Figure 17 shows the distribution of the tags “rock,” “park,” “road,” “valley,” “cliff,” “tree,” “snow,” and “dam,” each shown in a different color across six directories. Black indicates that there are no images in a directory or that there is no corresponding tag. Hence, it becomes very easy to identify folders with similar types of image content.

Pixel rendering of image tags distribution.
To represent the correlation between two chosen tags, we use a nested-square-based rendering technique. In Figure 18, a solid square indicates that all the images contain one of the selected tags. When the square has two colors, the outermost rectangle represents the dominant tag and the innermost rectangle represents the proportion of images with both the dominant tag and the other tag. For example, if “tree” is represented by green and “snow” is represented by yellow, in the square in the upper left corner tree, images are dominant and the proportion of images containing trees and snow is shown.

Nested square rendering of image tag correlation.
The two pixel-based rendering representations, as bars and squared glyphs, complement each other. Representing information as bars of different sizes and colors makes it easier to visually detect patterns among a large number of variables. In turn, the visualization of tag correlations as squared glyphs allows a detailed comprehension of content and the relationships between variables. Feedback obtained from archivists indicates that the flexibility to conduct general observations as well as more detailed analyses is in tune with archival practices and that a change in visual representation (from bars to glyphs) within one analysis workflow helps the cognitive process.
Usage scenario
Jane, a freelance journalist, is writing an article about the diversity of landscapes in the US National Parks. She decides to find images to illustrate her article in the National Parks website archives, which provide a comprehensive image survey of the flora and fauna of the parks and corresponding descriptions. She soon realizes that it is not only time consuming to inspect each web page but also that she will need to organize thousands of images to reflect the diversity across different parks by sorting one image after another.
She decides to use the visual analysis tool offered by the National Archives to find what she needs. Using the search function in combination with the file type view and the statistics panel, she rapidly identifies a record group containing web pages with ~26,000 images corresponding to National Parks. In turn, hovering over the directories, she can read their labels in the tag cloud and see that they are organized by park name. These steps have reduced her work significantly, and now she can focus on finding the right images.
Using the tags selector in the visualization interface, she can find images related to tags as well as the correlation between two tags. She first specifies the terms (e.g. “mountain,” “river,” “forest,” “tree,” “water,” and “parks”) that correspond to her inquiry. The system automatically associates these terms to images that have them as a tag in their html description and presents a view showing the distribution of the various tags across directories (Figure 19).

Visualization showing image tags of different National Parks.
In Figure 19, different squares represent images and html pages of different National Park websites organized by name of the park. Each color block represents the number of images associated with a particular tag, and the images that are not associated with any tag because they lack a correspondent description are shown in gray. This view allows her to quickly learn which parks have images containing one or more of the landscape attributes she is interested in as well as to have a notion of the amount of images that for lack of a description are not classified. She now wants to select one image containing two landscape features present in National Parks. She notices that valleys and rivers are prominent tags in the collection and finds their correlation (Figure 20). In the view with the results, she notes two directories (marked in red in the image) containing bigger proportions of images with both rivers and valleys and decides to select images for her article from those directories. The visualization tool narrowed her search from 26,000 to 30 images, which is the combined total of images with the relevant tags contained in the two selected directories.

Correlation between river and valley tags shown as nested squares.
Focus group study
In the case of visual analytics applications where the goal is to help users make inferences and arrive at decisions about a phenomenon during an interactive process, typical usability studies will not capture the complexity of the experience. 39,40 To better understand how a visualization helps the investigative process, researchers suggest that user studies should include different heuristics and combine quantitative and qualitative methods. 41 Yet, there is no agreed upon standard to evaluate the visualization experience. 42 When the visualization was nearly complete, we decided to conduct a focus group session to understand the domain experts’ attitudes toward it. Focus groups are a type of qualitative interview aimed at discussing products, applications, or issues in depth among smaller groups of interested parties with the goal of understanding their feelings toward the object of the discussion. 43 During the sessions, a free flow of opinions is encouraged, and it is likely that opinions will shift and new ideas will be developed as a result of the exchange between the participants. It is precisely this kind of dynamic that allows relevant themes to emerge and that is, from a research perspective, important to capture. The focus group method is not suitable to test the usability of applications nor does it take the place of a user experience study. In the context of our project, we understood that a focus group session would provide rich insights about the interests, needs, and concerns of prospective users of the visual analytics application. For example, we wanted to learn whether they had significant problems understanding the representation and the interactive features, whether they would consider using the tool, what role it would play in their daily work, and how much of a shift in practices, from pen and paper to interactive analysis, this would be.
Protocol and data analysis rationale
Six curators with varied expertise were recruited to participate in a 2-h focus group session: a digital archivist, an IT policy maker, two preservation librarians, a metadata librarian, and a digital assets management librarian from the University of Texas campus. The session’s protocol was planned to promote understanding of the application and to enable free exchange of ideas while directing the discussions to target points. We divided the session into four segments of 25 min each and included a break in the middle. The first three segments focused on the main visualization features: (1) representation of structure and technical attributes, (2) representation of content and organizational patterns, and (3) interactivity and collaborative capabilities. The fourth segment was an open discussion about the totality of the experience that included more demonstrations of features of the visualization. Using a 41.9″ × 24.5″ tiled display system to interact with the visualization, a facilitator explained the features in a show and tell fashion using the case studies presented in this article as examples of interactions with the collections using only the metadata gathered by the application. Participants did not interact with the visualization to carry out specific tasks, but following the demonstrations and at any point in the discussions, they could ask questions about the application’s mechanics. During each demonstration, the facilitator asked the participants whether they wanted to see additional features of the visualization. Once the discussions were opened, to initiate the conversation, and in connection to each feature, the facilitator formulated questions related to each feature of the visualization to ascertain
the participants’ understanding of the collection’s representation and abstractions as shown in the interactive views;
whether the interactive views facilitate making inferences and decisions.
The audio of the session was recorded. To analyze the data, we wrote a session report using Krueger’s analysis framework as guidance. 44 A focus group report is a synthesis of the session that allows one to identify the relevant themes that emerge during the discussions. To create the report, we listened to the audio three times and took notes on a three-column form. We recorded emerging themes in one column and included the session time stamp and the participant’s comments in adjacent columns. Later, similar themes were consolidated and reorganized around their responsiveness to questions (a) and (b) above. Based on this new order, we formulate our interpretations.
Session report
Structure and technical attributes representation
Upon viewing the entire collection represented as a treemap with the file classes’ distributions, the participants determined that web, pdf, and image file classes are the prominent types in the collection. They also concluded that web pages are the predominant type of digital object in the collection. Many commented that learning about the technical characteristics of the collection allows them to anticipate the types of software needed to access the data. They also said that the possibility to detect the presence of unknown file formats is helpful in deciding what preservation actions to prioritize. At the same time, they noted difficulties distinguishing between some of the colors representing the 20 file classes.
Participants also indicated that the information about provenance, labels showing the name of the agency for each record group shown on screen, was not sufficiently integrated with the structure and file classes distribution view. Currently, the name of the directory is shown on the control panel as the user mouses over one directory at a time. Participants agreed that looking at all the provenance information at the same time would better inform the comparisons across directories and allow them to better establish priorities for preservation. There was a short discussion about the possibility of showing the labels in each treemap square as an alternative, but one of the participants noted that the presence of all the titles would clutter the general collection view.
A number of comparisons with traditional methods of interacting with digital collections emerged during the discussions. As we demonstrated how to navigate the collection’s hierarchy, the digital assets management librarian explained that she was used to the traditional file explorer views of collections. Shortly, after making that comment she indicated that in contrast with the file explorer view, in the visualization she could appreciate the distributions of file classes. In turn, the archivist commented that traditional linear review methods would not scale for the purposes of evaluating such a large collection.
Significant time was spent answering questions about what type of metadata was gathered and how it was stored and categorized in the RDBMS. All the participants made suggestions for improving or adding features to the file class visualization in relation to their particular expertise. For example, the preservation librarian said that she would like to visualize files in relation to their last modified dates and to archived dates for preservation planning purposes and the metadata librarian indicated that it would be important to detect duplicate records. In turn, the IT policy maker reiterated the importance of viewing file classes in context with provenance for purposes of planning file migration in relation to potential Freedom of Information Act (FOIA) requests. She also commented that the treemap representation could be used to investigate the organizational structure of an agency and to study the sizes and types of records that they produce.
Content and organizational patterns
After explaining the rationale behind the alignment-based clustering for viewing collections’ organizational patterns and the methods for visualizing image tags, the discussion turned back to issues of collection representation and completeness. One of the participants noted that her first impression when looking at these views was that everything was perfectly organized. It was after our explanation that she realized that only four types of organizational patterns are represented and that not all the images had a corresponding descriptive tag. Participants suggested that unknown patterns and untagged images should be represented in a manner similar to the way that unknown file classes are shown in black. At the same time, they worried that having many more categories/colors to visualize might lead to visual clutter.
Interactivity
As the session progressed, the participants were sorting out some of the issues that they raised in previous segments. For example, to mitigate issues of visual clutter and still render more details about the collection, a participant suggested that hierarchical navigation could be built in relation to different levels of categorizations. By clicking on a given color/class, a user can navigate to subcategories showing more detailed information related to that class of information. They also noted the importance of keeping track of the navigation between views to aid the thought process while conducting assessment. It was also mentioned that the visualization would be useful to communicate and collaborate with other colleagues. Three participants mentioned that they had data that they would like to analyze using the visualization.
Participants remembered visualization features and made suggestions for better integrating them. At the beginning of the discussion, we showed the correspondence between the file classes and the information included in the control bar such as the tag clouds and the charts showing numerical metadata. As the facilitator interacted with different views, participants made remarks about how these information pointers could better complement other views. For example, when we showed the tag image viewer, the metadata librarian wondered whether the tag cloud could provide more detail about the content of the images.
Final discussion
While all the participants commented positively about the possibility to observe patterns across directories and to compare and contrast to make decisions, some wanted to learn about the collection with more precision. One participant indicated that she was not sure whether the visualization was giving her “more than surface information.” When we explained that the visualization was intended for exploratory purposes to highlight characteristics that a user could further investigate, she expressed the need to go from broad overviews to details seamlessly in order to arrive at better conclusions. For this participant, inferences did not substitute for facts, and thus, learning about the types of digital objects present in the collection did not replace the need to study the file formats that compose those objects. She mentioned that her conclusion could be related to her lack of expertise working with the application. A different perspective was brought up by another participant who explained that a curator may never have a complete understanding of a very large collection. To her, the advantage of the visualization was that it provided overview information about the collection while minimizing the need to examine item by item. She explained that looking at how the visualization represents sizes and heterogeneous attributes helped her grasp the notion of large amounts of information.
Interpretation
The interpretation of the focus group session provided rich insights on fundamental aspects of the visualization design. We identified three central themes: (1) balancing between abstract and detailed information about the collection, (2) visual clutter, and (3) accuracy of the representation. In the sections below, we discuss the themes and their relationships to the analysis goals.
Whether the participants understood the collection’s representation and abstractions as shown in the interactive views
The discussions revealed that the participants understood the representations and abstractions to the point of suggesting other metadata elements to visualize, other interactive features, and new uses for the application. Moreover, they identified the importance of tracking the different views involved during an analysis workflow as a way to improve the cognitive process. The major concern for some of the participants was balancing the general notions that they were obtaining about the collection through the abstract views, with the need for learning more details that emerged as they were exposed to the visualization. Issues about visual clutter point to the complexity of presenting and processing the many layers of information considered by digital curators to evaluate a collection. And yet, at the same time that the participants commented on the difficulties of visually processing the colors and words, they reiterated their need for more detailed information.
Whether the interactive views facilitate making inferences and decisions
After observing the views showing general patterns, the participants asked for more details about what they were observing. We learned that for some participants, making inferences and decisions about the collection is tied to the accuracy and completeness of the collection’s representation. This was obvious when participants pointed out that what counts as unknown information was not highlighted consistently across the different visualization features. In addition, the participants were always looking for numerical or textual references to contrast and validate their observations and to make better decisions.
The tensions described here could be related to the perception created by a long tradition of linear review practices, that is, in order to obtain a precise knowledge of the collection, all the digital objects need to be observed. It is also possible that it will take time and adjustment for curators to use visual patterns as a way of learning about collections, which cognitively involves filtering some information to focus only on the most relevant.
We consider that our decision to summarize the information as much as it was meaningfully possible for triage purposes is adequate and needed. In fact, it was by observing how the visualization represented different sizes of directories and their contents that some participants started to grapple with the notion of large data and indicated that traditional methods of collection evaluation would not scale. However, our design did not provide sufficient functionality for users to observe low-level data or to guide the process from abstraction to detail. In addition, we were not consistent in carrying the “unknown” category throughout all the visualization features, which created doubts about the accuracy of the representation.
Improvement as a result of the curators’ feedback
Acknowledging the domain experts’ needs, we introduced three major changes in the application. A network graph automatically tracks the different views involved in an analysis and decision-making workflow (detailed in section “Data navigation”), an aggregator feature enables users to specify particular attributes they want to analyze (detailed in section “Numerical metadata visualization”), and a selector feature (section “Data selection and filtering”) provides users with the ability to combine and filter the collection attributes to match their interests and to do so at their own pace (detailed in section “Data selection and filtering”). We also added the capability to render unknown information in the use case presented in section “Usage scenario.” These features complement the general pattern views for the purpose of improving the cognitive process.
Conclusions and discussion
This work introduces new ways to understand, discover, interpret, and interact with the wealth of digital collection information for curation purposes. To conduct digital collection analysis, we designed an interactive visualization that acts as a bridge between the curator and the collection. The application enables users to identify patterns that would be extremely difficult, if not impossible, due to the diversity and size of the data to recognize using linear review methods. Throughout the design, domain specialists gave the team iterative feedback about the usefulness of the analysis methods and the clarity of the visual representations.
We conducted a focus group study with six professional digital curators. The focus group format allowed participants to think out loud and to change their opinions as the session progressed. While this kind of study does not provide all the answers needed to evaluate a visual analytics application, it allowed us to understand what issues are relevant to the users. The focus group results will also serve as a key reference to develop a formal user experience study. An important finding is that going from general abstraction to details has a profound effect on the user’s capacity to study a collection. We also identified the need to help users to make the transition from their traditional analytics process to the visual analytics process.
Much of our work focused on summarizing large amounts of information as meaningful visual representations. To address the accuracy of the analyses, including the completeness and quality of the data, we (1) provided more than one data representation, so that digital curators can validate results; (2) presented information about content and organizational criteria in context with provenance and original order to enable making informed inferences; and (3) highlighted what is not known or in doubt about the collections.
Significant challenges still remain to present large amounts of information about digital collections. While the treemap visualization is useful to navigate the structure of a collection, for larger and nested collections, the analysis may become cumbersome. In that regard, we are working on alternative visualizations to show attributes independently from the collection’s hierarchical structure. We also need to provide more precise descriptions for the contents than those derived from word frequencies in tag clouds. Currently, we are investigating the combination of graphics and natural language processing (NLP) methods to generate high-level descriptions based on aggregated label and content information. In addition, we learned that curators would like to record interpretations as annotations to the visualization to enhance the cognitive process. They would also like to save analysis workflows to share with other colleagues.
We learned that presenting large amounts of attributes and their corresponding colors makes it difficult for users to identify individual information. At the same time, as users learn about the collection using the visualization, they demand to know more details. We are changing our strategy to provide users with more control to design their collection analysis sessions. For example, to allow gradual transitions from abstract to detail and vice versa, we are modifying the visualization so users can further narrow the quantity of collection attributes that they will explore at a given time. Users will be able to recategorize file classes to highlight file types that are relevant to their collections; group file classes according to functions, domains, and/or broader categories that make sense to the particular collection (e.g. administration, publications, engineering, and archaeology); and choose color mapping schemas accordingly. This new feature will also enable users to choose existing color mapping schemas, such as recommended by Colorbrewer [45]. In addition, we are developing new curation functions such as identification and location of duplicate, corrupted, and empty files, and another for visualization of digital objects in relation to dates for purposes of time-related analysis.
Footnotes
Funding
This work was supported through a National Archives and Records Administration (NARA) supplement to the National Science Foundation (NSF) Cooperative Agreement TERAGRID: Resource Partners (grant number OCI-0504077).
