Abstract
The institutional repository is a major means of providing open access to academic output and is changing academic communications. As use of the institutional repository is spreading, research advancing its management policy and technology has been conducted in the library and academic communities. This study has undertaken a co-word analysis of author keywords in articles from the SCOPUS database from 1997 to 2012 and found 8 clusters that represent the intellectual structure of Institutional Repository Research, including ‘Metadata’, ‘Open Access’, ‘Institutional Repository’, ‘digital Library’, ‘dSpace’, ‘Copyright’, ‘Preservation’ and ‘Sematic Web’. To understand these intellectual structures, this study used a co-occurrence matrix based on Pearson’s correlation coefficient to create a clustering of the words using the hierarchical clustering technique. To visualize these intellectual structures, this study carried out a multidimensional scaling analysis, to which a PROXCAL algorithm was applied.
1. Introduction
The institutional repository, which is a major means of self-archiving, can change the academic communication flow led by commercial publishers as well as increase the citation rate by quickly distributing research results. The institutional repository is spreading with support encompassing academia, governmental organizations and the library community, and foundational principles formed from the BOAI (Budapest Open Access Initiative) declaration in 2002 to the IFLA (International Federation of Library Associations) announcement in 2011. Institutional repositories have been installed and now operate at over 2200 institutions as of early 2013. This number is rapidly increasing and is centred around Europe, North America and Asia. Under these circumstances, their implementation and operating experience have been shared among the library community and various research domains to advance the technology and improve interoperability.
To identify open access research trends in a particular area, the method of dividing an analysis object into subcategories and summing the number of papers included in the corresponding category has been widely used. Since this method mostly either compares paper counts in detail in an area or examines the dynamism of the paper count in a time series, it cannot explain the distribution of various subject concepts existing in the domain or the relation between subjects. By using co-word analysis, the domain of knowledge can be quantitatively found and the connection between domains can be brought to light. In other words, the phenomenon can be analysed as a key concept is identified and grouped.
This study examined institutional repository related research areas and trends through co-word analysis. Also, the intellectual structure of the institutional repository was examined as forming a cluster through clustering techniques and multidimensional scaling and schematizing correlations. It is anticipated that this work will be helpful in setting the research direction and subjects by researchers in this field as well as developing various supporting policies for vitalization of the institutional repository in the future.
2. Theoretical background
2.1. Institutional repository concept and proliferation state
A barrier has been erected against information access by some commercial publishers that dominate information with the publication of high-priced academic journals. The open access movement has started to take back a leadership position on behalf of researchers, judging that the current system produces adverse effects on academic communication. There are two methods used in order to realize open access: one is the publication of open access journals and the other is self-archiving. The self-archiving method archives the commercially distributed journal to various repositories after agreement with the publisher. The institutional repository, which is a major means of self-archiving, can preserve its own publications indefinitely and the citation rate can be increased, thereby quickly distributing research results from a research aspect. From an institutional perspective, it can reduce time-consuming work invovled in maintaining publications and change the academic communication flow led by commercial publishers.
Open access has been supported by academia, governmental organizations and the library community since the BOAI declaration in 2002 and more recently the IFLA announcement in 2011. As a result, the number of open access journals has increased to 8622 as of February 2013 and the number of papers published has reached 969,098. The registration of a total of 2257 institutional repositories – including the USA with 396 (17.5%), the UK with 210 (9.3%), Germany with 165 (7.3%) and Japan with 138 (6.1%) – to OpenDOAR (The Directory of Open Access Repositories, http://www.opendoar.org/) as of February 2013 shows steady global growth.
2.2. Institutional repository-related research area
The institutional repository aims to support open access based on academic communication innovation and promotes the long-term preservation and outside proliferation of knowledge production by institutions. Accordingly, this study examines various detailed subjects in this field, such as open access policy, copyright, digital library, interoperability, record preservation, and so on.
The prospecting of major detailed research domains in this field before starting the co-word analysis on research domains of the institutional repository was arranged as follows. The first research domain relates to open access-based academic communication. Specifically, it examines the efforts by academia and the library community to change academic communication composition in response to commercial publishers, and includes the following important research domains [1]: the development of open access policy for research results supported by government or university funds [2]; open access journal publication [3]; and evaluating impact [4]. The second domain is related to the role of universities and libraries, which are the main operating bodies, and the political efforts of universities to support the self-archiving of researchers [5], and the role of university libraries and librarians in installing and operating the institutional repository is discussed [6]. The third domain is a subject related to the institutional repository system. The implementation of open source software [7, 8], the connection with other systems [9], electronic publications [10] and the implementation of a shared repository system in the library network [11] are addressed. The fourth domain is interoperability with metadata. Issues of the methodology to be mutually operated as content is stored in the institutional repository, that is, the standardization of the OAI protocol [12] and metadata, quality improvement [13] and authority control [14] can be included here. The fifth domain is a digital preservation problem [15]. The storage method and metadata for the long-term preservation of content and the role of the archivist [16] are discussed. In addition to the long-term preservation of learning objects of universities [17], and the long-term preservation of research data related to e-science [18, 19], subjects related to the citation index development of content stored in the institutional repository [20] have started to receive attention from researchers.
Bailey [21] focuses on classifying institutional repository-related research and inventing the classification system separately. Having collected academic papers, books, technical reports and proceedings related to the institutional repository since 2000, he is operating The Institutional Repository Bibliography (IRB), which is an article-indexing system. The article index data are constructed as the institutional repository-related research domains are classified into 12 fields, and the volume of bibliography at each item is the largest in ‘the operation case of university/research institution’ and ‘the national unit main project’. Additionally, research has been performed in the domains of ‘digital preservation,’ ‘library issue,’ ‘metadata,’ ‘policy’ and ‘software’.
In classifying the institutional repository research domain into particular categories and summing the number of papers included in the corresponding categories, IRB promotes the understanding of this field. However, the distribution of various detailed subject concepts existing in a particular domain and correlations of the subject concepts are not explained. Thus, it is anticipated that the co-word analysis to be performed in this study will be supplementary in explaining those parts.
2.3. Overview of co-word analysis technique
Co-word analysis is generally a method of extracting words from the articles of corresponding subject fields, calculating the co-occurrence frequency of each word pair and obtaining correlations between words, for example, using various indexes and mapping subdomains. That is, if two keywords simultaneously appear in the same paper, the two subjects mentioned in the paper are correlated with each other. When measuring the intensity of correlation between the words, the research patterns and trends of corresponding fields can be examined. Thus, if using this analysis method, the structure of the particular subject field can be analysed without a data classification system.
In comparison to bibliometric studies, bibliometric analysis is based on the analysis of the citations contained in the scientific paper [22], but co-word analysis provides visual representation to simplify the subject field domain using co-occurring frequencies such as a keyword or classification code, which is presented in the literature [23].
Meanwhile, researchers have used co-word analysis to examine subject domains in various fields and chronological changes. Ding et al. [22] made a domain map using multidimensional scaling based on a similarity index by extracting keywords from research objects in information retrieval fields. Kwon [24] analysed the research field of enterprise architecture, which manages management information, using co-word analysis. Peters and Van Raan [25] analysed the chemical engineering field and Kim et al. [26] analysed the digital finance field.
3. Research method
Generally in co-word analysis, the correlation between words is obtained using various indices after extracting words from the literature in corresponding subject fields and calculating the co-occurrence frequency of each word pair. Next the subdomain can be understood as mapping the correlation on the multidimensional scaling (MDS). Although when directly performing multidimensional scaling without clustering, the group of words is also formed, a more easily understandable domain map can be formed if expressing clusters on the map as clustering words.
Therefore, this study has performed the clustering technique; additionally a second analysis for calculation of the correlation between subdomains has been performed in order to examine the interaction between subgroups in more detail. With these similarity indexes between clusters, MDS was performed once again. A more detailed research method is explained below.
3.1. Keyword collection
Co-word analysis generally extracts analysis object words from titles, abstracts, keywords, etc. This study collected the author keyword constructed in the SCOPUS database. The keyword assigned by the author and that assigned by the indexer co-exist in SCOPUS, and the subjective viewpoint of the index expert can be reflected in the non-controlled subject term assigned by the indexer. Thus, distorted results, the so-called ‘indexer effect’ [27], can be caused, and therefore this study performed analysis only on the keyword assigned by the author.
The more detailed keyword collection process is as follows. First, the author keywords were extracted from 204 papers searched with ‘Institutional and repository’ among the data constructed in December 2012 in the SCOPUS database. Second, the compound words were made into single words by removing spaces and 564 keywords were extracted through a filtering process of synonyms, broad terms and narrow terms. Third, frequency analysis was performed on the filtered words and 32 keywords that occurred more than times were extracted. In all, 55 keywords occurred more than three times, but among these, 22 keywords that occurred three times were not judged as important for drawing the research result. On the other hand, nine keywords that occurred four times were considered valid and meaningful. Therefore this study selected 32 keywords that occurred more than four times. Fourth, the co-word analysis was performed on those extracted keywords. The TI program for DOS (http://www.leydesdorff.net/software/ti/) developed by Leydesdorff for co-word analysis was used as the software to calculate the co-occurrence frequency.
3.2. Two-mode co-occurrence matrix and similarity index calculation
The co-occurrence matrix is a method of showing whether two words co-occur in one study. The value of the cell of two words is decided by the number of times these two words both appear in the same document. The higher co-occurrence frequency of the two words means a closer relationship between them. The similarity index is used to measure the similarity between words because it can standardize the difference between words with high and low appearance frequency as normalizing the co-occurrence frequency range [28].
In the co-word analysis, cosine, jaccard and Pearson’s correlation coefficient are mainly used as similarity coefficients. This study measured the similarity by calculating the Pearson’s correlation coefficient, which was used in the research by Ding et al. [22] and Lee et al. [28], and this was processed through the SPSS statistical process program.
3.3. Cluster
When the processing data volume becomes larger in the co-word analysis, the clustering technique is mostly performed because of the difficulty of analysis [25]. When directly performing multidimensional scaling without clustering, a group of words is also formed, and more easily understandable domain maps can be formed if expressing clusters on the map as clustering words [28]. The most frequently used clustering technique in co-word analysis is hierarchical clustering, which uses the Wards method and creates a cluster while minimizing the increase in the squared error that results when two clusters are merged.
Meanwhile, in this study, to obtain either similarity or dissimilarity between clusters, the similarity was remeasured between word lists included in the cluster. As in the analysis by Lee [29], this study calculated the sum of the co-occurrence frequency of indexes included in the cluster with the Pearson’s correlation coefficient, and it was utilized as input data for calculating the similarity between groups.
3.4. Mapping
The multidimensional scaling commonly used in the co-word analysis implicitly expresses data as structuralizing the relation between complex objects and visualizing it in a multidimensional space [30]. Whereas the entities located near each other on the position map of a multidimensional scaling method indicate higher relative similarity, the entities located far from each other indicate lower relative similarity [28]. In order to examine the location of the keyword presented on the map with Pearson’s correlation coefficient, this study calculated the Euclid distance and visualized it in two-dimensional space by applying the PROXSCAL algorithm [31]. Also, in order to visualize and examine the relation between subject subdomains, the similarity coefficient between clusters was calculated and the second map with the same method was made based on this. The groups with a correlation coefficient >0.5 were connected with a line to display the relationship between groups and the cluster size was indicated as calculating the rate at which the keyword frequency sum included in the cluster was occupied in the whole.
4. Analysis results
4.1. Frequency analysis
Two-hundred and four papers that the author assigned keywords to among the academic papers searched with ‘Institutional and Repository’ in the SCOPUS database in December 2012 were extracted. Filtering the 204 papers with the extracted author keywords, 564 analysis data were sorted.
Before the frequency analysis, the distribution of papers in each year was examined first. As shown in Figure 1, although the institutional repository-related papers started to be published in 2000, the activation time of the research seemed to be about 2005. It faltered more recently but showed a steady increase as shown by four papers in 2005, 27 papers in 2008 and 42 papers in 2010. Examining the papers published in 1997, the first on the graph was related to the medical data hospital repository [32] in the medical science field. Thus, it can be presumed that the first paper related to an institutional repository performed in the actual library and information science field is research on the crisis of academic information and the institutional repository of universities [33] published in 2001. It can also be presumed that related research has been steadily performed after the dSpace installation at Massachusetts Institute of Technology [34] in 2003. On the other hand, the results of the analysis of the frequency on data filtered to 564 are shown in (Table 1).

Time series graph.
Keywords appearing more than four times.
‘InstitutionalRepository’ (195), ‘openAccess’ (46), ‘scholarlyCommunication’ (22), ‘metadata’ (22) and ‘digitalLibraries’ (21) were concluded to be high-frequency keywords. The terms ‘academicLibraries’ (10), ‘higherEducation’ (5), ‘dSpace’ (14), ‘copyright’ (5), ‘preservation’ (5), ‘semanticWeb’ (5) and ‘ontologies’ (4) were also concluded to be high-frequency keywords. Thus, the frequency analysis results can be summarized as follows.
First, ‘openAccess’ and ‘scholarlyCommunication’, reflecting academic communitcation innovation, were high-frequency keywords in addition to ‘institutionalRepository’. Second, ‘digitalLibraries’, ‘metadata’ and ‘oaipmh,’ which is a component of the digital library service and protocol for data interoperability, were concluded to be high-frequency keywords. Third, ‘dSpace’, which is a globally and commonly utilized open source repository software, most frequently appeared for the software and ‘India’ most frequently appeared for the nation. Fourth, ‘learningObject’, reflecting the long-term preservation trend of education material, and ‘researchData’ and ‘digitalAssets’, reflecting the curation of research data, also significantly appeared.
4.2. Correlation matrix
The results of the co-occurrence matrix calculated through TI software and Pearson’s correlation analysis performed for measuring similarity are as shown in Tables 2 and 3. Also, the word pairs having a correlation coefficient of >0.7 from the correlation analysis results are shown in Table 4.
Part of the two-mode matrix of co-occurring words.
Part of the similarity matrix using correlation coefficients.
Word pair showing high correlation coefficients.
A high correlation coefficient means a high co-occurrence frequency of words. In other words, they can be interpreted as research concepts having high correlation with this field. The more detailed analysis through the results in (Table 4) is as follows.
First, the correlation coefficient between ‘metadata’ and ‘oaipmh’ was the highest at 0.852. It could be understood as ‘oaipmh’, which is the harvesting protocol applied to the institutional repository, and ‘metadata’, which is an object of interoperability, having the highest correlation in this field. Second, the correlation coefficient between ‘institutionalRepository’ and ‘openAccess’ showed a considerably high value of 0.851, and this signified that there was various studies recognizing the institutional repository as a key tool for realization of open access. Third, the correlation coefficient between ‘institutionalRepository’ and ‘dSpace’ was 0.736, which was also significantly high, and this means that research on realization and application of an institutional repository through dSpace software was frequently performed. Fourth, ‘openAccess’ and ‘subjectInstitutionalRepository’ (0.758), ‘digitalLibraries’ and ‘searchEngine’ (0.726), ‘interoperability’ and ‘catalog’ (0.767), ‘copyright’ and ‘Policy’(0.743) were studied with high correlations.
4.3. Clustering
The hierarchical group analysis was performed on the correlation analysis results drawn above. As a result of performing cluster using the Ward method and standardizing with the Z score, a dendrogram was drawn as shown in Figure 2.

Dendrogram.
The dendrogram can be divided into three clusters. BG1, which is the first cluster, forms a large group including up to 20 keywords from ‘metadata’ to ‘india’. BG2 includes eight keywords from ‘copyright’ to ‘archiving’ and BG3 includes four keywords from ‘semanticWeb’ to ‘self’. Examining the internal attributes of each group, the first group is a large cluster related to ‘institutionalRepository’ and ‘openAccess’, the second group is a medium cluster related to ‘copyright’ and ‘preservation,’ and the third group is a small cluster including ‘semanticWeb’, ‘self’, etc. ‘Self’ can be interpreted as short for ‘self-archiving’, one of the feasible means of achieving open access. However, since it was hard to understand the relation between the internal attributes of BG1, which occupied high shares, if dividing them into three groups as shown above, they were divided again into 11 clusters on the basis of the dendrogram. Yet among the 11 clusters, there were remaining clusters that only had one keyword, such as ‘archiving’, ‘knowlegeManagement’ and ‘self(Self-archiving)’. Since these remaining nodes are hardly considered independent clusters, this study excluded them from clustering. Therefore eight clusters were created as shown in Table 5.
Eleven clusters and representative keyword.
The representative keyword of each group was displayed with one showing the highest frequency among the keywords included in each group. G1 was ‘metadata’, G2 was ‘openAccess’, G3 was ‘InstitutionalRepository’, G4 was ‘digitalLibrary’, G5 was ‘dSpace’, G6 was ‘copyright’, G7 was ‘preservation’ and G8 was ‘semanticWeb’. The group share indicates the share of the occurrence frequency sum of keywords included in each cluster in the whole and G3 (InstitutionalRepository) was the biggest at 35.7%; the next was G2 (openAccess) with 23.7% and the rest showed a similar share of G1 (8.8%), G4 (10.1%) and G5 (11.2%). Summing the co-occurrence frequency value between keywords included in each group and writing a two-mode co-occurrence matrix based on these to examine the correlation between groups are shown in Table 6.
Two-mode matrix of co-occurrence between groups.
The results of Pearson’s correlation analysis for examining the correlation between groups on a two-mode co-occurrence matrix above is shown in Table 7. Looking at Table 7, showing the correlation analysis results between clusters, G3 and G2 representing ‘institutionalRepository’ and ‘openAccess’ show the highest correlation coefficient of 0.734, and G5 representing ‘dSpace’ also shows a high coefficient of 0.714. In summary, it can be presumed that the institutional repository has a high correlation with the domain related to ‘openAccess’ and ‘dSpace’. Other clusters have not shown particular correlations.
Results of the correlation analysis between groups.
4.4. Mapping
The results of standardizing the Pearson’s correlation coefficient matrix with a Z score, calculating the Euclid distance and visualizing it in two-dimensional space by applying the PROXSCAL algorithm are as shown in Figure 3. The MDS analysis results showed the stress index of 0.18, which is not bad considering the amount of analysis data. Dividing the map of Figure 3 subjecting a total of 32 keywords to three clusters, BG1, including ‘institutionalRepository’ and ‘openAccess’, is located at the centre and BG2, including ‘copyright’ and ‘preservation’, is located at left bottom. On the right side is BG3, including ‘semanticWeb’ and ‘ontologies’.

MDS map based on keyword.
The MDS map written in groups clustered to eight in 4.3 (Table 7) is shown in Figure 4. The point size reflects the share of clusters calculated in Table 7. As a result of multidimensional scaling analysis with a correlation coefficient between groups, S stress is 0.07 lower than in Figure 3 and the location of groups on the map can be explained as follows.

MDS map based on cluster.
First, G3 (institutionalRepository) is a little below the centre and G2 (openAccess) and G5 (dSpace) are close to it. G4 (digitalLibrary) is on the left above G3 on the map and other groups are scattered on the map. Lining up the coefficients >0.5 in (Table 7), G2 (openAccess) and G3 (institutionalRepository), G3 (institutionalRepository) and G5(dSpace) are correlated.
Analysing the clusters overall, they can be summarized as follows. First, G2 (openAccess) and G5 (dSpace) are correlated at the centre of G3 (institutionalRepository). In other words, ‘institutionalRepository’ is frequently discussed as a realization strategy of ‘openAccess’ along with ‘dSpace’, which is a construction and operation tool. Second, although G1 (metadata) and G4 (digitalLibrary) do not show small shares, the particular group having a high correlation is not shown. Third, since the ‘sematicWeb’ and ‘preservation’ are located at the edge without correlation with other clusters, they can be interpreted as independent research domains.
5. Summary and discussion
Research results on the institutional repository have been brought to the attention of many librarians and researchers. That is why the institutional repository domain already has its own categorized article index system, like the Institutional Repository Bibliography (http://digital-scholarship.org/irb/). In that system, many articles and dissertations have been accumulated into certain categories and added to the number of articles that fall into the relevant category. Thus, it has taken the approach of simply comparing the number of particular subareas, but it has failed to explain the distribution of various subject concepts within the research area and the relation between subjects. Therefore, this study moves beyond simply thematic classification and carries out co-word analysis to place special focus on the relation between subjects and the intellectual structure by clustering them into several groups. By going through the process in this way, this research has led to a clear explanation of the intellectual structure of the institutional repository. The findings are as follows.
Institutional repository-related research started in 2000 when the dSpace was installed at Massachusetts Institute of Technology, but was activated from 2005. Progress has slowed recently but has steadily increased from four articles in 2005 to 42 in 2010.
The main research subjects that have been frequently handled in this domain are institutional repository as a significant mean of open access, issues about scholarly communication and metadata as an information-retrieval tool that is highly correlated to standard protocol OAI-PMH.
If dividing the research domains of this field in detail, they are divided into the eight subgroups of ‘metadata’, ‘openAccess’, ‘institutionalRepository’, ‘digitalLibrary’, ‘dSpace’, ‘copyRight’, ‘preservation’ and ‘sematicWeb’. ‘OpenAccess’, ‘digitalLibrary’ and ‘dSpace’ are studied with a strong correlation with ‘institutionalRepository’, located at the centre, but the subjects in ‘metadata’ do not have high shares and are located in the surrounding areas. Interoperability of metadata among institutional repositories and quality improvement are considered to be very important issues in this area. Since ‘preservation’, ‘copyright’ and ‘sematic web’ are also located in the surrounding areas, they can be interpreted as independent research domains.
The ‘institutionalRepository’ is studied as a tool for realization of ‘openAccess’ and a strategy for scholarly communication innovation. The ‘institutionalRepository’ cluster shows the highest weight of shares, which means that diverse cases on implementation and management of the institutional repository have been researched in this domain. Like the result of Bailey’s [21] analysis, cases studies about university and academic institutional repositories, projects like ARROW and DRIVER, and regional unit shared repository experiences have been handled. It also shows a high coefficient to the ‘openAccess’ cluster, which means that the institutional repository is regarded as an important means of realizing open access.
There are various institutional repository software programs, such as dSpace, ePrint, e-repository and EARMAS. In this study, only dSpace has been proven to be a high-occurrence keyword among institutional repository software, so it could be explained that ‘dSpace’ is frequently handled as representative software for the institutional repository and its management cases have been discussed widely. However, the dSpace cluster only has a correlation with the ‘institutionalRepository’ cluster; it could be presumed that dSapce is only discussed as an extension of the repository implementation in terms of software.
‘DigitalLibrary’ also shows a high proportion and it is not correlated to any other cluster. Digital library is an information concept in library and information science, but in this domain, it is not so much related to interlocking with an institutional repository.
It is interpreted that ‘learningObject’, ‘e-science’ and ‘digitalCuration’ are rising as new domains, which the institutional repository must develop, and do not form their own domains yet. Subjects related to the ‘semanticWeb’ or ‘ontology’ are still treated as insignificant.
6. Conclusion
The institutional repository, which is a major means of open access, is an innovation strategy of academic communication. As institutional repositories are spreading, with 2257 of them as of early 2013, their implementation and operating experience have been shared and various research for advancing its policy and technology has been performed in the library and academic community. The purpose of this study was to identify the intellectual structure of the research domain in this area aquantitatively. This study discloses the key subject concept of the institutional repository research areas and their relations through the co-word analysis method in the SCOPUS database. As a result of the co-word analysis, the institutional repository research area could be categorized into eight subgroups while ‘institutional repository’ was actively studied as closely related with the subject concepts ‘open access’ and ‘dSpace’.
The institutional repository is an important tool to realize open access and at the same time is a scholarly communication innovation strategy, which can preserve the intellectual assets of the institution for a long time and rapidly distribute research results. Thus this study, which quantitatively defines the research domains in this field through the co-work analysis, is anticipated not only to be utilized as basic materials for developing various supporting policies for vitalization of the institutional repository in the future but also to be helpful in setting research direction and subjects by researchers in this field.
The limitation of this study is that it has focused on just the journals listed on SCOPUS. Even though SCOPUS is the largest index database in the world, many important non-English journals are excluded. Therefore a follow-up study is needed to examine intellectual structure including research outputs from the non-English-speaking world. In addition, it would be meaningful as further study using time series analysis for understanding how the research topic in this field has changed.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
