Abstract
Collaboration is essential for some types of research, and some agencies include collaboration among the requirements for funding research projects. This makes it important to analyse collaborative research ties. Traditional methods to indicate the extent of collaboration between organizations use co-authorship data in citation databases. Publication data from these databases are not publicly available and can be expensive to access and so hyperlink data has been proposed as an alternative. This paper investigates whether using machine learning methods to filter page types can improve the extent to which hyperlink data can be used to indicate the extent of collaboration between universities. Structured information about research projects extracted from UK and EU funding agency websites, co-authored publications and academic links between universities were analysed to identify if there is any association between the number of hyperlinks connecting two universities, with and without machine learning filtering, and the number of publications they co-authored. An increased correlation was found between the number of inlinks to a university’s website and the extent to which it collaborates with other universities when machine learning techniques were used to filter out apparently irrelevant inlinks.
1. Introduction
Collaboration is often beneficial to research. In the worst case it does not positively influence research productivity and in the best case it increases it [1]. However, the main goal of research collaboration is the creation of new scientific knowledge [2] and collaborative academic papers have been shown to have more impact than single authored papers [3, 4]. Studies have shown collaboration to occur because of a number of reasons, some of which are access to expertise, equipment or funds [1]. Collaboration has also been shown to improve research productivity in terms of the number of publications produced [2, 5–7].
Studying collaboration is important for exploring the relationships between organizations which can aid in identifying important or influential actors and the role different organizations play in a particular research field. Collaboration studies can also be used to explore knowledge based innovation systems [8]. Knowledge-based innovation systems are ‘systems where efficient interactions between actors enable greater innovation’ [8, 9].
The standard methodology for studies of collaboration in academia involves the analysis of co-authorship data obtained from publication databases like the Web of Science and SCOPUS or from a sample of a few core journals of a particular field [1, 2, 10–12]. As collaboration is not always visible through co-authorship [13], and a few core journals from a particular field may not show the collaboration patterns of the whole disciple [1], Beaver [1] advises that results be qualified according to the data sources.
Although data from publication databases are arguably the most reliable source that can be used to indicate the extent of collaboration between organizations, they are not publicly available, can be expensive to access and it is time-consuming to process the author affiliation fields to extract institutional information. This can make it difficult or expensive for researchers to carry out collaboration studies. If an alternative data source that can indicate the extent of collaboration between organizations is available, even if it is not as reliable as data from publication databases, it could be used for pilot studies to determine if it is worth investing time and money for a full-scale study. The goal of this paper is to determine the extent to which freely available web based data can be used to measure collaborative activities between universities.
Hyperlinks in websites have shown early promise to be used as a data source for collaboration studies because they have some similarities with citations in academic articles. However, links are unreliable because they are created for a variety of reasons that can be difficult to identify. Stuart et al. [14] suggest that automatic methods for link classification should be identified for fully harnessing the potential of hyperlinks for collaborative studies.
A number of studies [15–18] have attempted to identify the reasons why hyperlinks in academic institutions are created, but link classification can be time consuming and is largely subjective. The reasons for link creation can be automatically inferred from the relationship between the two web pages that a hyperlink connects [19]. In their study, the authors automatically identified web page types using supervised learning techniques to make classification feasible for large scale studies, and then the reasons for linking between different page types were examined. The majority of links connecting two pages related to staff in a university were created because of past, present or future collaboration between them [19]. Therefore, restricting link data to only these links may produce a reasonable indicator of collaborative activities.
This study investigates whether a machine learning approach to classifying hyperlinks can be used to better study collaborative relationships between universities by automatically identifying those links in university websites that are probably created for collaborative reasons and finding the extent to which these links correlate with collaboration in terms of co-authored publications and collaboration in terms of co-participation in research projects.
2. Literature review
2.1. Collaboration
Research collaboration is defined as ‘the coming together of researchers to achieve the common goal of producing new scientific knowledge’ [2]. The results of successful research are usually published as scientific papers, thus if multiple researchers work together to produce new knowledge, it is likely that they will be co-authors in the published scientific papers. So even though collaboration is not always equivalent to co-authorship [2, 20], multiple-authored papers are widely used to indicate the extent of collaboration between individuals or organizations.
There is a consensus about the importance of scientific collaboration for researchers. Some government funding agencies encourage research collaboration by adding collaboration to part of the funding requirements [5, 21]; collaboration associates with research productivity in terms of the number of publications [2, 5–7]. Even though inter-organizational collaboration is encouraged, the majority of research projects have only one participating organization. In 2012, approximately 60% of EPSRC-funded projects awarded to UK universities in this study had only one participating organization. This is in line with other results that the majority of collaboration is within a single organization [22].
Gift authorship, where authors who may not have contributed to the work are included, and ghost authors, where authors who made significant contributions are not included in the published scientific paper [13], are among several drawbacks to using co-authorship as an indicator for collaboration [2]. However, Katz and Martin [2] also cited their invariance, ability to be verified, relative inexpensiveness, practicality and that the results are statistically more significant than those from case studies as among the advantages of using this method.
Since institutional collaboration decreases exponentially as the geographic distance separating collaborative partners increases [12], collaboration is influenced by geography. Geographic patterns have also been identified in university website inter-linking [23], even though links in general are different from citations. That links and citations show similar patterns makes it interesting to know if certain link types can be used to estimate co-authorship.
Researchers have analysed the collaboration networks of organizations that participated in EU-funded projects using statistical and social network analysis techniques [24–26]. Results from these studies have given an overview of the main properties of the collaboration network of different research fields and the roles organizations play in the knowledge innovation system. For example, the core of the network in technical fields has a high proportion of large companies because organizations in these sectors are interested in projects with high profit potential, while the core of the network in health-related fields is made up of universities and research centres because of the social importance of health [24].
A number of studies have investigated collaborative relationships between individuals, departments and organizations through co-authored publications. Based on co-authorship relations, European researchers collaborate more with global researchers rather than exclusively European researchers [27], although collaboration within a single institution still produces the majority of research outputs [22]. Most collaboration studies use co-authorship relations and acknowledgements or sub-authorship [13] from publication databases to investigate collaborative relations. In the current paper, the extent to which hyperlink data can be used to estimate the extent of collaborative activities between universities is investigated. This is important not only for people who cannot afford bibliographic databases but also because webometrics indicators may be used to show collaborations that may not be reflected in traditional bibliometrics [8].
2.2. Link analysis
A micro-analysis of links that focuses on the meaning of individual links [8] in a university’s website to identify those that represent collaboration is time-consuming because of the size of university websites and because each link has to be manually examined. Micro-analysis of hyperlinks is necessary if links in university websites are to be effectively used for collaboration studies and this is only feasible on a large scale if links are analysed automatically.
In what is perhaps the published literature most related to this study, Stuart et al. [14] investigated if web links from university websites reflect collaborative relationships between the two organizations that the link connects. Their result suggests that direct linking cannot be confidently used to infer relationships between the two organizations that the link connects, but a significant proportion of outlinks from UK University websites reflect collaborative activities, so web links have the potential to be used as an indicator for collaboration, if methods for filtering out irrelevant hyperlinks are identified.
It is possible to automatically filter out some irrelevant hyperlinks. Kenekayoro et al. [19] used decision tree induction, a supervised learning algorithm, to automatically classify web pages in UK university websites. Library, information and career service pages, which the authors called support pages in their classification scheme, showed high automatic classification accuracy, with a precision of 78%. Precision is the likelihood that the classifier will correctly classify a web page type in the same category as a human coder. This page type contributes up to 35% to the total web pages in a UK university’s website [19] and is unlikely to contain links that suggest collaborative activities [14]. Hence using supervised learning to automatically exclude such pages should help when using web links as a collaboration indicator.
Links in a university’s website are different from citations in scientific papers and there have been several studies that aimed to understand what links between higher education institutions represent. Bar-Ilan [17, 18] describes a framework for studying links in academic websites based on the link, source page and target page from different aspects like link context, link tone and several other properties. Up to 90% of links in university websites are created for scholarly reasons [28], and a majority of links between web pages related to staff in UK universities are created as a result of previous or future collaborations [19].
Support vector machines (SVMs) can perform either as well or significantly better than other competing methods in most machine learning contexts [29], so SVM is the machine learning algorithm used in the current study.
3. Research questions
The aim of this research is to determine if hyperlinks are better indicators of the extent of collaboration between universities when only academic links are used compared with when all links are used. This fills a gap in previous research because, although hyperlinks have previously been used as indicators of academic collaboration [14], this has not been done in conjunction with automatic or manual web page classification. Academic links in this study are:
links between two universities’ staff web pages; or
links between university websites excluding those from library and other web pages created to provide service to university staff or students.
This aim is achieved by answering the research questions:
(1) Can the extent of collaboration between two universities be better estimated with hyperlinks if only those links between university staff web pages are used rather than all links?
(2) Can the extent to which a university collaborates with other UK universities be better estimated by the total number of academic in-links rather than the total in-links to the university’s website?
4. Methods
This research investigates if automatically identifying and restricting hyperlinks in university websites to only those that may be created for collaborative reasons can help produce better collaboration indicators. To achieve this, web pages related to university staff and web pages that provide services to university staff or students were automatically identified using machine learning methods and then statistical correlation tests were used to determine if:
(1) the correlation between the number of links connecting two university websites and extent to which the two universities collaborate together is higher when only links connecting staff web pages are used than when all links connecting the two universities’ websites are used;
(2) the correlation between the number of in-links to a university’s website and the extent to which the university collaborates with all other universities is higher when only academic in-links are used rather than all in-links.
The extent to which two universities collaborate together is estimated by the number of publications that the universities co-authored and the number of research projects that they co-participated in.
The extent to which a university collaborates with other universities is determined by the total number of research projects the university partook in with other universities.
4.1. Publication data
The number of co-authored publications between universities was extracted from publication data retrieved from the Web of Science. The Centre for Science and Technology Studies (CWTS) in Leiden University provided the processed co-authorship information for the 36 UK universities that appeared in their CWTS 2013 Leiden Ranking. Publication data is restricted to these 36 universities. These 36 universities had a total of 323,763 publications between 2008 and 2011.
The Leiden Ranking is based on publication data retrieved from the Thomson Reuters Web of Science bibliographic database. It ranks world universities based on their scientific impact, measured through citations and the extent in which they collaborate, measured through co-authorships. The data collection methodology used in the Leiden Ranking assigns publications to universities based on the institutional affiliation that authors indicate in their publication. The Leiden Ranking data collection has two stages; the first stage assigns a publication to a university when the university’s address or variants of the university’s address is explicitly mentioned, and the second stage assigns publications of hospitals affiliated to a university to that university.
4.2. Project data
The UK research council’s website (http://gtr.rcuk.ac.uk/) [30] contains information about funded research projects from all UK research councils. A program was written to extract research project information for the 104 UK universities in Appendix A from the UK research council’s website [30] on 20 May 2013.
To cover additional UK research funding, data from CORDIS (Community Research and Development Information Service), a major source for EU research funding, data was added to the data from UK research councils. Information about EU-funded research projects that started after 1 January 2006 was extracted from the CORDIS website. Records with incomplete data were excluded, so that 7415 EU Funded projects and 30,091 UK research council funded projects were used for the analysis. These projects do not cover all UK universities’ research funding between 2006 and 2013, but probably cover a large amount. In 2010/2011 the UK BIS Research Councils, the Royal Society and the British Academy were responsible for 35% of the UK higher education sector research income and European Union government bodies were responsible for 10% of the UK higher education sector research income [31].
4.3. Hyperlink data
A custom web crawler was used to extract the links from one UK university to another. The crawler did not visit all webpages; it only covered the links that can be reached by iteratively following links from the university’s homepage, similar to SocSciBot [15]. To speed up crawling while still visiting a majority of the webpages, new webpages were not added to the list of webpages to visit (frontier) after the crawler had visited 2000 consecutive pages without finding a link to another UK university (ac.uk).
4.4. Automatic web page classification
A total of 2500 web pages were randomly selected and manually classified by the first author into staff pages and support pages as two separate facets. These manually classified pages formed the training set that was used to create the model for automatic web page classification. The techniques described in Kenekayoro et al. [19] were used to create the classification model.
Staff-related web page facet:
Staff pages – related to staff in university. Examples include homepages, staff profiles, list of publications, CVs.
Other pages – all other pages.
Academic pages facet:
Support pages – library web pages that contain repositories of learning resources for staff/student and documents/information for enhancing teaching/learning skills.
Academic pages – all other pages.
The web pages in the training set were pre-processed and used to create the classification model. The following steps were used to transform the web pages into vectors for machine learning. The machine learning raw data was the web page URLs and web page titles split into ‘words’ at punctuation characters ‘[space]\r\t\n.,;:\′\’()?!-><#$\\%&*+/@^_=[]{}|‘’. All terms were converted to lower case and stemmed with the Porter Stemming Algorithm.
WEKA [32] implements support vector machines in its sequential minimal optimization algorithm. The default settings of the classifier along with the top 250 features was used to create the model that automatically classified web pages. Tweaking the classifier settings or the number of features may improve the accuracy of the classifier, but this was not necessary because the default settings produced up to 94.5% accuracy determined by 10-fold cross-validation on a human classified dataset, which was good enough for the purpose of this study.
4.5. Normalization
The size of a university is a factor that can influence the total number of inlinks and outlinks from a university’s website. Academic staff size and research quality of universities are factors that influence the number of outlinks created in a university’s website and the likelihood that it will be the target of inlinks [33]. University size can also influence its total number of collaborations and therefore both collaboration and hyperlink data should be size-normalized before conducting any correlation test.
The number of publications, the number of research projects, the number of inlinks and the number of project collaborations were divided by the total number of academic staff in 2008 for each university. The number of a university’s project collaborations is the number of research projects the university partook in with other UK universities:
The number of links connecting two universities, the number of co-authored publications and the number of co-participating projects between two universities were divided by the product of number of academic staff in the two universities in 2008 for normalization:
The Higher Education Statistics Agency (www.hesa.ac.uk) was used to provide data about the number academic of staff in each UK university.
The geographic distance separating two universities was determined by the straight line distance between the addresses of the two universities. Longitude and latitude geographic coordinates of the addresses of universities were extracted from Google Maps. The spherical law of cosines was used to compute the geographic distance separating two universities. This formula is widely used to compute the distance between two geographic coordinates [34]:
where R is the radius of the earth: 6371 km.
5. Results
To identify the extent of association between hyperlinks and publication data from Thomson Reuters, the data was compared using Spearman correlations. Tables 1 and 2 show the correlations between hyperlink data, research project data and publication data of UK universities.
Pairs of universities Spearman correlations between the links between two universities’ websites (NL), the staff target links (NSTL), inter-staff links (NISL), the number of co-participating projects (NCPP), co-authored publications (NCAP) and the geographic distance separating two UK universities in the 2013 CWTS Leiden Ranking (all normalized except distance).
Correlation is significant at the 0.01 level (two-tailed).
Individual universities Spearman correlations between the total number of inlinks (IPA), academic inlinks (AIPA), publications (PPA), project collaborations per academic (PCPA) and research projects (RPPA) for UK universities (all per academic).
Correlation is significant at the 0.01 level (two-tailed).
Correlation is significant at the 0.05 level (two-tailed).
Statistical correlations between links and collaboration indicators may be a result of secondary factors that influence both links and co-authored publications. Webometrics results can be only partially validated through correlation tests [35], which is why a random selection of links should be also classified to give context to the results of webometrics studies [36]. However, if links are previously selected to include only those links that may be created for collaborative reasons before the correlation tests, it can increase the confidence in the validity of correlation tests.
6. Discussion and conclusions
The goal of this paper was to identify the extent to which hyperlink data can be improved to best indicate the extent of collaboration between universities. Hyperlinks connecting UK university websites were analysed to find any associations between hyperlinks and co-authored publications from databases like Thomson Reuters or SCOPUS, which have previously been used to study collaboration between organizations.
The first research question investigated if the extent of collaboration between two universities is better estimated by links between staff related web pages rather than links between all web pages. When university staff-related web pages were automatically identified using a supervised learning technique with up to 94% accuracy, the correlation between the number of links between university staff related web pages and the number of co-authored publications between the two universities was lower than the correlation between all links between the two university websites and co-authored publications. Even though the majority of links between staff related web pages may have been created for collaborative reasons, inter-staff links do not associate better with other indicators of collaboration compared to all links. This may be because, currently, there are not enough staff pages in university websites to give reliable results. Also, as a result of the changing web, some academics may have their online presence in social networking sites like LinkedIn, Facebook, Twitter and Acedemia.edu instead of personal web pages and online CVs. Moreover, links between staff pages are not the only links in an institution’s website that suggest collaborative activities. Previous studies show that links that appear at shallow depths and links that are clickable logos are particularly likely to indicate collaboration relationships between the two organizations that the link connects, but these attributes are not exploited in the classification scheme used in this research. Including links from other webpages that contain links created for collaboration reasons may therefore improve the results. There is a marginal improvement in the results when links are restricted to staff target web pages instead of inter-staff links or raw links.
The second research question assessed if the total inlink count to a university website will better estimate the extent to which a university collaborates if only academic links are used. Academic links are links from university websites, excluding links from library and other service pages. The correlations show a significant improvement in the association between the number of academic inlinks to a university and the number of research projects that the university partakes in with other universities compared with using all inlinks. This second result also suggests that supervised learning methods to automatically classify hyperlinks can be used to improve the results of webometric studies.
Footnotes
Appendix A: List of UK universities that are used in this study
| Anglia Ruskin University | Open University | University of Hull |
| Aberystwyth University | Oxford Brookes University | University of Kent |
| Aston University | Plymouth University | University of Leeds |
| Bangor University | Queen Margaret University Edinburgh | University of Leicester |
| Bath Spa University | Queen Mary University London | University of Lincoln |
| Birkbeck, University of London | Queens’ University Belfast | University of Liverpool |
| Birmingham City University | Robert Gordon university | University of Manchester |
| Bournemouth University | Royal Holloway university | University of Northampton |
| Bristol University | School of Oriental and African Studies | University of Northumbria |
| Brunel University | Sheffield Hallam University | University of Nottingham |
| Canterbury Christ Church University | Swansea University | University of Oxford |
| City University London | Teesside University | University of Portsmouth |
| Coventry University | University College London | University of Reading |
| De Montfort University | University for the Creative Arts | University of Salford |
| Durham University | University of Aberdeen | University of Sheffield |
| Edinburgh Napier University | University of Abertay | University of Southampton |
| Glasgow Caledonian University | University of Bath | University of St Andrews |
| Goldsmiths, University of London | University of Bedfordshire | University of Stirling |
| Harper Adams University | University of Birmingham | University of Strathclyde |
| Heriot–Watt University | University of Bradford | University of Sunderland |
| Imperial College London | University of Brighton | University of Surrey |
| Kings College London | University of Buckingham | University of Sussex |
| Kingston University | University of Cambridge | University of the Arts London |
| Lancaster University | University of Derby | University of the West of England |
| Leeds Metropolitan University | University of Dundee | University of Ulster |
| Liverpool Hope University | University of East Anglia | University of Wales, Lampeter |
| Liverpool John Moores University | University of East London | University of Wales, Newport |
| London Metropolitan University | University of Edinburgh | University of Warwick |
| London Sch of Economics and Political Science | University of Exeter | University of West London |
| London Sch of Hygiene and Trop Med | University of Glamorgan | University of West of Scotland |
| London South Bank University | University of Glasgow | University of Westminster |
| Loughborough University | University of Gloucestershire | University of Wolverhampton |
| Manchester Metropolitan University | University of Greenwich | University of Worcester |
| Newcastle University | University of Hertfordshire | University of York |
| Nottingham Trent University | University of Huddersfield |
Acknowledgements
The authors thank Leiden University, especially Ludo Waltman, for providing the co-authored publication data for UK universities.
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
