Abstract
Ranking of universities regarding their web-based activities plays a pivotal role in promoting scientific advancement since it motivates the open access accessibility to scientific results. In this study, a new ranking system based on the website quality factors and traffic evaluation was proposed. Since top-ranked universities are usually considered as the standard models for lower ranked ones, the focus of this study was on top-ranked universities. The proposed ranking was compared with well-known Webometrics ranking system. The website traffic and quality assessment were acquired for websites of top-ranked world universities and the correlation between these indices and the Webometrics ranking was evaluated. The summation of the weighted value of obtained measures according to an optimal weight vector obtained by a genetic algorithm framework was used for ranking purposes. The results showed that the website total traffic size was correlated with Webometrics rank (
Keywords
Introduction
The world wide web has become the most important tool for information exchange and communication between researchers, scientific centers and universities all over the world (Park & Thelwall, 2006). In this regard, the quality and quantity of information presented by a center at web space has been the center of attention during the past decade. So far, several ranking systems were proposed in order to motivate publishing the scientific contents on the Web and in this way help the scientific development (Wimaladharma & Herath, 2016).
The international webometric ranking systems were suggested since the mid of 20 century, focusing on the quantity and quality of the content of websites (Almind & Ingwersen, 1997). The criteria such as the size of website, collaboration with other websites and the extent of rich files such as documents, slides and multimedia were considered for ranking purposes. According to the Björneborn and Ingwersen, webometric is the quantitative aspect of the construction of website and the application of information resources based on the informatics approaches (Björneborn & Ingwersen, 2004). Webometric measures the aspects related to websites, webpages and hyperlinks using search engines (Michael Thelwall, 2009). The webometric index plays a pivotal role in information and also organization management (Longqing & Qingfeng, 2011). Such an index aims to exhibit a clear picture of the scientific and educational status of an organization and helps the less-developed organizations to resolve their weaknesses toward improving the quality and quantity of information (Taheri et al., 2015).
The most popular webometric ranking system is Webometrics that is developed by cybermetrics lab, a group of the Spanish National Research Council, which takes into account the quantity and quality of Web content. Four measures of Visibility/Impact, Size/Presence, Openness/Richness and Excellence/Scholar (refer to Table 1 for detail description) were considered for ranking thousands of academic centers around the world according to their Web-based and scientific activities. So far, several researches have used the Webometrics criteria for comparing the scientific status of independent institutions. The comparison between the webometric status of world universities (Nissom & Kulathuramaiyer, 2012), the comparison between website impact factor of Arab-world universities (Elgohary, 2008), the study of Webometrics rank of open access repositories (Aguillo et al., 2010), the comparison between top-ranked Iranian medical universities with top-ranked world universities (Farashi et al., 2020) and the webometric status of Iranian hospitals (Shadpour et al., 2013) are some examples.
Despite the advantages of Webometrics ranking, it has several drawbacks (Mike Thelwall, 2008). The methodology for ranking and the period of reports are two controversial issues for Webometrics. This ranking system releases two reports each year, therefore continuous tracking of the quality and quantity of Web-based activities is difficult based on Webometrics. On the other hand, the criteria used by Webometrics are all quantitative measures (i.e. the number of backlinks, number of webpages, number of citations from top authors and number of high-quality papers), while it seems that the quality of the website is a missing concept for university evaluation by Webometrics. Therefore, the Webometrics ranking can exhibit only a part of the research and educational activities of a university (Mike Thelwall, 2008). In this regard, proposing new terms and systems for the assessment of Web-based activities of universities has been the subject of some studies.
Jati et al. proposed a ranking system that used entropy measure for dynamically weighing the visibility, size, openness and excellence values (Jati & Dominic, 2017). The method provided the same result as Webometrics which uses fixed weight value for each measure. Kats and Rokach presented an innovative concept named Wikiometrics that used the extracted metrics from Wikipedia for ranking purposes and obtained comparable results with standard ranking systems (Singh et al., 2013). Patidar and Vishwakarma proposed a new ranking system for considering different kinds of research organizations in the same package using a clustered domain ranking strategy (Patidar & Vishwakarma, 2014).
Here, it is hypothesized that the traffic indicators of a website and the criteria for evaluating the quality of website design and content might be suitable measures for webometric assessment. In this regard, investigating the relationship between Webometrics ranking and website traffic indices is a necessary practice. The results may reveal the applicability of website related information to predict the Webometrics rank or help to suggest more completed webometric metrics for a more accurate ranking.
So far several studies have been proposed for using traffic measures to evaluate Web-based activities. Guskov et al. investigated the relationship between website traffic (number of user audiences) and Webometrics ranking for 10 scientific organizations. The obtained results showed no significant relationship between Webometrics ranking and website traffic (Guskov et al., 2015). However, Vaughan et al. calculated the website traffic of several universities and business centers of USA and China and showed that there was a significant correlation between website traffic and performance of such centers (Vaughan & Yang, 2013). Sarkar et al. investigated the websites associated with tourism industry companies in India considering measures such as traffic size, webpage loading speed, number of webpages and so on (Sarkar et al., 2018). Jun et al. studied the capability of website traffic for forecasting by analogy. They showed that website traffic could be used as a potential tool for forecasting purposes (Jun et al., 2017).
The present study checked out the hypothesis if there was any correlation between Webometrics ranking and website traffic and quality assessment measures. Furthermore, it was investigated if such measures could be used for proposing a new ranking system, as an independent ranking system or for anticipating future ranking.
Material and methods
Sample population
For choosing the sample population for this study, the website traffic size of several Webometrics top-ranked universities were obtained from “similarweb” website, then an exponential curve was fitted to the traffic values. Where the fitted exponential curve crossed a threshold equal to 10% of maximum value, the addition of further universities to the sample population was terminated. In this way 30 universities were selected for further analysis. This guarantees the inclusion of the most successful universities from webometric perspective to the sample population, which are considered as the reference for other universities.
These 30 top-ranked world universities contained 25 universities from USA, three from England and two from Canada. Top-ranked universities are usually followed by lower-ranked universities from scientometrics methodologies and website design strategies. In this regard, we focused on top-ranked world universities instead of a broader range of universities. The information concerned the total website traffic size, traffic sources, average visit duration, average pages per visit, bounce rate (the last three ones were considered for the quality assessment of the website design and content) were acquired from “similarweb” website (
Description of terms. The last five terms were averaged for three successive months
Description of terms. The last five terms were averaged for three successive months
The block diagram for the procedures performed in this study was depicted in Fig. 1.
Feature set for ranking
The features considered in this study were divided into two categories: features from website traffic assessment and features from evaluating the quality of design
Webometrics ranking and website assessment measures acquired for 30 top-ranked world universities
Webometrics ranking and website assessment measures acquired for 30 top-ranked world universities
Block diagram of the proposed ranking system.
and contents. The first category included features such as total traffic size and the contribution of different sources for attracting visitors to the website, while the second category included the measures such as average visit duration, average pages per visit, bounce rate and the number of broken links. It is worth noting that the total website traffic size was evaluated as the number of visits during a specified period of time and should not be interpreted as the volume of information entering the website (the common definition in information technology). The way that total traffic size is created is another important issue that reflects the Web-based activity of a university. Different sources that create the total traffic size of a website consist of direct access to the website through web browsers (Direct access), access to the website by referrals, access to the website through the search engines, access to the website via social networks and access to the website by available links embedded at emails. For universities that the staffs, researchers and students are well trained for using the web space facilities, the contribution of direct access to the website is high. On the other hand, considering SEO strategies helps universities to highlight the contribution of search engines in the total traffic size of their website. More activity of affiliated persons of the university at the Web space (in other websites or social networks) can also promote the website’s traffic size. Altogether, it seems that the contribution of traffic sources on total traffic size may suggest how universities use the Web space facilities for improving the university presence on the Web.
Besides the website’s traffic measures, there are some other factors which may motivate a visitor for using the published information on the website or contribute to the web-based activities of a university. These factors are directly or indirectly related to the quality of website design and also the quality and quantity of published information at the website. It is a difficult task to find the optimized website quality assessment tools since the universal accepted are not available. Even though there are different criteria for evaluating the website quality, here we selected some limited numbers (average visit duration, average pages per visit, bounce rate and the number of broken links) according to the available tools that obtain an assessment for these measures.
For each university, a vector of numbers contains total traffic size, the contribution of five different sources in total traffic size, average visit duration, average pages per visit, bounce rate and the number of broken links, i.e. a ten-element feature vector was created for further analysis.
In order to investigate the relationship between website traffic indices and Webometrics ranking, Spearman’s rank correlation coefficient was used. The Spearman’s rank correlation coefficient (
In which,
Furthermore, the adjusted coefficient of determination (
In which,
Information obtained from website’s traffic measures (i.e. total traffic size, the contribution of search engines, direct access, social networks, referrals and emails in total traffic size) and quality measures (i.e. average pages per visit, average visit duration, bounce rate and number of broken links) were considered as feature vector (consist of 10 elements) for checking if these factors were potent to predict the Webometrics ranking of top-ranked universities. For investigating the weight of each factor in total ranking score, a genetic algorithm strategy was used as follows.
The sample population (feature vectors) was divided into train and test sets by a K-fold cross-validation strategy (K The initial weight vector (initial chromosome) was specified by the correlation between each factor and the total Webometrics rank of universities in the study population. In order to consider the statistical significance, the initial weight for each factor was multiplied by 1-pvalue of the correlation between that factor and total Webometrics ranks. The weight vector was multiplied by the feature vector of each university and summed up to obtain a score for that university. Such a score was calculated for all universities in the population. The obtained scores were sorted in a descending manner. The ideal was that the sorted order was similar to the Webometrics rank, i.e. the first university of Webometrics rank should have the highest score and the second university in Webometrics ranking should have the second-highest rank and so on. However, due to differences between Webometrics measures and the considered features in this study, it was difficult to have the identical ranks, therefore another approach was used for evaluating the proposed method compared with Webometrics ranking. This strategy was to consider all pairs of universities, check if the rank in Webometrics was better for one of them whether the same was correct in the obtained rank by our proposed method. A similarity measure was calculated by dividing the positive cases by all possible pairs. Using genetic algorithm operators i.e. crossover and mutation and selecting the winner parents with a higher probability (weight vectors with higher similarity measures), new weight vectors were generated and the procedure explained in step 2 was repeated until the average change in fitness value was lower than a predefined threshold. Finally, 10 weight vectors with the highest similarity measure were chosen as the winner chromosomes and their average value was calculated as the final weight vector. The averaging was performed for reducing the effect of outlier weight values. The method performance was evaluated by test set. In other words, when the proposed ranking system was trained, the feature matrix that each row corresponded to feature vector of each test sample, was applied and the ranking procedure was performed between test samples. The obtained rank was compared with Webometrics rank of test sample in a pair-wise manner to obtain a similarity value. The above-mentioned procedure was repeated 10 times and an average similarity value was reported.
Correlation analysis between website traffic size and Webometrics ranks for (A) size, (B) visibility, (C) openness and (D) excellence. Each point showed the Webometrics rank against the website total traffic size. For thirty top-ranked world universities, the total traffic size was reported in terms of million visitors (M visitor) obtained from “similarweb” website. The solid line showed the fitted line to the data points by regression analysis. Spearma’s rank correlation coefficient (
Correlation between website total traffic size and overall Webometrics rank. For 30 top-ranked world universities, the traffic was reported in terms of million visitors (M visitor) obtained from “similarweb” website. The solid line showed the fitted line to the data points according to the regression analysis. The Spearman’s rank correlation coefficient (
Figure 2A–D showed the relationship between website total traffic size and its Webometrics rank for size, visibility, openness and excellence measures, respectively. Based on the obtained results, there was a statistically significant (
The average visit duration and average pages per visit indicate the traffic quality of a website (Subhani, 2014). For longer visit duration, the probability of taking an action is higher. The action could be purchasing a product or service or subscribing the website which are in line with the purposes of the website owner (Prasetio et al., 2016). In the present study, besides average visit duration and average pages per visit measures, their ratio was also considered as a criterion for website quality assessment. The correlation between these measures and size, visibility, openness and excellence rank was calculated using Spearman’s rank correlation coefficient. The result was reported in Table 3.
Correlation between average visit duration, average pages per visit and their ratio with Webometrics ranks and bounce rate
Since the result of Figs 2 and 3 indicated a statistically significant correlation between Webometrics ranking and website traffic size, those traffic sources which were responsible for such a correlation were investigated. For this purpose, the correlation between overall Webometrics rank and the contribution of each traffic sources (i.e. direct, search engines, referrals, social networks and emails) in total traffic source was calculated using Spearman’s rank correlation and the results were reported in Table 4. The result of Table 4 implied that the statistically significant effect (
Correlation between overall Webometrics rank and the contribution of different traffic sources in total traffic size of the website
In order to investigate whether the feature set introduced in this work was potent to predict Webometrics ranking of top-ranked world universities, a new ranking system
Evaluation of the proposed method for predicting the next Webometrics ranking. The Webometrics ranking (January 2020 edition) and traffic information acquired for 5 top-ranked world universities were reported in this table. The system was trained according to the Webometrics report for January 2019
was proposed based on these features. According to the methodology section, a genetic algorithm-based ranking system was trained by train samples (24 feature vectors out of 30 feature vectors) and evaluated using test samples in a K-fold (K
In order to check the ability of the proposed method for Webometrics rank prediction, the website traffic and quality assessment measures of five top-ranked world universities were extracted based on the January 2020 edition of Webometrics report (i.e. a newer and different edition from the one that had been used as train sample for this study). The values were shown in Table 5. It should be noted that in this analysis the proposed ranking framework was trained according to the last ranking reports and then was used for predicting the next Webometrics ranking.
Table 5 indicatesd that even though the output ranks of the proposed method were not similar to Webometrics, in a pairwise comparison mode, 8 out of 10 pairs were correctly ranked. For example, while Stanford University had greater rank compared with MIT, California Berkeley and Washington in Webometrics, it was assigned a higher rank compared with only California Berkeley and Washington by the proposed ranking system i.e. in two out of three possible pairs the ranking position of Stanford University were similar between these two ranking systems (Webometrics and the proposed one). Repeating this procedure for five randomly selected sets (
This study was focused on top-ranked world universities, however, it was speculated if the proposed ranking system could obtain the reliable outcomes for middle-ranked or low-ranked universities, when it was trained according to the information of top-ranked universities. For this purpose from the January 2020 report of Webometrics, three sets of samples (
Figure 2A indicated that by increasing website pages (lower size rank) the total traffic size (or total visits) was larger. Furthermore, websites with higher number of backlinks (lower visibility rank) were more connected to other websites and in this way shared more information contents which increased total traffic size (Fig. 2B). However, the analysis (Fig. 2C and D) showed that there was no significant correlation between total website traffic size and openness/excellence ranks. Since only limited numbers of published papers by affiliated persons of the university are specified as 10% most cited papers, which describes the excellence ranking, it is reasonable to suppose that excellence has limited effect on website traffic size.
The weighted summation of impact, presence, openness and excellence scores is used by Webometrics to rank universities. The correlation analysis presented by Fig. 3 showed that there was a statistically significant correlation between overall Webometrics rank and total traffic size of a website (
The results of Table 3 showed no statistically significant correlation between the quality of website content and Webometrics ranking, while the longer average visit duration and larger average pages per visit were correlated with bounce rate. Since bounce rate indicated the percentage of website visitors who left the website after observing the first page, lower bounce rate indicated the higher quality of the webpage content, which convinced the visitors to stay longer on the website and reviewed more webpages. The results of Table 3 emphasized that the Webometrics ranking system did not consider the quality of website pages (Mike Thelwall, 2008).
The website total traffic size for 30 top-ranked world universities showed that direct access contributed for 40.41% (min
Where we calculated the correlation between traffic size and ranks (Fig. 2), the negative correlation indicated that top-ranked universities with better rank (lower ranks) had higher traffic size. Also, it was reasonable that for a high quality website, higher average pages per visit and average visit duration was accompanied by substantially lower bounce rate (i.e. a big negative correlation reported in Table 3). Furthermore, Table 4 indicated that when the Webometrics rank of a university was better, more users accessed to its website content by searching strategies (i.e. negative correlation between search and total rank in Table 4), however, perhaps due to the advanced SEO strategies that was considered by top-ranked universities which facilitated the access to their websites by search engines, majority of users preferred to reach website content by search engines instead of direct access. While for lower ranked universities direct access probably obtained better results which caused the reported positive correlation in Table 4.
Among the features discussed here, the contribution of direct access and search engines in total traffic size and also the total traffic size of the website showed a statistically significant correlation with Webometrics total rank (see Fig. 3 and Table 4,
Even though the proposed method could predict the correct position of universities in Webometric ranking (with a prediction accuracy of up to 69%), the result for forecasting the future rank of middle and low-ranked universities by the proposed method (while the ranking system was trained by the information of top-ranked universities) indicated that the trained system was not able to predict the ranking of universities with low or middle rank in Webometrics with satisfactory accuracy. However, training the system with related information of the corresponding universities (i.e. low rank or middle-ranked universities) might increase the performance of the system.
It is worth noting that, due to the random nature of the genetics algorithm, the weight vector obtained for each repeat (distinct set of train samples) was completely different from other repeats. In this regard, it was not possible to declare a distinct and unique weight for each feature to show its importance on ranking.
How to use the proposed ranking system?
In this study, a new strategy was proposed for ranking universities according to the website traffic and quality measures. The simulation results showed that the prediction accuracy was about 69% when Webometrics ranking was used as the standard. Since the required information for the proposed ranking system is obtained in a quicker and easier way compared with the required information for Webometrics, it is worth using it for continuously tracking the webometric status of a university when compared with some other targeted universities. Upon the release of the Webometrics report, our proposed system can be retrained by website traffic information and quality measures. The trained system can be used for predicting the possible position of a university among some targeted universities by applying the updated information of website traffic and quality measures. This procedure can be performed for example monthly, therefore, a good estimate about the future Webometrics rank of the university can be obtained. The output of the proposed ranking system provides an evaluation of university Web-based activities toward improving Webometrics ranking. Even though the focus of this study was on universities and their webometric ranking, the proposed method can simply be used as a ranking prediction tool for other types of websites, provided that it is trained by suitable information.
A note on study sample size
Sample size has substantial effect on machine learning procedure. Small sample size may lead to a biased performance estimate and a low statistical power. However, this study was restricted to top-ranked universities according to their website traffic measure. This limited the sample population to only 30 samples. Regarding our regression analysis i.e. Figs 2 and 3, even though the study sample size is quite small, Van der Ploeg et al. found the optimum sample size for linear regression model should be at least 20 to 50 samples for each candidate feature (van der Ploeg et al., 2014). Furthermore, in the current study, a regression/prediction model related to several predictors was proposed. The accuracy of the prediction method is relied on sample size and the squared multiple correlation coefficient (
Study limitation and future studies
In this study, a complementary webometric strategy was proposed which might enrich the evaluation of web-based activities of different types of organizations. However, here the proposed method was only tested for webometric evaluation of top-ranked universities. Testing the methodology for other types of organizations by future studies might be useful. Furthermore, it is useful to test the generality of the method by inclusion lower rank universities. Also, in this study the traffic size was measured based on the number of visitors, however, other definitions of traffic measures such as the size of transmitted and received bytes can be considered for website usage and the results be compared with this study. Another shortcoming of the current study is that the included universities were from English-language countries (USA, Canada and England) with the near identical culture. This might affect obtained results since studies showed that cultural characteristics might influence website design, structure (Fletcher, 2006) and website’s usage (Singh et al., 2013). Inclusion of samples from different cultures might obtained more generalized results. Finally, in this study a limited number of website quality assessment measures were used. Inclusion of new quality-specific assessment measures such as ease of use, reliability, availability of the needed information, speed of downloading measures and so on can enhance the interpretation.
Conclusion
In order to promote competition between scientific centers and improving the web-based publication of scientific information, different types of webometric ranking systems have been developed during the last decades. The best known is Webometrics, despite its benefits, it has many drawbacks including ignoring factors such as the quality of website design and content updating. These drawbacks limit the potential of Webometrics ranking for an accurate ranking. In the current study which was focused on the webometric evaluation of top-ranked world universities, the relationship of website traffic and quality measures was investigated with Webometrics ranking. Furthermore, a complementary ranking system based on extracted features of website traffic information and some quality measures was proposed that incorporated the genetics algorithm in the training step. The results showed that there is a correlation between Webometrics ranking and website traffic (
Funding
This work was supported by deputy of research and technology, (Grant No. 980321217).
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Not applicable.
Footnotes
Conflict of interest
The authors declare that they have no conflict of interest.
