Abstract
News feeds generate colossal amount of data consisting of important information hidden in the intricacies. State of the art methods are still at infancy in providing a very generic and publicly available solution to skim through the important information in the news from various sources and an ability to search using specific keywords in different languages. This paper focuses on designing a tool to extract semantic details from news articles published through various internet sources in various languages. The semantic information is stored within DBMS for ease of organizing and retrieving the data. Further, a querying facility to search through entire articles based on the keyword or date-based search is also proposed to view the crisp content. The news articles in English, and two Indian languages - Hindi and Malayalam are considered for experimentation. The proposed strategy consists of two main components namely, Generative model creation and Query engine. Generative model aims to extract important entities and keywords along with their relevance to the article and other similar articles using Latent Dirichlet Allocation(LDA) and Named Entity Recognition(NER). Query engine is to facilitate on the fly retrieval of semantic content from the database, based on user keyword. The search engine, along with database indexing, reduces the access time to the database thereby retrieving the information in less time. Experimental results show that the proposed method is effective in terms of quality of information and time consumed for information retrieval.
Keywords
Introduction
News analytics deal with measuring various qualitative and quantitative attributes of unstructured textual news articles. Some of these attributes include information pertaining to sentiments, relevance and novelty aspects. Expressing news articles in the form of numbers and metadata gives ability to manipulate everyday information in mathematical and statistical ways. News analytics are generally obtained through automated text analysis and applied to digital texts using elements from Natural Language Processing(NLP) and Machine Learning(ML) such as latent semantic analysis, support vector machines, "bag of words" among other techniques. Text analytics consist of phases like NER, Text Similarity, and Latent Dirichlet Allocation(LDA)[1, 2].
The quality of news published by several online news agencies remains questionable as amount of crisp content in an article is very low as compared to the size of article and it is quite time consuming for user to get the context out of the article. There can be biased information about a specific person/entity from particular news publishers. Solely relying on such type of biased information might not be good for the user. There can be contradicting statements by different online news publishers about the same news. In few scenarios there is possibility that news articles’ content might not match the headline stated for that article. These are some of the reasons why news analytics has become a mandate in news industry.
Various challenges exist in the front of news analytics. Information gathering, is one of the prominent challenges since huge volume of data is produced every minute. Gathering appropriate information from different sources itself is a challenging task and it becomes even more harder if the articles are in different languages. Web scraping from various news sites is another challenge since each site follows a different strategy for updating the daily news in their respective sites. The problems related to accumulation, representation, and storage of such information in an organised manner for semantic and quick access are still not solved in its entirety [2, 3]. Though various open source libraries and tools are available for text analytics such as predicting stock trends, they are not productive and are still used at experiment level [4]. Google news labs analyse audience trend, and media analytics to help new publishers to publish better quality news articles [5]. Thus, humongous information is available in various news sites. Nevertheless, there is a dearth of generic solution to semantically retrieve information from multilingual news articles based on specific keywords. Machine learning in combination with NLP helps data analysts turn unstructured text into usable data and gain insights from the data. Text data requires a special ML approach due to high dimensionality as a consequence of hundreds of thousands of words and phrases.
This paper proposes a tool to provide user a solution to explore various news from varied sources based on augmented search using specific keywords from different sites in different languages.
The contributions of the paper are:
Design a tool to extract semantic details from news articles published through various internet sources in various languages. Accumulation, representation and storage of semantic information within DBMS to facilitate date-based as well as keyword based querying. Integrate Latent Dirichlet Allocation(LDA) with Named Entity Recognition(NER) for semantic extraction of information from varieties of sources.
Related Work
News analytics involve automated text analysis by using NLP and ML techniques for semantic analysis and classification [6–8]. A thorough analysis of qualitative dimensions of event types in scientific research has been ignored by scholars previously. But, the repeated longitudinal analysis and comparison of organizational behaviors has been done [9]. Within business, news analytics informs event-driven trading strategies and business intelligence applications [10]. Search mechanisms to retrieve semantic information from documents are available that are based on database indexing, neural networks, and matrix decomposition strategies [11–13].
A similar work was identified in [14], which focuses on presenting condensed news data to the user. The data extracted from rss feeds, are readily available in xml format. However, in our proposed model, the data is extracted directly from news websites and is also extended to non-English articles. Our work has implementation of various information retrieval methods to extract data, which is not done in the referred work. Further, the proposed work focuses on retrieving data for a given keyword and sorting the results based on relevance to the keyword as opposed to the referred work that only focuses on a given category to retrieve the news articles. Currently available systems are mostly processing English news articles [15, 16]. Functionalities for sentiment analysis [17, 18] are available but support for non-English articles such as Chinese, Spanish, German, Indian languages etc. are yet to develop [17, 19]. Thus, semantic retrieval from Hindi, Malayalam, articles are not directly available.
This paper provides enhancements to the work carried out in [20]. The proposed tool extends its functionalities for news articles from non-English websites (in this case Hindi and Malayalam). The non-English articles are initially translated to English and then processed for storage and retrieval in the later stages. The information retrieval is executed as a set of multiple parallel processes (8 processes in this case) instead of one process, thereby reducing the time required to complete extracting the necessary information. A text based clustering is applied to obtain common set of keywords, for the given set of articles for a particular date. The search results are sorted in order of importance of the search word in the article. Even the user interface has been enhanced to filter news articles based on the news websites and the source language.
The computational research aiming at automatically identifying Named Entities in texts forms a vast and heterogeneous pool of strategies, methods and representations [19, 22]. A multimodal approach to quantify the entity coherence between image and text in real-world news is adopted in [22]. A system to find important facts from relevant articles and graph based visualization is framed in [23]. One of the first research papers in the field was presented by Lisa F. Rau that describes a system to “extract and recognize [company] names”[24]. It relies on heuristics and handcrafted rules. Feature space for Named Entity Recognition comprises of descriptors or characteristic attributes of words designed for algorithmic consumption. Feature vector representation is an abstraction over text where typically each word is represented by one or many Boolean, numeric and nominal values. For example, a hypothetical NER system may represent each word of a text with 3 attributes: a Boolean attribute with the value true if the word is capitalized and false otherwise; 2) a numeric attribute corresponding to the length, in characters, of the word; 3) a nominal attribute corresponding to the lowercase version of the word. But here the semantics of the documents in terms of topics or how each word in the document can attribute to a topic of the document is not reflected. Hence its necessary to model the topics of each news article to capture its meaning in a succinct form.
Topic modeling is an unsupervised learning technique that aims at discovering the abstract topics from a set of documents. One of the most powerful techniques based on Dirichlet distributions is Latent Dirichlet Allocation(LDA) [1, 20] that was first introduced by Blei, Ng and Jordan in 2003. Researchers have published many articles in the field of topic modelling and applied in various fields such as software engineering, political science, medical and linguistic science [2, 26]. LDA is a generative probabilistic model in which the documents are represented as random mixtures over latent topics. Here, a topic is characterized by a distribution over words. LDA represents topics by word probabilities. The words with highest probabilities in each topic usually give a good idea of what the topic is. Each latent topic in the LDA model is also represented as a probabilistic distribution over words and the word distributions of topics share a common Dirichlet prior as well. Given a corpus D consisting of M documents, with document d having w words, LDA models D based on a generative process as shown in Algorithm 1.
Algorithm 1 The LDA algorithm
1:State Choose a multinomial distribution φ (t) for topic t ∈ 1, . . . , T from a Dirichlet distribution with parameter β
2:Choose a multinomial distribution θ (d) for document d ∈ 1, . . . , M from a Dirichlet distribution with parameter α
3:for w i ∈ d and i ∈ (1, 2, . . . . . , N d ) do
4:Select a topic z i from d
5:Select a word w i from z i
6:end for
In Algorithm 1, words in documents are the only observed variables while others are latent variables φ, θ and hyper parameters (α and β). Thus, the primary objective of this work is to identify latent semantic topics from the news articles in different languages in order to provide the stakeholders summarized and semantic information based on date or keyword search. The content is organized by news scraping and information is extracted from the scrapped articles to be stored in RDBMS. Other details like URL of the articles or URL of similar articles is also shown to view the complete article. In this paper, the terms scraping and crawling are used interchangeably.
Proposed Model
A tool has been developed that facilitates storage, management, and search of multilingual news articles in English, Hindi and Malayalam languages. An efficient retrieval of semantic information from multilingual news articles is proposed. This is an extension of the work [20] where only English newspapers were considered for data retrieval. The current system has developed Crawlers for Hindi and Malayalam news articles as well to extract relevant information along with news URLs and date of article. A generic code for scraping from news articles using Scrapy together with BeautifulSoup packages in Python, mitigates the need for separate code for data extraction from different news websites. The crawled details are stored in the database. Periodically, the article content is extracted from the URL stored in the database and preprocessed, which can later be used for retrieving the information. The details extracted by the tool are entities (location, person, organisation) mentioned in news, news summary, keywords mentioned, and other similar news articles. These information are stored in database, which are then displayed to the user, based on the user query. The tool facilitates the retrieval of content from the database, based on user keyword with decreased search time. The automated crawling of news helps in easy maintenance of different news sources. A search engine is built that helps user to find appropriate data that he is looking for. Keyword extraction for non English feeds, Clustering the documents to explore the common keywords in the articles, and LDA based search mechanism that sorts the search result in the order of relevance are the enhancements in the tool. The user interface of the tool is also augmented with provision of search in non-English language. The concept of multi-threading has been used to improvise the retrieval time of the documents based on keyword and date.
The schema layout used for database storage is as shown in Figure 1. The text in grey indicates the name of table. The items listed are the attributes that describe the content stored in the table. The set of underlined attributes form the primary key, that are unique to each data. The primary key for the table news acts as a marker for each of the news article. The primary key in news-source is used for identifying the website for an article present in news. The arrows indicate foreign key constraints. The foreign key is useful as reference to the original news article.

Schema Diagram of the tables present in the Database, along with the attributes, to store various details of the news. The underlined attributes are the primary key and arrowhead to the primary key represent foreign keys

The sequence of modules implemented in the tool. The dashed lines indicate the independence of modules of one another
The various operations in the tool are depicted in Figure 2. The dashed lines in Figure 2 denote the modules’ independence from one another. This means that the modules can be easily extended without any complications by the person maintaining the tool and faults can be diagnosed quickly. The tool must be executed in the order shown in the diagram. The code is written in Python, and the libraries used are open source.
The DBMS used is PostgreSQL-10.19 for storage of various information as shown in the schema diagram 1. Once the text is stored in the tables, we use psycopg2 module as an interface between Postgresql and the Python code to perform analytics on text. The outputs obtained from various analytics such as LDA and Clustering are stored back in the tables such as news_lda,news_cluster,news_keywords,news_cluster. The users can query the database with the keywords and the relevant information can be retrieved from the concerned tables. The advantage of the proposed solution is priority based search and filters to filter articles by language and news sites. As the primary focus is to pull all the data related to news articles based on required keywords, we need to have structured format to arrange the data for efficient retrieval of appropriate and relevant articles. So we chose PostgreSQL compared to NoSQL as it has enormous number of extensions with good indexing mechanisms and provide good scalability with inherent concurrency control. However, this approach is not suitable if mere analytics is to be done on multiple entities and does not work if sentences are given for information retrieval instead of a keyword. The tables News_source and News_domain are populated prior to information retrieval, by scraping the required news articles.
Table 1 shows the news websites considered in this work. Other languages can easily be added.
English and non-English news sources that are crawled in the proposed system
English and non-English news sources that are crawled in the proposed system
To obtain URLs of news articles from the websites mentioned in Table 1, web crawlers are deployed. The details extracted from each news source are the news headlines, url of the news article, the date of the article, and source language. These attributes are stored in the news table with attributes news_url, date_of_article, language. The url of the main page of a news website is stored in the news_source table as source_url.
Given a news website, the first step is to extract the content from its various subsections using Crawler. The crawler starts from the given set of urls defined in news_domain table under the column domain_url. The urls in the table, link to various subsections in a news website such as sports, politics, science etc. The crawler, then obtains the details from the set of tags, defined in the structure of the document. One crawler cannot be used for all the news websites as the tags vary with the websites. The crawled URLs are stored in the table News_domain with details such as date and headlines. Once the URLs are stored, the structure of the page with headlines and date are extracted. The enclosing tags will be different for each of the news articles. We need to define the way in which the content is read, differently for each of the documents. Among the various frameworks for web crawling in Python, Scrapy and Beautiful soup have been used as crawlers to extract the article content from the news sources.
Scrapy is used for extracting news articles from Times Of India and Hindustan Times. BeautifulSoup is yet another python library, that has been used for parsing the HTML and XML documents. The idea behind inculcating this library into the system, is that it eliminates the dependency on writing a specific web crawler for each individual websites. BeautifulSoup starts from url for a given news website in news_source table and uses regular expression and traversal methods such as DFS and BFS to traverse through different parts of website. This ability of BeautifulSoup makes it a common implementation of web crawler for variety of websites. Another advantage of using BeautifulSoup is, based on the implementation, the number of articles that are required can be controlled by the user. Then the parsed new urls, date and heading are inserted into the database. BeautifulSoup library is used for crawling news articles from The Hindu, Mathrubhumi and Dainik Jagran.
Algorithm 2 Algorithm for retrieving text, summary and keywords from English news articles
1: Download the article for the URL using news paper
2:
4: Delete entry from table news for the given URL Return
5:
6: Assign text < - text from the article using newspaper library function
7: Remove line breaks from text
8: Remove non-encodable characters from text
9: Assign summary < - summary from the article using newspaper library function
10: Remove line breaks from summary
11: Remove non encodable characters from summary
12: Update table news with text, summary for given URL
13: Assign keywords < - keywords from the article using newspaper library function
14:
15: Insert ID, keyword into table news_keywords
16:
Once the URLs are crawled, the articles are parsed to be stored in the database. Newspaper3k is a python library, which is used for processing news articles based on a URL. URLs without textual articles are omitted from storage. The line breaks from the article are omitted to save storage space. The characters that cannot be encoded are also removed. Algorithm 2 demonstrates how text, summary and keywords for English documents are retrieved and stored in the database.
Algorithm 3 Algorithm for retrieving text, summary and keywords from Hindi news articles
1: Download the article for the URL using news paper
2: If unable to download then
3: entry from table news for the given URL Return
4:
5: Assign text < - text from the article using newspaper library function
6: Translate text to English using googletrans
7:
8: Return
9:
10: Remove line breaks from text
11: Remove non-encodable characters from text
12: Assign summary < - summary from the article using newspaper library function
13: Translate summary to English using googletrans
14:
15: Return
16:
17: Remove line breaks from summary
18: Remove non encodable characters from summary
19: Update table news with text, summary for given URL
20: Assign keywords < - keywords from the article using newspaper library function
21:
22: Translate keyword to English using googletrans
23:
24: Return
25:
26: Insert ID, keyword into table news_keywords
27:
The library support can also be extended for Hindi and other language news articles using a translation functionality. A python library called googletrans provides support for translation. The details of Hindi articles(text, summary, keywords) are translated before being inserted into the tables and source language is also stored for reference. Algorithm 3 shows how the text, summary and keywords are stored in case of Hindi news articles.
Algorithm 4 Algorithm for retrieving text, summary and keywords from Malayalam news articles
1: Assign final_text < - ""
2: Obtain set of all <p> tags inside < div class="articleBody" > using Beautiful Soup library
3:
4: Delete entry from news table for the given URL
5: Return
6:
7: For each of the <p > tags obtained do
8: Append content inside the < p > tag to the final text
9:
10: Translate the final text to English using googletrans
11:
12: Return
13:
14: Assign summary < - output obtained from summary function in gensim library
15:
16: Delete entry from news table for the given URL
17: Return
18:
19: Update table news with text, summary for given URL
20: Assign keywords < - output obtained from keyword function in gensim library
21:
22: Delete entry from news table for the given URL
23: Return
24:
25:
26: Insert ID, keyword into table news_keywords
27:
In the Malayalam news website, Mathrubhumi, for each of the news articles, the text data is stored as set of <p > tags obtained, inside the < div class="articleBody" > tag. The enclosed text data is obtained and translated using googletrans. The text data is feed to the gensim library, which has the functionality of obtaining summary and keywords for the given data. In case, the text data size is small the library methods will return an error. This would imply, that the given url does not point to a relevant data (example, being a caption for a picture/video gallery) and can be deleted from news table. Algorithm 4 summarises how details are extracted and inserted into the database.
The summary, obtained in the given step is only to be shown in the web application and is not used in further stages of information gathering. The function for finding keywords, returns a keyword set and each keyword is added to the table news_keywords in the database, along with the pertaining news identity. The keywords will be used later in keyword based searching.
Text Analytics
After the summary and keywords extraction are stored in the database, text analytics aims at Semantic analysis of the articles for efficient Search engine. This stage incorporates Named Entity Recognition, Text Similarity and Keywords/Topic modelling using LDA (Latent Drichlet Allocation), and Clustering.
LDA is a generative probabilistic model of a corpus [25]. It represents documents as random mixtures over latent topics, where a topic is characterized by a distribution over words. LDA represents topics by word probabilities. The words with highest probabilities in each topic usually give a good idea of what the topic is about. These words can act as keywords for the given article. The working of LDA for an article is given in Figure 3. For the given sum of word scores obtained as output of LDA, keywords A,B,C are obtained. Word A has the highest score of 0.56, followed by B with 0.28 and C with lowest score of 0.08. The word, along with the scores is inserted into news_lda table and is later used in keyword based searching.

Example of how important keywords along with the score are retrieved from the obtained sum of word scores, as a result of performing LDA. In this case A, B, C are the most important keywords, with A having the highest probability and C having the lowest.
Named entity recognition generates tokens from a given article where each token if identified as a Person, a Location, or an Organization is added to the database. To retrieve these details, a NER tagger, using Stanford NER package is used. There is a caveat here. For instance, if in an article, ’New Delhi’ is to be recognized as location, the tokenizing step will split it into different set of tags as two Locations, ’New’, and ’Delhi’. In order to overcome this issue, the current token is compared with previously scanned token. If tags are not same, the previous token is added to the database else the current token is appended to the previous token and stored as an entity. To avoid storage of duplicate entities, a dictionary of scanned entities is made to scan the the current entity and to the database only if it is not available in the dictionary.
Given a set of articles, the aim of text clustering is to partition the articles into multiple groups based on common keywords. Here, each document is represented as a vector, where the value of each of its dimensions represents the importance of words in that document. The value is the dot product of term frequency and Inverse document frequency called TF-IDF vector. Term frequency is the number of occurrence of the word in the document, whereas the Inverse Document Frequency is how much information the word provides. The articles are partitioned into 6 clusters to obtain the keywords for the following set of categories: Sports, Politics, Entertainment, Lifestyle, Science and Technology and miscellaneous. If the search engine is not provided with specific keywords but just the date range, clustering output helps to find relevant articles of the given dates based on the prominent keywords in each of the clusters.
Search Mechanism
The data stored in the tables are retrieved and presented to the user in an organised manner by appropriate indexing on the tables. The user interface in the tool has various input options to filter down the results such as specific websites, source language, and date of publication. There are two primary ways of querying/searching the articles:
Searching without a keyword Searching using a keyword
Contents displayed when searched with a specific keyword
Contents displayed when searched with a specific keyword
If explicit keywords are not input by the user, then the details of all the articles in a specified date range, the relation between the articles and the summary of the news articles based on clustering are displayed. Table 3 shows the details displayed to the user, when the user queries for the data, without specifying a keyword. It can be noted that the cluster groups and the pertaining details, are displayed separately, while the other details are displayed in a tabular format.
Contents displayed when searched without keywords
Contents displayed when searched without keywords
Searching for all the articles in a given date range can result in a large output if the prominent keywords are too many, and it can be cumbersome to search for the required news in a given context. Hence, the tool provides functionality to query the news articles based on a user specific keyword. The focus is on the keyword and the relevance of keyword in individual articles, rather than a set of articles and the relation between the articles. Here, the results obtained from the table news_lda, news_keywords are aggregated and displayed to the user, sorted by order of relevance of the article to the given keyword. The relevance of each article is set to High, Medium, and Low based on the score of the keyword in each of them as obtained by applying LDA. Thus semantic retrieval of articles is facilitated.
First the news details pertaining to the news ids present in news_lda table are retrieved. For an instance, two sets of documents A and B, on applying LDA, can be represented as:
In this case, when keyword c is queried, as document A has higher value of c compared to that in document B, document A will be displayed first, followed by document B. In case of querying for keyword f, as document B has higher score of f, document B will be displayed first followed by document A. The output obtained from querying the news_lda table, will have the highest relevance to the keyword, such that the keyword can be used when assigning a headline for the given article. The relevance value assigned to the results so obtained is set to High.
The next set of document details retrieved correspond to the news ids found in news_keywords table, but not present in news_lda table. The keywords are important in the given news article, but does not have much relevance compared to the results obtained from news_lda table. The relevance value of such articles is set to Medium. The final set of results are from news table, where there is only mention of the keyword in the article and will not have any significant relevance in the article. The relevance value is set to Low in such cases. Algorithm 5 shows in what order the search is implemented, with Algorithm 6 specifying the details that are displayed for each news article. The document details with relevance value as High, will be displayed first, followed by ones with Medium relevance, followed by the documents with Low relevance value.
Algorithm 5 Algorithm for search mechanism
1: Assign result_set < - []
2: Retrieve set of news_id in news_lda table, for which the value in word attribute is same as the keyword and sorted in descending order by value score
3:
4: Assign to_add_entry < - get_news_details(news_id, "High")
5: Add to_add_entry to result_set
6:
7: Retrieve set of news_id in news_keywords table, for which the value in word_entry attribute is same as the keyword and news_id not present in news_lda table
8:
9: Assign to_add_entry < - get_news_details(news_id, "Medium")
10: Add to_add_entry to result_set
11:
12: Retrieve set of news_id in news table, for which the value in news_text contains the keyword and news_id not present in news_lda table and news_id not present in news_keywords
13:
14: Assign to_add_entry < - get_news_details(news_id, "Low")
15: Add to_add_entry to result_set
16:
17:
18:
19: Display item (as a tuple)
20:
21:
Algorithm 6 Algorithm to get details for a given news id. This is also called from Algorithm 5 as get news details(news id, relevance score)
1: Retrieve summary, date_of_article, news_url, language from news table and using source_id present in news retrieve source_name from news_source table and add to to_add_entry
2: Assign relevance score to the given input value of relevance score and add to to_add_entry
3: Retrieve entity, entity_type using news_id from news_details table and add to to_add_entry
4: Retrieve word_entry using news_id from news_keywords table and add to to_add_entry
5: Add to_add_entry to result_set
Experimentation and Results
SetUp
The tool is deployed and tested on a commodity machine with an i7 8th generation processor clocked at 1.8Ghz with 8GB RAM. Due to the large size of the expected output, the tool is tested for a single day. The crawler is implemented for the following news media websites: The Hindu, Times of India, Hindustani Times, Mathrubhumi, Dainik Jagran. The main python libraries used are gensim,BeautifulSoup, and newspaper3k for implementation of various stages such as topic modelling, web scraping, summarization, etc. The experiments are done to show the effectiveness of Single thread processing versus multiprocessing, comparison of libraries used - gensim and newspaper3k, and Search based on topic modelling algorithm - LDA.
Single processing versus Multiprocessing
Table 4 depicts the time taken for each of the steps of news analytics when implemented and executed as a single thread in a sequential manner. Btree indexing is created on date_of_article for news table, to reduce time when retrieving articles based on date range and on news_id for tables news_lda, news_keywords, news_details, news_cluster to reduce time for extracting details for a given news_id attribute. As we have used Postgresql, the deadlock situation, if occurs, is handled inherently by the Postgresql engine itself. Each of the process is independent of each other and there is no overlap. Also, order of insertion of news articles is not important as Postgresql does support multiple inserts at the same time using its inherent concurrency control mechanism.
Processing time of various steps for 1500 articles. The readings were taken when the steps were executed as a single thread
Processing time of various steps for 1500 articles. The readings were taken when the steps were executed as a single thread
In order to improve the time taken for execution, multiprocessing approach is adopted that executes several processes simultaneously. Table 5 depicts the time taken for extracting and storing article text, from different media URLs. As the number of processes doubles, the time taken reduces to half. As this experiment is run on a 8 core system, the maximum number of processes that can run at a time is 8.
Time taken (in seconds) for multiprocessing of given number of input urls for varied number of processes
This leads us to running each of the stages as a set of 8 processes running at a time (8 records will be processed at a time, for each stage). The concurrent inserts into various tables are handled by Postgresql DBMS.
Comparison of gensim and newspaper3k libraries based on the character length of the text in the original article and summary of the articles
Comparison of gensim and newspaper3k libraries based on the character length of the text in the original article and summary of the articles
Average keyword count comparison for given number of articles. The more number of keywords, the more information can be obtained from the article
A working model for the LDA based search is implemented. The search is conducted for various sets of keywords (including search without specifying keywords or in other words, search for all the articles), over a given set of articles. The search algorithm is executed three times and the average retrieval time is considered. Table 8, 9, 10, and 11 depict the execution results on 5449, 9998, 15597, and 20396 articles respectively. The results exhibit that the search time for a given keyword, is not dependent on the retrieved number of articles, but is dependent on the total article size, against which the search is executed.
The overall average time is taken for each of the article sets and plotted against the number of articles in the set. Table 12 shows the number of articles against the average retrieval time. Figure 4 exhibits that there is a linear dependency between the number of articles and the search time for an arbitrary keyword, as described by the blue line. When the number of articles is 0, the time to access the database and the input processing takes some negligible amount of time.
Conclusion
This paper has proposed a tool that facilitates analysis based on unsupervised learning, summarization and semantic information retrieval of news articles from various sources. The tool not only supports news articles in English but also in Indian languages Hindi and Malayalam. It comprises of web crawling, keyword extraction based on LDA, clustering of the articles, storage and management of the articles in an organized form within DBMS with proper indexing structures. Experiments on selective and real news websites in English, Hindi, and Malayalam languages show the feasibility of the proposed system. It has been observed that the retrieval time of the relevant articles is linear with the total size of the articles stored, given the keywords as input. If keywords are not explicitly mentioned, then clustering of the articles is sought to identify the apt set of articles from each clusters based on the common keywords in each of them. Thus, the proposed tool integrates the crawling of articles in different languages, compiling and analysing the news articles from different sources in different languages, and search engine in single platform obviating the need for separate APIs or tools for different functionalities.
In future, this tool can be equipped with crawlers for foreign languages. The search engine can be extended to allow querying the database for multiple sentences and phrases and not just set of keywords. Currently, only unsupervised learning is supported. Supervised learning for news analytics can also be included to understand the performance and accuracy of the system better.

Graph of average retrieval search time against no of articles. The blue line indicates that there is a linear relation between the no of articles and the time of retrieval of the articles when searched for a given keyword
Keyword based retrieval of articles. Retrieval time (in milliseconds) against set of 5449 articles
Keyword based retrieval of articles. Retrieval time (in milliseconds) against set of 9998 articles
Keyword retrieval time (in milliseconds) against set of 15597 articles
Keyword retrieval time (in milliseconds) against set of 20396 articles
Average retrieval time (in milliseconds) against given set of article sizes
