Abstract
Modern systems like the Internet of Things, cloud computing, and sensor networks generate a huge data archive. The knowledge extraction from these huge archived data requires modified approaches in algorithm design techniques. The field of study in which analysis of such huge data is carried out is called big data analytics, which helps to optimize the performance with reduced cost and retrieves the information efficiently. The enhancement of traditional data analytics needs to modify to suit big data analytics because it may not manage huge amounts of data. The real thought is how to design the data mining algorithms suitable to handle big data analysis. This paper discusses data analytics at the initial level, to begin with, the insights about the analysis process for big data. Big data analytics have a current research edge in the knowledge extraction field. This paper highlights the challenges and problems associated with big data analysis and provide inner insights into several techniques and methods used.
Introduction
In today’s Information Technology field, most of the data generated are digital and exchanged on Internet because of the fast growth of IT and its related technologies. It is already, more than 92% of the new data generated in 2002 is estimated by Lyman and Varian [55]. However, data creation is usually easier than extracting insights from the gathered data. Therefore, it is experienced that performing analysis on large data is not occurring recently, but it has been there for many years. Due to enhanced technological developments, today’s processing elements and resources are much more powerful than those in the 1950s; the large data set stresses the system in analyzing.
Several efficient methods such as density-based approaches, sampling, distributed computing, grid computing-based approaches, incremental learning, and data condensation [86] are presented in response to large-size data analysis problems. The results of data analysis using these methods have come up with various methods, improving the performance of large data analytics. One of the input data volume reduction methods is the principle component analysis [25], and it accelerates the process of data analytics applied in the big data field. Another method is called sampling, which is utilized to decrease data computations of data clustering [44]. The computation time of data analytics is enhanced by adopting sampling as one of the methods for reducing data computation [43,91]. It is pointed out by Fisher et al. [32] about big data that it is a very large collection of data that cannot be processed centrally. Therefore, the data mining and analytics approaches used to extract insights from the data traditionally may not be applied directly.
The special characteristics of big data [49] are also its size in terms of velocity, volume, diversity, popularly known as 3Vs presented by Laney in [50]. In 3Vs, the large data size is referred to as volume, rapid data creation is velocity, and data from different sources and multiple types is termed as variety, respectively. Based on further studies, additional V’s such as vocabulary, veracity, variability, validity, vagueness, and value are added in big data definitions due to insufficient descriptions of big data using typical 3Vs [15,79]. Despite the large size of big data, not all data are useful for analysis and making important decisions. Therefore, the various academicians and researchers from the information processing industry are interested in sharing their analysis findings in big data-related studies.
This paper focuses on opportunities and various challenges in big data and its available analytics and approaches. The paper starts with the traditional data analysis and presentation of its results. The open research issues in big data we have stated in IoT and cloud computing domains. The algorithms related to big data analysis are discussed in brief. The paper finally concludes with the summarization of outcomes.
The remaining parts of the research paper are arranged as follows: Section 2 details the data analytics. Section 3 comprises bigdata analytics and its challenges in Section 4. Section 5 details the open research issues of big data analytics and its techniques in Section 6. Section 7 concludes the paper.
Data analytics
Knowledge discovery or data mining is the procedure of getting implicit and unknown valuable information from the stored data. The study included in the paper [31] provides the KDD process using operations like selection, preprocessing, transformation, interpretation, or evaluation. The system to perform data analytics can be built as shown in Fig. 1, which includes data input, analysis, and output production.

Knowledge discovery process in databases.
The input part includes basic operations such as collecting, selection, preprocessing, and transformation, as indicated in Fig. 1. The data is collected from internet, social networks and from various web applications [13]. The required appropriate kind of data for analysis is selected using the select operation from the data, based upon the gathered relevant information. The gathered data must integrate with the target data because the data analysis process is performed on the resultant database. The useful data is created using preprocessing of the input data. The preprocessing comprises different steps such as detecting, data cleaning, and filtering unwanted and insignificant data. The operators for performing selection and preprocessing do not make the resultant data directly usable; therefore, transformation operation is required. The transformation operation is used to format the data from different formats to the data – mining – capable format. The transformation is performed to create useful data for analysis after minimizing the complexity and downsizing the data scale up to a certain extent [19,83]. This process includes reducing dimensions, sampling, and coding to be useful for performing analysis on data [34]. The preprocessing process of the data analysis [34] includes the steps such as extraction, cleaning, integration, transformation, and reduction to be carried out on the unprocessed and plain data to extract useful data. These operators are used to clean the raw data; otherwise, factors including duplicate copy, inconsistent, noisy and outliers will easily affect the performance of the data analysis process. This can be summarized as the complexity reduction and the accuracy of the result after performing data analytics is improved when a systematic approach is adopted. Obviously, this includes various operators used during input gathering steps.
Data analysis
The term data mining is used to describe the data analysis process as displayed in Fig. 1. It is used to find hidden information/patterns/rules from the raw data to become information or knowledge. The data analysis is not limited to the data mining approaches [34], but other technologies such as statistics and machine learning have been used for performing data analysis for many years.
The statistical analysis methods were used in the early stages to describe and understand the current situations so that further improvements could be incorporated, for example, public opinion polling or entertainment program ratings.
The data mining methods were applied to get the required data from gathered data like statistical analysis. Various domain-specific approaches are introduced in further development stages once data mining problems were introduced in the data analysis. The apriori algorithm [3] is one of the examples in the development phase designed and implemented for the association rules-based problems.
The main drawbacks of the data mining approaches are high computation costs, even though most definitions are simple. Therefore, in continuation of the improvements of the response time of data mining approaches, other approaches such as machine learning [85], artificial intelligence [5,14], metaheuristic techniques [1], and distributed computing [17] were applied. These techniques were utilized individually or integrated with traditional data mining techniques and offer a professional solution to data mining problems. One of the examples is the clustering algorithm proposed by Krishna and Murty [46], which combines genetic algorithm and k-means and produced improved performance compared to using k-means alone.
Most of the data mining approaches include various steps such as input, output, initialization, data scan, rule construction, and update operators [78] as indicated in the Algorithm 1.

Data Mining Technique
In Fig. 2, the various parameters included in the steps as the raw data is represented by D, d is data from scan operator, r is rules, o is predefined size, and v is used to represent the candidate rules.
The three operators included here are scan, construct, and update,which is used in repetition until the termination condition is met. The scan operator is considered optional because its timing to employ depends upon the data mining algorithm’s design. Algorithm 1 describes most data mining algorithms. However, Algorithm 1 also shows that its operators can be applied to representative algorithms such as association rules, clustering, sequential patterns, and classification for extracting hidden data from raw data. This signifies that performance enhancement of data analysis is achieved by modifying these operators.
The two essential operators need in the representation of the output are evaluation and interpretation. The evaluation operators measure the results of the data analysis. The data mining algorithms also use an evaluation operator for the results assessments [46].
The solution of the data mining problems for the classification of input data always considers cohesion and coupling as two major goals. Cohesion describes as keeping the distance among every data and centroid or mean of the cluster in which it belongs as small as possible. In contrast, the coupling maintains the distance among the data, which belongs to various clusters as large as possible.
The SSE represents the summation of squared of errors, and it was used to measure the data mining results cohesion, and the formula given below defined it.
Euclidean distance measuring is one of the most familiar methods for measuring distance for data mining problems and is defined as follows.
Another popular measurement is termed accuracy, which is represented as
The classification results are evaluated using F-measure, recall (r), and precision (p). These measurements indicate how many data do not belong to aparticular group named as a group-A. The wrongly separated into the group-A; and how many data belong to the group-A is not divided into the group-A.
Table 1 is a confusion matrix that covers every condition of classification outcomes [54].
Confusion matrix [14]
Confusion matrix [14]
The numbers of correctly classified positive and negative examples are given using TN and TP, respectively; the number of incorrectly classified negative examples and positive examples are given by FP and FN, respectively. The meaning of precision, recall, and F – measure are easily described using the confusion matrix.
The precision is represented by
Once the precision and recall have been computed, the F-measure, F is estimated as:
The computation price and response times are two additional familiar measurements for evaluating the data mining result. The two data mining algorithms can be compared before incorporating into a particular domain using how fast they can acquire final mining outcomes to find similar results.
The current data to be considered is huge and in the composition of various data types [67]. The data analysis approaches for big data may change because of the features such as heterogeneity, high dimensions, unstructured, incomplete, erroneous, massive, and noise associated with the data [56,80]. It seems that collecting more data in big data provides more useful information, but it is not necessarily always true. The reason behind it is the ambiguity and abnormality associated with the large collection of data. One example is that an account may be operated through numerous users, resulting in degradation of the mining of results [16]. In addition, the analytics in the big data field has several new issues associated, like security, fault tolerance, privacy, storage,and data quality [41].
The big data created by many applications have three basic characteristics, namely variety, velocity, and volume. Therefore, big data analytics are re-examined based on the volume, velocity, and variety perspectives [36,57].
The flooded input data needs to be considered from the volume perception of big data characteristics because it may stop the analytics process. As a bottleneck in traditional data analytics for a wireless sensor network, the sensor will be shifted into processing, communications, and sensing data stored in big data analytics [11]. The bottlenecks are created everywhere when uploading a huge amount of data gathered by the sensors to the upper layer system. The velocity perspective in big data analytics differs from the traditional approach because the real time streaming of data creates the difficulty of a huge amount of data coming in less period. Still, the device and the system may not control these input data. This can be understood in terms of the similarity with the network flow analysis system in which one cannot analyze everything which can be gathered. The variety perspective provides insights about another problem for input operators because of incomplete and different data types.
Big data input
The handling of the huge amount of data is not a new research issue only for big data but experienced in several earlier tasks such as weather forecasting, gene analysis, and even in-network flow analysis also [3,51,86]. Similar problems also exist in big data analytics; therefore, input handling ability by the algorithms, platforms, and computers is provided using data preprocessing. The traditional preprocessing methods such as feature selection, sampling, compression, and so on can be applied in big data also [29,45,64]. However, several researchers study the complexity of the input data reduction because of the limitation of processing the input data as a whole by advanced computer technology using a single machine in most cases. There are possible solutions suggested in the design of the preprocessing operator by the use of domain-specific knowledge. Consider the example of using domain knowledge is the mobile web log analysis [89], in which B-tree, divide-and-conquer is used to filter unrelated log information.
The preprocessing of the massive logs, sensor data, or marketing analysis incurs very high computation costs in later studies [33]. This motivates Dawelbeit and McCrindle to use different methods to partition the input data. They used the bin packing method for partitioning the input data on cloud systems. The preprocessing of the raw data is handled using the cloud system and then provides the data with the uniform format to the algorithms or systems for further analysis works.
The data reduction methods are needed to reduce the computational costs at a significant level while performing data analytics. Sampling and compression are the methods used for reducing the input data size. In addition, the number of instances of the data that needs to be selected for analysis also opens the doors for research because the performance of sampling and compression methods on input data will be affected [21,68].
Result of big data analysis
The presentation of the result of the big data analytics is one of the important works of the output element of big data analytics. The representation of the result is most important because the user should easily understand it; otherwise, it will be useless. The visualization of the analytics results is classified into four kinds: alerting, dashboards, exploration, and reporting [90]. The recent research problem in big data analytics includes user interface design for the cloud scheme. This recent issue plays a very significant impact in the big data analytics in two aspects: one of which is how to explain the needed knowledge to the user in simple ways, and the other one is how the user can use the data analytics systems in an easier way based on their opinion [12,77]. The result of big data analysis is depicted in Table 2.
Analysis of big data based on clustering algorithms and classifiers
Analysis of big data based on clustering algorithms and classifiers
The analytics in field of big data provided with number of opportunities to the researchers with the benefits of mining of significant knowledge from the data. It is also observed and experienced by many researchers to have some challenges along with the opportunities. The handling of challenges in analyzing the data is based on the factors such as security of the information, methods used to perform computations, and computation complexities. Therefore, it is essential to know these factors while performing big data analytics. For example, many methods used in statistical analysis on normal-sized data may not be used effectively on large-sized data. Similarly, several techniques used to perform computations on small data cannot be applied directly to the big data analytics process. Here, four categories of challenges are discussed: storage and analysis, processing and knowledge discovery, data visualization, and information security.
Analysis and storage
The growth size of data is exponential because of the current devices and technologies such as mobile devices and sensory technologies used in computations. The storage cost of such a large data is much more and ultimately not able to manage enough storage space to perform further analytics. This indicates the first challenge in big data analytics is based onstorage and enhancement of input-output speed. The significance of the storage affects the representation and the knowledge discovery process. The main reason behind this is the need for easy access and prompt response for different analytics tasks. The basic challenge is the data size; the present techniques may not be able to react in sufficient instances. Modifying the existing algorithms to be suitably employed in such large data is always a daunting task. In recent times one of the major challenges in developing new machine learning algorithms for analyzing big data. The arrangement of huge datasets in the structure of clusters in helping big data analysis is of prime concern and should be dealt with effectively [39]. A huge quantity of structured and semi-structured data collection is possible using the Hadoop and MapReduce technologies developed particularly for large data processing tasks. The key challenge is how to arrange these data effectively so that better knowledge discovery can be made possible. The approach used as a standard practice for knowledge discovery by many organizations is to apply data mining algorithms after transforming unstructured and semi-structured data into a structured form. Das and Kumar [23] were developed a structure for data analysis, and also they have covered the description for data analysis for public tweets [22].
Processing and knowledge discovery
One of the prime concerns in big data and its related studies is discovering useful data insights and representation. The knowledge discovery process includes several domains, like information retrieval, authentication, archiving, and preservation. Several tools, namely formal component analysis [84], principal component analysis [40], fuzzy set [88], rough set [61], near set [62], etc., are used in knowledge discovery and representation. Data warehouse and data marts are the two most popular approaches for large data set management. The data taken from operational systems are accumulated in the data store house, and the data mart provides a facility for analysis based on the data from the data warehouse. The major concern here is the uncertainty and inconsistency connected with large databases because of more computational complexities. It is complex to consider the inclusive computational model that is commonly appropriate to big data. This suggests the use of domain-specific knowledge for analyzing big data because the particular complexities can be taken care of in this way. The machine learning techniques are used in this direction for doing research and surveys with lower memory requirements. These research and surveys focus on reducing the computational cost and complexities [4,18,71,74]. The greatest challenge is to minimize computational complexities, uncertainty, and inconsistencies in developing techniques and technologies for handling big data analysis.
Data visualization
The visualization aspect of the analysis includes tools such as graphs, charts, and maps to represent information and data. So the main objective is the presentation of data more adequately using graph theory techniques. The typical tools may not be the most suitable visualization tools because of the large size of the generated data. Tableau is used for big data visualization by some companies to assist in manageable ways for large data sets. The visualization tools are supportive in transforming complex and large data into intuitive pictures, and therefore, it becomes the adaptable choice for analysis. The company people use such tools to visualize the current trends, customer feedback, and sentiment analysis to suggest improvements in the process implementations. Big data analytics provided several challenges in the processing capabilities of the hardware and software systems. Consequently, many developments have happened in cloud computing, parallel computing, and distributed computing to assist the data analytics process. The visualization and scalability also changed according to the time to facilitate the analytics. The various issues can be taken care of by correlating more mathematical models in computer science.
Security of information
The major concern in big data analysis is the potential safety hazard [92] because of the correlation among the huge amount of data used for analysis and pattern matching for getting insights. Therefore, organizations must apply strategies to protect sensitive data, the main concern in big data analysis. Therefore protection of the information is the most significant issue in big data analytics. The organization uses authentication, authorization, and encryption techniques for the enhancement of the security of data. There are varieties of security measures that need to apply in big data applications; some of them are network dimensions, device varieties, and security monitoring on real time basis [37,58]. The focus in big data security issues is towards developing a multi-level policy model for systems security. Although many people have researched big data security, a significant level of improvement still requires [37].
Big data analytics and open research issues
The current research trends in industries and academia have focused on data science and big data analytics. The aim of data science is knowledge extraction from data by applying extensive research on big data. Data science and big data are applied to develop several applications, including uncertainty modeling, undecided data analysis, statistical learning, machine learning, signal processing, pattern recognition, data warehousing, and information science[63,65]. In this section, the focus is on discussing open research concerns in big data analytics. The three broad research issues associated with big data analytics are the Internet of things, cloud computing, and nature-inspired computing. However, it cannot be limited to these categories, and the paper published by Husing Kuo et al. [31] has analyzed big data problems associated to health care applications [2].
Big data analytics using IoT
The business process, global interrelations, and numerous personal characteristics are restructured because of the prolific Internet use by different categories of people. The Internet of Things has provided controls over innumerable autonomous gadgets by machines. This makes the sense that Internet accessibility is increasing day by day because appliances are becoming users of the Internet with web browsers. In recent times researchers are motivated to accept and involve in the developments of the activities towards IoT-related opportunities. The future of information and communication technologies will affect society and the economy because of the prolific use of IoT.
The improvement of mobile devices, cloud computing, everywhere communication technologies, and data analytics leads towards the growth of IoT-related techniques. Moreover, IoT presents challenges in combined effects with the large amounts of data generation characteristics.
The data management and knowledge detection process of huge scale automation applications improve by collaborating with diversified applications, namely big data processing and computational intelligence. Chang et al. [26,59] have done much research in this direction and suggested the applicability of the technology in this field for process improvement. This indicates the development of IoT data analysis infrastructure frameworks.
The researchers can suggest the tools to extract the insights from the constant data flow produced by IoT devices. This may be the utilization of machine learning techniques for getting meaningful information from the IoT-generated data. One of the challenging issues here understands generated data from IoT devices and extracts the insights from it focuses on big data analytics.
The system can facilitate knowledge using several segments known as the knowledge exploration system. The segments are basically of four types: acquisition of knowledge, knowledge base, dissemination of knowledge, and applications of knowledge. The different techniques for computational intelligence are used during the knowledge discovery process of the knowledge acquisition phase. The discovered knowledge is kept in the knowledge base step and based upon that; expert systems are designed for further activities. The significant information extraction from the knowledge base comes from the dissemination of knowledge. The discovered knowledge from the knowledge base should be in various applications, and this is considered the final segment of the knowledge exploration process. Many issues and challenges open the research scope in the knowledge exploration process, but this survey’s scope is limited. Figure 2 shows the knowledge discovery for IoT Big data and Fig. 3 describes the knowledge exploration system for better visualization.

Knowledge discovery for IoT big data.

System for knowledge exploration using IoT.
The accessibility of high-performance computing infrastructure becomes affordable due to the adaptation of virtually accessible computing platforms using various virtualization technologies. The computing infrastructures are hidden by virtualization software tools and make the way it is utilized like a true computer. The virtualization software provides the flexibility to specify the number of processing elements to be used, disk space requirements, and operating system selection. Cloud computing is the term used to describe the accessibility of these virtual computers. The scalable and on-demand accessibility of data and resources is made possible because of big data technologies and cloud computing developments. A huge amount of data and resources are made available on a demand access basis through virtualization techniques used to configure computing resources. Many researchers have discussed different issues and challenges widely open to be dealt with by anyone in the study of big data analytics and cloud computing related to data management, data variety, data storage, processing of data, and resource management [6,35,73]. Therefore it indicates that cloud computing helps in developing of variety of applications with infrastructure and tools. Many researchers also discussed and directed that the cloud computing platform provides the approaches and supports the development of applications to perform analysis on data. Therefore, data scientists should be allowed to use tools and techniques provided by cloud environments for knowledge exploration and further analytics activity. The major issues of big data analysis in the cloud environment are related to privacy concerns because hosting data on public servers may cause other potential security and privacy risks. The various issues discussed here will establish big data analytics on cloud computing to develop a high level.
Big data analytics using nature-inspired computing
The number of computing techniques influenced by nature is known as nature-inspired computing. These techniques are developed to deal with difficult real-world concerns. The bio-inspired techniques are used based on the biological systems that are self-organized and devoid of control by any essential control authorities. The bio-inspired computing approaches are developed for accumulating, processing, and recovering by computation calculations in biological molecules such as DNA and proteins. Some of the important features of such computing are the integration of biologically derivative resources for performing computational operations and retrieve intelligence presentations. These systems are most appropriate to adopt because an enormous amount of data have been produced and collected from different types of resources used across the web. A lot of intelligence requires data scientists during analysis and categorizes the data into different types such as image, text, video, etc., from the huge amount of data. The optimization applications of bio-inspired computing help in the knowledge discovery process from large datasets. One of the basic advantages of bio-inspired computing is its simplicity in the service provision problem solutions [81]. Cheng et al. [69] have discussed some applications of bio-inspired computing towards this end. Some of the observations based on the discussions from the bio-inspired computing model are helping in smarter connections, handling ambiguities, and unavoidable data loss. This way, it can be predicted that bio-inspired computing will assist in big data analytics to a massive level.
Big data analysis techniques
The mining approaches are used in traditional ways to get insights from the collection of data. The approaches are adopted for extracting information from a large collection of data.
Mining algorithm categories for the specific problem
It is pointed out that the term big data mining was initially available in work published by Fan and Bifet [24,30,82] in 1998. Big data mining is one of the main research issues in big data to refer to finding insights from a large set of data. It has also been observed that the traditional data mining techniques for data analysis are also used effectively for handling big data analytics based on computational cost, storage requirements, and the accuracy of the final result. This section briefly discusses the search and analysis algorithms about domains of big data analytics.
Clustering algorithms Clustering algorithms for data analysis become more limited in big data analysis because of the dependency of data to be arranged in the same formats and loaded in the same machine. The uniqueness of big data has brought up fresh challenges and milestones for data clustering issues. However, several solutions have been proposed for the problem of large size and high dimensionality in data [20,28,86]. One of the significant problems for big data clustering is the reduction of complexity in data. Big data clustering is divided into two categories: clustering in a single machine and multiple machines by Shirkhorshidi et al. in [70]. The techniques known as sampling and approaches for reducing dimensions were adopted for a single machine clustering, whereas the various–machine clustering implemented on MapReduce and parallel-based approaches [72,76]. This indicates that the traditional reduction approach is still usable in the big data domain because of less memory usage and reduced complexity benefits. The precise description of the sampling is elaborated as the reduction in the amount of data provided in the analytics process. In contrast, downsizing of the whole dataset is regarded as dimensionality reduction. The dimensionality reduction is applied in the data analytic process to remove the irrelevant attributes in the data sets.
Classification algorithms The number of classification algorithms used for the traditional data analysis process is modified and applied in big data mining similarly to clustering. These algorithms were modified to work in the parallel environment, or new parallel classification algorithms were developed for big data mining.
The classification algorithm mentioned in [8,75] works by gathering input data from distributed data sources and these data are provided for processing to different heterogeneous set of learners. Tekin et al. [9,75] studied classification and proposed a new classification technique named “classify or send for classification” (CoS). According to their assumption, each learner uses two different input data processing methods in the distributed classification scheme. One way the learner performs the classification by itself and in another labeling is done by other learners to which input data have been forwarded. In this way, different learners exchange information with each other. The accuracy of solving big data classification problems improves on this basis, and therefore this is called a cooperative learning solution. Rebentrost et al. [66] introduced support vector machine-based quantum computing for big data classification, which pointed out the reduction of memory space utilization and cost of computing. They have strongly argued that the cost of computing in their implementation is O(logNM). N and M, represent the number of sizes and the amount of training data. This way can be predicted that the quantum computing-based search algorithms will have a bright future if the hardware becomes mature in adaptability.
Frequent pattern mining algorithms Frequent pattern mining is also named as association rules and sequential pattern mining, and several researchers have focused on it for handling large–scale data sets. Researchers have been attracted to cloud computing and parallel computing technologies, and traditional algorithms for pattern mining. MapReduce framework was adopted for frequent pattern mining approaches because of improved performance and studied the behavior thoroughly in big data classification [52,53]. It is studied in [38,47,87] that the application of MapReduce-based frequent pattern mining to cloud computing will become popular in the future. The MapReduce-based model for big data analysis is compared with the traditional frequent pattern mining and studied its performance far better than the traditional one.
Big data mining using machine learning
Several works of literature have discussed the potential for machine learning and its use in data analytics [60,85]. Typically, machine learning algorithms are employed as the search algorithm for the required solution. Because of that, different analysis and mining problems are used compared to algorithms designed to perform data mining steps for specific problems. The problems based on data analysis can be solved using machine learning algorithms if they can be generated as optimization difficulty because many algorithms related to machine learning can be utilized for finding estimated solutions of optimization trouble. As an example of a machine learning algorithm, the genetic algorithm can be applied to solve the recurrent pattern matching crisis [42] and the clustering problem [46]. Furthermore, the machine learning approaches can solve various mining concerns in data analysis of the KDD and have the potential to enhance the performance of further branches of KDD, which includes reducing features for input operators [27,51].
Conclusion
This paper surveyed the studies done by several researchers involved in data analysis and its domain from traditional to modern big data analysis. The KDD procedure is dealt with at significant levels for the studies done, and three parts, input, analysis, and output, were discussed. This paper gives a short overview of data mining and big data analysis techniques such as data clustering, classification, and frequent pattern matching. The survey described various research concerns, challenges concerning big data analysis. The big data analysis process uses different approaches, like statistical analysis, data mining, cloud computing machine learning, etc. In the near future, it is expected that researchers will study many such approaches and come up with solutions to the most sought problems of big data analytics. Also, the variety of devices used in the recent distributed systems will play a role in gathering large data sets, and more insights will be extracted from those with much more advanced and superior algorithms and techniques.
