The Scope of Data Mining
Data mining is the process of discovering patterns in large datasets and used to extract usable information from any raw data. It is an efficient technique to analyze and categorize the hidden patterns of data according to various perspectives of applications. Data mining involves some other methods to process the extracting data’s such as Data Cleaning, Data Integration, and transformation of data, Evaluation of patterns and presentation of data. Once all these methods are over, the extracting information’s are used in fraud detection, data analysis process and etc…
The Scope of Data Mining
- Within the short time data mining optimize the huge dataset
- It represents the data in different perspectives of logical order
- It includes tree-shaped structure to understand the hierarchy of data
- It is used to derive the genetic way of classification of various sets of data items.
Data mining has a number of functionalities belonging to two primary categories one is descriptive and another is predictive
Descriptive
Descriptive is the clustering method which is used to identify the group of items based on some similar characteristics
Predictive
Predictive is the classification technique which is used to predict the class attributes and base models and rules.
The design and development of several applications of data mining algorithms requires the use of powerful tools. Different types of data mining tools are used to design the application program for software and hardware platforms. The data can be found through various digital tools from different sources to get raw data from digital and physical world
R-language tool
· Rapid Miner (erstwhile YALE)
· WEKA
· Python based Orange and NTLK
· Knime
- Sisense
- DataMart
- Oracle Data Mining
- Apache mahout
- SSDT(SQL server data tools)
- Rattle
- IBM cognos
- Teradata
- Dundas BI
Many techniques are used to mine data from different platforms and various applications. There number of techniques is used to evolve the data sets in various environments.
Data mining techniques is the important factors for developing projects which are designed to explore data. Data mining techniques has to be choosen based on the type of design and development.
Most commonly used techniques in Data mining:
- Statistics
- Classification
- Association
- Outlier detection
- Clustering
- Regression
- Prediction
- Cluster analysis
- Anomaly detection
- Intrusion detection
- Decision trees
- Neuralnetworks
The main benefit of data mining process is to discover those records of information and summarize it in a simpler format for the purpose of others .Data mining plays a vital role in collecting, processing, storing and analyzing data in order to extract raw information from various platforms. Data mining is used to create accurate models for databases. It helps to identify the data patterns and used to discover all sorts of information. It is used to improve the efficiency of decision making process
It is one of the most popular techniques used in data mining. There are several major data mining techniques have been developing. There are used in data mining projects. Recently adding the association, decision tree, classification, sequential patterns, prediction and clustering etc. The techniques are refers to technological devices. It is also known as leading edge technology or state of the art technology.
Descriptive
The technology refers to the point at which there is a gap in knowledge.
Bleed edge technology
It is a high risk technology of being unreliable. Example for electronic mail(email).The technology contains degree of risk.
Lack of concurrence
Leading to rapid changes. But it is very nature. The way of creating new things exists in the technology.
Lack of testing
It is one of the unreliable or simply untested technologies.
It is one of the successful technologies. It is used to establish the comparative advantages. The bleeding edge computer software is open source software.
Another one technology of cutting edge is state of the art technology. It is sometimes called also cutting edge. It is highest level of general development. It is a scientific field achieved at a particular time.
NFC technology
The technology used in the Google billboard. Near field communication used in order to encourage the customers. It is engaged with digital billboards.
Geo fencing
Geo fencing is a bar gaining area for marketers. Providing a host for mingled with the consumers. It is a real time content in a specific location.
Face book hangers
It is one of the social networking applications. Avoid hacking in the process of transformed messages.
The following domains are mostly used the Data mining.
- Risk management and corporate analysis.
- Fraud detection.
- Market Analysis and Management.
Risk management and corporate analysis
Listed below are the fields of corporate Sector are used the data mining:
- Competition
- Asset Evaluation and Finance planning
- Resource planning
Fraud detection
It helps to find the duration of the call, Destination of the call and time of the day in the fraud telephone calls.
Market Analysis and Management
Data mining is used in the following fields of market:
- Target marketing
- Customer profiling
- Providing summary information
- Cross market analysis
- Determining customer purchasing pattern
- Identifying customer requirements
Weka Analysis:
Weka plays the important role in Data mining. Data Mining is the technique that is used to extract the information from the large amount of dataset. It is used in many real time applications called Fraud detection, Production control, Market analysis, Customer Retention. It can be discovered in New Zealand by University of Waikato. It is used for implementing the multiple data mining algorithms. Here these algorithms are directly applied to the dataset. Weka is used for performing multiple data mining tasks with the collection of machine learning algorithms. The data mining algorithms will be performed on the following techniques:
- Classification
- Association rules
- Preprocessing
- Clustering
- Regression
Visualization tools also present in this Weka tool. It is open source software. This software is issued by the General Public License (GNU).
Decision Tree Algorithm:
Predictive
It is one of the best technique in data mining and it provide feasible result to the given dataset. It contains a root node which is placed at the top of the tree and other nodes followed by the root is known as child nodes, and more than one node which forms a branches, Every internal node contains an attribute, every branch represent an outcome of a test, and all leaf node has a class label.
Algorithm:
- Data partition, D, which is a set of training tuples and their associated class labels.
- Attribute list need to specify the given nodes.
- And Splitting criterion used to split the given data d and convert it into a tree.
- The final output will be a decision tree.
Method:
Create a node a;
- If tuples in D are in the same class, C then return A as leaf node labeled with class c;
- If attribute_list is empty then return N as leaf node with labeled with majority class in D;
- Apply attribute_selection_method(a, attribute_list) to find the best splitting_criterion;
- Label node N with splitting_criterion;
- For each outcome j of splitting criterion
- If Dj is empty then
Attach a leaf labeled with the majority
Class in D to node N;
- Else attach the node returned by generate
- Decision tree(Dj, attribute_list) to node a;
- End for;
- Return a;
K nearest Neighbor
It is an algorithm used for classification and regression, it is a non-parametric method. The output of this mining technique is depend on the classification or regression, that is
- In k-NN classification, the object separated by the most vote of its neighbors and the output will be a class membership.
- In K-NN regression, average value is taken from the given value and the output will be property value of the given object.
Algorithm:
- A given case will be classified by the votes of its neighbor which node hold highest value, if k=1, then case is assigned to the class of its neighbor.
- Other nodes which contains the reasonable values are assigned next to nearest node in the same class.
- The distance between the nodes is also calculated for perfect voting and placing the nodes in the correct order.
Naive Bayes classifier:
It is classifier mostly used in the machine learning system and provides the result based on utilization of the Bayes’s theorem. It is a probabilistic classifier, provide high scalability. It uses a prediction methods to classify the given data and apply the baye’s theorem to provide the feasible result. In Machine learning system naive Bayesian technique play a vital role to get the perfect learning system by every prediction of datasets and get the most predictable set as a perfect result.
All the system use the Bayesian equation to get most predictable result. The bayes theorem
- Convert the data set into a frequency table
- Create a simple table by analyzing the probabilities
- Finally use the naive Bayesian to get the result for every case. And the highest probability is the outcome of prediction.
Performance of the Three Algorithm and its Types:
- Decision Tree.
- K-nearest neighbour Algorithm.
- Navie Bayes Algorithm.
Decision Tree Algorithm Performance:
Decision tree is important algorithm for the data mining. It is the easiest one for comparing another algorithm. Decision tree is the Supervised learning algorithm .It is easily understand and very useful to use .The important one of the decision tree are following below:
- Training the data.
- Predictive the Model.
- Combine the form of tree structure.
Decision tree is the Classification Algorithm. The main aim of the Algorithm to classify the lowest number of the tree Structure. In our project using the Soybean.arff datasets. If the dataset are analyzing the weka tool are given below:
In this Weka tool are widely used to analyze the different types of the Classification Algorithm. It contains the Process to choose to select the type of the Algorithm. Finally we predict the classify type if the data.
After selecting the explorer, choose the process and select the data set from the WEKA tool resources. The dataset contains the number of the instance and labels are there.
The WEKA tool contains the largest number of the datasets. The above figure, using the soybean.arff dataset. Import the dataset to analyzing the data using the Classification Algorithm.
In the Above figure this is the overall classification of the data set for the Soybean.arff .In this fig discuss the number of instance and number labels are present. In this WEKA Tool are briefly explain the number of instance and number of attributes. WEKA tool are analyzing the large number of the Classification algorithm are used.
In the Above fig using the decision tree Algorithm. The Decision tree is the important one for the other Algorithm.J48 is the part the decision tree algorithm.
In the above figure used to classify the number of the instance and the number of the attributes are explained detailed. J48 algorithm used to split the data easily. It is easily understandable algorithm. In WEKA tool , the test option are used the datasets generated the new formation of the data set using Cross validation.
Navie byes is the one of the Classification Algorithm. This algorithm is based on the Approach of Bayes theorem. It is used to predict the models and Class labels. This classifier based on the Probability theorem.
The KNN Algorithm is widely used for the Past data, with the Corrected output values. In this Algorithm used to predict the Unknown data.
Conclusion:
For the applied dataset, in the naive Bayes, K nearest neighbor and decision tree, the outcome of the decision tree is considerably provide expected and best result for the data set, In the result analysis the best case was found by using tree structure and visualization curve. And also it is easily understandable and simple to use.