Definition and Comparison of Text Mining and Data Mining
Text mining is the process of analyzing text to extract information that is useful for a specific purpose (Han, et al., 2011). Text mining answers specific research questions. It filters large amounts of research and extracts information that one needs. Text mining is not just a search tool. It is able to identify and map patterns and trends and gives detailed information (Hastie, et al., 2009). This process distills actionable insights from text, reducing information, helping in navigation, and drawing out of important features. Text mining entails defining a problem statement, then transitioning from an unorganized state to an organized state, finally reaching an insight.
Data mining uses algorithms to analyze and extract useful information (Masys & R, 2001). For a long time now, business analysts have used data mining to extract specific knowledge from a set of unstructured data. Business data is evolving an almost of it is unstructured, in form of text. This has led to the need for text mining. The differences between text mining and data mining include:
- Nature of Data
While data mining involves analysis of homogenous figures and universal, text mining deals with heterogeneous documents for example, emails and social media posts (Jenssen & Astrid, 2001). Data mining analyzes structured data while text mining analyzes structured data.
- Deployment Time
In data mining, solutions are easy to implement after the definition of the algorithm. Data mining focuses on activities that are dependent on data, for example, accounting, supply chain and purchasing.
The heterogeneous nature of data in text mining increase complexity, thus increasing the deployment period (Witten Ian H & Frank Eibe, 2011). Data has to go through several analysis stages. A company must have a taxonomy before starting a text mining project. This might take some time to do.
- Technology Perception
Data mining has been used for decades and is said to be easier and robust. Text mining has been for a long time thought of as complex, domain and language specific. In the past, it was not much valued. The increased use of social media and companies’ concern about their reputation has resulted in a need for sentiment analysis.
The technologies used in text mining include, keyword-based technologies, statistics technologies and linguistic based technologies. In keyword-based technologies, input is based on a selection of keywords in text that are filtered as a series of character strings. In statistics technologies, the systems are based on machine learning (Hastie, et al., 2009).
Technologies used in Text Mining
Some of the text mining applications include:
- Knowledge Management Software
Organizations use text mining to manage unstructured information by ide notifying by identifying relationships and connections in text.
- Customer Intelligent Software
Companies are able to understand needs and opinions of customers from text, for example, through social media (Lovell Michael C, 1983). This is useful in improving customer support.
- Entity Extraction Software
Text mining filters information that is most important from text. Other applications use entity extraction results to boost business intelligence in an organization.
Question 2
Why artificial intelligence is important for supporting business to build smart systems:
Artificial intelligence tries to solve problems more easily than human beings do (S. B. Kotsiantis, 2007). AI processes and analyzes big data faster and more efficiently. It also has better decision-making capability than traditional software. Artificial intelligence improves customer management systems by making them self-updating and auto-correcting; reducing the amount of human effort needed to update the systems, for example, software like Zoho still depend on humans to remain accurate.
Artificial intelligence makes business applications such as e-commerce platforms personalized for consumers (Ford Martin & Colvin Geoff, 2018). It is able to predict what customer are likely to purchase base on their opinions and profiles.
AI gives recommendations on pricing that will return a favorable profit margin. ‘Price Tips’ is an AI business application that performs this function. Uber’s Route-based pricing predicts how much customers are ready to pay depending on their destinations and the time of day, they are riding.
In addition to this, business applications developed using artificial intelligence can prevent fraud in business platforms. Such systems identify an irregularity an irregularity and act upon it.
How artificial intelligence helps to transform companies:
AI has the benefit of reducing processing time. It makes business processes much faster and allows the use of much bigger data sets than that of competitors (M. Kuhn & K. Johnson, 2013). AI eliminates repetitive work and allows you to focus on important things like strategy and team dynamics. Walmart uses HANA, an AI business system to process transaction records in over their 11000 stores. This makes the operations faster.
AI supported business systems enable a business to make predictions and correlations. Amazon uses AI to create its recommendation engine. Siemens uses AI to predict downtimes by extracting data from hard drives. This saves them up to $512 million per hour and increases the uptime of machinery. Otto predicts what customers will by a week before they order. Their AI system has an accuracy of about 96% prediction of purchases that customers will make in the next 30 days.
Applications of Text Mining
AI is useful in identifying variances in products. For example, HANA can monitor data from a slowdown in production. HANA stores replicated data in RAM, thus data is available in real time. This facilitates faster solution implementation.
Businesses are able to make faster and better decisions using AI based business applications (M. Kuhn & K. Johnson, 2013). DOMO is an AI system that companies in decision making by gathering specific information. It extracts data from applications such as Facebook and Shopify to get information on customers, sales or product inventory. Such information is useful in generating reports and identifying real-time trends. Aptus, an AI application in sales enablement, gives companies suggestions on how to improve sales. The software can predict products that may appeal to customers by combining big data and machine learning.
AI helps to manage resources in a company. MindSphere is an AI system used by Siemens to monitor machinery and keep track of their performance. This helps to control use of equipment and reduce downtime.
Limitations of AI:
AI is costly. It is expensive to create smart technologies because of their complex structure. AI systems are also expensive to maintain.
AI may cause loss over decision-making and strategy in business (Ford Martin & Colvin Geoff, 2018). Many AI based applications are able to make decisions and give suggestions to businesses.
Development of AI systems might lead to loss of jobs (Ford Martin & Colvin Geoff, 2018). When applying for jobs in the future, it may be more difficult to get them due to the increased automation of business activities.
AI systems such as autonomous weapons that can potentially cause damage raise a concern for security (Quinlan J. R., 1993).How the business performance is Illustrated:
The diagram represents a project portfolio dashboard that represents the projects that a company has worked on. The visual objects on the dashboard are; a Gantt chart, a simple bar graph, a composite bar graph and a pie chart. The Gantt chart outlines the duration that the company took to complete each project. The start and end dates of the project execution have been indicated.
The bar graph specifies the number of days that each project took to complete. Project C had the longest execution time, of over 250 days. Project H took the least time to complete, less than 50 days. The composite bar graph indicates the specific amount of money allocated for each project.
Role of AI in Business Intelligence
The pie chart represents the resource allocation for each project. A viewer can use this information to account for the resources that each project used during its lifetime. This also provides a basis of comparison, to understand which projects used more resources, for future planning.
Critique of the dashboard design:
Introducing unnecessary variety with aim of diversifying the data, presentation techniques on the dashboard make viewing more difficult (Steven Faw, 2006). The designer should have used two simple bars instead of using both a composite bar graph and simple bar graph. Use of the same data presentation technique allows the viewer to interpret data using a consistent perceptual strategy. This saves time and energy.
Poorly designed display mechanisms (Steven Faw, 2006). The composite bar graph that represents the allocation of finances to the different projects is not clear. A viewer is not able to distinguish the various projects and the specific amount of money that each one used during its execution.
Use of inappropriate display media (Eckerson, 2010). The designer has used a pie chart when there is a better option such as using table of numbers where a viewer can see labels clearly. There are sections in the pie chart that are almost similar in sizes. This makes it more difficult for a viewer to interpret data.
Misuse of color (Eckerson, 2010). A designer needs to make color choices keeping in mind the different perception of color and the meaning of each color. The use of too many bright colors in the dashboard creates a distraction and the viewer finds it more difficult to view important data and make comparisons. There is no need to use the different colors for each bar when the designer has labelled the axis. In the Gantt chart, the vertical axis clearly indicates the name of each project thus no meaning to the color used. Colors should be cool and neutral and only used when there is need to highlight important information.
Arranging data poorly (Jenssen & Astrid, 2001). For comparison purposes, designers need to arrange data clearly and efficiently. The pie chart is not efficient for comparison purposes. Viewers find it difficult to interpret data due to the similar sizes of the sections of the pie chart.
Question 3
ANALYSIS OFBANK DATA USING WEKA ENVIRONMENT (WAIKATO ENVIRONMENT FOR KNOWLEDGE ANALYSIS)
Different systems have different methods of analyzing data collected from the environment. Machine learning has dominated analysis of datasets from different environments (Witten Ian H & Frank Eibe, 2011). It learns trends and patterns in structured data fed into the relevant machine learning AI technology. It is used in text and data mining (data mining is an interdisciplinary subfield in computer science) in extraction of information that is important from an earlier set which was a bunch of meaningless data at a glance. This work involves analysis, classification and presentation of bank data provided.
Impact of AI on Specific Fields
The analysis type used in this exercise is j48 decision tree in Weka environment. It is a java implementation of the early C4.5 algorithms which is the immediate successor of the then ID3. The Weka environment allows for visualization of the analyzed data in terms of a decision tree. This is easy to understand and follow through right to the last attribute after best splits.
A decision tree can be defined as a recursive instance space partition with nodes that form to create a tree that is rooted (Cousssement Kristof, et al., 2008). A tree is therefore directed and does not have incoming edges. The other nodes that are formed only have single incoming nodes. All nodes that have outgoing edges are referred to as decision nodes. All of the internal nodes divide instant space resulting into small sub spaces which can be two or more, and depend on the exact discrete functions that are used for the attribute values inputted.
The j48 algorithm is very basic, though it may look a little complicated to beginners because of the java code implementation. However, it very simple and works with basic principles that are easily understood. The algorithm works in a certain set of steps. It first analyzes the dataset fed into it, but based on the detailed instruction it is set to run the data in, therefore giving an output of the highest possible accuracy for the data under analysis (Fayyad, et al., 1996). The algorithm then does selection of values and attributes, and along dimensions that will provide the best split for more data analysis and extraction of more important information. It then creates child nodes after the best split arrived at on entropy and information gain evaluation (Fayyad, et al., 1996). After this, the algorithm attempts to divide the child nodes further for more attribute relations until a particular stopping potential is arrived at. This could be as a result of the decision tree becoming too big, or the same results of entropy and information gain arrived at after evaluation (this means that the data is of the same class), or the amount of data or information produced is insignificant or too small for consideration (Han, et al., 2011).
It is difficult to decide on the best split for attributes so as to come up with more relations under the initial attributes for the dataset available when building a j48 decision tree (Quinlan J. R., 1993). Getting the best split is important as it makes the tree more accurate and gives a conclusive result of the data mining exercise on the dataset provided. To come up with a solution to this problem, entropy and information gain evaluation is done. An attribute is first evaluated for the amount; level or interestingness of the information it can provide. Entropy evaluation then comes into play. Entropy evaluation is the measure of uncertainty that arises from the attribute being evaluated. The attribute that gives the lowest score of entropy and automatically, highest score for information gain is picked as the initial attribute and all the other attributes will be evaluated under it using the same entropy and information gain evaluation. This is what happens until a stopping criterion has been reached, which has already been described.
WEKA Software Analysis Report
An expression like NO 34.0/12.0 means the relevant attribute has been correctly classified 34 times and wrongly classified 12 times. YES 124.0 means that an attribute has been 124 times correctly classified and has not been wrongly classified in any instance. For this exercise, YES 24.0/2.0 is a YES over NO classification fraction of 24.0/2.0. Instance also means the same as individual.
A relatively accurate decision tree is built by the j48 decision tree algorithm based on information gain and entropy. The bank data provided, after analysis gave ‘Number of Children’ as the first attribute.
Initial attribute with least entropy score after evaluation is Children. This is split into two: whether the number of children is greater than 1, or less or equal to 1. On analysis of the number of children an instance has, in this case less or equal to 1, it is still analyzed in terms of number of children as the attribute resulting after best split. It is split into two: whether an individual has a number of children less or equal to 0 (meaning they have no children), or greater than 0 (meaning they have children). Individuals with a number of children greater than 0 are evaluated under the income attribute. Instances with income of less than or equal to 15,538.8 have a YES over NO classification fraction of 24.0/4.0. The instances with income greater than 15,538.8 score a YES over NO classification fraction of 111.0/5.0.
For individuals with a number of children less or equal to 0, married attribute is analyzed. It is split further into two: Those who are married (YES) and instances not married (NO). For instances not married (NO), mortgage attribute is considered for analysis. This is split further into two: Those with mortgage (YES) and those without mortgage (NO). The instances that have no mortgage (Mortgage=NO) have a NO over YES classification fraction of 48.0/3.0. For instances with mortgage (Mortgage=YES), save_act is the attribute analyzed. This is further split into two: Those instances with savings (Save_act=YES) and the instances without savings (Save_act=NO). Individuals that have savings have a NO over YES classification fraction of 23.0/0.0 (meaning this analysis was wrongly done 23 times!). Instances that have no savings (Save_act=NO) have a YES over NO classification fraction of 12.0/0.0 (this means that the classification was 12 times correctly done).
Instances of married individuals (Married=YES), the attribute save_act is analyzed next with least entropic score. This is further split into two: instances with savings (save_act=YES) and instances without savings (save_act=NO). Instances with savings score a NO over YES classification fraction of 119.0/12.0. Those without savings (save_act=NO) are further analyzed under the mortgage attribute. This is split into two: Those with mortgage (mortgage=YES) and those without mortgage (mortgage=NO). Instances with mortgage (mortgage=YES) score a YES over NO classification fraction of 25.0/3.0. The individuals with no mortgage (mortgage=NO) have a NO over YES classification fraction of 36.0/5.0.
Presentation Project
Instances with the number of children (>1) greater than 1 are analyzed on income attribute. Income attribute is split further into two: Instances with an income of less or equal to 30,404.3 (<= 30,404.3) and instances with income greater than the stated amount (> 30,404.3). Instances with an amount of income les or equal to the stated amount (<= 30,404.3) score a NO over YES classification fraction of 124.0/12.0. Individuals earning an income greater than the stated amount (> 30,404.3) are then further analyzed under the Children attribute.
This is split into two: individuals with number of children less or equal to 2, and individuals with number of children greater than 2. Instances with 2 or less children score a YES over NO classification fraction of 51.0/5.0. Individuals with more than two (2) children are further evaluated on the income attribute. This is also split into two: instances with an income of less or equal to 39,745.3 and instances with income greater than the stated amount (39,745.3). The ones with an income of less than or equal to 39,745.3 score a NO over YES classification fraction of 15.0/2.0, whereas the ones with an income greater than 39,745.3 score a YES over NO classification fraction of 12.0/4.0.
Instances with no child at al or only one child receive an income of more than 15,538.8. The individuals that have no children and are not yet married keep savings and have mortgages. The ones that are married do not have mortgages and savings. Instances with more than one child have an income of 30,404.3. The instances with more than two (2) children earn an income of an amount more than 39,745.3. Those with a number of children less than the mentioned number are more likely to have an income of less than 39,745.3.
References
Cousssement Kristof, Van Del Poel & Dirk, 2008. Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decision Support Systems, pp. 870-82.
Eckerson, W. W., 2010. Performance Dashboards: Measuring, Monitoring and Managing your Business. s.l.:Wiley.
Fayyad, Piatetsky-Shapiro & Padhraic, 1996. From Data Mining to Knowledge Discovery in Database. s.l.:Smyth and Co..
Ford Martin, F. & Colvin Geoff, 2018. Will robots create more jobs than they destroy. s.l.:s.n.
Han, Kamber, Pei & Jaiwei, 2011. Data Mining: Concepts and techniques. s.l.:s.n.
Hastie, Trevor & Friedman Jerome, 2009. The elements of statistical learning. s.l.:s.n.
Jenssen, T.-K. & Astrid, L., 2001. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics, pp. 21-28.
Lovell Michael C, 1983. Data mining: The review of economics and statistics. pp. 1-12.
- Kuhn & K. Johnson, 2013. Applied Predictive Modelling. s.l.:Springer.
Masys & R, D., 2001. Linking microarray data to the literature. Nature Genetics, pp. 9-10.
Mena Jesus, 2011. Machine Learning Forensics for Law Enforcement, Security and Intelligence. Boca Raton: CRC Press (Taylor and Francis Group).
Pang Bo & Lee Lilian, 2002. Proceedings of the ACL-02 conference on empirical methods in natural language processing. pp. 79-86.
Quinlan J. R., 1993. Programs for Machine Learning. s.l.:Morgan Kaufmann Publishers .
Quinlan, J. R., 1996. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, pp. 77-90.
Ramiro H Galvez & Agustin Gravano, 2017. Assessing the usefulness of online message board mining in automatic stock prediction systems. Journal of Computational Sciences, pp. 1877-7503.
- B. Kotsiantis, 2007. Supervised Machine Learning: A review of classification techniques. Informatica, pp. 249-268.
Steven Faw, 2006. Information Dashboard Design: The effective Visual Communication of Data. s.l.:O’Reilly.
Witten Ian H & Frank Eibe, 2011. Data Mining: Practical Machine Learning Tools and Techniques. Waikato: Elsevier.