What is Text Mining and its difference from Data Mining?
1.Text mining is the process of deriving or mining useful information of a desirable high quality from a set of texts (Hastie, Trevor, & Friedman Jerome, 2009). Such-like information of high quality is obtained through coming up with trends and patterns using means like statistical pattern learning which will be discussed in later parts of this section. It majorly involves structuring of selected input text (calculated and subsequent insertion into a related database), obtaining trends from the data, evaluating and interpreting it thereafter.
High quality in text mining refers to a combination of relevance, novelty and interestingness. Some tasks in text mining include document summarization, pattern recognition, information extraction, sentiment analysis, pattern recognition, information retrieval, word frequency distribution etc. Data mining on the other hand, is the process of obtaining trends and patterns from large sets of data. This process aims to extract all relevant information from an identified dataset, then transform it into forms that are better understandable for future reference and use.
When approached on a general perspective, data mining and text mining can be treated as one, the latter being dominant over the former (Mena Jesus, 2011). More clearly, data mining involves extraction of relationships from data that is structured, e.g., from pie charts or tables with rows and columns. Revolutionary data mining was practiced after the introduction of machine learning algorithms ID3, which is widely known as j48 in WEKA and other related software. Text mining on the other hand involves extracting insight from large datasets.
Retrieval of text from specific identified sources in text mining ranges from a preliminary step to final steps of compiling the resulting information of high quality.
The information is identified on a database, web address or a file system and the corpus (a group of text material) collected for further analysis. It is arranged in a systematic way using platforms like Word or Excel, saved in the required format then analyzed using machine learning software available or by available manpower.
Almost all of the text analytic systems use text mining technologies revolving around natural language processing such as forms of linguistic analysis, tagging of speech and forms of synaptic parsing. This is different from data mining which guides the machine learning software by assigning attributes to be used for comparison and coming up with a conclusion on the data provided. Data mining also involves numerical data that is structured in most cases, while text mining involves use of texts or words from a large set, a journal, newspaper or any other publication.
Technologies used in Text Mining
Another important technology used in text mining is named entity recognition (NER), also called entity chucking, entity extraction or entity identification which involves using email addresses, telephone numbers, which are always discerned through pattern matching or any other kind of pattern recognition. This is a subtask in information extraction or text mining which aims to locate then classify entities that have been named into categories that have been predefined. For instance, expressions of time, monetary values, locations, percentages, names of persons etc.
Research using NER systems is structured as taking unannotated blocks of texts, e.g.,
Dan bought 400 shares of EBM Corp. in 2018
And coming up with an annotated text block highlighting names of the entities. That is,
[Dan]Person bought 400 shares of [EBM Corp.] Organization in [2006] Time.
In the above example, the name of a person having one token, the company/organization name having two tokens and an expression of time have been well detected and classified. Some state of the art systems of NER have produced performances that have equaled human performance. Like the best system that entered MUC-7 scoring 93.39% of F-measure and human annotators having a score of 96.95% and 97.60%.
Entity disambiguation is another technology used in text mining. It involves the use of clues that depend on circumstances that form the setting of an event. It involves determining sense of a term in a specific context and is an important step in automatic term recognition. It is so much often that a certain term will have different meanings when different settings or contexts are considered. For example, the word mercury is a chemical when chemistry is the context. It is also a planet when the solar system is the context. Only by examining context being referred to can the intended meaning of text be accurately determined. This can lead to problems in text mining when splitting attributes. Entity disambiguation identifies and classifies words correctly according to the context intended.
Co-referencing, another important technology in text mining, refers to identification of terms in a text, that refer to the same noun phrases, or the same object. This comes after entity disambiguation. The algorithms made for this purpose, to resolve co-references, normally look for the nearest and best preceding individual which is compatible with the referring expression first. For example, “he” may attach to preceding expressions such as ‘the man’ or ‘Jake’ but not ‘Grace’. Such algorithms have so far been known to have accuracies well below 80%, which is acceptable (Crystal D, 1997).
Applications of Text Mining
Sentiment analysis also known as opinion mining or emotion AI, which involves discernment of material that is subjective and the extraction of various levels and forms of information attitudinal: emotion, mood, opinion, sentiment, is another technology in text mining. It refers to the use of text analysis, natural language processing, biometrics and computational linguistics to systematically quantify, identify, extract and study subjective information and affective states. It aims to determine the attitude of a writer, speaker or a subject with respect to a specific contextual polarity or topic or emotional reaction to an event, a document or interaction. Such attitudes may be evaluations or judgements, affective states which is the emotional state of the author or speaker at the time of interaction with the text, or the emotional communication intended, that is the emotional effect on the reader that was intended by its interlocutor or author.
Quantitative text analysis, commonly QTA, is an automated and systematic method used to process a large set of data. Extraction of policy positions from speeches, election manifestos, or emotion from newspaper articles is possible. All methods used in QTA can be reduced into three basic steps; determining the unit of analysis, define a corpus from texts we intend to examine then put the document feature matrix in one place
- Selecting text to examine: We first define a corpus which is a selection of texts organized in such a way that makes them suitable in QTA. Texts selected depend on the questions that need answering. It is necessary that the texts are relevant to the question. To make the process easy, the origin of the text is first examined.
- Deciding unit of analysis: When deciding the feature of the unit of analysis that is to be used, the following are considered: multiple-words (trigrams, bigrams, n-grams), words (unigrams), Count word lemmas and stems (Classes of equivalence for words e.g., runs, ran, run) and word counts in sentences, paragraphs and documents.
- Creating the document feature matrix: We put what we’ve created in a usable format. Each document is represented as a list of word counts or vector. We take a document then turn it to a set of known word counts, do the same for all selected texts, then stack every document in a column in the feature matrix. The words are now rows, the documents are the columns and a specified element in the feature matrix stands for the frequency of occurrence of a word in a document. Words that should be excluded are then looked at and documented by the researcher depending on his or her specifications.
- Text mining has a large range of applications, from government needs, to research and to business. Application categories include record management, E-discovery, National security, Enterprise management (Competitive Intelligence), Listening platforms (Sentiment Analysis Tool), Automated ad placement, Publishing etc.
2.
Why artificial intelligence is important for supporting business to build smart systems.
Unlike human beings, artificial systems are not subject to bias and prejudice which in most cases is important in decision making. Artificial systems will work with facts, compiling what happened in the past, relating it to what is happening in the present and drawing plans for making things better in the future.
Artificial systems are accurate and precise as they work with predefined algorithms for analysis. Performance of a business using AIs is expected to shoot up, unlike those that still resort to relying on human power in analysis and decision making.
Artificial intelligence predictions of business patterns when compared with human predictions and analysis of the same can make the business come up with better conclusions on the best course of action to take for better business performance,
How artificial intelligence helps to transform companies.
HANA, a machine learning software developed by SAP, analyses transaction data and information history for trends and irregularities (Pang Bo & Lee Lilian, 2002). Decision making is faster when it is used in conjunction with other supportive applications. The cloud-based software checks for any anomaly in the data it has been provided with after analyzing trends then sends an alert on this anomaly. Suggestions for possible solutions is also made by the software. It therefore ends up saving the company incorporating it into its operations a lot of money that would have been lost in losses if the anomaly was not identified and fixed early. The slightest anomaly in production or sales at exit points that would have been easily missed by the human eye is identified by this software. Any AI will make decisions that are data-driven which are more informed and reliable in most cases than those made by human beings.
AI in Business Intelligence Applications and its impact
Walmart is one of the big companies that use this software. It feeds this software transaction data from all of its more than 11.000 points of sale in real time for monitoring. There are predetermined figures that are fed into the software, so that real-time information can be monitored. It therefore operates faster, controlling its back office costs through consolidation of resources and all processes needed to handle all the important work. All the sales points can therefore be monitored centrally through the HANA AI.
An inventory research sponsored by SAP made an astonishing revelation, that almost all of the companies that had incorporated its cloud-based software, HANA, expected to make in excess of 580% return in investment.
Domo, another cloud-based machine learning software, works by collecting data from third party applications and platforms e.g., Square, SalesForce, Facebook, Twitter etc. then uses the data to provide business related insights, context to business intelligence. Any related comments on the products of the business is collected and analyzed using text mining techniques that are inbuilt. The AI monitors performance of the business at production and sales points. It then spots any rising trends and patterns in real-time, analyzes them and generates reports on suggestions on measures to take to improve production and sales. It can also give alerts in case of poor performance of a product. Just like HANA, predetermined data is fed into it against which real-time data will be compared against and appropriate reports generated. Proper adjustments can therefore be made by the company based on reports and suggestions made by the software, whether to increase, decrease or stop supply to a certain sales point etc.
Apptus, unlike the other AIs does not have predetermined data fed into it to make comparisons against. It learns the search patterns of customers all by itself and makes it easy for them to purchase products by making related suggestions based on the customers’ search. Since it keeps a record on the patterns and trends learned of the customers’ search patterns and preferences, it can generate reports on ways the company can improve its products to match the customers’ preferences based on the choices they make when making purchases.
Avanade is another AI, a joint venture by Accenture and Microsoft use Cortana Intelligence to learn search patterns of customers. It therefore makes suggestions to customers based on their searches. They have further improved the AI to recognize speech making it appealing to customers. Using the same principle of predictive analytics, it has made companies incorporating it into sales to make unimaginable returns on investment, the highest yet being 246%.
Examples of AI in Business Intelligence Applications
General Electric developed its own AI, Predix. It is used with other powerful industrial applications to share information, processing historic performance information of the various equipment. It then uses this data to predict a wide range of operational outcomes. It therefore calculates the amount of time the equipment can stay in operation before it fails, and gives a report on the period between which it’s necessary to perform maintenance. This goes a long way in saving the company on losses it would have incurred if such information were not in hand and production had to stop because of a production-important equipment’s failure.
Siemens has also made its cloud-based AI called MindSphere which works in pretty much the same way as GE’s Predix. It monitors fleets of equipment or machines for maintenance and service needs using machine tool analytics or drive train. It works with machines from all manufacturers unlike Predix which is manufacturer-specific.
The thread of artificial intelligence and the limitations of smart systems
Artificial intelligence will most definitely take over the world of engineering as current research is centered on making AIs that can think and work like human beings. This means that they can design their own hardware, design hardware for other AIs and make improvements to their own hardware and software. However, such breakthroughs if not checked could be the beginning of the human race extinction.
Any software can be hacked; thus they can be manipulated to give certain results that could cripple businesses.
When the information they collect from consumer information and patterns lands in the wrong hands, it could be sold to other organizations for the purpose of sending them nuisance advertisements.
AIs learn from the data fed into them and there is no other way that knowledge gets integrated into them, unlike in human learning. Therefore, any inconsistencies or errors or inaccuracies in data fed is reflected on the results.
3.Bank data analysis with waikato environment for knowledge analysis (weka) software
Machine learning involves learning trends and patterns from any structured dataset fed into an AI. It is mostly used in data mining (subfield that is interdisciplinary in computer technology), to extract important information from an earlier on meaningless bunch of data.
My work majorly involves analysis and classification of the bank data provided.
The type of analysis that I have used for this data is the j48 decision tree, as it is commonly known, which is a java algorithm for the initial C4.5 algorithms. C4.5 algorithms came after improvement of ID3. Weka software gives an appealing visual representation of the decision tree which can be easily followed through to the last split of attributes.
WEKA Classification
A decision tree is basically a recursive partition of an instance space, consisting of nodes that form a rooted tree. This means that it is a tree that is directed, and has no incoming edges. All of the other nodes have just one incoming node. Any node with an outgoing edge is known as an internal node. All of the other nodes are normally called terminals or decision nodes. Each of the internal nodes split instant space into smaller sub-spaces, conventionally two or more, depending on the particular discrete function used for the input attribute values.
The algorithm is basic, and the j48 algorithm, though it looks complicated because of the java code, is very simple and works on basic principles. The algorithm begins by analyzing the whole set of data based on the instruction it is set to run it in, thereby giving the maximum possible accuracy for the data-set analysis. It then selects values or attributes along dimensions that will give the best attribute split for more analysis and data extraction. Child nodes are then created based on the best split after entropy evaluation. The algorithms then recuse on the data on every child node already generated using all the child data until a stopping criterion has been reached. Stopping of further grilling for more important data happens when the tree has become too big, or the amount of data arrived at is too small and insignificant for consideration or when the classes reached when grilling is the same and no more comparisons can be made.
The major challenge that arises during building of a j48 decision tree is the decision on the best attribute to split so as to evaluate other related generated attributes under the major attribute that has been split. The best split is necessary for a more conclusive mining of data from the dataset. To solve this problem, information gain is used. An attribute is evaluated for the amount of information it can give. This is done through entropy evaluation, which is a measure of the uncertainty that arises from each of the attributes. The attribute with the lowest entropy and therefore the highest information gain is picked and used as an initial attribute under which the other attributes will be evaluated.
The j48 algorithm then builds an almost perfect and fairly accurate decision tree by evaluating attributes based on their entropies until the stopping point. The initial attribute in the bank data provided is Children (Whether or not an individual has children).
Instructions for WEKA Classification
An expression like YES 10.0/4.0 means that an attribute has been 10 times correctly classified, and 4 times wrongly classified. On this assignment, I will refer to them as a YES to NO classification ratio of 10.0/4.0 to reduce on the bulk of words.
Figure 1 Attributes
The preprocessing page shows that the data was classified into 12 attributes for all the 600 individuals.
Figure 2 j48 decision tree based on training set.
Figure 3 Training set decision tree
The accuracy of the classification when using the training data set is seen to be well over 90%. Though more accurate, it is bulkier, the decision tree is big and somewhat difficult to follow through. The split done on the attributes is the best that can be done for the data, but we want a less bulky tree that is easy to follow through at a glance. 10-fold cross validation is therefore used, with minimum number of objects set at 10. The resulting analysis is shown below.
Figure 4 10-fold cross validation test with attributes
Figure 5 Accuracy and Confusion Matrix
Figure 6 Decision tree visualization
After entropy and information gain evaluation, the initial attribute with highest information gain is Children (whether the number of children is less or equal to 1, or greater than 1). For analysis on the number of children that an individual has, which is less or equal to 1, the attribute is still analyzed in terms of number of children. It is split into two, whether the number of children is less than or equal to 0, or greater than 0.
For individuals with a number of children greater than 0, income attribute is analyzed. Those with income of less than or equal to 15,538.8 have a NO to YES classification ratio of 24.0/4.0 while those with income greater than 15,538.8 have a YES to NO classification ratio of 111.0/5.0. Married is the attribute considered for analysis for individuals with children less or equal to 0. This is further split into those who are married (YES) and those not married (=NO). For those not married (=NO), the attribute considered next is mortgage. This is also further split into those with mortgage (YES) and those with no mortgage (=NO). Those with no mortgage (=NO) have a NO to YES classification ratio of 48.0/3.0. For those with mortgage, save_act is the attribute analyzed next. This is also split into those with savings (save_act=YES) and those without savings (save_act=NO). Those with savings have a NO classification of 23.0 (this means that the analysis was done wrongly 23 times) while those without savings have a YES classification of 12.0(meaning classification was done correctly 12 times).
Analysis Report on WEKA Classification
For those that are married (married=YES), save_act is analyzed as the least entropic attribute. Those with savings (save_act=YES) have a NO to YES classification ratio of 119.0/12.0, while those without savings are analyzed for mortgage (whether or not they have mortgage). The ones with mortgage (mortgage=YES) have a YES to NO classification ratio of 25.0/3.0. Those with NO mortgage have a NO to YES classification ratio of 36.0/5.0.
Individuals with a number of children greater than 1 are initially analyzed under income attribute. Income is further split into two, individuals with income less or equal to 30,404.3 and those with income more than the stated amount. Those with less than the stated amount have a NO to YES ratio of 124.0/12.0. Those with income of more than 30,404.3 are further analyzed under Children (number of children), those with number of children less or equal to 2, and those with a number greater than 2. Individuals with a number of children less than 2 have a YES to NO classification ratio of 51.0/5.0. Those with a number of children greater than 2 are further analyzed under income attribute. Individuals with an income of less or equal to 39,745.3 have a NO to YES classification ratio of 15.0/2.0, while those with an income more than the stated amount have a YES to NO classification ratio of 12.0/4.0
Conclusion
Individuals with only one child or no child at all have an income of more than 15,538.8. Those with no children and are not married have mortgages and savings. Those that are married do not have savings and are less likely to have mortgages.
Individuals with more than 1 child mostly have an income of 30,404.3. Those with more than 2 children have an income of more than 39,745.3.
4.
Kpi business dashboard
A dashboard, also called a progress report is a representation, visual representation, of important information regarding an organization that highlights at a glance its performance. The contents are arranged then consolidated on a single screen such that the information it contains can be accessed and monitored easily. It therefore provides views of key performance indicators that are relevant to a specific objective or process in a business.
Almost all organizations that make dashboards for their various reasons post it on the WEB. This way, it can easily be accessed and constantly updated more easily, than if physical dashboards were made. A manufacturing dashboard, for example, could be showing the productivity of the business entity through numbers related to such productivity e.g., amount of product produced in a day, a week or even a year. In the same way, human resource dashboards could be showing numbers related to composition and retention, staff recruitment etc.
The business dashboard shown above shows the budgeted cost of production and the real cost of production incurred for all of the items, 1 through to 10. It also shows the projected and actual revenues gained from sale of the products. From the way the bar graphs are designed, it is very easy to tell how a particular product/item is performing in the market. One doesn’t need to look at the exact values indicated on the top part of the graph. The designer understood the data well before the actual design, and as such, most of the important data has been clearly brought out.
The dashboard also shows a total of projected cost of production and the actual cost incurred. This makes it easy for analysis as one does not have to calculate each projected and actual cost of each item to get the totals.
A line graph has also been provided to show the net and gross profit margins through the years of production, 2007 to 2016. The debt to equity ratio throughout the graph can be easily tracked through the line graphs drawn above them for comparison. This ratio is seen to improve in favor of equity at the last year, 2016.
The dashboard provided is for analysis purposes. There is just enough context in terms of evaluators of performance, extensive history and rich comparisons. The financial year’s performance including all of the other years’ performance is well captured in the given dashboard. All of the relevant information that can be used to do business evaluation is given in the dashboard. It is easy to understand and design and can be used by savvy technical users such as data analysts or researchers. All of the information represented is colored adequately, from the budgets to the actual revenues, totals for both budgets and revenues, the profit margins coloring for both the net and gross profits made and the equity alongside debt bar graphs.
However, the dashboard is not entirely perfect as there are some important aspects that have been ignored during the design. For comparison purposes, it would have been better if percentage deviations were given to easily provide the difference between various attributes on the dashboard. For example, percentage deviation in goals and actual costs of production and revenues instead of indicating the exact values as they are. There is also too much precision on the data on the dashboard. For instance, instead of using valid approximations like $8.4M, the data is represented as $8,410,963. Such exactness slows a viewer down when going through the data for comparison with other business performance dashboards.
The data represented on the dashboard has been arranged poorly (S. B. Kotsiantis, 2007). The net and gross profit line graphs should have come before the debt to equity ratio. There is a flow to information that a dashboard should give, based on desired sequence in viewing and importance. The main objective of a dashboard is to give information to a viewer, which he or she should be able to make sense of immediately. This should give a viewer a better understanding of the performance of the business in terms of returns on investment. If a viewer has to scroll up and down to get to some important information, then the intended value of the dashboard diminishes. The extra data points that have been given, which has made viewers to scroll down an extra page have blurred the whole picture and will without doubt confuse the viewer.
Data on profits has been inadequately highlighted. When a dashboard is glanced at, the eyes should be immediately drawn to the important information even when they are not prominently displayed on the screen (Steven Faw, 2006). Profits that the organization makes should be among the most colored and attractive parts of the dashboard.
The dashboard also extends for more than a page, which is a critical mistake (Witten Ian H & Frank Eibe, 2011). More understanding is derived from a dashboard when all the information can be seen on a single page. One may not remember the details of the attributes he or she has seen on the previous pages of a dashboard, thus making comparisons may not be easy.
References
Cousssement Kristof, Van Del Poel, & Dirk. (2008). Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decision Support Systems, 870-82.
Crystal D. (1997). A dictionary of linguistics and phonetics. Cambridge: Blackwell Publishing.
Eckerson, W. W. (2010). Performance Dashboards: Measuring, Monitoring and Managing your Business. Wiley.
Fayyad, Piatetsky-Shapiro, & Padhraic. (1996). From Data Mining to Knowledge Discovery in Database. Smyth and Co.
Ford Martin, F., & Colvin Geoff. (2018). Will robots create more jobs than they destroy.
Han, Kamber, Pei, & Jaiwei. (2011). Data Mining: Concepts and techniques.
Hastie, Trevor, & Friedman Jerome. (2009). The elements of statistical learning.
Jenssen, T.-K., & Astrid, L. (2001). A literature network of human genes for high-throughput analysis of gene expression . Nature Genetics, 21-28.
Lovell Michael C. (1983). Data mining: The review of economics and statistics. 1-12.Kuhn, & K. Johnson. (2013). Applied Predictive Modelling. Springer.
Masys, & R, D. (2001). Linking microarray data to the literature. Nature Genetics, 9-10.
Mena Jesus. (2011). Machine Learning Forensics for Law Enforcement, Security and Intelligence. Boca Raton: CRC Press (Taylor and Francis Group).
Pang Bo, & Lee Lilian. (2002). Proceedings of the ACL-02 conference on empirical methods in natural language processing. 79-86.
Quinlan J. R. (1993). Programs for Machine Learning . Morgan Kaufmann Publishers .
Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 77-90.
Ramiro H Galvez, & Agustin Gravano. (2017). Assessing the usefulness of online message board mining in automatic stock prediction systems. Journal of Computational Sciences, 1877-7503B. Kotsiantis. (2007). Supervised Machine Learning: A review of classificationtechniques. Informatica, 249-268.
Steven Faw. (2006). Information Dashboard Design: The effective Visual Communication of Data. O’Reilly.
Witten Ian H, & Frank Eibe. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Waikato: Elsevier.