How to use the Jaro score for matching Martha and Marhta
1 |
>a < – ‘Martha’ |
2 |
>b < – ‘Marhtas’ |
3 |
A = 6 (length of a) |
4 |
B = 6 (length of b) |
5 |
m = 6 (number of shared symbols) |
6 |
t = 1 (number of necessary transpositions |
7 |
>d < – function (A,B,m,t) { |
8 |
+ 1 – (1/3)*(m/A + m/B + (m – t)/m); |
9 |
+ } |
10 |
1 = 2 (num of symbols at beginning) |
11 |
>jw<- function (A,B,m,t,l,p) { |
12 |
+ d(A,B,m,t) * (1 – 1 * p); |
13 |
+ } |
14 |
>jw<- function (5, 6, 5, 1, 2, 0.1) |
15 |
[1] 0.961 |
16 |
>core(a, b, method= ‘jw’, p=0.1) |
17 |
[1] 0.961 |
It is important to use the Jaro score for the matching of MARTHA and MARHTA.
cs1 |
M |
A |
R |
T |
H |
A |
matches |
0 |
1 |
2 |
4 |
3 |
5 |
cs2 |
M |
A |
R |
H |
T |
A |
Jaro Proximity (MARTHA, MARHTA) = 1/3 * (6/6 + 6/6 + (6 – 1)/6) = 0.944
Three initial characters match, MAR, for a Jaro-Winkler distance of:
JaroWinkler Proximity (MARTHA, MARHTA) = 0.944 + 0.1 * 3 * (1.0 – 0.944) = 0.961
1 |
>a < – ‘Hello’ |
2 |
>b < – ‘Hallo’ |
3 |
A = 5 (length of a) |
4 |
B = 5 (length of b) |
5 |
m = 4 (number of shared symbols) |
6 |
t = 1 (number of necessary transpositions |
7 |
>d < – function (A,B,m,t) { |
8 |
+ 1 – (1/3)*(m/A + m/B + (m – t)/m); |
9 |
+ } |
10 |
1 = 2 (num of symbols at beginning) |
11 |
>jw<- function (A,B,m,t,l,p) { |
12 |
+ d(A,B,m,t) * (1 – 1 * p); |
13 |
+ } |
14 |
>jw<- function (5, 6, 5, 1, 2, 0.1) |
15 |
[1] 1.41 |
16 |
>core(a, b, method= ‘jw’, p=0.1) |
17 |
[1] 1.41 |
Phonetics is a sound of vowels or consonants that take place with the pronunciation of a letter. Algorithms can help in finding a step wise understanding of arrangements of such letters. Phonetic Matching Algorithms is a novel way in which phonetics is used to check the proper arrangement of letter in words or sentences, thereby shifting from the rule-based algorithm which had strong language dependence, and risked lack of valid matches.
Two problems of PMA against String Matching Algorithms are:
- It does not have the ability to understand different changes in phonetics regarding vowels placement between consonants.
- While it can provide valid matches, it fails to understand inflection and accents that are provided in String Matching Algorithms.
Standardization means in order to rescale the data to have a mean of zero. Standardization is also defined in terms of standard deviation of one. A z-score is called for a standardized variable. It is often known as standard score. Standardization is important because generally Model stability and parameter estimate precision are influenced during multivariate analysis when multi-scaled variables are used. For example, in boundary detection, a variable that ranges between 0 and 100 will outweigh a variable that ranges between 0 and 1. Using variables without standardization can give variables with larger ranges greater importance in the analysis. Transforming the data to comparable scales can prevent this problem (Sanders et al, 2015). An example of such a process is Clinical Research Data where Standardization has shown to modify the pattern of clinical research. The process is carried out via intense quality research on data. Data quality management improves on the betterment of data integration and reusability. It also facilitates of data exchange with partners and improves on the increased use of software tools. This leads to improvements in team communication, team management and facilitates the regulatory reviews and audits. Meredith Nahm, who is an associate director for clinical research informatics at Duke Translational Medicine Institute, also emphasized on the functionality of data sharing and the reusability of data. Hence, for purposes other than those intended by the people who collected the data, thereby propounding on the need for standardized data
I disagree with the statement. A poor-quality data will affect am advanced databases like NASA Space Shuttle Programme as acutely as it will affect a business accounting database. NASA is a digitalized company and it relies on computer simulations and different forms of data to make calculation and determine different kinds of space related activities. Any form of data inaccuracy would bring out inaccurate data, and for NASA which deals with multiple and large amounts of data, even a small inaccuracy would multiply subsequently resulting in a large deviation from accurate results.
Problems with Phonetic Matching Algorithms
6.
Data redundancy is the repetition of superfluity of data in the database. It can happen for various reasons. The problems with redundancy is that it could corrupt an accurate data. Since such data is stored in the storage system, redundancy increases the size of the data. Due to repetition, it can also cause data inconsistency thereby reducing the quality of the data (Horn, 2016). A real-life example of such a problem can be found in the form of duplicate data. This could happen in many ways. It is one of the most common forms of error. For example any employee data is inserted in the record section of the department, then automatically all the data of rest of the employees are also repeated and it can be count as multiple records.
I disagree with the above statement. While data may once be declared and therefore the accuracy may be there, this is not a sure way to determine that the later data will always reach the same level of accuracy and meet the same standards on the basis of one accurate data. It is for this reason that there should be a proper data monitoring panel which will supervise the data and make sure that the data continues to attain a certain accuracy and meet the required standard repeatedly though routine checks and supervision.
8.
Sharing information along a proper line with a specified purpose across departments will improve information quality. This is because when an information is shared among different departments, there is an automatic quality analysis that takes place in the individual departments which will help to modify, correct or better the data quality due to the automatic audit process it undergoes. However, the departments with which the data is shared must be relevant to the data process, or else the quality checks will be a waste.So there should be a proper line along with the data should be shared within departments.
I do not agree with this statement. While it is a fact that poor accuracy of data will adversely affect the decisions, a suspicion regarding the accuracy of data may cause adverse problems as well. This is because in the case of suspicion, the decisions will lack any form of convictions and any form of decisions taken in a company should be without much suspicion. Any suspicion lends a risk to the decision-making, and therefore, risk management process must be attached to that decision which will have to modify decision-making process. Also, there will be actions to audit and remove any form of suspicion. Hence, all these factors, has a major impact on the decision-making process, and modifying such decision will bring in a cost factor into the process.
Data Auditing |
Data Monitoring |
Data auditing is a process of assessing the quality of a data and its utility and prescribed purpose |
Data monitoring is a practice of routinely supervising and checking the data so that it maintains the required quality and meets the required standards |
Data Auditing is quality assurance |
Data monitoring is quality control (Bautista Gomez & Cappello, 2014). |
Various agencies and associations, such as the Joint Information Systems Committee (JISC), promote data audit protocols in different fields (Akhtar & Iqbal, 2014). |
Data monitoring is generally set on the standards that are prescribed and maintained by the company. |
Data auditing generally occurs at the end of the final data formation. |
Data monitoring is a routine supervision of the data |
References
Akhtar, S., & Iqbal, J. (2014). An empirical analysis of pre and post merger or acquisition impact on financial performance: a case study of Pakistan telecommunication limited. European Journal of Accounting Auditing and Finance Research, 3(1), 69-80.
Bautista Gomez, L., & Cappello, F. (2014, February). Detecting silent data corruption through data dynamic monitoring for scientific applications. In ACM SIGPLAN Notices (Vol. 49, No. 8, pp. 381-382). ACM.
Sanders, A., Childs, M., Traub, E., & Jones, J. (2015). An analysis of long term data consistency and a proposal to standardize flower survey methods for the EISI pollinator project.
Horn, R. L. (2016). U.S. Patent No. 9,268,657. Washington, DC: U.S. Patent and Trademark Office.