Research Background
Project Title: Fault Prediction In Object-Oriented Systems Using Machine Learning Techniques
Object oriented (OO) systems are a concept of software model where the back-end data used for the programming is represented as particular discrete objects. These objects are based on both the key users of the system and the user-system interactions [1]. The key attributes considered in each object are either data or information of the executable files. Thus, this model can provide a critical approach to handling the system database. Various languages like Java and C++ are used to develop software products based on OO system software models. It has become a popular product development model for software developers due to the ease of handling data and system interactions [2]. However, system quality is becoming more and more critical for system development, considering the critical areas of application of the software products. Thus, the fault prediction of the OO systems is a critical area for IT professionals to focus on. The project being proposed in the current proposal is considering an approach to develop skills in fault prediction for OO systems. The current proposal covers a detailed background study, research question, aim and objectives, key academic challenges, work plan, and key resources required for the project.
Object-oriented (OO) fault metrics
According to Yuan et al. [3], the demand for high quality and efficient software increases with time due to the increase in the complexity and features of the software products. Experts consider various object-oriented metrics in the field to measure and predict the quality of products. Some of these crucial metrics are cohesion, inheritance, coupling and so on, as these factors affect the quality on a large scale. However, predicting the quality has always been a critical and complex task for developers and testers. The key quality factors that are needed to be assessed and predicted while developing a software product are reliability, productivity, effort and maintainability [4]. All these factors define how efficient a product is while it is in function, and predicting this before the launch is very important for the developers. The most efficient approach considered by programmers is the early-phase prediction of how fault prone a product will be. Various researchers and experts have focused on one different metrics subset to analyse and predict the fault proneness hence.
Figure 1: Object oriented metrics
(Source: Singh, Bhatia and Singhrova 2018)
Machine learning is a set of algorithms and programming models based on a combined application of mathematics, statistics, computer science and data science. The core principle of these models is to make devices able to learn by themselves (also known to be Artificial Intelligence or AI) with the help of hidden patterns in datasets [5]. The key focus of these algorithms is to predict the behaviour of a data pattern that can be used for various purposes like economic and financial predictions, disease identification, drug discovery, intrusion and cyber security analytics, etc. Such algorithms can be used with various approaches (of finding patterns in datasets) like classification, clustering, regression and so on [6].
Research Question
According to Singh, Bhatia and Singhrova [7], machine learning models can play a major role in fault prediction due to their ability to find hidden patterns in a dataset. As already addressed in this background study, experts and developers consider various metrics and datasets [11]. In the wide scale literature study by Singh, Bhatia and Singhrova [7], the authors have shown a range of common tools and techniques used by fault prediction models developed. The key task required for the machine learning algorithm being developed for fault detection purposes. Some of the key models used for these purposes are Logistic Regression (regression), Random Forest (classification), and neural networking [7] [8] [9]. The key programming languages used for fault prediction purposes are C++ and Java. However, some of the most popular metrics used for fault prediction purposes are the CK, Briand, Conceptual cohesion, McCabe, and Halstead metrics [7] [10]. In order to train the model, an appropriate dataset plays a vital role. Some of the most crucial datasets for OO fault prediction model training are MDP (metric data program) developed by NASA, PROMISE repository, Qualitas corpus and so on [7].
Figure 2: Object oriented fault prediction tools
(Source: Singh, Bhatia and Singhrova 2018)
Therefore, as a future professional taking the task of fault prediction of OO systems with the help of machine learning as a strategic challenge and help become aligned with the contemporary industry practices. In order to achieve this, a comparative analysis of the most commonly used tools can be explored like the Random Forest, Logistic Regression and Neural Network. The key factor here would be (as understood from the background research) would be to identify an appropriate dataset and a programming language, and an OO metric to proceed with the task.
The key research question to be addressed in the project would hence be:
- How efficient are the common models (Random Forest, Logistic Regression and Neural Network) of OO system fault prediction?
As understood from the summary of the background study and the research question developed, the focus of the proposed project is to conduct a comparative study of three commonly used fault prediction ML models for OO systems. This would firstly require finding an appropriate dataset for fault prediction training and splitting it into training and validation or testing datasets. Further, three prediction model is to be developed using three different models, i.e., Random Forest, Logistic Regression and Neural Network, by training the models with the same training set. Eventually, the efficiency measures like accuracy, precision and recall will be evaluated for these three models.
- To develop an appropriate dataset for training the OO system fault detection model
- To split the dataset into training and validation datasets
- To train three prediction models based on Random Forest, Logistic Regression and Neural Network separately
- To validate three models to compare the various parameters of efficiency (like accuracy and recall)
Aim
The key deliverables of the project work to be developed are:
- Developing three trained prediction models for fault detection in OO systems
- Compare the performance efficiency of these three models based on accuracy, precision and recall score
Hence, the overall methodology can be evaluated from the key deliverables identified in this proposal. It is a machine learning methodology where prediction models must be developed. The models being considered here are based on a classification approach through which the initial model is to be trained. Classification is a supervised machine learning model where the training dataset has a class variable or attribute to classify the dataset into different classes or categories [12]. In the current project, the data class will be the identifier of system faults, a binary class (0 and 1 or negative and positive). Based on these classes, the system would define the model specifications. Each model would predict the system fault instances from the validation dataset based on these factors.
There are three key measures of comparative analysis here (as discussed already) which are:
- Accuracy– accuracy of a machine learning model, refers to the ratio of correctly predicted values of true and false fault flags with the overall dataset; this refers to the accuracy to which the system can predict the system faults
- Precision– precision of a system refers to the extent to which the positive cases of system faults are predicted concerning the overall positive cases predicted by the model used; this can be measured through the ratio of the true positive cases with the total positive cases predicted (i.e., true positive + false positive)
- Recall– recall refers to the extent to which the system predicts the true positive flags concerning the actual cases of system faults; this is to be measured with the ratio of true positive cases concerning the actual positive cases
Certain challenges are required to be tackled while planning and executing the entire project, such as:
- The quality and relevance of the dataset to be chosen are major constraints to b considered in this project. This is because the efficiency of the models to be developed and the validation process are both dependent on the dataset itself
- The dataset volume to be used is also a major factor in the model development process. The larger the training dataset would be, the better the accuracy of the trained model. On the other hand, a large dataset would require higher computing power and hence can hamper the project feasibility
- Finally, the computing power of the available resources is also going to affect the project feasibility the project
Figure 3: Project schedule
(Source: Created by Author)
Software resources
Programming language – Java
Tool – JCMT
Fault measurement metric – CK metric
Hardware resources
Processor – intel core i5 (7th generation) or above (64 bit)
RAM – 8 GB (minimum)
Disc space – 50 GB
Access
Access to an appropriate dataset for developing the prediction model.
References
[1] H., Hourani, H. Wasmi and T., Alrawashdeh, April. A code complexity model of object oriented programming (OOP). In 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT) (pp. 560-564). IEEE. 2019.
[2] M., Mahoney, Object-Oriented Programming. 2021.
[3] X., Yuan, B., Cai, Y., Ma, J., Zhang, K., Mulenga, Y., Liu and G., Chen, Reliability evaluation methodology of complex systems based on dynamic object-oriented Bayesian networks. IEEE Access, 6, pp.11289-11300. 2018.
[4] R., Padilla, S.L., Netto and EA, Da Silva, A survey on performance metrics for object-detection algorithms. In 2020 international conference on systems, signals and image processing (IWSSIP) (pp. 237-242). IEEE. 2020, July.
[5] G., Carleo, I., Cirac, K., Cranmer, L., Daudet, M., Schuld, N., Tishby, L., Vogt-Maranto and L., Zdeborová, Machine learning and the physical sciences. Reviews of Modern Physics, 91(4), p.045002. 2019
[6] C., Janiesch, P., Zschech and K., Heinrich, Machine learning and deep learning. Electronic Markets, 31(3), pp.685-695. 2021.
[7] A., Singh, R., Bhatia and A., Singhrova, Taxonomy of machine learning algorithms in software fault prediction using object oriented metrics. Procedia computer science, 132, pp.993-1001. 2018.
[8] P., Zhang, S., Shu and M., Zhou, An online fault detection model and strategies based on SVM-grid in clouds. IEEE/CAA Journal of Automatica Sinica, 5(2), pp.445-456. 2018.
[9] Z., Lin, Y.F., Wu, S.V., Peri, W., Sun, G., Singh, F., Deng, J. Jiang and S., Ahn, Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. arXiv preprint arXiv:2001.02407. 2020.
[10] S., Jha, R., Kumar, M., Abdel-Basset, I., Priyadarshini, R. Sharma and H.V., Long, Deep learning approach for software maintainability metrics prediction. Ieee Access, 7, pp.61840-61855. 2019.
[11] Y., Lei, B., Yang, X., Jiang, F., Jia, N. Li and AK, Nandi, Applications of machine learning to machine fault diagnosis: A review and roadmap. Mechanical Systems and Signal Processing, 138, p.106587. 2020.
[12] A., Churcher, R., Ullah, J., Ahmad, S., Ur Rehman, F., Masood, M., Gogate, F., Alqahtani, B. Nour and W.J., Buchanan, An experimental analysis of attack classification using machine learning in IoT networks. Sensors, 21(2), p.446. 2021.