Answer
The proposed topic of the research based on ‘Student Performance Data’ is-
‘The major contributing factors behind the Final Grade of the students studying Mathematics and Portuguese Language’.
The data is collected from online source (UCI Learning website, ‘University of California’). Hence, the data set is secondary to the researcher. This website contains raw, reliable and authentic data set.
The data is gathered by Paulo Cortez (University of Minho, Portugal). Now it is freely available in repository of UCI learning website.
The data approaches the achievement of students of two Portuguese schools. The data attributes include students’ demographics, social approach, school related features and student grades. The data set is collected as per school reports and according questionnaire.
The data set includes simultaneously both quantitative and qualitative variables. Many variables are nominal in nature (ex: sex, address, guardian and study-time) and some are binary variables (ex: schoolsup, nursery, internet and romantic). More of it, some variables are quantitative variables such as ‘age’, ‘G1’, ‘G2’ and ‘G3’. Few variables are measured in ‘Likert’ scale such as ‘freetime’, ‘Dalc’, ‘Walc’ and ‘goout’.
The student performance data set has multivariate characteristic. The attributes of the data set are of integer type. The number of instances of the data set is 649 and total number of attributes is 33. It is a very important fact that no missing values are present in this ‘Social’ type of data set.
- ‘School’: Name of the school of the student.
- ‘Sex’: Sex of the student.
- ‘Age’: Age of the student.
- ‘Address’: Home address type of the student.
- ‘Famsize’: Family size of the students.
- ‘Pstatus’: Cohabitation status of the parents.
- ‘Medu’: Education of mother.
- ‘Fedu’: Father’s education.
- ‘Mjob’: Job of mother.
- ‘Fjob’: Job of father.
- ‘Reason’: Reason to choose the school.
- ‘Guardian’: Guardian of the student
- ‘Travel-time’: Travelling time from home to school.
- ‘Study-time’: Weekly study time.
- ‘Failures’: Number of past class failures.
- ‘Schools-up’: Extra educational support is available or not.
- ‘Famsup’: Family educational support is available or not.
- ‘Paid’: Extra paid classes within the course of the subject or not.
- ‘Activities’: Extra-curricular activities of the students is present or not.
- ‘Nursery’: Attended or not in nursery school.
- ‘Higher’: Whether the student wants to take or not higher education.
- ‘Internet’: Whether internet access is available or not.
- ‘Romantic’: Whether student is in romantic relationship or not.
- ‘Famrel’: Quality of family relationships.
- ‘Freetime’: Free time after school.
- ‘Goout’: Going out with friends.
- ‘Dalc’: Workday alcohol consumption.
- ‘Walc’: Weekend alcohol consumption.
- ‘Health’: Current health status.
- ‘Absenses’: Number of school absences.
First of all the comparison of variables and factors are accomplished in this analysis. Also, some common differences regarding Mathematics and Portuguese language data set and the differences of predictors and their predictability are investigated in this analysis. Also, the variability of models of two different data sets are investigated in this analysis. Finally, it is notable that the target attribute is G3. We are investigating a strong association of G3 with G1 and G2 that correspond to the 1st and 2nd period grades of the students. The reason is that it is more difficult to predict G3 without G1 and G2. Note that, G1, G2 and G3 defines first period grade, second period grade and final grade.
The data set provides the information regarding performance in two distinct subjects that are: Mathematics (MAT) and Portuguese language (Por). The data sets mainly incorporate regression and classification analysis. The classification could be binary or five-level classification. On the other hand, regression analysis could be multiple regression or logistic regression. It undertakes the strength of main or interaction effects of predictor variables. The effects could be linear or non-linear.
Required Software
It is known to all that, ‘Classification’ is used to predict a label and ‘Regression’ is used to predict a quantity. The predictive modelling is about mapping a function from inputs to outputs. ‘Classification’ and ‘Regression’ simultaneously estimate the predictive modelling. The assigned multiple cases are the causes of multi-label classification problem. A classification could classify binary and two-class discrete and real-valued input variables. The classification accuracy is essential for a classification predictive model (Vijiyarani & Sudha, 2013). The classification algorithm might predict a continuous value; however, the continuous value is in the form of a probability for a class label. Both binary and multiclass classification could be possible with the data set. A regression model can have real valued or discrete input variables that needs the prediction of a quantity. The regression algorithm might predict a discrete value in the form of an integer quantity.
While dealing with Machine learning problem, a classification or regression model can analyse the target variable (Y) with respect to the input predictor variable (X). Both operations can be further grouped into Regression and Classification problems that can predict the value of the dependent attribute from predictive factors (Breiman, 2017). The only difference is that these two dependent attribute is numerical for regression and categorical for classification. The target variable determines which type of decision tree (Regression tree or Classification tree) is needed. The nominal variables would be used for classification model; the ordinal and numerical variables are used for regression tree (Naik & Samant, 2016).
The machine learning software ‘RapidMiner’ would be utilized to analyse the data sets and variables. The regression and classification models would be easily executed with this software (Goyal, 2014).
The data modelling and fitting of the analysis with the help of regression or classification trees would make the research report fruitful. The analytical and model-based report would be helpful for non-profit organisations and other researchers. The other concepts and ideas about advanced model could be originated from the research report.
In order to do the analysis, at first descriptive analysis of the variables involved in the study has been conducted. Among the 33 variables that are involved in the dataset, there are several variables that are categorical and several variables that are numerical. The task is mostly aimed at conducting a prediction model. Thus, regression analysis has been used for the development of the prediction model. Before that, a descriptive summary of each of the variables has been conducted. The summary of the categorical variables has been presented with the help of bar graphs, illustrated in the following figures. On the other hand, the summary of the numerical variables is tabulated in table 1.
Opportunity of the Analysis
Table 1: Summary of the Numerical Variables
Variables |
Minimum |
Maximum |
Average |
Age |
15 |
22 |
16.74 |
Mother’s Education (Medu) |
0 |
4 |
2.5 |
Father’s Education (Fedu) |
0 |
4 |
2.3 |
Travel Time |
1 |
4 |
1.57 |
Study Time |
1 |
4 |
1.93 |
Failures |
0 |
3 |
0.22 |
Quality of family relationships (famrel) |
1 |
5 |
3.93 |
Free time after school (freetime) |
1 |
5 |
3.18 |
Going out with friends (Goout) |
1 |
5 |
3.19 |
Workday alcohol consumption (Dalc) |
1 |
5 |
1.5 |
Weekend alcohol consumption (Walc) |
1 |
5 |
2.28 |
Health |
1 |
5 |
3.56 |
absences |
0 |
32 |
3.66 |
G1 |
0 |
19 |
11.399 |
G2 |
0 |
19 |
11.57 |
G3 |
0 |
19 |
11.91 |
Now, all the numerical variables are considered to evaluate the impact of the variables on the grades of the students. The grades of the students are obtained for two subjects, Maths and Portuguese. The impact of all the variables on the first period and the second period grade has been obtained at first and then the impact of the first period and the second period grade on the final grades has been evaluated for both the subjects separately.
For the prediction of the first periods grades in mathematics, the numerical variables that has been obtained significant in predicting the grades are Mother’s Education, Study Time, Failures and Workday alcohol consumption. The impact of all the other numerical variables are thus insignificant. The prediction model is given by the following regression equation:
For the prediction of the second period grades in mathematics, the numerical variables that has been obtained significant in predicting the grades are Mother’s Education, Study Time and Failures. Workday alcohol consumption has not been obtained as significant for this model. The impact of all the other numerical variables are thus insignificant. The prediction model is given by the following regression equation:
Further, both the first period and the second period grades are significant in predicting the final grades in mathematics. The prediction equation is given as follows:
For the prediction of the first periods grades in Portuguese, the numerical variables that has been obtained significant in predicting the grades are Study Time, Failures and goout The impact of all the other numerical variables are thus insignificant. The prediction model is given by the following regression equation:
For the prediction of the second period grades in Portuguese, the numerical variables that has been obtained significant in predicting the grades are Mother’s Education, Travel Time, goout and Failures. The impact of all the other numerical variables are thus insignificant. The prediction model is given by the following regression equation:
Further, both the first period and the second period grades are significant in predicting the final grades in Portuguese. The prediction equation is given as follows:
References:
Breiman, L., 2017. Classification and regression trees. Routledge.
Goyal, V. K. (2014). A Comparative Study of Classification Methods in Data Mining using RapidMiner Studio. IJIRSE) International Journal of Innovative Research in Science & Engineering.
Naik, A., & Samant, L. (2016). Correlation review of classification algorithm using data mining tool: WEKA, Rapidminer, Tanagra, Orange and Knime. Procedia Computer Science, 85, 662-668.
Vijiyarani, S., & Sudha, S. (2013). An efficient classification tree technique for heart disease prediction. In International Conference on Research Trends in Computer Technologies (ICRTCT-2013) Proceedings published in International Journal of Computer Applications (IJCA)(0975–8887) (Vol. 201).