Analyzing the Datasets
CRISP-DM is an acronym for Cross Industry Standard Process for Data Mining and is a commonly used framework in the information science firm for data analysis. This paper will demonstrate how the framework may be adapted to any data analysis job. The Airbnb Amsterdam Listing Data is used in this study. We will use Excel following the CRISP-DM model to analyze the data. The datasets mentioned earlier above are analyzed to find answers to the following questions about the listings of Airbnb in Amsterdam. How do listing prices change across locations and times? What are the main drivers of the listing price? What features of listings influence customer satisfaction the most?
There are three datasets from Airbnb Amsterdam that are analyzed: -listings.csv containing details on listing features like location, amenities, and reviews; calendar.csv, which contains the availability calendar of listings over one year; and reviews.csv, which contains reviews posted by customers after their stay. We used the CRISP-DM method to analyze these data throughout the rest of the paper to discover answers to all the questions presented in the above section.
The first step is business understanding of the problem. This step usually entails working sessions with business groups to understand better the challenges we’re attempting to tackle with data. For this project, that is not an option. Other situations may require data to be analyzed first to determine what queries regarding business operations can be addressed. This is the situation we’re in, and I’ve submitted the questions after first analyzing the datasets to see how they’re structured and what information they contain. Depending on an industry’s data science sophistication, we may find ourselves in two scenarios.
The next step is to understand the data. We begin to understand the data, the meaning of the columns in the data, and how they are generated in this step. We can better understand the data if we can link the column meanings to the data.
From the raw datasets, we generate summary reports. The overview guide provides an analysis of the data like, such as the number of rows and columns, data type of each column, unique values in each column, missing values in every column, and statistics like average, midpoint, and standard deviation for mathematical columns. This is the calendar descriptive statistical report that was created. The listing id is a unique identifier for each Airbnb listing. This is how both a qualitative and quantitative perspective of the data could be formed.
The third step is to prepare the data; Data preparation encompasses a broad array of data manipulations and conversions to address inconsistencies such as null values or outliers, transform columns to the correct format, and fill cells with new information.
The above relevant to the position aids in detecting data inconsistencies such as improper types of data and missing values. Price, for example, has been interpreted as a string because of the inserted $ sign and contains 665,853 missing values, as shown in the image above.
These are the most common data preparatory steps used on any dataset: – change column data types, eliminate duplicate rows, remove columns with fixed values, impute missing values, and encode category characteristics. For the calendar data, the price has been converted to a numerical value, and 365 duplicates are later dropped for the listing data; we look at different columns, and those with close to 100% missing values are dropped.
Data Preparation
Many attributes in this dataset contain null values, including some that are important. They include review_scores_communication, review_scores_accuracy, and bathroom, to mention a few. If more than 90% of the data are missing, the column is not imputed though it is not used for analysis. Missing values in numeric columns are adduced with local median, while missing values in categorical columns are imputed with “missing” text and encrypted later. Because most machine learning can only work with numeric data, all categorization and objects data fields must first be converted to numeric values. We can investigate the distribution of various features and between features after the data has been prepared using the techniques outlined above. Inferential statistical analysis is the term for this type of analysis. As we uncover new ways to describe the data, this stage frequently leads to other data preparation steps.
We may do so if we believe that adding additional features relying on domain expertise will improve the quality of insight and model and make them easier to consume for businesses. Feature Engineering is the term for this step. Before implementing a supervised or unsupervised classification algorithm to the dataset, the addition of new features techniques such as scaling or log transformation are performed. These steps are typically required for linear models, and distance-based clustering approaches aid algorithm convergence. Data preprocessing, data exploration, and feature engineering are frequently performed throughout a research project, depending on the requirements. The following are some of the findings from the Airbnb data exploration; the price has a relatively positive skewness, with most prices falling between $50 and $500, while there are few extraordinarily high prices above $1000. this is illustrated in the below graph.
In Amsterdam, entire homes or apartments account for over 60% of Airbnb listings, followed by private residences, while shared rooms account for about 2%; this information is shown below.
Only 11 percent of Airbnb hosts are designated as Super hosts. A host must have a 90 percent response rate or better, a 4.8 overall rating, a 1% cancellation rate or lower, and at minimum ten completed trips OR three full reservations totaling at least 100 overnight to become a Super host. Its visual is as indicated.
The price rates for the three various accommodation kinds are considerably diverse. The most expensive items are usually entire homes and flats, with a median price of roughly $200, although a broad array of pricing is accessible. The most expensive accommodations are private rooms, which cost over $90, and shared rooms, around $50. The cancellation procedure for the more priced listings is tougher. This is likely because the cost experienced by hosts in the case of cancellation is greater for more expensive listings, and the refund policy is a way to entice consumers who are certain of their travel intentions. Super-hosts, as expected, have a greater mean review rating score than regular hosts.
The fourth step in the CRISP-DM model is evaluating and modeling the data (Huber et al, 2019). The answer could be structured as a deep learning model, depending on the problem we’re solving. All other information is typically considered Predictor variables or independent variables; the quantity of interest is treated as the Target or Dependent variable. The model output can be used in various ways in prototype solutions, including: -Using model predictions to support business processes like sales forecasts and recommendation systems. One of the most important characteristics of a machine learning model is its best approximation to anonymous data with high precision; without this, the model is useless. Model Validation is the process of determining whether a predictive model has this quality.
Listing price changes across time and location in the following ways, let us look at listings in the Amsterdam neighborhood. one side of the neighborhood appears to be the most cost-effective neighborhood, with median prices ranging from $50 to $100. These areas are located outside of the city limits, in the suburbs. On the other hand, have median values in the $300-$400 range. These areas are close to the city center and important tourist attractions. The prices may also differ according to the seasons, as shown.
The general pattern indicates that the holiday and winter seasons, which run from November to March, are not good times for business. This is likely because people prefer to spend time with their families and stay at home during the holidays. Knowing the elements that influence the price charged by listings can help hosts add new facilities and services to their listings, allowing them to charge a greater fee than comparable listings.
For this section, I created a machine learning model that predicts listing pricing based on listing characteristics such as location, number of bedrooms, room type, and property type.
The essential features are then evaluated and visualized to understand their correlation with the price better. On the other hand, new hosts can use the model projections to estimate how much they can ask for based on the location and amenities offered at their listing. Understanding what keeps guests happy may help hosts enhance their customer experience, which has a ripple effect across the ecosystem: customers are pleased and give positive reviews, hosts receive more business, and Airbnb gets additional business and happier customers.
The last stage is to provide findings and deploy the model. This is when all of the previous analyses and models come into play. It could imply that the model developed is put into production or that the information presented is used to make business decisions. The CRISP-DM framework’s essential steps are as follows. Any new dataset can be difficult to analyze, but following the stages provided in this model can help organize the process.
There will be a need for professionals to use outdated data mining methods due to the development of CRISP-DM. The CRISP-DM model has become one useful methodology for analyzing big data since it provides clear guidelines for what to do at each stage. In cooperation with other software such as SEMWA, they work together to provide the desired output. From the data analyzed, there are many causes for the different listing prices shown, including and not limited to location and seasons.
References
Page, g., angelov, s., ivanov, p. and zlatev, v., striking the balance: teaching data mining with the right mixture of depth and breadth.
Huber, S., Wiemer, H., Schneider, D. and Ihlenfeldt, S., 2019. DMME: Data mining methodology for engineering applications–a holistic extension to the CRISP-DM model. Procedia Corp, 79, pp.403-408.