Multivariate Linear Regression
This week is all about one-hot encoding and multiple regression.
Background materials
- Robust One-Hot Encoding in Python
- Feature Engineering and Selection ebook
- One-hot encoding multicollinearity and the dummy variable trap
- Emulating R Regression Plots in Python
- Statsmodels Regression Plot
- Building and evaluating models.
- Test/Train Splits and Crossvalidation
- Interpreting residuals plot Stattrek and Statwing
Assignment
We will predict employee salaries from different employee characteristics (or features).
Import the data salary.csv to a Jupyter Notebook. A description of the variables is given in Salary metadata.csv. You will need the packages matplotlib / seaborn, pandas and statsmodels.
Steps and questions
- Perform some exploratory data analys (EDA) is by creating appropariate plots (e.g scatterplots and histograms) to visualise and investigate relatioships between dependent variables and the target/independent variable (salary).
- Create a descriptive statistics table to further characterise and describe the population under investigation.
- Which variables seem like good predictors of salary?
- Do any of the variables need to be transformed to be able to use them in a linear regression model?
- Perform some basic features engineering by one-hot encoding the variable Field into three dummy variables, using HR as the reference category
- You can use pandas’
get_dummies()
function for this (refer to “Background materials 1-3”).
- Perform correlation and statistical significance analysis to validate the relationship salary to each of the potential predictor variables:
- Calculate Pearson correlation coeffificent and plot the correspnding correlation matrix
- Calculate p-values related to the Pearson correlation coeffificents
- Address any problems that may adversely affect the multiple regression (e.g multicollinearity)
- Conduct some basic feature selection tasks by aggreating results from EDA, correlation matrix and p-values. Justify your feature selection decisions.
- Split your data into a training and test set.
- Train model:
- Fit a multiple linear regression model using a training dataset with corresponding features selected above
- Use the multiple linear regression model created from independent variables selected above and the training dataset to predict salary using the training dataset.
- Interpret the standardised coefficients given in the statsmodels output.
- What are the most important features when predicting employee salary?
- Test model:
- Run your model on the test set.
- Evaluate model
- Calculate and eplxain the significance of the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Square Error (RMSE) and R-squared values for your model
- Calculate the standardised residuals (
resid()
) and standardised predicted values (fittedvalues()
).
- Plot the residuals versus the predicted values using seaborn’s
residplot
with predicted values as the x parameter, and the actual values as y, specify lowess=True
.
- Are there any problems with the regression?
- Benchmark with cross-validation model
- Perform cross-validation using the training dataset, test and evaluate the cross-validation model with test data
- Compare performance of the cross-validation model (less prone to over-fitting) to determine whether the developed model has overfitted or not
- Does it seem like you have a reasonably good model?
References
Data is made up and inspired by Cohen, Cohen, West & Aiken. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd Edition.