Electric Zombies Presents
Salary Predictions
Tell Me More

About

.

What's this about?

This project is a a predictive algorithm used to predict the salary of someone with the given attributes about themselves, the outcome will predict whether or not you'll make over, under or exactly 50k

What is our goal?

We've all been at crossroads when it comes to what we want to pursue; however, our A.I. can provide users with a clear path. Our goal is to guide users to their future career based on what they wish to accomplish with their education.

What is our Dataset about?

The data set is a classification on salary that determines whether a person makes less than or equal to 50K or greater than 50K based on their personal attributes.

Classification or Regression?

Classification refers to a binary set data, aka and off or on. Regression refers to data that is cannot be classified as binary, usually numerical or quantifiable data. Our set is classification because this AI categorizes the data to be either over 50k or under 50k.

What is our MVP?

A prediction site that provides users with information in regards to their salary. The data set shows the correlation between education, occupation, and salary and predicts whether someone’s salary will be more or less than 50k based on the given information.

Who and Why?

We are the Electric Zombies, a group of six students attending an A.I. summer camp. One of our main projects in this camp is creating an A.I. model based on data sets. We decided to commit to a salary predictor as it can be helpful to those who may need a clear outcome towards their desired income based on their personal attributes such as occupation and education level.

Data Cleaning

Data cleaning is the process of preparing data for future analysis by changing or removing data that is incomplete, incorrect, irrelevant, duplicated, or improperly formatted. Data cleansing ensures quality of the data set and provides better accuracy in predictions. If data is not cleaned thoroughly enough, the accuracy of a model may be negatively impacted.

Removing Unwanted Columns

Because real-world data typically contains too much noise and only a few columns that give useful information, it is recommended that these columns be removed before preforming any data analysis. For our data set, we removed columns that contained other sources of income besides the main occupation in order to ensure our data was accurate.

Null Values

Real world data always comes with gaps in the data these are called null values. They can be treated in many ways however, one of the most widely used methods (and the one we used ourselves) is to impute the null values with the mean, median, mode or nearby values.

Encoding Data

Since new world data contains both numerical and categorical data, these categorical columns are required to be encoded. When fed to a machine learning model, as machine learning can only understand numeric data, these categorical data values are encoded to zeros and ones. Hence, making them more recognizable to machines.

Exploratory Data Analysis

Exploratory data analysis is important for any business since it allows data scientists to analyze the data before coming to any assumption. It ensures that the results produced are valid and applicable to business outcomes and goals. An EDA is a thorough examination meant to uncover the underlying structure of a data set and is important for a company because it exposes trends, patterns, and relationships that are not readily apparent.

...
Gender Distribution

Distribution of gender in >50K and <50K

...
Gender Pay Gap
Is there gender bias in pay?
...
Race
Does Race have any impact on salary earned?
...
Age Distribution
Is age one of the factor in earning salary?
...
Higher Education
Does higher education translates into higher income?
...
Correlation Heat Map
Any correlation between features and salary?

Machine Learning Pipeline

For solving any machine leaning problem , we need to follow certain steps from acquiring the data to building models.On a whole these steps are called as pipeline.

Models


Logistic Regression

...

Logistic regression is a method used to predict a binary outcome, such as yes or no, based on prior observations of a data set. A Logistic Regression model predicts a dependent data variable by analyzing the relationship between one or more existing variables. In Logistic Regression, an "S" shaped logistic function is used to predict two maximum values (0 or 1). The curve from the Logistic Regression function indicates the likelihood of the occurrence of each data piece.


Confusion Matrix for Logistic Regression:




  • Accuracy : 82%
  • Precision : 0.87
  • Recall : 0.90
  • F1-Score : 0.88

Random Forests

...

Random Forests are a sequence of decision trees, that perform classification or regression by asking true or false questions which get solved with a majority vote. Each tree makes a class prediction, and the prediction with the most votes becomes our model's final prediction.By using uncorrelated trees who give individual outputs, the Random Forest model generates a prediction by the committee that is more accurate than that of any individual tree.


Confusion Matrix for Random Forests:




  • Accuracy : 82%
  • Precision : 0.87
  • Recall : 0.90
  • F1-Score : 0.88

Neural Network

...

A neural network is a method in artificial intelligence that teaches computers to process data in a way that is inspired by the human brain. It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons in a layered structure that resembles the human brain. Some projects that have used Neural Networking that are used in a day to day basis are weather forecasting, credit scoring websites, and fraud detection systems.


Confusion Matrix for Neural Networks:




  • Accuracy : 84%
  • Precision : 0.87
  • Recall : 0.92
  • F1-Score : 0.90

Classification Metrics



Confusion Matrix

Confusion Matrix is a performance measurement for the machine learning classification problems where the output can be two or more classes. It is a table with combinations of predicted and actual values.

Predicted: 0 Predicted: 1
Actual: 0 True Negative (TN) False Positive (FP)
Actual: 1 False Negative (FN) True Positive (TP)
  • True Positives (TP): The number of positive instances correctly classified as positive. E.g., predicting an email as spam when it actually is spam.
  • False Positives (FP): The number of negative instances incorrectly classified as positive. E.g., predicting an email is spam when it actually is not spam.
  • True Negatives (TN): The number of negative instances correctly classified as negative. E.g., predicting an email is not spam when it actually is not spam.
  • False Negatives (FN): The number of positive instances incorrectly classified as negative. E.g., predicting an email is not spam when it actually is spam.


Accuracy

Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right.Accuracy is a good measure when there is class balance i.e. when both classes are almost equal or comparable. For imbalance classes other metrics such as precision , recall etc will give better perception on how model is performing on new data.

Accuracy = \(\frac {TP+TN}{TP+TN+FP+FN}\)



Recall / Sensitivity

Recall explains how many of the actual positive cases we were able to predict correctly with our model. It is a useful metric in cases where False Negative is of higher concern than False Positive. It is important in medical cases where it does not matter whether we raise a false alarm but the actual positive cases should not go undetected!

Recall = \(\frac {TP}{TP+FN}\)


Precision

Precision explains how many of the correctly predicted cases actually turned out to be positive. Precision is useful in the cases where False Positive is a higher concern than False Negatives. The importance of Precision is in music or video recommendation systems, e-commerce websites, etc. where wrong results could lead to customer churn and this could be harmful to the business.

Precision = \(\frac {TP}{TP+FP}\)


F1 Score

The F1-score (also sometimes called the F-Measure) is a single performance metric that takes both precision and recall into account. It's calculated by taking the harmonic mean of the two metrics.Only when both precision and recall have good performance will the F1-score be high.

F1 Score = \(\frac {2 . Precision . Recall}{Precision + Recall}\)



Conclusion


After analyzing our data, we found that out of all of the classification models, Neural Networks worked the best for predicting salary. Our Logistic Regression and Random Forest models both had an accuracy of around 82%, while our Neural Network model had an accuracy of around 84%, making it the best of the three that we used. From the analysis we can conclude that there is a relation between level of education, age, race etc.. and income received by a person.


Team

Aka the Electric Zombies

...

Aaron Chang

Experimenting until I find what works

...

Adam Ellington

The best way to predict the future is to create it

...

Amelia Lipcsei

Winner of the worst wifi award

...

Joshua Broyer

Model Analyst || Backend Developer || Musician

...

Karen Gerges

I did my best and that’s all that matters

...

Shellene Redhorse

Team Member

...

Vishnu Nelapati

Instructor

You Will Never Walk Alone

Copyright © AI Camp 2022