Data Science Jobs Data Cleaning
First look at the raw data shows us that there are a lot of things needed to be able to use it. There are many columns that are not needed like the ‘index’ column.
There are also some columns that need to be cleaned up like the ‘salary’ column that has both numeric and string data.
First thing I did was drop 3 columns that I deemed to be irrelevant for the data set which were the ‘index’, ‘Founded’, and ‘Revenue’ columns.
After this I started cleaning the ‘Salary Estimate’ column by removing the unnecessary words like the ‘K’s and ‘glassdoor est’.
I then decided to drop two additional columns, the ‘Competitors’ column and the ‘ Salary Estimate’ column which we had just cleaned into a new column ‘Estimated Salary’.
Lastly I realized that all Null values were -1 on the data set. I changed this by replacing all -1 with null.
Here I made two functions, one would pull the city from the ‘location’ column, and the other would pull the state from the ‘location’ column. I then assigned these to two new columns titled ‘City’ and ‘State’
Lastly I dropped the column ‘Location’.
Here I decided to clean up the ‘Company Name’ Column. The column had the name of the column, followed by its rating. I wanted to remove the rating to just have the name. To do this I found the index of the characters ‘/n’ that came before the rating. I then returned the string up until that index.
I followed that by dropping the original ‘Column Name’ column.
Finally I made a function that would look through the ‘Job Description’ column and return a 1 if it found a key-word. It would return 0 if it did not find that key-word in the description. I did this for every skill I believed could be needed for a data science job. Skills like ‘sql’, ‘python’, ‘machine learning’, and a few more. I then added those to their own columns at the end.