(Alternative option) 2. Clean Up Your Dataset with Open Refine

d) Currency Conversion in the "Import Jobs" Sheet

  1. Split the “Job Compensation” column into a “Min. Job Compensation” and a “Max. Job Compensation” column. Use the formula =IF(F2<>"", SPLIT(SUBSTITUTE(F2,"k","000")," - ",false),"") (This also converts $1K to $1000)

  2. Get the symbol of the currency the compensation is expressed in. Use the formula =LEFT(Cell_with_compensation,1)

  3. Create another sheet called “Exchange Rates”. Fill it in with three columns: “Currency Symbol”, “Currency Name”, “Exchange Rate to Dollars”. Fill this information in with each of the currencies you encountered in the job compensation column. (Fig. 3)

  4. Go back to the "Import Jobs" sheet. Convert all min and max job compensations to dollars, using the VLOOKUP formula to find the exchange rates. Formulas used in this case:

    • Column K: =IF(F2<>"", SPLIT(SUBSTITUTE(F2,"k","000")," - ",false),"")

    • Column M: =LEFT(K2,1)

    • Column N: =IF(M2<>"", VLOOKUP(M2,'Exchange Rates'!A:C,3),"")

    • Column O: =IF($N2<>"",Substitute(K2,$M2,"")*$N2,"")

    • Column P: =IF($N2<>"",Substitute(L2,$M2,"")*$N2,"")

  5. Set the format of  the cells in these columns to Currency: US $

  6. Recalculate a “Job Compensation in $” column by joining the min and max compensations with a “ - “.

  7. Calculate a “Median Compensation in $" column with the MEDIAN function

a) Clean up your "Jobs" Dataset in OpenRefine

Go to http://127.0.0.1:3333

  1. Download your “Jobs” sheet as a CSV and import it into OpenRefine. (Select UTF-8 as encoding)

  2. Clean the “Job Compensation” Column:

    • Edit Column -> Add Column Based on This Column ->Name the column "Equity"; Expression: value.split(" · ")[-1] 

    • Edit Cells -> Transform -> Expression: value.split(" · ")[0]

    • Facet -> Text Facet. Select all Cells with no Currency Symbol. Edit Cells -> Transform -> Expression: leave empty

  3. Clean the “Tags” Column

    • Edit Column -> Add Column Based on This Column -> Name column “Type of Contract”; Expression: value.split(‘ · ‘)[0]

    • Edit Column -> Add Column Based on This Column -> Name Column “Remote?”; Expression value.split(‘ · ‘)[1]

    • Edit Cells -> Transform -> Expression: replace(value,"Full Time · ","")

    • Edit Cells -> Transform -> Expression: replace(value,"Internship · ","")

    • Edit Cells -> Transform -> Expression: replace(value,"Cofounder · ","")

    • Edit Cells -> Transform -> Expression: replace(value,"Contract · ","")

    • Edit Cells -> Transform -> Expression: replace(value,"Remote OK · ","")

    • Edit Cells -> Transform -> Expression: replace(value,value.split(' · ')[0],"")

    • Edit Cells -> Split multi valued cells by “ · “

    • Edit Cells -> Split multi valued cells by “, “

    • Edit Cells -> Common Transforms -> To Titlecase

    • Facet -> Text Facet -> Cluster

    • Edit Cells -> Join Multiple Value Cells by “, “

  4. Clean “Remote?” Column

    • Facet -> Text Facet. Select the cell with “Remote OK”. Click “Invert”. Edit Cells -> Transform -> Expression: leave empty

    • Facet -> Text Facet. Edit “Remote OK” by replacing it with Yes. Edit "Blank Cells" by replacing with No.

  5. Add a column with a link to the hiring company for each job offer

    • Column “Job Offer Link” -> Add Column Based on This Column -> Name Column “Hiring Company Link”; Expression: value.split('/jobs/')[0] 

  6. Export your project as a CSV and import it into your Google Sheet as the "Import Jobs" sheet.

Fig. 3


b) Clean up your "Hiring Companies" Dataset in OpenRefine

  1. Download your “Hiring Companies” sheet as a CSV and import it into OpenRefine. (Select UTF-8 as encoding)

  2. Clean ”Profile” column

    • Edit Cells -> Transform -> Expression: value.split('/jobs')[0]

  3. Clean ”Tagline

    • Edit Cells -> Common Transforms -> Trim Leading and Trailing Whitespaces

  4. Clean ”Image

    • Edit Cells -> Transform -> Expression: replace(value,"-thumb_jpg.jpg","-medium_jpg.jpg")

  5. Clean ”Location

    • Edit Cells -> Common Transforms -> Trim Leading and Trailing Whitespaces

    • Edit Cells -> Split Multi-Valued Cells by “, “

    • Facet -> Text Facet. Use Facets to bulk clean up. Show the “Cluster” feature. But essentially this is manual work. Try to achieve a “City, (US State), County” format for location. Then use this to add a “US State”, a “Country" and a "Region" column

    • Edit Cells -> Joint Multi-Valued Cells by “|“

  6. Export your project as a CSV and import it into your Google Sheet as the "Import Companies" sheet.

Fig. 1


Fig. 2


c) Link the Two Sheets by a Common Column.

  1. Link the two sheets through a key column: In the "Import Jobs spreadsheet, create a column called “Hiring Company”

  2. Use the ‘FILTER’’ function to lookup the company name in the Company spreadsheet, using the “Hiring Company Link” column as the common value. (Fig. 1)

  3. Create a unique ID column by combining the Job Title with the Hiring Company (Fig. 2)