2. Clean Up Your Dataset with Google Sheets



Tools: SilkOpen RefineChrome BrowserChrome Scraper Extension, Google Sheets

Useful LinksData Source, Spreadsheet with Scraped and Cleaned DatasetResulting Silk


A) Clean up the "Scrape Job" sheet:

  1. Split up Column "Job Compensation Scrape":

    • Enter the first empty column (column E) on the 2nd row of the worksheet "Scrape Jobs". Enter the formula =SPLIT(C2," · ",false) and copy this formula down to all rows with data.

    • Copy the 2 resulting columns and Paste Special > Paste Values Only. Name column E "Job Compensation"; name column F "Equity".

  2. To create an "Equity" column: 

    • First add a filter to all the columns of the worksheet. Select the filter of column "Equity" and select all blank cells. You'll see that this filtered view shows that column "Job Compensation" contains some equity values. Copy this selection and paste them in column "Equity". After this select these selected cells in column "Job Compensation" and hit delete. Empty the filter on column "Equity".

  3. Split up column "Tags Scrape":

    • Enter the first empty column (column G) on the 2n row. Enter the formula =SPLIT(D2," · ",false) and copy the formula down to all rows with data.

    • Copy the resulting columns and Paste Special > Paste Values Only. You'll see a column with Contract info ("Full Time, Internship, etc) > name this column "Type of Contract". 

    • Name the column next to it "Remote?" and the one next to that "Tags".

  4. Let's create a "Location" column:

    • Add a column between "Remote?" and "Tags" and name this one "Location". Filter column "Remote?" on only results with "Remote OK". Note that this selection shows location info in the column "Tags". Copy these values and paste them in column "Location". Select the results in column "Tags" and hit delete. Now filter column "Remote?" on all results except "Remote OK". Copy the location info you'll see in this column into the column "Location" and afterwards delete the selection.

  5. Clean up the "Remote" column:

    • Filter column "Remote?" on results with "Remote OK" only and rename these values to "Yes". Now select all blank results in this column and rename these "Unspecified".

  6. Clean up the "Location" column:

    • Filter column "Location" for any values that are not actual locations (for instance "remote" or "anywhere"). Empty these cells.

  7. Create a "Tags" column:

    • Add a column between "Location" and "Tags". Enter on the 2nd row of this column the formula =JOIN(", ",K2:X2)  and copy this down to all rows with data. Copy these results and Paste Special > Paste Values Only in the column "Tags" (overwriting the results in that column is fine). This column with results will contain multiple trailing comma's: cleaning this is optional since these will be automatically removed during importing your data into Silk. To remove all trailing comma's:

      • Select column "Tags" and hit Command+F (on Macbook). Click "More Options" and enter in the Find bar " ," (without the quotation marks) and hit "Replace All".

      • Enter this formula in an empty column =IF(LEFT(J2,2)=", ",RIGHT(J2,LEN(J2)-2),J2) , assuming J is the column "Tags". Copy the resulting columns and Paste Special > Paste Values Only

  8. Create a "Hiring Company Link" column:

    • Find the first empty column and enter in the 2nd row the formula  =SPLIT(B2,"/jobs",false) and copy this down. Copy the first column of results of this formula (something like "https://angel.co/silk") and paste this in a new column next to column "Job Offer Link". Name the new column "Hiring Company Link".

  9. Copy all the edited results into a new sheet, called "Import Jobs".


B) Clean up the "Scrape Companies" sheet:

  1. Select column "Profile" and hit Command+F, click More Options. Enter in the Find bar "/jobs" (without the quotation marks) and hit "Replace All".

  2. Select the 2nd cell of an empty column and enter formula =TRIM(E2) , assuming column E is the column named "Location". Copy the formula down to all rows. Copy and Paste Special > Paste Values Only the results into the column "Location". 

  3. Select the column "Image" and hit Command+F, click More Options. Enter in the Find bar "-thumb_jpg.jpg" and in the Replace bar "-medium_jpg.jpg" and hit "Replace All".

  4. In this step we will try to clean up the Location column as much as possible, mostly with manual editing. First, Select the 2nd cell of an empty column and enter formula =SPLIT(E2,", ", FALSE) , assuming column E is the column named "Location". Copy and PCopy and Paste Special > Paste Values Only the results into the column you're working in. Remove any values that are not actual locations (like "remote" and "anywhere"). Go through all other values and try to add a country to the value if its a city: so if the value is "Amsterdam" rename the cell "Amsterdam, Netherlands". If the location is in the United States you can add the State name if you want (its more work, but recommended for better results and filtering options of your data!), so "Los Angeles" would become "Los Angeles, California, United States".

  5. Add columns "City", "Country", "Region" and optionally "US State" to the sheet and fill these with the split up results from the column "Location" you have just edited. We will go into more detail during the workshop.

  6. Copy all the edited results into a new sheet, called "Import Companies".


c) Link the Two Sheets by a Common Column.

  1. Link the two sheets through a key column: In the "Import Jobs" spreadsheet, create a column called “Hiring Company”

  2. Use the ‘FILTER’’ function to lookup the company name in the Company spreadsheet, using the “Hiring Company Link” column as the common value. (Fig. 1)

  3. Create a unique ID column by combining the Job Title with the Hiring Company (Fig. 2)

Fig. 1


Fig. 2


d) Currency Conversion in the "Import Jobs" Sheet

  1. Split the “Job Compensation” column into a “Min. Job Compensation” and a “Max. Job Compensation” column. Use the formula =IF(F2<>"", SPLIT(SUBSTITUTE(F2,"k","000")," - ",false),"") (This also converts $1K to $1000)

  2. Get the symbol of the currency the compensation is expressed in. Use the formula=LEFT(K2,1) assuming column K contains the Job Compensation values (change the formula accordingly if not).

  3. Create another sheet called “Exchange Rates”. Fill it in with three columns: “Currency Symbol”, “Currency Name”, “Exchange Rate to Dollars”. Fill the column "Currency Symbol" with the different symbols you have collected under step 2. Name the symbols appropriately in the column "Currency Name" and enter the following formula's in the column "Exchange Rate to US Dollars":

    • =GOOGLEFINANCE("EURUSD") for the row with the currency Euro. This formula gives a 'live' exchange rate of the currencies you have selected, in this case: "EURUSD" means the rate of 1 Euro converted in US dollars. Apply the appropriate formula for each currency symbol. For reference: Euro = EUR, British Pounds = GBP, Japanes Yen = JPY, Indian Rupee = INR.

  4. Go back to the "Import Jobs" sheet. Convert all min and max job compensations to dollars, using the VLOOKUP formula to find the exchange rates. Formulas used in this case:

    • Column K: =IF(F2<>"", SPLIT(SUBSTITUTE(F2,"k","000")," - ",false),"")

    • Column M: =LEFT(K2,1)

    • Column N: =IF(M2<>"", VLOOKUP(M2,'Exchange Rates'!A:C,3),"")

    • Column O: =IF($N2<>"",Substitute(K2,$M2,"")*$N2,"")

    • Column P: =IF($N2<>"",Substitute(L2,$M2,"")*$N2,"")

  5. Set the format of  the cells in these columns to Currency: US $

  6. Recalculate a “Job Compensation in $” column by joining the min and max compensations with a “ - “. Use the following formula =IF(F2<>"",JOIN(" - ",O2:P2),"")

  7. Calculate a “Median Compensation in $" column with the MEDIAN function. Use the following formula IF(F2<>"",MEDIAN(O2:P2),"").


Next: Analyze and Visualize Your Dataset >>