2 years ago

#48338

test-img

Wasian

Unable to replace White space in CSV in Python (Webscrape results)

I'm working on cleaning a dataframe of a CSV of a web scrape I did on LinkedIn Job Postings. I first incorporated a str.strip().replace(' ', '') to remove trailing/leading whitespace plus spaces in between the trailing/leading as well. This worked for most rows in my "Job Description" column, but there were some that had massive leftover whitespaces. (Something like this:

8. This range is provided by EVONA. Your actual pay will be based on your skills and experience —talk with your recruiter to learn more.                   
9.                                               ThisrangeisprovidedbyEPITEC.Youractualpaywillbebasedonyou

10. This range is provided by EPITEC. Your actual pay will be based on your skills and experience —talk with your recruiter to learn more.

Next I tried adding in a regular expression to remove all whitespace to see if this would at least remove all whitespaces .str.replace(r'\s+', '', regex=True) but this only removed all the whitespace in the rows that were already working properly. The result looked something like the following:

7.HarderMechanicalContractorshasanopeningforamechanicaldesignertosupportourclientsandconstructiontea
8...........................................................................................................................Actualpaymaybedifferent—t **the periods represent whitespace**
9.Description:*SolidworksreadblueprintsanddrawingsfromcustomerproducesketchsheetsAS-1100,CNCAerospacepartsSkills:*Solidworks,mechanicalengineering,readblueprints,pipeline,DesignAdditionalSkills&Qualifications:*Solidwork

I'll attach my current script below. Any suggestions are greatly appreciated and thank you in advanced!

import pandas as pd

path_to_file = '/Users/Desk/Desktop/Desk/mechanical_engineer_LinkedIn_test_scrape.csv'

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 400)
pd.set_option('max_colwidth', 300)

data = pd.read_csv(path_to_file)

data['Seniority'] = data['Seniority'].str.strip()
data['Seniority'] = data['Seniority'].replace('Full-time', 'N/A')
data['Job Description'] = data['Job Description'].str.strip().replace(' ', '')
data['Job Description'] = data['Job Description'].str.replace(r'\s+', '', regex=True)
#print(data.iloc[0:500, 6])
#print(data.head(100))
print(data['Job Description'].head(50))

python

regex

pandas

whitespace

0 Answers

Your Answer

Accepted video resources