2 years ago
#48338
Wasian
Unable to replace White space in CSV in Python (Webscrape results)
I'm working on cleaning a dataframe of a CSV of a web scrape I did on LinkedIn Job Postings. I first incorporated a str.strip().replace(' ', '') to remove trailing/leading whitespace plus spaces in between the trailing/leading as well. This worked for most rows in my "Job Description" column, but there were some that had massive leftover whitespaces. (Something like this:
8. This range is provided by EVONA. Your actual pay will be based on your skills and experience —talk with your recruiter to learn more.
9. ThisrangeisprovidedbyEPITEC.Youractualpaywillbebasedonyou
10. This range is provided by EPITEC. Your actual pay will be based on your skills and experience —talk with your recruiter to learn more.
Next I tried adding in a regular expression to remove all whitespace to see if this would at least remove all whitespaces .str.replace(r'\s+', '', regex=True) but this only removed all the whitespace in the rows that were already working properly. The result looked something like the following:
7.HarderMechanicalContractorshasanopeningforamechanicaldesignertosupportourclientsandconstructiontea
8...........................................................................................................................Actualpaymaybedifferent—t **the periods represent whitespace**
9.Description:*SolidworksreadblueprintsanddrawingsfromcustomerproducesketchsheetsAS-1100,CNCAerospacepartsSkills:*Solidworks,mechanicalengineering,readblueprints,pipeline,DesignAdditionalSkills&Qualifications:*Solidwork
I'll attach my current script below. Any suggestions are greatly appreciated and thank you in advanced!
import pandas as pd
path_to_file = '/Users/Desk/Desktop/Desk/mechanical_engineer_LinkedIn_test_scrape.csv'
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 400)
pd.set_option('max_colwidth', 300)
data = pd.read_csv(path_to_file)
data['Seniority'] = data['Seniority'].str.strip()
data['Seniority'] = data['Seniority'].replace('Full-time', 'N/A')
data['Job Description'] = data['Job Description'].str.strip().replace(' ', '')
data['Job Description'] = data['Job Description'].str.replace(r'\s+', '', regex=True)
#print(data.iloc[0:500, 6])
#print(data.head(100))
print(data['Job Description'].head(50))
python
regex
pandas
whitespace
0 Answers
Your Answer