Removing html tags in pandas
11,193
Solution 1
Like so re.sub('<[^<]+?>', '', text)
You can find details answer there.
Solution 2
The Pandas way is using Series.str.replace
:
df['overview_copy'] = df['overview_copy'].str.replace(r'<[^<>]*>', '', regex=True)
Details:
-
<
- a<
char -
[^<>]*
- zero or more chars ther than<
and>
as many as possible -
>
- a>
char.
See the regex demo.
Pandas output:
>>> df['overview_copy']
1 Environments subject.
2 property ;markets and exchange;
3
Name: overview_copy, dtype: object
>>>
Author by
Hamideh
Updated on June 08, 2022Comments
-
Hamideh almost 2 years
I am using pandas library on Python 3.5.1. How can I remove html tags from field values? Here are my input and output:
My code returned an error:
import pandas as pd code=[1,2,3] overview =['<p>Environments subject.</p>', '<ul><li> property ;</li></ul><ul><li>markets and exchange;</li></ul>', '<p class="MsoNormal" style="margin: 0cm 0cm 0pt;">'] # '<p class="SSPBodyText" style="padding: 0cm; text-align: justify;">The subject.</p>'] df= pd.DataFrame(overview,code) df.columns = ['overview'] df['overview_copy'] = df['overview'] # print(df) tags_list = ['<p>' ,'</p>' , '<p*>', '<ul>','</ul>', '<li>','</li>', '<br>', '<strong>','</strong>', '<span*>','</span>', '<a href*>','</a>', '<em>','</em>'] for tag in tags_list: # df['overview_copy'] = df['overview_copy'].str.replace(tag, '') df['overview_copy'].replace(to_replace=tag, value='', regex=True, inplace=True) print(df)