BeautifulSoup: get css classes from html

python html css beautifulsoup

10,145

Solution 1

BeautifulSoup itself doesn't parse CSS style declarations at all, but you can extract such sections then parse them with a dedicated CSS parser.

Depending on your needs, there are several CSS parsers available for python; I'd pick cssutils (requires python 2.5 or up (including python 3)), it is the most complete in it's support, and supports inline styles too.

Other options are css-py and tinycss.

To grab and parse such all style sections (example with cssutils):

import cssutils
sheets = []
for styletag in tree.findAll('style', type='text/css')
    if not styletag.string: # probably an external sheet
        continue
    sheets.append(cssutils.parseStyle(styletag.string))

With cssutil you can then combine these, resolve imports, and even have it fetch external stylesheets.

Solution 2

A BeautifulSoup & cssutils combo will do the trick nicely:

    from bs4 import BeautifulSoup as BSoup
    import cssutils
    selectors = {}
    with open(htmlfile) as webpage:
        html = webpage.read()
        soup = BSoup(html, 'html.parser')
    for styles in soup.select('style'):
        css = cssutils.parseString(styles.encode_contents())
        for rule in css:
            if rule.type == rule.STYLE_RULE:
                style = rule.selectorText
                selectors[style] = {}
                for item in rule.style:
                    propertyname = item.name
                    value = item.value
                    selectors[style][propertyname] = value

BeautifulSoup parses all "style" tags in the html (head & body), .encode_contents() converts the BeautifulSoup objects into a byte format that cssutils can read, and then cssutils parses the individual CSS styles all the way down to the property/value level via rule.selectorText & rule.style.

Note: The "rule.STYLE_RULE" filters out only styles. The cssutils documentation details options for filtering media rules, comments and imports.

It'd be cleaner if you broke this down into functions, but you get the gist...

10,145

Author by

root

Getting started with python. `

Updated on June 18, 2022

Comments

root almost 2 years

Is there a way to get CSS classes from an HTML file using BeautifulSoup? Example snippet:

<style type="text/css">

 p.c3 {text-align: justify}

 p.c2 {text-align: left}

 p.c1 {text-align: center}

</style>

Perfect output would be:

cssdict = {
    'p.c3': {'text-align': 'justify'},
    'p.c2': {'text-align': 'left'},
    'p.c1': {'text-align': 'center'}
}

although something like this would do:

L = [
    ('p.c3', {'text-align': 'justify'}),  
    ('p.c2', {'text-align': 'left'}),    
    ('p.c1', {'text-align': 'center'})
]

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

How do I convert base64 to an html file?

HTML parsing with BeautifulSoup 4 and Python

Find index of tag with certain text in beautifulsoup/python

AttributeError: 'NavigableString' object has no attribute 'find_all' (NameError)

How to scrape google maps using python

name error 'html' not defined with beautifulsoup4

How to define findAll for html nested tags using beautifulsoup

BeautifulSoup extracting data from multiple tables

Python scraping pdf from URL

Layout management in Plotly Dash app: how to position html div?