` about
that document.
When reporting an error in this documentation, please mention which
translation you're reading.
Quick Start
===========
Here's an HTML document I'll be using as an example throughout this
document. It's part of a story from `Alice in Wonderland`::
html_doc = """The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
Running the "three sisters" document through Beautiful Soup gives us a
:py:class:`BeautifulSoup` object, which represents the document as a nested
data structure::
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
#
#
#
# The Dormouse's story
#
#
#
#
#
# The Dormouse's story
#
#
#
# Once upon a time there were three little sisters; and their names were
#
# Elsie
#
# ,
#
# Lacie
#
# and
#
# Tillie
#
# ; and they lived at the bottom of a well.
#
#
# ...
#
#
#
Here are some simple ways to navigate that data structure::
soup.title
# The Dormouse's story
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# The Dormouse's story
soup.p['class']
# u'title'
soup.a
# Elsie
soup.find_all('a')
# [Elsie,
# Lacie,
# Tillie]
soup.find(id="link3")
# Tillie
One common task is extracting all the URLs found within a page's tags::
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
Another common task is extracting all the text from a page::
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
Does this look like what you need? If so, read on.
Installing Beautiful Soup
=========================
If you're using a recent version of Debian or Ubuntu Linux, you can
install Beautiful Soup with the system package manager:
:kbd:`$ apt-get install python3-bs4`
Beautiful Soup 4 is published through PyPi, so if you can't install it
with the system packager, you can install it with ``easy_install`` or
``pip``. The package name is ``beautifulsoup4``. Make sure you use the
right version of ``pip`` or ``easy_install`` for your Python version
(these may be named ``pip3`` and ``easy_install3`` respectively).
:kbd:`$ easy_install beautifulsoup4`
:kbd:`$ pip install beautifulsoup4`
(The :py:class:`BeautifulSoup` package is `not` what you want. That's
the previous major release, `Beautiful Soup 3`_. Lots of software uses
BS3, so it's still available, but if you're writing new code you
should install ``beautifulsoup4``.)
If you don't have ``easy_install`` or ``pip`` installed, you can
`download the Beautiful Soup 4 source tarball
`_ and
install it with ``setup.py``.
:kbd:`$ python setup.py install`
If all else fails, the license for Beautiful Soup allows you to
package the entire library with your application. You can download the
tarball, copy its ``bs4`` directory into your application's codebase,
and use Beautiful Soup without installing it at all.
I use Python 3.10 to develop Beautiful Soup, but it should work with
other recent versions.
.. _parser-installation:
Installing a parser
-------------------
Beautiful Soup supports the HTML parser included in Python's standard
library, but it also supports a number of third-party Python parsers.
One is the `lxml parser `_. Depending on your setup,
you might install lxml with one of these commands:
:kbd:`$ apt-get install python-lxml`
:kbd:`$ easy_install lxml`
:kbd:`$ pip install lxml`
Another alternative is the pure-Python `html5lib parser
`_, which parses HTML the way a
web browser does. Depending on your setup, you might install html5lib
with one of these commands:
:kbd:`$ apt-get install python-html5lib`
:kbd:`$ easy_install html5lib`
:kbd:`$ pip install html5lib`
This table summarizes the advantages and disadvantages of each parser library:
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| Parser | Typical usage | Advantages | Disadvantages |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not as fast as lxml, |
| | | * Decent speed | less lenient than |
| | | * Lenient (As of Python 3.2) | html5lib. |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very fast | * External C dependency |
| | | * Lenient | |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| lxml's XML parser | ``BeautifulSoup(markup, "lxml-xml")`` | * Very fast | * External C dependency |
| | ``BeautifulSoup(markup, "xml")`` | * The only currently supported | |
| | | XML parser | |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| html5lib | ``BeautifulSoup(markup, "html5lib")`` | * Extremely lenient | * Very slow |
| | | * Parses pages the same way a | * External Python |
| | | web browser does | dependency |
| | | * Creates valid HTML5 | |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
If you can, I recommend you install and use lxml for speed.
Note that if a document is invalid, different parsers will generate
different Beautiful Soup trees for it. See `Differences
between parsers`_ for details.
Making the soup
===============
To parse a document, pass it into the :py:class:`BeautifulSoup`
constructor. You can pass in a string or an open filehandle::
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
soup = BeautifulSoup("a web page", 'html.parser')
First, the document is converted to Unicode, and HTML entities are
converted to Unicode characters::
print(BeautifulSoup("Sacré bleu!", "html.parser"))
# Sacré bleu!
Beautiful Soup then parses the document using the best available
parser. It will use an HTML parser unless you specifically tell it to
use an XML parser. (See `Parsing XML`_.)
Kinds of objects
================
Beautiful Soup transforms a complex HTML document into a complex tree
of Python objects. But you'll only ever have to deal with about four
`kinds` of objects: :py:class:`Tag`, :py:class:`NavigableString`, :py:class:`BeautifulSoup`,
and :py:class:`Comment`.
.. py:class:: Tag
A :py:class:`Tag` object corresponds to an XML or HTML tag in the original document.
::
soup = BeautifulSoup('Extremely bold', 'html.parser')
tag = soup.b
type(tag)
#
Tags have a lot of attributes and methods, and I'll cover most of them
in `Navigating the tree`_ and `Searching the tree`_. For now, the most
important features of a tag are its name and attributes.
.. py:attribute:: name
Every tag has a name::
tag.name
# 'b'
If you change a tag's name, the change will be reflected in any
markup generated by Beautiful Soup down the line::
tag.name = "blockquote"
tag
# Extremely bold
.. py:attribute:: attrs
An HTML or XML tag may have any number of attributes. The tag ```` has an attribute "id" whose value is
"boldest". You can access a tag's attributes by treating the tag like
a dictionary::
tag = BeautifulSoup('bold', 'html.parser').b
tag['id']
# 'boldest'
You can access the dictionary of attributes directly as ``.attrs``::
tag.attrs
# {'id': 'boldest'}
You can add, remove, and modify a tag's attributes. Again, this is
done by treating the tag as a dictionary::
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
#
del tag['id']
del tag['another-attribute']
tag
# bold
tag['id']
# KeyError: 'id'
tag.get('id')
# None
.. _multivalue:
Multi-valued attributes
-----------------------
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is ``class`` (that is, a tag can have more than
one CSS class). Others include ``rel``, ``rev``, ``accept-charset``,
``headers``, and ``accesskey``. By default, Beautiful Soup parses the value(s)
of a multi-valued attribute into a list::
css_soup = BeautifulSoup('', 'html.parser')
css_soup.p['class']
# ['body']
css_soup = BeautifulSoup('', 'html.parser')
css_soup.p['class']
# ['body', 'strikeout']
If an attribute `looks` like it has more than one value, but it's not
a multi-valued attribute as defined by any version of the HTML
standard, Beautiful Soup will leave the attribute alone::
id_soup = BeautifulSoup('', 'html.parser')
id_soup.p['id']
# 'my id'
When you turn a tag back into a string, multiple attribute values are
consolidated::
rel_soup = BeautifulSoup('Back to the homepage
', 'html.parser')
rel_soup.a['rel']
# ['index', 'first']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# Back to the homepage
You can force all attributes to be parsed as strings by passing
``multi_valued_attributes=None`` as a keyword argument into the
:py:class:`BeautifulSoup` constructor::
no_list_soup = BeautifulSoup('', 'html.parser', multi_valued_attributes=None)
no_list_soup.p['class']
# 'body strikeout'
You can use ``get_attribute_list`` to get a value that's always a
list, whether or not it's a multi-valued atribute::
id_soup.p.get_attribute_list('id')
# ["my id"]
If you parse a document as XML, there are no multi-valued attributes::
xml_soup = BeautifulSoup('', 'xml')
xml_soup.p['class']
# 'body strikeout'
Again, you can configure this using the ``multi_valued_attributes`` argument::
class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('', 'xml', multi_valued_attributes=class_is_multi)
xml_soup.p['class']
# ['body', 'strikeout']
You probably won't need to do this, but if you do, use the defaults as
a guide. They implement the rules described in the HTML specification::
from bs4.builder import builder_registry
builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES
.. py:class:: NavigableString
-----------------------------
A string corresponds to a bit of text within a tag. Beautiful Soup
uses the :py:class:`NavigableString` class to contain these bits of text::
soup = BeautifulSoup('Extremely bold', 'html.parser')
tag = soup.b
tag.string
# 'Extremely bold'
type(tag.string)
#
A :py:class:`NavigableString` is just like a Python Unicode string, except
that it also supports some of the features described in `Navigating
the tree`_ and `Searching the tree`_. You can convert a
:py:class:`NavigableString` to a Unicode string with ``str``::
unicode_string = str(tag.string)
unicode_string
# 'Extremely bold'
type(unicode_string)
#
You can't edit a string in place, but you can replace one string with
another, using :ref:`replace_with()`::
tag.string.replace_with("No longer bold")
tag
# No longer bold
:py:class:`NavigableString` supports most of the features described in
`Navigating the tree`_ and `Searching the tree`_, but not all of
them. In particular, since a string can't contain anything (the way a
tag may contain a string or another tag), strings don't support the
``.contents`` or ``.string`` attributes, or the ``find()`` method.
If you want to use a :py:class:`NavigableString` outside of Beautiful Soup,
you should call ``unicode()`` on it to turn it into a normal Python
Unicode string. If you don't, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when you're
done using Beautiful Soup. This is a big waste of memory.
.. py:class:: BeautifulSoup
---------------------------
The :py:class:`BeautifulSoup` object represents the parsed document as a
whole. For most purposes, you can treat it as a :py:class:`Tag`
object. This means it supports most of the methods described in
`Navigating the tree`_ and `Searching the tree`_.
You can also pass a :py:class:`BeautifulSoup` object into one of the methods
defined in `Modifying the tree`_, just as you would a :py:class:`Tag`. This
lets you do things like combine two parsed documents::
doc = BeautifulSoup("INSERT FOOTER HEREHere's the footer", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
# 'INSERT FOOTER HERE'
print(doc)
#
#
Since the :py:class:`BeautifulSoup` object doesn't correspond to an actual
HTML or XML tag, it has no name and no attributes. But sometimes it's
useful to look at its ``.name``, so it's been given the special
``.name`` "[document]"::
soup.name
# '[document]'
Special strings
---------------
:py:class:`Tag`, :py:class:`NavigableString`, and
:py:class:`BeautifulSoup` cover almost everything you'll see in an
HTML or XML file, but there are a few leftover bits. The main one
you'll probably encounter is the :py:class:`Comment`.
.. py:class:: Comment
::
markup = ""
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
type(comment)
#
The :py:class:`Comment` object is just a special type of :py:class:`NavigableString`::
comment
# 'Hey, buddy. Want to buy a used parser'
But when it appears as part of an HTML document, a :py:class:`Comment` is
displayed with special formatting::
print(soup.b.prettify())
#
#
#
For HTML documents
^^^^^^^^^^^^^^^^^^
Beautiful Soup defines a few :py:class:`NavigableString` subclasses to
contain strings found inside specific HTML tags. This makes it easier
to pick out the main body of the page, by ignoring strings that
probably represent programming directives found within the
page. `(These classes are new in Beautiful Soup 4.9.0, and the
html5lib parser doesn't use them.)`
.. py:class:: Stylesheet
A :py:class:`NavigableString` subclass that represents embedded CSS
stylesheets; that is, any strings found inside a ``