python - How to extract tags from html content using beautifulsoup - TagMerge
3How to extract tags from html content using beautifulsoupHow to extract tags from html content using beautifulsoup

How to extract tags from html content using beautifulsoup

Asked 1 years ago
0
3 answers

How to achieve?

You can take use of contents or in following solution of stripped_strings that is a generator

list(b.stripped_strings)

#Output --> ['Category', 'Clothing', 'Sub-category', 'this is Sub-category', 'product', 'This is the actual product']

To convert this resultset in a dict you can use:

dict({x for x in zip(s[::2],s[1::2])})

Exmple:

html = '''
<div class="row d-3">
    <div class="col-16 col-sm-8">
        <strong>Category</strong> <br>
        Clothing</div>
    <div class="col-16 col-sm-8">
        <strong>Sub-category</strong> <br>
         this is Sub-category
        </div>
    <div class="col-16 col-sm-8">
        <strong>product</strong> <br>
        This is the actual product </div>
</div>'''

soup = BeautifulSoup(html, "lxml")

for b in soup.find_all("div", class_="row d-3"):
    s = list(b.stripped_strings)
    print(dict({x for x in zip(s[::2],s[1::2])}))

Output:

{'Category': 'Clothing', 'Sub-category': 'this is Sub-category', 'product': 'This is the actual product'}

Source: link

0

A webpage is just a text file in HTML format. And HTML-formatted text is ultimately just text. So, let's write our own HTML from scratch, without worrying yet about "the Web":
htmltxt = "

Hello World

"
This is the standard import statement for using Beautiful Soup:
from bs4 import BeautifulSoup
So, let's parse some HTML:
from bs4 import BeautifulSoup
htmltxt = "

Hello World

" soup = BeautifulSoup(htmltxt, 'lxml')
What is soup? As always, use the type() method to inspect an unknown object:
type(soup)
# bs4.BeautifulSoup
The BeautifulSoup object has a text attribute that returns the plain text of a HTML string sans the tags. Given our simple soup of

Hello World

, the text attribute returns:
soup.text
# 'Hello World'

Source: link

0

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">The Dormouse's story

<p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.

<p class="story">...

"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    
#     The Dormouse's story
#    
#   

# <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # , # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # and # <a class="sister" href="http://example.com/tillie" id="link2"> # Tillie # </a> # ; and they lived at the bottom of a well. #

# <p class="story"> # ... #

# </body> # </html>
soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title">The Dormouse's story

soup.p['class'] # u'title' soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

Source: link

Recent Questions on python

    Programming Languages