Home > Enterprise >  Remove embedded image data from HTML with BeautifulSoup
Remove embedded image data from HTML with BeautifulSoup

Time:01-28

I would like to use BS4 to remove embedded images to save space, but to leave the tag. For example remove the base64 data but leave <img src="data:image/jpeg;base64,<DELETED>

I can do this to remove everything including the tag:

tags=soup.findAll('img')
for match in tags:
  match.decompose()

Removes everything but I would like to keep the tag reference without the actual binary source. Is that possible?

CodePudding user response:

Python3


markup = """
<div>
    <p>Take the red pill</p>
    <img src="data:image/png;base64, iVBORw0KGgoAAAANSUhEUgAAAAUA
    AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
    9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Follow the white rabbit" />
</div>
"""

soup = BeautifulSoup(markup, 'html.parser')
tag = soup.img
tag['src'] = "data:image/jpeg;base64,"

print(tag)

Outputs

<img alt="Follow the white rabbit" src="data:image/jpeg;base64,"/>

CodePudding user response:

Here is how I managed to do it. Easy really?

for match in tags:
  match['src']='deleted'
  •  Tags:  
  • Related