Home > Software engineering >  BASH deleting HTML tags from the text file
BASH deleting HTML tags from the text file

Time:01-31

I need to filter out all HTML tags from the text file (could be any sequence between <...>)

I came up with this command: cat my_file | sed 's/<[^>]*>//', but it olny delets first tag in the line. How do I delete all the tags? Is the problem with the regular expression?

CodePudding user response:

From the sed manual:

The s command can be followed by zero or more of the following flags:

    g

Apply the replacement to all matches to the regexp, not just the first.

So

 cat my_file | sed 's/<[^>]*>//g'

CodePudding user response:

If your intent is to remove all tags and get only text between them. Use, html2text or pup 'text{}' https://github.com/ericchiang/pup http://www.mbayer.de/html2text/ There are other tools like xidel, xmlstarlet too.

  •  Tags:  
  • Related