BASH deleting HTML tags from the text file-CodePudding

I need to filter out all HTML tags from the text file (could be any sequence between <...>)

I came up with this command: cat my_file | sed 's/<[^>]*>//', but it olny delets first tag in the line. How do I delete all the tags? Is the problem with the regular expression?

CodePudding user response：

From the sed manual:

The s command can be followed by zero or more of the following flags:
    g
Apply the replacement to all matches to the regexp, not just the first.

 cat my_file | sed 's/<[^>]*>//g'

CodePudding user response：

If your intent is to remove all tags and get only text between them. Use, html2text or pup 'text{}' https://github.com/ericchiang/pup http://www.mbayer.de/html2text/ There are other tools like xidel, xmlstarlet too.