I need to filter out all HTML tags from the text file (could be any sequence between <...>)
I came up with this command: cat my_file | sed 's/<[^>]*>//', but it olny delets first tag in the line. How do I delete all the tags? Is the problem with the regular expression?
CodePudding user response:
From the sed manual:
The
scommand can be followed by zero or more of the following flags:gApply the replacement to all matches to the regexp, not just the first.
So
cat my_file | sed 's/<[^>]*>//g'
CodePudding user response:
If your intent is to remove all tags and get only text between them. Use, html2text or pup 'text{}' https://github.com/ericchiang/pup http://www.mbayer.de/html2text/ There are other tools like xidel, xmlstarlet too.
