I need to read all the words from a file to a variable. In addition to that I need to store each word only once. The selection will not be key sensitive so "Hello", "hello", "hElLo" and "HELLO" will count as the same word. If a word has an apostrophe, like the word "it's", it must ignore the "'s" and only count the "it" as a word.
To do that I used the following command:
#Stores the words of the file without duplicates
WORDS=`grep -o -E '\w ' $1 | sort -u -f`
The first two criteria are met but this method counts words like "it's" as two separate words "it" and "s".
CodePudding user response:
Maybe, something like that:
WORDS=$(grep -o -E "(\w|') " words.txt | sed -e "s/'.*\$//" | sort -u -f)
UPDATE
Explanations:
var=$(...command...): Execute command (newer and better solution than `...command...`) and put standard output tovarvariablegrep -o -E "(\w|') " words.txt: Read filewords.txtand applygrep filtergrepfilter is : print only found tokens (-o) from extended (-E) rational expression(\w|'). This expression is form extract characters of words (\w: synonym of[_[:alnum:]],alnumis for alpha-numeric characters like[0-9a-zA-Z]for english/american but extended to many other characters for other languages) or (|) simple cote ('), one or more times () : seeman grep
- The standard ouptut of
grepis the standard input of next commandsedwith the pipe (|) sed -e "s/'.*\$//": Execute (-e) expressions/'.*\$//:sedexpression is substitution (s/) of'.*\$(simple cote followed by zero or any characters to the end of line) by empty string (between the last two slashes (//)) : seeman sed
- The standard ouptut of
sedis the standard input of next commandsortwith the pipe (|) - sort the result of
sedand remove doubles (-u: uniq) and do not make a differences between upper and lower characters (case) : seeman sort
