I'm currently learning C and trying to write a function to tokenize a paragraph/string delimited by spaces and return an array with all the tokens. I'm stuck because I can't figure out why some token will carry symbols that are not in the original string. Can someone help me figure out what's wrong with my code? Plus I don't want to add additional library into the code or use functions like strtok().
char **tokenizeParagraph(char *paragraph) {
char *ptr = paragraph;
char words[MAX_WORDS][MAX_WORDLENGTH];
int wordIndex = 0;
int wordLen = 0;
while (*ptr) {
wordLen = 0;
while (*ptr && *ptr != ' ') {
wordLen ;
ptr ;
}
if (wordLen > 0) {
strncpy(words[wordIndex], paragraph, wordLen);
printf("%s\n", words[wordIndex]);
wordIndex ;
}
ptr ;
paragraph = ptr;
}
return words;
}
Here's a demo result:
tokenizeParagraph("I'm currently learning C and trying to write a function to tokenize a paragraph/string delimited by spaces and return an array with all the tokens.");
Much appreciated!
CodePudding user response:
You are creating the variable words on the stack and returning a pointer to this on the stack. However, when you return from the function, the stack is no longer restricted to your program, which means some things your pointer is pointing to may change, causing undefined behavior. In order to prevent this, change this code:
char words[MAX_WORDS][MAX_WORDLENGTH];
with this:
char** words = calloc(MAX_WORDS * MAX_WORDLENGTH, sizeof(char));
This will allocate memory on the heap instead of the stack, although you need stdlib.h to be included.
CodePudding user response:
What @Finxx already suggested is good enough. But you can still improve it if wordLen varies very widely.
char **tokenizeParagraph(char *paragraph) {
char *ptr = paragraph;
char** words = malloc(sizeof(char*) * MAX_WORDS);
int wordIndex = 0;
int wordLen;
while (*ptr) {
wordLen = 0;
while (*ptr && *ptr == ' ') {
ptr ;
}
paragraph = ptr;
while (*ptr && *ptr != ' ') {
wordLen ;
ptr ;
}
if (wordLen > 0) {
words[wordIndex] = malloc(sizeof(char) * wordLen 1);
strncpy(words[wordIndex], paragraph, wordLen);
words[wordIndex][wordLen] = '\0';
printf("%s\n", words[wordIndex]);
wordIndex ;
}
}
return words;
}
Also, note that strncpy does not add terminating NUL character. This is probably the reason for random characters appearing in the output.
Also, don't forget to free the allocated memory from caller function.:
int main() {
...
char** words = tokenizeParagraph(para);
...
for(int i = 0; i < MAX_WORDS; i ) {
free(words[i]);
}
free(words);
...
return 0;
}
