I need to parse UDP packets which can be invalid or contain some errors. I would like to replace invalid characters with . after a bytes to string conversion, in order to display the content of the packets.
How can I do it? This is my code:
func main() {
a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
s := string(a)
s = strings.Replace(s, string(0xFFFD), ".", 0)
fmt.Println("s: ", s) // I would like to display "a..b."
for _, r := range s {
fmt.Println("r: ", r)
}
rs := []rune(s)
fmt.Println("rs: ", rs)
}
CodePudding user response:
The root problem with your approach is that the result of type converting []byte to string does not have any U FFFDs in it: this type-conversion only copies bytes from the source to the destination, verbatim.
Just as byte slices, strings in Go are not obliged to contain UTF-8-encoded text; they can contain any data, including opaque binary data which has nothing to do with text.
But some operations on strings—namely type-converting them to []rune and iterating over them using range—do interpret strings as UTF-8-encoded text.
That is precisely where you got tripped: your range debugging loop attempted to interpret the string, and each time another attempt at decoding a properly encoded code point failed, range yielded a replacement character, U FFFD.
To reiterate, the string obtained by the type-conversion does not contain the characters you wanted to get replaced by your regexp.
As to how to actually make a valid UTF-8-encoded string out of your data, you might employ a two-step process:
- Type-convert your byte slice to a string—as you already do.
- Use any means of interpreting a string as UTF-8—replacing U FFFD which will dynamically appear during this process—as you're iterating.
Something like this:
var sb strings.Builder
for _, c := range string(b) {
if c == '\uFFFD' {
sb.WriteByte('.')
} else {
sb.WriteRune(c)
}
}
return sb.String()
