Home > OS >  bytes to string conversion with invalid characters
bytes to string conversion with invalid characters

Time:01-11

I need to parse UDP packets which can be invalid or contain some errors. I would like to replace invalid characters with . after a bytes to string conversion, in order to display the content of the packets.

How can I do it? This is my code:

func main() {
   a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
   s := string(a)
   s = strings.Replace(s, string(0xFFFD), ".", 0)

   fmt.Println("s: ", s) // I would like to display "a..b."
   for _, r := range s {
      fmt.Println("r: ", r)
   }
   rs := []rune(s)
   fmt.Println("rs: ", rs)
}

CodePudding user response:

The root problem with your approach is that the result of type converting []byte to string does not have any U FFFDs in it: this type-conversion only copies bytes from the source to the destination, verbatim.
Just as byte slices, strings in Go are not obliged to contain UTF-8-encoded text; they can contain any data, including opaque binary data which has nothing to do with text.

But some operations on strings—namely type-converting them to []rune and iterating over them using rangedo interpret strings as UTF-8-encoded text. That is precisely where you got tripped: your range debugging loop attempted to interpret the string, and each time another attempt at decoding a properly encoded code point failed, range yielded a replacement character, U FFFD.
To reiterate, the string obtained by the type-conversion does not contain the characters you wanted to get replaced by your regexp.

As to how to actually make a valid UTF-8-encoded string out of your data, you might employ a two-step process:

  1. Type-convert your byte slice to a string—as you already do.
  2. Use any means of interpreting a string as UTF-8—replacing U FFFD which will dynamically appear during this process—as you're iterating.

Something like this:

var sb strings.Builder
for _, c := range string(b) {
  if c == '\uFFFD' {
    sb.WriteByte('.')
  } else {
    sb.WriteRune(c)
  }
}
return sb.String()

See also: Remove invalid UTF-8 characters from a string

  •  Tags:  
  • Related