Home > Net >  Java and .NET/PowerShell producing different UTF-8 bytes
Java and .NET/PowerShell producing different UTF-8 bytes

Time:01-07

am getting grey hair over this. I need to convert strings in PowerShell to UTF-8. My reference code is in Java (and works as intended with the bigger application), so I need to reproduce what it does.

In Java, I do:

    private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray();

    public static String bytesToHex(byte[] bytes) {
        char[] hexChars = new char[bytes.length * 2];
        for (int j = 0; j < bytes.length; j  ) {
            int v = bytes[j] & 0xFF;
            hexChars[j * 2] = HEX_ARRAY[v >>> 4];
            hexChars[j * 2   1] = HEX_ARRAY[v & 0x0F];
        }
        return new String(hexChars);
    }
    
    public static void main(String[] args) throws Exception {
        System.out.println(bytesToHex("aöß".getBytes("UTF8")));
    }

which outputs 61C3B6C39F.

In PowerShell, I do

Write-Output $(([System.Text.UTF8Encoding]::New($false, $true).getBytes("aöß") | ForEach-Object ToString X2) -join '')

which outputs 61C383C2B6C383C5B8

Why are they different? How can I make the PowerShell encoding match the Java one?

I would be very grateful for any insights!

Best eDude

EDIT: Ok, now I am more confused. When running the above command in the PowerShell 5.1 console, it works as expected. When putting it into a script file and executing that, it does not.

EDIT 2: More info, if the script file is saved in UTF-8 encoding, the error appears. If it is saved in another encoding (e.g. Notepad 's ANSI), it works. Why is the encoding of the script file changing the behavior of the script itself? How can I prevent this and make sure to get consistent results?

CodePudding user response:

Try converting your script file to UTF-8-BOM encoding in Notepad and running it. PowerShell 5's default encoding is Western European (Windows) (windows-1252) so when there's no BOM in your script file it reads it as UTF-16, thus the double-length string.

Default encoding in PowerShell 7 is UTF-8, so it shouldn't be a problem.

You can check the default encoding for the different powershell versions like this:

PS> [System.Text.Encoding]::Default

You can also specify the required characters to avoid this issue in files without a BOM:

$str = [char]0x0061   [char]0x00F6   [char]0x00DF

Write-Output $(([System.Text.Encoding]::UTF8.GetBytes($str) | ForEach-Object ToString X2) -join '')
  •  Tags:  
  • Related