am getting grey hair over this. I need to convert strings in PowerShell to UTF-8. My reference code is in Java (and works as intended with the bigger application), so I need to reproduce what it does.
In Java, I do:
private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray();
public static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
for (int j = 0; j < bytes.length; j ) {
int v = bytes[j] & 0xFF;
hexChars[j * 2] = HEX_ARRAY[v >>> 4];
hexChars[j * 2 1] = HEX_ARRAY[v & 0x0F];
}
return new String(hexChars);
}
public static void main(String[] args) throws Exception {
System.out.println(bytesToHex("aöß".getBytes("UTF8")));
}
which outputs 61C3B6C39F.
In PowerShell, I do
Write-Output $(([System.Text.UTF8Encoding]::New($false, $true).getBytes("aöß") | ForEach-Object ToString X2) -join '')
which outputs 61C383C2B6C383C5B8
Why are they different? How can I make the PowerShell encoding match the Java one?
I would be very grateful for any insights!
Best eDude
EDIT: Ok, now I am more confused. When running the above command in the PowerShell 5.1 console, it works as expected. When putting it into a script file and executing that, it does not.
EDIT 2: More info, if the script file is saved in UTF-8 encoding, the error appears. If it is saved in another encoding (e.g. Notepad 's ANSI), it works. Why is the encoding of the script file changing the behavior of the script itself? How can I prevent this and make sure to get consistent results?
CodePudding user response:
Try converting your script file to UTF-8-BOM encoding in Notepad and running it. PowerShell 5's default encoding is Western European (Windows) (windows-1252) so when there's no BOM in your script file it reads it as UTF-16, thus the double-length string.
Default encoding in PowerShell 7 is UTF-8, so it shouldn't be a problem.
You can check the default encoding for the different powershell versions like this:
PS> [System.Text.Encoding]::Default
You can also specify the required characters to avoid this issue in files without a BOM:
$str = [char]0x0061 [char]0x00F6 [char]0x00DF
Write-Output $(([System.Text.Encoding]::UTF8.GetBytes($str) | ForEach-Object ToString X2) -join '')
