PowerShell’s Object Pipeline Corrupts Piped Binary Data
January 29, 2010 17 Comments
Yesterday I used curl to download a huge database backup from a remote server. Curl is UNIX-ey. By default, it streams its output to sdout and you then redirect that stream to a pipe or file like this:
$ curl sftp://server.somehwere.com/somepath/file > file
The above is essentially what I did from inside of a PowerShell session. After a couple of hours, I had my huge download and discovered that the database backup was corrupt. Then I realized that the file I ended up with was a little over 2x the size of the original file.
What happened?
Long story, short. This is a consequence of the object pipeline in PowerShell and you should never pipe raw binary data in PowerShell because it will be corrupted.
The Gory Details
You don’t have to work with a giant file. A small binary file will also be corrupted. I took a deeper look at this using a small PNG image.
PS> curl sftp://ftp.noserver.priv/img.png > img1.png % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 46769 100 46769 0 0 19986 0 0:00:02 0:00:02 --:--:— 74950
(FYI. Curl prints the progress of the download to stderr, so you see something on the console even though the stdout is redirected to file.)
This is essentially what I did with my big download and yields a file that is more than 2x the size of the original. My theory at this point was that since String objects in .Net are always Unicode, the bytes were being doubled as a consequence of an implicit conversion to UTF-16.
Using the > operator in PowerShell is the same thing piping to the Out-File cmdlet. Out-File has some encoding options. The interesting one is OEM:
"OEM" uses the current original equipment manufacturer code page identifier for the operating system.
That is essentially writing raw bytes.
PS> curl sftp://ftp.noserver.priv/img.png | out-file -encoding oem img2.png % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 46769 100 46769 0 0 26304 0 0:00:01 0:00:01 --:--:-- 74950
I was clearly on to something because this almost works. The file is just slightly larger than the original. It almost worked.
Just to prove that my build of curl isn’t broken, I also used the –o (–output) option.
PS> curl sftp://ftp.noserver.priv/img.png -o img3.png % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 46769 100 46769 0 0 25839 0 0:00:01 0:00:01 --:--:-- 76796
Here’s the result. You can see by the file sizes and md5 hashes that img1.png and img2.png are corrupt but img3.png is the same as the original img.png.
PS> ls img*.png | select name, length | ft -AutoSize Name Length ---- ------ img.png 46769 img1.png 94168 img2.png 47083 img3.png 46769 PS> md5 img*.png MD5 (img.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf MD5 (img1.png) = eb5a1421bcc4e3bea1063610b26e60f9 MD5 (img2.png) = 03b9b691f86404e9538a9c9c668c50ed MD5 (img3.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf
Hrm. What’s going on here?
Let’s look at a diff of img.png and img1.png, which was the result of using the > operator to redirect the stdout of curl to file.
The big thing to see here is that there are a lot of extra bytes. Crucially, the bytes are Unicode glyphs and FFFE have been added as the first two bytes. 0xFFEE is the byte order mark for a little-endian UTF-16. That confirms my theory that internally PowerShell converted the data to Unicode.
I can also create the same behavior by the Get-Content (aliased as cat) cmdlet to redirect binary data to a file.
PS> cat img1.png > img4.png PS> ls img1.png, img4.png | select name, length | ft -AutoSize Name Length ---- ------ img1.png 94168 img4.png 94168
So what is going on inside that pipeline?
PS> (cat .\img.png).GetType() | select name Name ---- Object[] PS> cat .\img.png | %{ $_.GetType() } | group name | select count, name | ft -AutoSize Count Name ----- ---- 315 String
The file is being converted into an Object array of 315 elements. Each element of the array contains a String object. Since the internal data type of String is Unicode, sometimes referred to loosely as “double-byte” characters, the total size of that data is roughly doubled.
Using the OEM text encoder converts the data back but not quite. What is going wrong? Time to look at a diff of img.png and img2.png, which was the OEM text encoder.
What you see here is a lot of 0x0D bytes have been inserted in front of all of the 0x0A bytes.
PS> (ls .\img2.png ).Length - (ls .\img.png ).Length 314
There are actually 314 of these 0x0D bytes added. What the heck is 0x0D? It is Carriage Return (CR). 0x0A is Line Feed (LF). In a Windows text file each line is marked with the sequence CRLF. 314 is exactly the number of CRLF sequences you need to turn a 315 element array of strings into a text file with Windows line endings.
Here’s what is happengin. PowerShell is making some assumtions:
- Anything streaming in as raw bytes is assumed to be text
- The text is converted into an array by splitting on bytes that would indicate an end of line in a text file.
- The text is reconstituted by out-file using the standard Windows end of line characters.
While this will work just fine with any kind of text, it is virtually guaranteed to corrupt any binary data. With the default text encoding you get a doubling of the original bytes and a bunch of new 0x0D bytes, too. The corruption fundamentally happens when the data is split into a string array. Using a binary encoder at the end of the pipeline doesn’t put the data back correctly because it always puts CRLF at the end of every array element. Unfortunately since there is more than one possible end of line sequence, this is as good as anything. Using a Windows to Unix conversion will not fix the file. There is no way to put humpty dumpty back together again.
To Sum Up, Just Don’t Do It
The moral is that it is never safe to pipe raw binary data in PowerShell. Pipes in PowerShell are for objects and text that can safely be automagically converted to a string array. You need to be cognizant of this and use Stream objects to manipulate binary files.
When using curl with PowerShell, never, never redirect to file with >. Always use the –o or –out <file>switch. If you need to stream the output of curl to another utility (say gpg) then you need to sub-shell into cmd for the binary streaming or use temporary files.
Pingback: Lightweight shelving of work-in-progress, with Mercurial « Nathan Evans' Nemesis of the Moment
Coorl :-), there is one PowerShell to native executable interaction that I don’t need to debug.
Thanks for posting these results.
I am currently working on a PowerShell script that interfaces with a REST API and needs to do GET, POST, and DELETE requests. I am trying to decide whether to use the .NET Framework libraries or curl. Both have their pros and cons.
Klaus
Thanks for taking the time to write this post, I experienced the same issue!
You can solve this problem by using “out-file -encoding ascii” instead of the > operator.
I don’t think that will work because the pipeline splits the data into an array automatically on unix or windows line endings.
StreamWriter works well for this.
$sw = New-Object System.IO.StreamWriter("dir.txt")
$sw.Write("a")
$sw.Close()
This results in a 1 byte file representing the character “a”. No trailing new lines, no encoding marks (FFFE etc) or encoding issues. Good for binary data.
You can’t really pipe the output of a non-posh binary to a streamwriter object, though.
Turns out you can… in about 15 lines : http://stackoverflow.com/questions/24708859/output-binary-data-on-powershell-pipeline/24745250#24745250
I was trying to avoid creating a temporary file and this worked for me !
Yeah. That’s not a pipe.
And it is not even a real curl, since MS decided to put their own way of downloading files and alias that to the curl command – disregarding the real code and year of effort that went into creating curl in the first place. https://daniel.haxx.se/blog/2016/08/19/removing-the-powershell-curl-alias/
This was PowerShell 1.0. There was no Invoke-WebRequest. I was talking about piping the real curl in PowerShell it doesn’t work unless the stream is really text.
Scary!
Note: Unicode is not an encoding, it’s a standard for representing text, defining both code points (e.g. U+20021 is “𠀡”) and rules for combining, collating and transforming text (e.g. “ß” uppercasing to “ss” and sorting at the same place if the text is German).
UTF-8, UTF-16LE, UTF-32 are encodings – they are ways of mapping Unicode code points to bytes of various widths. So U+20021 becomes 0xf0 0xa0 0x80 0xa1 in UTF-8, and 0x00 0x40 0xd8 0x21 0xdc in UTF-16LE.
So bytes are never Unicode, they are just bytes. You can interpret bytes as being encoded in a certain way and decode them that way, and presumably Powershell is treating that image as being e.g. UTF-8 (since it doesn’t start with a byte-order-mark \xff \xfe?) and then encoding it into UTF-16LE, which seems to be the popular choice on Windows. You could test that hypothesis by storing e.g. “å” into a file, save it as UTF-8 (\xc3 \xa5), then pipe it into a new file and check that the bytes end up as \xfe \xff \x00 \xe5 (possibly ended by a newline \x00 \x0a).
Thanks for this post. Have this changed? Asked question at https://stackoverflow.com/questions/47552334/never-never-powershell-pipelines-with-non-net-programs
Thanks for the posting 🙂
Actually, piping works fine, just keep it as a byte array and you’ll be fine. Also, never use the OEM encoding, always use BYTE. Here’s an example that will pipe a 256 byte array from a remote PSSession, 10 times, back to the caller, and save it into a file.
$s = New-PSSession SomeRemoteComputerYouHaveAccessTo
function Save-Bytes([string] $path, [byte[]] $data) {
Begin { Set-Content $path $data -Encoding byte; Write-Host “BEGIN”; }
Process { Add-Content $path $_ -Encoding byte; Write-Host $_.GetType(); }
}
Invoke-Command $s {
#create and fill a 256 byte[] array with values 0-255
$b = [byte[]]::new(256);
for($i=0; $i -lt $b.Length; $i++) { $b[$i] = $i; }
#write the byte[] array 10 times out (use -NoEnum or it will enum and send each byte separately, which will work but is a terrible idea)
for($i=0; $i -lt 10; $i++) { Write-Output $b -NoEnumerate; }
} | Save-Bytes(“.\x”);
Conversion to UTF-16 also happens when the source file is text (utf-8). I wrote a PS script to update a number of Oracle db servers TNSNAMES.ORA file. The servers weren’t restarted until a few months later and took ages to find the issue, as using get-content and other apps like notepad the file appeared correct.
Pingback: Is PowerShell ready to replace my Cygwin shell on Windows? [closed]