PowerShell’s Object Pipeline Corrupts Piped Binary Data

Yesterday I used curl to download a huge database backup from a remote server. Curl is UNIX-ey. By default, it streams its output to sdout and you then redirect that stream to a pipe or file like this:

$ curl sftp://server.somehwere.com/somepath/file > file

The above is essentially what I did from inside of a PowerShell session. After a couple of hours, I had my huge download and discovered that the database backup was corrupt. Then I realized that the file I ended up with was a little over 2x the size of the original file.

What happened?

Long story, short. This is a consequence of the object pipeline in PowerShell and you should never pipe raw binary data in PowerShell because it will be corrupted.

The Gory Details

You don’t have to work with a giant file. A small binary file will also be corrupted. I took a deeper look at this using a small PNG image.

PS> curl sftp://ftp.noserver.priv/img.png > img1.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  19986      0  0:00:02  0:00:02 --:--:— 74950

(FYI. Curl prints the progress of the download to stderr, so you see something on the console even though the stdout is redirected to file.)

This is essentially what I did with my big download and yields a file that is more than 2x the size of the original. My theory at this point was that since String objects in .Net are always Unicode, the bytes were being doubled as a consequence of an implicit conversion to UTF-16.

Using the > operator in PowerShell is the same thing piping to the Out-File cmdlet. Out-File has some encoding options. The interesting one is OEM:

"OEM" uses the current original equipment manufacturer code page identifier for the operating system.

That is essentially writing raw bytes.

PS> curl sftp://ftp.noserver.priv/img.png | out-file -encoding oem img2.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  26304      0  0:00:01  0:00:01 --:--:-- 74950

I was clearly on to something because this almost works. The file is just slightly larger than the original. It almost worked.

Just to prove that my build of curl isn’t broken, I also used the –o (–output) option.

PS> curl sftp://ftp.noserver.priv/img.png -o img3.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  25839      0  0:00:01  0:00:01 --:--:-- 76796

Here’s the result. You can see by the file sizes and md5 hashes that img1.png and img2.png are corrupt but img3.png is the same as the original img.png.

PS> ls img*.png | select name, length | ft -AutoSize

Name     Length
----     ------
img.png   46769
img1.png  94168
img2.png  47083
img3.png  46769


PS> md5 img*.png
MD5 (img.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf
MD5 (img1.png) = eb5a1421bcc4e3bea1063610b26e60f9
MD5 (img2.png) = 03b9b691f86404e9538a9c9c668c50ed
MD5 (img3.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf

Hrm. What’s going on here?

Let’s look at a diff of img.png and img1.png, which was the result of using the > operator to redirect the stdout of curl to file.

diff-img-img1

The big thing to see here is that there are a lot of extra bytes. Crucially, the bytes are Unicode glyphs and FFFE have been added as the first two bytes. 0xFFEE is the byte order mark for a little-endian UTF-16. That confirms my theory that internally PowerShell converted the data to Unicode.

I can also create the same behavior by the Get-Content (aliased as cat) cmdlet to redirect binary data to a file.

PS> cat img1.png > img4.png
PS> ls img1.png, img4.png | select name, length | ft -AutoSize

Name     Length
----     ------
img1.png  94168
img4.png  94168

So what is going on inside that pipeline?

PS> (cat .\img.png).GetType() | select name

Name
----
Object[]


PS> cat .\img.png | %{ $_.GetType() } | group name | select count, name | ft -AutoSize

Count Name
----- ----
  315 String

The file is being converted into an Object array of 315 elements. Each element of the array contains a String object. Since the internal data type of String is Unicode, sometimes referred to loosely as “double-byte” characters, the total size of that data is roughly doubled.

Using the OEM text encoder converts the data back but not quite. What is going wrong? Time to look at a diff of img.png and img2.png, which was the OEM text encoder.

 diff-img-img2

What you see here is a lot of 0x0D bytes have been inserted in front of all of the 0x0A bytes.

PS> (ls .\img2.png ).Length - (ls .\img.png ).Length
314

There are actually 314 of these 0x0D bytes added. What the heck is 0x0D? It is Carriage Return (CR). 0x0A is Line Feed (LF). In a Windows text file each line is marked with the sequence CRLF. 314 is exactly the number of CRLF sequences you need to turn a 315 element array of strings into a text file with Windows line endings.

Here’s what is happengin. PowerShell is making some assumtions:

  1. Anything streaming in as raw bytes is assumed to be text
  2. The text is converted into an array by splitting on bytes that would indicate an end of line in a text file.
  3. The text is reconstituted by out-file using the standard Windows end of line characters.

While this will work just fine with any kind of text, it is virtually guaranteed to corrupt any binary data. With the default text encoding you get a doubling of the original bytes and a bunch of new 0x0D bytes, too. The corruption fundamentally happens when the data is split into a string array. Using a binary encoder at the end of the pipeline doesn’t put the data back correctly because it always puts CRLF at the end of every array element. Unfortunately since there is more than one possible end of line sequence, this is as good as anything. Using a Windows to Unix conversion will not fix the file. There is no way to put humpty dumpty back together again.

To Sum Up, Just Don’t Do It

The moral is that it is never safe to pipe raw binary data in PowerShell. Pipes in PowerShell are for objects and text that can safely be automagically converted to a string array. You need to be cognizant of this and use Stream objects to manipulate binary files.

When using curl with PowerShell, never, never redirect to file with >. Always use the –o or –out <file>switch. If you need to stream the output of curl to another utility (say gpg) then you need to sub-shell into cmd for the binary streaming or use temporary files.

About these ads

5 Responses to PowerShell’s Object Pipeline Corrupts Piped Binary Data

  1. Pingback: Lightweight shelving of work-in-progress, with Mercurial « Nathan Evans' Nemesis of the Moment

  2. Coorl :-), there is one PowerShell to native executable interaction that I don’t need to debug.
    Thanks for posting these results.
    I am currently working on a PowerShell script that interfaces with a REST API and needs to do GET, POST, and DELETE requests. I am trying to decide whether to use the .NET Framework libraries or curl. Both have their pros and cons.

    Klaus

  3. test says:

    Thanks for taking the time to write this post, I experienced the same issue!

  4. Erik says:

    You can solve this problem by using “out-file -encoding ascii” instead of the > operator.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 84 other followers

%d bloggers like this: