PowerShell’s Object Pipeline Corrupts Piped Binary Data

January 29, 2010 17 Comments

Yesterday I used curl to download a huge database backup from a remote server. Curl is UNIX-ey. By default, it streams its output to sdout and you then redirect that stream to a pipe or file like this:

$ curl sftp://server.somehwere.com/somepath/file > file

The above is essentially what I did from inside of a PowerShell session. After a couple of hours, I had my huge download and discovered that the database backup was corrupt. Then I realized that the file I ended up with was a little over 2x the size of the original file.

What happened?

Long story, short. This is a consequence of the object pipeline in PowerShell and you should never pipe raw binary data in PowerShell because it will be corrupted.

The Gory Details

You don’t have to work with a giant file. A small binary file will also be corrupted. I took a deeper look at this using a small PNG image.

PS> curl sftp://ftp.noserver.priv/img.png > img1.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  19986      0  0:00:02  0:00:02 --:--:— 74950

(FYI. Curl prints the progress of the download to stderr, so you see something on the console even though the stdout is redirected to file.)

This is essentially what I did with my big download and yields a file that is more than 2x the size of the original. My theory at this point was that since String objects in .Net are always Unicode, the bytes were being doubled as a consequence of an implicit conversion to UTF-16.

Using the > operator in PowerShell is the same thing piping to the Out-File cmdlet. Out-File has some encoding options. The interesting one is OEM:

"OEM" uses the current original equipment manufacturer code page identifier for the operating system.

That is essentially writing raw bytes.

PS> curl sftp://ftp.noserver.priv/img.png | out-file -encoding oem img2.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  26304      0  0:00:01  0:00:01 --:--:-- 74950

I was clearly on to something because this almost works. The file is just slightly larger than the original. It almost worked.

Just to prove that my build of curl isn’t broken, I also used the –o (–output) option.

PS> curl sftp://ftp.noserver.priv/img.png -o img3.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  25839      0  0:00:01  0:00:01 --:--:-- 76796

Here’s the result. You can see by the file sizes and md5 hashes that img1.png and img2.png are corrupt but img3.png is the same as the original img.png.

PS> ls img*.png | select name, length | ft -AutoSize

Name     Length
----     ------
img.png   46769
img1.png  94168
img2.png  47083
img3.png  46769


PS> md5 img*.png
MD5 (img.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf
MD5 (img1.png) = eb5a1421bcc4e3bea1063610b26e60f9
MD5 (img2.png) = 03b9b691f86404e9538a9c9c668c50ed
MD5 (img3.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf

Hrm. What’s going on here?

Let’s look at a diff of img.png and img1.png, which was the result of using the > operator to redirect the stdout of curl to file.

The big thing to see here is that there are a lot of extra bytes. Crucially, the bytes are Unicode glyphs and FFFE have been added as the first two bytes. 0xFFEE is the byte order mark for a little-endian UTF-16. That confirms my theory that internally PowerShell converted the data to Unicode.

I can also create the same behavior by the Get-Content (aliased as cat) cmdlet to redirect binary data to a file.

PS> cat img1.png > img4.png
PS> ls img1.png, img4.png | select name, length | ft -AutoSize

Name     Length
----     ------
img1.png  94168
img4.png  94168

So what is going on inside that pipeline?

PS> (cat .\img.png).GetType() | select name

Name
----
Object[]


PS> cat .\img.png | %{ $_.GetType() } | group name | select count, name | ft -AutoSize

Count Name
----- ----
  315 String

The file is being converted into an Object array of 315 elements. Each element of the array contains a String object. Since the internal data type of String is Unicode, sometimes referred to loosely as “double-byte” characters, the total size of that data is roughly doubled.

Using the OEM text encoder converts the data back but not quite. What is going wrong? Time to look at a diff of img.png and img2.png, which was the OEM text encoder.

What you see here is a lot of 0x0D bytes have been inserted in front of all of the 0x0A bytes.

PS> (ls .\img2.png ).Length - (ls .\img.png ).Length
314

There are actually 314 of these 0x0D bytes added. What the heck is 0x0D? It is Carriage Return (CR). 0x0A is Line Feed (LF). In a Windows text file each line is marked with the sequence CRLF. 314 is exactly the number of CRLF sequences you need to turn a 315 element array of strings into a text file with Windows line endings.

Here’s what is happengin. PowerShell is making some assumtions:

Anything streaming in as raw bytes is assumed to be text
The text is converted into an array by splitting on bytes that would indicate an end of line in a text file.
The text is reconstituted by out-file using the standard Windows end of line characters.

While this will work just fine with any kind of text, it is virtually guaranteed to corrupt any binary data. With the default text encoding you get a doubling of the original bytes and a bunch of new 0x0D bytes, too. The corruption fundamentally happens when the data is split into a string array. Using a binary encoder at the end of the pipeline doesn’t put the data back correctly because it always puts CRLF at the end of every array element. Unfortunately since there is more than one possible end of line sequence, this is as good as anything. Using a Windows to Unix conversion will not fix the file. There is no way to put humpty dumpty back together again.

To Sum Up, Just Don’t Do It

The moral is that it is never safe to pipe raw binary data in PowerShell. Pipes in PowerShell are for objects and text that can safely be automagically converted to a string array. You need to be cognizant of this and use Stream objects to manipulate binary files.

When using curl with PowerShell, never, never redirect to file with >. Always use the –o or –out <file>switch. If you need to stream the output of curl to another utility (say gpg) then you need to sub-shell into cmd for the binary streaming or use temporary files.

Filed under Uncategorized Tagged with binary pipeline, corruption, curl, debugging, object pipeline, pipeline, powershell

17 Responses to PowerShell’s Object Pipeline Corrupts Piped Binary Data

Pingback: Lightweight shelving of work-in-progress, with Mercurial « Nathan Evans' Nemesis of the Moment
tellingmachine says:

June 17, 2011 at 12:13 am

Coorl :-), there is one PowerShell to native executable interaction that I don’t need to debug.
Thanks for posting these results.
I am currently working on a PowerShell script that interfaces with a REST API and needs to do GET, POST, and DELETE requests. I am trying to decide whether to use the .NET Framework libraries or curl. Both have their pros and cons.

Klaus

Reply
test says:

June 20, 2011 at 10:05 pm

Thanks for taking the time to write this post, I experienced the same issue!

Reply
Erik says:

February 14, 2014 at 4:50 am

You can solve this problem by using “out-file -encoding ascii” instead of the > operator.

Reply
- Brian Reiter says:
  
  February 14, 2014 at 5:01 am
  
  I don’t think that will work because the pipeline splits the data into an array automatically on unix or windows line endings.
  
  Reply
ben says:

August 8, 2014 at 1:51 am

StreamWriter works well for this.

$sw = New-Object System.IO.StreamWriter("dir.txt") $sw.Write("a") $sw.Close()

This results in a 1 byte file representing the character “a”. No trailing new lines, no encoding marks (FFFE etc) or encoding issues. Good for binary data.

Reply
- Brian Reiter says:
  
  August 8, 2014 at 4:28 am
  
  You can’t really pipe the output of a non-posh binary to a streamwriter object, though.
  
  Reply
  - nathou says:
    
    December 28, 2016 at 1:34 pm
    
    Turns out you can… in about 15 lines : http://stackoverflow.com/questions/24708859/output-binary-data-on-powershell-pipeline/24745250#24745250
    I was trying to avoid creating a temporary file and this worked for me !
  - Brian Reiter says:
    
    January 4, 2017 at 6:44 pm
    
    Yeah. That’s not a pipe.
kesor says:

November 19, 2016 at 10:20 am

And it is not even a real curl, since MS decided to put their own way of downloading files and alias that to the curl command – disregarding the real code and year of effort that went into creating curl in the first place. https://daniel.haxx.se/blog/2016/08/19/removing-the-powershell-curl-alias/

Reply
- Brian Reiter says:
  
  November 19, 2016 at 1:20 pm
  
  This was PowerShell 1.0. There was no Invoke-WebRequest. I was talking about piping the real curl in PowerShell it doesn’t work unless the stream is really text.
  
  Reply
k says:

December 2, 2016 at 9:27 am

Scary!

Note: Unicode is not an encoding, it’s a standard for representing text, defining both code points (e.g. U+20021 is “𠀡”) and rules for combining, collating and transforming text (e.g. “ß” uppercasing to “ss” and sorting at the same place if the text is German).

UTF-8, UTF-16LE, UTF-32 are encodings – they are ways of mapping Unicode code points to bytes of various widths. So U+20021 becomes 0xf0 0xa0 0x80 0xa1 in UTF-8, and 0x00 0x40 0xd8 0x21 0xdc in UTF-16LE.

So bytes are never Unicode, they are just bytes. You can interpret bytes as being encoded in a certain way and decode them that way, and presumably Powershell is treating that image as being e.g. UTF-8 (since it doesn’t start with a byte-order-mark \xff \xfe?) and then encoding it into UTF-16LE, which seems to be the popular choice on Windows. You could test that hypothesis by storing e.g. “å” into a file, save it as UTF-8 (\xc3 \xa5), then pipe it into a new file and check that the bytes end up as \xfe \xff \x00 \xe5 (possibly ended by a newline \x00 \x0a).

Reply
artkp says:

November 29, 2017 at 12:09 pm

Thanks for this post. Have this changed? Asked question at https://stackoverflow.com/questions/47552334/never-never-powershell-pipelines-with-non-net-programs

Reply
User says:

April 15, 2018 at 3:42 pm

Thanks for the posting 🙂

Reply
Brian Coverstone says:

December 5, 2018 at 12:30 am

Actually, piping works fine, just keep it as a byte array and you’ll be fine. Also, never use the OEM encoding, always use BYTE. Here’s an example that will pipe a 256 byte array from a remote PSSession, 10 times, back to the caller, and save it into a file.

$s = New-PSSession SomeRemoteComputerYouHaveAccessTo

function Save-Bytes([string] $path, [byte[]] $data) {
Begin { Set-Content $path $data -Encoding byte; Write-Host “BEGIN”; }
Process { Add-Content $path $_ -Encoding byte; Write-Host $_.GetType(); }
}

Invoke-Command $s {
#create and fill a 256 byte[] array with values 0-255
$b = [byte[]]::new(256);
for($i=0; $i -lt $b.Length; $i++) { $b[$i] = $i; }

#write the byte[] array 10 times out (use -NoEnum or it will enum and send each byte separately, which will work but is a terrible idea)
for($i=0; $i -lt 10; $i++) { Write-Output $b -NoEnumerate; }
} | Save-Bytes(“.\x”);

Reply
Mark Heath (@silicontrip) says:

June 15, 2022 at 3:10 am

Conversion to UTF-16 also happens when the source file is text (utf-8). I wrote a PS script to update a number of Oracle db servers TNSNAMES.ORA file. The servers weren’t restarted until a few months later and took ages to find the issue, as using get-content and other apps like notepad the file appeared correct.

Reply
Pingback: Is PowerShell ready to replace my Cygwin shell on Windows? [closed]