Yesterday I used curl to download a huge database backup from a remote server. Curl is UNIX-ey. By default, it streams its output to sdout and you then redirect that stream to a pipe or file like this:
$ curl sftp://server.somehwere.com/somepath/file > file
The above is essentially what I did from inside of a PowerShell session. After a couple of hours, I had my huge download and discovered that the database backup was corrupt. Then I realized that the file I ended up with was a little over 2x the size of the original file.
What happened?
Long story, short. This is a consequence of the object pipeline in PowerShell and you should never pipe raw binary data in PowerShell because it will be corrupted.
The Gory Details
You don’t have to work with a giant file. A small binary file will also be corrupted. I took a deeper look at this using a small PNG image.
PS> curl sftp://ftp.noserver.priv/img.png > img1.png
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 46769 100 46769 0 0 19986 0 0:00:02 0:00:02 --:--:— 74950
(FYI. Curl prints the progress of the download to stderr, so you see something on the console even though the stdout is redirected to file.)
This is essentially what I did with my big download and yields a file that is more than 2x the size of the original. My theory at this point was that since String objects in .Net are always Unicode, the bytes were being doubled as a consequence of an implicit conversion to UTF-16.
Using the > operator in PowerShell is the same thing piping to the Out-File cmdlet. Out-File has some encoding options. The interesting one is OEM:
"OEM" uses the current original equipment manufacturer code page identifier for the operating system.
That is essentially writing raw bytes.
PS> curl sftp://ftp.noserver.priv/img.png | out-file -encoding oem img2.png
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 46769 100 46769 0 0 26304 0 0:00:01 0:00:01 --:--:-- 74950
I was clearly on to something because this almost works. The file is just slightly larger than the original. It almost worked.
Just to prove that my build of curl isn’t broken, I also used the –o (–output) option.
PS> curl sftp://ftp.noserver.priv/img.png -o img3.png
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 46769 100 46769 0 0 25839 0 0:00:01 0:00:01 --:--:-- 76796
Here’s the result. You can see by the file sizes and md5 hashes that img1.png and img2.png are corrupt but img3.png is the same as the original img.png.
PS> ls img*.png | select name, length | ft -AutoSize
Name Length
---- ------
img.png 46769
img1.png 94168
img2.png 47083
img3.png 46769
PS> md5 img*.png
MD5 (img.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf
MD5 (img1.png) = eb5a1421bcc4e3bea1063610b26e60f9
MD5 (img2.png) = 03b9b691f86404e9538a9c9c668c50ed
MD5 (img3.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf
Hrm. What’s going on here?
Let’s look at a diff of img.png and img1.png, which was the result of using the > operator to redirect the stdout of curl to file.
The big thing to see here is that there are a lot of extra bytes. Crucially, the bytes are Unicode glyphs and FFFE have been added as the first two bytes. 0xFFEE is the byte order mark for a little-endian UTF-16. That confirms my theory that internally PowerShell converted the data to Unicode.
I can also create the same behavior by the Get-Content (aliased as cat) cmdlet to redirect binary data to a file.
PS> cat img1.png > img4.png
PS> ls img1.png, img4.png | select name, length | ft -AutoSize
Name Length
---- ------
img1.png 94168
img4.png 94168
So what is going on inside that pipeline?
PS> (cat .\img.png).GetType() | select name
Name
----
Object[]
PS> cat .\img.png | %{ $_.GetType() } | group name | select count, name | ft -AutoSize
Count Name
----- ----
315 String
The file is being converted into an Object array of 315 elements. Each element of the array contains a String object. Since the internal data type of String is Unicode, sometimes referred to loosely as “double-byte” characters, the total size of that data is roughly doubled.
Using the OEM text encoder converts the data back but not quite. What is going wrong? Time to look at a diff of img.png and img2.png, which was the OEM text encoder.
What you see here is a lot of 0x0D bytes have been inserted in front of all of the 0x0A bytes.
PS> (ls .\img2.png ).Length - (ls .\img.png ).Length
314
There are actually 314 of these 0x0D bytes added. What the heck is 0x0D? It is Carriage Return (CR). 0x0A is Line Feed (LF). In a Windows text file each line is marked with the sequence CRLF. 314 is exactly the number of CRLF sequences you need to turn a 315 element array of strings into a text file with Windows line endings.
Here’s what is happengin. PowerShell is making some assumtions:
- Anything streaming in as raw bytes is assumed to be text
- The text is converted into an array by splitting on bytes that would indicate an end of line in a text file.
- The text is reconstituted by out-file using the standard Windows end of line characters.
While this will work just fine with any kind of text, it is virtually guaranteed to corrupt any binary data. With the default text encoding you get a doubling of the original bytes and a bunch of new 0x0D bytes, too. The corruption fundamentally happens when the data is split into a string array. Using a binary encoder at the end of the pipeline doesn’t put the data back correctly because it always puts CRLF at the end of every array element. Unfortunately since there is more than one possible end of line sequence, this is as good as anything. Using a Windows to Unix conversion will not fix the file. There is no way to put humpty dumpty back together again.
To Sum Up, Just Don’t Do It
The moral is that it is never safe to pipe raw binary data in PowerShell. Pipes in PowerShell are for objects and text that can safely be automagically converted to a string array. You need to be cognizant of this and use Stream objects to manipulate binary files.
When using curl with PowerShell, never, never redirect to file with >. Always use the –o or –out <file>switch. If you need to stream the output of curl to another utility (say gpg) then you need to sub-shell into cmd for the binary streaming or use temporary files.
Like this:
Like Loading...