PowerShell’s Object Pipeline Corrupts Piped Binary Data

Yesterday I used curl to download a huge database backup from a remote server. Curl is UNIX-ey. By default, it streams its output to sdout and you then redirect that stream to a pipe or file like this:

$ curl sftp://server.somehwere.com/somepath/file > file

The above is essentially what I did from inside of a PowerShell session. After a couple of hours, I had my huge download and discovered that the database backup was corrupt. Then I realized that the file I ended up with was a little over 2x the size of the original file.

What happened?

Long story, short. This is a consequence of the object pipeline in PowerShell and you should never pipe raw binary data in PowerShell because it will be corrupted.

The Gory Details

You don’t have to work with a giant file. A small binary file will also be corrupted. I took a deeper look at this using a small PNG image.

PS> curl sftp://ftp.noserver.priv/img.png > img1.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  19986      0  0:00:02  0:00:02 --:--:— 74950

(FYI. Curl prints the progress of the download to stderr, so you see something on the console even though the stdout is redirected to file.)

This is essentially what I did with my big download and yields a file that is more than 2x the size of the original. My theory at this point was that since String objects in .Net are always Unicode, the bytes were being doubled as a consequence of an implicit conversion to UTF-16.

Using the > operator in PowerShell is the same thing piping to the Out-File cmdlet. Out-File has some encoding options. The interesting one is OEM:

"OEM" uses the current original equipment manufacturer code page identifier for the operating system.

That is essentially writing raw bytes.

PS> curl sftp://ftp.noserver.priv/img.png | out-file -encoding oem img2.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  26304      0  0:00:01  0:00:01 --:--:-- 74950

I was clearly on to something because this almost works. The file is just slightly larger than the original. It almost worked.

Just to prove that my build of curl isn’t broken, I also used the –o (–output) option.

PS> curl sftp://ftp.noserver.priv/img.png -o img3.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  25839      0  0:00:01  0:00:01 --:--:-- 76796

Here’s the result. You can see by the file sizes and md5 hashes that img1.png and img2.png are corrupt but img3.png is the same as the original img.png.

PS> ls img*.png | select name, length | ft -AutoSize

Name     Length
----     ------
img.png   46769
img1.png  94168
img2.png  47083
img3.png  46769


PS> md5 img*.png
MD5 (img.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf
MD5 (img1.png) = eb5a1421bcc4e3bea1063610b26e60f9
MD5 (img2.png) = 03b9b691f86404e9538a9c9c668c50ed
MD5 (img3.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf

Hrm. What’s going on here?

Let’s look at a diff of img.png and img1.png, which was the result of using the > operator to redirect the stdout of curl to file.

diff-img-img1

The big thing to see here is that there are a lot of extra bytes. Crucially, the bytes are Unicode glyphs and FFFE have been added as the first two bytes. 0xFFEE is the byte order mark for a little-endian UTF-16. That confirms my theory that internally PowerShell converted the data to Unicode.

I can also create the same behavior by the Get-Content (aliased as cat) cmdlet to redirect binary data to a file.

PS> cat img1.png > img4.png
PS> ls img1.png, img4.png | select name, length | ft -AutoSize

Name     Length
----     ------
img1.png  94168
img4.png  94168

So what is going on inside that pipeline?

PS> (cat .\img.png).GetType() | select name

Name
----
Object[]


PS> cat .\img.png | %{ $_.GetType() } | group name | select count, name | ft -AutoSize

Count Name
----- ----
  315 String

The file is being converted into an Object array of 315 elements. Each element of the array contains a String object. Since the internal data type of String is Unicode, sometimes referred to loosely as “double-byte” characters, the total size of that data is roughly doubled.

Using the OEM text encoder converts the data back but not quite. What is going wrong? Time to look at a diff of img.png and img2.png, which was the OEM text encoder.

 diff-img-img2

What you see here is a lot of 0x0D bytes have been inserted in front of all of the 0x0A bytes.

PS> (ls .\img2.png ).Length - (ls .\img.png ).Length
314

There are actually 314 of these 0x0D bytes added. What the heck is 0x0D? It is Carriage Return (CR). 0x0A is Line Feed (LF). In a Windows text file each line is marked with the sequence CRLF. 314 is exactly the number of CRLF sequences you need to turn a 315 element array of strings into a text file with Windows line endings.

Here’s what is happengin. PowerShell is making some assumtions:

  1. Anything streaming in as raw bytes is assumed to be text
  2. The text is converted into an array by splitting on bytes that would indicate an end of line in a text file.
  3. The text is reconstituted by out-file using the standard Windows end of line characters.

While this will work just fine with any kind of text, it is virtually guaranteed to corrupt any binary data. With the default text encoding you get a doubling of the original bytes and a bunch of new 0x0D bytes, too. The corruption fundamentally happens when the data is split into a string array. Using a binary encoder at the end of the pipeline doesn’t put the data back correctly because it always puts CRLF at the end of every array element. Unfortunately since there is more than one possible end of line sequence, this is as good as anything. Using a Windows to Unix conversion will not fix the file. There is no way to put humpty dumpty back together again.

To Sum Up, Just Don’t Do It

The moral is that it is never safe to pipe raw binary data in PowerShell. Pipes in PowerShell are for objects and text that can safely be automagically converted to a string array. You need to be cognizant of this and use Stream objects to manipulate binary files.

When using curl with PowerShell, never, never redirect to file with >. Always use the –o or –out <file>switch. If you need to stream the output of curl to another utility (say gpg) then you need to sub-shell into cmd for the binary streaming or use temporary files.

WONTFIX: select(2) in SUA 5.2 ignores timeout

With Windows Server 2003 R2, Microsoft incorporated Services for UNIX as a set of operating system components. The POSIX subsystem, Interix, is called the Subsystem for UNIX Applications (SUA) in Windows Server 2003 R2 and later.

Interix is the internal name of the Windows Posix Subsystem (PSXSS) that is based on OpenBSD and operates as an independent sister subsystem with the Windows subsystem (aka CSRSS or Client/Server Runtime Subsystem).

With the first version of SUA, aka Interix 5.2, Microsoft added a bunch of new UNIX APIs. Unfortunately the broke some things that were previously working in the previous edition which was called Interix 3.5 (aka Services for UNIX 3.5).

For example, select(2) is broken in SUA 5.2. It completely ignores the timeouts provided as arguments and returns immediately.

From the POSIX specification:

If the timeout parameter is not a null pointer, it specifies a maximum interval to wait for the selection to complete. If the specified time interval expires without any requested operation becoming ready, the function shall return. If the timeout parameter is a null pointer, then the call to pselect() or select() shall block indefinitely until at least one descriptor meets the specified criteria. To effect a poll, the timeout parameter should not be a null pointer, and should point to a zero-valued timespec structure.

Here is a little test program. What should happen is that select() should block for 10 seconds every time through the loop.

#include <stdio.h>;
#include <sys/time.h>

int main()
{
	printf("Testing select(2). Each pass through the loop should pause 10 seconds.\n\n");

	struct timeval time, pause;
	pause.tv_sec  = 10;
	pause.tv_usec = 0;
	int i;

	for( i=0; i&lt;10; i++ )
	{
		//insert a 10 second pause on every loop to test select().		
		gettimeofday(&amp;time, 0);
		printf(";Current time: %d\n", time.tv_sec);
		time.tv_sec += 10L;
		printf("Add 10 seconds: %d... And pause by calling select(2).\n", time.tv_sec);
		(void) select(0, 0, 0, 0, &pause);
	}
	return 0;
}

What actually happens is select() returns immediately.

% uname -svrm
Interix 5.2 SP-9.0.3790.4125 EM64T
% gcc selecttest.c -o selecttest 
% ./selecttest 
Testing select(2). Each pass through the loop should pause 10 seconds. 

Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2).
% 

The test run above should have taken 100 seconds but it actually completes in less than 1 second. This is a problem because many UNIX applications will use select() as a timing mechanism. Some will use select() as a timer even if they aren’t doing IO.

There is good news and bad news.

The bad news is that MSFT told me that the won’t fix this issue. Their official guidance is to use sleep(2) and usleep(2) to control timeouts in Interix 52.

The good news is that select(2) works properly on Interix 6.0 with Windows Vista and Windows Server 2008.

Fix Adobe’s broken PDF preview in x64 Windows

Adobe’s PDF previewer control doesn’t work correctly in x64 Windows because the Adobe Acrobat installer has a bug since June 2007. To be clear, the PDF preview handler works fine on x64 Windows but the Adobe Acrobat Reader setup program doesn’t install it correctly.

This affects previews in the Explorer shell and Outlook 2007/2010.

broken-pdf-preview

Fortunately, Leo Davidson has solved the problem and developed a little utility to fix the Adobe PDF preview control on x64 Windows.

fixed-pdf-preview

Wow. This installer defect has been there for what, 2 1/2 years. Seriously Adobe, this is just sad. Please fix it and while you are at it, send a nice thank you to Leo for solving the issue.

Outlook 2007 in Curmudgeon Mode

I am fed up with HTML email. I’m not sure if I just got one too many horrible emails with fuchsia comic sans text on a mauve background or white text on black background which becomes black text on black background after reply or forward. Or maybe it was the series of HTML email security disasters in Outlook 2002 from Office XP that could sploit you by just previewing a message. Probably both.

Regardless, I started reading all of my Outlook email as plain text in early 2002. I rarely want to look at the messages in HTML. I just I want everything to be plain text by default. That probably makes me a tech curmudgeon but life is so much better this way.

  • Phishing looks much more fishy in plain text because the evil URLs are exposed.
  • I get to choose the most readable font and color—Consolas 10.5 Black—instead of the sender—Comic Sans MS 12 Pink. (Seriously, I have known several people love to send email in big pink comic sans.)
  • The Internet Explorer (mshtml) HTML rendering engine is not invoked unless I specifically request the email to be displayed as HTML.
  • Rendering plain text defeats web beacons and exposes tracking URLs that marketing people hide in HTML email to track your behavior.

HTLM email is really most useful for marketing campaigns, hackers and phishers. The simplest way to opt-out of the target pool is to opt-out of HTML email.

Since Outlook 2002 SP1, Outlook has the capability to string HTML from incoming messages and display them as plain text. It started out as a registry tweak when the feature was rolled out with SP1 for Office XP, but it is now a full-fledged option.

Tools > Trust Center > Email Security

Check the “Read all standard mail in plain text” option.

outlook-trust-center

Outlook has a dubious feature whereby it attempts to remove “extra” line breaks from plain text messages by default.

And I also don’t want Outlook to reformat my plain text because it messes up code, other deliberate formatting and PGP signed messages.

Tools > Options > Email Options (button)

Uncheck “Remove extra line breaks in plain text messages”

dont-reformat-plain-text

Bruce Schneier: U.S. enables Chinese hacking of Google

Notable cryptographer and security expert Bruce Schneier has a new essay up at CNN.

In order to comply with government search warrants on user data,Google created a backdoor access system into Gmail accounts. This feature is what the Chinese hackers exploited to gain access.

This problem isn’t going away. Every year brings more Internet censorship and control, not just in countries like China and Iran but in the U.S., the U.K., Canada and other free countries, egged on by both law enforcement trying to catch terrorists, child pornographers and other criminals and by media companies trying to stop file sharers.

The problem is that such control makes us all less safe. Whether the eavesdroppers are the good guys or the bad guys, these systems put us all at greater risk. Communications systems that have no inherent eavesdropping capabilities are more secure than systems with those capabilities built in. And it’s bad civic hygiene to build technologies that could someday be used to facilitate a police state.

Read the entire article at CNN.com. This essay is a follow-up to a previous Schneier essay, “Technology Shouldn’t Give Big Brother a Head Start”.

 

Schneier is the inventor of the Blowfish and TwoFish block cypher algorithms as well as the Solitair cypher used in Neil Stephenson’s Cryptonomicon. TwoFish was a finalist to become the NSA’s advanced encryption standard (AES) but ultimately lost the competition to Rijndael.

Eek out more sound on Win7 laptops

This falls in the category of why isn’t it the default?

Laptops generally have really tiny speakers and it can be tough to hear them sometimes. Sometimes the root problem is a poorly configured driver from the OEM, such as I had with my MacBook Pro 15” 2nd gen unibody. (The updated crystal audio driver in Boot Camp 3.1 or the slightly older one from Boot Camp 2.2 fixes this.) Even so, the audio has to be cranked all the way up to watch video through the laptop speakers.

It’s not just an Apple problem, either. My HP nw8440 was even more anemic in the volume department.

Windows 7 has a buried optimization that helps a lot: “Loudness Equalization”.  The feature description states that it “uses an understanding of human hearing to reduce perceived volume differences” which implies that it really isn’t making the speakers louder but doing something to make them seem louder. I don’t know how it works its magic but it really works.

Control Panel > Sound > Playback

Double-click the device for your built-in speakers

On the enhancements tab, select “Loudness Equalization” and nothing else.

speakers-enhancement

The improvement is marked. It makes such a difference that I wonder why this feature isn’t turned on by default for built-in speakers.

Barack’s people are tracking clicks

Emails sent by Barack Obama’s people often have URLs in them.

obama-haiti-html

That’s fine but Mr. Obama’s people use a phishing technique where the link displayed is not the real link. My mail reader converts the emails to plain text by default, so it is obvious.

obama-haiti-txt

The text “http://my.barackobama.com/Haiti” is actually linked to some obscure URL at my.barackobama.com. This URL probably encodes information about the page to display as well as my identity. It is almost certainly there so that the people running my.barackobama.com can track my behavior if I were to click this link.

This is nothing new. Mr. Obama’s people have been doing things this way since the campaign and it is a common technique for tracking the behavior of people in email marketing campaigns. It has always bugged me, though, that Barack Obama does this.

Boot Camp 3.1, from Apple this time

Boot Camp now officially supports Windows 7. As I expected, it is largely a repackaging of drivers previously released for Boot Camp 2.2. Here’s the flyby of what is in the update (I looked at the x64 version).

  • NVidia  display driver 8.16.11.8861, 01/05/2010
  • Binary.aapltp_Bin
  • Binary.AppleBTBroadcom_Bin
  • Binary.AppleBTE_Bin
  • Binary.AppleBT_Bin
  • Binary.AppleDisplay_Bin
  • Binary.AppleiSight_Bin
  • Binary.AppleODD_Bin
  • Binary.asix_ethernet_Bin
  • Binary.AtherosWin7_Bin
  • Binary.Atheros_Bin
  • Binary.Ati_GraphicsWin7_Bin
  • Binary.Ati_Graphics_Bin
  • Binary.BroadcomEthernet_Bin
  • Binary.BroadcomWireless_Bin
  • Binary.Cirrus_Audio_Bin 6.6001.1.21 (all new!)
  • Binary.crystal_beach_Bin
  • Binary.intel_ethernet_Bin
  • Binary.IRFilter_Bin
  • Binary.Keyboard_Bin 3.0.0.0 (same version number as was in Boot Camp 3.0 but I can definitely dim the keyboard more, now)
  • Binary.marvell_ethernet_Bin
  • Binary.MultiTouchMouse_Bin
  • Binary.MultiTP_Bin 3.0.0.0 (same as in 2.2)
  • Binary.null_driver_Bin
  • Binary.Realtek_Bin
  • Binary.Sigmatel_Bin

Available from Apple in x86 and x64 flavors.

Google Developer Dashboard: “An error occurred: please try again later.”

When I use the Google Developer Dashboard to upload an update to my extension, it will often fail. The dashboard says, “An error occurred: please try again later.”.

Trying again later doesn’t help. For me, the fix is to log out and log back in again.

My First Chrome Extension, part 2

I was fixated on how to get Chrome to pass a feed URL to my registered feed reader, Outlook 2010. After I figured out how to pass the message, it was nearly a one-liner to modify an existing extension. How many things could be wrong with one line of code that seems to be working?

url = url.replace( "%f", feedUrl.replace( "http:", "feed:" ) );

There are two bugs that I see in here.

1: URLs can legitimately include %f

From RFC 1738:

In addition, octets may be encoded by a character triplet consisting of the character "%" followed by the two hexadecimal digits (from "0123456789ABCDEF") which forming the hexadecimal value of the octet. (The characters "abcdef" may also be used in hexadecimal encodings.)

In practice this would affect URLs that include these characters:

  • ð -> %F0
  • ñ –> %F1
  • ò –> %F2
  • ó –> %F3
  • ô –> %F4
  • õ –> %F5
  • ö –> %F6
  • ÷ –> %F7
  • ø –> %F8
  • ù –> %F9
  • ú –> %FA
  • û –> %FB
  • ü –> %FC
  • ý –> %FD
  • þ –> %FE
  • ÿ –> %FF

The simple solution is to avoid using a macro character sequence that includes valid hexadecimal values. Sticking with a single ASCII character it could be anything from ‘g’ trhough ‘z’ except ‘s’, which is already being used.

2: A feed URL could have “http:” anywhere

The scheme for a URI is always at the beginning, but it is common for one URI to encode another one inside it. That is exactly how you pass a feed URL to Google Reader.

Imagine a feed URL like “http://feedify.exmaple/q=http://mysite.somewhere/page”. In order to pass this to the feed scheme handler I need to replace the http: scheme with feed: but my original code would also replace the second http:, which would break the URL.

The solution is to use a regular expression so that I only replace the http: scheme rather than every occurrence of the pattern “http:”. In Regular Expression syntax, the ^ character means that the pattern has to start with the beginning of the string.

^http:

I JavaScript, the shorthand for a Regular Expression object is paired / characters:

/^http:/

3:URI schemes are not case sensitive

The replace() method of the JavaScript String object is case sensitive. My code would not work on a URL like “HTTP://mysite.example/stuff”.

Fortunately, the Regular Expression object in JavaScript has a case-insensitive match mode. You set this with the ‘i’ option:

/^http:/i

3 Bugs in 1 Line: Fixed

url = url.replace( "%g", feedUrl.replace( /^http:/i, "feed:" ) );