PowerShell’s Object Pipeline Corrupts Piped Binary Data

Yesterday I used curl to download a huge database backup from a remote server. Curl is UNIX-ey. By default, it streams its output to sdout and you then redirect that stream to a pipe or file like this:

$ curl sftp://server.somehwere.com/somepath/file > file

The above is essentially what I did from inside of a PowerShell session. After a couple of hours, I had my huge download and discovered that the database backup was corrupt. Then I realized that the file I ended up with was a little over 2x the size of the original file.

What happened?

Long story, short. This is a consequence of the object pipeline in PowerShell and you should never pipe raw binary data in PowerShell because it will be corrupted.

The Gory Details

You don’t have to work with a giant file. A small binary file will also be corrupted. I took a deeper look at this using a small PNG image.

PS> curl sftp://ftp.noserver.priv/img.png > img1.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  19986      0  0:00:02  0:00:02 --:--:— 74950

(FYI. Curl prints the progress of the download to stderr, so you see something on the console even though the stdout is redirected to file.)

This is essentially what I did with my big download and yields a file that is more than 2x the size of the original. My theory at this point was that since String objects in .Net are always Unicode, the bytes were being doubled as a consequence of an implicit conversion to UTF-16.

Using the > operator in PowerShell is the same thing piping to the Out-File cmdlet. Out-File has some encoding options. The interesting one is OEM:

"OEM" uses the current original equipment manufacturer code page identifier for the operating system.

That is essentially writing raw bytes.

PS> curl sftp://ftp.noserver.priv/img.png | out-file -encoding oem img2.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  26304      0  0:00:01  0:00:01 --:--:-- 74950

I was clearly on to something because this almost works. The file is just slightly larger than the original. It almost worked.

Just to prove that my build of curl isn’t broken, I also used the –o (–output) option.

PS> curl sftp://ftp.noserver.priv/img.png -o img3.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 46769  100 46769    0     0  25839      0  0:00:01  0:00:01 --:--:-- 76796

Here’s the result. You can see by the file sizes and md5 hashes that img1.png and img2.png are corrupt but img3.png is the same as the original img.png.

PS> ls img*.png | select name, length | ft -AutoSize

Name     Length
----     ------
img.png   46769
img1.png  94168
img2.png  47083
img3.png  46769


PS> md5 img*.png
MD5 (img.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf
MD5 (img1.png) = eb5a1421bcc4e3bea1063610b26e60f9
MD5 (img2.png) = 03b9b691f86404e9538a9c9c668c50ed
MD5 (img3.png) = 21d5d61e15a0e86c61f5ab1910d1c0bf

Hrm. What’s going on here?

Let’s look at a diff of img.png and img1.png, which was the result of using the > operator to redirect the stdout of curl to file.

diff-img-img1

The big thing to see here is that there are a lot of extra bytes. Crucially, the bytes are Unicode glyphs and FFFE have been added as the first two bytes. 0xFFEE is the byte order mark for a little-endian UTF-16. That confirms my theory that internally PowerShell converted the data to Unicode.

I can also create the same behavior by the Get-Content (aliased as cat) cmdlet to redirect binary data to a file.

PS> cat img1.png > img4.png
PS> ls img1.png, img4.png | select name, length | ft -AutoSize

Name     Length
----     ------
img1.png  94168
img4.png  94168

So what is going on inside that pipeline?

PS> (cat .\img.png).GetType() | select name

Name
----
Object[]


PS> cat .\img.png | %{ $_.GetType() } | group name | select count, name | ft -AutoSize

Count Name
----- ----
  315 String

The file is being converted into an Object array of 315 elements. Each element of the array contains a String object. Since the internal data type of String is Unicode, sometimes referred to loosely as “double-byte” characters, the total size of that data is roughly doubled.

Using the OEM text encoder converts the data back but not quite. What is going wrong? Time to look at a diff of img.png and img2.png, which was the OEM text encoder.

 diff-img-img2

What you see here is a lot of 0x0D bytes have been inserted in front of all of the 0x0A bytes.

PS> (ls .\img2.png ).Length - (ls .\img.png ).Length
314

There are actually 314 of these 0x0D bytes added. What the heck is 0x0D? It is Carriage Return (CR). 0x0A is Line Feed (LF). In a Windows text file each line is marked with the sequence CRLF. 314 is exactly the number of CRLF sequences you need to turn a 315 element array of strings into a text file with Windows line endings.

Here’s what is happengin. PowerShell is making some assumtions:

  1. Anything streaming in as raw bytes is assumed to be text
  2. The text is converted into an array by splitting on bytes that would indicate an end of line in a text file.
  3. The text is reconstituted by out-file using the standard Windows end of line characters.

While this will work just fine with any kind of text, it is virtually guaranteed to corrupt any binary data. With the default text encoding you get a doubling of the original bytes and a bunch of new 0x0D bytes, too. The corruption fundamentally happens when the data is split into a string array. Using a binary encoder at the end of the pipeline doesn’t put the data back correctly because it always puts CRLF at the end of every array element. Unfortunately since there is more than one possible end of line sequence, this is as good as anything. Using a Windows to Unix conversion will not fix the file. There is no way to put humpty dumpty back together again.

To Sum Up, Just Don’t Do It

The moral is that it is never safe to pipe raw binary data in PowerShell. Pipes in PowerShell are for objects and text that can safely be automagically converted to a string array. You need to be cognizant of this and use Stream objects to manipulate binary files.

When using curl with PowerShell, never, never redirect to file with >. Always use the –o or –out <file>switch. If you need to stream the output of curl to another utility (say gpg) then you need to sub-shell into cmd for the binary streaming or use temporary files.

WONTFIX: select(2) in SUA 5.2 ignores timeout

With Windows Server 2003 R2, Microsoft incorporated Services for UNIX as a set of operating system components. The POSIX subsystem, Interix, is called the Subsystem for UNIX Applications (SUA) in Windows Server 2003 R2 and later.

Interix is the internal name of the Windows Posix Subsystem (PSXSS) that is based on OpenBSD and operates as an independent sister subsystem with the Windows subsystem (aka CSRSS or Client/Server Runtime Subsystem).

With the first version of SUA, aka Interix 5.2, Microsoft added a bunch of new UNIX APIs. Unfortunately the broke some things that were previously working in the previous edition which was called Interix 3.5 (aka Services for UNIX 3.5).

For example, select(2) is broken in SUA 5.2. It completely ignores the timeouts provided as arguments and returns immediately.

From the POSIX specification:

If the timeout parameter is not a null pointer, it specifies a maximum interval to wait for the selection to complete. If the specified time interval expires without any requested operation becoming ready, the function shall return. If the timeout parameter is a null pointer, then the call to pselect() or select() shall block indefinitely until at least one descriptor meets the specified criteria. To effect a poll, the timeout parameter should not be a null pointer, and should point to a zero-valued timespec structure.

Here is a little test program. What should happen is that select() should block for 10 seconds every time through the loop.

#include <stdio.h>;
#include <sys/time.h>

int main()
{
	printf("Testing select(2). Each pass through the loop should pause 10 seconds.\n\n");

	struct timeval time, pause;
	pause.tv_sec  = 10;
	pause.tv_usec = 0;
	int i;

	for( i=0; i&lt;10; i++ )
	{
		//insert a 10 second pause on every loop to test select().		
		gettimeofday(&amp;time, 0);
		printf(";Current time: %d\n", time.tv_sec);
		time.tv_sec += 10L;
		printf("Add 10 seconds: %d... And pause by calling select(2).\n", time.tv_sec);
		(void) select(0, 0, 0, 0, &pause);
	}
	return 0;
}

What actually happens is select() returns immediately.

% uname -svrm
Interix 5.2 SP-9.0.3790.4125 EM64T
% gcc selecttest.c -o selecttest 
% ./selecttest 
Testing select(2). Each pass through the loop should pause 10 seconds. 

Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2). 
Current time: 1142434664 
Add 10 seconds: 1142434674... And pause by calling select(2).
% 

The test run above should have taken 100 seconds but it actually completes in less than 1 second. This is a problem because many UNIX applications will use select() as a timing mechanism. Some will use select() as a timer even if they aren’t doing IO.

There is good news and bad news.

The bad news is that MSFT told me that the won’t fix this issue. Their official guidance is to use sleep(2) and usleep(2) to control timeouts in Interix 52.

The good news is that select(2) works properly on Interix 6.0 with Windows Vista and Windows Server 2008.

My First Chrome Extension, part 2

I was fixated on how to get Chrome to pass a feed URL to my registered feed reader, Outlook 2010. After I figured out how to pass the message, it was nearly a one-liner to modify an existing extension. How many things could be wrong with one line of code that seems to be working?

url = url.replace( "%f", feedUrl.replace( "http:", "feed:" ) );

There are two bugs that I see in here.

1: URLs can legitimately include %f

From RFC 1738:

In addition, octets may be encoded by a character triplet consisting of the character "%" followed by the two hexadecimal digits (from "0123456789ABCDEF") which forming the hexadecimal value of the octet. (The characters "abcdef" may also be used in hexadecimal encodings.)

In practice this would affect URLs that include these characters:

  • ð -> %F0
  • ñ –> %F1
  • ò –> %F2
  • ó –> %F3
  • ô –> %F4
  • õ –> %F5
  • ö –> %F6
  • ÷ –> %F7
  • ø –> %F8
  • ù –> %F9
  • ú –> %FA
  • û –> %FB
  • ü –> %FC
  • ý –> %FD
  • þ –> %FE
  • ÿ –> %FF

The simple solution is to avoid using a macro character sequence that includes valid hexadecimal values. Sticking with a single ASCII character it could be anything from ‘g’ trhough ‘z’ except ‘s’, which is already being used.

2: A feed URL could have “http:” anywhere

The scheme for a URI is always at the beginning, but it is common for one URI to encode another one inside it. That is exactly how you pass a feed URL to Google Reader.

Imagine a feed URL like “http://feedify.exmaple/q=http://mysite.somewhere/page”. In order to pass this to the feed scheme handler I need to replace the http: scheme with feed: but my original code would also replace the second http:, which would break the URL.

The solution is to use a regular expression so that I only replace the http: scheme rather than every occurrence of the pattern “http:”. In Regular Expression syntax, the ^ character means that the pattern has to start with the beginning of the string.

^http:

I JavaScript, the shorthand for a Regular Expression object is paired / characters:

/^http:/

3:URI schemes are not case sensitive

The replace() method of the JavaScript String object is case sensitive. My code would not work on a URL like “HTTP://mysite.example/stuff”.

Fortunately, the Regular Expression object in JavaScript has a case-insensitive match mode. You set this with the ‘i’ option:

/^http:/i

3 Bugs in 1 Line: Fixed

url = url.replace( "%g", feedUrl.replace( /^http:/i, "feed:" ) );

Multiple Versions of IE with the Visual Studio Built-In Web Server: The Solution

The Problem

Last time I discussed the the unfortunate crippling of the Visual Studio built-in web  server, webdev.webserver, so that it can only process requests that originate from localhost and the side-effect that this creates a big impedance barrier to testing multiple versions of Internet Explorer with your web apps. I promised a solution to the dilemma.

The key is to run your different versions of IE in virtualization software and use a personal proxy server to forward their requests. If the proxy is running on your host OS and the browsers in the client VMs use the host OS proxy then, from the perspective of webdev.webserver in your host OS, all of the requests will appear to originate from localhost and it will serve them.

There are a few gotchas.

Step 1: loopback adapter

The loopback adapter is a virtual network interface device. It provides a way for us to create a shared network between the virtual machines and the host machines without altering the configuration of any real network interfaces.

Install the loopback adapter via Device Manager (devmgmt.msc) by right-clicking on the root “computer” node and selecting “Add legacy hardware”. This should bring up the Add Hardware wizard. Choose the manual, advanced install. Next you should see a list of common hardware types. Select “Network adapters”:

add-hardware_common-hardware-types

Select Microsoft as the manufacturer and “Microsoft Loopback Adapter” as the network adapter.

add-hardware_select-network-adapter

Finish out the wizard and it will create a new network device which will appear in your “Network Connections” control panel (ncpa.cpl). It will probably be called something like “Local Area Connection (2)”. I like to rename this to something more descriptive like “loopback” or “Internal Connection”.

netowork-connections-arg

Now you can manually assign a static IP address to this connection. Choose something from one of the ranges defined by the IETF as private: 10.0.0.0/8, 172.16.0.0/12 or 192.168.0.0/16.

For my example, I’m going to subnet the 10 network and use 10.237.0.1 mask 255.255.0.0 with no default gateway or DNS servers defined.

At this point, if you are running Windows Vista or 7, you may notice a small problem. The “Internal Connection” device says it is on the “Unidentified Network” which means that Windows thinks you are connected to a “Public Network” which means that Windows Firewall will block Windows File and Printer sharing.

network-sharing-arg

In order to calm Windows down, we need to make a registry change to mark our loopback adapter as an endpoint device. This indicates to Windows that it is not a true network device that connects to an external network. In my opinion, this should be the default setting for the loopback driver, but it isn’t. In order to make this change we need to create the *NdisDeviceType  value as DWORD of 1 in the key for our loopback adapter. (See MSDN documentation.)

Network adapters are configured under the following registry key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E972-E325-11CE-BFC1-08002BE10318}

The default value on this key is “Network Adapters”. There will be several four-digit number sub-keys (such as 0016) depending on how many network interfaces are installed on your machine. One of these will have a DriverDesc value of “Microsoft Loopback Adapter”.

Once you have found the key for the loopback adapter, add a DWORD value to it called *NdisDeviceType with a DWORD value of 1.  Note: common mistake is to leave off the asterisk, which should be included as part of the value name.

loopback-registry-fix

Once you have added this value, you must bounce the driver by disabling and enabling your loopback device or reboot for the change to take effect.

netowork-connections-fixed

network-sharing-fixed

The last loopback adapter-related activity is to tell Windows Firewall not to monitor the loopback interface.

In Windows 7, you do this through the advanced settings link of the Firewall control panel applet. Right-click on the root “Windows Firewal with Advanced Security” node and choose properties. You can then set the “Protected Network Connections” for the Domain Profile, Private Profile and Public profile.

win7-firewall-binding

In Windows Vista, it is a little easier to get to the dialog. In the Windows Firewall control panel applet, click “Change Settings”.

winvista-firewall-binding

Step 2: Install a Personal Proxy

Pretty much any lightweight personal proxy server will do for this. I like privoxy which I also use for general ad-blocking across all of my browsers. You can download the latest stable release from privoxy.org.

The privoxy installer is straightforward. Just run it.

Privoxy is essentially a unix-style daemon. It is configured through unixy text files. We need to edit Config.txt located in the Privoxy install directory to tell privoxy to listen on the IP address we bound to our loopback adapter. Privoxy will install into C:\Program Files\Privoxy on x86 Windows or C:\Program Files (x86)\Privoxy on x64 Window.

Look for the listen-address in Config.txt. Set it to the IP address you bound to your loopback adapter and also set the TCP port number to listen on. Privoxy is normally configured to listen on 8118. In my setup, the listen-address is 10.235.0.1:8118.

privoxy-listen-address

Finally, we want to configure Privoxy to run as a service so that it will just be there all the time without having to start up its rudimentary GUI. This is how to do it in PowerShell:

PS> ./privoxy --install
PS> set-service privoxy -StartupType Automatic
PS> start-service privoxy

Step 3: Virtual Machines

The hard part is behind us. The rest is pretty easy. We just need to set up Virtual  Machines using NAT (shared) networking.

vpc-shared-networking

You can get pre-configured virtual machines from Microsoft. These are set up for Virtual PC 2007 but can be run under Sun VirtualBox by just uninstalling the Virtual PC extensions and installing the VirtualBox drivers. On Windows Virtual PC with Windows 7, the XP image is a pain to deal with because Microsoft removed files to compress the image including USB drivers that are useless to VirtualPC 2007 but are essential to Windows Virtual PC. You have to get the drivers back onto the VHD in order to install the Windows Virtual PC extensions. Windows XP Mode is probably a better bet or just install Windows XP from scratch.

A couple of useful items are the IE7 Blocker and the IE8 blocker. These will prevent Automatic Updates from upgrading your browser and defeating the whole purpose of this exercise.

Inside of the client OS (probably Windows XP) you just the proxy settings in Internet Options so that HTTP is proxied to 10.237.0.1 port 8118 (or whatever you configured). Uncheck the bypass proxy for local addresses option.

xp-vm-proxy xp-vm-proxy-detail

With IE6 in the VM, you can now just go to the same localhost URLs that you use when you launch browsers on your host OS. The one remaining gotcha is that IE7 and IE8 are hard-coded to bypass a proxy server for localhost or 127.0.0.1. They will always bypass a proxy for localhost. The simplest workaround is to put a trailing period after the host name but before the port number so that http://localhost:8080 becomes http://localhost.:8080.

Here’s the whole shebang working with integration features enabled on Windows Virtual PC on Windows 7 x64. This is IE8 on Windows 7 (host OS), IE7 on Windows Vista and IE6 on Windows XP. All three are pointed at the same web project started from Visual Studio 2008. If you look very closely, you just might be able to see the extra period in the URL on IE7.

ie6-7-8

Multiple Versions of IE with the Visual Studio Built-In Web Server

 

For years, setting up a web project to run locally on your development machine with Visual Studio (and before that Visual InterDev) required a ton of prerequisites. You had to configure IIS and FrontPage Extensions. You had to have permissions set correctly in order to publish and debug. The setup did not play very well with source control systems and was generally a big nightmare time suck.

ie6-7-8-small

In the Java world things were much better much sooner, particularly if all you wanted was a servlet container to run your simple JSP site or to host your Spring POJO application. Back around 2002 or 2003 Netbeans 3 would magically publish your code into Apache Tomcat and let you debug it. You could do something similar with Eclipse and other Java IDEs of the day.

With Visual Studio 2005, Microsoft adopted something very similar to the Java IDE with Tomcat model. Starting with Visual Studio 2005, by default, you get magical publishing to a lightweight web server called webdev.webserver.exe. Webdev.webserver is based on the Cassini sample web server and shares a quirk of Cassini: it only accepts requests from localhost.

Microsoft says that this is for security reasons. They wanted to bundle a web server with .NET 2.0 to make it easier to get started programming but on the other hand they were licking their wounds from the constant successful attacks on Windows XP that had only started to abate with the rollout of SP2. So, Cassini cum webdev.webserver is hardcoded to refuse connections unless they originate from your own computer.

On the face of it that doesn’t seem like much of a problem but here’s the thing. Microsoft has 3 supported web browsers—Internet Explorer 6, Internet Explorer 7 and Internet Explorer 8—and you can only run one of them on a given Windows installation. Furthermore, you cannot run IE6 on Windows Vista and you cannot run IE6 or IE7 on Windows 7. Microsoft’s solution to help out developers is to pass out free copies of Virtual PC and to provide free virtual machine images of Windows running various browser configurations.

Gotcha. You can’t use these virtual machines with Visual Studio’s webdev.webserver. You have to publish to IIS. Ugh!

As the screenshot above may have given away, there is a solution. The key is that all webdev.webserver cares about is that requests originate from localhost. I’ll post all the gory details next time.

%d bloggers like this: