Embedding Arbitrary Language Glyphs in PDF with ItextSharp

One of my clients has an application which generates a PDF using ITextSharp. The document largely contains English text in the Latin character set but a portion of the PDF is supposed to contain contact information in a foreign language. In the first version of the software, the requirement was to support Latin, Cyrillic, Georgian and Armenian character sets.

We quickly discovered during testing that the Adobe Type 1 fonts embedded in itextsharp.dll only support Latin characters. Code points from the Cyrillic, Georgian and Armenian character sets showed up as white space in the document. Fortunately, iTextSharp supports TrueType font embedding with the correct incantation which enabled us to use Sylfaen to provide the necessary glyphs.

string sylfaenpath = Environment.GetEnvironmentVariable( "SystemRoot" ) + "\\fonts\\sylfaen.ttf";
BaseFont sylfaen = BaseFont.CreateFont( sylfaenpath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED );
Font easternEuroTextFont = new Font( sylfaen, 9f, Font.NORMAL );

With the second version of the application, we needed to support a bunch of new character sets in addition to the ones we previously supported including Hebrew, Arabic, Devangari, Sinhala, Lao, Thai and more South Asian and Southeast Asian scripts.

One option is to pick supporting fonts for each character set but we elected to pick something universal, which is Arial Unicode which includes glyphs for every code point defined in Unicode 2.1. Arial Unicode is from the Afra Monotype foundry and is bundled with Office 2007 and later but can be purchased separately if Office isn’t installed. (The main side effect of this font choice is moving from a serif font to a sans serif one.)

Universal Glyph Support Can Still Yield Gibberish

The remaining wrinkle is that Hebrew and Arabic are right-to-left languages which means that the characters in a Hebrew or Arabic string are supposed to be rendered from right-to-left instead of left-to-right. Just rendering an Arabic string with Arial Unicode in iTextSharp will yield reflected output which is gibberish.

Here is some reference Arabic text rendered in Arial. It says “al-ingliziya”, the Arabic word for English.

al-ingliziyah

Here’s what you get by default using Arial Unicode in iTextSharp.

default-broken-al-ingliziyah

Clearly, this is different from the reference rendering. Arabic is complicated because of the way the ligatures work so that the shape of a letter is heavily influenced by the letters next to it but basically, it’s backwards. What we have is now not nothing but Hebrew and Arabic are gibberish instead. We need to alter the rendering for Hebrew and Arabic and make them right-to-left.

A simple algorithm is to detect the presence of Hebrew or Arabic code points in a string and turn on right-to-left rendering. Regular Expressions define \p{Hebrew} and \p{Arabic} character classes which would be useful but unfortunately those aren’t supported in System.Text.RegularExpressions at this point. We need to roll our own.

const string regex_match_arabic_hebrew = @"[\u0600-\u06FF,\u0590-\u05FF]+";
if( Regex.IsMatch( text, regex_match_arabic_hebrew, RegexOptions.IgnoreCase ) 
    //arabic or hebrew characters exist, fix rendering

There’s no obvious RTL option for a text element in iTextSharp, so I tried reversing the strings, which is a slight improvement but it’s still broken. What we have is brain-dead rendering. The ligatures are not connecting the letters correctly.

string-reverse-broken-al-ingliziyah

On closer examination, there is RTL support in iTextSharp. It is exposed through object graph elements that implement IPdfRunDirection. (This is one of the places where it really shows that iTextSharp is a Java port. The use of static integer constants rather than enums is very Java 1.4. Enums are much more discoverable and the correct usage is more intuitively obvious.)

element.RunDirection = PdfWriter.RUN_DIRECTION_RTL;

Now the output from iTextSharp looks like the reference rendering.

correct-rtl-al-ingliziyah

Transliterate to Java and the same concepts apply to iText.

Example Snippet Code in C#

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;

//... assume a class that does stuff exists

pubilc byte[] CreatePdfStreamPdfWithRandomLanguageSupport( IEnumerable<string> textList )
{
	//C# does not support \p{Arabic} and \p{Hebrew} character classes. We have to roll our own.
	//We are assuming any string that contains an Arabic or Hebrew character is meant to be RTL.
	//Better would be to break strings into word tokens and test each word.
	const string regex_match_arabic_hebrew = @"[\u0600-\u06FF,\u0590-\u05FF]+";
	const string arialunicodepath          = Environment.GetEnvironmentVariable( "SystemRoot" ) + "\\fonts\\ARIALUNI.TTF";

	Document document = new Document( PageSize.LETTER );
	using(MemoryStream stream = new MemoryStream())
	{
		PdfWriter writer = PdfWriter.GetInstance( document, stream );
		try
		{
			//bunch of document setup here.
			document.Open();
			//arbitrarily, creating a 5 columnt table.
			PdfPTable table = new PdfPTable( 5 );
			
			//embed a Unicode font with broad glyph support for any code point we might need.
			//only the glyphs for code points actually used will be embedded in the document
			BaseFont nationalBase;
			if( File.Exists( arialunicaodepath ) 
				BaseFont.CreateFont( arialunicodepath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED ); 
			else
				throw new FileNotFoundException( "Could not find \"Arial Unicode MS\" font installed on this system." );

			Font nationalTextFont = new Font( nationalBase, 9f, Font.NORMAL );

			foreach( string text in textList )
			{
				//PdfPCell implements IPdfRunDirection
				PdPCell cell = new PdfPCell();
				//Arabic and Hebrew strings need to be reversed for right-to-left rendering
				//which is done by setting IPdfRunDirection.RunDirection. Otherwise, your RTL language text
				//comes out as backwards gibberish.
				if( text != null && Regex.IsMatch( text, regex_match_arabic_hebrew, RegexOptions.IgnoreCase ) )
			   		cell.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
			    //apply unicode font
			    Phrase phrase = new Phrase( text, nationalTextFont );
				cell.Add( phrase );
				table.AddCell( cell );
			}
			document.add( table );
		}
		finally
	    {
	        document.Close();
	        writer.Close();
	    }
	    return stream.GetBuffer();
	}
}

FIX: VirtualBox Host-Only Network Adapter Creates a Virtual “Public Network” Connection That Causes Windows to Disable Services

vbox-host-only-net-net-n-sharing

vbox-host-only-net-connectletVirtualBox creates a “VirtualBox Host-Only Network” device which is essentially a loopback adapter for creating network connections between virtual machines and between the host and virutal machines. Unfortunately, it shows up to Windows as an unidentified public network. Connecting to a “public network” ratchets up your firewall and disables network discovery and SMB/CIFS network shares. This is kind of a big side effect of installing some VM software.Fortunately, Windows does have a way to mark a network device as virtual by creating a registry value.

The type of the device. The default value is zero, which indicates a standard networking device that connects to a network. Set *NdisDeviceType to NDIS_DEVICE_TYPE_ENDPOINT (1) if this device is an endpoint device and is not a true network interface that connects to a network. For example, you must specify NDIS_DEVICE_TYPE_ENDPOINT for devices such as smart phones that use a networking infrastructure to communicate to the local computer system but do not provide connectivity to an external network.

This powerhsell script will find the “VirtualBox Host-Only Ethernet Adapter device entry in the registry and adds the ‘*NdisDeviceType’ value of 1. After reboot, the “VirtualBox Host-Only Ethernet Adapter” will no longer be monitored by the Network and Sharing center.

# tell windows that VirtualBox Host-Only Network Adapter
# is not a true network interface that connects to a network
# see http://msdn.microsoft.com/en-us/library/ff557037(VS.85).aspx
pushd
echo 'Marking VirtualBox Host-Only Network Adapter as a virtual device.'
cd 'HKLM:\system\CurrentControlSet\control\class\{4D36E972-E325-11CE-BFC1-08002BE10318}'
ls ???? | where { ($_ | get-itemproperty -name driverdesc).driverdesc `
-eq 'VirtualBox Host-Only Ethernet Adapter' } |`
new-itemproperty -name '*NdisDeviceType' -PropertyType dword -value 1
echo 'After you reboot the VirtualBox Host-Only Network unidentified public network should be gone.'
popd


ASP.NET MVC 2.0 Undocumented Model String Property Binding Breaking Change

The default behavior of Model binding in ASP.NET MVC 1.0 was to initialize strings to string.Empty. However, MVC 2.0 defaults to initializing strings to null. Unfortunately, this is not listed as one of the breaking changes from MVC 1.0.

This can be a big problem if you have a substantial site built in MVC 1.0 and you import it into Visual Studio 2010 and use MVC 2.0. You may find that you are suddenly throwing NullReferenceException during model binding on code that is working in production.

Locating the Issue

ASP.NET MVC source code is available under the Ms-PL open source license which means that it is possible to diff the two versions of MVC and figure out what changed without having to guess or disassemble anything.

DefaultModelBinder-diff

protected virtual object GetPropertyValue(ControllerContext controllerContext, ModelBindingContext bindingContext, PropertyDescriptor propertyDescriptor, IModelBinder propertyBinder) {
    object value = propertyBinder.BindModel(controllerContext, bindingContext);

    if (bindingContext.ModelMetadata.ConvertEmptyStringToNull && Object.Equals(value, String.Empty)) {
        return null;
    }

    return value;
}

MVC 2.0 adds some indirection to the binding behavior of the DefaultModelBinder class via a new ModelMetadata class which governs some of the behavior of DefaultModelBinder. Of particular interest here is that there is a property named ConvertEmptyStringToNull which, when true, causes an empty string to be returned as null instead.

It also turns out that when ModelMetaData is constructed, the value of ConvertEmptyStringToNull is set to true and there is no code in MVC which changes that default value.

ModelMetaData.ConvertEmptyString-default

Restoring MVC 1.0 Model String Property Binding Behavior

I think the quickest solution is going to be to set that ConvertEmptyStringToNull property to false by providing a creating a new, custom default implementation of IModelBinder by inheriting from DefaultModelBinder and changing the ConvertEmptyStringToNull value to false on the internal ModelMetaData object.

public sealed class EmptyStringModelBinder : DefaultModelBinder 
{
    public override object BindModel(ControllerContext controllerContext, ModelBindingContext bindingContext)
    {
        bindingContext.ModelMetadata.ConvertEmptyStringToNull = false;
        return base.BindModel(controllerContext, bindingContext);
    }
}

Then the Application_Start() method of Global.asax.cs, we set the DefaultBinder property of ModelBinderDictionary to use our custom type instead of leaving it as the default. This is exposed through the Binders property on the static ModelBinders class so that we get our IModelBinder as the default instead of DefaultModelBinder.

protected void Application_Start()
{
	ModelBinders.Binders.DefaultBinder = new EmptyStringModelBinder();
	RegisterRoutes( RouteTable.Routes );
}

VirtualBox Seemless Mode Glitch with Ubuntu

I recently installed Ubuntu 10.04 Lucid Lynx in VirtualBox. I applied all updates and installed the VM additions. When I tried swtiching into seemless mode, I ran into a problem where all of the window chrome disappeared. Well, the title bars were gone as well as a little bit of the status bar that includes the rounded corners. The title bars were still there but still, I was unable to move or resize windows in seamless mode, which is not ideal.

vbox-seemless-bug

Seemless Mode Only Works Correctly with all Effects Disabled

After fiddling for a while, I realized that Ubuntu had turned on some window effects after rebooting with the VM additions installed. Disabling all window effects solves the problem.

ubuntu-disable-effects

Seamless mode is cool but it isn’t multi-monitor aware. I can’t drag guest OS windows off of the monitor where seamless mode was initiated. If I define the VM as having 2 monitors in VirtualBox, Ubuntu doesn’t see the second one. Also, the screens size is set to the resolution of my smaller monitor in seamless mode which causes the Ubuntu desktop toolbars to float in the middle of my screen. Basically, the entire guest OS desktop has to be on one of my monitors but I can move everything with {right-CTRL+L} (to exit seamless mode), move guest OS window to desired monitor, {right-CTRL+L} (to enter seamless mode on the new monitor). This solution is workable, but hopefully some day VirtualBox seamless mode will automagically present multiple monitors to the guest OS and let me drag windows around arbitrarily.

VirtualBox Trigger Windows 7 Bug Causing it to Believe the File System is Broken

Booting for 4 hours and another hour to go

IMG_20100913_173926

VirtualBox 3.2.8 (and possibly other versions) running on Windows 7 or Windows Server 2008 R2 as the host operating sytsem triggers a bug that causes Windows 7 or Windows Server 2008 R2 to believe that NTFS is corrupt. Chkdsk then runs on every boot. In fact chkdsk is chasing a ghost. There is no corruption but on my system chkdsk takes over 5 hours to run to completion.

I don’t believe that VirtualBox is doing anything hinky. It is just unlucky to trigger a regression in NTFS.

This is a known regression in Windows 7 in the NTFS file system.  It occurs when doing a superceding [sic] rename over a file that has an atomic oplock on it (atomic oplocks are a new feature in Windows 7).  The indexer uses atomic oplocks which is why it helped when you disabled the indexer.  Explorer also uses atomic oplocks which is why you are still seeing the issue.  When this occurs STATUS_FILE_CORRUPT is incorrectly returned and the volume is marked "dirty" which is a signal to the system that chkdsk needs to be run.  No actual corruption has occured [sic].

Neal Christiansen
NTFS Development Lead

Fortunately, a hotfix already exists. Applying KB982927 appears to fix the issue for me.

Achieve Radically Better Firewire 800 Performance on Windows 7/2008/Vista x64

Microsoft is clearly ambivalent about Firewire (IEEE 1394). Windows 7 introduced a new IEEE 1394 stack but the performance remains abysmal compared to a Mac.

I have a Drobo 2nd generation device. My choice is to use slow USB 2 or a fast Firewire 800 interface. Unfortunately, large file copies over Firewire on Windows 7 or Windows Server 2008 R2 are slower than USB2. Worse, they can lock up the target device so that nothing else can access it until the file copy succeeds. I don’t have an explanation for why native Firewire support remains so lame but fortunately there is a free-of-charge solution to turbo-charge Windows Firewire performance to Mac OS X levels.

If you have a Firewire interface on your Windows 7 computer, do not hack around with legacy drivers and do not turn off the Windows write-cache buffer. Instead, download the latest ubiCore Firewire drivers for your version of Windows and install them. (Requires disconnecting all your Firewire devices and a reboot.)

Enjoy the painless 800 Mbps file copies.

Highly recommended.

Copy and Paste with Clipboard from PowerShell

Many times I want to use PowerShell to manipulate some text that I have in something else like email or a text editor or some such or conversely, I do something at the command line and I want to paste the output into a document of some kind.

I’m not sure why there is no cmdlet for outputting to clipboard.

PS> get-command out-*

CommandType     Name             Definition                                         
-----------     ----             ----------                                         
Cmdlet          Out-Default      Out-Default [-InputObject <PSObject>] [-Verbose]...
Cmdlet          Out-File         Out-File [-FilePath] <String> [[-Encoding] <Stri...
Cmdlet          Out-GridView     Out-GridView [-InputObject <PSObject>] [-Title <...
Cmdlet          Out-Host         Out-Host [-Paging] [-InputObject <PSObject>] [-V...
Cmdlet          Out-Null         Out-Null [-InputObject <PSObject>] [-Verbose] [-...
Cmdlet          Out-Printer      Out-Printer [[-Name] <String>] [-InputObject <PS...
Cmdlet          Out-String       Out-String [-Stream] [-Width <Int32>] [-InputObj...

Fortunately, though Windows ships with a native clip.exe for piping text to clipboard. You can alias it to out-clipboard or just pipe to “clip”.

new-alias  Out-Clipboard $env:SystemRoot\system32\clip.exe

Now you can just pipe to Out-Clipboard or clip just like the other Out-* cmdlets.

The larger issue is that PowerShell doesn’t expose a way to paste text from the clipboard. However, the WinForms API of the .NET Framework has a GetText() static method on the System.Windows.Forms.Clipboard class. There is a catch, though:

The Clipboard class can only be used in threads set to single thread apartment (STA) mode. To use this class, ensure that your Main method is marked with the STAThreadAttribute attribute.

This is unfortunate because PowerShell runs in a multithreaded apartment by default which means that the workaround is to spin up a new PowerShell process which is slow.

function Get-ClipboardText()
{
	$command =
	{
	    add-type -an system.windows.forms
	    [System.Windows.Forms.Clipboard]::GetText()
	}
	powershell -sta -noprofile -command $command
}

A slightly less obvious approach is to paste the text from the clipboard into a non-visual instance of the System.Windows.Forms.Textbox control and then emit the text of the control to the console. This works fine in a standard PowerShell process and is substantially faster than spinning up a new process.

function Get-ClipboardText()
{
	Add-Type -AssemblyName System.Windows.Forms
	$tb = New-Object System.Windows.Forms.TextBox
	$tb.Multiline = $true
	$tb.Paste()
	$tb.Text
}

The TextBox method is about 500 times faster than the sub-process method. On my system, it takes 0.501 seconds to do spin up the sub-process and emit the text from the clipboard versus 0.001 seconds to use the TextBox control in the current process.

I also alias this function to “paste”. The commands I actually type and remember are “clip” and “paste”.

When using paste, I generally am setting a variable or inserting my pasted text into a pipeline within parens.

CAVEAT: Whenever you capture a value without any {CR}{LF}, the clipboard will add it. You may need to trim the newline to get the behavior you expect when pasting into a pipeline.

Here’s some one-liners to illustrate. I have dig from BIND in my path and I have Select-String aliased to grep. Notice how when I paste the IP address I captured to the clipboard from the one-liner gets pasted back with an extra blank line.

PS> ((dig google.com) | grep '^google\.com') -match '(\d+\.\d+\.\d+\.\d+)$' | out-null; $matches[0]
72.14.235.104
PS> ((dig google.com) | grep '^google\.com') -match '(\d+\.\d+\.\d+\.\d+)$' | out-null; $matches[0] | clip
PS> paste
72.14.235.104

PS> ping (paste)
Ping request could not find host 72.14.235.104
. Please check the name and try again.
PS> ping (paste).trim()

Pinging 72.14.235.104 with 32 bytes of data:
Reply from 72.14.235.104: bytes=32 time=192ms TTL=51
Reply from 72.14.235.104: bytes=32 time=208ms TTL=51
Reply from 72.14.235.104: bytes=32 time=212ms TTL=51
Reply from 72.14.235.104: bytes=32 time=210ms TTL=51

Ping statistics for 72.14.235.104:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 192ms, Maximum = 212ms, Average = 205ms
%d bloggers like this: