Embedding Arbitrary Language Glyphs in PDF with ItextSharp

One of my clients has an application which generates a PDF using ITextSharp. The document largely contains English text in the Latin character set but a portion of the PDF is supposed to contain contact information in a foreign language. In the first version of the software, the requirement was to support Latin, Cyrillic, Georgian and Armenian character sets.

We quickly discovered during testing that the Adobe Type 1 fonts embedded in itextsharp.dll only support Latin characters. Code points from the Cyrillic, Georgian and Armenian character sets showed up as white space in the document. Fortunately, iTextSharp supports TrueType font embedding with the correct incantation which enabled us to use Sylfaen to provide the necessary glyphs.

string sylfaenpath = Environment.GetEnvironmentVariable( "SystemRoot" ) + "\\fonts\\sylfaen.ttf";
BaseFont sylfaen = BaseFont.CreateFont( sylfaenpath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED );
Font easternEuroTextFont = new Font( sylfaen, 9f, Font.NORMAL );

With the second version of the application, we needed to support a bunch of new character sets in addition to the ones we previously supported including Hebrew, Arabic, Devangari, Sinhala, Lao, Thai and more South Asian and Southeast Asian scripts.

One option is to pick supporting fonts for each character set but we elected to pick something universal, which is Arial Unicode which includes glyphs for every code point defined in Unicode 2.1. Arial Unicode is from the Afra Monotype foundry and is bundled with Office 2007 and later but can be purchased separately if Office isn’t installed. (The main side effect of this font choice is moving from a serif font to a sans serif one.)

Universal Glyph Support Can Still Yield Gibberish

The remaining wrinkle is that Hebrew and Arabic are right-to-left languages which means that the characters in a Hebrew or Arabic string are supposed to be rendered from right-to-left instead of left-to-right. Just rendering an Arabic string with Arial Unicode in iTextSharp will yield reflected output which is gibberish.

Here is some reference Arabic text rendered in Arial. It says “al-ingliziya”, the Arabic word for English.

al-ingliziyah

Here’s what you get by default using Arial Unicode in iTextSharp.

default-broken-al-ingliziyah

Clearly, this is different from the reference rendering. Arabic is complicated because of the way the ligatures work so that the shape of a letter is heavily influenced by the letters next to it but basically, it’s backwards. What we have is now not nothing but Hebrew and Arabic are gibberish instead. We need to alter the rendering for Hebrew and Arabic and make them right-to-left.

A simple algorithm is to detect the presence of Hebrew or Arabic code points in a string and turn on right-to-left rendering. Regular Expressions define \p{Hebrew} and \p{Arabic} character classes which would be useful but unfortunately those aren’t supported in System.Text.RegularExpressions at this point. We need to roll our own.

const string regex_match_arabic_hebrew = @"[\u0600-\u06FF,\u0590-\u05FF]+";
if( Regex.IsMatch( text, regex_match_arabic_hebrew, RegexOptions.IgnoreCase ) 
    //arabic or hebrew characters exist, fix rendering

There’s no obvious RTL option for a text element in iTextSharp, so I tried reversing the strings, which is a slight improvement but it’s still broken. What we have is brain-dead rendering. The ligatures are not connecting the letters correctly.

string-reverse-broken-al-ingliziyah

On closer examination, there is RTL support in iTextSharp. It is exposed through object graph elements that implement IPdfRunDirection. (This is one of the places where it really shows that iTextSharp is a Java port. The use of static integer constants rather than enums is very Java 1.4. Enums are much more discoverable and the correct usage is more intuitively obvious.)

element.RunDirection = PdfWriter.RUN_DIRECTION_RTL;

Now the output from iTextSharp looks like the reference rendering.

correct-rtl-al-ingliziyah

Transliterate to Java and the same concepts apply to iText.

Example Snippet Code in C#

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;

//... assume a class that does stuff exists

pubilc byte[] CreatePdfStreamPdfWithRandomLanguageSupport( IEnumerable<string> textList )
{
	//C# does not support \p{Arabic} and \p{Hebrew} character classes. We have to roll our own.
	//We are assuming any string that contains an Arabic or Hebrew character is meant to be RTL.
	//Better would be to break strings into word tokens and test each word.
	const string regex_match_arabic_hebrew = @"[\u0600-\u06FF,\u0590-\u05FF]+";
	const string arialunicodepath          = Environment.GetEnvironmentVariable( "SystemRoot" ) + "\\fonts\\ARIALUNI.TTF";

	Document document = new Document( PageSize.LETTER );
	using(MemoryStream stream = new MemoryStream())
	{
		PdfWriter writer = PdfWriter.GetInstance( document, stream );
		try
		{
			//bunch of document setup here.
			document.Open();
			//arbitrarily, creating a 5 columnt table.
			PdfPTable table = new PdfPTable( 5 );
			
			//embed a Unicode font with broad glyph support for any code point we might need.
			//only the glyphs for code points actually used will be embedded in the document
			BaseFont nationalBase;
			if( File.Exists( arialunicaodepath ) 
				BaseFont.CreateFont( arialunicodepath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED ); 
			else
				throw new FileNotFoundException( "Could not find \"Arial Unicode MS\" font installed on this system." );

			Font nationalTextFont = new Font( nationalBase, 9f, Font.NORMAL );

			foreach( string text in textList )
			{
				//PdfPCell implements IPdfRunDirection
				PdPCell cell = new PdfPCell();
				//Arabic and Hebrew strings need to be reversed for right-to-left rendering
				//which is done by setting IPdfRunDirection.RunDirection. Otherwise, your RTL language text
				//comes out as backwards gibberish.
				if( text != null && Regex.IsMatch( text, regex_match_arabic_hebrew, RegexOptions.IgnoreCase ) )
			   		cell.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
			    //apply unicode font
			    Phrase phrase = new Phrase( text, nationalTextFont );
				cell.Add( phrase );
				table.AddCell( cell );
			}
			document.add( table );
		}
		finally
	    {
	        document.Close();
	        writer.Close();
	    }
	    return stream.GetBuffer();
	}
}

Mitigate Adobe Reader Vulnerabilities with Google Chrome PDF Viewer

Adobe Reader has a growing list of exploits and a current unpatched vulnerability. The crux of the problem is that PDF documents are not simply documents. PDFs can contain arbitrary code in the form of Javascript or Flash as Adobe Reader embeds a full Javascript runtime and a private Flash runtime environment. Javascript can be disabled via the options but the Flash engine cannot be disabled via the GUI.

An interesting development is that Google Chrome 6.x includes its own PDF rendering plugin. This plugin converts PDF to HTML5 and renders it with the webkit engine. It is very fast and a fundamentally different approach from Adobe Reader. Commodity attacks on Adobe Reader should not be effective on Chrome.

The Chrome Beta channel includes the Chrome PDF viewer plug-in but it is disabled by default. Go to the about:plugins page to enable it. You can also disable the Adobe Reader plug-in while you are at it.

chrome-pdf