← FIX: VirtualBox Host-Only Network Adapter Creates a Virtual “Public Network” Connection That Causes Windows to Disable Services

Embedding Arbitrary Language Glyphs in PDF with ItextSharp

September 29, 2010 9 Comments

One of my clients has an application which generates a PDF using ITextSharp. The document largely contains English text in the Latin character set but a portion of the PDF is supposed to contain contact information in a foreign language. In the first version of the software, the requirement was to support Latin, Cyrillic, Georgian and Armenian character sets.

We quickly discovered during testing that the Adobe Type 1 fonts embedded in itextsharp.dll only support Latin characters. Code points from the Cyrillic, Georgian and Armenian character sets showed up as white space in the document. Fortunately, iTextSharp supports TrueType font embedding with the correct incantation which enabled us to use Sylfaen to provide the necessary glyphs.

string sylfaenpath = Environment.GetEnvironmentVariable( "SystemRoot" ) + "\\fonts\\sylfaen.ttf";
BaseFont sylfaen = BaseFont.CreateFont( sylfaenpath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED );
Font easternEuroTextFont = new Font( sylfaen, 9f, Font.NORMAL );

With the second version of the application, we needed to support a bunch of new character sets in addition to the ones we previously supported including Hebrew, Arabic, Devangari, Sinhala, Lao, Thai and more South Asian and Southeast Asian scripts.

One option is to pick supporting fonts for each character set but we elected to pick something universal, which is Arial Unicode which includes glyphs for every code point defined in Unicode 2.1. Arial Unicode is from the Afra Monotype foundry and is bundled with Office 2007 and later but can be purchased separately if Office isn’t installed. (The main side effect of this font choice is moving from a serif font to a sans serif one.)

Universal Glyph Support Can Still Yield Gibberish

The remaining wrinkle is that Hebrew and Arabic are right-to-left languages which means that the characters in a Hebrew or Arabic string are supposed to be rendered from right-to-left instead of left-to-right. Just rendering an Arabic string with Arial Unicode in iTextSharp will yield reflected output which is gibberish.

Here is some reference Arabic text rendered in Arial. It says “al-ingliziya”, the Arabic word for English.

Here’s what you get by default using Arial Unicode in iTextSharp.

Clearly, this is different from the reference rendering. Arabic is complicated because of the way the ligatures work so that the shape of a letter is heavily influenced by the letters next to it but basically, it’s backwards. What we have is now not nothing but Hebrew and Arabic are gibberish instead. We need to alter the rendering for Hebrew and Arabic and make them right-to-left.

A simple algorithm is to detect the presence of Hebrew or Arabic code points in a string and turn on right-to-left rendering. Regular Expressions define \p{Hebrew} and \p{Arabic} character classes which would be useful but unfortunately those aren’t supported in System.Text.RegularExpressions at this point. We need to roll our own.

const string regex_match_arabic_hebrew = @"[\u0600-\u06FF,\u0590-\u05FF]+";
if( Regex.IsMatch( text, regex_match_arabic_hebrew, RegexOptions.IgnoreCase ) 
    //arabic or hebrew characters exist, fix rendering

There’s no obvious RTL option for a text element in iTextSharp, so I tried reversing the strings, which is a slight improvement but it’s still broken. What we have is brain-dead rendering. The ligatures are not connecting the letters correctly.

On closer examination, there is RTL support in iTextSharp. It is exposed through object graph elements that implement IPdfRunDirection. (This is one of the places where it really shows that iTextSharp is a Java port. The use of static integer constants rather than enums is very Java 1.4. Enums are much more discoverable and the correct usage is more intuitively obvious.)

element.RunDirection = PdfWriter.RUN_DIRECTION_RTL;

Now the output from iTextSharp looks like the reference rendering.

Transliterate to Java and the same concepts apply to iText.

Example Snippet Code in C#

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;

//... assume a class that does stuff exists

pubilc byte[] CreatePdfStreamPdfWithRandomLanguageSupport( IEnumerable<string> textList )
{
	//C# does not support \p{Arabic} and \p{Hebrew} character classes. We have to roll our own.
	//We are assuming any string that contains an Arabic or Hebrew character is meant to be RTL.
	//Better would be to break strings into word tokens and test each word.
	const string regex_match_arabic_hebrew = @"[\u0600-\u06FF,\u0590-\u05FF]+";
	const string arialunicodepath          = Environment.GetEnvironmentVariable( "SystemRoot" ) + "\\fonts\\ARIALUNI.TTF";

	Document document = new Document( PageSize.LETTER );
	using(MemoryStream stream = new MemoryStream())
	{
		PdfWriter writer = PdfWriter.GetInstance( document, stream );
		try
		{
			//bunch of document setup here.
			document.Open();
			//arbitrarily, creating a 5 columnt table.
			PdfPTable table = new PdfPTable( 5 );
			
			//embed a Unicode font with broad glyph support for any code point we might need.
			//only the glyphs for code points actually used will be embedded in the document
			BaseFont nationalBase;
			if( File.Exists( arialunicaodepath ) 
				BaseFont.CreateFont( arialunicodepath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED ); 
			else
				throw new FileNotFoundException( "Could not find \"Arial Unicode MS\" font installed on this system." );

			Font nationalTextFont = new Font( nationalBase, 9f, Font.NORMAL );

			foreach( string text in textList )
			{
				//PdfPCell implements IPdfRunDirection
				PdPCell cell = new PdfPCell();
				//Arabic and Hebrew strings need to be reversed for right-to-left rendering
				//which is done by setting IPdfRunDirection.RunDirection. Otherwise, your RTL language text
				//comes out as backwards gibberish.
				if( text != null && Regex.IsMatch( text, regex_match_arabic_hebrew, RegexOptions.IgnoreCase ) )
			   		cell.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
			    //apply unicode font
			    Phrase phrase = new Phrase( text, nationalTextFont );
				cell.Add( phrase );
				table.AddCell( cell );
			}
			document.add( table );
		}
		finally
	    {
	        document.Close();
	        writer.Close();
	    }
	    return stream.GetBuffer();
	}
}

Filed under Uncategorized Tagged with arabic, c#, fonts, hebrew, itext, itextsharp, java, pdf, right-to-left, unicode

9 Responses to Embedding Arbitrary Language Glyphs in PDF with ItextSharp

ibrahem moahmed says:

December 9, 2010 at 12:43 am

this is good, but I cannot read the text form the file , Is there any way to select encoding ,font type ;

Reply
Brian Reiter says:

December 14, 2010 at 12:32 pm

@ibrahim moahmed: In iText you use the Font type to define the font used by an element in the PDF document. You have to select a font that contains the gyphs you want to render. There is no automatic substitution.

Reply
ash says:

August 18, 2011 at 12:12 pm

hello
I am using the following code to format html text that i get from editor
System.Collections.Generic.List htmlArray = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(new StringReader(strtText), null);

it returns IElement list..how i can convert that to IEnumerable textList?
please help

Reply
- Brian Reiter says:
  
  August 18, 2011 at 2:06 pm
  
  I don’t really know, partly because I don’t know what a textList is. Have you tried iterating the list and calling ToString() on each element? The comment on IElement.ToString() is “Gets the content of the text element.”.
  
  Reply
  - ash says:
    
    August 21, 2011 at 7:21 am
    
    http://www.codeproject.com/Questions/155098/Create-pdf-from-persian-html-file-by-ITextSharp?display=Print
    I have this same issue 😦 but its not answered there
ash says:

August 21, 2011 at 7:16 am

System.Collections.Generic.List htmlArray2 = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(new StringReader(strDetails.ToString()), null);
//add the collection to the document
PdfPTable tab = new PdfPTable(1);
tab.SpacingBefore = 0;
tab.SpacingAfter = 0;
tab.HorizontalAlignment = 2;
tab.RunDirection = PdfWriter.RUN_DIRECTION_RTL;

foreach (IElement element in htmlArray2)
{
Phrase myp = element as Phrase;
//Paragraph myp = element as Paragraph;//also not wrking

if (myp != null)
{
//tbl.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
BaseColor clr = new BaseColor(System.Drawing.Color.Blue);
iTextSharp.text.Font fontozel = new iTextSharp.text.Font(bffont, 10, iTextSharp.text.Font.NORMAL, clr);
myp.Font = fontozel;
//PdfPCell cells = new PdfPCell(tbl);
PdfPCell cells = new PdfPCell(new Phrase(myp.Content,fontozel));
cells.Border = 0;
cells.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
//cells.AddElement(element);(this creates empty pdf)
tab.AddCell(cells);

}
document.Add(tab);
}

this is my code .I am reading the contents from an HtmlEditor ..everything working fine but whatever coming inside tags are missing.this is the html

هاتشو! كلما شعرنا بالزكام يقترب منّا، نسارع باللجوء إلى مصدر من مصادر الفيتامين C. إما كوب من عصير البرتقال، الليموناضة أو المكمّلات الغذائية المكوّنة منه، والتي تُعتبر العلاج السحري للزكام.هل الفيتامين C يحمي أو يشفي من الزكام؟على مر السنين، أعتبر الفيتامين C كحامٍ أو شافٍ من الزكام. هذه الإدّعاءات لم تُثبت حقيقتها علمياً، ولكن من الثابت والمتعارف عليه أنّ تناول الكميات المناسبة من الفيتامين C يلعب دوراً هاماً في محاربة العدوى. فترة زكام أقصرأظهرت الدراسات الأخيرة أن الفيتامين C لا يمكنه إبعاد الزكام عن معظم الناس. ولكنه يخفف من مدة المرض عند تناوله في أوّل ظهور لعلامات المرض.يُجنّب الزكام عند الرياضيين والأشخاص كثيري النشاطمن ناحية أخرى، يُعدّ الفيتامين C حماية للأشخاص كثيري النشاط في الطقس البارد جداً – معظمهم عدّائي الماراثون – إذ أنه يخفف من خطر الإصابة بالمرض بنسبة النصف!ما هي فوائد الفيتامين C لجسم الإنسان؟نحتاج للفيتامين C لأنه: يكوّن الكولاجين، الأنسجة التي تساعد في تماسك الخلايا مع بعضها في أجسامنا. ضروري لصحّة العظام، الأسنان، اللثّة والأوعية الدمويّة في أجسامنا. يساعد أجسامنا على إمتصاص والكالسيوم يساعد على إلتئام الجروح يساهم في وظائف الدماغلا تستطيع أجسامنا إنتاج الفيتامين C ، لذا علينا تأمينه من خلال نظامنا الغذائي.ما هي الكميّة التي نحتاجها؟تحتاج النساء إلى 60 مغ من الفيتامين C يومياً. أما المدخنّين فيحتاجون إلى 35 مغ إضافياً عن غير المدخنّين، لأنّ المدخنين أكثر عرضة للأكسدة من سموم السجائر، ولديهم مستوى أقل من الفيتامين C في الدم.أين يمكننا أن نجد الفيتامين C؟معظم الناس تعتقد أنّ الحمضيّات هي المصدر الأفضل والوحيد للفيتامين C ، ولكن هنالك الكثير من الفاكهة والخضار الأخرى التي تحتوي عليه بكثرة. عصائر الفاكهة الطبيعية هي من المصادر اللذيذة والعمليّة للفيتامين C أيضاً.الطعام الفيتامين Cغوافا، وسط (1)165 مغبابايا، وسط (2/1 فنجان)95 مغفلفل أحمر (2/1 فنجان)95 مغبرتقال، وسط (1) 60 مغبروكوللي، مسلوق (2/1 كوب) 60 مغفلفل أخضر (2/1 فنجان)45 مغفراولة (2/1 فنجان)45 مغغرايب فروت (2/1 حبّة)40 مغبطاطس، مشوي مع القشرة (1) 25 مغملفوف، نيء (2/1 فنجان)25 مغبندورة، وسط، نيئة 25 مغسبانغ، نيئة (1 فنجان)15 مغعندما تشعرين باقتراب الزكام، بل إجعليه جزءاً من نظام غذائك الصحّي اليومي C لا تلجأي فقط للمأكولات الغنيّة بالفيتامين.

do you hav any idea?
Please help

Reply
- kashifnizam says:
  
  January 31, 2012 at 1:52 pm
  
  Hi Ash,
  Were you able to fix your problem? I’m also using the following
  
  System.Collections.Generic.List htmlArray2 = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(new StringReader(strDetails.ToString()), null);
  
  But unable to get the pdf the way you did. Can you please help me out.
  Thanks,
  kashif
  
  Reply
  - Talha Ashfaque Khan says:
    
    September 17, 2012 at 7:21 am
    
    any update on arabic/persian html-pdf issue?
Salman Hasrat says:

April 3, 2012 at 12:35 pm

Cheers man you’re blog helped me a lot! Thanks a billion!

Reply