20 Şubat 2013 Çarşamba

Some Operations on Arabic Characters

I confess that I am not a unicode expert and I don't know the related terms but I want to note my non-expert experince in case it may help someone. I will talk about
  • Importing Arabic data into MSSQL database
  • The non-joining Arabic caharacters problem
  • Examining Arabic characters by their decimal and hexadecimal values
Recently, I have copied some Arabic text from a PDF document to clipboard but the order of the charactes in each word is reversed. For example هجو became ﻮﺠه. Converting the PDF to some other formats like DOC didn't help in keeping the correct order. Simply, what you see is not what you get. This is one of the things that I don't like about PDFs. I was just planning to keep the Arabic words in a database, and luckly I found a time-saver function in SQL Server 2008. It is the REVERSE() function. Before explaining this easy function I will explain how I imported the Arabic word list.

Getting Raw Text File - I converted PDF to DOC via some free services over internet. Then I saved DOC file as an HTML document by Microsoft Word's save options. Now, I am free to make any changes to this HTML with a text editor: There are no hidden spaces, no hidden margins, no hidden tables, no hidden styles, no hidden anything. Since Notepad can't open such a big file, I used Notepad++. I recommend using NotePad++ because it is also good for cleaning garbage of styles which Microsoft Word added while it is saving HTML file. This text editor has a nice Find & Replace feature and supports regex. The tricks in this article may be helpful in using regex replace with Notepad.

Getting Tab Delimited File - After cleaning the document from unnecessary code blocks, I got a two-colum text file. Each line in the text file has an Arabic word and its meaning. The two are seperated from each other with tab character. You can call this file tabbed or tab delimited text file. You could also generate such text file by copying a table from Microsoft Word or from web browser and then copy it to a text editor, Notepad++. One good thing about these tab delimited text files is that they can be easily imported by SQL Server, Microsoft Excel, or that kind of software.

Importing Arabic Text to SQL Server - If you have Microsoft SQL Server Management Studio already installed, then open its import and export wizard. As a data source, select flat file source and locate our text file which has .html extension in this case. Since we have some Arabic characters in our text file, "65001 (UTF-8)" should have beeen seleceted as code page value. If it is not selected by default, edit your text file with Notepad++ and select "Encode in UTF-8" option. Check if some characters get spoiled, if not than save the file and reopen import wizard.

Now, back to import/export wizard, select row delimiter as the new line character {LF} and column delimiter as the tab character {t} from the import/export options. By default, all columns are of data type "string [DT_STR]" and of with 50 characters. If one of your columns contains text larger than 8000 characters, then chose "text stream [DT_TEXT]" for that column. Once you have a successful preview, click next until you see Edit Mappings button. Since we have some Arabic data, we should edit column mappings. Change type of a column from varchar to nvarchar or from text to ntext it is contains Arabic text. Ordinary varchar or text types doesn't support Unicode characters. If you continue without making any change here, you may get your characters spoiled. Click next and finish. I you get any truncation errors increase character limit of your columns in the next try and don't forget to delete unsuccessful table first.

Reversing Arabic Text - After successfully importing data, select top 1000 rows from your table. Our Arabic text looks reversed during entire column. We should add REVERSE() function to correct it like in the following query:
SELECT REVERSE([arabic]), [meaning] FROM [dictionaries].[dbo].[imported]
Beside reversing the Arabic text, we really need an ID column in our new table. I replicated the imported table with SQL Server's "script table as CREATE to" service and added auto incremented ID column to the new table. Then I transfered data from old table to new table with this query:
INSERT INTO reversed (arabic, meaning)
SELECT REVERSE(arabic), meaning FROM imported ORDER BY arabic
I should also note here that if you want to select according to a unicode string, you should add N before the string in the query. See the example below:
SELECT * FROM imported WHERE arabic = N'هجو'
Joining Seperated Arabic Characters - Now, the order of the Arabic characters is correct but there is one problem. Some characters looks seperated. هجو looks like هﺠﻮ . There is no space or zero width character in the word but some characters are not joined to others. I had a hard time in solving this situation. Then, I met with Arabic Unicode blocks via this wikipedia page. If you type any Arabic character, they all join each other. They are shaped automatically by the software you type on. On the other hand, if you find an original keyboard and type a character from Arabic Presentation Forms-A or Arabic Presentation Forms-B, software leaves it as it is. In order to see the difference between simple Arabic characters and their presentation forms, you can copy and paste هجو and هﺠﻮ to this pages search box. It will show the difference between the two HEH chracters.

Arabic characters have hex values between 0x0600—0x06FF, and characters of presentation forms have hex values between 0xFB50—0xFDFF and between 0xFE70—0xFEFF. You can convert hexadecimal to decimal by writing console.log(0xFEFF); in your FireBug console. You can also try these JavaScript commands to see the difference between Unicode blocks:
//A character from Arabic Presentation Forms-B:
"ﻻ".charCodeAt(0); //65275

//A simple Arabic character:
"ا".charCodeAt(0); //1575

//Aimple Arabic character:
"ل".charCodeAt(0); //1604

//Two simple characters joined each other as a glyph:
console.log("ل" + "ا");‎ // ‬لا (Don't copy this comment.)

//First character is a simple Arabic character:
"لا".charCodeAt(0); //1604

//Second character is also a simple Arabic character:
"لا".charCodeAt(1); //1575
The following commands may also be helpful:
//Get character from its hex value.
String.fromCharCode(0xfe80);

//Get character from its unicode value.
console.log("\uFE80");
Converting from Arabic Presentation Forms to Arabic - What I need to do is to convert any presentation character to simple Arabic character. On this topic, I could find only Accorpa's Arabic Converter From and To Arabic Presentation Forms B which is developed for iOS applications. I converted some part of it to C# code and added some more character support according to my database. You can examine my C# application that converts some characters to simple Arabic and gives alert at other characters so that you can visit wikipedia page titled "Arabic script in Unicode", and find a suitable character corresponding to a presentation character, and add it to this C# application. I know, the C# code is not clean and smart but it does the job. By the way, Windows' scientific calculator is very handy in converting from/to hex and don't forget İsa Sarı's simple Arabic keyboard.

You can download Arabic Character Converter from here: https://www.box.com/s/7lqe07i75eocmpt8fn41 It is an incomplete C# project. Someone may finish it by mapping all necessary presentation characaters to simple Arabic characters. Also note that, this project works with my database. So, you should adapt it if you want to use it and add some other presentation characters that you need.

Hiç yorum yok: