2012年9月12日水曜日

Javaで特定の文字が文字化けする

Javaで、機種依存文字では無い特定の文字が文字化けをするという問題があります。「〜‖—−¢£¬」の7文字がそれにあたります。「�」や「�」等の機種依存文字が化ける問題はあたりまえなので取り上げません。

問題が発生するのは、以下の2つの条件を満たす場合です。
  • SJIS系の文字コードを使用している
  • 入力時と出力時で異なる文字コードを使用している
逆に言えば入力時と出力時で文字コードを同一のものにすれば何も問題はありません

Webシステムなどでは画面から入力を受付け、それをDBに格納し検索時に表示するなどの使われ方をしますので、入力時と出力時であえて異なる文字コードを指定するケースは少ないはずです。ですから、実質的に問題が生じるのは、実装者が「何も指定していない」「何を指定しているかわからない」「調べ方もわからない」という場合が多いのではないかと思います。異なる文字コードを使用する複数のシステムを結合している場合にもこの問題が発生する可能性があります。

「〜‖—−¢£¬」の7文字はJIS X 0208で定められた範囲の文字であり、JISでもSJISでもMS932でもCP943でもEUC-JPでも共通して使える普通の文字です。機種依存文字ではありません。にもかかわらず文字化けします。

Java内部ではUnicodeを使用しています。そのため、どんな文字も一旦Unicodeデータとして格納されます。このとき、Unicodeのどの文字として格納するかのマッピング情報が必要になります。このマッピング情報が文字コードごとに少しずつ異なることが原因です。


具体的にセントマーク「¢」について考えます。「¢」はJIS X 0208で定められた文字です。つまり、日本語としての文字集合の範囲です。全角の「A」「0」「¥」などと近い扱いといえるでしょう。

Unicodeでは「¢」に相当するものは「U+00A2」と「U+FFE0」の2箇所で定められています。「U+00A2」があるエリアは、半角英数字記号のASCII文字が定められている領域の近くで、「U+FFE0」があるエリアは、全角の英数字記号や半角カナが定められているエリアです。個人的にはSJISとしての「¢」を格納するのは「U+FFE0」のほうが適切であるとは思いますが、とにかく、1つの文字を格納する候補がUnicodeでは複数あった、という点が問題の源泉です。そしてこのマッピングをベンダがそれぞれ独自に定めたため、文字コードによって同じ文字をUnicode上のどこに格納するかが異なる文字が発生してしまいました。

SJISでは「¢」をUnicodeの「U+00A2」に割り当てており、MS932では「¢」をUnicodeの「U+FFE0」に割り当てています。逆に、SJISではUnicodeの「U+FFE0」を変換できない文字としています。このため、MS932で入力した「¢」はSJISとして出力すると「?」に文字化けするのです。ところがMS932ではUnicodeの「U+00A2」を「¢」に変換できます。これはMS932が互換性を考慮し柔軟な設計にした結果でしょう。

このような仕組みで「〜‖—−¢£¬」の7文字は、条件によって文字化けします。といっても入力時と出力時に同じ文字コードを指定すればいいだけの話なので、それさえわかれば実質的には対策は容易でしょう。

¢ £ ¬
SJIS U+301C U+2016 U+2014 U+2212 U+00A2 U+00A3 U+00AC
CP943 U+301C U+2016 U+2014 U+2212 U+FFE0 U+FFE1 U+FFE2
MS932 U+FF5E U+2225 U+2015 U+FF0D U+FFE0 U+FFE1 U+FFE2
上記はそれぞれの文字を各文字コードで指定したときにUnicodeでどこに格納されるかの対応表です。


¢ £ ¬
SJIS→CP943 × × ×
CP943→SJIS × × ×
SJIS→MS932 × × × × ¢ £ ¬
MS932→SJIS × × × × × × ×
CP943→MS932 × × × × ¢ £ ¬
MS932→CP943 × × × × ¢ £ ¬
実際にJavaで実行した結果がこれです。「×」は文字化けした文字。基本的には、文字コードごとのマッピング先が異なれば、文字化けすると考えていいでしょう。例外的にMS932は「U+00A2」「U+00A3」「U+00AC」も正しく戻せるようにしているようです。

Java の MS932, Cp943C, SJIS の違い

Java※ の MS932, Cp943C, SJIS の変換で異なる点、および注意を要する点をまとめてみました。

※調査したバージョン:Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1_02-b06)

■概要

MS932 と Cp943C の両者は、Windows-31J の文字セットを扱えます。

主な違いは、Unicode への変換で一部異なるコードポイントに変換される事と、NEC特殊文字とIBM拡張文字の両方で定義されている文字を、Unicode から MS932/Cp943C に変換する際に、どちらのコードポイントかという点が異なります。

MS932/Cp943C は Unicode との対応付けが 一部の JIS X 0208 の文字に関して、SJIS と異なり、EUC_JP や ISO2022JP へ変換出来ない文字があるので注意が必要となります。

■MS932 と Cp943C で異なる Unicode コードポイント

MS932 と Cp943C では、Unicode に変換した時に 表1 に示した違いがあります。

表1 MS932 と Cp943C で異なる Unicode のコードポイント
表1

次の 表2 は、Unicode のコードポイントは同一だが、MS932/Cp943C への変換で異なるコードポイントへ変換される文字の一覧です。

表2 Unicode → MS932/Cp943C で変換先が異なるコードポイント
表2
※MS932 では、NEC特殊文字に変換され、Cp943C では、IBM拡張文字に変換されます。

■JIS X 0208-1983 で追加された文字に変換されるコードポイント

NEC特殊文字、NEC選定IBM拡張文字、IBM拡張文字で定義され、後に JIS X 0208-1983 で 2 区に追加になった文字は、JIS X 0208 の 2 区のコードポイントに変換されます。
表3 は、その一覧です。

表3 JIS文字に変換されるコードポイント
表3

■MS932/Cp943C と SJIS の相違点

MS932 もしくは Cp943C コンバーターは Unicode との対応付けで、表4 の赤字で示したコードポイントが、SJIS コンバーターで変換した場合と異なります。

EUC_JP や ISO02202JP コンバーターは、SJIS コンバーターと同じ Unicode のコードポイントを用いていますので、932/Cp943C との相互変換の際には注意が必要となります。

表4

■重複符号化されている文字

MS932/Cp943C とも、Unicode への変換で、重複符号化されている文字は、多対1 の変換となっています。
Unicode から MS932/Cp943C への変換は、表2 の文字以外は、MS932 と Cp943C は同一の変換となっています。

MS932 の変換は、次のページをご覧下さい。

Windows-31J の重複符号化文字と Unicode (当サイト・コンテンツ)
http://www2d.biglobe.ne.jp/~msyk/charcode/cp932/uni2sjis.html

■ユーザー定義文字

MS932/Cp943C では、95〜114区 (F040〜F9FC) がユーザー定義文字の領域となっていています。

両方のコンバーターとも Unicode の私用領域の U+E000〜U+E757 と対応付けされています。

■参照

WebSphere V3.5 発表ワークショップ資料
> 第3回 「プログラミング・モデル(JSP、Servlet)」 > 文字コード
http://www-6.ibm.com/jp/software/websphere/developer/wsv35wslib/pdf/was35_psj5_1.pdf

Windows-31J情報 (当サイト・コンテンツ)
http://www2d.biglobe.ne.jp/~msyk/charcode/cp932/index.html

文字コード掲示板 過去ログ
http://www2d.biglobe.ne.jp/~msyk/cgi-bin/charcode/bbs.cgi?c=gr&n=55

※間違いや不明な点がありましたらご指摘ください。よろしくお願いいたします。

2012年9月9日日曜日

Unlocking T-Mobile 4G Hotspot (ZTE MF61): A case study

 

So, I have one of these MiFi clone from T-Mobile and want to unlock it to use on AT&T (I know that AT&T 4G/3G isn't supported, but I thought maybe I could fix that later). The first thing I tried to do was contact T-Mobile, as they are usually very liberal concerning unlock codes. However, this time, T-Mobile (or, as they claim, the manufacture) isn't so generous. So I've decided to take it upon myself to do it. I will write down the entire procedure here as a case study on how to "reverse engineer" a new device. However, in no way do I consider myself an expert, so feel free to bash me in the comments on what I did wrong. Also, I have decided against releasing any binaries or patches because phone unlocking is a grey area (although it is legal here), but if you read along you should be able to repeat what I did, even though I will also try to generalize.

Getting information

The hardest part of any hack is the figuring-out-how-to-start phase. That's always tricky. But… let the games begin.

-Wheatley, Portal 2

So before we can do anything, we need to know what to do. The best place to begin is to look at the updater. A quick look at the extracted files, we find that the files being flashed have names such as "amss.mbn", "dsp1.mbn", and such. A quick scan with a hex editor, we see that the files are unencrypted and unsigned. That's good news because it means we have the ability to change the code. A quick Google search shows us that these files are firmware files for Qualcomm basebands. Now, we need to find more information on this Qualcomm chip. You may try some more Google-fu, but I took another path and took apart the device (not recommended if it's any more complicated). In this case, I found that we are dealing with a Qualcomm MDM8200A device. Google that and you'll find more information such as there are two DSP processors for the modem and on "apps" ARM processor (presumably for T-Mobile's custom firmare, and is what you see as the web interface). We want to unlock the device, so I assume the work is done in the DSP processor. That's the first problem. QDSP6 (I found this name through more Google skills) is not a supported processor in IDA Pro, my go-to tool, so we need another way to disassemble it.

Disassembly

Some more Googling (I'm sure you can see a pattern on how this works now) leads me to this. QDSP6 is actually called "Hexagon" by Qualcomm and they kindly provided an EBI and programmer's guide. I guessed from the documents that there is a toolchain, but no more information is provided about it. More searching lead me to believe that the in-house toolchain is proprietary, but luckily, there is an open source implementation that is being worked on. Having the toolchain means that we can use "objdump", the 2nd most popular disassembly tool [Citation Needed]. So, it's just a matter of sending dsp1.mbn and dsp2.mbn into objdump -x? Nope. It seems that our friends at ZTE either purposely or automatically (as part of the linker) stripped the "section headers" of the ELF file. I did a quick read of the ELF specifications and found that the "section headers" are not required for the program to run, but provides information for linking and such. What we did have was the "program headers", which is sort of a stripped down version of the section headers. (Program headers only tell: 1) where each "section" is located in file and where to load it in memory, 2) is it program or data?, 3) readable? writable?, while section headers give more information like the name of each section and more on what the program/data section's purpose is). What I then did is wrote my own section headers using the program headers as a guide and made up the names and other information (because they are not used in the actual disassembling anyways) with a hex editor. Then I pasted my headers into the file, changed some offsets, and objdump -x surrendered the assembly code. 180MB worth of it.

Assembly

So we have 180MB worth of code written in a language that could very well be greek. Luckily, as I've mentioned earlier, Qualcomm released a document detailing the QDSP assembly language and how it's used. Most likely, you would be dealing with a more "popular" processor like ARM or x86 and would have access to more resources. However, for QDSP6/Hexagon, we have two PDF documents and that is basically the Bible that we need to memorize. I then spend a couple of hours learning this new assembly language (assembly isn't that hard once you embrace it) and figured out the basics needed to reverse engineer (that is: jumps, store/loads, and arithmetic). Now, another problem arises. We have literally 3 million lines of assembly code with no function names, no symbols, and no "sections". How do we find where the goal (the function that checks the NCK key and unlocks the device accordantly) without spending the next two years decoding this mess? Here, we need to do some assumptions. First, we know   (through Google) that the AT modem command for inputting the NCK key is AT+ZNCK="keyhere" for ZTE devices. So, let's look for "ZNCK" in the hex editor of dsp1.mbn and dsp2.mbn. (If you are not as lucky and don't know what the AT command is, I would put money that the command will contain the word NCK, so just search that). In dsp2.mbn, we find a couple of results. One of the results is in a group of other AT commands. Each command is next to a 4-byte hex value and a bunch of zero padding. I would guess that it is a jump table and the hex values are the memory locations of the functions to jump to. Doing a quick memory to file offset conversion (from our ELF program header), we locate the offset in our disassembly dump to find that it starts an "allocframe" instruction. That means we are at the beginning of a function so our assumptions must be right. Now, we can get to the crux of the problem, which is figuring out how the keycheck works.

Mapping out the functions

We now know where the function of interest starts, but we don't know where it ends. It's easy to find out though, look for a jump to lr (in this case for this processor, it's a instruction to jump r31). We start at the beginning of the function and we copy all the instruction until we see a non-conditional jump. We paste the data into another text file (for easier reference). Then we go to the next location in the disassembly (where it would have jumped to) and copy the instruction until we see another non-conditional jump, and then paste them into the second text file. Keep doing this until you see a jump to r31. We now have most of the function. Notice I kept saying "non-conditional" jumps. That's because first, we just need the code that ALWAYS runs, just to filter out stuff we don't need. Now, we should get the other branches just so we have more information. To do this, just follow each jump or function call in the same way as we did for the main function. I would also recommend writing some labels like "branch1″ and "func1″ for each jump just so you can easily locate two jumps to the same location and such. I would also recommend only doing this up to three "levels" max (three function calls or three jumps) because it could get real messy real quick, and we will need more information so we can filter out un-needed code, as I will detail in the next section.

Finding data references

Right now, we are almost completely blind. All we know is what code is run. We don't know the names of functions or what they do, and it would take forever to "map" every function and every function every function calls (and so on). So we need to obtain some information. The best would be to see what data the code is using. For this processor (and likely many others), a "global pointer" is used to refer to some constant data. So, look for references to "gp" in the disassembly. Searching from the very beginning of the program, we find that the global pointer is set to 0×3500000, and according to the ELF headers, that is a section of the dsp2.mbn file at some file offset. In the section we care about, look for references to "gp" and use the offsets you find to locate the data they refer to. I would recommend adding some comments about them in the code so we don't forget about them. Now, the global pointer isn't everything, we can have regular hard-coded pointers to constant areas of memory. Look for setting of registers to large numbers. These are likely parameters to function calls that are too big to be just numerical data and are more likely pointers. Use the ELF header to translate the memory locations to file offsets. In this case (for this processor), some values may be split into rS.h and rS.l, these are memory locations that are too "large" to be set in the register at once. Just convert rS.h into a 16 bit integer, rS.l into a 16 bit integer (both might require zero padding in front), then combine them into one 32 bit integer where rS.h's value is in front of rS.l's value. For example, we have: r1.h = #384; r1.l = #4624. That will make r1 == 0×1801210. You should also make some comments in the code about the data that is being used. Now, predict standard library calls. This may be the hardest step because it involves guessing and incorrect guessing may make other guess more wrong. You don't have much information to go by, but you know 1) the values of some of the data being passed into function calls, and 2) library calls will usually be near the start of the program, or at least very far away from the current function. This will be harder if the function you are trying to map is already near the beginning of the program. The function I'm mapping is found at 0xf84c54, and most function calls are close to it. When I see a function call to 0xb02760, I know that it might be a library call. 3) Some of the more "common" functions and the types of parameters they accept. You don't need to figure out all of the library calls, just enough to get an idea of what the code is doing so you don't try to map out these functions (trying to map out strcpy, for example will get messy real quick). For example, one function call, I see is taking in a data pointer from a "gp" offset, a string that contains "%s: %d", and some more data. I will assume it is calling fprintf(). I see another function is being called many times throughout the code, and it always accepts two pointers where the second one may be a constant and a number. I will assume it is calling memcpy().

Translating

This may be the most boring part. You should have enough information now to try to write a higher language code that does what the assembly code says. I would recommend doing this because it is much easier to see logic this way. I used C and started by doing a "literal" transcription using stuff like "r0-r31″ as variable names and using goto. Then go back and try to simplify each section. In my process, I found that how the unlock key is checked is though sort of a hash function. It takes the user input, passes it through a huge algorithm of and/or/add/sub of more than 1000 lines and takes the result and compares it to a hard coded value in the NV ram (storage area for the device). Here, I made a choice to not go through and re-code this algorithm for two reasons. First, it would be of little use, as the key check doesn't use a known value like the IMEI and relies on a hard coded value in the NV ram that you need to extract (which a regular user might have trouble doing). Second, after decoding it, we would have to do the algorithm backwards to find the key from the "known value" in the NV ram (and it could be that it would be impossible to work backwards). So I took the easy way out and made a 4-byte patch in where I let the program compare the known value to itself instead of to the generated hash from the input and flashed it to the device. Then I inputted a random key, and the device was unlocked.

Now, remember at the beginning I said the code was unsigned? Because of that I could easily have reflashed the firmware with my "custom" code. However, if your device has some way of preventing modified code from running, you may have no choice but to decode the algorithm.

http://yifan.lu/

https://github.com/yifanlu