The deceitful case of the Windows command prompt


Recently, I provided some test data to a colleague who'd written a plugin for RegRipper to parse the searches from Shareaza. Because I'm mean, I made sure that the test contained some unusual characters, specifically the upside-down question mark (¿) and the upside-down exclamation mark (¡), and the latin small ligature o e(œ).

So in the Registry, the searches looked like this:



but when running my colleague's plugin the output looked like this:



As you can see, the unusual characters in Search.01 and Search.05 are not displayed correctly.

Three pieces of information that it's useful to know:

  1. The Windows Registry Type REG_SZ, (which is how the searches are stored), is stored either as ANSI or Unicode depending on which API function is used.
  2. perl uses utf8 (or is it UTF-8?) by default (and RegRipper uses this default).
  3. The Windows command prompt uses codepage 850 (at least where I live, and with my font; maybe yours is different).

So, perl reads a UCS-2LE string from the Registry, converts it to utf8, and then sends it to the Windows command prompt which tries displaying it as codepage 850. I'll repeat the problem again: "Windows command prompt... tries displaying it as codepage 850."

I spent a number of hours messing with codepages and encodings, changing the encoding perl was outputting with, changing the codepage command prompt was displaying with, and actually very nearly got it to work:



However, in a moment of face-palming clarity I appreciated the significance of 'displaying'. Command prompt was receiving correctly encoding strings, it just didn't know how to display it. Any Windows GUI application would! So, redirecting the output of RegRipper to a file:



and then opening the txt file solves the problem:



So the moral of this story: never trust your eyes.

Tools used in this post, but not mentioned:


Comments