I decided to gather up some of my posts on extended characters in source code, Unicode, and console I/O. I’ve posted this information before, but not all in one thread. With a little cleanup, this might make a nice FAQ entry one day

Extended characters in your source files
First, we need to know exactly what we have in memory after compiling code with extended character literals. In the C++ standards, there are 3 types of “character sets” to consider (quoting C++03):

1) The Basic Source Character Set

Quote Originally Posted by ISO/IEC 14882:2003(E)
Character Sets 2.2.1
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

Code:
1
2
3
4
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ˆ & | ˜ ! = , \ " ’

2) Physical Source File Characters

Quote Originally Posted by ISO/IEC 14882:2003(E)
Phases of translation 2.1.1.1
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set… Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that designates that character. …

3) The Execution Character Set

Quote Originally Posted by ISO/IEC 14882:2003(E)
Phases of translation 2.1.1.5
Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (2.13.2, 2.13.4).

Character Sets 2.2.3
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character … The execution character set and the execution wide-character set are supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific.

This hasn’t really change in C++11. The new string literals in C++11 do give us control of the in-memory encoding of strings, but there still has to be a translation from the “source file character set” to the “execution character set”, which is implementation defined. A better way to say this is that there is a conversion from one encoding to another.

How GCC and MSVC map “physical source file characters”
Next I’d like to talk about GCC’s and MSVC’s implementation defined behavior for C++03 string literals. For MSVC, I’m refering to VS2008 or higher. The first thing to look at is how they interpret the “physical source character set”. If the source file has a Unicode encoding with BOM, then the source file encoding is known. If the compiler doesn’t know what the source file encoding is, then:
GCC assumes the file is encoded as specified by the Posix-locale, or if it can’t get this information it assumes UTF8. Outside of a Posix emulation layer like MSYS or Cygwin, MinGW seems to assume UTF8 always (based on experimentation). This can be overridden with the command line parameter: -finput-charset.
MSVC assumes the file is ACP encoded, or in other words, encoded using the codepage which GetACP() returns. This is the ansi codepage associated with the system-locale in Windows.

Converting to the “execution character sets”
Now that we know how source characters are interpreted, we can look at their conversion to the “execution character sets”. There are 2 execution character sets: narrow and wide. Narrow strings are stored using the char type, and wide strings are stored using the wchar_t type. Here is how each compiler performs the conversion for:
Narrow literals/strings:
GCC defaults to UTF8, unless overridden via -fexec-charset.
MSVC always encodes using the ACP. So if you use a narrow litteral that the ACP doesn’t support, you’ll just get a warning and the compiler will change your character into a ‘?’.
Wide literals/strings:
GCC supports both a 2 byte wchar_t (like in Windows) and 4 byte wchar_t (like on most *nix’s). For 2 byte wchar_t systems, the default is UTF16. For 4 byte wchar_t systems, the default is UTF32. Both will use the system’s native byte-order. This can be overridden with -fwide-exec-charset (and -fshort-wchar for forcing a 2 byte wchar_t).
MSVC uses UTF16LE since Windows always uses a 2 byte wchar_t and is always little endian. MSVC also supports a “#pragma setlocale”, which is useful if the source file is codepage encoded and contains extended characters within wide-string literals. For example, consider this statement: “const wchar_t w = L’ç’;”. That character is encoded as 0x87 in some codepages, and 0xE7 in other codepages. Remember that MSVC assumes the file is ACP encoded (if there is no BOM) which may be the wrong assumption. By using “#pragma setlocale(“.852″)”, MSVC will know that the 0x87 byte in the source file is really the Unicode character U+00E7 and generates the proper wchar_t value.

So now we know what’s in memory for our narrow and wide source strings. Here’s what you should take away from this knowledge:
– As soon as you put extended character literals in your source code, you are in implementation defined territory.
– If you must put extended character literals in your source code: 1) save the source file with a Unicode encoding, preferably with a BOM. 2) If using MSVC, extended character literals should always be wide. 3) Hope that no one ever mangles your source code by saving it incorrectly.
– For the best compatibility with editors and compilers, use “universal character names” to represent extened characters in wide literals only. For example, use L”\u00e7″ instead of L”ç”.

Some code!
So now that we have something meaningful in memory for our string literals, chances are you’ll want to use it with standard I/O facilities. This is where C/C++ locales become important. Consider the following [invalid] code:

Code:
1
2
3
4
5
6
7
#include <stdio.h>
#include <wchar.h>
int main()
{
    fputws(L"\u00e7\n", stdout);
    return 0;
}//main

Even though the recomendations above have been followed, this code still doesn’t work. This is because C/C++ programs start-up with the “C” locale in effect by default, and the “C” locale does not support conversions of any characters outside the “basic character set”. This code is much more likely to have success:

Code:
1
2
3
4
5
6
7
8
9
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
    setlocale(LC_ALL, "");
    fputws(L"\u00e7\n", stdout);
    return 0;
}//main

The call to setlocale() says “I want to use the user’s default narrow string encoding”. This encoding is based on the Posix-locale for Posix environments. In Windows, this encoding is the ACP, which is based on the system-locale. However, the success of this code is dependent on two things: 1) The narrow encoding must support the wide character being converted. 2) The font/gui must support the rendering of that character. In Windows, #2 is often solved by setting cmd.exe’s font to Lucida Console.
Here’s the cooresponding C++ sample:

Code:
1
2
3
4
5
6
7
8
9
#include <iostream>
#include <locale>
using namespace std;
int main()
{
    wcout.imbue(locale(""));
    wcout << L"\u00e7" << endl;
    return 0;
}//main

Sadly, there are bugs in both MSVC and my Linux VM that prevent this from working properly (even though the prior C sample works fine). The bug will be fixed in VC11. My Linux VM is Mint 11, eglibc 2.13, libstdc++ 20110331, gcc 4.5.2-8ubuntu4 – I’m not sure if it’s a known issue or not. The workaround for both is to call C’s setlocale() instead.

Yet another conversion for Windows console I/O
Console I/O on Windows is further complicated by the existence of a “console codepage”, which is distinct and separate from the standard-locale’s narrow encoding. The above samples under Windows will perform the following converstions:
– L”\u00e7″ (as UTF16LE) is first converted to a multi-byte (char) string using the locale’s narrow encoding, the ACP. The ACP is 1252 for me, so the result is “\xe7”.
– “\xe7” (as CP1252) is then converted to the Windows console codepage. For me that’s CP437 by default, so the result of the converstion is “\x87”.
At this point, the “\x87” either goes through WriteFile() or WriteConsoleA(). The OS will recognize that the handle is the stdout handle and will use the console codepage to interpret the bytes. Then cmd.exe just needs to be using a font that supports that character.

This extra conversion under windows can be avoided by setting the console codepage to be equal to the ACP:

Code:
1
2
3
4
5
6
7
8
9
10
11
12
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <windows.h>
int main()
{
    SetConsoleOutputCP(GetACP());
    SetConsoleCP(GetACP());
    setlocale(LC_ALL, "");
    fputws(L"\u00e7\n", stdout);
    return 0;
}//main

You can also change the console codepage directly on the command line via the “chcp” command.

Direct Unicode I/O on the console
Ideally there wouldn’t be any conversions involving a non-Unicode encoding. On *nix this can be done if the locale is UTF8, making the compiler’s narrow “execution character set” UTF8 – which is the common case. Any conversions would then be wide to narrow, or UTF32/16 to UTF8. The nice thing here is that the conversion is lossless.

On Windows, the only way to achieve direct Unicode output is via WriteConsoleW(). The MS CRT (2008 and newer) provides a way to use C/C++ I/O facilities for direct Unicode output:

Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
#include <fcntl.h>
#include <io.h>
#include <cstdio>
#include <cwchar>
#include <iostream>
using namespace std;
int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    fputws(L"\u00e7\n", stdout);
    wcout << L"\u00e7" << endl;
    return 0;
}//main

This will send the UTF16LE string directly to WriteConsoleW(), unless output is redirected to a file, in which case the UTF16LE string is written as a stream of bytes via WriteFile().

Questions, comments, corrections, omissions welcome.

Advertisements