Wednesday, August 29, 2007

g_get_charset()函数的返回值问题

想测试一下gnu lib里面的c函数,于是写了下面的代码:
#include \
#include
gchar *utf8_strr=g_locale_to_utf8 ("日", -1, NULL, NULL, NULL);
if(utf8_str!=NULL)
g_printf("the result is %s\n", utf8_str);
else
g_printf("conversion failed\n");

文件的编码是gb18030,系统的locale是zh_CN.GB18030,最后输出的是conversion failed。说明g_locale_to_utf8转换不成功,用GError方法得到的错误结果是Invalid byte sequency。

但是将utf8_str=g_locale_to_utf8 ("日", -1, NULL, NULL, NULL);换成utf8_str=g_convert("日",-1,"UTF-8","GB2312",NULL,NULL,NULL); 后,转换成功,输出结果是对的,这个问题百思不得其解,后来测试g_get_charset()函数也遇到了同样诡异的问题,返回的系统使用的字符编码是ANSI_X3.4-1968,而我的系统编码明明是GB18030。后来在gnome.org的新闻组上找到了一篇同样问题的帖子,内容如下:

Folks,
So, I've been trying to use a locale other than POSIX C in working with GTK+/GLib 2. Unfortnately I've discovered that gutf8 doesn't believe in any user charset other than ANSI_X3.4-1968. The simple test program:

#include
#include


gint main(gint argc, gchar *argv[])
{
gchar *res;
const gchar *charset;
GError *error = NULL;

g_get_charset(&charset);
g_print("Charset is %s\n", charset);
if (argc > 1)
{
res = g_locale_to_utf8(argv[1], -1, NULL, NULL, &error);
if (res)
g_print("%s", res);
else if (error)
g_print("%s\n", error->message);
}
return(0);
}

*always* prints "Charset is ANSI_X3.4-1968". If there are non-ASCII characters in the input, it errors obviously. Looking through the g_get_charset() code
(g_utf8_get_charset_internal() to _g_locale_get_charset()), it appears that _g_locale_get_charset() expects to read $(libdir)/charset.alias. However, this file is never installed, and in fact config.charset creates an almost empty file. The config.charset text seems to imply that linux needs not the charset.alias file, because glibc gets it right, but since _g_locale_get_charset() never asks glibc, the answer is never found. Instead, g_locale_get_charset() fails to open charset.alias and instead returns the default fallback, which is ASCII. This is a big problem for many reasons.

g_locale_{to,from}_utf8() rely on g_get_charset() to return a valid answer. This is pretty essential to convenient character set work. In my case, I had set my locale to en_US.UTF-8 in an attempt to run an all-UTF-8 environment. Imagine my surprise when GTK+/GLib 2 refused to let me. I note that hte first place g_utf8_get_charset_internal() looks is to the environment variable "CHARSET". I've never seen this variable set in any environment I've worked with, but if this is supposed to be the "correct" solution, we should have the gdm maintainer set this value along with LANG upon login.

最后在某人的blog上找到了对于g_get_charset等函数的较为详细的解释,才恍然大悟,原来g_get_charset()函数要取得的字符编码必须是setlocale (LC_ALL, "")后的结果,因为程序里面没有执行这个函数,所以g_get_charset()收不到系统的编码,就返回它默认的编码
ANSI_X3.4-1968了,而g_locale_to_utf8()或是g_locale_from_utf8()函数都是根据g_get_charset()函数返回的结果来执行的,所以就会出现上面的问题。只要在上面的程序加入
#include
setlocale (LC_ALL, "");
这两行就可以输出正确的结果了。

对于g_get_charset()等函数更详细的解释如下:

Encoding Internal in Glib

1. Introduction

Glib use utf8 as internal encoding and thus all gtk+/gnome application use utf8 to represent text, so all text you got from widgets are utf8. In order to use legacy encoding, you need do some converstion.

Before converstion between two encoding, you should know what is encoding it is first. Of course we can guess the encoding of text, but unfortunately there is not perfect way to determine the encoding of one segment of text throught program. So many applications provide one encoding list to user and let user make the decision.

Filename handling is especially hard, because there is no indication whatsoever what character encoding a filename is in (it might have been created when the user was using a different locale, so filename encoding is basically unreliable and broken).

Glib has no idea to get the filename encoding either, so it let user to config filename encoding through enviroment: G_FILENAME_ENCODING and G_BROKEN_FILENAME. By default, Glib assumes that filenames on disk are in UTF-8 encoding, and through these enviroment variables, user can instruct Glib to use that particular encoding for filenames raterh than UTF-8.

2. Get Encoding Using Glib Function

2.1 g_get_charset

g_get_charset will get the character set from the C runtime on the current locale, that is to say g_get_charset will get the current locale encoding if you call setlocale (LC_ALL, "") in your applications; if you call setlocale (LC_ALL, "zh_CN.GB18030") in apps, then the later encoding of C runtime will be GB18030, thus g_get_charset will got GB18030.

2.2 g_get_filename_charsets

g_get_filename_charsets determine the prefered character sets(encoding maybe more accurate) used for filenames. The firest character set from the character sets is treated as filename encoding by Glib, the subsequent character sets are used when trying to generate a displayable respresentation of a filename, see g_filename_display_name().

On Unix, the character sets are determined by consulting the environment variables G_FILENAME_ENCODING and G_BROKEN_FILENAMES. On Windows, the character set used in the GLib API is always UTF-8 and said environment variables have no effect.

G_FILENAME_ENCODING may be set to a comma-separated list of character set names. The special token "@locale" is taken to mean the character set for the current locale. If G_FILENAME_ENCODING is not set, but G_BROKEN_FILENAMES is, the character set of the current locale is taken as the filename encoding. If neither environment variable is set, UTF-8 is taken as the filename encoding, but the character set of the current locale is also put in the list of encodings.

* G_FILENAME_ENCODING set: filename encoding is the first encoding in list.
* G_BROKEN_FILENAMES set: filename encoding is locale encoding.
* Both unset:
- filename encoding is utf8,
- and put the current locale charset into encoding list.

Notes:

* Because glib will cache the filename charsets internal, so if you change the value of G_FILENAME_ENCODING or G_BROKEN_FILENAMES dynamically, it doesn't effect. So don't use putenv() to change their values.
* On Unix, regardless of the locale character set or G_FILENAME_ENCODING value, the actual file names present on a system might be in any random encoding or just gibberish.

3. Conversion between C Runtime encoding and UTF-8

3.1 g_locale_to_utf8

The string parameter of g_locale_to_utf8 is a string in the encoding of the current locale of applications(C Runtime locale). On Windows this means the system codepage.

Converts a string which is in the encoding used for strings by the C runtime (usually the same as that used by the operating system, cause most of applications use setlocale (LC_ALL, "") to set it locale) in the current locale into a UTF-8 string.

eg: If the current locale is gb18030, while you set your applications using setlocale (LC_ALL, "zh_TW.BIG5), then the C runtime encoding is BIG5, while the OS encoding is gb18030.

If the current C runtime encoding is UTF-8, then duplicate simply.

3.2 g_locale_from_utf8

Converts a string from UTF-8 to the encoding used for strings by the C runtime (usually the same as that used by the operating system) in the current locale.

4. Conversion between Glib filename encoding and UTF-8

4.1 g_filename_to_utf8

Converts a string which is in the encoding used by GLib for filenames into a UTF-8 string. Filename encoding is the first encoding of list returned by g_get_filename_charset().

4.2 g_filename_from_utf8

Converts a string from UTF-8 to the encoding used for filenames. Filename encoding is the first encoding of list returned by g_get_filename_charset().

4.3 g_filename_from_uri

Converts an escaped ASCII-encoded URI to a local filename in the encoding used for filenames.

4.4 g_filename_to_uri

Converts an absolute filename to an escaped ASCII-encoded URI.

5. Display Name

5.1 g_filename_display_name

Converts a filename into a valid UTF-8 string. The conversion is not necessarily reversible, so you should keep the original around and use the return value of this function only for display purposes. Unlike g_filename_to_utf8(), the result is guaranteed to be non-NULL even if the filename actually isn't in the GLib file name encoding(always return one name for display purpose).

If you know the whole pathname of the file you should use g_filename_display_basename(), since that allows location-based translation of filenames.

Parameters:

1. filename: a pathname hopefully in the GLib file name encoding

2. Returns : a newly allocated string containing a rendition of the filename in valid UTF-8

5.2 g_filename_display_basename

Returns the display basename for the particular filename, guaranteed to be valid UTF-8. The display name might not be identical to the filename, for instance there might be problems converting it to UTF-8, and some files can be translated in the display

You must pass the whole absolute pathname to this functions so that translation of well known locations can be done.

This function is preferred over g_filename_display_name() if you know the whole path, as it allows translation.

Parameters:

* filename: an absolute pathname in the GLib file name encoding(hopefully)
* Returns : a newly allocated string containing a rendition of the basename of the filename in valid UTF-8

Notes:

* If glib encoding is UTF-8( returned by g_get_filename_charsets ) while filename parameters is not valid utf8 (g_utf8_validate), then the sequence encoding in glib encoding list will be used.
* If filename is not in hopeful encoding, then return one valid utf8 display name by make_valid_utf8(filename) in the form: "valid_utf8_string(invalid encoding)"

6. Lower Level Function

glib use native iconv routines or libiconv if has no native iconv implementation to do encoding converstion

g_iconv_open (to_codeset, from_codeset) will try codeset alias, so that provide more powerful conversion. (will learn g_charset_get_aliases() later)

7. Example

Please got simple example from http://blogs.sun.com/roller/resources/yydzero/main.c

No comments: