[Scummvm-devel] Translations of the About text in ScummVM

Sun May 8 14:45:19 CEST 2011

Am 06.05.2011 um 16:03 schrieb Max Horn:
> 
[...]

> 
> One thing that needs to be resolved in there is the use of non-ASCII chars in msgids (e.g. for "Touché"). It would be no problem to generate an ASCII-only msgid, but we really want the non-ASCII version in the source, too. As far as I could tell, we have no macro in common/translation.h that can be used to specify a distinct msgid and msgstr. Or is there?

After having thought about this a little more, I realized that my last comment above is nonsense, at least in our current setup: The msgid is both used to lookup strings, but also as the default message string when translation is disabled resp. when no translation.dat file can be found.

So specifying both a msgid *and* a msgstr would only make sense if it was possible to insert both in the code, using the msgstr as fallback if looking up the msgid failed (think of a "TransMan.getTranslation" method which takes an additional parameter as fallback. Kind of like
   TransMan.getTranslation(str [, context], defaultStrTranslation);

While something like that could be added in, it seems somewhat hackish... In addition, the underlying problem runs deeper than I thought at first, too: For the about dialog, we hard-code (in devtools/credits.pl) the assumption that the GUI uses a Latin-1 encoding for its font. In translations, this might not be the case at all.  But in the credits, we make use of non-ASCII characters for several names, headlines etc. 

So, we need to handle that. In fact, I think we need to handle that even if we choose not to translate the about dialog, unless we want to use a separate font for the about dialog with latin1 encoding. Indeed, if I look at the about dialog in the Ukrainian version, it has "Torbjirn Andersson" as one of our contributors, and we have a "Touchщ" engine ;).

Several possibilities come to mind to address this. Feedback appreciated!

1) Put only ASCII into credits.h; that is, in credits.pl, replace the html_entities_to_cpp function by html_entities_to_ascii. Then, leave it to translators to re-add accents, umlauts etc. as appropriate (and as possible in their encoding).
  So we would have msgid "Touche" is the binary, and each translation map this to either "Touché" or "Touche".

2) Keep using latin1 encoding for the msgids (as it is in my branch right now).
   So we would have "msgid "Touché" in the binary.

3) Use HTML encoding for the msgids. So we would have "Touché" in the binary.
   Couple this with a lookup table that converts this into latin1  (or even the active encoding)

4) Use a "dual-string" approach as described above:
   Put msgid "Touche" into the binary, but also in parallel "Touché", for use as fallback.

Approach 1 looks like the easiest by far to implement in all regards, but has the drawback that without translation.dat, or with translation support disabled, we loose all diacriticals. That's not nice, but then again not very problematic either. Another drawback: Our translators all must know and remember to translate "Touche" to "Touché" (if their encoding allows it, that is).  This is problematic because there are quite some names that involve diacritics; so to make this workable, it would be preferable to have a way to automatically populate all latin1-based translations with the correct "translations" for all names.

Approach 2 is still quite easy to implement, and avoid the drawback of 1; but it has the drawback that handling the .po files might be problematic: How would the latin1 msgid "Touché" be represented in the latin5 encoded po/ru_RU.po ? As "Touchщ" ? This strikes me as potentially confusing to our translators. And I am not even sure whether it really would work this way?

But if it does, this seems like an acceptable compromise.

Approach 3 tries to avoid the issues in solution 1 and 2. The idea is this: If translations are enabled and working, just lookup the msgid as normal. Otherwise, we can just map the HTML entities to latin1 data on the fly (using the same method we employ in credits.pl). This increases the code size only minimally. The msgids are still not completely nice for translators, but hopefully this is a minor issue.
Implementing this right is a little bit more work than 1 + 2, but still easy.

Approach 4 would store two variants of strings that contain diacrits. That means some more data is put into the binary, but not much. With this approach, there are no problems if translations are off / unavailable. However, it retains a drawback of approach 1: It is not clearly visible for translators that they should translate "Touche" to "Touché"; this is different in Approach 2 + 3.

All in all, right now I tend towards approach 3, but maybe somebody out there has a better idea or spots a problem with this that I missed.

Bye,
Max