[Scummvm-devel] Re : Improvements to blitting code.

Bertrand Augereau bertrand_augereau at yahoo.fr
Sat Aug 16 13:59:57 CEST 2008

Hi Carlo, hi everybody,I started a very long mail with lots of comments about all this but it crashed so I'll make it much shorter and much less informative this time :)
On x86, the MMX version will be a sure win (branchless, low latency, 8 pixels at a time)!Your emulated CMOV on the other hand really depends on the branch misprediction rate of the original code. This should be measured with an appropriate tool such as VTune or CodeAnalyst.Anyway MMX is supported on every PC on earth, so I say go for it!(and dont forget the EMMS ;) ) On the ARM don't forget the pipeline is nowhere as deep as a x86 and the conditional instructions are therefore less of a problem so Robin's implementation (asmDrawStripToScreen) which has a special case for 4 transparent pixels might still be a win, but I'm sure he will love to experiment with your technique :)Cheers,Bertrand--- En date de : Sam 16.8.08, carlo.bramix <carlo.bramix at libero.it> a écrit :
De: carlo.bramix <carlo.bramix at libero.it>
Objet: [Scummvm-devel] Improvements to blitting code.
À: "scummvm-devel" <scummvm-devel at lists.sourceforge.net>
Date: Samedi 16 Août 2008, 12h02

I read in the engines/scumm/gfx.cpp that there is a piece of code with the
comment "TODO: Optimize this code".
That code seems to blit a texture/sprite with transparency.
If it is detected an ARM cpu at compile time, it calls the
asmDrawStripToScreen() function instead of using a C code.
I believe that piece of code can be optimized greatly with a very simple
With this new method, you can process 4 pixel at once.

Current code do these actions:

byte tmp = *text++;
    tmp = *src;
*dst++ = tmp;

With my method, you have pointers to 32 bits ('text32' is the same of
'text' but for 32 bits units; 'dst32' and 'src32' are
the same) and CHARSET_MASK_TRANSPARENCY_32 is a 32 bits word where each 8 bits
This is the code:

unsigned long int temp = *text32++;
unsigned long int mask = temp ^ CHARSET_MASK_TRANSPARENCY_32;

mask = (((mask & 0x7f7f7f7f) + 0x7f7f7f7f) | mask) & 0x80808080;
mask = ((mask >> 7) + 0x7f7f7f7f) ^ 0x80808080;

*dst32++ = ((temp ^ *src32++) & c) ^ temp;

After you read 4 pixel of the texture, it calculates a mask for each 8 bit
If the single color is equal to CHARSET_MASK_TRANSPARENCY, then the mask is
"ff", otherwise it is "00".
If you want, you can see it as a simulation of the PCMPEQB instruction of MMX.
Finally, it combines source, destination and texture with the mask.
This line:

*dst32++ = ((temp ^ *src32++) & c) ^ temp;

is identical to:

*dst32++ = (*src32++ & mask) | (temp & ~mask);

but it is usually compiled better.
I also tried to see what it happens into compiled code by writing ASM myself:

For x86:

mov eax, [edi]
add edi, 4
mov ecx, eax
mov edx, eax
and eax, 0x7f7f7f7f
add eax, 0x7f7f7f7f
or  eax, edx
and eax, 0x80808080
shr eax, 7
add eax, 0x7f7f7f7f
xor eax, 0x80808080
mov edx, [esi]
add esi, 4
xor edx, ecx
and eax, edx
xor eax, ecx

For ARM:

r7 = 0x7f7f7f7f

ldr r2, [r0], +4 // from text
ldr r3, [r1], +4 // from src
eor r4, r2, r6
and r5, r4, r7
add r5, r5, r7
orr r5, r5, r4
bic r5, r5, r7
mvn r4, r7
add r5, r7, r5 LSR 7
eor r5, r5, r4
bic r2, r2, r5
and r3, r3, r5
orr r2, r2, r3

As you can see, the algorithm is much more efficient and faster than the
"LOAD-SHIFT-COMPARE" method into engines/scumm/gfxARM.s.
Perhaps it would be interesting to make a solution like this one:

#if defined ARM_USE_GFX_ASM 
#   define DrawStripToScreen    asmDrawStripToScreen
    typedef void (*DrawStripToScreen_t)(int, int, byte const*, byte const*,
byte*t, int, int, int);
    static DrawStripToScreen_t DrawStripToScreen;
    static inline void DrawStripToScreen(int, int, byte const*, byte const*,
byte*t, int, int, int);

If 'INDIRECT_STRIP_TO_SCREEN' is defined, you can use different
blitting functions depending to the running PC.
For example, if current system supports MMX instruction set, it can be done
with a simpler code:


movq    mm0, [edi] // tmp
movq    mm1, [esi] // src
add     edi, 8
add     edi, 8
movq    mm2, mm0
pcmpeqb mm0, mm3
pand    mm1, mm0
pandn   mm0, mm2
por     mm0, mm1

Perhaps it is also possible to optimize other parts of SCUMMVM in a similar
BTW: on top of engines/scumm/gfx.cpp you have written USE_ARM_GFX_ASM, while in
the next you wrote ARM_USE_GFX_ASM.
Which is the right one?

I hope this could be helpful to you and greeting from Italy.


Carlo Bramini.

This SF.Net email is sponsored by the Moblin Your Move Developer's
Build the coolest Linux based applications with Moblin SDK & win great
Grand prize is a trip for two to an Open Source event anywhere in the world
Scummvm-devel mailing list
Scummvm-devel at lists.sourceforge.net

Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.scummvm.org/pipermail/scummvm-devel/attachments/20080816/7749fe55/attachment.html>

More information about the Scummvm-devel mailing list