[Scummvm-devel] Improvements to blitting code.

Sat Aug 16 12:02:36 CEST 2008

Hello,
I read in the engines/scumm/gfx.cpp that there is a piece of code with the comment "TODO: Optimize this code".
That code seems to blit a texture/sprite with transparency.
If it is detected an ARM cpu at compile time, it calls the asmDrawStripToScreen() function instead of using a C code.
I believe that piece of code can be optimized greatly with a very simple algorithm.
With this new method, you can process 4 pixel at once.

Current code do these actions:

byte tmp = *text++;
if (tmp == CHARSET_MASK_TRANSPARENCY)
    tmp = *src;
*dst++ = tmp;
src++;

With my method, you have pointers to 32 bits ('text32' is the same of 'text' but for 32 bits units; 'dst32' and 'src32' are the same) and CHARSET_MASK_TRANSPARENCY_32 is a 32 bits word where each 8 bits field is equal to CHARSET_MASK_TRANSPARENCY.
This is the code:

unsigned long int temp = *text32++;
unsigned long int mask = temp ^ CHARSET_MASK_TRANSPARENCY_32;

mask = (((mask & 0x7f7f7f7f) + 0x7f7f7f7f) | mask) & 0x80808080;
mask = ((mask >> 7) + 0x7f7f7f7f) ^ 0x80808080;

*dst32++ = ((temp ^ *src32++) & c) ^ temp;

After you read 4 pixel of the texture, it calculates a mask for each 8 bit pixel.
If the single color is equal to CHARSET_MASK_TRANSPARENCY, then the mask is "ff", otherwise it is "00".
If you want, you can see it as a simulation of the PCMPEQB instruction of MMX.
Finally, it combines source, destination and texture with the mask.
This line:

*dst32++ = ((temp ^ *src32++) & c) ^ temp;

is identical to:

*dst32++ = (*src32++ & mask) | (temp & ~mask);

but it is usually compiled better.
I also tried to see what it happens into compiled code by writing ASM myself:

For x86:

mov eax, [edi]
add edi, 4
mov ecx, eax
xor eax, CHARSET_MASK_TRANSPARENCY_32
mov edx, eax
and eax, 0x7f7f7f7f
add eax, 0x7f7f7f7f
or  eax, edx
and eax, 0x80808080
shr eax, 7
add eax, 0x7f7f7f7f
xor eax, 0x80808080
mov edx, [esi]
add esi, 4
xor edx, ecx
and eax, edx
xor eax, ecx

For ARM:

r7 = 0x7f7f7f7f
r6 = CHARSET_MASK_TRANSPARENCY_32

ldr r2, [r0], +4 // from text
ldr r3, [r1], +4 // from src
eor r4, r2, r6
and r5, r4, r7
add r5, r5, r7
orr r5, r5, r4
bic r5, r5, r7
mvn r4, r7
add r5, r7, r5 LSR 7
eor r5, r5, r4
bic r2, r2, r5
and r3, r3, r5
orr r2, r2, r3

As you can see, the algorithm is much more efficient and faster than the "LOAD-SHIFT-COMPARE" method into engines/scumm/gfxARM.s.
Perhaps it would be interesting to make a solution like this one:

#if defined ARM_USE_GFX_ASM 
#   define DrawStripToScreen    asmDrawStripToScreen
#elif defined INDIRECT_STRIP_TO_SCREEN
    typedef void (*DrawStripToScreen_t)(int, int, byte const*, byte const*, byte*t, int, int, int);
    static DrawStripToScreen_t DrawStripToScreen;
#else
    static inline void DrawStripToScreen(int, int, byte const*, byte const*, byte*t, int, int, int);
#endif

If 'INDIRECT_STRIP_TO_SCREEN' is defined, you can use different blitting functions depending to the running PC.
For example, if current system supports MMX instruction set, it can be done with a simpler code:

mm3 = CHARSET_MASK_TRANSPARENCY_64

movq    mm0, [edi] // tmp
movq    mm1, [esi] // src
add     edi, 8
add     edi, 8
movq    mm2, mm0
pcmpeqb mm0, mm3
pand    mm1, mm0
pandn   mm0, mm2
por     mm0, mm1

Perhaps it is also possible to optimize other parts of SCUMMVM in a similar manner.
BTW: on top of engines/scumm/gfx.cpp you have written USE_ARM_GFX_ASM, while in the next you wrote ARM_USE_GFX_ASM.
Which is the right one?

I hope this could be helpful to you and greeting from Italy.

Sincerely,

Carlo Bramini.