[Scummvm-devel] Re : Improvements to blitting code.

Sat Aug 16 13:59:57 CEST 2008

Hi Carlo, hi everybody,I started a very long mail with lots of comments about all this but it crashed so I'll make it much shorter and much less informative this time :)
On x86, the MMX version will be a sure win (branchless, low latency, 8 pixels at a time)!Your emulated CMOV on the other hand really depends on the branch misprediction rate of the original code. This should be measured with an appropriate tool such as VTune or CodeAnalyst.Anyway MMX is supported on every PC on earth, so I say go for it!(and dont forget the EMMS ;) ) On the ARM don't forget the pipeline is nowhere as deep as a x86 and the conditional instructions are therefore less of a problem so Robin's implementation (asmDrawStripToScreen) which has a special case for 4 transparent pixels might still be a win, but I'm sure he will love to experiment with your technique :)Cheers,Bertrand--- En date de : Sam 16.8.08, carlo.bramix <carlo.bramix at libero.it> a écrit :
De: carlo.bramix <carlo.bramix at libero.it>
Objet: [Scummvm-devel] Improvements to blitting code.
À: "scummvm-devel" <scummvm-devel at lists.sourceforge.net>
Date: Samedi 16 Août 2008, 12h02

Hello,
I read in the engines/scumm/gfx.cpp that there is a piece of code with the
comment "TODO: Optimize this code".
That code seems to blit a texture/sprite with transparency.
If it is detected an ARM cpu at compile time, it calls the
asmDrawStripToScreen() function instead of using a C code.
I believe that piece of code can be optimized greatly with a very simple
algorithm.
With this new method, you can process 4 pixel at once.

Current code do these actions:

byte tmp = *text++;
if (tmp == CHARSET_MASK_TRANSPARENCY)
    tmp = *src;
*dst++ = tmp;
src++;

With my method, you have pointers to 32 bits ('text32' is the same of
'text' but for 32 bits units; 'dst32' and 'src32' are
the same) and CHARSET_MASK_TRANSPARENCY_32 is a 32 bits word where each 8 bits
field is equal to CHARSET_MASK_TRANSPARENCY.
This is the code:

unsigned long int temp = *text32++;
unsigned long int mask = temp ^ CHARSET_MASK_TRANSPARENCY_32;

mask = (((mask & 0x7f7f7f7f) + 0x7f7f7f7f) | mask) & 0x80808080;
mask = ((mask >> 7) + 0x7f7f7f7f) ^ 0x80808080;

*dst32++ = ((temp ^ *src32++) & c) ^ temp;

After you read 4 pixel of the texture, it calculates a mask for each 8 bit
pixel.
If the single color is equal to CHARSET_MASK_TRANSPARENCY, then the mask is
"ff", otherwise it is "00".
If you want, you can see it as a simulation of the PCMPEQB instruction of MMX.
Finally, it combines source, destination and texture with the mask.
This line:

*dst32++ = ((temp ^ *src32++) & c) ^ temp;

is identical to:

*dst32++ = (*src32++ & mask) | (temp & ~mask);

but it is usually compiled better.
I also tried to see what it happens into compiled code by writing ASM myself:

For x86:

mov eax, [edi]
add edi, 4
mov ecx, eax
xor eax, CHARSET_MASK_TRANSPARENCY_32
mov edx, eax
and eax, 0x7f7f7f7f
add eax, 0x7f7f7f7f
or  eax, edx
and eax, 0x80808080
shr eax, 7
add eax, 0x7f7f7f7f
xor eax, 0x80808080
mov edx, [esi]
add esi, 4
xor edx, ecx
and eax, edx
xor eax, ecx

For ARM:

r7 = 0x7f7f7f7f
r6 = CHARSET_MASK_TRANSPARENCY_32

ldr r2, [r0], +4 // from text
ldr r3, [r1], +4 // from src
eor r4, r2, r6
and r5, r4, r7
add r5, r5, r7
orr r5, r5, r4
bic r5, r5, r7
mvn r4, r7
add r5, r7, r5 LSR 7
eor r5, r5, r4
bic r2, r2, r5
and r3, r3, r5
orr r2, r2, r3

As you can see, the algorithm is much more efficient and faster than the
"LOAD-SHIFT-COMPARE" method into engines/scumm/gfxARM.s.
Perhaps it would be interesting to make a solution like this one:

#if defined ARM_USE_GFX_ASM 
#   define DrawStripToScreen    asmDrawStripToScreen
#elif defined INDIRECT_STRIP_TO_SCREEN
    typedef void (*DrawStripToScreen_t)(int, int, byte const*, byte const*,
byte*t, int, int, int);
    static DrawStripToScreen_t DrawStripToScreen;
#else
    static inline void DrawStripToScreen(int, int, byte const*, byte const*,
byte*t, int, int, int);
#endif

If 'INDIRECT_STRIP_TO_SCREEN' is defined, you can use different
blitting functions depending to the running PC.
For example, if current system supports MMX instruction set, it can be done
with a simpler code:

mm3 = CHARSET_MASK_TRANSPARENCY_64

movq    mm0, [edi] // tmp
movq    mm1, [esi] // src
add     edi, 8
add     edi, 8
movq    mm2, mm0
pcmpeqb mm0, mm3
pand    mm1, mm0
pandn   mm0, mm2
por     mm0, mm1

Perhaps it is also possible to optimize other parts of SCUMMVM in a similar
manner.
BTW: on top of engines/scumm/gfx.cpp you have written USE_ARM_GFX_ASM, while in
the next you wrote ARM_USE_GFX_ASM.
Which is the right one?

I hope this could be helpful to you and greeting from Italy.

Sincerely,

Carlo Bramini.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's
challenge
Build the coolest Linux based applications with Moblin SDK & win great
prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Scummvm-devel mailing list
Scummvm-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scummvm-devel

      _____________________________________________________________________________ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.scummvm.org/pipermail/scummvm-devel/attachments/20080816/7749fe55/attachment.html>