[Scummvm-devel] Predictive input code

Fri Apr 4 14:45:45 CEST 2008

Hi folks,

I wanted to bring your attention to the predictive input code in AGI
again.  Specifically, the function  "bool AgiEngine::matchWord(void)" in 
file engines/agi/predictive.cpp.

At one point, I had rewritten that code to use a trie for lookup. This was
scractched and replaced, though, because it caused super-slow loadup times
for some backends which had an inferior malloc implementation.

So it was replaced by code which does essentially a binary search on the
dictionary data.

But this algo is IMO far from being optimal. It is fast and easy to setup.
But unless there is an exact match, it doesn't find the "best" match. How
to define "best" clearly is not trivial, but for example a nice measure
would be to use the letter which is *most probable* based on the
dictionary. E.g. when a "9" is pressed, it could mean any of "wxyz", but
looking at the dictionary, "w" is the most probable one. As it is, the
algo will return the letter for which it finds the first partial match. So
if that happens to be "xaver", it will return an "x". In particular, if
the dictionary changes, the result can very, because the cut points of the
binary search will shift. Not so nice.

The goal to get "good non-exact matches" was the very reason why I
originally wrote this code using a trie. And I think it would be nice to
return to using a trie as the central data structure, as we used to do.
But as I said, we changed it because the implementation needed lots of
mallocs on startup, which was too slow on some systems.

My hope is that we can keep the trie and overcome the startup limitations.

One way to would be to use a trie-specific custom allocator. I.e. not a
general malloc() replacement just a special allocator for trie code, which
is quite simple to implement. It could take advantage from the fact that
we know roughly how many elements we have to store in the trie, and so
ideally we could get away with less than a dozen mallocs, in particular if
it's OK to temporarily allocate a bit too much memory (i.e. the strategy
would be to allocate enough memory for a number of nodes which definitely
is greater or equal to the actual required number of nodes, then in the
end shrink the memory to the actually needed amount, via realloc). It
would even be possible to compute the exact amount of required nodes, with
some effort, although that would make the code more complicated

Or we switch to a generic custom alloc system (there are various libs for
that, and we also have common/memorypool.h now). This might be interesting
in any case, but is somewhat out of scope for this particular thing

Finall we could try a better data structure for the trie (like double
array storage), but I am not sure whether those are suitable for us. In
particular, they tend to waste space, and have slower insert times, which
might make them too slow during startup once more.

So, I would like to try to revert the code back to a trie based algorithm
(based on my old code in there, with some refinements, which should cut
down the mallocs). Before I do this, I want to know whether there are
other plans for the predicitve code (is *anybody* doing anything with
this?). And whether the porters whose ports suffered from the code would
be willing to test patches by me for speed regressions before I put them
into SVN, in a timely fashion (I don't want my patch to bitrot for 6
months, but I also don't want to upset porters by just committing it).

Finally, it would be nice if we could turn this into a generic facility
(maybe located in common/predicitve), so that both AGI and the DS port
(which currently implements its own predictive code, which is
"hack-hooked" into the AGI code) could share this code, resulting in a
consistence user experience across all platforms.

Cheers,
Max