[Scummvm-cvs-logs] SF.net SVN: scummvm:[54919] tools/branches/gsoc2010-decompiler/decompiler/ doc

pidgeot at users.sourceforge.net pidgeot at users.sourceforge.net
Wed Dec 15 03:29:02 CET 2010


Revision: 54919
          http://scummvm.svn.sourceforge.net/scummvm/?rev=54919&view=rev
Author:   pidgeot
Date:     2010-12-15 02:29:01 +0000 (Wed, 15 Dec 2010)

Log Message:
-----------
DECOMPILER: Update LaTeX documentation

Modified Paths:
--------------
    tools/branches/gsoc2010-decompiler/decompiler/doc/cfg.tex
    tools/branches/gsoc2010-decompiler/decompiler/doc/codegen.tex
    tools/branches/gsoc2010-decompiler/decompiler/doc/disassembler.tex
    tools/branches/gsoc2010-decompiler/decompiler/doc/engine.tex
    tools/branches/gsoc2010-decompiler/decompiler/doc/overview.tex
    tools/branches/gsoc2010-decompiler/decompiler/doc/preamble.tex
    tools/branches/gsoc2010-decompiler/decompiler/doc/todo.tex

Modified: tools/branches/gsoc2010-decompiler/decompiler/doc/cfg.tex
===================================================================
--- tools/branches/gsoc2010-decompiler/decompiler/doc/cfg.tex	2010-12-15 01:45:47 UTC (rev 54918)
+++ tools/branches/gsoc2010-decompiler/decompiler/doc/cfg.tex	2010-12-15 02:29:01 UTC (rev 54919)
@@ -10,7 +10,7 @@
 \item Perform analysis on vertices
 \end{itemize}
 
-Calls to in-script functions are not represented with edges in the graph. This is done to keep functions separate from one another, so if your engine uses a jump as part of calling functions, you need to make sure you have given that particular jump the type kCallInstType.
+Calls to in-script functions are not represented with edges in the graph. This is done to keep functions separate from one another, so if your engine uses a jump as part of calling functions, you need to make sure you represent the jump using a \code{CallInstruction} instead of a \code{JumpInstruction}.
 
 The first step is handled in the constructor, while the next three steps are handled by the \code{createGroups()} method. The last step is handled by the \code{analyze} method.
 

Modified: tools/branches/gsoc2010-decompiler/decompiler/doc/codegen.tex
===================================================================
--- tools/branches/gsoc2010-decompiler/decompiler/doc/codegen.tex	2010-12-15 01:45:47 UTC (rev 54918)
+++ tools/branches/gsoc2010-decompiler/decompiler/doc/codegen.tex	2010-12-15 02:29:01 UTC (rev 54919)
@@ -19,50 +19,56 @@
 \emph{Note:} You must currently include a \code{\{} at the end of your signature.
 
 \subsection{Group processing}
-During processing of a group, the instructions in the group are processed one at a time. Certain kinds of instructions can be handled by generic code, while others must be handled by engine-specific code in the \code{processInst} method of your subclass.
+During processing of a group, the instructions in the group are processed one at a time. This is done by calling \code{processInst} on each instruction. This method should emulate the effect of the instruction, and, if the instruction corresponds to a statement, call \code{addOutputLine} on the code generator to add this statement as a line of code.
 
-If you need to access information about the group currently being processed, use the member variable \code{\_curGroup}.
+If you need to access information about the group currently being processed, use the member variable \code{\_curGroup} on the code generator.
 
 \subsection{The stack and stack entries}
-When generating the code, a stack is used to represent the state of the system. When data is pushed on the stack, a stack entry describing how that data was created is added; when data is popped, a stack entry describing the popped data is removed.
+\label{sec:stackvalues}
+When generating the code, a stack is used to represent the state of the system. When data is pushed on the stack, a \code{Value} describing how that data was created is added; when data is popped, a \code{Value} describing the popped data is removed.
 
-To manipulate the stack, use the \code{push} and \code{pop} methods to push or pop stack entries. Unlike the STL stack, \code{pop} returns the value being popped from the stack, so you do not have to first get the top element and then pop it afterwards, but you can still call the \code{peek} method if you just want to look at the topmost element without removing it. Additionally, it has an \code{empty} method to check if the stack is empty.
+To manipulate the stack, use the \code{push} and \code{pop} methods to push or pop values. Unlike the STL stack, \code{pop} returns the value being popped from the stack, so you do not have to first get the top element and then pop it afterwards, but you can still call the \code{peek} method if you just want to look at the topmost element without removing it. Additionally, it has an \code{empty} method to check if the stack is empty.
 
 Some engines require you to look further down the stack than just the topmost element. You can use the \code{peekPos} method to retrieve an element at an arbitrary position in the stack. This method takes an integer containing the number of stack entries to skip, i.e. passing the value 0 will give you the topmost element, while passing the value 2 will give you the third value on the stack.
 
 \emph{Note:} \code{peekPos} accesses the underlying STL container (\code{std::deque}) using the \code{at} function, which will throw an exception if the stack does not contain enough elements.
 
-When working with entries, you should use the \code{EntryPtr} type. This wraps the entry in a \code{boost::intrusive\_ptr} to free the associated memory when it is no longer referenced.
+When working with values, you should use the \code{ValuePtr} type. This wraps the entry in a \code{boost::intrusive\_ptr} to free the associated memory when it is no longer referenced.
 
-Some stack entries contain references to an arbitrary number of stack entries. This is handled using an STL \code{deque}, typedef'ed as \code{EntryList}.
+Some value types contain references to an arbitrary number of values. This can be handled using an STL \code{deque}, typedef'ed as \code{ValueList}.
 
-Stack entries can be categorized into 9 different types:
+The following value types are predefined:
 
-\paragraph{Integers (IntEntry)}
+\paragraph{Integers (IntValue)}
 Integers can use up to 32-bits, and be signed or unsigned. When creating an integer, you must specify its value and whether or not it is signed. This also contains additional methods to extract the value and signedness of the value, which may be of use in some situations.
 
-\paragraph{Variables (VarEntry)}
+\paragraph{Addresses (AddressValue/RelAddressValue)}
+Addresses are implemented as a specialization of IntValue. When output to a string, hexadecimal notation is used instead of decimal, and for relative addresses, the sign is prefixed and the absolute offset value is used instead (i.e., -1 is shown as \code{-0x1} instead of \code{0xFFFFFFFF}).
+
+Absolute addresses cannot be retrieved using \code{getSigned}, while relative addresses will return the offset with this method. To get the absolute address associated with either of these values, call \code{getUnsigned}.
+
+\paragraph{Variables (VarValue)}
 Variables are stored as a simple string. Subclasses of \code{CodeGenerator} must implement their own logic to determine a suitable variable name when given a reference.
 
-\paragraph{Binary operations (BinaryOpEntry)}
-Binary operations stores the two stack entries used as operands, and a string containing the operator. Parenthesis are automatically added around all binary operations to preserve the proper evaluation order.
+\paragraph{Binary operations (BinaryOpValue)}
+Binary operations stores the two stack entries used as operands, and a string containing the operator. Parentheses are added if the operator precedence requires it.
 
-\paragraph{Unary operations (UnaryOpEntry)}
+\paragraph{Unary operations (UnaryOpValue)}
 Just like binary operations, except only a single operand is stored. The operator can be placed before (prefix) or after (postfix) the operand.
 
-\paragraph{Duplicated entries (DupEntry)}
+\paragraph{Duplicated entries (DupValue)}
 Stores an index to distinguish between multiple duplicated entries. This index is automatically assigned and determined when calling the \code{dup} function to duplicate a stack entry.
 
-\paragraph{Array entries (ArrayEntry)}
-Array entries are stored as a simple string containing the name of the array, and an EntryList of stack entries used as the indices, with the first element in the EntryList being output as the first index.
+\paragraph{Array entries (ArrayValue)}
+Array entries are stored as a simple string containing the name of the array, and an EntryList of stack entries used as the indices, with the first element in the ValueList being output as the first index.
 
-\paragraph{Strings (StringEntry)}
-A string is stored as... well, a string. You have to supply your own quotes if necessary.
+\paragraph{Strings (StringValue)}
+A string is stored as... well, a string. The default implementation automatically surrounds the string with quotes.
 
-\paragraph{Lists (ListEntry)}
-A list is stored using an EntryList to contain the stack entries in the list. Elements are output left-to-right, such that the first element in the EntryList will be output as the first element in the list.
+\paragraph{Negated values (NegatedValue)}
+\code{NegatedValue} represents an expression of the form \code{!value}, i.e. boolean negation. This type is used to ensure that \code{value->negate()->negate()} does not actually perform double negation, but simply uses the original value.
 
-\paragraph{Function calls (CallEntry)}
+\paragraph{Function calls (CallValue)}
 Function calls have the same underlying storage types as an array entry, but the output is formatted like a function call instead of an array access.
 
 Each entry type knows how to output itself to an \code{std::ostream} supplied as a parameter to the \code{print} function, and the common base class \code{StackEntry} also overloads the \code{<<} operator so any stack entry can be streamed directly to an output stream using that function.
@@ -70,53 +76,54 @@
 \subsection{Outputting code}
 When processing certain kinds of instructions, you will probably want to create a line of code as part of the output. To do that, call \code{addOutputLine} with a string containing the code you wish to output as an argument. This will then be associated with the group being processed.
 
-If your line of code deals with control flow, you will probably want to do something about the indentation. You can supply two extra boolean arguments to \code{addOutputLine} to state that the indentation should be decreased before outputting this line, and/or that the indentation should be increased for lines output after this line. If you leave out these arguments, no extra indentation is added.
+If your line of code deals with specialized control flow, you will probably want to do something about the indentation. You can supply two extra boolean arguments to \code{addOutputLine} to state that the indentation should be decreased before outputting this line, and/or that the indentation should be increased for lines output after this line. If you leave out these arguments, no extra indentation is added.
 
+Note: By default, if, while and do-while statements detected in the control flow graph are automatically output after processing the conditional jump. To fill in the condition, the topmost stack value is used.
+
 Note: This indent handling is currently considered a temporary solution until there is time to implement something better. It may be replaced with a different form of indentation handling at a later time.
 
-You will usually need to output assignments at some point. For that, you can use the \code{writeAssignment} method to generate an assignment statement. \code{writeAssignment} takes two parameters, the first being the stack entry representing the left-hand side of the assignment operator, and the second being the stack entry representing the right-hand side of the operator.
+You will usually need to output assignments at some point. For that, you can use the \code{writeAssignment} method to generate an assignment statement. \code{writeAssignment} takes two parameters, the first being the Value representing the left-hand side of the assignment operator, and the second being the Value representing the right-hand side of the operator.
 
 \subsection{Default instruction handling and instruction metadata}
-When disassembling, you can store metadata for a given instruction to be used during code generation.
-
-Default handling exists for a number of instruction types, described below. To get the default handling, simply call the base class implementation of \code{processInst} from your code.
-\paragraph{kDupInstType}
+Default handling exists for a number of instruction types, described below. See also Section~\vref{sec:instructions} which discusses instructions in more detail.
+\paragraph{kDupInst}
 The topmost stack entry is popped, and two duplicated copies are pushed to the stack. If the entry being duplicated was not already a duplicate, an assignment will be output to assign the original stack entry to a special dup variable, to show that the original entry is not being recalculated.
 
-\paragraph{kUnaryOpPreInstType/kUnaryOpPostInstType}
+\paragraph{kUnaryOpPreInst/kUnaryOpPostInst}
 The topmost stack entry is popped, and a \code{UnaryOpEntry} is created and pushed to the stack, using the codegen metadata as the operator, and the previously popped entry as the operand. The exact type determines whether the operator is pre- or postfixed to the operand.
 
-\paragraph{kBinaryOpInstType}
+\paragraph{kBinaryOpInst}
 The two topmost stack entries are popped, and a BinaryOpEntry is created and pushed to the stack, using the codegen metadata as the operator and the previously popped entries as the operands. The order of the operands is determined by the value of the field \code{\_binOrder}, as described in Section~\vref{sec:argOrder}.
 
-\paragraph{kCondJumpInstType and kCondJumpRelInstType}
-The information on the stack is then read, and an if, while or do-while condition is output using the topmost stack entry. In general, you will want to call the base class method after handling one of these opcodes.
-
-\paragraph{kJumpInstType and kJumpRelInstType}
-If the current group has been detected as a break or a continue, a break or continue statement is output. Otherwise, the jump is analyzed and output unless it is a jump back to the condition of a while-loop that ends there, or it is determined that the jump is unnecessary due to an else block following immediately after.
-
-\paragraph{kReturnInstType}
+\paragraph{kReturnInst}
 This simply adds a line \code{return;} to the output.
 
-\emph{Note:} The default handling does not currently allow specifying a return value as part of the statement, as in \code{return 0;}. You will have to handle that yourself.
+\emph{Note:} The default handling does not currently allow specifying a return value as part of the statement, as in \code{return 0;}. You will have to handle that yourself using a subclass.
 
-\paragraph{kSpecialCallInstType}
+\paragraph{kKernelCallInst}
 The metadata is treated similar to parameter specifications in \code{SimpleDisassembler} (see Section~\vref{sec:simpledisasm}). If the specification string starts with the character \code{r}, this signifies that the call returns a value, and processing starts at the next character.
 For each character in the metadata string, \code{processSpecialMetadata} is called with the instruction being processed, and the current metadata character to be handled. The default implementation only understands the character \code{p}, which pops an argument from the stack and adds it to the argument list.
 Once the metadata string has been processed fully, then an entry representing the function call is pushed to the stack if the call returns a value. Otherwise, the call is added to the output.
 
 You can override the \code{processSpecialMetadata} method to add your own specification characters, just like you would override \code{readParameter} in \code{SimpleDisassembler}. Use the \code{addArg} method to add arguments.
 
-Due to the conflict with the specification of a return value, it is recommended that you do not adopt \code{r} as a metadata character.
+Due to the conflict with the specification of a return value, it is recommended that you do not adopt \code{r} as a metadata character unless you provide your own \code{processInst} implementation for this purpose.
 
+In addition, certain instruction types trigger additional generic behavior:
+\paragraph{Conditional jumps}
+After processing the conditional jump, an if, while or do-while condition is output using the topmost stack entry as the condition. The condition is automatically negated for if and while conditons, so you only need to consider what the instruction itself does. If the jump is taken when the checked condition is false (e.g. a \code{jumpFalse} instruction), you must remember to negate the value representing the condition.
+
+\paragraph{Unconditional jumps}
+After processing the instruction, if the current group has been detected as a break or a continue, a break or continue statement is output. Otherwise, the jump is analyzed and output unless it is a jump back to the condition of a while-loop that ends there, or it is determined that the jump is unnecessary due to an else block following immediately after. A default, empty implementation of \code{processInst} is provided for this type.
+
 \paragraph{Other types}
-No default handling exists for types other than those mentioned above, so you must handle them yourself in the \code{processInst} method of your subclass. This includes types like \code{kLoadInstType} and \code{kStoreInstType}.
+No default handling exists for types other than those mentioned above, so you must handle them yourself by creating new Instruction subclasses.
 
-Note that this also includes \code{kCallInstType}. Although many engines might want to handle this in a manner similar to \code{kSpecialCallInstType} opcodes, this is left to the engine-specific code so they can fully make sense of the metadata they choose to add to the function.
+Note that this also includes \code{kCallInst}. Although many engines might want to handle this in a manner similar to \code{kKernelCallInst} opcodes, this is left to the engine-specific code so they can fully make sense of the metadata they choose to add to the function.
 
 \subsection{Order of arguments}
 \label{sec:argOrder}
-The generic handling of binary operators (kBinaryOpInstType) and magic functions (kSpecialCallInstType) can be configured to display their arguments using FIFO or LIFO - respectively, the first and the last entry to be pushed onto the stack is used as the first (leftmost) argument. This is set as part of the constructor for the \code{CodeGenerator} class, using the enumeration values \code{kFIFOArgOrder} and \code{kLIFOArgOrder}.
+The generic handling of binary operators (kBinaryOpInst) and kernel functions (kKernelCallInst) can be configured to display their arguments using FIFO or LIFO - respectively, the first and the last entry to be pushed onto the stack is used as the first (leftmost) argument. This is set as part of the constructor for the \code{CodeGenerator} class, using the enumeration values \code{kFIFOArgOrder} and \code{kLIFOArgOrder}.
 
 To provide an example, consider the following sequence of instructions:
 

Modified: tools/branches/gsoc2010-decompiler/decompiler/doc/disassembler.tex
===================================================================
--- tools/branches/gsoc2010-decompiler/decompiler/doc/disassembler.tex	2010-12-15 01:45:47 UTC (rev 54918)
+++ tools/branches/gsoc2010-decompiler/decompiler/doc/disassembler.tex	2010-12-15 02:29:01 UTC (rev 54919)
@@ -3,18 +3,32 @@
 The purpose of the disassembler is to read instructions from a script file and convert them to a common, machine-readable form for further analysis.
 
 \subsection{Instructions}
-Instructions are represented using the \code{Instruction} struct.
+\label{sec:instructions}
+Instructions are represented using a type hierarchy, with the \code{Instruction} struct as the base type.
 
 \begin{C++}
 \begin{lstlisting}
-struct Instruction {
+struct Instruction : public RefCounted {
 	uint32 _opcode;
 	uint32 _address;
 	int16 _stackChange;
 	std::string _name;
-	InstType _type;
-	std::vector<Parameter> _params;
+	std::vector<ValuePtr> _params;
 	std::string _codeGenData;
+
+	friend std::ostream &operator<<(std::ostream &output, const Instruction *inst);
+	virtual std::ostream &print(std::ostream &output) const;
+	virtual bool isJump() const;
+	virtual bool isCondJump() const;
+	virtual bool isUncondJump() const;
+	virtual bool isStackOp() const;
+	virtual bool isFuncCall() const;
+	virtual bool isReturn() const;
+	virtual bool isKernelCall() const;
+	virtual bool isLoad() const;
+	virtual bool isStore() const;
+	virtual uint32 getDestAddress() const;
+	virtual void processInst(ValueStack &stack, Engine *engine, CodeGenerator *codeGen) = 0;
 };
 \end{lstlisting}
 \end{C++}
@@ -25,34 +39,69 @@
 \item \code{\_address} stores the absolute memory address where this instruction would be loaded into memory.
 \item \code{\_stackChange} stores the net change of executing this instruction - for example, if the instruction pushes a byte on to the stack, this should be set to 1. This is used to determine when each statement ends. The count can be in any unit you wish - bytes, words, bits - as long as the same unit is used for all instructions. This means that if your stack only works with 16-bit elements, pushing an 8-bit value and pushing a 16-bit value should have the same net effect on the stack.
 \item \code{\_name} contains the name of the instruction. This is mainly for use during code generation.
-\item \code{\_type} represent the type of instruction. See Section~\vref{sec:insttype} for details.
-\item \code{\_params} contains the parameters given to the instruction - for example, if you have the instruction \code{PUSH 1}, there would be one parameter, with the value of 1. See Section~\vref{sec:parameter} for details on the Parameter type.
+\item \code{\_params} contains the parameters given to the instruction - for example, if you have the instruction \code{PUSH 1}, there would be one parameter, with the value of 1. See Section~\vref{sec:parameter} for details on the Value type.
 \item \code{\_codeGenData} stores metadata to be used during code generation. For details, see Section~\vref{sec:codegen}.
 \end{itemize}
 
-If some instructions do not have a fixed effect on the stack--that is, the instruction name alone does not determine the effect on the stack--set the field to some easily recognizable value when doing the disassembly. You can then determine the correct value in a post-processing step after the code flow analysis.
+If some instructions do not have a fixed effect on the stack--that is, the instruction name alone does not determine the effect on the stack--set the field to some easily recognizable value when doing the disassembly. You will, however, have to determine the exact stack effect after disassembling the script, as the code flow analysis depends on this information to be accurate.
 
 \subsection{Instruction types}
 \label{sec:insttype}
-The instruction type is a generalization of the different kinds of instructions.
+As mentioned previously, the different instructions are represented using a type hierarchy. This allows you to independently specify how each kind of instruction should be handled, while abstracting away the engine-specific information to allow for generic analysis.
 
-This is particularly important during code flow analysis; since this part is engine-independent, the analysis must have some way of distinguishing the different types of instructions. Additionally, this information can be used during code generation to generalize the recognition of constructs--for example, the code generated for addition and the code generated for multiplication will generally be identical, with the exception of that single arithmetic instruction doing the work.
+This is particularly important during code flow analysis; since this part is completely engine-independent, the analysis must have some way of distinguishing the different types of instructions. For that purpose, a number of \code{is*} methods are defined which specify whether the instruction satisfies some specific purpose.
 
-Most of the types are self-explanatory, with the possible exception of \code{kSpecialCallInstType}. \code{kSpecialCallInstType} should be used for all "magic functions"--opcodes that perform some function specific to the engine, like playing a sound, drawing a graphic, or saving the game.
+Each of the predefined instruction types have a class associated with it to make it simpler to add functionality to a specific type of instructions, as specified in Table~\vref{tbl:insttypes}.
 
-During code generation, some instruction types have a pre-defined handling, while others must be handled on your own. It is possible to override the default handling for any or all opcodes if you wish. For details, see Section~\vref{sec:codegen}.
+\begin{table}
+\centering
+\begin{tabular}{|m{3.2cm}|m{5cm}|p{3.2cm}|}
+\hline
+\textbf{Type} & \textbf{Base class} & \textbf{Purpose} \\
+\hline
+\code{kBinaryOpInst} & \code{BinaryOpInstruction} & Binary operations (+, *, ==, etc.) \\\hline
+\code{kBoolNegateInst} & \code{BoolNegateInstruction} & Boolean negation \\\hline
+\code{kCallInst} & \code{CallInstruction} & Script function call \\\hline
+\code{kCondJumpInst} & \code{CondJumpInstruction} & Conditional jumps \\\hline
+\code{kDupInst} & \code{DupInstruction} & Duplicate stack entry \\\hline
+\code{kJumpInst} & \code{UncondJumpInstruction} & Unconditional jumps \\\hline
+\code{kKernelCallInst} & \code{KernelCallInstruction} & Kernel function call \\\hline
+\code{kLoadInst} & \code{LoadInstruction} & Load from memory \\\hline
+\code{kReturnInst} & \code{ReturnInstruction} & Function return \\\hline
+\code{kStackInst} & \code{StackInstruction} & Stack allocation or deallocation \\\hline
+\code{kStoreInst} & \code{StoreInstruction} & Store to memory \\\hline
+\code{kUnaryOpPreInst} & \code{UnaryOpPrefixInstruction} & Unary operation, prefixed operator \\\hline
+\code{kUnaryOpPostInst} & \code{UnaryOpPostfixInstruction} & Unary operation, postfixed operator \\\hline
+\end{tabular}
+\caption{Predefined instruction types}
+\label{tbl:insttypes}
+\end{table}
 
-\subsection{Parameters}
+Where deemed appropriate, some of the base classes contain a default implementation of \code{processInst}. You can create a new subclass for each of these types and override this method to change their functionality.
+
+\code{getDestAddress} is implemented on jump instructions to allow the generic code to find the target of a jump. You must create subclassses for your jump instructions which override this method.
+
+Most of the types are self-explanatory, with the possible exception of \code{kKernelCallInst}. \code{kKernelCallInst} should be used for "magic functions"--opcodes that perform some function specific to the engine, like playing a sound, drawing a graphic, or saving the game.
+
+When disassembling, you will need to create an instance of the correct instruction type for each of your instructions. For this purpose, \code{Disassembler} defines a factory \code{\_instFactory} where you can register your classes with an integer key. To do this, call \code{\_instFactory.addEntry<Type>(key)} in the constructor for your Disassemlber, where \code{Type} is the name of the type to register, and \code{key} is an integer key. To create an instance of the appropriate type, simply call \code{\_instFactory.create} with your key.
+
+By default, only some of the instruction types are registered automatically; for other types, you will have to register your own classes since it is not possible to fully define them in a generic fashion. The pre-registered types are \code{kBinaryOpInst}, \code{kBoolNegateInst}, \code{kDupInst}, \code{kKernelCallInst}, \code{kReturnInst}, \code{kUnaryOpPreInst} and \code{kUnaryOpPostInst}, but you can substitute any or all of these by registering your own type with the same key.
+
+For some engines, you will need to go beyond the existing types to implement very special behavior. In those cases, simply define a \code{const int} and give it a value to be able to use this as a new key. Use the value \code{kFirstCustomInst} for your first custom instruction type, and continue from there for new keys, as this will prevent conflict with future instruction types.
+
+When at all possible, you should inherit from one of the more specific types, rather than inheriting directly from \code{Instruction}.
+
+\subsection{Parameters and values}
 \label{sec:parameter}
-Parameters are stored using a tagged union - one field (\code{\_type}) tells you which data type is being stored, and another field (\code{\_value}) stores the actual value.
+Instruction parameters are stored using a hierarchy of \code{Value} types. Several types are predefined in \code{value.h}, and you can declare new types if you need to (e.g. a list of values).
 
-Three convenience methods are provided to extract the value, \code{getSigned}, \code{getUnsigned} and \code{getString}. Please note: if an incorrect method is called, an exception is thrown.
+\code{Value} types are also used during code generation, so you can reuse your parameter values directly.
 
-Although there are only 3 get methods, there are 7 different parameter types. This additional distinction is intended for you to use as you see fit, in case it is useful as metadata somewhere in your engine-specific code.
+All \code{Value} types must define a \code{print} function which prints themselves to a \code{std::ostream}. This is used not only for code generation, but also for disassembly and control flow output.
 
-If you need to store different types than those already allowed, add the new type to the list of type parameters for the \code{\_value} field and add another enumeration value to \code{ParamType}. You should make the new type \emph{output streamable}--that is, allow it to be used like \code{std::cout << value}. This allows the value to be output directly to an output stream regardless of its type.
+For direct values, you should also override the \code{dup} function to create a copy of your class. The default implementation is tailored for values that represent expressions, and will therefore output an assignment to show that the result of an expression is being duplicated.
 
-Note: When storing 8 or 16-bit unsigned values in the \code{\_value} field, cast them to an \code{uint32} when doing the assignment, or you will not be able to extract the value using \code{getUnsigned}. This is a limitation caused by the automatic type conversion algorithm used by C++.
+For more details, see Section~\vref{sec:stackvalues}, where Values are discussed wrt. code generation, and a list of predefined value types is given.
 
 \subsection{The Disassembler class}
 All disassemblers must inherit, directly or indirectly, from the \code{Disassembler} class. This is an abstract class providing an interface for disassemblers.
@@ -64,6 +113,7 @@
 	Common::File _f;
 	InstVec &_insts;
 	uint32 _addressBase;
+	ObjectFactory<int, Instruction> _instFactory;
 
 	virtual void doDisassemble() throw(std::exception) = 0;
 	virtual void doDumpDisassembly(std::ostream &output);
@@ -85,11 +135,13 @@
 
 \code{\_addressBase} is provided as a convenience if your engine does not consider the first instruction to be located at address 0. Assign the expected base address to this field, and make sure that the addresses you assign to the instructions are relative to this base address. This is mainly useful if your engine supports jumps or other references to absolute addresses in the script; if only relative addresses are used, the base address will not be relevant.
 
+\code{\_instFactory} is the factory used to create the appropriate Instruction subclasses. For details, see Section~\vref{sec:insttype}.
+
 \code{doDisassemble} is the method used to perform the actual disassembly, so this method must be implemented by all disassemblers.
 
-\code{disassemble} simply calls the \code{doDisassemble} method to perform the disassembly. The result is cached, so if this method is called twice, \code{doDisassemble} will only be called the first time.
+\code{disassemble} simply calls the \code{doDisassemble} method to perform the disassembly. The result is cached, so if this method is called twice, it won't perform disassembly again.
 
-Finally, \code{dumpDisassembly} is used to output the instructions in a human-readable format to a file or stdout, performing a disassembly first if required, and then calls \code{doDumpDisassembly} to perform the actual output. A default implementation is provided for \code{doDumpDisassembly}, but you can override it if the standard output format is not suitable for your particular engine.
+Finally, \code{dumpDisassembly} is used to output the instructions in a human-readable format to a file or stdout, performing a disassembly first if required, and then calls \code{doDumpDisassembly} to perform the actual output. \code{doDumpDisassembly} simply outputs each instruction in turn, using the printing function associated with each instruction. If you want to customize the way instructions are output, you should ideally create new Instruction subclasses and override their printing function, as the same format is used when dumping a code flow graph, but if you just want to prepend or append some additional information to the dump, you can override this method to do so.
 
 \subsection{The SimpleDisassembler class}
 \label{sec:simpledisasm}
@@ -129,13 +181,13 @@
 \end{lstlisting}
 \end{C++}
 
-To define an opcode, use the \code{OPCODE} macro. This macro takes 5 parameters: the opcode value, the name of the instruction, the type of instruction, the net effect on the stack, and a string describing the parameters that are part of the instruction. We will start by implementing the \code{POP} and \code{POP2} opcodes:
+To define an opcode, use the \code{OPCODE} macro. This macro takes 5 parameters: the opcode value, the name of the instruction, the key associated with the specific type of instruction, the net effect on the stack, and a string describing the parameters that are part of the instruction. We will start by implementing the \code{POP} and \code{POP2} opcodes:
 
 \begin{C++}
 \begin{lstlisting}
 START_OPCODES;
-	OPCODE(0x01, "POP", kStackInstType, -1, "");
-	OPCODE(0x03, "POP2", kStackInstType, -2, "");
+	OPCODE(0x01, "POP", kStackInst, -1, "");
+	OPCODE(0x03, "POP2", kStackInst, -2, "");
 END_OPCODES;
 \end{lstlisting}
 \end{C++}
@@ -165,7 +217,7 @@
 \label{tbl:paramtypes}
 \end{table}
 
-To help you remember these meanings, little-endian values are encoded using lower case ("small letters", i.e. little), while big-endian values are encoded using upper case ("big" letters). The exception here is a single byte, since endianness has no effect for individual bytes. Here, the mnemonic is that an unsigned byte ("B") has a larger maximum value. For the other letters, "s" was used because it is the first letter in "short", which is usually a 16-bit signed value in C. Similarly, "i" is short for "int". "w" and "d" come from the terms "word" and "dword", which are terms for 16-bit and 32-bit unsigned types on an x86 platform.
+To help you remember these meanings, little-endian values are encoded using lower case ("small letters", i.e. little), while big-endian values are encoded using upper case ("big" letters). The exception here is a single byte, since endianness has no effect for individual bytes. Here, the mnemonic is that an unsigned byte ("B") has a larger maximum value. For the other letters, "s" was used because it is the first letter in "short", which is usually a 16-bit signed value in C. Similarly, "i" is short for "int". "w" and "d" come from the terms "word" and "dword", which are terms for 16-bit and 32-bit unsigned types on the x86 platform.
 
 Note that strings are not supported by default. To add reading of a string type, you can override the \code{readParameter} function to add your own types:
 
@@ -200,11 +252,11 @@
 \begin{C++}
 \begin{lstlisting}
 START_OPCODES;
-	OPCODE(0x00, "PUSH", kStackInstType, 1, "B");
-	OPCODE(0x01, "POP", kStackInstType, -1, "");
-	OPCODE(0x02, "PUSH", kStackInstType, 1, "w");
-	OPCODE(0x03, "POP2", kStackInstType, -2, "");
-	OPCODE(0x80, "PRINT", kSpecialCallInstType, 0, "c");
+	OPCODE(0x00, "PUSH", kStackInst, 1, "B");
+	OPCODE(0x01, "POP", kStackInst, -1, "");
+	OPCODE(0x02, "PUSH", kStackInst, 1, "w");
+	OPCODE(0x03, "POP2", kStackInst, -2, "");
+	OPCODE(0x80, "PRINT", kKernelCallInst, 0, "c");
 END_OPCODES;
 \end{lstlisting}
 \end{C++}
@@ -219,13 +271,13 @@
 \begin{C++}
 \begin{lstlisting}
 START_OPCODES;
-	OPCODE(0x00, "PUSH", kStackInstType, 1, "B");
-	OPCODE(0x01, "POP", kStackInstType, -1, "");
-	OPCODE(0x02, "PUSH", kStackInstType, 1, "w");
-	OPCODE(0x03, "POP2", kStackInstType, -2, "");
-	OPCODE(0x80, "PRINT", kSpecialCallInstType, 0, "c");
+	OPCODE(0x00, "PUSH", kStackInst, 1, "B");
+	OPCODE(0x01, "POP", kStackInst, -1, "");
+	OPCODE(0x02, "PUSH", kStackInst, 1, "w");
+	OPCODE(0x03, "POP2", kStackInst, -2, "");
+	OPCODE(0x80, "PRINT", kKernelCallInst, 0, "c");
 	START_SUBOPCODE(0xFF);
-		OPCODE(0x00, "HALT", kSpecialCallInstType, 0, "");
+		OPCODE(0x00, "HALT", kKernelCallInst, 0, "");
 	END_SUBOPCODE;
 END_OPCODES;
 \end{lstlisting}
@@ -245,7 +297,7 @@
 \begin{C++}
 \begin{lstlisting}
 START_OPCODES;
-	OPCODE_MD(0x14, "add", kBinaryOpInstType, -1, "", "+");
+	OPCODE_MD(0x14, "add", kBinaryOpInst, -1, "", "+");
 END_OPCODES;
 \end{lstlisting}
 \end{C++}
@@ -267,4 +319,4 @@
 
 \code{OPCODE\_BASE} automatically keeps track of the current opcode value. You can access \code{full\_opcode} to get the current full opcode. Alternatively, you can use the \code{OPCODE\_BODY} macro to use the standard behavior for opcodes, and then follow that with the additional code you want. The \code{OPCODE\_BODY} macro takes the same arguments as the \code{OPCODE\_MD} macro.
 
-For your convenience, a few additional macros are available: \code{ADD\_INST}, which adds an empty instruction to the vector, and \code{LAST\_INST} which retrieves the last instruction in the vector. Additionally, you can use \code{INC\_ADDR} as a shorthand for incrementing the address variable by 1, but note that you should \emph{not} increment the address for the opcode itself - this is handled by the other macros.
+For your convenience, a few additional macros are available: \code{ADD\_INST}, which adds an empty instruction of the provided type to the vector, and \code{LAST\_INST} which retrieves the last instruction in the vector. Additionally, you can use \code{INC\_ADDR} as a shorthand for incrementing the address variable by 1, but note that you should \emph{not} increment the address for the opcode itself - this is handled by the other macros.

Modified: tools/branches/gsoc2010-decompiler/decompiler/doc/engine.tex
===================================================================
--- tools/branches/gsoc2010-decompiler/decompiler/doc/engine.tex	2010-12-15 01:45:47 UTC (rev 54918)
+++ tools/branches/gsoc2010-decompiler/decompiler/doc/engine.tex	2010-12-15 02:29:01 UTC (rev 54919)
@@ -14,7 +14,6 @@
 \begin{itemize}
 \item \code{getDisassembler}, which takes a reference to the instruction vector to use for storage and creates a disassembler object and returns it. For more on disassemblers, see Section~\vref{sec:disassembler}.
 \item \code{getCodeGenerator}, which takes a reference to the \code{std::ostream} to output the code to and creates a code generator object and returns it. For more on code generators, see Section~\vref{sec:codegen}.
-\item \code{getDestAddress}, which takes a const iterator to a jump instruction as a parameter and returns the address the instruction will jump to if the jump is taken. Unless you do differently in your engine-specific code, this function will only receive jumps as input, so if you can take a shortcut based on that, you are allowed to do that.
 \end{itemize}
 
 Additional methods you can override are:
@@ -24,17 +23,16 @@
 \item \code{postCFG}, which is a post-processing step called after control flow analysis. If you override \code{detectMoreFuncs} to return true, you must also override this function to process any newly found functions. A default implementation which does nothing is already provided in case you do not need to do any post-processing.
 \end{itemize}
 
+Additionally, if your engine is not stack-based, you may not wish to see the stack effect when reviewing the disassembly or code flow graph. You can disable this by calling \code{setOutputStackEffect(false)} from e.g. your Engine constructor. The method is defined in instruction.h, which you will have to include.
+
 It is important to realize that you do not necessarily need to implement a completely new code generator and disassembler for every engine; for variations on the same engine, you can reuse the existing classes and simply send in any extra information required. In particular, code generators are likely to be reusable without change for different versions of the same engine - e.g., the Kyra2 code generator will likely work for all Kyra games.
 
-\subsection{Game information}
-For some engines, it may not be enough to know the engine; some instructions may differ in behavior between different games or variants of a game, for example between talkie or non-talkie versions, or between versions for different platforms.
+For this purpose, the user may optionally specify an \emph{engine variant}, which is a string that will be passed to your Engine. If you make use of this feature, you should also override \code{getVariants} to specify which variants your engine supports. This list will be displayed to the user if they specify an engine while using the \code{-h} option.
 
-The \code{Engine} class contains a field \code{\_isTalkie} which is set to true if the user passed in the \code{-t} switch on the command line. You can check this flag in your engine-specific code if necessary.
+Note that the variant sent into your engine is not validated against your list of supported variants. This keeps the variant logic flexible, and allows you to implement your own fallback logic for unknown variant strings.
 
-In the interest of user friendliness, if the necessary data exists directly in the script file itself, you should use that instead of requiring additional switches to be passed.
+If you can auto-detect the variant from the script file, you should prefer this approach over asking the user to specify the variant.
 
-Note that, at the time of writing, there is no field containing platform information; this must be handled by implementing another engine which passes in relevant information to engine-specific classes.
-
 \subsection{Functions}
 Some engines allow multiple functions in a single script file. Each function must be analyzed separately, but in order to do that, it is of course necessary to know where the functions start and end, and when it is time to actually generate some code, you will want to know a bit about the function as well.
 

Modified: tools/branches/gsoc2010-decompiler/decompiler/doc/overview.tex
===================================================================
--- tools/branches/gsoc2010-decompiler/decompiler/doc/overview.tex	2010-12-15 01:45:47 UTC (rev 54918)
+++ tools/branches/gsoc2010-decompiler/decompiler/doc/overview.tex	2010-12-15 02:29:01 UTC (rev 54919)
@@ -1,6 +1,3 @@
-\section{Note!}
-The decompiler is currently undergoing a pretty major redesign, and accordingly, the documentation may be outdated. Take note of this when reading this document.
-
 \section{Overview}
 The decompilation process consists of a few different steps:
 

Modified: tools/branches/gsoc2010-decompiler/decompiler/doc/preamble.tex
===================================================================
--- tools/branches/gsoc2010-decompiler/decompiler/doc/preamble.tex	2010-12-15 01:45:47 UTC (rev 54918)
+++ tools/branches/gsoc2010-decompiler/decompiler/doc/preamble.tex	2010-12-15 02:29:01 UTC (rev 54919)
@@ -42,6 +42,7 @@
 \usepackage{hyperref}
 %\labelformat{equation}{(#1)}	% Correct equation references !!DOESN'T WORK ATM!!
 \usepackage{verbatim}
+\usepackage{array}
 
 % Listings, for writing code
 \usepackage{listings}
@@ -117,3 +118,6 @@
 %\setlength{\parindent}{0mm}
 
 \newcommand{\code}[1]{\texttt{#1}}
+
+\numberwithin{figure}{section}
+\numberwithin{table}{section}

Modified: tools/branches/gsoc2010-decompiler/decompiler/doc/todo.tex
===================================================================
--- tools/branches/gsoc2010-decompiler/decompiler/doc/todo.tex	2010-12-15 01:45:47 UTC (rev 54918)
+++ tools/branches/gsoc2010-decompiler/decompiler/doc/todo.tex	2010-12-15 02:29:01 UTC (rev 54919)
@@ -14,26 +14,14 @@
 
 As far as I have been able to tell, this optimization really is not used in Kyra, so this will have to be deferred until we have an engine which needs it.
 
-\subsection{Engine-specific arguments}
-It would be good to replace the current --is-talkie flag and make this engine-specific so it is clear what is supported where. Simliar things may be needed for other engines, e.g. a platform switch for Kyra1.
-
-One way to do this is to parse arguments in two passes; one for the generic arguments and one for arguments specific to the specified engine. The engine will then have to pass some information to us so we can use Boost.ProgramOptions to do the actual parsing.
-
-For help texts, we can require that if an engine is specified when using -h, we output the help text for that specific engine. This will avoid output from engines other than the one the user is looking for.
-
 \subsection{Re-enable short-circuit detection}
 Currently the short-circuit detection is disabled because it requires some extra handling in code generation which is not there yet (you have to analyze each jump more closely).
 
-\subsection{Reduce number of parentheses}
-Right now, the decompiler is completely paranoid about parentheses, inserting them pretty much anywhere it can get away with it. This should be improved, so we reduce the number of them; e.g. by checking the type of operands and only adding parentheses if the operand type can cause problems.
+\subsection{Refactor CFG design}
+The CFG anaylsis, while certainly functional, is not entirely pretty right now.
 
-Similarly, we should probably look into negated expressions to let the decompiler automatically rework what would be \code{!(a == b)} to \code{a != b}. This would also help decrease the number of parentheses.
+It would be a good idea to go over this and see if it can be made better somehow, e.g. by incorporating more of the syntax as nodes in the graph. This might also make it easier to get short-circuiting working correctly.
 
-\subsection{Refactor design}
-While a decompiler is almost by nature going to be somewhat complex, the current design might be a little too complex in a few places.
-
-When time permits, it would be a good idea to take another look at the design and see if and where improvements can be made.
-
 \subsection{Refactor disassemblers to accept a SeekableReadStream}
 It would be desirable if disassemblers accepted a \code{Common::SeekableReadStream} instead of a \code{Common::File}, for easier integration with other tools and possibly ScummVM itself.
 
@@ -43,3 +31,8 @@
 
 \subsection{SCUMM: Rewrite jump 0 at end of script to infinite loop}
 Several SCUMM scripts end with a jump 0, making them infinite loops. It would be nice if this could be expressed accordingly, but this does not appear to be a trivial task; some jump 0s in a script could be expressed as a continue, others cannot.
+
+\subsection{Proper getCondition method on Value}
+For now, it is assumed that conditional jumps leave their condition on the stack, so this can be retrieved by the generic code generation code. For non-stack-based engines, it would be a bit nicer if they could just give us a condition to use in an if/while/do-while, instead of currently requiring that the value is on the top of the stack.
+
+A very simple way to do this would be to simply define a getCondition on Value that takes the same parameters as processInst, with a default implementation that just calls processInst and pops and returns the top value from the stack - this means no changes would be required to exisiting engines, and new engines can override this method as they see fit.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




More information about the Scummvm-git-logs mailing list