The stroke recognition works using a fairly simple model: basically, the software knows at any one time what stroke it is expecting. As soon as you put the pen down and start drawing, the OS sends a stream of coordinates to the software, which uses the following logic to work out whether you have got the stroke "right" or not: (i) if the current position of the pen is more than N pixels away from the closest point of the stroke, then the stroke has failed (in other words, the pen must always stay in the blue zone in the diagram); (ii) if the pen is within M pixels of any of the pivotal points of a stroke (corners), then mark that pivotal point as "visited" (in other words, make sure the pen visits every red zone); (iii) when the pen is lifted, if the stroke has not already failed in (by leaving the blue zone) and if every pivotal (red) point has been visited then the stroke has succeeded.
Note that it would be better if the direction of the stroke were taken into account - this could be implemented by simply making sure you visit the red zones in the correct order.
The "mandarin1" dataset currently has 320 words which are gradually being added to by the members of the project. We are roughly following Elizabeth Scurfield's Teach Yourself Chinese (Amazon UK | US) but it doesn't really matter which course you are following, the first few hundred characters a beginner needs are essentially the same in any course. See below if you'd like to join us, or to make your own datasets. You could for example make a dataset which corresponds better to your textbook, or which uses Cantonese pronunciation for the characters.
The characters are stored as set of line segments and bezier curves, making up strokes. As we all learn Chinese, we add to the datasets using some simple Java-based software, which involves manually tracing over large characters rendered with a Truetype font. We are slowly working towards a goal of 1000 most-used words/characters. Ultimately, I think it would be excellent to have several different datasets. If any students or speakers of Chinese character based languages would like to build datasets (which would basically involve providing defintions and pronunciations for the characters that have already been done - luckily there is of course much overlap), then by all means please get involved.