Categorizer setup for loading rules and neural networks

If the user wanted to customizing rules or neural networks for either the Color Particle Analyzer or for the letter matcher it demanded some programming until ShapeLogic 1.6. Now you can do most of this by loading a categorizer file.

Here is a directory with example files: http://code.google.com/p/shapelogic/source/browse/trunk/src/test/resources/data/neuralnetwork/

Defining digit categorizer with a setup file

You can change the CapitalLetterMatch / Stream Vectorizer from a letter matcher to digit matcher by loading this file: shapelogic/src/test/resources/data/neuralnetwork/polygon_digit_recognizer_with_rules_print.txt:

Here are the definitions of the digit 0 and 1 and a list of the geometric features you to get written out:

========== COMMENT
This categorizer setup file is doing the same as the old digit matcher, 
DigitStreamVectorizer_.

========== def 0
holeCount == 1
tJunctionCount == 0
endPointCount == 0
multiLineCount == 1
curveArchCount > 0
hardPointCount == 0
softPointCount > 0

========== def 1
holeCount == 0
tJunctionLeftCount == 0
tJunctionRightCount == 0
endpointBottomPointCount == 1
horizontalLineCount == 0
verticalLineCount == 1
endPointCount == 2
multiLineCount == 0
softPointCount == 0
aspectRatio < 0.4

========== PRINTS
Category
Points
lineCount
holeCount

========== 

The biggest problem is to find the right names for the geometric feature.

Currently it only understand 3 relations >, < and ==.

Defining digit categorizer in DigitStreamVectorizer_

There is a simple one to one mapping of rule definition above and below:

public void makeDigitStream() {

        rule("0", HOLE_COUNT, "==", 1.);
        rule("0", T_JUNCTION_POINT_COUNT, "==", 0.);
        rule("0", END_POINT_COUNT, "==", 0.);
        rule("0", MULTI_LINE_COUNT, "==", 1.);
        rule("0", CURVE_ARCH_COUNT, ">", 0.);
        rule("0", HARD_CORNER_COUNT, "==", 0.);
        rule("0", SOFT_POINT_COUNT, ">", 0.);

        rule("1", HOLE_COUNT, "==", 0.);
        rule("1", T_JUNCTION_LEFT_POINT_COUNT, "==", 0.);
        rule("1", T_JUNCTION_RIGHT_POINT_COUNT, "==", 0.);
        rule("1", END_POINT_BOTTOM_POINT_COUNT, "==", 1.);
        rule("1", HORIZONTAL_LINE_COUNT, "==", 0.);
        rule("1", VERTICAL_LINE_COUNT, "==", 1.);
        rule("1", END_POINT_COUNT, "==", 2.);
        rule("1", MULTI_LINE_COUNT, "==", 0.);
        rule("1", SOFT_POINT_COUNT, "==", 0.);
        rule("1", ASPECT_RATIO, "<", 0.4);
                

EBNF grammar for categorizer setup file

BLOCK_START = "==========" ;

comment-section = "COMMENT" ,  {string} , BLOCK_START ; 

feature-section = "FEATURES" ,  {string} , BLOCK_START ; 

print-section = "PRINTS" , {string} , BLOCK_START ; 

result-section = "RESULTS" , {string} , BLOCK_START ; 

weights-definition = "WEIGHTS" , {number} , BLOCK_START

relation = "<" | ">" | "==" ;

rule-definition = "def" , string , { string , relation , number } , BLOCK_START ;

categorizer-definition = BLOCK_START , 
    {comment-section | feature-section | print-section
    | result-section | weights-definition | rule-definition } ;
    

It is hard to cause syntax errors in this grammar. But a sequence of characters starting with more === will not be considered a string.

Function of different sections

Here is a short description of what the different sections of the categorizer setup files is doing:

Comment section

This is just a comment.

Feature section

This is the the feature data streams that are needed as input for the neural network.

They can take 2 different forms:

  • Name of the Stream
  • Headline if it has been defined for the stream

The streams often have camel case names e.g. aspectRatio starting with lower case.

The headline names are often short names and are often starting with upper case.

This was chosen on purpose to be able to distinguish them.

Result section

A neural network produces a list of number from the last layer.

If there is a match only one of these numbers should be over 0.5, and that is considered the result of the neural network.

The result section assigns names to each of the outputs of this final list. E.g for a digit match there will be 10 output number and they will have names 0, 1, 2, 4, 5, 6, 7, 8 and 9.

Weights definition

Here is an example of a neural network that explains how the weights are organized:

/** Weights found using the Joone Neural Networks.  */
public final static double[][] WEIGHTS_FOR_XOR = {
    {
        //Output 0        , 1                , 2
        2.7388686085992333, 5.505721328606976, 4.235258932026585, //bias
                                                                     //Input
        -6.598582463774703, -3.678198637390036, -2.9604962169635076, // 0
        -6.59030690954159, -3.7790406961228347, -2.845930422442215   // 1
    },
    {   //Output 0
        -5.27100082610628,  //Bias for hidden second layer
                            //Input
        -10.45330943056037, // 0
        6.582922049952558,  // 1
        4.7139611039662945  // 2
    }
};

Rule definition

Here is an example of the rules defining the digit 0:

========== def 0
holeCount == 1
tJunctionCount == 0
endPointCount == 0
multiLineCount == 1
curveArchCount > 0
hardPointCount == 0
softPointCount > 0

holeCount is the name of the data feature stream of hole counts. So the first line will filter away all polygons that do not have a hole count equal to 0.

Finding names of streams

A stream-dictionary of what the different streams do and is called is being worked on now.

Finding names of streams takes a little digging. Here are a few places to start digging:

Look at the files where rules are defined. A good idea is to download the sources and open it using a modern IDE like Eclipse, IntelliJ, JBuilder, NetBeans. Then it is easier to maneuver to see where things are defined.

Not that different streams are defined for the particle analyzer group and letter matcher group.

Definition of basic streams

src/main/java/org/shapelogic/logic/CommonLogicExpressions.java

In this directory src/main/java/org/shapelogic/streamlogic there is several interesting file:

  • LoadLetterStreams.java
  • LoadParticleStreams.java
  • LoadPolygonStreams.java
  • StreamNames.java