tree: 888dcdb20f98dd30ce7ecf987720a93ac14045dc [path history] [tgz]
  1. internal/
  2. searchset/
  3. classifier.go
  4. classifier_test.go
  5. CONTRIBUTING.md
  6. LICENSE
  7. README.md
stringclassifier/README.md

StringClassifier

StringClassifier is a library to classify an unknown text against a set of known texts. The classifier uses the Levenshtein Distance algorithm to determine which of the known texts most closely matches the unknown text. The Levenshtein Distance is normalized into a “confidence percentage” between 1 and 0, where 1.0 indicates an exact match and 0.0 indicates a complete mismatch.

Types of matching

There are two kinds of matching algorithms the string classifier can perform:

  1. Nearest matching, and
  2. Multiple matching.

Normalization

To get the best match, normalizing functions can be applied to the texts. For example, flattening whitespaces removes a lot of inconsequential formatting differences that would otherwise lower the matching confidence percentage.

sc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower)

The normalizating functions are run on all the known texts that are added to the classifier. They're also run on the unknown text before classification.

Nearest matching

A nearest match returns the name of the known text that most closely matches the full unknown text. This is most useful when the unknown text doesn't have extraneous text around it.

Example:

func IdentifyText(sc *stringclassifier.Classifier, name, unknown string) {
  m := sc.NearestMatch(unknown)
  log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence)
}

Multiple matching

Multiple matching identifies all of the known texts which may exist in the unknown text. It can also detect a known text in an unknown text even if there's extraneous text around the unknown text. As with nearest matching, a confidence percentage for each match is given.

Example:

log.Printf("The text %q contains:", name)
for _, m := range sc.MultipleMatch(unknown, false) {
  log.Printf("  %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset)
}

Disclaimer

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.