| # StringClassifier |
| |
| StringClassifier is a library to classify an unknown text against a set of known |
| texts. The classifier uses the [Levenshtein Distance] algorithm to determine |
| which of the known texts most closely matches the unknown text. The Levenshtein |
| Distance is normalized into a "confidence percentage" between 1 and 0, where 1.0 |
| indicates an exact match and 0.0 indicates a complete mismatch. |
| |
| [Levenshtein Distance]: https://en.wikipedia.org/wiki/Levenshtein_distance |
| |
| ## Types of matching |
| |
| There are two kinds of matching algorithms the string classifier can perform: |
| |
| 1. [Nearest matching](#nearest), and |
| 2. [Multiple matching](#multiple). |
| |
| ### Normalization |
| |
| To get the best match, normalizing functions can be applied to the texts. For |
| example, flattening whitespaces removes a lot of inconsequential formatting |
| differences that would otherwise lower the matching confidence percentage. |
| |
| ```go |
| sc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower) |
| ``` |
| |
| The normalizating functions are run on all the known texts that are added to the |
| classifier. They're also run on the unknown text before classification. |
| |
| ### Nearest matching {#nearest} |
| |
| A nearest match returns the name of the known text that most closely matches the |
| full unknown text. This is most useful when the unknown text doesn't have |
| extraneous text around it. |
| |
| Example: |
| |
| ```go |
| func IdentifyText(sc *stringclassifier.Classifier, name, unknown string) { |
| m := sc.NearestMatch(unknown) |
| log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence) |
| } |
| ``` |
| |
| ## Multiple matching {#multiple} |
| |
| Multiple matching identifies all of the known texts which may exist in the |
| unknown text. It can also detect a known text in an unknown text even if there's |
| extraneous text around the unknown text. As with nearest matching, a confidence |
| percentage for each match is given. |
| |
| Example: |
| |
| ```go |
| log.Printf("The text %q contains:", name) |
| for _, m := range sc.MultipleMatch(unknown, false) { |
| log.Printf(" %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset) |
| } |
| ``` |
| |
| ## Disclaimer |
| |
| This is not an official Google product (experimental or otherwise), it is just |
| code that happens to be owned by Google. |