Fix the induced phrases search to not trigger on modified URLs.

URLs are stored as a single token by the classifier, meaning they can only
introduce 1 error in exact matching, which is good for approximate matching,
since a lengthy URL that is changed would otherwise introduce additional
errors.

However, this doesn't work well with the induced phrases check since a modified
URL would get stored as a insert/delete pair that might contain the triggering
word (e.g. "apache"). This meant that a license that had a different Apache URL
from our pristine copies would get rejected because it "introduced" the word
"apache"

This fixes the logic to not trigger an induced phrase condition if the insert
is paired with a delete that also contains the induced phrase, since this means
it did exist in the document after all. Diagnostic logging for the diffing
phase now includes output to help triage these conditions.

This proved very useful in identifying older Apache licenses that were
incorrectly rejected and sometimes barely matching non-applicable licenses.

PiperOrigin-RevId: 407908906
1 file changed
tree: b87cbb1550a112353e59d493ef1c373c038a41db
  1. commentparser/
  2. internal/
  3. licenses/
  4. serializer/
  5. stringclassifier/
  6. tools/
  7. v2/
  8. .travis.yml
  9. CHANGELOG
  10. classifier.go
  11. classifier_test.go
  12. CONTRIBUTING.md
  13. file_system_resources.go
  14. forbidden.go
  15. go.mod
  16. go.sum
  17. LICENSE
  18. license_type.go
  19. README.md
README.md

License Classifier

Build status

Introduction

The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE files with a single or multiple licenses in it, or source code files with the license text in a comment.

A “confidence level” is associated with each result indicating how close the match was. A confidence level of 1.0 indicates an exact match, while a confidence level of 0.0 indicates that no license was able to match the text.

Adding a new license

Adding a new license is straight-forward:

  1. Create a file in licenses/.

    • The filename should be the name of the license or its abbreviation. If the license is an Open Source license, use the appropriate identifier specified at https://spdx.org/licenses/.
    • If the license is the “header” version of the license, append the suffix “.header” to it. See licenses/README.md for more details.
  2. Add the license name to the list in license_type.go.

  3. Regenerate the licenses.db file by running the license serializer:

    $ license_serializer -output licenseclassifier/licenses
    
  4. Create and run appropriate tests to verify that the license is indeed present.

Tools

Identify license

identify_license is a command line tool that can identify the license(s) within a file.

$ identify_license LICENSE
LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794)
LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829)
LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)

License serializer

The license_serializer tool regenerates the licenses.db archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.

$ license_serializer -output licenseclassifier/licenses

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.