commit | 944bfc450d2ff7a35690262337efc3a3154cf94f | [log] [tgz] |
---|---|---|
author | Bill Neubauer <wcn@google.com> | Thu Aug 05 10:54:58 2021 -0700 |
committer | Bill Neubauer <wcn@google.com> | Wed Mar 16 15:31:27 2022 -0700 |
tree | 921dc8abe87173ed269cc5c0f56bb7fe11fdb270 | |
parent | ef8812216d51a3a43a3e9409169f8ba06a512766 [diff] |
Change the licenseclassifier return type to a summary structure rather than a raw list of matches. This allows us to return additional metadata about the original input to the classifier. One datum currently included in this summary is the number of lines of input presented to the classifier, which is necessary to identify ranges of text that did not contain a license. This change consolidates the usages of the third party classifier API around a common interface for the compliance team and provides a common stub, eliminating many ad-hoc stub implementations. This CL was tested via global presubmit to ensure usages outside of the compliance team were not negatively impacted. *** Change 388303070 PiperOrigin-RevId: 388973411
The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE
files with a single or multiple licenses in it, or source code files with the license text in a comment.
A “confidence level” is associated with each result indicating how close the match was. A confidence level of 1.0
indicates an exact match, while a confidence level of 0.0
indicates that no license was able to match the text.
Adding a new license is straight-forward:
Create a file in licenses/
.
.header
” to it. See licenses/README.md
for more details.Add the license name to the list in license_type.go
.
Regenerate the licenses.db
file by running the license serializer:
$ license_serializer -output licenseclassifier/licenses
Create and run appropriate tests to verify that the license is indeed present.
identify_license
is a command line tool that can identify the license(s) within a file.
$ identify_license LICENSE LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794) LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829) LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)
The license_serializer
tool regenerates the licenses.db
archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.
$ license_serializer -output licenseclassifier/licenses
This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.