commit | 990cf094ec6e7b1a2d9f32f0bce874b3d4b85297 | [log] [tgz] |
---|---|---|
author | Bill Neubauer <wcn@google.com> | Mon Feb 14 18:43:28 2022 -0800 |
committer | Bill Neubauer <wcn@google.com> | Wed Mar 16 15:38:47 2022 -0700 |
tree | eb2e9e9e09de0b9db8adbfbd9f94d97b2664c1dc | |
parent | f20db3b6f339e35246aac1036d696aec48edd6fa [diff] |
Modify the classfier to use the new asset layout for input files. assets was created in cl/427532245 creating a directory structure to more easily manage content extensibly. The structure of a filename is now Category/Name_of_content/name_of_variant.txt Accordingly, all licenses are in License, headers are in Header, and new content types can be made just by adding the directory in assets. All variants in a content directory are a match on the content, which gets its name from the directory. Future work will involve placing pristine license copies (called pristine.txt) in each directory, so we have a clear reference copy that can be made available for other purposes. This solves an issue where modified copies of licenses were getting pulled into manifests for cloud projects. The changes to the code replace the filename parsing scheme with the new directory-based scheme. This also fixes a problem with LicenseRef-iccjpeg in that the original license file didn't end with .txt and wasn't actually picked up by the classifer. expected.txt was modified to pick up these new hits. PiperOrigin-RevId: 428660046
The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE
files with a single or multiple licenses in it, or source code files with the license text in a comment.
A “confidence level” is associated with each result indicating how close the match was. A confidence level of 1.0
indicates an exact match, while a confidence level of 0.0
indicates that no license was able to match the text.
Adding a new license is straight-forward:
Create a file in licenses/
.
.header
” to it. See licenses/README.md
for more details.Add the license name to the list in license_type.go
.
Regenerate the licenses.db
file by running the license serializer:
$ license_serializer -output licenseclassifier/licenses
Create and run appropriate tests to verify that the license is indeed present.
identify_license
is a command line tool that can identify the license(s) within a file.
$ identify_license LICENSE LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794) LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829) LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)
The license_serializer
tool regenerates the licenses.db
archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.
$ license_serializer -output licenseclassifier/licenses
This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.