| commit | b2b19b3c53333625136af32ef9ebee41f73dc1cd | [log] [tgz] |
|---|---|---|
| author | Bill Neubauer <wcn@google.com> | Wed Jan 12 10:15:37 2022 -0800 |
| committer | Bill Neubauer <wcn@google.com> | Wed Mar 16 15:37:22 2022 -0700 |
| tree | e4dcc1edf9f2b01cc1ca911a6d3e5a793bb928b4 | |
| parent | 4a94a4b2dc7ec0119164ed69649bc96bc2b1025d [diff] |
Fixes handling of newline characters so that Normalize preserves the newline characters of the original input. Includes fixes for OOB array conditions detected by fuzzing and the crash logs from the service. The code doing the tokenization of the newlines had some minor bugs that resulted in spurious newlines being introduced into the token stream. This wasn't a problem before since they were only used inside the tokenizer to detect header constructs and de-hyphenate words, and were always removed from the token stream passed to calling functions. This meant that the token stream Normalize was reassembling had no newlines so it just produced a big line of text. Fixing Normalize to preserve the original newlines required fixing these glitches. Added more test cases to cover the different newline-related scenarios and new scenario files based on fuzzer findings. *** PiperOrigin-RevId: 421331357
The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE files with a single or multiple licenses in it, or source code files with the license text in a comment.
A “confidence level” is associated with each result indicating how close the match was. A confidence level of 1.0 indicates an exact match, while a confidence level of 0.0 indicates that no license was able to match the text.
Adding a new license is straight-forward:
Create a file in licenses/.
.header” to it. See licenses/README.md for more details.Add the license name to the list in license_type.go.
Regenerate the licenses.db file by running the license serializer:
$ license_serializer -output licenseclassifier/licenses
Create and run appropriate tests to verify that the license is indeed present.
identify_license is a command line tool that can identify the license(s) within a file.
$ identify_license LICENSE LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794) LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829) LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)
The license_serializer tool regenerates the licenses.db archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.
$ license_serializer -output licenseclassifier/licenses
This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.