fix: Apply regexes to the actual line content (separated by newlines) instead of the joinedLines. (#119)
Now users can craft regexes that would match just the first line of the
group or block (or any arbitrary line they desire) which is basically
impossible to do right now.
I made the decision to opt all by_regex regexes into the dotall flag
`(?s)` that allows `.` to match `\n`. This is to minimize migration
pain, and to remove the need for users to set that flag themselves
because I imagine users are just about always going to want that flag.
1. Today, by_regex is comparing against joinedLines. Many internal users
have already crafted regexes that accidentally span lines. Enabling
dotall lets most of those regexes work as is after this PR
2. Either users are sorting individual lines of content, in which they
don't care about the dotall flag, or they're sorting groups of lines in
which case they will _probably_ prefer dotall so they don't need to
explicitly match newlines with `\s` or `\n`.
3. If a user _doesn't_ want this behavior, they can explicitly disable
dotall in their regex with `(?-s)`
The primary alternative I considered to this would be to make this
(technically) new behavior opt-in instead of forcing it on all users. I
played around with the idea of re-using the regex's mutiline flag for
that. If the regex set multiline at the top level, it would run against
newline joined content. Otherwise it would continue to run against the
previous joinedLines content. I decided against this route for two main
reasons
1. It felt kind of hacky to borrow the multiline flag like that for this
purpose. It has other implications for how the regex is processed, and
conceivably a user would want to process newline joined content without
those implications
2. I think the newline joined content will be more intuitive to users
long-term. One of my internal CLs raised some questions about how
by_regex was working because the reviewer thought it would run against
the raw content.
- Arguably, there's still some weirdness here since we're trimming
leading whitespace from lines. I'm hoping that doesn't confuse users too
much
----
There are >1000 uses of by_regex internally. From the internal
regression test, without `(?s)` we got
```
devtools/keep_sorted/regression_test.go:99: 2 unordered at head, ordered differently with pending and live keep-sorted
devtools/keep_sorted/regression_test.go:99: 27 ordered at head, unordered with pending keep-sorted
```
with `(?s)` we cut that in half:
```
devtools/keep_sorted/regression_test.go:99: 14 ordered at head, unordered with pending keep-sorted
```
All 14 of those changed files are a subset of the original 29.
While auditing those 14, there were 4 classes of error:
- [2] User error: The user wrote a bad regex that was working by
accident
- [3] Lines joined w/o a space: The user wrote a regex that implicitly
relied on the fact that we sometimes joined lines without a space
between them (thanks [Hyrum](https://www.hyrumslaw.com/)).
- These errors can be fixed by inserting a `\s*` at the appropriate spot
- [3] Lines joined w/ a literal space: The user wrote a regex that
implicitly relied on the fact that we sometimes joined lines with a
literal space instead of a newline
- These errors can be fixed by replacing a literal space with the
whitespace character class: `\s`
- [6] Users were trying to sort a list of things and some of those
elements were a single line while some of those elements were split
across multiple lines. This led me to realize that we should be applying
the `joinLines` transform _after_ the regex has matched instead of
omitting it entirely. That tweak resolved these diffs.diff --git a/README.md b/README.md
index a8ef68c..fc62a8e 100644
--- a/README.md
+++ b/README.md
@@ -568,6 +568,11 @@
expressions] that will be applied to the group, and then sorting
will take place on just the results of the regular expressions.
+> [!NOTE]
+> By default, keep-sorted adds the `s` flag to all regular expressions that
+> allows `.` to also match `\n`. If you do not want that behavior, add `(?-s)`
+> to the start of your regular expression.
+
> [!TIP]
> Regular expressions often need special characters. See [Syntax](#syntax) below
> for how to include special characters in the `by_regex` option.
diff --git a/keepsorted/line_group.go b/keepsorted/line_group.go
index a637de8..3afb005 100644
--- a/keepsorted/line_group.go
+++ b/keepsorted/line_group.go
@@ -119,7 +119,7 @@
// Returns another boolean indicating whether the group should be ending
// after that line if so.
shouldAddToRegexDelimitedGroup := func(l string) (addToGroup bool, finishGroupAfter bool) {
- if metadata.opts.GroupStartRegex != nil {
+ if metadata.opts.GroupStartRegex != nil {
// For GroupStartRegex, all non-regex-matching lines should be
// part of the group including prior lines.
return !matchesAnyRegex(l, metadata.opts.GroupStartRegex), false
@@ -349,7 +349,7 @@
cb.braceCounts = make(map[string]int)
}
- // TODO(jfalgout): Does this need to handle runes more correctly?
+ // TODO: jfaer - Does this need to handle runes more correctly?
for i := 0; i < len(s); {
if cb.expectedQuote == "" {
// We do not appear to be inside a string literal.
@@ -419,8 +419,25 @@
}
func (lg *lineGroup) regexTokens() []regexToken {
- // TODO: jfaer - Should we match regexes on the original content?
- regexMatches := lg.opts.matchRegexes(lg.internalJoinedLines())
+ var regexMatches []regexMatch
+ if len(lg.opts.ByRegex) == 0 {
+ // We apply other options on top of what the regex extracts, but if the user
+ // didn't set by_regex, we should fall back to the behavior they would've
+ // expected before we started supporting by_regex.
+ // Namely: We would apply the other options on top of the
+ // internalJoinedLines() instead of the raw content.
+ regexMatches = []regexMatch{{lg.internalJoinedLines()}}
+ } else {
+ regexMatches = lg.opts.matchRegexes(lg.regexJoinedLines())
+ // We still want to apply the joinLines transform so that we get
+ // "reasonable human" comparisons if the regex matches more than one line.
+ for _, match := range regexMatches {
+ for i, s := range match {
+ match[i] = joinLines(strings.Split(s, "\n"))
+ }
+ }
+ }
+
ret := make([]regexToken, len(regexMatches))
if lg.access.regexTokens == nil {
lg.access.regexTokens = make([]regexTokenAccessRecorder, len(regexMatches))
@@ -453,10 +470,21 @@
return ret
}
-// internalJoinedLines calculates the same thing as joinedLines, except it
-// doesn't record that it was used in the accessRecorder.
+// internalJoinedLines attempts to concatenate all of this lineGroup's content
+// the same way a reasonable human might.
+//
+// If the previous line ends with a "word" character and the current line starts
+// with a "word" character, the two lines will be separated by a space.
+// Otherwise, the lines are concatenated without any separation.
+//
+// Unlike joinedLines, this method does not record that it was used in the
+// accessRecorder.
func (lg *lineGroup) internalJoinedLines() string {
- if len(lg.lines) == 0 {
+ return joinLines(lg.lines)
+}
+
+func joinLines(lines []string) string {
+ if len(lines) == 0 {
return ""
}
@@ -464,7 +492,7 @@
startsWithWordChar := regexp.MustCompile(`^\w`)
var s strings.Builder
var last string
- for _, l := range lg.lines {
+ for _, l := range lines {
l := strings.TrimLeftFunc(l, unicode.IsSpace)
if len(last) > 0 && len(l) > 0 && endsWithWordChar.MatchString(last) && startsWithWordChar.MatchString(l) {
s.WriteString(" ")
@@ -475,6 +503,25 @@
return s.String()
}
+// regexJoinedLines concatenates all of this lineGroup's content in a way that's
+// friendlier to regexes than internalJoinedLines.
+//
+// Primarily, this method still strips leading whitespace but it uses a real
+// newline character to separate lines instead of the "word" character logic of
+// internalJoinedLines.
+func (lg *lineGroup) regexJoinedLines() string {
+ if len(lg.lines) == 0 {
+ return ""
+ }
+ lines := make([]string, len(lg.lines))
+ for i, l := range lg.lines {
+ lines[i] = strings.TrimLeftFunc(l, unicode.IsSpace)
+ }
+ return strings.Join(lines, "\n")
+}
+
+// joinedLines returns internalJoinedLines and records that it was called in the
+// accessRecorder.
func (lg *lineGroup) joinedLines() string {
lg.access.joinedLines = true
return lg.internalJoinedLines()
diff --git a/keepsorted/options_parser.go b/keepsorted/options_parser.go
index 56b136c..7848ab7 100644
--- a/keepsorted/options_parser.go
+++ b/keepsorted/options_parser.go
@@ -128,7 +128,7 @@
func (ar *ByRegexOption) UnmarshalYAML(node *yaml.Node) error {
switch node.Tag {
case "!!str":
- pat, err := regexp.Compile(node.Value)
+ pat, err := compileByRegex(node.Value)
if err != nil {
return err
}
@@ -144,7 +144,7 @@
return fmt.Errorf("by_regex map item must have exactly one key-value pair, but got %d", len(m))
}
for pattern, template := range m {
- pat, err := regexp.Compile(pattern)
+ pat, err := compileByRegex(pattern)
if err != nil {
return fmt.Errorf("invalid regex pattern %q: %w", pattern, err)
}
@@ -198,7 +198,7 @@
func (p *parser) popByRegexOption() ([]ByRegexOption, error) {
return popListValue(p, func(s string) (ByRegexOption, error) {
- pat, err := regexp.Compile(s)
+ pat, err := compileByRegex(s)
return ByRegexOption{Pattern: pat}, err
})
}
@@ -328,3 +328,14 @@
}
return ch, ok
}
+
+func compileByRegex(re string) (*regexp.Regexp, error) {
+ if !strings.HasPrefix(re, "(?s)") && !strings.HasPrefix(re, "(?-s)") {
+ // The initial version of by_regex ran on top of lineGroup.joinedLines. This
+ // meant that users wrote regexes that didn't need to handle newlines.
+ // To minimize disruption, we automatically set dotall flag so that their
+ // regexes might still work after we start using real newlines.
+ re = "(?s)" + re
+ }
+ return regexp.Compile(re)
+}
diff --git a/keepsorted/options_parser_test.go b/keepsorted/options_parser_test.go
index d6c8286..b23e18d 100644
--- a/keepsorted/options_parser_test.go
+++ b/keepsorted/options_parser_test.go
@@ -236,7 +236,7 @@
name: "Regex",
input: ".*",
- want: []ByRegexOption{{regexp.MustCompile(".*"), nil}},
+ want: []ByRegexOption{{regexp.MustCompile("(?s).*"), nil}},
},
{
name: "MultipleRegex",
@@ -244,9 +244,9 @@
input: `[.*, abcd, '(?:efgh)ijkl']`,
allowYAMLList: true,
want: []ByRegexOption{
- {regexp.MustCompile(".*"), nil},
- {regexp.MustCompile("abcd"), nil},
- {regexp.MustCompile("(?:efgh)ijkl"), nil},
+ {regexp.MustCompile("(?s).*"), nil},
+ {regexp.MustCompile("(?s)abcd"), nil},
+ {regexp.MustCompile("(?s)(?:efgh)ijkl"), nil},
},
},
{
@@ -255,10 +255,10 @@
input: `[.*, Mon: 0, '\b(\d{2})/(\d{2})/(\d{4})\b': '${3}-${1}-${2}', "0: 1": 2]`,
allowYAMLList: true,
want: []ByRegexOption{
- {regexp.MustCompile(".*"), nil},
- {regexp.MustCompile("Mon"), &([]string{"0"})[0]},
- {regexp.MustCompile(`\b(\d{2})/(\d{2})/(\d{4})\b`), &([]string{"${3}-${1}-${2}"})[0]},
- {regexp.MustCompile(`0: 1`), &([]string{"2"})[0]},
+ {regexp.MustCompile("(?s).*"), nil},
+ {regexp.MustCompile("(?s)Mon"), &([]string{"0"})[0]},
+ {regexp.MustCompile(`(?s)\b(\d{2})/(\d{2})/(\d{4})\b`), &([]string{"${3}-${1}-${2}"})[0]},
+ {regexp.MustCompile(`(?s)0: 1`), &([]string{"2"})[0]},
},
},
{
diff --git a/keepsorted/options_test.go b/keepsorted/options_test.go
index bed93a6..f7e2a2d 100644
--- a/keepsorted/options_test.go
+++ b/keepsorted/options_test.go
@@ -218,7 +218,8 @@
want: blockOptions{
AllowYAMLLists: true,
ByRegex: []ByRegexOption{
- {regexp.MustCompile("(?:abcd)"), nil}, {regexp.MustCompile("efg.*"), nil},
+ {regexp.MustCompile("(?s)(?:abcd)"), nil},
+ {regexp.MustCompile("(?s)efg.*"), nil},
},
},
},
@@ -230,9 +231,9 @@
want: blockOptions{
AllowYAMLLists: true,
ByRegex: []ByRegexOption{
- {Pattern: regexp.MustCompile(`.*`)},
+ {Pattern: regexp.MustCompile(`(?s).*`)},
{
- Pattern: regexp.MustCompile(`\b(\d{2})/(\d{2})/(\d{4})\b`),
+ Pattern: regexp.MustCompile(`(?s)\b(\d{2})/(\d{2})/(\d{4})\b`),
Template: &[]string{"${3}-${1}-${2}"}[0],
},
},
@@ -246,9 +247,9 @@
want: blockOptions{
AllowYAMLLists: true,
ByRegex: []ByRegexOption{
- {Pattern: regexp.MustCompile(`foo, bar`)},
+ {Pattern: regexp.MustCompile(`(?s)foo, bar`)},
{
- Pattern: regexp.MustCompile(`\b(\d{2})/(\d{2})/(\d{4})\b`),
+ Pattern: regexp.MustCompile(`(?s)\b(\d{2})/(\d{2})/(\d{4})\b`),
Template: &[]string{"${3}-${1}-${2}"}[0],
},
},