| # String Processing For Swift 4 |
| |
| * Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](https://github.com/airspeedswift) |
| |
| The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus |
| far, with just this short blurb in the |
| [list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html): |
| |
| > **String re-evaluation**: String is one of the most important fundamental |
| > types in the language. The standard library leads have numerous ideas of how |
| > to improve the programming model for it, without jeopardizing the goals of |
| > providing a unicode-correct-by-default model. Our goal is to be better at |
| > string processing than Perl! |
| |
| For Swift 4 and beyond we want to improve three dimensions of text processing: |
| |
| 1. Ergonomics |
| 2. Correctness |
| 3. Performance |
| |
| This document is meant to both provide a sense of the long-term vision |
| (including undecided issues and possible approaches), and to define the scope of |
| work that could be done in the Swift 4 timeframe. |
| |
| ## General Principles |
| |
| ### Ergonomics |
| |
| It's worth noting that ergonomics and correctness are mutually-reinforcing. An |
| API that is easy to use--but incorrectly--cannot be considered an ergonomic |
| success. Conversely, an API that's simply hard to use is also hard to use |
| correctly. Achieving optimal performance without compromising ergonomics or |
| correctness is a greater challenge. |
| |
| Consistency with the Swift language and idioms is also important for |
| ergonomics. There are several places both in the standard library and in the |
| Foundation additions to `String` where patterns and practices found elsewhere |
| could be applied to improve usability and familiarity. |
| |
| ### API Surface Area |
| |
| Primary data types such as `String` should have APIs that are easily understood |
| given a signature and a one-line summary. Today, `String` fails that test. As |
| you can see, the Standard Library and Foundation both contribute significantly to |
| its overall complexity. |
| |
| **Method Arity** | **Standard Library** | **Foundation** |
| ---|:---:|:---: |
| 0: `f()` | 5 | 7 |
| 1: `f(:)` | 19 | 48 |
| 2: `f(::)` | 13 | 19 |
| 3: `f(:::)` | 5 | 11 |
| 4: `f(::::)` | 1 | 7 |
| 5: `f(:::::)` | - | 2 |
| 6: `f(::::::)` | - | 1 |
| |
| **API Kind** | **Standard Library** | **Foundation** |
| ---|:---:|:---: |
| `init` | 41 | 18 |
| `func` | 42 | 55 |
| `subscript` | 9 | 0 |
| `var` | 26 | 14 |
| |
| **Total: 205 APIs** |
| |
| By contrast, `Int` has 80 APIs, none with more than two parameters. <sup |
| id="a0">[0](#f0)</sup> String processing is complex enough; users shouldn't have |
| to press through physical API sprawl just to get started. |
| |
| Many of the choices detailed below contribute to solving this problem, |
| including: |
| |
| * [Restoring `Collection` conformance](#L-code-string--code--should-be-a--code-collection--code--of--code-character--code-s-again) and |
| dropping the `.characters` view. |
| * Providing a more |
| general, [composable slicing syntax](#unification-of-slicing-operations). |
| * [Altering `Comparable`](#syntax) so that parameterized |
| (e.g. case-insensitive) comparison fits smoothly into the basic syntax. |
| * [Clearly separating](#internationalization-and-the-default-behavior-of--code-string--code-) language-dependent |
| operations on text produced by and for humans from language-independent |
| operations on text produced by and for machine processing. |
| * Relocating APIs that fall outside the domain of basic string processing and |
| discouraging the proliferation of ad-hoc extensions. |
| |
| |
| ### Batteries Included |
| |
| While `String` is available to all programs out-of-the-box, crucial APIs for |
| basic string processing tasks are still inaccessible until `Foundation` is |
| imported. While it makes sense that `Foundation` is needed for domain-specific |
| jobs such as |
| [linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger), |
| one should not need to import anything to, for example, do case-insensitive |
| comparison. |
| |
| ### Unicode Compliance and Platform Support |
| |
| The Unicode standard provides a crucial objective reference point for what |
| constitutes correct behavior in an extremely complex domain, so |
| Unicode-correctness is, and will remain, a fundamental design principle behind |
| Swift's `String`. That said, the Unicode standard is an evolving document, so |
| this objective reference-point is not fixed. <sup id="a1">[1](#f1)</sup> While |
| many of the most important operations--e.g. string hashing, equality, and |
| non-localized comparison--[will be stable](#collation-semantics), the semantics |
| of others, such as grapheme breaking and localized comparison and case |
| conversion, are expected to change as platforms are updated, so programs should |
| be written so their correctness does not depend on precise stability of these |
| semantics across OS versions or platforms. Although it may be possible to |
| imagine static and/or dynamic analysis tools that will help users find such |
| errors, the only sure way to deal with this fact of life is to educate users. |
| |
| ## Design Points |
| |
| ### Internationalization |
| |
| There is strong evidence that developers cannot determine how to use |
| internationalization APIs correctly. Although documentation could and should be |
| improved, the sheer size, complexity, and diversity of these APIs is a major |
| contributor to the problem, causing novices to tune out, and more experienced |
| programmers to make avoidable mistakes. |
| |
| The first step in improving this situation is to regularize all localized |
| operations as invocations of normal string operations with extra |
| parameters. Among other things, this means: |
| |
| 1. Doing away with `localizedXXX` methods. |
| 2. Providing a terse way to name the current locale as a parameter. |
| 3. Automatically [adjusting defaults](#operations-with-options) for options such |
| as case sensitivity based on whether the operation is localized. |
| 4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see |
| guidance in the |
| [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html)). |
| |
| Along with appropriate documentation updates, these changes will make localized |
| operations more teachable, comprehensible, and approachable, thereby lowering a |
| barrier that currently leads some developers to ignore localization issues |
| altogether. |
| |
| #### The Default Behavior of `String` |
| |
| Although this isn't well-known, the most accessible form of many operations on |
| Swift `String` (and `NSString`) are really only appropriate for text that is |
| intended to be processed for, and consumed by, machines. The semantics of the |
| operations with the simplest spellings are always non-localized and |
| language-agnostic. |
| |
| Two major factors play into this design choice: |
| |
| 1. Machine processing of text is important, so we should have first-class, |
| accessible functions appropriate to that use case. |
| |
| 2. The most general localized operations require a locale parameter not required |
| by their un-localized counterparts. This naturally skews complexity towards |
| localized operations. |
| |
| Reaffirming that `String`'s simplest APIs have |
| language-independent/machine-processed semantics has the benefit of clarifying |
| the proper default behavior of operations such as comparison, and allows us to |
| make [significant optimizations](#collation-semantics) that were previously |
| thought to conflict with Unicode. |
| |
| #### Future Directions |
| |
| One of the most common internationalization errors is the unintentional |
| presentation to users of text that has not been localized, but regularizing APIs |
| and improving documentation can go only so far in preventing this error. |
| Combined with the fact that `String` operations are non-localized by default, |
| the environment for processing human-readable text may still be somewhat |
| error-prone in Swift 4. |
| |
| For an audience of mostly non-experts, it is especially important that naïve |
| code is very likely to be correct if it compiles, and that more sophisticated |
| issues can be revealed progressively. For this reason, we intend to |
| specifically and separately target localization and internationalization |
| problems in the Swift 5 timeframe. |
| |
| ### Operations With Options |
| |
| There are three categories of common string operation that commonly need to be |
| tuned in various dimensions: |
| |
| **Operation**|**Applicable Options** |
| ---|--- |
| sort ordering | locale, case/diacritic/width-insensitivity |
| case conversion | locale |
| pattern matching | locale, case/diacritic/width-insensitivity |
| |
| The defaults for case-, diacritic-, and width-insensitivity are sometimes different for |
| localized operations than for non-localized operations, so for example a |
| localized search should be case-insensitive by default, and a non-localized search |
| should be case-sensitive by default. We propose a standard "language" of |
| defaulted parameters to be used for these purposes, with usage roughly like this: |
| |
| ```swift |
| x.compared(to: y, case: .sensitive, in: swissGerman) |
| |
| x.lowercased(in: .currentLocale) |
| |
| x.allMatches(somePattern, case: .insensitive, diacritic: .insensitive) |
| ``` |
| |
| This usage might be supported by code like this: |
| |
| ```swift |
| enum StringSensitivity { |
| case sensitive |
| case insensitive |
| } |
| |
| extension Locale { |
| static var currentLocale: Locale { ... } |
| } |
| |
| extension Unicode { |
| // An example of the option language in declaration context, |
| // with nil defaults indicating unspecified, so defaults can be |
| // driven by the presence/absence of a specific Locale |
| func frobnicated( |
| case caseSensitivity: StringSensitivity? = nil, |
| diacritic diacriticSensitivity: StringSensitivity? = nil, |
| width widthSensitivity: StringSensitivity? = nil, |
| in locale: Locale? = nil |
| ) -> Self { ... } |
| } |
| ``` |
| |
| ### Comparing and Hashing Strings |
| |
| #### Collation Semantics |
| |
| What Unicode says about collation--which is used in `<`, `==`, and hashing-- turns |
| out to be quite interesting, once you pick it apart. The full Unicode Collation |
| Algorithm (UCA) works like this: |
| |
| 1. Fully normalize both strings. |
| 2. Convert each string to a sequence of numeric triples to form a collation key. |
| 3. "Flatten" the key by concatenating the sequence of first elements to the |
| sequence of second elements to the sequence of third elements. |
| 4. Lexicographically compare the flattened keys. |
| |
| While step 1 can usually |
| be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and |
| incrementally, step 2 uses a collation table that maps matching *sequences* of |
| Unicode scalars in the normalized string to *sequences* of triples, which get |
| accumulated into a collation key. Predictably, this is where the real costs |
| lie. |
| |
| *However*, there are some bright spots to this story. First, as it turns out, |
| string sorting (localized or not) should be done down to what's called |
| the |
| ["identical" level](http://unicode.org/reports/tr10/#Multi_Level_Comparison), |
| which adds a step 3a: append the string's normalized form to the flattened |
| collation key. At first blush this just adds work, but consider what it does |
| for equality: two strings that normalize the same, naturally, will collate the |
| same. But also, *strings that normalize differently will always collate |
| differently*. In other words, for equality, it is sufficient to compare the |
| strings' normalized forms and see if they are the same. We can therefore |
| entirely skip the expensive part of collation for equality comparison. |
| |
| Next, naturally, anything that applies to equality also applies to hashing: it |
| is sufficient to hash the string's normalized form, bypassing collation keys. |
| This should provide significant speedups over the current implementation. |
| Perhaps more importantly, since comparison down to the "identical" level applies |
| even to localized strings, it means that hashing and equality can be implemented |
| exactly the same way for localized and non-localized text, and hash tables with |
| localized keys will remain valid across current-locale changes. |
| |
| Finally, once it is agreed that the *default* role for `String` is to handle |
| machine-generated and machine-readable text, the default ordering of `String`s |
| need no longer use the UCA at all. It is sufficient to order them in any way |
| that's consistent with equality, so `String` ordering can simply be a |
| lexicographical comparison of normalized forms, <sup id="a4">[4](#f4)</sup> |
| (which is equivalent to lexicographically comparing the sequences of grapheme |
| clusters), again bypassing step 2 and offering another speedup. |
| |
| This leaves us executing the full UCA *only* for localized sorting, and ICU's |
| implementation has apparently been very well optimized. |
| |
| Following this scheme everywhere would also allow us to make sorting behavior |
| consistent across platforms. Currently, we sort `String` according to the UCA, |
| except that--*only on Apple platforms*--pairs of ASCII characters are ordered by |
| unicode scalar value. |
| |
| #### Syntax |
| |
| Because the current `Comparable` protocol expresses all comparisons with binary |
| operators, string comparisons--which may require |
| additional [options](#operations-with-options)--do not fit smoothly into the |
| existing syntax. At the same time, we'd like to solve other problems with |
| comparison, as outlined |
| in |
| [this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) |
| (implemented by changes at the head |
| of |
| [this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)). |
| We should adopt a modification of that proposal that uses a method rather than |
| an operator `<=>`: |
| |
| ```swift |
| enum SortOrder { case before, same, after } |
| |
| protocol Comparable : Equatable { |
| func compared(to: Self) -> SortOrder |
| ... |
| } |
| ``` |
| |
| This change will give us a syntactic platform on which to implement methods with |
| additional, defaulted arguments, thereby unifying and regularizing comparison |
| across the library. |
| |
| ```swift |
| extension String { |
| func compared(to: Self) -> SortOrder { |
| ... |
| } |
| } |
| ``` |
| |
| **Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible |
| that the standard library simply adopts Foundation's `ComparisonResult` as is, |
| but we believe the community should at least consider alternate naming before |
| that happens. There will be an opportunity to discuss the choices in detail |
| when the modified |
| [Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes |
| up for review. |
| |
| ### `String` should be a `Collection` of `Character`s Again |
| |
| In Swift 2.0, `String`'s `Collection` conformance was dropped, because we |
| convinced ourselves that its semantics differed from those of `Collection` too |
| significantly. |
| |
| It was always well understood that if strings were treated as sequences of |
| `UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`, |
| and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was |
| a collection of `Character` (extended grapheme clusters). During 2.0 |
| development, though, we realized that correct string concatenation could |
| occasionally merge distinct grapheme clusters at the start and end of combined |
| strings. |
| |
| This quirk aside, every aspect of strings-as-collections-of-graphemes appears to |
| comport perfectly with Unicode. We think the concatenation problem is tolerable, |
| because the cases where it occurs all represent partially-formed constructs. The |
| largest class--isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE |
| ACCENT)--are explicitly called out in the Unicode standard as |
| "[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)" or |
| "[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)". The other |
| cases--such as a string ending in a zero-width joiner or half of a regional |
| indicator--appear to be equally transient and unlikely outside of a text editor. |
| |
| Admitting these cases encourages exploration of grapheme composition and is |
| consistent with what appears to be an overall Unicode philosophy that "no |
| special provisions are made to get marginally better behavior for... cases that |
| never occur in practice." <sup id="a2">[2](#f2)</sup> Furthermore, it seems |
| unlikely to disturb the semantics of any plausible algorithms. We can handle |
| these cases by documenting them, explicitly stating that the elements of a |
| `String` are an emergent property based on Unicode rules. |
| |
| The benefits of restoring `Collection` conformance are substantial: |
| |
| * Collection-like operations encourage experimentation with strings to |
| investigate and understand their behavior. This is useful for teaching new |
| programmers, but also good for experienced programmers who want to |
| understand more about strings/unicode. |
| |
| * Extended grapheme clusters form a natural element boundary for Unicode |
| strings. For example, searching and matching operations will always produce |
| results that line up on grapheme cluster boundaries. |
| |
| * Character-by-character processing is a legitimate thing to do in many real |
| use-cases, including parsing, pattern matching, and language-specific |
| transformations such as transliteration. |
| |
| * `Collection` conformance makes a wide variety of powerful operations |
| available that are appropriate to `String`'s default role as the vehicle for |
| machine processed text. |
| |
| The methods `String` would inherit from `Collection`, where similar to |
| higher-level string algorithms, have the right semantics. For example, |
| grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of |
| `flatMap` with case-conversion, produce the same results one would expect |
| from whole-string ordering comparison, equality comparison, and |
| case-conversion, respectively. `reverse` operates correctly on graphemes, |
| keeping diacritics moored to their base characters and leaving emoji intact. |
| Other methods such as `index(of:)` and `contains` make obvious sense. A few |
| `Collection` methods, like `min` and `max`, may not be particularly useful |
| on `String`, but we don't consider that to be a problem worth solving, in |
| the same way that we wouldn't try to suppress `min` and `max` on a |
| `Set([UInt8])` that was used to store IP addresses. |
| |
| * Many of the higher-level operations that we want to provide for `String`s, |
| such as parsing and pattern matching, should apply to any `Collection`, and |
| many of the benefits we want for `Collections`, such |
| as [unified slicing](#unification-of-slicing-operations), should accrue |
| equally to `String`. Making `String` part of the same protocol hierarchy |
| allows us to write these operations once and not worry about keeping the |
| benefits in sync. |
| |
| * Slicing strings into substrings is a crucial part of the vocabulary of |
| string processing, and all other sliceable things are `Collection`s. |
| Because of its collection-like behavior, users naturally think of `String` |
| in collection terms, but run into frustrating limitations where it fails to |
| conform and are left to wonder where all the differences lie. Many simply |
| "correct" this limitation by declaring a trivial conformance: |
| |
| ```swift |
| extension String : BidirectionalCollection {} |
| ``` |
| |
| Even if we removed indexing-by-element from `String`, users could still do |
| this: |
| |
| ```swift |
| extension String : BidirectionalCollection { |
| subscript(i: Index) -> Character { return characters[i] } |
| } |
| ``` |
| |
| It would be much better to legitimize the conformance to `Collection` and |
| simply document the oddity of any concatenation corner-cases, than to deny |
| users the benefits on the grounds that a few cases are confusing. |
| |
| Note that the fact that `String` is a collection of graphemes does *not* mean |
| that string operations will necessarily have to do grapheme boundary |
| recognition. See [this section](#the--code-unicode--code--protocol) for details. |
| |
| ### `Character` and `CharacterSet` |
| |
| `Character`, which represents a |
| Unicode |
| [extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), |
| is a bit of a black box, requiring conversion to `String` in order to |
| do any introspection, including interoperation with ASCII. To fix this, we should: |
| |
| - Add a `unicodeScalars` view much like `String`'s, so that the sub-structure |
| of grapheme clusters is discoverable. |
| - Add a failable `init` from sequences of scalars (returning nil for sequences |
| that contain 0 or 2+ graphemes). |
| - (Lower priority) expose some operations, such as `func uppercase() -> |
| String`, `var isASCII: Bool`, and, to the extent they can be sensibly |
| generalized, queries of Unicode properties that should also be exposed on |
| `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase`. |
| |
| Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar` |
| type. This means it is usable on `String`, but only by going through the unicode |
| scalar view. To deal with this clash in the short term, `CharacterSet` should be |
| renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to |
| introduce a `CharacterSet` that provides similar functionality for extended |
| grapheme clusters. <sup id="a5">[5](#f5)</sup> |
| |
| ### Unification of Slicing Operations |
| |
| Creating substrings is a basic part of string processing, but the slicing |
| operations that we have in Swift are inconsistent in both their spelling and |
| their naming: |
| |
| * Slices with two explicit endpoints are done with subscript, and support |
| in-place mutation: |
| |
| ```swift |
| s[i..<j].mutate() |
| ``` |
| |
| * Slicing from an index to the end, or from the start to an index, is done |
| with a method and does not support in-place mutation: |
| |
| ```swift |
| s.prefix(upTo: i).readOnly() |
| ``` |
| |
| Prefix and suffix operations should be migrated to be subscripting operations |
| with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as |
| in |
| [this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md). |
| With generic subscripting in the language, that will allow us to collapse a wide |
| variety of methods and subscript overloads into a single implementation, and |
| give users an easy-to-use and composable way to describe subranges. |
| |
| Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)` |
| is an ongoing research project that can be considered part of the potential |
| long-term vision of text (and collection) processing. |
| |
| ### Substrings |
| |
| When implementing substring slicing, languages are faced with three options: |
| |
| 1. Make the substrings the same type as string, and share storage. |
| 2. Make the substrings the same type as string, and copy storage when making the substring. |
| 3. Make substrings a different type, with a storage copy on conversion to string. |
| |
| We think number 3 is the best choice. A walk-through of the tradeoffs follows. |
| |
| #### Same type, shared storage |
| |
| In Swift 3.0, slicing a `String` produces a new `String` that is a view into a |
| subrange of the original `String`'s storage. This is why `String` is 3 words in |
| size (the start, length and buffer owner), unlike the similar `Array` type |
| which is only one. |
| |
| This is a simple model with big efficiency gains when chopping up strings into |
| multiple smaller strings. But it does mean that a stored substring keeps the |
| entire original string buffer alive even after it would normally have been |
| released. |
| |
| This arrangement has proven to be problematic in other programming languages, |
| because applications sometimes extract small strings from large ones and keep |
| those small strings long-term. That is considered a memory leak and was enough |
| of a problem in Java that they changed from substrings sharing storage to |
| making a copy in 1.7. |
| |
| #### Same type, copied storage |
| |
| Copying of substrings is also the choice made in C#, and in the default |
| `NSString` implementation. This approach avoids the memory leak issue, but has |
| obvious performance overhead in performing the copies. |
| |
| This in turn encourages trafficking in string/range pairs instead of in |
| substrings, for performance reasons, leading to API challenges. For example: |
| |
| ```swift |
| foo.compare(bar, range: start..<end) |
| ``` |
| |
| Here, it is not clear whether `range` applies to `foo` or `bar`. This |
| relationship is better expressed in Swift as a slicing operation: |
| |
| ```swift |
| foo[start..<end].compare(bar) |
| ``` |
| |
| Not only does this clarify to which string the range applies, it also brings |
| this sub-range capability to any API that operates on `String` "for free". So |
| these other combinations also work equally well: |
| |
| ```swift |
| // apply range on argument rather than target |
| foo.compare(bar[start..<end]) |
| // apply range on both |
| foo[start..<end].compare(bar[start1..<end1]) |
| // compare two strings ignoring first character |
| foo.dropFirst().compare(bar.dropFirst()) |
| ``` |
| |
| In all three cases, an explicit range argument need not appear on the `compare` |
| method itself. The implementation of `compare` does not need to know anything |
| about ranges. Methods need only take range arguments when that was an |
| integral part of their purpose (for example, setting the start and end of a |
| user's current selection in a text box). |
| |
| #### Different type, shared storage |
| |
| The desire to share underlying storage while preventing accidental memory leaks |
| occurs with slices of `Array`. For this reason we have an `ArraySlice` type. |
| The inconvenience of a separate type is mitigated by most operations used on |
| `Array` from the standard library being generic over `Sequence` or `Collection`. |
| |
| We should apply the same approach for `String` by introducing a distinct |
| `SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`: |
| |
| > Important: Long-term storage of `Substring` instances is discouraged. A |
| > substring holds a reference to the entire storage of a larger string, not |
| > just to the portion it presents, even after the original string's lifetime |
| > ends. Long-term storage of a `Substring` may therefore prolong the lifetime |
| > of large strings that are no longer otherwise accessible, which can appear |
| > to be memory leakage. |
| |
| When assigning a `Substring` to a longer-lived variable (usually a stored |
| property) explicitly of type `String`, a type conversion will be performed, and |
| at this point the substring buffer is copied and the original string's storage |
| can be released. |
| |
| A `String` that was not its own `Substring` could be one word--a single tagged |
| pointer--without requiring additional allocations. `Substring`s would be a view |
| onto a `String`, so are 3 words - pointer to owner, pointer to start, and a |
| length. The small string optimization for `Substring` would take advantage of |
| the larger size, probably with a less compressed encoding for speed. |
| |
| The downside of having two types is the inconvenience of sometimes having a |
| `Substring` when you need a `String`, and vice-versa. It is likely this would |
| be a significantly bigger problem than with `Array` and `ArraySlice`, as |
| slicing of `String` is such a common operation. It is especially relevant to |
| existing code that assumes `String` is the currency type -- that is, the default |
| string type used for everyday exchange between APIs. To ease the pain of type |
| mismatches, `Substring` should be a subtype of `String` in the same way that |
| `Int` is a subtype of `Optional<Int>`. This would give users an implicit |
| conversion from `Substring` to `String`, as well as the usual implicit |
| conversions such as `[Substring]` to `[String]` that other subtype |
| relationships receive. |
| |
| In most cases, type inference combined with the subtype relationship should |
| make the type difference a non-issue and users will not care which type they |
| are using. For flexibility and optimizability, most operations from the |
| standard library will traffic in generic models of |
| [`Unicode`](#the--code-unicode--code--protocol). |
| |
| ##### Guidance for API Designers |
| |
| In this model, **if a user is unsure about which type to use, `String` is always |
| a reasonable default**. A `Substring` passed where `String` is expected will be |
| implicitly copied. When compared to the "same type, copied storage" model, we |
| have effectively deferred the cost of copying from the point where a substring |
| is created until it must be converted to `String` for use with an API. |
| |
| A user who needs to optimize away copies altogether should use this guideline: |
| if for performance reasons you are tempted to add a `Range` argument to your |
| method as well as a `String` to avoid unnecessary copies, you should instead |
| use `Substring`. |
| |
| ##### The "Empty Subscript" |
| |
| To make it easy to call such an optimized API when you only have a `String` (or |
| to call any API that takes a `Collection`'s `SubSequence` when all you have is |
| the `Collection`), we propose the following "empty subscript" operation, |
| |
| ```swift |
| extension Collection { |
| subscript() -> SubSequence { |
| return self[startIndex..<endIndex] |
| } |
| } |
| ``` |
| |
| which allows the following usage: |
| |
| ```swift |
| funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring |
| ``` |
| |
| The `[]` syntax can be offered as a fixit when needed, similar to `&` for an |
| `inout` argument. While it doesn't help a user to convert `[String]` to |
| `[Substring]`, the need for such conversions is extremely rare, can be done with |
| a simple `map` (which could also be offered by a fixit): |
| |
| ```swift |
| takesAnArrayOfSubstring(arrayOfString.map { $0[] }) |
| ``` |
| |
| #### Other Options Considered |
| |
| As we have seen, all three options above have downsides, but it's possible |
| these downsides could be eliminated/mitigated by the compiler. We are proposing |
| one such mitigation--implicit conversion--as part of the "different type, |
| shared storage" option, to help avoid the cognitive load on developers of |
| having to deal with a separate `Substring` type. |
| |
| To avoid the memory leak issues of a "same type, shared storage" substring |
| option, we considered whether the compiler could perform an implicit copy of |
| the underlying storage when it detects the string is being "stored" for long |
| term usage, say when it is assigned to a stored property. The trouble with this |
| approach is it is very difficult for the compiler to distinguish between |
| long-term storage versus short-term in the case of abstractions that rely on |
| stored properties. For example, should the storing of a substring inside an |
| `Optional` be considered long-term? Or the storing of multiple substrings |
| inside an array? The latter would not work well in the case of a |
| `components(separatedBy:)` implementation that intended to return an array of |
| substrings. It would also be difficult to distinguish intentional medium-term |
| storage of substrings, say by a lexer. There does not appear to be an effective |
| consistent rule that could be applied in the general case for detecting when a |
| substring is truly being stored long-term. |
| |
| To avoid the cost of copying substrings under "same type, copied storage", the |
| optimizer could be enhanced to reduce the impact of some of those copies. |
| For example, this code could be optimized to pull the invariant substring out |
| of the loop: |
| |
| ```swift |
| for _ in 0..<lots { |
| someFunc(takingString: bigString[bigRange]) |
| } |
| ``` |
| |
| It's worth noting that a similar optimization is needed to avoid an equivalent |
| problem with implicit conversion in the "different type, shared storage" case: |
| |
| ```swift |
| let substring = bigString[bigRange] |
| for _ in 0..<lots { someFunc(takingString: substring) } |
| ``` |
| |
| However, in the case of "same type, copied storage" there are many use cases |
| that cannot be optimized as easily. Consider the following simple definition of |
| a recursive `contains` algorithm, which when substring slicing is linear makes |
| the overall algorithm quadratic: |
| |
| ```swift |
| extension String { |
| func containsChar(_ x: Character) -> Bool { |
| return !isEmpty && (first == x || dropFirst().containsChar(x)) |
| } |
| } |
| ``` |
| |
| For the optimizer to eliminate this problem is unrealistic, forcing the user to |
| remember to optimize the code to not use string slicing if they want it to be |
| efficient (assuming they remember): |
| |
| ```swift |
| extension String { |
| // add optional argument tracking progress through the string |
| func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool { |
| let idx = idx ?? startIndex |
| return idx != endIndex |
| && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx))) |
| } |
| } |
| ``` |
| |
| #### Substrings, Ranges and Objective-C Interop |
| |
| The pattern of passing a string/range pair is common in several Objective-C |
| APIs, and is made especially awkward in Swift by the non-interchangeability of |
| `Range<String.Index>` and `NSRange`. |
| |
| ```swift |
| s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2)) |
| ``` |
| |
| In general, however, the Swift idiom for operating on a sub-range of a |
| `Collection` is to *slice* the collection and operate on that: |
| |
| ```swift |
| s2.find(s2[j..<s2.endIndex]) |
| ``` |
| |
| Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported |
| without the `NSRange` argument. The Objective-C importer should be changed to |
| give these APIs special treatment so that when a `Substring` is passed, instead |
| of being converted to a `String`, the full `NSString` and range are passed to |
| the Objective-C method, thereby avoiding a copy. |
| |
| As a result, you would never need to pass an `NSRange` to these APIs, which |
| solves the impedance problem by eliminating the argument, resulting in more |
| idiomatic Swift code while retaining the performance benefit. To help users |
| manually handle any cases that remain, Foundation should be augmented to allow |
| the following syntax for converting to and from `NSRange`: |
| |
| ```swift |
| let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j] |
| let iToJ = Range(nsr, in: s) // Equivalent to i..<j |
| ``` |
| |
| ### The `Unicode` protocol |
| |
| With `Substring` and `String` being distinct types and sharing almost all |
| interface and semantics, and with the highest-performance string processing |
| requiring knowledge of encoding and layout that the currency types can't |
| provide, it becomes important to capture the common "string API" in a protocol. |
| Since Unicode conformance is a key feature of string processing in Swift, we |
| call that protocol `Unicode`: |
| |
| **Note:** The following assumes several features that are planned but not yet implemented in |
| Swift, and should be considered a sketch rather than a final design. |
| |
| ```swift |
| protocol Unicode |
| : Comparable, BidirectionalCollection where Element == Character { |
| |
| associatedtype Encoding : UnicodeEncoding |
| var encoding: Encoding { get } |
| |
| associatedtype CodeUnits |
| : RandomAccessCollection where Element == Encoding.CodeUnit |
| var codeUnits: CodeUnits { get } |
| |
| associatedtype UnicodeScalars |
| : BidirectionalCollection where Element == UnicodeScalar |
| var unicodeScalars: UnicodeScalars { get } |
| |
| associatedtype ExtendedASCII |
| : BidirectionalCollection where Element == UInt32 |
| var extendedASCII: ExtendedASCII { get } |
| |
| var unicodeScalars: UnicodeScalars { get } |
| } |
| |
| extension Unicode { |
| // ... define high-level non-mutating string operations, e.g. search ... |
| |
| func compared<Other: Unicode>( |
| to rhs: Other, |
| case caseSensitivity: StringSensitivity? = nil, |
| diacritic diacriticSensitivity: StringSensitivity? = nil, |
| width widthSensitivity: StringSensitivity? = nil, |
| in locale: Locale? = nil |
| ) -> SortOrder { ... } |
| } |
| |
| extension Unicode : RangeReplaceableCollection where CodeUnits : |
| RangeReplaceableCollection { |
| // Satisfy protocol requirement |
| mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) |
| where C.Element == Element |
| |
| // ... define high-level mutating string operations, e.g. replace ... |
| } |
| |
| ``` |
| |
| The goal is that `Unicode` exposes the underlying encoding and code units in |
| such a way that for types with a known representation (e.g. a high-performance |
| `UTF8String`) that information can be known at compile-time and can be used to |
| generate a single path, while still allowing types like `String` that admit |
| multiple representations to use runtime queries and branches to fast path |
| specializations. |
| |
| **Note:** `Unicode` would make a fantastic namespace for much of |
| what's in this proposal if we could get the ability to nest types and |
| protocols in protocols. |
| |
| |
| ### Scanning, Matching, and Tokenization |
| |
| #### Low-Level Textual Analysis |
| |
| We should provide convenient APIs for processing strings by character. For example, |
| it should be easy to cleanly express, "if this string starts with `"f"`, process |
| the rest of the string as follows..." Swift is well-suited to expressing this |
| common pattern beautifully, but we need to add the APIs. Here are two examples |
| of the sort of code that might be possible given such APIs: |
| |
| ```swift |
| if let firstLetter = input.dropPrefix(alphabeticCharacter) { |
| somethingWith(input) // process the rest of input |
| } |
| |
| if let (number, restOfInput) = input.parsingPrefix(Int.self) { |
| ... |
| } |
| ``` |
| |
| The specific spelling and functionality of APIs like this are TBD. The larger |
| point is to make sure matching-and-consuming jobs are well-supported. |
| |
| #### Unified Pattern Matcher Protocol |
| |
| Many of the current methods that do matching are overloaded to do the same |
| logical operations in different ways, with the following axes: |
| |
| - Logical Operation: `find`, `split`, `replace`, match at start. |
| - Kind of pattern: `CharacterSet`, `String`, a regex, a closure. |
| - Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of |
| the method name, and sometimes an argument. |
| - Whole string or subrange. |
| |
| We should represent these aspects as orthogonal, composable components, |
| abstracting pattern matchers into a protocol like |
| [this one](https://github.com/apple/swift/blob/main/test/Prototypes/PatternMatching.swift#L33), |
| that can allow us to define logical operations once, without introducing |
| overloads, and massively reducing API surface area. |
| |
| For example, using the strawman prefix `%` syntax to turn string literals into |
| patterns, the following pairs would all invoke the same generic methods: |
| |
| ```swift |
| if let found = s.firstMatch(%"searchString") { ... } |
| if let found = s.firstMatch(someRegex) { ... } |
| |
| for m in s.allMatches((%"searchString"), case: .insensitive) { ... } |
| for m in s.allMatches(someRegex) { ... } |
| |
| let items = s.split(separatedBy: ", ") |
| let tokens = s.split(separatedBy: CharacterSet.whitespace) |
| ``` |
| |
| Note that, because Swift requires the indices of a slice to match the indices of |
| the range from which it was sliced, operations like `firstMatch` can return a |
| `Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in |
| the string being searched, if needed, can easily be recovered as the |
| `startIndex` and `endIndex` of the `Substring`. |
| |
| Note also that matching operations are useful for collections in general, and |
| would fall out of this proposal: |
| |
| ```swift |
| // replace subsequences of contiguous NaNs with zero |
| forces.replace(oneOrMore([Float.nan]), [0.0]) |
| ``` |
| |
| #### Regular Expressions |
| |
| Addressing regular expressions is out of scope for this proposal. |
| That said, it is important to note that the pattern matching protocol mentioned |
| above provides a suitable foundation for regular expressions, and types such as |
| `NSRegularExpression` can easily be retrofitted to conform to it. In the |
| future, support for regular expression literals in the compiler could allow for |
| compile-time syntax checking and optimization. |
| |
| ### String Indices |
| |
| `String` currently has four views--`characters`, `unicodeScalars`, `utf8`, and |
| `utf16`--each with its own opaque index type. The APIs used to translate indices |
| between views add needless complexity, and the opacity of indices makes them |
| difficult to serialize. |
| |
| The index translation problem has two aspects: |
| |
| 1. `String` views cannot consume one another's indices without a cumbersome |
| conversion step. An index into a `String`'s `characters` must be translated |
| before it can be used as a position in its `unicodeScalars`. Although these |
| translations are rarely needed, they add conceptual and API complexity. |
| 2. Many APIs in the core libraries and other frameworks still expose `String` |
| positions as `Int`s and regions as `NSRange`s, which can only reference a |
| `utf16` view and interoperate poorly with `String` itself. |
| |
| #### Index Interchange Among Views |
| |
| String's need for flexible backing storage and reasonably-efficient indexing |
| (i.e. without dynamically allocating and reference-counting the indices |
| themselves) means indices need an efficient underlying storage type. Although |
| we do not wish to expose `String`'s indices *as* integers, `Int` offsets into |
| underlying code unit storage makes a good underlying storage type, provided |
| `String`'s underlying storage supports random-access. We think random-access |
| *code-unit storage* is a reasonable requirement to impose on all `String` |
| instances. |
| |
| Making these `Int` code unit offsets conveniently accessible and constructible |
| solves the serialization problem: |
| |
| ```swift |
| clipboard.write(s.endIndex.codeUnitOffset) |
| let offset = clipboard.read(Int.self) |
| let i = String.Index(codeUnitOffset: offset) |
| ``` |
| |
| Index interchange between `String` and its `unicodeScalars`, `codeUnits`, |
| and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely |
| seamless by having them share an index type (semantics of indexing a `String` |
| between grapheme cluster boundaries are TBD--it can either trap or be forgiving). |
| Having a common index allows easy traversal into the interior of graphemes, |
| something that is often needed, without making it likely that someone will do it |
| by accident. |
| |
| - `String.index(after:)` should advance to the next grapheme, even when the |
| index points partway through a grapheme. |
| |
| - `String.index(before:)` should move to the start of the grapheme before |
| the current position. |
| |
| Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not |
| crucial, as the specifics of encoding should not be a concern for most use |
| cases, and would impose needless costs on the indices of other views. That |
| said, we can make translation much more straightforward by exposing simple |
| bidirectional converting `init`s on both index types: |
| |
| ```swift |
| let u8Position = String.UTF8.Index(someStringIndex) |
| let originalPosition = String.Index(u8Position) |
| ``` |
| |
| #### Index Interchange with Cocoa |
| |
| We intend to address `NSRange`s that denote substrings in Cocoa APIs as |
| described [later in this document](#substrings--ranges-and-objective-c-interop). |
| That leaves the interchange of bare indices with Cocoa APIs trafficking in |
| `Int`. Hopefully such APIs will be rare, but when needed, the following |
| extension, which would be useful for all `Collections`, can help: |
| |
| ```swift |
| extension Collection { |
| func index(offset: IndexDistance) -> Index { |
| return index(startIndex, offsetBy: offset) |
| } |
| func offset(of i: Index) -> IndexDistance { |
| return distance(from: startIndex, to: i) |
| } |
| } |
| ``` |
| |
| Then integers can easily be translated into offsets into a `String`'s `utf16` |
| view for consumption by Cocoa: |
| |
| ```swift |
| let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i)) |
| let swiftIndex = s.utf16.index(offset: cocoaIndex) |
| ``` |
| |
| ### Formatting |
| |
| A full treatment of formatting is out of scope of this proposal, but |
| we believe it's crucial for completing the text processing picture. This |
| section details some of the existing issues and thinking that may guide future |
| development. |
| |
| #### Printf-Style Formatting |
| |
| `String(format:)` is designed on the `printf` model: it takes a format string with |
| textual placeholders for substitution, and an arbitrary list of other arguments. |
| The syntax and meaning of these placeholders has a long history in |
| C, but for anyone who doesn't use them regularly they are cryptic and complex, |
| as the `printf (3)` man page attests. |
| |
| Aside from complexity, this style of API has two major problems: First, the |
| spelling of these placeholders must match up to the types of the arguments, in |
| the right order, or the behavior is undefined. Some limited support for |
| compile-time checking of this correspondence could be implemented, but only for |
| the cases where the format string is a literal. Second, there's no reasonable |
| way to extend the formatting vocabulary to cover the needs of new types: you are |
| stuck with what's in the box. |
| |
| #### Foundation Formatters |
| |
| The formatters supplied by Foundation are highly capable and versatile, offering |
| both formatting and parsing services. When used for formatting, though, the |
| design pattern demands more from users than it should: |
| |
| * Matching the type of data being formatted to a formatter type |
| * Creating an instance of that type |
| * Setting stateful options (`currency`, `dateStyle`) on the type. Note: the |
| need for this step prevents the instance from being used and discarded in |
| the same expression where it is created. |
| * Overall, introduction of needless verbosity into source |
| |
| These may seem like small issues, but the experience of Apple localization |
| experts is that the total drag of these factors on programmers is such that they |
| tend to reach for `String(format:)` instead. |
| |
| #### String Interpolation |
| |
| Swift string interpolation provides a user-friendly alternative to printf's |
| domain-specific language (just write ordinary swift code!) and its type safety |
| problems (put the data right where it belongs!) but the following issues prevent |
| it from being useful for localized formatting (among other jobs): |
| |
| * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict |
| types used in string interpolation. |
| * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't |
| distinguish (fragments of) the base string from the string substitutions. |
| |
| In the long run, we should improve Swift string interpolation to the point where |
| it can participate in most any formatting job. Mostly this centers around |
| fixing the interpolation protocols per the previous item, and supporting |
| localization. |
| |
| To be able to use formatting effectively inside interpolations, it needs to be |
| both lightweight (because it all happens in-situ) and discoverable. One |
| approach would be to standardize on `format` methods, e.g.: |
| |
| ```swift |
| "Column 1: \(n.format(radix:16, width:8)) *** \(message)" |
| |
| "Something with leading zeroes: \(x.format(fill: zero, width:8))" |
| ``` |
| |
| ### C String Interop |
| |
| Our support for interoperation with nul-terminated C strings is scattered and |
| incoherent, with 6 ways to transform a C string into a `String` and four ways to |
| do the inverse. These APIs should be replaced with the following |
| |
| ```swift |
| extension String { |
| /// Constructs a `String` having the same contents as `nulTerminatedUTF8`. |
| /// |
| /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded |
| /// bytes ending just before the first zero byte (NUL character). |
| init(cString nulTerminatedUTF8: UnsafePointer<CChar>) |
| |
| /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`. |
| /// |
| /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in |
| /// the given `encoding`, ending just before the first zero code unit. |
| /// - Parameter encoding: describes the encoding in which the code units |
| /// should be interpreted. |
| init<Encoding: UnicodeEncoding>( |
| cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>, |
| encoding: Encoding) |
| |
| /// Invokes the given closure on the contents of the string, represented as a |
| /// pointer to a null-terminated sequence of UTF-8 code units. |
| func withCString<Result>( |
| _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result |
| } |
| ``` |
| |
| In both of the construction APIs, any invalid encoding sequence detected will |
| have its longest valid prefix replaced by U+FFFD, the Unicode replacement |
| character, per Unicode specification. This covers the common case. The |
| replacement is done *physically* in the underlying storage and the validity of |
| the result is recorded in the `String`'s `encoding` such that future accesses |
| need not be slowed down by possible error repair separately. |
| |
| Construction that is aborted when encoding errors are detected can be |
| accomplished using APIs on the `encoding`. String types that retain their |
| physical encoding even in the presence of errors and are repaired on-the-fly can |
| be built as different instances of the `Unicode` protocol. |
| |
| ### Unicode 9 Conformance |
| |
| Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes |
| the process of properly identifying `Character` boundaries. We need to update |
| `String` to account for this change. |
| |
| ### High-Performance String Processing |
| |
| Many strings are short enough to store in 64 bits, many can be stored using only |
| 8 bits per unicode scalar, others are best encoded in UTF-16, and some come to |
| us already in some other encoding, such as UTF-8, that would be costly to |
| translate. Supporting these formats while maintaining usability for |
| general-purpose APIs demands that a single `String` type can be backed by many |
| different representations. |
| |
| That said, the highest performance code always requires static knowledge of the |
| data structures on which it operates, and for this code, dynamic selection of |
| representation comes at too high a cost. Heavy-duty text processing demands a |
| way to opt out of dynamism and directly use known encodings. Having this |
| ability can also make it easy to cleanly specialize code that handles dynamic |
| cases for maximal efficiency on the most common representations. |
| |
| To address this need, we can build models of the `Unicode` protocol that encode |
| representation information into the type, such as `NFCNormalizedUTF16String`. |
| |
| ### Parsing ASCII Structure |
| |
| Although many machine-readable formats support the inclusion of arbitrary |
| Unicode text, it is also common that their fundamental structure lies entirely |
| within the ASCII subset (JSON, YAML, many XML formats). These formats are often |
| processed most efficiently by recognizing ASCII structural elements as ASCII, |
| and capturing the arbitrary sections between them in more-general strings. The |
| current String API offers no way to efficiently recognize ASCII and skip past |
| everything else without the overhead of full decoding into unicode scalars. |
| |
| For these purposes, strings should supply an `extendedASCII` view that is a |
| collection of `UInt32`, where values less than `0x80` represent the |
| corresponding ASCII character, and other values represent data that is specific |
| to the underlying encoding of the string. |
| |
| ## Language Support |
| |
| This proposal depends on two new features in the Swift language: |
| |
| 1. **Generic subscripts**, to |
| enable [unified slicing syntax](#unification-of-slicing-operations). |
| |
| 2. **A [subtype relationship](#different-type-shared-storage)** between |
| `Substring` and `String`, enabling framework APIs to traffic solely in |
| `String` while still making it possible to avoid copies by handling |
| `Substring`s where necessary. |
| |
| Additionally, **the ability to nest types and protocols inside |
| protocols** could significantly shrink the footprint of this proposal |
| on the top-level Swift namespace. |
| |
| |
| ## Open Questions |
| |
| ### Must `String` be limited to storing UTF-16 subset encodings? |
| |
| - The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in |
| question here; this is about what encodings must be storable, without |
| transcoding, in the common currency type called "`String`". |
| - ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not. |
| - If we have a way to get at a `String`'s code units, we need a concrete type in |
| which to express them in the API of `String`, which is a concrete type |
| - If String needs to be able to represent UTF-32, presumably the code units need |
| to be `UInt32`. |
| - Not supporting UTF-32-encoded text seems like one reasonable design choice. |
| - Maybe we can allow UTF-8 storage in `String` and expose its code units as |
| `UInt16`, just as we would for Latin-1. |
| - Supporting only UTF-16-subset encodings would imply that `String` indices can |
| be serialized without recording the `String`'s underlying encoding. |
| |
| ### Do we need a type-erasable base protocol for UnicodeEncoding? |
| |
| UnicodeEncoding has an associated type, but it may be important to be able to |
| traffic in completely dynamic encoding values, e.g. for "tell me the most |
| efficient encoding for this string." |
| |
| ### Should there be a string "facade?" |
| |
| One possible design alternative makes `Unicode` a vehicle for expressing |
| the storage and encoding of code units, but does not attempt to give it an API |
| appropriate for `String`. Instead, string APIs would be provided by a generic |
| wrapper around an instance of `Unicode`: |
| |
| ```swift |
| struct StringFacade<U: Unicode> : BidirectionalCollection { |
| |
| // ...APIs for high-level string processing here... |
| |
| var unicode: U // access to lower-level unicode details |
| } |
| |
| typealias String = StringFacade<StringStorage> |
| typealias Substring = StringFacade<StringStorage.SubSequence> |
| ``` |
| |
| This design would allow us to de-emphasize lower-level `String` APIs such as |
| access to the specific encoding, by putting them behind a `.unicode` property. |
| A similar effect in a facade-less design would require a new top-level |
| `StringProtocol` playing the role of the facade with an `associatedtype |
| Storage : Unicode`. |
| |
| An interesting variation on this design is possible if defaulted generic |
| parameters are introduced to the language: |
| |
| ```swift |
| struct String<U: Unicode = StringStorage> |
| : BidirectionalCollection { |
| |
| // ...APIs for high-level string processing here... |
| |
| var unicode: U // access to lower-level unicode details |
| } |
| |
| typealias Substring = String<StringStorage.SubSequence> |
| ``` |
| |
| One advantage of such a design is that naïve users will always extend "the right |
| type" (`String`) without thinking, and the new APIs will show up on `Substring`, |
| `MyUTF8String`, etc. That said, it also has downsides that should not be |
| overlooked, not least of which is the confusability of the meaning of the word |
| "string." Is it referring to the generic or the concrete type? |
| |
| ### `TextOutputStream` and `TextOutputStreamable` |
| |
| `TextOutputStreamable` is intended to provide a vehicle for |
| efficiently transporting formatted representations to an output stream |
| without forcing the allocation of storage. Its use of `String`, a |
| type with multiple representations, at the lowest-level unit of |
| communication, conflicts with this goal. It might be sufficient to |
| change `TextOutputStream` and `TextOutputStreamable` to traffic in an |
| associated type conforming to `Unicode`, but that is not yet clear. |
| This area will require some design work. |
| |
| ### `description` and `debugDescription` |
| |
| * Should these be creating localized or non-localized representations? |
| * Is returning a `String` efficient enough? |
| * Is `debugDescription` pulling the weight of the API surface area it adds? |
| |
| ### `StaticString` |
| |
| `StaticString` was added as a byproduct of standard library development and kept |
| around because it seemed useful, but it was never truly *designed* for client |
| programmers. We need to decide what happens with it. Presumably *something* |
| should fill its role, and that should conform to `Unicode`. |
| |
| ## Footnotes |
| |
| <b id="f0">0</b> The integers rewrite currently underway is expected to |
| substantially reduce the scope of `Int`'s API by using more |
| generics. [↩](#a0) |
| |
| <b id="f1">1</b> In practice, these semantics will usually be tied to the |
| version of the installed [ICU](http://icu-project.org) library, which |
| programmatically encodes the most complex rules of the Unicode Standard and its |
| de-facto extension, CLDR.[↩](#a1) |
| |
| <b id="f2">2</b> |
| See |
| [http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation). Note |
| that inserting Unicode scalar values to prevent merging of grapheme clusters would |
| also constitute a kind of misbehavior (one of the clusters at the boundary would |
| not be found in the result), so would be relatively costly to implement, with |
| little benefit. [↩](#a2) |
| |
| <b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by |
| the Unicode standard for this purpose. In fact there's |
| a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf) |
| dedicated to it. In particular, §5.17 says: |
| |
| > When comparing text that is visible to end users, a correct linguistic sort |
| > should be used, as described in _Section 5.16, Sorting and |
| > Searching_. However, in many circumstances the only requirement is for a |
| > fast, well-defined ordering. In such cases, a binary ordering can be used. |
| |
| [↩](#a4) |
| |
| |
| <b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto |
| properties in a table that's indexed by unicode scalar value. This table is |
| part of the Unicode standard. Some of these queries (e.g., "is this an |
| uppercase character?") may have fairly obvious generalizations to grapheme |
| clusters, but exactly how to do it is a research topic and *ideally* we'd either |
| establish the existing practice that the Unicode committee would standardize, or |
| the Unicode committee would do the research and we'd implement their |
| result.[↩](#a5) |
| |
| <!-- Local Variables: --> |
| <!-- eval: (buffer-face-mode 1) --> |
| <!-- auto-fill-inhibit-regexp: ".*|.*" --> |
| <!-- End: --> |