blob: 1d2ea81029f477def87149e837f4b7674326c93b [file] [log] [blame] [edit]
{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Overview","text":"<p>jellyfish is a library for approximate &amp; phonetic matching of strings.</p> <p>Source: https://github.com/jamesturk/jellyfish</p> <p>Documentation: https://jamesturk.github.io/jellyfish/</p> <p>Issues: https://github.com/jamesturk/jellyfish/issues</p> <p> </p>"},{"location":"#included-algorithms","title":"Included Algorithms","text":"<p>String comparison:</p> <ul> <li>Levenshtein Distance</li> <li>Damerau-Levenshtein Distance</li> <li>Jaro Distance</li> <li>Jaro-Winkler Distance</li> <li>Match Rating Approach Comparison</li> <li>Hamming Distance</li> </ul> <p>Phonetic encoding:</p> <ul> <li>American Soundex</li> <li>Metaphone</li> <li>NYSIIS (New York State Identification and Intelligence System)</li> <li>Match Rating Codex</li> </ul>"},{"location":"#implementations","title":"Implementations","text":"<p>Each algorithm has Rust and Python implementations.</p> <p>The Rust implementations are used by default. The Python implementations are a remnant of an early version of the library and will probably be removed in 1.0.</p> <p>To explicitly use a specific implementation, refer to the appropriate module::</p> <pre><code>import jellyfish._jellyfish as pyjellyfish\nimport jellyfish.rustyfish as rustyfish\n</code></pre> <p>If you've already imported jellyfish and are not sure what implementation you are using, you can check by querying <code>jellyfish.library</code>.</p> <pre><code> if jellyfish.library == 'Python':\n # Python implementation\n elif jellyfish.library == 'Rust':\n # Rust implementation\n</code></pre>"},{"location":"#example-usage","title":"Example Usage","text":"<pre><code>&gt;&gt;&gt; import jellyfish\n&gt;&gt;&gt; jellyfish.levenshtein_distance('jellyfish', 'smellyfish')\n2\n&gt;&gt;&gt; jellyfish.jaro_similarity('jellyfish', 'smellyfish')\n0.89629629629629637\n&gt;&gt;&gt; jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')\n1\n\n&gt;&gt;&gt; jellyfish.metaphone('Jellyfish')\n'JLFX'\n&gt;&gt;&gt; jellyfish.soundex('Jellyfish')\n'J412'\n&gt;&gt;&gt; jellyfish.nysiis('Jellyfish')\n'JALYF'\n&gt;&gt;&gt; jellyfish.match_rating_codex('Jellyfish')\n'JLLFSH'\n</code></pre>"},{"location":"changelog/","title":"Changelog","text":""},{"location":"changelog/#103-17-november-2023","title":"1.0.3 - 17 November 2023","text":"<ul> <li><code>match_rating_codex</code> now raises a <code>ValueError</code> when passed non-alpha characters (#200)</li> <li>adds prebuilt wheels for Python 3.12</li> </ul>"},{"location":"changelog/#101-18-september-2023","title":"1.0.1 - 18 September 2023","text":"<ul> <li>fully remove deprecated names</li> <li>add armv7 linux builds</li> <li>fully drop Python 3.7 support</li> </ul>"},{"location":"changelog/#100-21-june-2023","title":"1.0.0 - 21 June 2023","text":"<ul> <li>bump to 1.0 (no notable changes from 0.11.2)</li> </ul>"},{"location":"changelog/#0112-2-april-2023","title":"0.11.2 - 2 April 2023","text":"<ul> <li>fix to Rust build process to build more wheels, thanks @MartinoMensio!</li> <li>switch to using <code>ahash</code> for Damerau-Levenshtein for speed gains</li> </ul>"},{"location":"changelog/#0111-30-march-2023","title":"0.11.1 - 30 March 2023","text":"<ul> <li>fix missing testdata in packages</li> </ul>"},{"location":"changelog/#0110-27-march-2023","title":"0.11.0 - 27 March 2023","text":"<ul> <li>switched to using Rust implementation for all algorithms</li> </ul>"},{"location":"changelog/#0100-25-march-2023","title":"0.10.0 - 25 March 2023","text":"<ul> <li>removed rarely-used <code>porter_stem</code> function, better implementations exist</li> </ul>"},{"location":"changelog/#090-7-january-2021","title":"0.9.0 - 7 January 2021","text":"<ul> <li>updated documentation available at https://jamesturk.github.io/jellyfish/</li> <li>support for Python 3.10+</li> <li>handle spaces correctly in MRA algorithm</li> </ul>"},{"location":"changelog/#089-26-october-2021","title":"0.8.9 - 26 October 2021","text":"<ul> <li>fix buffer overflow in NYSIIS</li> <li>remove unnecessary/undocumented special casing of digits in Jaro-Winkler</li> </ul>"},{"location":"changelog/#088-17-august-2021","title":"0.8.8 - 17 August 2021","text":"<ul> <li>release fix to fix Linux wheel issue</li> </ul>"},{"location":"changelog/#087-16-august-2021","title":"0.8.7 - 16 August 2021","text":"<ul> <li>safer allocations from CJellyfish</li> <li>include aarch64 wheels</li> </ul>"},{"location":"changelog/#084-4-august-2021","title":"0.8.4 - 4 August 2021","text":"<ul> <li>fix for jaro winkler (cjellyfish#8)</li> </ul>"},{"location":"changelog/#083-11-march-2021","title":"0.8.3 - 11 March 2021","text":"<ul> <li>build changes</li> <li>include OSX and Windows wheels</li> </ul>"},{"location":"changelog/#082-21-may-2020","title":"0.8.2 - 21 May 2020","text":"<ul> <li>fix jaro_winkler/jaro_winkler_similarity mix-up</li> <li>deprecate jaro_distance in favor of jaro_similarity backwards compatible shim left in place, will be removed in 1.0</li> <li>(note: 0.8.1 was a broken release without proper C libraries)</li> </ul>"},{"location":"changelog/#080-21-may-2020","title":"0.8.0 - 21 May 2020","text":"<ul> <li>rename jaro_winkler to jaro_winkler_similarity to match other functions backwards compatible shim added, but will be removed in 1.0</li> <li>fix soundex bug with W/H cases, #83</li> <li>fix metaphone bug with WH prefix, #108</li> <li>fix C match rating codex bug with duplicate letters, #121</li> <li>fix metaphone bug with leading vowels and 'kn' pair, #123</li> <li>fix Python jaro_winkler bug #124</li> <li>fix Python 3.9 deprecation warning</li> <li>add manylinux wheels</li> </ul>"},{"location":"changelog/#072-5-june-2019","title":"0.7.2 - 5 June 2019","text":"<ul> <li>fix CJellyfish damerau_levenshtein w/ unicode, thanks to immerrr</li> <li>fix final H in NYSIIS</li> <li>fix issue w/ trailing W in metaphone</li> </ul>"},{"location":"changelog/#071-10-january-2019","title":"0.7.1 - 10 January 2019","text":"<ul> <li>restrict install to Python &gt;= 3.4</li> </ul>"},{"location":"changelog/#070-10-january-2019","title":"0.7.0 - 10 January 2019","text":"<ul> <li>drop Python 2 compatibility &amp; legacy code</li> <li>add bugfix for NYSIIS for words starting with PF</li> </ul>"},{"location":"changelog/#061-april-16-2018","title":"0.6.1 - April 16 2018","text":"<ul> <li>fixed wheel release issue</li> </ul>"},{"location":"changelog/#060-april-7-2018","title":"0.6.0 - April 7 2018","text":"<ul> <li>fix quite a few bugs &amp; differences between C/Py implementations</li> <li>add wagner-fischer testdata</li> <li>uppercase soundex result</li> <li>better error handling in nysiis, soundex, and jaro</li> </ul>"},{"location":"changelog/#056-june-23-2016","title":"0.5.6 - June 23 2016","text":"<ul> <li>bugfix for metaphone &amp; soundex raising unexpected TypeErrors on Windows (#54)</li> </ul>"},{"location":"changelog/#055-june-21-2016","title":"0.5.5 - June 21 2016","text":"<ul> <li>bugfix for metaphone WH case</li> </ul>"},{"location":"changelog/#054-may-13-2016","title":"0.5.4 - May 13 2016","text":"<ul> <li>bugfix for C version of damerau_levenshtein thanks to Tyler Sellon</li> </ul>"},{"location":"changelog/#053-march-15-2016","title":"0.5.3 - March 15 2016","text":"<ul> <li>style/packaging changes</li> </ul>"},{"location":"changelog/#052-february-3-2016","title":"0.5.2 - February 3 2016","text":"<ul> <li>testing fixes for Python 3.5</li> <li>bugfix for Metaphone w/ silent H thanks to Jeremy Carbaugh</li> </ul>"},{"location":"changelog/#051-july-12-2015","title":"0.5.1 - July 12 2015","text":"<ul> <li>bugfixes for NYSIIS</li> <li>bugfixes for metaphone</li> <li>bugfix for C version of jaro_winkler</li> </ul>"},{"location":"changelog/#050-april-23-2015","title":"0.5.0 - April 23 2015","text":"<ul> <li>consistent unicode behavior, all functions take unicode and reject bytes on Py2 and 3, C and Python</li> <li>parametrize tests</li> <li>Windows compiler support</li> </ul>"},{"location":"changelog/#040-march-27-2015","title":"0.4.0 - March 27 2015","text":"<ul> <li>tons of new tests</li> <li>documentation</li> <li>split out cjellyfish</li> <li>test all w/ unicode and plenty of fixes to accommodate</li> <li>100% test coverage</li> </ul>"},{"location":"changelog/#034-february-4-2015","title":"0.3.4 - February 4 2015","text":"<ul> <li>fix segfaults and memory leaks via Danrich Parrol</li> </ul>"},{"location":"changelog/#033-november-20-2014","title":"0.3.3 - November 20 2014","text":"<ul> <li>fix bugs in damerau and NYSIIS</li> </ul>"},{"location":"changelog/#032-august-11-2014","title":"0.3.2 - August 11 2014","text":"<ul> <li>fix for jaro-winkler from David McKean</li> <li>more packaging fixes</li> </ul>"},{"location":"changelog/#031-july-16-2014","title":"0.3.1 - July 16 2014","text":"<ul> <li>packaging fix for C/Python alternative</li> </ul>"},{"location":"changelog/#030-july-15-2014","title":"0.3.0 - July 15 2014","text":"<ul> <li>python alternatives where C isn't available</li> </ul>"},{"location":"changelog/#022-march-14-2014","title":"0.2.2 - March 14 2014","text":"<ul> <li>testing fixes</li> <li>assorted bugfixes in NYSIIS</li> </ul>"},{"location":"changelog/#020-january-26-2012","title":"0.2.0 - January 26 2012","text":"<ul> <li>incorporate some speed changes from Peter Scott</li> <li>segfault bugfixes.</li> </ul>"},{"location":"changelog/#012-september-16-2010","title":"0.1.2 - September 16 2010","text":"<ul> <li>initial working release</li> </ul>"},{"location":"functions/","title":"Functions","text":"<p>Jellyfish provides a variety of functions for string comparison, phonetic encoding, and stemming.</p>"},{"location":"functions/#string-comparison","title":"String Comparison","text":"<p>These methods are all measures of the difference (aka edit distance) between two strings.</p>"},{"location":"functions/#levenshtein-distance","title":"Levenshtein Distance","text":"<pre><code>def levenshtein_distance(s1: str, s2: str)\n</code></pre> <p>Compute the Levenshtein distance between s1 and s2.</p> <p>Levenshtein distance represents the number of insertions, deletions, and substitutions required to change one word to another.</p> <p>For example: <code>levenshtein_distance('berne', 'born') == 2</code> representing the transformation of the first e to o and the deletion of the second e.</p> <p>See the Levenshtein distance article at Wikipedia for more details.</p>"},{"location":"functions/#damerau-levenshtein-distance","title":"Damerau-Levenshtein Distance","text":"<pre><code>def damerau_levenshtein_distance(s1: str, s2: str)\n</code></pre> <p>Compute the Damerau-Levenshtein distance between s1 and s2.</p> <p>A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifsh for fish) as a single edit.</p> <p>Where <code>levenshtein_distance('fish', 'ifsh') == 2</code> as it would require a deletion and an insertion, though <code>damerau_levenshtein_distance('fish', 'ifsh') == 1</code> as this counts as a transposition.</p> <p>See the Damerau-Levenshtein distance article at Wikipedia for more details.</p>"},{"location":"functions/#hamming-distance","title":"Hamming Distance","text":"<pre><code>def hamming_distance(s1: str, s2: str)\n</code></pre> <p>Compute the Hamming distance between s1 and s2.</p> <p>Hamming distance is the measure of the number of characters that differ between two strings.</p> <p>Typically Hamming distance is undefined when strings are of different length, but this implementation considers extra characters as differing. For example <code>hamming_distance('abc', 'abcd') == 1</code>.</p> <p>See the Hamming distance article at Wikipedia for more details.</p>"},{"location":"functions/#jaro-similarity","title":"Jaro Similarity","text":"<pre><code>def jaro_similarity(s1: str, s2: str)\n</code></pre> <p>Compute the Jaro similarity between s1 and s2.</p> <p>Jaro distance is a string-edit distance that gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.</p> <p>Warning</p> <p>Prior to 0.8.1 this function was named jaro_distance. It was removed in 1.0.</p>"},{"location":"functions/#jaro-winkler-similarity","title":"Jaro-Winkler Similarity","text":"<pre><code>def jaro_winkler_similarity(s1: str, s2: str)\n</code></pre> <p>Compute the Jaro-Winkler similarity between s1 and s2.</p> <p>Jaro-Winkler is a modification/improvement to Jaro distance, like Jaro it gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.</p> <p>Warning</p> <p>Prior to 0.8.1 this function was named jaro_winkler. That name is still available, but is no longer recommended. It will be replaced in 1.0 with a correct version.</p> <p>See the Jaro-Winkler distance article at Wikipedia for more details.</p>"},{"location":"functions/#match-rating-approach-comparison","title":"Match Rating Approach (comparison)","text":"<pre><code>def match_rating_comparison(s1, s2)\n</code></pre> <p>Compare s1 and s2 using the match rating approach algorithm, returns <code>True</code> if strings are considered equivalent or <code>False</code> if not. Can also return <code>None</code> if s1 and s2 are not comparable (length differs by more than 3).</p> <p>The Match rating approach algorithm is an algorithm for determining whether or not two names are pronounced similarly. Strings are first encoded using :py:func:<code>match_rating_codex</code> then compared according to the MRA algorithm.</p> <p>See the Match Rating Approach article at Wikipedia for more details.</p>"},{"location":"functions/#phonetic-encoding","title":"Phonetic Encoding","text":"<p>These algorithms convert a string to a normalized phonetic encoding, converting a word to a representation of its pronunciation. Each takes a single string and returns a coded representation.</p>"},{"location":"functions/#american-soundex","title":"American Soundex","text":"<pre><code>def soundex(s: str)\n</code></pre> <p>Calculate the American Soundex of the string s.</p> <p>Soundex is an algorithm to convert a word (typically a name) to a four digit code in the form 'A123' where 'A' is the first letter of the name and the digits represent similar sounds.</p> <p>For example <code>soundex('Ann') == soundex('Anne') == 'A500'</code> and <code>soundex('Rupert') == soundex('Robert') == 'R163'</code>.</p> <p>See the Soundex article at Wikipedia for more details.</p>"},{"location":"functions/#metaphone","title":"Metaphone","text":"<pre><code>def metaphone(s: str)\n</code></pre> <p>Calculate the metaphone code for the string s.</p> <p>The metaphone algorithm was designed as an improvement on Soundex. It transforms a word into a string consisting of '0BFHJKLMNPRSTWXY' where '0' is pronounced 'th' and 'X' is a '[sc]h' sound.</p> <p>For example <code>metaphone('Klumpz') == metaphone('Clumps') == 'KLMPS'</code>.</p> <p>See the Metaphone article at Wikipedia for more details.</p>"},{"location":"functions/#nysiis","title":"NYSIIS","text":"<pre><code>def nysiis(s: str)\n</code></pre> <p>Calculate the NYSIIS code for the string s.</p> <p>The NYSIIS algorithm is an algorithm developed by the New York State Identification and Intelligence System. It transforms a word into a phonetic code. Like soundex and metaphone it is primarily intended for use on names (as they would be pronounced in English).</p> <p>For example <code>nysiis('John') == nysiis('Jan') == JAN</code>.</p> <p>See the NYSIIS article at Wikipedia for more details.</p>"},{"location":"functions/#match-rating-approach-codex","title":"Match Rating Approach (codex)","text":"<pre><code>def match_rating_codex(s: str)\n</code></pre> <p>Calculate the match rating approach value (also called PNI) for the string s.</p> <p>The Match rating approach algorithm is an algorithm for determining whether or not two names are pronounced similarly. The algorithm consists of an encoding function (similar to soundex or nysiis) which is implemented here as well as :py:func:<code>match_rating_comparison</code> which does the actual comparison.</p> <p>See the Match Rating Approach article at Wikipedia for more details.</p>"}]}