Skip to content

Commit f5540d2

Browse files
fix(suggestions): Replace wrong Jaro-Winkler
Implementation of Jaro-Winkler similarity in the dguo/strsim-rs crate is wrong, causing strings with common prefix >=10 to all be considered perfect matches Using Jaro instead from the same crate fixes this issue Benefit of favoring long prefixes exists for matching common names But not for typo detection Hence use of Jaro instead of Jaro-Winkler is acceptable Confidence threshold adjusted so that `bar` is still suggested for `baz` since Jaro is strictly < Jaro-Winkler such an adjustment is expected. This is acceptable. While exact suggestions may change, the net change will be positive Suggestions are purely decorative and should thus not breaking change Fixes #4660 Also see rapidfuzz/strsim-rs#53
1 parent 90c042e commit f5540d2

File tree

1 file changed

+17
-6
lines changed

1 file changed

+17
-6
lines changed

src/parser/features/suggestions.rs

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,9 @@ use std::cmp::Ordering;
44
// Internal
55
use crate::builder::Command;
66

7-
/// Produces multiple strings from a given list of possible values which are similar
8-
/// to the passed in value `v` within a certain confidence by least confidence.
9-
/// Thus in a list of possible values like ["foo", "bar"], the value "fop" will yield
10-
/// `Some("foo")`, whereas "blark" would yield `None`.
7+
/// Find strings from an iterable of `possible_values` similar to a given value `v`
8+
/// Returns a Vec of all possible values that exceed a similarity threshold
9+
/// sorted by ascending similarity, most similar comes last
1110
#[cfg(feature = "suggestions")]
1211
pub(crate) fn did_you_mean<T, I>(v: &str, possible_values: I) -> Vec<String>
1312
where
@@ -16,8 +15,11 @@ where
1615
{
1716
let mut candidates: Vec<(f64, String)> = possible_values
1817
.into_iter()
19-
.map(|pv| (strsim::jaro_winkler(v, pv.as_ref()), pv.as_ref().to_owned()))
20-
.filter(|(confidence, _)| *confidence > 0.8)
18+
// GH #4660: using `jaro` because `jaro_winkler` implementation in `strsim-rs` is wrong
19+
// causing strings with common prefix >=10 to be considered perfectly similar
20+
.map(|pv| (strsim::jaro(v, pv.as_ref()), pv.as_ref().to_owned()))
21+
// Confidence of 0.7 so that bar -> baz is suggested
22+
.filter(|(confidence, _)| *confidence > 0.7)
2123
.collect();
2224
candidates.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap_or(Ordering::Equal));
2325
candidates.into_iter().map(|(_, pv)| pv).collect()
@@ -112,6 +114,15 @@ mod test {
112114
);
113115
}
114116

117+
#[test]
118+
fn best_fit_long_common_prefix_issue_4660() {
119+
let p_vals = ["alignmentScore", "alignmentStart"];
120+
assert_eq!(
121+
did_you_mean("alignmentScorr", p_vals.iter()),
122+
vec!["alignmentStart", "alignmentScore"]
123+
);
124+
}
125+
115126
#[test]
116127
fn flag_missing_letter() {
117128
let p_vals = ["test", "possible", "values"];

0 commit comments

Comments
 (0)