memchr/arch/all/
rabinkarp.rs

1/*!
2An implementation of the [Rabin-Karp substring search algorithm][rabinkarp].
3
4Rabin-Karp works by creating a hash of the needle provided and then computing
5a rolling hash for each needle sized window in the haystack. When the rolling
6hash matches the hash of the needle, a byte-wise comparison is done to check
7if a match exists. The worst case time complexity of Rabin-Karp is `O(m *
8n)` where `m ~ len(needle)` and `n ~ len(haystack)`. Its worst case space
9complexity is constant.
10
11The main utility of Rabin-Karp is that the searcher can be constructed very
12quickly with very little memory. This makes it especially useful when searching
13for small needles in small haystacks, as it might finish its search before a
14beefier algorithm (like Two-Way) even starts.
15
16[rabinkarp]: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
17*/
18
19/*
20(This was the comment I wrote for this module originally when it was not
21exposed. The comment still looks useful, but it's a bit in the weeds, so it's
22not public itself.)
23
24This module implements the classical Rabin-Karp substring search algorithm,
25with no extra frills. While its use would seem to break our time complexity
26guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only
27ever use RK on a constant subset of haystacks. The main point here is that
28RK has good latency properties for small needles/haystacks. It's very quick
29to compute a needle hash and zip through the haystack when compared to
30initializing Two-Way, for example. And this is especially useful for cases
31where the haystack is just too short for vector instructions to do much good.
32
33The hashing function used here is the same one recommended by ESMAJ.
34
35Another choice instead of Rabin-Karp would be Shift-Or. But its latency
36isn't quite as good since its preprocessing time is a bit more expensive
37(both in practice and in theory). However, perhaps Shift-Or has a place
38somewhere else for short patterns. I think the main problem is that it
39requires space proportional to the alphabet and the needle. If we, for
40example, supported needles up to length 16, then the total table size would be
41len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's
42probably bad to put that on the stack. So ideally, we'd throw it on the heap,
43but we'd really like to write as much code without using alloc/std as possible.
44But maybe it's worth the special casing. It's a TODO to benchmark.
45
46Wikipedia has a decent explanation, if a bit heavy on the theory:
47https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
48
49But ESMAJ provides something a bit more concrete:
50http://www-igm.univ-mlv.fr/~lecroq/string/node5.html
51
52Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases:
53https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs
54*/
55
56use crate::ext::Pointer;
57
58/// A forward substring searcher using the Rabin-Karp algorithm.
59///
60/// Note that, as a lower level API, a `Finder` does not have access to the
61/// needle it was constructed with. For this reason, executing a search
62/// with a `Finder` requires passing both the needle and the haystack,
63/// where the needle is exactly equivalent to the one given to the `Finder`
64/// at construction time. This design was chosen so that callers can have
65/// more precise control over where and how many times a needle is stored.
66/// For example, in cases where Rabin-Karp is just one of several possible
67/// substring search algorithms.
68#[derive(Clone, Debug)]
69pub struct Finder {
70    /// The actual hash.
71    hash: Hash,
72    /// The factor needed to multiply a byte by in order to subtract it from
73    /// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation),
74    /// where n is the length of the needle. This is how we "remove" a byte
75    /// from the hash once the hash window rolls past it.
76    hash_2pow: u32,
77}
78
79impl Finder {
80    /// Create a new Rabin-Karp forward searcher for the given `needle`.
81    ///
82    /// The needle may be empty. The empty needle matches at every byte offset.
83    ///
84    /// Note that callers must pass the same needle to all search calls using
85    /// this `Finder`.
86    #[inline]
87    pub fn new(needle: &[u8]) -> Finder {
88        let mut s = Finder { hash: Hash::new(), hash_2pow: 1 };
89        let first_byte = match needle.get(0) {
90            None => return s,
91            Some(&first_byte) => first_byte,
92        };
93        s.hash.add(first_byte);
94        for b in needle.iter().copied().skip(1) {
95            s.hash.add(b);
96            s.hash_2pow = s.hash_2pow.wrapping_shl(1);
97        }
98        s
99    }
100
101    /// Return the first occurrence of the `needle` in the `haystack`
102    /// given. If no such occurrence exists, then `None` is returned.
103    ///
104    /// The `needle` provided must match the needle given to this finder at
105    /// construction time.
106    ///
107    /// The maximum value this can return is `haystack.len()`, which can only
108    /// occur when the needle and haystack both have length zero. Otherwise,
109    /// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
110    #[inline]
111    pub fn find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
112        unsafe {
113            let hstart = haystack.as_ptr();
114            let hend = hstart.add(haystack.len());
115            let nstart = needle.as_ptr();
116            let nend = nstart.add(needle.len());
117            let found = self.find_raw(hstart, hend, nstart, nend)?;
118            Some(found.distance(hstart))
119        }
120    }
121
122    /// Like `find`, but accepts and returns raw pointers.
123    ///
124    /// When a match is found, the pointer returned is guaranteed to be
125    /// `>= start` and `<= end`. The pointer returned is only ever equivalent
126    /// to `end` when both the needle and haystack are empty. (That is, the
127    /// empty string matches the empty string.)
128    ///
129    /// This routine is useful if you're already using raw pointers and would
130    /// like to avoid converting back to a slice before executing a search.
131    ///
132    /// # Safety
133    ///
134    /// Note that `start` and `end` below refer to both pairs of pointers given
135    /// to this routine. That is, the conditions apply to both `hstart`/`hend`
136    /// and `nstart`/`nend`.
137    ///
138    /// * Both `start` and `end` must be valid for reads.
139    /// * Both `start` and `end` must point to an initialized value.
140    /// * Both `start` and `end` must point to the same allocated object and
141    /// must either be in bounds or at most one byte past the end of the
142    /// allocated object.
143    /// * Both `start` and `end` must be _derived from_ a pointer to the same
144    /// object.
145    /// * The distance between `start` and `end` must not overflow `isize`.
146    /// * The distance being in bounds must not rely on "wrapping around" the
147    /// address space.
148    /// * It must be the case that `start <= end`.
149    #[inline]
150    pub unsafe fn find_raw(
151        &self,
152        hstart: *const u8,
153        hend: *const u8,
154        nstart: *const u8,
155        nend: *const u8,
156    ) -> Option<*const u8> {
157        let hlen = hend.distance(hstart);
158        let nlen = nend.distance(nstart);
159        if nlen > hlen {
160            return None;
161        }
162        let mut cur = hstart;
163        let end = hend.sub(nlen);
164        let mut hash = Hash::forward(cur, cur.add(nlen));
165        loop {
166            if self.hash == hash && is_equal_raw(cur, nstart, nlen) {
167                return Some(cur);
168            }
169            if cur >= end {
170                return None;
171            }
172            hash.roll(self, cur.read(), cur.add(nlen).read());
173            cur = cur.add(1);
174        }
175    }
176}
177
178/// A reverse substring searcher using the Rabin-Karp algorithm.
179#[derive(Clone, Debug)]
180pub struct FinderRev(Finder);
181
182impl FinderRev {
183    /// Create a new Rabin-Karp reverse searcher for the given `needle`.
184    #[inline]
185    pub fn new(needle: &[u8]) -> FinderRev {
186        let mut s = FinderRev(Finder { hash: Hash::new(), hash_2pow: 1 });
187        let last_byte = match needle.last() {
188            None => return s,
189            Some(&last_byte) => last_byte,
190        };
191        s.0.hash.add(last_byte);
192        for b in needle.iter().rev().copied().skip(1) {
193            s.0.hash.add(b);
194            s.0.hash_2pow = s.0.hash_2pow.wrapping_shl(1);
195        }
196        s
197    }
198
199    /// Return the last occurrence of the `needle` in the `haystack`
200    /// given. If no such occurrence exists, then `None` is returned.
201    ///
202    /// The `needle` provided must match the needle given to this finder at
203    /// construction time.
204    ///
205    /// The maximum value this can return is `haystack.len()`, which can only
206    /// occur when the needle and haystack both have length zero. Otherwise,
207    /// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
208    #[inline]
209    pub fn rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
210        unsafe {
211            let hstart = haystack.as_ptr();
212            let hend = hstart.add(haystack.len());
213            let nstart = needle.as_ptr();
214            let nend = nstart.add(needle.len());
215            let found = self.rfind_raw(hstart, hend, nstart, nend)?;
216            Some(found.distance(hstart))
217        }
218    }
219
220    /// Like `rfind`, but accepts and returns raw pointers.
221    ///
222    /// When a match is found, the pointer returned is guaranteed to be
223    /// `>= start` and `<= end`. The pointer returned is only ever equivalent
224    /// to `end` when both the needle and haystack are empty. (That is, the
225    /// empty string matches the empty string.)
226    ///
227    /// This routine is useful if you're already using raw pointers and would
228    /// like to avoid converting back to a slice before executing a search.
229    ///
230    /// # Safety
231    ///
232    /// Note that `start` and `end` below refer to both pairs of pointers given
233    /// to this routine. That is, the conditions apply to both `hstart`/`hend`
234    /// and `nstart`/`nend`.
235    ///
236    /// * Both `start` and `end` must be valid for reads.
237    /// * Both `start` and `end` must point to an initialized value.
238    /// * Both `start` and `end` must point to the same allocated object and
239    /// must either be in bounds or at most one byte past the end of the
240    /// allocated object.
241    /// * Both `start` and `end` must be _derived from_ a pointer to the same
242    /// object.
243    /// * The distance between `start` and `end` must not overflow `isize`.
244    /// * The distance being in bounds must not rely on "wrapping around" the
245    /// address space.
246    /// * It must be the case that `start <= end`.
247    #[inline]
248    pub unsafe fn rfind_raw(
249        &self,
250        hstart: *const u8,
251        hend: *const u8,
252        nstart: *const u8,
253        nend: *const u8,
254    ) -> Option<*const u8> {
255        let hlen = hend.distance(hstart);
256        let nlen = nend.distance(nstart);
257        if nlen > hlen {
258            return None;
259        }
260        let mut cur = hend.sub(nlen);
261        let start = hstart;
262        let mut hash = Hash::reverse(cur, cur.add(nlen));
263        loop {
264            if self.0.hash == hash && is_equal_raw(cur, nstart, nlen) {
265                return Some(cur);
266            }
267            if cur <= start {
268                return None;
269            }
270            cur = cur.sub(1);
271            hash.roll(&self.0, cur.add(nlen).read(), cur.read());
272        }
273    }
274}
275
276/// Whether RK is believed to be very fast for the given needle/haystack.
277#[inline]
278pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool {
279    haystack.len() < 16
280}
281
282/// A Rabin-Karp hash. This might represent the hash of a needle, or the hash
283/// of a rolling window in the haystack.
284#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
285struct Hash(u32);
286
287impl Hash {
288    /// Create a new hash that represents the empty string.
289    #[inline(always)]
290    fn new() -> Hash {
291        Hash(0)
292    }
293
294    /// Create a new hash from the bytes given for use in forward searches.
295    ///
296    /// # Safety
297    ///
298    /// The given pointers must be valid to read from within their range.
299    #[inline(always)]
300    unsafe fn forward(mut start: *const u8, end: *const u8) -> Hash {
301        let mut hash = Hash::new();
302        while start < end {
303            hash.add(start.read());
304            start = start.add(1);
305        }
306        hash
307    }
308
309    /// Create a new hash from the bytes given for use in reverse searches.
310    ///
311    /// # Safety
312    ///
313    /// The given pointers must be valid to read from within their range.
314    #[inline(always)]
315    unsafe fn reverse(start: *const u8, mut end: *const u8) -> Hash {
316        let mut hash = Hash::new();
317        while start < end {
318            end = end.sub(1);
319            hash.add(end.read());
320        }
321        hash
322    }
323
324    /// Add 'new' and remove 'old' from this hash. The given needle hash should
325    /// correspond to the hash computed for the needle being searched for.
326    ///
327    /// This is meant to be used when the rolling window of the haystack is
328    /// advanced.
329    #[inline(always)]
330    fn roll(&mut self, finder: &Finder, old: u8, new: u8) {
331        self.del(finder, old);
332        self.add(new);
333    }
334
335    /// Add a byte to this hash.
336    #[inline(always)]
337    fn add(&mut self, byte: u8) {
338        self.0 = self.0.wrapping_shl(1).wrapping_add(u32::from(byte));
339    }
340
341    /// Remove a byte from this hash. The given needle hash should correspond
342    /// to the hash computed for the needle being searched for.
343    #[inline(always)]
344    fn del(&mut self, finder: &Finder, byte: u8) {
345        let factor = finder.hash_2pow;
346        self.0 = self.0.wrapping_sub(u32::from(byte).wrapping_mul(factor));
347    }
348}
349
350/// Returns true when `x[i] == y[i]` for all `0 <= i < n`.
351///
352/// We forcefully don't inline this to hint at the compiler that it is unlikely
353/// to be called. This causes the inner rabinkarp loop above to be a bit
354/// tighter and leads to some performance improvement. See the
355/// memmem/krate/prebuilt/sliceslice-words/words benchmark.
356///
357/// # Safety
358///
359/// Same as `crate::arch::all::is_equal_raw`.
360#[cold]
361#[inline(never)]
362unsafe fn is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool {
363    crate::arch::all::is_equal_raw(x, y, n)
364}
365
366#[cfg(test)]
367mod tests {
368    use super::*;
369
370    define_substring_forward_quickcheck!(|h, n| Some(
371        Finder::new(n).find(h, n)
372    ));
373    define_substring_reverse_quickcheck!(|h, n| Some(
374        FinderRev::new(n).rfind(h, n)
375    ));
376
377    #[test]
378    fn forward() {
379        crate::tests::substring::Runner::new()
380            .fwd(|h, n| Some(Finder::new(n).find(h, n)))
381            .run();
382    }
383
384    #[test]
385    fn reverse() {
386        crate::tests::substring::Runner::new()
387            .rev(|h, n| Some(FinderRev::new(n).rfind(h, n)))
388            .run();
389    }
390}