regex_automata/util/determinize/
state.rs

1/*!
2This module defines a DFA state representation and builders for constructing
3DFA states.
4
5This representation is specifically for use in implementations of NFA-to-DFA
6conversion via powerset construction. (Also called "determinization" in this
7crate.)
8
9The term "DFA state" is somewhat overloaded in this crate. In some cases, it
10refers to the set of transitions over an alphabet for a particular state. In
11other cases, it refers to a set of NFA states. The former is really about the
12final representation of a state in a DFA's transition table, where as the
13latter---what this module is focused on---is closer to an intermediate form
14that is used to help eventually build the transition table.
15
16This module exports four types. All four types represent the same idea: an
17ordered set of NFA states. This ordered set represents the epsilon closure of a
18particular NFA state, where the "epsilon closure" is the set of NFA states that
19can be transitioned to without consuming any input. i.e., Follow all of the NFA
20state's epsilon transitions. In addition, this implementation of DFA states
21cares about two other things: the ordered set of pattern IDs corresponding
22to the patterns that match if the state is a match state, and the set of
23look-behind assertions that were true when the state was created.
24
25The first, `State`, is a frozen representation of a state that cannot be
26modified. It may be cheaply cloned without copying the state itself and can be
27accessed safely from multiple threads simultaneously. This type is useful for
28when one knows that the DFA state being constructed is distinct from any other
29previously constructed states. Namely, powerset construction, in practice,
30requires one to keep a cache of previously created DFA states. Otherwise,
31the number of DFA states created in memory balloons to an impractically
32large number. For this reason, equivalent states should endeavor to have an
33equivalent byte-level representation. (In general, "equivalency" here means,
34"equivalent assertions, pattern IDs and NFA state IDs." We do not require that
35full DFA minimization be implemented here. This form of equivalency is only
36surface deep and is more-or-less a practical necessity.)
37
38The other three types represent different phases in the construction of a
39DFA state. Internally, these three types (and `State`) all use the same
40byte-oriented representation. That means one can use any of the builder types
41to check whether the state it represents already exists or not. If it does,
42then there is no need to freeze it into a `State` (which requires an alloc and
43a copy). Here are the three types described succinctly:
44
45* `StateBuilderEmpty` represents a state with no pattern IDs, no assertions
46and no NFA states. Creating a `StateBuilderEmpty` performs no allocs. A
47`StateBuilderEmpty` can only be used to query its underlying memory capacity,
48or to convert into a builder for recording pattern IDs and/or assertions.
49
50* `StateBuilderMatches` represents a state with zero or more pattern IDs, zero
51or more satisfied assertions and zero NFA state IDs. A `StateBuilderMatches`
52can only be used for adding pattern IDs and recording assertions.
53
54* `StateBuilderNFA` represents a state with zero or more pattern IDs, zero or
55more satisfied assertions and zero or more NFA state IDs. A `StateBuilderNFA`
56can only be used for adding NFA state IDs and recording some assertions.
57
58The expected flow here is to use the above builders to construct a candidate
59DFA state to check if it already exists. If it does, then there's no need to
60freeze it into a `State`. If it doesn't exist, then `StateBuilderNFA::to_state`
61can be called to freeze the builder into an immutable `State`. In either
62case, `clear` should be called on the builder to turn it back into a
63`StateBuilderEmpty` that reuses the underlying memory.
64
65The main purpose for splitting the builder into these distinct types is to
66make it impossible to do things like adding a pattern ID after adding an NFA
67state ID. Namely, this makes it simpler to use a space-and-time efficient
68binary representation for the state. (The format is documented on the `Repr`
69type below.) If we just used one type for everything, it would be possible for
70callers to use an incorrect interleaving of calls and thus result in a corrupt
71representation. I chose to use more type machinery to make this impossible to
72do because 1) determinization is itself pretty complex and it wouldn't be too
73hard to foul this up and 2) there isn't too much machinery involved and it's
74well contained.
75
76As an optimization, sometimes states won't have certain things set. For
77example, if the underlying NFA has no word boundary assertions, then there is
78no reason to set a state's look-behind assertion as to whether it was generated
79from a word byte or not. Similarly, if a state has no NFA states corresponding
80to look-around assertions, then there is no reason to set `look_have` to a
81non-empty set. Finally, callers usually omit unconditional epsilon transitions
82when adding NFA state IDs since they aren't discriminatory.
83
84Finally, the binary representation used by these states is, thankfully, not
85serialized anywhere. So any kind of change can be made with reckless abandon,
86as long as everything in this module agrees.
87*/
88
89use core::mem;
90
91use alloc::{sync::Arc, vec::Vec};
92
93use crate::util::{
94    int::{I32, U32},
95    look::LookSet,
96    primitives::{PatternID, StateID},
97    wire::{self, Endian},
98};
99
100/// A DFA state that, at its core, is represented by an ordered set of NFA
101/// states.
102///
103/// This type is intended to be used only in NFA-to-DFA conversion via powerset
104/// construction.
105///
106/// It may be cheaply cloned and accessed safely from multiple threads
107/// simultaneously.
108#[derive(Clone, Eq, Hash, PartialEq, PartialOrd, Ord)]
109pub(crate) struct State(Arc<[u8]>);
110
111/// This Borrow impl permits us to lookup any state in a map by its byte
112/// representation. This is particularly convenient when one has a StateBuilder
113/// and we want to see if a correspondingly equivalent state already exists. If
114/// one does exist, then we can reuse the allocation required by StateBuilder
115/// without having to convert it into a State first.
116impl core::borrow::Borrow<[u8]> for State {
117    fn borrow(&self) -> &[u8] {
118        &*self.0
119    }
120}
121
122impl core::fmt::Debug for State {
123    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
124        f.debug_tuple("State").field(&self.repr()).finish()
125    }
126}
127
128/// For docs on these routines, see the internal Repr and ReprVec types below.
129impl State {
130    pub(crate) fn dead() -> State {
131        StateBuilderEmpty::new().into_matches().into_nfa().to_state()
132    }
133
134    pub(crate) fn is_match(&self) -> bool {
135        self.repr().is_match()
136    }
137
138    pub(crate) fn is_from_word(&self) -> bool {
139        self.repr().is_from_word()
140    }
141
142    pub(crate) fn is_half_crlf(&self) -> bool {
143        self.repr().is_half_crlf()
144    }
145
146    pub(crate) fn look_have(&self) -> LookSet {
147        self.repr().look_have()
148    }
149
150    pub(crate) fn look_need(&self) -> LookSet {
151        self.repr().look_need()
152    }
153
154    pub(crate) fn match_len(&self) -> usize {
155        self.repr().match_len()
156    }
157
158    pub(crate) fn match_pattern(&self, index: usize) -> PatternID {
159        self.repr().match_pattern(index)
160    }
161
162    pub(crate) fn match_pattern_ids(&self) -> Option<Vec<PatternID>> {
163        self.repr().match_pattern_ids()
164    }
165
166    #[cfg(all(test, not(miri)))]
167    pub(crate) fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F) {
168        self.repr().iter_match_pattern_ids(f)
169    }
170
171    pub(crate) fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F) {
172        self.repr().iter_nfa_state_ids(f)
173    }
174
175    pub(crate) fn memory_usage(&self) -> usize {
176        self.0.len()
177    }
178
179    fn repr(&self) -> Repr<'_> {
180        Repr(&*self.0)
181    }
182}
183
184/// A state builder that represents an empty state.
185///
186/// This is a useful "initial condition" for state construction. It has no
187/// NFA state IDs, no assertions set and no pattern IDs. No allocations are
188/// made when new() is called. Its main use is for being converted into a
189/// builder that can capture assertions and pattern IDs.
190#[derive(Clone, Debug)]
191pub(crate) struct StateBuilderEmpty(Vec<u8>);
192
193/// For docs on these routines, see the internal Repr and ReprVec types below.
194impl StateBuilderEmpty {
195    pub(crate) fn new() -> StateBuilderEmpty {
196        StateBuilderEmpty(alloc::vec![])
197    }
198
199    pub(crate) fn into_matches(mut self) -> StateBuilderMatches {
200        self.0.extend_from_slice(&[0, 0, 0, 0, 0, 0, 0, 0, 0]);
201        StateBuilderMatches(self.0)
202    }
203
204    fn clear(&mut self) {
205        self.0.clear();
206    }
207
208    pub(crate) fn capacity(&self) -> usize {
209        self.0.capacity()
210    }
211}
212
213/// A state builder that collects assertions and pattern IDs.
214///
215/// When collecting pattern IDs is finished, this can be converted into a
216/// builder that collects NFA state IDs.
217#[derive(Clone)]
218pub(crate) struct StateBuilderMatches(Vec<u8>);
219
220impl core::fmt::Debug for StateBuilderMatches {
221    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
222        f.debug_tuple("StateBuilderMatches").field(&self.repr()).finish()
223    }
224}
225
226/// For docs on these routines, see the internal Repr and ReprVec types below.
227impl StateBuilderMatches {
228    pub(crate) fn into_nfa(mut self) -> StateBuilderNFA {
229        self.repr_vec().close_match_pattern_ids();
230        StateBuilderNFA { repr: self.0, prev_nfa_state_id: StateID::ZERO }
231    }
232
233    pub(crate) fn set_is_from_word(&mut self) {
234        self.repr_vec().set_is_from_word()
235    }
236
237    pub(crate) fn set_is_half_crlf(&mut self) {
238        self.repr_vec().set_is_half_crlf()
239    }
240
241    pub(crate) fn look_have(&self) -> LookSet {
242        LookSet::read_repr(&self.0[1..])
243    }
244
245    pub(crate) fn set_look_have(
246        &mut self,
247        set: impl FnMut(LookSet) -> LookSet,
248    ) {
249        self.repr_vec().set_look_have(set)
250    }
251
252    pub(crate) fn add_match_pattern_id(&mut self, pid: PatternID) {
253        self.repr_vec().add_match_pattern_id(pid)
254    }
255
256    fn repr(&self) -> Repr<'_> {
257        Repr(&self.0)
258    }
259
260    fn repr_vec(&mut self) -> ReprVec<'_> {
261        ReprVec(&mut self.0)
262    }
263}
264
265/// A state builder that collects some assertions and NFA state IDs.
266///
267/// When collecting NFA state IDs is finished, this can be used to build a
268/// `State` if necessary.
269///
270/// When dont with building a state (regardless of whether it got kept or not),
271/// it's usually a good idea to call `clear` to get an empty builder back so
272/// that it can be reused to build the next state.
273#[derive(Clone)]
274pub(crate) struct StateBuilderNFA {
275    repr: Vec<u8>,
276    prev_nfa_state_id: StateID,
277}
278
279impl core::fmt::Debug for StateBuilderNFA {
280    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
281        f.debug_tuple("StateBuilderNFA").field(&self.repr()).finish()
282    }
283}
284
285/// For docs on these routines, see the internal Repr and ReprVec types below.
286impl StateBuilderNFA {
287    pub(crate) fn to_state(&self) -> State {
288        State(Arc::from(&*self.repr))
289    }
290
291    pub(crate) fn clear(self) -> StateBuilderEmpty {
292        let mut builder = StateBuilderEmpty(self.repr);
293        builder.clear();
294        builder
295    }
296
297    pub(crate) fn look_need(&self) -> LookSet {
298        self.repr().look_need()
299    }
300
301    pub(crate) fn set_look_have(
302        &mut self,
303        set: impl FnMut(LookSet) -> LookSet,
304    ) {
305        self.repr_vec().set_look_have(set)
306    }
307
308    pub(crate) fn set_look_need(
309        &mut self,
310        set: impl FnMut(LookSet) -> LookSet,
311    ) {
312        self.repr_vec().set_look_need(set)
313    }
314
315    pub(crate) fn add_nfa_state_id(&mut self, sid: StateID) {
316        ReprVec(&mut self.repr)
317            .add_nfa_state_id(&mut self.prev_nfa_state_id, sid)
318    }
319
320    pub(crate) fn as_bytes(&self) -> &[u8] {
321        &self.repr
322    }
323
324    fn repr(&self) -> Repr<'_> {
325        Repr(&self.repr)
326    }
327
328    fn repr_vec(&mut self) -> ReprVec<'_> {
329        ReprVec(&mut self.repr)
330    }
331}
332
333/// Repr is a read-only view into the representation of a DFA state.
334///
335/// Primarily, a Repr is how we achieve DRY: we implement decoding the format
336/// in one place, and then use a Repr to implement the various methods on the
337/// public state types.
338///
339/// The format is as follows:
340///
341/// The first three bytes correspond to bitsets.
342///
343/// Byte 0 is a bitset corresponding to miscellaneous flags associated with the
344/// state. Bit 0 is set to 1 if the state is a match state. Bit 1 is set to 1
345/// if the state has pattern IDs explicitly written to it. (This is a flag that
346/// is not meant to be set by determinization, but rather, is used as part of
347/// an internal space-saving optimization.) Bit 2 is set to 1 if the state was
348/// generated by a transition over a "word" byte. (Callers may not always set
349/// this. For example, if the NFA has no word boundary assertion, then needing
350/// to track whether a state came from a word byte or not is superfluous and
351/// wasteful.) Bit 3 is set to 1 if the state was generated by a transition
352/// from a `\r` (forward search) or a `\n` (reverse search) when CRLF mode is
353/// enabled.
354///
355/// Bytes 1..5 correspond to the look-behind assertions that were satisfied
356/// by the transition that created this state. (Look-ahead assertions are not
357/// tracked as part of states. Instead, these are applied by re-computing the
358/// epsilon closure of a state when computing the transition function. See
359/// `next` in the parent module.)
360///
361/// Bytes 5..9 correspond to the set of look-around assertions (including both
362/// look-behind and look-ahead) that appear somewhere in this state's set of
363/// NFA state IDs. This is used to determine whether this state's epsilon
364/// closure should be re-computed when computing the transition function.
365/// Namely, look-around assertions are "just" conditional epsilon transitions,
366/// so if there are new assertions available when computing the transition
367/// function, we should only re-compute the epsilon closure if those new
368/// assertions are relevant to this particular state.
369///
370/// Bytes 9..13 correspond to a 32-bit native-endian encoded integer
371/// corresponding to the number of patterns encoded in this state. If the state
372/// is not a match state (byte 0 bit 0 is 0) or if it's only pattern ID is
373/// PatternID::ZERO, then no integer is encoded at this position. Instead, byte
374/// offset 3 is the position at which the first NFA state ID is encoded.
375///
376/// For a match state with at least one non-ZERO pattern ID, the next bytes
377/// correspond to a sequence of 32-bit native endian encoded integers that
378/// represent each pattern ID, in order, that this match state represents.
379///
380/// After the pattern IDs (if any), NFA state IDs are delta encoded as
381/// varints.[1] The first NFA state ID is encoded as itself, and each
382/// subsequent NFA state ID is encoded as the difference between itself and the
383/// previous NFA state ID.
384///
385/// [1] - https://developers.google.com/protocol-buffers/docs/encoding#varints
386struct Repr<'a>(&'a [u8]);
387
388impl<'a> Repr<'a> {
389    /// Returns true if and only if this is a match state.
390    ///
391    /// If callers have added pattern IDs to this state, then callers MUST set
392    /// this state as a match state explicitly. However, as a special case,
393    /// states that are marked as match states but with no pattern IDs, then
394    /// the state is treated as if it had a single pattern ID equivalent to
395    /// PatternID::ZERO.
396    fn is_match(&self) -> bool {
397        self.0[0] & (1 << 0) > 0
398    }
399
400    /// Returns true if and only if this state has had at least one pattern
401    /// ID added to it.
402    ///
403    /// This is an internal-only flag that permits the representation to save
404    /// space in the common case of an NFA with one pattern in it. In that
405    /// case, a match state can only ever have exactly one pattern ID:
406    /// PatternID::ZERO. So there's no need to represent it.
407    fn has_pattern_ids(&self) -> bool {
408        self.0[0] & (1 << 1) > 0
409    }
410
411    /// Returns true if and only if this state is marked as having been created
412    /// from a transition over a word byte. This is useful for checking whether
413    /// a word boundary assertion is true or not, which requires look-behind
414    /// (whether the current state came from a word byte or not) and look-ahead
415    /// (whether the transition byte is a word byte or not).
416    ///
417    /// Since states with this set are distinct from states that don't have
418    /// this set (even if they are otherwise equivalent), callers should not
419    /// set this assertion unless the underlying NFA has at least one word
420    /// boundary assertion somewhere. Otherwise, a superfluous number of states
421    /// may be created.
422    fn is_from_word(&self) -> bool {
423        self.0[0] & (1 << 2) > 0
424    }
425
426    /// Returns true if and only if this state is marked as being inside of a
427    /// CRLF terminator. In the forward direction, this means the state was
428    /// created after seeing a `\r`. In the reverse direction, this means the
429    /// state was created after seeing a `\n`.
430    fn is_half_crlf(&self) -> bool {
431        self.0[0] & (1 << 3) > 0
432    }
433
434    /// The set of look-behind assertions that were true in the transition that
435    /// created this state.
436    ///
437    /// Generally, this should be empty if 'look_need' is empty, since there is
438    /// no reason to track which look-behind assertions are true if the state
439    /// has no conditional epsilon transitions.
440    ///
441    /// Satisfied look-ahead assertions are not tracked in states. Instead,
442    /// these are re-computed on demand via epsilon closure when computing the
443    /// transition function.
444    fn look_have(&self) -> LookSet {
445        LookSet::read_repr(&self.0[1..])
446    }
447
448    /// The set of look-around (both behind and ahead) assertions that appear
449    /// at least once in this state's set of NFA states.
450    ///
451    /// This is used to determine whether the epsilon closure needs to be
452    /// re-computed when computing the transition function. Namely, if the
453    /// state has no conditional epsilon transitions, then there is no need
454    /// to re-compute the epsilon closure.
455    fn look_need(&self) -> LookSet {
456        LookSet::read_repr(&self.0[5..])
457    }
458
459    /// Returns the total number of match pattern IDs in this state.
460    ///
461    /// If this state is not a match state, then this always returns 0.
462    fn match_len(&self) -> usize {
463        if !self.is_match() {
464            return 0;
465        } else if !self.has_pattern_ids() {
466            1
467        } else {
468            self.encoded_pattern_len()
469        }
470    }
471
472    /// Returns the pattern ID for this match state at the given index.
473    ///
474    /// If the given index is greater than or equal to `match_len()` for this
475    /// state, then this could panic or return incorrect results.
476    fn match_pattern(&self, index: usize) -> PatternID {
477        if !self.has_pattern_ids() {
478            PatternID::ZERO
479        } else {
480            let offset = 13 + index * PatternID::SIZE;
481            // This is OK since we only ever serialize valid PatternIDs to
482            // states.
483            wire::read_pattern_id_unchecked(&self.0[offset..]).0
484        }
485    }
486
487    /// Returns a copy of all match pattern IDs in this state. If this state
488    /// is not a match state, then this returns None.
489    fn match_pattern_ids(&self) -> Option<Vec<PatternID>> {
490        if !self.is_match() {
491            return None;
492        }
493        let mut pids = alloc::vec![];
494        self.iter_match_pattern_ids(|pid| pids.push(pid));
495        Some(pids)
496    }
497
498    /// Calls the given function on every pattern ID in this state.
499    fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, mut f: F) {
500        if !self.is_match() {
501            return;
502        }
503        // As an optimization for a very common case, when this is a match
504        // state for an NFA with only one pattern, we don't actually write the
505        // pattern ID to the state representation. Instead, we know it must
506        // be there since it is the only possible choice.
507        if !self.has_pattern_ids() {
508            f(PatternID::ZERO);
509            return;
510        }
511        let mut pids = &self.0[13..self.pattern_offset_end()];
512        while !pids.is_empty() {
513            let pid = wire::read_u32(pids);
514            pids = &pids[PatternID::SIZE..];
515            // This is OK since we only ever serialize valid PatternIDs to
516            // states. And since pattern IDs can never exceed a usize, the
517            // unwrap is OK.
518            f(PatternID::new_unchecked(usize::try_from(pid).unwrap()));
519        }
520    }
521
522    /// Calls the given function on every NFA state ID in this state.
523    fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, mut f: F) {
524        let mut sids = &self.0[self.pattern_offset_end()..];
525        let mut prev = 0i32;
526        while !sids.is_empty() {
527            let (delta, nr) = read_vari32(sids);
528            sids = &sids[nr..];
529            let sid = prev + delta;
530            prev = sid;
531            // This is OK since we only ever serialize valid StateIDs to
532            // states. And since state IDs can never exceed an isize, they must
533            // always be able to fit into a usize, and thus cast is OK.
534            f(StateID::new_unchecked(sid.as_usize()))
535        }
536    }
537
538    /// Returns the offset into this state's representation where the pattern
539    /// IDs end and the NFA state IDs begin.
540    fn pattern_offset_end(&self) -> usize {
541        let encoded = self.encoded_pattern_len();
542        if encoded == 0 {
543            return 9;
544        }
545        // This arithmetic is OK since we were able to address this many bytes
546        // when writing to the state, thus, it must fit into a usize.
547        encoded.checked_mul(4).unwrap().checked_add(13).unwrap()
548    }
549
550    /// Returns the total number of *encoded* pattern IDs in this state.
551    ///
552    /// This may return 0 even when this is a match state, since the pattern
553    /// ID `PatternID::ZERO` is not encoded when it's the only pattern ID in
554    /// the match state (the overwhelming common case).
555    fn encoded_pattern_len(&self) -> usize {
556        if !self.has_pattern_ids() {
557            return 0;
558        }
559        // This unwrap is OK since the total number of patterns is always
560        // guaranteed to fit into a usize.
561        usize::try_from(wire::read_u32(&self.0[9..13])).unwrap()
562    }
563}
564
565impl<'a> core::fmt::Debug for Repr<'a> {
566    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
567        let mut nfa_ids = alloc::vec![];
568        self.iter_nfa_state_ids(|sid| nfa_ids.push(sid));
569        f.debug_struct("Repr")
570            .field("is_match", &self.is_match())
571            .field("is_from_word", &self.is_from_word())
572            .field("is_half_crlf", &self.is_half_crlf())
573            .field("look_have", &self.look_have())
574            .field("look_need", &self.look_need())
575            .field("match_pattern_ids", &self.match_pattern_ids())
576            .field("nfa_state_ids", &nfa_ids)
577            .finish()
578    }
579}
580
581/// ReprVec is a write-only view into the representation of a DFA state.
582///
583/// See Repr for more details on the purpose of this type and also the format.
584///
585/// Note that not all possible combinations of methods may be called. This is
586/// precisely what the various StateBuilder types encapsulate: they only
587/// permit valid combinations via Rust's linear typing.
588struct ReprVec<'a>(&'a mut Vec<u8>);
589
590impl<'a> ReprVec<'a> {
591    /// Set this state as a match state.
592    ///
593    /// This should not be exposed explicitly outside of this module. It is
594    /// set automatically when a pattern ID is added.
595    fn set_is_match(&mut self) {
596        self.0[0] |= 1 << 0;
597    }
598
599    /// Set that this state has pattern IDs explicitly written to it.
600    ///
601    /// This should not be exposed explicitly outside of this module. This is
602    /// used internally as a space saving optimization. Namely, if the state
603    /// is a match state but does not have any pattern IDs written to it,
604    /// then it is automatically inferred to have a pattern ID of ZERO.
605    fn set_has_pattern_ids(&mut self) {
606        self.0[0] |= 1 << 1;
607    }
608
609    /// Set this state as being built from a transition over a word byte.
610    ///
611    /// Setting this is only necessary when one needs to deal with word
612    /// boundary assertions. Therefore, if the underlying NFA has no word
613    /// boundary assertions, callers should not set this.
614    fn set_is_from_word(&mut self) {
615        self.0[0] |= 1 << 2;
616    }
617
618    /// Set this state as having seen half of a CRLF terminator.
619    ///
620    /// In the forward direction, this should be set when a `\r` has been seen.
621    /// In the reverse direction, this should be set when a `\n` has been seen.
622    fn set_is_half_crlf(&mut self) {
623        self.0[0] |= 1 << 3;
624    }
625
626    /// The set of look-behind assertions that were true in the transition that
627    /// created this state.
628    fn look_have(&self) -> LookSet {
629        self.repr().look_have()
630    }
631
632    /// The set of look-around (both behind and ahead) assertions that appear
633    /// at least once in this state's set of NFA states.
634    fn look_need(&self) -> LookSet {
635        self.repr().look_need()
636    }
637
638    /// Mutate the set of look-behind assertions that were true in the
639    /// transition that created this state.
640    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641        set(self.look_have()).write_repr(&mut self.0[1..]);
642    }
643
644    /// Mutate the set of look-around (both behind and ahead) assertions that
645    /// appear at least once in this state's set of NFA states.
646    fn set_look_need(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
647        set(self.look_need()).write_repr(&mut self.0[5..]);
648    }
649
650    /// Add a pattern ID to this state. All match states must have at least
651    /// one pattern ID associated with it.
652    ///
653    /// Callers must never add duplicative pattern IDs.
654    ///
655    /// The order in which patterns are added must correspond to the order
656    /// in which patterns are reported as matches.
657    fn add_match_pattern_id(&mut self, pid: PatternID) {
658        // As a (somewhat small) space saving optimization, in the case where
659        // a matching state has exactly one pattern ID, PatternID::ZERO, we do
660        // not write either the pattern ID or the number of patterns encoded.
661        // Instead, all we do is set the 'is_match' bit on this state. Overall,
662        // this saves 8 bytes per match state for the overwhelming majority of
663        // match states.
664        //
665        // In order to know whether pattern IDs need to be explicitly read or
666        // not, we use another internal-only bit, 'has_pattern_ids', to
667        // indicate whether they have been explicitly written or not.
668        if !self.repr().has_pattern_ids() {
669            if pid == PatternID::ZERO {
670                self.set_is_match();
671                return;
672            }
673            // Make room for 'close_match_pattern_ids' to write the total
674            // number of pattern IDs written.
675            self.0.extend(core::iter::repeat(0).take(PatternID::SIZE));
676            self.set_has_pattern_ids();
677            // If this was already a match state, then the only way that's
678            // possible when the state doesn't have pattern IDs is if
679            // PatternID::ZERO was added by the caller previously. In this
680            // case, we are now adding a non-ZERO pattern ID after it, in
681            // which case, we want to make sure to represent ZERO explicitly
682            // now.
683            if self.repr().is_match() {
684                write_u32(self.0, 0)
685            } else {
686                // Otherwise, just make sure the 'is_match' bit is set.
687                self.set_is_match();
688            }
689        }
690        write_u32(self.0, pid.as_u32());
691    }
692
693    /// Indicate that no more pattern IDs will be added to this state.
694    ///
695    /// Once this is called, callers must not call it or 'add_match_pattern_id'
696    /// again.
697    ///
698    /// This should not be exposed explicitly outside of this module. It
699    /// should be called only when converting a StateBuilderMatches into a
700    /// StateBuilderNFA.
701    fn close_match_pattern_ids(&mut self) {
702        // If we never wrote any pattern IDs, then there's nothing to do here.
703        if !self.repr().has_pattern_ids() {
704            return;
705        }
706        let patsize = PatternID::SIZE;
707        let pattern_bytes = self.0.len() - 13;
708        // Every pattern ID uses 4 bytes, so number of bytes should be
709        // divisible by 4.
710        assert_eq!(pattern_bytes % patsize, 0);
711        // This unwrap is OK since we are guaranteed that the maximum number
712        // of possible patterns fits into a u32.
713        let count32 = u32::try_from(pattern_bytes / patsize).unwrap();
714        wire::NE::write_u32(count32, &mut self.0[9..13]);
715    }
716
717    /// Add an NFA state ID to this state. The order in which NFA states are
718    /// added matters. It is the caller's responsibility to ensure that
719    /// duplicate NFA state IDs are not added.
720    fn add_nfa_state_id(&mut self, prev: &mut StateID, sid: StateID) {
721        let delta = sid.as_i32() - prev.as_i32();
722        write_vari32(self.0, delta);
723        *prev = sid;
724    }
725
726    /// Return a read-only view of this state's representation.
727    fn repr(&self) -> Repr<'_> {
728        Repr(self.0.as_slice())
729    }
730}
731
732/// Write a signed 32-bit integer using zig-zag encoding.
733///
734/// https://developers.google.com/protocol-buffers/docs/encoding#varints
735fn write_vari32(data: &mut Vec<u8>, n: i32) {
736    let mut un = n.to_bits() << 1;
737    if n < 0 {
738        un = !un;
739    }
740    write_varu32(data, un)
741}
742
743/// Read a signed 32-bit integer using zig-zag encoding. Also, return the
744/// number of bytes read.
745///
746/// https://developers.google.com/protocol-buffers/docs/encoding#varints
747fn read_vari32(data: &[u8]) -> (i32, usize) {
748    let (un, i) = read_varu32(data);
749    let mut n = i32::from_bits(un >> 1);
750    if un & 1 != 0 {
751        n = !n;
752    }
753    (n, i)
754}
755
756/// Write an unsigned 32-bit integer as a varint. In essence, `n` is written
757/// as a sequence of bytes where all bytes except for the last one have the
758/// most significant bit set. The least significant 7 bits correspond to the
759/// actual bits of `n`. So in the worst case, a varint uses 5 bytes, but in
760/// very common cases, it uses fewer than 4.
761///
762/// https://developers.google.com/protocol-buffers/docs/encoding#varints
763fn write_varu32(data: &mut Vec<u8>, mut n: u32) {
764    while n >= 0b1000_0000 {
765        data.push(n.low_u8() | 0b1000_0000);
766        n >>= 7;
767    }
768    data.push(n.low_u8());
769}
770
771/// Read an unsigned 32-bit varint. Also, return the number of bytes read.
772///
773/// https://developers.google.com/protocol-buffers/docs/encoding#varints
774fn read_varu32(data: &[u8]) -> (u32, usize) {
775    // N.B. We can assume correctness here since we know that all varuints are
776    // written with write_varu32. Hence, the 'as' uses and unchecked arithmetic
777    // is all okay.
778    let mut n: u32 = 0;
779    let mut shift: u32 = 0;
780    for (i, &b) in data.iter().enumerate() {
781        if b < 0b1000_0000 {
782            return (n | (u32::from(b) << shift), i + 1);
783        }
784        n |= (u32::from(b) & 0b0111_1111) << shift;
785        shift += 7;
786    }
787    (0, 0)
788}
789
790/// Push a native-endian encoded `n` on to `dst`.
791fn write_u32(dst: &mut Vec<u8>, n: u32) {
792    use crate::util::wire::NE;
793
794    let start = dst.len();
795    dst.extend(core::iter::repeat(0).take(mem::size_of::<u32>()));
796    NE::write_u32(n, &mut dst[start..]);
797}
798
799#[cfg(test)]
800mod tests {
801    use alloc::vec;
802
803    use quickcheck::quickcheck;
804
805    use super::*;
806
807    #[cfg(not(miri))]
808    quickcheck! {
809        fn prop_state_read_write_nfa_state_ids(sids: Vec<StateID>) -> bool {
810            // Builders states do not permit duplicate IDs.
811            let sids = dedup_state_ids(sids);
812
813            let mut b = StateBuilderEmpty::new().into_matches().into_nfa();
814            for &sid in &sids {
815                b.add_nfa_state_id(sid);
816            }
817            let s = b.to_state();
818            let mut got = vec![];
819            s.iter_nfa_state_ids(|sid| got.push(sid));
820            got == sids
821        }
822
823        fn prop_state_read_write_pattern_ids(pids: Vec<PatternID>) -> bool {
824            // Builders states do not permit duplicate IDs.
825            let pids = dedup_pattern_ids(pids);
826
827            let mut b = StateBuilderEmpty::new().into_matches();
828            for &pid in &pids {
829                b.add_match_pattern_id(pid);
830            }
831            let s = b.into_nfa().to_state();
832            let mut got = vec![];
833            s.iter_match_pattern_ids(|pid| got.push(pid));
834            got == pids
835        }
836
837        fn prop_state_read_write_nfa_state_and_pattern_ids(
838            sids: Vec<StateID>,
839            pids: Vec<PatternID>
840        ) -> bool {
841            // Builders states do not permit duplicate IDs.
842            let sids = dedup_state_ids(sids);
843            let pids = dedup_pattern_ids(pids);
844
845            let mut b = StateBuilderEmpty::new().into_matches();
846            for &pid in &pids {
847                b.add_match_pattern_id(pid);
848            }
849
850            let mut b = b.into_nfa();
851            for &sid in &sids {
852                b.add_nfa_state_id(sid);
853            }
854
855            let s = b.to_state();
856            let mut got_pids = vec![];
857            s.iter_match_pattern_ids(|pid| got_pids.push(pid));
858            let mut got_sids = vec![];
859            s.iter_nfa_state_ids(|sid| got_sids.push(sid));
860            got_pids == pids && got_sids == sids
861        }
862    }
863
864    quickcheck! {
865        fn prop_read_write_varu32(n: u32) -> bool {
866            let mut buf = vec![];
867            write_varu32(&mut buf, n);
868            let (got, nread) = read_varu32(&buf);
869            nread == buf.len() && got == n
870        }
871
872        fn prop_read_write_vari32(n: i32) -> bool {
873            let mut buf = vec![];
874            write_vari32(&mut buf, n);
875            let (got, nread) = read_vari32(&buf);
876            nread == buf.len() && got == n
877        }
878    }
879
880    #[cfg(not(miri))]
881    fn dedup_state_ids(sids: Vec<StateID>) -> Vec<StateID> {
882        let mut set = alloc::collections::BTreeSet::new();
883        let mut deduped = vec![];
884        for sid in sids {
885            if set.contains(&sid) {
886                continue;
887            }
888            set.insert(sid);
889            deduped.push(sid);
890        }
891        deduped
892    }
893
894    #[cfg(not(miri))]
895    fn dedup_pattern_ids(pids: Vec<PatternID>) -> Vec<PatternID> {
896        let mut set = alloc::collections::BTreeSet::new();
897        let mut deduped = vec![];
898        for pid in pids {
899            if set.contains(&pid) {
900                continue;
901            }
902            set.insert(pid);
903            deduped.push(pid);
904        }
905        deduped
906    }
907}
regex_automata/util/determinize/state.rs

regex_automata/util/determinize/
state.rs