opaque-lattice/papers_txt/An-efficient-string-solver-for-string-constraints-wi_2025_Journal-of-Systems.txt

                                                              Journal of Systems Architecture 160 (2025) 103340


                                                                  Contents lists available at ScienceDirect


                                                         Journal of Systems Architecture
                                                         journal homepage: www.elsevier.com/locate/sysarc


An efficient string solver for string constraints with regex-counting and
string-length
Denghang Hu ∗, Zhilin Wu
Key Laboratory of System Software and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, University of Chinese
Academy of Sciences, Beijing, China


ARTICLE               INFO                               ABSTRACT

Keywords:                                                Regular expressions (regex for short) and string-length function are widely used in string-manipulating
String constraints                                       programs. Counting is a frequently used feature in regexes that counts the number of matchings of sub-
String solver                                            patterns. The state-of-the-art string solvers are incapable of solving string constraints with regex-counting and
Regular expression
                                                         string-length efficiently, especially when the counting and length bounds are large. In this work, we propose
String length
                                                         an automata-theoretic approach for solving such class of string constraints. The main idea is to symbolically
Counting operator
                                                         model the counting operators by registers in automata instead of unfolding them explicitly, thus alleviating the
                                                         state explosion problem. Moreover, the string-length function is modeled by a register as well. As a result, the
                                                         satisfiability of string constraints with regex-counting and string-length is reduced to the satisfiability of linear
                                                         integer arithmetic, which the off-the-shelf SMT solvers can then solve. To improve the performance further,
                                                         we also propose various optimization techniques. We implement the algorithms and validate our approach on
                                                         49,843 benchmark instances. The experimental results show that our approach can solve more instances than
                                                         the state-of-the-art solvers, at a comparable or faster speed, especially when the counting and length bounds are
                                                         large or when the counting operators are nested with some other counting operators or complement operators.


1. Introduction                                                                               proposed to solve such string constraints efficiently (see e.g. [8,17]).
                                                                                                  Counting (aka repetition) is a convenient feature in regexes that
    The string data type plays a crucial role in modern programming                           counts the number of matchings of sub-patterns. For instance, 𝑎{2,4}
languages such as JavaScript, Python, Java, and PHP. String manipula-                         specifies that 𝑎 occurs at least twice and at most four times, and 𝑎{2,∞}
tions are error-prone and could even give rise to severe security vulner-                     specifies that 𝑎 occurs at least twice. Note that although the Kleene
abilities (e.g., cross-site scripting, aka XSS). One powerful method for
                                                                                              star and the Kleene plus operator are special cases of counting, i.e., 𝑒∗
identifying such bugs is symbolic execution. It analyzes the feasibility
                                                                                              is equivalent to 𝑒{0,∞} and 𝑒+ is equivalent to 𝑒{1,∞} , in the rest of this
of execution paths in a program by encoding the paths as logical
formulas and utilizing the constraint solvers (e.g. Z3 [1]) to solve the                      paper, for clarity, we use counting to denote expressions of the form
satisfiability of the formulas. Symbolic execution of string manipulating                     𝑒{𝑚,𝑛} and 𝑒{𝑚,∞} , but not 𝑒∗ or 𝑒+ . Counting is a frequently used feature
programs has motivated the highly active research area of string con-                         of regexes. According to the statistics from [13], counting occurs in
straint solving, resulting in the development of numerous string solvers                      more than 20% of 1204 Python projects. Therefore, an efficient analysis
in the last decade, e.g., Z3seq [1], CVC4/5 [2,3], Z3str/2/3/4 [4–                            of string manipulating programs requires efficient solvers for string
7], Z3str3RE [8], Z3-Trau [9,10], OSTRICH [11], Slent [12], among                             constraints containing regexes with counting and string-length function
many others. Regular expressions (regex for short) and the string-length                      at least.
function are widely used in string-manipulating programs. According                               The aforementioned state-of-the-art string constraint solvers still
to the statistics from [13–15], regexes are used in about 30–40% of
                                                                                              suffer from such string constraints, especially when the counting and
Java, JavaScript, and Python software projects. Moreover, string-length
                                                                                              length bounds are large. For instance, none of the string solvers CVC5,
occupies 78% of the occurrences of string operations in 18 Javascript
applications, according to the statistics from [16]. As a result, most                        Z3seq, Z3-Trau, Z3str3, Z3str3RE, and OSTRICH is capable of solving
of the aforementioned string constraint solvers support both regexes                          the following constraint within 120 s,
and string-length function. Moreover, specialized algorithms have been                        𝑥 ∈ (𝛴 ⧵ 𝑎){1,60} (𝛴 ⧵ 𝑏){1,60} (𝛴 ⧵ 𝑐){0,60} ∧ 𝑥 ∈ 𝛴 ∗ 𝑐 + ∧ |𝑥| > 120.    (1)


  ∗ Corresponding author.
     E-mail address: hudh@ios.ac.cn (D. Hu).

https://doi.org/10.1016/j.sysarc.2025.103340
Received 2 June 2024; Received in revised form 7 January 2025; Accepted 7 January 2025
Available online 17 January 2025
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
D. Hu and Z. Wu                                                                                                     Journal of Systems Architecture 160 (2025) 103340


Intuitively, the constraint in (1) specifies that 𝑥 is a concatenation of         and back references. Real-world regexes have been addressed in sym-
three strings 𝑥1 , 𝑥2 , 𝑥3 where 𝑎 (resp. 𝑏, 𝑐) does not occur in 𝑥1 (resp.       bolic execution of Javascript programs [21] and string constraint
𝑥2 , 𝑥3 ), moreover, 𝑥 ends with a nonempty sequence of 𝑐’s, and the              solving [22]. Nevertheless, the counting operators are still unfolded
length of 𝑥 is greater than 120. It is easy to observe that this constraint       explicitly in [22]. The Trau tool focuses on string constraints involving
is unsatisfiable since on the one hand, |𝑥| > 120 and the counting upper          flat regular languages and string-length function and solves them by
bound 60 in both (𝛴 ⧵ 𝑎){1,60} and (𝛴 ⧵ 𝑏){1,60} imply that 𝑥 must end            computing LIA formulas that define the Parikh images of flat regular
with some character from 𝛴 ⧵ 𝑐, that is, a character different from 𝑐,            languages [10]. The Slent tool solves the string constraints involving
and on the other hand, 𝑥 ∈ 𝛴 ∗ 𝑐 + requires that 𝑥 has to end with 𝑐.             string-length function by encoding them into so-called length-encoded
     A typical way for string constraint solvers to deal with regular             automata, then utilizing symbolic model checkers to solve their non-
expressions with counting is to unfold them into those without counting           emptiness problem [12]. However, neither Trau nor Slent supports
using the concatenation operator. For instance, 𝑎{1,4} is unfolded into           counting operators explicitly, in other words, counting operators in
𝑎(𝜀 + 𝑎 + 𝑎𝑎 + 𝑎𝑎𝑎) and 𝑎{2,∞} is unfolded into 𝑎𝑎𝑎∗ . Since the unfolding        regexes should be unfolded before solved by them. Cost registers in
incurs an exponential blow-up on the sizes of constraints (assuming that          CEFAs are different from registers in (symbolic) register automata [23,
the counting in string constraints are encoded in binary), the unfolding          24]: In register automata, registers are used to store input values and
penalizes the performance of the solvers, especially when the length              can only be compared for equality/inequality, while in CEFAs, cost
bounds are also available.                                                        registers are used to store integer-valued costs and can be updated by
     Contribution. In this work, we focus on the class of string con-             adding/subtracting integer constants and constrained by the accepting
straints with regular membership and string-length function, where the            conditions which are LIA formulas. Therefore, cost registers in CEFAs
counting operators may occur (called RECL for brevity). We make the               are more like counters in counter automata/machines [25], that is,
following contributions.                                                          CEFAs can be obtained from counter machines by removing transition
                                                                                  guards and adding accepting conditions. CEFAs can also be considered
    • We propose an automata-theoretical approach for solving RECL                as a variation of Parikh automata [26] in which the numbers in the
      constraints. Our main idea is to represent the counting operators           alphabet can be integers, instead of natural numbers. Counting-set
      by cost registers in cost-enriched finite automata (CEFA, see Sec-          automata were proposed in [27,28] to quickly match a subclass of
      tion 5 for the definition), instead of unfolding them explicitly. The       regexes with counting. Moreover, a variant of nondeterministic counter
      string-length function is modeled by cost registers as well. The            automata, called bit vector automata, was proposed recently in [29]
      satisfiability of RECL constraints is reduced to the non-emptiness          to enable fast matching of a more expressive class of regexes with
      problem of CEFA w.r.t. a linear integer arithmetic (LIA) formula.           counting. Nevertheless, the non-emptiness problem of these automata
      According to the results from [18], an LIA formula can be com-              was not considered, and it is unclear whether these automata models
      puted to represent the potential values of registers in CEFA. Thus,         can be used for solving string constraints with regex-counting and
      the non-emptiness of CEFA w.r.t. LIA formulas can be reduced to             string-length.
      the satisfiability of LIA formulas, which is then tackled by off-the-       Structure. The rest of this paper is structured as follows: Section 2 in-
      shelf SMT solvers. Moreover, we propose techniques to reduce the            troduces the preliminaries. Section 3 gives an overview of the approach
      sizes (i.e., the number of states and transitions) of CEFA in order         in this paper. Section 4 presents the syntax and semantics of RECL.
      to achieve better performance.                                              Section 5 defines CEFA. Section 6 introduces the algorithm to solve
    • For complex RECL constraints where the counting operators may               RECL constraints. Section 7 presents some heuristics for improving the
      be nested with the other counting operators or complement oper-             performance of solving RECL constraints where the counting operators
      ators, we propose optimization techniques to improve the perfor-            are nested with some other counting operators or complement opera-
      mance. Furthermore, for the satisfiable RECL constraints where              tors. Section 9 shows the experiment results. Section 10 concludes this
      the solutions of the LIA formulas can be obtained, but the models           paper.
      of the original RECL constraint cannot be generated, we pro-                    A preliminary version of this paper has appeared in the proceedings
      pose optimization techniques to accelerate the model generation             of SETTA 2023 [30]. This paper extends the SETTA paper in the
      process.                                                                    following three aspects.
    • We implement the algorithm on top of OSTRICH, resulting in a
                                                                                      • To improve the performance of complex RECL constraints where
      string solver1 called OSTRICHRECL . Finally, we utilize a collec-
                                                                                        the counting operators are nested with some other counting op-
      tion of benchmark suites2 comprising 49,843 problem instances
                                                                                        erators or complement operators, various optimizations are pro-
      in total to evaluate the performance of OSTRICHRECL . The ex-
                                                                                        posed.
      periment results show that OSTRICHRECL solves the RECL con-
                                                                                      • To deal with the issue that the satisfiability of LIA formulas can be
      straints more efficiently than the state-of-the-art string solvers
                                                                                        solved, but models of the original RECL constraints cannot be gen-
      (see Tables 1–5 for details). For instance, OSTRICHRECL solves
                                                                                        erated, we propose techniques to optimize the model generation
      49,638 out of 49,843 instances, while the best state-of-the-art
                                                                                        process.
      solver solve only 47,721 instances. We also do experiments to jus-
                                                                                      • The benchmarks are extended to cover the situations where the
      tify some of our technical choices and evaluate the performance
                                                                                        counting operators are nested with some other counting opera-
      of the optimizations                                                              tors or complement operators, and the experiments are extended
Related work. We discuss more related work on regexes with counting,                    accordingly.
string-length function, and automata with registers/counters. Deter-
minism of regexes with counting was investigated in [19,20]. Real-                2. Preliminaries
world regexes in programming languages include features beyond
classical regexes, e.g., the greedy/lazy Kleene star, capturing groups,               We write N and Z for the sets of natural and integer numbers. For
                                                                                  𝑛 ∈ N with 𝑛 ≥ 1, [𝑛] denotes {1, … , 𝑛}; for 𝑚, 𝑛 ∈ N with 𝑚 ≤ 𝑛,
                                                                                  [𝑚, 𝑛] denotes {𝑖 ∈ N ∣ 𝑚 ≤ 𝑖 ≤ 𝑛}. Throughout the paper, 𝛴 is a finite
  1
     The solver is open source and available at https://github.com/               alphabet, ranged over 𝑎, 𝑏, ….
SimpleXiaohu/ostrich/tree/jsa2024                                                     For a function 𝑓 from 𝑋 to 𝑌 and 𝑋 ′ ⊆ 𝑋, we use 𝗉𝗋𝗃𝑋 ′ (𝑓 ) to
  2
     The benchmarks are available at https://github.com/SimpleXiaohu/             denote the restriction (aka projection) of 𝑓 to 𝑋 ′ , that is, 𝗉𝗋𝗃𝑋 ′ (𝑓 ) is
ostrich/tree/jsa2024/zaligvinder/models/jsa2024.                                  the function from 𝑋 ′ to 𝑌 , and 𝗉𝗋𝗃𝑋 ′ (𝑓 )(𝑥′ ) = 𝑓 (𝑥′ ) for each 𝑥′ ∈ 𝑋 ′ .

                                                                              2
D. Hu and Z. Wu                                                                                                                          Journal of Systems Architecture 160 (2025) 103340

                                          Table 1
                                          Overall performance evaluation.
                                                              CVC5      Z3str3RE             Z3str3      Z3seq     Ostrich    OSTRICHRECL
                                                   sat           28,282             28,449   23,322      28,212    25,975        29,084
                                                  unsat          16,809             19,316   12,742      18718     20,529        20,554
                                                unknown            8                 134     7021          203       162            14
                                                 timeout          4744              1944     6758         2710      3177           191
                                                   error           0                  44      44            56        0             0
                                               solved            45,091             47,721   36,020      46,874    46,504        49,638
                                            average time
                                                                  6.43              1.99      8.48        4.04       6.75         2.88
                                                 (s)


                                           Table 2
                                           Performance evaluation on the NestC benchmark suite.
                                                              CVC5     Z3str3RE    Z3str3    Z3seq                Ostrich    OSTRICHRECL
                                                    sat               508            166      196        451        0            635
                                                   unsat               0              4        0          67       238           238
                                                  unknown              0              35       31        105        2             6
                                                   timeout            492            795      773        377       760           121
                                                     error             0              0        0          0         0             0
                                                   solved             508            170      196        518       238           873
                                                average time
                                                                  31.57             20.09    50.06      26.13     46.11         19.49
                                                     (s)


    Table 3                                                                                         𝐿2 , denoted by 𝐿1 ⋅ 𝐿2 , is defined as {𝑢 ⋅ 𝑣 ∣ 𝑢 ∈ 𝐿1 , 𝑣 ∈ 𝐿2 }. The union
    Evaluation of the technical choices and optimizations in Section 6.                             (resp. intersection) of 𝐿1 and 𝐿2 , denoted by 𝐿1 ∪ 𝐿2 (resp. 𝐿1 ∩ 𝐿2 ), is
                          OSTRICHRECL
                                   NUXMV
                                             OSTRICHRECL
                                                       −ASR
                                                               OSTRICHRECL
                                                                                                    defined as {𝑢 ∣ 𝑢 ∈ 𝐿1 or 𝑢 ∈ 𝐿2 } (resp. {𝑢 ∣ 𝑢 ∈ 𝐿1 and 𝑢 ∈ 𝐿2 }). The
            sat               26,607               27,295               29,084
                                                                                                    complement of 𝐿1 , denoted by 𝐿1 , is defined as {𝑢 ∣ 𝑢 ∈ 𝛴 ∗ , 𝑢 ∉ 𝐿1 }.
           unsat              20,499               20,513               20,554
                                                                                                    The difference of 𝐿1 and 𝐿2 , denoted by 𝐿1 ⧵ 𝐿2 , is defined as 𝐿1 ∩ 𝐿2 .
         unknown                  45                  68                     14
                                                                                                    For a language 𝐿 and 𝑛 ∈ N, we define 𝐿𝑛 inductively as follows:
          timeout                2,693              1,968                   191                                                                                         ⋃
            error                  0                  0                      0                      𝐿0 = {𝜀}, 𝐿𝑛+1 = 𝐿 ⋅ 𝐿𝑛 for every 𝑛 ∈ N. Finally, define 𝐿∗ = 𝑛∈N 𝐿𝑛 .
           solved             47,106               47,808               49,638                      Finite automata. A (nondeterministic) finite automaton (NFA) is a tuple
      average time (s)         6.86                 4.76                 2.88
                                                                                                     = (𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 ), where 𝑄 is a finite set of states, 𝛴 is a finite
                                                                                                    alphabet, 𝛿 ⊆ 𝑄 × 𝛴 × 𝑄 is the transition relation, 𝐼 ⊆ 𝑄 is the set
    Table 4                                                                                         of initial states, and 𝐹 ⊆ 𝑄 is the set of final states. For readability, we
    Evaluation of the optimizations in Section 7.                                                                                               𝑎                    𝑎
                                                                                                    write a transition (𝑞 , 𝑎, 𝑞 ′ ) ∈ 𝛿 as 𝑞 ←←→
                                                                                                                                                ← 𝑞 ′ (or simply 𝑞 ←←→
                                                                                                                                                                     ← 𝑞 ′ ). The size
                          OSTRICHRECL
                                   −NEST
                                             OSTRICHRECL
                                                    −COMP
                                                                      OSTRICHRECL                                                               𝛿
            sat                  553                296                     635
                                                                                                    of an NFA , denoted by ||, is defined as the number of transitions
           unsat                 238                238                     238                     of . A run of  on a string 𝑤 = 𝑎1 ⋯ 𝑎𝑛 is a sequence of transitions
                                                                                                          𝑎1              𝑎𝑛
         unknown                  27                 8                       6                      𝑞0 ←←←←→               ← 𝑞𝑛 such that 𝑞0 ∈ 𝐼. The run is accepting if 𝑞𝑛 ∈ 𝐹 . A
                                                                                                           ← 𝑞1 ⋯ 𝑞𝑛−1 ←←←←→
          timeout                182                458                     121                     string 𝑤 is accepted by an NFA  if there is an accepting run of  on
            error                  0                 0                       0                      𝑤. In particular, the empty string 𝜀 is accepted by  if 𝐼 ∩ 𝐹 ≠ ∅. The
           solved              791                   534                     873                    language of , denoted by (), is the set of strings accepted by . An
      average time (s)        19.95                 37.49                   19.49                   NFA  is said to be deterministic if 𝐼 is a singleton and, for every 𝑞 ∈ 𝑄
                                                                                                    and 𝑎 ∈ 𝛴, there is at most one state 𝑞 ′ ∈ 𝑄 such that (𝑞 , 𝑎, 𝑞 ′ ) ∈ 𝛿. We
            Table 5                                                                                 use DFA to denote deterministic finite automata.
            Evaluation of the optimizations for the model generation                                      It is well-known that finite automata capture regular languages.
            in Section 8.                                                                           Moreover, the class of regular languages is closed under union, inter-
                                  OSTRICHRECL
                                           −MODEL
                                                     OSTRICHRECL
                                                                                                    section, concatenation, Kleene star, complement, and language differ-
                     sat                 28,735                29,084
                                                                                                    ence [31].
                    unsat                20,554                20,554
                                                                                                          Let 𝑤 ∈ 𝛴 ∗ . The Parikh image of 𝑤, denoted by 𝗉𝖺𝗋𝗂𝗄𝗁(𝑤), is defined
                   unknown                13                     14
                    timeout               541                   191
                                                                                                    as the function 𝜂 ∶ 𝛴 → N such that 𝜂(𝑎) = |𝑤|𝑎 for each 𝑎 ∈ 𝛴.
                      error                0                     0                                  The Parikh image of an NFA , denoted by 𝗉𝖺𝗋𝗂𝗄𝗁(), is defined as
                                                                                                    {𝗉𝖺𝗋𝗂𝗄𝗁(𝑤) ∣ 𝑤 ∈ ()}.
                   solved                49,289                49,638
              average time (s)            3.57                  2.88                                Linear integer arithmetic and Parikh images. We use standard existential
                                                                                                    linear integer arithmetic (LIA) formulas, which typically range over
                                                                                                    𝜓 , 𝜑, 𝛷, 𝛼. For a set X of variables, we use 𝜓∕𝜑∕𝛷∕𝛼(X) to denote the
                                                                                                    set of existential LIA formulas whose free variables are from X. For
Strings and languages. A string over 𝛴 is a (possibly empty) sequence                               example, we use 𝜑(⃗x) with ⃗x = (x1 , … , x𝑘 ) to denote an LIA formula 𝜑
of elements from 𝛴, denoted by 𝑢, 𝑣, 𝑤, …. An empty string is denoted                               whose free variables are from {x1 , … , x𝑘 }. For an LIA formula 𝜑(⃗x), we
by 𝜀. We use 𝛴𝜀 to denote 𝛴 ∪ {𝜀}. We write 𝛴 ∗ (resp., 𝛴 + ) for the                               use 𝜑[⃗𝑡∕⃗x] to denote the formula obtained by replacing (simultaneously)
set of all (resp. nonempty) strings over 𝛴. For a string 𝑢, we use |𝑢| to                           x𝑖 with 𝑡𝑖 for every 𝑖 ∈ [𝑘] where ⃗x = (x1 , … , x𝑘 ) and ⃗𝑡 = (𝑡1 , … , 𝑡𝑘 ) are
denote the number of letters in 𝑢. In particular, |𝜀| = 0. Moreover, for                            tuples of integer terms.
𝑎 ∈ 𝛴, let |𝑢|𝑎 denote the number of occurrences of 𝑎 in 𝑢. Assume that                                 At last, we recall the result about Parikh images of NFA. For each
𝑢 = 𝑎1 ⋯ 𝑎𝑛 is nonempty and 1 ≤ 𝑖 < 𝑗 ≤ 𝑛. We let 𝑢[𝑖] denote 𝑎𝑖 and                                𝑎 ∈ 𝛴, let z𝑎 be an integer variable. Let Z𝛴 denote the set of integer
𝑢[𝑖, 𝑗] for the substring 𝑎𝑖 ⋯ 𝑎𝑗 . Let 𝑢, 𝑣 be two strings. We use 𝑢 ⋅ 𝑣 to                        variables z𝑎 for 𝑎 ∈ 𝛴. Let  be an NFA over the alphabet 𝛴. Then
denote the concatenation of 𝑢 and 𝑣. A language 𝐿 over 𝛴 is a subset of                             we say that an LIA formula 𝜓(Z𝛴 ) defines the Parikh image of , if
strings. Let 𝐿1 , 𝐿2 be two languages. Then the concatenation of 𝐿1 and                             {𝜂 ∶ 𝛴 → N ∣ 𝜓[(𝜂(𝑎)∕z𝑎 )𝑎∈𝛴 ] is true} = 𝗉𝖺𝗋𝗂𝗄𝗁().

                                                                                              3
D. Hu and Z. Wu                                                                                                                                  Journal of Systems Architecture 160 (2025) 103340


                                                               Fig. 1. CEFA for (𝛴 ⧵ 𝑎){1,60} (𝛴 ⧵ 𝑏){1,60} (𝛴 ⧵ 𝑐){0,60} , 𝛴 ∗ 𝑐 + , and |𝑥|.


                                                                     Fig. 2. 1 ∩ 2 ∩ 3 : Intersection of 1 , 2 , and 3 .


Theorem 1 ([32]). Given an NFA , an existential LIA formula 𝜓 (Z𝛴 )                                     the LIA formula 𝜑 ≡ 𝑟4 > 120, that is, whether there exist 𝑤 ∈ 𝛴 ∗ and
can be computed in linear time that defines the Parikh image of .                                        (𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ) ∈ Z4 such that 𝑤 is accepted by , so that the resulting
                                                                                                          registers values (𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ) satisfy both 1 ≤ 𝑣1 ≤ 60 ∧ 1 ≤ 𝑣2 ≤ 60 ∧ 0 ≤
3. Overview                                                                                               𝑣3 ≤ 60 and 𝜑. It is not hard to observe that the non-emptiness of
                                                                                                           with respect to 𝜑 is independent of the characters of . Therefore,
      In this section, we utilize the string constraint in Eq. (1) to illustrate                          the characters in  can be ignored, resulting into an NFA  over the
the approach in our work.                                                                                 alphabet C, where C is the set of vectors from Z4 occurring in the
      At first, we construct a CEFA for the regular expression (𝛴 ⧵                                       transitions of  (see Fig. 3(a)). Then the original problem is reduced
𝑎){1,60} (𝛴 ⧵ 𝑏){1,60} (𝛴 ⧵ 𝑐){0,60} . Three registers are introduced, say                                to the problem of deciding whether there exists a string 𝑤′ ∈ C∗ that
𝑟1 , 𝑟2 , 𝑟3 , to represent the three counting operators; the nondeterministic                            is accepted by  and its Parikh image (i.e., numbers of occurrences of
finite automaton (NFA) for (𝛴 ⧵ 𝑎)∗ (𝛴 ⧵ 𝑏)∗ (𝛴 ⧵ 𝑐)∗ is constructed;                                     characters), say 𝜂𝑤′ ∶ C → N, satisfies 1 ≤ 𝑣′1 ≤ 60 ∧ 1 ≤ 𝑣′2 ≤ 60 ∧ 0 ≤
                                                                                                                                                                ∑
the updates of registers are added to the transitions of the NFA; the                                     𝑣′3 ≤ 60 ∧ 𝑣′4 > 120, where (𝑣′1 , 𝑣′2 , 𝑣′3 , 𝑣′4 ) = 𝑣∈C
                                                                                                                                                                  ⃗       ⃗ 𝑣⃗ for each 𝑣⃗ ∈ C.
                                                                                                                                                                     𝜂𝑤′ (𝑣)
                                                                                                                               ′ ′  ′   ′
                                                                                                          Intuitively, (𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ) is a weighted sum of vectors 𝑣⃗ ∈ C, where the
counting bounds are specified by the accepting condition 1 ≤ 𝑟1 ≤
60 ∧ 1 ≤ 𝑟2 ≤ 60 ∧ 0 ≤ 𝑟3 ≤ 60, resulting in a CEFA 1 illustrated                                        weight is the number of occurrences of 𝑣⃗ in 𝑤′ . (See Section 6.2 for
in Fig. 1(a). 𝑟1 + + means that we increment the value of 𝑟1 by one                                       more detailed arguments.)
after running the transition. A string 𝑤 is accepted by 1 if, when                                            Let C = {𝑣⃗1 , … , 𝑣⃗𝑚 }. From the results in [32,33], an existential LIA
reading the characters in 𝑤, 1 applies the transitions to update the                                     formula 𝜓 (x1 , … , x𝑚 ) can be computed to define the Parikh image of
state and the values of registers, reaching a final state 𝑞 in the end,                                   strings that are accepted by , where x1 , … , x𝑚 are the integer variables
and the resulting values of the three registers, say 𝑣1 , 𝑣2 , 𝑣3 , satisfy the                           to denote the number of occurrences of 𝑣⃗1 , … , 𝑣⃗𝑚 . Therefore, the satis-
accepting condition. In addition, we construct other two CEFAs 2 for                                     fiability of the string constraint in (1) is reduced to the satisfiability of
𝛴 ∗ 𝑐 + (see Fig. 1(b)) and 3 for string length function (see Fig. 1(c)).                                the following existential LIA formula,
                                                                                                                                 ⋀             ∑
In 3 , a register 𝑟4 is used to denote the length of strings and the                                      𝜓 (x1 , … , x𝑚 ) ∧ 1≤𝑗≤4 𝑟𝑗 = 1≤𝑘≤𝑚 x𝑘 𝑣𝑘,𝑗 ∧
accepting condition is 𝚝𝚛𝚞𝚎 (See Section 6.1 for more details about the                                                                                                                     (2)
                                                                                                           1 ≤ 𝑟1 ≤ 60 ∧ 1 ≤ 𝑟2 ≤ 60 ∧ 0 ≤ 𝑟3 ≤ 60 ∧ 𝑟4 > 120.
construction of CEFA.) Note that we represent the counting operators
symbolically by registers instead of unfolding them explicitly.                                           which can be solved by the off-the-shelf SMT solvers.
      Next, we construct 1 ∩ 2 ∩ 3 , that is, the intersection (aka                                        Nevertheless, when the original regexes are complicated (e.g. con-
product) of 1 , 2 , and 3 , as illustrated in Fig. 2(a), where the states                              tain occurrences of negation or intersection operators), the sizes of
that cannot reach the final states are removed. For technical conve-                                      the NFA  can still be big and the sizes of the LIA formulas defining
nience, we also think of the updates of registers in transitions as vectors                               the Parikh image of  are also big. Since the satisfiability of LIA
(𝑢1 , 𝑢2 , 𝑢3 , 𝑢4 ), where 𝑢𝑖 ∈ Z is the update on the register 𝑟𝑖 for each 𝑖 ∈                          formulas is an NP-complete problem [34], big sizes of LIA formulas
[4]. For instance, the transitions corresponding to the self-loop around                                  would be a bottleneck of the performance. To tackle this issue, we
(𝑝0 , 𝑞0 , 𝑞0′ ) are thought as ((𝑝0 , 𝑞0 , 𝑞0′ ), 𝑎′ , (𝑝0 , 𝑞0 , 𝑞0′ ), (1, 0, 0, 1)) with 𝑎′ ∈         propose techniques to reduce the sizes of the NFA .
𝛴 ⧵ {𝑎}, since 𝑟1 and 𝑟4 are incremented by one in these transitions.                                         Specifically, to reduce the sizes of , we determinize , and apply
After considering the updates of registers as vectors, the CEFA is like                                   the minimization algorithm to the resulting deterministic finite au-
Fig. 2(b).                                                                                                tomaton (DFA), resulting in a DFA , as illustrated in Fig. 3(b). Note
      Finally, the satisfiability of the original string constraint is reduced                            that  contains only three states 𝑞0′′ , 𝑞1′′ , 𝑞2′′ and six transitions, while 
to the non-emptiness of the CEFA  ≡ 1 ∩ 2 ∩ 3 with respect to                                         contains four states and eight transitions. Furthermore, if  contains

                                                                                                     4
D. Hu and Z. Wu                                                                                                             Journal of Systems Architecture 160 (2025) 103340


                                                              Fig. 3. Reduced automaton  and .


                                                                    Fig. 4. Syntax of RECL.


⃗
0-labeled transitions, then we can take these transitions as 𝜖-transitions          𝜃 ⊧ 𝜑, if one of the following holds: (1) 𝜑 ≡ 𝑥 ∈ 𝑒 and 𝜃(𝑥) ∈ (𝑒), (2)
and potentially reduce the sizes of automata further.                               𝜑 ≡ 𝑡1 𝑜 𝑡2 and 𝜃(𝑡1 ) 𝑜 𝜃(𝑡2 ), (3) 𝜑 ≡ 𝜑1 ∧ 𝜑2 and 𝜃 ⊧ 𝜑1 and 𝜃 ⊧ 𝜑2 .
    We implement all the aforementioned techniques on the top of OS-                A RECL formula 𝜑 is satisfiable if there is a mapping 𝜃 such as 𝜃 ⊧ 𝜑.
TRICH, resulting in a solver OSTRICHRECL . It turns out that                        The satisfiability problem for RECL (which is abbreviated as 𝖲𝖠𝖳𝖱𝖤𝖢𝖫 )
OSTRICHRECL is able to solve the string constraint in (1) within one                is deciding whether a given RECL formula 𝜑 is satisfiable.
second, while the state-of-the-art string solvers are incapable of solving
it within 120 s.                                                                    5. Cost-enriched finite automata

4. String constraints with regex-counting and string-length                             In this section, we define cost-enriched finite state automata (CEFA),
                                                                                    which was introduced in [18] and will be used to solve the satisfiability
    In the sequel, we define the string constraints with regex-counting             problem of RECL later on. Intuitively, CEFA adds write-only cost regis-
and string-length functions, i.e., REgex-Counting Logic (abbreviated as             ters to finite state automata. ‘‘write-only’’ means that the cost registers
RECL). The syntax of RECL is defined by the rules in Fig. 4, where 𝑥 is a           can only be written/updated but cannot be read, i.e., they cannot be
string variable, x is an integer variable, 𝑎 is a character from an alphabet        used to guard the transitions.
𝛴, and 𝑚, 𝑛 are integer constants. A RECL formula 𝜑 is a conjunction
of atomic formulas of the form 𝑥 ∈ 𝑒 or 𝑡1 𝑜 𝑡2 , where 𝑒 is a regular              Definition 1 (Cost-Enriched Finite Automaton). A cost-enriched finite
expression, 𝑡1 and 𝑡2 are integer terms, and 𝑜 ∈ {=, ≠, ≤, ≥, <, >}. Atomic         automaton  is a tuple (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼) where
formulas of the form 𝑥 ∈ 𝑒 are called regular membership constraints,
                                                                                        • 𝑅 = {𝑟1 , … , 𝑟𝑘 } is a finite set of registers,
and atomic formulas of the form 𝑡1 𝑜 𝑡2 are called length constraints.
A regular expression 𝑒 is built from ∅, the empty string 𝜖, and the                     • 𝑄, 𝛴 , 𝐼 , 𝐹 are as in the definition of NFA,
character 𝑎 by using concatenation ⋅, union +, Kleene star ∗ , intersection             • 𝛿 ⊆ 𝑄 × 𝛴 × 𝑄 × Z𝑅 is a transition relation, where Z𝑅 denotes the
∩, complement ̄, difference ⧵, counting {𝑚,𝑛} or {𝑚,∞} . An integer term                  updates on the values of registers.
is built from constants 𝑛, variables x, and string lengths |𝑥| by operators             • 𝛼 ∈ 𝛷(𝑅) is an LIA formula specifying an accepting condition.
+ and −.
                                                                                          For convenience, we use 𝑅 to denote the set of registers of .
    Moreover, for 𝑆 ⊆ 𝛴 with 𝑆 = {𝑎1 , … , 𝑎𝑘 }, we use 𝑆 as an
                                                                                    We assume a linear order on 𝑅 and write 𝑅 as a vector (𝑟1 , … , 𝑟𝑘 ).
abbreviation of 𝑎1 + ⋯ + 𝑎𝑘 .
                                                                                    Accordingly, we write an element of Z𝑅 as a vector (𝑣1 , … , 𝑣𝑘 ), where
    For each regular expression 𝑒, the language defined by 𝑒, denoted
                                                                                    𝑣𝑖 is the update of 𝑟𝑖 for each 𝑖 ∈ [𝑘]. We also write a transition
by (𝑒), is defined recursively. For instance, (∅) = ∅, (𝜀) = {𝜀},                                           𝑎
                                                                                    (𝑞 , 𝑎, 𝑞 ′ , 𝑣)           ← 𝑞′ .
                                                                                                  ⃗ ∈ 𝛿 as 𝑞 ←←→
(𝑎) = {𝑎}, and (𝑒1 ⋅ 𝑒2 ) = (𝑒1 ) ⋅ (𝑒2 ), (𝑒1 + 𝑒2 ) = (𝑒1 ) ∪ (𝑒2 ),                                𝑣⃗
and so on. It is well-known that regular expressions define the same                      The semantics of CEFA is defined as follows. Let  = (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 ,
class of languages as finite state automata, that is, the class of regular          𝐹 , 𝛼) be a CEFA. A run of  on a string 𝑤 = 𝑎1 ⋯ 𝑎𝑛 is a sequence
                                                                                          𝑎1              𝑎𝑛                                 𝑎𝑖
languages [31].                                                                     𝑞0 ←←←←→               ← 𝑞𝑛 such that 𝑞0 ∈ 𝐼 and 𝑞𝑖−1 ←←←←→
                                                                                           ← 𝑞1 ⋯ 𝑞𝑛−1 ←←←←→                                  ← 𝑞𝑖 for each 𝑖 ∈ [𝑛].
                                                                                       𝑣⃗1             𝑣⃗𝑛                                        𝑣⃗𝑖
    Let 𝜑 be a RECL formula and 𝖲𝖵𝖺𝗋𝗌(𝜑) (resp. 𝖨𝖵𝖺𝗋𝗌(𝜑)) denote the set                       𝑎  1               𝑎𝑛
of string (resp. integer) variables occurring in 𝜑. Then the semantics of           A run 𝑞0 ←←←←→← 𝑞1 ⋯ 𝑞𝑛−1 ←←←←→← 𝑞𝑛 is accepting if 𝑞𝑛 ∈ 𝐹 and 𝛼(𝑣⃗′ ∕𝑅) is true,
                                                                                                𝑣⃗1              𝑣⃗𝑛
𝜑 is defined with respect to a mapping 𝜃 ∶ 𝖲𝖵𝖺𝗋𝗌(𝜑) → 𝛴 ∗ ⊎𝖨𝖵𝖺𝗋𝗌(𝜑) → Z                             ∑
                                                                                    where 𝑣⃗′ = 𝑗∈[𝑛] 𝑣⃗𝑗 . The string 𝑤 is 𝑎𝑐 𝑐 𝑒𝑝𝑡𝑒𝑑 by the CEFA if and only
(where ⊎ denotes the disjoint union). Note that the mapping 𝜃 can                                                                             ∑
                                                                                    if there is an accepting run of 𝑤. The vector 𝑣⃗′ = 𝑗∈[𝑛] 𝑣⃗𝑗 is called the
naturally extend to the set of integer terms. For instance, 𝜃(|𝑥|) = |𝜃(𝑥)|,                                          𝑎1                𝑎𝑛
𝜃(𝑡1 + 𝑡2 ) = 𝜃(𝑡1 ) + 𝜃(𝑡2 ). A mapping 𝜃 is said to satisfy 𝜑, denoted by         cost of an accepting run 𝑞0 ←←←←→               ← 𝑞𝑛 . Note that the values of
                                                                                                                    ← 𝑞1 ⋯ 𝑞𝑛−1 ←←←←→
                                                                                                                      𝑣⃗1               𝑣⃗𝑛


                                                                                5
D. Hu and Z. Wu                                                                                                                                   Journal of Systems Architecture 160 (2025) 103340

                                                                      ∑
all registers are initiated to zero and updated to 𝑗∈[𝑛] 𝑣⃗𝑗 after all the
                                                                                                    From the construction,
transitions in the run are executed. We use 𝑣⃗′ ∈ (𝑤) to denote the fact
that there is an accepting run of  on 𝑤 whose cost is 𝑣⃗′ . We define                              (1 ∩ 2 ) = {(𝑤; 𝑣)             ⃗ ∈ (1 ) and (𝑤; 𝗉𝗋𝗃𝑅2 (𝑣))
                                                                                                                       ⃗ ∣ (𝑤; 𝗉𝗋𝗃𝑅1 (𝑣))                       ⃗ ∈ (2 )}.
the semantics of a CEFA , denoted by (), as {(𝑤; 𝑣⃗′ ) ∣ 𝑣⃗′ ∈ (𝑤)}.                            Intuitively, 1 ∩ 2 accepts the words that are accepted by both 1
In particular, if 𝐼 ∩ 𝐹 ≠ ∅ and 𝛼[0∕𝑅]          ⃗    is true, then (𝜀; 0)⃗ ∈ ().
                                                                                                    and 2 and outputs the costs of 1 and 2 .
Moreover, we define the output of a CEFA , denoted by (), as                                          The concatenation of 1 and 2 , denoted by 1 ⋅ 2 , is defined
{𝑣⃗′ ∣ ∃𝑤. 𝑣⃗′ ∈ (𝑤)}.                                                                             similarly as that of NFA, that is, a tuple (𝑅′ , 𝑄′ , 𝛴 , 𝛿 ′ , 𝐼 ′ , 𝐹 ′ , 𝛼 ′ ), where
     We want to remark that the definition of CEFA above is slightly                                𝑅′ = 𝑅1 ∪ 𝑅2 , 𝑄′ = 𝑄1 ∪ 𝑄2 , 𝐼 ′ = 𝐼1 , 𝛼 ′ = 𝛼1 ∧ 𝛼2 , 𝛿 ′ =
different from that of [18], where CEFA did not include accepting                                                          ⃗ ∣ (𝑞1 , 𝑎, 𝑞 ′ , 𝑣⃗1 ) ∈ 𝛿1 } ∪ {(𝑞2 , 𝑎, 𝑞 ′ , (0,
                                                                                                    {(𝑞1 , 𝑎, 𝑞1′ , (𝑣⃗1 , 0))                                                ⃗ 𝑣⃗2 )) ∣ (𝑞2 , 𝑎, 𝑞 ′ , 𝑣⃗2 ) ∈
                                                                                                                                          1                              2                         2
conditions. Moreover, the accepting conditions 𝛼 in CEFA are defined                                𝛿2 } ∪ {(𝑞1 , 𝑎, 𝑞2 , (0,⃗ 𝑣⃗2 )) ∣ 𝑞1 ∈ 𝐹1 , ∃𝑞 ′ ∈ 𝐼2 . (𝑞 ′ , 𝑎, 𝑞2 , 𝑣⃗2 ) ∈ 𝛿2 }, moreover, if
in a global fashion because the accepting condition does not distinguish                            𝐼2 ∩𝐹2 ≠ ∅, then 𝐹 ′ = 𝐹1 ∪𝐹2 , otherwise, 𝐹 ′ = 𝐹2 . From the construction,
final states. This technical choice is made so that the determinization
and minimization of NFA can be utilized to reduce the size of CEFA in                               (1 ⋅2 ) = {(𝑤1 𝑤2 ; 𝑣)               ⃗ ∈ (1 ) and (𝑤2 ; 𝗉𝗋𝗃𝑅2 (𝑣))
                                                                                                                           ⃗ ∣ (𝑤1 ; 𝗉𝗋𝗃𝑅1 (𝑣))                         ⃗ ∈ (2 )}.
the next section.
     In the sequel, we define three CEFA operations: union, intersection,                              Furthermore, the union, intersection, and concatenation operations
and concatenation. The following section will use these three oper-                                 can be extended naturally to multiple CEFA, that is, 1 ∪ ⋯ ∪ 𝑛 ,
ations to solve RECL constraints. Note that the union, intersection,                                1 ∩ ⋯ ∩ 𝑛 , 1 ⋅ ⋯ ⋅ 𝑛 . For instance, 1 ∪ 2 ∪ 3 = (1 ∪ 2 ) ∪ 3 ,
and concatenation operations are defined in a slightly more involved                                1 ∩ 2 ∩ 3 = (1 ∩ 2 ) ∩ 3 , and 1 ⋅ 2 ⋅ 3 = (1 ⋅ 2 ) ⋅ 3 .
manner than register automata [23] and counter automata [19], as a
result of the (additional) accepting conditions.                                                    6. Solving RECL constraints
     Let 1 = (𝑅1 , 𝑄1 , 𝛴 , 𝛿1 , 𝐼1 , 𝐹1 , 𝛼1 ) and 2 = (𝑅2 , 𝑄2 , 𝛴 , 𝛿2 , 𝐼2 , 𝐹2 , 𝛼2 )
be two CEFA that share the alphabet. Moreover, suppose that 𝑅1 ∩𝑅2 =                                    The goal of this section is to show how to solve RECL constraints by
∅ and 𝑄1 ∩ 𝑄2 = ∅.                                                                                  utilizing CEFA. At first, we reduce the satisfiability of RECL constraints
     The union of 1 and 2 is denoted by 1 ∪ 2 . Two fresh auxiliary                             to a decision problem defined in the sequel. Then we propose a decision
registers say 𝑟′1 , 𝑟′2 ∉ 𝑅1 ∪ 𝑅2 , are introduced so that the accepting                            procedure for this problem.
condition knows whether a run is from 1 (or 2 ). Note that the
fresh auxiliary registers are introduced to distinguish whether 1 or                               Definition 2 (𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠)). Let 𝑥1 , … , 𝑥𝑛 be string variables, 𝛬𝑥1 , … ,
2 accepts and weave the two accepting conditions of 1 and 2 into                                 𝛬𝑥𝑛 be nonempty sets of CEFA over the alphabet 𝛴 with 𝛬𝑥𝑖 =
one accepting condition. Specifically, 1 ∪ 2 = (𝑅′ , 𝑄′ , 𝛴 , 𝛿 ′ , 𝐼 ′ , 𝐹 ′ , 𝛼 ′ )             {𝑖,1 , … , 𝑖,𝑙𝑖 } for every 𝑖 ∈ [𝑛] where the sets of registers 𝑅1,1 , … ,
where                                                                                               𝑅1,𝑙 , … , 𝑅𝑛,1 , … , 𝑅𝑛,𝑙 are mutually disjoint, moreover, let 𝜑 be
                                                                                                          1                        𝑛                              ⋃
                                                                                                    an LIA formula whose free variables are from 𝑖∈[𝑛],𝑗∈[𝑙𝑖 ] 𝑅𝑖,𝑗 . Then
    • 𝑅′ = 𝑅1 ∪ 𝑅2 ∪ {𝑟′1 , 𝑟′2 }, 𝑄′ = 𝑄1 ∪ 𝑄2 ∪ {𝑞0′ } with 𝑞0′ ∉ 𝑄1 ∪ 𝑄2 ,                       the CEFA in 𝛬𝑥1 , … , 𝛬𝑥𝑛 are said to be nonempty w.r.t. 𝜑 if there
      𝐼 ′ = {𝑞0′ },                                                                                 are assignments 𝜃 ∶ {𝑥1 , … , 𝑥𝑛 } → 𝛴 ∗ and vectors 𝑣⃗𝑖,𝑗 such that
    • 𝛿 ′ is the union of four transitions sets:                                                    (𝜃(𝑥𝑖 ); 𝑣⃗𝑖,𝑗 ) ∈ (𝑖,𝑗 ) and 𝜑[(𝑣⃗𝑖,𝑗 ∕𝑅𝑖,𝑗 )] is true, for every 𝑖 ∈ [𝑛], 𝑗 ∈ [𝑙𝑖 ].
                                     ⃗ 1, 0)) ∣ ∃𝑞1 ∈ 𝐼1 . (𝑞1 , 𝑎, 𝑞 ′ , 𝑣⃗1 ) ∈ 𝛿1 }
           – {(𝑞0′ , 𝑎, 𝑞1′ , (𝑣⃗1 , 0,                              1
                 ′             ⃗
           – {(𝑞 , 𝑎, 𝑞 , (0, 𝑣⃗2 , 0, 1)) ∣ ∃𝑞2 ∈ 𝐼2 . (𝑞2 , 𝑎, 𝑞 ′ , 𝑣⃗2 ) ∈ 𝛿2 }
                          ′                                                                         Proposition 1 ([18]). 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) is PSPACE-complete.
                  0     2                                             2
                                    ⃗ 0, 0)) ∣ 𝑞1 ∉ 𝐼1 . (𝑞1 , 𝑎, 𝑞 ′ , 𝑣⃗1 ) ∈ 𝛿1 }
           – {(𝑞1 , 𝑎, 𝑞1′ , (𝑣⃗1 , 0,                                                                  Note that the decision procedure in [18] was only used to prove the
                                                                   1
                              ⃗ 𝑣⃗2 , 0, 0)) ∣ 𝑞2 ∉ 𝐼2 . (𝑞2 , 𝑎, 𝑞 ′ , 𝑣⃗2 ) ∈ 𝛿2 }
           – {(𝑞2 , 𝑎, 𝑞2′ , (0,                                                                    upper bound in Proposition 1 and not implemented as a matter of fact.
                                                                   2
                                                                                                    Instead, the symbolic model checker nuXmv [35] was used to solve
                   ⃗ 1, 0) is a vector that updates 𝑅1 by 𝑣⃗1 , updates 𝑅2 by
      where (𝑣⃗1 , 0,                                                                               𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠). We do not rely on nuXmv in this work and shall propose
      ⃗ and updates 𝑟′ , 𝑟′ by 1,0 respectively. Similarly for (0,
      0,                                                            ⃗ 𝑣⃗2 , 0, 1),                  a new algorithm for solving 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) in Section 6.2.
                          1 2
      and so on.
    • 𝐹 ′ and 𝛼 ′ are defined as follows,                                                           6.1. From 𝖲𝖠𝖳𝖱𝖤𝖢𝖫 to 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠)

                    ⃗ belongs to (1 ) or (2 ), i.e., one of the two
           – if (𝜖; 0)                                                                                  Let 𝜑 be a RECL constraint and 𝑥1 , … , 𝑥𝑛 be an enumeration of the
             automata accepts the empty string 𝜖, then 𝐹 ′ = 𝐹1 ∪𝐹2 ∪ {𝑞0′ }                        string variables occurring in 𝜑. Moreover, let 𝜑 ≡ 𝜑1 ∧ 𝜑2 such that
             and 𝛼 ′ = (𝑟′1 = 0 ∧ 𝑟′2 = 0) ∨ (𝑟′1 = 1 ∧ 𝛼1 ) ∨ (𝑟′2 = 1 ∧ 𝛼2 ),                     𝜑1 is a conjunction of regular membership constraints of 𝜑, and 𝜑2 is a
           – otherwise, 𝐹 ′ = 𝐹1 ∪ 𝐹2 and 𝛼 ′ = (𝑟′1 = 1 ∧ 𝛼1 ) ∨ (𝑟′2 = 1 ∧ 𝛼2 ).                  conjunction of length constraints of 𝜑. We shall reduce the satisfiability
                                                                                                    of 𝜑 to an instance of 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠).
                                                                                                        At first, we show how to construct a CEFA from a regex where
From the construction, we know that                                                                 counting operators may occur. Our approach largely follows Thomp-
              ⎧       | (𝑤; 𝗉𝗋𝗃 (𝑣))                                                     ⎫
                      |        𝑅1 ⃗ ∈ (1 ) and 𝗉𝗋𝗃𝑟′1 (𝑣) ⃗ = 1, or                               son’s construction [36] but we avoid using the 𝜖-transitions. Similar
              ⎪       |                                                                  ⎪
                   ⃗ || (𝑤; 𝗉𝗋𝗃𝑅2 (𝑣))
(1 ∪ 2 ) = ⎨(𝑤; 𝑣)              ⃗ ∈ (2 ) and 𝗉𝗋𝗃𝑟′ (𝑣) ⃗ = 1, or                    ⎬.         ideas of representing counting operators as counters in automata have
              ⎪       |                                   2
                                                                                         ⎪          appeared in e.g. [27].
              ⎩       | (𝑤; ⃗
                            𝑣) =  (𝜖 , ⃗
                                       0) if    or   accepts 𝜖                         ⎭
                      |                        1     2
                                                                                                        Let us start with register-representable regexes defined in the sequel.
Intuitively, 1 ∪ 2 accepts the words that are accepted by one of the                                  Let us fix an alphabet 𝛴.
1 and 2 and outputs the costs of the corresponding automaton. Note                                    Let 𝑒 be a regex over 𝛴. Then an occurrence of counting operators in
that an integer vector 𝑣⃗1 in 1 , is copied at most twice in 1 ∪ 2 ,                             𝑒, say (𝑒′ ){𝑚,𝑛} (or (𝑒′ ){𝑚,∞} ), is said to be register-representable if (𝑒′ ){𝑚,𝑛}
                ⃗ 0, 0) and (𝑣⃗1 , 0,
namely, (𝑣⃗1 , 0,                  ⃗ 1, 0). Similarly for the integer vectors in                    (or (𝑒′ ){𝑚,∞} ) is not in the scope of a Kleene star, another counting
2 .                                                                                                operator, complement, or language difference in 𝑒. We say that 𝑒 is
    The intersection of 1 and 2 , denoted by 1 ∩2 = (𝑅′ , 𝑄′ , 𝛴 , 𝛿 ′ , 𝐼 ′ ,                  register-representable if and only if all occurrences of counting operators
𝐹 , 𝛼 ′ ), is defined in the sequel.
  ′                                                                                                 in 𝑒 are register-representable. For instance, 𝑎{2,6} ∩ 𝑎{4,∞} is register-
                                                                                                    representable, while 𝑎{2,6} and (𝑎{2,6} ){4,∞} are not since 𝑎{2,6} is in the
    • 𝑅′ = 𝑅1 ∪ 𝑅2 , 𝑄′ = 𝑄1 × 𝑄2 , 𝐼 ′ = 𝐼1 × 𝐼2 , 𝐹 ′ = 𝐹1 × 𝐹2 , 𝛼 ′ = 𝛼1 ∧ 𝛼2 ,                 scope of complement and the counter operator {4, ∞} respectively.
    • 𝛿 ′ comprises the tuples ((𝑞1 , 𝑞2 ), 𝑎, (𝑞1′ , 𝑞2′ ), (𝑣⃗1 , 𝑣⃗2 )) such that (𝑞1 , 𝑎,           Let 𝑒 be a register-representable regex over 𝛴. By the following
      𝑞1′ , 𝑣⃗1 ) ∈ 𝛿1 and (𝑞2 , 𝑎, 𝑞2′ , 𝑣⃗2 ) ∈ 𝛿2 .                                              procedure, we will construct a CEFA out of 𝑒, denoted by 𝑒 .


                                                                                                6
D. Hu and Z. Wu                                                                                                                                  Journal of Systems Architecture 160 (2025) 103340


    1. For each sub-expression (𝑒′ ){𝑚,𝑛} with 𝑚 ≤ 𝑛 (resp. (𝑒′ ){𝑚,∞} )                              Observation 2. For each CEFA  = (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼),
       of 𝑒, we construct a CEFA (𝑒′ ){𝑚,𝑛} (resp. (𝑒′ ){𝑚,∞} ). Let 𝑒′ =                                 ⎧           |                       ⎡∑                       ⎫
       (𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 ). Then (𝑒′ ){𝑚,𝑛} = ((𝑟′ ), 𝑄′ , 𝛴 , 𝛿 ′′ , 𝐼 ′ , 𝐹 ′ , 𝛼 ′ ), where              ⎪∑          |
                                                                                                                         |                                    / ⎤         ⎪
                                                                                                      () = ⎨      ⃗ 𝑣⃗ | 𝜂 ∈ 𝗉𝖺𝗋𝗂𝗄𝗁( ) and 𝛼 ⎢
                                                                                                                  𝜂(𝑣)                                    ⃗ 𝑣⃗ 𝑅⎥ is true ⎬ .
                                                                                                                                                       𝜂(𝑣)
       𝑟′ is a new register, 𝑄′ = 𝑄 ∪ {𝑞0 } with 𝑞0 ∉ 𝑄, 𝐼 ′ = {𝑞0 },                                                    |                       ⎢                ⎥
                                                                                                             ⎪𝑣∈C
                                                                                                               ⃗         |                       ⎣𝑣∈C
                                                                                                                                                   ⃗             ⎦       ⎪
       𝐹 ′ = 𝐹 ∪ {𝑞0 }, and                                                                                  ⎩          |                                                ⎭
          𝛿 ′′   =    {(𝑞 , 𝑎, 𝑞 ′ , (0)) ∣ (𝑞 , 𝑎, 𝑞 ′ ) ∈ 𝛿} ∪
                                                                                                         For 𝑖 ∈ [𝑛], let 𝛼𝑖 be the accepting condition of 𝑖 . Then from
                      {(𝑞0 , 𝑎, 𝑞 ′ , (1)) ∣ ∃𝑞0′ ∈ 𝐼 . (𝑞0′ , 𝑎, 𝑞 ′ ) ∈ 𝛿} ∪
                                                                                                      Observation 2, we know that the following two conditions are equiva-
                      {(𝑞 , 𝑎, 𝑞 ′ , (1)) ∣ 𝑞 ∈ 𝐹 , ∃𝑞0′ ∈ 𝐼 . (𝑞0′ , 𝑎, 𝑞 ′ ) ∈ 𝛿},
                                                                                                      lent,
       moreover, 𝛼 ′ = 𝑚 ≤ 𝑟′ ≤ 𝑛 if 𝐼 ∩ 𝐹 = ∅, otherwise 𝛼 ′ = 𝑟′ ≤ 𝑛.
       (Intuitively, if 𝜀 is accepted by 𝑒′ , then the value of 𝑟′ can be                                 • there are outputs 𝑣⃗1 ∈ (1 ), … , 𝑣⃗𝑛 ∈ (𝑛 ) such that
       less than 𝑚.) Moreover, (𝑒′ ){𝑚,∞} is constructed by adapting 𝛼 ′                                    𝜑[𝑣⃗1 ∕𝑅1 , … , 𝑣⃗𝑛 ∕𝑅𝑛 ] is true,
       in (𝑒′ ){𝑚,𝑛} as follows: 𝛼 ′ = 𝑚 ≤ 𝑟′ if 𝐼 ∩ 𝐹 = ∅ and 𝛼 ′ = 𝚝𝚛𝚞𝚎                                 • there are 𝜂1 ∈ 𝗉𝖺𝗋𝗂𝗄𝗁(1 ), … , 𝜂𝑛 ∈ 𝗉𝖺𝗋𝗂𝗄𝗁(𝑛 ) such that
       otherwise.
                                                                                                           ⋀ ⎡ ∑                  /    ⎤     ⎡ ∑            /           ∑           /    ⎤
    2. For each sub-expression 𝑒′ of 𝑒 such that 𝑒′ contains occurrences                                        𝛼𝑖 ⎢          ⃗ 𝑣⃗ 𝑅𝑖 ⎥ ∧ 𝜑 ⎢
                                                                                                                          𝜂𝑖 (𝑣)                    𝜂1 (𝑣)
                                                                                                                                                        ⃗ 𝑣⃗ 𝑅1 , … ,          ⃗ 𝑣⃗ 𝑅𝑛 ⎥
                                                                                                                                                                            𝜂𝑛 (𝑣)
       of counting operators but 𝑒′ itself is not of the form (𝑒′1 ){𝑚,𝑛} or                                       ⎢                   ⎥     ⎢                                           ⎥
                                                                                                          𝑖∈[𝑛]    ⎣𝑣∈C
                                                                                                                     ⃗ 𝑖              ⎦     ⎣𝑣∈C
                                                                                                                                               ⃗ 
                                                                                                                                                  1
                                                                                                                                                                       𝑣∈C
                                                                                                                                                                       ⃗ 𝑛              ⎦
       (𝑒′1 ){𝑚,∞} , from the assumption that 𝑒 is register-representable,
       we know that 𝑒′ is of the form 𝑒′1 ⋅ 𝑒′2 , 𝑒′1 + 𝑒′2 , 𝑒′1 ∩ 𝑒′2 , or (𝑒′1 ). For                     is true.
       𝑒′ = (𝑒′1 ), we have 𝑒′ = 𝑒′ . For 𝑒′ = 𝑒′1 ⋅ 𝑒′2 , 𝑒′ = 𝑒′1 + 𝑒′2 ,                              Therefore, to solve the non-emptiness of CEFA in 𝛬′𝑥 , … , 𝛬′𝑥 w.r.t.
                                         1                                                                                                                        1       𝑛
       or 𝑒′ = 𝑒′1 ∩ 𝑒′2 , suppose that CEFA 𝑒′ and 𝑒′ have been                                    𝜑, it is sufficient to compute the existential LIA formulas 𝜓 (ZC ), …,
                                                       1               2                                                                                             1      1
       constructed.                                                                                   𝜓 (ZC ) to represent the Parikh images of 1 , ⋯, 𝑛 respectively,
                                                                                                           𝑛      𝑛
    3. For each maximal sub-expression 𝑒′ of 𝑒 such that 𝑒′ contains                                  where ZC = {z𝑖,𝑣⃗ ∣ 𝑣⃗ ∈ C𝑖 } for 𝑖 ∈ [𝑛], and solve the satisfiability of
                                                                                                                     𝑖
       no occurrences of counting operators, an NFA 𝑒′ can be con-                                   the following existential LIA formula
       structed by structural induction on the syntax of 𝑒′ . Then we
                                                                                                        ⋀⎛                     ⎡ ∑            /    ⎤⎞
       have 𝑒′ = 𝑒′ ⋅ 𝑒′ , 𝑒′ = 𝑒′ ∪ 𝑒′ , or 𝑒′ = 𝑒′ ∩ 𝑒′ .                                         ⎜𝜓 (ZC ) ∧ 𝛼𝑖 ⎢         z𝑖,𝑣⃗ 𝑣⃗ 𝑅𝑖 ⎥⎟ ∧
                           1      2               1        2                   1       2
                                                                                                             ⎜    𝑖    𝑖     ⎢                   ⎥⎟
                                                                                                       𝑖∈[𝑛] ⎝                 ⎣𝑣∈C
                                                                                                                                 ⃗ 𝑖              ⎦⎠
     For non-register-representable regexes, we first convert them into
register-                                                                                                ⎡ ∑             /           ∑           /    ⎤
representable regexes by unfolding all the non-register-representable                                   𝜑⎢       z1,𝑣⃗ 𝑣⃗ 𝑅1 , … ,      z𝑛,𝑣⃗ 𝑣⃗ 𝑅𝑛 ⎥ .
                                                                                                         ⎢                                            ⎥
occurrences of counting operators. After that, we utilize the aforemen-                                  ⎣𝑣∈C
                                                                                                           ⃗ 
                                                                                                               1
                                                                                                                                    𝑣∈C
                                                                                                                                    ⃗ 𝑛              ⎦
tioned procedure to construct CEFA. For instance, (𝑎{2,6} ⋅ 𝑏∗ ){2,∞} is                              Intuitively, the integer variables z𝑖,𝑣⃗ represent the number of occur-
transformed into (𝑎𝑎(𝜀 + 𝑎 + 𝑎𝑎 + 𝑎𝑎𝑎 + 𝑎𝑎𝑎𝑎) ⋅ 𝑏∗ ){2,∞} .                                           rences of 𝑣⃗ in the strings accepted by 𝑖 .
     For each 𝑖 ∈ [𝑛], let 𝑥𝑖 ∈ 𝑒𝑖,1 , … , 𝑥𝑖 ∈ 𝑒𝑖,𝑙𝑖 be an enumeration of                                 Because the sizes of the LIA formulas 𝜓 (ZC ), … , 𝜓 (ZC ) are
                                                                                                                                                                              1     1         𝑛    𝑛
the regular membership constraints for 𝑥𝑖 in 𝜑1 . Then we can construct                               proportional to the sizes (more precisely, the alphabet size, the number
CEFA 𝑖,𝑗 from 𝑒𝑖,𝑗 for each 𝑖 ∈ [𝑛] and 𝑗 ∈ [𝑙𝑖 ]. Moreover, we construct                            of states and transitions) of NFA 1 , … , 𝑛 , and the satisfiability of
another CEFA 𝑖,0 for each 𝑖 ∈ [𝑛] to model the length of 𝑥𝑖 . Specifi-                               existential LIA formulas is NP-complete, it is vital to reduce the sizes
cally, 𝑖,0 is constructed as ((𝑟𝑖,0 ), {𝑞𝑖,0 }, 𝛴 , 𝛿𝑖,0 , {𝑞𝑖,0 }, {𝑞𝑖,0 }, 𝚝𝚛𝚞𝚎) where             of these NFAs to improve the performance.
𝑟𝑖,0 is a fresh register and 𝛿𝑖,0 = {(𝑞𝑖,0 , 𝑎, 𝑞𝑖,0 , (1)) ∣ 𝑎 ∈ 𝛴}. Let 𝛬𝑥𝑖 =                                    ∑                          ∑
                                                                                                           Since 𝑣∈C(⃗       𝑖)
                                                                                                                                     ⃗ 𝑣⃗ = 𝑣∈C(
                                                                                                                                  𝜂(𝑣)                            ⃗ 𝜂(𝑣) ⃗ 𝑣⃗ for each 𝑖 ∈ [𝑛] and 𝜂 ∈
{𝑖,0 , 𝑖,1 , … , 𝑖,𝑙𝑖 } for each 𝑖 ∈ [𝑛], and 𝜑′2 ≡ 𝜑2 [𝑟1,0 ∕|𝑥1 |, … , 𝑟𝑛,0 ∕|𝑥𝑛 |].                                                          ⃗         𝑖 )⧵{0}
                                                                                                      𝗉𝖺𝗋𝗂𝗄𝗁(𝑖 ), it turns out that the 0-labeled            ⃗               transitions in 𝑖 do not
Then the satisfiability of 𝜑 is reduced to the non-emptiness of CEFAs in                                                                             ∑
𝛬𝑥1 , … , 𝛬𝑥𝑛 w.r.t. 𝜑′2 .                                                                            contribute to the final output 𝑣∈C(                 ⃗       𝑖 ) 𝜂(𝑣)
                                                                                                                                                                          ⃗ 𝑣.
                                                                                                                                                                             ⃗   Therefore, we can apply
                                                                                                      the following size-reduction technique for 𝑖 ’s.
6.2. Solving 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠)                                                                                  Step 2 (Reducing automata sizes). For each 𝑖 ∈ [𝑛], we view the
                                                                                                      transitions (𝑞 , 0,   ⃗ 𝑞 ′ ) in  as 𝜀-transitions (𝑞 , 𝜀, 𝑞 ′ ), and remove the 𝜀-
                                                                                                                                           𝑖
     In this section, we present a procedure to solve the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠)                                transitions from 𝑖 . Then we determinize and minimize the resulting
problem: Suppose that 𝑥1 , … , 𝑥𝑛 are mutually distinct string variables,                             NFA. □
𝛬𝑥1 , … , 𝛬𝑥𝑛 are nonempty sets of CEFA over the same alphabet 𝛴 where                                     For 𝑖 ∈ [𝑛], let 𝑖 denote the DFA obtained from 𝑖 by executing
𝛬𝑥𝑖 = {𝑖,1 , …, 𝑖,𝑙𝑖 } for every 𝑖 ∈ [𝑛]. Moreover, the sets of registers                           Step 2 and C𝑖 ∶= C𝑖 ⧵ {0}.                ⃗ From the construction, we know that
𝑅1,1 , … , 𝑅𝑛,𝑙 are mutually disjoint, and 𝜑 is a LIA formula whose                                 𝗉𝖺𝗋𝗂𝗄𝗁(𝑖 ) = 𝗉𝗋𝗃C (𝗉𝖺𝗋𝗂𝗄𝗁(𝑖 )) for each 𝑖 ∈ [𝑛]. Therefore, we compute
                 𝑛        ⋃                                                                                                   𝑖
free variables are from 𝑖∈[𝑛],𝑗∈[𝑙𝑖 ] 𝑅𝑖,𝑗 . The procedure comprises three                           LIA formulas from 𝑖 ’s, instead of 𝑖 ’s, to represent the Parikh images.
steps.                                                                                                     Step 3 (Computing an LIA formula). For each 𝑖 ∈ [𝑛], we compute an
     Step 1 (Computing intersection automata). For each 𝑖 ∈ [𝑛], compute                              existential LIA formula 𝜓𝑖 (ZC ) from 𝑖 to represent 𝗉𝖺𝗋𝗂𝗄𝗁(𝑖 ). Then
                                                                                                                                                      𝑖
a CEFA 𝑖 = 𝑖,1 ∩ ⋯ ∩ 𝑖,𝑙𝑖 , and let 𝛬′𝑥 ∶= {𝑖 }. □                                                the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem is reduced to the satisfiability of the following
                                             𝑖
     After Step 1, the non-emptiness of CEFAs in 𝛬𝑥1 , … , 𝛬𝑥𝑛 w.r.t. 𝜑 is                            LIA formula, □
reduced to the non-emptiness of CEFAs in 𝛬′𝑥 , … , 𝛬′𝑥 w.r.t. 𝜑.
                                                   1       𝑛                                            ⋀⎛                         ⎡ ∑             /         ⎤⎞
     In the following steps, we reduce the non-emptiness of CEFAs in                                         ⎜𝜓 (ZC ) ∧ 𝛼𝑖 ⎢              z𝑖,𝑣⃗ 𝑣⃗ 𝑅𝑖 ⎥⎟ ∧
𝛬′𝑥 , …, 𝛬′𝑥 w.r.t. 𝜑 to the satisfiability of an LIA formula. The reduction                                 ⎜  𝑖      𝑖          ⎢                         ⎥⎟
   1        𝑛
                                                                                                       𝑖∈[𝑛] ⎝                     ⎣𝑣∈C
                                                                                                                                     ⃗ 𝑖                    ⎦⎠
relies on the following two observations.                                                                                                                                                            (3)
                                                                                                            ⎡ ∑             /              ∑                 /        ⎤
                                                                                                         𝜑  ⎢       z1,𝑣⃗ 𝑣⃗ 𝑅1 , … ,              z𝑛,𝑣⃗ 𝑣⃗ 𝑅𝑛 .    ⎥
Observation 1. CEFA in 𝛬′𝑥 = {1 }, … , 𝛬′𝑥 = {𝑛 } are nonempty w.r.t.                                     ⎢                                                         ⎥
                               1                 𝑛                                                          ⎣𝑣∈C
                                                                                                              ⃗ 
                                                                                                                  1
                                                                                                                                          𝑣∈C
                                                                                                                                          ⃗ 𝑛                        ⎦
𝜑 iff there are outputs 𝑣⃗1 ∈ (1 ), … , 𝑣⃗𝑛 ∈ (𝑛 ) such that 𝜑[𝑣⃗1 ∕𝑅1 , …,
𝑣⃗𝑛 ∕𝑅𝑛 ] is true.                                                                                       Step 4 (Solving the LIA formula). We utilize some off-the-shelf SMT
   Let  = (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼) be a CEFA and C = {𝑣⃗ ∣ ∃𝑞 , 𝑎, 𝑞 ′ . (𝑞 , 𝑎, 𝑞 ′ ,             solver (e.g. Z3) to solve the satisfiability of the LIA formula in Eq. (3).
⃗ ∈ 𝛿}. Moreover, let  = (𝑄, C , 𝛿 ′ , 𝐼 , 𝐹 ) be an NFA over the alpha-
𝑣)                                                                                                    If the solver reports ‘‘sat’’ and returns a solution m, then we go to Step
bet C that is obtained from  by dropping the accepting condition                                    5. Otherwise, the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem has no solution. □
and ignoring the characters, that is, 𝛿 ′ comprises tuples (𝑞 , 𝑣,    ⃗ 𝑞 ′ ) such                        Step 5 (Constructing a solution). Let m be a solution of the LIA formula
that (𝑞 , 𝑎, 𝑞 ′ , 𝑣)
                   ⃗ ∈ 𝛿 for 𝑎 ∈ 𝛴.                                                                   in Eq. (3). We want to construct from m and the CEFAs 𝑖 for 𝑖 ∈ [𝑛],

                                                                                                  7
D. Hu and Z. Wu                                                                                                                    Journal of Systems Architecture 160 (2025) 103340


a solution m′ of the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem. The construction of m′ is                            RECL constraint: (i) unfold the counting operators in 𝑒1 , (ii)
obtained by constructing m′ (𝑥𝑖 ) for each 𝑖 ∈ [𝑛]. Let 𝑖 ∈ [𝑛]. We                             unfold the counting operator {𝑚,𝑛} . Let 𝗎𝖿 𝗅𝖽1 (𝑒) and 𝗎𝖿 𝗅𝖽2 (𝑒) denote
describe how to construct m′ (𝑥𝑖 ). The free variables ZC correspond                           the resulting regular expressions. Note that in 𝗎𝖿 𝗅𝖽1 (𝑒), all the
                                                              𝑖
to the number of occurrences of characters in C𝑖 , i.e. the alphabet of                        counting operators in 𝑒1 (if there is any) disappear, while in
the NFA 𝑖 . Moreover, in 𝑖 , the characters are just the integer vectors                      𝗎𝖿 𝗅𝖽2 (𝑒), all the counting operators in 𝑒1 are preserved and copied
and the characters in the original CEFAs 𝑖 used to construct 𝑖 are                            for 𝑛 times.
ignored. The construction of m′ (𝑥𝑖 ) is actually to solve the reachability                   • In the sequel, we estimate the sizes (the number of transitions)
problem in a finite transition system where                                                     of the NFAs 𝗎𝖿 𝗅𝖽 (𝑒) and 𝗎𝖿 𝗅𝖽 (𝑒) , denoted by 𝑠𝑧1 and 𝑠𝑧2 . If
                                                                                                                        1              2
                                                                                                𝑠𝑧1 ≤ 𝑠𝑧2 , then we choose to unfold the counting operators in
    • the nodes are of the form (𝑞 , (n𝑖,𝑣⃗ )𝑣∈C
                                             ⃗ 𝑖 ), where n𝑖,𝑣⃗ denotes the                    𝑒1 . Otherwise, we choose to unfold the counting operator {𝑚,𝑛} .
      number of the remaining occurrences of the integer vector 𝑣⃗ in                         • To estimate the sizes of 𝗎𝖿 𝗅𝖽 (𝑒) and 𝗎𝖿 𝗅𝖽 (𝑒) , we first esti-
                                                                                                                                      1                2
      𝑖 ,                                                                                      mate the sizes of the CEFAs 𝗎𝖿 𝗅𝖽1 (𝑒) and 𝗎𝖿 𝗅𝖽2 (𝑒) as two pairs
    • the transitions are of the form                                                           (𝗌𝗇𝗎𝗆1 (𝑒), 𝗏𝗇𝗎𝗆1 (𝑒)) and (𝗌𝗇𝗎𝗆2 (𝑒), 𝗏𝗇𝗎𝗆2 (𝑒)), where 𝗌𝗇𝗎𝗆1 (𝑒) and
      ((𝑞 , (n𝑖,𝑣⃗ )𝑣∈C             ′
                                       ⃗), (𝑞 ′ , (n′𝑖,𝑣⃗ )𝑣∈C                                  𝗏𝗇𝗎𝗆1 (𝑒) are the estimates of the number of states and integer vec-
                    ⃗  ), (𝑞 , 𝑎, 𝑞 , 𝑢                   ⃗  ))
                        𝑖                                    𝑖                                  tors in 𝗎𝖿 𝗅𝖽1 (𝑒) separately, similarly for (𝗌𝗇𝗎𝗆2 (𝑒), 𝗏𝗇𝗎𝗆2 (𝑒)). Since
      such that (𝑞 , 𝑎, 𝑞 ′ , 𝑢⃗) is a transition in 𝑖 , n′ = n𝑖,𝑢⃗ − 1, moreover,             the NFAs 𝗎𝖿 𝗅𝖽 (𝑒) and 𝗎𝖿 𝗅𝖽 (𝑒) are constructed from 𝗎𝖿 𝗅𝖽1 (𝑒)
                                                           𝑖,𝑢⃗                                                      1              2
      n′ = n𝑖,𝑣⃗ for every 𝑣⃗ ≠ 𝑢⃗.                                                             and 𝗎𝖿 𝗅𝖽2 (𝑒) by dropping the accepting conditions and ignore
       𝑖,𝑣⃗
                                                                                                characters, 𝗌𝗇𝗎𝗆1 (𝑒) and 𝗏𝗇𝗎𝗆1 (𝑒) are the estimates of the number
Moreover, the initial and final state of the reachability problem are set                       of states and characters of 𝗎𝖿 𝗅𝖽 (𝑒) , similarly for 𝗎𝖿 𝗅𝖽 (𝑒) . Then
                                                                                                                                       1                         2
to be (𝑞0 , (m(z𝑖,𝑣⃗ ))𝑣∈C
                       ⃗ 𝑖 ) and (𝑞𝑓 , (0, … , 0)) for some initial state 𝑞0 and               we estimate the sizes (number of transitions) of the NFAs 𝗎𝖿 𝗅𝖽 (𝑒)
                                                                                                                                                                                1
final state 𝑞𝑓 in 𝑖 . If a path from the initial state to the final state is                   and 𝗎𝖿 𝗅𝖽 (𝑒) as 𝑠𝑧1 = (𝗌𝗇𝗎𝗆1 (𝑒))2 ⋅ 𝗏𝗇𝗎𝗆1 (𝑒) and 𝑠𝑧2 = (𝗌𝗇𝗎𝗆2 (𝑒))2 ⋅
found, then the path gives us a desired string m′ (𝑥𝑖 ). □                                                 2
                                                                                                𝗏𝗇𝗎𝗆2 (𝑒) respectively.

7. Optimizations for non-register-representable RECL constraints                              It remains to show how to compute (𝗌𝗇𝗎𝗆1 (𝑒), 𝗏𝗇𝗎𝗆1 (𝑒)) and
                                                                                          (𝗌𝗇𝗎𝗆2 (𝑒), 𝗏𝗇𝗎𝗆2 (𝑒)) from 𝑒 = (𝑒1 ){𝑚,𝑛} . Let 𝑒 = (𝑒1 ){𝑚,𝑛} be a reg-
    In Section 6, to deal with the non-register-representable RECL con-                   ular expression where 𝑒1 is a register-representable regular expres-
straints, we unfold all the non-register-representable occurrences of                     sion. Note that there is only one register in 𝗎𝖿 𝗅𝖽1 (𝑒) . As a result,
counting operators. Nevertheless, the naive unfolding incurs an expo-                     we set 𝗏𝗇𝗎𝗆1 (𝑒) = 2. Moreover, in the sequel, we inductively com-
nential blow-up over the formula sizes, especially when the counting                      pute 𝗌𝗇𝗎𝗆′1 (𝑒1 ) and (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) from 𝑒1 , then let 𝗌𝗇𝗎𝗆1 (𝑒) =
bounds are large. Recall that an occurrence of counting operators is said                 𝗌𝗇𝗎𝗆′1 (𝑒1 ) and (𝗌𝗇𝗎𝗆2 (𝑒), 𝗏𝗇𝗎𝗆2 (𝑒)) = (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )).
to be non-register-representable if it is in the scope of some Kleene star,
another counting operator, complement, or language difference. Since                          • If 𝑒1 = ∅ or 𝑒1 = 𝜖, then 𝗌𝗇𝗎𝗆′1 (𝑒1 ) = 0 and
Kleene star can be seen as a special case of the counting operators, and                          (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) = (0, 0).
language difference 𝑒1 ⧵ 𝑒2 can be rewritten as 𝑒1 ∩ 𝑒2 , without loss of
                                                                                                (Note that here the constant 0, instead of 1, is chosen, so that
generality, we can assume that a non-register representable occurrence
                                                                                                later on, when we consider 𝑒1 = 𝑒2 ∩ 𝑒3 , 𝗌𝗇𝗎𝗆′2 (𝑒2 ) × 𝗌𝗇𝗎𝗆′2 (𝑒3 ) and
of a counting operator is in the scope of another counting operator or
                                                                                                𝗏𝗇𝗎𝗆′2 (𝑒2 ) ×𝗏𝗇𝗎𝗆′2 (𝑒3 ) will become 0 since 𝜖 ∩𝑒3 = 𝜖 and ∅ ∩𝑒3 = ∅.)
some complement operator. In the sequel, to alleviate the formula-size
blow-up problem resulted from the naive unfolding, we present various
                                                                                              • If 𝑒1 = 𝑎, then 𝗌𝗇𝗎𝗆′1 (𝑒1 ) = 2 and
optimizations, for the nesting of counting operators, and the nesting of
counting operators and a complement operator respectively.                                        (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) = (2𝑛, 1).

7.1. Optimizations for the nesting of counting operators
                                                                                              • If 𝑒1 = 𝑒2 ⋅ 𝑒3 , then 𝗌𝗇𝗎𝗆′1 (𝑒1 ) = 𝗌𝗇𝗎𝗆′1 (𝑒2 ) + 𝗌𝗇𝗎𝗆′1 (𝑒3 ) and
                                                                                                  (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) =
   In the sequel, we propose an optimization for the nesting of count-
ing operators. The optimization is designed for the special case that                                (𝑛(𝗌𝗇𝗎𝗆′2 (𝑒2 ) + 𝗌𝗇𝗎𝗆′2 (𝑒3 )), 𝑛(𝗏𝗇𝗎𝗆′2 (𝑒2 ) + 𝗏𝗇𝗎𝗆′2 (𝑒3 ))).
the nesting depth of counting operators is one. For a RECL constraint                           (Note that because 𝑒1 is concatenated for 𝑛 times in 𝗎𝖿 𝗅𝖽2 (𝑒), we
whose nesting depth of counting operators is greater than two, we can                           define 𝗌𝗇𝗎𝗆′2 (𝑒1 ) as 𝑛(𝗌𝗇𝗎𝗆′2 (𝑒2 ) + 𝗌𝗇𝗎𝗆′2 (𝑒3 )). Moreover, since the
apply the optimization for the inner-most nesting of counting operators                         integer vectors in 1 ⋅ 2 are of the form (𝑣⃗1 , 0)   ⃗ and (0,
                                                                                                                                                              ⃗ 𝑣⃗2 ) where
and decrement the nesting depth, then apply the optimization again to                           𝑣⃗1 and 𝑣⃗2 are integer vectors in 1 and 2 respectively, we define
the resulting constraint, and continue this process, until obtaining a                          𝗏𝗇𝗎𝗆′2 (𝑒1 ) as 𝑛(𝗏𝗇𝗎𝗆′2 (𝑒2 ) + 𝗏𝗇𝗎𝗆′2 (𝑒3 ))).
register-representable RECL constraint.                                                       • If 𝑒1 = 𝑒2 + 𝑒3 , then 𝗌𝗇𝗎𝗆′1 (𝑒1 ) = 𝗌𝗇𝗎𝗆′1 (𝑒2 ) + 𝗌𝗇𝗎𝗆′1 (𝑒3 ) and
   The main idea of the optimization is illustrated by the constraint
                                                                                                  (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) =
𝑥 ∈ (𝑎{1,1000} ){1,2} . If we apply the naive unfolding in Section 6, we
                                                                                                     (𝑛(𝗌𝗇𝗎𝗆′2 (𝑒2 ) + 𝗌𝗇𝗎𝗆′2 (𝑒3 ) + 1), 𝑛(2𝗏𝗇𝗎𝗆′2 (𝑒2 ) + 2𝗏𝗇𝗎𝗆′2 (𝑒3 ))).
will unfold 𝑎{1,1000} and obtain the constraint 𝑥 ∈ (𝑎(𝜀 + 𝑎 + 𝑎𝑎 +
⋯ + 𝑎 ⋯ 𝑎 )){1,2} . It is easy to observe that a smarter way is to unfold                       (Recall that each integer vector in 1 resp. 2 is copied at most
     ⏟⏟⏟                                                                                        twice in the construction of 1 ∪ 2 .)
         999
{1,2} , instead of {1,1000} . If we do this, then we get a constraint 𝑥 ∈                     • If 𝑒1 = 𝑒2 ∩ 𝑒3 , then 𝗌𝗇𝗎𝗆′1 (𝑒1 ) = 𝗌𝗇𝗎𝗆′1 (𝑒2 ) × 𝗌𝗇𝗎𝗆′1 (𝑒3 ) and
(𝑎{1,1000} )(𝜀 + 𝑎{1,1000} ), whose size is much smaller, compared to 𝑥 ∈                         (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) =
(𝑎(𝜀+𝑎+𝑎𝑎+⋯+ 𝑎 ⋯ 𝑎 )){1,2} . (Recall that we assume a binary encoding                                (𝑛(𝗌𝗇𝗎𝗆′2 (𝑒2 ) × 𝗌𝗇𝗎𝗆′2 (𝑒3 )), 𝑛(𝗏𝗇𝗎𝗆′2 (𝑒2 ) × 𝗏𝗇𝗎𝗆′2 (𝑒3 ))).
                    ⏟⏟⏟
                            999                                                                 (Recall that the integer vectors in 1 ∩ 2 are of the form
of the integer constants in counting operators.)
                                                                                                (𝑣⃗1 , 𝑣⃗2 ), where 𝑣⃗1 and 𝑣⃗2 are the integer vectors in 1 and 2
    Let us describe the optimization more precisely.
                                                                                                respectively.)
                                                                                                                                      ′
    • For a regular expression 𝑒 = (𝑒1 ){𝑚,𝑛} where 𝑒1 is a register-                         • If 𝑒1 = 𝑒2 , then 𝗌𝗇𝗎𝗆′1 (𝑒1 ) = 2𝗌𝗇𝗎𝗆1 (𝑒2 ) and
                                                                                                                                              ′
      representable regular expression, we can unfold the counting                                (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) = (𝑛 ⋅ 2𝗌𝗇𝗎𝗆2 (𝑒2 ) , 1).
      operators in the following ways to obtain a register-representable

                                                                                      8
D. Hu and Z. Wu                                                                                                                               Journal of Systems Architecture 160 (2025) 103340


      (In this case, since 𝑒1 is register-representable, we know that 𝑒2
                                                                                                Intuitively, 𝑞𝑠 is a sink state that deals with the situations where there
      contains no counting operators. As a result, we first construct the
                                                                                                are no runs, and 𝑟′ is a register newly introduced so that the different
      NFA for 𝑒2 , then determinize and complement it to construct the
                                                                                                cases of the non-accepting runs of 𝑒1 can be captured by a single
      automaton for 𝑒1 .)
                                                                                                accepting condition 𝛼 ′ .
    • If 𝑒1 = (𝑒2 )∗ , then 𝗌𝗇𝗎𝗆′1 (𝑒1 ) = 𝗌𝗇𝗎𝗆′1 (𝑒2 ) and
                                                                                                    Note that  is a strict over-approximation of 𝑒 = 𝑒1 since 𝑒1 may be
       (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) = (𝑛𝗌𝗇𝗎𝗆′2 (𝑒2 ), 1).                                       nondeterministic. For instance, if there are two runs of 𝑒1 on a string
                                                                                                𝑤, say one run that ends at a non-final state and another run that ends
      (In this case, since 𝑒1 is register-representable, we know that 𝑒2
                                                                                                at a final state and satisfies accepting condition 𝛼. Then 𝑤 ∈ (𝑒1 ),
      contains no counting operators.)
                     ′ ′                  ′                                                     thus 𝑤 ∉ (𝑒1 ). Nevertheless, according to the construction, 𝑤 is
    • If 𝑒1 = (𝑒2 ){𝑚 ,𝑛 } or 𝑒1 = (𝑒2 ){𝑛 ,∞} , then 𝗌𝗇𝗎𝗆′1 (𝑒1 ) = 𝑛′ 𝗌𝗇𝗎𝗆′1 (𝑒2 )
                                                                                                accepted by , witnessed by the run of 𝑒1 that ends at a non-final
      and
                                                                                                state.
       (𝗌𝗇𝗎𝗆′2 (𝑒1 ), 𝗏𝗇𝗎𝗆′2 (𝑒1 )) = (𝑛𝗌𝗇𝗎𝗆′2 (𝑒2 ), 𝑛).
                                                                                                Combination of under- and over-approximation. We combine the under-
      (In this case, since 𝑒1 is register-representable, we know that 𝑒2                        and over-approximation as well as the decision procedure into the
      contains no counting operators.)                                                          following procedure.

                                                                                                    1. We first apply the under-approximation.
7.2. Optimizations for the nesting of counting operators and a complement
                                                                                                    2. If the under-approximation returns ‘‘sat’’, then we know that
operator
                                                                                                       the original constraint is satisfiable. Otherwise, the under-appro-
     As mentioned in Section 6, if some counting operators are in the                                  ximation returns ‘‘unsat’’ and we apply the over-approximation.
scope of a complement operator, it is necessary to unfold these counting                            3. If the over-approximation returns ‘‘unsat’’, then we know that
operators before the complement operation. The exponential blow-                                       the original constraint is unsatisfiable. Otherwise, the over-
up in the unfolding combined with the blow-up in the complement                                        approximation returns ‘‘sat’’ and we resort to the decision pro-
operation will produce a huge number of states. To alleviate this                                      cedure (without approximations) in Section 6 to solve the RECL
problem, we propose two approximations of the complement operation,                                    constraint.
i.e., an under-approximation and an over-approximation, and combine
them into a procedure that achieves a nice balance between efficiency                           8. Optimizations for model generation
and precision.
     Let 𝑒 = 𝑒1 be a regular expression, where 𝑒1 is a register-                                    Recall that in Step 5 of Section 6, we construct a solution m′ from
representable regular expression.                                                               the model m of the LIA formula in Eq. (3) by solving the reachability
                                                                                                problems in a finite transition system. It is easy to see that the transition
Under-approximation. One way of doing the complement operation is
                                                                                                system contains exponentially many nodes in general and a naive
to transform 𝑒1 into a regular expression that does not contain counting
                                                                                                search over the transition system is not efficient. In the sequel, we
operators. We compute an under-approximation of 𝑒 by replacing each
                                                                                                describe two optimizations so that the reachability problem can be
occurrence of the counting operators in 𝑒1 , say (𝑒2 ){𝑚,𝑛} or (𝑒{𝑛,∞}
                                                                 2
                                                                       ), by
                                                                                                solved more efficiently.
𝑒∗2 . Let us use 𝑒′1 to denote the resulting regular expression. Note that
                                                                                                    To ease the presentation, let us assume that 𝑛, i.e. the number of
𝑒′1 does not contain counting operators. Moreover, it is easy to observe
                                                                                                string variables, to be 1.
that (𝑒1 ) ⊆ (𝑒′1 ). As a result, we have (𝑒′1 ) ⊆ (𝑒1 ). That is, 𝑒′1 is an                    At first, we refine the states of the transition system by recording
under-approximation of 𝑒1 .                                                                     the number of transitions, instead of just the number of occurrences of
Over-approximation. For 𝑒 = 𝑒1 , we utilize the CEFA constructed from                           the integer vectors.
𝑒1 , say 𝑒1 , to construct a CEFA  of 𝑒 so that () is an over-                              Refining the states of the transition system by recording the number of
approximation of (𝑒). Let 𝑒1 = (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼). The main idea of                    transitions. Let 𝜃 denote the formula in Eq. (3). We shall construct the
the construction is explained as follows: If a string 𝑤 is not accepted                         formula 𝜃 ′ so that any model of 𝜃 ′ necessarily encodes the number of
by 𝑒1 , then either there are no runs of 𝑒1 on 𝑤, or every run of 𝑒1                         occurrences of transitions within 1 .
on 𝑤 stop at a non-final state or enter a final state but do not satisfy                                            ⋀
                                                                                                 𝜃 ′ ≡ 𝜓1 (Z1 ) ∧    z1,𝑣⃗ = m(z1,𝑣⃗ ) ∧
the accepting condition 𝛼. We shall construct a CEFA  that accepts
                                                                                                                          𝑣∈C
                                                                                                                          ⃗ 
when one of these situations occurs. Then (𝑒1 ) ⊆ (), that is,  is                                 ⋀
                                                                                                                               1
                                                                                                                                                 ∑
an over-approximation of 𝑒.                                                                                     z1,𝑣⃗ =                                                         z1,(𝑞 ,𝑎,𝑞′ ,𝑣)
                                                                                                                                                                                             ⃗,
     Without loss of generality, we can assume that there are no transi-                              𝑣∈C
                                                                                                      ⃗ 
                                                                                                            1             (𝑞 ,𝑎,𝑞 ′ ) s.t. (𝑞 ,𝑎,𝑞 ′ ,𝑣)
                                                                                                                                                      ⃗ is a transition in 1
tions out of 𝐹 in 𝑒1 . If this is not the case, then a new state 𝑞𝑓 can                                                                        ′ ⃗ is a transition in  } is the set of
                                                                                                where Z1 = {z1,(𝑞 ,𝑎,𝑞′ ,𝑣)     ⃗ ∣ (𝑞 , 𝑎, 𝑞 , 𝑣)                     1
be introduced to transform 𝑒1 into a CEFA satisfying this constraint,                                                                                                                    ′ ⃗
                                                                                                variables z1,(𝑞 ,𝑎,𝑞′ ,𝑣)⃗ denoting the number of occurrences of (𝑞 , 𝑎, 𝑞 , 𝑣).
where 𝑞𝑓 is the unique accepting state. Then we define  = (𝑅∪ {𝑟′ }, 𝑄∪                                                                     ′
                                                                                                Note that compared to 𝜃, 𝜃 uses the LIA formula 𝜓1 , which is
{𝑞𝑠 }, 𝛴 , 𝛿 ′ , 𝐼 , 𝑄 ∪ {𝑞𝑠 }, 𝛼 ′ ), where
                                                                                                constructed from 1 by ignoring its accepting condition and tak-
    • 𝑟′ is a new register not in 𝑅,                                                            ing it as an NFA where the alphabet is seen as the set of transi-
                                                                                                tions. Furthermore, 𝜃 ′ relates the values of the variables z1,𝑣⃗ and
    • 𝑞𝑠 is a new state not in 𝑄,
    • 𝛿 ′ comprises the following transitions,                                                                ⃗ , while simultaneously constraining the values of z1,𝑣⃗ to be
                                                                                                z1,(𝑞 ,𝑎,𝑞 ′ ,𝑣)
                                                                                                m(z1,𝑣⃗ ). We then submit 𝜃 ′ to SMT solver, which returns a model
          – (𝑞 , 𝑎, 𝑞 ′ , (𝑣,
                           ⃗ 0)) such that (𝑞 , 𝑎, 𝑞 ′ , 𝑣)
                                                         ⃗ ∈ 𝛿 and 𝑞 ′ ∉ 𝐹 ,                    of 𝜃 ′ , in particular, the values of the variables z1,(𝑞,𝑎,𝑞 ′ ,𝑣)        ⃗ . Equipped
                           ⃗ 0)) such that there do not exist 𝑎-labeled transi-
          – (𝑞 , 𝑎, 𝑞𝑠 , (0,                                                                    with the model, we refine the transition system by considering the
            tions out of 𝑞, that is, for every 𝑞 ′ and 𝑣,     ⃗ we have (𝑞 , 𝑎, 𝑞 ′ , 𝑣)
                                                                                      ⃗ ∉       nodes (𝑞 , (n1,(𝑞,𝑎,𝑞 ′ ,𝑣)             ⃗ is a transition in 1 ), where n1,(𝑞 ,𝑎,𝑞 ′ ,𝑣)
                                                                                                                          ⃗ )(𝑞 ,𝑎,𝑞 ′ ,𝑣)                                             ⃗ de-
            𝛿,                                                                                  notes the number of remaining occurrences of (𝑞 , 𝑎, 𝑞 ′ , 𝑣).            ⃗ Note that
          – (𝑞 , 𝑎, 𝑞 ′ , (𝑣,
                           ⃗ 1)) such that (𝑞 , 𝑎, 𝑞 ′ , 𝑣)
                                                         ⃗ ∈ 𝛿 and 𝑞 ′ ∈ 𝐹 ,                    this refinement can avoid searching useless transitions. For instance,
                                                                                                if n1,(𝑞,𝑎,𝑞 ′ ,𝑣)                              ′ ⃗ then we can avoid searching the
                            ⃗
          – (𝑞𝑠 , 𝑎, 𝑞𝑠 , (0, 0)) such that 𝑎 ∈ 𝛴,                                                                ⃗ = 0 for some (𝑞 , 𝑎, 𝑞 , 𝑣),
                                                                                                transition (𝑞 , 𝑎, 𝑞 ′ , 𝑣)
                                                                                                                          ⃗ out of 𝑞. This stands in contrast to the original
    • 𝛼 ′ is defined as 𝑟′ = 1 → ¬𝛼.                                                            transition system where n1,𝑣⃗ is used, it may be the case that we have

                                                                                            9
D. Hu and Z. Wu                                                                                                                             Journal of Systems Architecture 160 (2025) 103340


n1,𝑣⃗ > 0 (since there may be multiple transitions associated with 𝑣)
                                                                   ⃗ and                                 Based on this template, we generate 500 instances.
the transition (𝑞 , 𝑎, 𝑞 ′ , 𝑣)
                             ⃗ should still be searched.                                                    For the benchmarks that contain the nestings of counting operators
     Moreover, we propose a method to reduce the number of transitions                                   and a complement operator, we utilize the following template
in the transition system further.
                                                                                                         𝑥 ∈ 𝑒𝑐 𝑜𝑚𝑝1 ∧ 𝑥 ∈ 𝑒𝑐 𝑜𝑚𝑝2 ∧ 𝑥 ∈ 𝑒𝑠𝑎𝑛𝑖 ∧ |𝑥| > 50,
Pruning the transitions in the transition system. Let the current state of                                                                                   {𝑚 ,𝑛 } {𝑚 ,𝑛 } {𝑚 ,𝑛 } {𝑚 ,𝑛 }
the transition system be (𝑝, (n1,(𝑞 ,𝑎,𝑞′ ,𝑣)                                                            where both 𝑒𝑐 𝑜𝑚𝑝1 and 𝑒𝑐 𝑜𝑚𝑝2 are of the form 𝑒1 1 1 𝑒2 2 2 𝑒3 3 3 𝑒4 4 4 ,
                                                             ⃗ is a transition in 1 ). Let
                                               ⃗ )(𝑞 ,𝑎,𝑞 ′ ,𝑣)
𝑝 denote the set of transitions that are reachable from the state 𝑝 in                                  such that for each 𝑖 ∈ [4], 20 ≤ 𝑚𝑖 ≤ 𝑛𝑖 ≤ 1000 and 𝑒𝑖 contains
1 . If {(𝑞 , 𝑎, 𝑞 ′ , 𝑣)                                                                                no counting operators. Note that two regular expressions 𝑒𝑐 𝑜𝑚𝑝1 and
                                        ⃗ > 0} ⊈ 𝑝 , in other words, there exists
                       ⃗ ∣ n1,(𝑞 ,𝑎,𝑞′ ,𝑣)
some transitions (𝑞 , 𝑎, 𝑞 ′ , 𝑣)  ⃗ not reachable from the state 𝑝 in 1 while                          𝑒𝑐 𝑜𝑚𝑝2 are introduced to have a better balance between the numbers of
                                                                                                         satisfiable and unsatisfiable instances. Based on this template, we also
             ⃗ > 0 is established in the current state (𝑝, (n1,(𝑞 ,𝑎,𝑞 ′ ,𝑣)
n1,(𝑞,𝑎,𝑞 ′ ,𝑣)                                                                  ⃗ )(𝑞 ,𝑎,𝑞 ′ ,𝑣)
                                                                                               ⃗
 is a transition in 1 ), then we know that it is impossible to reach the                                generate 500 instances.
final state (𝑞𝑓 , (0, … , 0)) from the current state so that we can remove                                    In total, the NestC benchmark suite contains 1000 instances.
all the transitions out of the current state in the transition system.                                        Note that the two templates are chosen so that on the one hand,
                                                                                                         the nestings of counting operators or nestings of counting operators
                                                                                                         and complement operators are available, and on the other hand, some
9. Experiments
                                                                                                         additional regular and length constraints are introduced, so that the
                                                                                                         constraints are not trivially satisfied, in particular, the constraints are
    We implemented the procedures and optimizations in Sections 6–8
                                                                                                         not satisfied by very short strings.
on top of OSTRICH, resulting in a string solver called OSTRICHRECL . In
this section, we evaluate the performance of OSTRICHRECL on three                                        Distribution of problem instances w.r.t. counting bounds. The distribution
benchmark suites, that is, RegCoL, AutomatArk, and NestC. In the                                         of problem instances w.r.t. the counting bounds in the three suites is
sequel, we first describe the three benchmark suites as well as the                                      shown in Fig. 5, where the 𝑥-axis represents the counting bound and
experiment setup. Then we present the experiment results. We evaluate                                    the 𝑦-axis represents the number of problem instances containing the
the performance and correctness of our solver against the state-of-the-                                  counting bound. In our statistics, while most problem instances contain
art string constraint solvers, including CVC5 [3], Z3seq [1], Z3str3 [6],                                only small bounds, there are still around 3000 (about 6%) of them using
Z3str3RE [8], and OSTRICH [11]. We also compare OSTRICHRECL with                                         large counting bounds (i.e., greater than or equal to 50).
several variants of itself, to evaluate the effectiveness of the technical
                                                                                                         Experiment setup. All experiments are conducted on CentOS Stream
choices and optimizations in Sections 6–8.
                                                                                                         release 8 with 4 Intel(R) Xeon(R) Platinum 8269CY 3.10 GHz CPU cores
                                                                                                         and 190 GB memory. We use the zaligvinder framework [42] to execute
9.1. Benchmark suites and the experiment setup                                                           the experiments, with a timeout of 60 s per instance.

   Our experiments utilize three benchmark suites, namely, RegCoL,                                       9.2. Performance evaluation against the other solvers
AutomatArk, and NestC. There are 49,843 instances in total. All bench-
mark instances are in the SMTLIB2 format. In the sequel, we give more                                        We evaluate the performance of OSTRICHRECL against the state-
details of these three benchmark suites.                                                                 of-the-art string constraint solvers, including CVC5 [3], Z3seq [1],
RegCoL benchmark suite. There are 40,628 RECL instances in the Reg-                                      Z3str3 [6], Z3str3RE [8], and OSTRICH [11], on RegCoL, AutomatArk
CoL suite. These instances are generated by extracting regexes with                                      and NestC benchmark suites. The experiment results can be found in
counting operators from the open source regex library [27,37] and                                        Table 1. Note that we take the results of CVC5 as the ground truth,3
manually constructing a RECL constraint 𝑥 ∈ 𝑒 ∧ 𝑥 ∈ 𝑒𝑠𝑎𝑛𝑖 ∧ |𝑥| >                                        and the results different from the ground truth are classified as error.
10 for each regex 𝑒, where 𝑒𝑠𝑎𝑛𝑖 ≡ 𝛴 ∗ (⟨+⟩ +′ +′′ + &)𝛴 ∗ is a regular                                  We can see that OSTRICHRECL solves almost all instances, except 182
expression that sanitizes all occurrence of special characters <, >, ′ , ′′ ,                            of them, that is, it solves 49,638 instances correctly. The number is
or &. The expression 𝑒𝑠𝑎𝑛𝑖 is introduced in view of the fact that these                                  4,547/1,917/13,618/2,764/3,134 more than the number of instances
characters are usually sanitized in Web browsers to alleviate the XSS                                    solved by CVC5/Z3str3RE/Z3str3/Z3seq/OSTRICH respectively. More-
attacks [16,38].                                                                                         over, OSTRICHRECL is the second fastest solver, whose average time on
                                                                                                         each instance is close to the fastest solver Z3str3RE (2.88 s versus 1.99
AutomatArk benchmark suite. This benchmark suite is adapted from the                                     s).
AutomatArk suite [39] by picking out the string constraints containing                                       Furthermore, in order to see how OSTRICHRECL performs on the
counting operators. We also add the length constraint |𝑥| > 10 for                                       RECL constraints where the counting and length bounds are large or
each string variable 𝑥. There are 8215 instances in the AutomatArk                                       when the counting operators are nested with some other counting op-
suite. Note that the original AutomatArk benchmark suite [39] in-                                        erators or with complement operators, we also compare OSTRICHRECL
cludes 19,979 instances, which are conjunctions of regular membership                                    with the other solvers solely on the NestC benchmark suite. The ex-
queries generated out of regular expressions in [40].                                                    periment results can be found in Table 2. From the results, we can see
NestC benchmark suite. We generate this benchmark suite by utilizing                                     that OSTRICHRECL outperforms the other solvers in both the number
the string fuzzing tool stringfuzz [41]. Note that stringfuzz is slightly                                of solved instances and the average time. Specifically, OSTRICHRECL
extended by adding the counting operators. For the benchmarks that                                       solves 873 out of 1000 instances, while the best state-of-the-art solver
contain the nestings of counting operators, we use the following tem-                                    (Z3seq) solves only 518 instances. Moreover, while the average time
plate                                                                                                    of OSTRICHRECL is only slightly better than Z3str3RE (19.49 s versus
                                                                                                         20.09 s), OSTRICHRECL solves much more instances than it (873 versus
𝑥 ∈ 𝑒𝑛𝑒𝑠𝑡 ∧ 𝑥 ∈ 𝑒𝑠𝑎𝑛𝑖 ∧ |𝑥| > 50                                                                         170). The results show the great advantage of OSTRICHRECL against the
where 𝑒𝑛𝑒𝑠𝑡 is of the form                                                                               other solvers on the NestC benchmark suite.
  {𝑚 ,𝑛 } {𝑚 ,𝑛 } {𝑚 ,𝑛 } {𝑚 ,𝑛 }
(𝑒1 1 1 𝑒2 2 2 𝑒3 3 3 𝑒4 4 4 ){𝑚5 ,𝑛5 } ,
                                                                                                            3
                                                                                                              Initially, we used the majority vote of the results of the solvers as the
such that 0 ≤ 𝑚𝑖 ≤ 𝑛𝑖 ≤ 20 for each 𝑖 ∈ [5] and 𝑒𝑖 contains no counting                                  ground truth. Nevertheless, on some problem instances, all the results of the
operators for each 𝑖 ∈ [4], moreover,                                                                    three solvers in the Z3 family are wrong (after manual inspection), thus failing
                                                                                                         this approach on these instances.
𝑒𝑠𝑎𝑛𝑖 ≡ 𝛴 ∗ (⟨+⟩ +′ +′′ + &)𝛴 ∗ .

                                                                                                    10
D. Hu and Z. Wu                                                                                                      Journal of Systems Architecture 160 (2025) 103340


                                                 Fig. 5. Distribution of problem instances w.r.t. counting bounds.


9.3. Evaluation of the technical choices and optimizations in Sections 6 -8          RECL constraints, the optimization for the nesting of counting operators
                                                                                     and a complement operator plays a greater role since it allows solving
   At first, we do experiments to evaluate the effectiveness of the                  around 1/3 more instances.
automata size-reduction techniques (i.e., Step 2) in Section 6. Then
we justify our technical choices of reducing the satisfiability prob-                Evaluation of the optimizations for the model generation in Section 8.
lem to that of LIA formulas and resort to the SMT solvers to solve                   We compare the performance of OSTRICHRECL with OSTRICHRECL       −MODEL
                                                                                                                                                             ,
them. Specifically, we compare OSTRICHRECL with a variant where                      where OSTRICHRECL −MODEL
                                                                                                                 is obtained from OSTRICHRECL by removing
the nuXmv model checker like [18] is used to solve the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠)                  the optimizations for the model generation.
problem. We also evaluate the effectiveness of the optimizations in                      The experiment results can be found in Table 5. From the results, we
Section 7, i.e., the optimizations for the nesting of counting operators as          can see that compared to OSTRICHRECL , OSTRICHRECL    −MODEL
                                                                                                                                                   solves 349
well as the nesting of counting operators and a complement operator.                 fewer instances and is 23% slower on average (3.57 s versus 2.88 s).
Finally, we evaluate the effectiveness of the optimizations for the model            As a result, the optimization for the model generation does improve
generation in Section 8.                                                             the performance, especially when the sizes of the CEFAs are huge. Note
                                                                                     that the optimization for model generation only affects the satisfiable
Evaluation of the technical choices and optimizations in Section 6. We               instances, that is, those instances where the generated LIA formulas can
compare OSTRICHRECL with OSTRICHRECL     NUXMV , which is a variant of               be solved, but the generation of models for these RECL constraints is
OSTRICHRECL where the nuXmv model checker is used to solve the                       unsuccessful.
𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem. Moreover, we also compare OSTRICHRECL with
OSTRICHRECL
          −ASR
               , which is obtained from OSTRICHRECL by removing the
automata size-reduction technique (i.e. Step 2 in Section 6.2).                      10. Conclusion
    The experiment results can be found in Table 3. From the results,
we can see that OSTRICHRECL solves 2532 more instances and is 2.38                       This work proposed an efficient automata-theoretical approach for
times faster than OSTRICHRECLNUXMV . Therefore, the technical choice in              solving string constraints with regex-counting and string-length. The
Section 6 where LIA formulas are computed and then solved by the SMT                 approach is based on encoding counting operators in regular expres-
solvers is reasonable, since OSTRICHRECL with this technical choice                  sions by cost registers symbolically instead of unfolding them explicitly.
is more efficient than OSTRICHRECL
                                 NUXMV , where nuXmv is used to solve                Moreover, this work proposed various optimization techniques to im-
the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem. Moreover, OSTRICHRECL solves 1083 more                     prove the performance further. Finally, we used three benchmark suites
instances and is 1.65 times faster than OSTRICHRECL
                                                 −ASR
                                                      . From the results,            comprising 49,843 instances in total to evaluate the performance of our
we can see that the automata-size reduction technique indeed plays an                approach. The experimental results show that our approach can solve
important role for the performance improvement.                                      more instances than the best state-of-the-art solver, at a comparable
Evaluation of the optimizations in Section 7. We also compare the perfor-            or faster speed, especially when the counting and length bounds are
mance of OSTRICHRECL with its variants OSTRICHRECL                     and           large or when counting operators are nested with some other counting
                                                               −NEST
OSTRICHRECL      ,  where OSTRICH   RECL
                                           is obtained from OSTRICH   RECL           operators or complement operators.
           −COMP                    −NEST
by removing the optimization for the nesting of counting operators,
                RECL
and OSTRICH−COMP is obtained by removing the optimization for                        CRediT authorship contribution statement
the nesting of counting operators and a complement operator. Since
these optimizations are not targeted for register-representable RECL                    Denghang Hu: Writing – original draft, Methodology. Zhilin Wu:
constraints and there are very few non-register-representable RECL con-              Writing – review & editing.
straints in the RegCoL and AutomatArk benchmark suites, we restrict
our attention to the NestC benchmark suite here.
    The experiment results can be found in Table 4. From the results, we             Declaration of competing interest
can see that on 1000 problem instances, OSTRICHRECL solves 82/339
more instances and is 1.02/1.92 times faster than OSTRICHRECL       −NEST
                                                                          /              The authors declare that they have no known competing finan-
OSTRICHRECL−COMP
                   respectively. As a result, while both  optimizations do           cial interests or personal relationships that could have appeared to
play a role in the performance improvement in solving these complex                  influence the work reported in this paper.

                                                                                11
D. Hu and Z. Wu                                                                                                                      Journal of Systems Architecture 160 (2025) 103340


References                                                                                     [19] W. Gelade, M. Gyssens, W. Martens, Regular expressions with counting: Weak
                                                                                                    versus strong determinism, SIAM J. Comput. 41 (1) (2012) 160–190, http:
                                                                                                    //dx.doi.org/10.1137/100814196.
 [1] L.M. de Moura, N. Bjørner, Z3: An efficient SMT solver, in: Tools and Algorithms
                                                                                               [20] H. Chen, P. Lu, Checking determinism of regular expressions with counting,
     for the Construction and Analysis of Systems, 14th International Conference,
                                                                                                    Inform. and Comput. 241 (2015) 302–320, http://dx.doi.org/10.1016/j.ic.2014.
     TACAS 2008, Held As Part of the Joint European Conferences on Theory and
                                                                                                    12.001.
     Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008.
                                                                                               [21] B. Loring, D. Mitchell, J. Kinder, Sound regular expression semantics for dynamic
     Proceedings, 2008, pp. 337–340, http://dx.doi.org/10.1007/978-3-540-78800-
                                                                                                    symbolic execution of JavaScript, in: Proceedings of the 40th ACM SIGPLAN
     3_24.
                                                                                                    Conference on Programming Language Design and Implementation, PLDI 2019,
 [2] T. Liang, A. Reynolds, C. Tinelli, C. Barrett, M. Deters, A DPLL(T) theory solver
                                                                                                    Phoenix, AZ, USA, June 22-26, 2019, ACM, 2019, pp. 425–438, http://dx.doi.
     for a theory of strings and regular expressions, in: CAV, 2014, pp. 646–662,
                                                                                                    org/10.1145/3314221.3314645.
     http://dx.doi.org/10.1007/978-3-319-08867-9_43.
                                                                                               [22] T. Chen, A. Flores-Lamas, M. Hague, Z. Han, D. Hu, S. Kan, A.W. Lin, P.
 [3] H. Barbosa, C. Barrett, M. Brain, G. Kremer, H. Lachnitt, M. Mann, A. Mohamed,                 Rümmer, Z. Wu, Solving string constraints with regex-dependent functions
     M. Mohamed, A. Niemetz, A. Nötzli, A. Ozdemir, M. Preiner, A. Reynolds, Y.                     through transducers with priorities and variables, Proc. ACM Program. Lang.
     Sheng, C. Tinelli, Y. Zohar, CVC5: A versatile and industrial-strength SMT solver,             6 (POPL) (2022) 1–31, http://dx.doi.org/10.1145/3498707.
     in: D. Fisman, G. Rosu (Eds.), Tools and Algorithms for the Construction and              [23] M. Kaminski, N. Francez, Finite-memory automata, in: Proceedings [1990] 31st
     Analysis of Systems, Springer International Publishing, Cham, 2022, pp. 415–442.               Annual Symposium on Foundations of Computer Science, Vol. 2, 1990, pp.
 [4] Y. Zheng, X. Zhang, V. Ganesh, Z3-str: a Z3-based string solver for web                        683–688, http://dx.doi.org/10.1109/FSCS.1990.89590.
     application analysis, in: ESEC/SIGSOFT FSE, 2013, pp. 114–124, http://dx.doi.             [24] L. D’Antoni, T. Ferreira, M. Sammartino, A. Silva, Symbolic register automata, in:
     org/10.1145/2491411.2491456.                                                                   I. Dillig, S. Tasiran (Eds.), Computer Aided Verification, Springer International
 [5] Y. Zheng, V. Ganesh, S. Subramanian, O. Tripp, J. Dolby, X. Zhang, Effective                   Publishing, Cham, 2019, pp. 3–21.
     search-space pruning for solvers of string equations, regular expressions and             [25] M.L. Minsky, Computation: Finite and Infinite Machines, in: Prentice-Hall Series
     length constraints, in: Computer Aided Verification - 27th International Confer-               in Automatic Computation, Prentice-Hall, 1967.
     ence, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I,            [26] F. Klaedtke, H. Rueß, Monadic second-order logics with cardinalities, in: J.C.M.
     Springer, 2015, pp. 235–254, http://dx.doi.org/10.1007/978-3-319-21690-4_14.                   Baeten, J.K. Lenstra, J. Parrow, G.J. Woeginger (Eds.), Automata, Languages and
 [6] M. Berzish, V. Ganesh, Y. Zheng, Z3str3: A string solver with theory-aware                     Programming, Springer Berlin Heidelberg, Berlin, Heidelberg, 2003, pp. 681–696.
     heuristics, in: 2017 Formal Methods in Computer Aided Design, FMCAD 2017,                 [27] L. Turoňová, L. Holík, O. Lengál, O. Saarikivi, M. Veanes, T. Vojnar, Regex
     Vienna, Austria, October 2-6, 2017, 2017, pp. 55–59, http://dx.doi.org/10.                     matching with counting-set automata, Proc. ACM Program. Lang. 4 (OOPSLA)
     23919/FMCAD.2017.8102241.                                                                      (2020) http://dx.doi.org/10.1145/3428286.
 [7] M. Berzish, Z3str4: A Solver for Theories over Strings (Ph.D. thesis), University         [28] L. Holík, J. Síc, L. Turonová, T. Vojnar, Fast matching of regular patterns with
     of Waterloo, Ontario, Canada, 2021, URL https://hdl.handle.net/10012/17102.                    synchronizing counting, in: O. Kupferman, P. Sobocinski (Eds.), FoSSaCS 2023,
                                                                                                    in: Lecture Notes in Computer Science, vol. 13992, Springer, 2023, pp. 392–412.
 [8] M. Berzish, J.D. Day, V. Ganesh, M. Kulczynski, F. Manea, F. Mora, D. Nowotka,
                                                                                               [29] A. Le Glaunec, L. Kong, K. Mamouras, Regular expression matching using bit
     Towards more efficient methods for solving regular-expression heavy string
                                                                                                    vector automata, Proc. ACM Program. Lang. 7 (OOPSLA1) (2023) http://dx.doi.
     constraints, Theoret. Comput. Sci. 943 (2023) 50–72.
                                                                                                    org/10.1145/3586044.
 [9] Diep Bui and contributors, Z3-trau, 2019, https://github.com/diepbp/z3-trau.
                                                                                               [30] D. Hu, Z. Wu, String constraints with regex-counting and string-length solved
[10] P.A. Abdulla, M.F. Atig, Y.-F. Chen, B.P. Diep, J. Dolby, P. Janků, H.-H. Lin, L.              more efficiently, in: H. Hermanns, J. Sun, L. Bu (Eds.), Dependable Software
     Holík, W.-C. Wu, Efficient handling of string-number conversion, in: Proceedings               Engineering. Theories, Tools, and Applications, Springer Nature Singapore,
     of the 41st ACM SIGPLAN Conference on Programming Language Design and                          Singapore, 2024, pp. 1–20.
     Implementation, in: PLDI 2020, Association for Computing Machinery, New York,             [31] J.E. Hopcroft, J.D. Ullman, Introduction to Automata Theory, Languages and
     NY, USA, 2020, pp. 943–957, http://dx.doi.org/10.1145/3385412.3386034.                         Computation, Addison-Wesley, 1979.
[11] T. Chen, M. Hague, A.W. Lin, P. Rümmer, Z. Wu, Decision procedures for path               [32] H. Seidl, T. Schwentick, A. Muscholl, P. Habermehl, Counting in trees for free, in:
     feasibility of string-manipulating programs with complex operations, PACMPL 3                  Automata, Languages and Programming: 31st International Colloquium, ICALP
     (POPL) (2019) http://dx.doi.org/10.1145/3290362.                                               2004, Turku, Finland, July 12-16, 2004. Proceedings, 2004, pp. 1136–1149,
[12] H.-E. Wang, S.-Y. Chen, F. Yu, J.-H.R. Jiang, A symbolic model checking                        http://dx.doi.org/10.1007/978-3-540-27836-8_94.
     approach to the analysis of string and length constraints, in: Proceedings of the         [33] K.N. Verma, H. Seidl, T. Schwentick, On the complexity of equational Horn
     33rd ACM/IEEE International Conference on Automated Software Engineering,                      clauses, in: CADE, 2005, pp. 337–352.
     in: ASE 2018, ACM, 2018, pp. 623–633, http://dx.doi.org/10.1145/3238147.                  [34] C. Haase, A survival guide to Presburger arithmetic, ACM SIGLOG News 5 (3)
     3238189.                                                                                       (2018) 67–82, http://dx.doi.org/10.1145/3242953.3242964.
[13] C. Chapman, K.T. Stolee, Exploring regular expression usage and context in                [35] R. Cavada, A. Cimatti, M. Dorigatti, A. Griggio, A. Mariotti, A. Micheli, S.
     Python, in: A. Zeller, A. Roychoudhury (Eds.), Proceedings of the 25th Interna-                Mover, M. Roveri, S. Tonetta, The nuXmv symbolic model checker, in: Computer
     tional Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken,                    Aided Verification - 26th International Conference, CAV 2014, Held As Part
     Germany, July 18-20, 2016, ACM, 2016, pp. 282–293, http://dx.doi.org/10.                       of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 18-22, 2014.
     1145/2931037.2931073.                                                                          Proceedings, 2014, pp. 334–342.
[14] J.C. Davis, C.A. Coghlan, F. Servant, D. Lee, The impact of regular expression            [36] K. Thompson, Programming techniques: Regular expression search algorithm,
     denial of service (ReDoS) in practice: An empirical study at the ecosystem                     Commun. ACM 11 (6) (1968) 419–422, http://dx.doi.org/10.1145/363347.
     scale, in: Proceedings of the 2018 26th ACM Joint Meeting on European                          363387.
     Software Engineering Conference and Symposium on the Foundations of Software              [37] J.C. Davis, L.G. Michael IV, C.A. Coghlan, F. Servant, D. Lee, Why aren’t
     Engineering, in: ESEC/FSE 2018, Association for Computing Machinery, New                       regular expressions a Lingua Franca? An empirical study on the re-use and
     York, NY, USA, 2018, pp. 246–256.                                                              portability of regular expressions, in: Proceedings of the 2019 27th ACM Joint
                                                                                                    Meeting on European Software Engineering Conference and Symposium on
[15] P. Wang, K.T. Stolee, How well are regular expressions tested in the wild? in:
                                                                                                    the Foundations of Software Engineering, in: ESEC/FSE 2019, Association for
     G.T. Leavens, A. Garcia, C.S. Pasareanu (Eds.), Proceedings of the 2018 ACM
                                                                                                    Computing Machinery, New York, NY, USA, 2019, pp. 443–454, http://dx.doi.
     Joint Meeting on European Software Engineering Conference and Symposium on
                                                                                                    org/10.1145/3338906.3338909.
     the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena
                                                                                               [38] T. Chen, Y. Chen, M. Hague, A.W. Lin, Z. Wu, What is decidable about string
     Vista, FL, USA, November 04-09, 2018, ACM, 2018, pp. 668–678.
                                                                                                    constraints with the ReplaceAll function, PACMPL 2 (POPL) (2018) 3:1–3:29,
[16] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, D. Song, A symbolic
                                                                                                    http://dx.doi.org/10.1145/3158091.
     execution framework for JavaScript, in: 2010 IEEE Symposium on Security and               [39] M. Berzish, M. Kulczynski, F. Mora, F. Manea, J.D. Day, D. Nowotka, V. Ganesh,
     Privacy, 2010, pp. 513–528, http://dx.doi.org/10.1109/SP.2010.38.                              An SMT solver for regular expressions and linear arithmetic over string length, in:
[17] T. Liang, N. Tsiskaridze, A. Reynolds, C. Tinelli, C.W. Barrett, A decision                    A. Silva, K.R.M. Leino (Eds.), Computer Aided Verification, Springer International
     procedure for regular membership and length constraints over unbounded                         Publishing, Cham, 2021, pp. 289–312.
     strings, in: C. Lutz, S. Ranise (Eds.), Frontiers of Combining Systems - 10th             [40] L. D’Antoni, Automatark: Automata benchmark, 2018, URL https://github.com/
     International Symposium, FroCoS 2015, Wroclaw, Poland, September 21-24,                        lorisdanto/automatark.
     2015. Proceedings, in: Lecture Notes in Computer Science, vol. 9322, Springer,            [41] D. Blotsky, F. Mora, M. Berzish, Y. Zheng, I. Kabir, V. Ganesh, StringFuzz: A
     2015, pp. 135–150.                                                                             fuzzer for string solvers, in: H. Chockler, G. Weissenbacher (Eds.), Computer
[18] T. Chen, M. Hague, J. He, D. Hu, A.W. Lin, P. Rümmer, Z. Wu, A decision                        Aided Verification, Springer International Publishing, Cham, 2018, pp. 45–51.
     procedure for path feasibility of string manipulating programs with integer data          [42] M. Kulczynski, F. Manea, D. Nowotka, D.B. Poulsen, ZaligVinder: A generic test
     type, in: D.V. Hung, O. Sokolsky (Eds.), Automated Technology for Verification                 framework for string solvers, J. Software: Evol. Process. 35 (4) (2023) e2400,
     and Analysis - 18th International Symposium, ATVA 2020, Hanoi, Vietnam,                        http://dx.doi.org/10.1002/smr.2400, arXiv:https://onlinelibrary.wiley.com/doi/
     October 19-23, 2020, Proceedings, in: Lecture Notes in Computer Science, 12302,                pdf/10.1002/smr.2400.
     Springer, 2020, pp. 325–342, http://dx.doi.org/10.1007/978-3-030-59152-6_18.


                                                                                          12