Files
opaque-lattice/papers_txt/An-efficient-string-solver-for-string-constraints-wi_2025_Journal-of-Systems.txt
2026-01-06 12:49:26 -07:00

1092 lines
145 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Journal of Systems Architecture 160 (2025) 103340
Contents lists available at ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
An efficient string solver for string constraints with regex-counting and
string-length
Denghang Hu , Zhilin Wu
Key Laboratory of System Software and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, University of Chinese
Academy of Sciences, Beijing, China
ARTICLE INFO ABSTRACT
Keywords: Regular expressions (regex for short) and string-length function are widely used in string-manipulating
String constraints programs. Counting is a frequently used feature in regexes that counts the number of matchings of sub-
String solver patterns. The state-of-the-art string solvers are incapable of solving string constraints with regex-counting and
Regular expression
string-length efficiently, especially when the counting and length bounds are large. In this work, we propose
String length
an automata-theoretic approach for solving such class of string constraints. The main idea is to symbolically
Counting operator
model the counting operators by registers in automata instead of unfolding them explicitly, thus alleviating the
state explosion problem. Moreover, the string-length function is modeled by a register as well. As a result, the
satisfiability of string constraints with regex-counting and string-length is reduced to the satisfiability of linear
integer arithmetic, which the off-the-shelf SMT solvers can then solve. To improve the performance further,
we also propose various optimization techniques. We implement the algorithms and validate our approach on
49,843 benchmark instances. The experimental results show that our approach can solve more instances than
the state-of-the-art solvers, at a comparable or faster speed, especially when the counting and length bounds are
large or when the counting operators are nested with some other counting operators or complement operators.
1. Introduction proposed to solve such string constraints efficiently (see e.g. [8,17]).
Counting (aka repetition) is a convenient feature in regexes that
The string data type plays a crucial role in modern programming counts the number of matchings of sub-patterns. For instance, 𝑎{2,4}
languages such as JavaScript, Python, Java, and PHP. String manipula- specifies that 𝑎 occurs at least twice and at most four times, and 𝑎{2,∞}
tions are error-prone and could even give rise to severe security vulner- specifies that 𝑎 occurs at least twice. Note that although the Kleene
abilities (e.g., cross-site scripting, aka XSS). One powerful method for
star and the Kleene plus operator are special cases of counting, i.e., 𝑒
identifying such bugs is symbolic execution. It analyzes the feasibility
is equivalent to 𝑒{0,∞} and 𝑒+ is equivalent to 𝑒{1,∞} , in the rest of this
of execution paths in a program by encoding the paths as logical
formulas and utilizing the constraint solvers (e.g. Z3 [1]) to solve the paper, for clarity, we use counting to denote expressions of the form
satisfiability of the formulas. Symbolic execution of string manipulating 𝑒{𝑚,𝑛} and 𝑒{𝑚,∞} , but not 𝑒 or 𝑒+ . Counting is a frequently used feature
programs has motivated the highly active research area of string con- of regexes. According to the statistics from [13], counting occurs in
straint solving, resulting in the development of numerous string solvers more than 20% of 1204 Python projects. Therefore, an efficient analysis
in the last decade, e.g., Z3seq [1], CVC4/5 [2,3], Z3str/2/3/4 [4 of string manipulating programs requires efficient solvers for string
7], Z3str3RE [8], Z3-Trau [9,10], OSTRICH [11], Slent [12], among constraints containing regexes with counting and string-length function
many others. Regular expressions (regex for short) and the string-length at least.
function are widely used in string-manipulating programs. According The aforementioned state-of-the-art string constraint solvers still
to the statistics from [1315], regexes are used in about 3040% of
suffer from such string constraints, especially when the counting and
Java, JavaScript, and Python software projects. Moreover, string-length
length bounds are large. For instance, none of the string solvers CVC5,
occupies 78% of the occurrences of string operations in 18 Javascript
applications, according to the statistics from [16]. As a result, most Z3seq, Z3-Trau, Z3str3, Z3str3RE, and OSTRICH is capable of solving
of the aforementioned string constraint solvers support both regexes the following constraint within 120 s,
and string-length function. Moreover, specialized algorithms have been 𝑥 ∈ (𝛴 𝑎){1,60} (𝛴 𝑏){1,60} (𝛴 𝑐){0,60} ∧ 𝑥 ∈ 𝛴 𝑐 + ∧ |𝑥| > 120. (1)
Corresponding author.
E-mail address: hudh@ios.ac.cn (D. Hu).
https://doi.org/10.1016/j.sysarc.2025.103340
Received 2 June 2024; Received in revised form 7 January 2025; Accepted 7 January 2025
Available online 17 January 2025
1383-7621/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
Intuitively, the constraint in (1) specifies that 𝑥 is a concatenation of and back references. Real-world regexes have been addressed in sym-
three strings 𝑥1 , 𝑥2 , 𝑥3 where 𝑎 (resp. 𝑏, 𝑐) does not occur in 𝑥1 (resp. bolic execution of Javascript programs [21] and string constraint
𝑥2 , 𝑥3 ), moreover, 𝑥 ends with a nonempty sequence of 𝑐s, and the solving [22]. Nevertheless, the counting operators are still unfolded
length of 𝑥 is greater than 120. It is easy to observe that this constraint explicitly in [22]. The Trau tool focuses on string constraints involving
is unsatisfiable since on the one hand, |𝑥| > 120 and the counting upper flat regular languages and string-length function and solves them by
bound 60 in both (𝛴 𝑎){1,60} and (𝛴 𝑏){1,60} imply that 𝑥 must end computing LIA formulas that define the Parikh images of flat regular
with some character from 𝛴 𝑐, that is, a character different from 𝑐, languages [10]. The Slent tool solves the string constraints involving
and on the other hand, 𝑥 ∈ 𝛴 𝑐 + requires that 𝑥 has to end with 𝑐. string-length function by encoding them into so-called length-encoded
A typical way for string constraint solvers to deal with regular automata, then utilizing symbolic model checkers to solve their non-
expressions with counting is to unfold them into those without counting emptiness problem [12]. However, neither Trau nor Slent supports
using the concatenation operator. For instance, 𝑎{1,4} is unfolded into counting operators explicitly, in other words, counting operators in
𝑎(𝜀 + 𝑎 + 𝑎𝑎 + 𝑎𝑎𝑎) and 𝑎{2,∞} is unfolded into 𝑎𝑎𝑎 . Since the unfolding regexes should be unfolded before solved by them. Cost registers in
incurs an exponential blow-up on the sizes of constraints (assuming that CEFAs are different from registers in (symbolic) register automata [23,
the counting in string constraints are encoded in binary), the unfolding 24]: In register automata, registers are used to store input values and
penalizes the performance of the solvers, especially when the length can only be compared for equality/inequality, while in CEFAs, cost
bounds are also available. registers are used to store integer-valued costs and can be updated by
Contribution. In this work, we focus on the class of string con- adding/subtracting integer constants and constrained by the accepting
straints with regular membership and string-length function, where the conditions which are LIA formulas. Therefore, cost registers in CEFAs
counting operators may occur (called RECL for brevity). We make the are more like counters in counter automata/machines [25], that is,
following contributions. CEFAs can be obtained from counter machines by removing transition
guards and adding accepting conditions. CEFAs can also be considered
• We propose an automata-theoretical approach for solving RECL as a variation of Parikh automata [26] in which the numbers in the
constraints. Our main idea is to represent the counting operators alphabet can be integers, instead of natural numbers. Counting-set
by cost registers in cost-enriched finite automata (CEFA, see Sec- automata were proposed in [27,28] to quickly match a subclass of
tion 5 for the definition), instead of unfolding them explicitly. The regexes with counting. Moreover, a variant of nondeterministic counter
string-length function is modeled by cost registers as well. The automata, called bit vector automata, was proposed recently in [29]
satisfiability of RECL constraints is reduced to the non-emptiness to enable fast matching of a more expressive class of regexes with
problem of CEFA w.r.t. a linear integer arithmetic (LIA) formula. counting. Nevertheless, the non-emptiness problem of these automata
According to the results from [18], an LIA formula can be com- was not considered, and it is unclear whether these automata models
puted to represent the potential values of registers in CEFA. Thus, can be used for solving string constraints with regex-counting and
the non-emptiness of CEFA w.r.t. LIA formulas can be reduced to string-length.
the satisfiability of LIA formulas, which is then tackled by off-the- Structure. The rest of this paper is structured as follows: Section 2 in-
shelf SMT solvers. Moreover, we propose techniques to reduce the troduces the preliminaries. Section 3 gives an overview of the approach
sizes (i.e., the number of states and transitions) of CEFA in order in this paper. Section 4 presents the syntax and semantics of RECL.
to achieve better performance. Section 5 defines CEFA. Section 6 introduces the algorithm to solve
• For complex RECL constraints where the counting operators may RECL constraints. Section 7 presents some heuristics for improving the
be nested with the other counting operators or complement oper- performance of solving RECL constraints where the counting operators
ators, we propose optimization techniques to improve the perfor- are nested with some other counting operators or complement opera-
mance. Furthermore, for the satisfiable RECL constraints where tors. Section 9 shows the experiment results. Section 10 concludes this
the solutions of the LIA formulas can be obtained, but the models paper.
of the original RECL constraint cannot be generated, we pro- A preliminary version of this paper has appeared in the proceedings
pose optimization techniques to accelerate the model generation of SETTA 2023 [30]. This paper extends the SETTA paper in the
process. following three aspects.
• We implement the algorithm on top of OSTRICH, resulting in a
• To improve the performance of complex RECL constraints where
string solver1 called OSTRICHRECL . Finally, we utilize a collec-
the counting operators are nested with some other counting op-
tion of benchmark suites2 comprising 49,843 problem instances
erators or complement operators, various optimizations are pro-
in total to evaluate the performance of OSTRICHRECL . The ex-
posed.
periment results show that OSTRICHRECL solves the RECL con-
• To deal with the issue that the satisfiability of LIA formulas can be
straints more efficiently than the state-of-the-art string solvers
solved, but models of the original RECL constraints cannot be gen-
(see Tables 15 for details). For instance, OSTRICHRECL solves
erated, we propose techniques to optimize the model generation
49,638 out of 49,843 instances, while the best state-of-the-art
process.
solver solve only 47,721 instances. We also do experiments to jus-
• The benchmarks are extended to cover the situations where the
tify some of our technical choices and evaluate the performance
counting operators are nested with some other counting opera-
of the optimizations tors or complement operators, and the experiments are extended
Related work. We discuss more related work on regexes with counting, accordingly.
string-length function, and automata with registers/counters. Deter-
minism of regexes with counting was investigated in [19,20]. Real- 2. Preliminaries
world regexes in programming languages include features beyond
classical regexes, e.g., the greedy/lazy Kleene star, capturing groups, We write N and Z for the sets of natural and integer numbers. For
𝑛 ∈ N with 𝑛 ≥ 1, [𝑛] denotes {1, … , 𝑛}; for 𝑚, 𝑛 ∈ N with 𝑚 ≤ 𝑛,
[𝑚, 𝑛] denotes {𝑖 ∈ N 𝑚 ≤ 𝑖𝑛}. Throughout the paper, 𝛴 is a finite
1
The solver is open source and available at https://github.com/ alphabet, ranged over 𝑎, 𝑏, ….
SimpleXiaohu/ostrich/tree/jsa2024 For a function 𝑓 from 𝑋 to 𝑌 and 𝑋 𝑋, we use 𝗉𝗋𝗃𝑋 (𝑓 ) to
2
The benchmarks are available at https://github.com/SimpleXiaohu/ denote the restriction (aka projection) of 𝑓 to 𝑋 , that is, 𝗉𝗋𝗃𝑋 (𝑓 ) is
ostrich/tree/jsa2024/zaligvinder/models/jsa2024. the function from 𝑋 to 𝑌 , and 𝗉𝗋𝗃𝑋 (𝑓 )(𝑥 ) = 𝑓 (𝑥 ) for each 𝑥𝑋 .
2
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
Table 1
Overall performance evaluation.
CVC5 Z3str3RE Z3str3 Z3seq Ostrich OSTRICHRECL
sat 28,282 28,449 23,322 28,212 25,975 29,084
unsat 16,809 19,316 12,742 18718 20,529 20,554
unknown 8 134 7021 203 162 14
timeout 4744 1944 6758 2710 3177 191
error 0 44 44 56 0 0
solved 45,091 47,721 36,020 46,874 46,504 49,638
average time
6.43 1.99 8.48 4.04 6.75 2.88
(s)
Table 2
Performance evaluation on the NestC benchmark suite.
CVC5 Z3str3RE Z3str3 Z3seq Ostrich OSTRICHRECL
sat 508 166 196 451 0 635
unsat 0 4 0 67 238 238
unknown 0 35 31 105 2 6
timeout 492 795 773 377 760 121
error 0 0 0 0 0 0
solved 508 170 196 518 238 873
average time
31.57 20.09 50.06 26.13 46.11 19.49
(s)
Table 3 𝐿2 , denoted by 𝐿1 ⋅ 𝐿2 , is defined as {𝑢𝑣 𝑢𝐿1 , 𝑣𝐿2 }. The union
Evaluation of the technical choices and optimizations in Section 6. (resp. intersection) of 𝐿1 and 𝐿2 , denoted by 𝐿1 𝐿2 (resp. 𝐿1 ∩ 𝐿2 ), is
OSTRICHRECL
NUXMV
OSTRICHRECL
ASR
OSTRICHRECL
defined as {𝑢 𝑢𝐿1 or 𝑢𝐿2 } (resp. {𝑢 𝑢𝐿1 and 𝑢𝐿2 }). The
sat 26,607 27,295 29,084
complement of 𝐿1 , denoted by 𝐿1 , is defined as {𝑢 𝑢 ∈ 𝛴 , 𝑢𝐿1 }.
unsat 20,499 20,513 20,554
The difference of 𝐿1 and 𝐿2 , denoted by 𝐿1 𝐿2 , is defined as 𝐿1 ∩ 𝐿2 .
unknown 45 68 14
For a language 𝐿 and 𝑛 ∈ N, we define 𝐿𝑛 inductively as follows:
timeout 2,693 1,968 191
error 0 0 0 𝐿0 = {𝜀}, 𝐿𝑛+1 = 𝐿𝐿𝑛 for every 𝑛 ∈ N. Finally, define 𝐿 = 𝑛∈N 𝐿𝑛 .
solved 47,106 47,808 49,638 Finite automata. A (nondeterministic) finite automaton (NFA) is a tuple
average time (s) 6.86 4.76 2.88
 = (𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 ), where 𝑄 is a finite set of states, 𝛴 is a finite
alphabet, 𝛿 ⊆ 𝑄 × 𝛴 × 𝑄 is the transition relation, 𝐼𝑄 is the set
Table 4 of initial states, and 𝐹𝑄 is the set of final states. For readability, we
Evaluation of the optimizations in Section 7. 𝑎 𝑎
write a transition (𝑞 , 𝑎, 𝑞 ) ∈ 𝛿 as 𝑞 ←←→
𝑞 (or simply 𝑞 ←←→
𝑞 ). The size
OSTRICHRECL
NEST
OSTRICHRECL
COMP
OSTRICHRECL 𝛿
sat 553 296 635
of an NFA , denoted by ||, is defined as the number of transitions
unsat 238 238 238 of . A run of  on a string 𝑤 = 𝑎1 ⋯ 𝑎𝑛 is a sequence of transitions
𝑎1 𝑎𝑛
unknown 27 8 6 𝑞0 ←←←←→ ← 𝑞𝑛 such that 𝑞0 ∈ 𝐼. The run is accepting if 𝑞𝑛𝐹 . A
𝑞1 ⋯ 𝑞𝑛1 ←←←←→
timeout 182 458 121 string 𝑤 is accepted by an NFA  if there is an accepting run of  on
error 0 0 0 𝑤. In particular, the empty string 𝜀 is accepted by  if 𝐼𝐹 ≠ ∅. The
solved 791 534 873 language of , denoted by (), is the set of strings accepted by . An
average time (s) 19.95 37.49 19.49 NFA  is said to be deterministic if 𝐼 is a singleton and, for every 𝑞𝑄
and 𝑎 ∈ 𝛴, there is at most one state 𝑞 𝑄 such that (𝑞 , 𝑎, 𝑞 ) ∈ 𝛿. We
Table 5 use DFA to denote deterministic finite automata.
Evaluation of the optimizations for the model generation It is well-known that finite automata capture regular languages.
in Section 8. Moreover, the class of regular languages is closed under union, inter-
OSTRICHRECL
MODEL
OSTRICHRECL
section, concatenation, Kleene star, complement, and language differ-
sat 28,735 29,084
ence [31].
unsat 20,554 20,554
Let 𝑤 ∈ 𝛴 . The Parikh image of 𝑤, denoted by 𝗉𝖺𝗋𝗂𝗄𝗁(𝑤), is defined
unknown 13 14
timeout 541 191
as the function 𝜂 𝛴 → N such that 𝜂(𝑎) = |𝑤|𝑎 for each 𝑎 ∈ 𝛴.
error 0 0 The Parikh image of an NFA , denoted by 𝗉𝖺𝗋𝗂𝗄𝗁(), is defined as
{𝗉𝖺𝗋𝗂𝗄𝗁(𝑤) 𝑤 ∈ ()}.
solved 49,289 49,638
average time (s) 3.57 2.88 Linear integer arithmetic and Parikh images. We use standard existential
linear integer arithmetic (LIA) formulas, which typically range over
𝜓 , 𝜑, 𝛷, 𝛼. For a set X of variables, we use 𝜓∕𝜑∕𝛷∕𝛼(X) to denote the
set of existential LIA formulas whose free variables are from X. For
Strings and languages. A string over 𝛴 is a (possibly empty) sequence example, we use 𝜑(⃗x) with ⃗x = (x1 , … , x𝑘 ) to denote an LIA formula 𝜑
of elements from 𝛴, denoted by 𝑢, 𝑣, 𝑤, …. An empty string is denoted whose free variables are from {x1 , … , x𝑘 }. For an LIA formula 𝜑(⃗x), we
by 𝜀. We use 𝛴𝜀 to denote 𝛴 {𝜀}. We write 𝛴 (resp., 𝛴 + ) for the use 𝜑[⃗𝑡⃗x] to denote the formula obtained by replacing (simultaneously)
set of all (resp. nonempty) strings over 𝛴. For a string 𝑢, we use |𝑢| to x𝑖 with 𝑡𝑖 for every 𝑖 ∈ [𝑘] where ⃗x = (x1 , … , x𝑘 ) and ⃗𝑡 = (𝑡1 , … , 𝑡𝑘 ) are
denote the number of letters in 𝑢. In particular, |𝜀| = 0. Moreover, for tuples of integer terms.
𝑎 ∈ 𝛴, let |𝑢|𝑎 denote the number of occurrences of 𝑎 in 𝑢. Assume that At last, we recall the result about Parikh images of NFA. For each
𝑢 = 𝑎1 ⋯ 𝑎𝑛 is nonempty and 1 ≤ 𝑖 < 𝑗𝑛. We let 𝑢[𝑖] denote 𝑎𝑖 and 𝑎 ∈ 𝛴, let z𝑎 be an integer variable. Let Z𝛴 denote the set of integer
𝑢[𝑖, 𝑗] for the substring 𝑎𝑖𝑎𝑗 . Let 𝑢, 𝑣 be two strings. We use 𝑢𝑣 to variables z𝑎 for 𝑎 ∈ 𝛴. Let  be an NFA over the alphabet 𝛴. Then
denote the concatenation of 𝑢 and 𝑣. A language 𝐿 over 𝛴 is a subset of we say that an LIA formula 𝜓(Z𝛴 ) defines the Parikh image of , if
strings. Let 𝐿1 , 𝐿2 be two languages. Then the concatenation of 𝐿1 and {𝜂 𝛴 → N 𝜓[(𝜂(𝑎)z𝑎 )𝑎∈𝛴 ] is true} = 𝗉𝖺𝗋𝗂𝗄𝗁().
3
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
Fig. 1. CEFA for (𝛴 𝑎){1,60} (𝛴 𝑏){1,60} (𝛴 𝑐){0,60} , 𝛴 𝑐 + , and |𝑥|.
Fig. 2. 1 ∩ 2 ∩ 3 : Intersection of 1 , 2 , and 3 .
Theorem 1 ([32]). Given an NFA , an existential LIA formula 𝜓 (Z𝛴 ) the LIA formula 𝜑 ≡ 𝑟4 > 120, that is, whether there exist 𝑤 ∈ 𝛴 and
can be computed in linear time that defines the Parikh image of . (𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ) ∈ Z4 such that 𝑤 is accepted by , so that the resulting
registers values (𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ) satisfy both 1 ≤ 𝑣1 ≤ 60 ∧ 1 ≤ 𝑣2 ≤ 60 ∧ 0 ≤
3. Overview 𝑣3 ≤ 60 and 𝜑. It is not hard to observe that the non-emptiness of
 with respect to 𝜑 is independent of the characters of . Therefore,
In this section, we utilize the string constraint in Eq. (1) to illustrate the characters in  can be ignored, resulting into an NFA  over the
the approach in our work. alphabet C, where C is the set of vectors from Z4 occurring in the
At first, we construct a CEFA for the regular expression (𝛴 transitions of  (see Fig. 3(a)). Then the original problem is reduced
𝑎){1,60} (𝛴 𝑏){1,60} (𝛴 𝑐){0,60} . Three registers are introduced, say to the problem of deciding whether there exists a string 𝑤 ∈ C that
𝑟1 , 𝑟2 , 𝑟3 , to represent the three counting operators; the nondeterministic is accepted by  and its Parikh image (i.e., numbers of occurrences of
finite automaton (NFA) for (𝛴 𝑎) (𝛴 𝑏) (𝛴 𝑐) is constructed; characters), say 𝜂𝑤′ C → N, satisfies 1 ≤ 𝑣1 ≤ 60 ∧ 1 ≤ 𝑣2 ≤ 60 ∧ 0 ≤
the updates of registers are added to the transitions of the NFA; the 𝑣3 ≤ 60 ∧ 𝑣4 > 120, where (𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ) = 𝑣∈C
⃗ ⃗ 𝑣⃗ for each 𝑣⃗ ∈ C.
𝜂𝑤′ (𝑣)
Intuitively, (𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ) is a weighted sum of vectors 𝑣⃗ ∈ C, where the
counting bounds are specified by the accepting condition 1 ≤ 𝑟1 ≤
60 ∧ 1 ≤ 𝑟2 ≤ 60 ∧ 0 ≤ 𝑟3 ≤ 60, resulting in a CEFA 1 illustrated weight is the number of occurrences of 𝑣⃗ in 𝑤 . (See Section 6.2 for
in Fig. 1(a). 𝑟1 + + means that we increment the value of 𝑟1 by one more detailed arguments.)
after running the transition. A string 𝑤 is accepted by 1 if, when Let C = {𝑣⃗1 , … , 𝑣⃗𝑚 }. From the results in [32,33], an existential LIA
reading the characters in 𝑤, 1 applies the transitions to update the formula 𝜓 (x1 , … , x𝑚 ) can be computed to define the Parikh image of
state and the values of registers, reaching a final state 𝑞 in the end, strings that are accepted by , where x1 , … , x𝑚 are the integer variables
and the resulting values of the three registers, say 𝑣1 , 𝑣2 , 𝑣3 , satisfy the to denote the number of occurrences of 𝑣⃗1 , … , 𝑣⃗𝑚 . Therefore, the satis-
accepting condition. In addition, we construct other two CEFAs 2 for fiability of the string constraint in (1) is reduced to the satisfiability of
𝛴 𝑐 + (see Fig. 1(b)) and 3 for string length function (see Fig. 1(c)). the following existential LIA formula,
⋀ ∑
In 3 , a register 𝑟4 is used to denote the length of strings and the 𝜓 (x1 , … , x𝑚 ) ∧ 1≤𝑗≤4 𝑟𝑗 = 1≤𝑘≤𝑚 x𝑘 𝑣𝑘,𝑗
accepting condition is 𝚝𝚛𝚞𝚎 (See Section 6.1 for more details about the (2)
1 ≤ 𝑟1 ≤ 60 ∧ 1 ≤ 𝑟2 ≤ 60 ∧ 0 ≤ 𝑟3 ≤ 60 ∧ 𝑟4 > 120.
construction of CEFA.) Note that we represent the counting operators
symbolically by registers instead of unfolding them explicitly. which can be solved by the off-the-shelf SMT solvers.
Next, we construct 1 ∩ 2 ∩ 3 , that is, the intersection (aka Nevertheless, when the original regexes are complicated (e.g. con-
product) of 1 , 2 , and 3 , as illustrated in Fig. 2(a), where the states tain occurrences of negation or intersection operators), the sizes of
that cannot reach the final states are removed. For technical conve- the NFA  can still be big and the sizes of the LIA formulas defining
nience, we also think of the updates of registers in transitions as vectors the Parikh image of  are also big. Since the satisfiability of LIA
(𝑢1 , 𝑢2 , 𝑢3 , 𝑢4 ), where 𝑢𝑖 ∈ Z is the update on the register 𝑟𝑖 for each 𝑖 ∈ formulas is an NP-complete problem [34], big sizes of LIA formulas
[4]. For instance, the transitions corresponding to the self-loop around would be a bottleneck of the performance. To tackle this issue, we
(𝑝0 , 𝑞0 , 𝑞0 ) are thought as ((𝑝0 , 𝑞0 , 𝑞0 ), 𝑎 , (𝑝0 , 𝑞0 , 𝑞0 ), (1, 0, 0, 1)) with 𝑎 ∈ propose techniques to reduce the sizes of the NFA .
𝛴 {𝑎}, since 𝑟1 and 𝑟4 are incremented by one in these transitions. Specifically, to reduce the sizes of , we determinize , and apply
After considering the updates of registers as vectors, the CEFA is like the minimization algorithm to the resulting deterministic finite au-
Fig. 2(b). tomaton (DFA), resulting in a DFA , as illustrated in Fig. 3(b). Note
Finally, the satisfiability of the original string constraint is reduced that  contains only three states 𝑞0 , 𝑞1 , 𝑞2 and six transitions, while 
to the non-emptiness of the CEFA  ≡ 1 ∩ 2 ∩ 3 with respect to contains four states and eight transitions. Furthermore, if  contains
4
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
Fig. 3. Reduced automaton  and .
Fig. 4. Syntax of RECL.
0-labeled transitions, then we can take these transitions as 𝜖-transitions 𝜃 ⊧ 𝜑, if one of the following holds: (1) 𝜑 ≡ 𝑥𝑒 and 𝜃(𝑥) ∈ (𝑒), (2)
and potentially reduce the sizes of automata further. 𝜑 ≡ 𝑡1 𝑜 𝑡2 and 𝜃(𝑡1 ) 𝑜 𝜃(𝑡2 ), (3) 𝜑 ≡ 𝜑1 ∧ 𝜑2 and 𝜃 ⊧ 𝜑1 and 𝜃 ⊧ 𝜑2 .
We implement all the aforementioned techniques on the top of OS- A RECL formula 𝜑 is satisfiable if there is a mapping 𝜃 such as 𝜃 ⊧ 𝜑.
TRICH, resulting in a solver OSTRICHRECL . It turns out that The satisfiability problem for RECL (which is abbreviated as 𝖲𝖠𝖳𝖱𝖤𝖢𝖫 )
OSTRICHRECL is able to solve the string constraint in (1) within one is deciding whether a given RECL formula 𝜑 is satisfiable.
second, while the state-of-the-art string solvers are incapable of solving
it within 120 s. 5. Cost-enriched finite automata
4. String constraints with regex-counting and string-length In this section, we define cost-enriched finite state automata (CEFA),
which was introduced in [18] and will be used to solve the satisfiability
In the sequel, we define the string constraints with regex-counting problem of RECL later on. Intuitively, CEFA adds write-only cost regis-
and string-length functions, i.e., REgex-Counting Logic (abbreviated as ters to finite state automata. write-only means that the cost registers
RECL). The syntax of RECL is defined by the rules in Fig. 4, where 𝑥 is a can only be written/updated but cannot be read, i.e., they cannot be
string variable, x is an integer variable, 𝑎 is a character from an alphabet used to guard the transitions.
𝛴, and 𝑚, 𝑛 are integer constants. A RECL formula 𝜑 is a conjunction
of atomic formulas of the form 𝑥𝑒 or 𝑡1 𝑜 𝑡2 , where 𝑒 is a regular Definition 1 (Cost-Enriched Finite Automaton). A cost-enriched finite
expression, 𝑡1 and 𝑡2 are integer terms, and 𝑜 ∈ {=, ≠, ≤, ≥, <, >}. Atomic automaton  is a tuple (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼) where
formulas of the form 𝑥𝑒 are called regular membership constraints,
𝑅 = {𝑟1 , … , 𝑟𝑘 } is a finite set of registers,
and atomic formulas of the form 𝑡1 𝑜 𝑡2 are called length constraints.
A regular expression 𝑒 is built from ∅, the empty string 𝜖, and the • 𝑄, 𝛴 , 𝐼 , 𝐹 are as in the definition of NFA,
character 𝑎 by using concatenation ⋅, union +, Kleene star , intersection • 𝛿 ⊆ 𝑄 × 𝛴 × 𝑄 × Z𝑅 is a transition relation, where Z𝑅 denotes the
∩, complement ̄, difference , counting {𝑚,𝑛} or {𝑚,∞} . An integer term updates on the values of registers.
is built from constants 𝑛, variables x, and string lengths |𝑥| by operators • 𝛼 ∈ 𝛷(𝑅) is an LIA formula specifying an accepting condition.
+ and .
For convenience, we use 𝑅 to denote the set of registers of .
Moreover, for 𝑆 ⊆ 𝛴 with 𝑆 = {𝑎1 , … , 𝑎𝑘 }, we use 𝑆 as an
We assume a linear order on 𝑅 and write 𝑅 as a vector (𝑟1 , … , 𝑟𝑘 ).
abbreviation of 𝑎1 + ⋯ + 𝑎𝑘 .
Accordingly, we write an element of Z𝑅 as a vector (𝑣1 , … , 𝑣𝑘 ), where
For each regular expression 𝑒, the language defined by 𝑒, denoted
𝑣𝑖 is the update of 𝑟𝑖 for each 𝑖 ∈ [𝑘]. We also write a transition
by (𝑒), is defined recursively. For instance, (∅) = ∅, (𝜀) = {𝜀}, 𝑎
(𝑞 , 𝑎, 𝑞 , 𝑣) ← 𝑞 .
⃗ ∈ 𝛿 as 𝑞 ←←→
(𝑎) = {𝑎}, and (𝑒1 ⋅ 𝑒2 ) = (𝑒1 ) ⋅ (𝑒2 ), (𝑒1 + 𝑒2 ) = (𝑒1 ) (𝑒2 ), 𝑣⃗
and so on. It is well-known that regular expressions define the same The semantics of CEFA is defined as follows. Let  = (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 ,
class of languages as finite state automata, that is, the class of regular 𝐹 , 𝛼) be a CEFA. A run of  on a string 𝑤 = 𝑎1 ⋯ 𝑎𝑛 is a sequence
𝑎1 𝑎𝑛 𝑎𝑖
languages [31]. 𝑞0 ←←←←→ ← 𝑞𝑛 such that 𝑞0 ∈ 𝐼 and 𝑞𝑖1 ←←←←→
𝑞1 ⋯ 𝑞𝑛1 ←←←←→ ← 𝑞𝑖 for each 𝑖 ∈ [𝑛].
𝑣⃗1 𝑣⃗𝑛 𝑣⃗𝑖
Let 𝜑 be a RECL formula and 𝖲𝖵𝖺𝗋𝗌(𝜑) (resp. 𝖨𝖵𝖺𝗋𝗌(𝜑)) denote the set 𝑎 1 𝑎𝑛
of string (resp. integer) variables occurring in 𝜑. Then the semantics of A run 𝑞0 ←←←←→← 𝑞1 ⋯ 𝑞𝑛1 ←←←←→← 𝑞𝑛 is accepting if 𝑞𝑛𝐹 and 𝛼(𝑣⃗′ 𝑅) is true,
𝑣⃗1 𝑣⃗𝑛
𝜑 is defined with respect to a mapping 𝜃 𝖲𝖵𝖺𝗋𝗌(𝜑) → 𝛴 ⊎𝖨𝖵𝖺𝗋𝗌(𝜑) → Z ∑
where 𝑣⃗′ = 𝑗∈[𝑛] 𝑣⃗𝑗 . The string 𝑤 is 𝑎𝑐 𝑐 𝑒𝑝𝑡𝑒𝑑 by the CEFA if and only
(where ⊎ denotes the disjoint union). Note that the mapping 𝜃 can ∑
if there is an accepting run of 𝑤. The vector 𝑣⃗′ = 𝑗∈[𝑛] 𝑣⃗𝑗 is called the
naturally extend to the set of integer terms. For instance, 𝜃(|𝑥|) = |𝜃(𝑥)|, 𝑎1 𝑎𝑛
𝜃(𝑡1 + 𝑡2 ) = 𝜃(𝑡1 ) + 𝜃(𝑡2 ). A mapping 𝜃 is said to satisfy 𝜑, denoted by cost of an accepting run 𝑞0 ←←←←→ ← 𝑞𝑛 . Note that the values of
𝑞1 ⋯ 𝑞𝑛1 ←←←←→
𝑣⃗1 𝑣⃗𝑛
5
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
all registers are initiated to zero and updated to 𝑗∈[𝑛] 𝑣⃗𝑗 after all the
From the construction,
transitions in the run are executed. We use 𝑣⃗′ ∈ (𝑤) to denote the fact
that there is an accepting run of  on 𝑤 whose cost is 𝑣⃗′ . We define (1 ∩ 2 ) = {(𝑤; 𝑣) ⃗ ∈ (1 ) and (𝑤; 𝗉𝗋𝗃𝑅2 (𝑣))
(𝑤; 𝗉𝗋𝗃𝑅1 (𝑣)) ⃗ ∈ (2 )}.
the semantics of a CEFA , denoted by (), as {(𝑤; 𝑣⃗′ ) 𝑣⃗′ ∈ (𝑤)}. Intuitively, 1 ∩ 2 accepts the words that are accepted by both 1
In particular, if 𝐼𝐹 ≠ ∅ and 𝛼[0𝑅] ⃗ is true, then (𝜀; 0)⃗ ∈ ().
and 2 and outputs the costs of 1 and 2 .
Moreover, we define the output of a CEFA , denoted by (), as The concatenation of 1 and 2 , denoted by 1 ⋅ 2 , is defined
{𝑣⃗′ ∃𝑤. 𝑣⃗′ ∈ (𝑤)}. similarly as that of NFA, that is, a tuple (𝑅 , 𝑄 , 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼 ), where
We want to remark that the definition of CEFA above is slightly 𝑅 = 𝑅1 𝑅2 , 𝑄 = 𝑄1 𝑄2 , 𝐼 = 𝐼1 , 𝛼 = 𝛼1 ∧ 𝛼2 , 𝛿 =
different from that of [18], where CEFA did not include accepting ⃗ (𝑞1 , 𝑎, 𝑞 , 𝑣⃗1 ) ∈ 𝛿1 } {(𝑞2 , 𝑎, 𝑞 , (0,
{(𝑞1 , 𝑎, 𝑞1 , (𝑣⃗1 , 0)) ⃗ 𝑣⃗2 )) (𝑞2 , 𝑎, 𝑞 , 𝑣⃗2 ) ∈
1 2 2
conditions. Moreover, the accepting conditions 𝛼 in CEFA are defined 𝛿2 } {(𝑞1 , 𝑎, 𝑞2 , (0,⃗ 𝑣⃗2 )) 𝑞1 ∈ 𝐹1 , ∃𝑞 𝐼2 . (𝑞 , 𝑎, 𝑞2 , 𝑣⃗2 ) ∈ 𝛿2 }, moreover, if
in a global fashion because the accepting condition does not distinguish 𝐼2 ∩𝐹2 ≠ ∅, then 𝐹 = 𝐹1 𝐹2 , otherwise, 𝐹 = 𝐹2 . From the construction,
final states. This technical choice is made so that the determinization
and minimization of NFA can be utilized to reduce the size of CEFA in (1 ⋅2 ) = {(𝑤1 𝑤2 ; 𝑣) ⃗ ∈ (1 ) and (𝑤2 ; 𝗉𝗋𝗃𝑅2 (𝑣))
(𝑤1 ; 𝗉𝗋𝗃𝑅1 (𝑣)) ⃗ ∈ (2 )}.
the next section.
In the sequel, we define three CEFA operations: union, intersection, Furthermore, the union, intersection, and concatenation operations
and concatenation. The following section will use these three oper- can be extended naturally to multiple CEFA, that is, 1 𝑛 ,
ations to solve RECL constraints. Note that the union, intersection, 1 ∩ ⋯ ∩ 𝑛 , 1 ⋅ ⋯ ⋅ 𝑛 . For instance, 1 2 3 = (1 2 ) 3 ,
and concatenation operations are defined in a slightly more involved 1 ∩ 2 ∩ 3 = (1 ∩ 2 ) ∩ 3 , and 1 ⋅ 2 ⋅ 3 = (1 ⋅ 2 ) ⋅ 3 .
manner than register automata [23] and counter automata [19], as a
result of the (additional) accepting conditions. 6. Solving RECL constraints
Let 1 = (𝑅1 , 𝑄1 , 𝛴 , 𝛿1 , 𝐼1 , 𝐹1 , 𝛼1 ) and 2 = (𝑅2 , 𝑄2 , 𝛴 , 𝛿2 , 𝐼2 , 𝐹2 , 𝛼2 )
be two CEFA that share the alphabet. Moreover, suppose that 𝑅1 ∩𝑅2 = The goal of this section is to show how to solve RECL constraints by
∅ and 𝑄1 ∩ 𝑄2 = ∅. utilizing CEFA. At first, we reduce the satisfiability of RECL constraints
The union of 1 and 2 is denoted by 1 2 . Two fresh auxiliary to a decision problem defined in the sequel. Then we propose a decision
registers say 𝑟1 , 𝑟2 ∉ 𝑅1 𝑅2 , are introduced so that the accepting procedure for this problem.
condition knows whether a run is from 1 (or 2 ). Note that the
fresh auxiliary registers are introduced to distinguish whether 1 or Definition 2 (𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠)). Let 𝑥1 , … , 𝑥𝑛 be string variables, 𝛬𝑥1 , … ,
2 accepts and weave the two accepting conditions of 1 and 2 into 𝛬𝑥𝑛 be nonempty sets of CEFA over the alphabet 𝛴 with 𝛬𝑥𝑖 =
one accepting condition. Specifically, 1 2 = (𝑅 , 𝑄 , 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼 ) {𝑖,1 , … , 𝑖,𝑙𝑖 } for every 𝑖 ∈ [𝑛] where the sets of registers 𝑅1,1 , … ,
where 𝑅1,𝑙 , … , 𝑅𝑛,1 , … , 𝑅𝑛,𝑙 are mutually disjoint, moreover, let 𝜑 be
1 𝑛
an LIA formula whose free variables are from 𝑖∈[𝑛],𝑗∈[𝑙𝑖 ] 𝑅𝑖,𝑗 . Then
𝑅 = 𝑅1 𝑅2 {𝑟1 , 𝑟2 }, 𝑄 = 𝑄1 𝑄2 {𝑞0 } with 𝑞0𝑄1 𝑄2 , the CEFA in 𝛬𝑥1 , … , 𝛬𝑥𝑛 are said to be nonempty w.r.t. 𝜑 if there
𝐼 = {𝑞0 }, are assignments 𝜃 {𝑥1 , … , 𝑥𝑛 } → 𝛴 and vectors 𝑣⃗𝑖,𝑗 such that
• 𝛿 is the union of four transitions sets: (𝜃(𝑥𝑖 ); 𝑣⃗𝑖,𝑗 ) ∈ (𝑖,𝑗 ) and 𝜑[(𝑣⃗𝑖,𝑗 ∕𝑅𝑖,𝑗 )] is true, for every 𝑖 ∈ [𝑛], 𝑗 ∈ [𝑙𝑖 ].
⃗ 1, 0)) 𝑞1 ∈ 𝐼1 . (𝑞1 , 𝑎, 𝑞 , 𝑣⃗1 ) ∈ 𝛿1 }
{(𝑞0 , 𝑎, 𝑞1 , (𝑣⃗1 , 0, 1
{(𝑞 , 𝑎, 𝑞 , (0, 𝑣⃗2 , 0, 1)) 𝑞2 ∈ 𝐼2 . (𝑞2 , 𝑎, 𝑞 , 𝑣⃗2 ) ∈ 𝛿2 }
Proposition 1 ([18]). 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) is PSPACE-complete.
0 2 2
⃗ 0, 0)) 𝑞1 ∉ 𝐼1 . (𝑞1 , 𝑎, 𝑞 , 𝑣⃗1 ) ∈ 𝛿1 }
{(𝑞1 , 𝑎, 𝑞1 , (𝑣⃗1 , 0, Note that the decision procedure in [18] was only used to prove the
1
𝑣⃗2 , 0, 0)) 𝑞2 ∉ 𝐼2 . (𝑞2 , 𝑎, 𝑞 , 𝑣⃗2 ) ∈ 𝛿2 }
{(𝑞2 , 𝑎, 𝑞2 , (0, upper bound in Proposition 1 and not implemented as a matter of fact.
2
Instead, the symbolic model checker nuXmv [35] was used to solve
⃗ 1, 0) is a vector that updates 𝑅1 by 𝑣⃗1 , updates 𝑅2 by
where (𝑣⃗1 , 0, 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠). We do not rely on nuXmv in this work and shall propose
⃗ and updates 𝑟 , 𝑟 by 1,0 respectively. Similarly for (0,
0, ⃗ 𝑣⃗2 , 0, 1), a new algorithm for solving 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) in Section 6.2.
1 2
and so on.
𝐹 and 𝛼 are defined as follows, 6.1. From 𝖲𝖠𝖳𝖱𝖤𝖢𝖫 to 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠)
⃗ belongs to (1 ) or (2 ), i.e., one of the two
if (𝜖; 0) Let 𝜑 be a RECL constraint and 𝑥1 , … , 𝑥𝑛 be an enumeration of the
automata accepts the empty string 𝜖, then 𝐹 = 𝐹1 𝐹2 {𝑞0 } string variables occurring in 𝜑. Moreover, let 𝜑 ≡ 𝜑1 ∧ 𝜑2 such that
and 𝛼 = (𝑟1 = 0 ∧ 𝑟2 = 0) (𝑟1 = 1 ∧ 𝛼1 ) (𝑟2 = 1 ∧ 𝛼2 ), 𝜑1 is a conjunction of regular membership constraints of 𝜑, and 𝜑2 is a
otherwise, 𝐹 = 𝐹1 𝐹2 and 𝛼 = (𝑟1 = 1 ∧ 𝛼1 ) (𝑟2 = 1 ∧ 𝛼2 ). conjunction of length constraints of 𝜑. We shall reduce the satisfiability
of 𝜑 to an instance of 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠).
At first, we show how to construct a CEFA from a regex where
From the construction, we know that counting operators may occur. Our approach largely follows Thomp-
⎧ | (𝑤; 𝗉𝗋𝗃 (𝑣)) ⎫
| 𝑅1 ⃗ ∈ (1 ) and 𝗉𝗋𝗃𝑟1 (𝑣) ⃗ = 1, or sons construction [36] but we avoid using the 𝜖-transitions. Similar
⎪ | ⎪
⃗ || (𝑤; 𝗉𝗋𝗃𝑅2 (𝑣))
(1 2 ) = ⎨(𝑤; 𝑣) ⃗ ∈ (2 ) and 𝗉𝗋𝗃𝑟 (𝑣) ⃗ = 1, or ⎬. ideas of representing counting operators as counters in automata have
⎪ | 2
⎪ appeared in e.g. [27].
⎩ | (𝑤; ⃗
𝑣) = (𝜖 , ⃗
0) if  or  accepts 𝜖 ⎭
| 1 2
Let us start with register-representable regexes defined in the sequel.
Intuitively, 1 2 accepts the words that are accepted by one of the Let us fix an alphabet 𝛴.
1 and 2 and outputs the costs of the corresponding automaton. Note Let 𝑒 be a regex over 𝛴. Then an occurrence of counting operators in
that an integer vector 𝑣⃗1 in 1 , is copied at most twice in 1 2 , 𝑒, say (𝑒 ){𝑚,𝑛} (or (𝑒 ){𝑚,∞} ), is said to be register-representable if (𝑒 ){𝑚,𝑛}
⃗ 0, 0) and (𝑣⃗1 , 0,
namely, (𝑣⃗1 , 0, ⃗ 1, 0). Similarly for the integer vectors in (or (𝑒 ){𝑚,∞} ) is not in the scope of a Kleene star, another counting
2 . operator, complement, or language difference in 𝑒. We say that 𝑒 is
The intersection of 1 and 2 , denoted by 1 ∩2 = (𝑅 , 𝑄 , 𝛴 , 𝛿 , 𝐼 , register-representable if and only if all occurrences of counting operators
𝐹 , 𝛼 ), is defined in the sequel.
in 𝑒 are register-representable. For instance, 𝑎{2,6} ∩ 𝑎{4,∞} is register-
representable, while 𝑎{2,6} and (𝑎{2,6} ){4,∞} are not since 𝑎{2,6} is in the
𝑅 = 𝑅1 𝑅2 , 𝑄 = 𝑄1 × 𝑄2 , 𝐼 = 𝐼1 × 𝐼2 , 𝐹 = 𝐹1 × 𝐹2 , 𝛼 = 𝛼1 ∧ 𝛼2 , scope of complement and the counter operator {4, ∞} respectively.
• 𝛿 comprises the tuples ((𝑞1 , 𝑞2 ), 𝑎, (𝑞1 , 𝑞2 ), (𝑣⃗1 , 𝑣⃗2 )) such that (𝑞1 , 𝑎, Let 𝑒 be a register-representable regex over 𝛴. By the following
𝑞1 , 𝑣⃗1 ) ∈ 𝛿1 and (𝑞2 , 𝑎, 𝑞2 , 𝑣⃗2 ) ∈ 𝛿2 . procedure, we will construct a CEFA out of 𝑒, denoted by 𝑒 .
6
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
1. For each sub-expression (𝑒 ){𝑚,𝑛} with 𝑚 ≤ 𝑛 (resp. (𝑒 ){𝑚,∞} ) Observation 2. For each CEFA  = (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼),
of 𝑒, we construct a CEFA (𝑒 ){𝑚,𝑛} (resp. (𝑒 ){𝑚,∞} ). Let 𝑒′ = ⎧ | ⎡∑ ⎫
(𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 ). Then (𝑒 ){𝑚,𝑛} = ((𝑟 ), 𝑄 , 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼 ), where ⎪∑ |
| / ⎤ ⎪
() = ⎨ ⃗ 𝑣⃗ | 𝜂 ∈ 𝗉𝖺𝗋𝗂𝗄𝗁( ) and 𝛼
𝜂(𝑣) ⃗ 𝑣⃗ 𝑅⎥ is true ⎬ .
𝜂(𝑣)
𝑟 is a new register, 𝑄 = 𝑄 {𝑞0 } with 𝑞0 ∉ 𝑄, 𝐼 = {𝑞0 }, | ⎢ ⎥
𝑣∈C
⃗ | ⎣𝑣∈C
⃗  ⎦ ⎪
𝐹 = 𝐹 {𝑞0 }, and ⎩  | ⎭
𝛿 = {(𝑞 , 𝑎, 𝑞 , (0)) (𝑞 , 𝑎, 𝑞 ) ∈ 𝛿}
For 𝑖 ∈ [𝑛], let 𝛼𝑖 be the accepting condition of 𝑖 . Then from
{(𝑞0 , 𝑎, 𝑞 , (1)) 𝑞0𝐼 . (𝑞0 , 𝑎, 𝑞 ) ∈ 𝛿}
Observation 2, we know that the following two conditions are equiva-
{(𝑞 , 𝑎, 𝑞 , (1)) 𝑞𝐹 , ∃𝑞0𝐼 . (𝑞0 , 𝑎, 𝑞 ) ∈ 𝛿},
lent,
moreover, 𝛼 = 𝑚 ≤ 𝑟𝑛 if 𝐼𝐹 = ∅, otherwise 𝛼 = 𝑟𝑛.
(Intuitively, if 𝜀 is accepted by 𝑒′ , then the value of 𝑟 can be • there are outputs 𝑣⃗1 ∈ (1 ), … , 𝑣⃗𝑛 ∈ (𝑛 ) such that
less than 𝑚.) Moreover, (𝑒 ){𝑚,∞} is constructed by adapting 𝛼 𝜑[𝑣⃗1 𝑅1 , … , 𝑣⃗𝑛 ∕𝑅𝑛 ] is true,
in (𝑒 ){𝑚,𝑛} as follows: 𝛼 = 𝑚 ≤ 𝑟 if 𝐼𝐹 = ∅ and 𝛼 = 𝚝𝚛𝚞𝚎 • there are 𝜂1 ∈ 𝗉𝖺𝗋𝗂𝗄𝗁(1 ), … , 𝜂𝑛 ∈ 𝗉𝖺𝗋𝗂𝗄𝗁(𝑛 ) such that
otherwise.
⋀ ⎡ ∑ / ⎤ ⎡ ∑ / ∑ / ⎤
2. For each sub-expression 𝑒 of 𝑒 such that 𝑒 contains occurrences 𝛼𝑖 ⎢ ⃗ 𝑣⃗ 𝑅𝑖 ⎥ ∧ 𝜑 ⎢
𝜂𝑖 (𝑣) 𝜂1 (𝑣)
⃗ 𝑣⃗ 𝑅1 , … , ⃗ 𝑣⃗ 𝑅𝑛 ⎥
𝜂𝑛 (𝑣)
of counting operators but 𝑒 itself is not of the form (𝑒1 ){𝑚,𝑛} or ⎢ ⎥ ⎢ ⎥
𝑖∈[𝑛] ⎣𝑣∈C
⃗ 𝑖 ⎦ ⎣𝑣∈C
⃗ 
1
𝑣∈C
⃗ 𝑛 ⎦
(𝑒1 ){𝑚,∞} , from the assumption that 𝑒 is register-representable,
we know that 𝑒 is of the form 𝑒1 ⋅ 𝑒2 , 𝑒1 + 𝑒2 , 𝑒1 ∩ 𝑒2 , or (𝑒1 ). For is true.
𝑒 = (𝑒1 ), we have 𝑒′ = 𝑒′ . For 𝑒 = 𝑒1 ⋅ 𝑒2 , 𝑒 = 𝑒1 + 𝑒2 , Therefore, to solve the non-emptiness of CEFA in 𝛬′𝑥 , … , 𝛬′𝑥 w.r.t.
1 1 𝑛
or 𝑒 = 𝑒1 ∩ 𝑒2 , suppose that CEFA 𝑒′ and 𝑒′ have been 𝜑, it is sufficient to compute the existential LIA formulas 𝜓 (ZC ), …,
1 2 1 1
constructed. 𝜓 (ZC ) to represent the Parikh images of 1 , ⋯, 𝑛 respectively,
𝑛 𝑛
3. For each maximal sub-expression 𝑒 of 𝑒 such that 𝑒 contains where ZC = {z𝑖,𝑣⃗ 𝑣⃗ ∈ C𝑖 } for 𝑖 ∈ [𝑛], and solve the satisfiability of
𝑖
no occurrences of counting operators, an NFA 𝑒′ can be con- the following existential LIA formula
structed by structural induction on the syntax of 𝑒 . Then we
⋀⎛ ⎡ ∑ / ⎤⎞
have 𝑒′ = 𝑒′ ⋅ 𝑒′ , 𝑒′ = 𝑒′ 𝑒′ , or 𝑒′ = 𝑒′ ∩ 𝑒′ . ⎜𝜓 (ZC ) ∧ 𝛼𝑖 ⎢ z𝑖,𝑣⃗ 𝑣⃗ 𝑅𝑖 ⎥⎟ ∧
1 2 1 2 1 2
⎜  𝑖𝑖 ⎢ ⎥⎟
𝑖∈[𝑛] ⎝ ⎣𝑣∈C
⃗ 𝑖 ⎦⎠
For non-register-representable regexes, we first convert them into
register- ⎡ ∑ / ∑ / ⎤
representable regexes by unfolding all the non-register-representable 𝜑⎢ z1,𝑣⃗ 𝑣⃗ 𝑅1 , … , z𝑛,𝑣⃗ 𝑣⃗ 𝑅𝑛 ⎥ .
⎢ ⎥
occurrences of counting operators. After that, we utilize the aforemen- ⎣𝑣∈C
⃗ 
1
𝑣∈C
⃗ 𝑛 ⎦
tioned procedure to construct CEFA. For instance, (𝑎{2,6} ⋅ 𝑏 ){2,∞} is Intuitively, the integer variables z𝑖,𝑣⃗ represent the number of occur-
transformed into (𝑎𝑎(𝜀 + 𝑎 + 𝑎𝑎 + 𝑎𝑎𝑎 + 𝑎𝑎𝑎𝑎) ⋅ 𝑏 ){2,∞} . rences of 𝑣⃗ in the strings accepted by 𝑖 .
For each 𝑖 ∈ [𝑛], let 𝑥𝑖𝑒𝑖,1 , … , 𝑥𝑖𝑒𝑖,𝑙𝑖 be an enumeration of Because the sizes of the LIA formulas 𝜓 (ZC ), … , 𝜓 (ZC ) are
1 1 𝑛 𝑛
the regular membership constraints for 𝑥𝑖 in 𝜑1 . Then we can construct proportional to the sizes (more precisely, the alphabet size, the number
CEFA 𝑖,𝑗 from 𝑒𝑖,𝑗 for each 𝑖 ∈ [𝑛] and 𝑗 ∈ [𝑙𝑖 ]. Moreover, we construct of states and transitions) of NFA 1 , … , 𝑛 , and the satisfiability of
another CEFA 𝑖,0 for each 𝑖 ∈ [𝑛] to model the length of 𝑥𝑖 . Specifi- existential LIA formulas is NP-complete, it is vital to reduce the sizes
cally, 𝑖,0 is constructed as ((𝑟𝑖,0 ), {𝑞𝑖,0 }, 𝛴 , 𝛿𝑖,0 , {𝑞𝑖,0 }, {𝑞𝑖,0 }, 𝚝𝚛𝚞𝚎) where of these NFAs to improve the performance.
𝑟𝑖,0 is a fresh register and 𝛿𝑖,0 = {(𝑞𝑖,0 , 𝑎, 𝑞𝑖,0 , (1)) 𝑎 ∈ 𝛴}. Let 𝛬𝑥𝑖 = ∑ ∑
Since 𝑣∈C(⃗ 𝑖)
⃗ 𝑣⃗ = 𝑣∈C(
𝜂(𝑣) ⃗ 𝜂(𝑣) ⃗ 𝑣⃗ for each 𝑖 ∈ [𝑛] and 𝜂 ∈
{𝑖,0 , 𝑖,1 , … , 𝑖,𝑙𝑖 } for each 𝑖 ∈ [𝑛], and 𝜑2 ≡ 𝜑2 [𝑟1,0 |𝑥1 |, … , 𝑟𝑛,0 |𝑥𝑛 |]. ⃗ 𝑖 ){0}
𝗉𝖺𝗋𝗂𝗄𝗁(𝑖 ), it turns out that the 0-labeled ⃗ transitions in 𝑖 do not
Then the satisfiability of 𝜑 is reduced to the non-emptiness of CEFAs in ∑
𝛬𝑥1 , … , 𝛬𝑥𝑛 w.r.t. 𝜑2 . contribute to the final output 𝑣∈C( ⃗ 𝑖 ) 𝜂(𝑣)
𝑣.
⃗ Therefore, we can apply
the following size-reduction technique for 𝑖 s.
6.2. Solving 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) Step 2 (Reducing automata sizes). For each 𝑖 ∈ [𝑛], we view the
transitions (𝑞 , 0, ⃗ 𝑞 ) in  as 𝜀-transitions (𝑞 , 𝜀, 𝑞 ), and remove the 𝜀-
𝑖
In this section, we present a procedure to solve the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) transitions from 𝑖 . Then we determinize and minimize the resulting
problem: Suppose that 𝑥1 , … , 𝑥𝑛 are mutually distinct string variables, NFA. □
𝛬𝑥1 , … , 𝛬𝑥𝑛 are nonempty sets of CEFA over the same alphabet 𝛴 where For 𝑖 ∈ [𝑛], let 𝑖 denote the DFA obtained from 𝑖 by executing
𝛬𝑥𝑖 = {𝑖,1 , …, 𝑖,𝑙𝑖 } for every 𝑖 ∈ [𝑛]. Moreover, the sets of registers Step 2 and C𝑖 = C𝑖 {0}. ⃗ From the construction, we know that
𝑅1,1 , … , 𝑅𝑛,𝑙 are mutually disjoint, and 𝜑 is a LIA formula whose 𝗉𝖺𝗋𝗂𝗄𝗁(𝑖 ) = 𝗉𝗋𝗃C (𝗉𝖺𝗋𝗂𝗄𝗁(𝑖 )) for each 𝑖 ∈ [𝑛]. Therefore, we compute
𝑛 𝑖
free variables are from 𝑖∈[𝑛],𝑗∈[𝑙𝑖 ] 𝑅𝑖,𝑗 . The procedure comprises three LIA formulas from 𝑖 s, instead of 𝑖 s, to represent the Parikh images.
steps. Step 3 (Computing an LIA formula). For each 𝑖 ∈ [𝑛], we compute an
Step 1 (Computing intersection automata). For each 𝑖 ∈ [𝑛], compute existential LIA formula 𝜓𝑖 (ZC ) from 𝑖 to represent 𝗉𝖺𝗋𝗂𝗄𝗁(𝑖 ). Then
𝑖
a CEFA 𝑖 = 𝑖,1 ∩ ⋯ ∩ 𝑖,𝑙𝑖 , and let 𝛬′𝑥 = {𝑖 }. □ the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem is reduced to the satisfiability of the following
𝑖
After Step 1, the non-emptiness of CEFAs in 𝛬𝑥1 , … , 𝛬𝑥𝑛 w.r.t. 𝜑 is LIA formula, □
reduced to the non-emptiness of CEFAs in 𝛬′𝑥 , … , 𝛬′𝑥 w.r.t. 𝜑.
1 𝑛 ⋀⎛ ⎡ ∑ / ⎤⎞
In the following steps, we reduce the non-emptiness of CEFAs in ⎜𝜓 (ZC ) ∧ 𝛼𝑖 ⎢ z𝑖,𝑣⃗ 𝑣⃗ 𝑅𝑖 ⎥⎟ ∧
𝛬′𝑥 , …, 𝛬′𝑥 w.r.t. 𝜑 to the satisfiability of an LIA formula. The reduction ⎜ 𝑖𝑖 ⎢ ⎥⎟
1 𝑛
𝑖∈[𝑛] ⎝ ⎣𝑣∈C
⃗ 𝑖 ⎦⎠
relies on the following two observations. (3)
⎡ ∑ / ∑ / ⎤
𝜑 ⎢ z1,𝑣⃗ 𝑣⃗ 𝑅1 , … , z𝑛,𝑣⃗ 𝑣⃗ 𝑅𝑛 . ⎥
Observation 1. CEFA in 𝛬′𝑥 = {1 }, … , 𝛬′𝑥 = {𝑛 } are nonempty w.r.t. ⎢ ⎥
1 𝑛𝑣∈C
⃗ 
1
𝑣∈C
⃗ 𝑛 ⎦
𝜑 iff there are outputs 𝑣⃗1 ∈ (1 ), … , 𝑣⃗𝑛 ∈ (𝑛 ) such that 𝜑[𝑣⃗1 𝑅1 , …,
𝑣⃗𝑛 ∕𝑅𝑛 ] is true. Step 4 (Solving the LIA formula). We utilize some off-the-shelf SMT
Let  = (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼) be a CEFA and C = {𝑣⃗ ∃𝑞 , 𝑎, 𝑞 . (𝑞 , 𝑎, 𝑞 , solver (e.g. Z3) to solve the satisfiability of the LIA formula in Eq. (3).
⃗ ∈ 𝛿}. Moreover, let  = (𝑄, C , 𝛿 , 𝐼 , 𝐹 ) be an NFA over the alpha-
𝑣) If the solver reports sat and returns a solution m, then we go to Step
bet C that is obtained from  by dropping the accepting condition 5. Otherwise, the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem has no solution. □
and ignoring the characters, that is, 𝛿 comprises tuples (𝑞 , 𝑣, ⃗ 𝑞 ) such Step 5 (Constructing a solution). Let m be a solution of the LIA formula
that (𝑞 , 𝑎, 𝑞 , 𝑣)
⃗ ∈ 𝛿 for 𝑎 ∈ 𝛴. in Eq. (3). We want to construct from m and the CEFAs 𝑖 for 𝑖 ∈ [𝑛],
7
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
a solution m of the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem. The construction of m is RECL constraint: (i) unfold the counting operators in 𝑒1 , (ii)
obtained by constructing m (𝑥𝑖 ) for each 𝑖 ∈ [𝑛]. Let 𝑖 ∈ [𝑛]. We unfold the counting operator {𝑚,𝑛} . Let 𝗎𝖿 𝗅𝖽1 (𝑒) and 𝗎𝖿 𝗅𝖽2 (𝑒) denote
describe how to construct m (𝑥𝑖 ). The free variables ZC correspond the resulting regular expressions. Note that in 𝗎𝖿 𝗅𝖽1 (𝑒), all the
𝑖
to the number of occurrences of characters in C𝑖 , i.e. the alphabet of counting operators in 𝑒1 (if there is any) disappear, while in
the NFA 𝑖 . Moreover, in 𝑖 , the characters are just the integer vectors 𝗎𝖿 𝗅𝖽2 (𝑒), all the counting operators in 𝑒1 are preserved and copied
and the characters in the original CEFAs 𝑖 used to construct 𝑖 are for 𝑛 times.
ignored. The construction of m (𝑥𝑖 ) is actually to solve the reachability • In the sequel, we estimate the sizes (the number of transitions)
problem in a finite transition system where of the NFAs 𝗎𝖿 𝗅𝖽 (𝑒) and 𝗎𝖿 𝗅𝖽 (𝑒) , denoted by 𝑠𝑧1 and 𝑠𝑧2 . If
1 2
𝑠𝑧1 ≤ 𝑠𝑧2 , then we choose to unfold the counting operators in
• the nodes are of the form (𝑞 , (n𝑖,𝑣⃗ )𝑣∈C
⃗ 𝑖 ), where n𝑖,𝑣⃗ denotes the 𝑒1 . Otherwise, we choose to unfold the counting operator {𝑚,𝑛} .
number of the remaining occurrences of the integer vector 𝑣⃗ in • To estimate the sizes of 𝗎𝖿 𝗅𝖽 (𝑒) and 𝗎𝖿 𝗅𝖽 (𝑒) , we first esti-
1 2
𝑖 , mate the sizes of the CEFAs 𝗎𝖿 𝗅𝖽1 (𝑒) and 𝗎𝖿 𝗅𝖽2 (𝑒) as two pairs
• the transitions are of the form (𝗌𝗇𝗎𝗆1 (𝑒), 𝗏𝗇𝗎𝗆1 (𝑒)) and (𝗌𝗇𝗎𝗆2 (𝑒), 𝗏𝗇𝗎𝗆2 (𝑒)), where 𝗌𝗇𝗎𝗆1 (𝑒) and
((𝑞 , (n𝑖,𝑣⃗ )𝑣∈C
⃗), (𝑞 , (n𝑖,𝑣⃗ )𝑣∈C 𝗏𝗇𝗎𝗆1 (𝑒) are the estimates of the number of states and integer vec-
⃗  ), (𝑞 , 𝑎, 𝑞 , 𝑢 ⃗  ))
𝑖 𝑖 tors in 𝗎𝖿 𝗅𝖽1 (𝑒) separately, similarly for (𝗌𝗇𝗎𝗆2 (𝑒), 𝗏𝗇𝗎𝗆2 (𝑒)). Since
such that (𝑞 , 𝑎, 𝑞 , 𝑢⃗) is a transition in 𝑖 , n = n𝑖,𝑢⃗ 1, moreover, the NFAs 𝗎𝖿 𝗅𝖽 (𝑒) and 𝗎𝖿 𝗅𝖽 (𝑒) are constructed from 𝗎𝖿 𝗅𝖽1 (𝑒)
𝑖,𝑢⃗ 1 2
n = n𝑖,𝑣⃗ for every 𝑣⃗ ≠ 𝑢⃗. and 𝗎𝖿 𝗅𝖽2 (𝑒) by dropping the accepting conditions and ignore
𝑖,𝑣⃗
characters, 𝗌𝗇𝗎𝗆1 (𝑒) and 𝗏𝗇𝗎𝗆1 (𝑒) are the estimates of the number
Moreover, the initial and final state of the reachability problem are set of states and characters of 𝗎𝖿 𝗅𝖽 (𝑒) , similarly for 𝗎𝖿 𝗅𝖽 (𝑒) . Then
1 2
to be (𝑞0 , (m(z𝑖,𝑣⃗ ))𝑣∈C
⃗ 𝑖 ) and (𝑞𝑓 , (0, … , 0)) for some initial state 𝑞0 and we estimate the sizes (number of transitions) of the NFAs 𝗎𝖿 𝗅𝖽 (𝑒)
1
final state 𝑞𝑓 in 𝑖 . If a path from the initial state to the final state is and 𝗎𝖿 𝗅𝖽 (𝑒) as 𝑠𝑧1 = (𝗌𝗇𝗎𝗆1 (𝑒))2 ⋅ 𝗏𝗇𝗎𝗆1 (𝑒) and 𝑠𝑧2 = (𝗌𝗇𝗎𝗆2 (𝑒))2 ⋅
found, then the path gives us a desired string m (𝑥𝑖 ). □ 2
𝗏𝗇𝗎𝗆2 (𝑒) respectively.
7. Optimizations for non-register-representable RECL constraints It remains to show how to compute (𝗌𝗇𝗎𝗆1 (𝑒), 𝗏𝗇𝗎𝗆1 (𝑒)) and
(𝗌𝗇𝗎𝗆2 (𝑒), 𝗏𝗇𝗎𝗆2 (𝑒)) from 𝑒 = (𝑒1 ){𝑚,𝑛} . Let 𝑒 = (𝑒1 ){𝑚,𝑛} be a reg-
In Section 6, to deal with the non-register-representable RECL con- ular expression where 𝑒1 is a register-representable regular expres-
straints, we unfold all the non-register-representable occurrences of sion. Note that there is only one register in 𝗎𝖿 𝗅𝖽1 (𝑒) . As a result,
counting operators. Nevertheless, the naive unfolding incurs an expo- we set 𝗏𝗇𝗎𝗆1 (𝑒) = 2. Moreover, in the sequel, we inductively com-
nential blow-up over the formula sizes, especially when the counting pute 𝗌𝗇𝗎𝗆1 (𝑒1 ) and (𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) from 𝑒1 , then let 𝗌𝗇𝗎𝗆1 (𝑒) =
bounds are large. Recall that an occurrence of counting operators is said 𝗌𝗇𝗎𝗆1 (𝑒1 ) and (𝗌𝗇𝗎𝗆2 (𝑒), 𝗏𝗇𝗎𝗆2 (𝑒)) = (𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )).
to be non-register-representable if it is in the scope of some Kleene star,
another counting operator, complement, or language difference. Since • If 𝑒1 = ∅ or 𝑒1 = 𝜖, then 𝗌𝗇𝗎𝗆1 (𝑒1 ) = 0 and
Kleene star can be seen as a special case of the counting operators, and (𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) = (0, 0).
language difference 𝑒1 𝑒2 can be rewritten as 𝑒1 ∩ 𝑒2 , without loss of
(Note that here the constant 0, instead of 1, is chosen, so that
generality, we can assume that a non-register representable occurrence
later on, when we consider 𝑒1 = 𝑒2 ∩ 𝑒3 , 𝗌𝗇𝗎𝗆2 (𝑒2 ) × 𝗌𝗇𝗎𝗆2 (𝑒3 ) and
of a counting operator is in the scope of another counting operator or
𝗏𝗇𝗎𝗆2 (𝑒2 ) ×𝗏𝗇𝗎𝗆2 (𝑒3 ) will become 0 since 𝜖 ∩𝑒3 = 𝜖 and ∅ ∩𝑒3 = ∅.)
some complement operator. In the sequel, to alleviate the formula-size
blow-up problem resulted from the naive unfolding, we present various
• If 𝑒1 = 𝑎, then 𝗌𝗇𝗎𝗆1 (𝑒1 ) = 2 and
optimizations, for the nesting of counting operators, and the nesting of
counting operators and a complement operator respectively. (𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) = (2𝑛, 1).
7.1. Optimizations for the nesting of counting operators
• If 𝑒1 = 𝑒2 ⋅ 𝑒3 , then 𝗌𝗇𝗎𝗆1 (𝑒1 ) = 𝗌𝗇𝗎𝗆1 (𝑒2 ) + 𝗌𝗇𝗎𝗆1 (𝑒3 ) and
(𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) =
In the sequel, we propose an optimization for the nesting of count-
ing operators. The optimization is designed for the special case that (𝑛(𝗌𝗇𝗎𝗆2 (𝑒2 ) + 𝗌𝗇𝗎𝗆2 (𝑒3 )), 𝑛(𝗏𝗇𝗎𝗆2 (𝑒2 ) + 𝗏𝗇𝗎𝗆2 (𝑒3 ))).
the nesting depth of counting operators is one. For a RECL constraint (Note that because 𝑒1 is concatenated for 𝑛 times in 𝗎𝖿 𝗅𝖽2 (𝑒), we
whose nesting depth of counting operators is greater than two, we can define 𝗌𝗇𝗎𝗆2 (𝑒1 ) as 𝑛(𝗌𝗇𝗎𝗆2 (𝑒2 ) + 𝗌𝗇𝗎𝗆2 (𝑒3 )). Moreover, since the
apply the optimization for the inner-most nesting of counting operators integer vectors in 1 ⋅ 2 are of the form (𝑣⃗1 , 0) ⃗ and (0,
𝑣⃗2 ) where
and decrement the nesting depth, then apply the optimization again to 𝑣⃗1 and 𝑣⃗2 are integer vectors in 1 and 2 respectively, we define
the resulting constraint, and continue this process, until obtaining a 𝗏𝗇𝗎𝗆2 (𝑒1 ) as 𝑛(𝗏𝗇𝗎𝗆2 (𝑒2 ) + 𝗏𝗇𝗎𝗆2 (𝑒3 ))).
register-representable RECL constraint. • If 𝑒1 = 𝑒2 + 𝑒3 , then 𝗌𝗇𝗎𝗆1 (𝑒1 ) = 𝗌𝗇𝗎𝗆1 (𝑒2 ) + 𝗌𝗇𝗎𝗆1 (𝑒3 ) and
The main idea of the optimization is illustrated by the constraint
(𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) =
𝑥 ∈ (𝑎{1,1000} ){1,2} . If we apply the naive unfolding in Section 6, we
(𝑛(𝗌𝗇𝗎𝗆2 (𝑒2 ) + 𝗌𝗇𝗎𝗆2 (𝑒3 ) + 1), 𝑛(2𝗏𝗇𝗎𝗆2 (𝑒2 ) + 2𝗏𝗇𝗎𝗆2 (𝑒3 ))).
will unfold 𝑎{1,1000} and obtain the constraint 𝑥 ∈ (𝑎(𝜀 + 𝑎 + 𝑎𝑎 +
⋯ + 𝑎𝑎 )){1,2} . It is easy to observe that a smarter way is to unfold (Recall that each integer vector in 1 resp. 2 is copied at most
⏟⏟⏟ twice in the construction of 1 2 .)
999
{1,2} , instead of {1,1000} . If we do this, then we get a constraint 𝑥 ∈ • If 𝑒1 = 𝑒2 ∩ 𝑒3 , then 𝗌𝗇𝗎𝗆1 (𝑒1 ) = 𝗌𝗇𝗎𝗆1 (𝑒2 ) × 𝗌𝗇𝗎𝗆1 (𝑒3 ) and
(𝑎{1,1000} )(𝜀 + 𝑎{1,1000} ), whose size is much smaller, compared to 𝑥 ∈ (𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) =
(𝑎(𝜀+𝑎+𝑎𝑎+⋯+ 𝑎𝑎 )){1,2} . (Recall that we assume a binary encoding (𝑛(𝗌𝗇𝗎𝗆2 (𝑒2 ) × 𝗌𝗇𝗎𝗆2 (𝑒3 )), 𝑛(𝗏𝗇𝗎𝗆2 (𝑒2 ) × 𝗏𝗇𝗎𝗆2 (𝑒3 ))).
⏟⏟⏟
999 (Recall that the integer vectors in 1 ∩ 2 are of the form
of the integer constants in counting operators.)
(𝑣⃗1 , 𝑣⃗2 ), where 𝑣⃗1 and 𝑣⃗2 are the integer vectors in 1 and 2
Let us describe the optimization more precisely.
respectively.)
• For a regular expression 𝑒 = (𝑒1 ){𝑚,𝑛} where 𝑒1 is a register- • If 𝑒1 = 𝑒2 , then 𝗌𝗇𝗎𝗆1 (𝑒1 ) = 2𝗌𝗇𝗎𝗆1 (𝑒2 ) and
representable regular expression, we can unfold the counting (𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) = (𝑛 ⋅ 2𝗌𝗇𝗎𝗆2 (𝑒2 ) , 1).
operators in the following ways to obtain a register-representable
8
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
(In this case, since 𝑒1 is register-representable, we know that 𝑒2
Intuitively, 𝑞𝑠 is a sink state that deals with the situations where there
contains no counting operators. As a result, we first construct the
are no runs, and 𝑟 is a register newly introduced so that the different
NFA for 𝑒2 , then determinize and complement it to construct the
cases of the non-accepting runs of 𝑒1 can be captured by a single
automaton for 𝑒1 .)
accepting condition 𝛼 .
• If 𝑒1 = (𝑒2 ) , then 𝗌𝗇𝗎𝗆1 (𝑒1 ) = 𝗌𝗇𝗎𝗆1 (𝑒2 ) and
Note that  is a strict over-approximation of 𝑒 = 𝑒1 since 𝑒1 may be
(𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) = (𝑛𝗌𝗇𝗎𝗆2 (𝑒2 ), 1). nondeterministic. For instance, if there are two runs of 𝑒1 on a string
𝑤, say one run that ends at a non-final state and another run that ends
(In this case, since 𝑒1 is register-representable, we know that 𝑒2
at a final state and satisfies accepting condition 𝛼. Then 𝑤 ∈ (𝑒1 ),
contains no counting operators.)
thus 𝑤 ∉ (𝑒1 ). Nevertheless, according to the construction, 𝑤 is
• If 𝑒1 = (𝑒2 ){𝑚 ,𝑛 } or 𝑒1 = (𝑒2 ){𝑛 ,∞} , then 𝗌𝗇𝗎𝗆1 (𝑒1 ) = 𝑛 𝗌𝗇𝗎𝗆1 (𝑒2 )
accepted by , witnessed by the run of 𝑒1 that ends at a non-final
and
state.
(𝗌𝗇𝗎𝗆2 (𝑒1 ), 𝗏𝗇𝗎𝗆2 (𝑒1 )) = (𝑛𝗌𝗇𝗎𝗆2 (𝑒2 ), 𝑛).
Combination of under- and over-approximation. We combine the under-
(In this case, since 𝑒1 is register-representable, we know that 𝑒2 and over-approximation as well as the decision procedure into the
contains no counting operators.) following procedure.
1. We first apply the under-approximation.
7.2. Optimizations for the nesting of counting operators and a complement
2. If the under-approximation returns sat, then we know that
operator
the original constraint is satisfiable. Otherwise, the under-appro-
As mentioned in Section 6, if some counting operators are in the ximation returns unsat and we apply the over-approximation.
scope of a complement operator, it is necessary to unfold these counting 3. If the over-approximation returns unsat, then we know that
operators before the complement operation. The exponential blow- the original constraint is unsatisfiable. Otherwise, the over-
up in the unfolding combined with the blow-up in the complement approximation returns sat and we resort to the decision pro-
operation will produce a huge number of states. To alleviate this cedure (without approximations) in Section 6 to solve the RECL
problem, we propose two approximations of the complement operation, constraint.
i.e., an under-approximation and an over-approximation, and combine
them into a procedure that achieves a nice balance between efficiency 8. Optimizations for model generation
and precision.
Let 𝑒 = 𝑒1 be a regular expression, where 𝑒1 is a register- Recall that in Step 5 of Section 6, we construct a solution m from
representable regular expression. the model m of the LIA formula in Eq. (3) by solving the reachability
problems in a finite transition system. It is easy to see that the transition
Under-approximation. One way of doing the complement operation is
system contains exponentially many nodes in general and a naive
to transform 𝑒1 into a regular expression that does not contain counting
search over the transition system is not efficient. In the sequel, we
operators. We compute an under-approximation of 𝑒 by replacing each
describe two optimizations so that the reachability problem can be
occurrence of the counting operators in 𝑒1 , say (𝑒2 ){𝑚,𝑛} or (𝑒{𝑛,∞}
2
), by
solved more efficiently.
𝑒2 . Let us use 𝑒1 to denote the resulting regular expression. Note that
To ease the presentation, let us assume that 𝑛, i.e. the number of
𝑒1 does not contain counting operators. Moreover, it is easy to observe
string variables, to be 1.
that (𝑒1 ) ⊆ (𝑒1 ). As a result, we have (𝑒1 ) ⊆ (𝑒1 ). That is, 𝑒1 is an At first, we refine the states of the transition system by recording
under-approximation of 𝑒1 . the number of transitions, instead of just the number of occurrences of
Over-approximation. For 𝑒 = 𝑒1 , we utilize the CEFA constructed from the integer vectors.
𝑒1 , say 𝑒1 , to construct a CEFA  of 𝑒 so that () is an over- Refining the states of the transition system by recording the number of
approximation of (𝑒). Let 𝑒1 = (𝑅, 𝑄, 𝛴 , 𝛿 , 𝐼 , 𝐹 , 𝛼). The main idea of transitions. Let 𝜃 denote the formula in Eq. (3). We shall construct the
the construction is explained as follows: If a string 𝑤 is not accepted formula 𝜃 so that any model of 𝜃 necessarily encodes the number of
by 𝑒1 , then either there are no runs of 𝑒1 on 𝑤, or every run of 𝑒1 occurrences of transitions within 1 .
on 𝑤 stop at a non-final state or enter a final state but do not satisfy ⋀
𝜃 ≡ 𝜓1 (Z1 ) ∧ z1,𝑣⃗ = m(z1,𝑣⃗ ) ∧
the accepting condition 𝛼. We shall construct a CEFA  that accepts
𝑣∈C
⃗ 
when one of these situations occurs. Then (𝑒1 ) ⊆ (), that is,  is ⋀
1
an over-approximation of 𝑒. z1,𝑣⃗ = z1,(𝑞 ,𝑎,𝑞 ,𝑣)
⃗,
Without loss of generality, we can assume that there are no transi- 𝑣∈C
⃗ 
1 (𝑞 ,𝑎,𝑞 ) s.t. (𝑞 ,𝑎,𝑞 ,𝑣)
⃗ is a transition in 1
tions out of 𝐹 in 𝑒1 . If this is not the case, then a new state 𝑞𝑓 can ⃗ is a transition in  } is the set of
where Z1 = {z1,(𝑞 ,𝑎,𝑞 ,𝑣) ⃗ (𝑞 , 𝑎, 𝑞 , 𝑣) 1
be introduced to transform 𝑒1 into a CEFA satisfying this constraint,
variables z1,(𝑞 ,𝑎,𝑞 ,𝑣)⃗ denoting the number of occurrences of (𝑞 , 𝑎, 𝑞 , 𝑣).
where 𝑞𝑓 is the unique accepting state. Then we define  = (𝑅 {𝑟 }, 𝑄
Note that compared to 𝜃, 𝜃 uses the LIA formula 𝜓1 , which is
{𝑞𝑠 }, 𝛴 , 𝛿 , 𝐼 , 𝑄 {𝑞𝑠 }, 𝛼 ), where
constructed from 1 by ignoring its accepting condition and tak-
𝑟 is a new register not in 𝑅, ing it as an NFA where the alphabet is seen as the set of transi-
tions. Furthermore, 𝜃 relates the values of the variables z1,𝑣⃗ and
𝑞𝑠 is a new state not in 𝑄,
• 𝛿 comprises the following transitions, ⃗ , while simultaneously constraining the values of z1,𝑣⃗ to be
z1,(𝑞 ,𝑎,𝑞 ,𝑣)
m(z1,𝑣⃗ ). We then submit 𝜃 to SMT solver, which returns a model
(𝑞 , 𝑎, 𝑞 , (𝑣,
⃗ 0)) such that (𝑞 , 𝑎, 𝑞 , 𝑣)
⃗ ∈ 𝛿 and 𝑞 𝐹 , of 𝜃 , in particular, the values of the variables z1,(𝑞,𝑎,𝑞 ,𝑣) ⃗ . Equipped
⃗ 0)) such that there do not exist 𝑎-labeled transi-
(𝑞 , 𝑎, 𝑞𝑠 , (0, with the model, we refine the transition system by considering the
tions out of 𝑞, that is, for every 𝑞 and 𝑣, ⃗ we have (𝑞 , 𝑎, 𝑞 , 𝑣)
⃗ ∉ nodes (𝑞 , (n1,(𝑞,𝑎,𝑞 ,𝑣) ⃗ is a transition in 1 ), where n1,(𝑞 ,𝑎,𝑞 ,𝑣)
⃗ )(𝑞 ,𝑎,𝑞 ,𝑣) ⃗ de-
𝛿, notes the number of remaining occurrences of (𝑞 , 𝑎, 𝑞 , 𝑣). ⃗ Note that
(𝑞 , 𝑎, 𝑞 , (𝑣,
⃗ 1)) such that (𝑞 , 𝑎, 𝑞 , 𝑣)
⃗ ∈ 𝛿 and 𝑞 𝐹 , this refinement can avoid searching useless transitions. For instance,
if n1,(𝑞,𝑎,𝑞 ,𝑣) ⃗ then we can avoid searching the
(𝑞𝑠 , 𝑎, 𝑞𝑠 , (0, 0)) such that 𝑎 ∈ 𝛴, ⃗ = 0 for some (𝑞 , 𝑎, 𝑞 , 𝑣),
transition (𝑞 , 𝑎, 𝑞 , 𝑣)
⃗ out of 𝑞. This stands in contrast to the original
𝛼 is defined as 𝑟 = 1 → ¬𝛼. transition system where n1,𝑣⃗ is used, it may be the case that we have
9
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
n1,𝑣⃗ > 0 (since there may be multiple transitions associated with 𝑣)
⃗ and Based on this template, we generate 500 instances.
the transition (𝑞 , 𝑎, 𝑞 , 𝑣)
⃗ should still be searched. For the benchmarks that contain the nestings of counting operators
Moreover, we propose a method to reduce the number of transitions and a complement operator, we utilize the following template
in the transition system further.
𝑥𝑒𝑐 𝑜𝑚𝑝1 ∧ 𝑥𝑒𝑐 𝑜𝑚𝑝2 ∧ 𝑥𝑒𝑠𝑎𝑛𝑖 ∧ |𝑥| > 50,
Pruning the transitions in the transition system. Let the current state of {𝑚 ,𝑛 } {𝑚 ,𝑛 } {𝑚 ,𝑛 } {𝑚 ,𝑛 }
the transition system be (𝑝, (n1,(𝑞 ,𝑎,𝑞 ,𝑣) where both 𝑒𝑐 𝑜𝑚𝑝1 and 𝑒𝑐 𝑜𝑚𝑝2 are of the form 𝑒1 1 1 𝑒2 2 2 𝑒3 3 3 𝑒4 4 4 ,
⃗ is a transition in 1 ). Let
⃗ )(𝑞 ,𝑎,𝑞 ,𝑣)
𝑝 denote the set of transitions that are reachable from the state 𝑝 in such that for each 𝑖 ∈ [4], 20 ≤ 𝑚𝑖 ≤ 𝑛𝑖 ≤ 1000 and 𝑒𝑖 contains
1 . If {(𝑞 , 𝑎, 𝑞 , 𝑣) no counting operators. Note that two regular expressions 𝑒𝑐 𝑜𝑚𝑝1 and
⃗ > 0} ⊈ 𝑝 , in other words, there exists
n1,(𝑞 ,𝑎,𝑞 ,𝑣)
some transitions (𝑞 , 𝑎, 𝑞 , 𝑣) ⃗ not reachable from the state 𝑝 in 1 while 𝑒𝑐 𝑜𝑚𝑝2 are introduced to have a better balance between the numbers of
satisfiable and unsatisfiable instances. Based on this template, we also
⃗ > 0 is established in the current state (𝑝, (n1,(𝑞 ,𝑎,𝑞 ,𝑣)
n1,(𝑞,𝑎,𝑞 ,𝑣) ⃗ )(𝑞 ,𝑎,𝑞 ,𝑣)
is a transition in 1 ), then we know that it is impossible to reach the generate 500 instances.
final state (𝑞𝑓 , (0, … , 0)) from the current state so that we can remove In total, the NestC benchmark suite contains 1000 instances.
all the transitions out of the current state in the transition system. Note that the two templates are chosen so that on the one hand,
the nestings of counting operators or nestings of counting operators
and complement operators are available, and on the other hand, some
9. Experiments
additional regular and length constraints are introduced, so that the
constraints are not trivially satisfied, in particular, the constraints are
We implemented the procedures and optimizations in Sections 68
not satisfied by very short strings.
on top of OSTRICH, resulting in a string solver called OSTRICHRECL . In
this section, we evaluate the performance of OSTRICHRECL on three Distribution of problem instances w.r.t. counting bounds. The distribution
benchmark suites, that is, RegCoL, AutomatArk, and NestC. In the of problem instances w.r.t. the counting bounds in the three suites is
sequel, we first describe the three benchmark suites as well as the shown in Fig. 5, where the 𝑥-axis represents the counting bound and
experiment setup. Then we present the experiment results. We evaluate the 𝑦-axis represents the number of problem instances containing the
the performance and correctness of our solver against the state-of-the- counting bound. In our statistics, while most problem instances contain
art string constraint solvers, including CVC5 [3], Z3seq [1], Z3str3 [6], only small bounds, there are still around 3000 (about 6%) of them using
Z3str3RE [8], and OSTRICH [11]. We also compare OSTRICHRECL with large counting bounds (i.e., greater than or equal to 50).
several variants of itself, to evaluate the effectiveness of the technical
Experiment setup. All experiments are conducted on CentOS Stream
choices and optimizations in Sections 68.
release 8 with 4 Intel(R) Xeon(R) Platinum 8269CY 3.10 GHz CPU cores
and 190 GB memory. We use the zaligvinder framework [42] to execute
9.1. Benchmark suites and the experiment setup the experiments, with a timeout of 60 s per instance.
Our experiments utilize three benchmark suites, namely, RegCoL, 9.2. Performance evaluation against the other solvers
AutomatArk, and NestC. There are 49,843 instances in total. All bench-
mark instances are in the SMTLIB2 format. In the sequel, we give more We evaluate the performance of OSTRICHRECL against the state-
details of these three benchmark suites. of-the-art string constraint solvers, including CVC5 [3], Z3seq [1],
RegCoL benchmark suite. There are 40,628 RECL instances in the Reg- Z3str3 [6], Z3str3RE [8], and OSTRICH [11], on RegCoL, AutomatArk
CoL suite. These instances are generated by extracting regexes with and NestC benchmark suites. The experiment results can be found in
counting operators from the open source regex library [27,37] and Table 1. Note that we take the results of CVC5 as the ground truth,3
manually constructing a RECL constraint 𝑥𝑒𝑥𝑒𝑠𝑎𝑛𝑖 ∧ |𝑥| > and the results different from the ground truth are classified as error.
10 for each regex 𝑒, where 𝑒𝑠𝑎𝑛𝑖 ≡ 𝛴 (⟨+⟩ + + + &)𝛴 is a regular We can see that OSTRICHRECL solves almost all instances, except 182
expression that sanitizes all occurrence of special characters <, >, , , of them, that is, it solves 49,638 instances correctly. The number is
or &. The expression 𝑒𝑠𝑎𝑛𝑖 is introduced in view of the fact that these 4,547/1,917/13,618/2,764/3,134 more than the number of instances
characters are usually sanitized in Web browsers to alleviate the XSS solved by CVC5/Z3str3RE/Z3str3/Z3seq/OSTRICH respectively. More-
attacks [16,38]. over, OSTRICHRECL is the second fastest solver, whose average time on
each instance is close to the fastest solver Z3str3RE (2.88 s versus 1.99
AutomatArk benchmark suite. This benchmark suite is adapted from the s).
AutomatArk suite [39] by picking out the string constraints containing Furthermore, in order to see how OSTRICHRECL performs on the
counting operators. We also add the length constraint |𝑥| > 10 for RECL constraints where the counting and length bounds are large or
each string variable 𝑥. There are 8215 instances in the AutomatArk when the counting operators are nested with some other counting op-
suite. Note that the original AutomatArk benchmark suite [39] in- erators or with complement operators, we also compare OSTRICHRECL
cludes 19,979 instances, which are conjunctions of regular membership with the other solvers solely on the NestC benchmark suite. The ex-
queries generated out of regular expressions in [40]. periment results can be found in Table 2. From the results, we can see
NestC benchmark suite. We generate this benchmark suite by utilizing that OSTRICHRECL outperforms the other solvers in both the number
the string fuzzing tool stringfuzz [41]. Note that stringfuzz is slightly of solved instances and the average time. Specifically, OSTRICHRECL
extended by adding the counting operators. For the benchmarks that solves 873 out of 1000 instances, while the best state-of-the-art solver
contain the nestings of counting operators, we use the following tem- (Z3seq) solves only 518 instances. Moreover, while the average time
plate of OSTRICHRECL is only slightly better than Z3str3RE (19.49 s versus
20.09 s), OSTRICHRECL solves much more instances than it (873 versus
𝑥𝑒𝑛𝑒𝑠𝑡𝑥𝑒𝑠𝑎𝑛𝑖 ∧ |𝑥| > 50 170). The results show the great advantage of OSTRICHRECL against the
where 𝑒𝑛𝑒𝑠𝑡 is of the form other solvers on the NestC benchmark suite.
{𝑚 ,𝑛 } {𝑚 ,𝑛 } {𝑚 ,𝑛 } {𝑚 ,𝑛 }
(𝑒1 1 1 𝑒2 2 2 𝑒3 3 3 𝑒4 4 4 ){𝑚5 ,𝑛5 } ,
3
Initially, we used the majority vote of the results of the solvers as the
such that 0 ≤ 𝑚𝑖 ≤ 𝑛𝑖 ≤ 20 for each 𝑖 ∈ [5] and 𝑒𝑖 contains no counting ground truth. Nevertheless, on some problem instances, all the results of the
operators for each 𝑖 ∈ [4], moreover, three solvers in the Z3 family are wrong (after manual inspection), thus failing
this approach on these instances.
𝑒𝑠𝑎𝑛𝑖 ≡ 𝛴 (⟨+⟩ + + + &)𝛴 .
10
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
Fig. 5. Distribution of problem instances w.r.t. counting bounds.
9.3. Evaluation of the technical choices and optimizations in Sections 6 -8 RECL constraints, the optimization for the nesting of counting operators
and a complement operator plays a greater role since it allows solving
At first, we do experiments to evaluate the effectiveness of the around 1/3 more instances.
automata size-reduction techniques (i.e., Step 2) in Section 6. Then
we justify our technical choices of reducing the satisfiability prob- Evaluation of the optimizations for the model generation in Section 8.
lem to that of LIA formulas and resort to the SMT solvers to solve We compare the performance of OSTRICHRECL with OSTRICHRECL MODEL
,
them. Specifically, we compare OSTRICHRECL with a variant where where OSTRICHRECL MODEL
is obtained from OSTRICHRECL by removing
the nuXmv model checker like [18] is used to solve the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) the optimizations for the model generation.
problem. We also evaluate the effectiveness of the optimizations in The experiment results can be found in Table 5. From the results, we
Section 7, i.e., the optimizations for the nesting of counting operators as can see that compared to OSTRICHRECL , OSTRICHRECL MODEL
solves 349
well as the nesting of counting operators and a complement operator. fewer instances and is 23% slower on average (3.57 s versus 2.88 s).
Finally, we evaluate the effectiveness of the optimizations for the model As a result, the optimization for the model generation does improve
generation in Section 8. the performance, especially when the sizes of the CEFAs are huge. Note
that the optimization for model generation only affects the satisfiable
Evaluation of the technical choices and optimizations in Section 6. We instances, that is, those instances where the generated LIA formulas can
compare OSTRICHRECL with OSTRICHRECL NUXMV , which is a variant of be solved, but the generation of models for these RECL constraints is
OSTRICHRECL where the nuXmv model checker is used to solve the unsuccessful.
𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem. Moreover, we also compare OSTRICHRECL with
OSTRICHRECL
ASR
, which is obtained from OSTRICHRECL by removing the
automata size-reduction technique (i.e. Step 2 in Section 6.2). 10. Conclusion
The experiment results can be found in Table 3. From the results,
we can see that OSTRICHRECL solves 2532 more instances and is 2.38 This work proposed an efficient automata-theoretical approach for
times faster than OSTRICHRECLNUXMV . Therefore, the technical choice in solving string constraints with regex-counting and string-length. The
Section 6 where LIA formulas are computed and then solved by the SMT approach is based on encoding counting operators in regular expres-
solvers is reasonable, since OSTRICHRECL with this technical choice sions by cost registers symbolically instead of unfolding them explicitly.
is more efficient than OSTRICHRECL
NUXMV , where nuXmv is used to solve Moreover, this work proposed various optimization techniques to im-
the 𝖭𝖤𝖫𝖨𝖠 (𝖢𝖤𝖥𝖠) problem. Moreover, OSTRICHRECL solves 1083 more prove the performance further. Finally, we used three benchmark suites
instances and is 1.65 times faster than OSTRICHRECL
ASR
. From the results, comprising 49,843 instances in total to evaluate the performance of our
we can see that the automata-size reduction technique indeed plays an approach. The experimental results show that our approach can solve
important role for the performance improvement. more instances than the best state-of-the-art solver, at a comparable
Evaluation of the optimizations in Section 7. We also compare the perfor- or faster speed, especially when the counting and length bounds are
mance of OSTRICHRECL with its variants OSTRICHRECL and large or when counting operators are nested with some other counting
NEST
OSTRICHRECL , where OSTRICH RECL
is obtained from OSTRICH RECL operators or complement operators.
COMP NEST
by removing the optimization for the nesting of counting operators,
RECL
and OSTRICHCOMP is obtained by removing the optimization for CRediT authorship contribution statement
the nesting of counting operators and a complement operator. Since
these optimizations are not targeted for register-representable RECL Denghang Hu: Writing original draft, Methodology. Zhilin Wu:
constraints and there are very few non-register-representable RECL con- Writing review & editing.
straints in the RegCoL and AutomatArk benchmark suites, we restrict
our attention to the NestC benchmark suite here.
The experiment results can be found in Table 4. From the results, we Declaration of competing interest
can see that on 1000 problem instances, OSTRICHRECL solves 82/339
more instances and is 1.02/1.92 times faster than OSTRICHRECL NEST
/ The authors declare that they have no known competing finan-
OSTRICHRECLCOMP
respectively. As a result, while both optimizations do cial interests or personal relationships that could have appeared to
play a role in the performance improvement in solving these complex influence the work reported in this paper.
11
D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340
References [19] W. Gelade, M. Gyssens, W. Martens, Regular expressions with counting: Weak
versus strong determinism, SIAM J. Comput. 41 (1) (2012) 160190, http:
//dx.doi.org/10.1137/100814196.
[1] L.M. de Moura, N. Bjørner, Z3: An efficient SMT solver, in: Tools and Algorithms
[20] H. Chen, P. Lu, Checking determinism of regular expressions with counting,
for the Construction and Analysis of Systems, 14th International Conference,
Inform. and Comput. 241 (2015) 302320, http://dx.doi.org/10.1016/j.ic.2014.
TACAS 2008, Held As Part of the Joint European Conferences on Theory and
12.001.
Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008.
[21] B. Loring, D. Mitchell, J. Kinder, Sound regular expression semantics for dynamic
Proceedings, 2008, pp. 337340, http://dx.doi.org/10.1007/978-3-540-78800-
symbolic execution of JavaScript, in: Proceedings of the 40th ACM SIGPLAN
3_24.
Conference on Programming Language Design and Implementation, PLDI 2019,
[2] T. Liang, A. Reynolds, C. Tinelli, C. Barrett, M. Deters, A DPLL(T) theory solver
Phoenix, AZ, USA, June 22-26, 2019, ACM, 2019, pp. 425438, http://dx.doi.
for a theory of strings and regular expressions, in: CAV, 2014, pp. 646662,
org/10.1145/3314221.3314645.
http://dx.doi.org/10.1007/978-3-319-08867-9_43.
[22] T. Chen, A. Flores-Lamas, M. Hague, Z. Han, D. Hu, S. Kan, A.W. Lin, P.
[3] H. Barbosa, C. Barrett, M. Brain, G. Kremer, H. Lachnitt, M. Mann, A. Mohamed, Rümmer, Z. Wu, Solving string constraints with regex-dependent functions
M. Mohamed, A. Niemetz, A. Nötzli, A. Ozdemir, M. Preiner, A. Reynolds, Y. through transducers with priorities and variables, Proc. ACM Program. Lang.
Sheng, C. Tinelli, Y. Zohar, CVC5: A versatile and industrial-strength SMT solver, 6 (POPL) (2022) 131, http://dx.doi.org/10.1145/3498707.
in: D. Fisman, G. Rosu (Eds.), Tools and Algorithms for the Construction and [23] M. Kaminski, N. Francez, Finite-memory automata, in: Proceedings [1990] 31st
Analysis of Systems, Springer International Publishing, Cham, 2022, pp. 415442. Annual Symposium on Foundations of Computer Science, Vol. 2, 1990, pp.
[4] Y. Zheng, X. Zhang, V. Ganesh, Z3-str: a Z3-based string solver for web 683688, http://dx.doi.org/10.1109/FSCS.1990.89590.
application analysis, in: ESEC/SIGSOFT FSE, 2013, pp. 114124, http://dx.doi. [24] L. DAntoni, T. Ferreira, M. Sammartino, A. Silva, Symbolic register automata, in:
org/10.1145/2491411.2491456. I. Dillig, S. Tasiran (Eds.), Computer Aided Verification, Springer International
[5] Y. Zheng, V. Ganesh, S. Subramanian, O. Tripp, J. Dolby, X. Zhang, Effective Publishing, Cham, 2019, pp. 321.
search-space pruning for solvers of string equations, regular expressions and [25] M.L. Minsky, Computation: Finite and Infinite Machines, in: Prentice-Hall Series
length constraints, in: Computer Aided Verification - 27th International Confer- in Automatic Computation, Prentice-Hall, 1967.
ence, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I, [26] F. Klaedtke, H. Rueß, Monadic second-order logics with cardinalities, in: J.C.M.
Springer, 2015, pp. 235254, http://dx.doi.org/10.1007/978-3-319-21690-4_14. Baeten, J.K. Lenstra, J. Parrow, G.J. Woeginger (Eds.), Automata, Languages and
[6] M. Berzish, V. Ganesh, Y. Zheng, Z3str3: A string solver with theory-aware Programming, Springer Berlin Heidelberg, Berlin, Heidelberg, 2003, pp. 681696.
heuristics, in: 2017 Formal Methods in Computer Aided Design, FMCAD 2017, [27] L. Turoňová, L. Holík, O. Lengál, O. Saarikivi, M. Veanes, T. Vojnar, Regex
Vienna, Austria, October 2-6, 2017, 2017, pp. 5559, http://dx.doi.org/10. matching with counting-set automata, Proc. ACM Program. Lang. 4 (OOPSLA)
23919/FMCAD.2017.8102241. (2020) http://dx.doi.org/10.1145/3428286.
[7] M. Berzish, Z3str4: A Solver for Theories over Strings (Ph.D. thesis), University [28] L. Holík, J. Síc, L. Turonová, T. Vojnar, Fast matching of regular patterns with
of Waterloo, Ontario, Canada, 2021, URL https://hdl.handle.net/10012/17102. synchronizing counting, in: O. Kupferman, P. Sobocinski (Eds.), FoSSaCS 2023,
in: Lecture Notes in Computer Science, vol. 13992, Springer, 2023, pp. 392412.
[8] M. Berzish, J.D. Day, V. Ganesh, M. Kulczynski, F. Manea, F. Mora, D. Nowotka,
[29] A. Le Glaunec, L. Kong, K. Mamouras, Regular expression matching using bit
Towards more efficient methods for solving regular-expression heavy string
vector automata, Proc. ACM Program. Lang. 7 (OOPSLA1) (2023) http://dx.doi.
constraints, Theoret. Comput. Sci. 943 (2023) 5072.
org/10.1145/3586044.
[9] Diep Bui and contributors, Z3-trau, 2019, https://github.com/diepbp/z3-trau.
[30] D. Hu, Z. Wu, String constraints with regex-counting and string-length solved
[10] P.A. Abdulla, M.F. Atig, Y.-F. Chen, B.P. Diep, J. Dolby, P. Janků, H.-H. Lin, L. more efficiently, in: H. Hermanns, J. Sun, L. Bu (Eds.), Dependable Software
Holík, W.-C. Wu, Efficient handling of string-number conversion, in: Proceedings Engineering. Theories, Tools, and Applications, Springer Nature Singapore,
of the 41st ACM SIGPLAN Conference on Programming Language Design and Singapore, 2024, pp. 120.
Implementation, in: PLDI 2020, Association for Computing Machinery, New York, [31] J.E. Hopcroft, J.D. Ullman, Introduction to Automata Theory, Languages and
NY, USA, 2020, pp. 943957, http://dx.doi.org/10.1145/3385412.3386034. Computation, Addison-Wesley, 1979.
[11] T. Chen, M. Hague, A.W. Lin, P. Rümmer, Z. Wu, Decision procedures for path [32] H. Seidl, T. Schwentick, A. Muscholl, P. Habermehl, Counting in trees for free, in:
feasibility of string-manipulating programs with complex operations, PACMPL 3 Automata, Languages and Programming: 31st International Colloquium, ICALP
(POPL) (2019) http://dx.doi.org/10.1145/3290362. 2004, Turku, Finland, July 12-16, 2004. Proceedings, 2004, pp. 11361149,
[12] H.-E. Wang, S.-Y. Chen, F. Yu, J.-H.R. Jiang, A symbolic model checking http://dx.doi.org/10.1007/978-3-540-27836-8_94.
approach to the analysis of string and length constraints, in: Proceedings of the [33] K.N. Verma, H. Seidl, T. Schwentick, On the complexity of equational Horn
33rd ACM/IEEE International Conference on Automated Software Engineering, clauses, in: CADE, 2005, pp. 337352.
in: ASE 2018, ACM, 2018, pp. 623633, http://dx.doi.org/10.1145/3238147. [34] C. Haase, A survival guide to Presburger arithmetic, ACM SIGLOG News 5 (3)
3238189. (2018) 6782, http://dx.doi.org/10.1145/3242953.3242964.
[13] C. Chapman, K.T. Stolee, Exploring regular expression usage and context in [35] R. Cavada, A. Cimatti, M. Dorigatti, A. Griggio, A. Mariotti, A. Micheli, S.
Python, in: A. Zeller, A. Roychoudhury (Eds.), Proceedings of the 25th Interna- Mover, M. Roveri, S. Tonetta, The nuXmv symbolic model checker, in: Computer
tional Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Aided Verification - 26th International Conference, CAV 2014, Held As Part
Germany, July 18-20, 2016, ACM, 2016, pp. 282293, http://dx.doi.org/10. of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 18-22, 2014.
1145/2931037.2931073. Proceedings, 2014, pp. 334342.
[14] J.C. Davis, C.A. Coghlan, F. Servant, D. Lee, The impact of regular expression [36] K. Thompson, Programming techniques: Regular expression search algorithm,
denial of service (ReDoS) in practice: An empirical study at the ecosystem Commun. ACM 11 (6) (1968) 419422, http://dx.doi.org/10.1145/363347.
scale, in: Proceedings of the 2018 26th ACM Joint Meeting on European 363387.
Software Engineering Conference and Symposium on the Foundations of Software [37] J.C. Davis, L.G. Michael IV, C.A. Coghlan, F. Servant, D. Lee, Why arent
Engineering, in: ESEC/FSE 2018, Association for Computing Machinery, New regular expressions a Lingua Franca? An empirical study on the re-use and
York, NY, USA, 2018, pp. 246256. portability of regular expressions, in: Proceedings of the 2019 27th ACM Joint
Meeting on European Software Engineering Conference and Symposium on
[15] P. Wang, K.T. Stolee, How well are regular expressions tested in the wild? in:
the Foundations of Software Engineering, in: ESEC/FSE 2019, Association for
G.T. Leavens, A. Garcia, C.S. Pasareanu (Eds.), Proceedings of the 2018 ACM
Computing Machinery, New York, NY, USA, 2019, pp. 443454, http://dx.doi.
Joint Meeting on European Software Engineering Conference and Symposium on
org/10.1145/3338906.3338909.
the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena
[38] T. Chen, Y. Chen, M. Hague, A.W. Lin, Z. Wu, What is decidable about string
Vista, FL, USA, November 04-09, 2018, ACM, 2018, pp. 668678.
constraints with the ReplaceAll function, PACMPL 2 (POPL) (2018) 3:13:29,
[16] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, D. Song, A symbolic
http://dx.doi.org/10.1145/3158091.
execution framework for JavaScript, in: 2010 IEEE Symposium on Security and [39] M. Berzish, M. Kulczynski, F. Mora, F. Manea, J.D. Day, D. Nowotka, V. Ganesh,
Privacy, 2010, pp. 513528, http://dx.doi.org/10.1109/SP.2010.38. An SMT solver for regular expressions and linear arithmetic over string length, in:
[17] T. Liang, N. Tsiskaridze, A. Reynolds, C. Tinelli, C.W. Barrett, A decision A. Silva, K.R.M. Leino (Eds.), Computer Aided Verification, Springer International
procedure for regular membership and length constraints over unbounded Publishing, Cham, 2021, pp. 289312.
strings, in: C. Lutz, S. Ranise (Eds.), Frontiers of Combining Systems - 10th [40] L. DAntoni, Automatark: Automata benchmark, 2018, URL https://github.com/
International Symposium, FroCoS 2015, Wroclaw, Poland, September 21-24, lorisdanto/automatark.
2015. Proceedings, in: Lecture Notes in Computer Science, vol. 9322, Springer, [41] D. Blotsky, F. Mora, M. Berzish, Y. Zheng, I. Kabir, V. Ganesh, StringFuzz: A
2015, pp. 135150. fuzzer for string solvers, in: H. Chockler, G. Weissenbacher (Eds.), Computer
[18] T. Chen, M. Hague, J. He, D. Hu, A.W. Lin, P. Rümmer, Z. Wu, A decision Aided Verification, Springer International Publishing, Cham, 2018, pp. 4551.
procedure for path feasibility of string manipulating programs with integer data [42] M. Kulczynski, F. Manea, D. Nowotka, D.B. Poulsen, ZaligVinder: A generic test
type, in: D.V. Hung, O. Sokolsky (Eds.), Automated Technology for Verification framework for string solvers, J. Software: Evol. Process. 35 (4) (2023) e2400,
and Analysis - 18th International Symposium, ATVA 2020, Hanoi, Vietnam, http://dx.doi.org/10.1002/smr.2400, arXiv:https://onlinelibrary.wiley.com/doi/
October 19-23, 2020, Proceedings, in: Lecture Notes in Computer Science, 12302, pdf/10.1002/smr.2400.
Springer, 2020, pp. 325342, http://dx.doi.org/10.1007/978-3-030-59152-6_18.
12