Journal of Systems Architecture 160 (2025) 103340 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc An efficient string solver for string constraints with regex-counting and string-length Denghang Hu โˆ—, Zhilin Wu Key Laboratory of System Software and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China ARTICLE INFO ABSTRACT Keywords: Regular expressions (regex for short) and string-length function are widely used in string-manipulating String constraints programs. Counting is a frequently used feature in regexes that counts the number of matchings of sub- String solver patterns. The state-of-the-art string solvers are incapable of solving string constraints with regex-counting and Regular expression string-length efficiently, especially when the counting and length bounds are large. In this work, we propose String length an automata-theoretic approach for solving such class of string constraints. The main idea is to symbolically Counting operator model the counting operators by registers in automata instead of unfolding them explicitly, thus alleviating the state explosion problem. Moreover, the string-length function is modeled by a register as well. As a result, the satisfiability of string constraints with regex-counting and string-length is reduced to the satisfiability of linear integer arithmetic, which the off-the-shelf SMT solvers can then solve. To improve the performance further, we also propose various optimization techniques. We implement the algorithms and validate our approach on 49,843 benchmark instances. The experimental results show that our approach can solve more instances than the state-of-the-art solvers, at a comparable or faster speed, especially when the counting and length bounds are large or when the counting operators are nested with some other counting operators or complement operators. 1. Introduction proposed to solve such string constraints efficiently (see e.g. [8,17]). Counting (aka repetition) is a convenient feature in regexes that The string data type plays a crucial role in modern programming counts the number of matchings of sub-patterns. For instance, ๐‘Ž{2,4} languages such as JavaScript, Python, Java, and PHP. String manipula- specifies that ๐‘Ž occurs at least twice and at most four times, and ๐‘Ž{2,โˆž} tions are error-prone and could even give rise to severe security vulner- specifies that ๐‘Ž occurs at least twice. Note that although the Kleene abilities (e.g., cross-site scripting, aka XSS). One powerful method for star and the Kleene plus operator are special cases of counting, i.e., ๐‘’โˆ— identifying such bugs is symbolic execution. It analyzes the feasibility is equivalent to ๐‘’{0,โˆž} and ๐‘’+ is equivalent to ๐‘’{1,โˆž} , in the rest of this of execution paths in a program by encoding the paths as logical formulas and utilizing the constraint solvers (e.g. Z3 [1]) to solve the paper, for clarity, we use counting to denote expressions of the form satisfiability of the formulas. Symbolic execution of string manipulating ๐‘’{๐‘š,๐‘›} and ๐‘’{๐‘š,โˆž} , but not ๐‘’โˆ— or ๐‘’+ . Counting is a frequently used feature programs has motivated the highly active research area of string con- of regexes. According to the statistics from [13], counting occurs in straint solving, resulting in the development of numerous string solvers more than 20% of 1204 Python projects. Therefore, an efficient analysis in the last decade, e.g., Z3seq [1], CVC4/5 [2,3], Z3str/2/3/4 [4โ€“ of string manipulating programs requires efficient solvers for string 7], Z3str3RE [8], Z3-Trau [9,10], OSTRICH [11], Slent [12], among constraints containing regexes with counting and string-length function many others. Regular expressions (regex for short) and the string-length at least. function are widely used in string-manipulating programs. According The aforementioned state-of-the-art string constraint solvers still to the statistics from [13โ€“15], regexes are used in about 30โ€“40% of suffer from such string constraints, especially when the counting and Java, JavaScript, and Python software projects. Moreover, string-length length bounds are large. For instance, none of the string solvers CVC5, occupies 78% of the occurrences of string operations in 18 Javascript applications, according to the statistics from [16]. As a result, most Z3seq, Z3-Trau, Z3str3, Z3str3RE, and OSTRICH is capable of solving of the aforementioned string constraint solvers support both regexes the following constraint within 120 s, and string-length function. Moreover, specialized algorithms have been ๐‘ฅ โˆˆ (๐›ด โงต ๐‘Ž){1,60} (๐›ด โงต ๐‘){1,60} (๐›ด โงต ๐‘){0,60} โˆง ๐‘ฅ โˆˆ ๐›ด โˆ— ๐‘ + โˆง |๐‘ฅ| > 120. (1) โˆ— Corresponding author. E-mail address: hudh@ios.ac.cn (D. Hu). https://doi.org/10.1016/j.sysarc.2025.103340 Received 2 June 2024; Received in revised form 7 January 2025; Accepted 7 January 2025 Available online 17 January 2025 1383-7621/ยฉ 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies. D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 Intuitively, the constraint in (1) specifies that ๐‘ฅ is a concatenation of and back references. Real-world regexes have been addressed in sym- three strings ๐‘ฅ1 , ๐‘ฅ2 , ๐‘ฅ3 where ๐‘Ž (resp. ๐‘, ๐‘) does not occur in ๐‘ฅ1 (resp. bolic execution of Javascript programs [21] and string constraint ๐‘ฅ2 , ๐‘ฅ3 ), moreover, ๐‘ฅ ends with a nonempty sequence of ๐‘โ€™s, and the solving [22]. Nevertheless, the counting operators are still unfolded length of ๐‘ฅ is greater than 120. It is easy to observe that this constraint explicitly in [22]. The Trau tool focuses on string constraints involving is unsatisfiable since on the one hand, |๐‘ฅ| > 120 and the counting upper flat regular languages and string-length function and solves them by bound 60 in both (๐›ด โงต ๐‘Ž){1,60} and (๐›ด โงต ๐‘){1,60} imply that ๐‘ฅ must end computing LIA formulas that define the Parikh images of flat regular with some character from ๐›ด โงต ๐‘, that is, a character different from ๐‘, languages [10]. The Slent tool solves the string constraints involving and on the other hand, ๐‘ฅ โˆˆ ๐›ด โˆ— ๐‘ + requires that ๐‘ฅ has to end with ๐‘. string-length function by encoding them into so-called length-encoded A typical way for string constraint solvers to deal with regular automata, then utilizing symbolic model checkers to solve their non- expressions with counting is to unfold them into those without counting emptiness problem [12]. However, neither Trau nor Slent supports using the concatenation operator. For instance, ๐‘Ž{1,4} is unfolded into counting operators explicitly, in other words, counting operators in ๐‘Ž(๐œ€ + ๐‘Ž + ๐‘Ž๐‘Ž + ๐‘Ž๐‘Ž๐‘Ž) and ๐‘Ž{2,โˆž} is unfolded into ๐‘Ž๐‘Ž๐‘Žโˆ— . Since the unfolding regexes should be unfolded before solved by them. Cost registers in incurs an exponential blow-up on the sizes of constraints (assuming that CEFAs are different from registers in (symbolic) register automata [23, the counting in string constraints are encoded in binary), the unfolding 24]: In register automata, registers are used to store input values and penalizes the performance of the solvers, especially when the length can only be compared for equality/inequality, while in CEFAs, cost bounds are also available. registers are used to store integer-valued costs and can be updated by Contribution. In this work, we focus on the class of string con- adding/subtracting integer constants and constrained by the accepting straints with regular membership and string-length function, where the conditions which are LIA formulas. Therefore, cost registers in CEFAs counting operators may occur (called RECL for brevity). We make the are more like counters in counter automata/machines [25], that is, following contributions. CEFAs can be obtained from counter machines by removing transition guards and adding accepting conditions. CEFAs can also be considered โ€ข We propose an automata-theoretical approach for solving RECL as a variation of Parikh automata [26] in which the numbers in the constraints. Our main idea is to represent the counting operators alphabet can be integers, instead of natural numbers. Counting-set by cost registers in cost-enriched finite automata (CEFA, see Sec- automata were proposed in [27,28] to quickly match a subclass of tion 5 for the definition), instead of unfolding them explicitly. The regexes with counting. Moreover, a variant of nondeterministic counter string-length function is modeled by cost registers as well. The automata, called bit vector automata, was proposed recently in [29] satisfiability of RECL constraints is reduced to the non-emptiness to enable fast matching of a more expressive class of regexes with problem of CEFA w.r.t. a linear integer arithmetic (LIA) formula. counting. Nevertheless, the non-emptiness problem of these automata According to the results from [18], an LIA formula can be com- was not considered, and it is unclear whether these automata models puted to represent the potential values of registers in CEFA. Thus, can be used for solving string constraints with regex-counting and the non-emptiness of CEFA w.r.t. LIA formulas can be reduced to string-length. the satisfiability of LIA formulas, which is then tackled by off-the- Structure. The rest of this paper is structured as follows: Section 2 in- shelf SMT solvers. Moreover, we propose techniques to reduce the troduces the preliminaries. Section 3 gives an overview of the approach sizes (i.e., the number of states and transitions) of CEFA in order in this paper. Section 4 presents the syntax and semantics of RECL. to achieve better performance. Section 5 defines CEFA. Section 6 introduces the algorithm to solve โ€ข For complex RECL constraints where the counting operators may RECL constraints. Section 7 presents some heuristics for improving the be nested with the other counting operators or complement oper- performance of solving RECL constraints where the counting operators ators, we propose optimization techniques to improve the perfor- are nested with some other counting operators or complement opera- mance. Furthermore, for the satisfiable RECL constraints where tors. Section 9 shows the experiment results. Section 10 concludes this the solutions of the LIA formulas can be obtained, but the models paper. of the original RECL constraint cannot be generated, we pro- A preliminary version of this paper has appeared in the proceedings pose optimization techniques to accelerate the model generation of SETTA 2023 [30]. This paper extends the SETTA paper in the process. following three aspects. โ€ข We implement the algorithm on top of OSTRICH, resulting in a โ€ข To improve the performance of complex RECL constraints where string solver1 called OSTRICHRECL . Finally, we utilize a collec- the counting operators are nested with some other counting op- tion of benchmark suites2 comprising 49,843 problem instances erators or complement operators, various optimizations are pro- in total to evaluate the performance of OSTRICHRECL . The ex- posed. periment results show that OSTRICHRECL solves the RECL con- โ€ข To deal with the issue that the satisfiability of LIA formulas can be straints more efficiently than the state-of-the-art string solvers solved, but models of the original RECL constraints cannot be gen- (see Tables 1โ€“5 for details). For instance, OSTRICHRECL solves erated, we propose techniques to optimize the model generation 49,638 out of 49,843 instances, while the best state-of-the-art process. solver solve only 47,721 instances. We also do experiments to jus- โ€ข The benchmarks are extended to cover the situations where the tify some of our technical choices and evaluate the performance counting operators are nested with some other counting opera- of the optimizations tors or complement operators, and the experiments are extended Related work. We discuss more related work on regexes with counting, accordingly. string-length function, and automata with registers/counters. Deter- minism of regexes with counting was investigated in [19,20]. Real- 2. Preliminaries world regexes in programming languages include features beyond classical regexes, e.g., the greedy/lazy Kleene star, capturing groups, We write N and Z for the sets of natural and integer numbers. For ๐‘› โˆˆ N with ๐‘› โ‰ฅ 1, [๐‘›] denotes {1, โ€ฆ , ๐‘›}; for ๐‘š, ๐‘› โˆˆ N with ๐‘š โ‰ค ๐‘›, [๐‘š, ๐‘›] denotes {๐‘– โˆˆ N โˆฃ ๐‘š โ‰ค ๐‘– โ‰ค ๐‘›}. Throughout the paper, ๐›ด is a finite 1 The solver is open source and available at https://github.com/ alphabet, ranged over ๐‘Ž, ๐‘, โ€ฆ. SimpleXiaohu/ostrich/tree/jsa2024 For a function ๐‘“ from ๐‘‹ to ๐‘Œ and ๐‘‹ โ€ฒ โІ ๐‘‹, we use ๐—‰๐—‹๐—ƒ๐‘‹ โ€ฒ (๐‘“ ) to 2 The benchmarks are available at https://github.com/SimpleXiaohu/ denote the restriction (aka projection) of ๐‘“ to ๐‘‹ โ€ฒ , that is, ๐—‰๐—‹๐—ƒ๐‘‹ โ€ฒ (๐‘“ ) is ostrich/tree/jsa2024/zaligvinder/models/jsa2024. the function from ๐‘‹ โ€ฒ to ๐‘Œ , and ๐—‰๐—‹๐—ƒ๐‘‹ โ€ฒ (๐‘“ )(๐‘ฅโ€ฒ ) = ๐‘“ (๐‘ฅโ€ฒ ) for each ๐‘ฅโ€ฒ โˆˆ ๐‘‹ โ€ฒ . 2 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 Table 1 Overall performance evaluation. CVC5 Z3str3RE Z3str3 Z3seq Ostrich OSTRICHRECL sat 28,282 28,449 23,322 28,212 25,975 29,084 unsat 16,809 19,316 12,742 18718 20,529 20,554 unknown 8 134 7021 203 162 14 timeout 4744 1944 6758 2710 3177 191 error 0 44 44 56 0 0 solved 45,091 47,721 36,020 46,874 46,504 49,638 average time 6.43 1.99 8.48 4.04 6.75 2.88 (s) Table 2 Performance evaluation on the NestC benchmark suite. CVC5 Z3str3RE Z3str3 Z3seq Ostrich OSTRICHRECL sat 508 166 196 451 0 635 unsat 0 4 0 67 238 238 unknown 0 35 31 105 2 6 timeout 492 795 773 377 760 121 error 0 0 0 0 0 0 solved 508 170 196 518 238 873 average time 31.57 20.09 50.06 26.13 46.11 19.49 (s) Table 3 ๐ฟ2 , denoted by ๐ฟ1 โ‹… ๐ฟ2 , is defined as {๐‘ข โ‹… ๐‘ฃ โˆฃ ๐‘ข โˆˆ ๐ฟ1 , ๐‘ฃ โˆˆ ๐ฟ2 }. The union Evaluation of the technical choices and optimizations in Section 6. (resp. intersection) of ๐ฟ1 and ๐ฟ2 , denoted by ๐ฟ1 โˆช ๐ฟ2 (resp. ๐ฟ1 โˆฉ ๐ฟ2 ), is OSTRICHRECL NUXMV OSTRICHRECL โˆ’ASR OSTRICHRECL defined as {๐‘ข โˆฃ ๐‘ข โˆˆ ๐ฟ1 or ๐‘ข โˆˆ ๐ฟ2 } (resp. {๐‘ข โˆฃ ๐‘ข โˆˆ ๐ฟ1 and ๐‘ข โˆˆ ๐ฟ2 }). The sat 26,607 27,295 29,084 complement of ๐ฟ1 , denoted by ๐ฟ1 , is defined as {๐‘ข โˆฃ ๐‘ข โˆˆ ๐›ด โˆ— , ๐‘ข โˆ‰ ๐ฟ1 }. unsat 20,499 20,513 20,554 The difference of ๐ฟ1 and ๐ฟ2 , denoted by ๐ฟ1 โงต ๐ฟ2 , is defined as ๐ฟ1 โˆฉ ๐ฟ2 . unknown 45 68 14 For a language ๐ฟ and ๐‘› โˆˆ N, we define ๐ฟ๐‘› inductively as follows: timeout 2,693 1,968 191 โ‹ƒ error 0 0 0 ๐ฟ0 = {๐œ€}, ๐ฟ๐‘›+1 = ๐ฟ โ‹… ๐ฟ๐‘› for every ๐‘› โˆˆ N. Finally, define ๐ฟโˆ— = ๐‘›โˆˆN ๐ฟ๐‘› . solved 47,106 47,808 49,638 Finite automata. A (nondeterministic) finite automaton (NFA) is a tuple average time (s) 6.86 4.76 2.88 ๎ˆญ = (๐‘„, ๐›ด , ๐›ฟ , ๐ผ , ๐น ), where ๐‘„ is a finite set of states, ๐›ด is a finite alphabet, ๐›ฟ โІ ๐‘„ ร— ๐›ด ร— ๐‘„ is the transition relation, ๐ผ โІ ๐‘„ is the set Table 4 of initial states, and ๐น โІ ๐‘„ is the set of final states. For readability, we Evaluation of the optimizations in Section 7. ๐‘Ž ๐‘Ž write a transition (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ ) โˆˆ ๐›ฟ as ๐‘ž โ†โ†โ†’ โ† ๐‘ž โ€ฒ (or simply ๐‘ž โ†โ†โ†’ โ† ๐‘ž โ€ฒ ). The size OSTRICHRECL โˆ’NEST OSTRICHRECL โˆ’COMP OSTRICHRECL ๐›ฟ sat 553 296 635 of an NFA ๎ˆญ, denoted by |๎ˆญ|, is defined as the number of transitions unsat 238 238 238 of ๎ˆญ. A run of ๎ˆญ on a string ๐‘ค = ๐‘Ž1 โ‹ฏ ๐‘Ž๐‘› is a sequence of transitions ๐‘Ž1 ๐‘Ž๐‘› unknown 27 8 6 ๐‘ž0 โ†โ†โ†โ†โ†’ โ† ๐‘ž๐‘› such that ๐‘ž0 โˆˆ ๐ผ. The run is accepting if ๐‘ž๐‘› โˆˆ ๐น . A โ† ๐‘ž1 โ‹ฏ ๐‘ž๐‘›โˆ’1 โ†โ†โ†โ†โ†’ timeout 182 458 121 string ๐‘ค is accepted by an NFA ๎ˆญ if there is an accepting run of ๎ˆญ on error 0 0 0 ๐‘ค. In particular, the empty string ๐œ€ is accepted by ๎ˆญ if ๐ผ โˆฉ ๐น โ‰  โˆ…. The solved 791 534 873 language of ๎ˆญ, denoted by ๎ˆธ(๎ˆญ), is the set of strings accepted by ๎ˆญ. An average time (s) 19.95 37.49 19.49 NFA ๎ˆญ is said to be deterministic if ๐ผ is a singleton and, for every ๐‘ž โˆˆ ๐‘„ and ๐‘Ž โˆˆ ๐›ด, there is at most one state ๐‘ž โ€ฒ โˆˆ ๐‘„ such that (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ ) โˆˆ ๐›ฟ. We Table 5 use DFA to denote deterministic finite automata. Evaluation of the optimizations for the model generation It is well-known that finite automata capture regular languages. in Section 8. Moreover, the class of regular languages is closed under union, inter- OSTRICHRECL โˆ’MODEL OSTRICHRECL section, concatenation, Kleene star, complement, and language differ- sat 28,735 29,084 ence [31]. unsat 20,554 20,554 Let ๐‘ค โˆˆ ๐›ด โˆ— . The Parikh image of ๐‘ค, denoted by ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๐‘ค), is defined unknown 13 14 timeout 541 191 as the function ๐œ‚ โˆถ ๐›ด โ†’ N such that ๐œ‚(๐‘Ž) = |๐‘ค|๐‘Ž for each ๐‘Ž โˆˆ ๐›ด. error 0 0 The Parikh image of an NFA ๎ˆญ, denoted by ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎ˆญ), is defined as {๐—‰๐–บ๐—‹๐—‚๐—„๐—(๐‘ค) โˆฃ ๐‘ค โˆˆ ๎ˆธ(๎ˆญ)}. solved 49,289 49,638 average time (s) 3.57 2.88 Linear integer arithmetic and Parikh images. We use standard existential linear integer arithmetic (LIA) formulas, which typically range over ๐œ“ , ๐œ‘, ๐›ท, ๐›ผ. For a set X of variables, we use ๐œ“โˆ•๐œ‘โˆ•๐›ทโˆ•๐›ผ(X) to denote the set of existential LIA formulas whose free variables are from X. For Strings and languages. A string over ๐›ด is a (possibly empty) sequence example, we use ๐œ‘(โƒ—x) with โƒ—x = (x1 , โ€ฆ , x๐‘˜ ) to denote an LIA formula ๐œ‘ of elements from ๐›ด, denoted by ๐‘ข, ๐‘ฃ, ๐‘ค, โ€ฆ. An empty string is denoted whose free variables are from {x1 , โ€ฆ , x๐‘˜ }. For an LIA formula ๐œ‘(โƒ—x), we by ๐œ€. We use ๐›ด๐œ€ to denote ๐›ด โˆช {๐œ€}. We write ๐›ด โˆ— (resp., ๐›ด + ) for the use ๐œ‘[โƒ—๐‘กโˆ•โƒ—x] to denote the formula obtained by replacing (simultaneously) set of all (resp. nonempty) strings over ๐›ด. For a string ๐‘ข, we use |๐‘ข| to x๐‘– with ๐‘ก๐‘– for every ๐‘– โˆˆ [๐‘˜] where โƒ—x = (x1 , โ€ฆ , x๐‘˜ ) and โƒ—๐‘ก = (๐‘ก1 , โ€ฆ , ๐‘ก๐‘˜ ) are denote the number of letters in ๐‘ข. In particular, |๐œ€| = 0. Moreover, for tuples of integer terms. ๐‘Ž โˆˆ ๐›ด, let |๐‘ข|๐‘Ž denote the number of occurrences of ๐‘Ž in ๐‘ข. Assume that At last, we recall the result about Parikh images of NFA. For each ๐‘ข = ๐‘Ž1 โ‹ฏ ๐‘Ž๐‘› is nonempty and 1 โ‰ค ๐‘– < ๐‘— โ‰ค ๐‘›. We let ๐‘ข[๐‘–] denote ๐‘Ž๐‘– and ๐‘Ž โˆˆ ๐›ด, let z๐‘Ž be an integer variable. Let Z๐›ด denote the set of integer ๐‘ข[๐‘–, ๐‘—] for the substring ๐‘Ž๐‘– โ‹ฏ ๐‘Ž๐‘— . Let ๐‘ข, ๐‘ฃ be two strings. We use ๐‘ข โ‹… ๐‘ฃ to variables z๐‘Ž for ๐‘Ž โˆˆ ๐›ด. Let ๎ˆญ be an NFA over the alphabet ๐›ด. Then denote the concatenation of ๐‘ข and ๐‘ฃ. A language ๐ฟ over ๐›ด is a subset of we say that an LIA formula ๐œ“(Z๐›ด ) defines the Parikh image of ๎ˆญ, if strings. Let ๐ฟ1 , ๐ฟ2 be two languages. Then the concatenation of ๐ฟ1 and {๐œ‚ โˆถ ๐›ด โ†’ N โˆฃ ๐œ“[(๐œ‚(๐‘Ž)โˆ•z๐‘Ž )๐‘Žโˆˆ๐›ด ] is true} = ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎ˆญ). 3 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 Fig. 1. CEFA for (๐›ด โงต ๐‘Ž){1,60} (๐›ด โงต ๐‘){1,60} (๐›ด โงต ๐‘){0,60} , ๐›ด โˆ— ๐‘ + , and |๐‘ฅ|. Fig. 2. ๎ˆญ1 โˆฉ ๎ˆญ2 โˆฉ ๎ˆญ3 : Intersection of ๎ˆญ1 , ๎ˆญ2 , and ๎ˆญ3 . Theorem 1 ([32]). Given an NFA ๎ˆญ, an existential LIA formula ๐œ“๎ˆญ (Z๐›ด ) the LIA formula ๐œ‘ โ‰ก ๐‘Ÿ4 > 120, that is, whether there exist ๐‘ค โˆˆ ๐›ด โˆ— and can be computed in linear time that defines the Parikh image of ๎ˆญ. (๐‘ฃ1 , ๐‘ฃ2 , ๐‘ฃ3 , ๐‘ฃ4 ) โˆˆ Z4 such that ๐‘ค is accepted by ๎ˆญ, so that the resulting registers values (๐‘ฃ1 , ๐‘ฃ2 , ๐‘ฃ3 , ๐‘ฃ4 ) satisfy both 1 โ‰ค ๐‘ฃ1 โ‰ค 60 โˆง 1 โ‰ค ๐‘ฃ2 โ‰ค 60 โˆง 0 โ‰ค 3. Overview ๐‘ฃ3 โ‰ค 60 and ๐œ‘. It is not hard to observe that the non-emptiness of ๎ˆญ with respect to ๐œ‘ is independent of the characters of ๎ˆญ. Therefore, In this section, we utilize the string constraint in Eq. (1) to illustrate the characters in ๎ˆญ can be ignored, resulting into an NFA ๎ˆฎ over the the approach in our work. alphabet C, where C is the set of vectors from Z4 occurring in the At first, we construct a CEFA for the regular expression (๐›ด โงต transitions of ๎ˆญ (see Fig. 3(a)). Then the original problem is reduced ๐‘Ž){1,60} (๐›ด โงต ๐‘){1,60} (๐›ด โงต ๐‘){0,60} . Three registers are introduced, say to the problem of deciding whether there exists a string ๐‘คโ€ฒ โˆˆ Cโˆ— that ๐‘Ÿ1 , ๐‘Ÿ2 , ๐‘Ÿ3 , to represent the three counting operators; the nondeterministic is accepted by ๎ˆฎ and its Parikh image (i.e., numbers of occurrences of finite automaton (NFA) for (๐›ด โงต ๐‘Ž)โˆ— (๐›ด โงต ๐‘)โˆ— (๐›ด โงต ๐‘)โˆ— is constructed; characters), say ๐œ‚๐‘คโ€ฒ โˆถ C โ†’ N, satisfies 1 โ‰ค ๐‘ฃโ€ฒ1 โ‰ค 60 โˆง 1 โ‰ค ๐‘ฃโ€ฒ2 โ‰ค 60 โˆง 0 โ‰ค โˆ‘ the updates of registers are added to the transitions of the NFA; the ๐‘ฃโ€ฒ3 โ‰ค 60 โˆง ๐‘ฃโ€ฒ4 > 120, where (๐‘ฃโ€ฒ1 , ๐‘ฃโ€ฒ2 , ๐‘ฃโ€ฒ3 , ๐‘ฃโ€ฒ4 ) = ๐‘ฃโˆˆC โƒ— โƒ— ๐‘ฃโƒ— for each ๐‘ฃโƒ— โˆˆ C. ๐œ‚๐‘คโ€ฒ (๐‘ฃ) โ€ฒ โ€ฒ โ€ฒ โ€ฒ Intuitively, (๐‘ฃ1 , ๐‘ฃ2 , ๐‘ฃ3 , ๐‘ฃ4 ) is a weighted sum of vectors ๐‘ฃโƒ— โˆˆ C, where the counting bounds are specified by the accepting condition 1 โ‰ค ๐‘Ÿ1 โ‰ค 60 โˆง 1 โ‰ค ๐‘Ÿ2 โ‰ค 60 โˆง 0 โ‰ค ๐‘Ÿ3 โ‰ค 60, resulting in a CEFA ๎ˆญ1 illustrated weight is the number of occurrences of ๐‘ฃโƒ— in ๐‘คโ€ฒ . (See Section 6.2 for in Fig. 1(a). ๐‘Ÿ1 + + means that we increment the value of ๐‘Ÿ1 by one more detailed arguments.) after running the transition. A string ๐‘ค is accepted by ๎ˆญ1 if, when Let C = {๐‘ฃโƒ—1 , โ€ฆ , ๐‘ฃโƒ—๐‘š }. From the results in [32,33], an existential LIA reading the characters in ๐‘ค, ๎ˆญ1 applies the transitions to update the formula ๐œ“๎ˆฎ (x1 , โ€ฆ , x๐‘š ) can be computed to define the Parikh image of state and the values of registers, reaching a final state ๐‘ž in the end, strings that are accepted by ๎ˆฎ, where x1 , โ€ฆ , x๐‘š are the integer variables and the resulting values of the three registers, say ๐‘ฃ1 , ๐‘ฃ2 , ๐‘ฃ3 , satisfy the to denote the number of occurrences of ๐‘ฃโƒ—1 , โ€ฆ , ๐‘ฃโƒ—๐‘š . Therefore, the satis- accepting condition. In addition, we construct other two CEFAs ๎ˆญ2 for fiability of the string constraint in (1) is reduced to the satisfiability of ๐›ด โˆ— ๐‘ + (see Fig. 1(b)) and ๎ˆญ3 for string length function (see Fig. 1(c)). the following existential LIA formula, โ‹€ โˆ‘ In ๎ˆญ3 , a register ๐‘Ÿ4 is used to denote the length of strings and the ๐œ“๎ˆฎ (x1 , โ€ฆ , x๐‘š ) โˆง 1โ‰ค๐‘—โ‰ค4 ๐‘Ÿ๐‘— = 1โ‰ค๐‘˜โ‰ค๐‘š x๐‘˜ ๐‘ฃ๐‘˜,๐‘— โˆง accepting condition is ๐š๐š›๐šž๐šŽ (See Section 6.1 for more details about the (2) 1 โ‰ค ๐‘Ÿ1 โ‰ค 60 โˆง 1 โ‰ค ๐‘Ÿ2 โ‰ค 60 โˆง 0 โ‰ค ๐‘Ÿ3 โ‰ค 60 โˆง ๐‘Ÿ4 > 120. construction of CEFA.) Note that we represent the counting operators symbolically by registers instead of unfolding them explicitly. which can be solved by the off-the-shelf SMT solvers. Next, we construct ๎ˆญ1 โˆฉ ๎ˆญ2 โˆฉ ๎ˆญ3 , that is, the intersection (aka Nevertheless, when the original regexes are complicated (e.g. con- product) of ๎ˆญ1 , ๎ˆญ2 , and ๎ˆญ3 , as illustrated in Fig. 2(a), where the states tain occurrences of negation or intersection operators), the sizes of that cannot reach the final states are removed. For technical conve- the NFA ๎ˆฎ can still be big and the sizes of the LIA formulas defining nience, we also think of the updates of registers in transitions as vectors the Parikh image of ๎ˆฎ are also big. Since the satisfiability of LIA (๐‘ข1 , ๐‘ข2 , ๐‘ข3 , ๐‘ข4 ), where ๐‘ข๐‘– โˆˆ Z is the update on the register ๐‘Ÿ๐‘– for each ๐‘– โˆˆ formulas is an NP-complete problem [34], big sizes of LIA formulas [4]. For instance, the transitions corresponding to the self-loop around would be a bottleneck of the performance. To tackle this issue, we (๐‘0 , ๐‘ž0 , ๐‘ž0โ€ฒ ) are thought as ((๐‘0 , ๐‘ž0 , ๐‘ž0โ€ฒ ), ๐‘Žโ€ฒ , (๐‘0 , ๐‘ž0 , ๐‘ž0โ€ฒ ), (1, 0, 0, 1)) with ๐‘Žโ€ฒ โˆˆ propose techniques to reduce the sizes of the NFA ๎ˆฎ. ๐›ด โงต {๐‘Ž}, since ๐‘Ÿ1 and ๐‘Ÿ4 are incremented by one in these transitions. Specifically, to reduce the sizes of ๎ˆฎ, we determinize ๎ˆฎ, and apply After considering the updates of registers as vectors, the CEFA is like the minimization algorithm to the resulting deterministic finite au- Fig. 2(b). tomaton (DFA), resulting in a DFA ๎ˆฏ, as illustrated in Fig. 3(b). Note Finally, the satisfiability of the original string constraint is reduced that ๎ˆฏ contains only three states ๐‘ž0โ€ฒโ€ฒ , ๐‘ž1โ€ฒโ€ฒ , ๐‘ž2โ€ฒโ€ฒ and six transitions, while ๎ˆฎ to the non-emptiness of the CEFA ๎ˆญ โ‰ก ๎ˆญ1 โˆฉ ๎ˆญ2 โˆฉ ๎ˆญ3 with respect to contains four states and eight transitions. Furthermore, if ๎ˆฎ contains 4 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 Fig. 3. Reduced automaton ๎ˆฎ and ๎ˆฏ. Fig. 4. Syntax of RECL. โƒ— 0-labeled transitions, then we can take these transitions as ๐œ–-transitions ๐œƒ โŠง ๐œ‘, if one of the following holds: (1) ๐œ‘ โ‰ก ๐‘ฅ โˆˆ ๐‘’ and ๐œƒ(๐‘ฅ) โˆˆ ๎ˆธ(๐‘’), (2) and potentially reduce the sizes of automata further. ๐œ‘ โ‰ก ๐‘ก1 ๐‘œ ๐‘ก2 and ๐œƒ(๐‘ก1 ) ๐‘œ ๐œƒ(๐‘ก2 ), (3) ๐œ‘ โ‰ก ๐œ‘1 โˆง ๐œ‘2 and ๐œƒ โŠง ๐œ‘1 and ๐œƒ โŠง ๐œ‘2 . We implement all the aforementioned techniques on the top of OS- A RECL formula ๐œ‘ is satisfiable if there is a mapping ๐œƒ such as ๐œƒ โŠง ๐œ‘. TRICH, resulting in a solver OSTRICHRECL . It turns out that The satisfiability problem for RECL (which is abbreviated as ๐–ฒ๐– ๐–ณ๐–ฑ๐–ค๐–ข๐–ซ ) OSTRICHRECL is able to solve the string constraint in (1) within one is deciding whether a given RECL formula ๐œ‘ is satisfiable. second, while the state-of-the-art string solvers are incapable of solving it within 120 s. 5. Cost-enriched finite automata 4. String constraints with regex-counting and string-length In this section, we define cost-enriched finite state automata (CEFA), which was introduced in [18] and will be used to solve the satisfiability In the sequel, we define the string constraints with regex-counting problem of RECL later on. Intuitively, CEFA adds write-only cost regis- and string-length functions, i.e., REgex-Counting Logic (abbreviated as ters to finite state automata. โ€˜โ€˜write-onlyโ€™โ€™ means that the cost registers RECL). The syntax of RECL is defined by the rules in Fig. 4, where ๐‘ฅ is a can only be written/updated but cannot be read, i.e., they cannot be string variable, x is an integer variable, ๐‘Ž is a character from an alphabet used to guard the transitions. ๐›ด, and ๐‘š, ๐‘› are integer constants. A RECL formula ๐œ‘ is a conjunction of atomic formulas of the form ๐‘ฅ โˆˆ ๐‘’ or ๐‘ก1 ๐‘œ ๐‘ก2 , where ๐‘’ is a regular Definition 1 (Cost-Enriched Finite Automaton). A cost-enriched finite expression, ๐‘ก1 and ๐‘ก2 are integer terms, and ๐‘œ โˆˆ {=, โ‰ , โ‰ค, โ‰ฅ, <, >}. Atomic automaton ๎ˆญ is a tuple (๐‘…, ๐‘„, ๐›ด , ๐›ฟ , ๐ผ , ๐น , ๐›ผ) where formulas of the form ๐‘ฅ โˆˆ ๐‘’ are called regular membership constraints, โ€ข ๐‘… = {๐‘Ÿ1 , โ€ฆ , ๐‘Ÿ๐‘˜ } is a finite set of registers, and atomic formulas of the form ๐‘ก1 ๐‘œ ๐‘ก2 are called length constraints. A regular expression ๐‘’ is built from โˆ…, the empty string ๐œ–, and the โ€ข ๐‘„, ๐›ด , ๐ผ , ๐น are as in the definition of NFA, character ๐‘Ž by using concatenation โ‹…, union +, Kleene star โˆ— , intersection โ€ข ๐›ฟ โІ ๐‘„ ร— ๐›ด ร— ๐‘„ ร— Z๐‘… is a transition relation, where Z๐‘… denotes the โˆฉ, complement ฬ„, difference โงต, counting {๐‘š,๐‘›} or {๐‘š,โˆž} . An integer term updates on the values of registers. is built from constants ๐‘›, variables x, and string lengths |๐‘ฅ| by operators โ€ข ๐›ผ โˆˆ ๐›ท(๐‘…) is an LIA formula specifying an accepting condition. + and โˆ’. For convenience, we use ๐‘…๎ˆญ to denote the set of registers of ๎ˆญ. Moreover, for ๐‘† โІ ๐›ด with ๐‘† = {๐‘Ž1 , โ€ฆ , ๐‘Ž๐‘˜ }, we use ๐‘† as an We assume a linear order on ๐‘… and write ๐‘… as a vector (๐‘Ÿ1 , โ€ฆ , ๐‘Ÿ๐‘˜ ). abbreviation of ๐‘Ž1 + โ‹ฏ + ๐‘Ž๐‘˜ . Accordingly, we write an element of Z๐‘… as a vector (๐‘ฃ1 , โ€ฆ , ๐‘ฃ๐‘˜ ), where For each regular expression ๐‘’, the language defined by ๐‘’, denoted ๐‘ฃ๐‘– is the update of ๐‘Ÿ๐‘– for each ๐‘– โˆˆ [๐‘˜]. We also write a transition by ๎ˆธ(๐‘’), is defined recursively. For instance, ๎ˆธ(โˆ…) = โˆ…, ๎ˆธ(๐œ€) = {๐œ€}, ๐‘Ž (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) โ† ๐‘žโ€ฒ . โƒ— โˆˆ ๐›ฟ as ๐‘ž โ†โ†โ†’ ๎ˆธ(๐‘Ž) = {๐‘Ž}, and ๎ˆธ(๐‘’1 โ‹… ๐‘’2 ) = ๎ˆธ(๐‘’1 ) โ‹… ๎ˆธ(๐‘’2 ), ๎ˆธ(๐‘’1 + ๐‘’2 ) = ๎ˆธ(๐‘’1 ) โˆช ๎ˆธ(๐‘’2 ), ๐‘ฃโƒ— and so on. It is well-known that regular expressions define the same The semantics of CEFA is defined as follows. Let ๎ˆญ = (๐‘…, ๐‘„, ๐›ด , ๐›ฟ , ๐ผ , class of languages as finite state automata, that is, the class of regular ๐น , ๐›ผ) be a CEFA. A run of ๎ˆญ on a string ๐‘ค = ๐‘Ž1 โ‹ฏ ๐‘Ž๐‘› is a sequence ๐‘Ž1 ๐‘Ž๐‘› ๐‘Ž๐‘– languages [31]. ๐‘ž0 โ†โ†โ†โ†โ†’ โ† ๐‘ž๐‘› such that ๐‘ž0 โˆˆ ๐ผ and ๐‘ž๐‘–โˆ’1 โ†โ†โ†โ†โ†’ โ† ๐‘ž1 โ‹ฏ ๐‘ž๐‘›โˆ’1 โ†โ†โ†โ†โ†’ โ† ๐‘ž๐‘– for each ๐‘– โˆˆ [๐‘›]. ๐‘ฃโƒ—1 ๐‘ฃโƒ—๐‘› ๐‘ฃโƒ—๐‘– Let ๐œ‘ be a RECL formula and ๐–ฒ๐–ต๐–บ๐—‹๐—Œ(๐œ‘) (resp. ๐–จ๐–ต๐–บ๐—‹๐—Œ(๐œ‘)) denote the set ๐‘Ž 1 ๐‘Ž๐‘› of string (resp. integer) variables occurring in ๐œ‘. Then the semantics of A run ๐‘ž0 โ†โ†โ†โ†โ†’โ† ๐‘ž1 โ‹ฏ ๐‘ž๐‘›โˆ’1 โ†โ†โ†โ†โ†’โ† ๐‘ž๐‘› is accepting if ๐‘ž๐‘› โˆˆ ๐น and ๐›ผ(๐‘ฃโƒ—โ€ฒ โˆ•๐‘…) is true, ๐‘ฃโƒ—1 ๐‘ฃโƒ—๐‘› ๐œ‘ is defined with respect to a mapping ๐œƒ โˆถ ๐–ฒ๐–ต๐–บ๐—‹๐—Œ(๐œ‘) โ†’ ๐›ด โˆ— โŠŽ๐–จ๐–ต๐–บ๐—‹๐—Œ(๐œ‘) โ†’ Z โˆ‘ where ๐‘ฃโƒ—โ€ฒ = ๐‘—โˆˆ[๐‘›] ๐‘ฃโƒ—๐‘— . The string ๐‘ค is ๐‘Ž๐‘ ๐‘ ๐‘’๐‘๐‘ก๐‘’๐‘‘ by the CEFA if and only (where โŠŽ denotes the disjoint union). Note that the mapping ๐œƒ can โˆ‘ if there is an accepting run of ๐‘ค. The vector ๐‘ฃโƒ—โ€ฒ = ๐‘—โˆˆ[๐‘›] ๐‘ฃโƒ—๐‘— is called the naturally extend to the set of integer terms. For instance, ๐œƒ(|๐‘ฅ|) = |๐œƒ(๐‘ฅ)|, ๐‘Ž1 ๐‘Ž๐‘› ๐œƒ(๐‘ก1 + ๐‘ก2 ) = ๐œƒ(๐‘ก1 ) + ๐œƒ(๐‘ก2 ). A mapping ๐œƒ is said to satisfy ๐œ‘, denoted by cost of an accepting run ๐‘ž0 โ†โ†โ†โ†โ†’ โ† ๐‘ž๐‘› . Note that the values of โ† ๐‘ž1 โ‹ฏ ๐‘ž๐‘›โˆ’1 โ†โ†โ†โ†โ†’ ๐‘ฃโƒ—1 ๐‘ฃโƒ—๐‘› 5 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 โˆ‘ all registers are initiated to zero and updated to ๐‘—โˆˆ[๐‘›] ๐‘ฃโƒ—๐‘— after all the From the construction, transitions in the run are executed. We use ๐‘ฃโƒ—โ€ฒ โˆˆ ๎ˆญ(๐‘ค) to denote the fact that there is an accepting run of ๎ˆญ on ๐‘ค whose cost is ๐‘ฃโƒ—โ€ฒ . We define ๎ˆธ(๎ˆญ1 โˆฉ ๎ˆญ2 ) = {(๐‘ค; ๐‘ฃ) โƒ— โˆˆ ๎ˆธ(๎ˆญ1 ) and (๐‘ค; ๐—‰๐—‹๐—ƒ๐‘…2 (๐‘ฃ)) โƒ— โˆฃ (๐‘ค; ๐—‰๐—‹๐—ƒ๐‘…1 (๐‘ฃ)) โƒ— โˆˆ ๎ˆธ(๎ˆญ2 )}. the semantics of a CEFA ๎ˆญ, denoted by ๎ˆธ(๎ˆญ), as {(๐‘ค; ๐‘ฃโƒ—โ€ฒ ) โˆฃ ๐‘ฃโƒ—โ€ฒ โˆˆ ๎ˆญ(๐‘ค)}. Intuitively, ๎ˆญ1 โˆฉ ๎ˆญ2 accepts the words that are accepted by both ๎ˆญ1 In particular, if ๐ผ โˆฉ ๐น โ‰  โˆ… and ๐›ผ[0โˆ•๐‘…] โƒ— is true, then (๐œ€; 0)โƒ— โˆˆ ๎ˆธ(๎ˆญ). and ๎ˆญ2 and outputs the costs of ๎ˆญ1 and ๎ˆญ2 . Moreover, we define the output of a CEFA ๎ˆญ, denoted by ๎ˆป(๎ˆญ), as The concatenation of ๎ˆญ1 and ๎ˆญ2 , denoted by ๎ˆญ1 โ‹… ๎ˆญ2 , is defined {๐‘ฃโƒ—โ€ฒ โˆฃ โˆƒ๐‘ค. ๐‘ฃโƒ—โ€ฒ โˆˆ ๎ˆญ(๐‘ค)}. similarly as that of NFA, that is, a tuple (๐‘…โ€ฒ , ๐‘„โ€ฒ , ๐›ด , ๐›ฟ โ€ฒ , ๐ผ โ€ฒ , ๐น โ€ฒ , ๐›ผ โ€ฒ ), where We want to remark that the definition of CEFA above is slightly ๐‘…โ€ฒ = ๐‘…1 โˆช ๐‘…2 , ๐‘„โ€ฒ = ๐‘„1 โˆช ๐‘„2 , ๐ผ โ€ฒ = ๐ผ1 , ๐›ผ โ€ฒ = ๐›ผ1 โˆง ๐›ผ2 , ๐›ฟ โ€ฒ = different from that of [18], where CEFA did not include accepting โƒ— โˆฃ (๐‘ž1 , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃโƒ—1 ) โˆˆ ๐›ฟ1 } โˆช {(๐‘ž2 , ๐‘Ž, ๐‘ž โ€ฒ , (0, {(๐‘ž1 , ๐‘Ž, ๐‘ž1โ€ฒ , (๐‘ฃโƒ—1 , 0)) โƒ— ๐‘ฃโƒ—2 )) โˆฃ (๐‘ž2 , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃโƒ—2 ) โˆˆ 1 2 2 conditions. Moreover, the accepting conditions ๐›ผ in CEFA are defined ๐›ฟ2 } โˆช {(๐‘ž1 , ๐‘Ž, ๐‘ž2 , (0,โƒ— ๐‘ฃโƒ—2 )) โˆฃ ๐‘ž1 โˆˆ ๐น1 , โˆƒ๐‘ž โ€ฒ โˆˆ ๐ผ2 . (๐‘ž โ€ฒ , ๐‘Ž, ๐‘ž2 , ๐‘ฃโƒ—2 ) โˆˆ ๐›ฟ2 }, moreover, if in a global fashion because the accepting condition does not distinguish ๐ผ2 โˆฉ๐น2 โ‰  โˆ…, then ๐น โ€ฒ = ๐น1 โˆช๐น2 , otherwise, ๐น โ€ฒ = ๐น2 . From the construction, final states. This technical choice is made so that the determinization and minimization of NFA can be utilized to reduce the size of CEFA in ๎ˆธ(๎ˆญ1 โ‹…๎ˆญ2 ) = {(๐‘ค1 ๐‘ค2 ; ๐‘ฃ) โƒ— โˆˆ ๎ˆธ(๎ˆญ1 ) and (๐‘ค2 ; ๐—‰๐—‹๐—ƒ๐‘…2 (๐‘ฃ)) โƒ— โˆฃ (๐‘ค1 ; ๐—‰๐—‹๐—ƒ๐‘…1 (๐‘ฃ)) โƒ— โˆˆ ๎ˆธ(๎ˆญ2 )}. the next section. In the sequel, we define three CEFA operations: union, intersection, Furthermore, the union, intersection, and concatenation operations and concatenation. The following section will use these three oper- can be extended naturally to multiple CEFA, that is, ๎ˆญ1 โˆช โ‹ฏ โˆช ๎ˆญ๐‘› , ations to solve RECL constraints. Note that the union, intersection, ๎ˆญ1 โˆฉ โ‹ฏ โˆฉ ๎ˆญ๐‘› , ๎ˆญ1 โ‹… โ‹ฏ โ‹… ๎ˆญ๐‘› . For instance, ๎ˆญ1 โˆช ๎ˆญ2 โˆช ๎ˆญ3 = (๎ˆญ1 โˆช ๎ˆญ2 ) โˆช ๎ˆญ3 , and concatenation operations are defined in a slightly more involved ๎ˆญ1 โˆฉ ๎ˆญ2 โˆฉ ๎ˆญ3 = (๎ˆญ1 โˆฉ ๎ˆญ2 ) โˆฉ ๎ˆญ3 , and ๎ˆญ1 โ‹… ๎ˆญ2 โ‹… ๎ˆญ3 = (๎ˆญ1 โ‹… ๎ˆญ2 ) โ‹… ๎ˆญ3 . manner than register automata [23] and counter automata [19], as a result of the (additional) accepting conditions. 6. Solving RECL constraints Let ๎ˆญ1 = (๐‘…1 , ๐‘„1 , ๐›ด , ๐›ฟ1 , ๐ผ1 , ๐น1 , ๐›ผ1 ) and ๎ˆญ2 = (๐‘…2 , ๐‘„2 , ๐›ด , ๐›ฟ2 , ๐ผ2 , ๐น2 , ๐›ผ2 ) be two CEFA that share the alphabet. Moreover, suppose that ๐‘…1 โˆฉ๐‘…2 = The goal of this section is to show how to solve RECL constraints by โˆ… and ๐‘„1 โˆฉ ๐‘„2 = โˆ…. utilizing CEFA. At first, we reduce the satisfiability of RECL constraints The union of ๎ˆญ1 and ๎ˆญ2 is denoted by ๎ˆญ1 โˆช ๎ˆญ2 . Two fresh auxiliary to a decision problem defined in the sequel. Then we propose a decision registers say ๐‘Ÿโ€ฒ1 , ๐‘Ÿโ€ฒ2 โˆ‰ ๐‘…1 โˆช ๐‘…2 , are introduced so that the accepting procedure for this problem. condition knows whether a run is from ๎ˆญ1 (or ๎ˆญ2 ). Note that the fresh auxiliary registers are introduced to distinguish whether ๎ˆญ1 or Definition 2 (๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– )). Let ๐‘ฅ1 , โ€ฆ , ๐‘ฅ๐‘› be string variables, ๐›ฌ๐‘ฅ1 , โ€ฆ , ๎ˆญ2 accepts and weave the two accepting conditions of ๎ˆญ1 and ๎ˆญ2 into ๐›ฌ๐‘ฅ๐‘› be nonempty sets of CEFA over the alphabet ๐›ด with ๐›ฌ๐‘ฅ๐‘– = one accepting condition. Specifically, ๎ˆญ1 โˆช ๎ˆญ2 = (๐‘…โ€ฒ , ๐‘„โ€ฒ , ๐›ด , ๐›ฟ โ€ฒ , ๐ผ โ€ฒ , ๐น โ€ฒ , ๐›ผ โ€ฒ ) {๎ˆญ๐‘–,1 , โ€ฆ , ๎ˆญ๐‘–,๐‘™๐‘– } for every ๐‘– โˆˆ [๐‘›] where the sets of registers ๐‘…๎ˆญ1,1 , โ€ฆ , where ๐‘…๎ˆญ1,๐‘™ , โ€ฆ , ๐‘…๎ˆญ๐‘›,1 , โ€ฆ , ๐‘…๎ˆญ๐‘›,๐‘™ are mutually disjoint, moreover, let ๐œ‘ be 1 ๐‘› โ‹ƒ an LIA formula whose free variables are from ๐‘–โˆˆ[๐‘›],๐‘—โˆˆ[๐‘™๐‘– ] ๐‘…๎ˆญ๐‘–,๐‘— . Then โ€ข ๐‘…โ€ฒ = ๐‘…1 โˆช ๐‘…2 โˆช {๐‘Ÿโ€ฒ1 , ๐‘Ÿโ€ฒ2 }, ๐‘„โ€ฒ = ๐‘„1 โˆช ๐‘„2 โˆช {๐‘ž0โ€ฒ } with ๐‘ž0โ€ฒ โˆ‰ ๐‘„1 โˆช ๐‘„2 , the CEFA in ๐›ฌ๐‘ฅ1 , โ€ฆ , ๐›ฌ๐‘ฅ๐‘› are said to be nonempty w.r.t. ๐œ‘ if there ๐ผ โ€ฒ = {๐‘ž0โ€ฒ }, are assignments ๐œƒ โˆถ {๐‘ฅ1 , โ€ฆ , ๐‘ฅ๐‘› } โ†’ ๐›ด โˆ— and vectors ๐‘ฃโƒ—๐‘–,๐‘— such that โ€ข ๐›ฟ โ€ฒ is the union of four transitions sets: (๐œƒ(๐‘ฅ๐‘– ); ๐‘ฃโƒ—๐‘–,๐‘— ) โˆˆ ๎ˆธ(๎ˆญ๐‘–,๐‘— ) and ๐œ‘[(๐‘ฃโƒ—๐‘–,๐‘— โˆ•๐‘…๎ˆญ๐‘–,๐‘— )] is true, for every ๐‘– โˆˆ [๐‘›], ๐‘— โˆˆ [๐‘™๐‘– ]. โƒ— 1, 0)) โˆฃ โˆƒ๐‘ž1 โˆˆ ๐ผ1 . (๐‘ž1 , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃโƒ—1 ) โˆˆ ๐›ฟ1 } โ€“ {(๐‘ž0โ€ฒ , ๐‘Ž, ๐‘ž1โ€ฒ , (๐‘ฃโƒ—1 , 0, 1 โ€ฒ โƒ— โ€“ {(๐‘ž , ๐‘Ž, ๐‘ž , (0, ๐‘ฃโƒ—2 , 0, 1)) โˆฃ โˆƒ๐‘ž2 โˆˆ ๐ผ2 . (๐‘ž2 , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃโƒ—2 ) โˆˆ ๐›ฟ2 } โ€ฒ Proposition 1 ([18]). ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) is PSPACE-complete. 0 2 2 โƒ— 0, 0)) โˆฃ ๐‘ž1 โˆ‰ ๐ผ1 . (๐‘ž1 , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃโƒ—1 ) โˆˆ ๐›ฟ1 } โ€“ {(๐‘ž1 , ๐‘Ž, ๐‘ž1โ€ฒ , (๐‘ฃโƒ—1 , 0, Note that the decision procedure in [18] was only used to prove the 1 โƒ— ๐‘ฃโƒ—2 , 0, 0)) โˆฃ ๐‘ž2 โˆ‰ ๐ผ2 . (๐‘ž2 , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃโƒ—2 ) โˆˆ ๐›ฟ2 } โ€“ {(๐‘ž2 , ๐‘Ž, ๐‘ž2โ€ฒ , (0, upper bound in Proposition 1 and not implemented as a matter of fact. 2 Instead, the symbolic model checker nuXmv [35] was used to solve โƒ— 1, 0) is a vector that updates ๐‘…1 by ๐‘ฃโƒ—1 , updates ๐‘…2 by where (๐‘ฃโƒ—1 , 0, ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ). We do not rely on nuXmv in this work and shall propose โƒ— and updates ๐‘Ÿโ€ฒ , ๐‘Ÿโ€ฒ by 1,0 respectively. Similarly for (0, 0, โƒ— ๐‘ฃโƒ—2 , 0, 1), a new algorithm for solving ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) in Section 6.2. 1 2 and so on. โ€ข ๐น โ€ฒ and ๐›ผ โ€ฒ are defined as follows, 6.1. From ๐–ฒ๐– ๐–ณ๐–ฑ๐–ค๐–ข๐–ซ to ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) โƒ— belongs to ๎ˆธ(๎ˆญ1 ) or ๎ˆธ(๎ˆญ2 ), i.e., one of the two โ€“ if (๐œ–; 0) Let ๐œ‘ be a RECL constraint and ๐‘ฅ1 , โ€ฆ , ๐‘ฅ๐‘› be an enumeration of the automata accepts the empty string ๐œ–, then ๐น โ€ฒ = ๐น1 โˆช๐น2 โˆช {๐‘ž0โ€ฒ } string variables occurring in ๐œ‘. Moreover, let ๐œ‘ โ‰ก ๐œ‘1 โˆง ๐œ‘2 such that and ๐›ผ โ€ฒ = (๐‘Ÿโ€ฒ1 = 0 โˆง ๐‘Ÿโ€ฒ2 = 0) โˆจ (๐‘Ÿโ€ฒ1 = 1 โˆง ๐›ผ1 ) โˆจ (๐‘Ÿโ€ฒ2 = 1 โˆง ๐›ผ2 ), ๐œ‘1 is a conjunction of regular membership constraints of ๐œ‘, and ๐œ‘2 is a โ€“ otherwise, ๐น โ€ฒ = ๐น1 โˆช ๐น2 and ๐›ผ โ€ฒ = (๐‘Ÿโ€ฒ1 = 1 โˆง ๐›ผ1 ) โˆจ (๐‘Ÿโ€ฒ2 = 1 โˆง ๐›ผ2 ). conjunction of length constraints of ๐œ‘. We shall reduce the satisfiability of ๐œ‘ to an instance of ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ). At first, we show how to construct a CEFA from a regex where From the construction, we know that counting operators may occur. Our approach largely follows Thomp- โŽง | (๐‘ค; ๐—‰๐—‹๐—ƒ (๐‘ฃ)) โŽซ | ๐‘…1 โƒ— โˆˆ ๎ˆธ(๎ˆญ1 ) and ๐—‰๐—‹๐—ƒ๐‘Ÿโ€ฒ1 (๐‘ฃ) โƒ— = 1, or sonโ€™s construction [36] but we avoid using the ๐œ–-transitions. Similar โŽช | โŽช โƒ— || (๐‘ค; ๐—‰๐—‹๐—ƒ๐‘…2 (๐‘ฃ)) ๎ˆธ(๎ˆญ1 โˆช ๎ˆญ2 ) = โŽจ(๐‘ค; ๐‘ฃ) โƒ— โˆˆ ๎ˆธ(๎ˆญ2 ) and ๐—‰๐—‹๐—ƒ๐‘Ÿโ€ฒ (๐‘ฃ) โƒ— = 1, or โŽฌ. ideas of representing counting operators as counters in automata have โŽช | 2 โŽช appeared in e.g. [27]. โŽฉ | (๐‘ค; โƒ— ๐‘ฃ) = (๐œ– , โƒ— 0) if ๎ˆญ or ๎ˆญ accepts ๐œ– โŽญ | 1 2 Let us start with register-representable regexes defined in the sequel. Intuitively, ๎ˆญ1 โˆช ๎ˆญ2 accepts the words that are accepted by one of the Let us fix an alphabet ๐›ด. ๎ˆญ1 and ๎ˆญ2 and outputs the costs of the corresponding automaton. Note Let ๐‘’ be a regex over ๐›ด. Then an occurrence of counting operators in that an integer vector ๐‘ฃโƒ—1 in ๎ˆญ1 , is copied at most twice in ๎ˆญ1 โˆช ๎ˆญ2 , ๐‘’, say (๐‘’โ€ฒ ){๐‘š,๐‘›} (or (๐‘’โ€ฒ ){๐‘š,โˆž} ), is said to be register-representable if (๐‘’โ€ฒ ){๐‘š,๐‘›} โƒ— 0, 0) and (๐‘ฃโƒ—1 , 0, namely, (๐‘ฃโƒ—1 , 0, โƒ— 1, 0). Similarly for the integer vectors in (or (๐‘’โ€ฒ ){๐‘š,โˆž} ) is not in the scope of a Kleene star, another counting ๎ˆญ2 . operator, complement, or language difference in ๐‘’. We say that ๐‘’ is The intersection of ๎ˆญ1 and ๎ˆญ2 , denoted by ๎ˆญ1 โˆฉ๎ˆญ2 = (๐‘…โ€ฒ , ๐‘„โ€ฒ , ๐›ด , ๐›ฟ โ€ฒ , ๐ผ โ€ฒ , register-representable if and only if all occurrences of counting operators ๐น , ๐›ผ โ€ฒ ), is defined in the sequel. โ€ฒ in ๐‘’ are register-representable. For instance, ๐‘Ž{2,6} โˆฉ ๐‘Ž{4,โˆž} is register- representable, while ๐‘Ž{2,6} and (๐‘Ž{2,6} ){4,โˆž} are not since ๐‘Ž{2,6} is in the โ€ข ๐‘…โ€ฒ = ๐‘…1 โˆช ๐‘…2 , ๐‘„โ€ฒ = ๐‘„1 ร— ๐‘„2 , ๐ผ โ€ฒ = ๐ผ1 ร— ๐ผ2 , ๐น โ€ฒ = ๐น1 ร— ๐น2 , ๐›ผ โ€ฒ = ๐›ผ1 โˆง ๐›ผ2 , scope of complement and the counter operator {4, โˆž} respectively. โ€ข ๐›ฟ โ€ฒ comprises the tuples ((๐‘ž1 , ๐‘ž2 ), ๐‘Ž, (๐‘ž1โ€ฒ , ๐‘ž2โ€ฒ ), (๐‘ฃโƒ—1 , ๐‘ฃโƒ—2 )) such that (๐‘ž1 , ๐‘Ž, Let ๐‘’ be a register-representable regex over ๐›ด. By the following ๐‘ž1โ€ฒ , ๐‘ฃโƒ—1 ) โˆˆ ๐›ฟ1 and (๐‘ž2 , ๐‘Ž, ๐‘ž2โ€ฒ , ๐‘ฃโƒ—2 ) โˆˆ ๐›ฟ2 . procedure, we will construct a CEFA out of ๐‘’, denoted by ๎ˆญ๐‘’ . 6 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 1. For each sub-expression (๐‘’โ€ฒ ){๐‘š,๐‘›} with ๐‘š โ‰ค ๐‘› (resp. (๐‘’โ€ฒ ){๐‘š,โˆž} ) Observation 2. For each CEFA ๎ˆญ = (๐‘…, ๐‘„, ๐›ด , ๐›ฟ , ๐ผ , ๐น , ๐›ผ), of ๐‘’, we construct a CEFA ๎ˆญ(๐‘’โ€ฒ ){๐‘š,๐‘›} (resp. ๎ˆญ(๐‘’โ€ฒ ){๐‘š,โˆž} ). Let ๎ˆญ๐‘’โ€ฒ = โŽง | โŽกโˆ‘ โŽซ (๐‘„, ๐›ด , ๐›ฟ , ๐ผ , ๐น ). Then ๎ˆญ(๐‘’โ€ฒ ){๐‘š,๐‘›} = ((๐‘Ÿโ€ฒ ), ๐‘„โ€ฒ , ๐›ด , ๐›ฟ โ€ฒโ€ฒ , ๐ผ โ€ฒ , ๐น โ€ฒ , ๐›ผ โ€ฒ ), where โŽชโˆ‘ | | / โŽค โŽช ๎ˆป(๎ˆญ) = โŽจ โƒ— ๐‘ฃโƒ— | ๐œ‚ โˆˆ ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎‰๎ˆญ ) and ๐›ผ โŽข ๐œ‚(๐‘ฃ) โƒ— ๐‘ฃโƒ— ๐‘…โŽฅ is true โŽฌ . ๐œ‚(๐‘ฃ) ๐‘Ÿโ€ฒ is a new register, ๐‘„โ€ฒ = ๐‘„ โˆช {๐‘ž0 } with ๐‘ž0 โˆ‰ ๐‘„, ๐ผ โ€ฒ = {๐‘ž0 }, | โŽข โŽฅ โŽช๐‘ฃโˆˆC โƒ— | โŽฃ๐‘ฃโˆˆC โƒ— ๎ˆญ โŽฆ โŽช ๐น โ€ฒ = ๐น โˆช {๐‘ž0 }, and โŽฉ ๎ˆญ | โŽญ ๐›ฟ โ€ฒโ€ฒ = {(๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , (0)) โˆฃ (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ ) โˆˆ ๐›ฟ} โˆช For ๐‘– โˆˆ [๐‘›], let ๐›ผ๐‘– be the accepting condition of ๎ˆฎ๐‘– . Then from {(๐‘ž0 , ๐‘Ž, ๐‘ž โ€ฒ , (1)) โˆฃ โˆƒ๐‘ž0โ€ฒ โˆˆ ๐ผ . (๐‘ž0โ€ฒ , ๐‘Ž, ๐‘ž โ€ฒ ) โˆˆ ๐›ฟ} โˆช Observation 2, we know that the following two conditions are equiva- {(๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , (1)) โˆฃ ๐‘ž โˆˆ ๐น , โˆƒ๐‘ž0โ€ฒ โˆˆ ๐ผ . (๐‘ž0โ€ฒ , ๐‘Ž, ๐‘ž โ€ฒ ) โˆˆ ๐›ฟ}, lent, moreover, ๐›ผ โ€ฒ = ๐‘š โ‰ค ๐‘Ÿโ€ฒ โ‰ค ๐‘› if ๐ผ โˆฉ ๐น = โˆ…, otherwise ๐›ผ โ€ฒ = ๐‘Ÿโ€ฒ โ‰ค ๐‘›. (Intuitively, if ๐œ€ is accepted by ๎ˆญ๐‘’โ€ฒ , then the value of ๐‘Ÿโ€ฒ can be โ€ข there are outputs ๐‘ฃโƒ—1 โˆˆ ๎ˆป(๎ˆฎ1 ), โ€ฆ , ๐‘ฃโƒ—๐‘› โˆˆ ๎ˆป(๎ˆฎ๐‘› ) such that less than ๐‘š.) Moreover, ๎ˆญ(๐‘’โ€ฒ ){๐‘š,โˆž} is constructed by adapting ๐›ผ โ€ฒ ๐œ‘[๐‘ฃโƒ—1 โˆ•๐‘…๎ˆฎ1 , โ€ฆ , ๐‘ฃโƒ—๐‘› โˆ•๐‘…๎ˆฎ๐‘› ] is true, in ๎ˆญ(๐‘’โ€ฒ ){๐‘š,๐‘›} as follows: ๐›ผ โ€ฒ = ๐‘š โ‰ค ๐‘Ÿโ€ฒ if ๐ผ โˆฉ ๐น = โˆ… and ๐›ผ โ€ฒ = ๐š๐š›๐šž๐šŽ โ€ข there are ๐œ‚1 โˆˆ ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎‰๎ˆฎ1 ), โ€ฆ , ๐œ‚๐‘› โˆˆ ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎‰๎ˆฎ๐‘› ) such that otherwise. โ‹€ โŽก โˆ‘ / โŽค โŽก โˆ‘ / โˆ‘ / โŽค 2. For each sub-expression ๐‘’โ€ฒ of ๐‘’ such that ๐‘’โ€ฒ contains occurrences ๐›ผ๐‘– โŽข โƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ๐‘– โŽฅ โˆง ๐œ‘ โŽข ๐œ‚๐‘– (๐‘ฃ) ๐œ‚1 (๐‘ฃ) โƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ1 , โ€ฆ , โƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ๐‘› โŽฅ ๐œ‚๐‘› (๐‘ฃ) of counting operators but ๐‘’โ€ฒ itself is not of the form (๐‘’โ€ฒ1 ){๐‘š,๐‘›} or โŽข โŽฅ โŽข โŽฅ ๐‘–โˆˆ[๐‘›] โŽฃ๐‘ฃโˆˆC โƒ— ๎ˆฎ๐‘– โŽฆ โŽฃ๐‘ฃโˆˆC โƒ— ๎ˆฎ 1 ๐‘ฃโˆˆC โƒ— ๎ˆฎ๐‘› โŽฆ (๐‘’โ€ฒ1 ){๐‘š,โˆž} , from the assumption that ๐‘’ is register-representable, we know that ๐‘’โ€ฒ is of the form ๐‘’โ€ฒ1 โ‹… ๐‘’โ€ฒ2 , ๐‘’โ€ฒ1 + ๐‘’โ€ฒ2 , ๐‘’โ€ฒ1 โˆฉ ๐‘’โ€ฒ2 , or (๐‘’โ€ฒ1 ). For is true. ๐‘’โ€ฒ = (๐‘’โ€ฒ1 ), we have ๎ˆญ๐‘’โ€ฒ = ๎ˆญ๐‘’โ€ฒ . For ๐‘’โ€ฒ = ๐‘’โ€ฒ1 โ‹… ๐‘’โ€ฒ2 , ๐‘’โ€ฒ = ๐‘’โ€ฒ1 + ๐‘’โ€ฒ2 , Therefore, to solve the non-emptiness of CEFA in ๐›ฌโ€ฒ๐‘ฅ , โ€ฆ , ๐›ฌโ€ฒ๐‘ฅ w.r.t. 1 1 ๐‘› or ๐‘’โ€ฒ = ๐‘’โ€ฒ1 โˆฉ ๐‘’โ€ฒ2 , suppose that CEFA ๎ˆญ๐‘’โ€ฒ and ๎ˆญ๐‘’โ€ฒ have been ๐œ‘, it is sufficient to compute the existential LIA formulas ๐œ“๎‰๎ˆฎ (ZC๎ˆฎ ), โ€ฆ, 1 2 1 1 constructed. ๐œ“๎‰๎ˆฎ (ZC๎ˆฎ ) to represent the Parikh images of ๎‰๎ˆฎ1 , โ‹ฏ, ๎‰๎ˆฎ๐‘› respectively, ๐‘› ๐‘› 3. For each maximal sub-expression ๐‘’โ€ฒ of ๐‘’ such that ๐‘’โ€ฒ contains where ZC๎ˆฎ = {z๐‘–,๐‘ฃโƒ— โˆฃ ๐‘ฃโƒ— โˆˆ C๎ˆฎ๐‘– } for ๐‘– โˆˆ [๐‘›], and solve the satisfiability of ๐‘– no occurrences of counting operators, an NFA ๎ˆญ๐‘’โ€ฒ can be con- the following existential LIA formula structed by structural induction on the syntax of ๐‘’โ€ฒ . Then we โ‹€โŽ› โŽก โˆ‘ / โŽคโŽž have ๎ˆญ๐‘’โ€ฒ = ๎ˆญ๐‘’โ€ฒ โ‹… ๎ˆญ๐‘’โ€ฒ , ๎ˆญ๐‘’โ€ฒ = ๎ˆญ๐‘’โ€ฒ โˆช ๎ˆญ๐‘’โ€ฒ , or ๎ˆญ๐‘’โ€ฒ = ๎ˆญ๐‘’โ€ฒ โˆฉ ๎ˆญ๐‘’โ€ฒ . โŽœ๐œ“๎‰ (ZC ) โˆง ๐›ผ๐‘– โŽข z๐‘–,๐‘ฃโƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ๐‘– โŽฅโŽŸ โˆง 1 2 1 2 1 2 โŽœ ๎ˆฎ ๐‘– ๎ˆฎ ๐‘– โŽข โŽฅโŽŸ ๐‘–โˆˆ[๐‘›] โŽ โŽฃ๐‘ฃโˆˆC โƒ— ๎ˆฎ๐‘– โŽฆโŽ  For non-register-representable regexes, we first convert them into register- โŽก โˆ‘ / โˆ‘ / โŽค representable regexes by unfolding all the non-register-representable ๐œ‘โŽข z1,๐‘ฃโƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ1 , โ€ฆ , z๐‘›,๐‘ฃโƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ๐‘› โŽฅ . โŽข โŽฅ occurrences of counting operators. After that, we utilize the aforemen- โŽฃ๐‘ฃโˆˆC โƒ— ๎ˆฎ 1 ๐‘ฃโˆˆC โƒ— ๎ˆฎ๐‘› โŽฆ tioned procedure to construct CEFA. For instance, (๐‘Ž{2,6} โ‹… ๐‘โˆ— ){2,โˆž} is Intuitively, the integer variables z๐‘–,๐‘ฃโƒ— represent the number of occur- transformed into (๐‘Ž๐‘Ž(๐œ€ + ๐‘Ž + ๐‘Ž๐‘Ž + ๐‘Ž๐‘Ž๐‘Ž + ๐‘Ž๐‘Ž๐‘Ž๐‘Ž) โ‹… ๐‘โˆ— ){2,โˆž} . rences of ๐‘ฃโƒ— in the strings accepted by ๎‰๎ˆฎ๐‘– . For each ๐‘– โˆˆ [๐‘›], let ๐‘ฅ๐‘– โˆˆ ๐‘’๐‘–,1 , โ€ฆ , ๐‘ฅ๐‘– โˆˆ ๐‘’๐‘–,๐‘™๐‘– be an enumeration of Because the sizes of the LIA formulas ๐œ“๎‰๎ˆฎ (ZC๎ˆฎ ), โ€ฆ , ๐œ“๎‰๎ˆฎ (ZC๎ˆฎ ) are 1 1 ๐‘› ๐‘› the regular membership constraints for ๐‘ฅ๐‘– in ๐œ‘1 . Then we can construct proportional to the sizes (more precisely, the alphabet size, the number CEFA ๎ˆญ๐‘–,๐‘— from ๐‘’๐‘–,๐‘— for each ๐‘– โˆˆ [๐‘›] and ๐‘— โˆˆ [๐‘™๐‘– ]. Moreover, we construct of states and transitions) of NFA ๎‰๎ˆฎ1 , โ€ฆ , ๎‰๎ˆฎ๐‘› , and the satisfiability of another CEFA ๎ˆญ๐‘–,0 for each ๐‘– โˆˆ [๐‘›] to model the length of ๐‘ฅ๐‘– . Specifi- existential LIA formulas is NP-complete, it is vital to reduce the sizes cally, ๎ˆญ๐‘–,0 is constructed as ((๐‘Ÿ๐‘–,0 ), {๐‘ž๐‘–,0 }, ๐›ด , ๐›ฟ๐‘–,0 , {๐‘ž๐‘–,0 }, {๐‘ž๐‘–,0 }, ๐š๐š›๐šž๐šŽ) where of these NFAs to improve the performance. ๐‘Ÿ๐‘–,0 is a fresh register and ๐›ฟ๐‘–,0 = {(๐‘ž๐‘–,0 , ๐‘Ž, ๐‘ž๐‘–,0 , (1)) โˆฃ ๐‘Ž โˆˆ ๐›ด}. Let ๐›ฌ๐‘ฅ๐‘– = โˆ‘ โˆ‘ Since ๐‘ฃโˆˆC(๎ˆฎโƒ— ๐‘–) โƒ— ๐‘ฃโƒ— = ๐‘ฃโˆˆC(๎ˆฎ ๐œ‚(๐‘ฃ) โƒ— ๐œ‚(๐‘ฃ) โƒ— ๐‘ฃโƒ— for each ๐‘– โˆˆ [๐‘›] and ๐œ‚ โˆˆ {๎ˆญ๐‘–,0 , ๎ˆญ๐‘–,1 , โ€ฆ , ๎ˆญ๐‘–,๐‘™๐‘– } for each ๐‘– โˆˆ [๐‘›], and ๐œ‘โ€ฒ2 โ‰ก ๐œ‘2 [๐‘Ÿ1,0 โˆ•|๐‘ฅ1 |, โ€ฆ , ๐‘Ÿ๐‘›,0 โˆ•|๐‘ฅ๐‘› |]. โƒ— ๐‘– )โงต{0} ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎‰๎ˆฎ๐‘– ), it turns out that the 0-labeled โƒ— transitions in ๎‰๎ˆฎ๐‘– do not Then the satisfiability of ๐œ‘ is reduced to the non-emptiness of CEFAs in โˆ‘ ๐›ฌ๐‘ฅ1 , โ€ฆ , ๐›ฌ๐‘ฅ๐‘› w.r.t. ๐œ‘โ€ฒ2 . contribute to the final output ๐‘ฃโˆˆC(๎ˆฎ โƒ— ๐‘– ) ๐œ‚(๐‘ฃ) โƒ— ๐‘ฃ. โƒ— Therefore, we can apply the following size-reduction technique for ๎‰๎ˆฎ๐‘– โ€™s. 6.2. Solving ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) Step 2 (Reducing automata sizes). For each ๐‘– โˆˆ [๐‘›], we view the transitions (๐‘ž , 0, โƒ— ๐‘ž โ€ฒ ) in ๎‰๎ˆฎ as ๐œ€-transitions (๐‘ž , ๐œ€, ๐‘ž โ€ฒ ), and remove the ๐œ€- ๐‘– In this section, we present a procedure to solve the ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) transitions from ๎‰๎ˆฎ๐‘– . Then we determinize and minimize the resulting problem: Suppose that ๐‘ฅ1 , โ€ฆ , ๐‘ฅ๐‘› are mutually distinct string variables, NFA. โ–ก ๐›ฌ๐‘ฅ1 , โ€ฆ , ๐›ฌ๐‘ฅ๐‘› are nonempty sets of CEFA over the same alphabet ๐›ด where For ๐‘– โˆˆ [๐‘›], let ๎ˆฏ๐‘– denote the DFA obtained from ๎‰๎ˆฎ๐‘– by executing ๐›ฌ๐‘ฅ๐‘– = {๎ˆญ๐‘–,1 , โ€ฆ, ๎ˆญ๐‘–,๐‘™๐‘– } for every ๐‘– โˆˆ [๐‘›]. Moreover, the sets of registers Step 2 and C๎ˆฏ๐‘– โˆถ= C๎ˆฎ๐‘– โงต {0}. โƒ— From the construction, we know that ๐‘…๎ˆญ1,1 , โ€ฆ , ๐‘…๎ˆญ๐‘›,๐‘™ are mutually disjoint, and ๐œ‘ is a LIA formula whose ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎ˆฏ๐‘– ) = ๐—‰๐—‹๐—ƒC๎ˆฏ (๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎‰๎ˆฎ๐‘– )) for each ๐‘– โˆˆ [๐‘›]. Therefore, we compute ๐‘› โ‹ƒ ๐‘– free variables are from ๐‘–โˆˆ[๐‘›],๐‘—โˆˆ[๐‘™๐‘– ] ๐‘…๎ˆญ๐‘–,๐‘— . The procedure comprises three LIA formulas from ๎ˆฏ๐‘– โ€™s, instead of ๎‰๎ˆฎ๐‘– โ€™s, to represent the Parikh images. steps. Step 3 (Computing an LIA formula). For each ๐‘– โˆˆ [๐‘›], we compute an Step 1 (Computing intersection automata). For each ๐‘– โˆˆ [๐‘›], compute existential LIA formula ๐œ“๎ˆฏ๐‘– (ZC๎ˆฏ ) from ๎ˆฏ๐‘– to represent ๐—‰๐–บ๐—‹๐—‚๐—„๐—(๎ˆฏ๐‘– ). Then ๐‘– a CEFA ๎ˆฎ๐‘– = ๎ˆญ๐‘–,1 โˆฉ โ‹ฏ โˆฉ ๎ˆญ๐‘–,๐‘™๐‘– , and let ๐›ฌโ€ฒ๐‘ฅ โˆถ= {๎ˆฎ๐‘– }. โ–ก the ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) problem is reduced to the satisfiability of the following ๐‘– After Step 1, the non-emptiness of CEFAs in ๐›ฌ๐‘ฅ1 , โ€ฆ , ๐›ฌ๐‘ฅ๐‘› w.r.t. ๐œ‘ is LIA formula, โ–ก reduced to the non-emptiness of CEFAs in ๐›ฌโ€ฒ๐‘ฅ , โ€ฆ , ๐›ฌโ€ฒ๐‘ฅ w.r.t. ๐œ‘. 1 ๐‘› โ‹€โŽ› โŽก โˆ‘ / โŽคโŽž In the following steps, we reduce the non-emptiness of CEFAs in โŽœ๐œ“๎ˆฏ (ZC ) โˆง ๐›ผ๐‘– โŽข z๐‘–,๐‘ฃโƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ๐‘– โŽฅโŽŸ โˆง ๐›ฌโ€ฒ๐‘ฅ , โ€ฆ, ๐›ฌโ€ฒ๐‘ฅ w.r.t. ๐œ‘ to the satisfiability of an LIA formula. The reduction โŽœ ๐‘– ๎ˆฏ ๐‘– โŽข โŽฅโŽŸ 1 ๐‘› ๐‘–โˆˆ[๐‘›] โŽ โŽฃ๐‘ฃโˆˆC โƒ— ๎ˆฏ๐‘– โŽฆโŽ  relies on the following two observations. (3) โŽก โˆ‘ / โˆ‘ / โŽค ๐œ‘ โŽข z1,๐‘ฃโƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ1 , โ€ฆ , z๐‘›,๐‘ฃโƒ— ๐‘ฃโƒ— ๐‘…๎ˆฎ๐‘› . โŽฅ Observation 1. CEFA in ๐›ฌโ€ฒ๐‘ฅ = {๎ˆฎ1 }, โ€ฆ , ๐›ฌโ€ฒ๐‘ฅ = {๎ˆฎ๐‘› } are nonempty w.r.t. โŽข โŽฅ 1 ๐‘› โŽฃ๐‘ฃโˆˆC โƒ— ๎ˆฏ 1 ๐‘ฃโˆˆC โƒ— ๎ˆฏ๐‘› โŽฆ ๐œ‘ iff there are outputs ๐‘ฃโƒ—1 โˆˆ ๎ˆป(๎ˆฎ1 ), โ€ฆ , ๐‘ฃโƒ—๐‘› โˆˆ ๎ˆป(๎ˆฎ๐‘› ) such that ๐œ‘[๐‘ฃโƒ—1 โˆ•๐‘…๎ˆฎ1 , โ€ฆ, ๐‘ฃโƒ—๐‘› โˆ•๐‘…๎ˆฎ๐‘› ] is true. Step 4 (Solving the LIA formula). We utilize some off-the-shelf SMT Let ๎ˆญ = (๐‘…, ๐‘„, ๐›ด , ๐›ฟ , ๐ผ , ๐น , ๐›ผ) be a CEFA and C๎ˆญ = {๐‘ฃโƒ— โˆฃ โˆƒ๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ . (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , solver (e.g. Z3) to solve the satisfiability of the LIA formula in Eq. (3). โƒ— โˆˆ ๐›ฟ}. Moreover, let ๎‰๎ˆญ = (๐‘„, C๎ˆญ , ๐›ฟ โ€ฒ , ๐ผ , ๐น ) be an NFA over the alpha- ๐‘ฃ) If the solver reports โ€˜โ€˜satโ€™โ€™ and returns a solution m, then we go to Step bet C๎ˆญ that is obtained from ๎ˆญ by dropping the accepting condition 5. Otherwise, the ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) problem has no solution. โ–ก and ignoring the characters, that is, ๐›ฟ โ€ฒ comprises tuples (๐‘ž , ๐‘ฃ, โƒ— ๐‘ž โ€ฒ ) such Step 5 (Constructing a solution). Let m be a solution of the LIA formula that (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) โƒ— โˆˆ ๐›ฟ for ๐‘Ž โˆˆ ๐›ด. in Eq. (3). We want to construct from m and the CEFAs ๎ˆฎ๐‘– for ๐‘– โˆˆ [๐‘›], 7 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 a solution mโ€ฒ of the ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) problem. The construction of mโ€ฒ is RECL constraint: (i) unfold the counting operators in ๐‘’1 , (ii) obtained by constructing mโ€ฒ (๐‘ฅ๐‘– ) for each ๐‘– โˆˆ [๐‘›]. Let ๐‘– โˆˆ [๐‘›]. We unfold the counting operator {๐‘š,๐‘›} . Let ๐—Ž๐–ฟ ๐—…๐–ฝ1 (๐‘’) and ๐—Ž๐–ฟ ๐—…๐–ฝ2 (๐‘’) denote describe how to construct mโ€ฒ (๐‘ฅ๐‘– ). The free variables ZC๎ˆฏ correspond the resulting regular expressions. Note that in ๐—Ž๐–ฟ ๐—…๐–ฝ1 (๐‘’), all the ๐‘– to the number of occurrences of characters in C๎ˆฏ๐‘– , i.e. the alphabet of counting operators in ๐‘’1 (if there is any) disappear, while in the NFA ๎ˆฏ๐‘– . Moreover, in ๎ˆฏ๐‘– , the characters are just the integer vectors ๐—Ž๐–ฟ ๐—…๐–ฝ2 (๐‘’), all the counting operators in ๐‘’1 are preserved and copied and the characters in the original CEFAs ๎ˆฎ๐‘– used to construct ๎ˆฏ๐‘– are for ๐‘› times. ignored. The construction of mโ€ฒ (๐‘ฅ๐‘– ) is actually to solve the reachability โ€ข In the sequel, we estimate the sizes (the number of transitions) problem in a finite transition system where of the NFAs ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) and ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) , denoted by ๐‘ ๐‘ง1 and ๐‘ ๐‘ง2 . If 1 2 ๐‘ ๐‘ง1 โ‰ค ๐‘ ๐‘ง2 , then we choose to unfold the counting operators in โ€ข the nodes are of the form (๐‘ž , (n๐‘–,๐‘ฃโƒ— )๐‘ฃโˆˆC โƒ— ๎ˆฏ๐‘– ), where n๐‘–,๐‘ฃโƒ— denotes the ๐‘’1 . Otherwise, we choose to unfold the counting operator {๐‘š,๐‘›} . number of the remaining occurrences of the integer vector ๐‘ฃโƒ— in โ€ข To estimate the sizes of ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) and ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) , we first esti- 1 2 ๎ˆฎ๐‘– , mate the sizes of the CEFAs ๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ1 (๐‘’) and ๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ2 (๐‘’) as two pairs โ€ข the transitions are of the form (๐—Œ๐—‡๐—Ž๐—†1 (๐‘’), ๐—๐—‡๐—Ž๐—†1 (๐‘’)) and (๐—Œ๐—‡๐—Ž๐—†2 (๐‘’), ๐—๐—‡๐—Ž๐—†2 (๐‘’)), where ๐—Œ๐—‡๐—Ž๐—†1 (๐‘’) and ((๐‘ž , (n๐‘–,๐‘ฃโƒ— )๐‘ฃโˆˆC โ€ฒ โƒ—), (๐‘ž โ€ฒ , (nโ€ฒ๐‘–,๐‘ฃโƒ— )๐‘ฃโˆˆC ๐—๐—‡๐—Ž๐—†1 (๐‘’) are the estimates of the number of states and integer vec- โƒ— ๎ˆฏ ), (๐‘ž , ๐‘Ž, ๐‘ž , ๐‘ข โƒ— ๎ˆฏ )) ๐‘– ๐‘– tors in ๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ1 (๐‘’) separately, similarly for (๐—Œ๐—‡๐—Ž๐—†2 (๐‘’), ๐—๐—‡๐—Ž๐—†2 (๐‘’)). Since such that (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ขโƒ—) is a transition in ๎ˆฎ๐‘– , nโ€ฒ = n๐‘–,๐‘ขโƒ— โˆ’ 1, moreover, the NFAs ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) and ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) are constructed from ๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ1 (๐‘’) ๐‘–,๐‘ขโƒ— 1 2 nโ€ฒ = n๐‘–,๐‘ฃโƒ— for every ๐‘ฃโƒ— โ‰  ๐‘ขโƒ—. and ๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ2 (๐‘’) by dropping the accepting conditions and ignore ๐‘–,๐‘ฃโƒ— characters, ๐—Œ๐—‡๐—Ž๐—†1 (๐‘’) and ๐—๐—‡๐—Ž๐—†1 (๐‘’) are the estimates of the number Moreover, the initial and final state of the reachability problem are set of states and characters of ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) , similarly for ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) . Then 1 2 to be (๐‘ž0 , (m(z๐‘–,๐‘ฃโƒ— ))๐‘ฃโˆˆC โƒ— ๎ˆฏ๐‘– ) and (๐‘ž๐‘“ , (0, โ€ฆ , 0)) for some initial state ๐‘ž0 and we estimate the sizes (number of transitions) of the NFAs ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) 1 final state ๐‘ž๐‘“ in ๎ˆฎ๐‘– . If a path from the initial state to the final state is and ๎‰๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ (๐‘’) as ๐‘ ๐‘ง1 = (๐—Œ๐—‡๐—Ž๐—†1 (๐‘’))2 โ‹… ๐—๐—‡๐—Ž๐—†1 (๐‘’) and ๐‘ ๐‘ง2 = (๐—Œ๐—‡๐—Ž๐—†2 (๐‘’))2 โ‹… found, then the path gives us a desired string mโ€ฒ (๐‘ฅ๐‘– ). โ–ก 2 ๐—๐—‡๐—Ž๐—†2 (๐‘’) respectively. 7. Optimizations for non-register-representable RECL constraints It remains to show how to compute (๐—Œ๐—‡๐—Ž๐—†1 (๐‘’), ๐—๐—‡๐—Ž๐—†1 (๐‘’)) and (๐—Œ๐—‡๐—Ž๐—†2 (๐‘’), ๐—๐—‡๐—Ž๐—†2 (๐‘’)) from ๐‘’ = (๐‘’1 ){๐‘š,๐‘›} . Let ๐‘’ = (๐‘’1 ){๐‘š,๐‘›} be a reg- In Section 6, to deal with the non-register-representable RECL con- ular expression where ๐‘’1 is a register-representable regular expres- straints, we unfold all the non-register-representable occurrences of sion. Note that there is only one register in ๎ˆญ๐—Ž๐–ฟ ๐—…๐–ฝ1 (๐‘’) . As a result, counting operators. Nevertheless, the naive unfolding incurs an expo- we set ๐—๐—‡๐—Ž๐—†1 (๐‘’) = 2. Moreover, in the sequel, we inductively com- nential blow-up over the formula sizes, especially when the counting pute ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) and (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) from ๐‘’1 , then let ๐—Œ๐—‡๐—Ž๐—†1 (๐‘’) = bounds are large. Recall that an occurrence of counting operators is said ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) and (๐—Œ๐—‡๐—Ž๐—†2 (๐‘’), ๐—๐—‡๐—Ž๐—†2 (๐‘’)) = (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )). to be non-register-representable if it is in the scope of some Kleene star, another counting operator, complement, or language difference. Since โ€ข If ๐‘’1 = โˆ… or ๐‘’1 = ๐œ–, then ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) = 0 and Kleene star can be seen as a special case of the counting operators, and (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) = (0, 0). language difference ๐‘’1 โงต ๐‘’2 can be rewritten as ๐‘’1 โˆฉ ๐‘’2 , without loss of (Note that here the constant 0, instead of 1, is chosen, so that generality, we can assume that a non-register representable occurrence later on, when we consider ๐‘’1 = ๐‘’2 โˆฉ ๐‘’3 , ๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) ร— ๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 ) and of a counting operator is in the scope of another counting operator or ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) ร—๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 ) will become 0 since ๐œ– โˆฉ๐‘’3 = ๐œ– and โˆ… โˆฉ๐‘’3 = โˆ….) some complement operator. In the sequel, to alleviate the formula-size blow-up problem resulted from the naive unfolding, we present various โ€ข If ๐‘’1 = ๐‘Ž, then ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) = 2 and optimizations, for the nesting of counting operators, and the nesting of counting operators and a complement operator respectively. (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) = (2๐‘›, 1). 7.1. Optimizations for the nesting of counting operators โ€ข If ๐‘’1 = ๐‘’2 โ‹… ๐‘’3 , then ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) = ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’2 ) + ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’3 ) and (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) = In the sequel, we propose an optimization for the nesting of count- ing operators. The optimization is designed for the special case that (๐‘›(๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) + ๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 )), ๐‘›(๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) + ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 ))). the nesting depth of counting operators is one. For a RECL constraint (Note that because ๐‘’1 is concatenated for ๐‘› times in ๐—Ž๐–ฟ ๐—…๐–ฝ2 (๐‘’), we whose nesting depth of counting operators is greater than two, we can define ๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ) as ๐‘›(๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) + ๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 )). Moreover, since the apply the optimization for the inner-most nesting of counting operators integer vectors in ๎ˆญ1 โ‹… ๎ˆญ2 are of the form (๐‘ฃโƒ—1 , 0) โƒ— and (0, โƒ— ๐‘ฃโƒ—2 ) where and decrement the nesting depth, then apply the optimization again to ๐‘ฃโƒ—1 and ๐‘ฃโƒ—2 are integer vectors in ๎ˆญ1 and ๎ˆญ2 respectively, we define the resulting constraint, and continue this process, until obtaining a ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ) as ๐‘›(๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) + ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 ))). register-representable RECL constraint. โ€ข If ๐‘’1 = ๐‘’2 + ๐‘’3 , then ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) = ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’2 ) + ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’3 ) and The main idea of the optimization is illustrated by the constraint (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) = ๐‘ฅ โˆˆ (๐‘Ž{1,1000} ){1,2} . If we apply the naive unfolding in Section 6, we (๐‘›(๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) + ๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 ) + 1), ๐‘›(2๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) + 2๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 ))). will unfold ๐‘Ž{1,1000} and obtain the constraint ๐‘ฅ โˆˆ (๐‘Ž(๐œ€ + ๐‘Ž + ๐‘Ž๐‘Ž + โ‹ฏ + ๐‘Ž โ‹ฏ ๐‘Ž )){1,2} . It is easy to observe that a smarter way is to unfold (Recall that each integer vector in ๎ˆญ1 resp. ๎ˆญ2 is copied at most โŸโŸโŸ twice in the construction of ๎ˆญ1 โˆช ๎ˆญ2 .) 999 {1,2} , instead of {1,1000} . If we do this, then we get a constraint ๐‘ฅ โˆˆ โ€ข If ๐‘’1 = ๐‘’2 โˆฉ ๐‘’3 , then ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) = ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’2 ) ร— ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’3 ) and (๐‘Ž{1,1000} )(๐œ€ + ๐‘Ž{1,1000} ), whose size is much smaller, compared to ๐‘ฅ โˆˆ (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) = (๐‘Ž(๐œ€+๐‘Ž+๐‘Ž๐‘Ž+โ‹ฏ+ ๐‘Ž โ‹ฏ ๐‘Ž )){1,2} . (Recall that we assume a binary encoding (๐‘›(๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) ร— ๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 )), ๐‘›(๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ) ร— ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’3 ))). โŸโŸโŸ 999 (Recall that the integer vectors in ๎ˆญ1 โˆฉ ๎ˆญ2 are of the form of the integer constants in counting operators.) (๐‘ฃโƒ—1 , ๐‘ฃโƒ—2 ), where ๐‘ฃโƒ—1 and ๐‘ฃโƒ—2 are the integer vectors in ๎ˆญ1 and ๎ˆญ2 Let us describe the optimization more precisely. respectively.) โ€ฒ โ€ข For a regular expression ๐‘’ = (๐‘’1 ){๐‘š,๐‘›} where ๐‘’1 is a register- โ€ข If ๐‘’1 = ๐‘’2 , then ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) = 2๐—Œ๐—‡๐—Ž๐—†1 (๐‘’2 ) and โ€ฒ representable regular expression, we can unfold the counting (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) = (๐‘› โ‹… 2๐—Œ๐—‡๐—Ž๐—†2 (๐‘’2 ) , 1). operators in the following ways to obtain a register-representable 8 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 (In this case, since ๐‘’1 is register-representable, we know that ๐‘’2 Intuitively, ๐‘ž๐‘  is a sink state that deals with the situations where there contains no counting operators. As a result, we first construct the are no runs, and ๐‘Ÿโ€ฒ is a register newly introduced so that the different NFA for ๐‘’2 , then determinize and complement it to construct the cases of the non-accepting runs of ๎ˆญ๐‘’1 can be captured by a single automaton for ๐‘’1 .) accepting condition ๐›ผ โ€ฒ . โ€ข If ๐‘’1 = (๐‘’2 )โˆ— , then ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) = ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’2 ) and Note that ๎ˆฎ is a strict over-approximation of ๐‘’ = ๐‘’1 since ๎ˆญ๐‘’1 may be (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) = (๐‘›๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ), 1). nondeterministic. For instance, if there are two runs of ๎ˆญ๐‘’1 on a string ๐‘ค, say one run that ends at a non-final state and another run that ends (In this case, since ๐‘’1 is register-representable, we know that ๐‘’2 at a final state and satisfies accepting condition ๐›ผ. Then ๐‘ค โˆˆ ๎ˆธ(๎ˆญ๐‘’1 ), contains no counting operators.) โ€ฒ โ€ฒ โ€ฒ thus ๐‘ค โˆ‰ ๎ˆธ(๎ˆญ๐‘’1 ). Nevertheless, according to the construction, ๐‘ค is โ€ข If ๐‘’1 = (๐‘’2 ){๐‘š ,๐‘› } or ๐‘’1 = (๐‘’2 ){๐‘› ,โˆž} , then ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’1 ) = ๐‘›โ€ฒ ๐—Œ๐—‡๐—Ž๐—†โ€ฒ1 (๐‘’2 ) accepted by ๎ˆฎ, witnessed by the run of ๎ˆญ๐‘’1 that ends at a non-final and state. (๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 ), ๐—๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’1 )) = (๐‘›๐—Œ๐—‡๐—Ž๐—†โ€ฒ2 (๐‘’2 ), ๐‘›). Combination of under- and over-approximation. We combine the under- (In this case, since ๐‘’1 is register-representable, we know that ๐‘’2 and over-approximation as well as the decision procedure into the contains no counting operators.) following procedure. 1. We first apply the under-approximation. 7.2. Optimizations for the nesting of counting operators and a complement 2. If the under-approximation returns โ€˜โ€˜satโ€™โ€™, then we know that operator the original constraint is satisfiable. Otherwise, the under-appro- As mentioned in Section 6, if some counting operators are in the ximation returns โ€˜โ€˜unsatโ€™โ€™ and we apply the over-approximation. scope of a complement operator, it is necessary to unfold these counting 3. If the over-approximation returns โ€˜โ€˜unsatโ€™โ€™, then we know that operators before the complement operation. The exponential blow- the original constraint is unsatisfiable. Otherwise, the over- up in the unfolding combined with the blow-up in the complement approximation returns โ€˜โ€˜satโ€™โ€™ and we resort to the decision pro- operation will produce a huge number of states. To alleviate this cedure (without approximations) in Section 6 to solve the RECL problem, we propose two approximations of the complement operation, constraint. i.e., an under-approximation and an over-approximation, and combine them into a procedure that achieves a nice balance between efficiency 8. Optimizations for model generation and precision. Let ๐‘’ = ๐‘’1 be a regular expression, where ๐‘’1 is a register- Recall that in Step 5 of Section 6, we construct a solution mโ€ฒ from representable regular expression. the model m of the LIA formula in Eq. (3) by solving the reachability problems in a finite transition system. It is easy to see that the transition Under-approximation. One way of doing the complement operation is system contains exponentially many nodes in general and a naive to transform ๐‘’1 into a regular expression that does not contain counting search over the transition system is not efficient. In the sequel, we operators. We compute an under-approximation of ๐‘’ by replacing each describe two optimizations so that the reachability problem can be occurrence of the counting operators in ๐‘’1 , say (๐‘’2 ){๐‘š,๐‘›} or (๐‘’{๐‘›,โˆž} 2 ), by solved more efficiently. ๐‘’โˆ—2 . Let us use ๐‘’โ€ฒ1 to denote the resulting regular expression. Note that To ease the presentation, let us assume that ๐‘›, i.e. the number of ๐‘’โ€ฒ1 does not contain counting operators. Moreover, it is easy to observe string variables, to be 1. that ๎ˆธ(๐‘’1 ) โІ ๎ˆธ(๐‘’โ€ฒ1 ). As a result, we have ๎ˆธ(๐‘’โ€ฒ1 ) โІ ๎ˆธ(๐‘’1 ). That is, ๐‘’โ€ฒ1 is an At first, we refine the states of the transition system by recording under-approximation of ๐‘’1 . the number of transitions, instead of just the number of occurrences of Over-approximation. For ๐‘’ = ๐‘’1 , we utilize the CEFA constructed from the integer vectors. ๐‘’1 , say ๎ˆญ๐‘’1 , to construct a CEFA ๎ˆฎ of ๐‘’ so that ๎ˆธ(๎ˆฎ) is an over- Refining the states of the transition system by recording the number of approximation of ๎ˆธ(๐‘’). Let ๎ˆญ๐‘’1 = (๐‘…, ๐‘„, ๐›ด , ๐›ฟ , ๐ผ , ๐น , ๐›ผ). The main idea of transitions. Let ๐œƒ denote the formula in Eq. (3). We shall construct the the construction is explained as follows: If a string ๐‘ค is not accepted formula ๐œƒ โ€ฒ so that any model of ๐œƒ โ€ฒ necessarily encodes the number of by ๎ˆญ๐‘’1 , then either there are no runs of ๎ˆญ๐‘’1 on ๐‘ค, or every run of ๎ˆญ๐‘’1 occurrences of transitions within ๎ˆฎ1 . on ๐‘ค stop at a non-final state or enter a final state but do not satisfy โ‹€ ๐œƒ โ€ฒ โ‰ก ๐œ“๎ˆฎ1 (Z๎ˆฎ1 ) โˆง z1,๐‘ฃโƒ— = m(z1,๐‘ฃโƒ— ) โˆง the accepting condition ๐›ผ. We shall construct a CEFA ๎ˆฎ that accepts ๐‘ฃโˆˆC โƒ— ๎ˆฏ when one of these situations occurs. Then ๎ˆธ(๎ˆญ๐‘’1 ) โІ ๎ˆธ(๎ˆฎ), that is, ๎ˆฎ is โ‹€ 1 โˆ‘ an over-approximation of ๐‘’. z1,๐‘ฃโƒ— = z1,(๐‘ž ,๐‘Ž,๐‘žโ€ฒ ,๐‘ฃ) โƒ—, Without loss of generality, we can assume that there are no transi- ๐‘ฃโˆˆC โƒ— ๎ˆฏ 1 (๐‘ž ,๐‘Ž,๐‘ž โ€ฒ ) s.t. (๐‘ž ,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) โƒ— is a transition in ๎ˆฎ1 tions out of ๐น in ๎ˆญ๐‘’1 . If this is not the case, then a new state ๐‘ž๐‘“ can โ€ฒ โƒ— is a transition in ๎ˆฎ } is the set of where Z๎ˆฎ1 = {z1,(๐‘ž ,๐‘Ž,๐‘žโ€ฒ ,๐‘ฃ) โƒ— โˆฃ (๐‘ž , ๐‘Ž, ๐‘ž , ๐‘ฃ) 1 be introduced to transform ๎ˆญ๐‘’1 into a CEFA satisfying this constraint, โ€ฒ โƒ— variables z1,(๐‘ž ,๐‘Ž,๐‘žโ€ฒ ,๐‘ฃ)โƒ— denoting the number of occurrences of (๐‘ž , ๐‘Ž, ๐‘ž , ๐‘ฃ). where ๐‘ž๐‘“ is the unique accepting state. Then we define ๎ˆฎ = (๐‘…โˆช {๐‘Ÿโ€ฒ }, ๐‘„โˆช โ€ฒ Note that compared to ๐œƒ, ๐œƒ uses the LIA formula ๐œ“๎ˆฎ1 , which is {๐‘ž๐‘  }, ๐›ด , ๐›ฟ โ€ฒ , ๐ผ , ๐‘„ โˆช {๐‘ž๐‘  }, ๐›ผ โ€ฒ ), where constructed from ๎ˆฎ1 by ignoring its accepting condition and tak- โ€ข ๐‘Ÿโ€ฒ is a new register not in ๐‘…, ing it as an NFA where the alphabet is seen as the set of transi- tions. Furthermore, ๐œƒ โ€ฒ relates the values of the variables z1,๐‘ฃโƒ— and โ€ข ๐‘ž๐‘  is a new state not in ๐‘„, โ€ข ๐›ฟ โ€ฒ comprises the following transitions, โƒ— , while simultaneously constraining the values of z1,๐‘ฃโƒ— to be z1,(๐‘ž ,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) m(z1,๐‘ฃโƒ— ). We then submit ๐œƒ โ€ฒ to SMT solver, which returns a model โ€“ (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , (๐‘ฃ, โƒ— 0)) such that (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) โƒ— โˆˆ ๐›ฟ and ๐‘ž โ€ฒ โˆ‰ ๐น , of ๐œƒ โ€ฒ , in particular, the values of the variables z1,(๐‘ž,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) โƒ— . Equipped โƒ— 0)) such that there do not exist ๐‘Ž-labeled transi- โ€“ (๐‘ž , ๐‘Ž, ๐‘ž๐‘  , (0, with the model, we refine the transition system by considering the tions out of ๐‘ž, that is, for every ๐‘ž โ€ฒ and ๐‘ฃ, โƒ— we have (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) โƒ— โˆ‰ nodes (๐‘ž , (n1,(๐‘ž,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) โƒ— is a transition in ๎ˆฎ1 ), where n1,(๐‘ž ,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) โƒ— )(๐‘ž ,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) โƒ— de- ๐›ฟ, notes the number of remaining occurrences of (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ). โƒ— Note that โ€“ (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , (๐‘ฃ, โƒ— 1)) such that (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) โƒ— โˆˆ ๐›ฟ and ๐‘ž โ€ฒ โˆˆ ๐น , this refinement can avoid searching useless transitions. For instance, if n1,(๐‘ž,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) โ€ฒ โƒ— then we can avoid searching the โƒ— โ€“ (๐‘ž๐‘  , ๐‘Ž, ๐‘ž๐‘  , (0, 0)) such that ๐‘Ž โˆˆ ๐›ด, โƒ— = 0 for some (๐‘ž , ๐‘Ž, ๐‘ž , ๐‘ฃ), transition (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) โƒ— out of ๐‘ž. This stands in contrast to the original โ€ข ๐›ผ โ€ฒ is defined as ๐‘Ÿโ€ฒ = 1 โ†’ ยฌ๐›ผ. transition system where n1,๐‘ฃโƒ— is used, it may be the case that we have 9 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 n1,๐‘ฃโƒ— > 0 (since there may be multiple transitions associated with ๐‘ฃ) โƒ— and Based on this template, we generate 500 instances. the transition (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) โƒ— should still be searched. For the benchmarks that contain the nestings of counting operators Moreover, we propose a method to reduce the number of transitions and a complement operator, we utilize the following template in the transition system further. ๐‘ฅ โˆˆ ๐‘’๐‘ ๐‘œ๐‘š๐‘1 โˆง ๐‘ฅ โˆˆ ๐‘’๐‘ ๐‘œ๐‘š๐‘2 โˆง ๐‘ฅ โˆˆ ๐‘’๐‘ ๐‘Ž๐‘›๐‘– โˆง |๐‘ฅ| > 50, Pruning the transitions in the transition system. Let the current state of {๐‘š ,๐‘› } {๐‘š ,๐‘› } {๐‘š ,๐‘› } {๐‘š ,๐‘› } the transition system be (๐‘, (n1,(๐‘ž ,๐‘Ž,๐‘žโ€ฒ ,๐‘ฃ) where both ๐‘’๐‘ ๐‘œ๐‘š๐‘1 and ๐‘’๐‘ ๐‘œ๐‘š๐‘2 are of the form ๐‘’1 1 1 ๐‘’2 2 2 ๐‘’3 3 3 ๐‘’4 4 4 , โƒ— is a transition in ๎ˆฎ1 ). Let โƒ— )(๐‘ž ,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) ๎‰€๐‘ denote the set of transitions that are reachable from the state ๐‘ in such that for each ๐‘– โˆˆ [4], 20 โ‰ค ๐‘š๐‘– โ‰ค ๐‘›๐‘– โ‰ค 1000 and ๐‘’๐‘– contains ๎ˆฎ1 . If {(๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) no counting operators. Note that two regular expressions ๐‘’๐‘ ๐‘œ๐‘š๐‘1 and โƒ— > 0} โŠˆ ๎‰€๐‘ , in other words, there exists โƒ— โˆฃ n1,(๐‘ž ,๐‘Ž,๐‘žโ€ฒ ,๐‘ฃ) some transitions (๐‘ž , ๐‘Ž, ๐‘ž โ€ฒ , ๐‘ฃ) โƒ— not reachable from the state ๐‘ in ๎ˆฎ1 while ๐‘’๐‘ ๐‘œ๐‘š๐‘2 are introduced to have a better balance between the numbers of satisfiable and unsatisfiable instances. Based on this template, we also โƒ— > 0 is established in the current state (๐‘, (n1,(๐‘ž ,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) n1,(๐‘ž,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) โƒ— )(๐‘ž ,๐‘Ž,๐‘ž โ€ฒ ,๐‘ฃ) โƒ— is a transition in ๎ˆฎ1 ), then we know that it is impossible to reach the generate 500 instances. final state (๐‘ž๐‘“ , (0, โ€ฆ , 0)) from the current state so that we can remove In total, the NestC benchmark suite contains 1000 instances. all the transitions out of the current state in the transition system. Note that the two templates are chosen so that on the one hand, the nestings of counting operators or nestings of counting operators and complement operators are available, and on the other hand, some 9. Experiments additional regular and length constraints are introduced, so that the constraints are not trivially satisfied, in particular, the constraints are We implemented the procedures and optimizations in Sections 6โ€“8 not satisfied by very short strings. on top of OSTRICH, resulting in a string solver called OSTRICHRECL . In this section, we evaluate the performance of OSTRICHRECL on three Distribution of problem instances w.r.t. counting bounds. The distribution benchmark suites, that is, RegCoL, AutomatArk, and NestC. In the of problem instances w.r.t. the counting bounds in the three suites is sequel, we first describe the three benchmark suites as well as the shown in Fig. 5, where the ๐‘ฅ-axis represents the counting bound and experiment setup. Then we present the experiment results. We evaluate the ๐‘ฆ-axis represents the number of problem instances containing the the performance and correctness of our solver against the state-of-the- counting bound. In our statistics, while most problem instances contain art string constraint solvers, including CVC5 [3], Z3seq [1], Z3str3 [6], only small bounds, there are still around 3000 (about 6%) of them using Z3str3RE [8], and OSTRICH [11]. We also compare OSTRICHRECL with large counting bounds (i.e., greater than or equal to 50). several variants of itself, to evaluate the effectiveness of the technical Experiment setup. All experiments are conducted on CentOS Stream choices and optimizations in Sections 6โ€“8. release 8 with 4 Intel(R) Xeon(R) Platinum 8269CY 3.10 GHz CPU cores and 190 GB memory. We use the zaligvinder framework [42] to execute 9.1. Benchmark suites and the experiment setup the experiments, with a timeout of 60 s per instance. Our experiments utilize three benchmark suites, namely, RegCoL, 9.2. Performance evaluation against the other solvers AutomatArk, and NestC. There are 49,843 instances in total. All bench- mark instances are in the SMTLIB2 format. In the sequel, we give more We evaluate the performance of OSTRICHRECL against the state- details of these three benchmark suites. of-the-art string constraint solvers, including CVC5 [3], Z3seq [1], RegCoL benchmark suite. There are 40,628 RECL instances in the Reg- Z3str3 [6], Z3str3RE [8], and OSTRICH [11], on RegCoL, AutomatArk CoL suite. These instances are generated by extracting regexes with and NestC benchmark suites. The experiment results can be found in counting operators from the open source regex library [27,37] and Table 1. Note that we take the results of CVC5 as the ground truth,3 manually constructing a RECL constraint ๐‘ฅ โˆˆ ๐‘’ โˆง ๐‘ฅ โˆˆ ๐‘’๐‘ ๐‘Ž๐‘›๐‘– โˆง |๐‘ฅ| > and the results different from the ground truth are classified as error. 10 for each regex ๐‘’, where ๐‘’๐‘ ๐‘Ž๐‘›๐‘– โ‰ก ๐›ด โˆ— (โŸจ+โŸฉ +โ€ฒ +โ€ฒโ€ฒ + &)๐›ด โˆ— is a regular We can see that OSTRICHRECL solves almost all instances, except 182 expression that sanitizes all occurrence of special characters <, >, โ€ฒ , โ€ฒโ€ฒ , of them, that is, it solves 49,638 instances correctly. The number is or &. The expression ๐‘’๐‘ ๐‘Ž๐‘›๐‘– is introduced in view of the fact that these 4,547/1,917/13,618/2,764/3,134 more than the number of instances characters are usually sanitized in Web browsers to alleviate the XSS solved by CVC5/Z3str3RE/Z3str3/Z3seq/OSTRICH respectively. More- attacks [16,38]. over, OSTRICHRECL is the second fastest solver, whose average time on each instance is close to the fastest solver Z3str3RE (2.88 s versus 1.99 AutomatArk benchmark suite. This benchmark suite is adapted from the s). AutomatArk suite [39] by picking out the string constraints containing Furthermore, in order to see how OSTRICHRECL performs on the counting operators. We also add the length constraint |๐‘ฅ| > 10 for RECL constraints where the counting and length bounds are large or each string variable ๐‘ฅ. There are 8215 instances in the AutomatArk when the counting operators are nested with some other counting op- suite. Note that the original AutomatArk benchmark suite [39] in- erators or with complement operators, we also compare OSTRICHRECL cludes 19,979 instances, which are conjunctions of regular membership with the other solvers solely on the NestC benchmark suite. The ex- queries generated out of regular expressions in [40]. periment results can be found in Table 2. From the results, we can see NestC benchmark suite. We generate this benchmark suite by utilizing that OSTRICHRECL outperforms the other solvers in both the number the string fuzzing tool stringfuzz [41]. Note that stringfuzz is slightly of solved instances and the average time. Specifically, OSTRICHRECL extended by adding the counting operators. For the benchmarks that solves 873 out of 1000 instances, while the best state-of-the-art solver contain the nestings of counting operators, we use the following tem- (Z3seq) solves only 518 instances. Moreover, while the average time plate of OSTRICHRECL is only slightly better than Z3str3RE (19.49 s versus 20.09 s), OSTRICHRECL solves much more instances than it (873 versus ๐‘ฅ โˆˆ ๐‘’๐‘›๐‘’๐‘ ๐‘ก โˆง ๐‘ฅ โˆˆ ๐‘’๐‘ ๐‘Ž๐‘›๐‘– โˆง |๐‘ฅ| > 50 170). The results show the great advantage of OSTRICHRECL against the where ๐‘’๐‘›๐‘’๐‘ ๐‘ก is of the form other solvers on the NestC benchmark suite. {๐‘š ,๐‘› } {๐‘š ,๐‘› } {๐‘š ,๐‘› } {๐‘š ,๐‘› } (๐‘’1 1 1 ๐‘’2 2 2 ๐‘’3 3 3 ๐‘’4 4 4 ){๐‘š5 ,๐‘›5 } , 3 Initially, we used the majority vote of the results of the solvers as the such that 0 โ‰ค ๐‘š๐‘– โ‰ค ๐‘›๐‘– โ‰ค 20 for each ๐‘– โˆˆ [5] and ๐‘’๐‘– contains no counting ground truth. Nevertheless, on some problem instances, all the results of the operators for each ๐‘– โˆˆ [4], moreover, three solvers in the Z3 family are wrong (after manual inspection), thus failing this approach on these instances. ๐‘’๐‘ ๐‘Ž๐‘›๐‘– โ‰ก ๐›ด โˆ— (โŸจ+โŸฉ +โ€ฒ +โ€ฒโ€ฒ + &)๐›ด โˆ— . 10 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 Fig. 5. Distribution of problem instances w.r.t. counting bounds. 9.3. Evaluation of the technical choices and optimizations in Sections 6 -8 RECL constraints, the optimization for the nesting of counting operators and a complement operator plays a greater role since it allows solving At first, we do experiments to evaluate the effectiveness of the around 1/3 more instances. automata size-reduction techniques (i.e., Step 2) in Section 6. Then we justify our technical choices of reducing the satisfiability prob- Evaluation of the optimizations for the model generation in Section 8. lem to that of LIA formulas and resort to the SMT solvers to solve We compare the performance of OSTRICHRECL with OSTRICHRECL โˆ’MODEL , them. Specifically, we compare OSTRICHRECL with a variant where where OSTRICHRECL โˆ’MODEL is obtained from OSTRICHRECL by removing the nuXmv model checker like [18] is used to solve the ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) the optimizations for the model generation. problem. We also evaluate the effectiveness of the optimizations in The experiment results can be found in Table 5. From the results, we Section 7, i.e., the optimizations for the nesting of counting operators as can see that compared to OSTRICHRECL , OSTRICHRECL โˆ’MODEL solves 349 well as the nesting of counting operators and a complement operator. fewer instances and is 23% slower on average (3.57 s versus 2.88 s). Finally, we evaluate the effectiveness of the optimizations for the model As a result, the optimization for the model generation does improve generation in Section 8. the performance, especially when the sizes of the CEFAs are huge. Note that the optimization for model generation only affects the satisfiable Evaluation of the technical choices and optimizations in Section 6. We instances, that is, those instances where the generated LIA formulas can compare OSTRICHRECL with OSTRICHRECL NUXMV , which is a variant of be solved, but the generation of models for these RECL constraints is OSTRICHRECL where the nuXmv model checker is used to solve the unsuccessful. ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) problem. Moreover, we also compare OSTRICHRECL with OSTRICHRECL โˆ’ASR , which is obtained from OSTRICHRECL by removing the automata size-reduction technique (i.e. Step 2 in Section 6.2). 10. Conclusion The experiment results can be found in Table 3. From the results, we can see that OSTRICHRECL solves 2532 more instances and is 2.38 This work proposed an efficient automata-theoretical approach for times faster than OSTRICHRECLNUXMV . Therefore, the technical choice in solving string constraints with regex-counting and string-length. The Section 6 where LIA formulas are computed and then solved by the SMT approach is based on encoding counting operators in regular expres- solvers is reasonable, since OSTRICHRECL with this technical choice sions by cost registers symbolically instead of unfolding them explicitly. is more efficient than OSTRICHRECL NUXMV , where nuXmv is used to solve Moreover, this work proposed various optimization techniques to im- the ๐–ญ๐–ค๐–ซ๐–จ๐–  (๐–ข๐–ค๐–ฅ๐– ) problem. Moreover, OSTRICHRECL solves 1083 more prove the performance further. Finally, we used three benchmark suites instances and is 1.65 times faster than OSTRICHRECL โˆ’ASR . From the results, comprising 49,843 instances in total to evaluate the performance of our we can see that the automata-size reduction technique indeed plays an approach. The experimental results show that our approach can solve important role for the performance improvement. more instances than the best state-of-the-art solver, at a comparable Evaluation of the optimizations in Section 7. We also compare the perfor- or faster speed, especially when the counting and length bounds are mance of OSTRICHRECL with its variants OSTRICHRECL and large or when counting operators are nested with some other counting โˆ’NEST OSTRICHRECL , where OSTRICH RECL is obtained from OSTRICH RECL operators or complement operators. โˆ’COMP โˆ’NEST by removing the optimization for the nesting of counting operators, RECL and OSTRICHโˆ’COMP is obtained by removing the optimization for CRediT authorship contribution statement the nesting of counting operators and a complement operator. Since these optimizations are not targeted for register-representable RECL Denghang Hu: Writing โ€“ original draft, Methodology. Zhilin Wu: constraints and there are very few non-register-representable RECL con- Writing โ€“ review & editing. straints in the RegCoL and AutomatArk benchmark suites, we restrict our attention to the NestC benchmark suite here. The experiment results can be found in Table 4. From the results, we Declaration of competing interest can see that on 1000 problem instances, OSTRICHRECL solves 82/339 more instances and is 1.02/1.92 times faster than OSTRICHRECL โˆ’NEST / The authors declare that they have no known competing finan- OSTRICHRECLโˆ’COMP respectively. As a result, while both optimizations do cial interests or personal relationships that could have appeared to play a role in the performance improvement in solving these complex influence the work reported in this paper. 11 D. Hu and Z. Wu Journal of Systems Architecture 160 (2025) 103340 References [19] W. Gelade, M. Gyssens, W. Martens, Regular expressions with counting: Weak versus strong determinism, SIAM J. Comput. 41 (1) (2012) 160โ€“190, http: //dx.doi.org/10.1137/100814196. [1] L.M. de Moura, N. Bjรธrner, Z3: An efficient SMT solver, in: Tools and Algorithms [20] H. Chen, P. Lu, Checking determinism of regular expressions with counting, for the Construction and Analysis of Systems, 14th International Conference, Inform. and Comput. 241 (2015) 302โ€“320, http://dx.doi.org/10.1016/j.ic.2014. TACAS 2008, Held As Part of the Joint European Conferences on Theory and 12.001. Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. [21] B. Loring, D. Mitchell, J. Kinder, Sound regular expression semantics for dynamic Proceedings, 2008, pp. 337โ€“340, http://dx.doi.org/10.1007/978-3-540-78800- symbolic execution of JavaScript, in: Proceedings of the 40th ACM SIGPLAN 3_24. Conference on Programming Language Design and Implementation, PLDI 2019, [2] T. Liang, A. Reynolds, C. Tinelli, C. Barrett, M. Deters, A DPLL(T) theory solver Phoenix, AZ, USA, June 22-26, 2019, ACM, 2019, pp. 425โ€“438, http://dx.doi. for a theory of strings and regular expressions, in: CAV, 2014, pp. 646โ€“662, org/10.1145/3314221.3314645. http://dx.doi.org/10.1007/978-3-319-08867-9_43. [22] T. Chen, A. Flores-Lamas, M. Hague, Z. Han, D. Hu, S. Kan, A.W. Lin, P. [3] H. Barbosa, C. Barrett, M. Brain, G. Kremer, H. Lachnitt, M. Mann, A. Mohamed, Rรผmmer, Z. Wu, Solving string constraints with regex-dependent functions M. Mohamed, A. Niemetz, A. Nรถtzli, A. Ozdemir, M. Preiner, A. Reynolds, Y. through transducers with priorities and variables, Proc. ACM Program. Lang. Sheng, C. Tinelli, Y. Zohar, CVC5: A versatile and industrial-strength SMT solver, 6 (POPL) (2022) 1โ€“31, http://dx.doi.org/10.1145/3498707. in: D. Fisman, G. Rosu (Eds.), Tools and Algorithms for the Construction and [23] M. Kaminski, N. Francez, Finite-memory automata, in: Proceedings [1990] 31st Analysis of Systems, Springer International Publishing, Cham, 2022, pp. 415โ€“442. Annual Symposium on Foundations of Computer Science, Vol. 2, 1990, pp. [4] Y. Zheng, X. Zhang, V. Ganesh, Z3-str: a Z3-based string solver for web 683โ€“688, http://dx.doi.org/10.1109/FSCS.1990.89590. application analysis, in: ESEC/SIGSOFT FSE, 2013, pp. 114โ€“124, http://dx.doi. [24] L. Dโ€™Antoni, T. Ferreira, M. Sammartino, A. Silva, Symbolic register automata, in: org/10.1145/2491411.2491456. I. Dillig, S. Tasiran (Eds.), Computer Aided Verification, Springer International [5] Y. Zheng, V. Ganesh, S. Subramanian, O. Tripp, J. Dolby, X. Zhang, Effective Publishing, Cham, 2019, pp. 3โ€“21. search-space pruning for solvers of string equations, regular expressions and [25] M.L. Minsky, Computation: Finite and Infinite Machines, in: Prentice-Hall Series length constraints, in: Computer Aided Verification - 27th International Confer- in Automatic Computation, Prentice-Hall, 1967. ence, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I, [26] F. Klaedtke, H. RueรŸ, Monadic second-order logics with cardinalities, in: J.C.M. Springer, 2015, pp. 235โ€“254, http://dx.doi.org/10.1007/978-3-319-21690-4_14. Baeten, J.K. Lenstra, J. Parrow, G.J. Woeginger (Eds.), Automata, Languages and [6] M. Berzish, V. Ganesh, Y. Zheng, Z3str3: A string solver with theory-aware Programming, Springer Berlin Heidelberg, Berlin, Heidelberg, 2003, pp. 681โ€“696. heuristics, in: 2017 Formal Methods in Computer Aided Design, FMCAD 2017, [27] L. Turoลˆovรก, L. Holรญk, O. Lengรกl, O. Saarikivi, M. Veanes, T. Vojnar, Regex Vienna, Austria, October 2-6, 2017, 2017, pp. 55โ€“59, http://dx.doi.org/10. matching with counting-set automata, Proc. ACM Program. Lang. 4 (OOPSLA) 23919/FMCAD.2017.8102241. (2020) http://dx.doi.org/10.1145/3428286. [7] M. Berzish, Z3str4: A Solver for Theories over Strings (Ph.D. thesis), University [28] L. Holรญk, J. Sรญc, L. Turonovรก, T. Vojnar, Fast matching of regular patterns with of Waterloo, Ontario, Canada, 2021, URL https://hdl.handle.net/10012/17102. synchronizing counting, in: O. Kupferman, P. Sobocinski (Eds.), FoSSaCS 2023, in: Lecture Notes in Computer Science, vol. 13992, Springer, 2023, pp. 392โ€“412. [8] M. Berzish, J.D. Day, V. Ganesh, M. Kulczynski, F. Manea, F. Mora, D. Nowotka, [29] A. Le Glaunec, L. Kong, K. Mamouras, Regular expression matching using bit Towards more efficient methods for solving regular-expression heavy string vector automata, Proc. ACM Program. Lang. 7 (OOPSLA1) (2023) http://dx.doi. constraints, Theoret. Comput. Sci. 943 (2023) 50โ€“72. org/10.1145/3586044. [9] Diep Bui and contributors, Z3-trau, 2019, https://github.com/diepbp/z3-trau. [30] D. Hu, Z. Wu, String constraints with regex-counting and string-length solved [10] P.A. Abdulla, M.F. Atig, Y.-F. Chen, B.P. Diep, J. Dolby, P. Jankลฏ, H.-H. Lin, L. more efficiently, in: H. Hermanns, J. Sun, L. Bu (Eds.), Dependable Software Holรญk, W.-C. Wu, Efficient handling of string-number conversion, in: Proceedings Engineering. Theories, Tools, and Applications, Springer Nature Singapore, of the 41st ACM SIGPLAN Conference on Programming Language Design and Singapore, 2024, pp. 1โ€“20. Implementation, in: PLDI 2020, Association for Computing Machinery, New York, [31] J.E. Hopcroft, J.D. Ullman, Introduction to Automata Theory, Languages and NY, USA, 2020, pp. 943โ€“957, http://dx.doi.org/10.1145/3385412.3386034. Computation, Addison-Wesley, 1979. [11] T. Chen, M. Hague, A.W. Lin, P. Rรผmmer, Z. Wu, Decision procedures for path [32] H. Seidl, T. Schwentick, A. Muscholl, P. Habermehl, Counting in trees for free, in: feasibility of string-manipulating programs with complex operations, PACMPL 3 Automata, Languages and Programming: 31st International Colloquium, ICALP (POPL) (2019) http://dx.doi.org/10.1145/3290362. 2004, Turku, Finland, July 12-16, 2004. Proceedings, 2004, pp. 1136โ€“1149, [12] H.-E. Wang, S.-Y. Chen, F. Yu, J.-H.R. Jiang, A symbolic model checking http://dx.doi.org/10.1007/978-3-540-27836-8_94. approach to the analysis of string and length constraints, in: Proceedings of the [33] K.N. Verma, H. Seidl, T. Schwentick, On the complexity of equational Horn 33rd ACM/IEEE International Conference on Automated Software Engineering, clauses, in: CADE, 2005, pp. 337โ€“352. in: ASE 2018, ACM, 2018, pp. 623โ€“633, http://dx.doi.org/10.1145/3238147. [34] C. Haase, A survival guide to Presburger arithmetic, ACM SIGLOG News 5 (3) 3238189. (2018) 67โ€“82, http://dx.doi.org/10.1145/3242953.3242964. [13] C. Chapman, K.T. Stolee, Exploring regular expression usage and context in [35] R. Cavada, A. Cimatti, M. Dorigatti, A. Griggio, A. Mariotti, A. Micheli, S. Python, in: A. Zeller, A. Roychoudhury (Eds.), Proceedings of the 25th Interna- Mover, M. Roveri, S. Tonetta, The nuXmv symbolic model checker, in: Computer tional Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrรผcken, Aided Verification - 26th International Conference, CAV 2014, Held As Part Germany, July 18-20, 2016, ACM, 2016, pp. 282โ€“293, http://dx.doi.org/10. of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 18-22, 2014. 1145/2931037.2931073. Proceedings, 2014, pp. 334โ€“342. [14] J.C. Davis, C.A. Coghlan, F. Servant, D. Lee, The impact of regular expression [36] K. Thompson, Programming techniques: Regular expression search algorithm, denial of service (ReDoS) in practice: An empirical study at the ecosystem Commun. ACM 11 (6) (1968) 419โ€“422, http://dx.doi.org/10.1145/363347. scale, in: Proceedings of the 2018 26th ACM Joint Meeting on European 363387. Software Engineering Conference and Symposium on the Foundations of Software [37] J.C. Davis, L.G. Michael IV, C.A. Coghlan, F. Servant, D. Lee, Why arenโ€™t Engineering, in: ESEC/FSE 2018, Association for Computing Machinery, New regular expressions a Lingua Franca? An empirical study on the re-use and York, NY, USA, 2018, pp. 246โ€“256. portability of regular expressions, in: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on [15] P. Wang, K.T. Stolee, How well are regular expressions tested in the wild? in: the Foundations of Software Engineering, in: ESEC/FSE 2019, Association for G.T. Leavens, A. Garcia, C.S. Pasareanu (Eds.), Proceedings of the 2018 ACM Computing Machinery, New York, NY, USA, 2019, pp. 443โ€“454, http://dx.doi. Joint Meeting on European Software Engineering Conference and Symposium on org/10.1145/3338906.3338909. the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena [38] T. Chen, Y. Chen, M. Hague, A.W. Lin, Z. Wu, What is decidable about string Vista, FL, USA, November 04-09, 2018, ACM, 2018, pp. 668โ€“678. constraints with the ReplaceAll function, PACMPL 2 (POPL) (2018) 3:1โ€“3:29, [16] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, D. Song, A symbolic http://dx.doi.org/10.1145/3158091. execution framework for JavaScript, in: 2010 IEEE Symposium on Security and [39] M. Berzish, M. Kulczynski, F. Mora, F. Manea, J.D. Day, D. Nowotka, V. Ganesh, Privacy, 2010, pp. 513โ€“528, http://dx.doi.org/10.1109/SP.2010.38. An SMT solver for regular expressions and linear arithmetic over string length, in: [17] T. Liang, N. Tsiskaridze, A. Reynolds, C. Tinelli, C.W. Barrett, A decision A. Silva, K.R.M. Leino (Eds.), Computer Aided Verification, Springer International procedure for regular membership and length constraints over unbounded Publishing, Cham, 2021, pp. 289โ€“312. strings, in: C. Lutz, S. Ranise (Eds.), Frontiers of Combining Systems - 10th [40] L. Dโ€™Antoni, Automatark: Automata benchmark, 2018, URL https://github.com/ International Symposium, FroCoS 2015, Wroclaw, Poland, September 21-24, lorisdanto/automatark. 2015. Proceedings, in: Lecture Notes in Computer Science, vol. 9322, Springer, [41] D. Blotsky, F. Mora, M. Berzish, Y. Zheng, I. Kabir, V. Ganesh, StringFuzz: A 2015, pp. 135โ€“150. fuzzer for string solvers, in: H. Chockler, G. Weissenbacher (Eds.), Computer [18] T. Chen, M. Hague, J. He, D. Hu, A.W. Lin, P. Rรผmmer, Z. Wu, A decision Aided Verification, Springer International Publishing, Cham, 2018, pp. 45โ€“51. procedure for path feasibility of string manipulating programs with integer data [42] M. Kulczynski, F. Manea, D. Nowotka, D.B. Poulsen, ZaligVinder: A generic test type, in: D.V. Hung, O. Sokolsky (Eds.), Automated Technology for Verification framework for string solvers, J. Software: Evol. Process. 35 (4) (2023) e2400, and Analysis - 18th International Symposium, ATVA 2020, Hanoi, Vietnam, http://dx.doi.org/10.1002/smr.2400, arXiv:https://onlinelibrary.wiley.com/doi/ October 19-23, 2020, Proceedings, in: Lecture Notes in Computer Science, 12302, pdf/10.1002/smr.2400. Springer, 2020, pp. 325โ€“342, http://dx.doi.org/10.1007/978-3-030-59152-6_18. 12