Reduced-Size Integer Linear Programming Models for String Selection Problems: Application to the Farthest String Problem

Abstract

We present integer programming models for some variants of the farthest string problem. The number of variables and constraints is substantially less than that of the integer linear programming models known in the literature. Moreover, the solution of the linear programming-relaxation contains only a small proportion of noninteger values, which considerably simplifies the rounding process. Numerical tests have shown excellent results, especially when a small set of long sequences is given.

1. Introduction

String selection and comparison problems have numerous applications, principally in computational biology, but also in coding theory, data compression, and quantitative linguistics. For instance, genomic and proteomic data can be modeled as sequences (strings) over the alphabets of nucleotides or amino acids; see, for example, Blazewicz et al. (2005), Boucher (2010), and Pappalardo et al. (2013). The formalization of problems like motif recognition and similar tasks leads to diverse combinatorial optimization problems like the closest (sub) string problem, the farthest (sub) string problem, the close to most strings problem, the far from most strings problem (FFMSP), and the distinguishing string selection problem; see, for example, Soleimani-damaneh (2001), Lanctot et al. (2003), Meneses et al. (2005), Festa (2007), Zörnig (2011), and Ferone et al. (2013).

In the present article we study some variants of the farthest string problem (FSP) that are generally NP-hard. The solution of these problems is very difficult, in particular in most biological applications where the sequences are very long; see Blazewicz et al. (2005) and Zörnig (2011, p. 3). FSPs are frequently modeled as (zero–one) integer linear programming (ILP) problems; see, for example, Lanctot et al. (2003), Meneses et al. (2005), and Festa and Pardalos (2012). Our main objective is to generalize the size reduction approach of Zörnig (2011), which has so far never been addressed in the literature, and apply it to the FSP. We show that at least for a small set of sequences of arbitrary length, several variants of the FSP can be solved (exactly or with a very small error in the optimal value), by merely solving the linear programming (LP) relaxation of the ILP problem and subsequent rounding of noninteger solution values. After introducing the necessary concepts, we model the FSP as an integer linear programming problem, considering two cases of feasible sets (sec. 3.1 and sec. 3.2). The number of variables and constraints is substantially less than that in the ILPs presented so far in the literature. In sections 3.3 and 3.4 we consider two further variants of the FSP. In one of them, the objective is to minimize the total sum of distances; the other is the FFMSP. By means of various test examples, we demonstrate that our proposed models can be easily solved. Some concluding remarks are given in section 4.

2. Notations and Basic Concepts

Extending some earlier ideas (Zörnig, 2011), we provide the theoretical basis for reduced-size ILP models for string selection problems.

Consider an alphabet \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega = \{1 , \ldots , \omega \}$$ \end{document} with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\omega \in {\mathbb N}$$ \end{document} , whose elements are called characters. By Ω^m we denote the set of all sequences of length m over Ω.

For any two sequences \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$s , t \in \Omega^m$$ \end{document} , the Hamming distance d(s, t) between s and t is defined as the number of positions in which s and t differ.

Many string selection problems are of the following general form:

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Sigma = \{s^1 , \ldots , s^n \}$$ \end{document} be a set of n sequences with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$s^i = ( s_1^i , \ldots , s_m^i ) \in \Omega^m$$ \end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 , \ldots , n$$ \end{document} . The problem is given by the string matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}S = \left( \begin{matrix}s_1^1 & \cdots & s_m^1 & \\ \vdots & & \vdots \\ s_1^n & \cdots & s_m^n\end{matrix} \right) \tag{2.1}\end{align*} \end{document}

whose rows consist of the sequences in Σ. Let V_j denote the set of elements appearing in the jth column of S. Furthermore, let v_j:=|V_j| be the cardinality of V_j and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$V : = \bigcup\nolimits_{j = 1}^m V_j$$ \end{document} the set of all characters appearing in the matrix S. The goal is to determine a string \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$t = ( t_1 , \ldots , t_m )$$ \end{document} that maximizes or minimizes a certain objective expressed in terms of Hamming distances between strings. The string t is usually constructed by choosing the element t_j from the set V_j; that is, the feasible solution set is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X = V_1 \ \times \ldots \times \ V_m$$ \end{document} . In particular, in the FSP the function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f ( t ) : = {{\min}\atop_{i = 1 , \ldots , n}} d ( s^i , t )$$ \end{document} is maximized over X.

We study some characteristics of the columns of Equation (2.1), which are n-dimensional vectors.

Definition 2.1

(i) A column of Equation (2.1) is called complete if it contains all characters of the set V. Otherwise, the column is called incomplete.

(ii) The induced set partition of a vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a = ( a_1 , \ldots , a_n )$$ \end{document} over the alphabet Ω is the partition of the set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{1 , \ldots , n \} $$ \end{document} , whose parts correspond to the indices of identical values of a.

(iii) Two vectors \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a = ( a_1 , \ldots , a_n )$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$b = ( b_1 , \ldots , b_n )$$ \end{document} over Ω are called isomorphic* if they induce the same set partition.*

Clearly, the above vectors a and b are isomorphic if and only if it holds \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a_i = a_j \Leftrightarrow b_i = b_j$$ \end{document} for any \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i , j \in \{1 , \ldots , n \} $$ \end{document} .

Example 2.1

Consider the vectors a=(1, 1, 2, 3, 2, 3, 1, 2, 2, 4), b=(2, 2, 4, 3, 4, 3, 2, 4, 4, 4), and c=(2, 3, 2, 3, 3, 1, 4, 4, 1, 2) of length m=10 over the alphabet \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega = \{1 , \ldots , 4 \} $$ \end{document} . Now the components a₁, a₂, and a₇ of a are identical (and all other a_i are different from these). Thus, we obtain the part {1, 2, 7} of the partition. In the same way we obtain the parts {3, 5, 8, 9}, {4, 6}, and {10}. Hence, the vector a induces the partition \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{1 , \ldots , 10 \} = \{1 , 2 , 7 \} \cup \{3 , 5 , 8 , 9 \} \cup \{4 , 6 \} \cup \{10 \} $$ \end{document} . The same partition is obtained for the vector b, but the vector c induces another partition: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{1 , \ldots , 10 \} =\{6 , 9 \} \cup \{1 , 3 , 10 \} \cup \{2 , 4 , 5 \} \cup \{7 , 8 \} $$ \end{document} . Thus, a and b are isomorphic, while a and c are not.

We now define a unique representative for any isomorphism class of vectors.

Definition 2.2

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{1 , \ldots , n \} = C_1 \cup \ldots \cup C_r$$ \end{document} be a partition into r parts, where the C_i are labeled such that min(C₁)<min(C₂) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$< \ldots <$$ \end{document} min(C_r). Then, the representative vector of the partition is defined as the vector having the number i on the positions corresponding to the part \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$C_i ( i = 1 , \ldots , r )$$ \end{document}

Example 2.2

Consider again the partition \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{1 , \ldots , 10 \} = \{6 , 9 \} \cup \{1 , 3 , 10 \} \cup \{2 , 4 , 5 \} \cup \{7 , 8 \} $$ \end{document} . The minima of the parts are 6, 1, 2, and 7. Thus, the part with minimum 1, 2, 6, and 7 is denoted by C₁, C₂, C₃, and C₄, respectively, resulting in C₁={1, 3, 10}, C₂={2, 4, 5}, C₃={6, 9}, and C₄={7, 8}. The representative vector of the partition is therefore (1, 2, 1, 2, 2, 3, 4, 4, 3, 1).

We are now able to define a normalized form of a string problem.

Definition 2.3

The representative vector of a vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a = ( a_1 , \ldots , a_n )$$ \end{document} over Ω is defined as the representative vector of the induced partition of a.

Consider the string matrix S in Equation (2.1). The normalized matrix of S, denoted by T, is the matrix obtained from S, substituting the columns by its representative vectors.

The numbers m_i of identical copies of representative vectors are called multiplicities. The matrix R, whose columns are the different representative vectors, is called the representative matrix.

Example 2.3

Consider the following matrix S with n=3, m=10, and ω=4:

Position 1 2 3 4 5 6 7 8 9 10

S s ¹ 1 4 2 3 4 2 4 2 4 3

s ² 3 3 1 3 2 4 1 2 1 3

s ³ 4 3 2 1 3 2 1 3 3 1

The corresponding normalized string matrix is

Position 1 2 3 4 5 6 7 8 9 10

T t ¹ 1 1 1 1 1 1 1 1 1 1

t ² 2 2 2 1 2 2 2 1 2 1

t ³ 3 2 1 2 3 1 2 2 3 2

By ordering the columns of T, we obtain

Position 4 8 10 3 6 2 7 1 5 9

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{T}$$ \end{document} t ¹ 1 1 1 1 1 1 1 1 1 1

t ² 1 1 1 2 2 2 2 2 2 2

t ³ 2 2 2 1 1 2 2 3 3 3

Column group 1 2 3 4

Thus, the representative matrix is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}\left( \begin{matrix}1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 2 & 1 & 2 & 3\end{matrix} \right)\end{align} \end{document}

with multiplicities \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( m_1 , \ldots , m_4 ) = ( 3 , 2 , 2 , 3 )$$ \end{document} . In principle, another representative vector corresponding to the trivial partition into one part exists, where all components are identical to 1. However, this vector occurs only when at least one column vector of the string matrix (2.1) consists of identical components. But such columns can be eliminated, since determining the corresponding elements of the searched sequence is trivial.

To any feasible solution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( s_1 , \ldots , s_m )$$ \end{document} of the problem S, we assign biuniquely a sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( t_1 , \ldots , t_m )$$ \end{document} of T as follows: If s_j is identical to the ith element of column j of S, we set t_j equal to the ith element of the jth column of the normalized matrix T. It can be easily verified that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( t_1 , \ldots , t_m )$$ \end{document} is well-defined. Formally, a biunique mapping from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X = V_1 \ \times \ldots \times \ V_m$$ \end{document} to the feasible solution set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$W = \{1 , \ldots , v_1 \} \times \{1 , \ldots , v_2 \} \ \times \ldots \times \{1 , \ldots , v_m \} $$ \end{document} is defined. For instance, the feasible solution (3, 3, 2, 3, 2, 4, 1, 2, 4, 3) of X in Example 2.3 corresponds to the feasible solution (2, 2, 1, 1, 2, 2, 2, 1, 1, 1) of W (see the respective underlined elements in S and T).

It is a crucial fact that this mapping preserves the Hamming distance; that is, the Hamming distance between two elements of X equals the Hamming distance between the corresponding elements of W. Thus, in modeling string problems one can generally work with the feasible solution space W instead of X. This may reduce the size of the ILP model significantly, since all vectors of an isomorphism class can be considered simultaneously in the formulation of the model (see sec. 3.1). In the remainder of this article, the modeling will be based on a normalized matrix T (except for sec. 3.3).

Finally, we use the notations \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\lceil x \rceil$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\lfloor x \rfloor$$ \end{document} for the smallest integer greater than or equal to x and the largest integer smaller than or equal to x, respectively.

3. Solving Some Variants of the FSP

In contrast to the closest string problem (CSP), it is not immediately obvious how the feasible solution set should be defined for the FSP. In the former case, one seeks a sequence that is as close as possible to all sequences of a set Σ. It is clear that the element x_j of a solution candidate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x = ( x_1 , \ldots , x_m )$$ \end{document} should be selected from the set V_j; that is, the feasible solution set is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}X = V_1 \times \ldots \times V_m. \tag{3.1}\end{align} \end{document}

The situation is different in the case of the FSP. Since one seeks a sequence as far as possible from all strings in Σ, it is in general not clear, from which character sets the elements of a solution sequence should be chosen. This depends on the respective application.

In the following, we study two cases of feasible solution sets X, considered in the literature. In the first case, X is defined as in Equation (3.1); see, for example, Festa and Pardalos (2012, sec. 1.2). In the second case, the extended feasible solution set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}X = V^m \tag{3.2}\end{align} \end{document}

is considered [see, e.g., condition (7) in the model of Meneses et al. (2005, sec. 4.2)].

3.1. Restricted feasible solution set

In analogy to the CSP model (5) in Zörnig (2011, p. 6), the FSP with feasible solution set (3.1) can be modeled by the integer linear programming problem: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}& {\rm max} \ d \\& \rm s.t. \\ & m - \sum_{j = 1}^k y_{r_{i , j}} , j \ge d \quad {\rm for} \ i = 1 , \ldots , n , \qquad\qquad\qquad\qquad\qquad (3.3) \\ & \quad \sum_{i = 1}^{v_{j}} y_{i , j} = m_j \quad {\rm for} \ j = 1 , \ldots , k , \\ & d, y_{i,j} {\rm nonnegative \ integers}.\end{align}\end{document}

The variable y_i,j represents the frequency of the character i in the positions of the solution sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$t = ( t_1 , \ldots , t_m )$$ \end{document} that correspond to the jth representative vector. The parameters k, m_j, v_j denote the number of such vectors, their multiplicities, and the number of different characters in the representative vectors. The length of the sequences is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m = m_1 + \ldots + m_k$$ \end{document} .

The left side of the inequalities represents the Hamming distance between tⁱ of T and the sequence encoded by the y_i,j. (Note that the first index r_i,j of the variables in the inequalities denotes the ith element of the jth representative vector.) The equations express the fact that the frequencies of characters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$1 , 2 , \ldots , v_j$$ \end{document} in the jth isomorphism class sum up to m_j.

The practical utility of the above model becomes clear from the following proposition.

Definition 3.1

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bar{y}_{i , j}$$ \end{document} denote the optimal solution values of the LP-relaxation of Equation (3.3) and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f_{i , j} : = \bar{y}_{i , j} - \lfloor \bar{y}_{i , j} \rfloor$$ \end{document} , the fractional parts (0≤f_i,j<1 for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 , \ldots , v_j , j = 1 , \ldots , m$$ \end{document} ). We define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$F_j : = \sum\nolimits_{i = 1}^{v_{j}} f_{i , j}$$ \end{document} , which is an integer from the set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{0 , \ldots , v_j - j \} $$ \end{document} , since the equations of (3.3) are satisfied. A standard rounding rule for problem (3.3) is defined as follows. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$j \in \{1 , \ldots , m \} $$ \end{document} be any fixed* integer. Select the F_j values from the numbers* \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bar{y}_{i , j}$$ \end{document} with the largest fractional parts (which need not be uniquely determined) and round them up. Round down the remaining values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bar{y}_{i , j}$$ \end{document} .

Obviously, this rounding procedure results in a feasible solution of the ILP problem (3.3), which will be called the standard rounding solution.

Proposition 3.1

Let d be the optimal value of the LP-relaxation of problem (3.3). Then, the objective value \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bar{d}$$ \end{document} of a standard rounding solution satisfies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bar{d} \ge d - k.$$ \end{document}

Proof. By substituting the values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bar{y}_{i , j}$$ \end{document} for its rounded value, each summand in the sums of the inequalities in problem (3.3) increases at most by 1.

The result implies that for a small number k of representative vectors, the standard rounding solution is a very good approximate solution for the FSP when d is large in comparison with k (see Table 1).

Table 1.
Standard Rounding Solution for the FSP Model (3.3)

No. m N d_relax d_srs Max. error

n=6ω=2k=31 1 11,263 5 7534.833 7534 0

2 10,965 2 7310.5 7310 0

3 12,368 5 8076.667 8076 0

4 14,046 5 9228.833 9228 0

5 9290 4 6048 6047 1

6 10,858 4 7151.167 7151 0

7 10,306 5 6818.167 6818 0

8 13,434 4 8721.833 8721 0

9 15,943 5 10,337.67 10,337 0

10 18,821 5 12,193.17 12,192 1

n=8ω=2k=127 11 737 3 473.5 473 0

12 1463 7 938.125 937 1

13 1122 7 758.6 757 1

14 2924 5 1875.875 1875 0

15 5903 5 3782.875 3782 0

16 11,861 6 7597.375 7596 1

17 23,929 6 15,335.12 15,334 1

18 47,465 5 30,424.88 30,424 0

19 36,907 2 24,858 24,857 1

20 35,923 3 24,290.5 24,290 0

n=6ω=5k=15 21 2794 8 2328.333 2328 0

22 3305 10 2754.167 2754 0

23 4394 9 3515.5 3514 1

24 5487 12 4398.333 4398 0

25 6034 11 5068.833 5067 1

26 7856 10 6284.667 6282 2

27 8698 6 7393.333 7393 0

28 9482 8 7775.167 7773 2

29 10,670 10 8962 8961 1

30 11,355 8 9084.833 9084 0

n=8ω=7k=28 31 5530 13 4838.75 4838 0

32 10,420 14 8336.625 8335 1

33 19,673 10 17,705.875 17,705 0

34 38,912 13 33,075.375 33,074 1

35 78,452 9 67,468.75 67,468 0

36 150,230 8 130,700.13 130,698 2

37 230,722 10 189,192.25 189,189 3

38 295,015 12 236,012.38 236,012 0

39 450,723 8 396,636.88 396,633 3

40 950,643 13 751,007.75 751,006 1

Example 3.1

For the problem in Example 2.3, it holds k=4, v₁=v₂=v₃=2, and v₄=3. The ILP model (3.3) takes the form \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}{\rm max} \ d \\& {\rm s.t.} \\ & 10 - y_{1 , 1} - y_{1 , 2} - y_{1 , 3} - y_{1 , 4} \ge d , \\& 10 - y_{1 , 1} - y_{2 , 2} - y_{2 , 3} - y_{2 , 4} \ge d , \\& 10 - y_{2 , 1} - y_{1 , 2} - y_{2 , 3} - y_{3 , 4} \ge d , \\ & \qquad y_{1 , 1} + y_{2 , 1} = m_1 , \qquad\qquad\qquad (3.4) \\ & \qquad y_{1 , 2} + y_{2 , 2} = m_2 , \\ & \qquad y_{1 , 3} + y_{2 , 3} = m_3 , \\ & \qquad y_{1 , 4} + y_{2 , 4} + y_{3 , 4} = m_4 ,\end{align}\end{document}

d, y_i,j nonnegative integers,

and the multiplicities are \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( m_1 , \ldots , m_4 ) = ( 3 , 2 , 2 , 3 )$$ \end{document} . The solution of the LP-relaxation is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}y_{1 , 1} = 0 , \quad y_{1 , 2} = 0 , \quad y_{1 , 3} = 2 , \quad y_{1 , 4} = 1. \bar{3} ,\end{align} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}y_{2 , 1} = 3 , \quad y_{2 , 2} = 2 , \quad y_{2 , 3} = 0 , \quad y_{2 , 4} = 1. \bar{3} , \tag{3.5}\end{align} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad \qquad \enspace \ y_{3 , 4} = 0. \bar{3}\end{align} \end{document}

with optimal value \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d = 6. \bar{6}$$ \end{document} . One of the standard rounding solutions is therefore given by y_1,4=1, y_2,4=1, and y_3,4=1 [where the other values in system (3.5) remain unchanged]. The corresponding objective value is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$6 = \lfloor 6. \bar{6} \rfloor$$ \end{document} ; thus, an optimal solution of the integer problem (3.4) is encountered.

The optimal solution corresponding to the string matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overline{T}$$ \end{document} in Example 2.3 is therefore \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{t} = ( 2 , 2 , 2 \mid 2 , 2 \mid 1 , 1 \mid 1 , 2 , 3 )$$ \end{document} , corresponding to the solution t=(1, 1, 2, 2, 2, 2, 1, 2, 3, 2) of T and to the solution s=(1, 4, 1, 1, 2, 4, 4, 3, 3, 1) of S.

Since problem (3.4) is a very specific case, we still consider the following somewhat more complex example.

Example 3.2

Consider the problem (3.3) with 6 binary sequences (n=6, ω=2). The representative matrix is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}R = \left( \begin{matrix}r_{1 , 1} & \cdots & r_{1 , 31} \\ \vdots & & \vdots \\ r_{6 , 1} & \cdots & r_{6 , 31}\end{matrix} \right) \tag{3.6}\end{align} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}\left( \begin{matrix}r_{1 , 1} & \cdots & r_{1 , 6} \\ \vdots & & \vdots \\ r_{6 , 1} & \cdots & r_{6 , 6}\end{matrix} \right) = \left( \begin{matrix}1 & 1 & 1 & 1 & 1 & 1 \\ 2 & 2 & 1 & 1 & 1 & 1 \\ 2 & 1 & 2 & 1 & 1 & 1 \\ 2 & 1 & 1 & 2 & 1 & 1 \\ 2 & 1 & 1 & 1 & 2 & 1 \\ 2 & 1 & 1 & 1 & 1 & 2\end{matrix} \right)\end{align} \end{document}

corresponds to partitions into two parts of size 1 and 5, respectively; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align} \left(\begin{matrix}r_{1 , 7} & \cdots & r_{1 , 21} \\ \vdots & & \vdots \\ r_{6 , 7} & \cdots & r_{6 , 21}\end{matrix}\right) = \left( \begin{matrix}1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \\ 1 \quad 2 \quad 2 \quad 2 \quad 2 \quad 2 \quad 2 \quad 2 \quad 2 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \\ 2 \quad 1 \quad 2 \quad 2 \quad 2 \quad 2 \quad 1 \quad 1 \quad 1 \quad 2 \quad 2 \quad 2 \quad 1 \quad 1 \quad 1 \\ 2 \quad 2 \quad 1 \quad 2 \quad 2 \quad 1 \quad 2 \quad 1 \quad 1 \quad 2 \quad 1 \quad 1 \quad 2 \quad 2 \quad 1 \\ 2 \quad 2 \quad 2 \quad 1 \quad 2 \quad 1 \quad 1 \quad 2 \quad 1 \quad 1 \quad 2 \quad 1 \quad 2 \quad 1 \quad 2 \\ 2 \quad 2 \quad 2 \quad 2 \quad 1 \quad 1 \quad 1 \quad 1 \quad 2 \quad 1 \quad 1 \quad 2 \quad 1 \quad 2 \quad 2\end{matrix} \right) \end{align} \end{document}

corresponds to partitions into two parts of size 2 and 4, respectively; and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}\left( \begin{matrix}r_{1 , 22} & \cdots & r_{1 , 31} \\ \vdots & & \vdots \\ r_{6 , 22} & \cdots & r_{6 , 31}\end{matrix} \right) = \left( \begin{matrix}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 2 & 2 & 2 & 2 & 2 & 2 \\ 1 & 2 & 2 & 2 & 1 & 1 & 1 & 2 & 2 & 2 \\ 2 & 1 & 2 & 2 & 1 & 2 & 2 & 1 & 1 & 2 \\ 2 & 2 & 1 & 2 & 2 & 1 & 2 & 1 & 2 & 1 \\ 2 & 2 & 2 & 1 & 2 & 2 & 1 & 2 & 1 & 1\end{matrix} \right)\end{align} \end{document}

corresponds to partitions into two parts of size 3. Problem (3.3) has now the form \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}& {\rm max} \ d \\& {\rm s.t.} \\& y_{1 , 1} + y_{1 , 2} + y_{1 , 3} \ldots + y_{1 , 31} + d \le m , \\& y_{2 , 1} + y_{2 , 2} + y_{1 , 3} \ldots + y_{2 , 31} + d \le m , \\& y_{2 , 1} + y_{1 , 2} + y_{2 , 3} \ldots + y_{2 , 31} + d \le m , \\& y_{2 , 1} + y_{1 , 2} + y_{1 , 3} \ldots + y_{2 , 31} + d \le m , \qquad\qquad\qquad\qquad\qquad (3.7) \\& y_{2 , 1} + y_{1 , 2} + y_{1 , 3} \ldots + y_{1 , 31} + d \le m , \\& y_{2 , 1} + y_{1 , 2} + y_{1 , 3} \ldots + y_{1 , 31} + d \le m , \\& y_{1 , j} + y_{2 , j} = m_j \quad {\rm for} \ j = 1 , \ldots , 31 ,\end{align} \end{document}

d, y_i,j nonnegative integers,

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m = m_1 + \ldots + m_{31}$$ \end{document} , k=S(6, 2)=31 (S(n, k) denote the Stirling numbers of the second kind, and the first index of the y_i,j is the corresponding element in the matrix (3.6). In the present binary case, one can reduce the number of variables further by substituting y_2,j for m_j – y_1,j and setting y_j:=y_1,j. One can verify that this yields the problem \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}&{\rm max} \ d \\& {\rm s.t.} \\& \quad Q \left( \begin{matrix}y_1 \\ \vdots \\ y_{31}\end{matrix} \right) \le \left( \begin{matrix}M_{1}-d \\ & \vdots & \\ M_{6}-d\end{matrix} \right) \tag{3.8}\end{align} \end{document}

y_j ≤ m_j for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$j = 1 , \ldots , 31$$ \end{document}

d, y_j nonnegative integers,

where the 6 × 31 matrix Q=(q_i,j) has entries 1 and −1 such that q_i,j=1 if r_i,j=1and q_i,j=−1 if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r_{i , j} = 2 ( i = 1 , \ldots , 6 , j = 1 , \ldots , 31 )$$ \end{document} . The constants M_i are defined by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$M_i = \sum\nolimits_{j:q_{i , j} = 1}m_j$$ \end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 , \ldots , 6$$ \end{document} .

Consider, for example, the randomly generated parameter values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( m_1 , \ldots , m_{31} )$$ \end{document} =(633, 291, 813, 91, 889, 501, 375, 265, 218, 599, 326, 259, 128, 572, 515, 295, 359, 195, 459, 315, 215, 651, 143, 141, 378, 194, 604, 215, 230, 148, 246), summing up to m=11,263. The LP-relaxation has the solution (y₁₀, y₁₄, y₂₂, y₂₄, y₂₅, y₂₇, y₂₉, y₃₁)=(41.16667, 14.33333, 290.1667, 141, 250.1667, 604, 79.66667, 246) (where only nonzero values are listed) with optimal value 7534.833. The corresponding standard rounding solution (y₁₀, y₁₄, y₂₂, y₂₄, y₂₅, y₂₇, y₂₉, y₃₁)=(41, 14, 290, 141, 250, 604, 80, 246) has the objective value 7534 and is therefore optimal for problem (3.8).

Note that a nontrivial FSP with 6 “long” sequences has been solved without needing any technique to ensure the integrality requirements.

The model (3.3) has \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$1 + \sum\nolimits_{j = 1}^k v_j$$\end{document} variables and k + n linear constraints, independent of the length m of the sequences. Therefore, its size is generally much smaller than that of the conventional models of Festa and Pardalos (2012, Sec. 1.2) and Meneses et al. (2005, sec. 4.2), where the size increases linearly with the length of the sequences. In particular for an FSP with 3 sequences (see problem (3.4)), we get k=4, v₁=v₂=v₃=2, v₄=3, and the model has 10 variables and 7 linear restrictions, whatever the length of the sequences may be. For example, the standard rounding solution is the exact solution of problem (3.4) for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m_1 = \ldots = m_4 = 500$$ \end{document} corresponding to the sequence length m=2000. But the just mentioned models would require several thousands of restrictions and binary variables to solve an FSP with 3 sequences of that length.

If some possible representative vectors do not occur in the normalized problem, the model size can be even further reduced. For example, if the third column group in problem (3.4) is empty, it follows that m₃=0. Thus, y_1,3=y_2,3=0; that is, one can eliminate these variables and the third equation from problem (3.4).

We now determine the standard rounding solution for test examples with randomly generated multiplicities. It turns out that the corresponding objective value is always very close to the optimal value of the FSP. In fact, the maximal possible error observed in the 40 test examples of Table 1 is 3. The standard rounding solution can be easily determined, since always only a small fraction of the optimal values of the LP-relaxation is noninteger.

For each test instance the following information is provided:

n: number of sequences

ω: alphabet size

k: number of representative vectors

m: sequence length

N: number of noninteger solution values

d_relax: optimal value of the LP-relaxation

d_srs: objective value of the standard rounding solution

maximum error: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\lfloor d_{relax} \rfloor - d_{{\rm srs}}$$ \end{document}

From the examples in Table 1 it becomes clear that the FSP for a sequence length up to about 1 million can be easily solved with model (3.3) for small values of n and ω. So, the actually observed objective value of the standard rounding solution is usually much better than the lower limit given by Proposition 3.1. This is an encouraging result, since finding an approximate solution for string selection problems is considered a very difficult task; see Boucher et al. (2012, p. 1).

In 22 of 40 cases the standard rounding solution is in fact the exact solution of the FSP. In the remaining cases the maximum possible error is 3. In these cases the verification of optimality by means of a Branch-and-Bound method may require a high computational effort. In particular, the integer solver of LINGO could not solve the ILP for example no. 10 of Table 1 in about 15 million iterations. However, an absolute error of only 3 is surely negligible for practical applications with very long sequences.

3.2. Extended feasible solution set

We now assume that the feasible solution set is Equation (3.2); that is, in determining the farthest string, any position can be occupied by any element of the set V. We assume that V=Ω; that is, all characters appear at least once in the string matrix (2.1). Otherwise, there exists an \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\omega_0 \in \Omega \backslash V$$ \end{document} such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( \omega_0 , \ldots , \omega_0 ) \in \Omega^m$$ \end{document} is an optimal solution of the FSP. Furthermore, we assume that the first r columns of the matrix (2.1) are complete, while the others are incomplete. In the construction of an optimal sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$t = ( t_1 , \ldots , t_m )$$ \end{document} , one can set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}t_i = \omega_i \quad {\rm for} \ i = r + 1 , \ldots , m \tag{3.9}\end{align} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\omega_i \in V / V_i$$ \end{document} is a character not occurring in the ith column of matrix (2.1). Thus, the incomplete columns can be exempted from further consideration. The problem simplifies considerably, when the proportion of incomplete columns is large (see sec. 4). In particular, any FSP with ω > n is trivial, since all columns of matrix (2.1) are incomplete in this case and an optimal solution is given by Equation (3.9). This observation might be interesting for problems with sequences over a large alphabet that may degrade the performance of a string selection algorithm considerably; see Kuksa and Pavlovic (2009).

We now solve the FSP for ω ≤ n. Let S be the string matrix of an FST after deletion of incomplete columns. Then, there exist k=S(n, ω) representative vectors, corresponding to the partitions of the set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{1 , \ldots , n \} $$ \end{document} into ω components. We now obtain the model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}& {\rm max} \ d \\& {\rm s.t.} \\& m - \sum_{j = 1}^ky_{r_{i , j} , j} \geq d \quad {\rm for} \ i = 1 , \ldots , n , \qquad\qquad\qquad (3.10) \\& \quad \sum_{i = 1}^ \omega y _{i , j} = m_j \quad {\rm for} \ j = 1 , \ldots , k ,\end{align} \end{document}

d, y_i,j nonnegative integers,

which is a special case of problem (3.3). Now the numbers of variables and linear constraints are given by 1 + kω and n + k, respectively. These numbers are calculated in Table 2 for some values of n and ω.

Table 2.
Size of the Model (3.10)

n ω k Variables Linear constr.

4 3 6 19 10

5 3 25 76 30

6 4 65 261 71

6 3 90 271 96

7 6 21 127 28

The case ω=n is of particular interest, and then the standard rounding solution is always optimal. Assume that incomplete columns have been removed from the matrix (2.1); that is, all remaining columns are permutations of the column vector ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$1 , \ldots , n$$ \end{document} ). Thus, all these columns are isomorphic and one obtains the unique representation vector ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$1 , \ldots , n$$\end{document} ). Problem (3.10) now takes the form \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align}& {\rm max} \ d \\& {\rm s.t.} \\& m - y_1 \ge d , \\& \qquad\qquad \vdots \ \vdots \qquad\qquad\qquad(3.11)\\& m - y_n \ge d, \\& y_1 + \ldots + y_n = m ,\end{align} \end{document}

d, y_i nonnegative integers,

where the variables y_i,₁ have been renamed as y_i.

Proposition 3.2

An optimal solution of problem (3.11) is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}y_1 = \ldots = y_n = \frac {m} {n} \tag {3.12} \end{align} \end{document}

if m/n is an integer. Otherwise, the standard rounding solution is optimal.

Proof. Evidently, the LP-relaxation of problem (3.11) has the optimal solution (3.12) with corresponding optimal value \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d = m - \frac {m} {n} $$ \end{document} . If \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ \frac {m} {n} $$ \end{document} is not an integer, the standard rounding solution has the objective value \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$m - \bigg \lceil \frac {m} {n} \bigg \rceil = \bigg \lfloor m - \frac {m} {n} \bigg \rfloor$$\end{document} . ■

Example 3.3

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}S = \left( \begin{matrix}1 & 3 & 1 & 3 & 1 & 2 & 1 & 3 & 4 \\ 2 & 4 & 3 & 4 & 2 & 1 & 2 & 2 & 3 \\ 3 & 1 & 2 & 2 & 3 & 4 & 4 & 1 & 2 \\ 4 & 2 & 4 & 1 & 4 & 3 & 3 & 4 & 1\end{matrix} \right) \tag{3.13}\end{align} \end{document}

be the string matrix of an FSP (after deletion of incomplete columns) with n=4 and m=9. The normalized problem has the matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}T = \left( \begin{matrix}1 & \cdots & 1 \\ 2 & \cdots & 2 \\ 3 & \cdots & 3 \\ 4 & \cdots & 4\end{matrix} \right) \tag{3.14}\end{align} \end{document}

with 9 identical columns. The LP solution of problem (3.11) is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y_1 = \ldots = y_4 = 2.25$$ \end{document} , yielding the standard rounding solution y₁=y₂=y₃=2 and y₄=3. Thus, an optimal solution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$t = ( t_1 , \ldots , t_9 )$$ \end{document} of the FSP given by Equation (3.14) contains the numbers 1, 2, 3, and 4 with frequencies 2, 2, 2, and 3, respectively. In particular, t=(1, 1, 2, 2, 3, 3, 4, 4, 4) is an optimal solution that corresponds to the solution (1, 3, 3, 4, 3, 4, 3, 4, 1) of matrix (3.13).

3.3. Maximizing the sum of distances

We now consider another variant of the FST where the objective is now to maximize the sum of distances instead of the minimal distance. This yields the problem \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}\max \sum_{i = 1}^n d ( s^i , t ) \tag{3.15}\end{align} \end{document}

where t is chosen from X given by Equation (3.1) or (3.2). This problem, studied in Cheng et al. (2004, Sec. 3), is incomparably simpler than the maximin problem considered so far. Because of its simplicity, there is no normalization necessary. The problem (3.15) can be decomposed into subproblems as follows. For given sequences \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$s^i = ( s_1^i , \ldots , s_m^i )$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$t = ( t_1 , \ldots , t_m )$$ \end{document} , we define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d_i^{( j )} = 0$$ \end{document} if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$s_j^i = t_j$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d_i^{( j )} =$$ \end{document} otherwise. The Hamming distance between sⁱ and t can then be written as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}d ( s^i , t ) = \sum_{j = 1}^m d_i^{( j )} \tag{3.16}\end{align} \end{document}

and the sum (3.15) takes the form \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}\max \sum_{i = 1}^n \sum_{j = 1}^m d_i^{( j )} = \max \sum_{j = 1}^m \sum_{i = 1}^nd_i^{( j )}. \tag{3.17}\end{align} \end{document}

Now the expressions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sum\nolimits_{i = 1}^n d_i^{( j )}$$ \end{document} can be maximized independently from each other, and the overall solution sequence of Equation (3.17) is composed of these individual solutions. Obviously it holds \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}\sum_{i = 1}^nd_i^{( j )} = n - f ( t_j ) , \tag{3.18}\end{align} \end{document}

where f (t_j) denotes the frequency of occurrence of the character t_j in the jth column of the string matrix S. Thus, Equation (3.18) is maximized by setting t_j equal to one of the rarest elements in the jth column of S. In particular, if this column is incomplete, one sets t_j equal to a character that does not occur in this column, corresponding to f (t_j)=0.

It is now clear that the FSP (3.15) can be solved easily in linear time in the size of the string matrix.

Example 3.4

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}S = \left( \begin{matrix}1 & 1 & 2 & 1 & 3 & 2 \\ 2 & 4 & 3 & 1 & 2 & 4 \\ 3 & 2 & 1 & 3 & 3 & 4 \\ 3 & 1 & 2 & 2 & 4 & 4 \\ 3 & 4 & 4 & 3 & 4 & 3\end{matrix} \right) \tag{3.19}\end{align} \end{document}

be the string matrix of an FSP with ω=4. If the feasible solution set is defined by Equation (3.1), an optimal solution of the FSP is given by (1, 2, 1, 2, 2, 3). In case of the definition (3.2), the sequence (4, 3, 4, 4, 1, 1) represents an optimal solution.

3.4. Far from most strings problem

We still consider a variant of the FSP, called the FFMSP:

Given a set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\Sigma} = \{s^1 , \ldots , s^n \} $$ \end{document} of n sequences with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$s^i = ( s^i_1 , \ldots , s^i_m ) \in \Omega^m$$ \end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 , \ldots , n$$ \end{document} and a threshold d₀ > 0. The goal is to find a sequence t from a feasible set X, maximizing the number of strings sⁱ in Σ such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d ( s^i , t ) \geq d_0.$$ \end{document}

The set X can be defined by Equation (3.1) or (3.2), but we will restrict ourselves to the second case. By modifying problem (3.10) we obtain the following model for the FFMSP. For \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 , \ldots , n$$ \end{document} we introduce the binary variable z_i, which equals 1 if the string sⁱ belongs to the “active” strings, that is, the strings that are considered in the distance maximization [otherwise z_i=0; see, e.g., Meneses et al. (2005, p. 4) for a similar reasoning]. We get \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}& \max \ z_1 + \ldots + z_n \\& {\rm s.t.} \\& m - z_i \sum_{j = 1}^k y_{r_{i , j} , j} \geq d_0 \quad {\rm for} \ i = 1 , \ldots , n , \qquad\qquad\qquad\qquad\qquad(3.20) \\& \quad \sum_{i = 1}^ \omega y_{i , j} = m_j \quad {\rm for} \ j = 1 , \ldots , k ,\end{align} \end{document}

y_i,j nonnegative integers, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align} z_i \in \{0 , 1 \} .\end{align} \end{document}

The optimal value of problem (3.20) is the maximum number of strings having distance at least d₀ from t.

Example 3.5

Consider the case n=4 and ω=3. The model (3.20) takes the form

max z₁ + z₃ + z₃ + z₄

s.t. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}& m - z_1 ( y_{1 , 1} + y_{1 , 2} + y_{1 , 3} + y_{1 , 4} + y_{1 , 5} + y_{1 , 6} ) \geq d_0 , \\& m - z_2 ( y_{1 , 1} + y_{2 , 2} + y_{2 , 3} + y_{2 , 4} + y_{2 , 5} + y_{2 , 6} ) \geq d_0 , \\& m - z_3 ( y_{2 , 1} + y_{1 , 2} + y_{3 , 3} + y_{2 , 4} + y_{3 , 5} + y_{3 , 6} ) \geq d_0 , \qquad\qquad\qquad\qquad\qquad (3.21) \\& m - z_4 ( y_{3 , 1} + y_{3 , 2} + y_{1 , 3} + y_{3 , 4} + y_{2 , 5} + y_{3 , 6} ) \geq d_0 , \\& \quad y_{1 , j} + y_{2 , j} + y_{3 , j} = m_j \quad {\rm for} \ j = 1 , \ldots , 6 , \\ & y_{i , j} \ \hbox{nonnegative integers} , \\ & z_i \in \{0 , 1 \} \quad {\rm for} \ i = 1 , \ldots , 4.\end{align}\end{document}

In order to solve the nonlinear integer programming problem (3.20), we proceeded as follows: In a first phase, the relaxation is solved, obtained by omitting the integer conditions. In a second phase, we solve the modification of problem (3.20), obtained as follows: we substitute the condition \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$z_i \in \{0 , 1 \} $$ \end{document} by z_i=0, if z_i was smaller than 1 in the first phase solution, and by z_i=1 otherwise ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 , \ldots , n$$ \end{document} ).

Example 3.6

Assume that the multiplicities in problem (3.21) are ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m_1 , \ldots , m_6$$ \end{document} )=(796, 974, 397, 998, 789, 999), implying m=4953. For the arbitrarily chosen threshold d₀=4000, the relaxation of problem (3.21) has the solution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align}& ( z_1 , z_2 , z_3 , z_4 ) = ( 0.46 , 1 , 1 , 1 ) , \qquad\qquad\qquad\qquad\qquad (3.22) \\& \left( \begin{matrix}y_{1 , 1} & \cdots & y_{1 , 6} \\ \vdots & & \vdots \\ y_{3 , 1} & \cdots & y_{3 , 6}\end{matrix} \right) = \left( \begin{matrix}0 & 0 & 0 & 692.68 & 575.85 & 825.29 \\ 739.85 & 382.29 & 397 & 0 & 0 & 173.71 \\ 56.15 & 591.71 & 0 & 305.14 & 213.15 & 0\end{matrix} \right)\end{align} \end{document}

with corresponding optimal value 3.46. We modify problem (3.21) by substituting the conditions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$z_i \in \{0 , 1 \} $$ \end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 , \ldots , 4$$ \end{document} by z₁=0, z₂=1, z₃=1, and z₄=1. The resulting linear program has the solution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align}& z_1 = 0 , z_2 = 1 , z_3 = 1 , z_4 = 1 , \qquad\qquad\qquad\qquad\qquad(3.23) \\& \left( \begin{matrix}y_{1 , 1} & \cdots & y_{1 , 6} \\ \vdots & & \vdots \\ y_{3 , 1} & \cdots & y_{3 , 6}\end{matrix} \right) = \left( \begin{matrix}240 & 0 & 0 & 998 & 789 & 999 \\ 556 & 21 & 0 & 0 & 0 & 0 \\ 0 & 953 & 397 & 0 & 0 & 0\end{matrix} \right)\end{align} \end{document}

with objective value 3.We emphasize that the integer values of the y_i,j occurred “automatically” by solving the linear problem (no measures were taken to assure integrality); Equation (3.23) also solves the integer problem (3.21) since the first phase solution was optimal for the relaxation.

In Table 3 the above simple heuristic has been applied to 30 test problems with different numbers n, m, and ω. The multiplicities m_j in problem (3.20) have been chosen at random and the threshold d₀ has been determined arbitrarily. Since the relaxation of problem (3.20) is not a convex optimization problem, the solution may be only a local optimum, depending on the starting values and solution procedure applied by the used software. In our test problems performed by the software LINGO, global optimality was achieved in 90% of the instances. It is very interesting to observe that in all test runs of Table 3 the second-phase solution has exclusively integer values. As this table shows, the objective value of the second phase was always \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\lfloor z \rfloor$$ \end{document} , where z denotes the objective value of the first phase. Thus, when the first phase terminated with a global optimum, the second phase provided a global optimal solution for the FFMSP (3.20). This is an encouraging result, since the FFMSP is considered one of the most difficult sequence consensus problems; see, for example, Ferone et al. (2013, p. 2).

Table 3.
Test Problems for the FFMSP Model (3.20)

Problem dimensions No. d₀ m First-phase optimal value Second-phase optimal value Global optimality

n=4ω=3 1 3000 3716 3.4566 3 Yes

2 4000 4465 2.6663 2 Yes

3 4000 4953 3.4551 3 Yes

4 5300 5415 2.0838 2 Yes

5 5500 5831 2.2559 2 Yes

6 5000 6362 3.5984 3 Yes

7 6900 6916 2.0082 2 Yes

8 5500 7747 4 4 Yes

9 8000 8488 2.2567 2 Yes

10 8000 9093 2.7218 2 Yes

n=6ω=4 11 600 627 5.4909 5 Yes

12 1200 1293 5.1123 5 Yes

13 1250 1293 4.2294 4 Yes

14 4000 4113 4.1795 4 Yes

15 8000 8059 4.0382 4 No

16 15,000 15,735 4.5307 4 Yes

17 30,000 30,936 4.2239 4 Yes

18 30,935 30,936 4.0002 4 Yes

19 55,000 60,963 5.1899 5 Yes

20 60,000 60,963 4.0889 4 Yes

n=8ω=7 21 6200 6411 7.0428 7 Yes

22 6300 6411 6.4807 6 No

23 8000 9142 7.9948 7 Yes

24 13,000 13,437 7.0421 7 Yes

25 13,300 13,437 6.1057 6 Yes

26 19,000 20,227 7.1054 7 Yes

27 20,000 20,227 6.1203 6 No

28 37,500 38,827 7.0449 7 Yes

29 9000 9567 7.1013 7 Yes

30 11,000 11,845 7.1425 7 Yes

4. Probabilistic Considerations and Concluding Remarks

The above elementary solution techniques work very well for a small number n of sequences. It is of course necessary to perform further tests for FSPs with a larger set of sequences. The number of possible representative vectors increases with n (see, e.g., Table 2); however, not all these vectors need to occur in a normalized problem (see sec. 3.1). It would be interesting to study the number k of representative vectors and their multiplicities occurring in practical applications. Assuming that the characters are randomly chosen from the alphabet, one can estimate k by means of simulations. It is intuitively clear that columns with a small number of different characters are rare in the random case. In particular, one can determine the probability that a column is incomplete.

Proposition 4.1

Assume that the n × m matrix (2.1) is randomly constructed such that every character of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega = \{1 , \ldots , \omega \} $$ \end{document} appears with equal probability. Then, a specific column of S is incomplete with probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}P ( n , \omega ) = \sum_ {i = 1} ^ {\omega - 1} ( - 1 ) ^ {i - 1} {\omega \choose i} \left( 1 - \frac {i} {\omega} \right) ^n.\end{align} \end{document}

Proof. Let A_i denote the event that a given column does not contain the character i. Then, from the inclusion–exclusion principle it follows that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align} P ( n , \omega ) & = {\rm P} ( A_1 \cup \ldots \cup A_ {\omega} ) = \sum_ {i = 1} ^ {\omega} P ( A_i ) - \sum_ {1 \leq i < j \leq \omega} ^ {\omega} P ( A_i \cap A_j ) + \ldots \\ &\quad + ( - 1 ) ^ {\omega - 1} \ {\rm P} ( A_1 \cap \ldots \cap A_ {\omega} ) \\ & = \omega \left( \frac {\omega - 1} {\omega} \right) ^n - {\omega \choose 2} \left( \frac {\omega - 2} {\omega} \right) ^n + \ldots + ( - 1 ) ^ {\omega} {\omega \choose \omega - 1} \left( \frac {1} {\omega} \right)^n , \end{align} \end{document}

implying the statement. ■

Consider, for example, an application with n=40 protein sequences composed of ω=20 characters corresponding to amino acids. Assuming that all characters appear with the same probability, a column of the string matrix (2.1) is then incomplete with probability P(40, 20) ≈ 0.964; that is, more than 96% of all columns are expected to be incomplete and hence can be exempted from further consideration (see Sect. 3.2). This fact simplifies the solution considerably.

It is also necessary to realize further test runs with model (3.20) for the FFMSP. In particular, it must be checked under which circumstances the solution values are integers in the respective two phases.

There also arise some theoretical questions regarding the main model (3.3). Among others, it should be investigated how parameter changes influence the solution. In particular, one can observe the following “linearity condition”: if Y=(y_i,j) solves the LP-relaxation of problem (3.3) for the parameters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m_1 , \ldots , m_k$$ \end{document} , then cY is the “relaxed” solution for the parameters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$cm_1 , \ldots , cm_k$$ \end{document} . In particular, it would be interesting to derive a better lower bound for the objective value of the standard rounding solution than that provided in Proposition 3.1. Such a bound is closely related to the number N of noninteger values in the relaxed solution. In this context it is interesting to observe that the coefficients of the variables y_i,j in the model (3.3) are ± 1. It might be worthwhile to examine if the low number of noninteger values can be explained by a generalized unimodularity concept; see, for example, Kotnyek (2002).

Position	1	2	3	4	5	6	7	8	9	10
S	s ¹	1	4	2	3	4	2	4	2	4	3
	s ²	3	3	1	3	2	4	1	2	1	3
	s ³	4	3	2	1	3	2	1	3	3	1

Position	1	2	3	4	5	6	7	8	9	10
T	t ¹	1	1	1	1	1	1	1	1	1	1
	t ²	2	2	2	1	2	2	2	1	2	1
	t ³	3	2	1	2	3	1	2	2	3	2

Position	4	8	10	3	6	2	7	1	5	9
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{T}$$ \end{document}	t ¹	1	1	1	1	1	1	1	1	1	1
	t ²	1	1	1	2	2	2	2	2	2	2
	t ³	2	2	2	1	1	2	2	3	3	3
Column group		1			2		3		4

	No.	m	N	d_relax	d_srs	Max. error
n=6ω=2k=31	1	11,263	5	7534.833	7534	0
	2	10,965	2	7310.5	7310	0
	3	12,368	5	8076.667	8076	0
	4	14,046	5	9228.833	9228	0
	5	9290	4	6048	6047	1
	6	10,858	4	7151.167	7151	0
	7	10,306	5	6818.167	6818	0
	8	13,434	4	8721.833	8721	0
	9	15,943	5	10,337.67	10,337	0
	10	18,821	5	12,193.17	12,192	1
n=8ω=2k=127	11	737	3	473.5	473	0
	12	1463	7	938.125	937	1
	13	1122	7	758.6	757	1
	14	2924	5	1875.875	1875	0
	15	5903	5	3782.875	3782	0
	16	11,861	6	7597.375	7596	1
	17	23,929	6	15,335.12	15,334	1
	18	47,465	5	30,424.88	30,424	0
	19	36,907	2	24,858	24,857	1
	20	35,923	3	24,290.5	24,290	0
n=6ω=5k=15	21	2794	8	2328.333	2328	0
	22	3305	10	2754.167	2754	0
	23	4394	9	3515.5	3514	1
	24	5487	12	4398.333	4398	0
	25	6034	11	5068.833	5067	1
	26	7856	10	6284.667	6282	2
	27	8698	6	7393.333	7393	0
	28	9482	8	7775.167	7773	2
	29	10,670	10	8962	8961	1
	30	11,355	8	9084.833	9084	0
n=8ω=7k=28	31	5530	13	4838.75	4838	0
	32	10,420	14	8336.625	8335	1
	33	19,673	10	17,705.875	17,705	0
	34	38,912	13	33,075.375	33,074	1
	35	78,452	9	67,468.75	67,468	0
	36	150,230	8	130,700.13	130,698	2
	37	230,722	10	189,192.25	189,189	3
	38	295,015	12	236,012.38	236,012	0
	39	450,723	8	396,636.88	396,633	3
	40	950,643	13	751,007.75	751,006	1

n	ω	k	Variables	Linear constr.
4	3	6	19	10
5	3	25	76	30
6	4	65	261	71
6	3	90	271	96
7	6	21	127	28

Problem dimensions	No.	d₀	m	First-phase optimal value	Second-phase optimal value	Global optimality
n=4ω=3	1	3000	3716	3.4566	3	Yes
	2	4000	4465	2.6663	2	Yes
	3	4000	4953	3.4551	3	Yes
	4	5300	5415	2.0838	2	Yes
	5	5500	5831	2.2559	2	Yes
	6	5000	6362	3.5984	3	Yes
	7	6900	6916	2.0082	2	Yes
	8	5500	7747	4	4	Yes
	9	8000	8488	2.2567	2	Yes
	10	8000	9093	2.7218	2	Yes
n=6ω=4	11	600	627	5.4909	5	Yes
	12	1200	1293	5.1123	5	Yes
	13	1250	1293	4.2294	4	Yes
	14	4000	4113	4.1795	4	Yes
	15	8000	8059	4.0382	4	No
	16	15,000	15,735	4.5307	4	Yes
	17	30,000	30,936	4.2239	4	Yes
	18	30,935	30,936	4.0002	4	Yes
	19	55,000	60,963	5.1899	5	Yes
	20	60,000	60,963	4.0889	4	Yes
n=8ω=7	21	6200	6411	7.0428	7	Yes
	22	6300	6411	6.4807	6	No
	23	8000	9142	7.9948	7	Yes
	24	13,000	13,437	7.0421	7	Yes
	25	13,300	13,437	6.1057	6	Yes
	26	19,000	20,227	7.1054	7	Yes
	27	20,000	20,227	6.1203	6	No
	28	37,500	38,827	7.0449	7	Yes
	29	9000	9567	7.1013	7	Yes
	30	11,000	11,845	7.1425	7	Yes

Footnotes

Author Disclosure Statement

No competing financial interests exist.

References

Blazewicz

, Rormanowicz

, and Kasprzak

2005. Selected combinatorial problems of computational biology. Eur. J. Oper. Res., 161, 585–597.

Boucher

C.A.

2010. Combinatorial and probabilistic approaches to motif recognition [PhD dissertation]. University of Waterloo, Waterloo, Ontario, Canada.

Boucher

C.A.

, Landau

G.M.

, Levy

, and Pritchard

2012. On approximating string selection problems with outliers. http://arxiv.org/pdf/1202.2820.pdf

Cheng

C.H.

, Huang

C.C.

, Hu

S.Y.

, and Chao

K.M.

2004. Efficient algorithms for some variants of the farthest string problem. In Proceedings of Workshop on Combinatorial Mathematics and Computation Theory, pp. 266–272.

Ferone

, Festa

, and Resende

M.G.C.

2013. Hybrid metaheuristics for the far from most string problem. In Blesa

M.J.

, et al., eds. Proceedings of Hybrid Metaheuristics: Lecture Notes in Computer Science, pp. 174–188.

Festa

2007. On some optimization problems in molecular biology. Math. Biosci., 207, 219–234.

Festa

, and Pardalos

P.M.

2012. Efficient solution for the far from most string problem. Ann. Oper. Res. 196, 663–682.

Kotnyek

2002. A generalization of totally unimodular and network matrices [PhD dissertation]. London School of Economics.

Kuksa

P.P.

, and Pavlovic

2009. Efficient discovery of common patterns in sequences over large alphabets. DIMACS Technical Report 2009. www.dimacs.rutgers.edu/TechnicalReports/TechReports/2009/2009-15.pdf

10.

Lanctot

, Li

, Ma

, et al. 2003. Distinguishing string selection problems. Inf. Comput., 185, 41–55.

11.

Meneses

C.N.

, Pardalos

P.M.

, Resende

M.G.C.

, and Vazacopoulos

2005. Modeling and solving string selection problems. In Proceedings of the 2005 International Symposium on Mathematical and Computational Biology, Biomat 2005, Rio de Janeiro.

12.

Pappalardo

, Pardalos

P.M.

, and Stracquadanio

2013. Optimization Approaches for Solving String Selection Problems. Springer, New York, NY.

13.

Soleimani-damaneh

2011. On some multiobjective optimization problems arising in biology. Int. J. Comput. Math., 88, 1103–1119.

14.

Zörnig

2011. Improved optimization modelling for the closest string and related problems. Appl. Math. Model., 35, 5609–5617.