An evolutionary logistic regression method to identify confused drug names

Abstract

Confused drug names are a common cause of medication errors, and are related to look-alike and sound-alike drug names. For the problem of identifying confused drug name pairs, individual similarity measures are used between the drug names. In the state-of-art, a logistic regression with the standard learning algorithm has been used to combine individual similarity measures. However, only three similarity measures have been combined but the results of previous research do not outperform with a statistical significance to any individual measure. In addition, the problem of potential confused drug names pairs presents a high unbalanced distribution of dataset that it is a hard problem to supervised machine learning models. In this paper, an improved combined logistic regression measure based on 21 individual measures is presented with the standard learning algorithm. Also, we present an evolutionary learning method for a combined logistic regression measure that allows to learn an unbalanced dataset. According to the experimentation with a gold standard dataset, our proposed combined measures outperform previous research with a statistical significance to identify pairs of confused drug names. In addition, the rankings of individual and combined similarity measures are presented.

Keywords

Look-alike sound-alike drug names patient safety logistic regression genetic algorithm imbalanced dataset.

1. Introduction

Medication names that sound and look similar to others are related to medication errors. Drug names mix-up is a risk to patient safety that causes at least one death per day and harming approximately 1.5 million people per year in the United States. According to the World Health Organization (WHO), the estimated annual cost of medication errors is around $42 billion USD, an additional annual cost of $3.5 billion USD for patients who suffer harm[46].

Look-Alike/Sound-Alike drug names (LASA) are the most common cause of medication errors worldwide. LASA leads pharmacists, nurses, patients, doctors, and others health care to the unintended interchange of medicine brand names at different stages of the administration process that can result in patient injury or death [1 , 46]. Name confusion occurs as the result of a weak medication system and human errors-related factors. These undesired medication errors are potentially preventable events [42].

Medication errors are voluntarily and anonymously reported in national reporting programs by the health practitioners and consumers. The National Medication Errors Reporting Program (MERP) in the United States is an internationally recognized program [11]. The MERP and similar programs are used for determining the causes about the medication errors to obtain stronger medication systems[6 , 17].

Regulatory authorities in the United States, Canada, and like in other countries are improving the reports systems to identify the factors and potential causes of LASA. Strategies to decrease potential LASA medication errors are to educate the healthcare community and work with others regulatoryorganizations, professionals, manufacturers, and patients; to improve the safe medication practices.

About 25 percent of the errors reported in the Institute for Safe Medication Practices (ISMP) corresponds to LASA confusion problem [7]. In 2002, according to the U.S. Pharmacopeia (USP) [20] there were reported 192,477 medication errors. Fortunately, in 67,707 cases were possible to intercept the medication before it was administrated to the patient. In 91,446 cases, the medication was administrated but it did not damage the patient. However, in 2,600 cases there were required an intervention to keep the patient alive, but in 20 cases the patient dead. Since, the errors were detected before writing the prescription of the medication the rest of them are classified as potential.

The ISMP publishes an updated List of Confused Drug Names [28]. The ISMP list is used to know which medications need special attention and safeguard by healthcare practitioners to reduce LASA errors and patient harm.

LASA problem is growing continuously [7 , 21]. Since the first LASA list was published in 1973, LASA lists has been updated frequently with new pairs [41]. In 1995, USP published the Quality Review 49 with 200 confusable pairs [21], in 2001 the list grows to 850 pairs in the Quality Review 76 [43], in 2004 to 1,950 pairs in the Quality Review 79 [44] and in 2006 to 3,170 pairs [21]. Recently in 2017, the ISMP reports that 36 drug names have been added because they are involved in 17,133 medication errors. A drug name may also participate in several pairs of LASA [21], for example, Serentil participates in four pairs related to other drug names [43].

Multiple human errors-related factors are involved in the LASA confusion problem [21, 30]. For example, although LASA Avandia and Coumadin pair does not have a similar spelling, it is classified as a visual perception error when is prescribed in a poor handwritten letter. Confusion occurs when the first capital letter “A” () looks like “C” () and the last letters “ia” () looks like “in” () [10, 30]. Another visual perception error takes place when the LASA Hydroxyzine and Hydralazine pair is typewritten communicated because they have a similar spelling, it means they share identical prefixes, suffixes, and lengths [10]. For example, the LASA Xanax and Zantac pair does not have a similar spelling, but it is classified as an auditory perception error because it sounds similar [21]. Sometimes in this kind of auditory errors, the pair shares some features related to similar spelling.

Also, short-term memory errors are related to memory lapses, for example, when the pharmacist, after reading the name Hydroxyzine, writes Hydralazine [21, 30].

A motor control error occurs when an incorrect medicine is selected from a computerized down-list [21, 30].

While the type of error is not easy to be identified, the root-cause of similarity could be detected or avoided. As part of the strategy to reduce the risk of registering new confused drug names, the U.S. Food and Drug Administration (FDA) needs to identify potentially confused drug names a priori [5].

In the approval process for a new drug, the FDA encourage the implementation of computerized methods and algorithms to evaluate the similarity to the proposed name [10]. First, FDA recommends that the industry follow a series of best practices in the selection of drug proprietary name for new medications [12]. In the review process of the proposed name, it is evaluated by the Phonetic and Orthographic Computer Analysis (POCA) software for comparing the name against different drug databases [10 , 13]. Once the drug name is reviewed, FDA could reject the name if it is too similar to existing previously registered drug names [12].

A preventive action is taken when a confusable drug name is responsible of serious medication errors, in these cases, the proprietary name has been changed in order to avoid the error with another drug names [5].

Recent research proposes focusing on homologating the review process and implementing automated processes based on machine learning techniques to solve the challenge [15 , 42].

In this paper, we demonstrate how a logistic regression model outperforms a unique similarity measure model when the model is appropriately trained for identifying confusing names.

The problem of identifying LASA pairs is defined as, definition 1. Let a confused drug nameX with n letters be considered a sequence of elements, it means X =< x₁, x₂, … x_n >, where |X| denotes the length (size) of X.

Given a set D = {d₁, d₂, …, d_m} of potentially confused drug names, and a set of look-alike-and-sound-alike pairs, defined as: $T = {(d_{i}, d_{j}) | ((d_{i}, d_{j}) \in λ \Rightarrow (d_{j}, d_{i}) \in λ) \land i \neq j}$ (1)

where λ is the LASA list, and T ⊆ D × D. The problem of identifying confused drug names from D consists in retrieve a subset H of all the pairs that belongs to T, that is defined as:

$H = {(d_{i}, d_{j}) | ((d_{i}, d_{j}) \in T \lor (d_{j}, d_{i}) \in T) \land i \neq j}$ (2)

According to related works [4 , 29–32], for identifying a pair of confusable drug names not only it is needed to use individual measures for capturing particular look-alike (orthographic cause) and sound-alike (phonetic cause) patterns between the names, but also it is necessary to combine the previous results because it is not possible to know a priori the cause of the confusion (i.e. an orthographic or phonetic). In both cases, the measures are classified as distance (as closer to zero as more related are the names) and similarity (as greater is the value as more related are the names). Normally, similarity measures are normalized to have a scale between different similarity values.

Lambert [18] presents a wide compilation of 22 measures for LASA problem, [30 –32] where the classical string-matching and based-distance measures (and some variants of them), are used. After evaluating individual measures, Lambert concludes Trigram-2B is the best orthographic similarity measure, Normalized Edit Distance (NED) is the best orthographic distance measure, and Editex is the best phonetic distance measure. However, only a subset of selected measures is used in a Logistic Regression Model (LRM) to combine the strengths of them.

The individual measures found in related work to this problem are present below.

1.1. Orthographic distance measures

Given the drug names X and Y as sequences of size n and m, respectively, Edit distance (also called Levenshtein) refers to the minimum cost of editing operations (insertion, deletion and substitution) to convert the sequence X into Y [33 , 47]. Some applications [9, 47] consider that there is needed two operations (insertion and deletion) for a substitution, in this case the cost of substitution is the sum of the cost of insertion and deletion. In this paper, all editing operations have a cost of 1, it is, the cost for substituting the letter x_i by the letter y_i, denoted as: cs (x_i,y_i) is 1 when they are different, or 0 in other case. In this case, the edit distance between X and Y is given by edit (n, m) computed by the following recurrence:

$edit (i, j) = {\begin{matrix} max (i, j) & \begin{matrix} i = 0 \\ or \\ j = 0 \end{matrix} \\ edit (i - 1, j - 1) & x_{i} = y_{i} \\ min {\begin{matrix} edit (i - 1) + 1 \\ edit (i, j - 1) + 1 \\ edit (i - 1, j - 1) + cs (x_{i}, y_{i}) \end{matrix} & x_{i} \neq y_{j} \end{matrix}$ (3)

For example, the edit distance between Zantac and Xanax is 3 because the minimum transformation involves two substitutions (Z → X and c → x) and one deletion (letter t).

A Normalized Edit Distance (NED) is computed by dividing the total edit distance between the length of the longer sequence [4 , 32]. For the above example, the NED is 3/6 = 0.5.

A Tapered Edit Distance (TED) is used for finding similar names on databases based on the pronunciation of the names. TED gives more relevance to the first coincidences than the last ones. For this, the maximum cost of penalization for a substitution and deletion at the beginning is greater than the minimum cost for a penalization at the ending of the sequences [3 , 47]. Normalized Tapered Edit Distance (NTED) is the normalized distance of TED.

Skeleton is an algorithm to convert a sequence of letters in a key. This key consists of the first letter of the sequence followed by its remaining unique consonants (in order of appearance) followed by its remaining unique vowels (in order of appearance) [36, 37]. For example, the drug names Zantac and Contac have the Skeleton keys Zntca and Cntoa, respectively. The distance named SkeletonKey uses the edit distance between corresponding keys of the drug names [31 –36].

Analogous to Skeleton key algorithm, Omission key uses the inverse order of the most frequently omitted consonants involve in spelling errors to build a key from a name for spelling correction. The ranking of the consonants presents the following order: RSTNLCHDPGMFBYWVZXQKJ, where the R is the most frequent consonant and J the less frequent consonant. The omission key of a sequence consists of the unique consonants in the above inverse frequency order followed by the vowels in appearance order [36, 37]. For example, the drug names Zantac and Contac have the omission keys Zcnta and Cntoa, respectively. The distance called OmissionKey uses the editdistance between the corresponding keys of the drug names [31 –36].

1.2. Orthographic similarity measures

N-gram similarity represents a sequence of the set of all its contiguous subsequences (grams) of size N [36]. For example, if |X| = m and N = 2 (bigrams), then X′ = {x₁x₂, x₂x₃, …, x_m - 1x_m} [4 , 31]. Given the sequences X and Y, the N-gram similarity is defined as the Dice similarity [2] between the sets X′ and Y′ in the next way: $Dice (X^{'} Y^{'}) = \frac{2 | X^{'} \cup Y^{'} |}{| X^{'} | + | Y^{'} |}$ (4)

Considering the drug names X = Zantac and Y = Contac for a bigram representation, then X′ = {Za, an, nt, ta, ac} and Y′ = {Co, on, nt, ta, ac}. In the above example, Dice (X′, Y′) =6/10. Lambert [29 , 32] uses bigram and trigram similarity.

N-gram similarity presents a weakness with the LASA problem because it is well-known that the prefixes and suffixes of the drug names are involved in their confusion [26, 40]. For increasing the sensitivity of the N-gram similarity variations with respect to initial and final letters are introduced. Lambert [31] proposes to add spaces (or a letter not included in the names) (B)efore and (A)fter in both drug names to make that the initial or final letters appear in one or more n-grams [31]. For example, the Trigram-2B measure uses the trigram similarity adding two spaces in both drug names. Following the example, the drug name Zantac is represented as:

$X = - - Zantac X^{'} = {- - Z, - Za, Zan, ant, nta, tac} .$ (5)

Lambert uses the variants of Bigram (1B, 1B1A and 1A) and Trigram (1B, 1A, 1B1A, 2B, 2A, 2B2A, 1B2A and 2B1A).

One disadvantage of the Dice similarity (therefore of the N-gram similarity, too) is that it is applied in sets; losing the order of the elements. The Normalized Longest Common Subsequence (NLCS) similarity lets to maintain an order in the common matching letters. Given the sequences X and Y of size n and m, respectively, the NLCS similarity is defined as the ratio of the length of the longest common subsequences between X and Y, NLCS = |lcs (n, m) |/max(|X|, ||Y|), where lcs (n, m) can be calculated by the recurrence in Equation (7) [24–27 , 32]. For example, with the drug names Zantac and Contac the length of the longest common subsequence is |ntac| =, therefore the NLCS = 4/6. The NLCS is used by [4 , 32]. $lcs (i, j) = {\begin{matrix} 0 & \begin{matrix} i = 0 \\ or \\ j = 0 \end{matrix} \\ lcs (i - 1, j - 1) + 1, & x_{i} = y_{i} \\ max (lcs (i, j - 1), lcs (i - 1, j)) & x_{i} \neq y_{j} \end{matrix}$ (6)

On the one hand, N-gram similarity lets to manage small subsequences, but it loses the order of the matching positions. On the other hand, NLCS similarity maintains an order in the matching positions but only for letters and it does not give relevance to the first and initial position.

1.3. Phonetic distance measures

Soundex is an indexing algorithm developed by Russell [9] that groups letters with similar sounds in order to get a coded representation from a name. Since the Soundex algorithm is not a measure, Lambert [31] and Kondrak [26] use the Soundex algorithm for coding a pair of drug names, but the distance between the names is obtained using the edit distance. In this paper, the Soundex distance implemented follows the description of Lambert [31] and Kondrak [26]. Beginning with the second letter of a drug name, Soundex distance replaces each letter of a drug name by a numeric code, then all zeros are suppressed (sup) and the resulting sequence is truncated to four symbols [26 , 31–35]. Given a pair of drug names as sequences X and Y of size n and m, respectively, Soundex distance is defined in Equation (7).

For instance, drug names Zantac and Xanax are coded as Z532 and X520, respectively; and these codes have an edit distance of 3. $\begin{matrix} Soundex (X, Y) = Edit (Code (x_{1 . . n})_{1 . . 4}, Code (y_{1 . . m})_{1 . . 4}) \\ Code (α_{i}) = {\begin{matrix} α_{1}, & α_{i} = 1 \\ sup, & a_{i > 1} \in {a, e, h, i, o, u, w, y} \\ 1, & α_{i > 1} \in {b, f, p, v} \\ 2, & α_{i > 1} \in {c, g, j, k, q, s, x, z} \\ 3, & α_{i > 1} \in {d, t} \\ 4, & α_{i > 1} \in {l} \\ 5, & α_{i > 1} \in {m, n} \\ 6, & α_{i > 1} \in {r} \end{matrix} \end{matrix}$ (7)

Phonix is similar to Soundex, it is just an indexing algorithm that uses 160 groups of letters for coding a drug name [14 , 47]. Since Phonix is not a measure, the edit distance is computed to obtain the Phonix distance between a pair of coded drug names. Given a pair of drug names as sequences X and Y of size n and m, respectively, Phonix distance is defined as:

$Phonix (X, Y) = Edit (α, β)$ (8) where α = PhonixCode (X) _1..8,

and β = PhonixCode (Y) _1..8.

Given a name Z, PhonixCode follows the next algorithm [36]:

Replace groups of orthographic letters by letters representing certain phonetic groups.

Replace the first letter by V if it is a vowel or the consonant Y.

Split the ending-sound from the name (roughly the part after the last vowel or Y).

Removed all vowels, the consonant H, W, Y and all duplicated consecutive letters.

Code the prefix of the name by replacing the letter α_i with the next CodePrefix function, see Equation (9). The maximum length of a Phonix code is restricted to 8 characters.

$\begin{matrix} CodePrefix (α_{i}) = {\begin{matrix} α_{1}, & a_{i = 1} \\ 1, & α_{9 > i > 1} \in {b, p} \\ 2, & α_{9 > i > 1} \in {c, g, j, k, q} \\ 3, & α_{9 > i > 1} \in {d, t} \\ 4, & α_{9 > i > 1} \in {l} \\ 5, & α_{9 > i > 1} \in {m, n} \\ 6, & α_{9 > i > 1} \in {r} \\ 7, & α_{9 > i > 1} \in {f, b} \\ 8, & α_{9 > i > 1} \in {s, x, z} \\ sup, in other case \end{matrix} \end{matrix}$ (9)

6. Code the ending-sound by replacing every letter according to its numerical value defined in the last CodePrefix function. The maximum length for a Phonic code of an ending-sound is restricted to 8 characters.

Editex distance computes the edit distance between the phonetic groups of the letters of the drug names [47]. In this case, the cost of the editing operations depends on the group of the letters (see Table 1) that are compared. If two letters are equal then the cost for a substitution is zero. However, if they are different but they belong to the same group then the substitution cost is one. Otherwise, all other editing operations have a cost of two [26]. For example, the drug names Zantac and Xanax have an Editex distance of 5, because the substitution cost for Z → X is one due to both are in the same group, and the substitution cost for c → x is two due to they are in distinct groups; and the deletion cost for t is two.

Table 1

Group of letters for Editex

Code	EDITEX
0	a, e, i, o, u, y
1	b, p
2	c, k, q
3	d, t
4	l, r
5	m, n
6	g, j
7	f, p, v
8	s, x, z
9	c, s, z

1.4. Combined measures

Since, it is not possible to known the cause of the confusion (orthographic or phonologic) a priori, several orthographical and phonological measures are combined [4 , 31]. A combined measure is used to take advantage of strengthens of the individual measures as an ensemble for giving the final similarity result between two drug names.

First, Lambert [31] evaluated 22 measures for the LASA problem. After that, the three best measures Trigram-2B, NED and Editex were selected for participating in a Logistic Regression Method (LRM-3) for combining its strengths in a new measure to get better results. The Logistic Regression (LR) is an algorithm to predict a binary classification. Like others machine learning algorithms, LR implements a standard learning algorithm to fit a model predictor [23]. In LRM-3 the set of parameters to fit are θ = {θ_0,θ_1, … , θ_n} for the hypothesis 0⩽ h_θ (x) ⩽ to classify a pair (X, Y) of drug names as confusable is respect to:

$h_{θ} (x) = g (y (x))$ (10) where y (x) = (θ^Tx) = θ₀ + θ₁x₁ + θ₂x₂ + θ₃x₃ and x₁ = Editex (X, Y) , x₂ = NED (X, Y) , x₃ = Trigram2B (X, Y) are the result obtained from each individual measure, and g (y (x)) =1/(1 + e^∧ (- y (x))) is the sigmoid function [23].

2. Proposed method

In this paper, a logistic regression is adjusted by an evolutionary approach to increase the accuracy on the process of identifying confusable drug names. The hypothesis is that an evolutionary can adjust better the parameters at the logistic regression model based on the F-measure in comparison to the traditional learning algorithm. Therefore, if we could determine the ranking similarity with a better accuracy between a set of drug names, then the problem of identify confused drug names is reduced to a subset with the highest score of similarity. In other words, we treat this problem as an optimization problem using an evolutionary approach.

2.1. Optimization problem

Given a set M = m₁, m₂, …, m of normalized individual measures (between 0 and 1) and a set D = d₁, d₂, …, d_m of confused drug names, the problem of combining all individual measures of M consists in finding the associated weights to each measure that in Equation (11) maximizes the f-measure evaluation of the set H from the query result to retrieve all T between all pairs in D × D, where the constraint is $\sum_{i = 1}^{n} w_{i} = 1$ . Therefore, the proposed Optimized Logistic Regression Model (OLRM) is defined as:

$O L R M - n (d_{i}, d_{j}) = \frac{1}{1 + e \sum_{k = 1}^{n} w_{k} m_{k} (d_{i}, d_{j})}$ (11)

2.2. Proposed genetic algorithm

Genetic Algorithm (GA) is an Evolutionary Algorithm (EA) inspired in the theory of natural selection mechanism proposed by Darwin that has proved to be an alternative solution for global optimization problems in large search spaces [18 –22].

In the first step, the GA proposes a population of random solutions (initial population step) that are evaluated according to the objective function to optimize (fitness function step). In this sense, a solution for one problem is not absolute, it means, there is a set of possible solutions where some are better than others. Considering mostly the best solutions (parents selection step), the GA proposes a new population mixing (crossover step) some parts from a canonical codification (chromosome encoding step) of these good solutions in order to get better solutions (evolution principle). Eventually, the way of mixing some parts from the canonical codification could produce repeated solutions. Therefore, the GA applies a small variation (mutation step) to the canonical codification in the new population in order to explore new solutions. The new population is evaluated again and the process is repeated until a satisfactory solution is reached or until some arbitrary stop-criteria is reached (stop condition) [34].

2.3. Proposed genetic operators

Chromosome Encoding. The associated weights W = {w₁, w₂, …, w_n} to each measure are represented by a binary chromosome with five precision decimals. Each weight has a value between 0 and 1.

Initial Population. All chromosomes in the initial Population (P_o) are created in random way.

Fitness Function. The key step of a GA is the Fitness function. Here, the aptitude of each chromosome must be evaluated. It is worth mentioning, that the objective of the FDA is to recover the closest LASA from a proposed drug name. Hence, the information-retrieval f-measure evaluation is used. Given a LASA pair (d_i, d_j) ∈ T, the f-measure for the query d_i evaluates the size of the set of retrieved drug names in ranking 1 (closest similar drug names to the query d_i), but if d_j does not appears in the last set, the f-measure add the size of the retrieved drug names in the next ranking, until appears d_j. In this way, f-measure evaluates the ability to find a relevant drug name from a query. F-measure is a harmonic balance between recall R and precision P. Precision P is defined as the number of correctly recovered units (LASA pairs) divided by the number of recovered units; D, T and H are defined in definition 1. $P (D) = \frac{| T \cap H}{| H |}$ (12)

Recall R is defined as the number of correctly recovered units divided by the number of correctly units. In this sense, Precision measures the fraction of retrieved units that are relevant, while Recall measures the fraction of relevant instances that are retrieved. $R (D) = \frac{| T \cap H |}{| H |}$ (13)

The f-measure for the queries of all different drug names (set D) is defined as: $F - measure (D) = \frac{2 RP}{R + P}$ (14)

The f-measure could be obtained at every ranking. In fact, it is desired to improve the f-measure in the top four rankings. Therefore, the fitness function computes a macro-averaging f-measure for the queries of all different drug names (set D) based on the sum of the first four rankings. $fitness (D) = \sum_{r = 1}^{4} f - measure (D, r)$ (15)

In other words, the fitness function gives more relevance to the combination of weights in W that, after retrieving the queries of all different drug names with the combined OLRM - n measure, produces the best sum of the first four f-measure evaluation.

Parent Selection. Once each chromosome has an associated fitness value, those stronger chromosomes have more probability of being selected as parents (natural selection mechanism). Natural selection mechanism establishes that two good solutions (chromosomes) could produce better solutions; nevertheless, in some cases the solution could be worse. In this step, the classical tournament selection is employed where the strongest chromosome from a small random subsample is selected as a parent. The smaller the subsample, the greater the possibility of select weaker chromosomes is.

Crossover.n-point crossover is used for mixing the genetic information of the parents. In this case, n random points between the genes of the parent chromosomes are selected, and then two offspring chromosomes are created swapping everything between the selected points.

Mutation. According to the evolution scheme, the mutation slightly happens in nature, with a low probability of 0.1%. However, it is one of the fundamental mechanisms to preserve the evolution. Since the chromosome has a binary codification, the classical inverse mutation operator is used.

Elite selection. It helps to keep the best solution of the previous generation.

3. Results and discussion

In the first section, with the objective of having a baseline result for the LASA pairs, the comparison between the individual similarity measures for identifying LASA pairs is presented. In the second section, with the objective to prove how our proposed similarity measure increase the accuracy when more individual measure are added, a comparison of the LRM-3 method to the proposed LRM-21 method is included in this section [26]. In this section, a statistical significance experiment is achieved with the aim to compare the proposed method to previous research.

In the third section, the proposed optimized logistic regression similarity measures OLRM-3 and OLRM-21 are evaluated and compared to the Lambert proposed measure (LRM-3). Additional to that, the ranking of individual measures and the ranking of combined measures (OLRM-21) are compared.

For the experimentation, the dataset USP-858 is used, that it is the same list of confused drug names reported by the USP [43]. Such list contains 858 pairs of confused LASA, with 630 unique drug names. It is worth mentioning that one drug name can be involved in more than one confusion pair. With this list is possible to generate 396,900 pairs of drug names, but only 0.3% of them are LASA pairs. This situation represents an unbalance distribution between the LASA pairs and no-LASA pairs with approximately 1:460.

In the same way, for comparing all approaches the macro-average f-measure, described in section 2.3, is used. It should be noted that in previous works different evaluation measures have been used. Since the LASA problem is treated as an information recovery system, the well-known macro-average f-measure is used in all the experiments.

In fact, it is desired that any system could recover only the 858 LASA pairs in the first ranking position. Nevertheless, it will be impossible because a drug name can be involved until four LASA pairs. Therefore, the fitness function is configured to improve the first fourth ranking positions.

3.1. Evaluation of individual similarity measures

All the orthographic and phonetic similarity measures previously described are individually evaluated with the USP-858 collection. The evaluation of each individual measure is computed using the macro-averaging f-measure accumulated in the first four ranking.

Figure 1 shows that the similarity measures NTED (orthographic distance measure), Trigram-2B2A (orthographic similarity measure), and Editex (phonetic distance measure) are the best ones to identify LASA pairs. It is worth noting, that the best similarity measures reported by Lambert, NED and Trigram-2B, fall into the ranking position and Phonix, Soundex and NLCS are the worst ones.

Fig.1

Ranking obtained for each individual measure according to macro-averaging f-measure. Orthographic and phonetic measures are showed in light gray and dark gray, respectively.

3.2. Evaluation of combined similarity measures

Firstly, the similarity measure proposed by Lambert based on Logistic Regression of three individual measures (LRM-3) and our proposed similarity measure based on Logistic Regression of 21 individual measures (LRM-21) are compared using the standard learning algorithm. For the LRM-3 measure, the original individual measures proposed by Lambert, Editex, NED, and Trigram-2B (LRM-3); are used. For the LRM-21 measure, all the normalized individual measures presented before are used.

As initial evaluation, LRM-3 and LRM-21 measures are evaluated with ten-fold cross-validation using F-measure evaluation used by Lambert [31]. In this case, LRM-3 measure obtains 98.67 and LRM-21 measure obtains 98.64. These high results agree to Lambert [31] results. However, to get a relevant recovery LASA pairs in top positions of a ranking, we propose to use a macro-averaging F-measureevaluation.

In Table 2, the macro-averaging F-measure evaluations of LRM-3 and LRM-21 measures are shown. As it is possible to observe in Table 2, our proposed LRM-21 measure outperforms to LRM-3 measure in all the ranking positions for the F-measure and the accumulated macro-averaging F-measure.

Table 2
Macro-averaging f-measure evaluation obtained after the learning process over the training set.

Ranking LRM-3 LRM-21

F-Meas. ΣF-Meas. F-Meas. ΣF-Meas.

1 0.511192 0.511192 0.547041 0.547041

2 0.449191 0.960383 0.465676 1.012717

3 0.386266 1.346649 0.403605 1.416322

4 0.335319 1.681968 0.353058 1.76938

5 0.296816 1.978784 0.314903 2.084283

6 0.265614 2.244398 0.28319 2.367473

7 0.238906 2.483304 0.256329 2.623802

8 0.216782 2.700086 0.234056 2.857858

9 0.198077 2.898163 0.213918 3.071776

10 0.181176 3.079339 0.197864 3.26964

Ranking	LRM-3	LRM-21
1	0.511192	0.511192	0.547041	0.547041
2	0.449191	0.960383	0.465676	1.012717
3	0.386266	1.346649	0.403605	1.416322
4	0.335319	1.681968	0.353058	1.76938
5	0.296816	1.978784	0.314903	2.084283
6	0.265614	2.244398	0.28319	2.367473
7	0.238906	2.483304	0.256329	2.623802
8	0.216782	2.700086	0.234056	2.857858
9	0.198077	2.898163	0.213918	3.071776
10	0.181176	3.079339	0.197864	3.26964

LRM-21 outperforms to LRM-3 with a statistical significance of 95% of confidence in the learning step over the training set. The fitness function in LRM-21 reaches 1.7847 in comparison to 1.6812 of LRM-3. Even though we are interested in improving only the first four positions, LRM-21 measure is better in all ranking positions.

3.3. Evaluation of optimized logistic regression measures

In this section, we evaluate our proposed genetic-algorithm optimized logistic regresion measures OLRM-3 and OLRM-21. The same individual similarity measures used for LRM-3 and LRM-21 are used for OLRM-3 and OLRM-21, respectively.

It should be mentioned that in the next experiments, the same tuning in the operators of the proposed GA is used. In the chromosome, each value of (see section 2.1) has a precision of five decimals. The initial population used is of W = {w₁, w₂, …, w_n} chromosomes. As we explain above, the fitness function evaluates the first four positions on the ranking result of f-measure. Only two competitors (k = 2) are selected for the selection operator tournament. The crossover operator used is one-point crossover. A mutation probability of 0.17% is used. Also, an elite strategy preserves only the best chromosome for the next generation.

In Fig. 2, the average weights that are found by the GA for the OLRM-3 are showed. These weights show that there are not a directly relation to the ranking position of individual measures (see Fig. 1).

Fig.2

Ranking obtained for the measure OLRM-3 based on the weights. The orthographic measures are showed in light gray and phonetic in dark gray.

In Fig. 3, the average weights that are found by the GA for the OLRM-21 are showed. Also, in this case, the finding weights show that there is not a trivial relation to the ranking position of individual measures (Fig. 1). It is worth noting that the phonetic measures for OLRM-21 hold upper positions, for example, the phonetic measure Soundex, which is one of the worst individual-evaluated measures, now holds the seventh position.

Fig.3

Ranking obtained for the measure OLRM-21 based on the weights. The orthographic measures are showed in light gray and phonetic in dark gray.

Figure 4 shows how the learning of our proposed OLRM-3 measure evolves through the generations of the GA. In the first generation, the learning of the OLRM-3 measure outperforms to the standard learning of LRM-3 measure. Analogously, the learning of the proposed OLRM-21 measure outperforms to the standard learning of our proposed LRM-21 measure. In this case, with 21 features the OLRM-21 finds a better learning step than OLRM-3 measure.

Fig.4

Training evolution of the macro-averaging f-measure obtained by OLRM-3 and OLRM-21 against LRM-3 and LRM-21 results. The values of the training lines (for the measure OLRM-3 and OLRM-21) correspond to the average of the top-four positions of the macro-averaging f-measure of the training sets.

Also, Fig. 4 shows the correlation, in the evolution of the GA, between the training dataset and the test dataset. In this case, it is clear that the test results are related to the training results. In this case, our proposed OLRM-3 outperforms to all the measures but OLRM-21 could not outperform LRM-21.

The measures proposed in this work (LRM-21, OLRM-3, and OLRM-21) outperform in the learning and the test process to LRM-3 previously presented by Lambert. After compute the Wilcoxon Signed-Rank Test the proposed measures are statistically significant according to of 95% of confidence, see Table 3.

Table 3

Comparison of our proposed evolutionarily learning measures with the previous standard learning measures (* denotes the individual measure considered as baseline, and ** denotes the method proposed by Lambert)

	Train	∑F - Meas.	Test	∑F - Meas.	p-val
1	OLRM-21	1.80165	LRM-21	1.76326	0.005
2	LRM-21	1.76938	OLRM-21	1.75739	0.005
3	OLRM-3	1.73365	OLRM-3	1.68182	0.005
4	LRM-3	1.68196	LRM-3	1.67528	**
5	NTED	1.62178	NTED	1.62178	*

4. Conclusion

LASA is a preventable harmful-health problem that is still growing with more than two decades of research. In this paper, an evolutionary learning method for a logistic regression model to improve the training process is proposed, despite unbalanced dataset of potential LASA pairs. For this, a genetic algorithm with a fitness function based on the sum of the top fourth macro-averaging f-measure is proposed. In specific, the sum of the top four macro-averaging f-measure achieves a greater amount of the accuracy in top positions.

According to the experimentation, our 21-combined measure base on a standard learning logistic regression method (LRM-21) outperforms the 3-combined measure base on a standard learning logistic regression method (LRM-3) for the train and test datasets. Also, our proposed 3-combined measure based on an evolutionarily-adjusted logistic regression model (OLRM-3) outperforms to LRM-3 in the train and test dataset. The same behavior is preserved with our proposed 21-combined measure based on an evolutionarily-adjusted logistic regression method (OLRM-21) that outperforms to LRM-3 with the train and test datasets; and to LRM-21 with the train dataset. However, our proposed LRM-21 obtains the best result in the test dataset. According to the results of the evolution, the training results of the optimized models and the test results are related.

The ranking of the finding weights of the proposed measures do not show a similar relation to the ranking of individual similarity measures. Thus, the novel method to train a logistic regression model is an option to outperform the learning algorithms of the machine learning models. As opposed to the traditional learning algorithm, in the proposed combined measure shows that the greater is the number of individual measures, the better are the results.

In future work, for improving the accuracy of the LASA problem more individual measures must be tested. Also, it will be interesting to select a different machine learning model in order to apply an evolutionary process for tuning its internal parameters.

Footnotes

Acknowledgments

Work done under partial support of Mexican Government CONACyT. We also thank UAEMex for their assistance.

References

ASHP guidelines on preventing medication errors in hospitals, American Journal of Health-System Pharmacy50 (1993), 305–314.

G.W.

Adamson and

Boreham , The use of an association measure based on character structure to identify semanti-cally related pairs of words and document titles, Information storage and retrieval10 (1974), 253–260.

Aneja ,

A.R.

Patki and

Kumbhalwar , Approximate proper name matching, 2007.

L.-C.

Chen ,

C.-H.

Chen ,

H.-M.

Chenand V.S. Tseng, Hybrid data mining approaches for prevention of drug dispensing errors, Journal of Intelligent Information Systems36 (2011), 305–327.

M.R.

Cohen ,

G.D.

Domizio and

R.E.

Lee , The role of drug names in medication errors, Medication errors. Wahihg-ton, DC: The American Pharmacists Association (2007), 87–110.

Craigle , MedWatch: The FDA safety information and adverse event reporting program, Journal of the Medical Library Association95 (2007), 224–225.

de Andrade-Azevedo ,

Azevedo-Anacleto and

Borges-Rosa , Nomes de medicamentos com grafia ou som semelhantes: Como evitar erros, Bol ISMP-Brasil3 (2014).

B.K.

Dixon , Similar drug names a growing cause of errors, Internal Medicine News41 (2008), 51–51.

A.K.

Elmagarmid ,

P.G.

Ipeirotis and

V.S.

Verykios , Duplicate record detection: A survey, IEEE Transactions on Knowledge and Data Engineering19 (2007).

10.

FDA

PDUFA Pilot Project - Propietary Name Concept Paper

2008

11.

FDA

FDA and ISMP Work to Prevent Medication Errors

2012

12.

FDAGuidance for industry. Contents of a complete submission for the evaluation of proprietary names, 2014.

13.

FDA

Phonetic and Orthographic Computer Analysis (POCA) program

2017

14.

Gadd , PHONIX: The algorithm, Program24 (1990), 363–366.

15.

B.H.

Garcia ,

Elenjord ,

Bjornstad ,

K.H.

Halvorsen ,

Hortemo and

Madsen , Safety and efficiency of a new generic package labelling: A before and after study in a simulated setting, BMJ Quality & Safety26 (2017), 817–823.

16.

J.A.

Gershman and

A.D.

Fass , Medication safety and phar-macovigilance resources for the ambulatory care setting: Enhancing patient safety, Hospital Pharmacy49 (2014), 363–368.

17.

K.A.

Getz ,

Stergiopoulos and K.I. Kaitin, Evaluating the completeness and accuracy of MedWatch data, American Journal of Therapeutics21 (2014), 442–446.

18.

Goldberg , Genetic Algorithms in Search, Optimization, and Machine Learning, Addison Wesley, Reading, Massachusetts, 1989.

19.

Gupta ,

A.P.

Srivastava and S. Awasthi, Fast and Effective Searches of Personal Names in an International Environment, International Journal of Innovative Research in Engineerinf and Managemenr1 (2014).

20.

Hicks ,

D.D.

Cousins and

R.L.

Williams , Summary of information submitted to MEDMARX in the year 2002: The quest for quality, US Pharmacopeia, 2003.

21.

R.W.

Hicks ,

S.C.

Becker and D.D. Cousins, MEDMARX data report. A report on the relationship of drug names and medication errors in response to the Institute of Medicine's call for action, in Center for the Advancement of Patient Safety, US Pharmacopeia, Rockville, MD, 2008.

22.

J.H.

Holland , Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence, MIT Press, 1992.

23.

D.W.

Hosmer Jr ,

Lemeshow and

R.X.

Sturdivant , Applied logistic regression, John Wiley & Sons, 2013.

24.

Kondrak , N-gram similarity and distance, in: String processing and information retrieval, Springer2005, pp. 115–126.

25.

Kondrak and

Dorr , Identification of confusable drug names: A new approach and evaluation methodology, in: Proceedings of the 20th International Conference on Computational Linguistics, Association for Computational Linguistics, Geneva, Switzerland, 2004, p. 952.

26.

Kondrak and

Dorr , Automatic identification of con-fusable drug names, Artificial Intelligence in Medicine36 (2006), 29–42.

27.

Kondrak and

B.J.

Dorr , A similarity-based approach and evaluation methodology for reduction of drug name confusion, in Albetta Univ Edmonton, 2003.

28.

Kovacic and

Chambers , Look-alike, sound-alike drugs in oncology, Journal of Oncology Pharmacy Practice17 (2011), 104–118.

29.

B.L.

Lambert , Predicting look-alike and sound-alike medication errors, American Journal of Health-System Pharmacy54 (1997), 1161–1171.

30.

B.L.

Lambert ,

K.-Y.

Chang and

S.-J.

Lin , Effect of orthographic and phonological similarity on false recognition of drug names, Social Science and Medicine52 (2001), 1843–1857.

31.

B.L.

Lambert ,

S.-J.

Lin ,

K.-Y.

Chang and

S.K.

Gandhi , Similarity as a risk factor in drug-name confusion errors: Thelook-alike (orthographic) and sound-alike (phonetic) model, Medical Care37 (1999), 1214–1225.

32.

B.L.

Lambert ,

Yu and

Thirumalai , A system for multiattribute drug product comparison,Journal of Medical Systems28 (2004), 31–56.

33.

V.I.

Levenshtein , Binary codes capable of correcting deletions insertions, and reversals, in: Soviet physics doklady, 1966, pp. 707–710.

34.

Mitchell , An introduction to genetic algorithms, Cambridge, Massachusetts London, England, Fifth printing, 1999.

35.

Nagata ,

Kimura and

Tsuchiya , Similarity index for sound-alikeness of drug names with pitch accents, Procedia Computer Science35 (2014), 1519–1528.

36.

Pfeifer ,

Poersch ,

Fuhr and L. Vi, Searching Proper Names in Databases, in: HIM, Citeseer, 1995, pp. 259–275.

37.

J.J.

Pollock and A. Zamora, Automatic spelling correction in scientific and scholarly text, Communications of the ACM27 (1984), 358–368.

38.

Rahman and

Parvin , Medication errors associated with look-alike/sound-alike drugs: A brief review, Journal of Enam Medical College5 (2015), 110–117.

39.

S.R.

Schroeder ,

M.M.

Salomon ,

W.L.

Galanter ,

G.D.

Schiff ,

A.J.

Vaida ,

M.J.

Gaunt ,

Bryson ,

Rash ,

Falck and

B.L.

Lambert , Cognitive tests predict real-world errors: The relationship between drug name confusion rates in laboratory-based memory and perception tests and corresponding error rates in large pharmacy chains, BMJ Quality & Safety26 (2016), 395–407.

40.

M.B.

Shah ,

Merchant ,

I.Z.

Chan and

Taylor , Characteristics that may help in the identification of potentially confusing proprietary drug names, Therapeutic Innovation & Regulatory Science (2016), 2168479016667161.

41.

Teplitsky , Hazards of sound-alike look-alike drug names, California Medicine119 (1973), 62.

42.

P.L.

Trbovich and

Hyland , Responding to the challenge of look-alike, sound-alike drug names, BMJ Quality & Safety26 (2017), 357–359.

43.

USP, USP quality review (76). US Pharmacopeia2001.

44.

USP, USP Quality Review (79). US Pharmacopeia, 2004.

45.

R.A.

Wagner and

M.J.

Fischer , The string-to-string correction problem, J ACM21 (1974), 168–173.

46.

WHO,Medication Without Harm Global Patient Safety Challenge on Medication Safety, 2017.

47.

Zobel and

Dart , Phonetic string matching: Lessons from information retrieval, in: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 1996, pp. 166–172.