Abstract
This study introduces computerized model for evaluation of corporate performance for companies traded in the main world stock markets. The main contribution of this study is to utilize a “Soft Regression” modeling tool, which is a soft computing tool based on fuzzy logic in financial statement analysis. Specifically, the tool is used to identify the most important financial ratios explaining the performance (as reflected by Operating Income Margin) of publicly traded companies, belonging to the manufacturing industries 2000–3999. We used data extracted from the XBRL database for years 2012 to 2016.
The main results and conclusions of the study are: The study identified relevant financial ratios for the manufacturing industry. It also revealed the relative importance of the various categories of financial ratios. Detailed comparison of the results for 2012 and for 2016 indicated high degree of consistency and stability over time. Not all financial ratios are equally relevant for all industries. Proxy variables belonging to the same category of financial ratios are interchangeable in our model. It does not matter, which of the ratios belonging to the same category are used, the results are very similar for both, 2012 and for 2016. All the resulting indicators imply that the model is highly reliable and robust.
The main contribution of this study is to present a soft computing modeling tool based on fuzzy logic which is intuitive, stable and not based on restrictive assumptions.
Introduction
The main objective of this study is to use “Soft Regression", which is a Soft Computing tool based on Fuzzy Logic to identify indicators associated with the successful performance of corporations in terms of corporate earnings. One of the main endeavors of financial analysts is to evaluate corporate earnings, and financial statements are an important input in this process. Financial statement analysis is used to provide stock recommendations to investors and is also used to generate benchmarks [1]. While financial ratios play an important role in this process the question is which financial ratios, among the hundreds that can be computed, should be analyzed? [2].
The aim of this study is to build a model using various financial ratios as explanatory variables, and to identify the ratios that are the most associated with the companies’ earnings. There are many financial ratios, all of them well known and widely accepted by professionals involved in analyzing financial reports. Our study attempts to find out which ratios are significant in explaining the behavior of the dependent variable (companies’ profitability), and even more important: what is the relative importance of these ratios among themselves. Unquestionably, identifying the most important financial ratios associated with earnings will constitute an important contribution in designing investment strategies.
An important factor which makes it very difficult to build a model of financial ratios while utilizing conventional modeling tools is that there is a substantial mathematical correlation among various financial ratios. Multicollinearity does not allow to incorporate all the relevant financial ratios into the same equation, thus undermining the reliability of the results due to model misspecification. In addition, the exclusion of some explanatory variables due to multicollinearity makes the computation of relative importance of the whole set of variables incorrect, because the excluded variables are implicitly assigned weight of zero even in the cases of important variables that given different model specification could become significant.
Therefore, in this study we utilize “Soft Regression” (SR), which is a Soft Computing modeling tool based on Fuzzy Information Processing. SR does not require independence of explanatory variables and thus multicollinearity does not affect the reliability of modeling results. In other words, SR allows to incorporate explanatory variables into the same equation even if they are mathematically correlated. In addition, SR generates reliable computation of relative importance of explanatory variables among themselves [3]. More details regarding SR are presented below.
Financial reports’ ratios are computed utilizing the eXtensive Business Reporting Language (XBRL). The Securities Exchange Commission (SEC) has mandated, since 2011, XBRL format for reporting of financial data for all publicly traded companies. XBRL facilitates information gathering and processing, since it is easily downloaded from the internet and translated into EXCEL format, which should be beneficial to users of the financial reporting information. We are using annual data between 2012 and 2016, in order to demonstrate stability, and consistency of the results, thus pointing to the reliability and robustness of the model.
Literature survey
Evaluating firm performance using financial ratios is the traditional tool for decision makers, including investors and researchers. Financial ratios express the relationship between total amounts observed in the financial statements allowing comparisons to be made across companies of different industries and different sizes, and within a company across time. The main issue raised over time is which of these ratios, among the hundreds that can be computed, should be analyzed to obtain the necessary information for the required decision.
Due to the fact that not all ratios are informative and can provide high discrimination power, it is necessary to filter out unrepresentative variables from a given data set through feature selection techniques [4]. There are many well-known feature selection/extraction techniques that have been used as a first step for bankruptcy prediction, the more traditional methods are correlation matrix [5], t-test, factor analysis, and stepwise logistic regression [6]. Logistic regression has also been a key method in feature selection for research focused on the usefulness of accounting ratios in predicting earnings movement, which can consequently be used as the basis for a profitable investment strategy [7–12].
Feature selection/extraction has also been found to enhance the performance of AI methods. Principal Component Analysis (PCA) has been found to increase the performance of models using financial ratios in bankruptcy predictions [13] and the performance of bankruptcy prediction models [14].
A comparison of five well known feature selection methods, in bankruptcy prediction, was done by [13]. The paper compared t-test, correlation matrix, stepwise regression, PCA and factor analysis, multi-layer perceptron neural networks were used as the prediction model. The results found that the t-test feature selection method outperformed the other methods.
In feature selection we chose a subset of feature from a set of features. In extraction, we create a subset out of the set (in PCA, we might take 7 PC out of 60 features. These 7 PC are built from all the features). That is, in feature selection, we chose features, in feature extraction, we create new features based on the original features.
Financial time series data are were found to be characterized by noise, chaos and a high degree of uncertainty, and contain strong nonlinearity and outliers [15, 16].
Soft Regression (SR) is an Artificial Intelligence (Soft Computing) modeling tool based on Fuzzy and Heuristic Information Processing. It has been evolving since 1990 s (for more details see [17]. Comparison of SR to Multivariate Regression method appears in [18]. Computing relative importance of explanatory variables (RELIMP) by utilizing SR versus traditional regression methods is presented in [19]. The detailed explanation of RELIMP (based on SR) and evaluation of its reliability are presented in [3].
Extensive literature addressing the precision and reliability of XBRL data is presented in detail below.
Extensible business reporting language
XBRL (eXtensible Business Reporting Language) is a freely available and global standard designed for exchanging business information. XBRL allows the expression of semantic meaning commonly required in business reporting. One use of XBRL is to define and exchange financial information, such as financial statements.
The U.S Securities and Exchange Commission (SEC) has created the XBRL U.S. GAAP Financial Reporting Taxonomy. This taxonomy is a collection of accounting data concepts and rules that enables companies to present their financial reports electronically. The SEC’s deployment was launched in 2008 in phases, and all public U.S. GAAP companies were required to file their financial reports using the XBRL reporting technology starting from June 15, 2011.
XBRL has several advantages over COMPUSTAT, which has been a popular source of financial information for both academics and practitioners. Among XBRL data advantages is the fact that it is freely available while COMPUSTAT is costly. XBRL filings also have a time advantage, it takes an average of 14 weekdays from the time a company files with the SEC for that data to appear in COMPUSTAT [20, 21], while XBRL data is published concurrently with the related PDF versions, and is immediately available. In addition, the reliability of COMPUSTAT has been questioned, prior studies have shown that COMPUSTAT data may differ from the original corporate financial data [22–24] and data found in other accounting databases [25, 26].
The model
Dependent Variable:
The dependent variable is A20-Operating Income Margin (operating income divided by total revenues)
Explanatory Variables:
Financial ratios have played an important part in evaluating the financial condition of companies [2] different ratios and a variety of different financial ratio classification systems have been suggested [27]. In this paper we follow one of the most common classifications as presented in numerous textbooks [28].
The ratios are commonly classified as follows:
Liquidity refers to the ability to pay for short term liabilities, current as well as liabilities which mature within the next year. The payment is expected to be in terms of present liquid assets as well as assets which are expected to become liquid within the next year. Efficiency, measured as Cash conversion cycle, refers to the ability to sell inventory, collect payment from customers and pay suppliers. Efficiency classification has very similar features to the liquidity classification, since for most companies the ability to pay their liabilities within the next year will depend on their ability to collect cash from customers. It is therefore common in many ratio classification schemes, to lump these two classifications together.
The investment decision is usually based on two important factors, risk and return. When examining the classifications presented above, the first two classifications represent the company’s risk level, its ability to pay its debts and operations and survive, in the short and the long run. The last two classifications represent return to the investor, profitability represents the potential for return, while the market ratios represent the actual return.
It should be noted that the Price/Earnings (P/E) ratio, which is classified as a market ratio, and represents the price the investor is willing to pay for one unit of earnings, is a special case in terms of its relationship with future earnings. Traditional capital markets theory assumes that the market is efficient in the sense that useful information, such as earnings information, influences the adjustment of share price [29, 30]. In other words, earnings changes can be used as an explanatory variable to the market price. However, the P/E ratio has also been shown to move in the opposite direction, current price changes may be used as an explanatory variable of future earnings [31, 32].
The results of the analysis presented in our study demonstrate that the variables found significant represent all four categories of financial ratios discussed above:
The company’s liquidity and efficiency are represented by the sales to total cash ratio. This is an inclusive ratio which represents the company’s ability to generate cash (and not just accounts receivables) from its current sales and be liquid. The ability to generate cash is pertinent for the ability of the company to pay off its current debts (efficiency).
The company’s solvency is represented by the Interest Coverage Ratio and the Cash Flow from Operations to total debt. The first ratio measures the proportionate amount of operating income that is used to cover interest payments, since these interest payments are usually made on a long-term basis, they are often treated as an ongoing expense. This ratio is also used to indicate the company’s capitalization efficiency, the impact of the company’s choices in raising capital. The second ratio representing the company’s solvency is: Cash Flow from operations to total debt. It indicates how long it will take the company to pay off all of its debt if it devotes all of its cash flow from operations to debt repayment, this ratio provides a snapshot of the overall financial health of the company.
It is reasonable that the classification which will be most prominent and have the most significant variables, are profitability. Profitability ratios represent the relative measures of the earnings (profits) the company created, and therefore have the closest association with the earnings themselves.
The market ratios represent the relationship between the company’s actual profits and the investor returns (gains from an increase in the price of the shares or from the distribution of dividends). There is a representation of both the gains from shares (P/E ratio) and the gains from dividends (Payment of dividends as a % of operating cash flow).
Proxy variables
The four types of financial ratios presented above are represented by measurable quantitative proxy variables as presented below. Appendix 1 shows all the accounting descriptors examined in the first phase of analysis. From these descriptors, the proxy variables for each “financial ratios” category were selected as follows:
A52-Sales to total cash
A54-Sales to total working capital
A47-Times Interest Earned:
A73-Cash From Operations (CFO) to Total Debt
A35-ROA (Return on Assets)
A50-Pre-taxes income over Sales
A51-Net Profit Margin:
A57-Research & Development Expense to Sales:
A59-Operating Income to Total assets
A70-EBITDA Margin Ratio: EDITDA (Earnings Before Interest, Taxes, Depreciation and Amortization) to Total Revenue
A21-P/E Ratio
A75-Payment of Dividends as % of OCF (Operating Cash Flow)
Data
Using the NASDAQ company list (http://www.nasdaq.com/screening/company-list.aspx) all 6,670 companies (tickers) listed on all of the three major US stock exchanges (AMEX, NASDAQ, and NYSE) were found.
The annual financial data was obtained using XBRL Analyst (created by FinDynamics); an Excel plugin that allows users to access the company’s XBRL tagged data from its XBRL SEC filing via the XBRL US database. Using this software not only allows easy access and analysis of the data but also allows the calculation of any missing balances. For example, the balance reported in each XBRL filing for total liabilities is not available on the original XBRL filing but is extracted and calculated using the XBRL Analyst. The obtained data was annual filings from 2012 to 2016 (5 years).
The process of selecting a subset of relevant features to be used in the model construction, was also used to create the financial ratios. 6,670 tickers were originally identified using the NASDAQ company list and 2,561 tickers were removed. The reasons for removal: there wasn’t any data reported in XBRL format, tickers for non-common stocks, tickers for companies with IPO’s between 2012 and 2016, and tickers for companies with more than one ticker (the same CIK).
The final sample included 4,109 companies (61.6% of all tickers listed) that were publicly traded on Q3/2017. For the purpose of this study it was decided to examine only one industry, the manufacturing industry (SIC code 2000–3999), which represents the largest industry, 1,597 tickers, 38.9% of the total sample out of which 1,585 reported operating income.
60 variables (based on [7]) were extracted from the XBRL filing data base (Appendix 1). It should be noted that some of the variables had to be calculated from the original filing, whereas some other variables were already calculated as part of the XBRL Analyst tool. We ended up with 622 companies having positive operating income for consecutive 5 years (2012–2016), 246 companies with negative operating income for consecutive 5 years, and 398 companies that had positive and negative operating income over the 5 years.
Method
The above description of the explanatory variables points to a possibility that there is a mathematical correlation among some of the variables described above. This means that it becomes impossible to include all of them together in the model when utilizing traditional modeling tools such as MVR (see [3]). Due to multicollinearity, some of the explanatory variables become insignificant not because they are not related enough to the dependent variable, but because of technical limitations of the MVR. We avoid this problem by utilizing SR modeling tool, where explanatory variables are not required to be independent of each other.
Soft Regression
SR is a modeling tool based on soft computing concepts such as Fuzzy Logic [33]. The technical details of the SR method are described in [3, 19].
We will briefly describe several of the important characteristics of the SR that are different from those of traditional MVR, and thus justify using it in this study. These characteristics are: Soft regression does not require precise model specification. This regression tool is based on Fuzzy Logic, which is designed in the first place to handle information under severe conditions of uncertainty and imprecision [33]. The idea here is to give up on the possibility of building a precise model and satisfying ourselves with the opportunity to work with whatever data are available. We generate a partial/less-precise model that could still be very reliable in a general direction of its conclusions because it avoids the problem of misspecification bias. It could be summarized as follows: It is preferable to have imprecise, but broadly correct results (SR), rather than have precise results (containing a small statistical error) which are incorrect (due to misspecification bias –MVR). Of course, in the modeling projects where some potentially important variables are excluded due to being insignificant because of multicollinearity (MVR method), such models are misspecified by definition. Explanatory variables are not required to be independent of each other. In the fields such as Economics, Finance, etc. the variables are usually intangible concepts, that are often highly correlated among themselves mathematically even while logically they could each represent separate and independent (at least to some extent) concepts. When using MVR, correlation among explanatory variables causes some of important explanatory variables to appear insignificant, and therefore being removed from the model - thus leading to model misspecification. Hence, this feature of SR (not requiring independence of explanatory variables and thus not removing variables due to multicollinearity) constitutes a major advantage in comparison to MVR. The relative importance of the explanatory variables among themselves is not affected by adding or removing variables. When a partial model is constructed, the significance of the explanatory variables and the relative importance of those variables among themselves are not affected by adding additional variables to the model, or removing some variables from it. This is in contrast to the behavior of MVR, where addition or removal of an explanatory variable can change drastically the significance and even coefficient sign of other explanatory variables of the model. This characteristic of the SR adds an important feature of stability into the research/decision making. The method requires to use normalized data. We introduce heuristically determined maximum and minimum thresholds (for maximum and minimum values during the normalizing process of the data –see explanation below). This helps to handle the distortions due to outlying values in a user-based logical approach (in contrast to strictly mathematical method utilized in sophisticated traditional techniques such as Robust Regression).
In SR there is a dependent variable and m numerical vectors (m columns) of explanatory variables. Let Y = (y1, y2, . . . , y
n
) be the n-dimensional vector of dependent variable to be explained, and let
Normalizing data: the conversion of numerical vectors into fuzzy sets requires their projection into equivalent vectors of the corresponding grades of membership (between zero and one, where 1 represents full membership, and 0 represents no membership at all in the set), based on predefined membership function which is expected logically to reflect the membership of each element in the fuzzy set. It is the critical requirement of this method that the membership function must be in line with human logic and common sense. This is the reason why the normalizing process is described in great detail below.
Based on [17], we define the membership function as follows: Let’s define
For all other elements (between
Normalizing the data - the implementation:
All the companies in our data base were divided into three groups: The group of “Winners”: contains companies which were continuously profitable, reported a positive net income, on annual basis for every year between 2012 to 2016 (including 2012 and 2016). The group of “Losers”: contains companies that reported a negative net income on annual basis for every year between 2012 to 2016 (including 2012 and 2016). All the remaining companies, the “Middle Group”.
Justification: As was stated above, the process must be in line with human logic and common sense and modelers should be capable of defending their decisions. For example, for
Similar, but inverse reasoning applies to
Every variable where
Computing Similarity ( S Y,X j ):
We compute the similarity between the dependent variable and every explanatory variable v j (j = 1, . . . , m) in the following way: we define distance for direct relation between variables:
If
The similarity or closeness (denoted by SY,X
j
) of each explanatory variable X
j
to Y is then computed as:
The measure of similarity indicates the degree to which explanatory variable behaves in a similar pattern (direct or inverse) in comparison to dependent variable. Therefore, the measure of similarity SY,X j is a parallel to the traditional statistical measures of significance (t-tests or sig.). However, in addition to a significant relation (similarity of SY,X j ⩾ 0.8), there is an option of partial significance 0.7 < SY,X j < 0.8, so that as SY,X j is approaching closer to 0.7, it is closer to insignificance. The gradual transition from being fully significant to being fully insignificant adds additional element of stability to the modeling process when utilizing soft regression.
Computing combined similarity (
Once similarity measures are computed for all the explanatory variables, the next step is to calculate collective contribution of all the explanatory variables combined in explaining the behavior of dependent variable. For every observation, we select the element from one (or more) of the explanatory variables, that is the most similar (has the shortest distance) to the dependent variable, thus creating the vector of minimum distances:
A combined similarity of all the explanatory variables to the dependent variable is
Computing relative importance (
The way to compute relative importance of the explanatory variables is to find out how much each of them contributes to the vector of minimum distances (7) (that was used to compute
Therefore, relative importance in the SR (in contrast to traditional regression methods) is not affected by correlation with other explanatory variables, and is determined solely by the contribution of a given explanatory variable to explaining the behavior of the dependent variable.
We can calculate relative weight or relative importance (denoted by Relimp) of each explanatory variable in explaining the behavior of the dependent variable based on the following principles (for more details see Yosef et. al.,2015):
This study involved a very large amount of regression runs, covering all the possible combinations of variables for the years 2012 and 2016, because each one of the four financial ratios category was represented by more than one proxy variable, while each one of the proxy variables covers important aspect within its category and cannot be ignored. In addition, it should be emphasized that such large amount of regression runs would be required for each year under study. A major challenge in summarizing the results was to present them in a concise form, while on the other hand exposing all the main and the most interesting outcomes.
Table 1 presents the measures of Similarity (SY,X j ) of all the proxy variables used in this study for all the years under study. The most important conclusion of this Table is that all the included variables are important to some degree on a consistent basis. For all the years covered in this study there was not a single case of insignificant SY,X j , since all the values came out greater than 0.7. In addition, it is easy to observe (by comparing the values located on the same row), that the similarity measures of each proxy variable are not much different from each other over the years. The consistency and stability are indicative of a solid and stable model: the variables characterized by a relatively high measure of SY,X j , are consistently high over the years, and the variables characterized by a partial significance are partially significant for all the years.
Similarity over the years
Similarity over the years
Note: all the explanatory variables are directly related to the dependent variable.
Due to utilization of several proxy variables from every one of the four Financial Ratios categories, it required very large number of regression runs to try all the possible combinations of the proxy variables from all four categories. As was explained in the theoretical section, the measure of similarity for a given explanatory variable remains the same no matter what are the other variables in any given regression run. Therefore, the task of presenting SY,X
j
measures in Table 1 was simple. However, this is not the case for the measures of relative importance: Relimp. As explained above, for any given explanatory variable, its Relimp will be different in every regression run based on a different set of explanatory variables. Thus, summarizing Relimps for the explanatory variables is more difficult and challenging. In Table 2 we present, just as an example, arbitrarily selected results of seven different regression runs (each one of the seven columns of the Table representing a separate regression run). The Table presents results for the year 2012 only. The first four rows display the variables included in the various regression runs. The measures of SY,X
j
(rows 5–8) are, as expected, the same measures as in Table 1 for the year 2012. The most interesting part of the Table are the measures of Relimp and of
Example of selected regression runs for 2012
The last row in Table 2 displays
Table 3 compares results for year 2012 to results of 2016. It differs from the Table 2 in the following aspects:
Comparison of 2012 to 2016
Table 2 presents results of selected sample of regression runs for 2012 only. Its purpose is to demonstrate stability and consistency of the results when different proxy variables are selected from the four groups of variables as explained above.
Table 3 compares between the regression results of 2012 and 2016. The comparison allows to demonstrate consistency of the model results over time. Table 3 is based on a very large number of regression runs (including all the possible combinations of explanatory variables). In order to present the results of so many regression runs in the most comprehensible and concise form, we utilize ranges of values which contain all the results of the various regression runs. This way the comparison between the results for 2012 and for 2016 become much simpler and convenient.
In addition to Table 3, we present Graphs 1 and 2, which visually present the same results. Graph 1 displays the comparison between 2012 and 2016 in terms of SY,X j measurements, while Graph 2 displays the comparison in terms of Relimp measurements. Both graphs are based on the mid-points of the ranges appearing in Table 3.

Similarity.

Relimp.
The consistency of the measures of SY,X j and of Relimp is clearly visible in Table 3 when comparing the ranges of these values for 2012 versus 2016. It is even easier to visually observe this consistency when looking at the Graphs 1 and 2. The consistency and stability of the model over time are important indicators of its reliability.
In this study we presented a computerized modeling tool “Soft Regression”, which is a Soft Computing tool based on Fuzzy logic, of earnings (Operating Income Margin) of companies characterized as manufacturing industries (SIC code 2000–3999). We used several categories of financial ratios as explanatory variables and included several financial ratios from each category as possible proxy variables to represent their relevant category.
The main conclusions are: All the categories of the financial ratios included in this study have been validated. All the proxy variables selected from the four main categories of financial ratios came out either fully significant or partially significant. No variables came out insignificant. The financial ratios category “Profitability” came out as the most important category (having the highest values of Relimp), followed by “Solvency”, and then by much weaker (but still significant) category “Market Ratios”. The last, and much weaker category came out “Liquidity and Efficiency”, characterized by only partially significant SY,X
j
measures. Comparing results of 2012 to 2016 leads to a conclusion that the model is stable and consistent over the years. Similar conclusion can be reached by comparing the SY,X
j
results for all the five years 2012 –2016. Incorporating different explanatory variables from the various categories of financial ratios led to similar and consistent results thus implying high degree of robustness of the model. Very high scores of A combination of stability, consistency, robustness and a strong explanatory power are all important indicators of a model reliability.
The main contribution of this study is to demonstrate effectiveness of soft computing modeling tool based on fuzzy logic. The resulting model is robust, consistent, stable, and thus very reliable. It validates relevant financial ratios and determines their relative importance, which is very critical information for the success of financial investments.
The logical follow-up for future research is to incorporate the method presented here into decision support system for financial investments. This will require to integrate several additional soft computing/fuzzy logic technologies.
Footnotes
Appendix 1 –All accounting descriptors examined in the first phase of analysis
Accounting Descriptors
1
Account Receivable Turnover
2
Current Ratio
3
Quick Ratio
4
Inventory Turnover
5
Total Debt To Equity
6
ROA
7
ROE
8
Gross Profit Margin
9
Days sales in Accounting Recv.
10
Inventory to total assets
11
Depreciation over Plant
12
Long-Term Debt/Equity
13
Equity/Fixed assets
14
Times Interest Earned
15
Sales/Total Assets
16
Pre-taxes income/Sales
17
Net Profit Margin
18
Sales to total cash
19
Sales to total Inventory
20
Sales to total working capital
21
Sales to Fixed assets
22
Working capital to total assets
23
Operating Income to Total assets
24
EBITDA Margin Ratio
25
Cash From Operations (CFO) to Total Debt
26
Payment Of Dividends as % of OCF
27
Net Income over OCF
28
ΔDepreciation (&Amortization), IS
29
Δinventory
30
ΔResearch &Development Expense
31
ΔTotal Assets
32
ΔTotal Long-Term Debt
33
ΔTotal Revenue
34
ΔCurrent Ratio
35
ΔQuick Ratio
36
ΔInventory Turnover
37
ΔDividends per share
38
ΔTotal Debt To Equity
39
ΔROE
40
ΔGross Profit Margin
41
ΔWorking capital
42
ΔDays sales in Accounting Recv.
43
ΔInventory to total assets
44
ΔDepreciation over Plant
45
ΔCapital Expenditures/total assets
46
ΔLong-Term Debt/Equity
47
ΔEquity/Fixed assets
48
ΔTimes Interest Earned
49
ΔSales/Total Assets
50
ΔPre-taxes income/Sales
51
ΔNet Profit Margin
52
ΔSales to total Inventory
53
ΔSales to total working capital
54
ΔResearch &Development Expense to Sales
55
ΔWorking capital to total assets
56
ΔOperating Income to Total assets
57
ΔEBITDA Margin Ratio
58
ΔCapital Expenditures/total assets
59
ΔTotal Depreciation
60
ΔTotal Debt
