### Distribution of Variables

In the context of ordinary least squares (OLS) regression analysis, the key variable is the regressand; in this case, ‘index’. For hypothesis testing to be valid, the error term in the model must be independently, identically normally distributed. With the regressors assumed to be constants from sample to sample, ‘index’ should also, therefore, be normally distributed. Two points might be made at this stage. First, there is no assumption regarding the distribution of the independent variables in the analysis, as should be clear from the frequent use of dummy variables. Second, the normality assumption is only required for the purposes of hypothesis testing. The mathematical estimation of the OLS coefficients can be undertaken whether this is true or not.

Figures 1 and 2 present, respectively, the histogram and detrended Q-Q plot for the ‘index’ variable, with both suggesting quite strong deviations from normality. However, this visual impression does not find a good deal of support from statistical measures that are typically used to evaluate departures from normality. Thus, as shown in Table 1, the median of ‘index’ is very close to its mean, as is the 5% trimmed mean (the average of the data when the highest and lowest 5% of data values are excluded). Both of these results are consistent with a normal distribution. Furthermore, while both the skewness and kurtosis statistics are negative – the former suggesting a bunching of data values at the lower end of the distribution and the latter a rather flat distribution – neither is large in relation to its standard error.

Somewhat more formally, Table 2 reports the Kolmogorov-Smirnov test, with Lillefors correction applied to render it more conservative, which compares the cumulative distribution function of ‘index’ with that a comparable normal variable. As this statistic is not significantly different from zero, the assumption of normality cannot be rejected. Likewise, the Shapiro-Wilks test, which is a correlation between the ‘index’ scores and their corresponding normal scores, does not depart from unity at any conventional significance level: a result that, once again, suggests a normal distribution. Nevertheless, it is hard to accept that a variable with a gap in its distribution and possessing multiple peaks is distributed normally and therefore it was decided to investigate the distribution of the natural logarithm of it, as suggested in the project brief.

As is to be expected with a monotonic transformation of the original data, the switch to the logarithm of ‘index’ produced no significant alterations to the conclusions reached above regarding the distribution of that variable. As such, the results of SPSS exploration of the properties of ‘lnindex’ are reproduced as Appendix A without further comment.

As noted above, there is no requirement that the exogenous variables in an OLS regression conform to any particular distribution. Nevertheless, Appendix B presents data summaries and normality tests for all of the continuous variables contained in the current data set, including those discussed above, along with summaries of the qualitative measures contained within it. It might be noted that, as ‘profit margin’ and ‘ROCE’ contain negative values, they could not be logged without dropping the affected cases from the sample. Likewise, logs cannot be taken for the two qualitative variables that take zero values. Therefore, it could not, at this stage, be considered appropriate to undertake manipulations of the potential regressors in the model to be constructed below.

### Relationships between variables

The scatter plots option in SPSS was used to explore the existence of any simple relationships between ‘index’ and the other continuous variables in the data set (excluding ‘actual’ and ‘max’) that might help to inform the choice of regression model specification. The results are reported in Appendix C. In general, this simple exercise failed to uncover any strong relationships, although it might be possible to argue that ‘age’ and ‘ROCE’ are weakly negatively related to ‘index’. Rather than presenting scatter plots for all remaining variables that are potential candidates for inclusion as regressors (a total of 21 additional diagrams), attention turns to the correlation matrix of all continuous variables. This is not only more compact, it will also highlight any simple linear relationships between regressors that would cause OLS estimation to fail, to a greater or lesser extent, as a result of multicollinearity.

The correlation matrix provided as Table 3 confirms the existence of only very weak simple linear associations of ‘index’ with the other continuous variables in the data set, although it must be recalled that the model to be estimated below will be a multivariate construct and the matrix could be masking more complex relationships. However, it does provide a preliminary warning that OLS will almost certainly not be able to handle the joint inclusion of ‘sales’ and ‘assets’ in the model. Furthermore, it may also be sensitive to the joint inclusion of either of those variables alongside ‘capital’. This will be examined below.

### Regression models

Notwithstanding the earlier precautionary remarks about collinearity between potential regressors, the base model examined here is of the form

In this formulation, the variables are taken directly from the data file, with two exceptions. First, ‘Manuf’ is a dummy variable taking the value 1 if the firm is in the manufacturing sector and 0 otherwise. Second, ‘Other’ is another constructed dummy variable taking the value 1 if the firm is classified as in the ‘other’ sector and 0 otherwise. This leaves the conglomerate firms as the base reference group. It might also be noted that ε is the residual error term.

On the face of it, the catch-all model might appear to work reasonably well. Thus, as reported in Table 4, it has a corrected R2 of 0.382, which is quite reasonable for a cross-section regression. However, this is achieved with only two significant variables in the model. The first of these, ‘listing’, attracts the positive sign that one would expect, given the regulatory regimes that tend to prevail on the stock exchanges of developed countries. The second is the negative coefficient for the variable ‘other’. As this is a dummy shift term reflecting the disclosure practices of such firms relative to those of conglomerate companies, it again might have been anticipated. However, the findings should be treated with caution because, as predicted above, the model is plagued by multicollinearity. This is revealed by the size of the variance inflation factors (VIFs) – the reciprocal of (1 – R2) in an auxiliary regression of one regressor on all of the others – for the ‘sales’ and ‘assets’ variables in the model. As these are essentially measuring the same thing in statistical terms, one needs to be eliminated. As the VIF is highest for ‘sales’, the model was re-estimated with it excluded.

As shown in Table 5, the explanatory power of the amended specification remains essentially unchanged when compared with the full model. Furthermore, there are no signs of gross instability in the qualitative impacts of the remaining included regressors: only ‘audit’ changed sign, but it was wholly insignificant in both formulations. However, it was still the case that only ‘listing’ and ‘Other’ achieved statistical significance. Recalling the high correlation coefficient between ‘assets’ and ‘capital’ – the two remaining regressors with the highest VIFs – it was decided to re-estimate the model again, excluding the latter of these variables.

The results of the re-specification are given in Table 6. While the explanatory power of the model fell, the deterioration was only marginal. Furthermore, ‘assets’ joined the list of significant variables, although its negative coefficient is contrary to what might intuitively be expected. Once again, there were no serious signs of instability in the estimates and all VIFs lay below two. However, cross-sectional estimations can suffer from heteroscedasticity; that is, from a non-constant variance of the disturbance term across observations. This possibility was checked by a cross-plot of the model’s standardised predicted values against its standardised residuals. The outcome, which is presented in Figure 3, exhibited a fairly random distribution and leads to the tentative conclusion that variance misspecification, is not a serious concern. Rather than attempting to examine further possible variants of the model through a fishing exercise to check this conclusion more thoroughly, recourse to the variable selection routines available in the SPSS regression facility.

Reassuringly, the stepwise, backward and forward options all generated the same result: a model incorporating ‘capital’, ‘listing’, ‘other’ and a constant. As such, only the estimates from the stepwise exercise are reproduced as Table 7. It should be noted that the coefficient estimates for the retained variables are very similar to those reported above for the fuller model, which is a reassuring indication of stability. The major change to highlight is that ‘capital’ rather than ‘assets’ was finally retained. Taken at face value, the findings suggest that stock exchange listing increases disclosure, while market capitalisation tends to decrease it. In addition, firms that are neither in manufacturing nor are conglomerates tend to have inferior disclosure practices.

### Potential problems and their resolution

There are a number of possible caveats to the foregoing exercise. The first is quite simply that it was undertaken without a justified theoretical underpinning and it is therefore difficult to consider questions of misspecification in other than a purely statistical sense. Second, some of the regressors, in particular perhaps ‘sales’, ‘profit margin’, ‘ROCE’ and ‘current’ are potentially volatile over time and might better be replaced with their averages over some backward looking horizon, say of five or ten years. Third, there is no real justification for assuming that a linear specification is the appropriate one to adopt. Fourth, there is an evident lack of variability in the dependent variable ‘index’, which has a very low coefficient of variation (standard deviation divided by the mean) of 2.5 per cent. One possible way to overcome this problem might be to increase the sample size, whether by further Greek companies or by international comparators. Finally, while the apparent errors are generally small, ‘index’ is not measured as simply ‘actual’/’max’ and this may be a cause of at least some of the problems encountered.