Firms Growth Prediction

This project is an analysis of fast growing firms in a European country. Using data that was collected, maintained and cleaned by Bisnode that contains data about 19,036 firms. This project is leveraged to predict the probability of fast growing firms, and classify firms of prospective fast growth and no fast growth. Fast growth can be defined in many ways. In this project, growth will be defined over 50% annual growth in revenues. I will use companies data in 2014 to predict probability of firms growing fast in 2015, and classifying them accordingly. To define fast growth for this project, I consider revenue as a main determinant of growth. Firms that have a 50% increase in annual revenues are considered fast growing. Data The bisnode-firms is a panel dataset that contains information about firms in a European country. In this project, I use the cleaned dataset that was maintained by Bisnode. As an initial step, the data was filtered to include observations between 2010 to 2015. The dataset originially contained data on 19,036 firms; 287,829 observations and 48 variables. Each observation corresponds to a firms in a specific year. Some of the variable that could be useful predictors of firms growth, for instance, financial data, data on the management, region etc are included. Data Preparation 1. Dataset was limited to include the panel from 2010-2015. 2. Variables with many missing values, such as COGS, finished_prod, net_dom_sales, net_exp_sales, and wages were dropped. Label Engineering Growth rate, as a variable, was not provided in the dataset. However, related financial data were, which made it simple to measure it. For this paper, we can consider a firm as fast growing if it had a 50% increase in revenues in the consecutive year. Fast growth, as a binary, is the y variable and all other variables are considered potential predictors and screened by different methods to pick the likely predictive ones. Therefore, fast growth, can be defined as 1 if the company is fast growing in 1 year, and 0 otherwise. 1. Impute the sales variable with 1 if the value is below 0. 2. Create variables for sales in million Euro, and log transformed sales in million 3. Created variable growth_rate; which is annual sales growth rate. 4. Observations that had a negative growth rate or infinite growth rate were dropped since they dont fit our scope. 5. Created a binary variable, fast growth, that captures if there was fast growth; over 50% annual growth in revenues. 6. Created variables d1_sales_mil_log, first difference in natural log sales in million, and age 1 Sample Design The sample was limited to the cross-section of firms in 2014. Observation that has sales below the 5th percentile or above the 95th percentile were excluded. These included firms that either very high or very low sales. As a result, the cleaned dataset consist of 47 variables and 5,737 observations. Feature Engineering In order to have some insight about the data and prior to building the models, we inspect the functional forms of the variables. Obvious errors such negative current assets or current liabilities, were imputed with 0 instead of the negative value, a binary variable to flag the error was created. Created a new variable for total assets. Unreasonable age values were imputed with the minimum age of 25 and maximum of 75 years. Moreover, financial variable, for example, annual profit & loss and income before tax, were standarized by sales, and winsorized. Variable to flag error or extreme values were also created. Quadratic terms were added to some financial terms to capture non-linearity. The final dataset, after cleaning and screening for potential predictors, is composed of 5453 observations and 110 variables. As a robustness check, the dataset will be split into 80% work set, and 20% holdout set. Explaratory Data Analysis The target variable is fast growth, expressed as a binary. As a first step, I’ll check the potential predictor variables 0.00 0.25 0.50 0.75 1.00 −1.0 −0.5 0.0 0.5 1.0 Standardized Annual Profit/Loss Fast Growth Fast growth probability distribution across standardized profit/loss 0.00 0.25 0.50 0.75 1.00 −1.0 −0.5 0.0 0.5 1.0 Standardized Income before Tax Fast Growth Growth Probability Distribution Across Standardized Profit/Loss 0.00 0.25 0.50 0.75 1.00 −4 −2 0 sales_mil_log growth Figure 1: Probability Distribution of Predictor Variables We can observe that probability of fast growth tends to decrease as sales decreases. Steep drops could be due to low number of observations in a specific interval. The same pattern applies for the distribution of probability across income before tax. Table 1 shows us the descriptive statistics of price for property type. Modeling In order to begin building the models, the variables should be defined. Predictors were grouped into 4 main variable categories: Firm, Quality variables, Financial, HR, as well as a separate group for interactions. I will consider 4 models for probability prediction with logit of increasing complexity. 1. Model 1 includes log sales_mill, squared log sales_mil, d1_sales_mil_log_mod, profit_loss_year_pl, fixed_assets_bs, curr_liab_bs, curr_liab_bs_flag_high, curr_liab_bs_flag_error,age, foreign_management. 2 2. Model 2 includes log sales_mill, squared log sales_mil, firm, engvar, d1. 3. Model 3 includes log sales_mill, squared log sales_mil, and all variables, but no interactions. 4. Logit LASSO model, most of the predictors are included as well as the set of potential interactions. 5. Random Forest: sales in millions, log 1st difference of sales in millions, Firm, Quality variables, Financial, HR; no interactions, no modified features Cross Validation Prediction I prepared 4 logit models to examine with OLS. The best performing model will be selected using crossvalidation, and prediction will be evaluated on that model using the holdout set. The work sample consists of 4362 observation, of which 3856 fast growths, and the holdout sample has 1089 observations, of which 963 fast growths. Table 2 shows the number of variables, R-squared, BIC and cross-validated training set and test set RMSE for the eight regressions. The table shows us two statistics for the entire work set: R-squared and BIC. First we estimated all regressions using all observations in the work set. Then, we estimated models by using 5-fold cross-validation. For each fold we estimated the regression using the training set, and used it for prediction not only on the training set but also in the corresponding test set. For this the Training RMSE and Test RMSE were calculated as the square root of the average MSE on the five training sets and the five test sets. From R-squared we could say that it is improving as more variables are added. In our case the most complex model explains 37% of the variation in prices. As for BIC it should be decreasing with more complex models, however, after a certain point it increases. But in our case the differences are relatively small. According to BIC in our case the best model is model number 6 as the more complex models have a risk of overfitting the data. The RMSE in the training set is improving as the model is getting more complex. In the test set it improves until model 7, after it is significantly worse. Model 7 has the lowest test RMSE with 45.95. This model includes all the variables except for the interactions in X3. Model 7 is significantly more complex than Model 5, which was deemed as best by BIC. Model 6 contained the interactions of property type and the additional interactions, meanwhile model 7 included amenities as well. RMSE suggests that the typical size of the prediction error in the test set is 45.95 euros for model 7, meanwhile it is 46.07 euro for model 6. From a statistical point of view it might be interesting, but if we would look only from business point of view it could be deemed insignificant. If we have conflict between BIC and cross-validation, cross-validation result should be chosen as it is not based on auxiliary assumptions. ## user system elapsed ## 61.973 0.408 62.424 Table 1: Logit Summary Number of predictors CV RMSE CV AUC X1 4 0.314 0.701 X2 39 0.316 0.698 X3 79 0.318 0.689 LASSO 1 0.322 0.677 3 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 False positive rate (1 − Specificity) True positive rate (Sensitivity) 0.2 0.4 0.6 threshold Figure 2: Training and test RMSE for the models 4

Firms Growth Prediction

Our Advantages

Secure Payments guaarantee