1. What is the definition of default in the data provided by the data vendor?
First important question is whether the way that defaults are defined by the external data vendor matches the internal default definition of your institution. EBA guidelines tell specifically how to define the default event and banks are required to follow this exact definition. However, data vendors are not bound by EBA guidelines, and it is crucial to make sure that the internal and external default definitions are aligned. If they differ, it must be possible to mitigate those differences by means of statistical methods, for example, by comparing and appropriately adjusting the default rates of the two datasets. Finally, you could consider using the external dataset only for risk differentiation and perform risk quantification based on the internal default rates. Potentially, an appropriate Margin of Conservatism will be needed in this case.
2. Does the level of granularity of the external data correspond to the segments’ definitions in your portfolio?
To use an external dataset for modelling a specific segment of a loan portfolio, you need to prove that the dataset is representative for that segment. There are several representativeness aspects that you should consider. One of them is the ‘Scope of application’, which is a part of the portfolio clearly defined by a consistent set of rules and for which the statistical model will be built.
The preliminary condition for investigation of the ‘Scope of application’ is that the internal criteria for identifying a segment can be also found in the external data set (for example: industry, geography, type of a loan). If that is not the case and the characteristics used to identify internal portfolio segments are not available in the external data sample (usually because the internal segmentation is more granular) one can try aggregating the segments until matching granularity can be achieved.
3. Does the external data coverage contain the Scope of application of your portfolio?
Once you determined that segment definitions for your portfolio can be applied to the external data, you need to establish whether enough data for those segments are available. For example, if you identify that one of your segments is construction companies based in the Netherlands, you should check whether data with such characteristics are available in the data set of the external data vendor.
This should be the case for all segments unless you can prove that the segments for which the data is missing are well represented by the segments for which the data is available. For example, consider an institution that sells loans to clients in the Netherlands and in France. However, the external dataset only covers Dutch loans. If it can be proven that the risk profile of French loans is similar to the Dutch ones, you can proceed with using the Dutch data for modelling both parts of the portfolio (potentially adding appropriate Margin of Conservatism).
4. Are you able to draw a representative data sample for your portfolio from the external dataset?
Even if the data within the Scope of Application of your segments can be obtained from the external data, it might be still not enough to prove the homogeneity of the internal and external data samples. To further substantiate your choices, you can perform statistical tests (assuming that the quality of explanatory variables in both datasets is sufficient to obtain statistically significant results), for example, by calculating a Population Stability Index comparing the two datasets. Even when the tests fail initially (which they might since the initial external dataset will most probably be larger than the internal dataset), you can create a representative data sample through, e.g., stratified sampling.
5. Is the number of defaults in the external dataset enough for modelling each of the segments?
In contrast to what is beneficial for business, in Credit Risk more defaults (and not fewer) ensure modelling success. Number of defaults in a portfolio (or a segment) is a crucial factor to determine whether a statistical Credit Risk model can be built. Smaller institutions or new lines of business often struggle with low number of defaults (or even their lack). Using external data can be a tremendous help for an institution which internally has not registered enough defaults to build a statistically sound model.
The number of defaults considered necessary for modelling can differ substantially between institutions. Preferably, one should have a minimum of 150 defaults per portfolio over a period of 7-10 years. Such number is often not available for corporate loans and models have been built with as few as 10-15 defaults per segment. When deciding on purchasing the data from an external data vendor you should set a feasible lower bound specific to your case and investigate the data availability to that extent.
6. What is the availability and quality of the variables that could be used for modelling?
Once you have established that the Scope of application of the internal dataset is covered by the external data set, that the definition of default is aligned with the internal definition (or the differences can be mitigated) and that the number of defaults within each segment is sufficient for modelling, now you should investigate the quality of the data. Key areas to look at are:
Does the dataset cover a good mixture of good and bad years so that accurate through-the-cycle estimates can be obtained?
- Timeliness of default data
How timely the defaults are recorded, and can the origination of the event be accurately traced? Do the timestamps in the external default data match with the internal data?
- Timeliness of risk characteristics
How timely are the data recorded by the data vendor? What is the lag between, for example, publication of the yearly figures by a company and those figures appearing in the external database?
How good is the quality of the data for your segments? Are the typical variables that are used in Credit Risk modelling (such as financial ratios) available or possible to calculate using the external data? What is the percentage of missing values, especially for the defaulted companies 1 year prior to default?
- Non-financial data availability
Are non-financial risk characteristics, such as company age, management quality etc. available in the external data? Together with more traditional financial variables (such as Balance Sheet and Profit and Loss information) these variables are known to be successful in predicting corporate defaults.