Improve the Book Metadata that Matters

As the value of book discoverability metadata in the publishing industry becomes more known, it’s helpful to know exactly where authors and publishers are making the most mistakes. Thousands of new…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




I know your number.

A Primer on Statistical Intuition

Measures these days rely heavily on data to fund solutions and strategies. We as human beings are now understood mathematically. Indeed, when logic is applied to the data, knowing what to measure and how to measure it makes a complicated world much less so. Where numbers have the ability to question common sense, statistics can be used to manipulate and confuse.

This post will discuss key concepts within the framework of data exploration, and the main focus of this piece lies in:

To begin, LC is a peer-to-peer lending firm based in the United States. Loan contracts are issued based on a borrower’s credit worthiness, determined by features such as credit score, credit history, desired loan amount and debt-to-income ratio.

The current dataset holds approximately 2 million entries across 145 features that describe the nature of loan contracts issued by LC between 2007 to 2019

Due to its large volume, a preliminary approach requires an organisation of our data. When dealing in numbers, descriptive analysis helps us quantify data systematically. To that end, we hang up charts and graphs in the gallery to identify patterns and representative portions that form the foundation for observation and intuition.

To observe a representative portrait, first filter unrepresentative portions.

We can observe missing values in the data to determine how they might affect the overall performance of the set.

Rows and columns with missing entries are likely to be “biased”. For example, an individual earning a higher-income would be more unwilling to disclose this information and would be biased to either produce a dishonest answer, or to not answer at all.

This paints a repeated cross-sectional data. Simply put, each loan contract can be seen as taking a different sample from the population over time, with the same variables being ascribed to each sample.

We can identify general changes in column values across all loan contracts over time, but not the change in column value of an individual contract over time.

Mean values and standard deviation are employed as aggregate measures. Loan-specific variables are quantified to observe patterns in the data over time. Loan amounts and interest rates are the focal point of this section.

This work of art presents large variations in the mean values of the three loan amounts between 2007 and 2010, with amount funded being significantly lower.

For instance, it would be fair to attribute the behaviour of this observation to the subprime mortgage crisis that occurred in the same period.

Data after 2009 presents a convergence between loan amount required and loan amount funded.

The snowy, jagged mountain ridge of mean interest rate levels presents a head-and-shoulders pattern with an upper bound of approximately 13 percent prior to 2011 followed by a peak at 15 percent in 2013, to a decrease throughout 2013 to 2016.

An abstract outline of rooftops of a tiny town, interest rate standard deviations have been increasing overtime.

The number of loan contracts issued overtime takes the impression of a quarter pipe at a pro skater park. We observe an exponential increase in contracts issued from 2008 to 2015, followed by a flatter climb before picking up again between 2017 and 2019. This dataset is skewed to the later part of the timeframe.

At current, the data shows a large number of active loans from 2014 onwards.

Between 2007 to 2015, we can say for certain that the number of fully paid loans has increased. However, we cannot extend this inference to data from 2014 onwards due to the sizable proportion of active loans.

Therefore, instead of looking into the distribution of loans after 2014, we will focus on the completed loans prior to 2014.

A ‘good’ loan outcome is a contract with loan amounts fully recovered (“Fully Paid”), otherwise it is considered as a ‘bad’ outcome. With over 70% of completed contracts fully paid, this produces an imbalance of data in loan outcomes.

To address this disparity, dummy variables were assigned to the samples of completed contracts as (1) a ‘good’ outcome and (0) ‘otherwise’.

Specific columns were correlated against loan outcomes, and evaluated with their assigned interest rates.

NOTE: Intuitively, we take interest rates to be higher in ‘otherwise’ outcomes versus ‘good’ outcomes.

Some men upload pictures of their pets on Tinder where others post topless selfies to exemplify confidence in aesthetics. What sort of features would signal to you that he might be a cheater? His chiselled abs that speak of vanity or his adorable pug that already made you swipe right?

Here, we correlate specific columns (variables) to observe stronger determinants of loan outcomes and interest rate levels based on observations from our gallery of fine art.

We understand that loan grades are scored based on an internal pre-assessment of the borrower. Borrowers with a higher loan grading held the larger proportion of ‘good’ outcomes. This implies consistency in the loan grades that LC issues to its borrowers. Interest rates are also observed to be lower in contracts made out to borrowers with a higher grade.

Here, a decoupling of interest rate levels is observed. Borrowers with no home ownership and who belonged to the “middle-income” group contained more ‘good’ outcomes, but conversely held higher interest rates. Home ownership seems to be a key variable in determining interest rate levels for a loan contract, independent of the probability of loan outcome.

Where ‘Education’ was cited as a purpose for loan, dissociations between interest rates and loan outcomes were observed. ‘Education’ contained the least number of ‘good’ outcomes but were assigned significantly lower interest rates.

The function of the upcoming predictive model is to determine if a loan request is likely to have a ‘good’ or ‘bad’ outcome prior to issuing the loan contract. Due to the disproportion in ‘good’ and ‘bad’ outcomes, a cross-validation of five stratified folds was fitted using Random Forest Classifier (RFC) to test the predictive performance of the dataset.

OR

Stratification is the process of rearranging the dataset to ensure that each fold is a good representative of the whole. With the whole being the training set, we stratify five subsets. The Random Forest Classifier creates a set of decision trees within subsets, and tests the subsets against the training set in predicting loan outcomes. It then aggregates the results from each decision tree to determine the eventual outcome of individual samples tested.

The predictive model identified 10 variables that seem to be stronger determinants in capturing loan outcomes. 4 of these variables are recovery-related.

An aggregate of 99.76% seems amazing.This might seem shocking to some, but I might have uncovered the secrets of predicting a ‘good’ and ‘bad’ loan outcome.

No. Numbers mean nothing unless we ascribe logic and intuition to it.

The aim of this prediction was to identify a loan outcome prior to a loan contract. Thus, the features within our dataset might not hold strong form metrics in terms of predicting loan outcomes ex ante.

Borrower assessment features should hold higher associations as these features are measured prior to issuing of contracts. As such, predictive analysis using the current dataset may not help us to derive any meaningful conclusions in determining the loan outcomes of different borrower types.

This post exists as a primer on the interpretation of data and predictive analysis, and much still remains to be discussed. With regards to the subject matter, further investigations should seek to overcome the technical barriers mentioned in working out a “good fit” of data with the aims of predicting loan outcomes at a preliminary level of assessment.

Individual loan contracts have assigned “loan grades”. It appears that LC already has in place some form of screening prior to the approval of loan requests. It is likely that the dataset only comprises of borrowers that have already passed the initial “screening”, which introduces the bias of adverse selection within the dataset.

Therefore, a broader dataset should want to include borrowers who were rejected at the initial level of assessment in order to identify stronger determinants (variables) of loan outcomes. This should provide more grounds for the observation of certain feature importance in predicting a loan outcome prior to issuing a loan contract.

To understand our intuitions better, one can investigate the behaviours of peer-to-peer lending firms. It might be a good option to understand how lending firms generally operated during the subprime mortgage crisis to further uncover the relationship it may hold with the patterns uncovered in descriptive analysis.

Furthermore, columns and rows with several missing entries were simply removed for a more generalized approach to the data. A more effective step would be to impute missing values through regression or assign aggregate measures. That way, these rows and columns would not have to be removed and can be included into the dataset for further analysis.

Add a comment

Related posts:

Upcoming adventure to show our skills as a freelance photographer in Dubai

Photography has always been a continuous learning process for us. Being a freelance photographer in Dubai and Sharjah we always see the things with a photographic angle and with creativity. Whether…

2020 Reflections on my Course in Clinical Operations

When I sat down to write this, my first thought was of a 10-page article I just wrote yesterday about family and the holiday season. Most of my stories only get about 30 views. That article, although…

Top 3 Reasons Why People Quit Their Jobs

Recent research by the Pew Research Center revealed that the top reasons people quit their job in 2021 were due to low pay, lack of advancement opportunities and feeling disrespected at work…