Computer Science

Peer-to-peer Lending: Optimizing the Classification and Matching of Digital Karmic Personas

image1.png

Sanya Sharma

William Jones College Preparatory

Abstract

In 2020, student debt in the U.S. has amounted to a sum of $1.56 trillion and is a growing national concern [1]. To alleviate this crisis, I assisted in the development of Coinsequence—a curated, peer-to-peer lending platform in which do-good lenders directly invest in students’ digital profiles. Here, lenders can filter through students with criterias ranging from a keyword like “tennis” to free-flowing text like “volunteers at a community shelter to help stray animals get sterilized”. In order to efficiently personalize the lending process and determine ideal lender-student matches, various machine learning (ML) techniques are employed: Latent Dirichlet Allocation (LDA) effectively sorts students into different categories; tokenization, stopwords deletion, lemmatization, and stemmitization ensures text uniformity; the Gradient Boosted Decision Tree ML algorithm then matches students’ digital personas to prospective lenders. This leverages the LDA-produced clusters from the conditional tests to improve the decision trees for matching. To remove redundant features from the dataset, Principal Component Analysis (PCA) and feature selections are proposed to be applied. Altogether, the proposed technologies optimize the process of student-to-lender matching, providing college students with an accessible way to receive micro-funding for their education.

Introduction

Description of Coinsequence

Americans today are more burdened by student debt than ever. In the U.S. alone, over 3.3 million children attend college every year, and among 2018 graduates, 69% of college students graduated with an outrageous average debt of $29,800 [1].

While high schoolers worry about college tuition, tens of millions of donors and investors fight for debt-free education. Coinsequence—a peer-to-peer financial aid platform—aims to alleviate student debt burdens by providing merit-based aid from charitable investors to loan-seeking students. Students create their digital persona by posting their actions and achievements on Coinsequence in the form of their time usage as defined by American Time Use Survey [2], i.e., their actions and achievements. Figure 1 models student Sophia Johnson’s sample persona.

image1.png

Figure 1. Sample student karmic profile

This appealing and unique method allows students to market themselves in a way that feels more human-like than a sterile resume or a lengthy essay that briefly touches on the student’s personality. Figure 2 contains the time-usage categories defined by the platform, in which students can categorize their actions; these are categorized using ATUS classifications. Each of these categories holds a different weightage—ranging from 0.25 to 1.25—through which students are awarded karma points (KP).

image3.png

Figure 2. Example list of karmic persona categories

Activity Posts

The process of posting, sharing, validating, and earning karma points is a highly engaging and social-like activity. Students can post usage activities, also known as activity posts, at any time. Figure 3 presents a screen that displays the post options for an activity post; Figure 4 models a sample post with validation, privacy, and visibility options for the user to customize.

image6.png

Figure 3. Screen displaying Activity Post options

image7.png

Figure 4. Example of a text-based Activity Post

Activity posts are categorized under 467 classifications (also called features) as defined in ATUS and are used to build each student’s karmic digital persona, or karmic profile. A karmic profile is synonymous to a college application, except in the former case, students must report their comprehensive time usage. Coinsequence creates digital personas according to a student’s activity posts and validations, which include social engagements surrounding posts and API-based scores such as the SAT and the ACT.

Investors and donors can view students’ digital personas through a variety of filters and assign different weights to time usages.

Problem statement for personalizing the peer-to-peer lending platform

Coinsequence’s most essential feature is its matching algorithm, which matches an appropriate set of students to investors based on text-based searches such as test scores, sports, extracurricular activities, karma points, etc. The most suitable karmic personas to onlooking investors can be displayed through the identification of the most efficient ML model.

Since a student’s karmic persona could consist of thousands of attributes, designing such an algorithm could be extremely complicated. Moreover, investors have the option of using simple keywords or sophisticated free-flowing texts to find suitable karmic personas.

The problem statement, then, is to obtain the best matching profiles based on as many search phrases as possible, which is similar to retrieving information through linked data and semantic web technology. Hence, various topic-modelling tools will be evaluated for information retrieval.

All data from karmic personas are discrete and massive. In particular, there are numerous activity categories, such as school work and domestic chores. However, not all attributes are weighted the same; attribute values are determined by data from previous matching results. There will be several features in the huge dataset of Coinsequence’s activity posts, and some irrelevant or repetitive features may act as noise to the ML algorithm while training; this will hinder the model’s efficiency, particularly in terms of time and accuracy. Therefore, feature selection—a method that extracts features most relevant to an investor’s searches—and dimensionality reduction—a method that reduces the number of irrelevant and redundant dimensions—are evaluated to optimize the features fed into the matching ML algorithm. The evaluation of the best matching algorithm and feature selection model can optimize efficiency and accuracy, a step crucial to developing a successful application.

Methods

2.1 Description of Topic Modeling

The topic model is the primary tool needed to determine the matches. In ML and natural language processing, a topic model is a frequently used text-mining tool that uncovers hidden semantic structures in a text body [3]. In the case of Coinsequence, the topic model uncovers karmic profiles. Some examples of topic models include explicit semantic analysis, Latent Dirichlet Allocation (LDA), and non-negative matrix factorization [4]. Specifically, LDA parameterizes search phrases while the Gradient Boosted Decision Tree classifies karmic personas.

2.2 Applying Latent Dirichlet Allocation to categorize searches in Coinsequence

LDA is used to categorize text into particular topics. Aware of how investors may be unfamiliar with the platform’s terminology, LDA and other complementary methods in the Natural Language Toolkit (NLTK) allow for query suggestions and intelligent query understanding [5]. In order to optimally match an investor with a student, instead of extracting literal keywords, the algorithm must determine karmic personas that fall under pre-categorized themes.

Since the text in investors’ searches and borrowers’ karmic personas are not processed, preprocessing must be done. First, the text must undergo tokenization, in which the content is tokenized into sentences and sentences into words. To facilitate this process, the text is uniformly lowercased, and all punctuation is removed. Then, conjunctions and phrases that provide negligible information (“and”, “those”, etc.) are removed. Next, the tokenized text undergoes lemmatization—the process of converting third and second person references to first person and past and future verbs to the present tense—to achieve consistency and avoid the repetition of keywords. This is followed by word stemming, which reduces the lemmatized words to their root forms. For example, “academically” would be stemmitized to “academic” and “dancing” to “dance”.

2.3 Introduction to Decision Trees

Decision trees, or Classification and Regression Trees (CART), are machine learning models in which a series of if-else statements are structured in a flowchart, tree-like figure. These models can produce categorical (e.g., positive or negative) or continuous (e.g., prices and quantity) outcomes from minimally processed input; these are called classification trees and regression trees, respectively [6].

2.4 Applying Gradient Boosted Decision Tree to Coinsequence

In Coinsequence, decision tree classification can be used to match karmic personas and personalize the peer-to-peer lending process [7]. Based on the LDA-produced keywords, the conditions (or tests) can be determined.

Initially, the decision tree is likely to be a weak learner; when the max depth is small, it has fewer conditions. To overcome this hurdle, boosting—the method of ensembling several weak learners to form a strong one—could be applied. Boosting is a form of sequential error correction, meaning each predictor (decision tree, in this case) attempts to correct its predecessor’s predictions. Specifically, the new decision tree would use the error between its predecessor’s prediction and the actual label as its new Y value labels. The residual error is used to efficiently minimize the loss in accuracy, leading to optimal results.

2.5 Description of Dimensionality Reduction

Dimensionality reduction is the process of decreasing the number of dimensions (or columns) of a dataset in order to reduce complexity. Generally, a dataset with 10 or more dimensions is considered high-dimensional. Dimensionality reduction is also used to remove redundant columns that have little variance, such as standardized testing scores like the SAT, AP, etc. However, this only applies to numeric data.

2.6 Principal Component Analysis (PCA)

Popularly used for dimensionality reduction in continuous data, PCA rotates and projects data along the direction of increasing variance. The features with the maximum variance are the principal components [8].

In the case of Coinsequence, conjoining multiple karmic profiles can lead to multiple dimensions and complicated profiles. To reduce the number of dimensions, pair plots between variables in the students’ karmic profiles such as GPA and school attendance will be created.

2.7 Description of Filter Methods

The selection of features using filter methods occurs similarly to preprocessing since the features are reduced prior to the learning phase. The filter methods help remove irrelevant, redundant, constant, duplicated, and correlated features in a matter of seconds [9]. They do not train on the ML algorithm and hence offer a simplistic approach. There are several statistical models present such as mutual information and chi-squared score; this paper uses the chi-squared score because it is well equipped for categorical variables like those of Coinsequence’s activity posts. Since the chi-squared test finds the independence of two events, the goal is to choose features that depend greatly on the final features selected. The “select K-best” method will be used to acquire the K-best features based on the chi-squared test. Since the inputs do not contain multicollinearity and require compute-efficient ranking which does not overfit the data (leading to poor recommendations), the filter method is the best choice for selecting the main features of the activity posts. They will rely on maximizing information gain from target classifications to rank the features that offer the most insight to a student’s unique qualities.

2.8 Description of Wrapper Methods

Wrapper methods apply the ML algorithm to the subsets of features and use their performances as evaluation criteria. Since they evaluate several combinations of subsets to find the best performing set, they are computationally expensive. The process of wrapper methods usually take on four steps: searching for a subset of features, training the ML algorithm on this subset, evaluating the performance of this model, and repeating this until the ideal, best-performing subset is found [10]. For Coinsequence, the sequential forward feature selection (SFS) process is used since it is more practical than exhaustive methods. SFS is an iterative method in which the best performer is selected after each feature is evaluated individually. This is followed by combining this feature with all remaining ones and then choosing the best-performing pair. To avoid inconsistencies while choosing the best feature, sequential floating is used, which essentially adds and removes features at every iteration in order to optimize the set. For this, SFS executes “backward steps” given that the objective function increases [9]. Once the best feature is added, it is examined whether deleting the worst chosen feature improves the objective function. If so, it is deleted and the process is repeated until a specified criterion is reached.

2.9 Hybrid Method

In classifying users’ activity posts, both efficiency and accuracy are of equal importance. The efficiency of classification determines the speed at which the classification algorithm executes to determine the category; the accuracy of classification determines the closeness of the category determined by the classification algorithm to the category identified manually. While filter methods are robust, wrapper methods offer high performance and accuracy. One activity post can contain more than 2,000 features—as seen in the American Time Use Survey Activity Lexicon. The lexicon designed to classify activity posts has been divided into two levels: The first level has 467 main features, and the second level has 1,174 sub-features. This classification mechanism allows developers to take advantage of the filter method’s efficiency to generate a feature ranking list of the main features. Filter methods remove unwanted features, and the wrapper method generates optimal sub-features (or subsets). The filter method provides very fast results, while the wrapper method provides a very detailed, valid classification of the sub-feature. Since, using wrapper methods exclusively could result in overfitting, this unique combination of the filter and wrapper methods is a perfect solution.

To apply the filter method to main features, a list of category (or feature) names must be defined. Then a pandas dataframe using the activity posts of various users is created to list extracurricular activities. This data frame is divided into x and y values, which help apply the chi-squared test and extract the K-best features efficiently. Next, the statistical test is initialized using the “SelectKBest” function in sklearn’s feature selection library. The score function is defined as “chi square”, and k—the number of highest scored features to be selected—is set to “2”. Lastly, the test initialized before is fit and transformed into the dataset to produce the category list and best-k features.

After the main features have been selected, the SFS wrapper method to the sub features is applied. The first few steps are identical to the filter method process defined above, but the list is of a sub-feature like [“key club activities,” “language club activities,” “National Honor Society activities,” etc.] for the extracurricular club activities. After importing the required libraries (SequentialFeatureSelector and LogisticRegression) and creating the x and y values, an SFS with logistic regression as the estimator is initialized, k value set as “5”, and the boolean forward set to true to enable the implementation of forward selection [11]. Finally, the five best sub-features are selected and can be fed into the ML algorithm.

Results

After preprocessing—performing the four steps listed in the LDA methods section—the following will result.

Investor’s search prior to preprocessing:

I want the student to be playing using a racquet, specifically tennis, and she should be able to take part in two fine arts like dancing or some other artistic skills.

Investor’s search after preprocessing:

[Racquet, tennis, dance, arts]

Table 2 shows the results from applying LDA to the preprocessed searches.

Cluster Titles

Values of Cluster

Possible racquet sports

{tennis, squash, badminton, pickleball, table tennis, tennis polo, speedball, beach tennis}

Possible computer science majors

{computer engineering, data science, machine learning, computational biology, coding}

Possible performing arts

{choir, theatre, drama, broadway, ballet, jazz, opera, gymnastics, impromptu concert}

Possible self-care activities

{meditative podcast, SoulCycle, ASMR, hiking, boba tea, mental health therapy}

Table 2. Outputs after applying LDA to processed investors’ searches for student karmic personas.

An example of the criteria specified in the investor’s search is shown below.

Has X number of Karma Points

Plans to pursue Computer Science

Scored 1530/1600 on the SAT

Plays on the varsity tennis team

0.3X come from sport activities

Figure 5 shows a sample decision tree based on the above example of an investor’s search.

image5.png Figure 5. Example of a decision tree for Coinsequence

As observed in Figure 5, each node contains various tests, which are represented by blue rectangles. If the condition is true, the arrow leads to the left node; if the condition is false, the arrow leads to the right node. The leaves classifying the karmic persona’s outcome are represented by a green “YES” or a red “NO.” The sets in curly brackets refer to categorical lists generated by the LDA model.

Using the age, GPA, and SAT score of students as principal components, the following pair plots in Figure 6 are created.

image6.png

Figure 6. Plots showing dimensionality reduction for a Coinsequence dataset. (Color required)

Similarly, more attributes will be used to generate other trends. These can be easily incorporated by importing the right datasets.

The use of a hybrid of filter-and-wrapper feature selection is shown on a portion of the features in activity posts. Three “Extracurricular activities” and 10 sub-features of extracurricular club activities are illustrated in Table 2 below.

Extracurricular music and performance activities (Main Feature)

Extracurricular student government activities (Main Feature)

Extracurricular club activities (Main Feature)

 

Key Club activities, including meetings (sub-feature)

National Honor Society activities (sub-feature)

 

Language club activities like French club (sub-feature)

Pep club activities (sub-feature)

 

Math club activities like tutoring (sub-feature)

American Field Service activities (sub-feature)

 

Chess club activities like competitions (sub-feature)

Debate club competition like MUN (sub-feature)

 

BIPOC clubs like activism (sub-feature)

Science club activities like science fair (sub-feature)

Table 2. Extracurricular club activities table showing main features and sub-features. Table extracted from the American Usage Survey Activity Lexicon.

The filter method ranks the main features using the chi-squared test: “club activities” show the highest dependence based on the data, whereas, “student government activities” show significant independence. The order would thus be [“club activities,” “music and performance activities,” “student government activities”]. Since k is equal to “2”, the best two features are kept, and the student government activities feature is removed.

Once the main features are classified, the next step is to precisely determine the sub-features.

The wrapper method removes redundant features, and since k is equal to “5”, the five best features are kept, which came out to be [“national honor society activities,” “math club activities,” “BIPOC clubs,” “science club activities,” “language club activities”]. The other 5 sub-features are removed and not fed into the ML algorithm.

Discussion and Limitations

As technology advances at an exponential rate, its impact widens astronomically. Social media platforms built on sophisticated artificial intelligence and data-mining algorithms are influencing the social behavior of billions of people, affecting the fundamental fabric of democracy. The potential for bias in AI is omnipresent—from movie recommendations to customer service bots to self-driving cars. For instance, many recruitment tools reinforce stereotypes by surfacing privileged demographics disproportionally, leading to further suppression of underrepresented groups. The unconscious bias of a handful of like-minded engineers thus becomes embedded in algorithms. 

Understanding the prevalence of bias and exclusion, it is important to ensure that fair assessments are taken. Hence, the future of this research can aim to focus on responsible AI, creating physically, emotionally, and cognitively inclusive technological solutions. Programming chatbots and auto-correct algorithms bring forth loopholes revealing an exclusion of the voices and needs of underrepresented groups in language technologies. Therefore, the research potential includes a focus on ethics, transparency, and inclusion. This includes, but is not limited to, inserting checks and balances to ensure the matches being made are not suppressing underrepresented groups and are representative of the population. Additionally, the algorithm should not favor a particular feature, demographic, or characteristic by any means. Overall, the research hopes to bring accountability to automated decision-making. It aims to use the personalized recommendations to support the research towards mitigating discrimination in algorithmic employment screening.

Moreover, this research hopes to be part of the future of engineering, sparking designs that reflect the researcher’s ideals of diversity and ethics.

Conclusion

It can be inferred that preprocessing of the search text can help to produce uniform text that is easily comprehensible. This noise removal helps the LDA perform significantly better, leading to more accurate recommendations based on search inputs. Additionally, using LDA, useful themes present were uncovered in the data, which would effectively drive the matching process. Applying LDA to curated data from karmic personas and investor searches revealed substantial, valuable clusters with accurate topics.

It can be inferred that the gradient boosted decision tree is a great model fit, because it easily digests the investors’ searches and LDA-produced clusters. The algorithm is also easy to comprehend and visualize, which is crucial for the transparency and accessibility Coinsequence’s matching algorithm aims to achieve.

Furthermore, using gradient boosting, the decision trees can be trained and improved in order to achieve a higher level of accuracy in the matching process.

As seen in Figure 6 of the pair plots, the age variable has no variance and thus adds no meaning in this particular scenario, in which an investor is analyzing profiles based on academic skill; age can thus be dropped from the data set. On the other hand, SAT score and GPA show significant variance, so these features are retained. Dimensionality reduction is shown to be effective in reducing redundant features and creating a concise, clear dataset.

The filter methods are able to provide very efficient results at the preliminary stage, selecting three out of 467 features for a sample of 10 random activity posts. The wrapper methods will then provide the best (most accurate) sub-feature selection for all 10 of the activity posts—exactly the same as what a human analyst would have selected.

Overall, the four methods researched were successful in the classification and matching of Coinsequence’s digital karmic personas, making the goal “debt free education for all” much more achievable. With a trusted peer-to-peer network of students and lenders, Coinsequence is envisioned to be a platform that alleviates the strain of student debt for graduates to come.

References

  1. “Student Loan Debt Statistics [2021]: Average Total Debt.” EducationData. February 28, 2021. https://educationdata.org/student-loan-debt-statistics.
  2. “ATUS News Releases.” U.S. Bureau of Labor Statistics. https://www.bls.gov/tus/.
  3. Li, Susan. “Topic Modeling and Latent Dirichlet Allocation (LDA) in Python.” Medium. June 01, 2018. https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24.
  4. Xu, Joyce. “Topic Modeling with LSA, PSLA, LDA & Lda2Vec.” Medium. December 20, 2018. https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05.
  5. Dwivedi, Priya. “NLP: Extracting the Main Topics from Your Dataset Using LDA in Minutes.” Medium. March 27, 2019. https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925.
  6. “Decision Trees for Decision Making.” Harvard Business Review. August 01, 2014. https://hbr.org/1964/07/decision-trees-for-decision-making.
  7. “The AI Behind LinkedIn Recruiter Search and Recommendation Systems.” LinkedIn Engineering. https://engineering.linkedin.com/blog/2019/04/ai-behind-linkedin-recruiter-search-and-recommendation-systems.
  8. Raj, Judy T. “Dimensionality Reduction for Machine Learning.” Medium. March 14, 2019. https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e.
  9. Charfaoui, Younes. “Hands-on with Feature Selection Techniques: Hybrid Methods.” Medium. July 11, 2020. https://heartbeat.fritz.ai/hands-on-with-feature-selection-techniques-hybrid-methods-b93b1b06d3a5.
  10. Brownlee, Jason. “How to Choose a Feature Selection Method For Machine Learning.” Machine Learning Mastery. August 20, 2020. https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/.
  11. Luhaniwal, Vikashraj. “Feature Selection Using Wrapper Methods in Python.” Medium. October 24, 2020. https://towardsdatascience.com/feature-selection-using-wrapper-methods-in-python-f0d352b346f.

About the author

She is truly intrigued by computer science and its extensive applications. She runs clubs and develops apps to alleviate the gender gap in technology and the outstanding student debt. When not coding, catch her reading autobiographies, dancing, putting together 5k piece puzzles, and watching Grey’s Anatomy. Cooking and creating escape rooms are her newest endeavors.

Leave a Reply

Your email address will not be published. Required fields are marked *