## Introduction

Over these past few months, the Data Science team here at HackerRank has been analyzing our own internal hiring data to see what we could learn from it. For example, which factors are the biggest predictor of success when evaluating talent using skill assessments? What else can we learn about candidates from just their coding challenge data?

In this article, we’ll talk about one of the key challenges we had with our dataset, and how we solved it by building a custom package for PySpark.

## What is Stratified Sampling?

When building a supervised learning model to solve a binary classification problem, at the very least, you need a dataset with both positive and negative outcomes. But sometimes the way in which the dataset is sampled to create training and test sets can alter the resulting model. For most cases where outcomes are balanced, random sampling using the entire dataset will suffice. However, if a dataset has a class imbalance, the models will start giving inconsistent results every time they are trained.

An example of this would be the college admissions data of a prestigious university. Let’s say that at a given university, only around 10% of applicants are admitted, and the rest are rejected. In this case, the outcome set is extremely skewed towards the rejects. If we build a machine-learning model with this dataset, how can we ensure that the training data is a good representation of the real world?

Enter: **Stratified sampling**, a sampling method used to ensure that the sampled dataset proportionally contains the same representation as the original dataset.

Continuing our example above, let’s say we are trying to take a 30% sample of the dataset. To apply stratified sampling to our dataset, we would do the following:

- Split the applicant pool into admitted and rejected
- Sample 30% of the admitted and rejected applicants separately
- Combine the admitted and rejected applicant samples

And voila! We have our stratified sample. Simple, right? Because we sampled the admitted and rejected separately, our resulting combined sample has 10% admitted applicants and 90% rejected applicants.

The advantage of stratified sampling is that it reduces selection bias. Stratifying the entire dataset before sampling ensures an accurate representation of the data in question.

Note that each data point must be exclusively in one class or the other for this method to be useful. Using our example above, if we introduce more classifications such as eye color, religious views, or race, then the stratification now becomes overly complicated, and stratified sampling loses its effectiveness.

## HackerRank’s Problem

HackerRank looked at the dataset of our internal hiring for software engineers. Much like college applications, people who apply to a job are either given an offer or rejection.

At the start of our analysis, we received mixed results every time we ran our models. This caused us to wonder why the model would treat the test sets differently every time we created them. Looking into the created test sets, we noticed that the ratios of rejected and offered candidates changed each time. In some extreme cases, the whole set only contained rejected candidates!

You ask, “would **stratified sampling** work?” This entire article has been a spoiler, but yes, it would. The Python implementation is fairly straightforward as well. Given a PySpark data frame with a label column, we can do the following to get a 30% `sample_set`

:

```
offer = dataset[dataset['label'] == 1]
reject = dataset[dataset['label'] == 0]
offer_sample = offer.randomSplit(0.3)
reject_sample = reject.randomSplit(0.3)
sample_set = offer_sample.unionAll(reject_sample)
```

When we first decided to use PySpark, we found that it has k-fold cross-validation, but does not support stratified cross validation. So to take it one step further, we wanted to do introduce stratified sampling to the PySpark `CrossValidation`

class.

The one difference in implementation is that the k-folds must be the same size. We implemented it as part of a new `StratifiedCrossValidator`

class that extends the PySpark `CrossValidator`

class. With `n_folds`

being the number of folds we cross-validate with, the code is:

```
# creating the folds
split_ratio = 1.0 / n_folds
offer_folds = offer.randomSplit([split_ratio for i in range(n_folds)])
reject_folds = reject.randomSplit([split_ratio for i in range(n_folds)])
sample_folds = [offer_folds[i].unionAll(reject_folds[i]) for i in range(n_folds)]
...
# combining the folds
for i in range(n_folds):
train_folds = [x for j,x in enumerate(sample_folds) if j != i]
train_set = reduce((lambda x, y: x.unionAll(y)), train_folds)
test_set = sample_folds[i]
```

## The Open Source Project

That said, as we wanted to provide a solution for the PySpark community, we open sourced our package here. It can be installed via `pip install spark-stratifier`

. Note that it will also install PySpark and Numpy if you do not already have those installed.

If you are interested in contributing, please message or discuss with us on Github. Lastly, if you enjoyed this post, please follow us and stay tuned for more!

Nice work! Great explanation.

I am not into Data Science still I learnt something, great article .