When you submit a code on hackerrank.com; it’s processed on the code execution platform that we call codechecker. Over the past 2 years we have invested a lot of time into profiling and improving key metrics which aided in cutting down on user’s wait time.
Overview
Steps involved in processing a code submission is sequential. Here is a high-level abstraction flowchart
Figure 1: Submission processing flowchart
Problem
As test-case count increases, a user will have to wait longer to see the result.
Solution 1: Spawn multiple processes which execute a batch of test-cases in parallel on the same worker. Drawback: This could lead to CPU starvation on (one or more) test-cases if one test-case completely monopolizes the CPU cores due to un-optimized code.
Solution 2: As the code-checker fleet consists of identical server instances we could break the request into multiple windows and distribute them over idle codechecker instances. That way we always ensure 100% of the server’s resources are available for the test-case execution and there is absolutely zero resource starvation.
When to parallelize?
It’s not always optimal to execute a submission is parallel. The processing overhead of splitting the test-cases and merging the response might be slower than executing the test-cases in sequence. Here is a control function which decides if it’s worth parallelizing.
Figure 2: Parallelizing control function
If we plot this equation as a function of n (v/s Soptimal) then it will look something like this:
Figure 3: Equation Plot
What is β? We can say it is more like a tuning factor for the split function. We can choose to control upper limit on the number of splits.
Merging distributed transient response data
We initially experimented with AWS Lambda for merging distributed transient partial execution results into final result with an S3 trigger. Lambda was ideal for this kind of data manipulation task with no external service/code dependency.
Advantages:
- Super easy to deploy
- Ultra low running cost
- No overhead on existing codechecker workers
Disadvantages:
- Difficult to debug.
- Very painful logging/monitoring.
- S3 call is a time-consuming operation. Takes 50ms – 100ms on an average to complete a GET call.
- S3’s read-after-write consistency caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.
Due to S3’s consistency caveat, we were noticing (0.0004%) of GET calls failing. This could have been handled via waitUntilObjectExists() or re-try mechanism but that would have added more delay in processing the submission.
Because of the failure rate we decided to drop Lambda and store the transient partial execution results along with control counters in a cache store which in-turn improved the fetch and merge time.
Visible improvements
An immediate drop in user’s wait time for submissions with a higher number of test-cases. Despite of increase in number of submissions in the following days; there is a drastic drop in submission turn_around_time.
Figure 4: No. of Submissions vs (99 percentile) wait time
Live status update of parallel execution
GIF 1: Live status on parallel execution
Intern Project
At HackerRank we have a tradition of handing high impact projects to capable interns. Parallel execution of test-cases was taken up by one of the summer interns in 2017. Here’s what he had to say about his experience.
”It was fun and really great time coding this stuff. We fixed logging issue by implementing our own logging mechanism for AWS Lambda which eventually made it easier to debug. After 2 weeks I was all set to submit code for the review and testing. After the code quality test, Arpith (Lead Engineer) and I worked closely to cover all possible cases to test the code.
It was great working with Hari (CTO, Co-Founder), Arpith (Lead Engineer) and delivering code that has such a greater impact. I loved working at HackerRank for the great learning experience and all fun part of my internship.”
Arpith works as part of the platform team managing codechecker infrastructure
Leave a Reply