Casper

Run your Java applications on Big Data Frameworks

Many parallel data processing frameworks have been proposed recently (e.g., Spark, Hadoop, GraphLab). However, to using such frameworks requires rewriting existing code to the domain-specific languages that are supported by them. This rewriting process is tedious and error-prone, and even if developers are willing to rewrite, they still need to choose which framework to use, as each framework delivers different amounts of performance improvement.

Casper is a compiler that automatically retargets sequential Java code to be executed on Apache Spark. Given a sequential code fragment, Casper uses verified lifting to infer a high-level summary of the code. The inferred summary is then compiled to be executed on Spark. The entire process is completely automated and currently our translated benchmarks can execute on average 6.1x faster when compared to the sequential implementation.

Source

Casper is open-source software, and is developed on GitHub. Please feel free to submit comments, issues, and pull requests there!

News

New We have just released the Casper source code on GitHub. We are actively working to roll out more features and bug fixes. Please subscribe to our mailing-list to stay up-to-date with latest news regarding Casper.

We have recently finished developing a prototype implementation of the Casper. Our paper describing this effort will appear at SYNT 2016!

Publications

[1]  Leveraging Parallel Data Processing Frameworks with Verified Lifting
      Maaz Bin Safeer Ahmad and Alvin Cheung
      SYNT 2016 – Presentation Slides
      Best Student Paper Award

Example: Word Count

The input to Casper is un-annotated sequential Java code. The following code computes the word count by sequentially iterating over the list of words:


We can use Casper to re-target the Java program to Apache Spark automatically:

  $ ./bin/run.sh WordCount.java WordCountTranslated.java

which translates WordCount.java and writes the generated code to WordCountTranslated.java:


The sequential loop in the original program has now been replaced by Spark RDD operations which can be executed in parallel. This leads to better performance and scaling as data size gets larger:

The above graph was generated by executing both solutions on a 10 node cluster of AWS m3.xlarge instances. The input data files were stored on HDFS.

People

If you have any questions, want to be kept up-to date, or just want to say hi, email us!