Run your Java applications on Big Data Frameworks

Many parallel data processing frameworks have been proposed recently (e.g., Spark, Hadoop, GraphLab). However, to using such frameworks requires rewriting existing code to the domain-specific languages that are supported by them. This rewriting process is tedious and error-prone, and even if developers are willing to rewrite, they still need to choose which framework to use, as each framework delivers different amounts of performance improvement.

Casper is a compiler that automatically retargets sequential Java code to be executed on Apache Spark. Given a sequential code fragment, Casper uses verified lifting to infer a high-level summary of the code. The inferred summary is then compiled to be executed on Spark. The entire process is completely automated and currently our translated benchmarks can execute on average 13.1x (and upto 29x) faster when compared to the sequential implementations.


Casper is open-source software, and is developed on GitHub. Please feel free to submit comments, issues, and pull requests there!


New We will be demoing Casper at this year's SIGMOD Conference!

We have released the Casper source code on GitHub. We are actively working to roll out more features and bug fixes. Please subscribe to our mailing-list to stay up-to-date with latest news regarding Casper.

We have recently finished developing a prototype implementation of Casper. Our paper describing this effort will appear at SYNT 2016!


[1]  Optimizing Data-Intensive Applications Automatically By Leveraging Parallel Data
Processing Frameworks

      Maaz Bin Safeer Ahmad and Alvin Cheung
      SIGMOD 2017 Demo (To appear)

[2]  Leveraging Parallel Data Processing Frameworks with Verified Lifting
      Maaz Bin Safeer Ahmad and Alvin Cheung
      SYNT 2016 – Presentation Slides
      Best Student Paper Award


If you have any questions, want to be kept up-to date, or just want to say hi, email us!

Example: Word Count

The input to Casper is un-annotated sequential Java code. The following code computes the word count by sequentially iterating over the list of words:

We can use Casper to re-target the Java program to Apache Spark automatically:

  $ ./bin/

which translates and writes the generated code to

The sequential loop in the original program has now been replaced by Spark RDD operations which can be executed in parallel. This leads to better performance and scaling as data size gets larger:

The above graph was generated by executing both solutions on a 10 node cluster of AWS m3.xlarge instances. The input data files were stored on HDFS.