Are you happy with your logging solution? Would you help us out by taking a 30-second survey? Click here


A Python wrapper for Cascading

Subscribe to updates I use pycascading

Statistics on pycascading

Number of watchers on Github 221
Number of open issues 30
Average time to close an issue 2 months
Main language Python
Open pull requests 0+
Closed pull requests 0+
Last commit over 7 years ago
Repo Created almost 8 years ago
Repo Last Updated over 1 year ago
Size 641 KB
Organization / Authortwitter
Page Updated
Do you use pycascading? Leave a review!
View open issues (30)
View pycascading activity
View on github
Fresh, new opensource launches 🚀🚀🚀
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating pycascading for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)


PyCascading is a Python wrapper for Cascading. You can control the full data processing workflow from Python.

  • Pipelines are built with Python operators
  • User-defined functions are written in Python
  • Passing arbitrary contexts to user-defined functions
  • Caching of interim results in pipes for faster replay
  • Uses Jython 2.5.2, easy integration with Java and Python libraries


There can't be a MapReduce tutorial without counting words. Here it is:

def main():

    def split_words(tuple):
        for word in tuple.get('line').split():
            yield [word]

    input | split_words | group_by('word', native.count()) | output

Above, the user-defined function that reshapes the stream is annotated with a PyCascading decorator, and the workflow is created by chaining operations into each other.

More examples for the different use cases can be found in the examples folder. See also the the docstrings in the sources for a complete documentation of the arguments.

To try the examples, first build the Java sources as described below in the Building section. Then, change to the 'examples' folder, and issue either


for a simulated Hadoop local run, or

../ -m -s hadoop_server

to deploy automatically on a Hadoop server. hadoop_server is the SSH address of an account where the master jar and script will be scp'd to. Note that the '-m' option has to be used only once in the beginning. The '-m' option copies the master jar to the server, and any subsequent deploys will use this master jar, and only the actual Python script will be copied over the network.


PyCascading may be used in one of two modes: in local Hadoop mode or with remote Hadoop deployment. Please note that you need to specify the locations of the dependencies in the java/ file.

In local mode, the script is executed in Hadoop's local mode. All files reside on the local file system, and creating a bundled deployment jar is not necessary.

To run in this mode, use the script, with the first parameter being the PyCascading script. Additional command line parameters may be used to pass on to the script.

In Hadoop mode, we assume that Hadoop runs on a remote SSH server (or localhost). First, a master jar is built and copied to the server. This jar contains all the PyCascading classes and other dependencies (but not Hadoop) needed to run a job, and may get rather large if there are a few external jars included. For this reason it is copied to the Hadoop deployment server only once, and whenever a new PyCascading script is run by the user, only the Pythn script is copied to the remote server and bundled there for submission to Hadoop. The first few variables in the script specify the Hadoop server and the folders where the deployment files should be placed.

Use the script to deploy a PyCascading script to the remote Hadoop server.


Requirements for building:

  • Cascading 1.2.* or 2.0.0 (
  • Jython 2.5.2+ (
  • Hadoop 0.20.2+, the version preferably matching the Hadoop runtime (
  • A Java compiler
  • Ant (

Requirements for running:

  • Hadoop installed and set up on the target server (
  • SSH access to the remote server
  • If testing scripts locally, a reasonable JVM callable by java

PyCascading consists of Java and Python sources. Python sources need no compiling, but the Java part needs to be built with Ant. For this, change to the 'java' folder, and invoke ant. This should build the sources and create a master jar for job submission.

The locations of the Jython, Cascading, and Hadoop folders on the file system are specified in the java/ file. You need to correctly specify these before compiling the source.

Also, check the script and the locations defined in the beginning of that file on where to put the jar files on the Hadoop server.


Have a bug or feature request? Please create an issue here on GitHub!

Mailing list

Currently we are using the cascading-user mailing list for discussions. Any questions, please ask there.


Gabor Szabo



Copyright 2011 Twitter, Inc.

Licensed under the Apache License, Version 2.0

pycascading open issues Ask a question     (View All Issues)
  • over 4 years Is there new version of PyCascading?
  • almost 5 years Remote execution fails: pycascading.pipe not found
  • over 5 years Json sink
  • almost 6 years Make sure the optional arguments passed in to UDFs are immutable
  • almost 6 years Add slf4j logging to Python
  • almost 6 years Make local functions usable in @udfs
  • almost 6 years Additional JARs and search paths in local mode
  • about 7 years Package up the project with virtualenv
  • about 7 years Make it possible in map to pass in a class that implements an Operation interface
  • about 7 years Set a jobconf variable about what we're doing right now
  • about 7 years Create a Python interface for an AggregateBy operation
  • about 7 years Document what variables are set by default on the mappers
  • about 7 years Create a local reader for HDFS files
  • about 7 years Move pycascading.helpers to pycascading
  • about 7 years "Compile" time checks
  • about 7 years Provide a connector across all the languages that use Cascading pipes
  • about 7 years Get the task running status
  • about 7 years Have a checkpointing cache that acts like a function call
  • about 7 years Try named tuples in Python
  • about 7 years Create a tee sink/source
  • about 7 years Use Python variables as sources
  • about 7 years Command line client
  • about 7 years Lambda functions as mapping functions
  • about 7 years Skewed groups with combiners
  • about 7 years Specify the number of reducers for any task
  • about 7 years Workaround for combiners
  • about 7 years Create skewed join
  • about 7 years Distribute it as a jar, not classes
  • about 7 years Create sink that dumps output to screen
  • almost 8 years Move to a maven-based build
pycascading list of languages used
Other projects in Python