Want to take your software engineering career to the next level? Join the mailing list for career tips & advice Click here


A domain-general, Bayesian method for analyzing high-dimensional data tables

Subscribe to updates I use crosscat

Statistics on crosscat

Number of watchers on Github 290
Number of open issues 26
Average time to close an issue 2 months
Main language Python
Average time to merge a PR 5 days
Open pull requests 7+
Closed pull requests 9+
Last commit over 2 years ago
Repo Created over 6 years ago
Repo Last Updated about 2 years ago
Size 8.82 MB
Homepage http://probcomp.c...
Organization / Authorprobcomp
Page Updated
Do you use crosscat? Leave a review!
View open issues (26)
View crosscat activity
View on github
Fresh, new opensource launches 🚀🚀🚀
Software engineers: It's time to get promoted. Starting NOW! Subscribe to my mailing list and I will equip you with tools, tips and actionable advice to grow in your career.
Evaluating crosscat for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)


.. image:: https://travis-ci.org/probcomp/crosscat.svg?branch=master :target: https://travis-ci.org/probcomp/crosscat

CrossCat is a domain-general, Bayesian method for analyzing high-dimensional data tables. CrossCat estimates the full joint distribution over the variables in the table from the data, via approximate inference in a hierarchical, nonparametric Bayesian model, and provides efficient samplers for every conditional distribution. CrossCat combines strengths of nonparametric mixture modeling and Bayesian network structure learning: it can model any joint distribution given enough data by positing latent variables, but also discovers independencies between the observable variables.

A range of exploratory analysis and predictive modeling tasks can be addressed via CrossCat, including detecting predictive relationships between variables, finding multiple overlapping clusterings, imputing missing values, and simultaneously selecting features and classifying rows. Research on CrossCat has shown that it is suitable for analysis of real-world tables of up to 10 million cells, including hospital cost and quality measures, voting records, handwritten digits, and state-level unemployment time series.


Local (Ubuntu)

You can install CrossCat using pip (no need to clone from git)::

$ pip install crosscat

If you'd like to install from source, CrossCat can be successfully installed locally on bare Ubuntu server 14.04 systems with::

$ sudo apt-get install build-essential cython python
$ sudo apt-get install python-setuptools python-numpy
$ git clone https://github.com/probcomp/crosscat.git

$ cd crosscat
$ python setup.py build
$ python setup.py install  # or python setup.py develop

CrossCat can also be installed in a local Python virtual environment: ::

$ cd crosscat
$ virtualenv --system-site-packages /path/to/venv
$ . /path/to/venv/bin/activate
$ python setup.py build
$ python setup.py install  # or python setup.py develop

A similar process has been found to work on OSX.


To run the automatic tests:

$ ./check.sh


Note: The VM is only meant to provide an out-of-the-box usable system setup. Its resources are limited and large jobs will fail due to memory errors. To run larger jobs, increase the VM resources or install directly to your system.

Python Client_

.. _Python Client: https://docs.google.com/file/d/0B_CtKGJ4pH2TdmNRZkhmamg5aVU/edit?usp=drive_web)

C++ backend_

.. _C++ backend: https://docs.google.com/file/d/0B_CtKGJ4pH2TeVo0Zk5IT3V6S0E/edit?usp=drive_web)


dha_example.py (github_) is a basic example of analysis using CrossCat. For a first test, run the following from above the top level crosscat dir

.. _github: https://github.com/probcomp/crosscat/blob/master/examples/dha_example.py


python crosscat/examples/dha_example.py crosscat/www/data/dha.csv --num_chains 2 --num_transitions 2

Note: the default argument values take a considerable amount of time to run and are best suited to a cluster.


Apache License, Version 2.0_

.. _Apache License, Version 2.0: https://github.com/probcomp/crosscat/blob/master/LICENSE)

crosscat open issues Ask a question     (View All Issues)
  • over 3 years missing dependencies / environment compatibilities
  • almost 4 years pls help run on windows 10 machine with anaconda python 2.7
  • about 4 years test_multiple_col_ensure.py is stochastic
  • about 4 years Multistate conditional sampling is inaccurate
  • about 4 years Crosscat should use 256-bit seeds
  • about 4 years Crosscat hyperprior grid on variance parameter is broader than it needs to be
  • about 4 years continuous component model can't handle constant column
  • over 4 years Hadoop state non-functional
  • over 4 years Python 3 compatibility
  • over 4 years Allow multistate queries in sample_utils to utilize multiprocessing
  • over 4 years Speed up inference (and ensure ergodicity) in the presence of ENSURE DEPENDENT
  • over 4 years Get a jenkins job testing crosscat with python3
  • over 4 years Revising algorithm for computing joint pdf
  • over 4 years Distribute binary builds
  • over 4 years our PRNG sucks
  • over 4 years crosscat does stupidly many deep copies
  • over 4 years suspicious implementation of sampling from categorical for cluster assignments
  • over 4 years Issues with impute_confidence.
  • over 4 years numpy and cython should not be required to be installed before setup.py runs
  • over 4 years Building for Anaconda on OS X requires extra linker flags
  • over 4 years BOOST_ROOT not being picked up when doing pip build
  • over 4 years vonmises test in test_mixture_inference_quality.py fails
  • almost 5 years associate random seed with each model
  • about 3 years rename pypi package: CrossCat -> crosscat
crosscat open pull requests (View All Pulls)
  • use cmath, not math.h, in C++ code
  • Randomize seeds for chain states better
  • 20160303 alxempirical pass bdb seed to crosscat engine
  • Remove dependency on boost.
  • Release the GIL, crosscat!!!
  • Add sanity to the poorly formatted State.pyx
  • 20160920 fsaad fixup
crosscat list of languages used
More projects by probcomp View all
Other projects in Python