Are you happy with your logging solution? Would you help us out by taking a 30-second survey? Click here

paperless

Scan, index, and archive all of your paper documents

Subscribe to updates I use paperless


Statistics on paperless

Number of watchers on Github 4135
Number of open issues 47
Average time to close an issue 9 days
Main language Python
Average time to merge a PR 3 days
Open pull requests 14+
Closed pull requests 9+
Last commit over 1 year ago
Repo Created almost 4 years ago
Repo Last Updated over 1 year ago
Size 5.2 MB
Organization / Authordanielquinn
Latest Release1.3.0
Contributors1
Page Updated
Do you use paperless? Leave a review!
View open issues (47)
View paperless activity
View on github
Fresh, new opensource launches πŸš€πŸš€πŸš€
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating paperless for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)

Paperless

Documentation Chat Travis Coverage Status

Index and archive all of your scanned paper documents

I hate paper. Environmental issues aside, it's a tech person's nightmare:

  • There's no search feature
  • It takes up physical space
  • Backups mean more paper

In the past few months I've been bitten more than a few times by the problem of not having the right document around. Sometimes I recycled a document I needed (who keeps water bills for two years?) and other times I just lost it... because paper. I wrote this to make my life easier.

How it Works

Paperless does not control your scanner, it only helps you deal with what your scanner produces

  1. Buy a document scanner that can write to a place on your network. If you need some inspiration, have a look at the scanner recommendations page.
  2. Set it up to scan to FTP or something similar. It should be able to push scanned images to a server without you having to do anything. Of course if your scanner doesn't know how to automatically upload the file somewhere, you can always do that manually. Paperless doesn't care how the documents get into its local consumption directory.
  3. Have the target server run the Paperless consumption script to OCR the file and index it into a local database.
  4. Use the web frontend to sift through the database and find what you want.
  5. Download the PDF you need/want via the web interface and do whatever you like with it. You can even print it and send it as if it's the original. In most cases, no one will care or notice.

Here's what you get:

The before and after

Documentation

It's all available on ReadTheDocs.

Requirements

This is all really a quite simple, shiny, user-friendly wrapper around some very powerful tools.

  • ImageMagick converts the images between colour and greyscale.
  • Tesseract does the character recognition.
  • Unpaper despeckles and deskews the scanned image.
  • GNU Privacy Guard is used as the encryption backend.
  • Python 3 is the language of the project.
    • Pillow loads the image data as a python object to be used with PyOCR.
    • PyOCR is a slick programmatic wrapper around tesseract.
    • Django is the framework this project is written against.
    • Python-GNUPG decrypts the PDFs on-the-fly to allow you to download unencrypted files, leaving the encrypted ones on-disk.

Stability

This project has been around since 2015, and there's lots of people using it, however it's still under active development (just look at the git commit history) so don't expect it to be 100% stable. You can backup the sqlite3 database, media directory and your configuration file to be on the safe side.

Affiliated Projects

Paperless has been around a while now, and people are starting to build stuff on top of it. If you're one of those people, we can add your project to this list:

Similar Projects

There's another project out there called Mayan EDMS that has a surprising amount of technical overlap with Paperless. Also based on Django and using a consumer model with Tesseract and Unpaper, Mayan EDMS is much more featureful and comes with a slick UI as well, but still in Python 2. It may be that Paperless consumes fewer resources, but to be honest, this is just a guess as I haven't tested this myself. One thing's for certain though, Paperless is a way better name.

Important Note

Document scanners are typically used to scan sensitive documents. Things like your social insurance number, tax records, invoices, etc. While Paperless encrypts the original files via the consumption script, the OCR'd text is not encrypted and is therefore stored in the clear (it needs to be searchable, so if someone has ideas on how to do that on encrypted data, I'm all ears). This means that Paperless should never be run on an untrusted host. Instead, I recommend that if you do want to use it, run it locally on a server in your own home.

Donations

As with all Free software, the power is less in the finances and more in the collective efforts. I really appreciate every pull request and bug report offered up by Paperless' users, so please keep that stuff coming. If however, you're not one for coding/design/documentation, and would like to contribute financially, I won't say no ;-)

The thing is, I'm doing ok for money, so I would instead ask you to donate to the United Nations High Commissioner for Refugees. They're doing important work and they need the money a lot more than I do.

paperless open issues Ask a question     (View All Issues)
  • about 3 years Files not transferred to PAPERLESS_MEDIADIR directory
  • about 3 years Find the first date in the OCRed text
  • about 3 years Problem with images-only pages
  • about 3 years Dockerfile not working
  • about 3 years RFE: Auto re-tag when adding tags
  • about 3 years Running Paperless on Synology Diskstation 212+
  • about 3 years webserver cpu load in idle
  • over 3 years Re-Tagger for Correspondents!?
  • over 3 years Search is too rigid: allow accent-insensitive search queries
  • over 3 years Thumbnail creation breaks if CONVERT_BINARY doesn't accept -alpha
  • over 3 years Can't access original file
  • over 3 years Add a test for required binaries at start-time
  • over 3 years It's not an issue, just a question in mind
  • over 3 years Integration into owncloud
  • over 3 years Integrate SANE into Web frontend
  • over 3 years Add automatic HTTPS
  • over 3 years Better text layout on OCR scraped information
  • over 3 years A Better Front-End
  • over 3 years Document Categories
  • over 3 years Make this a proper django package
  • over 3 years Continuous Integration for Docker & Vagrant builds
  • over 3 years Awesome awesome project!
  • over 3 years Hybridise PDFs with combined OCR'd text
paperless open pull requests (View All Pulls)
  • Add Dockerfile for application and documentation
  • Docker
  • Refactor file info extraction
  • django-filter 0.12.0 no longer available
  • Updating Django admin event log on document_consumption_finished
  • little changes to reflect as much as possible
  • Add manager command to re-tag documents without correspondent
  • adapted Dockerfile for alpine image
  • WIP: Allow encryption to be disabled
  • WiP: Extends the regex to find dates in documents as reported by @isaacsando
  • Handle Django migrations at startup
  • Disable login
  • WIP: Configuration cli argument for document_consumer
  • WIP : New imported documents list
paperless questions on Stackoverflow (View All Questions)
  • The XML document is not well formed error in Paperless Document API
  • Online paperless application management system recommendations?
  • Coding field navigation while creating paperless online versions of paper forms using WPF with C# and LinqToXml
  • What is the best way to promote a paperless environment?
paperless list of languages used
paperless latest release notes
1.3.0 1.3.0
  • You can now run Paperless without a login, though you'll still have to create at least one user. This is thanks to a pull-request from @matthewmoto: #295. Note that logins are still required by default, and that you need to disable them by setting PAPERLESS_DISABLE_LOGIN="true" in your environment or in /etc/paperless.conf.
  • Fix for #303 where sketchily-formatted documents could cause the consumer to break and insert half-records into the database breaking all sorts of things. We now capture the return codes of both convert and unpaper and fail-out nicely.
  • Fix for additional date types thanks to input from @isaacsando and code from @BastianPoe (#301).
  • Fix for running migrations in the Docker container (#299). Thanks to @TeraHz for the fix (#300) and to @pitkley for the review.
  • Fix for Docker cases where the issuing user is not UID 1000. This was a collaborative fix between @ChromoX and @pitkley in #311 and #312 to fix #306.
  • Patch the historical migrations to support MySQL's um, interesting way of handing indexes (#308). Thanks to @skuzzle for reporting the problem and helping me find where to fix it.
1.2.0 1.2.0
  • New Docker image, now based on Alpine, thanks to the efforts of @addadi and @Pit. This new image is dramatically smaller than the Debian-based one, and it also has a new home on Docker Hub. A proper thank-you to @Pit_ for hosting the image on his Docker account all this time, but after some discussion, we decided the image needed a more official-looking home.
  • @BastianPoe has added the long-awaited feature to automatically skip the OCR step when the PDF already contains text. This can be overridden by setting PAPERLESS_OCR_ALWAYS=YES either in your paperless.conf or in the environment. Note that this also means that Paperless now requires libpoppler-cpp-dev to be installed. Important: You'll need to run pip install -r requirements.txt after the usual git pull to properly update.
  • @BastianPoe has also contributed a monumental amount of work (#291) to solving #158: setting the document creation date based on finding a date in the document text.
1.1.0 1.1.0
  • Fix for #283, a redirect bug which broke interactions with paperless-desktop. Thanks to @chris-aeviator for reporting it.
  • Addition of an optional new financial year filter, courtesy of @ddddavidmartin (#256)
  • Fixed a typo in how thumbnails were named in exports (#285), courtesy of @pzl
Other projects in Python