Technology moves fast! ⚡ Don't get left behind.🚶 Subscribe to our mailing list to keep up with latest and greatest in open source projects! 🏆


Subscribe to our mailing list

portia

Visual scraping for Scrapy

Subscribe to updates I use portia


Statistics on portia

Number of watchers on Github 5891
Number of open issues 75
Average time to close an issue 2 days
Main language JavaScript
Average time to merge a PR 1 day
Open pull requests 63+
Closed pull requests 16+
Last commit over 1 year ago
Repo Created over 5 years ago
Repo Last Updated over 1 year ago
Size 24.8 MB
Organization / Authorscrapinghub
Contributors27
Page Updated
Do you use portia? Leave a review!
View open issues (75)
View portia activity
View on github
Fresh, new opensource launches 🚀🚀🚀
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating portia for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)

Portia

Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.

Try it out

To try Portia for free without needing to install anything sign up for an account at scrapinghub and you can use our hosted version.

Running Portia

The easiest way to run Portia is using Docker.

You can run Portia using docker by running:

docker run -v ~/portia_projects:/app/data/projects:rw -p 9001:9001 scrapinghub/portia

For more detailed instructions, and alternatives to using Docker, see the Installation docs.

Documentation

Documentation can be found here. Source files can be found in the docs directory.

portia open issues Ask a question     (View All Issues)
  • over 2 years Python 3?
  • over 2 years how to extract the url you are scraping?
  • over 2 years SSH error using Vagrant process
  • over 2 years what's the structural thought of portia's server-side?
  • almost 3 years How to use perform login of portia?
  • almost 3 years Incomplete instructions?
  • almost 3 years http://slybot.readthedocs.org not found
  • almost 3 years error unexpected error
  • almost 3 years how to use proxy for slyd
  • almost 3 years Switching to old UI
  • almost 3 years UI for creating a project in nui-develop
  • almost 3 years how does portia deal with high concurrency?
  • almost 3 years Unable to start Portia UI from the nui-develop branch
  • almost 3 years Can portia export scrapy python code
  • almost 3 years Portia (Vagrant) can`t open page behind a proxy
  • almost 3 years Debian VPS Install
  • almost 3 years I get error text centos6.5 python2.7.9 scrapy1.1
  • almost 3 years exceptions.AttributeError: 'FerryServerProtocol' object has no attribute 'autoPingPendingCall'
  • almost 3 years Portia can't perform if adding javascript to manipulate html
  • about 3 years "docker run -i -t --rm -v <PROJECT_FOLDER>/data:/app/slyd/data:rw -p 9001:9001 --name portia portia"
  • about 3 years Splash integration
  • about 3 years `portiacrawl` and Javascript
  • about 3 years Docker build failed on Ubuntu 14.04
  • about 3 years Error in Slybot tests
  • about 3 years Sharing projects between several portia instances
  • about 3 years UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0
  • about 3 years Start slyd issue - twistd -n slyd
  • about 3 years Portia has no support for Basic Auth
  • about 3 years Portiacrawl command issue
  • over 3 years Ember.js version in Portia
portia open pull requests (View All Pulls)
  • Update installation.rst
  • Docker instructions
  • added video and image to readme
  • login before fetch
  • Mining listing data
  • Fix JShint errors
  • Visual selection of followed links
  • vagrant compatible
  • [DO NOT MERGE] support for unicode spider names
  • Recommend Ubuntu Trusty
  • Clean deferred handling in api response
  • permissions are wrong when starting slyd and nginx from vagrant
  • Corrected related bug and added multiple containers on the same tag
  • PORTIA-517 Add modernizr
  • Add django style storage backend
  • More fixes for switching schemas in a sample
  • Fix HTML entities escaping for resources
  • [WIP] Add download code feature
  • [WIP] PORTIA-493: ORM for slybot data
  • Dev2
  • Add optinonal page_clustering to slybot
  • PORTIA-539 [WIP] Start URLs for Slybot
  • [WIP] Save and apply cookies for spider
  • Add samples to train page clustering on subsequest crawls
  • Load JS or raw HTML depending on JS enabled status for sample
  • PORTIA-529: Generated Start URLs
  • [WIP] Repeated fields support UI
  • [WIP] Improve repeated items for slybot
  • [WIP] PORTIA-494: JSON API endpoints for ORM
  • PORTIA-415: Queue model updates
  • PORTIA-472: Explicit selector recomputation on user input
  • [WIP] Schedule a spider to run from portia.
  • Update portiaui build
  • PORTIA-551 JobQ integration with Portia
  • Portia-599 Start Urls
  • Indentation within Portia's left panel
  • Normalize Start Urls with ORM
  • [WIP] Explicitly handle CSS and XPath annotations in Slybot
  • Update migration SQL to avoid unnecessary additional copies
  • needed in order to have internet inside vm (from archlinux)
  • PORTIA-633: Extraction Loading Spinner
  • Reload websocket spider with each message that extracts data
  • Handle nested items
  • [WIP] Add python 3 support
  • YCB-487: Generating scrapely JSON file from Portia
  • PORTIA-689: Explicit generated urls
  • Compile sample using rendered or original html depending on js status
  • Move html to new file in portia server to reduce sample file size
  • PORTIA-703: Unexpected double click error
  • PORTIA-701 PORTIA-702 Browser back button loading
  • Update extract endpoint to allow multiple results per request.
  • Improve how spiders are loaded from the filesystem
  • Do not replace named html entities in urls
  • Add input box to create project
  • PORTIA-660: Copy + Download
  • [WIP] Port portia to python3.
  • Add support for Portia to use newer ember-data
  • [WIP] Update to Ember 2.10 and Glimmer 2
  • Add `DropMetaPipeline` to remove meta fields from output
  • Slybot use latest libraries
  • Add git fs storage
  • fix the default spider loader class.
  • Use master as git default user
portia questions on Stackoverflow (View All Questions)
  • Portia interface is not working
  • Error while running portia on docker in ubuntu
  • Storing portia extraced item in MySQL database
  • Error while trying to store data obtained from portia in mongoDB
  • Portia installation getting errors when trying to start slyd
  • Portia Spider logs showing ['Partial'] during crawling
  • How to annotate same text for different fields in Portia?
  • How to run Scrapy/Portia on Azure Web App
  • How to get the `keywords` from html use `portia`
  • How do I get the least articles of a website use portia
  • How fields are storing in list in Portia crawl?
  • extract Meta tags from website using portia (scrapy)
  • How to get URL before the page is being scraped in Portia?
  • How Portia get the URL by default while deploying spider?
  • unable to deploy portia project using scrapyd-deploy due to 'No module found ..'
  • How to get URL from Crawled instead of Scraped from in Portia spider deployment?
  • Unable to load .doc link in portia?
  • ImportError: No module named urllib.request when installing portia
  • How to add cookies in Portia
  • Is it possible to support JS by Portia by using splash?
  • Unable to deploy Portia project to scrapyd due to not found /tmp file
  • Spider middleware in Portia not called
  • portia No Such Resource File not found
  • Unable to Deploy the portia spider in centos7 using scrapyd deploy
  • Is portia simple a scrapy with graphic?
  • Attempting to install Portia on OSX or Ubuntu
  • How to install portia, a python application from Github (Mac)
  • how to scrawl similar elements by portia(python)
  • How to add default field names in Portia scrapy drop down list?
  • How to use regex in Portia visual scrapy?
portia list of languages used
Other projects in JavaScript