Open source projects by scrapinghub

scrapinghub/portia

Visual scraping for Scrapy

☕JavaScript   ★5891 stars   ⚠75 open issues   ⚭27 contributors   ☯over 4 years old  

scrapinghub/splash

Lightweight, scriptable browser as a service with an HTTP API

☕Python   ★1734 stars   ⚠189 open issues   ⚭22 contributors   ☯over 5 years old  

scrapinghub/extruct

Extract embedded metadata from HTML markup

☕HTML   ★205 stars   ⚠12 open issues   ⚭3 contributors   ☯about 3 years old  

scrapinghub/scrapyrt

Scrapy realtime

☕Python   ★336 stars   ⚠17 open issues   ⚭4 contributors   ☯almost 4 years old  

scrapinghub/frontera

A scalable frontier for web crawlers

☕Python   ★632 stars   ⚠78 open issues   ⚭16 contributors   ☯about 4 years old  

scrapinghub/scrapy-splash

Scrapy+Splash for JavaScript integration

☕Python   ★513 stars   ⚠15 open issues   ⚭5 contributors   ☯over 5 years old  

scrapinghub/dateparser

python parser for human readable dates

☕Python   ★845 stars   ⚠77 open issues   ⚭23 contributors   ☯about 4 years old  

scrapinghub/testspiders

Useful test spiders for Scrapy

☕Python   ★132 stars   ⚠2 open issues   ⚭5 contributors   ☯over 7 years old  

scrapinghub/skinfer

Skinfer is a tool for inferring and merging JSON schemas

☕Python   ★68 stars   ⚠7 open issues   ⚭2 contributors   ☯almost 4 years old  

scrapinghub/aduana

Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).

☕C   ★43 stars   ⚠10 open issues   ⚭4 contributors   ☯over 3 years old  

scrapinghub/webstruct

NER toolkit for HTML data

☕HTML   ★157 stars   ⚠16 open issues   ⚭6 contributors   ☯over 5 years old  

scrapinghub/adblockparser

Python parser for Adblock Plus filters

☕Python   ★95 stars   ⚠4 open issues   ⚭2 contributors   ☯almost 5 years old  

scrapinghub/python-simhash

An efficient simhash implementation for python

☕C   ★42 stars   ⚠1 open issues   ⚭2 contributors   ☯over 4 years old  

scrapinghub/webpager

Paginating the web

☕C   ★29 stars   ⚠1 open issues   ⚭2 contributors   ☯over 5 years old  

scrapinghub/pydepta

A python implementation of DEPTA

☕C   ★61 stars   ⚠3 open issues   ⚭3 contributors   ☯over 5 years old  

scrapinghub/mdr

A python library detect and extract listing data from HTML page.

☕C   ★63 stars   ⚠6 open issues   ⚭2 contributors   ☯over 4 years old  

scrapinghub/aile

Automatic Item List Extraction

☕HTML   ★55 stars   ⚠6 open issues   ⚭1 contributors   ☯over 3 years old  

scrapinghub/shub

Scrapinghub Command Line Client

☕Python   ★61 stars   ⚠33 open issues   ⚭24 contributors   ☯over 4 years old