Are you happy with your logging solution? Would you help us out by taking a 30-second survey? Click here

x-ray

The next web scraper. See through the <html> noise.

Subscribe to updates I use x-ray


Statistics on x-ray

Number of watchers on Github 4238
Number of open issues 144
Average time to close an issue 7 days
Main language JavaScript
Average time to merge a PR 18 days
Open pull requests 32+
Closed pull requests 7+
Last commit over 1 year ago
Repo Created over 4 years ago
Repo Last Updated over 1 year ago
Size 1.27 MB
Organization / Authormatthewmueller
Latest Release2.1.0
Contributors29
Page Updated
Do you use x-ray? Leave a review!
View open issues (144)
View on github
Fresh, new opensource launches 🚀🚀🚀
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating x-ray for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)

x-ray

Last version Build Status Coverage Status Dependency status Dev Dependencies Status NPM Status Node version OpenCollective OpenCollective

var Xray = require('x-ray');
var x = Xray();

x('https://blog.ycombinator.com/', '.post', [{
  title: 'h1 a',
  link: '.article-title@href'
}])
  .paginate('.nav-previous a@href')
  .limit(3)
  .write('results.json')

Installation

npm install x-ray

Job Board

Looking for a career upgrade? Check out the available Node.js & Javascript positions at these innovative companies.

Features

  • Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.

  • Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.

  • Pagination support: Paginate through websites, scraping each page. X-ray also supports a request delay and a pagination limit. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped.

  • Crawler support: Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages.

  • Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.

  • Pluggable drivers: Swap in different scrapers depending on your needs. Currently supports HTTP and PhantomJS driver drivers. In the future, I'd like to see a Tor driver for requesting pages through the Tor network.

Selector API

xray(url, selector)(fn)

Scrape the url for the following selector, returning an object in the callback fn. The selector takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is selector@attribute. If you do not supply an attribute, the default is selecting the innerText.

Here are a few examples:

  • Scrape a single tag
xray('http://google.com', 'title')(function(err, title) {
  console.log(title) // Google
})
  • Scrape a single class
xray('http://reddit.com', '.content')(fn)
  • Scrape an attribute
xray('http://techcrunch.com', 'img.logo@src')(fn)
  • Scrape innerHTML
xray('http://news.ycombinator.com', 'body@html')(fn)

xray(url, scope, selector)

You can also supply a scope to each selector. In jQuery, this would look something like this: $(scope).find(selector).

xray(html, scope, selector)

Instead of a url, you can also supply raw HTML and all the same semantics apply.

var html = "<body><h2>Pear</h2></body>";
x(html, 'body', 'h2')(function(err, header) {
  header // => Pear
})

API

xray.driver(driver)

Specify a driver to make requests through. Available drivers include:

  • request - A simple driver built around request. Use this to set headers, cookies or http methods.
  • phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).

xray.stream()

Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:

var app = require('express')();
var x = require('x-ray')();

app.get('/', function(req, res) {
  var stream = x('http://google.com', 'title').stream();
  stream.pipe(res);
})

xray.write([path])

Stream the results to a path.

If no path is provided, then the behavior is the same as .stream().

xray.then(cb)

Constructs a Promise object and invoke its then function with a callback cb. Be sure to invoke then() at the last step of xray method chaining, since the other methods are not promisified.

x('https://dribbble.com', 'li.group', [{
  title: '.dribbble-img strong',
  image: '.dribbble-img [data-src]@data-src',
}])
  .paginate('.next_page@href')
  .limit(3)
  .then(function (res) {
    console.log(res[0]) // prints first result
  })
  .catch(function (err) {
    console.log(err) // handle error in promise
  })

xray.paginate(selector)

Select a url from a selector and visit that page.

xray.limit(n)

Limit the amount of pagination to n requests.

xray.abort(validator)

Abort pagination if validator function returns true. The validator function receives two arguments:

  • result: The scrape result object for the current page.
  • nextUrl: The URL of the next page to scrape.

xray.delay(from, [to])

Delay the next request between from and to milliseconds. If only from is specified, delay exactly from milliseconds.

xray.concurrency(n)

Set the request concurrency to n. Defaults to Infinity.

xray.throttle(n, ms)

Throttle the requests to n requests per ms milliseconds.

xray.timeout (ms)

Specify a timeout of ms milliseconds for each request.

Collections

X-ray also has support for selecting collections of tags. While x('ul', 'li') will only select the first list item in an unordered list, x('ul', ['li']) will select all of them.

Additionally, X-ray supports collections of collections allowing you to smartly select all list items in all lists with a command like this: x(['ul'], ['li']).

Composition

X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:

Crawling to another site

var Xray = require('x-ray');
var x = Xray();

x('http://google.com', {
  main: 'title',
  image: x('#gbar a@href', 'title'), // follow link to google images
})(function(err, obj) {
/*
  {
    main: 'Google',
    image: 'Google Images'
  }
*/
})

Scoping a selection

var Xray = require('x-ray');
var x = Xray();

x('http://mat.io', {
  title: 'title',
  items: x('.item', [{
    title: '.item-content h2',
    description: '.item-content section'
  }])
})(function(err, obj) {
/*
  {
    title: 'mat.io',
    items: [
      {
        title: 'The 100 Best Children\'s Books of All Time',
        description: 'Relive your childhood with TIME\'s list...'
      }
    ]
  }
*/
})

Filters

Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using |.

var Xray = require('x-ray');
var x = Xray({
  filters: {
    trim: function (value) {
      return typeof value === 'string' ? value.trim() : value
    },
    reverse: function (value) {
      return typeof value === 'string' ? value.split('').reverse().join('') : value
    },
    slice: function (value, start , end) {
      return typeof value === 'string' ? value.slice(start, end) : value
    }
  }
});

x('http://mat.io', {
  title: 'title | trim | reverse | slice:2,3'
})(function(err, obj) {
/*
  {
    title: 'oi'
  }
*/
})

Examples

In the Wild

  • Levered Returns: Uses x-ray to pull together financial data from various unstructured sources around the web.

Resources

  • Video: https://egghead.io/lessons/node-js-intro-to-web-scraping-with-node-and-x-ray

Backers

Support us with a monthly donation and help us continue our activities. [Become a backer]

Sponsors

Become a sponsor and get your logo on our website and on our README on Github with a link to your site. [Become a sponsor]

License

MIT

x-ray open issues Ask a question     (View All Issues)
  • about 2 years Does not seem to work with Node v8.0.0
  • about 2 years How to get currently scraped URL ?
  • about 2 years ECCONRESET
  • over 2 years ENOTFOUND - incorrect hostname
  • over 2 years Select :visible
  • over 2 years Crawling SyntaxError: Unmatched selector: @href
  • over 2 years :first, :last :n-child selectors
  • over 2 years Scrap progress feedback
  • over 2 years Looking for maintainers
  • over 2 years silently fails if img@src image element doesn't exist
  • over 2 years <Picture /> support
  • over 2 years Filters are being called multiple times per item
  • over 2 years Nested collections in pure html
  • over 2 years Can't select all p tags in body
  • over 2 years Question with "paginate" buttons that don't have the href link in them
  • over 2 years Using .parent()
  • over 2 years reusing the title || scraped values
  • over 2 years Using phantom driver cannot follow url
  • over 2 years .paginate with incomplete url
  • almost 3 years is it possible to have keys based on the selector in the output?
  • almost 3 years TypeError on parsing cookies
  • almost 3 years Pages refreshed frequently
  • almost 3 years Is it possible to have static values in the crawler output?
  • almost 3 years Exclude a tag while scraping
  • almost 3 years How to search for a string and return the parent div
x-ray open pull requests (View All Pulls)
  • If paginate is a function, call the function to get the url.
  • fixing path for require() for XRay and a typo in path.resolve
  • fix example typos and relative paths
  • Add Promise for `xray.then()`
  • Add "abort" callback to allow for dynamically ending pagination based on scrape results
  • Update package.json
  • [bug](examples) fix typo in selector example
  • Fix typo
  • Fix Composition and collection error
  • Fixup Documentation
  • Fix issue where callback was called too many times when using composi…
  • Delay method
  • Prevent infinite loops while paging certain sites
  • Ensuring that valid json is written when using pagination and single string selectors
  • Add check self.limit , enhance to paginate control (Pagination determine...
  • add support for superagent parser
  • Update dependency standard to v10
  • Update dependency debug to ~3.1.0
  • Update dependency isobject to ~3.0.0
  • Update dependency coveralls to v3
  • Update dependency chalk to ~2.3.0
  • Update dependency concat-stream to v1.6.0
  • Update dependency coveralls to v2.13.3
  • Update dependency mocha-lcov-reporter to v1.3.0
  • Update dependency isobject to ~2.1.0
  • Update dependency rimraf to v2.6.2
  • Update dependency debug to ~2.6.0
  • Update dependency cheerio to ~0.22.0
  • Update dependency batch to ~0.6.0
  • Update dependency object-assign to ~4.1.0
  • Update promisify.js
  • Update dependency standard to v11
x-ray questions on Stackoverflow (View All Questions)
  • Brake tags removed on x-ray scrape
  • database x-ray Machine Learning
  • AWS X-Ray SDK for .NET to load sampling rules
  • AWS X-Ray Python SDK get_service_graph
  • Moving testcases from zephyr to x-ray in JIRA
  • Enabling X-Ray support in AWS Lambda
  • how to store data from .write x-ray to mongodb or firebase
  • How do i use x-ray on a meteor app?
  • Crawling to another site using x-ray crash the App
  • How to upgrade Xray 1.12.0 to X Ray 2.0.0 +?
  • AWS X-ray open 'AWSXRay.log' at Error
  • where from i can download x-ray and MRI images in .jpeg format?
  • dynamic links in nodejs/ cheerio/ x-ray
  • I am trying to reconstruct a 3D model of a anatomical structure (Knee) using X- ray images.I am using C++ coding in opencv
  • AWS X-Ray w/ HAPI - No logs in Console
  • "X-Ray camera filters" Can someone explain this to me please?
  • X-Ray second level crawl returns empty array
  • Sending Patient info to Carestream Suite in an X-ray Machine
  • scraping items with x-ray that don't have a single root
  • Node x-ray crawling data from collection of url
  • x-ray scrapper in node.js not working
  • How can get second item of array with x-ray
  • How to specify language for x-ray npm webscrapping?
  • Use the x-ray or thermal effect on iPhone
  • Scrap web with x-ray
  • Write after end error using x-ray
  • NodeJS x-ray web-scraper: how to follow links and get content from sub page
  • How to remove horizontal line from this x-ray image?
  • NodeJS x-ray web-scraper:multiple urls in a loop callback
  • x-ray-phantom authentication, unable to effectively login
x-ray list of languages used
x-ray latest release notes
2.1.0 A New Hope Edition

Hello, my name is @Kikobeats and I'm a new maintainer of x-ray project. Nice to meet you!

I'm happy to say that this version have a lot of changes aligned to make easy contribute with the project.

General

  • Added Travis and Coveralls integration to be sure that PR don't break code.
  • Refactored a LOT. Extracted dirty logic for hide complexity and make clean use API.
  • Moved phantom tests into x-ray-phantom
  • Moved CLI into a separate project (still in progress).
  • Priorized issues under labels.

API

  • Added .stream method that follow the same behavior that .write wihtout parameters (more semantic)

What's next

  • Better stream support.
  • Better interaction with request object (like setup headers/cookies).
Other projects in JavaScript