grabber

command module

v0.0.0-...-c2a4849 Latest Latest Go to latest Published: Jun 30, 2014 License: BSD-2-Clause Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/drbig/grabber

Links

Open Source Insights

README ¶

grabber

Grabber is a concurrent declarative web scraper and downloader.

Features:

Simple tree-like JSON configuration
XPath and Regexp extractors
Parallel parsing and extraction
Parallel download
Ability to bail out early (e.g. for updating)
Fails fast on config errors, tolerates web errors
Follow, every, and single extraction modes
Multiple XPaths or Regexps per stage
Multi-grouped regexps with a separator (e.g. extract to CSV)
It's rather fast

Run grabber -h to see command-line options.

See examples/ directory and consult the code to learn the format of the config files.

Note that for tumblr.json you'll need to replace all occurrences of {{name}} with a proper account (subdomain) name and all occurrences of {{paging}} with the (XPath's text() operator) contents of what your target blog uses for 'next page' (or semantically equivalent). You may also notice that the format is already template-friendly, so you can easily write a script for generating per-blog templates.

The examples provided are certainly not exhaustive.

Advice:

Remember you can build your config iteratively by using the log command, so that you make sure the current level works as it should before going further.

When downloading:

For the first run set bail to 0 and use options -quiet -stdout, you may also wish to pipe the output of the run to tee log. Then inspect the output/logfile for any errors. If it looks ok set bail to something reasonable e.g. if you have 10 assets per page set it to 20.

Todo / Bugs

Needs testing 'in the wild'
Better documentation
Ability to use Content-Disposition
Full config parsing and error checking during load
Test suite

Copyright

Absolutely no warranty. See LICENSE.txt for details.

Documentation ¶

Overview ¶

Grabber is a concurrent declarative web scraper and downloader.

Features:

Simple tree-like JSON configuration
XPath and Regexp extractors
Parallel parsing and extraction
Parallel download
Ability to bail out early (e.g. for updating)
Fails fast on config errors, tolerates web errors
Follow, every, and single extraction modes
Multiple XPaths or Regexps per stage
Multi-grouped regexps with a separator (e.g. extract to CSV)
It's rather fast

Run `grabber -h` to see command-line options.

See `examples/` directory and consult the code to learn the format of the config files.

Note that for `tumblr.json` you'll need to replace all occurrences of `{{name}}` with a proper account (subdomain) name and all occurrences of `{{paging}}` with the (XPath's text() operator) contents of what your target blog uses for 'next page' (or semantically equivalent). You may also notice that the format is already template-friendly, so you can easily write a script for generating per-blog templates.

The examples provided are certainly not exhaustive.

Advice:

Remember you can build your config iteratively by using the `log` command, so that you make sure the current level works as it should before going further.

When downloading:

For the first run set `bail` to `0` and use options `-quiet -stdout`, you may also wish to pipe the output of the run to `tee log`. Then inspect the output/logfile for any errors. If it looks ok set `bail` to something reasonable e.g. if you have 10 assets per page set it to 20.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

grabber

README ¶

grabber

Todo / Bugs

Copyright

Documentation ¶

Overview ¶

Notes ¶

Bugs ¶

Source Files ¶