creep

package module

v0.0.0-...-8f3739f Latest Latest Go to latest Published: Jan 23, 2014 License: GPL-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/rickys/creep

Links

Open Source Insights

README ¶

Package to crawl the web. Used by main program Crawl.

To install:
$ go get github.com/RickyS/Crawl
$ go get github.com/RickyS/Creep

You'll neeed both packages, the depend on each other. The main program is crawl. The working package is creep. Note the capital letters on the names to 'go get'.

The easiest introduction might be to run
go test
This runs for 9 seconds on my system.

Package creep implements a web crawler. It reads web pages and follows links to the rest of the web, recursively, ad infinitum, within the limits provided. We use the term creep to avoid name clashes with other software called 'walk' and 'crawl'. I'm thinking of changing it to 'stroll'.

The goroutines in crawl.go listens on a request channel and then scans the web page specified in the message from the request channel. Each link-to-another-web-page found is then enqueued onto the request channel. Eventually, this or another goroutine will read that request and process it.

The code in samedomain.go uses the package "github.com/joeguo/tldextract" to get the database to help figure out whether two different URLs belong to the same domain. It turns out that this is not as simple as it might seem.

In order to prevent infinite regress, the program limits operation to the list of domains in the json file.

There are parameters in the json file that adjust the limitations. TBD.

Documentation ¶

Overview ¶

Package creep implements a web crawler.  It reads web pages and follows links to the rest of

the web, recursively, ad infinitum, within the limits provided. We use the term creep to avoid name clashes with other software called 'walk' and 'crawl'. I'm thinking of changing it to 'stroll'.

Index ¶

Constants
func CreepWebSites(urls []string, maxPermittedUrls int, maxGoRo int, justOneDomain bool) <-chan *ResponseFromWeb
type JobData
type JobDataArray
- func LoadJobData(filename string) *JobDataArray
type RequestUrl
type ResponseFromWeb

Constants ¶

View Source

const ExitCommandUrl string = "ExitExitExitExit" // Fake Url that tells goroutine to exit.

Variables ¶

This section is empty.

Functions ¶

func CreepWebSites ¶

func CreepWebSites(urls []string, maxPermittedUrls int, maxGoRo int, justOneDomain bool) <-chan *ResponseFromWeb

Main External entry point for package creep. Call only once at a time, but you can give it an array of urls to process.

Types ¶

type JobData ¶

type JobData struct {
	Testname      string
	Maxurls       int
	MaxGoRoutines int
	Gomaxprocs    int
	ExpectFail    bool
	JustOneDomain bool
	Urls          []string
}

type JobDataArray ¶

type JobDataArray struct {
	Tests []JobData
}

var JobDescription JobDataArray

func LoadJobData ¶

func LoadJobData(filename string) *JobDataArray

type RequestUrl ¶

type RequestUrl struct {
	Url string
}

For a channel of url requests. At one time I thought each request would be more than a string.

type ResponseFromWeb ¶

type ResponseFromWeb struct {
	Url          string         // Original url
	HttpResponse *http.Response // Response from http.Get()
	Err          error          // Error from Get()
	ElapsedTime  time.Duration  // Time duration of Get()
}

Includes the answer to the Get of the url, the url itself, error, elapsed time.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL