Documentation
¶
Index ¶
- Constants
- Variables
- func FindRepositoriesByString(s string) (urls []string, err error)
- func GetBaseDir() string
- func MoveCompressFile(src, dst string, compressionType CompressionType, level int) (err error)
- func MustGlob(pattern string) []string
- func PrependSchema(s string) string
- func RandomEndpoint() string
- func Render(opts *RenderOpts) error
- func UserHomeDir() string
- type About
- type Client
- type CompressionType
- type Config
- type CopyHook
- type Description
- type DirLaster
- type Doer
- type GetRecord
- type HTTPError
- type Harvest
- type Harvester
- type Header
- type Identify
- type Interval
- type Laster
- type ListIdentifiers
- type ListMetadataFormats
- type ListRecords
- type ListSets
- type Metadata
- type MetadataFormat
- type MultiError
- type OAIError
- type RateLimitedReader
- type Record
- type RenderOpts
- type Repository
- type Request
- type RequestNode
- type Response
- type ResumptionToken
- type Set
- type Values
Constants ¶
const ( // DefaultTimeout on requests. DefaultTimeout = 10 * time.Minute // DefaultMaxRetries is the default number of retries on a single request. DefaultMaxRetries = 8 )
const Day = 24 * time.Hour
Day has 24 hours.
const Version = "0.4.26"
Version of tools.
Variables ¶
var ( // StdClient is the standard lib http client. StdClient = &Client{Doer: http.DefaultClient} // DefaultClient is the more resilient client, that will retry and timeout. DefaultClient = &Client{Doer: CreateDoer(DefaultTimeout, DefaultMaxRetries)} // DefaultUserAgent to identify crawler, some endpoints do not like the Go // default (https://golang.org/src/net/http/request.go#L462), e.g. // https://calhoun.nps.edu/oai/request. DefaultUserAgent = fmt.Sprintf("metha/%s", Version) // ControlCharReplacer helps to deal with broken XML: http://eprints.vu.edu.au/perl/oai2. Add more // weird things to be cleaned before XML parsing here. Another faulty: // http://digitalcommons.gardner-webb.edu/do/oai/?from=2016-02-29&metadataPr // efix=oai_dc&until=2016-03-31&verb=ListRecords. Replace control chars // outside XML char range. ControlCharReplacer = strings.NewReplacer( "\u0000", "", "\u0001", "", "\u0002", "", "\u0003", "", "\u0004", "", "\u0005", "", "\u0006", "", "\u0007", "", "\u0008", "", "\u0009", "", "\u000B", "", "\u000C", "", "\u000E", "", "\u000F", "", "\u0010", "", "\u0011", "", "\u0012", "", "\u0013", "", "\u0014", "", "\u0015", "", "\u0016", "", "\u0017", "", "\u0018", "", "\u0019", "", "\u001A", "", "\u001B", "", "\u001C", "", "\u001D", "", "\u001E", "", "\u001F", "", "\uFFFD", "", "\uFFFE", "", ) )
var ( // BaseDir is where all data is stored. BaseDir = filepath.Join(UserHomeDir(), ".cache", "metha") // ErrAlreadySynced signals completion. ErrAlreadySynced = errors.New("already synced") // ErrInvalidEarliestDate for unparsable earliest date. ErrInvalidEarliestDate = errors.New("invalid earliest date") )
var ( ErrInvalidVerb = errors.New("invalid OAI verb") ErrMissingVerb = errors.New("missing verb") ErrCannotGenerateID = errors.New("cannot generate ID") ErrMissingURL = errors.New("missing URL") ErrParameterMissing = errors.New("missing required parameter") )
var EndpointList string
var Endpoints = splitNonEmpty(EndpointList, "\n")
Endpoints from https://git.io/fxvs0.
Functions ¶
func FindRepositoriesByString ¶ added in v0.1.29
FindRepositoriesByString returns a list of already harvested base URLs given a fragment of the base URL.
func GetBaseDir ¶ added in v0.1.43
func GetBaseDir() string
GetBaseDir returns the base directory for the cache.
func MoveCompressFile ¶ added in v0.1.25
func MoveCompressFile(src, dst string, compressionType CompressionType, level int) (err error)
MoveCompressFile with compression type support
func PrependSchema ¶
PrependSchema prepends http, if its missing.
func RandomEndpoint ¶ added in v0.1.27
func RandomEndpoint() string
RandomEndpoint returns a random endpoint url.
func Render ¶ added in v0.2.16
func Render(opts *RenderOpts) error
Types ¶
type About ¶
type About struct {
Body []byte `xml:",innerxml" json:"body,omitempty"`
}
About has addition record information.
type Client ¶
type Client struct {
Doer Doer
// contains filtered or unexported fields
}
Client can execute requests.
func CreateClient ¶
CreateClient creates a client with timeout and retry properties.
func CreateClientWithRateLimit ¶ added in v0.4.14
CreateClientWithRateLimit creates a client with timeout, retry properties, and rate limiting.
func (*Client) Do ¶
Do executes a single OAIRequest. ResumptionToken handling must happen in the caller. Only Identify and GetRecord requests will return a complete response.
func (*Client) GetRateLimit ¶ added in v0.4.14
GetRateLimit returns the current rate limit setting.
func (*Client) SetRateLimit ¶ added in v0.4.14
SetRateLimit sets the download rate limit in bytes per second. Set to 0 to disable rate limiting.
type CompressionType ¶ added in v0.4.2
type CompressionType int
const ( CompZstd CompressionType = iota CompGzip )
func DetectCompression ¶ added in v0.4.2
func DetectCompression(filename string, firstBytes []byte) CompressionType
Add a function to detect compression type from file extension or content
type Config ¶ added in v0.3.23
type Config struct {
BaseURL string
Format string
Set string
From string
Until string
MaxRequests int
DisableSelectiveHarvesting bool
CleanBeforeDecode bool
IgnoreHTTPErrors bool
MaxEmptyResponses int
SuppressFormatParameter bool
HourlyInterval bool
DailyInterval bool
ExtraHeaders http.Header
KeepTemporaryFiles bool
IgnoreUnexpectedEOF bool
Delay time.Duration
MaxRetries int // Maximum number of retry attempts
RetryDelay time.Duration // Delay between retries
RetryBackoff float64 // Multiplier for delay between retries (e.g., 2.0 for exponential backoff)
CompressionType CompressionType
CompressionLevel int // -5 to 22 for zstd
NoCompression bool
}
type CopyHook ¶ added in v0.1.38
CopyHook is a Logrus hook that copies messages to a writer.
func NewCopyHook ¶ added in v0.1.38
NewCopyHook initializes a copy hook. By default, it copies Warn, Error, Fatal and Panic level messages. Override these by passing in other logrus.Level values.
type Description ¶
type Description struct {
Body []byte `xml:",innerxml"`
}
Description holds information about a set.
func (Description) GoString ¶
func (desc Description) GoString() string
GoString is a formatter for Description content.
type DirLaster ¶
DirLaster extract the maximum value from the files of a directory. The values are extracted per file via TransformFunc, which gets a filename and returns a token. The tokens are sorted and the lexikographically largest element is returned.
type GetRecord ¶
type GetRecord struct {
Record Record `xml:"record,omitempty" json:"record,omitempty"`
}
GetRecord returns a single record.
type Harvest ¶
type Harvest struct {
Config *Config
Client *Client
// XXX: Lazy via sync.Once?
Identify *Identify
Started time.Time
// Protects the rare case, where we are in the process of renaming
// harvested files and get a termination signal at the same time.
sync.Mutex
}
Harvest contains parameters for mass-download. MaxRequests and CleanBeforeDecode are switches to handle broken token implementations and funny chars in responses. Some repos do not support selective harvesting (e.g. zvdd.org/oai2). Set "DisableSelectiveHarvesting" to try to grab metadata from these repositories. From and Until must always be given with 2006-01-02 layout. TODO(miku): make zero type work (lazily run identify).
func NewHarvest ¶
NewHarvest creates a new harvest. A network connection will be used for an initial Identify request.
type Header ¶
type Header struct {
Status string `xml:"status,attr" json:"status,omitempty"`
Identifier string `xml:"identifier,omitempty" json:"identifier,omitempty"`
DateStamp string `xml:"datestamp,omitempty" json:"datestamp,omitempty"`
SetSpec []string `xml:"setSpec,omitempty" json:"setSpec,omitempty"`
}
A Header is part of other requests.
type Identify ¶
type Identify struct {
RepositoryName string `xml:"repositoryName,omitempty" json:"repositoryName,omitempty"`
BaseURL string `xml:"baseURL,omitempty" json:"baseURL,omitempty"`
ProtocolVersion string `xml:"protocolVersion,omitempty" json:"protocolVersion,omitempty"`
AdminEmail []string `xml:"adminEmail,omitempty" json:"adminEmail,omitempty"`
EarliestDatestamp string `xml:"earliestDatestamp,omitempty" json:"earliestDatestamp,omitempty"`
DeletedRecord string `xml:"deletedRecord,omitempty" json:"deletedRecord,omitempty"`
Granularity string `xml:"granularity,omitempty" json:"granularity,omitempty"`
Description []Description `xml:"description,omitempty" json:"description,omitempty"`
}
Identify reports information about a repository.
type Interval ¶
Interval represents a span of time.
func (Interval) DailyIntervals ¶ added in v0.1.14
DailyIntervals segments a given interval into daily intervals.
func (Interval) HourlyIntervals ¶ added in v0.2.5
HourlyIntervals segments a given interval into hourly intervals.
func (Interval) MonthlyIntervals ¶
MonthlyIntervals segments a given interval into monthly intervals.
type ListIdentifiers ¶
type ListIdentifiers struct {
Headers []Header `xml:"header,omitempty" json:"header,omitempty"`
ResumptionToken ResumptionToken `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"`
}
ListIdentifiers lists headers only.
type ListMetadataFormats ¶
type ListMetadataFormats struct {
MetadataFormat []MetadataFormat `xml:"metadataFormat,omitempty" json:"metadataFormat,omitempty"`
}
ListMetadataFormats lists supported metadata formats.
type ListRecords ¶
type ListRecords struct {
Records []Record `xml:"record" json:"record"`
ResumptionToken ResumptionToken `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"`
}
ListRecords lists records.
type ListSets ¶
type ListSets struct {
Set []Set `xml:"set,omitempty" json:"set,omitempty"`
ResumptionToken ResumptionToken `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"`
}
ListSets lists available sets.
type Metadata ¶
type Metadata struct {
Body []byte `xml:",innerxml"`
}
Metadata contains the actual metadata, conforming to varying schemas.
func (Metadata) MarshalJSON ¶
MarshalJSON marshals the metadata body.
type MetadataFormat ¶
type MetadataFormat struct {
MetadataPrefix string `xml:"metadataPrefix,omitempty" json:"metadataPrefix,omitempty"`
Schema string `xml:"schema,omitempty" json:"schema,omitempty"`
MetadataNamespace string `xml:"metadataNamespace,omitempty" json:"metadataNamespace,omitempty"`
}
MetadataFormat holds information about a format.
type MultiError ¶
type MultiError struct {
Errors []error
}
MultiError collects a number of errors.
func (*MultiError) Error ¶
func (e *MultiError) Error() string
Error formats all error strings into a single string.
type OAIError ¶
type OAIError struct {
Code string `xml:"code,attr" json:"code,omitempty"`
Message string `xml:",chardata" json:"message,omitempty"`
}
OAIError is an OAI protocol error.
type RateLimitedReader ¶ added in v0.4.14
type RateLimitedReader struct {
// contains filtered or unexported fields
}
RateLimitedReader wraps an io.Reader with rate limiting
func NewRateLimitedReader ¶ added in v0.4.14
func NewRateLimitedReader(r io.Reader, ctx context.Context) *RateLimitedReader
NewRateLimitedReader creates a new rate limited reader
func (*RateLimitedReader) Close ¶ added in v0.4.14
func (s *RateLimitedReader) Close() error
Close closes the underlying reader if it implements io.Closer
func (*RateLimitedReader) Read ¶ added in v0.4.14
func (s *RateLimitedReader) Read(p []byte) (int, error)
Read reads bytes into p with rate limiting.
func (*RateLimitedReader) SetRateLimit ¶ added in v0.4.14
func (s *RateLimitedReader) SetRateLimit(bytesPerSec float64)
SetRateLimit sets rate limit (bytes/sec) to the reader.
type Record ¶
type Record struct {
XMLName xml.Name
Header Header `xml:"header,omitempty" json:"header,omitempty"`
Metadata Metadata `xml:"metadata,omitempty" json:"metadata,omitempty"`
About About `xml:"about,omitempty" json:"about,omitempty"`
}
Record represents a single record.
type RenderOpts ¶ added in v0.2.16
type RenderOpts struct {
Writer io.Writer
Harvest Harvest
Root string
From string
Until string
UseJson bool
}
RenderOpts controls output by the metha-cat command.
type Repository ¶
type Repository struct {
BaseURL string
}
Repository represents an OAI endpoint.
func (Repository) CompleteListSize ¶ added in v0.4.17
func (r Repository) CompleteListSize() (int, error)
func (Repository) Formats ¶
func (r Repository) Formats() ([]MetadataFormat, error)
Formats returns a list of metadata formats.
type Request ¶
type Request struct {
BaseURL string
Verb string
Identifier string
MetadataPrefix string
From string
Until string
Set string
ResumptionToken string
CleanBeforeDecode bool
SuppressFormatParameter bool
ExtraHeaders http.Header
}
A Request can express any OAI request. Not all combination of values will yield valid requests.
type RequestNode ¶
type RequestNode struct {
Verb string `xml:"verb,attr" json:"verb,omitempty"`
Set string `xml:"set,attr" json:"set,omitempty"`
MetadataPrefix string `xml:"metadataPrefix,attr" json:"metadataPrefix,omitempty"`
}
RequestNode carries the request information into the response.
type Response ¶
type Response struct {
ResponseDate string `xml:"responseDate,omitempty" json:"responseDate,omitempty"`
Request RequestNode `xml:"request,omitempty" json:"request,omitempty"`
Error OAIError `xml:"error,omitempty" json:"error,omitempty"`
GetRecord GetRecord `xml:"GetRecord,omitempty" json:"GetRecord,omitempty"`
Identify Identify `xml:"Identify,omitempty" json:"Identify,omitempty"`
ListIdentifiers ListIdentifiers `xml:"ListIdentifiers,omitempty" json:"ListIdentifiers,omitempty"`
ListMetadataFormats ListMetadataFormats `xml:"ListMetadataFormats,omitempty" json:"ListMetadataFormats,omitempty"`
ListRecords ListRecords `xml:"ListRecords,omitempty" json:"ListRecords,omitempty"`
ListSets ListSets `xml:"ListSets,omitempty" json:"ListSets,omitempty"`
}
Response is the envelope. It can hold any OAI response kind.
func (*Response) CompleteListSize ¶ added in v0.1.38
CompleteListSize returns the value of completeListSize, if it exists.
func (*Response) Cursor ¶ added in v0.1.38
CompleteListSize returns the value of completeListSize, if it exists.
func (*Response) GetResumptionToken ¶
GetResumptionToken returns the resumption token or an empty string if it does not have a token. In addition, return an empty string, if cursor and complete list size are defined and are equal (doaj, refs #14865).
func (*Response) HasResumptionToken ¶
HasResumptionToken determines if the request has a ResumptionToken.
type ResumptionToken ¶ added in v0.1.38
type ResumptionToken struct {
Text string `xml:",chardata"` // eyJhIjogWyIyMDE5LTAyLTIxV...
CompleteListSize string `xml:"completeListSize,attr"`
Cursor string `xml:"cursor,attr"`
ExpirationDate string `xml:"expirationDate,attr"`
}
ResupmtionToken with optional extra information.
type Set ¶
type Set struct {
SetSpec string `xml:"setSpec,omitempty" json:"setSpec,omitempty"`
SetName string `xml:"setName,omitempty" json:"setName,omitempty"`
SetDescription Description `xml:"setDescription,omitempty" json:"setDescription,omitempty"`
}
A Set has a spec, name and description.
type Values ¶
Values enhances the builtin url.Values.
func (Values) EncodeVerbatim ¶
EncodeVerbatim is like Encode(), but does not escape the keys and values.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
metha-cat
command
|
|
|
metha-files
command
|
|
|
metha-fortune
command
|
|
|
metha-id
command
|
|
|
metha-ls
command
|
|
|
metha-pack
command
metha-pack iterates over all harvested files and will compact them per endpoint into a single file.
|
metha-pack iterates over all harvested files and will compact them per endpoint into a single file. |
|
metha-sync
command
|
|
|
extra
|
|
|
_largecrawl
command
genjson extracts info from a stream of OAI DC XML records, e.g.
|
genjson extracts info from a stream of OAI DC XML records, e.g. |
|
etd
command
|
|
|
migratezstd041
command
|
|
|
pkpindex
command
Small util to get journal info from https://index.pkp.sfu.ca currently including 1264043 records indexed from 4960 publications.
|
Small util to get journal info from https://index.pkp.sfu.ca currently including 1264043 records indexed from 4960 publications. |
|
Package xflag add an additional flag type Array for repeated string flags.
|
Package xflag add an additional flag type Array for repeated string flags. |
