Documentation
¶
Overview ¶
Package unflat provides PDF to Markdown conversion optimized for AI training datasets and search indexing.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Document ¶
type Document struct {
Title string
Author string
Subject string
Keywords []string
Creator string
Producer string
CreationDate time.Time
ModificationDate time.Time
PageCount int
PageWidth float64
PageHeight float64
Language string
Content string
}
Document represents a converted PDF document with its Markdown content.
type NewlineCollapseResult ¶
NewlineCollapseResult holds the result of newline collapsing.
type NewlinePolicy ¶
type NewlinePolicy int
NewlinePolicy determines when to preserve or collapse newlines.
const ( NewlinePolicyPreserve NewlinePolicy = iota // Keep all newlines NewlinePolicyNormal // Collapse some newlines contextually NewlinePolicyAggressive // Collapse most newlines )
type Options ¶
type Options struct {
// NewlineCollapseAggressiveness controls how aggressively line breaks are
// removed. Valid values: 0 (none), 1 (conservative), 2 (moderate), 3 (aggressive).
// Default: 2 (moderate).
NewlineCollapseAggressiveness int
// MinHeadingFontSize is the minimum font size to be considered a heading.
// Default: 14.0
MinHeadingFontSize float64
// PreserveOriginalSpacing semantics when true, attempts to preserve
// original spacing information from the PDF. Default: false.
PreserveOriginalSpacing bool
// ColumnLayoutAware when true, detects multi-column layouts and attempts
// to reconstruct logical reading order. Default: true.
ColumnLayoutAware bool
// HeadingLevelMappings allows custom mapping from detected patterns to
// heading levels.
HeadingLevelMappings map[string]int
// IncludeMetadata when true, includes a YAML metadata block at the beginning
// of generated Markdown files. Default: true.
IncludeMetadata bool
// HeadingPrefixPatterns are font name patterns that indicate a heading.
HeadingPrefixPatterns []string
// Debug when true, enables debug output. Default: false.
Debug bool
// EnableMarkdownFormatting when true, applies Markdown formatting (headings with
// # prefix, bold with **, etc). When false, outputs plain text while still
// writing to .md file extension. Default: true.
EnableMarkdownFormatting bool
}
Options configures the PDF to Markdown conversion behavior.
func DefaultOptions ¶
func DefaultOptions() *Options
DefaultOptions returns Options with sensible defaults.
Click to show internal directories.
Click to hide internal directories.