unflat

package module
v0.0.0-...-11a844b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 8, 2026 License: MIT Imports: 7 Imported by: 0

README

Unflat

A Go package for PDF-to-Markdown conversion optimized for AI training datasets and search indexing.

Installation

go get github.com/n01nex/unflat

Dependencies

Usage

package main

import (
    "fmt"
    "os"
    "github.com/n01nex/unflat"
)

func main() {
    pdfPath := "/path/to/document.pdf"
    outputPath := "/path/to/output.md"

    opts := unflat.DefaultOptions()
    opts.IncludeMetadata = true
    opts.ColumnLayoutAware = true
    opts.NewlineCollapseAggressiveness = 2

    doc, err := unflat.Convert(pdfPath, opts)
    if err != nil {
        fmt.Fprintf(os.Stderr, "Error: %v\n", err)
        os.Exit(1)
    }

    if err := os.WriteFile(outputPath, []byte(doc.Content), 0644); err != nil {
        fmt.Fprintf(os.Stderr, "Error writing: %v\n", err)
        os.Exit(1)
    }

    fmt.Printf("Converted: %s -> %s\n", pdfPath, outputPath)
}

Configuration Options

The Options struct provides the following configuration options:

Option Type Default Description
NewlineCollapseAggressiveness int 2 Controls line break removal. Values: 0 (preserve all), 1 (conservative), 2 (moderate), 3 (aggressive)
MinHeadingFontSize float64 14.0 Minimum font size to be considered a heading
PreserveOriginalSpacing bool false Attempt to preserve original PDF spacing
ColumnLayoutAware bool true Detect multi-column layouts and reconstruct reading order
HeadingLevelMappings map[string]int {} Custom mapping from patterns to heading levels
IncludeMetadata bool true Include YAML metadata block at the start of Markdown
HeadingPrefixPatterns []string [Bold, Title, Heavy, ...] Font name patterns indicating headings
EnableMarkdownFormatting bool true Apply Markdown formatting (headings with #, bold with **, etc). When false, outputs plain text.
Debug bool false Enable debug output.

Features

  • Intelligent text extraction - Extracts text from PDFs while preserving layout information
  • Heading detection - Uses multiple heuristics (font size, weight, positioning, ALL CAPS patterns)
  • Smart newline collapse - Intelligently determines when to preserve vs collapse line breaks
  • PDF character fixing - Handles ligatures (fi, fl, ffi, ffl), PDF-encoded numbers, and encoding issues
  • Multi-column support - Detects and reconstructs logical reading order for multi-column layouts
  • Metadata preservation - Extracts and includes document metadata (title, author, dates, etc.)
  • Markdown formatting - Proper handling of lists, code blocks, and special characters

Document Metadata

When IncludeMetadata is true, the generated Markdown files include a YAML block with:

  • title - Document title
  • author - Author information (if available)
  • subject - Subject (if available)
  • keywords - Keywords (if available)
  • creation_date - Document creation date
  • page_count - Number of pages
  • page_width / page_height - Page dimensions
  • language - Detected language

License

MIT License - see LICENSE file for details.

Documentation

Overview

Package unflat provides PDF to Markdown conversion optimized for AI training datasets and search indexing.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Document

type Document struct {
	Title            string
	Author           string
	Subject          string
	Keywords         []string
	Creator          string
	Producer         string
	CreationDate     time.Time
	ModificationDate time.Time
	PageCount        int
	PageWidth        float64
	PageHeight       float64
	Language         string
	Content          string
}

Document represents a converted PDF document with its Markdown content.

func Convert

func Convert(pdfPath string, opts *Options) (*Document, error)

Convert converts a PDF file to Markdown content.

type NewlineCollapseResult

type NewlineCollapseResult struct {
	Text       string
	BreakType  string
	WasChanged bool
}

NewlineCollapseResult holds the result of newline collapsing.

type NewlinePolicy

type NewlinePolicy int

NewlinePolicy determines when to preserve or collapse newlines.

const (
	NewlinePolicyPreserve   NewlinePolicy = iota // Keep all newlines
	NewlinePolicyNormal                          // Collapse some newlines contextually
	NewlinePolicyAggressive                      // Collapse most newlines
)

type Options

type Options struct {
	// NewlineCollapseAggressiveness controls how aggressively line breaks are
	// removed. Valid values: 0 (none), 1 (conservative), 2 (moderate), 3 (aggressive).
	// Default: 2 (moderate).
	NewlineCollapseAggressiveness int

	// MinHeadingFontSize is the minimum font size to be considered a heading.
	// Default: 14.0
	MinHeadingFontSize float64

	// PreserveOriginalSpacing semantics when true, attempts to preserve
	// original spacing information from the PDF. Default: false.
	PreserveOriginalSpacing bool

	// ColumnLayoutAware when true, detects multi-column layouts and attempts
	// to reconstruct logical reading order. Default: true.
	ColumnLayoutAware bool

	// HeadingLevelMappings allows custom mapping from detected patterns to
	// heading levels.
	HeadingLevelMappings map[string]int

	// IncludeMetadata when true, includes a YAML metadata block at the beginning
	// of generated Markdown files. Default: true.
	IncludeMetadata bool

	// HeadingPrefixPatterns are font name patterns that indicate a heading.
	HeadingPrefixPatterns []string

	// Debug when true, enables debug output. Default: false.
	Debug bool

	// EnableMarkdownFormatting when true, applies Markdown formatting (headings with
	// # prefix, bold with **, etc). When false, outputs plain text while still
	// writing to .md file extension. Default: true.
	EnableMarkdownFormatting bool
}

Options configures the PDF to Markdown conversion behavior.

func DefaultOptions

func DefaultOptions() *Options

DefaultOptions returns Options with sensible defaults.

type ParagraphStats

type ParagraphStats struct {
	Lines        []string
	Spans        []string
	X            float64
	Width        float64
	FontSize     float64
	Column       int
	PageNum      int
	IsHeading    bool
	HeadingLevel int
}

ParagraphStats holds statistics about paragraphs.

Directories

Path Synopsis
cmd
test command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL