go-camelot

command module
v0.0.0-...-b993aa4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 15, 2020 License: AGPL-3.0 Imports: 2 Imported by: 0

README

go-camelot

Clean room implementation for PDF table detection; inspired by camelot. Starts with the raw image file; get it from their corresponding go-pardocs + go-dundocs commands. This is called via simple os.exec for every page; pass back OCR-ed text to sync with calling function.

Assumption: Only target pure strongly separated tables with line; e.g. detect line

Ideas

Use edge detection (e.g. Canny) + line detection (e.g Hough Lines); available via gocv to create the row/column slices

Use font (via pdfcpu - https://pdfcpu.io/extract/extract_fonts.html) + character-sets to detect text in sliced image - https://github.com/Th1nkK1D/gocr

Alt: Use OCR on the slices made available

- https://github.com/otiai10/gosseract
- https://github.com/otiai10/ocrserver

See one implementation: https://github.com/hybridgroup/gocv/tree/master/cmd/find-lines

Techniques

Libraries available for use

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL