gh0ffice

package module

v1.0.0 Latest Latest Go to latest Published: Sep 24, 2024 License: Apache-2.0 Imports: 18 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/WhityGhost/gh0ffice

Links

Open Source Insights

README ¶

📄 Gh0ffice (Office/PDF File Parser)

This Go-based project provides a robust parser for various office document formats, including DOCX/DOC, PPTX/PPT, XLSX/XLS, and PDF. The parser extracts both content and metadata from these file types, allowing easy access to structured document data for further processing or analysis.

🛠 Features

Metadata Extraction: Captures essential metadata such as title, author, keywords, and modification dates.
Content Parsing: Supports extraction of text content from multiple file formats.
Extensible Architecture: Easily add support for new file formats by implementing additional reader functions.

📂 Supported Formats

DOCX: Extracts text content from Word documents.
PPTX: Extracts text content from PowerPoint presentations.
XLSX: Extracts data from Excel spreadsheets.
DOC: Extracts text content from Legacy Word documents.
PPT: Extracts text content from Legacy PowerPoint presentations.
XLS: Extracts data from Legacy Excel spreadsheets.
PDF: Extracts text content from PDF files (note that some complex PDFs may not be fully supported).

📖 Installation

To use this project, ensure you have Go installed on your system. Clone this repository and run the following command to install the dependencies:

go mod tidy

🚀 Usage

Basic Usage

You can inspect a document and extract its content and metadata by calling the inspectDocument function with the file path as follows:

doc, err := gh0ffice.InspectDocument("path/to/your/file.docx")
if err != nil {
    log.Fatalf("Error reading document: %s", err)
}
fmt.Printf("Title: %s\n", doc.Title)
fmt.Printf("Content: %s\n", doc.Content)

Debugging

Set the DEBUG variable to true to enable logging for more verbose output during the parsing process:

const DEBUG bool = true

⚠️ Limitations

The PDF parsing may fail on certain complex or malformed documents.
Only straightforward text extraction is performed; formatting and images are not considered.
Compatibility tested primarily on major office file formats.

📝 License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.

📬 Contributing

Contributions are welcome! Please feel free to create issues or submit pull requests for new features or bug fixes.

👥 Author

This project is maintained by the team and community of YT-Gh0st. Contributions and engagements are always welcome!

For any questions or suggestions, feel free to reach out. Happy parsing! 😊

Documentation ¶

Index ¶

Constants
Variables
func SetDebug(dbg bool)
type DocReader
type Document
- func InspectDocument(pathname string, target_abpath string) (*Document, error)

Constants ¶

View Source

const ISO string = "2006-01-02T15:04:05"

Variables ¶

View Source

var DEBUG bool = false

View Source

var PARA_RE = regexp.MustCompile(`(</[a-z]:p>)+`)

View Source

var TAG_RE = regexp.MustCompile(`(<[^>]*>)+`)

Functions ¶

func SetDebug ¶ added in v0.5.1

func SetDebug(dbg bool)

Types ¶

type DocReader ¶

type DocReader func(string) (string, error)

type Document ¶

type Document struct {
	RePath         string    `json:"path"`
	Filename       string    `json:"filename"`
	Title          string    `json:"title"`
	Subject        string    `json:"subject"`
	Creator        string    `json:"creator"`
	Keywords       string    `json:"keywords"`
	Description    string    `json:"description"`
	Lastmodifiedby string    `json:"lastModifiedBy"`
	Revision       string    `json:"revision"`
	Category       string    `json:"category"`
	Content        string    `json:"content"`
	Modifytime     time.Time `json:"modified"`
	Createtime     time.Time `json:"created"`
	Accesstime     time.Time `json:"accessed"`
	Size           int       `json:"size"`
	// contains filtered or unexported fields
}

func InspectDocument ¶ added in v0.4.1

func InspectDocument(pathname string, target_abpath string) (*Document, error)

Make a struct of documentation involves content and metadata, file information

Source Files ¶

View all Source files

gh0ffice.go

Directories ¶

Path	Synopsis
lib
ioadapters
metagoffice
pdf Package pdf implements reading of PDF files.	Package pdf implements reading of PDF files.
pdf/pdfpasswd command
xls xls package use to parse the 97 -2004 microsoft xls file(".xls" suffix, NOT ".xlsx" suffix )	xls package use to parse the 97 -2004 microsoft xls file(".xls" suffix, NOT ".xlsx" suffix )

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL