gh0ffice

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 24, 2024 License: Apache-2.0 Imports: 18 Imported by: 2

README

📄 Gh0ffice (Office/PDF File Parser)

This Go-based project provides a robust parser for various office document formats, including DOCX/DOC, PPTX/PPT, XLSX/XLS, and PDF. The parser extracts both content and metadata from these file types, allowing easy access to structured document data for further processing or analysis.

🛠 Features

  • Metadata Extraction: Captures essential metadata such as title, author, keywords, and modification dates.
  • Content Parsing: Supports extraction of text content from multiple file formats.
  • Extensible Architecture: Easily add support for new file formats by implementing additional reader functions.

📂 Supported Formats

  • DOCX: Extracts text content from Word documents.
  • PPTX: Extracts text content from PowerPoint presentations.
  • XLSX: Extracts data from Excel spreadsheets.
  • DOC: Extracts text content from Legacy Word documents.
  • PPT: Extracts text content from Legacy PowerPoint presentations.
  • XLS: Extracts data from Legacy Excel spreadsheets.
  • PDF: Extracts text content from PDF files (note that some complex PDFs may not be fully supported).

📖 Installation

To use this project, ensure you have Go installed on your system. Clone this repository and run the following command to install the dependencies:

go mod tidy

🚀 Usage

Basic Usage

You can inspect a document and extract its content and metadata by calling the inspectDocument function with the file path as follows:

doc, err := gh0ffice.InspectDocument("path/to/your/file.docx")
if err != nil {
    log.Fatalf("Error reading document: %s", err)
}
fmt.Printf("Title: %s\n", doc.Title)
fmt.Printf("Content: %s\n", doc.Content)
Debugging

Set the DEBUG variable to true to enable logging for more verbose output during the parsing process:

const DEBUG bool = true

⚠️ Limitations

  • The PDF parsing may fail on certain complex or malformed documents.
  • Only straightforward text extraction is performed; formatting and images are not considered.
  • Compatibility tested primarily on major office file formats.

📝 License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.

📬 Contributing

Contributions are welcome! Please feel free to create issues or submit pull requests for new features or bug fixes.

👥 Author

This project is maintained by the team and community of YT-Gh0st. Contributions and engagements are always welcome!


For any questions or suggestions, feel free to reach out. Happy parsing! 😊

Documentation

Index

Constants

View Source
const ISO string = "2006-01-02T15:04:05"

Variables

View Source
var DEBUG bool = false
View Source
var PARA_RE = regexp.MustCompile(`(</[a-z]:p>)+`)
View Source
var TAG_RE = regexp.MustCompile(`(<[^>]*>)+`)

Functions

func SetDebug added in v0.5.1

func SetDebug(dbg bool)

Types

type DocReader

type DocReader func(string) (string, error)

type Document

type Document struct {
	RePath         string    `json:"path"`
	Filename       string    `json:"filename"`
	Title          string    `json:"title"`
	Subject        string    `json:"subject"`
	Creator        string    `json:"creator"`
	Keywords       string    `json:"keywords"`
	Description    string    `json:"description"`
	Lastmodifiedby string    `json:"lastModifiedBy"`
	Revision       string    `json:"revision"`
	Category       string    `json:"category"`
	Content        string    `json:"content"`
	Modifytime     time.Time `json:"modified"`
	Createtime     time.Time `json:"created"`
	Accesstime     time.Time `json:"accessed"`
	Size           int       `json:"size"`
	// contains filtered or unexported fields
}

func InspectDocument added in v0.4.1

func InspectDocument(pathname string, target_abpath string) (*Document, error)

Make a struct of documentation involves content and metadata, file information

Directories

Path Synopsis
lib
pdf
Package pdf implements reading of PDF files.
Package pdf implements reading of PDF files.
pdf/pdfpasswd command
xls
xls package use to parse the 97 -2004 microsoft xls file(".xls" suffix, NOT ".xlsx" suffix )
xls package use to parse the 97 -2004 microsoft xls file(".xls" suffix, NOT ".xlsx" suffix )

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL