Lightweight PDF text extractor with a C++ core (PoDoFo 1.x), a Node.js native addon, and an Electron GUI.
pdfx
that extracts text from PDFs via PoDoFo 1.x.pdfx_cli
wraps the core for command-line extraction (txt
or json
output).pdfx.node
exposes the extractor to Node via N-API (node-addon-api
).cpp/ # C++ core + CLI (CMake)
include/PdfTextExtractor.hpp
src/PdfTextExtractor.cpp
src/cli_main.cpp
CMakeLists.txt
cmake/FindOrFetchPoDoFo.cmake # Finds system PoDoFo or vendors from source
node/
addon/ # Node native addon (cmake-js)
binding.cpp
CMakeLists.txt
package.json
test-addon.sh
gui/ # Electron app (main + preload + renderer)
main.js
preload.cjs
renderer/index.html
package-lock.json
tests/
some.pdf # Minimal one-page PDF
some_pdf.py # Script that generates a minimal PDF
LICENSE
README.md # (this file supersedes the minimal stub)
PoDoFo 1.x:
find_package(PoDoFo CONFIG QUIET)
).For the Node addon:
cmake-js
).npm
to install dev dependencies listed in node/addon/package.json
.For the GUI:
npm
.electron
(already present in node/gui/package-lock.json
, see Run: Electron GUI).# from repo root
cmake -S cpp -B build/cpp -DCMAKE_BUILD_TYPE=Release
cmake --build build/cpp --target pdfx pdfx_cli -j
Artifacts:
build/cpp/libpdfx.*
build/cpp/pdfx_cli[.exe]
Install (optional):
cmake --install build/cpp --prefix /your/prefix
Usage (from cpp/src/cli_main.cpp
):
Usage: pdfx_cli -i input.pdf [-o out.txt] [--pages 1-3,5] [--format txt|json]
Examples:
# Extract all pages to stdout as text
build/cpp/pdfx_cli -i tests/some.pdf
# Extract page 1 and 3..5 to a file
build/cpp/pdfx_cli -i tests/some.pdf --pages 1,3-5 -o extracted.txt
# JSON output
build/cpp/pdfx_cli -i tests/some.pdf --format json
Verification steps:
tests/some.pdf
.--format json
, output has a pages
array; each item contains "index"
and "text"
.# from repo root
cd node/addon
npm i
npm run build # uses cmake-js; produces build/Release/pdfx.node
# (optional) print the full path to the built artifact
npm run print:artifact
Test the addon binary:
# quick export inspection
./test-addon.sh --build
# => prints exported methods: [ 'extractAll', 'extractPages' ]
// replace with the actual path the build printed for pdfx.node
const addon = require('./node/addon/build/Release/pdfx.node');
(async () => {
const pages = addon.extractAll('tests/some.pdf');
console.log('page count:', pages.length);
console.log('page1:', pages[0]);
const some = addon.extractPages('tests/some.pdf', [0]); // zero-based indices
console.log('only page 1:', some[0]);
})();
Verification steps:
extractAll()
returns an array of strings (one per page).extractPages(path, [0,2])
returns only the selected pages in order.The GUI consists of:
node/gui/main.js
(Electron main, ESM)node/gui/preload.cjs
(context-isolated preload, CJS)node/gui/renderer/index.html
It expects the native addon at one of:
node/addon/build/Release/pdfx.node
(repo dev build), or<app resources>/native/pdfx.node
(if you package later)Minimal run (dev)
Note:
node/gui
currently lacks apackage.json
. Create one as shown below, then install and run Electron.
{
"name": "pdfx-gui",
"version": "0.1.0",
"type": "module",
"main": "main.js",
"private": true,
"scripts": {
"start": "electron ."
},
"devDependencies": {
"electron": "^31.3.0"
}
}
Then:
# build the native addon first so the GUI can load it
(cd node/addon && npm i && npm run build)
# run the GUI
cd node/gui
npm i
npm run start
Keyboard shortcuts (from the renderer UI):
Ctrl/⌘ + O
Ctrl/⌘ + E
Ctrl/⌘ + S
GUI features visible in renderer/index.html
:
1-3, 6
) → zero-based indices internally..txt
.PdfTextExtractor
Header: cpp/include/PdfTextExtractor.hpp
struct ExtractOptions {
bool preserve_layout = false; // currently unused by implementation
};
class PdfTextExtractor {
public:
std::vector<std::string> extractAll(const std::string& pdfPath,
const ExtractOptions& opts = {});
std::vector<std::string> extractPages(const std::string& pdfPath,
const std::vector<int>& pageIndices,
const ExtractOptions& opts = {});
private:
std::string extractOnePage(PoDoFo::PdfMemDocument& doc, int pageIndex,
const ExtractOptions& opts);
static std::string toUtf8(const std::string& s);
};
Implementation notes (from PdfTextExtractor.cpp
):
PoDoFo::PdfMemDocument
.PdfPage::ExtractTextTo(std::vector<PdfTextEntry>&)
is used.e.Text
entries with '\n'
.toUtf8
is a pass-through (no re-encoding).File: node/addon/binding.cpp
// const addon = require('./build/Release/pdfx.node')
addon.extractAll(path: string) // -> string[] per page
addon.extractPages(path: string, pages: number[]) // -> string[] for requested pages
TypeError
on invalid arguments.Error("pdfx native error: ...")
.window.pdfx
)File: node/gui/preload.cjs
window.pdfx = {
selectPdf(): Promise<string|null>, // showOpenDialog; returns path or null
extractAll(filePath: string): Promise<string[]>, // IPC -> addon
extractPages(filePath: string, pages: number[]): Promise<string[]>,
saveText(defaultName: string, text: string): Promise<string|null>, // showSaveDialog
onError(cb: (msg: string) => void): () => void // subscribe to startup load errors
}
The main process (node/gui/main.js
) resolves the addon from:
../addon/build/Release/pdfx.node
(dev), or<resources>/native/pdfx.node
(packaged).If loading fails, the renderer receives an error via pdfx:onError
.
tests/some.pdf
(one page; Latin-1 text operators, quotes, TJ
array, etc.).Generate a minimal PDF:
python tests/some_pdf.py
# writes ./some.pdf in the current working directory
Handled by cmake/FindOrFetchPoDoFo.cmake
:
find_package(PoDoFo CONFIG QUIET)
).If not found, vendor from source:
https://github.com/podofo/podofo.git
PDFX_PODOFO_TAG
(default: "1.0.0"
)Options:
-DPDFX_VENDOR_DEPS=ON
— force vendored build.-DPDFX_PODOFO_TAG=<tag-or-branch>
— pick a different PoDoFo ref when vendoring.The module sets:
PDFX_PoDoFo_TARGET
— target to link against.PDFX_PoDoFo_INCLUDE_DIRS
— include directories passed to the pdfx
target.MIT — see LICENSE
.