Changelog#
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.7.0] - 2026-03-03#
Added#
- Add
close_unclosed_boundariesparameter toTfSettings(default:True). When enabled, a pre-processing pass groups all mutually-intersecting h-edges and v-edges into connected components. For each component, if the h-edges' x-span extends beyond the x-positions of the v-edges in that component (or vice versa for the y direction), a virtual closing edge is synthesised at the extension endpoint. The full intersection-detection and cell-detection pipeline is then re-run with the enhanced edge set.intersection_x_tolerance/intersection_y_toleranceare used as thresholds. The feature is skipped entirely when either strategy is"text"to avoid false positives from text-derived edges. This fixes some known bugs in other table extraction libraries (pdfplumber#631, pdfplumber#1296). (#24)
Changed#
- Breaking:
close_unclosed_boundariesis nowTrueby default, which may change table detection results for PDFs whose table outer boundaries are only partially drawn. To restore the previous behavior, setclose_unclosed_boundaries=FalseinTfSettings. (#24)
[0.6.0] - 2026-03-01#
Added#
- Add
exclude_background_colored_edgesparameter toTfSettings(default:True) to replaceexclude_white_edges(#20). Uses a spatial visibility algorithm: for each edge, the fill colors of the rectangles directly adjacent on both sides (withinsnap_tolerance) are inspected. An edge is excluded only when it is indistinguishable from its surroundings on all effective sides (missing sides default to the standard white PDF page background). This correctly handles pages with mixed-background tables and non-white backgrounds. (See tests/data/diff-bg-color-and-line-color.pdf) (#22)
Fixed#
- Cell text spacing: A space is now inserted between two words in a table cell only when the gap between their bounding boxes (in reading direction) exceeds the word-extraction tolerance (
x_tolerancefor horizontal text,y_tolerancefor vertical). This preserves visible gaps in languages like English (e.g. "Table 1" and "Abcd") while avoiding unwanted spaces in languages that do not use spaces between words (e.g. Chinese). (#23)
Changed#
- Breaking:
exclude_white_edgeshas been removed and replaced byexclude_background_colored_edges. The new parameter isTrueby default and subsumes the old behavior on white-background pages. Rename any existingexclude_white_edges=Falsetoexclude_background_colored_edges=False. (#22)
[0.5.0] - 2026-02-25 [YANKED]#
Reason for yanking:
exclude_white_edgeswas not designed comprehensively enough — it failed to handle non-white and mixed-background pages correctly. It has been superseded by the more robustexclude_background_colored_edgesintroduced in 0.6.0.
Added#
- Add
is_strokedattribute toRectandLine: indicates whether the PDF path is stroked (frompath.is_stroked()) - Add
fill_modeattribute toRectandLine: fill rule from pdfium-renderPdfPathFillMode(NONE, WINDING, or EVEN_ODD); exposeFillModeenum to Python with same three values - Replace
Line.colorwithLine.stroke_colorandLine.fill_color(aligned withRect) - Add
Table.to_list()returninglist[list[TableCellValue]]: each cell hastext(orNonewhen merged),merged_left, andmerged_topso merge direction (left vs above) is explicit (#19) - Add
TableCellValueclass with attributestext,merged_left, andmerged_topfor use withto_list()(#19) - Add
get_intersections_from_edges(h_edges, v_edges, ...)function: given horizontal and vertical edges (as returned byget_edges), returns a mapping from every(x, y)intersection point to the edges that pass through it; accepts the same tolerance kwargs asget_edges - Add
Document.save_to_bytes()method to serialize the PDF to an in-memory byte buffer, always without encryption; if the original was password-protected the returned bytes can be opened without a password - Add
page.docback-reference: everyPageobject now carries a reference to theDocumentit belongs to - Add
Page.page_idxproperty: zero-based index of the page within its document - Add
Page.rotation_degreesproperty: clockwise rotation of the page in degrees - Add
Page.clear_cache()method as the canonical name for clearing cached objects - Add
tablers.debugmodule withPageImageclass for visualizing detected tables and edges on a rendered page image; requires the optionaldebugextra (pip install tablers[debug]) (#18) - Add
exclude_white_edgesparameter toTfSettingsto control filtering of white edges (RGB = 255, 255, 255) during table extraction (#20)
Changed#
- Breaking: White edges (RGB = 255, 255, 255) are now excluded by default during table extraction. This may change the behavior of existing code that relied on white edges being included. To restore the previous behavior, set
exclude_white_edges=FalseinTfSettings. (#20) - Breaking:
Line.colorhas been removed. UseLine.stroke_colorandLine.fill_colorinstead (aligned withRect). Pageis now a Python-level wrapper that holds adocback-reference; Rust-side type isPyo3Page
Deprecated#
find_tables_from_cellsparameterpdf_pagehas been renamed topagefor arguments naming consistency; passingpdf_pagestill works but raises aDeprecationWarningand will be removed in a future releasePage.clear()is now an alias forPage.clear_cache(); preferclear_cache()going forward
[0.4.2] - 2026-02-11#
Fixed#
- Fix narrow closepath polylines not being regarded as strict lines (#13, #15)
- Fix nested XObject transformation matrices not being applied correctly
- Now python context manager with
Documentwould return correct type for better type hinting experience
[0.4.1] - 2026-02-05#
Changed#
- Make this package usable in Linux with glibc >= 2.28 (glibc >= 2.34 formerly)
[0.4.0] - 2026-01-31#
Added#
- Add
clipparameter tofind_tablesandfind_all_cells_bboxesfor table detection in specific regions (#10)
Fixed#
- Fix edge extension for mixed text/non-text strategies to extract tables correctly (#9)
[0.3.0] - 2025-01-13#
Added#
- Add python
Edgeconstructor for programmatic edge creation withorientation,x1,y1,x2,y2,width, andcolorparameters - Add
explicitstrategy for table detection, allowing the use of explicitly provided edges (#7) - Add
explicit_h_edgesandexplicit_v_edgessettings toTfSettingsfor providing explicit edges - Allow
pageparameter to beNoneinfind_tables,find_all_cells_bboxesandget_edgeswhen both strategies areexplicit(andextract_textisFalseforfind_tables) - Add
plumber_edge_to_tablers_edgefunction for convertingpdfplumberedges totablersedges - Add documentation and doc workflow with Material-for-MkDocs (#6)
Changed#
- Change
Edgeinvalid orientation error from Rust panic to PythonValueError - Change
get_edgesfunction signature and API
[0.2.0] - 2025-01-05#
Added#
- Add CSV export for tables (
to_csv) (#5) - Add Markdown export for tables (
to_markdown) - Add HTML export for tables (
to_html) - Add
min_rowsandmin_columnssettings for table filtering (default: None, no filter) - Add
include_single_cellsetting to configure whether to include tables with only one cell (default: false) - Add
need_stripoption to table extraction functions for whitespace and line feed handling (default: true) - Add
rowsandcolumnsproperties for Python bindings
Fixed#
- Fix handling of multiple MoveTo commands in one path segment
- Improve rectangle detection with better path segment type handling
[0.1.1] - 2025-12-30#
Fixed#
- Fix the bug that linux whl does not contains
libpdfium.so(fixed by renaming it tolibpdfium.so.1)
[0.1.0] - 2025-12-30#
Added#
- Add NonNegative validations for settings
- Add context manager support to Document class for Python
- Add table finding and text extraction settings with new API functions
- Add comprehensive README with features and usage examples
- Add comprehensive docstrings to Python modules and Rust code
- Add tests
- Add CI workflow
- Add pre-commit hooks
Changed#
- Update TfSettings default strategies from Lines to LinesStrict
- Replace
horizontal_ltrandvertical_ttbwithtext_read_in_clockwiseto handle text with rotation_degrees 90 and 270 simultaneously - Enable to deal with pdf with page_count > 65535 by updating pdfium-render
- Use global pdfium runtime
Fixed#
- Fix cargo clippy errors and update lint scripts
- Replace macOS pdfium dylib with arm64 version
[0.0.0] - 2025-12-25#
Added#
- lines / lines_strict / text strategies for extracting tables in a pdf page