Skip to content

API Reference#

This page provides detailed documentation for all public classes and functions in Tablers.

Functions#

find_tables#

Find all tables in a PDF page or from explicit edges.

def find_tables(
    page: Page | None = None,
    extract_text: bool = True,
    clip: BBox | None = None,
    tf_settings: TfSettings | None = None,
    **kwargs: Unpack[TfSettingItems]
) -> list[Table]

Parameters:

Parameter Type Default Description
page Optional[Page] None The PDF page to analyze. Can be None only if both strategies are "explicit" and extract_text is False
extract_text bool True Whether to extract text content from table cells
clip Optional[BBox] None Optional clip region (x1, y1, x2, y2). If provided, only edges within this region are used for table detection
tf_settings Optional[TfSettings] None TableFinder settings object. If not provided, default settings are used
**kwargs Unpack[TfSettingItems] - Additional keyword arguments passed to TfSettings

Returns: list[Table] - A list of Table objects found in the page.

Raises:

  • ValueError - If page is None and extract_text is True.
  • ValueError - If page is None and either strategy is not "explicit".

Example:

from tablers import Document, find_tables

with Document("example.pdf") as doc:
    page = doc.get_page(0)
    tables = find_tables(page, extract_text=True)
    for table in tables:
        print(f"Table with {len(table.cells)} cells at {table.bbox}")

Example with clip region:

from tablers import Document, find_tables

with Document("example.pdf") as doc:
    page = doc.get_page(0)
    # Only extract tables from a specific region
    clip = (100.0, 100.0, 400.0, 300.0)  # (x1, y1, x2, y2)
    tables = find_tables(page, extract_text=True, clip=clip)

Example with explicit edges (no page required):

from tablers import Edge, TfSettings, find_tables

h_edges = [Edge("h", 0.0, 0.0, 100.0, 0.0), Edge("h", 0.0, 100.0, 100.0, 100.0)]
v_edges = [Edge("v", 0.0, 0.0, 0.0, 100.0), Edge("v", 100.0, 0.0, 100.0, 100.0)]

settings = TfSettings(
    horizontal_strategy="explicit",
    vertical_strategy="explicit",
    explicit_h_edges=h_edges,
    explicit_v_edges=v_edges,
)

tables = find_tables(page=None, extract_text=False, tf_settings=settings)

Clip coordinates on rotated pages

When a page is marked as rotated by 90° or 270°, page.width and page.height are defined based on the upright orientation (as you would normally view the page). However, all object coordinates (lines, text, etc.) within the PDF are defined based on the unrotated coordinate system (where page.width corresponds to the actual page.height after rotation is removed).

Therefore, clip values must also be specified using the unrotated coordinate system. Failing to account for this may result in incorrect table extraction.


find_all_cells_bboxes#

Find all table cell bounding boxes in a PDF page or from explicit edges.

def find_all_cells_bboxes(
    page: Page | None = None,
    clip: BBox | None = None,
    tf_settings: TfSettings | None = None,
    **kwargs: Unpack[TfSettingItems]
) -> list[tuple[float, float, float, float]]

Parameters:

Parameter Type Description
page Optional[Page] The PDF page to analyze. Can be None only if both strategies are "explicit"
clip Optional[BBox] Optional clip region (x1, y1, x2, y2). If provided, only edges within this region are used for cell detection
tf_settings Optional[TfSettings] TableFinder settings object
**kwargs Unpack[TfSettingItems] Additional keyword arguments passed to TfSettings

Returns: list[BBox] - A list of bounding boxes (x1, y1, x2, y2) for each detected cell.

Raises: RuntimeError - If page is None and either strategy is not "explicit".

Example:

from tablers import Document, find_all_cells_bboxes

with Document("example.pdf") as doc:
    page = doc.get_page(0)
    cells = find_all_cells_bboxes(page)
    print(f"Found {len(cells)} cells")

Example with clip region:

from tablers import Document, find_all_cells_bboxes

with Document("example.pdf") as doc:
    page = doc.get_page(0)
    # Only detect cells within a specific region
    clip = (100.0, 100.0, 400.0, 300.0)
    cells = find_all_cells_bboxes(page, clip=clip)

Example with explicit edges (no page required):

from tablers import Edge, TfSettings, find_all_cells_bboxes

h_edges = [Edge("h", 0.0, 0.0, 100.0, 0.0), Edge("h", 0.0, 100.0, 100.0, 100.0)]
v_edges = [Edge("v", 0.0, 0.0, 0.0, 100.0), Edge("v", 100.0, 0.0, 100.0, 100.0)]

settings = TfSettings(
    horizontal_strategy="explicit",
    vertical_strategy="explicit",
    explicit_h_edges=h_edges,
    explicit_v_edges=v_edges,
)

cells = find_all_cells_bboxes(None, tf_settings=settings)

Clip coordinates on rotated pages

See the warning in find_tables about using clip with rotated pages.


find_tables_from_cells#

Construct tables from a list of cell bounding boxes.

def find_tables_from_cells(
    cells: list[tuple[float, float, float, float]],
    extract_text: bool,
    page: Page | None = None,
    tf_settings: TfSettings | None = None,
    **kwargs: Unpack[TfSettingItems]
) -> list[Table]

Parameters:

Parameter Type Description
cells list[BBox] A list of cell bounding boxes to group into tables
extract_text bool Whether to extract text content from cells
page Optional[Page] The PDF page (required if extract_text is True)
tf_settings Optional[TfSettings] Table finder settings
**kwargs Unpack[TfSettingItems] Additional keyword arguments for settings

Returns: list[Table] - A list of Table objects constructed from the cells.

Raises: RuntimeError - If extract_text is True but page is not provided.

Deprecated parameter pdf_page

The parameter was renamed from pdf_page to page. Passing pdf_page as a keyword argument still works but emits a DeprecationWarning and will be removed in a future release. Update your call sites:

# Old (deprecated)
find_tables_from_cells(cells, extract_text=True, pdf_page=page)

# New
find_tables_from_cells(cells, extract_text=True, page=page)

get_edges#

Extract edges (lines and rectangle borders) from a PDF page or from explicit edges.

def get_edges(
    page: Page | None = None,
    tf_settings: TfSettings | None = None,
    **kwargs: Unpack[TfSettingItems]
) -> dict[str, list[Edge]]

Parameters:

Parameter Type Description
page Optional[Page] The PDF page to extract edges from. Can be None only if both strategies are "explicit"
tf_settings Optional[TfSettings] TableFinder settings object
**kwargs Unpack[TfSettingItems] Additional keyword arguments passed to TfSettings

Returns: dict - A dictionary with keys "h" (horizontal edges) and "v" (vertical edges).

Raises: RuntimeError - If page is None and either strategy is not "explicit".


get_intersections_from_edges#

Compute intersection points from a set of horizontal and vertical edges.

def get_intersections_from_edges(
    h_edges: list[Edge],
    v_edges: list[Edge],
    tf_settings: TfSettings | None = None,
    **kwargs: Unpack[TfSettingItems]
) -> dict[tuple[float, float], dict[str, list[Edge]]]

Parameters:

Parameter Type Description
h_edges list[Edge] Horizontal edges (e.g. edges["h"] from get_edges)
v_edges list[Edge] Vertical edges (e.g. edges["v"] from get_edges)
tf_settings Optional[TfSettings] TableFinder settings object; controls intersection tolerances
**kwargs Unpack[TfSettingItems] Additional keyword arguments passed to TfSettings

Returns: dict - A mapping from (x, y) intersection coordinates to a dict with keys "h" and "v", each containing the list of edges that pass through that point.

Tip

This function is designed to consume the output of get_edges directly:

edges = get_edges(page)
intersections = get_intersections_from_edges(edges["h"], edges["v"])
See Inspecting Intersections for more details.


plumber_edge_to_tablers_edge#

Convert a pdfplumber edge dictionary to a Tablers Edge object.

from tablers.edges import plumber_edge_to_tablers_edge

def plumber_edge_to_tablers_edge(
    plumber_edge: dict[str, Any],
    page_rotation: float,
    page_height: float,
    page_width: float,
) -> Edge

Parameters:

Parameter Type Description
plumber_edge dict[str, Any] A pdfplumber edge dictionary containing orientation, x0, y0, x1, y1, linewidth, and stroking_color
page_rotation float The rotation of the page in degrees
page_height float The height of the page
page_width float The width of the page

Returns: Edge - A Tablers Edge object.

Tip

This function can serve as a reference for writing conversion functions for other PDF libraries. See Using Edges from Other Libraries for more details.


Classes#

Document#

Represents an opened PDF document.

class Document:
    def __init__(
        self,
        path: Path | str | None = None,
        bytes: bytes | None = None,
        password: str | None = None
    )

Parameters:

Parameter Type Description
path Union[Path, str, None] File path to the PDF document
bytes Optional[bytes] PDF content as bytes
password Optional[str] Password for encrypted PDFs

Note

Either path or bytes must be provided, but not both. If both are provided, only path is used.

Properties:

Property Type Description
page_count int Total number of pages in the document

Methods:

Method Returns Description
get_page(page_num) Page Retrieve a specific page by index (0-based)
pages() Iterator[Page] Get a lazy iterator over all pages
save_to_bytes() bytes Serialize the document to bytes without encryption (see warning below)
close() None Close the document and release resources
is_closed() bool Check if the document has been closed

save_to_bytes() strips encryption

If the original document was password-protected, save_to_bytes() returns a byte buffer that can be opened without a password. This is intentional—use it only when stripping the encryption is appropriate for your use case. The method also allocates a full in-memory copy of the PDF on every call; cache the result if you need it more than once.

Context Manager:

with Document("example.pdf") as doc:
    for page in doc.pages():
        print(page.width, page.height)

Page#

Represents a single page in a PDF document.

Attributes:

Attribute Type Description
width float The width of the page in points
height float The height of the page in points
page_idx int Zero-based index of this page within its document
rotation_degrees float Clockwise rotation of the page in degrees
objects Optional[Objects] Extracted objects, or None if not yet extracted
doc Document The Document instance this page belongs to

Methods:

Method Returns Description
is_valid() bool Check if the page reference is still valid
extract_objects() None Extract all objects from the page
clear_cache() None Clear cached objects to free memory
clear() None Alias for clear_cache(); deprecated, prefer clear_cache()

Table#

Represents a table extracted from a PDF page.

Attributes:

Attribute Type Description
bbox tuple[float, float, float, float] Bounding box (x1, y1, x2, y2)
cells list[TableCell] All cells in the table
rows list[CellGroup] All rows in the table
columns list[CellGroup] All columns in the table
page_index int Index of the page containing this table
text_extracted bool Whether text has been extracted

Methods:

Method Returns Description
to_csv() str Convert to CSV format
to_markdown() str Convert to Markdown table format
to_html() str Convert to HTML table format
to_list() list[list[TableCellValue]] Convert to list of rows; each cell has text, merged_left, and merged_top (see TableCellValue)

Warning

Export methods raise ValueError if text has not been extracted.


TableCellValue#

One grid slot returned by Table.to_list(). Carries the cell text (when present) and merge direction so you can tell whether a merged slot continues from the left or from above.

Attributes:

Attribute Type Description
text Optional[str] Cell text; None when this slot is merged (continuation of another cell)
merged_left bool True if this slot is merged with the cell to the left (same row)
merged_top bool True if this slot is merged with the cell above (same column)

For a slot with content, text is set and both merged_left and merged_top are False. For a merged slot, text is None and at least one of merged_left or merged_top is True. When a cell spans both right and below, the bottom-right slot can have both merged_left and merged_top true.

Repr: repr(cell) returns a string "(text, merged_left, merged_top)": text is shown as None or as a double-quoted string (internal quotes and backslashes are escaped); the two booleans are shown as True or False. Example: ("abc", False, False) or (None, True, False).


TableCell#

Represents a single cell in a table.

Attributes:

Attribute Type Description
bbox tuple[float, float, float, float] Bounding box (x1, y1, x2, y2)
text str Text content of the cell

How cell text is built

Cell text is produced by grouping characters into words (using WordsExtractSettings) and then joining those words. A space is inserted between two consecutive words only when the gap between their bounding boxes (in reading direction) exceeds the word-extraction tolerance (x_tolerance for horizontal text, y_tolerance for vertical). As a result, visible gaps in the PDF (e.g. between "Table 1" and "Abcd") are reflected as spaces, while languages that do not use spaces between words (e.g. Chinese) do not get extra spaces.


CellGroup#

Represents a group of table cells arranged in a row or column.

Attributes:

Attribute Type Description
cells list[Optional[TableCell]] Cells in this group, with None for empty positions
bbox tuple[float, float, float, float] Bounding box of the entire group

Objects#

Container for all extracted objects from a PDF page.

Attributes:

Attribute Type Description
rects list[Rect] All rectangles found in the page
lines list[Line] All line segments found in the page
chars list[Char] All text characters found in the page

Rect#

Represents a rectangle extracted from a PDF page.

Attributes:

Attribute Type Description
bbox tuple[float, float, float, float] Bounding box
fill_color tuple[int, int, int, int] Fill color (RGBA)
stroke_color tuple[int, int, int, int] Stroke color (RGBA)
stroke_width float Stroke width
is_stroked bool Whether the path is stroked
fill_mode FillMode Fill rule (NONE, WINDING, or EVEN_ODD)

FillMode#

PDF path fill rule: winding (nonzero) or even-odd.

Values: FillMode.NONE, FillMode.WINDING, FillMode.EVEN_ODD (mirrors pdfium-render PdfPathFillMode)


Line#

Represents a line segment extracted from a PDF page.

Attributes:

Attribute Type Description
line_type Literal["straight", "polyline", "curve"] Type of line
points list[tuple[float, float]] Points defining the line path
stroke_color tuple[int, int, int, int] Stroke color (RGBA)
fill_color tuple[int, int, int, int] Fill color (RGBA)
width float Line width
is_stroked bool Whether the line is stroked
fill_mode FillMode Fill rule (NONE, WINDING, or EVEN_ODD)

Char#

Represents a text character extracted from a PDF page.

Attributes:

Attribute Type Description
unicode_char Optional[str] Unicode character
bbox tuple[float, float, float, float] Bounding box
rotation_degrees float Clockwise rotation in degrees
upright bool Whether the character is upright

Edge#

Represents a line edge extracted from a PDF page or created programmatically.

class Edge:
    def __init__(
        self,
        orientation: Literal["h", "v"],
        x1: float,
        y1: float,
        x2: float,
        y2: float,
        width: float = 1.0,
        color: Color = (0, 0, 0, 255),
    ) -> None

Constructor Parameters:

Parameter Type Default Description
orientation Literal["h", "v"] - "h" for horizontal, "v" for vertical
x1 float - Left x-coordinate
y1 float - Top y-coordinate
x2 float - Right x-coordinate
y2 float - Bottom y-coordinate
width float 1.0 Stroke width
color Color (0, 0, 0, 255) Stroke color (RGBA)

Raises: ValueError - If orientation is not "h" or "v".

Example:

from tablers import Edge

# Create a horizontal edge
h_edge = Edge("h", 0.0, 50.0, 100.0, 50.0)

# Create a vertical edge with custom width and color
v_edge = Edge("v", 50.0, 0.0, 50.0, 100.0, width=2.0, color=(255, 0, 0, 255))

Attributes:

Attribute Type Description
orientation Literal["h", "v"] "h" for horizontal, "v" for vertical
x1 float Left x-coordinate
y1 float Top y-coordinate
x2 float Right x-coordinate
y2 float Bottom y-coordinate
width float Stroke width
color tuple[int, int, int, int] Stroke color (RGBA)

Type Aliases#

Alias Definition Description
Point tuple[float, float] A 2D point (x, y)
BBox tuple[float, float, float, float] Bounding box (x1, y1, x2, y2)
Color tuple[int, int, int, int] RGBA color (0-255 each)

Debug Module (tablers.debug)#

Optional dependency

The debug module requires the debug extra. Install it with:

pip install tablers[debug]

PageImage#

Renders a PDF page to a PIL image and provides drawing primitives for annotating detected tables, edges, and intersection points.

from tablers.debug import PageImage

class PageImage:
    def __init__(
        self,
        page: Page,
        original: PIL.Image.Image | None = None,
        resolution: int | float = 72,
        antialias: bool = False,
    )

Parameters:

Parameter Type Default Description
page Page - The page to render
original Optional[PIL.Image.Image] None Pre-rendered image. If None, the page is rendered at the given resolution
resolution Union[int, float] 72 Rendering resolution in DPI
antialias bool False Enable anti-aliasing during rendering

Raises: RuntimeError — If original is None and the document has already been closed.

Password-protected PDFs

PageImage rendering supports only documents without a password. For password-protected PDFs, use Document.save_to_bytes() to obtain a decrypted copy, then open it with Document(bytes=...) and pass the resulting page to PageImage.

Attributes:

Attribute Type Description
original PIL.Image.Image The unmodified rendered page image
annotated PIL.Image.Image The working copy with all annotations applied
scale float Ratio of image pixels to page points (image_width / page_width)
bbox BBox Page coordinate space: (0, 0, page.width, page.height)
resolution Union[int, float] The DPI used for rendering

Methods:

Method Returns Description
reset() PageImage Discard all annotations and restore annotated to original
copy() PageImage Return a new PageImage sharing the same original but with an independent annotated copy
save(dest, format, quantize, colors, bits, **kwargs) None Save the annotated image to a file path or BytesIO
show() None Display the annotated image (calls PIL.Image.show)
_repr_png_() bytes Return PNG bytes for Jupyter notebook inline display

Drawing methods (all return self for chaining):

Method Description
draw_line(points, stroke, stroke_width) Draw a polyline. Accepts a tuple or list of two (x, y) points
draw_lines(list_of_lines, stroke, stroke_width) Draw multiple lines
draw_vline(location, stroke, stroke_width) Draw a vertical line spanning the full page height at x = location
draw_vlines(locations, stroke, stroke_width) Draw multiple vertical lines
draw_hline(location, stroke, stroke_width) Draw a horizontal line spanning the full page width at y = location
draw_hlines(locations, stroke, stroke_width) Draw multiple horizontal lines
draw_rect(bbox, fill, stroke, stroke_width) Draw a filled rectangle. Accepts a 4-tuple bbox (x1, y1, x2, y2)
draw_rects(list_of_rects, fill, stroke, stroke_width) Draw multiple rectangles
draw_circle(center, radius, fill, stroke) Draw a circle. Accepts a (cx, cy) center tuple
draw_circles(list_of_circles, radius, fill, stroke) Draw multiple circles
debug_table(table, fill, stroke, stroke_width) Draw a filled rectangle over every cell in a Table
debug_tablefinder(tf_settings, **kwargs) Draw all detected tables (cell outlines) and detected edges

Color arguments (fill, stroke in the methods above): accept either an RGBA tuple (r, g, b, a) or a string. String colors are resolved via PIL's ImageColor.getrgb. For the list of supported string formats, see the ImageColor reference. Alpha is set to 255 (opaque) for string colors; for transparency use an RGBA tuple.

Default color constants (importable from tablers.debug):

Constant Value Description
DEFAULT_FILL (0, 0, 255, 50) Semi-transparent blue fill
DEFAULT_STROKE (255, 0, 0, 200) Near-opaque red stroke
DEFAULT_STROKE_WIDTH 1 Stroke width in pixels
DEFAULT_RESOLUTION 72 Default rendering DPI

Example — visualize table detection in Jupyter:

from tablers import Document
from tablers.debug import PageImage

with Document("example.pdf") as doc:
    page = doc.get_page(0)
    img = PageImage(page, resolution=150)

    # Draw tables, edges, and intersection points in one call
    img.debug_tablefinder()

    # Display inline (Jupyter auto-calls _repr_png_)
    img

Example — annotate and save:

from tablers import Document, find_tables
from tablers.debug import PageImage

with Document("example.pdf") as doc:
    page = doc.get_page(0)
    tables = find_tables(page, extract_text=False)

    img = PageImage(page)
    for table in tables:
        img.debug_table(table)
    img.save("annotated.png", quantize=False)

Example — method chaining:

img = (
    PageImage(page)
    .draw_hline(200.0)
    .draw_vline(300.0)
    .debug_tablefinder()
)
img.save("debug.png", quantize=False)