pdfscraper.layout.utils

Module Contents

Classes

VerticalOrientation

Direction of a Y-axis. Bottom→Top or Top→Bottom.

HorizontalOrientation

Direction of a X-axis. Left→Right or Right→Left.

Orientation

Directions of X and Y axes.

PageOrientation

Directions of X/Y axes together with page dimensions.

Bbox

A rectangular bounding box.

Rectangular

An object with a rectangular bounding box.

Backend

Generic enumeration.

Color

Functions

create_bbox_backend(backend, coords, page_orientation)

Creates a bbox taking into account axis direction from a given page.

get_bbox(block)

get_rightmost(block)

get_leftmost(block)

get_topmost(block)

get_bottommost(block)

group_objs(words[, gap, decimals, axis])

Group words into vertically adjacent lines.

get_center_group(group)

Get a middle point of a group of words.

get_center(obj)

Get a middle point of a word.

flatten(items)

Yield items from any nested iterable.

groupby_consec(df, col)

Attributes

DEFAULT_BACKEND_PAGE_ORIENTATIONS

class pdfscraper.layout.utils.VerticalOrientation[source]

Direction of a Y-axis. Bottom→Top or Top→Bottom.

bottom_is_zero :bool[source]
class pdfscraper.layout.utils.HorizontalOrientation[source]

Direction of a X-axis. Left→Right or Right→Left.

left_is_zero :bool[source]
class pdfscraper.layout.utils.Orientation[source]

Directions of X and Y axes.

vertical_orientation :VerticalOrientation[source]
horizontal_orientation :HorizontalOrientation[source]
classmethod create(left_is_zero=True, bottom_is_zero=False)[source]
class pdfscraper.layout.utils.PageOrientation[source]

Directions of X/Y axes together with page dimensions.

property left_is_zero[source]
property bottom_is_zero[source]
orientation :Orientation[source]
page_height :float[source]
page_width :float[source]
classmethod create(left_is_zero=True, bottom_is_zero=False, page_width=None, page_height=None)[source]
class pdfscraper.layout.utils.Bbox[source]

Bases: NamedTuple

A rectangular bounding box.

property height: float[source]
Return type

float

property width: float[source]
Return type

float

x0 :float[source]
y0 :float[source]
x1 :float[source]
y1 :float[source]
__str__()[source]

Return str(self).

Return type

str

__eq__(other, decimals=1, n=4)[source]

Return self==value.

Return type

bool

move(delta=(0, 0, 0, 0))[source]
__add__(other)[source]
Parameters

other (Bbox) –

isclose(other, tolerance)[source]

Check if two bboxes are close to each other.

Parameters
  • other (Bbox) –

  • tolerance (float) –

isinside(other)[source]

Check if this bbox is inside another bbox.

Parameters

other (Bbox) –

Return type

bool

classmethod from_coords(coords, invert_y=False, invert_x=False, page_height=None, page_width=None)[source]
Return type

Bbox

class pdfscraper.layout.utils.Rectangular[source]

An object with a rectangular bounding box.

property width[source]
property height[source]
property x0[source]
property x1[source]
property y0[source]
property y1[source]
bbox :Bbox[source]
move(delta)[source]
class pdfscraper.layout.utils.Backend[source]

Bases: enum.Enum

Generic enumeration.

Derive from this class to define new enumerations.

PDFMINER = pdfminer[source]
PYMUPDF = pymupdf[source]
pdfscraper.layout.utils.DEFAULT_BACKEND_PAGE_ORIENTATIONS :Dict[Literal[Backend, Backend], Orientation][source]
pdfscraper.layout.utils.create_bbox_backend(backend, coords, page_orientation)[source]

Creates a bbox taking into account axis direction from a given page.

Parameters
  • backend (Backend) – backend type

  • coords – 4-item sequence of x0,y0,x1,y1 coordinates

  • page_orientation (PageOrientation) – page size together with X/Y axes directions.

Returns

a bounding box

Return type

Bbox

class pdfscraper.layout.utils.Color[source]
r :float[source]
g :float[source]
b :float[source]
__eq__(other, decimals=1)[source]

Return self==value.

pdfscraper.layout.utils.get_bbox(block)[source]
Return type

Tuple[float, float, float, float]

pdfscraper.layout.utils.get_rightmost(block)[source]
Return type

float

pdfscraper.layout.utils.get_leftmost(block)[source]
Return type

float

pdfscraper.layout.utils.get_topmost(block)[source]
Return type

float

pdfscraper.layout.utils.get_bottommost(block)[source]
Return type

float

pdfscraper.layout.utils.group_objs(words, gap=5, decimals=1, axis='y')[source]

Group words into vertically adjacent lines.

First, create a dictionary with rounded y-coordinates as keys, and lists of words as values. Then merge together lists whose coordinate delta is <= gap.

Parameters
  • words (List) – list of Words

  • gap (float) – vertical delta between lines to be merged.

  • decimals (int) – rounding precision.

  • axis (str) – horizontal (x) or vertical (y) grouping

Returns

vertically grouped lines, each line is sorted horizontally inside.

Return type

List[List]

pdfscraper.layout.utils.get_center_group(group)[source]

Get a middle point of a group of words.

Parameters

group (List) –

Return type

float

pdfscraper.layout.utils.get_center(obj)[source]

Get a middle point of a word.

Return type

float

pdfscraper.layout.utils.flatten(items)[source]

Yield items from any nested iterable.

pdfscraper.layout.utils.groupby_consec(df, col)[source]