pdfscraper.layout.text

Module Contents

Classes

TextLine

A horizontal line of text.

SortedTextlines

Word

A text string representing one word. It's generated from a line of text by splitting on a space.

Span

An object with a rectangular bounding box.

Line

Block

Functions

get_span_bbox(span)

Calculate bounding box for a span.

line2str(line)

class pdfscraper.layout.text.TextLine(words)[source]

Bases: pdfscraper.layout.utils.Rectangular

A horizontal line of text.

property text[source]
__getitem__(key)[source]
__bool__()[source]
__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

__contains__(text)[source]
class pdfscraper.layout.text.SortedTextlines(textlines, words, origin=None)[source]
Parameters

textlines (List[TextLine]) –

select(condition, retain_empty_lines=False)[source]

Find content matching condition.

Parameters

condition (Callable) –

Return type

SortedTextlines

resort()[source]
__repr__()[source]

Return repr(self).

Return type

str

class pdfscraper.layout.text.Word(bbox, text='', font='', size='', color=None, normalize_text=False)[source]

Bases: pdfscraper.layout.utils.Rectangular

A text string representing one word. It’s generated from a line of text by splitting on a space.

Parameters
__hash__()[source]

Return hash(self).

Return type

int

__repr__()[source]

Return repr(self).

Return type

str

__eq__(other)[source]

Return self==value.

Return type

bool

__str__()[source]

Return str(self).

Return type

str

class pdfscraper.layout.text.Span(bbox, words=None)[source]

Bases: pdfscraper.layout.utils.Rectangular

An object with a rectangular bounding box.

Parameters
property text[source]
__slots__ = ['words', 'bbox'][source]
__repr__()[source]

Return repr(self).

classmethod from_pymupdf(span, page_orientation)[source]
Parameters
Return type

Span

classmethod from_pdfminer(span, page_orientation)[source]

Convert a list of pdfminer characters into a Span.

Split a list by space into Words.

@param span: list of characters

Parameters
Return type

Span

class pdfscraper.layout.text.Line(bbox, spans)[source]
Parameters

bbox (pdfscraper.layout.utils.Bbox) –

property text[source]
__slots__ = ['bbox', 'spans'][source]
__repr__()[source]

Return repr(self).

class pdfscraper.layout.text.Block(bbox, lines)[source]
Parameters

bbox (pdfscraper.layout.utils.Bbox) –

__slots__ = ['bbox', 'lines'][source]
__repr__()[source]

Return repr(self).

pdfscraper.layout.text.get_span_bbox(span)[source]

Calculate bounding box for a span.

Parameters

span (List) –

Returns

Return type

pdfscraper.layout.utils.Bbox

pdfscraper.layout.text.line2str(line)[source]
Parameters

line (List[Word]) –

Return type

str