`pdfscraper.layout.text`

Module Contents

Classes

`TextLine`	A horizontal line of text.
`SortedTextlines`
`Word`	A text string representing one word. It's generated from a line of text by splitting on a space.
`Span`	An object with a rectangular bounding box.
`Line`
`Block`

Functions

`get_span_bbox`(span)	Calculate bounding box for a span.
`line2str`(line)

class pdfscraper.layout.text.TextLine(words)[source]

Bases: pdfscraper.layout.utils.Rectangular

A horizontal line of text.

property text[source]

__getitem__(key)[source]

__bool__()[source]

__str__()[source]: Return str(self).

__repr__()[source]: Return repr(self).

__contains__(text)[source]

class pdfscraper.layout.text.SortedTextlines(textlines, words, origin=None)[source]

Parameters: textlines (List[TextLine]) –

select(condition, retain_empty_lines=False)[source]

Find content matching condition.

Parameters: condition (Callable) –
Return type: SortedTextlines

resort()[source]

__repr__()[source]

Return repr(self).

Return type: str

class pdfscraper.layout.text.Word(bbox, text='', font='', size='', color=None, normalize_text=False)[source]

Bases: pdfscraper.layout.utils.Rectangular

A text string representing one word. It’s generated from a line of text by splitting on a space.

Parameters

bbox (pdfscraper.layout.utils.Bbox) –
text (str) –
font (str) –
size (str) –

__hash__()[source]

Return hash(self).

Return type: int

__repr__()[source]

Return repr(self).

Return type: str

__eq__(other)[source]

Return self==value.

Return type: bool

__str__()[source]

Return str(self).

Return type: str

class pdfscraper.layout.text.Span(bbox, words=None)[source]

Bases: pdfscraper.layout.utils.Rectangular

An object with a rectangular bounding box.

Parameters

bbox (pdfscraper.layout.utils.Bbox) –
words (List[Word]) –

property text[source]

__slots__ = ['words', 'bbox'][source]

__repr__()[source]: Return repr(self).

classmethod from_pymupdf(span, page_orientation)[source]

Parameters

span (dict) –
page_orientation (pdfscraper.layout.utils.PageOrientation) –

Return type

Span

classmethod from_pdfminer(span, page_orientation)[source]

Convert a list of pdfminer characters into a Span.

Split a list by space into Words.

@param span: list of characters

Parameters

span (List[pdfminer.layout.LTChar]) –
page_orientation (pdfscraper.layout.utils.PageOrientation) –

Return type

Span

class pdfscraper.layout.text.Line(bbox, spans)[source]

Parameters: bbox (pdfscraper.layout.utils.Bbox) –

property text[source]

__slots__ = ['bbox', 'spans'][source]

__repr__()[source]: Return repr(self).

class pdfscraper.layout.text.Block(bbox, lines)[source]

Parameters: bbox (pdfscraper.layout.utils.Bbox) –

__slots__ = ['bbox', 'lines'][source]

__repr__()[source]: Return repr(self).

pdfscraper.layout.text.get_span_bbox(span)[source]

Calculate bounding box for a span.

Parameters: span (List) –
Returns
Return type: pdfscraper.layout.utils.Bbox

pdfscraper.layout.text.line2str(line)[source]

Parameters: line (List[Word]) –
Return type: str

pdfscraper.layout.text

Module Contents

Classes

Functions

`pdfscraper.layout.text`