pdfscraper.layout.text
Module Contents
Classes
A horizontal line of text. |
|
A text string representing one word. It's generated from a line of text by splitting on a space. |
|
An object with a rectangular bounding box. |
|
Functions
|
Calculate bounding box for a span. |
|
- class pdfscraper.layout.text.TextLine(words)[source]
Bases:
pdfscraper.layout.utils.RectangularA horizontal line of text.
- class pdfscraper.layout.text.SortedTextlines(textlines, words, origin=None)[source]
- Parameters
textlines (List[TextLine]) –
- class pdfscraper.layout.text.Word(bbox, text='', font='', size='', color=None, normalize_text=False)[source]
Bases:
pdfscraper.layout.utils.RectangularA text string representing one word. It’s generated from a line of text by splitting on a space.
- Parameters
bbox (pdfscraper.layout.utils.Bbox) –
text (str) –
font (str) –
size (str) –
- class pdfscraper.layout.text.Span(bbox, words=None)[source]
Bases:
pdfscraper.layout.utils.RectangularAn object with a rectangular bounding box.
- Parameters
bbox (pdfscraper.layout.utils.Bbox) –
words (List[Word]) –
- classmethod from_pymupdf(span, page_orientation)[source]
- Parameters
span (dict) –
page_orientation (pdfscraper.layout.utils.PageOrientation) –
- Return type
- classmethod from_pdfminer(span, page_orientation)[source]
Convert a list of pdfminer characters into a Span.
Split a list by space into Words.
@param span: list of characters
- Parameters
span (List[pdfminer.layout.LTChar]) –
page_orientation (pdfscraper.layout.utils.PageOrientation) –
- Return type
- class pdfscraper.layout.text.Line(bbox, spans)[source]
- Parameters
bbox (pdfscraper.layout.utils.Bbox) –
- class pdfscraper.layout.text.Block(bbox, lines)[source]
- Parameters
bbox (pdfscraper.layout.utils.Bbox) –