For scanned documents, text recognition (OCR) is usually performed so that the content is available via the free text query. For non-scanned documents, a PDF file is (almost) always created for viewing the content in order to read out the text contained therein for the free text query, i.e. no text recognition is used. This is also the case if directly searchable PDF files are available. The ArchivistaBox uses the 'pdftotext' tool for this. The text is usually formatted as closely as possible to the representation in the PDF file. In cases where the content cannot be found in the PDF file according to the visual reading flow when publishing, it may be necessary to extract the text using other (additional) options. These can be specified in this field. The following pdftotext options are available:
The value
-f <int> : first page to convert
-l <int> : last page to convert
-r <fp> : resolution, in DPI (default is 72)
-x <int> : x-coordinate of the crop area top left corner
-y <int> : y-coordinate of the crop area top left corner
-W <int> : width of crop area in pixels (default is 0)
-H <int> : height of crop area in pixels (default is 0)
-layout : maintain original physical layout
-fixed <fp> : assume fixed-pitch (or tabular) text
-raw : keep strings in content stream order
-nodiag : discard diagonal text
-htmlmeta : generate a simple HTML file, including the meta information
-tsv : generate a simple TSV file, including the meta information for bounding boxes
-enc <string> : output text encoding name
-listenc : list available encodings
-eol <string> : output end-of-line convention (unix, dos, or mac)
-nopgbrk : don't insert page breaks between pages
-bbox : output bounding box for each word and page size to html. Sets -htmlmeta
-bbox-layout : like -bbox but with extra layout bounding box data. Sets -htmlmeta
-cropbox : use the crop box rather than media box
-colspacing <fp> : how much spacing we allow after a word before considering adjacent text to be a
new column, as a fraction of the font size (default is 0.7, old releases had a 0.3 default)
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-raw
is particularly worth mentioning here. This can be used, for example, to prepare column-based PDF files so that the text flows along the columns.