{"id":1694,"date":"2015-12-01T13:00:42","date_gmt":"2015-12-01T12:00:42","guid":{"rendered":"http:\/\/archivista.ch\/cms\/?page_id=1694"},"modified":"2018-01-18T20:00:10","modified_gmt":"2018-01-18T19:00:10","slug":"ocr-speed-pdf","status":"publish","type":"page","link":"https:\/\/archivista.ch\/cms\/en\/news\/2008-2019\/year-2015\/ocr-speed-pdf\/","title":{"rendered":"OCR, speed &#038; PDF"},"content":{"rendered":"<h1>ArchivistaBox 2015\/X: text recognition, searchable PDF files and factor 2x optimisation<\/h1>\n<p><em><strong>Egg, 7th October 2015:<\/strong>\u00a0Version 2015\/X brings innovations that are capable of cutting the processing time for many tasks by at least 50%. The innovations allow processing in ArchivistaDMS to be spread\u00a0across all the CPU cores, as required. This results in both higher processing speeds\u00a0for the reading in of new documents and a significant increase in the text recognition rate (OCR). PDF documents can now be created directly in an external\u00a0Windows folder, with it being possible to use\u00a0the ArchivistaBox for the fully-automatic creation of searchable PDF files. Self-supporting archives can also now be created by ARM-based ArchivistaBox systems &#8211; the ISO file required for this has a size of 80 megabytes.<\/em><\/p>\n<p><em><img decoding=\"async\" style=\"width: 600px; height: 450px;\" src=\"https:\/\/archivista.ch\/cms\/wp-content\/uploads\/image\/gipfel.jpg\" alt=\"\" \/><\/em><\/p>\n<h2>Processing with as many CPU cores as\u00a0required<\/h2>\n<p>For some years now, \u00a0computers have increasingly been manufactured with multiple processors (CPUs). This can be extremely beneficial\u00a0for programs and applications, but only if they have already been optimised in this regard. Until now, this has only been the case with the\u00a0ArchivistaBox in respect of text recognition and on those occasions where there are many documents queueing to be processed. With the <strong>release of\u00a0Version 2015\/X, the documents can now be processed in parallel by the available processors. With eight processors, for example, a 200-page document can be processed in just one-eighth of the time previously required (provided, of course, that all eight processors are actuated simultaneously).<\/strong> This sounds somewhat mundane, but that most certainly isn&#8217;t the case. This is because if the total computing time is allocated to a single task, bottlenecks can then occur elsewhere.<\/p>\n<p>Now, the operating system monitors the ongoing applications so that no\u00a0single job is allocated all the system&#8217;s resources.\u00a0\u00a0 In fact, the available capacity is shared out. Having said that, it is, of course, not a particularly good idea to start too many programs at exactly the same time. Again, as\u00a0an example: if 1000 text recognition jobs are started simultaneously, then all the documents will be processed simultaneously, but at the expense of not being able to choose to finish specific jobs as a priority. In the\u00a0worst\u00a0case, if\u00a0it turns out that there is too little memory (RAM) for the 1000 jobs, some of the documents will &#8218;hang&#8216; during the processing.<\/p>\n<p><img decoding=\"async\" style=\"margin: 5px 10px; float: right; width: 293px; height: 519px;\" src=\"https:\/\/archivista.ch\/cms\/wp-content\/uploads\/image\/wegweiser.jpg\" alt=\"\" \/> The basic rule is: the number of available CPU cores must equal the number of simultaneously-running programmes or applications. This ensures that the jobs with the highest priority are executed. It is in this very regard that <strong>Version 2015\/X is strong. Depending on the usage, the CPU cores can be deployed individually per customer. <\/strong>Example A: many users are accessing the archive at the same time, but the volume of documents needing to be newly recorded is fairly low. Solution A: only 1 or 2 CPU cores\u00a0need to be\u00a0reserved for processing purposes. Example B: a large number of\u00a0scanned documents require conversion into searchable PDF files as quickly as possible. Solution B: all CPU cores are released for processing purposes. This causes archive access speed\u00a0to be\u00a0somewhat reduced and so the documents are actually processed more quickly (by a factor x).<\/p>\n<p>To conclude, a couple of measurements from actual situations: reading the German and English versions of the ArchivistaBox handbook (PDF files with 205 \/ 204 pages respectively), including text recognition using Tesseract, can now be accomplished in about 7 minutes and 30 seconds by the ArchivistaBox Matterhorn. This is a performance capacity of one page per second, or some 80,000 pages per day. If documents that are already in digital format are to be processed, the <strong>409 pages can be processed in around 50 seconds &#8211; a performance capacity of over 700,000 pages per day.<\/strong><\/p>\n<p>By comparison, the &#8222;old&#8220; code needed around 19 minutes for the job with the text recognition, and around 1 minute 50 seconds to import the handbook. With the &#8222;new&#8220; code, therefore, optimisation factors of between 2.2 and 2.5 can be achieved. The use of faster CPUs would, of course, allow these performance levels to be raised even further by factors of between four and six. And using a cluster would make it possible to scale up performance levels by almost any factor desired. However, the bottom line is that <strong>with the current code, the relevant hardware need only be capable of working half as fast in order to deliver the same level of performance.<\/strong><\/p>\n<h2>Creating searchable PDF documents<\/h2>\n<p>As has already been stated above, the current range of ArchivistaBox systems\u00a0is capable of achieving very good results, particularly in respect of text recognition. ArchivistaBoxes are not only suitable for use as DMS systems, they can also be deployed as &#8222;flow heaters&#8220;\u00a0for the creation of searchable PDF files. Previously,\u00a0the files\u00a0that were created had to be further processed using a script or the API (Application Programming Interface). Now, the searchable PDF files created can be copied directly to another network drive at any convenient time.<\/p>\n<p>The settings required for this can be made in\u00a0<strong>WebAdmin, under &#8222;OCR Definitions&#8220;<\/strong> and <strong>&#8222;Text Recognition Options OCR&#8220;:<\/strong><\/p>\n<p><img decoding=\"async\" style=\"width: 395px; height: 231px;\" src=\"https:\/\/archivista.ch\/cms\/wp-content\/uploads\/image\/ocrpdf.png\" alt=\"\" \/><\/p>\n<p>Once the option has been activated, the PDF files that are generated are saved directly into the sharing path previously specified in WebAdmin, after completion of the text recognition. If no network drive is available, the generated PDF files are saved in the TEMP folder of the Archivista shared area.<\/p>\n<h2>Self-supporting archives for all<\/h2>\n<p>Self-supporting archives can now be created with all ArchivistaBoxes (formerly Intel\/AMD). Both the\u00a0ARM-based and the Intel\/AMD-based models now provide a &#8222;small&#8220; 80 megabyte ISO file, which allows self-supporting archives\u00a0to be created. In the ARM-based model, the ISO file (archivista_cd1.iso) has to be put into the ftp\/smb TEMP folder. In the Intel\/AMD boxes, the file can be uploaded using the ArchivistaVM &#8222;Home&#8220; button into the folder:\u00a0\/var\/lib\/vz\/template\/iso. The file can be found in the &#8222;download&#8220; folder, under the name: &#8217;selfrun.zip&#8216;. The password for unzipping remains the same for all OS files.<\/p>\n<p>Options can also be specified in\u00a0WebAdmin in order to allow the archive to be space-optimised.\u00a0\u00a0 The image files can be compressed (a greater degree of compression is available for JPEG images) and the source and searchable PDF files can be excluded from export. This allows\u00a0Archivista archives of up to 50 gigabytes in size\u00a0(in the millions, in terms of pages) to be written to an ISO file and conveniently started up on an Intel\/AMD computer. The self-supporting archives run exclusively in the main memory (RAM), where the size of the ISO file plus at least 600 megabytes of RAM are required for trouble-free operation.<\/p>\n<h2>Ready for productive deployment<\/h2>\n<p>The <strong>ArchivistaBox 2015\/X is ready for immediate deployment<\/strong> and a current version can be requested at any time\u00a0by email or telephone by customers with\u00a0valid maintenance contracts. Updates can be conveniently imported using\u00a0WebConfig. Thanks to the <strong>new ARM-based ArchivistaBox range, which includes <a href=\"https:\/\/archivista.ch\/cms\/en\/news\/eight-core\/\">eight-core<\/a> as standard, a solution now exists\u00a0(for\u00a0<a href=\"https:\/\/archivista.ch\/cms\/en\/news\/year-2014\/dolder-amp-2014iii\/\">ArchivistaBox Dolder,<\/a> for example)\u00a0to create several tens of thousands of pages of searchable PDF files per day,<\/strong> as often as is required, at a starting price of just 360 Swiss francs.<\/p>\n\n\n\n\t<div class=\"dkpdf-button-container\" style=\"            text-align:right \">\n\n\t\t<a class=\"dkpdf-button\" href=\"\/cms\/wp-json\/wp\/v2\/pages\/1694?pdf=1694\" target=\"_blank\"><span class=\"dkpdf-button-icon\"><i class=\"fa fa-file-pdf-o\"><\/i><\/span> PDF Button<\/a>\n\n\t<\/div>\n\n\n\n\n\n","protected":false},"excerpt":{"rendered":"<p>ArchivistaBox 2015\/X: text recognition, searchable PDF files and factor 2x optimisation Egg, 7th October 2015:\u00a0Version 2015\/X brings innovations that are capable of cutting the processing time for many tasks by at least 50%. The innovations allow processing in ArchivistaDMS to be spread\u00a0across all the CPU cores, as required. This results in both higher processing speeds\u00a0for [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":1721,"menu_order":249,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"footnotes":""},"class_list":["post-1694","page","type-page","status-publish","hentry"],"acf":[],"_links":{"self":[{"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/pages\/1694","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/comments?post=1694"}],"version-history":[{"count":5,"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/pages\/1694\/revisions"}],"predecessor-version":[{"id":5228,"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/pages\/1694\/revisions\/5228"}],"up":[{"embeddable":true,"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/pages\/1721"}],"wp:attachment":[{"href":"https:\/\/archivista.ch\/cms\/wp-json\/wp\/v2\/media?parent=1694"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}