Andreas Ch.
M.Sc.Eng. Andreas Ch Kazmierczak is the founder and developer of Print2CAD Software. He has been developing software since 1982. He gained his knowledge of software development over the course of his university career followed by post grad training. Andreas attended the Technical University in Aachen, Germany where he earned his Master Degree in Engineering.
1. Introduction 1.1 What is Print2CAD? 1.2 What is a PDF? 1.3 What is DWG? 1.4 What is DXF? 1.5 System Requirements - System Requirements - System Requirements Hardware 10 10 10 12 13 14 14 14 1.6 License Agreement § 1 Waiver of Responsibility § 2 You Agree to the Following Terms and Restrictions § 3 Copyrights 15 15 16 18 2. Installation 21 3. Conversion of Different PDF Formats 3.1 Vector Based Data Made from CAD Systems 3.2 Vector Based, Through a Plotter Interface Exported PDF File 3.
5. Main Menu 5.1 File Selection 5.2 Output Files 5.3 Target Directory for Converted Files 5.4 Conversion of Directories 5.5 Version of the Target File 5.6 Wizard 5.7 Activation of the Program (depending on Purchasing Method) 5.8 Load and Save Program Settings 42 43 44 44 44 44 45 46 47 6. Optimization - Pages and Coordinates 6.1 Select of PDF pages 6.2 Scaling of Coordinates 6.3 Rotation of Coordinates 6.4 Transformation of Coordinates 6.5 Purge Bright Elements on Bright Background 48 49 50 51 52 52 7.
72 73 75 75 77 78 9.3 Improvement of the Vectorization Process 9.3.1 Recognition of Horizontal and Vertical Lines 9.3.2 Recognition of Inclined Lines 9.3.3 Circle and Arc Recognition 79 79 79 80 9.4 Improving Pixel Images Before the Vectorization 9.4.1 Filling Small Holes in the Pixel Traces 9.4.2 Filling All Holes in the Pixel Traces 9.4.3 Thinning Out the Pixel Lines 9.4.4 Making the Pixel Traces Thicker 9.4.5 Removal of Free Pixels 9.4.6 Closing Slightly Opened Pixel Traces 80 81 81 81 81 82 83 9.
12. Vectorization Expert Settings 12.1 Smoothing of Polylines 12.2 Enforce Smoothing Motion 12.3 Max gap jump in pixels 12.4 Tolerance in Pixels 12.5 Conjugation Tolerance in Pixels 12.6 Minimum Pixel Length 12.7 Arc Tolerance in Pixels 12.8 Angel Sensitivity in Pixels 94 95 95 95 96 96 96 96 96 13. Configuration 13.1 Choosing The Program Language 13.2 Unit Of The Converted DWG Or DXF File 13.3 Keeping Settings After Ending The Program 13.4 Using Prefix “Print2CAD- For Converted Files 13.
107 108 108 109 109 17.4 Color Type 17.4.1 Grayscale Color Space 17.4.2 RGB Color Space 17.4.3 RGBA Color Space 17.4.4 CMYK Color Space 110 110 111 111 112 17.5 OCR Definition 17.6 Conversion of selected PDF Pages 112 113 18. Analysis of a PDF File 114 19. DWG, DXF to PDF Conversion 19.1 PDF Header 19.2 Embedding Fonts 19.3 Geometry Optimization 19.4 TTF Fonts as Geometry 19.5 Zoom To Extensions 19.6 Generate PDF Layer 19.7 Output Disabled Layers 19.8 Convert Model Space 19.
21. OCR-Mode - Text, Line Type and Coordinates Recognition 122 22. OCR Text Recognition 22.1 General 22.2 Procedure 22.2.1 Breakdown Detection 22.2.2 Adjusting the Outlined Areas 22.2.3 Recognition of Pattern 22.2.3.1 Correcting Errors at the Pixel Level 22.2.3.2 Pattern Matching Mapping 22.2.3.3 Error Correction on Plane of Projection 22.2.3.4 Error Correction on Word Level 22.2.4 Manual Correction of the Recognized Texts 22.2.
Print2CAD OCR 2013 Print2CAD OCR 2013- 9
1. Introduction 1.1 What is Print2CAD? Print2CAD is an application that converts PDF files into a DWG or DXF file that can be imported and edited into any CAD system. Print2CAD also converts PDF into raster formats (TIFF, JPEG, etc.). Print2CAD also converts DWG or DXF files into PDFs. Print2CAD is a stand alone program that works independently with all CAD systems. In other words, you do not need a CAD program to use Print2CAD.
The original imaging model of PDF was, like PostScript‘s, opaque: each object drawn on the page completely replaced anything previously marked in the same location. In PDF 1.4 the imaging model was extended to allow transparency. When transparency is used, new objects interact with previously marked objects to produce blending effects. The addition of transparency to PDF was done by means of new extensions that were designed to be ignored in products written to the PDF 1.3 and earlier specifications.
1.3 What is DWG? DWG (“drawing”) is a file format used for storing two and three dimensional design data and metadata. It is a native binary format for AutoCAD and other Autodesk Products. Almost all of CAD Systems are able to import DWG files. DWG is the native and proprietary file format for AutoCAD® and a trademark of Autodesk, Inc. The .bak (drawing backup), .dws (drawing standards), .dwt (drawing template) and .sv$ (temporary automatic save) files are also DWG files.
AutoCAD DXF (Drawing Interchange Format, or Drawing Exchange Format) is a CAD data file format developed by Autodesk for enabling data interoperability between AutoCAD and other programs. DXF was originally introduced in December 1982 as part of AutoCAD 1.0 and was intended to provide an exact representation of the data in the AutoCAD native file format, DWG. Versions of AutoCAD from Release 10 (October 1988) and up support both ASCII and binary forms of DXF. Earlier versions support only ASCII.
1.5 System Requirements Input: PDF all Versions (Raster and Vector) TIFF, JPEG, GIF, PNG HPGL, HPGL-2 DWF (2D) Output: DWG - all versions (as RealDWG™ fully compatible with AutoCAD, AutoCAD LT and all other CAD systems). DXF- all versions (Version 12, 2000-2013 compatible) for all CAD systems. TIFF, JPEG, PNG, GIF, BMP, RAW PDF Patent Pendings German Patents Pending: 10 2006 015 957.8, 10 2007 003 485.9 and 10 2007 046 116.
Developers: Kazmierczak Software GmbH BackToCAD Technologies, LLC Sandbühlstr. 12 400 Galleria Pkwy, Suite 1500 D-70794 Filderstadt Atlanta, GA 30339 Germany USA Internet: www.dxf.de www.backtocad.com DWG is the native and proprietary file format for AutoCAD® and a trademark of Autodesk.
§ 2 You Agree to the Following Terms and Restrictions 1. The transfer module of Print2CAD™ software may be installed and used on one computer only. It may not be installed on multiple computers used by different people simultaneously. 2. End Licensees agree not to alter, reverse engineer or disassemble the Software Application.
In no event shall Licensee or its suppliers be liable in any way for indirect, special or consequential damages of any nature, including without limitation, lost business profits, or liability or injury to third persons, whether foreseeable or not, regardless of whether Licensee or its suppliers have been advised of the possibility of such damages. 6. You are not entitled to loan, rent, nor to use it as the basis for software programs of your own. 7.
§ 3 Copyrights Copyright © 2006-2013 Kazmierczak® Software GmbH, Germany . All rights reserved. Contains Autodesk® RealDWG by Autodesk, Inc. Copyright© 1998-2012 Autodesk, Inc. All rights reserved. PVGOUTLIB: Copyright (c) Soft Tolls GmbH. All rights reserved. IMAGE POWER JPEG-2000: Copyright (c) 2001-2003 Michael David Adams. All rights reserved. See jasper_license.txt OpenSSL: Copyright (C) 1995-1998 Eric Young (eay@cryptsoft.com). All rights reserved. See openssl_license.txt.
Print2CAD OCR 2013 Print2CAD OCR 2013- 19
Print2CAD OCR 2013 - 20
The installation is valid for all Windows 7, XP, and Vista 32 and 64 versions. The below description of the program concerns the installation CD-ROM drive D:\ and the target hard disk C:\. For other drives, the installation should be carried out similarly. a. Download and burn on CD the installation program. b. Restart your machine, then insert the CD-ROM in the drive. c. Navigate to your CD-ROM drive (e.g. Explorer). d. Start the installation by double clicking on installation program .
3. Conversion of Different PDF Formats Invented by Adobe systems in 1993, the portable document format (PDF) is a data format for documents which can be used on many different platforms. In the last few years the PDF format has had unrivaled success and is not only for text documents but can also be implemented for blueprints from CAD software. The ground breaking idea behind the success is the scalability of the document. The scalability of PDF is possible because PDF is vector based and not pixel based.
Print2CAD OCR 2013 Figure: A true PDF file with native elements. Figure: A PDF file with no native elements. It contains only a raster picture.
3.1 Vector Based Data Made from CAD Systems Vector based PDF files are the real PDF data format. The native PDF entities such as polylines, texts, native hatches are used. This kind of PDF file is created directly from a CAD application without using a plotter interface. In other words, it is exported into PDF, not “Printed To...” PDF. This kind of PDF is excellent for converting the data into DXF and DWG. The coordinates are exact enough to be used for the purpose of CAD.
This type of PDF is exported from a CAD programusing a plotter interface. This type of PDF has only lines and hatches, often with a resolution of 75 dpi. Whereas in the original CAD drawing coordinates can be placed in any location, plotters and printers use DPI or Dots Per Inch. Thus, there is a limit to the locations a coordinate can be placed. When a DWG is “Printed To...” PDF, the coordinates are snapped to closest „dot“ in the set “Dots Per Inch.
3.3 Raster-Based PDF Files A raster-based PDF is one containing only pixels. This type of PDF data does not include any native PDF elements like lines, hatches or text. The quality of the conversion is thus based on the resolution of the scan. These raster pictures have to be vectorized during a conversion to DWG or DXF. This kind of PDF is not exceptable for converting the data into DXF and DWG. The coordinates are of very bad quality and are not enough to be used for the purpose of CAD.
This is a combination of vector and raster formats, with all the pros and cons in one. The hybrid PDF is the real PDF file that contains the lines, texts and hatches within. This data also contains raster pictures. In this case, you have to decide how you handle the PDF raster pictures. Print2CAD offers you a lot of possibilities to vectorize raster pictures. This kind of PDF is very exceptable for converting the data into DXF and DWG. The native PDF data will convert properly.
4. Conversion with the Help of our Wizard” Print2CAD 2013 has a “Wizard” that is unique throughout the world. The idea behind the Assistant is that the user views the original drawing (in the form of PDF, HPGL, DWG, TIFF or JPEG) with the aid of a built-in viewer, assesses the quality and the contents of the drawing with the aid of his own human understanding, and then answers the questions posed by the program regarding the contents and quality of the input file.
Step 1 Step2 Step 3 Step 4 Step 5 Step 6 Print2CAD OCR 2013- 29 Print2CAD OCR 2013 As a result of the evaluation provided by the user, based on his ability to make judgments and his intellect, the program is in a position to make a number of optimal settings for the converted file. The program creates between 2 and 8 (maximum) sets of settings to use when converting the file. After the conversion the user looks at the drawings with a DWG/ DXF viewer and chooses the drawing with the best quality.
4.1 Selection of the conversion method and the target formats In the first step of our Wizard the questions should be answered regarding the conversion method and the DWG/DXF target format. Our Wizard supports conversions from PDF into DWG/DXF, from HPGL2 into DWG/DXF, from DWF to DWG/DXF and from TIFF/JPEG/PNG/GIF into DWG/DXF. In the case of other formats you are directed to the main screen program. 1 2 3 4 5 Legend: 1. Video clip introduction via Kazmierczak® Online University 2.
Print2CAD OCR 2013 The program supports the following format versions: PDF; all Adobe-compatible 2D versions (as of June 2012) HPGL, HP-GL/2 and HP-RTL DWF; all 2D versions (as of June 2012) TIFF, JPEG, GIFF and PNG all versions (as of June 2012) DWG versions 14 to 2013-compatible as RealDWG from Autodesk DXF versions 14 to 2013-compatible as RealDWG from Autodesk With PDF formats it is essential to make a distinction between PDFs with native PDF elements such as paths, cross-hatching, text, etc.
4.2 Selection of the files and the target directory The Wizard converts any desired number of files in one run. The converted files are saved in the target directory which is selected prior to conversion in the area below the file display area. If no target directory was selected the converted files are saved in the same directory as that of the source files. A separate file is created for each page of multi-page PDF files. 1 2 3 4 5 6 7 Legend: 1.
The converted files have by default the prefix “Print2CAD-”. This prefix can be disabled in the configuration, but in that case any existing files that have the same name and same extension will be overwritten without any warning. Multiple converted files and the associated settings are saved in the target directory. The settings file will be designated by the extension .p4c. Please note the following points 1.
4.3 Details on the scale, colors, and layers In the third step our Wizard requests details on the scale, color and layers of the drawing. Depending on the answers to these questions, suitable settings are generated for the conversion. 1 2 3 4 5 6 Legend: 1. Video clip introduction via Kazmierczak® Online University 2. Notes on the conversion step 3. View of selected File 4. Details on the color of the drawing 5. Details on the scale of the drawing 6.
Unfortunately our converter cannot rely on the color details given in the input files, especially in the case of PDF files. Very often black/white files are marked as “full color” files, which gives poor results in conversion. Here the user must examine the drawing and decide which color palette should be used. Press the “View selected file” button and view the file in the internal viewer. Then decide whether the drawing is primarily black and white, grey-scaled or colored.
4.3.3 Details on the layer structure Press the “View selected file” button and view the file in our internal viewer. You can see a listing of the layers on the left-hand side of the viewer. Check whether this layer structure is actually used by switching the individual layers on and off. Then decide whether to use the existing layer structure or instead to create a new structure on the basis of the colors and types of elements.
In the fourth step our Wizard requests details on the contents and quality of the drawing. Depending on the answers to these questions, suitable settings are generated for the conversion. 1 2 3 4 5 Legend: 1. Video clip introduction via Kazmierczak® Online University 2. Notes on the conversion step 3. Settings regarding the quality of the paths (lines, arcs and circles) 4. Settings regarding the quality of the cross-hatching 5.
During the analysis of the drawing the paths (lines, arcs and circles), the cross-hatching and the raster images are saved in various PDF files. The native text elements are ignored when doing this. Press the “View selected file” button and view the files. Then answer the questions on the contents and quality. It is important to decide which property predominates. Answer the question “only horizontal text” if there really is only horizontal text without other symbols or elements (lines, circles, etc.).
In the fifth step our Wizard generates various alternative settings for the conversions on the basis of the details on the quality, contents and scale of the drawing. Start the conversion with all the suggested settings and choose the best setting on the basis of the conversion quality. The settings are saved in the target directory and have the ending “.p4c”. If you wish to deactivate a particular setting, just click on the corresponding checkbox. 1 2 3 4 5 Legend: 1.
4.6 Handling the results of the conversion In the last step, the Wizard converts all the selected files using the selected settings. The files are saved in the target directory and are given the ending “-settings— [number].dwg” or “-settings—[number].dxf” as appropriate. The settings for the relevant conversion are saved under the name “-settings—[number].p4c”. The settings can be reused in the main screen. The converted DWG or DXF files can be viewed using the built-in DWG/DXF viewer.
The saved settings (.p4c files) can be reused in the main screen. Do this by starting the program in the main menu and load the settings with the “Load settings” function in the program. Print2CAD OCR 2013- 41 Print2CAD OCR 2013 4.
5. Main Menu 1 2 3 4 5 6 7 8 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Print2CAD OCR 2013 5.1 File Selection The program Print2CAD can convert multiple files in one run. All selected files will remain in their original condition after the conversion. You can choose a target directory for the converted files. If no target directory is selected, the output files will be saved in the same directory as the source files. Multi-page PDF files will convert into separate files for each page of the original file.
5.2 Output Files The following output files can be created when converting: CAD files: DXF or DWG Raster files: BMP, JPG, PNG, TIFF, RAW, GIF PostScript files: EPS PDF files: PDF 5.3 Target Directory for Converted Files A target directory for the converted files should be specified. If a target directory is not selected, the output files are created in the same directory as the source files. The converted files will have the prefix “Print2CAD-.
With the help of the Wizard, the optimum settings for certain file types can be selected. To aide with selection of settings, the user selects the file type that best matches to his source file which when selected loads the optimum settings. Print2CAD OCR 2013- 45 Print2CAD OCR 2013 5.
5.7 Activation of the Program (depending on Purchasing Method) Important! The activation method depends upon how and where you purchased the program. Please follow the activation instructions you receive from our purchasing system. If you purchased the USB hardlock version of the program, then you do not have to activate it. Once you click the “Activation” button in the presence of the USB hard lock, the notice “OK! USB Code Meter found.” will appear.
When ending the application, the program saves the current program settings automatically. The same options will then be loaded when starting the program again. This option can be turned off in the Configuration Tab of the Main Menu. You can save and load all program settings by clicking the button “Save Settings” (a file with extension “p4c” will be created) and “Load Settings.” Figure: Save program settings as P4C-File Print2CAD OCR 2013- 47 Print2CAD OCR 2013 5.
6. Optimization - Pages and Coordinates 1 1. 2. 3. 4. 5. 6. 7.
The user can select which pages to convert with PDF files that have multiple pages or simply choose to convert all pages with one execution. The selected pages should be separated by using a semicolon. (e.g. 1; 4; 12; 34). To convert several pages in a row (from - to) you can use a hyphen (e.g. 12-18). Example: 1; 4; 8-10; 12 Print2CAD will convert pages 1,4,8,9,10 and 12. For PDF to raster conversion, the settings shown in the above interface will have no effect.
6.2 Scaling of Coordinates Coordinates in PDF files are usually in a resolution of 72dpi. 72dpi means that one inch (25.4 mm) equals 72 pixels. A PDF file can have the accuracy of 25.4/72 = 0.35 mm or 1/72 inch. A 1200 dpi resolution would make a PDF file with 18.5 * 72/1200 = 1.1 mm or 1/24 inch accuracy. These types of high-resolution PDF files are rare. Unfortunately, the scale of a PDF drawing can only be retrieved from the header of a construction plan.
The user can specify the rotation angle for their converted files. When converting from PDF to DWG or DXF, the arrangement of the coordinates of the PDF paths are used, and any potential display rotation of the PDF representation is ignored. Due to this reason, the converted data may be displayed in a different rotation angle in the DWG or DXF than a PDF reader. Figure: Rotation of Coordinates Print2CAD OCR 2013- 51 Print2CAD OCR 2013 6.
6.4 Transformation of Coordinates Factors selected by the user are added to every coordinate. 6.5 Purge Bright Elements on Bright Background PDF files often include invisible white elements placed on a white background. It is possible to delete these elements during the conversion to decrease file size and conversion time. The limiting magnitude of color bightness (from 1 to 255) can be determined.
1 2 7 1. 2. 3. 4. 5. 6. 7. 3 4 5 6 8 Recognition of the Layer Structure Purge Short Distance Polyline Vertexes (Data Reduction) Delete Short Lines (Data Reduction) Generate Circles and Arcs Color Palette of the DWG or DXF Files Assign Line Weight to Entities Hatch Conversion Print2CAD OCR 2013- 53 Print2CAD OCR 2013 7.
7.1 Recognition of the Layer Structure Print2CAD offers various possibilities to assign a layer structure to the resulting DWG or DXF file. 7.1.1 Assign the PDF Layer Structure to DWG or DXF (if available) When creating a PDF file, a PDF layer structure can be assigned. Unfortunately, this feature is rarely used. The layer structure in PDF has a tree-like structure. In contrast, the layer structure in DWG or DXF is flat.
Print2CAD allows the user to assign a uniform color to all converted elements. 7.3 Color Palette of the DWG or DXF Files Print2CAD allows the user to assign the RGB values of the colors used in a PDF into the resulting DWG or DXF elements. Important! The color Black in a PDF is converted into the color White in RGB. The color White in a PDF is converted into the color White in RGB.
7.4 Assign Line Weight to Entities The program Print2CAD allows the user to assign a PDF line weight to all DWG or DXF elements. Contrary to DWG elements having compulsory line weights, PDF elements can have any user-defined line weight.
Important! Line weight is only available in DWG or DXF version 2004 and higher. If necessary, change the target version to 2004 or higher. Important! Line weights can also be defined as a hatch in PDF files. We gave this line weight the name “Line Weight Fake”. To test your PDF drawing, please follow the steps below: Turn off the line weight view in the Adobe Reader under “View Line Weights”. If the lines do not appear with line weight 0.
The converted DWG will only feature one line weight, which in this case would be 2.11mm, which is an ISO standard line weight. Our suggestion for this would be to use the scaling factor for line weights. By using this factor, the user can enlarge or reduce the line weights in the PDF. Set the factor to 0.10 , as shown in our example above, and the results of the the line weights will be 0.18, 0.25 and 0.35mm after the conversion.
7.5.1 Delete all Hatches When these features are selected, the hatches are not converted, and only the boundary of the hatches are represented as polylines. If the PDF file was created using an HPGL interface the hatching boundaries may have loops. These loops are interpreted differently in DWG or DXF than in HPGL.They are possibly left empty. In such cases, delete all hatches and only output the hatching boundaries. 7.5.
7.6 Generate Circles and Arcs Many PDF files contain circles and arcs that have been converted into polylines. These polylines tend to be imprecise making it difficult to detect them as a circle or arc. Although the recognition of a polyline as a circle or arc appears to be easy when one looks at a PDF (a person can immediately recognize the circle or arc), software has to work a little harder to do this.
Important!! When a conversion generates arcs upside down, then the radius deviation R in % was set too high. Figure: Internal Arc Parameters in Print2CAD Print2CAD OCR 2013- 61 Print2CAD OCR 2013 To generate arcs from polylines is again a very difficult task, but not impossible. The radius of arcs is subject to an internal limitation, because otherwise straight lines could be converted into arcs with a large radius. The angle alpha of the generated arcs are limited to at least 20 degrees.
7.7 Purge Short Distance Polyline Vertexes PDF files can contain paths with many points (vertexes). Even a large amount of path data can easily be processed in PDF, because paths do not have many parameters and properties. After the conversion in DWG or DXF, every path segment becomes a full CAD line or polyline. These single CAD elements may include many additional parameters and properties. Therefore it is advantageous to purge the polyline points during the conversion.
Some PDF files contain paths with many small lines. This happens mostly when dottedlines have been generated in PDF files as single lines. These single lines can easily be converted to PDF as the paths do not have many parameters or properties. After the conversion in DWG or DXF, every line will be a full CAD line and may feature many additional parameters and properties. This data may strain the capacities of a CAD system and RAM. The minimum allowed line length can be set with the parameter “d.
8. Conversion of native PDF texts 1 1. 2. 3. 4. 5. 6. 7. 6.
The text in PDF files can be placed as strings or individual characters. How can you find out if your PDF file contains real text? The best method is to analyse the PDF file with the analysis function of Print2CAD and see if there are any text entities indicated. Another method is to open the PDF file in a PDF Reader and zoom the text to maximum view. If the letters still have smooth edges (displaying an arc, not a polyline), your PDF file most likely features real text.
When vectorizing (OCR function not active), a polyline gets drawn along the middle of a pixel trace or the outline of a pixel area and iteratively smoothed. Then the polyline gets recognized as a circle, ellipse or spline. Figure: Vectorization procedure (generation of an ellipse) Using the OCR procedure, a pixel image gets recognized or discarded as a symbol based on its shape.
Print2CAD OCR 2013 Figure: No real PDF text (raster picture) Figure: No real PDF text (text as polylines) Figure: No real PDF text (text as hatches) Print2CAD OCR 2013- 67
Another problem is created by PDF fonts as they are usually embedded in the PDF file. In DWG or DXF, the fonts have to be taken from the system. Since the fonts are embedded in PDF, the characters are no longer coded, for example per the ASCII table. PDF files often use Identity-H fonts with no rule regarding character encoding.
In PDF files, text is usually defined as separate characters or groups of characters with their own insertion points. With the help of special internal methods, Print2CAD merges characters into strings and places these strings as text in the DWG or DXF drawing. Figure: PDF characters and character groups (so called Text Runs) and CAD text Print2CAD does not reconstruct text that was fragmented into lines, arcs or hatches. Such “text” is converted faithfully back into lines or hatches in the CAD drawing.
8.3 Sort Text Onto Separate Layer When activating this function, all native text gets sorted onto a predetermined layer. If there are no real text, but only polylines, hatches or raster images, the letters will not be recognized as text. 8.4 Scale Factors for Blank Space Width Text in PDF files is often placed as single letters. In this case the spaces are not available.
Enabling this option, all text styles get the same selected SHX or TTF font assigned. Figure: Fonts in PDF and in DWG or DXF Print2CAD OCR 2013- 71 Print2CAD OCR 2013 8.
9. Vectorization of Raster Pictures 1 1. 2. 3. 4. 5. 6.
The program can convert scanned engineering and site plans with the help of OCR or vectorization to DWG or DXF format. It is important to know that vectorization and OCR (Optical Character Recognition) are two completely different procedures that convert raster data into other formats. Vectorization calculates the middle of a pixel trace at the edge of the pixel area or polygon and iteratively smooths it.
It is due to this difference of operation that the two actions, OCR and Vectorization, cannot be combined and is why the OCR mode is accessed using a tool where the user tells the program what and where the text is. The result of the vectorization depends entirely on the quality of the original raster file. If the file is of a poor quality, the resulting vectorization will likewise be poor.
9.2.1 Find the Center of the Pixel Traces With the help of polylines, raster images can be vectorized along the center of pixel traces. After setting the polylines, the recognition of circles or splines follows. This method is suitable for most construction plans. Figure: Vectorization of a circle Figure: PDF file suitable for a vectorization along the center of the pixel traces Print2CAD OCR 2013- 75 Print2CAD OCR 2013 9.
The setting “Find the Center of the Pixel Traces” provides poor results for filled areas in raster images. In this case, a line is drawn through the center of the pixel area as shown below.
With this option, the pixel areas are converted into pixel outlines as a first step. Usually a maximum of 3 pixels remain as the pixel outline. The thickness of the outline can be specified under “Expert Settings.” After doing so, lines get drawn along the center of the pixel traces and smoothed automatically.
9.2.3 Find the Outlines of the Pixel Areas When selecting this option, the outlines of the pixel traces and areas will be converted along the boundary line and smoothed automatically.
Print2CAD OCR 2013 9.3 Improvement of the Vectorization Process 9.3.1 Recognition of Horizontal and Vertical Lines The horizontal and vertical lines are recognized. Figure:Raster image with horizontal and vertical pixel traces 9.3.2 Recognition of Inclined Lines Recognition of n*45 degree inclined pixel traces as n*45 degree inclined lines.
9.3.3 Circle and Arc Recognition Detection of circles and arcs occurs when a closed polyline fits the parameters required of an arc or circle. 9.4 Improving Pixel Images Before the Vectorization PDFs with an embedded raster picture have to be improved before executing the vectorization process. This step is often necessary to ensure the quality of the converted file.
Print2CAD OCR 2013 9.4.1 Filling Small Holes in the Pixel Traces This setting fills the small holes in the pixel traces. 9.4.2 Filling All Holes in the Pixel Traces This setting allows all holes in the pixel traces to be filled. 9.4.3 Thinning Out the Pixel Lines This setting removes one layer of pixels from the outer edge of the pixel trace. 9.4.4 Making the Pixel Traces Thicker This setting will add a layer of pixels to the outer edge of the pixel trace.
9.4.5 Removal of Free Pixels Often older scanned drawings will have free pixels. Selecting this feature will ensure the removal of these unwanted pixels, for a cleaner converted file.
Often older scanned drawings will contain open pixel traces. This setting will prompt the program to close these open spaces. Under “For Experts Only” the user is allowed to select the distance between pixels. Figure: Sligthly broken lines in a raster image Figure: Removal of slightly broken lines in a raster image Print2CAD OCR 2013- 83 Print2CAD OCR 2013 9.4.
9.5 Color Palette of the Vectorization 9.5.1 Black and White Vectorization Vectorization is executed in black and white and does not support any bright pixel colors (e.g. cyan). 9.5.2 Color Vectorization The vectorization is executed in seven primary colors (index colors). First the pixel image is saved to seven files per the elementary colors, these files then get vectorized one after the other and afterwards assembled into one common DWG or DXF file. 9.
A scanned drawing can cause lines to wave after the vectorization. Selecting the settings “Activating Smoothing Iterations of Lines,” the wavy lines are smoothed in iterations. Figure: Vectorizing and smoothing of a pixel trace Print2CAD OCR 2013- 85 Print2CAD OCR 2013 9.
10. Converting into Editable Raster Pictures Use this menu if your PDF contains pasted photos or if you want to edit raster images in DWG or DXF. 1 1. 2. 3. 4.
A pixel image PDF file that does not have lines or areas cannot be vectorized. In such cases the pixel image should either be extracted to the hard disk or converted as horizontal lines or solids. Figure: PDF with inserted photo Print2CAD OCR 2013- 87 Print2CAD OCR 2013 10.
10.2 Converting Raster Images as Horizontal Lines In this setting pixels are combined to form horizontal lines. This option is ideal for large pixel files with colorful images (e.g. photos or logos).
In this setting all pixel images are vectorized and embedded into the drawing as an entity “Solid” (filled square). All solids having close to the same color will be connected to create one solid. This option is ideal for colorful pictures. Figure: Example of a pixel image conversion into the DWG entity “Solid” Print2CAD OCR 2013- 89 Print2CAD OCR 2013 10.
11. Thresholds for Black and White, Raster Extracting In this menu you will find options that apply to all vectorization methods. 1 1. 2.
This function allows pixel images to be extracted to the hard disk. Print2CAD takes these raster images and embeddes them or refers to them as element “Image” in a DWG or DXF file. Print2CAD OCR 2013- 91 Print2CAD OCR 2013 11.
11.2 Threshold for Colors Black and White It may be necessary of the user to define what value of brightness is white and what value of darkness is black so that only bright pixels of raster images get allocated to the color white. When vectorizing in black and white, the user-defined threshold for what is black, assigns all such pixels the color black. All remaining pixels are assigned the color white.
Print2CAD OCR 2013 Figure: Diagram of the conversion of light gray or white pixels in the color of pixels in the color black or dark grey using thresholds. Color is gray when R = G = B.
12. Vectorization Expert Settings This menu is used when you want to refine the vectorization of raster images. A good understanding of vectorization methods is necessary here. To reset the optimimum settings, please press the button “Reset to optimum” 1 1. 2. 3.
The polylines must be smoothed out after vectorization. The less smooth the line prior to vectorization, the less smooth the final converted line and in fact it may be wavy or jagged depending upon the quality of the original file. 12.2 Enforce Smoothing Motion This setting caused the polylines created during vectorization to be smoothed or rendered less wavy and/or jagged. This is not a fix all, but it improves the conversion. 12.
12.4 Tolerance in Pixels This function allows you to control the tolerance in pixels to determine the center of the pixel traces. 12.5 Conjugation Tolerance in Pixels This function allows you to control the tolerance in pixels to determine the center of arc-like pixel traces. 12.6 Minimum Pixel Length All pixels with a length less than or equal to the specified number get purged. 12.
1 1. 2. 3. 4. 5. 2 3 4 Print2CAD OCR 2013 13. Configuration 5 Language Selection (available in multiple languages) Unit selection for resulting DWG or DXF Last settings will be saved upon exiting the program.
13.1 Choosing The Program Language The Software Print2CAD is available in English, Spanish, Italian, French, and German. 13.2 Unit Of The Converted DWG Or DXF File This option allows the user to select the unit for measurement in their converted DWG or DXF file. The standard unit of PDF files is mm. 13.3 Keeping Settings After Ending The Program When exiting the program, the last settings are saved and automatically loaded when the program is restarted. 13.
With the help of the wizard, the optimal settings for a specific file can be carried out. For this purpose, an image can be selected that best fits the converted file. Not all PDF files include native PDF elements such as lines and circles. Many PDF files consist only of an inserted raster files. Again, the quality of the image determines the quality of the converted file. Print2CAD OCR 2013- 99 Print2CAD OCR 2013 14.
15. Batch Run with Command Line Print2CAD can be started and controlled with the help of a command line. However, it is important to write the program fetch in quotation marks as the path may have space characters. Syntax of the command line -a: “Path and name of the settings file with an extension.p4c“ -b: “Path and name of the file selected for conversion“ -c: “Output path of the converted file“ Example for Print2CAD “c:\Programs\Print2CAD 2013\KAZMprint2cad32.exe” –a:“f:\test.p4c” –b:“f:\test.
16.1 Print Permission If the file does not have permission to extract, which is set by whoever created the PDF, then the conversion process will require the file to be “printed to...” DWG. This is checked and performed internally and behind the scenes. However, this results in the file going through a 300 DPI plot-interface which, if any element or coordinate is not directly on any one of those 300 dots then the coordinate will be moved to the closest dot, thereby losing accuracy in the final drawing.
17.
Print2CAD OCR 2013 17.1. Raster Target Format 17.1.1 TIFF “Tagged Image File Format (abbreviated TIFF) is a file format for storing images, popular among Apple Macintosh owners, graphic artists, the publishing industry, and both amateur and professional photographers in general. As of 2009, it is under the control of Adobe Systems.
17.1.2 JPEG “In computing, JPEG (pronounced /ˈdʒeɪpɛɡ/, jay-peg) is a commonly used method of lossy compression for photographic images. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality. JPEG compression is used in a number of image file formats.
Uncompressed bitmap files (such as BMP) are typically larger than compressed (with any of various methods) image file formats for the same image. For example, the 1058×1058 Wikipedia logo, which occupies about 271 KB in the lossless PNG format, takes about 3358 KB as a 24-bit BMP file. Uncompressed formats are generally unsuitable for transferring images on the Internet or other slow or capacity-limited media. (...)” Source: Wikipedia, subject “BMP” License Agreement: http://creativecommons.
technique was patented in 1985. Controversy over the licensing agreement between the patent holder, Unisys, and CompuServe in 1994 spurred the development of the Portable Network Graphics (PNG) standard; since then all the relevant patents have expired. (...)” Source: Wikipedia, “GIF” License Agreement: http://creativecommons.org/licenses/by-sa/3.0/ 17.1.6 RAW “A camera raw image file contain minimally processed data from the image sensor of either a digital camera, image, or motion picture film scanner.
Print2CAD OCR 2013 17.2. Raster Image Color Depth “Color depth or bit depth, is a computer graphics term describing the number of bits used to represent the color of a single pixel in a bitmap image or video frame buffer. This concept is also known as bits per pixel (bpp), particularly when specified along with the number of bits used. Higher color depth gives a broader range of distinct colors.
17.3 Raster Image Compression “Image compression is the application of data compression on digital images. In effect, the objective is to reduce redundancy of the image data in order to be able to store or transmit data in an efficient form: A chart showing the relative quality of various jpg settings and also compares saving a file as a jpg normally and using a “save for web” technique. Image compression can be lossy or lossless.
“Group 3 and 4 faxes are digital formats, and take advantage of digital compression methods to greatly reduce transmission times. Group 3 faxes conform to the ITU-T Recommendations T.30 and T.4. Group 3 faxes take between six and fifteen seconds to transmit a single page (not including the initial time for the fax machines to handshake and synchronize). Group 4 faxes conform to the ITU-T Recommendations T.563, T.503, T.521, T.6, T.62, T.70, T.72, T.411 to T.417.
17.4 Color Type 17.4.1 Grayscale Color Space “In photography and computing, a grayscale or grayscale digital image is an image in which the value of each pixel is a single sample, that is, it carries only intensity information. Images of this sort, also known as black-and-white, are composed exclusively of shades of gray, varying from black at the weakest intensity to white at the strongest.
“An RGB color space is any additive color space based on the RGB color model. A particular RGB color space is defined by the three chromaticities of the red, green, and blue additive primaries, and can produce any chromaticity that is the triangle defined by those primary colors. The complete specification of an RGB color space also requires a white point chromaticity and a gamma correction curve. RGB is an acronym for Red, Green, Blue.
17.4.4 CMYK Color Space “The CMYK color model (process color, four color) is a subtractive color model, used in color printing, and is also used to describe the printing process itself. CMYK refers to the four inks used in some color printing: cyan, magenta, yellow, and key black. Though it varies by print house, press operator, press manufacturer and press run, ink is typically applied in the order of the abbreviation.
For multi-page PDF documents, you can determine that only certain pages are converted to a raster file. The page numbers must be specified and separated by a comma (eg 1, 4, 12, 34). Page ranges are indicated with a hyphen (eg 12-18). Example: 1, 4, 8-10, 12 It issues the pages 1, 4, 8, 9, 10 and 12. Print2CAD OCR 2013- 113 Print2CAD OCR 2013 17.
18. Analysis of a PDF File 1 1. 2. 3.
1. 2. 3.
19. DWG, DXF to PDF Conversion With the help of Print2CAD, DWG or DXF files can be converted into PDF files. DWG or DXF files can be converted directly into PDF with high quality PDF elements like text, circles, curves, lines with line types, and layers. Raster images are only displayed if they are inserted in DWG as a BMP. JPEG and TIFF files are not supported. 19.1 PDF Header The user can enter a description of the PDF file, which will later be displayed in the document properties. 19.
2 8 9 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 3 10 4 11 5 6 7 12 PDF embed fonts Geometry optimization enabled SHX or TTF fonts print as geometry Zoom to the drawing limits PDF label Treatment of the layout and the model range Select Paper Size Adopt layers in the PDF Text widths scale Line width set to 0.
19.4 TTF Fonts as Geometry The text with TTF fonts (Windows fonts) are recreated into simple geometry (lines, polylines). 19.5 Zoom To Extensions If this option is inactive, the last displayed section of the drawing is given out as a PDF. When activating this option, the drawing is zoomed to extensions before being converted into PDF. 19.6 Generate PDF Layer The layer assignement of the DWG or DXF is adopted in the PDF file. This option is available only for PDF version 1.4 and higher. 19.
All layouts and the model space of the DWG or DXF will be output as a PDF with seperate pages. Figure: Example - a model space of DWG or DXF file 19.12 Scale Text Width By activating this function, all text widths are scaled by the given factor. Figure: Example - a layout of DWG or DXF file Print2CAD OCR 2013- 119 Print2CAD OCR 2013 19.
19.13 Set Line Width to 0.0 With this function, all line widths are set to the value 0.0. 19.14 PDF Version This allows the selection of what version of PDF to be output. 19.15 Paper Format Set the dimensions of the PDF sheets here. Figure: Print margins of a PDF file 19.16 Font Directory The software Print2CAD includes original AutoCAD 2013 fonts. These fonts are saved in the program directory ...\Print2CAD 2013\Fonts.
With the aid of Print2CAD, text heights can be normalized in the converted DWG or DXF file. This is done by specifying the height ranges of text and assigning common text heights to these ranges. Existing text heights can be determined with the integrated viewer DeepView and its “Analysis” function or by clicking the button “View” and thus opening the “Properties.
21. OCR-Mode - Text, Line Type and Coordinates Recognition Print2CAD can recognize text split into multiple lines, polylines, hatches, and raster pixels using OCR methods. Print2CAD’s OCR techniques allow the detection of dashed and dotted lines. Print2CAD’s OCR techniques also allows the accurate calibration of the coordinates of the converted drawing. OCR is an abbreviation for “Optical Character Recognition.” OCR also means symbol and pattern recognition.
Print2CAD can recognize text split into multiple lines, polylines, hatches, and raster pixels using the OCR method. 1 1. 2. 3. 4. 5. 6. 7. 2 3 4 5 6 7 Deactivation of text recognition Activate manual text recognition Indicate text areas List of text areas Preselection of text Linewidth of texts fragmented in lines Language of the text recognition Print2CAD OCR 2013- 123 Print2CAD OCR 2013 22.
22.1 General “Optical character recognition, abbreviation is OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website.
The initial point is an image file (raster graphic) that will be generated by Print2CAD from the file being converted. This image file has the name “file name_KazOcr.tif.” It is automatically generated in the directory of the drawing being converted. When using the OCR method with text separated in lines, a clear line weight is needed. This OCR line weight can be specified in the OCR text recognition interface. It is usually set to 0.4mm.
22.2.1 Breakdown Detection The image file is divided into relevant fields (text, captions). This allocation is performed by the user using a special editor. The division is necessary because a computer program can not with sufficient certainty sort out of a drawing file the text, lines, circles, arcs and separate the text from the rest of the drawing.
1. 2. 3. 4. 5. 6. 7.
22.2.2 Adjusting the Outlined Areas The outlined text areas need to be cleansed of waste lines. This happens based on the knowledge that most lines overlap the outlined text areas, and are connected to other lines in the same area. When adjusting the text areas, all pixel traces will be deleted, which will also cut the borders of the text areas. This process happens automatically during the conversion. The graphics below can visually explain this procedure: 22.2.3 Recognition of Pattern 22.2.3.
The pixel patterns of the text area are compared with patterns in the data base and then rough digitals are created 22.2.3.3 Error Correction on Plane of Projection The rough digitals are compared with dictionaries (Intelligent Character Recognition, ICR), and evaluated in regards to their probable correctness by linguistical and statistical means.
22.2.4 Manual Correction of the Recognized Texts Print2CAD offers a special method allowing the user to manually correct the text areas which were recognized incorrectly. There is a possibility the computer cannot differ between “B” and “8” or number “0” and the letter “O”. Therefore, as a last step of the text recognition, an interaction between the user and the software is required.
1. 2. 3. 4. 5. 6. 7.
22.2.
With the help of Print2CAD you can recognize fragmented lines. Print2CAD recognizes line types “dashed” and “dash dot” at any angle. 23.1 Basics One of the problems of a line type conversion from PDF, HPGL, TIFF, and/or JPEG to DWG or DXF is that lines are everywhere. The possibility to define a line with line types in PDF or HPGL files exist, but it is rarely used, because the line types appear inaccurate in the zoom factor. In these cases a static copy of the line is created using small single lines.
23.2 Methods and Parameter 23.2.1 Activation 1 1. 2. 3. 4. 5. 6. 7.
Print2CAD OCR 2013 A line is detected only as a line by line type, if the number of dashes is greater than 4.
23.2.2 Parameter for Detecting Line Types Print2CAD requires the user to identify the line types with inclination by using the internal editor. This fairly quick process will reduce the error rate significantly. To familiarize yourself with the details procedure please watch our training videos located on our website. The beginning and ending of the highlight does not necessarily need to be exact with the line in the drawing.
Print2CAD OCR 2013 Figure: A PDF file with dashed and dash-dot line types Print2CAD OCR 2013- 137
Print2CAD OCR 2013 - 138
1. 2. 3. 4. 5. 6.
24. Calibration of Coordinates Print2CAD’s OCR techniques also allow the accurate calibration of the coordinates of the converted drawing. 24.1 Basics of the Calibration Problem The coordinates of PDF, HPGL, DWF, TIFF, and JPEG files often have a significant accuracy error. This error is mostly created by exporting entities in a raster of 72 dpi (dots per inch). Print2CAD features settings to allow the user to calibrate the coordinates of the converted drawing.
Print2CAD OCR 2013 Figure: Calibration of coordinates with the help of a Y, X calibration point Figure: Calibration of coordinates using the Y coordinate Print2CAD OCR 2013- 141
24.2 Activation of the Coordinate Calibration 1 1. 2. 3. 4. 5. 6. 7.
Print2CAD OCR 2013 Print2CAD OCR 2013- 143
1 1. 2. 3. 4. 5. 6.
Print2CAD OCR 2013 Index A Acrobat 11 Adjusting 128 Adobe 10, 22, 31, 103 Adobe Reader 57 Analysis 114 Apple Macintosh 103 Areas 77 ASCII 68 AT&T Labs 11 AutoCAD 12, 13, 31, 59, 84, 116 Autodesk 12 B bit depth 107 Bitmap 131 Blank Space Width 70 BMP 44, 104, 116 Breakdown Detection 126 C CAD 10, 12, 22, 24, 25, 26, 33, 44, 58, 62, 63, 69, 70, 84 Caddie 12 Calibration 140, 142 Calibration of Coordinates 140 CMYK 105, 112 Color 84 Color Depth 107 colors 34 Colors 53 Compression 108, 109 Configuration 97 C
conversion 30, 39 Conversion 28, 58, 64, 102, 116 Converting 86, 88 coordinates 48 Coordinates 50, 51, 52 Correction 129 Current Layout 118 D DIB 104 dpi 25, 50 DPI 109 DSC 106 DWF 14, 30, 35, 131, 140 dwg 40 DWG 10, 12, 14, 24, 25, 26, 27, 28, 30, 40, 42, 44, 51, 53, 54, 55, 56, 58, 62, 63, 68, 69, 71, 73, 84, 86, 89, 91, 97, 98, 101, 116, 133, 140 dxf 40 DXF 10, 13, 14, 24, 25, 26, 27, 30, 40, 42, 44, 51, 53, 54, 55, 56, 58, 62, 6 3, 68, 69, 71, 73, 84, 86, 91, 97, 98, 101, 116, 133 DXGF 40 E Editable 8
Print2CAD OCR 2013 G G3 109 G4 109 GDI 104 GIF 14, 30, 44, 105 H Handwritten 129 hatch 125 Horizontal Lines 88 HPGL 14, 28, 31, 35, 131, 133, 140 HPGL-2 14 HPGL2 30 HP-RTL 31 Human Intellect 28 Human Intellect Assistant 32, 34, 39, 40 Hybrid 27 I ICR 129 Illustrator 11 Improving 80 Inclination 136 IntelliCAD 12 ITU-T 109 J JFIF 104 JPEG 10, 14, 28, 30, 101, 103, 104, 106, 109, 116, 133, 140 JPG 44 Print2CAD OCR 2013- 147
K Kazmierczak® 30, 32, 34, 37, 39, 40 L Layer 53, 118 layers 34 Layer Structure 53 Layouts 118 LCD 111 Line Type Areas 134 Line Type Recognition 133, 134 Line Types 136 line weight 125 Line Width 120 List of Coordinates 142 LZW 108 M Microsoft 104 Model Space 118 N Normalization 121 O OCR 35, 38, 42, 65, 73, 112, 122, 124, 131 OCR-Mode 122 Open Design Alliance 12 Optimization 42 OS/2 104 Outline 77 Outlines 78 Output 69 Print2CAD OCR 2013 - 148
p4c 33, 39, 40, 41, 47 Palette 84 Paper Format 120 Pattern 128 Pattern Length 134 Pattern Matching 129 PDF 10, 11, 14, 22, 24, 25, 26, 27, 28, 30, 31, 32, 35, 38, 42, 43, 44, 48, 49, 50, 51, 54, 55, 56, 58, 60, 62, 63, 64, 65, 69, 70, 71, 75, 77, 78 , 80, 86, 87, 90, 98, 99, 101, 102, 113, 114, 116, 120, 131, 133, 140 PDF Reader 65 Photoshop 11 pixel 85 Pixel 75, 77, 96, 128 Pixel Areas 78 PNG 14, 30, 44, 105 Polylines 95 Polyline Vertexes 53 PostScript 11, 14, 106 Print2CAD 10, 27, 28, 31, 33, 43, 44, 49,
S scale 34 Scale 119 Scale Factors 70 Separate Layer 70 SHX 64, 71, 116 Smoothing 85, 95 Sort 70 T target 30 target directory 32 Text Hights 121 text recognition 125 Text Recognition 123, 132 Threshold 90, 92 TIFF 10, 14, 28, 30, 35, 44, 101, 103, 106, 116, 133, 140 Tolerance 96 Traces 77 Transformation 52 TrustedDWG 12 TTF 64, 71, 116 U uman Intellect Assistant 37 USB 46 V Vector 27 vectorization 75, 90 Vectorization 42, 72, 73, 78, 79, 80, 84, 94 vectorized 26, 31, 87 vectorizing 66, 92 Vectorizing 85
Print2CAD OCR 2013 W Windows 28, 68, 70, 104 Wizard 45 Z Zoom 118, 144 Print2CAD OCR 2013- 151