Data360 Analyze

 View Only
  • 1.  How to import pdf documents

    Posted 02-21-2024 23:53

    Hi Folks

    I was hoping that someone could advise me if/how to import a pdf document within D360



    ------------------------------
    andrew darnell
    Knowledge Community Shared Account
    ------------------------------


  • 2.  RE: How to import pdf documents

    Employee
    Posted 02-28-2024 05:09
      |   view attached

    Importing multi-structured data from PDF files is inherently a tricky and error-prone process. PDF files may also be images rather than contain any structured data and Data360 Analyze does not include any Optical Character Recognition (OCR) capabilities.

    One option would be to use the Transform node with a third party Python module (e.g. PyPDF2)  that can extract text information from a PDF file.

    Here is a basic example of how the PyPDF2 module can be leveraged to convert the textual data in a PDF document into a text file.

    The attached zip contains the data flow and requires Data360 Analyze v.3.12.1 or higher.

    The Python scripts used in the Transform node are also included in archive, together with the input PDF document used by the example data flow.

    The data flow assumes the PyPDF2 package has been imported into the 'lib/jython2' directory within the Analyze system's 'site' directory (and is hence on the default search path used when the system attempts to locate the module to be imported).

    In practice the PyPDF2 module does extract the structured data from the sample PDF file. However, the original structure is not maintained and there appears to be some aberrations in the field names that were identified by the module. Writing the extracted data to  a text file at least enables you to examine the data before attempting to import it and convert it to structured data in a second step.

    As stated at the start of this post, attempting to convert PDF files into structured data is not straightforward and your mileage may vary.

    Note, Precisely do not provide support for third party Python packages 



    ------------------------------
    Adrian Williams
    Precisely Software Inc.
    ------------------------------

    Attachment(s)



  • 3.  RE: How to import pdf documents

    Posted 02-28-2024 18:14

    HI Adrian

    Thank You for supplying the graph, however I will need some further context on how to run it.

    I've uploaded the pdf file to my server, however I am not sure what I need to do to get the first node looking for the pdf?

    Looking forward to your reply



    ------------------------------
    andrew darnell
    Knowledge Community Shared Account
    ------------------------------



  • 4.  RE: How to import pdf documents

    Employee
    Posted 03-01-2024 14:20

    The example graph is only a PoC that is intended to show how you might use the PyPDF2 module to access a PDF file.

    In the script in the Transform node, the filename is hardcoded - see the line that defines the 'tgtFilePath'  variable.

    The upstream Create Data node is only being used in the example to trigger the execution of the Transform node.

    With the example code the pertinent Python statements are being executed in the ConfigureFields script - which is executed before any input records are processed. You could create your own custom code that obtained the target file from a data flow or run property, or from a literal value you have specified in a custom property defined on the node. The converted file has the same name as the target file but with a .txt suffix.

    You would need to modify the script to accept the target file value as an input field, for example by creating a custom Python function to process the PDF file conversion.



    ------------------------------
    Adrian Williams
    Precisely Software Inc.
    ------------------------------



  • 5.  RE: How to import pdf documents

    Employee
    Posted yesterday
      |   view attached

    Please note that, per the original reply that provided the example data flow

    "The data flow assumes the PyPDF2 package has been imported into the 'lib/jython2' directory within the Analyze system's 'site' directory (and is hence on the default search path used when the system attempts to locate the module to be imported)."

    If it has not been imported you will get an error similar to the following "ImportError: No module named PyPDF2"

    Attached is an example data flow that can help to download and install the PyPDF module. The example assumes there is internet access from the Analyze server for the download of the Python package.



    ------------------------------
    Adrian Williams
    Precisely Software Inc.
    ------------------------------