Data360 Analyze

 View Only
  • 1.  Parquet Input/Output

    Posted 12-03-2020 18:10

    Hi

    Wondering if anyone has a solution they are willing to share for reading and writing Parquet formatted files with Analyze.

    https://parquet.apache.org/documentation/latest/

    There are several Python modules available that could potentially be leveraged:

    parquet-python (pure python, reader only) :  https://github.com/jcrobak/parquet-python

    fastparquet (Python 3 only) : https://github.com/dask/fastparquet

    PyArrow (Python 3 only) : https://arrow.apache.org/docs/python/

    The priority use-case is to publish a partitioned Parquet dataset (which may be multiple physical files) to cloud based storage (S3 or Azure) for efficient ingestion into Hadoop.

    Also, I’d also like to know if Infogix has any roadmap plans to include Input and Output connector nodes for Parquet formatted datasets.

    Thanks in advance,

    Mario Ermacora



  • 2.  RE: Parquet Input/Output

    Employee
    Posted 01-04-2021 15:25

    Mario - I have used the Apache parquet libraries with Analyze in the past with variations for all the use cases above.  Attached is reader that may work for you. I've included the 5 additional publicly available jar files you will need.  Please note, these are not official nodes, but those I have created and used in the past.  

    The python modules look interesting and are probably a much simpler implementation.  There are also a few options to read directly from the cloud/hdfs datasources across other file types.  

     

    Attached files

    parquetJars.zip
    Parquet Reader-Writer - 4 Jan 2021.lna

     



  • 3.  RE: Parquet Input/Output

    Posted 07-24-2024 19:49
    Edited by Shannon Collins 07-24-2024 19:51

    Hi,

    Just seeing if there's been any advancements on this topic - Reading/Writing Parquet files in Data360 Analyze? Wondering if there anyone's created a more concise solution to create and read parquet files in Data360 Analyze that doesn't require addition jar files to be installed on server that they're happy to share? 

    I have a business requirement to publish a partitioned Parquet dataset (multiple physical files) to cloud based ADLS storage. 

    Thanks 

    Regards,

    Shannon



  • 4.  RE: Parquet Input/Output

    Employee
    Posted 07-25-2024 04:51

    There are currently no product nodes that generate Parquet format files.

    However, the Azure Datalake Storage Put node can can be used to upload files to ADLS Gen.2

    https://help.precisely.com/r/Data360-Analyze/3.14/en-US/Data360-Analyze-Server-Help/Node-help/Output-Connectors/Azure-Datalake-Storage-Put



    ------------------------------
    Adrian Williams
    Precisely Software Inc.
    ------------------------------