Data360 Analyze

 View Only
  • 1.  Parquet Input/Output

    Posted 12-03-2020 18:10

    Hi

    Wondering if anyone has a solution they are willing to share for reading and writing Parquet formatted files with Analyze.

    https://parquet.apache.org/documentation/latest/

    There are several Python modules available that could potentially be leveraged:

    parquet-python (pure python, reader only) :  https://github.com/jcrobak/parquet-python

    fastparquet (Python 3 only) : https://github.com/dask/fastparquet

    PyArrow (Python 3 only) : https://arrow.apache.org/docs/python/

    The priority use-case is to publish a partitioned Parquet dataset (which may be multiple physical files) to cloud based storage (S3 or Azure) for efficient ingestion into Hadoop.

    Also, I’d also like to know if Infogix has any roadmap plans to include Input and Output connector nodes for Parquet formatted datasets.

    Thanks in advance,

    Mario Ermacora



  • 2.  RE: Parquet Input/Output

    Employee
    Posted 01-04-2021 15:25

    Mario - I have used the Apache parquet libraries with Analyze in the past with variations for all the use cases above.  Attached is reader that may work for you. I've included the 5 additional publicly available jar files you will need.  Please note, these are not official nodes, but those I have created and used in the past.  

    The python modules look interesting and are probably a much simpler implementation.  There are also a few options to read directly from the cloud/hdfs datasources across other file types.  

     

    Attached files

    parquetJars.zip
    Parquet Reader-Writer - 4 Jan 2021.lna