Data360 Analyze

 View Only
  • 1.  Join node optimisation

    Posted 07-04-2021 01:55

    Hi, 

    When a D3S join node is used, I can see that the join keys are sorted before being matched. Is there any other optimization done inside the Java class of the "New Join" node?

    Do you think that a custom Python module using the a Pandas library join or merge function would be quicker or slower for large datasets?

    Thanks

    Scott



  • 2.  RE: Join node optimisation

    Employee
    Posted 07-07-2021 04:26

    I can't really comment on the details of any optimizations as I don't know the internal working of the implementation of the Join node. However, the sort functionality is optional on the node and, if your data is already pre-sorted, you can disable the sorting on one or both input data sets. Pre-sorting can decrease the overall processing time for a data flow if the same data set is to be joined in multiple nodes.

    Reference using Pandas:

    I do not believe we have run any performance tests that benchmarked Pandas against the Join node. However the new Join node in Analyze is 3-4x faster than the original Join node used in the Lavastorm LAE product - so if you have imported a legacy .brg file into Analyze you may want to benchmark the original Join nodes in the data flow versus using the new Join nodes as you may get a similar performance improvement.

    The technology stack used by Pandas is different to the Join node (C++ vs. Java) so there will be some variance in performance and this may be dependent on the platform you are using too (Windows vs. Linux). For a 'large' data set you may have differences due to the use of in-memory dataframes in Pandas vs. the use of disk-based spill-over used by the Join node. 

    If you use Pandas you will need to work out how to get both data sets into the environment of the node where you will be using the Pandas module. You may need to also modify the data type of fields to optimize the memory usage of Pandas. 

    A basic prerequisite to using the Pandas merge functions will be getting Pandas installed in the first place. It is not a trivial task to do this if you are using a Linux platform and you will probably need to get assistance from your IT team to install the Pandas package and it's prerequisite packages. The Pandas package can only be used by the Python node and is incompatible with the Transform node.

    Another issue with using Pandas is the resulting logic is not going to be as transparent as when using the Join node. This may also limit the long-term maintainability of the data flow as specialist technical skills will be required to understand the operation of the custom logic.