Data360 Analyze

 View Only
  • 1.  Lookup node & performance

    Posted 09-05-2023 02:13

    What would be the consequences of running millions of rows and a big amount of columns for lookup nodes in 360 often?

    In the documentation it states that "The Lookup node is recommended for use with a small data set on the right input as the entire right data set is loaded into memory"

    My point is, can this cause sluggish performance for Data 360 and the machine it's installed on?

    As a note, i have over 100 Gb of physical memory installed.



    ------------------------------
    Henrik B
    E.ON Sverige
    ------------------------------


  • 2.  RE: Lookup node & performance

    Employee
    Posted 09-05-2023 07:07
    Edited by Adrian Williams 09-06-2023 09:11

    The documentation for the Lookup node states:

    "The Lookup node works in a similar way to the Join and Merge nodes and is an optimization of the inner and left joins. The Lookup node is recommended for use with a small data set on the right input as the entire right data set is loaded into memory. An advantage of using the Lookup node is that the input data sets do not need to be sorted prior to using the node. If you want to join two large data sets, you should use the Merge or Join node."

    The choice of which correlation node to use depends on the specifics of your scenario. The Lookup node is intended for use in the situation where the number of records on the Lookup (Right) input pin is relatively small compared with the number of records on the Data (Left) input pin. All of the records on the Lookup pin are loaded into memory. If you have sufficient memory available on the machine you still need to ensure that you have provided the node with sufficient Java Heap Space to store the maximum number of records in the Lookup data set. You may need to specify the value of the JvmMaxHeapSize property (which is hidden by default) if you encounter Java Heap Space errors when running the node. 

    As the Lookup node is an optimization of the inner and left joins used by the Join and Merge nodes, the Lookup node will usually perform better than the Join and Merge nodes in situations where it can be used as it does not have to repeatedly buffer a portion of the Right data set into memory as it is processing the Left Data records (since all Right records are being stored in memory). It also does not have to pre-sort the data sets, which also increases its performance.

    As with all correlation nodes, having a narrower set of fields in the data sets being joined may improve the overall efficiency (since the data for unused fields in the data set does not have to be marshalled into and out of the node during processing). If the expected fraction of records that will match is low (i.e. you are effectively filtering out the majority of input records since they do not match) and the data set is wide, it may be better to uniquely identify the records in the data set and remove unnecessary fields prior to the join operation; perform the join, and then re-join the discarded fields back into the results set using the unique id.



    ------------------------------
    Adrian Williams
    Precisely Software Inc.
    ------------------------------



  • 3.  RE: Lookup node & performance

    Employee
    Posted 09-06-2023 08:52

    My experience has been that the lookup node aborts if the right input is too big.  About 10 years ago I would have the trouble when approaching a million records on the right input, but half a million records was ok. 

    I did not analyze that with respect to the amount of memory available for that node. Now with much larger memory sizes, the limit before the abort may be larger. You might you increase the memory on the node as Adrian said.  Of course don't increase the memory on the node so much that it causes issues with the operating system itself.



    ------------------------------
    Ernest Jones
    Precisely Software Inc.
    PEARL RIVER NY
    ------------------------------