Data360 Analyze

 View Only
  • 1.  Run dependencies

    Posted 02-06-2022 08:48

    Hi Infogix, In my graph I have a series of nodes that are connected to each other and each node takes  a long time to run. When I run the graph (all nodes at once) then the later nodes start running and effectively waiting for the earlier nodes to complete.

    My question is whether that is the most performant way to do it or should I add run dependencies due to the long running earlier nodes so that the later nodes doesn't waste CPU waiting?

    thanks

    Scott



  • 2.  RE: Run dependencies

    Employee
    Posted 02-08-2022 07:13

    If the 'later' nodes do not depend on an output of, or successful completion of, the 'earlier' nodes then the later nodes can be run independently in parallel with the earlier nodes. 

    However, whether the later nodes are wasting CPU time depends on the existing number of nodes that are running in parallel in your data flow and what else is running on the Analyze server at that time. The nodes in your data flow will 'compete' with nodes in other users' data fllows for the server's resources. 

    By default the maximum number of nodes that can run in parallel in a data flow is set to four. This limit can be increased by the system administrator by setting a configuraton property per the Help documentation on setting the thread limits.  The thread limit is per data flow - if there were four data flows running in parallel on the system then a maximum of 16 nodes would be running at any one time. 

    For optimal performance of the server you should aim to have a maximum of ~4 threads per CPU core. Above this figure the kernel's scheduler will waste a lot of time context switching between the threads - leading to reduced overal performance.

    You should aim to schedule jobs to smooth out the overall load on the server.

    Using the newer nodes (Transform, Split, Sort, etc) that are written in Java will also improve performance - especially when you have many nodes that are each dealing with a moderate amount of data. This is because they will then execute 'in-Container' - which considerably reduces the start-up time of the node compared with running the node in it's own process. 



  • 3.  RE: Run dependencies

    Posted 02-08-2022 07:49

    Thanks Adrian. Just to clarify a couple of points:

    1) Regarding competing of later nodes with earlier nodes within the same dataflow (where they are dependent on the previous node completing so can't be run in parallel), is it more efficient to use a run dependency clock between nodes or is it effectively the same thing 'under the hood' as it being connected in series? Is it using a Java event listener?

    2) And just to double check regarding your reply that "By default the maximum number of nodes that can run in parallel in a data flow is set to four" - just to re-confirm, this is per scheduled instance of a data flow? i.e.  not a max of 4 nodes running in parallel across all instances of the same dataflow graph that could be running at the same time?

    much appreciated,

    Scott



  • 4.  RE: Run dependencies

    Employee
    Posted 02-08-2022 10:38

    There is a 'Controller' function within Analyze that controls the execution of nodes in the data flows, so there is a lot more going on under the covers. If there is a dependency between nodes then the simpliest and probably most efficient method is to have a connection from an output pin of the last 'earlier' node to the first node of the 'later' nodes. This is using explicit sequencing based on the successful execution of the earlier node. Using a run dependency introduces an extra overhead to the system - especially when you have a run dependency defined on the output clock pin of a Composite node that is connected to the input clock pin of another Composite node. The impact of this Composite to Composite clock overhead has been mitigated in recent versions of the product but it is a bit more efficient to use a wired connection where possible to sequence the two Composites. You can then have one or more node to node run dependencies within the later Composite to sequence individual chains of nodes within the Composite (if required). 

    If you have multiple schedules that are running the same data flow then the thread limit will apply separately to each running instance of the data flow.