Run dependencies

View Only

Back to discussions

Expand all | Collapse all

1. Run dependencies

0 Like
Scott PERRY
Posted 02-06-2022 08:48

Reply Reply Privately
Hi Infogix, In my graph I have a series of nodes that are connected to each other and each node takes a long time to run. When I run the graph (all nodes at once) then the later nodes start running and effectively waiting for the earlier nodes to complete.

My question is whether that is the most performant way to do it or should I add run dependencies due to the long running earlier nodes so that the later nodes doesn't waste CPU waiting?

thanks

Scott
2. RE: Run dependencies

0 Like
Employee

Adrian Williams
Posted 02-08-2022 07:13

Reply Reply Privately
If the 'later' nodes do not depend on an output of, or successful completion of, the 'earlier' nodes then the later nodes can be run independently in parallel with the earlier nodes.

However, whether the later nodes are wasting CPU time depends on the existing number of nodes that are running in parallel in your data flow and what else is running on the Analyze server at that time. The nodes in your data flow will 'compete' with nodes in other users' data fllows for the server's resources.

By default the maximum number of nodes that can run in parallel in a data flow is set to four. This limit can be increased by the system administrator by setting a configuraton property per the Help documentation on setting the thread limits. The thread limit is per data flow - if there were four data flows running in parallel on the system then a maximum of 16 nodes would be running at any one time.

For optimal performance of the server you should aim to have a maximum of ~4 threads per CPU core. Above this figure the kernel's scheduler will waste a lot of time context switching between the threads - leading to reduced overal performance.

You should aim to schedule jobs to smooth out the overall load on the server.

Using the newer nodes (Transform, Split, Sort, etc) that are written in Java will also improve performance - especially when you have many nodes that are each dealing with a moderate amount of data. This is because they will then execute 'in-Container' - which considerably reduces the start-up time of the node compared with running the node in it's own process.
3. RE: Run dependencies

0 Like
Scott PERRY
Posted 02-08-2022 07:49

Reply Reply Privately
Thanks Adrian. Just to clarify a couple of points:

1) Regarding competing of later nodes with earlier nodes within the same dataflow (where they are dependent on the previous node completing so can't be run in parallel), is it more efficient to use a run dependency clock between nodes or is it effectively the same thing 'under the hood' as it being connected in series? Is it using a Java event listener?

2) And just to double check regarding your reply that "By default the maximum number of nodes that can run in parallel in a data flow is set to four" - just to re-confirm, this is per scheduled instance of a data flow? i.e. not a max of 4 nodes running in parallel across all instances of the same dataflow graph that could be running at the same time?

much appreciated,

Scott
4. RE: Run dependencies

0 Like
Employee

Adrian Williams
Posted 02-08-2022 10:38

Reply Reply Privately
There is a 'Controller' function within Analyze that controls the execution of nodes in the data flows, so there is a lot more going on under the covers. If there is a dependency between nodes then the simpliest and probably most efficient method is to have a connection from an output pin of the last 'earlier' node to the first node of the 'later' nodes. This is using explicit sequencing based on the successful execution of the earlier node. Using a run dependency introduces an extra overhead to the system - especially when you have a run dependency defined on the output clock pin of a Composite node that is connected to the input clock pin of another Composite node. The impact of this Composite to Composite clock overhead has been mitigated in recent versions of the product but it is a bit more efficient to use a wired connection where possible to sequence the two Composites. You can then have one or more node to node run dependencies within the later Composite to sequence individual chains of nodes within the Composite (if required).

If you have multiple schedules that are running the same data flow then the thread limit will apply separately to each running instance of the data flow.

Data360 Analyze

Run dependencies

Scott PERRY02-06-2022 08:48

Adrian Williams02-08-2022 07:13

Scott PERRY02-08-2022 07:49

Adrian Williams02-08-2022 10:38

1. Run dependencies

2. RE: Run dependencies

3. RE: Run dependencies

4. RE: Run dependencies

About Precisely

Customer Support

Copyright ©2024 Precisely. All rights reserved worldwide.