I don't know the performance requirements you have, nor the performance expectations of the Analyze nodes. I can think of other ways to split up the data that probably run more slowly.
You could add outputs to a Transform node and use the Python modulus operator on the value of node.execcount and the number of outputs. From that, determine which output pin.
Maybe instead you are wanting to limit the number of records that are processed, for that you need the Looping node. You could make it extract a chunk of maybe 100,000 records for upload. Then on the next iteration the next 100,000 for upload, etc.
------------------------------
Ernest Jones
Precisely Software Inc.
PEARL RIVER NY
------------------------------
Original Message:
Sent: 09-21-2023 05:33
From: Gail Sinclair
Subject: Efficient method of splitting datasets
We have some large datasets that we want to send to DQ+. We get a memory error with the Publish to DQ+ node due to the size and so we plan to split the data into many smaller datasets and load those.
I've used the hash node to do this which seems to work quite well - however our performance testers are concerned that this uses up to 75% CPU.
Can anyone suggest a more CPU-efficient method of splitting a dataset into multiple subsets?
Many thanks
------------------------------
Gail Sinclair
Hargreaves Lansdown PLC
------------------------------