Data360 Analyze

 View Only
  • 1.  Efficient method of splitting datasets

    Posted 09-21-2023 05:34

    We have some large datasets that we want to send to DQ+.  We get a memory error with the Publish to DQ+ node due to the size and so we plan to split the data into many smaller datasets and load those.
    I've used the hash node to do this which seems to work quite well - however our performance testers are concerned that this uses up to 75% CPU.  
    Can anyone suggest a more CPU-efficient method of splitting a dataset into multiple subsets?

    Many thanks



    ------------------------------
    Gail Sinclair
    Hargreaves Lansdown PLC
    ------------------------------


  • 2.  RE: Efficient method of splitting datasets

    Employee
    Posted 09-22-2023 09:25

    I don't know the performance requirements you have, nor the performance expectations of the Analyze nodes. I can think of other ways to split up the data that probably run more slowly. 

    You could add outputs to a Transform node and use the Python modulus operator on the value of node.execcount and the number of outputs. From that, determine which output pin.

    Maybe instead you are wanting to limit the number of records that are processed, for that you need the Looping node.  You could make it extract a chunk of maybe 100,000 records for upload.  Then on the next iteration the next 100,000 for upload, etc.



    ------------------------------
    Ernest Jones
    Precisely Software Inc.
    PEARL RIVER NY
    ------------------------------



  • 3.  RE: Efficient method of splitting datasets

    Posted 09-22-2023 09:40

    Thanks, Ernest

    The Analyze engineering team have advised that this CPU spike is not a concern as Analyze will use as much as is available if nothing else is running.  If other process are running then the node will use less but take longer.

    I will keep your suggestions in mind though.



    ------------------------------
    Gail Sinclair
    Hargreaves Lansdown PLC
    ------------------------------