Data360 Analyze

 View Only
  • 1.  Performance questions

    Posted 11-17-2021 07:03

    Hi Infogix

    Please can you help with the below performance related questions

    1) As I understand it, Analyze automatically implements a cache of data for each output pin of an Analyze node. So if I have an input file of 1GB in size that I load say using a CSV file loader, and I then connected that to 10 seperate transform nodes, all of which just passed the data on, would i essentially have 11 files all the same in the cache? With a total of 11GB of RAM taken up by this file caching?

    2) Also are files that are written to the temp folder e.g. <Data360Analyze data directory>/data-<port number>/executions stored in the cache?

    3) If i am writing a csv file out to a different location to the temp folder but i want it to be cached do I just write it with a checkpoint node?  And then elsewhere in my graph I can use a seperate checkpoint node (pointing at the same file location) to retrieve from the cache or do I need to connect up to the original checkpoint node?

    4) For In-container node execution, how does D3S determine how many nodes to put in one JVM, does it just do as many as it can that are all all connected together (and are Java under the hood). Is the Python node also included in this as under the hood it is Jython?

    thanks

    Scott

     

    thanks

    Scott



  • 2.  RE: Performance questions

    Employee
    Posted 11-17-2021 10:36

    Re 1) The data associated with each output pin of a node is stored in a temporary data file (.brd file). These files are stored on disk, they are not held in-memory so do not consume RAM - but they will consume disk space. The format of a .brd file is not a .csv and there may be some improvement in the stored file size compared with the input .csv file but, essentially, you will be consuming GBs of disk space if a large data file is processed through multiple Transform nodes. The temporary data files generated (implicitly) by the system are purged according to the system's data retention settings that are configured for ad-hoc data flow execution runs and for scheduled executions of data flows. 

    Re 2) The system's temporary data directory (<Data360Analyze data directory>/data-<port number>/executions ) contains one or more directories for each user of the system (only 'admin' on a desktop instance). On a Server instance the user's id (e.g. 'bob') is used as the directory name. The temporary data generated by a user's data flow will be stored in a sub-directory of their main directory e.g. executions/bob/<data_flow_name>/ 

    The executions directory also contains a 'cache' directory. The contents of the executions/cache directory should NOT be deleted as this can cause problems. This directory contains (among other things) compiled versions of the java code for nodes that have been executed. The compiled node components are used for subsequent execution of a node to improve performance.

    Re 3) You can use a checkpoint node to write the data but you would need to explicitly define the CheckpointFileame property rather than using the default {{^handle^}} textual substitution value of the property as this is different for each node. A checkpoint node wil not run if it's input pin is not connected so you cannot use it without an upstream node to start it's execution.

    The Checkpoint node is a composite node that leverages nodes that write and read .brd files. It may be more transparent (from a data flow logic perspective) to simply use an Output BRD node to write the imported data and then use separate BRD File nodes eleswhere in your data flow to read in the data from the explicitly defined .brd file. You should consider implementing logic to delete explicitly created .brd files as the system will not include these in the temporary data purge mechanism.

    4) There is a system property that defines the maximum number of threads available in the container. Only Java-based nodes are eligible to run in-container. However, not all Java-based nodes run in-container (e.g. the JDBC Query node). Nodes do not have to be connected to run in-container. When an eligible node is ready to be run if the limit on the maximum number of concurrently running nodes has not been reached then the system will attempt to run it in the container. If the limit has been met then the node will run in it's own process space. 

    The thread limit can be configured in your cust.prop via the property:  ls.brain.server.nodeContainer.inContainerPoolSize

    In practice, the decision to run an eligible node in-container depends on a number of factors including the type of node is being run, the number of fields in the input data and the number of records in the data. For some nodes that have a large number of data 'cells' in their input data, they will be run in their own process space.

    The Python node does not use Jython (and it's not Java based) - the Python node leverages CPython, allowing it to execute third-party Python packages that are written in C/C++. The Python node always runs in it's own process space.



  • 3.  RE: Performance questions

    Posted 11-17-2021 11:33

    Thanks Adrian for the detailed explanation, it is very helpful.

    One more question regarding the setting for the thread limit (which by default, is set to 4) and governs how many nodes are able to execute concurrently for a given execution.

    In your experience would you recommend increasing this to be aligned with the number of CPU cores or perhaps number of Cores x 4 assuming each core has 4 virtual threads?

    I appreciate we would only see performance improvements based on the below statement taken from the infogix help. 

    "Increasing the number of controller farm threads will only have an impact on performance if there are enough parallel paths (branches) in your data flow to allow more nodes to be run concurrently. That is, this setting will have no affect on nodes whose execution depends on a previous node due to a connection or run dependency; in these cases the dependent nodes in a branch will still be run sequentially regardless of any change to this setting."

     

    thanks

    Scott

     



  • 4.  RE: Performance questions

    Employee
    Posted 11-23-2021 11:10

    There is no specific rule about how many threads you should run however, creating an excessive number of threads will cause problems. This is particularly true for the process threads used for nodes that are run outside of the node container - you should not exceed a maximum of 4 threads per CPU (as seen by the Linux kernel). For the in-container node execution the java processes theads are more light-weight compared with Linux process threads but it is still prudent to be conservative with the number of concurrent nodes running in-container. You may want to increase the number of in-container threads to perhaps 20 in a umber of stages and then check the performance and memory utilization of the system (the system needs to be restarted for the new value to be recognised).

    As nodes running in-container are fast to start-up and relatively short-lived a lot of nodes can be processed even with a moderate number of in-container threads. As more nodes are run simultaneously in-container this will increase the Java heap space used by the container and you should monitor it's RAM usage. 

    The node container has no maximum heap size set by default. You can configure the maximum amount of heap memory that the node container can use in your cust.prop via adding the property ls.brain.server.nodeContainer.javaMaxHeapSize. For example, to set the container to use a maximum of 2GB you can add the line to your cust.prop file:

    ls.brain.server.nodeContainer.javaMaxHeapSize=2048m