Data360 Analyze

View Only

Back to discussions

Expand all | Collapse all

Performance Questions Data360

1. Performance Questions Data360

Like
Mark Bergsma
Posted 01-15-2021 10:21

Reply Reply Privately
Hi,

I am currently testing the Data360-app with the aim to migrate my graphs from LAE to Data360.

In general, I am facing serious performance issues in Data360 compared to LAE. Please see below further information on my HW/SW-specifications, and the issues I am having:

- Machine Specifications: Intel Core i7,4core CPU, 16GB internal memory, 1TB SSD Hard-drive (90% Free disk space), Windows10, Google Chrome V. 87.0.4280.141

- LAE version 6.1.40

- Data360 version: Windows desktop version 3.6.7.6658

- Complexity Dataflows: Data flows are relatively simple (no clocks, cartesian products, fuzzy matching etc.) containing mainly data transformation and -aggregation steps.

- I run max. 4 nodes at the same time within a dataflow

- Amount of data to be processed is relatively large: For most of my data flows/graphs the raw input data contains appr. 300mln records.

FINDINGS:

- Overall a significant more processing time required by D360 compared to LAE (Appr. 20% increase in processing time for dataflows that run from 2 up to 7 hours)

- Where in LAE i was able to perform tasks in other applications, running the flows in D360 consumes ~100% of CPU- and internal storage capacity

- Regardless the size of the dataflow, related datasources the D360-app is very slow (a lot of "loading" waiting time) for (this increases even more when a flow is actually running):

- Starting-up D360

- Browsing and toggling thru the Directory User Interface

- Opening a dataflow

- Browsing and toggling thru a dataflow (e.g. entering and exesting composite nodes)

- Connecting nodes and/or bending connectors

- Toggling thru the Proporties panel

- Opening data in DataViewer

QUESTIONS:

I have gone thru all the materials that can be found in the documentation and community regarding performance issues, but unfortunately i did not find anything that could help me in tackling the issues. Therefore:

- What can I do improve overall performance of D360?

- Is there an option similar to the LAE "Run in agressive mode" within D360 to help improve processing speed?

Thank you in advance for your reply and support.

Kind Regards,

Mark
2. RE: Performance Questions Data360

Like
Employee

Adrian Williams
Posted 01-21-2021 07:13

Reply Reply Privately
Hi Mark,

We are sorry to hear about the performance issues you are experiencing.

Regarding the execution performance you may want to investigate which nodes account for the longest processing times. You may be able to leverage some of the new nodes that are available in Analyze to replace existing nodes as these provide improved performance on large data sets e.g.

Sort: 2-3x

Lookup: 2-9x

Transform: ~1.3x

Merge: (vs old X-Ref) 3-4x

Aggregate: (vs old Agg-Ex) ~1.3x

Join: 3-4x

Tail: ~10x

CSV input: 2-5x

Re. design-time performance, if there are many nodes in the data flow you may have better performance if you use Composites to group related nodes so they are displayed on different 'planes' in the UI. However, if you are clocking nodes together then you should avoid the implicit many-to-many node clocking that results from clocking one composite to another as discussed here https://doc.infogixsaas.com/analyze/Default.htm#j-admin/composite-performance.htm

Can you confirm what you mean by "Run in agressive mode" - Are you referring to streaming execution? If so, this mode of operation is not supported by Analyze.

There are upcoming enhancements to Analyze which will improve the design-time performance. These enhancements will be delivered in upcoming releases.
3. RE: Performance Questions Data360

Like
Mark Bergsma
Posted 01-22-2021 02:29

Reply Reply Privately
Hi Adrian,

Thank you very much for you reply and assistance. Just to confirm:

- In order to improve the processing times you recommend to "manually" transform the different superseeded nodes to the new nodes?

- By "Run by Aggresive" i mean an option in the LAE -unctionality whereby "in-between" steps of the analysis flows (i.e. the outputs of the individual nodes) are not saved, and only the output of the last node executed is being saved. Is this also supported by Data360?

In addition: Investigation the task manager of my operating system i noticed that it is the "Open JDK platform binary"-process that consumes almost all my CPU-capacity when i start Data360 and run nodes. Even in idle-modes - opening Data360 but not opening a dataflow", it takes almost 30% of the CPU. Is this something you have encountered earlier and do you have any suggestions to fix this?

Looking forward to your reaction!

Regards,

Mark
4. RE: Performance Questions Data360

Like
Employee

Adrian Williams
Posted 01-22-2021 05:00

Reply Reply Privately
There are no tools to automatically replace legacy LAE nodes that use BRAINScript so it would be necesarry to manually replace the nodes with the equivalent Analyze node (e.g. using a Transform node in place of a legacy Transform (Superseded) [aka Filter] node). While it would be 'cleaner' to refactor an entire data flow to use the new nodes, it may be that the majority of the performance improvements could be gained by replacing a few 'bottleneck' nodes.

I do not know of any configration options in Analyze that provide an equivalent of the 'Run by aggressive' mode but I will confer internally to confirm the situation.

When Analyze is started up the web application runs in Tomcat. This will be the Open JDK platform binary that has the biggest memory footprint. This can consume a large amount of CPU time when it initially starts and continues to consume CPU time more sporatically as it 'settles down', but eventually it shoud reduce to a low level (as an example, see below for the current Task Manager view of my PC performance). Which Anti-Virus software are you using? It may be there is an interaction with it when the application is running and writing cache files, etc that may be impacting performance.
5. RE: Performance Questions Data360

Like
Mark Bergsma
Posted 01-29-2021 01:59

Reply Reply Privately
Hi, I am still facing a lot of performance problemens. It seems that the status of the app/workflow is continuously "Updating Document".

So far i have done the following:

- Optimise dataflows by replacing superseeded nodes with Python-based nodes

- Exclude Data360.exe from the realtime virusdetection

- Exclude OpenJdkPlatform.exe from the realtime virusdetection

I'm getting quiet desperate here: My laptop is working fine and i only face these problems using Data360. Every action i perform on Data360 literary takes minutes to complete.

Do you have any other suggetstions on how to improve performance?

Mark

P.S>

I work on google chrome version 88.0.4324.104 (Officiële build) (64-bits) with Data360 Data360 Analyze v3.6.7.6658
6. RE: Performance Questions Data360

Like
Employee

Gerard Cafaro
Posted 01-29-2021 05:44

Reply Reply Privately
The architecture of the Analyze Desktop product is very different from the LAE Desktop product.

When you install the Analyze desktop product the application footprint consists of: Tomcat web application server, Analyze Server, H2 file based database. Whereas LAE Desktop consists of the BRE thick client and LAE Server.

This means that from a general high level point of view the Analyze product does require and use more system resources than that of LAE Desktop.

*** What can I do to improve the overall performance of D360? ***

In terms of execution processing - a lot of work has been done in Analyze to improve the execution performance – we have rewritten a lot of nodes and these have therefore superseded the old nodes.

For example:
             - Agg Ex - renamed to "Agg Ex (Superseded)" and replaced by Aggregate
             - Filter - renamed to "Filter (Superseded)" and replaced by Filter for basic filtering, and Transform for scripted transformations (now in Python and not Brainscript)
             - All the Join nodes have been superseded and replaced with new nodes
             - Sort - renamed to "Sort (Superseded)" and replaced by Sort.

Therefore a Data Flow in Analyze using the new nodes should perform better than the equivalent Graph using the old nodes in BRE.

It would be interesting to see your results as you start to migrate your Graphs from .brg files into Analyze and start swapping out the old nodes for the new ones.

*** Is there an option similar to the LAE "Run in aggressive mode" within D360 to help improve processing speed? ***

300 million records seems quite a lot to process on a Desktop, especially in a larger Data Flow, and given that, the answer to this is currently "no". In Analyze for ad-hoc runs where you go in and run nodes interactively, interim data is always written to file and only cleaned up when you re-run or clear the nodes.
There is a general setting you can use to determine how long to keep ad-hoc runs for, although that is just in <number of days>.
For scheduled runs, there are settings for when to delete temporary data, but again this is on completed runs rather than as the run is running.

- Starting-up D360 Analyze
Due to the difference in the product architecture, this will generally be slower

- Browsing and toggling through the Directory User Interface
This is something we know about, and will be improving in a future release.

- Opening a dataflow / Browsing and toggling thru a dataflow (e.g. entering and existing composite nodes) / connecting nodes and/or bending connectors / Toggling through the Properties panel
These are things that we know about, and are currently working on, you should see major improvements here once we move into the 3.8 series of Analyze releases.

- Opening data in DataViewer

This is something we know about, generally larger and wider datasets will take longer to load. Again it is something we will be looking at in the future.

In summary…
- Execution: should be faster than in BRE if you replace the superseded / deprecated nodes.
- Design time performance issues: we know about and are actively looking at and expect major improvements in coming releases.
- Resource usage: Due to architecture design changes however, the overall resource usage will be higher than LAE
7. RE: Performance Questions Data360

Like
Mark Bergsma
Posted 01-29-2021 06:41

Reply Reply Privately
Hello Christina.

Thank you for your quick response.

Regarding Runtime of LAE vs D360:

- I have used a relatively simple dataflow (processing and transforming 2 datasets of appr, 5mln. records with a "x-ref" to join both sets at the end of the flow

- For all the superseeded nodes I created an alternative node using the Data360/python alternatvie

- First I did a node-by-node comparison to compare processing times of the superseeded nodes vs the alternative nodes. In some cases the alternative nodes where faster (esp. the x-ref), in some cases the superseeded nodes were faster (esp. for standardizing and normalizing fields using the old "filter-" vs. new "transform"-nodes as well as the filter/split-nodes)

- Then I optimized the dataflow taking for each step the fastes node

- Finally I compared the processing time in LAE with thea processing time of Data360 using the omptimised dataflow. Unfortunately the LAE graph was processed significantly faster.

- Furthermore: I realise that the earlier example I gave regarding processing 300M records is quiet a lot for processing on a desktop. However, although it takes some hours using LAE, it has never given me any problems and i was always able to perform other tasks in other apps (or graphs) while the graph was running. I would expect similar performance of D360, even tho the "aggresive mode" is not available

Regarding Design time Performance Issues: Understood. Do you have any idea when the 3.8 series are going to be released?

Best regards,

Mark

Data360 Analyze

Performance Questions Data360

Mark Bergsma01-15-2021 10:21

Adrian Williams01-21-2021 07:13

Mark Bergsma01-22-2021 02:29

Adrian Williams01-22-2021 05:00

Mark Bergsma01-29-2021 01:59

Gerard Cafaro01-29-2021 05:44

Mark Bergsma01-29-2021 06:41

1. Performance Questions Data360

2. RE: Performance Questions Data360

3. RE: Performance Questions Data360

4. RE: Performance Questions Data360

5. RE: Performance Questions Data360

6. RE: Performance Questions Data360

7. RE: Performance Questions Data360

About Precisely

Customer Support

Copyright ©2025 Precisely. All rights reserved worldwide.