Thanks for the help Adrian, the solution works.
I did a test with 3 different data sets(rows > 10^6), it turned out that my solution i first posted was 18.2% faster than your version(i took a mean of the 3 test).
Original Message:
Sent: 05-30-2024 03:25
From: Adrian Williams
Subject: Row Hash optimization, Data comparison/validation test
Ah, my apologies, there was an element missing in the original logic that I have also corrected in my original message above. The new statement updates the logic to hash the values of the fields and now produces the same results as your code.
strRow = "".join(str(in1[strCol]) for strCol in in1)
------------------------------
Adrian Williams
Precisely Software Inc.
Original Message:
Sent: 05-30-2024 02:20
From: Henrik B
Subject: Row Hash optimization, Data comparison/validation test
Thanks for the answer Adrian.
The problem with the new script compared to the old is that, i dont get unique values for the Rowhash output, all values are the same.
In the old script, i get unique values for each row, for RowHash in my test dataset.
Modified script:

------------------------------
Henrik B
E.ON Sverige
Original Message:
Sent: 05-29-2024 11:48
From: Adrian Williams
Subject: Row Hash optimization, Data comparison/validation test
The Calculate Fields node is not really suitable for this type of treatment as there are multiple steps, each of which need the output of the previous step to create the final result. The Transform node is more appropriate and provides a more transparent view of the processing being performed on the data. Both nodes are Java based so there is no intrinsic benefit to using a Calculate Fields node and may actually provide poorer performance due to the need to write out each intermediate step.
You could improve the performance of the code in the Transform node by replacing the lines in the ProcessRecords script that concatenate the field string values, replacing the current line in the script with the following which uses the join() method:
out1 += in1
#intCol = 0
#strRow = "";
#for strCol in in1:
# strRow = strRow + str(in1[intCol])
# intCol = intCol + 1
strRow = "".join(str(strCol) for strCol in in1)
hashId = hashlib.md5()
hashId.update(repr(strRow).encode('utf-8'))
------------------------------
Adrian Williams
Precisely Software Inc.
Original Message:
Sent: 05-29-2024 04:58
From: Henrik B
Subject: Row Hash optimization, Data comparison/validation test
Hello, this is a great solution im using to generate a hash value for an entire row with N number of columns and N amount of data.
I use this for data comparison/Data validation test to see if two datasets are exactly the same, so i just run two datasets into the transform node below and then i make a join/merge to see if i get a 100% match. Note that, the order of the columns has to be the same in both data sets.


This is something i use frequently, the question i have is.
Would this be possible to do in a "Calculate Fields" node? I assume that the performance would be much better in that node, for this type of operation.
I also hope others can benefit from this solution.
Best Regards
------------------------------
Henrik B
E.ON Sverige
------------------------------