Data360 Analyze

 View Only
  • 1.  Row Hash optimization, Data comparison/validation test

    Posted 05-29-2024 04:58

    Hello, this is a great solution im using to generate a hash value for an entire row with N number of columns and N amount of data.

    I use this for data comparison/Data validation test to see if two datasets are exactly the same, so i just run two datasets into the transform node below and then i make a join/merge to see if i get a 100% match. Note that, the order of the columns has to be the same in both data sets.

    This is something i use frequently, the question i have is.

    Would this be possible to do in a "Calculate Fields" node? I assume that the performance would be much better in that node, for this type of operation.

    I also hope others can benefit from this solution.

    Best Regards



    ------------------------------
    Henrik B
    E.ON Sverige
    ------------------------------


  • 2.  RE: Row Hash optimization, Data comparison/validation test

    Employee
    Posted 05-29-2024 11:49
    Edited by Adrian Williams 05-30-2024 03:21

    The Calculate Fields node is not really suitable for this type of treatment as there are multiple steps, each of which need the output of the previous step to create the final result. The Transform node is more appropriate and provides a more transparent view of the processing being performed on the data. Both nodes are Java based so there is no intrinsic benefit to using a Calculate Fields node and may actually provide poorer performance due to the need to write out each intermediate step.

    You could improve the performance of the code in the Transform node by replacing the lines in the ProcessRecords script that concatenate the field string values,  replacing the current line in the script with the following which uses the join() method:

    out1 += in1

    #intCol = 0
    #strRow = "";

    #for strCol in in1:
    #    strRow = strRow + str(in1[intCol])
    #    intCol = intCol + 1

    strRow = "".join(str(in1[strCol]) for strCol in in1)


    hashId = hashlib.md5()

    hashId.update(repr(strRow).encode('utf-8'))



    ------------------------------
    Adrian Williams
    Precisely Software Inc.
    ------------------------------



  • 3.  RE: Row Hash optimization, Data comparison/validation test

    Posted 05-30-2024 02:20
    Edited by Henrik B 05-30-2024 03:26

    Thanks for the answer Adrian.

    The problem with the new script compared to the old is that, i dont get unique values for the Rowhash output, all values are the same.

    In the old script, i get unique values for each row, for RowHash in my test dataset.

    Modified script:



    ------------------------------
    Henrik B
    E.ON Sverige
    ------------------------------



  • 4.  RE: Row Hash optimization, Data comparison/validation test
    Best Answer

    Employee
    Posted 05-30-2024 03:25

    Ah, my apologies, there was an element missing in the original logic that I have also corrected in my original message above. The new statement updates the logic to hash the values of the fields and now produces the same results as your code.

    strRow = "".join(str(in1[strCol]) for strCol in in1)


    ------------------------------
    Adrian Williams
    Precisely Software Inc.
    ------------------------------



  • 5.  RE: Row Hash optimization, Data comparison/validation test

    Posted 05-30-2024 04:57

    Thanks for the help Adrian, the solution works.

    I did a test with 3 different data sets(rows > 10^6), it turned out that my solution i first posted was 18.2% faster than your version(i took a mean of the 3 test).



    ------------------------------
    Henrik B
    E.ON Sverige
    ------------------------------