Data360 Analyze

 View Only
  • 1.  Fuzzy Xref Fuzzy Threshhold

    Employee
    Posted 10-09-2019 09:00

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Hello i'm comparing two data sets with an entity name in both.   I want to try to match by name using the fuzzy Soundex algorithm.  What i can't figure out is how to determine the threshold value.    Is it numeric ? how is 1 different then say 4 ?   The node does not run if the value is set to nothing.   It would be great to run it and see the scores of the fuzzy score output field.  and based on that set my threshold.   Is there a way to do that ? or how does the value score work ?  

     

    Also i have two strings one is 13,000 records and the other one is about 800.  It's been running for a while and no results yet.   Can i speed this up ? It's just two lists of names being compared very simple.  Currently i'm using 1 as the threshold value.  

     

    Thank you. 



  • 2.  RE: Fuzzy Xref Fuzzy Threshhold

    Employee
    Posted 10-09-2019 09:06

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Is there a better way to compare  two strings ? 

     



  • 3.  RE: Fuzzy Xref Fuzzy Threshhold

    Posted 10-17-2019 01:32

    You need first to select the Fuzzy Algorithm

    help will give you all the information you need:

    http://localhost:8080/docs/dist/help/Content/e-node-help/Correlation/fuzzy-join.html

     

    Few months ago I tried to use the Levenshtein distance algorithm with both Lavastorm and data3sixty to compare millions of lines with hundred of thousand of records. Neither Lavastorm or Data3sixty were able to run such huge amount of commands on my computer.

    We wrote a Python script and launched it on a 16 core server.

    The way you are writing the Python code will have a huge influence on the processing time. But 13.000 x 800 processing should be  done within few minutes. Which algorithm have you selected?