Hi Juan,
I would suggest that this is more of a master data management question than an analyze question. Analyze will do anything that you can algorithmically formulate. So you just need to be clear what your business rule is for the task before you code it.
My opinion is that there is no perfect way to do this with the kind of sample you suggest, there will always be no way for example to know if Mike Langdon, is in fact not Michael Landon. Or to be a little clearer, maybe Michael Langdon is a typo of Michaela Langdon. Or Mike Langdon of 234 University Land may have moved to 233 University Lane to move in with his friends, but it might be just a completely unrelated person with a similar name. That said, you might be able to improve your efforts with more data, e.g. Date of Birth or social security number.
If you accept that and want to do it just like you said above anyway, I would add some score columns with a transform for the length of the fields. Decide what columns you consider the minimum primary key and aggregate that into a list then join that data back to the main data set. In the groups that arise from that, you need to select max (score). I think you can avoid the iteration with that approach and just operate with sets.
As an example, here is a one key Oracle sql deduplication based on the above, this has a tie for the result in this case, and more criteria are needed. You can of course do it iteratively if you want, it would just be harder.
with originaldata as (
select 1 as ID, 'Origin1' as SourceSystem, 'Daniel Brown' as TheName, 90210 Thepostcode, 'California' as Thestate, 'Beverly Hills' as Add2, '123 Davinci St' as Add1, 'dan@brown.com' as email from dual union all
select 2, 'Origin5', 'Daniel Brown', 90210, 'California', 'Beverly Hills', '123 Davinci St', 'dan@brown.com' from dual union all
select 3, 'Origin2', 'Daniel Ernest Brown', 90210, 'California', 'Beverly Hills', '123 Davinci St', 'danny1234@brown.com' from dual union all
select 4, 'Origin3', 'Dan Brown', 90210, 'California', 'Beverly Hills', '123 Davinci St', 'danmeister@brown.com' from dual union all
select 5, 'Origin1', 'Michael Langdon', 12345, 'California', 'Beverly Hills', '234 University Lane', 'mlangdon@brown.com' from dual union all
select 6, 'Origin3', 'Mike Langdon', 12345, 'California', 'Beverly Hills', NULL, NULL from dual union all
select 7, 'Origin4', 'Michael Langdon', 12345, 'California', 'Beverly Hills', '234 University Lane', 'mlangdon@brown.com' from dual union all
select 8, 'Origin3', 'Mike Langdon', 12345, 'California', 'Beverly Hills', NULL, NULL from dual
),
scored as
(select originaldata.*, length(originaldata.thename)+1000 thescore from originaldata
),
thekey as
(select thepostcode,thestate,add2,add1, rank() over (order by thepostcode,thestate,add2,add1) as keyid from originaldata group by thepostcode,thestate,add2,add1 order by 1,2,3,4),
rejoined as (
select * from thekey join scored
on thekey.thepostcode = scored.thepostcode
and thekey.thestate=scored.thestate
and thekey.add1=scored.add1
and thekey.add2=scored.add2
),
maxscores as (
select keyid, max(thescore) themax from rejoined
group by keyid
)
select maxscores.*,rejoined.* from maxscores
inner join rejoined
on maxscores.keyid = rejoined.keyid
and maxscores.themax = rejoined.thescore
------------------------------
Peter Sykes
Vontobel Holding AG
ZUERICH
------------------------------
Original Message:
Sent: 01-10-2023 10:01
From: Juan Vidal
Subject: Iteration
Hello everyone, I need some help on how to solve a specific use case.
We have an approx 90k record set loaded with customer data. Each row consists of at least these columns:
RowID, DataSource, CustomerName, Zip, State, City, Address, Email, Phone, Fax, Mail, ....... (plus lots of other columns)
We have cases where the Name contains other initials, as in:
RowID, DataSource, CustomerName, Zip, State, City, Address, Email, Phone, Fax, Mail, ....... (plus lots of other columns)
1, Origin1, Daniel Brown, 90210, California, Beverly Hills, 123 Davinci St, dan@brown.com ...
2, Origin5, Daniel Brown, 90210, California, Beverly Hills, 123 Davinci St, dan@brown.com ...
3, Origin2, Daniel Ernest Brown, 90210, California, Beverly Hills, 123 Davinci St, danny1234@brown.com ...
4, Origin3, Dan Brown, 90210, California, Beverly Hills, 123 Davinci St, danmeister@brown.com ...
5, Origin1, Michael Langdon, 12345, California, Beverly Hills, 234 University Lane, mlangdon@brown.com ...
6, Origin3, Mike Langdon, 12345, California, Beverly Hills, NULL, NULL...
7, Origin4, Michael Langdon, 12345, California, Beverly Hills, 234 University Lane, mlangdon@brown.com ...
8, Origin3, Mike Langdon, 12345, California, Beverly Hills, NULL, NULL...
First of all, we have to eliminate the duplicated records.
According to this information and from what we learned from the client, the first 4 records, should be considered as THE SAME PERSON and their data should be merged into a single record, that should contain the data of all 3 adding whatever data these other records have (if applicable), for example, the "email" column should contain all 3 mails. The same principle applies to the last 4 records in the example as well.
For the most part, this is working so far, however, the problem arises when trying to evaluate the "CustomerName" column, since it contains different "names" and thus we cannot eliminate dupes based on the zip, state, city, and address columns.
What we believe could be a solution, would be to read the 1st record, and evaluate the next N... then, if the columns "zip, state, city, address" match then we should pick the longest "CustomerName" and copy that value to the 1st record.
How we believe this could be done is with 2 loops, the "outer loop" going from 1 to N and the "inner" loop evaluating a fixed number of iterations (say 3).
On each inner iteration, we would evaluate the data, decide if it's "the same person" and then move the name to its longest form until the last iteration.
We cannot seem to find a way to do this, neither with a transform node via python nor with the "standard" nodes.
Any and all help will be appreciated.
Thanks in advance.
------------------------------
Juan Vidal
Mutual of America Life Insurance Company
------------------------------