Hi Juan,
I would suggest that this is more of a master data management question than an analyze question. Analyze will do anything that you can algorithmically formulate. So you just need to be clear what your business rule is for the task before you code it.
My opinion is that there is no perfect way to do this with the kind of sample you suggest, there will always be no way for example to know if Mike Langdon, is in fact not Michael Landon. Or to be a little clearer, maybe Michael Langdon is a typo of Michaela Langdon. Or Mike Langdon of 234 University Land may have moved to 233 University Lane to move in with his friends, but it might be just a completely unrelated person with a similar name. That said, you might be able to improve your efforts with more data, e.g. Date of Birth or social security number.
If you accept that and want to do it just like you said above anyway, I would add some score columns with a transform for the length of the fields. Decide what columns you consider the minimum primary key and aggregate that into a list then join that data back to the main data set. In the groups that arise from that, you need to select max (score). I think you can avoid the iteration with that approach and just operate with sets.
As an example, here is a one key Oracle sql deduplication based on the above, this has a tie for the result in this case, and more criteria are needed. You can of course do it iteratively if you want, it would just be harder.
with originaldata as (
select 1 as ID, 'Origin1' as SourceSystem, 'Daniel Brown' as TheName, 90210 Thepostcode, 'California' as Thestate, 'Beverly Hills' as Add2, '123 Davinci St' as Add1, 'dan@brown.com' as email from dual union all
select 2, 'Origin5', 'Daniel Brown', 90210, 'California', 'Beverly Hills', '123 Davinci St', 'dan@brown.com' from dual union all
select 3, 'Origin2', 'Daniel Ernest Brown', 90210, 'California', 'Beverly Hills', '123 Davinci St', 'danny1234@brown.com' from dual union all
select 4, 'Origin3', 'Dan Brown', 90210, 'California', 'Beverly Hills', '123 Davinci St', 'danmeister@brown.com' from dual union all
select 5, 'Origin1', 'Michael Langdon', 12345, 'California', 'Beverly Hills', '234 University Lane', 'mlangdon@brown.com' from dual union all
select 6, 'Origin3', 'Mike Langdon', 12345, 'California', 'Beverly Hills', NULL, NULL from dual union all
select 7, 'Origin4', 'Michael Langdon', 12345, 'California', 'Beverly Hills', '234 University Lane', 'mlangdon@brown.com' from dual union all
select 8, 'Origin3', 'Mike Langdon', 12345, 'California', 'Beverly Hills', NULL, NULL from dual
),
scored as
(select originaldata.*, length(originaldata.thename)+1000 thescore from originaldata
),
thekey as
(select thepostcode,thestate,add2,add1, rank() over (order by thepostcode,thestate,add2,add1) as keyid from originaldata group by thepostcode,thestate,add2,add1 order by 1,2,3,4),
rejoined as (
select * from thekey join scored
on thekey.thepostcode = scored.thepostcode
and thekey.thestate=scored.thestate
and thekey.add1=scored.add1
and thekey.add2=scored.add2
),
maxscores as (
select keyid, max(thescore) themax from rejoined
group by keyid
)
select maxscores.*,rejoined.* from maxscores
inner join rejoined
on maxscores.keyid = rejoined.keyid
and maxscores.themax = rejoined.thescore
------------------------------
Peter Sykes
Vontobel Holding AG
ZUERICH
------------------------------