Spectrum Technology Platform

 View Only
  • 1.  Search Index v/s Data Hub

    Employee
    Posted 08-20-2019 02:16
    Edited by Akshay Pandita 08-22-2019 01:50
    Hi Team -

    I am working with a customer where, they wanted to understand major difference in our elastic search ( Search Index) and Data Hub in terms of performance and scalability, e.g if we have to do matching using Fuzzy logic shall they go ahead with search Index method( with24 + algos)  as it can handle complexity to a great extent or use Data Hub which has limited fuzzy matching( need to understand more on the matching capability of Data Hub) and has some pre defined limited algos like: contains, like, not like. Also a query that they have is if they scale it horizontally ( add more nodes) or vertically (add more power) the hub does not perform aptly.

    I wanted to put across and share this to our community so that if such scenario anyone has seen or has some insights can share it with everyone. The basic understanding is that SI is scalable and Hub is also scalable( but to a lesser extent), but when we increase the load of data then write complexity in hub increases as compared to SI. The end objective is to have better performance and no data lag.


    Thanks,
    Akshay.

    ------------------------------
    Akshay Pandita
    Advisory Consultant - Professional Services

    ------------------------------


  • 2.  RE: Search Index v/s Data Hub

    Employee
    Posted 08-22-2019 14:15

    Hi Akshay,

    I guess let's start by clarifying some terms. I suspect you are not actually matching using Search Index or Data Hub, but rather querying one of those to retrieve candidates to be passed to the matching engine. You correctly identify that the search index, being an implementation of the Elasticsearch search engine, provides robust indexing and searching that scales across the cluster. The Data Hub is a graph implementation that supports similar indexing and searching - including range, substring (contains), starts- or ends-with, etc. Because every node has a complete copy of the graph, any node can service any query, so queries scale well across the cluster. In reality you could use either the Search Index or the Data Hub as a source of candidates for matching. The implementation would be slightly different, using the candidate finder stage to query the search index, and the query hub stage for data hub.

    Since you mention write complexity in the hub in talking about scalability, I'm going to assume you're talking about the performance of Write to Hub stage operations in a cluster vs a single server implementation (scaling out vs scaling up). One of the options for writing to the data hub locks the record(s) on all nodes when performing a write or update, and this takes longer with more nodes than with fewer. There are write strategies that avoid this although they aren't appropriate to all use cases.

    You can contact me directly by email if you want to talk more about the particular client you are working with.



    ------------------------------
    Brad Stengel
    PITNEY BOWES SOFTWARE, INC
    Miami Lakes FL
    ------------------------------