MapInfo Pro

 View Only

MapBasic Monday: Cluster Analysis

  • 1.  MapBasic Monday: Cluster Analysis

    Employee
    Posted 06-02-2025 03:53
      |   view attached

    Last week, I was asked a question by @yogesh pawar from our Technical Support team:

    A customer was seeking a method to divide points within a polygon into groups. The groups should consist of the same number of points, around 10 in this case. They were seeking a way to automate the redistricting process.

    This is often also referred to as Cluster Analysis, where you want to create groups of records. This can be used for sales district optimization, for school districts, and for a long range of other use cases.

    RouteFinder for MapInfo from Routeware has a dedicated tool for this that supports both straight-line analysis and routing analysis, where you take a road network into consideration.

    One of the methods for performing this analysis is k-means clustering. You can find Python implementations of this in the Python module scikit-learn. I'll come back to that in a later article.

    In this article, I'll look at a very basic approach I took when trying to divide points into a given number of clusters.

    Happy #MapBasicMonday.

    Finding the Starting Point

    For my basic approach to cluster analysis, I decided to use the extent of the data as a starting point. Around a dataset, you can draw a minimum bounding rectangle (MBR). I will let the tool assign a location on the MBR as the starting point. This could be the center of the MBR, the four corners, the center of the four sides, or one of the corners.

    For the four sides or the four corners, it will change for each cluster. This means that the starting point will change as I go through finding points for a cluster.

    If I use the center or a specific corner, I will use the same every time. But this may also vary a bit. The MBR will change as it will be based on the points that still haven't been assigned to a cluster.

    From the selected starting point, I create a buffer to find nearby points.

    The size of the buffer is also related to the size of the MBR. I have set it to be 1/20 of the width of the MBR.

    The buffer size is increased until at least one point is found. This point, or these points, are the starting points for the cluster.

    The source code for the starting points looks like this
    Function FindFirstForCluster(ByVal nClusterID As Integer) As Integer
    
    Dim	oPoint, oBuffer As Object,
    	i As Integer,
    	fWidth As Float
    
    OnError GoTo ErrorOccured
    
    	FindFirstForCluster = 0
    
    	If QueryRecordsToCluster("__to_cluster") = 0 Then
    		Call TABClose("__to_cluster")
    		Exit Function
    	End If
    
    	Do Case msStartMethod
    		Case "Corners"
    			Do Case (nClusterID Mod 4)
    				Case 0	'SW
    					oPoint = CreatePoint(mfMinX, mfMinY)
    				Case 1	'NW
    					oPoint = CreatePoint(mfMinX, mfMaxY)
    				Case 2	'NE
    					oPoint = CreatePoint(mfMaxX, mfMaxY)
    				Case 3	'SE
    					oPoint = CreatePoint(mfMaxX, mfMinY)
    			End Case
    		Case "Sides"
    			Do Case (nClusterID Mod 4)
    				Case 0	'Left
    					oPoint = CreatePoint(mfMinX, mfCenterY)
    				Case 1	'Top
    					oPoint = CreatePoint(mfCenterX, mfMaxY)
    				Case 2	'Right
    					oPoint = CreatePoint(mfMaxX, mfCenterY)
    				Case 3	'Bottom
    					oPoint = CreatePoint(mfCenterX, mfMinY)
    			End Case
    		Case "Center"
    				oPoint = CreatePoint(mfCenterX, mfCenterY)
    		Case "SW"
    				oPoint = CreatePoint(mfMinX, mfMinY)
    		Case "NW"
    				oPoint = CreatePoint(mfMinX, mfMaxY)
    		Case "NE"
    				oPoint = CreatePoint(mfMaxX, mfMaxY)
    		Case "SE"
    				oPoint = CreatePoint(mfMaxX, mfMinY)
    	End Case
    
    	FindFirstForCluster = AddToCluster(nClusterID, oPoint, "__to_cluster")
    
    	Call TABClose("__curr_clusters")
    	Call TABClose("__to_cluster")
    
    	Exit Function
    '-------------------------
    ErrorOccured:
    	Print Err() & " " & Error$() & ": FindFirstForCluster: " & nClusterID
    
    End Function

    Most of the code above is assigning a location to the analysis based on the start method chosen and the current cluster number.

    After this, the function AddToCluster is used to find and assign points to the current cluster.

    Function AddToCluster(	  ByVal nClusterID As Integer
    					, ByVal oInput As Object
    					, ByVal sQuery As String
    					) As Integer
    
    Dim	i, nNumRows, nRow, nRows, nRowID As Integer,
    	fWidth As Float,
    	oBuffer As Object
    
    OnError GoTo ErrorOccured
    
    	fWidth 	= Maximum(mfWidth / mnStartSplit, 1)
    
    	For i = 1 to mnStartSplit
    		oBuffer	= Buffer(oInput, 12, i * fWidth, "m")
    
    		Select t.ClusterID, ObjectDistance(oInput, t.OBJ, "m") "Distance_m"
    			From __to_cluster As "t"
    			Where t.OBJ Intersects oBuffer
    			And ClusterID = 0
    			Order By Distance_m Asc
    			Into __next_for_cluster NoSelect Hide
    
    		nNumRows = TableInfo(__next_for_cluster, TAB_INFO_NROWS)
    		Print Time(24) & "   Search Distance " & Round(fWidth * i, 0.1) & " m -> " & nNumRows & " rows"
    		If nNumRows > 0 Then
    			If nNumRows + mnCountCurrentCluster <= mnMinimumRowCount Then
     				'**Adding all selected to cluser
    				Update __next_for_cluster
    					Set ClusterID = nClusterID
    				AddToCluster = nNumRows		'Returning the number of rows added
    			Else
    				Fetch First From __next_for_cluster
    				Do Until EOT(__next_for_cluster)
    
    					nRowID = __next_for_cluster.ROWID
    					Update __next_for_cluster
    						Set ClusterID = nClusterID
    						Where ROWID = nRowID
    
    					nRows = nRows + 1
    					If (mnCountCurrentCluster + nRows) >= mnMinimumRowCount Then
    						Exit Do
    					End If
    					Fetch Next From __next_for_cluster
    				Loop
    				AddToCluster = nRows
    			End If
    			Exit For
    		End If
    	Next
    
    	Exit Function
    '-------------------------
    ErrorOccured:
    	Print Err() & " " & Error$() & ": FindNextForCluster: " & nClusterID
    
    End Function

    This function is called when finding the first points for a cluster and finding the next points too. The difference lies in the input variable oInput, which changes from a location on the MBR to a single input point or a multipoint based on all the points in the current cluster.

    Finding the Next Points

    Once the initial points for the cluster have been found, the analysis starts using these locations. A buffer is created around the points that make up the cluster. The buffer size is again increased until at least one point is found.
    When more points have been assigned to the cluster, these points are also included in the buffer analysis.
    Until the minimum number of points has been assigned to the current cluster.
    Now the algorithm moves onto the next corner of the MBR. The MBR is also recalculated. This means that not only do we move to a new corner, that corner may also shift a bit as only the points not already assigned to a cluster as used when calculating the MBR.
    If you had configured the analysis to use the same point every time, the only change would come from a potential change in the MBR. I did realize that using the center as a starting point wasn't that good an idea. This would often only assign points at the center to a cluster, and so not affect any change to the MBR. This is also to some extent true when using another single starting location. I think you will get a better result by using multiple starting locations. Maybe two would also work. This could be the two corners on the left, on the right, on the top, or at the bottom that you were switching between.
    This continues until the tool has found points for all the clusters.
    The source code for finding the next points is also quite simple. It starts by creating a multipoint for all the points currently in the cluster. And then it calls the function AddToCluster just like finding the first point(s).
    Function FindNextForCluster(ByVal nClusterID As Integer) As Integer
    
    Dim	oPoints As Object,
    	fWidth As Float
    
    OnError GoTo ErrorOccured
    
    	FindNextForCluster = 0
    
    	If QueryRecordsToCluster("__to_cluster") = 0 Then
    		Call TABClose("__to_cluster")
    		Exit Function
    	End If
    
    	Select ClusterID, AggregateCombine(obj)
    		From msTabToCluster
    		Where ClusterID = nClusterID
    		Group By ClusterID
    		Into __curr_cluster NoSelect Hide
    	Fetch First From __curr_cluster
    	oPoints	= __curr_cluster.OBJ
    	Call TABClose("__curr_cluster")
    
    	FindNextForCluster = AddToCluster(nClusterID, oPoints, "__to_cluster")
    
    	Call TABClose("__next_for_cluster")
    	Call TABClose("__to_cluster")
    
    	Exit Function
    '-------------------------
    ErrorOccured:
    	Print Err() & " " & Error$() & ": FindNextForCluster: " & nClusterID
    
    End Function

    Finalizing the Process

    After looping over the desired number of clusters, there may still be some points that haven't been assigned to a cluster. These points are often scattered across the dataset as smaller groups of points.
    To get these assigned, I find the nearest cluster for each of the points and assign this cluster to these points. This can mean that some clusters grow above the size you were looking for.
    I have created the function FindNearestClusterID that returns the nearest Cluster ID for a given point.
    Function FindNearestClusterID(ByVal oPoint As Object) As Integer
    
    Dim	oBuffer As Object,
    	i As Integer,
    	fWidth As Float
    
    OnError GoTo ErrorOccured
    
    	FindNearestClusterID = 0
    
    	If QueryRecordsClustered("__been_clustered") = 0 Then
    		Call TABClose("__been_clustered")
    		Exit Function
    	End If
    
    	fWidth 	= Maximum(mfWidth / mnStartSplit, 1)
    
    	For i = 1 to mnStartSplit
    		oBuffer	= Buffer(oPoint, 12, i * fWidth, "m")
    
    		Select t.ClusterID, ObjectDistance(oPoint, t.OBJ, "m") "Distance_m"
    			From __been_clustered As "t"
    			Where t.OBJ Intersects oBuffer
    			Order By Distance_m Asc
    			Into __nearest_cluster NoSelect Hide
    
    		If TableInfo(__nearest_cluster, TAB_INFO_NROWS) > 0 Then
    			Fetch First From __nearest_cluster
    			FindNearestClusterID = __nearest_cluster.ClusterID
    			Exit For
    		End If
    	Next
    
    	Call TABClose("__nearest_cluster")
    	Call TABClose("__been_clustered")
    
    	Exit Function
    '-------------------------
    ErrorOccured:
    	Print Err() & " " & Error$() & ": FindNearestClusterID"
    
    End Function
    I can use this function to assign a cluster to the points that haven't been assigned one:
    Update msTabToCluster
    	Set ClusterID = FindNearestClusterID(OBJ)
    	Where ClusterID = 0

    Improvement Ideas

    This is a Proof of Concept in an early stage. There are a few things that still need improvement.
    1. Ask the user for input on the number of clusters or the number of points in each cluster
    2. Ask the user for input on the initial size of the buffers. Performance-wise, it may be better to use a bigger initial buffer width to limit the number of queries.
    3. Change the function AddToCluster to use recursive calls instead of a loop. This would mean that the function would call itself until it found some points.
    4. Try to implement k-means using MapBasic

    What would you do to improve this cluster analysis?

    I have included the source code and a compiled version of the tool in the zip file below.



    ------------------------------
    Peter Horsbøll Møller
    Principal Presales Consultant | Distinguished Engineer
    Precisely | Trust in Data
    ------------------------------

    Attachment(s)

    zip
    ClusterAnalysis.zip   6 KB 1 version