MapInfo Pro

View Only

Back to discussions

Expand all | Collapse all

MapBasic Monday: Cluster Analysis

1. MapBasic Monday: Cluster Analysis

Like
Employee

Peter Møller
Posted 06-02-2025 03:53
| view attached

Reply Reply Privately
Last week, I was asked a question by @yogesh pawar from our Technical Support team:

A customer was seeking a method to divide points within a polygon into groups. The groups should consist of the same number of points, around 10 in this case. They were seeking a way to automate the redistricting process.

This is often also referred to as Cluster Analysis, where you want to create groups of records. This can be used for sales district optimization, for school districts, and for a long range of other use cases.

RouteFinder for MapInfo from Routeware has a dedicated tool for this that supports both straight-line analysis and routing analysis, where you take a road network into consideration.

One of the methods for performing this analysis is k-means clustering. You can find Python implementations of this in the Python module scikit-learn. I'll come back to that in a later article.

In this article, I'll look at a very basic approach I took when trying to divide points into a given number of clusters.

Happy #MapBasicMonday.

Finding the Starting Point

For my basic approach to cluster analysis, I decided to use the extent of the data as a starting point. Around a dataset, you can draw a minimum bounding rectangle (MBR). I will let the tool assign a location on the MBR as the starting point. This could be the center of the MBR, the four corners, the center of the four sides, or one of the corners.

For the four sides or the four corners, it will change for each cluster. This means that the starting point will change as I go through finding points for a cluster.

If I use the center or a specific corner, I will use the same every time. But this may also vary a bit. The MBR will change as it will be based on the points that still haven't been assigned to a cluster.

From the selected starting point, I create a buffer to find nearby points.

The size of the buffer is also related to the size of the MBR. I have set it to be 1/20 of the width of the MBR.

The buffer size is increased until at least one point is found. This point, or these points, are the starting points for the cluster.

The source code for the starting points looks like this

Function FindFirstForCluster(ByVal nClusterID As Integer) As Integer Dim oPoint, oBuffer As Object, i As Integer, fWidth As Float OnError GoTo ErrorOccured FindFirstForCluster = 0 If QueryRecordsToCluster("__to_cluster") = 0 Then Call TABClose("__to_cluster") Exit Function End If Do Case msStartMethod Case "Corners" Do Case (nClusterID Mod 4) Case 0 'SW oPoint = CreatePoint(mfMinX, mfMinY) Case 1 'NW oPoint = CreatePoint(mfMinX, mfMaxY) Case 2 'NE oPoint = CreatePoint(mfMaxX, mfMaxY) Case 3 'SE oPoint = CreatePoint(mfMaxX, mfMinY) End Case Case "Sides" Do Case (nClusterID Mod 4) Case 0 'Left oPoint = CreatePoint(mfMinX, mfCenterY) Case 1 'Top oPoint = CreatePoint(mfCenterX, mfMaxY) Case 2 'Right oPoint = CreatePoint(mfMaxX, mfCenterY) Case 3 'Bottom oPoint = CreatePoint(mfCenterX, mfMinY) End Case Case "Center" oPoint = CreatePoint(mfCenterX, mfCenterY) Case "SW" oPoint = CreatePoint(mfMinX, mfMinY) Case "NW" oPoint = CreatePoint(mfMinX, mfMaxY) Case "NE" oPoint = CreatePoint(mfMaxX, mfMaxY) Case "SE" oPoint = CreatePoint(mfMaxX, mfMinY) End Case FindFirstForCluster = AddToCluster(nClusterID, oPoint, "__to_cluster") Call TABClose("__curr_clusters") Call TABClose("__to_cluster") Exit Function '------------------------- ErrorOccured: Print Err() & " " & Error$() & ": FindFirstForCluster: " & nClusterID End Function

Most of the code above is assigning a location to the analysis based on the start method chosen and the current cluster number.

After this, the function AddToCluster is used to find and assign points to the current cluster.

Function AddToCluster( ByVal nClusterID As Integer , ByVal oInput As Object , ByVal sQuery As String ) As Integer Dim i, nNumRows, nRow, nRows, nRowID As Integer, fWidth As Float, oBuffer As Object OnError GoTo ErrorOccured fWidth = Maximum(mfWidth / mnStartSplit, 1) For i = 1 to mnStartSplit oBuffer = Buffer(oInput, 12, i * fWidth, "m") Select t.ClusterID, ObjectDistance(oInput, t.OBJ, "m") "Distance_m" From __to_cluster As "t" Where t.OBJ Intersects oBuffer And ClusterID = 0 Order By Distance_m Asc Into __next_for_cluster NoSelect Hide nNumRows = TableInfo(__next_for_cluster, TAB_INFO_NROWS) Print Time(24) & " Search Distance " & Round(fWidth * i, 0.1) & " m -> " & nNumRows & " rows" If nNumRows > 0 Then If nNumRows + mnCountCurrentCluster <= mnMinimumRowCount Then '**Adding all selected to cluser Update __next_for_cluster Set ClusterID = nClusterID AddToCluster = nNumRows 'Returning the number of rows added Else Fetch First From __next_for_cluster Do Until EOT(__next_for_cluster) nRowID = __next_for_cluster.ROWID Update __next_for_cluster Set ClusterID = nClusterID Where ROWID = nRowID nRows = nRows + 1 If (mnCountCurrentCluster + nRows) >= mnMinimumRowCount Then Exit Do End If Fetch Next From __next_for_cluster Loop AddToCluster = nRows End If Exit For End If Next Exit Function '------------------------- ErrorOccured: Print Err() & " " & Error$() & ": FindNextForCluster: " & nClusterID End Function

This function is called when finding the first points for a cluster and finding the next points too. The difference lies in the input variable oInput, which changes from a location on the MBR to a single input point or a multipoint based on all the points in the current cluster.

Finding the Next Points

Once the initial points for the cluster have been found, the analysis starts using these locations. A buffer is created around the points that make up the cluster. The buffer size is again increased until at least one point is found.

When more points have been assigned to the cluster, these points are also included in the buffer analysis.

Until the minimum number of points has been assigned to the current cluster.

Now the algorithm moves onto the next corner of the MBR. The MBR is also recalculated. This means that not only do we move to a new corner, that corner may also shift a bit as only the points not already assigned to a cluster as used when calculating the MBR.

If you had configured the analysis to use the same point every time, the only change would come from a potential change in the MBR. I did realize that using the center as a starting point wasn't that good an idea. This would often only assign points at the center to a cluster, and so not affect any change to the MBR. This is also to some extent true when using another single starting location. I think you will get a better result by using multiple starting locations. Maybe two would also work. This could be the two corners on the left, on the right, on the top, or at the bottom that you were switching between.

This continues until the tool has found points for all the clusters.

The source code for finding the next points is also quite simple. It starts by creating a multipoint for all the points currently in the cluster. And then it calls the function AddToCluster just like finding the first point(s).

Function FindNextForCluster(ByVal nClusterID As Integer) As Integer Dim oPoints As Object, fWidth As Float OnError GoTo ErrorOccured FindNextForCluster = 0 If QueryRecordsToCluster("__to_cluster") = 0 Then Call TABClose("__to_cluster") Exit Function End If Select ClusterID, AggregateCombine(obj) From msTabToCluster Where ClusterID = nClusterID Group By ClusterID Into __curr_cluster NoSelect Hide Fetch First From __curr_cluster oPoints = __curr_cluster.OBJ Call TABClose("__curr_cluster") FindNextForCluster = AddToCluster(nClusterID, oPoints, "__to_cluster") Call TABClose("__next_for_cluster") Call TABClose("__to_cluster") Exit Function '------------------------- ErrorOccured: Print Err() & " " & Error$() & ": FindNextForCluster: " & nClusterID End Function

Finalizing the Process

After looping over the desired number of clusters, there may still be some points that haven't been assigned to a cluster. These points are often scattered across the dataset as smaller groups of points.

To get these assigned, I find the nearest cluster for each of the points and assign this cluster to these points. This can mean that some clusters grow above the size you were looking for.

I have created the function FindNearestClusterID that returns the nearest Cluster ID for a given point.

Function FindNearestClusterID(ByVal oPoint As Object) As Integer Dim oBuffer As Object, i As Integer, fWidth As Float OnError GoTo ErrorOccured FindNearestClusterID = 0 If QueryRecordsClustered("__been_clustered") = 0 Then Call TABClose("__been_clustered") Exit Function End If fWidth = Maximum(mfWidth / mnStartSplit, 1) For i = 1 to mnStartSplit oBuffer = Buffer(oPoint, 12, i * fWidth, "m") Select t.ClusterID, ObjectDistance(oPoint, t.OBJ, "m") "Distance_m" From __been_clustered As "t" Where t.OBJ Intersects oBuffer Order By Distance_m Asc Into __nearest_cluster NoSelect Hide If TableInfo(__nearest_cluster, TAB_INFO_NROWS) > 0 Then Fetch First From __nearest_cluster FindNearestClusterID = __nearest_cluster.ClusterID Exit For End If Next Call TABClose("__nearest_cluster") Call TABClose("__been_clustered") Exit Function '------------------------- ErrorOccured: Print Err() & " " & Error$() & ": FindNearestClusterID" End Function

I can use this function to assign a cluster to the points that haven't been assigned one:

Update msTabToCluster Set ClusterID = FindNearestClusterID(OBJ) Where ClusterID = 0

Improvement Ideas

This is a Proof of Concept in an early stage. There are a few things that still need improvement.

Ask the user for input on the number of clusters or the number of points in each cluster

Ask the user for input on the initial size of the buffers. Performance-wise, it may be better to use a bigger initial buffer width to limit the number of queries.

Change the function AddToCluster to use recursive calls instead of a loop. This would mean that the function would call itself until it found some points.

Try to implement k-means using MapBasic

What would you do to improve this cluster analysis?

I have included the source code and a compiled version of the tool in the zip file below.

------------------------------
Peter Horsbøll Møller
Principal Presales Consultant | Distinguished Engineer
Precisely | Trust in Data
------------------------------

Attachment(s)

ClusterAnalysis.zip 6 KB 1 version

MapInfo Pro

MapBasic Monday: Cluster Analysis

1. MapBasic Monday: Cluster Analysis

Finding the Starting Point

Finding the Next Points

Finalizing the Process

Improvement Ideas

About Precisely

Customer Support

Copyright ©2025 Precisely. All rights reserved worldwide.