In part 1 of this article I will outline some of the issues you ought to consider when gridding point data. In part 2 I will look at how the different gridding methods perform when trying to reconstruct a raster from a point dataset.
You have a TAB file with point data, or a collection of LAS or LAZ files, or one or more ASCII files with delimited point data – and you want to grid this data to create a raster. To get a good result from a gridding operation you need to consider the following issues-
Source data arrangement
There are six typical source data configurations.
Points are arranged in a grid pattern with regular or semi-regular spacing in X and Y. Consider the sample spacing to be the smallest of the average distance between samples in the X direction and the Y direction.
Points are collected at a regular sample spacing along parallel lines that are regularly spaced. Consider the sample spacing to be the distance between the lines. Generally, along lines the measurements will be oversampled.
Points are collected at a regular sample spacing along a network of interconnected non-parallel lines (for example following a road network). Consider the sample spacing to be the average distance between lines, which you may only be able to judge by visual inspection.
Points are randomly located in space. A sample density analysis can help you to compute the average sample spacing and histogram statistics can give you a sense of the sample spacing distribution.
Points are randomly located but occur in clusters where point density is elevated. A sample density analysis can help you to compute the average sample spacing and histogram statistics can give you a sense of the sample spacing distribution.
A mixture of two or more of the source data arrangements described above. These are the hardest kinds of datasets to grid, and are often considered multi-scale.
Source data observations are almost always point samples. Note that MapInfo also supports vector samples (as polylines and polygons) which are decomposed into point samples that are regularly spaced along the polyline (or polygon exterior boundary).
Source data size
I classify any dataset with less than one million samples as "small", a dataset with up to 100 million samples "large" and a dataset with over 100 million samples "huge". Some methods are better than others at handling large or huge datasets and this factor alone may dictate how you grid your data. Note that some methods have been used on datasets of more than one billion points.
I classify a raster as small if it contains up to 100 million cells, large if it contains up to 10 billion cells and huge if it contains more than that. At PB we have created rasters that contain in excess of one trillion cells.
The performance of a gridding method may be primarily related to the number of data samples or the number of raster cells, or both.
As a general rule, for methods that require point data to be searched at each grid cell use small source datasets and generate small rasters. This includes Heat map, Nearest neighbour, Natural neighbour and Inverse Distance Weighted. I recommend that you only generate small rasters using Minimum curvature. Triangulation and Stamp can be used to generate large rasters from huge datasets.
There are four kinds of gridding operation available –
Generates an estimate of the density of data samples or events. These methods don't tell you what value your data has across the raster – they tell you how many source data samples you have spatially across the raster, or how many times some event has occurred over the raster space.
Reflects the exact source data values without interpolation. These methods provide you with a rasterised picture of the exact source data values which you can use for exploring your source data or for further analysis.
Estimates source data values across the raster by interpolation without exceeding the statistical range of the source data values. These methods give you an estimate of a discontinuous measured parameter across the raster space.
Estimates source data values smoothly across the raster by interpolation, possibly exceeding the statistical range of the source data values. These methods give you an estimate of a continuous measured parameter or field across the raster space.
We could add to this analytical methods like the "Distance" operation that populates the raster cells with the distance to the closest data sample. In a future version of Pro we will have a variety of LiDAR analysis operations to study tree canopy (coverage, density and height).
Having determined the style of gridding required, you now need to determine exactly which method suits your needs.
Returns a representation of the spatial density of samples.
For example, you may have a dataset that records the location of car accidents in a city. A heat map will show you where the highest concentration of accidents occur and help you identify traffic "black spots" for remedial action.
Returns a representation of the spatial density of samples where a data value at the sample location records the number of events at that location and this is used to weight the density.
For example, you may have a dataset that records the location of car accidents in a city along with the number of people killed an injured in those accidents. A weighted heat map will show you where the highest concentration of human casualties are. Or, you may have a dataset that records the location of robberies in a city and at each sample you have an estimate of the value of the lost property. A weighted heat map will provide a representation of where the most financial damage is being inflicted by the criminal activity.
Returns an exact count of the number of samples within the search ellipse centred on each raster cell.
This measure may be more useful from a statistical analysis perspective than the smooth estimates of density returned by other methods.
Returns the exact spatial density of samples within the search ellipse centred on each raster cell.
This is the same as the sample count method, except that the count is converted to a spatial density (based on the horizontal units of the raster).
Generates a Delaunay triangulation from the sample points and populates raster cells by linear interpolation across the triangle that overlies the raster cell centre. If the cell size is small compared to the triangle size, the flat triangle surfaces will be readily visible in the raster.Smoothing can be achieved by filtering the output raster.
Fits a hypothetical smooth surface that minimises overall curvature through the sample locations and returns the value of this surface at each raster cell. The raster will honour the value at sample locations closely but, to maintain maximum smoothness, may produce undesired local overshoots or undershoots and may exceed the minimum and maximum sample values.
Output raster design
The most important design issue is cell size. If you want the raster to closely honour the input sample values, you will need a cell size 4 – 5 times smaller than the sample spacing. Balance this against the dependence of the method performance on the number of raster cells.
For minimum curvature (and triangulation to a lesser extent) the cell size is critical to success. If the cell size is too small then minimum curvature will fail to interpolate smoothly between samples. Follow the 4-5 cells per sample spacing rule.
Sometimes source data samples in a regular anisotropic arrangement are located on lines that are an east-west or north-south in a local grid coordinate system but inclined in a map projection. For some methods (like minimum curvature or bi-directional splining), it can be better to grid the data in the local coordinate system. Thereafter, the raster will need to be warped to be displayed in a map coordinate system. This can be done by running a warp operation or using a virtual raster that implements a warp on-the-fly.
Tips for optimal results from each method
Algorithm performance is determined by the raster cell size and the size of the search ellipse. For high performance, choose a larger cell size and smaller search ellipse. Whilst a small search ellipse will capture more detail, but it may not be able to interpolate between samples leading to gaps in the raster coverage. Aim for a search ellipse approximately two times the sample spacing.
To achieve smoothness, you need to select a kernel weighting function that tapers to zero. See Wikipedia for details - https://en.wikipedia.org/wiki/Kernel_(statistics). Use either "Quartic" or "Triweight". In our 2019 major release, we will add "Tricube" and also introduce a new kernel called "Gaussian Sharpened" that will allow you to use a larger search ellipse to ensure coverage and smoothness, whilst still capturing detail.
To compute an estimate, select a tapered kernel and turn off "use input column as frequency". To compute a weighted estimate, select the column and turn on "use input column as frequency". To compute a sample count select the "Data count" kernel option and to compute sample density turn on "normalize by area".
In general, you should not need to use smoothing or coincident point processing.
Algorithm performance is determined by the size of the input dataset and less so by the raster size. You can choose a small cell size (to reflect the actual sample values) or a large cell size (to integrate sample values over a cell). The method is missing an option to count the number of samples in a cell (this will be added in the 2019 major release).
In general, you should not use smoothing or coincident point processing.
Algorithm performance is determined by the raster cell size and the size of the search radius. Raster cell coverage is determined by the search radius and the goal is to find that smallest search radius that provides you with the coverage that you require. Start small, and work upwards.
To contain the raster on the edges of the sample space, you may want to use a clipping polygon.
Algorithm performance is determined by the number of input samples (which must be triangulated) and to a lesser extent by the raster cell size.
You are required to set the "Maximum triangle size". Triangles with a side length longer than this will be excluded. This will prevent triangulation across large voids where there are no source data points. Also, it can prevent large triangles from being generated along the edges of the data set. If you set a very large triangle size (as large as the sample space) then you will generate a triangulation that extends to the convex hull of the samples for small datasets. For large or huge source datasets the convex hull is not guaranteed.
Coincident point processing is recommended.
Because this implementation discretises the polygon area contributions by the raster cell size, the result is often not perfectly smooth. Some smoothing is recommended to counter this. Coincident point processing is recommended.
Inverse Distance Weighted
The search ellipse can be carefully designed in both shape and orientation and it can be divided into sectors from which samples will be selected. This creates a powerful sample selection tool but it presupposes that the same selection rules will be suitable across the entire sample space. This method is susceptible to sample density variations and regions of dense sampling will overly influence the estimations around them. Using search sectors is a partial solution to this.
The linear weighting model does not produce smooth rasters but is easy to control. Both the power and exponential models depend on the scale of the distance which makes them difficult to work with. In the 2019 major release we will introduce a new scaling parameter to improve the utility of these models. The Gaussian model can produce smooth rasters as long as the range parameter is carefully chosen. In all cases, experimentation is required and it may take many trial runs to tune the parameters appropriately. In the 2019 release we will offer new weighting functions that will produce smooth and more pleasing results.
Coincident point processing and a clipping polygon are recommended. Smoothing may be necessary.
Algorithm performance is determined by the raster cell size and the number of iterations performed. Critically, once the raster is sufficiently large that it cannot be held in memory and needs to be swapped out to file, performance will worsen dramatically.
A higher number of iterations will lead to a higher quality raster but iterations will cease once the change from one iteration to the next is lower than the "percentage change" value. To ensure iterations are performed, make this a very small number.
"Spline tension" ranges from zero to one and defaults to zero. For many kinds of data it is appropriate to introduce a small amount of tension (for example 0.25). This will help control the raster where it is unconstrained.
A clipping polygon is recommended. Smoothing is not recommended.
What is a good result?
If you have chosen an assignment method then a good result is largely determined by the design on your output raster. Primarily, make sure the cell size is appropriate to represent the source data.
If you have chosen a heat map method then a good raster will strike some kind of balance between detail and smoothness. Where that balance lies is up to you. In general, smoothness will be achieved at the expense of detail.
If you have chosen an interpolation method then a good raster will strike a balance between honouring the input data samples (meaning that if you query the value of the raster at a sample data location, it ought to return a result as close as possible to the source data value) and smoothness. Where that balance lies is up to you and depends on the how much confidence you have in the sample value correctness (or perhaps how much you value that correctness). You might value smoothness because a smooth raster tends to be easier to visualise and interpret.
This table tells you if the performance of a gridding method is related to the number of source data samples or the number of raster cells, or both.
Search radius critical
Cell size hardly matters
High performance interpolator
Difficult to multi-thread, memory management critical
This table tells you whether a gridding method is suitable for small, large or huge sample data collections.
Control the search ellipse
Limit the maximum search distance
Carefully choose the cell size
This table tells you what kinds of sample arrangements are handled well by each gridding method. It should be taken as a guide only.
These tables provide commentary on each sample arrangement for each gridding method.
Sample density is uniform, unlikely use case.
Locate the centre of the raster cells on the sample locations.
Simulates a stamped raster.
Triangles can flip orientation randomly.
Simulates a smoothed stamped raster.
Search distance can be isotropic and minimised.
Choose a cell size ¼ the sample separation.
Sample density is regular, unlikely use case.
Choose a cell size suitable for the along line sample spacing.
Tessellation between lines tend to be unsightly.
Long triangles between lines tend to be unsightly.
Orient the search ellipse to straddle lines.
Choose a cell size ¼ the line separation.
Difficult to obtain a meaningful result.
Tessellation can change rapidly and tends to be unsightly.
Triangles can change rapidly and tend to be unsightly.
Difficult to design an appropriate search ellipse.
Difficult to choose an appropriate cell size.
Choose a weighting function that tapers to create a smooth raster.
Fill voids with a large search distance, but watch runtime performance.
Control or create voids or the concave hull with maximum triangle size.
Choose a cell size based on the smallest sample separation along lines.
MapInfo Pro Advanced provides a selection of gridding methods that are suitable for a wide variety of data scientists. Some specialist methods are not provided – for example, bi-directional splining is fast and effective for line based potential field surveys and Kriging in its various forms is popular in the Geostatistics community.
If you are working with demographic data you will probably use the Heat Map (Estimate and Weighted Estimate) methods.
If you are working with LiDAR data you will probably use Triangulation, especially if the source datasets are large.
If you are estimating a field (like temperature or a magnetic field for example) then Minimum curvature is likely to provide the best results and consider using some spline tension.
If you are working with highly variable data where you want to get a sense of trends without seeing the high-frequency variations, then Inverse Distance Weighted may be a suitable choice.
And don't forget that if you find a bug or want to see a feature implemented then let us know about it!