Computing Geographic Distance with Snowflake
In this article, I will introduce how to implement a UDF (User-Defined Function) that computes geographic distances in Snowflake.
User-Defined Functions (UDFs)
In the context of databases, functions are reusable pieces of code that take some input, perform an action, and return the result. Snowflake, like many other database systems, provides a range of built-in functions. However, sometimes the built-in functions are not enough to meet the specific requirements of complex data processing tasks. This is where UDFs come into play.
UDFs allow users to define their own functions in Snowflake. These functions can then be reused across multiple queries and applications, just like the built-in functions. UDFs can be particularly useful for encapsulating a piece of logic that is used repeatedly.
Haversine Formula
The Haversine Formula is a mathematical equation used to calculate the great-circle distance between two points on the surface of a sphere, given their longitudes and latitudes. This formula is crucial when it comes to measuring distances on Earth, which can be approximated as a sphere for short distances.
The formula is given as:
Where:
are the latitude of point 1 and latitude of point 2 (in radians),\phi_1, \phi_2 is the difference of latitudes (in radians),\Delta \phi is the difference in longitudes (in radians),\Delta \lambda is the radius of the Earth (mean radius = 6,371 km),R is the great-circle distance between the two points (along the surface of the sphere).d
This formula breaks down the calculation into three parts:
Implementing a UDF in Snowflake
I will walk through the steps of implementing the Haversine formula as UDF in Snowflake.
Creating the Haversine Distance UDF
Let's write the SQL query to create the UDF that implements the Haversine formula. In this UDF, we'll pass four parameters - the latitudes and longitudes of two points. The function should return the distance between these points in kilometers.
Here’s the corrected SQL code to create this UDF:
CREATE OR REPLACE FUNCTION HAVERSINE_DISTANCE(
lat1 FLOAT,
lon1 FLOAT,
lat2 FLOAT,
lon2 FLOAT
)
RETURNS FLOAT
LANGUAGE SQL
AS
$$
-- Earth's radius in km
6371.0 * 2 * ATAN2(
SQRT(
SIN(RADIANS(lat2 - lat1) / 2) * SIN(RADIANS(lat2 - lat1) / 2) +
COS(RADIANS(lat1)) * COS(RADIANS(lat2)) *
SIN(RADIANS(lon2 - lon1) / 2) * SIN(RADIANS(lon2 - lon1) / 2)
),
SQRT(
1 - (
SIN(RADIANS(lat2 - lat1) / 2) * SIN(RADIANS(lat2 - lat1) / 2) +
COS(RADIANS(lat1)) * COS(RADIANS(lat2)) *
SIN(RADIANS(lon2 - lon1) / 2) * SIN(RADIANS(lon2 - lon1) / 2)
)
)
)
$$;
This code directly embeds the calculations from the Haversine formula. It uses built-in functions like SIN
, COS
, ATAN2
, and RADIANS
to perform the necessary mathematical operations.
Using the Haversine Distance UDF
With the UDF created, you can now use it in SQL queries to calculate the distance between any two geographic points. For example, let’s calculate the distance between Tokyo (35.6895N, 139.6917E) and Los Angeles (34.0522N, -118.2437E):
SELECT HAVERSINE_DISTANCE(35.6895, 139.6917, 34.0522, -118.2437) AS distance_in_km;
DISTANCE_IN_KM | |
---|---|
1 | 8815.473355809 |
This query uses the HAVERSINE_DISTANCE
function we created, passing in the latitude and longitude of Tokyo and Los Angeles, and returns the distance between these two cities in kilometers.