2023-05-29

Computing Geographic Distance with Snowflake UDF

Computing Geographic Distance with Snowflake

In this article, I will introduce how to implement a UDF (User-Defined Function) that computes geographic distances in Snowflake.

User-Defined Functions (UDFs)

In the context of databases, functions are reusable pieces of code that take some input, perform an action, and return the result. Snowflake, like many other database systems, provides a range of built-in functions. However, sometimes the built-in functions are not enough to meet the specific requirements of complex data processing tasks. This is where UDFs come into play.

UDFs allow users to define their own functions in Snowflake. These functions can then be reused across multiple queries and applications, just like the built-in functions. UDFs can be particularly useful for encapsulating a piece of logic that is used repeatedly.

Haversine Formula

The Haversine Formula is a mathematical equation used to calculate the great-circle distance between two points on the surface of a sphere, given their longitudes and latitudes. This formula is crucial when it comes to measuring distances on Earth, which can be approximated as a sphere for short distances.

The formula is given as:

a = \sin^2\left(\frac{\Delta \phi}{2}\right) + \cos(\phi_1) \cdot \cos(\phi_2) \cdot \sin^2\left(\frac{\Delta \lambda}{2}\right)
c = 2 \cdot \text{atan2}\left(\sqrt{a}, \sqrt{1-a}\right)
d = R \cdot c

Where:

  • \phi_1, \phi_2 are the latitude of point 1 and latitude of point 2 (in radians),
  • \Delta \phi is the difference of latitudes (in radians),
  • \Delta \lambda is the difference in longitudes (in radians),
  • R is the radius of the Earth (mean radius = 6,371 km),
  • d is the great-circle distance between the two points (along the surface of the sphere).

This formula breaks down the calculation into three parts: a calculates the square of half the chord length between the points; c represents the angular distance in radians, and d computes the actual distance using the radius of the Earth.

Implementing a UDF in Snowflake

I will walk through the steps of implementing the Haversine formula as UDF in Snowflake.

Creating the Haversine Distance UDF

Let's write the SQL query to create the UDF that implements the Haversine formula. In this UDF, we'll pass four parameters - the latitudes and longitudes of two points. The function should return the distance between these points in kilometers.

Here’s the corrected SQL code to create this UDF:

sql
CREATE OR REPLACE FUNCTION HAVERSINE_DISTANCE(
  lat1 FLOAT,
  lon1 FLOAT,
  lat2 FLOAT,
  lon2 FLOAT
)
RETURNS FLOAT
LANGUAGE SQL
AS
$$
  -- Earth's radius in km
  6371.0 * 2 * ATAN2(
    SQRT(
      SIN(RADIANS(lat2 - lat1) / 2) * SIN(RADIANS(lat2 - lat1) / 2) +
      COS(RADIANS(lat1)) * COS(RADIANS(lat2)) *
      SIN(RADIANS(lon2 - lon1) / 2) * SIN(RADIANS(lon2 - lon1) / 2)
    ),
    SQRT(
      1 - (
        SIN(RADIANS(lat2 - lat1) / 2) * SIN(RADIANS(lat2 - lat1) / 2) +
        COS(RADIANS(lat1)) * COS(RADIANS(lat2)) *
        SIN(RADIANS(lon2 - lon1) / 2) * SIN(RADIANS(lon2 - lon1) / 2)
      )
    )
  )
$$;

This code directly embeds the calculations from the Haversine formula. It uses built-in functions like SIN, COS, ATAN2, and RADIANS to perform the necessary mathematical operations.

Using the Haversine Distance UDF

With the UDF created, you can now use it in SQL queries to calculate the distance between any two geographic points. For example, let’s calculate the distance between Tokyo (35.6895N, 139.6917E) and Los Angeles (34.0522N, -118.2437E):

sql
SELECT HAVERSINE_DISTANCE(35.6895, 139.6917, 34.0522, -118.2437) AS distance_in_km;
DISTANCE_IN_KM
1 8815.473355809

This query uses the HAVERSINE_DISTANCE function we created, passing in the latitude and longitude of Tokyo and Los Angeles, and returns the distance between these two cities in kilometers.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!