2023-03-19

SQL Tuning Techniques

SQL Tuning Techniques

This article introduces some fundamental strategies to improve the speed and efficiency of SQL queries.

Utilizing Indexes

Indexing is a crucial concept in SQL as it significantly improves the performance of database queries.

Improving SQL Performance by Creating Indexes

Creating an index on the columns used in WHERE clause can change the full table scan to an index scan, thereby improving query speed. However, it's worth noting that indexing tends to be beneficial when filtering a relatively small subset of data.

sql
-- Before optimization
SELECT * FROM books WHERE author = "Hemingway";

-- After optimization
CREATE INDEX idx_books_author ON books(author);
SELECT * FROM books WHERE author = "Hemingway";

Creating Compound Indexes

A compound index includes multiple columns. It is particularly effective when you filter on multiple columns in your WHERE clause. But remember, the order of columns in the index and in the WHERE clause matters.

sql
-- Before optimization
SELECT * FROM students WHERE firstName = "John" AND lastName = "Doe";

-- After optimization
CREATE INDEX idx_students_first_last_name ON students(firstName, lastName);
SELECT * FROM students WHERE firstName = "John" AND lastName = "Doe";

Creating Indexes for Each Column instead of Compound Indexes

In certain cases, you might want to create separate indexes for each column rather than a single compound index. This can be beneficial when you often filter on individual columns.

sql
-- Before optimization
SELECT * FROM orders WHERE customerID = 123 OR productID = 456;

-- After optimization
CREATE INDEX idx_orders_customerID ON orders(customerID);
CREATE INDEX idx_orders_productID ON orders(productID);
SELECT * FROM orders WHERE customerID = 123 OR productID = 456;

Creating Indexes for Columns used in ORDER BY and GROUP BY

Columns used in ORDER BY and GROUP BY clauses can benefit from indexing, which can speed up these operations. If there are multiple columns involved, a compound index might be created.

sql
-- Before optimization
SELECT category, COUNT(*) FROM products GROUP BY category;

-- After optimization
CREATE INDEX idx_products_category ON products(category);
SELECT category, COUNT(*) FROM products GROUP BY category;
sql
-- Before optimization
SELECT * FROM customers ORDER BY lastName, firstName;

-- After optimization
CREATE INDEX idx_customers_last_first_name ON customers(lastName, firstName);
SELECT * FROM customers ORDER BY lastName, firstName;

Avoiding Meaningless Sorting

Database query performance can be significantly affected by the inclusion of unnecessary sorting operations. These operations, including ORDER BY and GROUP BY commands, can cause processing time to balloon, affecting overall performance.

Avoiding Unnecessary ORDER BY and GROUP BY

ORDER BY and GROUP BY operations consume a significant amount of processing time. As such, they should only be used when absolutely necessary. Moreover, when they are employed, they should be applied after the dataset has been as much filtered as possible.

sql
-- Before optimization
SELECT * FROM users ORDER BY name;

-- After optimization
SELECT * FROM users WHERE active = 1 ORDER BY name;

In the optimized version, the ORDER BY clause is applied after filtering inactive users, which reduces the number of records to be sorted.

Speeding up MAX, MIN Calculation with Indexes

Calculations such as MIN and MAX can be optimized using indexes. When an index is applied to the column involved in these operations, the database can quickly find the needed values.

sql
-- Before optimization
SELECT MAX(score), MIN(score) FROM students;

-- After optimization
CREATE INDEX idx_students_score ON students(score);
SELECT MAX(score), MIN(score) FROM students;

Using EXISTS instead of DISTINCT

The DISTINCT clause, used to eliminate duplicate rows, is a costly operation as it requires sorting data. In scenarios where only unique values are required without the need for sorting, EXISTS can be used as an alternative.

sql
-- Before optimization
SELECT DISTINCT c.course_id FROM courses AS c INNER JOIN enrollments AS e ON c.course_id = e.course_id;

-- After optimization
SELECT c.course_id FROM courses AS c WHERE EXISTS (SELECT 1 FROM enrollments AS e WHERE e.course_id = c.course_id);

Using UNION ALL instead of UNION

UNION operations can be resource-intensive as they involve sorting and eliminating duplicate records. If you are joining rows that you know are unique, UNION ALL can be used instead, which does not involve sorting or eliminating duplicates.

sql
-- Before optimization
SELECT name FROM employees UNION SELECT name FROM managers;

-- After optimization
SELECT name FROM employees UNION ALL SELECT name FROM managers;

Avoiding Repetitive Processing

In SQL tuning, the key principle to embrace is to minimize repetitive processing as much as possible. This can be accomplished by reducing the number of table accesses, preferring JOINs over subqueries within SELECT, and avoiding unnecessary repeated filtering.

Reducing the Number of Table Accesses

The fundamental part of SQL tuning involves decreasing the frequency of table accesses. In situations where table content can be verified in a single instance, it's optimal to complete the check within that single occasion.

Use JOIN Instead of Subqueries within SELECT

When we use subqueries in the target column of a SELECT statement, the SQL for the subquery is executed for each row, which can increase processing time. Instead, using a JOIN for a single association can produce the same results with less processing.

Consider the following code, where we are executing a subquery within the SELECT statement for each employee row to get the name of the department.

sql
SELECT
emp.*, (SELECT dp.name FROM departments AS dp WHERE dp.id = emp.department_id)
FROM employees AS emp;

By using a LEFT JOIN, we can achieve the same result without having to execute the subquery multiple times, as shown in the modified code below.

sql
SELECT
emp.*, dp.name
FROM employees AS emp LEFT JOIN departments AS dp
ON emp.department_id = dp.id;

Avoid Unnecessary Repeated Filtering

The WHERE clause can also contribute to processing time. Reducing the number of times it's used can help eliminate unnecessary processing.

Let's take a look at the following example where we are filtering twice with the WHERE clause and associating the data.

sql
SELECT o1.*, o2.daily_summary
FROM orders AS o1
LEFT JOIN(
   SELECT order_date, SUM(order_amount * order_price) AS "daily_summary" FROM orders
   WHERE order_date BETWEEN "2023-01-01" AND "2023-01-31"
   GROUP BY order_date
) AS o2
ON o1.order_date = o2.order_date
WHERE o1.order_date BETWEEN "2023-01-01" AND "2023-01-31";

By using window functions, we can reduce it to a single filter and a single aggregation as shown in the modified code below.

sql
SELECT
*, SUM(o1.order_amount * o1.order_price) OVER(PARTITION BY o1.order_date) AS daily_summary
FROM orders AS o1
WHERE o1.order_date BETWEEN "2023-06-01" AND "2023-06-30";

Reducing the Number of SQL Executions

When optimizing SQL performance, reducing the number of SQL executions can contribute significantly towards making your queries more efficient. This is due to the fact that each execution of SQL causes communication between the application server and the DB server, which can be time-consuming. Let's explore some techniques to reduce the number of SQL executions.

Reducing the Number of INSERT Commands

When inserting multiple records into a table, consider using multi-insert. This allows you to insert multiple records in a single SQL execution.

For example, if you want to insert two records into a projects table, you might typically do something like this:

sql
INSERT INTO projects(name, start_date, end_date) VALUES ('Database','2023-02-01','2023-08-31');
INSERT INTO projects(name, start_date, end_date) VALUES ('WebApp','2023-04-05','2023-12-23');

However, these two separate insertions can be combined into one operation as follows:

sql
INSERT INTO
projects(name, start_date, end_date)
VALUES
('Database','2023-02-01','2023-08-31'),  ('WebApp','2023-04-05','2023-12-23');

Removing Index Before Bulk INSERT and Adding It Afterwards

When inserting records into a table that has an index, the system must also update the index, which can be a time-consuming operation. When inserting tens of thousands of records or more, it's worth considering removing the index before the insert operation, and then re-creating it afterwards.

Using LIMIT for Query Execution

If you just want to check a few records in a table, you can use the LIMIT clause to reduce processing time. However, note that aggregate functions such as COUNT(*) will not be sped up by adding a LIMIT clause.

Using TRUNCATE Instead of DELETE for Total Record Removal

When you need to remove all records from a table, the DELETE operation can be slow because it logs the deletion of each individual row. If you do not need to keep these logs and need to clear a table completely, consider using the TRUNCATE command, which is typically much faster.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!