SQL DISTINCT: Selecting Unique Values in SQL
Introduction
The DISTINCT
keyword in SQL is used to ensure that the results of a query contain only unique values.
It eliminates duplicate rows from the result set, allowing you to focus on distinct entries.
This is especially useful when dealing with large datasets where you need to retrieve unique records based on one
or more columns.
1. What is the DISTINCT Keyword?
The DISTINCT
keyword is applied to columns in a SQL query to filter out duplicate values.
When used in a query, it ensures that each row in the result set is unique based on the selected columns.
Syntax:
1
2
SELECT DISTINCT column1, column2, ...
FROM table_name;
Where:
SELECT DISTINCT column1, column2, ...
: Specifies the columns for which unique values are to be retrieved.FROM table_name
: Indicates the table from which to retrieve data.
2. How DISTINCT Works
When DISTINCT is used, SQL processes the result set and removes any duplicate rows. The uniqueness is determined based on the combination of columns specified in the SELECT clause.
Example:
Retrieve unique departments from the employees
table:
1
2
SELECT DISTINCT department_id
FROM employees;
In this example:
- The query returns a list of unique
department_id
values from theemployees
table, removing any duplicates.
3. Using DISTINCT with Multiple Columns
You can use DISTINCT
with multiple columns to get unique combinations of values across those columns.
Example:
Retrieve unique combinations of department_id
and job_title
from the employees
table:
1
2
SELECT DISTINCT department_id, job_title
FROM employees;
In this example:
- The result set includes unique pairs of
department_id
andjob_title
, eliminating rows where both columns have duplicate values.
4. Common Use Cases
4.1. Retrieving Unique Values
Fetch a list of unique job titles from the employees table:
1
2
SELECT DISTINCT job_title
FROM employees;
4.2. Retrieving Unique Combinations
Find unique combinations of departments and job titles from the employees table:
1
2
SELECT DISTINCT department_id, job_title
FROM employees;
4.3. Filtering Unique Records Based on Criteria
Retrieve unique department IDs for employees who are managers:
1
2
3
SELECT DISTINCT department_id
FROM employees
WHERE job_title = 'Manager';
4.4. Combining DISTINCT with Aggregates
Get unique departments and their respective total count of employees:
1
2
3
SELECT DISTINCT department_id, COUNT(employee_id) AS total_employees
FROM employees
GROUP BY department_id;
4.5. Filtering Out Duplicate Entries in a Join
Get unique employee names and their corresponding department names after joining with the departments table:
1
2
3
SELECT DISTINCT e.employee_name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;
5. Handling Duplicates without DISTINCT
In some cases, you may want to handle duplicates in other ways besides using DISTINCT.
5.1. Using GROUP BY
Instead of DISTINCT, you can use GROUP BY to group rows and aggregate results.
Example:
1
2
3
SELECT department_id, COUNT(*)
FROM employees
GROUP BY department_id;
In this example:
- The GROUP BY clause groups rows by
department_id
and counts the number of employees in each department.
5.2. Using ROW_NUMBER() for More Control
For more control over duplicates, you can use window functions like ROW_NUMBER()
to filter out duplicates
based on specific criteria.
Example:
1
2
3
4
5
6
7
WITH ranked_employees AS (
SELECT employee_id, department_id, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees
)
SELECT employee_id, department_id
FROM ranked_employees
WHERE rank = 1;
In this example:
- The
ROW_NUMBER()
function assigns a rank to each employee within their department based on salary. - The outer query selects the highest-paid employee in each department.
6. Performance Considerations
- Efficiency
Using DISTINCT can be resource-intensive, especially with large datasets. It requires sorting and filtering to remove duplicates, which can impact query performance. - Indexes
Ensure that appropriate indexes are applied to columns used with DISTINCT to improve performance. Indexes help speed up the process of finding and eliminating duplicate values. - Query Optimization
Optimize queries by carefully selecting the columns to apply DISTINCT and consider restructuring queries to minimize the performance impact.
Conclusion
The DISTINCT keyword is a powerful tool for retrieving unique values in SQL queries. By removing duplicate rows from the result set, it helps ensure that the data retrieved is precise and relevant. Understanding how to use DISTINCT effectively, along with considering performance implications and alternative methods for handling duplicates, enhances your ability to work with SQL databases and obtain meaningful results.