Post

SQL DISTINCT: Selecting Unique Values in SQL



Introduction

The DISTINCT keyword in SQL is used to ensure that the results of a query contain only unique values. It eliminates duplicate rows from the result set, allowing you to focus on distinct entries. This is especially useful when dealing with large datasets where you need to retrieve unique records based on one or more columns.

1. What is the DISTINCT Keyword?

The DISTINCT keyword is applied to columns in a SQL query to filter out duplicate values. When used in a query, it ensures that each row in the result set is unique based on the selected columns.

Syntax:

1
2
SELECT DISTINCT column1, column2, ...
FROM table_name;

Where:

  • SELECT DISTINCT column1, column2, ...: Specifies the columns for which unique values are to be retrieved.
  • FROM table_name: Indicates the table from which to retrieve data.

2. How DISTINCT Works

When DISTINCT is used, SQL processes the result set and removes any duplicate rows. The uniqueness is determined based on the combination of columns specified in the SELECT clause.

Example:

Retrieve unique departments from the employees table:

1
2
SELECT DISTINCT department_id
FROM employees;

In this example:

  • The query returns a list of unique department_id values from the employees table, removing any duplicates.

3. Using DISTINCT with Multiple Columns

You can use DISTINCT with multiple columns to get unique combinations of values across those columns.

Example:

Retrieve unique combinations of department_id and job_title from the employees table:

1
2
SELECT DISTINCT department_id, job_title
FROM employees;

In this example:

  • The result set includes unique pairs of department_id and job_title, eliminating rows where both columns have duplicate values.

4. Common Use Cases

4.1. Retrieving Unique Values

Fetch a list of unique job titles from the employees table:

1
2
SELECT DISTINCT job_title
FROM employees;

4.2. Retrieving Unique Combinations

Find unique combinations of departments and job titles from the employees table:

1
2
SELECT DISTINCT department_id, job_title
FROM employees;

4.3. Filtering Unique Records Based on Criteria

Retrieve unique department IDs for employees who are managers:

1
2
3
SELECT DISTINCT department_id
FROM employees
WHERE job_title = 'Manager';

4.4. Combining DISTINCT with Aggregates

Get unique departments and their respective total count of employees:

1
2
3
SELECT DISTINCT department_id, COUNT(employee_id) AS total_employees
FROM employees
GROUP BY department_id;

4.5. Filtering Out Duplicate Entries in a Join

Get unique employee names and their corresponding department names after joining with the departments table:

1
2
3
SELECT DISTINCT e.employee_name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;

5. Handling Duplicates without DISTINCT

In some cases, you may want to handle duplicates in other ways besides using DISTINCT.

5.1. Using GROUP BY

Instead of DISTINCT, you can use GROUP BY to group rows and aggregate results.

Example:

1
2
3
SELECT department_id, COUNT(*)
FROM employees
GROUP BY department_id;

In this example:

  • The GROUP BY clause groups rows by department_id and counts the number of employees in each department.

5.2. Using ROW_NUMBER() for More Control

For more control over duplicates, you can use window functions like ROW_NUMBER() to filter out duplicates based on specific criteria.

Example:

1
2
3
4
5
6
7
WITH ranked_employees AS (
    SELECT employee_id, department_id, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
    FROM employees
)
SELECT employee_id, department_id
FROM ranked_employees
WHERE rank = 1;

In this example:

  • The ROW_NUMBER() function assigns a rank to each employee within their department based on salary.
  • The outer query selects the highest-paid employee in each department.

6. Performance Considerations

  • Efficiency
    Using DISTINCT can be resource-intensive, especially with large datasets. It requires sorting and filtering to remove duplicates, which can impact query performance.
  • Indexes
    Ensure that appropriate indexes are applied to columns used with DISTINCT to improve performance. Indexes help speed up the process of finding and eliminating duplicate values.
  • Query Optimization
    Optimize queries by carefully selecting the columns to apply DISTINCT and consider restructuring queries to minimize the performance impact.

Conclusion

The DISTINCT keyword is a powerful tool for retrieving unique values in SQL queries. By removing duplicate rows from the result set, it helps ensure that the data retrieved is precise and relevant. Understanding how to use DISTINCT effectively, along with considering performance implications and alternative methods for handling duplicates, enhances your ability to work with SQL databases and obtain meaningful results.

© 2024 Java Tutorial Online. All rights reserved.