Removing Duplicate Lines from a File Using a Shell Script

Removing Duplicate Lines using Shell Script

Duplicate lines in a file can clutter up data and make it difficult to work with, especially when dealing with large datasets. Fortunately, using a simple shell script, you can easily remove duplicate lines and streamline your data processing tasks. In this article, we’ll walk through the process of creating a shell script that takes a file name as an argument and removes all duplicate lines from it.

Prerequisites:

To follow along with this tutorial, you’ll need:

  1. Basic knowledge of the Linux/Unix command line.
  2. A text editor to write and edit shell scripts.
  3. A terminal to execute the shell script.

Creating the Shell Script:

Let’s start by creating a shell script named remove_duplicates.sh. Open your preferred text editor and create a new file with the following content:

#!/bin/bash
# Check if a file name is provided as an argument
if [ $# -ne 1 ]; then
    echo "Usage: $0 <file>"
    exit 1
fi
# Check if the file exists
if [ ! -f "$1" ]; then
    echo "File '$1' not found!"
    exit 1
fi
# Remove duplicate lines from the file
awk '!seen[$0]++' "$1" > "$1.tmp" && mv "$1.tmp" "$1"
echo "Duplicate lines removed successfully from $1"
#!/bin/bash
# Check if a file name is provided as an argument
if [ $# -ne 1 ]; then
    echo "Usage: $0 <file>"
    exit 1
fi
# Check if the file exists
if [ ! -f "$1" ]; then
    echo "File '$1' not found!"
    exit 1
fi
# Remove duplicate lines from the file
awk '!seen[$0]++' "$1" > "$1.tmp" && mv "$1.tmp" "$1"
echo "Duplicate lines removed successfully from $1"

Save the file and make it executable by running the following command in your terminal:

chmod +x remove_duplicates.sh

Understanding the Script:

  • #!/bin/bash: This line specifies the interpreter to be used, which is bash in this case.
  • if [ $# -ne 1 ]; then: Checks if the number of arguments provided to the script is not equal to 1. If not, it displays a usage message and exits.
  • if [ ! -f "$1" ]; then: Checks if the file specified as an argument exists. If not, it displays an error message and exits.
  • awk '!seen[$0]++' "$1" > "$1.tmp" && mv "$1.tmp" "$1": This line removes duplicate lines from the file using the awk command. It uses an associative array seen to keep track of lines already encountered. The awk command prints lines that have not been seen before. The output is then redirected to a temporary file. Finally, the original file is replaced with the temporary file containing unique lines.
  • echo "Duplicate lines removed successfully from $1": Displays a success message indicating that duplicate lines have been removed from the file.

Usage:

To use the script, simply provide the file name as an argument when executing the script. For example:

./remove_duplicates.sh myfile.txt

Read more on Shell and Linux related articles

Shell doc

Author: user