uniq.py: find unique elements

The need to find unique elements within columns of different files is very common.

Thus when you install the bio package another script called uniq.py is also installed. This software prints the unique elements from a column.

Using comm.py

If file 1 contains:

A
B
C
A
B

then the command:

uniq.py file_1.txt

will print:

A
B
C

The flag -c used as:

uniq.py -c file_1.txt

will print:

2           A
2           B
1           C

uniq.py can be used from standard input:

cat file_1.txt |  uniq.py -c

Why does uniq.py exist?

We could use the UNIX construct:

sort | uniq -c | sort -rn

the problem with the above is that the columns it prints are not tab separated. We may also use the entrez direct tool called:

sort-uniq-count-rank

but for that entrez-direct must be installed.

Additional utility

uniq.py can read different columns of a file and the delimiter may be changed as well. Read the second columns of three comma separated files:

uniq.py -c -d , -f 2  file1 file2 file3

I don’t usually advocate rewriting UNIX tools, in this case, writing a better uniq makes a lot of sense.

Usage

uniq.py -h
usage: uniq.py [-h] [-f 1] [-d ''] [-c] [fnames [fnames ...]]

positional arguments:
  fnames             file names

optional arguments:
  -h, --help         show this help message and exit
  -f 1, --field 1    field index (1 by default)
  -d '', --delim ''  delimiter (tab by default)
  -c, --count        produce counts