bio uniq: find unique elements

The need to find unique elements within columns of different files is very common.

Using comm.py

If file 1 contains:

A
B
C
A
B

then the command:

bio uniq file_1.txt

will print:

A
B
C

The flag -c used as:

bio uniq  -c file_1.txt

will print:

2           A
2           B
1           C

uniq.py can be used from standard input:

cat file_1.txt |  bio uniq -c

Why does uniq.py exist?

We could use the UNIX construct:

sort | uniq -c | sort -rn

the problem with the above is that the columns it prints are not tab separated. We may also use the entrez direct tool called:

sort-uniq-count-rank

but for that entrez-direct must be installed.

In addition bio uniq can read different columns of a file plus the delimiter may be changed as well. To find the unique elements listed in the seecond column of three comma separated files:

bio uniq -c -d , -f 2  file1 file2 file3

I don’t usually advocate rewriting UNIX tools, in this case, writing a better uniq makes a lot of sense.

Usage

bio uniq -h
usage: bio [-h] [-f 1] [-c] [-t] [fnames [fnames ...]]

positional arguments:
  fnames           file names

optional arguments:
  -h, --help       show this help message and exit
  -f 1, --field 1  field index (1 by default)
  -c, --count      produce counts
  -t, --tab        tab delimited (default is csv)