bio uniq: find unique elements
The need to find unique elements within columns of different files is very common.
If file 1 contains:
A B C A B
then the command:
bio uniq file_1.txt
A B C
-c used as:
bio uniq -c file_1.txt
2 A 2 B 1 C
uniq.py can be used from standard input:
cat file_1.txt | bio uniq -c
We could use the UNIX construct:
sort | uniq -c | sort -rn
the problem with the above is that the columns it prints are not tab separated. We may also use the entrez direct tool called:
but for that
entrez-direct must be installed.
bio uniq can read different columns of a file plus the delimiter may be changed as well. To find the unique elements listed in the seecond column of three comma separated files:
bio uniq -c -d , -f 2 file1 file2 file3
I don’t usually advocate rewriting UNIX tools, in this case, writing a better
uniq makes a lot of sense.
bio uniq -h
usage: bio [-h] [-f 1] [-c] [-t] [fnames [fnames ...]] positional arguments: fnames file names optional arguments: -h, --help show this help message and exit -f 1, --field 1 field index (1 by default) -c, --count produce counts -t, --tab tab delimited (default is csv)