Help! - Compare two columns in text files and extract whole row if they match.

GavinC

Registered
Hey,

As part of my research in genetics, I need to work with large text files that displease excel and so I am trying to work out how to do it via terminal. Unfortunately, my terminal skills are nigh on zero and I don't have time to wait around while I learn it from scratch so I wondered if anyone might be able to help...

Essentially, I have two files. The first has only one column (a list of genes). The second has many columns (of which the third column is also a list of genes) and many thousands of rows.

What I want to do is compare column 1 in file 1 with column 3 in file 2 and extract the whole row corresponding to items that match into a new file.

By way of example...

if I had the following two files...

File 1:
gene1
gene5
gene3

File 2:
ant banana gene2 bee honey horse
red green gene1 purple yellow gold
one three gene3 seven two two
elf gnome gene7 fairy ork wizard
beans chips gene10 steak sausage eggs

then the output I would want would be...

red green gene1 purple yellow gold
one three gene3 seven two two


If anybody can tell me how to do that I would be very grateful...
 
Try

Code:
$ grep -F -i -w -f file1 file2

or if the letter case does not matter, use

Code:
$ grep -F -w -f file1 file2

Grep command is used in searching files. Normally the search text is given in the command as an argument, but -f file1 makes the file1 as a list of searched text.

The "re" is grep name stands for regular expression, i.e. a expression that matches different text (for example Ar[0-9][a-c] matches both Ar9a and Ar4c). Option -F makes the file1 lines to act as normal text.

Finally, option -w makes the command to search words, instead of part of texts (in the example file2, line

beans chips gene10 steak sausage eggs

is not found).
 
Take a look at the join utility. In Terminal do man join.

Code:
JOIN(1)                   BSD General Commands Manual                  JOIN(1)

NAME
     join -- relational database operator

SYNOPSIS
     join [-a file_number | -v file_number] [-e string] [-o list] [-t char]
          [-1 field] [-2 field] file1 file2

DESCRIPTION
     The join utility performs an ``equality join'' on the specified files and
     writes the result to the standard output.  The ``join field'' is the
     field in each file by which the files are compared.  The first field in
     each line is used by default.  There is one line in the output for each
     pair of lines in file1 and file2 which have identical join fields.  Each
     output line consists of the join field, the remaining fields from file1
     and then the remaining fields from file2.
...

You want something like

Code:
join -1 1 -2 3 file1 file2

and you will have to experiment with the -o option to get the output you want.
 
Did you try it? Join needs the files to be sorted based on the key field. So, you have to first sort the files

Code:
sort -k1 < file1 > file1_s
sort -k3 < file2 > file2_s

and then use the join

Code:
join -1 1 -2 3 file1_s file2_s

(or use the grep, as I wrote). I guess join is better in memory usage, since the file1 is not read into memory.
 
Thanks, guys! The grep thing works like a treat for me - bringing back all the rows where the gene matches rather than just the first one, which is perfect as the same gene may appear multiple times but with different information in the rest of the row and I needed to extract all of those rows. So perfect!

Since that seemed easy for you, I thought I might venture a harder question.

I actually have 17 of these big files with the gene in the 2nd column (I know I said 3rd before, but it's actually the second). Some how I need to generate a lists of extracted rows for items in the gene column (i.e. the rest of the row can be different so long as the gene matches) that are found in all 17 files, then another list for those found in 16 out of 17 files, then another for 15 out of 17 files, then another for 14 of 17 files ect ect.

I realise this is probably a much bigger task, but if anyone could just give me an idea about how they would go about it...?
 
Back
Top