SQL advice needed

Jajajabanks131 · July 3, 2019, 7:27pm

I need a tool that will look up all lines from file1 in file2 (something like grep -f file1 file2), but I need it to be really fast for big files. I expect file1 to change for every lookup and to have even few million lines, file2 will stay the same, it might be updated from time to time and may contain over few billion lines (something around 150GB file size).

file1 format: item
file2 format: item:value

I heard program with hashmap would be the fastest option, but apparently I'd need a lot of ram for it. So I'd like it to be done with NoSQL database like Cassandra Apache. I'd need a table with 2 rows: item and value and some program using API that would get all lines (file1) and look them all up in the table and if item match I'd like it to return item:value.

I need it to be really fast so I'd need to know what speeds can I expect, is disk speeds the only important factor?

I've tried doing it by using normal MySQL with two tables (file1 table and file2 table) and matching them with JOIN although speeds are not satisfying

btrimpop · July 3, 2019, 8:36pm

https://www.red-gate.com/products/

robert_volk · July 4, 2019, 12:52pm

Before you get too deep into a Hadoop/Cassandra solution, check this out:

https://livefreeordichotomize.com/2019/06/04/using_awk_and_r_to_parse_25tb/

Another oldie but goodie in the same vein:

Regarding fast grep, RipGrep is apparently the fastest, or in many cases faster than regular grep:

https://blog.burntsushi.net/ripgrep/

It's hard to say how well it will compare 2 files of 150GB in size, but there are a lot of efficiencies in the text utilities that Hadoop/Spark/etc. do not have (the first 2 links describe these pretty well)