Using Awk and R to parse 25 TB of DNA sequencing data [I’m not the author]
In theory this would have made each query quicker because the desired SNP data would only reside in the ranges of a few of the Parquet chunks within a given region. I ran a query on our already loaded data to get the list of the unique SNPs, their positions, and their chromosomes. # Join the raw data with the snp bins data_w_bin % by =’snp_name’) %>% group_by(chr_bin) %>% arrange(Position) %>% Spark_write_Parquet( path = DUMP_LOC, mode = ‘overwrite’, partition_by = c(‘chr_bin’) ) Notice the use of , this lets Spark know it should send this dataframe to all nodes.
Source: livefreeordichotomize.com