Investigating the 900GB “Collection #1-5” password leaks with R

Investigating the 900GB “Collection #1-5” password leaks with R

So, in simple terms, Spark is about parallelizing data processing tasks over a cluster, i.e. a network of computers that each processes a task (preferably in-memory, so it’s quite fast). In Spark, “job” is the general term for a set of data transformations that result in a so-called “action”, i.e. writing to disk or sending the (condensed) results of a query to a frontend like R. This is very important: To process as much data as possible in Spark and load only what’s necessary into R, because R can only handle as much data as there is memory – and of course doesn’t have any parallel processing capabilities (natively).

Source: timogrossenbacher.ch