Extracting campaign finance data from gnarly PDFs using deep learning

Extracting campaign finance data from gnarly PDFs using deep learning

I’ve just completed an experiment to extract information from TV station political advertising disclosure forms using deep learning. The resulting model achieves 90% accuracy extracting total spending from the PDFs in the (held out) test set, which shows that deep learning can generalize surprisingly well to previously unseen form types. But even in its current form, this is a difficult data set that is very relevant to journalism, and improvements in technique will be immediately useful to campaign finance reporting.

Source: jonathanstray.com