Characterizing secret leakage in public GitHub repositories
There are (at least) two good sources of information for secret detection: the GitHub search API and the GitHub public dataset maintained in Google BigQuery. The first phase of the process is to query for candidate files which may contain secrets, using a carefully crafted set of search terms:
Given a set of candidate files, the next thing you’re going to need is a set of regular expressions for popular key formats. For example:
The regular expressions can then be used to scan the candidate files from the first phase, with any matches considered “candidate secrets”.