Finding AWS keys with Google BigQuery on Github data

Theory: AWS Keys Not In Your Code

We have been told thousands of times not to leave our AWS keys in source code.
I have also heard that AWS takes care and searches repositiories and if they find some, they inform people to remove them.

Practice: How many are there?

What do you think, how many keys are still on github?
For an estimate we can run an experiment for ourselves, will take just a couple of minutes.

Tempting: Github Repos on Google BigQuery

I became aware of Google BigQuery hosting the full data set from github as the twitter message from @github tells us. Thanks to Markus!

Howto: Throw together a query

With a little SQL and the BigQuery examples I arrived at the following:
(Not by the most extreme stretch of the imagination I would call myself a data scientist).

1
2
3
4
5
6
7
8
9
10
11
12
SELECT
UNIQUE( REGEXP_EXTRACT(line, r'(AKIA[A-Z0-9]{16})'))
FROM (
SELECT
SPLIT(content, '\n') line,
id
FROM
[bigquery-public-data:github_repos.sample_contents]
WHERE
NOT binary
AND content CONTAINS 'aws'
HAVING line LIKE '%AKIA%')

It searches for public AWS keys, which have an obvious signature to begin with AKIA.
I am reasoning that where a public key can be found there may also be a secret key.
(As my Grandma said: Where there is smoke there also is fire.)

Run: Seeing is Believing

Samples can be run with the BigQuery console without any fuss within the free monthly quota.
The run finishes in seconds and delivers 35 results, most of which look like actual working keys.
Very few of them are obviously edited and thus made non-functional, which is why I dare list them here:

1
2
3
4
....
AKIAI44QH8DHBEXAMPLE
AKIAIOSFODNN7EXAMPLE
...33 more...

The query has been run on a sample data set.
Extrapolation to the full data suggests there are more than 500 unique keys on github.

Caveats

  • The query works only for public keys. Secret keys have no easily distinguishable signature, apart from their length and character set.
  • I used the sample data set for my own curiousity. Running across the full 3 TB+ dataset would probably have exceeded my free monthly quota.
  • This cannot be counted as a proven vulnerability, as no attempt was made to retrieve any secret key, or go one step further to try it, and I do not recommend this.

Conclusion

Stay tuned: