hakopako

Full-stack Engineer's blog

Import lots of data to DynamoDB from s3 bucket

日本語 | English

Overview

For here, create reference table of DynamoDB that contains about 42 million rows by using import function. I was tuned this will finish in 24h.

Official document: AWS DataPipeline

Let's get started !!!!

Generate import file

Output your import files by following the format below, then put them into a S3 bucket directory. The format is now json, but before that, it was text style which contains control characters.

Format sample:

{"id":{"s":"a"},"name":{"s":"apple"},"price":{"n":"1"}}
{"id":{"s":"b"},"name":{"s":"book"},"price":{"n":"5"}}
{"id":{"s":"c"},"name":{"s":"cup"},"price":{"n":"4"}}

Note:
You can NOT select a specific file but a specific directory for import (actually the structure is a key). So, put nothing other than import files in the directory. AWS doesn't check extensions. without extension is also fine.

How does it work in background ?

 ・ Required AWS Data Pipe Line for import/export.
 ・ EMR runs in background.
 ・ Set how much % of DynamoDB write/read capacity during import/export.

What I tried actually doing

Cost

As the official document says, in background, m1.small x 1 and m1xlarge x 1 of EMR launch as default. And the upper limit of write capacity of DynamoDB is 10,000.

Estimated DynamoDB EMR Total
24h $8.6(W:500) $15.7 $24.3
4h $9.8(W:3000) $2.6 $12.4
2h $9.9(W:6000) $1.3 $11.2

Note:
If import 42 million lines in 24h... 42M / (24 x 60 x 60) = 486.111... So, this time taking plan 24h because EMR performance is uncertain yet.

Speed

  • After all, sample import file was 6GB, 42M lines.
  • 10 mins for uploading the file to S3.
  • 20 mins for launching Data Pipe Line and EMR.
  • 24h for import (equal to my calculation)

Performance

  • DynamoDB capacity was fully used during the running time.
  • At plan 24h, CPU usage was constantly about 2%.
  • Probably plan 2h works well.

Conclusion

I know there is another way like using SDK. But it's difficult to use fully DynamoDB capacity by SDK. Especially if you try to write lots data, I'm sure it leads you to dismal consequences such as "Control how many processes run, and it depends on server spec and network" etc...

Thus, I highly recommend DynamoDB import function.
Have a nice DynamoDB life !