Make your analysis:
Each piece of ETL should be isolated and independent
is just moving the data into a common place
CSV and json are popular formats since they are datatype agnostic. RDB is also popular but requires a strict schema
python pandas
most dev time spent on this
Transform per item basis, 1 in 1 out. then parallelize it
After it is transformed, store it in an datatype opinionated place. depends on final load place
Idempotence: same thing happens every time. you want it to be ok if you rerun it with the same data, it should not add duplicate records. for updating records, make the extract only get new data
do can’t do relationship discovery until you load the data into a relation db, since you don’t know primary keys yet. you can have null records that fill in when the related data come in, or null foreign keys that you update periodically.
surrogate key: A surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or application. natural/business key: has meaning outside db