ETL

Make your analysis:

  • Reproducible
    • analysis code
      • use SCM
    • data sources
      • exact data source files are stored somewhere they can be used again
    • algorithmic randomness
      • set RNG seed in SCM so it is more reproducible
  • Consistent
    • check smaller datasets into SCM, or can keep a small sample in SCM
  • Productionizable
    • common data format
    • shared data formats and ETL code between dev and prod

Each piece of ETL should be isolated and independent

Extract

is just moving the data into a common place

CSV and json are popular formats since they are datatype agnostic. RDB is also popular but requires a strict schema

python pandas

Transform

most dev time spent on this

  • datatype conversion
  • unit conversion
  • derived calculations (from multiple source attributes)
  • enriching and hydrating

Transform per item basis, 1 in 1 out. then parallelize it

After it is transformed, store it in an datatype opinionated place. depends on final load place

Load

Idempotence: same thing happens every time. you want it to be ok if you rerun it with the same data, it should not add duplicate records. for updating records, make the extract only get new data

Relationship discovery

do can’t do relationship discovery until you load the data into a relation db, since you don’t know primary keys yet. you can have null records that fill in when the related data come in, or null foreign keys that you update periodically.

surrogate key: A surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or application. natural/business key: has meaning outside db