Instead, we aim for self-service capabilities and automation that would allow a larger set of contributors to build data assets and push them to production. We can no longer rely on a small centralized data engineering team that builds and maintains all the DAGs.Instead, data pipelines should be written, tested, and deployed as efficiently and as fast as possible. We can no longer tolerate weeks-long development cycles to generate new data assets.Instead, we think about the quality, state, and lineage of our heterogeneous data assets. We no longer just want to run Spark jobs.On the other hand, the way we interact with data today is very different from how things were seven years ago: Thanks to its feature-rich User Interface (UI), its ability to manage a wide range of operations, and particularly its no-nonsense and intuitive approach to organizing workflows via DAGs and tasks, it quickly eclipsed existing orchestrators like Spotify’s Luigi ( which was open-sourced in 2012) and the Hadoop ecosystem’s Oozie.Īirflow is used today by data engineering teams around the world for an ever-expanding list of use cases, supported by custom operators, in-house abstractions, and a myriad of hacks to leverage some of its aging features. Ever since it was open-sourced by Airbnb back in 2015, Apache Airflow established itself as the de-facto standard for orchestration within the data space.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |