Giter Site home page Giter Site logo

data-engineer-roadmap's Introduction

Modern Data Engineer Roadmap 2021

Roadmap to becoming a data engineer in 2021

Twitter YouTube Website Jobs

This roadmap aims to give a complete picture of the modern data engineering landscape and serve as a study guide for aspiring data engineers.


Note to beginners

Beginners shouldnโ€™t feel overwhelmed by the vast number of tools and frameworks listed here. A typical data engineer would master a subset of these tools throughout several years depending on his/her company and career choices.


๐Ÿ”ฅ We just launched Data Stack Jobs โ€” a clean and simple job site for Data Stack Engineers!

Text version for visually impaired users

Data Engineer Roadmap

Nice to have ๐Ÿ˜Ž

Text version for visually impaired users

Data Engineer Roadmap Extras

Contributions are welcome ๐Ÿ’œ

Please raise an issue to discuss your suggestions or open a Pull Request to request improvements.

Reviewers ๐Ÿ”Ž

Huge thank you to @whydidithavetobebugs, @sawidis, @marclamberti and @mpyeager for reviewing this roadmap.

About us ๐Ÿ‘‹๐Ÿผ

datastack.tv is the learning platform for the modern data stack. We create concise screencast video tutorials for data engineers. Browse our courses here!

License ๐Ÿ—ž

Copyright ยฉ 2021 Alexandra Abbas โ€” [email protected]

data-engineer-roadmap's People

Contributors

alexandraabbas avatar dwsmith1983 avatar mkvorwerck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-engineer-roadmap's Issues

Add GitLab Pipelines to CI/CD

Hey there,

Thanks for putting together this awesome resource! Iโ€™d strongly suggest adding GitLab Pipelines to the CI/CD section. Itโ€™s an extremely useful platform and is actually what competitively prompted GitHub Actions to emerge, as far as I know.

Hope this helps!

digdag

digdag is a nice workflow scheduler. much easier to setup than airflow

Which Mind Map app?

Can I ask which mind map app you used to create this beautiful outline? It's simply awesome! <3
BTW the structure and content of your course is refreshing to say the least. Just purchased your annual membership. Looking forward to more such great content.

Agile ways of working

Not really a technical skill, but considering that most tech companies have adopted agile methodologies, I think having some knowledge of how Scrum or Kanban works is also an important skill for any data engineer.

Cloud specific versions

Would there be interest in creating Cloud specific versions of the roadmap that goes into more specific details for each product choice?

I work at Google Cloud so would be happy to contribute towards that.

I think this is a great way to show all the options and now use this as reference for the wider ecosystem when people ask so thank you for creating this.

Cloud Basics

As we are in cloud era, it's good to have knowledge on basic Cloud Architecture and common services from AWS/Azure etc.

I see you did mention a few cloud based tools, but it's good to have basic understanding of cloud services and how they are co related. A section named "Cloud fundamentals" maybe.

What do you think?

Cheers!

Encryption in Transit and at Rest

Hello,

I would suggest the inclusion of encryption in transit SSL/TLS in the networking part and refine the data security & privacy with encryption at Rest

Supporting reading material

I am a complete beginner who decided to follow the roadmap couple of months ago. Sharing a few books that helped me to get started.

  1. How computer works : Code: The Hidden Language of Computer Hardware and Software Link
  2. How internet works: Introduction to Networking Link
  3. API : An Introduction to APIs by Brian Cooksey, Stephanie Briones (Illustrator), Danny Schreiber Link

I am a self learner who is looking forward to receiving further support on next steps.

Suggestion: Data Modeling & Execution Engine

Hi everyone,
I LOVE THIS. Thanks!
I would humbly add, from my experience, 3 domains:

  1. Data modeling techniques.
    These are critical to really own DWHing and data eng in general and would include:
  • Kimbal/Inmon classical approaches as base.
  • Snowflake/Star as more advanced views.
  • Data Vault 2.0
  1. The subdomain of database admin/understanding that is Execution Engine tweaking, that would hold:
  • Execution plans and hints
  • Indexes to their different types
  • TOP 3-5 DB vendor's proprietary tricks for this domain.
  1. ETL & ELT concepts in different scales and velocity. When to use which and why, etc.

Concurrency models are missing

For a modern data engineer knowledge of concurrency models is important.

  1. A data engineer should know the difference between concurrency and parallelism.
  2. A data engineer should know the difference between task parallelism and data parallelism.
  3. Threads vs. processes. Example in Python: libraries threading vs multiprocessing, what are the differences, and what problems does Python have with threading.
  4. A pretty typical scenario for modern data integration: call n APIs each x sec / min / hours. How to do that with a good performance? One of the ways would be to use asynchronous programming.
  5. Actor model might be good to know as well.
  6. DAG (example: Apache Airflow) vs state machines (example: Amazon Step Functions) vs ... . Is actually covered by 'Data structures and algorithms', but maybe would be good to mention this as an example of how knowledge of them might be helpful for a data engineer.
  7. Parallel programming using techniques like CUDA on GPU.
  8. Functional programming is also 'nice to have' (but not obligatory).

If you agree on at least some of the points, I can prepare the text.

Refine Knowledge of Algorithms

I think would be interesting to refine the knowledge of algorithms with Big O, Big theta notations and code complexity.
Algorithms seems vague. Moreover, things like SOLID and clean code would help as well.

List of Storage systems

As a data engineer one work lot of storage service which can be block storage and object storage. So maybe you can mention Storage systems like:

  • HDFS
  • S3
  • Azure Blob
  • Google Cloud Storage

What is the license for these images?

Love this graphic, and would love to use it (with attribution obviously). Could you clarify the license under which these images are distributed?

DynamoDB categorization

Hi,

First and foremost, nice job on characterizing concepts and the fields. I really liked the picture.

On the issue itself: Why do you characterize DynamoDB as Key-Value and not as Wide-Column?
If I was asked to characterize the difference. I would say that a key-value store (like Redis or RocksDB) is something where you know nothing about the Value part (except maybe its datatype); whereas on a column-wide store, it's still key-value store since you always need a primary key (aka partition key), or but where you can characterize the value into multiple sub-columns and have secondary indexes (aka sort key).

At least someone in Wikipedia agrees with me.
Am I missing something?

Thanks

Pub/Sub messaging

Hello,

Not sure how to edit a png to add a new pull request, so I'm creating this issue.

I believe GCP's Pub/Sub messaging system deserves to be under the "Messaging" section too.

What about modeling / simulation?

Right now this is more an invitation to discussion than a request.

What modeling techniques does a data engineer need and for what use cases? Does anybody do simulation before actually designing a system / solution? If yes: what are the tools / approaches?

Following potential use cases came to my mind:

  • logical planning: if one wants to build an app which solves real-world problems, one wants to understand that problems. To avoid forgetting about some aspects of the problems or to discover not obvious aspects, one can do real-world simulation. In Python there is a library called simpy. Does anybody have experience using it? Also, diagrams (e.g. UML) can be used to do logical modeling for almost everything: state diagrams, data flow, components etc.
  • behaviour of distributed clusters (databases): I saw the following tool https://github.com/domclick/tuchanka, which imitates failures of a cluster node, waits for recovery, fixes failed node and cyclically continue testing. Anybody doing smth similar?
  • communication, networking, latency: I don't have much experience working with real-time environments - what are the typical techniques simulating / modeling real-time connectivity issues? Does one do performance measuring with some small dataset, and then extrapolates the results, or are there any other approaches?

Need to add Apache Ozone

Ozone moved into GA late last year and has seen some adoption since then.

It can handle billions object so it is hailed as the replacement of HDFS which struggles around 400 million files.

Created a PR with updates to the markdown doc but someone else will need to fix the image.

Russian translation

ะžั‚ะปะธั‡ะฝะพ, ั…ะพั€ะพัˆะพ ะฑั‹ ะฝะฐ ั€ัƒััะบะพะผ! Nice.

Math and Statistics should be added

I know this is Data Engineer roadmap and not Data Science roadmap, however, I still think that Maths and Statistics should have its own box in the roadmap ( despite it could be included in the "Fundamentals" ).

Being a Data Engineer without enough math and statistics knowledge is considered as "Danger Zone" by Drew Conwayโ€™s diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

ds_danger_zone

Other than that, your roadmap is awesome @alexandraabbas, thanks a lot.

Which BPM would you suggest

So all application have business processes, I saw you mentioned workflow scheduling but can that also be used for bpm kind of system

Why Infrastructure as a code

This is an interesting point and bit controversial but why IAAC.
Well I am very active on this but I really found, this is still not that much established and even cloud engineers and every Infra doesn't follow this.

Tech Stack too Overwhelming

Looking at the roadmap it's too overwhelming to see so many frameworks, technologies to learn.

My suggestion is to divide the technologies horizontally and vertically with years of experience. This would narrow down the roadmap or give a clear road map than just mentioning the tech stack. Along with the division, mention the projects to be made for the corresponding years of experience.

The above is solely my opinion.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.