datastacktv / data-engineer-roadmap Goto Github PK

Roadmap to becoming a data engineer in 2021

cloud data-engineer-roadmap data-engineering roadmap

data-engineer-roadmap's Introduction

Roadmap to becoming a data engineer in 2021

This roadmap aims to give a complete picture of the modern data engineering landscape and serve as a study guide for aspiring data engineers.

Note to beginners

Beginners shouldn’t feel overwhelmed by the vast number of tools and frameworks listed here. A typical data engineer would master a subset of these tools throughout several years depending on his/her company and career choices.

🔥 We just launched Data Stack Jobs — a clean and simple job site for Data Stack Engineers!

Text version for visually impaired users

Nice to have 😎

Text version for visually impaired users

Contributions are welcome 💜

Please raise an issue to discuss your suggestions or open a Pull Request to request improvements.

Reviewers 🔎

Huge thank you to @whydidithavetobebugs, @sawidis, @marclamberti and @mpyeager for reviewing this roadmap.

About us 👋🏼

datastack.tv is the learning platform for the modern data stack. We create concise screencast video tutorials for data engineers. Browse our courses here!

License 🗞

Copyright © 2021 Alexandra Abbas — [email protected]

data-engineer-roadmap's People

Contributors

Stargazers

Watchers

Forkers

hirajanwin marcos314 pallabpandaown akeebismail anbubenzigar deepakksahu iblaine trucnguyenlam romulovitor srikanth-gandi atulsharma89 weiplanet fwanghe frankfan007 iseeyah markandeyuluinturi leo23 ravikiransharvirala robertkhdev vivek4348 coelho90 javierinf uruakpauzochukwu tedlee1024 chinabjhzc python-z mlliarm geogubd shaunstanislauslau rahuljyala7 hadryan seanxl freezesoul denayarahadika cybernetics quanfang allensmile beesitech tuanlm173 chemshi kennydao neveroldmilk fengfengzi1202 caofancpu d8ger rafaelfess unworld11 hermes0911 tboydar zhangji340 tymichael wjnbreu mmenjivar92 cmftall ammadanwer gitgithan andy-yongmin-kim serapinsi zamzambadruzaman kokospapa8 hsjarbin biyanisuraj bibhutigaurav srinivasadk shobinjoy rbackupx jpsiyyadri huzairuje efrenbl93 ablatov git-me rickmunene rodio346 mrafayaleem sheyla-sbarbosa alonsoaz abhishek-ch kkboyina cesarcamgen antony2919 omarhosny206 keshabb ign0relee pqkhanh561 akkida746 jsrikantappa ganeshroman ankitshukla1107 mukteshkrmishra cl970 mrenau rheehot rickstark fixfo hammad-raza eliekawerk saifulbadhon muktadirul junqueira oadesanya

data-engineer-roadmap's Issues

Add GitLab Pipelines to CI/CD

Hey there,

Thanks for putting together this awesome resource! I’d strongly suggest adding GitLab Pipelines to the CI/CD section. It’s an extremely useful platform and is actually what competitively prompted GitHub Actions to emerge, as far as I know.

Hope this helps!

Add Microsoft Excel/OpenOffice Calc

These tools are very useful in everyday work.

Test.

//URl

digdag

digdag is a nice workflow scheduler. much easier to setup than airflow

AWS Cloud Formation as a general recommendation

AWS Cloud Formation as a general recommendation in my opinion doesn't make sense. I see Terraform with much more usage and bigger community, plus it is cloud agnostic.

Missing Data Governance knowledge

Which Mind Map app?

Can I ask which mind map app you used to create this beautiful outline? It's simply awesome! <3
BTW the structure and content of your course is refreshing to say the least. Just purchased your annual membership. Looking forward to more such great content.

Data science

Ingenieer

Agile ways of working

Not really a technical skill, but considering that most tech companies have adopted agile methodologies, I think having some knowledge of how Scrum or Kanban works is also an important skill for any data engineer.

Cloud specific versions

Would there be interest in creating Cloud specific versions of the roadmap that goes into more specific details for each product choice?

I work at Google Cloud so would be happy to contribute towards that.

I think this is a great way to show all the options and now use this as reference for the wider ecosystem when people ask so thank you for creating this.

Cloud Basics

As we are in cloud era, it's good to have knowledge on basic Cloud Architecture and common services from AWS/Azure etc.

I see you did mention a few cloud based tools, but it's good to have basic understanding of cloud services and how they are co related. A section named "Cloud fundamentals" maybe.

What do you think?

Cheers!

Encryption in Transit and at Rest

Hello,

I would suggest the inclusion of encryption in transit SSL/TLS in the networking part and refine the data security & privacy with encryption at Rest

Supporting reading material

I am a complete beginner who decided to follow the roadmap couple of months ago. Sharing a few books that helped me to get started.

How computer works : Code: The Hidden Language of Computer Hardware and Software Link
How internet works: Introduction to Networking Link
API : An Introduction to APIs by Brian Cooksey, Stephanie Briones (Illustrator), Danny Schreiber Link

I am a self learner who is looking forward to receiving further support on next steps.

Tasks manager

Hello I haven't seen frameworks like Celery https://docs.celeryproject.org/en/stable/index.html
Or spring https://spring.io/projects/spring-cloud-dataflow#overview

In my personal experience I had to create many batch pipelines, using these. Now with airflow I'm planning to move some. But still there are some legacy code I can't change :) so knowledge to maintain them are necessary.

Suggestion: Data Modeling & Execution Engine

Hi everyone,
I LOVE THIS. Thanks!
I would humbly add, from my experience, 3 domains:

Data modeling techniques.
These are critical to really own DWHing and data eng in general and would include:

Kimbal/Inmon classical approaches as base.
Snowflake/Star as more advanced views.
Data Vault 2.0

The subdomain of database admin/understanding that is Execution Engine tweaking, that would hold:

Execution plans and hints
Indexes to their different types
TOP 3-5 DB vendor's proprietary tricks for this domain.

ETL & ELT concepts in different scales and velocity. When to use which and why, etc.

data_engineer_roadmap

Opennebula ?

Where does tools like Openebula fit in?
https://opennebula.io/

ClickHouse should be added to the "Data warehouses" section

https://github.com/ClickHouse/ClickHouse

Kibana in Visualize data

It'd be great if you could add Kibana to visualize data. One of the popular components in ELK stack.

https://github.com/elastic/kibana

Concurrency models are missing

For a modern data engineer knowledge of concurrency models is important.

A data engineer should know the difference between concurrency and parallelism.
A data engineer should know the difference between task parallelism and data parallelism.
Threads vs. processes. Example in Python: libraries threading vs multiprocessing, what are the differences, and what problems does Python have with threading.
A pretty typical scenario for modern data integration: call n APIs each x sec / min / hours. How to do that with a good performance? One of the ways would be to use asynchronous programming.
Actor model might be good to know as well.
DAG (example: Apache Airflow) vs state machines (example: Amazon Step Functions) vs ... . Is actually covered by 'Data structures and algorithms', but maybe would be good to mention this as an example of how knowledge of them might be helpful for a data engineer.
Parallel programming using techniques like CUDA on GPU.
Functional programming is also 'nice to have' (but not obligatory).

If you agree on at least some of the points, I can prepare the text.

Refine Knowledge of Algorithms

I think would be interesting to refine the knowledge of algorithms with Big O, Big theta notations and code complexity.
Algorithms seems vague. Moreover, things like SOLID and clean code would help as well.

MLflow

Data Processing Architectures

Hello! congratulations for such an awesome roadmap, I think a data engineer should know about lambda architecture and kappa architecture. I think those are base architectures to start building custom data processing architectures for specific problems. Here are some resources:

.

link to datastack.tv has expired cert

The certificate for datastack.tv expired on 11/08/2021

If the site is no longer in use the link should be removed.

Infographic has text as graphics and is inaccessible for visually impaired users

Could you publish a text version? Even as an outline.

List of Storage systems

As a data engineer one work lot of storage service which can be block storage and object storage. So maybe you can mention Storage systems like:

HDFS
S3
Azure Blob
Google Cloud Storage

What is the license for these images?

Love this graphic, and would love to use it (with attribution obviously). Could you clarify the license under which these images are distributed?

DynamoDB categorization

Hi,

First and foremost, nice job on characterizing concepts and the fields. I really liked the picture.

On the issue itself: Why do you characterize DynamoDB as Key-Value and not as Wide-Column?
If I was asked to characterize the difference. I would say that a key-value store (like Redis or RocksDB) is something where you know nothing about the Value part (except maybe its datatype); whereas on a column-wide store, it's still key-value store since you always need a primary key (aka partition key), or but where you can characterize the value into multiple sub-columns and have secondary indexes (aka sort key).

At least someone in Wikipedia agrees with me.
Am I missing something?

Thanks

Meaning CS Fundamental ?

CS Fundamental is mean Cloud Storage ?

Lightweight markup languages

Somewhere early in the tree maybe mention markdown? Useful for documentation, github issues, Jupyter.

Missing Microsoft SQL Server, Oracle DB and IBM DB2

I believe these are very old as well as very mature relational database solutions and should be added to this roadmap.

Data engineer

Pub/Sub messaging

Hello,

Not sure how to edit a png to add a new pull request, so I'm creating this issue.

I believe GCP's Pub/Sub messaging system deserves to be under the "Messaging" section too.

Suggestion: replace Pulumi with the AWS CDK

Great project! I would suggest replacing Pulumi with the AWS CDK https://github.com/aws/aws-cdk. Its variants, cdk8s and cdk for terraform already have incredible utility for how long the projects have existed and in my opinion, the cdk is the dominant player in the Infrastructure as Code space

Data way

What about modeling / simulation?

Right now this is more an invitation to discussion than a request.

What modeling techniques does a data engineer need and for what use cases? Does anybody do simulation before actually designing a system / solution? If yes: what are the tools / approaches?

Following potential use cases came to my mind:

logical planning: if one wants to build an app which solves real-world problems, one wants to understand that problems. To avoid forgetting about some aspects of the problems or to discover not obvious aspects, one can do real-world simulation. In Python there is a library called simpy. Does anybody have experience using it? Also, diagrams (e.g. UML) can be used to do logical modeling for almost everything: state diagrams, data flow, components etc.
behaviour of distributed clusters (databases): I saw the following tool https://github.com/domclick/tuchanka, which imitates failures of a cluster node, waits for recovery, fixes failed node and cyclically continue testing. Anybody doing smth similar?
communication, networking, latency: I don't have much experience working with real-time environments - what are the typical techniques simulating / modeling real-time connectivity issues? Does one do performance measuring with some small dataset, and then extrapolates the results, or are there any other approaches?

Power BI should be included, right?

CS

Need to add Apache Ozone

Ozone moved into GA late last year and has seen some adoption since then.

It can handle billions object so it is hailed as the replacement of HDFS which struggles around 400 million files.

Created a PR with updates to the markdown doc but someone else will need to fix the image.

Russian translation

Отлично, хорошо бы на русском! Nice.

How about Azure Synapse Analytics

Hi,

How about including into the modern Data warehouse solutions Azure Synapse Analytics?

Cheers,
Kostas

Math and Statistics should be added

I know this is Data Engineer roadmap and not Data Science roadmap, however, I still think that Maths and Statistics should have its own box in the roadmap ( despite it could be included in the "Fundamentals" ).

Being a Data Engineer without enough math and statistics knowledge is considered as "Danger Zone" by Drew Conway’s diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Other than that, your roadmap is awesome @alexandraabbas, thanks a lot.

Which BPM would you suggest

So all application have business processes, I saw you mentioned workflow scheduling but can that also be used for bpm kind of system

Why Infrastructure as a code

This is an interesting point and bit controversial but why IAAC.
Well I am very active on this but I really found, this is still not that much established and even cloud engineers and every Infra doesn't follow this.

Tech Stack too Overwhelming

Looking at the roadmap it's too overwhelming to see so many frameworks, technologies to learn.

My suggestion is to divide the technologies horizontally and vertically with years of experience. This would narrow down the roadmap or give a clear road map than just mentioning the tech stack. Along with the division, mention the projects to be made for the corresponding years of experience.

The above is solely my opinion.